Batch Cataloging of Scanned Documents via OCR? 31
munwin99 asks: "I am looking for some software to process a batch of images (scanned forms). We want to use Gallery to view the images, and be able to search them by 3 or 4 attributes. We want to get these attributes from the form (date, name, etc). We want it to check a section of the scanned form, read the info from that section(s), and dump the retrieved info into Gallery (using OCR / ICR). Is there any (preferably) free or open source software that can do this? Supported OSes should include either Windows, Linux or Mac OS X. Even Gallery is optional, if someone has a better suggestion."
Custom Layout (Score:5, Informative)
Re:Forget Gallery (Score:2)
Hire a kid from high school (Score:3, Insightful)
Not a troll, a job application (Score:1)
Re:Hire a kid from high school (Score:1)
Plus, it creates jobs for those who have trouble finding something to fit in with studying.
Hack your value/key pairs into EXIF data (Score:2, Offtopic)
Anyway this is probably how you'd want to go about this:
1. Scan doc to file
2.
Open Source OCR (Score:2, Informative)
The OCR / document image layout analysis world is dominated by a handful of commercial companies. There is a dearth of OCR and document analysis code available in the open source community. That which is available on any sort of 'free' basis is not going to be of a lot of use other than as a starting point for some serious development of your own, I would suggest.
The big names commercially are:
Abby [abbyy.com]
Re:Open Source OCR (Score:2)
cheap way to buy Abby (Score:2)
Here's what you do: Buy a 5.0 license of Abby Finereader off ebay. You can buy it for about 10$.
Buy the upgrade version of the latest version of Abby Finereader for $150.
It's still $160, but that's still considerably cheaper than paying the new price of $500-600. Abby finereader docs say specifically that the upgrade software will work successfully on ALL prior versions of finereader.
As far as feeding into a database, I'm afr
Re:Here's what we did (Score:1)
What OCR software did you use? I haven't had real good luck with this. (The documents are already scanned into PDF's when I recieve them so I have no control over the quality.)
Re:Here's what we did (Score:1)
You Get What You Pay For (Score:3, Informative)
The big players are Abbyy and Scansoft. Both have extensive feature lists, from handy GUIs to form/document layout to Asian language support. They also come with a hefty price tag. Their Windows support is best, but they have software for others. Single user applications are reasonably priced in the two-digit figure range. However, we decided not to integrate with either of them in part because of the price tag for high volume server-side processing. If you only have a hundred or two forms to do at a time, a workstation solution may be your best bet.
We chose to integrate with Transym [transym.com], a cheap but pretty good engine. It does a good job at what it tries to do, which is recognize standard printed text. We then take that text and extract meaningful data, like dates and names, from the output text + position information. Pretty much every other cheap/free package we looked at had pretty lousy performance on our straight-forward documents (primarily typed paragraphs).
ICR (recognizing handwriting) and IMR (mark recognition) is another bag. There are very few players in this arena. They work best when the domain is well-defined (the U.S. Postal Service, for instance, does pretty well at recognizing zip codes). If you're trying to recognize dates and check boxes, the form definition software that Abbyy and Scansoft provide probably fits your needs best.
Finally, you need to consider how reliable you want your process to be and how much quality control you want. Even the best OCR engine makes errors, and ICR is quite a bit behind that. You can't blindly trust OCR output unless you're willing to deal with incomplete data. If you're going to have a human verify the computer's work for only a few fields, you may not be gaining significant efficiency.
(I don't claim to have evaluated every potential option. There may be software we missed, software we didn't evaluate because it didn't meet our integration needs, and software that's come to light after we did our search.)
Re: (Score:2, Informative)
Comment removed (Score:3, Informative)
Zylab (Score:2)
Contact Amazon (Score:2)
Docubase (Score:2)