Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Technology

Automated OCR for Forms Processing? 30

Oscar Carrillo asks: "We have to do a large NIH grant which collects tons of data. And much of that data is in the form of questionnaires. The forms will be available on the web, but it's mostly not feasible to have the subjects sit in front of the computer all day (not to mention that people get annoyed sitting in front of a computer all day). The study is being conducted at several universities and institutions around the country. Using Linux/JSP/Struts/PostgreSQL will take care of most of our needs. But it would save a lot of data entry, if all forms could be scanned at each site, images uploaded to the website, and then automatically put through OCR (Optical Character Recognition) to get only the relevant raw data that subjects wrote. Does anyone know of something that can handle this? Are there any open source projects that can handle this? Any good commercial alternatives?"
This discussion has been archived. No new comments can be posted.

Automated OCR for Forms Processing?

Comments Filter:
  • Bubbles (Score:3, Insightful)

    by adamy ( 78406 ) on Tuesday July 16, 2002 @02:33PM (#3896177) Homepage Journal
    Remeber those old annoying CTBS test a nad SAT stuff? If these surveys are multiple choice, use the old #2 lead pencil and scan em in that way. You data will already be entered in. Most universities have the facilities for this already.

    Do not count on handwriting recognition to be successful for the people who fill out the surveys. While it works fine for typeset and computer gnereated print, it won't work for many different handwritings and many different idiomatic expressions.

    • I believe this [scantron.com] is what you are thinking about.

      Or maybe this [google.com]?

      Due to the hardware involved, I imagine this isn't something that some OSS coder is going to slap together. NIH should have a reader/software somewhere.

      Nature of questions would help answer the question a bit better.

  • by Hee Hee Hee ( 310695 ) on Tuesday July 16, 2002 @02:39PM (#3896241)
    Check out NLM's DocMorph at docmorph.nlm.nih.gov/docmorph/default.htm. It's a site put up by the NIH (coincidentally) where you post your scanned image and they post an OCR'ed document, in the format you choose, for a short period. It does a fairly good job for the price (free).
    • NLM's site does OCR (reading "machine-printed" text). The poster wants to do ICR (read "handwritten text"). There are very few companies that do this; and AFAIK there are no public domain/OSS/freeware solutions.

      How many pages are we talking about? And at what resolution are they scanned? Anything below 256DPI for handwritten is not worth it.

  • For one of our clients, OCR Forms made some sense, but the problem was that a computer form was vastly easier to use for our purpouses: If someone typed in a vendor name, the computer form made an educated guess after the first few chars, if it was incorrect, you just kept on typing, if if correct you just tabbed over. An OCR Form system would have to be able to corelate "Sqishy Soft","Squish-Soft","SS LLC","SquishySofware" and every permutaion and bad spelling with vendor #1212 - the computer form would just let you pick vendor "Squishy Software INC" form the list.

    For offsite form entry, Psion Revo's worked wonderfully, but we've moved everything to Sharp Zarus due to the un-certain future of Psion. We've kept the Psions, but just replace them when they break.

    Sorry for the ramble, It's lunch time.

  • by tps12 ( 105590 ) on Tuesday July 16, 2002 @02:47PM (#3896326) Homepage Journal
    it's mostly not feasible to have the subjects sit in front of the computer all day

    Then I guess somebody forgot to tell my boss.
  • Solutions (Score:3, Interesting)

    by Wrexen ( 151642 ) on Tuesday July 16, 2002 @02:50PM (#3896354) Homepage
    Does anyone know of something that can handle this?

    High school students? Technology isn't the answer to everything, and if these are handwritten you're not going to have very much success trying to automate the recognition. My name is Fod Na1oyyy, etc
  • Fla. (Score:3, Funny)

    by Strange Ranger ( 454494 ) on Tuesday July 16, 2002 @02:58PM (#3896437)

    Doesn't the State of Florida has a forms tallying system they're looking to unload?
  • by Pauly ( 382 ) on Tuesday July 16, 2002 @03:05PM (#3896497)
    Having worked at one of the world's largest OCR/Forms processing vendors, take it from me: don't do this.

    OCR forms processing does:

    • waste money and time
    • create unnecessary pain
    • require high-quality and expensive printed forms
    • require high-quality and expensive scanning equipment
    • introduce more human error

    OCR forms processing does NOT:

    • "save a lot of data entry"
    • do anything automatically (unless your forms are all checkboxes)
    • save money or time
    That said, if you have a lot of questions to be answered, a well designed form using as little handwritten responses as possible (all checkboxes are best), may be viable.

    Frankly, most of the large projects I worked on could have gotten the task done easier and cheaper writing an app to run on low-end Palms given to each interviewee. Seriously.

    If you would like more concrete advice or contacts with people in the industry, email me.

  • by j-turkey ( 187775 ) on Tuesday July 16, 2002 @03:49PM (#3896937) Homepage
    I am not aware of any open-source automated ICR for forms processing. I can, however, offer a few commercial alternatives.

    I have used TELEform [cardiff.com], by Cardiff [cardiff.com] and was somewhat impressed. It can take multiple different inputs (scanner, fax, email, web POST, etc), run ICR on them, and store the data in a number of different RDBMS. I believe that this is a Windows-only package.

    Another piece of software that I cannot recommend as I do not have experience with is from 170 Systems called [170systems.com] 170 MarkView [170systems.com], which basically does the same thing.

    I have used TELEform in a medical/clinical setting, where doctors fill out prescriptions, fax them in, and character recgonition is run server-side where it is verified by my data staff. It works pretty well, but you need to keep in mind that handwriting recgonition is not infallible, and if you are interested in any level of accuracy, I would recommend that a human verify each comb-box where there is handwritten text. Most verifiers that I've seen are pretty good and you can glance at each each and just pound the tab-key to scream through the fields.

    As far as statistics for accuracy with TELEform, the numbers that I reported are as follows (the numbers represent the percentage of fields with ICR errors of the specified type):

    ICR Error type:
    ==========
    Handwriting 3.7%
    Combination Handwriting/OCR 3.2%
    OCR 2.9%
    Total 9.9%

    You can take the first three numbers with a grain of salt (the numbers based on what kind of ICR error occured are subjective and somewhat antecdotal) but the total is accurate -- expect to have approximately 9.9% of your fields come back with errors, and around 6% if you are really careful in desigining your forms and train your users on how to write on those forms. These numbers are consistent for all of the ICR systems I have used.

    I hope this helps...


    -Turkey
  • Two Words (Score:3, Informative)

    by 4/3PI*R^3 ( 102276 ) on Tuesday July 16, 2002 @04:10PM (#3897102)
    ...WORK STUDIES
    ...GRADUATE STUDENTS
    ...SLAVE LABOR

    You will spend less time and get better results hiring starving, desperate for money, college students to do data entry.

    • There are companies that will accept your boxes of paper and return electronic data. (They have rooms full of people typing away madly, I suppose.)

      Just google for data entry outsourcing [google.com].

      • Yeah, the least expensive ones ship it off (electronically via high-speed scanner) to India or some other country where labor is dirt cheap. I've used them in the past.

        OCR is just too unreliable. The best of them are only around 99% accurate with typewritten pages which still means lots of errors per page. OCR of handwritten input is a joke. HUMAN interpretation of handwritten data isn't even totally accurate due to the piss poor handwritting some people have.

        If you can minimize the amount of "essay" data and maximize multiple choice with fill-in-the-dot you will be able to lower your error rate.
  • ... what sort of data you're expecting on the forms.

    If it's handwritten, just forget it. You'll have enough problems getting people able to read it, much less computers. The postal service does do some of this, but they have a secret: they know all the valid addresses and can do cross-referencing between different parts if they really have to.

    If it's typed you might be able to OCR it, but don't count on it being truly reliable and plan on saving the image as well - you're going to have to be able to go back to it.

    If it's filled and printed you might be able to put something together, but if you have that why not have them send the data electronically instead?

    If it's fill and print but you can't communicate it electronically, see if you can generate barcodes. If this is the case, I assume it's generated by an application instead of an HTML form, since an HTML form could be communicated back to the server.

    The main situation I know of where OCRing worked well was for an imaging system - the company wanted to store images of all the work order pages for each customer (to include signatures & handwritten notes), tied back to the database. Since the initial work orders were printed, all that needed to be OCRed was the work order number, which both included check digits and needed to match against a known work order already in the database. Even then, there were provisions for dealing with the ones that weren't recognized. Barcodes weren't used because the imaging system was separate from the creation system, which didn't have the capability to generate them.

  • Captiva (Score:2, Insightful)

    by HockeyP9 ( 583682 )
    Im not sure, but I believe a company called Captiva can do the type of capturing your taliking about. I know many companies use it to process tax forms and such. its definatly not cheap or open-source though.
  • and know an outstanding programmer that works with a number of OS platforms that I would call an expert on OCR, forms recognition, etc. Check out http://www.microimagesys.com and contact Mr. Lunglhofer. Also, look at Kofax for your Image and OCR retrieval from scanned documents. I am not 100% sure Adobe has a *nix version; but I create a considerable number of e-forms in Adobe (and learned this from Mr. Lunglhofer). These forms are used in an enormous variety of electronic, web-based, and non-web applications. Ask him what he would suggest and see what kind of product he could provide for you.
  • having worked as a support engineer (until a recent layoff) for eiStream (link [eistream.com]), I know that they have software capable of all aspects of what you seek.

    the WMS division of the company produces software that focuses on digitization of processed forms, OCR, workflow management and the like. the trickiest part of this process, however, may be for their off the shelf product to incorporate all your specific aspects (or requirements) into one application/process. i would also be interested to see how they handle your request as it nearly fits an as of yet (and perpetually unreleased) new product. regardless, it shouldn't be impossible to accomplish what you seek with what they have available.

    as with many companies, they have an professional services group willing to come on site to customize the application to your project, performance tune the hardware and software, and help you learn the product some...

    the core OCR software can be sampled with a 60 day trial download (link [eistream.com]) of the Imaging Pro (Windows users only...) software. the obvious down side to this company (other than my layoff), is that the required server hardware and licensing costs can become cumbersome. it really all depends on the scale/scope of the project and how many documents you seek to simultaneously process... the PSO group can scope that for you before hand though if you want. oh yeah, and its nowhere near open source...

  • Consider off shore data entry. Many large operations (for example, payroll firms) still do this.

    Accurate OCR on forms requires a lot of custom work. The costs would be prohibitive for a one-time study.

    OCR is only accurate if the problem's constrained. For example if you know the entry is an address or phone number.

    Free form cursive is "impossible" to do accurately. Free form printed is very, very hard. You'd probably have to restrict people to writing letters in consequtive boxes, which most people hate.

    A lot of work will go into designing the questions and the layout of the form itself. And how will your scanner be fed, mechanically or by hand? A degree of skew will ruin your data unless your system is designed to handle it. (Ever fill out a form with large dots in each corner?)

    After all that, you'd still need humans & a data entry system to check the borderline cases. A good system would have a data correction station that flashes portions of images and the system's best guesses for the operator to choose among or replace outright.

    • Several companies, including a major airline I did some work (with Micro Image Systems) for a few months ago, stopped using Mexican data entry and went to OCR. They save $300K a year with the new system. Yes, they have a manual corrective system in place but it is far more cost-effective and accurate than their former manual system. They also no longer save paper copies of every shipping bill; saving on storage space and recordkeeper people that go into the boxes to retrieve things, etc.
  • I would agree with other posters that OCR will not solve your problem. I would also agree with you that having people sit in front of a computer for long periods is not a good idea.

    I have created a system to help in reviewing proposed changes to the Convention on International Trade in Endangered Species(CITES)which sends out a formatted email form to reviewers (many of whom are in developng countries). The reviewers reply with their answers and a simple RegEx sucks the answers into a database.

    Benefits of this approach:
    • Iterative- users can return to their email program several times before completing and sending the form. You don't have to complete in one sitting (our questionnaire is about 150 detailed questions long)
    • Offline- for users that don't have a dedicated net connection (yes they exist)
    • Transaction based- you know that an email has been successfully sent
    • Lowtech- a text based email is a kind of lowest common denominator. You don't have problems with plug-ins or JRE versions with this solution
    This kind of thing is easy to build. The tricky part is predicting what users will do with it. In our forms the questions look like-
    ||Please tell us your shoe size::
    ||What color are your shoes?::
    The directions originally told users to enter data between the :: after a question and the || before the next question. Result: many of them answered like
    ||Please tell us your show size:My size is twelve:
    and caused the RegEx to miss the answer.

    Best of luck
  • If the forms are mostly checkboxes, you can probably scan it as a picture, then look in the right areas for crud in checkboxes. Might need some alignment with known markings in corners. If there is some writing or text (serial number, name), enter that manually while displaying the picture on screen. This is also a good time to ask about questions which seem to have no or multiple boxes checked. "Please clarify question 3C."

    Sometimes simple brute force does wonders.
  • I was recently involved in an implementation similar to this one, and there are a number of factors that might/might not make it worthwhile for you to use OCR:

    1) What's the daily/weekly volume of forms/pages? If it's really only a few hundred or even several thousand pages per week, it may not be worth it. In my project, we had several deadlines where we had to turn around ten thousand forms (5-8 pages each) in 3 days - using OCR to increase the throughput quickly became cheaper than hiring temps to do data entry.

    2) What kind of data are you collecting? As several others have mentioned, form UI design is intimately connected with the type of data you need and the quality tolerances for the data. If most of the questions can be answered in checkboxes, then OCR is a good bet and will probably save you some time. Actually the vendors refer to OMR (object mark recognition, or checkboxes/bubbles), OCR, and ICR (intelligent character recognition, usually handprint). OMR accuracy is >99%, OCR is something like 95-99% given a good font, and ICR is somewhat less. We had great success with OMR/checkboxes, but ICR recognition of handprinted address data, free response questions, etc. was quite poor. Even if each form contains a mixture of checkboxes and free response questions, it may be worth it to implement an OCR-type solution because "heads-up" data entry of the free response questions is quicker than having to look down at a stack of papers and leaf through them.

    3) Are the forms already designed and printed, or do you have some freedom in working with designers to adapt them for OCR use? Using an OCR solution has the side benefit of forcing you to be really careful about how you ask questions and how you lay out the form - in other words, it really forces you to constrain certain lines and boxes, making things that much more explicit for the user. We went through 6 or 7 revisions of one of our forms before we found a format that didn't confuse anyone. I've got to think that many times human data entry operators don't bother to type in the extra notes/margin comments/etc. on manually entered forms.

    4) How many sites are there? In our case, we had 1 site where all of the forms were mailed. We had a dedicated form processing area with 5 big workstations, $30K of software (we used ReadSoft, more on that later), and $15-20K for three medium-sized Kodak scanners. The workflow is basically scan-->perform OCR/ICR/OMR recognition -->manually validate any fields that were not recognized --> write to a flat file or write directly to database using ODBC.

    SCAN - If you have lots of sites, paying $2-3K for a low-end duplex scanner for each site may not be realistic. I'm increasingly interested in the low-end HP printer/scanner machines, as these allow you to scan to a network location or email (starting at about $700).

    RECOGNITION - ideally this would happen in one location, because you're generally paying the bulk of the license fee for this.

    VALIDATION - both ReadSoft (http://www.readsoft.net) and Captiva (http://www.captivacorp.com) claim to support both fat/proprietary and web-based clients for the validation by data entry people. I've never tested the web clients, but that might be a real selling point.

    UPLOAD - we wrote the data to a flat file and then used a perl script to post each line of data to a JSP form - so each form was available both on the web and on paper - this turned out to be quite nice.
    As far as vendors, ReadSoft and Captiva both seem to be pretty strong. Cardiff Teleforms is another decent product, although they have an integrated form-builder that is pretty limiting, compared to Readsoft's ability to adapt to pre-existing paper forms.

    I looked long and hard at sourceforge and elsewhere, and though there are some OCR projects, there doesn't seem to be anything focused on form processing. Given the progress on the XForms standard and an alpha implementation of it in the development build of Cocoon, I'd love to see a project get started to build a forms OCR project into Cocoon. Anyone?

  • Wouldn't scantron sheets (you know, the forms that require you to fill out little holes with a number 2 pencil) be your best bet? Considering this is an NIH funded project, I am assuming that you're working either at a university or at the NIH (with many universities nearby). I don't know of one college, university, or medical school that does not have the facility and equipment to scan in those scantron sheets for their exams or applications. I am certain this is the best option for you, since even corporations are still using them for stuff like proxy voting for the next stockholders meeting---really, if it's cost effective for public corporations, I don't see why they aren't for you (unless I'm missing something).

A morsel of genuine history is a thing so rare as to be always valuable. -- Thomas Jefferson

Working...