OCR applications/services?

EdwardB

Well-known member
Every year, we do a ton of scanning/rescanning, of hard copy curriculum. Or worse, they do it themselves on a MFP with the worst possible scan settings. Does any regularly use/can recommend an OCR application or even service that very accurately creates optimized PDFs?

I was just looking at Amazon AI Textract services, looks interesting
 
There are probably over a hundred OCR apps out there, but under the hood, meaning the core OCR Engine, I am aware of less than the count of our fingers. So most apps use one of the common engines. The screenshot below is from an app that does thousands of pages per day from every imaginable scanner source which are sometimes operated by rookie users. It uses a modified version of the well known Tesseract engine which was originally developed by HP and Google. More about that here. Tesseract (software) - Wikipedia.

About your expectation "very accurately" - depends. Feel free to PM a link to a sample file and I can return the OCR'ed result, or you can have a trial version of the app and try it yourself. Automatic OCR won't be as accurate as human modified OCR or cloud AI services that 'read' your text and do more than OCR alone can do using their harvested models.
About your comment "create optimized PDFs" - yes. Additional processing can optionally re-compress images, compress uncompressed PDF streams, linearize file, clean unused streams & collect garbage.

Benefits:
- No limit on number of pages per day/per month/per year (some other Vendors do this)
- No extra fees based on usage (some other Vendors do this)
- Confidential & private data stays confidential and private. App can run offline, even if your internet is down. No AI or Cloud service. No data goes to a shady corporation that uses your data to train their models and then resells it to others.
- Relatively easy to use. The UI below speaks for itself.
Disclaimer: I am the lead developer.

My personal comments about other vendors:
1-
Before they changed their business model Abbyy Finereader was great. Unlimited use and extremely good layout analysis and near perfect OCR result. Some people still have and use the version from 15-20+ years ago which probably beats Tesseract depending on the content. You cannot buy it anymore as far as I know. They switched to forced cloud/AI and have severely restricting limit on number of pages per month.
2-
I am against AI/cloud based solutions because of privacy concerns. Would your customers want their data to go to an unknown third party ?

Best Regards.

1747932028823.png
 
There are probably over a hundred OCR apps out there, but under the hood, meaning the core OCR Engine, I am aware of less than the count of our fingers. So most apps use one of the common engines. The screenshot below is from an app that does thousands of pages per day from every imaginable scanner source which are sometimes operated by rookie users. It uses a modified version of the well known Tesseract engine which was originally developed by HP and Google. More about that here. Tesseract (software) - Wikipedia.

About your expectation "very accurately" - depends. Feel free to PM a link to a sample file and I can return the OCR'ed result, or you can have a trial version of the app and try it yourself. Automatic OCR won't be as accurate as human modified OCR or cloud AI services that 'read' your text and do more than OCR alone can do using their harvested models.
About your comment "create optimized PDFs" - yes. Additional processing can optionally re-compress images, compress uncompressed PDF streams, linearize file, clean unused streams & collect garbage.

Benefits:
- No limit on number of pages per day/per month/per year (some other Vendors do this)
- No extra fees based on usage (some other Vendors do this)
- Confidential & private data stays confidential and private. App can run offline, even if your internet is down. No AI or Cloud service. No data goes to a shady corporation that uses your data to train their models and then resells it to others.
- Relatively easy to use. The UI below speaks for itself.
Disclaimer: I am the lead developer.

My personal comments about other vendors:
1-
Before they changed their business model Abbyy Finereader was great. Unlimited use and extremely good layout analysis and near perfect OCR result. Some people still have and use the version from 15-20+ years ago which probably beats Tesseract depending on the content. You cannot buy it anymore as far as I know. They switched to forced cloud/AI and have severely restricting limit on number of pages per month.
2-
I am against AI/cloud based solutions because of privacy concerns. Would your customers want their data to go to an unknown third party ?
Tesseract is one thing I am looking at. I remember way back when it was bleeding edge, had no idea until recently that it was still being developed by Google.

as for AI/cloud, we are talking about scanned school curriculum... our Senior network admin wanted to lock down our file share against just any staff member accessing them, my response was "What would happen? Someone might read the documents and learn something?" LOL
 
Understood. Your case is unique; and indeed funny "someone might read the documents and learn something"
However for many others, especially in Europe where our software is used the most, data protection and confidentiality is critical.
 
   
Back
Top