Technology

37702 readers

460 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 2 years ago

MODERATORS

[email protected]

Some assistance/advice for OCRing? (beehaw.org)

submitted 1 year ago by [email protected] to c/[email protected]

8 comments fedilink hide all child comments

I've just scanned a section of a book (in French) that unfortunately uses a very fine typeface and a lot of italics that seem to confuse the OCR.

I'm on Linux, so I'm switching off between gscan2pdf (which makes use of the remarkable unpaper program) and Master PDF Editor (a proprietary program) to clean up and deskew the scans before OCRing them (since each program has their own strengths and weaknesses). I did this, got the scanned pages looking pretty good, and then OCRed them using Tesseract (which is an option in gscan2pdf). I also tried GOCR, which produced garbage-level results.

Tesseract didn't do too badly, but what did happen is that it occasionally mixes lines of text together--despite me trying to get them as straight as possible, and doing what I thought was a pretty good job! Also, it will put spaces in the midst of words and sentences, like this: "J e t'ai m e" which is kind of annoying to have to go through and fix, especially since there are a lot of those spaces! Can anyone recommend a better approach to this, some different software maybe, or is this the best I can reasonably hope for?

all 9 comments

sorted by: hot top controversial new old

[–] [email protected] 4 points 1 year ago (2 children)

I use ocrmypdf, after being a bit frustrated with gscan2pdf. There is a simple ui available, but I just created a tiny script that does the ocr , deskew, etc. In one operation with wildcard file selection.

I also installed a jbig compressor that really shrinks images. My processed docs are generally 40% to 80% smaller, and it seems to get better tesseract output than gscan does.

[–] [email protected] 2 points 1 year ago* (last edited 1 year ago)

OCRmyPDF is what I use as well, had good luck with it on boardgame rulebooks that sometimes come with missing or partial embedded text. Combined with recoll and the Emacs pdf-tools mode I have it all indexed and at my fingertips.

[–] [email protected] 4 points 1 year ago

Have a look at Kraken which has many state-of-the-art models for both HTR and OCR

[–] [email protected] 3 points 1 year ago

@hedge If the font is “strange”, you may try Ocular OCR. Intended for historical books, has a possibility to learn new fonts.

[–] [email protected] 2 points 1 year ago (1 children)

Thanks to @[email protected], @[email protected], & @[email protected] for their responses. I forgot that Tesseract is mainly used from the command line; something which, despite being a Linux person, I'm not super proficient with. It looks like gscan2pdf and Master PDF OCRs got different results despite, I think, both using the same version of Tesseract.

[–] [email protected] 2 points 1 year ago

despite, I think, both using the same version of Tesseract.

So difference must be in the settings, which you can achieve both by using tesseract directly

[–] [email protected] 2 points 1 year ago

I've read good things about donut, although I haven't used it yet myself.