Could someone explain what this does? The explanation paragraph does not make se...

Andoryuuta · on Nov 6, 2021

As far as I understand, it takes a PDF that contains only image data (e.g. a scan of a price of paper) and uses OCR to recognize the text, then overlays the text on top of the image in the output PDF.

It would allow you to take a physically scanned document and create a PDF with selectable text you could copy+paste, search over, etc.

dredmorbius · on Nov 6, 2021

Text is behind the image, as the first 'graph of TFA notes:

the text will be added to each page invisibly "behind" the images.

perth · on Nov 6, 2021

It calls a series of open source tools that result in producing a pdf with text embedded behind an image overlay, where the image overlay is the original pdf. It was a while ago where I really looked into this but to name a few:

ImageMagick to convert the pages to images

Tesseract-ocr by Google to transcribe the text in the images, which puts it’s output into singular pdf files

Pdfunite to stitch together the pdfs back into a whole file

I’m sure I’m missing a few, iirc it can call a tool that straightens the pages as well.

EDIT: Messed around and remembered the stuff:

where a.pdf is a 2 page PDF:

>convert a.pdf a.png

makes a-0.png and a-1.png

OCR's each image:

>for x in {0..1} ; do tesseract a-$x.png a_ocr-$x PDF ; done ;

combines them into 1 PDF:

>pdfunite a_ocr-{0..1}.pdf a_ocr_combined.pdf

jhvkjhk · on Nov 6, 2021

It will make the “text” on a pdf with only images searchable and selectable.

gorgoiler · on Nov 6, 2021

It converts images to text.

The input is a scanned PDF. The output is the same PDF with the recognized text on top, in a transparent font.

Copy and paste now works because when you click the PDF you are selecting the transparent text.

galactus · on Nov 6, 2021

I came here to say this. I read it 4 times before giving up.