Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Could someone explain what this does? The explanation paragraph does not make sense to me grammatically.


As far as I understand, it takes a PDF that contains only image data (e.g. a scan of a price of paper) and uses OCR to recognize the text, then overlays the text on top of the image in the output PDF.

It would allow you to take a physically scanned document and create a PDF with selectable text you could copy+paste, search over, etc.


Text is behind the image, as the first 'graph of TFA notes:

the text will be added to each page invisibly "behind" the images.


It calls a series of open source tools that result in producing a pdf with text embedded behind an image overlay, where the image overlay is the original pdf. It was a while ago where I really looked into this but to name a few:

ImageMagick to convert the pages to images

Tesseract-ocr by Google to transcribe the text in the images, which puts it’s output into singular pdf files

Pdfunite to stitch together the pdfs back into a whole file

I’m sure I’m missing a few, iirc it can call a tool that straightens the pages as well.

EDIT: Messed around and remembered the stuff:

where a.pdf is a 2 page PDF:

>convert a.pdf a.png

makes a-0.png and a-1.png

OCR's each image:

>for x in {0..1} ; do tesseract a-$x.png a_ocr-$x PDF ; done ;

combines them into 1 PDF:

>pdfunite a_ocr-{0..1}.pdf a_ocr_combined.pdf


It will make the “text” on a pdf with only images searchable and selectable.


It converts images to text.

The input is a scanned PDF. The output is the same PDF with the recognized text on top, in a transparent font.

Copy and paste now works because when you click the PDF you are selecting the transparent text.


I came here to say this. I read it 4 times before giving up.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: