Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It calls a series of open source tools that result in producing a pdf with text embedded behind an image overlay, where the image overlay is the original pdf. It was a while ago where I really looked into this but to name a few:

ImageMagick to convert the pages to images

Tesseract-ocr by Google to transcribe the text in the images, which puts it’s output into singular pdf files

Pdfunite to stitch together the pdfs back into a whole file

I’m sure I’m missing a few, iirc it can call a tool that straightens the pages as well.

EDIT: Messed around and remembered the stuff:

where a.pdf is a 2 page PDF:

>convert a.pdf a.png

makes a-0.png and a-1.png

OCR's each image:

>for x in {0..1} ; do tesseract a-$x.png a_ocr-$x PDF ; done ;

combines them into 1 PDF:

>pdfunite a_ocr-{0..1}.pdf a_ocr_combined.pdf



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: