It calls a series of open source tools that result in producing a pdf with text embedded behind an image overlay, where the image overlay is the original pdf. It was a while ago where I really looked into this but to name a few:
ImageMagick to convert the pages to images
Tesseract-ocr by Google to transcribe the text in the images, which puts it’s output into singular pdf files
Pdfunite to stitch together the pdfs back into a whole file
I’m sure I’m missing a few, iirc it can call a tool that straightens the pages as well.
EDIT: Messed around and remembered the stuff:
where a.pdf is a 2 page PDF:
>convert a.pdf a.png
makes a-0.png and a-1.png
OCR's each image:
>for x in {0..1} ; do tesseract a-$x.png a_ocr-$x PDF ; done ;
ImageMagick to convert the pages to images
Tesseract-ocr by Google to transcribe the text in the images, which puts it’s output into singular pdf files
Pdfunite to stitch together the pdfs back into a whole file
I’m sure I’m missing a few, iirc it can call a tool that straightens the pages as well.
EDIT: Messed around and remembered the stuff:
where a.pdf is a 2 page PDF:
>convert a.pdf a.png
makes a-0.png and a-1.png
OCR's each image:
>for x in {0..1} ; do tesseract a-$x.png a_ocr-$x PDF ; done ;
combines them into 1 PDF:
>pdfunite a_ocr-{0..1}.pdf a_ocr_combined.pdf