I have to keep hundreds of documents I receive each year for five or ten years, some indefinitely: invoices, statements, contracts, etc. Quite a few of these documents are 1 to 2 cm thick. My accountant organizes them in folders sorted by the source of the the document and collected into separate file cabinets and draws.
During covid, I stopped having my accountant come to my home office to do this work. After a great deal of experimentation, I resorted to purchasing an expensive Epson (approximately $600) document scanner that reliably does full duplex scans very quickly. The documents are stamped with a serial number (e.g. 2021-207) and stored in a single directory on my system with a file name that corresponds to the serial number (i.e. 2012-207.pdf).
The Epson hardware is great, the software is just barely adequate. It processes one document at a time, and the Epson application's OCR running on a M1 Mac mini can't keep up with the scanner and slows the whole process down somewhat. I would like to batch process the scanned documents in the background to convert the pdfs generated by the scanner into "searchable" pdfs. (Pdf files produced by scanners have an image layer but no text layer underneath it. Optional OCR done post-scan then adds the text layer.) I've tried a number of other OCR applications, one of the best is Adobe Acrobat Pro; it has the Adobe Pro kind of price, unfortunately, and does a million things I don't need.
Back to my filing system. I keep each years physical documents in one file drawer sorted by serial number. Because the documents are stamped with a serial number before being scanned, I can always find the physical document easily if I am looking at the pdf. Furthermore, because my pdf's are searchable I can quickly locate a bill or a tax document by a relevant name or even a particular amount (like, where did this $1,808.17 discrepancy come from).
Is this perfect, no far from it. Many little irritations afflict the actual process. Many statements have large amounts of small barely readable disclosures and footnotes, sometimes in faint small fonts. This is largely useless, slows down the OCR, and increases the file sizes. Barcodes, QRCodes, and DataMatrix codes often appear on the first page of these documents or even every page of the documents. It would be great if these were somehow scanned and used to tag the documents. The Epson software insists on embedding spaces in the generated file names, doesn't allow me to use auto generated ISO dates in file names, that makes working with the files from the command line less than ideal. (File names like "2021-Aug-07 122.pdf" are user friendly but not some friendly for scripts or sort commands.) I use several different configurations for the scanner, the software supports it, but I have to pay attention to pick landscape and double-sided when needed.
Thank you HN for the many suggested solutions to the OCR issues. It gives my hope that I'll be able to wire together something better than I've got now.
During covid, I stopped having my accountant come to my home office to do this work. After a great deal of experimentation, I resorted to purchasing an expensive Epson (approximately $600) document scanner that reliably does full duplex scans very quickly. The documents are stamped with a serial number (e.g. 2021-207) and stored in a single directory on my system with a file name that corresponds to the serial number (i.e. 2012-207.pdf).
The Epson hardware is great, the software is just barely adequate. It processes one document at a time, and the Epson application's OCR running on a M1 Mac mini can't keep up with the scanner and slows the whole process down somewhat. I would like to batch process the scanned documents in the background to convert the pdfs generated by the scanner into "searchable" pdfs. (Pdf files produced by scanners have an image layer but no text layer underneath it. Optional OCR done post-scan then adds the text layer.) I've tried a number of other OCR applications, one of the best is Adobe Acrobat Pro; it has the Adobe Pro kind of price, unfortunately, and does a million things I don't need.
Back to my filing system. I keep each years physical documents in one file drawer sorted by serial number. Because the documents are stamped with a serial number before being scanned, I can always find the physical document easily if I am looking at the pdf. Furthermore, because my pdf's are searchable I can quickly locate a bill or a tax document by a relevant name or even a particular amount (like, where did this $1,808.17 discrepancy come from).
Is this perfect, no far from it. Many little irritations afflict the actual process. Many statements have large amounts of small barely readable disclosures and footnotes, sometimes in faint small fonts. This is largely useless, slows down the OCR, and increases the file sizes. Barcodes, QRCodes, and DataMatrix codes often appear on the first page of these documents or even every page of the documents. It would be great if these were somehow scanned and used to tag the documents. The Epson software insists on embedding spaces in the generated file names, doesn't allow me to use auto generated ISO dates in file names, that makes working with the files from the command line less than ideal. (File names like "2021-Aug-07 122.pdf" are user friendly but not some friendly for scripts or sort commands.) I use several different configurations for the scanner, the software supports it, but I have to pay attention to pick landscape and double-sided when needed.
Thank you HN for the many suggested solutions to the OCR issues. It gives my hope that I'll be able to wire together something better than I've got now.