ocr – Voyageur's corner

Tesseract is one of the best open-source OCR software available, and I recently took over ebuilds maintainership for it. Current development is still quite active, and since last stable release they added a new OCR engine based on LSTM neural networks. This engine is available in an alpha release, and initial numbers show a much faster OCR pass, with fewer errors.

Sounds interesting? If you want to try it, this alpha release is now in tree (along with a live ebuild). I insist on the alpha tag, this is for testing, not for production; so the ebuild masked by default, and you will have to add to your package.unmask file:
=app-text/tesseract-4.00.00_alpha*
The ebuild also includes some additional changes, like current documentation generated with USE=doc (available in stable release too), and updated linguas.

Testing with paperwork

The initial reason I took over tesseract is that I also maintain paperwork ebuilds, a personal document manager, to handle scanned documents and PDFs (which is heavy tesseract user). It recently got a new 1.1 release, if you want to give it a try!

Tag: ocr

app-text/tesseract 4.0 alpha ebuild available for testing

Testing with paperwork