Ceci est une ancienne révision du document !
While it is possible to extract text from a PDF using a selection with copy/paste, it doesn’t always work as planned. Also, you can lose formatting. Able2Extract Professional 9 can do all of that and more. Built into the Pro version is a rather impressive OCR feature which can extract text from images.
Installing
Installing Able2Extract is easy enough. You download the Ubuntu/Debian .DEB file, double-click it and let it install. If you have a key to unlock it then you can enter that after the install.
Usage
On first use you’re taken step-by-step on how to open a file and convert it to text. In short, you’re working across the menu from left to right.
In steps: • Open a file (PDF or text) • Select an area (all or an area) • Select an output format (HTML, image, and LibreOffice Calc and Writer are supported) • Save.
Using the OCR took a little while to figure out, but you just convert the image to a PDF, or print the image to a PDF.
PDF to Text
Upon opening FCM#94 (previous page, top right image) I skipped to page 13 and selected the first three columns of text. This also made it select the image, so I went with that and clicked the OpenOffice (surely it should say LibreOffice?) button. From the popup I clicked the ‘Convert’ button below Writer to get an ODT file.
The ODT file was saved then automatically opened in LibreOffice Write.
While the output (previous page, bottom right image) isn’t identical to the PDF it has kept the header, and text colours, which is nice. Even the dotted vertical lines were kept. The ‘drop cap on’ did knock it out of whack for those two lines, but the output as a whole is still very usable.
One thing I did notice is that even with small PDF files, like FCM (10MB) it takes a few seconds to skip through the PDF.
Anyway, getting text from a PDF isn’t that impressive. Time to give the OCR a run for its money.
Image To Text
Seeing that it could do Calc, I decided to get a bit cheeky and convert a table from an image to Calc format.
Would it be able to read the text from the image, make it editable and keep it within a decent table format?
The answer is a resounding yes! While some text is a bit off, it has to be said that the original was a PDF printed, scanned, and turned into a PDF again, so the quality was a bit ropey.
It would certainly be easy to convert that Calc output into a table that would resemble the original.
What about an image of text to editable text?
Yep!
I like how it converts it to editable text, does a good job of it, and even keeps headers in bold. It’s not just a dump of plain text. It really does try to copy the format of the original.
Conclusion
Of course, it’s not infallible. Give it a coloured background with white text and I’m pretty sure it’ll fail, but so will the vast majority of OCR applications. I was particularly impressed with how few errors there were in a good quality image to editable text.
If you have high quality images that you need converted back to text, then this application is definitely one to consider, and kudos to Investintech for making a Linux version of their app available.
Linux System Requirements
OS: Linux Fedora 20 or newer, Ubuntu 13.10 or newer, 32-bit edition
RAM: 512+ MB of free memory available for the software
Hard Drive Space: 250 MB of disk space for the program components
Monitor: 1366 (Width) x 768 (Height) screen resolution
Download trial from: http://www.investintech.com/prod_downloadsa2e_pro.htm
COMPETITION:
To win one of five life-time keys to Able2Extract Professional 9 all you have to do is answer the following question:
What does OCR stand for?
Email your answer to: misc@fullcirclemagazine.org
Deadline for entries is Sunday 19th April. Five winners will be drawn at random.