Turns those scans into something editable
The topic of OCR software came up on the MacGroup iBBS the other day, and my response to the member that asked for impressions of different software got me thinking about OCR software for the Mac (thanks George! I was stuck for a topic!).
Lately, I've been using Adobe Acrobat Pro for most of my OCR needs, because most of the time I'm taking something and scanning it to a PDF anyway. But what about those times you want to edit what you've scanned? Acrobat's a poor choice if you need to do editing afterwards, as PDFs are meant to be an end product. Even Adobe will tell you that to change a PDF it's best to go back to the source document. Clearly, Acrobat isn't the answer in that case. So what is the answer on the Mac?
I've used a couple of different programs over the years. The first one I ever used was ABBYY Fine Reader. It worked pretty well, but ABBYY got out of the retail OCR business with regard to Macs (they have a *lot* of software for Windows though!). They kept a hand in with a lite version that they offered to scanner manufacturers as something they could bundle with their scanners to show that they were Mac-friendly, but you couldn't buy it. I got a copy (version 5) with a scanner at one point, but it it's pretty old now and not Intel-native.
Then I moved on to READ I.R.I.S. It was pretty good software, and has been kept up-to-date, but I didn't like some of the restrictions (there's a page count restriction and a few other things they held back for the "Corporate" edition). But what finally turned me off was their support – or lack of it.
Whenever a new patch version comes out, you're entitled to get it for free (as opposed to a major new release, which they charge an upgrade for, like most companies). But to get it, you have to email support and ask them for a URL to download it. You'd think they have 5 customers!
I found that to be annoying, but nothing as annoying as when I needed a refund. I received an email from them that the latest version (12) or the software was out, so I dutifully ordered a copy on CD. Of course, I didn't bother to look and see that they had only upgraded the Windows version and not the Mac version. I wrote and asked if I could swap, and was told the Mac version update was "several months" away, so I should ask for a refund (they have a 30-day refund policy). OK, I figured I was out shipping, but I should have noticed it was only the Windows version. So not a big deal, right?
I went through their procedure to get an RMA number from the web site. After a week went by and I didn't get a response, I tried calling the sales line. After leaving a few messages and *still* not getting a response, I called again and asked for support. I finally got a live person who told me to call Sales. After I told him I had been trying to contact Sales for weeks, and was close to missing my 30 days for a refund, he grudgingly agreed to take my info for a refund. With the last syllable out of my mouth for my address, he hung up on me!
I did get the refund, but I was not pleased that I had to beg for it. Then, a little over a month after I got the refund, I got a dunning notice from them looking for the money! After a couple of phone calls they decided I didn't actually owe them any money (how nice) but I got the impression the whole time that I was a deadbeat. So now that version 12 for the Mac is finally out, I'm not exactly jumping on it.
So what's to do? There is another product called Omnipage, but it's $499! That's a bit out of my price range since I don't need to use the software all that often. The upgrade price from an older version is $20 more than READ I.R.I.S. And George mentioned he's had bad experiences with their support.
Well, ABBYY is back in the game with ABBYY Fine Reader Express for $99.99. It appears you can only buy it direct from them as a downloadable product, so I don't know that you'll be able to find it discounted anywhere. Still, George inspired me to take a look at what was out there, and the software had been good in the past, so I decided to take a chance on it. (I can't seem to find the rerun policy anywhere on the site, but they have one, because their Knowledge Base alludes to it.)
After downloading and running the installer, I was asked to enter the serial number from my web purchase (also sent in an email) and the software was up and running. The interface is pretty simple:
My scanner showed up in the drop-down, and I tried out a magazine page. It did pretty well, getting probably 95% of it right. I was a bit surprised when I opened the RTF file though, as it was one long column instead of three like on the page. One of the selling points of the software is that it preserves the layout.
But my default for RTF files is TextEdit, and it's not the most sophisticated of programs, so I opened the file in the latest version of Apple's Pages. Still only one column. Then I opened it in Word 2008 and low and behold – the page looked just like the original, with a large heading area at the top and three columns of text below. Bold items were bold, headings large – Fine Reader did a fine job.
In addition to reading from your scanner, you can see that you can choose a file from your system (so even if for some reason your scanner doesn't have a TWAIN driver, or you're using a photo from your camera, which ABBYY claims to support you can still use the software). So I tried a copy of MacNews from 2007. Here's one of the original pages:
It's our own Calvin trying to get you to backup your data! He has some good articles here on the blog about it too, so check them out.
Here's the OCR version:
That's pretty well done! It didn't get the large, styled "I" at the beginning of the text, and a couple of words are run together, but all in all, I could easily start working with that.
It didn't do as well with a page with screen shots on it, because it tried to translate the tiny text on the screen! I went back to the software where you can label parts of the document as pure graphics or as a table, and then the OCR engine will handle it accordingly. As you can see with the page above, it had no trouble with pure photos, but anything with text it will try and read unless you tell it otherwise. After I marked the scree shots as graphics, they were properly placed in the document.
I also tried converting this issue of MacNews into HTML, and it did a credible job again. I would need to go back and mark all the graphics again, but otherwise it did well; it even preserved the links in the document. The PDF was turned into one long page, which might not be what you'd like though.
I tried a 149 page PDF and it took about 20 minutes to convert. It needed to be gone through for the graphics, like the MacNews PDF, but the text was about 95% correct.
It's a bit disappointing that only Word seems to recognize the layout data in the RTF files – I would have hoped that Pages would too, but at least you can get the text and fonts and do the layout yourself.
One other plus – it's the only software I have that will read multiple pages from my scanner's feeder. Even Acrobat won't do that correctly. I've had to bring up the scan interface separately and save off a PDF or image file(s) for any other software, but ABBYY Fine Reader read all the pages of a test.
There's a link in the program menu to check for updates (right now there aren't any) so I'm hopeful I won't need to beg anyone for them. Overall, I think that ABBYY Fine Reader Express is a good choice if you need to convert documents to editable form.