Problems with "Fat Years" (盛世) PDF. Need OCR?

Paul Duke

进士
Hi, I'm trying to read "The Fat Years" in Chinese and I'm having trouble cutting and pasting text from the PDF (which apparently is an official version posted by the author) into Wenlin. When I cut and paste I get a mess of unreadable codes. I have tried many software applications to read and re-save PDFs, but nothing works. I get the same mess everytime.

I tried using the OCR function on Pleco on my iPhone but the text is simply too small for Pleco to recognize it. Hmm, maybe if I had an iPad?

Years ago I bought a scanner which had excellent OCR software which handled Chinese very well. I don't have access to a scanner anymore but I searched around for OCR software which can do the job with image files. Seems they are all pretty expensive. There's a program called Abbyy Fine Reader but it is $99 in the app store. I recall that my scanner cost less than that and the software came with it for free!

Any suggestions of how to inexpensively perform OCR on a PDF file so that I can convert it to a workable text file and move it into a Chinese dictionary program, Pleco and/or Wenlin?

If you don't know the "Fat Years" here's a NY Times profile of the author:
http://www.nytimes.com/2011/07/30/world ... lobal-home

thanks

PD
 

Paul Duke

进士
Well, a quick follow-up. The Abbyy Fine Reader I mentioned looks very professional and says it handles 170 languages, but Chinese is not one of them...
 

mikelove

皇帝
Staff member
If you convert the pages in the PDF file to a .jpg or other image format then our OCR system should be able to handle it - direct PDF support is on our to-do list but it's rather hairy so it's taking a while. Aside from that I'm afraid I don't have a whole lot of suggestions, though.
 
Check the pdf and go to file-->properties. Then click the "font" tab. You're pdf probably has the characters saved as vectors instead of fonts, in which case it's impossible to extract. You'll have to use an image decoder.
 

Paul Duke

进士
Hi thanks much for the note. I am reading on the Mac using Preview, and that doesn't appear to have File>Properties as an option, but undoubtedly you are right.

And just as you suggested, after a lot of searching around on the 'net a few weeks back I was able to OCR it pretty successfully using a website service.

Thanks so much as always!

PD
 

gato

状元
See attached for a Word doc version of the book converted from the PDF posted by the author.
 

Attachments

  • sheng shi 2013 CN.doc
    437 KB · Views: 960
Top