Still OCR won't recognize large PDF file

Shun · May 9, 2018

Hi Mike,

I took photographs of a book, then converted the resulting JPG files into one PDF file using Acrobat Pro (the PDF file size equals the total size of the JPEGs) and copied the PDF file to Pleco. Still OCR is able to recognize the individual JPG files, but strangely enough, when I open the PDF, the Chinese characters aren't tappable. I tried both the 甲 and the 乙 recognizers, unfortunately neither of them worked. The PDF file size is about 143 MB. I doubt that Pleco needs to read the entire decoded PDF file into memory, so shouldn't Still OCR be able to handle PDFs of any size? I will send you a link to the PDF file by E-mail.

Thanks, Shun

mikelove · May 9, 2018

Does it help if you downsample the JPGs? They're coming through as pretty enormous in this PDF file, we use slightly different methods to determine how much to downscale pages by in PDFs versus standalone images before we render them, and it's possible we're not downscaling them enough in this particular PDF.

Shun · May 9, 2018

Downscaling to 900 pixel width created a 16 MB file, which worked, but was a bit too blurry. So I'm resorting to OCR'ing individual JPEGs for now.

Thanks!

pdwalker · May 10, 2018

(heh. First world problems)

Still OCR won't recognize large PDF file

Shun

状元

mikelove

皇帝

Shun

状元

pdwalker

状元