Strange font/character handling issue in file reader

Hi, I am attempting to read The Fat Years, via a downloaded PDF from bannedbook.org. The PDF looks normal enough on a Macbook, Windows, and even visually in Pleco's file reader.

The first sentence looks like this inside Pleco's document reader : 一个月不见了

However if I tap on each character, Pleco's pop up displays: 一个月不见乴 (note the last character is wrong)

Why would the document reader display 了but the search pop up window on that character show 乴?

Outside of Pleco, this character 了 is being converted into 乴 in Calibre's PDF to .mobi convert function. On both the Mac and Windows Adobe Reader, the character looks like 了. However, copying and pasting this text on both the Mac and Windows converts this character 了 into 乴. What could be going on that would allow on screen rendering to appear correct, with the underlying character wrong?

I've tried inspecting the fonts in the PDF, and while I can see that the PDF fonts are definitely double byte, the actual font names are garbled, so I can't tell what fonts are in use.
 

Shun

状元
Hi,

this happens a lot with PDFs due to their encoding. What you can do is open the same PDF in Pleco using the Still OCR feature, which will recognize all characters visually and allow you to read them correctly. You can leaf through the pages using the buttons at the upper right. For a long book, this might get a little cumbersome, so maybe you could split it up first using another tool. (There are free open source tools to do this.)

Regards, Shun
 

etm001

状元
What could be going on that would allow on screen rendering to appear correct, with the underlying character wrong?

I don't have an answer to your problem, but I'll share something similar that happens to me (and it's really frustrating). On Mac using TextEdit I use the "Save to PDF..." option. I discovered that since El Capitan (AFAIK), the Unicode values for various characters are wrong. They display correctly in Mac OS, they even display correction in Pleco Reader. However, tap-selecting these characters fails in Pleco (you can't even tap them). I submitted feedback to Apple ages ago, but the problem still persists.
 

Shun

状元
Edit: This is what Mike said six months ago: "This is an issue with the Mac OS X PDF converter, we've had a few other reports of it now too - basically it's exporting some characters as Kangxi Radicals (which use different character encoding numbers) instead of as actual characters. Best fix at the moment is to generate your PDF some other way, or use another format. Now that we know this is a problem we'll probably add some code to automatically detect those + convert them to regular characters in our next major update but that's a few months away."

So maybe the next version of Pleco will be able to handle such cases. (possibly even from PDF converters other than the built-in Mac OS X one)

Original thread: http://plecoforums.com/threads/partially-tappable-pdfs-in-reader.5058/
 
Last edited:
Thanks all... yeah, Chinese computing isn't bullet proof. That's the lesson I've taken away from everything I've encountering, including this problem.

One thing to emphasize with the problem I've seen here, if I copy / paste the PDF text this conversion error happens on both Mac OS Yosemite and Windows 8.1. Whatever is going on with the encoding problem is cross platform and not specific to Mac OS X. I think maybe the original fonts or document PDF-ing was wonky.

The Chinese government banned the book, so the author released the book for free. God knows what happened on his end to create the PDF file...
 

Shun

状元
You almost have to be a programmer to really understand what's going on in this case. But the Still OCR feature is a great workaround, I can highly recommend it for short PDF documents that are only partially tappable.
 
Top