Cantonese User Dictionary, but packaged as a <plecopack>

I know you've seen this question before (starting in at least 2003 with what might possibly be the eighth ever thread on this forum!), but I at least have a new twist on it!

I'm currently a student at 香港中文大學 studying Cantonese and I have a number of sources available to me that I have full rights to use which amount to a little bit over 25,000 entries with Cantonese romanizations. Being able to dump these into Pleco would be immensely valuable. Since custom user dictionaries are currently feature-limited for non-Mandarin I thought I could possibly insert them the same way Pleco does for the releases that you (re)publish.

I'm technically capable and am totally okay if any proposed solution breaks in the future on an arbitrary upgrade without notice. My current ill-advised attempt:
- Insert proxy between Pleco and `filelist.php`.
- Serve my dictionary by inserting a bonus item into the response.
- Create a properly formatted dictionary.

"Properly formatted" notes:
- I have the ability to put my source data into the same format as CC-Canto, but that's not the format you're delivering things in.
- I could set it up as a PDB and reverse CC-Canto but I'm unfamiliar with the format and the tooling available appears limited.

Only two questions:
1. Is Pleco expecting the content of the database to be encrypted? (In other words, should I stop now because there is no way to do this without your private key?)
2. Is there an existing tool that you're using for reading/producing PDB files or are you running something of your own that has been around since 2000-ish?

I am trying to avoid imposing on you. Please take your time if you elect to respond, I know you're prepping for iOS 14, new screen resolutions, and more. If this thread starts making you desire to nuke my license from orbit please ignore my questions. I'm a happy customer no matter what.


***

A trip down memory lane:
 

mikelove

皇帝
Staff member
Heh, the 'nuke from orbit' bit certainly had nothing to do with people politely asking questions - it's mostly people who simply don't like the app, won't accept / aren't satisfied with any help we offer, and rather than requesting a refund from Apple (as we repeatedly nudge them to do) they simply go on sending angry emails about the things they don't like. (a few of which eventually degenerate into ad hominem attacks against me personally, conspiracy theory accusations about Apple / Google / the PRC / the Trump administration / etc - some people feel like if they're upset enough in a customer support situation they can pretty much just let fly with whatever ugly stuff is on their mind)

Anyway, as far as your question: I'm afraid there's not really any way to do this, no. Our PDB encoder is totally proprietary - not even based on anything open-source - and while it does not require encryption (and open-source databases like CC-CEDICT are not in fact encrypted), it's a hairy 20-year-old undocumented binary format and it'd take a loooong time to make any headway with it, plus even if you did we've thrown away about 90% of it for 4.0 anyway. The lack of Cantonese support in user dictionaries isn't because we intentionally block it, we simply never got around to implementing it - they use a totally different format (nice open friendly SQLite) and we never bothered implementing a Cantonese search index for it.

With 4.0 we support not only Cantonese user dictionaries but user dictionaries with whatever arbitrary language / indexing system you like - we wrapped a bunch of configuration screens around ICU, so if you want to load a custom tokenizer, custom collation sequence, series of custom transliterations applied before indexing, etc, that should all be totally doable; also some proprietary stuff for Asian languages like a system to map one romanization system to another (find all syllables in one romanization system that begin with a particular sequence, map each of them to its equivalent in the other system, and search for exact matches on all of those). But until that's ready, there's not much I can suggest for our current app.
 
Ugh. You're trying really hard to sell me on dropping my iOS 13.5 install for a future Pleco 4.X which requires iOS 14 because of Catalyst. :p (Pleco, Anki, and Twitter combine for a significant majority of my screen time, when 4.X ships I'll upgrade immediately.)

I'll try to wait as patiently as I can for 4.X; let me know if you need somebody to put Cantonese stuff through its paces. I'm also happy to put the romanization transformation stuff through its paces too; I've been meaning to implement a transformation from Jyutping to Barnett-Chao as a joke.

Check out some Barnett-Chao examples for some absolute nonsense:

***

Separately, one of the most-significant difficulties I've had in advancing past early-intermediate skill Cantonese is identifying which word to use in which context (and, for bonus points, how it is pronounced in that context). All of the existing Cantonese dictionaries (except probably Wenlin's ABC) appear to primarily be wordlists with pronunciations—including the spreadsheets I have access to. These are immensely valuable but I feel like more can be done.

香港中文大學 is considering compiling a more-lexicological-aware database that plays a bit more-nicely with Cantonese. In particular, 口語 vs. 書面語 on one axis with register (日常 to 正式) on a separate axis. Time and location axes would also be interesting from a language preservation perspective (though that goal seems pretty well beyond the scope of Pleco).

Only mentioning this now so it bounces around in the back of your head in case that project moves forward or I decide I want something like that to have better tools to teach kids Cantonese or you decide that you want to diversify Pleco's language options. (Korean and Japanese both might have use for a similar approach.)
 

mikelove

皇帝
Staff member
Joke romanizations: yep, c.f. my tweet on Quacking Pinyin.

Language diversification: definitely on my mind, all of this stuff should also translate to Japanese and Korean (though may need a few extra coding bits for some of the more arcane kanji/kana mixed search stuff).
 
Top