Cantonese User Dictionary, but packaged as a <plecopack>

nathanhammond · Oct 10, 2020

I know you've seen this question before (starting in at least 2003 with what might possibly be the eighth ever thread on this forum!), but I at least have a new twist on it!

I'm currently a student at 香港中文大學 studying Cantonese and I have a number of sources available to me that I have full rights to use which amount to a little bit over 25,000 entries with Cantonese romanizations. Being able to dump these into Pleco would be immensely valuable. Since custom user dictionaries are currently feature-limited for non-Mandarin I thought I could possibly insert them the same way Pleco does for the releases that you (re)publish.

I'm technically capable and am totally okay if any proposed solution breaks in the future on an arbitrary upgrade without notice. My current ill-advised attempt:
- Insert proxy between Pleco and `filelist.php`.
- Serve my dictionary by inserting a bonus item into the response.
- Create a properly formatted dictionary.

"Properly formatted" notes:
- I have the ability to put my source data into the same format as CC-Canto, but that's not the format you're delivering things in.
- I could set it up as a PDB and reverse CC-Canto but I'm unfamiliar with the format and the tooling available appears limited.

Only two questions:
1. Is Pleco expecting the content of the database to be encrypted? (In other words, should I stop now because there is no way to do this without your private key?)
2. Is there an existing tool that you're using for reading/producing PDB files or are you running something of your own that has been around since 2000-ish?

I am trying to avoid imposing on you. Please take your time if you elect to respond, I know you're prepping for iOS 14, new screen resolutions, and more. If this thread starts making you desire to nuke my license from orbit please ignore my questions. I'm a happy customer no matter what.

https://twitter.com/i/web/status/1310373764279001089

***

A trip down memory lane:

Cantonese?

Hi Mike. I have version 2.0.x sitting on my PC but have been too lazy to install it. I'm wondering if you currently support, or plan to support, Cantonese romanizations in the dictionary. If not, would it be possible to add a feature where the user can input and store his own romanization for...

www.plecoforums.com

Trying to add a customized Cantonese Dictionary...

Hi, I'm looking to add cantonese cedict which I got from to pleco since I couldn't find the Early Access 廣州話方言詞典 which you mentioned in the update. The problem is the CEDict is in the format: which doesn't fit Pleco's format: Editing Traditional and Simplified is simple enough, how can I...

www.plecoforums.com

Cantonese user dictionary entries

I'm trying to create user dictionaries in Cantonese however there is nowhere on the user dictionary form to enter the jyutping pronounciation. Is this possible to do? Please can someone help?

www.plecoforums.com

Cantonese user dictionary?

Is it possible to create Cantonese user dictionaries? Is there a technical reason why user dictionaries behave differently than built-in ones (e.g. a faster read-only database?), and will this be changed in a future update? Thanks!

www.plecoforums.com

mikelove · Oct 10, 2020

Heh, the 'nuke from orbit' bit certainly had nothing to do with people politely asking questions - it's mostly people who simply don't like the app, won't accept / aren't satisfied with any help we offer, and rather than requesting a refund from Apple (as we repeatedly nudge them to do) they simply go on sending angry emails about the things they don't like. (a few of which eventually degenerate into ad hominem attacks against me personally, conspiracy theory accusations about Apple / Google / the PRC / the Trump administration / etc - some people feel like if they're upset enough in a customer support situation they can pretty much just let fly with whatever ugly stuff is on their mind)

Anyway, as far as your question: I'm afraid there's not really any way to do this, no. Our PDB encoder is totally proprietary - not even based on anything open-source - and while it does not require encryption (and open-source databases like CC-CEDICT are not in fact encrypted), it's a hairy 20-year-old undocumented binary format and it'd take a loooong time to make any headway with it, plus even if you did we've thrown away about 90% of it for 4.0 anyway. The lack of Cantonese support in user dictionaries isn't because we intentionally block it, we simply never got around to implementing it - they use a totally different format (nice open friendly SQLite) and we never bothered implementing a Cantonese search index for it.

With 4.0 we support not only Cantonese user dictionaries but user dictionaries with whatever arbitrary language / indexing system you like - we wrapped a bunch of configuration screens around ICU, so if you want to load a custom tokenizer, custom collation sequence, series of custom transliterations applied before indexing, etc, that should all be totally doable; also some proprietary stuff for Asian languages like a system to map one romanization system to another (find all syllables in one romanization system that begin with a particular sequence, map each of them to its equivalent in the other system, and search for exact matches on all of those). But until that's ready, there's not much I can suggest for our current app.

nathanhammond · Oct 10, 2020

Ugh. You're trying really hard to sell me on dropping my iOS 13.5 install for a future Pleco 4.X which requires iOS 14 because of Catalyst.

(Pleco, Anki, and Twitter combine for a significant majority of my screen time, when 4.X ships I'll upgrade immediately.)

I'll try to wait as patiently as I can for 4.X; let me know if you need somebody to put Cantonese stuff through its paces. I'm also happy to put the romanization transformation stuff through its paces too; I've been meaning to implement a transformation from Jyutping to Barnett-Chao as a joke.

Check out some Barnett-Chao examples for some absolute nonsense:

Barnett–Chao Romanisation - Wikipedia

en.wikipedia.org

***

Separately, one of the most-significant difficulties I've had in advancing past early-intermediate skill Cantonese is identifying which word to use in which context (and, for bonus points, how it is pronounced in that context). All of the existing Cantonese dictionaries (except probably Wenlin's ABC) appear to primarily be wordlists with pronunciations—including the spreadsheets I have access to. These are immensely valuable but I feel like more can be done.

香港中文大學 is considering compiling a more-lexicological-aware database that plays a bit more-nicely with Cantonese. In particular, 口語 vs. 書面語 on one axis with register (日常 to 正式) on a separate axis. Time and location axes would also be interesting from a language preservation perspective (though that goal seems pretty well beyond the scope of Pleco).

Only mentioning this now so it bounces around in the back of your head in case that project moves forward or I decide I want something like that to have better tools to teach kids Cantonese or you decide that you want to diversify Pleco's language options. (Korean and Japanese both might have use for a similar approach.)

mikelove · Oct 10, 2020

Joke romanizations: yep, c.f. my tweet on Quacking Pinyin.

Language diversification: definitely on my mind, all of this stuff should also translate to Japanese and Korean (though may need a few extra coding bits for some of the more arcane kanji/kana mixed search stuff).

pdwalker · Oct 21, 2020

nathanhammond said:
Check out some Barnett-Chao examples for some absolute nonsense:

Barnett–Chao Romanisation - Wikipedia

en.wikipedia.org

Good Lord! That's the most insane system I've seen.

At least with Mike's Quackyin system, you know he's not serious

Cantonese User Dictionary, but packaged as a <plecopack>

nathanhammond

Member

Cantonese?

Trying to add a customized Cantonese Dictionary...

Cantonese user dictionary entries

Cantonese user dictionary?

mikelove

皇帝

nathanhammond

Member

Barnett–Chao Romanisation - Wikipedia

mikelove

皇帝

pdwalker

状元

Barnett–Chao Romanisation - Wikipedia