MoE Minnan and Hakka dictionaries

Abun

榜眼
Is there still any demand for the additional items?
I've been caught up with various professional studies which has taken up some of the free time I previously had for this kind of thing.
Yes, there definitely is! I just figured you were busy and didn't want to put you under pressure. After all, you are doing this in your free time, and even without the result being a lot of use for yourself! Big thanks for that!

Well for what it's worth, we've mostly finished implementing support for arbitrary pronunciation systems for 4.0, so it's almost certain that will be included.
Awesome, that should make things a lot easier I guess!
 

举人
Without dragging out a dead conversation, how is this dictionary installed? I downloaded the files from Github yet Pleco has issues importing it. Is there a thread anywhere to explain the process? Ah Minnan, reminds me of taking Minnan classes at NTU (not very successfully) few years back
 

alex_hk90

状元
Without dragging out a dead conversation, how is this dictionary installed? I downloaded the files from Github yet Pleco has issues importing it. Is there a thread anywhere to explain the process? Ah Minnan, reminds me of taking Minnan classes at NTU (not very successfully) few years back

Settings - Manage Dictionaries - Add User - Existing - select the pqb file.
You need to have the paid "Flashcard System" add-on to have user dictionaries.
 
Well for what it's worth, we've mostly finished implementing support for arbitrary pronunciation systems for 4.0, so it's almost certain that will be included.
Not at all sure that this is the right place to ask, but what's the current progress in Pleco providing Minnan/Taiwanese dictionary & flashcard production support along the same lines as has been provided for Cantonese?
 

mikelove

皇帝
Staff member
Nothing to report on this one either - the user dictionary system in 4.0 will theoretically support arbitrary romanization systems but it may be a bit of a kludge at least at first.
 

daoge

Member
Nearly one thousand new entries have been added to the MoE Minnan dictionary since the last time @alex_hk90 posted his flashcards in 2015, so I thought I'd drop a link to an updated version in case anyone else is still interested:


Both files contain 15002 entries from dict-twblg.json in addition to 6793 entries from mdict-twblg-ext.json.
I also corrected a few issues with the old flashcards:
  • Words ending in a stop are now correctly marked as tone 4 instead of tone 1 (the example for '色' now correctly reads 'ang5-sik4' instead of 'ang5-sik1').
  • English words in definitions should no longer have tone numbers (the entry for '三文魚' now reads 'salmon' instead of 'salmon1')
  • Synonyms now appear as links
  • Entries display the reading if available (i.e. 白 or 文)
  • I added some space between the definitions and examples, which should make entries a bit easier to read

The diacritic to tone number conversion isn't perfect, so I still had to manually clean up a handful of entries. Ideally the romanizations could be imported as-is, but I haven't had any luck getting the diacritics to display in Pleco.


Edit: I also took a crack at converting the Maryknoll dictionary, so here are the results:


I noticed a few issues with the spreadsheet (which the website warns is not fully proof-read), so these are parsed from the PDFs. This dictionary has 55,525 entries, and although they're quite terse, it's much easier to search than the MoE Minnan dictionary since it's English-Taiwanese. It also doubles as a decent English-Chinese dictionary, since every entry contains a Chinese definition.

If anyone spots any issues, let me know!
 
Last edited:
Many thanks to @alex_hk90 and @daoge for putting together the database into a Pleco-compatible dictionary. It’s really nice to have it in Pleco.

I tried looking up some random entries that we already have in CC-CEDICT by typing their Tai-lo pronunciations and for some reason it won’t find the entry for “huanbeh” (番麥). Any ideas why?

https://www.moedict.tw/'番麥
 

daoge

Member
Unfortunately, searching for entries by Tai-lo is quite finicky. Here are a few tips/observations:
  1. Leave off final consonants. For example, I get no results for "kam a bit", but searching for "kam a bi" shows me the entry for 柑仔蜜 (kam1-a2-bit8).
  2. Try include spaces. E.g. "kam a bi" works, but "kamabi" does not. Sometimes it works either way―I get results for both "huanbe" and "huan be".
  3. There's something funky going on with tone numbers. Tones 1-3 correctly filter the search results, but anything else yields unexpected results. E.g. searching for "huan1" only shows entries pronounced "huan1", but searching for "huan5" shows entries pronounced "huan5", "huan7", "huann5", "huann7", and "huann2"
  4. Sometimes Pleco gets stuck in a state where it won't let you cycle the search type. If you're not getting any results, (1) Clear the search field (2) Tap the dictionary group button until it shows "C" (3) Press and hold "C", then select "MoE Minnan" from the menu (4) Enter your search again. I've found it helpful to enable the "Sticky Selection" option in Settings > Manage Dictionaries > MoE Minnan, which prevents Pleco from automatically switching to an English full-text search as you're typing.

I hope that helps!
 

Abun

榜眼
I noticed that problem before and settled on similar workarounds to @daoge. It looks like Pleco’s string processing before matching against dictionaries – the same one which allows fuzzy searches in Pinyin and the like – is working against us here. This explains final consonants (apart from n and ng) giving funky results – Pleco probably recognises these as impossible finals in Pinyin and assumes that they must be the initials of a new syllable. It also explains tone 5 in particular being odd, because that’s interpreted as neutral tone; and higher numerals are probably not expected at all (they tend to be a bit more reliable than 5 for me). I’m not entirely sure why tone 4 wouldn’t work but I suspect in this case it’s the final plosive which poses the bigger problem.

Maybe it’s possible to turn off string preprocessing for C-E/C-C dictionaries with a flag character, so the string is matched directly? If it is, I haven’t found it yet. And if it does exist, I’d probably still not use it all that often because of course matching the raw string as it is would certainly have to mean that I would have to exactly match the string on the dictionary lemma I’m looking for. No more fuzzy search.
 

mikelove

皇帝
Staff member
Yeah, Pleco is pretty much still treating this as pinyin and so a lot of stuff is going to behave oddly like that - a big project in Pleco 4.0 has been making everything related to Pinyin generalizable to other syllable systems (which also has the happy side effect of making it easy to make dictionaries for entirely different languages), even to the point that you could theoretically have a dictionary database using POJ and a search type with a mapping table to convert Tai-Lo searches into that (same basic idea as how we handle Zhuyin/Yale Cantonese now).

So this should be throughly fixable in 4.0, but there's not much we can do in the current release - overrides fix pronunciation but then the search engine simply ignores the field entirely. And indeed it probably would not have been possible to do this well in anything but a horrifyingly complicated multi-year bottom-up rewrite like 4.0 is since we had like a decade-and-a-half worth of accumulated pinyin-related assumptions to work through :)

Maybe consider making a Tai-lo user dictionary as an English-to-Chinese instead of a Chinese-to-English one? The "English" should at least avoid mucking up letters, and our current English collator totally ignores spaces.
 

Abun

榜眼
So this should be throughly fixable in 4.0, but there's not much we can do in the current release - overrides fix pronunciation but then the search engine simply ignores the field entirely. And indeed it probably would not have been possible to do this well in anything but a horrifyingly complicated multi-year bottom-up rewrite like 4.0 is since we had like a decade-and-a-half worth of accumulated pinyin-related assumptions to work through :)
Yeah I figured that it would be pretty much impossible to work around at least under the current system. Very much looking forward to 4.0! ;)

Maybe consider making a Tai-lo user dictionary as an English-to-Chinese instead of a Chinese-to-English one? The "English" should at least avoid mucking up letters, and our current English collator totally ignores spaces.
Maybe, but that would mean this dictionary wouldn’t be searchable by character, right? I guess you could just put the character lemma within the definition text and then let users find it using full-text search (is that possible for custom E-C dictionaries?)… But then that search wouldn’t be able to differentiate between the head word and a random word appearing in the definition or an example. So I think such an E-C dictionary wouldn’t be viable as a substitute for the current one, but only as an index for where you can look up the characters.
 

daoge

Member
Hi everyone, I came up with a workaround that should make searching a bit more ergonomic. I created a separate dictionary with Mandarin head words and the corresponding Taiwanese words and romanizations in the definition. This makes it possible to find a Taiwanese word by searching in Mandarin or by doing a full-text search of the pronunciation (remember to enable full-text searches in the dictionary settings):
1.png
2.png
3.png

Once you look up a word, you can click on a link to take you to the actual definition.
Here's the pqb: dict-twblg-index.pqb.zip

I also updated the original dictionary to include alternate pronunciations. For example, the entry for 頭毛 (thau5-mng5) now includes an alternate pronunciation (thau5-moo1) in the definition text. I've updated the links in my original post to point to the new version. I hope it's helpful!
 
Hi everyone, I came up with a workaround that should make searching a bit more ergonomic. I created a separate dictionary with Mandarin head words and the corresponding Taiwanese words and romanizations in the definition. This makes it possible to find a Taiwanese word by searching in Mandarin or by doing a full-text search of the pronunciation (remember to enable full-text searches in the dictionary settings):
View attachment 3339View attachment 3340View attachment 3341
Once you look up a word, you can click on a link to take you to the actual definition.
Here's the pqb: dict-twblg-index.pqb.zip

I also updated the original dictionary to include alternate pronunciations. For example, the entry for 頭毛 (thau5-mng5) now includes an alternate pronunciation (thau5-moo1) in the definition text. I've updated the links in my original post to point to the new version. I hope it's helpful!
That's a great idea, thanks! I installed the dictionary but for some reason the entries won't show up when I perform search by pinyin or hanzi. The middle option from your screenshot (#a tok a), however, works for me. Any ideas?

EDIT: actually, I just realized that when you look up a word by inserting hanzi in the search box you'd need to use traditional characters in order for the entry to show up in MIX among results (duh.. it's a Taiwanese dictionary :) ).

How difficult would it be to modify the dictionary so that MIX's entries would be included among other dictionaries?

For example: let's say you search by py eg. "waiguoren", then tap on the entry. This is where you get a bunch of definitions from other dictionaries. How much work would it be to include MIX as well, so that when I tap on eg "waiguoren / 外国人 / 外國人" entry you'd see MIX definition as per pic on the far right of your screenshot?
 
Last edited:

daoge

Member
Including simplified headwords should be pretty straightforward, and adding pinyin is also on my to-do list. Whenever I get around to it I'll post an update :)

There's still a lot more interesting data I'd like to incorporate—MoEDict includes English translations for some entries, so it'd be pretty cool to make MIX searchable by English too.

Edit: @goldyn chyld Here you go: dict-twblg-index_pinyin.zip

I let Pleco automatically fill in the missing Pinyin using this method:
https://www.plecoforums.com/threads...a-a-text-editor-and-computer.5141/#post-40849
There are bound to be some errors, but it should allow entries from MIX to be included with results from other dictionaries (and you can always manually correct any mistakes you might find).
 
Last edited:

David

举人
Thank you for this contribution. I've been using this recently and it has been helpful.

Question: (Other than each entry's pinyin field) why were the tone diacritics converted to tone numbers? Were they not supported by the font? I'm not as good at reading the numbers compared to the tone marks.
 
Top