The MoE dictionary is now open source

Yiliya

榜眼
It took me a while to even see the difference there, but yes you're right it doesn't. The rule I have currently in place for that character is to convert 麼 to 麽 when it is not pronounced mó. I think I misinterpreted some of the information earlier in the thread about this one. What would you consider the rule for simplifying (or not) this character 麼? Is it always simplified to 麽 or are there cases when it is not?
麼 (麻+幺) is always simplified to 么, except when it's pronounced mó. The simplified variant in this case is 麽 (麻+么). MoE has about a dozen words with 麼/mó.

I haven't been following the thread that closely so if you could summarise all the oversights since the last version (MoEDict-04c-Simp02a) I can look at doing a quick update of the Simplified conversion. :)
麼/mó is the only oversight I noticed since the last version.
 

alex_hk90

状元
麼 (麻+幺) is always simplified to 么, except when it's pronounced mó. The simplified variant in this case is 麽 (麻+么). MoE has about a dozen words with 麼/mó.
OK, so to make sure I've got this right, I should replace the current rule:
- Replace 麼 with 麽 when not pronounced mó;
with the following rules:
- Replace 麼 with 么 when not pronounced mó;
- Replace 麼 with 麽 when pronounced mó.
Does that look right to you?

麼/mó is the only oversight I noticed since the last version.
Cool. :)
 

alex_hk90

状元
Small update (MoEDict-04c-Simp02b) to the simplified conversion to fix 麼 as described above by Yiliya.
Pleco flashcards (165810 entries):
EDIT: For latest version see this post.
Pleco user dictionary (165810 entries):
EDIT: For latest version see this post.

If this fixes the issue I'll upload the amended supporting files here as well (even though the changes are really minor). :)
 
Last edited:

Yiliya

榜眼
alex_hk90,
After using the conversion for a couple of months I've noticed many more problems.

It seems the database you're using for the trad to simp one-to-one conversion is nowhere as extensive as Unihan. Many somewhat rare (and not really) characters didn't get converted. Like 滷, for example. Both 滷 and 鹵 should simplify to 卤. But your script only converts 鹵.

And there are many more missing. E.g.
鮦=鲖
鴷=䴕
鵁=䴔
鵷=鹓
鷫=鹔

Etc, etc. Unihan covers all of these.

I'm not sure if you have the time and interest to basically do it all over again, now using Unihan for one-to-one conversions. But it's something to consider if you ever get to it.

Basically, you would need to download Unihan.zip and extract Unihan_Variants.txt from it. Then simply make a list of all the lines that have kSimplifiedVariant.
 

alex_hk90

状元
alex_hk90,
After using the conversion for a couple of months I've noticed many more problems.

It seems the database you're using for the trad to simp one-to-one conversion is nowhere as extensive as Unihan. Many somewhat rare (and not really) characters didn't get converted. Like 滷, for example. Both 滷 and 鹵 should simplify to 卤. But your script only converts 鹵.

And there are many more missing. E.g.
鮦=鲖
鴷=䴕
鵁=䴔
鵷=鹓
鷫=鹔

Etc, etc. Unihan covers all of these.

I'm not sure if you have the time and interest to basically do it all over again, now using Unihan for one-to-one conversions. But it's something to consider if you ever get to it.

Basically, you would need to download Unihan.zip and extract Unihan_Variants.txt from it. Then simply make a list of all the lines that have kSimplifiedVariant.

Thanks for testing Yiliya.

I didn't want to mix multiple sources for the conversion as that would then require checking the exceptions (Conversion-If, Conversion-IfNot) for overlaps/duplicates with the one-to-one mapping (Conversion-Always). However considering how few entries there are in the exception tables, it actually won't be too much of an issue replacing the Conversion-Always table with the data from Unihan and then checking for the exceptions.

Having a quick look at it, it should be easy to filter Unihan_Variants.txt for just the kSimplifiedVariant lines, but then I'll need a mapping table to the actual Unicode characters.
At the moment a line looks like this:
U+346F kSimplifiedVariant U+3454
The codes need replacing so it's like this:
㑯 kSimplifiedVariant 㑔
If you have an easy way of doing this I should be able to re-do the conversion using this data relatively easily (just takes an hour or so to run, then another hour or so to import to Pleco).
 

Yiliya

榜眼
Wenlin can convert codes to actual Unicode.

Anyway, I've already done it for you. Here's the full table.

Four characters in the table have two simplified variants. Namely:
徵 kSimplifiedVariant 征;徵
鍾 kSimplifiedVariant 钟;锺
願 kSimplifiedVariant 愿;(rare CJK Extension C character, which the forum software doesn't support)
餘 kSimplifiedVariant 余;馀

We can already process 徵 depending on the pinyin, so this line should be deleted (like all the other If/IfNot exceptions). As for 锺: 曾作"鍾"的简化字, 后停用. The second simplified variant for 願 is too rare, again can be deleted. And we already discussed 余/馀 in the past. So what I propose is:
鍾=钟
願=愿
餘=余
Exactly the same as our previous conversion. Again, I'd like to stress that the pinyin dependent simplifications, like 著 should be removed from this table by hand. I didn't want to mess with the original table, so I'm leaving it to you.

Also not sure, if Pleco can handle CJK Extension B and further. Maybe the lines containing such characters should be filtered out.

One more thing, it seems like they removed 滷=卤 from this latest version of Unihan, and I suggest adding it back. 滷 is definitely not used on the mainland, and the dictionaries seem to agree that it does simplify to 卤.
 

Attachments

  • simp.txt
    143.5 KB · Views: 819
Last edited:

Yiliya

榜眼
Been looking through the table. Another problem is this:
託 kSimplifiedVariant 讬

讬 is non-standard. 託 officially simplifies to 托. Again, just like we're already doing it in our current conversion. Maybe there would be some merit in comparing the both databases to see if there's some other discrepancies? Unihan is very extensive, but it sometimes lists non-standard simplified variants.

If you can't be bothered, though, another solution would be to simply use your current database for the bulk of the conversion and use Unihan only for the missing characters. Your current database is obviously pretty limited, I just noticed that it doesn't even have 纍=累.
 
Last edited:

alex_hk90

状元
Wenlin can convert codes to actual Unicode.

Anyway, I've already done it for you. Here's the full table.

Four characters in the table have two simplified variants. Namely:
徵 kSimplifiedVariant 征;徵
鍾 kSimplifiedVariant 钟;锺
願 kSimplifiedVariant 愿;(rare CJK Extension C character, which the forum software doesn't support)
餘 kSimplifiedVariant 余;馀

We can already process 徵 depending on the pinyin, so this line should be deleted (like all the other If/IfNot exceptions). As for 锺: 曾作"鍾"的简化字, 后停用. The second simplified variant for 願 is too rare, again can be deleted. And we already discussed 余/馀 in the past. So what I propose is:
鍾=钟
願=愿
餘=余
Exactly the same as our previous conversion. Again, I'd like to stress that the pinyin dependent simplifications, like 著 should be removed from this table by hand. I didn't want to mess with the original table, so I'm leaving it to you.

Also not sure, if Pleco can handle CJK Extension B and further. Maybe the lines containing such characters should be filtered out.

One more thing, it seems like they removed 滷=卤 from this latest version of Unihan, and I suggest adding it back. 滷 is definitely not used on the mainland, and the dictionaries seem to agree that it does simplify to 卤.

Thanks for this, if I get time over the weekend I'll have a look at using this Unihan data you have converted. I'll probably do as you have suggested and just use this data for cases not covered in the current tables. :)
 

alex_hk90

状元
EDIT: An official Pleco version of MoEDict is now available.
EDIT: For latest version (MoEDict-05 and MoEDict-05-Simp03) see this post.

OK, based on Yiliya's testing and suggestions in the above posts, I've used the Unihan data to augment the "Conversion-Always" table with an additional 662 one-to-one Traditional to Simplified character conversions.

I've attached some rough notes I kept on the process "20130831 Conversion-Simp03-Unihan.txt" and also the latest general notes for the conversion "20130714 Conversion-Simp02b.txt" and associated tables "Conversion-Simp02b.zip". The Unihan notes assume you have already run through the general notes first. If this version is better then I'll probably combine the Unihan notes into the general notes to make it simpler as one process.

Anyway, here are the links to the resulting new version (MoEDict-04c-Simp03):
Pleco flashcards (165810 entries):
http://www.mediafire.com/download/t1wslmd8h8gc5za/MoEDict-04c-Simp03-cards.txt
https://dl.dropboxusercontent.com/u/8391622/Chinese/Pleco/MoEDict-04c-Simp03-cards.txt.7z
Pleco user dictionary (165810 entries):
http://www.mediafire.com/download/cy2calcwp3u5yla/MoEDict-04c-Simp03.pqb.7z
https://dl.dropboxusercontent.com/u/8391622/Chinese/Pleco/MoEDict-04c-Simp03.pqb.7z

Hope this fixes the issues mentioned in the posts above. :)
 

Attachments

  • 20130831 Conversion-Simp03-Unihan.txt
    3 KB · Views: 1,001
  • 20130714 Conversion-Simp02b.txt
    11.9 KB · Views: 1,005
  • Conversion-Simp02b.zip
    33.1 KB · Views: 778
Last edited:
I just wanted to say thank you - this dictionary has proved invaluable as a reference over the past few months, and I'm looking forward to many years of it being of great help. I am so grateful you put in the work to create it, and then posted it here.
 
Wow, this dictionary is amazing thank you so much!

However, I have a question - When I just browse though all of the entries, at the beginning there are many entries where the headword doesn't consist of Chinese characters, but only of some weird numbers inside brackets.

Ex:
{8e40}
{8e41}
{8e43}
{8e4f}
{91ea}
{91ed}

When I first imported the dictionary, I thought the whole thing was messed up, but all the words that I've looked up so far haven't had this problem - they've all had correct headwords and definitions. Upon closer inspection I saw that all of the ones that had headwords as just numbers in brackets where variant characters, where the entry was always:


「x」的異體字。
*Where x = any one Chinese character

So, I'm assuming these are just rare variant characters that Pleco doesn't support, is that right? My only other thought would be that I used an online program to extract the file, and then transferred it to my phone... This wouldn't be the reason though, would it? Because there are only maybe 100 or so of these entries before I get to entries that look perfectly normal...
 

alex_hk90

状元
Wow, this dictionary is amazing thank you so much!

However, I have a question - When I just browse though all of the entries, at the beginning there are many entries where the headword doesn't consist of Chinese characters, but only of some weird numbers inside brackets.

Ex:
{8e40}
{8e41}
{8e43}
{8e4f}
{91ea}
{91ed}

When I first imported the dictionary, I thought the whole thing was messed up, but all the words that I've looked up so far haven't had this problem - they've all had correct headwords and definitions. Upon closer inspection I saw that all of the ones that had headwords as just numbers in brackets where variant characters, where the entry was always:


「x」的異體字。
*Where x = any one Chinese character

So, I'm assuming these are just rare variant characters that Pleco doesn't support, is that right? My only other thought would be that I used an online program to extract the file, and then transferred it to my phone... This wouldn't be the reason though, would it? Because there are only maybe 100 or so of these entries before I get to entries that look perfectly normal...

Glad to hear you're finding the dictionary useful. :)

The characters which are just hex numbers in brackets are ones where no equivalent Unicode character has been identified - if you look back through the thread on the parts which mention "db2unicode.pl" (a script that converts these characters to Unicode: https://github.com/g0v/moedict-epub/blob/master/db2unicode.pl), there is some discussion on this. It uses the following mapping table, but I don't know how the contributors constructed this (maybe just by inspection?): https://github.com/g0v/moedict-epub/blob/master/sym.txt

In short, yes you are right these are rare variant characters, but it's not that Pleco doesn't support them but that they haven't been converted from their codes to actual displayable characters.
 

abhoriel

Member
thanks again for setting up this dictionary, I use it a lot despite mainly dealing with simplified characters, so the conversion was essential!
 
Sorry to be both dense and late to the game: I have downloaded the .7z file from Mediafire referenced above for the MoE with no problem, and even transferred it successfully to my Samsung Android phone. Now how do I install it recognizably in Pleco? I tried using "Add zip file" but it does not recognize the file as such. Thansk for yuyr suggestions, and your hard work on this!

Nathan
 
I have successfully converted it to a .zip file, but still cannot open it using "add zip file" in Pleco, so I think I must be barking up the wrong tree. Thanks for any suggestions!

Nathan
 

alex_hk90

状元
Sorry to be both dense and late to the game: I have downloaded the .7z file from Mediafire referenced above for the MoE with no problem, and even transferred it successfully to my Samsung Android phone. Now how do I install it recognizably in Pleco? I tried using "Add zip file" but it does not recognize the file as such. Thansk for yuyr suggestions, and your hard work on this!

Nathan

Hi Nathan,

I didn't know you could import a Zip archive. The way I import a user dictionary is to extract the archive first and then import the .pqb file via Settings - Dictionary - Manage Dictionaries - Add User - Load Existing.

Hope this helps. :)
 
Top