The MoE dictionary is now open source

alex_hk90

状元
You thought I forgot? I did! Not sure if there is still interest in this or not, but I have corrected the list (from page 3 of this thread) below. As I understand it, this is supposed to be one traditional form on the left with its two simplified forms on the right. I am going by PRC pronunciation since these are for use in the PRC, and therefore also pretty much also noting things only as they apply to the PRC. If I say something is not a character in the PRC, that means it is not one of the characters approved for general daily use. I did not cite sources as in previous posts on this topic in this thread, since my time is pulling me elsewhere. If something has its own entry in Xinhua Zidian, but says 同 whatever, then it is not (officially) a simplified or variant character; it is a standard PRC character. Below this corrected list I have cut and pasted a new list of the only 11 out of 37 entries from this list that seem to warrant T: S, S status. If you feel I am in error, please say so.

NB: There are more T: S, S situations like this. I may not get around to it till June. Should I edit this post (no one will be notified by the system) or make a new post on this thread?

I'm still interested in this - thank you again for all the information! :)

What I'm planning to do is compile a Traditional to Simplified mapping/table/database that works something like the following:
1. Convert all one-to-one mappings from Traditional to Simplified.
- These are the cases where a Traditional character is always converted to the same Simplified characters.
- For these a single data table of Traditional to Simplified pairings are all that should be required.
2. Convert the one-to-many mappings from Traditional to Simplified.
- These are the cases like the ones you have gone through above where a Traditional character has multiple possible Simplified characters depending on pronunciation or other reasons.
- For cases where it is "do not simplify if a certain pronunciation/situation, do for others" this might need two data tables (the second ones being cases to not simplify).

Does anyone know a good source for the simple cases in 1. above? I had a quick glance through the Wikipedia page but it talked about lists and I didn't immediately notice a simple database that could be used (even if incorrectly includes the cases in 2., they can be filtered out relatively easily). :)
 

Yiliya

榜眼
alex_hk90 said:
Does anyone know a good source for the simple cases in 1. above?
Unihan?

Download Unihan.zip. The file you need is Unihan_Variants.txt. Simply make a list of all the entries that have kSimplifiedVariant and you're set.

Making pinyin based exceptions for 著 (don't simplify when it's zhù) and 乾 (don't simplify when it's qián) will make the final results 99% correct. Should be enough for the vast majority of users. Besides, you can always rebuild the database once you have a better conversion method. I'm sure most people would rather have the 99% correct 简体字 when wait an eternity for 100% ones (I'm not even sure it's possible at all without proofreading the dictionary).
 

alex_hk90

状元
Unihan?

Download Unihan.zip. The file you need is Unihan_Variants.txt. Simply make a list of all the entries that have kSimplifiedVariant and you're set.

Making pinyin based exceptions for 著 (don't simplify when it's zhù) and 乾 (don't simplify when it's qián) will make the final results 99% correct. Should be enough for the vast majority of users. Besides, you can always rebuild the database once you have a better conversion method. I'm sure most people would rather have the 99% correct 简体字 when wait an eternity for 100% ones (I'm not even sure it's possible at all without proofreading the dictionary).

Thanks for the link. :)

Looking at earlier posts in this thread I noticed a link posted by mikelove to the SayJack tables (http://www.sayjack.com/chinese/traditional-to-simplified-chinese-conversion-table/). So I've parsed this data into a useful format, and also the table of characters not simplified (http://www.sayjack.com/chinese/chinese-characters-both-traditional-and-simplified/).

I'll have a look at using this to convert the headwords to simplified - with the second list it will also be possible to check for characters that are not specifically dealt with and so might need manual consideration / use of other tables (like the Unihan one above).
 

feng

榜眼
The system didn't notify me of the posts that came after me, as it has in the past, so apologies to all for not replying earlier. Not sure why that would happen . . .

audreyt said:
Please refer to http://www.audreyt.org/newdict/astral.html for their original (character) form. Modern systems should have fonts for them; if not, HanaMinB from http://fonts.jp/hanazono/ contains all currently coded Han characters.
Thanks, Audreyt.org! And thanks to Audrey as well :D. FYI, I am using OS10.7 with Firefox 20.0 and it did not read the characters for U+2B624 and U+2B5B8, though I did find them elsewhere. I will look at them tomorrow and report back tomorrow night. I downloaded Hanazono once and it didn't help. I must have not done something I was supposed to with it.

mikelove said:
Actually it does, but so few people have compatible fonts installed on their system that I figured it was easier to just print the codes.
How do we install these fonts? I went to some page recently that said to download thus and such if I can't read their characters, but it was of no help. I probably did something wrong.

alex_hk90 said:
alex_hk90 said:
Does anyone know a good source for the simple cases in 1. above?
Lists 1 and 2 are just the officially declared simplifications. Even then, one needs to read the footnotes as they have much of the information I have given above. List 3 is superfluous except for a footnote or two. The List of Variants is often overlooked. It is tiresome and tricky, but can yield gold if you are patient. Study it, you ought! Of course I don't know if iOS and Android do what they should with font issues (my browser doesn't care) such as the 穴寶蓋 on top of 空 and lots of other characters; under 宀 Taiwan has 儿 and the PRC has 八字底。Look at 底 itself: is that a 丶 or a 短橫 on the bottom? Depends which character set you are talking about. There are also characters that differ but are noted nowhere; they just exist and you need to find them (e.g. 致/致 (I don't remember being noted),looking the same in my browser, but Taiwan uses 夊 and the PRC uses 攵). It's a matter of how small a difference in the character sets you want to tolerate. Lots more of this stuff. What you get from websites and the books I have seen is not overly thorough. I am trying to remedy that, though it is part of a larger project, not the main theme. I should note that I am talking about the official forms. Obviously, many people in Taiwan write things differently from the official standard (and some of that is happening in the PRC, too).

Yiliya said:
(I'm not even sure it's possible at all without proofreading the dictionary)
You are entirely correct :) (though not all of the 11 I listed require that).
 

feng

榜眼
The four that were listed as codes originally:
擣:捣、扌寿:PRC considers 擣 to be a variant, using 搗 as the traditional form; hence this is not a T: S, S. BTW, Taiwan uses both 搗 and 擣, but only 搗 is on the their first list of common characters (4,808). That is a larger problem with this sort of thing: the PRC limits itself to 7,008 characters, outside of historical reproduction; Taiwan has three lists totaling about 30,000 characters that are considered OK, and anyway, Taiwanese publishers print characters as they please.

愿:愿、願(with 页): This is on List 1: 愿 stands for itself and 願; the other character is a hypothetical character based on the simplification of 頁。This is not a T: S, S.

餸:餸、饣送:This character is very uncommon. It does not appear in any of the one volume dictionaries I looked in. Hanyu Da Zidian gives the usual annoying "方言" without saying the 言 of which 方. Pronounced ㄙㄨㄥˋ, meaning food other than the staple food (“主食以外的菜肴”). Anyway, just a matter of simplifying the 飠, so this too is not a T: S, S

騃:呆、马矣:騃 is officially a variant character in the PRC, so the simplification is hypothetical. It is also not on the first list of common characters for Taiwan. Not a T: S, S.
 

HuShifang

秀才
This is truly awesome - thanks alex_hk90.

I was wondering, have any other Android users had problems loading the .pqb of the dictionary? I downloaded the .7z from the Dropbox link and unzipped it, but when I try to open MoEDict-04a.pqb from my local storage as an existing user dictionary in Pleco, I get an error message stating that it's "Not a Dictionary Backup: Sorry, but this file does not appear to be a valid user dictionary database. (you may see this error if you've already installed another copy of the same dictionary)". Of course, I hadn't already installed it. (I'm working on a Nexus 10 tablet running fully up-to-date Jelly Bean)
 

alex_hk90

状元
This is truly awesome - thanks alex_hk90.

I was wondering, have any other Android users had problems loading the .pqb of the dictionary? I downloaded the .7z from the Dropbox link and unzipped it, but when I try to open MoEDict-04a.pqb from my local storage as an existing user dictionary in Pleco, I get an error message stating that it's "Not a Dictionary Backup: Sorry, but this file does not appear to be a valid user dictionary database. (you may see this error if you've already installed another copy of the same dictionary)". Of course, I hadn't already installed it. (I'm working on a Nexus 10 tablet running fully up-to-date Jelly Bean)

I think you might need to add it as a new user dictionary rather than an existing one.

An update on the Simplified headwords, I haven't had a chance to look at it much recently but I wasn't far off completing a conversion so I might be able to produce something this weekend.
 

alex_hk90

状元
Alright, I've had a first go at adding simplified headwords for the MoEDict Pleco conversion and the following are the results (MoEDict-04a-Simp01).
Pleco user dictionary (164935 entries):
EDIT: See this post for the latest (and final from me) version of the Pleco conversion.
Pleco flashcards (165810 entries):
EDIT: See this post for the latest (and final from me) version of the Pleco conversion.

The Traditional to Simplified conversion was done using information from the SayJack website and information feng posted in this thread (thanks again feng :)). The information was processed into the three files attached:
Conversion-Always
= simple one-to-one mapping cases (always replace the Traditional with the Simplified for these cases);
Conversion-If
= convert if the Pinyin matches (replace Traditional with Simplified for {Traditional, Pinyin});
Conversion-IfNot
= convert if the Pinyin does not match (replace Traditional with Simplified for {Traditional, !Pinyin}).

These tables were imported into a database and used to generate simplified versions of the headwords for MoEDict. If anyone would like details of exactly how this was done then let me know, but be warned that it's a rather messy and inefficient combination of Bash scripting and SQL.

Keep in mind this is a first attempt so expect there to be errors/omissions/etc. There were also these few at the end of the Conversion-Always tables which I didn't really know how to nicely deal with so just put both possibilities (as I understood them):
---
Traditional,Simplified,Source,Notes
夥,伙/-,feng,夥 is only used in the PRC when it means 多.
鹼,碱/硷,feng,Both characters exist in the PRC.
餘,余/馀,feng,馀 is only supposed to be used when necessary to prevent confusion.
摺,折/摺,feng,摺 is only used in the PRC when necessary to avoid confusion.
---
If someone understands these better (for the particular usages in MoEDict) then this could probably be improved. And of course if anyone has any other additions/corrections/improvements then I'd be happy to implement them.

Hope this is useful for someone. :)
 

Attachments

  • Conversion-Always.txt
    44.5 KB · Views: 912
  • Conversion-If.txt
    92 bytes · Views: 803
  • Conversion-IfNot.txt
    129 bytes · Views: 770

Yiliya

榜眼
Seems to be working nice, for the most part.

As to your question. I know for sure that 硷 is a non-standard (perhaps older) simplification. No longer used. Baidu automatically converts it to 碱. BTW, there seems to be something wrong with this conversion, I can't seem to find the entry for 鹼.
 

alex_hk90

状元
Seems to be working nice, for the most part.

As to your question. I know for sure that 硷 is a non-standard (perhaps older) simplification. No longer used. Baidu automatically converts it to 碱. BTW, there seems to be something wrong with this conversion, I can't seem to find the entry for 鹼.

Thanks for checking. :)

If 硷 is no longer used, then I can change the simplification for 鹼 to just "鹼,碱".

The entry for 鹼 comes up for me as "鹼[碱/硷]" or "碱/硷[鹼]" (depending on the setting of the Traditional characters / Character Set option), not sure why it isn't for you. :confused:
 

alex_hk90

状元
No, it really doesn't come up. Neither 鹼, nor 碱. Strange.
I just checked again and there is something strange about it - it shows if you specifically set MOE in the dictionary selector button in the top-right corner of the search, but does not if you are just cycling through the dictionaries by tapping that button. Hmm, maybe it's to do with having multiple characters for the simplified? When I get some time I'll change it to a single simplified character (碱) and see if the problem persists.
 

mikelove

皇帝
Staff member
The problem here is that slashes aren't supported in user dictionaries - actually nowadays they're not even supported in our own dictionaries anymore, we tag entries as "character variants" of each other and the software automatically figures out how to insert slashes. So it may be displaying correctly when you just look at MOE, but it isn't searching them, and it's stripping them out of merged results because it assumes you're using a very old (pre-2.4) Pleco-supplied dictionary database that still used them and doesn't want to end up with a garbled display.

Anyway, we don't support variants in user dictionaries yet, and probably won't until next year - we're still tweaking the design even for our own databases and don't want to commit to it for user-created data until we're happy with it - so for right now I'd suggest creating two entries and linking one to the other. (the PUA codes for a link are U+EAB8 to start / U+EABB to stop - can be in either fanti or jianti)
 

HuShifang

秀才
I think you might need to add it as a new user dictionary rather than an existing one.

An update on the Simplified headwords, I haven't had a chance to look at it much recently but I wasn't far off completing a conversion so I might be able to produce something this weekend.

Alex - thanks; I think there's actually a problem with my installation of Pleco - (importing from the .txt into a newly created user dictionary wound up not working, but I've also discovered that I can't load *any* .pqb files.) I'm going to email Mike about it... Regardless, thanks again for all of this!
 

alex_hk90

状元
The problem here is that slashes aren't supported in user dictionaries - actually nowadays they're not even supported in our own dictionaries anymore, we tag entries as "character variants" of each other and the software automatically figures out how to insert slashes. So it may be displaying correctly when you just look at MOE, but it isn't searching them, and it's stripping them out of merged results because it assumes you're using a very old (pre-2.4) Pleco-supplied dictionary database that still used them and doesn't want to end up with a garbled display.

Anyway, we don't support variants in user dictionaries yet, and probably won't until next year - we're still tweaking the design even for our own databases and don't want to commit to it for user-created data until we're happy with it - so for right now I'd suggest creating two entries and linking one to the other. (the PUA codes for a link are U+EAB8 to start / U+EABB to stop - can be in either fanti or jianti)

Thanks for the explanation. :) I might just duplicate the entries for the time being (have one for S1[T] and another for S2[T}) and see how that works, otherwise I'm not sure how to deal with the cases where it's a multi-character headword.

Also, I realised I forgot to apply the latest Unicode conversion for the self-looping variants - I'll do that as well. :)
 

Yiliya

榜眼
I think a better idea would be to just get rid of the variants completely.

I've made a bit of research using my collection of both printed and electronic dictionaries.

About 硷:
现代汉语规范词典第二版: 同“碱”。现在一般写作“碱”。
No longer used, as you can see. There are many variants of 鹼/碱, BTW, including 礆, 鹻, 堿 etc. 硷 is not really any special.

As for 夥, 馀 and 摺, it's a bit more complicated. The dictionaries (I have latest printed editions of 现代汉语词典 and 现代汉语规范词典) do say that they're sometimes used to avoid confusion, but at the same time those very same dictionaries don't provide any non-single character examples. That is, there are no actual words that use them. I guess one would use them in a Classical Chinese text or something. So just go ahead and convert all the words with 夥, 餘 and 摺 to 伙, 余 and 折, and you can also add an entry for 馀 saying "「餘」的簡體字。", just to be extra safe.
 

Yiliya

榜眼
Is this project still alive?

Anyway, I just found another exception: 菸 should NOT be simplified to 烟 when it's NOT pronounced yān. The character has a separate meaning when pronounced yu (the tone depends on the area and era, seems to be yū in the PRC, yú in Taiwan, and even yù in older dictionaries).
 

alex_hk90

状元
Is this project still alive?

Anyway, I just found another exception: 菸 should NOT be simplified to 烟 when it's NOT pronounced yān. The character has a separate meaning when pronounced yu (the tone depends on the area and era, seems to be yū in the PRC, yú in Taiwan, and even yù in older dictionaries).

Hi, yes the project is still alive. Thank you for the additional information. I haven't had much free time lately but when I get a chance I will update the conversion with the information you have provided. :)
 
dude, alex, thanks...a lot :)

I'm surprised it took me *this* long to find this thread...but well worth it

any other [good] community pqd's hanging around?!
 

alex_hk90

状元
dude, alex, thanks...a lot :)

I'm surprised it took me *this* long to find this thread...but well worth it

You're welcome. :) Though your thanks should really be directed to others such as: the Taiwan Ministry of Education for the data, the team at 3du.tw for converting it into such a usable format, Yiliya for posting it on here, feng for the information on the simplified conversion and of course mikelove and Pleco for allowing user dictionaries. :)

any other [good] community pqd's hanging around?!

Some of the other user dictionaries I have on my phone are: F3K (Frequent 3000 characters); AVC (Audio-Visual Chinese textbook word list); YEDict (Cantonese dictionary, though Pleco now has the pronunciations at least in PLC).
 
Top