The MoE dictionary is now open source

alex_hk90

状元
Minor update (MoEDict-04b) with the new Unicode mapping table from audreyt a couple of pages back (fixing some self-looping variants):
Pleco flashcards (165810 entries):
EDIT: See this post for the latest (and final from me) version of the Pleco conversion.
Pleco user dictionary: [EDIT: still struggling to import more than 160,000 or so of the entries, going to try this in batches to identify which ones are going wrong - not sure why this has changed since the previous version though]
EDIT: See this post for the latest (and final from me) version of the Pleco conversion.

Minor update for the Simplified conversion (MoEDict-04b-Simp02) with the additional information from Yiliya and feng and the removal of any slashes "/" in the titles (which should fix the search issue mentioned):
Pleco flashcards (165810 entries):
EDIT: See this post for the latest (and final from me) version of the Pleco conversion.
Pleco user dictionary: [EDIT: still struggling to import more than 65,000 or so of the entries, going to try this in batches to identify which ones are going wrong - again not sure why this has changed since the previous version though]
EDIT: See this post for the latest (and final from me) version of the Pleco conversion.

For the time being I have just kept the more commonly used simplification for the remaining "T : S, S" cases to resolve the search issue, but maybe a couple of these variants should have entries added (if they don't already exist, I haven't actually checked yet).

I've attached the resulting conversion lists used, and also the instructions I kept for reproducing the traditional to simplified conversion (as previously mentioned, it's not the most efficient to say the least but it does seem to work).

Hope this is helpful for someone. :)
 

Attachments

  • Conversion-Always.txt
    44.6 KB · Views: 789
  • Conversion-If.txt
    109 bytes · Views: 783
  • Conversion-IfNot.txt
    128 bytes · Views: 838
  • 20130619 Conversion.txt
    11.6 KB · Views: 1,390
You're welcome. :) Though your thanks should really be directed to others such as: the Taiwan Ministry of Education for the data, the team at 3du.tw for converting it into such a usable format, Yiliya for posting it on here, feng for the information on the simplified conversion and of course mikelove and Pleco for allowing user dictionaries. :)

...and modest.
 

KaiRong

Member
Hello everybody,

I would like to use the MoE dictionary as a user dictionary. Do I need to install the dictionary's flashcards as well, or the user dictionary works without them? :) Sorry for the lame question, I bought the Pleco bundle a while ago, but I've never used user dictionaries before... :(
 

alex_hk90

状元
Hello everybody,

I would like to use the MoE dictionary as a user dictionary. Do I need to install the dictionary's flashcards as well, or the user dictionary works without them? :) Sorry for the lame question, I bought the Pleco bundle a while ago, but I've never used user dictionaries before... :(

You can either import the flashcards into a user dictionary, or just use the user dictionary file (which is the same thing, just someone else will have imported the flashcards into the dictionary already). :)
 

alex_hk90

状元
EDIT: For latest version see this post.

Finally fixed the import issue, which was caused by the use of the square brackets in the unconverted non-Unicode characters conflicting with the Pleco flashcard format (which uses square brackets for the traditional Hanzi when importing cards with both simplified and traditional).

This is probably going to be the last update from me as I don't feel there's really anything left to do, so I've tried to collect all the latest files here if anyone else wants to reproduce the steps and make their own changes/additions.

MoEDict Pleco Conversion 04c SQL with comments:
(attached as a text file due to forum post size limitation, original filename "20130622-MoEDict-Pleco-04c.sql")

Resulting output (MoEDict-04c) from the above (traditional characters only, as the original data):
Pleco flashcards (165810 entries):
https://dl.dropboxusercontent.com/u/8391622/Chinese/Pleco/MoEDict-04c-cards.txt.7z
http://www.mediafire.com/download/6sp74rid3j24xjj/MoEDict-04c-cards.txt
Pleco user dictionary (165810 entries):
http://www.mediafire.com/download/teu4lmw55u39igi/MoEDict-04c.pqb.7z
https://dl.dropboxusercontent.com/u/8391622/Chinese/Pleco/MoEDict-04c.pqb.7z
(Note: 7-zip, p7zip, ZArchiver (Android) or similar can be used to extract/decompress the 7z archives.)

Not as well structured as the above, but I've also attached the notes (which comprise mainly of a combination of Bash scripts and SQL) for adding simplified headwords ("20130622 Conversion-Simp02a.txt", but unlike the SQL for the above it's not automated, you need to read through and copy sections to Bash or sqlite3 as appropriate in the right order). The CSV files with the conversion tables (using data from SayJack and from feng and Yiliya on this thread) referenced in the notes are attached in the archive file ("Conversion-Simp02a.zip").

Resulting output (MoEDict-04c-Simp02a) from going through the above mentioned conversion notes (using the same data, in fact via importing the resulting flashcards, but with simplified headwords added):
Pleco flashcards (165810 entries):
http://www.mediafire.com/download/x29tsik3ydk8ck7/MoEDict-04c-Simp02a-cards.txt
https://dl.dropboxusercontent.com/u/8391622/Chinese/Pleco/MoEDict-04c-Simp02a-cards.txt.7z
Pleco user dictionary (165810 entries):
http://www.mediafire.com/download/bazuuwdu54djz8u/MoEDict-04c-Simp02a.pqb.7z
https://dl.dropboxusercontent.com/u/8391622/Chinese/Pleco/MoEDict-04c-Simp02a.pqb.7z

I would recommend only installing one of the two versions (either with or without the simplified headwords) or otherwise changing the icon abbreviation of one of the versions (they are both set to MOE currently).

Hopefully that should be everything. :)
 

Attachments

  • 20130622-MoEDict-Pleco-04c.sql.txt
    12.2 KB · Views: 968
  • 20130622 Conversion-Simp02a.txt
    11.8 KB · Views: 939
  • Conversion-Simp02a.zip
    33.1 KB · Views: 750
Last edited:

feng

榜眼
Yiliya said:
As to your question. I know for sure that 硷 is a non-standard (perhaps older) simplification. No longer used. Baidu automatically converts it to 碱.
If you check the 第一表 of the 《簡化字總表》you will see that 鹼 is simplified as 硷. That is the official line of the government of the People's Republic of China. It is true that dictionaries tend to just define it as 同 "碱 ", though. This is doubly interesting, since the PRC's 異體字表 has both 鹼 and 碱 as official (PRC yitizi list lists PRC traditional characters, so they have 鹼). In any case, both 硷 and 碱 are part of the PRC's 7008 general use characters. I am surprised to see Xinhua Zidian go it's own way on this. I had not previously noted them doing so. Baidu (or any other web simplifier) is not a source I would turn to for weighty matters such as this.

Yiliya said:
Anyway, I just found another exception: 菸 should NOT be simplified to 烟 when it's NOT pronounced yān. The character has a separate meaning when pronounced yu (the tone depends on the area and era, seems to be yū in the PRC, yú in Taiwan, and even yù in older dictionaries).
One can find quite a number of archaic pronunciations and definitions for many common characters, really. I guess it would be worth noting here, that while the PRC's lists for simplification and yitizi are ordered alphabetically, they are not meant to be simplifications only for those pronunciations (or definitions). They simply list things under their most common pronunciation for convenient look up. The handful of exceptions to what I just said are noted on the appropriate lists (and earlier on this thread).



EDIT: Here's a quandary for you electronic cross-strait lexicographers: 鹼:硷,碱 is another TSS, but only from the perspective of Taiwan and history, not from the perspective of the PRC.
http://dict.variants.moe.edu.tw/yitia/fra/fra04750.htm
I will go double check and report back tonight if I find disagreement with the above.
 

Yiliya

榜眼
Baidu (or any other web simplifier) is not a source I would turn to for weighty matters such as this.
That's why I looked it up in 现代汉语词典 and 现代汉语规范词典. Please take your time to read all of the posts before replying. I wrote:
About 硷:
现代汉语规范词典第二版: 同“碱”。现在一般写作“碱”。
No longer used, as you can see. There are many variants of 鹼/碱, BTW, including 礆, 鹻, 堿 etc. 硷 is not really any special.

One can find quite a number of archaic pronunciations and definitions for many common characters, really. I guess it would be worth noting here, that while the PRC's lists for simplification and yitizi are ordered alphabetically, they are explicitly not meant to be simplifications only for those pronunciations (or definitions). They simply list things under their most common pronunciation for convenient look up. The handful of exceptions to what I just said are noted on the appropriate lists (and earlier on this thread).
金山词霸's 高级汉语词典 has the following:


〈动〉
枯萎 [wither; be withered]
盛夏日方中而灌之, 瓜不旋踵而菸败。 --宋·司马光《论张尧佐除宣徽使状》
又如: 菸邑(菸桯, 菸萎。 枯萎); 菸黄(萎黄; 枯萎); 菸败(枯萎衰败)
另见 yān 烟

As you can see, the character is NOT simplified in this particular sense.
 

feng

榜眼
To all, I have edited my previous post to reflect another T: S, S


Yiliya, no offense was intended. I did read your mention of other sources, but as 硷 is both valid and official in the PRC as the form of 鹼, that was my point in mentioning Baidu for simplifying characters (you may wish to reread my quotation of you in the previous post). The web simplifiers generally incorrectly simplify certain characters. The people who make them are not as careful or knowledgeable as the people who make the standard dictionaries (which have 硷(鹼)同 “碱”).

As for yu1, regardless of what Jinshan Ciba has (it's also not a standard sort of dictionary), I am just pointing out what the PRC's official line is. I am certainly not disagreeing with you about the history of the character. Nonetheless, I can not find the pronunciation of yu1 in Xinhua Zidian or other standard one volume dictionaries of modern Chinese from the PRC. No doubt one can find yu1 菸 in dictionaries of classical Chinese (I'll try to remember to check later today) in the PRC as most of them use traditional characters, as equivalents or main entries, but publishing related to old books and such is a usage outside of the simplification process, so in that sense no rules apply.

I have seen non-official characters used in PRC publications, both simplified and traditional non-official characters, on rare occasions, but since my participation in this thread was predicated on persons who expressed an interest in understanding some of the finer points of the simplification process, I felt it best to give the party line (pun intended), especially since the PRC government has a degree of control over the publishing industry's character choices that Taiwan does not even attempt.
 

Yiliya

榜眼
What relevant is that everyone everywhere in the PRC uses 碱, 硷 is just a rare, non-standard variant. Baidu, the biggest PRC site, doesn't even let you search for it. What would be the point of including it in our conversion? Variant Chinese characters are too numerous to account for each and every of them.

As for the PRC pronunciation of 菸, it also appears in 漢語大詞典, with the following:


〔《集韻》衣虛切,平魚,影。〕
枯萎。參見「菸邑」。
 

feng

榜眼
碱: I don't disagree with you about the practical reality of the situation. If you want practical, then some of the other T: S, S discussed on previous pages need not be there either.

菸:Now I understand. The MoE dictionary has it, so you guys have to do or not do something with it. Sorry for the confusion. I was recently approached by someone to do some work similar to this. One of the problems I brought up was that while the PRC has rules, if one enlarges the character set one works with beyond the 7008 characters that the PRC has for general use (which does not have 菸), it becomes confusing because things are proscribed, yet they exist.

菸 is a proscribed variant in the PRC. That means it is not considered a simplification by the PRC; it is considered a variant character and they have chosen not to use it. The PRC's List of Variants proscribes 1,027 characters as no longer valid for general use. This means that it is not in regular dictionaries, but yes Hanyu Dacidian, Hanyu Dazidian (2 ed.), 《故訓匯纂》、Ci Yuan, and some other dictionaries like that from the PRC do have 菸 (I checked today :p ). Depending on which dictionary you look in, some or all of those 1,027 characters are going to come up because they were all used at some point to some extent. The PRC rules for variants operate within the 7008 characters for general use. For history or classical literature, publishers can use other characters. I mean, 廁 is a proscribed variant. The PRC uses 厠 which is then simplified to 厕 via List 2 (貝/贝). That doesn't mean all three versions can't be found in a PRC dictionary of sufficient size. Even the List 1 and 2 simplifications, which are intended to be used for any relevant character one encounters, are not actually used exclusively by all dictionaries. These large, or specifically classical, dictionaries are not normal use and for that reason don't have to follow the rules.

Taiwan's web-based variant dictionary has more than 76,000 variants, in addition to nearly 30,000 characters that Taiwan considers kosher. It's all but endless.
 
@alex_hk90: Thanks again for all the effort you put into tweaking the MoE database to work so seamlessly in Pleco! MoE is an invaluable reference, and to have it in Pleco makes it so much more awesome. (Btw, the simplified headwords feature you added has worked great for me so far, too.) :)
 

feng

榜眼
Alex or anyone, is it possible to get a complete list of the head characters (and also the variants) used in the MoE dictionary?
 

alex_hk90

状元
Alex or anyone, is it possible to get a complete list of the head characters (and also the variants) used in the MoE dictionary?
Sure, I've attached a dump of the "entries" table from "dict-revised.sqlite3", which has schema:
Code:
CREATE TABLE "entries" (
    "id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
    "title" varchar(255),
    "radical" varchar(255),
    "stroke_count" integer,
    "non_radical_stroke_count" integer,
    "dict_id" integer
);

The attached archive has two versions, one is from the original file and one is after running the "db2unicode" script to convert some (most?) of the missing characters to Unicode.

The second column ("title") contains the head characters.

Hope this helps. :)
 

Attachments

  • dict-revised_entries.zip
    1.9 MB · Views: 797

feng

榜眼
Thank you, Alex! I will get Excel when I get a new computer later this year. I will enjoy working with this list.

EDIT: I just opened it in TextEdit. It has all 163,093 entries. I had assumed that since you were working with the simplified head character equivalents that you had a list of just the head characters? I guess not? In any case, the list is ordered by bushou, which is fun. Thanks again!
 

alex_hk90

状元
Thank you, Alex! I will get Excel when I get a new computer later this year. I will enjoy working with this list.

EDIT: I just opened it in TextEdit. It has all 163,093 entries. I had assumed that since you were working with the simplified head character equivalents that you had a list of just the head characters? I guess not? In any case, the list is ordered by bushou, which is fun. Thanks again!
You're welcome. :) This is from the original, unprocessed data. If you want I can output the processed data (grouped by Hanzi/pinyin)? The way I did the conversion was to do various search-replaces on the title column, so there isn't just a list of the characters per se (though it is easy to produce one as the data is in a very nice format).
 

Yiliya

榜眼
Hey, Alex. Found one more oversight.

麼 (mó) should have been simplified to 麽, as in 幺麽 (trad: 么麼). Your script simply ignores this hanzi/pinyin combo for some reason.

But oh well.
 

feng

榜眼
You're welcome. :) This is from the original, unprocessed data. If you want I can output the processed data (grouped by Hanzi/pinyin)? The way I did the conversion was to do various search-replaces on the title column, so there isn't just a list of the characters per se (though it is easy to produce one as the data is in a very nice format).
If it's not time consuming for you, it would be nice to have a list of the characters since while their website says they have 11,930 characters (plus 1,848 variants), I asked them about this, due to the way they worded it. Turns out it is 11,930 characters pronunciations. So, 菸 is counted twice; and hence plenty of other common or uncommon multiple pronunciations are, so I am wondering just how many characters they really have. Thanks again. Look forward to using this when I get Pleco.
 

alex_hk90

状元
Hey, Alex. Found one more oversight.

麼 (mó) should have been simplified to 麽, as in 幺麽 (trad: 么麼). Your script simply ignores this hanzi/pinyin combo for some reason.

But oh well.
It took me a while to even see the difference there, but yes you're right it doesn't. The rule I have currently in place for that character is to convert 麼 to 麽 when it is not pronounced mó. I think I misinterpreted some of the information earlier in the thread about this one. What would you consider the rule for simplifying (or not) this character 麼? Is it always simplified to 麽 or are there cases when it is not?

I haven't been following the thread that closely so if you could summarise all the oversights since the last version (MoEDict-04c-Simp02a) I can look at doing a quick update of the Simplified conversion. :)

If it's not time consuming for you, it would be nice to have a list of the characters since while their website says they have 11,930 characters (plus 1,848 variants), I asked them about this, due to the way they worded it. Turns out it is 11,930 characters pronunciations. So, 菸 is counted twice; and hence plenty of other common or uncommon multiple pronunciations are, so I am wondering just how many characters they really have. Thanks again. Look forward to using this when I get Pleco.

I'll have a look at doing this. It shouldn't be difficult to make such a list (can just split all the headwords into individual characters and filter for only the unique ones, probably using "select distinct" or similar in SQL). It would be harder to differentiate between characters and variants, as this would involve checking the linked definition and rely on some consistent data to identify variants (which the dictionary might have, I haven't checked).
 

alex_hk90

状元
If it's not time consuming for you, it would be nice to have a list of the characters since while their website says they have 11,930 characters (plus 1,848 variants), I asked them about this, due to the way they worded it. Turns out it is 11,930 characters pronunciations. So, 菸 is counted twice; and hence plenty of other common or uncommon multiple pronunciations are, so I am wondering just how many characters they really have. Thanks again. Look forward to using this when I get Pleco.

I can't guarantee that this is all the characters, but I've outputted just the distinct/unique entries that have single characters titles (split by Unicode and non-Unicode, the latter not having a Unicode conversion). There are 9635 on the Unicode list and 1765 on the non-Unicode list (so a total of 11400). To see if this is really a complete list I would have to check through all the remaining multiple character title entries to see if they have characters not on these lists (doable but a bit time-consuming).
 

Attachments

  • dict-revised-chars-unicode.txt
    206.1 KB · Views: 1,076
  • dict-revised-chars-non_unicode.txt
    43 KB · Views: 865
Top