You're welcome.abhoriel said:thanks a lot for this! its a really beautiful dictionary.
Simplified headwords would indeed be nice addition though
sandwich said:Edit: A couple of other things, some characters have "(變)" plus an additional pronunciation, and 兒 has incorrect zhuyin. shows "˙" instead of "ㄦ". (see: 蔓兒 for both)
Edit2: Ok, looks like 兒 thing is a pleco bug since other dictionaries have the same issue.
I'm not sure if you can import the Zhuyin using flashcards, at least I don't know how to do it.sandwich said:Just noticed a couple of characters with wrong readings when pron system is set to zhuyin. 欸 is "?", and 誒 is "ề". Should be ㄟˋ and ㄝˋ respectively. No idea how often this is a problem (and don't care about who to blame), but might be an idea to give pleco the zhuyin and let it sort out generating the pinyin.
Is the entry for 呆 a typo or is ai2 (not dai2) really an alternate pronunciation? A benefit of not merging them is that the alternate pronunciations are searchable by the Pinyin. Regarding the feasibility of merging them, as long as there is no ambiguity in the entry that it is referring to then it should be possible (if a little tricky).sandwich said:Also are entries that have "(又音)" in their pronunciation a non issue? (see: 欸 and 呆). Wrt 又音, no way to merge them into their original entry right? Given that some are already like this. (see: 腌臢)
Yeah I noticed the "(變)" ones - it's from the original data. I was considering splitting them like the "(又音)" ones:sandwich said:Edit: A couple of other things, some characters have "(變)" plus an additional pronunciation, and 兒 has incorrect zhuyin. shows "˙" instead of "ㄦ". (see: 蔓兒 for both)
Edit2: Ok, looks like 兒 thing is a pleco bug since other dictionaries have the same issue.
Ah, I just kinda assumed you could swap them and it would work. Obviously not. I haven't noticed any other similar issues yet, so maybe these should just be special cased to 'ei' and 'eh'?alex_hk90 said:I'm not sure if you can import the Zhuyin using flashcards, at least I don't know how to do it.
I'm not sure what you are asking here. Personally I would assume the dictionary is right. Unfortunately the (又音) don't have usage notes, so if they weren't searchable then there would be no point to having them. Main thought was that they are kinda hidden from view and the actual info they refer to is in the other entry. (I don't know how the multiple entry stuff works, so maybe this is just a temporary non issue on ios?).alex_hk90 said:Is the entry for 呆 a typo or is ai2 (not dai2) really an alternate pronunciation? A benefit of not merging them is that the alternate pronunciations are searchable by the Pinyin. Regarding the feasibility of merging them, as long as there is no ambiguity in the entry that it is referring to then it should be possible (if a little tricky).
Ah, sorry. Missed that you had already discussed it.alex_hk90 said:Yeah I noticed the "(變)" ones - it's from the original data. I was considering splitting them like the "(又音)" ones:
http://www.plecoforums.com/viewtopic.ph ... =30#p29339 (point 3.)
http://www.plecoforums.com/viewtopic.ph ... =75#p29433
I haven't checked to see if there is a pattern as to why sometimes multiple pronunciations are in the Pinyin field and sometimes it has been split into multiple entries. It's probably possible to either merge or split all of them, but I didn't want to do that until I had worked out what should be done and if there was a reason/pattern to the split/combined entries.
In all honesty I haven't tried it - I don't know anything about Zhuyin.sandwich said:Ah, I just kinda assumed you could swap them and it would work. Obviously not. I haven't noticed any other similar issues yet, so maybe these should just be special cased to 'ei' and 'eh'?
It was just an aside on that particular entry, I would've thought dai2 was a more obvious alternate pronunciation for dai1 than ai2 would be.sandwich said:I'm not sure what you are asking here. Personally I would assume the dictionary is right.
I'm not sure what (temporary non-)issue you are talking about?sandwich said:Unfortunately the (又音) don't have usage notes, so if they weren't searchable then there would be no point to having them. Main thought was that they are kinda hidden from view and the actual info they refer to is in the other entry. (I don't know how the multiple entry stuff works, so maybe this is just a temporary non issue on ios?).
No worries - do you have any suggestions of how to resolve these?sandwich said:Ah, sorry. Missed that you had already discussed it.
Variants are important for people reading genuine old books (not modern editions) or doing research into characters.audreyt said:The missing entries are all variant characters; they have no distinct semantics, and it's safe to discard them.
alex_hk90 said:- Sometimes the Pinyin has a note before it, like <又音> or <讀音>; this should probably be moved into the main body of the definition as well,...
又音: "also pronounced", often these are the same as PRC pronunciations, and the analogous form of Xinhua Zidian's 舊讀 (which are often Taiwan pronunciations). I am not really sure that each did this for the purpose of cross-listing the other's pronunciation, but rather because there are historical issues with the change in pronunciation on either side of the straits. In fact, though I can't think of an example, I am sure that some of them are not such. In any case, they don't have usage notes because they are not, in my experience, related to usage. It's a "You say tomato. I say tomato" kind of a thing.sandwich said:Unfortunately the (又音) don't have usage notes, so if they weren't searchable then there would be no point to having them.
Yes, really. Taiwan has some old pronunciations that the PRC no longer uses.alex_hk90 said:Is the entry for 呆 a typo or is ai2 (not dai2) really an alternate pronunciation?
* 著: zhuo2 is also in Xinhua ZidianYiliya said:著 (don't simplify when it's pronounced zhù), very common character that causes the most of the confusion in Trad -> Simp conversions
徵 (don't simplify when it's pronounced zhǐ), this is a rare usage, but still, MoE has it
於 (don't simplify when it's pronounced wū), also rare
Also, the 幺/么/麼/麽 confusion. Basically, Trad 么 = Simp 幺 (yāo), Trad 麼 = Simp 么 (me) OR 麽 (mó). This way, 么麼 (yāomó) gets simplified to 幺麽.
Another thing to consider is that the MoE dictionary uses a number of archaic traditional characters throughout the whole dictionary, case in point - 祕 (instead of the nowadays commonly accepted 秘).
At least on a Mac, STHeiti does not display 敢 with a 丅 on top of the 耳, but rather a 乛。There are a number of characters like this. PRC traditional varies occasionally from Taiwan traditional _by font_ no matter what one types (and varies plenty of times for other reasons). Nearly all computer fonts for Chinese on English OS systems are entirely PRC, even for traditional characters. Taiwan's MoE has fonts for free and most computers have "BiaoKai" which seems to be the same, but I won't swear to it.mikelove said:but iOS actually includes both simplified- and traditional-styled variants of its built-in STHeiti font (with the attendant character / punctuation / etc changes)
There are some errors in that list, unless it is a display issue. One example: 锺 is not a character in the PRC, though it is typeable. PRC uses 钟 for both 鐘 and 鍾. There are other such examples there.mikelove said:FWIW, here's a longer list of TC->multiple SC mappings, including a couple of Extension B/C ones which might be considered more "variants":
Interesting.feng said:Yes, really. Taiwan has some old pronunciations that the PRC no longer uses.
Is there an official source that states all this information (on the Traditional to Simplified conversions), or is it just common knowledge?feng said:* 著: zhuo2 is also in Xinhua Zidian
* Although 么 (yao1) is the official standard in Taiwan, even 《新編國語日報辭典》(Taiwan's equivalent of 《現代漢語詞典》 or 《新華字典》) uses 幺, since 么 is a variant, historically.
* 祕 is the official standard in Taiwan. 秘 is for the PRC (and a variant in origin).
The data has columns for bopomofo:feng said:2) I hope you can do this and that you can save the bo-po-mo (re comment above somewhere). I for one am very sick of always looking at Hanyu Pinyin.
sqlite> select bopomofo, bopomofo2, pinyin
...> from heteronyms
...> where rowid > 100000 and rowid < 100010;
ㄈㄨˇ ㄊㄧㄢˊ|fǔ tián|fǔ tián
ㄩㄥˇ|yǔng|yǒng
ㄩㄥˇ ㄐㄩˋ|yǔng jiù|yǒng jù
ㄩㄥˇ ㄌㄨˋ|yǔng lù|yǒng lù
ㄩㄥˇ ㄉㄠˋ|yǔng dàu|yǒng dào
ㄅㄥˊ|béng|béng
ㄅㄥˊ ㄩㄥˋ|béng yùng|béng yòng
ㄋㄧㄥˋ|nìng|nìng
ㄋㄧㄥˊ|níng|níng
feng said:At least on a Mac, STHeiti does not display 敢 with a 丅 on top of the 耳, but rather a 乛。There are a number of characters like this. PRC traditional varies occasionally from Taiwan traditional _by font_ no matter what one types (and varies plenty of times for other reasons). Nearly all computer fonts for Chinese on English OS systems are entirely PRC, even for traditional characters. Taiwan's MoE has fonts for free and most computers have "BiaoKai" which seems to be the same, but I won't swear to it.
feng said:There are some errors in that list, unless it is a display issue. One example: 锺 is not a character in the PRC, though it is typeable. PRC uses 钟 for both 鐘 and 鍾. There are other such examples there.
feng said:1) I find Taiwan's MoE very responsive. Recently, I email two different offices at the MoE (including the one that controls this dictionary and others) once or twice a month for some research I am doing and they are great. Academia Sinica can go . . . but anyway MoE is responsive.
feng said:2) I hope you can do this and that you can save the bo-po-mo (re comment above somewhere). I for one am very sick of always looking at Hanyu Pinyin.
All official . But in lots of different places. The PRC has various official lists. Taiwan has one main official list and then other lists for very uncommon characters. They go about it in very different ways, with issues to be figured out even within both sets of lists -- and between Taiwan and the PRC is a larger issue to figure out all the different forms and disappearing characters and such.alex_hk90 said:feng said:* 著: zhuo2 is also in Xinhua Zidian
* Although 么 (yao1) is the official standard in Taiwan, even 《新編國語日報辭典》(Taiwan's equivalent of 《現代漢語詞典》 or 《新華字典》) uses 幺, since 么 is a variant, historically.
* 祕 is the official standard in Taiwan. 秘 is for the PRC (and a variant in origin).
Is there an official source that states all this information (on the Traditional to Simplified conversions), or is it just common knowledge?
I am in the middle of researching all this as a small part of a larger project I am working on. Looking at your list again: 鉋 刨,铇 : 铇 does not exist in the PRC, officially. As with the character in the previous post, it is something that theoretically should exist based on List 2 (Zong Biao), but the PRC's Yitizi List proscribes 鉋 altogether, which makes whatever List 2 might do to it moot. 卻 卻,却: the PRC proscribes 卻. Still others there. And, as you of course know, there are several funky characters with weird rules about simplification such as 線 (can't type the simplified form; I don't mean 綫/线) and 馀 and 摺 and others (only simplify if your life depends on it). Funny, with three of ten appendices dealing with simplification issues I never thought to make a list for multiple simplified forms. Actually, I think the list is a bit smaller than what you gave. I will take a look at it some more on Monday (or Tuesday . . .).mikelove said:Do you have another list that you'd recommend alex_hk90 (or whoever eventually converts this to simplified) use?
You emailed onile@mail.naer.edu.tw ? Them is the dictionary people. Actually, http://email.moe.gov.tw/EDU_WEB/sendmail/send.php?sGo=1mikelove said:That hasn't been my experience with them, sadly.
Reason number 783 to get Pleco! :idea:mikelove said:Pleco has BoPoMoFo support built-in for all dictionaries (auto-converts from Pinyin), so it's not really necessary to extract it specifically from this one.
Thanks for all the information - interesting stuff.feng said:All official . But in lots of different places. The PRC has various official lists. Taiwan has one main official list and then other lists for very uncommon characters. They go about it in very different ways, with issues to be figured out even within both sets of lists -- and between Taiwan and the PRC is a larger issue to figure out all the different forms and disappearing characters and such.
The parts I wrote about variants is not 'official'. You have to research to get that gold, though it is easy to do the larger part of that on Taiwan's online 《異體字字典》 which is quick, easy, and free. One can (and if it is important, must) look in serious paper character dictionaries that at least attempt to base themselves on historical principles such 《正中形音義綜合大字典》 or 《漢語大字典》(第二版) and other sorts of dictionaries, either ancient or modern about the ancient. I figure it beats having a heroine habit! Though it cost me as much to buy all those books
Sorry, I don't have any sort of hand-held device, so I can't give an opinion regarding your specific question about placement, other than to say I am in a long term love affair with bo-po-mo :mrgreen:
feng said:I am in the middle of researching all this as a small part of a larger project I am working on. Looking at your list again: 鉋 刨,铇 : 铇 does not exist in the PRC, officially. As with the character in the previous post, it is something that theoretically should exist based on List 2 (Zong Biao), but the PRC's Yitizi List proscribes 鉋 altogether, which makes whatever List 2 might do to it moot. 卻 卻,却: the PRC proscribes 卻. Still others there. And, as you of course know, there are several funky characters with weird rules about simplification such as 線 (can't type the simplified form; I don't mean 綫/线) and 馀 and 摺 and others (only simplify if your life depends on it). Funny, with three of ten appendices dealing with simplification issues I never thought to make a list for multiple simplified forms. Actually, I think the list is a bit smaller than what you gave. I will take a look at it some more on Monday (or Tuesday . . .).
I was actually staying away from List 1 and List 2 in general (other than to note the simplifications), except where they mess things up (outside of mere simplification) in Taiwan's list of 4,808 common characters that I am using as the corpus for my project. The problem is that even outside of almost 5,000 characters there are still lots of issues like this, so to make a comprehensive list would require of me rather more effort. I have been focusing on the list of 4,808 characters as they represent nearly all the characters one would ever want for anything outside of the numerous rare forms encountered in Chinese history or classical literature. Ideally I want to modify the list for a future project, adding and deleting a hundred or so characters each way to make it "perfect" for daily use. The list provides me with a set of parameters that allow me to finish my project before the next Mayan cycle rolls around (how's that for a Chinese reference? It is.).
I am afraid I may not have answered your question or made much sense. Don't be shy about refocusing my attention.
There are some fickle simplifications, which I mentioned in my last two or three replies to this thread. There are also, as mentioned in those same replies, some incorrect simplifications going around. Getting it 99% right is easy; getting it 100% right takes more effort, as there are nitpicking little exceptions to worry about.alex_hk90 said:2. Convert all one-to-many Traditional to Simplified characters using a list of ({Traditional, Pinyin}, Simplified) pairs, assuming that {Traditional, Pinyin} to Simplified is a one-to-one mapping (i.e. there are no Traditional characters which convert to more than one Simplified characters once you consider the pronunciation).
The bottom of the page at the original site has "中華民國教育部 版權所有 (c) 2000 Ministry of Education, R.O.C. All rights reserved." Is that open source now?goldyn chyld said:Speaking of Tw 異體字字典, you can find its wordlist here: https://github.com/kcwu/moedict-variants
I wonder if it'd be possible to make it work in Pleco. But it seems quite complicated, esp. since they often use an image to display a rare character...
Yeah, it does seem to be the case that it is that last ~1% or so that is the issue. What I'm looking for is a list/table/database that includes all of these, so I can just do a search/replace (more or less) on the Traditional to get the correct Simplified. Do you know if such a list exists? Unfortunately I don't have the time to do the research and collate from different sources.There are some fickle simplifications, which I mentioned in my last two or three replies to this thread. There are also, as mentioned in those same replies, some incorrect simplifications going around. Getting it 99% right is easy; getting it 100% right takes more effort, as there are nitpicking little exceptions to worry about.
Mike, new forum look is great!
The U+2ABCD notation means "Unicode, codepoint 0x2ABCD" and refers to characters outside the Basic Multilingual Plane (sometimes referred to as "Astral Characters").
Unfortunately, this forum software does not support such characters...
Maybe both? Edit the original post and then make a new one pointing back to it.NB: There are more T: S, S situations like this. I may not get around to it till June. Should I edit this post (no one will be notified by the system) or make a new post on this thread?