CC-CEDICT Updated

mikelove

皇帝
Staff member
Since it's been 3 months; this file is for iOS, but Android will get an updated version along with the big beta release shortly, though this one should also work on Android (it'll just make merged searches a bit slower since it's missing the extra index table for them).

http://d26r4pfa6wznjb.cloudfront.net/p2cedict-120626.zip

See the instructions at http://www.pleco.com/ipdirectdownload.html for how to install.

Important change here (and the reason we're posting it in the forums before it shows up in the "Updates" tab) is that we made a tweak to the file format to work around the issue of multi-word full-text searches not always returning results correctly - "washing machine" e.g. should now work more reliably. So please let us know if it does. (sorry for not fixing that sooner)
 

mikelove

皇帝
Staff member
goldyn chyld said:
Thanks! Will it get updated regularly from now on?

It may not be updated again until the big iPhone update is actually out, depending on the timing of that - while the iPhone/Android codebase is still split it's actually quite a lot more labor-intensive to do this than it would be otherwise, and we'd really rather not take time away from converting other, brand new dictionaries in order to put out CC-CEDICT updates.
 
Is there a chance users could update CC-CEDICT by themselves? I.e. download the latest version of CEDICT and then import it through iTunes somehow into the Pleco app?
 

mikelove

皇帝
Staff member
goldyn chyld said:
Is there a chance users could update CC-CEDICT by themselves? I.e. download the latest version of CEDICT and then import it through iTunes somehow into the Pleco app?

Not really - you could import it as a user dictionary but it would be very slow, and actually with the new version of Pleco the difference between a user-created and a Pleco-created version of CC-CEDICT will get considerably greater since we've put a lot of effort into working around some of CC-CEDICT's problems (like the lack of good variant tagging) with automated scripts.
 
Hey Mike,

I was wondering if the new version of Pleco would support variant word compounds or pronunciations?

For example, in CEDICT we have: "打抱不平 打抱不平 [da3 bao4 bu4 ping2] /to come to the aid of sb suffering an injustice/to fight for justice/also written 抱打不平[bao4 da3 bu4 ping2]/". But if you look up the "also written" counterpart, 抱打不平, it won't show up even if you look it up by pinyin.

An example of a "variant" pronunciation: 蜗牛. In CEDICT we have: 蝸牛 蜗牛 [wo1 niu2] /snail/Taiwan pr. [gua1 niu2]/, yet if you look up "gua niu" in Pleco, nothing will show up.

Could anything be done about that?
 

Yiliya

榜眼
Another suggestion.

For whatever reason, CC-CEDICT uses fullwidth Latin letters (which are used in Japanese IMEs, but NOT Chinese ones) instead of the usual halfwidth. So there are problems with looking up some words, e.g.:
AA制
V沟
K书

I suggest you convert all fullwidth letters to halfwidth when building the database, like so:
AA制
V沟
K书

I haven't come across a Chinese person typing any of these words in fullwidth (and most Chinese IMEs don't even support fullwidth!), so it really makes me scratch my head why would they do this.
 

mikelove

皇帝
Staff member
goldyn chyld said:
I was wondering if the new version of Pleco would support variant word compounds or pronunciations?

Yes, we have a very robust system for handling those now, though it may not support them fully in CC-CEDICT right away since the tagging there is maddeningly inconsistent - we've put in a lot of time just getting it to deal with stuff like variant characters (so you don't get two results every time you type in a word starting with 台), effort that sadly was rejected by the CC-CEDICT editors and therefore now exists only in our increasingly-elaborate CC-CEDICT fork. (which we've designed to allow for syncing of new / updated entries from the website, but that requires enough manual checking / editing that it makes CC updates more complicated than they'd otherwise be)

But getting it to detect / handle every one of these variant links correctly would be even more work; "also written"s often do have separate entries associated with them, which themselves contain slightly different information and therefore can't be reduced to a simple variant link unless we hand check them to make sure the information is in fact the same, and "Taiwan pr."s sometimes have their own entry as well.

Yiliya said:
I suggest you convert all fullwidth letters to halfwidth when building the database, like so:
AA制
V沟
K书

We actually support them in searches now; basically the system treats the full-width character as the "headword" and the half-width letter as the "Pinyin," and since it allows mixing those two fields you can end up searching both ways. However, this is currently a little buggy in some versions of Pleco.
 
mikelove said:
goldyn chyld said:
I was wondering if the new version of Pleco would support variant word compounds or pronunciations?

Yes, we have a very robust system for handling those now, though it may not support them fully in CC-CEDICT right away since the tagging there is maddeningly inconsistent - we've put in a lot of time just getting it to deal with stuff like variant characters (so you don't get two results every time you type in a word starting with 台), effort that sadly was rejected by the CC-CEDICT editors and therefore now exists only in our increasingly-elaborate CC-CEDICT fork. (which we've designed to allow for syncing of new / updated entries from the website, but that requires enough manual checking / editing that it makes CC updates more complicated than they'd otherwise be)

But getting it to detect / handle every one of these variant links correctly would be even more work; "also written"s often do have separate entries associated with them, which themselves contain slightly different information and therefore can't be reduced to a simple variant link unless we hand check them to make sure the information is in fact the same, and "Taiwan pr."s sometimes have their own entry as well.

Hmm, I can imagine it's a lot of work... I remember the huge batch you submitted to the reviewing queue on CC-CEDICT a while ago (in fact, it's still there), but I just don't know how it should be handled, especially since the CEDICT format prefers not to list all the "also written" word compounds with variant characters, but rather lists the character as a variant (instead of the whole compound). I'll try and talk to other editors about your submission again, but I'm afraid they might not have time to deal with it (or won't even feel like it because of the dictionary format). Although personally I'd rather list those variant word compounds than just character variants, of which unfortunately only DICO (and perhaps a few other software dictionaries) can actually make use of...

Btw, are there any plans to update the CC-CEDICT database in Pleco in the near future?
 

mikelove

皇帝
Staff member
goldyn chyld said:
Hmm, I can imagine it's a lot of work... I remember the huge batch you submitted to the reviewing queue on CC-CEDICT a while ago (in fact, it's still there), but I just don't know how it should be handled, especially since the CEDICT format prefers not to list all the "also written" word compounds with variant characters, but rather lists the character as a variant (instead of the whole compound). I'll try and talk to other editors about your submission again, but I'm afraid they might not have time to deal with it (or won't even feel like it because of the dictionary format). Although personally I'd rather list those variant word compounds than just character variants, of which unfortunately only DICO (and perhaps a few other software dictionaries) can actually make use of...

It's about 3x as big now, actually - we basically just have a big CC-CEDICT format diff file (+s and -s) which we apply as the first step in our CC-to-Pleco-format converter script, spitting out an error whenever there's a line that can't be found (since then it needs to be checked and updated). But it's no big deal as long as you guys don't revise those entries too extensively.

goldyn chyld said:
Btw, are there any plans to update the CC-CEDICT database in Pleco in the near future?

The Android one is a lot newer; the iPhone one may not get updated again until our big app update, since we'd rather not have to maintain two different versions of our CC-CEDICT database at the same time and the formats now differ pretty significantly between our still-a-bit-experimental database engine on Android and our old-but-more-reliable engine on iPhone.
 

Yiliya

榜眼
mikelove said:
Yiliya said:
I suggest you convert all fullwidth letters to halfwidth when building the database, like so:
AA制
V沟
K书

We actually support them in searches now; basically the system treats the full-width character as the "headword" and the half-width letter as the "Pinyin," and since it allows mixing those two fields you can end up searching both ways. However, this is currently a little buggy in some versions of Pleco.
Doesn't seem to work (on Android), I can't find anything by typing V沟, although typing Vgou does get the word.
 

mikelove

皇帝
Staff member
Yiliya said:
Doesn't seem to work (on Android), I can't find anything by typing V沟, although typing Vgou does get the word.

Yeah, that's one of said versions - works better on iPhone but should work again on Android shortly.
 

jr4

Member
Would love an updated cc-cedict, although I would love some better chinese-chinese dictionaries even more :)
 

mikelove

皇帝
Staff member
jr4 said:
Would love an updated cc-cedict, although I would love some better chinese-chinese dictionaries even more :)

In what respect? What sort of better C-C dictionaries are you looking for?
 

Yiliya

榜眼
Minor complaint about pinyin formatting.

It's true that CC-CEDICT's pinyin normally includes no spaces, but it has capital letters. When there's a capital letter in the middle of the word, it is assumed that a space is put before it. E.g. the pinyin for 中华人民共和国 is [Zhong1 hua2 Ren2 min2 Gong4 he2 guo2], but when you see it on their website, you get Zhōng​huá​ Rén​mín​ Gòng​hé​guó, which is very easy on the eyes. However, Pleco gives Zhōng​huá​Rén​mín​Gòng​hé​guó, which is a bit strange. Or another case, 小日本儿 [xiao3 Ri4 ben3 r5], which should be xiǎo​ Rì​běnr, not xiǎo​Rì​běnr.

Thankfully this should be easy to fix with a small alteration to your script.
 

mikelove

皇帝
Staff member
Yiliya said:
It's true that CC-CEDICT's pinyin normally includes no spaces, but it has capital letters. When there's a capital letter in the middle of the word, it is assumed that a space is put before it. E.g. the pinyin for 中华人民共和国 is [Zhong1 hua2 Ren2 min2 Gong4 he2 guo2], but when you see it on their website, you get Zhōng​huá​ Rén​mín​ Gòng​hé​guó, which is very easy on the eyes. However, Pleco gives Zhōng​huá​Rén​mín​Gòng​hé​guó, which is a bit strange. Or another case, 小日本儿 [xiao3 Ri4 ben3 r5], which should be xiǎo​ Rì​běnr, not xiǎo​Rì​běnr.

Thankfully this should be easy to fix with a small alteration to your script.

Yes - about 1 line of Perl; thanks for pointing this out, not sure how we failed to think of it before. (though FWIW it sure would be nice if they'd do this intelligently rather than spa cing ev e ry syl lab le )
 

mikelove

皇帝
Staff member
Experimental release of the latest version of CC-CEDICT available here - this is experimental because we don't have a way to release an official CC-CEDICT update that won't have its entry IDs end up out-of-sync with those in the Android version (and in our forthcoming iOS update), so because of that, flashcards that you create in it will have their definitions stored in the flashcard database rather than being linked to (and hence getting updates from) the original dictionary. But other than that it should work fine and give you ~10 months worth of updates since our last iOS CC-CEDICT release (put out right before we started aggressively modifying it to work with our new variant handling system).

To install, copy into your Pleco files directory (e.g. by downloading this from the in-app web browser), go into Settings / File Browser and tap on the file.
 
Great! Thanks for that, Mike.

(I assume not displaying apostrophes in the definitions is due to it being an experimental version?)
 
Top