MakePlecoDict: CEDICT Questions

adamlau

探花
Using CEDICT in UTF-8. The original file must be correctly formatted before being input, which means:

1. The file must be re-encoded from UTF-8 to UTF-16, Little Endian
2. Spaces must be replaced with the tab character
3. Both start and end definition delimiters (/) must be removed
4. Brackets should be placed around alternative simplified/traditional characters and phrases

Is there an formatting script or batch file you could include with makeplecodict.exe?
 

mikelove

皇帝
Staff member
We don't use such a script ourselves, actually our own encoder uses a different (and much more complicated) input format - the engine is the same but we simplified the front end so we wouldn't need to supply a 10 page file format manual. You're welcome to write your own conversion script, though. We might post a minor bug-fix update to MakePlecoDict at some point, but it's going to be a long time before we release a user-friendly, well-supported dictionary converter - there's just not that much demand for one, and there are too many other things we're working on.
 

adamlau

探花
MakePlecoDict: Oversized char block in line...

cedict.utx (UTF-16, Little Endian, CR/LF Pairs) available here:

http://www.nnotary.com/documents/cedict.utx

cedict.utx follows the format of the MakePlecoDict Sample.txt. When running MakePlecoDict.exe on cedict.utx (based on CEDICT UTF-8, 3 September 2005) stdout displays the following:

Oversized char block in line 10
Oversized char block in line 32
Oversized char block in line 255
Oversized char block in line 758
Oversized char block in line 826
Oversized char block in line 1215
Oversized char block in line 1427
Oversized char block in line 1764
Oversized char block in line 1910
Oversized char block in line 2997
Oversized char block in line 3123

Are these messages errors or are they merely informational i.e. can the created database be used within PlecoDict?
 

mikelove

皇帝
Staff member
This means that there are too many characters before the first tab in the line - the database should work OK (it just won't index anything past the 14th character in the line) but it might mean that you haven't correctly separated the characters/Pinyin/definition.
 

mikelove

皇帝
Staff member
No, that limtation's going to be in there for a while - it's not a big deal, all it means is that if you're entering a phrase longer than 14 characters you can't search for any character past the 14th one. Which I don't imagine is likely to cause anyone any problems, since very few Chinese words/phrases even go past 4 characters.

And yes, we converted that file to the CEDICT custom database with our own tools without any problems.
 

adamlau

探花
The limitation is impactful because the CEDICT UTF-8 shows both Traditional and Simplified characters. Depending on whether the whitespace between alternate characterset phrases is counted, phrase input is limited to between 6 (if whitespace is counted as a character) or 7 characters maximum. Idioms are impacted...
 

mikelove

皇帝
Staff member
No, because PlecoDict indexes simplified and traditional characters independently, so you'd get the first 14 characters of the simplified and the first 14 characters of the traditional part. Though even with 8-character idioms it's highly doubtful you'd need to search past the 4th or 5th character.
 

adamlau

探花
The successful cedict.pdb database I created and synced (after first deleting the PlecoDict-suppled version) did not take on the yellow CED icon. Will this be corrected by 1.0.2 official?
 

mikelove

皇帝
Staff member
No, because unfortunately there's not any facility in MakeDict for assigning icons to dictionaries.
 
Top