MakePlecoDict: CEDICT Questions

adamlau · Sep 11, 2005

Using CEDICT in UTF-8. The original file must be correctly formatted before being input, which means:

1. The file must be re-encoded from UTF-8 to UTF-16, Little Endian
2. Spaces must be replaced with the tab character
3. Both start and end definition delimiters (/) must be removed
4. Brackets should be placed around alternative simplified/traditional characters and phrases

Is there an formatting script or batch file you could include with makeplecodict.exe?

mikelove · Sep 11, 2005

We don't use such a script ourselves, actually our own encoder uses a different (and much more complicated) input format - the engine is the same but we simplified the front end so we wouldn't need to supply a 10 page file format manual. You're welcome to write your own conversion script, though. We might post a minor bug-fix update to MakePlecoDict at some point, but it's going to be a long time before we release a user-friendly, well-supported dictionary converter - there's just not that much demand for one, and there are too many other things we're working on.

adamlau · Sep 11, 2005

MakePlecoDict: Oversized char block in line...

cedict.utx (UTF-16, Little Endian, CR/LF Pairs) available here:

http://www.nnotary.com/documents/cedict.utx

cedict.utx follows the format of the MakePlecoDict Sample.txt. When running MakePlecoDict.exe on cedict.utx (based on CEDICT UTF-8, 3 September 2005) stdout displays the following:

Oversized char block in line 10
Oversized char block in line 32
Oversized char block in line 255
Oversized char block in line 758
Oversized char block in line 826
Oversized char block in line 1215
Oversized char block in line 1427
Oversized char block in line 1764
Oversized char block in line 1910
Oversized char block in line 2997
Oversized char block in line 3123

Are these messages errors or are they merely informational i.e. can the created database be used within PlecoDict?

mikelove · Sep 11, 2005

This means that there are too many characters before the first tab in the line - the database should work OK (it just won't index anything past the 14th character in the line) but it might mean that you haven't correctly separated the characters/Pinyin/definition.

Anonymous · Sep 12, 2005

mikelove said:
...the database...just won't index anything past the 14th character in the line...

Is this a limitation to be lifed in the official PlecoDict 1.0.2?

...but it might mean that you haven't correctly separated the characters/Pinyin/definition.

If time permits, please check cedict.utx for proper formatting:

http://www.nnotary.com/documents/cedict.utx

mikelove · Sep 12, 2005

No, that limtation's going to be in there for a while - it's not a big deal, all it means is that if you're entering a phrase longer than 14 characters you can't search for any character past the 14th one. Which I don't imagine is likely to cause anyone any problems, since very few Chinese words/phrases even go past 4 characters.

And yes, we converted that file to the CEDICT custom database with our own tools without any problems.

adamlau · Sep 12, 2005

The limitation is impactful because the CEDICT UTF-8 shows both Traditional and Simplified characters. Depending on whether the whitespace between alternate characterset phrases is counted, phrase input is limited to between 6 (if whitespace is counted as a character) or 7 characters maximum. Idioms are impacted...

mikelove · Sep 12, 2005

No, because PlecoDict indexes simplified and traditional characters independently, so you'd get the first 14 characters of the simplified and the first 14 characters of the traditional part. Though even with 8-character idioms it's highly doubtful you'd need to search past the 4th or 5th character.

adamlau · Sep 12, 2005

The successful cedict.pdb database I created and synced (after first deleting the PlecoDict-suppled version) did not take on the yellow CED icon. Will this be corrected by 1.0.2 official?

mikelove · Sep 12, 2005

No, because unfortunately there's not any facility in MakeDict for assigning icons to dictionaries.

MakePlecoDict: CEDICT Questions

adamlau

探花

mikelove

皇帝

adamlau

探花

mikelove

皇帝

Anonymous

Guest

mikelove

皇帝

adamlau

探花

mikelove

皇帝

adamlau

探花

mikelove

皇帝