Segmenting CEDICT-Tatoeba Dictionary Generator for any language


I reprogrammed the «CEDICT Tatoeba example sentences dictionary generator» with word segmenting added. A search for "中" now will not produce example sentences that contain 中国人. Only if you display the dictionary definition for 中国人 will sentences containing 中国人 appear. The script searches for words inside sentences in both the forward and backwards directions to maximize word discovery. (Whenever you have a string of Chinese characters that serves as the ending and beginning of two different CEDICT dictionary keywords, you get two different segmentations depending on the direction in which you scan a Chinese sentence, both of which are useful. This script checks for both.)

Tatoeba languages other than English can be selected. The CEDICT definition will remain in English. The export of the dictionary from Python takes about 15 minutes, and its import into Pleco will take quite a bit longer. Here is an example dictionary entry for French from the 118,000 current CEDICT terms:


Hanzi: 宁愿[寧願]
Pinyin: ning4 yuan4

- would rather
- better

Je préférerais démissionner que travailler sous ses ordres.

Je préférerais rester ici.

Je préfère commander une bière.

Je préférerais que vous preniez un jour de congé.


Je préférerais rester plutôt que de m'en aller.

Je préfère prendre des médicaments plutôt que d'avoir une piqûre.

Nous préférerions manger des escargots demain.

Definitions from CC-CEDICT, example sentences from


Many of the sentences are quite colloquial, it's just nice to get a bit more context and everyday usage than from most dictionaries. CEDICT also includes a nice selection of Internet slang and modern usage.

I attach the Python script. It includes some instructions. I uploaded the English, French, and German files here (the same link as for the version from 2 years back):

One could combine this script with an HSK level detection algorithm, so beginners could only see sentences with easier vocabulary.




