20120601 Parse Cantonese CEDICT File to Pleco User Dictionary Format
---
Dictionary Sources:
http://writecantonese8.wordpress.com/2012/02/04/cantonese-cedict/
http://www.mdbg.net/chindict/chindict.php?page=cedict
1. Parse Cantonese CEDICT (YEDICT) file to single deliminator ("@" chosen as not otherwise used in file)
From:
{traditional} {simplified} [{cantonese}] [{mandarin}] /{definitions}
(.+?)\s(.+?)\s\[(.*?)\]\s\[(.*?)\]\s/(.+)\n
揸 揸 [ja1] [zha1] /to hold/to drive/to pilot/to make a decision/
劇本 剧本 [kek6 bun2] [ju4 ben3] /script for play, opera, movie etc/screenplay/
To:
{traditional}@{simplified}@[{cantonese}]@[{mandarin}]@/{definitions}
\1@\2@[\3]@[\4]@/\5\n
揸@揸@[ja1]@[zha1]@/to hold/to drive/to pilot/to make a decision/
劇本@剧本@[kek6 bun2]@[ju4 ben3]@/script for play, opera, movie etc/screenplay/
i.e.
sed -r 's_(.+?)\s(.+?)\s\[(.*?)\]\s\[(.*?)\]\s/(.+)_\1@\2@[\3]@[\4]@/\5_g' <yedict_20100108.u8 >yedict_20100108.atsv
---
2. Clean up misformatted entries.
Using spreadsheet (LibreOffice.org Calc), filter for entries which do not have expected format:
- "[" at start of Cantonese
- "[" at start of Mandarin
- "/" at start of definition
Fix these few entries manually (in the original text file).
Note: really this is better done using a database (like SQLite, see below).
---
3. Clean up definition entries.
Note: to import into SQLite database, need to escape " as "" (specifically affects unclosed quotes).
a. Add missing Pinyin where possible, from (Mandarin) CEDICT (and cjklib?).
Use following substitution for CEDICT (only Mandarin, no Cantonese pronunciation):
sed -r 's_(.+?)\s(.+?)\s\[(.*?)\]\s/(.+)_\1@\2@[\3]@/\4_g' <cedict_ts.u8 >cedict_ts.atsv
Then for YEDICT entries where Mandarin blank ("[]"), cross-reference with CEDICT,
looking up (Mandarin) Pinyin where Simplified matches.
SQLite command to insert into new table with cross-referenced matches:
'insert into yedict_new(simplified, traditional, mandarin, cantonese, definition) select yedict.simplified, yedict.traditional, cedict.mandarin, yedict.cantonese, yedict.definition from yedict join cedict on yedict.simplified = cedict.simplified where yedict.mandarin = '[]';'
But note will duplicate if multiple matches, so need to manually fix (actually only few hundred matches out of over 6000 missing).
Note: the vast majority of entries without Pinyin are Cantonese-specific words and terms, often including a character not used in Mandarin and so without a well-defined pronunciation. Other cases are phrases made up of words that do have Mandarin pronuncations, but the phrase itself is not in CEDICT and so probably not often used (if at all) in Mandarin. To deal with these, probably need some kind of procedure for splitting the entries and then finding the Pinyin for the individual components.
Note 2: there are also quite a few entries with incomplete Pinyin, could check for these by matching number of syllables to number of characters.
b. Standardise Cantonese romanisations (to Jyutping, from Yale).
Some complication as need to only convert Yale and not mess with Jyutping.
So only convert Cantonese where no numeric tone digits (1-6).
Probably easier to create new file with only those to be converted first, convert all in that file, then recombine.
Can use cjknife (cjklib) for actual conversion:
cjknife -s CantoneseYale -t Jyutping -m "[cantonese]"
Identify and transfer Yale (non-Jyutping) Cantonese entries:
'insert into yedict_yale select * from yedict_final where not (cantonese like '%1%' OR cantonese like '%2%' OR cantonese like '%3%' OR cantonese like '%4%' OR cantonese like '%5%' OR cantonese like '%6%');'
Output yedict_yale to yedict_yale.atsv, then process line by line in bash:
while read -r line;\
do YALE=`echo $line | sed -r "s_(.+?)@(.+?)@\[(.+?)\]@\[(.+?)\]@(.+)_\4_g"`;\
JP=`cjknife -s CantoneseYale -t Jyutping -m "$YALE"`;\
ROW=`echo $line | sed -r "s_(.+?)@(.+?)@\[(.+?)\]@\[(.+?)\]@(.+)_\1@\2@\[\3\]@\[$JP\]@\5_g"`;\
echo $ROW >> yedict_jpnew.atsv;\
done < yedict_yale.atsv;
Note: this works but is very slow, almost certainly significant efficiency improvements possible here.
Convert Yale to Jyutping in definitions themselves - only around 100 so can do manually using regex search '(.+?)/(.+?)\[(.+)/'.
Note: appears even in other 200,000+ entries there is some use of Yale, could script to check and replace as follows:
- look for definitions where with pattern "...[Cantonese]..."
- check Cantonese for tone numbers (1-6)
- if not found then convert from Yale to Jyutping, and reinsert back in place
Reimport and recombine (all quite simple):
- Import into new table yedict_jpnew.
- Insert Jyutping definitions from yedict_final to yedict_recombined.
'insert into yedict_recombined select * from yedict_final where (cantonese like '%1%' OR cantonese like '%2%' OR cantonese like '%3%' OR cantonese like '%4%' OR cantonese like '%5%' OR cantonese like '%6%');'
- Insert from yedict_jpnew to yedict_recombined.
'insert into yedict_recombined select * from yedict_jpnew;'
c. Parse definitions and add Pleco definition formatting.
Replace separators (/) with (Pleco) newline and (standard Unicode) bullet point, with exceptions for first and last separator.
sed can accept Unicode directly, or escaped if converted to UTF-8 (http://www.utf8-chartable.de/unicode-utf8-table.pl)
e.g. bullet character [•] (U+2022) = '\xe2\x80\xa2'
Pleco dictionary formatting special codes (unofficial, subject to change) [private use Unicode]:
---
EAB1 = new line
EAB2/EAB3 = bold
EAB4/EAB5 = italic
EAB8/EABB = "copy-whatever's-in-this-to-the-Input-Field hyperlinks"
coloured text:
"EAC1 followed by two characters with the highest-order bit 1 and the lowest-order 12 bits representing the first/second halves of a 24-bit RGB color value to start the range, EAC2 to end. So to render a character in green, for example, you'd want EAC1 800F 8F00, then the character, then EAC2."
---
UTF-8: U+EAB1 = '\xee\xaa\xb1'
Deal with first and last separators (replace first with bullet-space only, last with nothing):
sed -r 's_(.+?)@/(.+)/_\1@\xe2\x80\xa2\x20\2_g' <yedict_recombined.atsv >yedict_plecodefstemp.atsv
Middle separators (replace with newline-bullet-space):
sed -r 's_/_\xee\xaa\xb1\xe2\x80\xa2\x20_g' <yedict_plecodefstemp.atsv >yedict_plecodefs.atsv
Note: known issue where slash used as part of definition (e.g. "...a girl/woman"), but difficult to see alternative to manual corrections here.
Also known issue with some Cantonese fields which use slashes to indicate alternate pronunciations, could improve above by only applying second substitution to definitions rather than to the entire rows. Might have to do this row by row as a result.
d. Deal with multiple pronuncations.
Currently left as "py1 | py2" - split and duplicate? Probably not worth it.
Note: after final export, remember to clean up (i.e. reverse) escaped " (back to " from "").
sed -r 's_""_"_g' <yedict_plecodefs.atsv >yedict_plecodefsfinal.atsv
---
4. Convert to Pleco user dictionary (flashcard) text format.
From:
{traditional}@{simplified}@[{cantonese}]@[{mandarin}]@{definitions}
(.+?)@(.+?)@\[(.+?)\]@\[(.+?)\]@(.+)\n
揸@揸@[ja1]@[zha1]@/to hold/to drive/to pilot/to make a decision/
劇本@剧本@[kek6 bun2]@[ju4 ben3]@/script for play, opera, movie etc/screenplay/
To:
{simplified}[{traditional}]{ TAB }{mandarin}{ TAB }[{cantonese}]{ NEW LINE }{definitions}
\2[\1]\t\4\t[\3]\xee\xaa\xb1\5\n
揸[揸] zha1 [ja1]/to hold/to drive/to pilot/to make a decision/
剧本[劇本] ju4 ben3 [kek6 bun2]/script for play, opera, movie etc/screenplay/
i.e.
sed -r 's_(.+?)@(.+?)@\[(.+?)\]@\[(.+?)\]@(.+)_\2[\1]\t\4\t[\3]\xee\xaa\xb1\5_g' <yedict_plecodefsfinal.atsv >yedict_plecocards.txt
---
5. Import to Pleco User Dictionary (SQLite) database format.
Very slow using Pleco itself, so create dictionary file then manually add from CSV-esque file using SQLite Browser?
This looks plausible but not that straightforward, with a sortkey and sort tables to work out.
Relative difficulty is intentional to deter use of pirated dictionaries.
Perhaps consider using Android SDK and virtual machine to do this?
At least need to split into smaller Pleco flashcard format text files for the import, say 10,000 entries at a time.