User Dictionary Specification

dustpuppy · Jun 10, 2012

Is your cantonese dictionary usable with the document reader ? i.e. if someone sends you a text message in cantonese, can you parse word by word ? That's really what i'm looking for. I don't care about searching, I just want to parse some text (and sometimes verify the words that i'm typing in myself).

The other thing i'm going to be looking for now is a reliable cantonese text-to-speech solution on Android. I had a perfect setup with the iSpeech app on iPhone, but they don't have one for Android.

alex_hk90 · Jun 10, 2012

dustpuppy said:
If you could post the instructions, i'd be very grateful.

OK, I kept these as notes in case I need to do it again, so I wouldn't exactly call them 'instructions' per se (and a lot of it is done quite messily):

Code:

20120601 Parse Cantonese CEDICT File to Pleco User Dictionary Format
---
Dictionary Sources:
http://writecantonese8.wordpress.com/2012/02/04/cantonese-cedict/
http://www.mdbg.net/chindict/chindict.php?page=cedict

1. Parse Cantonese CEDICT (YEDICT) file to single deliminator ("@" chosen as not otherwise used in file)

From:
{traditional} {simplified} [{cantonese}] [{mandarin}] /{definitions}
(.+?)\s(.+?)\s\[(.*?)\]\s\[(.*?)\]\s/(.+)\n
揸 揸 [ja1] [zha1] /to hold/to drive/to pilot/to make a decision/
劇本 剧本 [kek6 bun2] [ju4 ben3] /script for play, opera, movie etc/screenplay/

To:
{traditional}@{simplified}@[{cantonese}]@[{mandarin}]@/{definitions}
\1@\2@[\3]@[\4]@/\5\n
揸@揸@[ja1]@[zha1]@/to hold/to drive/to pilot/to make a decision/
劇本@剧本@[kek6 bun2]@[ju4 ben3]@/script for play, opera, movie etc/screenplay/

i.e.
sed -r 's_(.+?)\s(.+?)\s\[(.*?)\]\s\[(.*?)\]\s/(.+)_\1@\2@[\3]@[\4]@/\5_g' <yedict_20100108.u8 >yedict_20100108.atsv

---
2. Clean up misformatted entries.

Using spreadsheet (LibreOffice.org Calc), filter for entries which do not have expected format:
- "[" at start of Cantonese
- "[" at start of Mandarin
- "/" at start of definition

Fix these few entries manually (in the original text file).

Note: really this is better done using a database (like SQLite, see below).

---
3. Clean up definition entries.

Note: to import into SQLite database, need to escape " as "" (specifically affects unclosed quotes).


a. Add missing Pinyin where possible, from (Mandarin) CEDICT (and cjklib?).

Use following substitution for CEDICT (only Mandarin, no Cantonese pronunciation):
sed -r 's_(.+?)\s(.+?)\s\[(.*?)\]\s/(.+)_\1@\2@[\3]@/\4_g' <cedict_ts.u8 >cedict_ts.atsv

Then for YEDICT entries where Mandarin blank ("[]"), cross-reference with CEDICT, 
looking up (Mandarin) Pinyin where Simplified matches.

SQLite command to insert into new table with cross-referenced matches:
'insert into yedict_new(simplified, traditional, mandarin, cantonese, definition) select yedict.simplified, yedict.traditional, cedict.mandarin, yedict.cantonese, yedict.definition from yedict join cedict on yedict.simplified = cedict.simplified where yedict.mandarin = '[]';'
But note will duplicate if multiple matches, so need to manually fix (actually only few hundred matches out of over 6000 missing).

Note: the vast majority of entries without Pinyin are Cantonese-specific words and terms, often including a character not used in Mandarin and so without a well-defined pronunciation. Other cases are phrases made up of words that do have Mandarin pronuncations, but the phrase itself is not in CEDICT and so probably not often used (if at all) in Mandarin. To deal with these, probably need some kind of procedure for splitting the entries and then finding the Pinyin for the individual components.

Note 2: there are also quite a few entries with incomplete Pinyin, could check for these by matching number of syllables to number of characters.


b. Standardise Cantonese romanisations (to Jyutping, from Yale).
Some complication as need to only convert Yale and not mess with Jyutping.
So only convert Cantonese where no numeric tone digits (1-6).
Probably easier to create new file with only those to be converted first, convert all in that file, then recombine.

Can use cjknife (cjklib) for actual conversion:
cjknife -s CantoneseYale -t Jyutping -m "[cantonese]"

Identify and transfer Yale (non-Jyutping) Cantonese entries:
'insert into yedict_yale select * from yedict_final where not (cantonese like '%1%' OR cantonese like '%2%' OR cantonese like '%3%' OR cantonese like '%4%' OR cantonese like '%5%' OR cantonese like '%6%');'

Output yedict_yale to yedict_yale.atsv, then process line by line in bash:
while read -r line;\
do YALE=`echo $line | sed -r "s_(.+?)@(.+?)@\[(.+?)\]@\[(.+?)\]@(.+)_\4_g"`;\
JP=`cjknife -s CantoneseYale -t Jyutping -m "$YALE"`;\
ROW=`echo $line | sed -r "s_(.+?)@(.+?)@\[(.+?)\]@\[(.+?)\]@(.+)_\1@\2@\[\3\]@\[$JP\]@\5_g"`;\
echo $ROW >> yedict_jpnew.atsv;\
done < yedict_yale.atsv;
Note: this works but is very slow, almost certainly significant efficiency improvements possible here.

Convert Yale to Jyutping in definitions themselves - only around 100 so can do manually using regex search '(.+?)/(.+?)\[(.+)/'.

Note: appears even in other 200,000+ entries there is some use of Yale, could script to check and replace as follows:
- look for definitions where with pattern "...[Cantonese]..."
- check Cantonese for tone numbers (1-6)
- if not found then convert from Yale to Jyutping, and reinsert back in place

Reimport and recombine (all quite simple):
- Import into new table yedict_jpnew.
- Insert Jyutping definitions from yedict_final to yedict_recombined.
'insert into yedict_recombined select * from yedict_final where (cantonese like '%1%' OR cantonese like '%2%' OR cantonese like '%3%' OR cantonese like '%4%' OR cantonese like '%5%' OR cantonese like '%6%');'
- Insert from yedict_jpnew to yedict_recombined.
'insert into yedict_recombined select * from yedict_jpnew;'


c. Parse definitions and add Pleco definition formatting.
Replace separators (/) with (Pleco) newline and (standard Unicode) bullet point, with exceptions for first and last separator.

sed can accept Unicode directly, or escaped if converted to UTF-8 (http://www.utf8-chartable.de/unicode-utf8-table.pl)
e.g. bullet character [•] (U+2022) = '\xe2\x80\xa2'

Pleco dictionary formatting special codes (unofficial, subject to change) [private use Unicode]:
---
EAB1 = new line
EAB2/EAB3 = bold
EAB4/EAB5 = italic
EAB8/EABB = "copy-whatever's-in-this-to-the-Input-Field hyperlinks"
coloured text:
"EAC1 followed by two characters with the highest-order bit 1 and the lowest-order 12 bits representing the first/second halves of a 24-bit RGB color value to start the range, EAC2 to end. So to render a character in green, for example, you'd want EAC1 800F 8F00, then the character, then EAC2."
---
UTF-8: U+EAB1 = '\xee\xaa\xb1'

Deal with first and last separators (replace first with bullet-space only, last with nothing):
sed -r 's_(.+?)@/(.+)/_\1@\xe2\x80\xa2\x20\2_g' <yedict_recombined.atsv >yedict_plecodefstemp.atsv

Middle separators (replace with newline-bullet-space):
sed -r 's_/_\xee\xaa\xb1\xe2\x80\xa2\x20_g' <yedict_plecodefstemp.atsv >yedict_plecodefs.atsv

Note: known issue where slash used as part of definition (e.g. "...a girl/woman"), but difficult to see alternative to manual corrections here.
Also known issue with some Cantonese fields which use slashes to indicate alternate pronunciations, could improve above by only applying second substitution to definitions rather than to the entire rows. Might have to do this row by row as a result.


d. Deal with multiple pronuncations.
Currently left as "py1 | py2" - split and duplicate? Probably not worth it.


Note: after final export, remember to clean up (i.e. reverse) escaped " (back to " from "").
sed -r 's_""_"_g' <yedict_plecodefs.atsv >yedict_plecodefsfinal.atsv

---
4. Convert to Pleco user dictionary (flashcard) text format.

From:
{traditional}@{simplified}@[{cantonese}]@[{mandarin}]@{definitions}
(.+?)@(.+?)@\[(.+?)\]@\[(.+?)\]@(.+)\n
揸@揸@[ja1]@[zha1]@/to hold/to drive/to pilot/to make a decision/
劇本@剧本@[kek6 bun2]@[ju4 ben3]@/script for play, opera, movie etc/screenplay/

To:
{simplified}[{traditional}]{ TAB }{mandarin}{ TAB }[{cantonese}]{ NEW LINE }{definitions}
\2[\1]\t\4\t[\3]\xee\xaa\xb1\5\n
揸[揸]	zha1	[ja1]/to hold/to drive/to pilot/to make a decision/
剧本[劇本]	ju4 ben3	[kek6 bun2]/script for play, opera, movie etc/screenplay/

i.e.
sed -r 's_(.+?)@(.+?)@\[(.+?)\]@\[(.+?)\]@(.+)_\2[\1]\t\4\t[\3]\xee\xaa\xb1\5_g' <yedict_plecodefsfinal.atsv >yedict_plecocards.txt

---
5. Import to Pleco User Dictionary (SQLite) database format.

Very slow using Pleco itself, so create dictionary file then manually add from CSV-esque file using SQLite Browser?
This looks plausible but not that straightforward, with a sortkey and sort tables to work out.
Relative difficulty is intentional to deter use of pirated dictionaries.

Perhaps consider using Android SDK and virtual machine to do this?

At least need to split into smaller Pleco flashcard format text files for the import, say 10,000 entries at a time.

Looking back at them, a lot of the steps were to clean up entries (adding Pinyin, standardising romanisation, etc.), the only really essential ones are step 1 (convert to single deliminator so can do processing easier and/or import to a database table), step 3c (add Pleco formatting to the definitions), step 4 (convert to Pleco flashcard format) and step 5 (import to Pleco user dictionary). Let me know if anything is unclear and hope this helps.

dustpuppy said:
Is your cantonese dictionary usable with the document reader ? i.e. if someone sends you a text message in cantonese, can you parse word by word ? That's really what i'm looking for. I don't care about searching, I just want to parse some text (and sometimes verify the words that i'm typing in myself).

You can use user dictionaries with the document reader, yes.

dustpuppy said:
The other thing i'm going to be looking for now is a reliable cantonese text-to-speech solution on Android. I had a perfect setup with the iSpeech app on iPhone, but they don't have one for Android.

I'm also looking for this. The site I linked to had audio files for individual syllables and also for a few thousand words, but I haven't found a (free) app that can utilise the files well. In theory once we have the data for Hanzi to pronunciation (which I think is readily available) and then sound files for these pronunciations, it shouldn't be too difficult to make such an app. I'll have some time later this summer so if I don't find anything I might look into writing one myself.

dustpuppy · Jun 10, 2012

Thanks for your instructions, i'll try it this week.

As far as the text-to-speech app, the only thing you need to do is call iSpeech's webservice

mikelove · Jun 10, 2012

alex_hk90 said:
http://writecantonese8.wordpress.com/
I used the 'Cantonese CEDICT' file from that website as the data to convert into Pleco format, and while I'd be happy to share it I don't want to infringe on any intellectual property rights and I'm not clear on the copyright of that file (or, more specifically, the sources of it).

That's the reason why we weren't able to use (say) the CantoFish database - they say up-front that they used readings from CantoDict, which is a free website but which the maintainer claims copyright in.

dustpuppy said:
The other thing i'm going to be looking for now is a reliable cantonese text-to-speech solution on Android. I had a perfect setup with the iSpeech app on iPhone, but they don't have one for Android.

The company we license our new English TTS engine from is supposedly getting ready to launch a Cantonese one soon and we're keeping a close eye on that.

alex_hk90 said:
Relative difficulty is intentional to deter use of pirated dictionaries.

That's a bit misleading - it's not that we designed it to be difficult, it's that it's naturally difficult and pirated dictionaries give us one less reason to invest resources in making it easier

alex_hk90 · Jun 10, 2012

mikelove said:
That's the reason why we weren't able to use (say) the CantoFish database - they say up-front that they used readings from CantoDict, which is a free website but which the maintainer claims copyright in.

Yeah, from reading a couple of things on the CantoDict website, it seems that the maintainer wants to create his own software to use the database, which would explain why it isn't made freely available in the same way as CEDICT.

mikelove said:
That's a bit misleading - it's not that we designed it to be difficult, it's that it's naturally difficult and pirated dictionaries give us one less reason to invest resources in making it easier

Sorry!

As mentioned those notes were for personal use and I didn't think I would be sharing them when I wrote them. In any case, thanks for the clarification.

alex_hk90 · Jun 14, 2012

dustpuppy said:
Thanks for your instructions, i'll try it this week.

You're welcome. And good luck (sorry that the instructions are a bit messy, I was doing them as I went along and not always in the right order).

dustpuppy said:
As far as the text-to-speech app, the only thing you need to do is call iSpeech's webservice

Thanks for the suggestion.

I'll have a look at some point.

PS: To the person who PMed me with message title "Cantonese dictionary", I tried to reply but it says you have receiving of private messages disabled.
For an easier alternative to my instructions, you could try installing ColorDict (free): https://play.google.com/store/apps/deta ... 9yZGljdCJd
Then get a Stardict format Cantonese-English from here: http://writecantonese8.wordpress.com/20 ... ile-fixed/
And copy the files in the archive (not the folder, just the files inside the folder) to "/sdcard/dictdata/" (i.e. the "dictdata" folder on the root of your SD card) on your phone.
"Cantonese-English" dictionary should then appear under the 'Dicts' tab in ColorDict.
Hope this helps.

LantauMan · Jun 14, 2012

alex_hk90 said:
For an easier alternative to my instructions, you could try installing ColorDict (free): https://play.google.com/store/apps/deta ... 9yZGljdCJd
Then get a Stardict format Cantonese-English from here: http://writecantonese8.wordpress.com/20 ... ile-fixed/
And copy the files in the archive (not the folder, just the files inside the folder) to "/sdcard/dictdata/" (i.e. the "dictdata" folder on the root of your SD card) on your phone.
"Cantonese-English" dictionary should then appear under the 'Dicts' tab in ColorDict.

I tried the above, using ColorDict, and it installs with ease, but the Stardict format dictionaries are only 220 kb, as opposed to around 2MB for the dictionary you modified. I found that even the most basic everyday Cantonese characters, such as the number one, are not found in the dictionary available from that link. Well, I'm hoping that Pleco will come up with a Cantonese dictionary add-on.

alan.koehn · Jun 19, 2012

HW60 said:
mikelove said:

Update: full-text user dictionary search is now in, at least in a basic way. (it turned out that this was the optimal time to do it because we were mucking around with so much other search code anyway) It may be a little buggy and it may even stay a little buggy for a while, but it's essentially working anyway. (strictly for English at the moment, though)

Click to expand...

Just to be sure: "is in now" means in Pleco 2.3.8 to come? Strictly for English includes German, but no Hiragana?

When will 2.3.8 be available? Will I have to "lock" the user dictionary before I can use the full-text search in it?

mikelove · Jun 19, 2012

alan.koehn said:
When will 2.3.8 be available? Will I have to "lock" the user dictionary before I can use the full-text search in it?

Very soon, and no, full-text will dynamically update itself as you modify entries so there's no need to lock (doesn't even have a performance impact).

alex_hk90 · Jun 19, 2012

mikelove said:
alan.koehn said:

When will 2.3.8 be available? Will I have to "lock" the user dictionary before I can use the full-text search in it?

Click to expand...

Very soon, and no, full-text will dynamically update itself as you modify entries so there's no need to lock (doesn't even have a performance impact).

Sounds great - thank you!

radioman · Jul 30, 2012

Two Questions.

1) Is there a way to do line feeds within a personal dictionary definition (e.g., using the | symbol?)?

2) Am I reading this right that User Dictionaries will be able to do do full text searches now or in the near future?

mikelove · Jul 30, 2012

radioman said:
1) Is there a way to do line feeds within a personal dictionary definition (e.g., using the | symbol?)?

You can insert the Unicode character code U+EAB1, though that's decidedly unofficial / might change in a future release.

radioman said:
2) Am I reading this right that User Dictionaries will be able to do do full text searches now or in the near future?

Yes, already works beautifully in our development builds.

radioman · Jul 31, 2012

mikelove said:
You can insert the Unicode character code U+EAB1, though that's decidedly unofficial / might change in a future release.

I tried the EAB1 to get a line feed, but did not get the expected result. And the result did not produce line feed for me, just some han

It was the same for looking at the entry in the Reader or as a Flashcard.

Maybe I'm missing something.

HW60 · Jul 31, 2012

mikelove said:
alan.koehn said:

When will 2.3.8 be available? Will I have to "lock" the user dictionary before I can use the full-text search in it?

Click to expand...

Very soon, and no, full-text will dynamically update itself as you modify entries so there's no need to lock (doesn't even have a performance impact).

What does "dynamically update itself as you modify entries" mean? Do I have to change the entries of my user dictionary one by one? I tried some full text search, but did not find any entry in my user dictionary, not even after I modified a single entry and searched for that. Do I modify entries when I import the whole user dictionary again?

alex_hk90 · Jul 31, 2012

radioman said:
mikelove said:

You can insert the Unicode character code U+EAB1, though that's decidedly unofficial / might change in a future release.

Click to expand...

I tried the EAB1 to get a line feed, but did not get the expected result. And the result did not produce line feed for me, just some han

It was the same for looking at the entry in the Reader or as a Flashcard.

Maybe I'm missing something.

It works for me, the following is the line break character: ""

HW60 said:
mikelove said:

alan.koehn said:

When will 2.3.8 be available? Will I have to "lock" the user dictionary before I can use the full-text search in it?

Click to expand...

Very soon, and no, full-text will dynamically update itself as you modify entries so there's no need to lock (doesn't even have a performance impact).

Click to expand...

What does "dynamically update itself as you modify entries" mean? Do I have to change the entries of my user dictionary one by one? I tried some full text search, but did not find any entry in my user dictionary, not even after I modified a single entry and searched for that. Do I modify entries when I import the whole user dictionary again?

I don't think the current version has full text search yet, mikelove's last post on this thread refers to development builds. My understanding of this 'dynamically update itself' is that you'll be able to search for the new text immediately after updating an entry (you won't need to reload the dictionary or anything).

radioman · Jul 31, 2012

mikelove said:
It works for me, the following is the line break character: ""

Works for me too. Many thanks!

mikelove · Jul 31, 2012

alex_hk90 said:
I don't think the current version has full text search yet, mikelove's last post on this thread refers to development builds. My understanding of this 'dynamically update itself' is that you'll be able to search for the new text immediately after updating an entry (you won't need to reload the dictionary or anything).

Correct on both counts - it's not supported yet, but when it is it'll update automatically and you won't have reload anything / rebuild the index / etc.

CharonM72 · Sep 29, 2013

I hate to revive an old thread, but I managed to import this Cantonese dictionary (and also a dictionary of Baxter's Old and Middle Chinese readings) in large part thanks to alex's post, but using a much simpler method.
I've posted the details here, if anyone is interested:
http://efordnam.blogspot.com/2013/09/pleco-cantonese-and-historical-chinese.html

alex_hk90 · Sep 30, 2013

CharonM72 said:
I hate to revive an old thread, but I managed to import this Cantonese dictionary (and also a dictionary of Baxter's Old and Middle Chinese readings) in large part thanks to alex's post, but using a much simpler method.
I've posted the details here, if anyone is interested:
http://efordnam.blogspot.com/2013/09/pleco-cantonese-and-historical-chinese.html

Happy to hear you found my (rather messy) notes somewhat useful, and thanks for sharing.

If you haven't seen already there is a more recent Pleco conversion (for one of the free Taiwan Ministry of Education dictionaries) in this thread:
http://www.plecoforums.com/threads/the-moe-dictionary-is-now-open-source.3606/
I've uploaded my notes for that as well so perhaps there'll be something useful for future conversions.

User Dictionary Specification

dustpuppy

榜眼

alex_hk90

状元

dustpuppy

榜眼

mikelove

皇帝

alex_hk90

状元

alex_hk90

状元

LantauMan

进士

alan.koehn

Member

mikelove

皇帝

alex_hk90

状元

radioman

状元

mikelove

皇帝

radioman

状元

HW60

状元

alex_hk90

状元

radioman

状元

mikelove

皇帝

CharonM72

秀才

alex_hk90

状元