EDIT: Work has progressed significantly, here is the link to the Github which was kindly created by @alex_hk90: https://github.com/alexhk90/Pleco-User-Dictionaries/tree/master/MoE-Minnan
Opening a proper thread to continue the discussion about possible insertion of the online Minnan and Hakka dictionaries by the Taiwanese Ministry of Education which started here: http://plecoforums.com/threads/official-moedict-pleco-release.4915/ .
Linking the databases I found again for reference:
As for the romanization issue, I think it would probably be easier to convert the diacritics to numbers than in Pinyin. The reason for this is that syllables are always linked with a hyphen, so detecting a syllable border would be as easy as searching for the character "-".
I must admit, I so far only scratched the surface of php and js and don't know anything about non-web-based programming at all. But speaking in pseudo-code, I guess such a script could look roughly like this:
Create an array which assigns an index to each dictionary entry.
For each entry:
Access the pronunciation info and store it in a string.
Explode the string into a second array, using "-" as the marker for where to seperate. Thereby the syllables are seperated from each other.
Search each syllable for the combining diacritics. If one is detected, it is deleted and the corresponding number is added to the end of the syllable.
Implode the resulting array back to a string, re-adding the seperating "-" and return it.
Insert the changed string into the pronunciation info of the entry.
The result of this progress should be that diacritics are replaced with numbers at the end of the syllable. The 1st and 4th tone are not marked with diacritics and would therefore have no number. Maybe this can be fixed by detecting final consonants (4th tone is an entering one, therefore it always ends on -p, -t, -k or -h (glottal stop)), but I don't think that's absolutely necessary. @alex_hk90, do you think that is reasonable?
Opening a proper thread to continue the discussion about possible insertion of the online Minnan and Hakka dictionaries by the Taiwanese Ministry of Education which started here: http://plecoforums.com/threads/official-moedict-pleco-release.4915/ .
Linking the databases I found again for reference:
For Minnan: The first link provided by @audreyt under "collection" doesn't seem to work (it transports you to a page containing a poem). There is another one under the parsing area which does though (https://github.com/g0v/moedict-data-twblg). The dict-twbl.json file in that directory looks usable, although the documention seems to suggest that there are more up to date versions which I can't find.
For Hakka the http://www.audreyt.org/newdict/hakka.tar.gz ultimately contains a folder with individual html documents for each entry; don't know how easy that is to work with.
I guess that would depend on what qualifies as making sense to you. I personally would love to be able to use the MoE dict in Pleco.As mentioned before I don't know anything about Minnan but the JSON file you have linked to looks pretty clean - shouldn't be too difficult to use that and convert to Pleco flashcards / user dictionary format. Whether it will make any sense given the above discussion about romanisation / etc. is another question.
As for the romanization issue, I think it would probably be easier to convert the diacritics to numbers than in Pinyin. The reason for this is that syllables are always linked with a hyphen, so detecting a syllable border would be as easy as searching for the character "-".
I must admit, I so far only scratched the surface of php and js and don't know anything about non-web-based programming at all. But speaking in pseudo-code, I guess such a script could look roughly like this:
Create an array which assigns an index to each dictionary entry.
For each entry:
Access the pronunciation info and store it in a string.
Explode the string into a second array, using "-" as the marker for where to seperate. Thereby the syllables are seperated from each other.
Search each syllable for the combining diacritics. If one is detected, it is deleted and the corresponding number is added to the end of the syllable.
Implode the resulting array back to a string, re-adding the seperating "-" and return it.
Insert the changed string into the pronunciation info of the entry.
The result of this progress should be that diacritics are replaced with numbers at the end of the syllable. The 1st and 4th tone are not marked with diacritics and would therefore have no number. Maybe this can be fixed by detecting final consonants (4th tone is an entering one, therefore it always ends on -p, -t, -k or -h (glottal stop)), but I don't think that's absolutely necessary. @alex_hk90, do you think that is reasonable?
Last edited: