Official MoEDict Pleco Release

Abun

榜眼
Do you have links to the data in a usable format and some general information? I don't really know anything about Minnan or Hakka.
To be honest I have no idea on the data formats they are stored in or how easy it is to export them. I tried to quick-scan the source codes for hints, but I can't find them. Then again, I believe database query code doesn't usually show up in the source code anyways (at least if it's written in php, it shouldn't in php, right?) In any case, I guess it's likely that the format will be similar to the Mandarin database, right? Anyways, the web pages are:
http://twblg.dict.edu.tw/holodict_new/index.htm (Minnan)
http://hakka.dict.edu.tw/hakkadict/index.htm (Hakka)
Considering the stuff you got converted already, you might well succeed where I fail :D

The page for the 台日大辭典 is http://taigi.fhl.net/dict/ although sadly only a translated version (into Taiwanese Minnan) can be searched. You do get a link to a scan of the corresponding original page with every entry though.
 
Last edited:

alex_hk90

状元
To be honest I have no idea on the data formats they are stored in or how easy it is to export them. I tried to quick-scan the source codes for hints, but I can't find them. Then again, I believe database query code doesn't usually show up in the source code anyways (at least if it's written in php, it shouldn't in php, right?) In any case, I guess it's likely that the format will be similar to the Mandarin database, right? Anyways, the web pages are:
http://twblg.dict.edu.tw/holodict_new/index.htm (Minnan)
http://hakka.dict.edu.tw/hakkadict/index.htm (Hakka)
Considering the stuff you got converted already, you might well succeed where I fail :D

The page for the 台日大辭典 is http://taigi.fhl.net/dict/ although sadly only a translated version (into Taiwanese Minnan) can be searched. You do get a link to a scan of the corresponding original page with every entry though.
Thanks @Abun.

If they are in the same format as MoEDict then the g0v.tw team has already done much of the hard work for importing into a usable format:
http://www.plecoforums.com/threads/the-moe-dictionary-is-now-open-source.3606/#post-29296
(In fact @audreyt mentions the Hakka dictionary in the above post.)

Can you find the relevant links in the following page?
https://g0v.hackpad.com/3du.tw-ZNwaun62BP4
 

Abun

榜眼
Can you find the relevant links in the following page?
https://g0v.hackpad.com/3du.tw-ZNwaun62BP4
For Minnan: The first link provided by @audreyt under "collection" doesn't seem to work (it transports you to a page containing a poem). There is another one under the parsing area which does though (https://github.com/g0v/moedict-data-twblg). The dict-twbl.json file in that directory looks usable, although the documention seems to suggest that there are more up to date versions which I can't find.
For Hakka the http://www.audreyt.org/newdict/hakka.tar.gz ultimately contains a folder with individual html documents for each entry; don't know how easy that is to work with.

For Minnan at least, one would also have to decide how to deal with non-standard-Unicode characters such as 亻因. Many of these are supported in the newer extensions (C and later), but a few are not. For Hakka I can't say how big this problem is because I can't speak Hakka.

Btw mikelove, is it possible to define two fonts for characters, one as main and the other as fallback in case main doesn't support a character? The reason I'm asking is because I have so far been unable to find a single font which includes Unicode including the newer CJK extensions. Usually there only is an extending file which contains only the extensions. So in order to display a text which contains both "normal" and extension characters, you have to switch font for the new ones.

EDIT: Just discovered that the list in the json file for Minnan seems to use combining diacritics instead of hardcoded marked vowels (e.g. ā (a + combining diacritic ¯) instead of ā (hardcoded ā) for the letter a with a macron). At least if I copy and past the code into a text editor, I can delete the diacritic and the letter stays. I guess that would greatly simplify the program needed to convert the romanization with diacritics into romanization with numbers.
 
Last edited:

mikelove

皇帝
Staff member
We do that now, actually - download the Extended Chinese Font in Add-ons and it'll use that to draw characters from the newer extensions.
 

Abun

榜眼
We do that now, actually - download the Extended Chinese Font in Add-ons and it'll use that to draw characters from the newer extensions.
Ah, I hadn't seen realized that, thanks! Does the list cover the complete extensions though? In a quick test, I was able to find (敖 over 力, Ext. B) and (亻因, Ext. C), but not (辶日, Ext. B) or (魚隶, Ext. C, although there is a version wwhich has the four dots of 魚 replaced with 大, not sure if that's a variant character). And are there plans to include them in the normal search methods? I could only access them by going to the char page one of the components and scrolling down until I found the character I was looking for.
 

mikelove

皇帝
Staff member
Were you looking these up in the dictionary? Have you downloaded the extended version of the Unihan database as well?
 

mikelove

皇帝
Staff member
We don't cover them 100% in the dictionary yet, even with Extended Unihan - some extended characters may only be viewable in documents or in user dictionary entries. If you copy one of these characters to the clipboard and open up the Clip Reader does it display correctly in there?
 

Abun

榜眼
We don't cover them 100% in the dictionary yet, even with Extended Unihan - some extended characters may only be viewable in documents or in user dictionary entries. If you copy one of these characters to the clipboard and open up the Clip Reader does it display correctly in there?
I entered them into the search bar with a Minnan keyboard. That worked, even for the characters I couldn't find in the char lists before (i.e. I get a result, although the character isn't displayed in the searchbar itself).
 
image.jpg
Works on iOS....
image.jpg
 

alex_hk90

状元
For Minnan: The first link provided by @audreyt under "collection" doesn't seem to work (it transports you to a page containing a poem). There is another one under the parsing area which does though (https://github.com/g0v/moedict-data-twblg). The dict-twbl.json file in that directory looks usable, although the documention seems to suggest that there are more up to date versions which I can't find.
For Hakka the http://www.audreyt.org/newdict/hakka.tar.gz ultimately contains a folder with individual html documents for each entry; don't know how easy that is to work with.

For Minnan at least, one would also have to decide how to deal with non-standard-Unicode characters such as 亻因. Many of these are supported in the newer extensions (C and later), but a few are not. For Hakka I can't say how big this problem is because I can't speak Hakka.

Btw mikelove, is it possible to define two fonts for characters, one as main and the other as fallback in case main doesn't support a character? The reason I'm asking is because I have so far been unable to find a single font which includes Unicode including the newer CJK extensions. Usually there only is an extending file which contains only the extensions. So in order to display a text which contains both "normal" and extension characters, you have to switch font for the new ones.

EDIT: Just discovered that the list in the json file for Minnan seems to use combining diacritics instead of hardcoded marked vowels (e.g. ā (a + combining diacritic ¯) instead of ā (hardcoded ā) for the letter a with a macron). At least if I copy and past the code into a text editor, I can delete the diacritic and the letter stays. I guess that would greatly simplify the program needed to convert the romanization with diacritics into romanization with numbers.

As mentioned before I don't know anything about Minnan but the JSON file you have linked to looks pretty clean - shouldn't be too difficult to use that and convert to Pleco flashcards / user dictionary format. Whether it will make any sense given the above discussion about romanisation / etc. is another question.

We are going a bit off-topic here so maybe there should be a new thread for further discussion on this?
 

mikelove

皇帝
Staff member
"Soon" for us is a pretty broad term :) Other interesting dictionary held up by some issues we discovered with one part of it which we're rapidly addressing now.
 

mikelove

皇帝
Staff member
That's the one I was alluding to, yes, though certainly not the only interesting thing in the pipeline...
 
Top