Official MoEDict Pleco Release

Abun · Aug 25, 2015

alex_hk90 said:
Do you have links to the data in a usable format and some general information? I don't really know anything about Minnan or Hakka.

To be honest I have no idea on the data formats they are stored in or how easy it is to export them. I tried to quick-scan the source codes for hints, but I can't find them. Then again, I believe database query code doesn't usually show up in the source code anyways (at least if it's written in php, it shouldn't in php, right?) In any case, I guess it's likely that the format will be similar to the Mandarin database, right? Anyways, the web pages are:
http://twblg.dict.edu.tw/holodict_new/index.htm (Minnan)
http://hakka.dict.edu.tw/hakkadict/index.htm (Hakka)
Considering the stuff you got converted already, you might well succeed where I fail

The page for the 台日大辭典 is http://taigi.fhl.net/dict/ although sadly only a translated version (into Taiwanese Minnan) can be searched. You do get a link to a scan of the corresponding original page with every entry though.

alex_hk90 · Aug 25, 2015

Abun said:
To be honest I have no idea on the data formats they are stored in or how easy it is to export them. I tried to quick-scan the source codes for hints, but I can't find them. Then again, I believe database query code doesn't usually show up in the source code anyways (at least if it's written in php, it shouldn't in php, right?) In any case, I guess it's likely that the format will be similar to the Mandarin database, right? Anyways, the web pages are:
http://twblg.dict.edu.tw/holodict_new/index.htm (Minnan)
http://hakka.dict.edu.tw/hakkadict/index.htm (Hakka)
Considering the stuff you got converted already, you might well succeed where I fail

The page for the 台日大辭典 is http://taigi.fhl.net/dict/ although sadly only a translated version (into Taiwanese Minnan) can be searched. You do get a link to a scan of the corresponding original page with every entry though.

Thanks @Abun.

If they are in the same format as MoEDict then the g0v.tw team has already done much of the hard work for importing into a usable format:
http://www.plecoforums.com/threads/the-moe-dictionary-is-now-open-source.3606/#post-29296
(In fact @audreyt mentions the Hakka dictionary in the above post.)

Can you find the relevant links in the following page?
https://g0v.hackpad.com/3du.tw-ZNwaun62BP4

Abun · Aug 26, 2015

alex_hk90 said:
Can you find the relevant links in the following page?
https://g0v.hackpad.com/3du.tw-ZNwaun62BP4

For Minnan: The first link provided by @audreyt under "collection" doesn't seem to work (it transports you to a page containing a poem). There is another one under the parsing area which does though (https://github.com/g0v/moedict-data-twblg). The dict-twbl.json file in that directory looks usable, although the documention seems to suggest that there are more up to date versions which I can't find.
For Hakka the http://www.audreyt.org/newdict/hakka.tar.gz ultimately contains a folder with individual html documents for each entry; don't know how easy that is to work with.

For Minnan at least, one would also have to decide how to deal with non-standard-Unicode characters such as 亻因. Many of these are supported in the newer extensions (C and later), but a few are not. For Hakka I can't say how big this problem is because I can't speak Hakka.

Btw mikelove, is it possible to define two fonts for characters, one as main and the other as fallback in case main doesn't support a character? The reason I'm asking is because I have so far been unable to find a single font which includes Unicode including the newer CJK extensions. Usually there only is an extending file which contains only the extensions. So in order to display a text which contains both "normal" and extension characters, you have to switch font for the new ones.

EDIT: Just discovered that the list in the json file for Minnan seems to use combining diacritics instead of hardcoded marked vowels (e.g. ā (a + combining diacritic ¯) instead of ā (hardcoded ā) for the letter a with a macron). At least if I copy and past the code into a text editor, I can delete the diacritic and the letter stays. I guess that would greatly simplify the program needed to convert the romanization with diacritics into romanization with numbers.

mikelove · Aug 26, 2015

We do that now, actually - download the Extended Chinese Font in Add-ons and it'll use that to draw characters from the newer extensions.

Abun · Aug 26, 2015

mikelove said:
We do that now, actually - download the Extended Chinese Font in Add-ons and it'll use that to draw characters from the newer extensions.

Ah, I hadn't seen realized that, thanks! Does the list cover the complete extensions though? In a quick test, I was able to find (敖 over 力, Ext. B) and (亻因, Ext. C), but not (辶日, Ext. B) or (魚隶, Ext. C, although there is a version wwhich has the four dots of 魚 replaced with 大, not sure if that's a variant character). And are there plans to include them in the normal search methods? I could only access them by going to the char page one of the components and scrolling down until I found the character I was looking for.

mikelove · Aug 26, 2015

Were you looking these up in the dictionary? Have you downloaded the extended version of the Unihan database as well?

Abun · Aug 26, 2015

mikelove said:
Were you looking these up in the dictionary? Have you downloaded the extended version of the Unihan database as well?

I have.

mikelove · Aug 26, 2015

We don't cover them 100% in the dictionary yet, even with Extended Unihan - some extended characters may only be viewable in documents or in user dictionary entries. If you copy one of these characters to the clipboard and open up the Clip Reader does it display correctly in there?

Abun · Aug 26, 2015

mikelove said:
We don't cover them 100% in the dictionary yet, even with Extended Unihan - some extended characters may only be viewable in documents or in user dictionary entries. If you copy one of these characters to the clipboard and open up the Clip Reader does it display correctly in there?

I entered them into the search bar with a Minnan keyboard. That worked, even for the characters I couldn't find in the char lists before (i.e. I get a result, although the character isn't displayed in the searchbar itself).

ACardiganAndAFrown · Aug 26, 2015

Works on iOS....

alex_hk90 · Aug 30, 2015

Abun said:
For Minnan: The first link provided by @audreyt under "collection" doesn't seem to work (it transports you to a page containing a poem). There is another one under the parsing area which does though (https://github.com/g0v/moedict-data-twblg). The dict-twbl.json file in that directory looks usable, although the documention seems to suggest that there are more up to date versions which I can't find.
For Hakka the http://www.audreyt.org/newdict/hakka.tar.gz ultimately contains a folder with individual html documents for each entry; don't know how easy that is to work with.

For Minnan at least, one would also have to decide how to deal with non-standard-Unicode characters such as 亻因. Many of these are supported in the newer extensions (C and later), but a few are not. For Hakka I can't say how big this problem is because I can't speak Hakka.

Btw mikelove, is it possible to define two fonts for characters, one as main and the other as fallback in case main doesn't support a character? The reason I'm asking is because I have so far been unable to find a single font which includes Unicode including the newer CJK extensions. Usually there only is an extending file which contains only the extensions. So in order to display a text which contains both "normal" and extension characters, you have to switch font for the new ones.

EDIT: Just discovered that the list in the json file for Minnan seems to use combining diacritics instead of hardcoded marked vowels (e.g. ā (a + combining diacritic ¯) instead of ā (hardcoded ā) for the letter a with a macron). At least if I copy and past the code into a text editor, I can delete the diacritic and the letter stays. I guess that would greatly simplify the program needed to convert the romanization with diacritics into romanization with numbers.

As mentioned before I don't know anything about Minnan but the JSON file you have linked to looks pretty clean - shouldn't be too difficult to use that and convert to Pleco flashcards / user dictionary format. Whether it will make any sense given the above discussion about romanisation / etc. is another question.

We are going a bit off-topic here so maybe there should be a new thread for further discussion on this?

Abun · Aug 31, 2015

alex_hk90 said:
We are going a bit off-topic here so maybe there should be a new thread for further discussion on this?

You're right. I opened one here: http://plecoforums.com/threads/moe-minnan-and-hakka-dictionaries.4938/

ACardiganAndAFrown · Sep 1, 2015

mikelove said:
两岸词典 should be available soon too

Soon?

mikelove said:
and we've also got another interesting free dictionary we'll be launching probably any day now.

Any day?

mikelove · Sep 1, 2015

"Soon" for us is a pretty broad term

Other interesting dictionary held up by some issues we discovered with one part of it which we're rapidly addressing now.

ACardiganAndAFrown · Sep 26, 2015

mikelove said:
"Soon" for us is a pretty broad term Other interesting dictionary held up by some issues we discovered with one part of it which we're rapidly addressing now.

Is CC-Canto the other interesting dictionary or just something completely unrelated?

mikelove · Sep 27, 2015

That's the one I was alluding to, yes, though certainly not the only interesting thing in the pipeline...

goldyn chyld · Jan 14, 2016

mikelove said:
两岸词典 should be available soon too

Any update on this?

mikelove · Jan 14, 2016

Out next week I think. (along with a ton of other stuff)

mikelove · Jan 22, 2016

Next week, now - sorry, 3 different members of my family came down with colds.

Official MoEDict Pleco Release

Abun

榜眼

alex_hk90

状元

Abun

榜眼

mikelove

皇帝

Abun

榜眼

mikelove

皇帝

Abun

榜眼

mikelove

皇帝

Abun

榜眼

ACardiganAndAFrown

状元

alex_hk90

状元

Abun

榜眼

ACardiganAndAFrown

状元

mikelove

皇帝

ACardiganAndAFrown

状元

mikelove

皇帝

goldyn chyld

状元

mikelove

皇帝

mikelove

皇帝