Official MoEDict Pleco Release

Discussion in 'Pleco for Android' started by mikelove, Aug 5, 2015.

  1. Abun

    Abun 进士

    To be honest I have no idea on the data formats they are stored in or how easy it is to export them. I tried to quick-scan the source codes for hints, but I can't find them. Then again, I believe database query code doesn't usually show up in the source code anyways (at least if it's written in php, it shouldn't in php, right?) In any case, I guess it's likely that the format will be similar to the Mandarin database, right? Anyways, the web pages are:
    http://twblg.dict.edu.tw/holodict_new/index.htm (Minnan)
    http://hakka.dict.edu.tw/hakkadict/index.htm (Hakka)
    Considering the stuff you got converted already, you might well succeed where I fail :D

    The page for the 台日大辭典 is http://taigi.fhl.net/dict/ although sadly only a translated version (into Taiwanese Minnan) can be searched. You do get a link to a scan of the corresponding original page with every entry though.
     
    Last edited: Aug 25, 2015
  2. alex_hk90

    alex_hk90 状元

    Thanks @Abun.

    If they are in the same format as MoEDict then the g0v.tw team has already done much of the hard work for importing into a usable format:
    http://www.plecoforums.com/threads/the-moe-dictionary-is-now-open-source.3606/#post-29296
    (In fact @audreyt mentions the Hakka dictionary in the above post.)

    Can you find the relevant links in the following page?
    https://g0v.hackpad.com/3du.tw-ZNwaun62BP4
     
  3. Abun

    Abun 进士

    For Minnan: The first link provided by @audreyt under "collection" doesn't seem to work (it transports you to a page containing a poem). There is another one under the parsing area which does though (https://github.com/g0v/moedict-data-twblg). The dict-twbl.json file in that directory looks usable, although the documention seems to suggest that there are more up to date versions which I can't find.
    For Hakka the http://www.audreyt.org/newdict/hakka.tar.gz ultimately contains a folder with individual html documents for each entry; don't know how easy that is to work with.

    For Minnan at least, one would also have to decide how to deal with non-standard-Unicode characters such as 亻因. Many of these are supported in the newer extensions (C and later), but a few are not. For Hakka I can't say how big this problem is because I can't speak Hakka.

    Btw mikelove, is it possible to define two fonts for characters, one as main and the other as fallback in case main doesn't support a character? The reason I'm asking is because I have so far been unable to find a single font which includes Unicode including the newer CJK extensions. Usually there only is an extending file which contains only the extensions. So in order to display a text which contains both "normal" and extension characters, you have to switch font for the new ones.

    EDIT: Just discovered that the list in the json file for Minnan seems to use combining diacritics instead of hardcoded marked vowels (e.g. ā (a + combining diacritic ¯) instead of ā (hardcoded ā) for the letter a with a macron). At least if I copy and past the code into a text editor, I can delete the diacritic and the letter stays. I guess that would greatly simplify the program needed to convert the romanization with diacritics into romanization with numbers.
     
    Last edited: Aug 26, 2015
    alex_hk90 likes this.
  4. mikelove

    mikelove 皇帝 Staff Member

    We do that now, actually - download the Extended Chinese Font in Add-ons and it'll use that to draw characters from the newer extensions.
     
  5. Abun

    Abun 进士

    Ah, I hadn't seen realized that, thanks! Does the list cover the complete extensions though? In a quick test, I was able to find (敖 over 力, Ext. B) and (亻因, Ext. C), but not (辶日, Ext. B) or (魚隶, Ext. C, although there is a version wwhich has the four dots of 魚 replaced with 大, not sure if that's a variant character). And are there plans to include them in the normal search methods? I could only access them by going to the char page one of the components and scrolling down until I found the character I was looking for.
     
  6. mikelove

    mikelove 皇帝 Staff Member

    Were you looking these up in the dictionary? Have you downloaded the extended version of the Unihan database as well?
     
  7. Abun

    Abun 进士

    I have.
     
  8. mikelove

    mikelove 皇帝 Staff Member

    We don't cover them 100% in the dictionary yet, even with Extended Unihan - some extended characters may only be viewable in documents or in user dictionary entries. If you copy one of these characters to the clipboard and open up the Clip Reader does it display correctly in there?
     
    Abun likes this.
  9. Abun

    Abun 进士

    I entered them into the search bar with a Minnan keyboard. That worked, even for the characters I couldn't find in the char lists before (i.e. I get a result, although the character isn't displayed in the searchbar itself).
     
  10. image.jpg
    Works on iOS....
    image.jpg
     
  11. alex_hk90

    alex_hk90 状元

    As mentioned before I don't know anything about Minnan but the JSON file you have linked to looks pretty clean - shouldn't be too difficult to use that and convert to Pleco flashcards / user dictionary format. Whether it will make any sense given the above discussion about romanisation / etc. is another question.

    We are going a bit off-topic here so maybe there should be a new thread for further discussion on this?
     
  12. Abun

    Abun 进士

  13. Soon?

    Any day?
     
    giokve likes this.
  14. mikelove

    mikelove 皇帝 Staff Member

    "Soon" for us is a pretty broad term :) Other interesting dictionary held up by some issues we discovered with one part of it which we're rapidly addressing now.
     
  15. Is CC-Canto the other interesting dictionary or just something completely unrelated?
     
  16. mikelove

    mikelove 皇帝 Staff Member

    That's the one I was alluding to, yes, though certainly not the only interesting thing in the pipeline...
     
  17. goldyn chyld

    goldyn chyld 状元

    Any update on this? :)
     
  18. mikelove

    mikelove 皇帝 Staff Member

    Out next week I think. (along with a ton of other stuff)
     
    Wan, Abun, giokve and 2 others like this.
  19. mikelove

    mikelove 皇帝 Staff Member

    Next week, now - sorry, 3 different members of my family came down with colds.
     

Share This Page