MoE Minnan and Hakka dictionaries

Discussion in 'Future Products' started by Abun, Aug 31, 2015.

  1. alex_hk90

    alex_hk90 状元

    New version (MoE-Minnan-v02) addressing some of these points:
    - Pleco flashcards (14,005 entries): [EDIT: superseded by new version - see Pleco-User-Dictionaries/MoE-Minnan [GitHub] for latest version]

    I'm not really sure how to address these ones:
    Anyone have any suggestions on these?
     
    Last edited: Sep 28, 2015
  2. Abun

    Abun 进士

    I use Tai-lo when I write for myself, although of course I have no problem whatsoever reading POJ seeing as Tai-lo is based on POJ. In my opinion Tai-lo has a number of small advantages. Most of these are personal taste, but I think for this kind of project, Tai-lo has one objective advantage over POJ, which is precisely the one that is addressed in chapter 3 of the article you linked: There are numerous versions of POJ and although the differences are rather small (for example in the placement of tone diacritica, but also things like the placement of the nasalization marker when combined with the glottal stop -h), they can make database queries very difficult. You would either have to write a script which catches all possible spellings or write out a set of spelling rules for the user to follow. That latter way is how this (http://210.240.194.97/iug/Ungian/soannteng/chil/taihoa.asp) and this (http://taigi.fhl.net/dict/) database work, but it might be difficult to inform the user of such rules in Pleco. For Tai-lo this problem is pretty much non-existent because there is only one version (well, the case sensitivity is no clearer than in POJ, but seeing that Tai-lo doesn't use -N as a nasalization marker, you can just do case-insensitive queries without issues).

    I guess you could call it an attempt on soft standardization. The ministry does publish the lists and the dictionary, and quite possibly it is also responsible for the "characters only" policy. However the characters themselves are not decided on by ministery officials but by conferences of scholars (although I can't tell you who decides on who participates in those conferences to be honest). The MoE set is not enforced in publications, so authors can use whatever way of writing they prefer. The only possible exception are textbooks in public schools (and even there I'm not sure. I've seen two or three and they used the MoE set and Tai-lo, but it may not be obligatory. I do know that teachers are not prohibited from using other ways of writing in class, though). The MoE characters and Tai-lo are obligatory in the official Taiwanese language tests which have to be taken in order to get the qualification for steaching Taiwanese in public schools and (as far as I know) also for certain types of government officials.

    Interesting that you know some Taiwanese but only limited Mandarin; you don't meet many people like this anymore (especially outside missionary circles). Seeing as you seem particularly familiar with the Maryknoll Society's work, I guess you take classes with them? That wouldn't happen to have been in Taipei and within the past year? If so, it's actually possible that we have met before :)

    Ah, I see how the MoE dictionary is of limited use to you then... I'm much hoping that it is pioneer work which might clear the way for more dictionaries to be implemented into Pleco, though. Personally, I would be most interested in the Taiwanese-Japanese dictionary (the second of the two linked above), simply because it is just massive :D
     
    Last edited: Sep 2, 2015
  3. Abun

    Abun 进士

    Wow that was quick! It's late in the evening where I'm situated, but I'm going to have a detailed look at it tomorrow :)

    As for the remaining issues, it seems the diacritics issue is Pleco having problems with these characters for some reasons. At least the problem persists even if I install a font like TNR which I know supports combining diacritics. I guess the problem would disappear after diacritics-to-number conversion is implemented. That might mitigate the problem with the @ as well. The other two problems appear to me like the don't have anything to do with the dictionary but with Pleco itself. Maybe mikelove knows more about it?
     
  4. alex_hk90

    alex_hk90 状元

    For this diacritics-to-number conversion, I see you have had an initial go in your earlier posts, but is there a simple list of rules / definitive definitions of the diacritics somewhere I could look at to implement this as part of the JSON to Pleco flashcards conversion script?
     
  5. I don't think this needs to be done manually, tbh, if Plecos copy of HanaMin is up to date. Maybe @mikelove knows what's going on.
     
  6. jasonmcdowell

    jasonmcdowell Member

    I haven't taken any classes with the Maryknoll Society, but they have been the best source that I've found for materials to study Taiwanese from English. I visited the Maryknoll office in Taipei a few years ago, but I haven't had much personal contact with them. My connection to Taiwanese is marrying a Taiwanese-American woman I met in college. Her family in Los Angeles uses Taiwanese at home, and I started learning while we were still dating. Right now I'm in a race to see when our 1 year old baby is going to overtake me in Taiwanese fluency. Much more recently, I've learned a little Mandarin too, but when I visit Taiwan, I mostly use Taiwanese to talk with people and get food, etc.

    You can find the tone numbers here: https://en.wikipedia.org/wiki/Taiwanese_Romanization_System#Tones
    1 tong (東)
    2 tóng (黨)
    3 tòng (棟)
    4 tok (督) (this tone has no diacritic mark, but a syllable end in p,t,k, or h)
    5 tông (同)
    6 (there is no 6th tone, it merged with the 2nd tone)
    7 tōng (洞)
    8 to̍k (毒)
     
  7. alex_hk90

    alex_hk90 状元

    Thanks - on first glance it doesn't look too difficult to automate the diacritics to numeric tone conversion. :D

    EDIT: Can you check if this is an accurate representation of the mapping?
    Code:
    {
      "vowels":
      {
      "a":
      {
      "á": "acute",
      "à": "grave",
      "â": "circumflex",
      "ā": "macron",
      "a̍": "vertical",
      "a": "none"
      },
      "e":
      {
      },
      "i":
      {
      },
      "o":
      {
      },
      "u":
      {
      }
      },
      "diacritics":
      {
      "acute": 2,
      "grave": 3,
      "circumflex": 5,
      "macron": 7,
      "vertical": 8,
      "none":
      [
      {
      "p": 4,
      "t": 4,
      "k": 4,
      "h": 4
      },
      1
      ]
      }
    }
    I've only done "a" so far but I think the other vowels should be the same?

    EDIT: Above has been superseded (no need to separate out each vowel as combining diacritic marks used).
     
    Last edited: Sep 28, 2015
  8. Abun

    Abun 进士

    Haha I guess that's a better learning method than classes anyways :D

    This is actually a gross oversimplification as it only holds completely true for the "literary readings". However early Chinese dialectologists almost exclusively concerned themselves with the more "respectable" reading pronunciations, this view has spread far and even some scholars who should know better still repeat it. I guess a detailed discussion of that would be off-topic here, but we can open a new thread if you like ;) For our purpose here, suffice to say that the 6th tone disappeared in most variants of Minnan, including those recorded in the MoE dictionary.


    Looks correct to me. Depending on how you plan to implement it, you might have to check for whether a second "o" follows after the marked "o" because "oo" exists as a vowel distinct from "o". You will also definitely have to add checks for "m" and "n" with diacritics too, though, because Minnan has vocalic "m" and "ng" (in the latter case, the diacritic is put on the "n").
     
  9. alex_hk90

    alex_hk90 状元

    Thanks - hopefully I'll have time to have a go this evening, as I'm going travelling over this weekend so it will have to wait until next week otherwise.

    EDIT: Started to look at this but should have read your earlier posts in more detail - missed that they were Unicode combining accents so didn't need to list out all the vowels like I did. :oops:
     
    Last edited: Sep 3, 2015
  10. Abun

    Abun 进士

    Don't worry, I didn't have much time to test the second version and won't get much this weekend, either...
     
  11. alex_hk90

    alex_hk90 状元

    Thanks for posting this, helped me think about how to do it in Python.
    I've got it to a stage where it could be working, but not sure I've included all the possible punctuation - might need to do some kind of check against the raw data here. Also because I'm checking for all the different separators and punctuation for every word it's not the most efficient.
    Anyway, with the script I've written I get the following results with a few test lines:
    Code:
    1: Khuànn-tio̍h tsit khuán lâng tō gê
    Khuann3-tioh8 tsit4 khuan2 lang5 to7 ge5
    2: Kè-á tu khah kuè--khì--leh.
    Ke3-a2 tu1 khah4 kue3--khi3--leh4.
    3: Kā phue̍h tsānn--khí-lâi.
    Ka7 phueh8 tsann7--khi2-lai5.
    4: Hit nn̄g uân oo-oo ê mi̍h-kiānn sī siánn-mih?
    Hit4 nng7 uan5 oo1-oo1 e5 mih8-kiann7 si7 siann2-mih4?
    5: Honnh, guân-lâi sī án-ne--ooh!
    Honnh4, guan5-lai5 si7 an2-ne1--ooh4!
    6: Tsa-bóo khiā tsit pîng, tsa-poo khiā hit pîng.
    Tsa1-boo2 khia7 tsit4 ping5, tsa1-poo1 khia7 hit4 ping5.
    Does that look right to you?
     
  12. Abun

    Abun 进士

    Looks correct to me :)

    I also tested the second version some more. The problems you have addressed (the ones with the "example" and "type" fields) were successfully fixed as far as I can see and I didn't find any new ones either (well, I did find a spelling mistake in one entry, but that's a problem of the source, not the conversion).

    I was also able to pinpoint the behaviour of @ in searches: Adding @ returns all entries that contain the string which follows after @ but don't begin with it. So searching for "@in" returns "lin", "thinn", "iau-kin", "so-inn" ect., but not "in", "inn" and so on. The syllable-seperating dash is not taken into account for some reason, so "pai-ni" is returned as well. I don't know whether this is something that can be fixed, though
     
  13. alex_hk90

    alex_hk90 状元

    Thanks. :)

    Next version(s) ready:
    - Pleco flashcards (14,005 entries) with diacritic tones (as source data): [EDIT: superseded by new version - see Pleco-User-Dictionaries/MoE-Minnan [GitHub] for latest version]
    - Pleco flashcards (14,005 entries) with numeric tones: [EDIT: superseded by new version - see Pleco-User-Dictionaries/MoE-Minnan [GitHub] for latest version]

    The main change is addition of the numeric tone version, currently only for the headwords but the code has been modularised so it should not be too difficult to apply to the definitions and examples as well, as long as they can be reliably identified from the longer string.

    Once the diacritic to numeric tone conversion has been applied to the definitions and examples as well as the headwords, what else is left to do?

    EDIT: Another new version (MoE-Minnan-v04), with numeric tones for (hopefully) all Romanisation (there could be one or two remaining bugs with numeral placement, let me know if you find anything):
    - Pleco flashcards (14,005 entries) with diacritic tones (as source data): [EDIT: see Pleco-User-Dictionaries/MoE-Minnan [GitHub] for latest version]
    - Pleco flashcards (14,005 entries) with numeric tones: [EDIT: see Pleco-User-Dictionaries/MoE-Minnan [GitHub] for latest version]

    If this works then I think we're pretty much done? :)
     
    Last edited: Sep 28, 2015
  14. Abun

    Abun 进士

    Importing atm. Are there any special conditions in particular which you suspect might cause bugs and which I should look out for?

    Pretty much, I guess^^ I can only think about two things which we might think about including, but none of them is absolutely essential if you ask me.
    The first would be information about 異體字, but that seem to be located in another json file (https://github.com/g0v/moedict-data-twblg/blob/master/x-異用字.json). Does that make things difficult?
    The information in the "reading" field of the main json could be included as well, but I think they would have to be recognizable as what they are at the very first glance so they don't create confusion. The MoE website does this by displaying them inside a square behind the header character which I don't think is possible in Pleco because it doesn't seem there are any special (squared, circled or whatever) forms of these characters in Unicode (the MoE used image files). Maybe putting it at the top of the entry in full space angled brackets (【】) would be visually clear enough... But in my opinion it's not an absolute must anyways :) @audreyt's 萌典 doesn't display it, either.

    I also realized that there are a few entries in the dict on the original MoE website which don't appear in the json file, specifically those which are marked as "additional" (附錄) on the website. These entries contain more special vocabulary, for example family names, toponyms, non-sinitic loan words, the 24 節氣 (立春, 雨水 ect.), as well as few additional denominations for certain family members. I can't find corresponding files in @audreyt's directory though. Considering the 萌典 doesn't list them, it's quite possible the files don't exist. I still think they are quite interesting, though (especially the loanwords). Considering the number of entries isn't very high (3~400 in total maybe) I'm thinking of just making a list myself. Could you give me information about what kind of format would be usable for you?
    Especially the loanwords deserve some attention here I guess. Most of them are Japanese (or English borrowed via Japanese). So I think we would need information not only on the meaning and PoS, but also about the original word that was loaned. On the other hand, these words were not assigned characters by the MoE. So I suggest either leaving the character line blank or copying the romanization into there if Pleco doesn't like a blank character line. The entries might look something like:

    an3-nai7
    <動> 招待、引導
    原: 日: 案內(あんない, annai)​

    kha1-me2-lah4
    <名> 開麥拉、照相機
    原: 日: カメラ(kamera) > 英: camera​

    A problem may be pseudo-English loans, though (such as oo-tóo-bái (機車) > オートバイ(ōtobai) > "autobike"). I'm not quite sure how best to indicate this.
     
    Last edited: Sep 8, 2015
  15. alex_hk90

    alex_hk90 状元

    Cases where there is a lot of switching between Chinese and Latin characters without separating vocab might trip up the logic - I think I have got enough recursion in the logic to catch all likely cases, but some might have slipped through the net.

    :)

    Multiple files don't really make a big difference here - might take a bit longer to run the script unless I set up a temporary relational database, but nothing conceptually difficult about it.
    How is that file structured and how would you suggest combining it with the current version?

    That would be quite easy to do, I can add it to the next version and see if it's better with or without it (or if it might be better somewhere else, like at the end of the definition).

    If you're doing it manually then anything close to Pleco flashcard format would be best (have a look at the MoE-Minnan ones for an example):
    Hanzi{TAB}Pinyin{TAB}Definition
    For new lines in definition you need to use a particular Unicode private use character: 
     
  16. Abun

    Abun 进士

    You mean a lot of changing between Characters and Latin letters within a single entry? The most extreme examples are probably single character entries which have a lot of meanings. I checked a few and couldn't see anything out of the ordinary :) However it did give me the idea that the translation of examples could be put in parentheses to make it look neater xD

    I could do that, but I doubt that's particularly practical. Wouldn't it be a bit like a website which is styled in html instead of using a spreadsheet, i.e. written with only the immediate output in mind? It would be rather inflexible in terms of possible changes to layout because every line would have to be checked by hand, wouldn't it?
    I for the moment made an excel sheet which contains the family names and the loanwords (the toponyms are a bit more numerous than I expected, so I leave them out for now). Then I made a pleco-friendly txt out of the loanword part, which did work, but as expected it took quite a lot of time because I had to do the layouting by hand in every single line. I wonder if it wouldn't be more practical to use some sort of database format and then use an algorithm to compile the txt. For now I used xlsx, simply because using excel it's easier to work with than if I wrote a json by hand in a text editor, but I realize that xlsx is maybe not ideal for our purpose...
     

    Attached Files:

  17. alex_hk90

    alex_hk90 状元

    Yeah, switching between Characters and Latin without punctuation between them.
    And bracketing the translation of examples is probably doable - I might have a look at that for the next version.

    An Excel sheet is fine, as you can easily output that to CSV or tab-deliminated format (pretty much Pleco flashcards) anyway, both of which can then be read into a script or database.

    Let me know what you think the priority items should be for the next version. :)
     
  18. Abun

    Abun 进士

    Yeah, that should occur most often in single character entries with a lot of meanings because usually each meaning would have at least one example and with single characters (i.e. not full words), the examples would not be full sentences. I didn't find any problems so far, though :)

    Uploaded an excel sheet with the family names, loan words and 24節氣 (https://www.dropbox.com/s/415y5jd5jt4cg3d/MoE_Dict_appendix.xlsx?dl=0). I'm working on the toponyms but as I said, there are more of those than I thought. Moreover, I will be very busy during this weekend and the coming week and probably won't get a lot done there (if anything), so it might take a bit.

    Also just as a disclaimer: The information in the "type" columns as well as that on the source word in Western languages are not copied from the MoE but have been added by me, for the sake of consistency with "type", and completeness with the etymology. I am torn on whether this addition is justified when weighing it up against being faithful to the source. (Btw I have a similar conflict with the JapKanji column as some of the Kanji there are presented in their traditional form instead of the modern Japanese simplified one, for example 櫻 instead of 桜. I decided to go with the source there, though.) What's your opinion?

    My priority would be: Appendix > 異體字 > "reading" > parentheses around example tranlations.
     
  19. Abun

    Abun 进士

    Just noticed that I haven't answered your question concerning the structure of the 異體字 json file, yet:oops:
    It's actually very minmalistic: It lists the 異體字 for the entries by calling their ID (it doesn't actually explain the numbers, but I checked a few examples; the number corresponds to the "id" field in the main json file).
     
  20. Abun

    Abun 进士

    Just finished the work on the toponyms and updated the xls file on Dropbox (link doesn't seem to have changed: https://www.dropbox.com/s/415y5jd5jt4cg3d/MoE_Dict_appendix.xlsx?dl=0).

    In some of the tables for toponyms there was a distinction between 讀音一 and 讀音二. In most cases, the second one contains an older version of the name which does not necessarily match the characters (for example Ku7-tsam7 (舊站) for the train station which today is called Tsiau1-ping5 (沼平)). In a limited number of cases, this field contains a dialectal variant pronunciation as well, but that seems to be a mistake to me. In any case, I stored 讀音二 in a column "alt". In terms of output, I suggest adding it to the normal pronunciation line after a slash (ex. "Tsiau1-ping5/Ku7-tsam7") since that would make it possible to find the entry by searching for the 讀音二.
     

Share This Page