MoE Minnan and Hakka dictionaries

alex_hk90

状元
Awesome! I imported it as a dictionary as well as flashcards and tested a few things that I thought might cause problems and here's what I found (in order of descending importance):
  • Tone diacritics are not displayed anywhere (be it dictionary window, when opening a single entry or during flashcard tests). The sole exception is the edit window. There everything is displayed. The above-stroke for the 8th tone returns an "unknown character" square in the default font, but in TNR it's fine.
  • The "example" field is missing (or maybe that was intentional in this first version because there were problems?)
  • The "type" field returns nothing for idioms, resulting in empty angled brackets (ex. 驚驚袂著等)
  • There seem to be discrepancies as to whether or not the @ has to be included in the search. For example, "in" (亻因) can only be found if searching for "in" (without the @), but "@in" (with the @) returns entries as well.
  • Searching for characters from unicode extensions (ex. (亻因), (敖 over 力) ect.) return only the unicode entry, although the entries exist in the dictionary (problem with Pleco’s search algorithm?)
  • The characters of a few entries cannot be displayed, ex. peh (足百). This was to be expected, though; these characters are not yet encoded in unicode, including in the extensions, so I guess there's not much that can be done there.
Another thought that came to me concerning the layout: Maybe in the final version, it would be nice to have index numbers (maybe ①, ② ect. as with other existing dictionaries) for each heteronym within an entry. That would make the structure clearer at first glance.

New version (MoE-Minnan-v02) addressing some of these points:
- Pleco flashcards (14,005 entries): [EDIT: superseded by new version - see Pleco-User-Dictionaries/MoE-Minnan [GitHub] for latest version]

I'm not really sure how to address these ones:
  • Tone diacritics are not displayed anywhere (be it dictionary window, when opening a single entry or during flashcard tests). The sole exception is the edit window. There everything is displayed. The above-stroke for the 8th tone returns an "unknown character" square in the default font, but in TNR it's fine.
  • There seem to be discrepancies as to whether or not the @ has to be included in the search. For example, "in" (亻因) can only be found if searching for "in" (without the @), but "@in" (with the @) returns entries as well.
  • Searching for characters from unicode extensions (ex. (亻因), (敖 over 力) ect.) return only the unicode entry, although the entries exist in the dictionary (problem with Pleco’s search algorithm?)
  • The characters of a few entries cannot be displayed, ex. peh (足百). This was to be expected, though; these characters are not yet encoded in unicode, including in the extensions, so I guess there's not much that can be done there.

Anyone have any suggestions on these?
 
Last edited:

Abun

榜眼
Do you all prefer using Tâi-lô or POJ? I've only ever practiced with POJ, though I included several other romanizations in my previous dictionary work.
I use Tai-lo when I write for myself, although of course I have no problem whatsoever reading POJ seeing as Tai-lo is based on POJ. In my opinion Tai-lo has a number of small advantages. Most of these are personal taste, but I think for this kind of project, Tai-lo has one objective advantage over POJ, which is precisely the one that is addressed in chapter 3 of the article you linked: There are numerous versions of POJ and although the differences are rather small (for example in the placement of tone diacritica, but also things like the placement of the nasalization marker when combined with the glottal stop -h), they can make database queries very difficult. You would either have to write a script which catches all possible spellings or write out a set of spelling rules for the user to follow. That latter way is how this (http://210.240.194.97/iug/Ungian/soannteng/chil/taihoa.asp) and this (http://taigi.fhl.net/dict/) database work, but it might be difficult to inform the user of such rules in Pleco. For Tai-lo this problem is pretty much non-existent because there is only one version (well, the case sensitivity is no clearer than in POJ, but seeing that Tai-lo doesn't use -N as a nasalization marker, you can just do case-insensitive queries without issues).

I don't understand the scope of the MoE dataset. Is the regular MoE dataset a chinese-chinese dictionary, and the MoE Minnan dataset show Hanzi characters for Minnan words written in Tâi-lô? I thought that Minnan does not have standardized hanzi? Is this an attempt at standardization by the Ministry of Education?
I guess you could call it an attempt on soft standardization. The ministry does publish the lists and the dictionary, and quite possibly it is also responsible for the "characters only" policy. However the characters themselves are not decided on by ministery officials but by conferences of scholars (although I can't tell you who decides on who participates in those conferences to be honest). The MoE set is not enforced in publications, so authors can use whatever way of writing they prefer. The only possible exception are textbooks in public schools (and even there I'm not sure. I've seen two or three and they used the MoE set and Tai-lo, but it may not be obligatory. I do know that teachers are not prohibited from using other ways of writing in class, though). The MoE characters and Tai-lo are obligatory in the official Taiwanese language tests which have to be taken in order to get the qualification for steaching Taiwanese in public schools and (as far as I know) also for certain types of government officials.

My Taiwanese is much better than my Mandarin, and my Taiwanese is only at a beginning conversational level. I can probably only recognize 10 - 20 hanzi, so while I'm planning to start practicing hanzi using Pleco, the Maryknoll Taiwanese-English dataset will be most helpful to me.
Interesting that you know some Taiwanese but only limited Mandarin; you don't meet many people like this anymore (especially outside missionary circles). Seeing as you seem particularly familiar with the Maryknoll Society's work, I guess you take classes with them? That wouldn't happen to have been in Taipei and within the past year? If so, it's actually possible that we have met before :)

I think I have read that user dictionaries in Pleco do not do full text search, which would make looking up English words or Chinese words in the Maryknoll data difficult. Maryknoll also published some English-Taiwanese PDF files, but they don't have an excel spreadsheet or any other easily parsed format for the English-Taiwanese dataset. I've considered trying to parse the PDF files, but it would be a lot of work to get everything perfect.
Ah, I see how the MoE dictionary is of limited use to you then... I'm much hoping that it is pioneer work which might clear the way for more dictionaries to be implemented into Pleco, though. Personally, I would be most interested in the Taiwanese-Japanese dictionary (the second of the two linked above), simply because it is just massive :D
 
Last edited:

Abun

榜眼
New version (MoE-Minnan-v02) addressing some of these points:
- Pleco flashcards (14,005 entries): https://www.dropbox.com/s/0noa28nwyrwfo7o/MoE-Minnan-flashcards-v02.txt.7z?dl=0
Wow that was quick! It's late in the evening where I'm situated, but I'm going to have a detailed look at it tomorrow :)

As for the remaining issues, it seems the diacritics issue is Pleco having problems with these characters for some reasons. At least the problem persists even if I install a font like TNR which I know supports combining diacritics. I guess the problem would disappear after diacritics-to-number conversion is implemented. That might mitigate the problem with the @ as well. The other two problems appear to me like the don't have anything to do with the dictionary but with Pleco itself. Maybe mikelove knows more about it?
 

alex_hk90

状元
Wow that was quick! It's late in the evening where I'm situated, but I'm going to have a detailed look at it tomorrow :)

As for the remaining issues, it seems the diacritics issue is Pleco having problems with these characters for some reasons. At least the problem persists even if I install a font like TNR which I know supports combining diacritics. I guess the problem would disappear after diacritics-to-number conversion is implemented. That might mitigate the problem with the @ as well. The other two problems appear to me like the don't have anything to do with the dictionary but with Pleco itself. Maybe mikelove knows more about it?

For this diacritics-to-number conversion, I see you have had an initial go in your earlier posts, but is there a simple list of rules / definitive definitions of the diacritics somewhere I could look at to implement this as part of the JSON to Pleco flashcards conversion script?
 
Ah, it seems the newest extension of Unicode has slipped me and my HanaMin was not up to date. Thanks for pointing it out. How do you change the font for the extension characters in Pleco though? I can't find an option for that, only for the regular Chinese font...

I don't think this needs to be done manually, tbh, if Plecos copy of HanaMin is up to date. Maybe @mikelove knows what's going on.
 
Seeing as you seem particularly familiar with the Maryknoll Society's work, I guess you take classes with them?

I haven't taken any classes with the Maryknoll Society, but they have been the best source that I've found for materials to study Taiwanese from English. I visited the Maryknoll office in Taipei a few years ago, but I haven't had much personal contact with them. My connection to Taiwanese is marrying a Taiwanese-American woman I met in college. Her family in Los Angeles uses Taiwanese at home, and I started learning while we were still dating. Right now I'm in a race to see when our 1 year old baby is going to overtake me in Taiwanese fluency. Much more recently, I've learned a little Mandarin too, but when I visit Taiwan, I mostly use Taiwanese to talk with people and get food, etc.

For this diacritics-to-number conversion, I see you have had an initial go in your earlier posts, but is there a simple list of rules / definitive definitions of the diacritics somewhere I could look at to implement this as part of the JSON to Pleco flashcards conversion script?

You can find the tone numbers here: https://en.wikipedia.org/wiki/Taiwanese_Romanization_System#Tones
1 tong (東)
2 tóng (黨)
3 tòng (棟)
4 tok (督) (this tone has no diacritic mark, but a syllable end in p,t,k, or h)
5 tông (同)
6 (there is no 6th tone, it merged with the 2nd tone)
7 tōng (洞)
8 to̍k (毒)
 

alex_hk90

状元
You can find the tone numbers here: https://en.wikipedia.org/wiki/Taiwanese_Romanization_System#Tones
1 tong (東)
2 tóng (黨)
3 tòng (棟)
4 tok (督) (this tone has no diacritic mark, but a syllable end in p,t,k, or h)
5 tông (同)
6 (there is no 6th tone, it merged with the 2nd tone)
7 tōng (洞)
8 to̍k (毒)

Thanks - on first glance it doesn't look too difficult to automate the diacritics to numeric tone conversion. :D

EDIT: Can you check if this is an accurate representation of the mapping?
Code:
{
  "vowels":
  {
  "a":
  {
  "á": "acute",
  "à": "grave",
  "â": "circumflex",
  "ā": "macron",
  "a̍": "vertical",
  "a": "none"
  },
  "e":
  {
  },
  "i":
  {
  },
  "o":
  {
  },
  "u":
  {
  }
  },
  "diacritics":
  {
  "acute": 2,
  "grave": 3,
  "circumflex": 5,
  "macron": 7,
  "vertical": 8,
  "none":
  [
  {
  "p": 4,
  "t": 4,
  "k": 4,
  "h": 4
  },
  1
  ]
  }
}

I've only done "a" so far but I think the other vowels should be the same?

EDIT: Above has been superseded (no need to separate out each vowel as combining diacritic marks used).
 
Last edited:

Abun

榜眼
I haven't taken any classes with the Maryknoll Society, but they have been the best source that I've found for materials to study Taiwanese from English. I visited the Maryknoll office in Taipei a few years ago, but I haven't had much personal contact with them. My connection to Taiwanese is marrying a Taiwanese-American woman I met in college. Her family in Los Angeles uses Taiwanese at home, and I started learning while we were still dating. Right now I'm in a race to see when our 1 year old baby is going to overtake me in Taiwanese fluency. Much more recently, I've learned a little Mandarin too, but when I visit Taiwan, I mostly use Taiwanese to talk with people and get food, etc.
Haha I guess that's a better learning method than classes anyways :D

6 (there is no 6th tone, it merged with the 2nd tone)
This is actually a gross oversimplification as it only holds completely true for the "literary readings". However early Chinese dialectologists almost exclusively concerned themselves with the more "respectable" reading pronunciations, this view has spread far and even some scholars who should know better still repeat it. I guess a detailed discussion of that would be off-topic here, but we can open a new thread if you like ;) For our purpose here, suffice to say that the 6th tone disappeared in most variants of Minnan, including those recorded in the MoE dictionary.


Thanks - on first glance it doesn't look too difficult to automate the diacritics to numeric tone conversion. :D

EDIT: Can you check if this is an accurate representation of the mapping?
Code:
{
  "vowels":
  {
  "a":
  {
  "á": "acute",
  "à": "grave",
  "â": "circumflex",
  "ā": "macron",
  "a̍": "vertical",
  "a": "none"
  },
  "e":
  {
  },
  "i":
  {
  },
  "o":
  {
  },
  "u":
  {
  }
  },
  "diacritics":
  {
  "acute": 2,
  "grave": 3,
  "circumflex": 5,
  "macron": 7,
  "vertical": 8,
  "none":
  [
  {
  "p": 4,
  "t": 4,
  "k": 4,
  "h": 4
  },
  1
  ]
  }
}

I've only done "a" so far but I think the other vowels should be the same?
Looks correct to me. Depending on how you plan to implement it, you might have to check for whether a second "o" follows after the marked "o" because "oo" exists as a vowel distinct from "o". You will also definitely have to add checks for "m" and "n" with diacritics too, though, because Minnan has vocalic "m" and "ng" (in the latter case, the diacritic is put on the "n").
 

alex_hk90

状元
Looks correct to me. Depending on how you plan to implement it, you might have to check for whether a second "o" follows after the marked "o" because "oo" exists as a vowel distinct from "o". You will also definitely have to add checks for "m" and "n" with diacritics too, though, because Minnan has vocalic "m" and "ng" (in the latter case, the diacritic is put on the "n").

Thanks - hopefully I'll have time to have a go this evening, as I'm going travelling over this weekend so it will have to wait until next week otherwise.

EDIT: Started to look at this but should have read your earlier posts in more detail - missed that they were Unicode combining accents so didn't need to list out all the vowels like I did. :oops:
 
Last edited:

Abun

榜眼
Thanks - hopefully I'll have time to have a go this evening, as I'm going travelling over this weekend so it will have to wait until next week otherwise.
Don't worry, I didn't have much time to test the second version and won't get much this weekend, either...
 

alex_hk90

状元
Ah right, I hadn't thought about that.

I played around with javascript a little and managed to write a script that can convert text from an input line in the way I imagine it to be. Probably is much messier than need be (I decided it's safer to declare a new variable every time something is edited, just in case, but that's probably not necessary) but at least it works :D Don't know if it's feasible to use javascript for such a purpose (since it's database work, I guess php is the language of choice?), but maybe the overall structure is still useful.

Code:
<!Doctype html />
<html>
  <head>
  <title>Tâi-lô conversion script</title>
  <meta charset="utf-8" />

  <script>
      function numfunc(inputForm) {
      // store input string in variable inp
      var inp = input.tlinput.value;

      /* Replace Space with "-q" and double hyphen with "-x" respectively.
      The "-" is detected when exploding in the next step, the letters are used
      to recognize the original spacing character and re-insert it later after
      re-implosion. A hyphen is also added in front of punctuation marks in order
      to seperate them from the preceding syllable ("." and "?" don't work)*/
      var inputTrans  = inp.replace(/ /g, "-q");
      inputTrans  = inputTrans.replace(/--/g, "-x");
      inputTrans  = inputTrans.replace(/,/g, "-,");
      inputTrans  = inputTrans.replace(/!/g, "-!");
      inputTrans  = inputTrans.replace(/\./g, "-.");
      inputTrans  = inputTrans.replace(/\?/g, "-?");

      // Split into Array
      var inpArray = inputTrans.split("-");

      // Declare empty output array
      var outpArray = [];

      // For-loop goes through every element in inpArray (every syllable)
      for (i = 0; i < inpArray.length; i++) {
          /* If statements check existance of combining diacritic in string
          (acute = 2, gravis = 3, circumflex = 5, macron = 7, vertical line
          above = 8), delete it and place the corresponding number at the
          end of the string*/
          if (inpArray[i].search("́") >= 0) {
              outpArray[i] = inpArray[i].replace("́", "");
              outpArray[i] += "2";
          } else if (inpArray[i].search("̀") >= 0) {
              outpArray[i] = inpArray[i].replace("̀", "");
              outpArray[i] += "3";
          } else if (inpArray[i].search("̂") >= 0) {
              outpArray[i] = inpArray[i].replace("̂", "");
              outpArray[i] += "5";
          } else if (inpArray[i].search("̄") >= 0) {
              outpArray[i] = inpArray[i].replace("̄", "");
              outpArray[i] += "7";
          } else if (inpArray[i].search("̍") >= 0) {
              outpArray[i] = inpArray[i].replace("̍", "");
              outpArray[i] += "8";
          } else {
              /* For all elements without diacritic marks, add 4 if they have a
              入聲 coda, output them as is if they are punctuation and add 1 in all
              other cases */
              if (inpArray[i].substring(inpArray[i].length - 1) == "p" ||
                   inpArray[i].substring(inpArray[i].length - 1) == "t" ||
                   inpArray[i].substring(inpArray[i].length - 1) == "k" ||
                   inpArray[i].substring(inpArray[i].length - 1) == "h"
                  ) {
                  outpArray[i] = inpArray[i] + "4";
              } else if (inpArray[i] == "." || inpArray[i] == "," ||
                              inpArray[i] == "?" || inpArray[i] == "!" ||
                              inpArray[i] == ""  || inpArray[i] == "q" ||
                              inpArray[i] == "x"
                             ) {
                  outpArray[i] = inpArray[i];
              } else {
                  outpArray[i] = inpArray[i] + "1";
              }
          }
      }

      // Join output array to a string
      var output = outpArray.join("-");
 
      /* Replace "-q" and "-x" with a spacebar and double hyphen respectively and
      delete the seperating hyphen in front of punctuation */
      output = output.replace(/-q/g, " ");
      output  = output.replace(/-x/g, "--");
      output  = output.replace(/-,/g, ",");
      output  = output.replace(/-!/g, "!");
      output  = output.replace(/-\./g, ".");
      output  = output.replace(/-\?/g, "?");
  
      // Insert output in the "output" paragraph
      document.getElementById("output").innerHTML = output;
      }
  </script>

  </head>

  <body>
  <!-- Input form -->
  <form id="input" action="" onsubmit="numfunc()" method="get">
  Input Romanization here:<br />
  <input type="text" name="tlinput" /><br />
  <input type="button" value="Click to output" onclick="numfunc(this.inputForm)" />
  </form>

  <!-- Output -->
  <p>Output with numbers:</p>
  <p id="output"></p>
  </body>
</html>

EDIT: Just thought of one problem: This script doesn't take punctuation into accout, so if a syllable is followed by a punctuation mark, numbers are added after it. (e.g. "
Tsa-bóo khiā tsit pîng, tsa-poo khiā hit pîng." --> Tsa-boo2 khia7 tsit ping,5 tsa-poo khia7 hit ping.5).
EDIT2: Streamlined it a bit in terms of number of different variables. Implemented numbering for 1st and 4th tone as well. Also taught it to recognize certain punctuation marks ("," and "!" to be precise) and add the numbers in front of them instead of behind. "." and "?" continue to be a problem because js syntax prevents me from using the same method as for "," and "!".
EDIT3: Now working for "." and "?" as well. I just forgot that I have to cancel those out with \ :oops:

Thanks for posting this, helped me think about how to do it in Python.
I've got it to a stage where it could be working, but not sure I've included all the possible punctuation - might need to do some kind of check against the raw data here. Also because I'm checking for all the different separators and punctuation for every word it's not the most efficient.
Anyway, with the script I've written I get the following results with a few test lines:
Code:
1: Khuànn-tio̍h tsit khuán lâng tō gê
Khuann3-tioh8 tsit4 khuan2 lang5 to7 ge5
2: Kè-á tu khah kuè--khì--leh.
Ke3-a2 tu1 khah4 kue3--khi3--leh4.
3: Kā phue̍h tsānn--khí-lâi.
Ka7 phueh8 tsann7--khi2-lai5.
4: Hit nn̄g uân oo-oo ê mi̍h-kiānn sī siánn-mih?
Hit4 nng7 uan5 oo1-oo1 e5 mih8-kiann7 si7 siann2-mih4?
5: Honnh, guân-lâi sī án-ne--ooh!
Honnh4, guan5-lai5 si7 an2-ne1--ooh4!
6: Tsa-bóo khiā tsit pîng, tsa-poo khiā hit pîng.
Tsa1-boo2 khia7 tsit4 ping5, tsa1-poo1 khia7 hit4 ping5.
Does that look right to you?
 

Abun

榜眼
Anyway, with the script I've written I get the following results with a few test lines:
Code:
1: Khuànn-tio̍h tsit khuán lâng tō gê
Khuann3-tioh8 tsit4 khuan2 lang5 to7 ge5
2: Kè-á tu khah kuè--khì--leh.
Ke3-a2 tu1 khah4 kue3--khi3--leh4.
3: Kā phue̍h tsānn--khí-lâi.
Ka7 phueh8 tsann7--khi2-lai5.
4: Hit nn̄g uân oo-oo ê mi̍h-kiānn sī siánn-mih?
Hit4 nng7 uan5 oo1-oo1 e5 mih8-kiann7 si7 siann2-mih4?
5: Honnh, guân-lâi sī án-ne--ooh!
Honnh4, guan5-lai5 si7 an2-ne1--ooh4!
6: Tsa-bóo khiā tsit pîng, tsa-poo khiā hit pîng.
Tsa1-boo2 khia7 tsit4 ping5, tsa1-poo1 khia7 hit4 ping5.
Does that look right to you?
Looks correct to me :)

I also tested the second version some more. The problems you have addressed (the ones with the "example" and "type" fields) were successfully fixed as far as I can see and I didn't find any new ones either (well, I did find a spelling mistake in one entry, but that's a problem of the source, not the conversion).

I was also able to pinpoint the behaviour of @ in searches: Adding @ returns all entries that contain the string which follows after @ but don't begin with it. So searching for "@in" returns "lin", "thinn", "iau-kin", "so-inn" ect., but not "in", "inn" and so on. The syllable-seperating dash is not taken into account for some reason, so "pai-ni" is returned as well. I don't know whether this is something that can be fixed, though
 

alex_hk90

状元
Looks correct to me :)

I also tested the second version some more. The problems you have addressed (the ones with the "example" and "type" fields) were successfully fixed as far as I can see and I didn't find any new ones either (well, I did find a spelling mistake in one entry, but that's a problem of the source, not the conversion).

I was also able to pinpoint the behaviour of @ in searches: Adding @ returns all entries that contain the string which follows after @ but don't begin with it. So searching for "@in" returns "lin", "thinn", "iau-kin", "so-inn" ect., but not "in", "inn" and so on. The syllable-seperating dash is not taken into account for some reason, so "pai-ni" is returned as well. I don't know whether this is something that can be fixed, though

Thanks. :)

Next version(s) ready:
- Pleco flashcards (14,005 entries) with diacritic tones (as source data): [EDIT: superseded by new version - see Pleco-User-Dictionaries/MoE-Minnan [GitHub] for latest version]
- Pleco flashcards (14,005 entries) with numeric tones: [EDIT: superseded by new version - see Pleco-User-Dictionaries/MoE-Minnan [GitHub] for latest version]

The main change is addition of the numeric tone version, currently only for the headwords but the code has been modularised so it should not be too difficult to apply to the definitions and examples as well, as long as they can be reliably identified from the longer string.

Once the diacritic to numeric tone conversion has been applied to the definitions and examples as well as the headwords, what else is left to do?

EDIT: Another new version (MoE-Minnan-v04), with numeric tones for (hopefully) all Romanisation (there could be one or two remaining bugs with numeral placement, let me know if you find anything):
- Pleco flashcards (14,005 entries) with diacritic tones (as source data): [EDIT: see Pleco-User-Dictionaries/MoE-Minnan [GitHub] for latest version]
- Pleco flashcards (14,005 entries) with numeric tones: [EDIT: see Pleco-User-Dictionaries/MoE-Minnan [GitHub] for latest version]

If this works then I think we're pretty much done? :)
 
Last edited:

Abun

榜眼
Another new version, with numeric tones for (hopefully) all Romanisation (there could be one or two remaining bugs with numeral placement, let me know if you find anything):
- Pleco flashcards (14,005 entries) with diacritic tones (as source data): MoE-Minnan-v04
- Pleco flashcards (14,005 entries) with numeric tones: MoE-Minnan-v04-numeric
Importing atm. Are there any special conditions in particular which you suspect might cause bugs and which I should look out for?

If this works then I think we're pretty much done? :)
Pretty much, I guess^^ I can only think about two things which we might think about including, but none of them is absolutely essential if you ask me.
The first would be information about 異體字, but that seem to be located in another json file (https://github.com/g0v/moedict-data-twblg/blob/master/x-異用字.json). Does that make things difficult?
The information in the "reading" field of the main json could be included as well, but I think they would have to be recognizable as what they are at the very first glance so they don't create confusion. The MoE website does this by displaying them inside a square behind the header character which I don't think is possible in Pleco because it doesn't seem there are any special (squared, circled or whatever) forms of these characters in Unicode (the MoE used image files). Maybe putting it at the top of the entry in full space angled brackets (【】) would be visually clear enough... But in my opinion it's not an absolute must anyways :) @audreyt's 萌典 doesn't display it, either.

I also realized that there are a few entries in the dict on the original MoE website which don't appear in the json file, specifically those which are marked as "additional" (附錄) on the website. These entries contain more special vocabulary, for example family names, toponyms, non-sinitic loan words, the 24 節氣 (立春, 雨水 ect.), as well as few additional denominations for certain family members. I can't find corresponding files in @audreyt's directory though. Considering the 萌典 doesn't list them, it's quite possible the files don't exist. I still think they are quite interesting, though (especially the loanwords). Considering the number of entries isn't very high (3~400 in total maybe) I'm thinking of just making a list myself. Could you give me information about what kind of format would be usable for you?
Especially the loanwords deserve some attention here I guess. Most of them are Japanese (or English borrowed via Japanese). So I think we would need information not only on the meaning and PoS, but also about the original word that was loaned. On the other hand, these words were not assigned characters by the MoE. So I suggest either leaving the character line blank or copying the romanization into there if Pleco doesn't like a blank character line. The entries might look something like:

an3-nai7
<動> 招待、引導
原: 日: 案內(あんない, annai)​

kha1-me2-lah4
<名> 開麥拉、照相機
原: 日: カメラ(kamera) > 英: camera​

A problem may be pseudo-English loans, though (such as oo-tóo-bái (機車) > オートバイ(ōtobai) > "autobike"). I'm not quite sure how best to indicate this.
 
Last edited:

alex_hk90

状元
Importing atm. Are there any special conditions in particular which you suspect might cause bugs and which I should look out for?.
Cases where there is a lot of switching between Chinese and Latin characters without separating vocab might trip up the logic - I think I have got enough recursion in the logic to catch all likely cases, but some might have slipped through the net.

Pretty much, I guess^^ I can only think about two things which we might think about including, but none of them is absolutely essential if you ask me.
:)

The first would be information about 異體字, but that seem to be located in another json file (https://github.com/g0v/moedict-data-twblg/blob/master/x-異用字.json). Does that make things difficult?
Multiple files don't really make a big difference here - might take a bit longer to run the script unless I set up a temporary relational database, but nothing conceptually difficult about it.
How is that file structured and how would you suggest combining it with the current version?

The information in the "reading" field of the main json could be included as well, but I think they would have to be recognizable as what they are at the very first glance so they don't create confusion. The MoE website does this by displaying them inside a square behind the header character which I don't think is possible in Pleco because it doesn't seem there are any special (squared, circled or whatever) forms of these characters in Unicode (the MoE used image files). Maybe putting it at the top of the entry full space angled brackets (【】) would be visually clear enough... But in my opinion it's not an absolute must anyways :) @audreyt's 萌典 doesn't display it, either.
That would be quite easy to do, I can add it to the next version and see if it's better with or without it (or if it might be better somewhere else, like at the end of the definition).

I also realized that there are a few entries in the dict on the original MoE website which don't appear in the json file, specifically those which are marked as "additional" (附錄) on the website. These entries contain more special vocabulary, for example family names, toponyms, non-sinitic loan words, the 24 節氣 (立春, 雨水 ect.), as well as few additional denominations for certain family members. I can't find corresponding files in @audreyt's directory though. Considering the 萌典 doesn't list them, it's quite possible the files don't exist. I still think they are quite interesting, though (especially the loanword because some of those are quite interesting). Considering the number of entries isn't very high (3~400 in total maybe) I'm thinking of just making a list myself. Could you give me information about what kind of format would be usable for you?
Especially the loanwords deserve some attention here I guess. Most of them are Japanese (or English borrowed via Japanese). So I think we would need information not only on the meaning and PoS, but also about the original word that was loaned. On the other hand, these words were not assigned characters by the MoE. So I suggest either leaving the character line blank or copying the romanization into there if Pleco doesn't like a blank character line. The entries might look something like:

an3-nai7
<動> 招待、引導
原: 日: 案內(あんない, annai)​

kha1-me2-lah4
<名> 開麥拉、照相機
原: 日: カメラ(kamera) > 英: camera​

A problem may be pseudo-English loans, though (such as oo-tóo-bái (機車) > オートバイ(ōtobai) > "autobike"). I'm not quite sure how best to indicate this.
If you're doing it manually then anything close to Pleco flashcard format would be best (have a look at the MoE-Minnan ones for an example):
Hanzi{TAB}Pinyin{TAB}Definition
For new lines in definition you need to use a particular Unicode private use character: 
 

Abun

榜眼
Cases where there is a lot of switching between Chinese and Latin characters without separating vocab might trip up the logic - I think I have got enough recursion in the logic to catch all likely cases, but some might have slipped through the net.
You mean a lot of changing between Characters and Latin letters within a single entry? The most extreme examples are probably single character entries which have a lot of meanings. I checked a few and couldn't see anything out of the ordinary :) However it did give me the idea that the translation of examples could be put in parentheses to make it look neater xD

If you're doing it manually then anything close to Pleco flashcard format would be best (have a look at the MoE-Minnan ones for an example):
Hanzi{TAB}Pinyin{TAB}Definition
For new lines in definition you need to use a particular Unicode private use character: 
I could do that, but I doubt that's particularly practical. Wouldn't it be a bit like a website which is styled in html instead of using a spreadsheet, i.e. written with only the immediate output in mind? It would be rather inflexible in terms of possible changes to layout because every line would have to be checked by hand, wouldn't it?
I for the moment made an excel sheet which contains the family names and the loanwords (the toponyms are a bit more numerous than I expected, so I leave them out for now). Then I made a pleco-friendly txt out of the loanword part, which did work, but as expected it took quite a lot of time because I had to do the layouting by hand in every single line. I wonder if it wouldn't be more practical to use some sort of database format and then use an algorithm to compile the txt. For now I used xlsx, simply because using excel it's easier to work with than if I wrote a json by hand in a text editor, but I realize that xlsx is maybe not ideal for our purpose...
 

Attachments

  • MoEDict_loans.txt
    18.8 KB · Views: 672

alex_hk90

状元
You mean a lot of changing between Characters and Latin letters within a single entry? The most extreme examples are probably single character entries which have a lot of meanings. I checked a few and couldn't see anything out of the ordinary :) However it did give me the idea that the translation of examples could be put in parentheses to make it look neater xD
Yeah, switching between Characters and Latin without punctuation between them.
And bracketing the translation of examples is probably doable - I might have a look at that for the next version.

I could do that, but I doubt that's particularly practical. Wouldn't it be a bit like a website which is styled in html instead of using a spreadsheet, i.e. written with only the immediate output in mind? It would be rather inflexible in terms of possible changes to layout because every line would have to be checked by hand, wouldn't it?
I for the moment made an excel sheet which contains the family names and the loanwords (the toponyms are a bit more numerous than I expected, so I leave them out for now). Then I made a pleco-friendly txt out of the loanword part, which did work, but as expected it took quite a lot of time because I had to do the layouting by hand in every single line. I wonder if it wouldn't be more practical to use some sort of database format and then use an algorithm to compile the txt. For now I used xlsx, simply because using excel it's easier to work with than if I wrote a json by hand in a text editor, but I realize that xlsx is maybe not ideal for our purpose...
An Excel sheet is fine, as you can easily output that to CSV or tab-deliminated format (pretty much Pleco flashcards) anyway, both of which can then be read into a script or database.

Let me know what you think the priority items should be for the next version. :)
 

Abun

榜眼
Yeah, switching between Characters and Latin without punctuation between them.
And bracketing the translation of examples is probably doable - I might have a look at that for the next version.
Yeah, that should occur most often in single character entries with a lot of meanings because usually each meaning would have at least one example and with single characters (i.e. not full words), the examples would not be full sentences. I didn't find any problems so far, though :)

An Excel sheet is fine, as you can easily output that to CSV or tab-deliminated format (pretty much Pleco flashcards) anyway, both of which can then be read into a script or database.
Uploaded an excel sheet with the family names, loan words and 24節氣 (https://www.dropbox.com/s/415y5jd5jt4cg3d/MoE_Dict_appendix.xlsx?dl=0). I'm working on the toponyms but as I said, there are more of those than I thought. Moreover, I will be very busy during this weekend and the coming week and probably won't get a lot done there (if anything), so it might take a bit.

Also just as a disclaimer: The information in the "type" columns as well as that on the source word in Western languages are not copied from the MoE but have been added by me, for the sake of consistency with "type", and completeness with the etymology. I am torn on whether this addition is justified when weighing it up against being faithful to the source. (Btw I have a similar conflict with the JapKanji column as some of the Kanji there are presented in their traditional form instead of the modern Japanese simplified one, for example 櫻 instead of 桜. I decided to go with the source there, though.) What's your opinion?

Let me know what you think the priority items should be for the next version. :)
My priority would be: Appendix > 異體字 > "reading" > parentheses around example tranlations.
 

Abun

榜眼
Just noticed that I haven't answered your question concerning the structure of the 異體字 json file, yet:oops:
It's actually very minmalistic: It lists the 異體字 for the entries by calling their ID (it doesn't actually explain the numbers, but I checked a few examples; the number corresponds to the "id" field in the main json file).
 

Abun

榜眼
Just finished the work on the toponyms and updated the xls file on Dropbox (link doesn't seem to have changed: https://www.dropbox.com/s/415y5jd5jt4cg3d/MoE_Dict_appendix.xlsx?dl=0).

In some of the tables for toponyms there was a distinction between 讀音一 and 讀音二. In most cases, the second one contains an older version of the name which does not necessarily match the characters (for example Ku7-tsam7 (舊站) for the train station which today is called Tsiau1-ping5 (沼平)). In a limited number of cases, this field contains a dialectal variant pronunciation as well, but that seems to be a mistake to me. In any case, I stored 讀音二 in a column "alt". In terms of output, I suggest adding it to the normal pronunciation line after a slash (ex. "Tsiau1-ping5/Ku7-tsam7") since that would make it possible to find the entry by searching for the 讀音二.
 
Top