Romanizations for Hokkien / Hakka / Wu / Sichuanese

mikelove

皇帝
Staff member
We've decided we need to ship some sort of official support for a couple of Chinese topolects in Pleco 4.0 - rather than expecting users to go through dozens of pages of reference documentation to figure out how to do so themselves - and I'm looking for some advice on which romanization systems to default to and how to use them.

Hokkien: it seems like Tâi-lô is the predominant system, and for the most part we can approach it like we do Mandarin and (Yale) Cantonese - search with suffix tone numbers and display with diacritics. Does that make sense?

Hakka: there seems to be more active competition here, but since, as with Hokkien, the Taiwan MOE seems to publish a lot of material for this and it generally seems like most of the interest in working with Hakka languages is coming from users in Taiwan, we should probably follow their 臺灣客家語拼音方案 scheme, correct? The Wikipedia article on that shows floating diacritic marks after syllables, but then the actual MOE Hakka dictionary uses superscript numbers, 2 per syllable, which to me look tidier - would the latter system likely be acceptable to most users? And for searching, would we want to let you enter zero or two digits and ignore suffixes of just one digit?

Wu: again more competition but I get the impression that the most favored system at the moment is Wugniu? And that between sandhi chains and other complications online dictionaries generally don't bother to make the tones searchable? Is it worth the trouble to implement a sandhi notation system like Wiktionary's? (it seems like they've got about the only open-source dictionary data for it)

Sichuanese: it seems like the system to use here is Sichuanese Pinyin, which again we can treat like regular Pinyin but with more syllables and superscripts instead of diacritics?
 
Sichuanese: it seems like the system to use here is Sichuanese Pinyin, which again we can treat like regular Pinyin but with more syllables and superscripts instead of diacritics?
+ mergence of erhua, i.e.: 包儿 ---> ber1 -or- 猫儿 ---> mer1 (as opposed to MSM bao1 r5 & mao1 r5)

Not sure if 2 characters corresponding to a single syllable is tricky or not.
 

mikelove

皇帝
Staff member
Not a major problem, no - we can either make a table that includes every possible erhua-ended syllable or just skip the whole process and accept any syllable that looks valid. (we don't do that with Pinyin for performance reasons, but it's less of a problem when you're not worried about scaling up to 40+ dictionaries)
 

Abun

榜眼
Great to hear that you're spending thoughts on how to integrate nore non-Standard varieties of Chinese! It would be awesome to seethem supported even out-of-the-box in Pleco!

My two cents when it comes to Hokkien (about the others I'm afraid don't know enough):
Hokkien: it seems like Tâi-lô is the predominant system, and for the most part we can approach it like we do Mandarin and (Yale) Cantonese - search with suffix tone numbers and display with diacritics. Does that make sense?
How dominant Tâi-lô is depends a bit on how you look at it. Pe̍h-ōe-jī (POJ) has a much longer history and has even been used as a script proper (rather than just as a crutch for learning/clarifying the pronunciation of characters) in a couple of publications including bible translations, newspapers and, I’m told, personal correspondence. So I’d wager there are probably more sources in POJ than Tâi-lô, although I freely admit I don’t have numeric data to back up that claim. One other point to consider is that that parts in the Taiwanese Hokkien world who remain highly suspicious of any language planning attempts from the government, given its past history of repression, and who therefore don’t like POJ on principle. I don’t have any data on how large this group could be though.

That said, if you have to choose one, I personally would still suggest Tâi-lô for a couple of reasons:
  • visual clarity, especially when it comes to Tâi-lô <oo> vs POJ <o͘>
  • standardisation: POJ is much like Wade-Giles – it isn’t so much one system as a family of closely related ones. The general features have stabilised over the last 150 or so years, but some details (e.g. tone mark placement on certain finals, relative placement of <ⁿ> and final <h> etc.) can differ from publication to publication.
  • regularity: POJ is already fairly phonemic, but it does contain certain inconsistencies (e.g. it uses <u> for the onglide /w/, but /wa/ and /we/ are written <oa> and <oe>, which in most POJ variants also exhibit a rather arcane behaviour with regards to how the tone diacritic is placed). Tâi-lô, while not entirely phonemic either, tends to be more regular: /wa/ and /we/ are <ua> and <ue> with diacritic always placed on the core vowel).
Still, there will be plenty of people who prefer POJ, or even more niche transcription systems such as Bbánlám pìngyīm, and who might consider it worth the effort to implement conversion to their favourite scheme themselves. Do you plan on supporting something like that?

If so, there is one complication which might be worth considering while planning how to structure the raw data: Both Tâi-lô and POJ, like Hanyu Pinyin, show the underlying tone for all syllables and leave it to the reader to figure out where and how to apply tone sandhi. Some more niche systems (particularly those which try to indicate the tone with otherwise unused letters instead of diacritics or numbers, e.g. Phofsit Daibuun) write the surface tone instead – so most words can be spelled in two different ways, depending on whether they are sandhied or not in a given context. As a result, conversion between these two types of systems is only possible if the data contains information on which syllables are subject to sandhi and which are not. Indeed, if it is at all possible to obtain, such data would be valuable even outside of conversion, e.g. showing sandhi domains for learners or at some point (fingers crossed) for TTS.

An additional complication is of course Hokkien varieties outside of mainstream Xiamen/Taiwan Hokkien. Although I guess the further you differ from that mainstream, the less relevant the data in POJ/Tâi-lô dictionaries will be, so users interested in such variants will probably have to manage their own user dicts anyways.
 
Last edited:

mikelove

皇帝
Staff member
Thanks for this detailed reply.

We have a lot of code already written for conversions between romanizations, so I think we can support POJ pretty easily if a lot of people request it. But Tâi-lô being consistent and standardized has a lot to recommend it as an internal storage format. We also already support separate fields for sandhi readings in Mandarin and Cantonese and can easily offer one of those for Hokkien too.
 

Hydramus

Member
Hi, this is fantastic news.
I was trying to work out a way to automate Hong Kong Hakka romanisation into Pleco and potentially publish it but I do not know how to. I am happy to help support with what I can.
For material you might find interesting on Hong Kong Hakka:

Hong Kong Hakka Wiki Entry
Hong Kong Hakka Dictionary

I do understand the sentiment that the Taiwan MOE is the most published work and also because it is an official language in Taiwan, I'd fully support it being the main one that you guys focus on in Pleco (and I guess this might help me learn it!).

On the romanisation, this is the tricky part. I am less familiar with it but which dialect is considered more commonplace and should be used here? Which one is common in learning material or children's books for example? I find that diacritics might be more accessible for some but the numbers are a lot more accurate but just have a learning curve. Would it be possible for searching to be flexible to do zero, 1 or 2 digits? I would likely search without the digits like I do for Cantonese.
 

mikelove

皇帝
Staff member
Thanks. Do you know if HK Hakka can be expressed accurately using the MOE system?

For sandhi in Mandarin and tone changes in Cantonese, we've adopted the approach of letting you search for either or neither - we certainly could support entering both tone numbers, but I'm not sure how many people would be interested in looking up words that way or how much a search might be narrowed down by that versus only entering one.
 

Abun

榜眼
Thanks for this detailed reply.

We have a lot of code already written for conversions between romanizations, so I think we can support POJ pretty easily if a lot of people request it. But Tâi-lô being consistent and standardized has a lot to recommend it as an internal storage format. We also already support separate fields for sandhi readings in Mandarin and Cantonese and can easily offer one of those for Hokkien too.
Thanks for your reply! I agree with Tâi-lô being more suitable at the very least internally. And if anybody doesn’t like it for any reason, conversion (or if possible even better: the ability to define custom conversion procedures) would probably be greatly appreciated.

When it comes to sandhi, could you elaborate on what those sandhi reading fields look like? Because a field for alternative pronunciations to pronounce the same head word would not solve the main issue with Hokkien.

The Hokkien situation is like this: All tones basically have two pitch contours: One which is used under sandhi conditions and one otherwise. Sandhi conditions mean, simplified somewhat, that the syllable is followed by at least one other non-neutral tone syllable within its prosodic phrase (or in other words, the final full-tone syllable within any given prosodic phrase is the only one in that phrase which is not sandhi’ed, all others before it are). Note that unlike in Mandarin or Wu, Hokkien sandhi doesn’t care at all which tones the surrounding syllables carry; there is one sandhi pitch contour for each tone category, the tones around do not matter.

This also means that in order to pronounce a given phrase correctly, you need information on how it is broken up into sandhi phrases. To my knowledge, no transcription system systematically provides this information, unfortunately. Most of them (like Tâi-lô and POJ) write the underlying tone and expect the reader to apply sandhi entirely by themselves (at least almost; they indicate neutral tone syllables with a double hyphen and since neutral tones can normally only occur at the end of a sandhi phrase after the unsandhi’ed syllable, that does give occasional indication). So if the internal representation should be future-proof, be it for conversion, TTS or simply just as a sandhi indication for users, it would need information about which syllables belong to the same sandhi phrase – at least unless you can use AI; the National Taipei University of Technology’s TTS engine sounds quite good even with user-input (https://web.archive.org/web/20240426145828/http://tts001.iptcloud.net:8804/, I’ve had problems with the site being unreachable sometimes, hence the archived link) and the private company Ithuan (https://ithuan.tw/) seems to be making advances there as well, though I couldn’t test their engine. In scientific literature on the topic, many researchers use the pound character # to indicate a sandhi phrase boundary.

Luckily, head-words in a dictionary are kind of the easiest case here because single words pretty much always go in the same sandhi phrase and can’t be split up further. So you can generally assume that the only syllable which is sensitive to the context is the last. All other syllables will always be sandhi’ed because they have at least one syllable following. However there are some exceptional entries which have internal sandhi boundaries imposed by their syntactic structure. One obvious example are idioms, including four-character fixed expressions such as chengyu. There are also a couple of other, normal words with an internal sandhi boundary, usually these are subject-verb compounds. For example, the Taiwanese Hokkien words for ‘earthquake’ and ‘headache’ are 地動 tē-tāng or tuē-tāng, and 頭疼 thâu-thiànn, each with the first syllable unsandhi’ed. And of course when it comes to example sentences, most of them will definitely contain more than one sandhi phrase.

In any case, I’m looking forward to being able to use Pleco for non-Mandarin/Cantonese regional languages as well, that would be a huge improvement to an already great app for me!
 

Hydramus

Member
Yeah, it will work the same as Taiwan MOE with the diacritics, single number (to represent all 6 tones (see below, from my notes), or the double to talk about the entry. I wonder if we could one day have the diagrams in the app to explain these too.
Tone TableRef: http://www.hkilang.org/v2/發音字典/
Ref: p433 香港客家话研究
Tonal MarkSimplified number notationDirectional Tone notation (see graph)Chinese Example
fúngfung1fung33
fūngfung2fung11
fǔngfung3fung31
fùngfung4fung55
sǐtsit5sit3
sìtsit6sit5

香港客家话研究 is a great book for reference if you ever want more detail but I own a copy too so can help where I can and I also know the author.
I have an old diagram from an old website back in the day that looks like this (notice the discrepancies due to dialects)

1716980482288.png
 

mikelove

皇帝
Staff member
Thanks very much for all of this detailed info.

Sandhi at the moment is just a separate reading field with different tones. We auto-generate it for Mandarin since the algorithm for that is quite straightforward - it's indexed separately for search, so users can search for words with or without sandhi and can display either (or both). We don't auto-generate it for Cantonese but the behavior is otherwise the same.

My thinking was that the same approach would work here - the main Hokkien field gives you the underlying tones, the sandhi one gives you the tones with all of these transformations applied. As with Cantonese, we would not attempt to do this automatically, but it would be a separate field users could supply (importing it or editing an entry to add it). Is there a reason this would be unsuitable for Hokkien?
 

Abun

榜眼
Words consisting of more than one syllable probably need at least two additional sandhi fields.

Let’s take a multi-syllabic word like 明白 ‘to understand’. Tâi-lô spells this as bîng-pi̍k, with underlying 5th and 8th tone respectively. If you use a separate field for sandhi, you would would probably have to represent this as:
  • underlying: bing5-pik8
  • in_isolation: bing7-pik8/bing3-pik8 (even in isolation, the first syllable is in a sandhi context on account of the following second syllable – this happens for almost all words, though exceptions such as 地動 or 頭疼 exist). Now the 5th tone is a bit troublesome because in some dialects its sandhi form sounds identical to unsandhi’ed 7th, while in others it’s identical to unsandhi’ed 3rd, hence the two different readings)
  • after_sandhi: bing7-pik4/bing3-pik4 (the word might appear in a context where the last syllable must also undergo sandhi – in this case this could be caused by a following object for instance: 我無明白你咧講啥物 = Mandarin 我不明白你在說什么. Sandhi’ed tone 8 sounds identical to unsandhi’ed tone 4)
The alternative would be a single field somewhat like this (I’ll use XML but of course another format could use a similar structure):

XML:
<headword>
  <sandhiing>bing5</sandhiing><final>pik8</final>
</headword>

Or you could use some special character to denote sandhi boundaries within a headword, e.g. # like in many papers on Hokkien tone sandhi:

XML:
<headword>bing5-pik8</headword>

Whereas a word like 地動 where the first syllable never sandhis would be:

XML:
<headword>te7#-tang7</headword>

Then the sandhi readings (both the full sandhi form and with the last syllable remaining unsandhied) could be automatically generated from that information, including the dialect-dependent sandhi for the 5th tone. An added benefit would be that all of these tagging strategies could theoretically also be used identically in examples if desired.

Note that unfortunately, there are more differences in dialect than the sandhi pattern of the 5th tone (rhyme alternations are most common), and even Mainstream Taiwanese sometimes accepts two different ones (e.g. 火 can be hué or ). So you might still need a separate field at least for the different mainstream pronunciations, if not for more fringe dialects.
 

mikelove

皇帝
Staff member
Thanks. I don't know if we'd necessarily support your third "after_sandhi" field, at least not initially, since it seems like kind of a scope creep to look at tone changes next to other words - in practice I think anybody who doesn't already know the tones for a word is probably going to be looking it up without tones rather than assuming they have them down correctly.

For alternate pronunciations our standard approach is variant entries - you make another card/entry and insert a {{}} link to the original entry in the entry body field. (which the software is smart enough to collapse / consolidate in search results)
 

shiki

秀才
Is there an actual reason why Bopomofo extended can't be used for Hokkien or Hakka? I don't know either of these languages but since I do plan on studying them and as someone learning Mandarin with Bopomofo already, I would definitely like to learn them with Bopomofo if possible. I understand the phonetics of Mandarin and these languages are very different and there's not a lot of materials with this system but it seems like it is usable. As a fan of bopomofo I'm very curious if this is possible.

 

mikelove

皇帝
Staff member
We can potentially support it as a display / search format, this is mostly about deciding what to use internally in databases; it's a lot easier for us to work with a reading system that consists entirely of ASCII characters and then convert to / from that. (this is how we support regular Bopomofo in the dictionary now)
 

shiki

秀才
Okay.
Not sure if this'll be of any use in helping getting this implented into pleco but at the bottom of this project page it says
"Because Taibun is MIT-licensed, any developer can essentially do whatever they want with it as long as they include the original copyright and licence notice in any copies of the source code. Note, that the data used by the package is licensed under a different copyright.

The data is licensed under CC BY-SA 4.0"

 
Top