Romanizations for Hokkien / Hakka / Wu / Sichuanese

mikelove · May 13, 2024

We've decided we need to ship some sort of official support for a couple of Chinese topolects in Pleco 4.0 - rather than expecting users to go through dozens of pages of reference documentation to figure out how to do so themselves - and I'm looking for some advice on which romanization systems to default to and how to use them.

Hokkien: it seems like Tâi-lô is the predominant system, and for the most part we can approach it like we do Mandarin and (Yale) Cantonese - search with suffix tone numbers and display with diacritics. Does that make sense?

Hakka: there seems to be more active competition here, but since, as with Hokkien, the Taiwan MOE seems to publish a lot of material for this and it generally seems like most of the interest in working with Hakka languages is coming from users in Taiwan, we should probably follow their 臺灣客家語拼音方案 scheme, correct? The Wikipedia article on that shows floating diacritic marks after syllables, but then the actual MOE Hakka dictionary uses superscript numbers, 2 per syllable, which to me look tidier - would the latter system likely be acceptable to most users? And for searching, would we want to let you enter zero or two digits and ignore suffixes of just one digit?

Wu: again more competition but I get the impression that the most favored system at the moment is Wugniu? And that between sandhi chains and other complications online dictionaries generally don't bother to make the tones searchable? Is it worth the trouble to implement a sandhi notation system like Wiktionary's? (it seems like they've got about the only open-source dictionary data for it)

Sichuanese: it seems like the system to use here is Sichuanese Pinyin, which again we can treat like regular Pinyin but with more syllables and superscripts instead of diacritics?

ACardiganAndAFrown · May 15, 2024

mikelove said:
Sichuanese: it seems like the system to use here is Sichuanese Pinyin, which again we can treat like regular Pinyin but with more syllables and superscripts instead of diacritics?

+ mergence of erhua, i.e.: 包儿 ---> ber1 -or- 猫儿 ---> mer1 (as opposed to MSM bao1 r5 & mao1 r5)

Not sure if 2 characters corresponding to a single syllable is tricky or not.

mikelove · May 15, 2024

Not a major problem, no - we can either make a table that includes every possible erhua-ended syllable or just skip the whole process and accept any syllable that looks valid. (we don't do that with Pinyin for performance reasons, but it's less of a problem when you're not worried about scaling up to 40+ dictionaries)

ACardiganAndAFrown · May 16, 2024

I can email you a file of unique syllables and unique syllables with tone numbers lifted from our 28,000+ entry dict, if that's helpful.

mikelove · May 16, 2024

That would be great if you don't mind, thanks!

Abun · May 21, 2024

Great to hear that you're spending thoughts on how to integrate nore non-Standard varieties of Chinese! It would be awesome to seethem supported even out-of-the-box in Pleco!

My two cents when it comes to Hokkien (about the others I'm afraid don't know enough):

mikelove said:
Hokkien: it seems like Tâi-lô is the predominant system, and for the most part we can approach it like we do Mandarin and (Yale) Cantonese - search with suffix tone numbers and display with diacritics. Does that make sense?

How dominant Tâi-lô is depends a bit on how you look at it. Pe̍h-ōe-jī (POJ) has a much longer history and has even been used as a script proper (rather than just as a crutch for learning/clarifying the pronunciation of characters) in a couple of publications including bible translations, newspapers and, I’m told, personal correspondence. So I’d wager there are probably more sources in POJ than Tâi-lô, although I freely admit I don’t have numeric data to back up that claim. One other point to consider is that that parts in the Taiwanese Hokkien world who remain highly suspicious of any language planning attempts from the government, given its past history of repression, and who therefore don’t like POJ on principle. I don’t have any data on how large this group could be though.

That said, if you have to choose one, I personally would still suggest Tâi-lô for a couple of reasons:

visual clarity, especially when it comes to Tâi-lô <oo> vs POJ <o͘>
standardisation: POJ is much like Wade-Giles – it isn’t so much one system as a family of closely related ones. The general features have stabilised over the last 150 or so years, but some details (e.g. tone mark placement on certain finals, relative placement of <ⁿ> and final <h> etc.) can differ from publication to publication.
regularity: POJ is already fairly phonemic, but it does contain certain inconsistencies (e.g. it uses <u> for the onglide /w/, but /wa/ and /we/ are written <oa> and <oe>, which in most POJ variants also exhibit a rather arcane behaviour with regards to how the tone diacritic is placed). Tâi-lô, while not entirely phonemic either, tends to be more regular: /wa/ and /we/ are <ua> and <ue> with diacritic always placed on the core vowel).

Still, there will be plenty of people who prefer POJ, or even more niche transcription systems such as Bbánlám pìngyīm, and who might consider it worth the effort to implement conversion to their favourite scheme themselves. Do you plan on supporting something like that?

If so, there is one complication which might be worth considering while planning how to structure the raw data: Both Tâi-lô and POJ, like Hanyu Pinyin, show the underlying tone for all syllables and leave it to the reader to figure out where and how to apply tone sandhi. Some more niche systems (particularly those which try to indicate the tone with otherwise unused letters instead of diacritics or numbers, e.g. Phofsit Daibuun) write the surface tone instead – so most words can be spelled in two different ways, depending on whether they are sandhied or not in a given context. As a result, conversion between these two types of systems is only possible if the data contains information on which syllables are subject to sandhi and which are not. Indeed, if it is at all possible to obtain, such data would be valuable even outside of conversion, e.g. showing sandhi domains for learners or at some point (fingers crossed) for TTS.

An additional complication is of course Hokkien varieties outside of mainstream Xiamen/Taiwan Hokkien. Although I guess the further you differ from that mainstream, the less relevant the data in POJ/Tâi-lô dictionaries will be, so users interested in such variants will probably have to manage their own user dicts anyways.

mikelove · May 22, 2024

Thanks for this detailed reply.

We have a lot of code already written for conversions between romanizations, so I think we can support POJ pretty easily if a lot of people request it. But Tâi-lô being consistent and standardized has a lot to recommend it as an internal storage format. We also already support separate fields for sandhi readings in Mandarin and Cantonese and can easily offer one of those for Hokkien too.

Hydramus · May 28, 2024

Hi, this is fantastic news.
I was trying to work out a way to automate Hong Kong Hakka romanisation into Pleco and potentially publish it but I do not know how to. I am happy to help support with what I can.
For material you might find interesting on Hong Kong Hakka:

Hong Kong Hakka Wiki Entry
Hong Kong Hakka Dictionary

I do understand the sentiment that the Taiwan MOE is the most published work and also because it is an official language in Taiwan, I'd fully support it being the main one that you guys focus on in Pleco (and I guess this might help me learn it!).

On the romanisation, this is the tricky part. I am less familiar with it but which dialect is considered more commonplace and should be used here? Which one is common in learning material or children's books for example? I find that diacritics might be more accessible for some but the numbers are a lot more accurate but just have a learning curve. Would it be possible for searching to be flexible to do zero, 1 or 2 digits? I would likely search without the digits like I do for Cantonese.

mikelove · May 28, 2024

Thanks. Do you know if HK Hakka can be expressed accurately using the MOE system?

For sandhi in Mandarin and tone changes in Cantonese, we've adopted the approach of letting you search for either or neither - we certainly could support entering both tone numbers, but I'm not sure how many people would be interested in looking up words that way or how much a search might be narrowed down by that versus only entering one.

Abun · May 29, 2024

mikelove said:
Thanks for this detailed reply.

We have a lot of code already written for conversions between romanizations, so I think we can support POJ pretty easily if a lot of people request it. But Tâi-lô being consistent and standardized has a lot to recommend it as an internal storage format. We also already support separate fields for sandhi readings in Mandarin and Cantonese and can easily offer one of those for Hokkien too.

Thanks for your reply! I agree with Tâi-lô being more suitable at the very least internally. And if anybody doesn’t like it for any reason, conversion (or if possible even better: the ability to define custom conversion procedures) would probably be greatly appreciated.

When it comes to sandhi, could you elaborate on what those sandhi reading fields look like? Because a field for alternative pronunciations to pronounce the same head word would not solve the main issue with Hokkien.

The Hokkien situation is like this: All tones basically have two pitch contours: One which is used under sandhi conditions and one otherwise. Sandhi conditions mean, simplified somewhat, that the syllable is followed by at least one other non-neutral tone syllable within its prosodic phrase (or in other words, the final full-tone syllable within any given prosodic phrase is the only one in that phrase which is not sandhi’ed, all others before it are). Note that unlike in Mandarin or Wu, Hokkien sandhi doesn’t care at all which tones the surrounding syllables carry; there is one sandhi pitch contour for each tone category, the tones around do not matter.

This also means that in order to pronounce a given phrase correctly, you need information on how it is broken up into sandhi phrases. To my knowledge, no transcription system systematically provides this information, unfortunately. Most of them (like Tâi-lô and POJ) write the underlying tone and expect the reader to apply sandhi entirely by themselves (at least almost; they indicate neutral tone syllables with a double hyphen and since neutral tones can normally only occur at the end of a sandhi phrase after the unsandhi’ed syllable, that does give occasional indication). So if the internal representation should be future-proof, be it for conversion, TTS or simply just as a sandhi indication for users, it would need information about which syllables belong to the same sandhi phrase – at least unless you can use AI; the National Taipei University of Technology’s TTS engine sounds quite good even with user-input (https://web.archive.org/web/20240426145828/http://tts001.iptcloud.net:8804/, I’ve had problems with the site being unreachable sometimes, hence the archived link) and the private company Ithuan (https://ithuan.tw/) seems to be making advances there as well, though I couldn’t test their engine. In scientific literature on the topic, many researchers use the pound character # to indicate a sandhi phrase boundary.

Luckily, head-words in a dictionary are kind of the easiest case here because single words pretty much always go in the same sandhi phrase and can’t be split up further. So you can generally assume that the only syllable which is sensitive to the context is the last. All other syllables will always be sandhi’ed because they have at least one syllable following. However there are some exceptional entries which have internal sandhi boundaries imposed by their syntactic structure. One obvious example are idioms, including four-character fixed expressions such as chengyu. There are also a couple of other, normal words with an internal sandhi boundary, usually these are subject-verb compounds. For example, the Taiwanese Hokkien words for ‘earthquake’ and ‘headache’ are 地動 tē-tāng or tuē-tāng, and 頭疼 thâu-thiànn, each with the first syllable unsandhi’ed. And of course when it comes to example sentences, most of them will definitely contain more than one sandhi phrase.

In any case, I’m looking forward to being able to use Pleco for non-Mandarin/Cantonese regional languages as well, that would be a huge improvement to an already great app for me!

Hydramus · May 29, 2024

Yeah, it will work the same as Taiwan MOE with the diacritics, single number (to represent all 6 tones (see below, from my notes), or the double to talk about the entry. I wonder if we could one day have the diagrams in the app to explain these too.

Tone Table	Ref： http://www.hkilang.org/v2/發音字典/
	Ref： p433 香港客家话研究
Tonal Mark	Simplified number notation	Directional Tone notation (see graph)	Chinese Example
fúng	fung1	fung33	風
fūng	fung2	fung11	洪
fǔng	fung3	fung31	哄
fùng	fung4	fung55	鳳
sǐt	sit5	sit3	室
sìt	sit6	sit5	食

香港客家话研究 is a great book for reference if you ever want more detail but I own a copy too so can help where I can and I also know the author.
I have an old diagram from an old website back in the day that looks like this (notice the discrepancies due to dialects)

mikelove · May 29, 2024

Thanks very much for all of this detailed info.

Sandhi at the moment is just a separate reading field with different tones. We auto-generate it for Mandarin since the algorithm for that is quite straightforward - it's indexed separately for search, so users can search for words with or without sandhi and can display either (or both). We don't auto-generate it for Cantonese but the behavior is otherwise the same.

My thinking was that the same approach would work here - the main Hokkien field gives you the underlying tones, the sandhi one gives you the tones with all of these transformations applied. As with Cantonese, we would not attempt to do this automatically, but it would be a separate field users could supply (importing it or editing an entry to add it). Is there a reason this would be unsuitable for Hokkien?

Abun · Jun 3, 2024

Words consisting of more than one syllable probably need at least two additional sandhi fields.

Let’s take a multi-syllabic word like 明白 ‘to understand’. Tâi-lô spells this as bîng-pi̍k, with underlying 5th and 8th tone respectively. If you use a separate field for sandhi, you would would probably have to represent this as:

underlying: bing5-pik8
in_isolation: bing7-pik8/bing3-pik8 (even in isolation, the first syllable is in a sandhi context on account of the following second syllable – this happens for almost all words, though exceptions such as 地動 or 頭疼 exist). Now the 5th tone is a bit troublesome because in some dialects its sandhi form sounds identical to unsandhi’ed 7th, while in others it’s identical to unsandhi’ed 3rd, hence the two different readings)
after_sandhi: bing7-pik4/bing3-pik4 (the word might appear in a context where the last syllable must also undergo sandhi – in this case this could be caused by a following object for instance: 我無明白你咧講啥物 = Mandarin 我不明白你在說什么. Sandhi’ed tone 8 sounds identical to unsandhi’ed tone 4)

The alternative would be a single field somewhat like this (I’ll use XML but of course another format could use a similar structure):

XML:

<headword>
  <sandhiing>bing5</sandhiing><final>pik8</final>
</headword>

Or you could use some special character to denote sandhi boundaries within a headword, e.g. # like in many papers on Hokkien tone sandhi:

XML:

<headword>bing5-pik8</headword>

Whereas a word like 地動 where the first syllable never sandhis would be:

XML:

<headword>te7#-tang7</headword>

Then the sandhi readings (both the full sandhi form and with the last syllable remaining unsandhied) could be automatically generated from that information, including the dialect-dependent sandhi for the 5th tone. An added benefit would be that all of these tagging strategies could theoretically also be used identically in examples if desired.

Note that unfortunately, there are more differences in dialect than the sandhi pattern of the 5th tone (rhyme alternations are most common), and even Mainstream Taiwanese sometimes accepts two different ones (e.g. 火 can be hué or hé). So you might still need a separate field at least for the different mainstream pronunciations, if not for more fringe dialects.

mikelove · Jun 3, 2024

Thanks. I don't know if we'd necessarily support your third "after_sandhi" field, at least not initially, since it seems like kind of a scope creep to look at tone changes next to other words - in practice I think anybody who doesn't already know the tones for a word is probably going to be looking it up without tones rather than assuming they have them down correctly.

For alternate pronunciations our standard approach is variant entries - you make another card/entry and insert a {{}} link to the original entry in the entry body field. (which the software is smart enough to collapse / consolidate in search results)

shiki · Jun 13, 2024

Is there an actual reason why Bopomofo extended can't be used for Hokkien or Hakka? I don't know either of these languages but since I do plan on studying them and as someone learning Mandarin with Bopomofo already, I would definitely like to learn them with Bopomofo if possible. I understand the phonetics of Mandarin and these languages are very different and there's not a lot of materials with this system but it seems like it is usable. As a fan of bopomofo I'm very curious if this is possible.

Bopomofo Extended - Wikipedia

en.m.wikipedia.org

mikelove · Jun 13, 2024

We can potentially support it as a display / search format, this is mostly about deciding what to use internally in databases; it's a lot easier for us to work with a reading system that consists entirely of ASCII characters and then convert to / from that. (this is how we support regular Bopomofo in the dictionary now)

shiki · Jun 16, 2024

Okay.
Not sure if this'll be of any use in helping getting this implented into pleco but at the bottom of this project page it says
"Because Taibun is MIT-licensed, any developer can essentially do whatever they want with it as long as they include the original copyright and licence notice in any copies of the source code. Note, that the data used by the package is licensed under a different copyright.

The data is licensed under CC BY-SA 4.0"

GitHub - andreihar/taibun: Taiwanese Hokkien Transliterator and Tokeniser

Taiwanese Hokkien Transliterator and Tokeniser. Contribute to andreihar/taibun development by creating an account on GitHub.

github.com

mikelove · Jun 16, 2024

Thanks!

justthinkingaboutpleco · Aug 16, 2024

So is this for sure? Hokkien speakers will have a dedicated dictionary for themselves? Will there be more than one like Cantonese?

mikelove · Aug 17, 2024

We're planning to add code to support searching and displaying Hokkien in user dictionaries - that's pretty much all we're confident about so far. We had been planning to let users do this by manually editing a bunch of expert settings, but the way things evolved over the final year or so of 4.0's initial development made that way too complicated.

It would certainly be possible to have more than one of those, but whether we'll also offer Hokkien user dictionaries for direct download in "Add-ons" is TBD, as is support for Hokkien in other areas like flashcards / audio / etc; some of that will probably depend on how much usage that initial foray into Hokkien gets.

Romanizations for Hokkien / Hakka / Wu / Sichuanese

皇帝

状元

皇帝

状元

皇帝

榜眼

皇帝

Member

皇帝

榜眼

Member

皇帝

榜眼

皇帝

进士

皇帝

进士

皇帝

Member

皇帝