Associating imported custom entries with existing entries not working well

manishearth · Mar 1, 2022

Hi!

I'm working on a custom dictionary that's based off of cross-dialectical data on Wiktionary, with the intent of making it easier to work with lects based on existing knowledge of a lect or using the existing dictionaries. Wiktionary has highly useful "dialectical synonyms" lists on various entries showing the common ways a different word is said (in other words, it has mappings from "standard written" form to the spoken form).

Here's what I have so far, generated for Mandarin/Cantonese/Taishanese and specifically for Beijing/Hong Kong/Taishan:

I'm currently using the "import text file" support that Pleco has for importing entries into a custom user dictionary. It's slow but it overall seems to work, and as far as I can tell this is the recommended way to do this¹.

For the imported entries to associate with existing entries (instead of creating new ones), I had to include pinyin pronunciations (or jyutping if the pinyin wasn't available) so that things would associate correctly (and I'm consistently using traditional characters; this is what Wiktionary does anyway, so there should not be ambiguity).

However, this still seems to be creating separate dictionary entries in some cases. For one, in some (but not all? It happens for 什麼 and 現在 but not 來 or 聽日) cases where the simplified and traditional forms differ it creates a separate entry:

（In the csv file, this has an entry "現在 xiànzài Pronunciation: Mandarin:...", though the tabs got converted to spaces when pasting here)

There's also some brokenness here:

(This has a csv entry "喺 {hai2} Pronunciation: Cantonese: ....")

Despite the jyutping tone mark not showing up in the search (this seems to be another bug²), it does show up on the entry itself, but for whatever reason the entry isn't associated with the existing one (maybe because my entry only has jyutping, not pinyin?)

I'm not really sure what to do here; it seems like there are a couple bugs or tricky bits in how Pleco associates imported dictionary entries with existing entries but it's not super clear, and I can't find much in the way of prior discussion. Help would be appreciated.

¹If there's a better way, I'm super down to try that instead. In the long run I plan to release an open source tool so that you can make these on your own (if you'd like to try it out now, feel free to post on my profile to ask for a specific set of lects and i can make you one). I'm also happy to help make this a part of the main app provided it can be done in a way that works with Wiktionary's CC-BY-SA licensing: it would be really cool if this could be a configurable dictionary the way the builtin Unicode/UNI dictionary is where you select the lects you care about and it only shows up on entries where that matters. Perhaps have a separate pronunciation and synonym dictionary.

²This bug has also hit me a bunch when importing flashcards from my vocab Google sheet with the "create new if not found" option enabled, which only has jyutping; the resultant entries show up without tone mark and look weird.

manishearth · Mar 1, 2022

I realized that I need to use brackets around traditional characters so that they create the right entry, but the same problems are still occurring.

Shun · Mar 1, 2022

Hi,

this looks like an interesting project, though I can't really help here, as I don't have much experience with Pleco's Cantonese support. All I can say is that the Hanzi Simplified and Traditional, pinyin, and the Cantonese pronunciation fields all need to match exactly for them to be subsumed under one dictionary entry. Simplified should be empty in your case, without Pleco automatically deriving it from Traditional.

Pardon my ignorance, do you already know what the "plus-minus" symbol next to one of the Cantonese pronunciations stands for?

Regards,

Shun

mikelove · Mar 1, 2022

This is way beyond what we intended the current user dictionary system to be able to do, honestly, so it may not be something we can really offer you a good fix for; indeed even just the use of formatting codes has always been a completely experimental / unsupported feature (and in fact breaks badly in quite a lot of cases).

@Shun's basic points about everything needing to match to show up in the same entry is correct. User dictionaries don't reliably fill in missing simplified / traditional / pinyin like flashcards do, so one thing that might help would be to import this to a flashcard file first - in which case Pleco will try a lot harder to line these up with dictionary entries - and then dump those flashcards to your user dictionary, rather than importing it as a user dictionary directly.

manishearth · Mar 1, 2022

Hmmm, I see, thanks. Yeah I'm aware the formatting stuff is experimental; it works right now so I'm using it.

Would it be possible to get the additional set of entry-matching options in the dictionary import feature? AIUI they roughly work the same otherwise. It seems like this is annoying for cases less weird than mine as well; e.g. when importing vocabulary from a personal vocabulary spreadsheet.

I could import these as flashcards, but how do I dump them into a dictionary? Would this be by using the options "definition source: file only, store in user dict, fill in missing fields, missing entries: create blank, ambiguous entries: use first", and ensuring that there is only a single user dictionary "above the line" in the dictionaries menu? I'm a bit worried about messing up existing flashcards and user dictionaries; I do use my "regular" user dict.

I may also try getting the simplified data, I don't have it just yet but I know how to get it and I can include it.

mikelove · Mar 1, 2022

manishearth said:
Would it be possible to get the additional set of entry-matching options in the dictionary import feature? AIUI they roughly work the same otherwise. It seems like this is annoying for cases less weird than mine as well; e.g. when importing vocabulary from a personal vocabulary spreadsheet.

In 4.0 user dictionary databases are basically just flashcard databases with some tables missing, so this should be doable once that's out.

manishearth said:
I could import these as flashcards, but how do I dump them into a dictionary?

There's actually a batch command for that - "Convert all custom to user dict." This will dump them to the first editable dictionary in your dictionary sort order.

manishearth said:
I'm a bit worried about messing up existing flashcards and user dictionaries; I do use my "regular" user dict.

Yeah, I'd recommend backing everything else up before trying this.

manishearth · Mar 1, 2022

mikelove said:
There's actually a batch command for that - "Convert all custom to user dict." This will dump them to the first edible dictionary in your dictionary sort order.

Oh, sweet, thanks.

Associating imported custom entries with existing entries not working well

manishearth

Member

Attachments

manishearth

Member

Shun

状元

mikelove

皇帝

manishearth

Member

mikelove

皇帝

manishearth

Member