Wikidata-generated dictionary

danielt998 · Dec 26, 2024

I don't know if this is interesting to people, but this is a project I started a while ago to generate a dictionary from anything that has a Wikidata item, currently with about 140k entries:https://github.com/danielt998/Wikid...or/blob/master/output/output_pleco_format.txt
it has a lot of entries for names of places/people etc that won't be in other dictionaries. Unfortunately it also has quite a lot of entries that make no sense in a dictionary and may produce some 'noise'. It would be good to know whether or not people find this useful and if so, in future I may take a look at filtering out some of the entries that are not useful etc.

Shun · Dec 27, 2024

Hi Daniel,

I think that's a wonderful idea! Wikipedia should be well-maintained, which leads to complete, mostly error-free and up-to-date data. I will definitely turn to this list for phonetic and semantic transcriptions of names.

Thanks a lot,

Shun

danielt998 · Dec 27, 2024

I'm glad it's useful

The most frustrating thing is the pinyin, I've only included cases where a character has a single pinyin pronunciation, which filters about half of the entries. Maybe still including those but leaving the pinyin empty would be possible but I don't want to mislead anyone by giving them incorrect pronunciation

Some other things I would like to do include:
* updating the data as this is using 5+ year old data - this'll significantly increase the size (wouldn't surprise me if it doubles the number of entries but it will involve parsing a multiple-terabyte json file!)
* provide a version of the data that is only simplified/traditional - some entries only have one or the other and I have only done automatic transliteration (in both ways) in cases that are unambiguous
* create a curation script which allows me to manually filter out entries we don't want quickly

Also there's a flag that I've disabled to include zh-HK and zh-MO which also doubles the output size if enabled but I don't know if Cantonese names are too different or not to be worth including (can also generate a separate dict for Cantonese learners at some point)

Shun · Dec 27, 2024

These considerations make a lot of sense.

Usually, transliterations should only consist of strings of single characters anyway. Of course it is always possible that an alternative pronunciation of a character is used in the transliteration instead of the main one. But if you used Wikipedia's pinyin, we should be fine.

Perhaps you can work with split versions of the current Wikipedia, or filter the JSON stream as it comes in.

Cheers, Shun

danielt998 · Dec 27, 2024

>But if you used Wikipedia's pinyin, we should be fine.

Unfortunately there's generally no pinyin in Wikidata so I've had to generate it automatically from cc-cedict (though come to think of it there's a pinyin property that some Wikidata items have but only really for things with native Chinese names, not transliterations - I can try to incorporate that where it exists when I next run the pre-processing script).

danielt998 · Dec 27, 2024

also a separate project idea I have is to use Wikipedia instead of Wikidata and to extract anything using a {{zh}} or similar template, which sometimes does contain pinyin

Shun · Dec 27, 2024

danielt998 said:
>But if you used Wikipedia's pinyin, we should be fine.

Unfortunately there's generally no pinyin in Wikidata so I've had to generate it automatically from cc-cedict (though come to think of it there's a pinyin property that some Wikidata items have but only really for things with native Chinese names, not transliterations - I can try to incorporate that where it exists when I next run the pre-processing script).

Yeah, you're right. So 90% of the pronunciations should still be correct.

Shun · Dec 27, 2024

danielt998 said:
also a separate project idea I have is to use Wikipedia instead of Wikidata and to extract anything using a {{zh}} or similar template, which sometimes does contain pinyin

Definitely. Perhaps if you limit yourself to those terms whose pinyin is definitely correct, you would still end up with a long list that is worth studying.

danielt998 · Dec 27, 2024

> Yeah, you're right. So 90% of the pronunciations should still be correct.

of the ones in the file I posted it should be 100% as I only generate an entry in cases where every character has a single pinyin reading (unless a reading is missing from cc-cedict which seems unlikely) - the question is whether to include ones where it is ambiguous and we have to guess

Shun · Dec 27, 2024

danielt998 said:
> Yeah, you're right. So 90% of the pronunciations should still be correct.

of the ones in the file I posted it should be 100% as I only generate an entry in cases where every character has a single pinyin reading (unless a reading is missing from cc-cedict which seems unlikely) - the question is whether to include ones where it is ambiguous and we have to guess

Oh, that's a great idea! Yes, CC-CEDICT should be quite comprehensive.

Ambiguous cases would probably be error-prone. Perhaps only Google Cloud contains some useful data, since Google Translate always displays the pinyin when you enter Hanzi into it. But accessing that through an API probably wouldn't be free.

danielt998 · Dec 27, 2024

if anyone was interested they could change this line to include the ones that are a guess too and it just takes the first pinyin it finds for that character (though I think we could consider that some pinyin readings for some characters are more or less common in names/transliterations and and there are some 'rules' about which to use (or use statistical data to use the most common pinyin reading if such a thing exists))

Shun · Dec 27, 2024

Wow, I like Java. Yes, usually a Hanzi used for transliterations will only have a single pinyin each time, so one could infer the pronunciation from multiple terms whose pronunciation is known to those where it isn't. So there are definitely some good possibilities.

danielt998 · Dec 27, 2024

hmm interestingly apparently ICU has some data for the most common reading: https://www.ibm.com/docs/en/ignm/7.0.0?topic=transliteration-chinese-overview so I could use that (probably still have it optional as to whether only to include unambiguous ones though)

(I don't know how hard it'd be to get access to Infosphere name management and what the licence would be of the generated data but ICU itself is free/open source if we use it directly)

Shun · Dec 27, 2024

Nice, thanks, I hadn't heard of the ICU project. However, as you say, they also caution that their pronunciations for names are decided on a frequency basis:

"The International Components for Unicode (ICU) open source project has a set of system rules that transliterate commonly used Chinese characters into Mandarin Pinyin representations. Each character has only one output form. In the case of characters with multiple pronunciations, the most common one is selected. The Global Name Management transliteration process uses the ICU internal rule set for most Chinese characters. Exceptions are handled by special rules."

So I think inference from other names' pronunciations confirmed by Wikipedia would probably be the most correct option. But of course it's up to you.

Cheers, Shun

mikelove · Dec 28, 2024

Unihan also has this in the HanyuPinlu field, I believe.

Wikidata-generated dictionary

danielt998

秀才

Shun

状元

danielt998

秀才

Shun

状元

danielt998

秀才

danielt998

秀才

Shun

状元

Shun

状元

danielt998

秀才

Shun

状元

danielt998

秀才

Shun

状元

danielt998

秀才

Shun

状元

mikelove

皇帝