Wikidata-generated dictionary

I don't know if this is interesting to people, but this is a project I started a while ago to generate a dictionary from anything that has a Wikidata item, currently with about 140k entries:https://github.com/danielt998/Wikid...or/blob/master/output/output_pleco_format.txt
it has a lot of entries for names of places/people etc that won't be in other dictionaries. Unfortunately it also has quite a lot of entries that make no sense in a dictionary and may produce some 'noise'. It would be good to know whether or not people find this useful and if so, in future I may take a look at filtering out some of the entries that are not useful etc.
 

Shun

状元
Hi Daniel,

I think that's a wonderful idea! Wikipedia should be well-maintained, which leads to complete, mostly error-free and up-to-date data. I will definitely turn to this list for phonetic and semantic transcriptions of names.

Thanks a lot,

Shun
 
I'm glad it's useful :)
The most frustrating thing is the pinyin, I've only included cases where a character has a single pinyin pronunciation, which filters about half of the entries. Maybe still including those but leaving the pinyin empty would be possible but I don't want to mislead anyone by giving them incorrect pronunciation

Some other things I would like to do include:
* updating the data as this is using 5+ year old data - this'll significantly increase the size (wouldn't surprise me if it doubles the number of entries but it will involve parsing a multiple-terabyte json file!)
* provide a version of the data that is only simplified/traditional - some entries only have one or the other and I have only done automatic transliteration (in both ways) in cases that are unambiguous
* create a curation script which allows me to manually filter out entries we don't want quickly

Also there's a flag that I've disabled to include zh-HK and zh-MO which also doubles the output size if enabled but I don't know if Cantonese names are too different or not to be worth including (can also generate a separate dict for Cantonese learners at some point)
 
Last edited:

Shun

状元
These considerations make a lot of sense.

Usually, transliterations should only consist of strings of single characters anyway. Of course it is always possible that an alternative pronunciation of a character is used in the transliteration instead of the main one. But if you used Wikipedia's pinyin, we should be fine.

Perhaps you can work with split versions of the current Wikipedia, or filter the JSON stream as it comes in.

Cheers, Shun
 
>But if you used Wikipedia's pinyin, we should be fine.

Unfortunately there's generally no pinyin in Wikidata so I've had to generate it automatically from cc-cedict (though come to think of it there's a pinyin property that some Wikidata items have but only really for things with native Chinese names, not transliterations - I can try to incorporate that where it exists when I next run the pre-processing script).
 
also a separate project idea I have is to use Wikipedia instead of Wikidata and to extract anything using a {{zh}} or similar template, which sometimes does contain pinyin
 

Shun

状元
>But if you used Wikipedia's pinyin, we should be fine.

Unfortunately there's generally no pinyin in Wikidata so I've had to generate it automatically from cc-cedict (though come to think of it there's a pinyin property that some Wikidata items have but only really for things with native Chinese names, not transliterations - I can try to incorporate that where it exists when I next run the pre-processing script).

Yeah, you're right. So 90% of the pronunciations should still be correct. ;)
 

Shun

状元
also a separate project idea I have is to use Wikipedia instead of Wikidata and to extract anything using a {{zh}} or similar template, which sometimes does contain pinyin
Definitely. Perhaps if you limit yourself to those terms whose pinyin is definitely correct, you would still end up with a long list that is worth studying.
 
> Yeah, you're right. So 90% of the pronunciations should still be correct. ;)

of the ones in the file I posted it should be 100% as I only generate an entry in cases where every character has a single pinyin reading (unless a reading is missing from cc-cedict which seems unlikely) - the question is whether to include ones where it is ambiguous and we have to guess
 

Shun

状元
> Yeah, you're right. So 90% of the pronunciations should still be correct. ;)

of the ones in the file I posted it should be 100% as I only generate an entry in cases where every character has a single pinyin reading (unless a reading is missing from cc-cedict which seems unlikely) - the question is whether to include ones where it is ambiguous and we have to guess

Oh, that's a great idea! Yes, CC-CEDICT should be quite comprehensive.

Ambiguous cases would probably be error-prone. Perhaps only Google Cloud contains some useful data, since Google Translate always displays the pinyin when you enter Hanzi into it. But accessing that through an API probably wouldn't be free.
 
Last edited:
if anyone was interested they could change this line to include the ones that are a guess too and it just takes the first pinyin it finds for that character (though I think we could consider that some pinyin readings for some characters are more or less common in names/transliterations and and there are some 'rules' about which to use (or use statistical data to use the most common pinyin reading if such a thing exists))
 

Shun

状元
Wow, I like Java. Yes, usually a Hanzi used for transliterations will only have a single pinyin each time, so one could infer the pronunciation from multiple terms whose pronunciation is known to those where it isn't. So there are definitely some good possibilities. :)
 
hmm interestingly apparently ICU has some data for the most common reading: https://www.ibm.com/docs/en/ignm/7.0.0?topic=transliteration-chinese-overview so I could use that (probably still have it optional as to whether only to include unambiguous ones though)

(I don't know how hard it'd be to get access to Infosphere name management and what the licence would be of the generated data but ICU itself is free/open source if we use it directly)
 

Shun

状元
Nice, thanks, I hadn't heard of the ICU project. However, as you say, they also caution that their pronunciations for names are decided on a frequency basis:

"The International Components for Unicode (ICU) open source project has a set of system rules that transliterate commonly used Chinese characters into Mandarin Pinyin representations. Each character has only one output form. In the case of characters with multiple pronunciations, the most common one is selected. The Global Name Management transliteration process uses the ICU internal rule set for most Chinese characters. Exceptions are handled by special rules."

So I think inference from other names' pronunciations confirmed by Wikipedia would probably be the most correct option. But of course it's up to you. :)

Cheers, Shun
 
Top