Order of the definitions in dictionaries

#1
Hello,

One question regarding the CC dictionary... or dictionaries in general.
Say I look 石 in the dictionary.
I see 2 CC entries.
One of them has 1. rock 2. stone 3. stone inscription 4. one of the eight ancient musical instrument
Another one, separated from that, has Shi (with capital S) - surname Shi

I am wondering why there are 2 entries for a same character. It feels that it would have been better to use a single entry combining all the 5 potential meanings inside.
The reason I ask is that the first definition seems to sort the 4 defintions per order of usefulness (rock/stone are the most common meanings).

My friend is trying to use this dictionary to display the most common meanings first. It's easy if there is 1 entry with all definitions sorted out inside. But it's much more difficult if there are multiple entries...
 
#2
Hi François,

you may already know that there are cases of characters which share the same ideogram, but have different pronunciations and different etymologies (even though they look the same): homographs like 行 xíng and 行 háng, or 乐 yuè and 乐 lè. Depending on the dictionary used by Pleco, there are those which group a character's dictionary entries not by pronunciation, but by their etymologies, so that 数 shǔ and 数 shù are mentioned in the same dictionary entry because they are etymologically related, but 行 xíng and 行 háng would be two entirely separate dictionary entries.

For CC-CEDICT, I think this is just because of the way the data was grouped by its makers. 石 shí and the name 石 Shí have two slightly different pronunciations, so they chose to make two separate dictionary entries for them, even though they very likely are etymologically related. Pleco will then have to display them separately, because both the headwords in Simplified/Traditional and the pronunciations have to match for Pleco to list them together. The makers of Pleco could recombine these entries if they wanted to, but then they would have to find all the homographs and leave those separate, too—which would be the dictionary makers' task more than Pleco's.

So perhaps your friend could try a dictionary besides CC-CEDICT?

Cheers,

Shun
 
Last edited:
#3
石 shí and the name 石 Shí have two slightly different pronunciations
Uh, no.

This actually relates to the rules of Pinyin. Similarly to English, Pinyin dictates that proper nouns be capitalized. Just like Stone, as a surname, would be capitalized in English but would be written lowercase as a regular noun - 石 is also capitalized appropriately.
 
Last edited:
#4
Got it! Yeah, due to those 破音字 and those cases such as a names, I can understand the need to differentiate the entries. It completely makes sense. It's just that since it seems definitions within an entry seem ordered by frequency of usage, I was wondering if there's any way to get usage frequency for different entries. But it doesn't seem easy. Will check out if other dictionaries provide frequency of usage even for 破音字s or across entries. Thanks!
 
Last edited:
#5
This actually relates to the rules of Pinyin. Similarly to English, Pinyin dictates that proper nouns be capitalized. Just like Stone, as a surname, would be capitalized in English but would be written lowercase as a regular noun - 石 is also capitalized appropriately.
You're right, of course, I just took a technical point of view. Seen from the point of view of dictionary organization, the two pronunciations are treated separately—even if the difference is just due to pinyin convention.

Got it! Yeah, due to those 破音字 and those cases such as a names, I can understand the need to differentiate the entries. It completely makes sense. It's just that since it seems definitions within an entry seem ordered by frequency of usage, I was wondering if there's any way to get usage frequency for different entries. But it doesn't seem easy. Will check out if other dictionaries provide frequency of usage even for 破音字s or across entries.
Oh yes, 破音字 they're called. I'm not sure how Pleco stores its internal usage frequency data—I believe it's stored independently from the dictionaries. You may be a good candidate for the usage frequencies from the BCC. Have a look at these two threads:

https://plecoforums.com/threads/wor...haracter-corpus-bcc-blcu-chinese-corpus.5859/
https://plecoforums.com/threads/integrating-bcc-corpus-data-into-dictionary.6123/

You're welcome!
 
Last edited:

mikelove

皇帝
Staff member
#6
We group all dictionaries this way at the moment - lowercase / caps entries are separated because that is the convention in a lot of dictionaries (like CC-CEDICT) and it seems to us like a pretty sensible one since it avoids bloating entries for common characters with obscure senses relating to particular place names / surnames / etc.

However, this will be fully customizable in 4.0, so if you prefer to have lowercase and capitalized pinyin grouped in the same listing you'll be able to do that.
 
Last edited:
Top