Displaying frequency information

Bob P

秀才
To help me focus on learning characters that are common, I like to know how frequently a given word or character I have looked at is. When I encounter a new character or word, I often have no idea whether it is worth the time to learn it. It would be great if Pleco just showed me that directly. Is there any plan to add this in the future? For me, I'd like a simple stack-ranked number (i.e. a unique frequency for each word, not like the Fenn frequency buckets), and it doesn't have to be super precise (i.e. I'm not too interested in the debates about how to measure frequency).

Ok, feature request aside, has anyone created a user dict that does this? If not, would others find this valuable? It seems like it would be a simple thing to do, and we could include multiple different frequency measures, HSK number, etc. I'd like to do it for multi-character words as well as single characters.

FWIW, so far, I've found that the Unihan database provides several unsatisfying measures: Frequency (http://www.unicode.org/reports/tr38/tr38-21.html#kFrequency), Fenn (http://www.unicode.org/reports/tr38/tr38-21.html#kFenn), Grade Level (http://www.unicode.org/reports/tr38/tr38-21.html#kGradeLevel) and Hanyu Pinlu (http://www.unicode.org/reports/tr38/tr38-21.html#kHanyuPinlu). I've used them all for awhile and Fenn is the closest to useful, and it is from 100 year old data.
 

John.

秀才
There is a user dictionary with 99121 entries showing the frequency of a word according to the SUBTLEX-CH frequency list. This frequency list is based on the subtitles of 6243 movies and tv show episodes. According to the authors, word frequencies based on subtitles "are a good estimate of daily language exposure". The pinyin of the entries is not always correct, but if you search for a word in Chinese characters it's not a problem.

Source: Cai Q, Brysbaert M (2010) SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles. PLoS ONE 5(6): e10729. doi:10.1371/journal.pone.0010729

Edit: Most pinyin errors were corrected. The dictionary is attached as a .zip file.
 

Attachments

  • example entry frequency dictionary.png
    example entry frequency dictionary.png
    92.4 KB · Views: 992
  • SUBTLEX-CH improved pinyin.zip
    16.3 MB · Views: 1,146
Last edited:

Bob P

秀才
There is a user dictionary with 99121 entries showing the frequency of a word according to the SUBTLEX-CH frequency list. This frequency list is based on the subtitles of 6243 movies and tv show episodes. According to the authors, word frequencies based on subtitles "are a good estimate of daily language exposure". The pinyin of the entries is not always correct, but if you search for a word in Chinese characters it's not a problem.

Thanks, John! This is awesome and pretty close to what I'm looking for. Did you make the Pleco version of it?

Unfortunately there are two problems, though. First, the biggest blocker is that traditional characters are not supported. I primarily use traditional characters, so that is a non-starter.

The other problem is that the frequencies for individual characters in this database are strictly word frequencies. For example, the 100th most common character according to http://www.hanzicraft.com/ (which uses Jun Da's http://lingua.mtsu.edu/chinese-computing/statistics/char/list.php?Which=MO) is "实", but its frequency in this database is 6202! On the other hand, "泡" is the 2000th most common character in Jun Da and is 2231 in this database! What's happening? It is that the database is strictly using word frequency for single characters, while I was looking for character frequency (although actually showing both would be preferable). So it is meaningless in the context of "is this character in the top 2000 most commonly used characters?", i.e. evaluating which characters are important to learn.

More details: I used a SQL browser to inspect the db and show only single characters, and "实" is ranked 1520 using that method. However, there are 176 entries that contain "实", many with higher frequencies than the single character. Clearly the 1520 is the frequency for the usage of that character only when it is not used as part of a compound.

Fortunately, character frequencies are trivial to compute from the original source, unlike word frequencies, so they could be added fairly easily. The lack of a traditional character source is the bigger problem.
 
Hi Furio, I had a little trouble following that thread -- did you end up getting a solution that imported into Pleco and worked correctly? Also, it looks like you have both of the problems I mentioned above with the SUBTLEX-CH solution.
Perhaps
http://lingua.mtsu.edu/chinese-computing/statistics/index.html or
http://lingua.mtsu.edu/chinese-computing/statistics/char/list.php?Which=MO
are more useful?
It's very easy to adapt the file for Pleco.

# Char Freq. Percentile Pinyin English
0001 的 7922684 4.09432531783 de/di2/di4 (possessive particle)/of, really and truly, aim/clear
0100 实 368494 41.7703310946 shi2 real/true/honest/really/solid
9933 鴒 1 100 ling2 (wagtail not in CEDICT)
 
Top