frequencies in pleco chinese dictionary

#1
The pleco dictionary shows frequencies from 1 to 5. How many words are in each category?

How have the frequencies been measured? I am familiar with some research about the frequencies years ago. I especially would like to know if the frequencies are based on research about written or spoken language.
Additionally, I want to ask to add frequencies for spoken language, not just for written language. And frequencies for words, not just for characters would also be helpfull.

Can anyone help me with these questions please?
 

mikelove

皇帝
Staff member
#2
That's actually not our data, it's a field from the Unihan database, documented here. However, we are working to add some better frequency data of our own (we use some internally for search result ranking but it's not quite polished enough yet to display).
 
#3
This book has freq study of characters as well as kouyu 口语 of 16,000 terms overlapping though
 

Attachments

Last edited:
#4
I found in http://corpus.leeds.ac.uk/frqc/lcmc.num a Corpus of words used in web.
I tried also to use https://en.wiktionary.org/wiki/Appendix:Mandarin_Frequency_lists but a number of words appear more times: it is not possible to use it!.
Attached an image for
爱护[愛護] ai4hu4 Rank: 4070 - Freq.: 0.03‰

** For others sorted files look at following posts **

Edit:
Dec, 1 : changed layout using ‰ instead of /1000
Dec, 2: not word characters deleted, dictionary pqb rebuilt; 25000 word giga-zh file added; only the number for rank, without total number of lemmas information

Note: Copyright note in http://corpus.leeds.ac.uk/list.html
The lists are distributed under the Creative Commons (CC BY) Attribution license.
 

Attachments

Last edited:
#5
I always like the listing of terms. theses 2 links are good, especially for learning commonly used terms or review for foreigner and natives alike.the first link is more of a frequency list.the second one is like a dictionary unsorted. When I see any list of terms, I tent to want to sort or arrange the list in dictioary form,
not frequency form so people can find the wanted term easily. If I scan thru the list of 5000 terms or more ,l
Can go crazy. Chinese dictionary uses spend too much time scanning which waste time. I prefer to design a Chinese dictionary with fixed position like that in English 字母文字
The second list repeat listing of terms in simplified And complex. It is convinient for new comers. To save time,I prefer too to display in 简[簡]format.
Thanks for the 2 links that were screen shot for my record.
 
#6
I found in http://corpus.leeds.ac.uk/frqc/lcmc.num a Corpus of words used in web.
I tried also to use https://en.wiktionary.org/wiki/Appendix:Mandarin_Frequency_lists but a number of words appear more times: it is not possible to use it!.
Attached an image for
爱好[愛好] ai4hao4 Rank: 3379/45000 Freq.: 0.031 ‰
Hi Furio Petrossi,

I hope you don't mind my uploading a version of your file Freq.txt without the following entries:

2 , Rank: 1/45000 Freq.: 75.23
4 " Rank: 3/45000 Freq.: 17.91 ‰
13 : Rank: 12/45000 Freq.: 5.10 ‰
25 ? Rank: 24/45000 Freq.: 2.98 ‰
35 ! Rank: 34/45000 Freq.: 2.15 ‰
40 ; Rank: 39/45000 Freq.: 2.06 ‰
43 ) Rank: 42/45000 Freq.: 1.87 ‰
44 ( Rank: 43/45000 Freq.: 1.87 ‰
84 《 Rank: 83/45000 Freq.: 1.11 ‰
85 》 Rank: 84/45000 Freq.: 1.10 ‰
186 - Rank: 185/45000 Freq.: 0.505 ‰
795 1991年[1991年] 1¿9¿9¿1¿ nian2 Rank: 794/45000 Freq.: 0.149 ‰
1072 - Rank: 1071/45000 Freq.: 0.111 ‰
1139 1990年[1990年] 1¿9¿9¿0¿ nian2 Rank: 1138/45000 Freq.: 0.103 ‰
1149 1. Rank: 1148/45000 Freq.: 0.102 ‰
1297 2. Rank: 1296/45000 Freq.: 0.091 ‰
1605 1989年[1989年] 1¿9¿8¿9¿ nian2 Rank: 1604/45000 Freq.: 0.072 ‰
1608 10月[10月] 1¿0¿ yue4 Rank: 1607/45000 Freq.: 0.072 ‰
1814 3. Rank: 1813/45000 Freq.: 0.064 ‰
1836 1988年[1988年] 1¿9¿8¿8¿ nian2 Rank: 1835/45000 Freq.: 0.063 ‰
1875 3月[3月] 3¿ yue4 Rank: 1874/45000 Freq.: 0.062 ‰
1880 4月[4月] 4¿ yue4 Rank: 1879/45000 Freq.: 0.061 ‰
1950 1月[1月] 1¿ yue4 Rank: 1949/45000 Freq.: 0.059 ‰
2048 8月[8月] 8¿ yue4 Rank: 2047/45000 Freq.: 0.056 ‰
2074 6月[6月] 6¿ yue4 Rank: 2073/45000 Freq.: 0.055 ‰
2122 2月[2月] 2¿ yue4 Rank: 2121/45000 Freq.: 0.054 ‰
2159 H Rank: 2158/45000 Freq.: 0.052 ‰
2167 ① Rank: 2166/45000 Freq.: 0.052 ‰
2299 C[C] Rank: 2298/45000 Freq.: 0.048 ‰
2301 ② Rank: 2300/45000 Freq.: 0.048 ‰
2390 7月[7月] 7¿ yue4 Rank: 2389/45000 Freq.: 0.046 ‰
2413 11月[11月] 1¿1¿ yue4 Rank: 2412/45000 Freq.: 0.046 ‰
2471 12月[12月] 1¿2¿ yue4 Rank: 2470/45000 Freq.: 0.045 ‰
2541 9月[9月] 9¿ yue4 Rank: 2540/45000 Freq.: 0.043 ‰
2550 5月[5月] 5¿ yue4 Rank: 2549/45000 Freq.: 0.043 ‰
2651 4. Rank: 2650/45000 Freq.: 0.042 ‰
2960 . Rank: 2959/45000 Freq.: 0.037 ‰
3041 DNA Rank: 3040/45000 Freq.: 0.035 ‰
3190 1992年[1992年] 1¿9¿9¿2¿ nian2 Rank: 3189/45000 Freq.: 0.034 ‰
3196 A[A] Rank: 3195/45000 Freq.: 0.034 ‰
3226 1985年[1985年] 1¿9¿8¿5¿ nian2 Rank: 3225/45000 Freq.: 0.033 ‰
3547 ③ Rank: 3546/45000 Freq.: 0.030 ‰
3653 1986年[1986年] 1¿9¿8¿6¿ nian2 Rank: 3652/45000 Freq.: 0.029 ‰
3770 1980年[1980年] 1¿9¿8¿0¿ nian2 Rank: 3769/45000 Freq.: 0.027 ‰
3984 1987年[1987年] 1¿9¿8¿7¿ nian2 Rank: 3983/45000 Freq.: 0.026 ‰
4182 ④ Rank: 4181/45000 Freq.: 0.024 ‰
4249 ℃ Rank: 4248/45000 Freq.: 0.024 ‰
4469 5. Rank: 4468/45000 Freq.: 0.022 ‰
4556 10万[10萬] 1¿0¿ wan4 Rank: 4555/45000 Freq.: 0.021 ‰
4591 / Rank: 4590/45000 Freq.: 0.021 ‰
4800 × Rank: 4799/45000 Freq.: 0.020 ‰
4915 30日[30日] 3¿0¿ ri4 Rank: 4914/45000 Freq.: 0.019 ‰
4936 15日[15日] 1¿5¿ ri4 Rank: 4935/45000 Freq.: 0.019 ‰
4964 1984年[1984年] 1¿9¿8¿4¿ nian2 Rank: 4963/45000 Freq.: 0.019 ‰
 

Attachments

#7
I hope you don't mind my uploading a version of your file Freq.txt without the following entries
Hello! I started to delete some "strange words", but your work is better! :) . Now I'll change also the dictionary Freq.pbq, but if you can do it, it's better!
Bye,
Furio

Note: For the first item
的[的] de5 Rank: 2/45000 Freq.: 51.04 ‰
is for me more homogeneous that
的[的] de5 Rank: 2/45000 Freq.: 5,10 %

Note 2: I don't know why, but also in the original file 化 hua4 appera three times..., so for
一[一] yi1 Rank: 151/45000 Freq.: 0.593 ‰ , Rank: 449/45000 Freq.: 0.249 ‰, and others... look at FreqSorted.txt in next message.
 
Last edited:
#8
I tent to want to sort or arrange the list in dictionary
For sorting I'm using MS Excel, 1) import txt file or copy&Paste 2) sort 3) Copy & paste in a new text file or export it like txt file TAB separated.
If you need some type of sort or subcategory, we can try to solve your problem...

[Some changes Dec., 2]
[Added FreqSortPinYin.txt Dec, 3]
 

Attachments

Last edited:
#9
Note 2: I don't know why, but also in the original file 化 hua4 appera three times..., so for
一[一] yi1 Rank: 151/45000 Freq.: 0.593 ‰ , Rank: 449/45000 Freq.: 0.249 ‰, and others... look at FreqSorted.txt in next message.
Hello Furio Petrossi,

I've just had a look at your files Freq.txt & FreqSorted.txt and at the frequency list in http://corpus.leeds.ac.uk/frqc/lcmc.num.

In both of your files, 化 appears three times by itself, while in Leed's list it comes up just once.
On the other hand, both 变化 and 文化 are missing in your files, while in Leed's list both are present.

Your files:
化[化] hua4 Rank: 340/45000 Freq.: 0.309 ‰
化[化] hua4 Rank: 348/45000 Freq.: 0.304 ‰
化[化] hua4 Rank: 667/45000 Freq.: 0.177 ‰

Leed's list:
340 309.43 变化
348 304.44 文化
667 177.68 化

Maybe something went wrong in the process of formatting the list.
 
#11
I tried to rebuild all, but i need help for check: can you help me?
I added also the 25000 word giga-zh file. Rank are a little different, 'cause the methodology used in http://corpus.leeds.ac.uk/list.html was different.
All is my first message (edited).
Bye and Thank you!
Hello Furio,

If you tell me exactly what you would like me to do, and upload the files needed, I'll be glad to give you a hand over the weekend.
 
#13
I tried to rebuild all, but i need help for check: can you help me?
I added also the 25000 word giga-zh file. Rank are a little different, 'cause the methodology used in http://corpus.leeds.ac.uk/list.html was different.
All is my first message (edited).
Bye and Thank you!
Hi Furio,
It's seems to me that everything is OK now with the files Freq.txt, giga-zh.txt, giga-zh-sorted.
Let me upload the file giga-zh-sorted by pinyin.txt, just in case you or someone else is interested in it.
 

Attachments

#15
This is great Furio and sobriaebritas! After importing into Pleco, how can I sort this by the rank/frequency fields? Pleco's 'sort by' fields aren't showing those options.
 

mikelove

皇帝
Staff member
#17
No, we haven't yet come up with a way of generating it we're confident enough in to expose - Chinese is a ridiculously hard language to do this with reliably.

We do however support flashcard category 'tags' now, so if you wanted an easier way to display it yourself you could generate / import a list of word frequencies grouped into classes and have the tags for those classes display at the top of an entry.
 
Top