frequencies in pleco chinese dictionary

Discussion in 'Chinese Language' started by godlich, May 11, 2015.

  1. godlich

    godlich Member

    The pleco dictionary shows frequencies from 1 to 5. How many words are in each category?

    How have the frequencies been measured? I am familiar with some research about the frequencies years ago. I especially would like to know if the frequencies are based on research about written or spoken language.
    Additionally, I want to ask to add frequencies for spoken language, not just for written language. And frequencies for words, not just for characters would also be helpfull.

    Can anyone help me with these questions please?
     
  2. mikelove

    mikelove 皇帝 Staff Member

    That's actually not our data, it's a field from the Unihan database, documented here. However, we are working to add some better frequency data of our own (we use some internally for search result ranking but it's not quite polished enough yet to display).
     
  3. Sy

    Sy 进士

    This book has freq study of characters as well as kouyu 口语 of 16,000 terms overlapping though
     

    Attached Files:

    Last edited: Oct 6, 2015
  4. I found in http://corpus.leeds.ac.uk/frqc/lcmc.num a Corpus of words used in web.
    I tried also to use https://en.wiktionary.org/wiki/Appendix:Mandarin_Frequency_lists but a number of words appear more times: it is not possible to use it!.
    Attached an image for
    爱护[愛護] ai4hu4 Rank: 4070 - Freq.: 0.03‰

    ** For others sorted files look at following posts **

    Edit:
    Dec, 1 : changed layout using ‰ instead of /1000
    Dec, 2: not word characters deleted, dictionary pqb rebuilt; 25000 word giga-zh file added; only the number for rank, without total number of lemmas information

    Note: Copyright note in http://corpus.leeds.ac.uk/list.html
    The lists are distributed under the Creative Commons (CC BY) Attribution license.
     

    Attached Files:

    Last edited: Dec 3, 2015
    sobriaebritas and alex_hk90 like this.
  5. Sy

    Sy 进士

    I always like the listing of terms. theses 2 links are good, especially for learning commonly used terms or review for foreigner and natives alike.the first link is more of a frequency list.the second one is like a dictionary unsorted. When I see any list of terms, I tent to want to sort or arrange the list in dictioary form,
    not frequency form so people can find the wanted term easily. If I scan thru the list of 5000 terms or more ,l
    Can go crazy. Chinese dictionary uses spend too much time scanning which waste time. I prefer to design a Chinese dictionary with fixed position like that in English 字母文字
    The second list repeat listing of terms in simplified And complex. It is convinient for new comers. To save time,I prefer too to display in 简[簡]format.
    Thanks for the 2 links that were screen shot for my record.
     
    sobriaebritas likes this.
  6. sobriaebritas

    sobriaebritas 探花

    Hi Furio Petrossi,

    I hope you don't mind my uploading a version of your file Freq.txt without the following entries:

    2 , Rank: 1/45000 Freq.: 75.23
    4 " Rank: 3/45000 Freq.: 17.91 ‰
    13 : Rank: 12/45000 Freq.: 5.10 ‰
    25 ? Rank: 24/45000 Freq.: 2.98 ‰
    35 ! Rank: 34/45000 Freq.: 2.15 ‰
    40 ; Rank: 39/45000 Freq.: 2.06 ‰
    43 ) Rank: 42/45000 Freq.: 1.87 ‰
    44 ( Rank: 43/45000 Freq.: 1.87 ‰
    84 《 Rank: 83/45000 Freq.: 1.11 ‰
    85 》 Rank: 84/45000 Freq.: 1.10 ‰
    186 - Rank: 185/45000 Freq.: 0.505 ‰
    795 1991年[1991年] 1¿9¿9¿1¿ nian2 Rank: 794/45000 Freq.: 0.149 ‰
    1072 - Rank: 1071/45000 Freq.: 0.111 ‰
    1139 1990年[1990年] 1¿9¿9¿0¿ nian2 Rank: 1138/45000 Freq.: 0.103 ‰
    1149 1. Rank: 1148/45000 Freq.: 0.102 ‰
    1297 2. Rank: 1296/45000 Freq.: 0.091 ‰
    1605 1989年[1989年] 1¿9¿8¿9¿ nian2 Rank: 1604/45000 Freq.: 0.072 ‰
    1608 10月[10月] 1¿0¿ yue4 Rank: 1607/45000 Freq.: 0.072 ‰
    1814 3. Rank: 1813/45000 Freq.: 0.064 ‰
    1836 1988年[1988年] 1¿9¿8¿8¿ nian2 Rank: 1835/45000 Freq.: 0.063 ‰
    1875 3月[3月] 3¿ yue4 Rank: 1874/45000 Freq.: 0.062 ‰
    1880 4月[4月] 4¿ yue4 Rank: 1879/45000 Freq.: 0.061 ‰
    1950 1月[1月] 1¿ yue4 Rank: 1949/45000 Freq.: 0.059 ‰
    2048 8月[8月] 8¿ yue4 Rank: 2047/45000 Freq.: 0.056 ‰
    2074 6月[6月] 6¿ yue4 Rank: 2073/45000 Freq.: 0.055 ‰
    2122 2月[2月] 2¿ yue4 Rank: 2121/45000 Freq.: 0.054 ‰
    2159 H Rank: 2158/45000 Freq.: 0.052 ‰
    2167 ① Rank: 2166/45000 Freq.: 0.052 ‰
    2299 C[C] Rank: 2298/45000 Freq.: 0.048 ‰
    2301 ② Rank: 2300/45000 Freq.: 0.048 ‰
    2390 7月[7月] 7¿ yue4 Rank: 2389/45000 Freq.: 0.046 ‰
    2413 11月[11月] 1¿1¿ yue4 Rank: 2412/45000 Freq.: 0.046 ‰
    2471 12月[12月] 1¿2¿ yue4 Rank: 2470/45000 Freq.: 0.045 ‰
    2541 9月[9月] 9¿ yue4 Rank: 2540/45000 Freq.: 0.043 ‰
    2550 5月[5月] 5¿ yue4 Rank: 2549/45000 Freq.: 0.043 ‰
    2651 4. Rank: 2650/45000 Freq.: 0.042 ‰
    2960 . Rank: 2959/45000 Freq.: 0.037 ‰
    3041 DNA Rank: 3040/45000 Freq.: 0.035 ‰
    3190 1992年[1992年] 1¿9¿9¿2¿ nian2 Rank: 3189/45000 Freq.: 0.034 ‰
    3196 A[A] Rank: 3195/45000 Freq.: 0.034 ‰
    3226 1985年[1985年] 1¿9¿8¿5¿ nian2 Rank: 3225/45000 Freq.: 0.033 ‰
    3547 ③ Rank: 3546/45000 Freq.: 0.030 ‰
    3653 1986年[1986年] 1¿9¿8¿6¿ nian2 Rank: 3652/45000 Freq.: 0.029 ‰
    3770 1980年[1980年] 1¿9¿8¿0¿ nian2 Rank: 3769/45000 Freq.: 0.027 ‰
    3984 1987年[1987年] 1¿9¿8¿7¿ nian2 Rank: 3983/45000 Freq.: 0.026 ‰
    4182 ④ Rank: 4181/45000 Freq.: 0.024 ‰
    4249 ℃ Rank: 4248/45000 Freq.: 0.024 ‰
    4469 5. Rank: 4468/45000 Freq.: 0.022 ‰
    4556 10万[10萬] 1¿0¿ wan4 Rank: 4555/45000 Freq.: 0.021 ‰
    4591 / Rank: 4590/45000 Freq.: 0.021 ‰
    4800 × Rank: 4799/45000 Freq.: 0.020 ‰
    4915 30日[30日] 3¿0¿ ri4 Rank: 4914/45000 Freq.: 0.019 ‰
    4936 15日[15日] 1¿5¿ ri4 Rank: 4935/45000 Freq.: 0.019 ‰
    4964 1984年[1984年] 1¿9¿8¿4¿ nian2 Rank: 4963/45000 Freq.: 0.019 ‰
     

    Attached Files:

  7. Hello! I started to delete some "strange words", but your work is better! :) . Now I'll change also the dictionary Freq.pbq, but if you can do it, it's better!
    Bye,
    Furio

    Note: For the first item
    的[的] de5 Rank: 2/45000 Freq.: 51.04 ‰
    is for me more homogeneous that
    的[的] de5 Rank: 2/45000 Freq.: 5,10 %

    Note 2: I don't know why, but also in the original file 化 hua4 appera three times..., so for
    一[一] yi1 Rank: 151/45000 Freq.: 0.593 ‰ , Rank: 449/45000 Freq.: 0.249 ‰, and others... look at FreqSorted.txt in next message.
     
    Last edited: Dec 2, 2015
  8. For sorting I'm using MS Excel, 1) import txt file or copy&Paste 2) sort 3) Copy & paste in a new text file or export it like txt file TAB separated.
    If you need some type of sort or subcategory, we can try to solve your problem...

    [Some changes Dec., 2]
    [Added FreqSortPinYin.txt Dec, 3]
     

    Attached Files:

    Last edited: Dec 3, 2015
  9. sobriaebritas

    sobriaebritas 探花

    Hello Furio Petrossi,

    I've just had a look at your files Freq.txt & FreqSorted.txt and at the frequency list in http://corpus.leeds.ac.uk/frqc/lcmc.num.

    In both of your files, 化 appears three times by itself, while in Leed's list it comes up just once.
    On the other hand, both 变化 and 文化 are missing in your files, while in Leed's list both are present.

    Your files:
    化[化] hua4 Rank: 340/45000 Freq.: 0.309 ‰
    化[化] hua4 Rank: 348/45000 Freq.: 0.304 ‰
    化[化] hua4 Rank: 667/45000 Freq.: 0.177 ‰

    Leed's list:
    340 309.43 变化
    348 304.44 文化
    667 177.68 化

    Maybe something went wrong in the process of formatting the list.
     
  10. I tried to rebuild all, but i need help for check: can you help me?
    I added also the 25000 word giga-zh file. Rank are a little different, 'cause the methodology used in http://corpus.leeds.ac.uk/list.html was different.
    All is my first message (edited).
    Bye and Thank you!
     
  11. sobriaebritas

    sobriaebritas 探花

    Hello Furio,

    If you tell me exactly what you would like me to do, and upload the files needed, I'll be glad to give you a hand over the weekend.
     
  12. Peter

    Peter 进士

    alex_hk90 and sobriaebritas like this.
  13. sobriaebritas

    sobriaebritas 探花

    Hi Furio,
    It's seems to me that everything is OK now with the files Freq.txt, giga-zh.txt, giga-zh-sorted.
    Let me upload the file giga-zh-sorted by pinyin.txt, just in case you or someone else is interested in it.
     

    Attached Files:

  14. Thank you very much, sobriaebritas!
     
  15. lovepleco

    lovepleco Member

    This is great Furio and sobriaebritas! After importing into Pleco, how can I sort this by the rank/frequency fields? Pleco's 'sort by' fields aren't showing those options.
     

Share This Page