Word frequency list based on a 15 billion character corpus: BCC (BLCU Chinese Corpus)

#1
The Beijing Language and Culture University created a balanced corpus of 15 billion characters. It’s based on news (人民日报 1946-2018,人民日报海外版 2000-2018), literature (books by 472 authors, including a significant portion of non-Chinese writers), non-fiction books, blog and weibo entries as well as classical Chinese. Frequency lists derived from the corpus can be downloaded here: http://bcc.blcu.edu.cn/downloads/resources/BCC_LEX_Zh.zip

The ZIP file contains a global frequency list based on the whole corpus and frequency lists based on specific categories (e.g. news, literature...) of the corpus. These text files can easily be turned into a Pleco user dictionary.

The corpus is much larger than the CCL (470 million characters), the CNC (100 million characters), the SUBTLEX-CH (47 million characters) and the LCMC (less than 2 million characters). It seems as if the frequency lists derived from this corpus might be the most reliable frequency lists currently available. More detailed information about the corpus can be found in this paper.

If you have problems with the original files the UTF-8 versions attached below may work. Due to technical limitations the lists global and blogs only include the 1048576 most frequent words.

I also created two Pleco user dictionaries showing the frequency of a word as definition. The first is based on the 100 000 most frequent words from the literature frequency list, the second is based on the 100 000 most frequent words from the news frequency list. They are attached below as ZIP files. If you are interested in a frequency user dictionary based on spoken language, see this post.

Maybe this is useful to some of you.
 

Attachments

Last edited:
#3
Thank you very much! Frequency lists are very useful. Small caveat: For some reason, I could only open the text files in MS Word for Windows, not Mac; any other plain text editor with Chinese encoding support and Pleco's Document Reader failed. (error message or freeze) Do you think I have permission to repost them here as Unicode UTF-8?
 
Last edited:
#4
That's really strange. It works for me to open all the files in Word on Windows using GB-18030. I just tried opening them in Pleco on Android using the "GB (mainland)" encoding, which works as well. If it doesn't work at all for you I could send you the files converted into UTF-8, if you think this might help.
 
#6
I have not found any copyright notice regarding the frequency lists, which is why I only posted the download link. But since they are offering it as a free download they are probably not going to mind a repost in UTF-8.
 
#7
I agree; too bad my Word 2016 for Windows, version 1805, keeps locking up while saving as Unicode (file too long). I'm sorry, perhaps someone else can help out. :)
 
Last edited:
#8
The problem seems to be that two of these lists (global and blogs) exceed 1048576 lines (which is the maximum Excel can deal with). But since presumably no one cares about words that are not part of the 1048576 most frequent ones I just had Excel ignore these. This way converting the files into UTF-8 was not a problem. I added the UTF-8 Versions to the original post. Does it work now on iOS?
 
#9
I see, perhaps Word has a similar line limit. Many thanks! Pleco on iOS now opens them without a hitch.

Edit: Wenlin on Mac is able to open the full unconverted file, as well.
 
Last edited:
#10
Thank you, John! This data looks very interesting!

I've just started to look closer at the data in excel and something doesn't seem right.
The most common character in the global corpus is 第 with 2,002,074,595 entries.
However, 第 is not such a common character in any of the other individual corpuses at around 1300th place in most of the other corpuses.
Just looking at the top 30 or so characters, the counts for each character in the global corpus is equal to the sum of the counts in the five individual corpuses. So there does seem to be something fishy about this first entry 第 in the global corpus.
I'm wondering where the entry for "~" (Row 29 in the attached txt file) comes from as well.
Any ideas?
 

Attachments

#11
I’ve been wondering about this too. As you can see here, the corpus does not only consists of texts of those categories that are featured in the BCC_LEX_Zh.zip file with their own frequency list, but of texts belonging to other categories as well. There are kinds of texts, for example legal texts, that feature lots and lots of 第 (see the image of a segment of the Chinese property law). So maybe the corpus contains in one of those other categories many of such texts. But since as you say the addition of the numbers of a word in the specific frequency lists seems to result in the number featured in the global list, it still wouldn't explain why it shows up in the global ist.

Because these frequency lists are officially released by the university I almost can't believe it's an error. After all nothing is more eye-catching than a word other than 的 in the first line of a frequency list. But I don't really have an explanation for it either.

The ~ comes from weibo. It's excessively used there and thus number 6 on the weibo frequency list.
 

Attachments

Last edited:
#12
Hi leguan and John,

I've found one possibility. I searched through the corpus using the search word „第“ and found that there are a lot of page indications in brackets spread throughout many of the texts. In this screenshot, I could find three of them. It might be that due to this effect, there are more 第s than 的s. Though it should be easy to filter these out.

026177D7-6DC8-4E7C-A6FE-6E95767CB94E.png
 
Last edited:
#13
Very interesting. This might help to explain it. Maybe they filtered them out in all of the frequency lists except the global one?
 
#14
Good thought; that may well be. Though on second thought, if the corpus has 15 billion characters, and the global frequency list lists 2 billion occurrences of 第, that number must be a mistake, since about every 7.5th character would then have to be a 第. It doesn't matter, I'll just ignore the 第. :)
 
Last edited:
#15
I did not consider this, you are right. I would love to just send them an e-mail asking about this issue, but the "联系我们" link on their website only leads to an error message. Well, ignoring the 第 will work too.
 
#16
Hi Shun and John,
Thank you both for your replies and great detective work!
I'm thinking that maybe the easiest solution might be to ignore the global corpus included in the package and just make a new one by adding the counts for each word on the five individual files.
I'll look further into this later today!
BR
 
#18
I'm thinking that maybe the easiest solution might be to ignore the global corpus included in the package and just make a new one by adding the counts for each word on the five individual files.
I'll look further into this later today!
BR
Hi leguan,

Thank you for your efforts! Sounds like a quick Python job using the "dictionary" data structure (for example). I'll try my hand at it, as well. :)

Best, Shun
 
#19
Hi leguan and John,

I wrote a little Python script to combine the five separate frequency lists into a new global list. Here is the list and the Python 3 source: (You need to remove the .txt suffix.) I used the UTF-8 files converted by John.

Regards, Shun


Edit: I just noticed the combined file isn't sorted by frequency anymore. I can do that later, or perhaps leguan would like to help with this last bit? :)
 

Attachments

Top