Word frequency list based on a 15 billion character corpus: BCC (BLCU Chinese Corpus)

leguan · Jun 16, 2018

Sorry, I just sorted the non-mac version. Hope it's still helpful though

Shun · Jun 16, 2018

Hi leguan,

thanks & you're welcome! I am happy with it; the values correspond to those of my unsorted list.

Peter · Jun 17, 2018

Many thanks. Anyone else notice the inclusion of Hiragana characters in the corpus?

Shun · Jun 17, 2018

You're welcome. I noticed in other digitized Chinese texts—ones that may have been OCR‘ed—that there are Hiragana characters interspersed in them, possibly as markers of characters the OCR engine didn't recognize. So it could be that these characters just made their way into the corpus and the frequency list, since the generation of the frequency list is fully automated. Just a possibility.

sobriaebritas · Jun 18, 2018

Peter said:
Many thanks. Anyone else notice the inclusion of Hiragana characters in the corpus?

Hello Peter,

[Edit: I've also found 110883 lines with no Hanzi and 4721 with Hanzi and Latin (numbers included) mixed]
I've found these 85:

のいなとしたてっにでんすはうかぷるこさまがだあおもりらくをきれちぁよどそけみつぃめねひえやゃせじごわばずぜざづろぞげぶょふぐびへぴゆぱべほぬゝぎむぢゞゅぇぉぼぽぅぺゐゎゑ

Shun · Jun 18, 2018

Hi sobriaebritas,

thank you, that is a lot of instances. I looked at the original global GB18030 text file using Wenlin, it gave me the following messge:

It seems to have the same number of Hiragana characters as in your listing. Perhaps they just weren't very careful about the corpus sources, i.e. they included some Japanese source material? Though then there would have to be Katakana and Kanji-only characters in it, as well. I know that John has sent an E-mail to the creator asking him about the 第, and if he answers, perhaps he could also ask about the Japanese characters.

sobriaebritas · Jun 18, 2018

Shun said:
Though then there would have to be Katakana and Kanji-only characters in it, as well.

Hello Shun,
There seem to be the following 88 Katakana:
ンノスヽイルトラシリィッアクサドァマタチナコレジツカロフキプバブテメセグネニハオムミダソズエパャザビデウピュケゼヒガゲゴヘェヾモョベワヤボポヅペギホヌゾヂォヴユヨゥヶヮヲヱヵヰ

[Edit: I've also found 110883 lines with no Hanzi and 4721 with Hanzi and Latin (numbers included) mixed]

Shun said:
I know that John has sent an E-mail to the creator asking him about the 第, and if he answers, perhaps he could also ask about the Japanese characters.

Thank you for the information, Shun. And thank you to John indeed!

Shun · Jun 18, 2018

Hello sobriaebritas,

thanks! So with this many extraneous characters, either this is all just noise in the data, or they deliberately included Japanese texts. One almost can't include Japanese texts in a Chinese corpus by accident. If it's noise in the data, it would have to be due to OCR noise. But why keep it in the frequency list, then.

sobriaebritas · Jun 18, 2018

I'd just like to share the attached file: "global_wordfreq.release (no Hiragana-Katakana)". 1704877 entries in a UTF-8 text file.

Shun · Jun 18, 2018

Thank you very much! A clean file is the main thing.

sobriaebritas · Jun 19, 2018

Hello everybody,

The attached zip file contains the following text files (no Hiragana, Katakana):

global_wordfreq.release (Hanzi only).txt
global_wordfreq.release (Hanzi-Arabic Numbers).txt
global_wordfreq.release (Hanzi-Arabic Numbers-Latin).txt
global_wordfreq.release (Hanzi-Latin).txt

(I think) The text file "global_wordfreq.release (Hanzi-Arabic Numbers-Latin).txt" is the only one that contains duplicate entries similar to this:
Ｈ5Ｎ1型 5300
Ｈ５Ｎ１型 1563

BenJackson · Jan 8, 2020

I was just revisiting these BLCU lists because I'm nearing my original SUBTLEX-WF goals, and I know my news comprehension is not on par with my general listening. I filtered the top 1000 from the news list vs my Pleco DB to see what I was missing, and some of the words were pretty surprising, like 毛主席 being almost identical in frequency to 你! And 无产阶级 squeezed in between 项目 and 每... So I checked the description: based on news (人民日报 1946-2018, ...) ohhhhh.

Seems like a different time window is going to be more useful to me, assuming I can find it.

Shun · Jan 8, 2020

Good point! If you wish, we could always create a small news corpus (and frequency list) of our own. A couple of hundred articles should already be enough.

Shun · Jan 9, 2020

Thanks! Which newspapers would you like me (or others) to collect articles from? I feel «People's Daily» or «Global Times» would be a bit unrepresentative. But perhaps «Caijing»/«财经» could be more interesting? (with a broader purview that includes politics and society, a bit like «The Economist»)

According to this page, copyrights shouldn't be a problem, especially if our aim is just a frequency list:

https://linguistics.stackexchange.c...ight-issues-when-making-a-corpus-from-the-web

BenJackson · Jan 14, 2020

I'm not qualified to build such a list, but if I did I'd probably favor accessibility over having the ideal source. E.g. if Xinhua had an RSS feed, that would be ideal.

Shun · Jan 14, 2020

Hello Ben,

I think that's quite an easy thing to program with Python, a very high-level language. I would do it about like this:

I would read in the BCC corpus frequency list as a dictionary, then
Having concatenated all the news/magazine articles as plain text, I would build a dictionary of all the words in the news/magazine articles up to 8 characters long, counting their number of occurrences with the help of the BCC frequency list (which tells us which combinations of characters are real expressions).
For N-grams of at least two characters that don't exist in the BLCU list, I could store them in a list, which one could scan for legal expressions.

This shouldn't take more than 50-100 lines of Python code, maybe less.

The advantage of sourcing articles from an RSS/Atom feed would of course be automation.

According to «China Whisper», these would be the Top 10 most read Chinese newspapers:

1. Reference News 参考消息
2. People’s Daily 人民日报
3. The Global Times 环球时报
4. Southern Weekly 南方周末
5. Southern Metropolitan Daily 南方都市报
6. The China Youth Daily 中国青年报
7. Qilu Evening News 齐鲁晚报
8. Xinmin Evening News 新民晚报
9. Yangtse Evening News 扬子晚报
10. West China City News 华西都市报

I think if we use good newspapers, that could be sufficient for obtaining a good list of media vocabulary. I can try putting something together with RSS; People's Daily has a working feed, for example. We can still change our sources later.

I can start it soon; I am always open to any inputs.

Regards,

Shun

Shun · Jan 15, 2020

Hello all,

I've created a new thread for it.

Cheers,

Shun

Word frequency list based on a 15 billion character corpus: BCC (BLCU Chinese Corpus)

leguan

探花

Shun

状元

Peter

榜眼

Shun

状元

sobriaebritas

榜眼

Shun

状元

sobriaebritas

榜眼

Shun

状元

sobriaebritas

榜眼

Attachments

Shun

状元

sobriaebritas

榜眼

Attachments

BenJackson

举人

Shun

状元

Shun

状元

BenJackson

举人

Shun

状元

Shun

状元