Word frequency list based on a 15 billion character corpus: BCC (BLCU Chinese Corpus)

Shun

状元
Hi leguan,

thanks & you're welcome! I am happy with it; the values correspond to those of my unsorted list.
 
Last edited:

Shun

状元
You're welcome. I noticed in other digitized Chinese texts—ones that may have been OCR‘ed—that there are Hiragana characters interspersed in them, possibly as markers of characters the OCR engine didn't recognize. So it could be that these characters just made their way into the corpus and the frequency list, since the generation of the frequency list is fully automated. Just a possibility.
 
Many thanks. Anyone else notice the inclusion of Hiragana characters in the corpus?
Hello Peter,

[Edit: I've also found 110883 lines with no Hanzi and 4721 with Hanzi and Latin (numbers included) mixed]
I've found these 85:

の い な と し た て っ に で ん す は う か ぷ る こ さ ま が だ あ お も り ら く を き れ ち ぁ よ ど そ け み つ ぃ め ね ひ え や ゃ せ じ ご わ ば ず ぜ ざ づ ろ ぞ げ ぶ ょ ふ ぐ び へ ぴ ゆ ぱ べ ほ ぬ ゝ ぎ む ぢ ゞ ゅ ぇ ぉ ぼ ぽ ぅ ぺ ゐ ゎ ゑ
 
Last edited:

Shun

状元
Hi sobriaebritas,

thank you, that is a lot of instances. I looked at the original global GB18030 text file using Wenlin, it gave me the following messge:

Screen Shot 2018-06-18 at 14.32.31.png

It seems to have the same number of Hiragana characters as in your listing. Perhaps they just weren't very careful about the corpus sources, i.e. they included some Japanese source material? Though then there would have to be Katakana and Kanji-only characters in it, as well. I know that John has sent an E-mail to the creator asking him about the 第, and if he answers, perhaps he could also ask about the Japanese characters.
 
Last edited:
Though then there would have to be Katakana and Kanji-only characters in it, as well.
Hello Shun,
There seem to be the following 88 Katakana:
ン ノ ス ヽ イ ル ト ラ シ リ ィ ッ ア ク サ ド ァ マ タ チ ナ コ レ ジ ツ カ ロ フ キ プ バ ブ テ メ セ グ ネ ニ ハ オ ム ミ ダ ソ ズ エ パ ャ ザ ビ デ ウ ピ ュ ケ ゼ ヒ ガ ゲ ゴ ヘ ェ ヾ モ ョ ベ ワ ヤ ボ ポ ヅ ペ ギ ホ ヌ ゾ ヂ ォ ヴ ユ ヨ ゥ ヶ ヮ ヲ ヱ ヵ ヰ

[Edit: I've also found 110883 lines with no Hanzi and 4721 with Hanzi and Latin (numbers included) mixed]

I know that John has sent an E-mail to the creator asking him about the 第, and if he answers, perhaps he could also ask about the Japanese characters.
Thank you for the information, Shun. And thank you to John indeed!
 

Shun

状元
Hello sobriaebritas,

thanks! So with this many extraneous characters, either this is all just noise in the data, or they deliberately included Japanese texts. One almost can't include Japanese texts in a Chinese corpus by accident. If it's noise in the data, it would have to be due to OCR noise. But why keep it in the frequency list, then. :)
 
Last edited:
Hello everybody,

The attached zip file contains the following text files (no Hiragana, Katakana):

global_wordfreq.release (Hanzi only).txt
global_wordfreq.release (Hanzi-Arabic Numbers).txt
global_wordfreq.release (Hanzi-Arabic Numbers-Latin).txt
global_wordfreq.release (Hanzi-Latin).txt

(I think) The text file "global_wordfreq.release (Hanzi-Arabic Numbers-Latin).txt" is the only one that contains duplicate entries similar to this:
H5N1型 5300
H5N1型 1563
 

Attachments

BenJackson

举人
I was just revisiting these BLCU lists because I'm nearing my original SUBTLEX-WF goals, and I know my news comprehension is not on par with my general listening. I filtered the top 1000 from the news list vs my Pleco DB to see what I was missing, and some of the words were pretty surprising, like 毛主席 being almost identical in frequency to 你! And 无产阶级 squeezed in between 项目 and 每... So I checked the description: based on news (人民日报 1946-2018, ...) ohhhhh.

Seems like a different time window is going to be more useful to me, assuming I can find it.
 

Shun

状元
Thanks! Which newspapers would you like me (or others) to collect articles from? I feel «People's Daily» or «Global Times» would be a bit unrepresentative. But perhaps «Caijing»/«财经» could be more interesting? (with a broader purview that includes politics and society, a bit like «The Economist»)

According to this page, copyrights shouldn't be a problem, especially if our aim is just a frequency list:

https://linguistics.stackexchange.com/questions/9232/do-i-have-copyright-issues-when-making-a-corpus-from-the-web
 

BenJackson

举人
I'm not qualified to build such a list, but if I did I'd probably favor accessibility over having the ideal source. E.g. if Xinhua had an RSS feed, that would be ideal.
 

Shun

状元
Hello Ben,

I think that's quite an easy thing to program with Python, a very high-level language. I would do it about like this:
  1. I would read in the BCC corpus frequency list as a dictionary, then
  2. Having concatenated all the news/magazine articles as plain text, I would build a dictionary of all the words in the news/magazine articles up to 8 characters long, counting their number of occurrences with the help of the BCC frequency list (which tells us which combinations of characters are real expressions).
  3. For N-grams of at least two characters that don't exist in the BLCU list, I could store them in a list, which one could scan for legal expressions.
This shouldn't take more than 50-100 lines of Python code, maybe less.

The advantage of sourcing articles from an RSS/Atom feed would of course be automation.

According to «China Whisper», these would be the Top 10 most read Chinese newspapers:

1. Reference News 参考消息
2. People’s Daily 人民日报
3. The Global Times 环球时报
4. Southern Weekly 南方周末
5. Southern Metropolitan Daily 南方都市报
6. The China Youth Daily 中国青年报
7. Qilu Evening News 齐鲁晚报
8. Xinmin Evening News 新民晚报
9. Yangtse Evening News 扬子晚报
10. West China City News 华西都市报

I think if we use good newspapers, that could be sufficient for obtaining a good list of media vocabulary. I can try putting something together with RSS; People's Daily has a working feed, for example. We can still change our sources later.

I can start it soon; I am always open to any inputs.

Regards,

Shun
 
  • Like
Reactions: JD
Top