Word frequency list based on a 15 billion character corpus: BCC (BLCU Chinese Corpus)

#22
Hi leguan,

thanks & you're welcome! I am happy with it; the values correspond to those of my unsorted list.
 
Last edited:
#24
You're welcome. I noticed in other digitized Chinese texts—ones that may have been OCR‘ed—that there are Hiragana characters interspersed in them, possibly as markers of characters the OCR engine didn't recognize. So it could be that these characters just made their way into the corpus and the frequency list, since the generation of the frequency list is fully automated. Just a possibility.
 
#25
Many thanks. Anyone else notice the inclusion of Hiragana characters in the corpus?
Hello Peter,

[Edit: I've also found 110883 lines with no Hanzi and 4721 with Hanzi and Latin (numbers included) mixed]
I've found these 85:

の い な と し た て っ に で ん す は う か ぷ る こ さ ま が だ あ お も り ら く を き れ ち ぁ よ ど そ け み つ ぃ め ね ひ え や ゃ せ じ ご わ ば ず ぜ ざ づ ろ ぞ げ ぶ ょ ふ ぐ び へ ぴ ゆ ぱ べ ほ ぬ ゝ ぎ む ぢ ゞ ゅ ぇ ぉ ぼ ぽ ぅ ぺ ゐ ゎ ゑ
 
Last edited:
#26
Hi sobriaebritas,

thank you, that is a lot of instances. I looked at the original global GB18030 text file using Wenlin, it gave me the following messge:

Screen Shot 2018-06-18 at 14.32.31.png

It seems to have the same number of Hiragana characters as in your listing. Perhaps they just weren't very careful about the corpus sources, i.e. they included some Japanese source material? Though then there would have to be Katakana and Kanji-only characters in it, as well. I know that John has sent an E-mail to the creator asking him about the 第, and if he answers, perhaps he could also ask about the Japanese characters.
 
Last edited:
#27
Though then there would have to be Katakana and Kanji-only characters in it, as well.
Hello Shun,
There seem to be the following 88 Katakana:
ン ノ ス ヽ イ ル ト ラ シ リ ィ ッ ア ク サ ド ァ マ タ チ ナ コ レ ジ ツ カ ロ フ キ プ バ ブ テ メ セ グ ネ ニ ハ オ ム ミ ダ ソ ズ エ パ ャ ザ ビ デ ウ ピ ュ ケ ゼ ヒ ガ ゲ ゴ ヘ ェ ヾ モ ョ ベ ワ ヤ ボ ポ ヅ ペ ギ ホ ヌ ゾ ヂ ォ ヴ ユ ヨ ゥ ヶ ヮ ヲ ヱ ヵ ヰ

[Edit: I've also found 110883 lines with no Hanzi and 4721 with Hanzi and Latin (numbers included) mixed]

I know that John has sent an E-mail to the creator asking him about the 第, and if he answers, perhaps he could also ask about the Japanese characters.
Thank you for the information, Shun. And thank you to John indeed!
 
#28
Hello sobriaebritas,

thanks! So with this many extraneous characters, either this is all just noise in the data, or they deliberately included Japanese texts. One almost can't include Japanese texts in a Chinese corpus by accident. If it's noise in the data, it would have to be due to OCR noise. But why keep it in the frequency list, then. :)
 
Last edited:
#31
Hello everybody,

The attached zip file contains the following text files (no Hiragana, Katakana):

global_wordfreq.release (Hanzi only).txt
global_wordfreq.release (Hanzi-Arabic Numbers).txt
global_wordfreq.release (Hanzi-Arabic Numbers-Latin).txt
global_wordfreq.release (Hanzi-Latin).txt

(I think) The text file "global_wordfreq.release (Hanzi-Arabic Numbers-Latin).txt" is the only one that contains duplicate entries similar to this:
H5N1型 5300
H5N1型 1563
 

Attachments

Top