You're welcome. I noticed in other digitized Chinese texts—ones that may have been OCR‘ed—that there are Hiragana characters interspersed in them, possibly as markers of characters the OCR engine didn't recognize. So it could be that these characters just made their way into the corpus and the frequency list, since the generation of the frequency list is fully automated. Just a possibility.
thank you, that is a lot of instances. I looked at the original global GB18030 text file using Wenlin, it gave me the following messge:
It seems to have the same number of Hiragana characters as in your listing. Perhaps they just weren't very careful about the corpus sources, i.e. they included some Japanese source material? Though then there would have to be Katakana and Kanji-only characters in it, as well. I know that John has sent an E-mail to the creator asking him about the 第, and if he answers, perhaps he could also ask about the Japanese characters.
thanks! So with this many extraneous characters, either this is all just noise in the data, or they deliberately included Japanese texts. One almost can't include Japanese texts in a Chinese corpus by accident. If it's noise in the data, it would have to be due to OCR noise. But why keep it in the frequency list, then.
I was just revisiting these BLCU lists because I'm nearing my original SUBTLEX-WF goals, and I know my news comprehension is not on par with my general listening. I filtered the top 1000 from the news list vs my Pleco DB to see what I was missing, and some of the words were pretty surprising, like 毛主席 being almost identical in frequency to 你! And 无产阶级 squeezed in between 项目 and 每... So I checked the description: based on news (人民日报 1946-2018, ...) ohhhhh.
Seems like a different time window is going to be more useful to me, assuming I can find it.
Thanks! Which newspapers would you like me (or others) to collect articles from? I feel «People's Daily» or «Global Times» would be a bit unrepresentative. But perhaps «Caijing»/«财经» could be more interesting? (with a broader purview that includes politics and society, a bit like «The Economist»)
According to this page, copyrights shouldn't be a problem, especially if our aim is just a frequency list:
I think that's quite an easy thing to program with Python, a very high-level language. I would do it about like this:
I would read in the BCC corpus frequency list as a dictionary, then
Having concatenated all the news/magazine articles as plain text, I would build a dictionary of all the words in the news/magazine articles up to 8 characters long, counting their number of occurrences with the help of the BCC frequency list (which tells us which combinations of characters are real expressions).
For N-grams of at least two characters that don't exist in the BLCU list, I could store them in a list, which one could scan for legal expressions.
This shouldn't take more than 50-100 lines of Python code, maybe less.
The advantage of sourcing articles from an RSS/Atom feed would of course be automation.
According to «China Whisper», these would be the Top 10 most read Chinese newspapers:
1. Reference News 参考消息
2. People’s Daily 人民日报
3. The Global Times 环球时报
4. Southern Weekly 南方周末
5. Southern Metropolitan Daily 南方都市报
6. The China Youth Daily 中国青年报
7. Qilu Evening News 齐鲁晚报
8. Xinmin Evening News 新民晚报
9. Yangtse Evening News 扬子晚报
10. West China City News 华西都市报
I think if we use good newspapers, that could be sufficient for obtaining a good list of media vocabulary. I can try putting something together with RSS; People's Daily has a working feed, for example. We can still change our sources later.
I can start it soon; I am always open to any inputs.