Hello Ben,
I think that's quite an easy thing to program with Python, a very high-level language. I would do it about like this:
- I would read in the BCC corpus frequency list as a dictionary, then
- Having concatenated all the news/magazine articles as plain text, I would build a dictionary of all the words in the news/magazine articles up to 8 characters long, counting their number of occurrences with the help of the BCC frequency list (which tells us which combinations of characters are real expressions).
- For N-grams of at least two characters that don't exist in the BLCU list, I could store them in a list, which one could scan for legal expressions.
This shouldn't take more than 50-100 lines of Python code, maybe less.
The advantage of sourcing articles from an RSS/Atom feed would of course be automation.
According to «China Whisper», these would be the Top 10 most read Chinese newspapers:
1. Reference News 参考消息
2. People’s Daily 人民日报
3. The Global Times 环球时报
4. Southern Weekly 南方周末
5. Southern Metropolitan Daily 南方都市报
6. The China Youth Daily 中国青年报
7. Qilu Evening News 齐鲁晚报
8. Xinmin Evening News 新民晚报
9. Yangtse Evening News 扬子晚报
10. West China City News 华西都市报
I think if we use good newspapers, that could be sufficient for obtaining a good list of media vocabulary. I can try putting something together with RSS; People's Daily has a working feed, for example. We can still change our sources later.
I can start it soon; I am always open to any inputs.
Regards,
Shun