Media-related vocabulary gathering project

Shun · Jan 15, 2020

Dear @BenJackson, dear @JD & dear all,

the acquisition of media-related vocabulary from newspapers or television news programmes usually comes rather late in one's Chinese learning career and does not form part of most learning materials I've seen so far. This can be a frustrating experience especially for more advanced learners who have up to now not easily been able to focus on media-specific vocabulary. @BenJackson and others have pointed to this fact, trying to rectify it by gathering vocabulary from media corpus frequency lists.

One problem that @BenJackson has run into with the BLCU media corpus (see thread) was that its sources reach back to 1946 and it therefore contains many vocabulary items which have since lost currency. With a small corpus of 650 articles from People's Daily, downloaded using a Python script, I hope to start providing a more modern frequency list of media-related vocabulary.

The frequency list has the following features:

It uses all sections of the 人民日报 / People's Daily newspaper, including the sports section.
All articles in their RSS feeds, going back from the 15th to the 12th of January 2020, are included. I could try running the script every two days and collect articles for longer time periods in order to obtain more data.
I provide two frequency lists:
- One list ("peoples_daily_bcc_freqlist.txt") only contains expressions that also appear in the BCC corpus frequency list. This list should only contain lexical expressions.
- The other list ("peoples_daily_non_bcc_freqlist.txt") only contains expressions that do not appear in the BCC corpus frequency list, that were found using an N-gram search algorithm. It therefore includes not only single expressions, but also common combinations of words which would take significant manual work to filter out, but can be a valuable resource by themselves for practicing speaking and writing, as they are common elements of sentences.
Of course, most vocabulary items in the lists are not media-specific. I would assume that vocabulary of a frequency between 20 and 200 may contain the most useful "gems". I suggest that learners skim the list for words they don't yet know and seem likely to appear in the media.
The non-BCC list includes expressions up to 12 characters in length.

Tell me if you'd like me to add other newspapers. Adding new sources is easy, especially if there is an RSS/Atom feed.

Enjoy the lists,

Shun

hoshi · Jan 15, 2020

I think this is all-inclusive vocabulary rather than media-related vocab list.
KEY Chinese-English dictionary has 287,000 entry before and now at 300,000 range. wonder they how to sources these vocablary.

Shun · Jan 15, 2020

Hi hoshi,

Thanks for your feedback! Of course, at least 95% of the vocabulary used in media has uses elsewhere, as well. So it's the other 5% we are interested in, and these <5% will differ for each learner.

Good point on the KEY dictionary, I also think it has a very good range of original vocabulary; surely the makers of the KEY dictionary also use corpora to find new candidates to add.

Especially the non-BCC file could be used to discover new vocab, but certainly it requires a good bit of manual work. Personally, I can't really study a list of, say, 1,000 words without seeing any context they're being used in. So it's probably better just to collect news articles and look for interesting words in those—they should be more memorable that way.

Cheers, Shun

BenJackson · Jan 20, 2020

Interesting list. I'd love to see the corpus to understand why 国 and 中 are the 2nd and 3rd most common words (with 中国 at 38th, too). In non-news corpuses I'd expect it to be a word segmentation error, but some other spot checks of characters I rarely see standing alone shows that in news articles they do stand alone frequently. So already I'm learning something.

Shun said:
Especially the non-BCC file could be used to discover new vocab, but certainly it requires a good bit of manual work. Personally, I can't really study a list of, say, 1,000 words without seeing any context they're being used in.

Yes, most of the lists I end up studying now I generate from specific works so I end up finding all of the anomalous results (or non-word N-grams) in the source text to understand why they happened. Once, with 《长安十二时辰》, I ended up spoiling the ending by Google image searching one of the N-grams to confirm it was a name!

Shun · Jan 20, 2020

Hi BenJackson,

thanks, I'm glad to hear you're already benefitting from the list! My N-gram algorithm works in such a way that a word like 中国 both counts to the number of occurrences of 中国 and also once to the number of occurrences of 中 and 国 alone—so it counts each character doubly. To make sure I'm not infringing on any copyrights, I am sending you the corpus as a private message.

BenJackson said:
Yes, most of the lists I end up studying now I generate from specific works so I end up finding all of the anomalous results (or non-word N-grams) in the source text to understand why they happened. Once, with 《长安十二时辰》, I ended up spoiling the ending by Google image searching one of the N-grams to confirm it was a name!

Oh yes, there are also a couple of very long names of government agencies, etc. I like the simple idea of collecting a number of texts one has already read and then using a Python script with N-gram generator to create a small corpus and frequency list from them. That can be a useful learning aid. Perhaps I could clean my Python script up to allow others on the forums to run it. Or are you already using a specific software tool for the N-grams?

Cheers, Shun

BenJackson · Jan 20, 2020

Shun said:
Oh yes, there are also a couple of very long names of government agencies, etc.

I had to extend my maximum substring to 8 to account for things like 中央纪委国家监委 (which I could almost guess the meaning of, but I was wrong about who was being supervised!).

Shun said:
Or are you already using a specific software tool for the N-grams?

I've been working on Chinese text analysis tools in C++ since I started studying. Here are my results against the corpus you sent. For other people to have context, that news corpus is about the size of 3-4 novels.

The "char_freq" is the raw frequency of the characters (columns are char, freq, cumulative freq, raw count). There are 3478 unique characters, of which about 500 appear only once. 98% coverage only requires 1600 characters (and those characters are appearing 20x in the corpus)
The "char_arf" is the "Average Reduced Frequency" [1] which is a weighted frequency which favors words that are distributed over the text rather than clumped in one spot. The way the algorithm works, the ARF of a well-distributed character will have about 1/2 of the value of its raw count, while a poorly distributed character's ARF can be much lower. As an example, 游 is in position 38 (appearing 1843 times) in the raw frequency list. But this must be due to it appearing frequently in a smaller set of articles, because it is 150 in the ARF.
The "word_freq" is the raw frequency of the words. This is generated by splitting the input sentences using a Viterbi algorithm and the Jieba medium dictionary (jieba/结巴 [2] is a github project. I'm only using their dictionaries, not their algorithm). You can think of this algorithm as trying "every" way to split a sentence and picking the one that is most "probable" (judged by having the highest product of individual word frequencies). There are 24,500 unique words (but you would probably disagree with jieba's idea of exactly what counts as a word, since it includes number+mw combos like 一个), 10,000 (!) of which appear only once. Knowing every word that appeared at least 2x would get you to 96.5% comprehension.
The "word_arf" is the ARF using the same input set of words as "word_freq".
The "substrings" are how I try to find words that are missing from my dictionaries. It's an N-gram analysis with a bunch of fudge factors (like an arbitrary list of characters that aren't allowed in an N-gram, like 的, plus a pass that de-duplicates overlapping substrings by only letting each input character count against a single N-gram in the final output). Each line is a raw count, the word, and then it is checked against CEDICT and jieba to see if it exists in those dictionaries. As an example, 中国特色社会主义 was found in CEDICT, but not jieba, so the splitting results (for word frequencies above) would count those as individual words (中国/特色/社会主义). Normally what I do here is look at this list and check the words in context in my source document and decide which ones go in my supplemental dictionary, and then I re-run the analysis to improve the word splitting. For example, 习近平总书记 appears, but I'm happy to learn that as 习近平/总书记, so I would leave that out. Something like 高质量发展 I would Google, usually there's a Baidu page or something, and decide if it's important. This news corpus is an interesting test, because a lot of the high-frequency N-grams are compound words. When you analyze a novel, you always find a lot of proper names. This makes me want to go back and see if word-splitting the N-gram results could automatically figure out things like 游戏/产业.

[1] Savicky, Petr & Hlaváčová, Jaroslava. (2002). Measures of Word Commonness. Journal of Quantitative Linguistics. 9. 215-231. 10.1076/jqul.9.3.215.14124.
[2] https://github.com/fxsjy/jieba

Shun · Jan 20, 2020

Impressive work, thank you very much for the explanation! I will have a good look at your references. Your techniques really squeeze the most information possible on word commonness out of corpora.

Weyland · Jan 20, 2020

@Shun Maybe I'm thinking too idealistic. But, wouldn't the next step (apart from creating a new post-2020 list) be a data set for the rise and decline in the prevalence of words? Like you'd have with Google Search Trends

I've been using that frequency list for Chengyu and what I've found out is that some of the Chengyu aren't that frequent in daily life, but rather they are political slogans. Like how 胡锦涛 (Hu Jintao) promoted the use of 以人为本 (put people first) making it like the 10th or so most used proverb, but isn't at all prevalent in daily life.

BenJackson · Jan 20, 2020

Weyland said:
I've been using that frequency list for Chengyu and what I've found out is that some of the Chengyu aren't that frequent in daily life, but rather they are political slogans.

Every frequency based list has biases that become apparent once you study it. For example, SUBTLEX is based on subtitles. That sounds like an unbiased source for a spoken Chinese corpus. In fact, you end up with words popular in dramatic TV (murder, courtroom drama) and reality TV. Here are some words in SUBTLEX-CH-WF starting at position 1259:

Code:

"Total word count: 33,546,516"                       
"Context number: 6,243"                       
Word    WCount    W/million    logW    W-CD    W-CD%    logW-CD
愚蠢    2276       67.85      3.3572   1536    24.6     3.1864
炸弹    2268       67.61      3.3556    665    10.65    2.8228
帅      2268       67.61      3.3556   1374    22.01    3.138
踢      2266       67.55      3.3553   1234    19.77    3.0913
联邦    2261       67.4       3.3543    894    14.32    2.9513
评委    2253       67.16      3.3528    323     5.17    2.5092
客气    2251       67.1       3.3524   1589    25.45    3.2011
造成    2251       67.1       3.3524   1446    23.16    3.1602

The W-CD% column tells you what fraction of source documents had the word. This is related to the linguistic measure "dispersion". In this sample you can see that in this region of overall word frequency, words are appearing in about 25% of the corpus documents. There are some clear outliers, like 评委 (judging panel) which is almost exactly as common as 客气 but appears in only 1/5th as many document (implying that where it does occur, such as in reality TV shows, it is 5x more common). Similarly 炸弹 (bomb) is repeated a lot in the 10% of documents where it occurs.

This is why I started looking into things like "Average Reduced Frequency" (see footnote above) which de-rates words which appear a lot in relatively few documents. In the case of SUBTLEX, a good blended measure is to multiply individual word frequency times W-CD% (the percent of documents), which has the effect of significantly reducing the rank of words like 评委 and 淘汰 which are probably from reality shows.

A reference on (linguistic) dispersion is: Gries, Stefan. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics. 13. 403-437. 10.1075/ijcl.13.4.02gri.

Shun · Jan 21, 2020

Weyland said:
@Shun Maybe I'm thinking too idealistic. But, wouldn't the next step (apart from creating a new post-2020 list) be a data set for the rise and decline in the prevalence of words? Like you'd have with Google Search Trends

@Weyland Do you perhaps know the Google Ngram Viewer? Due to the fact that every book in the Google Books corpus has a year of publication, it's easy to date each occurrence of a word, so it will show you a graph through time of the frequency of usage for any word you enter, similar to Google Search Trends. In other corpora, it's of course much harder to tag words with the year they were used in.

Weyland · Jan 21, 2020

Shun said:
@Weyland Do you perhaps know the Google Ngram Viewer? Due to the fact that every book in the Google Books corpus has a year of publication, it's easy to date each occurrence of a word, so it will show you a graph through time of the frequency of usage for any word you enter, similar to Google Search Trends. In other corpora, it's of course much harder to tag words with the year they were used in.

I do now. Though it lacks in its Chinese capabilities. It can't, for some reason, detect 无论如何. While that word going by the BCC list is the most common.

Still, having something like this, with enough material, could show you which words are going out of fashion.

Shun · Jan 21, 2020

I agree, that's odd. It wouldn't find 智能手机, another four-character word, either. That must be a deliberate limitation of their engine. I see one can download the entire dataset at:

http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

That's probably some useful data to experiment with.

Weyland · Jan 21, 2020

Shun said:
It wouldn't find 智能手机

The cut-off is 2009.

BenJackson · Feb 1, 2020

I decided to start scraping the RSS myself because I suspect the initial sample size is just too small. Evidence in favor: The currently available RSS goes back about 3 weeks (although there are a handful of items going back to 2017, probably from abandoned RSS feeds). If I dump a list of "high frequency words I don't know", the top two are 防控 (take defensive measures) and 疫情 (epidemic situation). 疫情 is 26th overall even in the ARF (meaning it is spread among many articles -- 1329 of 5070). It's 11th overall (!!) in the total word frequency. Just above 中国.

Top substrings (not existing words) are 疫情防控 (shocking) and 新型冠状病毒 (this is literally the title of https://www.who.int/zh/emergencies/diseases/novel-coronavirus-2019 ).

So this seems like a great way to get current event vocabulary, but it's going to take a while to settle out.

Shun · Feb 1, 2020

Very nice, congratulations! I'd love to see the frequency lists once they're all polished.

Weyland · Feb 1, 2020

@Shun @BenJackson So, this all probably means that 2020s vocabulary, at least for the start of it, is going to be focused on the corona virus. You could save yourself a lot of time @Shun if you just datamine 《末世凡人》 for a word list.

Shun · Feb 1, 2020

@Weyland That would probably take some OCR to read from the 《末世凡人》漫画 speech bubbles, or did you have a different method in mind?

Weyland · Feb 1, 2020

Shun said:
@Weyland That would probably take some OCR to read from the 《末世凡人》漫画 speech bubbles, or did you have a different method in mind?

I was... joking. With the Corona virus dominating the news as it does, you might as well switch out its contents with that of post-apocalyptic literature and end up with similar results, except for maybe idiomatic slogans. Also, science-fiction + fantasy stories are probably not a good representation of word frequency, seeing how much they use story based jargon.

Shun · Feb 1, 2020

True, and a third factor: Reality is almost always more complex than fiction. We're entering the domain of "digital humanities" here. One could, for example, scan a large number of Chinese books available on the Internet and see how much their vocabularies overlap. I think one could even figure out which books are likely to have been written by the same author, and things like that.

BenJackson · Feb 11, 2020

Update on this ongoing experiment:

The 人民网 RSS feeds add about 1000 articles/day. There are about 5000 available at any one time. Even though the dates go back quite a ways on some feeds (and most go back 3 weeks), the actual volume of articles is much higher than the static snapshot suggests.
I've been gathering articles for about a week and currently have 12853. Extracting just headlines and article bodies, it's about 30x more text than Shun's original (sorry I mis-named it "shen" above
If anything, the (linguistic) 疫情 situation is getting worse. That word is literally as common as 是 and 一. It appeared 67489 times in those 12853 articles! I imagine including headlines is biasing this a bit. (As a quick experiment I checked: It appears in 2518 of the 12853 headlines, however outside of headlines it is proportionally even more common, putting it above 了 and just below 和!)
There's an interesting quirk where lots of articles start out with a header like 新华社罗马1月17日电（记者陈占杰）(news agency/location/date/"electronic"/author). The effect of putting "电" for electronic right after the date is causing 日电 ("NEC corp") to seem like a hot topic.

I have also tried another experiment with word splitting. The most effective algorithm requires word frequency data, but the best word frequency lists I have contain lots of "non dictionary" words that are not things I'd ever want to study. For example, Jieba has lots of phrases which are number+MW. I have tried to apply the Jieba frequencies to the CEDICT dictionary (arbitrarily assigning CEDICT as an authority of "what's a word"). There are some really interesting subtleties here (because the dictionaries don't match exactly) but as far as producing lists of words to study, I think it's worthwhile.

This means that the actual words in the attached files are not directly comparable to the ones I provided above. Every word in the files attached to this message should be defined in CEDICT. Otherwise the definitions of the different lists are the same as above (N-gram analysis not provided this time).

Example of what happens in an article with 疫情:

近日，中国美术家协会向全国美术工作者发出倡议，号召中国美术界心手相牵，众志成城，为武汉加油，为湖北加油，全力投入防控疫情的严峻斗争。号召大家拿起手中的画笔，以笔作枪，用美术作品凝聚起人民群众抗击疫情的强大精神力量，与全国人民一起共抗疫情。同时，特邀美术家们为英模们画肖像、画速写，记录这些新时代最美中国人。 “面对疫情，医务人员、疾控工作者、媒体工作者都是勇敢的逆行者，他们或救死扶伤，或记录现场的勇士，是人民健康的守护者，是真相的探寻者，是不忘初心、牢记使命的践行者。他们就是疫情面前的最美中国人！是时代英雄！”全国政协委员、中国美术家协会分党组书记、驻会副主席徐里接受电话采访时表示。当前新型冠状病毒疫情防控形势依然严峻，各地医护人员、军队医疗队纷纷放弃春节假期，主动请缨，迅速集结，驰援湖北。各大媒体记者也冲向疫情防控的第一线，多方面、多角度为民众报道抗击新型冠状病毒疫情的进展情况。

Media-related vocabulary gathering project

状元

Attachments

Member

状元

举人

状元

举人

Attachments

状元

榜眼

举人

状元

榜眼

状元

榜眼

举人

状元

榜眼

状元

榜眼

状元

举人

Attachments