Media-related vocabulary gathering project

Shun · Feb 11, 2020

Many thanks, your lists look very clean and realistic! Now if we tagged each occurrence of a word with a date and scraped articles for a year, we could find some "words of the month" for each month or draw interesting frequency graphs.

(There may even be a seasonality for some expressions?)

Thanks also for your example from an article. Of course, an important function of such articles is to boost readers' morale, see them through this dark stretch, and to unite them. I see the word 英雄 (hero) is quite frequent, at the 1603rd place.

BenJackson · Feb 11, 2020

I'm actually scraping the data into a database, which is very helpful for de-duping articles. Besides the natural duplication I see due to checking the whole feed each day, I also see them posting the same article on more than one of their sites, and even occasionally on multiple days (sometimes attributed to different authors!). So I do have all of the metadata (like dates) so more analysis will be possible later.

In fact, pretty soon I will need to get smarter about the analysis step. Right now I just extract all of the titles and article bodies into a giant flat text file and treat it like a giant novel. If I keep going for a year I'm definitely going to have to change to just incrementally analyze new articles and keep partially tabulated data in the DB, which will naturally produce the time series you are thinking of.

Then I start daydreaming about using AWS Lambda to do the scraping and updating, keeping the DB in the cloud, etc. Then I realize why people charge money for pre-existing corpuses and I consider just buying one!

Oh BTW, 英雄 is actually even more common in SUBTLEX at 1079. People love their heroes!

Shun · Feb 11, 2020

Interesting, nice project design! An incremental DB sounds good. I think whether you should go all-out with AWS depends on if you see it as enough of a challenge for yourself.

Thanks for your observation on the frequency of 英雄 in SUBTLEX. This gives me the feeling that the Chinese think in terms of heroes more than most Westerners do, perhaps due to their regard for their many legends. (such as, more recently, Lei Feng)

BenJackson · Feb 21, 2020

Your theory has some merit (counts are articles):

Shun · Feb 22, 2020

Thanks for the graph; the theory may indeed have something going for it.

BenJackson · Jun 15, 2020

In celebration of gathering 100,000 articles, here's an update.

Some notes:

We are still in the 疫情 era for sure. But there are still hints of it being less general than it is frequent, since it appears in relatively few documents (still 67%!) compared to its neighbors like 有 (90%).
I found a bug in my ARF analysis (see above) which increased the prominence of infrequent words. I would not recommend using the old ARF files as uploaded for any serious analysis, although the high-frequency data is still basically fine. I have not re-run using ARF because I think it's a better measure for something like a novel (self contained, and without clear breakdown into documents) as opposed to these news articles, to which other dispersion methods apply very well.
This still counts titles and body together as one unit.
This still uses CC-CEDICT (with frequencies blended from Jieba) as the splitting dictionary.
There are some very short articles, including ones with effectively only a title (online these are articles with mostly images or a video). This means the document percentage drops off quite a bit faster than other corpuses.
This time I present the data in SUBTLEX-CH format, since I wanted to present overall vs document frequency without inventing a new file format.
This corpus is actually bigger than the SUBTLEX corpus!
If you were going to study from a frequency list based on these, I recommend re-sorting by the product of "WCount * W-CD" or even just "W-CD" as better ways to broaden your vocabulary rather than straight frequency.

Shun · Jun 15, 2020

Hi BenJackson,

thank you, congratulations! So if you started the gathering of 人民网 articles about 120 days ago, that would mean they published about 800 new articles per day, including very short articles. That's quite a lot.

Cheers, Shun

BenJackson · Mar 3, 2021

Wanted to update at 250,000 articles, but I got busy and let it go a few thousand over. Here's update data through this morning.

COVID related words dropping off a bit now. For example, a plot of the number of articles per day including 疫情:

Note that the leftmost edge is not reliable because it's when I started gathering, and on the right edge (near the end) you can see the anomaly caused by Chinese New Year.

Maybe in the future I will run without most or all of the 2020 data.

richardpohl · May 16, 2024

Hello there,

I ust by chance found this wonderful project, is it discontinued now, I see last update is from 2021? I wish to see data from more recent (after COVID) years. In any case, thanks for very interesting frequency lists!

Richard

Shun · May 16, 2024

Hello Richard,

you're welcome! That was just some experimentation (on my side). I see that 人民日报's RSS feeds are still updated today, so it would be possible to generate newer frequency lists. However, you would probably prefer lists made with @BenJackson's much more sophisticated program to ones made using my Python version.

Shun

BenJackson · Jul 12, 2024

richardpohl said:
Hello there,

I ust by chance found this wonderful project, is it discontinued now, I see last update is from 2021? I wish to see data from more recent (after COVID) years. In any case, thanks for very interesting frequency lists!

Richard

The job has been dutifully pulling from the 人民网 RSS feed the whole time. Up to 756586 articles now. I will try to remember to update the frequency list this weekend. If there's a clear 疫情 remission I will make a frequency list that slices off the early years.

BenJackson · Jul 21, 2024

Ok, the official number of articles is up to 759632. The database is 4,015,181,824 bytes!

Here's an updated graph of 疫情 articles/day:

Also attached are updated frequency lists for the entire corpus and for the corpus since 2023-01-01. In the latter, the dreaded 疫情 has dropped into the 800s! Still fun to see other things, like 发展 remaining high (at number 6).

Shun · Jul 21, 2024

Congratulations! It's a little treasure trove of useful everyday media vocabulary, such as 互动、防范、配套、采购、枢纽、流域.

I'm just wondering, how long does it take your efficient C++ program to churn through all 4 gigabytes of text data to generate the RENMINWANG-ALL frequency lists? Maybe half a minute?

Thanks again!

Shun

BenJackson · Jul 21, 2024

Shun said:
I'm just wondering, how long does it take your efficient C++ program to churn through all 4 gigabytes of text data to generate the RENMINWANG-ALL frequency lists? Maybe half a minute?

About 14 minutes, but it's entirely single threaded. When I wrote it, and the data was about 1/3rd the size, 5 minutes seemed fast enough that I didn't bother to parallelize it.

Shun · Jul 21, 2024

I see, thanks a lot!

richardpohl · Aug 26, 2024

A great job!

antonomasia · Jul 8, 2025

Gentle(wo)men and scholars the lot of you! I have been using the SUBTLEX-CH but, while reasonable for spoken language, you obviously get lots of totally broken frequencies for US place names and stuff related to action movies, and it is definitely not optimal for novels or technical texts. These (and the fixed BLCU) lists are awesome! Thanks!

Shun · Jul 8, 2025

Thanks a lot @antonomasia and @richardpohl ! I hope Ben will return to the forums sometime to upload his current lists.

Media-related vocabulary gathering project

Shun

状元

BenJackson

举人

Shun

状元

BenJackson

举人

Shun

状元

BenJackson

举人

Attachments

Shun

状元

BenJackson

举人

Attachments

richardpohl

Member

Shun

状元

BenJackson

举人

BenJackson

举人

Attachments

Shun

状元

BenJackson

举人

Shun

状元

richardpohl

Member

antonomasia

Member

Shun

状元