Media-related vocabulary gathering project

Shun

状元
Many thanks, your lists look very clean and realistic! Now if we tagged each occurrence of a word with a date and scraped articles for a year, we could find some "words of the month" for each month or draw interesting frequency graphs. :) (There may even be a seasonality for some expressions?)

Thanks also for your example from an article. Of course, an important function of such articles is to boost readers' morale, see them through this dark stretch, and to unite them. I see the word 英雄 (hero) is quite frequent, at the 1603rd place.
 
Last edited:

BenJackson

举人
I'm actually scraping the data into a database, which is very helpful for de-duping articles. Besides the natural duplication I see due to checking the whole feed each day, I also see them posting the same article on more than one of their sites, and even occasionally on multiple days (sometimes attributed to different authors!). So I do have all of the metadata (like dates) so more analysis will be possible later.

In fact, pretty soon I will need to get smarter about the analysis step. Right now I just extract all of the titles and article bodies into a giant flat text file and treat it like a giant novel. If I keep going for a year I'm definitely going to have to change to just incrementally analyze new articles and keep partially tabulated data in the DB, which will naturally produce the time series you are thinking of.

Then I start daydreaming about using AWS Lambda to do the scraping and updating, keeping the DB in the cloud, etc. Then I realize why people charge money for pre-existing corpuses and I consider just buying one!

Oh BTW, 英雄 is actually even more common in SUBTLEX at 1079. People love their heroes!
 

Shun

状元
Interesting, nice project design! An incremental DB sounds good. I think whether you should go all-out with AWS depends on if you see it as enough of a challenge for yourself.

Thanks for your observation on the frequency of 英雄 in SUBTLEX. This gives me the feeling that the Chinese think in terms of heroes more than most Westerners do, perhaps due to their regard for their many legends. (such as, more recently, Lei Feng)
 
Last edited:

BenJackson

举人
Your theory has some merit (counts are articles):

1582336249940.png
 

Shun

状元
Thanks for the graph; the theory may indeed have something going for it.
 
Last edited:

BenJackson

举人
In celebration of gathering 100,000 articles, here's an update.

Some notes:
  • We are still in the 疫情 era for sure. But there are still hints of it being less general than it is frequent, since it appears in relatively few documents (still 67%!) compared to its neighbors like 有 (90%).
  • I found a bug in my ARF analysis (see above) which increased the prominence of infrequent words. I would not recommend using the old ARF files as uploaded for any serious analysis, although the high-frequency data is still basically fine. I have not re-run using ARF because I think it's a better measure for something like a novel (self contained, and without clear breakdown into documents) as opposed to these news articles, to which other dispersion methods apply very well.
  • This still counts titles and body together as one unit.
  • This still uses CC-CEDICT (with frequencies blended from Jieba) as the splitting dictionary.
  • There are some very short articles, including ones with effectively only a title (online these are articles with mostly images or a video). This means the document percentage drops off quite a bit faster than other corpuses.
  • This time I present the data in SUBTLEX-CH format, since I wanted to present overall vs document frequency without inventing a new file format.
  • This corpus is actually bigger than the SUBTLEX corpus!
  • If you were going to study from a frequency list based on these, I recommend re-sorting by the product of "WCount * W-CD" or even just "W-CD" as better ways to broaden your vocabulary rather than straight frequency.
 

Attachments

  • RENMINWANG.zip
    723.7 KB · Views: 651

Shun

状元
Hi BenJackson,

thank you, congratulations! So if you started the gathering of 人民网 articles about 120 days ago, that would mean they published about 800 new articles per day, including very short articles. That's quite a lot.

Cheers, Shun
 
Last edited:

BenJackson

举人
Wanted to update at 250,000 articles, but I got busy and let it go a few thousand over. Here's update data through this morning.

COVID related words dropping off a bit now. For example, a plot of the number of articles per day including 疫情:

疫情 articles per day.png


Note that the leftmost edge is not reliable because it's when I started gathering, and on the right edge (near the end) you can see the anomaly caused by Chinese New Year.

Maybe in the future I will run without most or all of the 2020 data.
 

Attachments

  • RENMINWANG.zip
    1 MB · Views: 588

richardpohl

Member
Hello there,

I ust by chance found this wonderful project, is it discontinued now, I see last update is from 2021? I wish to see data from more recent (after COVID) years. In any case, thanks for very interesting frequency lists!

Richard
 

Shun

状元
Hello Richard,

you're welcome! That was just some experimentation (on my side). I see that 人民日报's RSS feeds are still updated today, so it would be possible to generate newer frequency lists. However, you would probably prefer lists made with @BenJackson's much more sophisticated program to ones made using my Python version.

Shun
 
Last edited:

BenJackson

举人
Hello there,

I ust by chance found this wonderful project, is it discontinued now, I see last update is from 2021? I wish to see data from more recent (after COVID) years. In any case, thanks for very interesting frequency lists!

Richard
The job has been dutifully pulling from the 人民网 RSS feed the whole time. Up to 756586 articles now. I will try to remember to update the frequency list this weekend. If there's a clear 疫情 remission I will make a frequency list that slices off the early years.
 

BenJackson

举人
Ok, the official number of articles is up to 759632. The database is 4,015,181,824 bytes!

Here's an updated graph of 疫情 articles/day:


chart.png

Also attached are updated frequency lists for the entire corpus and for the corpus since 2023-01-01. In the latter, the dreaded 疫情 has dropped into the 800s! Still fun to see other things, like 发展 remaining high (at number 6).
 

Attachments

  • RENMINWANG-ALL.zip
    1 MB · Views: 77
  • RENMINWANG-SINCE-2023.zip
    828.8 KB · Views: 62

Shun

状元
Congratulations! It's a little treasure trove of useful everyday media vocabulary, such as 互动、防范、配套、采购、枢纽、流域.

I'm just wondering, how long does it take your efficient C++ program to churn through all 4 gigabytes of text data to generate the RENMINWANG-ALL frequency lists? Maybe half a minute?

Thanks again!

Shun
 

BenJackson

举人
I'm just wondering, how long does it take your efficient C++ program to churn through all 4 gigabytes of text data to generate the RENMINWANG-ALL frequency lists? Maybe half a minute?
About 14 minutes, but it's entirely single threaded. When I wrote it, and the data was about 1/3rd the size, 5 minutes seemed fast enough that I didn't bother to parallelize it.
 
Top