Media-related vocabulary gathering project

Shun

状元
Many thanks, your lists look very clean and realistic! Now if we tagged each occurrence of a word with a date and scraped articles for a year, we could find some "words of the month" for each month or draw interesting frequency graphs. :) (There may even be a seasonality for some expressions?)

Thanks also for your example from an article. Of course, an important function of such articles is to boost readers' morale, see them through this dark stretch, and to unite them. I see the word 英雄 (hero) is quite frequent, at the 1603rd place.
 
Last edited:

BenJackson

举人
I'm actually scraping the data into a database, which is very helpful for de-duping articles. Besides the natural duplication I see due to checking the whole feed each day, I also see them posting the same article on more than one of their sites, and even occasionally on multiple days (sometimes attributed to different authors!). So I do have all of the metadata (like dates) so more analysis will be possible later.

In fact, pretty soon I will need to get smarter about the analysis step. Right now I just extract all of the titles and article bodies into a giant flat text file and treat it like a giant novel. If I keep going for a year I'm definitely going to have to change to just incrementally analyze new articles and keep partially tabulated data in the DB, which will naturally produce the time series you are thinking of.

Then I start daydreaming about using AWS Lambda to do the scraping and updating, keeping the DB in the cloud, etc. Then I realize why people charge money for pre-existing corpuses and I consider just buying one!

Oh BTW, 英雄 is actually even more common in SUBTLEX at 1079. People love their heroes!
 

Shun

状元
Interesting, nice project design! An incremental DB sounds good. I think whether you should go all-out with AWS depends on if you see it as enough of a challenge for yourself.

Thanks for your observation on the frequency of 英雄 in SUBTLEX. This gives me the feeling that the Chinese think in terms of heroes more than most Westerners do, perhaps due to their regard for their many legends. (such as, more recently, Lei Feng)
 
Last edited:

BenJackson

举人
Your theory has some merit (counts are articles):

1582336249940.png
 

Shun

状元
Thanks for the graph; the theory may indeed have something going for it.
 
Last edited:

BenJackson

举人
In celebration of gathering 100,000 articles, here's an update.

Some notes:
  • We are still in the 疫情 era for sure. But there are still hints of it being less general than it is frequent, since it appears in relatively few documents (still 67%!) compared to its neighbors like 有 (90%).
  • I found a bug in my ARF analysis (see above) which increased the prominence of infrequent words. I would not recommend using the old ARF files as uploaded for any serious analysis, although the high-frequency data is still basically fine. I have not re-run using ARF because I think it's a better measure for something like a novel (self contained, and without clear breakdown into documents) as opposed to these news articles, to which other dispersion methods apply very well.
  • This still counts titles and body together as one unit.
  • This still uses CC-CEDICT (with frequencies blended from Jieba) as the splitting dictionary.
  • There are some very short articles, including ones with effectively only a title (online these are articles with mostly images or a video). This means the document percentage drops off quite a bit faster than other corpuses.
  • This time I present the data in SUBTLEX-CH format, since I wanted to present overall vs document frequency without inventing a new file format.
  • This corpus is actually bigger than the SUBTLEX corpus!
  • If you were going to study from a frequency list based on these, I recommend re-sorting by the product of "WCount * W-CD" or even just "W-CD" as better ways to broaden your vocabulary rather than straight frequency.
 

Attachments

  • RENMINWANG.zip
    723.7 KB · Views: 504

Shun

状元
Hi BenJackson,

thank you, congratulations! So if you started the gathering of 人民网 articles about 120 days ago, that would mean they published about 800 new articles per day, including very short articles. That's quite a lot.

Cheers, Shun
 
Last edited:

BenJackson

举人
Wanted to update at 250,000 articles, but I got busy and let it go a few thousand over. Here's update data through this morning.

COVID related words dropping off a bit now. For example, a plot of the number of articles per day including 疫情:

疫情 articles per day.png


Note that the leftmost edge is not reliable because it's when I started gathering, and on the right edge (near the end) you can see the anomaly caused by Chinese New Year.

Maybe in the future I will run without most or all of the 2020 data.
 

Attachments

  • RENMINWANG.zip
    1 MB · Views: 429
Top