Shun
状元
Dear @BenJackson, dear @JD & dear all,
the acquisition of media-related vocabulary from newspapers or television news programmes usually comes rather late in one's Chinese learning career and does not form part of most learning materials I've seen so far. This can be a frustrating experience especially for more advanced learners who have up to now not easily been able to focus on media-specific vocabulary. @BenJackson and others have pointed to this fact, trying to rectify it by gathering vocabulary from media corpus frequency lists.
One problem that @BenJackson has run into with the BLCU media corpus (see thread) was that its sources reach back to 1946 and it therefore contains many vocabulary items which have since lost currency. With a small corpus of 650 articles from People's Daily, downloaded using a Python script, I hope to start providing a more modern frequency list of media-related vocabulary.
The frequency list has the following features:
Enjoy the lists,
Shun
the acquisition of media-related vocabulary from newspapers or television news programmes usually comes rather late in one's Chinese learning career and does not form part of most learning materials I've seen so far. This can be a frustrating experience especially for more advanced learners who have up to now not easily been able to focus on media-specific vocabulary. @BenJackson and others have pointed to this fact, trying to rectify it by gathering vocabulary from media corpus frequency lists.
One problem that @BenJackson has run into with the BLCU media corpus (see thread) was that its sources reach back to 1946 and it therefore contains many vocabulary items which have since lost currency. With a small corpus of 650 articles from People's Daily, downloaded using a Python script, I hope to start providing a more modern frequency list of media-related vocabulary.
The frequency list has the following features:
- It uses all sections of the 人民日报 / People's Daily newspaper, including the sports section.
- All articles in their RSS feeds, going back from the 15th to the 12th of January 2020, are included. I could try running the script every two days and collect articles for longer time periods in order to obtain more data.
- I provide two frequency lists:
- One list ("peoples_daily_bcc_freqlist.txt") only contains expressions that also appear in the BCC corpus frequency list. This list should only contain lexical expressions.
- The other list ("peoples_daily_non_bcc_freqlist.txt") only contains expressions that do not appear in the BCC corpus frequency list, that were found using an N-gram search algorithm. It therefore includes not only single expressions, but also common combinations of words which would take significant manual work to filter out, but can be a valuable resource by themselves for practicing speaking and writing, as they are common elements of sentences.
- Of course, most vocabulary items in the lists are not media-specific. I would assume that vocabulary of a frequency between 20 and 200 may contain the most useful "gems". I suggest that learners skim the list for words they don't yet know and seem likely to appear in the media.
- The non-BCC list includes expressions up to 12 characters in length.
Enjoy the lists,
Shun
Attachments
Last edited: