Common Idioms; A Collection by Grade [HSK / old HSK / 中考 / 高考 / ...]

#1
Pleco's addition of the Duogongneng Chengyu Cidian⁻¹⁻ adds an overwhelming description of (over?) 8000 idioms, their descriptions, origins, synonyms, antonyms, example sentences, and more. But, if you've ever browsed for a Chinese idiom that resembles our western ones, or just felt like trying out one you might have found out that general knowledge of the eight-thousand-and-some idioms is, to say the least, spotty.

So, if you feel like updating your own language arsenal with some shiny idioms, but you don't know which ones are known amongst most Chinese, then fear not; I've created a flashcard list of 605 idioms that people, if they passed the Gaokao, should be familiar with.

HSK idioms⁻²⁻ - 112 cards
pre-2010 HSK idiom⁻³⁻ - 92 cards
Zhongkao idioms - 156 cards
Gaokao idioms - 245 cards


In the past I used “A Concise Dictionary of Chinese Idioms" by Sinolingua to find some apt idioms, but this concise dictionary (with its 2000+ entries) is beyond the scope of the average Chinese. It does come with a nice collection of example sentences, even when used in conversation (if applicable).

I will update these flashcards later. As I feel that these 605 flashcards don't reflect the most common idioms, but rather what is taught by Chinese schools after middle schoool (+ofc HSK). So, if someone has a better list to reorganize/add to these, please do tell. Because it's kind of odd that idioms like 人山人海 and 入乡随俗 aren't included in these sets (too basic? probably.)


-1- The dictionary is available for purchase for $20 US dollars.
-2- If you're using the HSK6 materials by BCLUP then check out this Flashcard collection
-3- Apart from the idioms the old HSK included an extra 4100 words, here they are in Flashcard format.
 

Attachments

#2
Hi Weyland,

thanks for the great idea and work! My immediate reaction was to use the BCC corpus frequency list (See this post:

https://plecoforums.com/threads/wor...haracter-corpus-bcc-blcu-chinese-corpus.5859/

) to sort your idioms by frequency in general usage.

In the file "Weyland idioms with frequencies.txt" I've simply added the number of times an idiom was used in the BCC corpus. In the file "Weyland idioms with frequencies sorted.txt", I've sorted them by frequency, starting with the most common one. I also attach the Python script I did it with.

Did you enter all the 605 idioms in your list by hand, or did you have another source? It may well be that 入乡随俗 is too basic, or maybe it's just a whim of theirs not to include it. :)

Cheers,

Shun
 

Attachments

#3
The Gaokao and Zhongkao collections are based on test preparations apps for the name wise tests. It just said 常见成语 so I didn't question it.

But, now that I think about it: there is probably an agenda. The language they teach will shape the language that's spoken in the future.

With some of these idioms ranging from usage between the hundreds and hundreds of thousands... I think we need a better frequency based list for idioms.

@Shun would you be willing to write a script to generate some frequency based idiom lists based on the Duogongneng Chengyu Cidian?

I was going to create a similar list for 熟语, but I don't think Pleco differentiates between an idiom(atic phrase) from ancient Chinese like 《史记》 or a passage from 《北京青年报》two decades ago.
 
#4
@Shun would you be willing to write a script to generate some frequency based idiom lists based on the Duogongneng Chengyu Cidian?
Unfortunately, the dictionary entries from the Duogongneng Chengyu Cidian, or any other Pleco dictionary, can't be exported as a list for licensing reasons (you could easily import them as flashcards and then export them), so I cannot grade these by frequency. Just joking: If you wish, you could copy the entire 8000 definitions into a list (from the printed Duogongneng Chengyu Cidian), then I could do it. Compared to studying and using them all, that would probably still be only 5% of the work. :)

With some of these idioms ranging from usage between the hundreds and hundreds of thousands... I think we need a better frequency based list for idioms.
I tend to disagree, why should it be different with idioms when the word frequencies can range from the 10s to many billions? Some of the idioms may simply have fallen out of use. The BCC corpus should reflect present-day usage quite well.

I was going to create a similar list for 熟语, but I don't think Pleco differentiates between an idiom(atic phrase) from ancient Chinese like 《史记》 or a passage from 《北京青年报》two decades ago.
Can you explain your thoughts on this in a bit more detail? It sounds interesting. :)
 
#6
Hi Peter,

great work! I've noticed long ago on the forums that you must be a great programmer. :) Did you import the entire global BCC list as flashcards using only the DGNCYCD, skipping those entries that don't exist in the DGNCYCD, and then re-export it without the definitions? If so, Weyland, I'm sorry about not thinking of that option when you asked.

Cheers,

Shun
 
#7
Can you explain your thoughts on this in a bit more detail? It sounds interesting.
While Pleco does somewhat differentiate between 成语 and 熟语 by having the tags "IDIOM", "WELL-KNOWN PHRASE", “COMMON PHRASE", "COLLOQUIAL", "FIGURATIVE COLLOQUIAL", the system doesn't lend itself to much clarity.


Try this. Taken from from `blog_lit_news_tech_weibo_freq.release_sorted.txt`
I've taken the liberty to organize the list by sets of 200. Like this people could tag the flashcard categories and see whether the idiom they've just looked up is worth memorizing.

By further inspection, though, the list does lack idioms/phrases that contain a (comma), but which are covered in the 多功能成语词典 , e.g.;

"养兵千日,用兵一时“
"过五关,斩六将“

Also, does the BCC corpus take in mind sources such as novels or casual conversations? As Idioms like ”有眼不识泰山“ are rather common, yet are only 4438 on the list. Though it might just be that I have too big of an obsession with 客套话, or rather self-deprecating language.
 

Attachments

#8
While Pleco does somewhat differentiate between 成语 and 熟语 by having the tags "IDIOM", "WELL-KNOWN PHRASE", “COMMON PHRASE", "COLLOQUIAL", "FIGURATIVE COLLOQUIAL", the system doesn't lend itself to much clarity.
Pleco displays the data it gets from the dictionaries, so the quality of such tags always just depends on which dictionaries you have installed and on the dictionaries themselves.

By further inspection, though, the list does lack idioms/phrases that contain a (comma), but which are covered in the 多功能成语词典 , e.g.;

"养兵千日,用兵一时“
"过五关,斩六将“
That is of course not ideal. I would assume that they used an N-gram search to compute the frequency list—which could include commas—, though I can't tell why they excluded such longer expressions.

Also, does the BCC corpus take in mind sources such as novels or casual conversations? As Idioms like ”有眼不识泰山“ are rather common, yet are only 4438 on the list. Though it might just be that I have too big of an obsession with 客套话, or rather self-deprecating language.
The file "blog_lit_news_tech_weibo_freq.release_sorted.txt" used by Peter and me is based on a literature corpus, among others. (with the "lit" in the name) But I don't know the corpus well at all, it would be better if "John." from the thread I referenced chimed in on this. Here I quote his excellent description:

The Beijing Language and Culture University created a balanced corpus of 15 billion characters. It’s based on news (人民日报 1946-2018,人民日报海外版 2000-2018), literature (books by 472 authors, including a significant portion of non-Chinese writers), non-fiction books, blog and weibo entries as well as classical Chinese. Frequency lists derived from the corpus can be downloaded here: http://bcc.blcu.edu.cn/downloads/resources/BCC_LEX_Zh.zip

The ZIP file contains a global frequency list based on the whole corpus and frequency lists based on specific categories (e.g. news, literature...) of the corpus. These text files can easily be turned into a Pleco user dictionary.

The corpus is much larger than the CCL (470 million characters), the CNC (100 million characters), the SUBTLEX-CH (47 million characters) and the LCMC (less than 2 million characters). It seems as if the frequency lists derived from this corpus might be the most reliable frequency lists currently available. More detailed information about the corpus can be found in this paper.

If you have problems with the original files the UTF-8 versions attached below may work. Due to technical limitations the lists global and blogs only include the 1048576 most frequent words.

I also created two Pleco user dictionaries showing the frequency of a word as definition. The first is based on the 100 000 most frequent words from the literature frequency list, the second is based on the 100 000 most frequent words from the news frequency list. They are attached below as ZIP files. If you are interested in a frequency user dictionary based on spoken language, see this post.

Maybe this is useful to some of you.
 
#9
Using this frequency list has helped A LOT! Especially when learning other words and characters. For example, if I see a character of which a said dictionary entry refers to an idiom that is #6000+ out of 8000, then I know that knowing that tidbit of information, or using that idiom in a certain phrase will have whichever phrase I come up with "crash and burn".

Or when I'm watching a Chinese series that tends to be laden with proverbs and popular phrases I shouldn't feel so bad for not knowing a certain idiom when my Chinese peers, likewise, would have to use a dictionary to even understand it. Though, whenever there is a variant of the said idiom it's vital to check what the frequency of that specific idiom is. As a variant of a commonly used idiom might only show up in the 5000s if that. However, most (if not all) Chinese would see the relation between the two.
 
Top