Common Idioms; A Collection by Grade [HSK / old HSK / 中考 / 高考 / ...]

Weyland

举人
Pleco's addition of the Duogongneng Chengyu Cidian⁻¹⁻ adds an overwhelming description of (over?) 8000 idioms, their descriptions, origins, synonyms, antonyms, example sentences, and more. But, if you've ever browsed for a Chinese idiom that resembles our western ones, or just felt like trying out one you might have found out that general knowledge of the eight-thousand-and-some idioms is, to say the least, spotty.

So, if you feel like updating your own language arsenal with some shiny idioms, but you don't know which ones are known amongst most Chinese, then fear not; I've created a flashcard list of 605 idioms that people, if they passed the Gaokao, should be familiar with.

HSK idioms⁻²⁻ - 112 cards
pre-2010 HSK idiom⁻³⁻ - 92 cards
Zhongkao idioms - 156 cards
Gaokao idioms - 245 cards


In the past I used “A Concise Dictionary of Chinese Idioms" by Sinolingua to find some apt idioms, but this concise dictionary (with its 2000+ entries) is beyond the scope of the average Chinese. It does come with a nice collection of example sentences, even when used in conversation (if applicable).

I will update these flashcards later. As I feel that these 605 flashcards don't reflect the most common idioms, but rather what is taught by Chinese schools after middle schoool (+ofc HSK). So, if someone has a better list to reorganize/add to these, please do tell. Because it's kind of odd that idioms like 人山人海 and 入乡随俗 aren't included in these sets (too basic? probably.)


-1- The dictionary is available for purchase for $20 US dollars.
-2- If you're using the HSK6 materials by BCLUP then check out this Flashcard collection
-3- Apart from the idioms the old HSK included an extra 4100 words, here they are in Flashcard format.
 

Attachments

Shun

状元
Hi Weyland,

thanks for the great idea and work! My immediate reaction was to use the BCC corpus frequency list (See this post:

https://plecoforums.com/threads/word-frequency-list-based-on-a-15-billion-character-corpus-bcc-blcu-chinese-corpus.5859/

) to sort your idioms by frequency in general usage.

In the file "Weyland idioms with frequencies.txt" I've simply added the number of times an idiom was used in the BCC corpus. In the file "Weyland idioms with frequencies sorted.txt", I've sorted them by frequency, starting with the most common one. I also attach the Python script I did it with.

Did you enter all the 605 idioms in your list by hand, or did you have another source? It may well be that 入乡随俗 is too basic, or maybe it's just a whim of theirs not to include it. :)

Cheers,

Shun
 

Attachments

Weyland

举人
The Gaokao and Zhongkao collections are based on test preparations apps for the name wise tests. It just said 常见成语 so I didn't question it.

But, now that I think about it: there is probably an agenda. The language they teach will shape the language that's spoken in the future.

With some of these idioms ranging from usage between the hundreds and hundreds of thousands... I think we need a better frequency based list for idioms.

@Shun would you be willing to write a script to generate some frequency based idiom lists based on the Duogongneng Chengyu Cidian?

I was going to create a similar list for 熟语, but I don't think Pleco differentiates between an idiom(atic phrase) from ancient Chinese like 《史记》 or a passage from 《北京青年报》two decades ago.
 

Shun

状元
@Shun would you be willing to write a script to generate some frequency based idiom lists based on the Duogongneng Chengyu Cidian?
Unfortunately, the dictionary entries from the Duogongneng Chengyu Cidian, or any other Pleco dictionary, can't be exported as a list for licensing reasons (you could easily import them as flashcards and then export them), so I cannot grade these by frequency. Just joking: If you wish, you could copy the entire 8000 definitions into a list (from the printed Duogongneng Chengyu Cidian), then I could do it. Compared to studying and using them all, that would probably still be only 5% of the work. :)

With some of these idioms ranging from usage between the hundreds and hundreds of thousands... I think we need a better frequency based list for idioms.
I tend to disagree, why should it be different with idioms when the word frequencies can range from the 10s to many billions? Some of the idioms may simply have fallen out of use. The BCC corpus should reflect present-day usage quite well.

I was going to create a similar list for 熟语, but I don't think Pleco differentiates between an idiom(atic phrase) from ancient Chinese like 《史记》 or a passage from 《北京青年报》two decades ago.
Can you explain your thoughts on this in a bit more detail? It sounds interesting. :)
 

Shun

状元
Hi Peter,

great work! I've noticed long ago on the forums that you must be a great programmer. :) Did you import the entire global BCC list as flashcards using only the DGNCYCD, skipping those entries that don't exist in the DGNCYCD, and then re-export it without the definitions? If so, Weyland, I'm sorry about not thinking of that option when you asked.

Cheers,

Shun
 

Weyland

举人
Can you explain your thoughts on this in a bit more detail? It sounds interesting.
While Pleco does somewhat differentiate between 成语 and 熟语 by having the tags "IDIOM", "WELL-KNOWN PHRASE", “COMMON PHRASE", "COLLOQUIAL", "FIGURATIVE COLLOQUIAL", the system doesn't lend itself to much clarity.


Try this. Taken from from `blog_lit_news_tech_weibo_freq.release_sorted.txt`
I've taken the liberty to organize the list by sets of 200. Like this people could tag the flashcard categories and see whether the idiom they've just looked up is worth memorizing.

By further inspection, though, the list does lack idioms/phrases that contain a (comma), but which are covered in the 多功能成语词典 , e.g.;

"养兵千日,用兵一时“
"过五关,斩六将“

Also, does the BCC corpus take in mind sources such as novels or casual conversations? As Idioms like ”有眼不识泰山“ are rather common, yet are only 4438 on the list. Though it might just be that I have too big of an obsession with 客套话, or rather self-deprecating language.
 

Attachments

Shun

状元
While Pleco does somewhat differentiate between 成语 and 熟语 by having the tags "IDIOM", "WELL-KNOWN PHRASE", “COMMON PHRASE", "COLLOQUIAL", "FIGURATIVE COLLOQUIAL", the system doesn't lend itself to much clarity.
Pleco displays the data it gets from the dictionaries, so the quality of such tags always just depends on which dictionaries you have installed and on the dictionaries themselves.

By further inspection, though, the list does lack idioms/phrases that contain a (comma), but which are covered in the 多功能成语词典 , e.g.;

"养兵千日,用兵一时“
"过五关,斩六将“
That is of course not ideal. I would assume that they used an N-gram search to compute the frequency list—which could include commas—, though I can't tell why they excluded such longer expressions.

Also, does the BCC corpus take in mind sources such as novels or casual conversations? As Idioms like ”有眼不识泰山“ are rather common, yet are only 4438 on the list. Though it might just be that I have too big of an obsession with 客套话, or rather self-deprecating language.
The file "blog_lit_news_tech_weibo_freq.release_sorted.txt" used by Peter and me is based on a literature corpus, among others. (with the "lit" in the name) But I don't know the corpus well at all, it would be better if "John." from the thread I referenced chimed in on this. Here I quote his excellent description:

The Beijing Language and Culture University created a balanced corpus of 15 billion characters. It’s based on news (人民日报 1946-2018,人民日报海外版 2000-2018), literature (books by 472 authors, including a significant portion of non-Chinese writers), non-fiction books, blog and weibo entries as well as classical Chinese. Frequency lists derived from the corpus can be downloaded here: http://bcc.blcu.edu.cn/downloads/resources/BCC_LEX_Zh.zip

The ZIP file contains a global frequency list based on the whole corpus and frequency lists based on specific categories (e.g. news, literature...) of the corpus. These text files can easily be turned into a Pleco user dictionary.

The corpus is much larger than the CCL (470 million characters), the CNC (100 million characters), the SUBTLEX-CH (47 million characters) and the LCMC (less than 2 million characters). It seems as if the frequency lists derived from this corpus might be the most reliable frequency lists currently available. More detailed information about the corpus can be found in this paper.

If you have problems with the original files the UTF-8 versions attached below may work. Due to technical limitations the lists global and blogs only include the 1048576 most frequent words.

I also created two Pleco user dictionaries showing the frequency of a word as definition. The first is based on the 100 000 most frequent words from the literature frequency list, the second is based on the 100 000 most frequent words from the news frequency list. They are attached below as ZIP files. If you are interested in a frequency user dictionary based on spoken language, see this post.

Maybe this is useful to some of you.
 

Weyland

举人
Using this frequency list has helped A LOT! Especially when learning other words and characters. For example, if I see a character of which a said dictionary entry refers to an idiom that is #6000+ out of 8000, then I know that knowing that tidbit of information, or using that idiom in a certain phrase will have whichever phrase I come up with "crash and burn".

Or when I'm watching a Chinese series that tends to be laden with proverbs and popular phrases I shouldn't feel so bad for not knowing a certain idiom when my Chinese peers, likewise, would have to use a dictionary to even understand it. Though, whenever there is a variant of the said idiom it's vital to check what the frequency of that specific idiom is. As a variant of a commonly used idiom might only show up in the 5000s if that. However, most (if not all) Chinese would see the relation between the two.
 

Shun

状元
That's pretty huge. I see yours has about 1,500 pages, so 40 idioms fit on each page on average. The following chengyu dictionary covers only 17,112 idioms on 1,640 pages, so it is able to go to greater lengths to describe each idiom, with 10 idioms per page on average:


Perhaps as a nice Christmas present?
 

mikelove

皇帝
Staff member
I’m probably not going to pursue any more new dictionary licenses until next year, actually, because based on these interview comments by the chair of the US House of Representives’ antitrust committee it seems like after 11 long years, we may finally soon get a reprieve from Apple’s 30% commissions, and if that does happen it’ll dramatically shake up (in a good way) the world of app content licensing.
 

Shun

状元
That would of course be excellent. Passing the money from Apple to smaller app vendors, allowing them to grow further, certainly promises to create more value and a healthier economy overall. Concentration of too much capital in a single place is never good.
 

Weyland

举人
I’m probably not going to pursue any more new dictionary licenses until next year
Personally, I was just flabbergasted that there is an idiom dictionary with 61 thousand entries. Duogongneng Chengyu Dictionary is plenty for now. The thing I'm worrying about now though is whether or not the idiom dictionary has a too niche of a following. Mike, is there currently a dictionary that is inching its way up on the chopping block?
 

mikelove

皇帝
Staff member
Mike, is there currently a dictionary that is inching its way up on the chopping block?
A few, plus there are a couple of updates / expansions of existing licenses under discussion, but we were already intentionally holding back somewhat so we could focus on 4.0 (Chinese History + graded readers have done very well for us but they most definitely delayed work on the new app a bit).
 

Weyland

举人
Hmmm... anyone taking bets? I'm betting against better judgement that the Traditional Chinese Medicine dictionaries will be done away with.

we were already intentionally holding back somewhat so we could focus on 4.0
“could focus on 4.0", past tense. Pleco 4.0 ETA -> Tomorrow.

EDIT: No wait! Tomorrow is on the weekends. But, Sunday it's father's day. So it will release then. Because it will be in the spirit of a gift for Papa Xi.
 
Last edited:

Weyland

举人
Try to exercise some patience. Good things come to those who wait!
I know, I know. Please take it with a grain of salt, as it was meant in good humour.

Just trying to find an outlet to procrastinate while I'm putting together this PSC wordlist. I would've brushed up my Python skills or if all came to fail and push came to shove ask you to help me get it somewhat in order. But, as far as I can see the lists on 普通话学习网, 普通话学习app, and several others I could find through google all have several mistakes per 100 words, so doing it by hand.

Ever since I created that old-HSK flashcard collection, and found out that 100% of its vocabulary got included into the PSC I've been looking for a complete list export to Pleco. But, all those I found had similar mistakes. Now that we have the preliminary HSK3.0 list I'm even more motivated, because if 100% of its vocabulary carries over to the PSC then that means that by memorizing the HSK list, which is 11,092 entries strong, you've already progressed through 65% of the PSC vocabulary, which is 17,055 entries strong.
 
Top