18,896 HSK sentences

Dear all,

here’s a file of 18,896 quite realistic-sounding HSK sentences, in ascending order of complexity. I imported them from a tatoeba sentence list some time ago. They’re useful for reinforcing one’s feeling for sentence structure and grammatical constructions. They work best with the Self-graded studying mode or even just by browsing through in Card Info.

Caveat: The generated pinyin is sometimes off, and the English often sounds like it hadn’t been translated by a native English speaker.

Hope you like it, cheers, Shun

Edit: The file in message number 24 is ordered by sentence length. I'm also attaching it here.


Last edited:
Thanks a lot for your list. :) Good English translations and realistic-sounding Chinese. However, I will try using regex to change them to the format

"Hanzi - pinyin - translation" instead of

"Word Hanzi - Hanzi with word in pinyin - translation".

If I manage it, I will upload the converted list here.
It was converted from your list using excel. The idea is you want to test if you can write hanzi. But just listening to the pinyin of the word gives you no context so it is an unrealistic context. The sentence gives you the context (you can choose to not display the sentence initially but I prefer to read the sentence and then write the hanzi without listening to only the pronunciation).
The converter does two passes through your list trying to use as many list entries as there are in my 65535 words with definitions in all the Pleco dictionaries I have.
On each pass it chooses the least used sentence (so far, and if one exists) to distribution usage of the sentences evenly.
Currently, using the highest 40000 frequency words that exist as Pleco entries blended 50%/50% from the Subtitles and Leiden Weibo corpuses.
Of the 40,000 top words, only 13670 appear in the sentence list you compiled. The average number of sentences per word is thus 23084/13670. ie. approximately 2 sentences/word. HSK words are prioritized - There are about 500 HSK words which don't exist in the sentences so I'm planning to add 500 sentences to your list and rebuild
Thank you for your kind words. :)
I can make alternative versions if you have different parameters you prefer. e.g. number of sentences/word, which words are included as candidates, etc.
So if I understand it correctly, you're not just picking the most "special" word in each sentence, i.e. the least common word, but something more complex. Anyway, the goal is to get the most useful word.
You're welcome! I guess most of the HSK words is already a fine starting point. You seem to have done the optimal word selection already, as I see it. Have a nice day then!
You are quite correct, it doesn't look for a "special" word in each sentence. Rather it makes multiple passes through HSK and high frequency words ordered by HSK and frequency looking. On each pass it chooses the sentence in which the word exists which currently has the lowest usage so far (i.e by the current word and all other words that have been used). I have ordered the sentences by sentence length from short to long, so that it is biased to choosing shorter sentences. The reason for this is I prefer shorter contexts to longer ones as the primary reason for the sentence is to give the context - reading comprehension is just a byproduct of incorporating the context.