18,896 HSK sentences

Discussion in 'Flashcard Exchange' started by Shun, Sep 30, 2017.

  1. Shun

    Shun 状元

    Dear all,

    here’s a file of 18,896 quite realistic-sounding HSK sentences, in ascending order of complexity. I imported them from a public Anki flashcard set some time ago. They’re useful for reinforcing one’s feeling for sentence structure and grammatical constructions. They work best with the Self-graded studying mode or even just by browsing through in Card Info.

    Caveat: The generated pinyin is sometimes off, and the English often sounds like it hadn’t been translated by a native English speaker.

    Hope you like it, cheers, Shun
     

    Attached Files:

    Last edited: Oct 19, 2017 at 9:02 AM
  2. leguan

    leguan 举人

    I like it. ;-)
     

    Attached Files:

  3. leguan

    leguan 举人

  4. Shun

    Shun 状元

    Thanks a lot for your list. :) Good English translations and realistic-sounding Chinese. However, I will try using regex to change them to the format

    "Hanzi - pinyin - translation" instead of

    "Word Hanzi - Hanzi with word in pinyin - translation".

    If I manage it, I will upload the converted list here.
     
  5. leguan

    leguan 举人

    It is your list, not mine ;-)
     
  6. leguan

    leguan 举人

    The order is intentional. Please see the screen shots above.
     
  7. leguan

    leguan 举人

    It was converted from your list using excel. The idea is you want to test if you can write hanzi. But just listening to the pinyin of the word gives you no context so it is an unrealistic context. The sentence gives you the context (you can choose to not display the sentence initially but I prefer to read the sentence and then write the hanzi without listening to only the pronunciation).
     
    Shun likes this.
  8. Shun

    Shun 状元

    I see! Funny though that it's got 23'000 lines. Your format is definitely very useful!
     
    Last edited: Oct 19, 2017 at 9:35 AM
  9. leguan

    leguan 举人

    If you don't know the hanzi from the Mandarin context, you can look at the English translation as a hint.
     
  10. Shun

    Shun 状元

    Very interesting! That way you can focus solely on writing. I like it.
     
  11. leguan

    leguan 举人

    The converter does two passes through your list trying to use as many list entries as there are in my 65535 words with definitions in all the Pleco dictionaries I have.
    On each pass it chooses the least used sentence (so far, and if one exists) to distribution usage of the sentences evenly.
     
  12. Shun

    Shun 状元

    So not all the English translations are bad, only some. ;)
     
  13. Shun

    Shun 状元

    This is really neat, you must be very much into IT as well! :) So thanks again.
     
  14. leguan

    leguan 举人

    Currently, using the highest 40000 frequency words that exist as Pleco entries blended 50%/50% from the Subtitles and Leiden Weibo corpuses.
    Of the 40,000 top words, only 13670 appear in the sentence list you compiled. The average number of sentences per word is thus 23084/13670. ie. approximately 2 sentences/word. HSK words are prioritized - There are about 500 HSK words which don't exist in the sentences so I'm planning to add 500 sentences to your list and rebuild
     
  15. leguan

    leguan 举人

    Thank you for your kind words. :)
    I can make alternative versions if you have different parameters you prefer. e.g. number of sentences/word, which words are included as candidates, etc.
     
  16. Shun

    Shun 状元

    So if I understand it correctly, you're not just picking the most "special" word in each sentence, i.e. the least common word, but something more complex. Anyway, the goal is to get the most useful word.
     
  17. Shun

    Shun 状元

    You're welcome! I guess most of the HSK words is already a fine starting point. You seem to have done the optimal word selection already, as I see it. Have a nice day then!
     
  18. leguan

    leguan 举人

    Thank you very much too for posting the list :)
    I can now enjoy reading comprehension and writing in context in an enjoyable way!
     
  19. Shun

    Shun 状元

    Glad you like it. I have to go check Anki's free database again. :)
     
  20. leguan

    leguan 举人

    You are quite correct, it doesn't look for a "special" word in each sentence. Rather it makes multiple passes through HSK and high frequency words ordered by HSK and frequency looking. On each pass it chooses the sentence in which the word exists which currently has the lowest usage so far (i.e by the current word and all other words that have been used). I have ordered the sentences by sentence length from short to long, so that it is biased to choosing shorter sentences. The reason for this is I prefer shorter contexts to longer ones as the primary reason for the sentence is to give the context - reading comprehension is just a byproduct of incorporating the context.
     

Share This Page