79,000 Chinese-English, French, German, Italian, Japanese, and Spanish sentences

Shun

状元
Salut shaoguan!

They aren't ordered by HSK level yet. I attach the "hsk new.txt" file. It's an export from Pleco's built-in HSK list. I also attach a newer source file, with separated functions and constants for the languages so you don't have to enter them everywhere. Perhaps the attached source code is a work in progress. If so, you can use the other source file you already have. The (large) BCC corpus files would also be needed for the rating code, but you should also easily be able to leave them out to begin with.

Have fun, Shun
 

Attachments

Shun

状元
What do you think would be faster ?
Use REGEX to remove the pinyin and recreate the original sentences in the file "sentence_contextual_tatoeba_cn_fra_folded by HSK rating - random sentence selection.txt"
or
Order by HSK level the sentences in the file "sentences_cmn_fra_simplified_folded.txt"
You're welcome! I think the second option would be a bit easier. It will just take the computer a while to process. Just tell me if you would like me to do the whole thing.
 

shaoguan

秀才
Thanks for your offer !
For the BCC corpus : I have to generate the file "global_wordfreq.release (Hanzi only).txt" right ?

Well if it doesn't take you too much time I accept your offer to do the whole thing :D
 

Shun

状元
You're welcome! I will try to do it soon and post the file here. :)

I think the BCC file is on the forums, too. (Search for "BCC")
 

Shun

状元
Hello shaoguan,

I attach the HSK rated Chinese-French file, ready for importing into Pleco, and the Python script I did it with. Note that the HSK rating feature could still be improved, but especially the separation between HSK levels 3, 4, and 5 seems quite sensible to me. Sentences that are rated as HSK level 1 or 2 are usually harder than their indicated level.

(Actually, I did the same thing two years before, but now we have a more flexible, cleaner script to do it.)

Enjoy, have fun,

Shun
 

Attachments

shaoguan

秀才
Thanks Shun.
I am creating an Anki deck with these pretty well written sentences.
If somebody is interested I will share !

Greets !
 

Toom

秀才
Thanks a lot for your hard work and those decks! I'm still a little bit confused by all the decks. The sentences are from tatoebao just like the 18,896 HSK sentences. So does that mean that they contain those?

I'd like the deck which contains the sentences sorted by HSK levels (HSK2 contains sentences with HSK1-2 words, HSK3 the sentences with HSK1-3 excluding those already in the HSK2, ...). I found one on this thread but it only contains the contextual sentences with one word in English. Is there a deck with the full sentence in Chinese?

Thanks
 

Shun

状元
Hello Toom,

thanks for the praise! The 18,896 HSK sentences are taken from dict.cn (as I found out by googling), and they are independent of the Tatoeba sentences, i.e. there is no overlap.

There are, I suggest you download the full sentences ordered by HSK difficulty from here:


The pinyin line is filled with a scrambled sentence, but since Pleco (or any computer program) isn't able to derive the correct segmented pinyin from a Hanzi sentence, the data should be more valuable this way, and it will allow you to put it back together as an exercise.

Sorry about the confusion. A good rule of thumb is to download the files from the last post in a particular thread. :)

Feel free to keep asking if that still isn't quite what you were looking for.

Have fun,

Shun
 

Toom

秀才
Hello Toom,

thanks for the praise! The 18,896 HSK sentences are taken from dict.cn (as I found out by googling), and they are independent of the Tatoeba sentences, i.e. there is no overlap.

There are, I suggest you download the full sentences ordered by HSK difficulty from here:


The pinyin line is filled with a scrambled sentence, but since Pleco (or any computer program) isn't able to derive the correct segmented pinyin from a Hanzi sentence, the data should be more valuable this way, and it will allow you to put it back together as an exercise.

Sorry about the confusion. A good rule of thumb is to download the files from the last post in a particular thread. :)

Feel free to keep asking if that still isn't quite what you were looking for.

Have fun,

Shun
Exactly what I was looking for Thanks a lot!
 
Top