79,000 Chinese-English, French, German, Italian, Japanese, and Spanish sentences

Shun

状元
Salut shaoguan!

They aren't ordered by HSK level yet. I attach the "hsk new.txt" file. It's an export from Pleco's built-in HSK list. I also attach a newer source file, with separated functions and constants for the languages so you don't have to enter them everywhere. Perhaps the attached source code is a work in progress. If so, you can use the other source file you already have. The (large) BCC corpus files would also be needed for the rating code, but you should also easily be able to leave them out to begin with.

Have fun, Shun
 

Attachments

  • hsk new.txt
    33.4 KB · Views: 527
  • Fold_rate_match_in_separated_functions.py.txt
    18.2 KB · Views: 517

Shun

状元
What do you think would be faster ?
Use REGEX to remove the pinyin and recreate the original sentences in the file "sentence_contextual_tatoeba_cn_fra_folded by HSK rating - random sentence selection.txt"
or
Order by HSK level the sentences in the file "sentences_cmn_fra_simplified_folded.txt"

You're welcome! I think the second option would be a bit easier. It will just take the computer a while to process. Just tell me if you would like me to do the whole thing.
 

shaoguan

举人
Thanks for your offer !
For the BCC corpus : I have to generate the file "global_wordfreq.release (Hanzi only).txt" right ?

Well if it doesn't take you too much time I accept your offer to do the whole thing :D
 

Shun

状元
You're welcome! I will try to do it soon and post the file here. :)

I think the BCC file is on the forums, too. (Search for "BCC")
 

Shun

状元
Hello shaoguan,

I attach the HSK rated Chinese-French file, ready for importing into Pleco, and the Python script I did it with. Note that the HSK rating feature could still be improved, but especially the separation between HSK levels 3, 4, and 5 seems quite sensible to me. Sentences that are rated as HSK level 1 or 2 are usually harder than their indicated level.

(Actually, I did the same thing two years before, but now we have a more flexible, cleaner script to do it.)

Enjoy, have fun,

Shun
 

Attachments

  • HSK_BCC_Rater.py.txt
    5.9 KB · Views: 450
  • sentences_hsk_graded_chinese_to_fra.txt
    1.2 MB · Views: 2,009

shaoguan

举人
Thanks Shun.
I am creating an Anki deck with these pretty well written sentences.
If somebody is interested I will share !

Greets !
 

Toom

秀才
Thanks a lot for your hard work and those decks! I'm still a little bit confused by all the decks. The sentences are from tatoebao just like the 18,896 HSK sentences. So does that mean that they contain those?

I'd like the deck which contains the sentences sorted by HSK levels (HSK2 contains sentences with HSK1-2 words, HSK3 the sentences with HSK1-3 excluding those already in the HSK2, ...). I found one on this thread but it only contains the contextual sentences with one word in English. Is there a deck with the full sentence in Chinese?

Thanks
 

Shun

状元
Hello Toom,

thanks for the praise! The 18,896 HSK sentences are taken from dict.cn (as I found out by googling), and they are independent of the Tatoeba sentences, i.e. there is no overlap.

There are, I suggest you download the full sentences ordered by HSK difficulty from here:


The pinyin line is filled with a scrambled sentence, but since Pleco (or any computer program) isn't able to derive the correct segmented pinyin from a Hanzi sentence, the data should be more valuable this way, and it will allow you to put it back together as an exercise.

Sorry about the confusion. A good rule of thumb is to download the files from the last post in a particular thread. :)

Feel free to keep asking if that still isn't quite what you were looking for.

Have fun,

Shun
 

Toom

秀才
Hello Toom,

thanks for the praise! The 18,896 HSK sentences are taken from dict.cn (as I found out by googling), and they are independent of the Tatoeba sentences, i.e. there is no overlap.

There are, I suggest you download the full sentences ordered by HSK difficulty from here:


The pinyin line is filled with a scrambled sentence, but since Pleco (or any computer program) isn't able to derive the correct segmented pinyin from a Hanzi sentence, the data should be more valuable this way, and it will allow you to put it back together as an exercise.

Sorry about the confusion. A good rule of thumb is to download the files from the last post in a particular thread. :)

Feel free to keep asking if that still isn't quite what you were looking for.

Have fun,

Shun
Exactly what I was looking for Thanks a lot!
 

Ria

Member
Hello,
thank you for those useful word lists! My import of the English is still running, but the txt looked very well.

May I ask if this is difficult to set up? I am a language enthusiast, so I am interested in many of the sentence translations available on there.
I noticed they have Chinese-Esperanto, Shanghainese-Mandarin, and even Shanghainese-English.

I would very much like to be able to convert some myself :)

Please advise me!
 

Shun

状元
Hi Ria,

you're welcome! Indeed, it's easy to create flashcards for any combination of languages with Tatoeba's data.

All you need is to have Python installed on your Mac, PC, or Linux system, and to download the raw Tatoeba data ("Sentences" and "Links") from


Once you have these, you only need the Python source code for the sentence pair generation and put the Tatoeba input files in the same directory as the Python script. I added the Python source to the beginning of this thread, so you can use this relatively simple bit of code. The only thing you may need to adjust in the script are the language codes that the script looks for and the file names used for input and output.

If you have a Mac, I suggest using Homebrew (https://brew.sh) to install Python; on Windows, you could use the regular Python installer from https://www.python.org/

PyCharm is a nice Integrated Development Environment (IDE) for Python; however, it is not mandatory to use it. You could also run the Python scripts from the command line and use a popular text editor like Visual Studio Code to edit them.

Hope this helps for a start,

Shun
 
  • Like
Reactions: Ria
Top