79,000 Chinese-English, French, German, Italian, Japanese, and Spanish sentences

Shun

状元
Salut shaoguan!

They aren't ordered by HSK level yet. I attach the "hsk new.txt" file. It's an export from Pleco's built-in HSK list. I also attach a newer source file, with separated functions and constants for the languages so you don't have to enter them everywhere. Perhaps the attached source code is a work in progress. If so, you can use the other source file you already have. The (large) BCC corpus files would also be needed for the rating code, but you should also easily be able to leave them out to begin with.

Have fun, Shun
 

Attachments

  • hsk new.txt
    33.4 KB · Views: 602
  • Fold_rate_match_in_separated_functions.py.txt
    18.2 KB · Views: 591

Shun

状元
What do you think would be faster ?
Use REGEX to remove the pinyin and recreate the original sentences in the file "sentence_contextual_tatoeba_cn_fra_folded by HSK rating - random sentence selection.txt"
or
Order by HSK level the sentences in the file "sentences_cmn_fra_simplified_folded.txt"

You're welcome! I think the second option would be a bit easier. It will just take the computer a while to process. Just tell me if you would like me to do the whole thing.
 

shaoguan

举人
Thanks for your offer !
For the BCC corpus : I have to generate the file "global_wordfreq.release (Hanzi only).txt" right ?

Well if it doesn't take you too much time I accept your offer to do the whole thing :D
 

Shun

状元
You're welcome! I will try to do it soon and post the file here. :)

I think the BCC file is on the forums, too. (Search for "BCC")
 

Shun

状元
Hello shaoguan,

I attach the HSK rated Chinese-French file, ready for importing into Pleco, and the Python script I did it with. Note that the HSK rating feature could still be improved, but especially the separation between HSK levels 3, 4, and 5 seems quite sensible to me. Sentences that are rated as HSK level 1 or 2 are usually harder than their indicated level.

(Actually, I did the same thing two years before, but now we have a more flexible, cleaner script to do it.)

Enjoy, have fun,

Shun
 

Attachments

  • HSK_BCC_Rater.py.txt
    5.9 KB · Views: 509
  • sentences_hsk_graded_chinese_to_fra.txt
    1.2 MB · Views: 3,453

shaoguan

举人
Thanks Shun.
I am creating an Anki deck with these pretty well written sentences.
If somebody is interested I will share !

Greets !
 

Toom

秀才
Thanks a lot for your hard work and those decks! I'm still a little bit confused by all the decks. The sentences are from tatoebao just like the 18,896 HSK sentences. So does that mean that they contain those?

I'd like the deck which contains the sentences sorted by HSK levels (HSK2 contains sentences with HSK1-2 words, HSK3 the sentences with HSK1-3 excluding those already in the HSK2, ...). I found one on this thread but it only contains the contextual sentences with one word in English. Is there a deck with the full sentence in Chinese?

Thanks
 

Shun

状元
Hello Toom,

thanks for the praise! The 18,896 HSK sentences are taken from dict.cn (as I found out by googling), and they are independent of the Tatoeba sentences, i.e. there is no overlap.

There are, I suggest you download the full sentences ordered by HSK difficulty from here:


The pinyin line is filled with a scrambled sentence, but since Pleco (or any computer program) isn't able to derive the correct segmented pinyin from a Hanzi sentence, the data should be more valuable this way, and it will allow you to put it back together as an exercise.

Sorry about the confusion. A good rule of thumb is to download the files from the last post in a particular thread. :)

Feel free to keep asking if that still isn't quite what you were looking for.

Have fun,

Shun
 

Toom

秀才
Hello Toom,

thanks for the praise! The 18,896 HSK sentences are taken from dict.cn (as I found out by googling), and they are independent of the Tatoeba sentences, i.e. there is no overlap.

There are, I suggest you download the full sentences ordered by HSK difficulty from here:


The pinyin line is filled with a scrambled sentence, but since Pleco (or any computer program) isn't able to derive the correct segmented pinyin from a Hanzi sentence, the data should be more valuable this way, and it will allow you to put it back together as an exercise.

Sorry about the confusion. A good rule of thumb is to download the files from the last post in a particular thread. :)

Feel free to keep asking if that still isn't quite what you were looking for.

Have fun,

Shun
Exactly what I was looking for Thanks a lot!
 

Ria

Member
Hello,
thank you for those useful word lists! My import of the English is still running, but the txt looked very well.

May I ask if this is difficult to set up? I am a language enthusiast, so I am interested in many of the sentence translations available on there.
I noticed they have Chinese-Esperanto, Shanghainese-Mandarin, and even Shanghainese-English.

I would very much like to be able to convert some myself :)

Please advise me!
 

Shun

状元
Hi Ria,

you're welcome! Indeed, it's easy to create flashcards for any combination of languages with Tatoeba's data.

All you need is to have Python installed on your Mac, PC, or Linux system, and to download the raw Tatoeba data ("Sentences" and "Links") from


Once you have these, you only need the Python source code for the sentence pair generation and put the Tatoeba input files in the same directory as the Python script. I added the Python source to the beginning of this thread, so you can use this relatively simple bit of code. The only thing you may need to adjust in the script are the language codes that the script looks for and the file names used for input and output.

If you have a Mac, I suggest using Homebrew (https://brew.sh) to install Python; on Windows, you could use the regular Python installer from https://www.python.org/

PyCharm is a nice Integrated Development Environment (IDE) for Python; however, it is not mandatory to use it. You could also run the Python scripts from the command line and use a popular text editor like Visual Studio Code to edit them.

Hope this helps for a start,

Shun
 
  • Like
Reactions: Ria

Kypselus

Member
Hi All,

Thank you everyone for sharing your word lists. I am new to learning chinese and using Pleco and having a little trouble with the flashcards, if anyone could offer some advice.

I have downloaded the file from post #88:


and I see from the .txt file the cards are organised as sentences which is exactly what I have been looking for, however when I run a New Test in the app using the list, I am only tested on individual words? Have I set up the import file incorrectly? Or perhaps my test settings are wrong? If anyone would know how to resolve so that the entire sentence appears on the card, I would be very grateful for the help.

Kind regards,
 

Shun

状元
Hi Kypselus,

assuming that you're using Pleco 3.2, did you perhaps have the "Show" setting set to "Headword"?

IMG_9293.png

You could set it to "Pron + Defn" if you wanted to be tested on the missing Chinese Hanzi word.

If you just want to translate entire sentences, you'd have to use a different sentence file from earlier this thread and import that. Such as this one:


Just experiment with the "Show" setting, and you should be fine.

Enjoy, Shun
 

VLP

Member
Hi all,
I’d like to thank everyone here especially @Shun and @leguan for the tremendous work that has been put throughout the years, constantly improving these flashcards.


I imported the flashcards in Pleco Legacy, without any problem. However I use Pleco 4.0 on a daily, so I imported them there and it doesn’t seem to work as intended.

After tweaking the settings in many ways, I can’t seem to display the Chinese headword with proper colors and tones. Did I miss anything? Is this Pleco 4.0 issue solvable? I’d really love to hear if anyone managed to import these in 4.0 without issue.

See:
IMG_3872.png


Here if the Mandarin field starts with pinyin w/ tone numbers, then the headword will be colored, otherwise it won’t
IMG_3882.PNG
IMG_3878.png

(Original after importing, Chinese headwords are white w/o tones)

IMG_3873.png
IMG_3874.png

(edited manually, adding "bei3jing1" at the beginning of the Mandarin segment, then tabbing to the sentence)


Am I missing something while importing? Is there a workaround or a way I’d be able to make the flashcards look exactly as in Pleco Legacy but in 4.0?

Thanks in advance for any response,
Victor
 
Last edited:

Shun

状元
Hello Victor,

thanks for the praise! I hope @leguan reads it, too. Looking at your first screenshot, it seems like the Pleco 4 beta colors the first n-p characters, starting from position p. (n: length of the headword, p: position of the headword in the Mandarin field)

Since Pleco 4 allows for much more customization of flashcards than Pleco 3.2, with cloze testing specifically supported, I am guessing that Mike will focus on that instead of making our old non-standard Pleco 3.2 cards work on Pleco 4.0. But @leguan and I can definitely support version 4 with new cards. If anyone has already figured out how the Pleco 4 data format and Flashcards system works exactly for things like cloze testing, do write it here, before the official documentation is released. Then we should be able to convert the lists to the new format easily.

You're welcome, have fun,

Shun
 

mikelove

皇帝
Staff member
Using the Mandarin field for sentences doesn't really work in 4.0, yeah - it only ever worked in 3.0 as a hack.

The next beta has an easy UI for building custom test types which will show arbitrary fields in arbitrary orders, so once that's out, you could use a batch command to put the sentence text into a newly created custom field and then design a test type to test you on that field, but doing it in the current beta would involve a lot of mucking around with presentations and test stages.
 
Top