79,000 Chinese-English, French, German, Italian, Japanese, and Spanish sentences

Salut shaoguan!

They aren't ordered by HSK level yet. I attach the "hsk new.txt" file. It's an export from Pleco's built-in HSK list. I also attach a newer source file, with separated functions and constants for the languages so you don't have to enter them everywhere. Perhaps the attached source code is a work in progress. If so, you can use the other source file you already have. The (large) BCC corpus files would also be needed for the rating code, but you should also easily be able to leave them out to begin with.

Have fun, Shun
 

Attachments

What do you think would be faster ?
Use REGEX to remove the pinyin and recreate the original sentences in the file "sentence_contextual_tatoeba_cn_fra_folded by HSK rating - random sentence selection.txt"
or
Order by HSK level the sentences in the file "sentences_cmn_fra_simplified_folded.txt"

You're welcome! I think the second option would be a bit easier. It will just take the computer a while to process. Just tell me if you would like me to do the whole thing.
 
Thanks for your offer !
For the BCC corpus : I have to generate the file "global_wordfreq.release (Hanzi only).txt" right ?

Well if it doesn't take you too much time I accept your offer to do the whole thing :D
 
You're welcome! I will try to do it soon and post the file here. :)

I think the BCC file is on the forums, too. (Search for "BCC")
 
Hello shaoguan,

I attach the HSK rated Chinese-French file, ready for importing into Pleco, and the Python script I did it with. Note that the HSK rating feature could still be improved, but especially the separation between HSK levels 3, 4, and 5 seems quite sensible to me. Sentences that are rated as HSK level 1 or 2 are usually harder than their indicated level.

(Actually, I did the same thing two years before, but now we have a more flexible, cleaner script to do it.)

Enjoy, have fun,

Shun
 

Attachments

Thanks Shun.
I am creating an Anki deck with these pretty well written sentences.
If somebody is interested I will share !

Greets !
 
Thanks a lot for your hard work and those decks! I'm still a little bit confused by all the decks. The sentences are from tatoebao just like the 18,896 HSK sentences. So does that mean that they contain those?

I'd like the deck which contains the sentences sorted by HSK levels (HSK2 contains sentences with HSK1-2 words, HSK3 the sentences with HSK1-3 excluding those already in the HSK2, ...). I found one on this thread but it only contains the contextual sentences with one word in English. Is there a deck with the full sentence in Chinese?

Thanks
 
Hello Toom,

thanks for the praise! The 18,896 HSK sentences are taken from dict.cn (as I found out by googling), and they are independent of the Tatoeba sentences, i.e. there is no overlap.

There are, I suggest you download the full sentences ordered by HSK difficulty from here:


The pinyin line is filled with a scrambled sentence, but since Pleco (or any computer program) isn't able to derive the correct segmented pinyin from a Hanzi sentence, the data should be more valuable this way, and it will allow you to put it back together as an exercise.

Sorry about the confusion. A good rule of thumb is to download the files from the last post in a particular thread. :)

Feel free to keep asking if that still isn't quite what you were looking for.

Have fun,

Shun
 
Hello Toom,

thanks for the praise! The 18,896 HSK sentences are taken from dict.cn (as I found out by googling), and they are independent of the Tatoeba sentences, i.e. there is no overlap.

There are, I suggest you download the full sentences ordered by HSK difficulty from here:


The pinyin line is filled with a scrambled sentence, but since Pleco (or any computer program) isn't able to derive the correct segmented pinyin from a Hanzi sentence, the data should be more valuable this way, and it will allow you to put it back together as an exercise.

Sorry about the confusion. A good rule of thumb is to download the files from the last post in a particular thread. :)

Feel free to keep asking if that still isn't quite what you were looking for.

Have fun,

Shun
Exactly what I was looking for Thanks a lot!
 
Hello,
thank you for those useful word lists! My import of the English is still running, but the txt looked very well.

May I ask if this is difficult to set up? I am a language enthusiast, so I am interested in many of the sentence translations available on there.
I noticed they have Chinese-Esperanto, Shanghainese-Mandarin, and even Shanghainese-English.

I would very much like to be able to convert some myself :)

Please advise me!
 
Hi Ria,

you're welcome! Indeed, it's easy to create flashcards for any combination of languages with Tatoeba's data.

All you need is to have Python installed on your Mac, PC, or Linux system, and to download the raw Tatoeba data ("Sentences" and "Links") from


Once you have these, you only need the Python source code for the sentence pair generation and put the Tatoeba input files in the same directory as the Python script. I added the Python source to the beginning of this thread, so you can use this relatively simple bit of code. The only thing you may need to adjust in the script are the language codes that the script looks for and the file names used for input and output.

If you have a Mac, I suggest using Homebrew (https://brew.sh) to install Python; on Windows, you could use the regular Python installer from https://www.python.org/

PyCharm is a nice Integrated Development Environment (IDE) for Python; however, it is not mandatory to use it. You could also run the Python scripts from the command line and use a popular text editor like Visual Studio Code to edit them.

Hope this helps for a start,

Shun
 
  • Like
Reactions: Ria
Hi All,

Thank you everyone for sharing your word lists. I am new to learning chinese and using Pleco and having a little trouble with the flashcards, if anyone could offer some advice.

I have downloaded the file from post #88:


and I see from the .txt file the cards are organised as sentences which is exactly what I have been looking for, however when I run a New Test in the app using the list, I am only tested on individual words? Have I set up the import file incorrectly? Or perhaps my test settings are wrong? If anyone would know how to resolve so that the entire sentence appears on the card, I would be very grateful for the help.

Kind regards,
 
Hi Kypselus,

assuming that you're using Pleco 3.2, did you perhaps have the "Show" setting set to "Headword"?

IMG_9293.png

You could set it to "Pron + Defn" if you wanted to be tested on the missing Chinese Hanzi word.

If you just want to translate entire sentences, you'd have to use a different sentence file from earlier this thread and import that. Such as this one:


Just experiment with the "Show" setting, and you should be fine.

Enjoy, Shun
 
Hi all,
I’d like to thank everyone here especially @Shun and @leguan for the tremendous work that has been put throughout the years, constantly improving these flashcards.


I imported the flashcards in Pleco Legacy, without any problem. However I use Pleco 4.0 on a daily, so I imported them there and it doesn’t seem to work as intended.

After tweaking the settings in many ways, I can’t seem to display the Chinese headword with proper colors and tones. Did I miss anything? Is this Pleco 4.0 issue solvable? I’d really love to hear if anyone managed to import these in 4.0 without issue.

See:
IMG_3872.png


Here if the Mandarin field starts with pinyin w/ tone numbers, then the headword will be colored, otherwise it won’t
IMG_3882.PNG
IMG_3878.png

(Original after importing, Chinese headwords are white w/o tones)

IMG_3873.png
IMG_3874.png

(edited manually, adding "bei3jing1" at the beginning of the Mandarin segment, then tabbing to the sentence)


Am I missing something while importing? Is there a workaround or a way I’d be able to make the flashcards look exactly as in Pleco Legacy but in 4.0?

Thanks in advance for any response,
Victor
 
Last edited:
Hello Victor,

thanks for the praise! I hope @leguan reads it, too. Looking at your first screenshot, it seems like the Pleco 4 beta colors the first n-p characters, starting from position p. (n: length of the headword, p: position of the headword in the Mandarin field)

Since Pleco 4 allows for much more customization of flashcards than Pleco 3.2, with cloze testing specifically supported, I am guessing that Mike will focus on that instead of making our old non-standard Pleco 3.2 cards work on Pleco 4.0. But @leguan and I can definitely support version 4 with new cards. If anyone has already figured out how the Pleco 4 data format and Flashcards system works exactly for things like cloze testing, do write it here, before the official documentation is released. Then we should be able to convert the lists to the new format easily.

You're welcome, have fun,

Shun
 
Using the Mandarin field for sentences doesn't really work in 4.0, yeah - it only ever worked in 3.0 as a hack.

The next beta has an easy UI for building custom test types which will show arbitrary fields in arbitrary orders, so once that's out, you could use a batch command to put the sentence text into a newly created custom field and then design a test type to test you on that field, but doing it in the current beta would involve a lot of mucking around with presentations and test stages.
 
Back
Top