79,000 Chinese-English, French, German, Italian, Japanese, and Spanish sentences

Shun · Dec 9, 2020

Salut shaoguan!

They aren't ordered by HSK level yet. I attach the "hsk new.txt" file. It's an export from Pleco's built-in HSK list. I also attach a newer source file, with separated functions and constants for the languages so you don't have to enter them everywhere. Perhaps the attached source code is a work in progress. If so, you can use the other source file you already have. The (large) BCC corpus files would also be needed for the rating code, but you should also easily be able to leave them out to begin with.

Have fun, Shun

shaoguan · Dec 9, 2020

I will try my best.
Thanks for your help Shun !

Shun · Dec 9, 2020

shaoguan said:
What do you think would be faster ?
Use REGEX to remove the pinyin and recreate the original sentences in the file "sentence_contextual_tatoeba_cn_fra_folded by HSK rating - random sentence selection.txt"
or
Order by HSK level the sentences in the file "sentences_cmn_fra_simplified_folded.txt"

You're welcome! I think the second option would be a bit easier. It will just take the computer a while to process. Just tell me if you would like me to do the whole thing.

shaoguan · Dec 10, 2020

Thanks for your offer !
For the BCC corpus : I have to generate the file "global_wordfreq.release (Hanzi only).txt" right ?

Well if it doesn't take you too much time I accept your offer to do the whole thing

Shun · Dec 10, 2020

You're welcome! I will try to do it soon and post the file here.

I think the BCC file is on the forums, too. (Search for "BCC")

Shun · Dec 10, 2020

Hello shaoguan,

I attach the HSK rated Chinese-French file, ready for importing into Pleco, and the Python script I did it with. Note that the HSK rating feature could still be improved, but especially the separation between HSK levels 3, 4, and 5 seems quite sensible to me. Sentences that are rated as HSK level 1 or 2 are usually harder than their indicated level.

(Actually, I did the same thing two years before, but now we have a more flexible, cleaner script to do it.)

Enjoy, have fun,

Shun

shaoguan · Dec 10, 2020

Thanks Shun.
I am creating an Anki deck with these pretty well written sentences.
If somebody is interested I will share !

Greets !

Toom · Dec 31, 2020

Thanks a lot for your hard work and those decks! I'm still a little bit confused by all the decks. The sentences are from tatoebao just like the 18,896 HSK sentences. So does that mean that they contain those?

I'd like the deck which contains the sentences sorted by HSK levels (HSK2 contains sentences with HSK1-2 words, HSK3 the sentences with HSK1-3 excluding those already in the HSK2, ...). I found one on this thread but it only contains the contextual sentences with one word in English. Is there a deck with the full sentence in Chinese?

Thanks

Shun · Dec 31, 2020

Hello Toom,

thanks for the praise! The 18,896 HSK sentences are taken from dict.cn (as I found out by googling), and they are independent of the Tatoeba sentences, i.e. there is no overlap.

There are, I suggest you download the full sentences ordered by HSK difficulty from here:

Randomly Ordered Sentences Game

Dear all, dear @leguan, I've just had a promising idea for a new way of learning sentence structure using Pleco Flashcards and the Tatoeba sentences. It reminds me of similar questions in IQ tests. The game is simple: You are shown a sentence in your language and the corresponding Chinese...

plecoforums.com

The pinyin line is filled with a scrambled sentence, but since Pleco (or any computer program) isn't able to derive the correct segmented pinyin from a Hanzi sentence, the data should be more valuable this way, and it will allow you to put it back together as an exercise.

Sorry about the confusion. A good rule of thumb is to download the files from the last post in a particular thread.

Feel free to keep asking if that still isn't quite what you were looking for.

Have fun,

Shun

Toom · Dec 31, 2020

Shun said:
Hello Toom,

thanks for the praise! The 18,896 HSK sentences are taken from dict.cn (as I found out by googling), and they are independent of the Tatoeba sentences, i.e. there is no overlap.

There are, I suggest you download the full sentences ordered by HSK difficulty from here:

Randomly Ordered Sentences Game

Dear all, dear @leguan, I've just had a promising idea for a new way of learning sentence structure using Pleco Flashcards and the Tatoeba sentences. It reminds me of similar questions in IQ tests. The game is simple: You are shown a sentence in your language and the corresponding Chinese...

plecoforums.com

The pinyin line is filled with a scrambled sentence, but since Pleco (or any computer program) isn't able to derive the correct segmented pinyin from a Hanzi sentence, the data should be more valuable this way, and it will allow you to put it back together as an exercise.

Sorry about the confusion. A good rule of thumb is to download the files from the last post in a particular thread.

Feel free to keep asking if that still isn't quite what you were looking for.

Have fun,

Shun

Exactly what I was looking for Thanks a lot!

Shun · Dec 31, 2020

Toom said:
Exactly what I was looking for Thanks a lot!

Great! You're very welcome!

Ria · Feb 15, 2021

Hello,
thank you for those useful word lists! My import of the English is still running, but the txt looked very well.

May I ask if this is difficult to set up? I am a language enthusiast, so I am interested in many of the sentence translations available on there.
I noticed they have Chinese-Esperanto, Shanghainese-Mandarin, and even Shanghainese-English.

I would very much like to be able to convert some myself

Please advise me!

Shun · Feb 15, 2021

Hi Ria,

you're welcome! Indeed, it's easy to create flashcards for any combination of languages with Tatoeba's data.

All you need is to have Python installed on your Mac, PC, or Linux system, and to download the raw Tatoeba data ("Sentences" and "Links") from

Download sentences - Tatoeba

tatoeba.org

Once you have these, you only need the Python source code for the sentence pair generation and put the Tatoeba input files in the same directory as the Python script. I added the Python source to the beginning of this thread, so you can use this relatively simple bit of code. The only thing you may need to adjust in the script are the language codes that the script looks for and the file names used for input and output.

If you have a Mac, I suggest using Homebrew (https://brew.sh) to install Python; on Windows, you could use the regular Python installer from https://www.python.org/

PyCharm is a nice Integrated Development Environment (IDE) for Python; however, it is not mandatory to use it. You could also run the Python scripts from the command line and use a popular text editor like Visual Studio Code to edit them.

Hope this helps for a start,

Shun

Kypselus · Dec 22, 2024

Hi All,

Thank you everyone for sharing your word lists. I am new to learning chinese and using Pleco and having a little trouble with the flashcards, if anyone could offer some advice.

I have downloaded the file from post #88:

79,000 Chinese-English, French, German, Italian, Japanese, and Spanish sentences

Hi leguan, that sounds excellent, I'm all for a clear thread structure. So I will use another thread to answer your post and edit in a link to the new thread here: HSK Difficulty Thread Best, Shun

www.plecoforums.com

and I see from the .txt file the cards are organised as sentences which is exactly what I have been looking for, however when I run a New Test in the app using the list, I am only tested on individual words? Have I set up the import file incorrectly? Or perhaps my test settings are wrong? If anyone would know how to resolve so that the entire sentence appears on the card, I would be very grateful for the help.

Kind regards,

Shun · Dec 22, 2024

Hi Kypselus,

assuming that you're using Pleco 3.2, did you perhaps have the "Show" setting set to "Headword"?

You could set it to "Pron + Defn" if you wanted to be tested on the missing Chinese Hanzi word.

If you just want to translate entire sentences, you'd have to use a different sentence file from earlier this thread and import that. Such as this one:

79,000 Chinese-English, French, German, Italian, Japanese, and Spanish sentences

Dear all, here is an archive containing translated Chinese sentences in the following language pairs, ready for importing into Pleco: Chinese-English 41,955 sentences Chinese-French 15,740 sentences Chinese-German 4,566 sentences Chinese-Italian 3,800 sentences...

plecoforums.com

Just experiment with the "Show" setting, and you should be fine.

Enjoy, Shun

VLP · Mar 23, 2025

Hi all,
I’d like to thank everyone here especially @Shun and @leguan for the tremendous work that has been put throughout the years, constantly improving these flashcards.

I imported the flashcards in Pleco Legacy, without any problem. However I use Pleco 4.0 on a daily, so I imported them there and it doesn’t seem to work as intended.

After tweaking the settings in many ways, I can’t seem to display the Chinese headword with proper colors and tones. Did I miss anything? Is this Pleco 4.0 issue solvable? I’d really love to hear if anyone managed to import these in 4.0 without issue.

See:

Here if the Mandarin field starts with pinyin w/ tone numbers, then the headword will be colored, otherwise it won’t

(Original after importing, Chinese headwords are white w/o tones)

(edited manually, adding "bei3jing1" at the beginning of the Mandarin segment, then tabbing to the sentence)

Am I missing something while importing? Is there a workaround or a way I’d be able to make the flashcards look exactly as in Pleco Legacy but in 4.0?

Thanks in advance for any response,
Victor

Shun · Mar 23, 2025

Hello Victor,

thanks for the praise! I hope @leguan reads it, too. Looking at your first screenshot, it seems like the Pleco 4 beta colors the first n-p characters, starting from position p. (n: length of the headword, p: position of the headword in the Mandarin field)

Since Pleco 4 allows for much more customization of flashcards than Pleco 3.2, with cloze testing specifically supported, I am guessing that Mike will focus on that instead of making our old non-standard Pleco 3.2 cards work on Pleco 4.0. But @leguan and I can definitely support version 4 with new cards. If anyone has already figured out how the Pleco 4 data format and Flashcards system works exactly for things like cloze testing, do write it here, before the official documentation is released. Then we should be able to convert the lists to the new format easily.

You're welcome, have fun,

Shun

mikelove · Mar 25, 2025

Using the Mandarin field for sentences doesn't really work in 4.0, yeah - it only ever worked in 3.0 as a hack.

The next beta has an easy UI for building custom test types which will show arbitrary fields in arbitrary orders, so once that's out, you could use a batch command to put the sentence text into a newly created custom field and then design a test type to test you on that field, but doing it in the current beta would involve a lot of mucking around with presentations and test stages.

79,000 Chinese-English, French, German, Italian, Japanese, and Spanish sentences

状元

Attachments

举人

状元

举人

状元

状元

Attachments

举人

举人

状元

举人

状元

Member

状元

Member

状元

Member

状元

皇帝