79,000 Chinese-English, French, German, Italian, Japanese, and Spanish sentences

#1
Dear all,

here is an archive containing translated Chinese sentences in the following language pairs, ready for importing into Pleco:

Chinese-English 41,955 sentences
Chinese-French 15,740 sentences
Chinese-German 4,566 sentences
Chinese-Italian 3,800 sentences
Chinese-Japanese 3,936 sentences
Chinese-Spanish 8,995 sentences
Chinese-Russian 5,165 sentences

Most of the sentences are quite simple, taken from real-life situations, so if you're a novice to upper intermediate learner and your mother tongue is English, French, German, Spanish, Japanese, or Italian, they‘re likely to be useful for reinforcing your feeling for idiomatic expressions and sentence structure. They work best with the Self-graded studying mode.

I converted the list of sentences and list of sentence correspondences into the Pleco format using Python, converted it to Simplified Chinese using the Python package "hanziconv", segmented the Hanzi using Stanford's Chinese Tree Bank word segmenter, and finally, added pinyin to them using Pleco's Fill in missing fields feature in Import Flashcards. The files are in Simplified Chinese; if you require Traditional Chinese, simply activate the Fill in missing fields switch when importing.

The source data comes from the excellent site Tatoeba.org, which offers many languages. You could even get Chinese-Japanese sentence pairs, or for almost any other language. If someone asks for a particular language pair, such as Chinese-Swedish or Chinese-Russian, I can easily convert it for them.

The sentences and translations were (and are) made available under the following Creative Commons-Attribution license:

https://creativecommons.org/licenses/by/2.0/


Edit: @leguan The word segmentation for Japanese and English is better now because I first converted the sentences to Simplified, then applied the Stanford word segmentation. So I ask you to redownload.


Hope you like it, cheers,

Shun


Edit: For @leguan 's highly useful sentence contextual flashcards based on these lists and the HSK list, please head down to message #42, or use this link:

https://plecoforums.com/threads/79-...apanese-and-spanish-sentences.5925/post-45063
 

Attachments

Last edited:
#2
Hi Shun,
Great work!

I am very keen to add the Chinese-English sentences to sentence contextual flashcards I made from the "18,896 HSK sentences" you posted in September 2017 as, as you mentioned above, it does indeed seem that these Tatoeba sentences are very simple and taken from real-life situations, which I believe is ideal for creating a minimal context that allows the user to disambiguate which word is being tested!

https://plecoforums.com/threads/18-896-hsk-sentences.5615/#post-42743

I have a couple of questions though. In your earlier sentences above, pinyin for "words" in the example sentences did not have space characters within. This made identifying word matches in the sentences much more reliable. For example, in the sentence below the pinyin for the "word"s soldier, stand guard, and coffin do not have any internal spaces, e.g. "shi4bing1" "shou3wei4", and "ling2jiu4".

四个士兵守卫灵柩.
si4 ge4 shi4bing1 shou3wei4 ling2jiu4
Four soldiers stood guard over the coffin.

I am thus wondering if the "pinyin" package you used is able to output pinyin "word-by-word" as per the format above.

Also, I can't remember if I mentioned that I have lived in Japan for the last 19 years working at a Japanese company. So my Japanese is much better than my Chinese. As such, I am also very keen to make a Chinese-Japanese version of the sentence contextual flashcards mentioned above. Could I take you up on your offer above and ask you to convert the Chinese-Japanese sentences? There is no hurry though - first I will work on adding the new Chinese-English sentences to my earlier flashcards as mentioned above.

I don't know how many other users are interested, but after I create the updated Chinese-English (and hopefully Chinese-Japanese) flashcards, I will post them here.

Kind regards,

Leguan
 
#3
Hi Shun,
Oh sorry, I read your first post more carefully and discovered that you had already covered my first question above, i.e., pinyin is "done syllable by syllable, without any word recognition.". Does that mean that word recognition is not possible? For creating the "sentence contextual" flashcards I mentioned above, word recognition is probably essential to avoid incorrect matches.
Thanks again!
 
#4
Hi Leguan,

thanks a lot for your interest and compliment!

Yes, I only googled for "python hanzi to pinyin" and used the most sensible search result I found. That one doesn't seem to support word segmenting. I see now that I can use the Stanford Word Segmenter in the nltk package which puts spaces between the Hanzi and do a pinyin conversion without added spaces afterwards. Here, it should actually be best if I used Pleco's Hanzi-to-pinyin conversion because it can draw on its large set of dictionaries for using the right pinyin syllables for multi-syllable words. I will try around with it as an exercise and post the results here.

Of course, because it is so easy to do, here I'm already sending you the Chinese-Japanese sentence pairs with the bad pinyin, just so you get an impression of the translation quality. Later, I will try out the nltk package. I just hope that Pleco preserves the spaces between words in the Hanzi-to-pinyin conversion when importing into Flashcards, in that case the results should be pretty good.

I guess it can take time for other users' interest to grow, but you can't know unless you post, so you definitely should. :) I am interested, anyway.

I'm also attaching the Python script I used for matching the sentences up (it uses a lot of RAM), for the sake of transparency.

Kind regards,

Shun
 

Attachments

#5
Hi Shun,
Oh sorry, I read your first post more carefully and discovered that you had already covered my first question above, i.e., pinyin is "done syllable by syllable, without any word recognition.". Does that mean that word recognition is not possible? For creating the "sentence contextual" flashcards I mentioned above, word recognition is probably essential to avoid incorrect matches.
Thanks again!
Nono, it's definitely possible, I just started out with this quick solution. I'll try it right now!
 
Last edited:
#6
Hi Leguan,

it worked great, Pleco added pinyin while retaining the Hanzi spaces. I am sending you the Traditional Chinese file, if you import it in Pleco with "Fill in missing fields", the simplified part will automatically be filled in, of course.

Other languages will follow later!

Kind regards,

Shun
 

Attachments

#7
Hi Shun,
Thank you very much indeed. Yes, that is fantastic!
Thank you also very much for creating the CN-JPN flashcards and sharing your python script.
I will post my results as soon as I have included these Tatoeba sentences in my some "sentence contextual" flashcards or created new Japanese ones!
I eagerly await your Chinese-English flashcards with PKU segmenting :)- again no hurry as I will be away for a week to visit my mother.

Once again thank you so much!
Leguan
 
#8
Hi Leguan,

I'm glad to hear it! The English version should be forthcoming in a few hours, but as you say, no hurry. :)

You're very welcome & enjoy the time off,

Shun
 
Last edited:
#11
Hi @leguan,

I thought you may also find uses for a Japanese-English version of the tatoeba sentence list. It has a whopping 198,210 sentence pairs, which of course don't work in Pleco, but in Anki. My pleasure!

Cheers,

Shun
 

Attachments

Last edited:
#12
Hi Leguan,

I thought you may also find uses for a Japanese-English version of the tatoeba sentence list. It has a whopping 198,210 sentence pairs, which of course don't work in Pleco, but in Anki. My pleasure!

Cheers,

Shun
Wow, that's fantastic timing! Thank you very much, Shun!:D
Yes, that also is very useful to me! Actually, I think it is possible to work with Japanese in Pleco to some extent if we don't use the pinyin field in flashcards or dictionary entries, or at least don't use it as intended ;)

I have already made two dictionaries that I made from Chinese and Japanese word frequency lists that list all the words that contain a specific hanzi (or kanji) sorted in frequency order. For the Chinese dictionary you can use the "Popup Definition" functionality of Pleco to look at all the definitions of words using a specific Chinese character in the order of frequency, which is very useful for getting an overview of the vocabulary set for a Chinese character. For the Japanese dictionary, I can't review the definitions in Pleco yet because I don't have a Japanese dictionary in Pleco, but that will be my next project :)

In short, I think it may also well be possible to create sentence contextual writing practice flashcards also usable in Pleco but I'll have to investigate further if there will be any limitations with respect to using the pinyin field in flashcards.

On another note, Please find attached the sentence contextual writing practice flashcards as promised above.
(Please refer to your earlier thread (https://plecoforums.com/threads/18-896-hsk-sentences.5615/) for information and screenshots showing how to use these cards (the format and usage is the same as the flashcards based on the "18,896 HSK sentences" you posted earlier)

I have added all of the "Chinese-English 41,955" sentences to my earlier sentence list and the new flashcards are now based on a total of 63,557 sentences. I have updated my Excel spreadsheet to look for high frequency words (i.e. the top 100K words in the recently discovered BCC corpus, and the top 60K words in the LWC and SUBTLX corpuses).

The spreadsheet finds matches of these words in the sentences in a rather crude put reasonably reliable manner by looking for sentences for which BOTH the pinyin and Chinese characters for a word can be found in a sentence. (It is not sophisticated enough to check that the word and pinyin are in the same position in the sentence, so there will likely be a few incorrect matches, but I did not encounter any such cases in practice with my earlier flashcards so I think it is an acceptable hack. In any case if such issues are encountered, it is just a matter of deleting the flashcard for which the issue occured)

Some statistics regarding these flashcards:
Total number or flashcards: 55,478
Total number of unique sentences: 39,240
Total number of unique words tested: 20,248
Maximum sentence length: 35 Chinese characters
Average sentence length: 22.76 Chinese characters

Some further statistics for another set of flashcards which also include the "Chinese-Japanese 3,936" sentences:
(Since the number of Chinese-Japanese sentences is only a small proportion of the total number of sentences, the statistics below also give a good idea of the breakdown for the relative proportion of HSK and non-HSK words in the Chinese-English flashcard set)

Total number of HSK words: 4,595
Total number of sentences testing HSK words: 27,234
Average number of sentences per HSK word: 5.92

Total number of non-HSK words: 15,653
Total number of sentences testing non-HSK words: 30,606
Average number of sentences per non-HSK word: 1.95

As you can see from the above, the idea and strategy was to include more sentences for HSK words compared with non-HSK words.
In addition the maximum number of sentences per HSK grade is limited to HSK grade * 2 -1, as need a lot more practice of higher grade words than lower grade ones :)

Unfortunately, the my word matching strategy only works if there are spaces between words. For example, the pinyin for the sentence "好主意!" is "hao3zhu3yi4!" without any spaces so it cannot match the word "主意" - the pinyin would have to be "hao3 zhu3yi4!" to achieve a successful match. Since there are a lot of cases where the pinyin run on like this, the number of word matches is not as high as it could be. But in any case, there are a lot more words and sentences being tested than those in the earlier flashcards I made based on the "18,896 HSK sentences" you posted earlier) so I'm quite happy with this new expanded set of flashcards!:)

Hope they may be of use to anyone who wants to practice writing Chinese characters within the context of a sentence, as is invariably the case in real life.

Kind regards,
leguan
 

Attachments

Last edited:
#13
Hi leguan,

thank you very much, your lists look really nice and cleanly made! Hats off!

I think the best way for me to use them is to show just the Chinese sentence with the pinyin in it (i.e. what's stored in Pleco under pinyin) and let me figure out the rest, which allows me to:

1. Figure out the right Hanzi from the sentence alone
2. Determine the sentence meaning in English

both at the same time.

I acknowledge the pinyin problem. I heard that there are now some research groups who are successfully using semantic analysis to translate between languages (I think Chinese, too), so that would definitely include better Chinese word segmentation. Let's hope that these become available for everyone soon.

I will tell you how I'm faring once I've done a couple hundred repetitions with the list.

Kind regards,

Shun
 
#14
Hi Shun,

I'm happy to hear you like them, and agree with your way of using them too!
I use them the same way, and just use the "Reveal Definition" to reveal the sentence meaning in English if I have difficulty understanding the Chinese and need a further hint :)

Thats interesting too to hear that there is hope for improved Chinese word segmentation. This can only help us make even better tools to help us learn Mandarin!

Yes, I would like to hear your thoughts after using the flashcards a bit - I can try to refine them based on your feedback!

Best regards,
leguan
 
#15
Hi Shun,
Just a note regarding using non-"pinyin" in the flashcard pinyin field. Characters like 、。?!etc are deleted by pleco on importation. This makes some sentences hard to decipher correctly. However, I discovered just now that single byte characters like , . ? ! etc remain intact. I will therefore convert such "disappearing" double byte characters to equivalent single byte characters and repost the updated flashcards hopefully by tomorrow.

Best regards
leguan
 
#17
Hi Shun,
Thank you very much for doing the replacements on the English file and for also noting that the Chinese “” also need replacing!

I have also replaced the following characters in both files and have removed trailing spaces in the pinyin that are added during the replacement process:

?→?
!→!
。→.
、→,
“→"
”→"

Please find attached the updated files. Have a nice day!

Best regards,
leguan
 

Attachments

#18
Hi leguan,

great, thank you! However, don't forget the Chinese comma "," which needs to be replaced by the Western comma ",". :) No big deal!

You too, best regards,

Arthur


PS: I started practicing with it and already love it.
 
Last edited:
#19
Hi leguan,

what I thought might be nice would be to separate the HSK and the tatoeba lists, because they differ quite markedly in difficulty, instead of merging everything into one list. But still, it is really useful already!

Kind regards,

Shun
 
#20
Hi Shun,

Happy to hear you are enjoying studying with them:)

Thank you for pointing out the overlooked Chinese comma issue.
Please find attached the flashcards with 6306(!) Chinese commas ("," ) removed.
That is a lot of sentences which will now be easier to parse correctly in our brains!
I also discovered Chinese Semicolons ;and Colons :and have replaced them also with western semicolons and colons.

Your observation that the HSK and tatoeba lists are markedly different in difficult is very enlightening.
Yes, spliting them into two separate lists is a very good idea!

I have therefore created two new flashcard sets as attached with the following characteristics:

<<Tatoeba Set>>
Total number of original sentences in list: 41,588
Total number or flashcards: 25,900
Total number of unique sentences: 20,788
Percentage of original sentences utilized (= (b)/(a)) = 49.6%
(Note the lower percentage than the HSK set due to the less well segmented pinyin)
Total number of unique words tested: 10,823

Total number of HSK words: 3,056
Total number of flashcards testing HSK words: 11,877
Average number of flashcards per HSK word: 3.88

Total number of non-HSK words: 7,767
Total number of flashcards testing non-HSK words: 14,023
Average number of flashcards per non-HSK word: 1.80

<<HSK Set>>
Total number of original sentences in list: 18,261 (a)
Total number or flashcards: 38,773
Total number of unique sentences: 17,823 (b)
Percentage of original sentences utilized (= (b)/(a)) = 97.6%
(Note the higher percentage than the Tatoeba set due to the better segmented pinyin)
Average number of flashcards per unique sentence: 2.17
Total number of unique words tested: 15,617
Average number of flashcards per unique words tested: 2.48

Total number of HSK words: 4,279
Total number of flashcards testing HSK words: 21,303
Average number of flashcards per HSK word: 4.97

Total number of non-HSK words: 11,339
Total number of flashcards testing non-HSK words: 17,470
Average number of flashcards per non-HSK word: 1.54


I have also rebuilt the original Tatoeba+HSK+α ENG and ENG+JPN lists as follows:
(Some people might prefer this list and others might prefer the original. (This new build has more sentences per word than the earlier list) Pick your poison :))

<Tatoeba+HSK+α English Only> (Rebuilt)
Total number of original sentences in list: 63,556 (a)
Total number or flashcards: 68,917
Total number of unique sentences: 43,572 (b)
Percentage of original sentences utilized (= (b)/(a)) = 68.5%
(Better than Tatoeba, worse than HSK)
Average number of flashcards per unique sentence: 1.58
Total number of unique words tested: 20,247
Average number of flashcards per unique words tested: 3.40

Total number of HSK words: 4,594
Total number of flashcards testing HSK words: 36,690
Average number of flashcards per HSK word: 7.98

Total number of non-HSK words: 15,653
Total number of flashcards testing non-HSK words: 35,181
Average number of flashcards per non-HSK word: 2.24

<Tatoeba+HSK+α English + Japanese> (Rebuilt)
Total number of original sentences in list: 67,458 (a)
Total number or flashcards: 71,873
Total number of unique sentences: 45,998 (b)
Percentage of original sentences utilized (= (b)/(a)) = 68.1%
(Better than Tatoeba, worse than HSK)
Average number of flashcards per unique sentence: 1.56
Total number of unique words tested: 20,247
Average number of flashcards per unique words tested: 3.54

Total number of HSK words: 4,594
Total number of flashcards testing HSK words: 36,789
Average number of flashcards per HSK word: 8.00

Total number of non-HSK words: 15,653
Total number of flashcards testing non-HSK words: 35,181
Average number of flashcards per non-HSK word: 2.24

Enjoy:)

Best regards,
leguan
 

Attachments

Top