79,000 Chinese-English, French, German, Italian, Japanese, and Spanish sentences

#21
Hi leguan,

wow, you are like an angel! :) Now I can choose whether I would like to focus on more advanced vocabulary (in the HSK sentence set) or on sentence structure and everyday communication (tatoeba). Thank you kindly!

Good point also on the semicolons and colons. It's too bad I couldn't get any better word segmentation on the tatoeba lists. I will try to improve on it by reading up on the topic. Segmenting and adding pinyin in one go would surely be preferable to using two different tools (the Stanford segmenter which puts spaces between the Hanzi, and Pleco's built-in pinyin completion function).

Let's hope that others will jump on the contextual writing practice bandwagon as time goes on.

One piece of feedback on the system: After about one day of successful intermittent studying with the contextual writing practice system, I can already say that when I feel the need to test myself on the same words in the Chinese-English direction using a filter, I will set the Flashcards module to Remap cards to dicts so I can see a full dictionary definition of the words, which of course conveys more information than the HSK/tatoeba sentences alone would. One already studies the words’ context very well when one studies in the English-Chinese direction, so that should be quite sufficient. It's also great to see more examples of usage from a good dictionary to complement the HSK/Tatoeba sentences’ context.

Perhaps one last wish for now (no hurry): If it's easy enough, could you perhaps also make a Tatoeba Chinese-to-German contextual writing practice list? Then I could also give it to my younger students of the German mother tongue (and their teacher), who would surely love it if I explain it to them properly. It's often better to test against your mother tongue than through English.

Would you like me to create yet more language pairs with my Python script? Tatoeba offers all of the following languages, and any language can be paired with any other language:

https://tatoeba.org/eng/stats/sentences_by_language

Best regards, have a nice evening,

Shun
 
#22
Hi Shun
Thank you very much for your reply and kind offer to create more language pairs.

I'm happy to hear also that these flashcards are of use to you in your studies:)

Yes, I will be happy to make the Chinese-German flashcards. However, there are a number of manual steps that I will have to perform to incorporate the German sentences so it might take me a little while.

I'll be happy to hear if they are of use to your students. If my vba code was not so embarrassingly untidy I would be j happy to share the spreadsheet with you also but it will probably take much longer for me to get round to that.:rolleyes:

Best regards
leguan
 
#23
Hi leguan,

that's very kind, thank you. No rush, you'll be done when you'll be done. I have some Pleco students, I will show it to them in the hope that they will pass it on to others. I will be able to give it to them in mid-October.

Oh, I have programmed using VBA once, as well. (to do a data file conversion) I heard it's a fine language to get things done quickly. If you want, please feel free to send the Excel file and other necessary files over to me (For that case I'll send you my E-mail address by PM.); trying to make sense of the VBA code would be an excellent exercise for me. I will not disseminate it, I could just send you back a cleaned-up version later if I manage it.

Cheers and best regards,

Shun
 
Last edited:
#24
How do I open the files as notecards to be tested? I downloaded the files but it seems like they can only be opened in the Pleco reader. It'd be nice to test them rather than scroll through them.

Thank you so much
 
#25
Hi broadfootb,

yeah, you have to choose Import Cards from the sidebar's Import / Export menu item. There, you need to choose the file (it is in the "Inbox") and the following settings:

Import flashcards.jpg

That will make them accessible to your Flashcards module and allow you to be tested on them.

Hope this helps,

Shun
 
#27
Hi leguan,

that's very kind, thank you. No rush, you'll be done when you'll be done. I have some Pleco students, I will show it to them in the hope that they will pass it on to others. I will be able to give it to them in mid-October.

Oh, I have programmed using VBA once, as well. (to do a data file conversion) I heard it's a fine language to get things done quickly. If you want, please feel free to send the Excel file and other necessary files over to me (For that case I'll send you my E-mail address by PM.); trying to make sense of the VBA code would be an excellent exercise for me. I will not disseminate it, I could just send you back a cleaned-up version later if I manage it.

Cheers and best regards,

Shun
I've finished cleaning up the code and have completed incorporating the Chinese-German flashcards in my Excel spreadsheet.

So, hopefully I will be able to post the Chinese-German flashcards as well as updated flashcards for the earlier for sets I posted above shortly.

Best regards
leguan
 
#29
Awesome! I will take any Chinese-German sources for Pleco I can get my hands on. Shun, if you can get Chinese-Russian sentences, I'd greatly appreciate it.

Is it possible to do it for Thai or Laotian? I'd certainly appreciate it.

Thanks for your help.
 
#30
Great! Here are the Tatoeba Chinese-Russian and Russian-German sentences. Would you like Thai-Chinese or Thai-English or Thai-German, Laotian-Chinese or Laotian-English or Laotian-German, or all of them? :) Edit: I see Thai and Lao only have 600 and 40 sentences, respectively. At least they're represented on Tatoeba, and I can create the lists anyway.

@leguan: Many thanks, I'm very much looking forward to them!
 

Attachments

Last edited:
#31
Chinese-Thai and Chinese-Lao sounds good.

Thanks for the Chinese-Russian sentences. I haven't ever seen anything Russian language related for Pleco so this is awesome.
 
#32
Hi broadfootb and leguan,

I'm glad to hear it! Here are the Chinese-Thai and Chinese-Lao sentences. Enjoy!

@leguan: With the Chinese-Thai and Chinese-Lao sentences, I've just found out that Pleco naturally has segmenting built in, as well, and it seems to be quite good. Perhaps using the Stanford segmenter wasn't even helpful. Perhaps Pleco just stripped out the space characters between the Stanford segmented words, so it was all Pleco segmenting, or perhaps using the Stanford segmenter made it worse than using Pleco's segmenter alone. I will investigate.

Edit: I see Pleco segments the same way with unsegmented Chinese sentences, so it stripped the Stanford segmenter's spaces, and the latter didn't do anything. Therefore, the files I uploaded are the best I can currently offer.

Best regards,

Shun
 

Attachments

Last edited:
#33
I went through the Chinese-Lao deck today. It contains two Chinese-Russian cards in the deck and the translation of one of the cards was incorrect but otherwise it was a good review of Lao and Chinese. I will check out the Chinese-Thai and Chinese-Russian deck later this week. I have my hands full with the Chinese-English deck and I might give a crack at the Chinese-Japanese deck. :cool:

I just love the idea of using Pleco to learn languages other than Cantonese, German, Italian, Spanish, French, or Mandarin. This is so cool and it's nice of Shun and leguaan to help out the the Pleco community like this.
 
#34
I've finished cleaning up the code and have completed incorporating the Chinese-German flashcards in my Excel spreadsheet.

So, hopefully I will be able to post the Chinese-German flashcards as well as updated flashcards for the earlier for sets I posted above shortly.

Best regards
leguan
Hi Shun and broadfootb,

Here are Chinese-German flashcards as well as the updated Tatoeba English, HSK English, Tatoeba+HSK+α English, and Tatoeba+HSK+α English+Japanese flashcards. Sorry for the delay!

<Tatoeba Chinese-German>
Total number of original sentences in list: 4,538 (a)
Total number or flashcards: 5,740
Total number of unique sentences: 3,411(b)
Percentage of original sentences utilized (= (b)/(a)) = 75.1%
Average number of flashcards per unique sentence: 1.68
Total number of unique words tested: 3,388
Average number of flashcards per unique words tested: 1.69

Total number of HSK words: 1,387
Total number of flashcards testing HSK words: 2,631
Average number of flashcards per HSK word: 1.89

Total number of non-HSK words: 2,001
Total number of flashcards testing non-HSK words: 3,110
Average number of flashcards per non-HSK word: 1.55


<Tatoeba English> (Rebuilt again with improved pinyin matching)
Total number of original sentences in list: 41,587 (a)
Total number or flashcards: 39,828
Total number of unique sentences: 28,689 (b)
Percentage of original sentences utilized (= (b)/(a)) = 69.0%
Average number of flashcards per unique sentence: 1.38
Total number of unique words tested: 13,405
Average number of flashcards per unique words tested: 2.97

Total number of HSK words: 3,353
Total number of flashcards testing HSK words: 15,456
Average number of flashcards per HSK word: 4.60

Total number of non-HSK words: 10,052
Total number of flashcards testing non-HSK words: 24,372
Average number of flashcards per non-HSK word: 2.42


<HSK English> (Rebuilt again with improved pinyin matching)
Total number of original sentences in list: 18,261 (a)
Total number or flashcards: 40,054
Total number of unique sentences: 17,750 (b)
Percentage of original sentences utilized (= (b)/(a)) = 97.2%
Average number of flashcards per unique sentence: 2.26
Total number of unique words tested: 15,561
Average number of flashcards per unique words tested: 2.57

Total number of HSK words: 4,236
Total number of flashcards testing HSK words: 22,541
Average number of flashcards per HSK word: 5.32

Total number of non-HSK words: 11,325
Total number of flashcards testing non-HSK words: 17,514
Average number of flashcards per non-HSK word: 1.55


<Tatoeba+HSK+α English> (Rebuilt again with improved pinyin matching)
Total number of original sentences in list: 63,556 (a)
Total number or flashcards: 86,721
Total number of unique sentences: 51,279 (b)
Percentage of original sentences utilized (= (b)/(a)) = 80.7%
Average number of flashcards per unique sentence: 1.69
Total number of unique words tested: 21,896
Average number of flashcards per unique words tested: 3.96

Total number of HSK words: 4,602
Total number of flashcards testing HSK words: 41,102
Average number of flashcards per HSK word: 8.93

Total number of non-HSK words: 17,294
Total number of flashcards testing non-HSK words: 45,620
Average number of flashcards per non-HSK word: 2.64


<Tatoeba+HSK+α English + Japanese> (Rebuilt again with improved pinyin matching)
Total number of original sentences in list: 67,458 (a)
Total number or flashcards: 88,477
Total number of unique sentences: 53,601 (b)
Percentage of original sentences utilized (= (b)/(a)) = 79.4%
Average number of flashcards per unique sentence: 1.65
Total number of unique words tested: 22,187
Average number of flashcards per unique words tested: 3.98

Total number of HSK words: 4,605
Total number of flashcards testing HSK words: 41,642
Average number of flashcards per HSK word: 9.04

Total number of non-HSK words: 17,583
Total number of flashcards testing non-HSK words: 46,835
Average number of flashcards per non-HSK word: 2.66

I have discovered and have cleaned up further a lot of residual double-byte punctuation symbols which remained in some of my earlier posted flashcard sets. This has resulted in much better sentence utilization ratios for some of the sets.:):):) e.g., sentence utilization for the Tatoeba English flashcards has gone up from 49.6% to a much better 69%.:)

The conclusion is that the Tatoeba word segmentation is still not as good as the segmentation in the HSK list, but not so bad as I incorrectly surmised:oops: - sorry about that!

Enjoy!
 

Attachments

Last edited:
#35
Hi leguan,

thank you very much for this excellent work! I think this is going to be a hit with my students.

No problem, the Tatoeba word segmentation is actually Pleco's segmentation. So it can't be bad. ;) I will try the jieba package @elsey.jack mentioned in another thread. But to my knowledge, jieba also only puts spaces in between the Hanzi, like the Java-based Stanford segmenter. Getting the right pinyin out of Hanzi multi-syllable words then still needs something with access to large dictionaries, like Pleco. But as I had mentioned, Pleco strips the spaces in between the Hanzi before it does its dictionary-based segmentation. Here, it would only help if we had an excellent Hanzi segmentation algorithm, and then a feature in Pleco that respects these whitespaces in between, and just fills in what its dictionaries are offering as pinyin for the words between the spaces.

Thanks again! I will report back how I'm faring.

Shun
 
Last edited:

mikelove

皇帝
Staff member
#36
No problem, the Tatoeba word segmentation is actually Pleco's segmentation. So it can't be bad.
Heh - not something we've put a lot of time into. Any particular category of thing it seemed to get wrong a lot? Was it mostly failing on stuff not in the dictionary (proper nouns, e.g.) or was it also frequently breaking, say, 3-character sequences that were supposed to be 2 + 1 into 1 + 2 instead?
 
#37
Heh - not something we've put a lot of time into. Any particular category of thing it seemed to get wrong a lot? Was it mostly failing on stuff not in the dictionary (proper nouns, e.g.) or was it also frequently breaking, say, 3-character sequences that were supposed to be 2 + 1 into 1 + 2 instead?
Thanks for asking! Not exactly that, I'm going through my recently studied cards and found the following specific cases:

It recognized 组装 zu3zhuang1 as zu3 zhuang1.
It recognized 缺陷 que1xian4 as que1 (The xian4 was lost, it was at the end of the sentence "这个系统有些明显的缺陷。")
It recognized 系统 xi4tong3 as xi4 tong3.
It recognized 面临 mian4lin2 as mian4 lin2.

Edit: At least for 系统, I can confirm that it isn't the Stanford segmenter's fault, because the Stanford segmenter didn't split it up. Would you like to test these particular sentences? Then I will look them up and post them here.

Many things probably can't be avoided without semantic or tree analysis, like:

这本书里错误多得以至于老师把它叫作“醉鬼版”。

The 得以 in that sentence is taken as a word where it shouldn't be together.

But other than that, usually the segmentation already works fine. It would be very nice, however, to have an option for it to not strip whitespace in the Hanzi, thereby allowing the special treebank-based word segmenters to do this part of the task. This would open Pleco's segmentation up to the world of Chinese segmenters. :)

Cheers and thanks,

Shun
 
Last edited:
#38
Hello Mike,

as proof, I have reimported the four sentences without the pinyin with Fill in missing fields enabled and got the same errors again. I don't see how these cases could happen, perhaps it's just a small bug in the current algorithm that needs to be corrected for it to work.

Cheers,

Shun
 

Attachments

mikelove

皇帝
Staff member
#39
@Shun - this might actually be dictionary-related; what's your current dictionary search order in Manage Dictionaries? There are a few dictionaries which erroneously include spaces in the pinyin for those single words.
 
#40
Oh yes, thanks for the hint! Here's my dictionary search order:

IMG_3886.PNG IMG_3887.PNG IMG_3888.PNG IMG_3889.PNG IMG_3890.PNG


I saw in the case of 组装 zu3zhuang1 that the «FLTRP Chinesisch-Deutsch» dictionary had "zu3 zhuang1", and the NHD one, where the two syllables are together, is in another entry farther down. For 系统, again the FHD is the culprit. I should move that dictionary much farther down in my search order and then re-do the segmentation/pinyin completion. But then, leguan has already invested so much work, he would also have to do it again. Perhaps sometime later, if we feel like it. The sentences are already very useful as they are now.

In the Dictionary screen, I see that Pleco usually indicates when dictionaries farther down have a different pinyin. But it strips the whitespace in the pinyin, so there is no indication of a missing or added space in the pinyin of dictionaries farther down. That is probably correct this way, otherwise one would have a lot of unnecessary pinyin indications.

Edit: After having moved down all the FHD dicts, unfortunately they still come first because they have have a different Traditional Chinese entry for 系统, for example (係統 instead of 系統). Also the Xiandai Hanyu Da Cidian has that different Traditional. I see I would have to disable all of these dictionaries for the error to go away, or just work with a relatively small, reliable subset of dictionaries to perform this pinyin completion task. More isn't always better. :)
 
Last edited:
Top