79,000 Chinese-English, French, German, Italian, Japanese, and Spanish sentences

Shun · Dec 19, 2018

Hi leguan,

that sounds excellent, I'm all for a clear thread structure. So I will use another thread to answer your post and edit in a link to the new thread here:

HSK Difficulty Thread

Best,

Shun

leguan · Jan 6, 2019

I have just updated the Tatoeba+HSK+α English flashcards to correct an issue with rogue forward slash characters in the word pinyin for a number of the target words preventing flashcard generation for those words.

For details of the update, and to download the updated version of the Tatoeba+HSK+α English flashcards please scroll to the bottom of post #42 of this thread (see link below), or just download the updated file also attached to this message.

https://plecoforums.com/threads/79-...apanese-and-spanish-sentences.5925/post-45063

Shun · Jan 6, 2019

Dear all,

I've also finished a first Python version of @leguan's sentence contextual flashcard generation program. This program does the following:

It collects all the words from the BCC, Leiden, and SUBTLEX corpora that are known to Pleco's dictionaries and orders them by frequency. (175,000 words)
It matches all the words from the merged corpus file to the sentences in which they occur. The sentences can be from any source, but are from tatoeba's Chinese-English list in this case.
Using the indexed word list (i.e., with the matching sentence IDs added to them) and the numbered tatoeba sentence list, it generates Pleco sentence flashcards whose headword is the Hanzi word we are looking for, and whose pronuncation field is filled with the sentence in which the headword occurs, with the headword replaced by its pinyin, and whose definition is filled with the English translation of the sentence (multiple translations for the same Chinese sentence are folded together).

It doesn't use the same sentence more than twice, and it doesn't generate more than seven different sentences for one HSK word. (across all HSK levels) In addition, it takes a random sample of 7000 mostly non-HSK words from the first 40,000 most common words in the corpora which occur in the tatoeba sentences, and generates sentences with them. These words can't be too hard, since the tatoeba sentences aren't that hard, either.
It orders the finished sentence list by HSK level. The HSK level of the sentence is currently determined not by the HSK level of the headword (if there is one), but by the HSK level of the entire sentence, as calculated by the HSK rating program (see thread "Automatically Assessing the HSK Difficulty Level of Arbitrary Chinese Sentences").

It doesn't order the sentences randomly, because Pleco's Flashcards will already do that. More languages will follow. This program does something very similar to leguan's, now just additionally separated by HSK levels. There are still many things I/we could improve. I hope you like the sentences; feedback is welcome.

I attach the current source code and the Chinese-English output.

Edit: Added German, French, Italian, Japanese, Russian, and Spanish.

Enjoy,

Shun

leguan · Jan 6, 2019

Excellent work, Shun!
The tools you have developed are great not only because they, if my understanding is correct, fully automate the sentence contextual flashcard generation process, but also because their modularity allow easy adaptation to a user's preferences, and for utilization in other related projects. They indeed form a textbook on how to use Python to create sentence based learning tools and are surely a great asset for all!

Shun · Jan 6, 2019

Hi leguan,

many thanks! Python seems to be quite ideal for this type of task. Thanks also for the excellent "sentence contextual" idea of replacing one Chinese word with its pinyin. As stated previously, it trains your Hanzi writing skills, and at the same time your Chinese reading skills and general passive vocabulary knowledge.

Best, Shun

pdwalker · Jan 13, 2019

This just keeps getting better.

Thanks Shun.

Shun · Jan 13, 2019

Hi pdwalker,

many thanks, you‘re welcome! I‘m also very open to any additional requests you may have after having studied some.

Best,

Shun

Shun · Feb 25, 2019

Hello all,

at @Akpierce1776 's request, I am happy to upload a Chinese-English Tatoeba sentence contextual flashcard list, graded by HSK from levels 2 to 6 instead of 3 to 6 as before, now with 23,866 sentences in all. There are just under 900 sentences that were determined to be HSK level 2. This list is based on the newest Tatoeba sentence data from February 24, 2019, which contains about 5-6% more sentences than the old lists did. If anyone would like me to make HSK 2-6 sentence lists from the the newest data in another language, please just tell me.

For those who are already in the process of studying with the older lists, I don't think it's worth starting over just because of these 5-6%, though.

Enjoy,

Shun

PS: And, as always, thanks to @leguan for the great idea!

agewisdom · Feb 25, 2019

@Shun The HSK 2-6 segregated cards is pretty fantastic work! Many thanks. I'll update my post and spread the word around soon.
Phew, it's pretty hard work reviewing these cards.

BTW - Is there any way to get PLECO to voice the entire sentence rather than just the pinyin word?

Shun · Feb 25, 2019

Hi agewisdom,

thank you very much! I guess so.

There is a way to have it pronounce almost the entire sentence: You select the sentence and tap on the speaker button. The pinyin part will be pronounced when you haven’t selected anything.

Enjoy (and say thank you to leguan equally),

Shun

Akpierce1776 · Feb 25, 2019

Shun said:
Hello all,

at @Akpierce1776 's request, I am happy to upload a Chinese-English Tatoeba sentence contextual flashcard list, graded by HSK from levels 2 to 6 instead of 3 to 6 as before, now with 23,866 sentences in all. !

Thanks so much!! Greatly appreciated!

Shun · Feb 25, 2019

You're very welcome!

Akpierce1776 · Feb 25, 2019

[QUOTE="Shun, post: If anyone would like me to make HSK 2-6 sentence lists from the the newest data in another language, please just tell me [/QUOTE]

If you are able to easily do the Spanish version of these, that would be fantastically helpful to practice my Spanish as well.

Shun · Feb 25, 2019

This is the newest Spanish-English list, which I would recommend over Chinese-Spanish if refreshing Spanish is your goal, because you can use it in both directions with Anki, unlike the sentence contextual flashcards.

Akpierce1776 · Feb 25, 2019

I did not realize these could be imported to Anki easily. Even the contextual sentences set?

agewisdom · Feb 25, 2019

Shun said:
Hi agewisdom,
There is a way to have it pronounce almost the entire sentence: You select the sentence and tap on the speaker button. The pinyin part will be pronounced when you haven’t selected anything.
Shun

I don't know why but that doesn't work. It still only pronounces the pinyin word.

Alternatively, is there a way to change the default to speak the entire sentence instead?

leguan · Feb 25, 2019

agewisdom said:
I don't know why but that doesn't work. It still only pronounces the pinyin word.

Alternatively, is there a way to change the default to speak the entire sentence instead?

Unfortunately, I believe not. At least, not now.

I guess we might be able to realise this once Pleco 4.0 has been launched

However, even if it is achievable, it seems likely that it will require rebuilding the flashcards with an extra full sentence without pinyin (or possibly, full pinyin) attribute

agewisdom · Feb 25, 2019

leguan said:
Unfortunately, I believe not. At least, not now.

I guess we might be able to realise this once Pleco 4.0 has been launched However, even if it is achievable, it seems likely that it will require rebuilding the flashcards with an extra full sentence without pinyin (or possibly, full pinyin) attribute

Thanks @leguan for letting me know.

A big THANK YOU for sharing your deck and flashcard concept. I tried it out and darn it... It's really HARD! But I'm learning a lot through it. It's just that I don't know some of the other characters which makes it hard to GUESS

the pinyin character accordingly. Everytime I go through the flashcard, it's just like reading Chinese text. I stutter and fumble a lot, but at least it's in manageable chunks.

leguan · Feb 25, 2019

Hi agewisdom,

You're very welcome! I'm very happy to hear that the flashcards are useful to you!

Have tried using Shun's latest HSK level graded decks? With his HSK level graded decks you can reduce the overall difficulty of the sentences by choosing not to include higher HSK levels when you start a new test, which should make it easier to guess the pinyin word.

agewisdom · Feb 25, 2019

Yes, I used his folded graded decks. Even HSK 2 is also a bit tough. The pinyin words are HSK 2 but some of the other sentences aren't. Which was what prompted me to ask whether it was possible to hear the entire sentence rather than just the pinyin.

Still, it's excellent practice. Except there's a bit more work to lookup some of unfamiliar characters in the sentences.

79,000 Chinese-English, French, German, Italian, Japanese, and Spanish sentences

状元

探花

Attachments

状元

Attachments

探花

状元

状元

状元

状元

Attachments

进士

状元

秀才

状元

秀才

状元

Attachments

秀才

进士

探花

进士

探花

进士