Hi Shun,
Thank you very much for your continued efforts to improve the pinyin segmentation and for your concern regarding my workload!
Until now, I have maintained one Excel spreadsheet containing all of the sentence lists, and have created flashcards from this spreadsheet by programmatically selecting sentences from the required lists. However, since this method has a few drawbacks, including not being able to easily prevent duplicate flashcards and non-optimal sentence utilization, particularly for flashcard sets based on a small proportion of the total sentences (e.g. Chinese-German), I have now created separate spreadsheets for each sentence list, and have created new "sentence contextual writing" flashcards sets based on these.
Yesterday, I also discovered two other related issues regarding Pleco's treatment of text in the pinyin field upon importation.
Before I go into the details, firstly we should be clear that these sentence contextual flashcards are based on a "hack" of Pleco's functionality. The pinyin field was surely never intended to contain mixtures of pinyin, chinese characters, and other languages' alphanumeric text. This means that these flashcards could be rendered unusable any time in the future by an update to Pleco's behaviour regarding the pinyin field
! -> prays to God of Pleco...
EDIT: See Shun's following post for good news regarding this!
Here are the issues:
Issue 1: Alphabetic characters in the "Pinyin" field can also "disappear", or be converted into pinyin. For example, the English word "to" could become "tong", etc
Issue 2: Full width numeric characters (e.g. 0,1,・・・)also seem to "disappear" on importation.
It seems that Issue 2 can be dealt with by converting the full-width numeric characters into half-width numeric characters. However, for Issue 1, unfortunately this issue appears to occur regardless of whether the characters are half or full-width.
In my new sets of flashcards I have thus removed all of the sentences (totalling about 1000) that contain alphabetic characters in the definition field and have converted all of the full-width numeric characters to half-width.
The result of all this is as follows:
<Tatoeba Chinese-German>
Total number of original sentences in list: 4,538 → 4,411 (a)
Total number of flashcards: 5,740 → 8,392
Total number of unique sentences: 3,411 → 4,215 (b)
Percentage of original sentences utilized (= (b)/(a)) = 75.1% →
95.5%
Average number of flashcards per unique sentence: 1.68 → 1.99
Total number of unique words tested: 3,388 → 3,474
Average number of flashcards per unique words tested: 1.69 → 2.42
Total number of HSK words: 1,387 → 1,544
Total number of flashcards testing HSK words: 2,631 → 4,631
Average number of flashcards per HSK word: 1.89 → 3.00
Total number of non-HSK words: 2,001 → 1,930
Total number of flashcards testing non-HSK words: 3,110 → 3,397
Average number of flashcards per non-HSK word: 1.55 → 1.76
COMMENT: The big gain in sentence utilization partly comes from Shun's improved pinyin segmentation and partly from removing the non-Chinese-German sentences from the spreadsheet used to generate these flashcards - this made sure that 100% of the matches were relevant (previously, a high proportion (more than 90%) of the matches (up to a maximum of 30/word) in my spreadsheet were related to other flashcard sets so this limited the possible number of relevant Chinese-German sentence matches in this Chinese-German flashcard set). The downside of indexing on a per flashcard set basis is that each sentence list needs to be indexed separately - for the larger flashcard sets this is usually an overnight job
!
<Tatoeba English>
Total number of original sentences in list: 41,587 → 40,792 (a)
Total number of flashcards: 39,828 → 46,603
Total number of unique sentences: 28,689 → 31,837 (b)
Percentage of original sentences utilized (= (b)/(a)) = 69.0% →
78.0%
Average number of flashcards per unique sentence: 1.38 → 1.62
Total number of unique words tested: 13,405 → 13,259
Average number of flashcards per unique words tested: 2.97 → 3.51
Total number of HSK words: 3,353 → 3,440
Total number of flashcards testing HSK words: 15,456 → 21,870
Average number of flashcards per HSK word: 4.60 → 6.35
Total number of non-HSK words: 10,052 → 9,821
Total number of flashcards testing non-HSK words: 24,372 → 24,732
Average number of flashcards per non-HSK word: 2.42 → 2.52
COMMENT: As well as the gains attributable to Shun's improved pinyin segmentation, some of the sentence utilization gain for this flashcard set can be attributed to increasing the maximum number of flashcards per non-HSK word from six to seven and increasing the maximum BCC corpus ranking to 160,000 from 100,000, and the maximum non-BCC corpus (i.e., LWC and SUBTLEX) ranking from 60,000 to 100,000
<HSK English>
Total number of original sentences in list: 18,261 → 18,036 (a)
Total number of flashcards: 40,054 → 57,460
Total number of unique sentences: 17,750 → 17,819 (b)
Percentage of original sentences utilized (= (b)/(a)) = 97.2% →
98.8%
Average number of flashcards per unique sentence: 2.26 → 3.24
Total number of unique words tested: 15,561 → 15,823
Average number of flashcards per unique words tested: 2.57 → 3.63
Total number of HSK words: 4,236 → 4,318
Total number of flashcards testing HSK words: 22,541 → 28,259
Average number of flashcards per HSK word: 5.32 → 6.54
Total number of non-HSK words: 11,325 → 11,505
Total number of flashcards testing non-HSK words: 17,514 → 29,199
Average number of flashcards per non-HSK word: 1.55 → 2.54
COMMENT: The sentence utilization gain here can mainly be attributed to increasing the maximum number of flashcards per HSK and non-HSK tested words.
<Tatoeba+HSK+α English>
Total number of original sentences in list: 63,556 → 62,520 (a)
Total number of flashcards: 86,721→ 88,604
Total number of unique sentences: 51,279 → 51,701 (b)
Percentage of original sentences utilized (= (b)/(a)) = 80.7% →
82.7%
Average number of flashcards per unique sentence: 1.69 → 1.71
Total number of unique words tested: 21,896 → 21,532
Average number of flashcards per unique words tested: 3.96 → 4.11
Total number of HSK words: 4,602 → 4,607
Total number of flashcards testing HSK words: 41,102 → 41,382
Average number of flashcards per HSK word: 8.93 → 8.98
Total number of non-HSK words: 17,294 → 17,157
Total number of flashcards testing non-HSK words: 45,620 → 47,222
Average number of flashcards per non-HSK word: 2.64 → 2.75
COMMENT: For this set I increased the maximum number of flashcards per non-HSK word from six to seven - this change resulted in an addition of 2665 new flashcards..
Once again, I hope these flashcard sets will be useful for those who want to practice writing Chinese characters based on the sound (pinyin) of the word, as would be the case when listening to the sentence spoken in Chinese, in the context of a sentence at the same time as getting reading comprehension practice.
As an aside, when I study with these flashcards, I use "Self-grading" and only give myself a "remembered perfectly" grade if I can perform all three of the following:
1. perfectly write the Chinese characters for the tested word
2. fully understand the sentence
3. read the sentence out loud with correct pronunciation/tones for all words in the sentence.
Enjoy
[2019/01/06 UPDATE for Tatoeba+HSK+α English]
<Tatoeba+HSK+α English>
2019/01/06 UPDATE
Total number of original sentences in list: 62,520 (a)
Total number of flashcards: 88,604→
93,380
Total number of unique sentences: 51,701 →
53,353 (b)
Percentage of original sentences utilized (= (b)/(a)) = 82.7% →
85.3%
Average number of flashcards per unique sentence: 1.71 →
1.75
Total number of unique words tested: 21,532 →
22,816
Average number of flashcards per unique words tested: 4.11 →
4.09
Total number of HSK words tested: 4,607 →
4,809
Total number of flashcards testing HSK words: 41,382 →
43,131
Average number of flashcards per HSK word: 8.98 →
8.97
Total number of non-HSK words: 17,157 →
18,007
Total number of flashcards testing non-HSK words: 47,222 →
50,248
Average number of flashcards per non-HSK word: 2.75 →
2.79
COMMENT: For this set I removed some rogue forward slash characters that somehow had been introduced into the pinyin for 5347 (including 214 HSK) words and regenerated this flashcard set. This has resulted in an additional 4776 flashcards, an additional 1284 words tested (including an additional 202 HSK words), and an increase in the sentence utilization ratio to 85.3%.