This allows for sentence reuse across different language pairs, but it also means that you can get 1:n (one-to-many) correspondences.
...and then, based on the HSK level counts for a sentence, assign a HSK level using a good formula (perhaps the weighted average HSK level of all the words in the sentence plus 1), with HSK 6 still being the maximum. Or we could even introduce a HSK 7 level for what is even more advanced.
Perhaps there is even some Digital Humanities paper about assessing the difficulty of Chinese sentences that we could learn a thing or two from.
Yeah, there's a lot of material. Thanks! I already found one, it's called:
"Automatic Difficulty Assessment for Chinese Texts", by "John Lee, Meichun Liu, Chun Yin Lam, Tak On Lau, Bing Li, Keying Li" from City University of Hong Kong.
Hi pdwalker and leguan,
I've just folded leguan's Chinese-English Tatoeba list from message #42 (46'604 lines) into a new file with 45'002 lines, where different English translations of identical Hanzi-Pinyin combinations are grouped together, with a " / " separator between the different variants. Takes just a second to execute. I attach the simple script and the resulting folded list.
It would be even more effective if we did this folding before replacing one Chinese word with its pinyin using Leguan's program. So this is just a start!
Best,
Shun
Hi pdwalker and leguan,
It would be even more effective if we did this folding before replacing one Chinese word with its pinyin using Leguan's program. So this is just a start!
I agree. But since reindexing the sentences is a rather time consuming process, I am not currently planning to rebuild the sentence contextual flashcards to incorporate "folding". Creating graded flashcard sets, on the other hand, just requires plugging in grading data for the current sentences and some additional sentence selection logic...
Hello leguan, pdwalker, agewisdom,
谢谢! So I've created a HSK rating script that calculates the average HSK level of all recognized HSK words in a sentence, averaged with the HSK level of the word with the highest HSK level in the sentence, with the scale shifted by +0.75, rounded to integers and bounded by levels 3 and 6.
Best regards,
Shun