Shun
状元
(continued from the thread ”79,000 Chinese-English, French, German, Italian, Japanese, and Spanish sentences“)
Dear @leguan,
Maybe there is a small misunderstanding, but did you assume that it is too difficult for a HSK 4 student, let's say, to guess the right Hanzi for the pinyin in the context of a sentence? In my opinion, if they know the word that needs to be figured out, they can also do this quite easily without being at HSK6+ level.
I believe that many sentences in the Tatoeba set are indeed HSK 3 or 4. Only in the HSK sentence set (which were also attributed to Tatoeba) do we find fewer sentences below level 5 or 6. We see this difference also in the way the sentences were graded by the algorithm. (many more HSK 6 sentences in the HSK set)
Of course, it would be a bit cleaner to run the HSK rating algorithm on the original Tatoeba sentences for each language, but I did it with your sentences to retain the sentence contextual feature.
Yeah, there is always the possibility of finding words of a higher HSK level in the sentence than what it is rated at, but I've seen quite a few HSK tests where there were also plenty of unknown words in the sentences, such that even in the official HSK tests, the examinee is asked to guess at meanings they may not yet have learned in order to answer all questions.
So I will include the BCC data as a next step, and I'll keep everyone posted. Once we are all happy with the resulting lists, I guess we can again include them in the main 79'000 flashcards thread.
Best regards,
Shun
Dear @leguan,
I believe your current intention is to develop "difficulty" analysis tools as a first step. . Using the current sentence contextual flashcards is a good way to evaluate the analysis tools (i.e. proof of concept) but not ideal for creating HSK graded sentence contextual flashcard sets since the sentence contextual flashcards have been specifically optimized for HSK6+ students. That is one reason for there being so fewer lower level HSK flashcards remaining after categorizing sentences by difficulty.
Maybe there is a small misunderstanding, but did you assume that it is too difficult for a HSK 4 student, let's say, to guess the right Hanzi for the pinyin in the context of a sentence? In my opinion, if they know the word that needs to be figured out, they can also do this quite easily without being at HSK6+ level.
I believe that many sentences in the Tatoeba set are indeed HSK 3 or 4. Only in the HSK sentence set (which were also attributed to Tatoeba) do we find fewer sentences below level 5 or 6. We see this difference also in the way the sentences were graded by the algorithm. (many more HSK 6 sentences in the HSK set)
Of course, it would be a bit cleaner to run the HSK rating algorithm on the original Tatoeba sentences for each language, but I did it with your sentences to retain the sentence contextual feature.
Another point that should be noted by readers is that the tested words in the "by HSK levels" are not limited to those HSK levels e.g. 盒子 in 请看这里的这个he2zi5. Please look at this box here. is categorized as HSK 3 but it is an HSK 4 word, etc.
Yeah, there is always the possibility of finding words of a higher HSK level in the sentence than what it is rated at, but I've seen quite a few HSK tests where there were also plenty of unknown words in the sentences, such that even in the official HSK tests, the examinee is asked to guess at meanings they may not yet have learned in order to answer all questions.
So I will include the BCC data as a next step, and I'll keep everyone posted. Once we are all happy with the resulting lists, I guess we can again include them in the main 79'000 flashcards thread.
Best regards,
Shun