Automatically Assessing the HSK Difficulty Level of Arbitrary Chinese Sentences

#1
(continued from the thread ”79,000 Chinese-English, French, German, Italian, Japanese, and Spanish sentences“)

Dear @leguan,

I believe your current intention is to develop "difficulty" analysis tools as a first step. . Using the current sentence contextual flashcards is a good way to evaluate the analysis tools (i.e. proof of concept) but not ideal for creating HSK graded sentence contextual flashcard sets since the sentence contextual flashcards have been specifically optimized for HSK6+ students. That is one reason for there being so fewer lower level HSK flashcards remaining after categorizing sentences by difficulty.
Maybe there is a small misunderstanding, but did you assume that it is too difficult for a HSK 4 student, let's say, to guess the right Hanzi for the pinyin in the context of a sentence? In my opinion, if they know the word that needs to be figured out, they can also do this quite easily without being at HSK6+ level.

I believe that many sentences in the Tatoeba set are indeed HSK 3 or 4. Only in the HSK sentence set (which were also attributed to Tatoeba) do we find fewer sentences below level 5 or 6. We see this difference also in the way the sentences were graded by the algorithm. (many more HSK 6 sentences in the HSK set)

Of course, it would be a bit cleaner to run the HSK rating algorithm on the original Tatoeba sentences for each language, but I did it with your sentences to retain the sentence contextual feature.

Another point that should be noted by readers is that the tested words in the "by HSK levels" are not limited to those HSK levels e.g. 盒子 in 请看这里的这个he2zi5. Please look at this box here. is categorized as HSK 3 but it is an HSK 4 word, etc.
Yeah, there is always the possibility of finding words of a higher HSK level in the sentence than what it is rated at, but I've seen quite a few HSK tests where there were also plenty of unknown words in the sentences, such that even in the official HSK tests, the examinee is asked to guess at meanings they may not yet have learned in order to answer all questions.

So I will include the BCC data as a next step, and I'll keep everyone posted. Once we are all happy with the resulting lists, I guess we can again include them in the main 79'000 flashcards thread.

Best regards,

Shun
 
#2
...did you assume that it is too difficult for a HSK 4 student, let's say, to guess the right Hanzi for the pinyin in the context of a sentence? In my opinion, if they know the word that needs to be figured out, they can also do this quite easily without being at HSK6+ level.
No, I did not make that assumption - in fact, I agree fully with your opinion.:) I just wanted to point out that some readers may not expect "HSK3" flashcards to test higher than HSK3 words, and that it would be a good idea to make this point clear to them in advance so that there is no misunderstanding about what is being "tested".

Of course, it would be a bit cleaner to run the HSK rating algorithm on the original Tatoeba sentences for each language, ...
So I will include the BCC data as a next step, and I'll keep everyone posted. Once we are all happy with the resulting lists, I guess we can again include them in the main 79'000 flashcards thread.
Shun
With all due respect, I do not think it is a good approach to use the existing sentence contextual flashcards to create new graded sentence contextual flashcards. Rather, I believe the (much more) preferable approach is to create new sentence contextual flashcards from the original sentence lists based on sentence "difficulty" rating data.
 
#3
No, I did not make that assumption - in fact, I agree fully with your opinion.:) I just wanted to point out that some readers may not expect "HSK3" flashcards to test higher than HSK3 words, and that it would be a good idea to make this point clear to them in advance so that there is no misunderstanding about what is being "tested".
Absolutely, I agree. Since we're still very much in the alpha testing phase, we can point it out to users when the cards look good. If we were to work with the sentence contextual flashcards lists, I could also make sure in the algorithm that the HSK word we are looking for never exceeds the HSK level of the category. But then I would need to work with the finished sentence contextual flashcard lists (see below).

I think we can also delete the current lists when new lists are uploaded, to avoid confusion.

With all due respect, I do not think it is a good approach to use the existing sentence contextual flashcards to create new graded sentence contextual flashcards. Rather, I believe the (much more) preferable approach is to create new sentence contextual flashcards from the original sentence lists based on sentence "difficulty" rating data.
Yes, let's do it properly. I just did it with the sentence contextual flashcards as a start. I also hope I will be able to offload some of the work from you by writing a Python script that will perform at least part of your excellent procedure.
 
#4
Hi leguan and pdwalker,

I have now included the BCC data in some sensible way, though it is still simple.

The strongest correlation I could find between BCC corpus frequencies and the difficulty of sentences is that if a word is quite rare and wasn't found in the HSK vocabulary, it is quite probable that the HSK score is still too low—since up to now, only the HSK levels of words that were found in the HSK vocabulary have counted to the score. So I add 1 to the HSK score if I find a relatively rare word (frequency in the BCC is less than 100,000) in the sentence that isn't in the HSK. This of course still leaves the door open to names, which aren't difficult by themselves. However, I didn't encounter any such sentences so far.

So I adapted the algorithm to include the BCC corpus frequency data in the following, still simple way: In the first pass, I check the HSK levels of the words in the sentences that occur in the HSK vocabulary. After that, I create a string of all characters/words that do not occur in the HSK vocabulary. I then create a list of the BCC corpus frequencies of the characters/words found in this string, using the first 100,000 BCC words (down to a frequency of about 2,500—the highest frequency, as we know, is 943,370,349 for 的). If the minimum frequency of this list is less than 100,000, I add 1 to the HSK score.

With that, I obtain the following distribution of levels in the Tatoeba Chinese-English list:
Level 3 12,293 sentences
Level 4 16,266 sentences
Level 5 9,320 sentences
Level 6 2,117 sentences

For 5,601 of these 40,000 sentences, the HSK score was increased by 1 based on the BCC data. This looks like a reasonable number.

I attach the complete script and the resulting "Tatoeba Chinese-English by HSK" list. I will try to get a better feel for the BCC corpus frequency data, to make the contribution of the BCC corpus to the HSK more granular.

Best,

Shun
 

Attachments

#7
If we were to work with the sentence contextual flashcards lists, I could also make sure in the algorithm that the HSK word we are looking for never exceeds the HSK level of the category. But then I would need to work with the finished sentence contextual flashcard lists (see below).
There is no need to worry about this because the sentence contextual flashcard generation process already specifically determines how many sentences are included for each word based on the word's HSK grade and frequency of use ranking.

Logic can just be added to the sentence contextual flashcard generation module to make sure that only sentences appropriate to each grade are included in the flashcard set generated for that grade based on sentence difficulty analysis data.

In any case, I do believe that it is best to keep sentence difficulty analysis and sentence contextual flashcard generation as independent processes in order to maintain modularity and to ensure that difficulty analysis data can be flexibly applied in sentence contextual flashcard generation as well as other projects.

Best regards
leguan
 
Last edited:
#9
hi shun, the trad txt stil contains one or two simplified, fx 该 :)
also, would it be possible for you to substitute pinyin with zhuyin, and tone numbers with tone marks?
would be amazing if you could, but hey! no hurry
8-D
 
#10
Hi leguan,

There is no need to worry about this because the sentence contextual flashcard generation process already specifically determines how many sentences are included for each word based on the word's HSK grade and frequency of use ranking.
Oh yes, of course; I forgot about that.

Logic can just be added to the sentence contextual flashcard generation module to make sure that only sentences appropriate to each grade are included in the flashcard set generated for that grade based on sentence difficulty analysis data.
That would work, but wouldn't that also mean we lose maybe 20-30% of all sentences, since all the sentences that were given the wrong HSK rating by my algorithm would have to be discarded by your algorithm? Wouldn't it then be preferable if I supplied you with a granular score (like HSK 4.81, with a comma) for each sentence, and then your algorithm could include my score and the score of the headword to arrive at the final score of the sentence by averaging the two and rounding them, just making sure that the headword that the user has to enter isn't of a higher HSK than the sentence rating? See below.

In any case, I do believe that it is best to keep sentence difficulty analysis and sentence contextual flashcard generation as independent processes in order to maintain modularity and to ensure that difficulty analysis data can be flexibly applied in sentence contextual flashcard generation as well as other projects.
In light of my considerations above, I almost have the feeling it would be best to integrate both algorithms as tightly as possible, to avoid losing any sentences or rating information. Difficulty analysis alone could still be maintained in a separate script.

Well, I am happy either way. Would you like me to provide you with a sentence list that has a granular HSK score (with two digits after the comma) as a fourth field on each line and that includes the BCC in its rating determination, just so you'd have something to tinker with?

Best regards,

Shun
 
Last edited:
#11
Hi rizen,

your wish is my command, I added the Zhuyin file to the post above. I converted the pinyin with tone numbers to zhuyin using this JavaScript converter:

https://toshuo.com/chinese-tools/pinyin-to-zhuyin-live-converter/

I've also run it through a Simp.-to-Trad. converter, though it gave me bad results sometimes. (for example, it converted 了 to 瞭, which is a possible replacement of 了, but uncommon, so I corrected it) If you encounter any other issues with the Traditional text, I can try to find another converter.

Enjoy,

Shun


hi shun, the trad txt stil contains one or two simplified, fx 该 :)
also, would it be possible for you to substitute pinyin with zhuyin, and tone numbers with tone marks?
would be amazing if you could, but hey! no hurry
8-D
 
#14
Hi Shun,
Sorry for my greatly delayed reply!

That would work, but wouldn't that also mean we lose maybe 20-30% of all sentences, since all the sentences that were given the wrong HSK rating by my algorithm would have to be discarded by your algorithm?
Shun
Yes, you are quite right - incorrectly rated sentences would be lost if they were discarded solely on the basis of the sentence difficulty rating.

Wouldn't it then be preferable if I supplied you with a granular score (like HSK 4.81, with a comma) for each sentence, and then your algorithm could include my score and the score of the headword to arrive at the final score of the sentence by averaging the two and rounding them, just making sure that the headword that the user has to enter isn't of a higher HSK than the sentence rating?
I think this is a very good way to strike a balance between sentence and tested word difficulty, and at the same time reduce the influence of (possibly inaccurate) sentence difficulty ratings.

Such averaging can be added to the sentence contextual flashcard generation algorithm with the sentence difficulty rating information as an additional input just as you have proposed.

I also agree that limiting the tested word difficulty to less than or equal to the sentence difficulty is also a very reasonable and good approach.

In any case, individual preferences can be straightforwardly and flexibly accommodated in the flashcard generation algorithm.

Would you like me to provide you with a sentence list that has a granular HSK score (with two digits after the comma) as a fourth field on each line and that includes the BCC in its rating determination, just so you'd have something to tinker with?
Yes, that would be great! To be honest, though, I am not sure when or if I will get round to working on the generation of graded sentence contextual flashcard sets. But, your rating information will certainly be very useful if I do!

In any case, I believe that your difficulty rating information could also be utilized in an uncountable number of other sentence based learning tools, so would be a great resource for not only for me but for anyone else interested in building such tools!

Best regards
leguan
 
#15
Hi leguan,

Sorry for my greatly delayed reply!
That's fine!

Then I am including an extended English rated sentence list, with fractional HSK numbers. I removed the lower limit of HSK 3 so you'd have the full data. You'd almost certainly have to shift the HSK scale up a bit to get more realistic results.

I added more information, too, so one can check what the algorithm does. The format is:

Chinese sentence<<tab>>pinyin<<tab>>English translation<<tab>>Fractional HSK score<<tab>>A list of each HSK word with its HSK rating<<space>>An indicator whether the BCC corpus raised the HSK level by one<<space>>A string of all the non-HSK words in the sentence

Later, you can just remove the last tab and everything after it.

In any case, I believe that your difficulty rating information could also be utilized in an uncountable number of other sentence based learning tools, so would be a great resource for not only for me but for anyone else interested in building such tools!
Thanks, though I think it's far from the cutting edge of science, but instead it just produces some quite useful results. Perhaps if someone googles for "assessing Chinese sentence difficulty", it already gets found. ;)

On our past efforts in general: I look at them as a useful and inspiring coding exercise; perhaps you're already so far along that it feels like work for you. For me, it it doesn't feel like work, but, as I've said, like an enjoyable learning activity. Of course it's all voluntary. Seen another way, that's even more incentive for me to try to replicate in Python what you did using Excel/VBA. Would you perhaps be willing to send me just the raw input data files you worked with, not necessarily the source code? This would save me a lot of importing into and exporting out of Pleco, and it would enable me to work independently on the Python version. If so, I could PM you a file request link you could upload it to, but of course, no hurry!

Thanks and best regards,

Shun
 

Attachments

#18
hi shun. a couple of requests. on https://plecoforums.com/threads/79-...-italian-japanese-and-spanish-sentences.5925/ you mention that the tatoeba.org can provide traditional chars "as source" (instead of machine-converted). did i understand that correctly? btw... i would love to have this file in a version with no pinyin/zhuyin (so just [traditional cn] -> [en]) and randomly ordered (no levels/ranking, although that totally defeats the purpose of the present thread! albeit not that of the original thread). thanks, happy new 2019 to you and all:)
 
#19
Hi rizen,

happy new 2019! I just did it, but the thing is that the original sentences aren't all in Traditional, but part Traditional, part Simplified. So, to get all Traditional sentences, I have to run them through a Traditional converter. For this list, I left them as they were. I will send it to you by private message, without pinyin/zhuyin and in random order.

Best,

Shun
 
Top