Hi Weyland,
thank you for some very good points, let me address them one by one:
Well, I did give it a download and took a look at it before, and decided it wasn't for me. (As my biggest issue is post-HSK active vocabulary).
Do you really think the sentences are too easy for someone at HSK level 6? I feel they contain plenty of vocabulary that isn't in any of the six HSK levels, especially the ones from
dict.cn. Even if one can often guess what a Chinese word one has never seen before means, these sentences still allow one to practice proper word usage precisely. At least from my former classmates at university, quite a few of whom have successfully taken the HSK 6, I can tell that a lot of practice in the
feeling for vocabulary, and the
proper use of vocabulary is still necessary. I'd say HSK 6 gives you a skeleton of hard vocabulary, but there are many more basic words related to them that need to be practiced before you are able to form sentences that are indistinguishable from a native speaker's, i.e. that are in every way adequate to a situation and sound very natural. For example, if we have the sentence:
她多次遭到同事侮慢。
She suffered many slights from colleagues.
If you're at HSK level 6, you can deduce that "suffer, incur" corresponds to 遭到, or that 侮慢 must have to do with 侮辱, but let's be honest, we probably didn't yet know that the first two words are the proper words to use in exactly this situation: Slightly abstract, slightly euphemistic, referring to something unpleasant, but not wanting to express oneself too bluntly. So that is something you can learn, that is quite useful.
It may well be that not every Chinese learner cares about such niceties, though to me at least, being able to find precisely the right word in Chinese seems pretty important.
Some of the issues that aren't accounted for:
- Words in quotations "x" should not be separated from their quotations. Sentences that have multiple quotes in them should probably not be included.
Yes, if the content of a quotation is made up of just one word, I wouldn't separate them, either. Why would you remove a sentence with multiple quotations? It could still be a good sentence.
- Is separating the measure words, e.g. "一项", into “一” and "项" on purpose?
Yes, I would keep it that way, so the learner can't see immediately what the measure word is, and needs to recognize it and also assign it to the right noun phrase, in case there is more than one measure word.
- Symbols, such as exclamations, question marks and commas, should probably be featured in the scrambled-up sentences. Chinese has multiple versions of the comma, not all of them carry over to the scrambled-up sentences.
Indeed, thanks, I noticed that, too. This is definitely something I need to fix.
Edit: I verified that my lists do contain all punctuation marks the way they're supposed to, but once the cards are imported into Pleco, some of them are lost. Perhaps Pleco does some parsing in the pronunciation field. Another reason to look forward to Pleco 4, where, as we know, custom fields can be created.
- You should probably give idioms another pass-over. e.g. 过河拆桥's sentence is just “过河拆桥。”, which is the sentence, but goes against the spirit of this game.
Oh yes, a one-idiom sentence would need to be scrambled.
- When sentences feature multiple non-HSK words they should probably not be part of an HSK list. That, or, you could give a quick word+definition as part of the English translation.
I rated the list by HSK levels, though they aren't strictly meant for HSK learners of those levels. I just use the HSK level vocabulary to rate them. Instead of HSK, I should perhaps introduce a simple scale with "Novice", "Beginner", "Lower Intermediate", "Intermediate", "Upper Intermediate", "Advanced", "Expert", and "Native-level", or similar.
A "word+definition" line below the English translation would look nice, but it would give you the answer right from the start, which I'd rather not do. But fortunately with Pleco, the student can always tap on any word or phrase once they see the answer, and add it to their flashcards.
- Not a big fan of short sentences. Apart from the idioms, "你在撒谎."and "我胃疼." seems like a waste. Maybe have sentence length be correspondent to HSK level? Since I shared that Purple Culture database with you it shouldn't be too difficult to have 15+ character sentences for HSK5-6.
I agree, you're absolutely right. I should make a check if a sentence is composed of a four-letter idiom, then scramble that. If it isn't, I could drop the sentence if it is five characters long or shorter.
- To continue the point above. If you the scrambled up sentences, as is the case with "我胃疼." only consists of two pieces it's probably not worth including. Heck, anything less than 5 (excluding the symbols) is probably not worth including. Maybe for HSK1 "你好吗?" would work? I guess?
Exactly, yes, though something like “你好吗?” is already covered by study materials for beginners, so it's probably best if I leave out such short sentences.
- Can we avoid sentences with transliterated names, or names at all?
Yeah, in Tatoeba, there are just a handful of names that are repeated across all sentences. I should at least not scramble them, i.e. leave them together.
- IF the HSK example sentences are going to include idioms, can we use the frequency list we created to make sure they're part of the 500-600 most common idioms? While you can guess the meaning of 厚颜无耻(freq. 1800-2000), it doesn't make sense to include it in a list for let's say HSK5 that only includes one idiom (讨价还价).
I agree, yeah, I will integrate your frequency-sorted list of idioms into the rating code. Then we could even separate out all sentences with idioms, so users can practice sentences with idioms specifically.
- One of the sentences: "A n n 是 啦啦队队长." chooses 啦啦 (which means gossiping) as a word, as opposed to 啦啦队 (which means chearleading). How does your code sort through which word to pick? Long words should probably take priority, after HSK specific words, to avoid such problems.
Yes, it starts looking for all words of four letters, then three, two, and one. (the HSK lists and BCC together) I saw that my BCC frequency list doesn't have 啦啦队, so that's the reason.
- 《Whatever is inside the brackets should probably remain inside the brackets》... Same goes for all brackets.
Perhaps I'll scramble the entire sentence if the character string inside brackets is five characters long or less, or if it's longer, I will keep it inside the brackets and scramble that.
- “不” seems to always accompany another character(s) if that character is a verb, sometimes it messes up. 不知道 is an entry on Pleco, and will stay together. But, 不希望 will be divided up into "不希" and "望", which doesn't make sense.
Yes, thanks, that isn't good. I think I could try searching for substrings from right to left instead of from left to right. Then the word 希望 would get caught before 不希.
- If the sentences has a question mark and a 吗 , then let's not seperate them.
Yes, not having a user do anything that's obvious, makes sense.
That's it for now. I'm off to make myself some grub.
Thank you very much! Guten Appetit!
Cheers,
Shun