feng
榜眼
Howdy,
I have some questions vaguely in the nature of computers and Chinese. I am hopeful that someone can at least point me in the right direction.
1) Short version: is there a not too onerous way -- for someone with the right computer skills -- to dump several thousand characters into Google and have it spit out the number of pages it finds for each? Something like:
的 15,710,000,000
一 9,350,000,000
睘 528,000
㗊 29,900
Long version: I was thinking of making a character dictionary for the use of beginning through advanced/fluent students to learn and review. The best I have found is Oxford's Concise (I think I have the second edition, old), which is not a character dictionary only, but that is how I use it for review. I thought I might, if not do better, at least do something I thought was better :mrgreen:
The Olympics would be in Beijing again by the time I finished it, but it might be worth a try. Thinking of about 5,000 characters.
Doing such a project would require (I think require is the right word here) that the dictionary be ordered by character frequency so that it could actually be used by people from their first day of learning Chinese. Back in the 1980s both Taiwan and China did their lists (see James Dew's _6000 Words_), but one cannot get a hold of them and Dew's lists are incomplete. Even online I find Chinese people doing research into this and they cannot find the PRC book, which sounded good (except for the haphazardly massaged numbers: x% of this genre, y % of that genre, z % of another genre). The Taiwan one may not have been published in more than an informal way, for all I know. Doesn't sound like it was necessarily a book. Yes, there's Jun Da's page (kind of screwy numbers, and missing characters despite an enormous corpus and a very long list of characters), and one in Hong Kong based on what I know not.
I thought Google would be a good choice since even for Chinese it is has the widest coverage and anyway Baidu doesn't tell you the number of pages after 100,000,000 which means you can't put in order the characters occurring on the web more than 100,000,000 times.
There are issues in how you search, of course. If one searches in PRC characters, then it is kind of goofy when you go to make a book that also has traditional characters with what may be close to 200 disappearing traditional characters, though far fewer than that for ones in common use.
The web is also an imperfect way to search if one wants to search in traditional characters. Even if you put characters in quotes, it doesn't solve all your problems, but it makes for fewer problems. I could search just under .tw domains which presumably would be (nearly) exclusively (TW) traditional characters (which differ from PRC and HK traditional characters in certain cases), though using Taiwan would shrink the corpus dramatically.
I am especially looking for thoughts on this sort of automated searching. Can it be done? Is it overly time consuming for someone with the appropriate skill set? This would be a great advertisement for Pleco if Pleco did it: "The Pleco Character Frequency List (aka The Pleco List)"
Extra credit question (though it is not actually a question): if this can be done, one could dump a whole dictionary full of multi-character words in and get them sorted by frequency. I don't know if there are intellectual property issues involved for other dictionaries, but since Pleco owns Wei Dongya's 98,000 word dictionary, that will do :lol:
Google is about as close as one can probably get to a natural weighting (not focused on just one genre like modern fiction or ancient poetry, as some lists are; nor artificially weighted by some all-knowing professor) of different usages for both characters and multi-character words since it has such a large corpus and pulls from such a diverse range of web pages.
2) Encoding: two questions
Are characters that look the same, but aren't the same and are typed in different input sets (made-up word) encoded the same? Give you an obscure example: 离 is a character in it's own right aside from being the simplified form of 離。Are they encoded the same? I guess yes because certain searches in Chinese bring back all Japanese results, but I ask because I have occasionally had difficultly searching in a word processing document for even an identical character typed with two different input systems, even though they were both typing in traditional characters.
My other question is about these little things that I see from time to time after typing an uncommon character, or when I see an uncommon character on the web. Here is an example:
http://www.china-language.gov.cn/wenzig ... i/014c.htm
Just above "12 画" you will see . I have the printed list and nothing seems to be missing in the vicinity of on that web page, so I am wondering what it is. I guess it has something to do with character encoding, but that list has no archaic characters.
I thank you for your time and patience.
I have some questions vaguely in the nature of computers and Chinese. I am hopeful that someone can at least point me in the right direction.
1) Short version: is there a not too onerous way -- for someone with the right computer skills -- to dump several thousand characters into Google and have it spit out the number of pages it finds for each? Something like:
的 15,710,000,000
一 9,350,000,000
睘 528,000
㗊 29,900
Long version: I was thinking of making a character dictionary for the use of beginning through advanced/fluent students to learn and review. The best I have found is Oxford's Concise (I think I have the second edition, old), which is not a character dictionary only, but that is how I use it for review. I thought I might, if not do better, at least do something I thought was better :mrgreen:
The Olympics would be in Beijing again by the time I finished it, but it might be worth a try. Thinking of about 5,000 characters.
Doing such a project would require (I think require is the right word here) that the dictionary be ordered by character frequency so that it could actually be used by people from their first day of learning Chinese. Back in the 1980s both Taiwan and China did their lists (see James Dew's _6000 Words_), but one cannot get a hold of them and Dew's lists are incomplete. Even online I find Chinese people doing research into this and they cannot find the PRC book, which sounded good (except for the haphazardly massaged numbers: x% of this genre, y % of that genre, z % of another genre). The Taiwan one may not have been published in more than an informal way, for all I know. Doesn't sound like it was necessarily a book. Yes, there's Jun Da's page (kind of screwy numbers, and missing characters despite an enormous corpus and a very long list of characters), and one in Hong Kong based on what I know not.
I thought Google would be a good choice since even for Chinese it is has the widest coverage and anyway Baidu doesn't tell you the number of pages after 100,000,000 which means you can't put in order the characters occurring on the web more than 100,000,000 times.
There are issues in how you search, of course. If one searches in PRC characters, then it is kind of goofy when you go to make a book that also has traditional characters with what may be close to 200 disappearing traditional characters, though far fewer than that for ones in common use.
The web is also an imperfect way to search if one wants to search in traditional characters. Even if you put characters in quotes, it doesn't solve all your problems, but it makes for fewer problems. I could search just under .tw domains which presumably would be (nearly) exclusively (TW) traditional characters (which differ from PRC and HK traditional characters in certain cases), though using Taiwan would shrink the corpus dramatically.
I am especially looking for thoughts on this sort of automated searching. Can it be done? Is it overly time consuming for someone with the appropriate skill set? This would be a great advertisement for Pleco if Pleco did it: "The Pleco Character Frequency List (aka The Pleco List)"
Extra credit question (though it is not actually a question): if this can be done, one could dump a whole dictionary full of multi-character words in and get them sorted by frequency. I don't know if there are intellectual property issues involved for other dictionaries, but since Pleco owns Wei Dongya's 98,000 word dictionary, that will do :lol:
Google is about as close as one can probably get to a natural weighting (not focused on just one genre like modern fiction or ancient poetry, as some lists are; nor artificially weighted by some all-knowing professor) of different usages for both characters and multi-character words since it has such a large corpus and pulls from such a diverse range of web pages.
2) Encoding: two questions
Are characters that look the same, but aren't the same and are typed in different input sets (made-up word) encoded the same? Give you an obscure example: 离 is a character in it's own right aside from being the simplified form of 離。Are they encoded the same? I guess yes because certain searches in Chinese bring back all Japanese results, but I ask because I have occasionally had difficultly searching in a word processing document for even an identical character typed with two different input systems, even though they were both typing in traditional characters.
My other question is about these little things that I see from time to time after typing an uncommon character, or when I see an uncommon character on the web. Here is an example:
http://www.china-language.gov.cn/wenzig ... i/014c.htm
Just above "12 画" you will see . I have the printed list and nothing seems to be missing in the vicinity of on that web page, so I am wondering what it is. I guess it has something to do with character encoding, but that list has no archaic characters.
I thank you for your time and patience.