character frequency on Google and character encoding

feng · Jan 15, 2013

Howdy,

I have some questions vaguely in the nature of computers and Chinese. I am hopeful that someone can at least point me in the right direction.

1) Short version: is there a not too onerous way -- for someone with the right computer skills -- to dump several thousand characters into Google and have it spit out the number of pages it finds for each? Something like:
的 15,710,000,000
一 9,350,000,000
睘 528,000
㗊 29,900

Long version: I was thinking of making a character dictionary for the use of beginning through advanced/fluent students to learn and review. The best I have found is Oxford's Concise (I think I have the second edition, old), which is not a character dictionary only, but that is how I use it for review. I thought I might, if not do better, at least do something I thought was better :mrgreen:
The Olympics would be in Beijing again by the time I finished it, but it might be worth a try. Thinking of about 5,000 characters.

Doing such a project would require (I think require is the right word here) that the dictionary be ordered by character frequency so that it could actually be used by people from their first day of learning Chinese. Back in the 1980s both Taiwan and China did their lists (see James Dew's _6000 Words_), but one cannot get a hold of them and Dew's lists are incomplete. Even online I find Chinese people doing research into this and they cannot find the PRC book, which sounded good (except for the haphazardly massaged numbers: x% of this genre, y % of that genre, z % of another genre). The Taiwan one may not have been published in more than an informal way, for all I know. Doesn't sound like it was necessarily a book. Yes, there's Jun Da's page (kind of screwy numbers, and missing characters despite an enormous corpus and a very long list of characters), and one in Hong Kong based on what I know not.

I thought Google would be a good choice since even for Chinese it is has the widest coverage and anyway Baidu doesn't tell you the number of pages after 100,000,000 which means you can't put in order the characters occurring on the web more than 100,000,000 times.

There are issues in how you search, of course. If one searches in PRC characters, then it is kind of goofy when you go to make a book that also has traditional characters with what may be close to 200 disappearing traditional characters, though far fewer than that for ones in common use.

The web is also an imperfect way to search if one wants to search in traditional characters. Even if you put characters in quotes, it doesn't solve all your problems, but it makes for fewer problems. I could search just under .tw domains which presumably would be (nearly) exclusively (TW) traditional characters (which differ from PRC and HK traditional characters in certain cases), though using Taiwan would shrink the corpus dramatically.

I am especially looking for thoughts on this sort of automated searching. Can it be done? Is it overly time consuming for someone with the appropriate skill set? This would be a great advertisement for Pleco if Pleco did it: "The Pleco Character Frequency List (aka The Pleco List)"

Extra credit question (though it is not actually a question): if this can be done, one could dump a whole dictionary full of multi-character words in and get them sorted by frequency. I don't know if there are intellectual property issues involved for other dictionaries, but since Pleco owns Wei Dongya's 98,000 word dictionary, that will do :lol:

Google is about as close as one can probably get to a natural weighting (not focused on just one genre like modern fiction or ancient poetry, as some lists are; nor artificially weighted by some all-knowing professor) of different usages for both characters and multi-character words since it has such a large corpus and pulls from such a diverse range of web pages.

2) Encoding: two questions
Are characters that look the same, but aren't the same and are typed in different input sets (made-up word) encoded the same? Give you an obscure example: 离 is a character in it's own right aside from being the simplified form of 離。Are they encoded the same? I guess yes because certain searches in Chinese bring back all Japanese results, but I ask because I have occasionally had difficultly searching in a word processing document for even an identical character typed with two different input systems, even though they were both typing in traditional characters.

My other question is about these little  things that I see from time to time after typing an uncommon character, or when I see an uncommon character on the web. Here is an example:
http://www.china-language.gov.cn/wenzig ... i/014c.htm
Just above "12 画" you will see . I have the printed list and nothing seems to be missing in the vicinity of  on that web page, so I am wondering what it is. I guess it has something to do with character encoding, but that list has no archaic characters.

I thank you for your time and patience.

mikelove · Jan 16, 2013

feng said:
1) Short version: is there a not too onerous way -- for someone with the right computer skills -- to dump several thousand characters into Google and have it spit out the number of pages it finds for each?

Not easily (we've actually tried this ourselves) - if you write a script to query one character at a time it'll start putting up error messages after 100 queries or so and not let you do another search for a couple of hours. Spreading this out to a few hundred searches a day would probably be a bad idea, since they'll almost certainly tweak their algorithm / re-index something in the interim and you'll end up with frequencies that aren't accurate relative to each other (a character might have 20% more results on one day than another).

You've also got to deal with things that might distort results for a given character even after repeated use - common use in pages that they index a lot, for example (e.g. a character in the name of a popular website or product or a popular discussion forum software package); the hit count is telling you the number of pages but not really the prominence or uniqueness of those pages. And of course there are regional variants (which they sometimes merge and sometimes don't), common typos, puns, etc to juggle too.

You might find Google Books' ngram data useful, though - lot more rigorous and consistent than search results, albeit over a much smaller data set.

feng said:
Extra credit question: if this can be done, one could dump a whole dictionary full of multi-character words in and get them sorted by frequency. I don't know if there are intellectual property issues involved for other dictionaries, but since Pleco owns Wei Dongya's 98,000 word dictionary, that will do

We don't actually own it, we just have a perpetual license to do more-or-less whatever we want with it (a license that I believe will be officially extended to the new 3rd edition sometime in the next week or two). But we wouldn't have the legal power to release it into open-source even if we wanted to; the rights only extend to our own use of it. (we do retain ownership of all of our modifications, though, so the few thousand new entries we've added are ours, as are the new examples etc that we're working on)

We've actually got a basic frequency table built into our Android app (and iOS in its next update) to use for sorting search results; it was aggregated from a combination of some corpus analysis (take a whole bunch of data and count the words in it), some prominent vocabulary lists (like HSK), and a couple of other sources. But it's not really good enough to be applied to an entire dictionary - it can do a decent job of comparing search results (figure out which word with a particular Pinyin is the most common, figure out which full-text result for "learn" is the most common word, etc), but we wouldn't want to rely on it to, say, suggest the first 200 words that you ought to learn.

feng said:
Are characters that look the same, but aren't the same and are typed in different input sets (made-up word) encoded the same? Give you an obscure example: 离 is a character in it's own right aside from being the simplified form of 離。Are they encoded the same? I guess yes because certain searches in Chinese bring back all Japanese results, but I ask because I have occasionally had difficultly searching in a word processing document for even an identical character typed with two different input systems (even though they were both typing in traditional characters).

Nope, different encodings - Google just treats them as variants of each other for the purposes of generating search results.

feng said:
My other question is about these little  things that I see from time to time after typing an uncommon character, or when I see an uncommon character on the web. Here is an example:
http://www.china-language.gov.cn/wenzig ... i/014c.htm
Just above "12 画" you will see . I have the printed list and nothing seems to be missing in the vicinity of  on that web page, so I am wondering what it is. I guess it has something to do with character encoding, but that list has no archaic characters.

That seems to be a "private use" character - most likely a rare character that their Chinese text editing system assigned to a custom character code. Not much that can be done with those unless you know exactly which software they were created by, though.

feng · Jan 16, 2013

Thank you! As always, you are a wealth of useful information. I will now investigate Ngram. [Edit: I investigated it. Biggest problem is that they have only simplified characters.]

Emp :D said:
Spreading this out to a few hundred searches a day would probably be a bad idea, since they'll almost certainly tweak their algorithm / re-index something in the interim and you'll end up with frequencies that aren't accurate relative to each other (a character might have 20% more results on one day than another).

20%?! You mean they react to your code? Or no, they change stuff so much that day to day it is just not good to compare their results at different times, even if you enter 100 characters a day by hand, over 50 days, for 5,000?

Emp :D said:
Nope, different encodings - Google just treats them as variants of each other for the purposes of generating search results.

Is there any way to make Google or other search engines differentiate in their search between same or equivalent characters originally in a particular encoding? Am I correct in understanding that Unicode goes not by simplified or traditional, but by whether something was originally GB or BIG5, etc.?

OK, no Google for me. Thanks! Good to know.

Emp :D said:
That seems to be a "private use" character - most likely a rare character that their Chinese text editing system assigned to a custom character code.

Ummmmm, even though the placement seems strange and there appears to be nothing missing nearby in a paper published version?

mikelove · Jan 17, 2013

feng said:
20%?! You mean they react to your code? Or no, they change stuff so much that day to day it is just not good to compare their results at different times, even if you enter 100 characters a day by hand, over 50 days, for 5,000?

The latter - they tweak their algorithm often and there's no guarantee that results from different days will match up.

feng said:
Is there any way to make Google or other search engines differentiate in their search between same or equivalent characters originally in a particular encoding? Am I correct in understanding that Unicode goes not by simplified or traditional, but by whether something was originally GB or BIG5, etc.?

You can limit by region / language, which is not done 100% accurately but would eliminate most of the non-Taiwan hits anyway.

feng said:
Ummmmm, even though the placement seems strange and there appears to be nothing missing nearby in a paper published version?

Might be an internal formatting code, then.

character frequency on Google and character encoding

feng

榜眼

mikelove

皇帝

feng

榜眼

mikelove

皇帝