You're welcome. I noticed in other digitized Chinese texts—ones that may have been OCR‘ed—that there are Hiragana characters interspersed in them, possibly as markers of characters the OCR engine didn't recognize. So it could be that these characters just made their way into the corpus and the frequency list, since the generation of the frequency list is fully automated. Just a possibility.
thank you, that is a lot of instances. I looked at the original global GB18030 text file using Wenlin, it gave me the following messge:
It seems to have the same number of Hiragana characters as in your listing. Perhaps they just weren't very careful about the corpus sources, i.e. they included some Japanese source material? Though then there would have to be Katakana and Kanji-only characters in it, as well. I know that John has sent an E-mail to the creator asking him about the 第, and if he answers, perhaps he could also ask about the Japanese characters.
thanks! So with this many extraneous characters, either this is all just noise in the data, or they deliberately included Japanese texts. One almost can't include Japanese texts in a Chinese corpus by accident. If it's noise in the data, it would have to be due to OCR noise. But why keep it in the frequency list, then.