How can you SORT Chinese characters...single and multiple

Yes! And the sort order is a problem for UNICODE consortium too!

Just as an illustration of another aspect of the problem:

ABC (Wenlin) Entries with identical spelling (including tones) are arranged by order of frequency (Xiandai Hanyu Pinlü Cidian and Zhongwen Shumianyu Pinlü Cidian):

淹浸 ¹yānjìn
烟禁 ²yānjìn
严谨 ¹yánjǐn
严紧 ²yánjǐn
严禁 yánjìn*
掩襟 yǎnjīn(r)
演进 yǎnjìn
演进到 yǎnjìndào

Microsoft (Word, Excel, etc.) Entries with identical spelling (including tones) are arranged by number of strokes

烟禁 ²yānjìn
淹浸 ¹yānjìn
严紧 ²yánjǐn
严谨 ¹yánjǐn
严禁 yánjìn*
掩襟 yǎnjīn(r)
演进 yǎnjìn
演进到 yǎnjìndào
 
bigram is a word as you know.
Hi Sy,
I beg to differ, unless you use "bigram" and "disyllabic word" as synonyms. But then again, "bigram" and "disyllabic word" do not refer to the same thing. Or, to put it in another way, all disyllabic words are bigrams, but not all bigrams are disyllabic words. For instance, 我也 is a bigram (as I understand this term), but I wouldn't say it's a word.
 
Last edited:
I aim to sort both.
eg, 中,中心,中间,中国…etc
Sort can be word by word...each term in each line
Or word with sub heading for bigram.
Do you mean something like this?


事实
事实层次
事实动词
事实婚姻
事实俱在
事实清单
事实如此
事实上
事实上公司
事实胜于雄辩
事实问题
事实修正
事实意义
事实昭彰
事实真相
澄清事实
事实
既成事实
经验事实
歪曲事实
违反事实
诬捏事实
隐匿事实
重要事实
事实
施事实词
依事实宣告无罪
以事实为根据
-------------------------------
and then the same with
事变
事畜
事端
事儿
.....
.....
.....
 
Last edited:
In English, we sort any thing under the sun, but how you sort Chinese characters?
Hi Sy,
Do you know the pdf file I've attached to this message? Have you ever read it? I thought it might be of interest to you. It's already 30 years old, though.
(The Need for an Alphabetically Arranged General Usage Dictionary of Mandarin Chinese: A Review Article of Some Recent Dictionaries and Current Lexicographical Projects)
 

Attachments

  • spp001_chinese_dictionary.pdf
    1.4 MB · Views: 1,028
(The Need for an Alphabetically Arranged General Usage Dictionary of Mandarin Chinese
Romanization or Pinyin is OK for "Mandarin", but create problems for Cantonese or Japanese people...
More: the spoken language change, for now Pinyin is near to the pronunciation of standard chinese, but some word are pronuciated in different ways commonly: not like the difference between written and spoken language in English luckily!
Perhaps the problem cannot have an unique solution for printed dictionaries: but for Pleco an digital dictionaries is an opportunity!
 

Sy

进士
image.jpeg
As for your English comments (creative way to post!), I think there is not such a stark contrast between Chinese and English dictionaries. In 新華字典 the characters are in fact in a fixed position, ordered by pronunciation just as in an English dictionary. If one does not know at least one pronunciation of that character (very unlikely), one can simply use the 部首 index. Having to use such an index is a slight inconvenience, but I think one generally knows the pronunciation of the character one wishes to look up.

When you talk about the Hong Kong phonebook only ordering the first two characters, you mean the surname and one of two characters in the given name (if it is a two character given name)? Any list for Chinese names is a problem outside of, though made easier by, characters. Chinese surnames are few in number, and the number of common surnames fewer still.

OK, 20,000 magazine names. What's wrong with a phonetic ordering of Chinese characters? Pleco uses Hanyu pinyin as its romanization system. Even with 200,000+ different words it does not seem that one can get a terribly long list of possibilities typing in pinyin. Typing "yiyi" or "shishi" are perhaps among the worst cases, and even they are not too bad. Most combinations of syllables seem to offer fews choices.

I can't see how organizing a dictionary or list by 形碼 would give you the fixed position you seek. It would be a kind of index, just like bushou.

Sorry I have not been able to help.
 

朱真明

进士
At the end of the day I really think that this is just grasping at straws. If I look up the word "realistic" in a English dictionary, I will first have to find the "R" section and then the "Re" and "Rea" finally to the "Real" section and then I will have to scan through all of these words.........

Real
Real ale
Real estate
Re-align
Realise
Realism
Realist

before reaching realistic. This order is not set and can vary dictionary to dictionary depending on how many words are in that dictionary. This scenario is not much different to that of 衣, 依, and 醫. That is, once you have reached the "yi 1" section you will have to scan through a list of varying 字 before reaching the one you are looking for. It's not really a big deal.
 

Sy

进士
Reply to no 28
If you are scanning , it is not a good ordering system..
In English dictionary , I don't have to scan .Thus, it is a good system.
In Chinese dictionary, when I scan a long list and I don't find it.
Then I have to scan again and realize that character / term in NOT in the list..
This action causes loss of time and aggravation.
If I want to design a new system ,I like to avoid this problem for human and machine
Search.
 

Sy

进士
Just as an illustration of another aspect of the problem:

ABC (Wenlin) Entries with identical spelling (including tones) are arranged by order of frequency (Xiandai Hanyu Pinlü Cidian and Zhongwen Shumianyu Pinlü Cidian):

淹浸 ¹yānjìn
烟禁 ²yānjìn
严谨 ¹yánjǐn
严紧 ²yánjǐn
严禁 yánjìn*
掩襟 yǎnjīn(r)
演进 yǎnjìn
演进到 yǎnjìndào

Microsoft (Word, Excel, etc.) Entries with identical spelling (including tones) are arranged by number of strokes

烟禁 ²yānjìn
淹浸 ¹yānjìn
严紧 ²yánjǐn
严谨 ¹yánjǐn
严禁 yánjìn*
掩襟 yǎnjīn(r)
演进 yǎnjìn
演进到 yǎnjìndào

I did not go to page 2, I missed the posts
Now. I catch up .

In the above lists of pinyin order, the Chinese characters stay together.
I have seen in pinyin order , the Chinese characters are separated.
If one does a machine sort , he prob,ly sees Chinese characters separation.
 

Sy

进士
Hi Sy,
I beg to differ, unless you use "bigram" and "disyllabic word" as synonyms. But then again, "bigram" and "disyllabic word" do not refer to the same thing. Or, to put it in another way, all disyllabic words are bigrams, but not all bigrams are disyllabic words. For instance, 我也 is a bigram (as I understand this term), but I wouldn't say it's a word.


TRUE,
I AGREE.
 

朱真明

进士
Reply to no 28
If you are scanning , it is not a good ordering system..
In English dictionary , I don't have to scan .Thus, it is a good system.

I'm pretty sure that in my comment I showed that you do scan in a English dictionary.
 

Sy

进士
Do you mean something like this?


事实
事实层次
事实动词
事实婚姻
事实俱在
事实清单
事实如此
事实上
事实上公司
事实胜于雄辩
事实问题
事实修正
事实意义
事实昭彰
事实真相
澄清事实
事实
既成事实
经验事实
歪曲事实
违反事实
诬捏事实
隐匿事实
重要事实
事实
施事实词
依事实宣告无罪
以事实为根据
-------------------------------
and then the same with
事变
事畜
事端
事儿
.....
.....
.....

Reply to no 24
This attachment is similar to 商务的 巜汉英词典》without 反义词条格式

image.jpeg
 

Sy

进士
Hi Sy,
Do you know the pdf file I've attached to this message? Have you ever read it? I thought it might be of interest to you. It's already 30 years old, though.
(The Need for an Alphabetically Arranged General Usage Dictionary of Mandarin Chinese: A Review Article of Some Recent Dictionaries and Current Lexicographical Projects)
Hi Sy,
Do you know the pdf file I've attached to this message? Have you ever read it? I thought it might be of interest to you. It's already 30 years old, though.
(The Need for an Alphabetically Arranged General Usage Dictionary of Mandarin Chinese: A Review Article of Some Recent Dictionaries and Current Lexicographical Projects)


Sobri.....
I can not read it now due to my ignorance.
I wish it is shown directly.
I will try to decode it later.
I am sure it would be very interesting.thanks
 

Sy

进士
I'm pretty sure that in my comment I showed that you do scan in a English dictionary.

真明:you make me think
Thanks so much.
I admire the Romanized language in sorting.
In a best language system , one should not have to scan .
A word should be in a FIXED LOCATION in a dictionary.
If one wants to scan to waste time,that is one,s choice.
I like automation . Manual mode is too slow.
Scan makes me 头晕眼花。
 

朱真明

进士
A word should be in a FIXED LOCATION in a dictionary.

A word or character is always in a fixed location inside a dictionary, it just varies depending on how you go about identifying where the fixed location is or which dictionary you are using. For example the character 真 is on page 726 of 遠東拼音漢英辭典. Because this dictionary is ordered by pinyin, I can first look for "Z" and then "Zh", "Zhe" finally to "Zhen". At that point I would be at page 725, from there I just need to scan through 貞, 珍, 針, and 砧 before reaching 真. Alternatively I could go to the back of the dictionary and look up the word by stroke order or radical which would then tell me the exact page number.

Regardless of which method you used the character 真 is still in a fixed location in this dictionary. You will not find it anywhere else. If I were to use another dictionary of course it will be in a slightly different location. But this is the same in English. If you were to compare one word in two different dictionaries and check what words listed before that word and after that word, each dictionary would be different. This is because they contain different amounts of words in the dictionary and maybe sometimes conjugation and pluralization aren't taken into consideration. In order to find the word you will still have to go through the "pronunciation sections" method in order to find the fixed location of the word. This isn't even considering accents, regional variants, alternate pronunciations and so forth.

Natural language is subject to many variations due to numerous cultural influences. Automation is only suitable for logically consistent languages like mathematical language or programming language.
 
Last edited:

Sy

进士
Hi Sy,
Do you know the pdf file I've attached to this message? Have you ever read it? I thought it might be of interest to you. It's already 30 years old, though.
(The Need for an Alphabetically Arranged General Usage Dictionary of Mandarin Chinese: A Review Article of Some Recent Dictionaries and Current Lexicographical Projects)


I just copy n paste your reference to do a google seek .
I found it n read the 20 plus pages.
Mair emphasize pinyin approach .a lot of historical background . 30 years later , pinyin can not solve the present problem. 王力…等wrote the article I posted here
They had an opposing view.
 

feng

榜眼
Romanization or Pinyin is OK for "Mandarin", but create problems for Cantonese or Japanese people...
I don't understand your point. All the Japanese dictionaries for native speakers that I am aware of are arranged by kana (i.e. by pronunciation). A syllabary is just an alphabet with a different name due to linguists loving to name things ;)
Aside from Cantonese's lack of standardization, what is the issue with using romanization for Cantonese (a language of which I am ignorant)?

More: the spoken language change, for now Pinyin is near to the pronunciation of standard chinese, but some word are pronuciated in different ways commonly: not like the difference between written and spoken language in English luckily!
Perhaps the problem cannot have an unique solution for printed dictionaries: but for Pleco an digital dictionaries is an opportunity!
If one knows standard Mandarin and the basic rules of juyin, or a given system of romanization, it is hard to spell things wrong. There are barely 400 syllables in actual use, not counting tones. What do you mean when saying that some words are commonly pronounced differently? You mean characters? Multi-character words? Taiwan vs PRC pronunciation? Could you give a couple of examples please?

Sy: Love your posting style; even better with the paper still on the clipboard!
Frankly, I think your fundamental question has been answered by more than one person on this thread. One can not expect to go to a restaurant and get a meal one likes without perusing the menu, ordering, and then waiting for the food to be made. One can not go to a library and get the right book without consulting the catalog and/or browsing the shelves.
I agree with you that it makes no sense that PRC dictionaries ordered by Pinyin then inexplicably throw the characters in at random (or is there some logic?) under the same tone, rather than ordering them by stroke count which has been the practice of the last 400 years. Of course, "yi" is the most populous syllable in Hanyu pinyin, so that somewhat exaggerates the problem.
Is your interest in 22,000 characters, more than three quarters of which practically no one has ever seen, theoretical or practical? In other words, what is it you want to do with these uncommon characters? Counting variants, there are well over 100,000 characters, but arguably less than 30,000 basic characters (i.e. non-variants) with only 5,000 or so of those known by educated people (I've been testing!), so what is your need for a lightning fast, all perfect lookup method for rare characters?


"The Need for an Alphabetically Arranged General Usage Dictionary of Mandarin Chinese" is not worth the time of day, IMHO. Neither is the dictionary it spawned :confused:
 
Last edited:

Sy

进士
A word or character is always in a fixed location inside a dictionary, it just varies depending on how you go about identifying where the fixed location is or which dictionary you are using. For example the character 真 is on page 726 of 遠東拼音漢英辭典. Because this dictionary is ordered by pinyin, I can first look for "Z" and then "Zh", "Zhe" finally to "Zhen". At that point I would be at page 725, from there I just need to scan through 貞, 珍, 針, and 砧 before reaching 真. Alternatively I could go to the back of the dictionary and look up the word by stroke order or radical which would then tell me the exact page number.

Regardless of which method you used the character 真 is still in a fixed location in this dictionary. You will not find it anywhere else. If I were to use another dictionary of course it will be in a slightly different location. But this is the same in English. If you were to compare one word in two different dictionaries and check what words listed before that word and after that word, each dictionary would be different. This is because they contain different amounts of words in the dictionary and maybe sometimes conjugation and pluralization aren't taken into consideration. In order to find the word you will still have to go through the "pronunciation sections" method in order to find the fixed location of the word. This isn't even considering accents, regional variants, alternate pronunciations and so forth.

Natural language is subject to many variations due to numerous cultural influences. Automation is only suitable for logically consistent languages like mathematical language or programming language.

真明and all.

In English ,if a dictionary has only 3 words,namely,
Cat
Dog
Pig
Dog is indexed between cat and pig in its fixed position.
Dog can not come after pig.
I your example, 贞珍针砧真
Anyone or I can index them as 真针珍贞砧
Thus, 真has no fixed position
I wish I know how to express it more clearly.
Another thing, when you go back to the rear to use another system.you cause delay by introducing multi system for dictionary look up.
When I use the English system,I use only one system....alphabetic sort.
 

朱真明

进士
Maybe you should add the word "Realistic" into your three word list, it might clarify things a little better.

What I'm getting at, is that you have created an unrealistic scenario that is incapable of honestly reflecting reality.

If you are including all of the dictionaries in a language, then without a doubt no word ever has a fixed location. Its location is determined by the dictionary it is found in. I have already shown this and if you own two English dictionaries then I encourage you to test out the method. Furthermore, not all English dictionaries are ordered alphabetically, have you ever used a specialist dictionary before? Some of them are organised by category which assumes that you are already familiar with that field of knowledge.

Chinese is more flexible, it is not a phonetic or syllabary based writing system therefore if you don't know the pronunciation of the word there are other means by which you can look it up in a dictionary. You do not need to use multiple systems to find the word, you just need to use the appropriate one. Meaning that having multiple systems to look up words is not a burden but actually a freedom.

I honestly think that it has already been proven that you will not be able to eliminate scanning or searching for words in paperback dictionaries, if you have any evidence to the contrary then please present it.

All in all, what you desire has already been achieved in electronic dictionaries. As we move into the electronic age and paper-books are slowly reduced in favour of electronic versions, dictionaries would probably be the first to go. Meaning this type of idealism is relatively pointless.

Anyway, I did enjoy the conversation but still wonder, in your research have you got anything that can contribute towards the development of systems for organising Chinese characters?
 
Top