How can you SORT Chinese characters...single and multiple

Sy · Jan 3, 2016

feng said:
I don't understand your point. All the Japanese dictionaries for native speakers that I am aware of are arranged by kana (i.e. by pronunciation). A syllabary is just an alphabet with a different name due to linguists loving to name things
Aside from Cantonese's lack of standardization, what is the issue with using romanization for Cantonese (a language of which I am ignorant)?

If one knows standard Mandarin and the basic rules of juyin, or a given system of romanization, it is hard to spell things wrong. There are barely 400 syllables in actual use, not counting tones. What do you mean when saying that some words are commonly pronounced differently? You mean characters? Multi-character words? Taiwan vs PRC pronunciation? Could you give a couple of examples please?

Sy: Love your posting style; even better with the paper still on the clipboard!
Frankly, I think your fundamental question has been answered by more than one person on this thread. One can not expect to go to a restaurant and get a meal one likes without perusing the menu, ordering, and then waiting for the food to be made. One can not go to a library and get the right book without consulting the catalog and/or browsing the shelves.
I agree with you that it makes no sense that PRC dictionaries ordered by Pinyin then inexplicably throw the characters in at random (or is there some logic?) under the same tone, rather than ordering them by stroke count which has been the practice of the last 400 years. Of course, "yi" is the most populous syllable in Hanyu pinyin, so that somewhat exaggerates the problem.
Is your interest in 22,000 characters, more than three quarters of which practically no one has ever seen, theoretical or practical? In other words, what is it you want to do with these uncommon characters? Counting variants, there are well over 100,000 characters, but arguably less than 30,000 basic characters (i.e. non-variants) with only 5,000 or so of those known by educated people (I've been testing!), so what is your need for a lightning fast, all perfect lookup method for rare characters?

"The Need for an Alphabetically Arranged General Usage Dictionary of Mandarin Chinese" is not worth the time of day, IMHO. Neither is the dictionary it spawned

Feng and All,
I used the screen shot to contain my writing is easier for me.when I use iPad typing ,iPad changed my spelling from in to I and many more others
I am not a fast typist.
Another thing ,when I type to reply to you, I have to move to the top of the page to read your writing and move down to continue answering you.
In screen shot, I read your writing in iPad and write my discussion on paper without referring up and down off the page.
I want to discuss with you with all good points.it may take many postings.likewise. I want to discuss with others the same way.
I may not have time to do it in one shot and right away.
In pinyin arrangement,there are too many characters with the same sound.as you said 400 sounds for all those thousands of single characters.
Also there are 同音词 to make pinyin to become a pinyin language difficult.
Pinyin is really for 注音 only for now.
Til next time.

feng · Jan 3, 2016

Sy,

I really do like your posting style. I would do it myself, but I am too lazy.

What I was trying to say in my last post when referring to "yi" is that since 1615 dictionaries have typically (and still do for Taiwan and those I have seen from Hong Kong) listed your three example characters as 衣依醫, that is by stroke count. The problem you point to is a PRC problem.

As for your cat-dog-pig dictionary, see how many Americans (and likely other English speakers) can find voila or onomatopoeia in the dictionary. They know both words, but I bet you they can't spell them -- and in English you have no other recourse if you can't spell it (until Google came along).

I am under the opinion that 朱真明 has answered your questions fully. As for my own attempts, you make me feel I am in a dialogue with myself. I commented on the putative Li Wang essay, I asked you questions about what you are trying to do . . . and I got little or nothing by way of direct response. I wish you all the best with whatever it is you are doing

Sy · Jan 4, 2016

I like ALL members to read this first part of an article
ON RADICAL INDEX SYSTEM
Somehow the image is not clear here but it is clearer in my photo album.
I don't know why . Hope you can read it ok.

I will reply to Feng,s posting later.

Sy · Jan 4, 2016

Here is another writing to those who like to read.

Sy · Jan 4, 2016

Last reference that may be of interest.

I TOOK THE PICTURE CONTAINS WRITING
WRITE MY NOTES
SELECT. UPLOAD A FILE
FIND THE PICTURE JUST TOOK FROM ALBUM
To post full image
POST REPLY

Sy · Jan 5, 2016

As I mentioned , I would come back to answer this part of Mr Feng post

Sy · Jan 6, 2016

feng said:
I don't understand your point. All the Japanese dictionaries for native speakers that I am aware of are arranged by kana (i.e. by pronunciation). A syllabary is just an alphabet with a different name due to linguists loving to name things
Aside from Cantonese's lack of standardization, what is the issue with using romanization for Cantonese (a language of which I am ignorant)?

If one knows standard Mandarin and the basic rules of juyin, or a given system of romanization, it is hard to spell things wrong. There are barely 400 syllables in actual use, not counting tones. What do you mean when saying that some words are commonly pronounced differently? You mean characters? Multi-character words? Taiwan vs PRC pronunciation? Could you give a couple of examples please?

Sy: Love your posting style; even better with the paper still on the clipboard!
Frankly, I think your fundamental question has been answered by more than one person on this thread. One can not expect to go to a restaurant and get a meal one likes without perusing the menu, ordering, and then waiting for the food to be made. One can not go to a library and get the right book without consulting the catalog and/or browsing the shelves.
I agree with you that it makes no sense that PRC dictionaries ordered by Pinyin then inexplicably throw the characters in at random (or is there some logic?) under the same tone, rather than ordering them by stroke count which has been the practice of the last 400 years. Of course, "yi" is the most populous syllable in Hanyu pinyin, so that somewhat exaggerates the problem.
Is your interest in 22,000 characters, more than three quarters of which practically no one has ever seen, theoretical or practical? In other words, what is it you want to do with these uncommon characters? Counting variants, there are well over 100,000 characters, but arguably less than 30,000 basic characters (i.e. non-variants) with only 5,000 or so of those known by educated people (I've been testing!), so what is your need for a lightning fast, all perfect lookup method for rare characters?

"The Need for an Alphabetically Arranged General Usage Dictionary of Mandarin Chinese" is not worth the time of day, IMHO. Neither is the dictionary it spawned

Feng and All:
In pinyin index facing same sound terms, they use the following 3 ways to help
Detailed index.
1/ 札字法that is to order. First stroke is horizontal, second is verticle, third is slash
To left,fourth is slash to right, 5th is crooked or bent.
2/ 江山千古，曲法....use the first stroke to index
3/ 寒来暑往，曲法.......same as above
Confused ?no standard.

Years ago , friend working in library of congress said the magic number of characters is about 22000 to cover index of books, people names and geographic names...
Now, Unicode has about that number incl Japanese ,Korean characters.
I collected a list published by China was about 21000. Now they increase them to 27000. I stay with 22000 max not for me but for others as reference. I like real time computer speed .dont want to go lightning speed.
山西and 陕西have same sound and tone. They changed the spelling to make an exception.

Abun · Jan 6, 2016

For a print dictionary, I don't think it's possible to ever eliminate the necessity of some index for a simple reason: There are two main scenarios where you would look sth up in a Chinese dictionary: a) You know the pronunciation but not the character or meaning, or b) you know the character but not the pronunciation or meaning. (Of course there is the scenario where you know both pronunciation and character and just look for the meaning, but then you have the luxory to be able to choose either a or b for looking the word up.) In a good dictionary it should be possible to find a word in either of those scenarios. However as far as I can see it's impossible to find an unambiguous way to order words which results in the same order for both input methods. So the compiler has to choose ordering either by pronunciation or by character shape. In both cases there would have to be an index to allow the user to look a word up with the respective other method. A hybrid method such as ordering by pronunciation but order homophones by character shape doesn't solve the problem because you wouldn't be able to find the list of homophones without knowing the pronunciation of the character you're looking for.

I also second the argument made before that it is impossible completely eliminating the necessity to scan. I think there might have been a misunderstanding caused by different interpretations of the "fixed position" (as far as I understand the OP s/he means "unambiguous" (relative to other words) while other contributors understood it as fixed in terms of absolute position).
Even if the position of a word in a dictionary is unambiguous in the ordering system (as it is for example in English dictionaries ordered by spelling, apart from a few homographs), it is still improbable that I find it at first glance. More likely, I open the dictionary at the approximate location I expect to find it according to the first one or two letters (e.g. towards the beginning for c-, towards the end for u-) and then have to compare with other words to know in which direction and by how much I was off. Unless I'm already very close, I will again take a guess at how many pages I am off and then repeat the process until I've found the right page. And when I do, I still have to compare with adjacent words to find the exact location. I know when to expect to find the word instead of having to scan at random, that's true, but I do have to scan.
While an absolutely fixed position should in theory indeed be possible to find without scanning, it presupposes that I as the user know that absolute position beforehand. I'm afraid I fail to see how that could be possible.

Sy · Jan 7, 2016

feng said:
I don't understand your point. All the Japanese dictionaries for native speakers that I am aware of are arranged by kana (i.e. by pronunciation). A syllabary is just an alphabet with a different name due to linguists loving to name things
Aside from Cantonese's lack of standardization, what is the issue with using romanization for Cantonese (a language of which I am ignorant)?

If one knows standard Mandarin and the basic rules of juyin, or a given system of romanization, it is hard to spell things wrong. There are barely 400 syllables in actual use, not counting tones. What do you mean when saying that some words are commonly pronounced differently? You mean characters? Multi-character words? Taiwan vs PRC pronunciation? Could you give a couple of examples please?

Sy: Love your posting style; even better with the paper still on the clipboard!
Frankly, I think your fundamental question has been answered by more than one person on this thread. One can not expect to go to a restaurant and get a meal one likes without perusing the menu, ordering, and then waiting for the food to be made. One can not go to a library and get the right book without consulting the catalog and/or browsing the shelves.
I agree with you that it makes no sense that PRC dictionaries ordered by Pinyin then inexplicably throw the characters in at random (or is there some logic?) under the same tone, rather than ordering them by stroke count which has been the practice of the last 400 years. Of course, "yi" is the most populous syllable in Hanyu pinyin, so that somewhat exaggerates the problem.
Is your interest in 22,000 characters, more than three quarters of which practically no one has ever seen, theoretical or practical? In other words, what is it you want to do with these uncommon characters? Counting variants, there are well over 100,000 characters, but arguably less than 30,000 basic characters (i.e. non-variants) with only 5,000 or so of those known by educated people (I've been testing!), so what is your need for a lightning fast, all perfect lookup method for rare characters?

"The Need for an Alphabetically Arranged General Usage Dictionary of Mandarin Chinese" is not worth the time of day, IMHO. Neither is the dictionary it spawned

Feng said "If one knows standard Mandarin and the basic rules of juyin, or a given system of romanization, it is hard to spell things wrong. There are barely 400 syllables in actual use, not counting tones. What do you mean when saying that some words are commonly pronounced differently? You mean characters? Multi-character words? Taiwan vs PRC pronunciation? Could you give a couple of examples

Feng and all
Sorry.i did not address the the question in the second paragraph AS QUOTED ABOVE.

The trouble is that mandarin is not spoken by everyone; therefore, romanization can not be done as a language.for example, the Cantonese would not be able to read the
Romanized/pinyin writing. I agreed with you the 400 spelling is a plus.this cut down
The spelling mistakes. In Wang li,s essay, one of the items he said was the problem
Of 同音词。if 2 wordS or 2 multi characters have the same sound,one can not distinguish which word is in reference. In single character ,the problem is worse.
I can not give you many examples easily ; however, I can look up my reference
. Coming off my head , I can only give the word of SHANXI WHICH can mean
山西 or 陕西。so I this case, they changed 陕西to spell SHAANXI.
I did not mean to leave you hanging.
I don't know whether I have replied to you fully; otherwise, rephrase the question.
We will continue to discuss.

Sy · Jan 7, 2016

Abun said:
For a print dictionary, I don't think it's possible to ever eliminate the necessity of some index for a simple reason: There are two main scenarios where you would look sth up in a Chinese dictionary: a) You know the pronunciation but not the character or meaning, or b) you know the character but not the pronunciation or meaning. (Of course there is the scenario where you know both pronunciation and character and just look for the meaning, but then you have the luxory to be able to choose either a or b for looking the word up.) In a good dictionary it should be possible to find a word in either of those scenarios. However as far as I can see it's impossible to find an unambiguous way to order words which results in the same order for both input methods. So the compiler has to choose ordering either by pronunciation or by character shape. In both cases there would have to be an index to allow the user to look a word up with the respective other method. A hybrid method such as ordering by pronunciation but order homophones by character shape doesn't solve the problem because you wouldn't be able to find the list of homophones without knowing the pronunciation of the character you're looking for.

I also second the argument made before that it is impossible completely eliminating the necessity to scan. I think there might have been a misunderstanding caused by different interpretations of the "fixed position" (as far as I understand the OP s/he means "unambiguous" (relative to other words) while other contributors understood it as fixed in terms of absolute position).
Even if the position of a word in a dictionary is unambiguous in the ordering system (as it is for example in English dictionaries ordered by spelling, apart from a few homographs), it is still improbable that I find it at first glance. More likely, I open the dictionary at the approximate location I expect to find it according to the first one or two letters (e.g. towards the beginning for c-, towards the end for u-) and then have to compare with other words to know in which direction and by how much I was off. Unless I'm already very close, I will again take a guess at how many pages I am off and then repeat the process until I've found the right page. And when I do, I still have to compare with adjacent words to find the exact location. I know when to expect to find the word instead of having to scan at random, that's true, but I do have to scan.
While an absolutely fixed position should in theory indeed be possible to find without scanning, it presupposes that I as the user know that absolute position beforehand. I'm afraid I fail to see how that could be possible.

Abun and All

alex_hk90 · Jan 7, 2016

Sy said:
真明and all.

In English ,if a dictionary has only 3 words,namely,
Cat
Dog
Pig
Dog is indexed between cat and pig in its fixed position.
Dog can not come after pig.
I your example, 贞珍针砧真
Anyone or I can index them as 真针珍贞砧
Thus, 真has no fixed position
I wish I know how to express it more clearly.
Another thing, when you go back to the rear to use another system.you cause delay by introducing multi system for dictionary look up.
When I use the English system,I use only one system....alphabetic sort.

This phenomenon is inevitable for any language (including both English and Chinese) with homographs (same spelling, multiple meanings) and/or homonyms (same pronunciation, multiple meanings).
To take a similar example as you have done, with 3 words in English, namely:
set;
run;
one;
there are dozens of different meanings and no consistent place across different dictionaries:
set: Wiktionary, TheFreeDictionary, Merrium-Webster;
run: Wiktionary, TheFreeDictionary, Merrium-Webster;
one: Wiktionary, TheFreeDictionary, Merrium-Webster.
So to find a particular meaning you still have to first look up by spelling, then by type of word, then by meaning.
Yes it might not be as common in English as in Chinese, but it is still fairly common, and for fairly common words as well.

Abun · Jan 14, 2016

@Sy: So your idea is to use a code point system which includes phonetic, graphic and semantic information? Interesting idea, that way not only different characters would have an unambiguous code, but indeed each subentry would. But if the code includes 音, 形 and 義, how would it be possible to find the desired entry if you know only one of the three (unless there are indices again of course)? And how would you handle multi-character entries. Those have their own 義 but their 音 and 形 cannot be covered with the same encoding method as single-character entries (unless maybe you use only the 音 and 形 information of the first character). Or is what you envision more a 字典 instead of a 辭典?

Sy · Jan 14, 2016

Sy · Jan 14, 2016

ABUN ...CONTINUE TO ALL
One more thing,if I have meaning in another column in English ,I can pull out the Chinese FORM,SOUND.
IN English to Chinese dictioary ,there is no index problem
In Chinese to English dictioary , I have to use another indexable language for assistance like English here. Later, there may be a solution.

Sy · Jan 15, 2016

I went to Flash card about food to borrow from Miguel (ref #3) on edible mushroom .this is a short list for illustration. Thank you, Miguel.

草菇[草菇] cao3gu1

cao3gu1

春菇[春菇] chun1gu1

chun1gu1

刺芹菇[刺芹菇] ci4qin2gu1

ci4qin2gu1

金针菇[金針菇] jin1zhen1gu1

jin1zhen1gu1

口蘑[口蘑] kou3mo2

kou3mo2

木耳[木耳] mu4er3

mu4er3

平菇[平菇] ping2gu1

ping2gu1

食菌[食菌] shi2jun4

shi2jun4

松蕈[松蕈] song1xun4

song1xun4

香菇 xiang1gu1

xiang1gu1

银耳[銀耳] yin2er3

yin2er3

猪苓[豬苓] zhu1ling2

zhu1ling2

Sy · Jan 15, 2016

In my ref #55 above,I struggled to put up this chart without grids.
I manually edited the mushroom list and delete some.
just leave chinese names //pinyin with tones.
do a sort in pinyin.the result is shown above.
I do this is to illustrate the sorting of chinese terms in pinyin order.
One time ,I went to a book sale .I wanted to order a magazine ;the seller could not find the magazine
name and cost of subscription.
I am always thinking that if the seller had indexed a master magazine list ,he would not have such problem to find the info.
I wish I know how to convert chinese character magazine names to pinyin easier in the rows like in Excel so I can sort thousands of names of a list. Please advise me your steps.many Thanks

Sy · Jan 16, 2016

alex_hk90 said:
This phenomenon is inevitable for any language (including both English and Chinese) with homographs (same spelling, multiple meanings) and/or homonyms (same pronunciation, multiple meanings).
To take a similar example as you have done, with 3 words in English, namely:
set;
run;
one;
there are dozens of different meanings and no consistent place across different dictionaries:
set: Wiktionary, TheFreeDictionary, Merrium-Webster;
run: Wiktionary, TheFreeDictionary, Merrium-Webster;
one: Wiktionary, TheFreeDictionary, Merrium-Webster.
So to find a particular meaning you still have to first look up by spelling, then by type of word, then by meaning.
Yes it might not be as common in English as in Chinese, but it is still fairly common, and for fairly common words as well.

************
Alex ref 51... And all
In my previous posts I said about the No fixed position
I repeat here with example again
See 新华字典 front section of using radical to find character.
Page of three dots for water
2 stroke
汀
汁
汇
氿
汈
汉
氾
I can Order the characters as 汇，汁，汀. 氿，氾，汉，汈
Also please refer to refer no.43. 涂建国说无定序⋯等
Also see Feng and 朱真明

朱真明 · Jan 17, 2016

Sy said:
Alex ref 51... And all
In my previous posts I said about the No fixed position
I repeat here with example again

I think everybody here already understands your issue, but you haven't addressed the counter-argument that this phenomenon you are trying to deal with is actually an inevitable one that cannot be dealt with.

I made the claims that you will find this phenomena in English as well as in every so called "Natural Language". If you are familiar with linguistics especially socio-linguistics and historical linguistics you would know that they are different from computational linguistics. That is, natural language isn't entirely logical due to its complicated history of cultural influences.

My argument is (and I think "feng" "abun" and "alex" are in general agreement) that what you desire to achieve is logically unachievable and that you should probably just accept the idiosyncrasies of natural language.

Are you willing to argue against the claims made? Do you have an argument that shows what you want is logically valid?

I understand you probably created this discussion thread hoping for some practical advice on how to achieve what you want but I think if we cannot sort out the fundamentals of the issue then practical advice isn't going to be that practical.

alex_hk90 · Jan 17, 2016

Sy said:
************
Alex ref 51... And all
In my previous posts I said about the No fixed position
I repeat here with example again
See 新华字典 front section of using radical to find character.
Page of three dots for water
2 stroke
汀
汁
汇
氿
汈
汉
氾
I can Order the characters as 汇，汁，汀. 氿，氾，汉，汈
Also please refer to refer no.43. 涂建国说无定序⋯等
Also see Feng and 朱真明

I agree that you can order those characters in multiple ways - what I was trying to show is that you can order words in multiple ways in English as well, and hence I gave a few examples in a few dictionaries where the particular meanings of words had "no fixed position". For all intents and purposes, these are different 'words' with "no fixed positions". I don't think there are many, if any, languages where every word has a "fixed position" in a dictionary.

朱真明 said:
I think everybody here already understands your issue, but you haven't addressed the counter-argument that this phenomenon you are trying to deal with is actually an inevitable one that cannot be dealt with.

I made the claims that you will find this phenomena in English as well as in every so called "Natural Language". If you are familiar with linguistics especially socio-linguistics and historical linguistics you would know that they are different from computational linguistics. That is, natural language isn't entirely logical due to its complicated history of cultural influences.

My argument is (and I think "feng" "abun" and "alex" are in general agreement) that what you desire to achieve is logically unachievable and that you should probably just accept the idiosyncrasies of natural language.

Exactly.

Sy · Jan 24, 2016

朱真明 said:
I think everybody here already understands your issue, but you haven't addressed the counter-argument that this phenomenon you are trying to deal with is actually an inevitable one that cannot be dealt with.

I made the claims that you will find this phenomena in English as well as in every so called "Natural Language". If you are familiar with linguistics especially socio-linguistics and historical linguistics you would know that they are different from computational linguistics. That is, natural language isn't entirely logical due to its complicated history of cultural influences.

My argument is (and I think "feng" "abun" and "alex" are in general agreement) that what you desire to achieve is logically unachievable and that you should probably just accept the idiosyncrasies of natural language.

Are you willing to argue against the claims made? Do you have an argument that shows what you want is logically valid?

I understand you probably created this discussion thread hoping for some practical advice on how to achieve what you want but I think if we cannot sort out the fundamentals of the issue then practical advice isn't going to be that practical.

In reality, I never thought of generating a thread to ask for advise.
Right now I don't have a firm answer either .I try to design some thing that
May answer some sorting problem questions. Please refer to the printed attachment
I posted. People voiced their problems. these problems existed from long ago. I don't make them up.
It is real complex to discuss in a few postings to reveal the solution.
So I rather do the easier sorting of pinyin terms as I did in the MUSHROOM
POST previously
I try to learn the flash card and OCR SYSTEM here to see if I can find an easy
Way to compile the pinyin listing.
When I have a firm solution, I shall be back on the character sort issue.
I consider sorting character is like sorting a can of worms.

How can you SORT Chinese characters...single and multiple

进士

榜眼

进士

进士

进士

进士

进士

榜眼

进士

进士

状元

榜眼

进士

进士

进士

进士

进士

进士

状元

进士