iPhone Feature Requests

gato · May 22, 2010

So the point here is that IME's lists character associations by estimated word frequency whereas dictionaries display words by pinyin order. Seems fair enough point.

Not sure that Pleco has the data available for word frequency to implement association, though. Word frequency data seems to be subject to copyright. Google was embarrassed a few years ago after its use of Sohu IME's word frequency data in the Google pinyin IME was publicized.

mikelove · May 22, 2010

rypervenche - that makes sense, but in that case wouldn't it be better if we just sorted the search results by frequency? We could even have an option to only do that for 1-character searches so that you'd still be able to browse alphabetically by Pinyin in other cases. We'd need to add frequency data in either case, but it seems like in general that would achieve the same end more efficiently.

Unfortunately, that frequency data while it is integrated into the iPhone isn't something we can really access, and as gato points out we don't have that data ourselves either - generating some from a large corpus is a possibility, as is providing a way for users to supply that sort of data own (e.g. by flagging words that are in a particular flashcard list, so you could import all of HSK and have HSK words come up first), but it's also something we might be able to license; it doesn't seem like there's a good free multi-character wordlist available online (or else Google probably would have used it themselves).

mikelove · May 23, 2010

Had some trouble sleeping, so I combined the contents of about half a dozen free internet Chinese word frequency lists, weighted them roughly evenly, filtered out words that didn't appear in more than one, and came up with a set of about 40,000 that seems like it might actually work - thanks very much for the continued prodding on this.

Here's a list of the first 500 words from that list in order - does anyone notice any particularly big omissions / particularly out-of-place words in there? (perhaps I should post the first few thousand as flashcard lists)

gato · May 23, 2010

It seems to be a frequency list derived from PRC newspapers. There's a pretty heavy bias towards official language.

Here are the top two-character words on that list:

中国
我们
没有
什么
发展
这个
现在
他们
一个
就是
可以
自己
那个
问题
工作
经济
国家
记者
企业

It might be good enough to be used for an IME, however.

The two-character word (bigram) frequency list from general fiction available here seems to be closer to the frequency experience in everyday language. I noticed that the author copyright notice on the site restricts commercial use of the list. But I suppose you could always go through electronic texts of contemporary novels on your own to generate a frequency list.
http://lingua.mtsu.edu/chinese-computin ... m/form.php
Bigram frequency list for the general fiction sub-corpus
1. 一个 57186 4.21834619678
2. 什么 45946 7.66008832863
3. 没有 40408 5.89442863226
4. 自己 35411 8.22590596472
5. 我们 33208 4.37625854551
6. 他们 31875 4.56612189359
7. 知道 22390 7.78745189666
8. 起来 21237 5.37620423292
9. 这个 21137 3.83614462594
10. 时候 19493 7.78539684589
11. 这样 19203 5.23431393801
12. 怎么 17118 7.3134229469
13. 已经 16604 8.44503875864
14. 现在 16279 5.7008234155
15. 出来 14315 4.45046966356
16. 不能 13410 4.15616711186
17. 还是 13211 3.71041616582
18. 不知 12635 4.47906889549
19. 可以 12462 6.5032612462
20. 女人 12426 5.08744035395

mfcb · May 23, 2010

i may be wrong, but for the application of a frequency list its pretty unimportant from where it is and if it is exact. in fact i could not choose any because i sometimes read novels, but also sometimes i read news.

i expect from this data, that when i enter "guo" i get to see the most frequent words starting with guo, and i guess, that its not so important if some of the words swap positions...

as for frequency lists in IMEs, they are self-adjusting. it really helps when writing messages, as the vocab you used comes up prominently, hehe, so maybe mixing the "standard" frequency list with "my" flashcards would create my personal frequency list, best option?

am i missing something important?

hairyleprechaun · May 23, 2010

I am also a strong proponent of frequency-based character predictions for the handwriting recognizer. I input Chinese daily, both using handwriting and Pinyin IMEs, on a few different devices (laptop and desktop PCs, HP iPAQ PDA, iPhone, Nokia mobile phone) and I find that having good frequency-based predictions makes Chinese input much faster for everyday Chinese writing. A few people in this forum have mentioned using Apple's Chinese IME over Hanwang in Pleco because of the lack of predictions. Actually, I too am not using the Hanwang handwriting recognizer in Pleco because of its lack of predictions even though I love it's superior (when compared to Apple's handwriting recognizer) character recognition, the ability to set the background to transparent for seeing instant dictionary results, and the time-saving gestures added by Pleco. However, from my experience, Apple's handwriting recognizer and Pinyin IME both have very poor character predictions. I am not sure where Apple came up with such poor character frequency charts, because using a Pinyin IME on many of the mainland China Nokias (one of the best Pinyin IMEs Nokia has is on the 5300 which odlly enough happens to be much better than the IME included on some of their more expensive models) or using the Wefit IME on mobile phones (currently used on my jailbroken iPhone) or the CE-Star IME (currently used on my PDA) all have excellent frequency-based predictions.

mikelove · May 23, 2010

gato - good point, but I think that bias could be corrected by adjusting the weightings (not all of those sources were newspaper-based) and/or factoring in an additional list or two from more more colloquial sources.

That list you link to is somewhat biased too,though - 起来 I suspect is in there because of its verb-complement usage rather than the literal meaning "get up," so definitely not as commonly seen in official language, 怎么 likewise is rather colloquial, and 不知 is there because of 不知道 because the list was a pure bigram breakdown rather than a more intelligent word-based segmentation. Also not sure why 女人 is so high, that definitely seems like a literature-specific quirk - not that it's not important, but I don't see it ranking ahead of more basic words like 如果 or 因为 or 问题.

mfcb - yes, exactitude isn't necessarily important if we're just using this for input and/or results sorting.

hairyleprechaun - definitely, some of the frequency orderings in Apple's Pinyin input are rather appalling actually.

So here's a thought on how we might handle result list sorting in a future release (experimentally in 2.1.x and more officially in 2.2):

Single character - entries for that exact character sorted by most common pronunciation (per Unihan / Hanyu Pinlu), then words beginning with that character sorted by length and (within the same length) frequency. 一直 per rypervenche's example would therefore be something like the 7th or 8th item listed on the results page for 一. (Apple's list on my phone at least goes 共个些定起下样直 while the list I posted goes 个些定样直下起般)

> button on a single character would still jump to you all of the words beginning with that character / pronunciation alphabetically sorted. Change wouldn't bother most people since before this they were only getting those first few characters and no others.

Single Pinyin syllable - characters sorted by frequency, followed by words sorted by length and frequency. Though there's also a case to be made for sorting multi-syllable results alphabetically I suppose...

Multiple characters - never likely to have more than a few results, probably sort by length and frequency still though.

Multiple Pinyin syllables - definitely want to sort these by frequency.

Words not found - break down into multiple but length-matched result sections, except for the last section which would be open-ended; 普利科 for example would give you all of the single-character entries for 普, then all single-character entries for 利, then all entries starting with 科 (length-sorted), while 词典软 would give you all of the exact 词典 and then all of the starting-with-软 results including 软件.

Or would it be better if we simply built this into the handwriting input system but left the search result lists untouched?

gato · May 23, 2010

I would be careful with displaying the results by frequency. People looking up something in a dictionary are more often than not looking for a less common word. What they are looking for probably isn't going to be near the top in a freqency-sorted list. Sorting by freqency instead of by pinyin might make things harder to find.

I could see it being used when displaying the results of a full text search. The usage then is more similar to a search engine. In other cases, I would just limit it to the IME.

rypervenche · May 23, 2010

mikelove: I actually like everything you said. I think you should go with what you stated in your post rather than only put it in the handwriting tool, for the simple reason that not everyone will use the handwriting tool. If I need to for some reason use pinyin or bopomofo, I will lose out on the frequency feature. I think having the option to sort either by pinyin or by frequency would be a handy option. Something on the search screen would be most useful I think.

gato: That is not necessarily true gato. It may be for advanced learners or fluent speakers, but for beginners and intermediate (or on the off chance you just simply don't know a common or easy phrase) sorting by frequency would still be a good idea. If you feel that the word/phrase is a rare one, then you know to head toward the bottom of the list.

mikelove · May 24, 2010

gato - I don't think we'd make it a default option, at least not right away - just something people could turn on in Settings, though probably at the most basic level of them in 2.2 when we roll out our simple / advanced settings dichotomy. Behavior changes like that definitely aren't something you want to spring on people who might like things the way they are now. With customizable toolbars we could always make it something button-invokable too.

rypervenche - great! Just seems more logical this way, though handwriting suggestions might make sense in the text editing screen even if we stuck with the current system for handwriting input in dictionary searches.

The imprecise nature of most Chinese frequency lists (no way around it, really, spoken / written / formal / informal / technical / business / Taiwan / mainland / etc Chinese often involve very different vocabularies) means that your scrolling-to-the-bottom idea might not be such a slam-dunk, though; for a search with a hundred results, whether a particular word comes up at #30 or #90 may be largely a matter of chance. Which suggests we might want to make the sort order independently configurable for different search types, or have an option to sort alphabetically when the number of results is past a certain threshold, or maybe have a separate frequency-sorted section for results among the, say, 5000 most common words and then list the remaining results (or, better still, all results) in alphabetical order after those.

hairyleprechaun · May 24, 2010

I definitely prefer to have frequency based character prediction added to the handwriting recognizer. However, the idea of a toggle button to switch the search results to/from frequency based is quite intriguing.

mikelove · May 24, 2010

Interesting... well I guess we could make it an option in both places, at least initially - can always take one away / bury one in Advanced Settings in 2.2 once we get some better feedback on it.

Vzzzbx · May 26, 2010

Haven't seen this mentioned anywhere: A way to flag flashcards during testing would be super. Occasionally I'll hit one that I really, really want to analyse later (or just edit), but when I'm halfway through a hundred-card test I don't want to go out and back in.

mikelove · May 26, 2010

Already implemented, just tap on the middle one of the three tabs to the left of the answer buttons (triangle icon), then tap on the category button to assign the card to a new category (or remove it from one it's already in). You can save yourself a tap by having that button default to a specific category through Flashcard Testing / Commands / Buttons / Category button default, in which case it acts as a toggle (so if the card is already in the category, the icon changes and tapping on the button will remove it from that category).

Vzzzbx · May 27, 2010

Wow. Best software ever.

Tony · Jun 5, 2010

One tiny feature request: is it possible to turn off the iPhone/iPad scrolling feature on those character screens that don't have any scrolling (specifically, the screens with magnified characters)? I like to practice writing the character on those pages and get a little distracted with the character moving around as I'm writing with my finger on the screen.

Thanks!
Tony

mikelove · Jun 6, 2010

Good idea - we'll try to get that in for 2.1.1. Thanks!

Tony · Jun 7, 2010

mikelove said:
Good idea - we'll try to get that in for 2.1.1. Thanks!

Awesome! Thanks, Mike.

YoshiCookie · Jun 7, 2010

#1

I don't know how/if this is possible, but if there was a way to put an add-on into Safari so that there could be a pop-up, in-Safari mini-dictionary (ala nciku.com's Tooltip Dictionary), that would be amazing.

I guess you have the web browser in the program, but it seems like a bit hidden and the interface isn't so pretty. And does it have a pop-up dictionary or do you have to copy it to the pasteboard? (I can't remember)

#2

This sounds really advanced to me, but it would be amazing for beginning users if you could take a photograph of a Chinese character in real-life (using the camera) and have the dictionary recognize it and bring up the definition, etc.

#3

We've talked about this, but it would be great to have additional dictionary resources, like a Chinese-Japanese dictionary. Or have all user dictionaries online and accessible as long as you have an internet connection.

#4

I'm excited for 2.2. I hope the interface gets even better! I'm hoping for customizable fonts. I wish there were more character etymological information, the kind that you find in Wenlin, or Chinese Text Project, or sites that show examples of primitive forms of the characters (like how 其 was originally a picture of a basket, then they added a table underneath (丌) to clarify... then it was a sound loan character... now the "basket" meaning is written as 箕）。。。that kind of detailed information on-the-go.

Wenlin is working on their version 4... it's taking them forever and a half... and only after that are they considering an iPhone app. You guys at Pleco seem a lot faster.

#5

You should be able to switch Trad./Simp. at any time. For example, when I'm at the "Stroke Order Diagram" page, I have to back up, then switch to the other set, then go back forward. I constantly look at both sets, so it would be nice to be able to switch back and forth more easily. I wish the UNICODE database were more... well, UNIFIED. Every single variant (Japanese variant, historical variant, older form, simplified form) are single entries with no reference to one another.

Thanks for the best product... ever!!!!

numble · Jun 8, 2010

YoshiCookie said:
This sounds really advanced to me, but it would be amazing for beginning users if you could take a photograph of a Chinese character in real-life (using the camera) and have the dictionary recognize it and bring up the definition, etc.

I suspect that this is in the works. I may be wrong though. There are some iPhone Apps that do this already.

There are even Apps that will let you take photos of a whole document and convert that into editable text. I don't know how good those Apps are, but that would be amazing for copying in newspaper or magazine articles into the reader to read.

I still am waiting/hoping for MP3 lyrics support in the reader.

iPhone Feature Requests

状元

皇帝

皇帝

Attachments

状元

状元

状元

皇帝

状元

Member

皇帝

状元

皇帝

进士

皇帝

进士

Member

皇帝

Member

秀才

状元