A few questions about Pleco functions and dictionaries

timseb

进士
Hi!

I have been using Pleco for a while and I also have the professional bundle. I have a few things I have not quiet got a grip on, and will number my questions:

1. Is it possible to import a character or word and get *all* available definitions in the dictionary, and not just one of them? For example, if I import 划, so far I have only found the options to either pick an entry manually or make Pleco pick the first one. The first option is workable if you only import a few characters and know what you're after, but if you import a list of a few thousand characters, it's just not an option. I would really like to pick, for example ABC Dictionary, and get both a card for huá and one for huà, in this case. Is this possible, or can it be made possible? I guess 尽 would be another well known example, but there are of course a lot of them.

2. When exporting cards, is it possible to get all traditional variants of a character provided by dictionary instead of just a single one?

3. How many entries are there in the PLC dictionary?

4. Are the KEY E-C chinese definitions exportable? CC, PLC and ABC for example, while OCC is not. It would be great to know before I buy it.

Thanks!
 

mikelove

皇帝
Staff member
1) Not at the moment; best bet would be to create a separate record in your import file with each pronunciation. (it'll match against the pronunciation field if it's there)

2) No, to be honest there are a lot of really weird archaic variants in there - we're mostly optimizing around search so we'd rather include a rare variant than not - and we don't have great data on which ones are in common use (character frequency itself is a poor proxy since in a lot of cases a character might be common in other senses but rare as a traditional version of this particular one) so we don't think it would be very useful data to export.

If you want an exhaustive mapping of characters to traditional variants the Unihan database does a pretty good job with that.

3) About 120,000.

4) Yes, they are.
 

timseb

进士
Thank you. Excellent answers.

Do you know if it's possible to make a list out of the Unihan database or know of a list of that kind? A list to import to Pleco. A list that would turn 差 into a like the one below for example, and perhaps with the eventual traditional variation for matching. Is it even possible to just export all dictionary entries in some weird way?

差 chà
差 chāi
差 chài
差 cī
差 chài

I tried doing this with Chinese Text Analyzer, which worked OK but with problems. It gives both jìn and jǐn for 尽, but only xuè for 血. For 差 I only get cha1, cha4 and chai1. I'm guessing there are multiple reasons for this, among them that it's not really based on a character dictionary.

That last answer was very welcome indeed, I'm buying that dictionary right away!
 

mikelove

皇帝
Staff member
It does have all of that information, yes:


lists those pronunciations and even a few more. You'd basically want to a) download the Unihan database, b) extract whichever reading band you wanted from the Readings file (kHanyuPinyin or kXHC1983 or whatever), and c) run it through a script to convert the U+ into a character and put each reading and that character on a separate line.
 

timseb

进士
It does have all of that information, yes:


lists those pronunciations and even a few more. You'd basically want to a) download the Unihan database, b) extract whichever reading band you wanted from the Readings file (kHanyuPinyin or kXHC1983 or whatever), and c) run it through a script to convert the U+ into a character and put each reading and that character on a separate line.

Thank you.

I have been using the Unihan website for a few months and really like it, but did not know it could be turned into lists manually. I am not very technical, but also not a complete beginner, so I should be able to make this work! Will be back in a few hours if I don't. I have no idea about scripts though. :rolleyes:
 

timseb

进士
After looking at my downloaded Unihan data for a while and still not understanding a single thing, I gave up. I did however use a generator to get all readings for the characters in question. The list is here. The crux is now how to get each reading on a new row, combined with the character. I'm guessing for people who do coding, the answer is glaringly obvious, but I can't seem to figure it out.

A lot of these readings are either super obscure or plain wrong sometimes(?), but I'm thinking that's not a problem since the dictionary will filter them out anyway.
 

mikelove

皇帝
Staff member
You'd basically want a regular expression search, something like:

^([^ \n]*) ([^ \n]*) ([^ \n]*)
to
$1\t$2\n$1\t$3

repeating until it has nothing more to replace. (in some text editors you might have to replace the $'s with \'s) Make sure that the readings and characters are separated with tabs and not spaces in the final import file.
 

timseb

进士
Thank you. I think I'm moving in the right direction, but misunderstood something along the way. The script (in Notepad++) gave me tab separations between all readings, and duplicate lines, but still there are more readings than one on each row. Sometimes only one row but several readings. The list now looks like this:

一 yi1 yi2
一 yi1 yi4
人 ren2 ren
下 xia4 xia
上 shang4 shang3
上 shang4 shang

EDIT: I might have solved it. I turned all tabs into spaces, and they used your script once again. Now each reading has its own row!
 
Last edited:

timseb

进士
Hello.

The solution mentioned above by mike is the best I've come across so far, but I would really love to get the following:

My list of traditional characters in the first column, the Tōngyòng Guīfàn Hànzì Zìdiǎn standard variant in the second, and the Tōngyòng Guīfàn Hànzì Zìdiǎn readings in the third. I could then import this to Pleco for almost perfect matching to the KEY dictionary (which seems to be the best for hanzi definitions). This would give me the common mainland readings and variants (long term this would work best for me I think). I use Pleco for most things I do, but *not* for flash card studying, where I use Anki. I am not importing it directly into Anki (I haven't found a way to make that work for characters with multiple meanings/readings) so I'm still doing it manually by exporting the KEY definitions to a spreadsheet which I'm copying manually into Anki.

Anyone who has recommendations, or know coding to make this so much easier, would be hugely appreciated. To explain, this is what I would want as a flaschard for, say, these two characters:

EXAMPLE #1:

FRONT:

BACK:
1. lā 2. lá
(1)
v 1 pull, drag 2 haul, transport (in a vehicle) 3 move (troops) 4 {music} play (certain instruments, like húqin 胡琴 "violin", shǒufēngqín 手風琴/手风琴 "accordion", etc.) 5 draw out, space out, extend 6 help, give/lend a hand 7 implicate, involve, drag in 8 {regional} bring up (a child), raise 9 solicit (like customers through advertising) 10 press (as into military service) 11 {colloquial} defecate 12 {colloquial} make (a list, as in lā ge dānzi 拉個單子/拉个单子, "make a list") 13 {phon} (used to transliterate "-la-", "-ra-", etc.)
(2)
v slash/slit/cut (open, out, etc.)

Traditional:

EXAMPLE #2:

FRONT:

BACK:
niǎo
n bird

Now, for some reasing the Tongyang Guifan Hanzi Zidian pinyin is not on the main page of the characters Unihan page, so I don't know how it deals with a character like 着. The readings are under kTGHZ2013 in the downloaded archive.

English is not my mother tongue and I wish I was more proficient, so if anything (or much of it) is unclear, please tell me so. I would really like to find a reasonable solution to this technical hindrance.

For Pleco import, I guess this should be the format:
丑[丑] chou3
醜[丑] chou3
六[六] liu4
拉[拉] la1
拉[拉] la2

...and so on.
 
Last edited:

timseb

进士
I just want to say that I have been able to do this manually with decent results, so if it's not solvable that's totally OK. I just have one last queston mike. When I export Oxford flaschards I don't get the definitions, which I know is because it's a protected dictionary. I do however get the Simplified, Traditional and the Pinyin. Does that mean this information about the dictionary is not protected? Would that in turn mean I could get a list from you (txt, for example), with all KEY entries with Traditional, Simplified and Pinyin, but not definitions? I know it's a longshot, but I think that would mean I could import the entire list into Pleco, and thereby don't miss any entries.
 
Last edited:

mikelove

皇帝
Staff member
No, I'm afraid not; while individual words are not protected the overall collection is. (in fact many would argue that the single most valuable piece of a dictionary is its list of words)
 

timseb

进士
No, I'm afraid not; while individual words are not protected the overall collection is. (in fact many would argue that the single most valuable piece of a dictionary is its list of words)

Actually, after I posted this I realized that in effect meant I would be able to extract the entire dictionary and upload it free for everyone, which made the question kind of silly. I'm spending my days trying to come up with a solution to this but I just can't. The kTGHZ2013 is amazing to find out the most common readings, but it only works for simplified characters. This problem maybe just doesn't have a solution at all.

EDIT: I noticed the kTGHZ2013 gives ji2 as the only common reading for 㴔 while the HanyuPinyin (which tends to have all(?) possible readings, tells me 31683.060:xī,yì,sè. Does that mean kTGHZ2013 is not as reliable as I'd hoped? I can find ji2 in Xiandai Guifan but not in KEY, ABC and some others. Perhaps I should buy the Hanyu Da Cidian? That's by far the most authorotative dictionary, right?
 
Last edited:

timseb

进士
I'm thinking of buying the Hanyu Da Cidian. How does it treat what ABC calls "meaningless bound form" characters? For example 鵪? Does it do like MoeDict and ask me to check out 鵪鶉, or does it define the character as well?
 

timseb

进士
It usually defines them, digging back to their historical meaning. In the case of 鵪 it's:

Very promising! I noticed that the dictionary is a mix of simplified and traditional. What is the reason behind this? Would you say that affects the "authoritativeness" at all (if I get it right the source for a lot of characters used is Pleco rather than the dictionary in itself)? I will use it for traditional characters.
 

mikelove

皇帝
Staff member
It's the way they wrote it, basically - I think a lot of the traditional is out of fidelity to the original sources.

That being said, it's mostly intended as a word reference rather than a character reference, so if you're looking for a definitive ruling on specific characters you'd probably be better off with a printed 字典 focused specifically on that.
 

timseb

进士
It's the way they wrote it, basically - I think a lot of the traditional is out of fidelity to the original sources.

That being said, it's mostly intended as a word reference rather than a character reference, so if you're looking for a definitive ruling on specific characters you'd probably be better off with a printed 字典 focused specifically on that.

Thank you for all your answers! They have been very clarifying.
 

Fernando

榜眼
I was also thinking about getting the 漢語大詞典, but I remember balking when I saw that it was a mix of simplified and traditional characters. For a work of that scope and ambition they should provide a traditional-only edition and then from that derive one with definitions in simplified, since conversion from traditional to simplified is trivial as opposed to the other way around.

On the same note: for some dictionaries, e.g. New Century, in which traditional is "provided by Pleco", how exactly is that done? If it's automatic wouldn't it be prone to errors?
 

mikelove

皇帝
Staff member
On the same note: for some dictionaries, e.g. New Century, in which traditional is "provided by Pleco", how exactly is that done? If it's automatic wouldn't it be prone to errors?

Automatic with some manual editing, yes.

For a while we released any titles like this - originally in simplified and too big for us to reasonably convert to traditional by hand - as simplified-only, but traditional users were outraged by this and seemed to be almost unanimously of the opinion that a conversion with errors was better than no conversion at all, so that's what we do now. (but we disclose this fact in the "Add-ons" screen for that very reason)
 

Fernando

榜眼
I learn traditional and traditional only, and I'm not sure how automatic conversions can help the cause when it undermines what traditional characters have going for them: precision and the very fact that simplified characters are derived from them.

As for traditional dictionaries not derived from simplified, what have we got? Only ABC, Oxford and MOE? Or really only MOE?
 
Top