A few questions about Pleco functions and dictionaries

timseb · Jun 29, 2020

Hi!

I have been using Pleco for a while and I also have the professional bundle. I have a few things I have not quiet got a grip on, and will number my questions:

1. Is it possible to import a character or word and get *all* available definitions in the dictionary, and not just one of them? For example, if I import 划, so far I have only found the options to either pick an entry manually or make Pleco pick the first one. The first option is workable if you only import a few characters and know what you're after, but if you import a list of a few thousand characters, it's just not an option. I would really like to pick, for example ABC Dictionary, and get both a card for huá and one for huà, in this case. Is this possible, or can it be made possible? I guess 尽 would be another well known example, but there are of course a lot of them.

2. When exporting cards, is it possible to get all traditional variants of a character provided by dictionary instead of just a single one?

3. How many entries are there in the PLC dictionary?

4. Are the KEY E-C chinese definitions exportable? CC, PLC and ABC for example, while OCC is not. It would be great to know before I buy it.

Thanks!

mikelove · Jun 29, 2020

1) Not at the moment; best bet would be to create a separate record in your import file with each pronunciation. (it'll match against the pronunciation field if it's there)

2) No, to be honest there are a lot of really weird archaic variants in there - we're mostly optimizing around search so we'd rather include a rare variant than not - and we don't have great data on which ones are in common use (character frequency itself is a poor proxy since in a lot of cases a character might be common in other senses but rare as a traditional version of this particular one) so we don't think it would be very useful data to export.

If you want an exhaustive mapping of characters to traditional variants the Unihan database does a pretty good job with that.

3) About 120,000.

4) Yes, they are.

timseb · Jun 29, 2020

Thank you. Excellent answers.

Do you know if it's possible to make a list out of the Unihan database or know of a list of that kind? A list to import to Pleco. A list that would turn 差 into a like the one below for example, and perhaps with the eventual traditional variation for matching. Is it even possible to just export all dictionary entries in some weird way?

差 chà
差 chāi
差 chài
差 cī
差 chài

I tried doing this with Chinese Text Analyzer, which worked OK but with problems. It gives both jìn and jǐn for 尽, but only xuè for 血. For 差 I only get cha1, cha4 and chai1. I'm guessing there are multiple reasons for this, among them that it's not really based on a character dictionary.

That last answer was very welcome indeed, I'm buying that dictionary right away!

mikelove · Jun 29, 2020

It does have all of that information, yes:

Unihan data for U+5DEE

lists those pronunciations and even a few more. You'd basically want to a) download the Unihan database, b) extract whichever reading band you wanted from the Readings file (kHanyuPinyin or kXHC1983 or whatever), and c) run it through a script to convert the U+ into a character and put each reading and that character on a separate line.

timseb · Jun 29, 2020

mikelove said:
It does have all of that information, yes:

Unihan data for U+5DEE

lists those pronunciations and even a few more. You'd basically want to a) download the Unihan database, b) extract whichever reading band you wanted from the Readings file (kHanyuPinyin or kXHC1983 or whatever), and c) run it through a script to convert the U+ into a character and put each reading and that character on a separate line.

Thank you.

I have been using the Unihan website for a few months and really like it, but did not know it could be turned into lists manually. I am not very technical, but also not a complete beginner, so I should be able to make this work! Will be back in a few hours if I don't. I have no idea about scripts though.

timseb · Jun 29, 2020

After looking at my downloaded Unihan data for a while and still not understanding a single thing, I gave up. I did however use a generator to get all readings for the characters in question. The list is here. The crux is now how to get each reading on a new row, combined with the character. I'm guessing for people who do coding, the answer is glaringly obvious, but I can't seem to figure it out.

A lot of these readings are either super obscure or plain wrong sometimes(?), but I'm thinking that's not a problem since the dictionary will filter them out anyway.

mikelove · Jun 29, 2020

You'd basically want a regular expression search, something like:

^([^ \n]*) ([^ \n]*) ([^ \n]*)
to
$1\t$2\n$1\t$3

repeating until it has nothing more to replace. (in some text editors you might have to replace the $'s with \'s) Make sure that the readings and characters are separated with tabs and not spaces in the final import file.

timseb · Jun 29, 2020

Thank you. I think I'm moving in the right direction, but misunderstood something along the way. The script (in Notepad++) gave me tab separations between all readings, and duplicate lines, but still there are more readings than one on each row. Sometimes only one row but several readings. The list now looks like this:

一 yi1 yi2
一 yi1 yi4
人 ren2 ren
下 xia4 xia
上 shang4 shang3
上 shang4 shang

EDIT: I might have solved it. I turned all tabs into spaces, and they used your script once again. Now each reading has its own row!

timseb · Jul 14, 2020

Hello.

The solution mentioned above by mike is the best I've come across so far, but I would really love to get the following:

My list of traditional characters in the first column, the Tōngyòng Guīfàn Hànzì Zìdiǎn standard variant in the second, and the Tōngyòng Guīfàn Hànzì Zìdiǎn readings in the third. I could then import this to Pleco for almost perfect matching to the KEY dictionary (which seems to be the best for hanzi definitions). This would give me the common mainland readings and variants (long term this would work best for me I think). I use Pleco for most things I do, but *not* for flash card studying, where I use Anki. I am not importing it directly into Anki (I haven't found a way to make that work for characters with multiple meanings/readings) so I'm still doing it manually by exporting the KEY definitions to a spreadsheet which I'm copying manually into Anki.

Anyone who has recommendations, or know coding to make this so much easier, would be hugely appreciated. To explain, this is what I would want as a flaschard for, say, these two characters:

EXAMPLE #1:

FRONT: 拉
BACK:
1. lā 2. lá
(1)
v 1 pull, drag 2 haul, transport (in a vehicle) 3 move (troops) 4 {music} play (certain instruments, like húqin 胡琴 "violin", shǒufēngqín 手風琴/手风琴 "accordion", etc.) 5 draw out, space out, extend 6 help, give/lend a hand 7 implicate, involve, drag in 8 {regional} bring up (a child), raise 9 solicit (like customers through advertising) 10 press (as into military service) 11 {colloquial} defecate 12 {colloquial} make (a list, as in lā ge dānzi 拉個單子/拉个单子, "make a list") 13 {phon} (used to transliterate "-la-", "-ra-", etc.)
(2)
v slash/slit/cut (open, out, etc.)

Traditional: 拉

EXAMPLE #2:

FRONT: 鸟
BACK:
niǎo
n bird

Now, for some reasing the Tongyang Guifan Hanzi Zidian pinyin is not on the main page of the characters Unihan page, so I don't know how it deals with a character like 着. The readings are under kTGHZ2013 in the downloaded archive.

English is not my mother tongue and I wish I was more proficient, so if anything (or much of it) is unclear, please tell me so. I would really like to find a reasonable solution to this technical hindrance.

For Pleco import, I guess this should be the format:
丑[丑] chou3
醜[丑] chou3
六[六] liu4
拉[拉] la1
拉[拉] la2

...and so on.

timseb · Jul 17, 2020

I just want to say that I have been able to do this manually with decent results, so if it's not solvable that's totally OK. I just have one last queston mike. When I export Oxford flaschards I don't get the definitions, which I know is because it's a protected dictionary. I do however get the Simplified, Traditional and the Pinyin. Does that mean this information about the dictionary is not protected? Would that in turn mean I could get a list from you (txt, for example), with all KEY entries with Traditional, Simplified and Pinyin, but not definitions? I know it's a longshot, but I think that would mean I could import the entire list into Pleco, and thereby don't miss any entries.

mikelove · Jul 17, 2020

No, I'm afraid not; while individual words are not protected the overall collection is. (in fact many would argue that the single most valuable piece of a dictionary is its list of words)

timseb · Jul 17, 2020

mikelove said:
No, I'm afraid not; while individual words are not protected the overall collection is. (in fact many would argue that the single most valuable piece of a dictionary is its list of words)

Actually, after I posted this I realized that in effect meant I would be able to extract the entire dictionary and upload it free for everyone, which made the question kind of silly. I'm spending my days trying to come up with a solution to this but I just can't. The kTGHZ2013 is amazing to find out the most common readings, but it only works for simplified characters. This problem maybe just doesn't have a solution at all.

EDIT: I noticed the kTGHZ2013 gives ji2 as the only common reading for 㴔 while the HanyuPinyin (which tends to have all(?) possible readings, tells me 31683.060:xī,yì,sè. Does that mean kTGHZ2013 is not as reliable as I'd hoped? I can find ji2 in Xiandai Guifan but not in KEY, ABC and some others. Perhaps I should buy the Hanyu Da Cidian? That's by far the most authorotative dictionary, right?

timseb · Jul 17, 2020

I'm thinking of buying the Hanyu Da Cidian. How does it treat what ABC calls "meaningless bound form" characters? For example 鵪? Does it do like MoeDict and ask me to check out 鵪鶉, or does it define the character as well?

mikelove · Jul 17, 2020

It usually defines them, digging back to their historical meaning. In the case of 鵪 it's:

鸟名。本指羽毛无斑的鹌鹑。后亦混称鹌鹑。

timseb · Jul 17, 2020

mikelove said:
It usually defines them, digging back to their historical meaning. In the case of 鵪 it's:

Very promising! I noticed that the dictionary is a mix of simplified and traditional. What is the reason behind this? Would you say that affects the "authoritativeness" at all (if I get it right the source for a lot of characters used is Pleco rather than the dictionary in itself)? I will use it for traditional characters.

mikelove · Jul 17, 2020

It's the way they wrote it, basically - I think a lot of the traditional is out of fidelity to the original sources.

That being said, it's mostly intended as a word reference rather than a character reference, so if you're looking for a definitive ruling on specific characters you'd probably be better off with a printed 字典 focused specifically on that.

timseb · Jul 17, 2020

mikelove said:
It's the way they wrote it, basically - I think a lot of the traditional is out of fidelity to the original sources.

That being said, it's mostly intended as a word reference rather than a character reference, so if you're looking for a definitive ruling on specific characters you'd probably be better off with a printed 字典 focused specifically on that.

Thank you for all your answers! They have been very clarifying.

Fernando · Jul 17, 2020

I was also thinking about getting the 漢語大詞典, but I remember balking when I saw that it was a mix of simplified and traditional characters. For a work of that scope and ambition they should provide a traditional-only edition and then from that derive one with definitions in simplified, since conversion from traditional to simplified is trivial as opposed to the other way around.

On the same note: for some dictionaries, e.g. New Century, in which traditional is "provided by Pleco", how exactly is that done? If it's automatic wouldn't it be prone to errors?

mikelove · Jul 17, 2020

Fernando said:
On the same note: for some dictionaries, e.g. New Century, in which traditional is "provided by Pleco", how exactly is that done? If it's automatic wouldn't it be prone to errors?

Automatic with some manual editing, yes.

For a while we released any titles like this - originally in simplified and too big for us to reasonably convert to traditional by hand - as simplified-only, but traditional users were outraged by this and seemed to be almost unanimously of the opinion that a conversion with errors was better than no conversion at all, so that's what we do now. (but we disclose this fact in the "Add-ons" screen for that very reason)

Fernando · Jul 17, 2020

I learn traditional and traditional only, and I'm not sure how automatic conversions can help the cause when it undermines what traditional characters have going for them: precision and the very fact that simplified characters are derived from them.

As for traditional dictionaries not derived from simplified, what have we got? Only ABC, Oxford and MOE? Or really only MOE?

A few questions about Pleco functions and dictionaries

进士

皇帝

进士

皇帝

进士

进士

皇帝

进士

进士

进士

皇帝

进士

进士

皇帝

进士

皇帝

进士

榜眼

皇帝

榜眼