Does anyone have a Pleco-ready Soothill dictionary?

kongmu

Member
I've read the posts in the last year or so from folks interested in being able to import the Soothill/Hodous Chinese Buddhist dictionary into Pleco, and I, too, am very interested. I am fairly ignorant as to how to convert the .pdf, .xml, or Stardict files (found at http://buddhistinformatics.ddbc.edu.tw/ ... saries.php) into a .txt file that Pleco can read correctly.

Has anyone already done this? And is such a file available either through the document exchange part of this forum, or through other means?

I know Mike has suggested he could do such a conversion, but he is a man with many things on his plate, and this is one of the lower priorities, I'm sure.

Thanks for the help.

--Kongmu 釋空目
 

laobaigou

举人
Has anyone gotten to this yet? Should be really simple ( vi(m)+a little Perl script, voila! ). No pinyin unfortunately, so format would, I guess, be 'char(s)<tab><tab>text'. The defs therein are a bit truncated from the printed version of Soothill/Hodous which I have, and seems to include a few items not in the printed version. This appears to be covered by the creative commons license which states: 'You are free: to Share — to copy, distribute and transmit the work, to Remix — to adapt the work, to make commercial use of the work'. Another idea would be to take the other version of this that appears on this site and take the pinyin (and the jiantizi ) where it exists and merge it into this version. Again not too difficult. Maybe the amount of text is too much for the iTouch and iPhone version, but should be fine for the iPad I'd think.
 

laobaigou

举人
Just for fun I did this. Has 16792 entries v. 52144 entries for the other version on this site. No pinyin, no jiantizi. Have no idea how many entries are in the printed version; whether this version is larger or just different. Still wondering about folding in the pinyin and the jiantizi from the other version.
 

mikelove

皇帝
Staff member
laobaigou said:
Just for fun I did this. Has 16792 entries v. 52144 entries for the other version on this site. No pinyin, no jiantizi. Have no idea how many entries are in the printed version; whether this version is larger or just different. Still wondering about folding in the pinyin and the jiantizi from the other version.

That would be best - with the new merged dictionary search system you're really going to want user-created dictionaries to have jianti and Pinyin (though they can live without fanti), otherwise they'll end up with their results separated from built-in dictionary ones. Part of our hesitation at doing official conversions of some dictionaries like Soothill has been their inconsistent inclusion of jianti + Pinyin - we don't want to be in the position of dropping support for a dictionary that people value after we've previously added it, but we also don't want to have to take responsibility for adding jianti + Pinyin ourselves in free dictionaries.
 

laobaigou

举人
Fiddled with this some more. Of the 16792 entries in this version of the dictionary, 927 don't have pinyin from the larger version. I thought there would be a more complete mapping so about 5% do not have pinyin or simplified chars. I don't think I want to slog thru' 927 entries filling in pinyin, also what about the missing jiantizi? So one line will have:
'伊罗婆那[伊羅婆那]<tab>pinyin<tab>definition', but another lines will be:
'伊羅缽龍王<tab><tab>definition'. Could stick the fantizi in '[..]' but would [伊羅缽龍王]<tab><tab>definition, even make sense.
what about: 'fantizi [jiantizi]<tab>pinyin<tab>definition'. It isn't clear to me why fantizi is the one in brackets. Mike could you explain?
 

mikelove

皇帝
Staff member
laobaigou said:
Fiddled with this some more. Of the 16792 entries in this version of the dictionary, 927 don't have pinyin from the larger version. I thought there would be a more complete mapping so about 5% do not have pinyin or simplified chars. I don't think I want to slog thru' 927 entries filling in pinyin, also what about the missing jiantizi?

There are a number of tools (Wenlin, Adso) that can make a reasonably good attempt at filling in the missing Pinyin / jianti for you - very accurate if you don't mind taking a little time to manually disambiguate the problem characters.

laobaigou said:
what about: 'fantizi [jiantizi]<tab>pinyin<tab>definition'. It isn't clear to me why fantizi is the one in brackets. Mike could you explain?

We had to pick one of them to go second and fanti are far less popular among our customers, not to mention the fact that virtually all of our licensed dictionaries are delivered primarily or even exclusively in jianti and we have to add fanti ourselves. Jianti also tend to be a bit more standardized - fanti have a lot of special cases like 匯/彙 and 台/臺 and 嘆/歎 that collapse to a single consistent character in jianti. (the decades in which most/all mainland computers supported the 6763 characters in GB2312 and nothing else really helped to rein in character variants enormously - not saying whether that was a good or a bad thing for the language but it certainly makes a lexicographer's job easier)
 

laobaigou

举人
Well.... I'd like to do that, but I don't really have the software, the knowledge (to disambiguate alt pinyin for some of the chars) nor the time at present. I just was trying to reduce it all to a software problem which I can solve.... and it was 95% successful anyway! :?
 

kongmu

Member
Well, this is what I get for not checking my original post in well over a year...! Laobaigou, I'd be very interested in seeing your hard-fought efforts in making the Soothill dictionary Pleco-friendly. Is it shareable by you?
 

kongmu

Member
Thank you very much.

If you would like, you can simply attach the file to a response here (there should be an "Upload a file" button to the bottom right of the reply field.) Otherwise, I can message you with my email address.

Thanks again in advance....:)
 
Last edited:
Hello,

After having converted the soothill-hodous.ddbc.tei.p5.xml.zip to text (attached), I have had a try at converting the fantizi into pinyin with Wenlin and I get what you can see in the attached image.

Soothill Pinyin Disambiguation.jpg
.

By the way, I think there are some entries with errors, namely:

[口*普][口*隆]

[口*企]吒

[口*尸]剌拏伐底

[馬*夌]

[怡-台+追]惕鬼

[月*冊]

阿缽羅[口*底]訶諦

不喞[口*留]

曷剌[羊*兒]

痾[口*路]祗

盧[口*尸]胝訶目多

乞[口*栗]雙提贊

颯破[木*致]迦

喪[貝*親]
 

Attachments

  • ddbc.soothill-hodous txt.zip
    870 KB · Views: 1,041
Last edited:

laobaigou

举人
Looking at what I wrote above over a year ago, I have the small one (16792 entries) I worked on, but now I'm wondering where the large one (52144 entries) came from? I just don't remember, and unfortunately, I didn't write it down. I think what I did before was to grab the pinyin from the larger version and insert it into the smaller version. Why is the smaller version better than the larger version? I saved it on my computer as 'big-soothill' and didn't further identify it.
 
Looking at what I wrote above over a year ago, I have the small one (16792 entries) I worked on, but now I'm wondering where the large one (52144 entries) came from? I just don't remember, and unfortunately, I didn't write it down. I think what I did before was to grab the pinyin from the larger version and insert it into the smaller version. Why is the smaller version better than the larger version? I saved it on my computer as 'big-soothill' and didn't further identify it.

Hi, laobaigou,

Do you mean that you have both the small and the large versions, and that you have the small version with pinyin?
 

kongmu

Member
Hi all.

So, I've been playing around with the Soothill dictionary as well, trying to get DDBC's Stardict version Pleco-friendly. The HTML version that Sobriaebritas kindly uploaded was great, but the definitions were odd; as if something wasn't right; most likely because of the xml markup in the text file.

You can click this link to download my attempt at converting the Stardict file (for some reason this forum site says the 2.4mb file is too big to upload here). The dictionary does indeed have 16,792 entries - I don't think there is a "larger" version available.

In this version, I removed the mark-up, and added Pinyin via Wenlin. However, the original Stardict file has Sanskrit text in the entries. Not sure why, but even though this Sanskrit script reads fine in the preview on Pleco right before you import the entries, however, once imported, all the Sansrkrit is stripped and you are left with small boxes in the entries (any thought from anyone as to why?)

So, other than this anomaly the dictionary is quite useable now.

If I figure out how to get rid of the Sanskrit script out from the txt file, or work from the HTML file as Sobriaebritas did, I'll post something new. At the moment, I'm trying to convert the classic 丁福保 - 佛學大辭典 (also open source) to Pleco. Running into other problems, but still have confidence it can work...!
 
If I figure out how to get rid of the Sanskrit script out from the txt file

Hi kongmu (et al.),

Thanks a lot. I´ve just dowloaded your file, and got rid of the Sanskrit. I´m going to upload it here, but first I´d like to know whether you prefer to keep the tones indicated by numbers (instead of diacritics) and the stroke order of the Chinese entries (instead of the phonetic order).
 

mikelove

皇帝
Staff member
You need to keep them as numbers for the importer to work correctly - it can usually pick up tone marks too, but numbers work more reliably and will still be replaced by tone marks if the software is configured that way.
 
Thank you for the tip, Mike.

So I attach kongmu´s file without the Devanagari script and the entries arranged in "phonetic order". (I think the first six entries should be modified a bit.)
 

Attachments

  • ddbc.soothill-hodous complete for pleco no Devanagari.zip
    954.4 KB · Views: 805

kongmu

Member
Wow. That's great, Sobriaebritas; thank you very much (not sure how you were able to do that so fast! :)

I also managed to process 丁福保's《佛學大辭典》, which can be downloaded here. I ended up using an older version of the dictionary that is floating around the web (found here). The updated one from the DDBC website turned out to be too problematic for my meager skills at this sort of thing. For example, their Stardict file contains actual small jpgs of Sanskrit syllables in a few hundred entries, which end up showing up (along with the Chinese characters in the respective entries) as what I'm assuming is some measure of HTML code in the compiled txt file. The only way I could figure out how to make those entries readable was to manually cut and paste from another dictionary program, without the Sanskrit syllables. Anyway, out of 30,000+ entries, there's only a discrepancy of about 1,000 between the two Dingfubao versions, so for my purposes, this older one is Ok.

And, for what it is worth, I also processed Foguang Shan's large Buddhist dictionary as well, also obtained from the Stardict site above. The site says that it is also "free to use" as the Soothill and Dingfubao (I haven't actually confirmed this, however). It does appear in a few other online dictionary sites for free use (here, and here, for example), but I haven't come across an actual file able to download beside the Stardict dictionary site. Anyway, Foguang Shan's dictionary can be downloaded here for Pleco.

If anyone knows that this dictionary is not "free to use" (Mike?), please do let me know; I'll edit this post right away.
 
Top