User Dictionary Specification

alex_hk90 · Jun 2, 2012

Hi all,

I'm not sure if this is the right place for this as it should apply to all versions, but I use the Android version so I put it here rather than 'Future Products'.

This is mainly for the developers, but perhaps users who have used the user dictionary feature will also know.

So my questions are based on the specification for user dictionaries in Pleco:

1. Since the import speed is painfully slow, does anyone know what the specification is for the SQLite file ([userdict].pqb) itself?
Most of the database tables and columns are pretty obvious, but "sortkey" in the "entries" table and the related "posdex_hz" (Hanzi) and "posdex_py" (Pinyin) sorting / lookup tables weren't immediately obvious to me. I could probably work it out but thought I'd ask first if anyone already has the specification. With this it would be easy to import large databases directly into the SQLite (pqb) file, instead of going through the slow import method on the phone itself.

2. What are the codes for formatting the definitions?
I found this post: http://www.plecoforums.com/viewtopic.php?f=13&t=1406&p=10399#p10399, is the list there complete?

3. Full text search - could this be supported in the Android version?
I found this post: http://www.plecoforums.com/viewtopic.php?f=17&t=2686, but it relates to iOS, so I don't know if it also applies to the Android version (I have no idea what version of SQLite it uses or what features it may have).

Alternatively, would it be possible to create 'proper' Pleco dictionaries which due allow full text search?
I'm guessing due to licensing there's some kind of encryption on the paid ones, but is there no way we can make our own ones? If the process and/or specification is relatively simple I could probably code a quick program to do it (for instance, convert from flashcard-style text file to Pleco dictionary database format).

Thanks in advance.

mikelove · Jun 2, 2012

alex_hk90 said:
1. Since the import speed is painfully slow, does anyone know what the specification is for the SQLite file ([userdict].pqb) itself?
Most of the database tables and columns are pretty obvious, but "sortkey" in the "entries" table and the related "posdex_hz" (Hanzi) and "posdex_py" (Pinyin) sorting / lookup tables weren't immediately obvious to me. I could probably work it out but thought I'd ask first if anyone already has the specification. With this it would be easy to import large databases directly into the SQLite (pqb) file, instead of going through the slow import method on the phone itself.

It's a syllable-by-syllable alternating mix of characters and Pinyin, with the Pinyin using full-width characters so that they'll sort later than Chinese characters (and hence a longer Pinyin syllable will sort after a shorter one). Bit of a kludge but it was very useful back on Palm when we couldn't easily specify custom SQLite collations.

However, in general I'd advise against putting a lot of effort into developing your own converter, because the format is likely to change significantly over the next few releases. If you do do one, pay close attention to the indexes too since they're not automatically updated when the database is.

alex_hk90 said:
2. What are the codes for formatting the definitions?
I found this post: viewtopic.php?f=13&t=1406&p=10399#p10399, is the list there complete?

Yes - there are a couple of others but they rely on certain other internal codes that you don't have access to, so those are all that could feasibly be employed by an end user.

alex_hk90 said:
3. Full text search - could this be supported in the Android version?
I found this post: viewtopic.php?f=17&t=2686, but it relates to iOS, so I don't know if it also applies to the Android version (I have no idea what version of SQLite it uses or what features it may have).

That one's been on the to-do list for YEARS (HW60 will attest to that...) - it's doable and in fact it's gotten considerably easier with the move to iOS/Android, but there are a number of technical issues we'd like to sort out concerning full-text searches of our own dictionaries before we get them working in user ones. (some of which we've worked out for the Android release that we're about to start beta testing - British versus American spelling, e.g.)

Frankly, though, we're not that enthusiastic about having Pleco be a general-purpose, dump-whatever-database-into-it-you-want dictionary engine because of piracy; a better format means more people loading dictionaries that aren't legally licensed, and the availability of those dictionaries makes our software more appealing to pirates (most of whom currently prefer something that supports StarDict or some other format that they can get a lot of pirated dictionaries in). That's not a reason for us to never produce a desktop importer, or to block other people from producing one, but it does make it hard to justify making that a priority over other feature requests

alex_hk90 · Jun 2, 2012

mikelove said:
Frankly, though, we're not that enthusiastic about having Pleco be a general-purpose, dump-whatever-database-into-it-you-want dictionary engine because of piracy; a better format means more people loading dictionaries that aren't legally licensed, and the availability of those dictionaries makes our software more appealing to pirates (most of whom currently prefer something that supports StarDict or some other format that they can get a lot of pirated dictionaries in). That's not a reason for us to never produce a desktop importer, or to block other people from producing one, but it does make it hard to justify making that a priority over other feature requests

Thank you for the prompt and detailed response.

I can certainly understand the concerns from your point of view, though it can be a bit frustrating for the end-user. My opinion has always been that people who are willing to pay will do so even if pirated versions are available, whereas people who are not will simply not use the software in question if there is no free or pirated version available. Of course I have no idea if this is the situation in reality.

mikelove · Jun 2, 2012

alex_hk90 said:
Thank you for the prompt and detailed response. I can certainly understand the concerns from your point of view, though it can be a bit frustrating for the end-user. My opinion has always been that people who are willing to pay will do so even if pirated versions are available, whereas people who are not will simply not use the software in question if there is no free or pirated version available. Of course I have no idea if this is the situation in reality.

The problem for us is the "in between" group - people who are generally willing to pay for our software, but would use a pirated version if it was easily available. This unfortunately describes a very large portion of expats in China (who are generally law-abiding but spend every day surrounded by people casually pirating stuff). More dyed-in-the-wool pirates using it makes the pirated version more widely available, which means more temptation for otherwise-honest people to pirate it as well.

Also, plenty of people who would never pirate our software would have fewer hang-ups about pirating a dictionary made by a big faceless publisher, particularly one who stubbornly refuses to license out their dictionaries; we need to respect the IP rights of companies that don't do business with us just as much as we respect those of companies that do, and that means making sure that widely-pirated dictionaries from licensing-averse publishers like 商务印书馆 aren't illegitimately available in Pleco (which anyway would only lower the odds of our getting those dictionaries in the future).

alex_hk90 · Jun 5, 2012

mikelove said:
The problem for us is the "in between" group - people who are generally willing to pay for our software, but would use a pirated version if it was easily available. This unfortunately describes a very large portion of expats in China (who are generally law-abiding but spend every day surrounded by people casually pirating stuff). More dyed-in-the-wool pirates using it makes the pirated version more widely available, which means more temptation for otherwise-honest people to pirate it as well.

Also, plenty of people who would never pirate our software would have fewer hang-ups about pirating a dictionary made by a big faceless publisher, particularly one who stubbornly refuses to license out their dictionaries; we need to respect the IP rights of companies that don't do business with us just as much as we respect those of companies that do, and that means making sure that widely-pirated dictionaries from licensing-averse publishers like 商务印书馆 aren't illegitimately available in Pleco (which anyway would only lower the odds of our getting those dictionaries in the future).

I'm not convinced that people using pirated dictionaries would have a significant impact on sales of the legally available dictionaries and other add-ons. However I do see what you mean in terms of the possible knock-on effect, that if the software does attract more hardcore pirates they might then reverse-engineer the legally available dictionaries and release a pirated version of the whole thing, which is more likely to have an impact on sales. I guess your current policy is the safest and most sensible.

Back on topic, there seems to be an issue with the user dictionary import function when trying to import many (i.e. 100,000+) entries. The first time I tried it cut off at around 89,000, then trying to add to this it keeps on cutting off before processing the whole file (at below 20,000 new entries). I haven't been watching it because it's so slow, but when I come back to the phone it's stopped the import function and switched to the dictionary screen. It's added some of the entries (starting from the beginning of the file, thankfully, so it's easy to make a new file with just the later ones that didn't add), but didn't continue to the end of the file.

Also, is there a performance issue when using a user dictionary with many (i.e. 200,000+) entries? Would it be better to instead split this into several smaller user dictionaries?

Thanks in advance.

mikelove · Jun 7, 2012

alex_hk90 said:
I'm not convinced that people using pirated dictionaries would have a significant impact on sales of the legally available dictionaries and other add-ons. However I do see what you mean in terms of the possible knock-on effect, that if the software does attract more hardcore pirates they might then reverse-engineer the legally available dictionaries and release a pirated version of the whole thing, which is more likely to have an impact on sales. I guess your current policy is the safest and most sensible.

More that they'd reverse-engineer the copy-protection system, which is as secure as we can make it but isn't bulletproof. (even Google and Amazon can't come up with hacker-proof DRM, we just have the benefit of being a much lower-profile target) They wouldn't really need to crack the dictionaries (most of which can be had in non-Pleco pirated versions now anyway), just the app, and they'd need to do that so that they could then access their own user dictionaries.

alex_hk90 said:
Back on topic, there seems to be an issue with the user dictionary import function when trying to import many (i.e. 100,000+) entries. The first time I tried it cut off at around 89,000, then trying to add to this it keeps on cutting off before processing the whole file (at below 20,000 new entries). I haven't been watching it because it's so slow, but when I come back to the phone it's stopped the import function and switched to the dictionary screen. It's added some of the entries (starting from the beginning of the file, thankfully, so it's easy to make a new file with just the later ones that didn't add), but didn't continue to the end of the file.

Probably that an entry's too long - it can get a bit buggy with those; check to make sure you don't have any lines longer than 4000 characters or so.

alex_hk90 said:
Also, is there a performance issue when using a user dictionary with many (i.e. 200,000+) entries? Would it be better to instead split this into several smaller user dictionaries?

Relatively minimal as long as you "Lock" it via Manage Dicts - doing that enables some performance improvements that are impossible in a continuously-edited dictionary. Should be faster than having multiple small ones, actually.

HW60 · Jun 7, 2012

mikelove said:
Frankly, though, we're not that enthusiastic about having Pleco be a general-purpose, dump-whatever-database-into-it-you-want dictionary engine because of piracy; a better format means more people loading dictionaries that aren't legally licensed, and the availability of those dictionaries makes our software more appealing to pirates

I export all my flashcards and import them again into a user dictionary, and this user dict (more than 10000 entries) is on top of my dictionary list. This is very comfortabel to see if I should know a seemingly new word, which other of my flashcards use the same character, which chinese and japanese words with the same character I should know etc. There are some problems though:

- I would appreciate a full text search (a full text definition search in flashcards would do the same job.) As I have only custom cards I would be searching in my own informations. Maybe you can add a function "convert custom flashcard db to user dict with full text search".
- When simplified and traditional character are different, Pleco does not realise that the particular character is already in my flashcard database (without traditional character information). Then when I add the character to my flashcards (+), convert the new entry to a custom entry and delete the traditional character, only then Pleco finds out it is a duplicate card.
- For more than 50% of my flashcards I would appreciate a search with the pronunciation field even if it is not Pinyin but japanese Hiragana or Katakana starting with @ - even without @ that did work in Windows Mobile very well. Maybe you can realise a full text pronunciation search ...

alex_hk90 · Jun 7, 2012

mikelove said:
Probably that an entry's too long - it can get a bit buggy with those; check to make sure you don't have any lines longer than 4000 characters or so.

I've checked where it stopped and there's nothing special about the final entry successfully imported and the next one not imported. In fact, what I've been doing is then truncating the file so that it starts with the next one that wasn't imported, and importing the new truncated file, which works fine (including the first one which wasn't imported in the larger file). To be honest I don't know if this is a Pleco issue or maybe a phone issue (I don't know how Android multi-tasking works but maybe something else is dragging the 'focus' away from Pleco).

mikelove said:
Relatively minimal as long as you "Lock" it via Manage Dicts - doing that enables some performance improvements that are impossible in a continuously-edited dictionary. Should be faster than having multiple small ones, actually.

Thanks, I'll try this.

mikelove · Jun 7, 2012

HW60 said:
- I would appreciate a full text search (a full text definition search in flashcards would do the same job.) As I have only custom cards I would be searching in my own informations. Maybe you can add a function "convert custom flashcard db to user dict with full text search".

Yes, you've been asking for that one since the WM days

Still on our radar, just not sure when.

HW60 said:
- When simplified and traditional character are different, Pleco does not realise that the particular character is already in my flashcard database (without traditional character information). Then when I add the character to my flashcards (+), convert the new entry to a custom entry and delete the traditional character, only then Pleco finds out it is a duplicate card.

We're considering adding an option to change the way duplicates are detected (e.g. to only use one character set) - would that help?

HW60 said:
- For more than 50% of my flashcards I would appreciate a search with the pronunciation field even if it is not Pinyin but japanese Hiragana or Katakana starting with @ - even without @ that did work in Windows Mobile very well. Maybe you can realise a full text pronunciation search ...

Might happen with 'custom fields' but not until then.

alex_hk90 said:
I've checked where it stopped and there's nothing special about the final entry successfully imported and the next one not imported. In fact, what I've been doing is then truncating the file so that it starts with the next one that wasn't imported, and importing the new truncated file, which works fine (including the first one which wasn't imported in the larger file). To be honest I don't know if this is a Pleco issue or maybe a phone issue (I don't know how Android multi-tasking works but maybe something else is dragging the 'focus' away from Pleco).

That might pause the import but wouldn't stop it altogether - when you reopened Pleco the import would continue.

HW60 · Jun 7, 2012

mikelove said:
HW60 said:

- When simplified and traditional character are different, Pleco does not realise that the particular character is already in my flashcard database (without traditional character information). Then when I add the character to my flashcards (+), convert the new entry to a custom entry and delete the traditional character, only then Pleco finds out it is a duplicate card.

Click to expand...

We're considering adding an option to change the way duplicates are detected (e.g. to only use one character set) - would that help?

Yes - using only one character set would solve most of the problems. There is still a problem with japanese flashcards: same kanji, but different Kana are duplicates in Pleco.

mikelove · Jun 8, 2012

Update: full-text user dictionary search is now in, at least in a basic way. (it turned out that this was the optimal time to do it because we were mucking around with so much other search code anyway) It may be a little buggy and it may even stay a little buggy for a while, but it's essentially working anyway. (strictly for English at the moment, though)

alex_hk90 · Jun 8, 2012

mikelove said:
Update: full-text user dictionary search is now in, at least in a basic way. (it turned out that this was the optimal time to do it because we were mucking around with so much other search code anyway) It may be a little buggy and it may even stay a little buggy for a while, but it's essentially working anyway. (strictly for English at the moment, though)

Thanks very much!

HW60 · Jun 8, 2012

mikelove said:
Update: full-text user dictionary search is now in, at least in a basic way. (it turned out that this was the optimal time to do it because we were mucking around with so much other search code anyway) It may be a little buggy and it may even stay a little buggy for a while, but it's essentially working anyway. (strictly for English at the moment, though)

Just to be sure: "is in now" means in Pleco 2.3.8 to come? Strictly for English includes German, but no Hiragana?

mikelove · Jun 8, 2012

HW60 said:
Just to be sure: "is in now" means in Pleco 2.3.8 to come? Strictly for English includes German, but no Hiragana?

Yes and yes. Though I've got to double-check on German because SQLite's default tokenizer might not deal with the umlauts and ßes correctly.

alex_hk90 · Jun 8, 2012

I've finally finished importing the Cantonese dictionary file I found (with almost 220,000 entries) and so locked the database - it now works pretty much as well as the internal dictionaries, so thanks for the suggestion.

Before locking it, it seemed to be a bit inconsistent - sometimes it would be really slow and cause the 'Wait / Force Close' dialog box (choosing Wait would then be fine as long as you waited long enough), and other times it would work fine regardless. Looking forward to having full text search in these dictionaries. :mrgreen:

HW60 · Jun 8, 2012

mikelove said:
HW60 said:

Just to be sure: "is in now" means in Pleco 2.3.8 to come? Strictly for English includes German, but no Hiragana?

Click to expand...

Yes and yes. Though I've got to double-check on German because SQLite's default tokenizer might not deal with the umlauts and ßes correctly.

That is really good news! In HDD I have no problems with umlauts, Pleco finds ü and ß.

The search string for full-text search is compared with the definition field, which normally does not include Hiragana. Why is it "strictly for English at the moment" then? Do you think Pleco 2.3.8 will be released this year?

mikelove · Jun 8, 2012

HW60 said:
The search string for full-text search is compared with the definition field, which normally does not include Hiragana. Why is it "strictly for English at the moment" then? Do you think Pleco 2.3.8 will be released this year?

Not just this year but this month, probably in beta (at last) next week. "Strictly English" meant that it doesn't search for Chinese characters like our other full-text search does. (that requires a more complicated tokenizer)

dustpuppy · Jun 10, 2012

alex_hk90: you have a cantonese dictionary in pleco format ? i'm extremely interested in trying it out if you're willing to share. I've been looking for a cantonese dictionary on my phone for ages.

alex_hk90 · Jun 10, 2012

dustpuppy said:
alex_hk90: you have a cantonese dictionary in pleco format ? i'm extremely interested in trying it out if you're willing to share. I've been looking for a cantonese dictionary on my phone for ages.

The following site has a lot of Cantonese resources, including instructions for having a Cantonese-English dictionary on Android and iOS:
http://writecantonese8.wordpress.com/
I used the 'Cantonese CEDICT' file from that website as the data to convert into Pleco format, and while I'd be happy to share it I don't want to infringe on any intellectual property rights and I'm not clear on the copyright of that file (or, more specifically, the sources of it).

If you want I can post some notes I kept while converting the file (which is in CEDICT format) so you can do it yourself, but it's not that straightforward and the import takes ages.

dustpuppy · Jun 10, 2012

If you could post the instructions, i'd be very grateful.

User Dictionary Specification

状元

皇帝

状元

皇帝

状元

皇帝

状元

状元

皇帝

状元

皇帝

状元

状元

皇帝

状元

状元

皇帝

榜眼

状元

榜眼