Need suggestions for a Grammar dictionary DB.

Smoodo

举人
Hi there,

I've been compiling a Chinese Grammar DB little by little on my Palm and I would like to see about integrating it into PlecoDict or the current version of your software. I don't have a laptop here in China, so I want to ask a few theoretical qustions.

What would happen if I added dictionary entries that started with three periods, then a character, then three periods and the rest of the pattern?

Is the dictionary doing straight string matching? Since Chinese characters are 2 bytes, what happens to a period? Is it a single byte? Does it matter?

I really think that this is possible and would like to release a free (opensource) dictionary and make sure that one format is compatible with PlecoDict.

Best Regards,

Brandon
 

mikelove

皇帝
Staff member
Hello,

First off - great idea! I'm looking forward to seeing this when it's done.

In response to your question, PlecoDict actually uses UTF-16 Unicode for text encoding, so letters, punctuation marks, and Chinese characters are all double-byte (we use data compression so that this doesn't result in a lot of wasted space). And while there are lots of indexes etc involved, the dictionary does basically do straight string matching so it should be able to search for 3 periods and a character without any problems.

The main problem with the period-character-period type of entry you describe is that until the user entered the first character the dictionary would have no way of knowing whether they were trying to do a Pinyin or character search - it would still work once they input the character, but it could be a little slow. Also the indexing system is mainly designed to handle character entries, so the extra periods on the beginning of the entry could also cause a minor performance hit (simply because from the software's perspective you'd have hundreds or thousands of entries with the same 3 starting characters).

So I think a better method would be to take advantage of our new dictionary encoder's ability to use different characters for indexing than it does for display - this would allow you to display the periods in the entry text but omit them from searches, so when people wanted to search for a pattern they would simply enter the characters and the pattern would come up. This would be both faster for the user (since they don't have to go to the trouble of entering periods) and easier for the software to handle.

Anyway, best of luck with this and let me know if you have any more questions,

Michael Love
 

Smoodo

举人
Indexing questions.

Hi,

Thanks for the quick response. I've been thinking about your reply and need more clarification.

You mentioned a new encoder, is it already posted or can you email to me?

If I index the patterns as you say where the user would just start entering characters, it seems that from a usage perspective it would make grammar lookups clumsey. I say this because many patterns are just that, patterns. There has to be some way to allow the software to differentiate between patterns. Not only to save search time, but also to make it elegant for the user to use and the software to interpret.

I think that it would be good design to have some kind of marker that distinguishes a Dictionary DB from a Pattern Lookup DB. The two are different because in a dictionary your search engine is going to try to match on the first index it encounters that conforms to the key, with the only distiction that you might be using a separate (index) for pinyin to character lookups. The Pattern Lookup DBhas categories of entries. I see two immediate ones illustrated in the following two cases. The later two cases are just simple expansions of the first two.

First I will define some things to use to explain my thinking.

Variable Definition:
Code:
const UTF16 cNPM = "??";         //  (something unique that your program  
                                              // can detect and therefore not do any
                                              // extraneous lookups.  Here, I'm
                                              // indirectly asking you about that 
                                              // character vs display indexing
                                              // mentioned in your reply.)

UTF16 A,B,C;                           // Can be a unique character


Case 1: Most simple case (let the user select dictionary because PlecoDict doesn't have any data to make an intelligent decision as to weather or not the charater is a regular lookup or a grammar pattern.
Code:
[A]

Case 2: This case is definitely a pattern. A nonlookup placemarker alerts PlecoDict to look for dictionaries that comply with NPM's and not search those that don't. (Which I know is most of the dictionarys for Pleco. Huge time space saving.)
Code:
cNPM[A]

or 

cNPM[A]cNPM[B]

Case 3:
Expounds on Case 2
Code:
cNPM[A][B]cNPM[C]

Case 4:
Expounds on Case 1
Code:
[A][B]cNPM[C]


I realize that I am making a lot of educated guesses about PlecoDict. I did look at the code for MakeDict, but I imagine that the new encoder is not one and the same. If you would please give me some feedback about this, I'll start arranging the data in a way that will make it easy for your encoder utility to compile it.

Best Regards,

Brandon

If you do need to email me, please send to:

bjackson at willowbendstudios dot com
 

mikelove

皇帝
Staff member
First off, we're still doing some tweaking/debugging on the encoder, so I haven't got a good version to send you right now, but I'm afraid it wouldn't do you much good anyway as we're not making it open-source like we did MakeDict: Pleco-generated and user-generated databases now use pretty much the same format, and while keeping the source closed won't completely eliminate the risk of someone cracking our databases it should at least make it a good deal harder. (plus it puts us in a better position legally to argue that we made every reasonable effort to protect our licensors' data)

As far as pattern detection, though, we've actually been reworking a lot of that code in the last few days, so some of my previous answers are no longer valid. (I probably should have waited until we were finished before posting my initial reply, sorry) It now looks like what you're proposing is actually pretty close to where we'll end up; there'll be a mechanism for a database index to declare certain ranges of characters that it does or doesn't cover, obviously with some exceptions and special cases for things like punctuation marks that separate Pinyin syllables or the occasional alphanumeric in a Chinese character index, but basically this would allow us to designate some special character like # or * as a pattern-matching symbol and inform anyone wishing to create a pattern-matching database to use that character (which the encoder would either auto-detect or be instructed to use by a flag or something like that). Neither a standard Pinyin nor a standard character index would cover this pattern-matching symbol so the software would quickly be able to figure out that the search query was for your database.

Of course this isn't set in stone and there are some finer points of it that still need to be worked out, but it's looking like this is where we're going to end up, so as far as arranging your database there really isn't much you can or need to do until we actually get around to releasing the encoder.
 

Smoodo

举人
Life on the bleeding edge! Nothing like it.

That's alright! I've worked on the bleeding edge before and definitely understand where you're coming from. I understand that elements under the hood are churning. It's nice to be able to have a chance to give input that potentially improves your design. When you do have some stable specifications, please let me know. In the back of my mind, I'm wondering how much flexibility (granularity) I should maintain or afford to compiling this stuff in a database. I suppose I can always normalize the database later when you have specs. I'm playing with the space vs time tradeoff of managing data and lookups.

It's a pleasure to help with good projects. If you do feel like sending any executables or code snippits, it's fine with me. I know what unstable means. If it would require an NDA, I'm not oppossed to that either. My only stipulation would be that the grammar database remain open and free in some form or another, which doesn't conflict with anything. :)

Thanks for all of your feedback. Just from your previous feedback, I see that there is a lot of creative energy and creativity flowing over there.
 
Top