I don't understand 'duplicate checking' during Flashcard import

Jim Kay · May 3, 2013

For me, a duplicate Flashcard is one where the Chinese is the same on both AND the pronunciation is also the same on both. Comparing the definitions is clearly too much for a computer to do, so a prompt is appropriate.

BUT, I seem to see two very serious problems:

1. Pleco seems to take ONLY the first Chinese character from the new card and then search for any existing (currently being imported?) card with the same first character. This causes a vast number of false duplicate hits.

2. There are entirely different formats used for displaying the card being imported and the card it is suspected of being a duplicate of. These different formats make the manual decision process much too difficult to use when wading through a very large number of false positive hits.

(I use Excel and a sort on my card data (Chinese first, then Bopomofo) along with duplicate flagging to scrub my data of any and all TRUE duplicates, but Pleco continues to flag a very large percentage of false positives.)

The list I'm playing with now has a bit over 1,500 items. The next list I'm working on will be around 16,000 items. That one is a little more than half typed. I have to be really bored to work on it.

mikelove · May 4, 2013

Jim Kay said:
For me, a duplicate Flashcard is one where the Chinese is the same on both AND the pronunciation is also the same on both. Comparing the definitions is clearly too much for a computer to do, so a prompt is appropriate.

That's how our duplicate checking works, yes.

Jim Kay said:
1. Pleco seems to take ONLY the first Chinese character from the new card and then search for any existing (currently being imported?) card with the same first character. This causes a vast number of false duplicate hits.

Certainly not the case, and this suggests some sort of issue with your import file - something that's preventing it from seeing the later characters. Are you sure the file is formatted correctly? Try exporting a few cards created in Pleco to compare with. You want the file to be arranged like this:
simplified[traditional]<tab>pinyin<tab>definition
One entry per line. BoPoMoFo in the place of Pinyin usually works but it's not 100% and can also be encoded weirdly in some cases, so I'd recommend running that through an automated converter to get it to Pinyin - you can still view the text within Pleco in BoPoMoFo if you like. You can omit simplified or traditional but that may cause problems in matching against cards created from within the dictionary since they'll include both (and the software only flags a duplicate if everything matches).

Jim Kay said:
2. There are entirely different formats used for displaying the card being imported and the card it is suspected of being a duplicate of. These different formats make the manual decision process much too difficult to use when wading through a very large number of false positive hits.

There are some slight variations, but again this sounds like an issue in your import file.

Jim Kay · May 4, 2013

Here are a few cards from my input file:

电话筒[電話筒]dian4hua4tong3N: telephone receiver
好处[好處]hao3chu4"good points, profits, benefits"
共[共]gong4"common, same, together"
钢[鋼]gang1steel
像[像]xiang4"image, portrait, look like, seem as"
分开[分開]fen1kai1to separate
痛[痛]tong4N: pain
他在[他在]ta1zai4he/she is at/in

(These are NOT selected from the 'duplicate' process.) Unless you see something I don't see, the format is exactly what you have described above. (Seeing those quote marks reminds me I forgot about copy/paste and used the old familiar save-as.)

Jim Kay · May 4, 2013

My original input does have only Bopomofo, but I've put the CEdict into a SQL database and I'm using lookup on the character to select the correct pinyin-matching both the character and the pronunciation. I use this database for finding the correct simplified form too.

mikelove · May 4, 2013

I assume that the copy-and-pasted text doesn't include the tabs that are in the original file?

If so, the only other explanation I can think of is that there's an issue with the text encoding format - is this in UTF-8 or something else? Is the importer likewise configured to use UTF-8?

Jim Kay · May 4, 2013

I imported the tab-separated using Excel which, naturally enough, splits the file on the tabs and, in the process, deletes. them.

I past the 'copy' into a rather advanced editor and save it with an explicit specification of UTF-8.

The imported is configured for UTF-8 as well and the 'sample' import shows me looks just fine. I even scrolled through it a bit to be sure.

mikelove · May 4, 2013

Hmm... well could you possibly PM me a sample of the file, then? Along with indicating which specific entries are being mistakenly flagged as duplicates? Perhaps there's a more subtle formatting issue here that I haven't thought of.

Jim Kay · May 4, 2013

This is becoming very interesting. There does seem to be a problem, but not the problem I thought it was.

First, I am using 'category' in a way that is very different from the Plico 'design.' I have a number of separate lists that I refer to as 'lessons' and I study them separately and for different purposes. But for Pleco, one flashcard is just like any other and it isn't really intended for different categories to be completely independent. So, if I have a card in one category and I try to import another card exactly like it into another category, it's flagged as a duplicate.

I realized this when I tried to import the same set of cards with only a different category name as the first card. ALL of the cards were being flagged as duplicates.

But then I deleted every single flashcard in my system and I even deleted all of my categories (leaving only the permanent 'uncategorized' behind (containing no cards.) The import STILL flagged every single card as a suspected duplicate.

So, it seems, that deleting all of my cards does NOT take them out of the database. And with absolutely NO cards left, the database STILL retains all of the card information. Thus importing the same cards again will produce a slew of duplicate messages even though the 'duplicates' are not there anymore. It becomes necessary to delete the entire database to get rid of the phantom duplicates.

So deleting a card from the last category where it exists, really does get rid of the card (I can't find it again.) But the ghost of the card remains in the database. The card does NOT magically appear in 'uncategorized' as one might think, the card is gone.

Previously, I had tested the Import with a few scattered cards which I then deleted, not knowing those cards remained in the database. So when I imported the full list, those ghost entries triggered duplicate prompts.

I was assuming 'duplicate' meant one card in the import was the same as another card in the same import. Obviously that's not the case.

So the question has become this: Shouldn't deleting a card from the very last user category where it resides either dump it into 'uncategorized' or delete it from the database? (And, by implication, shouldn't deleting a card that exists ONLY in 'uncategorized' delete it from the catalog as well?)

Jim

mikelove · May 4, 2013

Jim Kay said:
First, I am using 'category' in a way that is very different from the Plico 'design.' I have a number of separate lists that I refer to as 'lessons' and I study them separately and for different purposes. But for Pleco, one flashcard is just like any other and it isn't really intended for different categories to be completely independent. So, if I have a card in one category and I try to import another card exactly like it into another category, it's flagged as a duplicate.

For that scenario I think you'd want to "allow" rather than "prompt" for duplicates. Alternatively, you might find what you're looking for with our "scorefile" system - that way you can use categories the way they were intended to be used in Pleco but still keep multiple review histories for a particular word depending on how you're reviewing it.

Jim Kay said:
But then I deleted every single flashcard in my system and I even deleted all of my categories (leaving only the permanent 'uncategorized' behind (containing no cards.) The import STILL flagged every single card as a suspected duplicate.

That's certainly not the intended behavior... try doing a Search Cards for "All cards" - do the cards come up then?

Jim Kay · May 4, 2013

Nice fix! That was very fast indeed.

(I tried to run a test that would answer your question, but the fix made that impossible-and unnecessary.)

Thanks,
Jim

mikelove · May 4, 2013

Great! Still not sure why they'd go missing, though - seems like you might have somehow ended up with an orphaned category (one that's a child of a now deleted category).

Jim Kay · May 5, 2013

I only created one category and no child categories. Oh we'll.

I don't understand 'duplicate checking' during Flashcard import

Jim Kay

举人

mikelove

皇帝

Jim Kay

举人

Jim Kay

举人

mikelove

皇帝

Jim Kay

举人

mikelove

皇帝

Jim Kay

举人

mikelove

皇帝

Jim Kay

举人

mikelove

皇帝

Jim Kay

举人