Integrating BCC Corpus Data into Dictionary

#1
Hello Mike,

it occurred to me that it may be worthwhile to add an indicator for the frequency of a word in the upper right corner of a dictionary definition using the frequency data in the BCC corpus, allowing the user to see at a glance how common a word is. The frequency information could be given either by means of some sophisticated index score between 1 and 100, or it could be given a colored level between one and six, for example—1 being the most common, and 6 being the least common. The BCC corpus seems to have pretty loose licensing terms.

Pleco already seems to be using frequency data to sort the search results. Adding them meaningfully to dictionary definitions would be even better, I believe. That is something which printed dictionaries can’t do. Of course, there should be a toggle in the dictionary settings to turn the frequency display off.

Do you think this would be a viable idea?

Regards,

Shun
 

mikelove

皇帝
Staff member
#2
Couldn't you do this now with flashcard category tags? Wouldn't need any help from us.

I'm honestly a little wary of adding built-in frequency listings because I don't think they're a very good way to learn Chinese; even a really excellent corpus will probably be several years out of date for slang vocabulary, so a term that comes up as uncommon may actually be quite common now (or vice versa) - people are constantly repurposing old words - plus I don't believe they're accurate to a sufficient degree of granularity to be useful for prioritizing studies; you're going to study all of the 1000 most common words eventually, and doing 100-200 before 200-300 is probably less helpful than, say, learning a bunch of words related to the same subject / context at the same time.
 
#3
Thanks for your quick answer.

Couldn't you do this now with flashcard category tags? Wouldn't need any help from us.
You're right; in theory, yes, though it would mean importing about 1.7 million flashcards, or 100,000 if one trimmed it to the first 100,000 words or so. In any case, it probably would be a lot of data to carry around, even if it were built into the app.

I'm honestly a little wary of adding built-in frequency listings because I don't think they're a very good way to learn Chinese; even a really excellent corpus will probably be several years out of date for slang vocabulary, so a term that comes up as uncommon may actually be quite common now (or vice versa) - people are constantly repurposing old words - plus I don't believe they're accurate to a sufficient degree of granularity to be useful for prioritizing studies; you're going to study all of the 1000 most common words eventually, and doing 100-200 before 200-300 is probably less helpful than, say, learning a bunch of words related to the same subject / context at the same time.
I do not see this feature as being primarily useful as a learning aid, but more for confirming whether one's personal impression of the approximate commonness of a word is about right, and perhaps for deciding whether a word is already worth adding to one's flashcards at the stage one is at. One would certainly avoid adding arcane or very rare words.

For the most common 1,000 words, they would all be in the highest rating of commonness anyway, I was thinking more of distinguishing between various levels of intermediate or advanced words. When a word has 500,000 occurrences, while another has, say, 200,000 occurrences, I do think that would mean there is a meaningful difference in the actual frequency of usage in the whole language. (The maximum number of occurrences for a word is more than 900 million in the BCC global corpus.) Even more so if one word has 1 million occurrences, and another 100,000, for example. If a slang term is rated wrongly (which I doubt, though, since the BCC appears to be up-to-date and comprehensive), and if it were to be off a few years later for special kinds of words, it would be unpleasant, but probably a rare occurrence.

Perhaps one could place a commonness rating of a word at the top of the WORDS tab listing instead? Then it wouldn't be plainly visible, and it wouldn't have any adverse effects on learners studying vocabulary of a particular subject and feeling demotivated when they see a low commonness rating for a word they have to study? Perhaps that would be something for Pleco in 10 years' time, or possibly even for a free/paid Add-on?

But of course, you'd have to feel good about it. If you don't, life goes on, of course, and I could indeed try out the flashcard category tag option instead.

Regards,

Shun
 
Last edited:
#4
Learning from word frequency lists without context is bad.

However word lists are useful tool in helping decide "is this word really worth knowing at my current level?" That is how I use frequency lists. I remember finding this to be especially useful when moving from curated leaning texts to native material.
 
#5
I very much agree with Peter and Shun (and Mike:)).
In fact, as I highly value such functionality, I built my own set of such flashcards a while back, which use the Pleco tagging functionality to display the BCC, Leiden Weibo Corpus and SUBTLX corpus rankings for the every word and notice no slow down in performance :)

Generating such flashcards is a relatively trivial task: just add each word as an empty flashcard to a Pleco flashcard category based on the words ranking range and add tags in Pleco for each ranking range. I'm sorry but due to possible licensing implications, I am not keen on sharing my own flashcards... (As per Mike's suggestion above,you can easily make your own)

Request for the Gods of Pleco (i.e., Mike :)) Displaying tags in the popup window would make such functionally even more useful. Of course, you are quite right that there are limitations to using such frequency lists in your studies. In particular, I fully agree one shouldn't make a study plan based on them :)

Notes for attached screenshot:
G=BCC Global, B=Blog, N=News, T=Technical, W=Weibo, L=Literature, S=SUBTLX, w=Leiden
 

Attachments

Last edited:
#7
Learning from word frequency lists without context is bad.

However word lists are useful tool in helping decide "is this word really worth knowing at my current level?" That is how I use frequency lists. I remember finding this to be especially useful when moving from curated leaning texts to native material.
Thanks, this is very much the use I am envisioning. It's a natural progression from using HSK-rated sentence lists to having all words frequency-rated. Since I'm not really inclined to work with curated texts anymore, frequency lists are becoming more important. I also agree that if such a feature were built into Pleco, it is prone to be misused and misinterpreted by some users (it wouldn't always be authoritative, and Pleco needs to be as authoritative as possible), so it's probably preferable to keep it as a DIY option.

I very much agree with Peter and Shun (and Mike:)).
In fact, as I highly value such functionality, I built my own set of such flashcards a while back, which use the Pleco tagging functionality to display the BCC, Leiden Weibo Corpus and SUBTLX corpus rankings for the every word and notice no slow down in performance :)
Great! Then it surely is using some very efficient algorithm to find the flashcards. I remember you already had this setup a long time ago.

Generating such flashcards is a relatively trivial task: just add each word as an empty flashcard to a Pleco flashcard category based on the words ranking range and add tags in Pleco for each ranking range. I'm sorry but due to possible licensing implications, I am not keen on sharing my own flashcards... (As per Mike's suggestion above,you can easily make your own)
Indeed, it's very straightforward. Sharing is needed only for more complicated things, and you're surely right about licensing.

Request for the Gods of Pleco (i.e., Mike :)) Displaying tags in the popup window would be make such functionally even more useful. Of course, you are quite right that there are limitations to using such frequency lists in your studies. In particular, I fully agree one shouldn't make a study plan based on them :)
Yeah, studying things in isolation is never good. Thanks for the screenshots!
 
#8
Great! Then it surely is using some very efficient algorithm to find the flashcards. I remember you already had this setup a long time ago.

Thanks for the screenshots!
You're most welcome!

Oh, I forgot to mention that to reduce the number of flashcards required you can import all the corpus entries into Pleco using the "skip missing entries" option and immediately export the created flashcards as a list of all the corpus entries with dictionary entries.

In my case this reduces the number of flashcards to about 200K or so for the BCC "Global" corpus, for example.
 
#9
Oh, I forgot to mention that to reduce the number of flashcards required you can import all the corpus entries into Pleco using the "skip missing entries" option and immediately export the created flashcards as a list of all the corpus entries with dictionary entries.

In my case this reduces the number of flashcards to about 200K or so for the BCC "Global" corpus, for example.
Thanks for this tip! But I think I will import them all without skipping any, down to a frequency of 100, and put them in a user dictionary with the frequency as the definition. (427,651 expressions) This way, I can make fuller use of the BCC corpus, and I will know for any expression unknown to Pleco's dictionaries that it exists, but is very rare.
 
Last edited:

mikelove

皇帝
Staff member
#12
You're right; in theory, yes, though it would mean importing about 1.7 million flashcards, or 100,000 if one trimmed it to the first 100,000 words or so. In any case, it probably would be a lot of data to carry around, even if it were built into the app.
Would you even need 100,000, though? 20,000 is around native speaker vocabulary level; anything less common than that is uncommon enough to be lumped together as specialized vocabulary.

4.0 is *way* faster at this stuff - can import and fully index (even full-text!) CC-CEDICT with its > 100,000 entries on an iPhone 6s in about 1 minute - and will also make it much easier to pull data from a user dictionary into a tab and possibly even a tag indicator (if not in 4.0 then in 4.something at least), so I do believe that what you're talking about would be achievable without official support from us.

(incidentally, it can also pull data from a user dictionary and then sort/group a list of words using that data, or extract a list of words from a user dictionary definition and then pull up a list of those words as search results, so both frequency sorting and other fixed-list stuff like synonyms / character breakdowns / etc should be user-replaceable in 4.0; we haven't talked about it as much, but the same ridiculous amount of customization we've brought to flashcard testing is also coming to searches)

@leguan - tags in popup already done for 4.0.
 
#13
Would you even need 100,000, though? 20,000 is around native speaker vocabulary level; anything less common than that is uncommon enough to be lumped together as specialized vocabulary.
Since there are many repetitions in the form of compound expressions like "国民政府" in the corpus, the first 20,000 N-grams aren't really the same as the hard vocabulary of which the average native speaker may only have 20,000. But yes, there are still quite a few words in the first 20,000 most frequent lines of BCC's frequency list that I've never seen. I will just put in whatever fits in. Importing 427,000 expressions is a bit many, but if it doesn't bring Pleco down and might produce an occasional match, why not. :)

4.0 is *way* faster at this stuff - can import and fully index (even full-text!) CC-CEDICT with its > 100,000 entries on an iPhone 6s in about 1 minute - and will also make it much easier to pull data from a user dictionary into a tab and possibly even a tag indicator (if not in 4.0 then in 4.something at least), so I do believe that what you're talking about would be achievable without official support from us.
Such efficient code is really impressive. It took me about 4 hours to import 427,000 expressions on Pleco 3.2, but it already worked as advertised, also with the tags, but I can imagine that with version 4.0, it should be much more feasible.

(incidentally, it can also pull data from a user dictionary and then sort/group a list of words using that data, or extract a list of words from a user dictionary definition and then pull up a list of those words as search results, so both frequency sorting and other fixed-list stuff like synonyms / character breakdowns / etc should be user-replaceable in 4.0; we haven't talked about it as much, but the same ridiculous amount of customization we've brought to flashcard testing is also coming to searches)

@leguan - tags in popup already done for 4.0.
This is really great and will please leguan.
 
Last edited:
#16
Dear all,

I'm writing to report that the system works as expected, the only disadvantage being that my Flashcards database has grown to 250 MB, and the BCC Frequencies user dictionary is 66 MB.

The question of whether pinyin for a word is available is also handled well. If Pleco has a pinyin transcription for an expression, the tags show up (because BCC also gets the pinyin transcription upon import); if BCC is the only dictionary to know the word but pinyin isn't available, Pleco lets me tap on the word and displays the Pop-up definition for it without the pinyin.

Regards,

Shun
 
Last edited:
#17
Dear Shun,

Thank you for the update! Very interesting, indeed!

Am I correct to understand that Pleco assigns the categories defined in the imported file to dictionary entries?
(I've never considered that tags could be added to dictionary entries before!)
 
#18
Dear leguan,

you're very welcome!

It's really quite simple: The words from the corpus are added to the Flashcards database upon import, the user dictionary entries are automatically generated for the imported words, and the user dictionary entries are then linked to the newly created flashcards in the flashcards database. If I tap on a word which is only known to the BCC corpus and open the Dictionary view, I see the user dictionary entry as well as the tag because the word is also found in Flashcards and has the tag there. If I hadn't used a user dictionary entry, I wouldn't have been able to tap on the word from anywhere in the app. — So it still uses Flashcards, but is assisted by the user dictionary.

I'm already hard at work on the sentence contextual Python clone. The indexing of 175,000 words contained in 40,000 sentences takes about an hour.

Regards,

Shun
 
Last edited:
#19
Thank you very much for your detailed explanation.! Yes, that makes sense. Also, by importing the card as a user dictionary you gain additional benefits without losing anything!, So if my understanding is correct it seems there are no significant downsides:)

In my case though, if I create such a user dictionary, I'd like to include all of the individual corpus frequency rankings into a single dictionary entry for each term, so it will require a little more work:rolleyes:
 
#20
I'm already hard at work on the sentence contextual Python clone. The indexing of 175,000 words contained in 40,000 sentences takes about an hour.
Great! It does indeed seem that, as we expected, Python is indeed quite a bit quicker than VBA!
 
Top