Tatoeba example sentences together with CC-CEDICT

Shun

状元
Hi all,

I have created a user dictionary that includes up to fifteen Chinese-English Tatoeba example sentences, where available, with each of the about 118,000 CC-CEDICT dictionary entries. Here's an excerpt from a dictionary entry:

IMG_3957.png


The zipped user dictionary (45 MB) and the text file (9.8 MB) can be downloaded through this link:


One can get something very similar by visiting Tatoeba.org and entering a Chinese expression, though it surely is a lot more comfortable to have these sentences right inside Pleco.

@Natasha 's question inspired me to do this. It is quicker to access example sentences like this than through the Search feature in Organize Flashcards.

Attribution: I used data from CC-CEDICT and Tatoeba.org for this dictionary. I attach the Python script I did it with. Feedback is welcomed.

Enjoy,

Shun
 

Attachments

乔米

秀才
Hey Shan,

I was looking to use this data to plug-in to my fork of a CEDICT compatible chinese plugin for Anki https://github.com/joeminicucci/chinese-support-redux.

First of all, can you advise me on how to get the data on my own from tatoeba.org? In the downloads section I was unable to get Mandarin Chinese sentences with accompanying translations from the following tatoeba data set categories:

1. "Sentences"
2. Transcriptions

These sets did not look correct so I performed a search using the following query:

Which looked to return the proper data. Is there a way using their API to grab all of that, or do I need to scrape it?

After that, I looked at your data set, as a precursor to writing a python parser to add it to my custom CEDICT db. I noticed there is an unprintable character scattered several times throughout the file, as seen in the below screenshot (note that I am using VS Code with UTF-8 encoding enabled for viewing):

1607491174755.png



Questions:

1. What is the best way to get this data from the source? I did a little more investigation and I can see that you merged the tatoeba eng sentences with cmn sentences using the links data set, so regarding number one, do you have the code you used to merge those with translation links, i.e. the sentences_cmn_eng_simplified_folded.txt you fed into your python parser?
2. What is the unprintable character in your data set?


Thanks for your time!
乔米
 
Last edited:

Shun

状元
Hello 乔米,

sure; from


you need to get the "sentences" and "links" files. The sentences file has the structure

Sentence id [tab] Lang [tab] Text

and the links file

Sentence id [tab] Translation id

So you can filter the sentences for the "Chinese Mandarin" language code 'cmn' and whatever other language you wish, then find all the sentence pairs using the links file, which links sentences of the same meaning with different IDs and languages together. I think it's a nice small exercise, but sure, you can find the Python source code here:


(Edit: I noticed the way I wrote the Python script two years ago isn't very memory-efficient. I would do it differently now, filter the sentences first, then read them into a dictionary, and then combine them.)

The special character in my text files is a "newline" character for Pleco, because the carriage return character is already reserved for the TSV file format. You can find more info on them here:


No problem, hope this helps,

Shun
 
Last edited:

乔米

秀才
Shun,

Thanks so much for your assistance. I'm definitely open to coding exercises but I have so much other code to get after I do appreciate this push forward! This will save me a lot of time, and I will include my modifications / optimizations in my utilities which I use for developing the Anki addon.

Also, thanks for the heads up on the data organization scheme in Tatoeba as well as the source code. I'll DM you once I get all of my ducks in a row on Github.

Thanks so much!!
 

Shun

状元
Hi 乔米,

you're welcome! That's fine, I didn't know your situation. Thanks for the offer!

Have a great day,

Shun
 
Top