Tatoeba example sentences together with CC-CEDICT

Shun · Oct 24, 2020

Hi all,

I have created a user dictionary that includes up to fifteen Chinese-English Tatoeba example sentences, where available, with each of the about 118,000 CC-CEDICT dictionary entries. Here's an excerpt from a dictionary entry:

The zipped user dictionary (45 MB) and the text file (9.8 MB) can be downloaded through this link:

CC-CEDICT with Tatoeba Sentences - Google Drive

drive.google.com

One can get something very similar by visiting Tatoeba.org and entering a Chinese expression, though it surely is a lot more comfortable to have these sentences right inside Pleco.

@Natasha 's question inspired me to do this. It is quicker to access example sentences like this than through the Search feature in Organize Flashcards.

Attribution: I used data from CC-CEDICT and Tatoeba.org for this dictionary. I attach the Python script I did it with. Feedback is welcomed.

Enjoy,

Shun

乔米 · Dec 9, 2020

Hey Shan,

I was looking to use this data to plug-in to my fork of a CEDICT compatible chinese plugin for Anki https://github.com/joeminicucci/chinese-support-redux.

First of all, can you advise me on how to get the data on my own from tatoeba.org? In the downloads section I was unable to get Mandarin Chinese sentences with accompanying translations from the following tatoeba data set categories:

1. "Sentences"
2. Transcriptions

These sets did not look correct so I performed a search using the following query:

Advanced search - Tatoeba

tatoeba.org

Which looked to return the proper data. Is there a way using their API to grab all of that, or do I need to scrape it?

After that, I looked at your data set, as a precursor to writing a python parser to add it to my custom CEDICT db. I noticed there is an unprintable character scattered several times throughout the file, as seen in the below screenshot (note that I am using VS Code with UTF-8 encoding enabled for viewing):

Questions:
1. What is the best way to get this data from the source? I did a little more investigation and I can see that you merged the tatoeba eng sentences with cmn sentences using the links data set, so regarding number one, do you have the code you used to merge those with translation links, i.e. the sentences_cmn_eng_simplified_folded.txt you fed into your python parser?
2. What is the unprintable character in your data set?

Thanks for your time!
乔米

Shun · Dec 9, 2020

Hello 乔米,

sure; from

Download sentences - Tatoeba

tatoeba.org

you need to get the "sentences" and "links" files. The sentences file has the structure

Sentence id [tab] Lang [tab] Text

and the links file

Sentence id [tab] Translation id

So you can filter the sentences for the "Chinese Mandarin" language code 'cmn' and whatever other language you wish, then find all the sentence pairs using the links file, which links sentences of the same meaning with different IDs and languages together. I think it's a nice small exercise, but sure, you can find the Python source code here:

79,000 Chinese-English, French, German, Italian, Japanese, and Spanish sentences

Dear all, here is an archive containing translated Chinese sentences in the following language pairs, ready for importing into Pleco: Chinese-English 41,955 sentences Chinese-French 15,740 sentences Chinese-German 4,566 sentences Chinese-Italian 3,800 sentences...

plecoforums.com

(Edit: I noticed the way I wrote the Python script two years ago isn't very memory-efficient. I would do it differently now, filter the sentences first, then read them into a dictionary, and then combine them.)

The special character in my text files is a "newline" character for Pleco, because the carriage return character is already reserved for the TSV file format. You can find more info on them here:

Multiple new lines in user defined flashcards

Perhaps this the flashcard exchange thread but hope to reach a larger audience here. Is there a way to introduce new lines in user generated definitions so that difference "meanings" can be numbered and start on new lines (i.e. formatted line the cards generated from dictionary entries)?

plecoforums.com

No problem, hope this helps,

Shun

乔米 · Dec 11, 2020

Shun,

Thanks so much for your assistance. I'm definitely open to coding exercises but I have so much other code to get after I do appreciate this push forward! This will save me a lot of time, and I will include my modifications / optimizations in my utilities which I use for developing the Anki addon.

Also, thanks for the heads up on the data organization scheme in Tatoeba as well as the source code. I'll DM you once I get all of my ducks in a row on Github.

Thanks so much!!

Shun · Dec 11, 2020

Hi 乔米,

you're welcome! That's fine, I didn't know your situation. Thanks for the offer!

Have a great day,

Shun

乔米 · Dec 21, 2020

@Shun

I made my own implementation that merges corporas from tatoeba, given any 2 language codes. It keeps everything in memory, downloads the bz2/tars automatically and outputs a csv. Have fun! Tatoeba Corpora Merger

Shun · Dec 22, 2020

Hi 乔米,

excellent work, thank you!

Shun

Tatoeba example sentences together with CC-CEDICT

Shun

状元

CC-CEDICT with Tatoeba Sentences - Google Drive

Attachments

乔米

秀才

Advanced search - Tatoeba

Shun

状元

Download sentences - Tatoeba

79,000 Chinese-English, French, German, Italian, Japanese, and Spanish sentences

Multiple new lines in user defined flashcards

乔米

秀才

Shun

状元

乔米

秀才

Shun

状元