TOCFL Levels 1-5 in tab separated text files - Please help me covert

Mer

Member
Here are the TOCFL levels in 5 different text files: tab separated text files.
As is, they can be used in Anki but need to be formatted for Pleco.
If someone knows how to convert them to Pleco format that would be awesome.

I modified the lists to include Bigrams only and removed parts of speech because it's pretty easy to understand part of speech from context. I removed the non bigrams because I know most of them already. Sorry about that for anyone who may have wanted them but there weren't that many anyway.

Thanks

PS:
If you're studying traditional in Taiwan, I recommend using the TOCFL lists instead of Practical Audio Visual Chinese because after book 3 in the series the books start including >50% outliers which really hinders progress. Books 1, 2 and 3 are quite good though with book 1 being a great beginner book.
 

Attachments

  • Anki TOCFL bigrams L01.txt
    10.6 KB · Views: 1,191
  • Anki TOCFL bigrams L02.txt
    9.4 KB · Views: 1,028
  • Anki TOCFL bigrams L03.txt
    16.3 KB · Views: 1,066
  • Anki TOCFL bigrams L04.txt
    106.2 KB · Views: 1,263
  • Anki TOCFL bigrams L05.txt
    108 KB · Views: 1,207

Mer

Member
Here they are in Pleco Format. I figured out the formatting. [Just add "// Title Bla bla bla"] to the first line of the tab separated file.

example
example-jpg.1762
 

Attachments

  • Pleco TOCFL bigrams L01-05.zip
    127.4 KB · Views: 1,820
  • example.jpg
    example.jpg
    95.4 KB · Views: 3,712

Shun

状元
Hi Mer,

thanks for these useful lists. It's good to see the slight differences in Taiwanese Mandarin. Since Pleco has three fields per card (Hanzi, pinyin and the English definition), you would need another <tab> character after the pinyin instead of a space. This should be possible to correct using regular expressions, perhaps someone knows how to do it?

Cheers, Shun
 

alex_hk90

状元
Hi Mer,

thanks for these useful lists. It's good to see the slight differences in Taiwanese Mandarin. Since Pleco has three fields per card (Hanzi, pinyin and the English definition), you would need another <tab> character after the pinyin instead of a space. This should be possible to correct using regular expressions, perhaps someone knows how to do it?

Cheers, Shun

This should get you fairly close:
Code:
sed 's/\(\w*\)\s\(\w*\)\s\(.*\)/\1\t\2\t\3/g' Input.txt > Output.txt
I tried with one of the files and it looks more or less there with just a bit of manual clean-up required after.
 

Shun

状元
Thanks a lot! I tried running sed on OS X both with and without the -E option. If run it without the -E option, it seems not to change anything; if I use the -E option, I get this error:

sed -E 's/\(\w*\)\s\(\w*\)\s\(.*\)/\1\t\2\t\3/g' test.txt > test-out.txt
sed: 1: "s/\(\w*\)\s\(\w*\)\s\(. ...": \1 not defined in the RE
 

alex_hk90

状元
Thanks a lot! I tried running sed on OS X both with and without the -E option. If run it without the -E option, it seems not to change anything; if I use the -E option, I get this error:

sed -E 's/\(\w*\)\s\(\w*\)\s\(.*\)/\1\t\2\t\3/g' test.txt > test-out.txt
sed: 1: "s/\(\w*\)\s\(\w*\)\s\(. ...": \1 not defined in the RE

Mac has a different (BSD-based) version of sed - have a look at installing GNU sed and that should work. :)
 

Shun

状元
Something to keep in mind. :) I'll do it in 3 weeks on my experimental machine which has Xcode on it. (then I'll have enough time) Or if you like, you could PM the output files to me for cleanup & I'll post them here.
 
Last edited:

alex_hk90

状元
Something to keep in mind. :) I'll do it in 3 weeks on my experimental machine which has Xcode on it. (then I'll have enough time) Or if you like, you could PM the output files to me for cleanup & I'll post them here.
I don't think you can attach files to a PM so I've uploaded them here. I'll remove the attachment (EDIT: now removed) when you've posted the cleaned-up version. :)
 
Last edited:

Shun

状元
Perfect, thanks! I couldn't replace a "comma + any letter" sequence by "comma + <space> + any letter", but it should definitely be usable now. (EDIT: See last post.)
 
Last edited:

alex_hk90

状元
Perfect, thanks! I couldn't replace a "comma + any letter" sequence by "comma + <space> + any letter", but it should definitely be usable now.
This should do it:
Code:
sed 's@\([a-z,A-Z]\),\([a-z,A-Z]\)@\1, \2@g' Pleco\ TOCFL\ bigrams\ L05b.txt > Pleco\ TOCFL\ bigrams\ L05c.txt
 

Shun

状元
Excellent, thanks for making it work on BSD, here's the final, cleaned-up version:
 

Attachments

  • Pleco TOCFL bigrams L01c-L05c.zip
    126.6 KB · Views: 1,267
Top