Importing large amount of texts from the Bible

#1
Hi there,

I wondered if there is a way I can import a large amount of text into some kind of software which will be able to decipher words? My plan would then be to import them into Pleco as a flash card deck similar to the HSK lists already on Pleco. My end result being, like the HSK word lists when you look up a word you can see where it’s source is. For me, I am specifically looking at words from the bible. I know there is a list of characters out there in the bible but it’s the words I want to go for. Sorry if this kind of question has been asked 100’s of times already but couldn’t find anything on the forum.

Many thanks
 
#2
Hi Christopher,

that‘s an interesting question. I‘m almost inclined to believe that the best thing you could do to get Chinese Bible concordances would be to read through the Chinese Bible and add interesting words to Flashcards, as well as wait for version 4.0 of Pleco, which will very likely add contextual information to flashcards added from texts in the Document Reader.

Right now, you could run the Chinese Bible through the Stanford segmenter and then import the words into Pleco Flashcards from a large text file list. But then you would be missing the context. A better option would be to write a Python script that prepares flashcards with meta information.

If you wish, you can send me your version of the Chinese Bible text by PM, and I will segment it for you just as a start.

Regards,

Shun
 

mikelove

皇帝
Staff member
#3
Yes, we're hoping to have something like that in 4.0 or 4.1 but not really any built-in way to do it in the current version of Pleco.
 
#5
Dear Shun, thank you for your very helpful reply and I concur with Rizen, concerning your kindness.

Sorry, what do you mean by PM? I hope I’m not being stupid asking that question. It reminds of once in French class at school when I plucked up all my courage to ask the teacher in French what S.V.P. meant which was written on the black board. Of course, all my classmates turned around and in unison shouted, “Si Vous Plait!”

I’m interested in the simplified version of the New Chinese Version 新译本。

Many thanks
 
#6
Dear Christopher,

thank you, I am always glad to help out.

By PM I meant „private message“.

Do you have the 新译本 as an electronic text? As I’ve mentioned, at this point I could only separate out the words in the text so you could then eliminate the words you do not wish to include, and after that, Pleco could add Chinese dictionary definitions, but still without the meta information. Adding all the places where they appear in the text could be automated, but it would still mean a good bit of manual work.

Regards,

Shun
 
#7
Thanks Shun - OK, I get the PM.

I have it on a pdf? Would that work?

When you say, " at this point I could only separate out the words in the text so you could then eliminate the words you do not wish to include" Shun, that sounds like an incredible amount of work considering how big the bible is.

Just to clarify, what I'm really wanting is for me to be able to look up a word in pleco and see whether that word is found in the 新译本 or not, just like now when I look up a word I can see which level of HSK it is in, this is incredibly helpful to know what to study and what to leave aside for now. Is that what you understood I meant from the first post?

Many thanks
 
#8
I have it on a pdf? Would that work?
That should work, there should be ways to extract the text from it.

When you say, " at this point I could only separate out the words in the text so you could then eliminate the words you do not wish to include" Shun, that sounds like an incredible amount of work considering how big the bible is.
Oh, that is done automatically by a segmenter. So the total workload for me is perhaps 3 minutes.

Just to clarify, what I'm really wanting is for me to be able to look up a word in pleco and see whether that word is found in the 新译本 or not, just like now when I look up a word I can see which level of HSK it is in, this is incredibly helpful to know what to study and what to leave aside for now. Is that what you understood I meant from the first post?
That would be possible! A Yes/No indicator would definitely be possible with the help of a flashcard category tag.

If you can send it over, I will send back a word list for you to import into Pleco!

Regards,

Shun
 
#9
Hi Christopher,

if I understand you correctly you are particularly interested in learning words that appear in the bible. Just learning words that happen to appear in there somewhere is going to be very inefficient. It would be much more efficient if you study words according to the frequency with which they are used in the bible, starting of course with the most frequent ones. There is no better and faster way to improve your ability to read the Chinese bible and talk about bible topics in Chinese then this.

Based on a txt file of the 圣经新译本 I found online (I don't know how reliable this version is, but I read a few lines here and there and nothing seemed odd) I created a list of the 5 000 most frequently used words in this bible, sorted by frequency. The number next to each word tells you how often this word appears in the bible. Many names of biblical figures aren't included in this list since they are not part of the dictionary that was used in the process of creating this list. However since these names are just transcriptions it shouldn't be an issue.

If you export all the words you already know from Pleco, copy them into an excel file, copy this frequency list beneath and choose "remove duplicates", you will be left with a list of the most frequent words of the Chinese bible which you don't already know.
If you would only know the 2500 most frequent words of this list you will already know about 93 percent of the words that are used in this bible in total, which will allow you to read it, but you will have to use a dictionary quite frequently, usually several times per page. If you know the 5000 most frequent ones, your comprehension rate will be around 98 percent, which will allow you to read it without using a dictionary (or rarely using a dictionary).
 

Attachments

#10
Thanks so much John for doing this and going the extra mile (pun intended). I was about to send Shun a link I’d found but you beat me to it! That frequency list will be so helpful.

Shun, would there still be a possibility of creating a flashcard catergory tag for this list at all?

Many thanks - I really didn’t have much hope this would produce much when I posted this initially so as we British say, I’m flabbergasted!
 
#12
Hi Christopher,

there is, please find attached a list based on John's list which you can open with and import straight into Pleco. Just make sure you choose the following settings:

- Duplicate entries: Allow
- Missing entries: Create blank
- Ambiguous entries: Use first
- Fill in missing fields: Enabled
- Definition source: Prefer File
- Store in user dict: Disabled

After you've done that, you can go into Organize Flashcards, tap the "i" button to the right of the imported category (or tap-hold the category name on Android) and add a tag name "Chinese Bible" or similar. From then on, any word you encounter that is also in the Bible frequency list will have the tag.

I can safely say this: We strive to be a good Pleco community! Thanks for the PDF file.

Regards,

Shun
 

Attachments

#15
Would it also be possible to make the same kind of list (flashcard category tag ) for this version too? 中文标准译本 This is just the New Testament.

If it is possible, I can copy the text from a digital book I have onto a word Document, would that work? Many thanks for your time.
 
#16
Yes, you just have to segment it (because Chinese doesn‘t mark word boundaries), replace blanks with newlines and remove duplicates, counting the number of occurrences if you wish. There is no fully automated process to do it, you have to perform/program a few steps manually.

I could do that one for you, do you mean the version of the Bible you provided the link to previously?
 
#17
Actually there is a fully automated process to do it. You can use the tool "Chinese Text Analyser" (there's a free trial available) to easily get a list of all words used in any given TXT file, sorted by frequency. Just open the file - for example the 中文标准译本 copied in a TXT file - with the Chinese Text Analyser, choose "File > Export > to File > Word List". If you want all words used in a text, sorted by frequency, just choose Words: "All", Sort by "Frequency (Descending) and Rows "All". Beneath you can choose, which information you want to be included in the list. Finally, click "Okay" and you can save your list.
You also have the possibility to import all the words you already know into the program. After doing this, the program is able to tell you for any file you open with it, what percentage of words used in the file you already know. But keep in mind that these results are, for a number of reasons, not too accurate, nonetheless they can be very helpful.
 
Top