Importing large amount of texts from the Bible

Hi there,

I wondered if there is a way I can import a large amount of text into some kind of software which will be able to decipher words? My plan would then be to import them into Pleco as a flash card deck similar to the HSK lists already on Pleco. My end result being, like the HSK word lists when you look up a word you can see where it’s source is. For me, I am specifically looking at words from the bible. I know there is a list of characters out there in the bible but it’s the words I want to go for. Sorry if this kind of question has been asked 100’s of times already but couldn’t find anything on the forum.

Many thanks
 

Shun

状元
Hi Christopher,

that‘s an interesting question. I‘m almost inclined to believe that the best thing you could do to get Chinese Bible concordances would be to read through the Chinese Bible and add interesting words to Flashcards, as well as wait for version 4.0 of Pleco, which will very likely add contextual information to flashcards added from texts in the Document Reader.

Right now, you could run the Chinese Bible through the Stanford segmenter and then import the words into Pleco Flashcards from a large text file list. But then you would be missing the context. A better option would be to write a Python script that prepares flashcards with meta information.

If you wish, you can send me your version of the Chinese Bible text by PM, and I will segment it for you just as a start.

Regards,

Shun
 

mikelove

皇帝
Staff member
Yes, we're hoping to have something like that in 4.0 or 4.1 but not really any built-in way to do it in the current version of Pleco.
 
Dear Shun, thank you for your very helpful reply and I concur with Rizen, concerning your kindness.

Sorry, what do you mean by PM? I hope I’m not being stupid asking that question. It reminds of once in French class at school when I plucked up all my courage to ask the teacher in French what S.V.P. meant which was written on the black board. Of course, all my classmates turned around and in unison shouted, “Si Vous Plait!”

I’m interested in the simplified version of the New Chinese Version 新译本。

Many thanks
 

Shun

状元
Dear Christopher,

thank you, I am always glad to help out.

By PM I meant „private message“.

Do you have the 新译本 as an electronic text? As I’ve mentioned, at this point I could only separate out the words in the text so you could then eliminate the words you do not wish to include, and after that, Pleco could add Chinese dictionary definitions, but still without the meta information. Adding all the places where they appear in the text could be automated, but it would still mean a good bit of manual work.

Regards,

Shun
 
Thanks Shun - OK, I get the PM.

I have it on a pdf? Would that work?

When you say, " at this point I could only separate out the words in the text so you could then eliminate the words you do not wish to include" Shun, that sounds like an incredible amount of work considering how big the bible is.

Just to clarify, what I'm really wanting is for me to be able to look up a word in pleco and see whether that word is found in the 新译本 or not, just like now when I look up a word I can see which level of HSK it is in, this is incredibly helpful to know what to study and what to leave aside for now. Is that what you understood I meant from the first post?

Many thanks
 

Shun

状元
I have it on a pdf? Would that work?

That should work, there should be ways to extract the text from it.

When you say, " at this point I could only separate out the words in the text so you could then eliminate the words you do not wish to include" Shun, that sounds like an incredible amount of work considering how big the bible is.

Oh, that is done automatically by a segmenter. So the total workload for me is perhaps 3 minutes.

Just to clarify, what I'm really wanting is for me to be able to look up a word in pleco and see whether that word is found in the 新译本 or not, just like now when I look up a word I can see which level of HSK it is in, this is incredibly helpful to know what to study and what to leave aside for now. Is that what you understood I meant from the first post?

That would be possible! A Yes/No indicator would definitely be possible with the help of a flashcard category tag.

If you can send it over, I will send back a word list for you to import into Pleco!

Regards,

Shun
 

John.

秀才
Hi Christopher,

if I understand you correctly you are particularly interested in learning words that appear in the bible. Just learning words that happen to appear in there somewhere is going to be very inefficient. It would be much more efficient if you study words according to the frequency with which they are used in the bible, starting of course with the most frequent ones. There is no better and faster way to improve your ability to read the Chinese bible and talk about bible topics in Chinese then this.

Based on a txt file of the 圣经新译本 I found online (I don't know how reliable this version is, but I read a few lines here and there and nothing seemed odd) I created a list of the 5 000 most frequently used words in this bible, sorted by frequency. The number next to each word tells you how often this word appears in the bible. Many names of biblical figures aren't included in this list since they are not part of the dictionary that was used in the process of creating this list. However since these names are just transcriptions it shouldn't be an issue.

If you export all the words you already know from Pleco, copy them into an excel file, copy this frequency list beneath and choose "remove duplicates", you will be left with a list of the most frequent words of the Chinese bible which you don't already know.
If you would only know the 2500 most frequent words of this list you will already know about 93 percent of the words that are used in this bible in total, which will allow you to read it, but you will have to use a dictionary quite frequently, usually several times per page. If you know the 5000 most frequent ones, your comprehension rate will be around 98 percent, which will allow you to read it without using a dictionary (or rarely using a dictionary).
 

Attachments

  • 圣经新译本 5000 most frequently used words.txt
    51.2 KB · Views: 401
Thanks so much John for doing this and going the extra mile (pun intended). I was about to send Shun a link I’d found but you beat me to it! That frequency list will be so helpful.

Shun, would there still be a possibility of creating a flashcard catergory tag for this list at all?

Many thanks - I really didn’t have much hope this would produce much when I posted this initially so as we British say, I’m flabbergasted!
 

Shun

状元
Hi Christopher,

there is, please find attached a list based on John's list which you can open with and import straight into Pleco. Just make sure you choose the following settings:

- Duplicate entries: Allow
- Missing entries: Create blank
- Ambiguous entries: Use first
- Fill in missing fields: Enabled
- Definition source: Prefer File
- Store in user dict: Disabled

After you've done that, you can go into Organize Flashcards, tap the "i" button to the right of the imported category (or tap-hold the category name on Android) and add a tag name "Chinese Bible" or similar. From then on, any word you encounter that is also in the Bible frequency list will have the tag.

I can safely say this: We strive to be a good Pleco community! Thanks for the PDF file.

Regards,

Shun
 

Attachments

  • 圣经新译本 5000 most frequently used words - ready for Pleco import.txt
    56.1 KB · Views: 461
Thank you so much Shun, this is just what I wanted. There is no doubt this is going to save me a lot of time and keep the motivation up.
 
Would it also be possible to make the same kind of list (flashcard category tag ) for this version too? 中文标准译本 This is just the New Testament.

If it is possible, I can copy the text from a digital book I have onto a word Document, would that work? Many thanks for your time.
 

Shun

状元
Yes, you just have to segment it (because Chinese doesn‘t mark word boundaries), replace blanks with newlines and remove duplicates, counting the number of occurrences if you wish. There is no fully automated process to do it, you have to perform/program a few steps manually.

I could do that one for you, do you mean the version of the Bible you provided the link to previously?
 

John.

秀才
Actually there is a fully automated process to do it. You can use the tool "Chinese Text Analyser" (there's a free trial available) to easily get a list of all words used in any given TXT file, sorted by frequency. Just open the file - for example the 中文标准译本 copied in a TXT file - with the Chinese Text Analyser, choose "File > Export > to File > Word List". If you want all words used in a text, sorted by frequency, just choose Words: "All", Sort by "Frequency (Descending) and Rows "All". Beneath you can choose, which information you want to be included in the list. Finally, click "Okay" and you can save your list.
You also have the possibility to import all the words you already know into the program. After doing this, the program is able to tell you for any file you open with it, what percentage of words used in the file you already know. But keep in mind that these results are, for a number of reasons, not too accurate, nonetheless they can be very helpful.
 

岩恩

秀才
Hi, everyone here. I hope it's ok I am bringing back up this older thread. I am wanting to do so because it is perfect for what I am looking for but I just have two additional questions for help.
First is to ask, if possible, for assistance to do the same thing Christopher had asked about (great thread posts all together by the way) except to do it for the 新标点和合本
Secondly and finally, to ask the question - is it possible once the file was set up, as was done above, to then also integrate that into Skritter?
I would very much love to learn how to write the words through their app also.

Thanks!
 

pdwalker

状元
Firstly, you'd first have to find the text for 新标点和合本.

Secondly, wouldn't it be better to ask how to get information into Skritter in the Skritter forums? While I imagine there are some people here familiar with Skritter here, you'd probably have better luck asking for that information in a more appropriate place.
 
Top