Problem with Jieba for Python

sobriaebritas · Apr 14, 2022

Hello everybody,

This is the first time I use Python, so I am a complete newie. After having installed Jieba for Python, I have run into a weird problem. So far, I have just managed to do what you can see in the first snapshot (Python-Jieba (1).jpg)

Encouraged by that, I tried to segment the same text by opening a text file (utf-8 no BOM), but got this message (please, see Python-Jieba (2).jpg):

" File "P:\Program Files (X64-X86)\Python\Python310\LuXun.py", line 4, in <module> with open('P:/Jieba/jieba-0.42.1/build/lib/jieba/LuXun.txt') as myfile: data = myfile.read() UnicodeDecodeError: 'gbk' codec can't decode byte 0xac in position 22: illegal multibyte sequence"

The "illegal multibyte sequence (byte 0xac)" happens to be the letter "i" that is in "line 4, in position 22". But the character code of that letter "i" is the same as the one for the other occurrences of "i" in that same line, mainly 0069 (Unicode (hex), Simp. Chinese GB (hex)). You can see that on the third snapshot (Python-Jieba (3).jpg).

As I said at the beginning, I am a complete newie in Python. Does someone have any idea about what I may be doing wrong?

Thank you for any comments and/or suggestions!

Shun · Apr 14, 2022

Hi sobriaebritas,

that's good to hear. I think that's a relatively easy one. Where you have

Python:

open('P:/Jieba/jieba-0.42.1/build/lib/jieba/LuXun.txt')

you should have

Python:

open('P:/Jieba/jieba-0.42.1/build/lib/jieba/LuXun.txt', encoding='utf-8-sig')

This is because the Windows version of Python uses different text encodings per default (as I've noticed), so you need to be on the safe side and specify it when you open the stream. You could also just use 'utf-8', but 'utf-8-sig' seems to handle Unicode text both with and without a signature.

If it still doesn't work, you could google for even more encodings supported by Python.

Hope this helps,

Shun

sobriaebritas · Apr 14, 2022

Hello, Shun,

Thank you for answering, and helping me solve this issue.
Unfortunately, another one has popped up: TypeError: write() argument must be str, not bytes

I have changed that for f.write(str(new_text.encode('utf-8')))

seg_list = jieba.cut(data, cut_all=False)
new_text = ("".join(seg_list))
f = open('testtext22.txt', 'w')
f.write(str(new_text.encode('utf-8')))
f.close()

but all I get is the testtext22.txt (in P:\Program Files (X64-X86)\Python\Python310) which looks like this: b'\xe4\xbb\x8e\xe5\x8e\xbb\xe5\xb9\xb4\xe8\xb5\xb7, ...

Anyway, as a newie, I guess I just have to take it easy.

Thank you again, Shun.

Shun · Apr 14, 2022

Hello sobriaebritas,

you're welcome, glad it worked.

When you join a list of small strings, what you should get is one longer string. I assume that jieba uses Unicode internally, so you should be getting one long Unicode string in the variable new_text. Thus, you should be able to say simply f.write(new_text). Also, it would be good to add the encoding="utf-8-sig" part to the f = open(...) call.

Taking it easy certainly is a good idea if it helps you to notice every detail. But often in programming, when you get a complicated error, I find that you have to look at just one aspect of your program, and after having solved that aspect, all the other very complicated-seeming things will resolve themselves.

No problem,

Shun

Problem with Jieba for Python

sobriaebritas

榜眼

Attachments

Shun

状元

sobriaebritas

榜眼

Shun

状元