ChatGPT Example Sentences User Dictionary

A while back, I posted a user dictionary of ChatGPT generated example sentences. I've created a new version of the user dictionary with a greatly expanded list of words.

The new version features:
  • 18,000 words, hundreds of thousands of example sentences
  • Covers entire HSK vocabulary list (including the 7-9 levels)
  • Includes examples for thousands of Chengyu
  • 20 example sentences for each word
To use it, download the dictionary-import-file from https://github.com/simon-crosby/pleco-generated-user-dictionary. Create a new user dictionary in Pleco and import the file.
 

Proe24

Member
Is there a way to get the example sentences to be formatted similar to other dictionaries where the Chinese appears in blue with pinyin under it, and the English translation under that? I've attached a picture as an example, you can see how the top is formatted versus how the bottom is.
 

Attachments

  • Selected photo.jpeg
    Selected photo.jpeg
    263.8 KB · Views: 54

Shun

状元
Hi Proe24,

I think this will be possible after the final version of Pleco 4.0 comes out, along with a manual describing the text import data format.
@simon_crosby's file, like all other lists currently on the forums, still is in the traditional Pleco 3.2-compatible format which only offered a few options, such as text styles and colors.

Cheers, Shun
 

cowabunga

秀才
I'm curious what native speakers or otherwise native-level Chinese learners feel about chat GPT's accuracy in Chinese, specially regarding grammar. I saw one article on this topic when I looked for something like Grammarly for Chinese. If I recall right, the native blogger who wrote the article concluded that none of the public Chinese grammar correction tools were correct enough to be useful. On a side note, even Grammarly for English often gives wrong suggestions, at least for more advanced English sentence structure, or structures that are colloquial, or perhaps uncommon in modern usage.

My limited AI understanding and anecdotal experience has given me almost no confidence in LLM accuracy when it comes to many things, including grammar. For example, if I make the LLM second guess its answer (whether I know the correct answer/grammar or not), it seems to almost always agree with me.

Perhaps I need to improve my prompt construction skills, but how well strictly do ChatGPT or other public, free LLMs actually stick to instructions (say for example: to only use sentences found in native-content online), at the expense of trying to give me what it thinks I want to hear.
 

Shun

状元
Hi cowabunga,

I had similar concerns, but here's what I thought: There could be a difference between ML-powered sentence translation and ChatGPT-powered example sentence generation in two or more languages. When translating sentences written by humans, it's probably much harder for ChatGPT to properly grasp their meaning down to the pragmatic level, which would be necessary for a good translation. In the case of LLM generation of sentence pairs, ChatGPT can start from scratch where it just generates one sentence at random using the rules it has learned, first for one language, then for the other. In this case, ChatGPT knows exactly which meaning it wants to generate for both languages.

ChatGPT of course isn't as creative in generating its meanings as a human, but for language learning, this may be desirable, as it keeps the sentence difficulty down.

Cheers,

Shun
 
Last edited:

Shun

状元
Hello @mikelove,

in the .23 beta, there may be a bug in the conversion code from the 3.2 dictionary text import format to the 4.0 format. The sentences in Simon's import file include the following characters:

Import file.png


But the .23 beta inserts a space character where the "Boldface End" character is:

Superfluous space.PNG



This wouldn't matter as much if the Pop-up definition automatically stretched across the space character, but it doesn't. So to get a definition for the full expressions, one currently needs to move the end of the selection manually each time, without the certainty that it's going to find a definition.

Thus I suggest the following two fixes:
  1. The conversion code shouldn't add a space when it encounters a "Boldface End" character.
  2. The Pop-up definition code could ignore spaces, which of course don't exist between Hanzi characters, anyway. It could still respect tabs, punctuation, or newlines.

Thanks a lot,

Shun
 

mikelove

皇帝
Staff member
Thanks, but both of these fixes pose problems.

Adding a space after the end of bold text is necessarily to prevent weird behavior with Markdown ** spans in the middle of words - it doesn't matter in every case, but coming up with code to re-check whether a particular ** is going to behave like it's supposed to without a space would be complicated, and frankly there isn't enough content out there using our formatting flags to justify investing days and days of programming time into it; easier to just manually fix the small number of files floating around in which it's a problem.

And while the popup definition does ignore spaces in some situations already (e.g. PDF / OCR files where they can occur randomly), it breaks up words on them in dictionary entries because a number of dictionaries - English-to-Chinese ones e.g. - separate words with just whitespace. I suppose in theory we could make this an option that you could configure on a dictionary-by-dictionary basis, but it's getting awfully specialized.

Also, even in this specific dictionary, it seems like in most cases you would want the reader to terminate at the end of the bold span, since that ought to be a word break, with the bolded text being a standalone word; the bolded word in that example should really be 火车站, and it should be listed under that rather than 火车. In an ambiguous segmentation case - 研究所+有 versus 研究+所有, e.g. - if just 研究 is bolded and it's followed by a space, the reader ought to favor the second segmentation.

Anyway, the best solution here would probably be to export this dictionary in Markdown, remove all of those spaces after the **s, and reimport it. Or you could play around with doing it with a batch command, or simply find-and-replace the private use Pleco characters in the original dictionary with their Markdown equivalents.
 

Shun

状元
Thank you very much for your detailed answer.

I wasn't familiar with the requirements of the Markdown format, and I now seem to understand your rationale. I'm sorry to have pressed a point like this with my inadequate background knowledge. :)

So I'll gladly retract my suggestions, and of course, any repairs to dictionaries using the new Pleco 4.0 format should be easy for me to make.

I had thought from earlier posts (a long time ago) that Pleco 4.0's internal dictionary styling format used a subset of HTML. But maybe it still does—you will surely explain the details later, when the new Pleco actually comes out. Right now, we're just beta-testing.

Best,

Shun
 

mikelove

皇帝
Staff member
No worries and certainly no need to apologize - it's always better to ask.

Internally, we use an even more proprietary format in 4.0 than we did in 3.0 - though it can be converted to/from Markdown - but for user content authoring we currently only support Markdown.

We do, however, have an HTML parser too - we use it in 4.0 for HTML-based document reading (EPUB plus iOS' decoding of DOCX/RTF files) and for HTML-based flashcard/presentation templates - so in theory we could add an option at some point to author user dictionary entries in HTML too. But Markdown is a lot easier to work with, so we'd kind of like to explore the limits of what we can do with that - and determine what we might actually be gaining with HTML support - before we consider making that move.
 
Top