German decompounding?

mikelove · Oct 19, 2022

Quick question for German speakers: would decompounding of long German nouns be a helpful full text search feature?

My sense of it is that when you’re just looking up a translation it maybe isn’t that important - if you’re looking for a particular compound you’d just enter that compound, you don’t need “Hund” to match “Hundehütte” any more than an English speaker needs “dog” to match “doghouse” - but I understand that compounds in German can sometimes bring in a lot more content (adjectives, most importantly) and so I’m wondering whether in a dictionary there are likely to be compound words you might want to search for just a part of.

Thanks!

Shun · Oct 19, 2022

Hello Mike,

to give you a comprehensive answer, I have read some more about decompounding here:

Multilingual search: Decompounding with language-specific lexicons

Multilingual search: Create a language-specific lexicon for accurate decompounding using weak supervision, without sacrificing speed.

www.algolia.com

It explains: "Since most ecommerce search is keyword based, once we’ve removed the stop words für (for) and meine (my), breaking the compound Hundehütte into Hunde + hütte should provide similar results as when querying Hütte für meinen Hund."

So decompounding automatically searches for both "Hund" and "Hütte" instead of "Hundehütte" when I enter that compound word.

I understand that decompounding will try to split only non-lexical compounds, i.e. compounds that were formed ad-hoc by the person who is searching; so "Fahrkarte" (= ticket in public transport) would not be split because that compound was lexical and fixed, and the user would have no interest in getting search results for either "fahren" or "Karte".

Perhaps decompounding is most important for E-commerce search engines because in that case, you want to make sure to catch and display all eligible products that a user might want instead of just searching for the compound word. In Pleco, however, I suspect you'd be getting unwanted search results in too many cases—I'm not sure about that, though. It's hard to say without trying it out. A knowledgeable user would instinctively use separate words instead of long compounds, because they know that Pleco may have a hard time separating them. Almost all of the compound words I tried to come up with are lexical words—non-lexical compounds are quite rare in my everyday use. A few examples of lexical compounds:

Brieftasche
Taschentuch
Flaschenöffner
Kopfhörer
Bildschirmständer

Decompounding wouldn't do anything with these, which is correct. A few non-lexical, or at least less-lexical compound words:

Leseerfahrungen
Belüftungsschacht
Sanitärinstallation
Beatmungsapparat
Flaschenöffner-Herstellungsprozess / Flaschenöffnerherstellungsprozess (here, you'd be more likely to use the first version with a dash, because that makes clear where it combines two lexical compounds. So no more decompounding would be needed here, either.)
Kindheitserinnerungen
Nachtlektüre

I think the cases where decompounding may produce more useful search results than without are quite few and far between. I would definitely be making it an opt-in feature if it were implemented.

What would seem more important and useful to me are Boolean searches, or simply multi-word full text searches where each of the separate words can occur anywhere in definition (right now, as we know, one can only search for an exact multi-word expression), so I could also enter the components of longer words if I wanted to. I remember you had said a long time ago that Boolean searches were on the To-do list. I'm quite confident that's coming.

On your remark that adjectives are common candidates for compounding, did you mean adjectives formed from nouns? So that if I search for "Wasser" (water), it will also find "wässrig" (watery)? In general, I don't see a lot of compounding happening with adjectives.

Hope this helps,

Shun

mikelove · Oct 19, 2022

Thanks for the detailed and thoughtful reply!

None of the words in your second group appear in any of our current German dictionaries, so that's a good sign.

Regarding adjectives, I was thinking of compound nouns with an adjective at the beginning - my understanding was that in some cases something like "Chinesisches Wörterbuch" might instead be a single Chinesischeswörterbuch. (so while you'd probably be clear that you were looking for a Chinesisches Wörterbuch you wouldn't necessarily type it in as one word and would nevertheless like it to match that one word) My information is probably incorrect on that though

, and in any case if somebody did something odd like that in a dictionary we offered we could simply split it back into 2 words and not need to add a whole extra algorithm for that sort of case.

And yes, searching words in any order is indeed coming, though my current plan is to leave the default as exact phrase search since I think that's more likely to be useful in most cases.

Shun · Oct 19, 2022

You're very welcome!

I'd say adjective-noun compound words are somewhat frequent, like "Grossanlass" or "Kleinkind", but perhaps a bit less frequent than noun-noun compound words.

In conclusion, if you are able to include a good decompounding algorithm, that may indeed help in cases such as the ones from my second group. One would have to try it out.

And yes, searching words in any order is indeed coming, though my current plan is to leave the default as exact phrase search since I think that's more likely to be useful in most cases.

That's good to hear. Searching for multiple words in any order will be useful especially when searching through individual example sentences, I believe, because that should give one a much greater chance of finding complex constructions with other words in between.

mikelove · Oct 21, 2022

Shun said:
In conclusion, if you are able to include a good decompounding algorithm, that may indeed help in cases such as the ones from my second group. One would have to try it out.

Thanks. I think it's enough of a 'nice to have' that we can probably punt automated decompounding to a later 4.x update, though in the meantime we might look for some of the more common ones that actually occur in our various German dictionaries and manually add them to the inflection mapping table.

Shun · Oct 21, 2022

That makes sense. The simpler, the better.

n9 · Oct 22, 2022

I don't think it would be worth the effort. Much more important issues: fixing the many, many cases of words were umlauts and ß are converted to their non-umlaut version (ä->ae, ö->oe, ü->ue, ß->ss). And make sure that the search finds them either way (it's quicker to type "ae" on a phone).

Shun · Oct 23, 2022

True, I didn't even realize this. The main (and perhaps only?) culprit seems to be HanDeDict:

If I enter "öffen" instead of "oeffen", the DHD search results won't come up.

The easiest way of resolving this may just be to replace the "oe" -> "ö", "ae" -> "ä" and "ue" -> ü (plus the capitalized versions) in the DHD source data, without making any changes to the engine.

If you enter "ü", "ö" or "ä", right now it will be treated exactly like an "u", "o" or "a", and vice versa. (at least with my settings)

Cheers,

Shun

mikelove · Oct 23, 2022

That one we've already fixed, or at least we think we have - we no longer use the English collation algorithm for everything that isn't Chinese but instead use dedicated ones for each language. Along with giving each language its own separate search tab:

danm · Jan 26, 2023

Relating to the Umlaut/ß issues, in DeHanDict the ß seem to be erroneously broken down into sz instead of ss, making it impossible to find a word like Straße in the current version without knowing you have to search for Strasze. Maybe this is something that can be fixed before Pleco 4?

mikelove · Jan 26, 2023

That actually seems to be the case in the original data too, unfortunately, or at least the only copy of it we have. (it's no longer online as far as I can tell)

If you can find the original text somewhere it should be easy enough to import into Pleco as a user dictionary, but the only way we could fix it on our end would be to manually go through entry by entry and figure out which sz's ought to be ss'es, which would be rather labor-intensive.

danm · Jan 26, 2023

If it helps, sz is very rare in German, it pretty much only occurs as part of a compound. Notable exceptions I can think of being Szene, Szenario (ß will never be at the start of a word, anyways). A conservative correction that should catch a lot of false entries with minimal(/none I can think of) false positives would be something like "replace all sz at the end of a word or with <=2 letters following".

But yeah, don't sweat it, DHD/HDD aren't the most reliable dictionaries anyway.

German decompounding?

mikelove

皇帝

Shun

状元

Multilingual search: Decompounding with language-specific lexicons

mikelove

皇帝

Shun

状元

mikelove

皇帝

Shun

状元

n9

秀才

Shun

状元

mikelove

皇帝

danm

Member

mikelove

皇帝

danm

Member