Adding support for inflection between different grammatical genders #105

BHK4321 · 2025-04-09T08:29:56Z

Solving issue #98
Added the filter_ files for specific lexemes to be merged.
Updated ParseWikidata.java for merging the lexemes with differing genders for a combined inflection space

BHK4321 · 2025-04-09T09:31:02Z

@nciric
@grhoten
The tests aren't giving results as expected please suggest me some changes.

grhoten

This is good progress. Thanks.

Please review the comments for additional changes to consider.

grhoten · 2025-04-09T17:17:40Z

inflection/tools/dictionary-parser/src/main/resources/org/unicode/wikidata/filter_de.properties

+L484250=L252570
+L44834=L494386
+L860063=L931664
+L2272=L295129


This looks like the right kind of data.

It looks like there are DOS newlines here. Can you please commit these files as plain text to match the newlines of the other properties files?

grhoten · 2025-04-09T17:21:35Z

inflection/tools/dictionary-parser/src/main/java/org/unicode/wikidata/ParseWikidata.java

    private final TreeSet<String> rareLemmas = new TreeSet<>();
    private final TreeSet<String> omitLemmas = new TreeSet<>();
+    private final Map<String, List<String>> mergeMap = new HashMap<>();
+    private final TreeSet<String> differ = new TreeSet<>();


I think you meant defer or deferred. The word differ implies that you're computing the difference.

grhoten · 2025-04-09T17:30:23Z

inflection/tools/dictionary-parser/src/main/java/org/unicode/wikidata/ParseWikidata.java

+                        if (mergeMap.containsKey(key)) {
+                            mergeMap.get(key).add(value);
+                        } else {
+                            List<String> list = new ArrayList<>();
+                            list.add(value);
+                            mergeMap.put(key, list);
+                        }


You might want to consider using computeIfAbsent. You might be able to replace this whole if statement with this line. I think that it's faster too because it only looks up the the key once.

mergeMap.computeIfAbsent(key, v -> new ArrayList<>()).add(value);

grhoten · 2025-04-09T17:36:13Z

inflection/tools/dictionary-parser/src/main/java/org/unicode/wikidata/ParseWikidata.java

 */
+
 public final class ParseWikidata {
+    static final class Pair<K, V> {


Can you replace this new class with the existing AbstractMap.SimpleEntry? I like code reuse.

grhoten · 2025-04-09T17:40:41Z

inflection/tools/dictionary-parser/src/main/java/org/unicode/wikidata/ParseWikidata.java

+            if (value.isEmpty())
+                return;


Please use curly braces around if statements. It helps with readability, and it protects against misunderstandings if the indentation is misaligned.

grhoten · 2025-04-09T17:54:51Z

inflection/tools/dictionary-parser/src/main/java/org/unicode/wikidata/ParseWikidata.java

+            String key = entry.getKey();
+            List<String> value = entry.getValue();
+            value.add(0, key);
+            if (value.isEmpty())
+                return;
+            Lexeme mergedLexeme = lexemeMap.get(value.get(0)).getKey();
+            for (int i = 1; i < value.size(); i++) {
+                mergedLexeme = mergeLexemes(mergedLexeme, lexemeMap.get(value.get(i)).getKey());
+            }


I'm unsure why you're inserting the key into the list of values, and then you're skipping over it.

Would this work instead?

Lexeme mergedLexeme = lexemeMap.computeIfAbsent(entry.getKey(), key -> { throw new IllegalArgumentException(key + ": id not found"); }).getKey(); for (var value : entry.getValue()) { mergeLexemes(mergedLexeme, lexemeMap.computeIfAbsent(value, key -> { throw new IllegalArgumentException(key + ": id not found"); }).getKey()); }

grhoten · 2025-04-09T17:57:50Z

inflection/tools/dictionary-parser/src/main/java/org/unicode/wikidata/ParseWikidata.java

+    private Lexeme mergeLexemes(Lexeme lexeme1, Lexeme lexeme2) {
+        // Combine forms
+
+        lexeme1.forms.addAll(lexeme2.forms);
+
+        for (Map.Entry<String, List<String>> entry : lexeme2.claims.entrySet()) {
+            lexeme1.claims.merge(entry.getKey(), entry.getValue(), (v1, v2) -> {
+                v1.addAll(v2);
+                return v1;
+            });
+        }
+        return lexeme1;
+    }


This is almost right. Some or all of the claims have to be moved to each form. The part of speech doesn't have to move, but the gender, animacy and other grammatical properties must be moved from the lexeme level to each form. The gender is typically added at the lexeme level instead of each form.

After that operation occurs on lexeme2, then you can append the lexeme2 forms to lexeme1.

You might want to create another function to move those claims to each form, including lexeme1 before it's called with mergeLexemes.

grhoten · 2025-04-09T18:05:41Z

inflection/tools/dictionary-parser/src/main/java/org/unicode/wikidata/ParseWikidata.java

+            for (int i = 1; i < value.size(); i++) {
+                mergedLexeme = mergeLexemes(mergedLexeme, lexemeMap.get(value.get(i)).getKey());
+            }
+            analyzeLexeme(lexemeMap.get(mergedLexeme.id).getValue(), mergedLexeme);


Can you get the pair once so that you don't have to call lexemeMap.get twice?

grhoten · 2025-04-09T18:08:07Z

inflection/tools/dictionary-parser/src/main/java/org/unicode/wikidata/ParseWikidata.java

+                        rareLemmas.add(key);
+                    } else if ("omit".equals(value)) {
+                        omitLemmas.add(key);
+                    } else if (value.startsWith("L")) {


How about value.matches("L[0-9]+")? This ensures that the L is followed by numbers.

grhoten · 2025-04-09T18:09:13Z

inflection/tools/dictionary-parser/src/main/java/org/unicode/wikidata/ParseWikidata.java

 import java.util.TreeMap;
+import java.util.HashMap;
 import java.util.TreeSet;
+// import javafx.util.Pair;


Please don't leave in commented out code. It won't add value.

…ode/wikidata/filter_de.properties

…ode/wikidata/filter_it.properties

BHK4321 · 2025-04-09T21:03:01Z

Hello @grhoten,
Thanks for reviewing,
I made the changes as requested,
Initial Test Results:

Results after changes:

I am still quite unsure about the moveLexemeClaimsToForms function.
Please suggest if any more changes are required.

grhoten

Looking better.

grhoten · 2025-04-10T05:47:01Z

inflection/tools/dictionary-parser/src/main/java/org/unicode/wikidata/ParseWikidata.java

+         for(LexemeForm form : lexeme.forms){
+            for (Map.Entry<String, List<String>> entry : lexeme.claims.entrySet()) {


Please format these for loops the same. I prefer the spaces of the second for loop.

Since you're moving the claims, I'd expect that claims on the lexeme would be empty before the exit of this function. If you don't do that, then the analyzeLexeme will mix up these properties on each form when that's not desirable.

grhoten · 2025-04-10T05:47:49Z

inflection/tools/dictionary-parser/src/main/java/org/unicode/wikidata/ParseWikidata.java

+         for(LexemeForm form : lexeme.forms){
+            for (Map.Entry<String, List<String>> entry : lexeme.claims.entrySet()) {
+                String key = entry.getKey();
+                if (!key.equals("PartOfSpeech")) {


I think lexeme.lexicalCategory has this important detail. You should be able to skip this check. Sorry for the confusion.

grhoten · 2025-04-10T05:53:16Z

inflection/tools/dictionary-parser/src/main/java/org/unicode/wikidata/ParseWikidata.java

+                    if (value.matches("L[0-9]+")) {
+                        mergeMap.computeIfAbsent(key, v -> new ArrayList<>()).add(value);
+                        defferedLexemes.add(key);
+                        defferedLexemes.add(value);


This looks a lot better. Can you also make the values be split by commas?

Oh, and I think you meant deferredLexemes instead of defferedLexemes.

Hi @grhoten,

This looks a lot better. Can you also make the values be split by commas?

Are you talking about the values in the fillter_.properties file or i should change the mergeMap to be Map<String , String>
instead of Map<String , List<String>> and then store the lexemes as for example: "L123,L1234,L12345".

I apologize for the spelling mistake. Fixed it now

The mergeMap is the right type. Instead of performing a single add, you can use addAll on a list.

var values = Arrays.asList(value.split(","));

grhoten · 2025-04-10T05:54:14Z

inflection/tools/dictionary-parser/src/main/java/org/unicode/wikidata/ParseWikidata.java

+        for (Map.Entry<String, List<String>> entry : mergeMap.entrySet()) {
+            SimpleEntry<Lexeme, Integer> pair = lexemeMap.computeIfAbsent(entry.getKey(), key -> {
+                throw new IllegalArgumentException(key + ": id not found");
+            });
+            Lexeme mergedLexeme = pair.getKey();
+            int lineNumber = pair.getValue();
+            for (var value : entry.getValue()) {
+                mergeLexemes(mergedLexeme, lexemeMap.computeIfAbsent(value, key -> {
+                    throw new IllegalArgumentException(key + ": id not found");
+                }).getKey());
+            }
+            analyzeLexeme(lineNumber, mergedLexeme);
+        }


This is much clearer. Thanks.

grhoten · 2025-04-10T06:00:48Z

inflection/tools/dictionary-parser/src/main/java/org/unicode/wikidata/ParseWikidata.java

+    private Lexeme mergeLexemes(Lexeme lexeme1, Lexeme lexeme2) {
+        moveLexemeClaimsToForms(lexeme2);
+        // Combine forms
+        lexeme1.forms.addAll(lexeme2.forms);
+        return lexeme1;
+    }


Looks good.

grhoten · 2025-04-10T15:24:14Z

inflection/tools/dictionary-parser/src/main/java/org/unicode/wikidata/ParseWikidata.java

+                form.claims.computeIfAbsent(key, k -> new ArrayList<>()).addAll(entry.getValue());
+            }
+         }
+         if (lexeme.claims != null) {


If lexeme.claims can be null, then the for loop must be included in this if statement. I don't think it's ever null though. Running this tool with a recent version of Wikidata will confirm that.

I just looked for the edge case , but the tool runs without the 'if' also.
I ran it without the if condition, i mean directly clearing the claims without any check.

BHK4321 · 2025-04-10T16:07:09Z

@grhoten
If there are any more changes possible please do suggest.
Thanks for giving time.

Rereviewing

grhoten · 2025-04-10T17:09:49Z

inflection/tools/dictionary-parser/src/main/resources/org/unicode/wikidata/filter_de.properties

+#organisatorin = organisator
+#Eigentümerin = Eigentümer
+#Autorin = Autor
+#Teilnehmerin = Teilnehmer
+#Freundin = Freund
+#Ehefrau = Ehemann
+#Benutzerin =Benutzer
+#Organspenderin = Organspender
+#Besucherin = Besucher


I think you have the order reversed. What does the the number of failures for German look like now? If it's lot less failures than the number in the language list, then that's a very compelling reason to merge these changes.

Optionally, please add filter_fr.properties too with the necessary changes to reduce the French test failures. If it's too hard, it can be added in a separate pull request. Please tell me which way you decide before I approve these changes.

I think we're almost ready to merge.

Sure, @grhoten
Here are the final results,
There were total 91 + 41 + 7 = 139 failing test cases for de + fr + it:

and right here i parsed the dictionary for all the three languages together and i found total 63 failing tests,

For italian,
The failing test passed.

For German,
The test cases failing are the ones in which pronouns with different genders are used , rest the human-nouns tests are completely working fine.

For French,
I am getting this result:

Overall, there are many failing test which look like this:

So for the final report for this Pull Request 139 - 63 = 76 tests passed.
Thankyou.

Oh yeah, French requires some code changes to get to that to work. We can add French later.

grhoten · 2025-04-10T22:06:17Z

These changes are looking good. The other failures seem to be around git LFS. I'm not sure why those pipelines are failing.

BHK4321 added 2 commits April 9, 2025 13:55

Update ParseWikidata.java

4f60493

Add files via upload

997ed77

BHK4321 marked this pull request as draft April 9, 2025 09:10

BHK4321 marked this pull request as ready for review April 9, 2025 09:10

BHK4321 marked this pull request as draft April 9, 2025 09:10

BHK4321 changed the title ~~Work in Progress issue #98~~ Adding support for inflection between different grammatical genders Apr 9, 2025

BHK4321 marked this pull request as ready for review April 9, 2025 09:31

Update filter_de.properties

e93c26e

grhoten requested changes Apr 9, 2025

View reviewed changes

BHK4321 added 5 commits April 10, 2025 00:14

Delete inflection/tools/dictionary-parser/src/main/resources/org/unic…

ceadffb

…ode/wikidata/filter_de.properties

Add files via upload

0c17121

Delete inflection/tools/dictionary-parser/src/main/resources/org/unic…

afc262c

…ode/wikidata/filter_it.properties

Add files via upload

46da771

Update ParseWikidata.java

f1346a2

grhoten previously requested changes Apr 10, 2025

View reviewed changes

Update ParseWikidata.java

600ebe4

grhoten reviewed Apr 10, 2025

View reviewed changes

BHK4321 added 2 commits April 10, 2025 20:56

Adding all values from a comma separated list.

885d82a

Removing unnecessary "if"

afb8da2

grhoten self-requested a review April 10, 2025 17:04

grhoten reviewed Apr 10, 2025

View reviewed changes

Update filter_de.properties

7c10548

grhoten merged commit 64cb2ca into unicode-org:main Apr 10, 2025
1 of 3 checks passed

grhoten mentioned this pull request Apr 11, 2025

Update dictionary-parser to merge lexemes of separate genders #98

Closed

BHK4321 mentioned this pull request Apr 12, 2025

Integrate Unicode Inflection into Unicode Message Format #87

Open

BHK4321 deleted the patch-1 branch April 13, 2025 12:24

BHK4321 restored the patch-1 branch April 13, 2025 12:29

BHK4321 deleted the patch-1 branch April 13, 2025 12:29

		for(LexemeForm form : lexeme.forms){
		for (Map.Entry<String, List<String>> entry : lexeme.claims.entrySet()) {

Uh oh!

Adding support for inflection between different grammatical genders #105

Adding support for inflection between different grammatical genders #105

Uh oh!

Conversation

BHK4321 commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BHK4321 commented Apr 9, 2025

Uh oh!

grhoten left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BHK4321 commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grhoten left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BHK4321 Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BHK4321 Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BHK4321 commented Apr 10, 2025

Uh oh!

grhoten Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BHK4321 Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grhoten commented Apr 10, 2025

Uh oh!

Uh oh!

BHK4321 commented Apr 9, 2025 •

edited

Loading

BHK4321 commented Apr 9, 2025 •

edited

Loading

BHK4321 Apr 10, 2025 •

edited

Loading

BHK4321 Apr 10, 2025 •

edited

Loading

grhoten Apr 10, 2025 •

edited

Loading

BHK4321 Apr 10, 2025 •

edited

Loading