Skip to content

Conversation

@BHK4321
Copy link
Contributor

@BHK4321 BHK4321 commented Apr 9, 2025

Solving issue #98
Added the filter_ files for specific lexemes to be merged.
Updated ParseWikidata.java for merging the lexemes with differing genders for a combined inflection space

@BHK4321 BHK4321 marked this pull request as draft April 9, 2025 09:10
@BHK4321 BHK4321 marked this pull request as ready for review April 9, 2025 09:10
@BHK4321 BHK4321 marked this pull request as draft April 9, 2025 09:10
@BHK4321 BHK4321 changed the title Work in Progress issue #98 Adding support for inflection between different grammatical genders Apr 9, 2025
@BHK4321
Copy link
Contributor Author

BHK4321 commented Apr 9, 2025

@nciric
@grhoten
The tests aren't giving results as expected please suggest me some changes.

@BHK4321 BHK4321 marked this pull request as ready for review April 9, 2025 09:31
Copy link
Member

@grhoten grhoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good progress. Thanks.

Please review the comments for additional changes to consider.

L484250=L252570
L44834=L494386
L860063=L931664
L2272=L295129
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the right kind of data.

It looks like there are DOS newlines here. Can you please commit these files as plain text to match the newlines of the other properties files?

private final TreeSet<String> rareLemmas = new TreeSet<>();
private final TreeSet<String> omitLemmas = new TreeSet<>();
private final Map<String, List<String>> mergeMap = new HashMap<>();
private final TreeSet<String> differ = new TreeSet<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant defer or deferred. The word differ implies that you're computing the difference.

Comment on lines 126 to 132
if (mergeMap.containsKey(key)) {
mergeMap.get(key).add(value);
} else {
List<String> list = new ArrayList<>();
list.add(value);
mergeMap.put(key, list);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to consider using computeIfAbsent. You might be able to replace this whole if statement with this line. I think that it's faster too because it only looks up the the key once.

mergeMap.computeIfAbsent(key, v -> new ArrayList<>()).add(value);

*/

public final class ParseWikidata {
static final class Pair<K, V> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you replace this new class with the existing AbstractMap.SimpleEntry? I like code reuse.

Comment on lines 288 to 289
if (value.isEmpty())
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use curly braces around if statements. It helps with readability, and it protects against misunderstandings if the indentation is misaligned.

Comment on lines 285 to 293
String key = entry.getKey();
List<String> value = entry.getValue();
value.add(0, key);
if (value.isEmpty())
return;
Lexeme mergedLexeme = lexemeMap.get(value.get(0)).getKey();
for (int i = 1; i < value.size(); i++) {
mergedLexeme = mergeLexemes(mergedLexeme, lexemeMap.get(value.get(i)).getKey());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure why you're inserting the key into the list of values, and then you're skipping over it.

Would this work instead?

Lexeme mergedLexeme = lexemeMap.computeIfAbsent(entry.getKey(), key -> {
                throw new IllegalArgumentException(key + ": id not found");
            }).getKey();
for (var value : entry.getValue()) {
    mergeLexemes(mergedLexeme, lexemeMap.computeIfAbsent(value, key -> {
                throw new IllegalArgumentException(key + ": id not found");
            }).getKey());
}

Comment on lines 268 to 280
private Lexeme mergeLexemes(Lexeme lexeme1, Lexeme lexeme2) {
// Combine forms

lexeme1.forms.addAll(lexeme2.forms);

for (Map.Entry<String, List<String>> entry : lexeme2.claims.entrySet()) {
lexeme1.claims.merge(entry.getKey(), entry.getValue(), (v1, v2) -> {
v1.addAll(v2);
return v1;
});
}
return lexeme1;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost right. Some or all of the claims have to be moved to each form. The part of speech doesn't have to move, but the gender, animacy and other grammatical properties must be moved from the lexeme level to each form. The gender is typically added at the lexeme level instead of each form.

After that operation occurs on lexeme2, then you can append the lexeme2 forms to lexeme1.

You might want to create another function to move those claims to each form, including lexeme1 before it's called with mergeLexemes.

for (int i = 1; i < value.size(); i++) {
mergedLexeme = mergeLexemes(mergedLexeme, lexemeMap.get(value.get(i)).getKey());
}
analyzeLexeme(lexemeMap.get(mergedLexeme.id).getValue(), mergedLexeme);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you get the pair once so that you don't have to call lexemeMap.get twice?

rareLemmas.add(key);
} else if ("omit".equals(value)) {
omitLemmas.add(key);
} else if (value.startsWith("L")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about value.matches("L[0-9]+")? This ensures that the L is followed by numbers.

import java.util.TreeMap;
import java.util.HashMap;
import java.util.TreeSet;
// import javafx.util.Pair;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't leave in commented out code. It won't add value.

@BHK4321
Copy link
Contributor Author

BHK4321 commented Apr 9, 2025

Hello @grhoten,
Thanks for reviewing,
I made the changes as requested,
Initial Test Results:

Screenshot 2025-04-10 022537

Results after changes:

Screenshot 2025-04-10 022552

I am still quite unsure about the moveLexemeClaimsToForms function.
Please suggest if any more changes are required.

Copy link
Member

@grhoten grhoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking better.

Comment on lines 249 to 250
for(LexemeForm form : lexeme.forms){
for (Map.Entry<String, List<String>> entry : lexeme.claims.entrySet()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please format these for loops the same. I prefer the spaces of the second for loop.

Since you're moving the claims, I'd expect that claims on the lexeme would be empty before the exit of this function. If you don't do that, then the analyzeLexeme will mix up these properties on each form when that's not desirable.

for(LexemeForm form : lexeme.forms){
for (Map.Entry<String, List<String>> entry : lexeme.claims.entrySet()) {
String key = entry.getKey();
if (!key.equals("PartOfSpeech")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think lexeme.lexicalCategory has this important detail. You should be able to skip this check. Sorry for the confusion.

Comment on lines 102 to 105
if (value.matches("L[0-9]+")) {
mergeMap.computeIfAbsent(key, v -> new ArrayList<>()).add(value);
defferedLexemes.add(key);
defferedLexemes.add(value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a lot better. Can you also make the values be split by commas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and I think you meant deferredLexemes instead of defferedLexemes.

Copy link
Contributor Author

@BHK4321 BHK4321 Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @grhoten,

This looks a lot better. Can you also make the values be split by commas?

Are you talking about the values in the fillter_.properties file or i should change the mergeMap to be Map<String , String>
instead of Map<String , List<String>> and then store the lexemes as for example: "L123,L1234,L12345".

Copy link
Contributor Author

@BHK4321 BHK4321 Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize for the spelling mistake. Fixed it now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mergeMap is the right type. Instead of performing a single add, you can use addAll on a list.

var values = Arrays.asList(value.split(","));

Comment on lines +269 to +281
for (Map.Entry<String, List<String>> entry : mergeMap.entrySet()) {
SimpleEntry<Lexeme, Integer> pair = lexemeMap.computeIfAbsent(entry.getKey(), key -> {
throw new IllegalArgumentException(key + ": id not found");
});
Lexeme mergedLexeme = pair.getKey();
int lineNumber = pair.getValue();
for (var value : entry.getValue()) {
mergeLexemes(mergedLexeme, lexemeMap.computeIfAbsent(value, key -> {
throw new IllegalArgumentException(key + ": id not found");
}).getKey());
}
analyzeLexeme(lineNumber, mergedLexeme);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much clearer. Thanks.

Comment on lines 261 to 266
private Lexeme mergeLexemes(Lexeme lexeme1, Lexeme lexeme2) {
moveLexemeClaimsToForms(lexeme2);
// Combine forms
lexeme1.forms.addAll(lexeme2.forms);
return lexeme1;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

form.claims.computeIfAbsent(key, k -> new ArrayList<>()).addAll(entry.getValue());
}
}
if (lexeme.claims != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If lexeme.claims can be null, then the for loop must be included in this if statement. I don't think it's ever null though. Running this tool with a recent version of Wikidata will confirm that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just looked for the edge case , but the tool runs without the 'if' also.
I ran it without the if condition, i mean directly clearing the claims without any check.

Screenshot 2025-04-10 210503

@BHK4321
Copy link
Contributor Author

BHK4321 commented Apr 10, 2025

@grhoten
If there are any more changes possible please do suggest.
Thanks for giving time.

@grhoten grhoten self-requested a review April 10, 2025 17:04
@grhoten grhoten dismissed their stale review April 10, 2025 17:05

Rereviewing

Comment on lines 3 to 11
#organisatorin = organisator
#Eigentümerin = Eigentümer
#Autorin = Autor
#Teilnehmerin = Teilnehmer
#Freundin = Freund
#Ehefrau = Ehemann
#Benutzerin =Benutzer
#Organspenderin = Organspender
#Besucherin = Besucher
Copy link
Member

@grhoten grhoten Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have the order reversed. What does the the number of failures for German look like now? If it's lot less failures than the number in the language list, then that's a very compelling reason to merge these changes.

Optionally, please add filter_fr.properties too with the necessary changes to reduce the French test failures. If it's too hard, it can be added in a separate pull request. Please tell me which way you decide before I approve these changes.

I think we're almost ready to merge.

Copy link
Contributor Author

@BHK4321 BHK4321 Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, @grhoten
Here are the final results,
There were total 91 + 41 + 7 = 139 failing test cases for de + fr + it:

Screenshot 2025-04-10 233052

and right here i parsed the dictionary for all the three languages together and i found total 63 failing tests,

Screenshot 2025-04-10 233406

  • For italian,
    The failing test passed.

  • For German,
    The test cases failing are the ones in which pronouns with different genders are used , rest the human-nouns tests are completely working fine.

  • For French,
    I am getting this result:

Screenshot 2025-04-10 233752

  • Overall, there are many failing test which look like this:

Screenshot 2025-04-10 233940

So for the final report for this Pull Request 139 - 63 = 76 tests passed.
Thankyou.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, French requires some code changes to get to that to work. We can add French later.

@grhoten
Copy link
Member

grhoten commented Apr 10, 2025

These changes are looking good. The other failures seem to be around git LFS. I'm not sure why those pipelines are failing.

@grhoten grhoten merged commit 64cb2ca into unicode-org:main Apr 10, 2025
1 of 3 checks passed
@BHK4321 BHK4321 deleted the patch-1 branch April 13, 2025 12:24
@BHK4321 BHK4321 restored the patch-1 branch April 13, 2025 12:29
@BHK4321 BHK4321 deleted the patch-1 branch April 13, 2025 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants