-
-
Notifications
You must be signed in to change notification settings - Fork 17
Adding support for inflection between different grammatical genders #105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
grhoten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good progress. Thanks.
Please review the comments for additional changes to consider.
| L484250=L252570 | ||
| L44834=L494386 | ||
| L860063=L931664 | ||
| L2272=L295129 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like the right kind of data.
It looks like there are DOS newlines here. Can you please commit these files as plain text to match the newlines of the other properties files?
| private final TreeSet<String> rareLemmas = new TreeSet<>(); | ||
| private final TreeSet<String> omitLemmas = new TreeSet<>(); | ||
| private final Map<String, List<String>> mergeMap = new HashMap<>(); | ||
| private final TreeSet<String> differ = new TreeSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (mergeMap.containsKey(key)) { | ||
| mergeMap.get(key).add(value); | ||
| } else { | ||
| List<String> list = new ArrayList<>(); | ||
| list.add(value); | ||
| mergeMap.put(key, list); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to consider using computeIfAbsent. You might be able to replace this whole if statement with this line. I think that it's faster too because it only looks up the the key once.
mergeMap.computeIfAbsent(key, v -> new ArrayList<>()).add(value);
| */ | ||
|
|
||
| public final class ParseWikidata { | ||
| static final class Pair<K, V> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you replace this new class with the existing AbstractMap.SimpleEntry? I like code reuse.
| if (value.isEmpty()) | ||
| return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use curly braces around if statements. It helps with readability, and it protects against misunderstandings if the indentation is misaligned.
| String key = entry.getKey(); | ||
| List<String> value = entry.getValue(); | ||
| value.add(0, key); | ||
| if (value.isEmpty()) | ||
| return; | ||
| Lexeme mergedLexeme = lexemeMap.get(value.get(0)).getKey(); | ||
| for (int i = 1; i < value.size(); i++) { | ||
| mergedLexeme = mergeLexemes(mergedLexeme, lexemeMap.get(value.get(i)).getKey()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure why you're inserting the key into the list of values, and then you're skipping over it.
Would this work instead?
Lexeme mergedLexeme = lexemeMap.computeIfAbsent(entry.getKey(), key -> {
throw new IllegalArgumentException(key + ": id not found");
}).getKey();
for (var value : entry.getValue()) {
mergeLexemes(mergedLexeme, lexemeMap.computeIfAbsent(value, key -> {
throw new IllegalArgumentException(key + ": id not found");
}).getKey());
}
| private Lexeme mergeLexemes(Lexeme lexeme1, Lexeme lexeme2) { | ||
| // Combine forms | ||
|
|
||
| lexeme1.forms.addAll(lexeme2.forms); | ||
|
|
||
| for (Map.Entry<String, List<String>> entry : lexeme2.claims.entrySet()) { | ||
| lexeme1.claims.merge(entry.getKey(), entry.getValue(), (v1, v2) -> { | ||
| v1.addAll(v2); | ||
| return v1; | ||
| }); | ||
| } | ||
| return lexeme1; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is almost right. Some or all of the claims have to be moved to each form. The part of speech doesn't have to move, but the gender, animacy and other grammatical properties must be moved from the lexeme level to each form. The gender is typically added at the lexeme level instead of each form.
After that operation occurs on lexeme2, then you can append the lexeme2 forms to lexeme1.
You might want to create another function to move those claims to each form, including lexeme1 before it's called with mergeLexemes.
| for (int i = 1; i < value.size(); i++) { | ||
| mergedLexeme = mergeLexemes(mergedLexeme, lexemeMap.get(value.get(i)).getKey()); | ||
| } | ||
| analyzeLexeme(lexemeMap.get(mergedLexeme.id).getValue(), mergedLexeme); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you get the pair once so that you don't have to call lexemeMap.get twice?
| rareLemmas.add(key); | ||
| } else if ("omit".equals(value)) { | ||
| omitLemmas.add(key); | ||
| } else if (value.startsWith("L")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about value.matches("L[0-9]+")? This ensures that the L is followed by numbers.
| import java.util.TreeMap; | ||
| import java.util.HashMap; | ||
| import java.util.TreeSet; | ||
| // import javafx.util.Pair; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't leave in commented out code. It won't add value.
…ode/wikidata/filter_de.properties
…ode/wikidata/filter_it.properties
|
Hello @grhoten, Results after changes: I am still quite unsure about the |
grhoten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking better.
| for(LexemeForm form : lexeme.forms){ | ||
| for (Map.Entry<String, List<String>> entry : lexeme.claims.entrySet()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please format these for loops the same. I prefer the spaces of the second for loop.
Since you're moving the claims, I'd expect that claims on the lexeme would be empty before the exit of this function. If you don't do that, then the analyzeLexeme will mix up these properties on each form when that's not desirable.
| for(LexemeForm form : lexeme.forms){ | ||
| for (Map.Entry<String, List<String>> entry : lexeme.claims.entrySet()) { | ||
| String key = entry.getKey(); | ||
| if (!key.equals("PartOfSpeech")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think lexeme.lexicalCategory has this important detail. You should be able to skip this check. Sorry for the confusion.
| if (value.matches("L[0-9]+")) { | ||
| mergeMap.computeIfAbsent(key, v -> new ArrayList<>()).add(value); | ||
| defferedLexemes.add(key); | ||
| defferedLexemes.add(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a lot better. Can you also make the values be split by commas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and I think you meant deferredLexemes instead of defferedLexemes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @grhoten,
This looks a lot better. Can you also make the values be split by commas?
Are you talking about the values in the fillter_.properties file or i should change the mergeMap to be Map<String , String>
instead of Map<String , List<String>> and then store the lexemes as for example: "L123,L1234,L12345".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologize for the spelling mistake. Fixed it now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mergeMap is the right type. Instead of performing a single add, you can use addAll on a list.
var values = Arrays.asList(value.split(","));
| for (Map.Entry<String, List<String>> entry : mergeMap.entrySet()) { | ||
| SimpleEntry<Lexeme, Integer> pair = lexemeMap.computeIfAbsent(entry.getKey(), key -> { | ||
| throw new IllegalArgumentException(key + ": id not found"); | ||
| }); | ||
| Lexeme mergedLexeme = pair.getKey(); | ||
| int lineNumber = pair.getValue(); | ||
| for (var value : entry.getValue()) { | ||
| mergeLexemes(mergedLexeme, lexemeMap.computeIfAbsent(value, key -> { | ||
| throw new IllegalArgumentException(key + ": id not found"); | ||
| }).getKey()); | ||
| } | ||
| analyzeLexeme(lineNumber, mergedLexeme); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much clearer. Thanks.
| private Lexeme mergeLexemes(Lexeme lexeme1, Lexeme lexeme2) { | ||
| moveLexemeClaimsToForms(lexeme2); | ||
| // Combine forms | ||
| lexeme1.forms.addAll(lexeme2.forms); | ||
| return lexeme1; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
| form.claims.computeIfAbsent(key, k -> new ArrayList<>()).addAll(entry.getValue()); | ||
| } | ||
| } | ||
| if (lexeme.claims != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If lexeme.claims can be null, then the for loop must be included in this if statement. I don't think it's ever null though. Running this tool with a recent version of Wikidata will confirm that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@grhoten |
| #organisatorin = organisator | ||
| #Eigentümerin = Eigentümer | ||
| #Autorin = Autor | ||
| #Teilnehmerin = Teilnehmer | ||
| #Freundin = Freund | ||
| #Ehefrau = Ehemann | ||
| #Benutzerin =Benutzer | ||
| #Organspenderin = Organspender | ||
| #Besucherin = Besucher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you have the order reversed. What does the the number of failures for German look like now? If it's lot less failures than the number in the language list, then that's a very compelling reason to merge these changes.
Optionally, please add filter_fr.properties too with the necessary changes to reduce the French test failures. If it's too hard, it can be added in a separate pull request. Please tell me which way you decide before I approve these changes.
I think we're almost ready to merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, @grhoten
Here are the final results,
There were total 91 + 41 + 7 = 139 failing test cases for de + fr + it:
and right here i parsed the dictionary for all the three languages together and i found total 63 failing tests,
-
For italian,
The failing test passed. -
For German,
The test cases failing are the ones in which pronouns with different genders are used , rest the human-nouns tests are completely working fine. -
For French,
I am getting this result:
- Overall, there are many failing test which look like this:
So for the final report for this Pull Request 139 - 63 = 76 tests passed.
Thankyou.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, French requires some code changes to get to that to work. We can add French later.
|
These changes are looking good. The other failures seem to be around git LFS. I'm not sure why those pipelines are failing. |







Solving issue #98
Added the filter_ files for specific lexemes to be merged.
Updated ParseWikidata.java for merging the lexemes with differing genders for a combined inflection space