Coreferee causing duplication -Possible Source Code Issue #10979
-
Hello all, Thank you for your assistance. I am struggling to identify the source of a problem. I am using Coreferee 1.2 with Python 3.9, Spacy 3.2.1, spacy-transformers 1.1.6 and the English large and trf models. I am using Coreferee for coreference resolution and have coded in a resolution that captures not only the token but the NER children. However, I have run into a problem, where in rare cases Coreferee will sometimes pull in the children.tokens twice. For example 'John Roger Wihlem Smith' resolves to => 'John Roger John Roger Wilhelm Smith' or 'The United Kingdom of Great Britain' resolves to => 'The United The United Kingdom of Great Britain'. My code follows:
The code accomplishes the goal of capturing the cluster without using the outdated NeuralCoref and can still be used later to form NER grams for a LDA topic model. However, I get that one bug instance that emerges from time to time as a resolution. I have checked all other cleaning processes and those processes are not the problem, so I can only assume it is something built into 'Coreferee'. Any help resolving this problem would be most appreciated. Also welcome any thoughts on optimization. I'm pretty new to coding. Thank you, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi @Weiprecht, I'm glad you're finding Coreferee useful. While I wouldn't like to rule out the issue you describe being somehow caused by Coreferee, I consider it quite unlikely because Coreferee is providing references to the tokens heading the phrases you are extracting and you are building these phrases in your code. So although at first glance I couldn't find a problem in your code, that would be the first place I'd search for one. There's a semi-standard way of fulfilling the requirement you have, and it's documented at the end of this section of the Coreferee Readme. Could you please try switching to this solution? If you still have problems, please report them here. Best wishes, Richard |
Beta Was this translation helpful? Give feedback.
Hi @Weiprecht, I'm glad you're finding Coreferee useful. While I wouldn't like to rule out the issue you describe being somehow caused by Coreferee, I consider it quite unlikely because Coreferee is providing references to the tokens heading the phrases you are extracting and you are building these phrases in your code. So although at first glance I couldn't find a problem in your code, that would be the first place I'd search for one.
There's a semi-standard way of fulfilling the requirement you have, and it's documented at the end of this section of the Coreferee Readme. Could you please try switching to this solution? If you still have problems, please report them here. Best wishes, Ri…