Spans with the same text but different labels share the same extension value #8901
-
Hello. I don't know if it's a bug or just something that I do not fully understand. We've found this issue at work, but here I prepared a little toy example that can reproduce its behavior. I have a doc with a bunch of spans. Some spans can share the same text, but have different labels (this comes from auto-generated BRAT annotation that goes along with the text).
At first I thought that span1 and span2 are different objects, since they have different ids, hashes and are not equal to each other
The problem arises when I want to create a span extension that will store a list of objects connected to each span
The thing that surprised me was that span2 that shared the same text was also linked to the same list. Span3's "relations" is still None.
So, my question is, why spans that share the same text but are considered different python objects also share the same extension list? Is there a way to implement the extension when both span1 and span2 hold their own list of "relations"? Info about spaCy
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
This was a bit of a surprise to me but I confirmed it is happening and figured out why. When you create a Span extension, the data is actually saved on the Doc, in I guess this is a design decision, but it is surprising that items without object identity share data, and we should probably highlight this more in the docs. On the other hand, I'm having a hard time imagining concrete cases where it would make sense for spans over the same text to have different user data. Could you tell us some more about your use case? Also, though it's a bit awkward, I think you can work around this in various ways like setting a dictionary in the attribute where the key is the object id, or using a custom getter to wrap similar behavior. |
Beta Was this translation helpful? Give feedback.
This was a bit of a surprise to me but I confirmed it is happening and figured out why.
When you create a Span extension, the data is actually saved on the Doc, in
Doc.user_data
(a dict). The key is a tuple that includes the field name and the span start and end, but not the span object id or other info. So two spans with the same start and end will have the same data.I guess this is a design decision, but it is surprising that items without object identity share data, and we should probably highlight this more in the docs. On the other hand, I'm having a hard time imagining concrete cases where it would make sense for spans over the same text to have different user data. Could you tell …