Getting char spans to token spans #6093
-
|
This is mainly a general doubt. I saw in the documentation that there is By the way, the ways to achieve what I am asking above is by matching the substring within the main string by doing using space pharse matcher and getting the tokens. Another option is by tokenizing the main string and the subtring, which is gerated using the chararacters span or string[start:end]. Then looking where the tokens match to get the start and end of the tokens of the substring. Let me know if this feature exists. If not, I guess some of the ideas above may help a bit :-) Kind regards, Walter |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
|
Hmm, I'm not sure I understand exactly what you're asking. Can you give a concrete example of what you're trying to do? I think you might be misunderstanding what |
Beta Was this translation helpful? Give feedback.
-
|
Hello @adrianeboyd I know what you mean. I realized that char_span does not return the same as Span from spacy.tokens. Although, they seemed similar. Now, going back to what I am curious if it already exists, if so please point me in the direction of the corresponding documentation, is: Let's say we create the following document: and from it, if we want the first two tokens that correspond to the substring "Hello world", that is achieved by: Now, I want to know the start and end index of the subtring "Hello world". I would do that by: Then, I can slice the initial string now that I have the indexes of the substring: Here is that part that I am wondering. Let's reverse the operation and let's say that what I have is the index_start and index_end of the subtstring "Hello world" and I want its corresponding tokens within I am wondering if there is already something that does this straigforward in spacy or from where I can get the tokens of the substring in |
Beta Was this translation helpful? Give feedback.
-
|
I think using import spacy
nlp = spacy.blank("en")
test_string = "Hello world. This is a test!"
doc = nlp(test_string)
span = doc.char_span(0, 11)
assert span.start == 0
assert span.end == 2The main restriction in spacy v2 with span = doc.char_span(0, 12)
assert span == NoneThis will become more flexible with an |
Beta Was this translation helpful? Give feedback.
-
|
@adrianeboyd Perfect! That's what I was wondering! I was getting none back because the offsets were not matching fully the span where a token was. So, yes, the Thank you for your guidance @adrianeboyd! I am going to close this issue |
Beta Was this translation helpful? Give feedback.
I think using
Doc.char_spanis what you want, then.span.startandspan.endprovide the token start and end indices:The main restriction in spacy v2 with
Doc.char_spanis that it returnsNoneif the start and end character indices don't line up with a token boundary:This will become more flexible with an
alignment_modeoption in spacy v3 where you can choose to have the character offsets snap to the nearest token boundaries inside or outside the provided character…