Getting char spans to token spans #6093

walter-hernandez · 2020-09-19T00:10:54Z

walter-hernandez
Sep 19, 2020

This is mainly a general doubt. I saw in the documentation that there is char_span to start and end for where the initial character and last character for a labelled text is. I was wondering if there is an equivalente function giving spacy to do the contrary. Like a token_span function that gets as an input the char start index, chart end index, label and returns the token start and end.

By the way, the ways to achieve what I am asking above is by matching the substring within the main string by doing using space pharse matcher and getting the tokens. Another option is by tokenizing the main string and the subtring, which is gerated using the chararacters span or string[start:end]. Then looking where the tokens match to get the start and end of the tokens of the substring.

Let me know if this feature exists. If not, I guess some of the ideas above may help a bit :-)

Kind regards,

Walter

Answered by adrianeboyd

Sep 21, 2020

I think using Doc.char_span is what you want, then. span.start and span.end provide the token start and end indices:

import spacy
nlp = spacy.blank("en")

test_string = "Hello world. This is a test!"

doc = nlp(test_string)
span = doc.char_span(0, 11)
assert span.start == 0
assert span.end == 2

The main restriction in spacy v2 with Doc.char_span is that it returns None if the start and end character indices don't line up with a token boundary:

span = doc.char_span(0, 12)
assert span == None

This will become more flexible with an alignment_mode option in spacy v3 where you can choose to have the character offsets snap to the nearest token boundaries inside or outside the provided character…

View full answer

adrianeboyd · 2020-09-21T12:01:07Z

adrianeboyd
Sep 21, 2020

Hmm, I'm not sure I understand exactly what you're asking. Can you give a concrete example of what you're trying to do?

I think you might be misunderstanding what Doc.char_span does. It doesn't search for an existing annotation in the doc, it gives you back a new Span object that has the provided label.

0 replies

walter-hernandez · 2020-09-21T12:32:45Z

walter-hernandez
Sep 21, 2020
Author

Hello @adrianeboyd

I know what you mean. I realized that char_span does not return the same as Span from spacy.tokens. Although, they seemed similar. Now, going back to what I am curious if it already exists, if so please point me in the direction of the corresponding documentation, is:

Let's say we create the following document:

test_string = "Hello world. This is a test!"
doc = nlp(test_string )

and from it, if we want the first two tokens that correspond to the substring "Hello world", that is achieved by:
doc[0:2]

Now, I want to know the start and end index of the subtring "Hello world". I would do that by:

from spacy.tokens import Span
span = Span(doc, 0, 2, label="TEST")
index_start = span.start_char
index_end = span.end_char

Then, I can slice the initial string now that I have the indexes of the substring:
test_string[index_start:index_end]

Here is that part that I am wondering. Let's reverse the operation and let's say that what I have is the index_start and index_end of the subtstring "Hello world" and I want its corresponding tokens within doc. How I would do it is by using PhraseMatcher to find the substring in test_string and that way I get the tokens or I can tokenize the substring and iterate over it and see where the substring tokens match the tokens of test_string .

I am wondering if there is already something that does this straigforward in spacy or from where I can get the tokens of the substring in test_string directly, like how I did it with Span from spacy.tokens import. Considering that the start and the end indexes of the substring "Hello world" in the test_string are 0 and 11, the function could be something that would be like:

index_start  = 0
index_end = 11
span = Span_index(doc, index_start , index_end, label="TEST")
start_token= span.start_token
end_token= span.end_token

0 replies

adrianeboyd · 2020-09-21T15:22:29Z

adrianeboyd
Sep 21, 2020

I think using Doc.char_span is what you want, then. span.start and span.end provide the token start and end indices:

import spacy
nlp = spacy.blank("en")

test_string = "Hello world. This is a test!"

doc = nlp(test_string)
span = doc.char_span(0, 11)
assert span.start == 0
assert span.end == 2

The main restriction in spacy v2 with Doc.char_span is that it returns None if the start and end character indices don't line up with a token boundary:

span = doc.char_span(0, 12)
assert span == None

This will become more flexible with an alignment_mode option in spacy v3 where you can choose to have the character offsets snap to the nearest token boundaries inside or outside the provided character offsets.

0 replies

walter-hernandez · 2020-09-26T15:08:52Z

walter-hernandez
Sep 26, 2020
Author

@adrianeboyd Perfect! That's what I was wondering!

I was getting none back because the offsets were not matching fully the span where a token was. So, yes, the alignment_mode option in spacy v3 is going to be a welcomed feature from my side!

Thank you for your guidance @adrianeboyd! I am going to close this issue

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Getting char spans to token spans #6093

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Getting char spans to token spans #6093

Uh oh!

walter-hernandez Sep 19, 2020

Replies: 4 comments

Uh oh!

adrianeboyd Sep 21, 2020

Uh oh!

Uh oh!

walter-hernandez Sep 21, 2020 Author

Uh oh!

adrianeboyd Sep 21, 2020

Uh oh!

walter-hernandez Sep 26, 2020 Author

walter-hernandez
Sep 19, 2020

adrianeboyd
Sep 21, 2020

walter-hernandez
Sep 21, 2020
Author

adrianeboyd
Sep 21, 2020

walter-hernandez
Sep 26, 2020
Author