Skip to content
Discussion options

You must be logged in to vote

I think using Doc.char_span is what you want, then. span.start and span.end provide the token start and end indices:

import spacy
nlp = spacy.blank("en")

test_string = "Hello world. This is a test!"

doc = nlp(test_string)
span = doc.char_span(0, 11)
assert span.start == 0
assert span.end == 2

The main restriction in spacy v2 with Doc.char_span is that it returns None if the start and end character indices don't line up with a token boundary:

span = doc.char_span(0, 12)
assert span == None

This will become more flexible with an alignment_mode option in spacy v3 where you can choose to have the character offsets snap to the nearest token boundaries inside or outside the provided character…

Replies: 4 comments

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by ines
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / doc Feature: Doc, Span and Token objects
2 participants