Several questions when trying to get the start/end token index of a span given the character offset of it? #3304
-
|
Hi, I am trying to get the start/end token indices of a span given the character offset of it. I have searched for some posts and solutions(#1264) but they don't work in my cases. I have the dataset that has the format:
A_offset is the character offset of the start of span A in Text Now I want to get the start and end token indices of span A(or B or Pronoun) in a doc after First case: First, suppose we wanna get the inclusive token indices of span A, This returned span will give me Then It can't split
Second case: Inconsistent split Then I am going to find the mark of each span then calculate the token indices of the span For a sentence like this
When I didn't insert the matk, Could you kindly look into this? Environment |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
|
I hope I understand your question correctly – but I think the problem here is that the tokenization of your data doesn't match spaCy's tokenization. A For example, it seems like
Modifying the tokenizer is a step in the right direction – but in your example, you're creating a blank tokenizer from scratch with only one suffix rule. So your new tokenizer won't have any of the other tokenization data like tokenizer exceptions available. So it'll produce very different results. You probably want to pass in The upcoming |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for answering my question! I realized that I created a blank tokenizer from scratch. I am eager to know how to tweak |
Beta Was this translation helpful? Give feedback.
-
|
The suffixes = nlp.Defaults.suffixes + (r'''-+$''',)
nlp.tokenizer.suffix_search = spacy.util.compile_suffix_regex(suffixes).searchThe The other way you could customize this is to update the The default list of suffix expressions can be found here: https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py |
Beta Was this translation helpful? Give feedback.
The
nlp.tokenizer.suffix_searchattribute is writable, so you should be able to do something like:The
nlp.tokenizer.suffix_searchattribute should be a function which takes a unicode string and returns a regex match object orNone. Usually we use the.searchattribute of a compiled regex object, but you can use some other function that behaves the same way.The other way you could customize this is to update the
English.Defaults.suffixestuple. However, this won't change a tokenizer that you load, because when you load a tokenizer, it'll read the suffix regex fr…