using regex on the whole doc. the char map of the documentation #11307

joseberlines · 2022-08-15T08:12:29Z

joseberlines
Aug 15, 2022

Which page or section is this issue related to?

https://spacy.io/usage/rule-based-matching#regex-text
under:
How can I expand the match to a token sequence we find:

chars_to_tokens = {}
for token in doc:
    for i in range(token.idx, token.idx + len(token.text)):
        chars_to_tokens[i] = token.i

I have the impression that this code might not work properly if the regular expression is complicated and includes spaces, which ultimately are also chars of the text but with this code are not mapped.

I fixed the problem like this:
adding the following to complet all the keys of the dict:

for j in range (list(chars_to_tokens.keys())[-1]):
        if j not in list(chars_to_tokens.keys()):
            chars_to_tokens[j] = chars_to_tokens[j-1]

Answered by polm

Aug 15, 2022

Note that as mentioned in the docs, if your span has leading or trailing whitespace that's a problem. The docs don't explicitly state that in that particular section, but spaCy doesn't represent entities that start or end with whitespace. (If you have an example where starting or ending whitespace is significant let us know, I've never seen one before.) That's why the sample code only lets you find token boundaries (non-whitespace).

View full answer

polm · 2022-08-15T09:49:13Z

polm
Aug 15, 2022

Note that as mentioned in the docs, if your span has leading or trailing whitespace that's a problem. The docs don't explicitly state that in that particular section, but spaCy doesn't represent entities that start or end with whitespace. (If you have an example where starting or ending whitespace is significant let us know, I've never seen one before.) That's why the sample code only lets you find token boundaries (non-whitespace).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

using regex on the whole doc. the char map of the documentation #11307

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

using regex on the whole doc. the char map of the documentation #11307

Uh oh!

Uh oh!

joseberlines Aug 15, 2022

Which page or section is this issue related to?

Replies: 1 comment

Uh oh!

polm Aug 15, 2022

joseberlines
Aug 15, 2022

polm
Aug 15, 2022