Suggestions for using spaCy for text transformation #8652

gdevy · 2021-07-08T18:20:29Z

gdevy
Jul 8, 2021

Hey all!

I'm looking for some thoughts on how to best implement text transformation either as part of spaCy or on top of it. The goal is to replace certain substrings (Token, entities, arbitrary Spans) of an input text with another string to conceal personal information and some normalization for downstream tasks (outside of spaCy realm). The reason I want to do this using spaCy is that it provides many tools to identify substrings I'm interested in using NER, POS, Matcher and PhraseMatcher etc.

I've found some comments from Ines about spaCy's view on text transformations make it clear that I don't want to be mutating the underlying text. So my approach has been to do the annotation of substrings using spaCy pipeline, including custom pipes, and then use the result Doc to create the resulting processed text.

Here is a toy example of the current implementation I have to give you a concrete idea:

Identify a substring of interest
- Custom spaCy pipe that finds a phone number using some regex and retokenizes the whole number into a single token.
- Add an annotation Token._.is_phone_number and set it True for those tokens (default False).
Substitute substrings using a function like this
- After all the pipelines are applied, pass the document through a chain of transforming functions (all with the same signatures as below)

def mask_phone_number(doc: Doc, text: str) -> Tuple[Doc, str]:
    for token in doc:
        if token._.is_phone_number:
            text = re.sub(token.text, '[PHONE]', text)

    return doc, text

So the idea is that these transformation functions can be linked together Doc and text going from one to the next, with text variable accumulating changes (without changing the underlying Doc)

spacy_nlp.add_pipe('phonenumber_detection')   # add the custom pipeline from step 1

doc = spacy_nlp('some text')

transformations = [
    mask_phone_number,
    ...
]

text = doc.text
for transformation in transformations:
    doc, text = transformation(doc, text)

I hope that it makes it clear that the goal here is that it is extensible and composable (and testable). I can write custom spaCy pipes to identify areas I'm interesting in transforming and then I can build up a list of transformations to configure the functionality depending on application. The neat bonus is that the substring annotation logic (custom spaCy pipes) is separate from the transformation logic which allows the spaCy pipes to be nicely separated and easy to understand and in turn allowing the transformation logic to leverage annotations from several pipes.

I was decently satisfied with this approach but with spaCy 3.0 introducing better ways to validate pipelines with the assigns and requires descriptions I am wondering if I should take more advantage of all the machinery built into spaCy already.

The one potential idea I have is sticking transformation logic inside spaCy pipes and then accumulate transformed text in a custom Doc extension of type string like Doc._.transformed_text.

To be honest, that feels a little clumsy and a little naive so I am just looking to see what other people think about this. I would be happy to hear if anyone has tackled this (maybe in a similar way), if there are existing projects and any potential downfalls you guys see in this.

polm · 2021-07-09T03:55:35Z

polm
Jul 9, 2021

Looks like an overall good structure, and I think having a transformed_text Doc extension is perfectly reasonable.

One thing I would recommend is maybe rethinking your token attributes. If the only reason you set a token attribute is to mask the token later, you might be better off just setting a masked_text token attribute and then using that (when present) to build your output. This allows you to combine detection and replacement, making your code more compact, and might help reduce the number of extensions you need.

1 reply

gdevy Jul 9, 2021
Author

Thanks, polm. I appreciate your thoughts on this!

That is a good point, I can definitely see that getting cluttered. I mostly think of having detection and replacement separate as a benefit because then replacement logic can get more intelligent (able to combine multiple annotations to give priority for example). You're right though in some cases I do just go for the extension being the desired substitution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Suggestions for using spaCy for text transformation #8652

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Suggestions for using spaCy for text transformation #8652

Uh oh!

gdevy Jul 8, 2021

Replies: 1 comment · 1 reply

Uh oh!

polm Jul 9, 2021

Uh oh!

gdevy Jul 9, 2021 Author

gdevy
Jul 8, 2021

Replies: 1 comment 1 reply

polm
Jul 9, 2021

gdevy Jul 9, 2021
Author