Suggestions for using spaCy for text transformation #8652
gdevy
started this conversation in
Help: Best practices
Replies: 1 comment 1 reply
-
Looks like an overall good structure, and I think having a One thing I would recommend is maybe rethinking your token attributes. If the only reason you set a token attribute is to mask the token later, you might be better off just setting a |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hey all!
I'm looking for some thoughts on how to best implement text transformation either as part of spaCy or on top of it. The goal is to replace certain substrings (Token, entities, arbitrary Spans) of an input text with another string to conceal personal information and some normalization for downstream tasks (outside of spaCy realm). The reason I want to do this using spaCy is that it provides many tools to identify substrings I'm interested in using NER, POS,
Matcher
andPhraseMatcher
etc.I've found some comments from Ines about spaCy's view on text transformations make it clear that I don't want to be mutating the underlying text. So my approach has been to do the annotation of substrings using spaCy pipeline, including custom pipes, and then use the result Doc to create the resulting processed text.
Here is a toy example of the current implementation I have to give you a concrete idea:
Token._.is_phone_number
and set itTrue
for those tokens (defaultFalse
).So the idea is that these transformation functions can be linked together Doc and text going from one to the next, with
text
variable accumulating changes (without changing the underlying Doc)I hope that it makes it clear that the goal here is that it is extensible and composable (and testable). I can write custom spaCy pipes to identify areas I'm interesting in transforming and then I can build up a list of
transformations
to configure the functionality depending on application. The neat bonus is that the substring annotation logic (custom spaCy pipes) is separate from the transformation logic which allows the spaCy pipes to be nicely separated and easy to understand and in turn allowing the transformation logic to leverage annotations from several pipes.I was decently satisfied with this approach but with spaCy 3.0 introducing better ways to validate pipelines with the
assigns
andrequires
descriptions I am wondering if I should take more advantage of all the machinery built into spaCy already.The one potential idea I have is sticking transformation logic inside spaCy pipes and then accumulate transformed text in a custom Doc extension of type string like
Doc._.transformed_text
.To be honest, that feels a little clumsy and a little naive so I am just looking to see what other people think about this. I would be happy to hear if anyone has tackled this (maybe in a similar way), if there are existing projects and any potential downfalls you guys see in this.
Beta Was this translation helpful? Give feedback.
All reactions