Document alignment #5020

kormilitzin · 2020-02-14T11:55:00Z

kormilitzin
Feb 14, 2020

I'm wondering if there is a simple way to find correspondence (at the token level) between two Docs, where one is the transformation of another.

For example:

def clean_text(text):
       clean_text = do_something_to_text(text)
       return clean_text

nlp = spacy.load('en_core_web_sm)
raw_text = "Who is   ~£ Shaka\n\nKhan?"
cleaned_text = clean_text(raw_text)
print(cleaned_text)
> "Who is Shaka Khan?"


doc_raw = nlp(raw_text)
doc_clean = nlp(cleaned_text)

[(token.text, token.idx, token.i) for token in doc_raw]

[('Who', 0, 0), ('is', 4, 1), (' ', 7, 2), ('~£', 8, 3), ('Shaka', 11, 4), ('\n\n', 16, 5), ('Khan', 18, 6), ('?', 22, 7)]

whereas for the cleaned document:

[(token.text, token.idx, token.i) for token in doc_clean]

[('Who', 0, 0), ('is', 4, 1), ('Shaka', 7, 2), ('Khan', 13, 3), ('?', 17, 4)]

What I mean, is to find the correspondence between the tokens:

token in the raw --> token in transformed
0 --> 0
1 --> 1
4 --> 2
6 --> 3
7 --> 4

While it is not very obvious to me how to implement it easily (apart from the straightforward comparison of tokens and their POS tags), the fact that we are working with essentially the same original document gives me the hope that it is still possible to find the mapping from one Doc to another.

Any ideas? thanks.

adrianeboyd · 2020-02-14T12:06:56Z

adrianeboyd
Feb 14, 2020

Ideally do_something_to_text() would save a mapping at the character level that lets you reconstruct the alignment. If not, you have to rely on heuristics to align things.

I know that @tamuhey recently developed a new library to align slightly differing tokenizations: https://github.com/tamuhey/tokenizations. It might not be exactly what you're looking for (it only looks at the surface strings to do the alignment), but it should point you in the right direction in terms of the kinds of algorithms used to do this.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Document alignment #5020

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Document alignment #5020

Uh oh!

Uh oh!

kormilitzin Feb 14, 2020

Replies: 1 comment

Uh oh!

adrianeboyd Feb 14, 2020

kormilitzin
Feb 14, 2020

adrianeboyd
Feb 14, 2020