Document alignment #5020
Replies: 1 comment
-
|
Ideally I know that @tamuhey recently developed a new library to align slightly differing tokenizations: https://github.com/tamuhey/tokenizations. It might not be exactly what you're looking for (it only looks at the surface strings to do the alignment), but it should point you in the right direction in terms of the kinds of algorithms used to do this. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm wondering if there is a simple way to find correspondence (at the token level) between two Docs, where one is the transformation of another.
For example:
[(token.text, token.idx, token.i) for token in doc_raw][('Who', 0, 0), ('is', 4, 1), (' ', 7, 2), ('~£', 8, 3), ('Shaka', 11, 4), ('\n\n', 16, 5), ('Khan', 18, 6), ('?', 22, 7)]whereas for the cleaned document:
[(token.text, token.idx, token.i) for token in doc_clean][('Who', 0, 0), ('is', 4, 1), ('Shaka', 7, 2), ('Khan', 13, 3), ('?', 17, 4)]What I mean, is to find the correspondence between the tokens:
While it is not very obvious to me how to implement it easily (apart from the straightforward comparison of tokens and their POS tags), the fact that we are working with essentially the same original document gives me the hope that it is still possible to find the mapping from one Doc to another.
Any ideas? thanks.
Beta Was this translation helpful? Give feedback.
All reactions