Can't figure out why entities are misaligned #10445

Amalkatrazz · 2022-03-06T14:21:59Z

Amalkatrazz
Mar 6, 2022

So, the title is pretty self-explanatory, I'll go straight to the details.

The text string is as follows:

This text contains some entities:

{'entities': [(2522, 2535, 'DOC', 'OpenVMS Alpha'), (2689, 2700, 'DOC', 'OpenVMS I64'), (4650, 4678, 'DOC', 'VSI OpenVMS Calling Standard'), (3568, 3669, 'DOC', 'Porting Applications from VSI OpenVMS Alpha to VSI OpenVMS Industry Standard 64 for Integrity Servers'), (4871, 4917, 'DOC', 'VAX MACRO and Instruction Set Reference Manual'), (5196, 5259, 'DOC', 'OpenVMS System Messages: Companion Guide for Help Message Users'), (4205, 4270, 'DOC', 'Migrating an Application from OpenVMS VAX to OpenVMS Alpha Manual'), (3750, 3808, 'DOC', 'Migrating an Environment from OpenVMS VAX to OpenVMS Alpha')]}

(Names of entities are given here for convenience only, are not fed to SpaCy)

Apparently, and I ran my tests of multiple texts like this one, SpaCy picks up only some of the delimited entities and throws a misalignment warning for most of them.

Entity boundaries are generated quite simply: the 1st one by using string.find(), and the 2nd one by string.find() + len(entity)

When tested on a short, single-sentence text, the method above works just fine, and the training.offsets_to_biluo_tags shows proper alignment.

On longer texts, the first few entities are picked up correctly, but further down the text the chance of misalignment increases.

I do not think the method for detecting entity boundaries is wrong per se. After all, I even checked some of them manually. Entity at 5196-5259 is indeed at 5196-5259.

What am I doing wrong?

ljvmiranda921 · 2022-03-07T06:49:24Z

ljvmiranda921
Mar 7, 2022

Hi @Amalkatrazz ,

You might want to double-check your character indices. When I tested the text from pastebin, the alignment did work only for the first two examples OpenVMS Alpha and OpenVMS I64, but it didn't work for the latter ones. Take this for example:

text = "..."  # copied from pastebin
expected = "VSI OpenVMS Calling Standard"
actual = text[4650:4678]  # Result:  " version. In addition, the m"
assert actual == expected  # Fails

I also tried using your method of finding the character indices, and I'm getting different values.

text = "..." # copied from pastebin
text.find("VSI OpenVMS Calling Standard")  # returns 4816 instead of 4650

My hunch is that copying from one to another may have changed a few symbols like -- and —. I also noticed that you have a lot of newlines, maybe it's a Windows <-> Unix thing (like /r/n vs. /n)? This might not be apparent in the first few entities but it showed up for the latter ones. Double-check the character indices again, especially if you're reading a file. Perhaps it might even be better if you preprocess the text a bit so that it's manageable once you apply the alignment.

2 replies

Amalkatrazz Mar 7, 2022
Author

Hi, and thanks for your suggestions! I have solved the problem by approaching it from a completely different angle: instead of running str.find() on raw text, I did as follows:

(a) tokenised the text with SpaCy and made a list of tokens;
(b) tokenised each entity on the list with SpaCy and made a list of tokens for each;
(c) found where in List A each List B was contained in a completely matching sequence;
(d) retrieved token.idx values from there and applied appropriate summations.

Now all entities are aligned!

ljvmiranda921 Mar 7, 2022

Glad you found a way! If you want to further optimize that workflow, perhaps you can use the Matcher for your third step (or specifically, use this workflow for matching). You can add more flexibility in your rules and it may even be easier to debug :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Can't figure out why entities are misaligned #10445

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Can't figure out why entities are misaligned #10445

Uh oh!

Amalkatrazz Mar 6, 2022

Replies: 1 comment · 2 replies

Uh oh!

ljvmiranda921 Mar 7, 2022

Uh oh!

Amalkatrazz Mar 7, 2022 Author

Uh oh!

Uh oh!

ljvmiranda921 Mar 7, 2022

Amalkatrazz
Mar 6, 2022

Replies: 1 comment 2 replies

ljvmiranda921
Mar 7, 2022

Amalkatrazz Mar 7, 2022
Author