Can't figure out why entities are misaligned #10445
Replies: 1 comment 2 replies
-
Hi @Amalkatrazz , You might want to double-check your character indices. When I tested the text from pastebin, the alignment did work only for the first two examples text = "..." # copied from pastebin
expected = "VSI OpenVMS Calling Standard"
actual = text[4650:4678] # Result: " version. In addition, the m"
assert actual == expected # Fails I also tried using your method of finding the character indices, and I'm getting different values. text = "..." # copied from pastebin
text.find("VSI OpenVMS Calling Standard") # returns 4816 instead of 4650 My hunch is that copying from one to another may have changed a few symbols like |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
So, the title is pretty self-explanatory, I'll go straight to the details.
The text string is as follows:
https://pastebin.com/fwMCsmwZ
This text contains some entities:
{'entities': [(2522, 2535, 'DOC', 'OpenVMS Alpha'), (2689, 2700, 'DOC', 'OpenVMS I64'), (4650, 4678, 'DOC', 'VSI OpenVMS Calling Standard'), (3568, 3669, 'DOC', 'Porting Applications from VSI OpenVMS Alpha to VSI OpenVMS Industry Standard 64 for Integrity Servers'), (4871, 4917, 'DOC', 'VAX MACRO and Instruction Set Reference Manual'), (5196, 5259, 'DOC', 'OpenVMS System Messages: Companion Guide for Help Message Users'), (4205, 4270, 'DOC', 'Migrating an Application from OpenVMS VAX to OpenVMS Alpha Manual'), (3750, 3808, 'DOC', 'Migrating an Environment from OpenVMS VAX to OpenVMS Alpha')]}
(Names of entities are given here for convenience only, are not fed to SpaCy)
Apparently, and I ran my tests of multiple texts like this one, SpaCy picks up only some of the delimited entities and throws a misalignment warning for most of them.
Entity boundaries are generated quite simply: the 1st one by using
string.find()
, and the 2nd one bystring.find() + len(entity)
When tested on a short, single-sentence text, the method above works just fine, and the
training.offsets_to_biluo_tags
shows proper alignment.On longer texts, the first few entities are picked up correctly, but further down the text the chance of misalignment increases.
I do not think the method for detecting entity boundaries is wrong per se. After all, I even checked some of them manually. Entity at 5196-5259 is indeed at 5196-5259.
What am I doing wrong?
Beta Was this translation helpful? Give feedback.
All reactions