Dealing with hash issue in preparing .spacy object for spancat training #11827
-
Hi, I am currently setting up an experiment with SpanCat component. I was able to train an initial model on my data with a cloned repo, so basic settings should be fine. Right now, I have an issue in converting another set of IOB data into Here is what the iob data look like.
Here is the code I used to convert IOB data to the .spacy object: (again thank you for sharing the example project for this component, they are extremely helpful!!!) from pathlib import Path
from typing import List
import typer
from spacy.tokens import Doc, DocBin, SpanGroup
from spacy.training.converters import conll_ner_to_docs
from wasabi import msg
DOC_DELIMITER = "-DOCSTART- -X- O O\n"
def parse_genia(data: str,
span_key: str,
num_levels: int = 4,
doc_delimiter: str = DOC_DELIMITER) -> List[Doc]:
"""Parse GENIA dataset into spaCy docs
Our strategy here is to reuse the conll -> ner method from
spaCy and re-apply that n times. We don't want to write our
own ConLL/IOB parser.
Parameters
----------
data: str
The raw string input as read from the IOB file
num_levels: int, default is 4
Represents how many times a label has been nested. In
GENIA, a label was nested four times at maximum.
Returns
-------
List[Doc]
"""
docs = data.split("\n\n") #separate into sents
iob_per_level = []
for level in range(num_levels):
doc_list = []
for doc in docs: #iterate each chunk
tokens = [t for t in doc.split("\n") if t] #tokens
token_list = []
for token in tokens: #iterate tokens
annot = token.split("\t") #list of annotations
# First element is always the token text
text = annot[0]
# subsequent layers are relevant annotations
label = annot[level + 1]
# "text label" as format
_token = " ".join([text, label])
token_list.append(_token)
doc_list.append("\n".join(token_list))
annotations = doc_delimiter.join(doc_list)
iob_per_level.append(annotations)
# We then copy all the entities from doc.ents into
# doc.spans later on. But first, let's have a "canonical" docs
# to copy into
# conll_ner_to_docs internally identifies whether sentence segmentation is done
docs_per_level = [list(conll_ner_to_docs(iob)) for iob in iob_per_level]
docs_with_spans: List[Doc] = []
for docs in zip(*docs_per_level):
spans = [ent for doc in docs for ent in doc.ents]
doc = docs[0]
group = SpanGroup(doc, name=span_key, spans=spans)
doc.spans[span_key] = group
docs_with_spans.append(doc)
return docs_with_spans
def parse_engagement_v2(data: str,
span_key: str,
num_levels: int = 4,
doc_delimiter: str = DOC_DELIMITER) -> List[Doc]:
"""Parse ENGAGEMENT dataset into spaCy docs
This is a modified version of the genia_preprocess code:
1) I included Doc delimiter to reflect natural doc boundaries
Our strategy here is to reuse the conll -> ner method from
spaCy and re-apply that n times. We don't want to write our
own ConLL/IOB parser.
Parameters
----------
data: str
The raw string input as read from the IOB file
num_levels: int, default is 4
Represents how many times a label has been nested. In
GENIA, a label was nested four times at maximum.
Returns
-------
List[Doc]
"""
# docs = data.split("\n\n") #separate into sents
docs = data.split(doc_delimiter)
iob_per_level = []
for level in range(num_levels):
doc_list = []
for doc in docs: #iterate each chunk
# print(doc)
sent_list = []
for sent in doc.split("\n\n"):
tokens = [t for t in sent.split("\n") if t] #tokens
token_list = []
for token in tokens: #iterate tokens
annot = token.split("\t") #list of annotations
# First element is always the token text
text = annot[0]
# text = text.replace("#", "_") #tested whether "#" was doing the trick
# subsequent layers are relevant annotations
label = annot[level + 1]
# "text label" as format
_token = " ".join([text, label])
token_list.append(_token)
sent_list.append("\n".join(token_list))
doc_list.append("\n\n".join(sent_list))
annotations = doc_delimiter.join(doc_list)
iob_per_level.append(annotations)
# We then copy all the entities from doc.ents into
# doc.spans later on. But first, let's have a "canonical" docs
# to copy into
# conll_ner_to_docs internally identifies whether sentence segmentation is done
docs_per_level = [list(conll_ner_to_docs(iob)) for iob in iob_per_level]
docs_with_spans: List[Doc] = []
for docs in zip(*docs_per_level):
for d in docs:
print(d.ents)
spans = [ent for doc in docs for ent in doc.ents]
# print(spans)
# print([span.label_ for span in spans])
doc = docs[0]
group = SpanGroup(doc, name=span_key, spans=spans)
doc.spans[span_key] = group
docs_with_spans.append(doc)
return docs_with_spans
def main(input_path: Path, output_path: Path, span_key: str):
msg.good(f"Processing Engagement dataset ")
with input_path.open("r", encoding="utf-8") as f:
data = f.read()
docs = parse_engagement_v2(data, span_key=span_key, num_levels=3)
# docs = parse_genia(data, span_key=span_key)
doc_bin = DocBin(docs=docs)
doc_bin.to_disk(output_path)
msg.good(f"Processing Engagement dataset done")
if __name__ == "__main__":
typer.run(main) When I run this script. I get:
One of my guesses is that some of the labels never occurs in the first layer of the IOB data, and this may cause spacy not to be able to assign hash to labels. So, I randomly switched the position in the iob data, but this did not work either (maybe just because of the sheer probability). If the lack of a label in the Spacy Vocabulary is an issue, is there any way we can set a default tag set into the spancat layer so that we do not have to deal with this issue? If not this was the source of the issue, I would appreciate any guesses why adding additional tags to the dataset caused an issue? Thank you so much again for the awesome ecosystem you have been building in the NLP community. I really cannot do my work without SpaCy. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 9 replies
-
The problem is that these docs are coming back with different vocabs: docs_per_level = [list(conll_ner_to_docs(iob, model=nlp)) for iob in iob_per_level] I haven't tested this, but I think you can fix it by providing a single model (so they all use the same model vocab) to use for all the conversions: nlp = spacy.blank("en") # or some appropriate blank model
docs_per_level = [list(conll_ner_to_docs(iob, model=nlp)) for iob in iob_per_level] |
Beta Was this translation helpful? Give feedback.
The problem is that these docs are coming back with different vocabs:
I haven't tested this, but I think you can fix it by providing a single model (so they all use the same model vocab) to use for all the conversions: