Spancat not being initialized before training #11636
Unanswered
NixBiks
asked this question in
Help: Coding & Implementations
Replies: 1 comment 1 reply
-
Ahaaa - I had to add a label to the =================================== train ===================================
Running command: spacy train configs/sent_classifier.cfg --output training/ --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy --gpu-id -1
ℹ Saving to output directory: training
ℹ Using CPU
=========================== Initializing pipeline ===========================
[2022-10-12 18:06:48,653] [INFO] Set up nlp object from config
[2022-10-12 18:06:48,659] [INFO] Pipeline: ['sentencizer', 'domain_entity_ruler', 'spancat']
[2022-10-12 18:06:48,825] [INFO] Created vocabulary
[2022-10-12 18:06:48,826] [INFO] Finished initializing nlp object
[2022-10-12 18:06:50,673] [INFO] Initialized pipeline components: ['domain_entity_ruler', 'spancat']
✔ Initialized pipeline
============================= Training pipeline =============================
ℹ Pipeline: ['sentencizer', 'domain_entity_ruler', 'spancat']
ℹ Frozen components: ['sentencizer']
ℹ Set annotations on update for: ['sentencizer']
ℹ Initial learn rate: 0.001
E # LOSS SPANCAT SPANS_SC_F SPANS_SC_P SPANS_SC_R SCORE
--- ------ ------------ ---------- ---------- ---------- ------
0 0 0.00 29.73 19.30 64.71 0.30
0 200 0.00 29.73 19.30 64.71 0.30
0 400 0.00 29.73 19.30 64.71 0.30
0 600 0.00 29.73 19.30 64.71 0.30
0 800 0.00 29.73 19.30 64.71 0.30
0 1000 0.00 29.73 19.30 64.71 0.30
1 1200 0.00 29.73 19.30 64.71 0.30
1 1400 0.00 29.73 19.30 64.71 0.30
1 1600 0.00 29.73 19.30 64.71 0.30
Epoch 2: 0%| | 0/200 [00:00<?, ?it/s]✔ Saved pipeline to output directory
training/model-last FYI my convert script now looks like this @app.command()
def convert_for_spancat(
input_file: str = typer.Option(..., "--input"),
train_file: str = typer.Option(..., "--train"),
dev_file: str = typer.Option(..., "--dev"),
config_file: str = typer.Option(..., "--config"),
split: float = typer.Option(..., "--eval-split"),
):
"""Prepare data for spancat."""
random.seed(42)
input_data = models.FilePayload.parse_file(input_file).__root__
config = load_config(config_file)
lang_cls = spacy.util.get_lang_class(config["nlp"]["lang"])
nlp = lang_cls.from_config(config, disable=["spancat"]) # disabling since spancat is not initialized
nlp.initialize()
train = DocBin()
dev = DocBin()
n = len(input_data)
split_index = int(n * split)
random.shuffle(input_data)
span_key = config["components"]["spancat"]["spans_key"]
for index, example in enumerate(input_data):
full_text = report_to_text(title=example.title, html=example.html)
doc = nlp(full_text)
group = SpanGroup(doc, name=span_key, spans=[])
for span in doc.sents:
if example.deal_value is not None and example.deal_value in span.text:
span.label_ = "HAS_DEAL_VALUE"
group.append(span)
if len(group) == 0 and example.deal_value is not None:
typer.echo(f"Warning: no spans found for {example.title}")
doc.spans[span_key] = group
if index < split_index:
train.add(doc)
else:
dev.add(doc)
train.to_disk(train_file)
dev.to_disk(dev_file) |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I have a config file like this
And then I have a script like this to convert to spacy training data
But when I run the following command
then I get the following error
I've been searching around for a solution but to me it looks like my training data and dev data is correctly saved, i.e.
doc.spans[span_key] = group
for all documents in theDocBin
sBeta Was this translation helpful? Give feedback.
All reactions