spaCy spancat won’t learn (zero F-score) while NER on same data scores 0.40 — Prodigy-generated KPI/target corpus #13861
Unanswered
marsy-41
asked this question in
Help: Coding & Implementations
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I am traing to train a spaCy v 3.8.7 spancat model on ~100 sustainability reports (annotated with Prodigy) to extract KPIs and targets.
An NER pipeline trained on the same data reaches F≈0.40, but spancat stays at 0.00 no matter what I try:

I ran the same code using a GPU Nvidia V100 (32GB) node, so I think that memory issues should not be the problem. I also tried spancat Transformer, with the same result.
The data seems also fine when inspecting via spacy debug:
============================ Data file validation ============================
✔ Pipeline can be initialized with data
✔ Corpus is loadable
=============================== Training stats ===============================
Language: en
Training pipeline: tok2vec, spancat
1604 training docs
401 evaluation docs
✔ No overlap between training and evaluation data
⚠ Low number of examples to train a new pipeline (1604)
============================== Vocab & Vectors ==============================
ℹ 144646 total word(s) in the data (11307 unique)
ℹ No word vectors present in the package
============================ Span Categorization ============================
Spans Key Labels
sc {'KPI', 'Target'}
ℹ Span characteristics for spans_key 'sc'
ℹ SD = Span Distinctiveness, BD = Boundary Distinctiveness
Span Type Length SD BD N
Target 11.89 0.77 1.46 447
KPI 15.80 0.60 1.29 885
Wgt. Average 14.48 0.66 1.35 -
ℹ Over 90% of spans have lengths of 1 -- 34 (min=1, max=162). The most
common span lengths are: 2 (1.8%), 3 (2.1%), 4 (2.48%), 5 (2.1%), 6 (2.7%), 7
(3.6%), 8 (3.38%), 9 (4.13%), 10 (5.33%), 11 (5.03%), 12 (3.53%), 13 (5.18%), 14
(3.83%), 15 (5.11%), 16 (4.43%), 17 (3.98%), 18 (4.5%), 19 (3.98%), 20 (3.15%),
21 (3.45%), 22 (2.25%), 23 (2.4%), 24 (2.48%), 25 (1.95%), 26 (1.35%), 29
(2.03%), 30 (2.1%), 31 (0.98%), 34 (0.98%). If you are using the n-gram
suggester, note that omitting infrequent n-gram lengths can greatly improve
speed and memory usage.
⚠ Spans may not be distinct from the rest of the corpus
✔ Boundary tokens are distinct from the rest of the corpus
✔ Good amount of examples for all labels
✔ Examples without occurrences available for all labels
================================== Summary ==================================
✔ 6 checks passed
⚠ 2 warnings
Here is the code I use, as well as the config file. I did play around with various span sizes to (1) avoid memory overload and (2) fit our actual span sizes:
Python 3.12.11
1. create_docbin.py
sample_docbin.py ── Python ≥3.8 · spaCy ≥3.7
import json, random, spacy
from pathlib import Path
from spacy.tokens import DocBin
IN = Path("/Users/mgulenko/Dropbox/Research/KPIs/Python/spacy_spancat/annotations.jsonl") # JSONL with {"text", "spans"}
OUT = Path("/Users/mgulenko/Dropbox/Research/KPIs/Python/spacy_spancat/sample_training_data_span") # target dir
OUT.mkdir(parents=True, exist_ok=True)
SAMPLE, DEV_RATIO = 300, 0.20 # ≤300 docs → 80/20 split
random.seed(42)
nlp = spacy.blank("en") # tokenizer only
docs = []
for rec in map(json.loads, IN.read_text().splitlines()):
doc = nlp.make_doc(rec["text"])
spans = [
doc.char_span(s["start"], s["end"], s["label"], alignment_mode="contract")
for s in rec["spans"]
]
if None not in spans: # keep only token-aligned spans
doc.spans["sc"] = spans # one span group called "sc"
docs.append(doc)
if not docs:
raise SystemExit("No valid examples")
sample = random.sample(docs, min(SAMPLE, len(docs)))
split = int(len(sample) * (1 - DEV_RATIO))
DocBin(docs=sample[:split]).to_disk(OUT / "train.spacy")
DocBin(docs=sample[split:]).to_disk(OUT / "dev.spacy")
print(f"{split} train · {len(sample)-split} dev docs → {OUT}")
2. config.cfg
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null
[system]
gpu_allocator = null
seed = 0
[nlp]
lang = "en"
pipeline = ["tok2vec","spancat"]
batch_size = 5
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@Tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}
[components]
[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "sc"
threshold = 0.5
[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"
[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128
[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null
[components.spancat.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"
[components.spancat.suggester]
@misc = "spacy.ngram_range_suggester.v1"
min_size=2
max_size=12
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = false
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3
[corpora]
[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null
[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null
[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null
[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0
[training.logger]
@Loggers = "spacy.ConsoleLogger.v1"
progress_bar = false
[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001
[training.score_weights]
spans_sc_f = 1.0
spans_sc_p = 0.0
spans_sc_r = 0.0
[pretraining]
[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null
[initialize.components]
[initialize.tokenizer]
3. train_model.py
import spacy
import torch
from pathlib import Path
from spacy.cli.train import train
output_dir = Path("/Users/mgulenko/Dropbox/Research/KPIs/Python/spacy_spancat/training_data_span")
output_dir.mkdir(exist_ok=True)
train(
config_path="config.cfg",
output_path="training_data_span/output",
overrides={
"paths.train": str(output_dir / "train.spacy"),
"paths.dev": str(output_dir / "dev.spacy"),
"training.gpu_allocator": "pytorch",
}
)
I attached a 300 examples of our data, if that helps. Any help would be greatly appreciated!
sample_100.json
Beta Was this translation helpful? Give feedback.
All reactions