accessing `meta.json` from a factory while packaging a pipeline #10675

DSLituiev · 2022-04-20T02:45:12Z

DSLituiev
Apr 20, 2022

I would like to access nlp.meta in a post-processing function / factory that I add to a pipeline.

I tried two ways of adding this component factory with either spacy package or spacy assemble and both of them are not passing the right nlp object to it.

However, from within spacy package a wrong meta is seen, even if I provide it with with -m argument.

I have a post-processing pipe defined in a separate code file:

postprocess.py:

@Language.factory("pathology_cui")
class Pathology:

    def __init__(self, nlp: Language, name: str = "pathology_cui",
                 #log_level: StrictStr
                ):
        self.data = {}
        self.margin = 7
        import sys
        print(f"NLP: {nlp}", file=sys.stderr)
        print(f"nlp.meta: {nlp.meta}", file=sys.stderr)
        self.performance = nlp.meta["performance"]["ents_per_type"]
   ...

To add it to a pre-trained NER module, I read it, add a pathology_cui step that accesses the nlp.meta["performance"] data of the NER pipe, and save it to a new folder:

import spacy
from postprocess import Pathology

nlp = spacy.load(input_dir)
data_path = Path("data")

pathology_cui = nlp.add_pipe("pathology_cui", last=True)
pathology_cui.from_disk(data_path)

os.makedirs(f"{output_dir}/pathology_cui", exist_ok = True)
nlp.to_disk(output_dir)

Manual assembly

Here the first step works, but packaging fails:

import os
import re
import sys
from pathlib import Path
import spacy

from postprocess import Pathology

input_dir = ...
output_dir = ...

nlp = spacy.load(input_dir)
data_path = Path("data")

pathology_cui = nlp.add_pipe("pathology_cui", last=True)
pathology_cui.from_disk(data_path)

os.makedirs(f"{output_dir}/pathology_cui", exist_ok = True)
nlp.to_disk(output_dir)

This step works. Here, I can verify that nlp.meta["performance"] entry exists.

However, when I try to package it, I get an exception that the pathology_cui factory is receiving a nlp.meta that has no performance entry.

Details

NLP: <spacy.lang.en.English object at 0x7fdad19613a0>
nlp.meta: {'lang': 'en', 'name': 'pipeline', 'version': '0.0.0', 'spacy_version': '>=3.2.1,<3.3.0', 'description': '', 'author': '', 'email': '', 'url': '', 'license': '', 'spacy_git_version': '800737b41', 'vectors': {'width': 0, 'vectors': 0, 'keys': 0, 'name': None, 'mode': 'default'}, 'labels': {'tok2vec': [], 'ner': []}, 'pipeline': ['tok2vec', 'ner'], 'components': ['tok2vec', 'ner'], 'disabled': []}
Traceback (most recent call last):
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/spacy/__main__.py", line 4, in <module>
    setup_cli()
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/spacy/cli/_util.py", line 71, in setup_cli
    command(prog_name=COMMAND)
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/spacy/cli/package.py", line 48, in package_cli
    package(
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/spacy/cli/package.py", line 104, in package
    meta = get_meta(input_dir, meta)
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/spacy/cli/package.py", line 275, in get_meta
    nlp = util.load_model_from_path(Path(model_path))
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/spacy/util.py", line 488, in load_model_from_path
    nlp = load_model_from_config(config, vocab=vocab, disable=disable, exclude=exclude)
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/spacy/util.py", line 525, in load_model_from_config
    nlp = lang_cls.from_config(
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/spacy/language.py", line 1714, in from_config
    orig_pipeline = config.pop("components", {})
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/spacy/language.py", line 789, in add_pipe
    pipe_component = self.create_pipe(
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/spacy/language.py", line 671, in create_pipe
    resolved = registry.resolve(cfg, validate=validate)
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/thinc/config.py", line 729, in resolve
    resolved, _ = cls._make(
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/thinc/config.py", line 778, in _make
    filled, _, resolved = cls._fill(
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/thinc/config.py", line 850, in _fill
    getter_result = getter(*args, **kwargs)
  File "postprocess.py", line 54, in __init__
    self.performance = nlp.meta["performance"]["ents_per_type"]
KeyError: 'performance'

Declarative assembly

I saw there is assemble functionality, but I am not quite sure how to point the config to a pre-trained NER model. I've tried

in config-pathology_cui.cfg:

[components]

[components.ner]
source="ucsf_tumor_pathology_raw"

[components.pathology_cui]
factory = "pathology_cui"

CFG=config-pathology_cui.cfg

python -m spacy assemble \
    $CFG \
    $OUTPUT_DIR

Error

Details

Traceback (most recent call last):
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/spacy/__main__.py", line 4, in <module>
    setup_cli()
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/spacy/cli/_util.py", line 71, in setup_cli
    command(prog_name=COMMAND)
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/spacy/cli/assemble.py", line 45, in assemble_cli
    nlp = load_model_from_config(config, auto_fill=True)
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/spacy/util.py", line 525, in load_model_from_config
    nlp = lang_cls.from_config(
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/spacy/language.py", line 1780, in from_config
    nlp.add_pipe(
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/spacy/language.py", line 789, in add_pipe
    pipe_component = self.create_pipe(
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/spacy/language.py", line 671, in create_pipe
    resolved = registry.resolve(cfg, validate=validate)
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/thinc/config.py", line 729, in resolve
    resolved, _ = cls._make(
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/thinc/config.py", line 778, in _make
    filled, _, resolved = cls._fill(
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/thinc/config.py", line 850, in _fill
    getter_result = getter(*args, **kwargs)
  File "/Users/dlituiev/anaconda3/envs/nlp3.9/lib/python3.9/site-packages/en_ucsf_tumor_pathology_raw/postprocess.py", line 54, in __init__
    self.performance = nlp.meta["performance"]["ents_per_type"]
KeyError: 'performance'

Answered by DSLituiev

Apr 21, 2022

Thank you for your patience. I've created a gist with all code here, except the trained NER pipeline that I am trying to augment (I am working with my institutional privacy office to release that one as well).

I'd need to see a minimal example to understand the details about what "appending to an entity" means

Appending performance metrics to the entities as custom Span._.metric fields happens in postprocess.py:L110

but it's possible if you store self.meta = nlp.meta and move the code above from __init__ to __call__ you'll have access to the full nlp.meta

I am currently taking nlp.meta in the __init__ (which fails while using command line tools). Do you suggest taking the nlp object f…

View full answer

adrianeboyd · 2022-04-20T09:44:12Z

adrianeboyd
Apr 20, 2022

I'm not entirely sure I understand what you're trying to do with nlp.meta, but I think a callback would probably make more sense than a custom component. There are three pipeline creation callbacks and two for initialization (https://spacy.io/usage/training#custom-code-nlp-callbacks), plus one for training (https://spacy.io/api/data-formats/#config-training).

2 replies

DSLituiev Apr 20, 2022
Author

Thank you for your response. I would like to take per-class performance scores and append them to the respective entities.
This will require a nlp.meta that is produce as a result of training, thus I don't see how training or initialization callback may help here.

What I was counting for is that the nlp within the factory's __init__ would be a pre-trained NER pipeline. In my config, it goes [nlp] pipeline = ["tok2vec","ner","pathology_cui"] thus I assumed it is a correct order. When I manually assemble the pipeline, I also put the pathology_cui factory after the ner. Am I missing something about the ordering here?

adrianeboyd Apr 21, 2022

I think I'd need to see a minimal example to understand the details about what "appending to an entity" means, but it's possible if you store self.meta = nlp.meta and move the code above from __init__ to __call__ you'll have access to the full nlp.meta, which hasn't been deserialized at the point when __init__ is called.

Here's more about the order things happen in: https://spacy.io/usage/training/#config-lifecycle

DSLituiev · 2022-04-21T18:05:03Z

DSLituiev
Apr 21, 2022
Author

Thank you for your patience. I've created a gist with all code here, except the trained NER pipeline that I am trying to augment (I am working with my institutional privacy office to release that one as well).

I'd need to see a minimal example to understand the details about what "appending to an entity" means

Appending performance metrics to the entities as custom Span._.metric fields happens in postprocess.py:L110

but it's possible if you store self.meta = nlp.meta and move the code above from __init__ to __call__ you'll have access to the full nlp.meta

I am currently taking nlp.meta in the __init__ (which fails while using command line tools). Do you suggest taking the nlp object from the incoming doc argument in the __call__ -- something like performance = doc.nlp.meta["performance"]["ents_per_type"]? That code does not work verbatim. Is there a way to pick parent NLP object from a Doc?

Current version works when I run assemble-manual.py to assemble the pipeline in python. However, when I try to package it, or assemble it programmatically with spacy assemble config-assemble.cfg I am getting the error I've reported above

2 replies

adrianeboyd Apr 22, 2022

What I meant was to store a reference to the nlp.meta object in __init__ and then only access the intended values in __call__. meta.json is deserialized after all the pipeline objects are initialized.

@Language.factory("meta_reader")
class MetaReader:
    def __init__(self, nlp, name):
        self.meta = nlp.meta
        print("init", self.meta["labels"])

    def __call__(self, doc):
        print("call", self.meta["labels"])
        return doc

With en_core_web_sm:

init {'ner': []}
call {'ner': ['CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART']}

DSLituiev Apr 22, 2022
Author

Thank you. That worked

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

accessing `meta.json` from a factory while packaging a pipeline #10675

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

accessing meta.json from a factory while packaging a pipeline #10675

Uh oh!

DSLituiev Apr 20, 2022

Manual assembly

Declarative assembly

Replies: 2 comments · 4 replies

Uh oh!

adrianeboyd Apr 20, 2022

Uh oh!

DSLituiev Apr 20, 2022 Author

Uh oh!

adrianeboyd Apr 21, 2022

Uh oh!

Uh oh!

DSLituiev Apr 21, 2022 Author

Uh oh!

adrianeboyd Apr 22, 2022

Uh oh!

DSLituiev Apr 22, 2022 Author

accessing `meta.json` from a factory while packaging a pipeline #10675

DSLituiev
Apr 20, 2022

Replies: 2 comments 4 replies

adrianeboyd
Apr 20, 2022

DSLituiev Apr 20, 2022
Author

DSLituiev
Apr 21, 2022
Author

DSLituiev Apr 22, 2022
Author