-
-
Notifications
You must be signed in to change notification settings - Fork 105
Description
Description:
I'm encountering a problem with sentence segmentation when integrating spacy_llm components into a spaCy pipeline that is based on en_core_web_trf.
Observed Behavior:
- Sentence segmentation fails when spacy_llm components are added to the pipeline.
- The issue does not occur when using spacy_llm components in a blank pipeline.
Environment:
Using the latest versions of spaCy and en_core_web_trf.
- Config File (Example):
[paths]
examples = "examples.json"
[nlp]
lang = "en"
pipeline = ["transformer", "tagger", "parser", "lemmatizer", "llm", "llm_rel"]
[components]
[components.transformer]
source = "en_core_web_trf"
[components.tagger]
source = "en_core_web_trf"
[components.parser]
source = "en_core_web_trf"
[components.lemmatizer]
source = "en_core_web_trf"
[components.llm]
factory = "llm"
[components.llm.task]
@llm_tasks = "spacy.NER.v3"
labels = ["DISH", "INGREDIENT", "EQUIPMENT", "PERSON", "LOCATION"]
description = "Entities are the names food dishes,
ingredients, and any kind of cooking equipment.
Adjectives, verbs, adverbs are not entities.
Pronouns are not entities."
[components.llm.task.label_definitions]
DISH = "Known food dishes, e.g. Lobster Ravioli, garlic bread"
INGREDIENT = "Individual parts of a food dish, including herbs and spices."
EQUIPMENT = "Any kind of cooking equipment. e.g. oven, cooking pot, grill"
[components.llm.task.examples]
@misc = "spacy.FewShotReader.v1"
path = "examples.json"
[components.llm.model]
@llm_models = "spacy.Ollama.3.1.8b"
[components.llm_rel]
factory = "llm_rel"
[components.llm_rel.task]
@llm_tasks = "spacy.REL.v1"
labels = LivesIn,Visits
[components.llm_rel.task.examples]
@misc = "spacy.FewShotReader.v1"
path = "examples.jsonl"
[components.llm_rel.model]
@llm_models = "spacy.Ollama.3.1.8b"
Steps to Reproduce:
- Load the en_core_web_trf pipeline with the modified config.
- Process a text with the modified pipeline.
- Observe the lack of sentence segmentation.
Troubleshooting:
- Tried explicitly adding sentencizer to the pipeline.
- Experimented with different component orders.
- Verified config loading process.
If i run this code:
` self.nlp = spacy.load('en_core_web_trf')
self.nlp = assemble(config_path=self.config_path, overrides={"paths.examples": str(self.examples_path)})
print("config: ", self.nlp.config.to_str())
print("PIPELINE: ", self.nlp.pipeline)
`
- It gives me the following pipeline configuration:
IPELINE: [('transformer', <spacy_curated_transformers.pipeline.transformer.CuratedTransformer object at 0x7f2b36231960>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x7f2b64d6e1a0>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x7f2af922fed0>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x7f2b34e629c0>), ('llm', <spacy_llm.pipeline.llm.LLMWrapper object at 0x7f2b2ae4e2c0>), ('llm_rel', <spacy_llm.pipeline.llm.LLMWrapper object at 0x7f2b52016740>)]
But while processing text, it gives me the following error:
ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe('sentencizer'). Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting doc[i].is_sent_start.
I would appreciate any guidance or assistance in resolving this issue. Thank you!