Sentence Segmentation Issue with spacy_llm and en_core_web_trf

**Description:**

I'm encountering a problem with sentence segmentation when integrating spacy_llm components into a spaCy pipeline that is based on en_core_web_trf.

**Observed Behavior:**

- Sentence segmentation fails when spacy_llm components are added to the pipeline.
- The issue does not occur when using spacy_llm components in a blank pipeline.

**Environment:**

Using the latest versions of spaCy and en_core_web_trf.


- **Config File (Example):**


```
[paths]
examples = "examples.json"

[nlp]
lang = "en"
pipeline = ["transformer", "tagger", "parser", "lemmatizer", "llm", "llm_rel"]

[components]

[components.transformer]
source = "en_core_web_trf"

[components.tagger]
source = "en_core_web_trf"

[components.parser]
source = "en_core_web_trf"


[components.lemmatizer]
source = "en_core_web_trf"


[components.llm]
factory = "llm"

[components.llm.task]
@llm_tasks = "spacy.NER.v3"
labels = ["DISH", "INGREDIENT", "EQUIPMENT", "PERSON", "LOCATION"]
description = "Entities are the names food dishes,
    ingredients, and any kind of cooking equipment.
    Adjectives, verbs, adverbs are not entities.
    Pronouns are not entities."

[components.llm.task.label_definitions]
DISH = "Known food dishes, e.g. Lobster Ravioli, garlic bread"
INGREDIENT = "Individual parts of a food dish, including herbs and spices."
EQUIPMENT = "Any kind of cooking equipment. e.g. oven, cooking pot, grill"

[components.llm.task.examples]
@misc = "spacy.FewShotReader.v1"
path = "examples.json"

[components.llm.model]
@llm_models = "spacy.Ollama.3.1.8b"

[components.llm_rel]
factory = "llm_rel"

[components.llm_rel.task]
@llm_tasks = "spacy.REL.v1"
labels = LivesIn,Visits

[components.llm_rel.task.examples]
@misc = "spacy.FewShotReader.v1"
path = "examples.jsonl"

[components.llm_rel.model]
@llm_models = "spacy.Ollama.3.1.8b"


```


**Steps to Reproduce:**

1. Load the en_core_web_trf pipeline with the modified config.
2. Process a text with the modified pipeline.
3. Observe the lack of sentence segmentation.

**Troubleshooting:**

- Tried explicitly adding sentencizer to the pipeline.
- Experimented with different component orders.
- Verified config loading process.


If i run this code:

```
`       self.nlp = spacy.load('en_core_web_trf')

        self.nlp = assemble(config_path=self.config_path, overrides={"paths.examples": str(self.examples_path)})

        print("config: ", self.nlp.config.to_str())

        print("PIPELINE:  ", self.nlp.pipeline)
`
```

- It gives me the following pipeline configuration: 

`IPELINE:   [('transformer', <spacy_curated_transformers.pipeline.transformer.CuratedTransformer object at 0x7f2b36231960>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x7f2b64d6e1a0>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x7f2af922fed0>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x7f2b34e629c0>), ('llm', <spacy_llm.pipeline.llm.LLMWrapper object at 0x7f2b2ae4e2c0>), ('llm_rel', <spacy_llm.pipeline.llm.LLMWrapper object at 0x7f2b52016740>)]`


But while processing text, it gives me the following error: 

`ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: `nlp.add_pipe('sentencizer')`. Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting `doc[i].is_sent_start`.`



I would appreciate any guidance or assistance in resolving this issue. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sentence Segmentation Issue with spacy_llm and en_core_web_trf #494

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Sentence Segmentation Issue with spacy_llm and en_core_web_trf #494

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions