Setting span.id_ for doc.sents spans #13835

ghinch · 2025-06-20T15:32:46Z

ghinch
Jun 20, 2025

I want to set a unique ID for each of the sentence spans in doc.sents. However if I iterate over doc.sents and set a value for each, this does not persist. Is there a better way to do this? Eg.

for sent in doc.sents:
  sent.id_ = "<unique_id>"

...

for sent in doc.sents:
  print(sent.id_)

> ""

Answered by weezymatt

Jun 21, 2025

Hi @ghinch,

There is no built-in attribute for sentences. However, the big deal is that doc.sents yields Span objects. That is, you can you can take advantage of the corresponding Span for each sentence if you need to store it within a spaCy object. A similar issue was answered here by one of spaCy's maintainers.

This documentation provides information on extending new attributes. I provided an implementation that uses the attribute extension:

import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")

Span.set_extension("unique_id", default=-1)
doc = nlp("This is sentence one. This is sentence two.")

for sent_i, sent in enumerate(doc.sents):
    sent._.unique_id = se…

View full answer

weezymatt · 2025-06-21T23:19:30Z

weezymatt
Jun 21, 2025

Hi @ghinch,

There is no built-in attribute for sentences. However, the big deal is that doc.sents yields Span objects. That is, you can you can take advantage of the corresponding Span for each sentence if you need to store it within a spaCy object. A similar issue was answered here by one of spaCy's maintainers.

This documentation provides information on extending new attributes. I provided an implementation that uses the attribute extension:

import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")

Span.set_extension("unique_id", default=-1)
doc = nlp("This is sentence one. This is sentence two.")

for sent_i, sent in enumerate(doc.sents):
    sent._.unique_id = sent_i

You can certainly change the <unique_id> however fits your application. This solution works because the sentence segmentation is consistent across the document and you're accessing the exact slice. You can verify this by iterating over the sentences again:

for sent in doc.sents:
   print(sent._.unique_id)

Output:

0
1

Now, the preferred solution includes a combination of attribute extensions and nlp.add_pipe:

nlp = spacy.load("en_core_web_sm")

Span.set_extension("unique_id", default=-1)

@Language.component("label_sents")
def assign_sentence_ids(doc):
    for sent_i, sent in enumerate(doc.sents):
        sent._.unique_id = sent_i
    return doc

Then you can run:

nlp.add_pipe("label_sents", last=True)

doc = nlp("Sentence one. Sentence two.")
for sent in doc.sents:
    print(sent._.unique_id)

Of course you can use a python list or a pandas DataFrame also. Hope this was helpful.

2 replies

ghinch Jun 22, 2025
Author

@weezymatt many thanks for this thorough response. Is there a reason to not use the Span.id_ attribute for this unique id? I have not seen id_ have a value set by anything else in my implementation so far.

weezymatt Jun 22, 2025

@ghinch you're very welcome! The limited documentation and source code (see __cinit__) lead me to believe the main use involves categorical information.

One reason to not use the Span.id_ attribute is because a unique hash is created with doc.vocab.strings along with it's string representation. This is certainly valid for coreference or nested entities. If your use case involves the entire span (i.e., sentence after segmentation), the extension attributes are a better fit. Otherwise, you would be creating a new hash for each sentence which is inefficient compared to attributes.

This is also based on my limited research, so I may have missed something.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Setting span.id_ for doc.sents spans #13835

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Setting span.id_ for doc.sents spans #13835

Uh oh!

ghinch Jun 20, 2025

Replies: 1 comment · 2 replies

Uh oh!

weezymatt Jun 21, 2025

Uh oh!

ghinch Jun 22, 2025 Author

Uh oh!

weezymatt Jun 22, 2025

ghinch
Jun 20, 2025

Replies: 1 comment 2 replies

weezymatt
Jun 21, 2025

ghinch Jun 22, 2025
Author