Connect list of spacy.tokens.span.Span in one spacy.tokens.doc.Doc #10641

riZZZhik · 2022-04-11T07:57:09Z

riZZZhik
Apr 11, 2022

Hi,
I'm writing code that splits text for languages and runs different models on them.

Question

How can I connect list of <class 'spacy.tokens.span.Span'> in one <class 'spacy.tokens.doc.Doc'> ?
Without repredicting model using ' '.join(...)

Example:

Initialization

For splitting I use pipe installed on nlp:

import spacy
from spacy_langdetect import LanguageDetector

self.spacy_languages = {
            'ru': 'ru_core_news_sm',
            'en': 'en_core_web_sm'
}
self.lemmatizers = {key: spacy.load(value, disable=['parser', 'ner']) for key, value in self.spacy_languages.items()}

# Initialize spaCy language 
spacy.Language.factory("language_detector", func=lambda nlp, name: LanguageDetector())
self.lemmatizers['en'].add_pipe('sentencizer')
self.lemmatizers['en'].add_pipe("language_detector", last=True)

Tokenizing

Then I tokenize text and creating dict containing sents:

tokens = self.lemmatizers['en'](text)

texts = {}
for sent in tokens.sents:
    lang = sent._.language['language']
    if lang in texts:
        texts[lang].append(sent)
    else:
        texts[lang] = [sent]

Post process

So as output I am having lists containing <class 'spacy.tokens.span.Span'>

I am running models on all languages except 'en', because it is already lemmatized:

for lang, sents in texts.items():
    if lang not in self.lemmatizers.keys():
        raise UnknownLanguageError(lang, self.lemmatizers.keys())

    if lang == 'en':
        raise NotImplementedError("https://github.com/explosion/spaCy/discussions/10641")
    else:
        texts[lang] = self.lemmatizers[lang](' '.join([sent.text for sent in sents]))
return texts

Answered by polm

Apr 11, 2022

First note you can't join Spans/Docs from different languages. That doesn't seem to be an issue here, but I want to be clear.

What you can do is convert your Spans to Docs with Span.as_doc and then combine the Docs with Doc.from_docs. If you run into speed problems, be careful to check the documentation of those functions for tips on that.

However, note that I'm not sure the above has much advantage over just joining strings and running the pipeline again if your language detection pipeline is configured to run a minimum number of components. See the speed FAQ #8402 for notes about disabling components you aren't using.

View full answer

polm · 2022-04-11T08:11:13Z

polm
Apr 11, 2022

First note you can't join Spans/Docs from different languages. That doesn't seem to be an issue here, but I want to be clear.

What you can do is convert your Spans to Docs with Span.as_doc and then combine the Docs with Doc.from_docs. If you run into speed problems, be careful to check the documentation of those functions for tips on that.

However, note that I'm not sure the above has much advantage over just joining strings and running the pipeline again if your language detection pipeline is configured to run a minimum number of components. See the speed FAQ #8402 for notes about disabling components you aren't using.

1 reply

riZZZhik Apr 11, 2022
Author

Thanks, everything is correct.
Was able to convert list of spans to doc but note destroyed all my plans)

Traceback (most recent call last):
  File "/Users/rizhik/Desktop/dev/belinsky/test.py", line 42, in <module>
    Doc.from_docs(result)
  File "spacy/tokens/doc.pyx", line 1127, in spacy.tokens.doc.Doc.from_docs
ValueError: [E999] Unable to merge the Doc objects because they do not all share the same `Vocab`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Connect list of spacy.tokens.span.Span in one spacy.tokens.doc.Doc #10641

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Connect list of spacy.tokens.span.Span in one spacy.tokens.doc.Doc #10641

Uh oh!

Uh oh!

riZZZhik Apr 11, 2022

Question

Example:

Initialization

Tokenizing

Post process

Replies: 1 comment · 1 reply

Uh oh!

polm Apr 11, 2022

Uh oh!

riZZZhik Apr 11, 2022 Author

riZZZhik
Apr 11, 2022

Replies: 1 comment 1 reply

polm
Apr 11, 2022

riZZZhik Apr 11, 2022
Author