Connect list of spacy.tokens.span.Span in one spacy.tokens.doc.Doc #10641
-
Hi, QuestionHow can I connect list of Example:InitializationFor splitting I use pipe installed on nlp: import spacy
from spacy_langdetect import LanguageDetector
self.spacy_languages = {
'ru': 'ru_core_news_sm',
'en': 'en_core_web_sm'
}
self.lemmatizers = {key: spacy.load(value, disable=['parser', 'ner']) for key, value in self.spacy_languages.items()}
# Initialize spaCy language
spacy.Language.factory("language_detector", func=lambda nlp, name: LanguageDetector())
self.lemmatizers['en'].add_pipe('sentencizer')
self.lemmatizers['en'].add_pipe("language_detector", last=True) TokenizingThen I tokenize text and creating dict containing sents: tokens = self.lemmatizers['en'](text)
texts = {}
for sent in tokens.sents:
lang = sent._.language['language']
if lang in texts:
texts[lang].append(sent)
else:
texts[lang] = [sent] Post processSo as output I am having lists containing I am running models on all languages except 'en', because it is already lemmatized: for lang, sents in texts.items():
if lang not in self.lemmatizers.keys():
raise UnknownLanguageError(lang, self.lemmatizers.keys())
if lang == 'en':
raise NotImplementedError("https://github.com/explosion/spaCy/discussions/10641")
else:
texts[lang] = self.lemmatizers[lang](' '.join([sent.text for sent in sents]))
return texts |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
First note you can't join Spans/Docs from different languages. That doesn't seem to be an issue here, but I want to be clear. What you can do is convert your Spans to Docs with However, note that I'm not sure the above has much advantage over just joining strings and running the pipeline again if your language detection pipeline is configured to run a minimum number of components. See the speed FAQ #8402 for notes about disabling components you aren't using. |
Beta Was this translation helpful? Give feedback.
First note you can't join Spans/Docs from different languages. That doesn't seem to be an issue here, but I want to be clear.
What you can do is convert your Spans to Docs with
Span.as_doc
and then combine the Docs withDoc.from_docs
. If you run into speed problems, be careful to check the documentation of those functions for tips on that.However, note that I'm not sure the above has much advantage over just joining strings and running the pipeline again if your language detection pipeline is configured to run a minimum number of components. See the speed FAQ #8402 for notes about disabling components you aren't using.