Add option to TextSplitter to return individual sentences. Adding general SaT model support. #408

hhuangMITRE · 2025-09-23T20:25:25Z

Issues:

Add option to TextSplitter to return individual sentences openmpf#1965

Related PRs:

Add option to TextSplitter to return individual sentences. Adding general SaT model support. openmpf-python-component-sdk#93

This change is

jrobble

@jrobble reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @hhuangMITRE)

a discussion (no related file):
Mention SaT here:

    def split_input_text(self, text: str, from_lang: Optional[str],
                         from_lang_confidence: Optional[float]) -> SplitTextResult:
        """
        Splits up the given text in to chunks that are under TranslationClient.DETECT_MAX_CHARS.
        Each chunk will contain one or more complete sentences as reported
        by the (WtP or spaCy) sentence splitter.
        """

Mention SaT here:


class SentenceSplitter:
    """
    Class to divide large sections of text at sentence breaks using WtP and spaCy.
    It is only used when the text to translate exceeds
    the translation endpoint's character limit.
    """

a discussion (no related file):
Once the NLLB component lands, include it in this PR. It will need to be updated to mention SaT.

python/AzureTranslation/README.md line 108 at r1 (raw file):

  More advanced SaT/WtP models that use GPU resources (up to ~8 GB for WtP) are also available. See list of
  model names
  [here](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#available-models). The

This link only lists SaT. Also include a link to an older release for WTP models.

python/AzureTranslation/README.md line 112 at r1 (raw file):

  Review list of languages supported by SaT/WtP
  [here](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#supported-languages).

I think this link is only for SaT. Also include a link to an older release for WTP model languages.

jrobble · 2025-09-26T15:41:11Z

Via separate chat I mentioned that when using single-sentence splitting with NLLB with wtp-bert-mini it takes this:

pt_text="""Teimam de facto estes em que são indispensaveis os vividos raios do
nosso desanuviado sol, ou a face desassombrada da lua no firmamento
peninsular, onde não tem, como a de Londres--_a romper a custo um
plumbeo céo_--para verterem alegrias na alma e mandarem aos semblantes o
reflexo d'ellas; imaginam fatalmente perseguidos de _spleen_,
irremediavelmente lugubres e soturnos, como se a cada momento saíssem
das galerias subterraneas de uma mina de _pit-coul_, os nossos alliados
inglezes.

Como se enganam ou como pretendem enganar-nos!

É esta uma illusão ou má fé, contra a qual ha muito reclama debalde a
indelevel e accentuada expressão de beatitude, que transluz no rosto
illuminado dos homens de além da Mancha, os quaes parece caminharem
entre nós, envolvidos em densa atmosphera de perenne contentamento,
satisfeitos do mundo, satisfeitos dos homens e, muito especialmente,
satisfeitos de si.
"""

and breaks it down into individual words:

#85 128.8 INFO:nlp_text_splitter:Setup WtP model: wtp-bert-mini
#85 128.8 INFO:NllbTranslationComponent:Text to translate is larger than the 360 character limit, splitting into smaller sentences.
#85 129.1 INFO:NllbTranslationComponent:Input text split into 86 sentences.
#85 129.1 INFO:NllbTranslationComponent:Translating sentences...
#85 131.6 DEBUG:NllbTranslationComponent:Translated:
#85 131.6 Teimam
#85 131.6 to:
#85 131.6 They 're scared .
#85 133.2 DEBUG:NllbTranslationComponent:Translated:
#85 133.2 de
#85 133.2 to:
#85 133.2 of
#85 134.8 DEBUG:NllbTranslationComponent:Translated:
#85 134.8 facto
#85 134.8 to:
#85 134.8 fact
#85 136.3 DEBUG:NllbTranslationComponent:Translated:
#85 136.3 estes
#85 136.3 to:
#85 136.3 these
#85 137.8 DEBUG:NllbTranslationComponent:Translated:
#85 137.8 em
#85 137.8 to:
#85 137.8 in
#85 139.4 DEBUG:NllbTranslationComponent:Translated:
#85 139.4 que
#85 139.4 to:
#85 139.4 than
#85 140.9 DEBUG:NllbTranslationComponent:Translated:
#85 140.9 são
#85 140.9 to:
#85 140.9 are
#85 142.5 DEBUG:NllbTranslationComponent:Translated:
#85 142.5 indispensaveis
#85 142.5 to:
#85 142.5 The Commission
#85 144.0 DEBUG:NllbTranslationComponent:Translated:
#85 144.0 os
#85 144.0 to:
#85 144.0 the
#85 145.6 DEBUG:NllbTranslationComponent:Translated:
#85 145.6 vividos
#85 145.6 to:
#85 145.6 lived
#85 149.7 DEBUG:NllbTranslationComponent:Translated:
#85 149.7 raios do
#85 149.7 nosso desanuviado
#85 149.7 to:
#85 149.7 The lightning of our desnuviado .
#85 151.2 DEBUG:NllbTranslationComponent:Translated:
#85 151.2 sol,
#85 151.2 to:
#85 151.2 
#85 153.0 DEBUG:NllbTranslationComponent:Translated:
#85 153.0 ou a
#85 153.0 to:
#85 153.0 or a
#85 154.6 DEBUG:NllbTranslationComponent:Translated:
#85 154.6 face
#85 154.6 to:
#85 154.6 face

Try using SaT. Determine if this behavior is a result of our text splitter logic or the model itself.

…HOLD (#409) * Validate timestamps. --------- Co-authored-by: jrobble <[email protected]>

…at-model-update

hhuangMITRE

Reviewable status: 1 of 10 files reviewed, 5 unresolved discussions (waiting on @hhuangMITRE and @jrobble)

a discussion (no related file):

Previously, jrobble (Jeff Robble) wrote…

Mention SaT here:

    def split_input_text(self, text: str, from_lang: Optional[str],
                         from_lang_confidence: Optional[float]) -> SplitTextResult:
        """
        Splits up the given text in to chunks that are under TranslationClient.DETECT_MAX_CHARS.
        Each chunk will contain one or more complete sentences as reported
        by the (WtP or spaCy) sentence splitter.
        """

Mention SaT here:


class SentenceSplitter:
    """
    Class to divide large sections of text at sentence breaks using WtP and spaCy.
    It is only used when the text to translate exceeds
    the translation endpoint's character limit.
    """

Done.

a discussion (no related file):

Previously, jrobble (Jeff Robble) wrote…

Once the NLLB component lands, include it in this PR. It will need to be updated to mention SaT.

Updated NLLB with new tests as well. Also, I didn't see a LICENSE file so I did my best to add one in.

a discussion (no related file):

Previously, jrobble (Jeff Robble) wrote…

Via separate chat I mentioned that when using single-sentence splitting with NLLB with wtp-bert-mini it takes this:

pt_text="""Teimam de facto estes em que são indispensaveis os vividos raios do
nosso desanuviado sol, ou a face desassombrada da lua no firmamento
peninsular, onde não tem, como a de Londres--_a romper a custo um
plumbeo céo_--para verterem alegrias na alma e mandarem aos semblantes o
reflexo d'ellas; imaginam fatalmente perseguidos de _spleen_,
irremediavelmente lugubres e soturnos, como se a cada momento saíssem
das galerias subterraneas de uma mina de _pit-coul_, os nossos alliados
inglezes.

Como se enganam ou como pretendem enganar-nos!

É esta uma illusão ou má fé, contra a qual ha muito reclama debalde a
indelevel e accentuada expressão de beatitude, que transluz no rosto
illuminado dos homens de além da Mancha, os quaes parece caminharem
entre nós, envolvidos em densa atmosphera de perenne contentamento,
satisfeitos do mundo, satisfeitos dos homens e, muito especialmente,
satisfeitos de si.
"""

and breaks it down into individual words:

#85 128.8 INFO:nlp_text_splitter:Setup WtP model: wtp-bert-mini
#85 128.8 INFO:NllbTranslationComponent:Text to translate is larger than the 360 character limit, splitting into smaller sentences.
#85 129.1 INFO:NllbTranslationComponent:Input text split into 86 sentences.
#85 129.1 INFO:NllbTranslationComponent:Translating sentences...
#85 131.6 DEBUG:NllbTranslationComponent:Translated:
#85 131.6 Teimam
#85 131.6 to:
#85 131.6 They 're scared .
#85 133.2 DEBUG:NllbTranslationComponent:Translated:
#85 133.2 de
#85 133.2 to:
#85 133.2 of
#85 134.8 DEBUG:NllbTranslationComponent:Translated:
#85 134.8 facto
#85 134.8 to:
#85 134.8 fact
#85 136.3 DEBUG:NllbTranslationComponent:Translated:
#85 136.3 estes
#85 136.3 to:
#85 136.3 these
#85 137.8 DEBUG:NllbTranslationComponent:Translated:
#85 137.8 em
#85 137.8 to:
#85 137.8 in
#85 139.4 DEBUG:NllbTranslationComponent:Translated:
#85 139.4 que
#85 139.4 to:
#85 139.4 than
#85 140.9 DEBUG:NllbTranslationComponent:Translated:
#85 140.9 são
#85 140.9 to:
#85 140.9 are
#85 142.5 DEBUG:NllbTranslationComponent:Translated:
#85 142.5 indispensaveis
#85 142.5 to:
#85 142.5 The Commission
#85 144.0 DEBUG:NllbTranslationComponent:Translated:
#85 144.0 os
#85 144.0 to:
#85 144.0 the
#85 145.6 DEBUG:NllbTranslationComponent:Translated:
#85 145.6 vividos
#85 145.6 to:
#85 145.6 lived
#85 149.7 DEBUG:NllbTranslationComponent:Translated:
#85 149.7 raios do
#85 149.7 nosso desanuviado
#85 149.7 to:
#85 149.7 The lightning of our desnuviado .
#85 151.2 DEBUG:NllbTranslationComponent:Translated:
#85 151.2 sol,
#85 151.2 to:
#85 151.2 
#85 153.0 DEBUG:NllbTranslationComponent:Translated:
#85 153.0 ou a
#85 153.0 to:
#85 153.0 or a
#85 154.6 DEBUG:NllbTranslationComponent:Translated:
#85 154.6 face
#85 154.6 to:
#85 154.6 face

Try using SaT. Determine if this behavior is a result of our text splitter logic or the model itself.

This mainly due to the text splitter not recognizing newlines, which we've updated in the most recent update.

python/AzureTranslation/README.md line 108 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

This link only lists SaT. Also include a link to an older release for WTP models.

Done!

python/AzureTranslation/README.md line 112 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

I think this link is only for SaT. Also include a link to an older release for WTP model languages.

Done!

Updating WtP models. Adding sentence splitting option.

8b7b5ad

hhuangMITRE requested a review from jrobble September 23, 2025 20:25

hhuangMITRE self-assigned this Sep 23, 2025

hhuangMITRE mentioned this pull request Sep 23, 2025

Add option to TextSplitter to return individual sentences. Adding general SaT model support. openmpf/openmpf-python-component-sdk#93

Open

hhuangMITRE changed the title ~~Add option to TextSplitter to return individual sentences. Adding SaT model support.~~ Add option to TextSplitter to return individual sentences. Adding general SaT model support. Sep 23, 2025

jrobble requested changes Sep 25, 2025

View reviewed changes

regexer and others added 5 commits October 14, 2025 03:04

Update LlamaVideoSummarization to use TIMELINE_CHECK_ACCEPTABLE_THRES…

0138e9b

…HOLD (#409) * Validate timestamps. --------- Co-authored-by: jrobble <[email protected]>

Merge branch 'develop' into feature/nlp-text-splitter-sentence-mode-s…

df3a979

…at-model-update

Updating documentation. Adding license file for NLLB component.

b40f3e8

Adding support for new text splitter. Merging develop changes.

315bf6d

Adding support for new text splitter. Merging develop changes.

ae281c3

hhuangMITRE commented Oct 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add option to TextSplitter to return individual sentences. Adding general SaT model support. #408

Add option to TextSplitter to return individual sentences. Adding general SaT model support. #408

Uh oh!

hhuangMITRE commented Sep 23, 2025 •

edited by jrobble

Loading

Uh oh!

jrobble left a comment

Uh oh!

jrobble commented Sep 26, 2025

Uh oh!

hhuangMITRE left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add option to TextSplitter to return individual sentences. Adding general SaT model support. #408

Are you sure you want to change the base?

Add option to TextSplitter to return individual sentences. Adding general SaT model support. #408

Uh oh!

Conversation

hhuangMITRE commented Sep 23, 2025 • edited by jrobble Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrobble left a comment

Choose a reason for hiding this comment

Uh oh!

jrobble commented Sep 26, 2025

Uh oh!

hhuangMITRE left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hhuangMITRE commented Sep 23, 2025 •

edited by jrobble

Loading