Skip to content

Conversation

hhuangMITRE
Copy link
Contributor

@hhuangMITRE hhuangMITRE commented Sep 23, 2025

@hhuangMITRE hhuangMITRE requested a review from jrobble September 23, 2025 20:25
@hhuangMITRE hhuangMITRE self-assigned this Sep 23, 2025
@hhuangMITRE hhuangMITRE changed the title Add option to TextSplitter to return individual sentences. Adding SaT model support. Add option to TextSplitter to return individual sentences. Adding general SaT model support. Sep 23, 2025
Copy link
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrobble reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @hhuangMITRE)


a discussion (no related file):
Mention SaT here:

    def split_input_text(self, text: str, from_lang: Optional[str],
                         from_lang_confidence: Optional[float]) -> SplitTextResult:
        """
        Splits up the given text in to chunks that are under TranslationClient.DETECT_MAX_CHARS.
        Each chunk will contain one or more complete sentences as reported
        by the (WtP or spaCy) sentence splitter.
        """

Mention SaT here:


class SentenceSplitter:
    """
    Class to divide large sections of text at sentence breaks using WtP and spaCy.
    It is only used when the text to translate exceeds
    the translation endpoint's character limit.
    """

a discussion (no related file):
Once the NLLB component lands, include it in this PR. It will need to be updated to mention SaT.


python/AzureTranslation/README.md line 108 at r1 (raw file):

  More advanced SaT/WtP models that use GPU resources (up to ~8 GB for WtP) are also available. See list of
  model names
  [here](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#available-models). The

This link only lists SaT. Also include a link to an older release for WTP models.


python/AzureTranslation/README.md line 112 at r1 (raw file):

  Review list of languages supported by SaT/WtP
  [here](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#supported-languages).

I think this link is only for SaT. Also include a link to an older release for WTP model languages.

@jrobble
Copy link
Member

jrobble commented Sep 26, 2025

Via separate chat I mentioned that when using single-sentence splitting with NLLB with wtp-bert-mini it takes this:

pt_text="""Teimam de facto estes em que são indispensaveis os vividos raios do
nosso desanuviado sol, ou a face desassombrada da lua no firmamento
peninsular, onde não tem, como a de Londres--_a romper a custo um
plumbeo céo_--para verterem alegrias na alma e mandarem aos semblantes o
reflexo d'ellas; imaginam fatalmente perseguidos de _spleen_,
irremediavelmente lugubres e soturnos, como se a cada momento saíssem
das galerias subterraneas de uma mina de _pit-coul_, os nossos alliados
inglezes.

Como se enganam ou como pretendem enganar-nos!

É esta uma illusão ou má fé, contra a qual ha muito reclama debalde a
indelevel e accentuada expressão de beatitude, que transluz no rosto
illuminado dos homens de além da Mancha, os quaes parece caminharem
entre nós, envolvidos em densa atmosphera de perenne contentamento,
satisfeitos do mundo, satisfeitos dos homens e, muito especialmente,
satisfeitos de si.
"""

and breaks it down into individual words:

#85 128.8 INFO:nlp_text_splitter:Setup WtP model: wtp-bert-mini
#85 128.8 INFO:NllbTranslationComponent:Text to translate is larger than the 360 character limit, splitting into smaller sentences.
#85 129.1 INFO:NllbTranslationComponent:Input text split into 86 sentences.
#85 129.1 INFO:NllbTranslationComponent:Translating sentences...
#85 131.6 DEBUG:NllbTranslationComponent:Translated:
#85 131.6 Teimam
#85 131.6 to:
#85 131.6 They 're scared .
#85 133.2 DEBUG:NllbTranslationComponent:Translated:
#85 133.2 de
#85 133.2 to:
#85 133.2 of
#85 134.8 DEBUG:NllbTranslationComponent:Translated:
#85 134.8 facto
#85 134.8 to:
#85 134.8 fact
#85 136.3 DEBUG:NllbTranslationComponent:Translated:
#85 136.3 estes
#85 136.3 to:
#85 136.3 these
#85 137.8 DEBUG:NllbTranslationComponent:Translated:
#85 137.8 em
#85 137.8 to:
#85 137.8 in
#85 139.4 DEBUG:NllbTranslationComponent:Translated:
#85 139.4 que
#85 139.4 to:
#85 139.4 than
#85 140.9 DEBUG:NllbTranslationComponent:Translated:
#85 140.9 são
#85 140.9 to:
#85 140.9 are
#85 142.5 DEBUG:NllbTranslationComponent:Translated:
#85 142.5 indispensaveis
#85 142.5 to:
#85 142.5 The Commission
#85 144.0 DEBUG:NllbTranslationComponent:Translated:
#85 144.0 os
#85 144.0 to:
#85 144.0 the
#85 145.6 DEBUG:NllbTranslationComponent:Translated:
#85 145.6 vividos
#85 145.6 to:
#85 145.6 lived
#85 149.7 DEBUG:NllbTranslationComponent:Translated:
#85 149.7 raios do
#85 149.7 nosso desanuviado
#85 149.7 to:
#85 149.7 The lightning of our desnuviado .
#85 151.2 DEBUG:NllbTranslationComponent:Translated:
#85 151.2 sol,
#85 151.2 to:
#85 151.2 
#85 153.0 DEBUG:NllbTranslationComponent:Translated:
#85 153.0 ou a
#85 153.0 to:
#85 153.0 or a
#85 154.6 DEBUG:NllbTranslationComponent:Translated:
#85 154.6 face
#85 154.6 to:
#85 154.6 face 

Try using SaT. Determine if this behavior is a result of our text splitter logic or the model itself.

Copy link
Contributor Author

@hhuangMITRE hhuangMITRE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 10 files reviewed, 5 unresolved discussions (waiting on @hhuangMITRE and @jrobble)


a discussion (no related file):

Previously, jrobble (Jeff Robble) wrote…

Mention SaT here:

    def split_input_text(self, text: str, from_lang: Optional[str],
                         from_lang_confidence: Optional[float]) -> SplitTextResult:
        """
        Splits up the given text in to chunks that are under TranslationClient.DETECT_MAX_CHARS.
        Each chunk will contain one or more complete sentences as reported
        by the (WtP or spaCy) sentence splitter.
        """

Mention SaT here:


class SentenceSplitter:
    """
    Class to divide large sections of text at sentence breaks using WtP and spaCy.
    It is only used when the text to translate exceeds
    the translation endpoint's character limit.
    """

Done.


a discussion (no related file):

Previously, jrobble (Jeff Robble) wrote…

Once the NLLB component lands, include it in this PR. It will need to be updated to mention SaT.

Updated NLLB with new tests as well. Also, I didn't see a LICENSE file so I did my best to add one in.


a discussion (no related file):

Previously, jrobble (Jeff Robble) wrote…

Via separate chat I mentioned that when using single-sentence splitting with NLLB with wtp-bert-mini it takes this:

pt_text="""Teimam de facto estes em que são indispensaveis os vividos raios do
nosso desanuviado sol, ou a face desassombrada da lua no firmamento
peninsular, onde não tem, como a de Londres--_a romper a custo um
plumbeo céo_--para verterem alegrias na alma e mandarem aos semblantes o
reflexo d'ellas; imaginam fatalmente perseguidos de _spleen_,
irremediavelmente lugubres e soturnos, como se a cada momento saíssem
das galerias subterraneas de uma mina de _pit-coul_, os nossos alliados
inglezes.

Como se enganam ou como pretendem enganar-nos!

É esta uma illusão ou má fé, contra a qual ha muito reclama debalde a
indelevel e accentuada expressão de beatitude, que transluz no rosto
illuminado dos homens de além da Mancha, os quaes parece caminharem
entre nós, envolvidos em densa atmosphera de perenne contentamento,
satisfeitos do mundo, satisfeitos dos homens e, muito especialmente,
satisfeitos de si.
"""

and breaks it down into individual words:

#85 128.8 INFO:nlp_text_splitter:Setup WtP model: wtp-bert-mini
#85 128.8 INFO:NllbTranslationComponent:Text to translate is larger than the 360 character limit, splitting into smaller sentences.
#85 129.1 INFO:NllbTranslationComponent:Input text split into 86 sentences.
#85 129.1 INFO:NllbTranslationComponent:Translating sentences...
#85 131.6 DEBUG:NllbTranslationComponent:Translated:
#85 131.6 Teimam
#85 131.6 to:
#85 131.6 They 're scared .
#85 133.2 DEBUG:NllbTranslationComponent:Translated:
#85 133.2 de
#85 133.2 to:
#85 133.2 of
#85 134.8 DEBUG:NllbTranslationComponent:Translated:
#85 134.8 facto
#85 134.8 to:
#85 134.8 fact
#85 136.3 DEBUG:NllbTranslationComponent:Translated:
#85 136.3 estes
#85 136.3 to:
#85 136.3 these
#85 137.8 DEBUG:NllbTranslationComponent:Translated:
#85 137.8 em
#85 137.8 to:
#85 137.8 in
#85 139.4 DEBUG:NllbTranslationComponent:Translated:
#85 139.4 que
#85 139.4 to:
#85 139.4 than
#85 140.9 DEBUG:NllbTranslationComponent:Translated:
#85 140.9 são
#85 140.9 to:
#85 140.9 are
#85 142.5 DEBUG:NllbTranslationComponent:Translated:
#85 142.5 indispensaveis
#85 142.5 to:
#85 142.5 The Commission
#85 144.0 DEBUG:NllbTranslationComponent:Translated:
#85 144.0 os
#85 144.0 to:
#85 144.0 the
#85 145.6 DEBUG:NllbTranslationComponent:Translated:
#85 145.6 vividos
#85 145.6 to:
#85 145.6 lived
#85 149.7 DEBUG:NllbTranslationComponent:Translated:
#85 149.7 raios do
#85 149.7 nosso desanuviado
#85 149.7 to:
#85 149.7 The lightning of our desnuviado .
#85 151.2 DEBUG:NllbTranslationComponent:Translated:
#85 151.2 sol,
#85 151.2 to:
#85 151.2 
#85 153.0 DEBUG:NllbTranslationComponent:Translated:
#85 153.0 ou a
#85 153.0 to:
#85 153.0 or a
#85 154.6 DEBUG:NllbTranslationComponent:Translated:
#85 154.6 face
#85 154.6 to:
#85 154.6 face 

Try using SaT. Determine if this behavior is a result of our text splitter logic or the model itself.

This mainly due to the text splitter not recognizing newlines, which we've updated in the most recent update.


python/AzureTranslation/README.md line 108 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

This link only lists SaT. Also include a link to an older release for WTP models.

Done!


python/AzureTranslation/README.md line 112 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

I think this link is only for SaT. Also include a link to an older release for WTP model languages.

Done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants