-
Notifications
You must be signed in to change notification settings - Fork 3
Add option to TextSplitter to return individual sentences. Adding general SaT model support. #408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Add option to TextSplitter to return individual sentences. Adding general SaT model support. #408
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrobble reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @hhuangMITRE)
a discussion (no related file):
Mention SaT here:
def split_input_text(self, text: str, from_lang: Optional[str],
from_lang_confidence: Optional[float]) -> SplitTextResult:
"""
Splits up the given text in to chunks that are under TranslationClient.DETECT_MAX_CHARS.
Each chunk will contain one or more complete sentences as reported
by the (WtP or spaCy) sentence splitter.
"""
Mention SaT here:
class SentenceSplitter:
"""
Class to divide large sections of text at sentence breaks using WtP and spaCy.
It is only used when the text to translate exceeds
the translation endpoint's character limit.
"""
a discussion (no related file):
Once the NLLB component lands, include it in this PR. It will need to be updated to mention SaT.
python/AzureTranslation/README.md
line 108 at r1 (raw file):
More advanced SaT/WtP models that use GPU resources (up to ~8 GB for WtP) are also available. See list of model names [here](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#available-models). The
This link only lists SaT. Also include a link to an older release for WTP models.
python/AzureTranslation/README.md
line 112 at r1 (raw file):
Review list of languages supported by SaT/WtP [here](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#supported-languages).
I think this link is only for SaT. Also include a link to an older release for WTP model languages.
Via separate chat I mentioned that when using single-sentence splitting with NLLB with wtp-bert-mini it takes this:
and breaks it down into individual words:
Try using SaT. Determine if this behavior is a result of our text splitter logic or the model itself. |
…HOLD (#409) * Validate timestamps. --------- Co-authored-by: jrobble <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 10 files reviewed, 5 unresolved discussions (waiting on @hhuangMITRE and @jrobble)
a discussion (no related file):
Previously, jrobble (Jeff Robble) wrote…
Mention SaT here:
def split_input_text(self, text: str, from_lang: Optional[str], from_lang_confidence: Optional[float]) -> SplitTextResult: """ Splits up the given text in to chunks that are under TranslationClient.DETECT_MAX_CHARS. Each chunk will contain one or more complete sentences as reported by the (WtP or spaCy) sentence splitter. """
Mention SaT here:
class SentenceSplitter: """ Class to divide large sections of text at sentence breaks using WtP and spaCy. It is only used when the text to translate exceeds the translation endpoint's character limit. """
Done.
a discussion (no related file):
Previously, jrobble (Jeff Robble) wrote…
Once the NLLB component lands, include it in this PR. It will need to be updated to mention SaT.
Updated NLLB with new tests as well. Also, I didn't see a LICENSE file so I did my best to add one in.
a discussion (no related file):
Previously, jrobble (Jeff Robble) wrote…
Via separate chat I mentioned that when using single-sentence splitting with NLLB with wtp-bert-mini it takes this:
pt_text="""Teimam de facto estes em que são indispensaveis os vividos raios do nosso desanuviado sol, ou a face desassombrada da lua no firmamento peninsular, onde não tem, como a de Londres--_a romper a custo um plumbeo céo_--para verterem alegrias na alma e mandarem aos semblantes o reflexo d'ellas; imaginam fatalmente perseguidos de _spleen_, irremediavelmente lugubres e soturnos, como se a cada momento saíssem das galerias subterraneas de uma mina de _pit-coul_, os nossos alliados inglezes. Como se enganam ou como pretendem enganar-nos! É esta uma illusão ou má fé, contra a qual ha muito reclama debalde a indelevel e accentuada expressão de beatitude, que transluz no rosto illuminado dos homens de além da Mancha, os quaes parece caminharem entre nós, envolvidos em densa atmosphera de perenne contentamento, satisfeitos do mundo, satisfeitos dos homens e, muito especialmente, satisfeitos de si. """
and breaks it down into individual words:
#85 128.8 INFO:nlp_text_splitter:Setup WtP model: wtp-bert-mini #85 128.8 INFO:NllbTranslationComponent:Text to translate is larger than the 360 character limit, splitting into smaller sentences. #85 129.1 INFO:NllbTranslationComponent:Input text split into 86 sentences. #85 129.1 INFO:NllbTranslationComponent:Translating sentences... #85 131.6 DEBUG:NllbTranslationComponent:Translated: #85 131.6 Teimam #85 131.6 to: #85 131.6 They 're scared . #85 133.2 DEBUG:NllbTranslationComponent:Translated: #85 133.2 de #85 133.2 to: #85 133.2 of #85 134.8 DEBUG:NllbTranslationComponent:Translated: #85 134.8 facto #85 134.8 to: #85 134.8 fact #85 136.3 DEBUG:NllbTranslationComponent:Translated: #85 136.3 estes #85 136.3 to: #85 136.3 these #85 137.8 DEBUG:NllbTranslationComponent:Translated: #85 137.8 em #85 137.8 to: #85 137.8 in #85 139.4 DEBUG:NllbTranslationComponent:Translated: #85 139.4 que #85 139.4 to: #85 139.4 than #85 140.9 DEBUG:NllbTranslationComponent:Translated: #85 140.9 são #85 140.9 to: #85 140.9 are #85 142.5 DEBUG:NllbTranslationComponent:Translated: #85 142.5 indispensaveis #85 142.5 to: #85 142.5 The Commission #85 144.0 DEBUG:NllbTranslationComponent:Translated: #85 144.0 os #85 144.0 to: #85 144.0 the #85 145.6 DEBUG:NllbTranslationComponent:Translated: #85 145.6 vividos #85 145.6 to: #85 145.6 lived #85 149.7 DEBUG:NllbTranslationComponent:Translated: #85 149.7 raios do #85 149.7 nosso desanuviado #85 149.7 to: #85 149.7 The lightning of our desnuviado . #85 151.2 DEBUG:NllbTranslationComponent:Translated: #85 151.2 sol, #85 151.2 to: #85 151.2 #85 153.0 DEBUG:NllbTranslationComponent:Translated: #85 153.0 ou a #85 153.0 to: #85 153.0 or a #85 154.6 DEBUG:NllbTranslationComponent:Translated: #85 154.6 face #85 154.6 to: #85 154.6 face
Try using SaT. Determine if this behavior is a result of our text splitter logic or the model itself.
This mainly due to the text splitter not recognizing newlines, which we've updated in the most recent update.
python/AzureTranslation/README.md
line 108 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
This link only lists SaT. Also include a link to an older release for WTP models.
Done!
python/AzureTranslation/README.md
line 112 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
I think this link is only for SaT. Also include a link to an older release for WTP model languages.
Done!
Issues:
Related PRs:
This change is