Add option to TextSplitter to return individual sentences. Adding general SaT model support. #93

hhuangMITRE · 2025-09-23T20:24:00Z

Issues:

Add option to TextSplitter to return individual sentences openmpf#1965

Related PRs:

Add option to TextSplitter to return individual sentences. Adding general SaT model support. openmpf-components#408

This change is

jrobble

@jrobble reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @hhuangMITRE)

a discussion (no related file):
Mention SaT here:

# To hold spaCy, WtP, and other potential sentence detection models in cache

Mention SaT here:

            log.warning(
                "Invalid model setting '%s'. Only `cpu` and `cuda` "
                        "(or `gpu`) WtP model options available at this time. "
                        "Defaulting to `cpu` mode.", model_setting)

Mention SaT in install.sh and LICENSE.

detection/nlp_text_splitter/nlp_text_splitter/__init__.py line 83 at r1 (raw file):

            self._update_wtp_model(model_name, model_setting, default_lang)
            self.split = self._split_wtp
            log.info("Setup WtP model: %s", model_name)

Generally, 'f' strings are preferred since they keep the variable name inline with the text. It makes things easier to read.

detection/nlp_text_splitter/tests/test_text_splitter.py line 68 at r1 (raw file):

        self.assertEqual(2, len(actual))
        self.assertEqual('Hello, what is your name? ', actual[0])
        self.assertEqual('My name is John.', actual[1])

These asserts as the same as above test_sat_basic_sentence_split test. I would feel better if we can prove that the different splitting behaviors return different results.

detection/nlp_text_splitter/tests/test_text_splitter.py line 104 at r1 (raw file):

            500,
            len,
            self.sat_model,split_mode=SplitMode.SENTENCE))

Formatting nitpick: Move split_mode to next line.

detection/nlp_text_splitter/tests/test_text_splitter.py line 106 at r1 (raw file):

            self.sat_model,split_mode=SplitMode.SENTENCE))
        self.assertEqual(input_text, ''.join(actual))
        self.assertEqual(2, len(actual))

These asserts as the same as above. I would feel better if we can prove that the different splitting behaviors return different results.

hhuangMITRE

Reviewable status: 0 of 8 files reviewed, 4 unresolved discussions (waiting on @hhuangMITRE and @jrobble)

a discussion (no related file):

Previously, jrobble (Jeff Robble) wrote…

Mention SaT here:

# To hold spaCy, WtP, and other potential sentence detection models in cache

Mention SaT here:

            log.warning(
                "Invalid model setting '%s'. Only `cpu` and `cuda` "
                        "(or `gpu`) WtP model options available at this time. "
                        "Defaulting to `cpu` mode.", model_setting)

Mention SaT in install.sh and LICENSE.

Done.

detection/nlp_text_splitter/nlp_text_splitter/__init__.py line 83 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Generally, 'f' strings are preferred since they keep the variable name inline with the text. It makes things easier to read.

Updated, thanks!

detection/nlp_text_splitter/tests/test_text_splitter.py line 68 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

These asserts as the same as above test_sat_basic_sentence_split test. I would feel better if we can prove that the different splitting behaviors return different results.

I've added in the new test cases. There's also some new differences in translation which I've added to the other PR.

detection/nlp_text_splitter/tests/test_text_splitter.py line 104 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Formatting nitpick: Move split_mode to next line.

Done!

detection/nlp_text_splitter/tests/test_text_splitter.py line 106 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

These asserts as the same as above. I would feel better if we can prove that the different splitting behaviors return different results.

I've tweaked the test, right now SaT seems more sensitive to splitting it seems.

hhuangMITRE added 4 commits September 23, 2025 04:42

Updating WtP models. Adding sentence splitting option.

709f33a

Updating WtP models. Adding sentence splitting option.

40d4bb7

Updating WtP models. Adding sentence splitting option.

be38c78

Minor bugfix

b008096

hhuangMITRE requested a review from jrobble September 23, 2025 20:24

hhuangMITRE self-assigned this Sep 23, 2025

hhuangMITRE changed the title ~~Feature/nlp text splitter sentence mode sat model update~~ Add option to TextSplitter to return individual sentences. Sep 23, 2025

hhuangMITRE changed the title ~~Add option to TextSplitter to return individual sentences.~~ Add option to TextSplitter to return individual sentences. Adding SaT model support. Sep 23, 2025

hhuangMITRE mentioned this pull request Sep 23, 2025

Add option to TextSplitter to return individual sentences. Adding general SaT model support. openmpf/openmpf-components#408

Open

hhuangMITRE changed the title ~~Add option to TextSplitter to return individual sentences. Adding SaT model support.~~ Add option to TextSplitter to return individual sentences. Adding general SaT model support. Sep 23, 2025

jrobble requested changes Sep 25, 2025

View reviewed changes

hhuangMITRE added 4 commits October 14, 2025 00:57

Adding newline processing to text splitter.

cfd4e90

Adding newline processing to text splitter.

b4daeca

Adding newline processing to text splitter.

5674a7b

Adding newline processing to text splitter.

a74d7e7

hhuangMITRE commented Oct 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add option to TextSplitter to return individual sentences. Adding general SaT model support. #93

Add option to TextSplitter to return individual sentences. Adding general SaT model support. #93

Uh oh!

hhuangMITRE commented Sep 23, 2025 •

edited

Loading

Uh oh!

jrobble left a comment

Uh oh!

hhuangMITRE left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add option to TextSplitter to return individual sentences. Adding general SaT model support. #93

Are you sure you want to change the base?

Add option to TextSplitter to return individual sentences. Adding general SaT model support. #93

Uh oh!

Conversation

hhuangMITRE commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrobble left a comment

Choose a reason for hiding this comment

Uh oh!

hhuangMITRE left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hhuangMITRE commented Sep 23, 2025 •

edited

Loading