-
Notifications
You must be signed in to change notification settings - Fork 2
Add option to TextSplitter to return individual sentences. Adding general SaT model support. #93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Add option to TextSplitter to return individual sentences. Adding general SaT model support. #93
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrobble reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @hhuangMITRE)
a discussion (no related file):
Mention SaT here:
# To hold spaCy, WtP, and other potential sentence detection models in cache
Mention SaT here:
log.warning(
"Invalid model setting '%s'. Only `cpu` and `cuda` "
"(or `gpu`) WtP model options available at this time. "
"Defaulting to `cpu` mode.", model_setting)
Mention SaT in install.sh
and LICENSE
.
detection/nlp_text_splitter/nlp_text_splitter/__init__.py
line 83 at r1 (raw file):
self._update_wtp_model(model_name, model_setting, default_lang) self.split = self._split_wtp log.info("Setup WtP model: %s", model_name)
Generally, 'f' strings are preferred since they keep the variable name inline with the text. It makes things easier to read.
detection/nlp_text_splitter/tests/test_text_splitter.py
line 68 at r1 (raw file):
self.assertEqual(2, len(actual)) self.assertEqual('Hello, what is your name? ', actual[0]) self.assertEqual('My name is John.', actual[1])
These asserts as the same as above test_sat_basic_sentence_split
test. I would feel better if we can prove that the different splitting behaviors return different results.
detection/nlp_text_splitter/tests/test_text_splitter.py
line 104 at r1 (raw file):
500, len, self.sat_model,split_mode=SplitMode.SENTENCE))
Formatting nitpick: Move split_mode
to next line.
detection/nlp_text_splitter/tests/test_text_splitter.py
line 106 at r1 (raw file):
self.sat_model,split_mode=SplitMode.SENTENCE)) self.assertEqual(input_text, ''.join(actual)) self.assertEqual(2, len(actual))
These asserts as the same as above. I would feel better if we can prove that the different splitting behaviors return different results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 8 files reviewed, 4 unresolved discussions (waiting on @hhuangMITRE and @jrobble)
a discussion (no related file):
Previously, jrobble (Jeff Robble) wrote…
Mention SaT here:
# To hold spaCy, WtP, and other potential sentence detection models in cache
Mention SaT here:
log.warning( "Invalid model setting '%s'. Only `cpu` and `cuda` " "(or `gpu`) WtP model options available at this time. " "Defaulting to `cpu` mode.", model_setting)
Mention SaT in
install.sh
andLICENSE
.
Done.
detection/nlp_text_splitter/nlp_text_splitter/__init__.py
line 83 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Generally, 'f' strings are preferred since they keep the variable name inline with the text. It makes things easier to read.
Updated, thanks!
detection/nlp_text_splitter/tests/test_text_splitter.py
line 68 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
These asserts as the same as above
test_sat_basic_sentence_split
test. I would feel better if we can prove that the different splitting behaviors return different results.
I've added in the new test cases. There's also some new differences in translation which I've added to the other PR.
detection/nlp_text_splitter/tests/test_text_splitter.py
line 104 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Formatting nitpick: Move
split_mode
to next line.
Done!
detection/nlp_text_splitter/tests/test_text_splitter.py
line 106 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
These asserts as the same as above. I would feel better if we can prove that the different splitting behaviors return different results.
I've tweaked the test, right now SaT seems more sensitive to splitting it seems.
Issues:
Related PRs:
This change is