openmpf · hhuangMITRE · Sep 23, 2025 · Sep 23, 2025 · Sep 23, 2025 · Sep 23, 2025
diff --git a/LICENSE b/LICENSE
@@ -25,13 +25,14 @@ The nlp_text_splitter utlity uses the following sentence detection libraries:
 
 *****************************************************************************
 
-The WtP, "Where the Point", sentence segmentation library falls under the MIT License:
+The WtP, "Where the Point", and SaT, "Segment any Text" sentence segmentation
+library falls under the MIT License:
 
-https://github.com/bminixhofer/wtpsplit/blob/main/LICENSE
+https://github.com/segment-any-text/wtpsplit/blob/main/LICENSE
 
 MIT License
 
-Copyright (c) 2024 Benjamin Minixhofer
+Copyright (c) 2024 Benjamin Minixhofer, Markus Frohmann, Igor Sterner
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/detection/nlp_text_splitter/README.md b/detection/nlp_text_splitter/README.md
@@ -1,25 +1,57 @@
 # Overview
 
 This directory contains the source code, test examples, and installation script
-for the OpenMPF NlpTextSplitter tool, which uses WtP and spaCy libraries
-to detect sentences in a given chunk of text.
+for the OpenMPF NlpTextSplitter tool, which uses **SaT (Segment any Text)**,
+**WtP (Where's the Point)**, and **spaCy** to detect sentences in a given chunk of text.
 
 # Background
 
 Our primary motivation for creating this tool was to find a lightweight, accurate
 sentence detection capability to support a large variety of text processing tasks
 including translation and tagging.
 
-Through preliminary investigation, we identified the [WtP library ("Where's the
-Point")](https://github.com/bminixhofer/wtpsplit) and [spaCy's multilingual sentence
+Through preliminary investigation, we identified the [WtP/SaT library ("Where's the
+Point"/"Segment any Text")](https://github.com/bminixhofer/wtpsplit) and [spaCy's multilingual sentence
 detection model](https://spacy.io/models) for identifying sentence breaks
 in a large section of text.
 
 WtP models are trained to split up multilingual text by sentence without the need of an
 input language tag. The disadvantage is that the most accurate WtP models will need ~3.5
-GB of GPU memory. On the other hand, spaCy has a single multilingual sentence detection
-that appears to work better for splitting up English text in certain cases. Unfortunately
-this model lacks support handling for Chinese punctuation.
+GB of GPU memory. SaT is the newer successor to WtP from the same authors and
+generally offers better accuracy/efficiency.
+
+On the other hand, spaCy has a single multilingual sentence detection
+that appears to work better for splitting up English text in certain cases.
+
+This component has been updated to use the Azure Translation Component's NewLineBehavior class
+for swapping newlines with either whitespace or removing it altogether based on script detected.
+
+The reason why we need to consider the script/character encodings is because certain languages
+will treat whitespace between words as possessing different meanings. For instance in Chinese
+
+`电脑` would mean `computer` but `电 脑` would mean `electricity brain`.
+
+When calling the NLP text splitter, users can adjust the following parameters to control for sentence
+splitting behaviors:
+
+- `split_mode`: set to `DEFAULT` for splitting by chunk size and `SENTENCE` when splitting by sentences
+
+- `newline_behavior` : controls how newlines are handled in a submitted input text. Options include:
+  - `GUESS`  to choose ' ' for space-separated langs; '' for Chinese/Japanese/Korean.
+  - `SPACE`  to always replace with a single space.
+  - `REMOVE` to always remove (no space).
+  - `NONE`   to no change.
+
+For instance:
+```
+    result = list(TextSplitter.split(input_text,
+                  ...
+                  self.sat_model,
+                  split_mode='DEFAULT')
+                  newline_behavior='NONE')
+```
+Will attempt to split using an SaT model, using the default chunking parameters and no newline adjustments.
+
 
 # Installation
 
@@ -40,12 +72,13 @@ Please note that several customizations are supported:
   setup a PyTorch installation with CUDA (GPU) libraries.
 
 - `--wtp-models-dir |-m <wtp-models-dir >`: Add this parameter to
-  change the default WtP model installation directory
+  change the default WtP/SaT model installation directory
   (default: `/opt/wtp/models`).
 
 - `--install-wtp-model|-w <model-name>`: Add this parameter to specify
-  additional WTP models for installation. This parameter can be provided
-  multiple times to install more than one model.
+  additional WtP/SaT models for installation. Accepts both WtP names
+  (e.g., `wtp-bert-mini`) and SaT names (e.g., `sat-3l-sm`).
+  This parameter can be provided multiple times to install more than one model.
 
 - `--install-spacy-model|-s <model-name>`: Add this parameter to specify
   additional spaCy models for installation. This parameter can be provided

diff --git a/detection/nlp_text_splitter/install.sh b/detection/nlp_text_splitter/install.sh
@@ -7,11 +7,11 @@
 # under contract, and is subject to the Rights in Data-General Clause       #
 # 52.227-14, Alt. IV (DEC 2007).                                            #
 #                                                                           #
-# Copyright 2024 The MITRE Corporation. All Rights Reserved.                #
+# Copyright 2025 The MITRE Corporation. All Rights Reserved.                #
 #############################################################################
 
 #############################################################################
-# Copyright 2024 The MITRE Corporation                                      #
+# Copyright 2025 The MITRE Corporation                                      #
 #                                                                           #
 # Licensed under the Apache License, Version 2.0 (the "License");           #
 # you may not use this file except in compliance with the License.          #
@@ -37,7 +37,7 @@ main() {
     fi
     eval set -- "$options"
     local wtp_models_dir=/opt/wtp/models
-    local wtp_models=("wtp-bert-mini")
+    local wtp_models=("wtp-bert-mini" "sat-3l-sm")
     local spacy_models=("xx_sent_ud_sm")
     while true; do
         case "$1" in
@@ -107,10 +107,20 @@ download_wtp_models() {
 
     for model_name in "${model_names[@]}"; do
         echo "Downloading the $model_name model to $wtp_models_dir."
-        local wtp_model_dir="$wtp_models_dir/$model_name"
+        local model_dir="$wtp_models_dir/$model_name"
+
+        # Decide which HF org to use based on model prefix.
+        # - WtP: benjamin/<model>
+        # - SaT: segment-any-text/<model>
+        local hf_owner="benjamin"
+        case "$model_name" in
+            sat-*) hf_owner="segment-any-text" ;;
+        esac
+
         python3 -c \
             "from huggingface_hub import snapshot_download; \
-            snapshot_download('benjamin/$model_name', local_dir='$wtp_model_dir')"
+            snapshot_download(repo_id='${hf_owner}/${model_name}', local_dir='${model_dir}')"
+
     done
 }
 
@@ -149,12 +159,12 @@ Options
     --text-splitter-dir, -t <path>:    Path to text splitter source code. (defaults to to the
                                        same directory as this script)
     --gpu, -g:                         Install the GPU version of PyTorch
-    --wtp-models-dir , -m <path>:      Path where WTP models will be stored.
+    --wtp-models-dir , -m <path>:      Path where WtP/SaT models will be stored.
                                        (defaults to /opt/wtp/models)
-    --install-wtp-model, -w <name>:    Name of a WTP model to install in addtion to wtp-bert-mini.
+    --install-wtp-model, -w <name>:    Name of a WTP or SaT model to install in addition to 'wtp-bert-mini' and 'sat-3l-sm.
                                        This option can be provided more than once to specify
                                        multiple models.
-    --install-spacy-model | -s <name>: Names of a spaCy model to install in addtion to
+    --install-spacy-model | -s <name>: Names of a spaCy model to install in addition to
                                        xx_sent_ud_sm. The option can be provided more than once
                                        to specify multiple models.
 "