Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -25,13 +25,14 @@ The nlp_text_splitter utlity uses the following sentence detection libraries:

*****************************************************************************

The WtP, "Where the Point", sentence segmentation library falls under the MIT License:
The WtP, "Where the Point", and SaT, "Segment any Text" sentence segmentation
library falls under the MIT License:

https://github.com/bminixhofer/wtpsplit/blob/main/LICENSE
https://github.com/segment-any-text/wtpsplit/blob/main/LICENSE

MIT License

Copyright (c) 2024 Benjamin Minixhofer
Copyright (c) 2024 Benjamin Minixhofer, Markus Frohmann, Igor Sterner

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
53 changes: 43 additions & 10 deletions detection/nlp_text_splitter/README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,57 @@
# Overview

This directory contains the source code, test examples, and installation script
for the OpenMPF NlpTextSplitter tool, which uses WtP and spaCy libraries
to detect sentences in a given chunk of text.
for the OpenMPF NlpTextSplitter tool, which uses **SaT (Segment any Text)**,
**WtP (Where's the Point)**, and **spaCy** to detect sentences in a given chunk of text.

# Background

Our primary motivation for creating this tool was to find a lightweight, accurate
sentence detection capability to support a large variety of text processing tasks
including translation and tagging.

Through preliminary investigation, we identified the [WtP library ("Where's the
Point")](https://github.com/bminixhofer/wtpsplit) and [spaCy's multilingual sentence
Through preliminary investigation, we identified the [WtP/SaT library ("Where's the
Point"/"Segment any Text")](https://github.com/bminixhofer/wtpsplit) and [spaCy's multilingual sentence
detection model](https://spacy.io/models) for identifying sentence breaks
in a large section of text.

WtP models are trained to split up multilingual text by sentence without the need of an
input language tag. The disadvantage is that the most accurate WtP models will need ~3.5
GB of GPU memory. On the other hand, spaCy has a single multilingual sentence detection
that appears to work better for splitting up English text in certain cases. Unfortunately
this model lacks support handling for Chinese punctuation.
GB of GPU memory. SaT is the newer successor to WtP from the same authors and
generally offers better accuracy/efficiency.

On the other hand, spaCy has a single multilingual sentence detection
that appears to work better for splitting up English text in certain cases.

This component has been updated to use the Azure Translation Component's NewLineBehavior class
for swapping newlines with either whitespace or removing it altogether based on script detected.

The reason why we need to consider the script/character encodings is because certain languages
will treat whitespace between words as possessing different meanings. For instance in Chinese

`电脑` would mean `computer` but `电 脑` would mean `electricity brain`.

When calling the NLP text splitter, users can adjust the following parameters to control for sentence
splitting behaviors:

- `split_mode`: set to `DEFAULT` for splitting by chunk size and `SENTENCE` when splitting by sentences

- `newline_behavior` : controls how newlines are handled in a submitted input text. Options include:
- `GUESS` to choose ' ' for space-separated langs; '' for Chinese/Japanese/Korean.
- `SPACE` to always replace with a single space.
- `REMOVE` to always remove (no space).
- `NONE` to no change.

For instance:
```
result = list(TextSplitter.split(input_text,
...
self.sat_model,
split_mode='DEFAULT')
newline_behavior='NONE')
```
Will attempt to split using an SaT model, using the default chunking parameters and no newline adjustments.


# Installation

Expand All @@ -40,12 +72,13 @@ Please note that several customizations are supported:
setup a PyTorch installation with CUDA (GPU) libraries.

- `--wtp-models-dir |-m <wtp-models-dir >`: Add this parameter to
change the default WtP model installation directory
change the default WtP/SaT model installation directory
(default: `/opt/wtp/models`).

- `--install-wtp-model|-w <model-name>`: Add this parameter to specify
additional WTP models for installation. This parameter can be provided
multiple times to install more than one model.
additional WtP/SaT models for installation. Accepts both WtP names
(e.g., `wtp-bert-mini`) and SaT names (e.g., `sat-3l-sm`).
This parameter can be provided multiple times to install more than one model.

- `--install-spacy-model|-s <model-name>`: Add this parameter to specify
additional spaCy models for installation. This parameter can be provided
Expand Down
26 changes: 18 additions & 8 deletions detection/nlp_text_splitter/install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@
# under contract, and is subject to the Rights in Data-General Clause #
# 52.227-14, Alt. IV (DEC 2007). #
# #
# Copyright 2024 The MITRE Corporation. All Rights Reserved. #
# Copyright 2025 The MITRE Corporation. All Rights Reserved. #
#############################################################################

#############################################################################
# Copyright 2024 The MITRE Corporation #
# Copyright 2025 The MITRE Corporation #
# #
# Licensed under the Apache License, Version 2.0 (the "License"); #
# you may not use this file except in compliance with the License. #
Expand All @@ -37,7 +37,7 @@ main() {
fi
eval set -- "$options"
local wtp_models_dir=/opt/wtp/models
local wtp_models=("wtp-bert-mini")
local wtp_models=("wtp-bert-mini" "sat-3l-sm")
local spacy_models=("xx_sent_ud_sm")
while true; do
case "$1" in
Expand Down Expand Up @@ -107,10 +107,20 @@ download_wtp_models() {

for model_name in "${model_names[@]}"; do
echo "Downloading the $model_name model to $wtp_models_dir."
local wtp_model_dir="$wtp_models_dir/$model_name"
local model_dir="$wtp_models_dir/$model_name"

# Decide which HF org to use based on model prefix.
# - WtP: benjamin/<model>
# - SaT: segment-any-text/<model>
local hf_owner="benjamin"
case "$model_name" in
sat-*) hf_owner="segment-any-text" ;;
esac

python3 -c \
"from huggingface_hub import snapshot_download; \
snapshot_download('benjamin/$model_name', local_dir='$wtp_model_dir')"
snapshot_download(repo_id='${hf_owner}/${model_name}', local_dir='${model_dir}')"

done
}

Expand Down Expand Up @@ -149,12 +159,12 @@ Options
--text-splitter-dir, -t <path>: Path to text splitter source code. (defaults to to the
same directory as this script)
--gpu, -g: Install the GPU version of PyTorch
--wtp-models-dir , -m <path>: Path where WTP models will be stored.
--wtp-models-dir , -m <path>: Path where WtP/SaT models will be stored.
(defaults to /opt/wtp/models)
--install-wtp-model, -w <name>: Name of a WTP model to install in addtion to wtp-bert-mini.
--install-wtp-model, -w <name>: Name of a WTP or SaT model to install in addition to 'wtp-bert-mini' and 'sat-3l-sm.
This option can be provided more than once to specify
multiple models.
--install-spacy-model | -s <name>: Names of a spaCy model to install in addtion to
--install-spacy-model | -s <name>: Names of a spaCy model to install in addition to
xx_sent_ud_sm. The option can be provided more than once
to specify multiple models.
"
Expand Down
Loading