segment-any-text · harikesavan · Nov 25, 2025 · Nov 28, 2025 · Nov 29, 2025 · Dec 3, 2025
diff --git a/.github/workflows/python.yml b/.github/workflows/python.yml
@@ -8,26 +8,26 @@ jobs:
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
+        python-version: ["3.9", "3.10", "3.11", "3.12"]
 
     steps:
-      - uses: actions/checkout@v3
+      - uses: actions/checkout@v4
       - name: Set up Python ${{ matrix.python-version }}
-        uses: actions/setup-python@v4
+        uses: actions/setup-python@v5
         with:
           python-version: ${{ matrix.python-version }}
       - name: Install dependencies
         run: |
           python -m pip install --upgrade pip
-          pip install ruff==0.1.5 pytest==7.4.3
+          pip install ruff pytest
           pip install -e .[onnx-cpu]
           pip install torch --index-url https://download.pytorch.org/whl/cpu
       - name: Lint with ruff
         run: |
           # stop the build if there are Python syntax errors or undefined names
-          ruff --output-format=github --select=E9,F63,F7,F82 --target-version=py37 .
+          ruff check --output-format=github --select=E9,F63,F7,F82 --target-version=py39 .
           # default set of ruff rules with GitHub Annotations
-          ruff --output-format=github --target-version=py37 .
+          ruff check --output-format=github --target-version=py39 .
       - name: Test with pytest
         run: |
           pytest test.py
diff --git a/.gitignore b/.gitignore
@@ -20,4 +20,5 @@ data_subset/**
 *.txt
 *.log
 xlmr-*/**
-**/checkpoint-*/**
+**/checkpoint-*/**
+.venv/
diff --git a/README.md b/README.md
@@ -141,9 +141,91 @@ Since SaT are trained to predict newline probablity, they can segment text into
 sat.split(text, do_paragraph_segmentation=True)
 ```
 
+## (NEW! v2.2+) Length-Constrained Segmentation
+
+Control segment lengths with `min_length` and `max_length` parameters. This is useful when you need segments within specific size limits (e.g., for embedding models, storage, or downstream processing).
+
+### Basic Usage
+
+```python
+# segments will be at most 100 characters
+sat.split(text, max_length=100)
+
+# segments will be at least 20 characters (best effort) and at most 100 characters (strict)
+sat.split(text, min_length=20, max_length=100)
+
+# use different algorithms: "viterbi" (optimal, default) or "greedy" (faster)
+sat.split(text, max_length=100, algorithm="greedy")
+```
+
+### Priors for Length Preference
+
+Use priors to influence segment length distribution. Available priors:
+
+| Prior | Best For |
+|-------|----------|
+| `"uniform"` (default) | Just enforce max_length, let model decide |
+| `"gaussian"` | Prefer segments around a target length (intuitive) |
+| `"lognormal"` | Natural-feeling output |
+| `"clipped_polynomial"` | Must be very close to target length |
+
+```python
+# Gaussian prior (recommended): prefer segments around target_length
+sat.split(text, max_length=100, prior_type="gaussian", 
+          prior_kwargs={"target_length": 50, "spread": 10})
+
+# Log-normal prior: better models natural sentence length distribution
+sat.split(text, max_length=100, prior_type="lognormal", 
+          prior_kwargs={"target_length": 70, "spread": 0.5})
+
+# Clipped polynomial: hard cutoff at ±spread from target
+sat.split(text, max_length=100, prior_type="clipped_polynomial", 
+          prior_kwargs={"target_length": 60, "spread": 25})
+```
+
+### Language-Aware Defaults
+
+Pass `lang_code` to use language-specific defaults for `target_length` and `spread`:
+
+```python
+# German has longer average sentences → auto-uses target_length=90, spread=35
+sat.split(text, max_length=150, prior_type="gaussian", 
+          prior_kwargs={"lang_code": "de"})
+
+# Chinese has shorter sentences → auto-uses target_length=45, spread=15
+sat.split(text, max_length=100, prior_type="gaussian", 
+          prior_kwargs={"lang_code": "zh"})
+```
+
+When using LoRA with a language, this happens automatically:
+
+```python
+sat = SaT("sat-3l", style_or_domain="ud", language="de")
+sat.split(text, max_length=150, prior_type="gaussian")  # auto-uses German defaults
+```
+
+### How It Works
+
+The Viterbi algorithm finds globally optimal segmentation points that balance:
+- The model's sentence boundary predictions (where natural splits occur)
+- Your length preferences (via the prior; if provided)
+
+**Text Reconstruction:**
+```python
+# With constraints (max_length or min_length):
+original_text = "".join(segments)
+
+# Without constraints (SaT default):
+original_text = "\n".join(segments)
+```
+
+> **Note**: When `max_length` is set, the `threshold` parameter is ignored. The Viterbi/greedy algorithms use raw model probabilities directly instead of threshold-based filtering.
+
+For more details, see the [Length Constraints Documentation](./docs/LENGTH_CONSTRAINTS.md).
+
 ## Adaptation
 
-SaT can be domain- and style-adapted via LoRA. We provide trained LoRA modules for Universal Dependencies, OPUS100, Ersatz, and TED (i.e., ASR-style transcribed speecjes) sentence styles in 81 languages for `sat-3l`and `sat-12l`. Additionally, we provide LoRA modules for legal documents (laws and judgements) in 6 languages, code-switching in 4 language pairs, and tweets in 3 languages. For details, we refer to our [paper](https://arxiv.org/abs/2406.16678).
+SaT can be domain- and style-adapted via LoRA. We provide trained LoRA modules for Universal Dependencies, OPUS100, Ersatz, and TED (i.e., ASR-style transcribed speeches) sentence styles in 81 languages for `sat-3l`and `sat-12l`. Additionally, we provide LoRA modules for legal documents (laws and judgements) in 6 languages, code-switching in 4 language pairs, and tweets in 3 languages. For details, we refer to our [paper](https://arxiv.org/abs/2406.16678).
 
 We also provided verse segmentation modules for 16 genres for `sat-12-no-limited-lookahead`.