Skip to content
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions .github/workflows/python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,26 +8,26 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
python-version: ["3.9", "3.10", "3.11", "3.12"]

steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install ruff==0.1.5 pytest==7.4.3
pip install ruff pytest
pip install -e .[onnx-cpu]
pip install torch --index-url https://download.pytorch.org/whl/cpu
- name: Lint with ruff
run: |
# stop the build if there are Python syntax errors or undefined names
ruff --output-format=github --select=E9,F63,F7,F82 --target-version=py37 .
ruff check --output-format=github --select=E9,F63,F7,F82 --target-version=py39 .
# default set of ruff rules with GitHub Annotations
ruff --output-format=github --target-version=py37 .
ruff check --output-format=github --target-version=py39 .
- name: Test with pytest
run: |
pytest test.py
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,5 @@ data_subset/**
*.txt
*.log
xlmr-*/**
**/checkpoint-*/**
**/checkpoint-*/**
.venv/
84 changes: 83 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,9 +141,91 @@ Since SaT are trained to predict newline probablity, they can segment text into
sat.split(text, do_paragraph_segmentation=True)
```

## (NEW! v2.2+) Length-Constrained Segmentation

Control segment lengths with `min_length` and `max_length` parameters. This is useful when you need segments within specific size limits (e.g., for embedding models, storage, or downstream processing).

### Basic Usage

```python
# segments will be at most 100 characters
sat.split(text, max_length=100)

# segments will be at least 20 characters (best effort) and at most 100 characters (strict)
sat.split(text, min_length=20, max_length=100)

# use different algorithms: "viterbi" (optimal, default) or "greedy" (faster)
sat.split(text, max_length=100, algorithm="greedy")
```

### Priors for Length Preference

Use priors to influence segment length distribution. Available priors:

| Prior | Best For |
|-------|----------|
| `"uniform"` (default) | Just enforce max_length, let model decide |
| `"gaussian"` | Prefer segments around a target length (intuitive) |
| `"lognormal"` | Natural-feeling output |
| `"clipped_polynomial"` | Must be very close to target length |

```python
# Gaussian prior (recommended): prefer segments around target_length
sat.split(text, max_length=100, prior_type="gaussian",
prior_kwargs={"target_length": 50, "spread": 10})

# Log-normal prior: better models natural sentence length distribution
sat.split(text, max_length=100, prior_type="lognormal",
prior_kwargs={"target_length": 70, "spread": 0.5})

# Clipped polynomial: hard cutoff at ±spread from target
sat.split(text, max_length=100, prior_type="clipped_polynomial",
prior_kwargs={"target_length": 60, "spread": 25})
```

### Language-Aware Defaults

Pass `lang_code` to use language-specific defaults for `target_length` and `spread`:

```python
# German has longer average sentences → auto-uses target_length=90, spread=35
sat.split(text, max_length=150, prior_type="gaussian",
prior_kwargs={"lang_code": "de"})

# Chinese has shorter sentences → auto-uses target_length=45, spread=15
sat.split(text, max_length=100, prior_type="gaussian",
prior_kwargs={"lang_code": "zh"})
```

When using LoRA with a language, this happens automatically:

```python
sat = SaT("sat-3l", style_or_domain="ud", language="de")
sat.split(text, max_length=150, prior_type="gaussian") # auto-uses German defaults
```

### How It Works

The Viterbi algorithm finds globally optimal segmentation points that balance:
- The model's sentence boundary predictions (where natural splits occur)
- Your length preferences (via the prior; if provided)

**Text Reconstruction:**
```python
# With constraints (max_length or min_length):
original_text = "".join(segments)

# Without constraints (SaT default):
original_text = "\n".join(segments)
```

> **Note**: When `max_length` is set, the `threshold` parameter is ignored. The Viterbi/greedy algorithms use raw model probabilities directly instead of threshold-based filtering.

For more details, see the [Length Constraints Documentation](./docs/LENGTH_CONSTRAINTS.md).

## Adaptation

SaT can be domain- and style-adapted via LoRA. We provide trained LoRA modules for Universal Dependencies, OPUS100, Ersatz, and TED (i.e., ASR-style transcribed speecjes) sentence styles in 81 languages for `sat-3l`and `sat-12l`. Additionally, we provide LoRA modules for legal documents (laws and judgements) in 6 languages, code-switching in 4 language pairs, and tweets in 3 languages. For details, we refer to our [paper](https://arxiv.org/abs/2406.16678).
SaT can be domain- and style-adapted via LoRA. We provide trained LoRA modules for Universal Dependencies, OPUS100, Ersatz, and TED (i.e., ASR-style transcribed speeches) sentence styles in 81 languages for `sat-3l`and `sat-12l`. Additionally, we provide LoRA modules for legal documents (laws and judgements) in 6 languages, code-switching in 4 language pairs, and tweets in 3 languages. For details, we refer to our [paper](https://arxiv.org/abs/2406.16678).

We also provided verse segmentation modules for 16 genres for `sat-12-no-limited-lookahead`.

Expand Down
Loading