Add length-constrained segmentation with configurable priors and algo… #164

harikesavan · 2025-11-25T10:14:58Z

…rithms

Copilot

Pull request overview

This PR adds length-constrained segmentation capabilities to wtpsplit, allowing users to specify minimum and maximum segment lengths with configurable prior functions and algorithms (Viterbi or greedy).

Key Changes

New constraint-based segmentation with min_length and max_length parameters
Three prior function types: uniform, Gaussian, and clipped polynomial
Support for both Viterbi (optimal) and greedy (fast) segmentation algorithms

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 19 comments.

Show a summary per file

File	Description
`wtpsplit/utils/priors.py`	New module implementing three prior function types for length preferences
`wtpsplit/utils/constraints.py`	New module with constrained segmentation algorithms (Viterbi and greedy)
`wtpsplit/__init__.py`	Integrated new parameters into WtP and SaT split methods; added `_enforce_segment_constraints` helper; version downgraded to 2.1.6; removed `merge_lora` parameter
`test_constraints.py`	Comprehensive test suite covering constraints, priors, algorithms, and edge cases
`test.py`	Added integration tests for length-constrained segmentation features

Comments suppressed due to low confidence (1)

wtpsplit/init.py:1067

The error message contains a typo: "If you want to split on such newlines, set split_on_input_newlines=False." This should say "set split_on_input_newlines=True" because the condition at line 1055 checks if split_on_input_newlines is True to perform the splitting.

                    warnings.warn(
                        "split_on_input_newlines=False will lead to newlines in the output "
                        "if they were present in the input. Within the model, such newlines are "
                        "treated as spaces. "
                        "If you want to split on such newlines, set split_on_input_newlines=False."

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-25T10:29:01Z

wtpsplit/__init__.py

+                        boundaries = constrained_segmentation(
+                            paragraph_probs, prior_fn, min_length=min_length, max_length=max_length, algorithm=algorithm
+                        )
+                        indices = [b - 1 for b in boundaries]


Same potential index calculation issue: indices = [b - 1 for b in boundaries]. This pattern is repeated throughout the codebase and needs to be verified for correctness.

Copilot · 2025-11-25T10:29:01Z

wtpsplit/__init__.py

+        min_length: int = 1,
+        max_length: int = None,


Missing input validation for max_length parameter. The parameter accepts int = None as type but doesn't validate that when an integer is provided, it should be positive. Negative or zero values for max_length would cause logical errors in the segmentation algorithm. Consider adding validation: if max_length is not None and max_length <= 0: raise ValueError("max_length must be positive").

Copilot · 2025-11-25T10:29:01Z

wtpsplit/__init__.py

+        prior_type: str = "uniform",
+        prior_kwargs: dict = None,
+        algorithm: str = "viterbi",
    ):


Missing input validation for algorithm parameter. The parameter accepts any string but only "viterbi" and "greedy" are valid. Invalid values would only be caught when constrained_segmentation is called. Consider adding early validation: if algorithm not in ("viterbi", "greedy"): raise ValueError(f"algorithm must be 'viterbi' or 'greedy', got '{algorithm}'").

Suggested change

):

):

if algorithm not in ("viterbi", "greedy"):

raise ValueError(f"algorithm must be 'viterbi' or 'greedy', got '{algorithm}'")

Copilot · 2025-11-25T10:29:02Z

wtpsplit/__init__.py

                    # merge lora weights into transformer for 0 efficiency overhead
-                    if merge_lora:
-                        self.model.model.merge_adapter("sat-lora")
+                    self.model.model.merge_adapter("sat-lora")


The merge_lora conditional has been removed, making merge_adapter() always execute. This changes the behavior for users who may have set merge_lora=False previously. This is a behavioral breaking change related to the removed parameter.

Copilot · 2025-11-25T10:29:02Z

wtpsplit/__init__.py

+        if len(last) < min_length:
+            prev = result[-2]
+            if max_length is None or len(prev) + len(last) <= max_length:
+                result[-2] = prev + last


The same segment concatenation issue exists here: result[-2] = prev + last merges segments without preserving any delimiter or whitespace between them, which can corrupt the text output.

Suggested change

result[-2] = prev + last

# Insert a space between segments if needed

if prev and last and not prev[-1].isspace() and not last[0].isspace():

result[-2] = prev + " " + last

else:

result[-2] = prev + last

Copilot · 2025-11-25T10:29:05Z

wtpsplit/__init__.py

+from wtpsplit.utils.constraints import constrained_segmentation
+from wtpsplit.utils.priors import create_prior_function
+
+__version__ = "2.1.6"


The version has been downgraded from "2.1.7" to "2.1.6", which is unusual for a PR that adds new features. When adding new functionality like length-constrained segmentation, the version should typically be incremented (e.g., to "2.2.0" or "2.1.8"), not decremented. This could cause confusion for users and break dependency management.

Suggested change

__version__ = "2.1.6"

__version__ = "2.2.0"

Copilot · 2025-11-25T10:29:05Z

wtpsplit/utils/priors.py

+import numpy as np
+
+def create_prior_function(name, kwargs):
+    if name == "uniform":


Missing docstring for the create_prior_function. This is a public function being imported and used throughout the codebase. It should have documentation explaining:

What each prior type does

The expected kwargs for each prior type

The return value (a prior function)

Example usage

Suggested change

if name == "uniform":

"""

Returns a prior function for the specified prior type.

Parameters

----------

name : str

The name of the prior type. Supported values:

- "uniform": Returns 1.0 for all lengths up to `max_length`, else 0.0.

- "clipped_polynomial": Returns max(1.0 - alpha * ((length - mu) ** 2), 0.0).

- "gaussian": Returns exp(-0.5 * ((length - mu) / sigma) ** 2).

kwargs : dict

Keyword arguments for the prior type:

- For "uniform":

max_length (float or int, optional): Maximum allowed length. If provided, prior is 0.0 for lengths > max_length.

- For "clipped_polynomial":

alpha (float, optional): Controls the steepness. Default is 0.5.

mu (float, optional): Center of the polynomial. Default is 3.0.

- For "gaussian":

mu (float, optional): Mean of the Gaussian. Default is 20.0.

sigma (float, optional): Standard deviation. Default is 5.0.

Returns

-------

prior : function

A function prior(length) -> float, which computes the prior probability for the given length.

Raises

------

ValueError

If an unknown prior type is specified.

Examples

--------

>>> prior = create_prior_function("uniform", {"max_length": 10})

>>> prior(5)

1.0

>>> prior(15)

0.0

>>> prior = create_prior_function("clipped_polynomial", {"alpha": 0.2, "mu": 5})

>>> prior(5)

1.0

>>> prior(10)

0.0

>>> prior = create_prior_function("gaussian", {"mu": 10, "sigma": 2})

>>> prior(10)

1.0

>>> prior(14)

np.exp(-0.5 * ((14 - 10) / 2) ** 2)

"""

Copilot · 2025-11-25T10:29:05Z

wtpsplit/__init__.py

+            # Accumulate following segments
+            while j < len(filtered) and len(merged) < min_length:
+                next_seg = filtered[j]
+                if max_length is None or len(merged) + len(next_seg) <= max_length:


The _enforce_segment_constraints function concatenates segments directly without any delimiter (e.g., merged += next_seg on line 82). This can merge segments that should have space or other separators between them, potentially corrupting the text. For example, "Hello." + "World." would become "Hello.World." instead of "Hello. World."

Suggested change

if max_length is None or len(merged) + len(next_seg) <= max_length:

if max_length is None or len(merged) + len(next_seg) <= max_length:

# Insert a space if needed between segments

if merged and not merged.endswith((' ', '\n')) and not next_seg.startswith((' ', '\n')):

merged += ' '

Copilot · 2025-11-25T10:29:06Z

wtpsplit/__init__.py

+        prior_type: str = "uniform",
+        prior_kwargs: dict = None,
+        algorithm: str = "viterbi",
    ):


Missing input validation for conflicting constraints. If a user provides min_length > max_length, this creates an impossible constraint but no error is raised. The code should validate this and raise a clear error: if max_length is not None and min_length > max_length: raise ValueError("min_length cannot be greater than max_length").

Suggested change

):

):

if max_length is not None and min_length > max_length:

raise ValueError("min_length cannot be greater than max_length")

Copilot · 2025-11-25T10:29:06Z

wtpsplit/__init__.py

+                        if prior_kwargs is None:
+                            prior_kwargs = {}
+                        else:
+                            prior_kwargs = prior_kwargs.copy()


Performance concern: The prior_kwargs.copy() operation is performed on every paragraph iteration (lines 522-523, 558-559, 1003-1004, 1038-1039). For large documents with many paragraphs, this creates unnecessary dict copies. Consider moving the copy operation outside the loop or checking if the dict is actually being modified before copying.

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 11 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-28T14:52:00Z

test_length_constraints.py

+    def test_tiny_fragments_merging(self, wtp_model):
+        """Tiny fragments should be merged to meet min_length."""
+        text = "A. B. C. D. E. F. G. H. I. J."
+        splits = wtp_model.split(text, min_length=10, threshold=0.005)
+
+        for segment in splits:
+            assert len(segment) >= 10, f"Segment '{segment}' is too short"


The test at line 238 asserts that all segments are >= 10 characters when using min_length=10. However, based on the documentation and implementation, min_length is "best effort" - segments may be shorter if merging would violate max_length or if the segment is inherently too short. This test could fail in edge cases and doesn't match the documented behavior. Consider either:

Allowing for exceptions where segments can be shorter

Using a test case where min_length can always be satisfied

Copilot · 2025-11-28T14:52:00Z

test_length_constraints.py

+    def test_very_short_sentences(self, wtp_model):
+        """Very short sentences should be merged when needed."""
+        text = "Hi. Bye. Go. Stop. Run. Walk. Jump. Sit."
+        splits = wtp_model.split(text, min_length=15, threshold=0.005)
+
+        for segment in splits:
+            assert len(segment) >= 15, f"Segment '{segment}' is too short"


Similar to the previous test, line 248 asserts all segments are >= 15 characters. Since min_length is documented as "best effort", this test might fail in edge cases where the algorithm cannot satisfy the minimum without violating other constraints. Consider relaxing the assertion to account for best-effort behavior, or verifying that most (but not necessarily all) segments meet the minimum.

Copilot · 2025-11-28T14:52:00Z

test.py

+def test_min_length_constraint_wtp():
+    """Test minimum length constraint with WtP"""
+    wtp = WtP("wtp-bert-mini", ort_providers=["CPUExecutionProvider"])
+
+    text = "Short. Test. Hello. World. This is longer."
+    splits = wtp.split(text, min_length=15, threshold=0.005)
+
+    # All segments should be >= 15 characters
+    for segment in splits:
+        assert len(segment) >= 15, f"Segment '{segment}' is shorter than min_length"


Similar to the comprehensive test suite, line 273 asserts that all segments are >= 15 characters when using min_length=15. However, min_length is documented as "best effort" and segments may be shorter if constraints cannot be satisfied. This test may fail in edge cases and doesn't align with the documented behavior. Consider either:

Allowing for exceptions where segments can be shorter

Using test data where min_length can always be guaranteed

Copilot · 2025-11-28T14:52:00Z

wtpsplit/__init__.py

+        # Input validation
+        if max_length is not None and min_length > max_length:
+            raise ValueError(
+                f"min_length ({min_length}) cannot be greater than max_length ({max_length})"
+            )
+        if min_length < 1:
+            raise ValueError(f"min_length must be >= 1, got {min_length}")
+        if max_length is not None and max_length < 1:
+            raise ValueError(f"max_length must be >= 1, got {max_length}")
+        valid_priors = ["uniform", "gaussian", "clipped_polynomial"]
+        if prior_type not in valid_priors:
+            raise ValueError(f"Unknown prior_type: '{prior_type}'. Must be one of {valid_priors}")
+        valid_algorithms = ["viterbi", "greedy"]
+        if algorithm not in valid_algorithms:
+            raise ValueError(f"Unknown algorithm: '{algorithm}'. Must be one of {valid_algorithms}")


[nitpick] There's code duplication between WtP's split method (lines 323-337) and SaT's split method (lines 840-854). The same input validation logic is repeated. Consider extracting this validation into a shared helper function to reduce duplication and ensure consistency.

length_constrained_segmentation_demo.py

Copilot · 2025-11-28T14:52:01Z

test.py

+def test_min_max_constraints_together():
+    """Test both constraints simultaneously"""
+    wtp = WtP("wtp-bert-mini", ort_providers=["CPUExecutionProvider"])
+
+    text = "Hello world. " * 15
+    splits = wtp.split(text, min_length=25, max_length=65, threshold=0.005)
+
+    # All segments should satisfy both constraints
+    for segment in splits:
+        assert 25 <= len(segment) <= 65, f"Segment '{segment}' violates constraints"


Line 303 asserts that all segments satisfy both constraints 25 <= len(segment) <= 65. This strict assertion doesn't account for the "best effort" nature of min_length. While max_length is always strictly enforced, min_length may not be achievable in all cases. The test might fail for certain inputs where min_length cannot be satisfied without violating max_length.

Copilot · 2025-11-28T14:52:02Z

wtpsplit/utils/constraints.py

+                # Use >= to handle min_length == max_length case
+                if next_split >= curr_idx + min_length:


[nitpick] In the fallback case when dp[n] == -float("inf"), the code uses >= in the comparison at line 417, with a comment explaining it handles the min_length == max_length case. However, the regular condition at line 390-391 uses end_j = i - min_length which would already exclude the boundary case. The logic appears correct but the consistency between the fallback and the main algorithm could be clearer. Consider verifying that both code paths handle edge cases identically, especially when min_length == max_length.

Suggested change

# Use >= to handle min_length == max_length case

if next_split >= curr_idx + min_length:

# Use strict boundary to match main algorithm (min_length == max_length case)

if next_split - curr_idx >= min_length:

Copilot · 2025-11-28T14:52:02Z

wtpsplit/utils/constraints.py

+            # If still too short, try merging with previous non-empty segment
+            if len(merged) < min_length and result:
+                # Find a previous non-empty segment that can accommodate the merge
+                for prev_idx in range(len(result) - 1, -1, -1):


[nitpick] In the _enforce_segment_constraints_simple function, line 214 checks if not seg or not seg.strip(): which handles both empty strings and whitespace-only strings. However, the code then appends seg (which could be whitespace-only) to the result. This is intentional to preserve structure, but on line 264, there's a check if result[prev_idx] and result[prev_idx].strip(): that requires non-whitespace content. This could lead to inconsistent behavior where whitespace-only segments are preserved in some cases but not used as merge candidates in others. Consider documenting this behavior more clearly or ensuring consistent handling of whitespace-only segments throughout the function.

Suggested change

for prev_idx in range(len(result) - 1, -1, -1):

for prev_idx in range(len(result) - 1, -1, -1):

# Only merge with previous segments that contain non-whitespace content.

# Whitespace-only segments are preserved for structure, but not used as merge candidates.

Copilot · 2025-11-28T14:52:02Z

wtpsplit/utils/constraints.py

+                # Find a good split point (prefer splitting at whitespace)
+                split_at = max_length
+                # Look for whitespace near the end to split at
+                for j in range(max_length - 1, max(0, max_length - 20), -1):


[nitpick] In _enforce_segment_constraints, line 85 searches for whitespace to split at, looking back up to 20 characters from max_length. The magic number 20 should be defined as a named constant to make the code more maintainable and allow easy adjustment if needed. For example: WHITESPACE_SEARCH_WINDOW = 20.

Copilot · 2025-11-28T14:52:03Z

test_length_constraints.py

+from wtpsplit import WtP, SaT
+from wtpsplit.utils.constraints import (
+    constrained_segmentation,
+    _enforce_segment_constraints,


Import of '_enforce_segment_constraints' is not used.

Suggested change

_enforce_segment_constraints,

Co-authored-by: Copilot <[email protected]>

Copilot

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

wtpsplit/init.py:1062

Syntax Error (in Python 3).

l                    else:

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-04T15:46:50Z

wtpsplit/utils/constraints.py

+            while curr_idx < n:
+                next_split = min(curr_idx + max_length, n)
+                # Use >= to handle min_length == max_length case
+                if next_split >= curr_idx + min_length:
+                    indices.append(next_split)
+                curr_idx = next_split


The condition on line 427 checks if next_split >= curr_idx + min_length but the comment on line 426 says "Use >= to handle min_length == max_length case". However, this check appears inconsistent with the dynamic programming loop bounds where end_j = i - min_length (line 398). When min_length == max_length, this could potentially create segments that violate the equality constraint. Consider verifying this edge case is handled correctly.

Suggested change

while curr_idx < n:

next_split = min(curr_idx + max_length, n)

# Use >= to handle min_length == max_length case

if next_split >= curr_idx + min_length:

indices.append(next_split)

curr_idx = next_split

if min_length == max_length:

# Strictly enforce equal-sized segments

while curr_idx < n:

next_split = min(curr_idx + min_length, n)

indices.append(next_split)

curr_idx = next_split

else:

while curr_idx < n:

next_split = min(curr_idx + max_length, n)

# Use >= to handle min_length == max_length case

if next_split >= curr_idx + min_length:

indices.append(next_split)

curr_idx = next_split

Copilot · 2025-12-04T15:46:50Z

wtpsplit/__init__.py

+                                sentence = sentence[:-1]
+                            new_sentences.extend(sentence.split("\n"))
+                        sentences = new_sentences
+l                    else:


There is a lowercase 'l' character at the beginning of this line that appears to be a typo. This will cause a syntax error. The line should start with proper indentation and the else statement, not "l".

Suggested change

l else:

else:

Copilot · 2025-12-04T15:46:51Z

README.md

+
 ## Adaptation

 SaT can be domain- and style-adapted via LoRA. We provide trained LoRA modules for Universal Dependencies, OPUS100, Ersatz, and TED (i.e., ASR-style transcribed speecjes) sentence styles in 81 languages for `sat-3l`and `sat-12l`. Additionally, we provide LoRA modules for legal documents (laws and judgements) in 6 languages, code-switching in 4 language pairs, and tweets in 3 languages. For details, we refer to our [paper](https://arxiv.org/abs/2406.16678).


Typo in comment: "speecjes" should be "speeches".

Suggested change

SaT can be domain- and style-adapted via LoRA. We provide trained LoRA modules for Universal Dependencies, OPUS100, Ersatz, and TED (i.e., ASR-style transcribed speecjes) sentence styles in 81 languages for `sat-3l`and `sat-12l`. Additionally, we provide LoRA modules for legal documents (laws and judgements) in 6 languages, code-switching in 4 language pairs, and tweets in 3 languages. For details, we refer to our [paper](https://arxiv.org/abs/2406.16678).

SaT can be domain- and style-adapted via LoRA. We provide trained LoRA modules for Universal Dependencies, OPUS100, Ersatz, and TED (i.e., ASR-style transcribed speeches) sentence styles in 81 languages for `sat-3l`and `sat-12l`. Additionally, we provide LoRA modules for legal documents (laws and judgements) in 6 languages, code-switching in 4 language pairs, and tweets in 3 languages. For details, we refer to our [paper](https://arxiv.org/abs/2406.16678).

Copilot · 2025-12-04T15:46:51Z

wtpsplit/__init__.py

+                        warnings.warn(
+                            "split_on_input_newlines=False will lead to newlines in the output "
+                            "if they were present in the input. Within the model, such newlines are "
+                            "treated as spaces. "
+                            "If you want to split on such newlines, set split_on_input_newlines=True."


The warning message states "If you want to split on such newlines, set split_on_input_newlines=False" but this appears to be incorrect logic. If the user wants to split on newlines, they should set split_on_input_newlines=True, not False. The message should say "set split_on_input_newlines=True".

Copilot · 2025-12-04T15:46:51Z

wtpsplit/__init__.py

+        # Input validation
+        if max_length is not None and min_length > max_length:
+            raise ValueError(
+                f"min_length ({min_length}) cannot be greater than max_length ({max_length})"
+            )
+        if min_length < 1:
+            raise ValueError(f"min_length must be >= 1, got {min_length}")
+        if max_length is not None and max_length < 1:
+            raise ValueError(f"max_length must be >= 1, got {max_length}")
+        valid_priors = ["uniform", "gaussian", "clipped_polynomial", "lognormal"]
+        if prior_type not in valid_priors:
+            raise ValueError(f"Unknown prior_type: '{prior_type}'. Must be one of {valid_priors}")
+        valid_algorithms = ["viterbi", "greedy"]
+        if algorithm not in valid_algorithms:
+            raise ValueError(f"Unknown algorithm: '{algorithm}'. Must be one of {valid_algorithms}")
+
+        if max_length is not None and threshold is not None:
+            warnings.warn(
+                "Both 'threshold' and 'max_length' are set. When using length-constrained "
+                "segmentation (max_length), the threshold parameter is ignored.",
+                UserWarning,
+            )


The validation logic is duplicated in both the WtP.split() method (lines 322-343) and the SaT.split() method (lines 847-868). Consider extracting this validation logic into a shared helper function to avoid code duplication and ensure consistency. This would improve maintainability and reduce the risk of the validation logic diverging between the two classes.

Copilot · 2025-12-04T15:46:51Z

wtpsplit/__init__.py

+                        if prior_kwargs is None:
+                            prior_kwargs = {}
+                        else:
+                            prior_kwargs = prior_kwargs.copy()
+                        if max_length is not None:
+                            prior_kwargs["max_length"] = max_length
+                        prior_fn = create_prior_function(prior_type, prior_kwargs)


[nitpick] The prior_kwargs dictionary is copied and potentially modified multiple times in nested if blocks (lines 476-482, 508-513, 983-991, 1017-1025). This creates unnecessary object creation and complexity. Consider consolidating the prior_kwargs initialization logic to avoid redundant copying and mutation.

Copilot · 2025-12-04T15:46:52Z

wtpsplit/utils/constraints.py

+        seg_len = len(segment.strip()) if strip_whitespace else len(segment)
+        if seg_len < min_length and i + 1 < len(boundaries):


[nitpick] The variable name seg_len is calculated but then the same calculation is repeated inline in the following conditional check. Consider using the already-calculated seg_len variable instead of recalculating len(segment.strip()) if strip_whitespace else len(segment) on line 112.

Copilot

Pull request overview

Copilot reviewed 16 out of 24 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-05T04:33:19Z

setup.py

        # "onnxruntime>=1.13.1", # can make conflicts between onnxruntime and onnxruntime-gpu
-        "transformers>=4.22.2",
-        "huggingface-hub",
+        "transformers>=4.22.2,<5.0",  # v5.0 has breaking changes; adapters library needs update first


The dependency version constraint "transformers>=4.22.2,<5.0" references v5.0 as having breaking changes. However, as of the knowledge cutoff (January 2025), transformers v5.0 has not been released yet. The latest stable version is 4.x series. This comment may be speculative or based on unreleased information. Consider verifying whether v5.0 actually exists and has the mentioned breaking changes, or update the comment to reflect current state.

Suggested change

"transformers>=4.22.2,<5.0", # v5.0 has breaking changes; adapters library needs update first

"transformers>=4.22.2,<5.0", # upper bound is precautionary in case v5.0 introduces breaking changes; adapters library may need update if/when v5.0 is released

Copilot · 2025-12-05T04:33:20Z

setup.py

-        "transformers>=4.22.2",
-        "huggingface-hub",
+        "transformers>=4.22.2,<5.0",  # v5.0 has breaking changes; adapters library needs update first
+        "huggingface-hub<1.0",  # v1.0 has breaking changes (HfFolder removed)


The dependency version constraint "huggingface-hub<1.0" mentions that v1.0 has breaking changes (HfFolder removed). However, as of the knowledge cutoff (January 2025), huggingface-hub is still in the 0.x series (latest ~0.27.x). This comment may be speculative or based on unreleased information. Consider verifying the current state of huggingface-hub and whether this constraint and comment are still accurate.

Suggested change

"huggingface-hub<1.0", # v1.0 has breaking changes (HfFolder removed)

"huggingface-hub<1.0", # Pin to <1.0 until v1.0 is released and compatibility is confirmed

Copilot · 2025-12-05T04:33:20Z

docs/LENGTH_CONSTRAINTS.md

@@ -0,0 +1,288 @@
+# Length-Constrained Segmentation
+
+This supplementary document explains the theory and implementation of length-constrained segmentation in wtpsplit (NB: auto-generated).


[nitpick] The documentation states "(NB: auto-generated)" but this appears to be a manually written, well-structured markdown document. If this is indeed auto-generated, consider clarifying what tool generates it and when. If it's manually written, remove the "auto-generated" note to avoid confusion.

Suggested change

This supplementary document explains the theory and implementation of length-constrained segmentation in wtpsplit (NB: auto-generated).

This supplementary document explains the theory and implementation of length-constrained segmentation in wtpsplit.

Copilot · 2025-12-05T04:33:20Z

setup.py

        "skops",
        "pandas>=1",
-        "cached_property",  # for Py37
        "mosestokenizer",


The cached_property dependency has been removed from setup.py, but the code still imports it from the third-party package (from cached_property import cached_property) in wtpsplit/utils/__init__.py. Since python_requires=">=3.9" has been set, this should be updated to use the standard library version: from functools import cached_property. This will cause an import error for users installing the package.

Copilot · 2025-12-05T04:33:21Z

wtpsplit/__init__.py

+                            "split_on_input_newlines=False will lead to newlines in the output "
+                            "if they were present in the input. Within the model, such newlines are "
+                            "treated as spaces. "
+                            "If you want to split on such newlines, set split_on_input_newlines=True."


[nitpick] The warning message states "If you want to split on such newlines, set split_on_input_newlines=True" but this warning is shown when split_on_input_newlines=False. The warning should say "set split_on_input_newlines=True" which is correct, but the context suggests the user already has it set to False and probably wants to keep it that way. Consider rephrasing to make it clearer, e.g., "If you want to split on newlines instead, set split_on_input_newlines=True."

Suggested change

"If you want to split on such newlines, set split_on_input_newlines=True."

"If you want to split on newlines instead, set split_on_input_newlines=True."

markus583 · 2025-12-05T04:49:50Z

Over time, this mutated a little into a more comprehensive v2.2.0 update! I expanded the core features from @harikesavan (adding language priors, lognormal prior, among others), and tried to make sure we are covering all edge cases, and text can be fully recovered after splitting (hence the very comprehensive test suite...). I also added some docs, and bumped Python to >= 3.9 since our dependencies require that anyway.

I thought a full changelog, so here it is! (auto-generated, though):
@bminixhofer

Maybe also interesting to @harikesavan @igorsterner

📋 Changelog: wtpsplit v2.1.7 → v2.2.0

26 files changed, +3,122 / -101 lines

🎯 Major Feature: Length-Constrained Segmentation

Control segment lengths with min_length and max_length parameters.

New Parameters

Parameter	Type	Default	Description
`min_length`	`int`	`1`	Minimum length (best effort)
`max_length`	`int \| None`	`None`	Maximum length (strict)
`prior_type`	`str`	`"uniform"`	Prior distribution
`prior_kwargs`	`dict \| None`	`None`	Prior configuration
`algorithm`	`str`	`"viterbi"`	`"viterbi"` (optimal) or `"greedy"` (faster)

Prior Functions

Prior	Best For	Key Parameters
`"uniform"` (default)	Just enforce max_length	—
`"gaussian"`	Prefer target length	`target_length`, `spread`
`"lognormal"`	Natural distribution	`target_length`, `spread` (0.3-0.7)
`"clipped_polynomial"`	Hard enforcement	`target_length`, `spread`

Language-Aware Defaults (70+ languages)

Automatic target_length/spread based on language:

East Asian (zh, ja, ko): shorter (45-55 chars)
Germanic (de, nl, en): medium-long (75-90 chars)
Romance/Slavic (fr, es, ru): medium-long (78-85 chars)

# Auto-applies when using LoRA with language
sat = SaT("sat-3l", style_or_domain="ud", language="de")
sat.split(text, max_length=150, prior_type="gaussian")  # German defaults

# Or explicit
sat.split(text, max_length=100, prior_type="gaussian", prior_kwargs={"lang_code": "zh"})

Text Reconstruction

# With constraints: "".join(segments) == original
# Without constraints: "\n".join(segments) == original

🆕 New Files

File	Lines	Description
`wtpsplit/utils/constraints.py`	494	Viterbi DP & greedy algorithms, constraint enforcement
`wtpsplit/utils/priors.py`	198	Prior functions + 70+ language defaults
`test_length_constraints.py`	1,164	98 test cases
`length_constrained_segmentation_demo.py`	450	Interactive demo
`docs/LENGTH_CONSTRAINTS.md`	288	Math & implementation docs

📝 Key Modifications

`wtpsplit/init.py` (+276 lines)

New parameters for WtP.split() / SaT.split()
Input validation, warnings when threshold ignored with max_length
Type hints fixed with from __future__ import annotations

`wtpsplit/utils/init.py`

Bug fix: from cached_property import ... → from functools import cached_property

`README.md` (+84 lines)

New section: "(NEW! v2.2+) Length-Constrained Segmentation"

🔧 Build & CI Changes

`setup.py`

python_requires=">=3.9"                    # NEW: drops 3.7, 3.8
"transformers>=4.22.2,<5.0"               # Pinned (v5.0 breaking)
"huggingface-hub<1.0"                     # Pinned (HfFolder removed)
# Removed: "cached_property" (stdlib in 3.9+)

`pyproject.toml`

target-version: py38... → py39, py310, py311, py312
[tool.ruff.per-file-ignores] → [tool.ruff.lint.per-file-ignores]

`.github/workflows/python.yml`

Python matrix: removed 3.8, added 3.12
Updated to actions/checkout@v4, actions/setup-python@v5
ruff → ruff check --target-version=py39

`requirements.txt`

Pinned huggingface-hub==0.25.2
Removed cached_property

⚠️ Breaking Changes

Change	Impact
Python ≥3.9	Drops 3.7, 3.8
transformers <5.0	v5.0 has breaking API
huggingface-hub <1.0	v1.0 removes HfFolder

🧪 Tests

File	Tests
`test.py`	+6 new constraint tests
`test_length_constraints.py`	98 tests (NEW)
Total	130 tests ✅

bminixhofer

Thanks this will be a great feature! A couple of comments:

What to we need _enforce_segment_constraints_simple for? I only see it used in tests.
How did we arrive at LANG_SENTENCE_STATS? Is it LLM generated? If it is, I'd rather not have it since it's likely nonsense for a couple of languages. It would be good to derive this from corpus data instead as noted in the TODO.
I don't understand all the priors and why we need them. Why does lognormal lead to "natural-feeling output"? Did we ever check the empirical distribution of sentence lengths? I also don't understand how clipped_polynomial behaves. It would be nice to have a figure showing the priors.
I don't quite understand _enforce_segment_constraints. Why do we need to manually ensure segments don't exceed the maximum length, shouldn't the prior take care of this (since segments > max length have zero probablity)? I also don't see why we need to manually merge shorter segments since the prior should also account for this.
I didn't check the details of the Viterbi implementation. Is this LLM generated as well or manually written? We should have a pretty high confidence it's correct before merging in either case. Maybe @markus583 @harikesavan one of you is already certain it's correct. If not, we still need to check it in more detail.

igorsterner · 2026-01-08T21:56:43Z

Hey! Thanks all your work on this feature, and your huge efforts to maintain this project!

Just thought I'd chime in after Benjamin to add my two cents. I can definitely see the attraction of enforcing a max length, and hence why the global optimization via veterbi is good (and better than greedy, like the LCM people did). Awesome. If you need to me to dive in deep to review something just let me know.

But regarding user-specified priors on sentence length, I am more concerned. I agree with Benjamin's comments, that with the current approach one should use corpus statistics rather than heuristics. But I feel that even better would be to make the whole thing learnable, in which case one wouldn't need these priors. I think the typical approach would be a segment-level variant of a CRF, which would aim to do something like rank all the possible mean-pooled segments (and one could of course make segments of particular sizes impossible). If it's trained end to end, I think it could keep everything as one language-agnostic model. I'm sure there's another paper in doing better length-constrained sentence segmentation. But big performance gains would probably only come on edge cases. Regardless, I think there needs to be evaluation against at least one benchmark before merging big changes here.

Just my thoughts! Happy to chat. Hope you're all doing well :))

Add length-constrained segmentation with configurable priors and algo…

e0ca77e

…rithms

harikesavan requested a review from markus583 November 25, 2025 10:15

markus583 self-assigned this Nov 25, 2025

markus583 requested a review from Copilot November 25, 2025 10:20

Copilot started reviewing on behalf of markus583 November 25, 2025 10:21 View session

Copilot finished reviewing on behalf of markus583 November 25, 2025 10:24

Copilot AI reviewed Nov 25, 2025

View reviewed changes

fix edge cases + integration

b9e5b44

markus583 requested a review from Copilot November 28, 2025 14:47

Copilot started reviewing on behalf of markus583 November 28, 2025 14:47 View session

markus583 requested a review from bminixhofer November 28, 2025 14:48

Copilot finished reviewing on behalf of markus583 November 28, 2025 14:50

Copilot AI reviewed Nov 28, 2025

View reviewed changes

markus583 and others added 3 commits November 29, 2025 16:07

Update length_constrained_segmentation_demo.py

6832b97

Co-authored-by: Copilot <[email protected]>

clarify threshold + viterbi non-usage; fix newline duplication

471a6e2

improve param naming; add lognormal prior; lang defaults; fix edge cases

d5949d1

markus583 requested a review from Copilot December 4, 2025 15:42

Copilot started reviewing on behalf of markus583 December 4, 2025 15:42 View session

Copilot finished reviewing on behalf of markus583 December 4, 2025 15:44

Copilot AI reviewed Dec 4, 2025

View reviewed changes

markus583 added 2 commits December 5, 2025 12:07

fix typing & typo

1922168

bump to >= 3.9; lint

6b52cc1

markus583 requested a review from Copilot December 5, 2025 04:29

Copilot started reviewing on behalf of markus583 December 5, 2025 04:30 View session

Copilot finished reviewing on behalf of markus583 December 5, 2025 04:31

Copilot AI reviewed Dec 5, 2025

View reviewed changes

replace cached_property with stdlib import

f392418

markus583 added 2 commits December 5, 2025 21:57

fixes / consistency wrt priors

cbd47b8

fix: remove f-string without placeholders and unused import

bd4db82

bminixhofer requested changes Jan 8, 2026

View reviewed changes

-    ):
+    ):
+        if algorithm not in ("viterbi", "greedy"):
+            raise ValueError(f"algorithm must be 'viterbi' or 'greedy', got '{algorithm}'")

-                result[-2] = prev + last
+                # Insert a space between segments if needed
+                if prev and last and not prev[-1].isspace() and not last[0].isspace():
+                    result[-2] = prev + " " + last
+                else:
+                    result[-2] = prev + last

-    if name == "uniform":
+    """
+    Returns a prior function for the specified prior type.
+    Parameters
+    ----------
+    name : str
+        The name of the prior type. Supported values:
+        - "uniform": Returns 1.0 for all lengths up to `max_length`, else 0.0.
+        - "clipped_polynomial": Returns max(1.0 - alpha * ((length - mu) ** 2), 0.0).
+        - "gaussian": Returns exp(-0.5 * ((length - mu) / sigma) ** 2).
+    kwargs : dict
+        Keyword arguments for the prior type:
+        - For "uniform":
+            max_length (float or int, optional): Maximum allowed length. If provided, prior is 0.0 for lengths > max_length.
+        - For "clipped_polynomial":
+            alpha (float, optional): Controls the steepness. Default is 0.5.
+            mu (float, optional): Center of the polynomial. Default is 3.0.
+        - For "gaussian":
+            mu (float, optional): Mean of the Gaussian. Default is 20.0.
+            sigma (float, optional): Standard deviation. Default is 5.0.
+    Returns
+    -------
+    prior : function
+        A function prior(length) -> float, which computes the prior probability for the given length.
+    Raises
+    ------
+    ValueError
+        If an unknown prior type is specified.
+    Examples
+    --------
+    >>> prior = create_prior_function("uniform", {"max_length": 10})
+    >>> prior(5)
+.0
+    >>> prior(15)
+.0
+    >>> prior = create_prior_function("clipped_polynomial", {"alpha": 0.2, "mu": 5})
+    >>> prior(5)
+.0
+    >>> prior(10)
+.0
+    >>> prior = create_prior_function("gaussian", {"mu": 10, "sigma": 2})
+    >>> prior(10)
+.0
+    >>> prior(14)
+    np.exp(-0.5 * ((14 - 10) / 2) ** 2)
+    """

-                if max_length is None or len(merged) + len(next_seg) <= max_length:
+                if max_length is None or len(merged) + len(next_seg) <= max_length:
+                    # Insert a space if needed between segments
+                    if merged and not merged.endswith((' ', '\n')) and not next_seg.startswith((' ', '\n')):
+                        merged += ' '

		# Use >= to handle min_length == max_length case
		if next_split >= curr_idx + min_length:


		## Adaptation

		SaT can be domain- and style-adapted via LoRA. We provide trained LoRA modules for Universal Dependencies, OPUS100, Ersatz, and TED (i.e., ASR-style transcribed speecjes) sentence styles in 81 languages for `sat-3l`and `sat-12l`. Additionally, we provide LoRA modules for legal documents (laws and judgements) in 6 languages, code-switching in 4 language pairs, and tweets in 3 languages. For details, we refer to our [paper](https://arxiv.org/abs/2406.16678).

		seg_len = len(segment.strip()) if strip_whitespace else len(segment)
		if seg_len < min_length and i + 1 < len(boundaries):

	"transformers>=4.22.2,<5.0", # v5.0 has breaking changes; adapters library needs update first
	"transformers>=4.22.2,<5.0", # upper bound is precautionary in case v5.0 introduces breaking changes; adapters library may need update if/when v5.0 is released

	"huggingface-hub<1.0", # v1.0 has breaking changes (HfFolder removed)
	"huggingface-hub<1.0", # Pin to <1.0 until v1.0 is released and compatibility is confirmed

		@@ -0,0 +1,288 @@
		# Length-Constrained Segmentation

		This supplementary document explains the theory and implementation of length-constrained segmentation in wtpsplit (NB: auto-generated).

	"If you want to split on such newlines, set split_on_input_newlines=True."
	"If you want to split on newlines instead, set split_on_input_newlines=True."

Add length-constrained segmentation with configurable priors and algo… #164

Are you sure you want to change the base?

Add length-constrained segmentation with configurable priors and algo… #164

Uh oh!

Conversation

harikesavan commented Nov 25, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 4, 2025

`wtpsplit/init.py` (+276 lines)

`wtpsplit/utils/init.py`

`README.md` (+84 lines)

`setup.py`

`pyproject.toml`

`.github/workflows/python.yml`

`requirements.txt`