Skip to content

Conversation

@harikesavan
Copy link
Collaborator

…rithms

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds length-constrained segmentation capabilities to wtpsplit, allowing users to specify minimum and maximum segment lengths with configurable prior functions and algorithms (Viterbi or greedy).

Key Changes

  • New constraint-based segmentation with min_length and max_length parameters
  • Three prior function types: uniform, Gaussian, and clipped polynomial
  • Support for both Viterbi (optimal) and greedy (fast) segmentation algorithms

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 19 comments.

Show a summary per file
File Description
wtpsplit/utils/priors.py New module implementing three prior function types for length preferences
wtpsplit/utils/constraints.py New module with constrained segmentation algorithms (Viterbi and greedy)
wtpsplit/__init__.py Integrated new parameters into WtP and SaT split methods; added _enforce_segment_constraints helper; version downgraded to 2.1.6; removed merge_lora parameter
test_constraints.py Comprehensive test suite covering constraints, priors, algorithms, and edge cases
test.py Added integration tests for length-constrained segmentation features
Comments suppressed due to low confidence (1)

wtpsplit/init.py:1067

  • The error message contains a typo: "If you want to split on such newlines, set split_on_input_newlines=False." This should say "set split_on_input_newlines=True" because the condition at line 1055 checks if split_on_input_newlines is True to perform the splitting.
                    warnings.warn(
                        "split_on_input_newlines=False will lead to newlines in the output "
                        "if they were present in the input. Within the model, such newlines are "
                        "treated as spaces. "
                        "If you want to split on such newlines, set split_on_input_newlines=False."

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

boundaries = constrained_segmentation(
paragraph_probs, prior_fn, min_length=min_length, max_length=max_length, algorithm=algorithm
)
indices = [b - 1 for b in boundaries]
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same potential index calculation issue: indices = [b - 1 for b in boundaries]. This pattern is repeated throughout the codebase and needs to be verified for correctness.

Copilot uses AI. Check for mistakes.
Comment on lines 383 to 384
min_length: int = 1,
max_length: int = None,
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing input validation for max_length parameter. The parameter accepts int = None as type but doesn't validate that when an integer is provided, it should be positive. Negative or zero values for max_length would cause logical errors in the segmentation algorithm. Consider adding validation: if max_length is not None and max_length <= 0: raise ValueError("max_length must be positive").

Copilot uses AI. Check for mistakes.
prior_type: str = "uniform",
prior_kwargs: dict = None,
algorithm: str = "viterbi",
):
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing input validation for algorithm parameter. The parameter accepts any string but only "viterbi" and "greedy" are valid. Invalid values would only be caught when constrained_segmentation is called. Consider adding early validation: if algorithm not in ("viterbi", "greedy"): raise ValueError(f"algorithm must be 'viterbi' or 'greedy', got '{algorithm}'").

Suggested change
):
):
if algorithm not in ("viterbi", "greedy"):
raise ValueError(f"algorithm must be 'viterbi' or 'greedy', got '{algorithm}'")

Copilot uses AI. Check for mistakes.
# merge lora weights into transformer for 0 efficiency overhead
if merge_lora:
self.model.model.merge_adapter("sat-lora")
self.model.model.merge_adapter("sat-lora")
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The merge_lora conditional has been removed, making merge_adapter() always execute. This changes the behavior for users who may have set merge_lora=False previously. This is a behavioral breaking change related to the removed parameter.

Copilot uses AI. Check for mistakes.
if len(last) < min_length:
prev = result[-2]
if max_length is None or len(prev) + len(last) <= max_length:
result[-2] = prev + last
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same segment concatenation issue exists here: result[-2] = prev + last merges segments without preserving any delimiter or whitespace between them, which can corrupt the text output.

Suggested change
result[-2] = prev + last
# Insert a space between segments if needed
if prev and last and not prev[-1].isspace() and not last[0].isspace():
result[-2] = prev + " " + last
else:
result[-2] = prev + last

Copilot uses AI. Check for mistakes.
from wtpsplit.utils.constraints import constrained_segmentation
from wtpsplit.utils.priors import create_prior_function

__version__ = "2.1.6"
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version has been downgraded from "2.1.7" to "2.1.6", which is unusual for a PR that adds new features. When adding new functionality like length-constrained segmentation, the version should typically be incremented (e.g., to "2.2.0" or "2.1.8"), not decremented. This could cause confusion for users and break dependency management.

Suggested change
__version__ = "2.1.6"
__version__ = "2.2.0"

Copilot uses AI. Check for mistakes.
import numpy as np

def create_prior_function(name, kwargs):
if name == "uniform":
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing docstring for the create_prior_function. This is a public function being imported and used throughout the codebase. It should have documentation explaining:

  • What each prior type does
  • The expected kwargs for each prior type
  • The return value (a prior function)
  • Example usage
Suggested change
if name == "uniform":
"""
Returns a prior function for the specified prior type.
Parameters
----------
name : str
The name of the prior type. Supported values:
- "uniform": Returns 1.0 for all lengths up to `max_length`, else 0.0.
- "clipped_polynomial": Returns max(1.0 - alpha * ((length - mu) ** 2), 0.0).
- "gaussian": Returns exp(-0.5 * ((length - mu) / sigma) ** 2).
kwargs : dict
Keyword arguments for the prior type:
- For "uniform":
max_length (float or int, optional): Maximum allowed length. If provided, prior is 0.0 for lengths > max_length.
- For "clipped_polynomial":
alpha (float, optional): Controls the steepness. Default is 0.5.
mu (float, optional): Center of the polynomial. Default is 3.0.
- For "gaussian":
mu (float, optional): Mean of the Gaussian. Default is 20.0.
sigma (float, optional): Standard deviation. Default is 5.0.
Returns
-------
prior : function
A function prior(length) -> float, which computes the prior probability for the given length.
Raises
------
ValueError
If an unknown prior type is specified.
Examples
--------
>>> prior = create_prior_function("uniform", {"max_length": 10})
>>> prior(5)
1.0
>>> prior(15)
0.0
>>> prior = create_prior_function("clipped_polynomial", {"alpha": 0.2, "mu": 5})
>>> prior(5)
1.0
>>> prior(10)
0.0
>>> prior = create_prior_function("gaussian", {"mu": 10, "sigma": 2})
>>> prior(10)
1.0
>>> prior(14)
np.exp(-0.5 * ((14 - 10) / 2) ** 2)
"""

Copilot uses AI. Check for mistakes.
# Accumulate following segments
while j < len(filtered) and len(merged) < min_length:
next_seg = filtered[j]
if max_length is None or len(merged) + len(next_seg) <= max_length:
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _enforce_segment_constraints function concatenates segments directly without any delimiter (e.g., merged += next_seg on line 82). This can merge segments that should have space or other separators between them, potentially corrupting the text. For example, "Hello." + "World." would become "Hello.World." instead of "Hello. World."

Suggested change
if max_length is None or len(merged) + len(next_seg) <= max_length:
if max_length is None or len(merged) + len(next_seg) <= max_length:
# Insert a space if needed between segments
if merged and not merged.endswith((' ', '\n')) and not next_seg.startswith((' ', '\n')):
merged += ' '

Copilot uses AI. Check for mistakes.
prior_type: str = "uniform",
prior_kwargs: dict = None,
algorithm: str = "viterbi",
):
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing input validation for conflicting constraints. If a user provides min_length > max_length, this creates an impossible constraint but no error is raised. The code should validate this and raise a clear error: if max_length is not None and min_length > max_length: raise ValueError("min_length cannot be greater than max_length").

Suggested change
):
):
if max_length is not None and min_length > max_length:
raise ValueError("min_length cannot be greater than max_length")

Copilot uses AI. Check for mistakes.
Comment on lines 520 to 523
if prior_kwargs is None:
prior_kwargs = {}
else:
prior_kwargs = prior_kwargs.copy()
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance concern: The prior_kwargs.copy() operation is performed on every paragraph iteration (lines 522-523, 558-559, 1003-1004, 1038-1039). For large documents with many paragraphs, this creates unnecessary dict copies. Consider moving the copy operation outside the loop or checking if the dict is actually being modified before copying.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 11 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +235 to +241
def test_tiny_fragments_merging(self, wtp_model):
"""Tiny fragments should be merged to meet min_length."""
text = "A. B. C. D. E. F. G. H. I. J."
splits = wtp_model.split(text, min_length=10, threshold=0.005)

for segment in splits:
assert len(segment) >= 10, f"Segment '{segment}' is too short"
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test at line 238 asserts that all segments are >= 10 characters when using min_length=10. However, based on the documentation and implementation, min_length is "best effort" - segments may be shorter if merging would violate max_length or if the segment is inherently too short. This test could fail in edge cases and doesn't match the documented behavior. Consider either:

  1. Allowing for exceptions where segments can be shorter
  2. Using a test case where min_length can always be satisfied

Copilot uses AI. Check for mistakes.
Comment on lines +243 to +249
def test_very_short_sentences(self, wtp_model):
"""Very short sentences should be merged when needed."""
text = "Hi. Bye. Go. Stop. Run. Walk. Jump. Sit."
splits = wtp_model.split(text, min_length=15, threshold=0.005)

for segment in splits:
assert len(segment) >= 15, f"Segment '{segment}' is too short"
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the previous test, line 248 asserts all segments are >= 15 characters. Since min_length is documented as "best effort", this test might fail in edge cases where the algorithm cannot satisfy the minimum without violating other constraints. Consider relaxing the assertion to account for best-effort behavior, or verifying that most (but not necessarily all) segments meet the minimum.

Copilot uses AI. Check for mistakes.
Comment on lines 264 to 273
def test_min_length_constraint_wtp():
"""Test minimum length constraint with WtP"""
wtp = WtP("wtp-bert-mini", ort_providers=["CPUExecutionProvider"])

text = "Short. Test. Hello. World. This is longer."
splits = wtp.split(text, min_length=15, threshold=0.005)

# All segments should be >= 15 characters
for segment in splits:
assert len(segment) >= 15, f"Segment '{segment}' is shorter than min_length"
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the comprehensive test suite, line 273 asserts that all segments are >= 15 characters when using min_length=15. However, min_length is documented as "best effort" and segments may be shorter if constraints cannot be satisfied. This test may fail in edge cases and doesn't align with the documented behavior. Consider either:

  1. Allowing for exceptions where segments can be shorter
  2. Using test data where min_length can always be guaranteed

Copilot uses AI. Check for mistakes.
Comment on lines 840 to 854
# Input validation
if max_length is not None and min_length > max_length:
raise ValueError(
f"min_length ({min_length}) cannot be greater than max_length ({max_length})"
)
if min_length < 1:
raise ValueError(f"min_length must be >= 1, got {min_length}")
if max_length is not None and max_length < 1:
raise ValueError(f"max_length must be >= 1, got {max_length}")
valid_priors = ["uniform", "gaussian", "clipped_polynomial"]
if prior_type not in valid_priors:
raise ValueError(f"Unknown prior_type: '{prior_type}'. Must be one of {valid_priors}")
valid_algorithms = ["viterbi", "greedy"]
if algorithm not in valid_algorithms:
raise ValueError(f"Unknown algorithm: '{algorithm}'. Must be one of {valid_algorithms}")
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] There's code duplication between WtP's split method (lines 323-337) and SaT's split method (lines 840-854). The same input validation logic is repeated. Consider extracting this validation into a shared helper function to reduce duplication and ensure consistency.

Copilot uses AI. Check for mistakes.
Comment on lines 294 to 303
def test_min_max_constraints_together():
"""Test both constraints simultaneously"""
wtp = WtP("wtp-bert-mini", ort_providers=["CPUExecutionProvider"])

text = "Hello world. " * 15
splits = wtp.split(text, min_length=25, max_length=65, threshold=0.005)

# All segments should satisfy both constraints
for segment in splits:
assert 25 <= len(segment) <= 65, f"Segment '{segment}' violates constraints"
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 303 asserts that all segments satisfy both constraints 25 <= len(segment) <= 65. This strict assertion doesn't account for the "best effort" nature of min_length. While max_length is always strictly enforced, min_length may not be achievable in all cases. The test might fail for certain inputs where min_length cannot be satisfied without violating max_length.

Copilot uses AI. Check for mistakes.
Comment on lines +416 to +417
# Use >= to handle min_length == max_length case
if next_split >= curr_idx + min_length:
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] In the fallback case when dp[n] == -float("inf"), the code uses >= in the comparison at line 417, with a comment explaining it handles the min_length == max_length case. However, the regular condition at line 390-391 uses end_j = i - min_length which would already exclude the boundary case. The logic appears correct but the consistency between the fallback and the main algorithm could be clearer. Consider verifying that both code paths handle edge cases identically, especially when min_length == max_length.

Suggested change
# Use >= to handle min_length == max_length case
if next_split >= curr_idx + min_length:
# Use strict boundary to match main algorithm (min_length == max_length case)
if next_split - curr_idx >= min_length:

Copilot uses AI. Check for mistakes.
# If still too short, try merging with previous non-empty segment
if len(merged) < min_length and result:
# Find a previous non-empty segment that can accommodate the merge
for prev_idx in range(len(result) - 1, -1, -1):
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] In the _enforce_segment_constraints_simple function, line 214 checks if not seg or not seg.strip(): which handles both empty strings and whitespace-only strings. However, the code then appends seg (which could be whitespace-only) to the result. This is intentional to preserve structure, but on line 264, there's a check if result[prev_idx] and result[prev_idx].strip(): that requires non-whitespace content. This could lead to inconsistent behavior where whitespace-only segments are preserved in some cases but not used as merge candidates in others. Consider documenting this behavior more clearly or ensuring consistent handling of whitespace-only segments throughout the function.

Suggested change
for prev_idx in range(len(result) - 1, -1, -1):
for prev_idx in range(len(result) - 1, -1, -1):
# Only merge with previous segments that contain non-whitespace content.
# Whitespace-only segments are preserved for structure, but not used as merge candidates.

Copilot uses AI. Check for mistakes.
# Find a good split point (prefer splitting at whitespace)
split_at = max_length
# Look for whitespace near the end to split at
for j in range(max_length - 1, max(0, max_length - 20), -1):
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] In _enforce_segment_constraints, line 85 searches for whitespace to split at, looking back up to 20 characters from max_length. The magic number 20 should be defined as a named constant to make the code more maintainable and allow easy adjustment if needed. For example: WHITESPACE_SEARCH_WINDOW = 20.

Copilot uses AI. Check for mistakes.
from wtpsplit import WtP, SaT
from wtpsplit.utils.constraints import (
constrained_segmentation,
_enforce_segment_constraints,
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of '_enforce_segment_constraints' is not used.

Suggested change
_enforce_segment_constraints,

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

wtpsplit/init.py:1062

  • Syntax Error (in Python 3).
l                    else:

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +424 to +429
while curr_idx < n:
next_split = min(curr_idx + max_length, n)
# Use >= to handle min_length == max_length case
if next_split >= curr_idx + min_length:
indices.append(next_split)
curr_idx = next_split
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition on line 427 checks if next_split >= curr_idx + min_length but the comment on line 426 says "Use >= to handle min_length == max_length case". However, this check appears inconsistent with the dynamic programming loop bounds where end_j = i - min_length (line 398). When min_length == max_length, this could potentially create segments that violate the equality constraint. Consider verifying this edge case is handled correctly.

Suggested change
while curr_idx < n:
next_split = min(curr_idx + max_length, n)
# Use >= to handle min_length == max_length case
if next_split >= curr_idx + min_length:
indices.append(next_split)
curr_idx = next_split
if min_length == max_length:
# Strictly enforce equal-sized segments
while curr_idx < n:
next_split = min(curr_idx + min_length, n)
indices.append(next_split)
curr_idx = next_split
else:
while curr_idx < n:
next_split = min(curr_idx + max_length, n)
# Use >= to handle min_length == max_length case
if next_split >= curr_idx + min_length:
indices.append(next_split)
curr_idx = next_split

Copilot uses AI. Check for mistakes.
sentence = sentence[:-1]
new_sentences.extend(sentence.split("\n"))
sentences = new_sentences
l else:
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lowercase 'l' character at the beginning of this line that appears to be a typo. This will cause a syntax error. The line should start with proper indentation and the else statement, not "l".

Suggested change
l else:
else:

Copilot uses AI. Check for mistakes.
README.md Outdated

## Adaptation

SaT can be domain- and style-adapted via LoRA. We provide trained LoRA modules for Universal Dependencies, OPUS100, Ersatz, and TED (i.e., ASR-style transcribed speecjes) sentence styles in 81 languages for `sat-3l`and `sat-12l`. Additionally, we provide LoRA modules for legal documents (laws and judgements) in 6 languages, code-switching in 4 language pairs, and tweets in 3 languages. For details, we refer to our [paper](https://arxiv.org/abs/2406.16678).
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: "speecjes" should be "speeches".

Suggested change
SaT can be domain- and style-adapted via LoRA. We provide trained LoRA modules for Universal Dependencies, OPUS100, Ersatz, and TED (i.e., ASR-style transcribed speecjes) sentence styles in 81 languages for `sat-3l`and `sat-12l`. Additionally, we provide LoRA modules for legal documents (laws and judgements) in 6 languages, code-switching in 4 language pairs, and tweets in 3 languages. For details, we refer to our [paper](https://arxiv.org/abs/2406.16678).
SaT can be domain- and style-adapted via LoRA. We provide trained LoRA modules for Universal Dependencies, OPUS100, Ersatz, and TED (i.e., ASR-style transcribed speeches) sentence styles in 81 languages for `sat-3l`and `sat-12l`. Additionally, we provide LoRA modules for legal documents (laws and judgements) in 6 languages, code-switching in 4 language pairs, and tweets in 3 languages. For details, we refer to our [paper](https://arxiv.org/abs/2406.16678).

Copilot uses AI. Check for mistakes.
Comment on lines +1063 to +1067
warnings.warn(
"split_on_input_newlines=False will lead to newlines in the output "
"if they were present in the input. Within the model, such newlines are "
"treated as spaces. "
"If you want to split on such newlines, set split_on_input_newlines=True."
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning message states "If you want to split on such newlines, set split_on_input_newlines=False" but this appears to be incorrect logic. If the user wants to split on newlines, they should set split_on_input_newlines=True, not False. The message should say "set split_on_input_newlines=True".

Copilot uses AI. Check for mistakes.
Comment on lines 322 to 343
# Input validation
if max_length is not None and min_length > max_length:
raise ValueError(
f"min_length ({min_length}) cannot be greater than max_length ({max_length})"
)
if min_length < 1:
raise ValueError(f"min_length must be >= 1, got {min_length}")
if max_length is not None and max_length < 1:
raise ValueError(f"max_length must be >= 1, got {max_length}")
valid_priors = ["uniform", "gaussian", "clipped_polynomial", "lognormal"]
if prior_type not in valid_priors:
raise ValueError(f"Unknown prior_type: '{prior_type}'. Must be one of {valid_priors}")
valid_algorithms = ["viterbi", "greedy"]
if algorithm not in valid_algorithms:
raise ValueError(f"Unknown algorithm: '{algorithm}'. Must be one of {valid_algorithms}")

if max_length is not None and threshold is not None:
warnings.warn(
"Both 'threshold' and 'max_length' are set. When using length-constrained "
"segmentation (max_length), the threshold parameter is ignored.",
UserWarning,
)
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation logic is duplicated in both the WtP.split() method (lines 322-343) and the SaT.split() method (lines 847-868). Consider extracting this validation logic into a shared helper function to avoid code duplication and ensure consistency. This would improve maintainability and reduce the risk of the validation logic diverging between the two classes.

Copilot uses AI. Check for mistakes.
Comment on lines 476 to 482
if prior_kwargs is None:
prior_kwargs = {}
else:
prior_kwargs = prior_kwargs.copy()
if max_length is not None:
prior_kwargs["max_length"] = max_length
prior_fn = create_prior_function(prior_type, prior_kwargs)
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The prior_kwargs dictionary is copied and potentially modified multiple times in nested if blocks (lines 476-482, 508-513, 983-991, 1017-1025). This creates unnecessary object creation and complexity. Consider consolidating the prior_kwargs initialization logic to avoid redundant copying and mutation.

Copilot uses AI. Check for mistakes.
Comment on lines +112 to +113
seg_len = len(segment.strip()) if strip_whitespace else len(segment)
if seg_len < min_length and i + 1 < len(boundaries):
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The variable name seg_len is calculated but then the same calculation is repeated inline in the following conditional check. Consider using the already-calculated seg_len variable instead of recalculating len(segment.strip()) if strip_whitespace else len(segment) on line 112.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 24 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# "onnxruntime>=1.13.1", # can make conflicts between onnxruntime and onnxruntime-gpu
"transformers>=4.22.2",
"huggingface-hub",
"transformers>=4.22.2,<5.0", # v5.0 has breaking changes; adapters library needs update first
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dependency version constraint "transformers>=4.22.2,<5.0" references v5.0 as having breaking changes. However, as of the knowledge cutoff (January 2025), transformers v5.0 has not been released yet. The latest stable version is 4.x series. This comment may be speculative or based on unreleased information. Consider verifying whether v5.0 actually exists and has the mentioned breaking changes, or update the comment to reflect current state.

Suggested change
"transformers>=4.22.2,<5.0", # v5.0 has breaking changes; adapters library needs update first
"transformers>=4.22.2,<5.0", # upper bound is precautionary in case v5.0 introduces breaking changes; adapters library may need update if/when v5.0 is released

Copilot uses AI. Check for mistakes.
"transformers>=4.22.2",
"huggingface-hub",
"transformers>=4.22.2,<5.0", # v5.0 has breaking changes; adapters library needs update first
"huggingface-hub<1.0", # v1.0 has breaking changes (HfFolder removed)
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dependency version constraint "huggingface-hub<1.0" mentions that v1.0 has breaking changes (HfFolder removed). However, as of the knowledge cutoff (January 2025), huggingface-hub is still in the 0.x series (latest ~0.27.x). This comment may be speculative or based on unreleased information. Consider verifying the current state of huggingface-hub and whether this constraint and comment are still accurate.

Suggested change
"huggingface-hub<1.0", # v1.0 has breaking changes (HfFolder removed)
"huggingface-hub<1.0", # Pin to <1.0 until v1.0 is released and compatibility is confirmed

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,288 @@
# Length-Constrained Segmentation

This supplementary document explains the theory and implementation of length-constrained segmentation in wtpsplit (NB: auto-generated).
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The documentation states "(NB: auto-generated)" but this appears to be a manually written, well-structured markdown document. If this is indeed auto-generated, consider clarifying what tool generates it and when. If it's manually written, remove the "auto-generated" note to avoid confusion.

Suggested change
This supplementary document explains the theory and implementation of length-constrained segmentation in wtpsplit (NB: auto-generated).
This supplementary document explains the theory and implementation of length-constrained segmentation in wtpsplit.

Copilot uses AI. Check for mistakes.
"skops",
"pandas>=1",
"cached_property", # for Py37
"mosestokenizer",
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cached_property dependency has been removed from setup.py, but the code still imports it from the third-party package (from cached_property import cached_property) in wtpsplit/utils/__init__.py. Since python_requires=">=3.9" has been set, this should be updated to use the standard library version: from functools import cached_property. This will cause an import error for users installing the package.

Copilot uses AI. Check for mistakes.
"split_on_input_newlines=False will lead to newlines in the output "
"if they were present in the input. Within the model, such newlines are "
"treated as spaces. "
"If you want to split on such newlines, set split_on_input_newlines=True."
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The warning message states "If you want to split on such newlines, set split_on_input_newlines=True" but this warning is shown when split_on_input_newlines=False. The warning should say "set split_on_input_newlines=True" which is correct, but the context suggests the user already has it set to False and probably wants to keep it that way. Consider rephrasing to make it clearer, e.g., "If you want to split on newlines instead, set split_on_input_newlines=True."

Suggested change
"If you want to split on such newlines, set split_on_input_newlines=True."
"If you want to split on newlines instead, set split_on_input_newlines=True."

Copilot uses AI. Check for mistakes.
@markus583
Copy link
Collaborator

Over time, this mutated a little into a more comprehensive v2.2.0 update! I expanded the core features from @harikesavan (adding language priors, lognormal prior, among others), and tried to make sure we are covering all edge cases, and text can be fully recovered after splitting (hence the very comprehensive test suite...). I also added some docs, and bumped Python to >= 3.9 since our dependencies require that anyway.

I thought a full changelog, so here it is! (auto-generated, though):
@bminixhofer

Maybe also interesting to @harikesavan @igorsterner


📋 Changelog: wtpsplit v2.1.7 → v2.2.0

26 files changed, +3,122 / -101 lines


🎯 Major Feature: Length-Constrained Segmentation

Control segment lengths with min_length and max_length parameters.

New Parameters

Parameter Type Default Description
min_length int 1 Minimum length (best effort)
max_length int | None None Maximum length (strict)
prior_type str "uniform" Prior distribution
prior_kwargs dict | None None Prior configuration
algorithm str "viterbi" "viterbi" (optimal) or "greedy" (faster)

Prior Functions

Prior Best For Key Parameters
"uniform" (default) Just enforce max_length
"gaussian" Prefer target length target_length, spread
"lognormal" Natural distribution target_length, spread (0.3-0.7)
"clipped_polynomial" Hard enforcement target_length, spread

Language-Aware Defaults (70+ languages)

Automatic target_length/spread based on language:

  • East Asian (zh, ja, ko): shorter (45-55 chars)
  • Germanic (de, nl, en): medium-long (75-90 chars)
  • Romance/Slavic (fr, es, ru): medium-long (78-85 chars)
# Auto-applies when using LoRA with language
sat = SaT("sat-3l", style_or_domain="ud", language="de")
sat.split(text, max_length=150, prior_type="gaussian")  # German defaults

# Or explicit
sat.split(text, max_length=100, prior_type="gaussian", prior_kwargs={"lang_code": "zh"})

Text Reconstruction

# With constraints: "".join(segments) == original
# Without constraints: "\n".join(segments) == original

🆕 New Files

File Lines Description
wtpsplit/utils/constraints.py 494 Viterbi DP & greedy algorithms, constraint enforcement
wtpsplit/utils/priors.py 198 Prior functions + 70+ language defaults
test_length_constraints.py 1,164 98 test cases
length_constrained_segmentation_demo.py 450 Interactive demo
docs/LENGTH_CONSTRAINTS.md 288 Math & implementation docs

📝 Key Modifications

wtpsplit/__init__.py (+276 lines)

  • New parameters for WtP.split() / SaT.split()
  • Input validation, warnings when threshold ignored with max_length
  • Type hints fixed with from __future__ import annotations

wtpsplit/utils/__init__.py

  • Bug fix: from cached_property import ...from functools import cached_property

README.md (+84 lines)

  • New section: "(NEW! v2.2+) Length-Constrained Segmentation"

🔧 Build & CI Changes

setup.py

python_requires=">=3.9"                    # NEW: drops 3.7, 3.8
"transformers>=4.22.2,<5.0"               # Pinned (v5.0 breaking)
"huggingface-hub<1.0"                     # Pinned (HfFolder removed)
# Removed: "cached_property" (stdlib in 3.9+)

pyproject.toml

  • target-version: py38...py39, py310, py311, py312
  • [tool.ruff.per-file-ignores][tool.ruff.lint.per-file-ignores]

.github/workflows/python.yml

  • Python matrix: removed 3.8, added 3.12
  • Updated to actions/checkout@v4, actions/setup-python@v5
  • ruffruff check --target-version=py39

requirements.txt

  • Pinned huggingface-hub==0.25.2
  • Removed cached_property

⚠️ Breaking Changes

Change Impact
Python ≥3.9 Drops 3.7, 3.8
transformers <5.0 v5.0 has breaking API
huggingface-hub <1.0 v1.0 removes HfFolder

🧪 Tests

File Tests
test.py +6 new constraint tests
test_length_constraints.py 98 tests (NEW)
Total 130 tests ✅

Copy link
Collaborator

@bminixhofer bminixhofer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this will be a great feature! A couple of comments:

  • What to we need _enforce_segment_constraints_simple for? I only see it used in tests.
  • How did we arrive at LANG_SENTENCE_STATS? Is it LLM generated? If it is, I'd rather not have it since it's likely nonsense for a couple of languages. It would be good to derive this from corpus data instead as noted in the TODO.
  • I don't understand all the priors and why we need them. Why does lognormal lead to "natural-feeling output"? Did we ever check the empirical distribution of sentence lengths? I also don't understand how clipped_polynomial behaves. It would be nice to have a figure showing the priors.
  • I don't quite understand _enforce_segment_constraints. Why do we need to manually ensure segments don't exceed the maximum length, shouldn't the prior take care of this (since segments > max length have zero probablity)? I also don't see why we need to manually merge shorter segments since the prior should also account for this.
  • I didn't check the details of the Viterbi implementation. Is this LLM generated as well or manually written? We should have a pretty high confidence it's correct before merging in either case. Maybe @markus583 @harikesavan one of you is already certain it's correct. If not, we still need to check it in more detail.

@igorsterner
Copy link
Collaborator

Hey! Thanks all your work on this feature, and your huge efforts to maintain this project!

Just thought I'd chime in after Benjamin to add my two cents. I can definitely see the attraction of enforcing a max length, and hence why the global optimization via veterbi is good (and better than greedy, like the LCM people did). Awesome. If you need to me to dive in deep to review something just let me know.

But regarding user-specified priors on sentence length, I am more concerned. I agree with Benjamin's comments, that with the current approach one should use corpus statistics rather than heuristics. But I feel that even better would be to make the whole thing learnable, in which case one wouldn't need these priors. I think the typical approach would be a segment-level variant of a CRF, which would aim to do something like rank all the possible mean-pooled segments (and one could of course make segments of particular sizes impossible). If it's trained end to end, I think it could keep everything as one language-agnostic model. I'm sure there's another paper in doing better length-constrained sentence segmentation. But big performance gains would probably only come on edge cases. Regardless, I think there needs to be evaluation against at least one benchmark before merging big changes here.

Just my thoughts! Happy to chat. Hope you're all doing well :))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants