-
Notifications
You must be signed in to change notification settings - Fork 77
Add length-constrained segmentation with configurable priors and algo… #164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add length-constrained segmentation with configurable priors and algo… #164
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds length-constrained segmentation capabilities to wtpsplit, allowing users to specify minimum and maximum segment lengths with configurable prior functions and algorithms (Viterbi or greedy).
Key Changes
- New constraint-based segmentation with
min_lengthandmax_lengthparameters - Three prior function types: uniform, Gaussian, and clipped polynomial
- Support for both Viterbi (optimal) and greedy (fast) segmentation algorithms
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 19 comments.
Show a summary per file
| File | Description |
|---|---|
wtpsplit/utils/priors.py |
New module implementing three prior function types for length preferences |
wtpsplit/utils/constraints.py |
New module with constrained segmentation algorithms (Viterbi and greedy) |
wtpsplit/__init__.py |
Integrated new parameters into WtP and SaT split methods; added _enforce_segment_constraints helper; version downgraded to 2.1.6; removed merge_lora parameter |
test_constraints.py |
Comprehensive test suite covering constraints, priors, algorithms, and edge cases |
test.py |
Added integration tests for length-constrained segmentation features |
Comments suppressed due to low confidence (1)
wtpsplit/init.py:1067
- The error message contains a typo: "If you want to split on such newlines, set split_on_input_newlines=False." This should say "set split_on_input_newlines=True" because the condition at line 1055 checks if
split_on_input_newlinesis True to perform the splitting.
warnings.warn(
"split_on_input_newlines=False will lead to newlines in the output "
"if they were present in the input. Within the model, such newlines are "
"treated as spaces. "
"If you want to split on such newlines, set split_on_input_newlines=False."
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| boundaries = constrained_segmentation( | ||
| paragraph_probs, prior_fn, min_length=min_length, max_length=max_length, algorithm=algorithm | ||
| ) | ||
| indices = [b - 1 for b in boundaries] |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same potential index calculation issue: indices = [b - 1 for b in boundaries]. This pattern is repeated throughout the codebase and needs to be verified for correctness.
wtpsplit/__init__.py
Outdated
| min_length: int = 1, | ||
| max_length: int = None, |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing input validation for max_length parameter. The parameter accepts int = None as type but doesn't validate that when an integer is provided, it should be positive. Negative or zero values for max_length would cause logical errors in the segmentation algorithm. Consider adding validation: if max_length is not None and max_length <= 0: raise ValueError("max_length must be positive").
| prior_type: str = "uniform", | ||
| prior_kwargs: dict = None, | ||
| algorithm: str = "viterbi", | ||
| ): |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing input validation for algorithm parameter. The parameter accepts any string but only "viterbi" and "greedy" are valid. Invalid values would only be caught when constrained_segmentation is called. Consider adding early validation: if algorithm not in ("viterbi", "greedy"): raise ValueError(f"algorithm must be 'viterbi' or 'greedy', got '{algorithm}'").
| ): | |
| ): | |
| if algorithm not in ("viterbi", "greedy"): | |
| raise ValueError(f"algorithm must be 'viterbi' or 'greedy', got '{algorithm}'") |
wtpsplit/__init__.py
Outdated
| # merge lora weights into transformer for 0 efficiency overhead | ||
| if merge_lora: | ||
| self.model.model.merge_adapter("sat-lora") | ||
| self.model.model.merge_adapter("sat-lora") |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The merge_lora conditional has been removed, making merge_adapter() always execute. This changes the behavior for users who may have set merge_lora=False previously. This is a behavioral breaking change related to the removed parameter.
wtpsplit/__init__.py
Outdated
| if len(last) < min_length: | ||
| prev = result[-2] | ||
| if max_length is None or len(prev) + len(last) <= max_length: | ||
| result[-2] = prev + last |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same segment concatenation issue exists here: result[-2] = prev + last merges segments without preserving any delimiter or whitespace between them, which can corrupt the text output.
| result[-2] = prev + last | |
| # Insert a space between segments if needed | |
| if prev and last and not prev[-1].isspace() and not last[0].isspace(): | |
| result[-2] = prev + " " + last | |
| else: | |
| result[-2] = prev + last |
wtpsplit/__init__.py
Outdated
| from wtpsplit.utils.constraints import constrained_segmentation | ||
| from wtpsplit.utils.priors import create_prior_function | ||
|
|
||
| __version__ = "2.1.6" |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The version has been downgraded from "2.1.7" to "2.1.6", which is unusual for a PR that adds new features. When adding new functionality like length-constrained segmentation, the version should typically be incremented (e.g., to "2.2.0" or "2.1.8"), not decremented. This could cause confusion for users and break dependency management.
| __version__ = "2.1.6" | |
| __version__ = "2.2.0" |
| import numpy as np | ||
|
|
||
| def create_prior_function(name, kwargs): | ||
| if name == "uniform": |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing docstring for the create_prior_function. This is a public function being imported and used throughout the codebase. It should have documentation explaining:
- What each prior type does
- The expected kwargs for each prior type
- The return value (a prior function)
- Example usage
| if name == "uniform": | |
| """ | |
| Returns a prior function for the specified prior type. | |
| Parameters | |
| ---------- | |
| name : str | |
| The name of the prior type. Supported values: | |
| - "uniform": Returns 1.0 for all lengths up to `max_length`, else 0.0. | |
| - "clipped_polynomial": Returns max(1.0 - alpha * ((length - mu) ** 2), 0.0). | |
| - "gaussian": Returns exp(-0.5 * ((length - mu) / sigma) ** 2). | |
| kwargs : dict | |
| Keyword arguments for the prior type: | |
| - For "uniform": | |
| max_length (float or int, optional): Maximum allowed length. If provided, prior is 0.0 for lengths > max_length. | |
| - For "clipped_polynomial": | |
| alpha (float, optional): Controls the steepness. Default is 0.5. | |
| mu (float, optional): Center of the polynomial. Default is 3.0. | |
| - For "gaussian": | |
| mu (float, optional): Mean of the Gaussian. Default is 20.0. | |
| sigma (float, optional): Standard deviation. Default is 5.0. | |
| Returns | |
| ------- | |
| prior : function | |
| A function prior(length) -> float, which computes the prior probability for the given length. | |
| Raises | |
| ------ | |
| ValueError | |
| If an unknown prior type is specified. | |
| Examples | |
| -------- | |
| >>> prior = create_prior_function("uniform", {"max_length": 10}) | |
| >>> prior(5) | |
| 1.0 | |
| >>> prior(15) | |
| 0.0 | |
| >>> prior = create_prior_function("clipped_polynomial", {"alpha": 0.2, "mu": 5}) | |
| >>> prior(5) | |
| 1.0 | |
| >>> prior(10) | |
| 0.0 | |
| >>> prior = create_prior_function("gaussian", {"mu": 10, "sigma": 2}) | |
| >>> prior(10) | |
| 1.0 | |
| >>> prior(14) | |
| np.exp(-0.5 * ((14 - 10) / 2) ** 2) | |
| """ |
wtpsplit/__init__.py
Outdated
| # Accumulate following segments | ||
| while j < len(filtered) and len(merged) < min_length: | ||
| next_seg = filtered[j] | ||
| if max_length is None or len(merged) + len(next_seg) <= max_length: |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _enforce_segment_constraints function concatenates segments directly without any delimiter (e.g., merged += next_seg on line 82). This can merge segments that should have space or other separators between them, potentially corrupting the text. For example, "Hello." + "World." would become "Hello.World." instead of "Hello. World."
| if max_length is None or len(merged) + len(next_seg) <= max_length: | |
| if max_length is None or len(merged) + len(next_seg) <= max_length: | |
| # Insert a space if needed between segments | |
| if merged and not merged.endswith((' ', '\n')) and not next_seg.startswith((' ', '\n')): | |
| merged += ' ' |
| prior_type: str = "uniform", | ||
| prior_kwargs: dict = None, | ||
| algorithm: str = "viterbi", | ||
| ): |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing input validation for conflicting constraints. If a user provides min_length > max_length, this creates an impossible constraint but no error is raised. The code should validate this and raise a clear error: if max_length is not None and min_length > max_length: raise ValueError("min_length cannot be greater than max_length").
| ): | |
| ): | |
| if max_length is not None and min_length > max_length: | |
| raise ValueError("min_length cannot be greater than max_length") |
wtpsplit/__init__.py
Outdated
| if prior_kwargs is None: | ||
| prior_kwargs = {} | ||
| else: | ||
| prior_kwargs = prior_kwargs.copy() |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance concern: The prior_kwargs.copy() operation is performed on every paragraph iteration (lines 522-523, 558-559, 1003-1004, 1038-1039). For large documents with many paragraphs, this creates unnecessary dict copies. Consider moving the copy operation outside the loop or checking if the dict is actually being modified before copying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 11 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def test_tiny_fragments_merging(self, wtp_model): | ||
| """Tiny fragments should be merged to meet min_length.""" | ||
| text = "A. B. C. D. E. F. G. H. I. J." | ||
| splits = wtp_model.split(text, min_length=10, threshold=0.005) | ||
|
|
||
| for segment in splits: | ||
| assert len(segment) >= 10, f"Segment '{segment}' is too short" |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test at line 238 asserts that all segments are >= 10 characters when using min_length=10. However, based on the documentation and implementation, min_length is "best effort" - segments may be shorter if merging would violate max_length or if the segment is inherently too short. This test could fail in edge cases and doesn't match the documented behavior. Consider either:
- Allowing for exceptions where segments can be shorter
- Using a test case where min_length can always be satisfied
| def test_very_short_sentences(self, wtp_model): | ||
| """Very short sentences should be merged when needed.""" | ||
| text = "Hi. Bye. Go. Stop. Run. Walk. Jump. Sit." | ||
| splits = wtp_model.split(text, min_length=15, threshold=0.005) | ||
|
|
||
| for segment in splits: | ||
| assert len(segment) >= 15, f"Segment '{segment}' is too short" |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the previous test, line 248 asserts all segments are >= 15 characters. Since min_length is documented as "best effort", this test might fail in edge cases where the algorithm cannot satisfy the minimum without violating other constraints. Consider relaxing the assertion to account for best-effort behavior, or verifying that most (but not necessarily all) segments meet the minimum.
| def test_min_length_constraint_wtp(): | ||
| """Test minimum length constraint with WtP""" | ||
| wtp = WtP("wtp-bert-mini", ort_providers=["CPUExecutionProvider"]) | ||
|
|
||
| text = "Short. Test. Hello. World. This is longer." | ||
| splits = wtp.split(text, min_length=15, threshold=0.005) | ||
|
|
||
| # All segments should be >= 15 characters | ||
| for segment in splits: | ||
| assert len(segment) >= 15, f"Segment '{segment}' is shorter than min_length" |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the comprehensive test suite, line 273 asserts that all segments are >= 15 characters when using min_length=15. However, min_length is documented as "best effort" and segments may be shorter if constraints cannot be satisfied. This test may fail in edge cases and doesn't align with the documented behavior. Consider either:
- Allowing for exceptions where segments can be shorter
- Using test data where min_length can always be guaranteed
| # Input validation | ||
| if max_length is not None and min_length > max_length: | ||
| raise ValueError( | ||
| f"min_length ({min_length}) cannot be greater than max_length ({max_length})" | ||
| ) | ||
| if min_length < 1: | ||
| raise ValueError(f"min_length must be >= 1, got {min_length}") | ||
| if max_length is not None and max_length < 1: | ||
| raise ValueError(f"max_length must be >= 1, got {max_length}") | ||
| valid_priors = ["uniform", "gaussian", "clipped_polynomial"] | ||
| if prior_type not in valid_priors: | ||
| raise ValueError(f"Unknown prior_type: '{prior_type}'. Must be one of {valid_priors}") | ||
| valid_algorithms = ["viterbi", "greedy"] | ||
| if algorithm not in valid_algorithms: | ||
| raise ValueError(f"Unknown algorithm: '{algorithm}'. Must be one of {valid_algorithms}") |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] There's code duplication between WtP's split method (lines 323-337) and SaT's split method (lines 840-854). The same input validation logic is repeated. Consider extracting this validation into a shared helper function to reduce duplication and ensure consistency.
| def test_min_max_constraints_together(): | ||
| """Test both constraints simultaneously""" | ||
| wtp = WtP("wtp-bert-mini", ort_providers=["CPUExecutionProvider"]) | ||
|
|
||
| text = "Hello world. " * 15 | ||
| splits = wtp.split(text, min_length=25, max_length=65, threshold=0.005) | ||
|
|
||
| # All segments should satisfy both constraints | ||
| for segment in splits: | ||
| assert 25 <= len(segment) <= 65, f"Segment '{segment}' violates constraints" |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 303 asserts that all segments satisfy both constraints 25 <= len(segment) <= 65. This strict assertion doesn't account for the "best effort" nature of min_length. While max_length is always strictly enforced, min_length may not be achievable in all cases. The test might fail for certain inputs where min_length cannot be satisfied without violating max_length.
| # Use >= to handle min_length == max_length case | ||
| if next_split >= curr_idx + min_length: |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] In the fallback case when dp[n] == -float("inf"), the code uses >= in the comparison at line 417, with a comment explaining it handles the min_length == max_length case. However, the regular condition at line 390-391 uses end_j = i - min_length which would already exclude the boundary case. The logic appears correct but the consistency between the fallback and the main algorithm could be clearer. Consider verifying that both code paths handle edge cases identically, especially when min_length == max_length.
| # Use >= to handle min_length == max_length case | |
| if next_split >= curr_idx + min_length: | |
| # Use strict boundary to match main algorithm (min_length == max_length case) | |
| if next_split - curr_idx >= min_length: |
| # If still too short, try merging with previous non-empty segment | ||
| if len(merged) < min_length and result: | ||
| # Find a previous non-empty segment that can accommodate the merge | ||
| for prev_idx in range(len(result) - 1, -1, -1): |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] In the _enforce_segment_constraints_simple function, line 214 checks if not seg or not seg.strip(): which handles both empty strings and whitespace-only strings. However, the code then appends seg (which could be whitespace-only) to the result. This is intentional to preserve structure, but on line 264, there's a check if result[prev_idx] and result[prev_idx].strip(): that requires non-whitespace content. This could lead to inconsistent behavior where whitespace-only segments are preserved in some cases but not used as merge candidates in others. Consider documenting this behavior more clearly or ensuring consistent handling of whitespace-only segments throughout the function.
| for prev_idx in range(len(result) - 1, -1, -1): | |
| for prev_idx in range(len(result) - 1, -1, -1): | |
| # Only merge with previous segments that contain non-whitespace content. | |
| # Whitespace-only segments are preserved for structure, but not used as merge candidates. |
| # Find a good split point (prefer splitting at whitespace) | ||
| split_at = max_length | ||
| # Look for whitespace near the end to split at | ||
| for j in range(max_length - 1, max(0, max_length - 20), -1): |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] In _enforce_segment_constraints, line 85 searches for whitespace to split at, looking back up to 20 characters from max_length. The magic number 20 should be defined as a named constant to make the code more maintainable and allow easy adjustment if needed. For example: WHITESPACE_SEARCH_WINDOW = 20.
test_length_constraints.py
Outdated
| from wtpsplit import WtP, SaT | ||
| from wtpsplit.utils.constraints import ( | ||
| constrained_segmentation, | ||
| _enforce_segment_constraints, |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of '_enforce_segment_constraints' is not used.
| _enforce_segment_constraints, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 9 out of 10 changed files in this pull request and generated 7 comments.
Comments suppressed due to low confidence (1)
wtpsplit/init.py:1062
- Syntax Error (in Python 3).
l else:
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| while curr_idx < n: | ||
| next_split = min(curr_idx + max_length, n) | ||
| # Use >= to handle min_length == max_length case | ||
| if next_split >= curr_idx + min_length: | ||
| indices.append(next_split) | ||
| curr_idx = next_split |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition on line 427 checks if next_split >= curr_idx + min_length but the comment on line 426 says "Use >= to handle min_length == max_length case". However, this check appears inconsistent with the dynamic programming loop bounds where end_j = i - min_length (line 398). When min_length == max_length, this could potentially create segments that violate the equality constraint. Consider verifying this edge case is handled correctly.
| while curr_idx < n: | |
| next_split = min(curr_idx + max_length, n) | |
| # Use >= to handle min_length == max_length case | |
| if next_split >= curr_idx + min_length: | |
| indices.append(next_split) | |
| curr_idx = next_split | |
| if min_length == max_length: | |
| # Strictly enforce equal-sized segments | |
| while curr_idx < n: | |
| next_split = min(curr_idx + min_length, n) | |
| indices.append(next_split) | |
| curr_idx = next_split | |
| else: | |
| while curr_idx < n: | |
| next_split = min(curr_idx + max_length, n) | |
| # Use >= to handle min_length == max_length case | |
| if next_split >= curr_idx + min_length: | |
| indices.append(next_split) | |
| curr_idx = next_split |
wtpsplit/__init__.py
Outdated
| sentence = sentence[:-1] | ||
| new_sentences.extend(sentence.split("\n")) | ||
| sentences = new_sentences | ||
| l else: |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lowercase 'l' character at the beginning of this line that appears to be a typo. This will cause a syntax error. The line should start with proper indentation and the else statement, not "l".
| l else: | |
| else: |
README.md
Outdated
|
|
||
| ## Adaptation | ||
|
|
||
| SaT can be domain- and style-adapted via LoRA. We provide trained LoRA modules for Universal Dependencies, OPUS100, Ersatz, and TED (i.e., ASR-style transcribed speecjes) sentence styles in 81 languages for `sat-3l`and `sat-12l`. Additionally, we provide LoRA modules for legal documents (laws and judgements) in 6 languages, code-switching in 4 language pairs, and tweets in 3 languages. For details, we refer to our [paper](https://arxiv.org/abs/2406.16678). |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in comment: "speecjes" should be "speeches".
| SaT can be domain- and style-adapted via LoRA. We provide trained LoRA modules for Universal Dependencies, OPUS100, Ersatz, and TED (i.e., ASR-style transcribed speecjes) sentence styles in 81 languages for `sat-3l`and `sat-12l`. Additionally, we provide LoRA modules for legal documents (laws and judgements) in 6 languages, code-switching in 4 language pairs, and tweets in 3 languages. For details, we refer to our [paper](https://arxiv.org/abs/2406.16678). | |
| SaT can be domain- and style-adapted via LoRA. We provide trained LoRA modules for Universal Dependencies, OPUS100, Ersatz, and TED (i.e., ASR-style transcribed speeches) sentence styles in 81 languages for `sat-3l`and `sat-12l`. Additionally, we provide LoRA modules for legal documents (laws and judgements) in 6 languages, code-switching in 4 language pairs, and tweets in 3 languages. For details, we refer to our [paper](https://arxiv.org/abs/2406.16678). |
| warnings.warn( | ||
| "split_on_input_newlines=False will lead to newlines in the output " | ||
| "if they were present in the input. Within the model, such newlines are " | ||
| "treated as spaces. " | ||
| "If you want to split on such newlines, set split_on_input_newlines=True." |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning message states "If you want to split on such newlines, set split_on_input_newlines=False" but this appears to be incorrect logic. If the user wants to split on newlines, they should set split_on_input_newlines=True, not False. The message should say "set split_on_input_newlines=True".
| # Input validation | ||
| if max_length is not None and min_length > max_length: | ||
| raise ValueError( | ||
| f"min_length ({min_length}) cannot be greater than max_length ({max_length})" | ||
| ) | ||
| if min_length < 1: | ||
| raise ValueError(f"min_length must be >= 1, got {min_length}") | ||
| if max_length is not None and max_length < 1: | ||
| raise ValueError(f"max_length must be >= 1, got {max_length}") | ||
| valid_priors = ["uniform", "gaussian", "clipped_polynomial", "lognormal"] | ||
| if prior_type not in valid_priors: | ||
| raise ValueError(f"Unknown prior_type: '{prior_type}'. Must be one of {valid_priors}") | ||
| valid_algorithms = ["viterbi", "greedy"] | ||
| if algorithm not in valid_algorithms: | ||
| raise ValueError(f"Unknown algorithm: '{algorithm}'. Must be one of {valid_algorithms}") | ||
|
|
||
| if max_length is not None and threshold is not None: | ||
| warnings.warn( | ||
| "Both 'threshold' and 'max_length' are set. When using length-constrained " | ||
| "segmentation (max_length), the threshold parameter is ignored.", | ||
| UserWarning, | ||
| ) |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The validation logic is duplicated in both the WtP.split() method (lines 322-343) and the SaT.split() method (lines 847-868). Consider extracting this validation logic into a shared helper function to avoid code duplication and ensure consistency. This would improve maintainability and reduce the risk of the validation logic diverging between the two classes.
wtpsplit/__init__.py
Outdated
| if prior_kwargs is None: | ||
| prior_kwargs = {} | ||
| else: | ||
| prior_kwargs = prior_kwargs.copy() | ||
| if max_length is not None: | ||
| prior_kwargs["max_length"] = max_length | ||
| prior_fn = create_prior_function(prior_type, prior_kwargs) |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The prior_kwargs dictionary is copied and potentially modified multiple times in nested if blocks (lines 476-482, 508-513, 983-991, 1017-1025). This creates unnecessary object creation and complexity. Consider consolidating the prior_kwargs initialization logic to avoid redundant copying and mutation.
| seg_len = len(segment.strip()) if strip_whitespace else len(segment) | ||
| if seg_len < min_length and i + 1 < len(boundaries): |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The variable name seg_len is calculated but then the same calculation is repeated inline in the following conditional check. Consider using the already-calculated seg_len variable instead of recalculating len(segment.strip()) if strip_whitespace else len(segment) on line 112.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 16 out of 24 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # "onnxruntime>=1.13.1", # can make conflicts between onnxruntime and onnxruntime-gpu | ||
| "transformers>=4.22.2", | ||
| "huggingface-hub", | ||
| "transformers>=4.22.2,<5.0", # v5.0 has breaking changes; adapters library needs update first |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dependency version constraint "transformers>=4.22.2,<5.0" references v5.0 as having breaking changes. However, as of the knowledge cutoff (January 2025), transformers v5.0 has not been released yet. The latest stable version is 4.x series. This comment may be speculative or based on unreleased information. Consider verifying whether v5.0 actually exists and has the mentioned breaking changes, or update the comment to reflect current state.
| "transformers>=4.22.2,<5.0", # v5.0 has breaking changes; adapters library needs update first | |
| "transformers>=4.22.2,<5.0", # upper bound is precautionary in case v5.0 introduces breaking changes; adapters library may need update if/when v5.0 is released |
| "transformers>=4.22.2", | ||
| "huggingface-hub", | ||
| "transformers>=4.22.2,<5.0", # v5.0 has breaking changes; adapters library needs update first | ||
| "huggingface-hub<1.0", # v1.0 has breaking changes (HfFolder removed) |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dependency version constraint "huggingface-hub<1.0" mentions that v1.0 has breaking changes (HfFolder removed). However, as of the knowledge cutoff (January 2025), huggingface-hub is still in the 0.x series (latest ~0.27.x). This comment may be speculative or based on unreleased information. Consider verifying the current state of huggingface-hub and whether this constraint and comment are still accurate.
| "huggingface-hub<1.0", # v1.0 has breaking changes (HfFolder removed) | |
| "huggingface-hub<1.0", # Pin to <1.0 until v1.0 is released and compatibility is confirmed |
| @@ -0,0 +1,288 @@ | |||
| # Length-Constrained Segmentation | |||
|
|
|||
| This supplementary document explains the theory and implementation of length-constrained segmentation in wtpsplit (NB: auto-generated). | |||
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The documentation states "(NB: auto-generated)" but this appears to be a manually written, well-structured markdown document. If this is indeed auto-generated, consider clarifying what tool generates it and when. If it's manually written, remove the "auto-generated" note to avoid confusion.
| This supplementary document explains the theory and implementation of length-constrained segmentation in wtpsplit (NB: auto-generated). | |
| This supplementary document explains the theory and implementation of length-constrained segmentation in wtpsplit. |
| "skops", | ||
| "pandas>=1", | ||
| "cached_property", # for Py37 | ||
| "mosestokenizer", |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cached_property dependency has been removed from setup.py, but the code still imports it from the third-party package (from cached_property import cached_property) in wtpsplit/utils/__init__.py. Since python_requires=">=3.9" has been set, this should be updated to use the standard library version: from functools import cached_property. This will cause an import error for users installing the package.
| "split_on_input_newlines=False will lead to newlines in the output " | ||
| "if they were present in the input. Within the model, such newlines are " | ||
| "treated as spaces. " | ||
| "If you want to split on such newlines, set split_on_input_newlines=True." |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The warning message states "If you want to split on such newlines, set split_on_input_newlines=True" but this warning is shown when split_on_input_newlines=False. The warning should say "set split_on_input_newlines=True" which is correct, but the context suggests the user already has it set to False and probably wants to keep it that way. Consider rephrasing to make it clearer, e.g., "If you want to split on newlines instead, set split_on_input_newlines=True."
| "If you want to split on such newlines, set split_on_input_newlines=True." | |
| "If you want to split on newlines instead, set split_on_input_newlines=True." |
|
Over time, this mutated a little into a more comprehensive v2.2.0 update! I expanded the core features from @harikesavan (adding language priors, lognormal prior, among others), and tried to make sure we are covering all edge cases, and text can be fully recovered after splitting (hence the very comprehensive test suite...). I also added some docs, and bumped Python to >= 3.9 since our dependencies require that anyway. I thought a full changelog, so here it is! (auto-generated, though): Maybe also interesting to @harikesavan @igorsterner 📋 Changelog: wtpsplit v2.1.7 → v2.2.026 files changed, +3,122 / -101 lines 🎯 Major Feature: Length-Constrained SegmentationControl segment lengths with New Parameters
Prior Functions
Language-Aware Defaults (70+ languages)Automatic
# Auto-applies when using LoRA with language
sat = SaT("sat-3l", style_or_domain="ud", language="de")
sat.split(text, max_length=150, prior_type="gaussian") # German defaults
# Or explicit
sat.split(text, max_length=100, prior_type="gaussian", prior_kwargs={"lang_code": "zh"})Text Reconstruction# With constraints: "".join(segments) == original
# Without constraints: "\n".join(segments) == original🆕 New Files
📝 Key Modifications
|
| Change | Impact |
|---|---|
| Python ≥3.9 | Drops 3.7, 3.8 |
| transformers <5.0 | v5.0 has breaking API |
| huggingface-hub <1.0 | v1.0 removes HfFolder |
🧪 Tests
| File | Tests |
|---|---|
test.py |
+6 new constraint tests |
test_length_constraints.py |
98 tests (NEW) |
| Total | 130 tests ✅ |
bminixhofer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this will be a great feature! A couple of comments:
- What to we need
_enforce_segment_constraints_simplefor? I only see it used in tests. - How did we arrive at
LANG_SENTENCE_STATS? Is it LLM generated? If it is, I'd rather not have it since it's likely nonsense for a couple of languages. It would be good to derive this from corpus data instead as noted in the TODO. - I don't understand all the priors and why we need them. Why does
lognormallead to "natural-feeling output"? Did we ever check the empirical distribution of sentence lengths? I also don't understand howclipped_polynomialbehaves. It would be nice to have a figure showing the priors. - I don't quite understand
_enforce_segment_constraints. Why do we need to manually ensure segments don't exceed the maximum length, shouldn't the prior take care of this (since segments > max length have zero probablity)? I also don't see why we need to manually merge shorter segments since the prior should also account for this. - I didn't check the details of the Viterbi implementation. Is this LLM generated as well or manually written? We should have a pretty high confidence it's correct before merging in either case. Maybe @markus583 @harikesavan one of you is already certain it's correct. If not, we still need to check it in more detail.
|
Hey! Thanks all your work on this feature, and your huge efforts to maintain this project! Just thought I'd chime in after Benjamin to add my two cents. I can definitely see the attraction of enforcing a max length, and hence why the global optimization via veterbi is good (and better than greedy, like the LCM people did). Awesome. If you need to me to dive in deep to review something just let me know. But regarding user-specified priors on sentence length, I am more concerned. I agree with Benjamin's comments, that with the current approach one should use corpus statistics rather than heuristics. But I feel that even better would be to make the whole thing learnable, in which case one wouldn't need these priors. I think the typical approach would be a segment-level variant of a CRF, which would aim to do something like rank all the possible mean-pooled segments (and one could of course make segments of particular sizes impossible). If it's trained end to end, I think it could keep everything as one language-agnostic model. I'm sure there's another paper in doing better length-constrained sentence segmentation. But big performance gains would probably only come on edge cases. Regardless, I think there needs to be evaluation against at least one benchmark before merging big changes here. Just my thoughts! Happy to chat. Hope you're all doing well :)) |
…rithms