⚡️ Speed up function split_sentences by 47%
#54
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 47% (0.47x) speedup for
split_sentencesinguardrails/utils/tokenization_utils.py⏱️ Runtime :
49.9 milliseconds→34.0 milliseconds(best of43runs)📝 Explanation and details
The optimized code achieves a 46% speedup through strategic regex precompilation and pattern consolidation, addressing the primary performance bottlenecks identified in the profiling data.
Key optimizations applied:
Precompiled static regexes: The original code recompiled the same regex patterns on every call. The optimized version precompiles frequently-used patterns like
_QUESTION_SPLIT_REand_DOT_SPLIT_REat module load, eliminating repeated compilation overhead.Abbreviation pattern consolidation: The biggest performance gain comes from combining all 43 abbreviation patterns into a single regex using
r"|".join(abbreviations). This reduces ~4,300 individualre.sub()calls (in the profiler) to just one, cutting abbreviation processing time from 58.2% to 8% of total runtime.Per-call regex compilation: For patterns that depend on the dynamic
separatorparameter, regexes are compiled once per function call rather than on every substitution. This includes coordinating conjunction and preposition patterns.Optimized
split_sentences(): Precompiles both the initial sentence-splitting regex and the final separator-splitting regex, reducing regex compilation overhead in the main entry point.Performance characteristics by test type:
The optimization maintains identical behavior and output while dramatically reducing the regex engine overhead that dominated the original implementation's runtime.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-split_sentences-mh1ox4ckand push.