⚡️ Speed up function split_sentences
by 47%
#54
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 47% (0.47x) speedup for
split_sentences
inguardrails/utils/tokenization_utils.py
⏱️ Runtime :
49.9 milliseconds
→34.0 milliseconds
(best of43
runs)📝 Explanation and details
The optimized code achieves a 46% speedup through strategic regex precompilation and pattern consolidation, addressing the primary performance bottlenecks identified in the profiling data.
Key optimizations applied:
Precompiled static regexes: The original code recompiled the same regex patterns on every call. The optimized version precompiles frequently-used patterns like
_QUESTION_SPLIT_RE
and_DOT_SPLIT_RE
at module load, eliminating repeated compilation overhead.Abbreviation pattern consolidation: The biggest performance gain comes from combining all 43 abbreviation patterns into a single regex using
r"|".join(abbreviations)
. This reduces ~4,300 individualre.sub()
calls (in the profiler) to just one, cutting abbreviation processing time from 58.2% to 8% of total runtime.Per-call regex compilation: For patterns that depend on the dynamic
separator
parameter, regexes are compiled once per function call rather than on every substitution. This includes coordinating conjunction and preposition patterns.Optimized
split_sentences()
: Precompiles both the initial sentence-splitting regex and the final separator-splitting regex, reducing regex compilation overhead in the main entry point.Performance characteristics by test type:
The optimization maintains identical behavior and output while dramatically reducing the regex engine overhead that dominated the original implementation's runtime.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-split_sentences-mh1ox4ck
and push.