⚡️ Speed up function `split_sentences` by 47% #54

codeflash-ai · 2025-10-22T07:47:17Z

📄 47% (0.47x) speedup for `split_sentences` in `guardrails/utils/tokenization_utils.py`

⏱️ Runtime : 49.9 milliseconds → 34.0 milliseconds (best of 43 runs)

📝 Explanation and details

The optimized code achieves a 46% speedup through strategic regex precompilation and pattern consolidation, addressing the primary performance bottlenecks identified in the profiling data.

Key optimizations applied:

Precompiled static regexes: The original code recompiled the same regex patterns on every call. The optimized version precompiles frequently-used patterns like _QUESTION_SPLIT_RE and _DOT_SPLIT_RE at module load, eliminating repeated compilation overhead.
Abbreviation pattern consolidation: The biggest performance gain comes from combining all 43 abbreviation patterns into a single regex using r"|".join(abbreviations). This reduces ~4,300 individual re.sub() calls (in the profiler) to just one, cutting abbreviation processing time from 58.2% to 8% of total runtime.
Per-call regex compilation: For patterns that depend on the dynamic separator parameter, regexes are compiled once per function call rather than on every substitution. This includes coordinating conjunction and preposition patterns.
Optimized split_sentences(): Precompiles both the initial sentence-splitting regex and the final separator-splitting regex, reducing regex compilation overhead in the main entry point.

Performance characteristics by test type:

Simple sentences: 40-45% faster due to reduced regex compilation overhead
Large texts with many sentences: 60-65% faster, benefiting most from precompilation savings
Abbreviation-heavy texts: 20-30% faster, where the abbreviation consolidation provides the largest absolute time savings
Complex nested structures: 45-65% faster, as the precompiled patterns handle these efficiently

The optimization maintains identical behavior and output while dramatically reducing the regex engine overhead that dominated the original implementation's runtime.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 102 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import re

# imports
import pytest  # used for our unit tests
from guardrails.utils.tokenization_utils import split_sentences

# unit tests

# ------------------- BASIC TEST CASES -------------------

def test_single_sentence():
    # Single sentence, no punctuation at end
    codeflash_output = split_sentences("Hello world") # 123μs -> 87.3μs (41.9% faster)

def test_single_sentence_with_period():
    # Single sentence ending with a period
    codeflash_output = split_sentences("Hello world.") # 140μs -> 99.0μs (42.4% faster)

def test_two_sentences_period():
    # Two sentences separated by a period
    codeflash_output = split_sentences("Hello world. This is great.") # 169μs -> 118μs (43.6% faster)

def test_two_sentences_question():
    # Two sentences separated by a question mark
    codeflash_output = split_sentences("How are you? I'm fine.") # 169μs -> 119μs (41.8% faster)

def test_three_sentences_varied_punctuation():
    # Three sentences with different punctuation
    codeflash_output = split_sentences("Hi! How are you? I'm fine.") # 186μs -> 131μs (42.4% faster)

def test_sentences_with_exclamation():
    # Exclamation mark as sentence end
    codeflash_output = split_sentences("Wow! That's amazing!") # 165μs -> 118μs (39.8% faster)

def test_sentences_with_mixed_punctuation():
    # Mixed punctuation
    codeflash_output = split_sentences("Wait. What? No!") # 175μs -> 123μs (42.1% faster)

def test_sentences_with_newlines():
    # Sentences separated by newlines
    codeflash_output = split_sentences("Hello world.\nThis is great.") # 169μs -> 118μs (43.7% faster)

def test_sentences_with_multiple_spaces():
    # Sentences with extra spaces between them
    codeflash_output = split_sentences("Hello world.   This is great.") # 171μs -> 119μs (44.2% faster)

def test_sentence_with_abbreviation():
    # Sentence with abbreviation that shouldn't split
    codeflash_output = split_sentences("I met Dr. Smith. He is nice.") # 179μs -> 129μs (38.0% faster)

def test_sentence_with_multiple_abbreviations():
    # Multiple abbreviations in one sentence
    codeflash_output = split_sentences("Dr. Smith and Mr. Jones went to St. Louis.") # 189μs -> 142μs (33.3% faster)

def test_sentence_with_eg_abbreviation():
    # e.g. abbreviation should not split
    codeflash_output = split_sentences("Many fruits, e.g. apples, are healthy.") # 170μs -> 122μs (39.5% faster)

def test_sentence_with_ie_abbreviation():
    # i.e. abbreviation should not split
    codeflash_output = split_sentences("This is a test, i.e. an example.") # 168μs -> 122μs (37.7% faster)

def test_sentence_with_et_al_abbreviation():
    # et al. abbreviation should not split
    codeflash_output = split_sentences("Smith et al. found the results.") # 168μs -> 120μs (40.3% faster)

def test_sentence_with_period_in_parentheses():
    # Period inside parentheses should not split
    codeflash_output = split_sentences("This is a test (see Fig. 2). Next sentence.") # 193μs -> 135μs (43.0% faster)

def test_sentence_with_period_in_quotes():
    # Period inside quotes should not split
    codeflash_output = split_sentences('He said "Hello world." Then he left.') # 164μs -> 111μs (47.0% faster)

def test_sentence_with_period_in_brackets():
    # Period inside brackets should not split
    codeflash_output = split_sentences("This is [a test. Really]. Next sentence.") # 189μs -> 133μs (42.5% faster)

def test_sentence_with_list_abbreviation():
    # List abbreviation should not split
    codeflash_output = split_sentences("The items are listed in Fig. 3.") # 168μs -> 121μs (38.6% faster)

def test_sentence_with_multiple_abbreviations_and_sentence():
    # Multiple abbreviations and a sentence split
    codeflash_output = split_sentences("Dr. Smith, i.e. the director, arrived. He spoke.") # 198μs -> 144μs (36.9% faster)

# ------------------- EDGE TEST CASES -------------------

def test_empty_string():
    # Empty string should return empty list
    codeflash_output = split_sentences("") # 116μs -> 82.3μs (41.5% faster)

def test_only_punctuation():
    # Only punctuation should be a single sentence
    codeflash_output = split_sentences("!!!") # 131μs -> 91.1μs (44.1% faster)

def test_multiple_punctuation_marks():
    # Multiple punctuation marks at the end
    codeflash_output = split_sentences("Hello!!!") # 137μs -> 97.8μs (40.3% faster)

def test_multiple_sentence_endings():
    # Multiple sentence-ending punctuation between sentences
    codeflash_output = split_sentences("Hello!! How are you??") # 165μs -> 116μs (42.9% faster)

def test_sentence_with_abbreviation_at_end():
    # Abbreviation at end of sentence
    codeflash_output = split_sentences("He works at Acme Co.") # 144μs -> 105μs (37.3% faster)

def test_sentence_with_single_letter_abbreviation():
    # Single letter abbreviation
    codeflash_output = split_sentences("A. Smith went home.") # 154μs -> 107μs (43.1% faster)

def test_sentence_with_multiple_single_letter_abbreviations():
    # Multiple single letter abbreviations
    codeflash_output = split_sentences("A. B. Smith went home.") # 158μs -> 109μs (44.6% faster)

def test_sentence_with_period_and_no_space():
    # Period at end, no space after
    codeflash_output = split_sentences("Hello world.This is great.") # 153μs -> 106μs (44.3% faster)

def test_sentence_with_period_and_tab():
    # Period at end, tab after
    codeflash_output = split_sentences("Hello world.\tThis is great.") # 169μs -> 117μs (44.1% faster)

def test_sentence_with_period_and_newline():
    # Period at end, newline after
    codeflash_output = split_sentences("Hello world.\nThis is great.") # 168μs -> 117μs (43.3% faster)

def test_sentence_with_nested_parentheses():
    # Nested parentheses
    codeflash_output = split_sentences("This is a test (see Fig. 2 (details in Table 1)). Next sentence.") # 218μs -> 149μs (46.1% faster)

def test_sentence_with_nested_brackets():
    # Nested brackets
    codeflash_output = split_sentences("This is a test [see Fig. 2 [details in Table 1]]. Next sentence.") # 217μs -> 149μs (44.9% faster)

def test_sentence_with_nested_quotes():
    # Nested quotes
    codeflash_output = split_sentences('He said "She said \'Hello.\'". Then he left.') # 187μs -> 129μs (45.2% faster)

def test_sentence_with_conjunction_at_start():
    # Sentence starting with conjunction should not be split
    codeflash_output = split_sentences("He went home. And he slept.") # 169μs -> 119μs (42.3% faster)

def test_sentence_with_preposition_at_start():
    # Sentence starting with preposition should not be split
    codeflash_output = split_sentences("He went home. In the morning, he woke up.") # 181μs -> 124μs (45.6% faster)

def test_sentence_with_period_in_middle_of_word():
    # Period in the middle of a word should not split
    codeflash_output = split_sentences("The domain is example.com. Next sentence.") # 180μs -> 125μs (44.2% faster)

def test_sentence_with_multiple_abbreviations_and_punctuation():
    # Multiple abbreviations and punctuation
    codeflash_output = split_sentences("Dr. Smith, Ph.D., arrived at 10 a.m. He spoke.") # 197μs -> 140μs (40.6% faster)

def test_sentence_with_no_space_after_punctuation():
    # No space after punctuation
    codeflash_output = split_sentences("Hello world!How are you?I'm fine.") # 161μs -> 111μs (44.0% faster)

def test_sentence_with_multiple_newlines():
    # Multiple newlines between sentences
    codeflash_output = split_sentences("Hello world.\n\nThis is great.") # 170μs -> 119μs (43.1% faster)

def test_sentence_with_windows_line_endings():
    # Windows line endings
    codeflash_output = split_sentences("Hello world.\r\nThis is great.") # 168μs -> 118μs (42.7% faster)

def test_sentence_with_mismatched_brackets():
    # Mismatched brackets (should not crash)
    codeflash_output = split_sentences("This is a test [see Fig. 2. Next sentence.") # 211μs -> 153μs (37.8% faster)

def test_sentence_with_mismatched_parentheses():
    # Mismatched parentheses (should not crash)
    codeflash_output = split_sentences("This is a test (see Fig. 2. Next sentence.") # 211μs -> 154μs (36.7% faster)

def test_sentence_with_mismatched_quotes():
    # Mismatched quotes (should not crash)
    codeflash_output = split_sentences('He said "Hello world. Then he left.') # 179μs -> 123μs (44.6% faster)

def test_sentence_with_multiple_sentence_ending_punctuations():
    # Sentence ending with multiple punctuation marks
    codeflash_output = split_sentences("Hello world?! Next sentence.") # 169μs -> 118μs (42.9% faster)

def test_sentence_with_punctuation_inside_quotes():
    # Punctuation inside quotes should not split
    codeflash_output = split_sentences('He said "Wow! Amazing." Then he left.') # 172μs -> 120μs (43.2% faster)

def test_sentence_with_punctuation_inside_brackets():
    # Punctuation inside brackets should not split
    codeflash_output = split_sentences("This is [a test! Really]. Next sentence.") # 190μs -> 133μs (42.5% faster)

def test_sentence_with_punctuation_inside_parentheses():
    # Punctuation inside parentheses should not split
    codeflash_output = split_sentences("This is a test (Wow! Really). Next sentence.") # 193μs -> 134μs (43.5% faster)

def test_sentence_with_abbreviation_and_period():
    # Abbreviation followed by period
    codeflash_output = split_sentences("He is a Jr. He is young.") # 164μs -> 119μs (37.1% faster)

def test_sentence_with_abbreviation_and_no_space():
    # Abbreviation followed by no space
    codeflash_output = split_sentences("He is a Jr.He is young.") # 153μs -> 107μs (43.0% faster)

def test_sentence_with_abbreviation_and_newline():
    # Abbreviation followed by newline
    codeflash_output = split_sentences("He is a Jr.\nHe is young.") # 164μs -> 117μs (39.4% faster)

def test_sentence_with_multiple_abbreviations_and_no_space():
    # Multiple abbreviations, no space after
    codeflash_output = split_sentences("Dr.Smith went to St.Louis.") # 155μs -> 106μs (46.2% faster)

def test_sentence_with_multiple_abbreviations_and_newline():
    # Multiple abbreviations, newline after
    codeflash_output = split_sentences("Dr. Smith went to St. Louis.\nHe arrived.") # 194μs -> 142μs (36.6% faster)

def test_sentence_with_abbreviation_and_exclamation():
    # Abbreviation followed by exclamation
    codeflash_output = split_sentences("He is a Jr! He is young.") # 171μs -> 119μs (43.7% faster)

def test_sentence_with_abbreviation_and_question():
    # Abbreviation followed by question
    codeflash_output = split_sentences("He is a Jr? Is he young?") # 170μs -> 118μs (44.2% faster)

def test_sentence_with_abbreviation_and_multiple_punctuation():
    # Abbreviation followed by multiple punctuation
    codeflash_output = split_sentences("He is a Jr!! He is young.") # 171μs -> 120μs (42.5% faster)

# ------------------- LARGE SCALE TEST CASES -------------------

def test_large_text_many_sentences():
    # Large text with many sentences
    text = " ".join([f"Sentence {i}." for i in range(100)])
    expected = [f"Sentence {i}." for i in range(100)]
    codeflash_output = split_sentences(text) # 2.28ms -> 1.40ms (62.6% faster)

def test_large_text_with_abbreviations():
    # Large text with many abbreviations
    text = " ".join([f"Dr. Smith {i} went to St. Louis." for i in range(100)])
    expected = [f"Dr. Smith {i} went to St. Louis." for i in range(100)]
    codeflash_output = split_sentences(text) # 4.86ms -> 4.07ms (19.4% faster)

def test_large_text_with_nested_parentheses():
    # Large text with nested parentheses
    text = " ".join([f"Sentence {i} (see Fig. {i} (details in Table {i})). Next sentence {i}." for i in range(50)])
    expected = []
    for i in range(50):
        expected.append(f"Sentence {i} (see Fig. {i} (details in Table {i})).")
        expected.append(f"Next sentence {i}.")
    codeflash_output = split_sentences(text) # 4.15ms -> 2.51ms (65.1% faster)

def test_large_text_with_nested_brackets():
    # Large text with nested brackets
    text = " ".join([f"Sentence {i} [see Fig. {i} [details in Table {i}]]. Next sentence {i}." for i in range(50)])
    expected = []
    for i in range(50):
        expected.append(f"Sentence {i} [see Fig. {i} [details in Table {i}]].")
        expected.append(f"Next sentence {i}.")
    codeflash_output = split_sentences(text) # 4.18ms -> 2.55ms (63.9% faster)

def test_large_text_with_varied_punctuation():
    # Large text with varied punctuation
    text = " ".join([f"Sentence {i}! How are you? I'm fine." for i in range(30)])
    expected = []
    for i in range(30):
        expected.append(f"Sentence {i}!")
        expected.append("How are you?")
        expected.append("I'm fine.")
    codeflash_output = split_sentences(text) # 1.26ms -> 789μs (59.9% faster)

def test_large_text_with_newlines_and_tabs():
    # Large text with newlines and tabs
    text = "\n".join([f"Sentence {i}.\tNext sentence {i}." for i in range(50)])
    expected = []
    for i in range(50):
        expected.append(f"Sentence {i}.")
        expected.append(f"Next sentence {i}.")
    codeflash_output = split_sentences(text) # 2.47ms -> 1.54ms (60.1% faster)

def test_large_text_with_abbreviations_and_punctuation():
    # Large text with abbreviations and punctuation
    text = " ".join([f"Dr. Smith, Ph.D., arrived at 10 a.m. He spoke." for i in range(30)])
    expected = []
    for i in range(30):
        expected.append("Dr. Smith, Ph.D., arrived at 10 a.m.")
        expected.append("He spoke.")
    codeflash_output = split_sentences(text) # 2.14ms -> 1.48ms (44.8% faster)

def test_large_text_with_mixed_abbreviations_and_sentences():
    # Large text with mixed abbreviations and sentences
    text = " ".join([f"Dr. Smith, i.e. the director, arrived. He spoke." for i in range(30)])
    expected = []
    for i in range(30):
        expected.append("Dr. Smith, i.e. the director, arrived.")
        expected.append("He spoke.")
    codeflash_output = split_sentences(text) # 2.21ms -> 1.71ms (28.8% faster)

def test_large_text_with_no_sentence_endings():
    # Large text with no sentence-ending punctuation
    text = " ".join([f"Sentence {i}" for i in range(100)])
    expected = [" ".join([f"Sentence {i}" for i in range(100)])]
    codeflash_output = split_sentences(text) # 863μs -> 457μs (88.6% faster)

def test_large_text_with_only_punctuation():
    # Large text with only punctuation
    text = "." * 100
    expected = ["." * 100]
    codeflash_output = split_sentences(text) # 223μs -> 123μs (81.4% faster)

def test_large_text_with_edge_cases():
    # Large text with edge cases (abbreviations, parentheses, brackets, quotes, newlines)
    text = " ".join([
        f'Dr. Smith (see Fig. {i} [details in Table {i}]) said "Hello world." Next sentence {i}.' for i in range(30)
    ])
    expected = []
    for i in range(30):
        expected.append(f'Dr. Smith (see Fig. {i} [details in Table {i}]) said "Hello world."')
        expected.append(f"Next sentence {i}.")
    codeflash_output = split_sentences(text) # 2.34ms -> 1.39ms (68.3% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import re

# imports
import pytest  # used for our unit tests
from guardrails.utils.tokenization_utils import split_sentences

# unit tests

# --------------------------
# Basic Test Cases
# --------------------------

def test_single_sentence():
    # Basic: One sentence, no punctuation inside
    codeflash_output = split_sentences("This is a sentence.") # 150μs -> 106μs (40.8% faster)

def test_two_sentences_period():
    # Basic: Two sentences split by period
    codeflash_output = split_sentences("Hello world. Goodbye world.") # 166μs -> 117μs (42.1% faster)

def test_two_sentences_exclamation():
    # Basic: Two sentences split by exclamation
    codeflash_output = split_sentences("Wow! Amazing.") # 158μs -> 111μs (42.0% faster)

def test_two_sentences_question():
    # Basic: Two sentences split by question mark
    codeflash_output = split_sentences("Is this working? Yes.") # 164μs -> 114μs (44.6% faster)

def test_multiple_sentences_mixed_punctuation():
    # Basic: Multiple sentences with mixed punctuation
    text = "Hello! How are you? I'm fine. Thanks."
    expected = ["Hello!", "How are you?", "I'm fine.", "Thanks."]
    codeflash_output = split_sentences(text) # 210μs -> 146μs (44.0% faster)

def test_sentence_with_abbreviation():
    # Basic: Sentence with abbreviation (should not split after 'e.g.')
    text = "This is an example, e.g. a test. Next sentence."
    expected = ["This is an example, e.g. a test.", "Next sentence."]
    codeflash_output = split_sentences(text) # 191μs -> 138μs (37.9% faster)

def test_sentence_with_title_abbreviation():
    # Basic: Sentence with title abbreviation (should not split after 'Dr.')
    text = "Dr. Smith went home. He was tired."
    expected = ["Dr. Smith went home.", "He was tired."]
    codeflash_output = split_sentences(text) # 179μs -> 130μs (38.1% faster)

def test_sentence_with_period_inside_parentheses():
    # Basic: Period inside parentheses should not split
    text = "This is a test (e.g. with an example). It works."
    expected = ["This is a test (e.g. with an example).", "It works."]
    codeflash_output = split_sentences(text) # 200μs -> 139μs (43.4% faster)

def test_sentence_with_quotes():
    # Basic: Sentence with quotes containing period
    text = 'He said, "This is great." Then he left.'
    expected = ['He said, "This is great."', "Then he left."]
    codeflash_output = split_sentences(text) # 166μs -> 114μs (45.7% faster)

# --------------------------
# Edge Test Cases
# --------------------------

def test_empty_string():
    # Edge: Empty string input
    codeflash_output = split_sentences("") # 116μs -> 81.3μs (43.1% faster)

def test_only_punctuation():
    # Edge: Only punctuation as input
    codeflash_output = split_sentences("!") # 126μs -> 88.9μs (41.8% faster)
    codeflash_output = split_sentences(".") # 103μs -> 65.5μs (58.4% faster)
    codeflash_output = split_sentences("?") # 100μs -> 62.6μs (60.5% faster)

def test_no_sentence_endings():
    # Edge: Input with no sentence-ending punctuation
    codeflash_output = split_sentences("This is a test without punctuation") # 136μs -> 95.2μs (43.3% faster)

def test_multiple_spaces_between_sentences():
    # Edge: Multiple spaces between sentences
    text = "Hello.   World!   How are you?"
    expected = ["Hello.", "World!", "How are you?"]
    codeflash_output = split_sentences(text) # 187μs -> 129μs (45.1% faster)

def test_multiple_punctuation_marks():
    # Edge: Multiple punctuation marks at sentence end
    text = "What?! No way!! Really..."
    expected = ["What?!", "No way!!", "Really..."]
    codeflash_output = split_sentences(text) # 186μs -> 128μs (44.8% faster)

def test_sentence_with_nested_parentheses():
    # Edge: Nested parentheses should not split
    text = "This is a test (with (nested) parentheses). Next sentence."
    expected = ["This is a test (with (nested) parentheses).", "Next sentence."]
    codeflash_output = split_sentences(text) # 197μs -> 134μs (46.7% faster)

def test_sentence_with_nested_brackets():
    # Edge: Nested brackets should not split
    text = "Check this [with [nested] brackets]. Next."
    expected = ["Check this [with [nested] brackets].", "Next."]
    codeflash_output = split_sentences(text) # 186μs -> 127μs (46.3% faster)

def test_sentence_with_single_letter_abbreviation():
    # Edge: Single letter abbreviation (should not split after 'A.')
    text = "A. Smith went home. B. Jones stayed."
    expected = ["A. Smith went home.", "B. Jones stayed."]
    codeflash_output = split_sentences(text) # 184μs -> 126μs (46.3% faster)

def test_sentence_with_period_in_middle():
    # Edge: Period in middle of sentence (should not split)
    text = "This is a test. However, e.g. this example is valid."
    expected = ["This is a test.", "However, e.g. this example is valid."]
    codeflash_output = split_sentences(text) # 195μs -> 140μs (39.6% faster)

def test_sentence_with_line_breaks():
    # Edge: Sentences separated by line breaks
    text = "First sentence.\nSecond sentence!  Third sentence?"
    expected = ["First sentence.", "Second sentence!", "Third sentence?"]
    codeflash_output = split_sentences(text) # 197μs -> 133μs (48.6% faster)

def test_sentence_with_separator_in_text():
    # Edge: Text containing separator string should not break incorrectly
    sep = "abcdsentenceseperatordcba"
    text = f"This sentence mentions {sep}. Next sentence."
    expected = [f"This sentence mentions {sep}.", "Next sentence."]
    codeflash_output = split_sentences(text) # 177μs -> 122μs (44.7% faster)

def test_sentence_with_coordinating_conjunction():
    # Edge: No break before 'and'
    text = "He left. And he never returned."
    expected = ["He left.", "And he never returned."]
    codeflash_output = split_sentences(text) # 172μs -> 119μs (44.3% faster)

def test_sentence_with_preposition():
    # Edge: No break before 'in'
    text = "He left. In the morning, he returned."
    expected = ["He left.", "In the morning, he returned."]

    codeflash_output = split_sentences(text) # 176μs -> 122μs (44.8% faster)

def test_sentence_with_multiple_abbreviations():
    # Edge: Multiple abbreviations in one sentence
    text = "Dr. Smith, Ph.D., went home. Next sentence."
    expected = ["Dr. Smith, Ph.D., went home.", "Next sentence."]
    codeflash_output = split_sentences(text) # 190μs -> 135μs (41.0% faster)

def test_sentence_with_period_at_end_and_space():
    # Edge: Sentence ending with period and trailing spaces
    text = "Hello world.   "
    expected = ["Hello world."]
    codeflash_output = split_sentences(text) # 144μs -> 102μs (41.5% faster)

def test_sentence_with_period_and_newline():
    # Edge: Sentence ending with period and newline
    text = "Hello world.\n"
    expected = ["Hello world."]
    codeflash_output = split_sentences(text) # 141μs -> 99.3μs (43.0% faster)

def test_sentence_with_multiple_newlines():
    # Edge: Multiple newlines between sentences
    text = "Hello world.\n\nGoodbye world."
    expected = ["Hello world.", "Goodbye world."]
    codeflash_output = split_sentences(text) # 169μs -> 117μs (44.3% faster)

def test_sentence_with_abbreviation_and_newline():
    # Edge: Abbreviation at end of line should not split
    text = "This is e.g.\na test. Next sentence."
    expected = ["This is e.g. a test.", "Next sentence."]
    codeflash_output = split_sentences(text) # 183μs -> 132μs (38.5% faster)

def test_sentence_with_quote_and_period_inside():
    # Edge: Quoted sentence with period inside quotes
    text = 'He said, "Wait." Then he left.'
    expected = ['He said, "Wait."', "Then he left."]
    codeflash_output = split_sentences(text) # 160μs -> 110μs (45.6% faster)

def test_sentence_with_single_quotes():
    # Edge: Sentence with single quotes
    text = "She said, 'Hello.' Goodbye."
    expected = ["She said, 'Hello.'", "Goodbye."]
    codeflash_output = split_sentences(text) # 156μs -> 107μs (45.9% faster)

def test_sentence_with_multiple_abbreviation_types():
    # Edge: Multiple types of abbreviations
    text = "Mr. Smith and Mrs. Jones went to St. Louis. They met Dr. Brown."
    expected = ["Mr. Smith and Mrs. Jones went to St. Louis.", "They met Dr. Brown."]
    codeflash_output = split_sentences(text) # 227μs -> 173μs (31.1% faster)

def test_sentence_with_multiple_periods_in_abbreviation():
    # Edge: Abbreviation with multiple periods
    text = "E.g. this is an example. I.e. another one."
    expected = ["E.g. this is an example.", "I.e. another one."]
    codeflash_output = split_sentences(text) # 194μs -> 144μs (34.2% faster)

# --------------------------
# Large Scale Test Cases
# --------------------------

def test_large_text_many_sentences():
    # Large: 100 sentences separated by periods
    text = " ".join([f"Sentence {i}." for i in range(100)])
    expected = [f"Sentence {i}." for i in range(100)]
    codeflash_output = split_sentences(text) # 2.27ms -> 1.40ms (62.0% faster)

def test_large_text_mixed_punctuation():
    # Large: 100 sentences with mixed punctuation
    punct = [".", "!", "?"]
    text = " ".join([f"Sentence {i}{punct[i%3]}" for i in range(100)])
    expected = [f"Sentence {i}{punct[i%3]}" for i in range(100)]
    codeflash_output = split_sentences(text) # 2.28ms -> 1.39ms (64.1% faster)

def test_large_text_with_abbreviations():
    # Large: Sentences with abbreviations scattered
    text = " ".join([f"Dr. Smith did e.g. test {i}. Next sentence {i}." for i in range(50)])
    expected = [f"Dr. Smith did e.g. test {i}. Next sentence {i}." for i in range(50)]
    codeflash_output = split_sentences(text) # 3.48ms -> 2.75ms (26.9% faster)

def test_large_text_with_nested_parentheses_and_brackets():
    # Large: Sentences with nested parentheses/brackets
    text = " ".join([f"Test {i} (with [nested {i}]) end. Next {i}." for i in range(50)])
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-split_sentences-mh1ox4ck and push.

The optimized code achieves a **46% speedup** through strategic regex precompilation and pattern consolidation, addressing the primary performance bottlenecks identified in the profiling data. **Key optimizations applied:** 1. **Precompiled static regexes**: The original code recompiled the same regex patterns on every call. The optimized version precompiles frequently-used patterns like `_QUESTION_SPLIT_RE` and `_DOT_SPLIT_RE` at module load, eliminating repeated compilation overhead. 2. **Abbreviation pattern consolidation**: The biggest performance gain comes from combining all 43 abbreviation patterns into a single regex using `r"|".join(abbreviations)`. This reduces ~4,300 individual `re.sub()` calls (in the profiler) to just one, cutting abbreviation processing time from 58.2% to 8% of total runtime. 3. **Per-call regex compilation**: For patterns that depend on the dynamic `separator` parameter, regexes are compiled once per function call rather than on every substitution. This includes coordinating conjunction and preposition patterns. 4. **Optimized `split_sentences()`**: Precompiles both the initial sentence-splitting regex and the final separator-splitting regex, reducing regex compilation overhead in the main entry point. **Performance characteristics by test type:** - **Simple sentences**: 40-45% faster due to reduced regex compilation overhead - **Large texts with many sentences**: 60-65% faster, benefiting most from precompilation savings - **Abbreviation-heavy texts**: 20-30% faster, where the abbreviation consolidation provides the largest absolute time savings - **Complex nested structures**: 45-65% faster, as the precompiled patterns handle these efficiently The optimization maintains identical behavior and output while dramatically reducing the regex engine overhead that dominated the original implementation's runtime.

codeflash-ai bot requested a review from mashraf-222 October 22, 2025 07:47

codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `split_sentences` by 47% #54

⚡️ Speed up function `split_sentences` by 47% #54

Uh oh!

codeflash-ai bot commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

⚡️ Speed up function split_sentences by 47% #54

Are you sure you want to change the base?

⚡️ Speed up function split_sentences by 47% #54

Uh oh!

Conversation

codeflash-ai bot commented Oct 22, 2025

📄 47% (0.47x) speedup for split_sentences in guardrails/utils/tokenization_utils.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

⚡️ Speed up function `split_sentences` by 47% #54

⚡️ Speed up function `split_sentences` by 47% #54

📄 47% (0.47x) speedup for `split_sentences` in `guardrails/utils/tokenization_utils.py`