Skip to content

Conversation

codeflash-ai[bot]
Copy link

@codeflash-ai codeflash-ai bot commented Oct 22, 2025

📄 47% (0.47x) speedup for split_sentences in guardrails/utils/tokenization_utils.py

⏱️ Runtime : 49.9 milliseconds 34.0 milliseconds (best of 43 runs)

📝 Explanation and details

The optimized code achieves a 46% speedup through strategic regex precompilation and pattern consolidation, addressing the primary performance bottlenecks identified in the profiling data.

Key optimizations applied:

  1. Precompiled static regexes: The original code recompiled the same regex patterns on every call. The optimized version precompiles frequently-used patterns like _QUESTION_SPLIT_RE and _DOT_SPLIT_RE at module load, eliminating repeated compilation overhead.

  2. Abbreviation pattern consolidation: The biggest performance gain comes from combining all 43 abbreviation patterns into a single regex using r"|".join(abbreviations). This reduces ~4,300 individual re.sub() calls (in the profiler) to just one, cutting abbreviation processing time from 58.2% to 8% of total runtime.

  3. Per-call regex compilation: For patterns that depend on the dynamic separator parameter, regexes are compiled once per function call rather than on every substitution. This includes coordinating conjunction and preposition patterns.

  4. Optimized split_sentences(): Precompiles both the initial sentence-splitting regex and the final separator-splitting regex, reducing regex compilation overhead in the main entry point.

Performance characteristics by test type:

  • Simple sentences: 40-45% faster due to reduced regex compilation overhead
  • Large texts with many sentences: 60-65% faster, benefiting most from precompilation savings
  • Abbreviation-heavy texts: 20-30% faster, where the abbreviation consolidation provides the largest absolute time savings
  • Complex nested structures: 45-65% faster, as the precompiled patterns handle these efficiently

The optimization maintains identical behavior and output while dramatically reducing the regex engine overhead that dominated the original implementation's runtime.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 102 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import re

# imports
import pytest  # used for our unit tests
from guardrails.utils.tokenization_utils import split_sentences

# unit tests

# ------------------- BASIC TEST CASES -------------------

def test_single_sentence():
    # Single sentence, no punctuation at end
    codeflash_output = split_sentences("Hello world") # 123μs -> 87.3μs (41.9% faster)

def test_single_sentence_with_period():
    # Single sentence ending with a period
    codeflash_output = split_sentences("Hello world.") # 140μs -> 99.0μs (42.4% faster)

def test_two_sentences_period():
    # Two sentences separated by a period
    codeflash_output = split_sentences("Hello world. This is great.") # 169μs -> 118μs (43.6% faster)

def test_two_sentences_question():
    # Two sentences separated by a question mark
    codeflash_output = split_sentences("How are you? I'm fine.") # 169μs -> 119μs (41.8% faster)

def test_three_sentences_varied_punctuation():
    # Three sentences with different punctuation
    codeflash_output = split_sentences("Hi! How are you? I'm fine.") # 186μs -> 131μs (42.4% faster)

def test_sentences_with_exclamation():
    # Exclamation mark as sentence end
    codeflash_output = split_sentences("Wow! That's amazing!") # 165μs -> 118μs (39.8% faster)

def test_sentences_with_mixed_punctuation():
    # Mixed punctuation
    codeflash_output = split_sentences("Wait. What? No!") # 175μs -> 123μs (42.1% faster)

def test_sentences_with_newlines():
    # Sentences separated by newlines
    codeflash_output = split_sentences("Hello world.\nThis is great.") # 169μs -> 118μs (43.7% faster)

def test_sentences_with_multiple_spaces():
    # Sentences with extra spaces between them
    codeflash_output = split_sentences("Hello world.   This is great.") # 171μs -> 119μs (44.2% faster)

def test_sentence_with_abbreviation():
    # Sentence with abbreviation that shouldn't split
    codeflash_output = split_sentences("I met Dr. Smith. He is nice.") # 179μs -> 129μs (38.0% faster)

def test_sentence_with_multiple_abbreviations():
    # Multiple abbreviations in one sentence
    codeflash_output = split_sentences("Dr. Smith and Mr. Jones went to St. Louis.") # 189μs -> 142μs (33.3% faster)

def test_sentence_with_eg_abbreviation():
    # e.g. abbreviation should not split
    codeflash_output = split_sentences("Many fruits, e.g. apples, are healthy.") # 170μs -> 122μs (39.5% faster)

def test_sentence_with_ie_abbreviation():
    # i.e. abbreviation should not split
    codeflash_output = split_sentences("This is a test, i.e. an example.") # 168μs -> 122μs (37.7% faster)

def test_sentence_with_et_al_abbreviation():
    # et al. abbreviation should not split
    codeflash_output = split_sentences("Smith et al. found the results.") # 168μs -> 120μs (40.3% faster)

def test_sentence_with_period_in_parentheses():
    # Period inside parentheses should not split
    codeflash_output = split_sentences("This is a test (see Fig. 2). Next sentence.") # 193μs -> 135μs (43.0% faster)

def test_sentence_with_period_in_quotes():
    # Period inside quotes should not split
    codeflash_output = split_sentences('He said "Hello world." Then he left.') # 164μs -> 111μs (47.0% faster)

def test_sentence_with_period_in_brackets():
    # Period inside brackets should not split
    codeflash_output = split_sentences("This is [a test. Really]. Next sentence.") # 189μs -> 133μs (42.5% faster)

def test_sentence_with_list_abbreviation():
    # List abbreviation should not split
    codeflash_output = split_sentences("The items are listed in Fig. 3.") # 168μs -> 121μs (38.6% faster)

def test_sentence_with_multiple_abbreviations_and_sentence():
    # Multiple abbreviations and a sentence split
    codeflash_output = split_sentences("Dr. Smith, i.e. the director, arrived. He spoke.") # 198μs -> 144μs (36.9% faster)

# ------------------- EDGE TEST CASES -------------------

def test_empty_string():
    # Empty string should return empty list
    codeflash_output = split_sentences("") # 116μs -> 82.3μs (41.5% faster)

def test_only_punctuation():
    # Only punctuation should be a single sentence
    codeflash_output = split_sentences("!!!") # 131μs -> 91.1μs (44.1% faster)

def test_multiple_punctuation_marks():
    # Multiple punctuation marks at the end
    codeflash_output = split_sentences("Hello!!!") # 137μs -> 97.8μs (40.3% faster)

def test_multiple_sentence_endings():
    # Multiple sentence-ending punctuation between sentences
    codeflash_output = split_sentences("Hello!! How are you??") # 165μs -> 116μs (42.9% faster)

def test_sentence_with_abbreviation_at_end():
    # Abbreviation at end of sentence
    codeflash_output = split_sentences("He works at Acme Co.") # 144μs -> 105μs (37.3% faster)

def test_sentence_with_single_letter_abbreviation():
    # Single letter abbreviation
    codeflash_output = split_sentences("A. Smith went home.") # 154μs -> 107μs (43.1% faster)

def test_sentence_with_multiple_single_letter_abbreviations():
    # Multiple single letter abbreviations
    codeflash_output = split_sentences("A. B. Smith went home.") # 158μs -> 109μs (44.6% faster)

def test_sentence_with_period_and_no_space():
    # Period at end, no space after
    codeflash_output = split_sentences("Hello world.This is great.") # 153μs -> 106μs (44.3% faster)

def test_sentence_with_period_and_tab():
    # Period at end, tab after
    codeflash_output = split_sentences("Hello world.\tThis is great.") # 169μs -> 117μs (44.1% faster)

def test_sentence_with_period_and_newline():
    # Period at end, newline after
    codeflash_output = split_sentences("Hello world.\nThis is great.") # 168μs -> 117μs (43.3% faster)

def test_sentence_with_nested_parentheses():
    # Nested parentheses
    codeflash_output = split_sentences("This is a test (see Fig. 2 (details in Table 1)). Next sentence.") # 218μs -> 149μs (46.1% faster)

def test_sentence_with_nested_brackets():
    # Nested brackets
    codeflash_output = split_sentences("This is a test [see Fig. 2 [details in Table 1]]. Next sentence.") # 217μs -> 149μs (44.9% faster)

def test_sentence_with_nested_quotes():
    # Nested quotes
    codeflash_output = split_sentences('He said "She said \'Hello.\'". Then he left.') # 187μs -> 129μs (45.2% faster)

def test_sentence_with_conjunction_at_start():
    # Sentence starting with conjunction should not be split
    codeflash_output = split_sentences("He went home. And he slept.") # 169μs -> 119μs (42.3% faster)

def test_sentence_with_preposition_at_start():
    # Sentence starting with preposition should not be split
    codeflash_output = split_sentences("He went home. In the morning, he woke up.") # 181μs -> 124μs (45.6% faster)

def test_sentence_with_period_in_middle_of_word():
    # Period in the middle of a word should not split
    codeflash_output = split_sentences("The domain is example.com. Next sentence.") # 180μs -> 125μs (44.2% faster)

def test_sentence_with_multiple_abbreviations_and_punctuation():
    # Multiple abbreviations and punctuation
    codeflash_output = split_sentences("Dr. Smith, Ph.D., arrived at 10 a.m. He spoke.") # 197μs -> 140μs (40.6% faster)

def test_sentence_with_no_space_after_punctuation():
    # No space after punctuation
    codeflash_output = split_sentences("Hello world!How are you?I'm fine.") # 161μs -> 111μs (44.0% faster)

def test_sentence_with_multiple_newlines():
    # Multiple newlines between sentences
    codeflash_output = split_sentences("Hello world.\n\nThis is great.") # 170μs -> 119μs (43.1% faster)

def test_sentence_with_windows_line_endings():
    # Windows line endings
    codeflash_output = split_sentences("Hello world.\r\nThis is great.") # 168μs -> 118μs (42.7% faster)

def test_sentence_with_mismatched_brackets():
    # Mismatched brackets (should not crash)
    codeflash_output = split_sentences("This is a test [see Fig. 2. Next sentence.") # 211μs -> 153μs (37.8% faster)

def test_sentence_with_mismatched_parentheses():
    # Mismatched parentheses (should not crash)
    codeflash_output = split_sentences("This is a test (see Fig. 2. Next sentence.") # 211μs -> 154μs (36.7% faster)

def test_sentence_with_mismatched_quotes():
    # Mismatched quotes (should not crash)
    codeflash_output = split_sentences('He said "Hello world. Then he left.') # 179μs -> 123μs (44.6% faster)

def test_sentence_with_multiple_sentence_ending_punctuations():
    # Sentence ending with multiple punctuation marks
    codeflash_output = split_sentences("Hello world?! Next sentence.") # 169μs -> 118μs (42.9% faster)

def test_sentence_with_punctuation_inside_quotes():
    # Punctuation inside quotes should not split
    codeflash_output = split_sentences('He said "Wow! Amazing." Then he left.') # 172μs -> 120μs (43.2% faster)

def test_sentence_with_punctuation_inside_brackets():
    # Punctuation inside brackets should not split
    codeflash_output = split_sentences("This is [a test! Really]. Next sentence.") # 190μs -> 133μs (42.5% faster)

def test_sentence_with_punctuation_inside_parentheses():
    # Punctuation inside parentheses should not split
    codeflash_output = split_sentences("This is a test (Wow! Really). Next sentence.") # 193μs -> 134μs (43.5% faster)

def test_sentence_with_abbreviation_and_period():
    # Abbreviation followed by period
    codeflash_output = split_sentences("He is a Jr. He is young.") # 164μs -> 119μs (37.1% faster)

def test_sentence_with_abbreviation_and_no_space():
    # Abbreviation followed by no space
    codeflash_output = split_sentences("He is a Jr.He is young.") # 153μs -> 107μs (43.0% faster)

def test_sentence_with_abbreviation_and_newline():
    # Abbreviation followed by newline
    codeflash_output = split_sentences("He is a Jr.\nHe is young.") # 164μs -> 117μs (39.4% faster)

def test_sentence_with_multiple_abbreviations_and_no_space():
    # Multiple abbreviations, no space after
    codeflash_output = split_sentences("Dr.Smith went to St.Louis.") # 155μs -> 106μs (46.2% faster)

def test_sentence_with_multiple_abbreviations_and_newline():
    # Multiple abbreviations, newline after
    codeflash_output = split_sentences("Dr. Smith went to St. Louis.\nHe arrived.") # 194μs -> 142μs (36.6% faster)

def test_sentence_with_abbreviation_and_exclamation():
    # Abbreviation followed by exclamation
    codeflash_output = split_sentences("He is a Jr! He is young.") # 171μs -> 119μs (43.7% faster)

def test_sentence_with_abbreviation_and_question():
    # Abbreviation followed by question
    codeflash_output = split_sentences("He is a Jr? Is he young?") # 170μs -> 118μs (44.2% faster)

def test_sentence_with_abbreviation_and_multiple_punctuation():
    # Abbreviation followed by multiple punctuation
    codeflash_output = split_sentences("He is a Jr!! He is young.") # 171μs -> 120μs (42.5% faster)

# ------------------- LARGE SCALE TEST CASES -------------------

def test_large_text_many_sentences():
    # Large text with many sentences
    text = " ".join([f"Sentence {i}." for i in range(100)])
    expected = [f"Sentence {i}." for i in range(100)]
    codeflash_output = split_sentences(text) # 2.28ms -> 1.40ms (62.6% faster)

def test_large_text_with_abbreviations():
    # Large text with many abbreviations
    text = " ".join([f"Dr. Smith {i} went to St. Louis." for i in range(100)])
    expected = [f"Dr. Smith {i} went to St. Louis." for i in range(100)]
    codeflash_output = split_sentences(text) # 4.86ms -> 4.07ms (19.4% faster)

def test_large_text_with_nested_parentheses():
    # Large text with nested parentheses
    text = " ".join([f"Sentence {i} (see Fig. {i} (details in Table {i})). Next sentence {i}." for i in range(50)])
    expected = []
    for i in range(50):
        expected.append(f"Sentence {i} (see Fig. {i} (details in Table {i})).")
        expected.append(f"Next sentence {i}.")
    codeflash_output = split_sentences(text) # 4.15ms -> 2.51ms (65.1% faster)

def test_large_text_with_nested_brackets():
    # Large text with nested brackets
    text = " ".join([f"Sentence {i} [see Fig. {i} [details in Table {i}]]. Next sentence {i}." for i in range(50)])
    expected = []
    for i in range(50):
        expected.append(f"Sentence {i} [see Fig. {i} [details in Table {i}]].")
        expected.append(f"Next sentence {i}.")
    codeflash_output = split_sentences(text) # 4.18ms -> 2.55ms (63.9% faster)

def test_large_text_with_varied_punctuation():
    # Large text with varied punctuation
    text = " ".join([f"Sentence {i}! How are you? I'm fine." for i in range(30)])
    expected = []
    for i in range(30):
        expected.append(f"Sentence {i}!")
        expected.append("How are you?")
        expected.append("I'm fine.")
    codeflash_output = split_sentences(text) # 1.26ms -> 789μs (59.9% faster)

def test_large_text_with_newlines_and_tabs():
    # Large text with newlines and tabs
    text = "\n".join([f"Sentence {i}.\tNext sentence {i}." for i in range(50)])
    expected = []
    for i in range(50):
        expected.append(f"Sentence {i}.")
        expected.append(f"Next sentence {i}.")
    codeflash_output = split_sentences(text) # 2.47ms -> 1.54ms (60.1% faster)

def test_large_text_with_abbreviations_and_punctuation():
    # Large text with abbreviations and punctuation
    text = " ".join([f"Dr. Smith, Ph.D., arrived at 10 a.m. He spoke." for i in range(30)])
    expected = []
    for i in range(30):
        expected.append("Dr. Smith, Ph.D., arrived at 10 a.m.")
        expected.append("He spoke.")
    codeflash_output = split_sentences(text) # 2.14ms -> 1.48ms (44.8% faster)

def test_large_text_with_mixed_abbreviations_and_sentences():
    # Large text with mixed abbreviations and sentences
    text = " ".join([f"Dr. Smith, i.e. the director, arrived. He spoke." for i in range(30)])
    expected = []
    for i in range(30):
        expected.append("Dr. Smith, i.e. the director, arrived.")
        expected.append("He spoke.")
    codeflash_output = split_sentences(text) # 2.21ms -> 1.71ms (28.8% faster)

def test_large_text_with_no_sentence_endings():
    # Large text with no sentence-ending punctuation
    text = " ".join([f"Sentence {i}" for i in range(100)])
    expected = [" ".join([f"Sentence {i}" for i in range(100)])]
    codeflash_output = split_sentences(text) # 863μs -> 457μs (88.6% faster)

def test_large_text_with_only_punctuation():
    # Large text with only punctuation
    text = "." * 100
    expected = ["." * 100]
    codeflash_output = split_sentences(text) # 223μs -> 123μs (81.4% faster)

def test_large_text_with_edge_cases():
    # Large text with edge cases (abbreviations, parentheses, brackets, quotes, newlines)
    text = " ".join([
        f'Dr. Smith (see Fig. {i} [details in Table {i}]) said "Hello world." Next sentence {i}.' for i in range(30)
    ])
    expected = []
    for i in range(30):
        expected.append(f'Dr. Smith (see Fig. {i} [details in Table {i}]) said "Hello world."')
        expected.append(f"Next sentence {i}.")
    codeflash_output = split_sentences(text) # 2.34ms -> 1.39ms (68.3% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import re

# imports
import pytest  # used for our unit tests
from guardrails.utils.tokenization_utils import split_sentences

# unit tests

# --------------------------
# Basic Test Cases
# --------------------------

def test_single_sentence():
    # Basic: One sentence, no punctuation inside
    codeflash_output = split_sentences("This is a sentence.") # 150μs -> 106μs (40.8% faster)

def test_two_sentences_period():
    # Basic: Two sentences split by period
    codeflash_output = split_sentences("Hello world. Goodbye world.") # 166μs -> 117μs (42.1% faster)

def test_two_sentences_exclamation():
    # Basic: Two sentences split by exclamation
    codeflash_output = split_sentences("Wow! Amazing.") # 158μs -> 111μs (42.0% faster)

def test_two_sentences_question():
    # Basic: Two sentences split by question mark
    codeflash_output = split_sentences("Is this working? Yes.") # 164μs -> 114μs (44.6% faster)

def test_multiple_sentences_mixed_punctuation():
    # Basic: Multiple sentences with mixed punctuation
    text = "Hello! How are you? I'm fine. Thanks."
    expected = ["Hello!", "How are you?", "I'm fine.", "Thanks."]
    codeflash_output = split_sentences(text) # 210μs -> 146μs (44.0% faster)

def test_sentence_with_abbreviation():
    # Basic: Sentence with abbreviation (should not split after 'e.g.')
    text = "This is an example, e.g. a test. Next sentence."
    expected = ["This is an example, e.g. a test.", "Next sentence."]
    codeflash_output = split_sentences(text) # 191μs -> 138μs (37.9% faster)

def test_sentence_with_title_abbreviation():
    # Basic: Sentence with title abbreviation (should not split after 'Dr.')
    text = "Dr. Smith went home. He was tired."
    expected = ["Dr. Smith went home.", "He was tired."]
    codeflash_output = split_sentences(text) # 179μs -> 130μs (38.1% faster)

def test_sentence_with_period_inside_parentheses():
    # Basic: Period inside parentheses should not split
    text = "This is a test (e.g. with an example). It works."
    expected = ["This is a test (e.g. with an example).", "It works."]
    codeflash_output = split_sentences(text) # 200μs -> 139μs (43.4% faster)

def test_sentence_with_quotes():
    # Basic: Sentence with quotes containing period
    text = 'He said, "This is great." Then he left.'
    expected = ['He said, "This is great."', "Then he left."]
    codeflash_output = split_sentences(text) # 166μs -> 114μs (45.7% faster)

# --------------------------
# Edge Test Cases
# --------------------------

def test_empty_string():
    # Edge: Empty string input
    codeflash_output = split_sentences("") # 116μs -> 81.3μs (43.1% faster)

def test_only_punctuation():
    # Edge: Only punctuation as input
    codeflash_output = split_sentences("!") # 126μs -> 88.9μs (41.8% faster)
    codeflash_output = split_sentences(".") # 103μs -> 65.5μs (58.4% faster)
    codeflash_output = split_sentences("?") # 100μs -> 62.6μs (60.5% faster)

def test_no_sentence_endings():
    # Edge: Input with no sentence-ending punctuation
    codeflash_output = split_sentences("This is a test without punctuation") # 136μs -> 95.2μs (43.3% faster)

def test_multiple_spaces_between_sentences():
    # Edge: Multiple spaces between sentences
    text = "Hello.   World!   How are you?"
    expected = ["Hello.", "World!", "How are you?"]
    codeflash_output = split_sentences(text) # 187μs -> 129μs (45.1% faster)

def test_multiple_punctuation_marks():
    # Edge: Multiple punctuation marks at sentence end
    text = "What?! No way!! Really..."
    expected = ["What?!", "No way!!", "Really..."]
    codeflash_output = split_sentences(text) # 186μs -> 128μs (44.8% faster)

def test_sentence_with_nested_parentheses():
    # Edge: Nested parentheses should not split
    text = "This is a test (with (nested) parentheses). Next sentence."
    expected = ["This is a test (with (nested) parentheses).", "Next sentence."]
    codeflash_output = split_sentences(text) # 197μs -> 134μs (46.7% faster)

def test_sentence_with_nested_brackets():
    # Edge: Nested brackets should not split
    text = "Check this [with [nested] brackets]. Next."
    expected = ["Check this [with [nested] brackets].", "Next."]
    codeflash_output = split_sentences(text) # 186μs -> 127μs (46.3% faster)

def test_sentence_with_single_letter_abbreviation():
    # Edge: Single letter abbreviation (should not split after 'A.')
    text = "A. Smith went home. B. Jones stayed."
    expected = ["A. Smith went home.", "B. Jones stayed."]
    codeflash_output = split_sentences(text) # 184μs -> 126μs (46.3% faster)

def test_sentence_with_period_in_middle():
    # Edge: Period in middle of sentence (should not split)
    text = "This is a test. However, e.g. this example is valid."
    expected = ["This is a test.", "However, e.g. this example is valid."]
    codeflash_output = split_sentences(text) # 195μs -> 140μs (39.6% faster)

def test_sentence_with_line_breaks():
    # Edge: Sentences separated by line breaks
    text = "First sentence.\nSecond sentence!  Third sentence?"
    expected = ["First sentence.", "Second sentence!", "Third sentence?"]
    codeflash_output = split_sentences(text) # 197μs -> 133μs (48.6% faster)

def test_sentence_with_separator_in_text():
    # Edge: Text containing separator string should not break incorrectly
    sep = "abcdsentenceseperatordcba"
    text = f"This sentence mentions {sep}. Next sentence."
    expected = [f"This sentence mentions {sep}.", "Next sentence."]
    codeflash_output = split_sentences(text) # 177μs -> 122μs (44.7% faster)

def test_sentence_with_coordinating_conjunction():
    # Edge: No break before 'and'
    text = "He left. And he never returned."
    expected = ["He left.", "And he never returned."]
    codeflash_output = split_sentences(text) # 172μs -> 119μs (44.3% faster)

def test_sentence_with_preposition():
    # Edge: No break before 'in'
    text = "He left. In the morning, he returned."
    expected = ["He left.", "In the morning, he returned."]

    codeflash_output = split_sentences(text) # 176μs -> 122μs (44.8% faster)

def test_sentence_with_multiple_abbreviations():
    # Edge: Multiple abbreviations in one sentence
    text = "Dr. Smith, Ph.D., went home. Next sentence."
    expected = ["Dr. Smith, Ph.D., went home.", "Next sentence."]
    codeflash_output = split_sentences(text) # 190μs -> 135μs (41.0% faster)

def test_sentence_with_period_at_end_and_space():
    # Edge: Sentence ending with period and trailing spaces
    text = "Hello world.   "
    expected = ["Hello world."]
    codeflash_output = split_sentences(text) # 144μs -> 102μs (41.5% faster)

def test_sentence_with_period_and_newline():
    # Edge: Sentence ending with period and newline
    text = "Hello world.\n"
    expected = ["Hello world."]
    codeflash_output = split_sentences(text) # 141μs -> 99.3μs (43.0% faster)

def test_sentence_with_multiple_newlines():
    # Edge: Multiple newlines between sentences
    text = "Hello world.\n\nGoodbye world."
    expected = ["Hello world.", "Goodbye world."]
    codeflash_output = split_sentences(text) # 169μs -> 117μs (44.3% faster)

def test_sentence_with_abbreviation_and_newline():
    # Edge: Abbreviation at end of line should not split
    text = "This is e.g.\na test. Next sentence."
    expected = ["This is e.g. a test.", "Next sentence."]
    codeflash_output = split_sentences(text) # 183μs -> 132μs (38.5% faster)

def test_sentence_with_quote_and_period_inside():
    # Edge: Quoted sentence with period inside quotes
    text = 'He said, "Wait." Then he left.'
    expected = ['He said, "Wait."', "Then he left."]
    codeflash_output = split_sentences(text) # 160μs -> 110μs (45.6% faster)

def test_sentence_with_single_quotes():
    # Edge: Sentence with single quotes
    text = "She said, 'Hello.' Goodbye."
    expected = ["She said, 'Hello.'", "Goodbye."]
    codeflash_output = split_sentences(text) # 156μs -> 107μs (45.9% faster)

def test_sentence_with_multiple_abbreviation_types():
    # Edge: Multiple types of abbreviations
    text = "Mr. Smith and Mrs. Jones went to St. Louis. They met Dr. Brown."
    expected = ["Mr. Smith and Mrs. Jones went to St. Louis.", "They met Dr. Brown."]
    codeflash_output = split_sentences(text) # 227μs -> 173μs (31.1% faster)

def test_sentence_with_multiple_periods_in_abbreviation():
    # Edge: Abbreviation with multiple periods
    text = "E.g. this is an example. I.e. another one."
    expected = ["E.g. this is an example.", "I.e. another one."]
    codeflash_output = split_sentences(text) # 194μs -> 144μs (34.2% faster)

# --------------------------
# Large Scale Test Cases
# --------------------------

def test_large_text_many_sentences():
    # Large: 100 sentences separated by periods
    text = " ".join([f"Sentence {i}." for i in range(100)])
    expected = [f"Sentence {i}." for i in range(100)]
    codeflash_output = split_sentences(text) # 2.27ms -> 1.40ms (62.0% faster)

def test_large_text_mixed_punctuation():
    # Large: 100 sentences with mixed punctuation
    punct = [".", "!", "?"]
    text = " ".join([f"Sentence {i}{punct[i%3]}" for i in range(100)])
    expected = [f"Sentence {i}{punct[i%3]}" for i in range(100)]
    codeflash_output = split_sentences(text) # 2.28ms -> 1.39ms (64.1% faster)

def test_large_text_with_abbreviations():
    # Large: Sentences with abbreviations scattered
    text = " ".join([f"Dr. Smith did e.g. test {i}. Next sentence {i}." for i in range(50)])
    expected = [f"Dr. Smith did e.g. test {i}. Next sentence {i}." for i in range(50)]
    codeflash_output = split_sentences(text) # 3.48ms -> 2.75ms (26.9% faster)

def test_large_text_with_nested_parentheses_and_brackets():
    # Large: Sentences with nested parentheses/brackets
    text = " ".join([f"Test {i} (with [nested {i}]) end. Next {i}." for i in range(50)])
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-split_sentences-mh1ox4ck and push.

Codeflash

The optimized code achieves a **46% speedup** through strategic regex precompilation and pattern consolidation, addressing the primary performance bottlenecks identified in the profiling data.

**Key optimizations applied:**

1. **Precompiled static regexes**: The original code recompiled the same regex patterns on every call. The optimized version precompiles frequently-used patterns like `_QUESTION_SPLIT_RE` and `_DOT_SPLIT_RE` at module load, eliminating repeated compilation overhead.

2. **Abbreviation pattern consolidation**: The biggest performance gain comes from combining all 43 abbreviation patterns into a single regex using `r"|".join(abbreviations)`. This reduces ~4,300 individual `re.sub()` calls (in the profiler) to just one, cutting abbreviation processing time from 58.2% to 8% of total runtime.

3. **Per-call regex compilation**: For patterns that depend on the dynamic `separator` parameter, regexes are compiled once per function call rather than on every substitution. This includes coordinating conjunction and preposition patterns.

4. **Optimized `split_sentences()`**: Precompiles both the initial sentence-splitting regex and the final separator-splitting regex, reducing regex compilation overhead in the main entry point.

**Performance characteristics by test type:**
- **Simple sentences**: 40-45% faster due to reduced regex compilation overhead
- **Large texts with many sentences**: 60-65% faster, benefiting most from precompilation savings
- **Abbreviation-heavy texts**: 20-30% faster, where the abbreviation consolidation provides the largest absolute time savings
- **Complex nested structures**: 45-65% faster, as the precompiled patterns handle these efficiently

The optimization maintains identical behavior and output while dramatically reducing the regex engine overhead that dominated the original implementation's runtime.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 22, 2025 07:47
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants