Skip to content

Conversation

@codeflash-ai
Copy link
Contributor

@codeflash-ai codeflash-ai bot commented Oct 26, 2025

⚡️ This pull request contains optimizations for PR #857

If you approve this dependent PR, these changes will be merged into the original PR branch feat/hypothesis-tests.

This PR will be automatically closed if the original PR is merged.


📄 32% (0.32x) speedup for _compare_hypothesis_tests_semantic in codeflash/verification/equivalence.py

⏱️ Runtime : 4.67 milliseconds 3.53 milliseconds (best of 284 runs)

📝 Explanation and details

The optimized code achieves a 32% speedup by eliminating redundant data structures and reducing iteration overhead through two key optimizations:

1. Single-pass aggregation instead of list accumulation:

  • Original: Uses defaultdict(list) to collect all FunctionTestInvocation objects per test function, then later iterates through these lists to compute failure flags with any(not ex.did_pass for ex in orig_examples)
  • Optimized: Uses plain dicts with 2-element lists [count, had_failure] to track both example count and failure status in a single pass, eliminating the need to store individual test objects or re-scan them

2. Reduced memory allocation and access patterns:

  • Original: Creates and stores complete lists of test objects (up to 9,458 objects in large test cases), then performs expensive any() operations over these lists
  • Optimized: Uses compact 2-item lists per test function, avoiding object accumulation and expensive linear scans

The line profiler shows the key performance gains:

  • Lines with any(not ex.did_pass...) in original (10.1% and 10.2% of total time) are completely eliminated
  • The setdefault() operations replace the more expensive defaultdict(list).append() calls
  • Overall reduction from storing ~9,458 objects to just tracking summary statistics

Best performance gains occur in test cases with:

  • Large numbers of examples per test function (up to 105% faster for test_large_scale_all_fail)
  • Many distinct test functions (up to 75% faster for test_large_scale_some_failures)
  • Mixed pass/fail scenarios where the original's any() operations were most expensive

The optimization maintains identical behavior while dramatically reducing both memory usage and computational complexity from O(examples) to O(1) per test function group.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 44 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
🔮 Hypothesis Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from collections import defaultdict

# imports
import pytest
from codeflash.verification.equivalence import \
    _compare_hypothesis_tests_semantic


class DummyLogger:
    def debug(self, msg):
        pass  # No-op for testing

logger = DummyLogger()

class FunctionTestInvocation:
    """A minimal stub for the FunctionTestInvocation class."""
    class Id:
        def __init__(self, test_module_path, test_class_name, test_function_name, function_getting_tested):
            self.test_module_path = test_module_path
            self.test_class_name = test_class_name
            self.test_function_name = test_function_name
            self.function_getting_tested = function_getting_tested

    def __init__(self, test_module_path, test_class_name, test_function_name, function_getting_tested, did_pass):
        self.id = self.Id(test_module_path, test_class_name, test_function_name, function_getting_tested)
        self.did_pass = did_pass
from codeflash.verification.equivalence import \
    _compare_hypothesis_tests_semantic

# unit tests

# --- Basic Test Cases ---

def make_inv(module, cls, func, tested, did_pass):
    """Helper to create a FunctionTestInvocation instance."""
    return FunctionTestInvocation(module, cls, func, tested, did_pass)

def test_identical_pass_results():
    """Both original and candidate have the same passing test functions and examples."""
    orig = [
        make_inv('mod', 'Cls', 'test_func', 'func', True),
        make_inv('mod', 'Cls', 'test_func', 'func', True),
    ]
    cand = [
        make_inv('mod', 'Cls', 'test_func', 'func', True),
        make_inv('mod', 'Cls', 'test_func', 'func', True),
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 13.1μs -> 11.7μs (11.5% faster)

def test_identical_fail_results():
    """Both original and candidate have the same failing test functions and examples."""
    orig = [
        make_inv('mod', 'Cls', 'test_func', 'func', False),
        make_inv('mod', 'Cls', 'test_func', 'func', False),
    ]
    cand = [
        make_inv('mod', 'Cls', 'test_func', 'func', False),
        make_inv('mod', 'Cls', 'test_func', 'func', False),
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 12.3μs -> 10.8μs (14.3% faster)

def test_pass_and_fail_mixed():
    """Both original and candidate have mixed pass/fail, but match per test function."""
    orig = [
        make_inv('mod', 'Cls', 'test_func', 'func', True),
        make_inv('mod', 'Cls', 'test_func', 'func', False),
    ]
    cand = [
        make_inv('mod', 'Cls', 'test_func', 'func', False),
        make_inv('mod', 'Cls', 'test_func', 'func', True),
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 12.1μs -> 10.4μs (15.8% faster)

def test_different_example_counts():
    """Candidate has more/fewer examples but same pass/fail status per test function."""
    orig = [
        make_inv('mod', 'Cls', 'test_func', 'func', True),
    ]
    cand = [
        make_inv('mod', 'Cls', 'test_func', 'func', True),
        make_inv('mod', 'Cls', 'test_func', 'func', True),
        make_inv('mod', 'Cls', 'test_func', 'func', True),
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 11.4μs -> 10.2μs (11.7% faster)

def test_multiple_test_functions():
    """Multiple test functions, all passing in both original and candidate."""
    orig = [
        make_inv('mod', 'Cls', 'test_func1', 'func', True),
        make_inv('mod', 'Cls', 'test_func2', 'func', True),
    ]
    cand = [
        make_inv('mod', 'Cls', 'test_func1', 'func', True),
        make_inv('mod', 'Cls', 'test_func2', 'func', True),
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 12.7μs -> 10.7μs (19.3% faster)

# --- Edge Test Cases ---

def test_missing_test_function_in_candidate():
    """Test function exists in original but not in candidate. Should ignore and pass."""
    orig = [
        make_inv('mod', 'Cls', 'test_func1', 'func', True),
        make_inv('mod', 'Cls', 'test_func2', 'func', True),
    ]
    cand = [
        make_inv('mod', 'Cls', 'test_func1', 'func', True),
        # test_func2 missing
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 11.4μs -> 9.74μs (16.7% faster)

def test_missing_test_function_in_original():
    """Test function exists in candidate but not in original. Should ignore and pass."""
    orig = [
        make_inv('mod', 'Cls', 'test_func1', 'func', True),
    ]
    cand = [
        make_inv('mod', 'Cls', 'test_func1', 'func', True),
        make_inv('mod', 'Cls', 'test_func2', 'func', True),
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 11.0μs -> 9.65μs (14.0% faster)

def test_original_failed_candidate_passed():
    """Original has a failing example, candidate only passes. Should fail."""
    orig = [
        make_inv('mod', 'Cls', 'test_func', 'func', False),
        make_inv('mod', 'Cls', 'test_func', 'func', True),
    ]
    cand = [
        make_inv('mod', 'Cls', 'test_func', 'func', True),
        make_inv('mod', 'Cls', 'test_func', 'func', True),
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 16.1μs -> 14.7μs (9.66% faster)

def test_original_passed_candidate_failed():
    """Original passes, candidate has a failing example. Should fail."""
    orig = [
        make_inv('mod', 'Cls', 'test_func', 'func', True),
        make_inv('mod', 'Cls', 'test_func', 'func', True),
    ]
    cand = [
        make_inv('mod', 'Cls', 'test_func', 'func', False),
        make_inv('mod', 'Cls', 'test_func', 'func', True),
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 15.4μs -> 13.6μs (13.8% faster)

def test_all_empty_lists():
    """Both original and candidate are empty lists."""
    orig = []
    cand = []
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 7.22μs -> 6.66μs (8.33% faster)

def test_empty_candidate_nonempty_original():
    """Candidate is empty, original is not. Should pass, as missing candidate functions are ignored."""
    orig = [
        make_inv('mod', 'Cls', 'test_func', 'func', True),
    ]
    cand = []
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 8.56μs -> 8.39μs (2.04% faster)

def test_empty_original_nonempty_candidate():
    """Original is empty, candidate is not. Should pass, as missing original functions are ignored."""
    orig = []
    cand = [
        make_inv('mod', 'Cls', 'test_func', 'func', True),
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 8.12μs -> 8.07μs (0.669% faster)

def test_original_and_candidate_different_test_functions():
    """Original and candidate have completely different test functions. Should pass."""
    orig = [
        make_inv('mod', 'Cls', 'test_func1', 'func', True),
    ]
    cand = [
        make_inv('mod', 'Cls', 'test_func2', 'func', True),
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 9.03μs -> 8.94μs (0.939% faster)

def test_original_and_candidate_different_module_paths():
    """Test functions differ by module path; should be treated as different keys."""
    orig = [
        make_inv('mod1', 'Cls', 'test_func', 'func', True),
    ]
    cand = [
        make_inv('mod2', 'Cls', 'test_func', 'func', True),
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 9.07μs -> 8.84μs (2.58% faster)

def test_original_and_candidate_different_class_names():
    """Test functions differ by class name; should be treated as different keys."""
    orig = [
        make_inv('mod', 'Cls1', 'test_func', 'func', True),
    ]
    cand = [
        make_inv('mod', 'Cls2', 'test_func', 'func', True),
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 8.83μs -> 8.64μs (2.28% faster)

def test_original_and_candidate_different_function_getting_tested():
    """Test functions differ by function_getting_tested; should be treated as different keys."""
    orig = [
        make_inv('mod', 'Cls', 'test_func', 'func1', True),
    ]
    cand = [
        make_inv('mod', 'Cls', 'test_func', 'func2', True),
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 8.87μs -> 8.56μs (3.66% faster)

# --- Large Scale Test Cases ---

def test_large_scale_all_pass():
    """Large number of test functions and examples, all passing."""
    orig = [make_inv('mod', 'Cls', f'test_func{i}', 'func', True) for i in range(200)]
    cand = [make_inv('mod', 'Cls', f'test_func{i}', 'func', True) for i in range(200)]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 279μs -> 166μs (68.0% faster)

def test_large_scale_some_failures():
    """Large number of test functions, some with failures, candidate matches failures."""
    orig = []
    cand = []
    for i in range(500):
        # Every 10th test function fails
        did_pass = not (i % 10 == 0)
        orig.append(make_inv('mod', 'Cls', f'test_func{i}', 'func', did_pass))
        cand.append(make_inv('mod', 'Cls', f'test_func{i}', 'func', did_pass))
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 687μs -> 393μs (75.0% faster)

def test_large_scale_mismatch_failures():
    """Large number of test functions, some with failures, candidate does not match failures."""
    orig = []
    cand = []
    for i in range(300):
        # Every 7th test function fails in original, but candidate always passes
        orig.append(make_inv('mod', 'Cls', f'test_func{i}', 'func', not (i % 7 == 0)))
        cand.append(make_inv('mod', 'Cls', f'test_func{i}', 'func', True))
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 213μs -> 203μs (4.76% faster)

def test_large_scale_different_example_counts():
    """Large number of test functions, candidate has more examples per function, but all pass."""
    orig = []
    cand = []
    for i in range(100):
        orig.extend([make_inv('mod', 'Cls', f'test_func{i}', 'func', True) for _ in range(2)])
        cand.extend([make_inv('mod', 'Cls', f'test_func{i}', 'func', True) for _ in range(5)])
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 259μs -> 222μs (16.7% faster)

def test_large_scale_missing_functions():
    """Large number of test functions, candidate missing some functions."""
    orig = [make_inv('mod', 'Cls', f'test_func{i}', 'func', True) for i in range(900)]
    cand = [make_inv('mod', 'Cls', f'test_func{i}', 'func', True) for i in range(0, 900, 2)]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 795μs -> 526μs (51.2% faster)

def test_large_scale_all_fail():
    """All test functions fail in both original and candidate."""
    orig = [make_inv('mod', 'Cls', f'test_func{i}', 'func', False) for i in range(100)]
    cand = [make_inv('mod', 'Cls', f'test_func{i}', 'func', False) for i in range(100)]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 162μs -> 79.4μs (105% faster)

def test_large_scale_original_failed_candidate_passed():
    """Large number of test functions, original has failures, candidate passes."""
    orig = [make_inv('mod', 'Cls', f'test_func{i}', 'func', False) for i in range(100)]
    cand = [make_inv('mod', 'Cls', f'test_func{i}', 'func', True) for i in range(100)]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 76.4μs -> 70.3μs (8.60% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import sys
import types
from collections import namedtuple

# imports
import pytest  # used for our unit tests
from codeflash.verification.equivalence import \
    _compare_hypothesis_tests_semantic


# Mocks for external dependencies and data types
class DummyLogger:
    def debug(self, msg):
        pass  # No-op for testing

# Minimal substitute for codeflash.models.models.FunctionTestInvocation
FunctionTestInvocationID = namedtuple(
    "FunctionTestInvocationID",
    ["test_module_path", "test_class_name", "test_function_name", "function_getting_tested"]
)
class FunctionTestInvocation:
    def __init__(self, id, did_pass, loop_index=None, iteration_id=None):
        self.id = id
        self.did_pass = did_pass
        self.loop_index = loop_index
        self.iteration_id = iteration_id
logger = DummyLogger()
from codeflash.verification.equivalence import \
    _compare_hypothesis_tests_semantic


# Helper to create a FunctionTestInvocation for test cases
def make_invocation(module, cls, func, tested_func, did_pass, loop_index=None, iteration_id=None):
    return FunctionTestInvocation(
        FunctionTestInvocationID(module, cls, func, tested_func),
        did_pass,
        loop_index,
        iteration_id
    )

# ------------------- UNIT TESTS -------------------

# BASIC TEST CASES

def test_all_tests_passed_same_functions():
    """All test functions pass in both original and candidate."""
    orig = [
        make_invocation("mod", "Cls", "test_func1", "funcA", True),
        make_invocation("mod", "Cls", "test_func2", "funcB", True)
    ]
    cand = [
        make_invocation("mod", "Cls", "test_func1", "funcA", True),
        make_invocation("mod", "Cls", "test_func2", "funcB", True)
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 15.0μs -> 12.3μs (21.9% faster)

def test_different_example_counts_same_outcome():
    """Test functions have different number of examples, but same pass/fail outcome."""
    orig = [
        make_invocation("mod", "Cls", "test_func", "funcA", True),
        make_invocation("mod", "Cls", "test_func", "funcA", True)
    ]
    cand = [
        make_invocation("mod", "Cls", "test_func", "funcA", True)
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 12.3μs -> 10.7μs (15.2% faster)

def test_failure_propagation():
    """If original has a failure, candidate must also have a failure for same test function."""
    orig = [
        make_invocation("mod", "Cls", "test_func", "funcA", True),
        make_invocation("mod", "Cls", "test_func", "funcA", False)
    ]
    cand = [
        make_invocation("mod", "Cls", "test_func", "funcA", False),
        make_invocation("mod", "Cls", "test_func", "funcA", False)
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 12.7μs -> 10.6μs (19.9% faster)

def test_failure_mismatch():
    """If original has a failure, candidate must also have a failure, else return False."""
    orig = [
        make_invocation("mod", "Cls", "test_func", "funcA", True),
        make_invocation("mod", "Cls", "test_func", "funcA", False)
    ]
    cand = [
        make_invocation("mod", "Cls", "test_func", "funcA", True),
        make_invocation("mod", "Cls", "test_func", "funcA", True)
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 16.7μs -> 14.8μs (12.7% faster)

def test_pass_mismatch():
    """If original passes, candidate must also pass, else return False."""
    orig = [
        make_invocation("mod", "Cls", "test_func", "funcA", True),
        make_invocation("mod", "Cls", "test_func", "funcA", True)
    ]
    cand = [
        make_invocation("mod", "Cls", "test_func", "funcA", False),
        make_invocation("mod", "Cls", "test_func", "funcA", False)
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 15.9μs -> 13.9μs (14.0% faster)

def test_multiple_test_functions():
    """Multiple test functions, all pass."""
    orig = [
        make_invocation("mod", "Cls", "test_func1", "funcA", True),
        make_invocation("mod", "Cls", "test_func2", "funcB", True)
    ]
    cand = [
        make_invocation("mod", "Cls", "test_func1", "funcA", True),
        make_invocation("mod", "Cls", "test_func2", "funcB", True)
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 12.8μs -> 10.4μs (22.6% faster)

# EDGE TEST CASES

def test_empty_lists():
    """Both original and candidate are empty."""
    codeflash_output = _compare_hypothesis_tests_semantic([], []) # 7.21μs -> 6.70μs (7.55% faster)

def test_original_empty_candidate_nonempty():
    """Original is empty, candidate is not."""
    cand = [
        make_invocation("mod", "Cls", "test_func", "funcA", True)
    ]
    codeflash_output = _compare_hypothesis_tests_semantic([], cand) # 8.85μs -> 8.31μs (6.50% faster)

def test_candidate_empty_original_nonempty():
    """Candidate is empty, original is not."""
    orig = [
        make_invocation("mod", "Cls", "test_func", "funcA", True)
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, []) # 8.71μs -> 8.62μs (1.08% faster)

def test_missing_test_function_in_candidate():
    """Test function present in original, missing in candidate."""
    orig = [
        make_invocation("mod", "Cls", "test_func1", "funcA", True),
        make_invocation("mod", "Cls", "test_func2", "funcB", True)
    ]
    cand = [
        make_invocation("mod", "Cls", "test_func1", "funcA", True)
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 11.7μs -> 9.77μs (19.7% faster)

def test_missing_test_function_in_original():
    """Test function present in candidate, missing in original."""
    orig = [
        make_invocation("mod", "Cls", "test_func1", "funcA", True)
    ]
    cand = [
        make_invocation("mod", "Cls", "test_func1", "funcA", True),
        make_invocation("mod", "Cls", "test_func2", "funcB", True)
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 11.3μs -> 9.62μs (17.1% faster)

def test_different_test_functions():
    """Original and candidate have completely different test functions."""
    orig = [
        make_invocation("mod", "Cls", "test_func1", "funcA", True)
    ]
    cand = [
        make_invocation("mod", "Cls", "test_func2", "funcB", True)
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 9.04μs -> 8.72μs (3.66% faster)

def test_loop_index_and_iteration_id_ignored():
    """loop_index and iteration_id should not affect grouping."""
    orig = [
        make_invocation("mod", "Cls", "test_func", "funcA", True, loop_index=1, iteration_id="a"),
        make_invocation("mod", "Cls", "test_func", "funcA", True, loop_index=2, iteration_id="b")
    ]
    cand = [
        make_invocation("mod", "Cls", "test_func", "funcA", True, loop_index=3, iteration_id="c")
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 11.2μs -> 9.32μs (19.7% faster)

def test_mixed_pass_fail_examples():
    """Test function with mixed pass/fail examples, candidate matches failure presence."""
    orig = [
        make_invocation("mod", "Cls", "test_func", "funcA", True),
        make_invocation("mod", "Cls", "test_func", "funcA", False),
        make_invocation("mod", "Cls", "test_func", "funcA", True)
    ]
    cand = [
        make_invocation("mod", "Cls", "test_func", "funcA", True),
        make_invocation("mod", "Cls", "test_func", "funcA", False)
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 12.4μs -> 10.3μs (20.4% faster)

def test_mixed_pass_fail_examples_mismatch():
    """Test function with mixed pass/fail examples, candidate does not match failure presence."""
    orig = [
        make_invocation("mod", "Cls", "test_func", "funcA", True),
        make_invocation("mod", "Cls", "test_func", "funcA", False)
    ]
    cand = [
        make_invocation("mod", "Cls", "test_func", "funcA", True),
        make_invocation("mod", "Cls", "test_func", "funcA", True)
    ]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 16.0μs -> 14.4μs (10.6% faster)

# LARGE SCALE TEST CASES

def test_large_number_of_examples_same_outcome():
    """Test function with many examples, all pass."""
    orig = [make_invocation("mod", "Cls", "test_func", "funcA", True) for _ in range(500)]
    cand = [make_invocation("mod", "Cls", "test_func", "funcA", True) for _ in range(800)]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 282μs -> 329μs (14.1% slower)

def test_large_number_of_examples_with_failure():
    """Test function with many examples, at least one fails."""
    orig = [make_invocation("mod", "Cls", "test_func", "funcA", True) for _ in range(499)]
    orig.append(make_invocation("mod", "Cls", "test_func", "funcA", False))
    cand = [make_invocation("mod", "Cls", "test_func", "funcA", True) for _ in range(700)]
    cand.append(make_invocation("mod", "Cls", "test_func", "funcA", False))
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 258μs -> 299μs (13.6% slower)

def test_large_number_of_test_functions():
    """Many distinct test functions, all pass."""
    orig = [make_invocation("mod", "Cls", f"test_func{i}", f"func{i}", True) for i in range(300)]
    cand = [make_invocation("mod", "Cls", f"test_func{i}", f"func{i}", True) for i in range(300)]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 425μs -> 246μs (72.4% faster)

def test_large_number_of_test_functions_with_some_failures():
    """Many distinct test functions, some have failures, candidate matches failure presence."""
    orig = []
    cand = []
    for i in range(300):
        did_fail = (i % 50 == 0)
        orig.append(make_invocation("mod", "Cls", f"test_func{i}", f"func{i}", True))
        cand.append(make_invocation("mod", "Cls", f"test_func{i}", f"func{i}", True))
        if did_fail:
            orig.append(make_invocation("mod", "Cls", f"test_func{i}", f"func{i}", False))
            cand.append(make_invocation("mod", "Cls", f"test_func{i}", f"func{i}", False))
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 422μs -> 248μs (70.0% faster)

def test_large_scale_failure_mismatch():
    """Many test functions, candidate fails where original passes."""
    orig = [make_invocation("mod", "Cls", f"test_func{i}", f"func{i}", True) for i in range(300)]
    cand = [make_invocation("mod", "Cls", f"test_func{i}", f"func{i}", False) for i in range(300)]
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 227μs -> 212μs (7.04% faster)

def test_large_scale_failure_missing_in_candidate():
    """Many test functions, some with failures in original, candidate missing those failures."""
    orig = []
    cand = []
    for i in range(300):
        orig.append(make_invocation("mod", "Cls", f"test_func{i}", f"func{i}", True))
        cand.append(make_invocation("mod", "Cls", f"test_func{i}", f"func{i}", True))
        if i % 50 == 0:
            orig.append(make_invocation("mod", "Cls", f"test_func{i}", f"func{i}", False))
    codeflash_output = _compare_hypothesis_tests_semantic(orig, cand) # 226μs -> 212μs (6.61% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-pr857-2025-10-26T20.37.41 and push.

Codeflash

The optimized code achieves a **32% speedup** by eliminating redundant data structures and reducing iteration overhead through two key optimizations:

**1. Single-pass aggregation instead of list accumulation:**
- **Original**: Uses `defaultdict(list)` to collect all `FunctionTestInvocation` objects per test function, then later iterates through these lists to compute failure flags with `any(not ex.did_pass for ex in orig_examples)`
- **Optimized**: Uses plain dicts with 2-element lists `[count, had_failure]` to track both example count and failure status in a single pass, eliminating the need to store individual test objects or re-scan them

**2. Reduced memory allocation and access patterns:**
- **Original**: Creates and stores complete lists of test objects (up to 9,458 objects in large test cases), then performs expensive `any()` operations over these lists
- **Optimized**: Uses compact 2-item lists per test function, avoiding object accumulation and expensive linear scans

The line profiler shows the key performance gains:
- Lines with `any(not ex.did_pass...)` in original (10.1% and 10.2% of total time) are completely eliminated
- The `setdefault()` operations replace the more expensive `defaultdict(list).append()` calls
- Overall reduction from storing ~9,458 objects to just tracking summary statistics

**Best performance gains** occur in test cases with:
- **Large numbers of examples per test function** (up to 105% faster for `test_large_scale_all_fail`)
- **Many distinct test functions** (up to 75% faster for `test_large_scale_some_failures`) 
- **Mixed pass/fail scenarios** where the original's `any()` operations were most expensive

The optimization maintains identical behavior while dramatically reducing both memory usage and computational complexity from O(examples) to O(1) per test function group.
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Oct 26, 2025
@KRRT7 KRRT7 merged commit a6e8cdd into feat/hypothesis-tests Oct 26, 2025
20 of 23 checks passed
@codeflash-ai codeflash-ai bot deleted the codeflash/optimize-pr857-2025-10-26T20.37.41 branch October 26, 2025 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants