Skip to content

Conversation

codeflash-ai[bot]
Copy link

@codeflash-ai codeflash-ai bot commented Oct 22, 2025

📄 31% (0.31x) speedup for extract_internal_format in guardrails/schema/rail_schema.py

⏱️ Runtime : 926 microseconds 705 microseconds (best of 106 runs)

📝 Explanation and details

The optimized code achieves a 31% speedup through two key string processing optimizations:

1. Replaced split() with partition()

  • Original: format.split("; ") and internal.split(": ") create lists and process all occurrences
  • Optimized: format.partition("; ") and internal.partition(": ") stop at the first delimiter, returning exactly 3 elements
  • This is significantly faster when you only need to split on the first occurrence of a delimiter

2. Eliminated redundant string operations

  • Original: "; ".join(custom_rest) reconstructs the custom format string from a list
  • Optimized: Direct assignment of the remainder from partition()
  • Also eliminates the second RailTypes.get() call by reusing the cached result

3. Avoided redundant dictionary lookups

  • Original: Calls RailTypes.get(internal_type) twice (once for the check, once for assignment)
  • Optimized: Stores the result in rail_type variable and reuses it

The optimizations are particularly effective for large-scale test cases where the performance gains are most dramatic (up to 400% faster). For inputs with many custom formats or long strings, avoiding the overhead of list creation, joining operations, and redundant dictionary lookups provides substantial benefits. The improvements are consistent across all input types and sizes, with smaller gains on simple inputs (17-25%) and massive gains on complex inputs with many semicolon-separated segments.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 48 Passed
⏪ Replay Tests 255 Passed
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pytest
from guardrails.schema.rail_schema import extract_internal_format


# --- Function and dependencies to test ---
# Minimal mock for RailTypes and Format
class RailTypes:
    _types = {
        'json': 'JSON_TYPE',
        'xml': 'XML_TYPE',
        'csv': 'CSV_TYPE',
        'yaml': 'YAML_TYPE',
        'custom': 'CUSTOM_TYPE'
    }
    @classmethod
    def get(cls, key):
        return cls._types.get(key)

class Format:
    def __init__(self):
        self.internal_type = None
        self.internal_format_attr = ""
        self.custom_format = ""
from guardrails.schema.rail_schema import extract_internal_format

# --- Unit Tests ---
# 1. Basic Test Cases

def test_basic_json_format():
    # Basic input with known type and attribute
    codeflash_output = extract_internal_format("json: pretty; indent=2"); fmt = codeflash_output # 9.31μs -> 7.85μs (18.6% faster)

def test_basic_xml_format():
    # Known type with multiple format attributes
    codeflash_output = extract_internal_format("xml: compact; encoding=UTF-8"); fmt = codeflash_output # 8.37μs -> 6.80μs (23.0% faster)

def test_basic_no_custom_format():
    # Known type, no custom format after semicolon
    codeflash_output = extract_internal_format("csv: header"); fmt = codeflash_output # 8.39μs -> 7.12μs (17.9% faster)

def test_basic_custom_type():
    # Known custom type with attribute and custom format
    codeflash_output = extract_internal_format("custom: foo; bar=baz"); fmt = codeflash_output # 8.41μs -> 7.17μs (17.3% faster)

def test_basic_yaml_format():
    # Known type with attribute, no custom format
    codeflash_output = extract_internal_format("yaml: block"); fmt = codeflash_output # 8.30μs -> 6.97μs (19.1% faster)

# 2. Edge Test Cases

def test_unknown_type():
    # Unknown type, should treat whole input as custom_format
    codeflash_output = extract_internal_format("unknown: something; extra=info"); fmt = codeflash_output # 8.02μs -> 7.03μs (14.0% faster)

def test_no_colon_in_type():
    # Input with no colon, should treat as unknown type
    codeflash_output = extract_internal_format("jsonpretty; indent=2"); fmt = codeflash_output # 8.31μs -> 6.96μs (19.3% faster)

def test_empty_string():
    # Empty input string
    codeflash_output = extract_internal_format(""); fmt = codeflash_output # 8.03μs -> 6.78μs (18.4% faster)

def test_only_type_no_attr():
    # Only known type, no attribute or custom format
    codeflash_output = extract_internal_format("json"); fmt = codeflash_output # 8.11μs -> 6.73μs (20.6% faster)

def test_multiple_colons_in_internal():
    # Internal part has multiple colons
    codeflash_output = extract_internal_format("json: pretty: compact; foo=bar"); fmt = codeflash_output # 8.47μs -> 7.04μs (20.3% faster)

def test_multiple_semicolons():
    # Multiple custom formats separated by semicolons
    codeflash_output = extract_internal_format("xml: strict; encoding=UTF-8; version=1.0; standalone=yes"); fmt = codeflash_output # 8.68μs -> 6.77μs (28.2% faster)

def test_trailing_semicolon():
    # Trailing semicolon in input
    codeflash_output = extract_internal_format("csv: header;"); fmt = codeflash_output # 8.47μs -> 7.21μs (17.5% faster)

def test_leading_semicolon():
    # Leading semicolon in input (should treat as unknown type)
    codeflash_output = extract_internal_format("; json: pretty"); fmt = codeflash_output # 8.36μs -> 7.07μs (18.2% faster)

def test_spaces_in_type_and_attr():
    # Type and attribute with extra spaces
    codeflash_output = extract_internal_format("json:   pretty   ;   indent = 2  "); fmt = codeflash_output # 8.66μs -> 7.17μs (20.7% faster)

def test_semicolon_in_attr():
    # Semicolon inside attribute value (should split only on first semicolon)
    codeflash_output = extract_internal_format("json: foo;bar; baz=qux"); fmt = codeflash_output # 8.23μs -> 6.98μs (18.0% faster)

# 3. Large Scale Test Cases

def test_large_number_of_custom_formats():
    # Large number of custom formats (up to 999)
    custom_formats = [f"key{i}=val{i}" for i in range(1, 1000)]
    format_str = "json: pretty; " + "; ".join(custom_formats)
    codeflash_output = extract_internal_format(format_str); fmt = codeflash_output # 38.9μs -> 7.92μs (391% faster)

def test_large_internal_format_attr():
    # Large internal format attribute string
    large_attr = "a" * 500
    codeflash_output = extract_internal_format(f"csv: {large_attr}; foo=bar"); fmt = codeflash_output # 8.79μs -> 7.42μs (18.3% faster)

def test_large_unknown_type():
    # Large input with unknown type, should treat whole input as custom_format
    unknown_type = "unknown"
    large_attr = "x" * 500
    custom_formats = [f"key{i}=val{i}" for i in range(1, 500)]
    format_str = f"{unknown_type}: {large_attr}; " + "; ".join(custom_formats)
    codeflash_output = extract_internal_format(format_str); fmt = codeflash_output # 25.8μs -> 7.89μs (226% faster)

def test_large_only_type():
    # Large string with only type, no colon
    large_type = "json" * 100
    codeflash_output = extract_internal_format(large_type); fmt = codeflash_output # 9.03μs -> 7.68μs (17.5% faster)

def test_large_multiple_colons_and_semicolons():
    # Large input with many colons and semicolons in internal part
    internal = "xml: " + ": ".join([f"attr{i}" for i in range(1, 50)])
    custom = "; ".join([f"custom{i}=val{i}" for i in range(1, 50)])
    format_str = f"{internal}; {custom}"
    codeflash_output = extract_internal_format(format_str); fmt = codeflash_output # 12.0μs -> 7.30μs (64.1% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pytest
from guardrails.schema.rail_schema import extract_internal_format


# Mock RailTypes and Format for testing purposes
class MockRailTypes:
    # Simulate a registry of types
    _types = {
        "int": "IntegerType",
        "float": "FloatType",
        "str": "StringType",
        "bool": "BooleanType",
        "date": "DateType",
    }

    @classmethod
    def get(cls, name):
        return cls._types.get(name, None)

class Format:
    def __init__(self):
        self.custom_format = ""
        self.internal_type = None
        self.internal_format_attr = ""
from guardrails.schema.rail_schema import extract_internal_format

# -------------------- UNIT TESTS --------------------

# Basic Test Cases

def test_basic_int_type_with_attr_and_custom():
    # Standard case: type, format attribute, and custom format
    codeflash_output = extract_internal_format("int: 32bit; custom1; custom2"); fmt = codeflash_output # 9.20μs -> 7.46μs (23.2% faster)

def test_basic_float_type_with_attr():
    # Type with attribute, no custom format
    codeflash_output = extract_internal_format("float: scientific"); fmt = codeflash_output # 6.32μs -> 4.38μs (44.2% faster)

def test_basic_str_type_no_attr():
    # Type only, no attribute, no custom format
    codeflash_output = extract_internal_format("str"); fmt = codeflash_output # 8.43μs -> 6.75μs (24.8% faster)

def test_basic_bool_type_with_custom():
    # Type with custom format, no attribute
    codeflash_output = extract_internal_format("bool; custom_flag"); fmt = codeflash_output # 6.11μs -> 4.37μs (39.6% faster)

def test_basic_date_type_with_attr_and_custom():
    # Date type with attribute and custom format
    codeflash_output = extract_internal_format("date: YYYY-MM-DD; timezone: UTC"); fmt = codeflash_output # 6.11μs -> 4.39μs (39.4% faster)

# Edge Test Cases

def test_edge_unrecognized_type():
    # Type not in RailTypes registry
    codeflash_output = extract_internal_format("unknown: something; custom"); fmt = codeflash_output # 8.50μs -> 7.18μs (18.2% faster)

def test_edge_empty_string():
    # Empty string input
    codeflash_output = extract_internal_format(""); fmt = codeflash_output # 8.01μs -> 6.82μs (17.6% faster)

def test_edge_only_semicolon():
    # Input is just a semicolon
    codeflash_output = extract_internal_format(";"); fmt = codeflash_output # 8.05μs -> 6.66μs (20.8% faster)

def test_edge_multiple_colons_in_internal():
    # Multiple colons in internal part
    codeflash_output = extract_internal_format("int: 32bit: signed; custom1"); fmt = codeflash_output # 8.72μs -> 7.15μs (21.9% faster)

def test_edge_multiple_semicolons():
    # Multiple semicolons, some empty
    codeflash_output = extract_internal_format("str: utf8; ; custom1;; custom2"); fmt = codeflash_output # 8.87μs -> 7.31μs (21.4% faster)

def test_edge_no_type_just_custom():
    # No type, just custom format
    codeflash_output = extract_internal_format("custom_only"); fmt = codeflash_output # 8.15μs -> 6.96μs (17.0% faster)

def test_edge_type_with_colon_but_no_attr():
    # Type with colon but no attribute
    codeflash_output = extract_internal_format("float: ; custom"); fmt = codeflash_output # 6.25μs -> 4.42μs (41.6% faster)

def test_edge_internal_type_with_spaces():
    # Type name with leading/trailing spaces
    codeflash_output = extract_internal_format(" int : 64bit ; custom"); fmt = codeflash_output # 8.66μs -> 7.18μs (20.5% faster)

def test_edge_internal_type_case_sensitivity():
    # Type name with different case
    codeflash_output = extract_internal_format("Int: 32bit; custom"); fmt = codeflash_output # 8.41μs -> 7.00μs (20.1% faster)

def test_edge_custom_format_with_colons_and_semicolons():
    # Custom format contains colons and semicolons
    codeflash_output = extract_internal_format("str: ascii; custom: value; another: test; ;"); fmt = codeflash_output # 8.91μs -> 7.07μs (25.9% faster)

# Large Scale Test Cases

def test_large_scale_many_custom_formats():
    # Many custom formats (up to 999)
    custom_formats = [f"custom{i}" for i in range(1, 1000)]
    format_str = "int: 64bit; " + "; ".join(custom_formats)
    codeflash_output = extract_internal_format(format_str); fmt = codeflash_output # 38.8μs -> 7.79μs (398% faster)

def test_large_scale_long_internal_format_attr():
    # Very long internal format attribute
    long_attr = "x" * 500
    codeflash_output = extract_internal_format(f"str: {long_attr}; custom1"); fmt = codeflash_output # 9.04μs -> 7.32μs (23.6% faster)

def test_large_scale_large_input_string():
    # Large input string with many semicolons and colons
    internal_type = "float"
    internal_attr = "precision: high"
    custom_formats = [f"custom{i}: val{i}" for i in range(1, 500)]
    format_str = f"{internal_type}: {internal_attr}; " + "; ".join(custom_formats)
    codeflash_output = extract_internal_format(format_str); fmt = codeflash_output # 23.9μs -> 4.75μs (404% faster)

def test_large_scale_all_types():
    # Test all types in registry with custom formats
    for type_name, type_value in MockRailTypes._types.items():
        codeflash_output = extract_internal_format(f"{type_name}: attr; customA; customB"); fmt = codeflash_output # 18.2μs -> 13.9μs (31.1% faster)

def test_large_scale_no_custom_formats():
    # Large input with no custom formats, just type and attribute
    for type_name, type_value in MockRailTypes._types.items():
        codeflash_output = extract_internal_format(f"{type_name}: long_attr_name_with_lots_of_text"); fmt = codeflash_output # 16.8μs -> 13.5μs (25.0% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
⏪ Replay Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
test_pytest_testsunit_teststest_guard_log_py_testsintegration_teststest_guard_py_testsunit_testsvalidator__replay_test_0.py::test_guardrails_schema_rail_schema_extract_internal_format 479μs 418μs 14.6%✅

To edit these changes git checkout codeflash/optimize-extract_internal_format-mh1t26ud and push.

Codeflash

The optimized code achieves a **31% speedup** through two key string processing optimizations:

**1. Replaced `split()` with `partition()`**
- Original: `format.split("; ")` and `internal.split(": ")` create lists and process all occurrences
- Optimized: `format.partition("; ")` and `internal.partition(": ")` stop at the first delimiter, returning exactly 3 elements
- This is significantly faster when you only need to split on the first occurrence of a delimiter

**2. Eliminated redundant string operations**
- Original: `"; ".join(custom_rest)` reconstructs the custom format string from a list
- Optimized: Direct assignment of the remainder from `partition()` 
- Also eliminates the second `RailTypes.get()` call by reusing the cached result

**3. Avoided redundant dictionary lookups**
- Original: Calls `RailTypes.get(internal_type)` twice (once for the check, once for assignment)
- Optimized: Stores the result in `rail_type` variable and reuses it

The optimizations are particularly effective for **large-scale test cases** where the performance gains are most dramatic (up to 400% faster). For inputs with many custom formats or long strings, avoiding the overhead of list creation, joining operations, and redundant dictionary lookups provides substantial benefits. The improvements are consistent across all input types and sizes, with smaller gains on simple inputs (17-25%) and massive gains on complex inputs with many semicolon-separated segments.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 22, 2025 09:43
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants