⚡️ Speed up function `extract_internal_format` by 31% #62

codeflash-ai · 2025-10-22T09:43:12Z

📄 31% (0.31x) speedup for `extract_internal_format` in `guardrails/schema/rail_schema.py`

⏱️ Runtime : 926 microseconds → 705 microseconds (best of 106 runs)

📝 Explanation and details

The optimized code achieves a 31% speedup through two key string processing optimizations:

1. Replaced split() with partition()

Original: format.split("; ") and internal.split(": ") create lists and process all occurrences
Optimized: format.partition("; ") and internal.partition(": ") stop at the first delimiter, returning exactly 3 elements
This is significantly faster when you only need to split on the first occurrence of a delimiter

2. Eliminated redundant string operations

Original: "; ".join(custom_rest) reconstructs the custom format string from a list
Optimized: Direct assignment of the remainder from partition()
Also eliminates the second RailTypes.get() call by reusing the cached result

3. Avoided redundant dictionary lookups

Original: Calls RailTypes.get(internal_type) twice (once for the check, once for assignment)
Optimized: Stores the result in rail_type variable and reuses it

The optimizations are particularly effective for large-scale test cases where the performance gains are most dramatic (up to 400% faster). For inputs with many custom formats or long strings, avoiding the overhead of list creation, joining operations, and redundant dictionary lookups provides substantial benefits. The improvements are consistent across all input types and sizes, with smaller gains on simple inputs (17-25%) and massive gains on complex inputs with many semicolon-separated segments.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 48 Passed
⏪ Replay Tests	✅ 255 Passed
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import pytest
from guardrails.schema.rail_schema import extract_internal_format


# --- Function and dependencies to test ---
# Minimal mock for RailTypes and Format
class RailTypes:
    _types = {
        'json': 'JSON_TYPE',
        'xml': 'XML_TYPE',
        'csv': 'CSV_TYPE',
        'yaml': 'YAML_TYPE',
        'custom': 'CUSTOM_TYPE'
    }
    @classmethod
    def get(cls, key):
        return cls._types.get(key)

class Format:
    def __init__(self):
        self.internal_type = None
        self.internal_format_attr = ""
        self.custom_format = ""
from guardrails.schema.rail_schema import extract_internal_format

# --- Unit Tests ---
# 1. Basic Test Cases

def test_basic_json_format():
    # Basic input with known type and attribute
    codeflash_output = extract_internal_format("json: pretty; indent=2"); fmt = codeflash_output # 9.31μs -> 7.85μs (18.6% faster)

def test_basic_xml_format():
    # Known type with multiple format attributes
    codeflash_output = extract_internal_format("xml: compact; encoding=UTF-8"); fmt = codeflash_output # 8.37μs -> 6.80μs (23.0% faster)

def test_basic_no_custom_format():
    # Known type, no custom format after semicolon
    codeflash_output = extract_internal_format("csv: header"); fmt = codeflash_output # 8.39μs -> 7.12μs (17.9% faster)

def test_basic_custom_type():
    # Known custom type with attribute and custom format
    codeflash_output = extract_internal_format("custom: foo; bar=baz"); fmt = codeflash_output # 8.41μs -> 7.17μs (17.3% faster)

def test_basic_yaml_format():
    # Known type with attribute, no custom format
    codeflash_output = extract_internal_format("yaml: block"); fmt = codeflash_output # 8.30μs -> 6.97μs (19.1% faster)

# 2. Edge Test Cases

def test_unknown_type():
    # Unknown type, should treat whole input as custom_format
    codeflash_output = extract_internal_format("unknown: something; extra=info"); fmt = codeflash_output # 8.02μs -> 7.03μs (14.0% faster)

def test_no_colon_in_type():
    # Input with no colon, should treat as unknown type
    codeflash_output = extract_internal_format("jsonpretty; indent=2"); fmt = codeflash_output # 8.31μs -> 6.96μs (19.3% faster)

def test_empty_string():
    # Empty input string
    codeflash_output = extract_internal_format(""); fmt = codeflash_output # 8.03μs -> 6.78μs (18.4% faster)

def test_only_type_no_attr():
    # Only known type, no attribute or custom format
    codeflash_output = extract_internal_format("json"); fmt = codeflash_output # 8.11μs -> 6.73μs (20.6% faster)

def test_multiple_colons_in_internal():
    # Internal part has multiple colons
    codeflash_output = extract_internal_format("json: pretty: compact; foo=bar"); fmt = codeflash_output # 8.47μs -> 7.04μs (20.3% faster)

def test_multiple_semicolons():
    # Multiple custom formats separated by semicolons
    codeflash_output = extract_internal_format("xml: strict; encoding=UTF-8; version=1.0; standalone=yes"); fmt = codeflash_output # 8.68μs -> 6.77μs (28.2% faster)

def test_trailing_semicolon():
    # Trailing semicolon in input
    codeflash_output = extract_internal_format("csv: header;"); fmt = codeflash_output # 8.47μs -> 7.21μs (17.5% faster)

def test_leading_semicolon():
    # Leading semicolon in input (should treat as unknown type)
    codeflash_output = extract_internal_format("; json: pretty"); fmt = codeflash_output # 8.36μs -> 7.07μs (18.2% faster)

def test_spaces_in_type_and_attr():
    # Type and attribute with extra spaces
    codeflash_output = extract_internal_format("json:   pretty   ;   indent = 2  "); fmt = codeflash_output # 8.66μs -> 7.17μs (20.7% faster)

def test_semicolon_in_attr():
    # Semicolon inside attribute value (should split only on first semicolon)
    codeflash_output = extract_internal_format("json: foo;bar; baz=qux"); fmt = codeflash_output # 8.23μs -> 6.98μs (18.0% faster)

# 3. Large Scale Test Cases

def test_large_number_of_custom_formats():
    # Large number of custom formats (up to 999)
    custom_formats = [f"key{i}=val{i}" for i in range(1, 1000)]
    format_str = "json: pretty; " + "; ".join(custom_formats)
    codeflash_output = extract_internal_format(format_str); fmt = codeflash_output # 38.9μs -> 7.92μs (391% faster)

def test_large_internal_format_attr():
    # Large internal format attribute string
    large_attr = "a" * 500
    codeflash_output = extract_internal_format(f"csv: {large_attr}; foo=bar"); fmt = codeflash_output # 8.79μs -> 7.42μs (18.3% faster)

def test_large_unknown_type():
    # Large input with unknown type, should treat whole input as custom_format
    unknown_type = "unknown"
    large_attr = "x" * 500
    custom_formats = [f"key{i}=val{i}" for i in range(1, 500)]
    format_str = f"{unknown_type}: {large_attr}; " + "; ".join(custom_formats)
    codeflash_output = extract_internal_format(format_str); fmt = codeflash_output # 25.8μs -> 7.89μs (226% faster)

def test_large_only_type():
    # Large string with only type, no colon
    large_type = "json" * 100
    codeflash_output = extract_internal_format(large_type); fmt = codeflash_output # 9.03μs -> 7.68μs (17.5% faster)

def test_large_multiple_colons_and_semicolons():
    # Large input with many colons and semicolons in internal part
    internal = "xml: " + ": ".join([f"attr{i}" for i in range(1, 50)])
    custom = "; ".join([f"custom{i}=val{i}" for i in range(1, 50)])
    format_str = f"{internal}; {custom}"
    codeflash_output = extract_internal_format(format_str); fmt = codeflash_output # 12.0μs -> 7.30μs (64.1% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pytest
from guardrails.schema.rail_schema import extract_internal_format


# Mock RailTypes and Format for testing purposes
class MockRailTypes:
    # Simulate a registry of types
    _types = {
        "int": "IntegerType",
        "float": "FloatType",
        "str": "StringType",
        "bool": "BooleanType",
        "date": "DateType",
    }

    @classmethod
    def get(cls, name):
        return cls._types.get(name, None)

class Format:
    def __init__(self):
        self.custom_format = ""
        self.internal_type = None
        self.internal_format_attr = ""
from guardrails.schema.rail_schema import extract_internal_format

# -------------------- UNIT TESTS --------------------

# Basic Test Cases

def test_basic_int_type_with_attr_and_custom():
    # Standard case: type, format attribute, and custom format
    codeflash_output = extract_internal_format("int: 32bit; custom1; custom2"); fmt = codeflash_output # 9.20μs -> 7.46μs (23.2% faster)

def test_basic_float_type_with_attr():
    # Type with attribute, no custom format
    codeflash_output = extract_internal_format("float: scientific"); fmt = codeflash_output # 6.32μs -> 4.38μs (44.2% faster)

def test_basic_str_type_no_attr():
    # Type only, no attribute, no custom format
    codeflash_output = extract_internal_format("str"); fmt = codeflash_output # 8.43μs -> 6.75μs (24.8% faster)

def test_basic_bool_type_with_custom():
    # Type with custom format, no attribute
    codeflash_output = extract_internal_format("bool; custom_flag"); fmt = codeflash_output # 6.11μs -> 4.37μs (39.6% faster)

def test_basic_date_type_with_attr_and_custom():
    # Date type with attribute and custom format
    codeflash_output = extract_internal_format("date: YYYY-MM-DD; timezone: UTC"); fmt = codeflash_output # 6.11μs -> 4.39μs (39.4% faster)

# Edge Test Cases

def test_edge_unrecognized_type():
    # Type not in RailTypes registry
    codeflash_output = extract_internal_format("unknown: something; custom"); fmt = codeflash_output # 8.50μs -> 7.18μs (18.2% faster)

def test_edge_empty_string():
    # Empty string input
    codeflash_output = extract_internal_format(""); fmt = codeflash_output # 8.01μs -> 6.82μs (17.6% faster)

def test_edge_only_semicolon():
    # Input is just a semicolon
    codeflash_output = extract_internal_format(";"); fmt = codeflash_output # 8.05μs -> 6.66μs (20.8% faster)

def test_edge_multiple_colons_in_internal():
    # Multiple colons in internal part
    codeflash_output = extract_internal_format("int: 32bit: signed; custom1"); fmt = codeflash_output # 8.72μs -> 7.15μs (21.9% faster)

def test_edge_multiple_semicolons():
    # Multiple semicolons, some empty
    codeflash_output = extract_internal_format("str: utf8; ; custom1;; custom2"); fmt = codeflash_output # 8.87μs -> 7.31μs (21.4% faster)

def test_edge_no_type_just_custom():
    # No type, just custom format
    codeflash_output = extract_internal_format("custom_only"); fmt = codeflash_output # 8.15μs -> 6.96μs (17.0% faster)

def test_edge_type_with_colon_but_no_attr():
    # Type with colon but no attribute
    codeflash_output = extract_internal_format("float: ; custom"); fmt = codeflash_output # 6.25μs -> 4.42μs (41.6% faster)

def test_edge_internal_type_with_spaces():
    # Type name with leading/trailing spaces
    codeflash_output = extract_internal_format(" int : 64bit ; custom"); fmt = codeflash_output # 8.66μs -> 7.18μs (20.5% faster)

def test_edge_internal_type_case_sensitivity():
    # Type name with different case
    codeflash_output = extract_internal_format("Int: 32bit; custom"); fmt = codeflash_output # 8.41μs -> 7.00μs (20.1% faster)

def test_edge_custom_format_with_colons_and_semicolons():
    # Custom format contains colons and semicolons
    codeflash_output = extract_internal_format("str: ascii; custom: value; another: test; ;"); fmt = codeflash_output # 8.91μs -> 7.07μs (25.9% faster)

# Large Scale Test Cases

def test_large_scale_many_custom_formats():
    # Many custom formats (up to 999)
    custom_formats = [f"custom{i}" for i in range(1, 1000)]
    format_str = "int: 64bit; " + "; ".join(custom_formats)
    codeflash_output = extract_internal_format(format_str); fmt = codeflash_output # 38.8μs -> 7.79μs (398% faster)

def test_large_scale_long_internal_format_attr():
    # Very long internal format attribute
    long_attr = "x" * 500
    codeflash_output = extract_internal_format(f"str: {long_attr}; custom1"); fmt = codeflash_output # 9.04μs -> 7.32μs (23.6% faster)

def test_large_scale_large_input_string():
    # Large input string with many semicolons and colons
    internal_type = "float"
    internal_attr = "precision: high"
    custom_formats = [f"custom{i}: val{i}" for i in range(1, 500)]
    format_str = f"{internal_type}: {internal_attr}; " + "; ".join(custom_formats)
    codeflash_output = extract_internal_format(format_str); fmt = codeflash_output # 23.9μs -> 4.75μs (404% faster)

def test_large_scale_all_types():
    # Test all types in registry with custom formats
    for type_name, type_value in MockRailTypes._types.items():
        codeflash_output = extract_internal_format(f"{type_name}: attr; customA; customB"); fmt = codeflash_output # 18.2μs -> 13.9μs (31.1% faster)

def test_large_scale_no_custom_formats():
    # Large input with no custom formats, just type and attribute
    for type_name, type_value in MockRailTypes._types.items():
        codeflash_output = extract_internal_format(f"{type_name}: long_attr_name_with_lots_of_text"); fmt = codeflash_output # 16.8μs -> 13.5μs (25.0% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

⏪ Replay Tests and Runtime

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`test_pytest_testsunit_teststest_guard_log_py_testsintegration_teststest_guard_py_testsunit_testsvalidator__replay_test_0.py::test_guardrails_schema_rail_schema_extract_internal_format`	479μs	418μs	14.6%✅

To edit these changes git checkout codeflash/optimize-extract_internal_format-mh1t26ud and push.

The optimized code achieves a **31% speedup** through two key string processing optimizations: **1. Replaced `split()` with `partition()`** - Original: `format.split("; ")` and `internal.split(": ")` create lists and process all occurrences - Optimized: `format.partition("; ")` and `internal.partition(": ")` stop at the first delimiter, returning exactly 3 elements - This is significantly faster when you only need to split on the first occurrence of a delimiter **2. Eliminated redundant string operations** - Original: `"; ".join(custom_rest)` reconstructs the custom format string from a list - Optimized: Direct assignment of the remainder from `partition()` - Also eliminates the second `RailTypes.get()` call by reusing the cached result **3. Avoided redundant dictionary lookups** - Original: Calls `RailTypes.get(internal_type)` twice (once for the check, once for assignment) - Optimized: Stores the result in `rail_type` variable and reuses it The optimizations are particularly effective for **large-scale test cases** where the performance gains are most dramatic (up to 400% faster). For inputs with many custom formats or long strings, avoiding the overhead of list creation, joining operations, and redundant dictionary lookups provides substantial benefits. The improvements are consistent across all input types and sizes, with smaller gains on simple inputs (17-25%) and massive gains on complex inputs with many semicolon-separated segments.

codeflash-ai bot requested a review from mashraf-222 October 22, 2025 09:43

codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `extract_internal_format` by 31% #62

⚡️ Speed up function `extract_internal_format` by 31% #62

Uh oh!

codeflash-ai bot commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

⚡️ Speed up function extract_internal_format by 31% #62

Are you sure you want to change the base?

⚡️ Speed up function extract_internal_format by 31% #62

Uh oh!

Conversation

codeflash-ai bot commented Oct 22, 2025

📄 31% (0.31x) speedup for extract_internal_format in guardrails/schema/rail_schema.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

⚡️ Speed up function `extract_internal_format` by 31% #62

⚡️ Speed up function `extract_internal_format` by 31% #62

📄 31% (0.31x) speedup for `extract_internal_format` in `guardrails/schema/rail_schema.py`