Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 23, 2025

📄 40% (0.40x) speedup for _get_all_json_refs in src/algorithms/search.py

⏱️ Runtime : 848 microseconds 606 microseconds (best of 250 runs)

📝 Explanation and details

The optimized code achieves a 39% speedup by replacing recursive calls with an iterative approach using an explicit stack. Here's why this matters:

Key Optimization: Recursion → Iteration

What changed:

  • Original: Recursively calls _get_all_json_refs() and uses refs.update() to merge results from child nodes
  • Optimized: Uses a while loop with an explicit stack to traverse the JSON structure iteratively

Why it's faster:

  1. Eliminated Function Call Overhead: The original code made 2,420 recursive calls (visible in line profiler as 2,480 total hits with recursive update operations). Each function call in Python involves:

    • Creating a new stack frame
    • Parameter passing
    • Return value handling
    • Set allocation for each recursive call's refs
  2. Avoided Repeated Set Operations: The original used refs.update() to merge child results back into parent sets. The line profiler shows:

    • 380 + 768 + 1,272 = 2,420 refs.update() calls consuming ~3.6ms (30% of total time)
    • The optimized version eliminates all these merge operations by maintaining a single refs set
  3. Better Memory Locality: The iterative approach maintains one refs set and one stack list, improving cache efficiency compared to multiple temporary sets across recursive calls.

Performance by Test Case Type

  • Small/shallow structures (empty dicts, single refs): 16-23% slower due to stack initialization overhead outweighing recursion savings
  • Medium depth structures (3-5 levels): 2-16% faster as benefits start outweighing overhead
  • Large/deeply nested structures: 25-255% faster
    • test_large_nested_structure: 101μs → 28.7μs (255% faster) - the recursive version suffers worst with deep nesting (100 levels deep)
    • test_large_flat_dict_of_refs: 36.2μs → 30.8μs (17% faster)
    • test_large_mixed_structure: 38.8μs → 27.5μs (41% faster)

Impact Analysis

Based on the function name _get_all_json_refs (JSON schema reference extraction), this is likely used in:

  • Schema validation pipelines
  • API documentation generators
  • OpenAPI/JSON Schema parsers

These tools often process complex, deeply nested schemas where this optimization would have significant cumulative impact. The 39% average speedup means schemas that took 1 second to process now take ~606ms, which compounds across large codebases or high-throughput validation scenarios.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 56 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 4 Passed
📊 Tests Coverage 100.0%
🌀 Click to see Generated Regression Tests
from __future__ import annotations

from typing import Any, NewType

# imports
import pytest  # used for our unit tests
from src.algorithms.search import _get_all_json_refs

JsonRef = NewType("JsonRef", str)

# unit tests

# 1. BASIC TEST CASES


def test_empty_dict_returns_empty_set():
    # An empty dict should yield no refs
    codeflash_output = _get_all_json_refs({})  # 583ns -> 750ns (22.3% slower)


def test_empty_list_returns_empty_set():
    # An empty list should yield no refs
    codeflash_output = _get_all_json_refs([])  # 500ns -> 625ns (20.0% slower)


def test_no_ref_key_returns_empty_set():
    # Dict with no '$ref' key should yield no refs
    data = {"foo": 1, "bar": {"baz": 2}}
    codeflash_output = _get_all_json_refs(data)  # 1.38μs -> 1.54μs (10.8% slower)


def test_single_ref_at_root():
    # Dict with one '$ref' at top level
    data = {"$ref": "#/definitions/foo"}
    expected = {JsonRef("#/definitions/foo")}
    codeflash_output = _get_all_json_refs(data)  # 875ns -> 1.04μs (16.0% slower)


def test_single_ref_nested():
    # Dict with '$ref' nested inside another dict
    data = {"type": "object", "properties": {"foo": {"$ref": "#/definitions/foo"}}}
    expected = {JsonRef("#/definitions/foo")}
    codeflash_output = _get_all_json_refs(data)  # 1.88μs -> 1.96μs (4.29% slower)


def test_multiple_refs_flat():
    # Dict with multiple '$ref's at the same level
    data = {"a": {"$ref": "#/a"}, "b": {"$ref": "#/b"}}
    expected = {JsonRef("#/a"), JsonRef("#/b")}
    codeflash_output = _get_all_json_refs(data)  # 1.79μs -> 1.75μs (2.40% faster)


def test_multiple_refs_nested():
    # Dict with multiple '$ref's at different nesting levels
    data = {
        "properties": {
            "foo": {"$ref": "#/definitions/foo"},
            "bar": {"type": "object", "items": {"$ref": "#/definitions/bar"}},
        }
    }
    expected = {JsonRef("#/definitions/foo"), JsonRef("#/definitions/bar")}
    codeflash_output = _get_all_json_refs(data)  # 2.50μs -> 2.54μs (1.65% slower)


def test_ref_in_list():
    # '$ref' inside a list of dicts
    data = {
        "anyOf": [
            {"$ref": "#/definitions/foo"},
            {"type": "string"},
            {"$ref": "#/definitions/bar"},
        ]
    }
    expected = {JsonRef("#/definitions/foo"), JsonRef("#/definitions/bar")}
    codeflash_output = _get_all_json_refs(data)  # 2.33μs -> 2.21μs (5.66% faster)


def test_ref_in_list_of_lists():
    # '$ref' inside nested lists
    data = {
        "allOf": [
            [{"$ref": "#/definitions/foo"}, {"$ref": "#/definitions/bar"}],
            [{"$ref": "#/definitions/baz"}],
        ]
    }
    expected = {
        JsonRef("#/definitions/foo"),
        JsonRef("#/definitions/bar"),
        JsonRef("#/definitions/baz"),
    }
    codeflash_output = _get_all_json_refs(data)  # 2.79μs -> 2.50μs (11.6% faster)


def test_ref_value_is_not_string():
    # '$ref' value is not a string (should be ignored)
    data = {"$ref": 123, "nested": {"$ref": None}}
    codeflash_output = _get_all_json_refs(data)  # 1.54μs -> 1.67μs (7.50% slower)


def test_ref_key_in_list_of_dicts():
    # List of dicts, some with '$ref'
    data = [{"$ref": "#/a"}, {"not_a_ref": 1}, {"$ref": "#/b"}]
    expected = {JsonRef("#/a"), JsonRef("#/b")}
    codeflash_output = _get_all_json_refs(data)  # 2.21μs -> 2.04μs (8.13% faster)


# 2. EDGE TEST CASES


def test_ref_key_as_property_name():
    # '$ref' used as a property name, not as a reference
    data = {"properties": {"$ref": {"type": "string"}}}
    # Only count '$ref' if its value is a string
    codeflash_output = _get_all_json_refs(data)  # 1.67μs -> 1.83μs (9.06% slower)


def test_ref_key_with_non_str_value_deeply_nested():
    # Deeply nested dict with non-str '$ref' value
    data = {"a": {"b": {"c": {"$ref": ["not", "a", "string"]}}}}
    codeflash_output = _get_all_json_refs(data)  # 2.58μs -> 2.46μs (5.04% faster)


def test_ref_key_with_empty_string_value():
    # '$ref' with empty string as value
    data = {"$ref": ""}
    expected = {JsonRef("")}
    codeflash_output = _get_all_json_refs(data)  # 875ns -> 1.04μs (16.0% slower)


def test_ref_key_with_whitespace_string():
    # '$ref' with whitespace string as value
    data = {"$ref": " "}
    expected = {JsonRef(" ")}
    codeflash_output = _get_all_json_refs(data)  # 833ns -> 1.00μs (16.7% slower)


def test_ref_key_with_duplicate_values():
    # Duplicate '$ref' values in different locations
    data = {
        "a": {"$ref": "#/dup"},
        "b": {"nested": {"$ref": "#/dup"}},
        "c": {"$ref": "#/unique"},
    }
    expected = {JsonRef("#/dup"), JsonRef("#/unique")}
    codeflash_output = _get_all_json_refs(data)  # 2.42μs -> 2.38μs (1.77% faster)


def test_input_is_primitive():
    # Input is a primitive (not a dict or list)
    for primitive in [None, 42, 3.14, "hello", True, False]:
        codeflash_output = _get_all_json_refs(
            primitive
        )  # 1.46μs -> 2.12μs (31.4% slower)


def test_ref_key_in_mixed_types():
    # '$ref' in a dict inside a list, inside a dict, etc.
    data = {"x": [{"$ref": "#/foo"}, 123, ["not a dict", {"$ref": "#/bar"}]]}
    expected = {JsonRef("#/foo"), JsonRef("#/bar")}
    codeflash_output = _get_all_json_refs(data)  # 2.75μs -> 2.38μs (15.8% faster)


def test_ref_key_with_non_ascii_characters():
    # '$ref' with non-ascii/unicode string value
    data = {"$ref": "#/définitions/фу"}
    expected = {JsonRef("#/définitions/фу")}
    codeflash_output = _get_all_json_refs(data)  # 833ns -> 1.00μs (16.7% slower)


def test_ref_key_with_long_string():
    # '$ref' with a very long string value
    long_str = "#/" + "a" * 500
    data = {"$ref": long_str}
    expected = {JsonRef(long_str)}
    codeflash_output = _get_all_json_refs(data)  # 792ns -> 1.00μs (20.8% slower)


def test_dict_with_ref_and_other_types():
    # Dict with '$ref' and other types as values
    data = {"$ref": "#/foo", "other": 123, "more": None, "nested": {"$ref": "#/bar"}}
    expected = {JsonRef("#/foo"), JsonRef("#/bar")}
    codeflash_output = _get_all_json_refs(data)  # 1.79μs -> 1.92μs (6.57% slower)


# 3. LARGE SCALE TEST CASES


def test_large_flat_dict_of_refs():
    # Large dict with many '$ref' keys at the top level
    data = {f"item{i}": {"$ref": f"#/def/{i}"} for i in range(100)}
    expected = {JsonRef(f"#/def/{i}") for i in range(100)}
    codeflash_output = _get_all_json_refs(data)  # 36.2μs -> 30.8μs (17.3% faster)


def test_large_nested_structure():
    # Large nested structure with refs at various depths
    data = {"root": {}}
    current = data["root"]
    expected = set()
    # Build a chain of nested dicts, each with a '$ref'
    for i in range(100):
        ref_str = f"#/deep/{i}"
        current["$ref"] = ref_str
        expected.add(JsonRef(ref_str))
        if i < 99:
            current["next"] = {}
            current = current["next"]
    codeflash_output = _get_all_json_refs(data)  # 101μs -> 28.7μs (255% faster)


def test_large_list_of_dicts_with_refs():
    # Large list of dicts, each with a '$ref'
    data = [{"$ref": f"#/item/{i}"} for i in range(100)]
    expected = {JsonRef(f"#/item/{i}") for i in range(100)}
    codeflash_output = _get_all_json_refs(data)  # 33.5μs -> 25.3μs (32.3% faster)


def test_large_mixed_structure():
    # Large, deeply nested structure with dicts and lists
    data = []
    expected = set()
    # Each element is a dict with a '$ref' and a list of dicts with '$ref's
    for i in range(50):
        ref1 = f"#/outer/{i}"
        ref2 = f"#/inner/{i}"
        d = {"$ref": ref1, "list": [{"$ref": ref2}]}
        expected.add(JsonRef(ref1))
        expected.add(JsonRef(ref2))
        data.append(d)
    codeflash_output = _get_all_json_refs(data)  # 38.8μs -> 27.5μs (40.9% faster)


def test_performance_large_structure():
    # Performance: ensure function can handle a large, wide structure efficiently
    # (not a strict performance test, but ensures no recursion errors or timeouts)
    data = {"root": [{"$ref": f"#/x/{i}"} for i in range(500)]}
    expected = {JsonRef(f"#/x/{i}") for i in range(500)}
    codeflash_output = _get_all_json_refs(data)  # 155μs -> 112μs (37.5% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from __future__ import annotations

from typing import Any, NewType

# imports
import pytest  # used for our unit tests
from src.algorithms.search import _get_all_json_refs

JsonRef = NewType("JsonRef", str)

# unit tests

# 1. Basic Test Cases


def test_empty_dict_returns_empty_set():
    # Test with an empty dict
    codeflash_output = _get_all_json_refs({})
    result = codeflash_output  # 583ns -> 709ns (17.8% slower)


def test_no_refs_in_dict_returns_empty_set():
    # Test with dict that has no $ref keys
    data = {"foo": 1, "bar": {"baz": 2}}
    codeflash_output = _get_all_json_refs(data)
    result = codeflash_output  # 1.33μs -> 1.54μs (13.6% slower)


def test_single_ref_at_top_level():
    # Test with a single $ref at the top level
    data = {"$ref": "#/definitions/foo"}
    codeflash_output = _get_all_json_refs(data)
    result = codeflash_output  # 917ns -> 1.12μs (18.5% slower)


def test_single_ref_nested_dict():
    # Test with a single $ref nested inside a dict
    data = {"properties": {"foo": {"$ref": "#/definitions/bar"}}}
    codeflash_output = _get_all_json_refs(data)
    result = codeflash_output  # 1.71μs -> 1.75μs (2.40% slower)


def test_multiple_refs_at_various_levels():
    # Test with multiple $ref at different nesting levels
    data = {
        "properties": {
            "foo": {"$ref": "#/definitions/bar"},
            "baz": {"type": "string"},
            "qux": {"items": [{"$ref": "#/definitions/quux"}, {"type": "number"}]},
        },
        "$ref": "#/definitions/root",
    }
    codeflash_output = _get_all_json_refs(data)
    result = codeflash_output  # 3.71μs -> 3.29μs (12.7% faster)
    expected = {
        JsonRef("#/definitions/bar"),
        JsonRef("#/definitions/quux"),
        JsonRef("#/definitions/root"),
    }


def test_ref_in_list():
    # Test with $ref inside a list
    data = [{"$ref": "#/definitions/foo"}, {"not_a_ref": 1}]
    codeflash_output = _get_all_json_refs(data)
    result = codeflash_output  # 1.83μs -> 1.79μs (2.29% faster)


def test_multiple_refs_in_list():
    # Test with multiple $ref inside a list
    data = [
        {"$ref": "#/definitions/foo"},
        {"$ref": "#/definitions/bar"},
        {"other": [{"$ref": "#/definitions/baz"}]},
    ]
    codeflash_output = _get_all_json_refs(data)
    result = codeflash_output  # 2.71μs -> 2.33μs (16.1% faster)
    expected = {
        JsonRef("#/definitions/foo"),
        JsonRef("#/definitions/bar"),
        JsonRef("#/definitions/baz"),
    }


def test_duplicate_refs():
    # Test with duplicate $ref values
    data = {
        "a": {"$ref": "#/definitions/foo"},
        "b": {"$ref": "#/definitions/foo"},
        "c": [{"$ref": "#/definitions/foo"}],
    }
    codeflash_output = _get_all_json_refs(data)
    result = codeflash_output  # 2.46μs -> 2.38μs (3.49% faster)


# 2. Edge Test Cases


def test_ref_with_non_string_value():
    # Test with $ref whose value is not a string (should be ignored)
    data = {"$ref": {"not": "a string"}, "nested": {"$ref": 123}}
    codeflash_output = _get_all_json_refs(data)
    result = codeflash_output  # 1.79μs -> 2.00μs (10.4% slower)


def test_ref_as_property_name():
    # Test where $ref is a property name, but not a reference (e.g., in a schema property)
    data = {
        "properties": {"$ref": {"type": "string"}, "foo": {"$ref": "#/definitions/bar"}}
    }
    codeflash_output = _get_all_json_refs(data)
    result = codeflash_output  # 2.25μs -> 2.33μs (3.56% slower)


def test_non_dict_non_list_input():
    # Test with input that is not dict or list (should return empty set)
    for value in [None, 42, "string", 3.14, True, False]:
        codeflash_output = _get_all_json_refs(value)  # 1.50μs -> 2.04μs (26.6% slower)


def test_deeply_nested_refs():
    # Test with refs nested several levels deep
    data = {"a": {"b": {"c": {"d": {"$ref": "#/definitions/deep"}}}}}
    codeflash_output = _get_all_json_refs(data)
    result = codeflash_output  # 2.25μs -> 2.17μs (3.88% faster)


def test_empty_list_and_dict():
    # Test with empty lists and dicts at various places
    data = {"a": [], "b": {}, "c": [{"d": []}, {"e": {}}]}
    codeflash_output = _get_all_json_refs(data)
    result = codeflash_output  # 2.50μs -> 2.33μs (7.16% faster)


def test_ref_with_empty_string():
    # Test with $ref as empty string (should be included)
    data = {"$ref": ""}
    codeflash_output = _get_all_json_refs(data)
    result = codeflash_output  # 917ns -> 1.12μs (18.5% slower)


def test_ref_in_list_of_lists():
    # Test with $ref inside nested lists
    data = [
        [{"$ref": "#/definitions/foo"}],
        [],
        [{"bar": [{"$ref": "#/definitions/bar"}]}],
    ]
    codeflash_output = _get_all_json_refs(data)
    result = codeflash_output  # 2.92μs -> 2.54μs (14.8% faster)
    expected = {JsonRef("#/definitions/foo"), JsonRef("#/definitions/bar")}


def test_dict_with_ref_and_other_keys():
    # Test with dict that has $ref and other keys
    data = {"$ref": "#/definitions/foo", "other": "value"}
    codeflash_output = _get_all_json_refs(data)
    result = codeflash_output  # 1.08μs -> 1.42μs (23.5% slower)


# 3. Large Scale Test Cases


def test_large_flat_list_of_refs():
    # Test with a large flat list of dicts each with unique $ref
    N = 500
    data = [{"$ref": f"#/definitions/foo{i}"} for i in range(N)]
    codeflash_output = _get_all_json_refs(data)
    result = codeflash_output  # 155μs -> 113μs (36.8% faster)
    expected = {JsonRef(f"#/definitions/foo{i}") for i in range(N)}


def test_large_nested_structure_with_refs():
    # Test with a large nested structure containing $ref at different depths
    N = 100
    data = []
    for i in range(N):
        # Each element is a dict with a nested list containing a ref
        data.append(
            {
                "level1": {
                    "level2": [{"$ref": f"#/definitions/bar{i}"}, {"not_a_ref": i}]
                }
            }
        )
    codeflash_output = _get_all_json_refs(data)
    result = codeflash_output  # 113μs -> 83.1μs (36.0% faster)
    expected = {JsonRef(f"#/definitions/bar{i}") for i in range(N)}


def test_large_tree_of_refs():
    # Test with a tree structure where refs are at leaves
    def make_tree(depth, width, ref_prefix):
        if depth == 0:
            return {"$ref": f"{ref_prefix}/leaf"}
        return {
            f"child_{i}": make_tree(depth - 1, width, f"{ref_prefix}/{i}")
            for i in range(width)
        }

    # depth=3, width=3 gives 27 leaves
    data = make_tree(3, 3, "#/definitions")
    codeflash_output = _get_all_json_refs(data)
    result = codeflash_output  # 15.2μs -> 12.0μs (25.9% faster)

    # Collect expected refs
    def get_expected_refs(depth, width, ref_prefix):
        if depth == 0:
            return {JsonRef(f"{ref_prefix}/leaf")}
        refs = set()
        for i in range(width):
            refs.update(get_expected_refs(depth - 1, width, f"{ref_prefix}/{i}"))
        return refs

    expected = get_expected_refs(3, 3, "#/definitions")


def test_large_list_with_some_non_refs():
    # Test with a large list where only some elements contain $ref
    N = 500
    data = []
    expected = set()
    for i in range(N):
        if i % 10 == 0:
            data.append({"$ref": f"#/definitions/foo{i}"})
            expected.add(JsonRef(f"#/definitions/foo{i}"))
        else:
            data.append({"not_a_ref": i})
    codeflash_output = _get_all_json_refs(data)
    result = codeflash_output  # 129μs -> 100.0μs (29.2% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from enum import EnumCheck
from src.algorithms.search import _get_all_json_refs


def test__get_all_json_refs():
    _get_all_json_refs([{EnumCheck.CONTINUOUS: {}}])


def test__get_all_json_refs_2():
    _get_all_json_refs({"$ref": []})


def test__get_all_json_refs_3():
    _get_all_json_refs({"$ref": ""})


def test__get_all_json_refs_4():
    _get_all_json_refs([(v1 := {}), {"\x00\x00\x00\x00": 0}, v1])
🔎 Click to see Concolic Coverage Tests
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_y60jt975/tmp0f2e05ub/test_concolic_coverage.py::test__get_all_json_refs 1.46μs 1.50μs -2.80%⚠️
codeflash_concolic_y60jt975/tmp0f2e05ub/test_concolic_coverage.py::test__get_all_json_refs_2 958ns 1.08μs -11.6%⚠️
codeflash_concolic_y60jt975/tmp0f2e05ub/test_concolic_coverage.py::test__get_all_json_refs_3 958ns 1.12μs -14.8%⚠️
codeflash_concolic_y60jt975/tmp0f2e05ub/test_concolic_coverage.py::test__get_all_json_refs_4 1.79μs 1.71μs 4.92%✅

To edit these changes git checkout codeflash/optimize-_get_all_json_refs-mji1cxhy and push.

Codeflash Static Badge

The optimized code achieves a **39% speedup** by replacing recursive calls with an iterative approach using an explicit stack. Here's why this matters:

## Key Optimization: Recursion → Iteration

**What changed:**
- **Original**: Recursively calls `_get_all_json_refs()` and uses `refs.update()` to merge results from child nodes
- **Optimized**: Uses a `while` loop with an explicit stack to traverse the JSON structure iteratively

**Why it's faster:**

1. **Eliminated Function Call Overhead**: The original code made 2,420 recursive calls (visible in line profiler as 2,480 total hits with recursive update operations). Each function call in Python involves:
   - Creating a new stack frame
   - Parameter passing
   - Return value handling
   - Set allocation for each recursive call's `refs`

2. **Avoided Repeated Set Operations**: The original used `refs.update()` to merge child results back into parent sets. The line profiler shows:
   - 380 + 768 + 1,272 = 2,420 `refs.update()` calls consuming ~3.6ms (30% of total time)
   - The optimized version eliminates all these merge operations by maintaining a single `refs` set

3. **Better Memory Locality**: The iterative approach maintains one `refs` set and one `stack` list, improving cache efficiency compared to multiple temporary sets across recursive calls.

## Performance by Test Case Type

- **Small/shallow structures** (empty dicts, single refs): 16-23% **slower** due to stack initialization overhead outweighing recursion savings
- **Medium depth structures** (3-5 levels): **2-16% faster** as benefits start outweighing overhead
- **Large/deeply nested structures**: **25-255% faster**
  - `test_large_nested_structure`: 101μs → 28.7μs (255% faster) - the recursive version suffers worst with deep nesting (100 levels deep)
  - `test_large_flat_dict_of_refs`: 36.2μs → 30.8μs (17% faster)
  - `test_large_mixed_structure`: 38.8μs → 27.5μs (41% faster)

## Impact Analysis

Based on the function name `_get_all_json_refs` (JSON schema reference extraction), this is likely used in:
- Schema validation pipelines
- API documentation generators
- OpenAPI/JSON Schema parsers

These tools often process complex, deeply nested schemas where this optimization would have significant cumulative impact. The 39% average speedup means schemas that took 1 second to process now take ~606ms, which compounds across large codebases or high-throughput validation scenarios.
@codeflash-ai codeflash-ai bot requested a review from KRRT7 December 23, 2025 03:39
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 23, 2025
@KRRT7 KRRT7 closed this Dec 23, 2025
@codeflash-ai codeflash-ai bot deleted the codeflash/optimize-_get_all_json_refs-mji1cxhy branch December 23, 2025 05:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants