Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Jun 23, 2025

📄 27,577% (275.77x) speedup for find_leaf_nodes in src/dsa/nodes.py

⏱️ Runtime : 164 milliseconds 592 microseconds (best of 548 runs)

📝 Explanation and details

Analysis

The bottleneck is clear from line profiling:
The nested loop (for node in nodes, then for edge in edges) is quadratic in time (O(N*M)), which is confirmed by the profiling lines.

Here, almost 100% of time is spent scanning all edges for each node, which is inefficient.

Recommendation

  • Optimize by preprocessing:
    Build a set of all node ids that appear as "source" in any edge.
    Then any node whose id is not in this set is a leaf node.
    This reduces total cost to O(N+M) instead of O(N*M).

You have no relevant installed libraries, so no library-based optimizations apply here.


Optimized Code

Explanation

  • Build outgoing_sources set in one pass over edges: O(M)
  • List-comprehension scans nodes and checks set membership: O(N)
  • Set membership is O(1), much faster than the previous O(M) scan for each node.

Total complexity: O(N + M)
Memory: Only a set of unique source ids (usually much smaller than len(edges)).


Summary of Changes

  • Eliminated the inner loop (avoid scanning all edges per node).
  • Used a set for constant-time presence tests.
  • No dependencies, so code is portable to any Python 3.8+ interpreter (your type hints).

This will make your code orders of magnitude faster without changing results or altering function signatures.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 47 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 2 Passed
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from typing import Dict, List

# imports
import pytest  # used for our unit tests
from src.dsa.nodes import find_leaf_nodes

# unit tests

# -------------------- BASIC TEST CASES --------------------

def test_single_node_no_edges():
    # Single node, no edges; should be a leaf
    nodes = [{"id": 1}]
    edges = []
    codeflash_output = find_leaf_nodes(nodes, edges) # 417ns -> 584ns (28.6% slower)

def test_two_nodes_one_edge():
    # Node 1 points to node 2; node 2 is the only leaf
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_leaf_nodes(nodes, edges) # 750ns -> 750ns (0.000% faster)

def test_three_nodes_chain():
    # 1 -> 2 -> 3; only node 3 is a leaf
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [{"source": 1, "target": 2}, {"source": 2, "target": 3}]
    codeflash_output = find_leaf_nodes(nodes, edges) # 1.00μs -> 791ns (26.4% faster)

def test_three_nodes_star():
    # 1 -> 2, 1 -> 3; nodes 2 and 3 are leaves
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [{"source": 1, "target": 2}, {"source": 1, "target": 3}]
    codeflash_output = sorted(find_leaf_nodes(nodes, edges), key=lambda x: x["id"]) # 958ns -> 792ns (21.0% faster)

def test_disconnected_nodes():
    # 1 -> 2; 3 is disconnected; leaves: 2, 3
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 875ns -> 792ns (10.5% faster)

def test_all_nodes_leaves():
    # No edges; all nodes are leaves
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = []
    codeflash_output = sorted(find_leaf_nodes(nodes, edges), key=lambda x: x["id"]) # 542ns -> 708ns (23.4% slower)

def test_no_nodes():
    # No nodes, no edges; should return []
    codeflash_output = find_leaf_nodes([], []) # 250ns -> 375ns (33.3% slower)

def test_no_edges_some_nodes():
    # Multiple nodes, no edges; all are leaves
    nodes = [{"id": i} for i in range(5)]
    codeflash_output = sorted(find_leaf_nodes(nodes, []), key=lambda x: x["id"]) # 708ns -> 833ns (15.0% slower)

# -------------------- EDGE TEST CASES --------------------

def test_self_loop():
    # Node with a self-loop should NOT be a leaf
    nodes = [{"id": 1}]
    edges = [{"source": 1, "target": 1}]
    codeflash_output = find_leaf_nodes(nodes, edges) # 500ns -> 666ns (24.9% slower)

def test_multiple_edges_from_one_node():
    # Node 1 points to 2, 3, 4; only 2, 3, 4 are leaves
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}, {"id": 4}]
    edges = [{"source": 1, "target": 2}, {"source": 1, "target": 3}, {"source": 1, "target": 4}]
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 1.21μs -> 875ns (38.1% faster)

def test_cycle():
    # 1 -> 2, 2 -> 3, 3 -> 1; all in cycle, no leaves
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [{"source": 1, "target": 2}, {"source": 2, "target": 3}, {"source": 3, "target": 1}]
    codeflash_output = find_leaf_nodes(nodes, edges) # 1.00μs -> 833ns (20.0% faster)

def test_duplicate_edges():
    # Duplicate edges should not affect result
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2}, {"source": 1, "target": 2}]
    codeflash_output = find_leaf_nodes(nodes, edges) # 833ns -> 792ns (5.18% faster)

def test_edge_to_nonexistent_node():
    # Edge points to node not in nodes list; shouldn't affect leaves
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 3}]
    # Node 1 is not a leaf, node 2 is a leaf
    codeflash_output = find_leaf_nodes(nodes, edges) # 667ns -> 750ns (11.1% slower)

def test_edge_from_nonexistent_node():
    # Edge from non-existent node; shouldn't affect leaves
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 3, "target": 1}]
    # Both 1 and 2 have no outgoing edges, so both are leaves
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 750ns -> 750ns (0.000% faster)

def test_edge_with_extra_keys():
    # Edges and nodes with extra irrelevant keys
    nodes = [{"id": 1, "name": "A"}, {"id": 2, "name": "B"}]
    edges = [{"source": 1, "target": 2, "weight": 5}]
    codeflash_output = find_leaf_nodes(nodes, edges) # 750ns -> 708ns (5.93% faster)

def test_node_with_non_integer_id():
    # Node ids are strings
    nodes = [{"id": "a"}, {"id": "b"}]
    edges = [{"source": "a", "target": "b"}]
    codeflash_output = find_leaf_nodes(nodes, edges) # 875ns -> 709ns (23.4% faster)

def test_empty_edges_with_nonempty_nodes():
    # All nodes are leaves when edges is empty
    nodes = [{"id": 10}, {"id": 20}]
    codeflash_output = sorted(find_leaf_nodes(nodes, []), key=lambda x: x["id"]) # 500ns -> 667ns (25.0% slower)

def test_multiple_components():
    # Two disconnected components: 1->2, 3->4; leaves: 2, 4
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}, {"id": 4}]
    edges = [{"source": 1, "target": 2}, {"source": 3, "target": 4}]
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 1.12μs -> 875ns (28.6% faster)

def test_node_id_is_zero():
    # Node id is 0; should be handled correctly
    nodes = [{"id": 0}, {"id": 1}]
    edges = [{"source": 0, "target": 1}]
    codeflash_output = find_leaf_nodes(nodes, edges) # 750ns -> 750ns (0.000% faster)

# -------------------- LARGE SCALE TEST CASES --------------------

def test_large_chain():
    # Large chain: 0->1->2->...->n-1; only last node is leaf
    n = 1000
    nodes = [{"id": i} for i in range(n)]
    edges = [{"source": i, "target": i+1} for i in range(n-1)]
    codeflash_output = find_leaf_nodes(nodes, edges) # 15.7ms -> 61.0μs (25572% faster)

def test_large_star():
    # Node 0 points to all others; all others are leaves
    n = 1000
    nodes = [{"id": i} for i in range(n)]
    edges = [{"source": 0, "target": i} for i in range(1, n)]
    leaves = [{"id": i} for i in range(1, n)]
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 30.3ms -> 58.6μs (51687% faster)

def test_large_forest():
    # 500 chains of 2 nodes each: 0->1, 2->3, ..., 998->999; all odd ids are leaves
    n = 1000
    nodes = [{"id": i} for i in range(n)]
    edges = [{"source": i, "target": i+1} for i in range(0, n, 2)]
    leaves = [{"id": i+1} for i in range(0, n, 2)]
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 11.9ms -> 50.6μs (23378% faster)

def test_large_no_edges():
    # 1000 nodes, no edges; all are leaves
    n = 1000
    nodes = [{"id": i} for i in range(n)]
    codeflash_output = find_leaf_nodes(nodes, []); result = codeflash_output # 39.2μs -> 33.5μs (17.2% faster)

def test_large_all_edges_from_nonexistent_nodes():
    # All edges from nodes not in the nodes list; all nodes are leaves
    n = 1000
    nodes = [{"id": i} for i in range(n)]
    edges = [{"source": i+n, "target": (i+n+1)} for i in range(n)]
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 31.8ms -> 60.9μs (52129% faster)

def test_large_all_nodes_in_cycle():
    # n nodes in a cycle; no leaves
    n = 1000
    nodes = [{"id": i} for i in range(n)]
    edges = [{"source": i, "target": (i+1)%n} for i in range(n)]
    codeflash_output = find_leaf_nodes(nodes, edges) # 15.9ms -> 61.3μs (25859% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import pytest  # used for our unit tests
from src.dsa.nodes import find_leaf_nodes

# unit tests

# ---------------------------
# Basic Test Cases
# ---------------------------

def test_single_node_no_edges():
    # Single node, no edges: node is a leaf
    nodes = [{"id": 1, "label": "A"}]
    edges = []
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 458ns -> 583ns (21.4% slower)

def test_two_nodes_one_edge():
    # Two nodes, one edge from 1 to 2: node 2 is a leaf
    nodes = [{"id": 1, "label": "A"}, {"id": 2, "label": "B"}]
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 750ns -> 709ns (5.78% faster)

def test_three_nodes_chain():
    # Three nodes in a chain: 1->2->3, only 3 is a leaf
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [{"source": 1, "target": 2}, {"source": 2, "target": 3}]
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 917ns -> 792ns (15.8% faster)

def test_three_nodes_star():
    # One node points to two others: 1->2, 1->3; 2 and 3 are leaves
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [{"source": 1, "target": 2}, {"source": 1, "target": 3}]
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 958ns -> 833ns (15.0% faster)

def test_all_are_leaves():
    # No edges: all nodes are leaves
    nodes = [{"id": i} for i in range(5)]
    edges = []
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 750ns -> 833ns (9.96% slower)

# ---------------------------
# Edge Test Cases
# ---------------------------

def test_empty_nodes_and_edges():
    # No nodes, no edges: result is empty
    nodes = []
    edges = []
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 250ns -> 416ns (39.9% slower)

def test_edges_with_missing_nodes():
    # Edges refer to node ids not in nodes list: should ignore them
    nodes = [{"id": 1}]
    edges = [{"source": 2, "target": 3}]
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 583ns -> 667ns (12.6% slower)

def test_self_loop():
    # Node with self-loop is not a leaf
    nodes = [{"id": 1}]
    edges = [{"source": 1, "target": 1}]
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 542ns -> 666ns (18.6% slower)

def test_multiple_edges_from_one_node():
    # Node 1 points to 2, 3, 4; only 2,3,4 are leaves
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}, {"id": 4}]
    edges = [
        {"source": 1, "target": 2},
        {"source": 1, "target": 3},
        {"source": 1, "target": 4},
    ]
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 1.29μs -> 917ns (40.8% faster)

def test_disconnected_graph():
    # Two disconnected subgraphs: 1->2 and 3->4, leaves are 2 and 4
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}, {"id": 4}]
    edges = [{"source": 1, "target": 2}, {"source": 3, "target": 4}]
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 1.17μs -> 833ns (40.0% faster)

def test_multiple_edges_to_same_node():
    # 1->3, 2->3; 3 is not a leaf, 1 and 2 are leaves (since they have no outgoing edges)
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [{"source": 1, "target": 3}, {"source": 2, "target": 3}]
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 959ns -> 792ns (21.1% faster)

def test_duplicate_edges():
    # 1->2 twice, 2 is a leaf, 1 is not
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2}, {"source": 1, "target": 2}]
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 792ns -> 750ns (5.60% faster)

def test_node_with_incoming_and_outgoing_edges():
    # 1->2, 2->3; only 3 is a leaf
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [{"source": 1, "target": 2}, {"source": 2, "target": 3}]
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 959ns -> 792ns (21.1% faster)

def test_node_with_only_incoming_edges():
    # 1->2, 3->2; 2 has only incoming, 1 and 3 have only outgoing
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [{"source": 1, "target": 2}, {"source": 3, "target": 2}]
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 1.00μs -> 833ns (20.0% faster)

def test_node_with_no_edges_in_large_graph():
    # Node 5 is isolated, should be a leaf
    nodes = [{"id": i} for i in range(1, 6)]
    edges = [{"source": 1, "target": 2}, {"source": 2, "target": 3}, {"source": 3, "target": 4}]
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 1.38μs -> 1.00μs (37.5% faster)

def test_nodes_with_non_integer_ids():
    # Node ids are strings
    nodes = [{"id": "a"}, {"id": "b"}, {"id": "c"}]
    edges = [{"source": "a", "target": "b"}, {"source": "b", "target": "c"}]
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 1.12μs -> 792ns (42.0% faster)

def test_nodes_with_extra_fields():
    # Nodes have extra fields, should be preserved in output
    nodes = [{"id": 1, "name": "A"}, {"id": 2, "name": "B"}]
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 750ns -> 708ns (5.93% faster)

# ---------------------------
# Large Scale Test Cases
# ---------------------------

def test_large_chain():
    # 1000 nodes in a chain: only the last node is a leaf
    N = 1000
    nodes = [{"id": i} for i in range(N)]
    edges = [{"source": i, "target": i+1} for i in range(N-1)]
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 15.8ms -> 61.0μs (25756% faster)

def test_large_star():
    # Node 0 points to all others: all others are leaves
    N = 1000
    nodes = [{"id": i} for i in range(N)]
    edges = [{"source": 0, "target": i} for i in range(1, N)]
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 30.4ms -> 58.1μs (52287% faster)

def test_large_no_edges():
    # 1000 nodes, no edges: all are leaves
    N = 1000
    nodes = [{"id": i} for i in range(N)]
    edges = []
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 39.3μs -> 32.6μs (20.7% faster)

def test_large_half_leaves():
    # First 500 nodes point to next 500 nodes, so last 500 are leaves
    N = 1000
    nodes = [{"id": i} for i in range(N)]
    edges = [{"source": i, "target": i+500} for i in range(500)]
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 11.6ms -> 48.1μs (24006% faster)

def test_large_sparse_graph():
    # 1000 nodes, 10 random edges, most nodes are leaves
    N = 1000
    nodes = [{"id": i} for i in range(N)]
    edges = [{"source": i, "target": (i+1)%N} for i in range(10)]
    expected_leaves = [{"id": i} for i in range(N) if i not in range(10)]
    codeflash_output = find_leaf_nodes(nodes, edges); result = codeflash_output # 373μs -> 38.2μs (877% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from src.dsa.nodes import find_leaf_nodes

def test_find_leaf_nodes():
    find_leaf_nodes([{'id': '\x00'}], [(v1 := {'source': '', 'id': '\x00'}), v1, {'source': '\x01'}])

def test_find_leaf_nodes_2():
    find_leaf_nodes([{'id': ''}], [{'source': ''}])

To edit these changes git checkout codeflash/optimize-find_leaf_nodes-mc8qw2dn and push.

Codeflash

### Analysis

The bottleneck is clear from line profiling:  
The nested loop (`for node in nodes`, then `for edge in edges`) is quadratic in time (`O(N*M)`), which is confirmed by the profiling lines.



Here, almost **100%** of time is spent scanning all edges for each node, which is inefficient.

### Recommendation

- **Optimize** by preprocessing:  
  Build a set of all node ids that appear as `"source"` in any edge.  
  Then any node whose `id` is not in this set is a leaf node.  
  This reduces total cost to `O(N+M)` instead of `O(N*M)`.

You have no relevant installed libraries, so no library-based optimizations apply here.

---

## Optimized Code



### Explanation

- Build `outgoing_sources` set in one pass over `edges`: **O(M)**
- List-comprehension scans nodes and checks set membership: **O(N)**
- Set membership is `O(1)`, much faster than the previous `O(M)` scan for each node.

**Total complexity:** `O(N + M)`  
**Memory:** Only a set of unique `source` ids (usually much smaller than `len(edges)`).

---

## Summary of Changes

- Eliminated the inner loop (avoid scanning all edges per node).
- Used a set for constant-time presence tests.
- No dependencies, so code is portable to any Python 3.8+ interpreter (your type hints).

---

**This will make your code orders of magnitude faster without changing results or altering function signatures.**
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Jun 23, 2025
@codeflash-ai codeflash-ai bot requested a review from KRRT7 June 23, 2025 06:58
@KRRT7 KRRT7 closed this Jun 23, 2025
@codeflash-ai codeflash-ai bot deleted the codeflash/optimize-find_leaf_nodes-mc8qw2dn branch June 23, 2025 23:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant