Skip to content

Conversation

codeflash-ai[bot]
Copy link

@codeflash-ai codeflash-ai bot commented Oct 22, 2025

📄 46% (0.46x) speedup for parse_topic_name in chromadb/ingest/impl/utils.py

⏱️ Runtime : 3.05 milliseconds 2.08 milliseconds (best of 161 runs)

📝 Explanation and details

The optimization achieves a 46% speedup by precompiling the regex pattern instead of compiling it on every function call.

Key optimization:

  • Moved the regex pattern compilation from inside the function to module scope as _topic_pattern = re.compile(...)
  • Changed re.match(topic_regex, topic_name) to _topic_pattern.match(topic_name)

Why this is faster:
In the original code, re.match() compiles the regex pattern every time it's called, which is expensive. The line profiler shows the re.match() call taking 77% of total runtime (6.66ms out of 8.65ms). The optimized version reduces this to 52.7% (2.30ms out of 4.36ms) by eliminating the compilation overhead.

Performance characteristics:

  • Valid topics: 40-55% faster on average
  • Invalid topics: 60-80% faster (fails faster without regex compilation)
  • Large-scale operations: Maintains consistent speedup (38% faster for 1000 iterations)
  • Complex patterns: Smaller but still significant gains (9-35% for very long strings)

The optimization is particularly effective for applications that parse many topic names, as the regex compilation cost is amortized across all calls rather than paid per invocation.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 3052 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 2 Passed
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import re
from typing import Tuple

# imports
import pytest  # used for our unit tests
from chromadb.ingest.impl.utils import parse_topic_name

topic_regex = r"persistent:\/\/(?P<tenant>.+)\/(?P<namespace>.+)\/(?P<topic>.+)"
from chromadb.ingest.impl.utils import parse_topic_name

# unit tests

# ---------------------------
# Basic Test Cases
# ---------------------------

def test_basic_valid_topic():
    # Test a normal, valid topic name
    topic = "persistent://tenant1/namespace1/topic1"
    codeflash_output = parse_topic_name(topic) # 3.56μs -> 2.46μs (44.8% faster)

def test_basic_valid_topic_with_special_chars():
    # Test valid topic name with special characters
    topic = "persistent://ten-ant_2/nam-espace_3/top.ic-4"
    codeflash_output = parse_topic_name(topic) # 3.39μs -> 2.27μs (48.9% faster)

def test_basic_valid_topic_with_numbers():
    # Test valid topic name with numbers
    topic = "persistent://tenant123/namespace456/topic789"
    codeflash_output = parse_topic_name(topic) # 3.35μs -> 2.15μs (55.9% faster)

def test_basic_valid_topic_with_spaces():
    # Test valid topic name with spaces
    topic = "persistent://tenant a/namespace b/topic c"
    codeflash_output = parse_topic_name(topic) # 3.30μs -> 2.15μs (53.7% faster)

# ---------------------------
# Edge Test Cases
# ---------------------------

def test_edge_missing_prefix():
    # Topic missing the 'persistent://' prefix should raise ValueError
    topic = "tenant1/namespace1/topic1"
    with pytest.raises(ValueError):
        parse_topic_name(topic) # 2.55μs -> 1.40μs (81.7% faster)

def test_edge_missing_tenant():
    # Topic missing tenant part should raise ValueError
    topic = "persistent:///namespace1/topic1"
    with pytest.raises(ValueError):
        parse_topic_name(topic) # 3.05μs -> 1.94μs (57.1% faster)

def test_edge_missing_namespace():
    # Topic missing namespace part should raise ValueError
    topic = "persistent://tenant1//topic1"
    with pytest.raises(ValueError):
        parse_topic_name(topic) # 3.09μs -> 1.97μs (57.3% faster)

def test_edge_missing_topic():
    # Topic missing topic part should raise ValueError
    topic = "persistent://tenant1/namespace1/"
    with pytest.raises(ValueError):
        parse_topic_name(topic) # 3.06μs -> 1.95μs (57.0% faster)

def test_edge_empty_string():
    # Empty string should raise ValueError
    topic = ""
    with pytest.raises(ValueError):
        parse_topic_name(topic) # 2.49μs -> 1.40μs (78.0% faster)

def test_edge_only_prefix():
    # Only prefix, no tenant/namespace/topic
    topic = "persistent://"
    with pytest.raises(ValueError):
        parse_topic_name(topic) # 2.49μs -> 1.36μs (82.4% faster)

def test_edge_extra_slashes():
    # Extra slashes in the topic name
    topic = "persistent://tenant1/namespace1/topic1/extra"
    # Should match, but topic is 'topic1/extra'
    codeflash_output = parse_topic_name(topic) # 3.80μs -> 2.64μs (43.8% faster)

def test_edge_minimal_valid():
    # Minimal valid topic names (single char)
    topic = "persistent://t/n/s"
    codeflash_output = parse_topic_name(topic) # 3.20μs -> 2.08μs (53.8% faster)

def test_edge_unicode_characters():
    # Unicode characters in tenant, namespace, topic
    topic = "persistent://租户/命名空间/主题"
    codeflash_output = parse_topic_name(topic) # 3.97μs -> 3.03μs (31.2% faster)

def test_edge_slash_in_topic():
    # Topic part contains slashes (allowed by regex)
    topic = "persistent://tenant1/namespace1/topic/with/slash"
    codeflash_output = parse_topic_name(topic) # 3.31μs -> 2.23μs (48.4% faster)

def test_edge_leading_trailing_spaces():
    # Leading/trailing spaces in tenant, namespace, topic
    topic = "persistent:// tenant / namespace / topic "
    codeflash_output = parse_topic_name(topic) # 3.27μs -> 2.13μs (53.2% faster)

def test_edge_prefix_case_sensitive():
    # Prefix is case sensitive, so 'Persistent://' should fail
    topic = "Persistent://tenant1/namespace1/topic1"
    with pytest.raises(ValueError):
        parse_topic_name(topic) # 2.53μs -> 1.38μs (82.7% faster)

def test_edge_non_string_input():
    # Non-string input should raise TypeError before regex match
    with pytest.raises(TypeError):
        parse_topic_name(None) # 2.59μs -> 1.34μs (93.9% faster)
    with pytest.raises(TypeError):
        parse_topic_name(123) # 1.38μs -> 705ns (95.9% faster)

def test_edge_tenant_namespace_topic_with_slashes():
    # Tenant and namespace with slashes (should be part of tenant/namespace)
    topic = "persistent://ten/ant/name/space/topic"
    # Should match: tenant='ten/ant', namespace='name/space', topic='topic'
    codeflash_output = parse_topic_name(topic) # 3.42μs -> 2.39μs (43.2% faster)


def test_edge_tenant_namespace_topic_with_only_slashes():
    # Only slashes in tenant, namespace, topic (invalid, regex requires .+)
    topic = "persistent://///"
    with pytest.raises(ValueError):
        parse_topic_name(topic) # 3.07μs -> 1.78μs (72.6% faster)

# ---------------------------
# Large Scale Test Cases
# ---------------------------

def test_large_scale_long_names():
    # Very long tenant, namespace, topic names
    tenant = "t" * 500
    namespace = "n" * 400
    topic = "top" * 100
    topic_name = f"persistent://{tenant}/{namespace}/{topic}"
    codeflash_output = parse_topic_name(topic_name) # 11.9μs -> 10.6μs (12.1% faster)

def test_large_scale_many_topics():
    # Test parsing many valid topic names in a loop
    for i in range(1, 1001):  # 1000 topics
        topic_name = f"persistent://tenant{i}/namespace{i}/topic{i}"
        codeflash_output = parse_topic_name(topic_name) # 1.06ms -> 769μs (38.3% faster)

def test_large_scale_many_invalid_topics():
    # Test many invalid topic names in a loop
    for i in range(1, 1001):
        topic_name = f"tenant{i}/namespace{i}/topic{i}"  # missing prefix
        with pytest.raises(ValueError):
            parse_topic_name(topic_name)

def test_large_scale_topic_with_many_slashes():
    # Topic part with many slashes
    topic_name = "persistent://tenant/ns/" + "/".join([f"subtopic{i}" for i in range(100)])
    expected_topic = "/".join([f"subtopic{i}" for i in range(100)])
    codeflash_output = parse_topic_name(topic_name) # 5.19μs -> 3.74μs (38.9% faster)

def test_large_scale_unicode_names():
    # Large unicode names
    tenant = "租" * 100
    namespace = "命" * 100
    topic = "题" * 100
    topic_name = f"persistent://{tenant}/{namespace}/{topic}"
    codeflash_output = parse_topic_name(topic_name) # 6.96μs -> 5.73μs (21.6% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import re
from typing import Tuple

# imports
import pytest  # used for our unit tests
from chromadb.ingest.impl.utils import parse_topic_name

topic_regex = r"persistent:\/\/(?P<tenant>.+)\/(?P<namespace>.+)\/(?P<topic>.+)"
from chromadb.ingest.impl.utils import parse_topic_name

# unit tests

# ----------------------
# Basic Test Cases
# ----------------------

def test_basic_valid_topic():
    # Standard topic name
    topic = "persistent://tenant1/namespace1/topic1"
    codeflash_output = parse_topic_name(topic) # 3.46μs -> 2.29μs (51.4% faster)

def test_basic_valid_topic_with_special_chars():
    # Topic name with allowed special characters
    topic = "persistent://tenant-2/namespace_2/topic.2"
    codeflash_output = parse_topic_name(topic) # 3.25μs -> 2.22μs (46.2% faster)

def test_basic_valid_topic_with_numbers():
    # Topic name with numbers
    topic = "persistent://tenant123/namespace456/topic789"
    codeflash_output = parse_topic_name(topic) # 3.27μs -> 2.12μs (54.8% faster)

def test_basic_valid_topic_with_mixed_case():
    # Topic name with mixed case
    topic = "persistent://TenAnt/NameSpace/ToPiC"
    codeflash_output = parse_topic_name(topic) # 3.26μs -> 2.19μs (48.6% faster)

# ----------------------
# Edge Test Cases
# ----------------------

def test_missing_persistent_prefix():
    # Missing 'persistent://' prefix
    topic = "tenant1/namespace1/topic1"
    with pytest.raises(ValueError):
        parse_topic_name(topic) # 2.56μs -> 1.43μs (79.2% faster)

def test_missing_tenant():
    # Missing tenant part
    topic = "persistent:///namespace1/topic1"
    with pytest.raises(ValueError):
        parse_topic_name(topic) # 3.09μs -> 1.92μs (61.1% faster)

def test_missing_namespace():
    # Missing namespace part
    topic = "persistent://tenant1//topic1"
    with pytest.raises(ValueError):
        parse_topic_name(topic) # 3.14μs -> 1.95μs (61.0% faster)

def test_missing_topic():
    # Missing topic part
    topic = "persistent://tenant1/namespace1/"
    with pytest.raises(ValueError):
        parse_topic_name(topic) # 3.29μs -> 2.01μs (63.5% faster)

def test_empty_string():
    # Empty input string
    topic = ""
    with pytest.raises(ValueError):
        parse_topic_name(topic) # 2.34μs -> 1.39μs (67.7% faster)

def test_only_prefix():
    # Only the persistent prefix
    topic = "persistent://"
    with pytest.raises(ValueError):
        parse_topic_name(topic) # 2.50μs -> 1.37μs (82.9% faster)

def test_extra_slashes():
    # Extra slashes between parts
    topic = "persistent://tenant1/namespace1/topic1/extra"
    # Should match everything after the third slash as topic
    codeflash_output = parse_topic_name(topic) # 3.69μs -> 2.58μs (42.6% faster)

def test_slash_in_tenant_namespace_topic():
    # Slashes inside tenant, namespace, topic
    topic = "persistent://ten/ant/name/space/to/pic"
    # Should parse as tenant='ten', namespace='ant', topic='name/space/to/pic'
    codeflash_output = parse_topic_name(topic) # 3.17μs -> 2.11μs (49.8% faster)

def test_leading_trailing_spaces():
    # Leading/trailing spaces
    topic = " persistent://tenant1/namespace1/topic1 "
    with pytest.raises(ValueError):
        parse_topic_name(topic) # 2.47μs -> 1.36μs (81.6% faster)

def test_unicode_characters():
    # Unicode characters in tenant, namespace, topic
    topic = "persistent://租户/命名空间/主题"
    codeflash_output = parse_topic_name(topic) # 4.03μs -> 2.97μs (35.9% faster)

def test_empty_tenant_namespace_topic():
    # All parts empty
    topic = "persistent://///"
    with pytest.raises(ValueError):
        parse_topic_name(topic) # 2.40μs -> 1.35μs (77.1% faster)

def test_tenant_namespace_topic_are_spaces():
    # All parts are spaces
    topic = "persistent:// / / "
    codeflash_output = parse_topic_name(topic) # 3.33μs -> 2.33μs (42.6% faster)

def test_topic_with_newline():
    # Topic part contains a newline
    topic = "persistent://tenant1/namespace1/topic1\n"
    codeflash_output = parse_topic_name(topic) # 3.28μs -> 2.28μs (43.8% faster)

def test_topic_with_tabs():
    # Topic part contains a tab
    topic = "persistent://tenant1/namespace1/topic1\t"
    codeflash_output = parse_topic_name(topic) # 3.20μs -> 2.17μs (47.6% faster)

def test_topic_with_url_encoded_chars():
    # Topic part contains URL encoded characters
    topic = "persistent://tenant1/namespace1/topic%201"
    codeflash_output = parse_topic_name(topic) # 3.26μs -> 2.22μs (47.2% faster)

def test_topic_with_multiple_colons():
    # Multiple colons in tenant/namespace/topic
    topic = "persistent://ten:ant/name:space/to:pic"
    codeflash_output = parse_topic_name(topic) # 3.14μs -> 2.10μs (49.0% faster)

def test_topic_with_empty_topic_and_valid_tenant_namespace():
    # Empty topic part but valid tenant and namespace
    topic = "persistent://tenant1/namespace1/"
    with pytest.raises(ValueError):
        parse_topic_name(topic) # 3.04μs -> 1.98μs (53.5% faster)

# ----------------------
# Large Scale Test Cases
# ----------------------

def test_large_topic_names():
    # Large tenant, namespace, and topic names
    tenant = "t" * 500
    namespace = "n" * 400
    topic = "x" * 300
    topic_str = f"persistent://{tenant}/{namespace}/{topic}"
    codeflash_output = parse_topic_name(topic_str) # 11.2μs -> 10.2μs (9.83% faster)

def test_many_unique_topics():
    # Test parsing many unique topic names
    for i in range(1000):  # 1000 is within the allowed limit
        tenant = f"tenant{i}"
        namespace = f"namespace{i}"
        topic = f"topic{i}"
        topic_str = f"persistent://{tenant}/{namespace}/{topic}"
        codeflash_output = parse_topic_name(topic_str) # 1.05ms -> 761μs (38.3% faster)

def test_large_topic_with_slashes():
    # Large topic part with many slashes
    topic = "/".join([f"subtopic{i}" for i in range(200)])
    topic_str = f"persistent://tenant1/namespace1/{topic}"
    codeflash_output = parse_topic_name(topic_str) # 5.37μs -> 4.00μs (34.2% faster)

def test_large_unicode_topic():
    # Large unicode topic part
    tenant = "租户" * 100
    namespace = "命名空间" * 100
    topic = "主题" * 100
    topic_str = f"persistent://{tenant}/{namespace}/{topic}"
    codeflash_output = parse_topic_name(topic_str) # 10.8μs -> 9.52μs (13.7% faster)

def test_large_topic_with_special_chars():
    # Large topic with special characters
    tenant = "t" * 100 + "!@#$%^&*()"
    namespace = "n" * 100 + "~`[]{}"
    topic = "x" * 100 + "<>?/|\\"
    topic_str = f"persistent://{tenant}/{namespace}/{topic}"
    codeflash_output = parse_topic_name(topic_str) # 4.11μs -> 2.97μs (38.6% faster)

# ----------------------
# Mutation Testing Guards
# ----------------------

def test_mutation_guard_tenant_namespace_topic_order():
    # Ensure order of return values is correct
    topic = "persistent://first/second/third"
    codeflash_output = parse_topic_name(topic); result = codeflash_output # 3.13μs -> 2.07μs (51.5% faster)

def test_mutation_guard_strict_prefix():
    # Ensure strict matching of 'persistent://' prefix
    topic = "persiStent://tenant1/namespace1/topic1"
    with pytest.raises(ValueError):
        parse_topic_name(topic) # 2.50μs -> 1.43μs (75.2% faster)

def test_mutation_guard_no_partial_match():
    # Ensure partial matches do not pass
    topic = "persistent://tenant1/namespace1"
    with pytest.raises(ValueError):
        parse_topic_name(topic) # 3.16μs -> 2.03μs (55.7% faster)

def test_mutation_guard_invalid_characters():
    # Ensure invalid characters do not break parsing
    topic = "persistent://ten\0ant/nam\0espace/top\0ic"
    codeflash_output = parse_topic_name(topic) # 3.49μs -> 2.42μs (44.1% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from chromadb.ingest.impl.utils import parse_topic_name
import pytest

def test_parse_topic_name():
    parse_topic_name('persistent://\x00/0//0')

def test_parse_topic_name_2():
    with pytest.raises(ValueError, match='Invalid\\ topic\\ name:\\ '):
        parse_topic_name('')
🔎 Concolic Coverage Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_aqrniplu/tmpgk4oon_r/test_concolic_coverage.py::test_parse_topic_name 3.89μs 2.36μs 64.7%✅
codeflash_concolic_aqrniplu/tmpgk4oon_r/test_concolic_coverage.py::test_parse_topic_name_2 2.32μs 1.52μs 53.0%✅

To edit these changes git checkout codeflash/optimize-parse_topic_name-mh1pv4hw and push.

Codeflash

The optimization achieves a **46% speedup** by **precompiling the regex pattern** instead of compiling it on every function call.

**Key optimization:**
- Moved the regex pattern compilation from inside the function to module scope as `_topic_pattern = re.compile(...)`
- Changed `re.match(topic_regex, topic_name)` to `_topic_pattern.match(topic_name)`

**Why this is faster:**
In the original code, `re.match()` compiles the regex pattern every time it's called, which is expensive. The line profiler shows the `re.match()` call taking 77% of total runtime (6.66ms out of 8.65ms). The optimized version reduces this to 52.7% (2.30ms out of 4.36ms) by eliminating the compilation overhead.

**Performance characteristics:**
- **Valid topics**: 40-55% faster on average
- **Invalid topics**: 60-80% faster (fails faster without regex compilation)
- **Large-scale operations**: Maintains consistent speedup (38% faster for 1000 iterations)
- **Complex patterns**: Smaller but still significant gains (9-35% for very long strings)

The optimization is particularly effective for applications that parse many topic names, as the regex compilation cost is amortized across all calls rather than paid per invocation.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 22, 2025 08:13
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants