⚡️ Speed up method `Select.from_dict` by 9% #12

codeflash-ai · 2025-10-22T05:59:37Z

📄 9% (0.09x) speedup for `Select.from_dict` in `chromadb/execution/expression/operator.py`

⏱️ Runtime : 2.76 milliseconds → 2.52 milliseconds (best of 128 runs)

📝 Explanation and details

The optimized code achieves a 9% speedup by replacing multiple sequential if-elif conditions with a single dictionary lookup for special key mapping.

Key optimization:

Dictionary lookup vs. sequential comparisons: Instead of checking each special key (#id, #document, etc.) with separate if-elif statements, the code now uses a pre-built special_keys dictionary and performs a single k in special_keys lookup followed by direct dictionary access.

Why this is faster:

Dictionary lookups in Python are O(1) average case, while the original sequential if-elif chain requires up to 5 string comparisons in the worst case
The in operator on dictionaries uses hash table lookups, which are significantly faster than multiple string equality checks
Reduces the number of string comparisons from potentially 5 down to 1 hash lookup plus 1 dictionary access

Performance characteristics:

Large-scale improvements: The optimization shows the best gains (10-20% faster) on test cases with many special keys or mixed key types, where the dictionary lookup advantage compounds
Small overhead for simple cases: Basic tests show slight slowdowns (3-19%) due to the dictionary creation overhead, but this is amortized across larger inputs
Best suited for: Workloads processing many keys or repeated calls to from_dict(), where the dictionary lookup efficiency outweighs the initialization cost

The optimization maintains identical functionality while trading a small constant-time setup cost for significantly better scaling behavior with larger key sets.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	✅ 44 Passed
🌀 Generated Regression Tests	✅ 83 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	✅ 2 Passed
📊 Tests Coverage	100.0%

⚙️ Existing Unit Tests and Runtime

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`test_api.py::TestRoundTripConversion.test_select_round_trip`	4.07μs	4.49μs	-9.35%⚠️
`test_api.py::TestSelectFromDict.test_empty_keys`	3.29μs	3.78μs	-13.1%⚠️
`test_api.py::TestSelectFromDict.test_metadata_keys`	4.91μs	5.38μs	-8.79%⚠️
`test_api.py::TestSelectFromDict.test_mixed_keys`	4.64μs	5.16μs	-10.1%⚠️
`test_api.py::TestSelectFromDict.test_special_keys`	5.25μs	5.98μs	-12.3%⚠️
`test_api.py::TestSelectFromDict.test_unexpected_keys`	4.88μs	5.75μs	-15.1%⚠️
`test_api.py::TestSelectFromDict.test_validation`	4.21μs	5.14μs	-18.1%⚠️

🌀 Generated Regression Tests and Runtime

from dataclasses import dataclass, field
from typing import Any, Dict, Set, Union

# imports
import pytest  # used for our unit tests
from chromadb.execution.expression.operator import Select


# Minimal Key class for testing
class Key:
    def __init__(self, name):
        self.name = name

    def __eq__(self, other):
        if isinstance(other, Key):
            return self.name == other.name
        return False

    def __hash__(self):
        return hash(self.name)

    def __repr__(self):
        return f"Key({self.name!r})"

# Predefined Key constants
Key.ID = Key("#id")
Key.DOCUMENT = Key("#document")
Key.EMBEDDING = Key("#embedding")
Key.METADATA = Key("#metadata")
Key.SCORE = Key("#score")
from chromadb.execution.expression.operator import Select

# unit tests

# -------- Basic Test Cases --------

def test_basic_predefined_keys():
    # Test with all predefined keys
    data = {"keys": ["#id", "#document", "#embedding", "#metadata", "#score"]}
    codeflash_output = Select.from_dict(data); result = codeflash_output # 5.41μs -> 5.79μs (6.63% slower)
    expected = {Key.ID, Key.DOCUMENT, Key.EMBEDDING, Key.METADATA, Key.SCORE}

def test_basic_metadata_keys():
    # Test with regular metadata keys
    data = {"keys": ["title", "author", "date"]}
    codeflash_output = Select.from_dict(data); result = codeflash_output # 4.93μs -> 5.20μs (5.28% slower)
    expected = {Key("title"), Key("author"), Key("date")}

def test_basic_mixed_keys():
    # Test with both predefined and metadata keys
    data = {"keys": ["#document", "title", "author"]}
    codeflash_output = Select.from_dict(data); result = codeflash_output # 4.74μs -> 4.93μs (3.76% slower)
    expected = {Key.DOCUMENT, Key("title"), Key("author")}

def test_basic_empty_keys():
    # Test with empty keys list
    data = {"keys": []}
    codeflash_output = Select.from_dict(data); result = codeflash_output # 2.84μs -> 3.32μs (14.5% slower)

def test_basic_keys_as_tuple():
    # Test with keys provided as a tuple
    data = {"keys": ("#score", "foo")}
    codeflash_output = Select.from_dict(data); result = codeflash_output # 4.55μs -> 4.89μs (7.05% slower)
    expected = {Key.SCORE, Key("foo")}

def test_basic_keys_as_set():
    # Test with keys provided as a set
    data = {"keys": {"#embedding", "bar"}}
    codeflash_output = Select.from_dict(data); result = codeflash_output # 4.58μs -> 4.88μs (6.21% slower)
    expected = {Key.EMBEDDING, Key("bar")}

def test_basic_no_keys_field():
    # Test with no 'keys' field (should default to empty set)
    data = {}
    codeflash_output = Select.from_dict(data); result = codeflash_output # 2.79μs -> 3.32μs (16.2% slower)

# -------- Edge Test Cases --------

def test_edge_non_dict_input():
    # Test with non-dict input
    with pytest.raises(TypeError):
        Select.from_dict(["keys", ["#document"]]) # 1.44μs -> 1.45μs (0.619% slower)

def test_edge_keys_not_iterable():
    # Test with keys not being a list/tuple/set
    with pytest.raises(TypeError):
        Select.from_dict({"keys": "#document"}) # 1.91μs -> 1.94μs (1.65% slower)

def test_edge_keys_contains_non_string():
    # Test with keys containing a non-string value
    with pytest.raises(TypeError):
        Select.from_dict({"keys": ["#document", 123]}) # 2.69μs -> 3.12μs (13.7% slower)

def test_edge_unexpected_extra_field():
    # Test with unexpected extra field in dict
    with pytest.raises(ValueError):
        Select.from_dict({"keys": ["title"], "extra": "value"}) # 5.41μs -> 6.02μs (10.1% slower)

def test_edge_duplicate_keys():
    # Test with duplicate keys (should be deduped in set)
    data = {"keys": ["#score", "#score", "title", "title"]}
    codeflash_output = Select.from_dict(data); result = codeflash_output # 6.88μs -> 6.57μs (4.67% faster)
    expected = {Key.SCORE, Key("title")}

def test_edge_empty_string_key():
    # Test with empty string as key (allowed, creates Key(""))
    data = {"keys": [""]}
    codeflash_output = Select.from_dict(data); result = codeflash_output # 3.83μs -> 4.39μs (12.8% slower)
    expected = {Key("")}

def test_edge_keys_is_none():
    # Test with keys set to None (should raise TypeError)
    with pytest.raises(TypeError):
        Select.from_dict({"keys": None}) # 2.10μs -> 2.02μs (3.51% faster)

def test_edge_keys_is_dict():
    # Test with keys set to a dict (should raise TypeError)
    with pytest.raises(TypeError):
        Select.from_dict({"keys": {"foo": "bar"}}) # 1.99μs -> 2.12μs (6.28% slower)

def test_edge_keys_contains_special_chars():
    # Test with keys containing special characters (allowed)
    data = {"keys": ["@foo", "bar/baz", "qux!"]}
    codeflash_output = Select.from_dict(data); result = codeflash_output # 5.17μs -> 5.59μs (7.44% slower)
    expected = {Key("@foo"), Key("bar/baz"), Key("qux!")}

def test_edge_keys_contains_unicode():
    # Test with unicode string keys
    data = {"keys": ["😀", "标题", "#score"]}
    codeflash_output = Select.from_dict(data); result = codeflash_output # 4.99μs -> 5.27μs (5.30% slower)
    expected = {Key("😀"), Key("标题"), Key.SCORE}

def test_edge_keys_is_empty_dict():
    # Test with keys as an empty dict (should raise TypeError)
    with pytest.raises(TypeError):
        Select.from_dict({"keys": {}}) # 1.92μs -> 1.92μs (0.470% faster)

def test_edge_keys_is_integer():
    # Test with keys as an integer (should raise TypeError)
    with pytest.raises(TypeError):
        Select.from_dict({"keys": 123}) # 1.93μs -> 1.97μs (2.48% slower)

def test_edge_keys_contains_none():
    # Test with keys containing None (should raise TypeError)
    with pytest.raises(TypeError):
        Select.from_dict({"keys": ["title", None]}) # 2.86μs -> 3.54μs (19.2% slower)

# -------- Large Scale Test Cases --------

def test_large_scale_many_metadata_keys():
    # Test with 1000 metadata keys
    keys = [f"meta_{i}" for i in range(1000)]
    data = {"keys": keys}
    codeflash_output = Select.from_dict(data); result = codeflash_output # 279μs -> 253μs (10.3% faster)
    expected = {Key(f"meta_{i}") for i in range(1000)}

def test_large_scale_many_predefined_and_metadata_keys():
    # Test with 995 metadata keys + 5 predefined keys
    keys = [f"meta_{i}" for i in range(995)] + ["#id", "#document", "#embedding", "#metadata", "#score"]
    data = {"keys": keys}
    codeflash_output = Select.from_dict(data); result = codeflash_output # 281μs -> 255μs (10.4% faster)
    expected = {Key(f"meta_{i}") for i in range(995)} | {Key.ID, Key.DOCUMENT, Key.EMBEDDING, Key.METADATA, Key.SCORE}

def test_large_scale_all_keys_are_duplicates():
    # Test with 1000 duplicate keys
    data = {"keys": ["title"] * 1000}
    codeflash_output = Select.from_dict(data); result = codeflash_output # 425μs -> 385μs (10.3% faster)
    expected = {Key("title")}

def test_large_scale_keys_with_long_strings():
    # Test with long string keys
    keys = [("a" * 100) + str(i) for i in range(1000)]
    data = {"keys": keys}
    codeflash_output = Select.from_dict(data); result = codeflash_output # 310μs -> 284μs (9.25% faster)
    expected = {Key(("a" * 100) + str(i)) for i in range(1000)}

def test_large_scale_keys_with_special_and_predefined():
    # Test with mix of special chars and predefined keys
    keys = [f"@meta_{i}" for i in range(995)] + ["#score", "#id", "#embedding", "#document", "#metadata"]
    data = {"keys": keys}
    codeflash_output = Select.from_dict(data); result = codeflash_output # 284μs -> 253μs (12.3% faster)
    expected = {Key(f"@meta_{i}") for i in range(995)} | {Key.SCORE, Key.ID, Key.EMBEDDING, Key.DOCUMENT, Key.METADATA}
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from dataclasses import dataclass, field
from typing import Any, Dict, Set, Union

# imports
import pytest  # used for our unit tests
from chromadb.execution.expression.operator import Select


# Minimal Key class definition for testing
class Key:
    def __init__(self, name: str):
        self.name = name

    def __eq__(self, other):
        return isinstance(other, Key) and self.name == other.name

    def __hash__(self):
        return hash(self.name)

    def __repr__(self):
        return f"Key({self.name!r})"

# Initialize predefined key constants
Key.ID = Key("#id")
Key.DOCUMENT = Key("#document")
Key.EMBEDDING = Key("#embedding")
Key.METADATA = Key("#metadata")
Key.SCORE = Key("#score")
from chromadb.execution.expression.operator import Select

# unit tests

# ---------------------- BASIC TEST CASES ----------------------

def test_basic_select_special_keys():
    # Test with only special keys
    d = {"keys": ["#document", "#score"]}
    codeflash_output = Select.from_dict(d); sel = codeflash_output # 4.46μs -> 4.96μs (10.0% slower)

def test_basic_select_metadata_keys():
    # Test with only metadata keys
    d = {"keys": ["title", "author"]}
    codeflash_output = Select.from_dict(d); sel = codeflash_output # 4.44μs -> 4.97μs (10.7% slower)

def test_basic_select_mixed_keys():
    # Test with mixed special and metadata keys
    d = {"keys": ["#document", "title", "author"]}
    codeflash_output = Select.from_dict(d); sel = codeflash_output # 4.71μs -> 5.09μs (7.53% slower)

def test_basic_empty_keys():
    # Test with empty keys list
    d = {"keys": []}
    codeflash_output = Select.from_dict(d); sel = codeflash_output # 2.72μs -> 3.39μs (19.8% slower)

def test_basic_keys_as_tuple():
    # Test with keys as tuple
    d = {"keys": ("#embedding", "foo")}
    codeflash_output = Select.from_dict(d); sel = codeflash_output # 4.36μs -> 4.80μs (9.19% slower)

def test_basic_keys_as_set():
    # Test with keys as set
    d = {"keys": {"#score", "bar"}}
    codeflash_output = Select.from_dict(d); sel = codeflash_output # 4.51μs -> 4.96μs (9.07% slower)

def test_basic_missing_keys_field():
    # Test with missing keys field (should default to empty)
    d = {}
    codeflash_output = Select.from_dict(d); sel = codeflash_output # 2.72μs -> 3.30μs (17.5% slower)

# ---------------------- EDGE TEST CASES ----------------------

def test_edge_non_dict_input():
    # Test with non-dict input
    with pytest.raises(TypeError):
        Select.from_dict(["keys", ["#document"]]) # 1.46μs -> 1.44μs (1.46% faster)

def test_edge_keys_not_list_tuple_set():
    # Test with keys field not a list/tuple/set
    with pytest.raises(TypeError):
        Select.from_dict({"keys": "#document"}) # 1.94μs -> 1.93μs (0.414% faster)

def test_edge_keys_contains_non_string():
    # Test with keys containing non-string
    with pytest.raises(TypeError):
        Select.from_dict({"keys": ["#document", 123]}) # 2.70μs -> 3.20μs (15.8% slower)

def test_edge_unexpected_keys_in_dict():
    # Test with unexpected keys in the input dict
    with pytest.raises(ValueError) as excinfo:
        Select.from_dict({"keys": ["#document"], "foo": "bar"}) # 5.06μs -> 5.67μs (10.7% slower)

def test_edge_duplicate_keys():
    # Test with duplicate keys (should deduplicate in set)
    d = {"keys": ["#document", "#document", "title", "title"]}
    codeflash_output = Select.from_dict(d); sel = codeflash_output # 6.39μs -> 6.96μs (8.18% slower)

def test_edge_special_key_case_sensitivity():
    # Test with special key in wrong case (should be treated as metadata)
    d = {"keys": ["#Document"]}
    codeflash_output = Select.from_dict(d); sel = codeflash_output # 3.74μs -> 4.25μs (12.0% slower)

def test_edge_empty_string_key():
    # Test with empty string as key (should be accepted as metadata)
    d = {"keys": [""]}
    codeflash_output = Select.from_dict(d); sel = codeflash_output # 3.65μs -> 3.98μs (8.10% slower)

def test_edge_all_special_keys():
    # Test with all special keys
    d = {"keys": ["#id", "#document", "#embedding", "#metadata", "#score"]}
    codeflash_output = Select.from_dict(d); sel = codeflash_output # 5.31μs -> 5.13μs (3.63% faster)

def test_edge_keys_with_spaces():
    # Test with metadata keys containing spaces
    d = {"keys": ["first name", "last name"]}
    codeflash_output = Select.from_dict(d); sel = codeflash_output # 4.24μs -> 4.68μs (9.54% slower)

def test_edge_keys_with_unicode():
    # Test with metadata keys containing unicode characters
    d = {"keys": ["ключ", "标题"]}
    codeflash_output = Select.from_dict(d); sel = codeflash_output # 4.07μs -> 4.58μs (11.2% slower)

# ---------------------- LARGE SCALE TEST CASES ----------------------

def test_large_scale_many_metadata_keys():
    # Test with a large number of metadata keys
    keys = [f"meta_{i}" for i in range(1000)]
    d = {"keys": keys}
    codeflash_output = Select.from_dict(d); sel = codeflash_output # 279μs -> 251μs (11.0% faster)
    expected = {Key(k) for k in keys}

def test_large_scale_many_special_keys():
    # Test with repeated special keys (should deduplicate)
    keys = ["#id", "#document", "#embedding", "#metadata", "#score"] * 200
    d = {"keys": keys}
    codeflash_output = Select.from_dict(d); sel = codeflash_output # 134μs -> 111μs (20.4% faster)
    expected = {Key.ID, Key.DOCUMENT, Key.EMBEDDING, Key.METADATA, Key.SCORE}

def test_large_scale_mixed_keys():
    # Test with mix of special and metadata keys
    keys = ["#document", "#score"] + [f"meta_{i}" for i in range(998)]
    d = {"keys": keys}
    codeflash_output = Select.from_dict(d); sel = codeflash_output # 286μs -> 257μs (11.6% faster)
    expected = {Key.DOCUMENT, Key.SCORE} | {Key(f"meta_{i}") for i in range(998)}

def test_large_scale_performance():
    # Test performance for large input (not strict timing, but should not hang)
    keys = [f"key_{i}" for i in range(1000)]
    d = {"keys": keys}
    codeflash_output = Select.from_dict(d); sel = codeflash_output # 284μs -> 255μs (11.5% faster)

# ---------------------- DETERMINISM TEST CASES ----------------------

def test_determinism_same_input_same_output():
    # Test that the same input always produces the same output
    d = {"keys": ["#document", "title"]}
    codeflash_output = Select.from_dict(d); sel1 = codeflash_output # 4.60μs -> 5.16μs (10.8% slower)
    codeflash_output = Select.from_dict(d); sel2 = codeflash_output # 1.81μs -> 2.08μs (13.0% slower)

def test_determinism_set_equality():
    # Test that set equality works regardless of insertion order
    d1 = {"keys": ["#document", "title"]}
    d2 = {"keys": ["title", "#document"]}
    codeflash_output = Select.from_dict(d1); sel1 = codeflash_output # 3.90μs -> 4.37μs (10.8% slower)
    codeflash_output = Select.from_dict(d2); sel2 = codeflash_output # 1.96μs -> 2.05μs (4.53% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from chromadb.execution.expression.operator import Select
import pytest

def test_Select_from_dict():
    with pytest.raises(ValueError, match="Unexpected\\ keys\\ in\\ Select\\ dict:\\ \\{''\\}"):
        Select.from_dict({'keys': ('\x00\x00\x00'), '': 0})

def test_Select_from_dict_2():
    Select.from_dict({'keys': ['#id']})

def test_Select_from_dict_3():
    with pytest.raises(TypeError, match='Select\\ key\\ must\\ be\\ a\\ string,\\ got\\ int'):
        Select.from_dict({'keys': (0)})

def test_Select_from_dict_4():
    with pytest.raises(TypeError, match='Select\\ keys\\ must\\ be\\ a\\ list/tuple/set,\\ got\\ str'):
        Select.from_dict({'\x00\x00\x00\x00': '', 'keys': ''})

def test_Select_from_dict_5():
    with pytest.raises(ValueError, match="Unexpected\\ keys\\ in\\ Select\\ dict:\\ \\{'\\\\x00\\\\x00\\\\x00\\\\x00'\\}"):
        Select.from_dict({'keys': ['#metadata'], '\x00\x00\x00\x00': 0})

🔎 Concolic Coverage Tests and Runtime

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`codeflash_concolic_aqrniplu/tmp90wvxlim/test_concolic_coverage.py::test_Select_from_dict_2`	4.92μs	5.71μs	-13.8%⚠️
`codeflash_concolic_aqrniplu/tmp90wvxlim/test_concolic_coverage.py::test_Select_from_dict_4`	2.45μs	2.39μs	2.85%✅
`codeflash_concolic_aqrniplu/tmp90wvxlim/test_concolic_coverage.py::test_Select_from_dict_5`	5.61μs	6.63μs	-15.4%⚠️

To edit these changes git checkout codeflash/optimize-Select.from_dict-mh1l2n7v and push.

The optimized code achieves a **9% speedup** by replacing multiple sequential if-elif conditions with a single dictionary lookup for special key mapping. **Key optimization:** - **Dictionary lookup vs. sequential comparisons**: Instead of checking each special key (`#id`, `#document`, etc.) with separate if-elif statements, the code now uses a pre-built `special_keys` dictionary and performs a single `k in special_keys` lookup followed by direct dictionary access. **Why this is faster:** - Dictionary lookups in Python are O(1) average case, while the original sequential if-elif chain requires up to 5 string comparisons in the worst case - The `in` operator on dictionaries uses hash table lookups, which are significantly faster than multiple string equality checks - Reduces the number of string comparisons from potentially 5 down to 1 hash lookup plus 1 dictionary access **Performance characteristics:** - **Large-scale improvements**: The optimization shows the best gains (10-20% faster) on test cases with many special keys or mixed key types, where the dictionary lookup advantage compounds - **Small overhead for simple cases**: Basic tests show slight slowdowns (3-19%) due to the dictionary creation overhead, but this is amortized across larger inputs - **Best suited for**: Workloads processing many keys or repeated calls to `from_dict()`, where the dictionary lookup efficiency outweighs the initialization cost The optimization maintains identical functionality while trading a small constant-time setup cost for significantly better scaling behavior with larger key sets.

codeflash-ai bot requested a review from mashraf-222 October 22, 2025 05:59

codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up method `Select.from_dict` by 9% #12

⚡️ Speed up method `Select.from_dict` by 9% #12

Uh oh!

codeflash-ai bot commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

⚡️ Speed up method Select.from_dict by 9% #12

Are you sure you want to change the base?

⚡️ Speed up method Select.from_dict by 9% #12

Uh oh!

Conversation

codeflash-ai bot commented Oct 22, 2025

📄 9% (0.09x) speedup for Select.from_dict in chromadb/execution/expression/operator.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

⚡️ Speed up method `Select.from_dict` by 9% #12

⚡️ Speed up method `Select.from_dict` by 9% #12

📄 9% (0.09x) speedup for `Select.from_dict` in `chromadb/execution/expression/operator.py`