Skip to content

Conversation

codeflash-ai[bot]
Copy link

@codeflash-ai codeflash-ai bot commented Oct 22, 2025

📄 31% (0.31x) speedup for maybe_set_tenant_and_database in chromadb/auth/utils/__init__.py

⏱️ Runtime : 2.02 milliseconds 1.54 milliseconds (best of 52 runs)

📝 Explanation and details

The optimization replaces an expensive set-based approach with a single-pass linear scan in the _singleton_tenant_database_if_applicable function.

Key Changes:

  • Eliminated set creation: The original code converted user_databases to a set (set(user_databases)) which is O(n) operation with significant overhead for large lists
  • Eliminated list conversion: Removed list(user_databases_set)[0] which adds another conversion step
  • Single-pass algorithm: Replaced with a linear scan that tracks a single unique database value, immediately breaking on wildcards ("*") or multiple unique values

Why it's faster:

  • Avoids hash table overhead: Set creation involves hashing each element and building a hash table structure
  • Early termination: The loop breaks immediately when finding "*" or a second unique database, avoiding processing the entire list
  • Reduced memory allocations: No intermediate data structures are created

Performance characteristics by test case:

  • Large unique database lists (1000+ databases): Massive speedup (~1800%) because the algorithm terminates after finding the second unique value rather than processing all 1000 items
  • Single database cases: Moderate speedup (~30-45%) from avoiding set/list conversions
  • Wildcard databases: Good speedup (~18-45%) from early termination when "*" is encountered
  • Large duplicate lists: Slower (~80%) because it must scan all duplicates, whereas the original set would deduplicate immediately

The optimization excels when databases contain many unique values or wildcards (common in real-world auth scenarios) but performs worse only in the edge case of very large lists with all identical values.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 6 Passed
🌀 Generated Regression Tests 3541 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 2 Passed
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
auth/test_auth_utils.py::test_doesnt_overrite_from_auth 1.56μs 1.59μs -2.20%⚠️
auth/test_auth_utils.py::test_doesnt_overrite_from_auth_when_ambiguous 3.65μs 3.14μs 15.9%✅
auth/test_auth_utils.py::test_errors_when_provided_tenant_and_database_dont_match_from_auth 5.26μs 4.23μs 24.3%✅
auth/test_auth_utils.py::test_sets_tenant_and_database_when_none_or_default_provided 4.07μs 2.98μs 36.5%✅
🌀 Generated Regression Tests and Runtime
from typing import List, Optional, Tuple

# imports
import pytest
from chromadb.auth.utils.__init__ import maybe_set_tenant_and_database

# --- Begin: Minimal stubs for dependencies ---

# Simulate chromadb.auth.UserIdentity
class UserIdentity:
    def __init__(self, tenant: Optional[str], databases: Optional[List[str]]):
        self.tenant = tenant
        self.databases = databases

# Simulate chromadb.config
DEFAULT_TENANT = "default-tenant"
DEFAULT_DATABASE = "default-database"

# Simulate chromadb.errors
class ChromaAuthError(Exception):
    pass
from chromadb.auth.utils.__init__ import maybe_set_tenant_and_database

# unit tests

# -----------------------
# Basic Test Cases
# -----------------------

def test_no_overwrite_returns_none():
    # If overwrite flag is False, always returns input tenant/database unchanged
    user = UserIdentity(tenant="tenant1", databases=["db1"])
    t, d = maybe_set_tenant_and_database(user, False) # 894ns -> 954ns (6.29% slower)

def test_singleton_tenant_and_database_autofill():
    # Single tenant and single database, autofills both
    user = UserIdentity(tenant="tenant1", databases=["db1"])
    t, d = maybe_set_tenant_and_database(user, True) # 2.27μs -> 1.63μs (39.3% faster)

def test_multiple_databases_returns_none_for_database():
    # Multiple databases, should not autofill database
    user = UserIdentity(tenant="tenant1", databases=["db1", "db2"])
    t, d = maybe_set_tenant_and_database(user, True) # 1.73μs -> 1.74μs (0.689% slower)

def test_multiple_tenants_returns_none_for_tenant():
    # Tenant is "*", should not autofill tenant
    user = UserIdentity(tenant="*", databases=["db1"])
    t, d = maybe_set_tenant_and_database(user, True) # 2.12μs -> 1.47μs (44.5% faster)

def test_wildcard_database_returns_none_for_database():
    # Database is "*", should not autofill database
    user = UserIdentity(tenant="tenant1", databases=["*"])
    t, d = maybe_set_tenant_and_database(user, True) # 1.78μs -> 1.50μs (18.6% faster)

def test_no_tenant_no_database():
    # No tenant or database, returns input unchanged
    user = UserIdentity(tenant=None, databases=None)
    t, d = maybe_set_tenant_and_database(user, True) # 988ns -> 1.08μs (8.69% slower)

def test_user_provided_tenant_and_database_match():
    # Provided tenant/database match resolved, returns them
    user = UserIdentity(tenant="tenant1", databases=["db1"])
    t, d = maybe_set_tenant_and_database(user, True, user_provided_tenant="tenant1", user_provided_database="db1") # 2.75μs -> 2.09μs (31.2% faster)


def test_user_provided_tenant_and_database_none():
    # Provided tenant/database are None, should autofill with resolved values
    user = UserIdentity(tenant="tenant1", databases=["db1"])
    t, d = maybe_set_tenant_and_database(user, True, user_provided_tenant=None, user_provided_database=None) # 3.01μs -> 2.16μs (39.7% faster)


def test_tenant_is_none_database_is_empty_list():
    # Tenant is None, database is empty list
    user = UserIdentity(tenant=None, databases=[])
    t, d = maybe_set_tenant_and_database(user, True) # 1.43μs -> 1.40μs (2.29% faster)

def test_tenant_is_empty_string_database_is_none():
    # Tenant is empty string, database is None
    user = UserIdentity(tenant="", databases=None)
    t, d = maybe_set_tenant_and_database(user, True) # 1.10μs -> 1.17μs (6.00% slower)

def test_database_list_contains_duplicates():
    # Database list has duplicates, but only one unique value
    user = UserIdentity(tenant="tenant1", databases=["db1", "db1", "db1"])
    t, d = maybe_set_tenant_and_database(user, True) # 2.41μs -> 1.80μs (33.7% faster)

def test_database_list_contains_wildcard_and_single_value():
    # Database list has "*" and another value, should not autofill
    user = UserIdentity(tenant="tenant1", databases=["db1", "*"])
    t, d = maybe_set_tenant_and_database(user, True) # 1.74μs -> 1.64μs (5.98% faster)

def test_tenant_is_wildcard_database_is_none():
    # Tenant is "*", database is None
    user = UserIdentity(tenant="*", databases=None)
    t, d = maybe_set_tenant_and_database(user, True) # 1.11μs -> 1.13μs (2.03% slower)

def test_user_provided_tenant_and_database_both_none_and_no_auth():
    # Provided tenant/database None, no resolved value, should remain None
    user = UserIdentity(tenant=None, databases=None)
    t, d = maybe_set_tenant_and_database(user, True, user_provided_tenant=None, user_provided_database=None) # 1.34μs -> 1.36μs (1.40% slower)

def test_user_provided_tenant_and_database_both_default_and_no_auth():
    # Provided tenant/database default, no resolved value, should remain default
    user = UserIdentity(tenant=None, databases=None)
    t, d = maybe_set_tenant_and_database(user, True, user_provided_tenant=DEFAULT_TENANT, user_provided_database=DEFAULT_DATABASE) # 1.60μs -> 1.66μs (3.50% slower)



def test_user_provided_tenant_and_database_are_none_and_resolved_are_none():
    # Provided tenant/database are None, resolved are None, should stay None
    user = UserIdentity(tenant=None, databases=None)
    t, d = maybe_set_tenant_and_database(user, True) # 1.39μs -> 1.39μs (0.216% faster)

# -----------------------
# Large Scale Test Cases
# -----------------------

def test_large_number_of_databases_no_autofill():
    # User has access to 1000 unique databases, should not autofill database
    dbs = [f"db{i}" for i in range(1000)]
    user = UserIdentity(tenant="tenant1", databases=dbs)
    t, d = maybe_set_tenant_and_database(user, True) # 35.7μs -> 1.89μs (1793% faster)

def test_large_number_of_databases_with_duplicates_single_unique():
    # User has 1000 databases, all the same, should autofill
    dbs = ["dbX"] * 1000
    user = UserIdentity(tenant="tenant1", databases=dbs)
    t, d = maybe_set_tenant_and_database(user, True) # 5.60μs -> 32.2μs (82.6% slower)

def test_large_number_of_databases_with_wildcard():
    # User has 999 unique + "*", should not autofill
    dbs = [f"db{i}" for i in range(999)] + ["*"]
    user = UserIdentity(tenant="tenant1", databases=dbs)
    t, d = maybe_set_tenant_and_database(user, True) # 33.8μs -> 1.76μs (1817% faster)

def test_large_number_of_users_with_varied_tenants_and_databases():
    # Simulate 100 users, each with unique tenant and singleton database
    for i in range(100):
        user = UserIdentity(tenant=f"tenant{i}", databases=[f"db{i}"])
        t, d = maybe_set_tenant_and_database(user, True) # 57.0μs -> 42.0μs (35.5% faster)

def test_large_number_of_users_with_wildcard_tenants():
    # Simulate 100 users, each with wildcard tenant and singleton database
    for i in range(100):
        user = UserIdentity(tenant="*", databases=[f"db{i}"])
        t, d = maybe_set_tenant_and_database(user, True) # 55.2μs -> 41.5μs (33.0% faster)

def test_large_number_of_users_with_wildcard_databases():
    # Simulate 100 users, each with singleton tenant and wildcard database
    for i in range(100):
        user = UserIdentity(tenant=f"tenant{i}", databases=["*"])
        t, d = maybe_set_tenant_and_database(user, True) # 44.7μs -> 41.7μs (7.17% faster)

def test_large_number_of_users_with_multiple_databases():
    # Each user has singleton tenant, 10 databases, should not autofill database
    for i in range(100):
        dbs = [f"db{i}_{j}" for j in range(10)]
        user = UserIdentity(tenant=f"tenant{i}", databases=dbs)
        t, d = maybe_set_tenant_and_database(user, True) # 85.6μs -> 47.0μs (82.0% faster)

def test_large_number_of_users_no_overwrite():
    # Overwrite flag is False, should always return (None, None)
    for i in range(100):
        user = UserIdentity(tenant=f"tenant{i}", databases=[f"db{i}"])
        t, d = maybe_set_tenant_and_database(user, False) # 24.5μs -> 25.6μs (4.26% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from typing import List, Optional, Tuple

# imports
import pytest
from chromadb.auth.utils.__init__ import maybe_set_tenant_and_database

# Mocks for dependencies
DEFAULT_TENANT = "default-tenant"
DEFAULT_DATABASE = "default-database"

class ChromaAuthError(Exception):
    pass

class UserIdentity:
    def __init__(self, tenant: Optional[str] = None, databases: Optional[List[str]] = None):
        self.tenant = tenant
        self.databases = databases or []
from chromadb.auth.utils.__init__ import maybe_set_tenant_and_database

# unit tests

# -------------------
# Basic Test Cases
# -------------------

def test_no_overwrite_returns_inputs():
    """If overwrite flag is False, function returns provided tenant/database unchanged."""
    user = UserIdentity(tenant="tenantA", databases=["dbA"])
    # Inputs provided, but overwrite is False, so nothing changes
    codeflash_output = maybe_set_tenant_and_database(user, False, "tenantB", "dbB"); result = codeflash_output # 1.24μs -> 1.25μs (1.20% slower)
    # Inputs not provided, should remain None
    codeflash_output = maybe_set_tenant_and_database(user, False); result = codeflash_output # 462ns -> 477ns (3.14% slower)

def test_singleton_tenant_and_database_fills_defaults():
    """If user has one tenant and one database, and no inputs, fills both."""
    user = UserIdentity(tenant="tenantA", databases=["dbA"])
    codeflash_output = maybe_set_tenant_and_database(user, True); result = codeflash_output # 2.25μs -> 1.53μs (46.8% faster)


def test_singleton_tenant_and_database_with_matching_inputs():
    """If user provides matching tenant/database, returns them unchanged."""
    user = UserIdentity(tenant="tenantA", databases=["dbA"])
    codeflash_output = maybe_set_tenant_and_database(user, True, "tenantA", "dbA"); result = codeflash_output # 2.93μs -> 2.12μs (38.5% faster)

def test_singleton_tenant_only():
    """If user has singleton tenant, but multiple databases, only tenant is filled."""
    user = UserIdentity(tenant="tenantA", databases=["dbA", "dbB"])
    codeflash_output = maybe_set_tenant_and_database(user, True); result = codeflash_output # 1.94μs -> 1.90μs (2.43% faster)

def test_singleton_database_only():
    """If user has no tenant, but singleton database, only database is filled."""
    user = UserIdentity(tenant=None, databases=["dbA"])
    codeflash_output = maybe_set_tenant_and_database(user, True); result = codeflash_output # 2.08μs -> 1.45μs (44.1% faster)

def test_no_tenant_no_database():
    """If user has no tenant and no database, returns inputs or None."""
    user = UserIdentity()
    codeflash_output = maybe_set_tenant_and_database(user, True); result = codeflash_output # 1.00μs -> 1.06μs (5.82% slower)
    codeflash_output = maybe_set_tenant_and_database(user, True, "foo", "bar"); result = codeflash_output # 794ns -> 847ns (6.26% slower)

def test_wildcard_tenant_and_database():
    """If tenant or database is '*', do not fill from auth."""
    user = UserIdentity(tenant="*", databases=["*"])
    codeflash_output = maybe_set_tenant_and_database(user, True); result = codeflash_output # 1.63μs -> 1.34μs (21.3% faster)

def test_multiple_databases_prevents_filling_database():
    """If user has multiple databases, do not fill database."""
    user = UserIdentity(tenant="tenantA", databases=["dbA", "dbB"])
    codeflash_output = maybe_set_tenant_and_database(user, True); result = codeflash_output # 1.61μs -> 1.70μs (4.89% slower)

def test_multiple_tenants_prevents_filling_tenant():
    """If user has tenant='*', do not fill tenant."""
    user = UserIdentity(tenant="*", databases=["dbA"])
    codeflash_output = maybe_set_tenant_and_database(user, True); result = codeflash_output # 2.06μs -> 1.42μs (45.3% faster)

# -------------------
# Edge Test Cases
# -------------------




def test_no_error_when_provided_matches_default_and_auth_is_none():
    """If provided tenant/database is default, and auth returns None, do not error."""
    user = UserIdentity()
    # new_tenant and new_database are None, so no error
    codeflash_output = maybe_set_tenant_and_database(user, True, DEFAULT_TENANT, DEFAULT_DATABASE); result = codeflash_output # 1.69μs -> 1.71μs (0.645% slower)

def test_database_list_with_wildcard_and_single():
    """If user has ['dbA', '*'], do not fill database."""
    user = UserIdentity(tenant="tenantA", databases=["dbA", "*"])
    codeflash_output = maybe_set_tenant_and_database(user, True); result = codeflash_output # 1.89μs -> 1.75μs (8.04% faster)

def test_database_list_with_duplicates():
    """If user has duplicate databases, but only one unique and not '*', fill database."""
    user = UserIdentity(tenant="tenantA", databases=["dbA", "dbA"])
    codeflash_output = maybe_set_tenant_and_database(user, True); result = codeflash_output # 2.35μs -> 1.76μs (33.7% faster)

def test_database_list_with_duplicates_and_wildcard():
    """If user has duplicate databases and wildcard, do not fill database."""
    user = UserIdentity(tenant="tenantA", databases=["dbA", "dbA", "*"])
    codeflash_output = maybe_set_tenant_and_database(user, True); result = codeflash_output # 1.76μs -> 1.75μs (0.859% faster)

def test_none_inputs_are_handled():
    """If user_provided_tenant and user_provided_database are None, function fills as appropriate."""
    user = UserIdentity(tenant="tenantA", databases=["dbA"])
    codeflash_output = maybe_set_tenant_and_database(user, True, None, None); result = codeflash_output # 2.10μs -> 1.51μs (38.7% faster)


def test_empty_database_list():
    """If user has empty database list, do not fill database."""
    user = UserIdentity(tenant="tenantA", databases=[])
    codeflash_output = maybe_set_tenant_and_database(user, True); result = codeflash_output # 1.64μs -> 1.66μs (1.21% slower)

def test_none_tenant_and_empty_database_list():
    """If both tenant and database list are empty/None, returns None, None."""
    user = UserIdentity(tenant=None, databases=[])
    codeflash_output = maybe_set_tenant_and_database(user, True); result = codeflash_output # 1.14μs -> 1.16μs (2.24% slower)

# -------------------
# Large Scale Test Cases
# -------------------

def test_large_number_of_databases():
    """If user has many databases, do not fill database."""
    dbs = [f"db{i}" for i in range(1000)]
    user = UserIdentity(tenant="tenantA", databases=dbs)
    codeflash_output = maybe_set_tenant_and_database(user, True); result = codeflash_output # 35.6μs -> 1.81μs (1861% faster)

def test_large_number_of_databases_with_duplicates():
    """If user has 1000 duplicate databases, should fill database if all are the same and not '*'."""
    dbs = ["dbA"] * 1000
    user = UserIdentity(tenant="tenantA", databases=dbs)
    codeflash_output = maybe_set_tenant_and_database(user, True); result = codeflash_output # 5.87μs -> 32.2μs (81.8% slower)

def test_large_number_of_databases_with_wildcard():
    """If user has many databases including '*', do not fill database."""
    dbs = [f"db{i}" for i in range(999)] + ["*"]
    user = UserIdentity(tenant="tenantA", databases=dbs)
    codeflash_output = maybe_set_tenant_and_database(user, True); result = codeflash_output # 33.5μs -> 1.81μs (1748% faster)

def test_large_number_of_unique_databases():
    """If user has 1000 unique databases, do not fill database."""
    dbs = [f"db{i}" for i in range(1000)]
    user = UserIdentity(tenant="tenantA", databases=dbs)
    codeflash_output = maybe_set_tenant_and_database(user, True); result = codeflash_output # 32.9μs -> 1.70μs (1840% faster)

def test_large_number_of_users_with_singleton_tenant_and_database():
    """Test 1000 users each with a unique singleton tenant/database."""
    for i in range(1000):
        user = UserIdentity(tenant=f"tenant{i}", databases=[f"db{i}"])
        codeflash_output = maybe_set_tenant_and_database(user, True); result = codeflash_output # 535μs -> 409μs (30.7% faster)

def test_large_number_of_users_with_wildcard_tenant():
    """Test 1000 users each with wildcard tenant, should not fill tenant."""
    for i in range(1000):
        user = UserIdentity(tenant="*", databases=[f"db{i}"])
        codeflash_output = maybe_set_tenant_and_database(user, True); result = codeflash_output # 519μs -> 392μs (32.4% faster)

def test_large_number_of_users_with_wildcard_database():
    """Test 1000 users each with wildcard database, should not fill database."""
    for i in range(1000):
        user = UserIdentity(tenant=f"tenant{i}", databases=["*"])
        codeflash_output = maybe_set_tenant_and_database(user, True); result = codeflash_output # 434μs -> 405μs (7.20% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from chromadb.auth import UserIdentity
from chromadb.auth.utils.__init__ import maybe_set_tenant_and_database
import pytest

def test_maybe_set_tenant_and_database():
    maybe_set_tenant_and_database(UserIdentity('', tenant='', databases=None, attributes=None), False, user_provided_tenant='\x00', user_provided_database='default_database')

def test_maybe_set_tenant_and_database_2():
    with pytest.raises(ChromaAuthError, match='Database\\ \x01\\ does\\ not\\ match\\ \x00\\ from\\ the\\ server\\.\\ Are\\ you\\ sure\\ the\\ database\\ is\\ correct\\?'):
        maybe_set_tenant_and_database(UserIdentity('', tenant=None, databases=['\x00'], attributes=None), True, user_provided_tenant=None, user_provided_database='\x01')

def test_maybe_set_tenant_and_database_3():
    with pytest.raises(ChromaAuthError, match='Tenant\\ \x01\\ does\\ not\\ match\\ \x00\\ from\\ the\\ server\\.\\ Are\\ you\\ sure\\ the\\ tenant\\ is\\ correct\\?'):
        maybe_set_tenant_and_database(UserIdentity('', tenant='\x00', databases=None, attributes={}), True, user_provided_tenant='\x01', user_provided_database=None)

def test_maybe_set_tenant_and_database_4():
    maybe_set_tenant_and_database(UserIdentity('', tenant='\x00', databases=['\x00'], attributes={'': 0}), True, user_provided_tenant='', user_provided_database=None)
🔎 Concolic Coverage Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_aqrniplu/tmp_tyj47ma/test_concolic_coverage.py::test_maybe_set_tenant_and_database 1.58μs 1.50μs 5.27%✅
codeflash_concolic_aqrniplu/tmp_tyj47ma/test_concolic_coverage.py::test_maybe_set_tenant_and_database_4 3.06μs 2.22μs 37.6%✅

To edit these changes git checkout codeflash/optimize-maybe_set_tenant_and_database-mh1y2v4p and push.

Codeflash

The optimization replaces an expensive set-based approach with a single-pass linear scan in the `_singleton_tenant_database_if_applicable` function.

**Key Changes:**
- **Eliminated set creation**: The original code converted `user_databases` to a set (`set(user_databases)`) which is O(n) operation with significant overhead for large lists
- **Eliminated list conversion**: Removed `list(user_databases_set)[0]` which adds another conversion step
- **Single-pass algorithm**: Replaced with a linear scan that tracks a single unique database value, immediately breaking on wildcards ("*") or multiple unique values

**Why it's faster:**
- **Avoids hash table overhead**: Set creation involves hashing each element and building a hash table structure
- **Early termination**: The loop breaks immediately when finding "*" or a second unique database, avoiding processing the entire list
- **Reduced memory allocations**: No intermediate data structures are created

**Performance characteristics by test case:**
- **Large unique database lists** (1000+ databases): Massive speedup (~1800%) because the algorithm terminates after finding the second unique value rather than processing all 1000 items
- **Single database cases**: Moderate speedup (~30-45%) from avoiding set/list conversions
- **Wildcard databases**: Good speedup (~18-45%) from early termination when "*" is encountered
- **Large duplicate lists**: Slower (~80%) because it must scan all duplicates, whereas the original set would deduplicate immediately

The optimization excels when databases contain many unique values or wildcards (common in real-world auth scenarios) but performs worse only in the edge case of very large lists with all identical values.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 22, 2025 12:03
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants