Skip to content

Conversation

codeflash-ai[bot]
Copy link

@codeflash-ai codeflash-ai bot commented Oct 22, 2025

📄 7% (0.07x) speedup for MistralEmbeddingFunction.build_from_config in chromadb/utils/embedding_functions/mistral_embedding_function.py

⏱️ Runtime : 182 microseconds 170 microseconds (best of 45 runs)

📝 Explanation and details

The optimized code achieves a 7% speedup through two key micro-optimizations:

1. Import Caching in __init__:
The original code imports from mistralai import Mistral on every instantiation. The optimization caches the imported Mistral class in the module's global namespace using globals()["_mistral_client_mod"]. After the first import, subsequent instantiations skip the import machinery entirely, reducing overhead during repeated object creation.

2. Local Variable Caching in build_from_config:
Instead of calling config.get() twice (which involves attribute lookup each time), the optimization stores config.get as a local variable get and calls it directly. This eliminates repeated attribute lookups, which is a common Python performance pattern.

Why These Work:

  • Import statements involve module lookup and namespace resolution overhead
  • Attribute access (config.get) requires dictionary lookups in Python's object model
  • Local variable access is faster than attribute access in Python

Test Case Performance:
The optimizations show strongest gains (30-40% faster) in test cases that create multiple MistralEmbeddingFunction instances, such as test_basic_valid_config and large-scale tests. Error path tests show minimal impact since they don't reach the optimized construction code, which is expected and preserves the original error handling behavior.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 49 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 3 Passed
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import os
# Patch import for mistralai and os.getenv for testing
import sys
from typing import Any, Dict, List

# function to test
import numpy as np
# imports
import pytest  # used for our unit tests
from chromadb.utils.embedding_functions.mistral_embedding_function import \
    MistralEmbeddingFunction

# Minimal stub for Documents and EmbeddingFunction types for testing
Documents = List[str]
Embeddings = List[np.ndarray]

# Minimal stub for mistralai.Mistral and its embeddings.create method
class DummyEmbeddingData:
    def __init__(self, embedding):
        self.embedding = embedding

class DummyEmbeddingsResponse:
    def __init__(self, data):
        self.data = data

class DummyMistralClient:
    def __init__(self, api_key):
        self.api_key = api_key
    class embeddings:
        @staticmethod
        def create(model, inputs):
            # Return a list of dummy embeddings (length of input, each embedding is np.arange(len(input)))
            return DummyEmbeddingsResponse([DummyEmbeddingData([float(i) for i in range(len(inputs))]) for _ in inputs])


def patch_mistralai_and_env(monkeypatch, api_key_value):
    # Patch os.getenv to return the api_key_value
    monkeypatch.setattr(os, "getenv", lambda key: api_key_value if key == "MISTRAL_API_KEY" else None)
    # Patch mistralai import
    sys.modules["mistralai"] = type("mistralai", (), {"Mistral": DummyMistralClient})
from chromadb.utils.embedding_functions.mistral_embedding_function import \
    MistralEmbeddingFunction

# unit tests

# ----------- BASIC TEST CASES -----------

def test_basic_valid_config(monkeypatch):
    """Test build_from_config with valid config dictionary."""
    patch_mistralai_and_env(monkeypatch, "FAKE_API_KEY")
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); ef = codeflash_output # 5.13μs -> 3.82μs (34.5% faster)

def test_basic_valid_config_different_env(monkeypatch):
    """Test build_from_config with a different env var name."""
    # Patch getenv for custom env var
    monkeypatch.setattr(os, "getenv", lambda key: "OTHER_KEY" if key == "OTHER_ENV" else None)
    sys.modules["mistralai"] = type("mistralai", (), {"Mistral": DummyMistralClient})
    config = {"model": "other-model", "api_key_env_var": "OTHER_ENV"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); ef = codeflash_output # 3.87μs -> 2.93μs (31.9% faster)

def test_basic_call_returns_embeddings(monkeypatch):
    """Test that __call__ returns correct number of embeddings."""
    patch_mistralai_and_env(monkeypatch, "FAKE_API_KEY")
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); ef = codeflash_output # 3.80μs -> 2.66μs (42.7% faster)
    docs = ["hello", "world"]
    embeddings = ef(docs)
    for emb in embeddings:
        pass

# ----------- EDGE TEST CASES -----------

def test_missing_model_key(monkeypatch):
    """Test config missing 'model' key raises AssertionError."""
    patch_mistralai_and_env(monkeypatch, "FAKE_API_KEY")
    config = {"api_key_env_var": "MISTRAL_API_KEY"}
    with pytest.raises(AssertionError):
        MistralEmbeddingFunction.build_from_config(config) # 1.08μs -> 1.13μs (4.53% slower)

def test_missing_api_key_env_var_key(monkeypatch):
    """Test config missing 'api_key_env_var' key raises AssertionError."""
    patch_mistralai_and_env(monkeypatch, "FAKE_API_KEY")
    config = {"model": "test-model"}
    with pytest.raises(AssertionError):
        MistralEmbeddingFunction.build_from_config(config) # 1.08μs -> 1.16μs (6.73% slower)

def test_missing_both_keys(monkeypatch):
    """Test config missing both keys raises AssertionError."""
    patch_mistralai_and_env(monkeypatch, "FAKE_API_KEY")
    config = {}
    with pytest.raises(AssertionError):
        MistralEmbeddingFunction.build_from_config(config) # 1.08μs -> 1.10μs (1.64% slower)

def test_env_var_not_set(monkeypatch):
    """Test ValueError raised if API key env var is not set."""
    # Patch getenv to always return None
    monkeypatch.setattr(os, "getenv", lambda key: None)
    sys.modules["mistralai"] = type("mistralai", (), {"Mistral": DummyMistralClient})
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    with pytest.raises(ValueError) as excinfo:
        MistralEmbeddingFunction.build_from_config(config) # 4.29μs -> 3.09μs (38.6% faster)


def test_call_with_non_string_documents(monkeypatch):
    """Test __call__ raises ValueError if input contains non-string."""
    patch_mistralai_and_env(monkeypatch, "FAKE_API_KEY")
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); ef = codeflash_output # 4.34μs -> 3.95μs (9.89% faster)
    docs = ["hello", 42, "world"]
    with pytest.raises(ValueError) as excinfo:
        ef(docs)


def test_call_with_special_characters(monkeypatch):
    """Test __call__ with documents containing special/unicode characters."""
    patch_mistralai_and_env(monkeypatch, "FAKE_API_KEY")
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); ef = codeflash_output # 4.88μs -> 3.93μs (24.2% faster)
    docs = ["你好", "¡Hola!", "🙂"]
    embeddings = ef(docs)
    for emb in embeddings:
        pass

# ----------- LARGE SCALE TEST CASES -----------

def test_large_scale_999_documents(monkeypatch):
    """Test __call__ with 999 documents (max allowed for performance test)."""
    patch_mistralai_and_env(monkeypatch, "FAKE_API_KEY")
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); ef = codeflash_output # 4.01μs -> 2.91μs (38.0% faster)
    docs = [f"doc_{i}" for i in range(999)]
    embeddings = ef(docs)
    for emb in embeddings:
        pass

def test_large_scale_config(monkeypatch):
    """Test build_from_config with large config dictionary containing extra keys."""
    patch_mistralai_and_env(monkeypatch, "FAKE_API_KEY")
    config = {
        "model": "large-model",
        "api_key_env_var": "MISTRAL_API_KEY",
        "extra1": "value1",
        "extra2": 12345,
        "extra3": [1,2,3]
    }
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); ef = codeflash_output # 4.61μs -> 3.40μs (35.3% faster)

def test_large_scale_long_strings(monkeypatch):
    """Test __call__ with documents containing long strings."""
    patch_mistralai_and_env(monkeypatch, "FAKE_API_KEY")
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); ef = codeflash_output # 3.85μs -> 2.81μs (37.2% faster)
    docs = ["a" * 1000, "b" * 999, "c" * 998]
    embeddings = ef(docs)
    for emb in embeddings:
        pass

def test_large_scale_non_ascii(monkeypatch):
    """Test __call__ with many non-ASCII documents."""
    patch_mistralai_and_env(monkeypatch, "FAKE_API_KEY")
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); ef = codeflash_output # 3.80μs -> 2.73μs (39.4% faster)
    docs = ["😀" * 10 for _ in range(500)]
    embeddings = ef(docs)
    for emb in embeddings:
        pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import os
# Patch for import mistralai.Mistral
import sys
import types
from typing import Any, Dict, List

# function to test
import numpy as np
# imports
import pytest  # used for our unit tests
from chromadb.utils.embedding_functions.mistral_embedding_function import \
    MistralEmbeddingFunction

# Minimal stub for Documents and EmbeddingFunction for test purposes
Documents = List[str]
Embeddings = List[np.ndarray]
from chromadb.utils.embedding_functions.mistral_embedding_function import \
    MistralEmbeddingFunction

# 1. Basic Test Cases

def test_basic_valid_config():
    """Test build_from_config with a valid config dictionary."""
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); func = codeflash_output

def test_basic_custom_env_var():
    """Test build_from_config with a custom API key env var."""
    config = {"model": "another-model", "api_key_env_var": "CUSTOM_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); func = codeflash_output

def test_basic_call_returns_embeddings():
    """Test that __call__ returns correct number and type of embeddings."""
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); func = codeflash_output
    docs = ["hello", "world"]
    embeddings = func(docs)
    for emb in embeddings:
        pass

# 2. Edge Test Cases

def test_missing_model_key():
    """Test config missing 'model' key triggers assertion."""
    config = {"api_key_env_var": "MISTRAL_API_KEY"}
    with pytest.raises(AssertionError):
        MistralEmbeddingFunction.build_from_config(config) # 1.18μs -> 1.27μs (7.54% slower)

def test_missing_api_key_env_var_key():
    """Test config missing 'api_key_env_var' key triggers assertion."""
    config = {"model": "test-model"}
    with pytest.raises(AssertionError):
        MistralEmbeddingFunction.build_from_config(config) # 1.14μs -> 1.20μs (5.32% slower)

def test_missing_both_keys():
    """Test config missing both keys triggers assertion."""
    config = {}
    with pytest.raises(AssertionError):
        MistralEmbeddingFunction.build_from_config(config) # 1.06μs -> 1.10μs (3.56% slower)

def test_env_var_not_set():
    """Test when the api_key_env_var is not set in the environment."""
    config = {"model": "test-model", "api_key_env_var": "NOT_SET_VAR"}
    if "NOT_SET_VAR" in os.environ:
        del os.environ["NOT_SET_VAR"]
    with pytest.raises(ValueError) as excinfo:
        MistralEmbeddingFunction.build_from_config(config) # 6.10μs -> 5.25μs (16.2% faster)

def test_non_string_documents():
    """Test __call__ raises ValueError when input contains non-string items."""
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); func = codeflash_output
    docs = ["valid", 123, "another"]
    with pytest.raises(ValueError) as excinfo:
        func(docs)

def test_empty_documents_list():
    """Test __call__ with empty list returns empty embeddings list."""
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); func = codeflash_output
    docs = []
    embeddings = func(docs)

def test_model_and_api_key_env_var_are_none():
    """Test config with None values for keys triggers assertion."""
    config = {"model": None, "api_key_env_var": None}
    with pytest.raises(AssertionError):
        MistralEmbeddingFunction.build_from_config(config) # 1.23μs -> 1.33μs (8.16% slower)

def test_model_and_api_key_env_var_are_empty_strings():
    """Test config with empty strings for keys triggers ValueError for env var."""
    config = {"model": "", "api_key_env_var": ""}
    if "" in os.environ:
        del os.environ[""]
    with pytest.raises(ValueError) as excinfo:
        MistralEmbeddingFunction.build_from_config(config) # 5.99μs -> 5.15μs (16.1% faster)

# 3. Large Scale Test Cases

def test_large_number_of_documents():
    """Test __call__ with a large list of documents."""
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); func = codeflash_output
    docs = [f"doc {i}" for i in range(500)]  # 500 is a reasonable large number
    embeddings = func(docs)
    for emb in embeddings:
        pass

def test_large_scale_custom_env_var():
    """Test large scale with custom env var."""
    config = {"model": "big-model", "api_key_env_var": "CUSTOM_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); func = codeflash_output
    docs = ["a"] * 999  # Just under 1000
    embeddings = func(docs)
    for emb in embeddings:
        pass

def test_large_scale_empty_strings():
    """Test large scale with many empty string documents."""
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); func = codeflash_output
    docs = [""] * 1000
    embeddings = func(docs)
    for emb in embeddings:
        pass

# Extra: Determinism test

def test_deterministic_output_for_same_input():
    """Test that calling with same input gives same output (since Dummy returns deterministic embeddings)."""
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); func = codeflash_output
    docs = ["repeat", "repeat"]
    emb1 = func(docs)
    emb2 = func(docs)
    for a, b in zip(emb1, emb2):
        pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from chromadb.utils.embedding_functions.mistral_embedding_function import MistralEmbeddingFunction
import pytest

def test_MistralEmbeddingFunction_build_from_config():
    with pytest.raises(ValueError, match='The\\ mistralai\\ python\\ package\\ is\\ not\\ installed\\.\\ Please\\ install\\ it\\ with\\ `pip\\ install\\ mistralai`'):
        MistralEmbeddingFunction.build_from_config({'model': 0, 'api_key_env_var': 0})

def test_MistralEmbeddingFunction_build_from_config_2():
    with pytest.raises(AssertionError, match='This\\ code\\ should\\ not\\ be\\ reached'):
        MistralEmbeddingFunction.build_from_config({})
🔎 Concolic Coverage Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_aqrniplu/tmphq2edxdh/test_concolic_coverage.py::test_MistralEmbeddingFunction_build_from_config 115μs 117μs -1.45%⚠️
codeflash_concolic_aqrniplu/tmphq2edxdh/test_concolic_coverage.py::test_MistralEmbeddingFunction_build_from_config_2 1.24μs 1.38μs -9.88%⚠️

To edit these changes git checkout codeflash/optimize-MistralEmbeddingFunction.build_from_config-mh1t5g4r and push.

Codeflash

The optimized code achieves a 7% speedup through two key micro-optimizations:

**1. Import Caching in `__init__`:**
The original code imports `from mistralai import Mistral` on every instantiation. The optimization caches the imported `Mistral` class in the module's global namespace using `globals()["_mistral_client_mod"]`. After the first import, subsequent instantiations skip the import machinery entirely, reducing overhead during repeated object creation.

**2. Local Variable Caching in `build_from_config`:**
Instead of calling `config.get()` twice (which involves attribute lookup each time), the optimization stores `config.get` as a local variable `get` and calls it directly. This eliminates repeated attribute lookups, which is a common Python performance pattern.

**Why These Work:**
- Import statements involve module lookup and namespace resolution overhead
- Attribute access (`config.get`) requires dictionary lookups in Python's object model
- Local variable access is faster than attribute access in Python

**Test Case Performance:**
The optimizations show strongest gains (30-40% faster) in test cases that create multiple `MistralEmbeddingFunction` instances, such as `test_basic_valid_config` and large-scale tests. Error path tests show minimal impact since they don't reach the optimized construction code, which is expected and preserves the original error handling behavior.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 22, 2025 09:45
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants