⚡️ Speed up method `MistralEmbeddingFunction.build_from_config` by 7% #19

codeflash-ai · 2025-10-22T09:45:44Z

📄 7% (0.07x) speedup for `MistralEmbeddingFunction.build_from_config` in `chromadb/utils/embedding_functions/mistral_embedding_function.py`

⏱️ Runtime : 182 microseconds → 170 microseconds (best of 45 runs)

📝 Explanation and details

The optimized code achieves a 7% speedup through two key micro-optimizations:

1. Import Caching in __init__:
The original code imports from mistralai import Mistral on every instantiation. The optimization caches the imported Mistral class in the module's global namespace using globals()["_mistral_client_mod"]. After the first import, subsequent instantiations skip the import machinery entirely, reducing overhead during repeated object creation.

2. Local Variable Caching in build_from_config:
Instead of calling config.get() twice (which involves attribute lookup each time), the optimization stores config.get as a local variable get and calls it directly. This eliminates repeated attribute lookups, which is a common Python performance pattern.

Why These Work:

Import statements involve module lookup and namespace resolution overhead
Attribute access (config.get) requires dictionary lookups in Python's object model
Local variable access is faster than attribute access in Python

Test Case Performance:
The optimizations show strongest gains (30-40% faster) in test cases that create multiple MistralEmbeddingFunction instances, such as test_basic_valid_config and large-scale tests. Error path tests show minimal impact since they don't reach the optimized construction code, which is expected and preserves the original error handling behavior.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 49 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	✅ 3 Passed
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import os
# Patch import for mistralai and os.getenv for testing
import sys
from typing import Any, Dict, List

# function to test
import numpy as np
# imports
import pytest  # used for our unit tests
from chromadb.utils.embedding_functions.mistral_embedding_function import \
    MistralEmbeddingFunction

# Minimal stub for Documents and EmbeddingFunction types for testing
Documents = List[str]
Embeddings = List[np.ndarray]

# Minimal stub for mistralai.Mistral and its embeddings.create method
class DummyEmbeddingData:
    def __init__(self, embedding):
        self.embedding = embedding

class DummyEmbeddingsResponse:
    def __init__(self, data):
        self.data = data

class DummyMistralClient:
    def __init__(self, api_key):
        self.api_key = api_key
    class embeddings:
        @staticmethod
        def create(model, inputs):
            # Return a list of dummy embeddings (length of input, each embedding is np.arange(len(input)))
            return DummyEmbeddingsResponse([DummyEmbeddingData([float(i) for i in range(len(inputs))]) for _ in inputs])


def patch_mistralai_and_env(monkeypatch, api_key_value):
    # Patch os.getenv to return the api_key_value
    monkeypatch.setattr(os, "getenv", lambda key: api_key_value if key == "MISTRAL_API_KEY" else None)
    # Patch mistralai import
    sys.modules["mistralai"] = type("mistralai", (), {"Mistral": DummyMistralClient})
from chromadb.utils.embedding_functions.mistral_embedding_function import \
    MistralEmbeddingFunction

# unit tests

# ----------- BASIC TEST CASES -----------

def test_basic_valid_config(monkeypatch):
    """Test build_from_config with valid config dictionary."""
    patch_mistralai_and_env(monkeypatch, "FAKE_API_KEY")
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); ef = codeflash_output # 5.13μs -> 3.82μs (34.5% faster)

def test_basic_valid_config_different_env(monkeypatch):
    """Test build_from_config with a different env var name."""
    # Patch getenv for custom env var
    monkeypatch.setattr(os, "getenv", lambda key: "OTHER_KEY" if key == "OTHER_ENV" else None)
    sys.modules["mistralai"] = type("mistralai", (), {"Mistral": DummyMistralClient})
    config = {"model": "other-model", "api_key_env_var": "OTHER_ENV"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); ef = codeflash_output # 3.87μs -> 2.93μs (31.9% faster)

def test_basic_call_returns_embeddings(monkeypatch):
    """Test that __call__ returns correct number of embeddings."""
    patch_mistralai_and_env(monkeypatch, "FAKE_API_KEY")
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); ef = codeflash_output # 3.80μs -> 2.66μs (42.7% faster)
    docs = ["hello", "world"]
    embeddings = ef(docs)
    for emb in embeddings:
        pass

# ----------- EDGE TEST CASES -----------

def test_missing_model_key(monkeypatch):
    """Test config missing 'model' key raises AssertionError."""
    patch_mistralai_and_env(monkeypatch, "FAKE_API_KEY")
    config = {"api_key_env_var": "MISTRAL_API_KEY"}
    with pytest.raises(AssertionError):
        MistralEmbeddingFunction.build_from_config(config) # 1.08μs -> 1.13μs (4.53% slower)

def test_missing_api_key_env_var_key(monkeypatch):
    """Test config missing 'api_key_env_var' key raises AssertionError."""
    patch_mistralai_and_env(monkeypatch, "FAKE_API_KEY")
    config = {"model": "test-model"}
    with pytest.raises(AssertionError):
        MistralEmbeddingFunction.build_from_config(config) # 1.08μs -> 1.16μs (6.73% slower)

def test_missing_both_keys(monkeypatch):
    """Test config missing both keys raises AssertionError."""
    patch_mistralai_and_env(monkeypatch, "FAKE_API_KEY")
    config = {}
    with pytest.raises(AssertionError):
        MistralEmbeddingFunction.build_from_config(config) # 1.08μs -> 1.10μs (1.64% slower)

def test_env_var_not_set(monkeypatch):
    """Test ValueError raised if API key env var is not set."""
    # Patch getenv to always return None
    monkeypatch.setattr(os, "getenv", lambda key: None)
    sys.modules["mistralai"] = type("mistralai", (), {"Mistral": DummyMistralClient})
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    with pytest.raises(ValueError) as excinfo:
        MistralEmbeddingFunction.build_from_config(config) # 4.29μs -> 3.09μs (38.6% faster)


def test_call_with_non_string_documents(monkeypatch):
    """Test __call__ raises ValueError if input contains non-string."""
    patch_mistralai_and_env(monkeypatch, "FAKE_API_KEY")
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); ef = codeflash_output # 4.34μs -> 3.95μs (9.89% faster)
    docs = ["hello", 42, "world"]
    with pytest.raises(ValueError) as excinfo:
        ef(docs)


def test_call_with_special_characters(monkeypatch):
    """Test __call__ with documents containing special/unicode characters."""
    patch_mistralai_and_env(monkeypatch, "FAKE_API_KEY")
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); ef = codeflash_output # 4.88μs -> 3.93μs (24.2% faster)
    docs = ["你好", "¡Hola!", "🙂"]
    embeddings = ef(docs)
    for emb in embeddings:
        pass

# ----------- LARGE SCALE TEST CASES -----------

def test_large_scale_999_documents(monkeypatch):
    """Test __call__ with 999 documents (max allowed for performance test)."""
    patch_mistralai_and_env(monkeypatch, "FAKE_API_KEY")
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); ef = codeflash_output # 4.01μs -> 2.91μs (38.0% faster)
    docs = [f"doc_{i}" for i in range(999)]
    embeddings = ef(docs)
    for emb in embeddings:
        pass

def test_large_scale_config(monkeypatch):
    """Test build_from_config with large config dictionary containing extra keys."""
    patch_mistralai_and_env(monkeypatch, "FAKE_API_KEY")
    config = {
        "model": "large-model",
        "api_key_env_var": "MISTRAL_API_KEY",
        "extra1": "value1",
        "extra2": 12345,
        "extra3": [1,2,3]
    }
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); ef = codeflash_output # 4.61μs -> 3.40μs (35.3% faster)

def test_large_scale_long_strings(monkeypatch):
    """Test __call__ with documents containing long strings."""
    patch_mistralai_and_env(monkeypatch, "FAKE_API_KEY")
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); ef = codeflash_output # 3.85μs -> 2.81μs (37.2% faster)
    docs = ["a" * 1000, "b" * 999, "c" * 998]
    embeddings = ef(docs)
    for emb in embeddings:
        pass

def test_large_scale_non_ascii(monkeypatch):
    """Test __call__ with many non-ASCII documents."""
    patch_mistralai_and_env(monkeypatch, "FAKE_API_KEY")
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); ef = codeflash_output # 3.80μs -> 2.73μs (39.4% faster)
    docs = ["😀" * 10 for _ in range(500)]
    embeddings = ef(docs)
    for emb in embeddings:
        pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import os
# Patch for import mistralai.Mistral
import sys
import types
from typing import Any, Dict, List

# function to test
import numpy as np
# imports
import pytest  # used for our unit tests
from chromadb.utils.embedding_functions.mistral_embedding_function import \
    MistralEmbeddingFunction

# Minimal stub for Documents and EmbeddingFunction for test purposes
Documents = List[str]
Embeddings = List[np.ndarray]
from chromadb.utils.embedding_functions.mistral_embedding_function import \
    MistralEmbeddingFunction

# 1. Basic Test Cases

def test_basic_valid_config():
    """Test build_from_config with a valid config dictionary."""
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); func = codeflash_output

def test_basic_custom_env_var():
    """Test build_from_config with a custom API key env var."""
    config = {"model": "another-model", "api_key_env_var": "CUSTOM_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); func = codeflash_output

def test_basic_call_returns_embeddings():
    """Test that __call__ returns correct number and type of embeddings."""
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); func = codeflash_output
    docs = ["hello", "world"]
    embeddings = func(docs)
    for emb in embeddings:
        pass

# 2. Edge Test Cases

def test_missing_model_key():
    """Test config missing 'model' key triggers assertion."""
    config = {"api_key_env_var": "MISTRAL_API_KEY"}
    with pytest.raises(AssertionError):
        MistralEmbeddingFunction.build_from_config(config) # 1.18μs -> 1.27μs (7.54% slower)

def test_missing_api_key_env_var_key():
    """Test config missing 'api_key_env_var' key triggers assertion."""
    config = {"model": "test-model"}
    with pytest.raises(AssertionError):
        MistralEmbeddingFunction.build_from_config(config) # 1.14μs -> 1.20μs (5.32% slower)

def test_missing_both_keys():
    """Test config missing both keys triggers assertion."""
    config = {}
    with pytest.raises(AssertionError):
        MistralEmbeddingFunction.build_from_config(config) # 1.06μs -> 1.10μs (3.56% slower)

def test_env_var_not_set():
    """Test when the api_key_env_var is not set in the environment."""
    config = {"model": "test-model", "api_key_env_var": "NOT_SET_VAR"}
    if "NOT_SET_VAR" in os.environ:
        del os.environ["NOT_SET_VAR"]
    with pytest.raises(ValueError) as excinfo:
        MistralEmbeddingFunction.build_from_config(config) # 6.10μs -> 5.25μs (16.2% faster)

def test_non_string_documents():
    """Test __call__ raises ValueError when input contains non-string items."""
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); func = codeflash_output
    docs = ["valid", 123, "another"]
    with pytest.raises(ValueError) as excinfo:
        func(docs)

def test_empty_documents_list():
    """Test __call__ with empty list returns empty embeddings list."""
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); func = codeflash_output
    docs = []
    embeddings = func(docs)

def test_model_and_api_key_env_var_are_none():
    """Test config with None values for keys triggers assertion."""
    config = {"model": None, "api_key_env_var": None}
    with pytest.raises(AssertionError):
        MistralEmbeddingFunction.build_from_config(config) # 1.23μs -> 1.33μs (8.16% slower)

def test_model_and_api_key_env_var_are_empty_strings():
    """Test config with empty strings for keys triggers ValueError for env var."""
    config = {"model": "", "api_key_env_var": ""}
    if "" in os.environ:
        del os.environ[""]
    with pytest.raises(ValueError) as excinfo:
        MistralEmbeddingFunction.build_from_config(config) # 5.99μs -> 5.15μs (16.1% faster)

# 3. Large Scale Test Cases

def test_large_number_of_documents():
    """Test __call__ with a large list of documents."""
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); func = codeflash_output
    docs = [f"doc {i}" for i in range(500)]  # 500 is a reasonable large number
    embeddings = func(docs)
    for emb in embeddings:
        pass

def test_large_scale_custom_env_var():
    """Test large scale with custom env var."""
    config = {"model": "big-model", "api_key_env_var": "CUSTOM_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); func = codeflash_output
    docs = ["a"] * 999  # Just under 1000
    embeddings = func(docs)
    for emb in embeddings:
        pass

def test_large_scale_empty_strings():
    """Test large scale with many empty string documents."""
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); func = codeflash_output
    docs = [""] * 1000
    embeddings = func(docs)
    for emb in embeddings:
        pass

# Extra: Determinism test

def test_deterministic_output_for_same_input():
    """Test that calling with same input gives same output (since Dummy returns deterministic embeddings)."""
    config = {"model": "test-model", "api_key_env_var": "MISTRAL_API_KEY"}
    codeflash_output = MistralEmbeddingFunction.build_from_config(config); func = codeflash_output
    docs = ["repeat", "repeat"]
    emb1 = func(docs)
    emb2 = func(docs)
    for a, b in zip(emb1, emb2):
        pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from chromadb.utils.embedding_functions.mistral_embedding_function import MistralEmbeddingFunction
import pytest

def test_MistralEmbeddingFunction_build_from_config():
    with pytest.raises(ValueError, match='The\\ mistralai\\ python\\ package\\ is\\ not\\ installed\\.\\ Please\\ install\\ it\\ with\\ `pip\\ install\\ mistralai`'):
        MistralEmbeddingFunction.build_from_config({'model': 0, 'api_key_env_var': 0})

def test_MistralEmbeddingFunction_build_from_config_2():
    with pytest.raises(AssertionError, match='This\\ code\\ should\\ not\\ be\\ reached'):
        MistralEmbeddingFunction.build_from_config({})

🔎 Concolic Coverage Tests and Runtime

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`codeflash_concolic_aqrniplu/tmphq2edxdh/test_concolic_coverage.py::test_MistralEmbeddingFunction_build_from_config`	115μs	117μs	-1.45%⚠️
`codeflash_concolic_aqrniplu/tmphq2edxdh/test_concolic_coverage.py::test_MistralEmbeddingFunction_build_from_config_2`	1.24μs	1.38μs	-9.88%⚠️

To edit these changes git checkout codeflash/optimize-MistralEmbeddingFunction.build_from_config-mh1t5g4r and push.

The optimized code achieves a 7% speedup through two key micro-optimizations: **1. Import Caching in `__init__`:** The original code imports `from mistralai import Mistral` on every instantiation. The optimization caches the imported `Mistral` class in the module's global namespace using `globals()["_mistral_client_mod"]`. After the first import, subsequent instantiations skip the import machinery entirely, reducing overhead during repeated object creation. **2. Local Variable Caching in `build_from_config`:** Instead of calling `config.get()` twice (which involves attribute lookup each time), the optimization stores `config.get` as a local variable `get` and calls it directly. This eliminates repeated attribute lookups, which is a common Python performance pattern. **Why These Work:** - Import statements involve module lookup and namespace resolution overhead - Attribute access (`config.get`) requires dictionary lookups in Python's object model - Local variable access is faster than attribute access in Python **Test Case Performance:** The optimizations show strongest gains (30-40% faster) in test cases that create multiple `MistralEmbeddingFunction` instances, such as `test_basic_valid_config` and large-scale tests. Error path tests show minimal impact since they don't reach the optimized construction code, which is expected and preserves the original error handling behavior.

codeflash-ai bot requested a review from mashraf-222 October 22, 2025 09:45

codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up method `MistralEmbeddingFunction.build_from_config` by 7% #19

⚡️ Speed up method `MistralEmbeddingFunction.build_from_config` by 7% #19

Uh oh!

codeflash-ai bot commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

⚡️ Speed up method MistralEmbeddingFunction.build_from_config by 7% #19

Are you sure you want to change the base?

⚡️ Speed up method MistralEmbeddingFunction.build_from_config by 7% #19

Uh oh!

Conversation

codeflash-ai bot commented Oct 22, 2025

📄 7% (0.07x) speedup for MistralEmbeddingFunction.build_from_config in chromadb/utils/embedding_functions/mistral_embedding_function.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

⚡️ Speed up method `MistralEmbeddingFunction.build_from_config` by 7% #19

⚡️ Speed up method `MistralEmbeddingFunction.build_from_config` by 7% #19

📄 7% (0.07x) speedup for `MistralEmbeddingFunction.build_from_config` in `chromadb/utils/embedding_functions/mistral_embedding_function.py`