⚡️ Speed up function `numpy_matmul` by 310,613% #206

codeflash-ai · 2025-12-23T03:09:30Z

📄 310,613% (3,106.13x) speedup for `numpy_matmul` in `src/numerical/linear_algebra.py`

⏱️ Runtime : 2.60 seconds → 836 microseconds (best of 250 runs)

📝 Explanation and details

The optimized code achieves a ~3000x speedup by replacing a naive triple-nested loop implementation with NumPy's highly optimized np.dot() function.

Key Changes:

Replaced manual nested loops with np.dot(): The original code performs element-by-element matrix multiplication using three nested Python loops, which is extremely slow due to Python's interpreter overhead. The optimized version delegates this to NumPy's np.dot(), which uses optimized C/Fortran libraries (BLAS/LAPACK) with vectorized operations.
Added .astype(np.float64): This ensures the output dtype matches the original behavior where np.zeros() creates float64 arrays by default, maintaining compatibility when inputs are integers or other types.

Why This Is Faster:

Vectorization: NumPy's np.dot() operates on entire arrays at once using CPU vector instructions (SIMD), whereas Python loops process one element at a time.
Compiled code: NumPy's underlying implementation is in highly optimized C/Fortran, avoiding Python's interpretation overhead.
Cache efficiency: Optimized BLAS implementations use cache-aware algorithms that minimize memory access latency.

The line profiler shows the bottleneck shifted from the innermost loop (63.2% of time in original) to the single np.dot() call (95.9% of optimized time), but the absolute time dropped from 8.5 seconds to 2.8 milliseconds.

Performance Impact:
Based on the annotated tests, the optimization delivers massive speedups for larger matrices:

Small matrices (2x2): ~150-200% faster
Medium matrices (100x100): ~2 million % faster
Large matrices (500x500): ~2.4 million % faster
Vector operations: ~14,000% faster

The speedup scales dramatically with matrix size because the cubic complexity (O(n³)) of the nested loops becomes increasingly dominant, while np.dot() maintains efficiency through optimized algorithms.

Workload Considerations:
Since function references aren't available, the impact depends on usage patterns. If this function is called frequently or with large matrices (common in numerical/scientific computing, machine learning pipelines, or data processing), this optimization would significantly reduce computation time in those hot paths.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 37 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Click to see Generated Regression Tests

import numpy as np

# imports
import pytest
from src.numerical.linear_algebra import numpy_matmul

# unit tests

# --------- Basic Test Cases ---------


def test_matmul_identity_matrix():
    # Multiplying by identity should return the original matrix
    A = np.array([[1, 2], [3, 4]])
    I = np.eye(2)
    codeflash_output = numpy_matmul(A, I)
    result = codeflash_output  # 6.62μs -> 2.79μs (137% faster)


def test_matmul_zero_matrix():
    # Multiplying any matrix by a zero matrix should yield a zero matrix
    A = np.array([[5, 7], [1, -3]])
    Z = np.zeros((2, 2))
    codeflash_output = numpy_matmul(A, Z)
    result = codeflash_output  # 6.38μs -> 2.75μs (132% faster)


def test_matmul_basic_2x2():
    # Basic 2x2 multiplication
    A = np.array([[1, 2], [3, 4]])
    B = np.array([[2, 0], [1, 2]])
    expected = np.array([[4, 4], [10, 8]])
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 6.29μs -> 2.54μs (148% faster)


def test_matmul_rectangular():
    # Multiplying 2x3 by 3x2
    A = np.array([[1, 2, 3], [4, 5, 6]])
    B = np.array([[7, 8], [9, 10], [11, 12]])
    expected = np.array([[58, 64], [139, 154]])
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 8.08μs -> 2.54μs (218% faster)


def test_matmul_single_element():
    # 1x1 matrix multiplication
    A = np.array([[7]])
    B = np.array([[3]])
    expected = np.array([[21]])
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 2.54μs -> 2.54μs (0.000% faster)


# --------- Edge Test Cases ---------


def test_matmul_incompatible_shapes():
    # Should raise ValueError if shapes are not compatible
    A = np.array([[1, 2, 3], [4, 5, 6]])
    B = np.array([[1, 2], [3, 4]])  # 2x2, incompatible with A's 2x3
    with pytest.raises(ValueError, match="Incompatible matrices"):
        numpy_matmul(A, B)  # 1.21μs -> 792ns (52.5% faster)


def test_matmul_empty_matrices():
    # Multiplying empty matrices should raise an error
    A = np.zeros((0, 2))
    B = np.zeros((2, 0))
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 1.50μs -> 2.62μs (42.9% slower)


def test_matmul_row_vector_by_column_vector():
    # 1xN by Nx1 should yield a 1x1 matrix (dot product)
    A = np.array([[1, 2, 3]])
    B = np.array([[4], [5], [6]])
    expected = np.array([[32]])  # 1*4 + 2*5 + 3*6 = 32
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 3.83μs -> 2.50μs (53.4% faster)


def test_matmul_column_vector_by_row_vector():
    # Nx1 by 1xN should yield a NxN matrix (outer product)
    A = np.array([[1], [2], [3]])
    B = np.array([[4, 5, 6]])
    expected = np.array([[4, 5, 6], [8, 10, 12], [12, 15, 18]])
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 6.75μs -> 2.54μs (166% faster)


def test_matmul_negative_numbers():
    # Test with negative numbers
    A = np.array([[-1, -2], [-3, -4]])
    B = np.array([[2, -2], [-2, 2]])
    expected = np.array(
        [[-1 * 2 + -2 * -2, -1 * -2 + -2 * 2], [-3 * 2 + -4 * -2, -3 * -2 + -4 * 2]]
    )
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 6.25μs -> 2.42μs (159% faster)


def test_matmul_float_precision():
    # Test with floating point numbers
    A = np.array([[0.1, 0.2], [0.3, 0.4]])
    B = np.array([[1.5, 2.5], [3.5, 4.5]])
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 5.79μs -> 2.21μs (162% faster)


def test_matmul_large_values():
    # Test with very large numbers to check for overflow
    A = np.array([[1e100, 2e100], [3e100, 4e100]])
    B = np.array([[5e100, 6e100], [7e100, 8e100]])
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 5.42μs -> 2.21μs (145% faster)


def test_matmul_small_values():
    # Test with very small numbers
    A = np.array([[1e-100, 2e-100], [3e-100, 4e-100]])
    B = np.array([[5e-100, 6e-100], [7e-100, 8e-100]])
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 5.67μs -> 2.21μs (157% faster)


# --------- Large Scale Test Cases ---------


def test_matmul_large_square():
    # Large 100x100 matrix multiplication
    np.random.seed(0)
    A = np.random.rand(100, 100)
    B = np.random.rand(100, 100)
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 389ms -> 19.6μs (1983384% faster)


def test_matmul_large_rectangular():
    # Large 100x200 by 200x50 matrix multiplication
    np.random.seed(1)
    A = np.random.rand(100, 200)
    B = np.random.rand(200, 50)
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 389ms -> 21.6μs (1802876% faster)


def test_matmul_large_vector():
    # Large vector dot product (1x1000 by 1000x1)
    np.random.seed(2)
    A = np.random.rand(1, 1000)
    B = np.random.rand(1000, 1)
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 400μs -> 2.79μs (14246% faster)


def test_matmul_large_outer_product():
    # Outer product of large vectors (1000x1 by 1x1000)
    np.random.seed(3)
    A = np.random.rand(1000, 1)
    B = np.random.rand(1, 1000)
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 441ms -> 656μs (67125% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import numpy as np

# imports
import pytest  # used for our unit tests
from src.numerical.linear_algebra import numpy_matmul

# unit tests

# ---------------- BASIC TEST CASES ----------------


def test_matmul_identity_matrix():
    # Multiplying any matrix by the identity matrix should return the original matrix
    A = np.array([[1, 2], [3, 4]])
    I = np.eye(2)
    codeflash_output = numpy_matmul(A, I)
    result = codeflash_output  # 11.0μs -> 2.88μs (281% faster)


def test_matmul_zero_matrix():
    # Multiplying any matrix by a zero matrix should return a zero matrix
    A = np.array([[1, 2], [3, 4]])
    Z = np.zeros((2, 2))
    codeflash_output = numpy_matmul(A, Z)
    result = codeflash_output  # 6.96μs -> 2.71μs (157% faster)


def test_matmul_basic_2x2():
    # Basic 2x2 matrix multiplication
    A = np.array([[1, 2], [3, 4]])
    B = np.array([[2, 0], [1, 2]])
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 6.92μs -> 2.58μs (168% faster)


def test_matmul_rectangular_2x3_3x2():
    # 2x3 and 3x2 matrix multiplication
    A = np.array([[1, 2, 3], [4, 5, 6]])
    B = np.array([[7, 8], [9, 10], [11, 12]])
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 8.21μs -> 2.46μs (234% faster)


def test_matmul_vector_as_matrix():
    # Multiplying a 1xN matrix by a Nx1 matrix (dot product)
    A = np.array([[1, 2, 3]])
    B = np.array([[4], [5], [6]])
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 3.79μs -> 2.46μs (54.3% faster)


# ---------------- EDGE TEST CASES ----------------


def test_matmul_incompatible_shapes():
    # Matrices with incompatible shapes should raise ValueError
    A = np.array([[1, 2, 3], [4, 5, 6]])
    B = np.array([[1, 2], [3, 4]])
    with pytest.raises(ValueError):
        numpy_matmul(A, B)  # 1.42μs -> 792ns (78.8% faster)


def test_matmul_empty_matrices():
    # Multiplying empty matrices should work if dimensions are compatible (0xN * NxM -> 0xM)
    A = np.zeros((0, 3))
    B = np.zeros((3, 2))
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 1.38μs -> 2.46μs (44.1% slower)


def test_matmul_one_element():
    # 1x1 matrix multiplied by 1x1 matrix
    A = np.array([[7]])
    B = np.array([[3]])
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 2.67μs -> 2.50μs (6.68% faster)


def test_matmul_negative_numbers():
    # Multiplying matrices with negative numbers
    A = np.array([[1, -2], [-3, 4]])
    B = np.array([[-2, 1], [0, -1]])
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 6.50μs -> 2.46μs (164% faster)


def test_matmul_large_and_small_values():
    # Test with very large and very small (close to zero) values
    A = np.array([[1e10, 1e-10], [1e-10, 1e10]])
    B = np.array([[1e-10, 1e10], [1e10, 1e-10]])
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 5.96μs -> 2.25μs (165% faster)


def test_matmul_float_precision():
    # Test with floats to check precision
    A = np.array([[0.1, 0.2], [0.3, 0.4]])
    B = np.array([[0.5, 0.6], [0.7, 0.8]])
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 5.67μs -> 2.21μs (157% faster)


def test_matmul_non_square_rectangular():
    # Test non-square multiplication (3x2 * 2x4)
    A = np.array([[1, 2], [3, 4], [5, 6]])
    B = np.array([[7, 8, 9, 10], [11, 12, 13, 14]])
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 13.8μs -> 2.54μs (441% faster)


# ---------------- LARGE SCALE TEST CASES ----------------


def test_matmul_large_square():
    # Large square matrices (100x100)
    np.random.seed(0)
    A = np.random.rand(100, 100)
    B = np.random.rand(100, 100)
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 393ms -> 20.6μs (1910049% faster)


def test_matmul_large_rectangular():
    # Large rectangular matrices (100x500 * 500x50)
    np.random.seed(1)
    A = np.random.rand(100, 500)
    B = np.random.rand(500, 50)
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 982ms -> 41.5μs (2370118% faster)


def test_matmul_large_vector():
    # Large vector (1x1000 * 1000x1)
    np.random.seed(2)
    A = np.random.rand(1, 1000)
    B = np.random.rand(1000, 1)
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 401μs -> 2.71μs (14735% faster)


# ---------------- ADDITIONAL EDGE CASES ----------------


def test_matmul_with_zeros_and_ones():
    # Multiplying ones and zeros
    A = np.ones((3, 3))
    B = np.zeros((3, 3))
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 14.8μs -> 2.54μs (484% faster)


def test_matmul_with_different_dtypes():
    # Multiplying integer and float matrices
    A = np.array([[1, 2], [3, 4]], dtype=int)
    B = np.array([[0.5, 1.5], [2.5, -0.5]], dtype=float)
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 8.83μs -> 2.58μs (242% faster)


def test_matmul_broadcast_not_supported():
    # Should not support broadcasting like numpy.matmul does for 1D arrays
    A = np.array([1, 2, 3])
    B = np.array([4, 5, 6])
    with pytest.raises(ValueError):
        numpy_matmul(A, B)  # 2.88μs -> 1.29μs (123% faster)


def test_matmul_single_row_and_column():
    # 1xN and Nx1 should return a 1x1 matrix (dot product)
    A = np.array([[1, 2, 3, 4, 5]])
    B = np.array([[6], [7], [8], [9], [10]])
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 5.33μs -> 2.62μs (103% faster)


def test_matmul_single_column_and_row():
    # Nx1 and 1xM should return NxM matrix (outer product)
    A = np.array([[1], [2], [3]])
    B = np.array([[4, 5, 6]])
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 6.88μs -> 2.58μs (166% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from src.numerical.linear_algebra import numpy_matmul

To edit these changes git checkout codeflash/optimize-numpy_matmul-mji0ap57 and push.

The optimized code achieves a **~3000x speedup** by replacing a naive triple-nested loop implementation with NumPy's highly optimized `np.dot()` function. **Key Changes:** 1. **Replaced manual nested loops with `np.dot()`**: The original code performs element-by-element matrix multiplication using three nested Python loops, which is extremely slow due to Python's interpreter overhead. The optimized version delegates this to NumPy's `np.dot()`, which uses optimized C/Fortran libraries (BLAS/LAPACK) with vectorized operations. 2. **Added `.astype(np.float64)`**: This ensures the output dtype matches the original behavior where `np.zeros()` creates float64 arrays by default, maintaining compatibility when inputs are integers or other types. **Why This Is Faster:** - **Vectorization**: NumPy's `np.dot()` operates on entire arrays at once using CPU vector instructions (SIMD), whereas Python loops process one element at a time. - **Compiled code**: NumPy's underlying implementation is in highly optimized C/Fortran, avoiding Python's interpretation overhead. - **Cache efficiency**: Optimized BLAS implementations use cache-aware algorithms that minimize memory access latency. The line profiler shows the bottleneck shifted from the innermost loop (63.2% of time in original) to the single `np.dot()` call (95.9% of optimized time), but the absolute time dropped from 8.5 seconds to 2.8 milliseconds. **Performance Impact:** Based on the annotated tests, the optimization delivers massive speedups for larger matrices: - Small matrices (2x2): **~150-200%** faster - Medium matrices (100x100): **~2 million %** faster - Large matrices (500x500): **~2.4 million %** faster - Vector operations: **~14,000%** faster The speedup scales dramatically with matrix size because the cubic complexity (O(n³)) of the nested loops becomes increasingly dominant, while `np.dot()` maintains efficiency through optimized algorithms. **Workload Considerations:** Since function references aren't available, the impact depends on usage patterns. If this function is called frequently or with large matrices (common in numerical/scientific computing, machine learning pipelines, or data processing), this optimization would significantly reduce computation time in those hot paths.

codeflash-ai bot requested a review from KRRT7 December 23, 2025 03:09

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 23, 2025

KRRT7 closed this Dec 23, 2025

codeflash-ai bot deleted the codeflash/optimize-numpy_matmul-mji0ap57 branch December 23, 2025 05:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `numpy_matmul` by 310,613% #206

⚡️ Speed up function `numpy_matmul` by 310,613% #206

Uh oh!

codeflash-ai bot commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

⚡️ Speed up function numpy_matmul by 310,613% #206

⚡️ Speed up function numpy_matmul by 310,613% #206

Uh oh!

Conversation

codeflash-ai bot commented Dec 23, 2025

📄 310,613% (3,106.13x) speedup for numpy_matmul in src/numerical/linear_algebra.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

⚡️ Speed up function `numpy_matmul` by 310,613% #206

⚡️ Speed up function `numpy_matmul` by 310,613% #206

📄 310,613% (3,106.13x) speedup for `numpy_matmul` in `src/numerical/linear_algebra.py`