Skip to content

Conversation

brchristian
Copy link
Contributor

In src/datasets/load.py, we can use unpacking rather than concatenating two lists for improved time and memory performance. It’s a small improvement in absolute terms, but a consistent and measurable one:

- ALL_ALLOWED_EXTENSIONS = list(_EXTENSION_TO_MODULE.keys()) + [".zip"]
+ ALL_ALLOWED_EXTENSIONS = [*_EXTENSION_TO_MODULE.keys(), ".zip"]

Benchmarking shows approximately 32.3% time improvement and 30.6% memory improvement.

Example benchmarking script:

#!/usr/bin/env python3
"""
Benchmark script to test performance of list(_EXTENSION_TO_MODULE.keys()) vs [*_EXTENSION_TO_MODULE.keys()]
"""
import time
import tracemalloc
from statistics import mean, stdev

# Simulate _EXTENSION_TO_MODULE - based on actual size from datasets
_EXTENSION_TO_MODULE = {
    f".ext{i}": f"module{i}" for i in range(20)  # Realistic size
}

def method_old():
    """Current implementation using list()"""
    return list(_EXTENSION_TO_MODULE.keys()) + [".zip"]

def method_new():
    """Proposed implementation using unpacking"""
    return [*_EXTENSION_TO_MODULE.keys(), ".zip"]

def benchmark_time(func, iterations=100000):
    """Benchmark execution time"""
    times = []
    for _ in range(10):  # Multiple runs for accuracy
        start = time.perf_counter()
        for _ in range(iterations):
            func()
        end = time.perf_counter()
        times.append((end - start) / iterations * 1_000_000)  # microseconds
    
    return mean(times), stdev(times)

def benchmark_memory(func):
    """Benchmark peak memory usage"""
    tracemalloc.start()
    func()
    current, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    return peak

if __name__ == "__main__":
    print("Benchmarking list() vs unpacking performance...\n")
    
    # Time benchmarks
    old_time, old_std = benchmark_time(method_old)
    new_time, new_std = benchmark_time(method_new)
    
    print(f"Time Performance (µs per operation):")
    print(f"  list() approach:     {old_time:.3f} ± {old_std:.3f}")
    print(f"  unpacking approach:  {new_time:.3f} ± {new_std:.3f}")
    print(f"  Improvement:         {((old_time - new_time) / old_time * 100):.1f}% faster")
    
    # Memory benchmarks
    old_mem = benchmark_memory(method_old)
    new_mem = benchmark_memory(method_new)
    
    print(f"\nMemory Usage (bytes):")
    print(f"  list() approach:     {old_mem}")
    print(f"  unpacking approach:  {new_mem}")
    print(f"  Reduction:           {old_mem - new_mem} bytes ({((old_mem - new_mem) / old_mem * 100):.1f}% less)")
    
    # Verify identical results
    assert method_old() == method_new(), "Results should be identical!"
    print(f"\n✓ Both methods produce identical results")

Results:

Benchmarking list() vs unpacking performance...

Time Performance (µs per operation):
  list() approach:     0.213 ± 0.020
  unpacking approach:  0.144 ± 0.002
  Improvement:         32.3% faster

Memory Usage (bytes):
  list() approach:     392
  unpacking approach:  272
  Reduction:           120 bytes (30.6% less)

✓ Both methods produce identical results

Use unpacking rather than concatenation for improved time and memory performance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant