-
-
Notifications
You must be signed in to change notification settings - Fork 366
Looking at the performance profile of my aoc day 3 solution #8687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This leads to a speedup for AoC code from ~19s to ~300ms
Signed-off-by: Matthieu Pizenberg <[email protected]>
Signed-off-by: Matthieu Pizenberg <[email protected]>
Roc Interpreter Performance AnalysisDate: December 16, 2025 (Updated) Executive SummaryThe Roc interpreter shows significant performance overhead compared to Python for a simple fold+append benchmark. Through systematic profiling and optimization, we've identified and addressed the major bottlenecks: Completed Optimizations
Remaining Bottlenecks
The actual list operations ( Optimization Progress✅ Optimization 1: Method Closure CachingProblem: Solution: Added Results:
Code location: ✅ Optimization 2: HashMap Pre-allocationProblem: Hot HashMaps (translate_cache, poly_cache, method caches) started with capacity 0, causing multiple small reallocations during warmup. Solution: Pre-allocate capacity for frequently-used HashMaps in try result.translate_cache.ensureTotalCapacity(512);
try result.poly_cache.ensureTotalCapacity(128);
try result.method_resolution_cache.ensureTotalCapacity(64);
try result.evaluated_method_cache.ensureTotalCapacity(64);
try result.flex_type_context.ensureTotalCapacity(256);
try result.rigid_subst.ensureTotalCapacity(128);
try result.instantiate_scratch.ensureTotalCapacity(64);Results:
Code location: ✅ Optimization 3: Tuple Scratch BuffersProblem: Each tuple finalization allocated 3 temporary arrays (layouts, values, rt_vars), immediately freed after use. For 22,011 tuples, this caused 66,033 allocations. Solution: Added reusable scratch buffers to the tuple_scratch_layouts: std.array_list.Managed(Layout),
tuple_scratch_values: std.array_list.Managed(StackValue),
tuple_scratch_rt_vars: std.array_list.Managed(types.Var),Each tuple clears and reuses these buffers instead of allocating: self.tuple_scratch_layouts.clearRetainingCapacity();
try self.tuple_scratch_layouts.resize(total_count);
var elem_layouts = self.tuple_scratch_layouts.items;Results:
Code location: Current Profile (After All Optimizations)Key Metrics
Remaining Bottleneck Categories
Why
|
| Size | Roc (s) | Python (s) | Ratio |
|---|---|---|---|
| 1000 | ~0.5s | ~0.001s | ~500x slower |
Tracy Profiling Results
Top 20 Functions by Time
| Rank | Function | % Time | Calls | Mean (ns) | Notes |
|---|---|---|---|---|---|
| 1 | applyContinuation |
73.3% | 716,580 | 10,559 | Main dispatch loop |
| 2 | alloc |
25.6% | 563,830 | 4,680 | Memory allocation |
| 3 | scheduleExprEval |
25.5% | 625,541 | 4,203 | Expression scheduling |
| 4 | evalWithExpectedType |
24.0% | 114,874 | 21,601 | Type-directed eval |
| 5 | resolveMethodFunction |
23.6% | 102,797 | 23,755 | Method lookup |
| 6 | free |
22.8% | 563,394 | 4,182 | Memory deallocation |
| 7 | cont.binop_apply |
16.9% | 68,870 | 25,270 | Binary operators |
| 8 | sched.call |
15.2% | 23,041 | 68,154 | Call scheduling |
| 9 | instantiateType |
14.0% | 35,064 | 41,311 | Type instantiation |
| 10 | cont.dot_access_collect_args |
13.5% | 12,034 | 116,017 | Dot access args |
| 11 | cont.reassign_value |
13.1% | 34,955 | 38,670 | Var reassignment |
| 12 | cont.dot_access_resolve |
8.7% | 33,927 | 26,505 | Dot access resolution |
| 13 | cont.tuple_collect |
8.4% | 66,033 | 13,069 | Tuple construction |
| 14 | sched.call.anno_check |
6.0% | 22,034 | 28,236 | Annotation checking |
| 15 | typesStore.resolveVar |
4.8% | 3,204,487 | 154 | 3.2M calls! |
| 16 | evalLookupLocal |
4.5% | 225,921 | 2,044 | Local var lookup |
| 17 | cont.call_invoke_closure |
4.9% | 23,041 | 21,809 | Closure invocation |
| 18 | unifyGuarded |
3.6% | 103,270 | 3,566 | Type unification |
| 19 | cont.bind_decl |
3.6% | 34,015 | 10,829 | Declaration binding |
| 20 | translateTypeVar |
3.6% | 528,012 | 696 | Type translation |
Categorized Time Breakdown
| Category | % of Total | Key Functions |
|---|---|---|
| Memory Management | 48.4% | alloc (25.6%), free (22.8%) |
| Method/Dot Access | 45.8% | resolveMethodFunction (23.6%), cont.dot_access_* (22.2%) |
| Type System | ~25% | instantiateType (14%), unify* (~11%), translateTypeVar (3.6%) |
| Tuple Operations | ~11% | cont.tuple_collect (8.4%), tuple_collect.* (3%) |
| Binary Operators | 16.9% | cont.binop_apply |
| Actual List Ops | 0.63% | callLowLevelBuiltin |
Detailed Analysis
1. Memory Management (48.4%)
Observation: 563,830 allocations and 563,394 frees for processing 1,000 elements.
Per-iteration allocations: ~563 allocations per element processed.
Root causes:
- Every tuple
(new_max, rest.append(...))allocates - Every closure invocation may allocate for bindings
- Pattern matching creates copies via
pushCopy - Type instantiation allocates new type variables
Evidence:
alloc: 563,830 calls, 4,680 ns mean
free: 563,394 calls, 4,182 ns mean
2. Method Resolution (23.6%)
Observation: resolveMethodFunction called 102,797 times (103x per iteration).
Problem: Every call to rest.append(...) triggers method resolution:
- Look up the method name "append"
- Find the implementation for the receiver type
- Resolve any polymorphic constraints
Evidence:
resolveMethodFunction: 102,797 calls, 23,755 ns mean = 2.44 seconds total
Expected: Method resolution should be cached after first lookup for the same type.
3. Type Instantiation (14.0%)
Observation: instantiateType called 35,064 times with 41,311 ns mean.
Problem: Polymorphic functions like append : List a, a -> List a require type instantiation on each call. The interpreter appears to re-instantiate types even for monomorphic call sites.
Evidence:
instantiateType: 35,064 calls, 41,311 ns mean = 1.45 seconds total
typesStore.resolveVar: 3,204,487 calls (3.2 million!)
4. Dot Access Overhead (22.2%)
Observation: cont.dot_access_collect_args and cont.dot_access_resolve together take 22.2%.
Problem: The dot access path (rest.append(...)) involves:
- Evaluate receiver (
rest) - Collect arguments
- Resolve method
- Create closure call
This multi-step process is expensive for simple method calls.
Evidence:
cont.dot_access_collect_args: 12,034 calls, 116,017 ns mean
cont.dot_access_resolve: 33,927 calls, 26,505 ns mean
5. Type System Operations (~25% combined)
Observation: Unification and type operations account for significant overhead.
| Function | Calls | Time % |
|---|---|---|
unifyGuarded |
103,270 | 3.6% |
unifyVars |
58,269 | 2.8% |
unifyFlatType |
57,168 | 2.5% |
unifyStructure |
58,168 | 2.6% |
unifyNominalType |
35,137 | 1.1% |
unifyTuple |
22,000 | 1.0% |
Problem: The interpreter performs runtime type checking/unification that a compiled version would resolve at compile time.
6. The Non-Problem: List Copying
Observation: callLowLevelBuiltin (which handles list_append) is only 0.63%.
Initial hypothesis was wrong: I initially suspected O(n²) list copying due to refcount > 1 preventing in-place mutation. While this may still occur, it's not the dominant cost.
Evidence:
callLowLevelBuiltin: 114,797 calls, 568 ns mean = 0.065 seconds
copyToPtr: 403,917 calls, 114 ns mean = 0.046 seconds
Optimization Recommendations
High Priority (>10% impact potential)
1. Cache Method Resolution
Current: resolveMethodFunction called 102,797 times for same method
Target: Cache by (receiver_type, method_name) -> resolved_function
Expected impact: -20% runtime
2. Reduce Allocations
Current: 563K allocations for 1K elements
Target:
- Pool allocators for common sizes
- Stack allocation for small temporaries
- Avoid copying in pattern matching when value is consumed
Expected impact: -30% runtime
3. Cache Type Instantiation
Current: instantiateType called 35K times
Target: Cache instantiated types for monomorphic call sites
Expected impact: -10% runtime
Medium Priority (5-10% impact potential)
4. Optimize Dot Access Path
Current: Multi-continuation chain for method calls
Target: Fast path for simple method calls
Expected impact: -10% runtime
5. Reduce Type Variable Resolution
Current: 3.2M calls to typesStore.resolveVar
Target: Cache resolved variables, avoid repeated resolution
Expected impact: -5% runtime
6. Binary Operator Optimization
Current: cont.binop_apply at 16.9%
Target: Inline common operations (int comparison, addition)
Expected impact: -10% runtime
Lower Priority
7. Tuple Collection Optimization
- Pre-allocate tuple memory
- Avoid intermediate allocations
8. Pattern Matching Optimization
- Move semantics instead of copy
- Avoid refcount increment when source is consumed
Relevant Code Locations
| Function | File | Line |
|---|---|---|
resolveMethodFunction |
interpreter.zig | 7822 |
instantiateType |
interpreter.zig | 9131 |
applyContinuation |
interpreter.zig | 12983 |
scheduleExprEval |
interpreter.zig | 10292 |
pushCopy |
interpreter.zig | 893 |
copyToPtr |
StackValue.zig | 272 |
callLowLevelBuiltin |
interpreter.zig | 1158 |
cont.dot_access_resolve |
interpreter.zig | 15272 |
cont.dot_access_collect_args |
interpreter.zig | 15644 |
Next Steps
Completed ✅
Investigate- Found that method closures were being re-evaluated on every callresolveMethodFunctionAdd method resolution caching- Implementedevaluated_method_cachewith 88% hit rate (74% reduction)Pre-allocate HashMaps- Eliminated 100-200 warmup allocationsImplement tuple scratch buffers- Eliminated 66,033 allocations (16.8% reduction)
Current Investigation 🔍
High Priority: Type Variable Explosion
Problem: Creating 578,522 type variables for processing 1,000 elements
Breakdown:
typesStore.fresh: 271,813 callstypesStore.freshFromContent: 306,709 calls- Total: 578,522 type variables created
Questions to investigate:
- Why do we create 578 type variables per element processed?
- Are these type variables truly necessary for correctness?
- Can we cache or reuse type variables for monomorphic cases?
- Is polymorphic instantiation creating excessive type variables?
Expected: For a simple fold_rev with tuples, we might expect:
- ~22,000 type variables for tuples (one per tuple)
- ~35,000 for function instantiation
- Total expected: ~60,000-80,000
Actual: 578,522 (7-10x more than expected!)
Remaining Targets 🎯
Medium Priority: Remaining Memory Allocations (~327K)
With tuple allocations eliminated, remaining sources:
- Type store growth: ~60,000-100,000 allocations (efficient exponential growth)
- HashMap growth: ~100-200 allocations (now pre-allocated)
- Other operations: ~200,000+ allocations (to be profiled)
Potential approach: Cache instantiated types for identical (function_type, argument_types) pairs when types are fully concrete.
Medium Priority: Binary Operator Overhead (2.8%)
Problem: cont.binop_apply has high per-call cost.
Solution: Inline common operations (int comparison, addition) without continuation overhead.
Lower Priority
- Tuple operations: Pre-allocate tuple memory, avoid intermediate copies
typesStore.resolveVar: 2.8M calls - consider caching resolved results
Appendix: Raw Tracy Data
See bench_fold_append_tracy.csv for complete profiling data.
Data Collection Method
# Build with Tracy
zig build roc -Dtracy=$HOME/git/wolfpld/tracy -Dtracy-callstack=true
# Run benchmark
cd benchmarks
roc build bench_fold_append.roc
echo 1000 | ./bench_fold_append
# Export from Tracy GUI to CSV
# save bench_fold_append.tracy from the GUI, and then
tracy-csvexport bench_fold_append.tracy > bench_fold_append_tracy.csv|
Thank you for your contribution! Sometimes PRs end up staying open for a long time without activity, which can make the list of open PRs get long and time-consuming to review. To keep things manageable for reviewers, this bot automatically closes PRs that haven’t had activity in 60 days. This PR hasn’t had activity in 30 days, so it will be automatically closed if there is no more activity in the next 30 days. Keep in mind that PRs marked |
WIP: do NOT merge
I have set up a small benchmark in order to look at how Roc performs compared to python. The details of this benchmark are provided in this message on zulip: #advent of code > 2025 Day 3 @ 💬
Out of curiosity, I’ve looked into how to use tracy to get a better analysis of the bottlenecks of this, and it’s clearly still memory, with allocation/deallocation spending roughly 40% of the budget. Tracy is pretty cool!
Below in the next comment, I’ll add the analysis stage I’m at, with the help of Claude.