Lesson 12: Dynamic Compilers #562

sampsyo · 2025-08-24T22:55:39Z

sampsyo
Aug 24, 2025
Maintainer

Here's the thread for the dynamic compilers task, which involves doing some speculative transformations on Bril IR!

jku20 · 2025-11-10T03:27:40Z

jku20
Nov 10, 2025

The implementation of the trace can be found here in my fork of bril. The bulk of the implementation of the speculation pass can be found here.

Summary: I added a -t flag to brili which outputs a trace. I additionally wrote a pass which processes these traces and inserts them as straight-line code into the profiled bril IR. There is little post processing on these traces: jmps are removed and branches are converted to guards. I tested these using the ackermann, check-prime, and karatsuba benchmarks from bril's core language features benchmark suite. I wrote scripts to test each of these benchmarks matrices of inputs. For each of these cases, the behavior after traces were inserted was unchanged. These resultant IR after inserting the traces was tested on both the inputs used to generate the trace and also inputs not used to generate the trace. I found a significant performance penalty (when measured using dynamic instruction count) to these insertions. Because these traces aren't optimized (in fact they make br instructions which take the false path slower because converting a br into a guard then requires negating the condition), the only place for improvement in a trace is if it contains many jmps. This only happened in contrived tests made to explicitly show a speedup.

More Details on Tracing: Adding additional optimizations to the inserted traces would be possible, but the biggest influence on performance seemed how the profiler chose traces itself. My profiler is quite simple. It attempts to create intraprocederal traces of maximum length 5 instructions, starting at the first executed instruction, rejecting instructions which could have "side effects" (e.g. ret, call, print). It additionally adds the unneeded constraint that all traces should be disjoint to avoid ambiguity when traces start on the same instruction and reduce otherwise ballooning code size inserting the traces was causing. The number 5 is arbitrary but testing larger numbers didn't yield significantly different traces. It seemed for the benchmarks I tested, an unhandled instruction most often stopped a trace. This method of finding traces occasionally finds a common path, but as much as that finds short traces with lots of ignored instructions for which inserting them into the IR using speculation can only slow down the program.

Results: Except for programs with many chained jumps, this implementation of speculative execution increases the dynamic instruction count. A table can be seen below. Listed in the dynamic instruction count with the input used to generate that count in parenthesis:

benchmark	base trace input	spec trace input	base new 1	spec new 1	base new 2	spec new 2
ackermann	1464231 (3 6)	2326900 (3 6)	231 (2 2)	376 (2 2)	20700 (3 3)	33029 (3 3)
check-primes	8468 (50)	12683 (50)	9754 (55)	14604 (55)	26011 (100)	38806 (100)
karatsuba	1548 (43210 98765)	2106 (43210 98765)	537 (100 100)	734 (100 100)	1935 (12345 678910)	2625 (12345 678910)

These show between a 40-60% increase in dynamic instruction count. The results are mostly unsurprising. My explanation for why ackermann performs notably worse than the other two benchmarks is due to the tracing algorithm happening to find a really small trace in the recursive case which inserted proportionally more extra instructions compared to the other tests. Due to the simplicity of my tracing algorithm and also the more deterministic nature of the benchmarks I was tracing, I found the traces were pretty stable for varying inputs after a given size threshold. These results point to the best places for immediate improvements would be in smarter trace collection, such as a heuristic for when to start traces to help choose longer, explicit hot paths for traces or traces which the compiler can optimize. One possible heuristic could be starting traces on br instructions with the goal of inserting traces in the hot path of a loop. Additionally static optimization on the inserted traces would certainly help performance here as there are currently few very few ways for inserted traces to reduce dynamic instruction count.

The Hardest Part: I found the most challenging part to be dealing with terminating basic block terminators. My code initially made the incorrect assumption that traces should fall through to the instruction following them. This is clearly incorrect but took a bit to debug. Additionally, I'd never programmed in typescript before (and had extremely minimal javascript experience), so hacking the interpreter had a bit of a learning curve.

Star(s)?: I think this is star worthy work. It implements tracing and then the insertion of these traces into the profiled programs. It then does a straightforward description of this implementation and a brief analysis of quantitative results. The analysis then makes an argument for admittedly somewhat obvious improvements to the implementation. As I didn't actually implement these, I don't think the work is multi-star worthy, but I also think there is enough here to disqualify no stars, hence one star.

1 reply

sampsyo Nov 14, 2025
Maintainer Author

Very impressive, @jku20! Especially because you finished this before we even did the lesson about tracing. :) I think it's a cool idea to collect a lot of small traces; I think it will be just as common (or more) for people to approach this task by just attempting to take one big trace. Going this route led you to think about the space of possible heuristics for finding good traces to collect, which is great!

Mond45 · 2025-11-18T05:42:01Z

Mond45
Nov 18, 2025

Source: https://github.com/Mond45/bril/tree/trace

What I Did

For tracing, I modified the reference TypeScript Bril interpreter to output a trace for every executed instruction. The trace starts from the beginning of the main function and continues until either 16 instructions or a side-effect instruction (ret, print, or call) is executed. For control-flow instructions, jumps are removed from the trace, and branches are converted into guard instructions. I decided not to implement the interprocedural version of the tracer mainly due to the complications around handling variable names and return values across function calls. Since I only implemented the intraprocedural version, the tracer simply counts executed instructions to determine the point to continue from after successful speculation.

For the trace injector, I built a TypeScript script that wraps the input trace with speculate and commit, and adds the labels for points to continue from when speculation succeeds or fails.

I also wrote another script for creating traces and generating the optimized programs by injecting the built traces.

The Hardest Part

I initially aimed to implement the interprocedural tracer, but ultimately gave up because of limited time and the increasing complexity of the implementation details. I also spent quite some time debugging my test creation script, where it failed weirdly because I forgot to close the intermediate trace output file. The intraprocedural tracer and the injector were fairly straightforward to implement, though they required some effort to navigate and get used to the TypeScript Bril implementation.

Testing

I tested the optimized programs with injected traces on a selected subset of core Bril benchmarks. These tests were chosen such that each source contains only a main function. To avoid overfitting, the script builds traces using one set of argument values and then tests the optimized programs using a different set, comparing them against the unoptimized versions using turnt.

The benchmarks tested were:

benchmarks/core/factors.bril
benchmarks/core/gcd.bril
benchmarks/core/geometric-sum.bril
benchmarks/core/loopfact.bril
benchmarks/core/reverse.bril
benchmarks/core/squares.bril
benchmarks/core/sum-bits.bril
benchmarks/core/sum-digits.bril

The dynamic instructions counts for the unoptimized version versus the injected trace version are shown below:

Benchmark	Baseline	Traced
factors	72	77
gcd	64	75
geometric-sum	35	40
loopfact	77	82
reverse	46	51
squares	83	86
sum-bits	73	80
sum-digits	114	119

Conclusion

I believe I deserve a Michelin star for this task. Even though the work was not that ambitious, I managed to correctly implement the tracer and the optimizer, thoroughly verify their correctness, and build a script to automate the test creation process.

1 reply

keikun555 Dec 4, 2025
Collaborator

Good job! It would be include data on whether the optimizations didn't break the programs!

jeffreyqdd · 2025-11-19T04:03:11Z

jeffreyqdd
Nov 19, 2025

Source Code: https://github.com/jeffreyqdd/toy_trace_jit

What I Did

Tracing

This was a really interesting project, and I realized how precise I needed to be with tracing. For starters, I modified the reference TypeScript Bril interpreter to output a trace when I provide the "--trace" flag. For fun, I made the trace span the entire program execution. A sample trace output from the modified BRIL interpreter looks like:

tracer: {"dest":"a","op":"const","pos":{"col":5,"row":30},"type":"int","value":0}&*&{"col":5,"row":30}&*&{"a":0}&*&null
tracer: {"dest":"b","op":"const","pos":{"col":5,"row":31},"type":"int","value":10}&*&{"col":5,"row":31}&*&{"a":0,"b":10}&*&null
tracer: {"dest":"c","op":"const","pos":{"col":5,"row":32},"type":"int","value":20}&*&{"col":5,"row":32}&*&{"a":0,"b":10,"c":20}&*&null
tracer: {"args":["a","b","c"],"dest":"result","funcs":["heavy_branching_loop"],"op":"call","pos":{"col":5,"row":33},"type":"int"}&*&{"col":5,"row":33}&*&{"a":0,"b":10,"c":20}&*&"heavy_branching_loop"

So each instruction execution would give me 4 pieces of information delimited by &*&

instruction in JSON format
extracted position (row is used as the mock program counter)
environment (local values)
function call

JIT

The interesting question now is: "How do we detect hot parts of the code?" To tackle this question, I defined min_length and min_repeat. A continuous segment must be at least min_length and must repeat min_repeat times. To model this, I constructed a tree structure where each node is a PC value, and its children are a list of tuples (child PC, taken count). The "hottest" trace would be to greedily start at a node with a visited count > min_repeat and take its children with the highest visited count, inserting guards if there is not a single child. The result should be a blazing fast branchless path through the hottest code segments.

For simplicity, I opted not to have a hot-path segment execute instructions that could have side effects (call, ret, print).

After generating the guarded trace, I iteratively move guards closer to the start of the speculation block to reduce the number of instructions that are flushed when a guard fails.

Hardest Part
I realized that loops were ESPECIALLY ANNOYING. Consider PC=1 being a "loop header instruction" and PC=2,3 being a "loop body instruction". From the hot-code-algorithm, it would say "1,2,3" is traceable. HOWEVER, it would fail to distinguish that the values for each "1,2,3" run-through would be different. This was where the environment values emitted by the tracer proved to be helpful. I revised my tree structure to change the hash of the node from just hash(node) to hash(hash(node) + hash(environment))`. I understand that this is a very nuclear option, but I care about my tracing JIT not being overly optimistic with speculation. If my JIT tracing were optimistic, It would incur the cost of guards and aborting speculative execution.

Testing

I had a toy example that looked like

@heavy_branching_loop(a: int, b: int, c: int): int {
    literal_5: int = const 5;
    cond1: bool = lt a literal_5;
    literal_1: int = const 1;
    return_value: int = const 0;
    br cond1 .branch1 .branch2;
    .... a lot of branches depending on cond 1
    ret result
}

... a lot of calls where cond1 would evaluate to true and 1 where cond1 would evaluate to false

unoptimized	optimized
total_dyn_inst: 301	total_dyn_inst: 248

Unfortunately, even with my very conservative optimization threshold, a few benchmarks still performed considerably worse. I'm not too surprised. The JIT tries to aggressively use speculation on the longest repeated sequence, and the cost incurred for a "mis-predict" is very high (this reminds me of pipeline flushing on incorrect branch-predicts).

benchmark	unoptimized	optimized
bbs.bril	total_dyn_inst: 125	total_dyn_inst: 125
karatsuba.bril	total_dyn_inst: 1548	total_dyn_inst: 1888
binary search	total_dyn_inst: 358	total_dyn_inst: 519

In hindsight, min_repeat is the incorrect heuristic to implement this optimization. Instead, I should look at the relative percentage of a path taken and only generate a "branchless" trace when I detect a supermajority.

Conclusion

I believe I deserve a Michelin star because my code systematically identifies hot code sections and inserts "optimized" traces to reduce dynamic instruction counts (in one case 😞). The result was a trace-based compiler that works for programs that make repeated function calls with similar arguments. This is most definitely not multi-Michelin-star-worthy, since I did not support interprocedural calls or the "undoing" of operations with side effects.

Unfortunately, I was not feeling ambitious enough to write this in Rust.

1 reply

keikun555 Dec 4, 2025
Collaborator

Nice! Yes it's difficult to optimize programs with trace analyses, but I think it was cool to see it optimize a program, even if it was more artificial!

SyphonArch · 2025-11-20T21:21:15Z

SyphonArch
Nov 20, 2025

Team & Code

Members: Jake Hyun.

The core lesson code resides in the lesson12 directory, and the tracing interpreter fork (branch trace) is available at the brili trace fork.

What I Did

Implemented a trace-based speculative optimizer: modified fork of brili (trace branch) records a linear prefix of main until a stop (call, memory op, print, speculate/commit, backedge, or length cap). Injector (trace_inject.py) wraps the trace with speculate / commit, then jumps to continuation; abort path falls back to original body.

I created and injected special functions into the bril programs for both the trace and trace metadata information, and this proved to be a simple readable design.

Testing and Results

Harness (test_tracing.py) runs original vs traced on each benchmark with identical #ARGS: and compares stdout + static/dynamic instruction counts from brili -p.

Example execution on core, float, long, and mixed benchmarks:

Target programs: 92
100%|█████████████████| 92/92 [01:02<00:00,  1.48it/s]
Successful optimizations: 90/92
Static Instr Ratio (GM traced/orig): 1.598x
Dynamic Instr Ratio (GM traced/orig): 1.066x
Wrote results to lesson12_results.json

Plot

Bars above 1.0 show speculative overhead or frequent aborts; below 1.0 indicate successful fast-path execution reducing executed instructions.

Observations

The static instruction ratios are greater than one because of the duplicated prefix and the inserted guards. I wasn't ambitious on this assignment: no trace-local optimizations (such as dead code elimination inside the trace, constant folding of hoisted invariants, or guard specialization) were attempted. The dynamic instruction ratios mostly show a modest speculative overhead, which stems from the abort cost and the duplicate guards. Two benchmarks (sum-digits, grad_desc) unexpectedly execute fewer dynamic instructions under the traced version, suggesting that the fast path sometimes prunes work (or avoids extra loop backedges) even without local simplification. This behavior hints that carving out traces alone already opens windows for future passes (trace-specific CSE, invariant hoisting across the speculative region, and guard refinement).

The correctness coverage is ninety successes out of ninety-two benchmarks. The two failures (core/sqrt_bin_search.bril, long/function_call.bril) expose current limitations: the trace capture stopping heuristics can land awkwardly when early calls or nested control flow appear, producing an invalid injected fast path. Their presence underscores that while the simple approach works for most straight-line prefixes, a CFG/post-dominator aware continuation and more nuanced stop criteria would tighten reliability.

Hardest Part

Continuation placement was the most fragile part of the implementation and is also the part where I spent the most time debugging.

Generative AI

I used ChatGPT to port my evaluation harness from lesson 6/8 to this lesson, adapting it for the new optimization while maintaining similar structure and functionality. This accelerated iteration and helped spot subtle correctness issues.

Michelin Star

Not a star, but clean separation of profiling, injection, measurement for comprehensive evaluation.

1 reply

keikun555 Dec 4, 2025
Collaborator

Awesome, nice analysis on the data and why some benchmarks saw improvements, and why some failed!

SolidLao · 2025-11-20T21:32:51Z

SolidLao
Nov 20, 2025

Team Members

Ning Wang (nw366), Jiale Lao (jl4492)

Source Code

https://github.com/NingWang0123/cs6120/tree/main/assa10

Implementation Summary

We implemented a trace-based speculative optimizer for Bril that mimics tracing JIT compilation in a profile-guided AOT setting. The system has three main phases:

Trace Generation (tracer_dynamic.py): A modified interpreter that records execution traces by capturing instructions as they execute, converting branches to guards, and eliminating jumps.
Trace Injection (inject.py): Injects the extracted trace back into the original program using speculation primitives (speculate, guard, commit) to create a "fast path" while maintaining the original code as a fallback.
Speculative Execution (spec.py): An interpreter that supports the Bril speculation extension, handling state snapshots and rollback on guard failures.

The implementation extracts a single hot execution path, wraps it with speculation primitives, and measures performance using dynamic instruction counting.

We did not implement trace optimization and the interprocedural version because of time limitation.

Testing Methodology

We tested the implementation on 4 benchmarks from the Bril core suite, each with multiple inputs:

Correctness Verification: Compare outputs of original vs. traced programs to ensure semantic equivalence
Performance Measurement: Count dynamic instructions executed using the speculation-aware interpreter
Multiple Input Testing: Test each benchmark with:
- The input used to generate the trace
- Different inputs that may follow different execution paths

Test Benchmarks

Benchmark	Description
grad_desc	Gradient descent optimization loop
loopfact	Iterative factorial calculation
collatz	Collatz conjecture sequence with complex control flow
check-primes	Prime number checker with nested loops

Evaluation Results

Performance Summary

Benchmark	Input	Baseline	Traced	Overhead	Overhead %	Correct
grad_desc	w=6000 t=2000 lr=100 steps=20	229	434	+205	+89.5%	✓
grad_desc	w=8000 t=3000 lr=150 steps=30	339	548	+209	+61.7%	✓
loopfact	input=8	116	222	+106	+91.4%	✓
loopfact	input=12	168	277	+109	+64.9%	✓
collatz	x=7	169	175	+6	+3.6%	✓
collatz	x=13	99	105	+6	+6.1%	✓
check-primes	n=20	329	341	+12	+3.6%	✓
check-primes	n=30	499	511	+12	+2.4%	✓

Aggregate Statistics:

Total Test Cases: 8
Correctness: 100% (all outputs match baseline)
Total Overhead: +665 instructions
Average Overhead: +83.1 instructions per test (+39.5%)

Key Findings

Correctness: All 8 test cases produced identical outputs to the baseline, confirming the speculation mechanism correctly preserves semantics ✓
Performance: The traced programs consistently execute more instructions than baselines due to:
- Speculation infrastructure overhead (speculate, guard, commit instructions)
- Guard checks on every iteration
- No trace optimization implemented (no constant folding, dead code elimination, etc.)

Visualization (visualize_simple.py is generated by GPT-5 model)

Dynamic Instruction Counts Comparison

The visualization shows the dynamic instruction count for each test case, comparing baseline (green) vs. traced (orange) execution.

The Hardest Part

We initially tried to implement both trace optimization and inter-procedual version, but finally gave up because of time limits. This task requires more understanding of the speculation and implementation details, which takes us more time than previous tasks. Also the overhead is large, which highlights the importance of trace optimization in real JITs.

Michelin Star

We believe we deserve a michelin star, because of correct implementation, testing on 4 benchmarks with different inputs, and a good visualization. We did not have time for trace optimization and inter-procedural version, but we did understand and implement the core ideas of a dynamic compiler.

1 reply

keikun555 Dec 4, 2025
Collaborator

Good job, nice use of gen AI to make the graphs!

magg1egao · 2025-11-20T21:48:19Z

magg1egao
Nov 20, 2025

Team Members
Serena Zhang (syz8), Maggie Gao (mg2447), Jacqueline Wen (jw2347)

Source code URL

What We Did

For our tracing setup, we modified the Bril interpreter (brili.ts) to record at most 10 dynamic instructions, giving us a short, bounded trace. After generating this trace, our trace.cpp tool processed it by rewriting branches into guard instructions and reconstructing function calls and returns. We then wrapped the transformed trace in a speculative region. If all guards succeed at runtime, execution reaches the trace_success label and commits the speculative state. Otherwise, any mismatch triggers the guard_failed label, which falls back to the original program.

Testing

We evaluated our trace program on three tests: a custom sum_loop bril program and two existing benchmarks from the core folder in bril/benchmarks: armstrong and sum-divisors. Our created sum_loop program takes an input n and computes the sum of integers from 1 to n - 1. This gives us a simple, loop-heavy test.

For each test, we created four versions in the directory:

*.json: the original Bril program in JSON form.
*.trace: the execution trace generated from running the JSON program with a chosen input. We used the recommended input form armstrong and sum-divisors, which are 407 and 100 respectively. For sum_loop we used the input 15.
*_opt.json: the optimized Bril program produced by trace.cpp
*_opt.trace: the trace of the optimized program, which was ran with the same inputs as the original.

In each case, we used the original (*.trace) and optimized (*_opt.trace) traces to confirm that both versions produce identical printed outputs (booleans or integers, depending on the program), giving us confidence that our optimization preserved the proper semantics of each program.

Here, we have a table comparing the dynamic instructions before and after tracing.

Benchmark	Baseline Dynamic Instrs	Traced Dynamic Instrs
sum_loop	81	92
armstrong	133	144
sum-divisors	159	167

sum-divisors - Testing on other inputs we get:

Input	Traced Dynamic Instrs
50	121
73	116
137	149
174	195

Hardest Part

The hardest part was properly reconstructing the control flow and behavior from the raw trace file. This is because it is fundamentally stream-based and gives relatively little context.

Michelin Star

We believe that we deserve a michelin star because we produced a clean, working optimizer that correctly reconstructs control flow, handles calls and returns, inserts guards, and preserves program behavior across all our tests.

1 reply

keikun555 Dec 4, 2025
Collaborator

Good job! Yes it's difficult to actually get a more optimal program with this task, and that's okay :)

pedropontesgarcia · 2025-11-20T23:48:58Z

pedropontesgarcia
Nov 20, 2025

Dynamic compilation

By Helen and Pedro, at our Codeberg repo

We implemented a dynamic compilation flow by obtaining traces from the interpreter given a source line range and a function name, and then using those traces in combination with the speculative extension for Bril to run the transformed trace during interpretation.

Evaluation

Our JIT does not optimize, but we refined it as instructed on three benchmarks: graycode, sum-sq-diff, sum-check. We picked these since they are relatively branch-heavy and nontrivial, which helped us produce correct code for them including different inputs. Our workflow is:

Obtain a trace file specifying start, end and function to trace: sed '/^\s*$/d' ../benchmarks/core/graycode.bril | bril2json -p | brili -t --start=4 --end=25 --fn="xor" -- 10 2> graycode-trace
Use the trace to run the JIT: bril2json < ../benchmarks/core/graycode.bril | target/debug/bds jit -t graycode-trace | bril2json | brili -p 10
Compare the output with the reference: bril2json < ../benchmarks/core/graycode.bril | brili

Benchmark	Instrs w/o JIT	Instrs w/ JIT
`graycode`	677	777
`sum-sq-diff`	338	363
`sum-check`	88	119

Input for `greycode`	Reference output	JIT output
10	... 13	... 13
100	... 82	... 82
1000	... 532	... 532

We compared our outputs for these three benchmarks with the reference ones on a range of inputs and found them to be correct. Our JIT seemed to break some benchmarks, such as tail-call, so we decided to iterate on these three. Since the JIT extension is not part of our final project, we feel more comfortable having it as a proof-of-concept rather than striving for full correctness.

Star

We think so! We produced correct output on a range of inputs, and despite not managing to increase performance, we think we got the core of it working, so we are satisfied. We had to make interesting choices about tracing, inlining, and passing data between the interpreter and our compiler toolkit.

2 replies

mercure67 Nov 21, 2025

the breakage occurs on situations with multiple exit blocks, hence the selection of the chosen benchmarks. otherwise, the JIT simplifies minimally because we only run a basic DCE on it, which doesn't appear to affect the selected traces much. we didn't implement LVN because our LVN implementation would require some non-trivial refactoring to work with our trace format.

keikun555 Dec 4, 2025
Collaborator

Cool! I wonder what you felt was the most difficult part of the task was?

YoruCathy · 2025-11-21T01:42:38Z

YoruCathy
Nov 21, 2025

Source code
Link
What I did
I implemented a simple trace-based speculative optimizer for Bril by modifying the reference interpreter to record a linearized trace of the hot path taken during execution. I extended brili to start tracing at the beginning of main, capturing each executed instruction, converting dynamic branches into guard instructions, and discarding jumps to produce straight-line code that assumes the observed control-flow behavior. I then wrote a trace injector that embeds this recorded trace back into the original program as a speculative fast path using Bril’s speculate, commit, and guard instructions, falling back to the original code on guard failure. All these are in brili-trace.ts I tried to do the bonus point part but didn't end up implementing it due to time limit. I evaluated correctness and performance on several benchmarks by comparing outputs and dynamic instruction counts across original and speculative versions.
Evaluation
To check correctness, I ran each benchmark with two inputs: one used to generate the trace and one unseen input. For both inputs, I ran the original and speculative programs and compared their outputs using diff. A program was considered correct if the outputs matched on both inputs. All three benchmarks passed these checks, confirming that the speculative fast paths and fallbacks preserved program behavior. Performance was measured using Bril’s -p flag, which reports the dynamic instruction counts. For each benchmark and input, I recorded the instruction counts of both the original and speculative versions and compared them to quantify any speedup or overhead introduced by the speculative fast path.
The result suggests that the optimization does not break the program which is good, but it is not providing speedups either. Instead, it introduces small overhead. My guess is this is because the benchmarks are small and the speculative scaffolding probably costs more than it saves. Or it could be there are some issues with my implementation that I'm not aware of.

The results are as the following

Benchmark	Input	Original Dyn Inst	Speculative Dyn Inst	Correct?	Change
factors	60 (A)	72	72	PASS	0
	97 (B)	870	881	PASS	+11
perfect	28 (A)	58	63	PASS	+5
	12 (B)	37	67	PASS	+30
sum-of-cubes	10 (A)	8	11	PASS	+3
	20 (B)	8	11	PASS	+3

AI usage
I used GPT to help me debug my code, and also weite the run_perf.sh script to run the benchmarks and summarize the results in results.txt. I also used it to write a readme for the code here.

Star?
I think my work deserve a star for the implementation that deos not break the program and for the evaluation.

1 reply

keikun555 Dec 4, 2025
Collaborator

Good job! I like the thorough evaluation which includes the delta as well as the correctness of the optimization

tobiwg · 2025-11-21T04:49:51Z

tobiwg
Nov 21, 2025

Team: Adnan Al Armouti, Tobias Weinberg

Approach

For this assignment, we implemented a trace-based speculative optimizer for Bril.
Our pipeline had three main pieces:

We extended the interpreter to record a straight-line trace during execution.
We performed simple optimizations on the trace (constant folding + DCE).
We stitched the optimized trace back into main using the speculation extension (speculate, guard, commit) with a bailout block containing the original program.

Whenever a trace included instructions unsafe to run while speculating — such as call, print, or ret — we synthesized a failing guard to conservatively force an immediate bailout.

Correctness

Across multiple inputs, our transformed programs produced correct outputs.
Even when the speculative fast path aborted instantly, the bailout path reliably matched baseline behavior.

Results

We filtered for benchmarks where both baseline and JIT produced valid numeric dynamic instruction counts. These were the programs where our pass ran end-to-end successfully:

Benchmark	Baseline	JIT	Δ
arithmetic-series	7	17	+1
binary-fmt	100	128	+28
binpow	105	126	+21
bitshift	167	189	+22
braille	325	365	+40
combination	178	213	+35
fact	229	258	+29
fitsinside	10	18	+8
hanoi	99	129	+30
mccarthy91	1385	1566	+181
recfact	104	127	+23
rectangles-area-difference	14	24	+10
sqrt_bin_search	744	1105	+361
sum-divisible-by-m	16	24	+8
sum-of-cubes	8	19	+11

In every benchmark, the JIT version executed more dynamic instructions than the baseline.

Discussion

The slowdowns are an expected consequence of how our system behaves:

The speculative fast path was almost never taken.
Because most traces included function calls or printed values, our transformer inserted a failing guard (const false → guard) to avoid mis-speculation.
As a result, speculation always aborted, and execution immediately jumped to the bailout block.
The speculative overhead — speculate, synthetic guards, extra labels, and additional returns — executed on every run, even though the fast path was never used.
Since the traced regions were small and had no stable hot-path behavior, there was no opportunity for genuine optimization.

Although our system didn’t deliver speedups, it confirmed that our speculation framework was structurally correct, and it made clear how fragile speculative optimization is without good guard predicates.

Takeaways

Speculation only helps when the traced path is both hot and predictable.
Conservative guard synthesis guarantees correctness but eliminates any chance of a fast-path win.
Even a correct speculative framework can easily regress performance when no meaningful work is eliminated.
The assignment was valuable for understanding why real-world tracing JITs invest heavily in profiling, guard generation, and deoptimization logic.

GenAI Disclaimer

We used ChatGPT (GPT-5) to help debug issues in our toolchain and to sanity-check our Bril transformations. It was particularly helpful for explaining parts of the speculation extension and for drafting shell commands, but it also hallucinated Bril syntax, made up interpreter flags, and occasionally mixed up behaviors across different versions of our tooling. This caused several detours until we verified what was actually happening by reading the Bril docs, checking the reference interpreter, and inspecting execution traces manually. Once we learned to fact-check everything, ChatGPT was a useful debugging assistant

1 reply

keikun555 Dec 4, 2025
Collaborator

Awesome!

These were the programs where our pass ran end-to-end successfully

This makes it sound like there are programs that didn't run the pass successfully. I wonder if that's that case, and if so, if you've looked into why the pass didn't run successfully on the programs!

Also not that important, but I think arithmetic-series should have +10!

tf-mac · 2025-11-21T04:53:25Z

tf-mac
Nov 21, 2025

Source: HERE

Implementation:

I implemented a straight forward tracer. I modified brili to print out the trace of a given output, if (as above) a -t flag was given to the execution. Then, I created a new program to add this trace to the previous benchmark and then I ran the execution to ensure correctness.

Testing:

I only had time to test this on one bechmark and one input, the fizzbuzz benchmark with input 10. On input 10, there were 362 dynamic instructions (compared to 370 normally) but on 11 the dynamic instructions were 400, which implies this was a very small scoped optimization.

Michelin Star:

I think I generated a tracing program and successfully benchmarked it, earning a star

1 reply

sampsyo Dec 22, 2025
Maintainer Author

Hi, @tf-mac! Glad this worked out as anticipated. Too bad about only getting this to work on one benchmark, but that's life. Your writeup is also somewhat sparse here—it would have been great to know more about what you had to decide along the way to your implementation.

arw274 · 2025-11-21T04:58:11Z

arw274
Nov 21, 2025

Source

Link
I worked with @natetyoung on this assignment.

What We Did

We modified brili to implement a trace-based speculative optimizer. Our optimizer detects "hot" loops by checking whether a current label has been encountered earlier in the execution. If so, it begins collecting the trace as it walks through subsequent lines, and stops tracing when it encounters the same label once more. We bail if we ever encounter a call while tracing. We then stitch a trace back in after the original start label, handling jmp's and br's while stitching. The hardest part of this was probably hacking brili.ts and getting it to trace properly according to labels.

Evaluation

We tested it out on multiple benchmarks using multiple inputs. We generate the trace for the optimization from one of the inputs, and test the optimized and non-optimized programs on other inputs to check for correctness. For measuring performance, we just used the -p flag to count dynamic instructions. Doing this optimization increased the dynamic instruction count across inputs and several benchmarks from core. We summarize some results in the table below.

Benchmark	Input	Baseline	Optimized
Quadratic	-5, 8, 21	785	939
Quadratic	1, 0, -4	173	219
Quadratic	1, 4, 2	139	179
Quadratic	1, 2, 4	59	73
Triangle	42	218	305
Triangle	256	1288	1803
Triangle	999	5003	7004

Star

We believe that we deserve a Michelin star for a well-tested implementation that maintained correctness on multiple benchmarks.

1 reply

keikun555 Dec 4, 2025
Collaborator

Cool! I wonder if the evaluation could include whether the produced optimized programs gave the same output as the baseline!

maheshejs · 2025-11-21T04:58:26Z

maheshejs
Nov 21, 2025

Interprocedural Optimization: Ambitiously Inlining Calls Speculatively!
Code: https://github.com/maheshejs/cs6120pr

What I did

I implemented in Racket an interprocedural optimization to inline calls speculatively using the speculative extension. This speculative inlining is only performed for short functions, where I define short functions as functions which do not call other functions, do not have control flow, and are called in the main function.

Implementation

My approach uses a two-pass transformation with dynamic trace collection:

Speculation Setup (expose-speculated-calls): I instrument all calls in main with speculative regions. For each call, I create parameter-passing variables (cc-param-N for each argument), initialize a flag cc-inl = false to indicate inlining hasn't succeeded yet, and wrap the original call in a speculate/guard/commit block. The guard checks cc-inl and branches to either skip the original call (if inlining succeeded) or execute it as a fallback.
Trace Collection: I modified the Bril interpreter (brili.ts) to trace function executions. When a function is called for the first time, the interpreter:
- Records the function name and begins tracing
- Renames all variables in the traced function with a cc-rename- prefix to avoid clobbering caller variables
- Inserts id instructions to copy parameters from cc-param-N variables to the renamed parameters
- Records only value operations (instructions with destinations), stopping when it encounters control flow, calls, or returns
- On return, appends an id instruction to copy the renamed return value to cc-ret
- Finally appends cc-inl = true to signal successful inlining
Trace-Based Inlining (trace-inline-calls): Using the execution traces, I inline the bodies of eligible functions. For each icall instruction, I look up the function in the traces. If the trace exists and ends with cc-inl = true (indicating it's a straight-line function without control flow or nested calls), I replace the call with the inlined body. The inlined code uses the renamed variables to avoid conflicts with the caller's namespace.

Calling Convention

My speculative inlining uses a lightweight calling convention with specially-named variables:

cc-param-N: Copies of call arguments (e.g., cc-param-0, cc-param-1) created in the caller before speculation begins. The inlined function reads these to access its parameters.
cc-rename-*: All variables inside the inlined function are renamed with this prefix (e.g., cc-rename-mod, cc-rename-q) to create a separate namespace and prevent clobbering caller variables.
cc-ret: Holds the return value from the inlined function. At the end of the inlined body, an id instruction copies the renamed return variable to cc-ret, which is then copied to the original destination.
cc-inl: A boolean flag that starts as false. Successfully inlined functions set this to true at the end. The guard instruction checks this flag: if true, we commit and skip the original call; if false, we abort speculation and execute the original call as a fallback.

This convention ensures complete isolation between the caller and inlined callee through explicit variable renaming, while the cc-inl flag provides a mechanism to fall back to the original call if the function wasn't traceable or contained unsupported constructs.

For example, the main function of the Armstrong benchmark:

@main(input : int) {
  zero : int = const 0;
  ten : int = const 10;
  sum : int = const 0;
  digits : int = call @getDigits input;
  tmp : int = id input;
.loop:
  b : bool = gt tmp zero;
  br b .body .done;
.body:
  digit : int = call @mod tmp ten;
  pow : int = call @power digit digits;
  sum : int = add sum pow;
  tmp : int = div tmp ten;
  jmp .loop;
.done:
  res : bool = eq input sum;
  print res;
}

is transformed to, after applied optimizations:

@main(input: int) {
  zero: int = const 0;
  ten: int = const 10;
  sum: int = const 0;
  cc-inl: bool = const false;
  speculate;
  guard cc-inl .label.1;
  commit;
  br cc-inl .label.2 .label.1;
.label.1:
  digits: int = call @getDigits input;
.label.2:
  tmp: int = id input;
.loop:
  b: bool = gt tmp zero;
  br b .body .done;
.body:
  speculate;
  cc-rename-q: int = div tmp ten;
  cc-rename-aq: int = mul ten cc-rename-q;
  cc-rename-mod: int = sub tmp cc-rename-aq;
  cc-inl: bool = const true;
  digit: int = id cc-rename-mod;
  guard cc-inl .label.3;
  commit;
  br cc-inl .label.4 .label.3;
.label.3:
  digit: int = call @mod tmp ten;
.label.4:
  cc-inl: bool = const false;
  speculate;
  guard cc-inl .label.5;
  commit;
  br cc-inl .label.6 .label.5;
.label.5:
  pow: int = call @power digit digits;
.label.6:
  sum: int = add sum pow;
  tmp: int = div tmp ten;
  jmp .loop;
.done:
  res: bool = eq input sum;
  print res;
}

Testing
I tested speculative inlining using:

turnt with Bril examples and handwritten cases,
brili to interpret transformed programs and confirm correctness, and
brench to run on all Bril core benchmarks, checking correctness and dynamic instruction counts.

Results
Compared to my task 4’s global LVN baseline:

Overall, speculative inlining improves about 43% of the benchmarks. bbs performs worst since it fails to inline lsb which calls modulo and calling overheads for successful inlining of square and mod offset the optimization.

Hardest Part
For this task, the main challenge was reasoning about proper calling conventions to inline calls speculatively while avoiding clobbering variables. Some other bugs I ran across were recursive calls (which are excluded), empty traces, and speculative inlines absent in traces since the interpreter never went to the path for that call. To debug them, I had to inspect each transformed failing benchmarks, which turned out to be tedious and long.

Conclusion
I believe I deserve Michelin stars :) for successfully implementing an interprocedural optimization in speculative inlining, fully integrated with all my global optimizations (global LVN+DCE). I was able to support all the core Bril benchmarks.

*GenAI

I used Cursor to modify the Bril interpreter for tracing call instructions. GenAI was especially useful for coming up with a working code quite quickly.

1 reply

sampsyo Dec 22, 2025
Maintainer Author

This is seriously impressive, @maheshejs. Your design with a special "calling convention" for collapsed function calls is really cool and a good way to make this project more flexible and realistic than the basic intraprocedural tracing approach. And the results definitely indicate that it paid off. Super awesome!! 👏👏👏

zc579 · 2025-11-21T05:45:58Z

zc579
Nov 21, 2025

Implementation
I implemented a minimal tracing pipeline by writing a custom interpreter that records the exact dynamic execution path of a Bril program. The tracer executes the program normally, flattens control flow, fully inlines calls, and writes all executed instructions into a linear, executable Trace.json with input values baked in. I then wrote a small injector that embeds this trace back into the original benchmark as a speculative fast path using speculate, guard, and commit, falling back to the untouched slow path on guard failure.

Testing
I tested the system on the gpf benchmark with input 13195. The tracer produced the correct output 29 and matched the original program’s dynamic instruction count 501, confirming accurate path recording. After injecting the trace, the optimized program also produced 29 on the training input, but this time the dynamic instruction count change to 501.

star
I claim a star for correct implementation.

1 reply

sampsyo Dec 22, 2025
Maintainer Author

Hi, @zc579—cool that you got this to work once (for one benchmark with a single input). To be confident that you got things correct, however, it would be important to make sure that (a) it works on multiple benchmarks, hopefully a large swath of core benchmarks, and (b) it works even when you change the input, to be sure that your fallback on guard failure works.

Nikil-Shyamsunder · 2025-11-21T06:53:33Z

Nikil-Shyamsunder
Nov 21, 2025

What I Did

First, I modified the Bril interpreter to support a --trace flag that outputs detailed execution information. For each dynamic instruction executed, a line is printed in the format trace (line N): {JSON instruction}, where N is the instruction's line number within the function and the JSON contains the full instruction details. To track control flow, after each branch instruction, the interpreter prints trace: <label_name> to indicate which branch was taken.

Next, I wrote a shell script (gen_traces.sh) that automatically generates execution traces starting from main for every benchmark in the core suite and stores them in a raw_traces directory. Since this project involved a lot of string processing and manipulation, I decided against continuing in C++ and switched to Python for the remaining implementation.

I then wrote a Python script (process_traces.py) that processes the raw traces with the following logic:

Caps traces at 15 instructions OR stops at the first call, ret, or print instruction (anything with side effects)
Removes all jmp instructions from the trace
Converts br (branch) instructions into guard instructions, using the trace information to determine which condition to guard on
If the false branch was taken, inserts a not instruction to negate the condition before the guard
Prepends a speculate instruction at the beginning
Appends commit and jmp success instructions at the end
Adds an end_speculation label where failed guards will jump to
Stores the line number of the last traced instruction as the final line of the processed trace file, which tells us where to place the success label in the original code

Finally, I wrote a merging script (merge_traces.py) that combines the processed traces with the original Bril programs. It inserts the speculative code at the start of main, places the end_speculation label after the trace, and uses the stored line number to position the success label at the correct continuation point in the original code.

Evaluation and Results

I wrote an evaluation script (eval.py) that compares the baseline and optimized versions of each benchmark by:

Running both versions with their specified arguments
Comparing dynamic instruction counts using the -p profiling flag
Verifying that outputs match exactly
Generating a CSV file with baseline, optimized, and overhead percentage data
Creating a bar chart visualization comparing instruction counts on a log scale

41 out of 55 core benchmarks work correctly (producing the right output), which I'm pretty happy about! However, unfortunately, none of them actually show performance improvements with this optimization approach. The optimized versions all have higher instruction counts than the baseline due to the speculation overhead.

I think there are a few reasons for this not-so-good performance. First, I'm just capturing traces from the start of main rather than identifying hot paths or frequently executed code. The 15-instruction cap might be too short to capture meaningful work, or traces might be getting cut off at inconvenient points. The 14 failing benchmarks are mostly ones that involve loops where the loop environment changes over iterations. For example, if a loop variable gets incremented and then used in a branch condition, my speculative trace from the first iteration won't match subsequent iterations, causing guards to fail. I didn't have time to solve this, but I think the fix would involve tracking the full environment state and somehow encoding those dependencies as additional guard conditions.

Hardest Part
The hardest part was figuring out exactly what information I needed to extract from traces. My initial attempt just captured the instructions themselves, which obviously wasn't enough. Then I realized I needed to know which branches were taken, so I added the trace: <label> output after branch instructions.

But even that wasn't sufficient! It took me some time to realize that I needed something like a "program counter" - the line number of the last instruction in my trace - to know where execution should resume when speculation succeeds. This required going back and adding more machinery to the Bril interpreter to track and output instruction indices, which was more invasive than I initially planned.

AI Statement

I used AI assistance on this assignment, primarily for writing string parsing and data processing code. Primarily, I used it for the shell script for running benchmarks and capturing trace and the evaluation script for comparing results, computing statistics, and generating visualizations. I also don't know TypeScript that well so it helped me out a lot with modifying the brili.ts file. AI was quite useful for avoiding the need to learn Python's string parsing libraries or matplotlib from scratch. I found that giving clear, specific prompts about what strings to match and how to process them typically produced correct code quickly.

Conclusion
While I didn't achieve the speedups I was hoping for, I think I deserve a Michelin start for successfully implementing a complex optimization for trace-base optimization that is correct against a large set of the core benchmarks. I successfully built a complete pipeline from tracing through optimization to evaluation. The implementation is solid; it just needs better heuristics for where and when to speculate.

1 reply

keikun555 Dec 4, 2025
Collaborator

Awesome! Nice analysis on the evaluation and why some benchmarks failed.

Smubge · 2025-11-24T05:21:02Z

Smubge
Nov 24, 2025

code
bril repo

Cynthia Shao and Jonathan Brown worked together on this task.

Summary

We implemented tracing for Bril. For our testing, we used brench to validate correctness and dynamic instruction count comparisons. Overall this assignment was very fun and gave us great insight on how a tracing JIT works.

Tracing

We implemented tracing within the bril-ts interpreter by adding the ‘-t’ flag. I added simple checks to only trace if the function is main, and if we haven’t reached a label that is a backedge.

Our optimizer only optimizes main, and does not support calls, memory, or other extensions beyond simple bril.
Out tracing logic is simple, we directly insert the traced output of our interpreter into the main function, replacing branches with guards, adding a speculate at the beginning of the insertion, and a commit at the end. We go through all the traced instructions, and if there is an instruction that produces side effects or returns, we only add that at the end of the trace insertion after inserting a commit instruction. Following this insertion, we add a label of the original code. If the trace ends in a jump, we add that jump as the guard.

This strategy mainly optimizes loops that have a heavy setup and first iteration, as we know the possible input of the entry of the loop, and thus can optimize the first iteration.

We run LVN and DCE on this resulting trace inserted code, and we see a varying amount of decreased instruction count!

Testing and Benchmarking

For testing, we used Turnt and brench. Brench was used to get the dynamic instruction count and validate correctness, and used Turnt for snapshot testing.

Here are the benchmark comparisons:

We added support for separate argument inputs into the tracing optimization function via comments at the top of the Bril benchmarks. Thus, the correctness and snapshot testing should be able to guard against cases where the tracing optimization “overfits” and we end up with code that only works on one input.

Hardest Part

The hardest part was adding support across the stack! Creating the tracing optimization to wrap both the interpreter and apply all the possible optimizations to the trace inserted code was very difficult to keep track of, and it ended up that our benchmarks and testing suite did much of the heavy lifting.

Another difficulty we faced was debugging the Brench and Turnt outputs. Because the pipeline for optimizations was so deep, we had a hard time debugging correctness issues without much debugging support.

Self Assessment

We believe we deserve a Michelin star because we are proud of our work! We brainstormed an implementation of the tracing, and thoroughly tested it. We tested thoroughly using the Brench and Turnt with our test cases that had separate arguments for our tracing optimization and correctness checking. We also practiced good team coding practices with git, and learned a lot about tracing JITs, ensuring correctness across all programs! We believe we deserve a Michelin star for the tenacity and persistence put into the task.

1 reply

keikun555 Dec 4, 2025
Collaborator

Good job! It's nice to see that the pass optimized some benchmarks :)

itingtsai · 2025-11-27T12:38:58Z

itingtsai
Nov 27, 2025

Source Code

I implemented a trace-based speculative optimizer for Bril, by extending the reference interpreter (brili.ts) to record a linear trace starting from a specified loop header label. The tracer logs one full loop iteration: tracing begins when execution reaches the header label, and ends when a backedge returns to that label. On conditional branches, the tracer emits a guard for the taken direction or a guard on the inverted condition if not taken; all other instructions (jumps aside) are recorded directly. Calls, rets, or non-pure instructions terminate the trace early.

A Python trace optimizer (jit.py) reads the recorded trace, applies simplifications (e.g., eliminates redundant eq(i, 0) checks if a gt(i, 0) guard already ensures non-zero, removes intermediate jumps to linearize the trace), then injects the optimized trace back into the program using Bril’s speculation mechanism. The trace becomes a speculative fast path under a new label, the original code is renamed to a .fallback path, guards redirect to fallback on failure, wrapped by speculate / commit.

As a sanity check, I used a simple sum from 1 to input loop: sum = 0; i = input; while i > 0: sum += i; --i. Inside the loop there is an if (i == 0) branch that is never taken under normal execution when i > 0, so this matches a hot path scenario where speculation can safely remove the redundant check.

I compared dynamic instruction counts for baseline vs optimized versions over several inputs. For input 5 and 100, the optimized version executed fewer instructions (47 → 45; 807 → 710). For input 0 the guard immediately fails and speculation rolls back, resulting in overhead (7 → 10).

Input	Baseline Sum	Optimized Sum
0	0	0
5	15	15
100	5050	5050

Input	Baseline Total Dyn Inst	Optimized Total Dyn Inst
0	7	10
5	47	45
100	807	710

1 reply

sampsyo Dec 22, 2025
Maintainer Author

Cool! Thanks for the detailed description of how your tracing and "injection" machinery worked.

While it's cool that this seems to have worked for this specific loop, it would be good to know—on a wider variety of more realistic benchmarks—that this (a) still produces the right answer, (b) even when the input changes, so speculation failures are handled correctly, and (c) can even improve performance in at least a few cases.

Lesson 12: Dynamic Compilers #562

Uh oh!

sampsyo Aug 24, 2025 Maintainer

Replies: 16 comments · 17 replies

Uh oh!

Uh oh!

sampsyo Nov 14, 2025 Maintainer Author

Uh oh!

Uh oh!

What I Did

The Hardest Part

Testing

Conclusion

Uh oh!

keikun555 Dec 4, 2025 Collaborator

Uh oh!

Uh oh!

What I Did

Tracing

JIT

Testing

Conclusion

Uh oh!

keikun555 Dec 4, 2025 Collaborator

Uh oh!

Uh oh!

Team & Code

What I Did

Testing and Results

Plot

Observations

Hardest Part

Generative AI

Michelin Star

Uh oh!

keikun555 Dec 4, 2025 Collaborator

Uh oh!

Uh oh!

Team Members

Source Code

Implementation Summary

Testing Methodology

Test Benchmarks

Evaluation Results

Performance Summary

Key Findings

Visualization (visualize_simple.py is generated by GPT-5 model)

Dynamic Instruction Counts Comparison

The Hardest Part

Michelin Star

Uh oh!

keikun555 Dec 4, 2025 Collaborator

Uh oh!

What We Did

Testing

Hardest Part

Michelin Star

Uh oh!

keikun555 Dec 4, 2025 Collaborator

Uh oh!

Uh oh!

Dynamic compilation

Evaluation

Star

Uh oh!

Uh oh!

keikun555 Dec 4, 2025 Collaborator

Uh oh!

Uh oh!

keikun555 Dec 4, 2025 Collaborator

Uh oh!

Approach

Correctness

Results

Discussion

Takeaways

GenAI Disclaimer

Uh oh!

keikun555 Dec 4, 2025 Collaborator

Uh oh!

sampsyo
Aug 24, 2025
Maintainer

Replies: 16 comments 17 replies

sampsyo Nov 14, 2025
Maintainer Author

keikun555 Dec 4, 2025
Collaborator

keikun555 Dec 4, 2025
Collaborator

keikun555 Dec 4, 2025
Collaborator

keikun555 Dec 4, 2025
Collaborator

keikun555 Dec 4, 2025
Collaborator

keikun555 Dec 4, 2025
Collaborator

keikun555 Dec 4, 2025
Collaborator

keikun555 Dec 4, 2025
Collaborator