Discussion on Denali: a goal-directed superoptimizer #622

tf-mac · 2025-11-18T17:01:14Z

tf-mac
Nov 18, 2025

Reading: Denali: A Goal-Directed Superoptimizer

Discussion Leads: Thomas McFarland (@tf-mac ),Nipat Chenthanakij (@Mond45 ), Vesal Bakhtazad (@VesalBakhtazad )

maheshejs · 2025-11-18T22:48:23Z

maheshejs
Nov 18, 2025

Critique
I really enjoyed reading this paper on the Denali superoptimizer. A superoptimizer aims to find the optimal instruction sequence for a given input sequence. Denali does this with an automatic theorem prover to generate near-optimal machine code across architectures, which I found especially insightful. I also liked the efficient matching using the e-graph data structure, which this paper introduced to me. It's really impressive that Denali produces code on par with machine code written by experts and sometimes even better than code generated by a C compiler. However, it would've been nice if they also evaluated how these small cycle savings impact whole-program performance.

Question
This is a general question: since this paper came out, have superoptimizers been adopted to improve software productivity? Also, because the optimizer focuses on instruction sequences, it reminded me of peephole optimization. Are superoptimizers used to automatically generate efficient peephole optimizations?

0 replies

jeffreyqdd · 2025-11-20T01:38:01Z

jeffreyqdd
Nov 20, 2025

Critique I thought this paper was a fun read. Denali uses a code generator backed by a SAT solver to arrive at almost mathematically optimal machine code. It does this by constructing an equivalence graph (E-graph) from a set of hand-coded axioms, then it performs a binary search to answer the question "is there a program that computes the goal term within K cycles".

This methodology (at least to me) is very sensible and provably correct, under the assumption that the hand-coded axioms are sufficiently comprehensive. However, I am unconvinced about the real-world performance impact of the resulting code sequence. The paper emphasizes optimal code with respect to an abstract machine model, but does not provide benchmarks on physical hardware to demonstrate the end-to-end speedup.

I'm not sure about the state of micro-architectural optimizations from when the paper was published, but I feel that concepts such as caching, data forwarding, hazards, and out-of-order execution would have a noticeable effect on wall clock execution of such programs. It might be the case that the fastest code is not the shortest in terms of abstract machine cycles. I was also excited about Denali's application in tracing JITs, but I'm not too sure if an optimization pass that takes multiple minutes is acceptable.

Question
Superoptimizers appear to be very useful, however, the paper suggests.

They shouldn't be run on entire programs
A human needs to specify which section of the program should be "super optimized."

Could super-optimizers be employed by JITs to emit faster machine code? A JIT would know which code segments are executed frequently enough to benefit from a mathematically optimal solution. My main concern is that this still takes too long / too much compute (if running in a separate thread).

This leads me to the next question: could super-optimizers be employed in an offline setting to enhance traditional compiler infrastructure? (for example, we could run super-optimizers on common code patterns to come up with peephole optimizations).

0 replies

mercure67 · 2025-11-20T01:55:08Z

mercure67
Nov 20, 2025

critique

'dead-architectures'-{expletive} paper. funny to see compaq (formerly dec) alpha and intel itanium mentioned as the cutting edge.

more seriously, it's a pretty cool paper, if a bit light on evaluation (which makes sense, considering how specific the use case is). i guess the effectiveness of the 'superoptimisation' concept is kind of based on how amdahl's law-bound real programs are -- as in, do they genuinely spend 90% of their time in 1 loop, or is it possibly more distributed? and if it is more distributed, do the superoptimisations make a tangible enough difference to warrant their runtime?

i also wonder if something like this has been applied to embedded systems before, considering they tend to be more deterministic (thus constraining the search space), and shaving a few cycles off of ISR handling / an RTOS context switching procedure could be huge.

questions

is the level of the GMA the right 'granularity' for approaching the problem? as in, do we potentially miss out on optimisations from considering the program in these parts rather than as a whole? and what alternative structures could we choose as selection targets?
would adapting GMA to accomodate vectorisation (AVX, etc.) be feasible?
to what extent could super-optimisation be made more general to support different machines / architectures? likewise, does the assumption model around latencies hold with variation -- for example, GPU memories, which generally value bandwidth over consistent latency?

0 replies

Nikil-Shyamsunder · 2025-11-20T02:08:49Z

Nikil-Shyamsunder
Nov 20, 2025

Critique:
I think it is fair to characterize the Denali system as pursuing something sort of like correctness through construction / generating only machine code that can be justified via matching-based equivalence reasoning and SAT-validated scheduling, rather than verifying arbitrary user-written implementations. This paradigm is attractive to me because, unlike heavyweight formal-methods workflows, it enables correctness to emerge naturally from the search procedure rather than from developer-authored proofs, making the approach more adoptable in real engineering practice. At the same time, the paper is from 2002 and I have to ask why we do not use systems like Denali today. I guess it is because the method depends on massive search, complex axiomatization of real ISAs, and heavy SAT solving, and even in the paper’s examples, a single 31-instruction inner loop can take four hours to genera. It is reasonable to ask whether modern compute, better SAT solvers, and more scalable engines would make this approach more viable today.

Questions:
Why aren’t superoptimizers like Denali used in real compilers today? Would Denali work better now that we have faster computers and better SAT solvers?

Forgetting about optimization for a second, what is the most practical way to incorporate formal methods into real engineering workflows, and are formally correct code generators a better path to correctness than verifying hand-written implementations after the fact?

0 replies

magg1egao · 2025-11-20T02:42:26Z

magg1egao
Nov 20, 2025

Critique:
I like how the authors take a seemingly impossible problem of finding the absolute best machine code into two smaller, more manageable problems. I thought it was interesting how the problem was a “proof search” problem rather than a brute-force enumeration one. One thing to consider is that Denali seems to treat memory access as fixed values based on the profiling. However, that is very much not true for modern machines. Modern machines are much more unpredictable due to caching or other hidden hardware effects. It seems like the “optimal” findings of Denali’s model would not really match with the actual runtime on real hardware. It also seems like the authors believe performance is determined only by latency and scheduling constraints. However, in real machines there are many more other surprising interactions the model does not capture.

Question:
To what extent can one trust a superoptimizer where the correctness is solely due to manually written algebraic axioms?

1 reply

Fangtangtang Nov 20, 2025

I like how the authors take a seemingly impossible problem of finding the absolute best machine code into two smaller, more manageable problems. I thought it was interesting how the problem was a “proof search” problem rather than a brute-force enumeration one.

I also find the search principle particularly interesting! especially because it highlights a compelling duality: successful proofs correspond to failed searches for a counterexample, and vice versa. Based on this, the authors transform the optimization problem into a SAT-solving task, which I find both conceptually elegant and practically powerful.

To what extent can one trust a superoptimizer where the correctness is solely due to manually written algebraic axioms?

I have a thought that might be somewhat orthogonal to the original question. 😅
I see Denali itself as a well-designed superoptimization framework. The correctness of the axioms, together with the constraints proposed by the authors, can be viewed as just one particular configuration that can be registered into the framework. These choices may hold only in the idealized settings the authors evaluate, or be "optimal" only under those assumptions. but I believe users can and should bring in other verified axioms and more accurate performance models to address different scenarios. All of these can be registered into the framework as alternative configurations.

SyphonArch · 2025-11-20T02:45:30Z

SyphonArch
Nov 20, 2025

Critique

The paper presents Denali, a superoptimizer that combines E-graph–based equivalence reasoning with SAT-based scheduling to generate near-optimal straight-line code. While the approach is elegant, it relies on a large set of hand-written axioms, and it's unclear how maintainable or scalable that approach would be for modern ISAs. The E-graph matching phase also seems like it could grow uncontrollably, and the paper gives little sense of how often heuristic cutoffs risk missing better implementations. The "correct-by-design" claim also feels fragile, since any incorrect axiom or encoding silently breaks correctness. And it's not obvious how frequently developers actually need to hand-optimize new critical assembly loops today, which makes it harder to gauge the practical impact of a tool like this.

Questions

What would it take to make Denali's axiom set trustworthy and maintainable for a current ISA?
How would these "optimal" fragments interact with a real compiler pipeline that must manage register pressure and calling conventions?
In practice, what kinds of kernels actually justify multi-hour superoptimization runs?

0 replies

SerenaYZhang · 2025-11-20T03:15:15Z

SerenaYZhang
Nov 20, 2025

Critique

I found this paper on Denali to be really cool. It's impressive that the system can rival or even surpass human expert level coding in specific cases. However, my primary critique is the paper's limited benchmarking suite. The authors present only a few benchmarks, such as byte swap and checksum. While these are instructive for demonstrating Denali's capabilities, the small and relatively simple set of examples makes it difficult to assess the system's general performance and practicality.

Questions

How does Denali's performance scale with the complexity of the input code?
How effective is Denali across a wider range of common, performance-critical algorithms?

0 replies

Jacqueline-Wen · 2025-11-20T03:18:42Z

Jacqueline-Wen
Nov 20, 2025

Critique

Denali introduces an optimizer that generates close to optimal machine code for small but critical code snippets, where typical compiler optimizations are unable to sufficiently optimize. Denali uses SAT solving to search for code that is mathematically closest to the optimal cycle count on modern architectures.

My main concern is that Denali relies heavily on manually written axioms, which does not scale. When new architectures are inevitably developed, these axioms would likely need to be updated/rewritten, making Denali very high maintenance.

Questions

Are there any production systems that heavily rely on Denali?
What would be the impact of potentially incorrect axioms?

0 replies

pedropontesgarcia · 2025-11-20T03:52:07Z

pedropontesgarcia
Nov 20, 2025

Critique: I sort of find it funny that this paper answers a question similar to the one I asked in class when we discussed function inlining: could we try all possible combinations of inlining until we found an optimal one? The problem of "superoptimizing" makes total sense — I'm also happy with the approach of automatically proving the correctness of potential reorderings, as opposed to the test-case approach from prior work.

Question: Since the early 2000's we have developed WAY more powerful techniques (SMT) and systems (Lean, Z3) for automatic theorem proving and verification. Do these make a difference at all for superoptimizing with verification? How is this done these days?

0 replies

zc579 · 2025-11-20T04:01:28Z

zc579
Nov 20, 2025

Critique
This paper presents a powerful approach to generating near-optimal code. However, its own evaluation reveals that even small arithmetic kernels lead to very large Boolean encodings, which often requires substantial axiom unfolding to express basic equivalences, and that the system must repeatedly regenerate constraints for each candidate GMA form. All above suggest that the theoretical search principle may become computationally expensive even for simple instruction sequences.
Question
How well could such a superoptimizer integrate with modern compiler pipelines that must balance optimality with compilation speed?

0 replies

natetyoung · 2025-11-20T04:05:06Z

natetyoung
Nov 20, 2025

Critique
I found this paper fairly easy to follow and well-written; it's organized well and the examples are helpful to illustrate things (somewhat unusually). It presents a flow which by these days (23 years later) is fairly familiar: it runs equality saturation to produce an e-graph representing all possible ways of computing a set of values, and then uses a solver (in their case, a SAT solver) to extract an optimal program. Their tool is for straightline programs implemented on multiple-issue scalar processors only, it seems (they explicitly mention inner loop bodies as a primary target). Unfortunately, their preliminary results are a little thin; they don't do much exploration of how much better their results are than normal optimization techniques or by-hand optimization.

Question
To what extent have these kinds of techniques been refined and extended to more difficult problems? I know e-graphs and similar techniques are used in a fair amount of academic work on e.g. EDA, but I don't know whether similar techniques can be used for more and more sophisticated programs in ways like the one presented in this paper. Are there solver-based superoptimizer for loop transformations? For recursive computations? For locality-aware parallelization? Why aren't they widely adopted?

0 replies

jku20 · 2025-11-20T04:35:16Z

jku20
Nov 20, 2025

Critique: A bit part making Denali work seems to be coming up with good axioms (both mathematical and architectural) as to make the matching algorithm tractable and actually find good results. I'll propose a similar critique to the SAT solving section of the paper. Maybe I've been blinded by hindsight as these techniques (as the people above also mention) have come to be much more popular nowadays, but it feels like finding these constraints is the hard part, and yet it seems like the authors gives no methodical way of finding them and additionally make no strong argument to why they can't do this. Really, to me this issue seems to be pretty glossed over.

Question: Section 7 on additional constraints gives an example of a GMA which might require additional constraints but doesn't concretely explain these. What might the concrete constraint for the example given be? In general, how would this and possibly other constraints change based on the architecture supported Denali is targeting?

0 replies

NingWang0123 · 2025-11-20T06:50:05Z

NingWang0123
Nov 20, 2025

Critique
Denali’s main strength is its clean integration of equational reasoning with SAT-based search. It gets proof-based correctness and scales better than brute-force superoptimizers, and the e-graph design is conceptually ahead of its time. The main limitations are its restriction to straight-line code and dependence on a hand-crafted set of axioms and a simple cost model, which can miss optimizations or fail to capture real microarchitectural performance.

Questions
How well would the e-graph plus SAT approach scale if we tried to handle small control-flow patterns like conditional moves or short branches?

0 replies

Smubge · 2025-11-20T10:50:45Z

Smubge
Nov 20, 2025

Critique
This is a very interesting paper that feel like treating code generation as a logical search problem. I especially like its proof-based correctness, but as someone stated earlier the limitations seem somewhat restrictive. I am very curious of the impact of Denali, and upon further research, it seems that denali itself does not have a successor, although the core concepts/ ideas promoted through Denali were utilized later, and I found that really interesting.

Questions
Was Denali widely used? If not, why not?

What are some similarities/differences from modern superoptimizers and Denali?

How would Denali perform if rebuilt today?

0 replies

Discussion on Denali: a goal-directed superoptimizer #622

Uh oh!

Replies: 14 comments · 1 reply

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Critique

Questions

Uh oh!

Critique

Questions

Uh oh!

Critique

Questions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 14 comments 1 reply