Rendered at 05:56:31 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
titzer 16 hours ago [-]
The section on dynamic compilers is more or less all about trace compilation. Generally, trace compilation is a dead end and has been abandoned repeatedly. The more important concepts here are type feedback and speculation and deoptimization, as well as making fast compilers and tiering.
The course overall looks good, and it's great that so much is available online, so well done, Adrian.
samps 15 hours ago [-]
Thanks, Ben. I admit I mostly think tracing is just a mind-expanding concept to learn about, even if history has proven it’s not very practical as an organizing principle. But as you say, I’d love to offer more context on “what actually seems to work” industrially.
titzer 15 hours ago [-]
Yeah, it is conceptually interesting. I like giving students perspective and in 770 (https://www.cs.cmu.edu/~wasm/cs17-770/fall2025/) I might spend half or less of a lecture on tracing and I don't pull punches on how I think it ends up not really working well in real systems. It's a good opportunity to talk about program behavior and the cost/benefit of speculation.
We spend a lot more time on type feedback, ICs, and deoptimization which are the more universal concepts that can be applied to multiple different compiler designs.
vsmenon 4 hours ago [-]
To be fair, the listed Self paper did age quite well!
mlazos 10 hours ago [-]
I work on PyTorch torch.compile and it’s a tracing compiler as well. Perhaps this domain is very narrow though; it is also a very not conventional compiler, you’d probably be deeply offended by some of the stuff we do! ;)
jlebar 14 hours ago [-]
> Generally, trace compilation is a dead end and has been abandoned repeatedly.
JAX is a tracing compiler!
(I know, I know, it sits in an extremely different part of the problem space than TraceMonkey or LuaJIT. Still.)
titzer 12 hours ago [-]
Interesting. I think numerical computing is a narrow enough domain where programs have very well-behaved control flow, which avoids most of the problems of trace compilation. Loops over branchy code, which are really common in general programs, are very difficult to make work well with tracing.
Numerical programs being very stable in terms of control is what enables GPU parallelization and loop optimizations in the long tradition of Fortran compilers. Optimizations like loop tiling, interchange, strip mining, etc aren't going to be easy to do with trace compilation.
Anyway my comment was more directed toward trace compilation in the context of dynamic languages, and there I think it's pretty well established it only works well for small programs.
nwallin 2 hours ago [-]
My feeling about JavaScript is statistically speaking, random JavaScript code is more likely to be spaghetti nonsense than, say, the equivalent Python code. This feeling isn't based on empirical data, tbh it's probably as much anti-JS bias as it is experience with poorly written JavaScript.
Is your criticism of tracing specific to messy, confusing code, with lots of edge cases in the main loop, or does it also hold true for well written code?
I have no experience with compiler design, didn't even take a compilers course in college.
foobazgt 2 hours ago [-]
Nah, as titzer said, it's really about general-purpose code being branchy (a good match for CPUs) vs numeric code being heavily parallelized and non-branchy (a good match for SIMD/GPUs). Only a small subset of very specific languages / instructions target the latter (e.g. CUDA and SSE4).
achierius 12 hours ago [-]
ML compilers in particular go beyond even the level of stability you would expect from numerical programs. Due to how the SIMT model of thread/warp divergence works, the hardware heavily punishes unstable branches. E.g. if you have 32 threads taking a branch then recoalescing on a barrier -- if they all go the same direction then they can go down the execution pipe as a single bundle, but if 1 takes it while 31 don't, then that's 2x the ex-pipe usage by default (and if you have e.g. a computed-branch, performance goes out the window). Consequently, the whole stack is built around the expectation of stable control flow, even to the detriment of performance (from a local perspective).
ML frameworks even take advantage of this to compute, ahead-of-time, how much memory will be used at different points in the program graph, and thereafter schedule memcpy's to make space as necessary. Of course this only works for well-behaved program classes, but e.g. most LLM architectures fit into that category. Interestingly MoE models don't, since they require data-dependent control flow, thus the recent push towards accommodating dynamism in frameworks (like JAX, which until ~recently couldn't handle it at all).
zipy124 13 hours ago [-]
and PyPy right?
jcranmer 15 hours ago [-]
The TraceMonkey paper was on my qual reading list, and my quals happened to be around the time TraceMonkey was ripped out of SpiderMonkey. I was talking with one of the developers at the time (I think it was Jason Orendorff?), who said that tracing just doesn't work out, but there was limited circumstances where he thought it might work out... but I've completely forgotten what those circumstances were.
mikemike 10 hours ago [-]
Trace compilation is NOT a dead end.
LuaJIT works just fine. On big programs. On hundreds of millions of servers and devices.
I find it deeply saddening that even scholars keep repeating this trope. Ignorance is bliss.
titzer 7 hours ago [-]
Hi Mike, I think your work on luajit is really cool. I worked on JavaScript for some time and among all of the JS engines out there, a lot of ideas were tried. TraceMonkey in particular did explore trace compilation for JavaScript and benefited from a lot of fundamental research into trace compilation from Michael Franz's group and others. I've talked to many of those people. It just didn't work out for JS as there are too many diverse programs out there that aren't well-behaved. It's not enough to get great performance on some programs (and hopefully good performance on average programs) if there are pathological cases that ruin user experience. The internet is the greatest VM fuzzer yet invented and produces some stunningly awful things that present constant challenges to speculative dynamic optimization. It's well accepted in the JS space that method JITs have more stable, understandable performance, though they too can get into trouble.
Best of luck.
mikemike 6 hours ago [-]
I'm not sure one can infer anything from a single failed Mozilla project.
'Real-world' JavaScript is a challenge for any compiler, no matter the underlying technology.
The technical debt in SpiderMonkey (at that time, anyway) was eye-watering. Grafting anything on top of that was hard. They haven't even gotten to the point of implementing the crucial pieces of a trace compiler before the project folded.
Trace compilers make nice textbook exercises. Getting them into production is a different matter: region selection, side-exit handling, trace graph evolution, deep VM integration, code generation adapted to all of this … far from trivial, but doable.
Neither is it trivial to create a production-quality method-at-a-time JIT compiler and VM for a dynamic language.
Ceterum censeo: The fundamental papers on trace compilation are from Fisher (1981) and the teams around Multiflow (1990) and Dynamo (1999). Franz, Gal, et al can be credited for later re-popularizing the idea, but they haven't added anything of note.
panick21_ 7 hours ago [-]
Didn't you guys have this discussion like 10 years ago in Lambda the Ultimate comments?
abecedarius 14 hours ago [-]
Has LuaJIT been superseded?
giancarlostoro 16 hours ago [-]
Got any other recommended resources on building compilers?
yu3zhou4 12 hours ago [-]
PyTorch?
j2kun 15 hours ago [-]
I'm a bit confused about what makes this course "advanced." Most of the topics (dead code elimination, data flow, dominator analysis, SSA form) seem like they belong in a first course on compilers.
The short answer is that compilers is basically broken up into two courses, with the first course largely being the minimum necessary to build a compiler (lexing, parsing, codegen, register allocation), and the second course being how to build an optimizing compiler.
ebiederm 12 hours ago [-]
The academic literature on register allocation is scary.
First is presented a linear time optimal algorithm for graph coloring then it is claimed better can be done by a O(N^2) algorithm that uses a heuristic.
I do believe the dragon book got caught with the emperor's new register allocator and the literature hasn't really recovered yet.
mrkeen 10 hours ago [-]
I didn't believe the optimal algorithm is linear time, so I checked the source:
a simple register allocation algorithm can achieve most ot the performance of the more complex algorithms: linear-scan register allocation, developed by Poletto and Sarkar
"Most of the performance" is not optimal. Were you referring to a different source?
ebiederm 2 hours ago [-]
The big caveat in what I was saying is that it is only the pure graph coloring portion that has an optimal linear time algorithm.
Take code in static single assignment form, or another form where it is values that are tracked with live ranges. For a single basic block the interference graph of the live ranges is an interval graph. As the live ranges are just intervals from the instruction that produces the value to the last instruction that consumes the value.
It is annoying but totally doable to follow this into graph theory and see that interval graphs, are a subset of chordal graphs, and these graphs have an algorithm that colors the graph optimally in linear time.
Poke a little more and it turns out that algorithm is the same algorithm as linear-scan register allocation.
Add multiple basic blocks and it is trickier to see, but there are proofs in the academic literature that the interference graph remains an interval graph. Something about phi functions not counting.
Which is what I meant when I said a linear time algorithm.
Compare that to the O(N^2) heuristic from the paper register allocation by graph coloring.
The big change is going from a linear time optimal algorithm to a quadratic time heuristic and proclaiming the quadratic time heuristic is better. (What???)
In practice there are a lot of complications that are not accounted for by graph coloring. Instructions and function calls that take fixed registers. Spilling registers when you need more values alive then there are registers. Deliberately keeping a non-minimal number of values in registers to increase instruction level parallelism.
To the best of my knowledge all of those problems are very tractable if you start with a linear scan register allocator, as the code can be changed without requiring the register allocation to restart.
Code based on computing the interference graph has to reconpute the interference graph and restart whenever the code has to be changed. Making it much worse time wise. If for no other reason than computing the interference graph is O(N^2).
So yes more complications but only because the model described is overly simplified.
j2kun 12 hours ago [-]
Looks like there is quite a bit of overlap with regards to the optimizing parts between these two courses. I guess it's switching from the dragon book to academic papers that makes it advanced.
norir 11 hours ago [-]
I am actively opposed to this design for a first compiler. There is no need for a lexer with a recursive descent parser. Register allocation is also an unnecessary distraction. It is better for a first compiler to compile to a higher level language in which neither register assignment nor memory management are necessary.
Optimizing compilers are suboptimal in that they waste enormous amount of time optimizing code that can't or needn't be optimized and even where the optimizations are helpful, they are opaque and at risk of unexpectedly regressing both due to small changes at the source code level or changes in the compiler optimizer, both of which are quite insidious.
If instead of optimizing compilers, we had languages that allowed for seamless interop between low level and high level functions, then perhaps an llm becomes the optimizer (or you can invoke the compiler to optimize a specific function by source level rewrite). The benefit of this compared to a traditional optimizing compiler is that the optimization is done once per function and never repeated (until prompted) and the implementation is human readable and not buried in a binary. Moreover, and perhaps even more importantly, by not doing optimizations in the compiler, compilation times can be much faster, easily 100-1000x than state of the art optimizing compiler, while generating equivalent or even better runtime performance. As it has been said: premature optimization is the root of all evil.
mamcx 15 hours ago [-]
I have read TONS of material about it*, and none of that is part of the majority of that!
In fact, the "backend" be compiler or interpreter is nearly always left as "exercise to reader".
You can't imagine how much is left to be discovered, from how make a closure, track environment, do pattern matching, memory representation, etc.
EVERYTHING interesting is something you need to look for.
I think a lot of the non-professionals start with parsing and do not get exposed to backend. I have read two books about interpreters/compilers and they don't touch the backend very much.
Maybe this is introductory for backend?
DonaldPShimoda 15 hours ago [-]
That's part of it. I think another part is that it seems like the students are asked to read the papers behind a lot of the concepts, and academic literature is not generally very accessible to undergrads. (Not that they can't read it, but without someone guiding you through at least the first few papers, it can be a frustrating experience for many.)
vkazanov 13 hours ago [-]
What is advanced then? Good coverage of dce, data flow, ssa, intruction selection and reg alloc is actually like 98% of the backend.
How does this compare to Nora Sandler's "Writing a C compiler" in terms of the potential gains for the reader?
Jtsummers 10 hours ago [-]
Her book sits closer to what you'd get from an undergraduate level compilers course, this course covers more advanced material than she does. If you're a novice or a dabbler, start with her book or something else at that level, then try out this course and fill in the gaps as needed.
flash1 3 hours ago [-]
Great material! Thank you.
GL26 14 hours ago [-]
Saw a podcast that talked about the rust compiler, which apparently included machine learning algorithms at some points to determine whether or not you had code that could crash your system
afdbcreid 12 hours ago [-]
I've never heard about that and I'm pretty sure it's incorrect (although "machine learning" is a wide term), do you have a source for that?
saghm 11 hours ago [-]
I've also never heard this before. The closest I can think of is fuzzing test suites used to find panics in various crates, but that's neither machine learning nor in the compiler itself.
I know that datalog is used for the borrow checking logic, so I could also maybe imagine someone describing something like that in some hand wavy way like "proof by machine to detect up front whether the program is safe or might crash" and that getting misinterpreted, but that seems like a stretch
afdbcreid 7 hours ago [-]
Datalog is not used for borrow checking. It was used in old versions of the experimental next-gen borrow checker, Polonius. Current versions of Polonius (which is still experimental) are coded in Rust.
saghm 5 hours ago [-]
Oh interesting, I'm not sure why I misunderstood that. Thanks for the correction!
awesomeMilou 12 hours ago [-]
Is there also a self guided course for "basic compilers", before stepping into an advanced level?
Jtsummers 11 hours ago [-]
I'm a fan of the structure used in Essentials of Compilation [1][2] and Writing a C Compiler [3]. If you want to start with interpreters, I like Essentials of Programming Languages [4]. I have to admit, as popular as Crafting Interpreters is on this site and others, I'm not a fan. Others seem to love it so it's worth a shot, and freely available.
EOC and EOPL are a bit on the academic side, but, I think, they're highly approachable aside from the issues some people have with Scheme and Racket (the Python version of EOC would address that issue). Afterwards, I think the other, deeper and more academic texts on compilers become more approachable.
I'd say check out Crafting Interpreters [1]. It has 2 parts, the first in Java for doing a treewalk Interpreter in Java before going farther with a version written in C.
Not GP, but after doing Crafting Interpreters I was kinda left with a gap in my knowledge regarding the conversion of an AST into native code. Also kinda missing was optimization passes over an AST. I somewhat understand the idea, but it would definitely be nice to have a more guided book/article for this.
Crafting Interpreters is definitely a recommended read, but it stops at Interpreters (fair enough, the book is thick enough). Crafting Compilers would need at least 4-5 extra chapters IMO.
awesomeMilou 11 hours ago [-]
Hmm I would have hoped for something more formal and that's focused on compiled runtimes, instead of dynamic runtimes
Still, I appreciate you replying, I'm sure you meant to be helpful!
The course overall looks good, and it's great that so much is available online, so well done, Adrian.
We spend a lot more time on type feedback, ICs, and deoptimization which are the more universal concepts that can be applied to multiple different compiler designs.
JAX is a tracing compiler!
(I know, I know, it sits in an extremely different part of the problem space than TraceMonkey or LuaJIT. Still.)
Numerical programs being very stable in terms of control is what enables GPU parallelization and loop optimizations in the long tradition of Fortran compilers. Optimizations like loop tiling, interchange, strip mining, etc aren't going to be easy to do with trace compilation.
Anyway my comment was more directed toward trace compilation in the context of dynamic languages, and there I think it's pretty well established it only works well for small programs.
Is your criticism of tracing specific to messy, confusing code, with lots of edge cases in the main loop, or does it also hold true for well written code?
I have no experience with compiler design, didn't even take a compilers course in college.
ML frameworks even take advantage of this to compute, ahead-of-time, how much memory will be used at different points in the program graph, and thereafter schedule memcpy's to make space as necessary. Of course this only works for well-behaved program classes, but e.g. most LLM architectures fit into that category. Interestingly MoE models don't, since they require data-dependent control flow, thus the recent push towards accommodating dynamism in frameworks (like JAX, which until ~recently couldn't handle it at all).
LuaJIT works just fine. On big programs. On hundreds of millions of servers and devices.
I find it deeply saddening that even scholars keep repeating this trope. Ignorance is bliss.
Best of luck.
'Real-world' JavaScript is a challenge for any compiler, no matter the underlying technology.
The technical debt in SpiderMonkey (at that time, anyway) was eye-watering. Grafting anything on top of that was hard. They haven't even gotten to the point of implementing the crucial pieces of a trace compiler before the project folded.
Trace compilers make nice textbook exercises. Getting them into production is a different matter: region selection, side-exit handling, trace graph evolution, deep VM integration, code generation adapted to all of this … far from trivial, but doable.
Neither is it trivial to create a production-quality method-at-a-time JIT compiler and VM for a dynamic language.
Ceterum censeo: The fundamental papers on trace compilation are from Fisher (1981) and the teams around Multiflow (1990) and Dynamo (1999). Franz, Gal, et al can be credited for later re-popularizing the idea, but they haven't added anything of note.
The short answer is that compilers is basically broken up into two courses, with the first course largely being the minimum necessary to build a compiler (lexing, parsing, codegen, register allocation), and the second course being how to build an optimizing compiler.
First is presented a linear time optimal algorithm for graph coloring then it is claimed better can be done by a O(N^2) algorithm that uses a heuristic.
I do believe the dragon book got caught with the emperor's new register allocator and the literature hasn't really recovered yet.
Take code in static single assignment form, or another form where it is values that are tracked with live ranges. For a single basic block the interference graph of the live ranges is an interval graph. As the live ranges are just intervals from the instruction that produces the value to the last instruction that consumes the value.
It is annoying but totally doable to follow this into graph theory and see that interval graphs, are a subset of chordal graphs, and these graphs have an algorithm that colors the graph optimally in linear time.
Poke a little more and it turns out that algorithm is the same algorithm as linear-scan register allocation.
Add multiple basic blocks and it is trickier to see, but there are proofs in the academic literature that the interference graph remains an interval graph. Something about phi functions not counting.
Which is what I meant when I said a linear time algorithm.
Compare that to the O(N^2) heuristic from the paper register allocation by graph coloring.
The big change is going from a linear time optimal algorithm to a quadratic time heuristic and proclaiming the quadratic time heuristic is better. (What???)
In practice there are a lot of complications that are not accounted for by graph coloring. Instructions and function calls that take fixed registers. Spilling registers when you need more values alive then there are registers. Deliberately keeping a non-minimal number of values in registers to increase instruction level parallelism.
To the best of my knowledge all of those problems are very tractable if you start with a linear scan register allocator, as the code can be changed without requiring the register allocation to restart.
Code based on computing the interference graph has to reconpute the interference graph and restart whenever the code has to be changed. Making it much worse time wise. If for no other reason than computing the interference graph is O(N^2).
So yes more complications but only because the model described is overly simplified.
Optimizing compilers are suboptimal in that they waste enormous amount of time optimizing code that can't or needn't be optimized and even where the optimizations are helpful, they are opaque and at risk of unexpectedly regressing both due to small changes at the source code level or changes in the compiler optimizer, both of which are quite insidious.
If instead of optimizing compilers, we had languages that allowed for seamless interop between low level and high level functions, then perhaps an llm becomes the optimizer (or you can invoke the compiler to optimize a specific function by source level rewrite). The benefit of this compared to a traditional optimizing compiler is that the optimization is done once per function and never repeated (until prompted) and the implementation is human readable and not buried in a binary. Moreover, and perhaps even more importantly, by not doing optimizations in the compiler, compilation times can be much faster, easily 100-1000x than state of the art optimizing compiler, while generating equivalent or even better runtime performance. As it has been said: premature optimization is the root of all evil.
In fact, the "backend" be compiler or interpreter is nearly always left as "exercise to reader".
You can't imagine how much is left to be discovered, from how make a closure, track environment, do pattern matching, memory representation, etc.
EVERYTHING interesting is something you need to look for.
P.D: This only one of the years:https://gist.githubusercontent.com/mamcx/e1743571b9a1ea163a7...
Maybe this is introductory for backend?
I guess garbage collection is pretty advanced in the syllabus.
This course is just a second course on compilers for people who had an introduction. And a great one at that.
A good modern, practical and decently optimizing compiler can do just fine without all the things you've mentioned, including vectorization.
Besides, most programming language implementations never go beyond the basic SSA toolkit.
CS 6120: Advanced Compilers: The Self-Guided Online Course - https://news.ycombinator.com/item?id=39577878 - March 2024 (102 comments)
Advanced Compilers: Self-Guided Online Course - https://news.ycombinator.com/item?id=35130975 - March 2023 (82 comments)
Advanced Compilers: Self-Guided Online Course - https://news.ycombinator.com/item?id=25386756 - Dec 2020 (232 comments)
I know that datalog is used for the borrow checking logic, so I could also maybe imagine someone describing something like that in some hand wavy way like "proof by machine to detect up front whether the program is safe or might crash" and that getting misinterpreted, but that seems like a stretch
EOC and EOPL are a bit on the academic side, but, I think, they're highly approachable aside from the issues some people have with Scheme and Racket (the Python version of EOC would address that issue). Afterwards, I think the other, deeper and more academic texts on compilers become more approachable.
[1] https://mitpress.mit.edu/9780262047760/essentials-of-compila... - Racket version, has an open access version
[2] https://mitpress.mit.edu/9780262048248/essentials-of-compila... - Python version, has an open access version
[3] https://nostarch.com/writing-c-compiler - Your choice of implementation language
[4] https://mitpress.mit.edu/9780262062794/essentials-of-program... - Scheme, but works in Racket
1. https://craftinginterpreters.com/
Crafting Interpreters is definitely a recommended read, but it stops at Interpreters (fair enough, the book is thick enough). Crafting Compilers would need at least 4-5 extra chapters IMO.
Still, I appreciate you replying, I'm sure you meant to be helpful!