This is a guest post by Laurence Tratt, who is a programmer and Reader in Software Development in the Department of Informatics at King's College London where he leads the Software Development Team. He is also an EPSRC Fellow.
Programming language Virtual Machines (VMs) are familiar beasts: we use them to run apps on our phone, code inside our browsers, and programs on our servers. Traditional VMs are useful and widely used: nearly every working programmer is familiar with one or more of the “standard” Lua, Python, or Ruby VMs. However, such VMs are simplistic, containing only an interpreter (a simple implementation of a language). These often can’t run our programs as fast as we need; and, even when they can, they often waste huge amounts of server CPU time. We sometimes forget that servers consume a large, and growing, chunk of the world’s electricity output: slow language implementations are, quite literally, changing the world, and not in a good way.
More advanced VMs come with Just-In-Time (JIT) compilers (well known examples include LuaJIT, HotSpot (aka “the JVM”), PyPy, and V8). Such VMs observe a program’s run-time behaviour and use that to compile frequently executed parts of the program into machine code. When this works well, it leads to significant speed-ups (2x-10x is common; and 100x is not unknown). We all want to make these VMs even better than they currently are, but doing so is easier said than done.
Internally, VMs with JIT compilers are ferociously complex beasts, with many moving parts that interact in subtle ways that are hard to reason about. For example, any VM developer worth their salt will have tales about “deoptimisation bugs”: that is, bugs that occur when the optimised machine code is no longer useful and the VM has to fall back to a less optimised, but more general, version of the program (e.g. an interpreter). Have you ever thought about how to pick apart an inlined function’s stack? Perhaps it sounds quite easy. What happens when inlining has allowed the JIT compiler to remove memory allocations: how should the “missing” allocations be identified and “restored”? Even in just this single part of a VM, the complexity quickly becomes mind-boggling.
What does this have to do with Cloudflare or me? Well, as many of you know, Cloudflare is a heavy user of LuaJIT and they would like LuaJIT to perform even better than it does today. Although Cloudflare has developers with a deep understanding of LuaJIT, they’ve long wanted to give more back to the open-source community. Unfortunately, finding someone who’s able and willing to work on such a complicated program isn’t easy, to say the least. I suggested that our research group – the Software Development Team at King's College London – might provide a fruitful alternative route. Happily, Cloudflare agreed, and gave us funding for a project looking to improve LuaJIT's performance that started in late August.
Why our team? Well, partly because we've done a lot of VM work from data-structure optimisation to language composition to benchmarking, which helps us get people new to the field up and running quickly. Partly, I like to think, because we're open-minded about the projects we use as part of our research. Although we've not done a huge amount of direct LuaJIT research, we've always tried to stay abreast of its development (e.g. leading to us inviting Vyacheslav Egorov to talk about LuaJIT at the VM Summer School we ran in 2016), and we've often used it as a key part of cross-VM benchmarking.
Personally I've been aware of LuaJIT for many years: I submitted a very, very minor patch to make it run on OpenBSD in 2012 so that I could try out this VM that I'd heard others rave about; I used LuaJIT's clever assembly system in a small toy project; and I've included LuaJIT in more than one benchmarking suite (e.g. this paper). My first impression was that LuaJIT has an astonishingly small codebase and astonishingly fast warmup (roughly speaking, the time from the program starting to final machine code being generated). Both those points remain true today, and are a testament to Mike Pall’s vision and coding abilities. However, benchmarking LuaJIT against other VMs has shown weaknesses, which, I’ve come to realise, are widely known by those using LuaJIT on large systems. In a nutshell, larger programs exhibit substantial performance differences from one run to the next, and some programs don’t run as fast over long periods as other VMs are capable of. I don’t see this as a criticism – every VM I’m familiar with has strengths and weaknesses – but rather as an opportunity to see if we can make things even better.
How are we going to go about trying to improve LuaJIT? First, we had one of those strokes of luck you can normally only dream about: after conversations with some friendly LuaJIT insiders, I was very lucky to tempt Thomas Fransham, a LuaJIT expert, to work with us at King’s. Tom’s deep understanding of LuaJIT’s internals means we have someone who can immediately start turning ideas into reality. Second, we’re making use of our multi-year work on VM benchmarking which recently saw the light of day as the Virtual Machine Warmup Blows Hot and Cold paper and the Krun and warmup_stats systems. To cut a long and dense paper short, we found that, even in an idealised setting, widely studied benchmarks often don’t warm-up as they’re supposed to on well known VMs. Sometimes they get slower over time; sometimes they never stabilise; sometimes they’re inconsistent from one execution to the next. Even in the best case, we found that, across all the VMs we studied, only 43.5% of cases warmed up as they were supposed to (at 51%, LuaJIT was a little better than some VMs). While this is somewhat embarrassing news for those of us who develop VMs, it’s forced me to make two hypotheses (which is a fancy word for “gut feeling”) which we’ll test in this project: that VMs have heuristics (e.g. “when to compile code”) which interact in ways we no longer understand; and that rigorous benchmarking before and after changes (whether those changes add or remove code) is the only way to make performance better and more predictable.
As this might suggest, we’re not aiming to “just” improve LuaJIT. This project will also naturally help us advance our wider research interests centred around understanding how VM performance can be improved. I’m hopeful that what we learn from LuaJIT will, in the long run, also help other VMs improve. Indeed, we have big ideas for what the next generation of VMs might look like, and we’re bound to learn important lessons from this project. Our long-term bet is on meta-tracing: we think we can reduce its currently fearsome warmup costs through a combination of Intel’s newish Processor Tracing feature and some other tricks we have up our sleeves. That then opens up new possibilities to do things like parallel compilation that haven’t really been thought worthwhile before. While LuaJIT isn’t a meta-tracing JIT compiler, it is a tracing JIT compiler and, as the similar name suggests, most concepts carry over from one to the other.
To say that I’m extremely grateful to Cloudflare for their support is an understatement. It might not be obvious from the outside, but (apart from me) everyone in our research group is supported by external funding: without it, I can’t pay good people like Tom to help move the subject forward. I’m also pleased to say that, before I even asked, Cloudflare made clear that they wanted all the project’s results to be open. Any and all changes we make will be fully open-sourced under LuaJIT’s normal license, so as and when we make improvements to LuaJIT, the whole LuaJIT community will benefit. I also want us, more generally, to be good citizens within the wider LuaJIT community. We know we have a lot to learn from you, from ideas for benchmarking, to concrete experiences of performance problems. Don’t be shy about getting in touch!