Nea A webserver that never allocates

Meet Nia: A web server that never allocates, blending Rust's low-level performance with Rock's domain-specific platforms for a seamless web experience.

Welcome everybody! The next speaker is Fert. I know it's a bit challenging for me, but I'm really looking forward to this talk. It's about a web server that never allocates, which you might assume would be a Zig talk from a Zig speaker. However, Fert actually comes from The Rock Community. Also, I should mention that I use Rust, so let's see what people say in the chat.

Introduction: Could you introduce yourself a little bit, please?

Sure! I'm Fert, and I've been working on Rock for four and a half years now, which feels like a long time. I also work on various Rust projects. For instance, I, along with my colleagues, wrote an implementation of the Network Time Protocol (NTP) and am currently working on an implementation of ZP. I focus on low-level, performance-sensitive tasks.

Talk Introduction: Please give it up for Fert and let's get started.

Thank you! Welcome to Nia, the web server that never allocates. So, what is Nia? Nia is primarily a proof of concept, and it makes more sense if you keep that in mind. It is partially funded by ELNET, a cool organization that distributes EU money for internet-related projects. If you're in the EU, this might be interesting for you.

Nia is written in Rust, so while we won't need extensive Rust knowledge, it will come up a few times. Don't be scared. Nia emerged from our desire to experiment with Rock platforms. In Rock, we have the Rock language to write applications, but these applications always build on a platform, typically written in a more low-level language.

Rock Platforms: We wanted to make Rock a language good at specific domains, allowing for very targeted platforms that make certain assumptions about the domain. These platforms function like a domain-specific runtime. While many languages have one runtime that must work for all cases, we can have specific runtimes for areas like game development, GUI development, or CLI applications. This approach allows for a cleaner API boundary.

Nia's Purpose: Nia is also a web server, born out of frustrations with web development, particularly from Richard, who might be here. I don't work with the web much, and the more I learn about it, the happier I am that I don't have to deal with it often. Nia is based on struggles in production when deploying web services.

Low-Level Systems Programming: Finally, we can solve the problems we encounter using low-level systems programming. This has been a journey for me, transitioning from a background in Haskell and functional programming to falling in love with reading assembly and low-level programming.

=> 00:04:40

Optimizing web servers with low-level programming can prevent costly crashes and improve user experience.

I don't really work with the web all that much. The more I learn about it, the happier I get that I don't really have to touch it most of the time. However, it's based on struggles in production deploying these web services. That's why I'm really here—to solve the problems we run into using low-level systems programming. This has been a journey for me. My background is more in Haskell functional programming, but over the course of developing Rock, I've fallen in love with reading assembly, becoming performance-obsessed, and learning all these low-level tricks.

Today, we're looking at why we designed Thea and what low-level tricks we can use to fix these problems. What are these web server woes? What problems do you face when deploying a web server in the modern world? Ideally, a web server should be simple: accept a request, run some code on that request, get a response, and send it out. The only special thing about Thea here is that the request handler part is raw code. From your perspective, this Rock function is just CFFI—you don't need to know anything about Rock. We're just calling a foreign function that takes a request and returns a response.

However, this picture becomes much more complicated when deployed in the real world. Locally, resources are infinite—you have so much RAM that you could never possibly fill it with any workload you could dream of. But in the cloud, you pay for resources, and they are finite. You need to think carefully about how you're going to use these resources because you might run out. Running out of memory is particularly bad for your users. When your server runs out of memory, it gets killed, which is not a good time for your server or your users. They will notice and have to retry, which is poor UX. Over time, they might leave because it doesn't give a good impression and can slow them down, especially during complex transactions.

If you are a standard web developer using Ruby, Python, PHP, or whatever else, and you want to fix your memory consumption problems, the answer is not that much. You are at the mercy of your runtime, which determines everything. These runtimes are mediocre at everything—they have to be not terrible at everything you throw at them, but that means they're not very good at your specific application. This is exactly the problem we face. In theory, there might be some combination of settings for this runtime that is perfect for our application. These runtimes have 100 knobs you can slightly manipulate for a perfect configuration, but it's hard to find optimal settings because there are so many, and they all interact. Even when you do find them, your application or access patterns change, so the perfect settings are never perfect for very long.

=> 00:08:34

Finding the perfect settings for your app is like chasing a mirage; they change constantly and never truly fit.

The challenge we face is that while these systems are not terrible at everything you can throw at them, they are not actually very good at specific applications. In theory, there might be some combination of settings for this runtime that is perfect for our application. These runtimes have numerous knobs that you can manipulate slightly, potentially giving you a perfect configuration. However, finding optimal settings is difficult because there are so many, and they all interact. Additionally, even when you do find the perfect settings, your application or access patterns may change, rendering those settings imperfect over time. Even if you find the perfect settings, it is still a heuristic—a soft promise rather than a hard guarantee. You can't engineer based on a system like this; it's like quicksand—you sink away.

We have two goals with Nia. Goal number one is to get a firm upper bound on the amount of memory that our server is going to use. You always have a firm upper bound because that is what you pay for, but we want our server to work well within its bounds. It should never overstep or cross the line and should operate within its limits. Goal number two is that a particularly memory-hungry request should not influence any of the other requests that are currently in progress. Normally, when a process gets shut down due to one bad request that spikes memory usage or a bug, many innocent requests also get canceled when the whole server goes down. We want to prevent that, ensuring that good citizens continue to run uninterrupted.

In practice, this problem is quite hard. The second problem of figuring out which request is good or bad is challenging because existing runtimes can't do this. They throw all the traffic into one big heap, and disentangling that is surprisingly similar to fallible allocation, which you might be more familiar with. For example, in Rust, there are two functions from the standard library defined on a growable collection type. The first method, Reserve, reserves some additional memory, but this function can fail. When you run out of memory and your allocator is full, this will abort your program.

I spoke to people working on getting Rust into the Windows kernel and later those working on Rust for the Linux kernel. For them, this is a major problem because, unlike user apps where resources seem infinite and crashes are somewhat acceptable, the kernel should not crash or abort if some memory allocation fails. Operating systems need to handle this case properly. They started experimenting with additional APIs that usually start with the Try prefix, where you always get a return value indicating success or failure, and this function will never blow up your program.

This approach seems promising because now we can at least catch when allocation fails. We get a value back, allowing us to branch and try to recover. To illustrate, consider the camel, the allocator. Normally, when we load allocations onto our allocator, everything is fine and flourishing. Over time, these allocations get removed. However, in the case of an out-of-memory event, our allocator is already strained and under heavy load. When we add one more allocation, it...

=> 00:12:22

When memory allocation fails, it's often due to earlier bugs that didn't clean up properly, making it tricky to track down.

Now, we can at least catch when allocation fails. We get some value back, and we can branch on that. We can switch on two cases and try to recover. Let me illustrate this with the help of my friend, the camel—the allocator, C allocator. Normally, when we load allocations onto our allocator, everything is fine; everything is good. It's in its lane, flourishing, and over time, these allocations get removed.

In the case of an out-of-memory event, what's really happening is that our allocator is already quite strained and under heavy load. When we add one more allocation, the malloc that broke the camel's back, the final straw isn't actually the problem. This final straw was a perfectly normal allocation; everything about it was good—it was just ill-timed and unlucky. If we catch this with one of these fall allocation APIs, there's not much we can do because the current request or allocation isn't at fault for running out of memory. The problem likely occurred much earlier due to some bug where we didn't clean up memory or allocated more than we should have. However, we can't track that down from the current location where the problem shows up. These issues are temporarily separated, making tracking this bug down really tricky in practice.

I show this with our allocator because, in practice, we get much more boring graphs. For example, we can see a server with high memory usage that spikes and then gets killed. After rebooting, it spikes again and gets killed. Eventually, the people who posted this image made their server use less memory and restart automatically instead of being killed. On the right of that image, you can see it stays below the red line artificially, but the server keeps rebooting as it runs out of memory. This is not good.

So, what can we do? What are the solutions here? These are really niche designs, not universal good answers, but these are the answers we came up with. First of all, our starting point is actually pretty good. Rock is a much faster language than the status quo in the web space. Winning from Ruby is very easy when it comes to speed. Rock is relatively fast because we're compiled and can use LLVM for all the optimizations. Also, we use reference counting as our garbage collection approach, so we have no stop-the-world garbage collector. This is similar to how Rust and C++ deallocate when values go out of scope. For a language for a web server, we're actually looking pretty good.

We can do better, though. Problem one was getting a firm upper bound on the amount of memory used by our web server. We fixed that by making one request for memory. We could argue whether this is an allocation or not. I'm going to argue that because we don't call malloc and we don't call free, we make zero allocations. In practice, we make one very large allocation using mmap or something similar—operating system primitives that give you large sections of memory in one go. Perhaps we make one allocation instead of never allocating, but that's all we'll ever do. We'll never give that memory back, and we'll also never ask for more. There's literally no code in the program that would ask for more than that. In theory, we could pre-compile and hard-code how much memory we're going to need with a constant. I believe that is what Tiger Beetle at least used to do and might still do. The one limitation you get there is if you want to change that.

=> 00:16:10

Allocate memory once, never give it back, and give each request its own fixed memory bucket for faster performance and predictable limits.

Allocation using map or something similar involves operating system primitives that give you large sections of memory in one go. Instead of continuously allocating memory, we make one allocation and never give that memory back or ask for more. There's literally no code in the program that would ask for more than that. In theory, we could pre-compile and hard code how much memory we're going to need, using a constant. This is what Tiger Beetle used to do and might still do. However, this approach has a limitation: if you want to change that number, you need to recompile. This might work for some, but it doesn't work for us because we want Rock programmers to only need to know Rock and not install a whole toolchain for compiling native code. Therefore, we need to be a bit more dynamic.

In practice, we start the app, read some config, do some math, and allocate a number of bytes. This is all the work we ever do allocation-wise. An interesting observation here is that our memory consumption is always maximal. When you map, the memory is not really allocated; it's just there in administration. Only when you touch that page does it become yours. We will touch all of these pages eventually, so we will always have high memory usage. But that's fine because we paid for that memory, so we may as well use it.

Solution two is trickier and requires a couple more steps. Our problem was that a memory-hungry request should not impact any other request. We fix this by giving each request its own little allocator, its own region of memory that it can allocate into, and no one else can. This is a fixed amount of memory. In practice, these work like bumper arenas, so allocation into these is very cheap. We don't free memory there; only at the end, when the request is over and the response has been sent, do we clear the whole thing in one go by resetting a pointer to an earlier address, which is very fast.

In practice, we call these little slots buckets. We now run into a hard limit on the number of concurrent requests we can serve because we only have so many of these sections. We have one big allocation, chop it up into these buckets, and there's only so many of them. Every request needs one of these buckets to do its work, which is quite limiting. However, we gain a magic power: we can do math. We know how many buckets we have, and based on that and the memory an individual request needs, we can calculate the total memory required or vice versa. If we know the total memory we have and the memory required per request, we can determine how many concurrent requests we can serve with a particular instance.

This approach has a downside: if you have huge outliers, particular requests or arguments that use a lot more memory than the majority, the buckets need to be big enough to accommodate them, which means wasting a lot of space for most requests. It might be advisable to split this into different size classes, run multiple instances, and have some routing logic to use memory optimally. This has all been the happy case, assuming memory isn't a problem and our bucket is big enough.

=> 00:20:10

Optimize your memory allocation by splitting requests into different size classes and using routing logic to avoid wasting space.

In serving with a particular instance, it is very unfortunate if you have huge outliers or specific requests that use significantly more memory than the vast majority of your requests. This is because the buckets need to be large enough to accommodate them, resulting in wasted space for most requests. In practice, it might be advisable to split this into different size classes, run multiple instances, and implement some routing logic to use memory most optimally.

This has all been the happy case where memory isn't a problem and the bucket is big enough. However, the tricky part is what if it isn't? What if a request goes over its limit and exceeds its bounds? In this case, recovery is necessary. First, you can actually lock this particular request because it violated assumptions. It's not just a random memory allocation that was ill-timed; it’s a problematic request. Knowing about it is helpful as it allows tracking down and reproducing the issue.

The next step is to send the proper error 500 when this happens. Normally, when a web server gets canceled, the connection gets interrupted, but in this case, you can properly send an internal server error back. Finally, when allocation fails deep down the stack in an allocation function, and you run out of memory, you need to get back up the stack to a place where future requests can be accepted. This is done using set jump and long jump.

Set jump takes a snapshot of the current state of the program, including the stack pointer and instruction pointer. The program can continue, and later, this snapshot can be restored with a long jump, taking you back to where the set jump was called. Set jump returns zero when it makes a snapshot, and the program would print zero and not take the branch. There can be function calls in between, and long jump can be in a very nested stack frame, which is why it’s called a long jump—it can jump between functions and cross stack frames.

When long jump is called, it jumps back to the line where set jump was called, but this time it returns the value given to long jump, allowing the program to recognize that the error or jump happened. However, in practice, you aren't allowed to use the return value of set jump in this way. You can only branch on it or make control flow depend on it, not data flow. This example, taken from a university website, contains undefined behavior, which is common in longer pieces of code from the internet.

Given that this is a systems programming conference, some assembly code is included for further illustration.

=> 00:23:53

Understanding setjmp and longjmp in C can demystify seemingly magical code behavior.

The error occurred because the code, sourced from the internet, is longer than 10 lines and contains undefined behavior. In practice, you aren't allowed to use the return value of setjmp in this manner; it cannot be bound to a variable. Instead, it should only influence control flow, not data flow. This issue is exacerbated by the fact that the code originates from a university website.

Given that this is a systems programming conference, I included some assembly language, as this is the only context where I can do so. I found it fascinating because setjmp and longjmp seemed almost magical, even after reading the documentation. The exciting part is that with enough confidence, you can delve into the source code. Although this is tricky since libc is typically linked dynamically, I created a small Zig program, set it to use Musl, and used objdump to examine the source of setjmp.

In the source, we see various register names stored into the jump buffer, a large array at specific offsets. Conversely, longjmp retrieves values from this array and restores them to the registers, eventually jumping to a stored location. Despite some arithmetic involved, you can sit down and understand this process, which makes it less mysterious, albeit still somewhat complex.

The program operates as follows: it enters a loop, retrieves a work item from a queue, and awaits asynchronous processes. It then takes a snapshot of the current state to return to the loop if an allocation fails. If setjmp returns zero, the program proceeds with the work. If it fails, the failure occurs in the rock_alloc function, which attempts to allocate memory from an array of buckets. If allocation fails, it returns a null pointer, prompting a jump back up the stack to the loop. This time, setjmp returns one, leading to the failure handling branch, where logging, cleanup, and error responses are managed before continuing with the next request.

There are caveats to using setjmp and longjmp. They can skip destructors, potentially leaving mutexes locked. Additionally, the behavior of returning twice, as setjmp does when jumped to with longjmp, is not typically allowed in functions. While LLVM has a special annotation for this, Rust does not, making the function so unsafe that even an unsafe block cannot contain it. It is not exposed in the Rust standard library, requiring external definitions if needed. In practice, I use the inline assembly variants of these functions to ensure precise control.

=> 00:27:55

Rust's async functions turn into state machines, but you need an async runtime to efficiently manage them.

Constructors or sorry, skip destructors. That is sometimes bad because you might forget to unlock a mutex. Also, this returning twice behavior, which is in effect what set jump does when it gets jumped to with the long jump, is not something functions are really allowed to do. In LLVM, there is a special annotation that you can add to a function, but Rust doesn't have this annotation. Therefore, this function is so unsafe that even an unsafe block cannot contain it. It is also not exposed in the Rust libstd Library; you have to sort of extern and define it yourself if you really want to use it. In practice, I actually use the inline ASM variants of these functions just so I know exactly what they're going to do and they aren't going to do any sort of funny business.

A second topic, which is quite tangential but makes the whole solution a lot more complicated, is that we also want to do async I/O in this server. Async I/O makes sense for web servers because we don't want a whole thread to hang when a particular request handler makes a request to something in the outside world. In practice, the thing is that I don't understand how async works, and Rust doesn't understand how async works either. Rust only knows how to do a syntax transformation on code like this; it doesn't really know how to run it. When we have an async function and it does some await, all that the Rust compiler knows is how to turn this into a state machine that we can poll. We can continually ask it, "Are you done yet? Are you done yet?" and either it'll say, "Nope, still blocked, still waiting for something to complete," or it'll say, "Yes, actually I am done. Here is a value that I've produced at the end of my computation."

This is cool, but it still doesn't run because we need some program that will call these poll functions in an efficient way. We could run them in a loop, but we don't really want to run them in a hot loop and continually poll them. That would be inefficient and unfortunate. Instead, Rust allows user code or user packages to write a program that is going to do this polling. These programs or packages are called async runtimes, and they determine exactly what async is going to do in practice. We have several off-the-shelf options here: one called Tokyo, which is for desktop applications (we actually use that in the Network Time Protocol client), and another called Embassy, which works on embedded hardware. It turns out that async await is a surprisingly cool idea on embedded systems, where we can wait for a particular event to fire or a timer to run out. Instead of manually setting up every time, we can have the async machinery take care of that for us.

Unfortunately, we can use neither of these two solutions because the embedded one doesn't work well across multiple threads and doesn't make good use of many threads, and the Tokyo one will allocate, which we can't have. We need to control allocations, and we don't have custom allocators in Rust—sadness. Therefore, we need to actually build one ourselves. I'm not going to go into the technical details of that, but we wrote a whole blog post about it. The blog post isn't important; what's important is the code sample that goes with it, which is, to my knowledge, the first and only practical example of an async runtime in Rust that is simple enough to understand but also complicated enough to not be trivial. We really get into the details of how the operating system knows when a blocked call is ready for more reading or writing and how the program notices this. This has always been magical to me: if we have async, we learn that if something blocks, the computer is going to do something else or jump to some other piece of work. But how does it ever know that it should return to that other piece of work in an efficient way?

=> 00:31:52

Unlock the magic of Rust async with my latest blog post—dive deep into how operating systems manage blocked calls efficiently!

The code sample accompanying the discussion is important as it represents, to my knowledge, the first and only practical example of an async runtime in Rust that is both simple enough to understand and complex enough to not be trivial. We delve into the details of how the operating system knows when a blocked call is ready for more reading or writing, and how the program notices this. This has always seemed magical to me. With async, we learn that if something blocks, the computer will do something else, jumping to another piece of work. But how does it know to return to the previous task efficiently? Clearly, it shouldn't randomly go back, as that would be terrible for throughput and wasteful. It turns out it cheats by adding an extra thread that can re-enqueue jobs into the work queue we've seen before.

If you want to learn more about how Rust async works, I highly recommend the blog post I wrote. Moving forward, we need to make this prototype more robust. I believe I've removed all race conditions, but you never know. We should also run actual Rust applications. While we have some basic ones working, the interaction between platform and application is still being explored, and we haven't run serious applications with this architecture yet. Next week, I will be conducting benchmarks at the local university, where they have equipment for measuring power draw. This should be fun because, in theory, we should be faster than any Rust solution that uses malloc, as we don't allocate memory when handling a particular request. However, finding a benchmark that isn't fake but clearly demonstrates this has been challenging. Computers are fast and good at executing instructions.

In summary, custom allocators are a superpower. I am thrilled that Zig has brought this back into the spotlight because it wasn't widely discussed, especially from my Haskell background. Custom allocators are indeed a superpower, and I feel we haven't yet seen their full potential. It requires Zig being used in more places or other languages adopting these features to realize the true power of specialized memory allocation. With hard bounds, we can do math, which is incredibly useful. We can perform engineering instead of guesswork, ensuring our systems run reliably. Designing systems this way is highly beneficial.

Platforms combine convenience and performance, allowing very low-level code to manipulate the stack and coexist with a Rock application. The Rock application doesn't need to change even if we swap to a less fancy platform. For example, we could use a standard platform with malloc and a normal web server, running the same program on both platforms to see how it works in practice and understand the trade-offs. In closing, keep building programs that never allocate.

Thank you very much, Walker, for ending your talk on such a positive note. Do we have any questions from the audience? This type of talk truly benefits from audience interaction. At the beginning, you showed different buckets and mentioned they are all the same size. Have you considered over-allocating past the actual memory size and using something like Z swap to compress bytes that are generally zero because those requests never touch it?

I have a limit for how cursed this program should be, and this idea seems to exceed that threshold.

=> 00:35:53

Build programs that never allocate memory.

Closing: "Keep building programs that never allocate."

Thanks: "Thank you very much, Walker. That was a very nice way of ending your talk. Thanks."

Questions: "Do you have any questions from the audience? I think this is the kind of talk that really benefits from that."

Buckets and Memory Allocation: "At the beginning, you showed all the different buckets and commented on how they're all the same size. Have you ever considered over-allocating past the size of the actual memory and using something like Z swap to compress all of the bytes that are generally zero because those requests never touch it?"

Response: "I have a limit for how cursed this program should be, and this seems past that threshold. Yeah, it's probably not a good idea, but it's an interesting one at least."

Fetching Data from Database: "While ALUs execute the search algorithm, what about fetching data from the database? Because you will have some type of requests where you will discover afterwards how much memory you're going to need. Do you have a strategy for that?"

Strategy: "In practice, the way you would use this is to start off by overestimating how much memory your requests use and gradually bring that number down over time. Most people don't think about how much memory an individual request takes. You can take an average, but it's unclear how much of that is the runtime versus doing useful work in your application. You could also partially solve the leasing database case using pagination."

Constraints and New Patterns: "As soon as you have these constraints, you need to think about new patterns that make more sense within these constraints. We want to find out how we can make that work in a convenient way."

Set Jump and Long Jump in Rust: "I would call this usage of set jump and long jump highly questionable in Rust. I have multiple questions about that."

Response: "I agree with you. I was speaking with one of the Rust core developers last week, and they said we could make it less cursed paradoxically by adding more assembly. The main question is why don't you just use a result? And how do you handle when you, for example, open a file, run out of memory, and then do a long jump? That file will be leaked."

Handling Resources: "We need you to not do that. We make sure there are no Rust resources allocated between our set jump and our call to the Rock code. On the Rock side, we have to ensure things work out well. So far, we've been leaking things, and it's fine because we have that bump allocator. As long as you don't have any non-memory resources, you're good. If you want to open a file, we would need to figure something else out."

Result Usage: "Why not use a result where the allocation function returns a result that says it couldn't allocate, then you unwind the stack and run all your destructors?"

Response: "That allocation function is in Rust but called from Rock code. Unwinding over FFI boundaries is not allowed and not supported by Rock. It would be slow compared to doing the direct jump. We are in a very constrained environment when we run out of memory and are very deep down the stack. We don't want to unwind if we can help it because it adds a lot of complexity."

Desired Language Features: "You want a language that gives you custom allocators, allows you to do unsafe stuff, and lets you audit resource deallocation like destructors easily. Correct?"

Response: "Yes, that's it. Next question."

HTTP and HTTPS in Zig: "We also wanted to do HTTP and potentially HTTPS. When we started this project, the situation in Zig was not mature, and I am much more comfortable in Rust. In Zig, it would have required more work. I don't hate Zig; we use it a bunch and really like it, but we didn't pick it for this project."

Developer Experience: "I'm thinking about your developer experience. That's my perspective."

Final Thoughts: "The hard things were race conditions; the rest was just fun. Thank you very much. Let's give it up for the speaker.

Nea A webserver that never allocates - Folkert de Vries

Table of contents