Linking can be fast (if you cheat) Roc's Surgical Linker - Brendan Hansknecht

Linking can be fast if you cheat: learn how to make your static language feel as fast as Python!

Thank you very much. The next speaker is going to be Brandon H. Connect, which I've been told is the Americanized pronunciation. The real pronunciation is more complicated. I'm Italian, and the English pronunciation is already way too hard, so I'll leave it at that. He's a core contributor to Rock, so Brendon, please introduce yourself if you don't mind.

"Hi everybody, I'm Brendon Hands Connect. Fundamentally, I'm someone who loves diving into details a little too much, especially when it comes to performance and disassembly aspects of that nature. I work on compilers a lot, both AI compilers and Rock itself. I love making languages faster, making the compiler faster, and just digging into these types of problems that require knowing the system details."

"Awesome. I've actually watched quite a few of your live streams. Tell people where to find you and to please follow, share, like, subscribe, etc."

"I don't actually stream that much; I may get back into it. I'm always on the Rock Zulip if you just want to chat and discuss details. If I live stream, it's 'B Hands Connect,' except it's not spelled like my name. It's simpler because I just use the word 'connect' like the English word 'connect' to make it easy to spell."

"Okay, so people, if you're interested enough, you'll be able to find him. Well, please give it up for Brendon, and let's hear it."

"Okay, so before we get started, two things I want to make sure of. First, do people like dark mode, or should I switch it to light mode?"

"Dark mode sounds good enough."

"Second, is this type of animation going to be terrible?"

"We're good."

"Okay, so we should be good there. One final question, general for everybody: How many of you would say that the linker is a significant chunk of time in your development loop? Like when you develop something, you're stuck waiting for it as part of the stops?"

"Okay, that's roughly half."

"Now, on the other side of it, how many of the people who didn't raise their hands, is it because you use a dynamic language with essentially no linker?"

"About half of that, so like a quarter of people just use dynamic languages. And then of the people that use a static language, about two-thirds of them run into problems with the linker."

"This talk is 'Linking Can Be Fast If You Cheat: Rock Surgical Linker.' I just want to preface this by saying this is cheating. Do not get angry at all the linker implementers because their linkers are too slow. Mold is an amazing product; Rui does a lot of great work, so please respect their work. But now, let's look at how we can get faster than that."

"A quick little background if you want to talk to me later: I work at Modular, mostly on the AI compiler there, which is called 'The Engine.' I also work on Rock, so if you want to chat about either of those things or system details, etc., come find me."

"We're going to give a little background, explain some linking stuff because I assume not everybody's familiar with how linkers actually work, go into why Rock is special, and how we can take advantage of that. This is the theoretical goal right here: We want to feel like a dynamic language while being a static language. We want to be as fast as Python when it, for example, parses a simple argument and prints out the help message. We want to have that experience where it's essentially instant and nice, but in a compiled language."

"In Rock, we start with using our basic CLI app, which is a little bloated and has some minor issues, but we'll talk about those some more later. The front end is fundamentally close-ish, and it's actually doing a ton of extra work it shouldn't be doing. I think it should be at least half that size, if not even smaller. We'll move on from there."

"Then we had LLVM, and you can already feel it—that's enough that it doesn't feel instant. It's not terrible, but it hurts. Then you go to the dev back end, and that's not an error; it takes like one and a half milliseconds. It's apparently really fast when you just dump bytes to the system.

=> 00:05:07

Optimizing your development tools can drastically improve performance, making your workflow smoother and faster.

In the discussion about the compiled language, we start with the basic CLI app in Rock. This app is somewhat bloated and has minor issues, which will be addressed later. The front end is fundamentally close-ish but is performing a lot of extra work it shouldn't be doing. Ideally, it should be at least half its current size, if not smaller. Moving on, we had LM, and it was evident that it doesn't feel instant. It's not terrible, but it does hurt the performance.

When we look at the dev back end, it takes about one and a half milliseconds, which is really fast when you dump bytes to the disk directly from your application. However, it is not optimized, and there is work to be done to make it more balanced. Despite this, it is fast, and the execution is close enough. Given that this is a talk about linking, there is a missing piece.

Looking at the linker, LD takes about 400 milliseconds to link on a standard Linux machine with a 12-core Intel processor. While there are faster linkers like gold and lld, which are almost twice as fast, the performance is still not great. Surprisingly, the tical linker was almost tied with lld initially, but now mold is about half the speed of lld for linking this application, which is impressive. However, mold only works fast on big applications, and its performance on smaller applications is closer to lld. This means the iteration loop may not be very nice, and the experience is still feelable.

When mold is included in the graph, it is something you can feel. One major issue is that while we control the Rock front end, the dev back end, and the execution speed to some extent, we do not control mold. This leaves us stuck with this block. The question arises: how fast could we be? Mold's performance is like copying files, and if you copy a file with a cold cache, it takes about 15-20 milliseconds. With a hot cache, it takes around 5 milliseconds, though this can vary.

The surgical Lor is faster than copying the file with a cold cache, and while it is not quite an order of magnitude faster than mold, it is still pretty close. This represents a significant gain. To understand why linking is slow and what it is doing, we need to discuss ELF (Executable and Linkable Format), which Linux uses. While there is Mach-O for Mac, we will focus on ELF, which is well-documented and straightforward for Linux.

The ELF file is simple: it starts with headers that describe the high-level organization, such as the number of sections, endianness, and processor type. These headers provide metadata and organize the program's data and stored data. This understanding is crucial for discussing the main running surgical Lor and its performance on different platforms, including Windows.

=> 00:09:04

Understanding ELF headers and program headers is key to executing and organizing code efficiently.

We'll just talk about that one, which is what the main running surgical Lor does. We also have something for Windows too, and different pieces in progress. But you know, it's a pretty simple file. It's like, "Hi, I'm an ELF." Then it's like, "Here's how the program works." Then there's a bunch of data that is your actual program and the store data. There are some other headers that kind of give metadata.

The ELF headers give you a lot of things, right? But it's just like the high-level organization. It's like, "Oh hey, we have 10 sections, they're over here," and "Oh hey, we're using little-endian, not big-endian, and it's a 32-bit ARM processor." All of those different pieces are nothing too interesting, really. Then you get to the program headers. These are the really important pieces for actually executing something. They are required in all executables and shared libraries. It's basically like, "Hey, here's how operating system, you're going to load me into memory. I want this over there. I want this space to be filled with zeros. I want this space to be unique per thread," which is what TLS (thread local storage) is. These are all readable. I really would not like you to let someone write into my code segment, please. This is what this is requesting.

Then there's this one special segment called Dynamic. One of the really weird lies about linking is that it stops when you make an executable. There's actually a second linker that runs when you load your executable, normally called a loader, but it is literally doing the exact same thing as linking. Just an interesting note that linking doesn't even end when the executable is created.

Then you have the section headers, which is what the linker cares about. This is not needed in an executable, but it tells you all of the really important pieces of information related to each different piece. For example, what constructors do I need to run when the application starts up? That's what the init array is. All of those have to be merged together so that they can all be run at the beginning before the program launches. The finish array contains your destructors. There's also a pre-init in case you need constructors before your constructor. Important things.

Then you have some more minor things like here's all of the symbols, which could be variables, functions, etc. Related to that, here's all this table of strings because they're dynamically sized, so we need them in a separate table so we can tell where things are. Then there's relocations, which we'll get into right here. This is just the symbols and relocations from a rock platform or application.

What we can see here is the first two things are essentially the standard library, just a couple of standard library functions embedded in the application. Then we can see that there is an effect that we're going to be calling that lets you print to standard out, theoretically write a line to it. Then you can see these rock_alloc and rock_realloc, which are special rock things. In this case, we're saying they're undefined. This is where the linker really comes in. At some point, someone needs to define that, right? I can't just run code with undefined functions.

The other important thing is relocations, which is basically your lists of edits you need to do to finalize the application. They're going to say, "Hey, in the text section right here, I need to call memcpy. I have no idea what that is, so please put the address of memcpy in for me," or "Hey, I need to call rock_realloc, put the address of rock_realloc in," or "Hey, I need to point to this data that's in another section. I know it should exist, but I need you to update the pointer." Some of these even get more complicated, but for our use case, it's mostly just these direct and simple relocations that we're fundamentally running into.

This is where scale becomes important. This is "Hello, World!" in C. Hopefully, most of you can read that, not too hard. What does it end up linking to? You put the verbose flag in, and you see a lot of stuff come out that gets passed to this linker. Most of this stuff is this CRT, which I should have double-checked.

=> 00:12:39

Even a simple "Hello World" program involves processing megabytes of data and thousands of relocations, making it a surprisingly complex and time-consuming task.

In the context of software development, there are several tasks that developers frequently encounter. For instance, they might need to put the address of Min or Rock reic into their code, or they might need to point to data in another section and update pointers accordingly. These tasks, while seemingly straightforward, can become complex. However, for many use cases, the relocations involved are direct and simple.

Scale becomes a significant factor in these processes. Consider a basic "Hello World" program written in C. When you compile this program with the verbose flag, you see a lot of output, much of which is related to the CRT (Common Runtime). This runtime is crucial for the program's lifecycle, managing what happens before and after the main function. The compilation also involves the actual object file, lib GCC (which includes extra primitives), and the standard libc, which is typically dynamically linked.

Despite the simplicity of "Hello World," the compilation process involves processing 7 megabytes of data. This data includes 4 different symbols and 9,000 relocations that need to be processed and completed. The process involves tree shaking to remove unnecessary parts and then jamming the necessary pieces together. For each of the 9,000 relocations, the system must determine the correct pointers, some of which require dynamic relocations at runtime.

Even for this basic application, the process is data-intensive and time-consuming. The system must read and process all 7 megabytes of data, which is a relatively linear operation. This is why some linkers, like D, are slow, while others, like Mold, have optimized parallel processing to speed up the task.

When it comes to more complex applications, such as a basic CLI built on platforms like Rock, the situation becomes even more complicated. Platforms must define everything a user might need, from file I/O and directory access to environment variables and HTTP functionalities. This leads to bloated executables. For example, Rust, known for producing large executables, generates a piece that is about 45 megabytes with 400,000 relocations. Even with Mold's parallel processing, linking this takes 80 milliseconds.

This issue is not unique to massive applications like Chrome but affects any reasonably sized application. The final compiled artifact might only be a few megabytes, but the process of linking and handling relocations adds significant overhead. This also doesn't account for other shared dependencies, such as libc, which the application definitely calls into. Thus, understanding and optimizing these processes is crucial for efficient software development.

=> 00:16:23

Rock's unique stack design cuts down on compile time by pre-linking the platform, making the application faster and more efficient.

The reason it takes 80 milliseconds for even mold with its parallel processing and all of its fancy tricks to link this thing is because it is ginormous. We have some ideas for cutting that down to save time, but fundamentally, anybody who makes a reasonable-sized application is going to hit this problem. We're not talking about Chrome, which compiles into multiple gigabytes; we're talking about something where the final compiled artifact is just a few megabytes. This clearly adds up, and this is not including other shared dependencies like libc, which is definitely being called into.

The important question is, why is Rock special? Why do we get to cheat and other people don't? Rock is built on a stack, as Richard mentioned, consisting of the application, the platform, and the operating system. Yes, you could say there's the host language and other elements, but I just throw that into the platform. The critical aspect is the separation between Rock as the application and the platform. Rock is 100% pure code; it does nothing by itself. If you're in the REPL, which can cheat and print things, you can see an algorithm run, but fundamentally, Rock alone just makes heat. We force the user to allow us to panic and allocate memory on the platform, so we have two side effects: allocating and panicking. Otherwise, we just make heat and return something back to the platform, asking it to perform tasks.

This leads to something essential: if you look at a Rock application and all the different functions it cares about, they are all rock_ functions like rock_alloc and rock_dealloc. These are always required. Then you have the effects, which are platform-specific, determining what the application can actually do, like reading a file. This is entirely up to the platform. There is another special category, such as memcpy and memset. LLVM creates these during optimizations, replacing copying loops with memcpy. Additionally, there are some compiler RT functions that appear during complex math operations.

In our case, we go to Zig and ask it to bundle these functions. Zig provides an implementation in assembly for memcpy, which we include directly in the Rock application. We don't go to libc or compiler RT for linking; they are already in the Rock application. This results in a slim set of dependencies, primarily just the platform.

What if we could link the platform first and then add the Rock application later? Since the Rock application only depends on the platform, this should be feasible. This approach eliminates 400,000 relocations for a basic CLI, saving significant time. We also avoid scanning numerous libraries and shuffling megabytes of data. Since the platform is pre-known, we can pre-compute many elements. This sounds great, so why doesn't everyone do this? As long as we can define a nice API between these two components, many problems can be pre-handled, reducing the time required.

A quick aside: you may have heard of Dynamic Linking, which aims to achieve what we just discussed. It involves loading shared libraries and an application at runtime, performing the linking at load time. However, fundamentally, you can define this simple...

=> 00:20:17

Pre-compute tasks to save time and streamline your workflow by defining a clear API between components.

The platform is pre-known, so we can pre-compute a bunch of tasks. This sounds great, and one might wonder why doesn't everybody just do this? As long as we can define a nice API between these components, many problems can be pre-handled. Anything that's already done doesn't take more time, which is quite efficient.

You may have heard of Dynamic linking. It's designed to do exactly what I just described. It involves loading shared libraries and an application at load time. Fundamentally, you can define a simple API between two components, and you can create a Rock generator shared library. In this case, you don't have to process the entire platform, saving a lot of time. However, tools like LD are still slow at ensuring libraries with no dependencies, as they tend to scan everything just in case. Despite this, we aim for a nice distributable executable that can be easily handed off, making Dynamic linking not a complete solution here. It doesn't solve all the problems, but it can help in other languages by reducing the problem and allowing more pre-computation.

Now, let's discuss what we do with the surgical linker. The surgical linker is split into two phases: pre-processing and surgery. Pre-processing is straightforward; you compile the application as if it's a dynamic library because that's essentially what we want. Compilers today don't generate exactly what we need, so we pretend we're dynamically linking, build the platform, and complete pre-processing. All relocations are completed, and we then scan everything ahead of time, saving all necessary information about functions and their locations. We also resolve what still needs to be addressed in the platform to call Rock and delete anything related to Dynamic linking so that the application doesn't think it's dynamically linked.

In the surgery phase, you place the app at the end of the rest of the code and fix everything. During pre-processing, we compile as if dynamically linking, then shift the entire application's body. This requires caution to avoid SE faults, as application code doesn't like to be moved. By doing this carefully, we create space to add more program headers because Rock will need them. We collect all dynamic information, such as function symbols and relocations, save it to metadata, and remove any indication that it was dynamically linking to Rock.

Finally, we compile our app to machine code using the dev backend for speed. We append the bytes to the end of the binary, load everything up in a map for fast access, and write new program headers. We need one header to load all our code, another for all our data, and sometimes a third for additional requirements. This process ensures everything is efficiently organized and ready for execution.

=> 00:23:47

Streamline your app's performance by compiling to machine code and optimizing metadata handling for a faster, smoother development experience.

First, we collect the necessary information and save it to some metadata. Then, we delete the notion that it was dynamically linking to Rock, making it think it's static now. Next, we compile our app to machine code using the dev backend to make it fast. We literally append the bytes to the end of the binary and load everything up in a map to ensure things are fast. Since we just created a lot of these bytes, everything remains hot in cache.

We then write some new program headers. We need one header to load all of our code, another to load all of our data, and sometimes a third one to load all of our mutable data for specific cases. However, Rock doesn't usually have this issue due to its pure nature. From there, we go through and fix up all of these pieces. We identify all the relocations in the application, which is a super narrow set because it's just those at the boundary line. Even in a large application, you might have hundreds or thousands of these relocations, but not hundreds of thousands. This makes it a tiny problem, and since we already have so much pre-read, we aren't scanning any extra files or doing unnecessary work. We load our metadata and update the necessary parts, jumping around memory a bit, but it's a minor issue.

Next, we update the section headers, which is optional but makes debugging nicer. Section headers are only for linking and not for running the code, so they don't technically matter. At this point, we're done. We've skipped most of the linking and performed a slim surgery to achieve what we needed. This entire process is encapsulated in about 2,000 lines of code, covering all the core concepts needed. We do rely on a library to parse object files, but writing the offsets and constants ourselves wouldn't add many more lines of code.

There are a few pieces missing. We don't support a certain category of relocations yet, which will make things a bit more complex as they need to be dynamically corrected at runtime. Currently, the Rock compiler doesn't generate these, but we have plans to address this. Additionally, we don't support debug info CU Dwarf yet, which is a whole other challenge we're not ready to tackle at the moment.

For reference, LLD is about 120,000 lines of code, and Mold is around 600,000 lines. I haven't dug into these to see how they're allocated or if they're using vendored libraries, but this gives an idea of the difference in scale. The Rock front end should get significantly faster as it's currently doing extra work that shouldn't be needed. We're moving from a world where hitting 'go' means waiting and dealing with a frustrating delay, to one where hitting 'go' means it's done almost instantly. While it may not be as fast as Python, with further optimizations, we could achieve a development loop that feels like a dynamic language. This is something more languages should strive for, as users clearly appreciate the development loop of dynamic languages while also wanting the benefits of type checking and other features.

=> 00:27:24

Building software users love means focusing on the details that make development fast and pleasant.

In the current development environment, doing a bunch of extra work that shouldn't be needed is a common frustration. This is particularly evident when you hit go, sit, wait, and then see the result, which is often a terrible feeling. In contrast, the ideal scenario is hitting go and having the task done immediately. Although it may not be as fast as Python, with proper optimizations, it could feel like a dynamic language. This is a goal more languages should strive for, as users clearly prefer the development loop of dynamic languages. The challenge is to create languages that offer the same development loop while providing benefits like type checking.

Rock's tracing aims to compete with Python and offer a fluid experience. Zig, on the other hand, has different goals but emphasizes simplicity and fast compilation. This focus on user experience is often overlooked by big companies with infinite servers, remote builds, and complex caching systems. However, for the average user, this aspect is crucial for building software that is enjoyable to use.

The fundamental message here is to cheat more—meaning, pay attention to details. The solution might be simple, but it requires a deep understanding of systems to take advantage of them. This isn't really cheating; it's about leveraging constraints effectively. A significant problem today is the tendency to build higher abstractions without understanding the underlying details. Without this knowledge, it's challenging to create something truly remarkable.

In the Q&A session, a question was raised about targeting a shared object and emitting dynamic relocations. The process involves building everything as Position-Independent Code (PIC) or Position-Independent Executable (PIE), skipping Thread-Local Storage (TLS). This approach can lead to emitting a general dynamic model, which is not efficient at runtime. The response highlighted that in Rock's case, many things are skipped, and they patch up calls directly. While they currently scan the assembly to find and update calls, this might change. The call cost isn't a significant issue for Rock because it deals with big IO tasks in an async model, unlike a normal CFFI scenario where calls happen frequently.

=> 00:31:23

Rock's async model avoids the overhead of frequent calls, making it efficient for big IO tasks.

In our case, a lot of things don't exist because of Rock, so we do skip a lot of things. Fundamentally, we end up going and patching up a lot of things and making them into direct calls. Currently, we actually scan the assembly, find calls, and then update them. We may or may not keep that, but for us, the call cost isn't generally that big of a deal because these are our big IO tasks that are awaited in an async model. I don't think that ends up being a trade-off we have to consider nearly as much as if you're writing something with normal CFFI that's getting called back and forth all the time. That will be a very different model than what Rock ends up building with its async state machine that it's always building off of. However, I would definitely have to think more about your question before I give a good answer because details are hard to picture in my head and figure out.

Final question: When you link, you get rid of all the relocations for DWARF. How are you going to handle this? Actually, parsing DWARF is really slow. Everything is lab-encoded, and you haven't included that. I wonder how much time you're actually going to spend on doing this and if it's going to be worth it.

Response: Yeah, I know that DWARF is going to be a big pain. Also, certain dynamic relocations that we have to emit, like the section won't have enough space, and so we'll have to ship the whole binary more at our final compilation time. I know these are going to be big pain points, but at the same time, even if that doubles the time the surgical linker takes, which right now is really the time to write a file to disk, there isn't actually a computation time to the surgical linker for the most part. It's really just the mmap flushing to disk. So, I'm not super worried about it, but I do agree that they are big pieces that will require much more complicated surgery and will take significantly more time. But we're still cutting off, you know, 400,000 makes sense.

Bonus comment: There is a light in the tunnel. If you write a parser for DWARF, you can reuse it for Mako because Mako doesn't have relocation. So if you want to write DTIL, you have to do it yourself.

Response: Interesting. Yeah, I think you guys should have lunch together. I mean, everybody else in the room is thinking the same.

Question: Does it fundamentally not allow any kind of FFI with C?

Response: Rock makes the platform decide everything, so there is no arbitrary FFI. That's very pertinent in the design of Rock in that we want everything through this effect model. You can't do arbitrary FFI and load weird other modules through this effect thing that's controlled by the platform. It's not extensible in a way that lets that happen. The platform has to decide how FFI works, which could mean that the platform even allows calling into lib FFI and loading things dynamically to do that type of stuff. But it is a restriction on Rock to maintain its purity and avoid libraries that can do horrendous things. So, no DL open, no DL.

Moderator: Thanks. Any more questions? Raise your hands if so. I see one. Let's make it the last one.

Question: Have you thought about how you would do this for a system, for example, that has a fundamentally different model?

Response: I do think this can be done for more programming languages, but it's really about needing to know where your API boundary is that you can cut things off at and then tailor to that. In our case, we have this super clear API boundary. There's nothing stopping you in Zig from modeling such that this core business logic part of your application depends on only a certain set of things. You could always surgically link it in to make it fast. It's just about that slim API and not allowing it to do just anything. You could even still have it calling libc or other things of that nature, but you need to define those so you can avoid these extra costs of needing to load and scan libc and do all those other pieces. I do think it is possible for this to be expanded to a lot more languages, but it is a much trickier, more project-dependent thing for this tool.

Moderator: Awesome, thank you. If you have any more questions, find Brandon afterwards and talk to him. You will have to wait for Jacob to be done first, I think. Now we're going into the lunch break, so there's definitely going to be time. Brandon, please give it up for Brandon. Thank you.