The release of Apple Silicon in the form of the M1 chip has definitely stirred things up a lot! M1 boasts some scary performance gains over Intel’s even the most powerful i9 chips, while at the same time using much less energy. But, while the hardware is something to look up to, the decision to switch to the ARM architecture and instruction set, every day unravels more and more problems in existing software. In this short post, I’ll describe the root cause for the thread-local storage (TLS) miscompilation (or should I say, mislinking since this is a linker problem) by the LLVM toolchain on macOS. For context, have a look at this outstanding issue in the Zig compiler: ziglang/zig#7527. I should point out here that Zig’s stage1 (i.e., not-self-hosted) compiler uses LLVM for lowering the intermediate representation into the actual ISA. The self-hosted compiler in Debug mode will by default use our own, in-house incremental linker, whereas in Release mode the plan is to use LLVM to leverage years of really clever micro-optimisations that went into it. (I promise I will write an update about the progress with the incremental Mach-O linker soon-ish, and I will make sure to include all the juicy details that I learnt/discovered along the way. In the meantime, please join me at my FOSDEM21 session where I give an overview of the current state-of-the-art.)
DISCLAIMER: Everything written here is the result of me reverse engineering different Mach-O binaries that I crafted using different tools to try and pinpoint differences between programs that work and those that don’t. For this reason, I don’t expect everything I describe here to always be technically accurate. Unfortunately, the docs on the modern Mach-O are very scarce and so naturally, when writing the incremental linker in Zig, I’ve learnt most of the tricks and built up my understanding of the entire file format by reverse engineering different examples.
TL;DR take everything you read here with a hefty pinch of salt!
Take this simple Zig source as an example:
In this blog post, we will investigate how this source maps to the final Mach-O binary, and how a TLS variable
x
is managed. I should point out here, that currently this example will lead to a segfault on Apple Silicon
for the reasons we’ll discuss below.
Before we dig deeper into the root cause of TLS miscompilation by the LLD, we should first of all get the basics out of the way. If there is TLS within a Mach-O binary, the flags field within the Mach header must contain the flag:
Next, the __DATA
segment (or more generally, the read-write segment) will contain two additional sections:
__thread_vars
and __thread_bss
. If you look closely, the former is the threaded equivalent of the __data
section, while the latter of the __bss
section.
__thread_vars
sectionThis section is identified by the flag:
The section itself (or its actual contents) should be 8-byte aligned, and is usually padded with zeros. Here,
when the binary is loaded by the dynamic loader dyld
, the dyld_stub_binder
will write the address of the
tlv_get_address
symbol, which the compiled machine code responsible for initialising the thread-local
variable x
, will branch to, but more on this later.
For the dyld_stub_binder
to know that it needs to populate the address of tlv_get_address
function in the
space provided within the __thread_vars
section, we use the Dynamic Loader Info (LC_DYLD_INFO_ONLY
load command)
and in particular, the Binding Info section. Within the Binding Info, we guide the dyld_stub_binder
to the right
segment and offset within that segment which should equal a cell within the __thread_vars
section. I am not going
to dig into the actual opcodes used within the Binding Info section to drive the dyld
in this blog post. Instead,
I’d like to point everyone to a very good resource by Jonathan Levin here.
__thread_bss
sectionThis section is also 8-byte aligned and contains zerofill for thread-local storage, much like the __bss
section
for the static global variables. The section is identified by the flag:
Also, I should point out that this section doesn’t have data representation within the actual Mach-O file, i.e., its file offset points to the beginning of the binary. However, it is required to point to an unoccupied space within the virtual memory.
This is actually more straightforward than you may think. Within the symbol that makes use of the TLS, we call
tlv_get_address
to initialise the variable. How do we do that? Remember that when the binary was loaded,
dyld_stub_binder
was run, and fetched the address of tlv_get_address
and saved it within the __thread_vars
section. Therefore, all that we need to do is to fetch that address and branch to the actual symbol via this address.
For the sake of example, assume that we are currently at an address 0x100003CE0
, and suppose that the __thread_vars
section is at a virtual address of 0x100049180
, and that the dyld_stub_binder
will store the address of
tlv_get_address
symbol in the first cell of the section, so at the section’s start address 0x100049180
. Then,
all we need to do is first of all, locate the memory page where __thread_vars
is residing at, to then narrow down
to the actual cell, to finally load the address stored in that cell in some register we will branch from. How do we
do this with ARM64 ISA?
Just to recap, if the address of the first instruction above is 0x100003CE0
, then adrp
will load the address
of 0x100003CE0 + 70 * 0x1000 := 0x100049000
. Note that the result is truncated to the nearest 4KB page. Next, we
add 384
(0x180
in hex) to the result, so 0x100049000 + 0x180 := 0x100049180
which is, surprise, surprise, the
start address of the __thread_vars
section and the address of the first cell of that section. Perfect! Then, we simply
derefence the value stored in 0x100049180
, which by this time, will be populated by the dyld_stub_binder
and will
hold the actual address of the tlv_get_address
symbol, and store it in x8
register. Finally, we branch to it.
Unfortunately, the output generated by the LLD uses ldr
instead of add
instruction leading to a segfault. Why?
First of, this is the output generated by the LLD:
In the above snippet, the use of ldr
as the second instruction will first derefence the contents of memory at
an address 0x100049000
and then offset that value by 384
, whereas, what should happen is the exact opposite. We
should offset the address we want to load from by 384
as we did in the original snippet. This will lead to us
branching into whatever garbage was stored in 0x100049000
offset by 384
(or 0x180
in hex), and hence, will most
likely lead to a segfault or undefined behaviour.
I have to admit, I’m using my M1 MBA to drive the development of the incremental linker, and for some time now, I’ve been struggling by not having any stack traces in case I was hitting an assert or panicking somewhere in the codebase. I didn’t investigate the exact cause until very recently, and it turns out, miscompilation of TLS by LLD is to blame. This is because, in Zig, every thread keeps track of its current panic stage. We achieve this via TLS:
Therefore, any panic would have to traverse the corrupted codepath that we examined in the snippet above leading to a segfault without printing any stack trace. Let me bring up an example. Consider this simple Zig snippet:
Compiling and running the resulting binary, will unfortunately currently result in:
As an experiment, I decided to hack a manual fix for this by rewriting the offending instruction from ldr
to add
,
and the result is now as expected:
Clearly, a post-mortem fixup to the binary is not an option since this would require disassembling the entire program searching for any mention of the TLS. An intermediate solution for the time being will be for us to not use TLS on Apple Silicon until the problem is fixed in the LLD itself, or our in-house linker is able to perform well enough as a drop-in replacement.