Tracing Kernel Functions: How the illumos AMD64 FBT Provider Intercepts Function Calls

For my money there is perhaps nothing more valuable than in situ kernel debugging. The ability to ask questions of any part of the kernel, on a live production system, without stopping or pausing anything, is quite frankly, amazing. One such avenue to in situ kernel debugging is the DTrace FBT provider.
The FBT provider, which stands for Function Boundary Tracing, allows one to trace all kernel function entry and return sites, their arguments and return values, and even grab the stack that lead to the call. There is no overhead when probes are disabled, and minimal overhead on the specific entry or return site when enabled. DTrace is made up of several parts: the command, the user library, the kernel framework and VM, and the provider interface. Its authors made the smart decision of separating the general infrastructure of DTrace from the specifics of probe creation and firing: providers provide and fire the probe points, and the framework executes their actions. This allows providers to be developed separately from DTrace itself. If you want to learn more about DTrace I recommend reading Dynamic Instrumentation of Production Systems presented in USENIX 2004.
For this post I want to focus on how FBT intercepts function calls. That is, when a user enables an FBT probe, how exactly does the probe end up firing when a kernel thread hits that entry or return site? For example, the following one-liner instruments all calls to the mac_ring_tx()
function, responsible for delivering a network packet to the underlying NIC. It counts the number of packets traveling across each interface along with the distribution of the packet size sent across each interface. As you can see, a simple function entry probe can be quite powerful, and I’m only touching the surface of what’s possible. But how exactly did the FBT provider accomplish this?
rpz@thunderhead:~$ pfexec dtrace -qn 'mac_ring_tx:entry { this->mip = (mac_impl_t *)arg0; @[this->mip->mi_name] = count(); @dist[this->mip->mi_name] = quantize(msgsize(args[2])); } END { printf("TOTAL PACKETS\n"); printa(@); printf("\nDISTRIBUTION\n"); printa(@dist); }'
^C
TOTAL PACKETS
igb1 43
ixgbe6 31046
ixgbe3 60370
ixgbe5 68938
aggr1014 99793
ixgbe2 137064
aggr1013 197244
DISTRIBUTION
igb1
value ------------- Distribution ------------- count
32 | 0
64 |@@@@@@@@@@@@@@@ 16
128 |@@@@@@@@@@@@@@@@@@@@@@ 24
256 |@@@ 3
512 | 0
ixgbe6
value ------------- Distribution ------------- count
16 | 0
32 | 10
64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 31036
128 | 0
ixgbe5
value ------------- Distribution ------------- count
16 | 0
32 | 134
64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 68804
128 | 0
aggr1014
value ------------- Distribution ------------- count
16 | 0
32 | 144
64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 99649
128 | 0
ixgbe3
value ------------- Distribution ------------- count
32 | 0
64 | 127
128 | 0
256 | 0
512 | 0
1024 | 170
2048 | 139
4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 58801
8192 |@ 1052
16384 | 12
32768 | 69
65536 | 0
ixgbe2
value ------------- Distribution ------------- count
16 | 0
32 | 128
64 | 160
128 | 0
256 | 0
512 | 1
1024 | 79
2048 | 107
4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 136106
8192 | 478
16384 | 0
32768 | 5
65536 | 0
aggr1013
value ------------- Distribution ------------- count
16 | 0
32 | 128
64 | 97
128 | 0
256 | 0
512 | 1
1024 | 249
2048 | 246
4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 194907
8192 | 1530
16384 | 12
32768 | 74
65536 | 0
packet distribution per interface
Insrtumenting Function Calls and Returns
Before I can describe how a site is intercepted I need to give a primer on how the kernel text is instrumented. Upon first use DTrace loads the FBT provider, iterates all loaded kernel modules, and passes each one to fbt_provide_module()
. This function uses the ELF section header data to iterate all functions, read their program text, and find the location of all entry and return sites. The FBT provider creates an FBT probe (fbt_probe_t
) for each site, responsible for holding the information needed to instrument that site. This design produces zero overhead when no probes are in use as no program text is modified during this stage.
It’s not until the user executes a D script containing one or more FBT probes that the provider instruments the kernel module’s text. The DTrace framework calls fbt_enable()
, passing in the FBT probe structure created earlier which contains the information needed to alter the correct program text. If this is an entry probe, then FBT replaces the push %rbp
instruction. Otherwise, it is a return probe, and FBT replaces the ret
instruction. In both cases, FBT replaces this instruction with INT3
, the breakpoint instruction (or #BP in Intel mnemonics). This design produces the minimal overhead needed to intercept the probe site. It modifies only the site requested, leaving all other text untouched.
A breakpoint is what’s known as a software exception — a processor exception produced at the request of software, as opposed to an exception generated by a legit program exception, like divide by zero. All exceptions are really just interrupts, with the small difference being that an interrupt is something typically generated external to the processor, whereas an exception is generated by the processor as a consequence of the code it is executing. Furthermore, when reading code, you’ll also see breakpoint referred to as a trap. This is yet a further classification of the exception, and it dictates what actions the processor will take upon servicing the exception. All three terms are used interchangeably throughout the code, but the important thing to remember is it’s an interrupt. It’s interrupting the normal flow of the program code (in this case, the kernel). Like all other interrupts, the operating system must provide the processor with some action to take upon receiving a breakpoint. This action is defined in the Interrupt Descriptor Table (IDT), seen below.
usr/src/uts/intel/ia32/os/desctbls.c
set_gatesegd(&idt[T_BPTFLT],
(kpti_enable == 1) ? &tr_brktrap : &brktrap,
KCS_SEL, SDT_SYSIGT, TRP_UPL, idt_vector_to_ist(T_BPTFLT));
breakpoint IDT entry
Upon receiving a breakpoint the processor will call into the handler tr_brktrap
. The tr_
prefix stands for trampoline, which is a mitigation method used as part of KPTI. This trampoline provides Meltdown protection and ultimately leads to brktrap
, which I’ll come back to later. The SDT_SYSIGT
argument tells us this is an “interrupt gate”, meaning the processor will disable the assertion of all maskable hardware interrupts before invoking the handler — preventing another interrupt from interrupting the breakpoint. The TRP_UPL
argument indicates the interrupt may originate from user or kernel space. In the case of the FBT provider, it will always originate from kernel space. And finally, the last argument is an index into the Interrupt Stack Table (IST): a collection of dedicated stacks to be used during interrupt handling to avoid scribbling on the interrupted thread’s stack. In this case the T_BPTFLT
handler maps to a dedicated stack.
Before the Handler
So I’m running a D script that contains an action for one or more FBT probes, and a kernel thread executes one of these probe sites, what’s next? Based on the IDT entry it should call the brktrap
handler, but that’s not quite what happens. First, the processor does some work on the operating system’s behalf before calling the interrupt handler. It must switch to the stack specified in the IDT entry and push the following values to it, in this order (see Intel Vol. 3, §6.14.4).
- The interrupted thread’s stack selector (SS).
- The interrupted thread’s stack pointer (RSP).
- The interrupted thread’s RFlags.
- The interrupted thread’s Code Segment selector (CS).
- The interrupted thread’s Instruction Pointer (RIP).
- The error code of the exception, as
INT3
has no error code this doesn’t apply.
This ends with the processor in interrupt context, with the interrupted thread’s primary CPU context loaded on the new stack, ready to execute the operating system’s brktrap
handler — as depicted in the figure below.

INT3 thread/interrupt state pre-handler
The invoptrap Handler
The brktrap
handler is actually a front for the real workhorse: the invoptrap
handler. Different providers use different methods of trapping into the DTrace framework. Some use INT3
(#BP), others use the LOCK
prefix (0xFO
) incorrectly in order to force an invalid operation fault (#UD). I’m not sure why both methods exist or when a provider should prefer one method over the other, perhaps fodder for a future post. But in the case of FBT, the first stop is brktrap
.
usr/src/uts/intel/ia32/ml/exception.s
/*
* #BP
*/
ENTRY_NP(brktrap)
XPV_TRAP_POP
cmpw $KCS_SEL, 8(%rsp)
jne bp_user
/*
* This is a breakpoint in the kernel -- it is very likely that this
* is DTrace-induced. To unify DTrace handling, we spoof this as an
* invalid opcode (#UD) fault. Note that #BP is a trap, not a fault --
* we must decrement the trapping %rip to make it appear as a fault.
* We then push a non-zero error code to indicate that this is coming
* from #BP.
*/
decq (%rsp)
push $1 /* error code -- non-zero for #BP */
jmp ud_kernel
breakpoint handler
Lines 296-297 determine where the BP originated from: user-land or kernel-land. If the former, jump to bp_user
, otherwise fall through. For those that aren’t familiar with assembly, I’ll break down line 296 a bit more. At the point of this instruction the interrupt’s RSP
points to the 64-bit word containing the RIP
of the interrupted thread. The 8(%rsp)
instruction is AT&T syntax for indirect memory addressing: it tells the CPU to read the value at %rsp + 8
(basically the parens are equivalent to dereferencing a pointer in C). Since the stack grows downward, and every value on the stack is 64-bit, adding 8 bytes references the value “above” RSP
on the stack: the interrupted thread’s CS
. Therefore, the cmpw
instruction compares the lower 16 bits of the immediate value $KCS_SEL
to the interrupted thread’s CS
value. In this case we know the #BP originated from a kernel thread and the processor will fall through.
Lines 307-309 are well described by the comment above them. The bulk of the handling is done under the #UD handler, but in FBT’s case we are coming in via #BP. Rather than replicate the handler logic we opt to make the #BP look like a #UD, jumping to the ud_kernel
handler after doing so. This is achieved by rolling back the RIP
by one instruction, as it would be in the case of a #UD fault, and pushing an error code on the stack. The error code, while meaningless in this case, is needed to keep the stack layout consistent with other exceptions. All exceptions share common routines which expect the stack to be laid out in a particular way.
usr/src/uts/intel/ia32/ml/exception.s
ENTRY_NP(invoptrap)
XPV_TRAP_POP
cmpw $KCS_SEL, 8(%rsp)
jne ud_user
#if defined(__xpv)
movb $0, 12(%rsp) /* clear saved upcall_mask from %cs */
#endif
push $0 /* error code -- zero for #UD */
ud_kernel:
push $0xdddd /* a dummy trap number */
#UD handler
On line 345 we find the ud_kernel
label. It sits a few instructions into the invoptrap
handler, the main DTrace entry point. Like the brktrap
handler, invoptrap
looks at the the interrupt origin to determine what to do. Since we originated from #BP we skip all that and head straight to line 346 where we push a dummy trap number. Like the dummy error code pushed in brktrap
, this is done to make sure the stack has a consistent layout across all exceptions so that macros like INTR_PUSH
can function correctly. However, why we push a dummy value instead of the actual trap number, like all the other traps, is a mystery to me.
usr/src/uts/intel/ia32/ml/exception.s
INTR_PUSH
#UD handler
As I mentioned, all exceptions use the same stack layout, but what exactly is that layout? The layout is defined by the regs
structure, shown below. This structure includes all values pushed by the processor’s exception mechanism as well as the trap number, all general-purpose registers, the segment registers, and two special values named r_savfp
and r_savpc
. These last two fields contain copies of the interrupted thread’s frame pointer and program counter, respectively. The processor’s exception mechanism populates lines 101-105 of the structure on the stack, lines 95 & 100 are populated by the handler, and INTR_PUSH
is responsible for populating the rest.
usr/src/uts/intel/amd64/sys/privregs.h
struct regs {
/*
* Extra frame for mdb to follow through high level interrupts and
* system traps. Set them to 0 to terminate stacktrace.
*/
greg_t r_savfp; /* a copy of %rbp */
greg_t r_savpc; /* a copy of %rip */
greg_t r_rdi; /* 1st arg to function */
greg_t r_rsi; /* 2nd arg to function */
greg_t r_rdx; /* 3rd arg to function, 2nd return register */
greg_t r_rcx; /* 4th arg to function */
greg_t r_r8; /* 5th arg to function */
greg_t r_r9; /* 6th arg to function */
greg_t r_rax; /* 1st return register, # SSE registers */
greg_t r_rbx; /* callee-saved, optional base pointer */
greg_t r_rbp; /* callee-saved, optional frame pointer */
greg_t r_r10; /* temporary register, static chain pointer */
greg_t r_r11; /* temporary register */
greg_t r_r12; /* callee-saved */
greg_t r_r13; /* callee-saved */
greg_t r_r14; /* callee-saved */
greg_t r_r15; /* callee-saved */
/*
* fsbase and gsbase are sampled on every exception in DEBUG kernels
* only. They remain in the non-DEBUG kernel to avoid any flag days.
*/
greg_t __r_fsbase; /* no longer used in non-DEBUG builds */
greg_t __r_gsbase; /* no longer used in non-DEBUG builds */
greg_t r_ds;
greg_t r_es;
greg_t r_fs; /* %fs is *never* used by the kernel */
greg_t r_gs;
greg_t r_trapno;
/*
* (the rest of these are defined by the hardware)
*/
greg_t r_err;
greg_t r_rip;
greg_t r_cs;
greg_t r_rfl;
greg_t r_rsp;
greg_t r_ss;
};
AMD64 regs structure shared across exception handlers
After the INTR_PUSH
we have an entire regs
structure populated on the stack with RSP
pointing to r_savfp
.
usr/src/uts/intel/ia32/ml/exception.s
movq REGOFF_RIP(%rsp), %rdi
movq REGOFF_RSP(%rsp), %rsi
movq REGOFF_RAX(%rsp), %rdx
pushq (%rsi)
movq %rsp, %rsi
subq $8, %rsp
call dtrace_invop
ALTENTRY(dtrace_invop_callsite)
Lines 348-350 ready the first and third arguments for the dtrace_invop()
call: the address of the instruction being intercepted, and the RAX
value of the function being interrupted (for return probes). Line 349 looks like it’s placing the interrupted thread’s stack pointer in the second argument position, but really it’s staging the pointer in RSI
for lines 351-352. On line 351 we dereference the RSP
of the interrupted thread and place that value on the top of the handler’s stack. Which leads to the question: what was the last thing pushed on the stack of the interrupted thread? This depends on the probe. For entry probes we instrument the pushq %rbp
instruction, thus the last thing on the stack is the return site of the function that called the interrupted function. If foo()
called bar()
, and bar’s entry point is instrumented by DTrace, then (%rsi)
would be foo()+0xXXXX
. And it turns out this is true for return probes as well as illumos uses leave
plus ret
to exit the function, and FBT instruments the ret
instruction. Therefore, upon entry into the invop handler, the last thing on the interrupted thread’s stack is always the return instruction pointer. Moving onto line 352, we replace the interrupted thread’s stack pointer with our handler’s stack pointer, so that dtrace_invop()
will have a pointer to the handler’s stack as its second argument.
Finally, before calling dtrace_invop()
, we subtract 8 bytes from the stack to keep it at a 16-byte alignment (see Intel Vol. 3, §6.14.2). The generic dtrace_invop()
function checks this instrumentation point against the SDT and FBT providers. In this case we are dealing with an FBT probe and will end up calling fbt_invop()
. But before moving on, it’s also worth noting line 355, after the call to dtrace_invop()
, where we create the symbol dtrace_invop_callsite
. This becomes the return address upon entering dtrace_invop()
, and its use becomes important when accessing function arguments, which I will cover in a follow up post.
usr/src/uts/intel/dtrace/fbt.c
static int
fbt_invop(uintptr_t addr, uintptr_t *stack, uintptr_t rval)
{
uintptr_t stack0, stack1, stack2, stack3, stack4;
fbt_probe_t *fbt = fbt_probetab[FBT_ADDR2NDX(addr)];
for (; fbt != NULL; fbt = fbt->fbtp_hashnext) {
FBT invop handler
The FBT probes are stored in a chained hash table named fbt_probetab
. Line 85 hashes the instrumented address to determine the bucket the probe is in, and then line 87 loops through all entries in that bucket to find the one matching this address.
usr/src/uts/intel/dtrace/fbt.c
if ((uintptr_t)fbt->fbtp_patchpoint == addr) {
if (fbt->fbtp_roffset == 0) {
int i = 0;
/*
* When accessing the arguments on the stack,
* we must protect against accessing beyond
* the stack. We can safely set NOFAULT here
* -- we know that interrupts are already
* disabled.
*/
DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
CPU->cpu_dtrace_caller = stack[i++];
#ifdef __amd64
/*
* On amd64, stack[0] contains the dereferenced
* stack pointer, stack[1] contains savfp,
* stack[2] contains savpc. We want to step
* over these entries.
*/
i += 2;
#endif
stack0 = stack[i++];
stack1 = stack[i++];
stack2 = stack[i++];
stack3 = stack[i++];
stack4 = stack[i++];
DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT |
CPU_DTRACE_BADADDR);
dtrace_probe(fbt->fbtp_id, stack0, stack1,
stack2, stack3, stack4);
CPU->cpu_dtrace_caller = 0;
FBT entry probe
In this case we found an entry probe matching the instrumented address. We know it’s an entry probe because fbtp_roffset
is zero.
The use of CPU_DTRACE_NOFAULT
is interesting, and I’ll be honest, I’m not sure I fully understand why it’s used here, but here’s my best guess. First off, the no-fault mechanism is used throughout DTrace to protect the operating system from bad loads in DTrace actions. This flag allows the page fault logic in locore.s to determine if the #PF originated from DTrace, and if so to skip over the offending instruction, set some DTrace CPU flags to indicate the faulting load, and move on with execution. This is different from a typical #PF, which would wind up in the global trap handler. As for it’s use here? I think it’s because we are directly accessing the handler’s stack via the stack
pointer, as opposed to the typical way you access the stack in C: by using local variables. These reads should always be valid, but perhaps that wasn’t always the case in the history of this code. In any event, if any of the references on lines 109-113 happen to go out of bounds, the #PF handler will catch it and simply move on (leaving whatever garbage value it was initialized to).
Now it’s time to start reading data off the handler’s stack. Remember, stack
is pointing into the handler’s stack, at the point of the last push in the invoptrap
handler. The last thing we saved on the stack was the caller of the instrumented function, found at location stack[0]
. On line 99 we store that symbol address in the CPU-local variable cpu_dtrace_caller
, which is used in the DTrace stack()
action and to populate the DTrace built-in caller
variable. The next two values on the handler’s stack are the regs
structure values r_savfp
and r_savpc
, the saved values of the frame pointer and program counter, respectively. We skip over them to get to the arguments of the instrumented function.
The System V AMD64 ABI dictates that the first six arguments are stored in specific registers. These stack[N]
accesses reference those registers and those variables are passed to dtrace_probe()
to act as a cache for the first five arguments to the instrumented function. But why five and not six? While illumos no longer supports a 32-bit kernel, when we did have such support the #UD handler setup stack[1]
to point to the beginning of the first 10 arguments to the instrumented function, so we know it’s not because of 32-bit. It also doesn’t seem to be because of a SPARC limitation, as a quick search of the various SPARC ABIs all claim the first six arguments may be placed in registers. My best guess is that it’s from a desire to keep all arguments to dtrace_probe()
in registers — for efficiency and perhaps to avoid polluting the stack as some probe actions are sensitive to stack depth such as the stack()
action. In any event, the final action is to call dtrace_probe()
, the main entry point into the DTrace framework which ultimately executes the actions linked with this DTrace probe.
usr/src/uts/intel/dtrace/fbt.c
} else {
#ifdef __amd64
/*
* On amd64, we instrument the ret, not the
* leave. We therefore need to set the caller
* to assure that the top frame of a stack()
* action is correct.
*/
DTRACE_CPUFLAG_SET(CPU_DTRACE_NOFAULT);
CPU->cpu_dtrace_caller = stack[0];
DTRACE_CPUFLAG_CLEAR(CPU_DTRACE_NOFAULT |
CPU_DTRACE_BADADDR);
#endif
dtrace_probe(fbt->fbtp_id, fbt->fbtp_roffset,
rval, 0, 0, 0);
CPU->cpu_dtrace_caller = 0;
}
return (fbt->fbtp_rval);
}
}
return (0);
}
FBT return probe
In this case we have a matching return probe. I’m not so sure I follow this comment. The caller’s return address is still on the interrupted thread’s stack regardless of whether we instrument the leave
or ret
instruction, but the point is that this value must be saved in case the user calls the DTrace action stack()
in their action. Like the entry probe, the goal is to send a probe event to the DTrace framework, but this time instead of passing arguments we pass the return offset and return value. These values are accessible in the FBT return probe action as the variables arg0
and arg1
, respectively.
In the case of a matching probe, we need to return the fbtp_rval
value (line 140). This value was determined at probe creation time and tells the invoptrap
handler how to return back to the interrupted thread. This value depends on the type of probe, entry vs. return, and the architecture, 32-bit vs. 64. Given that illumos no longer ships a 32-bit kernel, we can ignore the architecture and assume AMD64. That means we return DTRACE_INVOP_PUSHL_EBP
for an entry probe and DTRACE_INVOP_RET
for a return probe. If we didn’t find a matching probe, then return zero (line 144). This tells the invop handler to treat this fault as a normal #UD and continue processing.
usr/src/uts/intel/ia32/ml/exception.s
pushq (%rsi)
movq %rsp, %rsi
subq $8, %rsp
call dtrace_invop
ALTENTRY(dtrace_invop_callsite)
addq $16, %rsp
cmpl $DTRACE_INVOP_PUSHL_EBP, %eax
je ud_push
cmpl $DTRACE_INVOP_LEAVE, %eax
je ud_leave
cmpl $DTRACE_INVOP_NOP, %eax
je ud_nop
cmpl $DTRACE_INVOP_RET, %eax
je ud_ret
jmp ud_trap
Picking up on line 356 of the invoptrap
handler, we add 16 bytes to the stack pointer which undoes the pushes on lines 351 and 353, leaving RSP
at the start of the regs
structure. From there we jump to one of the labels depending on the value returned by fbt_invop()
.
usr/src/uts/intel/ia32/ml/exception.s
ud_push:
/*
* We must emulate a "pushq %rbp". To do this, we pull the stack
* down 8 bytes, and then store the base pointer.
*/
INTR_POP
subq $16, %rsp /* make room for %rbp */
pushq %rax /* push temp */
movq 24(%rsp), %rax /* load calling RIP */
addq $1, %rax /* increment over trapping instr */
movq %rax, 8(%rsp) /* store calling RIP */
movq 32(%rsp), %rax /* load calling CS */
movq %rax, 16(%rsp) /* store calling CS */
movq 40(%rsp), %rax /* load calling RFLAGS */
movq %rax, 24(%rsp) /* store calling RFLAGS */
movq 48(%rsp), %rax /* load calling RSP */
subq $8, %rax /* make room for %rbp */
movq %rax, 32(%rsp) /* store calling RSP */
movq 56(%rsp), %rax /* load calling SS */
movq %rax, 40(%rsp) /* store calling SS */
movq 32(%rsp), %rax /* reload calling RSP */
movq %rbp, (%rax) /* store %rbp there */
popq %rax /* pop off temp */
jmp tr_iret_kernel /* return from interrupt */
In this case we are leaving an entry probe and need to emulate the pushq %rbp
instruction (which was overwritten with a breakpoint). The first line, INTR_POP
restores the GPRs and adjusts RSP
to point to the values pushed by the processor when switching to the interrupt stack, namely the CS:RIP
, RFLAGS
, and SS:RSP
values of the interrupted thread. The rest of the code, starting with line 373, is a bit odd and even conflicts with the comment above it somewhat (it’s pulling the stack down 16 bytes, not 8). This code doesn’t need to be this way, but I think it is this way as a consequence of a time in the past when the AMD64 interrupt handler was not using it’s own stack or perhaps the author of this routine started with a copy of the 32-bit routine. But the idea is that the iret
call is going to pop values off the stack to restore the CS:RIP
, RFLAGS
, and SS:RSP
registers to their original values. In order to emulate the frame pointer push we need to pull the stack down by enough bytes to store said pointer. To do that we first need to shift the values that were originally stashed by the processor on the stack when the interrupt was taken. On i386 this made sense as the handler reused the kernel thread’s stack, so we need to “pull it down”. But on AMD64 the handler has its own stack and so this is all unnecessary. That said, the AMD64 handler still pulls the stack down, to no real effect, but it also modifies the interrupted thread’s (referred to as the “caller” in this code) RSP
to stash the RBP
and increments the interrupted thread’s RIP
to the next instruction. Finally, we jump to tr_iret_kernel
, which is really just an iret
instruction (the author’s of KPTI decided it would be wise to alias it so that future maintainers know that the iret
path was analyzed when working on the Meltdown mitigations — a blessing for those who will have to maintain this code for decades to come). After the iret
has executed we are back in the kernel thread’s original execution context and it’s as if FBT was never involved.
Here’s how the push
emulation could be written in a simpler manner.
simple_ud_push:
/*
* Emulate "pushq %rbp" by stashing RBP on the caller's stack
* and incrementing the caller's RIP by one.
*/
INTR_POP
pushq %rax /* stash RAX to use as scratch */
movq (%rsp), %rax /* load calling RIP */
addq $1, %rax /* increment over trapping instr */
movq %rax, (%rsp) /* store calling RIP */
movq 24(%rsp), %rax /* load calling RSP */
subq $8, %rax /* make room for %rbp */
movq %rbp, (%rax) /* store %rbp there */
movq %rax, 24(%rsp) /* store calling RSP */
popq %rax /* restore RAX */
jmp tr_iret_kernel /* return from interrupt */
A simpler AMD64 push emulation
usr/src/uts/intel/ia32/ml/exception.s
ud_ret:
INTR_POP
pushq %rax /* push temp */
movq 32(%rsp), %rax /* load %rsp */
movq (%rax), %rax /* load calling RIP */
movq %rax, 8(%rsp) /* store calling RIP */
addq $8, 32(%rsp) /* adjust new %rsp */
popq %rax /* pop off temp */
jmp tr_iret_kernel /* return from interrupt */
#UD handler
Here we emulate the ret
instruction. On line 426 push RAX
to the stack in order to use it as a scratch register. On lines 427-428 we use the interrupted thread’s RSP
to get the return pointer. On line 429 we replace the stashed RIP
on the handler’s stack with the one we just pulled from the interrupted thread’s stack (remember, iret
is going to populate RIP
from the handler’s stack). On line 430 we remove the stored RIP
from the interrupted thread’s stack. On line 431 we restore RAX
. And finally we iret
back into the interrupted thread.
And with that we now know how a kernel function entry/return point is instrumented, intercepted, and emulated. While that’s interesting, it’s only part of the FBT story. One of the more powerful aspects of DTrace is the ability to grab the stacktrace and arguments at the time of probe firing. These features are implemented in their own functions, but they rely on the invop handler’s help. Now that we understand the handler we can see how these other DTrace features work, to be described in a follow up post.