Tail Call Optimization and Memory Management with Trait Objects and Vtables in Rust

We have looked at static and dynamic dispatch in Rust. In this article, we will examine the assembly code that is generated when one trait inherits from another trait. By analyzing this code, we can see how the compiler reduces some of the overhead associated with dynamic dispatch using tail call optimization. We will also examine how the compiler frees memory when a trait object, wrapped in a Box, goes out of scope. To illustrate these concepts, we will work with the following example code:

pub trait Shape {
    type T;
    fn area(&self) -> Self::T;
}

pub trait Draw: Shape {
    fn draw(&self);
}

// The function operates with a trait object reference
pub fn draw_dynamic(a: &dyn Draw<T = f64>) {
    a.draw();
}

// Invokes the draw method and then the area method on the trait object
pub fn draw_and_report_area_dynamic(a: &dyn Draw<T = f64>) -> f64 {
    a.draw();
    a.area()
}

// The trait object is wrapped in a Box. This means that function will have
// to free the memory associated with the trait object in the function body.
pub fn draw_and_report_area_dynamic_box(a: Box<dyn Draw<T = f64>>) -> f64 {
    a.draw();
    a.area()
}

Trait object fat pointer and the vtable

Under the hood, trait objects are implemented using fat pointers. The first pointer points to the data of the concrete type that implements the Draw trait, and the second pointer points to the vtable (virtual function table) for the concrete type. The vtable is a data structure that stores pointers to the implementations of concrete type methods. The vtable also stores the concrete type destructor, alignment and size information. This information will come in handy when we look at how the compiler frees memory when a trait object is wrapped in a Box.

The following diagram shows the layout of a trait object fat pointer. The byte offsets of the fat pointer and the vtable are shown in the diagram.

Rust trait object fat pointer

Assembly output for draw_dynamic

The generated assembly code for the draw_dynamic function includes a jump to the vtable address of the draw function, rather than a call. This optimization, known as Tail Call Optimization (TCO), eliminates the need to create a new stack frame and return address for the last (tail) function in the call chain.

In this case, the draw function returns directly to the caller of draw_dynamic, using the return address that was already present on the stack. This optimization is possible because the rdi register is already pointing to the concrete object that implements the Draw trait, satisfying the self parameter requirement for the draw function.

; Inputs:
;   rdi: Address of a
;   rsi: Address of the vtable associated with a
example::draw_dynamic:
        jmp     qword ptr [rsi + 32] ; a.draw() is called via the vtable
                                     ; and the tail call optimization is
                                     ; applied.

Assembly output for draw_and_report_area_dynamic

The generated assembly code for the draw_and_report_area_dynamic function is more complex than that of draw_dynamic. It first obtains the addresses of the draw and area methods from the vtable using offsets of 32 and 24, respectively. It then calls the draw method using a call instruction, which creates a new stack frame and return address.

The area method, however, is called using a jump instruction, indicating that it is eligible for Tail Call Optimization (TCO). Since the area method is the last function call in the body of draw_and_report_area_dynamic, the compiler can optimize it away by replacing the call with a jump, eliminating the need for a new stack frame and return address. The return value of the area method is also the return value of draw_and_report_area_dynamic, so the optimization has no impact on the correctness of the code.

; Inputs:
;   rdi: Address of a
;   rsi: Address of the vtable associated with a
; Outputs:
;   xmm0: Result of function
example::draw_and_report_area_dynamic:
        push    r14 ; Save the value of r14
        push    rbx ; Save the value of rbx
        push    rax ; Save the value of rax
        mov     r14, rsi ; Save the address of the vtable
        mov     rbx, rdi ; Save the address of a

        ; ⭐ Call a.draw()
        ; This is not the last function call in the function body, so it cannot
        ; be optimized away with a tail call.

        call    qword ptr [rsi + 32] ; a.draw() is called via the vtable
        mov     rdi, rbx ; Restore the address of a
        mov     rax, r14 ; Restore the address of the vtable

        ; ⭐ Pop off the three pushes on the stack in preparation of the tail call
        ; The stack is left in a clean state so that the jump to a.area() can
        ; can be followed by a return instruction that will return from
        ; draw_and_report_area_dynamic. 

        add     rsp, 8 ; Adjust the stack pointer
        pop     rbx ; Restore the value of rbx
        pop     r14 ; Restore the value of r14

        ; ⭐ Tail call a.area()
        ; This is the last function call in the function body and its return
        ; value is also the return value of draw_and_report_area_dynamic, so it
        ; can be optimized away with a tail call.

        jmp     qword ptr [rax + 24] ; a.area() is called via the vtable
                                     ; and the tail call optimization is
                                     ; applied.

Assembly output for draw_and_report_area_dynamic_box

The function draw_and_report_area_dynamic_box is a function that takes a Box containing a trait object as an argument and returns the result of calling the area method of the trait object. The function starts by saving the values of the r15, r14, and rbx registers on the stack, and then adjusts the stack pointer to reserve space for temporary variables. It then calls the draw method of the trait object contained in the Box, followed by the area method. The result of the area method is saved on the stack. The function then calls the destructor of the trait object contained in the Box and frees the memory allocated for the trait object. The function then returns the result of the area method. There is unreachable code after the ret instruction, which seems to be a bug in the compiler.

; Inputs:
;   rdi: Address of a
;   rsi: Address of the vtable associated with a
; Outputs:
;   xmm0: Result of function
example::draw_and_report_area_dynamic_box:
        push    r15 ; Save the value of r15 on the stack
        push    r14 ; Save the value of r14 on the stack
        push    rbx ; Save the value of rbx on the stack
        sub     rsp, 32 ; Adjust the stack pointer to reserve space temporary variables
        mov     rbx, rsi ; Save the address of the vtable into rbx
        mov     r14, rdi ; Save the address of the Box into r14
        mov     qword ptr [rsp + 16], rdi ; Save the address of the Box on the stack
        mov     qword ptr [rsp + 24], rsi ; Save the address of the vtable on the stack

        ; ⭐ Call the draw method of the trait object contained in the Box

        call    qword ptr [rsi + 32] ; Call the draw method of the trait object contained in the Box
        mov     rdi, r14 ; Restore the address of the Box into rdi

        ; ⭐ Call the area method of the trait object contained in the Box

        call    qword ptr [rbx + 24] ; Call the area method of the trait object contained in the Box
        movsd   qword ptr [rsp + 8], xmm0 ; Save the result of the area method on the stack

        ; ⭐ Call the destructor of the trait object contained in the Box

        mov     rdi, r14 ; Load the address of the Box into rdi
        call    qword ptr [rbx] ; Call the destructor of the value contained in the Box

        ; ♻️ Free the memory allocated for the trait object contained in the Box

        mov     rsi, qword ptr [rbx + 8] ; Load the size of the value contained in the Box into rsi
        test    rsi, rsi ; Check if the size of the value contained in the Box is 0
        je      .LBB5_5 ; If the size of the value contained in the Box is 0, jump to .LBB5_5
        mov     rdx, qword ptr [rbx + 16] ; Load the alignment of the value contained in the Box into rdx
        mov     rdi, r14 ; Load the address of the Box into rdi
        call    qword ptr [rip + __rust_dealloc@GOTPCREL] ; Call the __rust_dealloc function
.LBB5_5:
        movsd   xmm0, qword ptr [rsp + 8] ; Load the result of the area method into xmm0
        add     rsp, 32 ; Adjust the stack pointer to free space for temporaty variables
        pop     rbx ; Restore the value of rbx from the stack
        pop     r14 ; Restore the value of r14 from the stack
        pop     r15 ; Restore the value of r15 from the stack
        ret ; Return the result of the area method

        ; 💀 UNREACHABLE CODE
        ; The following code is unreachable as the function has already returned.
        ; This seems to be a bug in the compiler.

        mov     r15, rax ; Load the address of the Box into r15
        mov     rsi, qword ptr [rbx + 8] ; Load the size of the value contained in the Box into rsi
        mov     rdx, qword ptr [rbx + 16] ; Load the alignment of the value contained in the Box into rdx
        mov     rdi, r14 ; Load the address of the Box into rdi
        call    alloc::alloc::box_free ; Call the box_free function
        jmp     .LBB5_7 ; Jump to the end of the function
        mov     r15, rax ; Load the address of the Box into r15
        lea     rdi, [rsp + 16] ; Load the address of the Box on the stack into rdi
        call    core::ptr::drop_in_place<alloc::boxed::Box<dyn example::Draw+T = f64>> ; Call the drop_in_place function
.LBB5_7: 
        mov     rdi, r15 ; Load the address of the Box into rdi
        call    _Unwind_Resume@PLT ; Call the _Unwind_Resume function
        ud2 ; Trap
        call    qword ptr [rip + core::panicking::panic_no_unwind@GOTPCREL] ; Call the panic_no_unwind function
        ud2 ; Trap

DW.ref.rust_eh_personality:
        .quad   rust_eh_personality

Rectangle - a concrete type implementing the Draw trait

Example of a concrete type implementation of the Draw trait. The vtable, destructor, area method and draw method implementations are also covered.

// This function invokes the three functions we have been dealing with
// In this case, the compiler is able to inline the functions and remove
// the dynamic dispatch.
pub fn draw_rectange(rectangle: Rectangle<f64>) -> (f64, f64) {
    draw_dynamic(&rectangle);
    let a = draw_and_report_area_dynamic(&rectangle);
    let b = draw_and_report_area_dynamic_box(Box::new(rectangle));
    (a, b)
}

pub fn draw_rectange(rectangle: Rectangle<f64>) {
    draw_dynamic(&rectangle);
    draw_and_report_area_dynamic(&rectangle);
    draw_and_report_area_dynamic_box(Box::new(rectangle));
}

pub struct Point<T> {
    x: T,
    y: T,
}

pub struct Rectangle<T> {
    top_left: Point<T>,
    bottom_right: Point<T>,
}

impl<T> Shape for Rectangle<T>
where
    T: std::ops::Sub<Output = T> + std::ops::Mul<Output = T> + Copy,
{
    type T = T;
    fn area(&self) -> T {
        let width = self.bottom_right.x - self.top_left.x;
        let height = self.top_left.y - self.bottom_right.y;
        width * height
    }
}

impl<T> Draw for Rectangle<T>
where
    T: std::ops::Sub<Output = T> + std::ops::Mul<Output = T> + Copy,
{
    fn draw(&self) {
        println!("Draw a rectangle");
    }
}

Vtable for Rectangle

See the definition of the vtable for the Rectangle type. Compare this definition with the trait object and vtable figure.

.L__unnamed_3:
        .quad   core::ptr::drop_in_place<example::Rectangle<f64>>         ; Destructor
        .asciz  " \000\000\000\000\000\000\000\b\000\000\000\000\000\000" ; Size: 32 (leading space)
                                                                          ; Alignment: 8 (\b character)
        .quad   <example::Rectangle<T> as example::Shape>::area           ; Address of the area method for Rectangle
        .quad   <example::Rectangle<T> as example::Draw>::draw            ; Address of the draw method for Rectangle

Destructor for Rectangle

The destructor for Rectangle is core::ptr::drop_in_place<example::Rectangle<f64>>. The defined destructor simply returns.

core::ptr::drop_in_place<example::Rectangle<f64>>:
        ret

Area method for Rectangle

The area method for Rectangle is <example::Rectangle<T> as example::Shape>::area. The area method is defined as follows:

<example::Rectangle<T> as example::Shape>::area:
        movupd  xmm1, xmmword ptr [rdi + 8] ; Load the bottom_right.x and bottom_right.y fields into xmm1
        movsd   xmm0, qword ptr [rdi + 24] ; Load the top_left.x field into xmm0
        movhpd  xmm0, qword ptr [rdi] ; Load the top_left.y field into higher xmm0
        subpd   xmm1, xmm0 ; Perform vector subtractions to compute the width and height
        movapd  xmm0, xmm1 ; Copy the width and height into xmm0
        unpckhpd        xmm0, xmm1 ; Unpack the width and height into xmm0 and xmm1
        mulsd   xmm0, xmm1 ; Perform vector multiplication to compute the area
        ret ; Return the result in xmm0

Draw method for Rectangle

The draw method for Rectangle is <example::Rectangle<T> as example::Draw>::draw. The draw method is defined as follows:

<example::Rectangle<T> as example::Draw>::draw:
        sub     rsp, 56 ; Allocate space for the arguments on the stack
        lea     rax, [rip + .L__unnamed_1] ; Load the address of the format string into rax
        mov     qword ptr [rsp + 8], rax ; Store the address of the format string into the stack
        mov     qword ptr [rsp + 16], 1 ; Store the number of arguments into the stack
        mov     qword ptr [rsp + 24], 0 ; Store the address of the first argument into the stack
        lea     rax, [rip + .L__unnamed_2] ; Load the address of the argument into rax
        mov     qword ptr [rsp + 40], rax ; Store the address of the argument into the stack
        mov     qword ptr [rsp + 48], 0 ; Store the address of the second argument into the stack
        lea     rdi, [rsp + 8] ; Load the address of the arguments into rdi
        call    qword ptr [rip + std::io::stdio::_print@GOTPCREL] ; Call the print function
        add     rsp, 56 ; Deallocate the space for arguments from the stack
        ret

Key takeaways

Experiment with the Compiler Explorer

Edit the code in the Compiler Explorer to see the difference in the assembly output.

pub fn draw_and_report_area_dynamic(a: &dyn Draw<T = f64>) -> f64 {
    let area = a.area();
    a.draw();
    area
}