Understanding Async Await in Rust: From State Machines to Assembly Code

Introduction

This article will explore the inner workings of async await in Rust. We will examine how async functions are implemented as state machines and how they are converted to assembly code. Rust's async functions provide a mechanism for writing asynchronous code using a synchronous style. These functions are implemented as state machines, which are enums implementing the Future trait. The Future trait is a trait with a poll method that is repeatedly called until it returns Poll::Ready. The poll method is a state machine that moves between states until it reaches the final state, which returns Poll::Ready. We will use an example async function to help illustrate these concepts.

We recommend reading the "Rust Closures Under the Hood: Comparing impl Fn and Box<dyn Fn>" to understand the inner workings of closures in Rust. Closures are used in async functions to capture variables from the enclosing scope.

Rust Under the Hood

A deep dive into Rust internals and generated assembly

Learn how high-level constructs translate into efficient assembly code. Dive into dynamic dispatch, trait inheritance, and async programming internals.

Async example

We will use the following async function as an example. The code is taken from our fork of the simple async local executor, a single-threaded polling executor. We will be working with the game-units.rs example.

Example of an async function

We start with the goto function, which moves a unit towards a target position. The function takes a unit reference and a target position, returns a future that will move the unit towards the target position at each step, and completes when the unit has reached that position.

// Await this function until the unit has reached the target position.
async fn goto(unit: UnitRef, pos: i32) {
    UnitGotoFuture {
        unit,
        target_pos: pos,
    }
    .await;
}

With this function, an async caller of the goto function can write code like this:

goto(unit.clone(), 10).await;
// The code here will execute after the unit has reached position 10

The code above will move the unit towards position 10 and wait until the unit has reached position 10 before continuing execution. We will see that this is achieved without blocking the thread.

Example of a future that implements a poll

The poll function is a method defined on the Future trait, implemented for the UnitGotoFuture struct. An async executor calls the poll function to determine whether the Future is Ready or Pending.

The poll function takes two arguments: self, a mutable reference to the future being polled, and _cx, a reference to a Context object. The Context object is used to wake up the future when it becomes ready to continue execution.

In this specific implementation of poll, the function first retrieves the current position of the Unit that the future is associated with, by borrowing it immutably with self.unit.borrow().pos. Then, it checks if the current position of the Unit is equal to the target position that the future is supposed to move towards. If so, the future is considered, ready and the Poll::Ready(()) value is returned.

If the current position of the Unit is not equal to the target position, the future updates the position of the Unit by borrowing it mutably with self.unit.borrow_mut().pos and adding or subtracting 1 based on the sign of the difference between the current and target positions. Finally, the future returns Poll::Pending indicate that it is not yet ready to be completed.

Overall, the poll function is used to check the current state of a future and either return a value indicating that the future has been completed or indicate that it needs to continue executing and can be polled again later.

#[derive(Default)]
struct Unit {
    /// The 1-D position of the unit. In a real game, it would be 2D or 3D.
    pub pos: i32,
}
type UnitRef = Rc<RefCell<Unit>>;

/// A future that will move the unit towards `target_pos` at each step,
/// and complete when the unit has reached that position.
struct UnitGotoFuture {
    unit: UnitRef,
    target_pos: i32,
}
impl Future for UnitGotoFuture {
    type Output = ();
    fn poll(self: Pin<&mut Self>, _cx: &mut Context) -> Poll<Self::Output> {
        let unit_pos = self.unit.borrow().pos;
        if unit_pos == self.target_pos {
            Poll::Ready(())
        } else {
            self.unit.borrow_mut().pos += (self.target_pos - unit_pos).signum();
            Poll::Pending
        }
    }
}

/// Helper async function to write unit behavior nicely
async fn goto(unit: UnitRef, pos: i32) {
    UnitGotoFuture {
        unit,
        target_pos: pos,
    }
    .await;
}

Desugaring the async example

Before we delve into how the async function goto function is implemented in assembly, let's look into equivalent non-async Rust code that could implement the same functionality. This will ease our understanding of the assembly code.

The await on UnitGotoFuture splits the goto function into states that may be modeled using an enum that saves the execution point and resumes from the saved point when the executor calls the future's poll function.

// The state machine enum defines three states for the goto function:
// 1. Start: The initial state, where the function is called with the unit and target position.
// 2. Waiting: The state where the function is waiting for the UnitGotoFuture to complete.
// 3. Done: The final state, where the function has been completed.
#[repr(u8)]
enum GotoFuture {
    // 🚀 Initial state
    Start(UnitRef, i32) = 0,
    // 🕓 Waiting for UnitGotoFuture to complete
    Waiting(UnitGotoFuture) = 3,
    // ✅ Final state
    Done = 1,
}

// Implementing Future for GotoFuture
impl Future for GotoFuture {
    type Output = ();

    // The Future's poll method will be called by the async executor to check if the future is ready and if the execution can continue.
    fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
        // The loop is used to transition between states
        loop {
            match &mut *self {
                // 🚀 Start (0): In the start state, create a UnitGotoFuture and move to the waiting state
                GotoFuture::Start(unit, pos) => {
                    let fut = UnitGotoFuture {unit: unit.clone(), target_pos: *pos };
                    *self = GotoFuture::Waiting(fut);
                }
                // 🕓 Waiting (3): In the waiting state, poll the UnitGotoFuture and move to the done state if it's ready
                GotoFuture::Waiting(ref mut fut) => {
                    match Pin::new(fut).poll(cx) {
                        Poll::Ready(()) => *self = GotoFuture::Done,
                        Poll::Pending => return Poll::Pending,
                    }
                }
                // ✅ Done (1): In the done state, return ready
                GotoFuture::Done => return Poll::Ready(()),
            }
        }
    }
}

// The original async function is equivalent to creating a new GotoFuture instance in the start state
fn goto(unit: UnitRef, pos: i32) -> impl Future<Output = ()> {
    GotoFuture::Start(unit, pos)
}

The GotoFuture enum defines three states that correspond to the three stages of the async function's execution:

Start: The initial state, where the function is called with the unit and target position. It holds the unit reference and target position and is waiting to transition to the next state.
Waiting: The state where the function is waiting for the UnitGotoFuture to complete. It holds a UnitGotoFuture instance and polls it repeatedly until it returns Poll::Ready(()), indicating that it has completed its work.
Done: The final state where the function has been completed. It does not hold any additional information and immediately returns Poll::Ready(()) when polled.

We will see shortly that the compiler-generated code for the goto function is similar to the state machine we just described. It implements the state machine using a closure and tracks its current state using a state variable.

Understanding the generated assembly

Now that we better understand the async function's state machine let's look at the assembly code generated by the compiler for the goto async function.

Closure state machine

The compiler generates a closure to implement the state machine. The closure is then wrapped in a struct that implements the Future trait. The Future trait's poll method is implemented by calling the closure. The Future struct is then returned by the goto function.

The closure also contains a state variable that tracks the state of the state machine. The state variable is initialized to 0 in the Start state. When the poll method is called, the closure checks the state variable and takes appropriate actions based on the current state. The full-state machine is shown in the following state diagram.

Async closure state machine

The following assembly shows how the closure checks the state variable and uses a jump table to jump to the appropriate block of code based on the current state. The jump table switch happens at the start of the closure.

    movzx eax, byte ptr [rdi + 28]          ; Load the state to determine which block to execute. 
                                            ; The state is stored in the 28 offset in the closure environment.
    lea rcx, [rip + .LJTI57_0]              ; Load the address of the jump table rcx. 
                                            ; The jump table is a list of offsets from the 
                                            ; start of the jump table to each block.
    movsxd rax, dword ptr [rcx + 4*rax]     ; Get the jump offset from the entry 
                                            ; corresponding to the state. The index in rax is
                                            ; multiplied by 4 because the jump table is an
                                            ; an array of 32-bit jump offsets indexed by the state.
    add rax, rcx                            ; Add the jump offset to the start 
                                            ; of the jump table to get the address of the block to execute.
    jmp rax                                 ; Jump to the block to execute.

Here is the compiler-generated jump table. It contains the offsets from the start of the jump table to each block.

.LJTI57_0:
    .long .LBB57_4-.LJTI57_0 ; 🚀 Start (0): Entry point to the goto closure.
    .long .LBB57_3-.LJTI57_0 ; ✅ Done (1): Throw a panic if polled after completion of the future.
    .long .LBB57_2-.LJTI57_0 ;     UnitState2: 
    .long .LBB57_1-.LJTI57_0 ; 🕓 Waiting (3): Future is pending

Wrapping a closure in a future

This section will examine the generated assembly for the goto function. The goto function just returns the Future object. Calling the goto function does not execute the async function. The async function is executed when the Future object is awaited.

The following code shows the Rust equivalent of the generated assembly for the goto async function. The poll_fn function in std::future that creates a closure that implements the Future trait's poll method.

fn goto(unit: Unit, target_pos: i32) -> impl Future<Output = ()> {
    poll_fn(goto_closure)
}

The goto async function just returns the Future struct that wraps the goto::{{closure}}. The contents of the returned Future are shown below. They are essentially the closure environment of goto::{{closure}}. The returned closure environment contains the captured parameters at offset 16 and 24. The state variable is stored at offset 28. The closure also contains local variables that are used to store the intermediate results.

Async closure environment

The following function shows the assembly code of the goto async function. We see that the unit and target_pos parameters are being stored at offsets 16 and 24, respectively. The state variable is to be initialized to 0 (Start) at offset 28 in the Future. From the assembly, we see that no code has been executed yet.

; Input:
;   rdi: goto::{{closure}} environment
;   rsi: unit
;   rdx: target_pos
; Output:
;   rax: goto::{{closure}} environment/Future
playground::goto:
    mov rax, rdi ; rax = Set the return value to the future
    mov qword ptr [rdi + 16], rsi ; Save the unit.
    mov dword ptr [rdi + 24], edx ; Save the target_pos 
    mov byte ptr [rdi + 28], 0 ; 🚀 Start (0)
    ret ; Return

Role of the async executor

Rust requires an async executor to run the async functions. The executor is responsible for polling the future returned by the async function. The following sequence diagram shows how the executor polls the future returned by the goto async function. The executor calls the poll method of the Future trait. The poll method calls the goto::{{closure}} closure. The goto::{{closure}} closure checks the state variable and executes the appropriate code block based on the current state. The goto::{{closure}} closure then updates the state variable and returns the Poll object. The poll method returns the Poll object to the caller.

Async closure sequence diagram

Flow chart of the generated assembly of the `goto` closure

Now that we better understand the async function's state machine and async executors let's look at a high-level flow chart of the generated assembly code in the following flow chart. The compiler has inlined the poll for the UnitGotoFuture future into the goto::{{closure}}. The goto closure can also result in an exception if the RefCell borrowing fails.

Async closure flow chart

Generated assembly of the `goto::{{closure}}`

The assembly code implements an asynchronous function goto::{{closure}}. The function receives a mutable reference to a closure environment and a Context object. The function returns Poll::Pending or Poll::Ready.

The code has been annotated with comments to explain the assembly code. The comments are prefixed with the state of the state machine. The state machine has four states: Start (0), Waiting (3), Done (1), and UnitState2. The Start (0) state is the entry point to the function. The future is pending in the Waiting (3) state. The Done (1) state is where the future is completed. The UnitState2 state is where the future is completed with an error.

Here is a high-level description of the generated assembly code. This will help us understand the assembly code better.

The assembly code starts with a closure that is used to jump to the appropriate block of code depending on the current state of the goto::{{closure}}. The closure loads the current state of the future from the closure environment and uses it to look up the corresponding block of code in a jump table.

The function has three blocks of code. The Start (0) state initializes the closure environment and loads the Unit and target_pos fields into registers. It then saves these fields into the closure environment and jumps to the Waiting (3) State.

The Waiting (3) state loads the Unit field from the closure environment and calls the inlined UnitGotoFuture::poll logic. It calculates the difference between the unit_pos and the target_pos and updates the unit_pos accordingly. The function jumps to an error handler if the unit has already been borrowed. If the unit has reached the target_pos, it sets the state to Done (1). Otherwise, it returns Poll::Pending.

After transitioning to Done (1), the closure decrements the unit's strong reference count. If the strong reference count is zero, it frees the memory associated with the unit using the __rust_dealloc function. The function returns Poll::Ready to indicate the future has completed.

; Input:
;   rdi: Closure environment
;   rsi: &mut Context
; Output:
;   rax: Poll<()>

playground::goto::{{closure}}:
    push rbp
    push r15
    push r14
    push rbx
    push rax
    mov r15, rdi
    movzx eax, byte ptr [rdi + 28]       ; Load the state to determine which block to execute. 
                                         ; The state is stored in the 28 offset in the closure environment.
    lea rcx, [rip + .LJTI57_0]           ; Load the address of the jump table rcx. The jump table is a list of offsets. 
                                         ; from the start of the jump table to each block.
    movsxd rax, dword ptr [rcx + 4*rax]  ; Get the jump offset from the entry corresponding to the state. 
                                         ; The index in rax is multiplied by 4 because the jump table is an array 
                                         ; of 32-bit jump offsets indexed by the state.
    add rax, rcx                         ; Add the jump offset to the start of the jump table to get the 
                                         ; address of the block to execute.
    jmp rax                              ; Jump to the block to execute.

; == 🚀 Start (0) block entry point ==
; The caller of the async function sets the state to 0 and initializes the closure environment.

.LBB57_4:
    mov rdi, qword ptr [r15 + 16] ; Load the unit from the closure environment into rdi
    mov eax, dword ptr [r15 + 24] ; Load the target_pos from the closure environment into eax
    mov qword ptr [r15], rdi ; Save the unit in the closure environment
    mov dword ptr [r15 + 8], eax ; Save the target_pos in the closure environment
    jmp .LBB57_5

; == 🕓 Waiting (3) block entry point ==
; Resume the future after a poll that returned Pending

.LBB57_1: 
    mov rdi, qword ptr [r15] ; Load the unit from the closure environment into rdi

.LBB57_5:
    ; Inlined call to UnitGotoFuture::poll
    mov rax, qword ptr [rdi + 16] ; Load the borrow flag from the unit's RefCell into rax
    movabs rcx, 9223372036854775807 ; Load the maximum signed 64-bit integer rcx
                                    
    cmp rax, rcx ; Check if the borrow flag is greater than max signed 64-bit value.
                  
    jae .LBB57_6 ; If the value in rax is greater than or equal to max signed 64-bit value, 
                 ; jump to .LBB57_6 as the unit has already been borrowed.

    ;  The unit has not been borrowed.
    
    mov ebx, dword ptr [rdi + 24] ; Load the unit_pos from unit into ebx
    mov ebp, dword ptr [r15 + 8]  ; Load the target_pos from the closure environment into ebp
    mov ecx, ebp ; Set ecx to target_pos
    sub ecx, ebx ; Subtract unit_pos from target_pos and store the result in ecx
    jne .LBB57_8 ; If the difference is not equal to 0, jump as the unit has not reached the target position
    dec qword ptr [rdi] ; Decrement the strong reference count of the unit
    mov r14b, 1 ; Set the state to ✅ Done (1)
    jne .LBB57_17 ; Check if the strong reference count is not equal to 0. If it is not equal to 0, jump to .LBB57_17

    ; ♻️ Freeing Rc memory as the reference count is 0.
    dec qword ptr [rdi + 8] ; Decrement the weak reference count of the unit
    jne .LBB57_17 ; Jump if the weak reference count is not equal to 0. 
    mov esi, 32 ; Set the size of the memory to free to 32
    mov edx, 8 ; Set the alignment of the memory to free to 8
    call qword ptr [rip + __rust_dealloc@GOTPCREL] ; Call __rust_dealloc to free the memory
    jmp .LBB57_17

.LBB57_8:
    test rax, rax ; Check if the borrow flag is 0.
    jne .LBB57_9 ; If the borrow flag is not 0, jump to .LBB57_9 as the unit has already been borrowed.
    xor eax, eax ; Set eax to 0.
    ; signum function inlined - begin
    test ecx, ecx ; Check if the unit_pos is greater than the target_pos
    setg al ; Set the value of al to 1 if unit_pos is greater than target_pos
               ; Set the value of al to 0 if unit_pos is less than or equal to target_pos
    lea eax, [rbx + 2*rax] ; Add 2 to unit_pos if unit_pos is less than target_pos
    dec eax ; Subtract 1 from the result of the previous addition (signum addition of 1 or -1)
    ; signum function inlined - end

    mov dword ptr [rdi + 24], eax ; Save the new unit_pos in the unit
    mov qword ptr [rdi + 16], 0 ; Set the borrow flag to 0.
    mov r14b, 3 ; Set the state to 🕓 Waiting (3)

.LBB57_17:
    cmp ebp, ebx ; Compare unit_pos with target_pos
    setne al ; Set the value of al to 1 (future not ready) if unit_pos is not equal to target_pos
               ; Set the value of al to 0 (future ready) if unit_pos is equal to target_pos 
    mov byte ptr [r15 + 28], r14b ; Save the state in the closure environment
    add rsp, 8 ; Free local variables.
    pop rbx
    pop r14
    pop r15
    pop rbp
    ret ; Return the future status.

.LBB57_6:
; 💀 Prepare the panic message.
    lea r8, [rip + .L__unnamed_21] 
    lea rcx, [rip + .L__unnamed_20] ; drop_in_place
    mov esi, 24
    lea rdi, [rip + .L__unnamed_19] ; Load "already borrowed".
    jmp .LBB57_10

.LBB57_2:
    lea rdi, [rip + str.1] ; Load address of the string "`async fn` resumed after panicking"
    lea rdx, [rip + .L__unnamed_23]
    mov esi, 34
    call qword ptr [rip + core::panicking::panic@GOTPCREL]
    ud2

; == ✅ Done (1) block entry point: ==
; This block should never be reached as the future has already been completed.

.LBB57_3:
    lea rdi, [rip + str.2] ; Load address of the string "`async fn` resumed after completion"
    lea rdx, [rip + .L__unnamed_23]
    mov esi, 35
    call qword ptr [rip + core::panicking::panic@GOTPCREL]
    ud2

.LBB57_9:
    lea r8, [rip + .L__unnamed_22]
    lea rcx, [rip + .L__unnamed_8]
    mov esi, 16
    lea rdi, [rip + .L__unnamed_7]

.LBB57_10:
    mov rdx, rsp
    call qword ptr [rip + core::result::unwrap_failed@GOTPCREL]
    ud2
    mov r14, rax
    mov rdi, qword ptr [r15]
    call core::ptr::drop_in_place<playground::UnitGotoFuture>
    mov byte ptr [r15 + 28], 2
    mov rdi, r14
    call _Unwind_Resume@PLT
    ud2

.LJTI57_0:
    .long .LBB57_4-.LJTI57_0 ; 🚀 Start (0) : Entry point to the goto closure.
    .long .LBB57_3-.LJTI57_0 ; ✅ Done (1) : Throw a panic if polled after completion of the future.
    .long .LBB57_2-.LJTI57_0 ;     UnitState2: 
    .long .LBB57_1-.LJTI57_0 ; 🕓 Waiting (3); Future is pending.

; 🗂️ Vtable for the goto closure. 
.L__unnamed_26:
 .quad core::ptr::drop_in_place<playground::goto::{{closure}}>     ; Destructor for the FnOnce trait object
 .asciz " \000\000\000\000\000\000\000\b\000\000\000\000\000\000"   ; Size of object: 32 bytes (Leading space)
                                                                        ; Alignment of the Future trait object: 8 bytes (\b)
 .quad playground::goto::{{closure}}                               ; call_once method of the FnOnce trait

Key takeaways

An await in an async function typically results in a poll call on the future. If the future is not ready, the poll method returns Poll::Pending. The poll method returns Poll::Ready if the future is ready.
The async function itself returns a future that wraps a closure. The poll method of the future calls the closure.
The closure is implemented as a state machine. The await points are represented as state transitions.
- The state is stored in a compiler-generated enum.
- A jump table is used to jump to the appropriate state.
The state machine is a closure that implements the Future trait.
Local variables in the async function are stored in the closure environment. Too many local variables can cause the closure environment to be too large.

Articles in the async/await series

Desugaring and assembly of async/await in Rust - goto
Nested async/await in Rust: Desugaring and assembly - patrol
Rust async executor - executor