Understanding Async Await in Rust: From State Machines to Assembly Code

Introduction

In this article, we will explore the inner workings of async await in Rust. We will examine how async functions are implemented as state machines and how they are converted to assembly code. Rust's async functions provide a mechanism to write asynchronous code using a synchronous style. These functions are implemented as state machines, which are enums that implement the Future trait. The Future trait is a trait with a poll method that is repeatedly called until it returns Poll::Ready. The poll method is a state machine that moves between states until it reaches the final state, which returns Poll::Ready. We will use an example async function to help illustrate these concepts.

We recommend reading the "Rust Closures Under the Hood: Comparing impl Fn and Box" to understand inner workings of closures in Rust. Closures are used in async functions to capture variables from the enclosing scope.

Async example

We will use the following async function as an example. The code is taken from our fork of the simple async local executor, a single threaded polling executor. We will be working with the game-units.rs example.

Example of an async function

We start with the goto function, which moves a unit towards a target position. The function takes a unit reference and a target position and returns a future that will move the unit towards the target position at each step, and complete when the unit has reached that position.

// Await this function until the unit has reached the target position.
async fn goto(unit: UnitRef, pos: i32) {
    UnitGotoFuture {
        unit,
        target_pos: pos,
    }
    .await;
}

With this function, an async caller of this the goto function can write code like this:

goto(unit.clone(), 10).await;
// The code here will execute only after the unit has reached position 10

The code above will move the unit towards position 10 and wait until the unit has reached position 10 before continuing execution. We will see that this is achieved without blocking the thread.

Example of a future that implements a poll

The poll function is a method defined on the Future trait, which is implemented for the UnitGotoFuture struct. The poll function is called by an async executor to determine if the Future is Ready or Pending.

The poll function takes two arguments: self, which is a mutable reference to the future being polled, and _cx, which is a reference to a Context object. The Context object is used to wake up the future when it becomes ready to continue execution.

In this specific implementation of poll, the function first retrieves the current position of the Unit that the future is associated with, by borrowing it immutably with self.unit.borrow().pos. Then, it checks if the current position of the Unit is equal to the target position that the future is supposed to move towards. If so, the future is considered ready and the Poll::Ready(()) value is returned.

If the current position of the Unit is not equal to the target position, the future updates the position of the Unit by borrowing it mutably with self.unit.borrow_mut().pos and adding or subtracting 1 based on the sign of the difference between the current and target positions. Finally, the future returns Poll::Pending to indicate that it is not yet ready to complete.

Overall, the poll function is used to check the current state of a future and either return a value indicating that the future has completed or indicate that it needs to continue executing and can be polled again later.

#[derive(Default)]
struct Unit {
    /// The 1-D position of the unit. In a real game, it would be 2D or 3D.
    pub pos: i32,
}
type UnitRef = Rc<RefCell<Unit>>;

/// A future that will move the unit towards `target_pos` at each step,
/// and complete when the unit has reached that position.
struct UnitGotoFuture {
    unit: UnitRef,
    target_pos: i32,
}
impl Future for UnitGotoFuture {
    type Output = ();
    fn poll(self: Pin<&mut Self>, _cx: &mut Context) -> Poll<Self::Output> {
        let unit_pos = self.unit.borrow().pos;
        if unit_pos == self.target_pos {
            Poll::Ready(())
        } else {
            self.unit.borrow_mut().pos += (self.target_pos - unit_pos).signum();
            Poll::Pending
        }
    }
}

/// Helper async function to write unit behavior nicely
async fn goto(unit: UnitRef, pos: i32) {
    UnitGotoFuture {
        unit,
        target_pos: pos,
    }
    .await;
}

Desugaring the async example

Before we delve into the details of how the async function goto function is implemented in assembly, let's look into equivalent non-async Rust code that could implement the same functionality. This will ease our task of understanding the assembly code.

The await on UnitGotoFuture splits the goto function into states that may be modeled using an enum that saves the execution point and resumes from the saved point when the executor calls the future's poll function.

// The state machine enum defines three states for the goto function:
// 1. Start: The initial state, where the function is called with the unit and target position.
// 2. Waiting: The state where the function is waiting for the UnitGotoFuture to complete.
// 3. Done: The final state, where the function has completed.
#[repr(u8)]
enum GotoFuture {
    // 🚀 Initial state
    Start(UnitRef, i32) = 0,
    // 🕓 Waiting for UnitGotoFuture to complete
    Waiting(UnitGotoFuture) = 3,
    // ✅ Final state
    Done = 1,
}

// Implementing Future for GotoFuture
impl Future for GotoFuture {
    type Output = ();

    // The Future's poll method will be called from the async executor to check if the future is ready, and the execution can continue.
    fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
        // The loop is used to transition between states
        loop {
            match &mut *self {
                // 🚀 Start (0): In the start state, create a UnitGotoFuture and move to the waiting state
                GotoFuture::Start(unit, pos) => {
                    let fut = UnitGotoFuture {unit: unit.clone(), target_pos: *pos };
                    *self = GotoFuture::Waiting(fut);
                }
                // 🕓 Waiting (3): In the waiting state, poll the UnitGotoFuture and move to the done state if it's ready
                GotoFuture::Waiting(ref mut fut) => {
                    match Pin::new(fut).poll(cx) {
                        Poll::Ready(()) => *self = GotoFuture::Done,
                        Poll::Pending => return Poll::Pending,
                    }
                }
                // ✅ Done (1) : In the done state, return ready
                GotoFuture::Done => return Poll::Ready(()),
            }
        }
    }
}

// The original async function is equivalent to creating a new GotoFuture instance in the start state
fn goto(unit: UnitRef, pos: i32) -> impl Future<Output = ()> {
    GotoFuture::Start(unit, pos)
}

The GotoFuture enum defines three states that correspond to the three stages of the async function's execution:

  1. Start: The initial state, where the function is called with the unit and target position. It holds the unit reference and the target position, waiting to transition to the next state.
  2. Waiting: The state where the function is waiting for the UnitGotoFuture to complete. It holds a UnitGotoFuture instance and polls it repeatedly until it returns Poll::Ready(()), indicating that it has completed its work.
  3. Done: The final state, where the function has completed. It does not hold any additional information and immediately returns Poll::Ready(()) when polled.

We will see shortly that the compiler generated code for the goto function is similar to the state machine we just described. It uses a closure to implement the state machine and uses a state variable to track the current state of the state machine.

Understanding the generated assembly

Now that we have a better understanding of the async function's state machine, let's look at the assembly code generated by the compiler for the goto async function.

Closure state machine

The compiler generates a closure that is used to implement the state machine. The closure is then wrapped in a struct that implements the Future trait. The Future trait's poll method is implemented by calling the closure. The Future struct is then returned by the goto function.

The closure also contains a state variable that is used to track the current state of the state machine. The state variable is initialized to 0 in the Start state. When the poll method is called, the closure checks the state variable and takes appropriate actions based on the current state. The full state machine is shown in the following state diagram.

Async closure state machine

The following assembly shows how the closure checks the state variable and uses a jump table to jump to the appropriate block of code based on the current state. The jump table switch happens at the start of the closure.

    movzx	eax, byte ptr [rdi + 28]    ; Load the state to determine which block to execute. 
                                            ; The state is stored in the 28 offset in the closure environment.
    lea	rcx, [rip + .LJTI57_0]              ; Load the address of the jump table rcx. 
                                            ; The jump table is a list of offsets from the 
                                            ; start of the jump table to each block.
    movsxd	rax, dword ptr [rcx + 4*rax]; Get the jump offset from the entry 
                                            ; corresponding to the state. The index in rax is
                                            ; multiplied by 4 because the jump table is an
                                            ; array of 32-bit jump offsets indexed by the state.
    add	rax, rcx                            ; Add the jump offset to the start 
                                            ; of the jump table to get the address of the block to execute.
    jmp	rax                                 ; Jump to the block to execute.

Here is the compiler generated jump table. The jump table contains the offsets from the start of the jump table to each block.

.LJTI57_0:
    .long	.LBB57_4-.LJTI57_0 ; 🚀 Start (0) : Entry point to the goto closure.
    .long	.LBB57_3-.LJTI57_0 ; ✅ Done (1) : Throw a panic if polled after completion of the future.
    .long	.LBB57_2-.LJTI57_0 ;     UnitState2: 
    .long	.LBB57_1-.LJTI57_0 ; 🕓 Waiting (3); Future is pending

Wrapping a closure in a future

In this section we will examine the generated assembly for the goto function. The goto function just returns the Future object. Calling the goto function does not execute the async function. The async function is executed when the Future object is awaited.

The following code shows the Rust equivalent of the generated assembly for the goto async function. The poll_fn function in std::future that creates a closure that implements the Future trait's poll method.

fn goto(unit: Unit, target_pos: i32) -> impl Future<Output = ()> {
    poll_fn(goto_closure)
}

The goto async function just returns the Future struct that wraps the goto::{{closure}}. The contents of the returned Future are shown below. They are essentially the closure environment of goto::{{closure}}. The returned closure environment contains the captured parameters at offset 16 and 24. The state variable is stored at offset 28. The closure also contains local variables that are used to store the intermediate results.

Async closure environment

The following function shows the assembly code of the goto async function. We see that unit and target_pos parameters are being stored at offsets 16 and 24 respectively. The state variable to being initialized to 0 (Start) at offset 28 in the Future. From the assembly, we see that no code has been executed yet.

; Input:
;   rdi: goto::{{closure}} environment
;   rsi: unit
;   rdx: target_pos
; Output:
;   rax: goto::{{closure}} environment/Future
playground::goto:
    mov	rax, rdi ; rax = Set the return value to the future
    mov	qword ptr [rdi + 16], rsi ; Save the unit.
    mov	dword ptr [rdi + 24], edx ; Save the target_pos 
    mov	byte ptr [rdi + 28], 0 ; 🚀 Start (0)
    ret ; Return

Role of the async executor

Rust requires an async executor to run the async functions. The executor is responsible for polling the future returned by the async function. The following sequence diagram shows how the executor polls the future returned by the goto async function. The executor calls the poll method of the Future trait. The poll method calls the goto::{{closure}} closure. The goto::{{closure}} closure checks the state variable and executes the appropriate block of code based on the current state. The goto::{{closure}} closure then updates the state variable and returns the Poll object. The poll method then returns the Poll object to the caller.

Async closure sequence diagram

Flow chart of the generated assembly of the goto closure

Now that we have a better understanding of the async function's state machine and async executors, let's look at a high-level flow chart of the generated assembly code in the following flow chart. We see that the compiler has inlined the poll for the UnitGotoFuture future into the goto::{{closure}}. We note that the goto closure can also result in an exception if the RefCell borrowing fails.

Async closure flow chart

Generated assembly of the goto::{{closure}}

The assembly code is implementing an asynchronous function goto::{{closure}}. The function receives a mutable reference to a closure environment and a Context object. The function returns Poll::Pending or Poll::Ready.

The code has been annotated with comments to explain the assembly code. The comments are prefixed with the state of the state machine. The state machine has four states: Start (0), Waiting (3), Done (1), and UnitState2. The Start (0) state is the entry point to the function. The Waiting (3) state is the state where the future is pending. The Done (1) state is the state where the future is completed. The UnitState2 state is the state where the future is completed with an error.

Here is a high level description of the generated assembly code. This will help us understand the assembly code better.

The assembly code starts with a closure that is used to jump to the appropriate block of code depending on the current state of the goto::{{closure}}. The closure loads the current state of the future from the closure environment and uses it to look up the corresponding block of code in a jump table.

There are three blocks of code in the function. The Start (0) state initializes the closure environment and loads the Unit and target_pos fields from the closure environment into registers. It then saves these fields back into the closure environment and jumps to the Waiting (3) State.

The Waiting (3) state loads the Unit field from the closure environment and calls the inlined UnitGotoFuture::poll logic on it. It calculates the difference between the unit_pos and the target_pos and updates the unit_pos accordingly. If the unit has already been borrowed, the function jumps to an error handler. If the unit has reached the target_pos, it sets the state to Done (1). Otherwise, it returns Poll::Pending.

After transitioning to Done (1) state the closure decrements the strong reference count of the unit. If the strong reference count is zero, it frees the memory associated with the unit using the __rust_dealloc function. The function then returns Poll::Ready to indicate that the future has completed.

; Input:
;   rdi: Closure environment
;   rsi: &mut Context
; Output:
;   rax: Poll<()>

playground::goto::{{closure}}:
    push	rbp
    push	r15
    push	r14
    push	rbx
    push	rax
    mov	r15, rdi
    movzx	eax, byte ptr [rdi + 28] ; Load the state to determine which block to execute. 
                                     ; The state is stored in the 28 offset in the closure environment.
    lea	rcx, [rip + .LJTI57_0]       ; Load the address of the jump table rcx. The jump table is a list of offsets. 
                                     ; from the start of the jump table to each block.
    movsxd	rax, dword ptr [rcx + 4*rax] ; Get the jump offset from the entry corresponding to the state. 
                                         ; The index in rax is multiplied by 4 because the jump table is an array 
                                         ; of 32-bit jump offsets indexed by the state.
    add	rax, rcx                         ; Add the jump offset to the start of the jump table to get the 
                                         ; address of the block to execute.
    jmp	rax                              ; Jump to the block to execute.

; == 🚀 Start (0) block entry point ==
; The caller of the async function is responsible for setting the state to 0 and initializing the closure environment.

.LBB57_4:
    mov	rdi, qword ptr [r15 + 16] ; Load the unit from the closure environment into rdi
    mov	eax, dword ptr [r15 + 24] ; Load the target_pos from the closure environment into eax
    mov	qword ptr [r15], rdi ; Save the unit in the closure environment
    mov	dword ptr [r15 + 8], eax ; Save the target_pos in the closure environment
    jmp	.LBB57_5

; == 🕓 Waiting (3) block entry point ==
; Resume the future after a poll that returned Pending

.LBB57_1: 
    mov	rdi, qword ptr [r15] ; Load the unit from the closure environment into rdi

.LBB57_5:
    ; Inlined call to UnitGotoFuture::poll
    mov	rax, qword ptr [rdi + 16] ; Load the borrow flag from unit's RefCell into rax
    movabs	rcx, 9223372036854775807 ; Load the maximum signed 64-bit integer rcx
                                    
    cmp	rax, rcx ; Check if the borrow flag is greater than max signed 64 bit value.
                  
    jae	.LBB57_6 ; If the value in rax is greater than or equal to max signed 64-bit value, 
                 ; jump to .LBB57_6 as the unit has already been borrowed.

    ;  The unit has not been borrowed.
    
    mov	ebx, dword ptr [rdi + 24] ; Load the unit_pos from unit into ebx
    mov	ebp, dword ptr [r15 + 8]  ; Load the target_pos from the closure environment into ebp
    mov	ecx, ebp ; Set ecx to target_pos
    sub	ecx, ebx ; Subtract unit_pos from target_pos and store the result in ecx
    jne	.LBB57_8 ; If the difference is not equal to 0, jump as the unit has not reached the target position
    dec	qword ptr [rdi] ; Decrement the strong reference count of the unit
    mov	r14b, 1 ; Set the state to ✅ Done (1)
    jne	.LBB57_17 ; Check if the strong reference count is not equal to 0. If it is not equal to 0, jump to .LBB57_17

    ; ♻️ Freeing Rc memory as the reference count is 0.
    dec	qword ptr [rdi + 8] ; Decrement the weak reference count of the unit
    jne	.LBB57_17 ; Jump if the weak reference count is not equal to 0. 
    mov	esi, 32 ; Set the size of the memory to free to 32
    mov	edx, 8 ; Set the alignment of the memory to free to 8
    call	qword ptr [rip + __rust_dealloc@GOTPCREL] ; Call __rust_dealloc to free the memory
    jmp	.LBB57_17

.LBB57_8:
    test	rax, rax ; Check if the borrow flag is 0.
    jne	.LBB57_9 ; If the borrow flag is not 0, jump to .LBB57_9 as the unit has already been borrowed.
    xor	eax, eax ; Set eax to 0.
    ; signum function inlined - begin
    test	ecx, ecx ; Check if the unit_pos is greater than the target_pos
    setg	al ; Set the value of al to 1 if unit_pos is greater than target_pos
               ; Set the value of al to 0 if unit_pos is less than or equal to target_pos
    lea	eax, [rbx + 2*rax] ; Add 2 to unit_pos if unit_pos is less than target_pos
    dec	eax ; Subtract 1 from the result of the previous addition (signum addition of 1 or -1)
    ; signum function inlined - end

    mov	dword ptr [rdi + 24], eax ; Save the new unit_pos in the unit
    mov	qword ptr [rdi + 16], 0 ; Set the borrow flag to 0.
    mov	r14b, 3 ; Set the state to 🕓 Waiting (3)

.LBB57_17:
    cmp	ebp, ebx ; Compare unit_pos with target_pos
    setne	al ; Set the value of al to 1 (future not ready) if unit_pos is not equal to target_pos
               ; Set the value of al to 0 (future ready) if unit_pos is equal to target_pos	
    mov	byte ptr [r15 + 28], r14b ; Save the state in the closure environment
    add	rsp, 8 ; Free local variables.
    pop	rbx
    pop	r14
    pop	r15
    pop	rbp
    ret ; Return the future status.

.LBB57_6:
; 💀 Prepare the panic message.
    lea	r8, [rip + .L__unnamed_21] 
    lea	rcx, [rip + .L__unnamed_20] ; drop_in_place
    mov	esi, 24
    lea	rdi, [rip + .L__unnamed_19] ; Load "already borrowed".
    jmp	.LBB57_10

.LBB57_2:
    lea	rdi, [rip + str.1] ; Load address of string "`async fn` resumed after panicking"
    lea	rdx, [rip + .L__unnamed_23]
    mov	esi, 34
    call	qword ptr [rip + core::panicking::panic@GOTPCREL]
    ud2

; == ✅ Done (1) block entry point: ==
; This block should never be reached as the future has already completed.

.LBB57_3:
    lea	rdi, [rip + str.2] ; Load address of string "`async fn` resumed after completion"
    lea	rdx, [rip + .L__unnamed_23]
    mov	esi, 35
    call	qword ptr [rip + core::panicking::panic@GOTPCREL]
    ud2

.LBB57_9:
    lea	r8, [rip + .L__unnamed_22]
    lea	rcx, [rip + .L__unnamed_8]
    mov	esi, 16
    lea	rdi, [rip + .L__unnamed_7]

.LBB57_10:
    mov	rdx, rsp
    call	qword ptr [rip + core::result::unwrap_failed@GOTPCREL]
    ud2
    mov	r14, rax
    mov	rdi, qword ptr [r15]
    call	core::ptr::drop_in_place<playground::UnitGotoFuture>
    mov	byte ptr [r15 + 28], 2
    mov	rdi, r14
    call	_Unwind_Resume@PLT
    ud2

.LJTI57_0:
    .long	.LBB57_4-.LJTI57_0 ; 🚀 Start (0) : Entry point to the goto closure.
    .long	.LBB57_3-.LJTI57_0 ; ✅ Done (1) : Throw a panic if polled after completion of the future.
    .long	.LBB57_2-.LJTI57_0 ;     UnitState2: 
    .long	.LBB57_1-.LJTI57_0 ; 🕓 Waiting (3); Future is pending.

; 🗂️ Vtable for the goto closure. 
.L__unnamed_26:
	.quad	core::ptr::drop_in_place<playground::goto::{{closure}}>     ; Destructor for the FnOnce trait object
	.asciz	" \000\000\000\000\000\000\000\b\000\000\000\000\000\000"   ; Size of object: 32 bytes (Leading space)
                                                                        ; Alignment of the Future trait object: 8 bytes (\b)
	.quad	playground::goto::{{closure}}                               ; call_once method of the FnOnce trait 

Key takeaways

Articles in the async/await series

  1. Desugaring and assembly of async/await in Rust - goto
  2. Nested async/await in Rust: Desugaring and assembly - patrol
  3. Rust async executor - executor