Map Rust vector iteration to assembly
Vectors in Rust contain a pointer to a heap vector buffer (buf
) and a length (len
). The vector buffer is a pointer to a buffer of the type T
. The length is the number of elements in the vector. Here is an excerpt from the std::vec::Vec
implementation:
pub struct Vec<T, #[unstable(feature = "allocator_api", issue = "32838")] A: Allocator = Global> {
// Points to the start of the vector data and keeps the capacity information
buf: RawVec<T, A>,
// Length of the vector
len: usize,
}
Sample vector handling code
Let us explore the Rust to assembly mapping of the Vec
struct using a simple function, increment_by
that increments the elements of a vector by a specified value (num
). The sample also includes increment_example
that calls the increment_by
function.
// Increment each element of a i64 vector by num. A mutable
// reference to the vector is passed in.
pub fn increment_by(num: i64, list: &mut Vec<i64>) {
// Iterate over the vector and increment each element.
for item in list {
// Note that the item needs to be dereferenced
*item += num;
}
}
pub fn increment_example(inc_by: i64, array: [i64; 4]) -> Vec<i64> {
let mut list = array.to_vec();
increment_by(inc_by, &mut list);
list
}
Assembly code for increment_by
The generated assembly code for the increment_by
function is shown below. The generated assembly code seems quite complex. On close examination, it becomes clear that the code is optimized for operating on vectors of wildly different lengths.
Let us look at the assembly code that does the heavy lifting for a vector of a certain length. Once we understand these fragments, we will look at the complete assembly to understand how all other lengths are handled by combining these fragments.
Vector length less than 4
For vectors that are shorter than 4, the generated assembly code loops over each entry in the vector and increments it by the specified value.
.LBB0_11:
add qword ptr [rax], rdi ; Add num to the next entry in the vector
add rax, 8 ; Move to the next entry in the vector
cmp rax, rcx ; Check if we have reached the end of the vector
jne .LBB0_11 ; Jump if we have not reached the end of the vector
Vector length is 4
If the vector length is 4, the compiler unrolls the loop and generates the following assembly code. Vector instructions are used to perform two 64-bit operations in parallel. The movq
instruction is used to move the vector data from the vector buffer to a xmm
register. The addq
instruction is used to add the specified value to the vector data. The movq
instruction is then used to move the vector data back to the vector buffer.
.LBB0_7:
movdqu xmm1, xmmword ptr [rcx + 8*rsi] ; Load two 64-bit numbers from the vector
movdqu xmm2, xmmword ptr [rcx + 8*rsi + 16] ; Load two more 64-bit numbers from the vector
paddq xmm1, xmm0 ; Add the increment value to the first two 64-bit number
paddq xmm2, xmm0 ; Add the increment value to the 3rd and 4th of 64-bit number
movdqu xmmword ptr [rcx + 8*rsi], xmm1 ; Store the result back in the vector (1st and 2nd number)
movdqu xmmword ptr [rcx + 8*rsi + 16], xmm2 ; Store the result back in the vector (3rd and 4th number)
Vector length is a multiple of 8
For vectors with lengths that are a multiple of 8, the compiler handles 8 64-bit operations per iteration. The 8 64-bit numbers are copied to the xmm
registers using four movdqu
instructions. Four addq
instructions perform the 8 additions. Finally, the four movdqu
instructions copy the results back to the vector buffer.
.LBB0_5:
movdqu xmm1, xmmword ptr [rcx + 8*rsi] ; Load the first two 64-bit number
movdqu xmm2, xmmword ptr [rcx + 8*rsi + 16] ; Load the third and fourth 64-bit number
movdqu xmm3, xmmword ptr [rcx + 8*rsi + 32] ; Load the fifth and sixth 64-bit number
movdqu xmm4, xmmword ptr [rcx + 8*rsi + 48] ; Load the seventh and eighth 64-bit number
paddq xmm1, xmm0 ; Add the increment value to the first two 64-bit number
paddq xmm2, xmm0 ; Add the increment value to the third and fourth 64-bit number
movdqu xmmword ptr [rcx + 8*rsi], xmm1 ; Store the first two 64-bit number
movdqu xmmword ptr [rcx + 8*rsi + 16], xmm2 ; Store the third and fourth 64-bit number
paddq xmm3, xmm0 ; Add the increment value to the fifth and sixth 64-bit number
paddq xmm4, xmm0 ; Add the increment value to the seventh and eighth 64-bit number
movdqu xmmword ptr [rcx + 8*rsi + 32], xmm3 ; Store the fifth and sixth 64-bit number
movdqu xmmword ptr [rcx + 8*rsi + 48], xmm4 ; Store the seventh and eighth 64-bit number
add rsi, 8 ; Increment the 64-bit numbers index
add rax, -2 ; Decrement the count by 2 as 8 64-bit numbers have been processed.
jne .LBB0_5 ; Jump if there are still quad 64-bit numbers to process
test r10b, 1 ; Check if there are still entries to process
je .LBB0_8 ; Jump if less than 4 entries need to be processed.
Handling vectors of arbitrary length
Now that we have analyzed the special cases, let us look at the code below that handles vectors of arbitrary length. The code uses the assembly code blocks we described above. The full code of the function combines the three code blocks to handle vectors of arbitrary length. For example, a vector of length 41 will be handled as.
- Iterate until index 31 (32 entries = 4 * 8) with four loops through that process 8 operations per loop (4 sets of vector operations). The
.LBB0_5
label shows the code that performs the four loops. - Iterate until index 35 (32 + 4 entries) with the code that processes 4 operations with two vector operations
.LBB0_7
label shows this in the assembly of the function. - Iterate until index 38 (36 + 3 entries) with the code that processes one iteration per loop. Refer to
.LBB0_11
label in the following code.
; rsi contains the address of the vector to be incremented.
example::increment_by:
mov r9, qword ptr [rsi + 16] ; r9 now contains the length of the vector
test r9, r9 ; Check if the vector is empty
je .LBB0_12 ; Jump if the vector is empty
mov rcx, qword ptr [rsi] ; rcx now contains the address of the vector data
lea rax, [r9 - 1] ; rax now contains the index of the last entry in the vector.
movabs rdx, 2305843009213693951 ; Hex 1FFF_FFFF_FFFF_FFFF
and rdx, rax ; rdx contains the index of the last entry in the vector with 3 MSB cleared.
mov rax, rcx ; rax now contains the address of the vector data
cmp rdx, 3 ; Check if the vector has less than 4 entries
jb .LBB0_10 ; Jump if the vector has less than 4 entries
; == Handle vectors with 4 or more entries ==
add rdx, 1 ; Add 1 to the last entry index to get length.
mov r8, rdx ; r8 now contains the number of items to be processed.
and r8, -4 ; Mask of the lower 2 bits to get the count of entries that is divisible by 4.
movq xmm0, rdi ; Parameter num, the increment value, is now in xmm0.
pshufd xmm0, xmm0, 68 ; xmm0 upper 64 bits and lower 64 bits are set to num to facilitate vector operations.
lea rax, [r8 - 4] ; rax now is 4 less than the number of items to be processed.
mov r10, rax ; r10 is now 4 less than the number of items to be processed.
shr r10, 2 ; Divide r10 by 4 to get the index of the last quad 64-bit numbers.
add r10, 1 ; Add 1 to get the count of quad 64-bit numbers.
test rax, rax ; Check if any more sets of 4 entries are available
je .LBB0_3 ; Jump if no more sets of 4 entries are available
mov rax, r10 ; rax now contains the count of quad 64-bit numbers
and rax, -2 ; Mask of the lower 2 bits to get the count of quad 64-bit numbers that is divisible by 4.
xor esi, esi ; Set the quad 64-bit numbers index to 0
; == Process eight 64-bit additions per iteration ==
.LBB0_5:
movdqu xmm1, xmmword ptr [rcx + 8*rsi] ; Load the first two 64-bit number
movdqu xmm2, xmmword ptr [rcx + 8*rsi + 16] ; Load the third and fourth 64-bit number
movdqu xmm3, xmmword ptr [rcx + 8*rsi + 32] ; Load the fifth and sixth 64-bit number
movdqu xmm4, xmmword ptr [rcx + 8*rsi + 48] ; Load the seventh and eighth 64-bit number
paddq xmm1, xmm0 ; Add the increment value to the first two 64-bit number
paddq xmm2, xmm0 ; Add the increment value to the third and fourth 64-bit number
movdqu xmmword ptr [rcx + 8*rsi], xmm1 ; Store the first two 64-bit number
movdqu xmmword ptr [rcx + 8*rsi + 16], xmm2 ; Store the third and fourth 64-bit number
paddq xmm3, xmm0 ; Add the increment value to the fifth and sixth 64-bit number
paddq xmm4, xmm0 ; Add the increment value to the seventh and eighth 64-bit number
movdqu xmmword ptr [rcx + 8*rsi + 32], xmm3 ; Store the fifth and sixth 64-bit number
movdqu xmmword ptr [rcx + 8*rsi + 48], xmm4 ; Store the seventh and eighth 64-bit number
add rsi, 8 ; Increment the 64-bit numbers index
add rax, -2 ; Decrement the count by 2 as 8 64-bit numbers have been processed.
jne .LBB0_5 ; Jump if there are still quad 64-bit numbers to process
test r10b, 1 ; Check if there are still entries to process
je .LBB0_8 ; Jump if less than 4 entries need to be processed.
; == Block processing four 64-bit additions ==
.LBB0_7:
movdqu xmm1, xmmword ptr [rcx + 8*rsi] ; Load two 64-bit numbers from the vector
movdqu xmm2, xmmword ptr [rcx + 8*rsi + 16] ; Load two more 64-bit numbers from the vector
paddq xmm1, xmm0 ; Add the increment value to the first two 64-bit number
paddq xmm2, xmm0 ; Add the increment value to the 3rd and 4th of 64-bit number
movdqu xmmword ptr [rcx + 8*rsi], xmm1 ; Store the result back in the vector (1st and 2nd number)
movdqu xmmword ptr [rcx + 8*rsi + 16], xmm2 ; Store the result back in the vector (3rd and 4th number)
.LBB0_8:
cmp rdx, r8 ; Check if all entries have been processed
je .LBB0_12 ; Jump if all entries have been processed
lea rax, [rcx + 8*r8] ; rax now contains the address of the last entry
.LBB0_10:
lea rcx, [rcx + 8*r9] ; rcx now contains the address of the last entry in the vector
; rax points to the first entry in the vector
; == Process one 64-bit addition per iteration ==
.LBB0_11:
add qword ptr [rax], rdi ; Add num to the next entry in the vector
add rax, 8 ; Move to the next entry in the vector
cmp rax, rcx ; Check if we have reached the end of the vector
jne .LBB0_11 ; Jump if we have not reached the end of the vector
.LBB0_12:
ret
.LBB0_3:
xor esi, esi ; esi is now 0
test r10b, 1 ; Check if one set of quad 64-bit numbers is available
jne .LBB0_7 ; Jump if quad 64-bit numbers are available
jmp .LBB0_8 ; Jump if less than 4 64-bit numbers are available
The following diagrams help visualize how the code above connects the three basic blocks we have seen earlier.
Process eight 64-bit additions per iteration (LBB0_5
)
Block processing four 64-bit additions (LBB0_7
)
Process one 64-bit addition per iteration (LBB0_11
)
Assembly code for increment_example
pub fn increment_example(inc_by: i64, array: [i64; 4]) -> Vec<i64> {
let mut list = array.to_vec();
increment_by(inc_by, &mut list);
list
}
Looking at the code for increment_example
we can see that the generated assembly code is a lot simpler than that for increment_by
function. We also see that there is no call to increment_by
, the relevant code from the function has been inlined into increment_example
. We are seeing a lot of compiler optimizations at work here.
-
The compiler figures out that inlining the
increment_by
function would reduce a function call. -
Once the compiler decides to inline, it sees that the vector will only be of length 4. The compiler decides that it does not need to bring in the complexity of handling any arbitrary length vector. The compiler optimizes the inlined code to be a lot faster by just including the handling for a vector of length 4.
-
The compiler also unrolls the loop to process four 64-bit additions.
Also, note that the assembly code for increment_example
calls __rust_alloc
for allocating the vector data on the heap. If the memory allocation fails, the function throws an exception using the ud2
instruction.
example::increment_example:
push r15 ; Save r15 on the stack
push r14 ; Save r14 on the stack
push rbx ; Save rbx on the stack
mov r15, rdx ; r15 = address of the array
mov r14, rsi ; r14 = inc_by
mov rbx, rdi ; rbx = address of vector on the stack
; == Allocate memory for the vector buffer ==
mov edi, 32 ; edi = number of 64-bit numbers to process
mov esi, 8 ; esi = byte alignment is at 8 bytes
call qword ptr [rip + __rust_alloc@GOTPCREL] ; The allocated heap address is in rax
test rax, rax ; Check if the allocation failed
je .LBB1_1 ; Jump if the allocation failed
movups xmm0, xmmword ptr [r15] ; Load the first two 64-bit numbers from the array
movups xmm1, xmmword ptr [r15 + 16] ; Load the third and fourth 64-bit numbers from the array
movups xmmword ptr [rax + 16], xmm1 ; Store the third and fourth 64-bit numbers in the vector
movups xmmword ptr [rax], xmm0 ; Store the first two 64-bit numbers in the vector
add qword ptr [rax], r14 ; Add the increment value to the first 64-bit number
mov qword ptr [rbx], rax ; Store the address of the vector in the vector buffer
add qword ptr [rax + 8], r14 ; Add the increment value to the second 64-bit number
mov qword ptr [rbx + 8], 4 ; Store the vector length in the vector capacity field in RawVec.
add qword ptr [rax + 16], r14 ; Add the increment value to the third 64-bit number
mov qword ptr [rbx + 16], 4 ; Store the vector length in the vector.
add qword ptr [rax + 24], r14 ; Add the increment value to the fourth 64-bit number
mov rax, rbx ; Return the vector in rax
pop rbx ; Restore rbx
pop r14 ; Restore r14
pop r15 ; Restore r15
ret ; Return the vector in rax
.LBB1_1: ; Label for the allocation failure
mov edi, 32 ; edi = Size of the attempted allocation
mov esi, 8 ; esi = Alignment is at 8 bytes
; == Call the allocation failure handler ==
call qword ptr [rip + alloc::alloc::handle_alloc_error@GOTPCREL]
ud2 ; Use the ud2 instruction to throw an exception
View in the Compiler Explorer
Click on the Edit on Compiler Explorer to experiment with the generated assembly code. Two interesting experiments are:
- In the
increment_example
function, changearray: [i64; 4]
toarray: [i64; 20]
and see how the compiler handles the vector length. - In the
increment_by
function, change*item += num
to*item *= num
and see the dramatic impact on the generated code.- The compiler does not try to optimize for different vector lengths. It generates a single loop with the multiplication.