# Map Rust vector iteration to assembly

Vectors in Rust contain a pointer to a heap vector buffer (`buf`) and a length (`len`). The vector buffer is a pointer to a buffer of the type `T`. The length is the number of elements in the vector. Here is an excerpt from the `std::vec::Vec` implementation:

``````pub struct Vec<T, #[unstable(feature = "allocator_api", issue = "32838")] A: Allocator = Global> {
// Points to the start of the vector data and keeps the capacity information
buf: RawVec<T, A>,
// Length of the vector
len: usize,
}
``````

## Sample vector handling code

Let us explore the Rust to assembly mapping of the `Vec` struct using a simple function, `increment_by` that increments the elements of a vector by a specified value (`num`). The sample also includes `increment_example` that calls the `increment_by` function.

``````// Increment each element of a i64 vector by num. A mutable
// reference to the vector is passed in.
pub fn increment_by(num: i64, list: &mut Vec<i64>) {
// Iterate over the vector and increment each element.
for item in list {
// Note that the item needs to be dereferenced
*item += num;
}
}

pub fn increment_example(inc_by: i64, array: [i64; 4]) -> Vec<i64> {
let mut list = array.to_vec();
increment_by(inc_by, &mut list);
list
}
``````

## Assembly code for `increment_by`

The generated assembly code for the `increment_by` function is shown below. The generated assembly code seems quite complex. On close examination, it becomes clear that the code is optimized for operating on vectors of wildly different lengths.

Let us look at the assembly code that does the heavy lifting for a vector of a certain length. Once we understand these fragments, we will look at the complete assembly to understand how all other lengths are handled by combining these fragments.

### Vector length less than 4

For vectors that are shorter than 4, the generated assembly code loops over each entry in the vector and increments it by the specified value.

``````.LBB0_11:
add     qword ptr [rax], rdi    ; Add num to the next entry in the vector
add     rax, 8                  ; Move to the next entry in the vector
cmp     rax, rcx                ; Check if we have reached the end of the vector
jne     .LBB0_11                ; Jump if we have not reached the end of the vector
``````

### Vector length is 4

If the vector length is 4, the compiler unrolls the loop and generates the following assembly code. Vector instructions are used to perform two 64-bit operations in parallel. The `movq` instruction is used to move the vector data from the vector buffer to a `xmm` register. The `addq` instruction is used to add the specified value to the vector data. The `movq` instruction is then used to move the vector data back to the vector buffer.

``````.LBB0_7:
movdqu  xmm1, xmmword ptr [rcx + 8*rsi]         ; Load two 64-bit numbers from the vector
movdqu  xmm2, xmmword ptr [rcx + 8*rsi + 16]    ; Load two more 64-bit numbers from the vector
paddq   xmm1, xmm0                              ; Add the increment value to the first two 64-bit number
paddq   xmm2, xmm0                              ; Add the increment value to the 3rd and 4th of 64-bit number
movdqu  xmmword ptr [rcx + 8*rsi], xmm1         ; Store the result back in the vector (1st and 2nd number)
movdqu  xmmword ptr [rcx + 8*rsi + 16], xmm2    ; Store the result back in the vector (3rd and 4th number)
``````

### Vector length is a multiple of 8

For vectors with lengths that are a multiple of 8, the compiler handles 8 64-bit operations per iteration. The 8 64-bit numbers are copied to the `xmm` registers using four `movdqu` instructions. Four `addq` instructions perform the 8 additions. Finally, the four `movdqu` instructions copy the results back to the vector buffer.

``````.LBB0_5:
movdqu  xmm1, xmmword ptr [rcx + 8*rsi]         ; Load the first two 64-bit number
movdqu  xmm2, xmmword ptr [rcx + 8*rsi + 16]    ; Load the third and fourth 64-bit number
movdqu  xmm3, xmmword ptr [rcx + 8*rsi + 32]    ; Load the fifth and sixth 64-bit number
movdqu  xmm4, xmmword ptr [rcx + 8*rsi + 48]    ; Load the seventh and eighth 64-bit number
paddq   xmm1, xmm0      ; Add the increment value to the first two 64-bit number
paddq   xmm2, xmm0      ; Add the increment value to the third and fourth 64-bit number
movdqu  xmmword ptr [rcx + 8*rsi], xmm1         ; Store the first two 64-bit number
movdqu  xmmword ptr [rcx + 8*rsi + 16], xmm2    ; Store the third and fourth 64-bit number
paddq   xmm3, xmm0      ; Add the increment value to the fifth and sixth 64-bit number
paddq   xmm4, xmm0      ; Add the increment value to the seventh and eighth 64-bit number
movdqu  xmmword ptr [rcx + 8*rsi + 32], xmm3    ; Store the fifth and sixth 64-bit number
movdqu  xmmword ptr [rcx + 8*rsi + 48], xmm4    ; Store the seventh and eighth 64-bit number
add     rsi, 8       ; Increment the 64-bit numbers index
add     rax, -2     ; Decrement the count by 2 as 8 64-bit numbers have been processed.
jne     .LBB0_5     ; Jump if there are still quad 64-bit numbers to process
test    r10b, 1     ; Check if there are still entries to process
je      .LBB0_8     ; Jump if less than 4 entries need to be processed.
``````

### Handling vectors of arbitrary length

Now that we have analyzed the special cases, let us look at the code below that handles vectors of arbitrary length. The code uses the assembly code blocks we described above. The full code of the function combines the three code blocks to handle vectors of arbitrary length. For example, a vector of length 41 will be handled as.

1. Iterate until index 31 (32 entries = 4 * 8) with four loops through that process 8 operations per loop (4 sets of vector operations). The `.LBB0_5` label shows the code that performs the four loops.
2. Iterate until index 35 (32 + 4 entries) with the code that processes 4 operations with two vector operations `.LBB0_7` label shows this in the assembly of the function.
3. Iterate until index 38 (36 + 3 entries) with the code that processes one iteration per loop. Refer to `.LBB0_11` label in the following code.
``````; rsi contains the address of the vector to be incremented.
example::increment_by:
mov     r9, qword ptr [rsi + 16] ; r9 now contains the length of the vector
test    r9, r9                  ; Check if the vector is empty
je      .LBB0_12                ; Jump if the vector is empty
mov     rcx, qword ptr [rsi]    ; rcx now contains the address of the vector data
lea     rax, [r9 - 1]           ; rax now contains the index of the last entry in the vector.
movabs  rdx, 2305843009213693951 ; Hex 1FFF_FFFF_FFFF_FFFF
and     rdx, rax                 ; rdx contains the index of the last entry in the vector with 3 MSB cleared.
mov     rax, rcx                 ; rax now contains the address of the vector data
cmp     rdx, 3                   ; Check if the vector has less than 4 entries
jb      .LBB0_10                 ; Jump if the vector has less than 4 entries

; == Handle vectors with 4 or more entries ==
add     rdx, 1          ; Add 1 to the last entry index to get length.
mov     r8, rdx         ; r8 now contains the number of items to be processed.
and     r8, -4          ; Mask of the lower 2 bits to get the count of entries that is divisible by 4.
movq    xmm0, rdi       ; Parameter num, the increment value, is now in xmm0.
pshufd  xmm0, xmm0, 68  ; xmm0 upper 64 bits and lower 64 bits are set to num to facilitate vector operations.
lea     rax, [r8 - 4]   ; rax now is 4 less than the number of items to be processed.
mov     r10, rax        ; r10 is now 4 less than the number of items to be processed.
shr     r10, 2          ; Divide r10 by 4 to get the index of the last quad 64-bit numbers.
add     r10, 1          ; Add 1 to get the count of quad 64-bit numbers.
test    rax, rax        ; Check if any more sets of 4 entries are available
je      .LBB0_3         ; Jump if no more sets of 4 entries are available
mov     rax, r10        ; rax now contains the count of quad 64-bit numbers
and     rax, -2         ; Mask of the lower 2 bits to get the count of quad 64-bit numbers that is divisible by 4.
xor     esi, esi        ; Set the quad 64-bit numbers index to 0

; == Process eight 64-bit additions per iteration ==
.LBB0_5:
movdqu  xmm1, xmmword ptr [rcx + 8*rsi]         ; Load the first two 64-bit number
movdqu  xmm2, xmmword ptr [rcx + 8*rsi + 16]    ; Load the third and fourth 64-bit number
movdqu  xmm3, xmmword ptr [rcx + 8*rsi + 32]    ; Load the fifth and sixth 64-bit number
movdqu  xmm4, xmmword ptr [rcx + 8*rsi + 48]    ; Load the seventh and eighth 64-bit number
paddq   xmm1, xmm0      ; Add the increment value to the first two 64-bit number
paddq   xmm2, xmm0      ; Add the increment value to the third and fourth 64-bit number
movdqu  xmmword ptr [rcx + 8*rsi], xmm1         ; Store the first two 64-bit number
movdqu  xmmword ptr [rcx + 8*rsi + 16], xmm2    ; Store the third and fourth 64-bit number
paddq   xmm3, xmm0      ; Add the increment value to the fifth and sixth 64-bit number
paddq   xmm4, xmm0      ; Add the increment value to the seventh and eighth 64-bit number
movdqu  xmmword ptr [rcx + 8*rsi + 32], xmm3    ; Store the fifth and sixth 64-bit number
movdqu  xmmword ptr [rcx + 8*rsi + 48], xmm4    ; Store the seventh and eighth 64-bit number
add     rsi, 8       ; Increment the 64-bit numbers index
add     rax, -2     ; Decrement the count by 2 as 8 64-bit numbers have been processed.
jne     .LBB0_5     ; Jump if there are still quad 64-bit numbers to process
test    r10b, 1     ; Check if there are still entries to process
je      .LBB0_8     ; Jump if less than 4 entries need to be processed.

; == Block processing four 64-bit additions ==
.LBB0_7:
movdqu  xmm1, xmmword ptr [rcx + 8*rsi]         ; Load two 64-bit numbers from the vector
movdqu  xmm2, xmmword ptr [rcx + 8*rsi + 16]    ; Load two more 64-bit numbers from the vector
paddq   xmm1, xmm0                              ; Add the increment value to the first two 64-bit number
paddq   xmm2, xmm0                              ; Add the increment value to the 3rd and 4th of 64-bit number
movdqu  xmmword ptr [rcx + 8*rsi], xmm1         ; Store the result back in the vector (1st and 2nd number)
movdqu  xmmword ptr [rcx + 8*rsi + 16], xmm2    ; Store the result back in the vector (3rd and 4th number)
.LBB0_8:
cmp     rdx, r8                 ; Check if all entries have been processed
je      .LBB0_12                ; Jump if all entries have been processed
lea     rax, [rcx + 8*r8]       ; rax now contains the address of the last entry
.LBB0_10:
lea     rcx, [rcx + 8*r9]       ; rcx now contains the address of the last entry in the vector
; rax points to the first entry in the vector

; == Process one 64-bit addition per iteration ==
.LBB0_11:
add     qword ptr [rax], rdi    ; Add num to the next entry in the vector
add     rax, 8                  ; Move to the next entry in the vector
cmp     rax, rcx                ; Check if we have reached the end of the vector
jne     .LBB0_11                ; Jump if we have not reached the end of the vector
.LBB0_12:
ret
.LBB0_3:
xor     esi, esi                ; esi is now 0
test    r10b, 1                 ; Check if one set of quad 64-bit numbers is available
jne     .LBB0_7                 ; Jump if quad 64-bit numbers are available
jmp     .LBB0_8                 ; Jump if less than 4 64-bit numbers are available
``````

The following diagrams help visualize how the code above connects the three basic blocks we have seen earlier.

#### Process eight 64-bit additions per iteration (`LBB0_5`) #### Block processing four 64-bit additions (`LBB0_7`) #### Process one 64-bit addition per iteration (`LBB0_11`) ## Assembly code for `increment_example`

``````pub fn increment_example(inc_by: i64, array: [i64; 4]) -> Vec<i64> {
let mut list = array.to_vec();
increment_by(inc_by, &mut list);
list
}
``````

Looking at the code for `increment_example` we can see that the generated assembly code is a lot simpler than that for `increment_by` function. We also see that there is no call to `increment_by`, the relevant code from the function has been inlined into `increment_example`. We are seeing a lot of compiler optimizations at work here.

• The compiler figures out that inlining the `increment_by` function would reduce a function call.

• Once the compiler decides to inline, it sees that the vector will only be of length 4. The compiler decides that it does not need to bring in the complexity of handling any arbitrary length vector. The compiler optimizes the inlined code to be a lot faster by just including the handling for a vector of length 4.

• The compiler also unrolls the loop to process four 64-bit additions.

Also, note that the assembly code for `increment_example` calls `__rust_alloc` for allocating the vector data on the heap. If the memory allocation fails, the function throws an exception using the `ud2` instruction.

``````example::increment_example:
push    r15             ; Save r15 on the stack
push    r14             ; Save r14 on the stack
push    rbx             ; Save rbx on the stack
mov     r15, rdx        ; r15 = address of the array
mov     r14, rsi        ; r14 = inc_by
mov     rbx, rdi        ; rbx = address of vector on the stack

; == Allocate memory for the vector buffer ==
mov     edi, 32         ; edi = number of 64-bit numbers to process
mov     esi, 8          ; esi = byte alignment is at 8 bytes
call    qword ptr [rip + __rust_alloc@GOTPCREL] ; The allocated heap address is in rax

test    rax, rax       ; Check if the allocation failed
je      .LBB1_1       ; Jump if the allocation failed
movups  xmm0, xmmword ptr [r15]         ; Load the first two 64-bit numbers from the array
movups  xmm1, xmmword ptr [r15 + 16]    ; Load the third and fourth 64-bit numbers from the array
movups  xmmword ptr [rax + 16], xmm1    ; Store the third and fourth 64-bit numbers in the vector
movups  xmmword ptr [rax], xmm0         ; Store the first two 64-bit numbers in the vector
add     qword ptr [rax], r14            ; Add the increment value to the first 64-bit number
mov     qword ptr [rbx], rax            ; Store the address of the vector in the vector buffer
add     qword ptr [rax + 8], r14        ; Add the increment value to the second 64-bit number
mov     qword ptr [rbx + 8], 4          ; Store the vector length in the vector capacity field in RawVec.
add     qword ptr [rax + 16], r14       ; Add the increment value to the third 64-bit number
mov     qword ptr [rbx + 16], 4         ; Store the vector length in the vector.
add     qword ptr [rax + 24], r14       ; Add the increment value to the fourth 64-bit number
mov     rax, rbx                        ; Return the vector in rax
pop     rbx                             ; Restore rbx
pop     r14                             ; Restore r14
pop     r15                             ; Restore r15
ret                                     ; Return the vector in rax

.LBB1_1:                                        ; Label for the allocation failure
mov     edi, 32                         ; edi = Size of the attempted allocation
mov     esi, 8                          ; esi = Alignment is at 8 bytes
; == Call the allocation failure handler ==
call    qword ptr [rip + alloc::alloc::handle_alloc_error@GOTPCREL]
ud2                                     ; Use the ud2 instruction to throw an exception
``````

## View in the Compiler Explorer

Click on the Edit on Compiler Explorer to experiment with the generated assembly code. Two interesting experiments are:

• In the `increment_example` function, change `array: [i64; 4]` to `array: [i64; 20]` and see how the compiler handles the vector length.
• In the `increment_by` function, change `*item += num` to `*item *= num` and see the dramatic impact on the generated code.
• The compiler does not try to optimize for different vector lengths. It generates a single loop with the multiplication.