Introduction · UnrolledUtilities.jl

Motivation for Loop Unrolling

Although the iteration utilities in Base and Base.Iterators are sufficiently performant for most common use cases, those who choose to dive into the world of low-level optimization will often discover type instabilities in unexpected situations. Here is a particularly simple example:

julia> Test.@inferred map(one, Tuple(1:31));
julia> Test.@inferred map(one, Tuple(1:32));ERROR: return type NTuple{32, Int64} does not match inferred return type Tuple

This type instability is present in all maps over iterators with lengths greater than 31, regardless of whether they are statically sized. As with most type instabilities in Julia, this leads to memory allocations every time map is called with sufficiently long iterators.

Test.@inferred is helpful for checking whether the return type of a function call is stable, but looking directly at the generated LLVM code reveals just how different the two function calls above are:

julia> @code_llvm debuginfo=:none map(one, Tuple(1:31)); Function Signature: map(typeof(Base.one), NTuple{31, Int64})
define void @julia_map_192471(ptr noalias nocapture noundef nonnull sret([31 x i64]) align 8 dereferenceable(248) %sret_return, ptr nocapture noundef nonnull readonly align 8 dereferenceable(248) %"t::Tuple") #0 {
top:
  call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 8 dereferenceable(248) %sret_return, ptr noundef nonnull align 8 dereferenceable(248) @"_j_const#1", i64 248, i1 false)
  ret void
}

julia> @code_llvm debuginfo=:none map(one, Tuple(1:32))

; Function Signature: map(typeof(Base.one), NTuple{32, Int64})
define nonnull ptr @julia_map_192476(ptr nocapture noundef nonnull readonly align 8 dereferenceable(256) %"t::Tuple") #0 {
top:
  %jlcallframe1 = alloca [3 x ptr], align 8
  %gcframe2 = alloca [3 x ptr], align 16
  call void @llvm.memset.p0.i64(ptr align 16 %gcframe2, i8 0, i64 24, i1 true)
  %thread_ptr = call ptr asm "movq %fs:0, $0", "=r"() #13
  %tls_ppgcstack = getelementptr i8, ptr %thread_ptr, i64 -8
  %tls_pgcstack = load ptr, ptr %tls_ppgcstack, align 8
  store i64 4, ptr %gcframe2, align 16
  %frame.prev = getelementptr inbounds ptr, ptr %gcframe2, i64 1
  %task.gcstack = load ptr, ptr %tls_pgcstack, align 8
  store ptr %task.gcstack, ptr %frame.prev, align 8
  store ptr %gcframe2, ptr %tls_pgcstack, align 8
  %"Memory{Any}[]" = call ptr @jl_alloc_genericmemory(ptr nonnull @"+Core.GenericMemory#192478.jit", i64 32)
  %.data_ptr = getelementptr inbounds { i64, ptr }, ptr %"Memory{Any}[]", i64 0, i32 1
  %0 = load ptr, ptr %.data_ptr, align 8
  %gc_slot_addr_0 = getelementptr inbounds ptr, ptr %gcframe2, i64 2
  store ptr %"Memory{Any}[]", ptr %gc_slot_addr_0, align 16
  %ptls_field = getelementptr inbounds ptr, ptr %tls_pgcstack, i64 2
  %ptls_load = load ptr, ptr %ptls_field, align 8
  %"new::Array" = call noalias nonnull align 8 dereferenceable(32) ptr @ijl_gc_pool_alloc_instrumented(ptr %ptls_load, i32 800, i32 32, i64 140155851839856) #11
  %"new::Array.tag_addr" = getelementptr inbounds i64, ptr %"new::Array", i64 -1
  store atomic i64 140155851839856, ptr %"new::Array.tag_addr" unordered, align 8
  %1 = getelementptr inbounds ptr, ptr %"new::Array", i64 1
  store ptr %0, ptr %"new::Array", align 8
  store ptr %"Memory{Any}[]", ptr %1, align 8
  %"new::Array.size_ptr" = getelementptr inbounds i8, ptr %"new::Array", i64 16
  store i64 32, ptr %"new::Array.size_ptr", align 8
  br label %L22

L22:                                              ; preds = %L22, %top
  %value_phi = phi i64 [ 1, %top ], [ %5, %L22 ]
  %2 = add nsw i64 %value_phi, -1
  %3 = getelementptr inbounds ptr, ptr %0, i64 %2
  %4 = getelementptr inbounds ptr, ptr %"Memory{Any}[]", i64 2
  %.not15 = icmp eq ptr %4, %0
  store atomic ptr @"jl_global#192488.jit", ptr %3 release, align 8
  %.not16.not = icmp eq i64 %value_phi, 32
  %5 = add nuw nsw i64 %value_phi, 1
  br i1 %.not16.not, label %L37, label %L22

L37:                                              ; preds = %L22
  store ptr %"new::Array", ptr %gc_slot_addr_0, align 16
  store ptr @"jl_global#192494.jit", ptr %jlcallframe1, align 8
  %6 = getelementptr inbounds ptr, ptr %jlcallframe1, i64 1
  store ptr @"jl_global#192495.jit", ptr %6, align 8
  %7 = getelementptr inbounds ptr, ptr %jlcallframe1, i64 2
  store ptr %"new::Array", ptr %7, align 8
  %jl_f__apply_iterate_ret = call nonnull ptr @jl_f__apply_iterate(ptr null, ptr nonnull %jlcallframe1, i32 3)
  %frame.prev26 = load ptr, ptr %frame.prev, align 8
  store ptr %frame.prev26, ptr %tls_pgcstack, align 8
  ret ptr %jl_f__apply_iterate_ret
}

The type instability (and all of the resulting LLVM code complexity) in the second function call can be eliminated by replacing map with unrolled_map:

julia> Test.@inferred unrolled_map(one, Tuple(1:32));
julia> @code_llvm debuginfo=:none unrolled_map(one, Tuple(1:32)); Function Signature: unrolled_map(typeof(Base.one), NTuple{32, Int64})
define void @julia_unrolled_map_192499(ptr noalias nocapture noundef nonnull sret([32 x i64]) align 8 dereferenceable(256) %sret_return, ptr nocapture noundef nonnull readonly align 8 dereferenceable(256) %"itr::Tuple") #0 {
top:
  call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 8 dereferenceable(256) %sret_return, ptr noundef nonnull align 8 dereferenceable(256) @"_j_const#1", i64 256, i1 false)
  ret void
}

The minimum iterator length for type instability is not always 32; for instance, it can also be 14:

julia> first_11(itr) = itr[1:11]first_11 (generic function with 1 method)
julia> Test.@inferred first_11(Tuple(1:13));
julia> Test.@inferred first_11(Tuple(1:14));ERROR: return type NTuple{11, Int64} does not match inferred return type Tuple{Vararg{Int64}}

Note

Why is the function definition needed in this example?

On the first line of the example above, [1:11] is enclosed in a function so that it does not get evaluated in global scope. This turns the range 1:11 into a Core.Const, which the compiler can propagate into the call to getindex in order to infer the length of the result:

julia> @code_warntype first_11(Tuple(1:13))MethodInstance for Main.first_11(::NTuple{13, Int64})
  from first_11(itr) @ Main string:2
Arguments
  #self#::Core.Const(Main.first_11)
  itr::NTuple{13, Int64}
Body::NTuple{11, Int64}
1 ─ %1 = Main.:(:)::Core.Const(Colon())
│   %2 = (%1)(1, 11)::Core.Const(1:11)
│   %3 = Base.getindex(itr, %2)::NTuple{11, Int64}
└──      return %3

In contrast, running Test.@inferred Tuple(1:13)[1:11] would amount to checking whether the compiler can compute the result type of getindex given only the argument types NTuple{13, Int64} and UnitRange{Int64}, which it cannot do:

julia> @code_warntype Tuple(1:13)[1:11]

MethodInstance for getindex(::NTuple{13, Int64}, ::UnitRange{Int64})
  from getindex(t::Tuple, r::AbstractUnitRange) @ Base range.jl:430
Arguments
  #self#::Core.Const(getindex)
  t::NTuple{13, Int64}
  r::UnitRange{Int64}
Locals
  #87::Base.var"#87#88"{NTuple{13, Int64}, UnitRange{Int64}}
  @_5::UNION{NOTHING, TUPLE{INT64, INT64}}
  @_6::Int64
  ri::Int64
Body::TUPLE{VARARG{INT64}}
1 ──        nothing
│           nothing
│           Core.NewvarNode(:(#87))
│    %4   = Base.require_one_based_indexing::Core.Const(Base.require_one_based_indexing)
│           (%4)(r)
│    %6   = Base.:<=::Core.Const(<=)
│    %7   = Base.length(r)::Int64
│    %8   = (%6)(%7, 10)::Bool
└───        goto #4 if not %8
2 ── %10  = Base.ntuple::Core.Const(ntuple)
│    %11  = Base.:(var"#87#88")::Core.Const(Base.var"#87#88")
│    %12  = Core.typeof(t)::Core.Const(NTuple{13, Int64})
│    %13  = Core.typeof(r)::Core.Const(UnitRange{Int64})
│    %14  = Core.apply_type(%11, %12, %13)::Core.Const(Base.var"#87#88"{NTuple{13, Int64}, UnitRange{Int64}})
│           (#87 = %new(%14, t, r))
│    %16  = #87::Base.var"#87#88"{NTuple{13, Int64}, UnitRange{Int64}}
│    %17  = Base.length(r)::Int64
│    %18  = (%10)(%16, %17)::TUPLE{VARARG{INT64}}
└───        return %18
3 ──        Core.Const(:(goto %77))
4 ┄─ %21  = Base.:(==)::Core.Const(==)
│    %22  = Base.first(r)::Int64
│    %23  = (%21)(%22, 1)::Bool
└───        goto #15 if not %23
5 ── %25  = Base.:(==)::Core.Const(==)
│    %26  = Base.last(r)::Int64
│    %27  = Base.length(t)::Core.Const(13)
│    %28  = (%25)(%26, %27)::Bool
└───        goto #8 if not %28
6 ── %30  = t::NTuple{13, Int64}
└───        return %30
7 ──        Core.Const(:(goto %33))
8 ┄─ %33  = Base.:(==)::Core.Const(==)
│    %34  = Base.last(r)::Int64
│    %35  = Base.:-::Core.Const(-)
│    %36  = Base.length(t)::Core.Const(13)
│    %37  = (%35)(%36, 1)::Core.Const(12)
│    %38  = (%33)(%34, %37)::Bool
└───        goto #11 if not %38
9 ── %40  = Base.front(t)::NTuple{12, Int64}
└───        return %40
10 ─        Core.Const(:(goto %43))
11 ┄ %43  = Base.:(==)::Core.Const(==)
│    %44  = Base.last(r)::Int64
│    %45  = Base.:-::Core.Const(-)
│    %46  = Base.length(t)::Core.Const(13)
│    %47  = (%45)(%46, 2)::Core.Const(11)
│    %48  = (%43)(%44, %47)::Bool
└───        goto #14 if not %48
12 ─ %50  = Base.front::Core.Const(Base.front)
│    %51  = Base.front(t)::NTuple{12, Int64}
│    %52  = (%50)(%51)::NTuple{11, Int64}
└───        return %52
13 ─        Core.Const(:(goto %55))
14 ┄        goto #22
15 ─ %56  = Base.:(==)::Core.Const(==)
│    %57  = Base.last(r)::Int64
│    %58  = Base.length(t)::Core.Const(13)
│    %59  = (%56)(%57, %58)::Bool
└───        goto #22 if not %59
16 ─ %61  = Base.:(==)::Core.Const(==)
│    %62  = Base.first(r)::Int64
│    %63  = (%61)(%62, 2)::Bool
└───        goto #19 if not %63
17 ─ %65  = Base.tail(t)::NTuple{12, Int64}
└───        return %65
18 ─        Core.Const(:(goto %68))
19 ┄ %68  = Base.:(==)::Core.Const(==)
│    %69  = Base.first(r)::Int64
│    %70  = (%68)(%69, 3)::Bool
└───        goto #22 if not %70
20 ─ %72  = Base.tail::Core.Const(Base.tail)
│    %73  = Base.tail(t)::NTuple{12, Int64}
│    %74  = (%72)(%73)::NTuple{11, Int64}
└───        return %74
21 ─        Core.Const(:(goto %77))
22 ┄ %77  = r::UnitRange{Int64}
│    %78  = Base.IteratorSize(%77)::Core.Const(Base.HasShape{1}())
│    %79  = (%78 isa Base.SizeUnknown)::Core.Const(false)
│    %80  = Base.eltype(t)::Core.Const(Int64)
│    %81  = Base._array_for(%80, %77, %78)::Vector{Int64}
│    %82  = Base.LinearIndices(%81)::LinearIndices{1, Tuple{Base.OneTo{Int64}}}
│           (@_6 = Base.first(%82))
│           (@_5 = Base.iterate(%77))
│    %85  = @_5::UNION{NOTHING, TUPLE{INT64, INT64}}
│    %86  = (%85 === nothing)::Bool
│    %87  = Base.not_int(%86)::Bool
└───        goto #27 if not %87
23 ┄ %89  = @_5::Tuple{Int64, Int64}
│           (ri = Core.getfield(%89, 1))
│    %91  = Core.getfield(%89, 2)::Int64
│    %92  = ri::Int64
│    %93  = Base.getindex(t, %92)::Int64
│           nothing
└───        goto #25 if not %79
24 ─        Core.Const(:(Base.push!(%81, %93)))
└───        Core.Const(:(goto %100))
25 ┄ %98  = @_6::Int64
│           Base.setindex!(%81, %93, %98)
│           nothing
│    %101 = @_6::Int64
│           (@_6 = Base.add_int(%101, 1))
│           (@_5 = Base.iterate(%77, %91))
│    %104 = @_5::UNION{NOTHING, TUPLE{INT64, INT64}}
│    %105 = (%104 === nothing)::Bool
│    %106 = Base.not_int(%105)::Bool
└───        goto #27 if not %106
26 ─        goto #23
27 ┄ %109 = Core._apply_iterate(Base.iterate, Core.tuple, %81)::TUPLE{VARARG{INT64}}
└───        return %109

Although itr[1:10] is always inferrable when itr is a Tuple, itr[1:11] has a type instability whenever itr contains more than 13 items. More generally, itr[1:N] seems to be unstable for all N > 10 whenever itr contains more than N + 2 items. This type instability can be fixed by replacing getindex with unrolled_take:

julia> unrolled_first_11(itr) = unrolled_take(itr, Val(11))unrolled_first_11 (generic function with 1 method)
julia> Test.@inferred unrolled_first_11(Tuple(1:14));

Even when the final result of a function is inferred, there can be intermediate steps in the function with type instabilities that trigger allocations:

julia> function add_lengths(itr)
           length_sum = 0
           for n in 1:length(itr)
               length_sum += length(itr[n])
           end
       endadd_lengths (generic function with 1 method)
julia> Test.@inferred add_lengths(((1, 2), (1, 2, 3)))
julia> @allocated add_lengths(((1, 2), (1, 2, 3)))64
julia> @code_warntype add_lengths(((1, 2), (1, 2, 3)))MethodInstance for Main.add_lengths(::Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64, Int64}})
  from add_lengths(itr) @ Main REPL[1]:1
Arguments
  #self#::Core.Const(Main.add_lengths)
  itr::Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64, Int64}}
Locals
  @_3::UNION{NOTHING, TUPLE{INT64, INT64}}
  length_sum::Int64
  n::Int64
Body::Nothing
1 ─       (length_sum = 0)
│   %2  = Main.:(:)::Core.Const(Colon())
│   %3  = Main.length::Core.Const(length)
│   %4  = (%3)(itr)::Core.Const(2)
│   %5  = (%2)(1, %4)::Core.Const(1:2)
│         (@_3 = Base.iterate(%5))
│   %7  = @_3::Core.Const((1, 1))
│   %8  = (%7 === nothing)::Core.Const(false)
│   %9  = Base.not_int(%8)::Core.Const(true)
└──       goto #4 if not %9
2 ┄ %11 = @_3::Tuple{Int64, Int64}
│         (n = Core.getfield(%11, 1))
│   %13 = Core.getfield(%11, 2)::Int64
│   %14 = Main.:+::Core.Const(+)
│   %15 = length_sum::Int64
│   %16 = Main.length::Core.Const(length)
│   %17 = n::Int64
│   %18 = Base.getindex(itr, %17)::UNION{TUPLE{INT64, INT64}, TUPLE{INT64, INT64, INT64}}
│   %19 = (%16)(%18)::Int64
│         (length_sum = (%14)(%15, %19))
│         (@_3 = Base.iterate(%5, %13))
│   %22 = @_3::UNION{NOTHING, TUPLE{INT64, INT64}}
│   %23 = (%22 === nothing)::Bool
│   %24 = Base.not_int(%23)::Bool
└──       goto #4 if not %24
3 ─       goto #2
4 ┄       return nothing

The output of @code_warntype is quite cluttered, but the most important detail here is that the call to getindex does not get inferred because it can result in either a Tuple of length 2 or a Tuple of length 3. This type instability can be fixed by replacing getindex with unrolled_applyat:

julia> function unrolled_add_lengths(itr)
           length_sum = 0
           for n in 1:length(itr)
               length_sum += unrolled_applyat(length, n, itr)
           end
       endunrolled_add_lengths (generic function with 1 method)
julia> @allocated unrolled_add_lengths(((1, 2), (1, 2, 3)))0
julia> @code_warntype unrolled_add_lengths(((1, 2), (1, 2, 3)))MethodInstance for Main.unrolled_add_lengths(::Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64, Int64}})
  from unrolled_add_lengths(itr) @ Main REPL[1]:1
Arguments
  #self#::Core.Const(Main.unrolled_add_lengths)
  itr::Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64, Int64}}
Locals
  @_3::UNION{NOTHING, TUPLE{INT64, INT64}}
  length_sum::Int64
  n::Int64
Body::Nothing
1 ─       (length_sum = 0)
│   %2  = Main.:(:)::Core.Const(Colon())
│   %3  = Main.length(itr)::Core.Const(2)
│   %4  = (%2)(1, %3)::Core.Const(1:2)
│         (@_3 = Base.iterate(%4))
│   %6  = @_3::Core.Const((1, 1))
│   %7  = (%6 === nothing)::Core.Const(false)
│   %8  = Base.not_int(%7)::Core.Const(true)
└──       goto #4 if not %8
2 ┄ %10 = @_3::Tuple{Int64, Int64}
│         (n = Core.getfield(%10, 1))
│   %12 = Core.getfield(%10, 2)::Int64
│   %13 = Main.:+::Core.Const(+)
│   %14 = length_sum::Int64
│   %15 = Main.unrolled_applyat::Core.Const(UnrolledUtilities.unrolled_applyat)
│   %16 = n::Int64
│   %17 = (%15)(Main.length, %16, itr)::Int64
│         (length_sum = (%13)(%14, %17))
│         (@_3 = Base.iterate(%4, %12))
│   %20 = @_3::UNION{NOTHING, TUPLE{INT64, INT64}}
│   %21 = (%20 === nothing)::Bool
│   %22 = Base.not_int(%21)::Bool
└──       goto #4 if not %22
3 ─       goto #2
4 ┄       return nothing

For a detailed breakdown of when the tools provided by this package can improve performance, see the User Guide.

What Does Loop Unrolling Do

When a loop over N indices is unrolled, it gets compiled into N lines of LLVM code, where each line has a constant (Core.Const) index. For example, an unrolled loop that prints every integer from 1 to 33 is compiled into the following:

julia> @code_llvm debuginfo=:none unrolled_foreach(println, Tuple(1:33))

; Function Signature: unrolled_foreach(typeof(Base.println), NTuple{33, Int64})
define void @julia_unrolled_foreach_194895(ptr nocapture noundef nonnull readonly align 8 dereferenceable(264) %"itr::Tuple") #0 {
top:
  %"itr::Tuple[1]_ptr.unbox" = load i64, ptr %"itr::Tuple", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[1]_ptr.unbox")
  %"itr::Tuple[2]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 1
  %"itr::Tuple[2]_ptr.unbox" = load i64, ptr %"itr::Tuple[2]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[2]_ptr.unbox")
  %"itr::Tuple[3]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 2
  %"itr::Tuple[3]_ptr.unbox" = load i64, ptr %"itr::Tuple[3]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[3]_ptr.unbox")
  %"itr::Tuple[4]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 3
  %"itr::Tuple[4]_ptr.unbox" = load i64, ptr %"itr::Tuple[4]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[4]_ptr.unbox")
  %"itr::Tuple[5]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 4
  %"itr::Tuple[5]_ptr.unbox" = load i64, ptr %"itr::Tuple[5]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[5]_ptr.unbox")
  %"itr::Tuple[6]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 5
  %"itr::Tuple[6]_ptr.unbox" = load i64, ptr %"itr::Tuple[6]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[6]_ptr.unbox")
  %"itr::Tuple[7]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 6
  %"itr::Tuple[7]_ptr.unbox" = load i64, ptr %"itr::Tuple[7]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[7]_ptr.unbox")
  %"itr::Tuple[8]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 7
  %"itr::Tuple[8]_ptr.unbox" = load i64, ptr %"itr::Tuple[8]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[8]_ptr.unbox")
  %"itr::Tuple[9]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 8
  %"itr::Tuple[9]_ptr.unbox" = load i64, ptr %"itr::Tuple[9]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[9]_ptr.unbox")
  %"itr::Tuple[10]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 9
  %"itr::Tuple[10]_ptr.unbox" = load i64, ptr %"itr::Tuple[10]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[10]_ptr.unbox")
  %"itr::Tuple[11]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 10
  %"itr::Tuple[11]_ptr.unbox" = load i64, ptr %"itr::Tuple[11]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[11]_ptr.unbox")
  %"itr::Tuple[12]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 11
  %"itr::Tuple[12]_ptr.unbox" = load i64, ptr %"itr::Tuple[12]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[12]_ptr.unbox")
  %"itr::Tuple[13]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 12
  %"itr::Tuple[13]_ptr.unbox" = load i64, ptr %"itr::Tuple[13]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[13]_ptr.unbox")
  %"itr::Tuple[14]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 13
  %"itr::Tuple[14]_ptr.unbox" = load i64, ptr %"itr::Tuple[14]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[14]_ptr.unbox")
  %"itr::Tuple[15]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 14
  %"itr::Tuple[15]_ptr.unbox" = load i64, ptr %"itr::Tuple[15]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[15]_ptr.unbox")
  %"itr::Tuple[16]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 15
  %"itr::Tuple[16]_ptr.unbox" = load i64, ptr %"itr::Tuple[16]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[16]_ptr.unbox")
  %"itr::Tuple[17]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 16
  %"itr::Tuple[17]_ptr.unbox" = load i64, ptr %"itr::Tuple[17]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[17]_ptr.unbox")
  %"itr::Tuple[18]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 17
  %"itr::Tuple[18]_ptr.unbox" = load i64, ptr %"itr::Tuple[18]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[18]_ptr.unbox")
  %"itr::Tuple[19]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 18
  %"itr::Tuple[19]_ptr.unbox" = load i64, ptr %"itr::Tuple[19]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[19]_ptr.unbox")
  %"itr::Tuple[20]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 19
  %"itr::Tuple[20]_ptr.unbox" = load i64, ptr %"itr::Tuple[20]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[20]_ptr.unbox")
  %"itr::Tuple[21]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 20
  %"itr::Tuple[21]_ptr.unbox" = load i64, ptr %"itr::Tuple[21]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[21]_ptr.unbox")
  %"itr::Tuple[22]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 21
  %"itr::Tuple[22]_ptr.unbox" = load i64, ptr %"itr::Tuple[22]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[22]_ptr.unbox")
  %"itr::Tuple[23]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 22
  %"itr::Tuple[23]_ptr.unbox" = load i64, ptr %"itr::Tuple[23]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[23]_ptr.unbox")
  %"itr::Tuple[24]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 23
  %"itr::Tuple[24]_ptr.unbox" = load i64, ptr %"itr::Tuple[24]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[24]_ptr.unbox")
  %"itr::Tuple[25]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 24
  %"itr::Tuple[25]_ptr.unbox" = load i64, ptr %"itr::Tuple[25]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[25]_ptr.unbox")
  %"itr::Tuple[26]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 25
  %"itr::Tuple[26]_ptr.unbox" = load i64, ptr %"itr::Tuple[26]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[26]_ptr.unbox")
  %"itr::Tuple[27]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 26
  %"itr::Tuple[27]_ptr.unbox" = load i64, ptr %"itr::Tuple[27]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[27]_ptr.unbox")
  %"itr::Tuple[28]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 27
  %"itr::Tuple[28]_ptr.unbox" = load i64, ptr %"itr::Tuple[28]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[28]_ptr.unbox")
  %"itr::Tuple[29]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 28
  %"itr::Tuple[29]_ptr.unbox" = load i64, ptr %"itr::Tuple[29]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[29]_ptr.unbox")
  %"itr::Tuple[30]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 29
  %"itr::Tuple[30]_ptr.unbox" = load i64, ptr %"itr::Tuple[30]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[30]_ptr.unbox")
  %"itr::Tuple[31]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 30
  %"itr::Tuple[31]_ptr.unbox" = load i64, ptr %"itr::Tuple[31]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[31]_ptr.unbox")
  %"itr::Tuple[32]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 31
  %"itr::Tuple[32]_ptr.unbox" = load i64, ptr %"itr::Tuple[32]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[32]_ptr.unbox")
  %"itr::Tuple[33]_ptr" = getelementptr inbounds [33 x i64], ptr %"itr::Tuple", i64 0, i64 32
  %"itr::Tuple[33]_ptr.unbox" = load i64, ptr %"itr::Tuple[33]_ptr", align 8
  call void @j_println_194898(i64 signext %"itr::Tuple[33]_ptr.unbox")
  ret void
}

This LLVM code consists of 33 getelementptr instructions (each of which extracts a value from a Tuple at a particular index), 33 load instructions, and 33 call instructions (each of which switches execution to println). Every getelementptr instruction has a constant index between 0 and 32; in more complex examples where the call instructions get inlined, this constant index can be propagated into the LLVM code of the function being called. On the other hand, here is the LLVM code for the non-unrolled version of this loop:

julia> @code_llvm debuginfo=:none foreach(println, Tuple(1:33)); Function Signature: foreach(typeof(Base.println), NTuple{33, Int64})
define void @julia_foreach_194905(ptr nocapture noundef nonnull readonly align 8 dereferenceable(264) %"itr::Tuple") #0 {
top:
  %"itr::Tuple[1]_ptr.unbox" = load i64, ptr %"itr::Tuple", align 8
  call void @j_println_194908(i64 signext %"itr::Tuple[1]_ptr.unbox")
  br label %pass

L35:                                              ; preds = %pass
  ret void

pass:                                             ; preds = %pass, %top
  %0 = phi i64 [ 1, %top ], [ %value_phi14, %pass ]
  %value_phi14 = phi i64 [ 2, %top ], [ %1, %pass ]
  %1 = add nuw nsw i64 %value_phi14, 1
  %2 = getelementptr inbounds i64, ptr %"itr::Tuple", i64 %0
  %.unbox = load i64, ptr %2, align 8
  call void @j_println_194908(i64 signext %.unbox)
  %exitcond.not = icmp eq i64 %1, 34
  br i1 %exitcond.not, label %L35, label %pass
}

This LLVM code has a load instruction to get the first value and a getelementptr instruction with a non-constant integer index to get all other values. It also has conditional jump instructions for checking whether the last index has been reached after each load and getelementptr instruction.

Downsides of Loop Unrolling

Given the performance benefits of loop unrolling, it might seem at first that the standard library needs more of it. However, the standard library is not just meant for writing high-performance code with statically sized iterators—many of its use cases involve code that is only executed once or several times. In such cases, most of the execution time is required for compilation, and minimizing run time makes no practical difference. Although unrolled functions can occasionally be faster to compile than non-unrolled functions, they are typically slower to compile, which means that using them instead of standard library functions can often increase total execution time:

julia> tup32 = ntuple(Returns((1, 2)), 32);
julia> @elapsed map(first, tup32)0.009610456
julia> @elapsed unrolled_map(first, tup32)0.014892017

The increase in compilation time is usually no more than a factor of 5 for small iterators, but it grows as iterator length increases:

julia> tup320 = ntuple(Returns((1, 2)), 320);
julia> @elapsed map(first, tup320)0.009517612
julia> @elapsed unrolled_map(first, tup320)0.40387804

Loop unrolling can also theoretically increase the run time of a function in addition to its compilation time, since unrolled assembly code requires more space and takes longer to load than non-unrolled code. In practice, though, the constant propagation enabled by unrolling usually compensates for this slowdown.

So, when type instabilities and memory allocations need to be removed (as is required for static compilation) and the cost to total execution time is more or less irrelevant, using unrolled functions is probably worthwhile. Otherwise, if a significant increase in compilation time (and potentially also run time) needs to be avoided, using standard library functions might be a better option.

It is usually a good idea to compare the performance of unrolled code against non-unrolled code before settling on a particular design. Many examples of such comparisons can be found in the tables of benchmarks that are automatically generated for this package.