When to Use UnrolledUtilities

The functions and types exported by this package tend to perform better than their counterparts from Base and Base.Iterators in the scenarios listed below. Additional examples and more precise measurements can be found in the automatically generated tables of benchmarks.

Outline:

When to Use Unrolled Functions

Long iterators

  • map has an unstable return type for iterators with lengths greater than 32:

    julia> Test.@inferred map(one, Tuple(1:31));
    julia> Test.@inferred map(one, Tuple(1:32));ERROR: return type NTuple{32, Int64} does not match inferred return type Tuple
    julia> Test.@inferred unrolled_map(one, Tuple(1:32));
  • getindex has an unstable return type for Core.Const slices of length N > 10 from iterators with lengths greater than N + 2:

    julia> first_11(itr) = itr[1:11]first_11 (generic function with 1 method)
    julia> Test.@inferred first_11(Tuple(1:13));
    julia> Test.@inferred first_11(Tuple(1:14));ERROR: return type NTuple{11, Int64} does not match inferred return type Tuple{Vararg{Int64}}
    julia> unrolled_first_11(itr) = unrolled_take(itr, Val(11))unrolled_first_11 (generic function with 1 method)
    julia> Test.@inferred unrolled_first_11(Tuple(1:14));
  • For benchmarks that indicate performance improvements when using unrolled functions with long iterators, see Isolated Unrolled Functions

Iterators with elements of different types

  • in, any, all, foreach, and other functions in Base have intermediate type instabilities that trigger allocations for nonuniform iterators:

    julia> nonuniform_itr = ((1, 2), (1, 2, 3));
    julia> @allocated () in nonuniform_itr80
    julia> @allocated unrolled_in((), nonuniform_itr)0
    julia> @allocated any(isempty, nonuniform_itr)144
    julia> @allocated unrolled_any(isempty, nonuniform_itr)0
  • getindex has an unstable return type for nonuniform iterators when given non-constant (i.e., not Core.Const) indices, which can lead to intermediate type instabilities that trigger allocations:

    julia> function add_lengths(itr)
               length_sum = 0
               for n in 1:length(itr)
                   length_sum += length(itr[n])
               end
           endadd_lengths (generic function with 1 method)
    julia> @allocated add_lengths(nonuniform_itr)160
    julia> function unrolled_add_lengths(itr) length_sum = 0 for n in 1:length(itr) length_sum += unrolled_applyat(length, n, itr) end endunrolled_add_lengths (generic function with 1 method)
    julia> @allocated unrolled_add_lengths(nonuniform_itr)0
    Note
    How can unrolled_applyat be stable if n isn't a Core.Const?

    For the example of add_lengths, the compiler must infer the return type of itr[::Int64] before it can compile the call to length. Since this return type depends on the index n, the compiler needs to insert a runtime lookup into the method table that determines which method of length to call, length(::Tuple{Int64, Int64}) or length(::Tuple{Int64, Int64, Int64}), and this triggers allocations.

    For the example of unrolled_add_lengths, the compiler instead infers the return types of itr[::Core.Const(1)], itr[::Core.Const(2)], and so on for every index into itr. Then, it compiles a call to length for each of these return types, and it inserts a runtime switch instruction that determines which result of length to return for a particular value of n. As long as length itself only returns one type (in this case, Int64), this ensures that unrolled_add_lengths has no intermediate type instabilities.

    In other words, unrolled_applyat combines multiple methods for length and getindex into a single method, replacing the inefficient method table lookup that switches between them with a simpler switch instruction.

    Tip
    When should getindex be replaced with unrolled_applyat?

    The specific example above can be simplified by using unrolled_sum, instead of using a for-loop in conjunction with unrolled_applyat:

    julia> nonuniform_itr = ((1, 2), (1, 2, 3));
    julia> @allocated unrolled_sum(length, nonuniform_itr)0

    However, there are often situations in which loops cannot be replaced with function calls, such as when the loops are parallelized over multiple threads. For example, since CUDA is unable to compile kernels with type instabilities, something like the switch instruction in unrolled_applyat is required in order to parallelize over nonuniform iterators on GPUs.

  • For benchmarks that indicate performance improvements when using unrolled functions with nonuniform iterators, see Isolated Unrolled Functions and Nested Unrolled Functions

Reductions with intermediate values of different types

  • reduce and accumulate have unstable return types when the return type of the reduction operator is not constant, but only for iterators with lengths greater than 32:

    julia> Test.@inferred reduce(tuple, Tuple(1:32));
    julia> Test.@inferred reduce(tuple, Tuple(1:33));ERROR: return type Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Tuple{Int64, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64}, Int64} does not match inferred return type Union{Int64, Tuple{Any, Int64}}
    julia> Test.@inferred unrolled_reduce(tuple, Tuple(1:33));
  • For benchmarks that indicate performance improvements when using unrolled functions with nonuniform reductions, see Isolated Unrolled Functions

Functions with recursion during compilation

  • Any function that recursively calls itself with different types of arguments can trigger a compilation heuristic called the "recursion limit", which causes the compiler to generate code with type instabilities and allocations

    • Before Julia 1.11, the default recursion limit applied to every function with at least 3 levels of recursion during compilation, including relatively simple functions like an analogue of length for nested Tuples:

      julia> nested_itr_of_depth_2 = ((1, 2), (3, 4));
      julia> nested_itr_of_depth_3 = (((1,), (2,)), ((3,), (4,)));
      julia> recursive_length(itr) = eltype(itr) <: Tuple ? sum(recursive_length, itr) : length(itr)recursive_length (generic function with 1 method)
      julia> Test.@inferred recursive_length(nested_itr_of_depth_2);
      julia> Test.@inferred recursive_length(nested_itr_of_depth_3);
    • As of Julia 1.11, the default recursion limit is no longer triggered for this simple example, but still applies to more complex recursive functions:

      julia> recursive_sum_with_logs(itr) =
                 eltype(itr) <: Tuple ?
                 sum(recursive_sum_with_logs, itr) +
                 sum(log ∘ recursive_sum_with_logs, itr) :
                 sum(itr)recursive_sum_with_logs (generic function with 1 method)
      julia> Test.@inferred recursive_sum_with_logs(nested_itr_of_depth_2);
      julia> Test.@inferred recursive_sum_with_logs(nested_itr_of_depth_3);ERROR: return type Float64 does not match inferred return type Any
      julia> unrolled_recursive_sum_with_logs(itr) = eltype(itr) <: Tuple ? unrolled_sum(unrolled_recursive_sum_with_logs, itr) + unrolled_sum(log ∘ unrolled_recursive_sum_with_logs, itr) : unrolled_sum(itr)unrolled_recursive_sum_with_logs (generic function with 1 method)
      julia> Test.@inferred unrolled_recursive_sum_with_logs(nested_itr_of_depth_3);
    Note
    How can the default recursion limit be avoided?

    The recursion limit of any function f can be disabled by evaluating the following code in the module where f is defined:

    @static if hasfield(Method, :recursion_relation)
        for method in methods(f)
            method.recursion_relation = Returns(true)
        end
    end

    However, the recursion limits of functions defined in Base cannot be modified from any module outside of Base. Since the default limit applies to all functions from Base, they can have unstable return types whenever they need more than 2 levels of recursion for compilation, even if the user-defined functions passed to them have no recursion limits. The only way to avoid the default limit of a function from Base is to replace it with a function whose limit has been disabled (such as an analogous function from UnrolledUtilities).

  • For benchmarks that indicate performance improvements when using unrolled functions with recursive operations, see Recursive Unrolled Functions

When to Use StaticOneTo and StaticBitVector

Iterators of Ints from 1 to N

UnrolledUtilities.StaticOneToType
StaticOneTo(N)

A lazy and statically sized analogue of Base.OneTo(N).

This iterator can only store the integers from 1 to N, so its output_type_for_promotion is NoOutputType(). An efficient method is provided for unrolled_take, but no other unrolled functions can use StaticOneTos as output types.

source

If an iterator only contains the integers from 1 to N ≥ 0, it is possible to provide the compiler with the values in the iterator in addition to their types by using a StaticOneTo, as opposed to a Tuple or something similar. This can allow the compiler to fully optimize out code that depends on those values, essentially moving the code's execution from run time to compilation time:

julia> @code_llvm debuginfo=:none mapreduce(abs2, +, (1, 2, 3)); Function Signature: mapreduce(typeof(Base.abs2), typeof(Base.:(+)), Tuple{Int64, Int64, Int64})
define i64 @julia_mapreduce_195958(ptr nocapture noundef nonnull readonly align 8 dereferenceable(24) %"itr::Tuple") #0 {
top:
  %"itr::Tuple[2]_ptr" = getelementptr inbounds [3 x i64], ptr %"itr::Tuple", i64 0, i64 1
  %"itr::Tuple[3]_ptr" = getelementptr inbounds [3 x i64], ptr %"itr::Tuple", i64 0, i64 2
  %"itr::Tuple[1]_ptr.unbox" = load i64, ptr %"itr::Tuple", align 8
  %0 = mul i64 %"itr::Tuple[1]_ptr.unbox", %"itr::Tuple[1]_ptr.unbox"
  %"itr::Tuple[2]_ptr.unbox" = load i64, ptr %"itr::Tuple[2]_ptr", align 8
  %1 = mul i64 %"itr::Tuple[2]_ptr.unbox", %"itr::Tuple[2]_ptr.unbox"
  %2 = add i64 %1, %0
  %"itr::Tuple[3]_ptr.unbox" = load i64, ptr %"itr::Tuple[3]_ptr", align 8
  %3 = mul i64 %"itr::Tuple[3]_ptr.unbox", %"itr::Tuple[3]_ptr.unbox"
  %4 = add i64 %2, %3
  ret i64 %4
}
julia> @code_llvm debuginfo=:none mapreduce(abs2, +, StaticOneTo(3)); Function Signature: mapreduce(typeof(Base.abs2), typeof(Base.:(+)), UnrolledUtilities.StaticOneTo{3}) define i64 @julia_mapreduce_196016() #0 { top: ret i64 14 }

Standard library functions can sometimes take advantage of this optimization, but for most non-trivial operations it is necessary to use unrolled functions:

julia> @code_llvm debuginfo=:none mapreduce(log, +, StaticOneTo(3)); Function Signature: mapreduce(typeof(Base.log), typeof(Base.:(+)), UnrolledUtilities.StaticOneTo{3})
define double @julia_mapreduce_196076() #0 {
top:
  %0 = call double @j_log_196082(double 2.000000e+00)
  %1 = fadd double %0, 0.000000e+00
  %2 = call double @j_log_196082(double 3.000000e+00)
  %3 = fadd double %1, %2
  ret double %3
}
julia> @code_llvm debuginfo=:none unrolled_mapreduce(log, +, StaticOneTo(3)); Function Signature: unrolled_mapreduce(typeof(Base.log), typeof(Base.:(+)), UnrolledUtilities.StaticOneTo{3}) define double @julia_unrolled_mapreduce_196090() #0 { top: ret double 0x3FFCAB0BFA2A2002 }

For benchmarks that indicate performance improvements when using StaticOneTos, see Very Long Iterators.

Note
Can the compiler infer iterator values in other scenarios?

The compiler can usually infer the values of iterators that only contain singletons when they are accessed using Core.Const indices, but this is not possible for non-singletons (e.g., integers) unless some special type of iterator is used (e.g., a StaticOneTo).

Long iterators of Bools that get modified across loop iterations

UnrolledUtilities.StaticBitVectorType
StaticBitVector{N, [U]}(f)
StaticBitVector{N, [U]}([bit])

A statically sized analogue of BitVector with Unsigned chunks of type U, which can be constructed using either a function f(n) or a constant bit. By default, U is set to UInt8 and bit is set to false.

This iterator can only store Bools, so its output_type_for_promotion is a ConditionalOutputType. Efficient implementations are provided for all unrolled functions, though the methods for unrolled_map and unrolled_accumulate only apply when the first item in the output is a Bool.

source

Loops in Julia often allocate memory when a value larger than 32 bytes in size is modified across loop iterations (regardless of whether the loops are unrolled or not). Since Bools are represented by bytes, this limits certain types of loops to modifying bitmasks of no more than 32 Bools in order to avoid allocations. Unlike an iterator of Bools, though, a StaticBitVector stores 8 bits in every byte, which makes it possible to modify up to 256 bits at a time in loops without any allocations:

julia> random_bit_flips(itr) = reduce(
           (itr′, i) -> Base.setindex(itr′, !itr′[rand(1:i)], i),
           1:length(itr);
           init = itr,
       )random_bit_flips (generic function with 1 method)
julia> @allocated random_bit_flips(ntuple(Returns(true), Val(32)))0
julia> @allocated random_bit_flips(ntuple(Returns(true), Val(33)))4224
julia> @allocated random_bit_flips(StaticBitVector{256}(true))0

As with StaticOneTos, standard library functions can occasionally optimize StaticBitVectors as well as unrolled functions, but most complex use cases require unrolled functions.

For benchmarks that indicate performance improvements when using long StaticBitVectors that get modified across loop iterations, see Nested Unrolled Closures.