At many points in my life, I have needed to manipulate bitstrings of constant weight (a constant number of 1s). A good example is in my package dynamite for numerically simulating continuous-time dynamics of large collections of spin 1/2 particles, specifically when the Hamiltonian preserves total magnetization/particle number. I realized from chatting with a colleague that some of the bit twiddling hacks I’ve come up with over the years might be helpful to other people, so I figured I would record them here. I should note that I don’t know if I am the first person to come up with the constructions I describe below, but a cursory search on the internet didn’t come up with them (and at the time of writing, they seem to be missing from what does come up for me, like this StackOverflow post or this blog post). So here goes!

TL;DR

If you just want to see the fastest code in this post for iterating through bitstrings of constant weight, jump here.

Some basic facts

Before we jump in, it’s good to establish some basic facts about strings of constant weight. We’ll use \(n\) for the bitstring length, and \(k\) for the weight. There are \(\binom{n}{k} = \frac{n!}{k!(n-k)!}\) such strings, so wherever we’re using these, we probably are going to want \(k\) not to be too big. To help with intuition, here is a list of the strings for \(n=5\), \(k=3\):

Note that these are sorted in lexicographic order, or equivalently by their value when interpreted as big-endian integers. We will use this ordering throughout this post. When discussing bit positions, we will use zero-indexing from the right (so the zeroth bit is the rightmost one).

Iterating through bitstrings: improving on Gosper’s hack

A common task is to simply iterate through all constant-weight strings efficiently. We’ll do this by writing a function next_bitstring that takes an arbitrary bitstring as input and produces the lexicographically next string having the same weight. (Something I find neat about the algorithms in this section is that they need neither \(n\) nor \(k\) as inputs.)

It’s easiest (at least for me) to see an example first of the transformation we want to make. Here are two bitstrings of the same weight, that are neighbors in the lexicographic ordering (the ellipses correspond to arbitrary leading bits that don’t change).

...01110000
...10000011

next_bitstring called on the first value should return the second one. Let’s take a closer look at what’s going on here.

In general, any bitstring with weight \(k \geq 1\) ends in a block of zeros of length \(t_0 \geq 0\). Immediately to the left of that is a block of ones of length \(t_1 \geq 1\). For the first value above, \(t_0 = 4\) and \(t_1 = 3\). What we need to do is scoot the 1 at position \(t_0 + t_1 - 1\) to the left by one position, and then move the remaining \(t_1 - 1\) 1s in the block all the way to the right. Here is some Julia code that does that in a straightforward (ish) way:

function next_bitstring_simple(x::T) where T
    # count how long the string of zeros on the right is
    t0 = trailing_zeros(x)

    # increment by 1 at position t0, which maps e.g. 011100 -> 100000
    x += T(1) << t0

    # now trailing_zeros(x) = t0 + t1, so it's easy to compute t1
    t1 = trailing_zeros(x) - t0

    # set the bottom t1-1 bits to 1
    x |= (T(1) << (t1-1)) - T(1)

    # (in Julia, the value of the last expression is automatically returned)
end

Note

In Julia, the syntax function fn(x::T) where T allows us to access the type of the input x inside of the function definition.

Let’s try it!

(Hidden: some code to make integers print in binary)

Base.show(io::IO, x::UInt8) = print(io, string(x; base=2))

next_bitstring_simple(0b01110000)

10000011

Woohoo! It worked!

Now let’s look at an alternative way to do this, which is the one you might find if you search the web. It’s an absolute classic in the history of bit hacking called Gosper’s hack, which follows the same broad plan as our function above but uses only standard arithmetic functions and bitwise operations. It uses two main observations:

c = x & -x is equivalent to c = 1 << trailing_zeros(x)
Letting r = x + c, the expression ((r ⊻ x) >> 2) ÷ c yields a bitstring with the bottom \(t_1-1\) bits set to 1

Note

In Julia, ⊻ is bitwise XOR.

If you want to dive more into why those expressions work, check out this blog post. Let’s put these ideas together:

function next_bitstring_gosper(x)
    c = x & -x  # equivalent to: c = 1 << trailing_zeros(x)
    r = x + c   # maps e.g. 011100 -> 100000, as in next_bitstring_simple
    (((r ⊻ x) >> 2) ÷ c) | r  # sets bottom t1-1 bits to 1
end

Note

In Julia (and C/C++), for x of unsigned type we have -x = typemax(x) - x + 1. This works fine for our bit manipulation needs.

Now we can compare the performance of these two functions:

(Hidden: some code setting up a benchmark)

function run_benchmark(next_bitstring_fn, T, n, k)
    print("$(string(next_bitstring_fn)) ")

    # run once first to precompile the function
    iterate_all_bitstrings(next_bitstring_fn, T, 8, 4)

    @time result = iterate_all_bitstrings(next_bitstring_fn, T, n, k)
    @assert result == ((one(T) << k) - one(T)) << (n-k)
end

function iterate_all_bitstrings(next_bitstring_fn, ::Type{T}, n, k) where T
    x = (one(T) << k) - one(T)
    for _ in 1:(binomial(n, k)-1)
        x = next_bitstring_fn(x)
    end
    x
end

n = 48
k = 8

# this benchmark iterates through all bitstrings with length n and weight k

run_benchmark(next_bitstring_simple, UInt64, n, k)
run_benchmark(next_bitstring_gosper, UInt64, n, k)

next_bitstring_simple   2.551500 seconds (1 allocation: 16 bytes)
next_bitstring_gosper   2.957309 seconds (1 allocation: 16 bytes)

What the heck, the simple one is faster! So why is everyone posting on the internet about the Gosper one?

next_bitstring_simple is faster because it takes advantage of the trailing_zeros function, which on most modern CPUs has its own instruction (e.g. TZCNT for x86). Gosper’s hack accomplishes something similar by dividing by the integer c, which is a power of two—but the compiler cannot infer that c is a power of two, so it has to use the regular division instruction, which is slow! Presumably, a machine instruction for trailing_zeros simply didn’t exist when Gosper’s hack was invented, and so the division trick was the most efficient way to do it at the time.

Note

In C/C++, trailing_zeros is accessible via the GCC builtin __builtin_ctz for unsigned int, and related builtins for arguments of other sizes.

But just because trailing_zeros is faster than division doesn’t mean we should ignore the other cool stuff going on in Gosper’s hack. Indeed, we can borrow other nice ideas from Gosper, and get the best of both worlds:

function next_bitstring_fast(x)
    t0 = trailing_zeros(x)
    tmp = x

    # this line maps e.g. ...011100 to ...100000
    # x & -x is equivalent to T(1) << t0, but faster
    x += (x & -x)

    # tmp ⊻ x is a block of t1+1 ones starting at t0
    # thus shifting it to the right by t0+2 gives t1-1 ones on the right
    x |= ((tmp ⊻ x) >> (t0 + 2))
end

run_benchmark(next_bitstring_fast, UInt64, n, k)

next_bitstring_fast   1.110415 seconds (1 allocation: 16 bytes)

Woohoo! It’s more than twice as fast as either of the other implementations!

Note

Julia’s << and >> operators are defined a little differently than the same operators in e.g. C: Julia allows for negative shifts (the shift goes in the opposite direction), as well as shifts longer than the bit length of the shifted value (for which it returns 0). This means that << doesn’t compile down to just the left-shift machine instruction. For UInt64 we can convince the compiler to not worry about shifts outside the range 0-63 by simply masking the shift: instead of 1 << t0 we do 1 << (t0 & 63). In my tests, there was a negligible difference in performance of next_bitstring_fast if the mask was included or not, so I omitted it for simplicity. But, I use it below when examining the assembly generated by the compiler. If you’re interested, there is a discussion about this behavior here.

Getting into the weeds: why is it so much faster?

Compared to next_bitstring_simple, we’ve made essentially two changes in next_bitstring_fast. Let’s look at both of them.

First, we replaced 1 << t0 with x & -x. Interpreted literally, it’s not clear at all why these should perform differently: they both seem to require two operations. In the first, it’s loading the constant 1 and then shifting left; in the second, it’s negating x and then performing a logical AND.

For the bit shift one (using a mask of 63 as described in the note above, so Julia compiles it down to just the left shift instruction):

code_native(t0 -> UInt64(1) << (t0 & 63), (Int,))

yields the assembly (setup and cleanup omitted)

mov     eax, 1
shlx    rax, rax, rdi

The first line loads 1 into the eax register, and the second line shifts it by the value in rdi. This is what we expected.

How about the other implementation?

code_native(x -> x & -x, (UInt64,))

yields

blsi    rax, rdi

Whoah, what the heck is blsi? It turns out there is a single x86 instruction “Extract Lowest Set Bit” for the whole operation we were trying to do here! So, the difference between the two implementations is that the compiler recognizes that x & -x corresponds to the blsi instruction and just emits that. The compiler couldn’t figure out that 1 << t0 was actually doing the same thing.

Now let’s look at the second replacement we made in next_bitstring_fast. It’s the part where we set the low \(t_1 - 1\) bits to 1. Originally we had x |= (1 << (trailing_zeros(x)-t0-1)) - 1. Including a mask on the shift again to make the compiler happy, we get the following.

code_native((x, t0) -> x | ((UInt64(1) << (63 & (trailing_zeros(x)-t0-1))) - 1), (UInt64, Int))

yields

tzcnt   rax, rdi                ; rax = trailing_zeros(x)
not     esi                     ; t0 = ~t0, equiv to -t0-1
add     eax, esi                ; rax = trailing_zeros(x)-t0-1
mov     rcx, -1                 ; start with -1, the string of all 1s
shlx    rax, rcx, rax           ; left shift makes the t1-1 right bits 0
not     rax                     ; invert, so t1-1 right bits are 1 and rest zero
or      rax, rdi                ; finally or it into the other value

We replaced this with x |= ((tmp ⊻ x) >> (t0 + 2)), let’s see what it gives (again masking the shift):

code_native((x, tmp, t0) -> x | ((tmp ⊻ x) >> (63 & (t0 + 2))), (UInt64, UInt64, Int))

yields

xor     rsi, rdi                ; tmp ⊻= x
add     dl, 2                   ; t0 += 2
shrx    rax, rsi, rdx           ; tmp >> t0
or      rax, rdi                ; x |= tmp

Dang, that’s beautiful! And clearly faster than the other one.

Just for the satisfaction, let’s look at the assembly for the whole function next_bitstring_fast in all its glory:

tzcnt   rax, rdi
blsi    rcx, rdi
add     rcx, rdi
xor     rdi, rcx
add     al, 2
shrx    rax, rdi, rax
or      rax, rcx

Seven total instructions, and all of them are cheap operations. Pretty awesome!

Getting good performance on arbitrary precision integers

Let’s see what happens when we apply these functions to Julia’s arbitrary precision integer type, BigInt:

# test smaller since BigInt is slower
n = 36
k = 6

run_benchmark(next_bitstring_simple, BigInt, n, k)
run_benchmark(next_bitstring_gosper, BigInt, n, k)
run_benchmark(next_bitstring_fast, BigInt, n, k)

next_bitstring_simple   1.946280 seconds (29.22 M allocations: 564.698 MiB, 38.07% gc time)
next_bitstring_gosper   2.019920 seconds (29.22 M allocations: 549.837 MiB, 31.36% gc time)
next_bitstring_fast   1.550008 seconds (25.32 M allocations: 475.535 MiB, 31.94% gc time)

next_bitstring_fast is still the fastest, but not by much. We see from the output that it is doing a ton of memory allocations—what gives? Because BigInt is an arbitrary precision type, it is allocated on the heap; the way these functions are written, a new BigInt is being allocated for every new value produced during the computation! To get good performance we should instead work in-place as much as possible. Doing so involves digging under the hood a bit to access the in-place multi-precision arithmetic operations:

# the module containing low-level operations on BigInt
const MPZ = Base.GMP.MPZ

function next_bitstring_inplace(x::BigInt, tmp::BigInt = zero(BigInt))
    MPZ.set!(tmp, x)  # tmp = x
    MPZ.neg!(tmp)     # tmp = -x
    MPZ.and!(tmp, x)  # tmp = -x & x
    MPZ.add!(tmp, x)  # tmp = (-x & x) + x

    # return value of trailing_zeros is Int64 even for argument of type BigInt
    # thus t0 is allocated on the stack; we don't need to preallocate
    t0 = trailing_zeros(x)

    MPZ.xor!(x, tmp)             # x ⊻= tmp
    MPZ.fdiv_q_2exp!(x, t0 + 2)  # x >>>= t0 + 2
    MPZ.ior!(x, tmp)             # x |= tmp
end

function preallocated_benchmark(n, k)
    tmp = zero(BigInt)  # preallocate tmp
    run_benchmark(x -> next_bitstring_inplace(x, tmp), BigInt, n, k)
end

preallocated_benchmark(n, k)

#11   0.142667 seconds (9 allocations: 176 bytes)

Woohoo! That’s like 10x faster than any of the other functions were when applied to the type BigInt, and it is clear that no new memory is being allocated on each iteration.

Finally I note that if you have a set value of n, you don’t really need arbitrary precision integers because you know the maximum possible value, so you don’t need the size of the integers to be able to dynamically grow. Thus you can avoid allocating them on the heap. Therefore if performance is really critical I would probably set up a custom type using e.g. StaticArrays.jl; due to AVX instructions I bet the performance would be blazing. On the other hand, these functions are all so fast that if you are actually doing anything with the bitstrings, whatever you are doing will almost certainly be the dominant cost—not generating the bitstrings.