Chapter 2 dynamite: massively parallel numerics for many-body quantum spin systems

2.1 Introduction

The spin-1/2 particle is one of the simplest and most famous quantum systems, and its dynamics and behavior were established early in the study of theoretical quantum mechanics. The Schrodinger equation for a single two-level system has a straightforward form, and in many cases can be solved analytically without much difficulty. However, the study of collections of such simple particles has proved much more difficult, and in many cases much more interesting. Emergent properties including novel phases of matter, some of which only arise in exotic circumstances such as out-of-equilibrium driven systems, have generated excitement and driven an intense push towards the deeper understanding of these systems’ behavior. Throughout most of the $20^{t h}$ century, advances in their study came largely from the development of increasingly clever analytic methods. But recently, concurrent, drastic improvements in both high-performance computing power and experimental control of individual quantum particles have enabled new forays into questions about many-body quantum mechanics which had remained stubbornly resistant to analysis.

Even with the power of modern computers, the study of many-body quantum systems remains challenging. For example, simply representing an arbitrary quantum operator with support on $L$ spin-1/2 particles into computer memory requires storing roughly $4^{L}$ complex numbers—a prospect which is infeasible even on a supercomputer, for $L$ of just a couple dozen. Despite this, great progress has been made through the design of numerical algorithms which take advantage of the fact that the space of physically relevant quantum operators and states is often much smaller than the space of all possible ones. A classic and widely-used example of such a technique is the tensor network formalism, and more specifically matrix product states, which can be used to efficiently store and manipulate quantum states of low entanglement. A number of numerical libraries have been developed to perform tensor network computations, and they have become crucial tools in the field of numerical many-body physics [FWS22, HP18]. However, quantum systems with large amounts of entanglement are not treatable with matrix product states—a set that includes some of the most exciting and mysterious phenomena, such as the many-body localization phase transition and the dynamics of quantum information scrambling. The numerical study of these systems requires storing the entire wavefunction explicitly; as a result, interrogating the phenomena that emerge only when $L$ is large requires leveraging all of the tools of modern supercomputing, both in algorithms and hardware.

In this chapter, we present dynamite, a library for the numerical study of systems of spin-1/2 particles. Its main functions are to compute the time evolution of pure states in closed quantum systems, and to compute the eigenvalues and eigenvectors of quantum operators. This is achieved via Krylov subspace algorithms, in which a low-dimensional subspace that is continually updated throughout the computation until covergence is achieved—without relying on particular features of the wavefunction like low entanglement. Built on top of the PETSc and SLEPc sparse linear algebra libraries, dynamite is designed from the ground up to leverage the massive parallelism available in modern high-performance computing environments, with the ability to harness large numbers of CPUs distributed across many compute nodes via MPI, and to perform highly optimized computations on the thousands of cores of datacenter GPUs via CUDA. Previously, many-body physics studies looking to achieve parallelism at this scale required custom implementations in low-level languages like C, with much optimization required to achieve good performance. dynamite abstracts away this optimization from the user, offering a clean, intuitive Python interface in which quantum operators and states can be built symbolically (Figure 2.1(a)). This symbolic representation is then passed to a highly optimized backend, where features like GPU acceleration and memory saving “matrix-free” methods (see Section 2.4) can be enabled with the push of a button, enabling computations that are simultaneously very fast and memory-efficient (Figure 2.1(c)). At the same time, dynamite is portable, running equally as well in a Jupyter notebook on a laptop as it does in a massively distributed compute job on a cluster. This, along with its simple interface, makes it also useful as a tool for quick exploratory computations in the lab. Both operators and states can easily be exported from their symbolic representation to numpy format as well, allowing easy interface with existing code bases.


from dynamite.operators import sigmax, sigmay, sigmaz

H = 0
for i, j in kagome_bonds:  # definition of kagome_bonds not shown
    # Heisenberg interaction between spins i and j
    H += sum(0.25*s(i)*s(j) for s in [sigmax, sigmay, sigmaz])

ground_state_energy = H.eigsolve()

(a) Code snippet solving for the ground state energy of the Heisenberg model on the Kagome lattice.

(b) A Kagome lattice geometry, for 30 spins on a torus. Light gray bonds represent “wrap-around” bonds due to the periodic boundary conditions.

N. spins	Dim. of $H$	Solve time
24	1,352,078	0:00:01.9
27	20,058,300	0:00:32.9
30	77,558,760	0:03:40.3

(c) Time to solve for the ground state, for various system sizes, on one Nvidia A100 GPU. Conservation laws were used to reduce the dimension of the Hilbert space, and the solve was performed in “matrix-free” mode. In all cases, Hilbert space size is reduced by working in the half-filling subspace (this model conserves total magnetization). For

N = 24

and

N = 27

, Hilbert space is reduced by another factor of 2 by applying a

Z_{2}

(spin-flip) symmetry. dynamite was built with real number scalars for these timing results.

Figure 2.1: An example application of dynamite: solving for the ground state of the Heisenberg model on the Kagome lattice. A full script for this example, including code to compute the lattice bonds, is distributed with the dynamite source code, in the examples/scripts/kagome/ directory.

This chapter is organized as follows. In Section 2.2, we summarize the core functionality of dynamite, and also note how it is distinguished from several other software packages. In Section 2.3, we walk through several examples of studies that can be performed by dynamite, presenting along the way some novel physics results that dynamite was used to discover. In Section 2.4, we describe the algorithms and computational techniques underlying dynamite. In Section 2.5 we provide performance results for its solvers, measuring both elapsed time and memory usage in a number of different computational scenarios. Finally, in Section 2.6 we conclude with some discussion and outlook for the future.

2.2 Core functionality

dynamite offers the following core features:

Symbolic operators on arbitrary geometries Symbolically build operators on collections of spin-1/2 particles with arbitrary connectivity, by summing $n$ -body products of single-site Pauli operators. When computations are performed, dynamite automatically converts this symbolic representation in Python into the numerical one appropriate for the selected backend.
Time evolution Given a time-independent Hamiltonian $H$ , time $t$ and a pure state $| ψ_{0} ⟩$ , compute $| ψ_{t} ⟩ = exp (- i H t) | ψ_{0} ⟩$ . Or, compute the imaginary time evolution $| ψ_{τ} ⟩ = exp (- H τ) | ψ_{0} ⟩$ . Piecewise-constant time-dependent Hamiltonians (such as Floquet systems) can also be used.
Eigensolving Compute an operator’s $k$ lowest eigenvalues and their associated eigenstates, for user-configurable $k$ . Or, additionally given a target energy $τ$ , compute the $k$ eigenvalues closest to $τ$ and their associated states.
Entropy Tracing out arbitrary sets of spins from Efficiently compute reduced density matrices from pure states, and the associated von Neumann or Renyi entropies.
Symmetry subspaces Reduce computation time and memory usage by specifying subspaces corresponding to symmetries conserved by the Hamiltonian. dynamite has optimized implementations of $U (1)$ (conservation of total spin) and $Z_{2}$ (spin flip) symmetries. It also allows users to define their own subspaces consisting of arbitrary sets of product states (for example, the allowed states in a system with Rydberg blockades).
Matrix-free methods Drastically reduce memory usage by running computations “matrix free,” where matrix elements are computed on the fly from the symbolic representation rather than being stored in memory. See Section 2.4.2 for details.
Extensive documentation, tutorials, and examples Full documentation of all of dynamite’s functionality can be found at https://dynamite.readthedocs.io/. That site also includes instructions for installing the package and running the tutorial Jupyter notebooks, as well as a link to a large set of example scripts.

Some related software packages, whose functionality intersects with dynamite’s in various ways, include: QuSpin [WB17], a python package for exact diagonalization and dynamics that uses OpenMP parallelization; SPINPACK, a C package for finding lowest-energy eigenstates that uses MPI (or hybrid MPI-Pthreads) parallelization [203]; and SpinED, a static executable for calculating lowest-energy eigenstates (single node support) [WES23]. To our knowledge dynamite is the first numerical package that allows users to symbolically build quantum operators as Python objects, and then run computations on them in a massively parallel, distributed memory or GPU accelerated setting, including features such as matrix-free computation.

2.3 Examples

In this section we present a series of examples representing a range of interesting physics problems that can be studied with dynamite. Each of the following examples has a corresponding entry in the examples/scripts directory of the dynamite source tree, where users can find the full example source code and accompanying documentation. Those full examples also include various extra performance optimizations and other tricks, so the authors recommend that readers interested in reproducing or modifying these examples use those scripts as a starting point. Two of the examples also correspond to new scientific insight that was gained through studies using dynamite; we discuss the results where applicable.

2.3.1 Floquet prethermalization

Related published works:

Exponentially slow heating [arxiv] [journal]

Long-range prethermal phases [arxiv] [journal]

Closed quantum systems subjected to a drive tend to heat up to infinite temperature, because the drive inputs energy to the system but no energy is ever dissipated. At first sight, this means that most driven closed quantum systems are not particularly interesting—the infinite temperature state corresponds with a density matrix that is the identity, and is thus featureless. One way of avoiding this infinite-temperature fate is by setting up the system to exhibit many-body localization (MBL), which prevents the system from thermalizing, even at infinite time (MBL is explored further in Section 2.3.2). However, another way to observe interesting physics is to look at the system’s behavior pre-thermal regime: the period of time before it has had the chance to fully heat up. Amazingly, analytic studies have shown that in certain Floquet systems—those in which the drive is periodic—this pre-thermal regime can exist for a time scale at least exponentially long in the frequency of the drive. [MKS16, KMS16, ADH+17a, EBN17, ADH+17b] The hope is that such a long prethermal regime provides a sufficient window of time to observe interesting phenomena. Here, we numerically study the time evolution of a set of related Hamiltonians, and observe that long-range order induced by an effective Hamiltonian in the prethermal regime can lead to a novel out-of-equilibrium phase of matter called a discrete time crystal—in which a discrete time translation symmetry is spontaneously broken.

We begin by considering a 1D spin chain with open boundary conditions, with a long- or short-range $Z Z$ interaction (either decaying as a power law, or nearest-neighbor). In addition we apply a nearest-neighbor $X X$ interaction and a uniform, static magnetic field $\to h$ . This interaction has the form:

H = J_{z} \sum i < j \frac{σ_{i}^{z} σ_{j}^{z}}{| i - j |^{α}} + J_{x} \sum ⟨ i, j ⟩ σ_{i}^{x} σ_{j}^{x} + \sum i \to h \cdot \to σ

(2.1)

where the angle brackets on the second term indicates it is only a nearest-neighbor interaction. We may consider the nearest-neighbor interaction to correspond to the case of $α = \infty$ . In dynamite, this Hamiltonian itself can be implemented in just a few lines of code, as shown in Code Listing 2.1.


def build_hamiltonian(alpha, Jz, Jx, h):
    # sums over all ranges of interaction
    # index_sum takes the interaction sigmaz(0)*sigmaz(r) and
    # translates it along the spin chain
    long_range_ZZ = op_sum(
        1/r**alpha * index_sum(0.25*sigmaz(0)*sigmaz(r))
        for r in range(1, config.L)
    )

    # an XX interaction on every neighboring pair of sites
    # the 0.25 is because spin operators are 1/2 times the Pauli
    nearest_neighbor_XX = index_sum(0.25*sigmax(0)*sigmax(1))

    # op_sum combines the three components of the magnetic field vector;
    # index_sum translates it to every site along the spin chain
    magnetic_field = index_sum(
        op_sum(hi*0.5*s() for hi, s in zip(h, [sigmax, sigmay, sigmaz]))
    )

    return Jz*long_range_ZZ + Jx*nearest_neighbor_XX + magnetic_field

Code Listing 2.1: Implementation of the long-range Hamiltonian in Eq. 2.1.

For the Floquet drive, after every period $T$ of time evolution the system will undergo a global $π$ -pulse which we denote as $X$ , rotating all spins by $180^{\circ}$ around the $x$ -axis. (Note that this is equivalent to flipping the direction of the magnetic field $\to h$ across the $x$ axis every time $T$ ).

There are a few quantities we would like to track during the evolution of this system. First, we’d like to measure the energy of the effective Hamiltonian $D^{*}$ under which the system evolves (in a frame that flips along with the $π$ -pulses $X$ ). The exact form of $D^{*}$ cannot be expressed concisely, but we can approximate it as the Hamiltonian of Eq. 2.1 averaged over a full cycle of length $2 T$ , which we denote as $D$ :

D = \frac{1}{2} (H + X^{†} H X)

(2.2)

In addition to the expectation value of $D$ , we compute the magnetization of each spin in the $Z$ direction as the expectation values of $S_{i}^{z}$ for all $i$ , as well as the entanglement entropy. With that, we can implement the Floquet evolution of our system, as shown in Code Listing 2.2. Note how $D$ and the global pi-pulse are extremely easy to implement in dynamite via its support for arbitrary operator arithmetic. Furthermore, both operators are very fast to compute because the arithmetic is implemented symbolically via Pauli operations, rather than matrix-matrix products.


# the Hamiltonian; see previous code listing
H = build_hamiltonian(args.alpha, 1, args.Jx, args.h_vec)

# pi pulse operator
X = index_product(sigmax())

# the averaged "effective" Hamiltonian
D = (H + X*H*X)/2

# we create Deff and the Sz operators before the iterations start,
# in order to cache their matrix form
Sz_ops = [0.5*sigmaz(i) for i in range(args.L)]

# a workspace vector to store the output of the evolution in
tmp = state.copy()

for cycle in range(args.n_cycles):
    H.evolve(state, result=tmp, t=args.T)  # evolve under H for a time T
    X.dot(tmp, result=state)               # apply the pi pulse

    # "state" now contains the state vector at this time step
    # now compute some stats...

    half_chain_entropy = entanglement_entropy(state, range(args.L//2))

    D_val = D.expectation(state)

    Sz_vals = []
    for i in range(args.L):
        Sz_vals.append(Sz_ops[i].expectation(state))

    # now perhaps print the results, or save them to a file, or...

Code Listing 2.2: Time evolution under the Floquet Hamiltonian.

Using dynamite to evolve this system yields the results plotted in Figure 2.2. Here, we have used a system size of $L = 22$ spins, with the parameters $α \in {1.13, \infty}$ (for long- and short-range interactions respectively), $J_{z} = 1$ , $J_{x} = 0.75$ , and $\to h = (0.21, 0.17, 0.13)$ .¹¹ 1 For the time evolution in the figure we have built the Hamiltonian with closed rather than open boundary conditions, for which we define the long range-interaction strength by replacing $| i - j |^{- α}$ with ${[(L / π) sin | i - j | π / L]}^{- α}$ , which matches the periodicity of the boundary conditions. The frequency $ω$ is the inverse of the time $T$ that is allowed to elapse before the spin flip operator $X$ is applied. In all cases the evolution starts from a product state; the precise product state is labeled in Figure 2.2 and is varied in the long-range case to explore both low-energy (“cold”) and high-energy (“hot”) initial states.

Figure 2.2: Time evolution in a Floquet driven system, exhibiting prethermal discrete time crystalline behavior. [MEK+20] The behavior in the limit

ω \to \infty

is computed by evolving directly under

D

, which is equal to

D^{*}

in this limit. (a-c) Expectation value of the energy with respect to

D

, the approximate effective Hamiltonian in the prethermal regime. The energy is approximately conserved during the prethermal regime, until the system heats to infinite temperature at a time scale exponentially long in

ω

. (d-f) Half-chain entanglement entropy. We observe that by the time

τ_{p r e}

, the system thermalizes with respect to the effective pre-thermal Hamilonian, at which time the entropy hits a plateau; only when the system leaves the prethermal regime does the entropy approach its infinite temperature value. (g-i) Total magnetization. Period-doubling behavior, corresponding to spontaneous breaking of the discrete time-translation symmetry, is observed; it is stable only in the cold long-range case where

D^{*}

supports ferromagnetic order.

Begin by observing the top row of plots of Figure 2.2, which show the expectation value of the approximate effective Hamiltonian $D$ throughout the time evolution. In every case (short-range interactions, and both temperatures of long-range interactions) we observe a clear signature of a pre-thermal regime corresponding to the effective Hamiltonian approximated by $D$ . The expectation value of $D$ is conserved for the period in which the system’s evolution is well-approximated by it, and only after a sufficiently long time does the system begin to heat up and eventually reach a thermal state in which $⟨ D ⟩ = 0$ . As expected, we observe that the length of the prethermal regime depends exponentially on the frequency $ω$ . The same pattern is shown in the entanglement entropy, in the middle row of plots, with the addition of the fact that we see the initial effective thermalization of the product state to the thermal state with respect to $D$ ! In particular, the initial product state has no entanglement, and after a time $τ_{p r e}$ (which does not depend on $ω$ ) the system reaches an entanglement entropy corresponding to a thermal state of the same temperature as the initial state (with respect to $D$ ). This is made particularly clear by comparing to the black dotted line, which plots the limit of $ω = \infty$ (implemented by just evolving under $D$ directly, which is exactly equal to $D^{*}$ in this limit). The “kink” corresponding to where the system has thermalized with respect to $D$ but has not yet left the prethermal regime is called the “prethermal plateau.” Finally, in the bottom row of plots, we observe the time-crystalline behavior that was alluded to at the beginning of this section, in the form of period-doubling: the magnetization only returns to its original value every other period of the drive, corresponding to a spontaneously broken time translation symmetry. Note in particular that in the short-range and hot long-range cases, this time crystalline phase is not stable: it disappears by the time $τ_{p r e}$ is reached, because in these cases the effective Hamiltonian $D^{*}$ is not capable of supporting long-range magnetic order (the time crystal “melts”). On the other hand, in the cold long-range case, $D^{*}$ supports stable long-range ferromagnetic order, and the time crystalline behavior survives all the way until the system leaves the prethermal regime and begins to thermalize to its final infinite temperature state—a length of time that is exponentially dependent on the frequency. For more details and corresponding analytic analysis, see the published manuscripts corresponding to this study. [MKE+19, MEK+20]

2.3.2 Many-body localization in the Heisenberg random field model

One of the main motivations for pushing numerics to the largest possible system sizes is to study emergent phenomena which simply cannot exist in just a few spins. A prototypical example is many-body localization (MBL): a surprising phenomenon in which certain many-body systems with sufficiently strong disorder fail to thermalize. In particular we will study the MBL transition, in which the system moves from thermalizing to localized as the disorder strength is increased. Characterization of this transition has remained elusive: finite size effects seem hard to avoid, and tensor network methods break down due to extensive entanglement in states near the transition. Thus, iterative methods like the Krylov subspace methods used by dynamite have proved to be one of the best tools for its study. [PML+18, LLA15] We refer readers interested in learning more about the physics of MBL to one of the excellent review papers on the subject. [AAB+19]

Here, we show how to implement a model of nearest-neighbor Heisenberg interactions on a 1D chain with open boundary conditions, with disorder implemented as random $Z$ fields on each site:

H = \sum ⟨ i, j ⟩ {\to S}_{i} \cdot {\to S}_{j} + \sum i h_{i} S_{i}^{z}

(2.3)

where $\to S = (S^{x}, S^{y}, S^{z})$ , the subscripts indicate the index of the spin in the chain, and the angle brackets indicate that the indices run over nearest neighbors. The values of $h_{i}$ are drawn from a uniform distribution $[- h, h]$ where $h$ is a parameter that controls the strength of the disorder. The implementation in dynamite is presented in Code Listing 2.3. This Hamiltonian conserves total magnetization $\sum S^{z}$ , which we use to reduce the size of the Hilbert space for our computations.


from dynamite.operators import sigmax, sigmay, sigmaz, op_sum, index_sum
from numpy import random

def build_hamiltonian(L, h):
    # Heisenberg interaction on all nearest neighbors
    H = index_sum(
        op_sum(0.25*s(0)*s(1) for s in [sigmax, sigmay, sigmaz]),
        size=L
    )

    # random fields
    H += op_sum(
        random.uniform(-h, h)*0.5*sigmaz(i)
        for i in range(L)
    )

Code Listing 2.3: Implementation of the nearest-neighbor Heisenberg model, plus on-site random fields.

In this example we identify the MBL transition via the half-chain entanglement entropy of eigenstates, $S_{L / 2}$ , which is simply the bipartite von Neumann entropy when half the spins are traced out. The MBL transition should correspond to a transition from volume law to area law entanglement. In the full example included with the dynamite source, we also examine an eigenvalue statistic called the “adjacent gap ratio.”

The key feature that makes MBL so interesting is that the transition from volume to area law entanglement does not only occur in the ground state, but in excited states as well. This presents a great use case for dynamite’s “target” eigenvalue solver, which finds the $k$ eigenvalues (and eigenvectors) closest to a target energy, where $k$ is user configurable. Code Listing 2.4 implements the following computation: 1) choose an energy in the middle of the spectrum, 2) solve for 32 eigenpairs near this point, and then 3) compute the entanglement entropy for all of those eigenpairs.

An extensive numerical study of the MBL problem is presented in Chapter 3.


from dynamite.subspaces import SpinConserve
from dynamite.computations import entanglement_entropy

# ...
# not shown: set L and h
# ...

H = build_hamiltonian(L, h)
H.subspace = SpinConserve(L, L//2)

# compute 32 eigenpairs near energy 0.1,
# which is in the middle of the spectrum
evals, evecs = H.eigsolve(target=0.1, nev=32, get_vecs=True)

# compute the half-chain entanglement entropy of each
entropies = []
for evec in evecs:
    entropies.append(
        entanglement_entropy(evec, range(L//2))
    )

Code Listing 2.4: Implementation of the Heisenberg + random field model. build_hamiltonian is implemented in Code Listing 2.3.

2.3.3 Many-body chaos in the Sachdev-Ye-Kitaev model

Related published work: [arxiv] [journal]

In spirit the Sachdev-Ye-Kitaev (SYK) model represents the opposite of the localization in the previous example: it is expected to scramble information highly efficiently via chaotic many-body dynamics. [KIT15, MS16] Indeed, it exhibits maximal chaos: the Lyapunov exponent, which characterizes the rate of chaos, saturates its upper bound of $2 π T$ , where $T$ is the temperature of the system. [MSS16] Its physics can be connected to the dynamics of quantum information in black holes, providing a testbed for exotic phenomena such as scrambling-based teleportation. [GJW17, MSY17, BGL+23, SKG+22]

The SYK model gives us a chance to look at how quantum systems other than spins can be explored with dynamite, by transforming them onto a spin system. The SYK model we’ll use consists of Majoranas interacting in 0D, with random couplings. Specifically it consists of every possible 4-body interaction among N Majoranas, with each term having a random coupling strength:

H = \sqrt{\frac{6}{N^{3}}} \sum i j k l J_{i j k l} χ_{i} χ_{j} χ_{k} χ_{l}

(2.4)

where $J_{i j k l}$ are randomly chosen from a Gaussian distribution with variance 1.

To map the Majoranas onto the spin systems that are natively supported in dynamite, we can use the following transformation. For the Majorana with index $i$ , let $q = ⌊ i / 2 ⌋$ . Then

χ_{i} = σ_{q}^{{x, y}} \prod m \in [0, q - 1] σ_{m}^{z}

(2.5)

where the first Pauli is $σ^{x}$ if $i$ is even and $σ^{y}$ if it’s odd. In words, the Majorana consists of a $σ^{x}$ or $σ^{y}$ with a string of $σ^{z}$ extending to the edge of the spin chain. Note that we get two Majoranas for each spin! This is straightforward to implement in dynamite, but is actually already built in in the dynamite.extras module so we don’t have to do it ourselves. Using that, an implementation of this Hamiltonian is shown in Code Listing 2.5. Note that this Hamiltonian has enough terms that building it can take an appreciable amount of time; a more efficient but less transparent version of the build_hamiltonian function can be found in the full SYK example, distributed with the source code of dynamite. The full example also takes advantage of a $Z_{2}$ symmetry in the Hamiltonian to speed up the computations.


from dynamite.extras import majorana
import numpy as np

def build_hamiltonian(N):
    # N is the number of Majoranas

    H = 0
    for i in range(N):
        for j in range(i+1, N):
            for k in range(j+1, N):
                for l in range(k+1, N):
                    Jijkl = np.random.normal()
                    H += Jijkl*majorana(i)*majorana(j)*majorana(k)*majorana(l)

    return np.sqrt(6/N**3) * H

Code Listing 2.5: Implementation of the SYK Hamiltonian (Eq. 2.4).

We can study the fast scrambling behavior of the SYK model by examining out of time order correlators (OTOCs). In particular, we will measure to what extent two local operators $V (0)$ and $W (t)$ anticommute for various times $t$ , where the anticommutator at time $t = 0$ is zero:

C (t) = ⟨ {| {W (t), V (0)} |}^{2} ⟩ .

(2.6)

It is helpful to reduce this to the following equivalent expression

C (t) = 2 R e [⟨ W (t) V (0) W (t) V (0) ⟩] + 1 / 2

(2.7)

which is the formulation of $C (t)$ that we will use here.

Defining $O (t) = W (t) V (0) W (t) V (0)$ , we are specifically interested in this operator’s expectation value with respect to thermal states, of various (inverse) temperatures $β$ . Now, dynamite’s speed comes from the fact that it works with pure states, rather than mixed states—so the obvious plan to just compute $T r [O (t) e^{- β H}]$ is out of the question. Instead, we can take advantage of an idea called “quantum typicality” to get an estimate of the expectation value more efficiently. [BG09, SGB14] Quantum typicality says that $T r [O (t) e^{- β H}]$ is well approximated by the expectation value of $e^{- β H / 2} O (t) e^{- β H / 2}$ with respect to random states (where we have split the thermal operator in half across the trace to get a more symmetric result). For simplicity we can set $W = χ_{0}$ and $V = χ_{1}$ . With that, for a uniformly random state $| ψ_{r} ⟩$ we will compute (writing things out in full):

⟨ ψ_{r} | e^{- β H / 2} e^{i H t} χ_{0} e^{- i H t} χ_{1} e^{i H t} χ_{0} e^{- i H t} χ_{1} e^{- β H / 2} | ψ_{r} ⟩

(2.8)

In Code Listing 2.6 we demonstrate how the expectation value of this operator can be computed in dynamite.


from dynamite import config
from dynamite.extras import majorana
import numpy as np

# ... define number of majoranas N, temperature beta, time t, etc ...

# number of spins is half the number of majoranas that live on them
# this configures the number of spins globally
config.L = (N+1)//2

W = majorana(0)
V = majorana(1)

H = build_hamiltonian(N)

# imaginary time evolution for e^-beta*H/2
ket = H.evolve(State(state=’random’), t=-1j*beta/2)
bra = ket.copy()

# need a temporary "workspace" vector as well
tmp = ket.copy()

for _ in range(2):  # need to apply this twice
    # V(0)
    V.dot(tmp, result=ket)

    # W(t)
    H.evolve(ket, t=t, result=tmp)
    W.dot(tmp, result=ket)
    H.evolve(ket, t=-t, result=tmp)

# finally take the inner product and get the result!
otoc_val = bra.dot(ket)
C_t = 2*otoc_val.real + 1/2

Code Listing 2.6: Computation of the expectation value of the out-of-time-order correlator with respect to a thermal state, via quantum typicality. Here we work through Eq. 2.8 from right to left, applying time evolution and operators as we go. Since the thermal imaginary time evolution is symmetric on both sides of the Equation, we “re-use” it rather than computing it twice.

In Figure 2.3, we compute $F (t)$ which is closely related to $C (t)$ : it is the “regularized” OTOC in which the thermal density matrix is spread equally across the operator, see the supplementary information of the published manuscript for details [KYK+21]. Note that there are a lot of independent parameters: the time $t$ , temperature $β$ , and system size $N$ can all be varied, and to achieve good statistical significance requires then averaging over both the disorder in $H$ and the random states used in the quantum typicality computation. Figure 2.3(a) shows the dependence of $F (t)$ on $t$ for various system sizes, at $β J = 10$ . (The plotted value is $~ F (t) = F (t) / F (0)$ ). There are two main observations to be made from this figure. First, the scrambling behavior is clear: $F (t)$ decays toward zero with a certain time scale, determined at early times by the Lyanpunov exponent $λ$ as $1 - ~ F (t) \sim e^{λ t} / N$ . Second, there is clear flow of $λ$ with system size. A finite-size rescaling procedure can be used to extrapolate $λ$ to the limit of infinite system size, at which point it can be plotted as a function of the temperature $β J$ (Figure 2.3(b)). The observed dependence of $λ$ tracks closely with the result of a semiclassical solution (the Schwinger-Dyson equations, labeled “SD” in the figure), and at large $β$ (low temperature), the theoretically predicted fast-scrambling limit of $λ \leq 2 π / β$ . For more details, including of the finite-size rescaling procedure and the prediction from Schwinger-Dyson equations, see the published manuscript presenting these results. [KYK+21]

The results summarized in Figure 2.3 and elucidated in [KYK+21] provided the first numerical evidence supporting the claim that the SYK model exhibits fast scrambling behavior at low temperature. Previous efforts at smaller system sizes were unable to observe this behavior due to finite size effects; this study was made possible by dynamite’s speed at very large system sizes which enabled analysis of Hamiltonians on up to 60 Majoranas. Due to the four-body all-to-all connectivity of the model (Eq. 2.4), the Hamiltonian has an extremely large number of terms in the SYK Hamiltonian (487,635 for a system size of 60 Majoranas!), explicitly storing the (sparse!) matrix in memory is infeasible for large problems: for 60 Majoranas it would require over 350 terabytes of RAM! Thus dynamite’s matrix-free (“shell”) mode, which only stores and manipulates the operator symbolically, was crucial for this study.

After these results were published, two studies followed in which I was not directly involved, but warrant mention for the curious reader. The many-body teleportation of quantum states via scrambling, a phenomenon that emphasizes the duality between quantum mechanical models like SYK and the dynamics of gravity models, in particular black holes and wormholes, was numerically observed with dynamite. [SKG+22] In another study, dynamite was used to show that sparse SYK, in which many of the $J_{i j k l}$ are zero, can reproduce many of the features of the regular SYK model, but with a much lower computational cost to the simulation. [CMP21]

Figure 2.3: Fast scrambling in the SYK model. [KYK+21] (a) The value of the regularized and normalized OTOC

~ F (t) = F (t) / F (0)

(essentially equivalent to

C (t)

) as a function of time

t

, for varying system sizes

N

. Clear scrambling behavior is observed in the exponential growth of

1 - ~ F (t)

at short times, from which a Lyapunov exponent

λ

can be extracted. (b) The result of extrapolating

λ

to infinite system size, for various values of the temperature

β

. The results agree with the semiclassical prediction from the Schwinger-Dyson equations, and at low temperature saturate the fast scrambling limit of

λ \leq 2 π / β

2.4 Methods

In this section, we give an overview of the numerical techniques used by dynamite, including their strengths and limitations. First we describe at a high level the intuition behind Krylov subspace algorithms. Next we describe the symbolic representation of operators used internally by dynamite, and how it is used to implement the “matrix-free” multiplication of states by operators. Finally we describe techniques used for parallelization, with a focus on the matrix-free operations.

2.4.1 Krylov subspace algorithms

The two main computationally intensive tasks that dynamite performs are time evolution and eigensolving. At first glance, for an operator $H$ , the computational task of quantum time evolution is to compute the unitary $U = exp (- i H t)$ , and for eigensolving it is to compute a matrix $A$ and diagonal matrix $Λ$ such that $A H = Λ H$ . Both $U$ and $(A, Λ)$ can be computed in a straightforward manner using standard linear algebra subroutines (for example, SciPy’s linalg.expm and linalg.eigh routines); however, as we explain below, doing so is exponentially more expensive than is necessary for most physics studies. For an quantum operator on a Hilbert space of dimension $N$ , matrices $U$ and $A$ both of dimension $N \times N$ , consisting of $4^{L}$ complex numbers for a generic system of $L$ spins-1/2 particles. The power of the Krylov subspace algorithms underlying dynamite stems from the fact that they avoid ever explicitly computing $U$ or $A$ , only requiring $O (2^{L} \cdot p o l y (L))$ memory to operate and having a corresponding improvement in runtime. Here we give a brief overview of how they work; for details please see the SLEPc Users Manual and Technical Reports. [RCD+23, HRT+07] Note that if the dynamite user wants to tune the solver by passing any of the “command line” options to SLEPc as described in those documents, it can be achieved via the slepc_args argument to dynamite.config.initialize().

An initial concern seems to be that the matrix $H$ itself is of dimension $N \times N$ , which ostensibly is an obstacle to achieving the efficiency just promised. However, $H$ has a crucial feature that $U$ and $A$ do not: it is sparse. This means that most of the matrix elements are zero, and the cost of storing and working with the matrix can be reduced dramatically by only storing the non-zero elements. In fact, the memory usage can be reduced even further by not storing any of the matrix elements at all, as we discuss below in Section 2.4.2. With that, we move on to discussing how we can use this sparse $H$ to achieve performant time evolution and eigensolving.

For time evolution, the key is that we are often more interested in the action of the matrix $U$ on a state vector than we are in the matrix itself. For an initial state $| ψ_{0} ⟩$ , we desire to directly compute the time evolved vector $| ψ_{t} ⟩ = exp (- i H t) | ψ_{0} ⟩$ without needing to compute the unitary as an intermediate step. The Taylor expansion of the matrix exponential suggests that for reasonably small $| | H t | |$ , $| ψ_{t} ⟩$ will be dominated by a linear combination of the vectors $H^{i} | ψ_{0} ⟩$ for $i \in [0, m]$ , where $m$ is a small integer, say 30. These vectors are easy to compute by repeatedly multiplication with $H$ . This intuition forms the core idea behind Krylov subspace methods: construct the subspace $K_{m} = s p a n ({| ψ_{0} ⟩, H | ψ_{0} ⟩, H^{2} | ψ_{0} ⟩, \dots, H^{m - 1} | ψ_{0} ⟩})$ , and solve the linear algebra problem on this much smaller subspace. For time evolution, this means projecting $H$ onto $K_{n}$ to yield a matrix $~ H$ of dimension only $m \times m$ , computing the matrix exponential $~ U = exp (- i ~ H t)$ (a trivial computation for a small matrix like $~ H$ ), and then computing $| ψ_{t} ⟩ \approx ~ U | ψ_{0} ⟩$ . Projected back into the full Hilbert space, for small $| | H t | |$ , $~ U$ should be a good approximation of $U$ , and thus the result should be a good approximation of $| ψ_{t} ⟩$ .

The case in which $| | H t | |$ is not sufficiently small presents the opportunity to discuss another powerful technique crucial to these algorithms, which is called restarting. Instead of computing a single Krylov subspace and attempting to perform the whole computation in it, the subspace is iteratively updated throughout the computation until convergence is achieved. The default restarting scheme for time evolution in dynamite is quite simple: the desired evolution of time $t$ is broken down into a series of shorter evolutions of time $Δ t$ , which is sufficiently small that $~ U$ is a good approximation of $U$ . After each evolution of $Δ t$ , the Krylov subspace is constructed anew, such that it tracks the evolution of the state vector through the Hilbert space. This is the algorithm used by the Expokit package, which is reproduced in SLEPc; [SID98, RCD+23] there is also a Krylov-Schur solver with a less intuitive restarting scheme available in SLEPc but we find it to be less stable.

For eigensolving, we take advantage of the fact that computing the matrix $A$ of all eigenvectors, which is enormous, may be overkill. Physics studies are often interested in only a few particular eigenstates, such as the ground state and first few excited states. Matrix-vector products are again useful here, with the simplest algorithm being the so-called “power method.” Consider the decomposition of a random vector in the basis of (unknown) eigenstates of $H$ : it will have the form $| ψ_{r} ⟩ = \sum_{i} c_{i} | ϕ_{i} ⟩$ , where each $| ϕ_{i} ⟩$ is an eigenstate. When $H$ is multiplied by this vector several times, the result is $H^{m} | ψ_{r} ⟩ = \sum_{i} c_{i} λ_{i}^{m} | ϕ_{i} ⟩$ —the eigenvectors with eigenvalues of largest absolute value have been enhanced exponentially; for sufficiently large $m$ the result will converge to the most extremal eigenvector. Once again, this process can be enhanced through the use of a Krylov subspace: instead of discarding the vectors $| ψ_{r} ⟩$ through $H^{m - 1} | ψ_{r} ⟩$ , they are used to construct a Krylov subspace. The matrix $H$ is projected into this subspace and the full spectrum is computed of the much smaller $~ H$ . As in the case of time evolution, a single iteration of this algorithm is unlikely to converge within a reasonable error tolerance, again calling for a restarting scheme. A great deal of effort has gone into designing optimal restarting schemes for Krylov eigensolving; they are out of the scope of this work but we refer interested readers to the SLEPc technical papers for a discussion of the solvers underlying dynamite [HRT+07].

Perhaps surprisingly, since it solves for extremal eigenvalues, this technique can also be used to compute sets of neighboring excited states anywhere in the spectrum. This is achieved via a spectral transform: rather than performing the solve on the matrix $H$ itself, it is performed on a function of $H$ which moves the desired eigenvalues to the outside of the spectrum. In dynamite, the default spectral transform for finding interior eigenpairs is shift-and-invert, which, for a target energy $σ$ , performs the transformation $H \to (H - σ)^{- 1}$ such that the eigenvalues closest to $σ$ lie on the outside of the transformed matrix. This technique has been used with great success in the study of many-body localization in particular, as discussed in Sec. 2.3.2.

Finally, we note an important feature of all of these techniques: they almost exclusively rely on the matrix-vector product as the definition of the matrix, rather than ever needing to access individual matrix elements themselves. This fact is crucial for the ability to use “matrix-free” implementations with these algorithms, as discussing in Sec. 2.4.2 below.

Error

Despite being frequently termed “exact diagonalization,” Krylov subspace algorithms are approximate. In dynamite, the error is managed entirely by the SLEPc library that implements the solvers; SLEPc applies techniques such as the “restarting” just described in order to converge results to within a requested tolerance $ϵ$ (which can be passed as an argument in dynamite). In this section, we make precise what dynamite’s error tolerance means, largely by describing the definitions of error used by SLEPc.

For eigensolving, there is a straightforward way to compute the error. Suppose, for a matrix $H$ , that $~ λ$ and $~ x$ are an approximate eigenvalue and eigenvector, respectively. Ideally, we would have $A ~ x = ~ λ ~ x$ exactly. To capture the essence of this goal, the residual vector is defined as

r = A ~ x - ~ λ ~ x

(2.9)

and we may define the error in terms of this residual vector. For the solver settings used by dynamite, convergence to a tolerance $ϵ$ is achieved when at least the requested number of eigenpairs have

| | r | |_{2} / | λ | < ϵ

(2.10)

This can be shown to correspond with the following two error bounds on $~ λ$ and $~ x$ :

| λ - ~ λ | \leq ϵ | ~ λ |

(2.11)

sin θ (x, ~ x) \leq \frac{| | r | |_{2}}{δ}

(2.12)

where $θ (x, ~ x)$ is the angle between the exact and approximate vectors and $δ$ is the distance from $~ λ$ and its closest neighboring eigenvalue. Note in particular that the second expression implies that nearly-degenerate eigenvalues can lead to considerable mixing of the corresponding eigenvectors. Also note that when a spectral transform is used (the target mode of eigensolving) the convergence criterion is evaluated for the transformed eigenproblem. Finally, we observe that the Krylov-Schur method, used by SLEPc for eigensolving, is known to sometimes “miss” degenerate (or very near degenerate) eigenvalues; it should not be relied upon to always return the correct multiplicity for a degenerate eigenspace. [ROM18] A future version of dynamite may explore switching to the LOBPCG solver (see Chap. 3) which does not suffer from this limitation. A detailed description of the convergence criteria for SLEPc’s eigensolvers, including flags that can be used to adjust these criteria, can be found in Section 2.5 of the SLEPc Users Manual. [RCD+23] (See the documentation of dynamite’s config.initialize() function for information on how to pass “command-line” flags to SLEPc).

For time evolution (action of the matrix exponential), unfortunately there is not such a straightforward way of computing the error. As before, for an approximate output vector $~ w \approx e^{- i H t} x$ we may define the residual vector

r = e^{- i H t} x - ~ w

(2.13)

but in this case there is no way to explicitly compute it. Instead we must rely on indirect methods of estimating and controlling the error. dynamite uses SLEPc’s implementation of the Expokit solver for the action of the matrix exponential; the solver uses internal error metrics to attempt to keep the magnitude of the residual vector smaller than the requested tolerance $ϵ$ . The details of how this is done is out of the scope of this manuscript, but can be found in the original work detailing the Expokit solver. [SID98] Note that there is no explicit guarantee that the error will remain below the tolerance, but the error is usually well-controlled in practice. Bounding the error of Krylov subspace techniques for quantum time evolution is a long-standing and ongoing area of research; improved error estimates could potentially be included in a future version of dynamite. [PL86, SAA92, SL96, HL97, SID98, MV03, JAK20, RFW+22]

2.4.2 Matrix-free computation

Most operators that one encounters in the study of many-body spin physics are sparse: an overwhelming fraction of the matrix elements are zero. It is for this reason that dynamite uses the sparse numerical linear algebra libraries PETSc and SLEPc for its solvers. By default, the solvers use the compressed sparse row (CSR) storage format for matrices, in which only nonzero matrix elements (and the necessary indices) are stored in memory. For an operator of dimension $2^{L} \times 2^{L}$ , this reduces the cost of storage (and matrix-vector multiplication) from $O (4^{L})$ to $O (2^{L} k)$ , where $k$ is the number of nonzero elements per row—and $k$ is polynomial in $L$ for virtually all physically relevant operators. Of course, $O (2^{L} k)$ still has the potential to be quite large, and frequently is in practice. For the case that even the sparse representation of an operator uses too much memory, dynamite implements a custom representation of operators which we call the “XZC” representation, that leverages the tensor product structure of spin chain operators to store them using as little as $O (k)$ memory.²² 2 The XZC representation is currently denoted “MSC” in various parts of the dynamite source code. The intuition is that the operator is represented symbolically as a decomposition into strings of Pauli matrices; instead of being stored, the elements of the large sparse matrix are computed “on-the-fly” as they are needed. In the field of numerical linear algebra, this type of technique is known as “matrix-free” computing; for historical reasons it is referred to as “shell matrices” in dynamite. It dramatically reduces the memory usage, enabling computations that wouldn’t be possible otherwise due to memory limitations—especially on GPUs, where VRAM is limited. Usually, this comes with a trade-off: the need to compute matrix elements on-the-fly adds to the computational cost, so the user pays for the reduction in memory usage with an increase in the runtime. However, CSR matrix operations are frequently limited by memory bandwidth rather than CPU usage—the speed is limited by how quickly the CPU can retrieve matrix elements from main memory. By reducing the amount of data transferred across the memory bus, it’s actually possible for matrix-free methods to be faster, while simultaneously decreasing the memory usage! Whether this is the case in practice depends on how fast the matrix elements can be generated in real time—and much effort in the development of dynamite has been directed at optimizing this process to an extreme. In Section 2.5, we compare the performance of shell and non-shell computations for various problems. Here we describe how it works, and how it interfaces with the solvers.

The fundamental idea behind the XZC representation is that any operator on a set of $L$ spins can be decomposed as a linear combination of products of single-site Pauli operators. Explicitly, for an operator on $L$ spins, which has $m$ terms in this decomposition, we may define a pair of $m \times L$ binary matrices $M^{(x)}$ and $M^{(z)}$ , and a length- $m$ complex-valued vector $C$ , to represent an operator $H$ as

H = \sum i H_{i} = \sum i C_{i} ⨂ j (σ_{j}^{x})^{M_{i j}^{(x)}} (σ_{j}^{z})^{M_{i j}^{(z)}}

(2.14)

where $σ_{j}^{x}$ and $σ_{j}^{z}$ are the Pauli $X$ and $Z$ operators on site $j$ . In words, each row of $M^{(x)}$ and $M^{(z)}$ , and corresponding element of $C$ , corresponds to one term of the operator. In each row, the $1$ ’s in each binary matrix represents the locations in which a $σ^{x}$ or $σ^{z}$ Pauli should be present (with $0$ ’s corresponding to the identity Pauli on that site). $σ^{y}$ operators are represented via the product $σ_{x} σ_{z}$ , when both $M_{i j}^{(x)}$ and $M_{i j}^{(z)}$ are equal to 1.

Because $L$ in practice never exceeds $64$ , $M^{(x)}$ and $M^{(z)}$ are stored as vectors of integers.³³ 3 dynamite uses 32-bit integers by default, but can be configured to use 64-bit integers if $L$ is to exceed 31. The bits of each integer represent the spin indices for which a $σ_{x}$ and/or $σ_{z}$ operator should be included for that term. For the $C_{i}$ , it is straightforward to see that each value must be either entirely real or entirely imaginary if the operator is Hermitian; this allows $C$ to be stored as a vector of real values (64-bit floating point numbers). All together, an operator with $m$ terms is stored as two length- $m$ vectors of integers plus one length- $m$ vector of real numbers, yielding an overall storage cost of $O (m)$ . The quantity $k$ mentioned in the previous paragraph corresponds to the number of unique rows of $M^{(x)}$ ; in most situations $m$ is proportional to $k$ .

We now describe how computations are implemented on this format—specifically matrix-vector multiplication, which is by far the most important matrix operation for the iterative solvers described in Section 2.4.1. By default, dynamite uses product states in the $Z$ basis as a basis for its representation of the Hilbert space. For the full Hilbert space of length $2^{L}$ , dynamite takes basis state of index $α$ to correspond to the product state $| α ⟩ = ⨂_{j} | α_{j} ⟩$ , where $α_{j}$ is the $j^{t h}$ bit of the binary representation of the integer $α$ (and with $| 0 ⟩$ corresponding to an “up” spin, with $σ_{z}$ eigenvalue $+ 1$ , and $| 1 ⟩$ corresponding to a “down” spin, with eigenvalue $- 1$ ). Then, an arbitrary state vector $| ψ ⟩$ is stored as an array $\to x$ of $2^{L}$ complex numbers $x_{α}$ , such that $| ψ ⟩ = \sum_{α} x_{α} | α ⟩$ . This yields a very straightforward implementation of multiplication of operators by states. The operator-state multiplication can be decomposed as as the sum over multiplications by each of the $H_{i}$ of Eq. 2.14:

H | ψ ⟩ = \sum i, α H_{i} x_{α} | α ⟩

(2.15)

It is straightforward to see that the product of a single term $H_{i}$ from Eq. 2.14 with a product state $| α ⟩$ yields another product state $| β ⟩$ multiplied by a coefficient:

H_{i} | α ⟩ = C_{i} ⨂ j (σ_{j}^{x})^{M_{i j}^{(x)}} (σ_{j}^{z})^{M_{i j}^{(z)}} | α_{j} ⟩ = (- 1)^{\sum_{j} M_{i j}^{(z)} α_{j}} C_{i} | M_{i, *}^{(x)} ⊻ α ⟩

(2.16)

where $M_{i, *}^{(x)}$ is the $i^{t h}$ row of $M$ . In words, the result product state is $| β ⟩$ , where $β$ is the bitwise XOR of the integer $α$ and the $i^{t h}$ row of $M^{(x)}$ , multiplied by the coefficient $C_{i}$ , with a sign flip if an odd number of bits are set in the bitwise AND of the $i^{t h}$ row of $M^{(z)}$ and $α$ .

This is very promising from an implementation perspective. As described above, the rows of $M^{(x)}$ and $M^{(z)}$ , as well as the product states $α$ , will be represented as integers, this operation can be performed extremely quickly using bitwise operations. Computing $β = M_{i, *}^{(x)} ⊻ α$ costs a single bitwise XOR. Computing the sign flip costs just a bitwise AND, a popcount operation to count the number of bits set⁴⁴ 4 popcount is a native machine instruction in most modern CPUs., and another bitwise AND to get the least significant bit of the result (the parity of the number of bits set). Finally flipping the sign of $C_{i}$ costs a single XOR instruction, and then a fused multiply-add (one machine instruction in modern CPUs) can be used to multiply $x_{α}$ by the sign-flipped $C_{i}$ and add it into the appropriate element $y_{β}$ of the output vector. Overall, computing the action of each term of $H_{i}$ on each basis state costs two bitwise XORs, two bitwise ANDs, a popcount, and a fused multiply-add—not very expensive at all!

With that, we have a straightforward algorithm for computing a matrix-vector multiplication using the XZC representation of operators, given as pseudocode in Code Listing 2.7. The code listing includes one small addition that has not yet been discussed, which is the conversion of indices to states and vice versa. As previously mentioned, when using the full Hilbert space this operation corresponds to the identity—index $α$ corresponds to product state $| α ⟩$ . However when the Hilbert space dimension is reduced into a symmetry subspace, this may not be the case. The functions state_to_idx and idx_to_state represent the conversion between indices and integers representing the corresponding product states.


# matvec() computes y += H*x, where H is represented in XZC form by:
# - Mx (array of integers)
# - Mz (array of integers)
# - C  (array of real numbers)
function matvec(Mx, Mz, C, x, y)
    for alpha in 0:dim(x)
        for i in 0:nterms
            in_state = idx_to_state(alpha)
            out_state = Mx[i] ^ in_state  # bitwise XOR
            beta = state_to_idx(out_state)

            # & means bitwise AND
            flip_sign = parity_of_set_bits(Mz[i] & in_state)

            if flip_sign
                coeff = -C[i]
            else
                coeff = C[i]
            end

            y[beta] += coeff*x[alpha]
        end
    end
end

Code Listing 2.7: Pseudocode for “matrix-free” matrix-vector multiplication, in the XZC representation. This is the basic structure used by dynamite, although with many optimizations layered on top.

2.5 Performance results

In this section we present performance results for dynamite, measuring both runtime and memory usage. We compare the performance of various computational settings, including single-core, multiple cores on the same compute node, multiple cores distributed across several compute nodes, and GPU computations. We also benchmark compressed sparse row (CSR) format for matrix storage versus dynamite’s custom matrix-free XZC format (see Section 2.4.2). We benchmark using three different Hamiltonians (introduced in the examples of Sec. 2.3) which are representative of the range of models dynamite is useful for. Importantly, they cover a wide variation in number of non-zero elements per row of the matrix, as described in Table 2.1—which is one of the primary factors influencing computational cost. The Heisenberg + random fields and SYK models also have conserved quantities that allow us to benchmark dynamite’s performance when working in subspaces.

All benchmarks in this section were performed on the Department of Energy’s Perlmutter supercomputer at NERSC. The CPU benchmarks were performed on nodes having two AMD EPYC 7763 processors, each with 64 cores, for a total of 128 cores per node. Each node has a total of 512 Gb DDR4 RAM, with 256 Gb attached to each processor. All GPU benchmarks were performed using one Nvidia A100 GPU with up to 80 Gb attached VRAM, connected to one AMD EPYC 7763 processor (of which a single core was used), with 256 Gb DDR4 RAM. For more details on Perlmutter’s specifications, see the cluster documentation. [16]

Model	Non-zero elements per row	Conserved quantities
Long-range (Eq. 2.1)	$2 L$	(none)
Heisenberg + random fields (Eq. 2.3)	$\leq L$	Total magnetization
SYK (Eq. 2.4)	$L^{4} / 24 + O (L^{3})$	Parity

Table 2.1: The models used for benchmarking in Sec. 2.5. The number of nonzeros per row for the Heisenberg model is variable because some of the off-diagonal terms cancel to zero; it is bounded by the given figure. Note that the long-range model, despite its all-to-all connectivity, has only a linear number of nonzero elements per row because the long-range

Z Z

terms all sum together on the diagonal.

In Tables 2.2 and 2.3 we report performance results for the eigsolve and evolve methods of dynamite respectively. For eigensolving, we measure the cost of solving for the ground state of the model, to the default tolerance of $10^{- 8}$ . For time evolution, we benchmark the evolution of a product state⁵⁵ 5 The performance of dynamite’s time evolution solver is not expected to depend on the initial state. to a time $t = 50 / | | H | |_{\infty}$ , again to within the default tolerance of $10^{- 8}$ . We normalize $t$ in this way because the computational cost of time evolution in dynamite scales with $| | H t | |$ ; the normalization accounts for the arbitrary energy scale in the definition of each Hamiltonian and allows a fair comparison across models and system sizes. We note that the default tolerances are almost certainly tighter than is necessary for most physics studies; in practice users may want to reduce the tolerance to improve speed. In this section, all reported timings are real elapsed (“wall”) time, and include all phases of the computation including setup steps like initialization of MPI and construction of the operators.

Model

Configuration

Eigensolve

Hardware

Matrix mode

Wall time

Memory [GB]

Heisenberg

+random

fields

dim: 2,704,156

nonzeros: 24

1 CPU core

CSR

0:00:12.808

1.618

Matrix-free

0:01:05.621

0.867

1 CPU node

(128 cores)

CSR

0:00:04.648

11.714

Matrix-free

0:00:03.965

11.248

GPU

CSR

0:00:06.309

\leq 80

Matrix-free

0:00:03.974

\leq 80

dim: 155,117,520

nonzeros: 30

1 CPU node

(128 cores)

CSR

0:02:37.938

141.844

Matrix-free

0:05:26.727

59.624

GPU

Matrix-free

0:02:00.793

\leq 80

Long

range

dim: 4,194,304

nonzeros: 44

1 CPU core

CSR

0:04:07.932

5.069

Matrix-free

0:06:22.741

1.311

1 CPU node

(128 cores)

CSR

0:00:17.864

19.154

Matrix-free

0:00:19.227

11.347

GPU

CSR

0:00:18.184

\leq 80

Matrix-free

0:00:09.749

\leq 80

dim: 268,435,456

nonzeros: 56

1 CPU node

(128 cores)

Matrix-free

0:40:37.351

89.840

GPU

Matrix-free

0:18:39.491

\leq 80

SYK

dim: 2,097,152

nonzeros: 7,547

1 CPU node

(128 cores)

Matrix-free

0:26:44.837

17.131

2 CPU nodes

(256 cores)

CSR

0:11:31.164

472.640

GPU

Matrix-free

0:28:45.358

\leq 80

Table 2.2: Cost of solving for the ground state, for various models and configurations. Exact memory usage not reported for GPU runs because tooling was not available to measure the peak GPU memory usage; bounds correspond to the amount of attached memory on the GPUs used for testing.

Model

Configuration

Time evolution

Hardware

Matrix mode

Wall time

Memory [GB]

Heisenberg

+random

fields

dim: 2,704,156

nonzeros: 24

1 CPU core

CSR

0:00:18.479

2.306

Matrix-free

0:01:35.774

1.573

1 CPU node

(128 cores)

CSR

0:00:02.773

12.702

Matrix-free

0:00:05.348

12.193

GPU

CSR

0:00:06.887

\leq 80

Matrix-free

0:00:04.113

\leq 80

dim: 155,117,520

nonzeros: 30

1 CPU node

(128 cores)

CSR

0:02:02.834

180.944

Matrix-free

0:04:22.168

97.270

GPU

Matrix-free

0:01:30.241

\leq 80

Long

range

dim: 4,194,304

nonzeros: 44

1 CPU core

CSR

0:01:09.491

6.146

Matrix-free

0:01:32.764

2.384

1 CPU node

(128 cores)

CSR

0:00:06.958

19.593

Matrix-free

0:00:06.524

12.343

GPU

CSR

0:00:19.050

\leq 80

Matrix-free

0:00:04.386

\leq 80

dim: 134,217,728

nonzeros: 54

1 CPU node

(128 cores)

CSR

0:02:45.128

331.246

Matrix-free

0:02:58.955

84.435

GPU

Matrix-free

0:01:10.321

\leq 80

SYK

dim: 2,097,152

nonzeros: 7,547

1 CPU node

(128 cores)

CSR

0:04:20.826

412.341

Matrix-free

0:02:42.996

16.776

2 CPU nodes

(256 cores)

CSR

0:02:42.236

470.659

Matrix-free

0:01:59.353

67.093

GPU

Matrix-free

0:02:45.066

\leq 80

Table 2.3: Cost of time evolution of a random state to a time

t = 50 / | | H | |_{\infty}

, for various models and configurations. Exact memory usage not reported for GPU runs because tooling was not available to measure the peak GPU memory usage; bounds correspond to the amount of attached memory on the GPUs used for testing.

From the data in the two tables we can make a few observations. In single-core CPU performance, we observe that the CSR implementation outperforms the matrix-free implementation by a considerable amount. This makes sense, considering that the matrix-free implementation has to do more computational work to compute the matrix elements on-the-fly instead of pulling them from memory. The tradeoff is clear, however, in the memory usage, which is always less for the matrix-free implementation than CSR. Furthermore we observe that, as expected, the reduction in memory usage for the matrix-free implementation depends strongly on the number of nonzero elements per row of the model (see Table 2.1). SYK, which has the most nonzeros per rows, has the matrix-free implementation showing a dramatic reduction in memory usage of over $24 \times$ compared with CSR, for $L = 22$ . This gap is expected to only increase at larger system sizes, but the memory usage for CSR matrices for SYK is so large for $L = 28$ spins that we are not able to collect performance data for that case: estimating the memory usage with dynamite’s Operator.estimate_memory() function suggests that it would require over 100 terabytes to store such a matrix!

The performance tradeoff picture changes, however, when we look at parallel performance—both in the case of multicore CPU and GPU computations. In several cases, the matrix-free implementation actually outperforms the CSR implementation! There is actually a very straightforward explanation for this behavior: the computation is not actually limited by a computational bottleneck, but by the memory bandwidth. Because the 128 cores on one node, and the many thousands of cores in the GPU, share memory, the memory bus is not sufficiently fast to keep all of them supplied with data—and in the CSR implementation, there is simply no way around this, because the matrix elements need to be pulled from memory. On the other hand, the matrix-free implementation puts significantly less pressure on the memory bus, and the addition of the extra cores gives enough computational power to handle the extra computation required to compute the matrix elements when needed. Essentially the only thing that needs to be drawn from memory in that case is the vector data itself—and even with that, dynamite’s matrix-free implementations are designed to make use of the on-chip memory cache as much as possible to reduce this. Furthermore, except perhaps in the case of large SYK Hamiltonians where there are a very large number of terms of the Hamiltonian, the XZC representation used to generate the matrix elements can fully remain in cache. Consistent with these facts is that the speed ratio between the CSR implementation and the matrix-free one scales with the number of non-zero elements per row—corresponding to the fact that memory bandwidth is more of a bottleneck when there are more matrix elements. To summarize, in the parallel setting, it is often preferable to use the matrix-free implementation, because it simultaneously is faster and uses less memory.

Finally, we may compare the performance across the different hardware types. As is expected, in all cases using a single core is the slowest. In most cases, the matrix-free GPU performance is fastest, varying from a few times faster than the performance of 128 CPU cores to roughly comparable speed, depending on the model and problem size. However, we note that 128 CPU cores consist of an entire CPU node on Perlmutter, while there are 4 A100 GPUs per GPU node. So, for tasks such as averaging over disorder realizations, where computations can be “embarrassingly parallelized” by running one instance on each GPU on a node, the total computational throughput of a GPU node will in all cases exceed that of a CPU node by several times. (We also note that Perlmutter’s 128 cores per CPU node is on the large end; many compute clusters have only 32, 48, or perhaps 64 cores per node). The conclusion is that, if GPUs are available, they are usually the fastest option. The only drawback to running on them is that there is hard limit to the system sizes they can run due to the limit of their total attached memory. On the Perlmutter GPUs, which have at most 80 GB of attached memory, even in matrix-free mode computations are limited to $L = 27$ spins with no subspaces in use, and perhaps a few more with the use of subspaces (e.g. $L = 30$ for the Heisenberg model with total magnetization conservation enabled), due to the memory cost just of storing state vectors. Parallelizing a single computation across several GPUs in dynamite is currently in development, but is not yet supported. Thus for reaching the very largest system sizes, the only option is to use CPU parallelism, perhaps across several nodes if necessary.

2.6 Discussion and outlook

In numerical studies of quantum many-body phenomena, one cannot avoid the information-theoretic fact that the number of classical bits required to store an arbitrary quantum state is exponential in the number of particles of the quantum system. In certain cases, such as those involving low amounts of entanglement, there are ways to approximate a subset of states efficiently, but the generic case does not permit such optimizations. Fortunately, with the tools of modern supercomputing it is possible to perform linear algebra on astoundingly large vector spaces, with dimension exceeding even one trillion—making it possible to directly numerically study many-body quantum systems of nontrivial size.

In this chapter we have presented dynamite, a software package for the numerical study of many-body systems of spin-1/2 particles (and other systems which can be mapped onto them). dynamite provides a straightforward and intuitive Python interface, yet the computations are performed by a highly optimized backend built from the ground up to enable massive parallelization on modern supercomputer clusters, whether via distributed memory CPU computing or acceleration on GPUs. Particularly notable is dynamite’s symbolic representation of quantum operators, by which computations can be performed “matrix-free,” reducing memory usage considerably while often also improving speed.

Moving forward, the most obvious next step is to implement multi-GPU computations in dynamite, which will help alleviate the main limitation of GPU computing in dynamite—the limited memory attached to a single GPU. This feature is already in development; multi-GPU computing only recently has become a serious tool in supercomputing and thus its support on the supercomputers to which we have access is imperfect, but we expect multi-GPU dynamite to be available in production once these issues have been worked out. We also note that large-memory GPUs are currently in high demand for machine learning applications as well, so improvements in the VRAM per GPU can be expected on the hardware side as well. Aside from that, a crucial feature of dynamite is its ability to take advantage of symmetries in the physical system to reduce the dimension of the Hilbert space on which computations will be performed. So, another promising path forward is to implement more of these conservation laws, perhaps including conservation of momentum corresponding to lattice symmetries. Finally, there are more improvements to be had on the algorithmic side. For example, leveraging recently discovered algorithms and new types of hardware has been shown to have the potential to accelerate certain computations dramatically [HMB+21].

In Chapter 3, we discuss how a relatively new algorithm called LOBPCG can be pushed to improve the performance of the mid-spectrum eigensolving that we explored in Sec. 2.3.2, with the help of a hand-tuned matrix-vector multiplication designed specifically for the Heisenberg model.