Note: The results in this Chapter have been improved since the publication of this dissertation. The updated results are presented in the following manuscript: [arXiv:2403.18006]
Quantum circuits for coherently performing arithmetic on superpositions of values have been a subject of intense study since the first quantum algorithms for number theory problems were discovered in the mid-1990s. The most famous such algorithms are Shor’s algorithms for integer factorization and discrete logarithm, because of their potentially catastrophic effect on digital security. [SHO97] Among other applications, quantum arithmetic has recently become useful also for the implementation of protocols to achieve various quantum cryptographic tasks—from “proofs of quantumness” which allow a quantum device to demonstrate its computational capability to a skeptical classical verifier, to practical applications like secure delegation of computations to an untrusted remote quantum server. [MAH18, GV19, BCM+21, BKV+20, KCV+22, BGK+23]
The standard way of performing multiplication (both in the classical and quantum setting) is via the “schoolbook” algorithm, that uses O(n2) gates, where n is the size of the input. While the existence of faster classical algorithms for multiplication has been known for over a half-century, these algorithms have overheads that make them only useful for multiplication of large values—the GNU multiple-precision arithmetic library uses a threshold of 2176 bit inputs to switch away from the schoolbook method. In the quantum setting, these overheads have generally proven to be made even worse by the reversibility constraints of quantum circuits. Despite these challenges, several works have explored and optimized the implementation of quantum circuits for sub-quadratic time multiplication, largely focusing on the Karatsuba algorithm which has an asymptotic runtime of roughly O(n1.58). [PRM18, DBC18, GID19a, LAJ+21, KPF06] For many years, a significant challenge came from the fact that these fast algorithms are recursive—and building reversible circuits for them while maintaining the speedup required storing intermediate data, ultimately using a superlinear number of qubits. Notable is a recent work which for the first time reduced the number of qubits for quantum Karatsuba multiplication to linear in the size of the inputs, by performing extra manipulations on the output register that enable a strategy akin to tail recursion, where intermediate results are summed directly into the output and do not need to be uncomputed later. [GID19a] That work provides hope for very efficient quantum integer multiplication, and encourages the search for improvement in the gate count and especially the qubit count (multiplying 2048 bit numbers with that algorithm still requires over 10,000 ancilla qubits).
In this work, we explore a new paradigm for the design of sub-quadratic quantum multiplication circuits. We show that by combining the fundamental ideas behind fast multiplication algorithms with an inherently quantum technique where arithmetic is performed in the phases of a quantum state, it is possible to design circuits for quantum multiplication that simultaneously achieve sub-quadratic asymptotic gate counts and a constant (and small) number of ancilla qubits: just two ancillas for multiplication of a quantum value by a classical one, and five ancillas for the multiplication of two quantum values. Our results yield a family of algorithms with varying tradeoffs between constant factors and asymptotic complexity; by estimating the gate counts for our circuits we find that for 2048-bit inputs, our O(n1.46) algorithm is the optimal one for multiplication of a quantum integer by a classical one, and our O(n1.66) one is optimal for multiplying two quantum integers. We also find that our algorithms begin to outperform the schoolbook algorithm in gate count quite early—for inputs with fewer than 100 bits in some cases. In terms of concrete comparisons of circuit sizes (which we note are difficult to make fairly in the abstract circuit model) we find that our algorithms dramatically reduce the number of qubits required for subquadratic quantum multiplication, while simultaneously achieving competitive gate counts. We also note that our circuits do not require intermediate measurements for measurement-based uncomputation (indeed, there are so few garbage bits produced that there is essentially nothing to uncompute!). In terms of depth, we find that surprisingly, we are limited to O(n) depth only because we do not know of a way to perform the quantum Fourier transform in less than O(n) depth with such limited space—even if we allow the use of O(n) ancillas and for the QFT to be approximate. On the other hand, the core “phase arithmetic” portion of our algorithm can be performed in sub-linear depth using O(n) ancillas.
Directly using our circuits as a subroutine in Shor’s algorithm yields circuits that require only 2n+O(logn) qubits, where n is the length of the integer to be factored—while obtaining an asymptotic gate count of O(n2.46) with practical constant factors. This is compared to the standard O(n3) gate complexity of Shor’s algorithm.11 1 It has been known for many years that it is theoretically possible to reduce the asymptotic complexity of Shor’s algorithm below O(n3) [BCD+96], but previously the constant factors and number of qubits required seemed to make doing so impractical for reasonably sizes inputs. See Table 1 of [GE21] for a recent comparison of proposals for the implementation of Shor’s algorithm. We also explore the circuits’ application to the efficiently-verifiable cryptographic proof of quantumness introduced in Chapter 5; we see dramatic reductions in the qubit count along with competitive gate counts, when compared to previous implementation proposals. The new circuits also do not require the use of measurement-based uncomputation.
Throughout this manuscript we work in the abstract circuit model of quantum computation; a critical direction for future research is how to optimize our constructions to include considerations such as error correction, qubit routing, and noise.
In this work we will focus on implementing two related quantum arithmetic operations. The first is quantum-classical multiplication: computing the product of (a superposition of) integers stored in a quantum register with a classical integer a. We denote the unitary corresponding to this operation as Uq×c(a) (the subscript denotes “quantum” × “classical”, and the unitary is parameterized by a). It implements the following operation on product states (extended by linearity to superpositions):
Uq×c(a)|x⟩|w⟩=|x⟩|w+ax⟩ | (7.1) |
We note that this unitary (controlled off of another qubit) is the fundamental operation used in implementations of Shor’s algorithm (see Section 7.4.3).
The second operation we study is a quantum-quantum multiplication: finding the product of an integer (or superposition thereof) in one quantum register with another integer (or superposition) in a second register. We denote this as Uq×q; it implements the transformation
Uq×q|x⟩|y⟩|w⟩=|x⟩|y⟩|w+xy⟩ | (7.2) |
In Sec. 7.4.4 we also discuss the closely related unitary for which |x⟩ and |y⟩ of the previous equation are the same register, thus implementing a squaring operation:
Usquare|x⟩|w⟩=|x⟩|w+x2⟩ | (7.3) |
This squaring operation is fundamental to implementing “proofs of quantumness” based on the cryptographic function f(x)=x2modN, as discussed in Chapter 5.
We now discuss classical multiplication algorithms. The most straightforward algorithm for multiplying two numbers is known as the “schoolbook” method, because it is the one taught to young students when they first learn to multiply. The numbers are simply decomposed into individual digits and the individual products of those digits summed, scaled by powers of the base. In binary this can be expressed as
xy=∑i,j2i+jxiyj | (7.4) |
(presumably most elementary school teachers use base 10). This algorithm’s time complexity is O(n2); it is used widely for multiplication of small- to moderate-sized integers.
About half a century ago it was shown that it is possible to classically multiply integers in sub-quadratic time—asymptotically outperforming the schoolbook algorithm. We begin by describing the first sub-quadratic multiplication algorithm, called the Karatsuba algorithm, and then show that it is a special case of a broader class of algorithms called Toom-Cook. For an extended pedagogical exposition of the range of fast multiplication algorithms that have been discovered, we refer the reader to Knuth. [KNU98]
Consider two n-bit integers x and y to be multiplied together. Divide the bits of each into two pieces: x=2n/2x1+x0 (and the same for y), where x1 is the bits of the “more significant” half and x0 is the “less significant” one. Written in this way, the product can be expressed
xy=(2n/2x1+x0)(2n/2y1+y0)=2nx1y1+2n/2(x0y1+x1y0)+x0y0 | (7.5) |
This formulation is effectively the schoolbook algorithm in base 2n. We’ve replaced one product of size n with four products of size n/2, reflecting the quadratic scaling of the schoolbook algorithm.
The key observation behind the Karatsuba algorithm is that
x0y1+x1y0=(x0+x1)(y0+y1)−x0y0−x1y1 | (7.6) |
—and we already need to find the products x0y0 and x1y1 anyway! Making this substitution we have
xy=2nx1y1+2n/2((x0+x1)(y0+y1)−x0y0−x1y1)+x0y0 | (7.7) |
That is, the multiplication of size n can be accomplished with only three multiplications of size n/2, and a few extra additions! The next key insight is to apply this fact recursively, using it again to compute each of the sub-multiplications of size n/2, and then again to compute the multiplications used there, et cetera. Complexity analysis shows that when applied recursively, Karatsuba multiplication computes the product in only O(nlog23)=O(n1.58⋯) operations, outperforming the schoolbook algorithm.
The Toom-Cook algorithm uses the same intuition as Karatsuba, but splits each integer into k parts instead of just two. It also provides a broader intuition for why the Karatsuba algorithm works, by expressing the problem in terms of polynomial multiplication.
Consider an n-digit integer x divided into k chunks of size n/k, which we denote x0,x1,⋯,xk−1 with xk−1 being the most significant:
x=xk−12(k−1)n/k+⋯+x12n/k+x0 | (7.8) |
This can be reinterpreted as a polynomial
x(w)=xk−1wk−1+⋯+x1w+x0 | (7.9) |
evaluated at w=2n/k. From this perspective, we can recast a product of integers p=xy as a product of polynomials p(w)=x(w)y(w), evaluated at w=2n/k. The benefit of expressing the product this way is that for any point wi, p(wi)=x(wi)y(wi), and we can choose values of wi such that x(wi) and y(wi) are only roughly n/k bits long and thus the product is faster to compute. This suggests the following plan: compute the value of p(w) at several points, use those points to reconstruct the coefficients of the polynomial p(w), and finally evaluate p(w) at w=2n/k yielding the integer product z. x(w) and y(w) are polynomials of degree k−1, so their product p(w) is of degree 2(k−1), thus we must evaluate it at q=2(k−1)+1 points to unambiguously reconstruct the coefficients.
It may be helpful at this point to recast the Karatsuba algorithm in this light. Karatsuba corresponds to the case of k=2; the integer x is converted into the polynomial x(w)=x1w+x0 (and similarly for y). Here q=2(k−1)+1=3, and so we evaluate the product p(wi)=x(wi)y(wi) at the three points wi∈{0,1,∞}:
p(0) | =x(0)y(0)=x0y0 | (7.10) | ||
p(1) | =x(1)y(1)=(x0+x1)(y0+y1) | (7.11) | ||
p(∞) | =x(∞)y(∞)=x1y1 | (7.12) |
(where the value of polynomial p at infinity is the limit of p(w)/wdeg(p) as w→∞). The coefficients of the polynomial p(w) can be expressed in terms of its value at these three points as follows:
p(w)=p(∞)w2+(p(1)−p(∞)−p(0))w+p(0). | (7.13) |
Substituting Eqns. 7.10-7.12 into Eq. 7.13 and evaluating at w=2n/2 yields Eq. 7.7, the expression we originally had for computing p in Karatsuba multiplication!
The benefit of framing our multiplication in this way is that it becomes very straightforward to set k to values larger than 2. The Toom-Cook algorithm can be viewed as a five-step process, which is conveniently expressed in terms of linear algebra: [BOD07]
We divide the bits of x and y into k chunks each, which serve as the coefficients of a polynomial as in Eq. 7.9 above. We can represent this polynomial simply as a vector of the coefficients:
x=(xk−1,⋯,x1,x0) | (7.14) |
In this representation, evaluating the polynomial at a point wi corresponds to computing the inner product e⊺x with a vector
e(wi)=(wk−1i,⋯,wi,1) | (7.15) |
The next step is to evaluate those polynomials at several points. We define a length q=2(k−1)+1 vector ~x which contains the evaluation of our polynomial at each of q points w0,w1,⋯. With this we write ~x=Ax, where each row of the matrix A is e(wi) for one of the points wi at which we desire to evaluate the polynomial. Later it will be helpful to write A as a q×q matrix; we’ll allow multiplication by the length k vector x by simply extending prepending zero entries to x for the higher order terms. With that, A has the following structure:
A=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝wq−10⋯w20w01wq−11⋯w21w11⋮⋱⋮⋮⋮wq−1q−1⋯w2q−1wq−11⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠ | (7.16) |
We perform that matrix-vector multiplication for both x and y.
Now that x and y are expressed as polynomials evaluated at several points, we reach the payoff: we multiply the polynomials by simply performing pointwise multiplication of the vectors:
~p=~x∘~y | (7.17) |
where ~p is the result polynomial evaluated at the points wi, and ∘ denotes the pointwise product. Note the benefit here: we are performing 2q pointwise products of size n/k, instead of the k2 products of that size that would be required for schoolbook multiplication. Since q=2(k−1)+1 this is a very dramatic improvement! Furthermore we will recursively apply the Toom-Cook algorithm to each of the pointwise multiplications of the vector elements.
Now that we have the vector ~p representing the product polynomial evaluated at each of the q points wi, we need to convert it to a vector of polynomial coefficients p. We know that ~p=Ap; by appropriate choice of the wi we can ensure that A is invertible. Then we can simply compute the matrix inverse A−1, resulting in a straightforward computation p=A−1~p.
Finally, with the product polynomial expressed in terms of its coefficients p, the integer product can be recovered by evaluating that polynomial at w=2n/k. As before this can be written as the inner product (e(2n/k))⊺p, where
e(2n/k)=(2(q−1)n/k,⋯,22n/k,2n/k,1) | (7.18) |
With the algorithm laid out, let us the asymptotic complexity of Toom-Cook for any k. The algorithm performs q=2(k−1)+1 multiplications of size n/k at the top level, and then recurses. Straightforward complexity analysis yields a runtime of O(nlogkq), which can be made arbitrarily close to O(n) as k is increased. Unfortunately in practice the constant factors grow rapidly as k is increased, due to the cost of performing the matrix vector products Ax, Ay and (A−1)p. In practice, the value of k is varied depending on the size of the integers to be multiplied, but regardless of the size of input k is only practicable for small values (certainly less than 10). [165]
Many proposals for the implementation of quantum arithmetic mirror classical circuits, with various clever optimizations to account for the fact that quantum operations must in general be reversible. [VBE96, ZAL98, CDK+04, DKR+06, HRS17, GID18, GID19a] However, one early work by Draper stands out as an example of a quantum arithmetic circuit that truly has no classical analogue. [DRA00] That work proposes to use the quantum Fourier transform (QFT) to compute addition, and to use that addition circuit as a building block for implementing multiplication.
The action of the quantum Fourier transform, with modulus 2m, on a product state |x⟩ of m qubits can be written as follows (and by linearity extended to a superposition of inputs):
QFT2m|x⟩=∑zexp(2πixz2m)|z⟩ | (7.19) |
(and the inverse quantum Fourier transform IQFT2m is the reverse of this operation). Draper observed that one can perform addition in the Fourier basis simply by applying a series of phase rotations to the qubits of the register holding z:
QFT2m|x+a⟩=∑zexp(2πi(x+a)z2m)|z⟩=∑zexp(2πiaz2m)exp(2πixz2m)|z⟩ | (7.20) |
The rightmost expression is just QFT2m|x⟩ with an extra phase shift exp(2πiaz2m).
This construction can be readily extended to implement the multiplication unitaries described in Sec. 7.2.1. We simply add the product (cx in the classical-quantum case, xy in the quantum-quantum case) to the |w⟩ register using Fourier addition. Denoting the unitaries that apply the appropriate phase in the Fourier basis with a tilde:
U=(I⊗IQFT)~U(I⊗QFT) | (7.21) |
we can write down ~Uq×c and ~Uq×q (in terms of their action on product states):
~Uq×c(a)|x⟩|z⟩ | =exp(2πiaxz2m)|x⟩|z⟩ | (7.22) | ||
~Uq×q|x⟩|y⟩|z⟩ | =exp(2πixyz2m)|x⟩|y⟩|z⟩ | (7.23) |
Note that here, z represents each basis state resulting from applying the quantum Fourier transform to the output register; z is not the value itself that was already stored in the output register. We may decompose the quantum values as x=∑i2ixi (and respectively for y and z), yielding:
exp(2πiaxz2m) | =∏i,jexp(2πia2i+j2mxizj) | (7.24) | ||
exp(2πixyz2m) | =∏i,j,kexp(2πi2i+j+k2mxiyjzk) | (7.25) |
Since we have decomposed to individual bits, the product xiyj (or xiyjzk in the quantum-quantum case) is simply 1 if all relevant bits are 1, or 0 otherwise. This means these individual phase factors can be implemented via (doubly-)controlled-phase rotations of phase
ϕij=2πa2i+j2m(mod2π) | (7.26) |
between qubits xi and zj in the classical-quantum case, and
ϕijk=2π2i+j+k2m(mod2π) | (7.27) |
between qubits xi, yj, and zk in the quantum-quantum case. Importantly, these phases depend only on classical values, and can be computed during circuit compilation on a classical computer.
For the classical-quantum multiplication case, the total number of controlled-phase gates applied to implement multiplication is O(n2), which matches the complexity achieved by converting the classical “schoolbook” algorithm for integer multiplication into a quantum circuit (but is worse than the sub-quadratic multiplications described earlier). This scaling arises in an interesting way: we have accrued an extra factor of n due to the fact that a rotation proportional to the product ax must be performed for each of the n bits of z. However, we have simultaneously managed to get rid of a factor of n, due to the fact that a is classical, so we can coalesce the rotations corresponding to each bit of a and do them as one gate. Unfortunately, for quantum-quantum multiplication, it is not possible to coalesce the rotations for each yj in the same way, because y is stored in qubits. Thus we need O(n3) doubly-controlled-phase gates, which is considerably worse than quantum versions of schoolbook multiplication. Yet there are still benefits to this strategy. First, the constants on the scaling are very good: implementing ~U requires exactly n2 or n3 phase rotations (actually, even fewer since some of the rotations are so small they can be dropped). Second, it uses zero ancilla qubits. This can be contrasted with traditional adder circuits, which need extra work space to perform carries. Finally, it can be massively parallelized, since all the phase rotations commute, yielding better circuit depth.
In this work, we call this type of quantum circuit for multiplication “Fourier multiplication.”
Our main result is that it is possible to decompose the phases used in a Fourier multiplication via Toom-Cook-like transformations, ultimately yielding circuits with fewer than O(n2) gates for both classical-quantum and quantum-quantum multiplication. We first give the intuition behind the transformation, explaining it for k=2 for both classical-quantum and quantum-quantum multiplication. Then, we describe the case of arbitrary k in Section 7.3.3. Next, we analyze the gate counts and ancilla usage of these algorithms. Finally we describe how small modifications allow us to optimize the ancilla usage and depth, achieving the results stated in the introduction.
Throughout this work we will focus specifically on the implementation of an operation we call the “PhaseProduct:”
∑x,zcx,z|x⟩|z⟩PhaseProduct(ϕ)−−−−−−−−−−→∑x,zexp(iϕxz)cx,z|x⟩|z⟩ | (7.28) |
and a related operation which we call the “PhaseTripleProduct:”
∑x,y,zcx,y,z|x⟩|y⟩|z⟩PhaseTripleProduct(ϕ)−−−−−−−−−−−−−−→∑x,y,zexp(iϕxyz)cx,y,z|x⟩|y⟩|z⟩ | (7.29) |
Note that these operations are generalizations of ~Uq×c and ~Uq×q respectively—setting ϕ=2πc/2m in the former and ϕ=2π/2m in the latter recovers Eqs. 7.22 and 7.23. The generalization will be helpful when we apply the transformations recursively.
The fundamental idea is to decompose the phase ϕxz of Eq. 7.28 by using the Karatsuba decomposition (and later via Toom-Cook for general k). A first attempt at directly substituting Eq. 7.7 into the phase of Eq. 7.28 and decomposing into a product of phases yields
exp(2πiϕxz)=exp(2πiϕ2nx1z1)⋅exp(2πiϕ2n/2((x0+x1)(z0+z1)−x1z1−x0z0))⋅exp(2πiϕx0z0) |
This decomposition is not ideal—we still need to compute the value (x0+x1)(z0+z1)−x1z1−x0z0, which requires storing and reusing the partial products x0z0 and x1z1. Instead, we note that the prefactors on each phase are just some value, which is classically computed at circuit compilation time. We should endeavor to move as much of the arithmetic into these prefactors as possible, since they are set before the circuit runs and any complexity there does not affect the quantum circuit cost. Thus rearranging Eq. 7.7 to group by partial products, rather than powers of two, yields
exp(2πiϕxz)=exp(2πi(2n−2n/2)ϕx1z1)⋅exp(2πi2n/2ϕ(x0+x1)(z0+z1))⋅exp(2πi(1−2n/2)ϕx0z0) | (7.30) |
The implications of this expression are drastic. Each phase rotation on the right hand side consists of an entirely classical phase factor, multiplied by a product of two quantum integers. This is simply another application of PhaseProduct, but in which the integers are half as long! We can recursively apply the decomposition again to each of the sub-products, until the input integers are sufficiently small that we perform the phase products directly via individual two-qubit controlled-phase rotations.
Importantly, breaking the phase down this way does not require computing and storing any extra values in ancilla registers—the computation can be performed in-place. We do need the values (x0+x1) and (z0+z1), which we compute via regular quantum addition circuits. Importantly however, since addition is reversible we can compute these values in-place (up to a single “overflow” qubit), temporarily overwriting x1 and z1 respectively.
In Algorithm 7.1, we explicitly record the steps of this algorithm, and then we move on to the case of quantum-quantum multiplication.
For quantum-quantum multiplication we desire to implement the phase shift ϕxyz from Eq. 7.29, where x, y, and z are all stored in quantum registers. In a less constrained setting (such as in classical computing or “digital” quantum arithmetic), the product of three integers would be computed by taking the product of two of the integers and then multiplying the third integer by the result—so no special algorithm for the product of three integers is required. However, we are decomposing a phase, and this is only possible over sums, not products. Thus we introduce modified versions of the Karatsuba algorithm, which compute the product of three integers in one step.
Fortunately, viewing Karatsuba as pointwise multiplication of polynomials (as discussed in Sec. 7.2.2) gives a fairly straightforward way to do this. Denote the integer product we desire as p=xyz. We will again consider the factors x, y, and z as polynomials and evaluate them at a set of points wi. This time we will compute the pointwise triple product p(wi)=x(wi)y(wi)z(wi), and as before interpolate the polynomial p(w) and evaluate it at the appropriate w to recover the desired integer result p. The main difference from the two-integer case is that the degree of the polynomial p will be higher (3 in this case), so we will need to use more points wi. Regular (two-integer) Karatsuba as described in Sec. 7.2.2 uses the points {0,∞,1}, fortunately, it is easy to include the point −1 corresponding to the linear combination x0−x1 (resp. y and z), for a total of 3+1=4 points which uniquely will specify our degree-3 polynomial p. With that, we can write down a modified Karatsuba decomposition which applies to triple products (here grouping by partial products in view of the discussion in classical-quantum case):
xyz=(23n/2−2n/2)x1y1z1+12(2n+2n/2)(x0+x1)(y0+y1)(z0+z1)+12(2n−2n/2)(x0−x1)(y0−y1)(z0−z1)+(1−2n)x0y0z0 | (7.31) |
As before, the factors in each of the sub-multiplications are half the size of the original ones (up to one bit of overflow)!
Using this to decompose the phase ϕxyz, we end up with an algorithm that is structurally identical to the recursive decomposition of ϕxz in the classical-quantum case. In Algorithm 7.2, we describe how to perform this decomposition explicitly.
Note that as we show in Table 7.1, the k=2 quantum-quantum case does not actually provide a sub-quadratic gate count—it requires O(n2) two-qubit gates, matching the asymptotic cost of “digital” schoolbook multiplication. However it does provide a drastic speedup over the O(n3) gate count of the schoolbook algorithm for the PhaseTripleProduct operation represented in Eq. 7.29. In any case, it’s an illustrative example that provides good intuition for the k>2 cases which do yield a sub-quadratic runtime, which we discuss next.
In this section we describe the general case of k>2 for both classical-quantum and quantum-quantum multiplication. As in the previous two sections, we would like to apply the Toom-Cook decomposition to the phase products ϕxz and ϕxyz. Using the notation introduced in Sec. 7.2.2, we can write the entire Toom-Cook process for computing an integer product xz as a linear algebra expression:
xz=(e(2n/k))⊺A−1(Ax∘Az) | (7.32) |
As in the previous section, to decompose the triple product we modify the Toom-Cook algorithm to take the pointwise triple product:
xyz=(e(2n/k))⊺A−1(Ax∘Ay∘Az) | (7.33) |
Note that as in Section 7.3.2, the triple product increases the degree of the product polynomial, so we must evaluate the polynomials at more points. As a function of k, this number of points is q=3(k−1)+1 in the three-integer (PhaseTripleProduct) case (whereas it is q=2(k−1)+1 in the two-integer (PhaseProduct) case)
Inserting these expressions into the phases of Eq. 7.28 and 7.29, we have
exp(iϕx[y]z)=exp(iϕ(e(2n/k))⊺A−1(Ax[∘Ay]∘Az)) | (7.34) |
where we use square brackets to denote parts that only appear in the three-integer, PhaseTripleProduct case, but not in the two integer PhaseProduct case.
As in the k=2 case, we’d like to decompose Eq. 7.34 in a way that allows us to include as much linear algebra as possible in the prefactors, which can be classically computed at circuit compilation time. To do so, we define ^e⊺=(e(2n/k))⊺A−1, and decompose the phase across the inner product of ^e⊺ and the pointwise product Ax[∘Ay]∘Az:
exp(iϕx[y]z)=∏iexp(iϕ^ei(Ax)i[(Ay)i](Az)i) | (7.35) |
We see that now, as before, each of the phase factors in the product on the right hand side has the same form as the left hand side—a classical phase factor (in this case ϕ^wi) multiplied by a product of integers. By choosing the points wi appropriately we can ensure that the values (Ax)i (resp. y and z) are roughly 1/k the size of the input integers, up to overflow bits, reducing our larger multiplication to q much smaller ones.
The decomposition in Eq. 7.35 suggests a simple algorithm for computing PhaseProduct and PhaseTripleProduct efficiently, for general k. We lay out the steps explicitly in Algorithm 7.3. In that algorithm, the square brackets denote values that only appear when computing the PhaseTripleProduct. We note that Algorithm 7.3 is meant to be as simple as possible while exhibiting the asymptotic scaling we desire; in practice optimizations should be applied to reduce the qubit count and/or depth (see Secs. 7.3.4 and 7.3.4 below).
Finally, note that the multiplications by e(2n/k)⊺ and A−1 in Eqs. 7.32 and 7.33 correspond to step 4 (interpolation) and step 5 (recomposition) of the classical Toom-Cook algorithm respectively (see Sec. 7.2.2). Step 4 is the most complicated of the whole algorithm in terms of arithmetic; it requires roughly twice as many additions and subtractions as step 2 in the two-integer product case, and three times as many in the triple product. By offloading that step to classical precomputation, we have cut the quantum computational cost of a each level of recursion by an additional roughly 2/3 (PhaseProduct) or 3/4 (PhaseTripleProduct).
Algorithm | Gate count |
Schoolbook | O(n2) |
k=2 | O(nlog23)=O(n1.58⋯) |
k=3 | O(nlog35)=O(n1.46⋯) |
Algorithm | Gate count |
Schoolbook | O(n3) |
k=2 | O(nlog24)=O(n2) |
k=3 | O(nlog37)=O(n1.77⋯) |
k=4 | O(nlog410)=O(n1.66⋯) |
We now analyze the asymptotic scaling of the gate counts for Algorithm 7.3. We state our results as the following Theorem, which yields the scaling exponents listed in Table 7.1. Its proof is similar to the proof of the asymptotic complexity of the classical Toom-Cook algorithms. [KNU98]
Algorithm 7.3 can be implemented on a quantum device using O(nlogkq) two-qubit gates, where n is the length of the input integers.
We first show that the recursive calls are performed on integers of size n/k+t(k), for some function t(k) which is independent of n. From this we write a recursion relation and solve it to prove the theorem.
The recursive calls are performed on values of the form (Ax)i (resp. y and z). For the purposes of this proof, let the points wi be any q integers in the range (−q,q). Then, the relevant elements of A—corresponding to powers of each point wi up to degree k—are bounded by qk; since q is linear in k this implies that the elements of A are at most O(klogk) bits long. The elements of x (resp. y and z) are of size n/k, thus a single product Aijxj has length n/k+O(klogk). The value (Ax)i is the sum over k of these products, which adds at most logk bits, so we see that (Ax)i can be stored in n/k+O(klogk) bits. Thus we see t(k)=maxi⌈log(Ax)i⌉−n/k=O(klogk).
With that we may write a recursion relation for the gate count G. For one level of recursion, the algorithm performs at most 2qk additions and subtractions, and q recursive calls of size n/k+t(k) (one per iteration of the outer loop). With the additions and subtractions being performed in linear time dn for some constant d, we have
G(n)=dn+qG(n/k+t(k)) | (7.36) |
Noting that if G is a polynomial of degree ≤2, then G(n/k+t(k))≤G(n/k)+O(nt(k)), solving the recursion yields
G(n)=O(nlogkq) | (7.37) |
which is what we desired to show. Finally, because q is at most 3(k−1)+1 it is straightforward to see that for all k≥2 (which is all of the allowed values of k), the power of n is ≤2, and thus our assumption that G was a polynomial of degree at most two was justified. ∎
In the proof of Theorem 6, we showed that the value (Ax)i (resp. y and z) may overflow the register it is overwriting by a number of bits t(k)=O(klogk) which is constant in n. It is straightforward to see, then, that the algorithm can be executed using O(logn) ancilla qubits—a constant number of ancillas are used at each of the logn levels of recursion. In this section we show that it is possible to do even better, using only a small constant number of qubits for the entire operation (2 for classical-quantum multiplication, and 5 for quantum-quantum).
The first key observation is that we may reduce the overflow to a single qubit, by adjusting slightly how we divide the factors into their k parts. If we set the size of the k−1 less-significant parts of the value to ⌊(n−t)/k⌋ bits, then the most-significant part will be at least t bits longer than all the others. Thus if we choose our wi such that the coefficient on that large piece is 1, and overwrite it with the linear combination (Ax)i (resp. y and z), it will overflow by at most one qubit. Since each of the x, y, and z will have such an overflow qubit, this leads to a total of 3 overflow qubits per level of recursion (or 2 for the classical-quantum case, in which we only have x and z).
While we now only need one overflow qubit per level of recursion, we still need to deal with the larger issue of overflow qubits accumulating through the recursive calls. To do so, we perform the portion of the multiplication that involves that extra overflow qubit directly, before moving down in the recursive tree. Because one of the factors in this partial multiplication is only a single qubit, the contribution to the overall runtime is negligible. Once it has been done, the overflow qubit is uncomputed, and can be reused at a lower level recursive call.
We make these two ideas explicit in Algorithm 7.4, which reduces the ancilla qubit count to at most 3, while maintaining the asymptotic gate count of Algorithm 7.3.
Algorithm 7.4 can be implemented in O(nlogkq) two-qubit gates, where n is the length of the input integers.
There are three important ways in which Algorithm 7.4 differs from Algorithm 7.3: there is only one overflow qubit per register, steps 2-5 are new, and the recursive call of step 6 is on a value of length ℓ′ rather than n/k+t(k). For the last one, it is straightforward to see that ℓ′≤n/k+t(k), so the latter change will not adversely affect the running time. Thus we just need to show that only one overflow qubit is required, and that the cost of steps 2-5 is negligible in the asymptotic limit.
First we show that only a single overflow qubit is indeed required to store (Ax)i in the subregister xk−1 (resp. y and z). First observe that for wi∈{0,∞}, (Ax)i∈x0,xk−1 so clearly no overflow qubits are needed. The other wi are unit fractions with denominator αi. For these, recall that (Ax)i=∑jαk−1−jixj. Thus (Ax)i is bounded from above by 2ℓ′+2ℓt′. By definition ℓ′−ℓ≥t′, so (Ax)i requires at most ℓ′+1 bits to store, which is what we desired to show.
Now we examine the cost of steps 2-5. We begin with steps 2-4, in the case of PhaseProduct (used for classical-quantum multiplication), because it is simpler. In this case the parts of Algorithm 7.4 in square brackets are omitted, including all of step 3. In steps 2 and 4, we need to do a controlled application of phases proportional to (Az)i and (Ax)imod2ℓ′. These values are both already contained in a register, so this can be easily achieved with a series of ℓ′ controlled-phase gates, where ℓ′ is within in additive constant of n/k. Thus in this case our recursion relation is simply (cf. Eq 7.36)
G(n)=O(n)+qG(ℓ′) | (7.38) |
For steps 2-4 of PhaseTripleProduct, we need to implement phases proportional to (Ay)i(Az)i, ((Ax)imod2n/k)(Az)i, and ((Ax)imod2n/k)((Ay)imod2n/k). Once again all these values are readily available, but simply applying a series of doubly-controlled phase rotations would use n2 gates—which is too large. Instead, we may apply a controlled version of the PhaseProduct algorithm, for some (perhaps different) k′! Doing so will require a total of O(nlog′kq′) gates, and we get a recursion relation
G(n)=O(nlog′kq′)+qG(ℓ′) | (7.39) |
where again ℓ′ is within an additive constant of n/k. Thus as long as we set k′ such that
logk′[2(k′−1)+1]<logk[3(k−1)+1] | (7.40) |
both Eqs. 7.38 and 7.39 have the solution
O(nlogkq) | (7.41) |
which is what we desired to show.
The only thing that remains is step 5. We must show that uncomputing the overflow ancillas uses only a negligible number of gates. This is straightforward: the entire value (Ax)i requires only a number of gates linear in n to compute, so uncomputing the entire thing and recomputing (Ax)imod2ℓ′ would require only O(n) gates. (In practice the single ancilla can be uncomputed more efficiently than uncomputing the entire value). ∎
Remark: We note that k′=2 satisfies Eq. 7.40 for k≤5, which includes all of the k values we explicitly explore in this work.
Algorithm 7.4 uses at most 2 ancillas for PhaseProduct, and at most 5 for PhaseTripleProduct.
Again we begin with the case of PhaseProduct (which implements classical-quantum multiplication). In step 1, one ancilla qubit is allocated for the x and z registers respectively, for a total of two. Steps 2-4 do not require the allocation of any new ancilla qubits if they are implemented as a series of controlled-phase gates (as they are in the proof of Theorem 7). Because we uncompute the overflow ancillas in step 5, they can then be reused during the recursive call of step 6, thus requiring no additional ancilla qubits to be allocated. Thus PhaseProduct requires a total of 2 ancillas.
For PhaseTripleProduct, 3 ancillas (one for each register) are allocated in step 1. But as described in the proof of Theorem 7, steps 2-4 require the application of a controlled version of PhaseProduct. We have just shown that this requires 2 ancillas. Once again, in step 5 we uncompute the overflow ancillas and they can be reused during the recursive call of Step 6. Thus we see that PhaseTripleProduct can be implemented with a total of 5 ancillas. ∎
In practice, we may be interested in the size of t′, which is roughly ℓ′−n/k, because although it is clear that their difference is constant, if that constant is large it could meaningfully affect the practical cost of performing the recursion (since the recursive products are of size ℓ′). In Table 7.2, we record the values of t′ for various k, and observe that they are very small in practice.
Algorithm variant | t′ |
k=2 | 0 |
k=3 | 3 |
Algorithm variant | t′ |
k=2 | 0 |
k=3 | 4 |
k=4 | 7 |
Finally, we discuss gate depth. The main results of this section are summarized in Table 7.3, and formalized in Theorem 9. Stated succinctly, we find that it is possible to perform the phase shifts implemented by PhaseProduct and PhaseTriple product in sub-linear gate depth. Unfortunately, this does not immediately imply that these algorithms can be used to achieve sub-linear depth multiplication in O(n) qubits—surprisingly, we are bottlenecked by the quantum Fourier transform and its inverse, that need to be performed before and after the phase products! The lowest-depth (approximate) quantum Fourier transform circuit of which we are aware that runs in linear space is the “standard” construction having depth O(n); fast algorithms which have O(logn) depth are known but seem to require at least O(nlogn) qubits [CW00]. An important and exciting direction for future research is whether it is possible to compute this (approximate) quantum Fourier transform in sub-linear depth and linear space. Despite this limitation, we believe it is worth exploring how the Phase[Triple]Product portion of the multiplication can be made as low-depth as possible; in practice the constants are good enough on the linear-space quantum Fourier transform circuit that we expect its depth to be comparable to the depth of the Phase[Triple]Product for relevant sizes of integers, despite its worse asymptotic scaling. The low-depth constructions given here use O(n) ancilla qubits (note that this is more than the O(1) ancillas we used in the previous section) We expect that in practice the qubit cost will be dominated by the O(n) qubits needed to perform sub-linear time addition [DKR+06]; the number of ancillas needed for the overflow qubits in this case is also linear in n but small. We begin by giving intuition for how low depth can be achieved, and then state Theorem 9.
In Algorithms 7.3 and 7.4, in each iteration of the outer loop overwrites a single value xh of length n/k with (Ax)i (resp. y and z), recursively computes Phase[Triple]Product on that value, and then resets it to its original contents. However, we have k of these length n/k subregisters at our disposal—we may as well overwrite more than just one of them, so that we can perform several recursive calls in parallel! The challenge is that we need to compute each of the (Ax)i (resp. y and z) in-place, reversibly. We do not know of an algorithmic way to simultaneously compute many (Ax)i in-place for arbitrary k; instead, in Tables 7.4, 7.5, and 7.6 we record explicit sequences of operations to do so, which were created by hand for the cases of k=3 (both PhaseProduct and PhaseTripleProduct) and k=4 (PhaseTripleProduct only). We also note that Algorithms 7.1 and 7.2, which cover the k=2 cases, already implicitly allow this parallelism. In all of these sequences we have endeavored to only use addition, sign inversion, and scaling by powers of two, since each of these operations should be very efficient to implement.
Since we have k subregisters at our disposal, and q points at which we need to evaluate (Ax)i (resp. y and z), the best parallelism we can hope to achieve (in a single level of recursion) is np=⌈q/k⌉ parallel groups of recursive calls. Recalling the dependence of q on k, this quantity is equal to np=⌈2(k−1)+1k⌉=2 for PhaseProduct (for all k≥2) and np=⌈3(k−1)+1k⌉=3 for PhaseTripleProduct (for all k≥3). (Note that the optimum is achieved by the sequences in Algs. 7.1 and 7.2, and Tables 7.4 and 7.5). In view of Theorem 9 below, this corresponds to an asymptotic gate depth of O(nlogk2) for PhaseProduct and O(nlogk3) for PhaseTripleProduct. An avenue for further research is whether it is possible to saturate the hard lower bound of depth equal to total gate count over n, using only O(n) ancillas, by parallelizing over multiple levels of recursion simultaneously.
We formalize our results regarding gate depth in the following theorem, and give explicit asymptotic scaling of gate depth in Table 7.3.
Operation | Register 0 | Register 1 | Register 2 |
(start) | |x0⟩ | |x1⟩ | |x2⟩ |
Add reg. 2 to reg. 1 | |x0⟩ | |x2+x1⟩ | |x2⟩ |
Add reg. 0 to reg. 1 | |x0⟩ | |x2+x1+x0⟩ | |x2⟩ |
Product on all registers | |x0⟩ | |x2+x1+x0⟩ | |x2⟩ |
Invert sign of reg. 1 | |x0⟩ | |−x2−x1−x0⟩ | |x2⟩ |
Add reg. 0 to reg. 1 | |x0⟩ | |−x2−x1⟩ | |x2⟩ |
Add 2× reg. 2 to reg. 1 | |x0⟩ | |x2−x1⟩ | |x2⟩ |
Add reg. 1 to reg. 0 | |x2−x1+x0⟩ | |x2−x1⟩ | |x2⟩ |
Add reg. 0 to reg. 1 | |x2−x1+x0⟩ | |2x2−2x1+x0⟩ | |x2⟩ |
Add 2× reg. 2 to reg. 1 | |x2−x1+x0⟩ | |4x2−2x1+x0⟩ | |x2⟩ |
Product on regs. 1 and 0 | |x2−x1+x0⟩ | |4x2−2x1+x0⟩ | |x2⟩ |
Invert sign of reg. 1 | |x2−x1+x0⟩ | |−4x2+2x1−x0⟩ | |x2⟩ |
Add 2× reg. 2 to reg. 1 | |x2−x1+x0⟩ | |−2x2+2x1−x0⟩ | |x2⟩ |
Add reg. 1 to 2× reg. 0 | |x0⟩ | |−2x2+2x1−x0⟩ | |x2⟩ |
Add reg. 0 to reg. 1 | |x0⟩ | |−2x2+2x1⟩ | |x2⟩ |
Add 2× reg. 2 to reg. 1 | |x0⟩ | |2x1⟩ | |x2⟩ |
Divide reg. 1 by two | |x0⟩ | |x1⟩ | |x2⟩ |
Operation | Register 0 | Register 1 | Register 2 |
(start) | |x0⟩ | |x1⟩ | |x2⟩ |
Add reg. 0 to reg. 1 | |x0⟩ | |x0+x1⟩ | |x2⟩ |
Add reg. 2 to reg. 1 | |x0⟩ | |x0+x1+x2⟩ | |x2⟩ |
Product on all | |x0⟩ | |x0+x1+x2⟩ | |x2⟩ |
Add −1× reg. 2 to reg. 1 | |x0⟩ | |x0+x1⟩ | |x2⟩ |
Add 2× reg. 0 to −1× reg. 1 | |x0⟩ | |x0−x1⟩ | |x2⟩ |
Add reg. 1 to reg. 2 | |x0⟩ | |x0−x1⟩ | |x0−x1+x2⟩ |
Add reg. 2 to reg. 1 | |x0⟩ | |2x0−2x1+x2⟩ | |x0−x1+x2⟩ |
Add 2× reg. 0 to reg. 1 | |x0⟩ | |4x0−2x1+x2⟩ | |x0−x1+x2⟩ |
Product on regs. 1 and 2 | |x0⟩ | |4x0−2x1+x2⟩ | |x0−x1+x2⟩ |
Add reg. 0 to 2× reg. 2 | |x0⟩ | |4x0−2x1+x2⟩ | |3x0−2x1+2x2⟩ |
Add reg. 2 to −2× reg. 1 | |x0⟩ | |−5x0+2x1⟩ | |3x0−2x1+2x2⟩ |
Add 2× reg. 2 to reg. 1 | |x0⟩ | |x0−2x1+4x2⟩ | |3x0−2x1+2x2⟩ |
Add reg. 1 to −4× reg. 2 | |x0⟩ | |x0−2x1+4x2⟩ | |−11x0+6x1−4x2⟩ |
Add 2× reg. 1 to reg. 2 | |x0⟩ | |x0−2x1+4x2⟩ | |−9x0+2x1+4x2⟩ |
Add 8× reg. 0 to reg. 2 | |x0⟩ | |x0−2x1+4x2⟩ | |−x0+2x1+4x2⟩ |
Add 2× reg. 0 to reg. 2 | |x0⟩ | |x0−2x1+4x2⟩ | |x0+2x1+4x2⟩ |
Product on regs. 1 and 2 | |x0⟩ | |x0−2x1+4x2⟩ | |x0+2x1+4x2⟩ |
Add reg. 2 to −1× reg. 1 | |x0⟩ | |4x1⟩ | |x0+2x1+4x2⟩ |
Add −1× reg. 0 to reg. 2 | |x0⟩ | |4x1⟩ | |2x1+4x2⟩ |
Add −1× reg. 1 to 2× reg. 2 | |x0⟩ | |4x1⟩ | |8x2⟩ |
Divide reg. 1 by 4 | |x0⟩ | |x1⟩ | |8x2⟩ |
Divide reg. 2 by 8 | |x0⟩ | |x1⟩ | |x2⟩ |
Operation | Register 0 | Register 1 | Register 2 | Register 3 |
(start) | |x0⟩ | |x1⟩ | |x2⟩ | |x3⟩ |
Add reg. 0 to reg. 2 | |x0⟩ | |x1⟩ | |x0+x2⟩ | |x3⟩ |
Add reg. 3 to reg. 1 | |x0⟩ | |x1+x3⟩ | |x0+x2⟩ | |x3⟩ |
Add reg. 2 to reg. 1 | |x0⟩ | |x0+x1+x2+x3⟩ | |x0+x2⟩ | |x3⟩ |
Add reg. 1 to −2× reg. 2 | |x0⟩ | |x0+x1+x2+x3⟩ | |−x0+x1−x2+x3⟩ | |x3⟩ |
Product on all regs. | |x0⟩ | |x0+x1+x2+x3⟩ | |−x0+x1−x2+x3⟩ | |x3⟩ |
Add reg. 1 to reg. 2 | |x0⟩ | |x0+x1+x2+x3⟩ | |2x1+2x3⟩ | |x3⟩ |
Add −3× reg. 3 to reg. 2 | |x0⟩ | |x0+x1+x2+x3⟩ | |2x1−x3⟩ | |x3⟩ |
Add 3× reg. 0 to reg. 1 | |x0⟩ | |4x0+x1+x2+x3⟩ | |2x1−x3⟩ | |x3⟩ |
Add reg. 2 to 2× reg. 1 | |x0⟩ | |8x0+4x1+2x2+x3⟩ | |2x1−x3⟩ | |x3⟩ |
Add 3× reg. 3 to 2× reg. 2 | |x0⟩ | |8x0+4x1+2x2+x3⟩ | |4x1+x3⟩ | |x3⟩ |
Add −1× reg. 1 to 2× reg. 2 | |x0⟩ | |8x0+4x1+2x2+x3⟩ | |−8x0+4x1−2x2+x3⟩ | |x3⟩ |
Product on regs. 1 and 2 | |x0⟩ | |8x0+4x1+2x2+x3⟩ | |−8x0+4x1−2x2+x3⟩ | |x3⟩ |
Add reg. 1 to −1× reg. 2 | |x0⟩ | |8x0+4x1+2x2+x3⟩ | |16x0+4x2⟩ | |x3⟩ |
Divide reg. 2 by 4 | |x0⟩ | |8x0+4x1+2x2+x3⟩ | |4x0+x2⟩ | |x3⟩ |
Add 6× reg. 2 to reg. 1 | |x0⟩ | |32x0+4x1+8x2+x3⟩ | |4x0+x2⟩ | |x3⟩ |
Add −15× reg. 0 to 4× reg. 2 | |x0⟩ | |32x0+4x1+8x2+x3⟩ | |x0+4x2⟩ | |x3⟩ |
Add 15× reg. 3 to reg. 1 | |x0⟩ | |32x0+4x1+8x2+16x3⟩ | |x0+4x2⟩ | |x3⟩ |
Divide reg. 1 by 2 | |x0⟩ | |16x0+2x1+4x2+8x3⟩ | |x0+4x2⟩ | |x3⟩ |
Add −15× reg. 0 to reg. 1 | |x0⟩ | |x0+2x1+4x2+8x3⟩ | |x0+4x2⟩ | |x3⟩ |
Add reg. 1 to −2× reg. 2 | |x0⟩ | |x0+2x1+4x2+8x3⟩ | |−x0+2x1−4x2+8x3⟩ | |x3⟩ |
Product on regs. 1 and 2 | |x0⟩ | |x0+2x1+4x2+8x3⟩ | |−x0+2x1−4x2+8x3⟩ | |x3⟩ |
Add 6× reg. 1 to −2× reg. 2 | |x0⟩ | |x0+2x1+4x2+8x3⟩ | |8x0+8x1+32x2+32x3⟩ | |x3⟩ |
Divide reg. 2 by 8 | |x0⟩ | |x0+2x1+4x2+8x3⟩ | |x0+x1+4x2+4x3⟩ | |x3⟩ |
Add −3× reg. 2 to 2× reg. 1 | |x0⟩ | |−x0+x1−4x2+4x3⟩ | |x0+x1+4x2+4x3⟩ | |x3⟩ |
Add 3× reg. 0 to 4× reg. 1 | |x0⟩ | |−x0+4x1−16x2+16x3⟩ | |x0+x1+4x2+4x3⟩ | |x3⟩ |
Add 12× reg. 3 to reg. 2 | |x0⟩ | |−x0+4x1−16x2+16x3⟩ | |x0+x1+4x2+16x3⟩ | |x3⟩ |
Add −3× reg. 0 to 4× reg. 2 | |x0⟩ | |−x0+4x1−16x2+16x3⟩ | |x0+4x1+16x2+64x3⟩ | |x3⟩ |
Add 48× reg. 3 to reg. 1 | |x0⟩ | |−x0+4x1−16x2+64x3⟩ | |x0+4x1+16x2+64x3⟩ | |x3⟩ |
Product on regs. 1 and 2 | |x0⟩ | |−x0+4x1−16x2+64x3⟩ | |x0+4x1+16x2+64x3⟩ | |x3⟩ |
Add reg. 2 to reg. 1 | |x0⟩ | |8x1+128x3⟩ | |x0+4x1+16x2+64x3⟩ | |x3⟩ |
Divide reg. 1 by 8 | |x0⟩ | |x1+16x3⟩ | |x0+4x1+16x2+64x3⟩ | |x3⟩ |
Add −4× reg. 1 to reg. 2 | |x0⟩ | |x1+16x3⟩ | |x0+16x2⟩ | |x3⟩ |
Add −1× reg. 0 to reg. 2 | |x0⟩ | |x1+16x3⟩ | |16x2⟩ | |x3⟩ |
Divide reg. 2 by 16 | |x0⟩ | |x1+16x3⟩ | |x2⟩ | |x3⟩ |
Add −16× reg. 3 to reg. 1 | |x0⟩ | |x1⟩ | |x2⟩ | |x3⟩ |
First we prove the depth bound. If we use fast algorithms for the additions, we may perform them in time dlogn for some constant d [DKR+06]. As in the proof of Theorem 6 we denote the number of “overflow” bits of each (Ax)i as t(k), a function which does not depend on n. With this we can write a recursion relation
D(n)=dlogn+D(n/k+t) | (7.42) |
We note that for any power p, O((n/k+t)p)=O(np)+O(tp). The second term is O(1) with respect to n so we may drop it, thus we may solve the recursion with
D(n)=O(nlogknp) | (7.43) |
which is what we desired to show.
Next we show that this depth can be achieved using O(n) qubits. The logarithmic-depth addition uses O(n) ancillas which can be reused after the addition is complete. Thus we only need to show that the “overflow” qubits are at most O(n) across all levels of recursion.
Suppose we cut off the recursion (or at least the parallelism) when the size of input integers is below some constant value ncutoff. At the bottom level of recursion, there will be at most n/ncutoff values being phase multiplied in parallel, each needing up to t overflow ancillas. The number of values being multiplied, and thus the number of overflow ancillas, at each level of the tree of recursive calls is at most half of the level below it; thus by geometric series, the total number of overflow ancillas at all levels is at most twice the number of overflow ancillas at the bottom level. Thus the total number of overflow ancillas is at most 2tn/ncutoff, which is linear in n, which is what we desired to show. ∎
We note that for the values of k relevant in practice, the number of overflow ancillas t for a single value (Ax)i is quite small (see Table 7.2). Since the total number of ancillas needed is bounded by 2nt/ncutoff, and ncutoff can be taken to be, say, 50 without drastic loss of parallelism, we expect the qubit usage from “overflow” ancillas to not only be linear in n, but in fact a quite small fraction of n in practice. It is likely that the number of ancillas used for these overflows will be dwarfed by the ancilla usage from sub-linear-time addition.
In this section we show that our results imply a new upper bound for the gate cost of performing an exact quantum Fourier transform modulo a power of 2 with only O(1) ancillas. We note that in practice, the quantum Fourier transform should always be performed approximately, which can be achieved with the standard construction to within error ϵ in time O(nlog(1/ϵ)). [NC11] However we believe the following is at least of theoretical interest.
It was shown over two decades ago that the quantum Fourier transform with modulus 2n, which we may denote QFT2n, can be implemented exactly via the following three steps for any positive integer m<n (stated here in the language we have been using in this work): [CW00]
Apply QFT2n−m to the first n−m qubits
Apply PhaseProduct(2π/2n, |x⟩, |y⟩), where x is the value of the first n−m qubits and y is the value of the remaining m qubits
Apply QFT2m to the final m qubits
(Note that m=1 corresponds to the “standard” construction for the QFT). By setting m=n/2 and performing steps 1 and 3 via recursive call, the runtime of this algorithm becomes O(logn) times the cost of step 2. In the work introducing this construction, it was suggested to implement step 2 by adding a length n ancilla register and computing the product xy in that register, then applying single-qubit phase rotations based on the bits of the product (and then uncomputing the product afterwards). Given our results, we may apply PhaseProduct directly without needing to allocate an extra register. This directly implies that the exact quantum Fourier transform can be implemented in sub-quadratic time without the need for more than the O(1) ancillas used by PhaseProduct (in fact, just using a single anicilla for the case of k=2).
Algorithm | Gate count (millions) | Ancilla qubits | |
Toffoli/CCRϕ | 2- and 1-qubit | ||
Classical-quantum multiply-add | |||
This work | 0.46 | 2.6 | 12 |
Digital Windowed [GID19b] | 1.8 | 2.5 | 4106 |
Digital Karatsuba [GID19b] | 5.6 | 34 | 12730 |
Digital Schoolbook [GID19b] | 6.4 | 38 | 2048∗ |
Quantum-quantum multiply-add | |||
This work | 54 | 26 | 30 |
Digital Karatsuba† [GID19a] | 15 | 30 | 13152 |
Digital Schoolbook† [GID19a] | 21 | 420 | 4096∗ |
Having examined the asymptotic costs of our algorithms, we estimate the gate and qubit counts required to implement them in practice. There are a number of tunable parameters which affect the gate counts for different types of gates as well as the qubit counts; here we set the parameters to achieve a particular balance of gate counts and number of ancillas. The results for non-modular multiplication of 2048-bit numbers are summarized in Table 7.7. We find that these algorithms drastically reduce qubit counts, while exhibiting competitive gate counts. Regarding gate counts, we note two things: first, that making a direct comparison between our algorithms and others is not straightforward, because they use different gate sets—for example, a sizeable fraction of the two-qubit gates in our classical-quantum multiplication algorithm are controlled phase rotations, while in other implementations the two- and one-qubit gates might consist entirely of Clifford operations. Second, we note that these estimates for our algorithms correspond to preliminary circuit constructions; it is likely that there are further optimizations to be had in the implementation which could yield considerable improvement in gate counts especially.
For each value of k, we numerically determine the “crossover” point, the value of n above which the algorithm k outperforms k−1. Here, we define “outperform” as using fewer total CCRϕ, CRϕ, and Toffoli gates. In Table 7.8 we record the optimal crossover points that we find for each k. We note that these crossover points are dependent on the specific implementation and definition of cost, and so may vary, for example, if the circuits are compiled to target a certain physical gate set. Using these crossover points in our implementation of 2048-bit multiplication, we adjust k throughout the recursion to always use the optimal algorithm for the size of the factors, eventually switching to schoolbook when the factors have become sufficiently small.
Algorithm variant | Cutoff |
Schoolbook | — |
k=2 (Karatsuba) | 29 |
k=3 | 114 |
Algorithm variant | Cutoff |
Schoolbook | — |
k=2 | 9 |
k=3 | 58 |
k=4 | 262 |
For our estimates, we use the Cuccaro quantum adder to perform the additions required by our algorithms [CDK+04]. The Cuccaro adder requires one extra ancilla qubit, but it can be reused after the adder is complete. We consider this tradeoff worth it for its low gate counts. For sign inversions and multiplication/division by powers of 2, we note that both can both be implemented “logically” with zero quantum cost. Sign inversion in two’s complement representation corresponds to flipping all bits of the integer and incrementing the value by 1. Flipping all bits can be performed logically by relabeling which quantum state corresponds to 0 and which to 1; the addition of one can easily be incorporated into the next addition following the sign flip via the input “carry” bit that is inherently a part of Cuccaro addition. Multiplication and division by powers of 2 correspond just to bit shifts and can be performed for zero quantum cost by relabeling qubits.
Finally, we note that in addition to the cost of the circuits to implement the phases, we must also apply a quantum Fourier transform beforehand and an inverse quantum Fourier transform afterward. If we apply the standard technique of dropping phase rotations below a certain threshold to implement an approximate QFT, the gate count of these operations is negligible (less than the least significant figure listed in Table 7.7) even when the error for the approximate QFT is set as low as 10−9. This standard approximate QFT construction also requires no ancilla qubits.
In this section we describe various tweaks that can be made to our algorithms to optimize them for two exciting applications: Shor’s algorithm, and cryptographic proofs of quantumness. Before discussing those in depth, we begin with an aside about modular multiplication which will be relevant for both applications.
So far, we have discussed algorithms for non-modular multiplication—the inputs are each of length n, and the output is of length 2n. For both the applications we discuss here, we are instead interested in multiplication modulo some integer N. In this section we will show that with a small modification, our algorithms can output the product p in the form of an (approximate) binary fraction (pmodN)/N of precision m bits (stored on m qubits), while maintaining their asymptotic costs. Later we will show that this form, with m not much larger than n, is sufficient for relevant applications. Here we discuss only the case of quantum-quantum multiplication to avoid unnecessarily complicated notation, but the results trivially carry over to the case of classical-quantum multiplication.
In the preceding sections, we have been working in Fourier space modulo 22n. In Fourier space, the modular multiplication result we desire has the following form (here dropping any constant w to which the result was added, but it can be trivially included):
QFT22n|xymodN⟩=∑zexp(2πi(xymodN)z22n)|z⟩ | (7.44) |
It is relatively well-known that if we instead work in Fourier space modulo N, we may apply the following identity:
exp(2πi(xymodN)zN)=exp(2πixyzN) | (7.45) |
Intuitively, as the phase wraps around 2π, it automatically performs the modulo operation for us! This seems to suggest that applying our circuits to modular multiplication is trivial: simply change the modulus of the Fourier transforms.
The challenge is that circuits for performing the quantum Fourier transform with an arbitrary modulus are considerably more complex than for a power of two—and (to our knowledge) there do not exist constructions for the QFT with arbitrary modulus that operate in sub-quadratic time using only O(1) ancillas, which is what we desire. Instead, we may apply the phase of Eq. 7.45 but perform the final quantum Fourier transform modulo a power of two instead of N. This is simply quantum phase estimation for the phase 2πxy/N. It is well known that quantum phase estimation for a phase that cannot be represented exactly as a binary fraction yields a superposition which is exponentially heavily weighted on the binary strngs that best approximate the value of the phase. [NC11]
We note that there is another construction by which we can use our circuits to compute modular multiplication exactly, with the same asymptotic gate count but now using 3n+O(1) qubits. It is a quantum version of a classical construction called Montgomery multiplication. It is particularly nice because classically, Montgomery multiplication has the downside of introducing a (known) extra factor into the result that needs to eventually be dealt with; for classical-quantum multiplication the inverse of this extra factor can be classically multiplied into the classical input, canceling it out and obviating the need to deal with it later. However, we do not pursue that method further here because the approximate version above uses fewer qubits and suffices for our needs.
Another optimization relevant for both the applications below involves simplifying quantum Fourier transforms in certain cases. If we know that the “output” register is starting in the |0⟩ state, applying the initial quantum Fourier transform will just yield a uniform superposition over all bitstrings, all with the same phase of +1. Thus we can skip the complication of a Fourier transform modulo 2m and simply apply a Hadamard gate to each of the qubits in parallel, yielding the same state in depth 1 instead of depth O(n). (Note that we do still have to perform the full inverse quantum Fourier transform at the end of the multiplication.)
With that, we move on to discussing applications.
The core of Shor’s algorithm for factoring an integer of n bits can be implemented via a series of in-place multiplications by classical constants, controlled off of a single qubit (see the “one controlling qubit” trick, Sec. 2.4 of [BEA03]). Written out, a single one of these multiplications consists of the in-place operation
|x⟩→|cxmodN⟩ | (7.46) |
for a classical integer c. Here we apply the modular multiplication introduced in Sec. 7.4.1 to implement this operation to within an error ϵ using 2n+O(log1/ϵ) qubits (while maintaining the same subquadratic gate count of the multiplication algorithm).
To do so, we make the following observation: our classical-quantum multiplication algorithm does not require that the classical value c be an integer. It can be a floating-point number, which we may classically compute to whatever arbitrary precision we desire before using it to compute the values of the phase rotations in our multiplication algorithm. We use this fact to perform the following trick: compute the fractional value w=(cxmodN)⋅2n/N up to some precision m as in Eq. 7.45, and then multiply by the value N/2n to convert the fractional value w into an integer. The accuracy of this operation will depend on the precision to which we compute w; however, we will find that we only need O(log(1/ϵ)) extra bits to achieve an error of less than ϵ.
Algorithm 7.5 applies this idea to approximately implement the unitary of Eq. 7.46. It is clear by inspection that it uses n+O(log1/ϵ) ancilla qubits in time O(nlogkq). In the following theorem we prove the error bound.
The final state |ψ⟩ produced by Algorithm 7.5 has inner product ⟨ψ|cxmodN⟩≥1−O(ϵ).
We begin by showing that if Algorithm 7.5 were run with infinite precision (that is, setting m→∞), the correct state would result with zero error. Then we show that the difference between the unitary applied in the infinite precision case and the real case is small.
In the infinite precision case, after step 1 we have the state |x⟩|w⟩ for w=((c−1)xmodN)⋅(2n/N). In step 2 we add w⋅N/2n=(c−1)xmodN to x, yielding cxmodN or (cxmodN)+1 exactly. Step 3 uses an ancilla to reduce this to cxmodN; the ancilla now contains whether N needed to be subtracted or not. Beauregard observed the identity (a+b)modN≥a⇔a+b<N [BEA03], in our case this implies that the truth value of the comparison cxmodN<w⋅N/2n will equal the value of the ancilla and thus can be used to uncompute it. Finally, for step 5, observe that ((1−c−1)(cx)modN)=(c−1)xmodN=w and thus subtracting it from the second register resets that register to zero. Also note that the multiplicative inverse c−1 exists because c and N are co-prime (or, if they’re not, gcd(c,N) provides a factor of N without requiring Shor’s algorithm at all!). We now have the state |cxmodN⟩|0⟩ which is what we desired. Now we turn to showing that the approximate circuit, in which m=n+O(log1/e), yields a unitary that is within ϵ of the circuit with m→∞.
We begin with step 1. Here we must show that applying Hadamard gates to generate |x⟩∑z|z⟩, then a phase rotation of ϕ=2πwz and then an inverse QFT module 2m yields a state that is ϵ close to |x⟩|~w⟩. This can be seen immediately by observing that this process corresponds precisely to the quantum phase estimation algorithm for the phase 2πw, without the usual final measurement to extract an estimation of the phase. It is known that performing quantum phase estimation with an extra ⌈log(2+1/2ϵ)⌉ qubits beyond the precision to which the phase is desired yields a probability of at least 1−ϵ of yielding the value of the phase correctly up to the desired precision. [NC11] Here ~w corresponds to this “correct” value of the phase which would be observed if the register were measured, so the population of the state |x⟩|~w⟩ must be at least 1−ϵ and the overall error of this step is at most O(ϵ).
Now we move to step 2. Observe that even for the largest value of z, the difference in this phase when w is exact versus truncated to a total of m bits is bounded by 1/2m−n, and by definition m−n=O(log1/ϵ). Thus the difference in the phase is O(ϵ), and thus the angle between the state resulting from the exact rotation versus the truncated one is proportional to ϵ.
Steps 3 and 4 are not directly affected by the precision of the second register, so they do not contribute any additional error.
Finally, step 5 simply corresponds to the inverse of step 1. Since the “good” (error-free) portion of the state has |cxmodN⟩ in the first register exactly, this step exactly uncomputes the second register.
With that, we see that each individual step has error less than O(ϵ) when compared to the case of infinite m; the overall operation’s error is bounded by the sum of individual errors and thus is bounded by O(ϵ) as well, which is what we desired to show. ∎
For Shor’s algorithm, this operation must be repeated O(n) times. Using our fast multiplication algorithms of O(nlogk(2k−1)) time (with k≥2), this yields an implementation of Shor’s algorithm that runs in O(n1+logk(2k−1)) time, significantly faster than the usual O(n3) runtime for Shor’s algorithm. To be more concrete, it is O(n2.46⋯) for k=3. Importantly, this runtime is achieved while maintaining a very lean qubit count: Algorithm 7.5 uses 2n+log(1/ϵ) total qubits, and since we must repeat it n times we should set ϵ∼1/n. Thus the total qubit count for this implementation of Shor’s algorithm is 2n+O(logn) qubits. To our knowledge this is the first implementation of Shor’s algorithm running faster than O(n3) time within this space constraint.
Finally we describe an alternative way of implementing Shor’s algorithm in fewer than O(n3) total gates using our circuits. It takes advantage of the idea of “windowed arithmetic”, which has produced some of the fastest known circuits for Shor’s algorithm [GID19b, GE21]. In brief, instead of multiplying by each classical integer cb in sequence, a QROM circuit is used to look up the product ∏cbiimodN over p values i, from a classical lookup table of size 2p indexed by the bi. The QROM circuit puts this value into a quantum register, at which point a quantum-quantum multiplication circuit (which no longer must be controlled by another qubit) can be applied to compute the product. While such a construction reduces the total number of quantum multiplications which need to be performed, they are now quantum-quantum multiplications instead of controlled classical-quantum multiplications; in the context of our circuits it’s not clear a priori whether the extra cost (including the cost of the QROM circuit) is worth it. Also, this construction seems to require at least 3n qubits, plus the ancilla cost of the QROM. Nevertheless, a valuable direction for further research would be to explicitly count the gates of this construction and compare its cost to the other construction described above.
As discussed in previous chapters, much recent excitement has focused on creating and implementing efficiently-verifiable “proofs of quantumness,” in which an untrusted quantum “prover” can demonstrate its quantum capability to a efficient classical “verifier” [BCM+21, BKV+20, KCV+22, ZKL+21, LZG+22, YZ22, KLV+22]. These protocols are interesting not only for their ability to decisively demonstrate quantum computational advantage, but also because they can be modified to perform more complex quantum cryptographic tasks, such as certifiable generation of quantum random numbers, verifiable remote state preparation, and even the classically-verifiable delegation of arbitrary quantum computations to an untrusted device [BCM+21, BGK+23, GV19, MAH18]. Many of these protocols centrally rely on a cryptographic construction called a trapdoor claw-free function (TCF), and all known TCF constructions involve multiplication of integers in various contexts—making the circuits we develop in this work useful for the implementation of these protocols. In this section we focus on the implementation of the function f(x)=x2modN introduced in Chapter 5, which seems to give the best circuit sizes of the currently known TCFs.
To compute x2modN, we can directly apply our quantum-quantum multiplication circuits, with the inputs x and y being the same register. However, in the specific context of the proof of quantumness introduced in Chapter 5 there are certain further optimizations we can apply which reduce the circuit sizes considerably. With all of these optimizations applied, we find that the proof of quantumness protocols using x2modN can be implemented with dramatically fewer intermediate measurements and qubits that in the circuits introduced in Chapter 5, while maintaining competitive gate counts. We also believe more optimized implementations could reduce the gate counts considerably. In Table 7.9 we summarize these costs for various sizes of the modulus N. We now describe these optimizations in detail.
Circuit |
|
|
|
||||||
---|---|---|---|---|---|---|---|---|---|
n=128 (takes seconds on a desktop [44]) | |||||||||
Fast | 272 | 2.3×105 | 0 | ||||||
Narrow | 204 | 4.2×105 | 0 | ||||||
Previous best | 942 | 1.3×105 | 3.4×104 | ||||||
n=400 (takes hours on a desktop [44]) | |||||||||
Fast | 814 | 1.8×106 | 0 | ||||||
Narrow | 627 | 3.9×106 | 0 | ||||||
Previous best | 3051 | 8.8×105 | 2.4×105 | ||||||
n=829 (record for factoring [ZIM20]) | |||||||||
Fast | 1676 | 5.5×106 | 0 | ||||||
Narrow | 1268 | 1.3×107 | 0 | ||||||
Previous best | 5522 | 3.0×106 | 8.0×105 | ||||||
n=1024 (exceeds factoring record) | |||||||||
Fast | 2068 | 7.9×106 | 0 | ||||||
Narrow | 1566 | 1.8×107 | 0 | ||||||
Previous best | 6801 | 4.3×106 | 1.1×106 |
First, we note that the output of the function is measured immediately after it is computed. Thus instead of building a quantum circuit to compute x2modN in the form of an integer, we have the freedom to build a circuit that computes any bitstring that uniquely corresponds to such an integer. Classical post-processing can then be used to compute and return the integer value itself. In the context of our circuits, this suggests an immediate optimization: as discussed in Sec. 7.4.1, we can perform the modulo operation automatically in the phase, if we are willing to accept output in the form of a binary fraction representing the value (x2modN)/N. As long as we have enough bits of precision to uniquely identify an integer from this binary fraction, we may directly measure this output instead of transforming it back to integer form as we did in Shor’s algorithm.
Another optimization comes from an observation about the cryptography of trapdoor claw-free functions (TCFs). The classical hardness of the TCF-based “proof of quantumness” protocols comes from the fact that it is hard to find two inputs (x,x′) such that f(x)=f(x′), where f is a given instance of the TCF family. It is easy to show that this implies that it is also hard to find a pair of values (g(x′),g(x′)) for any efficiently-invertible function g—because if such a pair could be found, the function could be immediately inverted to yield (x,x′)! Therefore the classical verifier should allow the prover’s input registers to be modified by the TCF circuit, as long as the prover informs the verifier of the transformation that was made so that the verifier can perform the correct checks on the modified values. This is relevant to our recursive circuits for multiplication, because it means we can drop any gates that serve simply to return the input to its original values. As an example, in the k=2 case, we may allow our circuit to end with the input registers set to |x0⟩|x0+x1⟩ (resp. z), as long as the verifier knows that this is the value that is being manipulated in the later rounds of the protocol. (In practice we can drop the final gates of all levels of recursion, not just the top level, so the final state will look more complicated—but still be trivially computable by the verifier).
One more benefit comes from the fact that we are performing a square instead of a generic multiplication. A nice feature of the recursive algorithms is that if the top-level operation is to compute a square x2z, all of the lower levels of recursions will compute squaring operations as well (of the form (Ax)2i(Az)i). Therefore when we eventually reach sufficiently small n in the recursion and switch to the schoolbook algorithm, that operation will be performing a square as well. We note that the complexity of the schoolbook algorithm for applying a phase proportional to x2z is roughly half the cost of the same operation for xyz, because rotations of the form xixjzk and xjxizk are identical and can be combined. Also, phase rotations of the form xixizk (in which both x bits are the same) correspond to a single-controlled phase rotation, instead of a doubly-controlled one. These facts yield to a significant reduction in the cost of performing the controlled-phase rotations.
If space is very limited, yet another optimization stems from the fact that we are measuring the output immediately as it is produced. Instead of computing all the bits of the output at once, we may instead compute subsets of the bits of the output sequentially. For example, one may let the output register be of length only roughly n/2, and perform a smaller quantum-quantum multiplication to compute the less-significant half of the output and measure it. Then the same process can be repeated for the more-significant half, re-using the same output qubits. This optimization reduces the total qubit count from 2n+O(1) to 1.5n+O(1). The gate count will be somewhat larger due to the fact that the recursive multiplication cannot be applied to the full output at once, but by making the decomposition x2=2nx20+2n/2+1x0x1+x21 we note that gate cost can be reduced to that of 6 quantum-quantum multiplications of size n/2 (as opposed to one quantum-quantum multiplication of size n). This type of optimization can be extended to reduce the qubit count as low as n+O(1), but we believe that taking it any further than just described in practice is not likely to be worth the tradeoff in gate count.
Finally we note that there is another optimization, described when the “computational Bell test” protocol was originally introduced in Chapter 5, in which uncomputation of “garbage bits” can be replaced simply by measurement in the Hadamard basis. For the circuits we introduce here, this optimization is not actually very helpful, because there are only O(1) garbage bits introduced in the circuit (the overflow bits of each (Ax)i, resp. z)! In fact, we estimate that the experimental cost of performing extra intermediate measurements instead of the O(1) gates to uncompute these values directly is almost certainly not worth it. We consider a serious benefit of the circuits introduced here that they require no extra intermediate measurement at all (except the measurements explicitly required by the protocol).
Developing quantum circuits based on classical algorithms is a subtle pursuit. Peculiarities of the quantum setting, such as the need for reversibility and the limited number of qubits on near- and medium-term quantum devices, preclude the straightforward translation of classical circuits to quantum ones, especially for recursive algorithms. However, quantum mechanics also provides us with new computational techniques that do not have classical analogues. In this work we combine the algorithmic speedups from recursive fast multiplication algorithms, which are widely used in classical computing, with the inherently quantum technique of performing arithmetic in the phase of a quantum state and then using a quantum Fourier transform to recover the result. We show that this yields circuits for performing multiplication of numbers stored in quantum registers in a number of gates that is sub-quadratic in the size of the inputs, for both classical-quantum and quantum-quantum multiplication. This gate count scaling is achieved while using an extremely small number of qubits—as low as just two ancillas in addition to the registers storing the input and output. Our algorithms also offload a considerable portion of the work to classical pre-computation in the circuit compilation phase, significantly reducing the overhead of these asymtotically-faster algorithms. We show that this reduction of overhead makes the algorithms not just a theoretical novelty, but potentially the smallest known circuits for the multiplication of integers at sizes relevant in practice, for computations such as Shor’s algorithm and cryptographic protocols for demonstrating quantum advantage.
There are several potential optimizations to our algorithm that warrant further exploration—although there exist subtle pitfalls which must be considered along the way. One open question is whether the techniques presented here can be applied to fast Fourier transform based classical multiplication algorithms, the foremost being the Schonhage-Strassen algorithm [SS71]. This algorithm replaces the matrix-vector products in, for example, A−1(Ax∘Az) with discrete Fourier transforms, which can be computed in O(nlogn) time! (The discrete Fourier transform here should not be confused with the quantum Fourier transform already present in our algorithms—implementing Schonhage-Strassen on a quantum computer would correspond to implementing a quantum circuit for a classical fast Fourier transform algorithm, that outputs bitstrings representing the classical discrete Fourier transform of the input). The subtle but seemingly inescapable obstacle preventing this from fitting directly into our framework is that in Schonhage-Strassen, A, x, and z are defined on a ring with some modulus (different from the modulus N we discuss in Sec. 7.4.1), while the vector e against which the results are multiplied during the recomposition step is defined over all integers. Because they are defined over different spaces, the matrix-vector products are no longer associative—so precomputing e⊺A−1 will no longer yield the correct answer! Another outstanding question is whether a quantum circuit for the discrete Fourier transform can be devised that uses only a constant (or even logarithmic) number of ancilla qubits.
Within the algorithms that we do present, we believe there is still much room for optimization. The fact that the phase product portion of our circuits can be computed in sub-linear depth begs the question of whether there is a corresponding sub-linear depth algorithm for the quantum Fourier transform, which uses only a linear number of qubits. If such an algorithm were applied to our circuits, the entire multiplication could be performed in sub-linear depth. We also note that we do not explicitly give a sequence of additions and subtractions that yield a sub-linear depth circuit for quantum-quantum multiplication. For k=4 PhaseTripleProduct, computing the 10 sub-products in three parallel groups would yield a sub-linear depth circuit; the challenge is coming up with a sequence of arithmetic operations that yields the appropriate linear combinations at the right times. We are virtually certain that such a sequence of operations can be devised; we have not explored this extensively due to the orthogonal issue of the lack of a sub-linear quantum Fourier transform. More broadly, in all our algorithms it is worth exploring if there are better sequences that yield the right linear combinations in fewer addition operations that the sequences we propose. Perhaps there is an automated or algorithmic way to come up with these optimal sequences in the general case.
Finally we note that there are certain optimizations applicable in the “schoolbook” Fourier multiplication algorithm that don’t translate in an obvious way to our algorithms. [DRA00, BEA03] For example, a classic optimization is to drop very small rotations that have a negligible effect on the state. Unfortunately in our case there seem to be very few rotations small enough to be dropped, because the elements of ^e are complicated linear combinations of various powers of two—and thus they usually contain at least one large term. One seemingly clever idea when using our circuits in the proofs of quantumness described in Section 7.4 would be to set the modulus N so that one or more of the elements of ^e are close to a multiple of N. Then for those values, ^eimodN would be very small and whole branches of the recursive tree could be ignored. Unfortunately there is an extremely subtle but critical danger in doing this: the elements of ^e have low hamming weight (number of bits set to 1), and thus so would N. The classical special number field sieve algorithm is much more efficient at factoring numbers with low Hamming weight than numbers of a general form, so this technique would likely destroy the classical hardness of the quantum advantage protocol! [LL93]
Finally, we note that in this work we leave largely unexplored a number of practical considerations, such as how this algorithm interacts with error correction, qubit routing, and decomposition to native gate sets. We believe that these questions are important to explore, because our estimates in the abstract circuit model suggest that our algorithms could meaningfully reduce the resources required for quantum computations with real-world impacts and applications. This is especially true if future works find further optimizations (which we believe are likely to exist). Taking a broader perspective, the techniques for multiplication of quantum integers presented here represent a fundamentally new direction for the implementation of quantum arithmetic, combining two ideas which previously seemed ill-suited to working together. Instead, we find that their combination yields the dramatic benefits of both.