Chapter 7 Quantum fast multiplication with very few qubits

Note: The results in this Chapter have been improved since the publication of this dissertation. The updated results are presented in the following manuscript: [arXiv:2403.18006]

7.1 Introduction

Quantum circuits for coherently performing arithmetic on superpositions of values have been a subject of intense study since the first quantum algorithms for number theory problems were discovered in the mid-1990s. The most famous such algorithms are Shor’s algorithms for integer factorization and discrete logarithm, because of their potentially catastrophic effect on digital security. [SHO97] Among other applications, quantum arithmetic has recently become useful also for the implementation of protocols to achieve various quantum cryptographic tasks—from “proofs of quantumness” which allow a quantum device to demonstrate its computational capability to a skeptical classical verifier, to practical applications like secure delegation of computations to an untrusted remote quantum server. [MAH18, GV19, BCM+21, BKV+20, KCV+22, BGK+23]

The standard way of performing multiplication (both in the classical and quantum setting) is via the “schoolbook” algorithm, that uses gates, where is the size of the input. While the existence of faster classical algorithms for multiplication has been known for over a half-century, these algorithms have overheads that make them only useful for multiplication of large values—the GNU multiple-precision arithmetic library uses a threshold of 2176 bit inputs to switch away from the schoolbook method. In the quantum setting, these overheads have generally proven to be made even worse by the reversibility constraints of quantum circuits. Despite these challenges, several works have explored and optimized the implementation of quantum circuits for sub-quadratic time multiplication, largely focusing on the Karatsuba algorithm which has an asymptotic runtime of roughly [PRM18, DBC18, GID19a, LAJ+21, KPF06] For many years, a significant challenge came from the fact that these fast algorithms are recursive—and building reversible circuits for them while maintaining the speedup required storing intermediate data, ultimately using a superlinear number of qubits. Notable is a recent work which for the first time reduced the number of qubits for quantum Karatsuba multiplication to linear in the size of the inputs, by performing extra manipulations on the output register that enable a strategy akin to tail recursion, where intermediate results are summed directly into the output and do not need to be uncomputed later. [GID19a] That work provides hope for very efficient quantum integer multiplication, and encourages the search for improvement in the gate count and especially the qubit count (multiplying 2048 bit numbers with that algorithm still requires over 10,000 ancilla qubits).

In this work, we explore a new paradigm for the design of sub-quadratic quantum multiplication circuits. We show that by combining the fundamental ideas behind fast multiplication algorithms with an inherently quantum technique where arithmetic is performed in the phases of a quantum state, it is possible to design circuits for quantum multiplication that simultaneously achieve sub-quadratic asymptotic gate counts and a constant (and small) number of ancilla qubits: just two ancillas for multiplication of a quantum value by a classical one, and five ancillas for the multiplication of two quantum values. Our results yield a family of algorithms with varying tradeoffs between constant factors and asymptotic complexity; by estimating the gate counts for our circuits we find that for 2048-bit inputs, our algorithm is the optimal one for multiplication of a quantum integer by a classical one, and our one is optimal for multiplying two quantum integers. We also find that our algorithms begin to outperform the schoolbook algorithm in gate count quite early—for inputs with fewer than 100 bits in some cases. In terms of concrete comparisons of circuit sizes (which we note are difficult to make fairly in the abstract circuit model) we find that our algorithms dramatically reduce the number of qubits required for subquadratic quantum multiplication, while simultaneously achieving competitive gate counts. We also note that our circuits do not require intermediate measurements for measurement-based uncomputation (indeed, there are so few garbage bits produced that there is essentially nothing to uncompute!). In terms of depth, we find that surprisingly, we are limited to depth only because we do not know of a way to perform the quantum Fourier transform in less than depth with such limited space—even if we allow the use of ancillas and for the QFT to be approximate. On the other hand, the core “phase arithmetic” portion of our algorithm can be performed in sub-linear depth using ancillas.

Directly using our circuits as a subroutine in Shor’s algorithm yields circuits that require only qubits, where is the length of the integer to be factored—while obtaining an asymptotic gate count of with practical constant factors. This is compared to the standard gate complexity of Shor’s algorithm.11 1 It has been known for many years that it is theoretically possible to reduce the asymptotic complexity of Shor’s algorithm below  [BCD+96], but previously the constant factors and number of qubits required seemed to make doing so impractical for reasonably sizes inputs. See Table 1 of [GE21] for a recent comparison of proposals for the implementation of Shor’s algorithm. We also explore the circuits’ application to the efficiently-verifiable cryptographic proof of quantumness introduced in Chapter 5; we see dramatic reductions in the qubit count along with competitive gate counts, when compared to previous implementation proposals. The new circuits also do not require the use of measurement-based uncomputation.

Throughout this manuscript we work in the abstract circuit model of quantum computation; a critical direction for future research is how to optimize our constructions to include considerations such as error correction, qubit routing, and noise.

7.2 Background

7.2.1 Quantum multiplication

In this work we will focus on implementing two related quantum arithmetic operations. The first is quantum-classical multiplication: computing the product of (a superposition of) integers stored in a quantum register with a classical integer . We denote the unitary corresponding to this operation as (the subscript denotes “quantum” “classical”, and the unitary is parameterized by ). It implements the following operation on product states (extended by linearity to superpositions):

(7.1)

We note that this unitary (controlled off of another qubit) is the fundamental operation used in implementations of Shor’s algorithm (see Section 7.4.3).

The second operation we study is a quantum-quantum multiplication: finding the product of an integer (or superposition thereof) in one quantum register with another integer (or superposition) in a second register. We denote this as ; it implements the transformation

(7.2)

In Sec. 7.4.4 we also discuss the closely related unitary for which and of the previous equation are the same register, thus implementing a squaring operation:

(7.3)

This squaring operation is fundamental to implementing “proofs of quantumness” based on the cryptographic function , as discussed in Chapter 5.

7.2.2 Fast multiplication algorithms

We now discuss classical multiplication algorithms. The most straightforward algorithm for multiplying two numbers is known as the “schoolbook” method, because it is the one taught to young students when they first learn to multiply. The numbers are simply decomposed into individual digits and the individual products of those digits summed, scaled by powers of the base. In binary this can be expressed as

(7.4)

(presumably most elementary school teachers use base 10). This algorithm’s time complexity is ; it is used widely for multiplication of small- to moderate-sized integers.

About half a century ago it was shown that it is possible to classically multiply integers in sub-quadratic time—asymptotically outperforming the schoolbook algorithm. We begin by describing the first sub-quadratic multiplication algorithm, called the Karatsuba algorithm, and then show that it is a special case of a broader class of algorithms called Toom-Cook. For an extended pedagogical exposition of the range of fast multiplication algorithms that have been discovered, we refer the reader to Knuth. [KNU98]

Karatsuba multiplication

Consider two -bit integers and to be multiplied together. Divide the bits of each into two pieces: (and the same for ), where is the bits of the “more significant” half and is the “less significant” one. Written in this way, the product can be expressed

(7.5)

This formulation is effectively the schoolbook algorithm in base . We’ve replaced one product of size with four products of size , reflecting the quadratic scaling of the schoolbook algorithm.

The key observation behind the Karatsuba algorithm is that

(7.6)

—and we already need to find the products and anyway! Making this substitution we have

(7.7)

That is, the multiplication of size can be accomplished with only three multiplications of size , and a few extra additions! The next key insight is to apply this fact recursively, using it again to compute each of the sub-multiplications of size , and then again to compute the multiplications used there, et cetera. Complexity analysis shows that when applied recursively, Karatsuba multiplication computes the product in only operations, outperforming the schoolbook algorithm.

Toom-Cook multiplication

The Toom-Cook algorithm uses the same intuition as Karatsuba, but splits each integer into parts instead of just two. It also provides a broader intuition for why the Karatsuba algorithm works, by expressing the problem in terms of polynomial multiplication.

Consider an -digit integer divided into chunks of size , which we denote with being the most significant:

(7.8)

This can be reinterpreted as a polynomial

(7.9)

evaluated at . From this perspective, we can recast a product of integers as a product of polynomials , evaluated at . The benefit of expressing the product this way is that for any point , , and we can choose values of such that and are only roughly bits long and thus the product is faster to compute. This suggests the following plan: compute the value of at several points, use those points to reconstruct the coefficients of the polynomial , and finally evaluate at yielding the integer product . and are polynomials of degree , so their product is of degree , thus we must evaluate it at points to unambiguously reconstruct the coefficients.

It may be helpful at this point to recast the Karatsuba algorithm in this light. Karatsuba corresponds to the case of ; the integer is converted into the polynomial (and similarly for ). Here , and so we evaluate the product at the three points :

(7.10)
(7.11)
(7.12)

(where the value of polynomial at infinity is the limit of as ). The coefficients of the polynomial can be expressed in terms of its value at these three points as follows:

(7.13)

Substituting Eqns. 7.10-7.12 into Eq. 7.13 and evaluating at yields Eq. 7.7, the expression we originally had for computing in Karatsuba multiplication!

The benefit of framing our multiplication in this way is that it becomes very straightforward to set to values larger than . The Toom-Cook algorithm can be viewed as a five-step process, which is conveniently expressed in terms of linear algebra: [BOD07]

1. Split

We divide the bits of and into chunks each, which serve as the coefficients of a polynomial as in Eq. 7.9 above. We can represent this polynomial simply as a vector of the coefficients:

(7.14)

In this representation, evaluating the polynomial at a point corresponds to computing the inner product with a vector

(7.15)
2. Evaluate

The next step is to evaluate those polynomials at several points. We define a length vector which contains the evaluation of our polynomial at each of points . With this we write , where each row of the matrix is for one of the points at which we desire to evaluate the polynomial. Later it will be helpful to write as a matrix; we’ll allow multiplication by the length vector by simply extending prepending zero entries to for the higher order terms. With that, has the following structure:

(7.16)

We perform that matrix-vector multiplication for both and .

3. Multiply

Now that and are expressed as polynomials evaluated at several points, we reach the payoff: we multiply the polynomials by simply performing pointwise multiplication of the vectors:

(7.17)

where is the result polynomial evaluated at the points , and denotes the pointwise product. Note the benefit here: we are performing pointwise products of size , instead of the products of that size that would be required for schoolbook multiplication. Since this is a very dramatic improvement! Furthermore we will recursively apply the Toom-Cook algorithm to each of the pointwise multiplications of the vector elements.

4. Interpolate

Now that we have the vector representing the product polynomial evaluated at each of the points , we need to convert it to a vector of polynomial coefficients . We know that ; by appropriate choice of the we can ensure that is invertible. Then we can simply compute the matrix inverse , resulting in a straightforward computation .

5. Recomposition

Finally, with the product polynomial expressed in terms of its coefficients , the integer product can be recovered by evaluating that polynomial at . As before this can be written as the inner product , where

(7.18)

With the algorithm laid out, let us the asymptotic complexity of Toom-Cook for any . The algorithm performs multiplications of size at the top level, and then recurses. Straightforward complexity analysis yields a runtime of , which can be made arbitrarily close to as is increased. Unfortunately in practice the constant factors grow rapidly as is increased, due to the cost of performing the matrix vector products , and . In practice, the value of is varied depending on the size of the integers to be multiplied, but regardless of the size of input is only practicable for small values (certainly less than 10). [165]

7.2.3 Fourier arithmetic

Many proposals for the implementation of quantum arithmetic mirror classical circuits, with various clever optimizations to account for the fact that quantum operations must in general be reversible. [VBE96, ZAL98, CDK+04, DKR+06, HRS17, GID18, GID19a] However, one early work by Draper stands out as an example of a quantum arithmetic circuit that truly has no classical analogue. [DRA00] That work proposes to use the quantum Fourier transform (QFT) to compute addition, and to use that addition circuit as a building block for implementing multiplication.

The action of the quantum Fourier transform, with modulus , on a product state of qubits can be written as follows (and by linearity extended to a superposition of inputs):

(7.19)

(and the inverse quantum Fourier transform is the reverse of this operation). Draper observed that one can perform addition in the Fourier basis simply by applying a series of phase rotations to the qubits of the register holding :

(7.20)

The rightmost expression is just with an extra phase shift .

This construction can be readily extended to implement the multiplication unitaries described in Sec. 7.2.1. We simply add the product ( in the classical-quantum case, in the quantum-quantum case) to the register using Fourier addition. Denoting the unitaries that apply the appropriate phase in the Fourier basis with a tilde:

(7.21)

we can write down and (in terms of their action on product states):

(7.22)
(7.23)

Note that here, represents each basis state resulting from applying the quantum Fourier transform to the output register; is not the value itself that was already stored in the output register. We may decompose the quantum values as (and respectively for and ), yielding:

(7.24)
(7.25)

Since we have decomposed to individual bits, the product (or in the quantum-quantum case) is simply 1 if all relevant bits are 1, or 0 otherwise. This means these individual phase factors can be implemented via (doubly-)controlled-phase rotations of phase

(7.26)

between qubits and in the classical-quantum case, and

(7.27)

between qubits , , and in the quantum-quantum case. Importantly, these phases depend only on classical values, and can be computed during circuit compilation on a classical computer.

For the classical-quantum multiplication case, the total number of controlled-phase gates applied to implement multiplication is , which matches the complexity achieved by converting the classical “schoolbook” algorithm for integer multiplication into a quantum circuit (but is worse than the sub-quadratic multiplications described earlier). This scaling arises in an interesting way: we have accrued an extra factor of due to the fact that a rotation proportional to the product must be performed for each of the bits of . However, we have simultaneously managed to get rid of a factor of , due to the fact that is classical, so we can coalesce the rotations corresponding to each bit of and do them as one gate. Unfortunately, for quantum-quantum multiplication, it is not possible to coalesce the rotations for each in the same way, because is stored in qubits. Thus we need doubly-controlled-phase gates, which is considerably worse than quantum versions of schoolbook multiplication. Yet there are still benefits to this strategy. First, the constants on the scaling are very good: implementing requires exactly or phase rotations (actually, even fewer since some of the rotations are so small they can be dropped). Second, it uses zero ancilla qubits. This can be contrasted with traditional adder circuits, which need extra work space to perform carries. Finally, it can be massively parallelized, since all the phase rotations commute, yielding better circuit depth.

In this work, we call this type of quantum circuit for multiplication “Fourier multiplication.”

7.3 Results

Our main result is that it is possible to decompose the phases used in a Fourier multiplication via Toom-Cook-like transformations, ultimately yielding circuits with fewer than gates for both classical-quantum and quantum-quantum multiplication. We first give the intuition behind the transformation, explaining it for for both classical-quantum and quantum-quantum multiplication. Then, we describe the case of arbitrary in Section 7.3.3. Next, we analyze the gate counts and ancilla usage of these algorithms. Finally we describe how small modifications allow us to optimize the ancilla usage and depth, achieving the results stated in the introduction.

Throughout this work we will focus specifically on the implementation of an operation we call the “PhaseProduct:”

(7.28)

and a related operation which we call the “PhaseTripleProduct:”

(7.29)

Note that these operations are generalizations of and respectively—setting in the former and in the latter recovers Eqs. 7.22 and 7.23. The generalization will be helpful when we apply the transformations recursively.

7.3.1 classical-quantum multiplication

The fundamental idea is to decompose the phase of Eq. 7.28 by using the Karatsuba decomposition (and later via Toom-Cook for general ). A first attempt at directly substituting Eq. 7.7 into the phase of Eq. 7.28 and decomposing into a product of phases yields

This decomposition is not ideal—we still need to compute the value , which requires storing and reusing the partial products and . Instead, we note that the prefactors on each phase are just some value, which is classically computed at circuit compilation time. We should endeavor to move as much of the arithmetic into these prefactors as possible, since they are set before the circuit runs and any complexity there does not affect the quantum circuit cost. Thus rearranging Eq. 7.7 to group by partial products, rather than powers of two, yields

(7.30)

The implications of this expression are drastic. Each phase rotation on the right hand side consists of an entirely classical phase factor, multiplied by a product of two quantum integers. This is simply another application of PhaseProduct, but in which the integers are half as long! We can recursively apply the decomposition again to each of the sub-products, until the input integers are sufficiently small that we perform the phase products directly via individual two-qubit controlled-phase rotations.

Importantly, breaking the phase down this way does not require computing and storing any extra values in ancilla registers—the computation can be performed in-place. We do need the values and , which we compute via regular quantum addition circuits. Importantly however, since addition is reversible we can compute these values in-place (up to a single “overflow” qubit), temporarily overwriting and respectively.

In Algorithm 7.1, we explicitly record the steps of this algorithm, and then we move on to the case of quantum-quantum multiplication.

Input :  Quantum state
 Classical phase factor
Output :  Quantum state
Divide and registers in half, so and (here using little-endian bit order).
1 Apply PhaseProduct via recursive call
2 Apply PhaseProduct via recursive call
3 Apply quantum addition circuit from to , yielding the state in the register (and resp. for register)
4 Apply PhaseProduct via recursive call
5 Apply quantum subtraction circuit from to , restoring it to (and resp. for register)
Algorithm 7.1 PhaseProduct(, , ) for

7.3.2 quantum-quantum multiplication

For quantum-quantum multiplication we desire to implement the phase shift from Eq. 7.29, where , , and are all stored in quantum registers. In a less constrained setting (such as in classical computing or “digital” quantum arithmetic), the product of three integers would be computed by taking the product of two of the integers and then multiplying the third integer by the result—so no special algorithm for the product of three integers is required. However, we are decomposing a phase, and this is only possible over sums, not products. Thus we introduce modified versions of the Karatsuba algorithm, which compute the product of three integers in one step.

Fortunately, viewing Karatsuba as pointwise multiplication of polynomials (as discussed in Sec. 7.2.2) gives a fairly straightforward way to do this. Denote the integer product we desire as . We will again consider the factors , , and as polynomials and evaluate them at a set of points . This time we will compute the pointwise triple product , and as before interpolate the polynomial and evaluate it at the appropriate to recover the desired integer result . The main difference from the two-integer case is that the degree of the polynomial will be higher (3 in this case), so we will need to use more points . Regular (two-integer) Karatsuba as described in Sec. 7.2.2 uses the points , fortunately, it is easy to include the point corresponding to the linear combination (resp. and ), for a total of points which uniquely will specify our degree-3 polynomial . With that, we can write down a modified Karatsuba decomposition which applies to triple products (here grouping by partial products in view of the discussion in classical-quantum case):

(7.31)

As before, the factors in each of the sub-multiplications are half the size of the original ones (up to one bit of overflow)!

Using this to decompose the phase , we end up with an algorithm that is structurally identical to the recursive decomposition of in the classical-quantum case. In Algorithm 7.2, we describe how to perform this decomposition explicitly.

Input :  Quantum state
 Classical phase factor
Output :  Quantum state
Divide , , and registers in half, so , and (here using little-endian bit order).
1 Apply PhaseTripleProduct via recursive call
2 Apply PhaseTripleProduct via recursive call
3 Apply quantum addition circuit from to , yielding the state in the register (and resp. for and registers)
4 Apply PhaseTripleProduct via recursive call
5 Apply quantum circuit to subtract from , yielding the state in the register (and resp. for and registers)
6 Apply PhaseTripleProduct via recursive call
7 Apply quantum addition circuit from to , restoring it to (and resp. for and registers)
Algorithm 7.2 PhaseTripleProduct(, , , ) for

Note that as we show in Table 7.1, the quantum-quantum case does not actually provide a sub-quadratic gate count—it requires two-qubit gates, matching the asymptotic cost of “digital” schoolbook multiplication. However it does provide a drastic speedup over the gate count of the schoolbook algorithm for the PhaseTripleProduct operation represented in Eq. 7.29. In any case, it’s an illustrative example that provides good intuition for the cases which do yield a sub-quadratic runtime, which we discuss next.

7.3.3

In this section we describe the general case of for both classical-quantum and quantum-quantum multiplication. As in the previous two sections, we would like to apply the Toom-Cook decomposition to the phase products and . Using the notation introduced in Sec. 7.2.2, we can write the entire Toom-Cook process for computing an integer product as a linear algebra expression:

(7.32)

As in the previous section, to decompose the triple product we modify the Toom-Cook algorithm to take the pointwise triple product:

(7.33)

Note that as in Section 7.3.2, the triple product increases the degree of the product polynomial, so we must evaluate the polynomials at more points. As a function of , this number of points is in the three-integer (PhaseTripleProduct) case (whereas it is in the two-integer (PhaseProduct) case)

Inserting these expressions into the phases of Eq. 7.28 and 7.29, we have

(7.34)

where we use square brackets to denote parts that only appear in the three-integer, PhaseTripleProduct case, but not in the two integer PhaseProduct case.

As in the case, we’d like to decompose Eq. 7.34 in a way that allows us to include as much linear algebra as possible in the prefactors, which can be classically computed at circuit compilation time. To do so, we define , and decompose the phase across the inner product of and the pointwise product :

(7.35)

We see that now, as before, each of the phase factors in the product on the right hand side has the same form as the left hand side—a classical phase factor (in this case ) multiplied by a product of integers. By choosing the points appropriately we can ensure that the values (resp. and ) are roughly the size of the input integers, up to overflow bits, reducing our larger multiplication to much smaller ones.

The decomposition in Eq. 7.35 suggests a simple algorithm for computing PhaseProduct and PhaseTripleProduct efficiently, for general . We lay out the steps explicitly in Algorithm 7.3. In that algorithm, the square brackets denote values that only appear when computing the PhaseTripleProduct. We note that Algorithm 7.3 is meant to be as simple as possible while exhibiting the asymptotic scaling we desire; in practice optimizations should be applied to reduce the qubit count and/or depth (see Secs. 7.3.4 and 7.3.4 below).

Input :  Quantum state
 Classical phase factor
 Tunable parameter (and corresponding data , , etc.)
Output :  Quantum state
Divide , [,] and registers into subregisters each.
Let []
for  in  do
       Let be an index such that /* such an index always exists */
       1 Use quantum addition/subtraction to overwrite (and a few “overflow” ancillas) with (resp. [,] )
       2 Apply Phase[Triple]Product via recursive call
       3 Use quantum addition/subtraction to return to (resp. [,] )
      
end for
Algorithm 7.3 Phase[Triple]Product(, , [], )

Finally, note that the multiplications by and in Eqs. 7.32 and 7.33 correspond to step 4 (interpolation) and step 5 (recomposition) of the classical Toom-Cook algorithm respectively (see Sec. 7.2.2). Step 4 is the most complicated of the whole algorithm in terms of arithmetic; it requires roughly twice as many additions and subtractions as step 2 in the two-integer product case, and three times as many in the triple product. By offloading that step to classical precomputation, we have cut the quantum computational cost of a each level of recursion by an additional roughly 2/3 (PhaseProduct) or 3/4 (PhaseTripleProduct).

7.3.4 Analysis of asymptotic costs

Gate counts

Algorithm Gate count
Schoolbook
(a) Classical-quantum multiplication
Algorithm Gate count
Schoolbook
(b) Quantum-quantum multiplication
Table 7.1: Asymptotic scaling of gate counts for Algorithm 7.3, for various relevant in practice. “Schoolbook” in this table refers to the phase decompositions of Eqs. 7.24 and 7.25.

We now analyze the asymptotic scaling of the gate counts for Algorithm 7.3. We state our results as the following Theorem, which yields the scaling exponents listed in Table 7.1. Its proof is similar to the proof of the asymptotic complexity of the classical Toom-Cook algorithms. [KNU98]

Theorem 6.

Algorithm 7.3 can be implemented on a quantum device using two-qubit gates, where is the length of the input integers.

Proof.

We first show that the recursive calls are performed on integers of size , for some function which is independent of . From this we write a recursion relation and solve it to prove the theorem.

The recursive calls are performed on values of the form (resp. and ). For the purposes of this proof, let the points be any integers in the range . Then, the relevant elements of —corresponding to powers of each point up to degree —are bounded by ; since is linear in this implies that the elements of are at most bits long. The elements of (resp. and ) are of size , thus a single product has length . The value is the sum over of these products, which adds at most bits, so we see that can be stored in bits. Thus we see .

With that we may write a recursion relation for the gate count . For one level of recursion, the algorithm performs at most additions and subtractions, and recursive calls of size (one per iteration of the outer loop). With the additions and subtractions being performed in linear time for some constant , we have

(7.36)

Noting that if is a polynomial of degree , then , solving the recursion yields

(7.37)

which is what we desired to show. Finally, because is at most it is straightforward to see that for all (which is all of the allowed values of ), the power of is , and thus our assumption that was a polynomial of degree at most two was justified. ∎

Qubit counts

In the proof of Theorem 6, we showed that the value (resp. and ) may overflow the register it is overwriting by a number of bits which is constant in . It is straightforward to see, then, that the algorithm can be executed using ancilla qubits—a constant number of ancillas are used at each of the levels of recursion. In this section we show that it is possible to do even better, using only a small constant number of qubits for the entire operation (2 for classical-quantum multiplication, and 5 for quantum-quantum).

The first key observation is that we may reduce the overflow to a single qubit, by adjusting slightly how we divide the factors into their parts. If we set the size of the less-significant parts of the value to bits, then the most-significant part will be at least bits longer than all the others. Thus if we choose our such that the coefficient on that large piece is 1, and overwrite it with the linear combination (resp. and ), it will overflow by at most one qubit. Since each of the , , and will have such an overflow qubit, this leads to a total of 3 overflow qubits per level of recursion (or 2 for the classical-quantum case, in which we only have and ).

While we now only need one overflow qubit per level of recursion, we still need to deal with the larger issue of overflow qubits accumulating through the recursive calls. To do so, we perform the portion of the multiplication that involves that extra overflow qubit directly, before moving down in the recursive tree. Because one of the factors in this partial multiplication is only a single qubit, the contribution to the overall runtime is negligible. Once it has been done, the overflow qubit is uncomputed, and can be reused at a lower level recursive call.

We make these two ideas explicit in Algorithm 7.4, which reduces the ancilla qubit count to at most 3, while maintaining the asymptotic gate count of Algorithm 7.3.

Input :  Quantum state
 Classical phase factor
 Tunable parameter
Output :  Quantum state
Let .
Let the evaluation points be values from the set .
Let .
Let .
Let [].
Divide , [,] and registers into subregisters, where the least significant have length and the most significant contains the remaining bits whose size we denote .
for  in  do
       1 Use quantum addition/subtraction to overwrite and one “overflow” ancilla with (resp. [,] )
       Let represent the overflow qubit of the subregister (resp. , )
       2 Apply phase controlled by qubit
       3 [Apply phase controlled by qubit ]
       4 Apply phase controlled by qubit
       5 Uncompute the overflow qubit for each register
       6 Apply Phase[Triple]Product via recursive call
       7 Use quantum addition/subtraction to return to (resp. [,] )
      
end for
Algorithm 7.4 Phase[Triple]Product(, , [], ) with ancillas
Theorem 7.

Algorithm 7.4 can be implemented in two-qubit gates, where is the length of the input integers.

Proof.

There are three important ways in which Algorithm 7.4 differs from Algorithm 7.3: there is only one overflow qubit per register, steps 2-5 are new, and the recursive call of step 6 is on a value of length rather than . For the last one, it is straightforward to see that , so the latter change will not adversely affect the running time. Thus we just need to show that only one overflow qubit is required, and that the cost of steps 2-5 is negligible in the asymptotic limit.

First we show that only a single overflow qubit is indeed required to store in the subregister (resp. and ). First observe that for , so clearly no overflow qubits are needed. The other are unit fractions with denominator . For these, recall that . Thus is bounded from above by . By definition , so requires at most bits to store, which is what we desired to show.

Now we examine the cost of steps 2-5. We begin with steps 2-4, in the case of PhaseProduct (used for classical-quantum multiplication), because it is simpler. In this case the parts of Algorithm 7.4 in square brackets are omitted, including all of step 3. In steps 2 and 4, we need to do a controlled application of phases proportional to and . These values are both already contained in a register, so this can be easily achieved with a series of controlled-phase gates, where is within in additive constant of . Thus in this case our recursion relation is simply (cf. Eq 7.36)

(7.38)

For steps 2-4 of PhaseTripleProduct, we need to implement phases proportional to , , and . Once again all these values are readily available, but simply applying a series of doubly-controlled phase rotations would use gates—which is too large. Instead, we may apply a controlled version of the PhaseProduct algorithm, for some (perhaps different) ! Doing so will require a total of gates, and we get a recursion relation

(7.39)

where again is within an additive constant of . Thus as long as we set such that

(7.40)

both Eqs. 7.38 and 7.39 have the solution

(7.41)

which is what we desired to show.

The only thing that remains is step 5. We must show that uncomputing the overflow ancillas uses only a negligible number of gates. This is straightforward: the entire value requires only a number of gates linear in to compute, so uncomputing the entire thing and recomputing would require only gates. (In practice the single ancilla can be uncomputed more efficiently than uncomputing the entire value). ∎

Remark: We note that satisfies Eq. 7.40 for , which includes all of the values we explicitly explore in this work.

Theorem 8.

Algorithm 7.4 uses at most 2 ancillas for PhaseProduct, and at most 5 for PhaseTripleProduct.

Proof.

Again we begin with the case of PhaseProduct (which implements classical-quantum multiplication). In step 1, one ancilla qubit is allocated for the and registers respectively, for a total of two. Steps 2-4 do not require the allocation of any new ancilla qubits if they are implemented as a series of controlled-phase gates (as they are in the proof of Theorem 7). Because we uncompute the overflow ancillas in step 5, they can then be reused during the recursive call of step 6, thus requiring no additional ancilla qubits to be allocated. Thus PhaseProduct requires a total of 2 ancillas.

For PhaseTripleProduct, 3 ancillas (one for each register) are allocated in step 1. But as described in the proof of Theorem 7, steps 2-4 require the application of a controlled version of PhaseProduct. We have just shown that this requires 2 ancillas. Once again, in step 5 we uncompute the overflow ancillas and they can be reused during the recursive call of Step 6. Thus we see that PhaseTripleProduct can be implemented with a total of 5 ancillas. ∎

In practice, we may be interested in the size of , which is roughly , because although it is clear that their difference is constant, if that constant is large it could meaningfully affect the practical cost of performing the recursion (since the recursive products are of size ). In Table 7.2, we record the values of for various , and observe that they are very small in practice.

Algorithm variant
0
3
(a) Classical-quantum multiplication
Algorithm variant
0
4
7
(b) Quantum-quantum multiplication
Table 7.2: Number of bits by which the length of subregister must exceed the lengths of the other subregisters in Algorithm 7.4. This corresponds roughly to the “extra” length in addition to of values which are recursively multiplied.

Gate depth

Finally, we discuss gate depth. The main results of this section are summarized in Table 7.3, and formalized in Theorem 9. Stated succinctly, we find that it is possible to perform the phase shifts implemented by PhaseProduct and PhaseTriple product in sub-linear gate depth. Unfortunately, this does not immediately imply that these algorithms can be used to achieve sub-linear depth multiplication in qubits—surprisingly, we are bottlenecked by the quantum Fourier transform and its inverse, that need to be performed before and after the phase products! The lowest-depth (approximate) quantum Fourier transform circuit of which we are aware that runs in linear space is the “standard” construction having depth ; fast algorithms which have depth are known but seem to require at least qubits [CW00]. An important and exciting direction for future research is whether it is possible to compute this (approximate) quantum Fourier transform in sub-linear depth and linear space. Despite this limitation, we believe it is worth exploring how the Phase[Triple]Product portion of the multiplication can be made as low-depth as possible; in practice the constants are good enough on the linear-space quantum Fourier transform circuit that we expect its depth to be comparable to the depth of the Phase[Triple]Product for relevant sizes of integers, despite its worse asymptotic scaling. The low-depth constructions given here use ancilla qubits (note that this is more than the ancillas we used in the previous section) We expect that in practice the qubit cost will be dominated by the qubits needed to perform sub-linear time addition [DKR+06]; the number of ancillas needed for the overflow qubits in this case is also linear in but small. We begin by giving intuition for how low depth can be achieved, and then state Theorem 9.

In Algorithms 7.3 and 7.4, in each iteration of the outer loop overwrites a single value of length with (resp. and ), recursively computes Phase[Triple]Product on that value, and then resets it to its original contents. However, we have of these length subregisters at our disposal—we may as well overwrite more than just one of them, so that we can perform several recursive calls in parallel! The challenge is that we need to compute each of the (resp. and ) in-place, reversibly. We do not know of an algorithmic way to simultaneously compute many in-place for arbitrary ; instead, in Tables 7.4, 7.5, and 7.6 we record explicit sequences of operations to do so, which were created by hand for the cases of (both PhaseProduct and PhaseTripleProduct) and (PhaseTripleProduct only). We also note that Algorithms 7.1 and 7.2, which cover the cases, already implicitly allow this parallelism. In all of these sequences we have endeavored to only use addition, sign inversion, and scaling by powers of two, since each of these operations should be very efficient to implement.

Since we have subregisters at our disposal, and points at which we need to evaluate (resp. and ), the best parallelism we can hope to achieve (in a single level of recursion) is parallel groups of recursive calls. Recalling the dependence of on , this quantity is equal to for PhaseProduct (for all ) and for PhaseTripleProduct (for all ). (Note that the optimum is achieved by the sequences in Algs. 7.1 and 7.2, and Tables 7.4 and 7.5). In view of Theorem 9 below, this corresponds to an asymptotic gate depth of for PhaseProduct and for PhaseTripleProduct. An avenue for further research is whether it is possible to saturate the hard lower bound of depth equal to total gate count over , using only ancillas, by parallelizing over multiple levels of recursion simultaneously.

We formalize our results regarding gate depth in the following theorem, and give explicit asymptotic scaling of gate depth in Table 7.3.

Algorithm Sequence Depth
Schoolbook
Alg. 7.1
Table 7.4
(a) Classical-quantum multiplication
Algorithm Sequence Depth
Schoolbook
Alg. 7.2
Table 7.5
Table 7.6
(b) Quantum-quantum multiplication
Table 7.3: Asymptotic scaling of gate depth. “Schoolbook” in this table refers to the phase decompositions of Eqs. 7.24 and 7.25. The form of the scaling relation comes from Theorem 9. All algorithms listed here use only qubits. * Sublinear depth should be achievable for by finding a sequence of operations using fewer parallel groups than that of Table 7.6.
Operation Register 0 Register 1 Register 2
(start)
Add reg. 2 to reg. 1
Add reg. 0 to reg. 1
Product on all registers
Invert sign of reg. 1
Add reg. 0 to reg. 1
Add reg. 2 to reg. 1
Add reg. 1 to reg. 0
Add reg. 0 to reg. 1
Add reg. 2 to reg. 1
Product on regs. 1 and 0
Invert sign of reg. 1
Add reg. 2 to reg. 1
Add reg. 1 to reg. 0
Add reg. 0 to reg. 1
Add reg. 2 to reg. 1
Divide reg. 1 by two
Table 7.4: low-depth classical-quantum multiplication algorithm. This table lists the quantum operations performed to implement the unitary . The registers are divided into subregisters as and (using little-endian notation, so is the least-significant subregister). In this table only the state of the sub-registers are shown; the same operations are applied to the register. “Product on registers” means to apply a phase corresponding to the product of the respective and registers, usually by recursively calling the same algorithm again. Registers containing values upon which the algorithm is applied recursively are highlighted in bold. The linear combinations used here for the products correspond to the evaluation points .
Operation Register 0 Register 1 Register 2
(start)
Add reg. 0 to reg. 1
Add reg. 2 to reg. 1
Product on all
Add reg. 2 to reg. 1
Add reg. 0 to reg. 1
Add reg. 1 to reg. 2
Add reg. 2 to reg. 1
Add reg. 0 to reg. 1
Product on regs. 1 and 2
Add reg. 0 to reg. 2
Add reg. 2 to reg. 1
Add reg. 2 to reg. 1
Add reg. 1 to reg. 2
Add reg. 1 to reg. 2
Add reg. 0 to reg. 2
Add reg. 0 to reg. 2
Product on regs. 1 and 2
Add reg. 2 to reg. 1
Add reg. 0 to reg. 2
Add reg. 1 to reg. 2
Divide reg. 1 by 4
Divide reg. 2 by 8
Table 7.5: quantum-quantum multiplication sequence. In this table only the state of the sub-registers are shown; the same operations are applied to the register. Registers containing values upon which the algorithm is applied again recursively are highlighted in bold. The linear combinations used here for the products correspond to the evaluation points .
Operation Register 0 Register 1 Register 2 Register 3
(start)
Add reg. 0 to reg. 2
Add reg. 3 to reg. 1
Add reg. 2 to reg. 1
Add reg. 1 to reg. 2
Product on all regs.
Add reg. 1 to reg. 2
Add reg. 3 to reg. 2
Add reg. 0 to reg. 1
Add reg. 2 to reg. 1
Add reg. 3 to reg. 2
Add reg. 1 to reg. 2
Product on regs. 1 and 2
Add reg. 1 to reg. 2
Divide reg. 2 by 4
Add reg. 2 to reg. 1
Add reg. 0 to reg. 2
Add reg. 3 to reg. 1
Divide reg. 1 by 2
Add reg. 0 to reg. 1
Add reg. 1 to reg. 2
Product on regs. 1 and 2
Add reg. 1 to reg. 2
Divide reg. 2 by 8
Add reg. 2 to reg. 1
Add reg. 0 to reg. 1
Add reg. 3 to reg. 2
Add reg. 0 to reg. 2
Add reg. 3 to reg. 1
Product on regs. 1 and 2
Add reg. 2 to reg. 1
Divide reg. 1 by 8
Add reg. 1 to reg. 2
Add reg. 0 to reg. 2
Divide reg. 2 by 16
Add reg. 3 to reg. 1
Table 7.6: quantum-quantum multiplication sequence. In this table only the state of the sub-registers are shown; the same operations are applied to the and registers. Registers containing values upon which the algorithm is applied again recursively are highlighted in bold. The linear combinations used here for the products correspond to the evaluation points .
Theorem 9.

The algorithms recorded in Algs. 7.1 and 7.2, and Tables 7.4, 7.5, and 7.6, can be executed in depth using qubits, where is the number of groups of parallel recursive calls in the algorithm.

Proof.

First we prove the depth bound. If we use fast algorithms for the additions, we may perform them in time for some constant  [DKR+06]. As in the proof of Theorem 6 we denote the number of “overflow” bits of each as , a function which does not depend on . With this we can write a recursion relation

(7.42)

We note that for any power , . The second term is with respect to so we may drop it, thus we may solve the recursion with

(7.43)

which is what we desired to show.

Next we show that this depth can be achieved using qubits. The logarithmic-depth addition uses ancillas which can be reused after the addition is complete. Thus we only need to show that the “overflow” qubits are at most across all levels of recursion.

Suppose we cut off the recursion (or at least the parallelism) when the size of input integers is below some constant value . At the bottom level of recursion, there will be at most values being phase multiplied in parallel, each needing up to overflow ancillas. The number of values being multiplied, and thus the number of overflow ancillas, at each level of the tree of recursive calls is at most half of the level below it; thus by geometric series, the total number of overflow ancillas at all levels is at most twice the number of overflow ancillas at the bottom level. Thus the total number of overflow ancillas is at most , which is linear in , which is what we desired to show. ∎

Remark:

We note that for the values of relevant in practice, the number of overflow ancillas for a single value is quite small (see Table 7.2). Since the total number of ancillas needed is bounded by , and can be taken to be, say, without drastic loss of parallelism, we expect the qubit usage from “overflow” ancillas to not only be linear in , but in fact a quite small fraction of in practice. It is likely that the number of ancillas used for these overflows will be dwarfed by the ancilla usage from sub-linear-time addition.

7.3.5 Fast exact quantum Fourier transform

In this section we show that our results imply a new upper bound for the gate cost of performing an exact quantum Fourier transform modulo a power of 2 with only ancillas. We note that in practice, the quantum Fourier transform should always be performed approximately, which can be achieved with the standard construction to within error in time [NC11] However we believe the following is at least of theoretical interest.

It was shown over two decades ago that the quantum Fourier transform with modulus , which we may denote , can be implemented exactly via the following three steps for any positive integer (stated here in the language we have been using in this work): [CW00]

  1. Apply to the first qubits

  2. Apply PhaseProduct(, , ), where is the value of the first qubits and is the value of the remaining qubits

  3. Apply to the final qubits

(Note that corresponds to the “standard” construction for the QFT). By setting and performing steps 1 and 3 via recursive call, the runtime of this algorithm becomes times the cost of step 2. In the work introducing this construction, it was suggested to implement step 2 by adding a length ancilla register and computing the product in that register, then applying single-qubit phase rotations based on the bits of the product (and then uncomputing the product afterwards). Given our results, we may apply PhaseProduct directly without needing to allocate an extra register. This directly implies that the exact quantum Fourier transform can be implemented in sub-quadratic time without the need for more than the ancillas used by PhaseProduct (in fact, just using a single anicilla for the case of ).

7.3.6 Practical circuit costs

Algorithm Gate count (millions) Ancilla qubits
Toffoli/ 2- and 1-qubit
Classical-quantum multiply-add
This work 0.46 2.6 12
Digital Windowed [GID19b] 1.8 2.5 4106
Digital Karatsuba [GID19b] 5.6 34 12730
Digital Schoolbook [GID19b] 6.4 38
Quantum-quantum multiply-add
This work 54 26 30
Digital Karatsuba [GID19a] 15 30 13152
Digital Schoolbook [GID19a] 21 420
Table 7.7: Circuit size estimates for multiplication of 2048-bit numbers. All estimates are in the “abstract circuit model” (no error correction or routing costs included and all qubit counts are logical), thus comparisons between gate counts especially should be considered rough. The “digital” algorithms refer to algorithms which compute the multiplication directly rather than in Fourier space; estimates were computed using the Q# code from [GID19a] and [GID19b] The 2- and 1-qubit gate count column is in addition to the 3-qubit gates; it does not include decomposed 3-qubit gates. *These qubit counts are the values reported in [GID19a] and [GID19b]; we do not see why it would not be possible to perform schoolbook multiplication without the need for ancilla qubits. The resource estimates for digital quantum-quantum multiplication assume the output register is initialized to zero (that is, they only perform a multiplication, not a “multiply-add”).

Having examined the asymptotic costs of our algorithms, we estimate the gate and qubit counts required to implement them in practice. There are a number of tunable parameters which affect the gate counts for different types of gates as well as the qubit counts; here we set the parameters to achieve a particular balance of gate counts and number of ancillas. The results for non-modular multiplication of 2048-bit numbers are summarized in Table 7.7. We find that these algorithms drastically reduce qubit counts, while exhibiting competitive gate counts. Regarding gate counts, we note two things: first, that making a direct comparison between our algorithms and others is not straightforward, because they use different gate sets—for example, a sizeable fraction of the two-qubit gates in our classical-quantum multiplication algorithm are controlled phase rotations, while in other implementations the two- and one-qubit gates might consist entirely of Clifford operations. Second, we note that these estimates for our algorithms correspond to preliminary circuit constructions; it is likely that there are further optimizations to be had in the implementation which could yield considerable improvement in gate counts especially.

For each value of , we numerically determine the “crossover” point, the value of above which the algorithm outperforms . Here, we define “outperform” as using fewer total , , and Toffoli gates. In Table 7.8 we record the optimal crossover points that we find for each . We note that these crossover points are dependent on the specific implementation and definition of cost, and so may vary, for example, if the circuits are compiled to target a certain physical gate set. Using these crossover points in our implementation of 2048-bit multiplication, we adjust throughout the recursion to always use the optimal algorithm for the size of the factors, eventually switching to schoolbook when the factors have become sufficiently small.

Algorithm variant Cutoff
Schoolbook
(Karatsuba) 29
114
(a) Classical-quantum multiplication
Algorithm variant Cutoff
Schoolbook
9
58
262
(b) Quantum-quantum multiplication
Table 7.8: Crossover points of the algorithms. The “cutoff” column lists the smallest value of for which the given algorithm has a smaller gate count than the one immediately preceding it in the table. The “schoolbook” algorithm is the base case, and thus does not have a cutoff. Note that the values in this table are estimates and depend strongly on the precise implementation of the circuits.

For our estimates, we use the Cuccaro quantum adder to perform the additions required by our algorithms [CDK+04]. The Cuccaro adder requires one extra ancilla qubit, but it can be reused after the adder is complete. We consider this tradeoff worth it for its low gate counts. For sign inversions and multiplication/division by powers of 2, we note that both can both be implemented “logically” with zero quantum cost. Sign inversion in two’s complement representation corresponds to flipping all bits of the integer and incrementing the value by 1. Flipping all bits can be performed logically by relabeling which quantum state corresponds to 0 and which to 1; the addition of one can easily be incorporated into the next addition following the sign flip via the input “carry” bit that is inherently a part of Cuccaro addition. Multiplication and division by powers of 2 correspond just to bit shifts and can be performed for zero quantum cost by relabeling qubits.

Finally, we note that in addition to the cost of the circuits to implement the phases, we must also apply a quantum Fourier transform beforehand and an inverse quantum Fourier transform afterward. If we apply the standard technique of dropping phase rotations below a certain threshold to implement an approximate QFT, the gate count of these operations is negligible (less than the least significant figure listed in Table 7.7) even when the error for the approximate QFT is set as low as . This standard approximate QFT construction also requires no ancilla qubits.

7.4 Applications and optimizations

In this section we describe various tweaks that can be made to our algorithms to optimize them for two exciting applications: Shor’s algorithm, and cryptographic proofs of quantumness. Before discussing those in depth, we begin with an aside about modular multiplication which will be relevant for both applications.

7.4.1 Modular multiplication

So far, we have discussed algorithms for non-modular multiplication—the inputs are each of length , and the output is of length . For both the applications we discuss here, we are instead interested in multiplication modulo some integer . In this section we will show that with a small modification, our algorithms can output the product in the form of an (approximate) binary fraction of precision bits (stored on qubits), while maintaining their asymptotic costs. Later we will show that this form, with not much larger than , is sufficient for relevant applications. Here we discuss only the case of quantum-quantum multiplication to avoid unnecessarily complicated notation, but the results trivially carry over to the case of classical-quantum multiplication.

In the preceding sections, we have been working in Fourier space modulo . In Fourier space, the modular multiplication result we desire has the following form (here dropping any constant to which the result was added, but it can be trivially included):

(7.44)

It is relatively well-known that if we instead work in Fourier space modulo , we may apply the following identity:

(7.45)

Intuitively, as the phase wraps around , it automatically performs the modulo operation for us! This seems to suggest that applying our circuits to modular multiplication is trivial: simply change the modulus of the Fourier transforms.

The challenge is that circuits for performing the quantum Fourier transform with an arbitrary modulus are considerably more complex than for a power of two—and (to our knowledge) there do not exist constructions for the QFT with arbitrary modulus that operate in sub-quadratic time using only ancillas, which is what we desire. Instead, we may apply the phase of Eq. 7.45 but perform the final quantum Fourier transform modulo a power of two instead of . This is simply quantum phase estimation for the phase . It is well known that quantum phase estimation for a phase that cannot be represented exactly as a binary fraction yields a superposition which is exponentially heavily weighted on the binary strngs that best approximate the value of the phase. [NC11]

We note that there is another construction by which we can use our circuits to compute modular multiplication exactly, with the same asymptotic gate count but now using qubits. It is a quantum version of a classical construction called Montgomery multiplication. It is particularly nice because classically, Montgomery multiplication has the downside of introducing a (known) extra factor into the result that needs to eventually be dealt with; for classical-quantum multiplication the inverse of this extra factor can be classically multiplied into the classical input, canceling it out and obviating the need to deal with it later. However, we do not pursue that method further here because the approximate version above uses fewer qubits and suffices for our needs.

7.4.2 Skipping unnecessary QFTs

Another optimization relevant for both the applications below involves simplifying quantum Fourier transforms in certain cases. If we know that the “output” register is starting in the state, applying the initial quantum Fourier transform will just yield a uniform superposition over all bitstrings, all with the same phase of . Thus we can skip the complication of a Fourier transform modulo and simply apply a Hadamard gate to each of the qubits in parallel, yielding the same state in depth instead of depth . (Note that we do still have to perform the full inverse quantum Fourier transform at the end of the multiplication.)

With that, we move on to discussing applications.

7.4.3 Shor’s algorithm

The core of Shor’s algorithm for factoring an integer of bits can be implemented via a series of in-place multiplications by classical constants, controlled off of a single qubit (see the “one controlling qubit” trick, Sec. 2.4 of [BEA03]). Written out, a single one of these multiplications consists of the in-place operation

(7.46)

for a classical integer . Here we apply the modular multiplication introduced in Sec. 7.4.1 to implement this operation to within an error using qubits (while maintaining the same subquadratic gate count of the multiplication algorithm).

To do so, we make the following observation: our classical-quantum multiplication algorithm does not require that the classical value be an integer. It can be a floating-point number, which we may classically compute to whatever arbitrary precision we desire before using it to compute the values of the phase rotations in our multiplication algorithm. We use this fact to perform the following trick: compute the fractional value up to some precision as in Eq. 7.45, and then multiply by the value to convert the fractional value into an integer. The accuracy of this operation will depend on the precision to which we compute ; however, we will find that we only need extra bits to achieve an error of less than .

Algorithm 7.5 applies this idea to approximately implement the unitary of Eq. 7.46. It is clear by inspection that it uses ancilla qubits in time . In the following theorem we prove the error bound.

Input :  Quantum state (extended to superpositions by linearity)
 Classical constant
 Error level
Output :  Quantum state (up to error )
Let
Allocate a register of ancillas initialized to
1 Compute for via classical-quantum multiplication, where is rounded to a precision of bits
2 Compute . State is now (approximately) or .
3 Using an ancilla qubit, compute whether the left register is greater than ; subtract controlled by the ancilla. State is now
4 Uncompute the ancilla qubit by computing whether via a comparison operator
5 Subtract the value from the second register, where is the multiplicative inverse of . State is now .
Algorithm 7.5 In-place classical-quantum modular multiplication
Theorem 10.

The final state produced by Algorithm 7.5 has inner product .

Proof.

We begin by showing that if Algorithm 7.5 were run with infinite precision (that is, setting ), the correct state would result with zero error. Then we show that the difference between the unitary applied in the infinite precision case and the real case is small.

In the infinite precision case, after step 1 we have the state for . In step 2 we add to , yielding or exactly. Step 3 uses an ancilla to reduce this to ; the ancilla now contains whether needed to be subtracted or not. Beauregard observed the identity  [BEA03], in our case this implies that the truth value of the comparison will equal the value of the ancilla and thus can be used to uncompute it. Finally, for step 5, observe that and thus subtracting it from the second register resets that register to zero. Also note that the multiplicative inverse exists because and are co-prime (or, if they’re not, provides a factor of without requiring Shor’s algorithm at all!). We now have the state which is what we desired. Now we turn to showing that the approximate circuit, in which , yields a unitary that is within of the circuit with .

We begin with step 1. Here we must show that applying Hadamard gates to generate , then a phase rotation of and then an inverse module yields a state that is close to . This can be seen immediately by observing that this process corresponds precisely to the quantum phase estimation algorithm for the phase , without the usual final measurement to extract an estimation of the phase. It is known that performing quantum phase estimation with an extra qubits beyond the precision to which the phase is desired yields a probability of at least of yielding the value of the phase correctly up to the desired precision. [NC11] Here corresponds to this “correct” value of the phase which would be observed if the register were measured, so the population of the state must be at least and the overall error of this step is at most .

Now we move to step 2. Observe that even for the largest value of , the difference in this phase when is exact versus truncated to a total of bits is bounded by , and by definition . Thus the difference in the phase is , and thus the angle between the state resulting from the exact rotation versus the truncated one is proportional to .

Steps 3 and 4 are not directly affected by the precision of the second register, so they do not contribute any additional error.

Finally, step 5 simply corresponds to the inverse of step 1. Since the “good” (error-free) portion of the state has in the first register exactly, this step exactly uncomputes the second register.

With that, we see that each individual step has error less than when compared to the case of infinite ; the overall operation’s error is bounded by the sum of individual errors and thus is bounded by as well, which is what we desired to show. ∎

For Shor’s algorithm, this operation must be repeated times. Using our fast multiplication algorithms of time (with ), this yields an implementation of Shor’s algorithm that runs in time, significantly faster than the usual runtime for Shor’s algorithm. To be more concrete, it is for . Importantly, this runtime is achieved while maintaining a very lean qubit count: Algorithm 7.5 uses total qubits, and since we must repeat it times we should set . Thus the total qubit count for this implementation of Shor’s algorithm is qubits. To our knowledge this is the first implementation of Shor’s algorithm running faster than time within this space constraint.

Finally we describe an alternative way of implementing Shor’s algorithm in fewer than total gates using our circuits. It takes advantage of the idea of “windowed arithmetic”, which has produced some of the fastest known circuits for Shor’s algorithm [GID19b, GE21]. In brief, instead of multiplying by each classical integer in sequence, a QROM circuit is used to look up the product over values , from a classical lookup table of size indexed by the . The QROM circuit puts this value into a quantum register, at which point a quantum-quantum multiplication circuit (which no longer must be controlled by another qubit) can be applied to compute the product. While such a construction reduces the total number of quantum multiplications which need to be performed, they are now quantum-quantum multiplications instead of controlled classical-quantum multiplications; in the context of our circuits it’s not clear a priori whether the extra cost (including the cost of the QROM circuit) is worth it. Also, this construction seems to require at least qubits, plus the ancilla cost of the QROM. Nevertheless, a valuable direction for further research would be to explicitly count the gates of this construction and compare its cost to the other construction described above.

7.4.4 Proofs of quantumness:

As discussed in previous chapters, much recent excitement has focused on creating and implementing efficiently-verifiable “proofs of quantumness,” in which an untrusted quantum “prover” can demonstrate its quantum capability to a efficient classical “verifier” [BCM+21, BKV+20, KCV+22, ZKL+21, LZG+22, YZ22, KLV+22]. These protocols are interesting not only for their ability to decisively demonstrate quantum computational advantage, but also because they can be modified to perform more complex quantum cryptographic tasks, such as certifiable generation of quantum random numbers, verifiable remote state preparation, and even the classically-verifiable delegation of arbitrary quantum computations to an untrusted device [BCM+21, BGK+23, GV19, MAH18]. Many of these protocols centrally rely on a cryptographic construction called a trapdoor claw-free function (TCF), and all known TCF constructions involve multiplication of integers in various contexts—making the circuits we develop in this work useful for the implementation of these protocols. In this section we focus on the implementation of the function introduced in Chapter 5, which seems to give the best circuit sizes of the currently known TCFs.

To compute , we can directly apply our quantum-quantum multiplication circuits, with the inputs and being the same register. However, in the specific context of the proof of quantumness introduced in Chapter 5 there are certain further optimizations we can apply which reduce the circuit sizes considerably. With all of these optimizations applied, we find that the proof of quantumness protocols using can be implemented with dramatically fewer intermediate measurements and qubits that in the circuits introduced in Chapter 5, while maintaining competitive gate counts. We also believe more optimized implementations could reduce the gate counts considerably. In Table 7.9 we summarize these costs for various sizes of the modulus . We now describe these optimizations in detail.

Circuit
Total
qubits
Gates (/
Toffoli allowed)
Extra mid-
circuit measmts.
(takes seconds on a desktop [44])
Fast 272 0
Narrow 204 0
Previous best 942
(takes hours on a desktop [44])
Fast 814 0
Narrow 627 0
Previous best 3051
(record for factoring [ZIM20])
Fast 1676 0
Narrow 1268 0
Previous best 5522
(exceeds factoring record)
Fast 2068 0
Narrow 1566 0
Previous best 6801
Table 7.9: Resource estimates for the portion of a proof-of-quantumness protocol. These estimates do not include any error correction; depending on the individual gate fidelities it may be possible to mitigate error sufficiently using the post-selection scheme described in Chapter 5. Also note that the additional mid-circuit measurements performed in the cryptographic protocol after is computed are not included here.

First, we note that the output of the function is measured immediately after it is computed. Thus instead of building a quantum circuit to compute in the form of an integer, we have the freedom to build a circuit that computes any bitstring that uniquely corresponds to such an integer. Classical post-processing can then be used to compute and return the integer value itself. In the context of our circuits, this suggests an immediate optimization: as discussed in Sec. 7.4.1, we can perform the modulo operation automatically in the phase, if we are willing to accept output in the form of a binary fraction representing the value . As long as we have enough bits of precision to uniquely identify an integer from this binary fraction, we may directly measure this output instead of transforming it back to integer form as we did in Shor’s algorithm.

Another optimization comes from an observation about the cryptography of trapdoor claw-free functions (TCFs). The classical hardness of the TCF-based “proof of quantumness” protocols comes from the fact that it is hard to find two inputs such that , where is a given instance of the TCF family. It is easy to show that this implies that it is also hard to find a pair of values for any efficiently-invertible function —because if such a pair could be found, the function could be immediately inverted to yield ! Therefore the classical verifier should allow the prover’s input registers to be modified by the TCF circuit, as long as the prover informs the verifier of the transformation that was made so that the verifier can perform the correct checks on the modified values. This is relevant to our recursive circuits for multiplication, because it means we can drop any gates that serve simply to return the input to its original values. As an example, in the case, we may allow our circuit to end with the input registers set to (resp. ), as long as the verifier knows that this is the value that is being manipulated in the later rounds of the protocol. (In practice we can drop the final gates of all levels of recursion, not just the top level, so the final state will look more complicated—but still be trivially computable by the verifier).

One more benefit comes from the fact that we are performing a square instead of a generic multiplication. A nice feature of the recursive algorithms is that if the top-level operation is to compute a square , all of the lower levels of recursions will compute squaring operations as well (of the form ). Therefore when we eventually reach sufficiently small in the recursion and switch to the schoolbook algorithm, that operation will be performing a square as well. We note that the complexity of the schoolbook algorithm for applying a phase proportional to is roughly half the cost of the same operation for , because rotations of the form and are identical and can be combined. Also, phase rotations of the form (in which both bits are the same) correspond to a single-controlled phase rotation, instead of a doubly-controlled one. These facts yield to a significant reduction in the cost of performing the controlled-phase rotations.

If space is very limited, yet another optimization stems from the fact that we are measuring the output immediately as it is produced. Instead of computing all the bits of the output at once, we may instead compute subsets of the bits of the output sequentially. For example, one may let the output register be of length only roughly , and perform a smaller quantum-quantum multiplication to compute the less-significant half of the output and measure it. Then the same process can be repeated for the more-significant half, re-using the same output qubits. This optimization reduces the total qubit count from to . The gate count will be somewhat larger due to the fact that the recursive multiplication cannot be applied to the full output at once, but by making the decomposition we note that gate cost can be reduced to that of 6 quantum-quantum multiplications of size (as opposed to one quantum-quantum multiplication of size ). This type of optimization can be extended to reduce the qubit count as low as , but we believe that taking it any further than just described in practice is not likely to be worth the tradeoff in gate count.

Finally we note that there is another optimization, described when the “computational Bell test” protocol was originally introduced in Chapter 5, in which uncomputation of “garbage bits” can be replaced simply by measurement in the Hadamard basis. For the circuits we introduce here, this optimization is not actually very helpful, because there are only garbage bits introduced in the circuit (the overflow bits of each , resp. )! In fact, we estimate that the experimental cost of performing extra intermediate measurements instead of the gates to uncompute these values directly is almost certainly not worth it. We consider a serious benefit of the circuits introduced here that they require no extra intermediate measurement at all (except the measurements explicitly required by the protocol).

7.5 Discussion and outlook

Developing quantum circuits based on classical algorithms is a subtle pursuit. Peculiarities of the quantum setting, such as the need for reversibility and the limited number of qubits on near- and medium-term quantum devices, preclude the straightforward translation of classical circuits to quantum ones, especially for recursive algorithms. However, quantum mechanics also provides us with new computational techniques that do not have classical analogues. In this work we combine the algorithmic speedups from recursive fast multiplication algorithms, which are widely used in classical computing, with the inherently quantum technique of performing arithmetic in the phase of a quantum state and then using a quantum Fourier transform to recover the result. We show that this yields circuits for performing multiplication of numbers stored in quantum registers in a number of gates that is sub-quadratic in the size of the inputs, for both classical-quantum and quantum-quantum multiplication. This gate count scaling is achieved while using an extremely small number of qubits—as low as just two ancillas in addition to the registers storing the input and output. Our algorithms also offload a considerable portion of the work to classical pre-computation in the circuit compilation phase, significantly reducing the overhead of these asymtotically-faster algorithms. We show that this reduction of overhead makes the algorithms not just a theoretical novelty, but potentially the smallest known circuits for the multiplication of integers at sizes relevant in practice, for computations such as Shor’s algorithm and cryptographic protocols for demonstrating quantum advantage.

There are several potential optimizations to our algorithm that warrant further exploration—although there exist subtle pitfalls which must be considered along the way. One open question is whether the techniques presented here can be applied to fast Fourier transform based classical multiplication algorithms, the foremost being the Schonhage-Strassen algorithm [SS71]. This algorithm replaces the matrix-vector products in, for example, with discrete Fourier transforms, which can be computed in time! (The discrete Fourier transform here should not be confused with the quantum Fourier transform already present in our algorithms—implementing Schonhage-Strassen on a quantum computer would correspond to implementing a quantum circuit for a classical fast Fourier transform algorithm, that outputs bitstrings representing the classical discrete Fourier transform of the input). The subtle but seemingly inescapable obstacle preventing this from fitting directly into our framework is that in Schonhage-Strassen, , , and are defined on a ring with some modulus (different from the modulus we discuss in Sec. 7.4.1), while the vector against which the results are multiplied during the recomposition step is defined over all integers. Because they are defined over different spaces, the matrix-vector products are no longer associative—so precomputing will no longer yield the correct answer! Another outstanding question is whether a quantum circuit for the discrete Fourier transform can be devised that uses only a constant (or even logarithmic) number of ancilla qubits.

Within the algorithms that we do present, we believe there is still much room for optimization. The fact that the phase product portion of our circuits can be computed in sub-linear depth begs the question of whether there is a corresponding sub-linear depth algorithm for the quantum Fourier transform, which uses only a linear number of qubits. If such an algorithm were applied to our circuits, the entire multiplication could be performed in sub-linear depth. We also note that we do not explicitly give a sequence of additions and subtractions that yield a sub-linear depth circuit for quantum-quantum multiplication. For PhaseTripleProduct, computing the 10 sub-products in three parallel groups would yield a sub-linear depth circuit; the challenge is coming up with a sequence of arithmetic operations that yields the appropriate linear combinations at the right times. We are virtually certain that such a sequence of operations can be devised; we have not explored this extensively due to the orthogonal issue of the lack of a sub-linear quantum Fourier transform. More broadly, in all our algorithms it is worth exploring if there are better sequences that yield the right linear combinations in fewer addition operations that the sequences we propose. Perhaps there is an automated or algorithmic way to come up with these optimal sequences in the general case.

Finally we note that there are certain optimizations applicable in the “schoolbook” Fourier multiplication algorithm that don’t translate in an obvious way to our algorithms. [DRA00, BEA03] For example, a classic optimization is to drop very small rotations that have a negligible effect on the state. Unfortunately in our case there seem to be very few rotations small enough to be dropped, because the elements of are complicated linear combinations of various powers of two—and thus they usually contain at least one large term. One seemingly clever idea when using our circuits in the proofs of quantumness described in Section 7.4 would be to set the modulus so that one or more of the elements of are close to a multiple of . Then for those values, would be very small and whole branches of the recursive tree could be ignored. Unfortunately there is an extremely subtle but critical danger in doing this: the elements of have low hamming weight (number of bits set to 1), and thus so would . The classical special number field sieve algorithm is much more efficient at factoring numbers with low Hamming weight than numbers of a general form, so this technique would likely destroy the classical hardness of the quantum advantage protocol! [LL93]

Finally, we note that in this work we leave largely unexplored a number of practical considerations, such as how this algorithm interacts with error correction, qubit routing, and decomposition to native gate sets. We believe that these questions are important to explore, because our estimates in the abstract circuit model suggest that our algorithms could meaningfully reduce the resources required for quantum computations with real-world impacts and applications. This is especially true if future works find further optimizations (which we believe are likely to exist). Taking a broader perspective, the techniques for multiplication of quantum integers presented here represent a fundamentally new direction for the implementation of quantum arithmetic, combining two ideas which previously seemed ill-suited to working together. Instead, we find that their combination yields the dramatic benefits of both.