![linpack benchmark multinode linpack benchmark multinode](https://techotv.com/wp-content/uploads/2013/02/htc-one-benchmark-linpack.jpg)
This completes the distributed state update. This is followed by another pair-wise exchange, where P i sends P j’s updated half back to P j, while P j sends P i’s updated half to P j. This results in P i updating P j’s second half, and P j updating P i’s first half. Next, P i applies Q to its first half and the temporary storage (which contains first half of P j), while P j applies Q to its second half and the temporary storage (which contains second half of P j). Each node places the received half into its own temporary storage. Both nodes perform pairwise exchange of these halves: P i sends its first half to P j, while P j sends its second half to P i. Each local state vector is logically partitioned into two halves. Given a local state vector of 2 m complex amplitudes, each node reserves an extra 2 m − 1 words of memory as temporary storage. Our enhancements to this scheme are described in Section 4. We implement the communication scheme described in. Note that the distance between communicating processors in the virtual topology is 2 k − m. When k ⩾ m, the first and second elements of the pair are located on two different nodes and communication is required. Given single-qubit gate operation Q on qubit k, if k < m, the operation is fully contained within a node. Finally, using 1024-node distributed simulation, we simulate the 40-qubit quantum Fourier transform, an important kernel of many quantum algorithms, in 997 seconds. When communication is required, these gate operations becomes network bandwidth bound, and their run-time increases by 10 ×, which is commensurate with the memory to network bandwidth ratio on Stampede. Cache blocking optimization results in an additional ≈ 2.56 × run-time reduction of these gate operations. For a 40-qubit system, when no communication is required, single- and two-qubit controlled gate operations are memory bandwidth bound and take 0.43 and 0.21 seconds, respectively. Using 1024 nodes,the maximum available allocation, we simulate quantum circuits of up to 40 qubits. Herein we describe the implementation of qH iPSTER and the optimization required to achieve high performance and high hardware efficiency on the Stampede supercomputer. While the aggregate capacity of a particular HPC system is fixed, the quantum simulation time can be further improved. Table 1: Examples of TOP500 supercomputing systems, their memory capacity (in Petabytes), and largest quantum system they can simulate.
![linpack benchmark multinode linpack benchmark multinode](https://infohub.delltechnologies.com/static/media/fa9d1a80-798c-496e-965f-178f613e1eea.png)
In addition, the size of the quantum circuit (its number of gates) can result in significant run-time requirements on the classical system. Thus, the memory capacity of the classical system imposes an upper bound on the size of the simulation. 1 1 1Here and in the rest of the paper we assume complex double precision, with eight-byte real and eight-byte imaginary parts. Given n qubits, the size of the state vector is 2 n complex amplitudes, or 2 n + 4 bytes. Specifically, the fundamental challenge is that the size of state, or number of quantum amplitudes, grows exponentially with the number of qubits. While there exists number of techniques to simulate specific classes of quantum circuits efficiently, simulation of generic quantum circuits on classical computers is very inefficient, due to the exponential overhead.