Jax/PyTorch + cuQuantum #83

wcqc · 2023-08-18T21:28:57Z

wcqc
Aug 18, 2023

Dear cuQuantum Maintainers,

This is a very general question regarding cuQuantum integration with other libs. It was mentioned by @leofang in an earlier discussion thread that Nvidia has internal code that integrates cuQuantum with Pytorch and the same could be done with JAX. Have any internal benchmarks been done on the speed difference between these two approaches at all? Is one likely to gain a lot from using JAX + cuQuantum on the speed front in comparison with Pytorch + cuQuantum, especially with cuStateVec? Or would the main bottleneck of state vec simulation already be alleviated mainly by cuQuantum the speed difference would be minimal between the two approaches?

Thanks in advance for any insights on this.

Answered by mtjrider

Aug 23, 2023

If assumptions (1, 2) above are broken, does it mean we get less benefit from cuQuantum (as it currently stands) due to other induced overheads as a result of assumptions (1, 2) no longer hold?

Regarding (1, 2), the more mingled the model is, the more time you will spend in cuQuantum (say custatevec) calculating intermediate gradient data in the backward pass. The more the model looks like a traditional circuit, the more opportunities you will have to optimize and pipeline the circuit gradient with fewer API calls and memory transactions.

The Qiskit example is really useful here since AFAIK it doesn't use JAX. Let's suppose we implement a hybrid model, where all the assumptions (1, 2, …

View full answer

mtjrider · 2023-08-18T22:19:50Z

mtjrider
Aug 18, 2023
Maintainer

Hi @wcqc

Can you clarify what is meant by "integration" in this context?

Can you provide an example workload utilizing cuQuantum and another of these libraries?

It would be useful to know how elements of cuQuantum's software would collaborate with other libraries to better steer any discussion regarding expectations surrounding performance.

3 replies

wcqc Aug 18, 2023
Author

Hi @mtjrider,

Sure, happy to provide additional contexts.

Suppose we integrate cuQuantum into TorchQuantum (e.g., by offloading quantum workload to cuQuantum and classical workload to CUDA, making everything interoperable as detailed in this Discussion thread. Hypothetically, suppose we have a JAX counterpart for TorchQuantum (which AFAIK there isn't one currently widely available), and we'd like to apply the same integration with cuQuantum as we do for TorchQuantum while maintaining all the goodies in JAX (XLA, jit etc). Would choosing one over the other (JAXQuantum/TorchQuantum + cuQuantum) has an obvious edge for accelerating hybrid QML model training on GPUs (not TPUs)?

mtjrider Aug 19, 2023
Maintainer

@wcqc I see you've linked to our discussions page. Is that what you intended? Did you mean discussions/72?

I've put together a few thoughts below. It's somewhat long, and mostly ends in a question. Let me know if it's not what you're looking for, and we can close in on something more specific.

We should distinguish the value added by the broader ecosystem of CUDA-backed software from the value added by the APIs available in cuQuantum.

For example, many of the deep learning algorithms and pipelines are independently accelerated by other Nvidia libraries (e.g.) cuDNN, etc.. There are also libraries like cuTENSOR, CUTLASS easing the development/deployment of kernels in other related libraries.

I raise this point because a library like [TorchQuantum, JAX] still benefits strongly from that ecosystem. So, what value would cuQuantum add above and beyond the value already added by the broader CUDA ecosystem? Really answering this question requires some candidate workload. You mentioned hybrid quantum machine learning. I'll make a few assumptions:

The classical and quantum elements of the model are separable (e.g.) I can draw boxes around them and neither is mixed with the other
At most, the classical (or quantum) block feeds the quantum (or classical) block. (You can chain alternating blocks together, but the blocks should be "large").
Any quantum block is representable with a quantum circuit.
We use a state vector ansatz to compute measures from the quantum circuit. (You can use tensor networks, and the analysis that follows is still largely the same ... with the exception that an optimal network representation needs to be found).

Assumptions (1, 2) are important because they simplify the implementation of the classical/quantum gradient calculation in the backward pass. In this scenario, the benefit cuQuantum brings to the table is proportional to speedup provided by custatevec in computing various measures associated with the quantum circuit.

If you were to profile a workflow in [TorchQuantum, JAX] where the forward pass of the quantum circuit was highlighted, you could use that as a rough basis for comparing with custatevec by passing the same circuit through APIs leveraging cuQuantum (e.g.) qiskit, cirq in the appliance, or by writing some minimalistic simulator using custatevec.

I say "rough" because there is a lot going on in the backward pass, and care must be taken in managing and pipelining memory accesses/copies/reductions when a formal implementation is written.

We are investigating novel integrations like this, but concrete usage scenarios help us better contextualize the need. All of this is really a byproduct of a mature ecosystem of software permitting development, deployment, and experimentation in a stable and predictable fashion.

$@refraction-ray$

refraction-ray Sep 4, 2023

suppose we have a JAX counterpart for TorchQuantum (which AFAIK there isn't one currently widely available)

Please have a look at TensorCircuit if you are interested:) it offers seamless integration with jax, tensorflow, torch and cupy at the same time. For example, a hybrid network with quantum part simulated using tensorflow on GPU while the classical neural network part is evaluated using jax on CPU/GPU is very easy to compose in TensorCircuit, and the quantum/classical parts can be easily mixed in any way you want since automatic differentiation is native. One starting example could be: https://github.com/tencent-quantum-lab/tensorcircuit/blob/master/examples/hybrid_gpu_pipeline.py. In general, with the help of AD/VMAP/JIT, the efficiency of tensorcircuit can be several orders of magnitude better than qiskit.

wcqc · 2023-08-21T08:29:34Z

wcqc
Aug 21, 2023
Author

@wcqc I see you've linked to our discussions page. Is that what you intended? Did you mean discussions/72?

Sorry for the wrong link, indeed discussion/72 was the correct thread.

The classical and quantum elements of the model are separable (e.g.) I can draw boxes around them and neither is mixed with the other

At most, the classical (or quantum) block feeds the quantum (or classical) block. (You can chain alternating blocks together, but the blocks should be "large").

Any quantum block is representable with a quantum circuit.

We use a state vector ansatz to compute measures from the quantum circuit. (You can use tensor networks, and the analysis that follows is still largely the same ... with the exception that an optimal network representation needs to be found).

Assumptions (1, 2) are important because they simplify the implementation of the classical/quantum gradient calculation in the backward pass. In this scenario, the benefit cuQuantum brings to the table is proportional to speedup provided by custatevec in computing various measures associated with the quantum circuit.

if assumptions (1, 2) above are broken, does it mean we get less benefit from cuQuantum (as it currently stands) due to other induced overheads as a result of assumptions (1, 2) no longer hold?

If you were to profile a workflow in [TorchQuantum, JAX] where the forward pass of the quantum circuit was highlighted, you could use that as a rough basis for comparing with custatevec by passing the same circuit through APIs leveraging cuQuantum (e.g.) qiskit, cirq in the appliance, or by writing some minimalistic simulator using custatevec.

The Qiskit example is really useful here since AFAIK it doesn't use JAX. Let's suppose we implement a hybrid model, where all the assumptions (1, 2, 3, 4) are valid, in Qiskit and another one in JAX (+cuQuantum) taking full advantage of JIT, XLA etc., as a rough guidance, do you see major "speed" advantages in taking the JAX route compared with Qiskit? We fully appreciate the caveat that the best comparison could only be obtained with coding up models and benchmarking them, but any "rough" insights are much appreciated at this stage.

2 replies

mtjrider Aug 23, 2023
Maintainer

If assumptions (1, 2) above are broken, does it mean we get less benefit from cuQuantum (as it currently stands) due to other induced overheads as a result of assumptions (1, 2) no longer hold?

Regarding (1, 2), the more mingled the model is, the more time you will spend in cuQuantum (say custatevec) calculating intermediate gradient data in the backward pass. The more the model looks like a traditional circuit, the more opportunities you will have to optimize and pipeline the circuit gradient with fewer API calls and memory transactions.

The Qiskit example is really useful here since AFAIK it doesn't use JAX. Let's suppose we implement a hybrid model, where all the assumptions (1, 2, 3, 4) are valid, in Qiskit and another one in JAX (+cuQuantum) taking full advantage of JIT, XLA etc., as a rough guidance, do you see major "speed" advantages in taking the JAX route compared with Qiskit? We fully appreciate the caveat that the best comparison could only be obtained with coding up models and benchmarking them, but any "rough" insights are much appreciated at this stage.

It's difficult for me to comment, here. Because we provide low-level interfaces with the expressed goal of exposing the highest performance ... as long as you are following our recommended best practices in both libraries, you should expect to get very similar results. In short, if you didn't get similar results, we'd want to hear about it.

I would think about what features already present in each library serve your immediate purpose best. If you can provide more details on what you're trying to accomplish, I'd be happy to give you more of my thoughts on the subject.

Answer selected by wcqc

wcqc Aug 24, 2023
Author

Thanks very much again for the very detailed thoughts and comments. I'm very new to CUDA in general and cuQuantum, and JAX. Have marked the above as the answer for this discussion for now will open new discussions for specific questions.

Jax/PyTorch + cuQuantum #83

Uh oh!

wcqc Aug 18, 2023

Replies: 2 comments · 5 replies

Uh oh!

mtjrider Aug 18, 2023 Maintainer

Uh oh!

wcqc Aug 18, 2023 Author

Uh oh!

Uh oh!

mtjrider Aug 19, 2023 Maintainer

Uh oh!

Uh oh!

refraction-ray Sep 4, 2023

Uh oh!

Uh oh!

wcqc Aug 21, 2023 Author

Uh oh!

Uh oh!

mtjrider Aug 23, 2023 Maintainer

Uh oh!

wcqc Aug 24, 2023 Author

wcqc
Aug 18, 2023

Replies: 2 comments 5 replies

mtjrider
Aug 18, 2023
Maintainer

wcqc Aug 18, 2023
Author

mtjrider Aug 19, 2023
Maintainer

wcqc
Aug 21, 2023
Author

mtjrider Aug 23, 2023
Maintainer

wcqc Aug 24, 2023
Author