fix: Resolved #993 - Support >32 Qubit simulations on AMD GPUs via 3D grid folding#1016
Open
sun-9545sunoj wants to merge 4 commits intoquantumlib:mainfrom
Open
fix: Resolved #993 - Support >32 Qubit simulations on AMD GPUs via 3D grid folding#1016sun-9545sunoj wants to merge 4 commits intoquantumlib:mainfrom
sun-9545sunoj wants to merge 4 commits intoquantumlib:mainfrom
Conversation
- Implement 64-bit indexing and 2D grid dispatch for >32 qubit support - Add GPU-accelerated Hybrid Simulator (qsimh_base_cuda.cu) - Fix ROCm/HIP compilation issues and namespace collisions - Verify 34-qubit state vector and 43-qubit hybrid simulations
- Support grids larger than 65535^2 using blockIdx.z - Logic supports up to 65535^3 blocks (~2.8e14), theoretically enabling >60 qubit indexing - Verified backward compatibility with 34-qubit circuit
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces significant changes to support simulations with more than 32 qubits on AMD GPUs by implementing 3D grid folding and 64-bit indexing. The changes are comprehensive and address the core issue. My review focuses on ensuring the correctness and maintainability of these new features. I've found several critical issues in the CUDA kernels where the 3D block ID is not calculated correctly, which could lead to incorrect simulation results. Additionally, there are some correctness issues with command-line argument parsing and opportunities to improve maintainability by reducing code duplication.
…id folding - Implemented 64-bit indexing to prevent overflow beyond 31 qubits. - Added 3D grid dispatch to bypass hardware block limits. - Introduced GPU-accelerated hybrid simulator (qsimh). - Isolated namespaces in vectorspace_cuda.h for ROCm compatibility.
efa9d2f to
a53e0ff
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR provides a comprehensive solution to Issue #993, addressing the hipErrorInvalidConfiguration encountered when dispatching circuits beyond 31 qubits on AMD hardware.$x$ -dimension limit (65,535 blocks). Large workloads are now distributed across $(x, y, z)$ dimensions, supporting the massive thread counts required for high-qubit states.$2^{31}$ amplitudes, which occurs at the 32-qubit boundary.
By refactoring the dispatch logic and indexing, qsim can now successfully run simulations of 32+ qubits on high-memory devices like the AMD MI300X.
The Solution
Resolved Dispatch Limits:
Implemented 3D grid folding in CreateGrid to bypass the hardware-specific 1D
64-bit Indexing:
Replaced 32-bit signed integers with uint64_t for state-vector addressing. This prevents index overflow when the state space exceeds
Unlocked >32 Qubit Support:
Full State-Vector: Successfully verified 34 qubits (~128GB VRAM) on a single MI300X.
Hybrid Simulation: Introduced a GPU-accelerated hybrid simulator (qsimh_base_cuda.cu) to enable 32+ runs by partitioning the state space into manageable segments.
Verification & Benchmarks
Environment: AMD MI300X (192GB), ROCm 7.1.0.
Regression: Small-scale circuits (< 30 qubits) run with 100% accuracy.
Stress Test: Verified a 50-qubit hybrid simulation with 100% GPU utilization and sustained 750W power draw.
Correctness: Confirmed that 64-bit block IDs are correctly calculated across multi-dimensional grids.
Modified Files
apps/qsimh_base_cuda.cu (New Hybrid Simulator)
lib/cuda2hip.h (ROCm compatibility)
lib/simulator_cuda.h (3D Dispatch)
lib/simulator_cuda_kernels.h (64-bit Kernels)
lib/statespace_cuda.h (Grid Folding)
lib/statespace_cuda_kernels.h (Block ID Helpers)
lib/vectorspace_cuda.h (Namespace Isolation)