Hi all — I’m debugging an unexpected ordering/progress issue with peer-to-peer device copies and I’d like help understanding what CUDA/driver/runtime behavior could explain it.
Setup
Single node, 2 H100 NVL GPUs, Linux, CUDA 12.9
Two processes (one GPU per process)
Using CUDA IPC to access peer GPU memory (IPC handle opened in the other process)
A CPU “proxy” thread issues GPU-to-GPU copies (triggered from the GPU kernel – MSCCLPP port channel,
putWithSignal):
cudaMemcpyAsync(dst, src, bytes, cudaMemcpyDeviceToDevice, copy_stream);
- Compute kernels are launched on a different non-default stream:
Kernel0<<<..., compute_stream>>>(...); // some CUTLASS kernel A
Kernel1<<<..., compute_stream>>>(...); // some CUTLASS kernel B
Kernel2<<<..., compute_stream>>>(...); // some user reduction kernel
copy_stream != compute_stream, both are non-default and non-blockingNo sync between kernels (no
cudaDeviceSynchronize, nocudaStreamSynchronize, etc.)Only a sync at the very end of the overall loop
What I observe (problem case: Kernel 2 present)
Timeline (simplified):
Launch
Kernel 0oncompute_streamWhile
Kernel 0is still running, the proxy thread callscudaMemcpyAsync(..., copy_stream)- Nsight Systems CUDA API trace shows the
cudaMemcpyAsynccall occurs during Kernel 0 execution
- Nsight Systems CUDA API trace shows the
Launch Kernel 1 on
compute_streamLaunch Kernel 2 on
compute_streamThe actual P2P memcpy on the GPU does not start until after Kernel 1 completes
- In Nsight Systems, the memcpy shows up only once the compute has progressed past Kernel 1 / later kernels (i.e., no overlap), even though it was enqueued much earlier
So: the copy is submitted while Kernel 0 is running, but it’s executed only after the later computation (Kernel 1 / subsequent work) completes.
The proxy thread fires the cudaMemcpyAsync(.., copy_stream) while the kernel0 executes
The device-side execution of cudaMemcpy P2P is taken up only after the execution of kernel1
Control case (Kernel 2 removed → everything works)
If I remove Kernel 2 from the sequence (i.e., only Kernel 0 then Kernel 1), then the behavior is perfect:
The
cudaMemcpyAsyncenqueued during Kernel 0 overlaps as expected (copy begins promptly oncopy_stream)No unexpected delay/serialization
(The issue also does not happen if the kernel enqueue pattern is Kernel0, Kernel2, kernel1, kernel2)
Without Kernel2after the host-side cudaMemcpyAsync, the cudaMemcpy P2P is picked up during kernel0 execution.
Notes
The CPU-side
cudaMemcpyAsynchost call is not delayed; it happens during kernel execution time.The copy stream is different from the compute stream.
Behavior is repeatable/deterministic and across driver versions (590.48.01, 575.57.08)
Questions
Under what conditions can a P2P D2D
cudaMemcpyAsync(on a separate stream) be implicitly delayed until later computation completes?Are there known constraints/serialization behaviors for P2P copies when src/dst memory is accessed via CUDA IPC across processes (multi-process), even if streams are different?