cudaMemcpyAsync (P2P D2D) serializes with kernel execution

22:56 07 Feb 2026

Hi all — I’m debugging an unexpected ordering/progress issue with peer-to-peer device copies and I’d like help understanding what CUDA/driver/runtime behavior could explain it.

Setup

Single node, 2 H100 NVL GPUs, Linux, CUDA 12.9
Two processes (one GPU per process)
Using CUDA IPC to access peer GPU memory (IPC handle opened in the other process)
A CPU “proxy” thread issues GPU-to-GPU copies (triggered from the GPU kernel – MSCCLPP port channel, putWithSignal):

cudaMemcpyAsync(dst, src, bytes, cudaMemcpyDeviceToDevice, copy_stream);

Compute kernels are launched on a different non-default stream:

Kernel0<<<..., compute_stream>>>(...); // some CUTLASS kernel A
Kernel1<<<..., compute_stream>>>(...); // some CUTLASS kernel B
Kernel2<<<..., compute_stream>>>(...); // some user reduction kernel

copy_stream != compute_stream, both are non-default and non-blocking
No sync between kernels (no cudaDeviceSynchronize, no cudaStreamSynchronize, etc.)
Only a sync at the very end of the overall loop

What I observe (problem case: Kernel 2 present)

Timeline (simplified):

Launch Kernel 0 on compute_stream
While Kernel 0 is still running, the proxy thread calls cudaMemcpyAsync(..., copy_stream)
- Nsight Systems CUDA API trace shows the cudaMemcpyAsync call occurs during Kernel 0 execution
Launch Kernel 1 on compute_stream
Launch Kernel 2 on compute_stream
The actual P2P memcpy on the GPU does not start until after Kernel 1 completes
- In Nsight Systems, the memcpy shows up only once the compute has progressed past Kernel 1 / later kernels (i.e., no overlap), even though it was enqueued much earlier

So: the copy is submitted while Kernel 0 is running, but it’s executed only after the later computation (Kernel 1 / subsequent work) completes.

The proxy thread fires the cudaMemcpyAsync(.., copy_stream) while the kernel0 executes

The device-side execution of cudaMemcpy P2P is taken up only after the execution of kernel1

Control case (Kernel 2 removed → everything works)

If I remove Kernel 2 from the sequence (i.e., only Kernel 0 then Kernel 1), then the behavior is perfect:

The cudaMemcpyAsync enqueued during Kernel 0 overlaps as expected (copy begins promptly on copy_stream)
No unexpected delay/serialization

(The issue also does not happen if the kernel enqueue pattern is Kernel0, Kernel2, kernel1, kernel2)

Without Kernel2after the host-side cudaMemcpyAsync, the cudaMemcpy P2P is picked up during kernel0 execution.

Notes

The CPU-side cudaMemcpyAsync host call is not delayed; it happens during kernel execution time.
The copy stream is different from the compute stream.
Behavior is repeatable/deterministic and across driver versions (590.48.01, 575.57.08)

Questions

Under what conditions can a P2P D2D cudaMemcpyAsync (on a separate stream) be implicitly delayed until later computation completes?
Are there known constraints/serialization behaviors for P2P copies when src/dst memory is accessed via CUDA IPC across processes (multi-process), even if streams are different?

asynchronous optimization cuda overlap cuda-streams

Setup

What I observe (problem case: Kernel 2 present)

Control case (Kernel 2 removed → everything works)

Notes

Questions

Your Answer

Privacy & Cookie Consent