cudaMemcpyAsync (P2P D2D) serializes with kernel execution
22:56 07 Feb 2026

Hi all — I’m debugging an unexpected ordering/progress issue with peer-to-peer device copies and I’d like help understanding what CUDA/driver/runtime behavior could explain it.

Setup

  • Single node, 2 H100 NVL GPUs, Linux, CUDA 12.9

  • Two processes (one GPU per process)

  • Using CUDA IPC to access peer GPU memory (IPC handle opened in the other process)

  • A CPU “proxy” thread issues GPU-to-GPU copies (triggered from the GPU kernel – MSCCLPP port channel, putWithSignal):

cudaMemcpyAsync(dst, src, bytes, cudaMemcpyDeviceToDevice, copy_stream);
  • Compute kernels are launched on a different non-default stream:
Kernel0<<<..., compute_stream>>>(...); // some CUTLASS kernel A
Kernel1<<<..., compute_stream>>>(...); // some CUTLASS kernel B
Kernel2<<<..., compute_stream>>>(...); // some user reduction kernel

  • copy_stream != compute_stream, both are non-default and non-blocking

  • No sync between kernels (no cudaDeviceSynchronize, no cudaStreamSynchronize, etc.)

  • Only a sync at the very end of the overall loop


What I observe (problem case: Kernel 2 present)

Timeline (simplified):

  1. Launch Kernel 0 on compute_stream

  2. While Kernel 0 is still running, the proxy thread calls cudaMemcpyAsync(..., copy_stream)

    • Nsight Systems CUDA API trace shows the cudaMemcpyAsync call occurs during Kernel 0 execution
  3. Launch Kernel 1 on compute_stream

  4. Launch Kernel 2 on compute_stream

  5. The actual P2P memcpy on the GPU does not start until after Kernel 1 completes

    • In Nsight Systems, the memcpy shows up only once the compute has progressed past Kernel 1 / later kernels (i.e., no overlap), even though it was enqueued much earlier

So: the copy is submitted while Kernel 0 is running, but it’s executed only after the later computation (Kernel 1 / subsequent work) completes.

![image|690x489](upload://14KgyCdOn74rEY9ess4oTDwhGvP.png) The proxy thread fires the cudaMemcpyAsync(.., copy_stream) while the kernel0 executes

![image|690x392](upload://lyF95DOQcH4IWRzQT4kIzrTEBpX.png) The device-side execution of cudaMemcpy P2P is taken up only after the execution of kernel1


Control case (Kernel 2 removed → everything works)

If I remove Kernel 2 from the sequence (i.e., only Kernel 0 then Kernel 1), then the behavior is perfect:

  • The cudaMemcpyAsync enqueued during Kernel 0 overlaps as expected (copy begins promptly on copy_stream)

  • No unexpected delay/serialization

(The issue also does not happen if the kernel enqueue pattern is Kernel0, Kernel2, kernel1, kernel2)

![image|678x499](upload://aZKkAICivosLSWyQtaKQ15ze6CQ.png) Without Kernel2after the host-side cudaMemcpyAsync, the cudaMemcpy P2P is picked up during kernel0 execution.

Notes

  • The CPU-side cudaMemcpyAsync host call is not delayed; it happens during kernel execution time.

  • The copy stream is different from the compute stream.

  • Behavior is repeatable/deterministic and across driver versions (590.48.01, 575.57.08)

Questions

  1. Under what conditions can a P2P D2D cudaMemcpyAsync (on a separate stream) be implicitly delayed until later computation completes?

  2. Are there known constraints/serialization behaviors for P2P copies when src/dst memory is accessed via CUDA IPC across processes (multi-process), even if streams are different?

asynchronous optimization cuda overlap cuda-streams