Latency of warp add reduction instruction
14:24 10 Mar 2026

The CUDA Programming Guide describe a warp instruction named __reduce_add_sync.
What is the latency of the function?

Related sources:
This table within the guide describes throughput but not latency.
This question discusses the latency of _shfl as a function of the arguments, but do not provide numbers.

cuda gpu nvidia gpu-warp