Latency of warp add reduction instruction
The CUDA Programming Guide describe a warp instruction named __reduce_add_sync.
What is the latency of the function?
Related sources:
This table within the guide describes throughput but not latency.
This question discusses the latency of _shfl as a function of the arguments, but do not provide numbers.