Why are ReLU-based activations commonly used to approximate functions instead of arbitrary nonlinear functions?
12:55 24 Mar 2026

I’m trying to understand function approximation in neural networks.

When we say that a neural network with ReLU activations can approximate a given equation/function, we often justify it by showing that the function can be constructed (or approximated) using combinations of ReLU units.

My question is:
why do we specifically rely on ReLU (or similar standard activations) for this argument?

In other words, why not use some arbitrary nonlinear function as the activation function?
Wouldn’t any nonlinear function also allow us to approximate complex equations?

Is there something special about ReLU (or commonly used activations like sigmoid/tanh) that makes them preferable for function approximation, beyond just “being nonlinear”?

tensorflow machine-learning artificial-intelligence