The Cybenko’s universal approximation theorem states, per Wikipedia, that
feedforward networks with non-polynomial activation functions are dense in the space of continuous functions between two Euclidean spaces, with respect to the compact convergence topology.
My understanding of this is that for any continuous function
At a high level, the proof is along these lines:
- Any step in
can be exactly described by scaling and offsetting the vector Heaviside step function ; and therefore - Any rectangular region in
can be exactly described by such functions. - The mean value theorem for integrals states that any continuous function can be approximated to arbitrary precision by summing rectangular regions.
- Therefore, any continuous function can be approximated to arbitrary precision by some number of Heaviside step functions.
- We recall that
is the definition of a perceptron unit. - Therefore, any continuous function can be approximated to arbitrary precision by some number of perceptron units.
- A multi-layer perceptron with a single hidden layer consists of one layer of such perceptron units, with an input layer corresponding to
and an output layer corresponding to a linear combination of the perceptron units’ output. - Therefore, any continuous function can be approximated to arbitrary precision by a multi-layer perceptron with a single hidden layer.
(This proof can be extended to any nonlinear activation function.)
In practice, the number of neurons may be impracticably large. Additional hidden layers can exponentially reduce the number of neurons required through hierarchical feature extraction.