Imagine you have a magic box that can draw any picture you want. You just need to tell it "draw a cat" or "draw a house" and it will draw it perfectly!
The Universal Approximation Theorem says our โต-product networks are like that magic box โ they can learn to do anything (well, any continuous function) if we give them enough "drawing tools" (neurons).
Even though we removed the activation functions (like ReLU), we didn't lose any power! The โต-product is already "magical" enough on its own.
When we removed activation functions, critics asked: "Can you still learn complex functions?"
Traditional neural networks rely on activation functions (ReLU, sigmoid) to create non-linearity. Without them, a network would just be a series of linear transformations โ which can only learn linear functions.
This theorem proves that the โต-product's inherent geometric non-linearity is sufficient. We don't need separate activation functions because the โต-product itself is non-linear.
The proof is elegant and leverages the kernel structure established by Theorem 1:
Consider the โต-product with bias: $g(\mathbf{x}; \mathbf{w}, b) = \frac{(\mathbf{w}^\top\mathbf{x} + b)^2}{\|\mathbf{w} - \mathbf{x}\|^2 + \varepsilon}$
By differentiating twice with respect to $b$: $$\partial_b^2 g(\mathbf{x}; \mathbf{w}, b) = \frac{2}{\|\mathbf{x} - \mathbf{w}\|^2 + \varepsilon}$$
This is the inverse multiquadric (IMQ) kernel โ a well-studied kernel in approximation theory.
The IMQ kernel has a strictly positive Fourier transform (related to the modified Bessel function $K_0$). This is a key property for density results.
If a measure $\mu$ is orthogonal to all IMQ translates (i.e., $\int k(\mathbf{x}, \mathbf{w}) d\mu(\mathbf{x}) = 0$ for all $\mathbf{w}$), then by the positivity of the Fourier transform, $\mu$ must be the zero measure.
By the Hahn-Banach theorem and Riesz representation theorem, if the span of $\{g(\cdot; \mathbf{w}, b)\}$ is not dense, there exists a non-zero continuous linear functional that vanishes on the span. This functional corresponds to a measure, which by Step 3 must be zero โ a contradiction.
Therefore, the span is dense in $C(\mathcal{X})$. โ
NMNs are as expressive as ReLU/Sigmoid networks. Single hidden layer is sufficient in theory (though deeper networks may learn more efficiently).
Unlike ReLU (unbounded growth), the โต-product achieves density through localized geometric units โ creating "vortex-like" territorial fields.
No need for complex activation functions. The geometric operator itself provides the non-linearity, leading to simpler, more interpretable networks.
The proof connects NMNs to kernel methods, opening the door to using kernel-based optimization techniques and theoretical guarantees.
This is the fundamental existence theorem for NMNs. It answers the question: "Can we really learn complex functions without activation functions?"
The answer is a resounding yes. The โต-product's geometric structure provides sufficient non-linearity to approximate any continuous function.
This theorem bridges the gap between theoretical possibility and practical feasibility, showing that activation-free networks are not just a curiosity โ they're a viable alternative with the same expressive power.
Universal approximation theorems date back to the 1980s, with seminal work by Cybenko (1989) and Hornik et al. (1989) showing that single-hidden-layer networks with sigmoidal activations are universal approximators.
Our theorem extends this tradition, showing that geometric non-linearity (via the โต-product) can replace functional non-linearity (via activations) without losing approximation power.