← Back to Theory Theorem 2

Universal Approximation Theorem

🧒

Explain Like I'm 5

Imagine you have a magic box that can draw any picture you want. You just need to tell it "draw a cat" or "draw a house" and it will draw it perfectly!

The Universal Approximation Theorem says our -product networks are like that magic box — they can learn to do anything (well, any continuous function) if we give them enough "drawing tools" (neurons).

Even though we removed the activation functions (like ReLU), we didn't lose any power! The -product is already "magical" enough on its own.

Theorem: Let $\mathcal{X} \subset \mathbb{R}^d$ be compact. The class of single-hidden-layer -product networks $f(\mathbf{x}) = \sum_{i=1}^n \alpha_i \cdot g(\mathbf{x}; \mathbf{w}_i, b_i) + c$ is dense in $C(\mathcal{X})$ under the uniform norm. That is, NMNs can approximate any continuous function to arbitrary precision.

🎯 The Problem This Solves

When we removed activation functions, critics asked: "Can you still learn complex functions?"

Traditional neural networks rely on activation functions (ReLU, sigmoid) to create non-linearity. Without them, a network would just be a series of linear transformations — which can only learn linear functions.

This theorem proves that the -product's inherent geometric non-linearity is sufficient. We don't need separate activation functions because the -product itself is non-linear.

📐 The Mathematics In Depth

The proof is elegant and leverages the kernel structure established by Theorem 1:

Step 1: Recover IMQ Kernel

Consider the -product with bias: $g(\mathbf{x}; \mathbf{w}, b) = \frac{(\mathbf{w}^\top\mathbf{x} + b)^2}{\|\mathbf{w} - \mathbf{x}\|^2 + \varepsilon}$

By differentiating twice with respect to $b$: $$\partial_b^2 g(\mathbf{x}; \mathbf{w}, b) = \frac{2}{\|\mathbf{x} - \mathbf{w}\|^2 + \varepsilon}$$

This is the inverse multiquadric (IMQ) kernel — a well-studied kernel in approximation theory.

Step 2: Fourier Analysis

The IMQ kernel has a strictly positive Fourier transform (related to the modified Bessel function $K_0$). This is a key property for density results.

Step 3: Uniqueness via Orthogonality

If a measure $\mu$ is orthogonal to all IMQ translates (i.e., $\int k(\mathbf{x}, \mathbf{w}) d\mu(\mathbf{x}) = 0$ for all $\mathbf{w}$), then by the positivity of the Fourier transform, $\mu$ must be the zero measure.

Step 4: Density via Hahn-Banach/Riesz Duality

By the Hahn-Banach theorem and Riesz representation theorem, if the span of $\{g(\cdot; \mathbf{w}, b)\}$ is not dense, there exists a non-zero continuous linear functional that vanishes on the span. This functional corresponds to a measure, which by Step 3 must be zero — a contradiction.

Therefore, the span is dense in $C(\mathcal{X})$. ∎

🔬
Key Insight: The bias term $b$ is crucial! It allows the network to "shift" response fields and span the entire function space through differentiation. This is why we use $(\mathbf{w}^\top\mathbf{x} + b)^2$ rather than just $(\mathbf{w}^\top\mathbf{x})^2$.

💥 The Consequences

No Expressive Power Loss

NMNs are as expressive as ReLU/Sigmoid networks. Single hidden layer is sufficient in theory (though deeper networks may learn more efficiently).

🎯
Geometric Localization

Unlike ReLU (unbounded growth), the -product achieves density through localized geometric units — creating "vortex-like" territorial fields.

🔧
Simpler Architecture

No need for complex activation functions. The geometric operator itself provides the non-linearity, leading to simpler, more interpretable networks.

📊
Kernel Method Connection

The proof connects NMNs to kernel methods, opening the door to using kernel-based optimization techniques and theoretical guarantees.

🎓 What This Really Means

This is the fundamental existence theorem for NMNs. It answers the question: "Can we really learn complex functions without activation functions?"

The answer is a resounding yes. The -product's geometric structure provides sufficient non-linearity to approximate any continuous function.

This theorem bridges the gap between theoretical possibility and practical feasibility, showing that activation-free networks are not just a curiosity — they're a viable alternative with the same expressive power.

📜 Historical Context

Universal approximation theorems date back to the 1980s, with seminal work by Cybenko (1989) and Hornik et al. (1989) showing that single-hidden-layer networks with sigmoidal activations are universal approximators.

Our theorem extends this tradition, showing that geometric non-linearity (via the -product) can replace functional non-linearity (via activations) without losing approximation power.