← Back to Theory Theory

Mathematical Guarantees

🧒

Explain Like I'm 5

When you build with LEGOs, you want to know your tower won't fall! 🏗️

  • 📐 Mercer Kernel: Our building block fits with ALL other math toys
  • 🎨 Universal Approximation: We can build ANY shape we want!
  • ⚖️ Self-Regulation: Our tower can't grow TOO tall and tip over
  • 🧈 Smooth: No sharp edges that could poke or break

These are like safety certificates saying our math building blocks are strong and reliable!

📜 Overview of Guarantees

The -product comes with rigorous mathematical proofs ensuring it's a sound foundation for neural networks:

Property What It Means Why It Matters
Mercer Kernel Symmetric & positive semi-definite Connects to kernel methods & RKHS theory
Universal Approximation Can approximate any continuous function Same expressive power as standard NNs
Self-Regulation Outputs bounded as inputs grow No exploding activations
Stable Gradients Gradients vanish for distant inputs Natural gradient localization
Lipschitz Continuity Small input Δ → small output Δ Smooth loss landscape
Analyticity (C∞) Infinitely differentiable Safe for PINNs & higher-order methods

🔷 1. Mercer Kernel Property

Theorem (Mercer Kernel): The -product is symmetric and positive semi-definite: $$\text{ⵟ}(\mathbf{w}, \mathbf{x}) = \text{ⵟ}(\mathbf{x}, \mathbf{w})$$ $$\sum_{i,j} c_i c_j \text{ⵟ}(\mathbf{x}_i, \mathbf{x}_j) \geq 0 \quad \forall c_i \in \mathbb{R}$$

This establishes the -product within kernel theory, meaning there exists a feature space where it acts as an inner product.

🎯 2. Universal Approximation

Theorem (Universal Approximation): For any continuous function $f: K \to \mathbb{R}$ on compact $K$ and any $\varepsilon > 0$, there exists an NMN architecture $g$ such that: $$\sup_{\mathbf{x} \in K} |f(\mathbf{x}) - g(\mathbf{x})| < \varepsilon$$

NMNs can approximate any continuous function to arbitrary precision, matching the power of traditional neural networks while providing geometric benefits.

⚖️ 3. Self-Regulation Property

Proposition (Self-Regulation): For fixed $\mathbf{w}$, outputs are globally bounded: $$\lim_{\|\mathbf{x}\| \to \infty} \text{ⵟ}(\mathbf{w}, \mathbf{x}) \leq \|\mathbf{w}\|^2$$ $$\max_{\mathbf{x}} \text{ⵟ}(\mathbf{w}, \mathbf{x}) = \frac{\|\mathbf{w}\|^4}{\epsilon}$$
💡
Why This Matters: Unlike ReLU which can grow unboundedly, the -product has built-in regularization. No BatchNorm needed to control activation magnitudes!

🧲 4. Stable Gradient Property

Proposition (Stable Learning): Gradients decay for inputs far from the weight: $$\lim_{\|\mathbf{x}\| \to \infty} \left\| \nabla_{\mathbf{x}} \text{ⵟ}(\mathbf{w}, \mathbf{x}) \right\| = 0$$

Learning focuses on relevant, nearby regions while distant points contribute minimal gradient signal — natural attention-like behavior!

🧈 5. Lipschitz Regularity

Proposition (Lipschitz): There exists a constant $L$ such that: $$|\text{ⵟ}(\mathbf{w}, \mathbf{x}_1) - \text{ⵟ}(\mathbf{w}, \mathbf{x}_2)| \leq L \|\mathbf{x}_1 - \mathbf{x}_2\|$$

Small input changes produce proportionally small output changes — crucial for optimization stability and adversarial robustness.

∞ 6. Analyticity (C∞)

Lemma (Analyticity): The -product is infinitely differentiable: $$\text{ⵟ} \in C^{\infty}(\mathbb{R}^n \times \mathbb{R}^n)$$ All partial derivatives of all orders exist and are continuous.

Essential for Physics-Informed Neural Networks (PINNs) where we need to compute higher-order derivatives for differential equation solving.

🔗 Information-Theoretic Connections

Geometric-Information Duality: When applied to probability distributions: $$\text{ⵟ}(\mathbf{p}, \mathbf{q}) \propto \frac{\text{Signal}^2}{\text{Noise}}$$ Creating a bridge between geometric similarity and information-theoretic quantities like KL divergence and cross-entropy.
🎯
The Complete Picture: These guarantees together establish that NMNs are theoretically sound, computationally stable, and practically powerful — eliminating the need for ad-hoc normalization and activation functions while maintaining full expressive power.