Theorem 4

Stable Learning & Gradient Localization

๐Ÿง’

Explain Like I'm 5

Imagine you're learning to draw, and you have a teacher who gives you feedback. If you're drawing something very different from what the teacher wants, they might not give you much feedback (because it's too far off).

The โตŸ-product works the same way! When inputs are far away from the weight vector, the gradients (feedback) become very small. This means each neuron only "learns" from nearby examples โ€” creating natural localization.

This is like having a teacher who focuses on helping you with things that are close to what you're trying to learn, rather than getting distracted by completely different things.

Proposition: The gradient of the โตŸ-product with respect to input vanishes for distant inputs: $$\lim_{\|\mathbf{x}\| \to \infty} \|\nabla_{\mathbf{x}} \text{โตŸ}(\mathbf{w}, \mathbf{x})\| = 0$$ Specifically, $\|\nabla_{\mathbf{x}} \text{โตŸ}\| \sim \mathcal{O}(1/k)$ as $\|\mathbf{x}\| = k \to \infty$.

๐ŸŽฏ The Problem This Solves

Traditional neurons have global gradient influence:

  • Linear neurons: Gradients are constant regardless of distance (no localization)
  • ReLU neurons: Gradients are either constant (positive side) or zero (negative side) โ€” binary, not smooth

This means every training example affects every weight, leading to:

  • Slow convergence (too many conflicting signals)
  • Difficulty learning local patterns
  • Susceptibility to outliers

๐Ÿ“ The Mathematics In Depth

Computing the gradient:

$$\nabla_{\mathbf{x}} \text{โตŸ}(\mathbf{w}, \mathbf{x}) = \frac{2(\mathbf{w}^\top\mathbf{x})\mathbf{w}}{\|\mathbf{w} - \mathbf{x}\|^2 + \varepsilon} - \frac{(\mathbf{w}^\top\mathbf{x})^2 \cdot 2(\mathbf{x} - \mathbf{w})}{(\|\mathbf{w} - \mathbf{x}\|^2 + \varepsilon)^2}$$

As $\|\mathbf{x}\| = k \to \infty$, the denominator grows as $k^2$, while the numerator grows as $k$. Therefore, $\|\nabla_{\mathbf{x}} \text{โตŸ}\| \sim \mathcal{O}(1/k) \to 0$.

๐Ÿ’ฅ The Consequences

๐ŸŽฏ
Natural Localization

Each neuron creates a "territory" around its weight vector. Only nearby inputs contribute significant gradients, enabling local pattern learning.

โˆž
Infinitely Smooth (Cโˆž)

The โตŸ-product is infinitely differentiable, perfect for physics-informed neural networks (PINNs) requiring higher-order derivatives.

๐Ÿ“Š
Lipschitz Continuity

The gradient is Lipschitz continuous, providing theoretical guarantees for optimization and stability during training.

๐Ÿ›ก๏ธ
Robust to Outliers

Distant outliers contribute vanishingly small gradients, so they don't disrupt learning of local patterns.

๐ŸŽ“ What This Really Means

This proposition shows that โตŸ-product neurons have spatial awareness. They naturally create "learning territories" โ€” regions of input space where they actively learn, with smooth falloff outside.

This is fundamentally different from linear or ReLU neurons, which have global influence. The localization property enables:

  • Faster convergence (less conflicting gradient signals)
  • Better generalization (local patterns are more robust)
  • Interpretability (each neuron "owns" a region of space)