← Back to Theory Theorem 3

Self-Regulation & Bounded Outputs

🧒

Explain Like I'm 5

Imagine you have a volume knob that can go from 0 to 100. If you turn it all the way up, it might break your speakers! But what if the knob had a safety limit that prevents it from going too loud?

The -product is like that safe volume knob. Even if you give it really big numbers, it automatically limits itself and never explodes. It's self-regulating!

This means we don't need extra "safety equipment" (like normalization layers) — the -product is already safe by design.

Proposition: For any fixed weight vector $\mathbf{w}$, the -product output remains bounded and converges as $\|\mathbf{x}\| \to \infty$: $$\lim_{\|\mathbf{x}\| \to \infty} \text{ⵟ}(\mathbf{w}, \mathbf{x}) = \|\mathbf{w}\|^2 \cos^2\theta$$ where $\theta$ is the angle between $\mathbf{w}$ and the direction of $\mathbf{x}$.

🎯 The Problem This Solves

Traditional neural networks suffer from unbounded growth:

  • ReLU: Output grows linearly with input magnitude — can explode with outliers
  • Dot products: Scale linearly with dimension — requires careful initialization
  • Internal Covariate Shift: Activation statistics change during training, requiring normalization

The -product solves all three problems naturally, without explicit normalization.

📐 The Mathematics In Depth

As $\|\mathbf{x}\| \to \infty$, we can write $\mathbf{x} = k \mathbf{u}$ where $k = \|\mathbf{x}\|$ and $\mathbf{u}$ is a unit vector. Then:

$$\text{ⵟ}(\mathbf{w}, \mathbf{x}) = \frac{(\mathbf{w}^\top k\mathbf{u})^2}{\|k\mathbf{u} - \mathbf{w}\|^2 + \varepsilon} = \frac{k^2 (\mathbf{w}^\top\mathbf{u})^2}{k^2 - 2k(\mathbf{w}^\top\mathbf{u}) + \|\mathbf{w}\|^2 + \varepsilon}$$

Dividing numerator and denominator by $k^2$ and taking the limit:

$$\lim_{k \to \infty} \text{ⵟ}(\mathbf{w}, k\mathbf{u}) = \frac{(\mathbf{w}^\top\mathbf{u})^2}{1} = \|\mathbf{w}\|^2 \cos^2\theta$$

where $\cos\theta = \frac{\mathbf{w}^\top\mathbf{u}}{\|\mathbf{w}\|}$ is the cosine of the angle between $\mathbf{w}$ and $\mathbf{u}$.

💥 The Consequences

📉
No Exploding Activations

Outliers don't cause numerical instabilities. The output is bounded by $\|\mathbf{w}\|^2$, regardless of input magnitude.

🎯
Dimensional Self-Normalization

At initialization, both numerator and denominator scale as $\mathcal{O}(d)$, so their ratio remains $\mathcal{O}(1)$. No need for Xavier/He initialization!

🔄
Mitigates Internal Covariate Shift

As inputs grow large, activation statistics depend only on angular distribution, not magnitude. This naturally mitigates the covariate shift problem.

💾
Memory Efficiency

15-25% reduction in memory from eliminating normalization layers (BatchNorm, LayerNorm). Simpler, faster training.

🎓 What This Really Means

This proposition shows that the -product has built-in stability. Unlike ReLU or linear layers, it doesn't need external mechanisms to prevent numerical issues.

This is a geometric property — the inverse-square law in the denominator naturally creates a "safety valve" that prevents unbounded growth.

Practical Impact: No gradient explosion from large inputs. No need for gradient clipping in most cases. Simpler, more stable training dynamics.