Imagine you have a volume knob that can go from 0 to 100. If you turn it all the way up, it might break your speakers! But what if the knob had a safety limit that prevents it from going too loud?
The โต-product is like that safe volume knob. Even if you give it really big numbers, it automatically limits itself and never explodes. It's self-regulating!
This means we don't need extra "safety equipment" (like normalization layers) โ the โต-product is already safe by design.
Traditional neural networks suffer from unbounded growth:
The โต-product solves all three problems naturally, without explicit normalization.
As $\|\mathbf{x}\| \to \infty$, we can write $\mathbf{x} = k \mathbf{u}$ where $k = \|\mathbf{x}\|$ and $\mathbf{u}$ is a unit vector. Then:
Dividing numerator and denominator by $k^2$ and taking the limit:
where $\cos\theta = \frac{\mathbf{w}^\top\mathbf{u}}{\|\mathbf{w}\|}$ is the cosine of the angle between $\mathbf{w}$ and $\mathbf{u}$.
Outliers don't cause numerical instabilities. The output is bounded by $\|\mathbf{w}\|^2$, regardless of input magnitude.
At initialization, both numerator and denominator scale as $\mathcal{O}(d)$, so their ratio remains $\mathcal{O}(1)$. No need for Xavier/He initialization!
As inputs grow large, activation statistics depend only on angular distribution, not magnitude. This naturally mitigates the covariate shift problem.
15-25% reduction in memory from eliminating normalization layers (BatchNorm, LayerNorm). Simpler, faster training.
This proposition shows that the โต-product has built-in stability. Unlike ReLU or linear layers, it doesn't need external mechanisms to prevent numerical issues.
This is a geometric property โ the inverse-square law in the denominator naturally creates a "safety valve" that prevents unbounded growth.