Explain Like I'm 5
Building with LEGOs takes time and table space! โฑ๏ธ๐งฑ
- โฑ๏ธ FLOPs = How many "clicks" to snap pieces together (time)
- ๐ฆ Memory = How big a table you need (space)
- ๐ค Trade-off: NMN takes slightly more clicks, but needs a smaller table!
It's like trading: do 2ร the snapping, but your table can be 25% smaller!
๐งฎ Interactive FLOP Calculator
โก Layer Complexity Calculator
๐ Complexity Analysis
Key Identity: We reuse computations using:
$$\|\mathbf{w} - \mathbf{x}\|^2 = \|\mathbf{w}\|^2 + \|\mathbf{x}\|^2 -
2\mathbf{w}^\top\mathbf{x}$$
Since $\mathbf{w}^\top\mathbf{x}$ is already computed for the numerator, the denominator
comes "almost free"!
๐ Detailed FLOP Breakdown
| Operation | Linear + ReLU | NMN Layer |
|---|---|---|
| Matrix multiply | $2Bnd$ | $2Bnd$ (same) |
| Weight norms | โ | $nd$ (can cache) |
| Input norms | โ | $Bd$ |
| Distance computation | โ | $3Bn$ (from identity) |
| Square + Division | โ | $2Bn$ |
| ReLU | $Bn$ | โ |
| Total | $\approx 2Bnd$ | $\approx 4Bnd$ |
๐พ Memory Analysis
Linear + ReLU
Must store $Bn$ activation values for ReLU backward pass (to compute gradient mask).
NMN Layer
No activation storage needed โ gradient flows through geometric computation directly.
The Trade-off: NMN uses ~2ร FLOPs but saves 15-25% peak memory.
For large models (LLMs, vision transformers), memory is often the bottleneck โ
the extra compute is worth it if you can fit larger batches or longer sequences!
๐ Optimization Techniques
- Norm Caching: $\|\mathbf{w}\|^2$ can be computed once and cached per layer
- Fused Kernels: Custom CUDA kernels can combine operations for better memory bandwidth
- Mixed Precision: BF16 shows excellent stability due to bounded outputs
- Gradient Checkpointing: Less relevant since we already save activation memory
๐ Scaling Behavior
High-Dimensional Scaling:
In high dimensions ($d \to \infty$), for random unit vectors:
$$\mathbb{E}[\text{โต}(\mathbf{w}, \mathbf{x})] \approx \frac{1/d}{2 + \epsilon} =
O(1/d)$$
The โต-product naturally handles the "curse of
dimensionality"
through its self-regulating denominator.