Explain Like I'm 5
We gave the โต-product brain some tests to see how smart it is:
- ๐งฉ Puzzle Test (XOR): A tricky puzzle that normal simple brains can't solve. Our brain solved it with just ONE tiny piece!
- ๐ข Number Drawing Test (MNIST): Looking at handwritten numbers and guessing what they are. Our brain learned clearer, sharper pictures!
- ๐ผ๏ธ Picture Test (CIFAR/ImageNet): Recognizing cats, dogs, planes, and more. Our brain often did better, especially on harder tests!
- ๐ Writing Test (GPT-2): Predicting the next word in a sentence. Our brain was 11% better AND used less memory!
The best part? Our brain is simpler than other brains but works just as well or better!
๐งฉ The XOR Problem: A Classic Test
The XOR (exclusive or) problem is legendary in neural network history. It's a simple pattern that a single linear neuron cannot learn โ proving the need for non-linearity. Here's the pattern:
| Input $x_1$ | Input $x_2$ | XOR Output | Pattern |
|---|---|---|---|
| 0 | 0 | 0 | Both same โ 0 |
| 0 | 1 | 1 | Different โ 1 |
| 1 | 0 | 1 | Different โ 1 |
| 1 | 1 | 0 | Both same โ 0 |
Why it's hard: You can't draw a single straight line to separate the 1s from the 0s. Traditional solutions require either multiple layers OR an activation function.
- $(0,0)$ and $(1,1)$: $\mathbf{w}^T\mathbf{x} = 0$, so $\text{โต}(\mathbf{w}, \mathbf{x}) = 0$ โ
- $(0,1)$: $\text{โต}(\mathbf{w}, \mathbf{x}) = \frac{(-1)^2}{5+\epsilon} > 0$ โ
- $(1,0)$: $\text{โต}(\mathbf{w}, \mathbf{x}) = \frac{(1)^2}{1+\epsilon} > 0$ โ
๐ฎ Live XOR Training: Linear vs โต-Product
๐ข MNIST: Learning Digit Prototypes
MNIST is the "hello world" of machine learning โ 60,000 handwritten digits for training and 10,000 for testing. What's remarkable isn't just the accuracy, but what the neurons learn:
Linear Model Prototypes
Conventional linear neurons learn diffuse, blurry prototypes. They try to capture all variations of a digit, resulting in smeared, hard-to-interpret weight patterns.
โต-Product Prototypes
NMN neurons learn sharp, geometrically coherent digit representations. Each weight vector looks like a clear, prototypical example of its digit class.
A surprising discovery: โต-product neurons exhibit superposition behavior. When prototypes are inverted ($\mathbf{w} \to -\mathbf{w}$):
- Dot product neurons: 91.88% โ ~0.01% (complete failure!)
- โต-product neurons: 92.18% โ 87.87% (robust!)
The squared numerator means the sign of the dot product doesn't matter โ both $\mathbf{w}$ and $-\mathbf{w}$ are valid solutions!
๐ผ๏ธ Vision Benchmarks: From CIFAR to ImageNet
We systematically compared standard architectures with their "Aether" variants (where standard layers are replaced with โต-product units).
| Architecture | CIFAR-10 | CIFAR-100 | STL-10 | Tiny-ImageNet |
|---|---|---|---|---|
| ResNet-18 | 94.23% | 72.15% | 78.42% | 56.89% |
| Aether-ResNet-18 | 92.37% | 74.83% | 80.91% | 59.34% |
| ViT-Small | 91.78% | 69.91% | 75.13% | 52.76% |
| Aether-ViT-Small | 92.45% | 70.58% | 78.89% | 51.42% |
On ImageNet-1K, the flagship large-scale benchmark:
- ResNet-50: 74.13%
- Aether-ResNet-50: 75.24% (+1.11% improvement)
๐ AetherGPT: Language Modeling at Scale
To prove the โต-product generalizes beyond vision, we adapted GPT-2 architecture with โต-Attention and NMN layers.
Validation Loss (FP32)
GPT-2: 2.43
Aether-GPT2: 2.29
5.8% improvement
Validation Loss (BF16)
GPT-2: 3.03
Aether-GPT2: 2.69
11.2% improvement!
By eliminating activation functions and normalization layers, Aether-GPT2 achieves:
- 15-25% reduction in peak memory usage
- No storage of intermediate activations for ReLU/GELU gradients
- No LayerNorm statistics to track
On identical hardware (Kaggle TPU v5-8, batch size 64, context length 1024):
- Linear baseline: 138k tokens/s, 4h 50m 10s total
- Aether-GPT2: 132k tokens/s, 5h 02m 31s total
About 4% slower in raw throughput, but the memory savings enable larger batch sizes or longer contexts โ often a net win in practice.
๐ Vortex Decision Boundaries
One of the most visually striking differences between linear and NMN classifiers is their decision boundaries:
Linear Classifiers
Create unbounded half-space partitions. Each class region extends to infinity in some direction. Points far from the training data can still be confidently (and incorrectly) classified.
โต-Product Classifiers
Create localized, vortex-like territories around learned prototypes. Each neuron has a bounded region of influence. Points far from all prototypes receive low confidence โ a natural measure of uncertainty!
โ Experimental Takeaways
| Experiment | Key Finding | Significance |
|---|---|---|
| XOR | Single neuron solution | Proves intrinsic non-linearity |
| MNIST | Sharper prototypes, inversion robustness | Better geometric representations |
| Vision | Outperforms on complex datasets | Scales to real-world tasks |
| Language | 11.2% improvement, 15-25% less memory | Domain-agnostic benefits |
| Boundaries | Localized vortex territories | Built-in uncertainty quantification |