โ† Back to Theory Results

Experiments & Results

๐Ÿง’

Explain Like I'm 5

We gave the โตŸ-product brain some tests to see how smart it is:

  • ๐Ÿงฉ Puzzle Test (XOR): A tricky puzzle that normal simple brains can't solve. Our brain solved it with just ONE tiny piece!
  • ๐Ÿ”ข Number Drawing Test (MNIST): Looking at handwritten numbers and guessing what they are. Our brain learned clearer, sharper pictures!
  • ๐Ÿ–ผ๏ธ Picture Test (CIFAR/ImageNet): Recognizing cats, dogs, planes, and more. Our brain often did better, especially on harder tests!
  • ๐Ÿ“– Writing Test (GPT-2): Predicting the next word in a sentence. Our brain was 11% better AND used less memory!

The best part? Our brain is simpler than other brains but works just as well or better!

๐Ÿงฉ The XOR Problem: A Classic Test

The XOR (exclusive or) problem is legendary in neural network history. It's a simple pattern that a single linear neuron cannot learn โ€” proving the need for non-linearity. Here's the pattern:

Input $x_1$ Input $x_2$ XOR Output Pattern
0 0 0 Both same โ†’ 0
0 1 1 Different โ†’ 1
1 0 1 Different โ†’ 1
1 1 0 Both same โ†’ 0

Why it's hard: You can't draw a single straight line to separate the 1s from the 0s. Traditional solutions require either multiple layers OR an activation function.

The โตŸ-Product Solution: A single neuron with $\mathbf{w} = [1, -1]^T$ naturally solves XOR:
  • $(0,0)$ and $(1,1)$: $\mathbf{w}^T\mathbf{x} = 0$, so $\text{โตŸ}(\mathbf{w}, \mathbf{x}) = 0$ โœ“
  • $(0,1)$: $\text{โตŸ}(\mathbf{w}, \mathbf{x}) = \frac{(-1)^2}{5+\epsilon} > 0$ โœ“
  • $(1,0)$: $\text{โตŸ}(\mathbf{w}, \mathbf{x}) = \frac{(1)^2}{1+\epsilon} > 0$ โœ“
๐Ÿ’ก
Why This Works: The weight vector $[1, -1]$ is orthogonal to both $(0,0)$ and $(1,1)$ (dot product = 0), but has non-zero dot products with $(0,1)$ and $(1,0)$. The โตŸ-product's intrinsic non-linearity handles the rest!
๐ŸŽฎ Live XOR Training: Linear vs โตŸ-Product
Linear Neuron (can't solve XOR)
โตŸ-Product Neuron (solves XOR!)
Ready
Linear Loss: -- โตŸ Loss: --

๐Ÿ”ข MNIST: Learning Digit Prototypes

MNIST is the "hello world" of machine learning โ€” 60,000 handwritten digits for training and 10,000 for testing. What's remarkable isn't just the accuracy, but what the neurons learn:

๐Ÿ“Š
Linear Model Prototypes

Conventional linear neurons learn diffuse, blurry prototypes. They try to capture all variations of a digit, resulting in smeared, hard-to-interpret weight patterns.

โœจ
โตŸ-Product Prototypes

NMN neurons learn sharp, geometrically coherent digit representations. Each weight vector looks like a clear, prototypical example of its digit class.

๐Ÿ”„ Superposition & Prototype Inversion

A surprising discovery: โตŸ-product neurons exhibit superposition behavior. When prototypes are inverted ($\mathbf{w} \to -\mathbf{w}$):

  • Dot product neurons: 91.88% โ†’ ~0.01% (complete failure!)
  • โตŸ-product neurons: 92.18% โ†’ 87.87% (robust!)

The squared numerator means the sign of the dot product doesn't matter โ€” both $\mathbf{w}$ and $-\mathbf{w}$ are valid solutions!

๐Ÿ–ผ๏ธ Vision Benchmarks: From CIFAR to ImageNet

We systematically compared standard architectures with their "Aether" variants (where standard layers are replaced with โตŸ-product units).

Architecture CIFAR-10 CIFAR-100 STL-10 Tiny-ImageNet
ResNet-18 94.23% 72.15% 78.42% 56.89%
Aether-ResNet-18 92.37% 74.83% 80.91% 59.34%
ViT-Small 91.78% 69.91% 75.13% 52.76%
Aether-ViT-Small 92.45% 70.58% 78.89% 51.42%
๐Ÿ“ˆ
Key Pattern: Aether variants tend to outperform baselines more significantly on more complex datasets (CIFAR-100, STL-10, Tiny-ImageNet) compared to simpler ones (CIFAR-10). This suggests the geometric awareness of the โตŸ-product helps more when the task is harder.

On ImageNet-1K, the flagship large-scale benchmark:

  • ResNet-50: 74.13%
  • Aether-ResNet-50: 75.24% (+1.11% improvement)

๐Ÿ“– AetherGPT: Language Modeling at Scale

To prove the โตŸ-product generalizes beyond vision, we adapted GPT-2 architecture with โตŸ-Attention and NMN layers.

Training Setup: 2.5 billion tokens from Fineweb dataset, trained on Kaggle TPU v5-8. Same hyperparameters for fair comparison.
๐ŸŽฏ
Validation Loss (FP32)

GPT-2: 2.43
Aether-GPT2: 2.29
5.8% improvement

โšก
Validation Loss (BF16)

GPT-2: 3.03
Aether-GPT2: 2.69
11.2% improvement!

๐Ÿ’พ Memory Efficiency

By eliminating activation functions and normalization layers, Aether-GPT2 achieves:

  • 15-25% reduction in peak memory usage
  • No storage of intermediate activations for ReLU/GELU gradients
  • No LayerNorm statistics to track
โฑ๏ธ Throughput Comparison

On identical hardware (Kaggle TPU v5-8, batch size 64, context length 1024):

  • Linear baseline: 138k tokens/s, 4h 50m 10s total
  • Aether-GPT2: 132k tokens/s, 5h 02m 31s total

About 4% slower in raw throughput, but the memory savings enable larger batch sizes or longer contexts โ€” often a net win in practice.

๐ŸŒ€ Vortex Decision Boundaries

One of the most visually striking differences between linear and NMN classifiers is their decision boundaries:

๐Ÿ“
Linear Classifiers

Create unbounded half-space partitions. Each class region extends to infinity in some direction. Points far from the training data can still be confidently (and incorrectly) classified.

๐ŸŒ€
โตŸ-Product Classifiers

Create localized, vortex-like territories around learned prototypes. Each neuron has a bounded region of influence. Points far from all prototypes receive low confidence โ€” a natural measure of uncertainty!

๐Ÿ”ฌ
Interpretability Bonus: Because each โตŸ-product neuron creates a localized "territory," we can interpret each weight vector as a prototype of what that neuron "looks for." This makes NMNs inherently more interpretable than black-box neural networks.

โœ… Experimental Takeaways

Experiment Key Finding Significance
XOR Single neuron solution Proves intrinsic non-linearity
MNIST Sharper prototypes, inversion robustness Better geometric representations
Vision Outperforms on complex datasets Scales to real-world tasks
Language 11.2% improvement, 15-25% less memory Domain-agnostic benefits
Boundaries Localized vortex territories Built-in uncertainty quantification