Explain Like I'm 5
You know how when someone starts a sentence, you can guess the next word? ๐ค
- ๐ "Once upon a ____" โ You'd guess "time"!
- ๐ค GPT is a computer that learned to do this by reading LOTS of books
- โจ AetherGPT does the same thing, but using our โต-product magic!
The cool part? AetherGPT makes fewer mistakes and uses less memory! ๐
๐ง AetherGPT Architecture
AetherGPT is a GPT-2 variant that replaces standard attention and MLP blocks with โต-based components:
โต-Attention
Geometric query-key matching using alignment + proximity instead of pure dot product.
NMN Feed-Forward
Replaces MLP + GELU with NMN layers that have intrinsic non-linearity.
No LayerNorm
Self-regulation eliminates the need for normalization layers entirely.
Same Parameters
124M parameters, matching GPT-2 small for fair comparison.
๐ Training Dynamics
๐ Training Loss Comparison
๐ Results Summary
| Metric | GPT-2 | AetherGPT | Improvement |
|---|---|---|---|
| Validation Loss (BF16) | 3.03 | 2.69 | โ 11.2% |
| Validation Loss (FP32) | 3.05 | 2.78 | โ 8.9% |
| Peak Memory | Baseline | 15-25% less | โ |
| Throughput | Baseline | ~4% slower | ~ |
| Normalization Layers | Required | None | Simpler |
๐ฌ Key Observations
๐ง Training Configuration
- Dataset: OpenWebText (subset)
- Context Length: 1024 tokens
- Batch Size: 256 (effective)
- Learning Rate: 6e-4 with cosine decay
- Hardware: 8ร A100 GPUs