โ† Back to Theory Experiments

Language Experiments

๐Ÿง’

Explain Like I'm 5

You know how when someone starts a sentence, you can guess the next word? ๐Ÿค”

  • ๐Ÿ“– "Once upon a ____" โ†’ You'd guess "time"!
  • ๐Ÿค– GPT is a computer that learned to do this by reading LOTS of books
  • โœจ AetherGPT does the same thing, but using our โตŸ-product magic!

The cool part? AetherGPT makes fewer mistakes and uses less memory! ๐ŸŽ‰

๐Ÿง  AetherGPT Architecture

AetherGPT is a GPT-2 variant that replaces standard attention and MLP blocks with โตŸ-based components:

๐Ÿ‘€
โตŸ-Attention

Geometric query-key matching using alignment + proximity instead of pure dot product.

๐Ÿ”ข
NMN Feed-Forward

Replaces MLP + GELU with NMN layers that have intrinsic non-linearity.

๐Ÿšซ
No LayerNorm

Self-regulation eliminates the need for normalization layers entirely.

๐Ÿ“Š
Same Parameters

124M parameters, matching GPT-2 small for fair comparison.

๐Ÿ“ˆ Training Dynamics

๐Ÿ“Š Training Loss Comparison
โ”โ” GPT-2 (baseline) โ”โ” AetherGPT

๐Ÿ“Š Results Summary

Metric GPT-2 AetherGPT Improvement
Validation Loss (BF16) 3.03 2.69 โ†“ 11.2%
Validation Loss (FP32) 3.05 2.78 โ†“ 8.9%
Peak Memory Baseline 15-25% less โœ“
Throughput Baseline ~4% slower ~
Normalization Layers Required None Simpler

๐Ÿ”ฌ Key Observations

๐Ÿ“ˆ
BF16 Performance Boost: AetherGPT shows larger improvements in BF16 (11.2% vs 8.9% in FP32), suggesting that the geometric operations benefit from lower-precision floating point โ€” possibly due to the bounded nature of the โตŸ-product.
๐Ÿ’พ
Memory Efficiency: The 15-25% memory savings come from not needing to store activation values for ReLU/GELU during backward pass. This can enable larger batch sizes or longer context lengths on the same hardware.

๐Ÿ”ง Training Configuration

Hardware & Setup
  • Dataset: OpenWebText (subset)
  • Context Length: 1024 tokens
  • Batch Size: 256 (effective)
  • Learning Rate: 6e-4 with cosine decay
  • Hardware: 8ร— A100 GPUs