Explain Like I'm 5
Think of neural networks like building with different LEGO sets:
- 🧱 Regular LEGO (Linear + ReLU): Each brick does two things — one part measures, another part decides "keep or throw away." Lots of pieces!
- ⭐ NMN LEGO (ⵟ-Product): Each brick does everything at once! Measures AND decides in one magic piece!
We made three special bricks:
- 🔷 NMN Layer: For regular data (like a list of numbers)
- 🖼️ ⵟ-Conv: For pictures (slides a magic window)
- 👀 ⵟ-Attention: For sentences (decides what words to focus on)
Then we built awesome robots with these bricks: AetherResNet for pictures and AetherGPT for words!
🔷 The NMN Layer: Dense Geometric Computation
The NMN layer is the simplest building block — a drop-in replacement for
Linear + Activation. Each unit learns a prototype and responds based on both
alignment and proximity.
Key components:
- Weight vectors $\mathbf{w}_i$: Each acts as a learned "prototype" — what the neuron is looking for
- Bias terms $b_i$: Added to the numerator for expressivity (like a tunable threshold)
- Scaling factor $s$: Adaptive scaling: $s = \left(\frac{n}{\log(1+n)}\right)^\alpha$ with learnable $\alpha$
- Epsilon $\epsilon$: Stability constant preventing division by zero
🎛️ Interactive: NMN Neuron Playground
🖼️ ⵟ-Convolution: Geometric Feature Extraction
For spatially-structured data (images, audio spectrograms), we extend the ⵟ-product to convolutional operations:
Like standard convolution, the kernel slides across the input. But instead of just computing a dot product, we compute the full ⵟ-product at each location.
Dual Sensitivity
Responds strongly only when a patch is both aligned with AND close to the kernel. More selective than standard convolution.
Intrinsic Non-linearity
No need for ReLU/GELU after the convolution — the geometric ratio provides all the non-linearity needed.
Localized Response
Like RBF kernels, responses decay for patches far from the learned kernel, providing natural attention-like behavior.
Memory Efficient
No activation values to store for backprop — the gradient flows through the geometric computation directly.
👀 ⵟ-Attention: Geometric Query-Key Matching
For sequence modeling (text, time series), we adapt the ⵟ-product to the attention mechanism:
How it differs from standard attention:
| Aspect | Standard Attention | ⵟ-Attention |
|---|---|---|
| Score Computation | $Q \cdot K^T$ (dot product) | $Q \text{ ⵟ } K^T$ (alignment + distance) |
| Similarity Measure | Pure alignment | Alignment AND proximity |
| Score Scaling | $\frac{1}{\sqrt{d_k}}$ (manual) | Learnable $s$ |
| Distant Pairs | Can still have high scores | Naturally suppressed |
🏔️ AetherResNet: Geometric Vision Architecture
AetherResNet adapts the ResNet architecture by replacing standard convolutions with our geometric alternatives.
Each residual block consists of:
- ⵟ-Conv Layer: Geometric feature extraction (replaces Conv + ReLU)
- Linear Conv Layer: Standard convolution for channel mixing
- Residual Connection: $\text{output} = F(\mathbf{x}) + \mathbf{x}$
| Component | Standard ResNet | AetherResNet |
|---|---|---|
| Conv Layers | Conv2d → BatchNorm → ReLU | ⵟ-Conv → Linear Conv |
| Normalization | BatchNorm required | Not needed (self-regularizing) |
| Activation | ReLU between layers | None (intrinsic non-linearity) |
| Skip Connection | Identity or projection | Same |
Results: AetherResNet-18 outperforms standard ResNet-18 on CIFAR-100 (+2.68%), STL-10 (+2.49%), and Tiny-ImageNet (+2.45%).
📝 AetherGPT: Geometric Language Model
AetherGPT adapts GPT-2's transformer architecture with geometric operators:
- Multi-Head ⵟ-Attention: Replaces standard attention with geometric query-key matching
- NMN Feed-Forward: Replaces MLP + GELU with NMN layers
- No LayerNorm: Self-regulation makes normalization unnecessary
- Linear Output: Standard linear projection for next-token prediction
Better Loss
11.2% improvement in validation loss (BF16): 2.69 vs 3.03 for baseline GPT-2.
Less Memory
15-25% reduction in peak memory usage due to eliminated activation storage.
No LayerNorm
Self-regularization property eliminates the need for normalization layers entirely.
Comparable Speed
Only ~4% slower in raw throughput, often offset by larger batch sizes from memory savings.
⚡ Computational Efficiency
A common question: Is the ⵟ-product more expensive than linear layers? Here's the detailed analysis:
For input dimension $d$, output dimension $n$, batch size $B$:
- Linear Layer: $\Theta(Bnd)$ — one matrix multiplication
- NMN Layer: $\Theta(Bnd)$ — same asymptotic complexity!
The key insight: we reuse computations using the algebraic identity:
Since we already compute $\mathbf{w}^T\mathbf{x}$ for the numerator, we can reuse it for the denominator. The squared norms $\|\mathbf{w}\|^2$ and $\|\mathbf{x}\|^2$ can be precomputed and cached.
| Metric | Linear + ReLU | NMN Layer |
|---|---|---|
| FLOPs | $2Bnd + Bn$ | $\approx 4Bnd$ |
| Memory (Forward) | Store activations for ReLU | No activation storage needed |
| Normalization | Often needs BatchNorm/LayerNorm | Self-regularizing |
| Net Effect | Baseline | ~2× FLOPs, 15-25% less memory |
🚀 Getting Started with NMN
Ready to try it yourself? The NMN package is available on PyPI:
pip install nmn
import torch
from nmn import NMNLayer, YatConv2d
# Dense NMN layer
layer = NMNLayer(in_features=64, out_features=128)
x = torch.randn(32, 64) # batch of 32
output = layer(x) # shape: (32, 128)
# Convolutional ⵟ-layer
conv = YatConv2d(in_channels=3, out_channels=64, kernel_size=3)
img = torch.randn(1, 3, 224, 224)
features = conv(img) # shape: (1, 64, 222, 222)