← Back to Theory Implementation

NMN Architecture Guide

🧒

Explain Like I'm 5

Think of neural networks like building with different LEGO sets:

  • 🧱 Regular LEGO (Linear + ReLU): Each brick does two things — one part measures, another part decides "keep or throw away." Lots of pieces!
  • NMN LEGO (-Product): Each brick does everything at once! Measures AND decides in one magic piece!

We made three special bricks:

  • 🔷 NMN Layer: For regular data (like a list of numbers)
  • 🖼️ -Conv: For pictures (slides a magic window)
  • 👀 -Attention: For sentences (decides what words to focus on)

Then we built awesome robots with these bricks: AetherResNet for pictures and AetherGPT for words!

🔷 The NMN Layer: Dense Geometric Computation

The NMN layer is the simplest building block — a drop-in replacement for Linear + Activation. Each unit learns a prototype and responds based on both alignment and proximity.

NMN Layer Definition: For input $\mathbf{x} \in \mathbb{R}^d$ and weights $\{\mathbf{w}_i\}_{i=1}^n$ with biases $\{b_i\}$: $$h(\mathbf{x}) = s \cdot \sum_{i=1}^n \frac{(\mathbf{w}_i^T\mathbf{x} + b_i)^2}{\|\mathbf{w}_i - \mathbf{x}\|^2 + \epsilon}$$ where $s$ is a learnable scaling factor.

Key components:

  • Weight vectors $\mathbf{w}_i$: Each acts as a learned "prototype" — what the neuron is looking for
  • Bias terms $b_i$: Added to the numerator for expressivity (like a tunable threshold)
  • Scaling factor $s$: Adaptive scaling: $s = \left(\frac{n}{\log(1+n)}\right)^\alpha$ with learnable $\alpha$
  • Epsilon $\epsilon$: Stability constant preventing division by zero
💡
Why This Works: Every unit automatically provides non-linearity through the geometric ratio. No separate activation needed! The Universal Approximation Theorem (Theorem 2.4) guarantees this layer can approximate any continuous function.
🎛️ Interactive: NMN Neuron Playground
Weight Vector w
Input Vector x
Computation:
Dot product: --
Distance²: --
ⵟ(w,x) = --
Response strength:

🖼️ -Convolution: Geometric Feature Extraction

For spatially-structured data (images, audio spectrograms), we extend the -product to convolutional operations:

-Conv Definition: For kernel $K$ and input patch $I_{i,j}$ at location $(i,j)$: $$(\text{ⵟ-Conv}(K, I))_{i,j} = \frac{\langle K, I_{i,j} \rangle^2}{\|K - I_{i,j}\|^2 + \epsilon}$$

Like standard convolution, the kernel slides across the input. But instead of just computing a dot product, we compute the full -product at each location.

🎯
Dual Sensitivity

Responds strongly only when a patch is both aligned with AND close to the kernel. More selective than standard convolution.

🔄
Intrinsic Non-linearity

No need for ReLU/GELU after the convolution — the geometric ratio provides all the non-linearity needed.

📍
Localized Response

Like RBF kernels, responses decay for patches far from the learned kernel, providing natural attention-like behavior.

💾
Memory Efficient

No activation values to store for backprop — the gradient flows through the geometric computation directly.

👀 -Attention: Geometric Query-Key Matching

For sequence modeling (text, time series), we adapt the -product to the attention mechanism:

-Attention Definition: $$\text{ⵟ-Attention}(Q, K, V) = \text{softmax}\left( s \cdot (Q \text{ ⵟ } K^T) \right) V$$ where $Q \text{ ⵟ } K^T$ applies the -product element-wise between query and key vectors.

How it differs from standard attention:

Aspect Standard Attention -Attention
Score Computation $Q \cdot K^T$ (dot product) $Q \text{ ⵟ } K^T$ (alignment + distance)
Similarity Measure Pure alignment Alignment AND proximity
Score Scaling $\frac{1}{\sqrt{d_k}}$ (manual) Learnable $s$
Distant Pairs Can still have high scores Naturally suppressed
🔬
Intuition: In -Attention, a query strongly attends to a key only if they are BOTH aligned (pointing same direction) AND close in representation space. This is more selective than standard attention, which only considers alignment.

🏔️ AetherResNet: Geometric Vision Architecture

AetherResNet adapts the ResNet architecture by replacing standard convolutions with our geometric alternatives.

Block Structure

Each residual block consists of:

  1. -Conv Layer: Geometric feature extraction (replaces Conv + ReLU)
  2. Linear Conv Layer: Standard convolution for channel mixing
  3. Residual Connection: $\text{output} = F(\mathbf{x}) + \mathbf{x}$
Component Standard ResNet AetherResNet
Conv Layers Conv2d → BatchNorm → ReLU -Conv → Linear Conv
Normalization BatchNorm required Not needed (self-regularizing)
Activation ReLU between layers None (intrinsic non-linearity)
Skip Connection Identity or projection Same

Results: AetherResNet-18 outperforms standard ResNet-18 on CIFAR-100 (+2.68%), STL-10 (+2.49%), and Tiny-ImageNet (+2.45%).

📝 AetherGPT: Geometric Language Model

AetherGPT adapts GPT-2's transformer architecture with geometric operators:

Transformer Block Structure
  1. Multi-Head -Attention: Replaces standard attention with geometric query-key matching
  2. NMN Feed-Forward: Replaces MLP + GELU with NMN layers
  3. No LayerNorm: Self-regulation makes normalization unnecessary
  4. Linear Output: Standard linear projection for next-token prediction
📉
Better Loss

11.2% improvement in validation loss (BF16): 2.69 vs 3.03 for baseline GPT-2.

💾
Less Memory

15-25% reduction in peak memory usage due to eliminated activation storage.

🚫
No LayerNorm

Self-regularization property eliminates the need for normalization layers entirely.

⏱️
Comparable Speed

Only ~4% slower in raw throughput, often offset by larger batch sizes from memory savings.

⚡ Computational Efficiency

A common question: Is the -product more expensive than linear layers? Here's the detailed analysis:

Complexity Analysis

For input dimension $d$, output dimension $n$, batch size $B$:

  • Linear Layer: $\Theta(Bnd)$ — one matrix multiplication
  • NMN Layer: $\Theta(Bnd)$ — same asymptotic complexity!

The key insight: we reuse computations using the algebraic identity:

$$\|\mathbf{w} - \mathbf{x}\|^2 = \|\mathbf{w}\|^2 + \|\mathbf{x}\|^2 - 2\mathbf{w}^T\mathbf{x}$$

Since we already compute $\mathbf{w}^T\mathbf{x}$ for the numerator, we can reuse it for the denominator. The squared norms $\|\mathbf{w}\|^2$ and $\|\mathbf{x}\|^2$ can be precomputed and cached.

Metric Linear + ReLU NMN Layer
FLOPs $2Bnd + Bn$ $\approx 4Bnd$
Memory (Forward) Store activations for ReLU No activation storage needed
Normalization Often needs BatchNorm/LayerNorm Self-regularizing
Net Effect Baseline ~2× FLOPs, 15-25% less memory
📊
The Trade-off: NMN layers use roughly 2× the FLOPs of Linear+ReLU but save 15-25% memory by eliminating activation storage. At larger layer sizes, the memory savings become increasingly valuable, enabling larger batch sizes or longer contexts.

🚀 Getting Started with NMN

Ready to try it yourself? The NMN package is available on PyPI:

Installation
pip install nmn
Basic Usage
import torch
from nmn import NMNLayer, YatConv2d

# Dense NMN layer
layer = NMNLayer(in_features=64, out_features=128)
x = torch.randn(32, 64)  # batch of 32
output = layer(x)  # shape: (32, 128)

# Convolutional ⵟ-layer
conv = YatConv2d(in_channels=3, out_channels=64, kernel_size=3)
img = torch.randn(1, 3, 224, 224)
features = conv(img)  # shape: (1, 64, 222, 222)
📚
Full Documentation: Check out the GitHub repository for complete examples, pre-trained models, and training scripts for AetherResNet and AetherGPT.