No More DeLuLu A Kernel-Based Activation-Free Neural Networks

(w, x) = w, x⟩² wx‖² + ε

The ⵟ-product unifies alignment and proximity in a single geometric operator, enabling neural networks without activation functions while maintaining universal approximation capabilities.

15-25% Memory Reduction
11.2% Loss Improvement
0 Activation Functions
Scroll to explore

The Problem with Traditional Neural Networks

Modern neural networks rely on a paradigm that separates geometry from non-linearity: linear transformations (dot products) followed by element-wise activation functions (ReLU, sigmoid).

Consider ReLU: it maps the entire spectrum of negative pre-activations—representing varying degrees of dissimilarity—to a uniform zero, discarding nuanced geometric relationships.

📐

Dot Product Limitation

Captures alignment but ignores spatial proximity. Vectors can be aligned yet arbitrarily far apart.

🎯

Activation Information Loss

ReLU collapses half-spaces to zero. Sigmoid saturates extremes. Geometric structure is lost.

🔄

Architectural Complexity

Requires normalization layers, attention mechanisms, and regularization to stabilize training.

Topological Distortion by Activations

Original
After ReLU

⚠️ ReLU creates sharp folds - discontinuous gradient at zero

The -Product: Unifying Alignment & Proximity

Inspired by inverse-square laws in physics, the ⵟ-product creates a unified operator that captures both directional alignment and spatial proximity.

$$\text{ⵟ}(\mathbf{w}, \mathbf{x}) = \frac{\langle \mathbf{w}, \mathbf{x} \rangle^2}{\|\mathbf{w} - \mathbf{x}\|^2 + \epsilon}$$
w, x⟩²
Numerator
Squared dot product captures alignment between vectors
÷
wx‖² + ε
Denominator
Squared distance captures proximity with epsilon (ε) for stability
🎓

Mercer Kernel

Symmetric and positive semidefinite, connecting to established kernel theory.

Theorem 2.1

Universal Approximation

NMNs are dense in C(𝒳) under uniform norm on compact domains.

Theorem 2.4
📉

Self-Regularization

Responses remain bounded and gradients decay at infinity without explicit normalization.

Proposition 3.1

Stable Gradients

Gradients vanish for distant inputs, providing natural localization during training.

Proposition 3.2

Interactive Visualizations

Explore how the -product differs from traditional similarity measures

Similarity Measures & Gradient Fields

Drag the anchor point (⭐) to see how each measure responds. The top row shows similarity values, the bottom row shows gradient fields. Notice how the -product creates a localized potential well with vortex-like gradients.

Similarity Measures (drag ⭐ to move anchor)
Dot Product
w · x
Euclidean Distance²
‖w − x‖²
-Product
(w·x)² / (‖w−x‖² + ε)
Cosine Similarity
w·x / (‖w‖‖x‖)
Gradient Fields ∇ (also draggable)
Dot Product
Euclidean
-Product
Cosine

XOR Problem: Single Neuron Solution

The classic XOR problem cannot be solved by a single linear neuron. The -product's intrinsic non-linearity enables a single unit to solve it.

Linear Neuron

Cannot separate XOR

-Product Neuron

Solves XOR with w = [1, -1]

Input x XOR Output w·x (w,x)
(0,0) 0 0 0
(0,1) 1 -1 >0
(1,0) 1 1 >0
(1,1) 0 0 0

Decision Boundary Comparison

Linear neurons create hyperplane boundaries. -product neurons create vortex-like territorial fields around learned prototypes.

Linear Model Boundaries

Unbounded half-space partitions

-Product Boundaries

Localized vortex territories

Loss Landscape Comparison

The optimization landscape for the XOR problem. Dot product has a spurious minimum; -product creates exploitable valleys.

Dot Product + Sigmoid

-Product

Neural Matter Network Architectures

NMN layers serve as drop-in replacements for Linear + Activation, providing intrinsic non-linearity through geometry.

NMN Layer

Dense
$$h(\mathbf{x}) = s \cdot \sum_{i=1}^n \frac{(\mathbf{w}_i^\top\mathbf{x} + b_i)^2}{\|\mathbf{w}_i - \mathbf{x}\|^2 + \epsilon}$$

Replaces Linear + ReLU with a single geometric operation.

-Conv

Convolution
$$(\text{ⵟ-Conv}(K, I))_{i,j} = \frac{\langle K, I_{i,j} \rangle^2}{\|K - I_{i,j}\|^2 + \epsilon}$$

Geometrically-aware feature extraction for spatial data.

-Attention

Transformer
$$\text{ⵟ-Attn}(Q,K,V) = \text{softmax}(s \cdot Q \text{ⵟ} K^T) V$$

Query-key similarity through geometric alignment and proximity.

Implemented Architectures

Architecture Base Model Design Key Change
AetherResNet ResNet -Conv → Linear Conv per block No activation functions
AetherGPT GPT-2 MHA + NMN → Linear No normalization layers

Experimental Results

Vision Benchmarks

Architecture CIFAR-10 CIFAR-100 STL-10 Tiny-ImageNet
ResNet-18 94.23% 72.15% 78.42% 56.89%
Aether-ResNet-18 92.37% 74.83% 80.91% 59.34%
ViT-Small 91.78% 69.91% 75.13% 52.76%
Aether-ViT-Small 92.45% 70.58% 78.89% 51.42%

Language Modeling (Fineweb 2.5B tokens)

GPT-2 Baseline

FP32 Loss 2.43
BF16 Loss 3.03

Aether-GPT2

FP32 Loss 2.29
BF16 Loss 2.69
11.2% improvement in BF16

Training Dynamics

Training curves comparing Aether vs Linear baseline

Validation and training loss over 35k steps. Aether-GPT2 (blue) consistently outperforms the linear baseline (green).

Learned Prototypes: Linear vs -Product

MNIST prototype comparison

Linear models produce diffuse, blurry prototypes. -product neurons learn sharp, geometrically coherent digit representations.

Mathematical Foundations

Explore the rigorous theoretical guarantees behind Neural Matter Networks. Each theorem builds upon the previous, creating a complete mathematical framework.

Theorem 1

The -Product is a Mercer Kernel

Theorem (Mercer's Condition): The kernel $k_{\text{ⵟ}}(\mathbf{x}, \mathbf{w}) = \frac{(\mathbf{x} \cdot \mathbf{w})^2}{\|\mathbf{x} - \mathbf{w}\|^2 + \varepsilon}$ is symmetric and positive semi-definite, hence a valid Mercer kernel on $\mathbb{R}^d$.

What Does This Mean?

A Mercer kernel is a special type of similarity function that has two critical properties:

  • Symmetry: $k(\mathbf{x}, \mathbf{w}) = k(\mathbf{w}, \mathbf{x})$ — the similarity between x and w is the same as between w and x.
  • Positive Semi-Definiteness (PSD): For any set of points, the kernel matrix has all non-negative eigenvalues.

Why Is This Important?

Being a Mercer kernel means the -product implicitly computes an inner product in a high-dimensional feature space without ever explicitly computing that space. This is the famous "kernel trick" from machine learning theory.

💡
Key Insight: The -product is a product of two PSD kernels: the squared dot product $(\mathbf{x} \cdot \mathbf{w})^2$ (a polynomial kernel) and the inverse multiquadric $\frac{1}{\|\mathbf{x} - \mathbf{w}\|^2 + \varepsilon}$ (an RBF-like kernel). By the Schur product theorem, their product is also PSD.

What Does This Offer?

  • Theoretical Foundation: Connects NMNs to 50+ years of kernel method research (SVMs, Gaussian Processes, etc.)
  • Reproducing Kernel Hilbert Space (RKHS): Guarantees existence of a rich function space for learning
  • Optimization Guarantees: Many kernel-based optimization results apply directly
Theorem 2

Universal Approximation Theorem

Theorem: Let $\mathcal{X} \subset \mathbb{R}^d$ be compact. The class of single-hidden-layer -product networks $f(\mathbf{x}) = \sum_{i=1}^n \alpha_i \cdot g(\mathbf{x}; \mathbf{w}_i, b_i) + c$ is dense in $C(\mathcal{X})$ under the uniform norm. That is, NMNs can approximate any continuous function to arbitrary precision.

What Does This Mean?

The Universal Approximation Theorem (UAT) is the fundamental existence theorem for neural networks. It guarantees that given enough neurons, a network can approximate any continuous function as closely as desired.

This theorem proves that NMNs lose no expressive power by removing activation functions. The -product's inherent non-linearity is sufficient.

The Proof Strategy

The proof is elegant and leverages the kernel structure:

  1. Recover IMQ kernel: By differentiating $g(\mathbf{x}; \mathbf{w}, b)$ twice with respect to bias $b$, we recover the inverse multiquadric (IMQ) kernel: $\partial_b^2 g = \frac{2}{\|\mathbf{x} - \mathbf{w}\|^2 + \varepsilon}$
  2. Fourier analysis: The IMQ kernel has a strictly positive Fourier transform (Bessel function)
  3. Uniqueness: Any measure orthogonal to all IMQ translates must be zero
  4. Density: By Hahn-Banach/Riesz duality, the span is dense in $C(\mathcal{X})$
🔬
Key Insight: The bias term $b$ is crucial — it allows the network to "shift" response fields and span the entire function space through differentiation. This is why we use $(\mathbf{w}^\top\mathbf{x} + b)^2$ rather than just $(\mathbf{w}^\top\mathbf{x})^2$.

What Does This Offer?

  • No Power Loss: NMNs are as expressive as ReLU/Sigmoid networks
  • Geometric Localization: Unlike ReLU (unbounded growth), the -product achieves density through localized geometric units
  • Practical Corollary: Single hidden layer is sufficient in theory — though deeper networks may learn more efficiently
Proposition 3

Self-Regulation & Bounded Outputs

Proposition: For any fixed weight vector $\mathbf{w}$, the -product output remains bounded and converges as $\|\mathbf{x}\| \to \infty$: $$\lim_{\|\mathbf{x}\| \to \infty} \text{ⵟ}(\mathbf{w}, \mathbf{x}) = \|\mathbf{w}\|^2 \cos^2\theta$$ where $\theta$ is the angle between $\mathbf{w}$ and the direction of $\mathbf{x}$.

What Does This Mean?

Unlike ReLU which can grow unboundedly with input magnitude, or dot products which scale linearly, the -product naturally self-regulates. As inputs get very large, the output converges to a finite value that depends only on the direction, not magnitude.

Corollary: Dimensional Self-Normalization

Corollary: At initialization with random i.i.d. weights, both the numerator $(\mathbf{w}^\top\mathbf{x})^2$ and denominator $\|\mathbf{w} - \mathbf{x}\|^2$ scale as $\mathcal{O}(d)$ with dimension $d$, so their ratio remains $\mathcal{O}(1)$.

This means NMNs don't need careful initialization schemes like Xavier or He — they're dimensionally stable by design!

Corollary: Mitigating Internal Covariate Shift

Corollary: As input magnitudes grow large, the mean and variance of neuron activations across a batch become independent of input magnitudes, depending only on angular distribution.

This explains why NMNs can often operate without batch normalization — the "covariate shift" problem that plagues deep networks is naturally mitigated.

Practical Impact: No gradient explosion from large inputs. No need for gradient clipping in most cases. Simpler, more stable training dynamics.

What Does This Offer?

  • No Exploding Activations: Outliers don't cause numerical instabilities
  • Reduced Normalization Needs: Less reliance on BatchNorm, LayerNorm
  • Memory Efficiency: 15-25% reduction from eliminating normalization layers
  • Robust Training: Stable even with varying input distributions
Proposition 4

Stable Learning & Gradient Localization

Proposition: The gradient of the -product with respect to input vanishes for distant inputs: $$\lim_{\|\mathbf{x}\| \to \infty} \|\nabla_{\mathbf{x}} \text{ⵟ}(\mathbf{w}, \mathbf{x})\| = 0$$ Specifically, $\|\nabla_{\mathbf{x}} \text{ⵟ}\| \sim \mathcal{O}(1/k)$ as $\|\mathbf{x}\| = k \to \infty$.

What Does This Mean?

Each -product neuron creates a localized learning region. Points far from the weight vector contribute vanishingly small gradients. This is fundamentally different from:

  • Linear neurons: Gradients are constant regardless of distance (no localization)
  • ReLU neurons: Gradients are either constant (positive side) or zero (negative side)

Regularity Properties

Analyticity (Lemma)

The -product is $C^\infty$ — infinitely differentiable. Perfect for physics-informed neural networks (PINNs) where you need higher-order derivatives.

Lipschitz Continuity (Proposition)

The kernel is globally Lipschitz continuous: $|K(\mathbf{w}, \mathbf{x}) - K(\mathbf{w}, \mathbf{y})| \leq L \|\mathbf{x} - \mathbf{y}\|$. This bounds how fast outputs can change.

🎯
Geometric Intuition: Each neuron is like a "gravity well" — it strongly attracts nearby points but has negligible influence on distant ones. During training, each neuron learns to "own" a local region of the input space.

What Does This Offer?

  • Outlier Robustness: Distant outliers don't cause large, destabilizing gradient updates
  • Local Learning: Each neuron specializes in its region without interfering with others
  • Smoother Optimization: Lipschitz bounds provide theoretical guarantees on loss landscape
  • PINN-Friendly: $C^\infty$ smoothness enables any-order derivative computation
Theorems 5-6

Information-Geometric Foundations

Theorem (Minimal Similarity): For probability distributions $\mathbf{p}, \mathbf{q} \in \Delta^{n-1}$: $$\text{ⵟ}(\mathbf{p}, \mathbf{q}) = 0 \iff \text{supp}(\mathbf{p}) \cap \text{supp}(\mathbf{q}) = \emptyset$$ Furthermore, disjoint support implies $\text{KL}(\mathbf{p} \| \mathbf{q}) = \infty$.
Theorem (Maximal Similarity): $\text{ⵟ}(\mathbf{p}, \mathbf{q}) \to \infty \iff \mathbf{p} = \mathbf{q}$, and this corresponds to $\text{KL}(\mathbf{p} \| \mathbf{q}) = 0$.

What Does This Mean?

When the -product is applied to probability distributions (like softmax outputs), it exhibits deep connections to information theory. The extremes of the -product correspond precisely to extremes in KL divergence!

ⵟ = 0
Disjoint Support
Distributions never overlap
KL divergence = ∞
ⵟ → ∞
Identical Distributions
p = q everywhere
KL divergence = 0

Duality of Orthogonality Concepts

The -product unifies three distinct notions of "orthogonality":

Type Condition Interpretation
Euclidean $\mathbf{p} \cdot \mathbf{q} = 0$ Vectors perpendicular in space
Combinatorial $\text{supp}(\mathbf{p}) \cap \text{supp}(\mathbf{q}) = \emptyset$ No shared non-zero entries
Information-Theoretic $\text{KL}(\mathbf{p} \| \mathbf{q}) = \infty$ Infinite surprise/divergence

All three are equivalent when $\text{ⵟ}(\mathbf{p}, \mathbf{q}) = 0$!

🌉
Bridge Between Worlds: The -product acts as a bridge connecting Euclidean geometry (dot products, distances) with probabilistic reasoning (KL divergence, entropy). This explains why NMNs work well with cross-entropy loss — there's deep mathematical harmony.

What Does This Offer?

  • Natural Compatibility: Works harmoniously with entropy-based loss functions
  • Interpretable Outputs: Similarity scores have information-theoretic meaning
  • Theoretical Unity: Unifies geometric and probabilistic perspectives in one framework
  • Distribution Modeling: Natural fit for generative models and density estimation
Theorem 7

Topological Organization: Neural Fiber Bundles

Theorem: Let $\mathcal{M} \subset \mathbb{R}^d$ be a smooth compact data manifold and $\{w_c\}_{c=1}^C$ be class prototypes. The classification rule $\hat{c}(x) = \arg\max_c \text{ⵟ}(w_c, x)$ partitions $\mathcal{M}$ into decision regions.

Separation Property: If prototypes are orthogonal ($\langle w_i, w_j \rangle = 0$ for $i \neq j$), then:
  1. Prototypes are maximally dissimilar: $\text{ⵟ}(w_i, w_j) = 0$
  2. Decision boundaries are spatially separated from prototype cores

What Does This Mean?

Traditional linear classifiers create hyperplane decision boundaries — flat cuts through space that extend infinitely. The -product creates something fundamentally different: vortex-like territorial fields around each prototype.

Class A
Class B

Linear: Unbounded half-spaces

A
B
C

ⵟ-Product: Localized vortex territories

Why Orthogonal Prototypes?

When class prototypes are orthogonal, something magical happens:

  • Each prototype's "core" (where response is maximal) is completely separate from other prototypes
  • At point $x = w_i$: the response to class $i$ is $\|w_i\|^4/\varepsilon$ (very large), while response to class $j$ is exactly 0
  • Decision boundaries are forced into "neutral zones" between prototypes
🌀
Vortex Intuition: Imagine each prototype as a whirlpool. Points are "sucked in" toward their nearest prototype with strength proportional to alignment and inverse-square of distance. The decision boundary is where the "pull" from two prototypes exactly balances.

Fiber Bundle Structure

For those familiar with differential geometry: the classification can be viewed through the lens of fiber bundles. The data manifold $\mathcal{M}$ is the base space, each point $x$ has a "fiber" of response values $[\text{ⵟ}(w_1, x), ..., \text{ⵟ}(w_C, x)]$, and classification is projection to the dominant fiber component.

What Does This Offer?

  • Geometric Interpretability: Decision regions have intuitive spatial meaning
  • Natural Clustering: Each class "owns" a territory, not just a half-space
  • Robust Boundaries: Orthogonal prototypes guarantee maximum class separation
  • Manifold-Aware: Works naturally with curved data manifolds, not just flat spaces

The Complete Picture

These six theorem groups form a coherent mathematical framework:

1 Mercer: Valid kernel ⟹ rich function space
2 UAT: Can approximate any function
3 Self-Regulation: Bounded, stable outputs
4 Stable Gradients: Trainable via localized learning
5 Info Theory: Works with entropy losses
6 Topology: Geometric decision boundaries

Together, these guarantee that NMNs are theoretically sound, practically trainable, and geometrically interpretable — without a single activation function.

Quick Start

Installation

pip install nmn

# Framework-specific
pip install "nmn[torch]"    # PyTorch
pip install "nmn[keras]"    # Keras/TensorFlow
pip install "nmn[nnx]"      # Flax NNX (JAX)
pip install "nmn[all]"      # Everything
import torch
from nmn.torch.nmn import YatNMN

# Replace nn.Linear + activation with single layer
layer = YatNMN(
    in_features=128,
    out_features=64,
    epsilon=1e-5
)

x = torch.randn(32, 128)
y = layer(x)  # (32, 64) — inherently non-linear!
import keras
from nmn.keras.nmn import YatNMN

# Drop-in replacement for Dense + activation
layer = YatNMN(
    features=64,
    epsilon=1e-5
)

x = keras.ops.zeros((32, 128))
y = layer(x)  # (32, 64)
import jax.numpy as jnp
from flax import nnx
from nmn.nnx.nmn import YatNMN

layer = YatNMN(
    in_features=128,
    out_features=64,
    rngs=nnx.Rngs(0)
)

x = jnp.zeros((32, 128))
y = layer(x)  # (32, 64)

Available Layers

YatNMN Dense
YatConv1D Conv
YatConv2D Conv
YatConv3D Conv
YatConvTranspose Conv
MultiHeadAttention Attention
YatLSTMCell RNN
YatGRUCell RNN

Citation

@article{bouhsine2025nomoredelulu,
  author = {Taha Bouhsine},
  title = {No More DeLuLu: A Kernel-Based Activation-Free Neural Networks},
  year = {2025},
  url = {https://github.com/azettaai/nmn}
}