← Back to Theory Foundation

Theoretical Background

🧒

Explain Like I'm 5

Building a brain (neural network) is like building with different tools:

  • 📏 The measuring tool (dot product): Checks if two arrows point the same way. But it doesn't care how far apart they are!
  • 📐 The angle tool (cosine): Only cares about direction, ignores everything else.
  • 📍 The distance tool (Euclidean): Only cares about "how far," ignores direction.
  • 🔌 The switch (activation): Turns things on or off. But sometimes it's too harsh and forgets important details!

The problem? Each tool only does one thing. And the switch sometimes breaks our drawings by squishing or cutting parts off!

The -product is like a super tool that measures direction AND distance at the same time — and it doesn't need a harsh switch!

📏 The Dot Product: Alignment Without Distance

The dot product is the workhorse of neural computation. For vectors $\mathbf{a} = [a_1, \ldots, a_n]$ and $\mathbf{b} = [b_1, \ldots, b_n]$:

$$\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i = \|\mathbf{a}\| \|\mathbf{b}\| \cos(\theta)$$

What it tells us:

  • Positive: Vectors point in roughly the same direction
  • Zero: Vectors are perpendicular (orthogonal)
  • Negative: Vectors point in roughly opposite directions
⚠️
The Problem: The dot product conflates direction and magnitude. Two unit vectors have a maximum dot product of 1, but $[100, 0] \cdot [100, 0] = 10000$. This magnitude sensitivity must be compensated for elsewhere (e.g., normalization layers).

🧭 Cosine Similarity: Pure Direction

To isolate directional information, we normalize by magnitude:

$$\cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}$$

This gives values from $-1$ (opposite) through $0$ (orthogonal) to $+1$ (identical direction).

Advantage

Scale-invariant: $[1, 2, 3]$ and $[100, 200, 300]$ have cosine similarity of 1.0 because they point in the same direction.

Limitation

Ignores distance entirely! Vectors at $[0.001, 0]$ and $[1000000, 0]$ also have similarity 1.0, despite being vastly separated in space.

📍 Euclidean Distance: Proximity Without Direction

The Euclidean distance measures spatial separation:

$$d(\mathbf{p}, \mathbf{q}) = \sqrt{\sum_{i=1}^{n} (q_i - p_i)^2} = \|\mathbf{p} - \mathbf{q}\|$$

This is the foundation of clustering algorithms (k-means, k-NN), RBF networks, and many loss functions (MSE).

⚠️
The Problem: Distance ignores alignment completely. Consider a weight vector at the origin: points at $[1, 0]$ and $[0, 1]$ are equidistant, but one is aligned with the x-axis and one with the y-axis — very different orientations that distance alone cannot distinguish.

🖼️ Convolution: Localized Alignment

In convolutional neural networks (CNNs), the convolution operator slides a kernel across the input, computing dot products at each location:

$$(f * g)[n] = \sum_{m} f[m] \cdot g[n - m]$$

Each position in the output represents how well the local patch aligns with the kernel.

Key Roles of Convolution
  • Feature Detection: Kernels learn to detect edges, textures, shapes
  • Spatial Hierarchy: Stacking layers builds increasingly abstract features
  • Parameter Sharing: Same kernel applied everywhere → efficiency

The limitation: Standard convolution is essentially a local dot product. It inherits all the problems of dot products — no distance awareness, magnitude sensitivity.

🔌 Why We Need Non-Linearity

Here's a fundamental mathematical fact:

Theorem (Collapse of Linear Compositions): Any composition of linear functions is itself a linear function: $$f_n \circ f_{n-1} \circ \cdots \circ f_1 = \text{single linear function}$$ No matter how many layers, a purely linear network can only compute linear functions.

This means we NEED non-linearity to approximate complex functions. The question is: how do we introduce it?

⚡ Traditional Activation Functions

The standard approach is to apply element-wise non-linear functions after linear layers:

📊
ReLU: $\max(0, x)$

Pros: Simple, fast, mitigates vanishing gradients
Cons: "Dead neurons" (zero gradient for negative inputs), discontinuous derivative at 0

📈
Sigmoid: $\frac{1}{1+e^{-x}}$

Pros: Smooth, bounded output [0,1]
Cons: Vanishing gradients at extremes, outputs not zero-centered

〰️
Tanh: $\frac{e^x - e^{-x}}{e^x + e^{-x}}$

Pros: Zero-centered, smooth
Cons: Still saturates at extremes

🔔
GELU, SiLU, Mish

Pros: Smooth approximations to ReLU
Cons: More computation, still element-wise

🔺 The Geometric Cost of Activation Functions

This is the core insight that motivates Neural Matter Networks: activation functions distort geometry.

ReLU Distortion

Consider a smooth manifold (like a Swiss roll) passing through a ReLU layer:

  • All negative values → 0 (entire half-space collapsed!)
  • Points that were distinct become identical
  • Local neighborhoods are destroyed
  • Information is irreversibly lost
Sigmoid/Tanh Saturation

Points with very positive or very negative values get "squashed" to the extremes:

  • Distinct points → approximately the same output
  • Distances between large values → compressed to near-zero
  • Fine-grained differences → lost in saturation regions
💥
Example: Imagine two inputs: one has dot product $-0.1$ with a weight (slightly misaligned) and another has dot product $-100$ (strongly opposed). After ReLU, both become exactly 0 — we've lost all information about the degree of misalignment!
🌀 Animated: Gradient Flow Fields (watch the particles!)
ReLU: Half-Space Collapse
ⵟ-Product: Vortex Field
🔴 Particles flow along gradient directions • Watch how ReLU kills all info below zero

🔄 Topological Distortions

Beyond metric distortion, activation functions can change the topology of representations:

Property Before Activation After Activation
Injectivity Linear maps can be injective ReLU collapses half-spaces to 0
Connectedness Connected sets stay connected Can be preserved or broken
Smoothness Affine = infinitely smooth ReLU has discontinuous derivative
Neighborhoods Local structure preserved Neighbors can become identical

✨ The -Product: A Better Way

This theoretical background reveals what we need: a computational primitive that provides non-linearity without the geometric destruction of activation functions.

The -Product Solution: $$\text{ⵟ}(\mathbf{w}, \mathbf{x}) = \frac{\langle \mathbf{w}, \mathbf{x} \rangle^2}{\|\mathbf{w} - \mathbf{x}\|^2 + \epsilon}$$
  • Non-linear: Ratio of squared terms is inherently non-linear
  • Distance-aware: Denominator captures proximity
  • Alignment-aware: Numerator captures direction
  • Smooth: Infinitely differentiable everywhere
  • Injective for distinct inputs: Different (w,x) pairs → different outputs
🎯
Key Insight: The -product achieves non-linearity through the geometric relationship between vectors, not through a separate, information-destroying activation function. This preserves the rich geometric structure that traditional activations discard.

📋 Summary: Why Traditional Primitives Need Help

Primitive Captures Misses Problem
Dot Product Alignment + Magnitude Distance Scale-sensitive
Cosine Similarity Pure Direction Distance + Magnitude Position-blind
Euclidean Distance Spatial Separation Direction Orientation-blind
ReLU Activation Positive values Negative distinctions Half-space collapse
Sigmoid/Tanh Bounded output Extreme distinctions Saturation
-Product Alignment + Distance Addresses all above!