← Back to Theory Theorem 1

The ⵟ-Product is a Mercer Kernel

🧒

Explain Like I'm 5

Imagine you have a special measuring stick that tells you how "similar" two toys are. A good measuring stick should follow two rules:

  • 🔄 Fair both ways: If toy A is "very similar" to toy B, then toy B should also be "very similar" to toy A.
  • Makes sense together: If you measure lots of toys, the numbers should all "agree" with each other (no contradictions).

The -product is like a magic measuring stick that follows both rules perfectly! This means we can trust it to compare things fairly, and mathematicians have already figured out lots of cool tricks we can use with measuring sticks like this.

Theorem (Mercer's Condition): The kernel $k_{\text{ⵟ}}(\mathbf{x}, \mathbf{w}) = \frac{(\mathbf{x} \cdot \mathbf{w})^2}{\|\mathbf{x} - \mathbf{w}\|^2 + \varepsilon}$ is symmetric and positive semi-definite, hence a valid Mercer kernel on $\mathbb{R}^d$.

🎯 The Problem This Solves

When we invented the -product as a new way to measure similarity, we needed to answer a critical question: Is this mathematically legitimate?

Without being a valid kernel, the -product would be just another arbitrary formula. By proving it's a Mercer kernel, we unlock 50+ years of kernel methods research — SVMs, Gaussian Processes, kernel PCA, and more — all of which now apply to NMNs.

📐 The Mathematics In Depth

A Mercer kernel is a function $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ satisfying two properties:

1. Symmetry: $k(\mathbf{x}, \mathbf{w}) = k(\mathbf{w}, \mathbf{x})$ for all $\mathbf{x}, \mathbf{w}$
2. Positive Semi-Definiteness: For any $n$ points $\{x_1, ..., x_n\}$ and any coefficients $\{c_1, ..., c_n\}$: $$\sum_{i=1}^{n} \sum_{j=1}^{n} c_i c_j k(x_i, x_j) \geq 0$$

The proof proceeds in three steps:

Step 1: Decompose into known kernels

We write $\text{ⵟ}(\mathbf{w}, \mathbf{x}) = k_1(\mathbf{w}, \mathbf{x}) \cdot k_2(\mathbf{w}, \mathbf{x})$ where:

  • $k_1(\mathbf{w}, \mathbf{x}) = (\mathbf{w}^\top \mathbf{x})^2$ — the squared dot product (degree-2 polynomial kernel)
  • $k_2(\mathbf{w}, \mathbf{x}) = \frac{1}{\|\mathbf{w} - \mathbf{x}\|^2 + \varepsilon}$ — the inverse multiquadric (IMQ) kernel
Step 2: Verify each component is PSD
  • Polynomial kernel: $(\mathbf{w}^\top \mathbf{x})^2 = \langle \phi(\mathbf{w}), \phi(\mathbf{x}) \rangle$ where $\phi$ maps to the space of outer products. This is PSD by construction.
  • IMQ kernel: Has a known positive Fourier transform (modified Bessel function $K_0$), which by Bochner's theorem implies PSD.
Step 3: Apply Schur Product Theorem

The Schur product theorem states: if $K_1$ and $K_2$ are PSD kernel matrices, their element-wise (Hadamard) product $K_1 \circ K_2$ is also PSD.

Since both $k_1$ and $k_2$ are PSD, their product $\text{ⵟ} = k_1 \cdot k_2$ is PSD. ∎

💥 The Consequences

🌐
Reproducing Kernel Hilbert Space (RKHS)

Every Mercer kernel defines an RKHS — a rich function space where learning has nice properties. The -product implicitly projects data into this infinite-dimensional space.

🔮
The Kernel Trick

We can compute inner products in the high-dimensional feature space without ever computing the features explicitly. This is computationally efficient and theoretically powerful.

📊
Representer Theorem Applies

Optimal solutions to regularized learning problems lie in the span of kernel evaluations at training points. This gives theoretical guarantees on generalization.

🔗
Connection to SVMs & GPs

All kernel-based algorithms (Support Vector Machines, Gaussian Processes, kernel PCA) can now use the -product as their kernel function.

🎓 What This Really Means

This theorem is the foundation stone of NMN theory. It answers the question: "Why should we believe this strange formula has any mathematical meaning?"

By proving Mercer's condition, we establish that the -product isn't just a heuristic — it's a principled similarity measure with deep connections to functional analysis, optimization theory, and statistical learning.

💡
Key Insight: The -product combines the alignment sensitivity of polynomial kernels with the locality of RBF kernels. This hybrid nature is what makes it so effective for neural computation.

📜 Historical Context

Mercer's theorem dates back to 1909, when James Mercer proved that certain integral operators could be decomposed using orthonormal functions. This became the foundation of kernel methods in machine learning, popularized by SVMs in the 1990s.

By connecting NMNs to this rich history, we inherit decades of theoretical insights and practical algorithms — while introducing something genuinely new: activation-free neural networks.