🎯 The Problem This Solves
When we invented the ⵟ-product as a new way to measure similarity,
we needed to answer a critical question: Is this mathematically legitimate?
Without being a valid kernel, the ⵟ-product would be just another
arbitrary formula. By proving it's a Mercer kernel, we unlock 50+ years of
kernel methods research — SVMs, Gaussian Processes, kernel PCA, and more — all of which
now apply to NMNs.
📐 The Mathematics In Depth
A Mercer kernel is a function $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$
satisfying two properties:
1. Symmetry: $k(\mathbf{x}, \mathbf{w}) = k(\mathbf{w}, \mathbf{x})$ for all
$\mathbf{x}, \mathbf{w}$
2. Positive Semi-Definiteness: For any $n$ points $\{x_1, ..., x_n\}$ and any
coefficients $\{c_1, ..., c_n\}$:
$$\sum_{i=1}^{n} \sum_{j=1}^{n} c_i c_j k(x_i, x_j) \geq 0$$
The proof proceeds in three steps:
Step 1: Decompose into known kernels
We write $\text{ⵟ}(\mathbf{w}, \mathbf{x}) = k_1(\mathbf{w}, \mathbf{x}) \cdot k_2(\mathbf{w},
\mathbf{x})$ where:
- $k_1(\mathbf{w}, \mathbf{x}) = (\mathbf{w}^\top \mathbf{x})^2$ — the squared dot product
(degree-2 polynomial kernel)
- $k_2(\mathbf{w}, \mathbf{x}) = \frac{1}{\|\mathbf{w} - \mathbf{x}\|^2 + \varepsilon}$ — the
inverse multiquadric (IMQ) kernel
Step 2: Verify each component is PSD
- Polynomial kernel: $(\mathbf{w}^\top \mathbf{x})^2 = \langle
\phi(\mathbf{w}), \phi(\mathbf{x}) \rangle$ where $\phi$ maps to the space of outer
products. This is PSD by construction.
- IMQ kernel: Has a known positive Fourier transform (modified Bessel
function $K_0$), which by Bochner's theorem implies PSD.
Step 3: Apply Schur Product Theorem
The Schur product theorem states: if $K_1$ and $K_2$ are PSD kernel matrices,
their element-wise (Hadamard) product $K_1 \circ K_2$ is also PSD.
Since both $k_1$ and $k_2$ are PSD, their product $\text{ⵟ} = k_1 \cdot k_2$ is PSD. ∎
💥 The Consequences
🌐
Reproducing Kernel Hilbert Space (RKHS)
Every Mercer kernel defines an RKHS — a rich function space where learning has
nice properties. The ⵟ-product implicitly projects
data into this infinite-dimensional space.
🔮
The Kernel Trick
We can compute inner products in the high-dimensional feature space
without ever computing the features explicitly. This is computationally
efficient and theoretically powerful.
📊
Representer Theorem Applies
Optimal solutions to regularized learning problems lie in the span of kernel
evaluations at training points. This gives theoretical guarantees on generalization.
🔗
Connection to SVMs & GPs
All kernel-based algorithms (Support Vector Machines, Gaussian Processes, kernel PCA)
can now use the ⵟ-product as their kernel function.
🎓 What This Really Means
This theorem is the foundation stone of NMN theory. It answers the question:
"Why should we believe this strange formula has any mathematical meaning?"
By proving Mercer's condition, we establish that the ⵟ-product
isn't just a heuristic — it's a principled similarity measure with deep connections
to functional analysis, optimization theory, and statistical learning.
💡
Key Insight: The ⵟ-product combines
the alignment sensitivity of polynomial kernels with the locality
of RBF kernels. This hybrid nature is what makes it so effective for neural computation.
📜 Historical Context
Mercer's theorem dates back to 1909, when James Mercer proved that certain integral operators
could be decomposed using orthonormal functions. This became the foundation of kernel methods
in machine learning, popularized by SVMs in the 1990s.
By connecting NMNs to this rich history, we inherit decades of theoretical insights and
practical algorithms — while introducing something genuinely new: activation-free neural networks.