Imagine you have a special measuring stick that tells you how "similar" two toys are. A good measuring stick should follow two rules:
The โต-product is like a magic measuring stick that follows both rules perfectly! This means we can trust it to compare things fairly, and mathematicians have already figured out lots of cool tricks we can use with measuring sticks like this.
When we invented the โต-product as a new way to measure similarity, we needed to answer a critical question: Is this mathematically legitimate?
Without being a valid kernel, the โต-product would be just another arbitrary formula. By proving it's a Mercer kernel, we unlock 50+ years of kernel methods research โ SVMs, Gaussian Processes, kernel PCA, and more โ all of which now apply to NMNs.
A Mercer kernel is a function $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ satisfying two properties:
The proof proceeds in three steps:
We write $\text{โต}(\mathbf{w}, \mathbf{x}) = k_1(\mathbf{w}, \mathbf{x}) \cdot k_2(\mathbf{w}, \mathbf{x})$ where:
The Schur product theorem states: if $K_1$ and $K_2$ are PSD kernel matrices, their element-wise (Hadamard) product $K_1 \circ K_2$ is also PSD.
Since both $k_1$ and $k_2$ are PSD, their product $\text{โต} = k_1 \cdot k_2$ is PSD. โ
Every Mercer kernel defines an RKHS โ a rich function space where learning has nice properties. The โต-product implicitly projects data into this infinite-dimensional space.
We can compute inner products in the high-dimensional feature space without ever computing the features explicitly. This is computationally efficient and theoretically powerful.
Optimal solutions to regularized learning problems lie in the span of kernel evaluations at training points. This gives theoretical guarantees on generalization.
All kernel-based algorithms (Support Vector Machines, Gaussian Processes, kernel PCA) can now use the โต-product as their kernel function.
This theorem is the foundation stone of NMN theory. It answers the question: "Why should we believe this strange formula has any mathematical meaning?"
By proving Mercer's condition, we establish that the โต-product isn't just a heuristic โ it's a principled similarity measure with deep connections to functional analysis, optimization theory, and statistical learning.
Mercer's theorem dates back to 1909, when James Mercer proved that certain integral operators could be decomposed using orthonormal functions. This became the foundation of kernel methods in machine learning, popularized by SVMs in the 1990s.
By connecting NMNs to this rich history, we inherit decades of theoretical insights and practical algorithms โ while introducing something genuinely new: activation-free neural networks.