← Back to Theory Implementation

Squashing Functions

🧒

Explain Like I'm 5

Imagine you have a balloon 🎈 that can get REALLY big, but you need it to fit in a box.

  • 📦 Sigmoid: Squishes the balloon, but it's always at least half-full!
  • 💥 Softmax: Makes balloons compete — one WINS and gets all the air
  • 🧸 Softermax: A gentler version — balloons share more fairly
  • 🎈 Soft-sigmoid: Squishes gently from 0 up, not from 0.5!

The -product gives us numbers that are always positive, so we need special squishing functions made just for that!

⚠️ The Problem with Standard Functions

The -product always outputs non-negative values ($\geq 0$). Standard activation functions have issues with this:

Standard Sigmoid

$\sigma(x) = \frac{1}{1 + e^{-x}}$
For $x \geq 0$: outputs are in $[0.5, 1)$
Problem: Zero maps to 0.5, not 0!

Standard Softmax

Uses $e^x$ which can explode for large inputs.
Creates "hard" distributions — one winner takes all.
Problem: Too aggressive for soft attention!

✨ Alternative Squashing Functions

📊 Interactive: Squashing Function Comparison
━━ sigmoid ━━ soft-sigmoid ━━ soft-tanh

📐 Softermax (Competitive)

Definition (Softermax): $$\text{softermax}_n(x_k, \{x_i\}) = \frac{x_k^n}{\epsilon + \sum_i x_i^n}$$
  • ✅ No exponentials — numerically stable for large inputs
  • ✅ Power $n$ controls sharpness (like temperature)
  • ✅ Direct, interpretable translation of scores

📐 Soft-Sigmoid (Individualistic)

Definition (Soft-Sigmoid): $$\text{soft-sigmoid}_n(x) = \frac{x^n}{1 + x^n}$$
  • ✅ Maps non-negative inputs to $[0, 1)$
  • ✅ $f(0) = 0$ (unlike standard sigmoid where $f(0) = 0.5$)
  • ✅ Power $n$ controls transition steepness

📐 Soft-Tanh (Individualistic)

Definition (Soft-Tanh): $$\text{soft-tanh}_n(x) = \frac{x^n - 1}{1 + x^n} = 2 \cdot \text{soft-sigmoid}_n(x) - 1$$
  • ✅ Maps non-negative inputs to $[-1, 1)$
  • ✅ $f(0) = -1$, $f(1) = 0$, $f(\infty) \to 1$
  • ✅ Useful when centered output is needed

🎯 When to Use Each

Function Type Output Range Best For
Softermax Competitive $[0, 1]$, sums to ~1 Attention weights, class probabilities
Soft-Sigmoid Individual $[0, 1)$ Gates, per-neuron confidence
Soft-Tanh Individual $[-1, 1)$ Centered outputs, residual modulation
💡
Design Philosophy: These functions are purpose-built for -product scores. The power parameter $n$ acts like a "gravitational potential slope" — controlling how sharply neurons compete for territory.