Architecture¶
Layout¶
trnrand/
├── trnrand/
│ ├── __init__.py # Re-exports all RNG operations
│ ├── generator.py # Generator class, seeding, state management
│ ├── distributions.py # uniform, normal, exponential, bernoulli, etc.
│ ├── quasi.py # sobol, halton, latin_hypercube
│ └── nki/
│ ├── __init__.py # Backend dispatch (set_backend / HAS_NKI)
│ └── dispatch.py # Philox kernel scaffold for on-device RNG
├── tests/
├── examples/mc_integration.py
└── benchmarks/
Use cases across the suite¶
| Use case | trnrand function | Consumer |
|---|---|---|
| Noise injection (speech training) | normal() |
trnfft |
| Stochastic trace estimation | normal(), sobol() |
trnsolver |
| Weight initialization | truncated_normal() |
trnfft/nn.py |
| Monte Carlo integration | sobol(), halton() |
trnblas (DF-MP2) |
| Hyperparameter sweeps | sobol() |
Ablation studies |
| Data augmentation | uniform(), bernoulli() |
General |
NKI strategy¶
The Philox 4×32 counter-based RNG maps cleanly to Trainium:
- GpSimd engine runs the integer multiply-XOR rounds (the Tensor Engine is wasted on this).
- Parallel generation: each tile gets a disjoint counter range, no cross-tile coordination required.
- Deterministic:
(counter, key) → output— no state to synchronize across cores.
Philox is preferred over Mersenne Twister precisely because it's stateless and trivially parallelizable. It's the same engine used by cuRAND and JAX.
Box-Muller for normal()¶
The on-device normal path is a Box-Muller transform layered on the Philox uniform stream:
- Pairs of uniforms
(u1, u2)→ standard-normal pairs(z1, z2)viar = √(-2 ln u1),θ = 2π u2,z1 = r cos θ,z2 = r sin θ. - Runs on the Vector Engine, which has hardware
cos/sin/log/sqrt. - Box-Muller is preferred over Marsaglia polar here: Marsaglia avoids the trig calls but uses rejection sampling, which serializes branch-divergent lanes and kills SIMD throughput. Box-Muller has constant work per pair.
Known gaps¶
- NKI Philox kernel awaits on-hardware validation. The CPU reference
(
philox4x32_reference,philox_uniform_cpu) is the conformance oracle; seetests/test_nki_philox.py::TestPhiloxNKI. Tracked as #1. - Box-Muller kernel awaits on-hardware validation. Tracked as #2; same CPU-reference conformance pattern.
- Halton degrades above ~20 dimensions — known algorithmic limitation.
Sobol is preferred for
d > 10. - Quasi-random sequences are host-only. NKI scrambling for Sobol/Halton is a v0.3 follow-up.