Design Rationale

This page explains the key design decisions behind the BioSNICAR emulator architecture.

Why an emulator at all?

The forward model takes ~50 ms per evaluation. At this speed:

An MCMC posterior exploration (160,000 evaluations) takes ~2.2 hours
A 5-dimensional parameter sweep at modest resolution takes days
Real-time interactive exploration is impossible

The emulator reduces evaluation time to ~1 µs, making all of these practical.

Why MLP over alternatives?

Approach	Why not
Gaussian Process	O(n³) scaling. Impractical for 5000+ training points. Prediction ~ms (too slow for MCMC).
RBF interpolation	Dense O(n²) memory, O(n³) solve. Stores all training points.
Polynomial regression	Cannot capture nonlinear spectral features (absorption bands).
Random forest / XGBoost	Step-function approximation. Not differentiable — breaks gradient-based optimisers.

The MLP wins on: O(N) scaling, microsecond inference, smooth differentiable output, compact storage (~100 KB), pure-numpy inference.

Why scikit-learn for training, numpy for inference?

Build time uses sklearn.neural_network.MLPRegressor — it handles backprop, Adam, early stopping, and train/validation splits correctly.
Inference extracts weights as numpy arrays. The forward pass is ~4 matrix multiplications. No sklearn import needed.

Users who load a pre-built .npz file never need scikit-learn. Deployment environments stay lightweight.

Why not PyTorch/TensorFlow? They’re ~500 MB+ dependencies, overkill for a 3-layer MLP.

Why `.npz`, not pickle?

Version-independent — numpy arrays don’t break across Python/sklearn versions
No arbitrary code execution — safe to share (unlike pickle)
Human-inspectable — can be loaded and examined with np.load()
Compact — compressed numpy arrays, ~100–200 KB

Why PCA + MLP?

Albedo spectra produced by the forward model are low-dimensional. Although the output is a 480-element vector, the spectra lie on a manifold described by 8–15 principal components, because ice absorption and scattering vary smoothly with wavelength. The neural network therefore predicts ~10 PCA coefficients rather than 480 spectral values, which regularises the output and prevents unphysical spectral artefacts.

Why log-space sampling for impurities?

Impurity concentrations span orders of magnitude (0 to 500,000). With linear sampling in this range, fewer than 0.2% of points fall below 1000 — and the emulator fails for clean ice. Log₁₀(x+1) sampling distributes points evenly across orders of magnitude, ensuring the emulator works well at both low and high concentrations.

Building Emulators Overview