Fast EmulatorDesign Rationale

Design Rationale

This page explains the key design decisions behind the BioSNICAR emulator architecture.

Why an emulator at all?

The forward model takes ~50 ms per evaluation. At this speed:

  • An MCMC posterior exploration (160,000 evaluations) takes ~2.2 hours
  • A 5-dimensional parameter sweep at modest resolution takes days
  • Real-time interactive exploration is impossible

The emulator reduces evaluation time to ~1 µs, making all of these practical.

Why MLP over alternatives?

ApproachWhy not
Gaussian ProcessO(n³) scaling. Impractical for 5000+ training points. Prediction ~ms (too slow for MCMC).
RBF interpolationDense O(n²) memory, O(n³) solve. Stores all training points.
Polynomial regressionCannot capture nonlinear spectral features (absorption bands).
Random forest / XGBoostStep-function approximation. Not differentiable — breaks gradient-based optimisers.

The MLP wins on: O(N) scaling, microsecond inference, smooth differentiable output, compact storage (~100 KB), pure-numpy inference.

Why scikit-learn for training, numpy for inference?

  • Build time uses sklearn.neural_network.MLPRegressor — it handles backprop, Adam, early stopping, and train/validation splits correctly.
  • Inference extracts weights as numpy arrays. The forward pass is ~4 matrix multiplications. No sklearn import needed.

Users who load a pre-built .npz file never need scikit-learn. Deployment environments stay lightweight.

Why not PyTorch/TensorFlow? They’re ~500 MB+ dependencies, overkill for a 3-layer MLP.

Why .npz, not pickle?

  • Version-independent — numpy arrays don’t break across Python/sklearn versions
  • No arbitrary code execution — safe to share (unlike pickle)
  • Human-inspectable — can be loaded and examined with np.load()
  • Compact — compressed numpy arrays, ~100–200 KB

Why PCA + MLP?

Albedo spectra produced by the forward model are low-dimensional. Although the output is a 480-element vector, the spectra lie on a manifold described by 8–15 principal components, because ice absorption and scattering vary smoothly with wavelength. The neural network therefore predicts ~10 PCA coefficients rather than 480 spectral values, which regularises the output and prevents unphysical spectral artefacts.

Why log-space sampling for impurities?

Impurity concentrations span orders of magnitude (0 to 500,000). With linear sampling in this range, fewer than 0.2% of points fall below 1000 — and the emulator fails for clean ice. Log₁₀(x+1) sampling distributes points evenly across orders of magnitude, ensuring the emulator works well at both low and high concentrations.