GPs, BO, and the GOLLuM paper

The paper hinges on one idea: a Gaussian process is only as good as the geometry it sits on. This page builds the intuition layer-by-layer — starting from what a GP is, what it does for Bayesian optimization, and ending with the move GOLLuM makes that turns an LLM into a calibrated optimizer for chemistry. Already comfortable with BO? there's a separate playground for that — this guide focuses on the GP and the representation question.

Tab 1 — A Gaussian process is a distribution over functions

Drop points on the plot. The GP fits a smooth function through them, and the shaded band tells you how confident it is. The kernel is the only knob — it encodes how similar should two points' outputs be, given how close their inputs are? This is the surrogate model BO uses to decide what to test next. Get this clear and the rest of the paper falls into place.

posterior mean ±2σ uncertainty observation function samples
What you're seeing. A GP is a distribution over functions. Before any data, every function the kernel allows is equally plausible — the band is just prior uncertainty. Each observation collapses the band wherever the kernel thinks data is informative. Where two points are far compared to ℓ, the GP forgets they exist and uncertainty re-expands.

Try this: click Show function samples with no data — those squiggles are draws from the prior. Add one point and they all pinch through it. Add a few more and the squiggles cluster tightly: the posterior is sharp. That tightening is the only thing BO needs — a place where the surrogate is confident, and a place where it isn't.
GP state
observations
0
avg posterior σ
Kernel prior
Default in GOLLuM. Smooth but not infinitely so — the right amount of smoothness for most physical objectives.
How far similarity reaches. Small ℓ → GP only trusts very close neighbors. Large ℓ → it generalises across the whole space.
How much the function is allowed to vary. The height of the prior band.
Why this matters for BO
BO picks the next experiment by maximising μ + κσ (or similar). If the GP's μ and σ are wrong, BO is wrong. Calibrated σ is the whole game — overconfident → BO ignores promising regions; underconfident → BO never converges.
Pin this: a GP turns observations into a calibrated surrogate μ(x), σ(x). The kernel is its only assumption about the world, and that assumption is implicitly geometric: two inputs with small distance should have similar outputs. The whole GOLLuM paper is an argument that picking a representation where this assumption actually holds matters more than predictive accuracy. Tab 2 shows why.