GPs, BO, and the GOLLuM paper

The paper hinges on one idea: a Gaussian process is only as good as the geometry it sits on. This page builds the intuition layer-by-layer — starting from what a GP is, what it does for Bayesian optimization, and ending with the move GOLLuM makes that turns an LLM into a calibrated optimizer for chemistry. Already comfortable with BO? there's a separate playground for that — this guide focuses on the GP and the representation question.

Tab 1 — A Gaussian process is a distribution over functions

Drop points on the plot. The GP fits a smooth function through them, and the shaded band tells you how confident it is. The kernel is the only knob — it encodes how similar should two points' outputs be, given how close their inputs are? This is the surrogate model BO uses to decide what to test next. Get this clear and the rest of the paper falls into place.

posterior mean ±2σ uncertainty observation function samples

What you're seeing. A GP is a distribution over functions. Before any data, every function the kernel allows is equally plausible — the band is just prior uncertainty. Each observation collapses the band wherever the kernel thinks data is informative. Where two points are far compared to ℓ, the GP forgets they exist and uncertainty re-expands.

Try this: click Show function samples with no data — those squiggles are draws from the prior. Add one point and they all pinch through it. Add a few more and the squiggles cluster tightly: the posterior is sharp. That tightening is the only thing BO needs — a place where the surrogate is confident, and a place where it isn't.

GP state

observations

0

avg posterior σ

—

Kernel prior

type

Default in GOLLuM. Smooth but not infinitely so — the right amount of smoothness for most physical objectives.

lengthscale ℓ0.15

How far similarity reaches. Small ℓ → GP only trusts very close neighbors. Large ℓ → it generalises across the whole space.

signal variance σ²0.40

How much the function is allowed to vary. The height of the prior band.

observation noise0.005

Why this matters for BO

BO picks the next experiment by maximising μ + κσ (or similar). If the GP's μ and σ are wrong, BO is wrong. Calibrated σ is the whole game — overconfident → BO ignores promising regions; underconfident → BO never converges.

Pin this: a GP turns observations into a calibrated surrogate μ(x), σ(x). The kernel is its only assumption about the world, and that assumption is implicitly geometric: two inputs with small distance should have similar outputs. The whole GOLLuM paper is an argument that picking a representation where this assumption actually holds matters more than predictive accuracy. Tab 2 shows why.

Tab 2 — A GP is only as good as its embedding space

Same 20 experiments, same yields, same kernel. Two different representations. In the raw view, the input dimension is something brittle (e.g. an arbitrary parameter index). In the learned view, the LLM has been finetuned through the GP marginal likelihood — points with similar yields end up close. Slide between them. Watch ℓ/d̄ and the GP fit quality respond.

The geometric metric the paper introduces: ℓ / d̄ where ℓ = GP lengthscale, d̄ = mean pairwise distance between embeddings Higher = the GP can correlate broadly across the space (smooth, GP-friendly). Lower = neighbours look unrelated, the GP can't generalise. The paper finds this ratio correlates with BO success at r = 0.92 — better than any predictive accuracy metric.

Representation: how organised is the embedding space?raw → learned: 0%

0 = static LLM embeddings (no objective alignment). 100 = GOLLuM's finetuned space, where similar-yield points cluster.

Raw embedding space (e.g. fixed LLM features)

ℓ/d̄: — GP fit RMSE: — BO Top-5% (50 iters): —

Learned space (GOLLuM, GP-finetuned LLM)

ℓ/d̄: — GP fit RMSE: — BO Top-5% (50 iters): —

What the paper found. Across 14 different representations of the Buchwald-Hartwig benchmark — molecular fingerprints, static embeddings from T5, OpenAI, Qwen, ModernBERT — the geometric ratio ℓ/d̄ predicted optimization performance with r = 0.92. R² of the surrogate (how well it predicts held-out yield) only correlated at r = 0.78. This is counterintuitive but important: a GP that's a worse predictor but on a smoother representation beats a GP with higher accuracy on a jagged one.

Why. The acquisition function isn't asking "how accurate is μ?" — it's asking "where should I look next?" That decision needs extrapolation, and extrapolation only works if the kernel's smoothness assumption matches the space. Static LLM embeddings cluster by textual similarity — but two reactions with nearly identical SMILES can have wildly different yields. Static embeddings put them next to each other. The GP gets garbage in, gives garbage out.

The receipt: ℓ/d̄ vs BO performance across 14 representations

Reconstruction of Figure 3c from the paper. Each dot is one representation (a fingerprint or LLM embedding family) evaluated on the Buchwald-Hartwig benchmark. The x-axis is the geometric ratio ℓ/d̄ measured on that representation. The y-axis is Top-5% coverage after 50 BO iterations. Hover for names. The line is a least-squares fit; the Pearson correlation is computed live.

Pearson r: — R²: — n representations: 14 paper reports r = 0.92 for ℓ/d̄ vs BO performance; r = 0.78 for surrogate R² vs BO performance

The shape of this plot is the paper's central empirical claim. A simple geometric property of the embedding space predicts how good the optimizer will be, more reliably than how well the surrogate fits the data. Chemistry-specialized embeddings (DRFP, T5Chem-SMILES) sit top-right because chemists designed them with the right kind of locality. Most off-the-shelf LLM embeddings sit bottom-left: textual similarity does not align with reaction yield. The whole point of GOLLuM is to drag a generic LLM up and to the right by finetuning it through the GP marginal likelihood.

The paper's central diagnosis: static LLM embeddings fail at BO not because the model lacks chemistry knowledge — chemistry-pretrained T5Chem also fails — but because the embedding geometry isn't aligned with the objective. Tab 3 shows the fix: stop freezing the LLM. Let it learn.

Tab 3 — Make the LLM trainable, supervise it through the GP

Standard deep kernel learning: replace the GP's input x with g_φ(x), a neural feature extractor. GOLLuM's twist: g_φ is a pretrained large language model, and φ is updated through the GP's marginal likelihood (often via LoRA, so updates are cheap). The result: the LLM's embedding space gets reshaped to match what the GP needs.

The architecture

Text prompt

"Solvent: DMSO,
Additive: EtISOX,
Reactant: EtPhBr"

input

→

LLM (φ)

T5 / Qwen / ModernBERT
LoRA-trainable

trains

→

embedding x ∈ ℝᵈ

+ optional
linear projection P

trains

→

GP (θ)

Matérn-5/2 kernel
μ(x), σ²(x)

trains

↑ gradient of marginal likelihood ℒ(θ, φ) = log p(y | X, θ, φ) flows back through everything ↑

The training objective

No regression loss. No contrastive loss. Just one objective: the probability that the GP, with its current kernel and current embeddings, would have produced the experimental yields you've actually observed. Maximise that.

ℒ(θ, φ) = − ½ [ y^⊤ K_θ,φ⁻¹ y + log |K_θ,φ| + n log 2π ]

The kernel matrix K_θ,φ depends on the embeddings g_φ(x), so its gradient w.r.t. φ is well-defined. Backpropagation does the rest. Two terms balance:

data fit (y^⊤K⁻¹y): rewards embeddings where the GP can interpolate the seen yields.
complexity penalty (log |K|): punishes embeddings that over-fit (too long lengthscales, too high signal variance — the GP would have to be "luckier" to produce the data).

This is what makes the uncertainty calibrated: the marginal likelihood inherently trades off fit against confidence. A naïve regression objective would happily push σ to zero. Marginal likelihood doesn't let it.

Why this doesn't overfit on tiny datasets

Eight 1D points from a smooth function. Slide the lengthscale ℓ. Watch the two terms of the negative log marginal likelihood compete: data fit has a sweet spot (too short → kernel can't relate neighbours; too long → matrix becomes ill-conditioned and the term blows up), complexity penalty keeps decreasing in ℓ (simpler models are preferred). Their sum has a clean minimum. That minimum is what the joint training in GOLLuM finds.

lengthscale ℓ0.20

data fit (y^⊤K⁻¹y): — complexity (log|K|): — total −log likelihood: — optimum at ℓ = —

Why this matters for GOLLuM. A vanilla regression loss (MSE) would happily push the LLM to collapse every embedding into the smallest cluster that fits the seen yields. Marginal likelihood will not let it: the log|K| term blows up the moment the kernel matrix becomes degenerate. So the optimizer cannot cheat by squishing the space; it has to find an embedding where the data is genuinely well-modeled by a smooth GP. That is what makes 10 initial points enough to start reorganising a 7B-parameter LLM.

Three variants, increasing freedom

All optimise the same objective. They differ in what gets to move.

P_LLM

g(x) = P · LLM(t)

Frozen LLM, trainable linear projection P + ELU. Useful when the LLM is closed-source (e.g. OpenAI embeddings via API). You can only reshape the geometry that's already there, but for many tasks that's enough.

LLM_φ

g(x) = LLM_φ(t)

LoRA-finetuned LLM, no extra projection. Updates a tiny rank-r adapter inside the attention layers. Cheap, preserves pretrained knowledge, avoids catastrophic forgetting.

P_LLM_φ main

g(x) = P · LLM_φ(t)

Both. LoRA reshapes internal representations; projection P then carves the final geometry the GP wants. Best on hard tasks. The variant the paper recommends as default.

The clever bit: finetuning a 7B-parameter LLM normally takes thousands of examples. Here, the GP marginal likelihood gives a strong enough signal that 10 initial points + 50 sequential trials is enough to reorganise the embedding space. The GP's own complexity penalty acts as a regulariser — you can't overfit by collapsing the space, because that would blow up the log |K| term.

Tab 4 — The marginal likelihood is an implicit contrastive loss

This is the part that's hard to see from the equations but easy to see in motion. As the LLM trains through the GP marginal likelihood, high-yield experiments drift toward each other and away from low-yield ones — without anyone writing a contrastive loss term. It just falls out of the math.

high yield medium low yield

training iteration: 0 · mean dist (high ↔ high): — · mean dist (high ↔ low): —

Why this happens

The data-fit term in the marginal likelihood is, after some algebra, a weighted sum of pairwise interactions:

ℒ_implicit ∝ Σ_i,j w_ij · ‖g_φ(x_i) − g_φ(x_j)‖²

Where the weights w_ij come from the inverse kernel matrix and depend on how similar the outputs y_i, y_j are. The optimizer wants to:

decrease distance between points with similar y
increase distance between points with different y

That's a contrastive objective. Nobody wrote it. It's already inside the GP's likelihood.

What the paper observed

On Buchwald-Hartwig the LLM latent space starts random. By iteration 25, iodide-based aryl halides (high reactivity, high yield) start grouping. By iteration 50, the space cleanly separates: iodides on one side, chlorides on the other. The model never saw "iodides have higher reactivity" written anywhere — it inferred it from yields.

Why it matters for BO

Once the latent space is organised this way, the kernel's smoothness assumption finally holds. The GP can extrapolate: a new candidate that lands near the iodide cluster inherits high-yield uncertainty, and BO will preferentially sample from that region. Better geometry → smoother GP fit → smarter acquisition.

The connection back to BO: this is exactly the property GP-based BO needs but rarely gets for free. Hand-engineered chemistry fingerprints (DRFP) get part of the way there because they were designed by chemists who knew which features predict yield. GOLLuM's training discovers a comparable structure from natural language inputs and 60 yield observations — and beats DRFP, because the LLM can encode interactions a fingerprint can't (e.g. solvent–base coupling).

Tab 5 — The data flow, drawn three ways

Concretely: input params → embedding → GP → prediction. Here's that pipeline drawn three ways: the forward pass for one prediction, the training pass that fits the joint model on a batch, and the outer BO loop that wraps both. Then the honest answer to "does this give us a clean BO model for arbitrary data?"

1 · Forward pass predicting one candidate

Run at every BO iteration, once per candidate in the design space. This is what the acquisition function consumes.

params

"Solvent: DMSO,
Base: K₃PO₄,
Ligand: …"

text

→

LLM_φ

encoder
(+ projection P)

→

embedding

x ∈ ℝᵈ

→

GP

μ(x), σ²(x)

prediction

2 · Training pass refit on all N observations

Done in batch. Yields enter only through the kernel matrix's data-fit term (no per-example regression loss).

N observations

{params_i, y_i}
i = 1..N

batch

→

LLM_φ

encode all N

→

kernel matrix K

N × N
K_ij = k(x_i, x_j)

→

marginal likelihood ℒ

log p(y | X, θ, φ)

scalar loss

↑ gradient flows back through everything: ∂ℒ/∂θ updates the GP hyperparameters, ∂ℒ/∂φ updates the LLM's LoRA adapters and the projection P ↑

3 · Outer BO loop what wraps everything

The flow each round of the campaign. After step 5, the dataset has grown by one and we go around again.

1train

flow 2 above

→

2predict

flow 1 ×
all candidates

→

3acquire

argmax of
μ + κσ

→

4experiment

run it,
get yield

→

5append

N → N + 1

↻ loop back to step 1 until budget exhausted or "good enough" yield found

So — does this give us a clean BO model for arbitrary data?

Short answer: basically, yes, with a few asterisks. The paper's whole pitch is that this is the first framework where one architecture, one set of hyperparameters, works across radically different domains. They tested 23 of them.

Yes, the strong claim holds when…

You can describe your experiment in text. Lab notebook entries, recipe cards, parameter sheets all qualify.
Mixed variable types are fine. Categorical reagents + continuous temperatures + structural SMILES all flow through the LLM into the same embedding space. No bespoke kernel design.
Cold-start works. 10 random (often failed) initial points are enough — the marginal likelihood gives strong signal even on tiny datasets.
One architecture, no per-task tuning. Same hyperparameters won across organic synthesis, materials, process chemistry, molecular design.
You skip feature engineering. No fingerprints, no descriptors, no QM features. The LLM does that step for free.

But the caveats are real…

Cubic GP scaling. Kernel matrix inversion is O(N³). Fine for ~hundreds of experiments per campaign. Past ~10k it breaks. Paper acknowledges this and points at sparse/variational GP approximations as future work.
Text has to actually carry the signal. Domains where the meaningful data is high-dimensional numerical (protein conformations, raw crystal structures, spectroscopy traces) don't compress cleanly into language. Hybrid encoders likely needed there.
You still need a defined design space. BO picks from candidates you enumerate. If your space is open-ended generation (design any molecule), you need a separate generator feeding into the loop.
You still need the experiment to be runnable. This is BO, not magic. Each iteration costs whatever an experiment costs.

The reframe: instead of "BO with chemistry features", "BO with materials features", "BO with process features", you get BO with text. The expensive, domain-specific descriptor engineering step gets absorbed into a step the LLM does for free. For any wet-lab campaign, process tuning, formulation optimization, or device parameter search where you can write the setup in plain English, this is essentially a drop-in optimizer.

Polaron's manufacturing-parameter problems are squarely in that regime: small experiment budgets, mixed categorical and continuous parameters, descriptions you'd write in a process-engineering notebook. This is exactly the use case GOLLuM was built for.

Tab 6 — Putting it together

How the five ideas combine into one optimization loop, plus the headline numbers from the paper.

Results at a glance

All numbers from the paper. Same architecture, same hyperparameters across all 23 tasks. Cold-start from 10 below-median (often failed) initial experiments. 50-experiment budget unless noted.

Scope & ranking

23

optimization tasks

Across organic synthesis, materials science, process engineering, molecular design.

#1

average rank

Against all baselines including BoChemian, LAPEFT, DRFP-GP, and direct-prompted GPT/Claude/Gemini.

36.5%

top-5% coverage

Mean across 23 tasks at 50 experiments. Next best: BoChemian 25.6%, LAPEFT 12.0%.

~50%

fewer iterations

Median 44% fewer iterations needed to match the best baseline's final performance.

Buchwald-Hartwig (pharma cross-coupling benchmark)

44%

GOLLuM discovery rate

High-yield reactions found in 50 trials from a 10-failure cold start.

25%

traditional BO (DRFP-GP)

Reaction fingerprints + GP, the strongest pre-GOLLuM baseline.

15–26%

static LLM embeddings

T5, BERT, OpenAI, Qwen. Off-the-shelf LLM features without finetuning.

+79%

relative gain (T5 → P_LLM_φ+T5)

Same encoder. Finetuned through the GP marginal likelihood vs frozen.

Gains by domain

+90%

process chemistry (relative)

13.4% → 25.4% Top-5% coverage. Largest margin.

+44%

mixed-variable tasks

Categorical + continuous parameters together. Where fixed descriptors don't exist.

+28%

organic chemistry

23.9% → 30.6%. Modest but consistent.

+11%

molecular property

47.9% → 53.2%. Smallest margin — established fingerprints already strong here.

The geometric finding

0.92

Pearson r — ℓ/d̄ vs BO performance

Across 14 representations. Smoother embedding geometry predicts better optimization.

0.78

Pearson r — surrogate R² vs BO performance

Predictive accuracy of the GP matters less than embedding geometry. Tab 2 has the scatter.

10 + 50

data points used

10 cold-start failures + 50 BO iterations is enough to finetune a 7B LLM. LoRA + marginal likelihood does the heavy lifting.

10–80%

failure rate of direct LLM prompting

Hallucinated SMILES, out-of-space suggestions, duplicates. Why GOLLuM uses the LLM only as encoder, never as proposer.

The GOLLuM loop, in one breath

Describe each experiment in natural language. "Solvent: DMSO, base: K₃PO₄, ligand: AdBrettPhos…" — no fingerprints, no descriptors.
LLM encodes the text into a fixed-size embedding. This is the x the GP sees.
GP fits μ(x), σ(x) over previously observed yields, using a Matérn-5/2 kernel on those embeddings.
Joint training step: backprop the GP's marginal likelihood through both the GP's hyperparameters and the LLM's LoRA adapters. The latent space reorganises so that similar-yield points cluster (Tab 4).
Acquisition function picks the next experiment by maximising μ + κσ (or EI) over the design space — using the now-better-shaped GP. This is plain BO from here.
Run the experiment, get a yield, append it to the dataset, repeat.

What changes vs. plain BO

Input: natural language instead of hand-engineered descriptors.
Surrogate: GP on top of a trainable encoder, not on raw features.
Training: marginal likelihood updates the LLM as well as the GP.
Result: same BO acquisition logic, much better geometry.

What changes vs. LLM-as-optimizer

The LLM never directly proposes an experiment. It only encodes.
Selection is done by a calibrated acquisition function on a real GP.
Hallucinations, format errors, premature stopping (10–80% failure rates with prompted GPT-4o) become impossible — the LLM has no way to suggest something outside the design space.
You get the LLM's prior knowledge and Bayesian reliability.

Why this is interesting beyond chemistry

The framework only requires that you can describe an experiment in text. Anything you'd write in a lab notebook becomes a valid optimization input — process parameters for a battery cell, manufacturing settings for a polymer composite, growth conditions for a crystal. You stop needing a separate descriptor pipeline for each domain. What stays the same: the GP, the acquisition function, the marginal-likelihood training, the BO loop. The expensive engineering work (designing fingerprints) gets absorbed into a step the LLM does for free.

The single sentence to remember: GOLLuM uses the GP's marginal likelihood as the training signal for the LLM, so the embedding geometry that BO depends on isn't designed by hand — it emerges from the optimization itself. Uncertainty stops being a flaw of the LLM; it becomes the gradient that fixes the LLM.