The paper hinges on one idea: a Gaussian process is only as good as the geometry it sits on.
This page builds the intuition layer-by-layer — starting from what a GP is, what it does for Bayesian optimization,
and ending with the move GOLLuM makes that turns an LLM into a calibrated optimizer for chemistry.
Already comfortable with BO? there's a separate playground for that — this guide focuses on the GP and the representation question.
Tab 1 — A Gaussian process is a distribution over functions
Drop points on the plot. The GP fits a smooth function through them, and the shaded band tells you how confident it is.
The kernel is the only knob — it encodes how similar should two points' outputs be, given how close their inputs are?
This is the surrogate model BO uses to decide what to test next. Get this clear and the rest of the paper falls into place.
What you're seeing. A GP is a distribution over functions. Before any data, every function the kernel allows
is equally plausible — the band is just prior uncertainty. Each observation collapses the band wherever the kernel
thinks data is informative. Where two points are far compared to ℓ, the GP forgets they exist and uncertainty re-expands.
Try this: click Show function samples with no data — those squiggles are draws from the prior.
Add one point and they all pinch through it. Add a few more and the squiggles cluster tightly: the posterior is sharp. That tightening is
the only thing BO needs — a place where the surrogate is confident, and a place where it isn't.
GP state
observations
0
avg posterior σ
—
Kernel prior
Default in GOLLuM. Smooth but not infinitely so — the right amount of smoothness for most physical objectives.
How far similarity reaches. Small ℓ → GP only trusts very close neighbors. Large ℓ → it generalises across the whole space.
How much the function is allowed to vary. The height of the prior band.
Why this matters for BO
BO picks the next experiment by maximising μ + κσ (or similar). If the GP's μ and σ are wrong, BO is wrong.
Calibrated σ is the whole game — overconfident → BO ignores promising regions; underconfident → BO never converges.
Pin this: a GP turns observations into a calibrated surrogate μ(x), σ(x). The kernel is its only
assumption about the world, and that assumption is implicitly geometric: two inputs with small distance should have similar outputs.
The whole GOLLuM paper is an argument that picking a representation where this assumption actually holds matters more
than predictive accuracy. Tab 2 shows why.
Tab 2 — A GP is only as good as its embedding space
Same 20 experiments, same yields, same kernel. Two different representations.
In the raw view, the input dimension is something brittle (e.g. an arbitrary parameter index).
In the learned view, the LLM has been finetuned through the GP marginal likelihood — points with similar yields end up close.
Slide between them. Watch ℓ/d̄ and the GP fit quality respond.
The geometric metric the paper introduces:ℓ / d̄ where ℓ = GP lengthscale, d̄ = mean pairwise distance between embeddings
Higher = the GP can correlate broadly across the space (smooth, GP-friendly).
Lower = neighbours look unrelated, the GP can't generalise. The paper finds this ratio correlates with BO success at r = 0.92 — better than any predictive accuracy metric.
0 = static LLM embeddings (no objective alignment). 100 = GOLLuM's finetuned space, where similar-yield points cluster.
Raw embedding space (e.g. fixed LLM features)
ℓ/d̄: —GP fit RMSE: —BO Top-5% (50 iters): —
Learned space (GOLLuM, GP-finetuned LLM)
ℓ/d̄: —GP fit RMSE: —BO Top-5% (50 iters): —
What the paper found. Across 14 different representations of the Buchwald-Hartwig benchmark — molecular fingerprints,
static embeddings from T5, OpenAI, Qwen, ModernBERT — the geometric ratio ℓ/d̄ predicted optimization performance with
r = 0.92. R² of the surrogate (how well it predicts held-out yield) only correlated at r = 0.78. This is counterintuitive
but important: a GP that's a worse predictor but on a smoother representation beats a GP with higher accuracy on a jagged one.
Why. The acquisition function isn't asking "how accurate is μ?" — it's asking "where should I look next?" That decision
needs extrapolation, and extrapolation only works if the kernel's smoothness assumption matches the space.
Static LLM embeddings cluster by textual similarity — but two reactions with nearly identical SMILES can have wildly different yields.
Static embeddings put them next to each other. The GP gets garbage in, gives garbage out.
The receipt: ℓ/d̄ vs BO performance across 14 representations
Reconstruction of Figure 3c from the paper. Each dot is one representation (a fingerprint or LLM embedding family) evaluated on the Buchwald-Hartwig benchmark.
The x-axis is the geometric ratio ℓ/d̄ measured on that representation. The y-axis is Top-5% coverage after 50 BO iterations.
Hover for names. The line is a least-squares fit; the Pearson correlation is computed live.
Pearson r: —R2: —n representations: 14paper reports r = 0.92 for ℓ/d̄ vs BO performance; r = 0.78 for surrogate R2 vs BO performance
The shape of this plot is the paper's central empirical claim. A simple geometric property of the embedding space
predicts how good the optimizer will be, more reliably than how well the surrogate fits the data. Chemistry-specialized embeddings
(DRFP, T5Chem-SMILES) sit top-right because chemists designed them with the right kind of locality. Most off-the-shelf LLM embeddings
sit bottom-left: textual similarity does not align with reaction yield. The whole point of GOLLuM is to drag a generic LLM up and to the right
by finetuning it through the GP marginal likelihood.
The paper's central diagnosis: static LLM embeddings fail at BO not because the model lacks chemistry knowledge — chemistry-pretrained T5Chem
also fails — but because the embedding geometry isn't aligned with the objective. Tab 3 shows the fix: stop freezing the LLM. Let it learn.
Tab 3 — Make the LLM trainable, supervise it through the GP
Standard deep kernel learning: replace the GP's input x with gφ(x), a neural feature extractor.
GOLLuM's twist: gφ is a pretrained large language model, and φ is updated through the GP's marginal likelihood
(often via LoRA, so updates are cheap). The result: the LLM's embedding space gets reshaped to match what the GP needs.
↑ gradient of marginal likelihood ℒ(θ, φ) = log p(y | X, θ, φ) flows back through everything ↑
The training objective
No regression loss. No contrastive loss. Just one objective: the probability that the GP, with its current
kernel and current embeddings, would have produced the experimental yields you've actually observed. Maximise that.
ℒ(θ, φ) = − ½ [ y⊤ Kθ,φ−1 y + log |Kθ,φ| + n log 2π ]
The kernel matrix Kθ,φ depends on the embeddings gφ(x), so its gradient
w.r.t. φ is well-defined. Backpropagation does the rest. Two terms balance:
data fit (y⊤K−1y): rewards embeddings where the GP can interpolate the seen yields.
complexity penalty (log |K|): punishes embeddings that over-fit (too long lengthscales, too high signal variance — the GP would have to be "luckier" to produce the data).
This is what makes the uncertainty calibrated: the marginal likelihood inherently trades off fit against
confidence. A naïve regression objective would happily push σ to zero. Marginal likelihood doesn't let it.
Why this doesn't overfit on tiny datasets
Eight 1D points from a smooth function. Slide the lengthscale ℓ. Watch the two terms of the negative log marginal likelihood compete:
data fit has a sweet spot (too short → kernel can't relate neighbours; too long → matrix becomes ill-conditioned and the term blows up),
complexity penalty keeps decreasing in ℓ (simpler models are preferred).
Their sum has a clean minimum. That minimum is what the joint training in GOLLuM finds.
data fit (y⊤K−1y): —complexity (log|K|): —total −log likelihood: —optimum at ℓ = —
Why this matters for GOLLuM. A vanilla regression loss (MSE) would happily push the LLM to collapse every embedding
into the smallest cluster that fits the seen yields. Marginal likelihood will not let it: the log|K| term blows up the moment
the kernel matrix becomes degenerate. So the optimizer cannot cheat by squishing the space; it has to find an embedding where the data is
genuinely well-modeled by a smooth GP. That is what makes 10 initial points enough to start reorganising a 7B-parameter LLM.
Three variants, increasing freedom
All optimise the same objective. They differ in what gets to move.
PLLM
g(x) = P · LLM(t)
Frozen LLM, trainable linear projection P + ELU. Useful when the LLM is closed-source (e.g. OpenAI embeddings via API).
You can only reshape the geometry that's already there, but for many tasks that's enough.
LLMφ
g(x) = LLMφ(t)
LoRA-finetuned LLM, no extra projection. Updates a tiny rank-r adapter inside the attention layers.
Cheap, preserves pretrained knowledge, avoids catastrophic forgetting.
PLLMφmain
g(x) = P · LLMφ(t)
Both. LoRA reshapes internal representations; projection P then carves the final geometry the GP wants.
Best on hard tasks. The variant the paper recommends as default.
The clever bit: finetuning a 7B-parameter LLM normally takes thousands of examples.
Here, the GP marginal likelihood gives a strong enough signal that 10 initial points + 50 sequential trials is enough to reorganise the embedding space.
The GP's own complexity penalty acts as a regulariser — you can't overfit by collapsing the space, because that would blow up the log |K| term.
Tab 4 — The marginal likelihood is an implicit contrastive loss
This is the part that's hard to see from the equations but easy to see in motion. As the LLM trains through the GP marginal likelihood,
high-yield experiments drift toward each other and away from low-yield ones — without anyone writing a contrastive loss term.
It just falls out of the math.
high yieldmediumlow yield
training iteration: 0 ·
mean dist (high ↔ high): — ·
mean dist (high ↔ low): —
Why this happens
The data-fit term in the marginal likelihood is, after some algebra, a weighted sum of pairwise interactions:
ℒimplicit ∝ Σi,j wij · ‖gφ(xi) − gφ(xj)‖²
Where the weights wij come from the inverse kernel matrix and depend on how similar the
outputs yi, yj are. The optimizer wants to:
decrease distance between points with similar y
increase distance between points with different y
That's a contrastive objective. Nobody wrote it. It's already inside the GP's likelihood.
What the paper observed
On Buchwald-Hartwig the LLM latent space starts random.
By iteration 25, iodide-based aryl halides (high reactivity, high yield) start grouping.
By iteration 50, the space cleanly separates: iodides on one side, chlorides on the other.
The model never saw "iodides have higher reactivity" written anywhere — it inferred it from yields.
Why it matters for BO
Once the latent space is organised this way, the kernel's smoothness assumption finally holds.
The GP can extrapolate: a new candidate that lands near the iodide cluster inherits high-yield uncertainty,
and BO will preferentially sample from that region. Better geometry → smoother GP fit → smarter acquisition.
The connection back to BO: this is exactly the property GP-based BO needs but rarely gets for free.
Hand-engineered chemistry fingerprints (DRFP) get part of the way there because they were designed by chemists who
knew which features predict yield. GOLLuM's training discovers a comparable structure from natural language inputs and 60 yield observations —
and beats DRFP, because the LLM can encode interactions a fingerprint can't (e.g. solvent–base coupling).
Tab 5 — The data flow, drawn three ways
Concretely: input params → embedding → GP → prediction.
Here's that pipeline drawn three ways: the forward pass for one prediction, the training pass that fits the joint model on a batch,
and the outer BO loop that wraps both. Then the honest answer to "does this give us a clean BO model for arbitrary data?"
1 · Forward pass predicting one candidate
Run at every BO iteration, once per candidate in the design space. This is what the acquisition function consumes.
params
"Solvent: DMSO, Base: K₃PO₄, Ligand: …"
text
→
LLMφ
encoder (+ projection P)
→
embedding
x ∈ ℝᵈ
→
GP
μ(x), σ²(x)
prediction
2 · Training pass refit on all N observations
Done in batch. Yields enter only through the kernel matrix's data-fit term (no per-example regression loss).
N observations
{paramsi, yi} i = 1..N
batch
→
LLMφ
encode all N
→
kernel matrix K
N × N Kij = k(xi, xj)
→
marginal likelihood ℒ
log p(y | X, θ, φ)
scalar loss
↑ gradient flows back through everything:
∂ℒ/∂θ updates the GP hyperparameters,
∂ℒ/∂φ updates the LLM's LoRA adapters and the projection P
↑
3 · Outer BO loop what wraps everything
The flow each round of the campaign. After step 5, the dataset has grown by one and we go around again.
1train
flow 2 above
→
2predict
flow 1 × all candidates
→
3acquire
argmax of μ + κσ
→
4experiment
run it, get yield
→
5append
N → N + 1
↻ loop back to step 1 until budget exhausted or "good enough" yield found
So — does this give us a clean BO model for arbitrary data?
Short answer: basically, yes, with a few asterisks. The paper's whole pitch is that this is the first
framework where one architecture, one set of hyperparameters, works across radically different domains. They tested 23 of them.
Yes, the strong claim holds when…
You can describe your experiment in text. Lab notebook entries, recipe cards, parameter sheets all qualify.
Mixed variable types are fine. Categorical reagents + continuous temperatures + structural SMILES all flow through the LLM into the same embedding space. No bespoke kernel design.
Cold-start works. 10 random (often failed) initial points are enough — the marginal likelihood gives strong signal even on tiny datasets.
One architecture, no per-task tuning. Same hyperparameters won across organic synthesis, materials, process chemistry, molecular design.
You skip feature engineering. No fingerprints, no descriptors, no QM features. The LLM does that step for free.
But the caveats are real…
Cubic GP scaling. Kernel matrix inversion is O(N³). Fine for ~hundreds of experiments per campaign. Past ~10k it breaks. Paper acknowledges this and points at sparse/variational GP approximations as future work.
Text has to actually carry the signal. Domains where the meaningful data is high-dimensional numerical (protein conformations, raw crystal structures, spectroscopy traces) don't compress cleanly into language. Hybrid encoders likely needed there.
You still need a defined design space. BO picks from candidates you enumerate. If your space is open-ended generation (design any molecule), you need a separate generator feeding into the loop.
You still need the experiment to be runnable. This is BO, not magic. Each iteration costs whatever an experiment costs.
The reframe: instead of "BO with chemistry features", "BO with materials features", "BO with process features",
you get BO with text. The expensive, domain-specific descriptor engineering step gets absorbed into a step the LLM does
for free. For any wet-lab campaign, process tuning, formulation optimization, or device parameter search where you can write the
setup in plain English, this is essentially a drop-in optimizer.
Polaron's manufacturing-parameter problems are squarely in that regime: small experiment budgets, mixed categorical and continuous
parameters, descriptions you'd write in a process-engineering notebook. This is exactly the use case GOLLuM was built for.
Tab 6 — Putting it together
How the five ideas combine into one optimization loop, plus the headline numbers from the paper.
Results at a glance
All numbers from the paper. Same architecture, same hyperparameters across all 23 tasks. Cold-start from 10 below-median (often failed) initial experiments. 50-experiment budget unless noted.
Scope & ranking
23
optimization tasks
Across organic synthesis, materials science, process engineering, molecular design.
#1
average rank
Against all baselines including BoChemian, LAPEFT, DRFP-GP, and direct-prompted GPT/Claude/Gemini.
36.5%
top-5% coverage
Mean across 23 tasks at 50 experiments. Next best: BoChemian 25.6%, LAPEFT 12.0%.
~50%
fewer iterations
Median 44% fewer iterations needed to match the best baseline's final performance.
Across 14 representations. Smoother embedding geometry predicts better optimization.
0.78
Pearson r — surrogate R² vs BO performance
Predictive accuracy of the GP matters less than embedding geometry. Tab 2 has the scatter.
10 + 50
data points used
10 cold-start failures + 50 BO iterations is enough to finetune a 7B LLM. LoRA + marginal likelihood does the heavy lifting.
10–80%
failure rate of direct LLM prompting
Hallucinated SMILES, out-of-space suggestions, duplicates. Why GOLLuM uses the LLM only as encoder, never as proposer.
The GOLLuM loop, in one breath
Describe each experiment in natural language. "Solvent: DMSO, base: K₃PO₄, ligand: AdBrettPhos…" — no fingerprints, no descriptors.
LLM encodes the text into a fixed-size embedding. This is the x the GP sees.
GP fits μ(x), σ(x) over previously observed yields, using a Matérn-5/2 kernel on those embeddings.
Joint training step: backprop the GP's marginal likelihood through both the GP's hyperparameters and the LLM's LoRA adapters. The latent space reorganises so that similar-yield points cluster (Tab 4).
Acquisition function picks the next experiment by maximising μ + κσ (or EI) over the design space — using the now-better-shaped GP. This is plain BO from here.
Run the experiment, get a yield, append it to the dataset, repeat.
What changes vs. plain BO
Input: natural language instead of hand-engineered descriptors.
Surrogate: GP on top of a trainable encoder, not on raw features.
Training: marginal likelihood updates the LLM as well as the GP.
Result: same BO acquisition logic, much better geometry.
What changes vs. LLM-as-optimizer
The LLM never directly proposes an experiment. It only encodes.
Selection is done by a calibrated acquisition function on a real GP.
Hallucinations, format errors, premature stopping (10–80% failure rates with prompted GPT-4o) become impossible — the LLM has no way to suggest something outside the design space.
You get the LLM's prior knowledge and Bayesian reliability.
Why this is interesting beyond chemistry
The framework only requires that you can describe an experiment in text. Anything you'd write in a lab notebook
becomes a valid optimization input — process parameters for a battery cell, manufacturing settings for a polymer composite,
growth conditions for a crystal. You stop needing a separate descriptor pipeline for each domain.
What stays the same: the GP, the acquisition function, the marginal-likelihood training, the BO loop.
The expensive engineering work (designing fingerprints) gets absorbed into a step the LLM does for free.
The single sentence to remember:
GOLLuM uses the GP's marginal likelihood as the training signal for the LLM, so the embedding geometry that BO depends on
isn't designed by hand — it emerges from the optimization itself.
Uncertainty stops being a flaw of the LLM; it becomes the gradient that fixes the LLM.