Paper · NMI Core tech

GOLLuM

Large language models as uncertainty-calibrated optimizers for experimental discovery

Authors: Bojana Ranković, Ryan-Rhys Griffiths, Philippe Schwaller
Affiliations: EPFL (Institute of Chemical Sciences and Engineering) · NCCR Catalysis · Independent (San Francisco)
Venue: Nature Machine Intelligence
TL;DR
A language model is finetuned jointly with a Gaussian process through the GP's marginal likelihood, turning natural-language experiment descriptions into a calibrated Bayesian optimizer. Ranks first averaged across 23 chemistry / materials / process / molecular-design tasks using a single architecture with no per-task tuning, while cutting experimental budget roughly in half.
Headline numbers
23
optimization tasks
Organic synthesis, materials, process engineering, molecular design.
#1
average rank
Beats all baselines including BoChemian, LAPEFT, DRFP-GP, and direct-prompted GPT/Claude/Gemini.
44%
discovery rate (Buchwald-Hartwig)
Vs 25% for traditional BO with reaction fingerprints. Pharma cross-coupling benchmark.
~50%
fewer iterations
Median 44% fewer experiments needed to match best baseline's final performance.
0.92
Pearson r (ℓ/d̄ vs BO success)
Geometric ratio predicts optimization better than surrogate R² (r = 0.78).
10 + 50
data points used
Cold start from 10 failed experiments + 50 BO iterations. Enough to finetune a 7B LLM.
The architecture
Text prompt
"Solvent: DMSO, base: K₃PO₄, ligand: AdBrettPhos…"
LLM (φ)
T5 default · LoRA finetuned
embedding x ∈ ℝᵈ
+ optional linear projection P
GP surrogate
Matérn-5/2 kernel · μ(x), σ²(x)
ℒ(θ, φ) = log p(y | X, θ, φ)  —  gradient flows through GP and LLM together

Three variants. PLLM: frozen LLM + trainable projection (useful for closed-source LLMs). LLMφ: LoRA-only finetuning. PLLMφ (default): both — best on hard tasks.

Top-5% coverage at 50 experiments (averaged across 23 tasks)
GOLLuM (PLLMφ+T5)
36.5%
BO + tailored descriptors
29.7%
BoChemian (fixed LLM + GP)
25.6%
LAPEFT (post-hoc uncertainty)
12.0%
Gains by domain (Top-5% coverage, relative)
Process chemistry
+90%
Mixed-variable tasks
+44%
Organic chemistry
+28%
Molecular property
+11%
Strengths & limitations

What's strong

  • Generalist — one architecture, one set of hyperparameters across 23 tasks. No per-domain feature engineering.
  • Sample-efficient — 10 cold-start failures + 50 trials is enough. Matches real wet-lab budgets.
  • Interpretable — latent space self-organizes by reactivity (iodides cluster, chlorides separate). Gives chemists a "why."
  • Structurally reliable — LLM only encodes, never proposes. Hallucinated SMILES, out-of-space suggestions impossible by construction.
  • Elegant theory — marginal likelihood doubles as an implicit contrastive loss. Falls out of the math, not hand-designed.

Caveats

  • Cubic GP scaling, O(N³) — fine for hundreds of experiments, breaks past ~10k. Paper points at sparse / variational GPs as future work.
  • Text has to carry the signal — chemistry, process, formulation work well. Protein conformations, raw spectra, 3D structures don't compress cleanly.
  • Design space must be enumerable — BO picks from candidates you list. Open-ended generation needs a separate proposer upstream.
  • Recommender, not simulator — value capture depends on customer wet-lab throughput to actually run the suggestions.
Quotes worth remembering
"By repurposing the uncertainty from a Gaussian process as a direct training signal, we combine the accessibility of natural language with the reliability required for real experimental campaigns."
"Representation geometry determines optimization success better than predictive accuracy of a surrogate model."
"The GP marginal likelihood objective acts as an implicit contrastive loss, actively separating successful and unsuccessful experiments over time."
Bottom line

The core technology behind the Bojana startup pitch. Real result, peer-reviewed in NMI, with thorough benchmarks and a clean theoretical story. The implicit-contrastive-loss observation is genuinely elegant. The generalist claim (one architecture across 23 tasks) is the actual product-level differentiator. The algorithm is open source and reproducible — moat is execution + team + customer integration + data flywheel, not patent IP. See tech-analysis.md for the full startup feasibility take.