BORA — paper summary

TL;DR

A hybrid framework (BORA) that dynamically switches between Bayesian optimization and frontier reasoning LLMs (o3, gpt-5, gemini-2.5-flash) during a closed-loop experiment. LLM/BO hybrids beat BO-only by a wide margin. But LLM-only with o3 matches or beats the hybrid on both benchmarks tested. Most direct competitor to GOLLuM in the folder: same problem space, very different architecture, different cost structure.

Why this matters for the startup eval: this paper exists, from a credentialed materials lab, posted three months before the GOLLuM evaluation date. It is the single most important competitive signal in the folder. Different architecture (prompted frontier LLMs vs finetuned encoder) but same problem space.

Headline numbers

LLMs benchmarked

o4-mini, o3, gpt-5-mini, gpt-5, gemini-2.5-flash. All reasoning models.

benchmark problems

10D photocatalytic hydrogen evolution (chemistry) + 7D pétanque simulation (physics).

150

experiments per run

15 batches × 10 experiments. 5 batches warm-start by LLM, 10 batches hybrid.

624

optimization logs

20 repeat runs per condition. Released as data repository.

best overall LLM

Tightest distribution, most consistent. o3-only beat o3/BO hybrid on both benchmarks.

In silico

no real wet-lab data

Both benchmarks are simulations. Photocatalysis ground truth from prior published BO data.

The BORA framework

Adaptive policy that picks one of three actions at each step, depending on whether BO is stalling and how high model uncertainty is.

Action a1

Vanilla BO

Standard GP-based Bayesian optimization picks the next batch.

Action a2

LLM full intervention

LLM analyzes the optimization so far, generates new hypotheses, proposes the next batch outright.

Action a3

LLM-guided BO

BO proposes candidate set, LLM filters for the most promising subset.

Batches 1–5

LLM warm-start. LLM reasons about the problem, optionally searches literature, proposes initial 50 experiments.

Batches 6–15

Adaptive hybrid. Trust score determines when to call the LLM vs let BO run. 100 experiments.

Built on Opsight, Liverpool's cloud-native hypothesis-driven optimization platform. Each experiment card is plain English. LLM can use Python REPL, Google search, internal scratchpad.

Photocatalysis benchmark (10D, max HER = 28.37 µmol·h⁻¹)

Configuration	Avg after 25 exp	Avg after 150 exp	Std (150)	Times reached max
BO only	0.6	11.6	4.4	0 / 20
o4-mini / BO hybrid	7.5	20.2	4.4	1 / 20
gemini-2.5-flash / BO hybrid	11.3	20.7	4.9	2 / 20
gpt-5-mini / BO hybrid	12.5	21.5	2.8	0 / 20
gpt-5 / BO hybrid	7.5	25.1	4.2	9 / 20
o3 / BO hybrid	10.5	25.3	2.9	4 / 20
o3 only (no BO)	—	26.2	2.9	9 / 20

o3 with no BO at all reaches the highest average and ties on max-found rate. Worth pausing on.

Pétanque benchmark (7D physics simulation, max score = 100)

Configuration	Avg after 25 exp	Avg after 150 exp	Std (150)
BO only	9.3	52.6	26.8
o4-mini / BO hybrid	66.2	94.1	11.7
o3 / BO hybrid	59.9	98.2	2.8
gpt-5 / BO hybrid	63.9	99.2	0.9
o3 only (no BO)	—	99.5	1.2

Same story. On clean-physics problems, LLM-only is competitive or better than BO-augmented. BO alone is left behind.

Strengths & limitations

What's strong

Modern frontier LLMs tested — o3, gpt-5, gemini-2.5-flash. Most up-to-date comparison in the literature.
Honest about failure modes — run-by-run analysis, named outliers, reports cases where the LLM went down the wrong path (e.g. "avoid base" hypothesis that lost the optimum).
Two complementary benchmarks — messy chemistry (multiple length scales, unclear ground truth) and clean physics (known theory).
Open data — 624 optimization logs released for inspection.
Compares LLM-only against hybrid — provides the apples-to-apples needed to answer "does the BO scaffolding add value when the LLM is good enough?"

Caveats

Pure simulation — no real wet-lab validation. Photocatalysis uses prior published BO data as ground truth; pétanque is a self-built physics model.
Only 2 benchmarks — not the 23 of GOLLuM. Generality claim is weaker.
Closed-source LLM dependency — needs API access to o3, gpt-5, gemini. Expensive at scale. Energy cost flagged but not deeply analyzed.
No calibrated uncertainty — prompted LLMs don't give principled error bars. The "trust score" is a heuristic, not Bayesian.
Reproducibility risk — model versions change. Photocatalysis used previous o-series gpt-4o-mini, now superseded.
Literature search effect ≈ zero on photocatalysis — positive results bias means LLMs picked up bad hypotheses (e.g. dye sensitization) from papers that didn't apply.

Quotes worth remembering

"LLM/BO hybrids outperform BO-only approaches, particularly in early-stage exploration where the search is warm-started by LLM-driven hypotheses."

"Among the models tested, o3 delivered the strongest and most consistent optimisation performance after 150 experiments. LLM-only optimisations without the BO component also matched or surpassed hybrid methods in some settings."

"The strongest LLM-only performance was observed with a batch size of one, suggesting that experiment-by-experiment machine reasoning is a viable strategy for certain automated scientific optimisation tasks."

Bottom line for the GOLLuM startup thesis

The most uncomfortable paper in the folder. Same problem space as GOLLuM, different architecture, posted three months before the evaluation. The headline finding — o3-only beats o3/BO hybrid on both benchmarks — is the worst-case scenario for any team betting that the BO scaffold is the durable moat. Counterweights: these are simulations, not real wet-lab campaigns; calibrated uncertainty still matters when stakes are high; o3-class economics are expensive at scale (~$10-100 per experiment in API costs vs one-time LoRA finetune). GOLLuM still wins on interpretability, on cost-per-iteration, and on principled uncertainty. But the gap is narrower than it looked a year ago and is narrowing further. Worth tracking this group, this paper, and o-class model improvements closely if the startup decision is yes.