Preprint · ChemRxiv · Jan 2026 Direct competitor

BORA

Can We Automate Scientific Reasoning in Closed-Loop Experiments using Large Language Models?

Authors: Abdoulatif Cissé, Max E. Cooper, Mengjia Zhu, Xenophon Evangelopoulos, Andrew I. Cooper
Affiliation: Materials Innovation Factory and Department of Chemistry, University of Liverpool
Venue: ChemRxiv preprint (not yet peer reviewed) · Posted 30 January 2026
TL;DR
A hybrid framework (BORA) that dynamically switches between Bayesian optimization and frontier reasoning LLMs (o3, gpt-5, gemini-2.5-flash) during a closed-loop experiment. LLM/BO hybrids beat BO-only by a wide margin. But LLM-only with o3 matches or beats the hybrid on both benchmarks tested. Most direct competitor to GOLLuM in the folder: same problem space, very different architecture, different cost structure.
Why this matters for the startup eval: this paper exists, from a credentialed materials lab, posted three months before the GOLLuM evaluation date. It is the single most important competitive signal in the folder. Different architecture (prompted frontier LLMs vs finetuned encoder) but same problem space.
Headline numbers
5
LLMs benchmarked
o4-mini, o3, gpt-5-mini, gpt-5, gemini-2.5-flash. All reasoning models.
2
benchmark problems
10D photocatalytic hydrogen evolution (chemistry) + 7D pétanque simulation (physics).
150
experiments per run
15 batches × 10 experiments. 5 batches warm-start by LLM, 10 batches hybrid.
624
optimization logs
20 repeat runs per condition. Released as data repository.
o3
best overall LLM
Tightest distribution, most consistent. o3-only beat o3/BO hybrid on both benchmarks.
In silico
no real wet-lab data
Both benchmarks are simulations. Photocatalysis ground truth from prior published BO data.
The BORA framework

Adaptive policy that picks one of three actions at each step, depending on whether BO is stalling and how high model uncertainty is.

Action a1
Vanilla BO
Standard GP-based Bayesian optimization picks the next batch.
Action a2
LLM full intervention
LLM analyzes the optimization so far, generates new hypotheses, proposes the next batch outright.
Action a3
LLM-guided BO
BO proposes candidate set, LLM filters for the most promising subset.
Batches 1–5
LLM warm-start. LLM reasons about the problem, optionally searches literature, proposes initial 50 experiments.
Batches 6–15
Adaptive hybrid. Trust score determines when to call the LLM vs let BO run. 100 experiments.

Built on Opsight, Liverpool's cloud-native hypothesis-driven optimization platform. Each experiment card is plain English. LLM can use Python REPL, Google search, internal scratchpad.

Photocatalysis benchmark (10D, max HER = 28.37 µmol·h⁻¹)
Configuration Avg after 25 exp Avg after 150 exp Std (150) Times reached max
BO only 0.6 11.6 4.4 0 / 20
o4-mini / BO hybrid 7.5 20.2 4.4 1 / 20
gemini-2.5-flash / BO hybrid 11.3 20.7 4.9 2 / 20
gpt-5-mini / BO hybrid 12.5 21.5 2.8 0 / 20
gpt-5 / BO hybrid 7.5 25.1 4.2 9 / 20
o3 / BO hybrid 10.5 25.3 2.9 4 / 20
o3 only (no BO) 26.2 2.9 9 / 20

o3 with no BO at all reaches the highest average and ties on max-found rate. Worth pausing on.

Pétanque benchmark (7D physics simulation, max score = 100)
Configuration Avg after 25 exp Avg after 150 exp Std (150)
BO only 9.3 52.6 26.8
o4-mini / BO hybrid 66.2 94.1 11.7
o3 / BO hybrid 59.9 98.2 2.8
gpt-5 / BO hybrid 63.9 99.2 0.9
o3 only (no BO) 99.5 1.2

Same story. On clean-physics problems, LLM-only is competitive or better than BO-augmented. BO alone is left behind.

Strengths & limitations

What's strong

  • Modern frontier LLMs tested — o3, gpt-5, gemini-2.5-flash. Most up-to-date comparison in the literature.
  • Honest about failure modes — run-by-run analysis, named outliers, reports cases where the LLM went down the wrong path (e.g. "avoid base" hypothesis that lost the optimum).
  • Two complementary benchmarks — messy chemistry (multiple length scales, unclear ground truth) and clean physics (known theory).
  • Open data — 624 optimization logs released for inspection.
  • Compares LLM-only against hybrid — provides the apples-to-apples needed to answer "does the BO scaffolding add value when the LLM is good enough?"

Caveats

  • Pure simulation — no real wet-lab validation. Photocatalysis uses prior published BO data as ground truth; pétanque is a self-built physics model.
  • Only 2 benchmarks — not the 23 of GOLLuM. Generality claim is weaker.
  • Closed-source LLM dependency — needs API access to o3, gpt-5, gemini. Expensive at scale. Energy cost flagged but not deeply analyzed.
  • No calibrated uncertainty — prompted LLMs don't give principled error bars. The "trust score" is a heuristic, not Bayesian.
  • Reproducibility risk — model versions change. Photocatalysis used previous o-series gpt-4o-mini, now superseded.
  • Literature search effect ≈ zero on photocatalysis — positive results bias means LLMs picked up bad hypotheses (e.g. dye sensitization) from papers that didn't apply.
Quotes worth remembering
"LLM/BO hybrids outperform BO-only approaches, particularly in early-stage exploration where the search is warm-started by LLM-driven hypotheses."
"Among the models tested, o3 delivered the strongest and most consistent optimisation performance after 150 experiments. LLM-only optimisations without the BO component also matched or surpassed hybrid methods in some settings."
"The strongest LLM-only performance was observed with a batch size of one, suggesting that experiment-by-experiment machine reasoning is a viable strategy for certain automated scientific optimisation tasks."
Bottom line for the GOLLuM startup thesis

The most uncomfortable paper in the folder. Same problem space as GOLLuM, different architecture, posted three months before the evaluation. The headline finding — o3-only beats o3/BO hybrid on both benchmarks — is the worst-case scenario for any team betting that the BO scaffold is the durable moat. Counterweights: these are simulations, not real wet-lab campaigns; calibrated uncertainty still matters when stakes are high; o3-class economics are expensive at scale (~$10-100 per experiment in API costs vs one-time LoRA finetune). GOLLuM still wins on interpretability, on cost-per-iteration, and on principled uncertainty. But the gap is narrower than it looked a year ago and is narrowing further. Worth tracking this group, this paper, and o-class model improvements closely if the startup decision is yes.