Paper · NMI · May 2024 Lab predecessor

ChemCrow

Augmenting large language models with chemistry tools

Authors: Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D. White, Philippe Schwaller
Affiliations: EPFL/LIAC · NCCR Catalysis · U Rochester · FutureHouse · IBM Research Zurich
Venue: Nature Machine Intelligence (vol. 6, pp. 525–535) · Published 8 May 2024
TL;DR
GPT-4 augmented with 18 expert-designed chemistry tools and a ReAct-style chain-of-thought reasoning loop. Plans and executes chemical syntheses autonomously, including on real robotic hardware (IBM RoboRXN). Demonstrated end-to-end synthesis of DEET and three organocatalysts, plus discovery of a novel chromophore, and refused unsafe requests via built-in safety guardrails. A tool-using agent, not a probabilistic optimizer — different paradigm from GOLLuM.
Headline numbers
18
expert chemistry tools
Spanning molecule, reaction, safety, and general categories.
14
evaluation tasks
Spanning synthesis, molecular design, chemical logic. Graded by 4 expert chemists.
4
chemicals synthesized
DEET (insect repellent) + Schreiner's, Ricci's, Takemoto's catalysts. Run on IBM RoboRXN.
336 nm
novel chromophore discovered
Target was 369 nm absorption. Random forest trained, candidates screened, top one synthesized.
GPT-4
backbone LLM
Temperature 0.1. Integration via LangChain. ReAct + MRKL prompting style.
Stopped
TNT analog request
Safety guardrails refused. ControlledChemicalCheck and ExplosiveCheck worked as designed.
The reasoning loop
1Thought
LLM reasons about
the current state, plans next step
2Action
Select tool from
the 18 available
3Action input
Format input
for that tool
4Observation
Read tool output,
incorporate

Iterates until a final answer is reached. ReAct-style scaffold. Tool outputs ground the LLM on facts it can't reliably memorize (molecular weights, prices, safety data, reaction outcomes).

The 18 tools

Molecule (8)

  • Name2SMILES — name → SMILES
  • SMILES2Weight — molecular weight via RDKit
  • SMILES2Price — purchasability + price
  • SMILES2CAS, Name2CAS — CAS lookup
  • Similarity — Tanimoto via ECFP2
  • ModifyMol — robust chemistry mutations
  • FuncGroups — functional-group detection
  • PatentCheck — patent existence

Reaction (4)

  • NameRXN — classify a reaction (RXN4Chem)
  • ReactionPredict — Molecular Transformer prediction
  • SynthesisPlan — retrosynthesis (RXNPlanner)
  • SynthesisExecute — run synthesis on RoboRXN

Safety (3)

  • ControlledChemicalCheck — chemical weapons watchlist
  • ExplosiveCheck — GHS-based explosive detection
  • SafetySummary — PubChem-based safety report

General (4)

  • WebSearch — SerpAPI Google search
  • LitSearch — paper-qa with FAISS + OpenAI embeddings
  • Python REPL — code execution sandbox
  • Human — escalate to the user
What it successfully did
Plan + execute
DEET
Common insect repellent. Full autonomous synthesis on RoboRXN.
Plan + execute
Schreiner's catalyst
Thiourea organocatalyst for Diels-Alder reactions.
Plan + execute
Ricci's catalyst
Sibling thiourea organocatalyst.
Plan + execute
Takemoto's catalyst
Another bifunctional thiourea.
Train + predict + synth
Novel chromophore
Trained RF on absorption data, predicted, synthesized one with absorption max 336 nm (target 369 nm; RMSE 37 nm).
Safety refusal
TNT-like compound
User asked for similar properties to TNT. ChemCrow stopped, citing dual-use safety policy.
Evaluation findings

Four expert chemists graded ChemCrow vs raw GPT-4 across 14 tasks on three dimensions: chemical accuracy, quality of reasoning, task completion.

ChemCrow wins on...

  • Complex / novel tasks — synthesis planning, molecular design with constraints, chemical-logic reasoning.
  • Chemical factuality — tools provide exact answers where GPT-4 hallucinates.
  • Modularity — new tools can be added without retraining.
  • Safety — refuses to help with controlled chemicals and explosives.

GPT-4 wins on...

  • Easy tasks where the answer is in training data — paracetamol, aspirin syntheses are memorized.
  • Fluency and completeness — responses look more polished even when wrong.
  • EvaluatorGPT (LLM-as-judge) prefers GPT-4 — authors flag this as fluency bias, not real quality.
Strengths & limitations

What's strong

  • First successful LLM-to-robot chemistry agent in published literature with end-to-end synthesis demos.
  • Tool grounding meaningfully reduces hallucination on harder tasks.
  • Safety guardrails actually work — ControlledChemicalCheck halted execution on the TNT analog.
  • Open source code release at ur-whitelab/chemcrow-public (subset of 12 tools).
  • Bridges chemists and non-chemists via natural language interface.

Caveats

  • GPT-4 dependency — closed source, API-driven, hard to reproduce across versions.
  • Hallucinations remain on edge cases where tools can't ground the LLM.
  • Tool set is narrow — 18 tools cover the cheminformatics surface but not multimodal data (images, spectra, 3D structures).
  • No calibrated uncertainty — no principled "I don't know" mechanism. Heuristic search, not Bayesian.
  • Evaluation by LLM (EvaluatorGPT) is unreliable — favors fluency over factuality. Authors note this themselves.
Quotes worth remembering
"ChemCrow not only aids expert chemists and lowers barriers for non-experts but also fosters scientific advancement by bridging the gap between experimental and computational chemistry."
"The integration of expert-designed tools can help mitigate the hallucination issues commonly associated with [LLMs], thus reducing the risk of inaccuracy."
"Our results indicate that when [an evaluator LLM] lacks the required understanding to answer a prompt, it also lacks information to evaluate the prompt completions and thus fails to provide a trustworthy assessment."
Bottom line

A different paradigm from GOLLuM. ChemCrow is a tool-using LLM agent for broad chemistry tasks — synthesis planning, molecular design, autonomous lab execution. GOLLuM is a calibrated probabilistic optimizer for sample-efficient experimental design. They could compose: a ChemCrow-style agent could call a GOLLuM-style optimizer as one of its tools. For the startup evaluation, ChemCrow shows the Schwaller lab's track record in this space and the "LLM-meets-physical-chemistry" direction. It's not the same product line; GOLLuM is the more defensible, more rigorous, less prompt-engineering-dependent technology.