ChemCrow — paper summary

Headline numbers

expert chemistry tools

Spanning molecule, reaction, safety, and general categories.

evaluation tasks

Spanning synthesis, molecular design, chemical logic. Graded by 4 expert chemists.

chemicals synthesized

DEET (insect repellent) + Schreiner's, Ricci's, Takemoto's catalysts. Run on IBM RoboRXN.

336 nm

novel chromophore discovered

Target was 369 nm absorption. Random forest trained, candidates screened, top one synthesized.

GPT-4

backbone LLM

Temperature 0.1. Integration via LangChain. ReAct + MRKL prompting style.

Stopped

TNT analog request

Safety guardrails refused. ControlledChemicalCheck and ExplosiveCheck worked as designed.

The reasoning loop

1Thought

LLM reasons about
the current state, plans next step

2Action

Select tool from
the 18 available

3Action input

Format input
for that tool

4Observation

Read tool output,
incorporate

↻

Iterates until a final answer is reached. ReAct-style scaffold. Tool outputs ground the LLM on facts it can't reliably memorize (molecular weights, prices, safety data, reaction outcomes).

The 18 tools

Molecule (8)

Name2SMILES — name → SMILES
SMILES2Weight — molecular weight via RDKit
SMILES2Price — purchasability + price
SMILES2CAS, Name2CAS — CAS lookup
Similarity — Tanimoto via ECFP2
ModifyMol — robust chemistry mutations
FuncGroups — functional-group detection
PatentCheck — patent existence

Reaction (4)

NameRXN — classify a reaction (RXN4Chem)
ReactionPredict — Molecular Transformer prediction
SynthesisPlan — retrosynthesis (RXNPlanner)
SynthesisExecute — run synthesis on RoboRXN

Safety (3)

ControlledChemicalCheck — chemical weapons watchlist
ExplosiveCheck — GHS-based explosive detection
SafetySummary — PubChem-based safety report

General (4)

WebSearch — SerpAPI Google search
LitSearch — paper-qa with FAISS + OpenAI embeddings
Python REPL — code execution sandbox
Human — escalate to the user

What it successfully did

Plan + execute

DEET

Common insect repellent. Full autonomous synthesis on RoboRXN.

Plan + execute

Schreiner's catalyst

Thiourea organocatalyst for Diels-Alder reactions.

Plan + execute

Ricci's catalyst

Sibling thiourea organocatalyst.

Plan + execute

Takemoto's catalyst

Another bifunctional thiourea.

Train + predict + synth

Novel chromophore

Trained RF on absorption data, predicted, synthesized one with absorption max 336 nm (target 369 nm; RMSE 37 nm).

Safety refusal

TNT-like compound

User asked for similar properties to TNT. ChemCrow stopped, citing dual-use safety policy.

Evaluation findings

Four expert chemists graded ChemCrow vs raw GPT-4 across 14 tasks on three dimensions: chemical accuracy, quality of reasoning, task completion.

ChemCrow wins on...

Complex / novel tasks — synthesis planning, molecular design with constraints, chemical-logic reasoning.
Chemical factuality — tools provide exact answers where GPT-4 hallucinates.
Modularity — new tools can be added without retraining.
Safety — refuses to help with controlled chemicals and explosives.

GPT-4 wins on...

Easy tasks where the answer is in training data — paracetamol, aspirin syntheses are memorized.
Fluency and completeness — responses look more polished even when wrong.
EvaluatorGPT (LLM-as-judge) prefers GPT-4 — authors flag this as fluency bias, not real quality.

Strengths & limitations

What's strong

First successful LLM-to-robot chemistry agent in published literature with end-to-end synthesis demos.
Tool grounding meaningfully reduces hallucination on harder tasks.
Safety guardrails actually work — ControlledChemicalCheck halted execution on the TNT analog.
Open source code release at ur-whitelab/chemcrow-public (subset of 12 tools).
Bridges chemists and non-chemists via natural language interface.

Caveats

GPT-4 dependency — closed source, API-driven, hard to reproduce across versions.
Hallucinations remain on edge cases where tools can't ground the LLM.
Tool set is narrow — 18 tools cover the cheminformatics surface but not multimodal data (images, spectra, 3D structures).
No calibrated uncertainty — no principled "I don't know" mechanism. Heuristic search, not Bayesian.
Evaluation by LLM (EvaluatorGPT) is unreliable — favors fluency over factuality. Authors note this themselves.

Quotes worth remembering

"ChemCrow not only aids expert chemists and lowers barriers for non-experts but also fosters scientific advancement by bridging the gap between experimental and computational chemistry."

"The integration of expert-designed tools can help mitigate the hallucination issues commonly associated with [LLMs], thus reducing the risk of inaccuracy."

"Our results indicate that when [an evaluator LLM] lacks the required understanding to answer a prompt, it also lacks information to evaluate the prompt completions and thus fails to provide a trustworthy assessment."

Bottom line

A different paradigm from GOLLuM. ChemCrow is a tool-using LLM agent for broad chemistry tasks — synthesis planning, molecular design, autonomous lab execution. GOLLuM is a calibrated probabilistic optimizer for sample-efficient experimental design. They could compose: a ChemCrow-style agent could call a GOLLuM-style optimizer as one of its tools. For the startup evaluation, ChemCrow shows the Schwaller lab's track record in this space and the "LLM-meets-physical-chemistry" direction. It's not the same product line; GOLLuM is the more defensible, more rigorous, less prompt-engineering-dependent technology.