LabV – The Material Intelligence Platform

Chemistry Duel: Human vs. Machine

Chemistry Duel: Human vs. Machine – Who Comes Out on Top?

chemistry - blog post on study

A groundbreaking study reveals AI like GPT-4 can outperform human chemists in speed and accuracy on many tasks. But when it comes to intuition, structural reasoning, and creativity, machines still fall short. Discover why the future of chemistry depends on a hybrid approach combining human insight with machine intelligence.

Man vs. Machine: Who Wins in the Chemistry Lab?

Artificial intelligence (AI) under scrutiny: A new study by Friedrich Schiller University Jena sheds light on the capabilities of modern language models in the field of chemistry. Led by Dr. Kevin M. Jablonka, researchers investigated how powerful models like GPT-4 really are in chemistry. The result? In many cases, the machines are faster and more precise than human specialists – but they also have dangerous weaknesses. The study was recently published in “Nature Chemistry”.

In a press release, Dr. Kevin Jablonka, head of the Carl Zeiss Foundation junior research group at Friedrich Schiller University Jena, explains: “The possibilities of artificial intelligence in chemistry are attracting increasing interest – so we wanted to find out how good these models really are.”

The Setup: 2,700 Questions, 19 Chemists, 1 AI

At the centre of the investigation is the ChemBench benchmarking system, newly developed by the Jena research team. It includes over 2,700 tasks from virtually all areas of chemistry: organic, inorganic, analytical, physical, and technical. The questions range from school-level knowledge to university curricula and complex structural analyses. 

Blogpost - AI in chemistry I

The research team compared 19 experienced chemists with state-of-the-art AI models. While the humans could use tools like Google and chemistry software, the AIs relied solely on their training data. The result: in many cases, top models outperformed the best human participants. “The models drew solely on knowledge from training,” Jablonka explains.

Between Genius and Error: Where AI Excels – and Where It Fails

The models showed impressive performance in many classic knowledge questions. Especially in textbook and regulatory topics, they impressed with speed and accuracy – often even outperforming human specialists. For example, in a test on chemical regulation, GPT-4 achieved a success rate of 71%, while experienced chemists only reached 3%. In safety assessments, AI models could thus play an important role in the future, such as checking substances against regulatory requirements. 

However, there are limits: In structure-based tasks like predicting NMR spectra or determining isomers, the models struggled – yet gave confident but incorrect answers. Especially with NMR spectra, it became clear how the models produced wrong results with great conviction. “A model that gives incorrect answers with high confidence can cause problems in sensitive research areas,” warns Jablonka.

Determining isomer numbers also reveals a typical weakness of the models: while they can process molecular formulas, they struggle to recognize all possible structural variants. To correctly determine the number of possible isomers, they would need to understand chemical bonding relationships and spatial arrangements – something that still largely relies on experience and structural reasoning. This combination of apparent certainty and lack of structural understanding clearly illustrates why such tasks are particularly challenging for AI. 

So it’s no surprise that the models still perform no better than random chance when it comes to tasks like drug development or retrosynthetic analysis, where chemical intuition is crucial. 

This discrepancy points to a weakness in current evaluation methods: The successes of AI in standardized questions may say more about the nature of the questions than about real chemical understanding. A model may correctly recall many facts – but true chemical reasoning, interpreting structures, understanding mechanisms, and designing creative synthesis routes remains demanding. 

What ChemBench Means for Education and Everyday Lab Work

A key conclusion of the study concerns education: If language models can solve exam questions faster and better than students, the education system needs to adapt. In the future, the focus will shift away from rote memorization toward critical thinking, dealing with uncertainty, and creative chemical problem-solving. That the models perform better doesn’t necessarily mean they ‘think’ chemically – but it does show that we need to rethink how we teach and assess. 

At the same time, ChemBench highlights the importance of developing broader and deeper evaluation criteria for AI. Performance varies significantly depending on subfield and question type – something that must be considered when designing models and user interfaces. Previous tests often focused on so-called “property-prediction” tasks – i.e. predicting basic material properties like melting point or solubility. But these fall short if AI models are to work alongside experts and contribute to real decision-making. This also requires better interfaces that enable reliable communication between humans and machines – in other words, user-friendly platforms like LabV, which present results in a clear and understandable way and allow for follow-up queries.  

The authors emphasize that benchmarks like ChemBench are just a first step – what’s needed now are user-friendly systems where AI not only provides answers but also reveals its uncertainties. 

A Look Ahead: What Comes After ChemBench?

The study makes it clear: AI is capable of solving certain chemistry tasks faster and more safely than humans – but it remains limited in its ability to conduct structural and intuitive analysis. The next step lies in developing intelligent agent systems that can handle not just text, but also chemical formulas, molecular structures, and experimental data – in other words, all the diverse types of information that play a role in everyday lab work. 

“The real challenge will be to develop models that not only give correct answers but also recognise when they might be wrong,” the study states. 

Such systems could, for example, compare experimental parameters with literature data in early-stage material development, propose alternative synthesis routes, or interact directly with lab automation systems. In this way, AI would become not just a knowledge base, but an active research partner – with the potential to spark entirely new innovation processes. 

What does this mean for everyday lab practice?

The ChemBench study makes one thing clear: artificial intelligence can complement human expertise—but only if it’s embedded in context, controlled, and critically examined. This is precisely where platforms like LabV come in. As a Material Intelligence Platform, LabV is not designed to replace human judgment but to support decision-making through transparent data integration, traceable analyses, and well-structured interfaces. 

A hybrid approach that combines human intuition with machine efficiency is essential—and it will determine whether AI in the lab becomes a useful tool or an uncontrollable black box. 

Conclusion: The Future Is Hybrid

ChemBench shows how far AI has come in chemistry – and where its understanding still ends. The study is a wake-up call: Anyone using AI in the lab must understand it, control it, and apply it correctly. Then, it can become an unbeatable partner. 

“Our research shows that AI can be an important complement to human expertise – not a replacement, but a valuable tool to support the work,” summarises Kevin Jablonka. “With that, our study lays the foundation for closer collaboration between AI and human expertise in chemistry.” 

“Although today’s systems are still far from thinking like a chemist, ChemBench can be a stepping stone toward that goal.” – Nature Chemistry

AI has passed – but it hasn’t earned its PhD yet.

Sign-up with our newsletter.

Stay uptodate on newest trends and topics