The Virology Capabilities Test (VCT): A Benchmark for LLM Capabilities in Virology Laboratory Protocols

paper

2025·arXiv·arxiv.org/abs/2504.16137

Authors

Jasper Götting·Pedro Medeiros·Jon G Sanders·Nathaniel Li·Long Phan·Karam Elabd·Lennart Justen·Dan Hendrycks·Seth Donoughe

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

A critical empirical contribution to AI biosecurity risk assessment, demonstrating frontier LLMs now surpass domain experts in virology troubleshooting—directly relevant to debates about capability evaluations and biosecurity governance for advanced AI models.

Paper Details

Citations

2 influential

Year

2025

arXiv:2504.16137 DOI:10.48550/arXiv.2504.16137 Semantic Scholar

Metadata

Importance: 82/100arxiv preprintprimary source

Abstract

We present the Virology Capabilities Test (VCT), a large language model (LLM) benchmark that measures the capability to troubleshoot complex virology laboratory protocols. Constructed from the inputs of dozens of PhD-level expert virologists, VCT consists of $322$ multimodal questions covering fundamental, tacit, and visual knowledge that is essential for practical work in virology laboratories. VCT is difficult: expert virologists with access to the internet score an average of $22.1\%$ on questions specifically in their sub-areas of expertise. However, the most performant LLM, OpenAI's o3, reaches $43.8\%$ accuracy, outperforming $94\%$ of expert virologists even within their sub-areas of specialization. The ability to provide expert-level virology troubleshooting is inherently dual-use: it is useful for beneficial research, but it can also be misused. Therefore, the fact that publicly available models outperform virologists on VCT raises pressing governance considerations. We propose that the capability of LLMs to provide expert-level troubleshooting of dual-use virology work should be integrated into existing frameworks for handling dual-use technologies in the life sciences.

Summary

The Virology Capabilities Test (VCT) is a 322-question multimodal benchmark evaluating LLM capabilities in complex virology lab troubleshooting, created with PhD-level expert input. Remarkably, OpenAI's o3 model scores 43.8%, outperforming 94% of human expert virologists who average only 22.1%. The authors argue these dual-use capabilities warrant integration into biosecurity governance frameworks.

Key Points

•o3 achieves 43.8% accuracy on virology troubleshooting tasks, outperforming 94% of PhD-level expert virologists who average only 22.1%.
•The benchmark covers fundamental, tacit, and visual knowledge across 322 multimodal questions essential for practical virology work.
•Expert virologists were only tested in their own specialization areas, making the human baseline conservative and the LLM gap even more striking.
•The benchmark highlights dual-use risks: expert-level virology troubleshooting could assist in developing dangerous pathogens.
•Authors call for LLM biosecurity capabilities to be incorporated into existing dual-use research governance frameworks.

Cited by 2 pages

Page	Type	Quality
Biosecurity Interventions (Overview)	--	50.0
Is EA Biosecurity Work Limited to Restricting LLM Biological Use?	Analysis	55.0

Cached Content Preview

HTTP 200Fetched Apr 10, 202698 KB

[2504.16137] Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark 
 
 
 
 
 
 
 
 
 
 
 

 
 
 

 
 
 
 
 
 
 
 \titlecontents 
 section[2em] \contentslabel 2em \titlerule *[.5em]. \contentspage []

 

 
 Virology Capabilities Test (VCT):
 A Multimodal Virology Q&A Benchmark

 
 
 Jasper Götting a 
 
 SecureBio
 
 
 Pedro Medeiros a 
 
 SecureBio
 
 Center for Natural and Human Sciences, Federal University of ABC
 
 
 Jon G Sanders a 
 
 SecureBio
 
 
 
 Nathaniel Li † 
 
 
 
 
 Long Phan
 
 Center for AI Safety
 
 
 Karam Elabd
 
 SecureBio
 
 
 Lennart Justen
 
 Media Lab, Massachusetts Institute of Technology
 
 
 
 Dan Hendrycks
 
 Center for AI Safety
 
 
 Seth Donoughe ∗ 
 
 SecureBio
 
 
 (2025) 

 
 Abstract

 We present the Virology Capabilities Test (VCT), a large language model (LLM) benchmark that measures the capability to troubleshoot complex virology laboratory protocols.
Constructed from the inputs of dozens of PhD-level expert virologists, VCT consists of 322 322 multimodal questions covering fundamental, tacit, and visual knowledge that is essential for practical work in virology laboratories. VCT is difficult: expert virologists with access to the internet score an average of 22.1 % 22.1\% on questions specifically in their sub-areas of expertise. However, the most performant LLM, OpenAI’s o3, reaches 43.8 % 43.8\% accuracy, outperforming 94 % 94\% of expert virologists even within their sub-areas of specialization.
The ability to provide expert-level virology troubleshooting is inherently dual-use : it is useful for beneficial research, but it can also be misused.
Therefore, the fact that publicly available models outperform virologists on VCT raises pressing governance considerations. We propose that the capability of LLMs to provide expert-level troubleshooting of dual-use virology work should be integrated into existing frameworks for handling dual-use technologies in the life sciences.

 
 † † footnotetext: a Primary authors † † footnotetext: † Work conducted while at Center for AI Safety † † footnotetext: ∗ Correspondence to seth@securebio.org 
 
 
 1 Introduction

 
 As AI development accelerates, evaluations have become essential for quantifying the capabilities of large language models (LLMs), particularly in scientific reasoning  [ 33 , 23 , 22 ] . However, commonly used benchmarks, such as MMLU  [ 23 ] or GPQA  [ 50 ] , have important limitations. They often rely on multiple-choice questions with one correct answer among four choices. Although straightforward to create, evaluate, and grade, such benchmarks do not capture rare, tacit, and search-proof knowledge. Moreover, they do not test for reasoning over images, despite multimodality becoming a standard LLM capability with clear real-world applications  [ 68 ] . Furthermore, many existing benchmarks suffer from false ground-truth labels and rapid saturation  [ 42 , 5 , 40 , 51 ] .

 
 
 These limitations are particularly consequential for scientifi

... (truncated, 98 KB total)

Resource ID: e36047b497801066 | Stable ID: sid_l41t9x6LTU