The Virology Capabilities Test (VCT): A Benchmark for LLM Capabilities in Virology Laboratory Protocols
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
A critical empirical contribution to AI biosecurity risk assessment, demonstrating frontier LLMs now surpass domain experts in virology troubleshooting—directly relevant to debates about capability evaluations and biosecurity governance for advanced AI models.
Paper Details
Metadata
Abstract
We present the Virology Capabilities Test (VCT), a large language model (LLM) benchmark that measures the capability to troubleshoot complex virology laboratory protocols. Constructed from the inputs of dozens of PhD-level expert virologists, VCT consists of $322$ multimodal questions covering fundamental, tacit, and visual knowledge that is essential for practical work in virology laboratories. VCT is difficult: expert virologists with access to the internet score an average of $22.1\%$ on questions specifically in their sub-areas of expertise. However, the most performant LLM, OpenAI's o3, reaches $43.8\%$ accuracy, outperforming $94\%$ of expert virologists even within their sub-areas of specialization. The ability to provide expert-level virology troubleshooting is inherently dual-use: it is useful for beneficial research, but it can also be misused. Therefore, the fact that publicly available models outperform virologists on VCT raises pressing governance considerations. We propose that the capability of LLMs to provide expert-level troubleshooting of dual-use virology work should be integrated into existing frameworks for handling dual-use technologies in the life sciences.
Summary
The Virology Capabilities Test (VCT) is a 322-question multimodal benchmark evaluating LLM capabilities in complex virology lab troubleshooting, created with PhD-level expert input. Remarkably, OpenAI's o3 model scores 43.8%, outperforming 94% of human expert virologists who average only 22.1%. The authors argue these dual-use capabilities warrant integration into biosecurity governance frameworks.
Key Points
- •o3 achieves 43.8% accuracy on virology troubleshooting tasks, outperforming 94% of PhD-level expert virologists who average only 22.1%.
- •The benchmark covers fundamental, tacit, and visual knowledge across 322 multimodal questions essential for practical virology work.
- •Expert virologists were only tested in their own specialization areas, making the human baseline conservative and the LLM gap even more striking.
- •The benchmark highlights dual-use risks: expert-level virology troubleshooting could assist in developing dangerous pathogens.
- •Authors call for LLM biosecurity capabilities to be incorporated into existing dual-use research governance frameworks.
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Biosecurity Interventions (Overview) | -- | 50.0 |
| Is EA Biosecurity Work Limited to Restricting LLM Biological Use? | Analysis | 55.0 |
Cached Content Preview
[2504.16137] Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark
\titlecontents
section[2em] \contentslabel 2em \titlerule *[.5em]. \contentspage []
Virology Capabilities Test (VCT):
A Multimodal Virology Q&A Benchmark
Jasper Götting a
SecureBio
Pedro Medeiros a
SecureBio
Center for Natural and Human Sciences, Federal University of ABC
Jon G Sanders a
SecureBio
Nathaniel Li †
Long Phan
Center for AI Safety
Karam Elabd
SecureBio
Lennart Justen
Media Lab, Massachusetts Institute of Technology
Dan Hendrycks
Center for AI Safety
Seth Donoughe ∗
SecureBio
(2025)
Abstract
We present the Virology Capabilities Test (VCT), a large language model (LLM) benchmark that measures the capability to troubleshoot complex virology laboratory protocols.
Constructed from the inputs of dozens of PhD-level expert virologists, VCT consists of 322 322 multimodal questions covering fundamental, tacit, and visual knowledge that is essential for practical work in virology laboratories. VCT is difficult: expert virologists with access to the internet score an average of 22.1 % 22.1\% on questions specifically in their sub-areas of expertise. However, the most performant LLM, OpenAI’s o3, reaches 43.8 % 43.8\% accuracy, outperforming 94 % 94\% of expert virologists even within their sub-areas of specialization.
The ability to provide expert-level virology troubleshooting is inherently dual-use : it is useful for beneficial research, but it can also be misused.
Therefore, the fact that publicly available models outperform virologists on VCT raises pressing governance considerations. We propose that the capability of LLMs to provide expert-level troubleshooting of dual-use virology work should be integrated into existing frameworks for handling dual-use technologies in the life sciences.
† † footnotetext: a Primary authors † † footnotetext: † Work conducted while at Center for AI Safety † † footnotetext: ∗ Correspondence to seth@securebio.org
1 Introduction
As AI development accelerates, evaluations have become essential for quantifying the capabilities of large language models (LLMs), particularly in scientific reasoning [ 33 , 23 , 22 ] . However, commonly used benchmarks, such as MMLU [ 23 ] or GPQA [ 50 ] , have important limitations. They often rely on multiple-choice questions with one correct answer among four choices. Although straightforward to create, evaluate, and grade, such benchmarks do not capture rare, tacit, and search-proof knowledge. Moreover, they do not test for reasoning over images, despite multimodality becoming a standard LLM capability with clear real-world applications [ 68 ] . Furthermore, many existing benchmarks suffer from false ground-truth labels and rapid saturation [ 42 , 5 , 40 , 51 ] .
These limitations are particularly consequential for scientifi
... (truncated, 98 KB total)e36047b497801066 | Stable ID: OTFjOTdjNj