Virology Capabilities Test Paper
paperCredibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
A benchmark paper evaluating LLM capabilities on complex virology tasks, demonstrating that advanced models like o3 can outperform human experts on specialized domain knowledge, relevant to understanding AI capability emergence and potential dual-use risks in biotech domains.
Paper Details
Metadata
Summary
The Virology Capabilities Test (VCT) is a 322-question multimodal benchmark designed to evaluate large language models' ability to troubleshoot complex virology laboratory protocols. Created with input from PhD-level expert virologists, the benchmark covers fundamental, tacit, and visual knowledge essential for practical virology work. While expert virologists score only 22.1% on average within their areas of specialization, OpenAI's o3 model achieves 43.8% accuracy, outperforming 94% of expert virologists. The authors highlight that this capability is inherently dual-use—beneficial for research but potentially misusable—and argue that LLM capabilities for expert-level virology troubleshooting should be integrated into existing dual-use technology governance frameworks in the life sciences.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| SecureBio | Organization | 65.0 |
Cached Content Preview
Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark
\titlecontents
section[2em] \contentslabel 2em \titlerule *[.5em]. \contentspage []
Virology Capabilities Test (VCT):
A Multimodal Virology Q&A Benchmark
Jasper Götting a
SecureBio
Pedro Medeiros a
SecureBio
Center for Natural and Human Sciences, Federal University of ABC
Jon G Sanders a
SecureBio
Nathaniel Li †
Long Phan
Center for AI Safety
Karam Elabd
SecureBio
Lennart Justen
Media Lab, Massachusetts Institute of Technology
Dan Hendrycks
Center for AI Safety
Seth Donoughe ∗
SecureBio
(2025)
Abstract
We present the Virology Capabilities Test (VCT), a large language model (LLM) benchmark that measures the capability to troubleshoot complex virology laboratory protocols.
Constructed from the inputs of dozens of PhD-level expert virologists, VCT consists of 322 322 322 322 multimodal questions covering fundamental, tacit, and visual knowledge that is essential for practical work in virology laboratories. VCT is difficult: expert virologists with access to the internet score an average of 22.1 % percent 22.1 22.1\% 22.1 % on questions specifically in their sub-areas of expertise. However, the most performant LLM, OpenAI’s o3, reaches 43.8 % percent 43.8 43.8\% 43.8 % accuracy, outperforming 94 % percent 94 94\% 94 % of expert virologists even within their sub-areas of specialization.
The ability to provide expert-level virology troubleshooting is inherently dual-use : it is useful for beneficial research, but it can also be misused.
Therefore, the fact that publicly available models outperform virologists on VCT raises pressing governance considerations. We propose that the capability of LLMs to provide expert-level troubleshooting of dual-use virology work should be integrated into existing frameworks for handling dual-use technologies in the life sciences.
† † footnotetext: a Primary authors † † footnotetext: † Work conducted while at Center for AI Safety † † footnotetext: ∗ Correspondence to seth@securebio.org
1 Introduction
As AI development accelerates, evaluations have become essential for quantifying the capabilities of large language models (LLMs), particularly in scientific reasoning [ 33 , 23 , 22 ] . However, commonly used benchmarks, such as MMLU [ 23 ] or GPQA [ 50 ] , have important limitations. They often rely on multiple-choice questions with one correct answer among four choices. Although straightforward to create, evaluate, and grade, such benchmarks do not capture rare, tacit, and search-proof knowledge. Moreover, they do not test for reasoning over images, despite multimodality becoming a standard LLM capability with clear real-world applications [ 68 ] . Furthermore, many existing benchmarks suffer from false ground-truth labels and rapid saturation [ 42 , 5 , 40 , 51 ] .
These limitations are part
... (truncated, 98 KB total)d53cbc223d3b5084 | Stable ID: ZjQxYTlhZm