Virology Capabilities Test Paper

paper

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

A benchmark paper evaluating LLM capabilities on complex virology tasks, demonstrating that advanced models like o3 can outperform human experts on specialized domain knowledge, relevant to understanding AI capability emergence and potential dual-use risks in biotech domains.

Paper Details

Citations

2 influential

Year

2025

arXiv:2504.16137 DOI:10.48550/arXiv.2504.16137 Semantic Scholar

Metadata

arxiv preprintprimary source

Summary

The Virology Capabilities Test (VCT) is a 322-question multimodal benchmark designed to evaluate large language models' ability to troubleshoot complex virology laboratory protocols. Created with input from PhD-level expert virologists, the benchmark covers fundamental, tacit, and visual knowledge essential for practical virology work. While expert virologists score only 22.1% on average within their areas of specialization, OpenAI's o3 model achieves 43.8% accuracy, outperforming 94% of expert virologists. The authors highlight that this capability is inherently dual-use—beneficial for research but potentially misusable—and argue that LLM capabilities for expert-level virology troubleshooting should be integrated into existing dual-use technology governance frameworks in the life sciences.

Cited by 1 page

Page	Type	Quality
SecureBio	Organization	65.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 \titlecontents 
 section[2em] \contentslabel 2em \titlerule *[.5em]. \contentspage []

 

 
 Virology Capabilities Test (VCT):
 A Multimodal Virology Q&A Benchmark

 
 
 Jasper Götting a 
 
 SecureBio
 
 
 Pedro Medeiros a 
 
 SecureBio
 
 Center for Natural and Human Sciences, Federal University of ABC
 
 
 Jon G Sanders a 
 
 SecureBio
 
 
 
 Nathaniel Li † 
 
 
 
 
 Long Phan
 
 Center for AI Safety
 
 
 Karam Elabd
 
 SecureBio
 
 
 Lennart Justen
 
 Media Lab, Massachusetts Institute of Technology
 
 
 
 Dan Hendrycks
 
 Center for AI Safety
 
 
 Seth Donoughe ∗ 
 
 SecureBio
 
 
 (2025) 
 
 Abstract

 We present the Virology Capabilities Test (VCT), a large language model (LLM) benchmark that measures the capability to troubleshoot complex virology laboratory protocols.
Constructed from the inputs of dozens of PhD-level expert virologists, VCT consists of 322 322 322 322 multimodal questions covering fundamental, tacit, and visual knowledge that is essential for practical work in virology laboratories. VCT is difficult: expert virologists with access to the internet score an average of 22.1 % percent 22.1 22.1\% 22.1 % on questions specifically in their sub-areas of expertise. However, the most performant LLM, OpenAI’s o3, reaches 43.8 % percent 43.8 43.8\% 43.8 % accuracy, outperforming 94 % percent 94 94\% 94 % of expert virologists even within their sub-areas of specialization.
The ability to provide expert-level virology troubleshooting is inherently dual-use : it is useful for beneficial research, but it can also be misused.
Therefore, the fact that publicly available models outperform virologists on VCT raises pressing governance considerations. We propose that the capability of LLMs to provide expert-level troubleshooting of dual-use virology work should be integrated into existing frameworks for handling dual-use technologies in the life sciences.

 
 † † footnotetext: a Primary authors † † footnotetext: † Work conducted while at Center for AI Safety † † footnotetext: ∗ Correspondence to seth@securebio.org 
 
 
 1 Introduction

 
 As AI development accelerates, evaluations have become essential for quantifying the capabilities of large language models (LLMs), particularly in scientific reasoning  [ 33 , 23 , 22 ] . However, commonly used benchmarks, such as MMLU  [ 23 ] or GPQA  [ 50 ] , have important limitations. They often rely on multiple-choice questions with one correct answer among four choices. Although straightforward to create, evaluate, and grade, such benchmarks do not capture rare, tacit, and search-proof knowledge. Moreover, they do not test for reasoning over images, despite multimodality becoming a standard LLM capability with clear real-world applications  [ 68 ] . Furthermore, many existing benchmarks suffer from false ground-truth labels and rapid saturation  [ 42 , 5 , 40 , 51 ] .

 
 
 These limitations are part

... (truncated, 98 KB total)

Resource ID: d53cbc223d3b5084 | Stable ID: sid_9KzneSr8t5