Epoch AI Expanded Biological Model Dataset: Tracking 360+ ML Models in Biology

web

Epoch AI·epoch.ai/blog/announcing-expanded-biology-ai-coverage

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Epoch AI

Relevant to AI governance and biosecurity communities tracking the gap between AI capability deployment in high-risk domains like biology and the adoption of safety measures; the 3% safeguard figure is a notable empirical data point for biosecurity policy discussions.

Metadata

Importance: 62/100blog postdataset

Summary

Epoch AI announces expansion of its Biological Model Dataset to over 360 ML models used in biology (drug design, protein engineering, genomics), including training compute estimates and dual-use safeguard information. The dataset reveals rapid compute scaling from 2017-2021 followed by a slowdown, and critically finds that fewer than 3% of biological AI models have any dual-use safeguards in place.

Key Points

•Dataset now tracks 360+ ML models in biology with details on developers, tasks, training data, and compute estimates
•Training compute and dataset sizes grew substantially from 2017-2021, followed by a relative slowdown suggesting changing scaling dynamics
•Fewer than 3% of biological AI models have any dual-use safeguards (data filtering, risk evals, access controls), raising biosecurity concerns
•Most capable foundation models like ESM3 and AlphaFold 3 are more likely to have safeguards than smaller specialized models
•Detailed safeguards data is available only upon request to protect sensitive information while enabling responsible research

Cited by 1 page

Page	Type	Quality
Scientific Research Capabilities	Capability	68.0

Cached Content Preview

HTTP 200Fetched Apr 7, 20264 KB

Announcing our expanded biology AI coverage | Epoch AI 

 
 
 
 

 

 

 
 We’re pleased to announce an expansion of our Biological Model Dataset, a component of Epoch AI’s larger database of machine learning models. As the role of AI in biology continues to grow—powering advances in drug design, protein engineering, and genomics—the opportunities and governance challenges posed by biological AI models increase the importance of tracking advances in this field.

 Our goal with this project is to provide a comprehensive resource for researchers and policymakers. To this end, we have curated information from over 360 models in this update, prioritizing recent models at the frontier of capability, scale, or scientific impact. Alongside details on their developers, intended tasks, and training datasets, we’ve included new estimates of the training compute that went into developing them.

 Enable JavaScript to see an interactive visualization.

 
 Enable JavaScript to see an interactive visualization.

 
 Analyzing compute and data trends can help us understand how invested the field is in scaling as a means to increase performance. The plot above tracks the evolution of training compute and dataset sizes in biological models, highlighting a substantial increase from 2017 to 2021, followed by a relative slowdown. This visualization underscores how quickly the field has advanced—and also suggests that the pace may be changing. By providing transparent, easy-to-explore compute estimates, we hope to enable deeper discussion of what’s driving progress and where bottlenecks may arise in the near future.

 Download this data 
 Finally, because biological models can pose dual-use concerns, we have compiled information about safeguards that developers have adopted to mitigate such risks, such as data filtering, risk evaluations, inference-time refusal, and access controls. In our dataset, fewer than 3% of models have any such safeguards, although the most capable models (large foundation models like EvolutionaryScale’s ESM 3, or powerful specialized models like AlphaFold 3) tend to have more safeguards. We encourage developers to continue sharing best practices for mitigating potential misuse.

 To protect sensitive information about model safeguards while enabling responsible research, detailed safeguards data is available upon request. Researchers and developers interested in accessing this information can email [email&#160;protected] .

 You can find additional information about the dataset in our database documentation .

 Sentinel Bio provided a grant to fund this data collection project and make it publicly available. Epoch AI owns the resulting dataset. We thank them for their generous support.

 About the authors

 Pablo Villalobos Pablo Villalobos has a background in Mathematics and Computer Science. After spending some time as a software engineer, he decided to pivot towards AI. His interests include the economic consequences of advanced AI systems 

... (truncated, 4 KB total)

Resource ID: 215d1160b90a9948 | Stable ID: sid_v95fB9OFTZ