Back
Arb Research Work Portfolio
webarbresearch.com·arbresearch.com/work/
Arb Research is a small independent consultancy producing applied research for the AI safety and EA communities; this page catalogs their project portfolio and is useful for understanding the landscape of commissioned safety-relevant research.
Metadata
Importance: 35/100homepage
Summary
Arb Research is an independent research organization focused on quantitative and analytical work relevant to AI safety, forecasting, and effective altruism. Their portfolio showcases projects spanning AI risk evaluation, policy analysis, and evidence-based research. The work portfolio page provides an overview of completed and ongoing projects for potential clients and collaborators.
Key Points
- •Arb Research produces quantitative, evidence-based analysis on AI safety and related topics for EA-aligned organizations
- •Portfolio includes forecasting, policy analysis, and technical evaluations relevant to AI governance and safety
- •Works as a contractor/consultancy providing research services to organizations in the AI safety ecosystem
- •Projects span a range of topics including risk assessment, talent pipelines, and institutional analysis
- •Serves as a reference point for understanding what kinds of applied research are being commissioned in the AI safety space
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Arb Research | Organization | 50.0 |
| Samotsvety | Organization | 61.0 |
Cached Content Preview
HTTP 200Fetched Mar 20, 202650 KB
[](https://arbresearch.com/)[Arb news](https://arbresearch.com/)
[Semantic duplicates confound apparent AI progress](https://www.arxiv.org/abs/2602.12413)
_February 2026_
**Author**: Gavin, Ari, Juan, Tomas, Peli, Nicky, Nandi
**Client**: Pro bono
* * *
New paper studying AI performance. ( [The Twitter thread](https://x.com/g_leech_/status/2023384075537432662) is an accessible way in.)
In the past 3 years, LLM training corpuses expanded by a factor of 10000x. Does this increased chance of including test data confound apparent AI progress? What about including “semantic” duplicates, things which happen to be very similar to test data? And so how much of LLM performance is really down to “local” generalisation (pattern-matching to hard-to-detect semantically-equivalent training data)?
We experiment on OLMo 3, one of the only really good models with open training data. Since we have its entire training corpus, we can exhaustively check for real “natural” duplicates and finetune it to estimate their impact. We embed the entire training corpus.
We were surprised by how ineffective the standard n-gram decontamination is at catching exact duplicates - 70% of harder tasks had a match.
Every single MBPP test example and 78% of CodeForces have semantic duplicates.
So: n-gram decontamination is not enough even for the easy (exact) stuff, semantic duplicates are at least a moderately big deal, and this probably transfers to frontier models to some degree.
The first entry in a research programme with the grand aim of decomposing benchmark gains / apparent AI progress into 4 estimates:
1. benchmaxxing (memorising the test / paraphrasing / finetuning for a narrow task)
2. usemaxxing (RLing narrow capabilities in)
3. local generalisation (highly sophisticated pattern-matching but all bound by training data)
4. Deep OOD generalisation (e.g. algorithms that solve a whole problem class)
Obviously this will take a lot more work to nail down, and three times that much in closed models.
**[Arxiv](https://www.arxiv.org/abs/2602.12413), [code](https://github.com/AriSpiesberger/Soft-Contamination-Prevelance), [long explainer](https://x.com/g_leech_/status/2023384075537432662)**
[](https://www.arxiv.org/abs/2602.12413)
* * *
* * *
[2025 review](https://arbresearch.com/work/)
_December 2025_
**Author**: Gavin, Misha, Kristie, Juan, Tomáš, Peli, Phil, Simon, Sam, Rory, Rian, Stag, Ari, Nicky, Nandi, Paul, Stephen, Kristi, David, Lydia, Vidur, Jord, Ozzie, Olga, Lauren, Ulkar, Jooda
**Client**: Misc
* * *
Most of our work this year was again private; we completed 37 projects with 4.8 FTE and spent 3 months colocated in Stockholm and London.
### Public highlights
- Dwarkesh and Gavin’s book finally [came out](https://arbresearch.com/work/#Stripe%20Press%20book%20out) in Stripe Press.
- We collected [200](https://frontier2025.netlify.app/) of the bigge
... (truncated, 50 KB total)Resource ID:
998a34388901cfbe | Stable ID: OTYzODk3ZW