Back
EleutherAI Evaluation
webeleuther.ai·eleuther.ai/
EleutherAI is a key player in open-source AI research; their LM Evaluation Harness is widely used in safety and capabilities benchmarking, making them relevant to researchers studying model evaluation and alignment.
Metadata
Importance: 55/100homepage
Summary
EleutherAI is a decentralized, nonprofit AI research organization focused on open-source AI development, interpretability, and evaluation. They are known for creating large language models like GPT-NeoX and the Pile dataset, as well as the widely used LM Evaluation Harness. Their work emphasizes democratizing AI research and providing open alternatives to proprietary models.
Key Points
- •Developed the LM Evaluation Harness, a standard open-source framework for benchmarking large language models across diverse tasks.
- •Produced major open-source models and datasets including GPT-NeoX, GPT-J, and the Pile, enabling reproducible AI research.
- •Conducts research in interpretability, alignment, and AI safety alongside capabilities work.
- •Operates as a decentralized nonprofit, aiming to democratize access to large-scale AI research outside of large corporations.
- •Collaborates broadly with academic and independent researchers to advance open, transparent AI development.
Cited by 4 pages
| Page | Type | Quality |
|---|---|---|
| Instrumental Convergence Framework | Analysis | 60.0 |
| AI Knowledge Monopoly | Risk | 50.0 |
| AI Value Lock-in | Risk | 64.0 |
| AI Proliferation | Risk | 60.0 |
1 FactBase fact citing this source
| Entity | Property | Value | As Of |
|---|---|---|---|
| Connor Leahy | Role / Title | Co-founder | 2020 |
Cached Content Preview
HTTP 200Fetched Mar 15, 20265 KB
EleutherAI
0
EleutherAI
Explore our research
Interpreting Across Time
How do properties of models emerge and evolve over the course of training?
Eliciting Latent Knowledge
As models get smarter, humans won't always be able to independently check if a model's claims are true or false. We aim to circumvent this issue by directly eliciting latent knowledge (ELK) inside the model’s activations.
Training LLMs
EleutherAI has trained and released many powerful open source LLMs.
Recent Publications
Feb 16, 2026
arXiv
Quantifying the Effect of Test Set Contamination on Generative Evaluations
Feb 16, 2026
arXiv
As frontier AI systems are pretrained on web-scale data, test set contamination has become a critical concern for accurately assessing their capabilities. While research has thoroughly investigated the impact of test set contamination on discriminative evaluations like multiple-choice question-answering, comparatively little research has studied the impact of test set contamination on generative evaluations. In this work, we quantitatively assess the effect of test set contamination on generative evaluations through the language model lifecycle. We pretrain language models on mixtures of web data and the MATH benchmark, sweeping model sizes and number of test set replicas contaminating the pretraining corpus; performance improves with contamination and model size. Using scaling laws, we make a surprising discovery: including even a single test set replica enables models to achieve lower loss than the irreducible error of training on the uncontaminated corpus. We then study further training: overtraining with fresh data reduces the effects of contamination, whereas supervised finetuning on the training set can either increase or decrease performance on test data, depending on the amount of pretraining contamination. Finally, at inference, we identify factors that modulate memorization: high sampling temperatures mitigate contamination ef
... (truncated, 5 KB total)Resource ID:
120b456b2f9481b0 | Stable ID: MWIzZDk1OD