Skip to content
Longterm Wiki
Back

EleutherAI Evaluation

web
eleuther.ai·eleuther.ai/

EleutherAI is a key player in open-source AI research; their LM Evaluation Harness is widely used in safety and capabilities benchmarking, making them relevant to researchers studying model evaluation and alignment.

Metadata

Importance: 55/100homepage

Summary

EleutherAI is a decentralized, nonprofit AI research organization focused on open-source AI development, interpretability, and evaluation. They are known for creating large language models like GPT-NeoX and the Pile dataset, as well as the widely used LM Evaluation Harness. Their work emphasizes democratizing AI research and providing open alternatives to proprietary models.

Key Points

  • Developed the LM Evaluation Harness, a standard open-source framework for benchmarking large language models across diverse tasks.
  • Produced major open-source models and datasets including GPT-NeoX, GPT-J, and the Pile, enabling reproducible AI research.
  • Conducts research in interpretability, alignment, and AI safety alongside capabilities work.
  • Operates as a decentralized nonprofit, aiming to democratize access to large-scale AI research outside of large corporations.
  • Collaborates broadly with academic and independent researchers to advance open, transparent AI development.

Cited by 4 pages

1 FactBase fact citing this source

EntityPropertyValueAs Of
Connor LeahyRole / TitleCo-founder2020

Cached Content Preview

HTTP 200Fetched Mar 15, 20265 KB
EleutherAI 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 

 

 

 

 

 
 
 

 
 
 
 
 
 
 

 
 
 
 
 
 
 

 
 0 
 
 
 
 
 

 

 

 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 
 

 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 EleutherAI

 

 
 
 
 
 
 
 

 
 
 

 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 Explore our research

 

 
 

 
 

 

 

 

 
 
 

 
 
 

 
 
 
 
 

 
 
 

 

 

 

 

 
 

 
 

 
 

 

 

 
 
 
 

 

 
 
 

 

 
 

 
 
 
 Interpreting Across Time 
 

 
 
 
 

 

 
 
 

 

 
 

 
 
 

 
 
 
 How do properties of models emerge and evolve over the course of training?

 
 
 
 

 
 

 
 
 
 

 

 
 
 

 

 
 

 
 
 
 

 
 
 

 
 
 
 
 

 
 
 

 

 

 

 

 
 

 
 

 
 

 

 

 
 
 
 

 

 
 
 

 

 
 

 
 
 
 Eliciting Latent Knowledge 
 

 
 
 
 

 

 
 
 

 

 
 

 
 
 

 
 
 
 As models get smarter, humans won't always be able to independently check if a model's claims are true or false. We aim to circumvent this issue by directly eliciting latent knowledge (ELK) inside the model’s activations.

 
 
 
 

 
 

 
 
 
 

 

 
 
 

 

 
 

 
 
 
 

 
 
 

 
 
 
 
 

 
 
 

 

 

 

 

 
 

 
 

 
 

 

 

 
 
 
 

 

 
 
 

 

 
 

 
 
 
 Training LLMs 
 

 
 
 
 

 

 
 
 

 

 
 

 
 
 

 
 
 
 EleutherAI has trained and released many powerful open source LLMs.

 
 
 
 

 
 

 
 
 
 

 

 
 
 

 

 
 

 
 
 
 
 
 
 
 
 Recent Publications

 

 
 

 
 

 

 

 

 
 
 

 

 

 
 
 
 

 
 
 
 Feb 16, 2026 
 
 

 
 
 

 

 
 arXiv 

 
 

 
 
 
 Quantifying the Effect of Test Set Contamination on Generative Evaluations 
 

 
 
 
 

 
 
 
 Feb 16, 2026 
 
 

 
 
 

 

 
 arXiv 

 
 

 
 
 

 
 
 
 As frontier AI systems are pretrained on web-scale data, test set contamination has become a critical concern for accurately assessing their capabilities. While research has thoroughly investigated the impact of test set contamination on discriminative evaluations like multiple-choice question-answering, comparatively little research has studied the impact of test set contamination on generative evaluations. In this work, we quantitatively assess the effect of test set contamination on generative evaluations through the language model lifecycle. We pretrain language models on mixtures of web data and the MATH benchmark, sweeping model sizes and number of test set replicas contaminating the pretraining corpus; performance improves with contamination and model size. Using scaling laws, we make a surprising discovery: including even a single test set replica enables models to achieve lower loss than the irreducible error of training on the uncontaminated corpus. We then study further training: overtraining with fresh data reduces the effects of contamination, whereas supervised finetuning on the training set can either increase or decrease performance on test data, depending on the amount of pretraining contamination. Finally, at inference, we identify factors that modulate memorization: high sampling temperatures mitigate contamination ef

... (truncated, 5 KB total)
Resource ID: 120b456b2f9481b0 | Stable ID: MWIzZDk1OD