Back
Import AI Newsletter
webjack-clark.net·jack-clark.net/
A widely-followed newsletter by Anthropic co-founder Jack Clark; useful for tracking AI capability and policy developments as they emerge, and understanding how prominent safety-oriented figures contextualize new research.
Metadata
Importance: 55/100blog postnews
Summary
Import AI is a weekly newsletter by Jack Clark (co-founder of Anthropic and former OpenAI policy director) covering the latest developments in artificial intelligence research, policy, and safety. It curates and analyzes significant AI papers, industry trends, and governance developments, offering expert commentary on their implications. The newsletter is widely read in the AI research and policy community.
Key Points
- •Weekly curated roundup of significant AI research papers with accessible summaries and critical commentary
- •Covers AI policy, governance, and international competition dynamics alongside technical developments
- •Written by Jack Clark, a prominent figure in AI safety and policy with experience at OpenAI and Anthropic
- •Tracks compute trends, capability advances, and deployment risks relevant to AI safety considerations
- •Serves as an influential signal-aggregator for researchers, policymakers, and safety-focused practitioners
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| AI Governance and Policy | Crux | 66.0 |
Cached Content Preview
HTTP 200Fetched Mar 20, 202698 KB
# [Import AI](https://jack-clark.net/ "Home")
##### [March 16, 2026](https://jack-clark.net/2026/03/16/importai-449-llms-training-other-llms-72b-distributed-training-run-computer-vision-is-harder-than-generative-text/)
### [ImportAI 449: LLMs training other LLMs; 72B distributed training run; computer vision is harder than generative text](https://jack-clark.net/2026/03/16/importai-449-llms-training-other-llms-72b-distributed-training-run-computer-vision-is-harder-than-generative-text/)
#### by Jack Clark
Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe.
[Subscribe now](https://importai.substack.com/subscribe?)
**Can LLMs autonomously refine other LLMs for new tasks? Somewhat.** _…PostTrainBench shows startling growth in AI capabilities at post-training…_ AI-driven R&D might be the most important thing in all of AI, because it helps us understand whether AI systems might eventually build their own successors. So far, much of the focus on AI R&D has been in components that support AI development (e.g., autonomous creation of AI kernels), or training base models (e.g, the [NanoGPT speedrun benchmark](https://arxiv.org/abs/2506.22419)). But there’s been less attention paid to fine-tuning – the task involving adapting an existing LLM to a new dataset or behavior.
Researchers from the University of Tübingen, the Max Planck Institute for Intelligent Systems, and AI research organization Thoughtful Lab want to change that with PostTrainBench, a benchmark which targets a specific aspect of post-training; improving performance against a given dataset. “Post-training is how raw language models become useful”, the authors write. “Given a clear objective and limited compute, can today’s agents do the technical work?”. The answer appears to be ‘yes, but not as well as humans’.
**What are the key features of PostTrainBench?**
- **End-to-end**: “Agents must build their entire training pipeline from scratch”
- **Autonomous**: “Agents operate with full autonomy over data sources, training methods, and experimental strategy.”
- **Resource-bounded:** “Each run is constrained to 10 hours on a single H100 GPU”.
- **Integrity-preserving:**“Agents may not train on benchmark test data, modify the evaluation harness, or substitute a different model.”
**How PostTrainBench works:** “We give a frontier coding agent — Claude Code, Codex CLI, or Gemini CLI — a base language model and a target benchmark”.
- **4 models and 7 benchmarks**: The initial eval runs on four models: Qwen3-1.7B, Qwen3-4B, SmolLM3-3B, Gemma-3-4B. It tests these models across seven distinct benchmarks: AIME 2025, GSM8K, GPQA, HumanEval, BFCL, Arena-Hard, HealthBench-Easy.
**Results – big models win, especially Opus 4.6:** “The top-performing agent — Opus 4.6 running on Claude Code — scores 23.2%, about 3× higher than the 7.5% base model average.”
**But humans are still much better:**
... (truncated, 98 KB total)Resource ID:
f2acda99123c4a09 | Stable ID: NzczZDljOG