Back
OpenAI Deep Research February 2025 - FutureSearch
webfuturesearch.ai·futuresearch.ai/oaidr-feb-2025/
Relevant to AI safety discussions around deployment reliability, calibration, and the gap between perceived and actual AI research capabilities; useful as a concrete empirical evaluation of a widely-used frontier AI tool.
Metadata
Importance: 42/100blog postanalysis
Summary
FutureSearch conducted a detailed evaluation of OpenAI's Deep Research tool shortly after its February 2025 launch, documenting six specific failure cases and identifying key failure modes including overconfidence, poor source selection, and misinformation by omission. The evaluation concludes OAIDR outperforms some competing systems but falls significantly short of human researchers, exhibiting a 'jagged frontier' of inconsistent performance.
Key Points
- •OAIDR demonstrates a 'jagged frontier' of performance—inconsistently reliable in ways that make failure hard to predict before deployment
- •Overconfidence is a critical flaw: the system confidently reports wrong answers (e.g., 17.5% vs. 34.5% on Cybench) rather than admitting uncertainty
- •Poor source selection: OAIDR frequently prioritizes company blogs and SEO-spam over authoritative sources, undermining research quality
- •Misinformation by omission is especially dangerous—incomplete research that appears comprehensive may mislead users more than obvious errors
- •Recommended only for non-critical synthesis tasks; risky for topic introductions or time-sensitive factual queries
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| FutureSearch | Organization | 50.0 |
Cached Content Preview
HTTP 200Fetched Mar 20, 20266 KB
[← Back to Research](https://futuresearch.ai/research/)
## Key Takeaways
- **As later evidence shows** ( [Deep Research Bench](https://futuresearch.ai/deep-research-bench)), while OpenAI's Deep Research tool was initially impressive, it actually underperformed the later release of ChatGPT-o3+search
- **OpenAI Deep Research (OAIDR) shows a "jagged frontier"** of performance—better than some competing systems but significantly worse than intelligent humans
- **Overconfidence is a major issue**: OAIDR often reports wrong answers confidently when it should admit uncertainty
- **Peculiar source selection**: The system frequently chooses company blogs or SEO-spam sites over authoritative sources
- **Risk of misinformation by omission**: Incomplete research that appears comprehensive is particularly dangerous
## Mixed Reactions to OpenAI Deep Research
On February 3, OpenAI launched Deep Research, their long-form research tool, provoking a flurry of interest and intrigued reactions. **Many were seriously impressed**, while **others warned of frequent inaccuracies**.

FutureSearch conducted a detailed evaluation to understand OAIDR's true capabilities and limitations.
## The Verdict: Better Than Some, Worse Than Humans
Our evaluation found that **OAIDR is better than some competing systems** but still **significantly worse than human researchers**. The system exhibits what researchers call a **"jagged frontier"** of performance—inconsistent quality that makes it difficult to predict when it will succeed or fail.
### Key Failure Modes
1. **Overconfidence**: Reports wrong answers when it should admit uncertainty
2. **Peculiar source selection**: Prioritizes unreliable sources over authoritative ones
3. **Difficulty reading complex webpages**: Struggles with PDFs, images, and certain website formats
4. **Misinformation by omission**: Produces incomplete research that appears comprehensive
### Recommended Usage
- **Good for**: Synthesizing information where completeness isn't critical
- **Risky for**: Topic introductions (high risk of missing key information)
- **Potentially useful**: Niche, qualitative explorations
## Six Strange Failures: Detailed Examples
We tested OAIDR on six research queries where we knew the correct answers. Here's what we found:
### Failure \#1: Cybench Benchmark Performance
**Query**: Find the highest reported agent performance on the Cybench benchmark.
**OAIDR's answer**: 17.5%
**Correct answer**: 34.5%
**What went wrong**: OAIDR failed to identify the top-performing model (o1) and reported an outdated figure. The system prioritized corroboration from multiple sources over finding the most recent and accurate
... (truncated, 6 KB total)Resource ID:
386bc4dbf25d7d34 | Stable ID: NmQzNGJiYz