A study published in Nature this week compared top AI agent systems against human scientists on complex, open-ended research tasks. The humans won, decisively, and the failure modes of the agents are instructive.
Agents did well on contained, well-specified sub-tasks — retrieval, structured analysis, code execution against clear test harnesses. They collapsed on the parts of the work that define science: deciding which question to pursue, noticing when a result contradicts an assumption too quietly to be flagged, and revising a plan halfway through when evidence arrives that would make a human stop and reconsider.
The pattern is not new, but the magnitude is. As recently as last year, agent systems were closing the gap with human experts on bounded benchmarks. Open-ended multi-step work appears to be a different problem class, and the research community is beginning to treat it as one.
None of this argues against the near-term utility of agents. Most valuable work is not open-ended scientific inquiry. Most valuable work is routine, structured, and easy to specify, and agents are good at routine, structured, easy-to-specify work. But the industry's default framing — that we are months away from autonomous research scientists — deserves to be retired.
