#evals

InfoWorld AIanthropic claude code eval hygiene regressions

Making AI work through eval hygiene

Anthropic, of all companies, just shipped three quality regressions in Claude Code that its own evals didn’t catch. Think about that. Three regressions over a short six weeks, by the most sophisticated eval shop in AI. If this can happen to Anthropic, it most definitely can happen to you, and it likely will. In a refreshingly candid postmorten, Anthropic walked through what went wrong. On March 4, the team flipped Claude Code’s default reasoning effort from high to medium because internal evals showed only “slightly lower intelligence with significantly less latency for the majority of tasks.” On March 26, a caching optimization meant to clear stale thinking once an idle hour passed shipped with a bug that cleared it on every turn instead. On April 16, two innocuous-looking lines of system prompt asking Claude to be more concise turned out to cost 3% on coding quality, but only on a wider ablation suite that wasn’t part of the standard release gate. From inside the org, none of it trip

May 4, 9:00 AM

InfoWorld AIanthropic claude code ai agent regressions

Improving AI agents through better evaluations

May 4, 9:00 AM

Mentions — May 1, 2026 – May 7, 2026

Related Keywords

Latest Content

Making AI work through eval hygiene

Improving AI agents through better evaluations