How to Measure AI Agent Performance
Why it matters: Learn how to measure AI agent performance in 2026 with metrics, traces, and a step-by-step pipeline that catches failures before users do.
Artificial Intelligence +·

Why it matters: Agent 365 gives every AI agent an identity, a registry, and real oversight. See pricing, security architecture, rollout steps, and the gaps it leaves open.
Read full articleWhy it matters: Learn how to measure AI agent performance in 2026 with metrics, traces, and a step-by-step pipeline that catches failures before users do.
The LinkedIn co-founder was a key bridge to Microsoft’s relationship with OpenAI, but he also came with some baggage.
GitHub is expanding Copilot beyond the IDE with a new desktop application and a new collaborative work surface called canvas as part of its broader efforts to pitch the AI-assisted coding tool as the control center for agent-native software development. The desktop application announced at Microsoft’s annual Build conference this week is designed to give developers a dedicated environment for working with AI agents throughout the software development lifecycle, rather than limiting those interactions to code-generation tasks inside an editor, the company wrote in a blog post. The application includes a collaborative workspace called canvas where developers can brainstorm ideas, refine requirements, generate plans, and iterate on projects alongside AI, it said. It also has new Agent Merge and code review features that enable developers to automate Copilot to combine tasks of different agents to complete a specific goal or conduct autonomous code reviews according to set standards, it sa
Microsoft has identified seven new failure modes in agentic AI systems, in addition to those it identified last year in its first Taxonomy of Failure Modes in Agentic AI Systems. Four things contributed to the growing list of ways agentic AI can go wrong: the speed at which the technology went mainstream, the growing maturity of the Model Context Protocol (MCP) ecosystem, the rise of computer-use agents, and finally the gathering of more empirical evidence as researchers obtained more real-life findings. The seven new failure modes it has identified are: Agentic Supply Chain Compromise —agent behavior can be affected by natural language rather than malicious code; Goal Hijacking — adversarial instructions appear aligned with legitimate task completion, while silently redirecting the agent’s terminal goal; Inter-Agent Trust Escalation —a compromised agent asserts false identity or inflates claimed permissions to an orchestrator; Computer Use Agent (CUA) Visual Attack — agents operating
Projection, much? Microsoft’s head of AI has accused a rival’s AI service of being too pricey, just as the introduction of usage-based pricing for GitHub Copilot begins to hit developers using its own services. “Anthropic is extremely expensive and I think many people are urgently looking for alternatives,” Mustafa Suleyman, CEO of Microsoft AI, told Bloomberg News. The spotlight is on the cost of AI services at the moment, with so many different parts of the business using the technology while at the same time many businesses are finding it hard to report any meaningful ROI. This week, Microsoft at its annual Build conference looked to fight back against this when it announced seven new AI models, emphasizing the lower cost. The company hopes that cheaper AI models will mean more enterprises find that AI projects are viable. In 2025, Gartner reported that many such endeavors would be cancelled by 2027: cheaper implementations could be the way forward. Microsoft clearly sees its own AI
Microsoft Build 2026 didn't just announce products. It announced a philosophy: the era of the unmanaged AI agent is over.
Microsoft’s AI products aren’t selling and Github’s been plagued with troubles. WIRED spoke with VP Scott Hanselman about whether the company is in catch-up mode.
I set up an AI agent on a rented GPU, pointed it at a training script, and went to bed. By morning it had run 40 experiments, improved validation loss by 5.9%, and cut memory usage from 44 GB to 17 GB. It also spent four hours chasing a bug that a linter introduced behind […]