Insider Brief One of artificial intelligence’s longest-running challenges is building systems that can improve both how they operate and what they know without requiring constant human intervention. A research team now reports that a new approach could shatter that bottleneck. In a study published on the preprint server arXiv, researchers at Palo Alto-based Hexo Labs […]
How local optimization in last‑mile delivery can quietly break the system
The post The System Always Knows: Why Local Efficiency and System Performance Are Not the Same Problem appeared first on Towards Data Science.
We’ve all heard the mantra from the quants in the business community: you can’t manage what you can’t measure. And if that’s true for human intelligence, it should be true for the artificial kind too.
How do we measure agents and large language models (LLMs)? We’re just beginning to come up with statistical metrics. Here are several of the most common metrics that designers and users toss about when they’re evaluating a model.
[ See also: 27 questions to ask before choosing an LLM ]
Time to first token
How long does it take to generate the first token? For real-time applications with time constraints, faster responses can be essential. It’s well-known that people hate waiting even a few milliseconds. The teams that develop user interfaces learned decades ago that it’s important for the software to respond quickly when a human is waiting for an answer. Even a few seconds of delay mean that the human will wander off to another window to check some email or place some bet on a prediction
Microsoft has open-sourced an AI evaluation framework that converts natural-language requirements into executable tests, expanding its push into enterprise AI governance as organizations struggle to validate agent behavior before production deployments systematically.
The framework, called ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing), generates evaluation scenarios, datasets, metrics, and scorecards from written specifications, product requirements, and governance documents, Microsoft said in a blog post announcing the release.
“Agents fail in ways that are hard to see,” Microsoft wrote in the blog post. “They drift from policy, produce unsafe outputs in edge cases, and behave differently in production than they did in testing. Generic benchmarks do not catch these failures because they are not built around your policies, your agent, or your use case.”
Rather than requiring developers to manually create evaluation suites, ASSERT translates written intent
Deregulation risks hidden vulnerabilities, potentially destabilizing the economy and echoing past financial crises, urging cautious oversight.
The post Federal Reserve’s Barr warns banking deregulation could trigger next financial crisis appeared first on Crypto Briefing.
Why it matters: Learn how to measure AI agent performance in 2026 with metrics, traces, and a step-by-step pipeline that catches failures before users do.
Banks are adopting the XRP Ledger, but XRP stays stuck at $1.30. Why ledger adoption doesn't create token demand, and the metrics that would change it.