Training Isn’t Enough: Reasoning Models and LLMs Need Reinforcement Learning

MarktechPostreinforcement learning uiuc gpt-oss-20b harness-1

Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b

UIUC and Chroma's Harness-1 is a 20B retrieval subagent trained with reinforcement learning inside a stateful search harness. The harness maintains the bookkeeping — candidate pool, importance-tagged curated set, evidence graph, verification records — while the policy decides what to search, curate, verify, and when to stop. It reaches 0.730 average curated recall across eight benchmarks, beating the next open subagent by 11.4 points and trailing only Opus-4.6. Weights and harness code are public. The post Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b appeared first on MarkTechPost.

Jun 7, 6:25 AM

Towards Data Sciencereinforcement learning on-policy off-policy

The Fundamental Choice in Reinforcement Learning: On‑Policy vs. Off‑Policy

How a simple choice shapes exploration, safety, and efficiency The post The Fundamental Choice in Reinforcement Learning: On‑Policy vs. Off‑Policy appeared first on Towards Data Science.

Jun 5, 3:00 PM

The Verge AIopenai microsoft ai agents openclaw

Microsoft and OpenAI broke up — now they’re ready to fight

At Microsoft's annual Build conference on Tuesday, the company announced a slew of new or expanded AI initiatives, including a super app, in-house reasoning models, a cybersecurity tool, and OpenClaw-esque AI agents. All this news added up to a clear message: Microsoft is positioned to be one of the biggest players in AI, and it's finally acting like it. For years, Microsoft's AI business leaned hard on its early and exclusive partnership with OpenAI. But the drama-filled marriage slowly devolved into a situationship, and the pair effectively separated in late April (though Microsoft is still OpenAI's primary cloud partner - for now). This … Read the full story at The Verge.

Jun 3, 2:04 PM

MarktechPostreinforcement learning open-mm-rl turingenterprises multimodal reasoning

Design a Complete Multimodal RLVR Pipeline with Open-MM-RL, Vision-Language Prompting, Reward Scoring, and GRPO Export

In this tutorial, we explore the TuringEnterprises/Open-MM-RL dataset as a practical foundation for multimodal reasoning and reinforcement learning with verifiable rewards. We load the dataset, inspect its schema, analyze domains, formats, question lengths, answer types, and image distributions, and visualize representative examples from each domain. We also build a lightweight reward function that checks exact, […] The post Design a Complete Multimodal RLVR Pipeline with Open-MM-RL, Vision-Language Prompting, Reward Scoring, and GRPO Export appeared first on MarkTechPost.

May 26, 7:25 AM

NVidia Blognvidia london reinforcement learning alphago

NVIDIA, Ineffable Intelligence Team Up to Build the Future of Reinforcement Learning Infrastructure

Reinforcement-learning agents — AI systems that learn by trial and error — can convert computation into new knowledge. That’s the focus of a new engineering-level collaboration between NVIDIA and Ineffable Intelligence, the London-based AI lab founded by AlphaGo architect David Silver in the wake of Ineffable’s emergence from stealth last week. “The next frontier of […]

May 13, 1:00 PM

Towards Data Sciencereasoning models

How Major Reasoning Models Converge to the Same “Brain” as They Model Reality Increasingly Better

Because there's only one reality to model! The post How Major Reasoning Models Converge to the Same “Brain” as They Model Reality Increasingly Better appeared first on Towards Data Science.

May 7, 1:30 PM

Towards Data Sciencetoken usage latency reasoning models inference scaling

Inference Scaling (Test-Time Compute): Why Reasoning Models Raise Your Compute Bill

Why reasoning models dramatically increase token usage, latency, and infrastructure costs in production systems The post Inference Scaling (Test-Time Compute): Why Reasoning Models Raise Your Compute Bill appeared first on Towards Data Science.

May 3, 1:00 PM

MarktechPostreinforcement learning microsoft research world-r1 flow-grpo

Microsoft Research’s World-R1 Uses Flow-GRPO and 3D-Aware Rewards to Inject Geometric Consistency Into Wan 2.1 Without Architectural Changes

Microsoft Research's World-R1 Uses Reinforcement Learning to Force 3D Consistency Into Text-to-Video Models The post Microsoft Research’s World-R1 Uses Flow-GRPO and 3D-Aware Rewards to Inject Geometric Consistency Into Wan 2.1 Without Architectural Changes appeared first on MarkTechPost.

May 1, 12:40 AM