Microsoft MDASH found 16 Windows RCEs — here's exactly how the 100-agent pipeline works

Microsoft's MDASH AI security pipeline discovered 16 previously unknown Windows vulnerabilities, including four critical remote code execution flaws, which were patched in the May 2026 Patch Tuesday updates. The system orchestrates over 100 specialized AI agents across a five-stage pipeline of preparation, scanning, validation, deduplication, and proof generation to autonomously find and prove real software vulnerabilities. This model-agnostic approach, which outperformed frontier models on the CyberGym benchmark, demonstrates that structured multi-agent systems can be more effective for vulnerability discovery than relying on a single powerful model.

Read Full Article →

Quick Answer: Microsoft's MDASH (Multi-Model Agentic Scanning Harness) scored 88.45% on the CyberGym benchmark — beating Anthropic's Mythos Preview (83.1%) and OpenAI's GPT-5.5 (81.8%) — using no frontier model of its own. The system orchestrates 100+ specialized AI agents across a five-stage pipeline to find, debate, and prove real software vulnerabilities. It found 16 previously unknown Windows flaws, including four critical remote code execution bugs, patched in May 2026 Patch Tuesday. The strongest model doesn't automatically win. That's what Microsoft just demonstrated. Anthropic's Mythos is their most powerful model — so powerful it isn't publicly available. OpenAI's GPT-5.5 is their current flagship. Microsoft used neither. Instead, they stitched together publicly available models into a structured pipeline of more than 100 specialized agents, and that system just topped the CyberGym benchmark leaderboard. By five points. This isn't a benchmark story. It's a systems design story. Here's exactly how MDASH works — and what it means for the AI companies racing to build the world's most powerful single model. What Is MDASH? MDASH stands for Multi-Model Agentic Scanning Harness. It was built by Microsoft's Autonomous Code Security (ACS) team — several of whom were part of Team Atlanta, the group that won $29.5 million in DARPA's AI Cyber Challenge by building autonomous systems that could find and patch vulnerabilities in real software. The core idea: don't ask one model to do everything. Break the problem into stages, assign specialized agents to each stage, and let disagreement between agents become a signal rather than a failure. When an auditor flags something and a debater can't refute it — that finding's credibility goes up. MDASH is model-agnostic by design. When a better model releases, Microsoft swaps it in via configuration change. The pipeline, plugins, calibrations, and domain-specific context all carry forward. The model is one input. The system is the asset. The Five-Stage Pipeline MDASH doesn't scan code. It runs code through a structured sequence where each stage has a different job. Stage 1: Prepare. The system ingests source code, builds language-aware indexes, and analyzes past commits to map attack surfaces and threat models. Before any agent looks for bugs, MDASH knows where to look. Stage 2: Scan. Specialized auditor agents examine candidate code paths and generate findings — with hypotheses and supporting evidence. These aren't pattern matches. They're reasoned assessments of whether a code path could be exploitable. Stage 3: Validate. A second set of agents — the debaters — argues against each finding. Can this actually be reached? Is it truly exploitable? If the debater can't punch holes in an auditor's case, the finding survives. Frontier models handle the heavy reasoning here. Distilled, faster models handle high-volume verification work. Stage 4: Dedup. Semantically equivalent findings get collapsed. If three auditors flagged the same underlying issue through different code paths, that's one finding — not three. Stage 5: Prove. The system constructs and executes actual inputs that trigger the bug. Not theoretical. Working proof-of-concept exploits. This is where MDASH stops being a scanner and becomes something closer to an automated offensive researcher. Different models run at different stages. One state-of-the-art model handles reasoning-heavy tasks. A completely separate model acts as an independent counterpoint in the validation stage. The disagreement between them is a feature, not a problem. The Numbers That Matter Three benchmark results — each harder to ignore than the last. CyberGym (public): MDASH scored 88.45%. Mythos Preview came in second at 83.1%, GPT-5.5 at 81.8%. The benchmark, developed at UC Berkeley and published at ICLR 2026, contains 1,507 real-world vulnerability reproduction tasks from 188 open-source projects. Given a description of a known vulnerability and the unpatched code, the system must produce working attack code that triggers it. Historical recall (internal): Microsoft ran MDASH against pre-patch snapshots of two heavily reviewed Windows components to see if it would rediscover bugs that were later confirmed by the Microsoft Security Response Center. For clfs.sys, 96% recall across 28 MSRC cases spanning five years. For tcpip.sys, 100% recall across 7 cases spanning five years. These are bugs that real attackers exploited. That a system recovers 96% of them in one of the most-reviewed kernel components in Windows is a significant claim. Private driver test (StorageDrive): Microsoft planted 21 vulnerabilities into a private, never-published Windows driver — ensuring the models had never seen the code during training. MDASH found all 21 with zero false positives. Two Bugs That Required More Than One Model The benchmark scores are one thing. Two specific vulnerabilities show why a multi-agent architecture was actually necessary to find them. CVE-2...

Read Full Article → ← Back to News

Microsoft MDASH found 16 Windows RCEs — here's exactly how the 100-agent pipeline works

Related Articles

Share this article