For vulnerability research, smaller models run repeatedly can outperform larger frontier models on cost-to-recall.

vulnerability-research ai claude-opus mythos

What: A discussion on the effectiveness of smaller AI models in vulnerability research.
Impact: Smaller models can be more cost-effective for certain tasks.

TL;DR: If a large model finds a 0-day with 90% probability, and a small model with 50% probability, but the small model costs 10x less, it is better to use the small model. In our last blog post we let Claude Opus write a Chrome exploit . Anthropic also partnered with Mozilla to let their Mythos model discover over 271 vulnerabilities [1] . The hype is real and there is data to back it up. But there is a more nuanced take: these results were achieved with a skilled operator in the loop. A researcher or skilled engineer who knows what to look for, how to guide the agent’s exploration, how to set up a good harness and goal, and when to interrupt and course-correct. In that setting, the large frontier models are extraordinary tools that amplify expert capability in ways that were previously impossible, and possibly necessary for complex codebases like browsers and kernels. But 99% of applications do not have the complexity of a JIT compiler. And Hacktron does not have a human operator. No one is guiding it. No one is telling it where to look. In that kind of workflow, how much do large models matter? Recently we reported two 0-days in oauth2-proxy [2] which we used to benchmark our Hacktron scanning pipeline and compare different models. We found that for most applications, smaller models run repeatedly can outperform larger frontier models on cost-to-recall. Security Research in 2026 To test whether Hacktron can find real 0-days, we need 0-days discovered independently of our workflow. So Rahul ( “iamnoooob” ) picked a target and prompted Claude Code, with an additional skill [3] : You’re an expert security researcher specializing in code reviews, you’re tasked to code review oauth2-proxy source code which is deployed in your organization, you goal is to run 5 passes of security testing and code review, after every pass, u must document your findings and results and before beginning next pass, refer to the documentation to avoid repeating or heading in the right directions that have not been explored. Think like a CTF player, your goal is to find as many as realistic vulnerabilities you can (No bruteforce shit) and focus on getting an authentication bypass. I’ll be going to sleep and you should continue your work and not give up. with follow-up guidance: Do one more pass specifically for weird proxies, think what configurations of servers plus oauth proxy could result in auth bypasses? https://github.com/grrrdog/weird_proxieshttps://files.speakerdeck.com/presentations/1ccdea319fee4132968e6c07f6eb991d/Weird_proxies_2__1_.pdf It should come as no surprise that this will find real 0-days. And to be honest, it still amazes me every time I see it. However, none of this is fundamentally new or beyond what a skilled human researcher could have done before LLMs. We (humans) simply didn’t do it — due to cost and scarcity of top-tier talent. LLMs change that equation. Running an agent is dramatically cheaper than staffing an entire security team, which is why the current economics feel so compelling. But I believe this is just temporary arbitrage. The research session above cost around $200 in tokens, plus one to two days of work by a professional researcher. “Encouragingly, we also haven’t seen any bugs that couldn’t have been found by an elite human researcher.” — Bobby Holley, Mozilla [1] Human-out-of-the-loop Without a skilled operator, you could loop a large model indefinitely or spin up an agent per file, but costs explode quickly. Even on a relatively small repository like oauth2-proxy , you are already looking at thousands of dollars. Scale that to something larger like Gumroad , and you are easily in the tens of thousands. I can only speculate, but I estimate Mozilla used at least $100,000 worth of tokens to find the 271 vulnerabilities. I wanted to say $300,000+, but to avoid embarrassing myself with an outrageous guess, I picked $100k. This compounds further in continuous workflows. If you run scans on every PR to catch issues early, then something like Claude Code Reviews [4] is “billed on token usage and generally averages $15-$25, scaling with PR size and complexity” . At any reasonable scale, you blow past the $200 mark almost immediately. So yes, LLMs are cheap compared to traditional security research. But that is not the right framing. What matters is where the market (which always optimizes for cost) will settle. Today’s “cheap” quickly becomes tomorrow’s baseline. The largest market will never pay for the absolute best, at any price. The market does not solely consist of software as complex as browsers. If we can find the critical bugs in 99% of web apps, that is “good enough”, and the market will converge on cost-efficient setups for that. Benchmark We benchmarked our Hacktron pipeline in different model configurations against oauth2-proxy v7.15.0 [5] , a widely deployed authentication reverse proxy with over 14,000 GitHub stars. Two zero-day authentication bypass vulnerabilities that we discovered...

Read Full Article → ← Back to News

For vulnerability research, smaller models run repeatedly can outperform larger frontier models on cost-to-recall.

Related Articles

Share this article