Anthropic Opus 4.6 is less good at finding vulns than you might think

What: Analysis of Anthropic's Opus 4.6 for vulnerability detection
Impact: Limited effectiveness with high false positives

Introduction Opus 4.6 on its own seems to find software defects better than any previous Anthropic model, even without being embedded in a more complex workflow or agent. We decided to find out exactly how good it is. Our testing revealed that with good prompting and tools, Opus can find as many as a quarter of single function C vulnerabilities. However, it still misses the majority of flaws, and the hits come at the expense of a high false positive rate and inconsistency across runs. These results are impressive compared to previous generation models or human review, but they underline the need for embedding the model within larger systems for vulnerability discovery at enterprise scale with consistent results and manageable amounts of noise. The Test Overview We presented Opus 4.6 with 435 known vulnerable C functions from real world CVEs. We tried four different prompts and tool configurations, each simulating the sort of thing you might package as a Claude Code skill to use on your own codebase. Depending on approach, Opus correctly discovered between 25.1% and 28.5% of the vulnerabilities. However, false positive rates tended to be extremely high. As many as around 60% of all functions had at least one potentially spurious finding, although our structured reasoning approach reduced that to ~40%. More concerningly, results varied widely across attempts using a single method. For each classification approach, there tended to be a large common core of functions correctly labeled across all runs, along with a sizable set whose labels changed from run to run. It's worth noting that these vulnerabilities all made it past human review into production in widely-used open source projects. For a general purpose neural network to be consistently flagging ANY of these issues is incredible. In discovering the strengths, weaknesses and foibles of these powerful new models, we're not discounting their usefulness, just doing the necessary work to understand how to correctly engineer them into rational, battle-tested systems like any other software component. Doing this well is the difference between drowning in noise and inconsistent results and moving at the speed of the AI-enhanced attackers that salespeople won't stop trying to scare us with. Dataset In 2024, Yangruibo Ding and other researchers created the PrimeVul dataset as part of this study . One of the many notable things about it is its large collection of individual known-vulnerable C functions paired with the same function after patch. It's especially useful for evaluating LLM vulnerability detection because: The functions are from real CVEs in real code bases The quality of the dataset is much higher than many other academic vulnerability datasets, some of which have serious accuracy issues. Labeled vulnerable functions are much more likely to be actually vulnerable, and there is very little repetition in the data. The benign and vulnerable function pairs are perfect for seeing whether an LLM can alert on the real issue without false positiving on very similar benign code. Dataset of vulnerable functions before and after patching. For our work, we used a version of PrimeVul posted to hugging face: https://huggingface.co/datasets/colin/PrimeVul We specifically used the paired subset, and the test slice to allow us to do a rough comparison of the P-C metric between the original study and our research so we could get a sense of how Opus 4.6 performs compared to the models available in 2024. Original Methodology The original study covered too much ground to briefly summarize here. The part relevant to our Opus 4.6 benchmark is one of their techniques for evaluating model performance, which we borrowed and enhanced. Their original version: Give an LLM a vulnerable function. Ask it: "Is this function vulnerable? Yes/no" Give an LLM the patched version of that same vulnerable function. Ask for a binary classification again. Compare the classifications. Measure the number of times the LLM classified the vulnerable function as vulnerable AND the benign function as benign. This approach is particularly notable because it places a premium on both precision and recall, and it captures precision in a very effective way. The LLM cannot cheat its way to victory by flagging most things "vulnerable." Also, the only difference between the pre-patch and post-patch function is the flaw. It forces the model to distinguish between two otherwise very similar functions in a controlled way. The original researchers labeled the measurement that captures the times model got both the vulnerable and benign halves of a pair "P-C." If you were randomly choosing a label for each half of a function pair, you'd expect to score about 25% on it. GPT-4, a state of the art model at the time of the study (2024), got only 12.94% right – worse than literally flipping a coin. Original methodology. Updated Methodology While the PrimeVul dataset was ideal for benchmarking Opus 4.6, we decided to up...

Read Full Article → ← Back to News

Anthropic Opus 4.6 is less good at finding vulns than you might think

Related Articles

Share this article