- What: Discusses challenges of using LLMs for security reviews
- Impact: Technical teams working on code security
If your team is wiring a coding agent with a /security-review or /security-scan custom command, you are not alone. The idea is intuitive: point an LLM at a repository, let it read files, grep for suspicious patterns, and emit a punch list of vulnerabilities before merge. In practice, that workflow inherits properties of probabilistic models and agentic control loops that static analysis vendors spent years sanding down. In this article, I document six structural failure modes, connect them to measurement ideas you can reuse in engineering reviews, and ground one of the scarier claims about “secured” code in peer-reviewed evidence. You should read this as a scope guardrail, not a dismissal of LLM-assisted review. Used well and with the right contextual information, models compress context and suggest hypotheses. Used as the sole gate, they reintroduce variance where security programs usually demand repeatability, provenance, and budgets that survive every commit. Background and prior art Security engineering has long split work between human review, dynamic analysis, dependency and license scanning, and rule-driven static analysis (SAST). Those tools are not always perfect, but they are engineered for repeatability: the same inputs yield the same findings modulo explicit versioning rules, and incremental scans reuse prior graphs where products support it. Academic benchmarks and industry incident data also emphasize that correctness and security are distinct : code can pass functional tests and still admit exploits when threat models change. Large language models inverted part of that story. They excel at open-ended synthesis and contextual reasoning (potentially at a cross-file reasoning too, but yet to be determined), which is exactly what security reviewers do informally. Benchmarks at function level looked encouraging early on, which nudged teams toward “LLM as scanner” mental models. BaxBench (discussed later) is one of the newer checks on that optimism: it evaluates multi-file backend generation with functional tests and expert-authored exploits, reporting that even strong models leave a large slice of “correct” programs exploitable. This write-up focuses on agent harnesses —prompted workflows that loop tools, accumulate context, and terminate with a report, because that is what most /security-scan implementations are in practice from a developer perspective (inclusive of AI builders, and mature AI native engineers). How it works: the anatomy of an agentic security pass Operationally, a coding agent performing repository-scale review is a feedback system. The model proposes actions (read a path, search for a string, run a command), the harness (e.g: Claude Code or Cursor) executes them, results return as tokens, and the loop continues until a stop condition. That architecture is powerful and flexible, but it couples three sources of variability: the model’s policy, the tool surface exposed to it, and nondeterministic decoding. flowchart TB subgraph inputs [Inputs] PR[Diff or tree snapshot] POL[Policies and prompts] end subgraph loop [Agent loop] M[Model proposes next tool calls] T[Tool runner: read, grep, shell] C[Context assembly + truncation] M --> T --> C --> M end subgraph outputs [Outputs] R[Findings list + rationale] end PR --> M POL --> M M --> R Two consequences follow immediately. First, the state the model sees is partial and path-dependent: a different ordering of reads can change the final narrative, even if the repository is identical. Second, cost and latency scale with the loop , not with a precomputed program representation the way mature SAST engines do after indexing. The sections below translate those mechanics into product-level risks. 1. Run drift: the same command, a different report Run the same /security-scan twice on an unchanged tree. If you treat the output like a compiler, you expect bitwise stability. LLM decoders do not offer that contract unless you build a deterministic harness on top (fixed model version, temperature zero where exposed, pinned prompts, constrained tool plans, and often still residual variance). In real coding agents, temperature and sampling parameters are frequently not exposed to end users, and tool ordering can differ between sessions. Run drift is the security program name for that instability. It shows up in triage as contradictory severities, disappearing findings, or new “criticals” that no commit introduced. Teams respond by re-running scans “until it looks right,” which is harmless for creative writing and toxic for evidence chains. If you cannot reproduce a finding on demand, you cannot assign accountability, prioritize fairly, or defend an audit trail. Quick check: pick three commits, run the harness five times each at the same commit SHA, and record finding identifiers (CWE, file path, line span). Compare overlap with a simple set metric such as Jaccard similarity across runs. If similarity is low while the tree is fixed, you are measuring e...