- What: Discussion on AI-driven offensive security and new attack surfaces
- Impact: Relevant to enterprise security teams and AI researchers
Snyk Blog In this article Heuristic-Detectable vs Context-Dependent The lineage that actually matters What changed, and why now And there's a third attack surface that AI itself created Platform context Hybrid Dynamic Testing and LLM detection Attack narratives, not alert lists Enterprise AI Harness Where this lands Continuous Offensive Security: The Line We've Been Walking Written by Nuno Loureiro May 27, 2026 0 mins read AI Pentesting is having a moment . Well, several moments, actually. Every other week, another vendor announces something, or another LLM-driven pentesting tool tops some benchmark on a target nobody's heard of, another deck claims a new "gold standard" being disrupted, at long last... It's been busy. Underneath all the noise, though, there’s a real reason this is happening everywhere, all at once: the same reasoning capability that just made AI pentesting commercially viable is the same one attackers now have in their hands. Autonomous attackers are already probing application surfaces continuously at machine speed, on a schedule defenders can’t keep up with. The race at hand is now whether your offensive security testing finds the flaws before an attacker’s offensive AI does. The vendor announcements aren’t the real story here: they’re just the market catching up to a problem that’s already arrived. Working on Snyk API & Web for several years now, our Dynamic Security Testing product, I feel vindicated as Snyk announces Continuous Offensive Security . I want to use this post to do something other than just announce it: I want to explain why the line that runs from Dynamic Security Testing to AI Pentesting is one we've been walking for years now, and why anyone trying to build the second without a foundation in the first is, in my honest opinion, going to hit a ceiling quickly. Heuristic-Detectable vs Context-Dependent Here's the starting distinction that is key to understanding our whole point here. Heuristic-Detectable vulnerabilities are the ones that show themselves to deterministic tools. SAST matches patterns in the source code itself, whereas DAST takes the other approach: it throws payloads at a running application or API and observes responses: error messages, time delays, behavioral differences, that kind of thing. Probe, observe, infer. SQL injection, cross-site scripting, misused APIs, classic injection patterns, all of these surface reliably through behavior heuristics can recognize. Scanners and DAST tools have become very good at this over the years. Hundreds of vulnerability classes in this category are now reliably caught across the SDLC, and that’s a real win. Context-Dependent vulnerabilities are something else entirely. BOLA or IDOR, cross-tenant data leakage, authentication bypasses, and especially chained vulnerabilities, where a couple of mediums and highs combine into a critical exploit path. None of these have a heuristic signature you can probe for. You cannot write a rule, in SAST or in DAST, for “User A should not be able to read user B’s invoice”, that rule depends on what your application is SUPPOSED to do. The vulnerability lives in the gap between intended behavior and actual behavior, and there is no probe, no payload, no signature that can infer intent from the outside. This is why pentesting has always been human-led. Until recently, only humans could acquire the contextual understanding required to find this second class. Heuristics find signatures or behaviors , Pentesters find what’s only revealed through context . As cool as that sounds, that's not a slogan; it's how the discipline has worked for as long as it's been a discipline. And that's the line AI just crossed. The lineage that actually matters Here's the trade secret that isn't actually very secret: every credible DAST engine in the last decade was built by people who came from pentesting. The Snyk API & Web engine was no exception. The team that built it had spent years finding flaws by hand, and they designed it around what pentesters actually do: recon, probe, observe, reason, escalate, validate... the whole motion. Though, at the time, we could not mimic everything pentesters do. The missing piece was reasoning, which is exactly what we can do now, as LLMs understand context, and can reason. That heritage is why our BOLA detection , which we shipped last year, works the way it does. It's not pattern-matching against a signature list, as there is no signature for BOLA: it's a chain of authorization probes guided by structural reasoning about how API objects relate to identities. It's automated flaw-hunting that crosses the threshold from " What does the code do ?" into " What is it meant to do, and can I subvert that intent ?" It's been working in production for our customers for months. It's the proof that the line from DAST into reasoning-based testing is walkable. It's also not coincidentally why we'd been thinking about building the AI version of this long before " AI Pentesting " became a category anyone was raising money for. What changed, and why now What changed isn't the goal, but the cost . Reasoning at scale used to require a human pentester. $20K to $50K per engagement . Two weeks of calendar time, on average. A coverage window that closed the moment the report shipped, by which point the application had already shipped three more releases. That math is what manual pentesting was: irreplaceable, but constrained by the same thing every artisanal craft is constrained by: human time. Your pentest covers fifteen days a year. What's happening to the other three hundred and fifty ? AI changes the math, but not the discipline. The reasoning step that only a pentester could perform is now also something a sufficiently capable model can perform, repeatably, at a fraction of the cost. And there's a third attack surface that AI itself created Everything above is about an attack surface that has existed as long as web applications have: the heuristic-detectable and context-dependent vulnerabilities we described in traditional code, traditional APIs, and traditional architectures. AI changes the testing model, but it does not change the targets : these have been pentest material for two decades. There is also an entirely new attack surface that AI itself created, and that did not exist just five years ago. LLM-integrated applications, AI Agents calling tools, chatbots wired into customer data. Retrieval pipelines pulling from sources that an attacker can poison, prompt injection, or misuse. Data exfiltration through model outputs, or jailbreaks that turn a customer service Agent into a privileged actor with access it was never supposed to have. Manoj's piece walked through one version of this: an AI Agent calling an API nobody had stress-tested, triggered by nothing more exotic than an email address. These attacks are not bugs you can scan for, and they are not flaws in the traditional architectural sense. They live in the gap between what an LLM was prompted to do and what an attacker can convince it to do. The only way to find them is to do, against the LLM-integrated layer, what an attacker would: probe, escalate, exfiltrate, abuse. That's the third capability inside Continuous Offensive Security: Agent Red Teaming . Multi-step adversarial simulation against LLMs, AI Agents, and the tools they call. A tool purpose-built for the attack surface AI itself was created. There's one thing I like the most about how it's wired: it isn't a separate scan you have to schedule, or a different product you have to buy. During an assessment, the recon agent detects whether the target includes LLM-integrated components, and if it does, the Red Teaming module triggers automatically . You don't have to know in advance what kind of attack surface your application presents; the system figures it out and runs the right tests against the right layer. That matters more than it sounds at first. Most organizations now have AI somewhere in their app portfolios, but their security teams don't have a clear inventory of where. Recon-first, test-what-you-find, is the only model that scales when " Where is AI running in production ?" is a continuously moving target. So, that's the attack surface: flaws in traditional architectures and the new ground AI just created on top of them. The harder question is what it actually takes to do offensive testing well against both, continuously, at enterprise scale. Pointing an LLM at a target URL and letting it figure things out from cold is not the answer. Four things separate this from running blind. Platform context The naïve approach to AI pentesting is to start from scratch. Point the LLM at a URL, let it crawl, let it guess, let it burn compute on enumeration that doesn't matter, hope it eventually finds something. And worse: it has no way to distinguish a theoretical finding from one that’s actually exploitable in your stack, because it can’t see your code, your dependencies, your prior scans, your deployment environment, or your trust boundaries. That's the demo loop, but it's not how production-grade security testing should work. The Snyk version of this looks different, though. Continuous Offensive Security starts with everything the platform already knows about your application: SAST findings, SCA dependencies, asset inventories, prior DAST scans, and risk signals from across the platform. All of it feeds the AI Pentester before it sends a single request. That changes what the AI does on day one. Instead of " Figure out what this application is and find vulnerabilities, " the instruction set becomes " This application has these components, these dependencies, these prior findings, these reachable endpoints, this risk profile, now go after what isn't already covered ." The LLM stops guessing and starts working. Straight from our announcement this week: " Snyk is different because we already know your code ." That's the platform argument in seven words. Every other AI pentes