Programmatic Scanners in the Age of AI Agents
Why rule-based vulnerability scanning still matters and where AI actually fits.
There's a loud narrative in the security industry right now: AI agents are going to replace everything. Scanners, pentesters, maybe even the bugs themselves.
Every week there's a new AI-powered security tool promising to "think like a hacker" and find vulnerabilities autonomously. And honestly? The technology is impressive. LLMs can understand context, reason about application logic, and adapt to what they see in ways that no regex or pattern-matching engine ever could.
But if you're a pentester or bug bounty hunter using these tools in practice, you've probably noticed something: the reality doesn't quite match the hype yet.
Let me explain what I mean and why I think programmatic scanners are not just still relevant, but essential.
Burp Bounty Pro
Custom Profiles · Smart Scan · AI Scanner · Multi-Step Detection
Extend Burp Suite with 254 vulnerability profiles, 27 Smart Scan rules, AI-powered targeting, and multi-step scanning with cookie reuse. Define custom payloads, match conditions, and detection rules — or let the AI Scanner pick the right profiles for you.
The AI agent promise
The idea behind AI-powered scanning agents is compelling. Instead of firing predefined payloads at predefined insertion points, an AI agent can:
- Understand the application's context: what the page does, what each parameter means, what kind of data it expects.
- Reason about attack vectors: "this parameter takes a URL, so it might be vulnerable to SSRF" or "this form field is reflected in the response without encoding, so XSS is likely."
- Adapt in real time: adjust payloads based on error messages, WAF behavior, or application responses.
- Chain multiple steps: understand that step A needs to happen before step B to exploit a vulnerability.
This is genuinely powerful. An LLM looking at HTTP traffic can identify attack surfaces that a static rule engine might miss, especially in non-standard or complex application logic.
So what's the problem?
The cost (and speed) equation nobody talks about
Here's where it gets real.
LLMs are not free. They're not even cheap, at least not the good ones.
If you use a top-tier model (GPT-4-class, Claude Opus, or equivalent), you get genuinely intelligent analysis. The model can reason about context, understand application behavior, and make smart decisions about where to attack.
But the cost per request adds up fast. A typical web application audit involves thousands (sometimes tens of thousands) of HTTP requests and responses. If every request and response needs to pass through an LLM for analysis, you're looking at significant API costs. For a thorough scan of a medium-complexity application, the bill can easily reach hundreds of dollars. Per scan.
An alternative could be to use a cheaper, smaller model. But then you lose the very thing that makes AI scanning valuable: the intelligence. A small model analyzing HTTP traffic will miss nuances, generate more false positives, and struggle with the kind of contextual reasoning that justifies using AI in the first place.
This creates an uncomfortable dilemma:
- Good model + high coverage = very expensive. Great results, but the cost per engagement becomes hard to justify for routine assessments.
- Cheap model + high coverage = mediocre results. You're paying for "AI" but getting results that are barely better than a well-configured programmatic scanner.
- Good model + low coverage = incomplete. Smart analysis on a subset of requests, but you're missing attack surface.
For a bug bounty hunter scanning multiple targets per week, or a consultancy running dozens of assessments per month, this cost structure simply doesn't scale yet.
And cost isn't the only bottleneck, there's also speed. A programmatic scanner can fire 50-100+ requests per second with configurable concurrency. Put an LLM in the loop, and you're bound by API rate limits and response latency, typically 1-5 seconds per call. For a pentester with a deadline and a scope of thousands of endpoints, that slowdown can turn a 30-minute scan into a multi-hour process. In security assessments, time is money in a very literal sense.
What programmatic scanners actually do well
Here's what often gets lost in the AI hype: most common web vulnerabilities don't need artificial intelligence to be detected.
Think about it:
- Reflected XSS: Send a payload, check if it appears in the response. That's pattern matching, not reasoning.
- SQL injection: Send a time-based payload, measure the response delay. Send a boolean-based payload, compare the response length. Deterministic checks, no LLM needed.
-
Path traversal: Inject
../../etc/passwd, check forroot:xin the response. Pure string matching. -
Missing security headers: Check if
Strict-Transport-Securityis present. Simple presence/absence check. -
Technology fingerprinting: Look for
wp-contentin responses to detect WordPress. Pattern matching and conditional logic. - SSRF: Send a Collaborator payload, wait for an out-of-band interaction. No reasoning required.
These are well-defined, well-understood vulnerabilities with clear detection patterns. A programmatic scanner with good profiles can test for all of them, reliably, repeatedly, and at zero marginal cost per request.
That's not a limitation. That's an advantage.
Our programmatic scanner can fire 254 vulnerability profiles across thousands of endpoints with configurable concurrency, and it costs you nothing beyond the initial license. Try doing that with an LLM at $0.01-$0.05 per API call.
To be fair, there's a category of vulnerabilities where programmatic scanners genuinely struggle: business logic flaws, complex authorization bypasses, and context-dependent race conditions. These require understanding what the application is supposed to do, not just what it responds to a payload. An LLM reasoning about application behavior is fundamentally better suited for this kind of analysis.
Programmatic scanners won't catch a logic flaw where a discount code can be applied twice, or where changing the order of API calls bypasses a payment step. That requires intelligence, not pattern matching.
But here's the thing: these vulnerabilities are also the hardest for any automated tool to find reliably, including AI-powered ones. They represent a real gap, and AI will likely fill it over time. For now, the vast majority of bugs that matter in real-world engagements (XSS, SQLi, SSRF, RCE, path traversals, CVEs, etc) are well within reach of a good programmatic scanner.
Where AI actually fits and it's more than you think
So if programmatic scanners handle the execution, where does AI actually add value?
The answer isn't just one thing. AI delivers a great value-to-cost ratio in three distinct phases of a security assessment, none of which involve the actual exploitation.
1. Pre-attack strategy
Before a pentester starts firing payloads, they study the application. They browse it, map its structure, correlate requests/responses, understand the scope. They think:
- "This application has a REST API with sequential IDs and a separate admin panel: two different attack surfaces."
- "These five endpoints all share the same session cookie but different authorization levels: access control testing is a priority."
- "The login flow involves three redirects and a token exchange: this is a complex multi-step target."
This is strategic reasoning. It requires correlating dozens of requests/responses, understanding how they relate to each other, and building a mental model of the application before a single payload is sent.
An LLM can do this remarkably well. Feed it a batch of unique request patterns from the site map, and it can identify the application's structure, map the attack surface, prioritize which areas deserve the most attention, and suggest an overall testing strategy.
2. Attack targeting
Within that strategy, the next step is deciding (for each specific request) where to attack and what to test for.
A human pentester doesn't blindly test every parameter. They look at a request and think:
- "This
redirect_urlparameter probably handles redirects: worth testing for open redirect and SSRF." - "This
templateparameter in a POST body might be loading templates server-side: SSTI is a possibility." - "This endpoint accepts XML in the request body: XXE should be tested."
- "This API returns user data based on a sequential ID: IDOR is likely."
This kind of contextual analysis is where LLMs genuinely excel. They can look at an HTTP request (the URL structure, parameter names, content types, headers, the application context gathered during the strategy phase) and make intelligent decisions about what types of vulnerabilities are most likely for that specific request.
This is exactly where Burp Bounty Pro v3.1.0 with its new AI Scanner fits. The AI analyzes each request, decides which parameters are interesting and what vulnerability types to test for, and automatically launches the right programmatic profiles.
3. Feedback loop
After the programmatic engine executes and results come back, there's another phase where AI shines: analyzing what happened and refining the approach.
- "The SQLi payloads on the
idparameter all returned 403: there's likely a WAF. Try encoded variants or different injection points." - "The path traversal scan found
/etc/passwdon thefileparameter but got filtered ontemplate: the filter may be path-specific." - "Several endpoints returned debug error pages: the application is in development mode. Expand the scope to test for exposed admin endpoints."
This feedback loop (being informed of what the tests found, drawing conclusions, and adjusting the strategy) is something a programmatic scanner simply can't do. It requires reasoning. But it doesn't require thousands of LLM calls either. You're summarizing results and making a few strategic decisions.
A word of caution on the feedback loop: it's the most conceptually appealing phase, but also the hardest to implement well. When do you stop iterating? How do you prevent the LLM from chasing false positives down a rabbit hole? How do you balance "try harder on this endpoint" with "you've already spent enough time here, move on"?
A human pentester manages this naturally through experience and intuition. Encoding that judgment into an AI-driven loop is an open problem and getting it wrong means either wasted time on dead ends or prematurely abandoned attack vectors. It's solvable, but it's not trivial.
And where AI still falls short
The exploitation phase (the actual sending of payloads, matching response patterns, measuring time-based delays, handling Collaborator interactions, chaining multi-step attacks) is where AI-powered scanning breaks down economically. Exploitation means thousands of requests, each potentially needing LLM analysis. The value could be enormous (adaptive payloads, real-time WAF bypass reasoning), but the cost makes it impractical at scale.
There's another issue that goes beyond cost: reproducibility. A regex-based profile produces the same result every time you run it against the same target. It's deterministic. An LLM might decide to test for SSRF on one run and skip it on the next, depending on subtle variations in how it processes the context. For routine pentests this might be acceptable, but for compliance-driven assessments (PCI-DSS, SOC 2, ISO 27001) where you need to demonstrate consistent, repeatable testing methodology, non-deterministic targeting is a real problem. You need to be able to explain why a specific vulnerability was or wasn't tested, and "the AI decided not to" isn't a satisfying answer in an audit report.
And then there's the hallucination problem. As Albert Ziegler, Head of AI at XBOW, puts it clearly: LLMs generate the most plausible explanation based on prior patterns, but plausibility is not proof. A model can infer SQL injection from a response pattern that merely resembles prior exploits. It can assert impact without ever demonstrating it. It sounds confident, but the vulnerability isn't real.
This isn't a flaw in a specific model or something you fix with better prompts. It's a fundamental mismatch between how LLMs reason and how vulnerabilities actually exist. An LLM doesn't measure timing differences, verify side effects, or confirm that a payload changed application behavior. It reasons about what should happen based on what it's seen before, not about what did happen.
Ziegler's conclusion is telling: validation must exist outside the model. When an AI suspects a vulnerability, that suspicion should trigger a second phase using deterministic checks, measuring timing, verifying out-of-band interactions, checking for data leakage. In other words: exactly what a programmatic scanner does.
So the picture is clear:
- Strategy → AI
- Targeting → AI
- Execution → Programmatic
- Feedback → AI
Three out of four phases where AI adds value are cheap. The one expensive phase is exactly where a programmatic engine excels.
This is essentially an intelligent evolution of what we already do with Smart Scan in Burp Bounty Pro: passive detection identifies interesting patterns in traffic, rules evaluate conditions, and active profiles execute targeted attacks automatically.
The difference is that today, the passive detection uses regex and pattern matching. With AI, that entire strategy-targeting-feedback cycle gets dramatically smarter and because it's not the execution phase, the cost stays manageable.
The hybrid model and we're shipping it
I think the right model for web application scanning is a hybrid one:
- AI for the brain — Strategy, targeting, and feedback. Understanding context, correlating requests, identifying attack surfaces, prioritizing what to test, and learning from results.
- Programmatic engines for the muscle — Executing the actual tests at scale, with deterministic detection logic, configurable concurrency, and zero marginal cost per request.
We started with the targeting phase because that's where AI has the most immediate impact on scanning accuracy.
In Burp Bounty Pro v3.1.0, we're releasing the new AI Scanner available now.
The AI Scanner analyzes HTTP requests and decides (for each one) where to attack and what types of vulnerabilities to test for. It looks at URL structures, parameter names, content types, request bodies, and application context. Then it automatically selects and launches the appropriate vulnerability profiles from your library.
A key design decision: the AI only analyzes requests, not responses. HTTP responses can be massive (full HTML pages, JavaScript bundles, JSON payloads) and sending all of that to an LLM would be both slow and expensive.
Instead, a programmatic scanner processes the responses first, extracting relevant signals (technology fingerprints, reflected values, error patterns, interesting headers) and feeding that structured context back to the AI. This gives the AI the information it needs to make smart targeting decisions, without the cost of analyzing raw response bodies that can be tens or hundreds of kilobytes each.
In practice, this means:
- Programmatic scanner analyzes responses: Extracts technology fingerprints, interesting patterns, and contextual signals.
- AI analyzes the request + extracted context: "This parameter takes a URL, this other one looks like a file path, the server is running Spring Boot, and this endpoint accepts XML in the body."
- AI decides what to test: "SSRF and open redirect for the URL parameter, path traversal for the file parameter, Spring-specific CVEs, XXE for the XML body."
- Programmatic profiles execute: The right profiles fire automatically with deterministic detection logic.
The AI handles the expensive part, the reasoning, with a minimal number of LLM calls (one per unique request pattern), and it never has to process bloated response bodies.
The programmatic engine handles everything else: response analysis, payload execution, match logic, time-based detection, Collaborator interactions, and multi-step chains.
This is not a fully AI-powered scanner that burns through API credits on every single request and response. It's a carefully designed hybrid: programmatic processing feeds intelligence to the AI, the AI makes targeting decisions, and the programmatic engine executes at scale. The programmatic layer is also the validation layer, it's what turns AI hypotheses into confirmed (or dismissed) findings, exactly as Ziegler argues should happen in any AI-driven security system.
The AI Scanner is our first step. The strategy and feedback loop phases described above are on our roadmap, because the three-phase framework (strategy → targeting → feedback) is where AI delivers the most value without breaking the bank. The execution stays programmatic.
The bottom line
In my opinion, I think AI will eventually replace programmatic scanners entirely. When models become significantly cheaper and more reliable, there will be less reason to maintain separate rule-based engines. An AI that can analyze, target, and exploit, all at a fraction of today's cost, will be the better tool.
But that day isn't today. And it's probably not tomorrow either.
Until cost, speed, reliability, and reproducibility all improve significantly, programmatic and AI-powered scanning need to coexist, each contributing what it does best.
With Burp Bounty Pro v3.1.0, we've started building that coexistence starting with the AI Scanner for intelligent targeting.
The programmatic engine executes with deterministic precision. The cost stays manageable because each layer does what it's built for.
Will this hybrid model last forever? Probably not. But it's the best approach we have today, and it works.
Burp Bounty Pro v3.1.0 ships with the new AI Scanner, 254 vulnerability profiles, 27 Smart Scan rules, and the ability to build your own IF-THEN automation workflows. Practice on Burp Bounty Lab or learn more at bountysecurity.ai.