Code Analysis

Intermediate

T1190 | Exploit Public-Facing Application T1059 | Command and Scripting Interpreter

AI Code Review & Fuzzing

LLMs can analyse source code for vulnerabilities faster than manual review, while AI-guided fuzzers generate smarter inputs to find bugs that traditional fuzzers miss. Google's Big Sleep project proved this by discovering real-world zero-days in production software using AI alone.

The State of AI Vulnerability Discovery (2026)

Google's Project Big Sleep (formerly Naptime) used an LLM agent to discover a previously unknown exploitable buffer overflow in SQLite — a critical, widely-deployed database engine. This was the first public case of an AI finding a real 0-day before human researchers, and it won't be the last.

Why AI Code Review?

AI Strengths

Pattern recognition: Trained on millions of known-vulnerable code patterns
Tirelessness: Can audit 100K+ lines without fatigue or attention drift
Cross-language: Same model reviews C, Python, Go, JS, Solidity, etc.
Context window: Modern models handle 128K+ tokens — entire codebases at once
Explanation: Generates human-readable descriptions of why code is vulnerable

AI Limitations

Hallucinations: May report vulnerabilities that don't exist (false positives)
Business logic: Struggles with app-specific logic flaws that require domain knowledge
Subtle bugs: Race conditions and complex state machines still challenge LLMs
Scope limits: Even large context windows can't hold entire enterprise codebases
Requires verification: Every AI finding must be manually confirmed

1. LLM-Assisted Code Auditing

The most practical approach is to feed code chunks to an LLM with a security-focused system prompt, then triage the results manually. This works for both white-box pentests and bug bounty.

Security Audit Prompt Framework

python

# AI-assisted code review framework
# Works with: GPT-4o, Claude 3.5 Sonnet, DeepSeek-Coder, local models via Ollama

import openai
from pathlib import Path

SECURITY_AUDIT_PROMPT = """You are an expert security code reviewer. Analyse the following code for:

1. **Injection vulnerabilities**: SQL injection, command injection, LDAP injection, XSS
2. **Authentication/Authorisation flaws**: Broken auth, IDOR, privilege escalation
3. **Cryptographic issues**: Weak algorithms, hardcoded keys, improper random
4. **Memory safety**: Buffer overflows, use-after-free (for C/C++/Rust unsafe)
5. **Deserialization**: Unsafe deserialization of user input
6. **SSRF / Path traversal**: Server-side request forgery, directory traversal
7. **Race conditions**: TOCTOU, missing locks on shared state
8. **Secrets exposure**: API keys, tokens, passwords in source
9. **Dependency risks**: Known-vulnerable library patterns

For each finding:
- Severity: CRITICAL / HIGH / MEDIUM / LOW / INFO
- Vulnerable code: Quote the exact line(s)
- Impact: What an attacker can achieve
- CWE: Map to CWE ID
- Remediation: Specific fix with code example

If no vulnerabilities are found, state that explicitly. Do not invent findings."""

def audit_file(filepath: str, model: str = "gpt-4o") -> str:
    """Audit a single source file for security vulnerabilities."""
    code = Path(filepath).read_text()
    
    response = openai.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": SECURITY_AUDIT_PROMPT},
            {"role": "user", "content": f"File: {filepath}\n\n```\n{code}\n```"}
        ],
        temperature=0.1  # Low temp → more precise, fewer hallucinations
    )
    return response.choices[0].message.content

def audit_directory(directory: str, extensions: list = None):
    """Audit all matching files in a directory tree."""
    if extensions is None:
        extensions = ['.py', '.js', '.ts', '.go', '.c', '.cpp', '.java', '.rs', '.php', '.rb']
    
    findings = []
    for path in Path(directory).rglob('*'):
        if path.suffix in extensions and path.is_file():
            print(f"Auditing: {path}")
            result = audit_file(str(path))
            if "no vulnerabilities" not in result.lower():
                findings.append({"file": str(path), "findings": result})
    
    return findings

# Example: audit a Python web app
results = audit_directory("./target-app/src/", ['.py'])
for r in results:
    print(f"\n{'='*60}")
    print(f"FILE: {r['file']}")
    print(r['findings'])

# AI-assisted code review framework
# Works with: GPT-4o, Claude 3.5 Sonnet, DeepSeek-Coder, local models via Ollama

import openai
from pathlib import Path

SECURITY_AUDIT_PROMPT = """You are an expert security code reviewer. Analyse the following code for:

1. **Injection vulnerabilities**: SQL injection, command injection, LDAP injection, XSS
2. **Authentication/Authorisation flaws**: Broken auth, IDOR, privilege escalation
3. **Cryptographic issues**: Weak algorithms, hardcoded keys, improper random
4. **Memory safety**: Buffer overflows, use-after-free (for C/C++/Rust unsafe)
5. **Deserialization**: Unsafe deserialization of user input
6. **SSRF / Path traversal**: Server-side request forgery, directory traversal
7. **Race conditions**: TOCTOU, missing locks on shared state
8. **Secrets exposure**: API keys, tokens, passwords in source
9. **Dependency risks**: Known-vulnerable library patterns

For each finding:
- Severity: CRITICAL / HIGH / MEDIUM / LOW / INFO
- Vulnerable code: Quote the exact line(s)
- Impact: What an attacker can achieve
- CWE: Map to CWE ID
- Remediation: Specific fix with code example

If no vulnerabilities are found, state that explicitly. Do not invent findings."""

def audit_file(filepath: str, model: str = "gpt-4o") -> str:
    """Audit a single source file for security vulnerabilities."""
    code = Path(filepath).read_text()
    
    response = openai.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": SECURITY_AUDIT_PROMPT},
            {"role": "user", "content": f"File: {filepath}\n\n```\n{code}\n```"}
        ],
        temperature=0.1  # Low temp → more precise, fewer hallucinations
    )
    return response.choices[0].message.content

def audit_directory(directory: str, extensions: list = None):
    """Audit all matching files in a directory tree."""
    if extensions is None:
        extensions = ['.py', '.js', '.ts', '.go', '.c', '.cpp', '.java', '.rs', '.php', '.rb']
    
    findings = []
    for path in Path(directory).rglob('*'):
        if path.suffix in extensions and path.is_file():
            print(f"Auditing: {path}")
            result = audit_file(str(path))
            if "no vulnerabilities" not in result.lower():
                findings.append({"file": str(path), "findings": result})
    
    return findings

# Example: audit a Python web app
results = audit_directory("./target-app/src/", ['.py'])
for r in results:
    print(f"\n{'='*60}")
    print(f"FILE: {r['file']}")
    print(r['findings'])

Using Local Models (No Data Leakage)

bash

# For sensitive client code, use local models — zero data leaves your machine

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a coding-focused model
ollama pull deepseek-v3               # DeepSeek latest reasoning model
ollama pull qwen2.5-coder:32b        # Best open-source code model

# Run audit against local model (OpenAI-compatible API)
export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=ollama  # Ollama doesn't need a real key

# Or use Ollama's native API:
curl http://localhost:11434/api/generate -d '{
  "model": "deepseek-v3",
  "prompt": "Review this code for security vulnerabilities:\n\n<paste code>",
  "stream": false
}'
# For sensitive client code, use local models — zero data leaves your machine

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a coding-focused model
ollama pull deepseek-v3               # DeepSeek latest reasoning model
ollama pull qwen2.5-coder:32b        # Best open-source code model

# Run audit against local model (OpenAI-compatible API)
export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=ollama  # Ollama doesn't need a real key

# Or use Ollama's native API:
curl http://localhost:11434/api/generate -d '{
  "model": "deepseek-v3",
  "prompt": "Review this code for security vulnerabilities:\n\n<paste code>",
  "stream": false
}'

Agentic Code Review

Tools like Cursor, Windsurf, and VS Code Copilot Agent Mode can iteratively review code — the AI reads a file, asks itself follow-up questions about data flow, checks related files, and chains together a multi-step analysis. This produces significantly better results than single-shot prompting.

2. Agentic Code Review Tools

Agentic coding tools go beyond single-shot prompting — they autonomously navigate codebases, follow data flows across files, and iteratively refine their analysis. These are the most effective tools for AI-assisted security review in 2026.

Cursor

AI code editor with multi-file analysis and context-aware security review. Indexes your entire project for deep cross-reference analysis.

Key: Multi-file context + Composer agent mode

Windsurf (Codeium)

AI-powered IDE with Cascade for autonomous multi-step code analysis. Understands project structure and dependency chains.

Key: Cascade autonomous agent + flow awareness

Claude Code

Anthropic's CLI agent for terminal-based code review. Reads files, runs commands, and iterates on analysis directly from your shell.

Key: Terminal-native + tool use + extended thinking

OpenAI Codex CLI

OpenAI's open-source terminal coding agent. Runs locally with sandboxed execution for safe code analysis and generation.

Key: Open-source + sandboxed execution

Aider

Open-source AI pair programming tool that works with any LLM. Git-aware with automatic commit generation and repo-map for codebase understanding.

Key: Any LLM backend + git integration + repo-map

GitHub Copilot Agent Mode

VS Code integrated agent that iterates on code review tasks. Uses workspace context, terminal access, and multi-step reasoning.

Key: VS Code native + workspace-aware + MCP tools

CLI-Based Security Review

bash

# Using Claude Code for security review
claude "Review /path/to/target-app for security vulnerabilities. Focus on
injection flaws, authentication bypasses, and IDOR. For each finding,
provide the file, line number, CWE, severity, and a remediation."

# Using Aider with any model for targeted review
pip install aider-chat
cd /path/to/target-app

# With Claude
aider --sonnet --message "Audit auth.py and api/routes.py for security
vulnerabilities. List each finding with severity, CWE ID, and a fix."

# With a local model (no data leakage)
aider --model ollama/deepseek-v3 --message "Review all files in src/
for SQL injection, command injection, and path traversal vulnerabilities."

# Using OpenAI Codex CLI
codex "Audit this repository for security vulnerabilities. Focus on the
authentication middleware, API input validation, and session handling."
# Using Claude Code for security review
claude "Review /path/to/target-app for security vulnerabilities. Focus on
injection flaws, authentication bypasses, and IDOR. For each finding,
provide the file, line number, CWE, severity, and a remediation."

# Using Aider with any model for targeted review
pip install aider-chat
cd /path/to/target-app

# With Claude
aider --sonnet --message "Audit auth.py and api/routes.py for security
vulnerabilities. List each finding with severity, CWE ID, and a fix."

# With a local model (no data leakage)
aider --model ollama/deepseek-v3 --message "Review all files in src/
for SQL injection, command injection, and path traversal vulnerabilities."

# Using OpenAI Codex CLI
codex "Audit this repository for security vulnerabilities. Focus on the
authentication middleware, API input validation, and session handling."

3. Semgrep + AI Hybrid Analysis

The most effective approach combines traditional SAST (Static Application Security Testing) with LLM-powered triage. Semgrep finds potential issues with deterministic rules; the LLM filters out false positives and explains impact.

python

# Step 1: Run Semgrep to get candidate findings
pip install semgrep

# Scan with security-focused rules
semgrep scan --config=p/security-audit \
             --config=p/owasp-top-ten \
             --config=p/cwe-top-25 \
             --json --output findings.json \
             ./target-app/

# Step 2: Feed findings to LLM for triage
import json
import openai

with open("findings.json") as f:
    semgrep_results = json.load(f)

TRIAGE_PROMPT = """You are a security expert triaging Semgrep findings.
For each finding, determine:
1. Is this a TRUE positive or FALSE positive? Explain why.
2. If true positive: severity, exploitability, and recommended fix.
3. If false positive: why the code is actually safe.

Be conservative — when in doubt, flag it for manual review."""

for finding in semgrep_results.get("results", []):
    context = f"""
Rule: {finding['check_id']}
Severity: {finding['extra']['severity']}
Message: {finding['extra']['message']}
File: {finding['path']}:{finding['start']['line']}
Code: {finding['extra']['lines']}
"""
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": TRIAGE_PROMPT},
            {"role": "user", "content": context}
        ],
        temperature=0.1
    )
    print(f"\n--- {finding['path']}:{finding['start']['line']} ---")
    print(response.choices[0].message.content)

# Step 1: Run Semgrep to get candidate findings
pip install semgrep

# Scan with security-focused rules
semgrep scan --config=p/security-audit \
             --config=p/owasp-top-ten \
             --config=p/cwe-top-25 \
             --json --output findings.json \
             ./target-app/

# Step 2: Feed findings to LLM for triage
import json
import openai

with open("findings.json") as f:
    semgrep_results = json.load(f)

TRIAGE_PROMPT = """You are a security expert triaging Semgrep findings.
For each finding, determine:
1. Is this a TRUE positive or FALSE positive? Explain why.
2. If true positive: severity, exploitability, and recommended fix.
3. If false positive: why the code is actually safe.

Be conservative — when in doubt, flag it for manual review."""

for finding in semgrep_results.get("results", []):
    context = f"""
Rule: {finding['check_id']}
Severity: {finding['extra']['severity']}
Message: {finding['extra']['message']}
File: {finding['path']}:{finding['start']['line']}
Code: {finding['extra']['lines']}
"""
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": TRIAGE_PROMPT},
            {"role": "user", "content": context}
        ],
        temperature=0.1
    )
    print(f"\n--- {finding['path']}:{finding['start']['line']} ---")
    print(response.choices[0].message.content)

4. AI-Guided Fuzzing

Traditional fuzzers generate random or mutated inputs. AI-guided fuzzers understand the target's input format and logic, generating smarter inputs that reach deeper code paths faster.

AI Fuzzing Approaches

Approach	Tool	How It Works	Best For
LLM seed generation	Custom + AFL++	LLM generates initial corpus from docs/spec	Protocol fuzzers, API fuzzing
Harness generation	Google OSS-Fuzz-Gen	LLM writes fuzz harnesses from source	C/C++ libraries, OSS projects
Coverage-guided + LLM	FuzzGPT	LLM analyses coverage gaps, generates targeted inputs	Reaching deep code paths
Variant analysis	Big Sleep / Naptime	LLM agent reviews code + past bugs for variants	Finding variant bugs in patched code

LLM-Generated Fuzz Harnesses

bash

# Using Google's OSS-Fuzz-Gen approach: LLM generates fuzz harnesses
# from source code, then runs them with coverage-guided fuzzing

# Step 1: Clone target and OSS-Fuzz-Gen
git clone https://github.com/google/oss-fuzz-gen.git
cd oss-fuzz-gen

# Step 2: Generate a fuzz harness for a target function
# The LLM reads the function signature, documentation, and usage examples
# then writes a harness that exercises the function with fuzz-generated inputs

python generate_harness.py \
  --target-repo /path/to/target \
  --function "parse_input" \
  --model gpt-4o \
  --output harness.c

# Step 3: Compile and run with AFL++ or libFuzzer
clang -fsanitize=fuzzer,address -o fuzz_target harness.c target.c
./fuzz_target -max_len=4096 -timeout=10 corpus/

# Step 4: Let the LLM analyse crashes
python analyse_crash.py \
  --crash crash-*.txt \
  --source /path/to/target \
  --model gpt-4o
# Using Google's OSS-Fuzz-Gen approach: LLM generates fuzz harnesses
# from source code, then runs them with coverage-guided fuzzing

# Step 1: Clone target and OSS-Fuzz-Gen
git clone https://github.com/google/oss-fuzz-gen.git
cd oss-fuzz-gen

# Step 2: Generate a fuzz harness for a target function
# The LLM reads the function signature, documentation, and usage examples
# then writes a harness that exercises the function with fuzz-generated inputs

python generate_harness.py \
  --target-repo /path/to/target \
  --function "parse_input" \
  --model gpt-4o \
  --output harness.c

# Step 3: Compile and run with AFL++ or libFuzzer
clang -fsanitize=fuzzer,address -o fuzz_target harness.c target.c
./fuzz_target -max_len=4096 -timeout=10 corpus/

# Step 4: Let the LLM analyse crashes
python analyse_crash.py \
  --crash crash-*.txt \
  --source /path/to/target \
  --model gpt-4o

5. Google Big Sleep: AI Zero-Day Discovery

Big Sleep (formerly Naptime) is Google Project Zero's research into using LLM agents for vulnerability discovery. In late 2024, it found a real exploitable buffer overflow in SQLite that no human or traditional fuzzer had caught.

Big Sleep Agent Architecture

flowchart TD G["Define Goal: Find variant bugs"] --> R["Read Source Code"] R --> I["Identify Vulnerability Patterns"] I --> W["Write Test Cases"] W --> E["Execute Tests"] E --> ABug Confirmed? A -->|Yes| RPT["Report Finding"] A -->|No| REF["Refine Hypothesis"] REF --> R subgraph Tools CB["code_browser"] DB["debugger"] PS["python_sandbox"] RP["reporter"] end R -.-> CB W -.-> PS E -.-> DB RPT -.-> RP

Building Your Own Mini Big Sleep

You don't need Google's infrastructure to replicate this approach. Use an LLM with tool access (Claude with MCP, GPT-4o with function calling, or a local model with LangChain agents) and give it tools to: read code, run code, check coverage, and analyse crash outputs. The agent loop is where the magic happens — let the AI iterate autonomously.

6. Practical AI Code Review Workflow

Recommended Workflow for Pentesters

Scope & prioritise: Identify high-risk code areas — authentication, payment processing, file uploads, API endpoints, deserialization handlers.
Automated SAST first: Run Semgrep, CodeQL, or Bandit to generate a baseline of findings. This is fast and deterministic.
LLM triage: Feed SAST findings to an LLM to filter false positives and prioritise true positives by exploitability.
Deep-dive with AI: For critical areas, feed entire files or modules to the LLM with a security-focused prompt. Use agentic tools for multi-file analysis.
AI fuzzing: For binary/compiled targets or parsing code, use LLM-generated fuzz harnesses with AFL++ or libFuzzer.
Manual verification: ALWAYS manually verify AI findings before reporting. Write a proof-of-concept exploit for each confirmed vulnerability.
Report with context: Use the LLM to draft finding descriptions, impact analysis, and remediation guidance — but review and edit before submission.

bash

# Complete AI code review pipeline
# Combines: Semgrep → LLM triage → Deep analysis → Report generation

#!/bin/bash
set -e

TARGET_DIR="$1"
OUTPUT_DIR="./audit-results"
mkdir -p "$OUTPUT_DIR"

echo "[1/4] Running Semgrep SAST scan..."
semgrep scan --config=auto --json --output "$OUTPUT_DIR/semgrep.json" "$TARGET_DIR"

echo "[2/4] LLM triage of Semgrep findings..."
python ai_triage.py \
  --findings "$OUTPUT_DIR/semgrep.json" \
  --model deepseek-v3 \
  --output "$OUTPUT_DIR/triaged.json"

echo "[3/4] Deep AI analysis of critical files..."
python ai_deep_review.py \
  --directory "$TARGET_DIR" \
  --focus "auth,payment,upload,api" \
  --model gpt-4o \
  --output "$OUTPUT_DIR/deep-review.md"

echo "[4/4] Generating report..."
python ai_report.py \
  --triaged "$OUTPUT_DIR/triaged.json" \
  --deep "$OUTPUT_DIR/deep-review.md" \
  --output "$OUTPUT_DIR/security-audit-report.md"

echo "Done! Report: $OUTPUT_DIR/security-audit-report.md"
# Complete AI code review pipeline
# Combines: Semgrep → LLM triage → Deep analysis → Report generation

#!/bin/bash
set -e

TARGET_DIR="$1"
OUTPUT_DIR="./audit-results"
mkdir -p "$OUTPUT_DIR"

echo "[1/4] Running Semgrep SAST scan..."
semgrep scan --config=auto --json --output "$OUTPUT_DIR/semgrep.json" "$TARGET_DIR"

echo "[2/4] LLM triage of Semgrep findings..."
python ai_triage.py \
  --findings "$OUTPUT_DIR/semgrep.json" \
  --model deepseek-v3 \
  --output "$OUTPUT_DIR/triaged.json"

echo "[3/4] Deep AI analysis of critical files..."
python ai_deep_review.py \
  --directory "$TARGET_DIR" \
  --focus "auth,payment,upload,api" \
  --model gpt-4o \
  --output "$OUTPUT_DIR/deep-review.md"

echo "[4/4] Generating report..."
python ai_report.py \
  --triaged "$OUTPUT_DIR/triaged.json" \
  --deep "$OUTPUT_DIR/deep-review.md" \
  --output "$OUTPUT_DIR/security-audit-report.md"

echo "Done! Report: $OUTPUT_DIR/security-audit-report.md"

Always Verify AI Findings

AI-generated vulnerability reports WILL contain false positives. Never submit an AI finding to a bug bounty program or client report without manually verifying it and writing a working proof-of-concept. The AI is a force multiplier — not a replacement for human judgement.

AI Code Review Labs

Hands-on exercises applying AI-assisted code review and fuzzing techniques.

AI-Powered OWASP Juice Shop Review Custom Lab easy

LLM code auditPrompt engineeringVulnerability classification

Open Lab

Semgrep + LLM Triage Pipeline Custom Lab medium

SAST integrationFalse positive filteringAutomated triage

Open Lab

AI Fuzz Harness Generation with OSS-Fuzz-Gen Custom Lab hard

Fuzz harness generationAFL++Crash analysis

Open Lab

Need help setting up? Check our Lab Setup Guide →

Offensive Operations

Operator Playbook

Use AI to accelerate source review and fuzzing while proving findings with deterministic tests and human analysis.

Authorized use only

Offensive Focus

Ask AI to map data flow, trust boundaries, authz checks, injection sinks, and parser surfaces.
Use AI to generate review hypotheses, fuzz harness ideas, and missing-test candidates.
Validate every finding with code references, tests, or replayable proof.

Evidence To Capture

Written scope and allowed test classes
Timestamped prompts, retrieved context, tool calls, and response artifacts
Request IDs, model/provider/version, policy decisions, and tenant or user role
Screenshots or exported logs that reproduce the finding without exposing client secrets

Offensive Test Cases

Authz blind spot review

Objective: Have AI identify authorization assumptions and then manually verify one likely missing check.
Authorized setup: Use repository code approved for AI processing or a local model.
Evidence: Code path, AI hypothesis, manual trace, test result, and finding decision.

Fuzz harness ideation

Objective: Generate candidate fuzz targets for parsers, deserializers, and request handlers.
Authorized setup: Use local source and non-production harnesses.
Evidence: Harness plan, generated seed cases, crash/coverage result, and triage notes.

Common Findings

AI reports plausible issues without tracing reachability.
Generated tests assert behavior but do not prove the vulnerable path.
Sensitive source code is sent to unapproved external providers.

Lab Ideas

Run AI review on a deliberately vulnerable API and score missed authz paths.
Generate fuzz seeds for a JSON parser and measure coverage.
Compare AI findings against SAST output and manual review.

AI Code Review & Fuzzing

Why AI Code Review?

AI Strengths

AI Limitations

1. LLM-Assisted Code Auditing

Security Audit Prompt Framework

Using Local Models (No Data Leakage)

2. Agentic Code Review Tools

Cursor

Windsurf (Codeium)

Claude Code

OpenAI Codex CLI

Aider

GitHub Copilot Agent Mode

CLI-Based Security Review

3. Semgrep + AI Hybrid Analysis

4. AI-Guided Fuzzing

AI Fuzzing Approaches

LLM-Generated Fuzz Harnesses

5. Google Big Sleep: AI Zero-Day Discovery

6. Practical AI Code Review Workflow

Recommended Workflow for Pentesters

AI Code Review Labs

Operator Playbook

Offensive Focus

Evidence To Capture

Offensive Test Cases

Authz blind spot review

Fuzz harness ideation

Common Findings

Lab Ideas

Related Topics

Source Code Review

Web Pentesting

Binary Exploitation

Prompt Engineering

Vibe-Coding Security