Code Analysis
Intermediate
T1190 T1059

AI Code Review & Fuzzing

LLMs can analyse source code for vulnerabilities faster than manual review, while AI-guided fuzzers generate smarter inputs to find bugs that traditional fuzzers miss. Google's Big Sleep project proved this by discovering real-world zero-days in production software using AI alone.

The State of AI Vulnerability Discovery (2026)

Google's Project Big Sleep (formerly Naptime) used an LLM agent to discover a previously unknown exploitable buffer overflow in SQLite — a critical, widely-deployed database engine. This was the first public case of an AI finding a real 0-day before human researchers, and it won't be the last.

Why AI Code Review?

AI Strengths

  • Pattern recognition: Trained on millions of known-vulnerable code patterns
  • Tirelessness: Can audit 100K+ lines without fatigue or attention drift
  • Cross-language: Same model reviews C, Python, Go, JS, Solidity, etc.
  • Context window: Modern models handle 128K+ tokens — entire codebases at once
  • Explanation: Generates human-readable descriptions of why code is vulnerable

AI Limitations

  • Hallucinations: May report vulnerabilities that don't exist (false positives)
  • Business logic: Struggles with app-specific logic flaws that require domain knowledge
  • Subtle bugs: Race conditions and complex state machines still challenge LLMs
  • Scope limits: Even large context windows can't hold entire enterprise codebases
  • Requires verification: Every AI finding must be manually confirmed

1. LLM-Assisted Code Auditing

The most practical approach is to feed code chunks to an LLM with a security-focused system prompt, then triage the results manually. This works for both white-box pentests and bug bounty.

Security Audit Prompt Framework

python
# AI-assisted code review framework
# Works with: GPT-4o, Claude 3.5 Sonnet, DeepSeek-Coder, local models via Ollama

import openai
from pathlib import Path

SECURITY_AUDIT_PROMPT = """You are an expert security code reviewer. Analyse the following code for:

1. **Injection vulnerabilities**: SQL injection, command injection, LDAP injection, XSS
2. **Authentication/Authorisation flaws**: Broken auth, IDOR, privilege escalation
3. **Cryptographic issues**: Weak algorithms, hardcoded keys, improper random
4. **Memory safety**: Buffer overflows, use-after-free (for C/C++/Rust unsafe)
5. **Deserialization**: Unsafe deserialization of user input
6. **SSRF / Path traversal**: Server-side request forgery, directory traversal
7. **Race conditions**: TOCTOU, missing locks on shared state
8. **Secrets exposure**: API keys, tokens, passwords in source
9. **Dependency risks**: Known-vulnerable library patterns

For each finding:
- Severity: CRITICAL / HIGH / MEDIUM / LOW / INFO
- Vulnerable code: Quote the exact line(s)
- Impact: What an attacker can achieve
- CWE: Map to CWE ID
- Remediation: Specific fix with code example

If no vulnerabilities are found, state that explicitly. Do not invent findings."""

def audit_file(filepath: str, model: str = "gpt-4o") -> str:
    """Audit a single source file for security vulnerabilities."""
    code = Path(filepath).read_text()
    
    response = openai.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": SECURITY_AUDIT_PROMPT},
            {"role": "user", "content": f"File: {filepath}\n\n```\n{code}\n```"}
        ],
        temperature=0.1  # Low temp → more precise, fewer hallucinations
    )
    return response.choices[0].message.content

def audit_directory(directory: str, extensions: list = None):
    """Audit all matching files in a directory tree."""
    if extensions is None:
        extensions = ['.py', '.js', '.ts', '.go', '.c', '.cpp', '.java', '.rs', '.php', '.rb']
    
    findings = []
    for path in Path(directory).rglob('*'):
        if path.suffix in extensions and path.is_file():
            print(f"Auditing: {path}")
            result = audit_file(str(path))
            if "no vulnerabilities" not in result.lower():
                findings.append({"file": str(path), "findings": result})
    
    return findings

# Example: audit a Python web app
results = audit_directory("./target-app/src/", ['.py'])
for r in results:
    print(f"\n{'='*60}")
    print(f"FILE: {r['file']}")
    print(r['findings'])
# AI-assisted code review framework
# Works with: GPT-4o, Claude 3.5 Sonnet, DeepSeek-Coder, local models via Ollama

import openai
from pathlib import Path

SECURITY_AUDIT_PROMPT = """You are an expert security code reviewer. Analyse the following code for:

1. **Injection vulnerabilities**: SQL injection, command injection, LDAP injection, XSS
2. **Authentication/Authorisation flaws**: Broken auth, IDOR, privilege escalation
3. **Cryptographic issues**: Weak algorithms, hardcoded keys, improper random
4. **Memory safety**: Buffer overflows, use-after-free (for C/C++/Rust unsafe)
5. **Deserialization**: Unsafe deserialization of user input
6. **SSRF / Path traversal**: Server-side request forgery, directory traversal
7. **Race conditions**: TOCTOU, missing locks on shared state
8. **Secrets exposure**: API keys, tokens, passwords in source
9. **Dependency risks**: Known-vulnerable library patterns

For each finding:
- Severity: CRITICAL / HIGH / MEDIUM / LOW / INFO
- Vulnerable code: Quote the exact line(s)
- Impact: What an attacker can achieve
- CWE: Map to CWE ID
- Remediation: Specific fix with code example

If no vulnerabilities are found, state that explicitly. Do not invent findings."""

def audit_file(filepath: str, model: str = "gpt-4o") -> str:
    """Audit a single source file for security vulnerabilities."""
    code = Path(filepath).read_text()
    
    response = openai.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": SECURITY_AUDIT_PROMPT},
            {"role": "user", "content": f"File: {filepath}\n\n```\n{code}\n```"}
        ],
        temperature=0.1  # Low temp → more precise, fewer hallucinations
    )
    return response.choices[0].message.content

def audit_directory(directory: str, extensions: list = None):
    """Audit all matching files in a directory tree."""
    if extensions is None:
        extensions = ['.py', '.js', '.ts', '.go', '.c', '.cpp', '.java', '.rs', '.php', '.rb']
    
    findings = []
    for path in Path(directory).rglob('*'):
        if path.suffix in extensions and path.is_file():
            print(f"Auditing: {path}")
            result = audit_file(str(path))
            if "no vulnerabilities" not in result.lower():
                findings.append({"file": str(path), "findings": result})
    
    return findings

# Example: audit a Python web app
results = audit_directory("./target-app/src/", ['.py'])
for r in results:
    print(f"\n{'='*60}")
    print(f"FILE: {r['file']}")
    print(r['findings'])

Using Local Models (No Data Leakage)

bash
# For sensitive client code, use local models — zero data leaves your machine

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a coding-focused model
ollama pull deepseek-v3               # DeepSeek latest reasoning model
ollama pull qwen2.5-coder:32b        # Best open-source code model

# Run audit against local model (OpenAI-compatible API)
export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=ollama  # Ollama doesn't need a real key

# Or use Ollama's native API:
curl http://localhost:11434/api/generate -d '{
  "model": "deepseek-v3",
  "prompt": "Review this code for security vulnerabilities:\n\n<paste code>",
  "stream": false
}'
# For sensitive client code, use local models — zero data leaves your machine

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a coding-focused model
ollama pull deepseek-v3               # DeepSeek latest reasoning model
ollama pull qwen2.5-coder:32b        # Best open-source code model

# Run audit against local model (OpenAI-compatible API)
export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=ollama  # Ollama doesn't need a real key

# Or use Ollama's native API:
curl http://localhost:11434/api/generate -d '{
  "model": "deepseek-v3",
  "prompt": "Review this code for security vulnerabilities:\n\n<paste code>",
  "stream": false
}'

Agentic Code Review

Tools like Cursor, Windsurf, and VS Code Copilot Agent Mode can iteratively review code — the AI reads a file, asks itself follow-up questions about data flow, checks related files, and chains together a multi-step analysis. This produces significantly better results than single-shot prompting.

2. Agentic Code Review Tools

Agentic coding tools go beyond single-shot prompting — they autonomously navigate codebases, follow data flows across files, and iteratively refine their analysis. These are the most effective tools for AI-assisted security review in 2026.

Cursor

AI code editor with multi-file analysis and context-aware security review. Indexes your entire project for deep cross-reference analysis.

Key: Multi-file context + Composer agent mode

Windsurf (Codeium)

AI-powered IDE with Cascade for autonomous multi-step code analysis. Understands project structure and dependency chains.

Key: Cascade autonomous agent + flow awareness

Claude Code

Anthropic's CLI agent for terminal-based code review. Reads files, runs commands, and iterates on analysis directly from your shell.

Key: Terminal-native + tool use + extended thinking

OpenAI Codex CLI

OpenAI's open-source terminal coding agent. Runs locally with sandboxed execution for safe code analysis and generation.

Key: Open-source + sandboxed execution

Aider

Open-source AI pair programming tool that works with any LLM. Git-aware with automatic commit generation and repo-map for codebase understanding.

Key: Any LLM backend + git integration + repo-map

GitHub Copilot Agent Mode

VS Code integrated agent that iterates on code review tasks. Uses workspace context, terminal access, and multi-step reasoning.

Key: VS Code native + workspace-aware + MCP tools

CLI-Based Security Review

bash
# Using Claude Code for security review
claude "Review /path/to/target-app for security vulnerabilities. Focus on
injection flaws, authentication bypasses, and IDOR. For each finding,
provide the file, line number, CWE, severity, and a remediation."

# Using Aider with any model for targeted review
pip install aider-chat
cd /path/to/target-app

# With Claude
aider --sonnet --message "Audit auth.py and api/routes.py for security
vulnerabilities. List each finding with severity, CWE ID, and a fix."

# With a local model (no data leakage)
aider --model ollama/deepseek-v3 --message "Review all files in src/
for SQL injection, command injection, and path traversal vulnerabilities."

# Using OpenAI Codex CLI
codex "Audit this repository for security vulnerabilities. Focus on the
authentication middleware, API input validation, and session handling."
# Using Claude Code for security review
claude "Review /path/to/target-app for security vulnerabilities. Focus on
injection flaws, authentication bypasses, and IDOR. For each finding,
provide the file, line number, CWE, severity, and a remediation."

# Using Aider with any model for targeted review
pip install aider-chat
cd /path/to/target-app

# With Claude
aider --sonnet --message "Audit auth.py and api/routes.py for security
vulnerabilities. List each finding with severity, CWE ID, and a fix."

# With a local model (no data leakage)
aider --model ollama/deepseek-v3 --message "Review all files in src/
for SQL injection, command injection, and path traversal vulnerabilities."

# Using OpenAI Codex CLI
codex "Audit this repository for security vulnerabilities. Focus on the
authentication middleware, API input validation, and session handling."

3. Semgrep + AI Hybrid Analysis

The most effective approach combines traditional SAST (Static Application Security Testing) with LLM-powered triage. Semgrep finds potential issues with deterministic rules; the LLM filters out false positives and explains impact.

python
# Step 1: Run Semgrep to get candidate findings
pip install semgrep

# Scan with security-focused rules
semgrep scan --config=p/security-audit \
             --config=p/owasp-top-ten \
             --config=p/cwe-top-25 \
             --json --output findings.json \
             ./target-app/

# Step 2: Feed findings to LLM for triage
import json
import openai

with open("findings.json") as f:
    semgrep_results = json.load(f)

TRIAGE_PROMPT = """You are a security expert triaging Semgrep findings.
For each finding, determine:
1. Is this a TRUE positive or FALSE positive? Explain why.
2. If true positive: severity, exploitability, and recommended fix.
3. If false positive: why the code is actually safe.

Be conservative — when in doubt, flag it for manual review."""

for finding in semgrep_results.get("results", []):
    context = f"""
Rule: {finding['check_id']}
Severity: {finding['extra']['severity']}
Message: {finding['extra']['message']}
File: {finding['path']}:{finding['start']['line']}
Code: {finding['extra']['lines']}
"""
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": TRIAGE_PROMPT},
            {"role": "user", "content": context}
        ],
        temperature=0.1
    )
    print(f"\n--- {finding['path']}:{finding['start']['line']} ---")
    print(response.choices[0].message.content)
# Step 1: Run Semgrep to get candidate findings
pip install semgrep

# Scan with security-focused rules
semgrep scan --config=p/security-audit \
             --config=p/owasp-top-ten \
             --config=p/cwe-top-25 \
             --json --output findings.json \
             ./target-app/

# Step 2: Feed findings to LLM for triage
import json
import openai

with open("findings.json") as f:
    semgrep_results = json.load(f)

TRIAGE_PROMPT = """You are a security expert triaging Semgrep findings.
For each finding, determine:
1. Is this a TRUE positive or FALSE positive? Explain why.
2. If true positive: severity, exploitability, and recommended fix.
3. If false positive: why the code is actually safe.

Be conservative — when in doubt, flag it for manual review."""

for finding in semgrep_results.get("results", []):
    context = f"""
Rule: {finding['check_id']}
Severity: {finding['extra']['severity']}
Message: {finding['extra']['message']}
File: {finding['path']}:{finding['start']['line']}
Code: {finding['extra']['lines']}
"""
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": TRIAGE_PROMPT},
            {"role": "user", "content": context}
        ],
        temperature=0.1
    )
    print(f"\n--- {finding['path']}:{finding['start']['line']} ---")
    print(response.choices[0].message.content)

4. AI-Guided Fuzzing

Traditional fuzzers generate random or mutated inputs. AI-guided fuzzers understand the target's input format and logic, generating smarter inputs that reach deeper code paths faster.

AI Fuzzing Approaches

Approach Tool How It Works Best For
LLM seed generation Custom + AFL++ LLM generates initial corpus from docs/spec Protocol fuzzers, API fuzzing
Harness generation Google OSS-Fuzz-Gen LLM writes fuzz harnesses from source C/C++ libraries, OSS projects
Coverage-guided + LLM FuzzGPT LLM analyses coverage gaps, generates targeted inputs Reaching deep code paths
Variant analysis Big Sleep / Naptime LLM agent reviews code + past bugs for variants Finding variant bugs in patched code

LLM-Generated Fuzz Harnesses

bash
# Using Google's OSS-Fuzz-Gen approach: LLM generates fuzz harnesses
# from source code, then runs them with coverage-guided fuzzing

# Step 1: Clone target and OSS-Fuzz-Gen
git clone https://github.com/google/oss-fuzz-gen.git
cd oss-fuzz-gen

# Step 2: Generate a fuzz harness for a target function
# The LLM reads the function signature, documentation, and usage examples
# then writes a harness that exercises the function with fuzz-generated inputs

python generate_harness.py \
  --target-repo /path/to/target \
  --function "parse_input" \
  --model gpt-4o \
  --output harness.c

# Step 3: Compile and run with AFL++ or libFuzzer
clang -fsanitize=fuzzer,address -o fuzz_target harness.c target.c
./fuzz_target -max_len=4096 -timeout=10 corpus/

# Step 4: Let the LLM analyse crashes
python analyse_crash.py \
  --crash crash-*.txt \
  --source /path/to/target \
  --model gpt-4o
# Using Google's OSS-Fuzz-Gen approach: LLM generates fuzz harnesses
# from source code, then runs them with coverage-guided fuzzing

# Step 1: Clone target and OSS-Fuzz-Gen
git clone https://github.com/google/oss-fuzz-gen.git
cd oss-fuzz-gen

# Step 2: Generate a fuzz harness for a target function
# The LLM reads the function signature, documentation, and usage examples
# then writes a harness that exercises the function with fuzz-generated inputs

python generate_harness.py \
  --target-repo /path/to/target \
  --function "parse_input" \
  --model gpt-4o \
  --output harness.c

# Step 3: Compile and run with AFL++ or libFuzzer
clang -fsanitize=fuzzer,address -o fuzz_target harness.c target.c
./fuzz_target -max_len=4096 -timeout=10 corpus/

# Step 4: Let the LLM analyse crashes
python analyse_crash.py \
  --crash crash-*.txt \
  --source /path/to/target \
  --model gpt-4o

5. Google Big Sleep: AI Zero-Day Discovery

Big Sleep (formerly Naptime) is Google Project Zero's research into using LLM agents for vulnerability discovery. In late 2024, it found a real exploitable buffer overflow in SQLite that no human or traditional fuzzer had caught.

Big Sleep Agent Architecture

flowchart TD G["Define Goal: Find variant bugs"] --> R["Read Source Code"] R --> I["Identify Vulnerability Patterns"] I --> W["Write Test Cases"] W --> E["Execute Tests"] E --> ABug Confirmed? A -->|Yes| RPT["Report Finding"] A -->|No| REF["Refine Hypothesis"] REF --> R subgraph Tools CB["code_browser"] DB["debugger"] PS["python_sandbox"] RP["reporter"] end R -.-> CB W -.-> PS E -.-> DB RPT -.-> RP

Building Your Own Mini Big Sleep

You don't need Google's infrastructure to replicate this approach. Use an LLM with tool access (Claude with MCP, GPT-4o with function calling, or a local model with LangChain agents) and give it tools to: read code, run code, check coverage, and analyse crash outputs. The agent loop is where the magic happens — let the AI iterate autonomously.

6. Practical AI Code Review Workflow

Recommended Workflow for Pentesters

  1. Scope & prioritise: Identify high-risk code areas — authentication, payment processing, file uploads, API endpoints, deserialization handlers.
  2. Automated SAST first: Run Semgrep, CodeQL, or Bandit to generate a baseline of findings. This is fast and deterministic.
  3. LLM triage: Feed SAST findings to an LLM to filter false positives and prioritise true positives by exploitability.
  4. Deep-dive with AI: For critical areas, feed entire files or modules to the LLM with a security-focused prompt. Use agentic tools for multi-file analysis.
  5. AI fuzzing: For binary/compiled targets or parsing code, use LLM-generated fuzz harnesses with AFL++ or libFuzzer.
  6. Manual verification: ALWAYS manually verify AI findings before reporting. Write a proof-of-concept exploit for each confirmed vulnerability.
  7. Report with context: Use the LLM to draft finding descriptions, impact analysis, and remediation guidance — but review and edit before submission.
bash
# Complete AI code review pipeline
# Combines: Semgrep → LLM triage → Deep analysis → Report generation

#!/bin/bash
set -e

TARGET_DIR="$1"
OUTPUT_DIR="./audit-results"
mkdir -p "$OUTPUT_DIR"

echo "[1/4] Running Semgrep SAST scan..."
semgrep scan --config=auto --json --output "$OUTPUT_DIR/semgrep.json" "$TARGET_DIR"

echo "[2/4] LLM triage of Semgrep findings..."
python ai_triage.py \
  --findings "$OUTPUT_DIR/semgrep.json" \
  --model deepseek-v3 \
  --output "$OUTPUT_DIR/triaged.json"

echo "[3/4] Deep AI analysis of critical files..."
python ai_deep_review.py \
  --directory "$TARGET_DIR" \
  --focus "auth,payment,upload,api" \
  --model gpt-4o \
  --output "$OUTPUT_DIR/deep-review.md"

echo "[4/4] Generating report..."
python ai_report.py \
  --triaged "$OUTPUT_DIR/triaged.json" \
  --deep "$OUTPUT_DIR/deep-review.md" \
  --output "$OUTPUT_DIR/security-audit-report.md"

echo "Done! Report: $OUTPUT_DIR/security-audit-report.md"
# Complete AI code review pipeline
# Combines: Semgrep → LLM triage → Deep analysis → Report generation

#!/bin/bash
set -e

TARGET_DIR="$1"
OUTPUT_DIR="./audit-results"
mkdir -p "$OUTPUT_DIR"

echo "[1/4] Running Semgrep SAST scan..."
semgrep scan --config=auto --json --output "$OUTPUT_DIR/semgrep.json" "$TARGET_DIR"

echo "[2/4] LLM triage of Semgrep findings..."
python ai_triage.py \
  --findings "$OUTPUT_DIR/semgrep.json" \
  --model deepseek-v3 \
  --output "$OUTPUT_DIR/triaged.json"

echo "[3/4] Deep AI analysis of critical files..."
python ai_deep_review.py \
  --directory "$TARGET_DIR" \
  --focus "auth,payment,upload,api" \
  --model gpt-4o \
  --output "$OUTPUT_DIR/deep-review.md"

echo "[4/4] Generating report..."
python ai_report.py \
  --triaged "$OUTPUT_DIR/triaged.json" \
  --deep "$OUTPUT_DIR/deep-review.md" \
  --output "$OUTPUT_DIR/security-audit-report.md"

echo "Done! Report: $OUTPUT_DIR/security-audit-report.md"

Always Verify AI Findings

AI-generated vulnerability reports WILL contain false positives. Never submit an AI finding to a bug bounty program or client report without manually verifying it and writing a working proof-of-concept. The AI is a force multiplier — not a replacement for human judgement.
🎯

AI Code Review Labs

Hands-on exercises applying AI-assisted code review and fuzzing techniques.

🔧
AI-Powered OWASP Juice Shop Review Custom Lab easy
LLM code auditPrompt engineeringVulnerability classification
Open Lab
🔧
Semgrep + LLM Triage Pipeline Custom Lab medium
SAST integrationFalse positive filteringAutomated triage
Open Lab
🔧
Build a Mini Vulnerability Discovery Agent Custom Lab hard
LangChain agentsTool useIterative code analysis
🔧
AI Fuzz Harness Generation with OSS-Fuzz-Gen Custom Lab hard
Fuzz harness generationAFL++Crash analysis
Open Lab