Evaluation

Intermediate

AI Evaluation Workbench

AI security work is not finished when a prompt fails once. High-quality teams turn findings into repeatable evaluations that run after prompt edits, model upgrades, connector changes, and tool-permission changes.

Evaluation goal

Build small, representative, versioned eval sets that answer one question: did the control keep working after the system changed?

V2 Attack Flow Diagram

Finding To Regression Eval

Turn one offensive finding into a small test that keeps working after system changes.

Finding01

Confirmed Failure

Prompt, fixture, model output, retrieval trace, or tool call from the assessment.

Oracle02

Expected Control

Refuse, sanitize, retrieve only allowed chunks, require approval, or log and alert.

Eval03

Automate Check

PyRIT, Garak, Promptfoo, or custom scorer with versioned fixtures.

Release04

Gate Drift

Run after model, prompt, connector, policy, and tool-schema changes.

A useful eval measures security outcome, not prompt cleverness or refusal wording.

Workbench Stack

PyRIT

Useful for orchestrated red-team workflows, target adapters, scoring, and repeatable attack strategy execution.

Garak

Useful for broad LLM vulnerability probing across prompt injection, leakage, encoding, and unsafe response classes.

Promptfoo

Useful for product-team regression gates, prompt variants, provider comparisons, and pass/fail assertions.

Custom fixtures

Required for RAG, tool-use, tenant isolation, and business-logic tests that generic scanners cannot understand.

Eval Dataset Design

Fixture: prompt, user role, source document, tool manifest, and expected control behavior.
Oracle: deterministic assertion, semantic judge, human review queue, or policy decision.
Severity: impact if the test fails, not how clever the prompt looks.
Versioning: model, prompt, retrieval config, tool schema, and eval set version.
Evidence: request ID, output, retrieved context, tool-call trace, and scored result.
Gate: release blocking threshold, warning threshold, and owner for triage.

Score What Matters

Control pass rate

Did the system refuse, sanitize, retrieve correctly, require approval, or log the event as designed?

Impact severity

Prioritize tests that expose data, trigger tools, cross tenants, alter code, or produce unsafe operational guidance.

Regression drift

Track whether fixes degrade after prompt changes, model swaps, new connectors, or guardrail updates.

Report Template

Eval name: [short descriptive name]

Target: [app/model/feature]

Control expected: [refuse / sanitize / retrieve only allowed docs / require approval / log]

Observed result: [pass/fail plus evidence ID]

Risk: [business impact if this regresses]

Owner: [team responsible for prompt, gateway, retrieval, or tool policy]

Keep evals small and brutal

Ten high-signal tests tied to real findings are usually more valuable than thousands of generic jailbreak prompts. Start with the failures from the engagement, then expand only where the architecture creates repeated risk.

Advanced Research

Operator Playbook

Convert offensive AI findings into repeatable evals that product teams can run after model, prompt, policy, or connector changes.

Authorized use only

Offensive Focus

Build adversarial prompt suites from real findings and controlled fixtures.
Score security outcomes that matter: data exposure, tool action, policy bypass, tenant confusion, and logging failure.
Use evals to prevent regressions, not merely to generate pass/fail percentages.

Evidence To Capture

Written scope and allowed test classes
Timestamped prompts, retrieved context, tool calls, and response artifacts
Request IDs, model/provider/version, policy decisions, and tenant or user role
Screenshots or exported logs that reproduce the finding without exposing client secrets

Offensive Test Cases

Finding-to-eval conversion

Objective: Convert one confirmed finding into a repeatable test with expected safe behavior.
Authorized setup: Use sanitized prompts, fixtures, and staging targets.
Evidence: Original finding, eval prompt, fixture, scorer, expected result, and regression run output.

Provider drift comparison

Objective: Run the same adversarial suite across model versions or providers and compare security outcomes.
Authorized setup: Use approved providers and non-sensitive test data.
Evidence: Provider/model version, prompt version, pass/fail deltas, and risk decision.

Common Findings

Teams patch prompts manually without converting failures into regression tests.
Evals measure refusal wording instead of real security outcomes.
Provider/model changes ship without security drift checks.

Lab Ideas

Create a ten-case prompt injection suite for a toy support bot.
Use Promptfoo to compare two prompt versions.
Write a custom scorer for retrieved canary leakage.

Related Offensive AI Guides

AI App Pentest Methodology

Plan and execute authorized AI application tests.

RAG Security Testing

Build retrieval-specific eval fixtures.

AI Attack & Defense

Prompt injection and guardrail testing.

Tools & Resources

AI security tools, benchmarks, and practice platforms.