Evaluation
Intermediate
AML.T0054

AI Evaluation Workbench

AI security work is not finished when a prompt fails once. High-quality teams turn findings into repeatable evaluations that run after prompt edits, model upgrades, connector changes, and tool-permission changes.

Evaluation goal

Build small, representative, versioned eval sets that answer one question: did the control keep working after the system changed?

V2 Attack Flow Diagram

Finding To Regression Eval

Turn one offensive finding into a small test that keeps working after system changes.

Finding01

Confirmed Failure

Prompt, fixture, model output, retrieval trace, or tool call from the assessment.

Oracle02

Expected Control

Refuse, sanitize, retrieve only allowed chunks, require approval, or log and alert.

Eval03

Automate Check

PyRIT, Garak, Promptfoo, or custom scorer with versioned fixtures.

Release04

Gate Drift

Run after model, prompt, connector, policy, and tool-schema changes.

A useful eval measures security outcome, not prompt cleverness or refusal wording.

Workbench Stack

PyRIT

Useful for orchestrated red-team workflows, target adapters, scoring, and repeatable attack strategy execution.

Garak

Useful for broad LLM vulnerability probing across prompt injection, leakage, encoding, and unsafe response classes.

Promptfoo

Useful for product-team regression gates, prompt variants, provider comparisons, and pass/fail assertions.

Custom fixtures

Required for RAG, tool-use, tenant isolation, and business-logic tests that generic scanners cannot understand.

Eval Dataset Design

  • Fixture: prompt, user role, source document, tool manifest, and expected control behavior.
  • Oracle: deterministic assertion, semantic judge, human review queue, or policy decision.
  • Severity: impact if the test fails, not how clever the prompt looks.
  • Versioning: model, prompt, retrieval config, tool schema, and eval set version.
  • Evidence: request ID, output, retrieved context, tool-call trace, and scored result.
  • Gate: release blocking threshold, warning threshold, and owner for triage.

Score What Matters

Control pass rate

Did the system refuse, sanitize, retrieve correctly, require approval, or log the event as designed?

Impact severity

Prioritize tests that expose data, trigger tools, cross tenants, alter code, or produce unsafe operational guidance.

Regression drift

Track whether fixes degrade after prompt changes, model swaps, new connectors, or guardrail updates.

Report Template

Eval name: [short descriptive name]

Target: [app/model/feature]

Control expected: [refuse / sanitize / retrieve only allowed docs / require approval / log]

Observed result: [pass/fail plus evidence ID]

Risk: [business impact if this regresses]

Owner: [team responsible for prompt, gateway, retrieval, or tool policy]

Keep evals small and brutal

Ten high-signal tests tied to real findings are usually more valuable than thousands of generic jailbreak prompts. Start with the failures from the engagement, then expand only where the architecture creates repeated risk.

Advanced Research

Operator Playbook

Convert offensive AI findings into repeatable evals that product teams can run after model, prompt, policy, or connector changes.

Authorized use only

Offensive Focus

  • Build adversarial prompt suites from real findings and controlled fixtures.
  • Score security outcomes that matter: data exposure, tool action, policy bypass, tenant confusion, and logging failure.
  • Use evals to prevent regressions, not merely to generate pass/fail percentages.

Evidence To Capture

  • Written scope and allowed test classes
  • Timestamped prompts, retrieved context, tool calls, and response artifacts
  • Request IDs, model/provider/version, policy decisions, and tenant or user role
  • Screenshots or exported logs that reproduce the finding without exposing client secrets

Offensive Test Cases

Finding-to-eval conversion

Objective
Convert one confirmed finding into a repeatable test with expected safe behavior.
Authorized setup
Use sanitized prompts, fixtures, and staging targets.
Evidence
Original finding, eval prompt, fixture, scorer, expected result, and regression run output.

Provider drift comparison

Objective
Run the same adversarial suite across model versions or providers and compare security outcomes.
Authorized setup
Use approved providers and non-sensitive test data.
Evidence
Provider/model version, prompt version, pass/fail deltas, and risk decision.

Common Findings

  • Teams patch prompts manually without converting failures into regression tests.
  • Evals measure refusal wording instead of real security outcomes.
  • Provider/model changes ship without security drift checks.

Lab Ideas

  • Create a ten-case prompt injection suite for a toy support bot.
  • Use Promptfoo to compare two prompt versions.
  • Write a custom scorer for retrieved canary leakage.