AI Evaluation Workbench
AI security work is not finished when a prompt fails once. High-quality teams turn findings into repeatable evaluations that run after prompt edits, model upgrades, connector changes, and tool-permission changes.
Evaluation goal
V2 Attack Flow Diagram
Finding To Regression Eval
Turn one offensive finding into a small test that keeps working after system changes.
Confirmed Failure
Prompt, fixture, model output, retrieval trace, or tool call from the assessment.
Expected Control
Refuse, sanitize, retrieve only allowed chunks, require approval, or log and alert.
Automate Check
PyRIT, Garak, Promptfoo, or custom scorer with versioned fixtures.
Gate Drift
Run after model, prompt, connector, policy, and tool-schema changes.
Workbench Stack
PyRIT
Useful for orchestrated red-team workflows, target adapters, scoring, and repeatable attack strategy execution.
Garak
Useful for broad LLM vulnerability probing across prompt injection, leakage, encoding, and unsafe response classes.
Promptfoo
Useful for product-team regression gates, prompt variants, provider comparisons, and pass/fail assertions.
Custom fixtures
Required for RAG, tool-use, tenant isolation, and business-logic tests that generic scanners cannot understand.
Eval Dataset Design
- Fixture: prompt, user role, source document, tool manifest, and expected control behavior.
- Oracle: deterministic assertion, semantic judge, human review queue, or policy decision.
- Severity: impact if the test fails, not how clever the prompt looks.
- Versioning: model, prompt, retrieval config, tool schema, and eval set version.
- Evidence: request ID, output, retrieved context, tool-call trace, and scored result.
- Gate: release blocking threshold, warning threshold, and owner for triage.
Score What Matters
Control pass rate
Did the system refuse, sanitize, retrieve correctly, require approval, or log the event as designed?
Impact severity
Prioritize tests that expose data, trigger tools, cross tenants, alter code, or produce unsafe operational guidance.
Regression drift
Track whether fixes degrade after prompt changes, model swaps, new connectors, or guardrail updates.
Report Template
Eval name: [short descriptive name]
Target: [app/model/feature]
Control expected: [refuse / sanitize / retrieve only allowed docs / require approval / log]
Observed result: [pass/fail plus evidence ID]
Risk: [business impact if this regresses]
Owner: [team responsible for prompt, gateway, retrieval, or tool policy]
Keep evals small and brutal
Advanced Research
Operator Playbook
Convert offensive AI findings into repeatable evals that product teams can run after model, prompt, policy, or connector changes.
Offensive Focus
- Build adversarial prompt suites from real findings and controlled fixtures.
- Score security outcomes that matter: data exposure, tool action, policy bypass, tenant confusion, and logging failure.
- Use evals to prevent regressions, not merely to generate pass/fail percentages.
Evidence To Capture
- Written scope and allowed test classes
- Timestamped prompts, retrieved context, tool calls, and response artifacts
- Request IDs, model/provider/version, policy decisions, and tenant or user role
- Screenshots or exported logs that reproduce the finding without exposing client secrets
Offensive Test Cases
Finding-to-eval conversion
- Objective
- Convert one confirmed finding into a repeatable test with expected safe behavior.
- Authorized setup
- Use sanitized prompts, fixtures, and staging targets.
- Evidence
- Original finding, eval prompt, fixture, scorer, expected result, and regression run output.
Provider drift comparison
- Objective
- Run the same adversarial suite across model versions or providers and compare security outcomes.
- Authorized setup
- Use approved providers and non-sensitive test data.
- Evidence
- Provider/model version, prompt version, pass/fail deltas, and risk decision.
Common Findings
- Teams patch prompts manually without converting failures into regression tests.
- Evals measure refusal wording instead of real security outcomes.
- Provider/model changes ship without security drift checks.
Lab Ideas
- Create a ten-case prompt injection suite for a toy support bot.
- Use Promptfoo to compare two prompt versions.
- Write a custom scorer for retrieved canary leakage.