Test CLI — Adversarial Simulation
Simulate prompt injection attacks against your system prompts. Verify your AI applications are protected in CI/CD.
On this page
Overview
Simulate prompt injection attacks against your system prompts. @safepaste/test generates adversarial variants by injecting known attack payloads into a target prompt, scans each with @safepaste/core, and reports which attacks were detected.
Primary use case: CI/CD gating — verify that your AI system prompts are protected against known prompt injection categories before deploying to production.
@safepaste/core's deterministic detection engine. No network requests, no data sent anywhere.
Installation
Install both the test CLI and the core detection engine as dependencies.
npm install @safepaste/test @safepaste/core
Or run directly without installing.
npx @safepaste/test "You are a helpful assistant"@safepaste/core (>= 0.3.0) must be installed alongside @safepaste/test. The test CLI treats it as a black box — text in, detection result out.
Quick Start
Three ways to provide your system prompt to the test CLI.
Positional argument
Pass the prompt directly as a quoted string.
safepaste-test "You are a helpful coding assistant"File input
Read the prompt from a file — useful for long system prompts.
safepaste-test --file system-prompt.txt
Stdin
Pipe the prompt from another command or script.
echo "You are a helpful assistant" | safepaste-test
CLI Reference
All available flags and their defaults.
| Flag | Description | Default |
|---|---|---|
--format <report|json|jsonl> |
Output format | report |
--strict |
Strict mode (detection threshold 25 instead of 35) | off |
--categories <cat1,cat2,...> |
Test specific categories only | all 13 |
--pass-threshold <N> |
Minimum detection rate, 0 to 1 | 0.8 |
--file <path> |
Read prompt from file | - |
--help |
Show help text | - |
--version |
Show version number | - |
Output Formats
Choose from three output formats depending on your use case.
Report (default)
Human-readable summary with an ASCII bar chart showing detection rates per category. This is the default when no --format flag is specified.
SafePaste Attack Simulation ============================================================ Target: "You are a helpful coding assistant" Categories: 13 | Variants: 78 | Threshold: 35 instruction_override 6/6 100% ████████████ role_hijacking 6/6 100% ████████████ system_prompt 6/6 100% ████████████ exfiltration 6/6 100% ████████████ secrecy 6/6 100% ████████████ jailbreak 6/6 100% ████████████ obfuscation 4/6 67% ████████ instruction_chaining 6/6 100% ████████████ meta 5/6 83% ██████████ tool_call_injection 6/6 100% ████████████ system_message_spoofing 6/6 100% ████████████ roleplay_jailbreak 0/6 0% multi_turn_injection 5/6 83% ██████████ Result: PASS 63/78 detected (80.8% >= 80% threshold)
JSON
Single JSON object with full results. Use --format json for programmatic consumption.
safepaste-test --format json "You are a helpful assistant"{
"pass": true,
"prompt": "You are a helpful assistant",
"config": { "threshold": 35, "categories": 13, "passThreshold": 0.8 },
"summary": {
"total": 78,
"detected": 63,
"missed": 15,
"rate": 0.808,
"pass": true
},
"categories": { /* per-category stats */ },
"variants": [ /* full variant details */ ]
}JSONL
One JSON object per line, one per variant. Use --format jsonl for streaming or log ingestion.
safepaste-test --format jsonl "You are a helpful assistant"{"category":"instruction_override","strategy":"prepend","flagged":true,"score":72}
{"category":"instruction_override","strategy":"append","flagged":true,"score":68}
{"category":"instruction_override","strategy":"wrap","flagged":true,"score":65}
...Exit Codes
The CLI uses exit codes for CI/CD integration. A non-zero exit code fails your pipeline.
| Code | Meaning |
|---|---|
0 |
Detection rate >= pass threshold (default 80%) |
1 |
Detection rate < pass threshold — your prompt may be vulnerable |
2 |
Usage error (missing prompt, invalid flag, file not found) |
CI/CD Integration
Add prompt injection testing to your deployment pipeline. The CLI exits with code 1 when detection falls below the pass threshold, which fails the CI step.
GitHub Actions
name: Prompt Injection Test on: [push, pull_request] jobs: security-scan: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 18 - run: npm install @safepaste/test @safepaste/core - name: Test prompt injection detection run: npx @safepaste/test --file prompts/system.txt --pass-threshold 0.8
Generic CI
# Basic — fail the build if detection rate is below 80% safepaste-test "Your system prompt here" || exit 1 # Strict mode with higher threshold safepaste-test --strict --pass-threshold 0.9 --file system-prompt.txt # Save a JSON report as a build artifact safepaste-test --format json --file system-prompt.txt > security-report.json
--pass-threshold 0.7 and increase it as you improve your system prompt's resilience. A threshold of 0.9 with --strict is a strong security gate.
Attack Categories
The test CLI generates payloads across all 13 prompt injection attack categories. Each category has 2 seed payloads that get expanded into 6 variants (3 injection strategies each).
| Category | Description |
|---|---|
instruction_override |
Attempts to replace or override the system's instructions |
role_hijacking |
Tries to change the AI's assigned role or persona |
system_prompt |
Attempts to extract or reveal the system prompt |
exfiltration |
Tries to exfiltrate data via URLs, markdown, or other channels |
secrecy |
Instructs the AI to hide its actions from the user |
jailbreak |
Attempts to bypass safety guardrails and restrictions |
obfuscation |
Uses encoding, character substitution, or formatting tricks to hide payloads |
instruction_chaining |
Chains multiple instructions to slip malicious ones through |
meta |
Meta-level attacks about the AI's own behavior or training |
tool_call_injection |
Injects fake tool calls or function invocations |
system_message_spoofing |
Spoofs system-level messages or delimiters |
roleplay_jailbreak |
Uses roleplay scenarios to bypass restrictions |
multi_turn_injection |
Simulates multi-turn conversation context to inject instructions |
Filter to specific categories with the --categories flag.
safepaste-test --categories exfiltration,tool_call_injection --file system-prompt.txt
How It Works
The test CLI follows a four-step process to generate and evaluate adversarial test cases.
Select seed payloads
Selects 2 seed payloads per attack category (26 total across 13 categories). Each payload is a known prompt injection technique.
Generate injection variants
For each payload, generates 3 injection variants using different strategies: prepend (payload before prompt), append (payload after prompt), and wrap (payload wraps the prompt). This produces 78 total variants.
Scan each variant
Scans each variant with @safepaste/core's scanPrompt() function. The core engine runs 61 weighted detection patterns and returns a score, risk level, and flagged status.
Aggregate and report
Aggregates results by category, computes the overall detection rate, and determines pass/fail against your threshold. Outputs the report in your chosen format.
Programmatic API
Use the run() function directly in your Node.js code for custom integrations, test suites, or dashboards.
var { run } = require('@safepaste/test'); var report = run('You are a helpful coding assistant', { strict: false, categories: null, // all 13 categories passThreshold: 0.8 }); console.log(report.pass); // true or false console.log(report.summary.rate); // 0.808 console.log(report.summary.detected); // 63 console.log(report.summary.total); // 78
Return value
The run() function returns an object with the following structure.
| Field | Type | Description |
|---|---|---|
pass |
boolean | Whether the detection rate meets the pass threshold |
prompt |
string | The original target prompt |
config |
object | Configuration used: threshold, categories count, passThreshold, strict |
summary |
object | total, detected, missed, rate, pass |
categories |
object | Per-category stats: { total, detected } for each category |
variants |
array | Full details for every variant: category, strategy, text, flagged, score, risk |
Options
| Option | Type | Default | Description |
|---|---|---|---|
strict |
boolean | false |
Use strict mode (threshold 25 instead of 35) |
categories |
string[] | null | null |
Array of category names to test, or null for all |
passThreshold |
number | 0.8 |
Minimum detection rate (0-1) for a passing result |