1.5 · AI Incident Response & Security Governance

Detecting and Classifying AI Security Events

⏱ 12 minCourse 01

Before you can respond to an AI security incident, you need to detect it — and detection is harder than it sounds. AI incidents often don't generate the obvious signals of conventional attacks: no malware, no network intrusion, no access log entry.

The Four Detection Signals

◆Performance degradation — A sudden or gradual decline in model accuracy, precision, or any operational KPI. Could indicate poisoning, drift, or active evasion.
◆Statistical anomalies in inputs — Inputs that are statistically unusual compared to the baseline distribution. May indicate active adversarial probing or injection attempts.
◆Output anomalies — Model outputs that are factually wrong, unexpectedly formatted, or contain content outside the expected range. May indicate successful injection or jailbreaking.
◆Query pattern anomalies — Unusual volume, frequency, or patterns in API queries. The signature of model extraction or membership inference attacks.

The AI Incident Classification Framework

Not every anomaly is a security incident. Before mobilising a response, classify what you're dealing with:

◆Class A — Security Incident — Evidence of deliberate malicious activity: confirmed injection, active extraction attempt, poisoning detected. Full incident response activation.
◆Class B — Safety Event — Model producing harmful, biased, or incorrect outputs without evidence of attack. Model safety review required.
◆Class C — Quality Degradation — Model performance declining but no evidence of attack or safety issue. Engineering investigation required.
◆Class D — Environmental Change — Model behaving differently due to infrastructure changes, dependency updates, or data pipeline changes. Root cause analysis required.

⚠ The Misclassification Cost

Treating a Class A incident as a Class C quality issue delays containment and allows an active attack to continue. Conversely, treating a Class C issue as a Class A incident wastes incident response resources and causes unnecessary escalation. The classification step is worth the time it takes.

What to Monitor

◆Model accuracy metrics on a holdout test set — daily at minimum, hourly for critical systems
◆Distribution of inputs and outputs — flag significant deviations from baseline
◆API query volumes, sources, and patterns — alert on statistical anomalies
◆Error rates and refusal rates for LLM systems — sudden spikes may indicate attack activity
◆User reports and feedback — often the first signal of a safety or output quality issue

✓ Baseline First

You cannot detect anomalies without a baseline. Before you can run an AI incident response programme, you need to know what normal looks like for every AI system you operate. Document the expected performance ranges and input distributions for each system — this becomes your detection benchmark.