Detecting and Classifying AI Security Events
Before you can respond to an AI security incident, you need to detect it โ and detection is harder than it sounds. AI incidents often don't generate the obvious signals of conventional attacks: no malware, no network intrusion, no access log entry.
The Four Detection Signals
- โPerformance degradation โ A sudden or gradual decline in model accuracy, precision, or any operational KPI. Could indicate poisoning, drift, or active evasion.
- โStatistical anomalies in inputs โ Inputs that are statistically unusual compared to the baseline distribution. May indicate active adversarial probing or injection attempts.
- โOutput anomalies โ Model outputs that are factually wrong, unexpectedly formatted, or contain content outside the expected range. May indicate successful injection or jailbreaking.
- โQuery pattern anomalies โ Unusual volume, frequency, or patterns in API queries. The signature of model extraction or membership inference attacks.
The AI Incident Classification Framework
Not every anomaly is a security incident. Before mobilising a response, classify what you're dealing with:
- โClass A โ Security Incident โ Evidence of deliberate malicious activity: confirmed injection, active extraction attempt, poisoning detected. Full incident response activation.
- โClass B โ Safety Event โ Model producing harmful, biased, or incorrect outputs without evidence of attack. Model safety review required.
- โClass C โ Quality Degradation โ Model performance declining but no evidence of attack or safety issue. Engineering investigation required.
- โClass D โ Environmental Change โ Model behaving differently due to infrastructure changes, dependency updates, or data pipeline changes. Root cause analysis required.
Treating a Class A incident as a Class C quality issue delays containment and allows an active attack to continue. Conversely, treating a Class C issue as a Class A incident wastes incident response resources and causes unnecessary escalation. The classification step is worth the time it takes.
What to Monitor
- โModel accuracy metrics on a holdout test set โ daily at minimum, hourly for critical systems
- โDistribution of inputs and outputs โ flag significant deviations from baseline
- โAPI query volumes, sources, and patterns โ alert on statistical anomalies
- โError rates and refusal rates for LLM systems โ sudden spikes may indicate attack activity
- โUser reports and feedback โ often the first signal of a safety or output quality issue
You cannot detect anomalies without a baseline. Before you can run an AI incident response programme, you need to know what normal looks like for every AI system you operate. Document the expected performance ranges and input distributions for each system โ this becomes your detection benchmark.
