The Four Adversarial Attack Types
Adversarial attacks are deliberate, targeted attempts to cause an AI model to produce wrong, harmful, or exploitable outputs โ without triggering conventional detection. Understanding the taxonomy is the first step to building defences.
An adversarial input is a carefully crafted input that exploits weaknesses in how a model learned, not weaknesses in how code was written. This is what makes them invisible to traditional cybersecurity tools.
Attack Type 1: Evasion Attacks
Evasion attacks manipulate inputs at inference time โ that is, when the model is already deployed and processing real data. The attacker crafts an input that looks legitimate to humans but causes the model to misclassify or misbehave.
- โIn computer vision: adding imperceptible pixel noise that flips a "stop sign" to "speed limit" for an autonomous vehicle
- โIn fraud detection: crafting transaction data that mimics legitimate patterns but is in fact fraudulent
- โIn NLP: rephrasing a phishing email so a spam classifier marks it as safe
- โIn malware detection: subtly modifying malicious code to evade AI-powered antivirus
Attack Type 2: Model Inversion
Model inversion attacks work by querying a deployed model repeatedly with crafted inputs to reconstruct information about the training data. The attacker doesn't need access to your database โ they use your model as a window into it.
If your model was trained on personal data, a successful model inversion attack is a personal data breach under GDPR โ regardless of whether your database was ever accessed. The Information Commissioner's Office has confirmed this interpretation.
Attack Type 3: Membership Inference
Membership inference is more targeted than model inversion. Instead of reconstructing training data, the attacker wants to know whether a specific individual's record was used in training. This sounds abstract โ but consider the implications.
- โA healthcare model trained on patient records: an attacker could confirm whether a specific person has a particular condition
- โA financial risk model: confirming whether a person's bankruptcy history was in the training set
- โAn HR performance model: revealing whether specific employees were labelled as underperformers
Attack Type 4: Model Extraction
Model extraction is IP theft via API. An attacker systematically queries your model โ thousands or millions of times โ using carefully designed inputs, collecting the outputs, and using them to train a functional replica of your model.
If your AI model represents a competitive advantage โ proprietary risk scoring, demand forecasting, recommendation logic โ model extraction means a competitor or adversary can steal it without ever touching your codebase, infrastructure, or database. The only thing they need is API access.
These four attack types require different defences. You cannot address all of them with a single control. The key starting point is knowing which of your AI systems are exposed to external inputs โ those are your highest-risk surfaces.
