1.2 · Adversarial Attacks & Model Manipulation

The Four Adversarial Attack Types

⏱ 12 minCourse 01

Adversarial attacks are deliberate, targeted attempts to cause an AI model to produce wrong, harmful, or exploitable outputs — without triggering conventional detection. Understanding the taxonomy is the first step to building defences.

Core Definition

An adversarial input is a carefully crafted input that exploits weaknesses in how a model learned, not weaknesses in how code was written. This is what makes them invisible to traditional cybersecurity tools.

Attack Type 1: Evasion Attacks

Evasion attacks manipulate inputs at inference time — that is, when the model is already deployed and processing real data. The attacker crafts an input that looks legitimate to humans but causes the model to misclassify or misbehave.

◆In computer vision: adding imperceptible pixel noise that flips a "stop sign" to "speed limit" for an autonomous vehicle
◆In fraud detection: crafting transaction data that mimics legitimate patterns but is in fact fraudulent
◆In NLP: rephrasing a phishing email so a spam classifier marks it as safe
◆In malware detection: subtly modifying malicious code to evade AI-powered antivirus

Attack Type 2: Model Inversion

Model inversion attacks work by querying a deployed model repeatedly with crafted inputs to reconstruct information about the training data. The attacker doesn't need access to your database — they use your model as a window into it.

⚠ GDPR Implication

If your model was trained on personal data, a successful model inversion attack is a personal data breach under GDPR — regardless of whether your database was ever accessed. The Information Commissioner's Office has confirmed this interpretation.

Attack Type 3: Membership Inference

Membership inference is more targeted than model inversion. Instead of reconstructing training data, the attacker wants to know whether a specific individual's record was used in training. This sounds abstract — but consider the implications.

◆A healthcare model trained on patient records: an attacker could confirm whether a specific person has a particular condition
◆A financial risk model: confirming whether a person's bankruptcy history was in the training set
◆An HR performance model: revealing whether specific employees were labelled as underperformers

Attack Type 4: Model Extraction

Model extraction is IP theft via API. An attacker systematically queries your model — thousands or millions of times — using carefully designed inputs, collecting the outputs, and using them to train a functional replica of your model.

The Commercial Impact

If your AI model represents a competitive advantage — proprietary risk scoring, demand forecasting, recommendation logic — model extraction means a competitor or adversary can steal it without ever touching your codebase, infrastructure, or database. The only thing they need is API access.

4.2×

increase in adversarial AI attacks since 2022

73%

of production AI systems have no adversarial testing

£2.1M

average cost of a model-related security incident (2025)

✓ What You Should Know

These four attack types require different defences. You cannot address all of them with a single control. The key starting point is knowing which of your AI systems are exposed to external inputs — those are your highest-risk surfaces.