1.2 ยท Adversarial Attacks & Model Manipulation

The Four Adversarial Attack Types

โฑ 12 minCourse 01

Adversarial attacks are deliberate, targeted attempts to cause an AI model to produce wrong, harmful, or exploitable outputs โ€” without triggering conventional detection. Understanding the taxonomy is the first step to building defences.

Core Definition

An adversarial input is a carefully crafted input that exploits weaknesses in how a model learned, not weaknesses in how code was written. This is what makes them invisible to traditional cybersecurity tools.

Attack Type 1: Evasion Attacks

Evasion attacks manipulate inputs at inference time โ€” that is, when the model is already deployed and processing real data. The attacker crafts an input that looks legitimate to humans but causes the model to misclassify or misbehave.

  • โ—†In computer vision: adding imperceptible pixel noise that flips a "stop sign" to "speed limit" for an autonomous vehicle
  • โ—†In fraud detection: crafting transaction data that mimics legitimate patterns but is in fact fraudulent
  • โ—†In NLP: rephrasing a phishing email so a spam classifier marks it as safe
  • โ—†In malware detection: subtly modifying malicious code to evade AI-powered antivirus

Attack Type 2: Model Inversion

Model inversion attacks work by querying a deployed model repeatedly with crafted inputs to reconstruct information about the training data. The attacker doesn't need access to your database โ€” they use your model as a window into it.

โš  GDPR Implication

If your model was trained on personal data, a successful model inversion attack is a personal data breach under GDPR โ€” regardless of whether your database was ever accessed. The Information Commissioner's Office has confirmed this interpretation.

Attack Type 3: Membership Inference

Membership inference is more targeted than model inversion. Instead of reconstructing training data, the attacker wants to know whether a specific individual's record was used in training. This sounds abstract โ€” but consider the implications.

  • โ—†A healthcare model trained on patient records: an attacker could confirm whether a specific person has a particular condition
  • โ—†A financial risk model: confirming whether a person's bankruptcy history was in the training set
  • โ—†An HR performance model: revealing whether specific employees were labelled as underperformers

Attack Type 4: Model Extraction

Model extraction is IP theft via API. An attacker systematically queries your model โ€” thousands or millions of times โ€” using carefully designed inputs, collecting the outputs, and using them to train a functional replica of your model.

The Commercial Impact

If your AI model represents a competitive advantage โ€” proprietary risk scoring, demand forecasting, recommendation logic โ€” model extraction means a competitor or adversary can steal it without ever touching your codebase, infrastructure, or database. The only thing they need is API access.

4.2ร—
increase in adversarial AI attacks since 2022
73%
of production AI systems have no adversarial testing
ยฃ2.1M
average cost of a model-related security incident (2025)
โœ“ What You Should Know

These four attack types require different defences. You cannot address all of them with a single control. The key starting point is knowing which of your AI systems are exposed to external inputs โ€” those are your highest-risk surfaces.