AI Safety
Adversarial Machine Learning
Attacks across the ML lifecycle: evasion (adversarial examples), poisoning (corrupt training data), model extraction (steal via queries), privacy leakage (membership inference), and LLM jailbreaks. Why defenses are hard.
Why This Matters
ML models are deployed in safety-critical systems: autonomous vehicles, medical diagnosis, content moderation, fraud detection. Adversarial machine learning studies how attackers can manipulate these systems and what defenders can do about it.
The field began with a startling discovery: adding tiny, imperceptible perturbations to images can cause state-of-the-art classifiers to misclassify with high confidence. Since then, the threat model has expanded far beyond image perturbations to encompass the entire ML lifecycle.
NIST published the Adversarial Machine Learning taxonomy (NIST AI 100-2) to standardize how we think about these threats. Understanding this taxonomy is essential for anyone deploying ML in production.
Mental Model
Think of adversarial ML as a game between an attacker and a defender. The attacker has some knowledge of the model (white-box, black-box, or gray-box) and some capability (perturb inputs, corrupt training data, query the API). The defender wants the model to behave correctly despite the attacker's efforts.
The key insight: standard ML optimizes for average-case performance. Security requires worst-case guarantees. This gap is why adversarial ML is hard.
Formal Setup and Notation
Let be a classifier with parameters . Let be the loss function. Let be a correctly classified input-label pair.
Adversarial Example
An adversarial example is a perturbed input such that:
The perturbation is bounded in some norm (typically or ) so that is perceptually similar to .
Adversarial Risk
The adversarial risk (or robust risk) of a classifier is:
This replaces the standard expectation with a worst-case inner maximization over the perturbation set.
Threat Model
A threat model specifies: (1) the attacker's goal (misclassification, targeted misclassification, denial of service), (2) the attacker's knowledge (white-box, black-box, gray-box), (3) the attacker's capability (what they can modify and by how much).
NIST AML Taxonomy
The NIST taxonomy organizes attacks by the ML lifecycle stage they target:
Evasion attacks (inference time): modify inputs to fool a deployed model. This includes adversarial examples for classifiers and jailbreaks for LLMs.
Poisoning attacks (training time): inject malicious data into the training set to degrade model performance or plant backdoors.
Privacy attacks (post-deployment): extract sensitive information about the training data or the model itself.
Abuse attacks (misuse): use the model for purposes it was not designed for, such as generating harmful content.
Evasion Attacks in Detail
Fast Gradient Sign Method (FGSM)
The simplest evasion attack. Compute the gradient of the loss with respect to the input and step in the sign direction:
This is a single-step, -bounded attack. It is fast but weak compared to iterative methods.
Projected Gradient Descent (PGD) Attack
An iterative refinement of FGSM (Madry et al. 2018). Starting from (often with a random initialization inside the -ball), repeat:
where the projection is onto the intersection of the -ball around and the valid input domain (e.g., for pixel values). In practice this is implemented as two clips: clip to for , then clip to . Skipping the domain clip produces out-of-distribution inputs that do not correspond to valid images. PGD with random restarts is considered a strong first-order attack.
Carlini-Wagner (C&W) attack reformulates adversarial example generation as an optimization problem with a different objective. It often finds smaller perturbations than PGD but is more computationally expensive.
Transferability: adversarial examples generated for one model often fool other models trained on similar data. This enables black-box attacks: generate adversarial examples on a surrogate model and apply them to the target.
Poisoning Attacks
Data poisoning corrupts the training set. The attacker injects or modifies a fraction of training examples to either degrade overall accuracy or cause specific targeted misclassifications.
Backdoor attacks (trojan attacks) plant a trigger pattern. The model behaves normally on clean inputs but misclassifies any input containing the trigger. For example, a small patch in the corner of an image could cause the model to always predict a target class.
The poisoning rate needed can be surprisingly small. In some settings, poisoning less than 1% of the training data suffices to plant a reliable backdoor.
Model Extraction and Privacy
Model extraction (model stealing): the attacker queries a black-box model API and uses the responses to train a functionally equivalent copy. This threatens intellectual property and also enables white-box attacks on the copy.
Membership inference: given a model and a data point, determine whether that point was in the training set. This is a privacy violation. Overfit models are particularly vulnerable because they memorize training data.
Training data extraction: for large language models, carefully crafted prompts can cause the model to regurgitate memorized training data, including personally identifiable information.
LLM Attack Surfaces
Four distinct attack surfaces exist for LLMs, often conflated in practice:
Jailbreaks: evasion attacks against safety training. The user crafts prompts that bypass RLHF-installed refusals. Categories: role-playing attacks (DAN-style persona priming), encoding attacks (base64, low-resource languages, cipher tricks), multi-turn attacks (gradually escalating), and optimization-based attacks such as GCG (Zou, Wang, Kolter, Fredrikson 2023), which does gradient-based search over adversarial suffixes in token space and transfers across aligned models.
Prompt injection: untrusted content ingested by the LLM (web pages, emails, tool outputs) contains instructions the model follows. The adversary is not the user but a third party whose text reaches the model's context window. Indirect prompt injection (Greshake et al. 2023) is a system-design vulnerability, not a model-weights vulnerability: the LLM has no way to distinguish trusted user instructions from untrusted tool output.
Training-data poisoning: corruption of pretraining or fine-tuning data. Carlini et al. 2023 show that poisoning web-scale datasets is practical at low cost, because popular pretraining crawls have mutable or buyable provenance (expired domains, stale Wikipedia snapshots).
Training-data extraction: large models memorize and can be prompted to regurgitate training-set content, including PII and copyrighted text.
These surfaces require different defenses. Jailbreak-hardening does nothing against prompt injection; input filters do nothing against poisoning.
Main Theorems
Robustness-Accuracy Tradeoff (pedagogical toy)
Statement
This is a pedagogical toy in the spirit of Tsipras, Santurkar, Engstrom, Turner, Madry (2019). Consider a balanced two-class Gaussian mixture on with class conditionals and , and an -bounded adversary with budget . The Bayes-optimal standard classifier is the linear rule with clean accuracy . The Bayes-optimal -robust classifier within this linear family is the same direction but with a shifted threshold, yielding clean accuracy:
where is the standard normal CDF. The robustness-accuracy gap is whenever .
Tsipras et al. (2019) establish a stronger separation in a different toy distribution (Bernoulli-Gaussian features): there, any classifier with non-trivial -robust accuracy must have standard accuracy bounded away from , and robust learning requires sample complexity worse than standard learning (Schmidt et al. 2018).
Intuition
Robustness requires the classifier to maintain correct predictions even when inputs are pushed toward the decision boundary. This shrinks the effective margin by in the worst-case direction. In this toy, the shift is exactly along , so the clean-accuracy drop is .
Proof Sketch
By symmetry the Bayes-optimal standard classifier is with threshold at zero. An -robust classifier must classify correctly on the -expansion of each class region. For linear classifiers , robustness to means the signed distance to the hyperplane must exceed on training points of each class. The Bayes-optimal robust linear classifier shifts the threshold by away from the class being defended; averaging the two shifted error probabilities gives .
Why It Matters
This toy illustrates why every proposed defense against adversarial examples typically pays a clean-accuracy cost. The tradeoff is not an artifact of bad defense design; in this distributional family it is baked into the geometry. Practitioners must decide how much clean accuracy they are willing to sacrifice for robustness.
Failure Mode
The statement is for a specific Gaussian mixture with identity covariance and linear decision rules. Real data distributions can have different tradeoff curves, and non-linear classifiers on structured manifolds can in principle be both robust and accurate. Randomized smoothing (Cohen, Rosenfeld, Kolter 2019) gives provably certified -robust classifiers without assuming this toy setup. The "there exist models that are simultaneously robust and accurate" question remains open for real data.
Defenses
Adversarial training augments the training set with adversarial examples. The objective becomes:
This is the most effective empirical defense but is computationally expensive (requires running PGD at every training step) and incurs the accuracy tradeoff described above.
Certified defenses provide provable guarantees that no perturbation within the threat model can change the prediction. Randomized smoothing (Cohen, Rosenfeld, Kolter 2019) is the most scalable certified defense. Define the smoothed classifier . If the top class has lower-confidence bound and the runner-up has upper-confidence bound , then is certifiably -robust at with radius:
where is the standard-normal quantile function. The certificate is distribution-free in ; only noise and sampled class margins matter.
Input preprocessing (denoising, compression) and detection methods try to filter adversarial inputs before classification. These have repeatedly been broken by adaptive attacks.
Common Confusions
Adversarial robustness is not the same as input validation
Input validation checks that inputs are well-formed (correct format, in expected range). Adversarial robustness handles inputs that look legitimate but are designed to cause misclassification. A valid JPEG image can be an adversarial example.
Defense evaluation requires adaptive attacks
Many published defenses were broken because they were only tested against non-adaptive attacks (attacks not designed for the specific defense). A defense must be evaluated against an attacker who knows the defense mechanism and adapts accordingly. Obfuscated gradients, for example, can make gradient-based attacks fail without providing real robustness.
Summary
- Empirically, most high-capacity ML models without adversarial training are vulnerable to perturbations. Randomized smoothing (Cohen-Rosenfeld-Kolter 2019) provides certified robustness for some models and threat models
- The NIST taxonomy: evasion, poisoning, privacy, abuse
- PGD attack: iterative FGSM with projection, the standard strong attack
- Robustness-accuracy tradeoff: fundamental, not an artifact of bad defenses
- Adversarial training: effective but expensive and accuracy-reducing
- Always evaluate defenses against adaptive attackers
- LLM jailbreaks are the evasion attack analog for language models
Exercises
Problem
Compute the FGSM perturbation for a linear classifier with squared loss . What direction does the perturbation point in, and why does this make geometric sense?
Problem
Explain why black-box adversarial attacks work despite the attacker having no access to the target model's gradients. What property of adversarial examples enables this?
Problem
The robustness-accuracy tradeoff theorem shows a fundamental tension in the Gaussian mixture setting. Could additional unlabeled data help reduce this tradeoff? Discuss the theoretical arguments and empirical evidence.
References
Foundations (evasion and adversarial examples):
- Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow, Fergus, "Intriguing Properties of Neural Networks" (ICLR 2014). Original observation that small imperceptible perturbations flip classifier predictions.
- Goodfellow, Shlens, Szegedy, "Explaining and Harnessing Adversarial Examples" (ICLR 2015). FGSM and the linear explanation.
- Madry, Makelov, Schmidt, Tsipras, Vladu, "Towards Deep Learning Models Resistant to Adversarial Attacks" (ICLR 2018). PGD attack and min-max adversarial training.
- Tsipras, Santurkar, Engstrom, Turner, Madry, "Robustness May Be at Odds with Accuracy" (ICLR 2019). Toy-distribution analysis of the robustness-accuracy tradeoff.
- Schmidt, Santurkar, Tsipras, Talwar, Madry, "Adversarially Robust Generalization Requires More Data" (NeurIPS 2018). Sample-complexity separation.
Evaluation and adaptive attacks:
- Athalye, Carlini, Wagner, "Obfuscated Gradients Give a False Sense of Security" (ICML 2018). Methodology for breaking defenses that rely on masked gradients.
- Carlini, Athalye, Papernot, Brendel, Rauber, Tsipras, Goodfellow, Madry, Kurakin, "On Evaluating Adversarial Robustness" (arXiv 2019). Checklist for honest robustness evaluation.
- Croce, Hein, "Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks" (ICML 2020). AutoAttack.
- Tramèr, Carlini, Brendel, Madry, "On Adaptive Attacks to Adversarial Example Defenses" (NeurIPS 2020). Thirteen defenses broken by adaptive attacks.
- Croce et al., "RobustBench: a standardized adversarial robustness benchmark" (NeurIPS Datasets 2021).
Certified robustness:
- Cohen, Rosenfeld, Kolter, "Certified Adversarial Robustness via Randomized Smoothing" (ICML 2019). The certificate.
LLM attack surfaces:
- Zou, Wang, Kolter, Fredrikson, "Universal and Transferable Adversarial Attacks on Aligned Language Models" (arXiv:2307.15043, 2023). GCG: gradient-based adversarial suffixes.
- Greshake, Abdelnabi, Mishra, Endres, Holz, Fritz, "Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection" (arXiv:2302.12173, 2023).
- Anil, Durmus, Panickssery, Sharma, Benton, Kundu, Batson, Rimsky, Tong, Mu, Ford, Mosconi, Agrawal, Schaeffer, Bashkansky, Svenningsen, Lambert, Radhakrishnan, Denison, Hubinger, Bai, Bricken, Maxwell, Schiefer, Sully, Tamkin, Lanham, Nguyen, Korbak, Kaplan, Ganguli, Hatfield-Dodds, Kernion, Conerly, El-Showk, Elhage, Hume, Lovitt, Ndousse, Mercado, DasSarma, Lasenby, Larson, Ringer, Kadavath, Askell, Amodei, "Many-Shot Jailbreaking" (Anthropic, 2024). Extends LLM attack surface to long-context settings.
- Carlini, Jagielski, Choquette-Choo, Paleka, Pearce, Anderson, Terzis, Thomas, Tramèr, "Poisoning Web-Scale Training Datasets is Practical" (arXiv:2302.10149, 2023).
Standards and taxonomy:
- NIST AI 100-2 E2023, "Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations" (2024).
Next Topics
The natural next step from adversarial ML:
- LLM application security: OWASP Top 10 for LLM applications
Last reviewed: April 18, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Feedforward Networks and Backpropagationlayer 2 · tier 1
- Minimax and Saddle Pointslayer 2 · tier 2
- Von Neumann Minimax Theoremlayer 2 · tier 2
Derived topics
1- LLM Application Securitylayer 5 · tier 2
Graph-backed continuations