Contact Us

VU#425163: Machine learning classifiers trained via gradient descent are vulnerable to arbitrary misclassification attack

Cyber Security | June 4, 2021


Machine learning models trained using gradient descent can be forced to make arbitrary misclassifications by an attacker that can influence the items to be classified. The impact of a misclassification varies widely depending on the ML model’s purpose and of what systems it is a part.


This vulnerability results from using gradient descent to determine classification of inputs via a neural network. As such, it is a vulnerability in the algorithm. In plain terms, this means that the currently-standard usage of this type of machine learning algorithm can always be fooled or manipulated if the adversary can interact with it. What kind or amount of interaction an adversary needs is not always clear, and some attacks can be successful with only minor or indirect interaction. However, in general more access or more interaction options reduce the effort required to fool the machine learning algorithm. If the adversary has information about some part of the machine learning process (training data, training results, model, or operational/testing data), then with sufficient effort the adversary can craft an input that will fool the machine learning tool to yield a result of the adversary’s choosing. In instantiations of this vulnerability that we are currently aware of, “sufficient effort” ranges widely, between and weeks of commodity compute time.

Within the taxonomy by , such misclassifications are either perturbation attacks or adversarial examples in the physical domain. There are other kinds of failures or attacks related to ML systems, and other ML systems besides those trained via gradient descent. However, this note is restricted to this specific algorithm vulnerability. Formally, the vulnerability is defined for the following case of classification.

Let (x) be a feature vector and (y) be a class label. Let (L) be a loss function, such as cross entropy loss. We wish to learn a parameterization vector (theta) for a given class of functions (f) such that the expected loss is minimized. Specifically, let

[theta_{star} = min_theta mathop{mathbb{E}}_{x,y} Lleft(fleft(theta, xright), yright) ]

In the case where (fleft(theta,xright)) is a neural network, finding the global minimizer (theta_{star}) is often computationally intractable. Instead, various methods are used to find (hat{theta}) which is a “good enough” approximation. We refer to (fleft(hat{theta}, .right)) as the fitted neural network.

If stochastic gradient descent is used to find (hat{theta}) for the broadly defined set of (fleft(theta,xright)) representing neural networks, then the fitted neural network (fleft(hat{theta}, .right)) is vulnerable to adversarial manipulation.

Specifically, it is possible to take (fleft(hat{theta}, .right)) and find an (x’) such that the difference between (x) and (x’) is smaller than some arbitrary (epsilon) and yet (fleft(hat{theta}, xright)) has the label (y) and (fleft(hat{theta}, x’right)) has an arbitrarily different label (y’).

The uncertainty of the impact of this vulnerability is compounded because practitioners and vendors do not tend to disclose what machine learning algorithms they use. However, training neural networks by gradient descent is a common technique. See also the examples in the impact section.


An attacker can interfere with a system which uses gradient descent to change system behavior. As an algorithm vulnerability, this flaw has a wide-ranging but difficult-to-fully-describe impact. The precise impact will vary with the application of the ML system. We provide three illustrative examples; these should not be considered exhaustive.


The CERT/CC is currently unaware of a specific practical solution to this problem. To defend generally, do both of:
1.Adversarial training and testing of ML models. If a model must be exposed to adversarial input, the only well-tested defense against this kind of adversarial attack is adversarial training: using adversarially perturbed examples as part of the neural network’s training regimen. When used for training, these examples increase the model’s robustness against adversarial attack. Such training significantly increases the difficulty of attacking the model, but it does not guarantee the model is not vulnerable. Test your machine learning algorithm against known attacks. Libraries featuring reference implementations of popular attacks and defenses include , , and the (ART).

2. Standard defense in depth. A machine learning tool is not different from other software in this regard. Any tool should be deployed in an ecosystem that supports and defends it from adversarial manipulation. For machine learning tools specifically designed to serve a cybersecurity purpose, this is particularly important, as they are exposed to adversarial input as part of their designed tasking. See for more information on evaluating machine learning tools for cybersecurity.

Other proposed solutions, which rely on either pre-processing the data or simply obfuscating the gradient of the loss, when your adversary is aware that you are attempting those mitigations.


See Papernot et al. (2016) or Biggio and Roli (2018) for a brief history.

This document was written by Allen Householder, Jonathan M. Spring, Nathan VanHoudnos, and Oren Wright.

Vendor Information

This advisory information is generic and does not describe any specific instance of this type of problem,
so no vendors have been notified or listed here.

CVSS Metrics

Group Score Vector
Base 0 AV:–/AC:–/Au:–/C:–/I:–/A:–
Temporal 0 E:ND/RL:ND/RC:ND
Environmental 0 CDP:ND/TD:ND/CR:ND/IR:ND/AR:ND

Other Information

Date Public: 2020-03-19
Document Revision: 25

This content was originally published here.