[1412.6572] Explaining and Harnessing Adversarial Examples - arXiv

paper

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Metadata

Cited by 2 pages

Page	Type	Quality
AI Accident Risk Cruxes	Crux	67.0
Adversarial Training	Approach	58.0

Cached Content Preview

HTTP 200Fetched Apr 30, 202653 KB

# Explaining and Harnessing   Adversarial Examples

Ian J. Goodfellow, Jonathon Shlens & Christian Szegedy

Google Inc., Mountain View, CA

{goodfellow,shlens,szegedy}@google.com

###### Abstract

Several machine learning models, including neural networks, consistently
misclassify adversarial examples—inputs formed
by applying small but intentionally worst-case perturbations to examples from the dataset,
such that the perturbed input results in the model outputting an incorrect
answer with high confidence.
Early attempts at explaining this phenomenon focused on nonlinearity and overfitting.
We argue instead that the primary cause of neural networks’ vulnerability to adversarial
perturbation is their linear nature.
This explanation is supported by new quantitative results while
giving the first explanation of the most intriguing fact about
them: their generalization across architectures and training sets.
Moreover, this view yields a simple and fast method of generating adversarial examples.
Using this approach to provide examples for adversarial training, we reduce the test
set error of a maxout network on the MNIST dataset.

## 1 Introduction

Szegedy et al. ( [2014b](https://ar5iv.labs.arxiv.org/html/1412.6572#bib.bib19 "")) made an intriguing discovery: several machine learning models,
including state-of-the-art neural networks, are vulnerable to adversarial examples.
That is, these machine learning models misclassify examples that are only slightly different
from correctly classified examples drawn from the data distribution.
In many cases, a wide variety of models with different architectures trained
on different subsets of the training data misclassify the same adversarial example.
This suggests that adversarial examples expose
fundamental blind spots in our training algorithms.

The cause of these adversarial examples was a mystery, and speculative explanations have
suggested it is due to extreme nonlinearity of deep neural networks, perhaps combined
with insufficient model averaging and insufficient regularization of the purely supervised
learning problem. We show that these speculative hypotheses are unnecessary.
Linear behavior in high-dimensional spaces is sufficient to cause adversarial examples.
This view enables us to design a fast method of generating adversarial examples that
makes adversarial training practical. We show that adversarial training can provide
an additional regularization benefit beyond that provided by using dropout (Srivastava et al., [2014](https://ar5iv.labs.arxiv.org/html/1412.6572#bib.bib17 "")) alone.
Generic regularization strategies such as dropout, pretraining,
and model averaging do not confer a significant reduction in a model’s vulnerability to
adversarial examples, but changing to nonlinear model families such as RBF networks can do so.

Our explanation suggests a fundamental tension between designing models that are easy to train due
to their linearity and designing models that use nonlinear effects t

... (truncated, 53 KB total)

Resource ID: d20b6a68747b55ae | Stable ID: sid_4QIT1gaaMg