Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images
Deep neural networks (DNNs) have recently been achieving state-of-the-art performance on a variety of pattern-recognition tasks, most notably visual classification problems. Given that DNNs are now able to classify objects in images with near-human-level performance, questions naturally arise as to what differences remain between computer and human vision. A recent study revealed that changing an image (e.g. of a lion) in a way imperceptible to humans can cause a DNN to label the image as something else entirely (e.g. mislabeling a lion a library). Here we show a related result: it is easy to produce images that are completely unrecognizable to humans, but that state-of-the-art DNNs believe to be recognizable objects with 99.99% confidence (e.g. labeling with certainty that white noise static is a lion). Specifically, we take convolutional neural networks trained to perform well on either the ImageNet or MNIST datasets and then find images with evolutionary algorithms or gradient ascent that DNNs label with high confidence as belonging to each dataset class. It is possible to produce images totally unrecognizable to human eyes that DNNs believe with near certainty are familiar objects. Our results shed light on interesting differences between human vision and current DNNs, and raise questions about the generality of DNN computer vision.
Full Citation: Nguyen A, Yosinski J, Clune J. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. In Computer Vision and Pattern Recognition (CVPR '15), IEEE, 2015. (pdf)
- High quality figures from the paper here.
- Original CPPN images evolved from our experiments: 5 runs x 1000 images = 5000 images.
- Note that: even if your model is not fooled by much of our images, that is only a generalization test. It still could be the case that evolution could fool that network if specifically targeting it (see section A.2 in the paper supplementary).
- The entire code base to conduct the experiments in this paper are available here.
Press coverage:
- MIT Technology Review: “Smart” Software Can Be Tricked into Seeing What Isn’t There
- Wired: Simple Pictures That State-of-the-Art AI Still Can’t Recognize
- The Atlantic: How to fool a computer with optical illusions
- New Scientist: Optical illusions fool computers into seeing things
- Phys.org: Researchers find a way to fool deep neural networks into 'recognizing' images that aren't there
- Discovery: What Computers Think The World Looks Like
- ExtremeTech: Bad news, future: Computer brains are easily tricked by optical illusions, too
- iProgrammer: The Deep Flaw In All Neural Networks
- Our work was also on the homepage of Slashdot, was the #1 news article on HackerNews, and was published on Slate.
Frequently Asked Questions
- Were you trying to fool the network?
No. We were not trying to produce adversarial, unrecognizable images. Instead, we were trying to produce recognizable images, but these unrecognizable images emerged.
- Would the network be fooled again if you retrain it to recognize these fooling images as garbage?
We tried that. The answer is: “It’s complicated.” See sections 3.6, 3.7, and 3.8 in the paper for the results.
- Can the images be considered art?
Yes. In fact, to test that hypothesis we submitted them to the "University of Wyoming 40th Annual Juried Student Exhibition” and they were accepted. We’re also in contact with an artist that wants to show them in art galleries.
- Do these networks have a “none of the above” option, or are they forced to choose a class?
In response to an unrecognizable image, the networks could have output a low confidence for each of the 1000 classes, instead of an extremely high confidence value for one of the classes. In fact, they do just that for randomly generated images (e.g. those in generation 0 of the evolutionary run) : see Fig. 6 & 7 in the paper. Thus, they effectively have a way to communicate “none of the above”. It would be interesting to try training them with a specific “none of the above” class, but that would require assembling a set of pictures to put in that class that we are sure do not belong in one of the other 1000 classes. We’d like to do that, but assembling these images will take some time. Stay tuned.
- How are the certainty (confidence) scores calculated?
The neural network outputs its confidence in the image belonging to each of 1000 classes. The confidence scores across all 1000 classes must sum to 1, meaning that if one class is given 99.9% confidence then the rest of the classes combined have less than 0.1% confidence. In other words, in that case it is really declaring that it’s suuuuure the image is of that class. It could also split its confidence between classes more evenly. For example, looking at a child climbing a tree it could give 60% confidence to a “tree” class and 40% confidence to a “child” class, but for most of the images in our paper the network is certain the image belongs to one class.
- Can you fool the network if it only gives you a label to an image, but not confidence scores? In other words, do you need a gradient to follow or would a binary (yes/no) answer from the network suffice?
The gradient helps tremendously, but we think it is possible to fool a network even without a gradient (i.e. with a binary answer). For example, even if the only answer provided by Google’s image filter is that they do, or do not, filter an image, and a hacker is given no other information, we still think the network could be fooled. One way is brute force: simply try a lot and see if you get lucky. A smarter way is to utilize the result from our paper that images that fool one network often fool another. Given that, hackers can train their own network, use its gradient to produce fooling images, and then try those on Google’s image filters. Since Google’s filters (in all likelihood) also use deep neural networks, the attack is much more likely to work than the brute force method.
- Does this optical illusion phenomenon happen to animals or humans?
Yes. Humans are susceptible to optical illusions. Such illusions are designed to hack the way our brains see the world. Similarly, these images hack the way neural networks see the world.
Also of interest: Nikolaas Tinbergen found that natural neural networks are also susceptible to a related phenomenon, which he called supernormal stimuli. See https://imgur.com/a/ibMUn for more.
- Is this a Rorschach test for neural networks? Is it similar to seeing the face of Jesus in a grilled cheese sandwich, or a unicorn in a cloud?
The concepts are related, but the difference is that you say to yourself “that cloud looks like a unicorn”, but you don’t actually believe you are seeing a unicorn with certainty. These networks are certain that the images are really the everyday objects they declare them to be.
- Does this phenomenon happen to other classification models also?
We think that all AI techniques that create decision boundaries between classes (e.g. SVMs, deep neural networks, etc.), which are known as discriminative models, are subject to this fooling phenomenon. The below image helps explain why. With discriminative models, when generating images that a DNN maximally believes are members of the class, the produced image (red triangle) will be placed far away from the decision boundary, and thus far away from exemplars of the class (and therefore will likely be unrecognizable).
Another approach is to model the data in each class, instead of simply draw a boundary between classes. That is possible with generative models. With such generative models, the red triangle is likely to be closer to class exemplars (though it is still an open question as to whether it would look natural/recognizable, as the space within the manifold could still be large and consist of many unrecognizable images). We think that a perfect generative model would not be fooled, although making such a model for high-dimensional data (such as the natural images in our experiment) is currently beyond the technical abilities of scientists.
Note that a recent paper suggests that generative models are also susceptible to adversarial examples. - Can you share the code used to produce these images?
The deep neural network is the pre-trained network modeled on AlexNet provided by Caffe.
To evolve images, both the directly encoded and indirectly encoded images, we use the Sferes evolutionary framework. The entire code base to conduct the evolutionary experiments can be download here. The code for the images produced by gradient ascent is available here.