This work has led to some unfortunate misconceptions.
In particular, this weakness has nothing to do with Computer Vision and also nothing to do with deep learning. They only break ConvNets on images because images are fun to look at and ConvNets are state of the art. But at its core, the weakness is related to use of linear functions. In fact, you can break a simple linear classifier (e.g. Softmax Classifier or Logistic Regression) in just the same way. And you could similarly break speech recognition systems, etc. I covered this in CS231n in "Visualizing/Understanding ConvNets" lecture, slides around #50 (http://vision.stanford.edu/teaching/cs231n/slides/lecture8.p...).
The way I like to think about this is that for any input (e.g. an image), imagine there are billion tiny noise patterns you could add to the input. The vast majority in hundreds of billions are harmless and don't change the classifications, but given the weights of the network, backpropagation allows us to efficiently compute (with dynamic programming, basically) exactly the single most damaging noise pattern out of all billions.
All that being said, this is a concern and people are working on fixing it.
I agree with most of what you say, but note that nearly all of the images in the paper were generated without the gradient. I.e. all the images produced by evolution did not use the gradient, only the output of the network regarding its prediction confidence. There are some images that use the gradient, but only to show a 3rd class of "fooling images".
PS. It's nice to see our work (both this paper and the NIPS paper on transfer learning) in your class. Thanks for including it. I wish I could have my students take your course!
> This work has led to some unfortunate misconceptions.
Agreed; the weaknesses reported should definitely not be taken to affect only convnets or only deep learning. Ian's "Explaining and Harnessing Adversarial Examples" paper (linked by @Houshalter) should be required reading :).
> backpropagation allows us to efficiently compute (with dynamic programming, basically) exactly the single most damaging noise pattern out of all billions.
True. By using backprop, one can easily compute exact patterns of pixelwise noise to add to an image to produce arbitrary desired output changes. However, it's an important detail that that most of the images in the paper (all except the last section) were produced without knowledge of the weights of the network or by using backpropagation at all. This means a would-be-adversary need not have access to the complete model, only a method of running many examples through the network and checking the outputs.
> ...there are billion tiny noise patterns you could add to the input.
Perhaps because the CPPN fooling images were created in a different way (without using backprop), they seem to fool networks in a more robust way than one might think. Far from being a brittle addition of a very precise, pixelwise noise pattern, many fooling images are robust enough that their classification holds up even under rather severe distortions, such as using a cell phone camera to take a photo of the pdf displayed on a monitor and then running it through an AlexNet trained with a different random seed (photo cred: Dileep George):
The two classes that you describe: 1. Adversary has the weights and architecture and 2. Adversary can only do forward pass and observe output, are equivalent when all you're trying to do is compute the gradient on the data. In case 1 I use backprop, in case 2 I can compute the gradient numerically, it just takes a bit longer. Your stochastic search speeds this up.
Likewise, I was not very surprised that you can produce fooling images, but it is surprising and concerning that they generalize across models. It seems that there are entire, huge fooling subspaces of the input space, not just fooling images as points. And that these subspaces overlap a lot from one net to another, likely since they share similar training data (?) unclear. Anyway, really cool work :)
> Likewise, I was not very surprised that you can produce fooling images, but it is surprising and concerning that they generalize across models. It seems that there are entire, huge fooling subspaces of the input space, not just fooling images as points. And that these subspaces overlap a lot from one net to another,
Agreed. That is surprising, and also increases the security risks, because I can produce images on my in-house network and then take them out into the world to fool other networks without even having access to the outputs of those networks.
> likely since they share similar training data (?) unclear.
The original Szegedy et al. paper shows that these sort of examples generalize even to networks trained on different subsets of the data (and with different architectures).
> Agreed. That is surprising, and also increases the security risks, because I can produce images on my in-house network and then take them out into the world to fool other networks without even having access to the outputs of those networks.
Good point. You could also do this with the gradient version too (fool in-house using gradients -> hopefully fool someone else's network), but the transferability of fooling examples might differ depending on how they are found.
I've quite enjoyed reading your paper since it was uploaded to arxiv in December and I have been toying with redoing the MNIST part of your experiment on various classifiers. (I'm particularly interested to see if images generated against an SVM can fool a nearest neighbor or something like that.)
But I'm having problems generating images: A top SVM classifier on MNIST has a very stable confidence distribution against noisy images. If I generate 1000 random images, only 1 or 2 of them will have confidences that are different from the median confidence distribution. That is, all the images are classified with the same confidence as class 1. They also share the same confidence for class 2, etc.
So it is very difficult to make changes that affect the output of the classifier.
Any tips on how to get started with generating the images?
I would just unleash evolution. 1 or 2 in the first generation is a toehold, and from there evolution can begin to do its work. You can also try a larger population (e.g. 2000) and let it run for a while.
> ...in case 2 I can compute the gradient numerically, it just takes a bit longer.
Yep, true, might just take a while. On the other hand, even a very noisy estimate of the gradient might suffice, which could be faster to obtain. Perhaps someone will do that experiment soon. Maybe you could convince one of those students of yours to do this for extra credit?? ;).
> Likewise, I was not very surprised that you can produce fooling images, but it is surprising and concerning that they generalize across models.
Ditto x2.
> It seems that there are entire, huge fooling subspaces of the input space, not just fooling images as points. And that these subspaces overlap a lot from one net to another, likely since they share similar training data (?) unclear.
Yeah. I wonder if the subspaces found using non-gradient based exploration end up being either larger or overlapping more between networks than those found (more easily) with the gradient. Would be another interesting followup experiment.
Wait - they didn't use knowledge of the neural network internal state to calculate these patterns? Does that mean they could create equivalent images for human beings? What would those look like!
No, but we did make use of (1) a large number of input -> network -> output iterations, along with (2) precisely measured output values to decide which input to try next. It may not be so easy to experiment in the same way on natural organisms (ethically or otherwise).
Of course, if you're as clever as Tinbergen, you might be able to come up with patterns that fool organisms even without (1) or (2):
Perhaps a single experiment on millions of different people? A web experiment of some kind? "Which image looks more like a panda?" and flash two images on the screen.
That's a good idea, though note that there's a difference between asking "Which of these two images looks more like a panda?" and "Which of these two images looks more like a panda than a dog or cat?". The latter is the supervised learning setting used in the paper, and generally could lead to examples that look very different than pandas, as long as they look slightly more like pandas than dogs or cats. The former method is more like unsupervised density learning and could more plausibly produce increasingly panda-esque images over time.
A sort of related idea was explored with this site, where millions (ok, thousands) of users evolve shapes that look like whatever they want, but likely with a strong bias toward shapes recognizable to humans. Over time, many common motifs arise:
Problem is, even if you succeed and end up with a fabricated picture that fools human neural nets into believing it's a picture of a panda, how would you tell it's not really a picture of a panda?
You'd need another classifier to tell you "nope it's actually just random noise and shapes" ... hm.
I think you missed my somewhat deeper philosophical point :)
Who gets to decide what is really a picture of a panda?
If we'd manage to craft a picture that could with very high certainty trick human neural nets (for the sake of argument, including those higher cognitive functions) into believing something is a picture of a panda, "except it actually really isn't", what does that even mean?
Human insists it's a picture of a panda, computer classifier maintains it's noise and shapes.
Interesting, sure. But I started out wondering if some obviously-noise picture could be found that fooled humans, at least at first glance. "Hey a panda! Wait, what was I thinking, that's just noise!" It would be weird and cool, on the order of the dress meme etc. but much more so.
Kind of like the memes in Snowcrash, ancient forgotten symbols that make up the kernel of human thought.
"In modern software implementations of artificial neural networks, the approach inspired by biology has been largely abandoned for a more practical approach based on statistics and signal processing." [1]
The embodiment is changed of course. But the process of successive ranks of weighted accumulators has not been. Which is the neural model. Of course its different. But there's still the question of, could failure modes of the mathematical model be present in the biological one? Its a question of modeling, not wetware vs hardware.
But that model is not the way the overall activity of the brain is currently understood - neurons are seen as being far more complex than simple threshold machines. They may involve thresh effects but the claim of them working overall like any version of artificial neural works is no longer supported by anyone.
In particular, this weakness has nothing to do with Computer Vision and also nothing to do with deep learning. They only break ConvNets on images because images are fun to look at and ConvNets are state of the art. But at its core, the weakness is related to use of linear functions. In fact, you can break a simple linear classifier (e.g. Softmax Classifier or Logistic Regression) in just the same way. And you could similarly break speech recognition systems, etc. I covered this in CS231n in "Visualizing/Understanding ConvNets" lecture, slides around #50 (http://vision.stanford.edu/teaching/cs231n/slides/lecture8.p...).
The way I like to think about this is that for any input (e.g. an image), imagine there are billion tiny noise patterns you could add to the input. The vast majority in hundreds of billions are harmless and don't change the classifications, but given the weights of the network, backpropagation allows us to efficiently compute (with dynamic programming, basically) exactly the single most damaging noise pattern out of all billions.
All that being said, this is a concern and people are working on fixing it.