Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Scientists See Promise in Deep-Learning Programs (nytimes.com)
141 points by mtgx on Nov 24, 2012 | hide | past | favorite | 68 comments


Since I see some misunderstanding about deep learning, let me explain the fundamental idea: It's about reusing intermediate work.

The intuition is let's say I told you to write a complicated computer program. Let's say I told you that you could use routines and subroutines, but you couldn't use subsubroutines, or deeper levels of abstraction. In this restricted case, you could write any computer program, but you would have to use a lot of code-copying. With arbitrary levels of abstraction, you could do code reuse much more elegantly, and your code would be more compact.

Here is a more formal description: If you have a complicated non-linear function, you can describe it similarly to a circuit. If you restrict the depth of the circuit, you can in principle represent any function, but you need a really wide (exponentially wide) circuit. This can lead to overfitting. (Occam's Razor) By comparison, with a deep circuit, you can represent arbitrary functions compactly.

Standard SVMs and random forests can be shown, mathematically, to have a limited number of layers (circuit depth).

It turns out that expressing deep models using neural networks is quite convenient.

I gave an introduction to deep learning in 2009 that describes these intuitions: http://vimeo.com/7977427


If you restrict the depth of the circuit, you can in principle represent any function, but you need a really wide (exponentially wide) circuit.

Are you sure it's exponential ?

If you look at binary functions (ie. boolean circuits) any such function can be represented by a single layer function whose size is linear in the number of gates of the original function (I think it's 3 or 4 variables per gate) by converting to conjunctive normal form.

Of course it's not obvious that a similar scaling exists for non-binary functions but I'd be a bit surprised if increasing depth led to an exponential gain in representational efficiency.


I am not sure in the sense of: If I were dropped on a desert island, I could derive a water-tight proof of this result from scratch.

I am confident, though, based upon my reading of secondary sources written by people that I trust.

From one of Bengio's works (http://www.iro.umontreal.ca/~bengioy/papers/ftml.pdf): "More interestingly, there are functions computable with a polynomial-size logic gates circuit of depth k that require exponential size when restricted to depth k − 1 (Hastad, 1986)."


I think my argument was mistaken. The CNF form I was thinking of involves adding unknown variables so it doesn't actually allow you to compute the function in one step.


Computing a sum modulo 2 (cumulative xor) of n boolean inputs requires an exponential number of elements if you only have or, not and and gates to work with. (regardless of circuit depth, actually).


Reusing intermediate work? I don't think this is a good intuition. Using several levels of abstraction is more like it.

For example, in face recognition, first level - could be pixels. Second level - edges and corners: http://www.cs.nyu.edu/~yann/research/deep/images/ff1.gif Third - parts of the face: http://people.cs.umass.edu/~elm/images/face_feature.jpg


I'm not an expert on this, but I think this article overstates the relationship between "deep learning methods" and "neural networks". Neural nets have been around forever and, in the feed-forward case, are actually fairly basic statistical classifiers.

Deep learning, on the other hand, is about using layers of classifiers to progressively recognize higher-order concepts. In computer vision, for example, the first layer of classifiers may be recognizing things like edges, blocks of color, and other simple concepts, while progressive things may be recognizing things like "arm", "desk", or "cat" from the lower-order concepts.

There's a book I read a while ago that was super-interesting and digs in to how one researcher leveraged knowledge about how the human brain works to develop one of these deep learning methods: "On Intelligence" by Jeff Hawkins (http://www.amazon.com/On-Intelligence-Jeff-Hawkins/dp/B000GQ...)


No.

All currently used deep learning algorithms are special cases of neural networks. The reason why this is called "deep" learning is that before 2006, no one knew how to efficiently train neural nets with more than 1 or 2 hidden layers. (Or could, because of computing power.) Thanks to a breakthrough by Dr Hinton, this is now the case.

But all models used are neural nets. It's just that a vast amount new algorithms for training them have been developed in the last years and people came up with new ideas on how to use them.

But it is all neural nets. And that's the whole beauty of it.


Closer, but still no :) Geoff Hinton proposed contrastive divergence training for Restricted Boltzmann Machines in his 2006 science paper. CD does not apply outside of RBMs though, and most of these nets in the article here are not in fact RBMs. The paper did spark a lot of interest in the field though.

These are all neural nets (with some bells and whistles in some cases like tied weights, pooling units, etc) trained exactly as they were trained before using stochastic gradient descent or LBFGS. We did come up with a lot of tricks for making SGD work though, like momentum terms, clamping of weights during learning, dropout, unsupervised pretraining, etc., but in large part it's just a lot more compute power. These networks just turned out to work very well when you have a LOT of (fairly homogeneous) data and can afford to scale them up computationally. And that's pretty awesome, looks like we have a powerful hammer and there are plenty of nails lying around :)


That is not entirely accurate. The Science paper described how to (pre)train a deep belief net by training a sequence of RBMs. Contrastive divergence for RBM training (and more generally products of experts) was described in 2002 in "Training Products of Experts by Minimizing Contrastive Divergence" http://www.cs.toronto.edu/~hinton/absps/nccd.pdf


doh, not very carefully worded now that I'm re-reading my answer, you're right of course. Well, at least we're slowly converging on the right answer over several comments :)


What exactly is wrong what I wrote? I did not say that all nets nowadays would be trained by RBMs (in the contrary, I said quite the opposite, that new algorithms had been developed). I just said that they were part of the breakthrough.


What are your thoughts re: LBFGS vs HF as applied to FF networks? I've been using HF for RNNs and have been having very good results, but I haven't yet tried it on FF networks and wonder if I'd see a benefit compared to SGD with the bells and whistles or even something like LBFGS.


Are you talking about Hinton's "A Fast Learning Algorithm for Deep Belief Nets"? Before that was published, Hinton's lab and their spiritual allies were training large restricted boltzmann machines via truncated sampling for decades. And Yann LeCun's convolutional networks (the architecture used in Google's vision project) have also been trained via plain old stochastic gradient descent for decades.

As far as I can tell there hasn't been any single revolutionary breakthrough in this field...we just keep getting more computing power, discovering better tricks and heuristics, and trying to build larger and larger networks.


I'm guessing the "pretraining" described in this 2006 Science article: http://www.cs.toronto.edu/~hinton/science.pdf. (Possibly the same line of research the article you mention). Sure, if you look at things from a wide enough perspective, there haven't been any "revolutionary" breakthroughs. But this did seem to reignite interest in neural nets after they had sort of languished for a while. (Science described this work, somewhat hyperbolically, as "Neural nets 2.0").


I think culturally, Hinton made a big splash and got people to pay attention to learning hierarchies and SGD-like training algorithms. Algorithmically, though, SGD is both ancient and still the dominant deep learning training technique (though useful tricks, extensions, and rules of thumb keep accumulating)


Thats a very wide classification. I could say everything is a machine algorithm since they all run on general purpose cpus.


On the other hand, there is nothing a neural net can do that a Turing machine can't do (perhaps even better?).


There is no "a neural net". Which model do you mean? Certainly not the deep belief networks under discussion in the article.

edit: I'm not sure I was clear enough-- the term "neural network" is a misnomer that encompasses extremely different models that are largely unrelated except for being vaguely inspired by the brain. A vanilla multilayer-perceptron is essentially a generalization of logistic regression. Restricted Boltzmann Machines are different beasts-- they're a restriction of undirected graphical models made amenable to efficient training. Recurrent neural networks aren't in any way a minor extension of other neural networks-- you need different terminology to talk meaningfully about them and they essentially don't have reliable training algorithms. This latter class can be viewed as Turing-equivalent computation, but they're not at all the same as the models in the original article.


Neural Nets cannot loop (unless they are recurrent neural nets) and are memory bound.


That's also what mojuba was referring to.


What was the breakthrough?


The breakthrough was the insight that while you cannot train a deep neural net at once with backprop, you can train one layer after the other greedily with an unsupervised objective and later fine tune it with standard backprop.

Years later, Swiss researchers (Dan Ciresan et al) found that you can train neural nets with backprop, but you need lots of training time and lots of data. You can only achieve this by making use GPUs, otherwise it would take months.


You can't train fully connected deep models with backprop, or at least not easily or well. An alternative solution to this problem is spatial weight pooling (Yann's convolutional networks) which play well with SGD.


That is correct. The problem is that the gradients get smaller and smaller as you back propagate back towards the input layer. So learning on the front part of the net is slow. Hinton has a lot of good material about htis in his Coursera lectures.


Yes you can.

Check out the publications by Ciresan on MNIST, have a look at Hinton's dropout paper or at the Kaggle competition that used deep nets. Or try it yourself and spend a descent amount of time on hyper parameter tuning. :)


Which of Ciresan's projects are you referring to? Everything I've seen by him uses convolutional layers of some sort.


The first time I saw a paper on feasible deep networks was at NIPS 2006, specifically this paper: http://www.cs.toronto.edu/~hinton/absps/fastnc.pdf

It's been awhile since I read the paper, but as I recall it involved training an unsupervised model layer-by-layer (training a layer, freezing the weights, then training another layer on top of it).


http://www.socher.org/index.php/DeepLearningTutorial/DeepLea... is also a good reference. I wrote a short blog post this morning on the same subject http://blog.markwatson.com/2012/11/deep-learning.html


Contrastive Divergence.

The deep learning / RBM tutorial here is quite good and explains the technique.

http://deeplearning.net/tutorial/rbm.html#rbm


Jeff has some fascinating theories on AI that I think have a real chance of taking us out of this rut that AI has been stuck in for the past 60 years. If you want a good overview of what's in his book, "on Intelligence", check out his TED talk. http://www.ted.com/talks/jeff_hawkins_on_how_brain_science_w...


Geoffrey Hinton, mentioned in the article, has his class on neural networks available on Coursera https://www.coursera.org/course/neuralnets


Hinton was one of the people who invented backpropagation, which has let neural nets be as powerful as they are today. Somehow, despite his brilliance and intimate familiarity with backpropagation, his explanation of it is stunningly clear and simple. I'm thoroughly enjoying this course and recommend it to anyone who wants to build their own neural networks.


"If you can't explain something simply, you don't know enough about it. You do not really understand something unless you can explain it to your grandmother." - Some German dude :)


C.S. Lewis? "Any fool can write learned language: the vernacular is the real test. If you can't turn your faith into it, then either you don't understand it or you don't believe it."


That would be Uncle Al.


I was watching his lectures and I saw this post when I take a break. He was talking about the ups and downs in the history of Neural Nets. As far as I understand from all these lectures we're on the verge of a new up phase. Neural Nets are meaningful when they are large and deep and training such nets becomes feasible, although not immediately.


I started taking the class but had to take a break due to other real life occurrences. It was very enjoyable both content wise and style wise, for what my opinion count, I recommend it.


For those looking to learn about these techniques, I'd highly recommend the deep learning theano tutorials.

Hinton has a class on Coursera--I think it would be very confusing for beginners, but it has really great material.

Also, I run the "SF Neural Network Aficionados" meetup in san francisco and will be giving a workshop in January about building your own DBN in python, so feel free to check that out if you're in SF (although space was an issue last time).


Pls put notes & code online


How is "deep learning" different from "neural network"?


The idea of having multiple levels of representation (deep learning) goes beyond neural networks. A good example is the recent work (award-winning at NIPS 2012) on sum-product networks, which are graphical models whose partition function is tractable by construction. Several important things have been added since 2006 (when deep learning was deemed to begin) to the previous wave of neural networks research, in particular powerful unsupervised learning algorithms (which allow very successful semi-supervised and transfer learning - 2 competitions won in 2011), often incorporating advanced probabilistic models with latent variables, a better understanding (although much more remains to be done) of the optimization difficulty of training gradient-based systems through many composed non-linearities, and other improvements to regularize better (such as the recent dropouts) and to rationally and efficiently select hyper-parameters (random sampling and Bayesian optimization). It is also true that sheer improvements in computing power and amounts of training data are in part responsible for the impressively good results recently obtained in speech recognition (see recent New York Times article, 24 nov, J. Markoff) and object recognition (see NIPS 2012 paper by Krizhesky et al).


Well--two answers:

1) It's not. It's just a buzz word that people are going to use to separate the current (2006+) research from older research concerning neural networks. This is to draw a clear (and potentially self serving?) distinction between old neural networks that were discredited due to their lack of results vs current research that produces much much better results. So, it's neural networks rebranded. Oh my.

2) "traditional" neural networks and what people are using now are very different--mostly because what people are doing now actually works. Deep learning refers to deep neural networks, which take more traditional neural networks and stack them on top of each other to form a hierarchy of representations that ends up being effective for all kinds of stuff. Not that deep neural networks are a new concept--the newness is more that this is now practical rather than theoretical.

So, deep learning is like 20% bullshit, 80% the real deal. Still lots of work to be done, but I think "deep learning" is a nice buzzword to describe the current state of the art as far as neural networks go. It's all neural networks--but this time it's different, haha.


DBN?



Very confusing abbreviation since it is also used for Dynamic Bayesian Networks. Somebody should come up with something else and quickly before it sticks. :)


I was involved in the speech recognition work mentioned in the article and I led the team that won the Merck contest if anyone has any questions about those things. I also spend some time answering any machine learning question I feel qualified to answer at metaoptimize.com/qa


Congratulations on winning the Merck contest! That was an impressive demonstration.

About 12 years ago, I switched from a Bio major to CS. I hoped to major in AI, but after taking 2 upper level classes, one focusing on symbolic AI and the other focusing on Bayesian networks, I was completely turned off.

Our brains are massively parallel redundant systems that share practically nothing in common with modern Von Neumann CPUs. It seemed the only logical approach to AI was to study neurons. Then try to discover the basic functional units that they form in simple biological life forms like insects or worms. Keep reverse engineer brains of higher and higher life forms until we reach human level AI.

Whenever I tried to relate my course material in AI to what was actually going on in a brain, my profs met my questions with disdain and disinterest. I learned more about neurons in my high school AP Bio class than either of my AI classes. In their defense, we've come a long ways, with new tools like MRIs and neural probes.

The answers are all locked up in our heads. It took nature millions of years of natural selection to engineer our brains. If we want to crack this puzzle in our lifetimes, we to copy nature, not reinvent it from scratch. Purely mathematical theories like Bayesian statistics that have no basis in Biological systems might work in specific cases, but are not going to give us strong AI.

Are these new deep learning algorithms for neural networks rooted in biological research? Do we have to necessary tools yet to start reversing engineering the basic functional units of the brain?


We think so (http://vicarious.com/), but we are obviously biased.


I worked on the Netflix prize and haven't learned anything since then. There the RBM (or modified version per ruslan's paper) performed very well but not substantially better than the linear models (in apples to apples comparison.. ignore the time-dimension and peeking at the contents of the quiz\test set). And as I recall no one really made any progress with deeper networks on that problem. Has anything been learned since then that would suggest progress there?

I also don't recall anyone successfully incorporating the date of the rating into the RBM. Mostly this was useful in other models because on any particular day people would just bias their ratings up or down a bit. But also, as one can imagine, over the course of a year or two their tastes would change. Is it straightforward to include that time dimension into RBMs, and if so, is that a recently discovered technique?


The Netflix Prize winners had a few RBM models that used the dates.

Regarding the DBM - I also tried to use more than one layer, and without success. I tried out 3-layer and 4-layer autoencoders (can be called 1.5-layer and 2-layer DBM), with initialization by stacked RBMs or without it. It did not work well probably because: a) the model was inaccurate, and b) the learning method proposed for DBM was not completely correct. Intuitively, the right DBM-like model with the right learning method should have a chance to improve something on the Netflix task.

I found some improvement though (rather learning time than accuracy) in the standard RBMs. Instead of using CD, I split the weights into two sets, creating a directed RBM version. The "up" weights from the visible nodes to hidden are learned with CD with T=1. The "down" weights are learned to best fit the visible nodes, using the hidden nodes as predictors. The hidden nodes generated by CD T=1 are good enough, and we do not need additional iterations with increased T.


I played around for a while with writing an RBM learner in Go (RBMs are a particular instance of deep learning which Hinton specializes in).

More an experiment than anything else, but for anyone who is interested: https://github.com/taliesinb/gorbm. I don't claim there aren't bugs, and there is no documentation.

The consensus I've picked up from AI-specializing friends is that there are a lot of subtle gotchas and tricks (which Hinton and friends know about but don't necessarily advertise) without which RBMs are a non-starter for many problems. Which I suppose is pretty much standard for esoteric machine learning.


Deep belief networks are extremely powerful, we are finally getting to the point where we don't need to do tons of feature engineering to make useful complex classifiers. Used to be you would have to spend a ton of time doing data analysis and feature extraction to get useful and robust classifiers. Of course the usefulness of those sorts of networks were limited by how well you did the feature extraction. Now you train networks with much more minimally processed data, and get great results out of them.


Since the fall of AI, there are two groups of people in this topic -- one trying to make some reproducible, robust results with well defined algorithms and second importing random ideas from the first group onto some questionably defined ANN model and getting all the hype because of the "neural" buzzword. "Deep learning" is actually called boosting and has been around for years.


Unsupervised pre-training is fundamentally different than boosting.

Boosting is a clever way of modelling a conditional distribution. The insight behind the success of pre-training is that, for many perceptual tasks, having a good model of the input (rather than the input->output mapping) is key.

I have no delusion that the algorithms that work for training deep networks are anything like what the brain actually does, but I don't care. There are many tasks where deep neural nets are state of the art.


Not to argue with you, robrenaud, but Hinton himself writes in their 2006 paper 'A Fast Learning Algorithm for Deep Belief Nets':

The greedy algorithm bears some resemblance to boosting in its repeated use of the same “weak” learner, but instead of reweighting each data vector to ensure that the next step learns something new, it re- represents it.

I guess that most people however would not think of this interpretation of greedily pretraining deep networks :). (I wonder if mbq had this in mind).

In the same article your point about good models of the input is mentioned, too (only copy&paste a small part of the paragraph):

Unsupervised methods, however, can use very large unlabeled data sets, and each case may be very high-dimensional, thus providing many bits of constraint on a generative model.

The 2006 paper is really an amazing read in my opinion.


Boosting selects successive weak learners for the same classification problem, but under a changing distribution/weighting of the input space. Deep learning stacks complex models to create increasingly abstract representations. All I can really imagine them having in common is (1) they're both families of machine learning techniques and (2) they both (roughly) involve a collection of models, albeit in very different ways.


You're talking about Adaboost; general boosting can use any models and the only idea there is that it adds new model to fix residuals of the current chain. BTW "increasingly abstract representation" is a perfect example of this meaningless PR ANNs are built of.


No.

Deep learning is not about fixing the residuals of the current chain. Deep learning isn't even about residuals in the first place. It's about (1) finding good representations of your data (aka feature learning), (2) then adding a discriminative model on top and then (3) tuning everything. There is no relation to boosting at all.


Deep learning is not boosting at all. Deep learning is about composing trainable modules. Adding a layer f(x) to a layer g(x) to get h(x) = f(g(x)). Boosting creates a final classifier that is a weighted sum of the base classifiers, or something like h(x) = a * f(x) + b * g(x). Composition is what Professor Hinton means when he says "re-represent the input" and other similar phrases.


Pegged to the NIPS conference next week: http://nips.cc/Conferences/2012/


The students were also working with a relatively small set of data;

ANN-s are overfitted more often than not.


Are there any good C++ or Python SciPy libraries for building and training deep learning networks?


There is a C++/Cuda library with a python frontend that I am starting to play with that is from one of the guys who works with Hinton. It is written by Alex Krizhevsky and has lots of tools for training feed forward networks with lots of different connection topologies and neuron types. If I am not mistaken this was the library that was used in the recent Kaggle drug competition that is referenced in the article. There is some good starting point documentation here as well to look into, as long as you know enough about the mechanics of Artificial Neural Networks it has some really interesting stuff in there.

Here is the link: http://code.google.com/p/cuda-convnet/


Mentioned somewhere else: http://deeplearning.net/software/theano/ with a tutorial http://deeplearning.net/tutorial

Not C++ or Python, but lua with lots of stuff: torch 7 [[http://www.torch.ch/]]


Is there a good place to plug-in to get an overview of what (has been and is) going on in this area, without having to dive in all the way? An overview of the concepts, not the nuts and bolts, not the heavy-lifting.


the one overview I've found the most useful is http://www.youtube.com/watch?v=ZmNOAtZIgIk (Bay Area Vision Meeting: Unsupervised Feature Learning and Deep Learning by Andrew Ng in April 2011).


Can someone contrast what's in that article with what Jeff Hawkins' Numenta is attempting?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: