The Curse of Dimensionality in Classification

languagehacker · on May 13, 2014

Another good dimensionality reduction technique to consider is Latent Dirichlet Allocation. I use this approach for natural language or other "bursty" data sets. "Bursty" data sets are characterized by having Zipfian distribution over features, but certain long-tail features achieving a higher probability of multiple observations given initial observation in an instance. For example, "armadillo" is relatively rare, but an article mentioning an armadillo once has a high chance of mentioning it again.

A cool thing about LDA is that it allows you to express the latent characteristics of a given document as a point in Euclidean space. This gives you the ability to use spatial distance metrics such as cosine distance to express document similarity. I specifically use this for recommending large-scale UGC communities based on their latent characteristics. Furthermore, since you've turned your language data into spatial data, you're able to use spatial classifiers such as SVMs more effectively over natural language data, which is normally a bit better suited for Bayesian classifiers.

I'm a huge fan of Gensim for its LDA library. It's even capable of distributed computing using Pyro4. It's relatively trivial to deploy an LDA pipeline for extremely large datasets using EC2 and the Boto AWS library.

Edit: If you haven't heard of it, scikit-learn is an awesome Python library for highly performant machine learning using Python's C extensions for numerical computing (scipy, numpy). It's easy to take the data you get above and perform learning on it using the classifiers provided.

fiatmoney · on May 13, 2014

If you're explicitly looking to calculate document similarity via cosine distance, you may also want to try a technique that explicitly tries to map into an orthogonal space, like any of the principal components analysis variants.

amitdeshwar · on May 13, 2014

Can you provide a link to an article discussing how you can treat the latent characteristics as a point in Euclidean space?

languagehacker · on May 14, 2014

Here's some basic background: http://faculty.washington.edu/jwilker/559/SteyversGriffiths....

vonnik · on May 13, 2014

Neural nets are a great way to reduce dimensionality. In particular, deep autoencoders. here's an old Hinton paper on it: http://www.cs.toronto.edu/~hinton/science.pdf

alceufc · on May 13, 2014

I really like the idea of the Web site as a whole: explaining concepts from computer vision in a simple way.

When I was starting my masters course I was interested in learning what the concept of bag of words in computer vision was all about. Although it is straightforward technique, there are few examples on the Web explaining how to implement it (clustering the feature vectors and etc.)

apu · on May 14, 2014

Although not widely known, the best way to learn about specific computer vision topics in detail is usually through the websites for tutorials held at the premiere computer vision conferences. These are CVPR, ICCV, and ECCV. For example, here are the tutorials held at ICCV 2013: http://www.iccv2013.org/tutorials.php

If you click through, you'll see that most of them have links to slides, and some even have video coverage and/or links to software as well. They're also usually presented by experts in that area.

Finally, note that even if the most recent conferences don't have relevant tutorials for what you're looking for, you can still often find good material by going back further in time (e.g., tutorials presented 2 or more years ago). Although the state of the art may have advanced since then, the fundamentals usually don't change very frequently.

Malarkey73 · on May 14, 2014

I'm familiar with this idea - but it's nice to see explained with cute little puppies and kittens.

therobot24 · on May 13, 2014

one counter-example: face recognition using 100k features (http://research.microsoft.com/pubs/192106/HighDimFeature.pdf)

cortexman · on May 13, 2014

Not really. The article mentions that using linear methods (i.e., LIBLINEAR) is one way to avoid the curse. LIBLINEAR is specifically designed for situations in which you have many features and relatively few training instances. When using a linear classifier it may make sense to simply generate as many features as you can, and then use, i.e., lasso regression in order to do feature selection. http://www.csie.ntu.edu.tw/~cjlin/liblinear/

lightsidelabs · on May 13, 2014

Not strictly as many features as you can - there are many ways that you can add huge numbers of highly correlated and redundant features that limit the effectiveness of both the classifier as well as selection or regularization methods.

A simple example of this is in natural language processing. Adding dependency or phrase structure parse features to an n-gram bag-of-words model might result in an order of magnitude increase in the number of dimensions in your feature space, and ends up harming classification accuracy, even with tightly controlled and elegant feature selection methods.

kmike84 · on May 13, 2014

But the article used a linear model to demonstrate the curse, and the model was overfit just with 3 dimensions. There is clearly something missing: for example, for text data it is not uncommon to have thousands or hundred thousands of dimensions, and algorithms work fine.

I think the missing piece is regularisation. It doesn't have to do feature selection and actually reduce the number of dimensions, but you're right that using L1 for such data is usually a good idea.

ced · on May 13, 2014

The article had very few data points, that's why it worked with 3 dimensions. The deciding factor is how N (effective number of data points) compares with p (effective number of features).

nraynaud · on May 13, 2014

is there some kind of test to know if we are past the optimal number of dimensions? I guess overfitting could be detected by the ratio between volume and area of the classification boundary.

Bill_Dimm · on May 13, 2014

Cross-validation (actually, this is mentioned toward the end of the article). Basically, fit the the classifier with a subset of the data and test the predictions on the remainder. Predictions for out-of-sample data will be poor if you have overfitting.

http://en.wikipedia.org/wiki/Cross-validation_%28statistics%...

nraynaud · on May 14, 2014

yeah, this one is quite intuitive, but it reduces the training sample size.

Houshalter · on May 14, 2014

Once you find the optimal parameters, you can then train the model again on the entire dataset.

christopheraden · on May 14, 2014

There's some additional information you'd need to determine this. Suppose, for the sake of argument, I only have one feature, X. Pretend I extend it into two dimensions by simply replicating the feature. In two dimensions, (X,X) forms a perfect straight line, which should make it clear that using an additional dimension didn't gain you anything.

Testing whether or not to include additional terms requires an understanding of the distribution of the response, as well as the amount of collinearity with the features (how similar the features are). There are some ways to do this in statistics, but this is more of something they do in inference as opposed to prediction.

Heuristically, the most common way is just to look at the cross-validated classification error and compare it with and without a feature (or set of features) in question. Asking about the distribution of the cross-validated classification rate is an interesting statistical question, though!

RK · on May 13, 2014

You could make a plot like Figure 1. Look for the turning point (do some calculus if you can, i.e. d(perf)/d(dim) = 0).

christopheraden · on May 14, 2014

Derivatives require continuity. It would be sufficient to simply look at which number of dimensions gave you the best cross-validated classification rate.

mturmon · on May 14, 2014

Alas, it rarely looks clean like that. Actually never, in my experience, for real problems.

michaelochurch · on May 14, 2014

Cross-validation, separate training and test data sets, or AIC/BIC (AIC is more forgiving than BIC) if you can get a reasonable estimate of your "degrees of freedom". (For many models, however, d.f. is either not defined or intractable. For bagged or boosted trees, for example, you need CV or a test set.)

If you're data rich, you tend not to use CV but to have two or three sets. The reason 3 is better is because you ideally have (a) a training set for building models with known, fixed "hyperparameters" (e.g. regularization coefficients, tree sizes, neural net topologies), (b) a validation set for evaluating models with varying hyperparameters, in order to optimally select them, and (c) a test set on which you can evaluate the model for accuracy after your hyperparameters are chosen from b. Cross-validation is typically what you need to do when you have a small number of observations (say, 1000).

hessenwolf · on May 14, 2014

What do you do for a living?

jbert · on May 14, 2014

Can someone explain to me Figure 6 please? How does projecting the 3d space (and 2d green plane) lead to the regions around the cat's heads?

ronaldx · on May 14, 2014

I will try:

Imagine flattening the 3D space (Figure 5) down onto the 2D space of the floor.

Then, the image is trying to illustrate which sections of the 2D space have been selected as 'cat' by the classifier (green plane).

Visually, you can imagine that the cat heads have each pushed down a chunk of the green plane (the dog heads have held the green plane up off the ground).

This illustrates that the third dimension was important for separating out the cats: they certainly aren't separated by a clean line in those two dimensions.

However, it also illustrates that the separation might be slightly contrived: looked at in 2D, the green plane seems to have plucked out the odd cats correctly but without logic.

One goal is to show that using a high number of dimensions will guarantee that you can separate dogs and cats, but that this is just an over-fitted solution to the data set that you have: it will not continue to work when you apply it to further data.

jbert · on May 14, 2014

> Visually, you can imagine that the cat heads have each pushed down a chunk of the green plane (the dog heads have held the green plane up off the ground).

Thank you, that helps. I was mentally projecting the whole plane down and could see why it wasn't all-green (or all-not).