Another good dimensionality reduction technique to consider is Latent Dirichlet Allocation. I use this approach for natural language or other "bursty" data sets. "Bursty" data sets are characterized by having Zipfian distribution over features, but certain long-tail features achieving a higher probability of multiple observations given initial observation in an instance. For example, "armadillo" is relatively rare, but an article mentioning an armadillo once has a high chance of mentioning it again.
A cool thing about LDA is that it allows you to express the latent characteristics of a given document as a point in Euclidean space. This gives you the ability to use spatial distance metrics such as cosine distance to express document similarity. I specifically use this for recommending large-scale UGC communities based on their latent characteristics. Furthermore, since you've turned your language data into spatial data, you're able to use spatial classifiers such as SVMs more effectively over natural language data, which is normally a bit better suited for Bayesian classifiers.
I'm a huge fan of Gensim for its LDA library. It's even capable of distributed computing using Pyro4. It's relatively trivial to deploy an LDA pipeline for extremely large datasets using EC2 and the Boto AWS library.
Edit: If you haven't heard of it, scikit-learn is an awesome Python library for highly performant machine learning using Python's C extensions for numerical computing (scipy, numpy). It's easy to take the data you get above and perform learning on it using the classifiers provided.
If you're explicitly looking to calculate document similarity via cosine distance, you may also want to try a technique that explicitly tries to map into an orthogonal space, like any of the principal components analysis variants.
I really like the idea of the Web site as a whole: explaining concepts from computer vision in a simple way.
When I was starting my masters course I was interested in learning what the concept of bag of words in computer vision was all about. Although it is straightforward technique, there are few examples on the Web explaining how to implement it (clustering the feature vectors and etc.)
Although not widely known, the best way to learn about specific computer vision topics in detail is usually through the websites for tutorials held at the premiere computer vision conferences. These are CVPR, ICCV, and ECCV. For example, here are the tutorials held at ICCV 2013: http://www.iccv2013.org/tutorials.php
If you click through, you'll see that most of them have links to slides, and some even have video coverage and/or links to software as well. They're also usually presented by experts in that area.
Finally, note that even if the most recent conferences don't have relevant tutorials for what you're looking for, you can still often find good material by going back further in time (e.g., tutorials presented 2 or more years ago). Although the state of the art may have advanced since then, the fundamentals usually don't change very frequently.
Not really. The article mentions that using linear methods (i.e., LIBLINEAR) is one way to avoid the curse. LIBLINEAR is specifically designed for situations in which you have many features and relatively few training instances. When using a linear classifier it may make sense to simply generate as many features as you can, and then use, i.e., lasso regression in order to do feature selection. http://www.csie.ntu.edu.tw/~cjlin/liblinear/
Not strictly as many features as you can - there are many ways that you can add huge numbers of highly correlated and redundant features that limit the effectiveness of both the classifier as well as selection or regularization methods.
A simple example of this is in natural language processing. Adding dependency or phrase structure parse features to an n-gram bag-of-words model might result in an order of magnitude increase in the number of dimensions in your feature space, and ends up harming classification accuracy, even with tightly controlled and elegant feature selection methods.
But the article used a linear model to demonstrate the curse, and the model was overfit just with 3 dimensions. There is clearly something missing: for example, for text data it is not uncommon to have thousands or hundred thousands of dimensions, and algorithms work fine.
I think the missing piece is regularisation. It doesn't have to do feature selection and actually reduce the number of dimensions, but you're right that using L1 for such data is usually a good idea.
The article had very few data points, that's why it worked with 3 dimensions. The deciding factor is how N (effective number of data points) compares with p (effective number of features).
is there some kind of test to know if we are past the optimal number of dimensions? I guess overfitting could be detected by the ratio between volume and area of the classification boundary.
Cross-validation (actually, this is mentioned toward the end of the article). Basically, fit the the classifier with a subset of the data and test the predictions on the remainder. Predictions for out-of-sample data will be poor if you have overfitting.
There's some additional information you'd need to determine this. Suppose, for the sake of argument, I only have one feature, X. Pretend I extend it into two dimensions by simply replicating the feature. In two dimensions, (X,X) forms a perfect straight line, which should make it clear that using an additional dimension didn't gain you anything.
Testing whether or not to include additional terms requires an understanding of the distribution of the response, as well as the amount of collinearity with the features (how similar the features are). There are some ways to do this in statistics, but this is more of something they do in inference as opposed to prediction.
Heuristically, the most common way is just to look at the cross-validated classification error and compare it with and without a feature (or set of features) in question. Asking about the distribution of the cross-validated classification rate is an interesting statistical question, though!
Derivatives require continuity. It would be sufficient to simply look at which number of dimensions gave you the best cross-validated classification rate.
Cross-validation, separate training and test data sets, or AIC/BIC (AIC is more forgiving than BIC) if you can get a reasonable estimate of your "degrees of freedom". (For many models, however, d.f. is either not defined or intractable. For bagged or boosted trees, for example, you need CV or a test set.)
If you're data rich, you tend not to use CV but to have two or three sets. The reason 3 is better is because you ideally have (a) a training set for building models with known, fixed "hyperparameters" (e.g. regularization coefficients, tree sizes, neural net topologies), (b) a validation set for evaluating models with varying hyperparameters, in order to optimally select them, and (c) a test set on which you can evaluate the model for accuracy after your hyperparameters are chosen from b. Cross-validation is typically what you need to do when you have a small number of observations (say, 1000).
Imagine flattening the 3D space (Figure 5) down onto the 2D space of the floor.
Then, the image is trying to illustrate which sections of the 2D space have been selected as 'cat' by the classifier (green plane).
Visually, you can imagine that the cat heads have each pushed down a chunk of the green plane (the dog heads have held the green plane up off the ground).
This illustrates that the third dimension was important for separating out the cats: they certainly aren't separated by a clean line in those two dimensions.
However, it also illustrates that the separation might be slightly contrived: looked at in 2D, the green plane seems to have plucked out the odd cats correctly but without logic.
One goal is to show that using a high number of dimensions will guarantee that you can separate dogs and cats, but that this is just an over-fitted solution to the data set that you have: it will not continue to work when you apply it to further data.
> Visually, you can imagine that the cat heads have each pushed down a chunk of the green plane (the dog heads have held the green plane up off the ground).
Thank you, that helps. I was mentally projecting the whole plane down and could see why it wasn't all-green (or all-not).
A cool thing about LDA is that it allows you to express the latent characteristics of a given document as a point in Euclidean space. This gives you the ability to use spatial distance metrics such as cosine distance to express document similarity. I specifically use this for recommending large-scale UGC communities based on their latent characteristics. Furthermore, since you've turned your language data into spatial data, you're able to use spatial classifiers such as SVMs more effectively over natural language data, which is normally a bit better suited for Bayesian classifiers.
I'm a huge fan of Gensim for its LDA library. It's even capable of distributed computing using Pyro4. It's relatively trivial to deploy an LDA pipeline for extremely large datasets using EC2 and the Boto AWS library.
Edit: If you haven't heard of it, scikit-learn is an awesome Python library for highly performant machine learning using Python's C extensions for numerical computing (scipy, numpy). It's easy to take the data you get above and perform learning on it using the classifiers provided.