If someone is looking for the referenced paper (Accelerating Large-Scale Inference with Anisotropic Vector Quantization), it can be found here: https://arxiv.org/abs/1908.10396
(It's also linked in the repository's docs/algorithms.md, to which there's a link at the bottom of the main page.)
The paper is a bit odd. It feels like a lot of things are missing:
- How are the quantized vectors used to do NNS? Do we have to brute force compare against each codeword?
- In thst case, how many codewords are there? How big is k compared to n? How does changing k affect performance?
- What architecture is used to train the embeddings? We are only given the loss function.
- How fast is the preprocessing step? If we need to do Lloyd's algorithm type iterations combined with training neural nets, this is potentially a lot slower than Faiss which just picks random landmark points.
Right now ANN is a huge (and fascinating) area for active data structures/Algos research.
I do sort of wonder if we will reach a point where instead of pure recall per speed, we’ll have a family of algorithms with more trade offs for the use case. Where we begin to look at metrics for that domain for a given ANN approach instead of just its ability to recreate a nearest neighbors set retrieval.
Like one thing I see assumed is that while recall at N is good, does this also mean the “ranking” of that top N is ideal? I don’t want to have to manually compute NN on this top N if I can avoid it, for example.
And are there specific vector spaces where one ANN is preferred? Or will there be some universal approach that just works for everything?
I realize it’s too early to tell, but these questions always percolate in my mind when we hit a new benchmark in recall for speed. Especially since I still see people doing more naive things that seem to work perfectly fine for their use case (like KD trees, random projections, or LSH)
> And are there specific vector spaces where one ANN is preferred? Or will there be some universal approach that just works for everything?
This is a great question for research right now. Basically euclidean (and inner product) search is quite well understood, but for distance measures such as edit distance we still have no idea what the best approach is.
You use the ANN to get a short list of candidate vectors on which you can do normal dot products, so you won't have to use an approximate score for ranking.
Ok, so I am just trying to understand the basic concepts in the paper and put it in my own words:
It seems that the primary idea is that quantization precision is more important where there is a high density of neighbors.
I.e. at the edges the quantized sections (buckets) could be large since there are few items there, but at high density areas, the buckets should be much smaller in order to have an even distribution of objects per bucket as possible.
Therefore, the overall effectiveness of a Quantization loss function should not be evaluated on a sum of squared error (that assumes the vector space has consistent linear value), but should rather consider the densities of the vector space and use that as a weight of the errors at different regions.
To me it seems analogous to a hash set, where the goal would be to have even distribution (same number of items in every bucket).
We want to quantize space so that every position has about the same number of items.
Wow, this looks impressively fast at very reasonable recall rates in the ANN-benchmarks. It seems to leave faiss and nmslib in the dust. Pulling up the arXiv paper as we speak to figure out what these guys are doing to achieve such impressive results.
I'm surprised that I don't see DBSCAN, HDBSCAN, Spectral, etc. I don't even recognize these methods. Am I missing something or have the methods I'm familiar with become obsolete that fast?
DBSCAN, HDBSCAN, Spectral, and many other classical algorithms are not scalable. You cannot use them to search for approximate nearest neighbors, say, among several billion embeddings in a 1024-dimensional vector space, which is the sort of thing you can do with this library.
I don't know anything about this field. Are these distributed systems? A billion 1024 dimensional vectors sounds like a whole lot of memory (8 TB if each dimension is a double?).
Different category: DBSCAN is doing clustering, Approx Nearest Neighbours is a method to do dot product between a query vector and a large collection, and get a list of the top K, so it's a fast method for similarity search and ranking.
Well... the authors of this paper propose using "anisotropic vector quantization" functions that take into account the density of points in the embedding space -- instead of more conventional locality sensitive hashing functions that ignore such density. I like to think of it as "clustering" in the space of vector quantization functions, then using those functions to quantize the data for fast nn search.
The analogy with classical algorithms would be to apply something like DBSCAN or Spectral clustering to cluster points, then find a way to use the clusters to speed up nn search. In practice no one would do this; I'm mentioning it here only because I think it's a helpful analogy.
And what is that exactly? I was under the impression result of running nearest-neighbor algorithms are clusters. DBSCAN is a clustering algorithm. kNN is a clustering algorithm.
Is this used to find neighbors for use in a clustering algorithm?
Would this algorithm suit the case of wanting to find neighbors within a set radius of a point? Does anyone know of an approximate method for doing this?
milvus integrates and wraps libraries like faiss and nmslib, which are alternatives to Scann. (If anything, I expect the milvus developers will soon be looking into integrating and wrapping Scann.)
(It's also linked in the repository's docs/algorithms.md, to which there's a link at the bottom of the main page.)