Survey of Vector Database Management Systems (VDBMS)

VDBMS review: create wiki notes @parallel(false) @autodone(false)
- Distance measures @parallel(false) @autodone(false)
  - Hamming distance @parallel(false) @autodone(false) I usually think of Hamming distance as a form of edit distance that only tracks substitutions, but in fact it has a more generalized formulation as the number of dimensions that differ between two multidimensional entities.
  - Inner product distance @parallel(false) @autodone(false)
  - Minkowski distance @parallel(false) @autodone(false)
  - Mahalanobis distance @parallel(false) @autodone(false)
- Query concepts @parallel(false) @autodone(false)
  - Approximate nearest neighbors (formal definition) @parallel(false) @autodone(false) c refers to the approximation degree and k is the number of neighbors. May have originated in the review.
  - (c,k)-Search @parallel(false) @autodone(false)
  - Range search @parallel(false) @autodone(false)
  - K-nearest neighbors @parallel(false) @autodone(false)
  - Approximate nearest neighbor search @parallel(false) @autodone(false)
  - Nearest neighbor search @parallel(false) @autodone(false)
  - Predicated search query @parallel(false) @autodone(false)
  - Dense vs sparse retrieval @parallel(false) @autodone(false) Vector retrieval is “dense” because you need to retrieve the entire candidate in order to make a comparison, and that comparison is comprehensive (). By comparison, in a sparse , only a small amount of data needs to be fetched from an index, and the comparison is usually O(1).
    
    All data systems, both dense and sparse, make tradeoffs to balance read and write latency, fault tolerance, scalability, etc. Approximate search systems must also trade off precision and recall.
  - Precision and recall @parallel(false) @autodone(false)
- Partitioning @parallel(false) @autodone(false)
- Quantization @parallel(false) @autodone(false)
  - Vector quantization @parallel(false) @autodone(false) a mechanism for vector comrpession
  - Product quantization (PQ) @parallel(false) @autodone(false) A strategy for vector quantization
    
    A high-dimensional vector is split into lower-dimensional subvectors
    
    Each of these subvectors is then clustered using k means
    
    For a given subvector space, the centroid of each cluster is given an index. Index ordering is not meaningful. The mapping of indices to centroids is called a “codebook.”
    
    A given vector is thus represented as a coordinate in the index space, which is lower both in dimension and cardinality than the original vectors.
  - Scalar quantization (SQ) @parallel(false) @autodone(false) In general, SQ refers to any strategy for finding a discrete representation of a continuous scalar variable
    
    In the context of vector quantization, SQ implies quantizing each dimension of the vector separately
    
    Any scalar value in that dimension will be quantized to a discrete set of quantized values (i.e., ordered indices) called “quantization levels”
    
    k-means can be used to define these indices, but it doesn’t have to be
  - Anisotropic vector quantization (AVQ) @parallel(false) @autodone(false) Considered to be a form of PQ. Work from this discussion: https://chatgpt.com/share/92681f85-a486-4e95-be16-0bcd50b1e81b
- Etsy @parallel(false) @autodone(false)
  - https://www.etsy.com/codeascraft/from-image-classification-to-multitask-modeling-building-etsys-search-by-image-feature @parallel(false) @autodone(false)
  - FAISS @parallel(false) @autodone(false) FAISS is a modular library for vector search. Basically, you choose a quantization scheme, and then you can choose an indexing strategy (IVF, HNSW, etc). It offers “flat indexing” that just stores quantized vectors directly, resulting in brute force k-NN.
  - Maximum inner-product search @parallel(false) @autodone(false)
  - ScANN @parallel(false) @autodone(false) Strategy for quantization that successively applies AVQ and then PQ. The resulting quantized vector can then be hashed to achieve partitioning and hence horizontal scaling.
  - Vertex matching engine @parallel(false) @autodone(false) A commercial offering based on ScaNN
    
    There is no information about how it works beyond the fact that it uses ScaNN for storage and retrieval
    
    In particular, we have no specific information about the partitioning strategy used for horizontal scaling
    
    However, we can speculate that the hierarchical nature of ScaNN might give rise to a tree-based indexing strategy
  - https://www.etsy.com/codeascraft/efficient-visual-representation-learning-and-evaluation @parallel(false) @autodone(false)
  - Inverted file index (IVF) @parallel(false) @autodone(false) https://en.wikipedia.org/wiki/Inverted_index
    - IVFSQ @parallel(false) @autodone(false) Perform scalar quantization and bucket to the nearest centroid
- Indexing strategies @parallel(false) @autodone(false)
  - Why do we need to index? @parallel(false) @autodone(false) Brute-force vector comparison is O(dN), where d is the dimension of the vectors and N is the number of vectors in the comparison set.
    
    With sparse retrieval, indexing can involve sorting. However, there is no natural sorting principle for vectors. Hence, indexing always involves some kind of partitioning, within which brute force is ultimately required.
  - Data-dependent vs data-independent partitioning schemes @parallel(false) @autodone(false) There’s no reason why you couldn’t just try to uniformly break down the data in your vector database as a form of load balancing. But usually you can get much better performance by taking the underlying distribution into account. The problem is that, if the distribution changes, then your indexing scheme could hurt accuracy, performance, or both.
  - Table-based indexing @parallel(false) @autodone(false) With table-based indexing, a query is mapped onto a hash key and then the corresponding bucket is scanned. This implies that candidates outside of the selected partition will be excluded.
    - Locality-sensitive hashing (LSH) @parallel(false) @autodone(false) Probabilistic bucketing strategy designed to maximize the probability that similar inputs end up in the same bucket.
      
      The selection of the hashing scheme involves random numbers, but the actual hash value of a given input is deterministic given an existing scheme.
      
      Typically one creates multiple hash functions and hashes each input according to each function.
      
      Can tune precision, recall, and complexity by adjusting replicas and set logic (AND, OR, quantifiers)
      
      Most implementations are data-independent, but it’s a flexible scheme
    - Learning-to-hash (L2H) @parallel(false) @autodone(false) As the name suggests, learns a hashing algorithm, typically using k-means clustering.
      
      Training is expensive; out-of-distribution updates degrade the index
      
      Relatively uncommon in vector database systems
      - Spectral hashing @parallel(false) @autodone(false) Designs a hash function based on the principal components of the similarity matrix
      - SPANN @parallel(false) @autodone(false) Hash vectors to the nearest centroid following k-means
  - Tree-based indexing @parallel(false) @autodone(false) Tree-based vector search indices are search trees constructed by iteratively splitting the set of vectors. The difference in the algorithms comes from the selection of the partitioning plane.
    - k-d tree @parallel(false) @autodone(false) A k-d tree is a binary search tree representing high-dimensional data. For a given tree depth and dimensionality , partition the current node by the median of the -th dimension. (That is, the median across all values represented by the current node.)
      
      The k-d tree is inefficient when the variance along some dimensions is greater than the variance along others, especially if the dimensions with greater variance are not part of the initial basis.
    - Principal component tree (PKD, FLANN) @parallel(false) @autodone(false) Principal component trees are a variant on the k-d tree in which you rotate the input vectors so that they are expressed in terms of a basis consistent with the principal components of the covariance matrix.
      
      If you first sort the principal components in descending magnitude of eigenvalue and then build a regular k-d tree, you get what’s called a PKD tree. This is the optimal k-d tree for the data.
      
      As a heuristic, you can choose random principal components. This heuristic is implemented in a library called FLANN.
    - Random projection tree (RPTree, ANNOY) @parallel(false) @autodone(false) A class of algorithms that split the data, with overlaps, based on random projections. Justified by the Johnson-Lindenstrauss lemma that random projections preserve distance. Because of the overlap, they can be space-inefficient.
    - Defeatist search @parallel(false) @autodone(false) Return all of the vectors in the leaf node that would contain the query, regardless of how many or how close. “Defeatist” because it doesn’t even try.
  - Graph-based indexing @parallel(false) @autodone(false)
    - k-NN graph (KNNG) @parallel(false) @autodone(false) Associates each vector with its k-nearest neighbors. Allows exact k-NN lookup in O(1) time as long as the query is in the reference set.
      
      There are iterative algorithms for building an approximate KNNG. These are considered to be a form of unsupervised learning.
      
      KNNG can have disjoint subgraphs, especially for small k. To ensure connectivity, one can use a similar class of graphs called a Monotonic Search Network (MSN).
    - Navigable small-world (NSW) graph @parallel(false) @autodone(false) A navigable small world graph satisfies the properties of both a small world graph and a navigable graph. With such a tree, (c, k)-search complexity is logarithmic in the worst case.
    - Hierarchical NSW (HNSW) graph @parallel(false) @autodone(false) Adds an additional set of connections to a NSW graph that gives it a number of desirable properties. I haven’t looked into it.
- Predicated querying @parallel(false) @autodone(false)
  - Pre-filtering @parallel(false) @autodone(false)
    - Block-first scan @parallel(false) @autodone(false) Block out certain vectors before the scan. The scan proceeds over non-blocked vectors. This can be done at query time (“online blocking”) unless using a graph index, in which case it is necessary to first construct a pre-blocked graph (“offline blocking”).
    - Visit-first scan @parallel(false) @autodone(false) As each vector is retrieved, check to see if it meets the predicate condition. If not, exclude it and continue searching. If the predicate rarely excludes vectors, this can be faster than block-first scan because it avoids the initial blocking scan. For highly restrictive predicate, it can force extensive backtracking.
  - Post-filtering @parallel(false) @autodone(false) The bolt-on strategy to predicated queries: after retrieving neighbors, exclude those that don’t meet the predicate. Can result in a set that contains fewer than k items.
- Distributed search @parallel(false) @autodone(false)
  - Scatter-gather @parallel(false) @autodone(false) Query is “scattered” to all relevant shards, and the result is obtianed by “gathering” (aggregating) the results from each shard.
    
    For nearest neighbor search, this involves first doing knn or ann on each relevant shard, then doing knn on the union of all returned results.
  - LSM tree @parallel(false) @autodone(false) Came up extensively in Kleppman, and also appeared in Pan et al 2024. The basic idea is to stream updates to a log, then merge them into the database at a convenient time, but there’s a special data structure for doing so.
  - Out-of-place update @parallel(false) @autodone(false)
- Two-towers (aka dual encoder) neural network @parallel(false) @autodone(false)

David's raw ML reference notes

Explorer

Pan, et al. (2024)

Survey of Vector Database Management Systems (VDBMS)

Graph View

Backlinks