New extension to DuckDB: vector similarity search

Estimated read time 10 min read

Preface

In the current fields of data analysis, machine learning and AI, vector search has become an important part of database functionality. Postgres has pgvector, Oracle 23ai has also added a vector search function, and now DuckDB also has a plug-in.

DuckDB v0.10.0 brings many major updates  , including the fixed-length Array data type [1] that DuckDB also wants to join the vector database battle . During this period, there were also several articles on attempts at vector search, such as  using DuckDB + Ibis for RAG  and  building an artificial intelligence-driven search function in DUCKDB  .

The previous exploration was just a prelude. Now, DuckDB’s vss extension brings real vector search capabilities. vss Introduced support for HNSW (Hierarchical Navigable Small World) index to speed up vector similarity searches.

The original motivation for adding this data type (Array) was to provide optimized operations for lists that could take advantage of the positional semantics of their children and avoid branches since all lists are of the same length. For example, array operations in NumPy: stacking, shifting, multiplication, etc. Additionally, we wanted to improve interoperability with Apache Arrow, because previously Arrow’s fixed-size list types were converted to regular variable-size lists when imported into DuckDB, thus losing some type information.

With the growing interest in  vector embeddings  and  semantic similarity searchesARRAY , DuckDB also  introduced several distance metric functions  for this new  type: array_distance[2] ,  array_inner_product[3]  and  array_cosine_similarity[4] .

For those readers who have not yet dabbled in word embeddings or vector searches, this is an opportunity to learn about this powerful technique, which is, in a nutshell, a method for representing documents, images, entities – data as high-dimensional vectors and then A vector space is searched for similar vectors, using some kind of mathematical “distance” expression to measure similarity. This technique is widely used in everything from natural language processing to recommendation systems and image recognition, and its popularity has surged recently due to the emergence of generative AI and the availability of pre-trained models.

This is really exciting for the community! While we (and in this article we refer to DuckDB Labs) initially stated publicly that we would not add a vector similarity search index to DuckDB because we felt it was beyond our scope, we are very interested in supporting custom indexes through extensions. Personally, I  have been wanting to insert an “R-tree” index since DuckDB’s  spatial extension [5] ! So when one of our client projects evolved into creating a proof-of-concept custom “HNSW” index extension, we decided to give it a try. So… things went on like this.

vss Expand

Now, we are happy to announce  vss the birth of a vector similarity search extension for DuckDB! While some might say we’re late to the vector search party, we think the party has just begun!

Vector Similarity Search (VSS) extension

On the surface, vss it looks like a relatively small DuckDB extension. It does not provide any new data types, scalar functions or copy functions, but a new index type: HNSW(hierarchical navigable small world), which is a graph-based index structure particularly suitable for high-dimensional vector similarity Sex search.

CREATE TABLE embeddings (vec FLOAT[3]);
CREATE INDEX idx ON embeddings USING HNSW (vec);

 This index type cannot be used to enforce constraints or uniqueness like built-in  ART indexes [6] , nor can it be used to speed up joins or index regular columns. In contrast, HNSW indexes only apply to   columns of the type  containing FLOAT the elements  , and are only used to speed up queries. Compute the “distance” between a constant  and the indexed column   , sort the results by distance and return the top n results. That is, a query like this:ARRAYFLOAT ARRAYFLOAT ARRAY

SELECT *
FROM embeddings
ORDER BY array_distance(vec, [1, 2, 3]::FLOAT[3])
LIMIT 3;

Turn the optimized logical plan into a projection of a new  HNSW index scan operator that completely removes constraints and ordering. We can  EXPLAIN verify this by inspecting the output:

EXPLAIN
SELECT *
FROM embeddings
ORDER BY array_distance(vec, [1, 2, 3]::FLOAT[3])
LIMIT 3;
┌───────────────────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─     │
│             #0            │
└─────────────┬─────────────┘
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─     │
│            vec            │
│array_distance(vec, [1.0, 2│
│         .0, 3.0])         │
└─────────────┬─────────────┘
│      HNSW_INDEX_SCAN      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─     │
│   t1 (HNSW INDEX SCAN :   │
│            idx)           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─     │
│            vec            │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─     │
│           EC: 3           │
└───────────────────────────┘

You can  HNSW pass a parameter to the index creation statement  metric to determine which distance measure to use. The supported metrics are  l2sqcosine and  inner_product, corresponding to the three built-in distance functions: array_distancearray_cosine_similarity and  array_inner_product.

  • • By default  l2sq, it uses Euclidean distance ( array_distance):
CREATE INDEX l2sq_idx ON embeddings USING HNSW (vec)
WITH (metric = 'l2sq');
  • • To use cosine distance ( array_cosine_similarity):
CREATE INDEX cos_idx ON embeddings USING HNSW (vec)
WITH (metric = 'cosine');
  • • To use inner product ( array_inner_product):
CREATE INDEX ip_idx ON embeddings USING HNSW (vec)
WITH (metric = 'ip');

accomplish

vss The extension is based on  the usearch[7]  library, which provides a flexible C++ implementation of the HNSW index data structure with very impressive performance benchmarks. While we’re currently only using  usearch a subset of all the features and tweaking options offered, we’re excited to explore how to take advantage of even more of its features in the future. So far, we’re happy that it fits well with DuckDB’s development philosophy. Like DuckDB itself, usearch it’s written in portable C++11 with no external dependencies and released under a permissive license, which makes it very easy to integrate into our extension build and distribution pipelines.

limitation

A major limitation of the current  HNSW index is that it can only be created in an in-memory database unless the configuration parameter  SET hnsw_enable_experimental_persistence = ⟨bool⟩ is set to  true. If this parameter is not set, any attempt to create HNSW an index in the disk database  will generate an error message. However, if this parameter is set, the index will not only be created in memory, but will also be persisted to disk as part of the DuckDB database file during the checkpoint process. After restarting or loading a database file with a persistent  HNSW index, the index will be lazily loaded back into memory the first time the related table is accessed, which is much faster than recreating the index from scratch.

The reason for locking this feature behind the experimental flag is that we still have some known custom index persistence issues that need to be resolved before it will be enabled by default. In particular, WAL recovery of custom indexes has not been implemented correctly, which means that if a crash occurs or the database  HNSW is shut down unexpectedly while the index table has uncommitted changes, you may experience data loss or index corruption. While it is technically possible  vss to manually recover from an unexpected shutdown by first starting DuckDB separately, loading the extensions, and then appending the database files, which ensures that indexing functionality is available during WAL replay  HNSW , you should not rely on this approach for production workloads.

We are actively working on resolving this and other index persistence-related issues and hope to resolve them in  DuckDB v0.10.3 [8]  , but for now we recommend using indexes only in in-memory databases  HNSW .

However, at runtime, just like  ART that, HNSW the index must fit completely in RAM, and  HNSW the memory allocated at runtime is allocated “externally” to the DuckDB memory management system, meaning that it does not respect DuckDB’s  memory_limit configuration parameters.

Another current limitation of indexing so far HNSW is that it only supports the type of array elements  FLOAT (32-bit, single-precision floating point), and only supports distance measures corresponding to the three built-in distance functions, namely  array_distance, , array_inner_product and  array_cosine_similarity. But this is something we plan to expand on in the near future, as this is less of a technical limitation and more of a “we haven’t had time to do it yet” limitation.

In conclusion

The DuckDB  vss extension is a new extension that adds support for creating HNSW indexes on fixed-size list columns in DuckDB, accelerating vector similarity search queries. The extension can currently be installed by running on DuckDB v0.10.2 on all supported platforms (including WASM!)  INSTALL vss; LOAD vss . vss The extension breaks new ground for DuckDB extensions, providing custom index types, and we’re excited to improve and extend this functionality.

While we’re still working on resolving some of the limitations mentioned above, especially those related to persistence (and performance), we still really wanted to share this early version of the  vss extension because we believe it will open up a lot of cool opportunities for the community. So be sure to check out  vss the extension documentation [9]  for more information on how to use this extension!

Thanks to DuckDB Labs customers for sponsoring this innovative work. Now, we invite every developer and data scientist interested in vector search to learn more about the vss extension through our documentation and GitHub repository, and join our community to jointly promote the advancement of data analysis technology. 

You May Also Like

More From Author

+ There are no comments

Add yours