On Open Source Vector Databases. Q&A with Ben Bromhead

Q1. Instaclustr, acquired by NetApp a couple years ago, is now Instaclustr by NetApp—what role does Instaclustr by NetApp serve?

Instaclustr by NetApp helps organizations deliver applications at scale by operating and supporting their data infrastructure through our platform for open source technologies.

We have always been a believer in the power of open source technology at the data layer, and offer a one-stop destination for deploying, managing, and monitoring all components of their data infrastructure. 

Now under NetApp, we are able to take advantage of one of the best storage capabilities on the market—whether it’s on the cloud, hybrid, or on-prem. Over the coming months, you will see us increasingly integrating our open source capabilities with NetApp’ss block, file, and object offerings for better price performance and features.

Q2. You are the CTO at Instaclustr by NetApp. What are your current projects?

I’m focused on two areas right now. The first is helping our customers adopt and understand the new capabilities within Instaclustr enabled by NetApp. The second, which I think most of the industry is focusing on, is enabling generative AI workloads. 

Our generative AI focus is primarily on open source vector databases. Instaclustr has offered vector search capabilities since the start of 2023 through OpenSearch (knn-plugin), Postgres (pgvector) and, more recently, vector indexes in Apache Cassandra 5.0.

Q3. What are the open source vector database options for AI workloads? What factors go into deciding which one teams should go with for their AI workloads?

Organizations have been understandably eager to use vector databases to increase generative AI application reliability, improve recency and reduce hallucinations. But what’s still less understood—but welcome news for technology leaders at these enterprises—is that implementing a new vector database may not require them to adopt new, unfamiliar, or expensive data-layer technologies. While specialized vector database solutions are out there, enterprises can eliminate much of the adoption learning curve if they already use some of the most popular and fully open source databases out there: PostgreSQL (using the pgvector extension), Apache Cassandra (version 5.0) or OpenSearch

The pgvector extension for Postgres turns the extremely popular database into a high-performance vector database. Organizations that have Postgres in place and want to implement an intelligent data infrastructure quickly will really like this option. As we speak, countless organizations are already using tried and true Postgres to support LLMs and production AI applications. Postgres is also your first port of call for exploring a new idea, a new offering, or just experimenting. 

Cassandra 5.0 reintroduces the database as an ideal option for developing AI applications and storing the intelligent data that AI workloads require. This latest version adds (currently in open beta with GA expected soon) capabilities and functionality specifically designed for AI use cases and accelerating AI development. Among these features are Vector Search, Native Vector indexing, a new vector data type for embedding vectors, and new CQL functions that make harnessing that data pretty darn simple. Cassandra now combines robust AI workload support with especially high availability and scalability, making it an especially strong option for organizations already experienced with the open source database.

OpenSearch is another strong open source choice—offering a singular search, analytics, and vector database that countless enterprises are already familiar with. OpenSearch is already a proven quantity among enterprises, ready to speed development and support stable AI applications at scales reaching into the tens of billions of vectors (with low latency and high availability).

Each of these technologies offers a vector database solution you can trust, backed by a vast community and mature well-governed open source projects.

Q4.  While there are benefits of open source vector databases for improving accuracy and reducing hallucinations in generative AI models, are there potential downsides or limitations that teams should be aware of?

Using a technique called retrieval augmented generation (RAG), you can look up data that is relevant to a user or an agent’s prompt and provide it as context to the LLM. This enables you to provide data to an LLM that it was never trained on, or might be too expensive to fine-tune, or that rapidly changes. 

Even when supported by a familiar database solution, teams should carve out time and resources for the implementation process and to ensure continually optimization. There will be a learning curve to understand how to leverage a vector database, both performance-wise and for cost efficiency. 

The effectiveness of the solution will also greatly depend on the quality of the data used, how the data is chunked and embedded, as well as other metadata and hybrid search terms used alongside a vector search solution. It’s super easy to put vector search and LLMs together for a great and effective demo, but making something production-ready is where the real challenge lies. 

Having team members with vector database and data science expertise, or tapping into managed services for knowledgeable support, will alleviate those growing pains as enterprises build out their intelligent data infrastructure.

Q5. What’s a concrete example of how teams might leverage retrieval augmented generation (RAG) to boost language model performance for a specific AI use case?

RAG helps by providing additional context for a prompt, thus giving an LLM a greater set of information and data—thus increasing its accuracy. Vector search takes advantage of embedding models, which turn the semantic meaning of a set of text into a list of numbers (a vector). Concepts, topics, and words that are strongly related to each other will have a vector that is geometrically closer to each other. 

For example, when talking about furnishings for a lounge room, the words sofa and couch will be very close to each other. However, a sentence where the word ‘couch’ is used as a euphemistic adverb (e.g. to couch feedback to an employee as a set of constructive areas to improve on), the vectors for couch and sofa, in this case, will be further away. In this example, if you had just done a basic keyword search, both contexts would be ranked equally as the keyword is in both.

For a basic AI use case, take an enterprise employee who wants to query the company’s employment documentation (for example, to look up the company’s IT security policy on passwords). IT policy documentation can be broken up into smaller chunks and have embeddings generated for it (how you encode its semantic meaning). These embeddings are represented as vectors and can be stored in your favorite vector database.

When the user asks the AI application a question, the RAG process looks at that query and searches the vector database for related document chunks based on the embeddings calculated for the user’s query. The search is done using a nearest neighbor algorithm and returns a set of documents that may be relevant to that query. It then hands the returned plain text document to the LLM so it can reply to the user with accurate text. In this way, RAG can reduce hallucination issues, because it simply finds and retrieves the most relevant existing information.

Q6.  What specific advantages do 100% open source vector databases have over proprietary vector databases?

Teams are under tremendous competitive pressure to harness generative AI and get differentiated applications into production, out the door, and refined to meet market needs. Said another way: enterprises can’t afford hiccups right now. Engaging in a contract with a proprietary vector database provider that turns out to be a vendor lock-in trap, or just isn’t the best-fit capabilities-wise, can derail an organization’s AI roadmap before the train even gets going. 

In scenarios where AI is crucial to competitiveness, that can be a serious blow to an enterprise. In contrast, 100% open source vector databases leave organizations in full control of their data, preserve their options for future migrations, and offer extremely powerful and trustworthy capabilities for supporting AI workloads and development—at least when it comes to those three technologies I’ve mentioned.

Q7. For teams already heavily invested in a particular open source database ecosystem like PostgreSQL or Apache Cassandra, how difficult is the migration path to incorporate vector database capabilities? Are there any implementation challenges they should anticipate?

As I alluded to earlier, vector database experience can play a decisive role in the speed with which a team can harness even a familiar solution, get up to speed, and begin to optimize functionality. Cassandra 5.0, OpenSearch, and PostgreSQL offer some of the most robust, well-documented and supported vector database choices out there when it comes to getting up and running and realizing their full potential. 

That said, there will be a learning curve on the road to implementation and optimization, which experience or the right partners can effectively reduce.

Q8. Anything else you wish to add?

The best approach is to jump in and find a scenario where an LLM can add immediate value. Ensure it is low-risk enough for your team to learn lessons and improve as they better understand how to get the best out of generative AI.

………………………………………………

Ben Bromhead is the CTO at Instaclustr by NetApp

Prior to co-founding Instaclustr in 2012 (acquired by NetApp in 2022), Ben was an independent consultant developing NoSQL solutions for enterprises, and ran a high-tech cryptographic and cybersecurity formal testing laboratory at BAE Systems and Stratsec.

Sponsored by Clement | Peterson

You may also like...