Vector Search Engine: A Hands-On Example

📌 Implementing a Vector Search Engine with Qdrant Cloud

What is a Vector Database and Why is it Gaining Attention?

A vector database is a type of database that stores data as vectors instead of traditional rows and columns. Vectors are numerical representations of data points, which can capture the essence of complex data such as images, texts, and even audio. This capability makes vector databases particularly useful in applications requiring similarity searches, which are common in fields like machine learning, natural language processing, and computer vision.

Image source: Why You Shouldn’t Invest In Vector Databases?

The surge in interest in vector databases is driven by the rapid advancements in AI and machine learning. Traditional databases struggle to manage and search through high-dimensional data effectively. Vector databases, on the other hand, are designed to handle such data with ease, enabling efficient and accurate similarity searches. This makes them invaluable for modern applications such as recommendation systems, image and video search, and natural language understanding.

What is Vector Search and Why is it Necessary?

Vector search involves finding vectors in a high-dimensional space that are similar to a given query vector. This is essential in many AI applications. For instance, in a recommendation system, vector search can quickly identify items similar to those a user has previously interacted with. In image search, it can find visually similar images, and in natural language processing, it can retrieve semantically similar texts.

The necessity of vector search arises from the limitations of exact matching methods in handling unstructured and high-dimensional data. Exact matching can miss relevant items that are not identical but similar in some aspects. Vector search addresses this by enabling similarity-based retrieval, which is more aligned with how humans perceive similarities.

What is Qdrant and Why Do We Need It?

Qdrant is an advanced vector database that provides efficient and accurate similarity search capabilities. It leverages the power of vectors to manage high-dimensional data, making it an essential tool for applications involving machine learning and artificial intelligence.

Image source: social-preview-H.png (1280×640) (qdrant.tech)

Why Qdrant is Necessary

In an era where data is growing exponentially and the need for fast, accurate search results is critical, Qdrant fills a crucial gap. Traditional databases are not optimized for the types of high-dimensional data used in AI and ML applications. Qdrant offers a dedicated platform for storing and querying vector data, addressing these specific needs.

Technical Features of Qdrant

High-Performance Similarity Search: Qdrant is optimized for speed and accuracy, using advanced algorithms to quickly find the most similar vectors in large datasets.
Scalability: Designed to scale with your needs, Qdrant can handle millions to billions of vectors without compromising performance.
Easy Integration: Qdrant integrates seamlessly with popular machine learning frameworks and libraries, simplifying its adoption into existing workflows.
Real-Time Updates: Supports real-time updates, allowing for the addition, deletion, or modification of vectors on the fly without affecting search performance.
Rich API: Provides a comprehensive API for programmatic interaction, facilitating automation and integration.

Comparing Qdrant with Other Vector Databases

Advantages of Qdrant

Performance: Qdrant outperforms many other vector databases in search speed and accuracy. Its optimized algorithms and data structures ensure fast query times even with large datasets.
User-Friendly: Qdrant's intuitive interface and robust API make it accessible to developers of all skill levels. The seamless integration with machine learning frameworks further enhances its usability.
Scalability: Unlike some vector databases that struggle with large datasets, Qdrant is designed to scale efficiently, handling vast amounts of data without sacrificing performance.
Real-Time Capabilities: The ability to update vectors in real time is a significant advantage, especially for applications requiring constant data updates and immediate query results.

Disadvantages of Qdrant

Resource Intensive: Due to its high-performance capabilities, Qdrant can be resource-intensive, requiring more powerful hardware and infrastructure compared to simpler databases.
Complexity: For users unfamiliar with vector databases, the initial learning curve can be steep. Understanding vector representations and similarity search algorithms is crucial to fully leverage Qdrant's capabilities.

Sample Code for Using Qdrant Cloud

Below are code blocks that will demonstrate how to build a vector search engine using Qdrant Cloud. The code will include sections for importing embedding data, building an API with FastAPI, and implementing the search engine, along with the necessary package list.

Embedding Data Import (load.py)

import json
import logging
import os
import time
import pandas as pd
from qdrant_client import QdrantClient
from qdrant_client.http import models as rest

from qdrant_client.conversions import common_types as types
from ast import literal_eval
from .consts import COLLECTION_NAME, EMBEDDING_DATA

logger = logging.getLogger()
logger.setLevel(logging.INFO)
consoleHandler = logging.StreamHandler()
logger.addHandler(consoleHandler)
logger.setLevel(logging.INFO)

qdrant_host = os.environ["QDRANT_HOST"]
qdrant_api_key = os.environ["QDRANT_API_KEY"]
CHUNK_SIZE = 500


def load_embeddings_in_chunks(
    q_client: QdrantClient, file_path: str, chunk_size=CHUNK_SIZE
):
    logger.info(f"Processing {file_path}")

    # create collection if not exists
    collections: types.CollectionsResponse = q_client.get_collections()
    collections_names = [col.name for col in collections.collections]
    logger.info(f"Existing collections: {collections_names}")

    with open(file_path, "r") as file:
        vector_size = 0
        count_rows = 0
        for _, df_chunk in enumerate(pd.read_csv(file, chunksize=chunk_size)):
            df_chunk["embeddings"] = df_chunk["embeddings"].apply(literal_eval)

            if vector_size == 0:
                vector_size = len(df_chunk.iloc[0]["embeddings"])
                logger.info(f"vector_size: {vector_size}")

            # create collection if not exists
            if COLLECTION_NAME not in collections_names:
                q_client.recreate_collection(
                    collection_name=COLLECTION_NAME,
                    vectors_config={
                        "content": rest.VectorParams(
                            distance=rest.Distance.COSINE,
                            size=vector_size,
                        ),
                    },
                )
                logger.info(f"Created collection: {COLLECTION_NAME}")
                collections: types.CollectionsResponse = q_client.get_collections()
                collections_names = [col.name for col in collections.collections]

            points = [
                rest.PointStruct(
                    id=row["id"],
                    vector={
                        "content": row["embeddings"],
                    },
                    payload=json.loads(row["properties"]),
                )
                for _, row in df_chunk.iterrows()
            ]

            res: types.UpdateResult = q_client.upsert(
                collection_name=COLLECTION_NAME,
                points=points,
                wait=True,
            )
            logger.info(f"Upserted chunk to collection: {res}")
            count_rows += len(points)

        logger.info(f"Loaded {count_rows} rows collection.")


def load_to_db(q_client: QdrantClient):
    start_time = time.time()
    logger.info("Loading embeddings to db")

    # Assuming EMBEDDING_DATA contains file paths for the embeddings
    file_path = f"./embeddings/embedding1.csv"
    load_embeddings_in_chunks(q_client, file_path)

    file_path = f"./embeddings/embedding2.csv"
    load_embeddings_in_chunks(q_client, file_path)

    file_path = f"./embeddings/embedding3.csv"
    load_embeddings_in_chunks(q_client, file_path)

    logger.info(
        f"Loaded embeddings to db. Spent time: {str(time.time() - start_time)} seconds"
    )


def main():
    qdrant_client = QdrantClient(qdrant_host, api_key=qdrant_api_key)
    load_to_db(qdrant_client)


if __name__ == "__main__":
    main()

Building the API with FastAPI (main.py)

import logging
import os

from fastapi import FastAPI, HTTPException

from .search_engine import search_query, q_client, vector_size

if os.getenv("OPENAI_API_KEY", "") == "":
    raise Exception("OPENAI_API_KEY is not set")

# setting log go into stdout
logger = logging.getLogger()
logger.setLevel(logging.INFO)
consoleHandler = logging.StreamHandler()
logger.addHandler(consoleHandler)
logger.setLevel(logging.INFO)


app = FastAPI()


@app.get("/health")
def health():
    return {"status": "ok"}


@app.get("/count")
def db_size():
    count = vector_size()
    return {"count": count}


@app.get("/search")
def search(query: str, top_k: int = 20):
    try:
        result = search_query(query, top_k=top_k)
    except ValueError as e:
        raise HTTPException(status_code=404, detail="Item not found")
    return result


if __name__ == "__main__":
    # for test purpose
    import uvicorn

    uvicorn.run(app, host="0.0.0.0", port=8080)

Search Engine Implementation (search_engine.py)

import logging
import os
from time import time

import openai
import qdrant_client
from qdrant_client.conversions.common_types import CountResult

from .consts import COLLECTION_NAME

logger = logging.getLogger()

qdrant_host = os.getenv("QDRANT_HOST", None)
qdrant_api_key = os.getenv("QDRANT_API_KEY", None)

# Qdrant Cloud
if qdrant_host and qdrant_api_key:
    q_client = qdrant_client.QdrantClient(
        qdrant_host,
        api_key=qdrant_api_key,
    )
    logger.info(f"Qdrant client running on Cloud: {qdrant_host}")
else:
    q_client = qdrant_client.QdrantClient(":memory:")
    logger.info(f"Qdrant client running on local machine.")


def vector_size():
    try:
        res: CountResult = q_client.count(collection_name=COLLECTION_NAME)
        return res.count
    except Exception as e:
        logger.error(f"Error when counting: {e}")
        return 0


def search_query(query: str, vector_name: str = "content", top_k: int = 20):
    embds_start = time()
    embedded_query = openai.Embedding.create(
        input=query,
        model="text-embedding-ada-002",
    )["data"][0]["embedding"]
    embds_end = time()
    embds_time = embds_end - embds_start
    logger.info(f"Embedding time: {embds_time}, query: {query}")

    query_start = time()
    query_results = q_client.search(
        collection_name=COLLECTION_NAME,
        query_vector=(vector_name, embedded_query),
        limit=top_k,
    )
    query_end = time()
    query_time = query_end - query_start

    return {
        "result": query_results,
        "embedding_time": embds_time,
        "query_time": query_time,
    }

Constant Variables (consts.py)

COLLECTION_NAME = "my_first_collection"

EMBEDDING_DATA = {
    "EMBEDDING_DATA1": "embedding1.csv",
    "EMBEDDING_DATA2": "embedding2.csv",
    "EMBEDDING_DATA3": "embedding3.csv"
}

Required Packages List (requirements.txt)

fastapi==0.101.0
uvicorn[standard]==0.23.2
gunicorn==21.2.0
langchain==0.0.100
qdrant-client==1.4.0
openai==0.27.8
pandas==2.0.3

How to run it?

Prerequisites

Install following tools and packages:

Docker
Docker Compose
Python 3.9 and pip
requirements.txt

Prepare your embedding data:

Embeddings

Setup

1. Data files

Create csv files and add them under embeddings directory. Fow now, we have 3 files and their names should be:

embedding1.csv
embedding2.csv
embedding3.csv

2. Environment variables

Create a file named .env from .env.example and fill in the values.

OPENAI_API_KEY: Key to connect to GPT
QDRANT_HOST: Set it if using Qdrant cloud
QDRANT_API_KEY: Set it if using Qdrant cloud

3. Start service as docker container

Run following command in the root directory of the project, it will build docker image and start service:

# Build and start the service
docker-compose up -d

# only build the image
docker-compose build

# only start the service
docker-compose up -d

# stop the service
docker-compose stop

This will start service on 127.0.0.1 on port 8080.

Deploy and Run

All endpoints are GET request. Endpoints:

http://127.0.0.1:8080/health: Return ok message (for health check later)
http://127.0.0.1:8080/count: Return number of vectors in database
http://127.0.0.1:8080/search?query=: Search similarity endpoint with query parameter for query string and top_k parameter for number of results to return. Default top_k is 20.
Example (with only query string): http://127.0.0.1:8080/search?query=VectorDB
Example (with only query string and top_k size): http://127.0.0.1:8080/search?query=VectorDB&top_k=1

Data initial load to Cloud

 QDRANT_HOST="HOST_URL" QDRANT_API_KEY="API_KEY" python load.py

Notice

Deploying to Qdrant Cloud

Environment variables should be set in production environment. In local development, docker-compose will read .env file and set environment variables for container.

Loading data

Currently, to make development simple, we have added /load endpoint for loading data from .csv files.

Conclusion and Summary

Qdrant is a powerful vector database tailored to meet the demands of modern AI and machine learning applications. Its ability to efficiently store and search high-dimensional data makes it indispensable for tasks involving similarity search. With features like high performance, scalability, and real-time updates, Qdrant stands out among vector databases.

While it can be resource-intensive and may have a steep learning curve, the benefits it offers make it a valuable tool for data scientists and developers. By leveraging Qdrant Cloud, developers can easily integrate vector search capabilities into their applications, enabling more accurate and efficient data retrieval.

References

Back