Go to file

Joan Fontanals Martinez 34c762f7e7 docs: add config in readme		2023-06-16 13:40:48 +02:00
.github	docs: add config in readme	2023-06-16 13:40:48 +02:00
Dockerfiles	feat: add deploy to jcloud journey (#20 )	2023-06-15 17:53:59 +02:00
resources/jcloud_exec_template	feat: add deploy to jcloud journey (#20 )	2023-06-15 17:53:59 +02:00
scripts	chore: start the bones of the project (#5 )	2023-05-29 15:12:10 +02:00
tests	feat: pass DBConfig args to HNSW (#21 )	2023-06-16 12:41:55 +02:00
vectordb	docs: add config in readme	2023-06-16 13:40:48 +02:00
.gitignore	feat: add deploy to jcloud journey (#20 )	2023-06-15 17:53:59 +02:00
LICENSE	Initial commit	2023-05-02 16:59:10 +02:00
README.md	docs: add config in readme	2023-06-16 13:40:48 +02:00
aux.py	feat: first iteration Executors (#6 )	2023-05-31 17:19:13 +02:00
requirements.txt	feat: add deploy to jcloud journey (#20 )	2023-06-15 17:53:59 +02:00
setup.py	feat: add deploy to jcloud journey (#20 )	2023-06-15 17:53:59 +02:00

README.md

Vector Database for Python Developers

Vector Databases are databases that store embeddings representing data to provide semantic similarity between objects. Vector databases are used to perform similarity search between multimodal data, such as text, image, audio or videos and also are powering LLM applications to provide context for LLMs to improve the results of the generation and prevent evaluations.

vectordb is a simple, user-friendly solution for Python developers looking to create their own vector database with CRUD support. Vector databases are a key component of the stack needed to use LLMs as they allow them to have access to context and memory. Many of the solutions out there require developers and users to use complex solutions that are often not needed. With vectordb, you can easily create your own vector database solution that can work locally and still be easily deployed and served with scalability features such as sharding and replication.

Start with your solution as a local library and seamlessly transition into a served database with all the needed capability. No extra complexity than the needed one.

vectordb is based on the local libraries wrapped inside DocArray and the scalability, reliability and servinc capabilities of Jina.

In simple terms, one can think as DocArray being a the Lucene algorithmic logic for Vector Search powering the retrieval capabilities and Jina, the ElasticSearch making sure that the indexes are served and scaled for the clients, vectordb wraps these technologies to give a powerful and easy to use experience to use and develop vector databases.

💪 Features

User-friendly interface: vectordb is designed with simplicity and ease of use in mind, making it accessible even for beginners.
Adapts to your needs: vectordb is designed to offer what you need without extra complexity, supporting the features needed at every step. From local, to serve, to the cloud in a seamless way.
CRUD support: vectordb support CRUD operations, index, search, update and delete.
Serve: Serve the databases to insert or search as a service with gRPC or HTTP protocol.
Scalable: With vectordb, you can deploy your database in the cloud and take advantage of powerful scalability features like sharding and replication. With this, you can easily improve the latency of your service by sharding your data, or improve the availability and throughput by allowing vectordb to offer replication.
Deploy to the cloud: If you need to deploy your service in the cloud, you can easily deploy in Jina AI Cloud. More deployment options will soon come.
Serverless capacity: vectordb can be deployed in the cloud in serverless mode, allowing you to save resources and have the data available only when needed.
Multiple ANN algorithms: vectordb contains different implementations of ANN algorithms. These are the ones offered so far, we plan to integrate more:
- InMemoryExactNNVectorDB (Exact NN Search): Implements Simple Nearest Neighbour Algorithm.
- HNSWVectorDB (based on HNSW): Based on HNSWLib

🏁 Getting Started

To get started with Vector Database, simply follow these easy steps, in this example we are going to use InMemoryExactNNVectorDB as example:

Install vectordb:

pip install vectordb

Define your Index Document schema using DocArray:

from docarray import BaseDoc
from docarray.typing import NdArray

class MyTextDoc(TextDoc):
   text: str = ''
   embedding: NdArray[768]

Make sure that the schema has a field schema as a tensor type with shape annotation as in the example.

Use any of the pre-built databases with the document schema (InMemoryExactNNVectorDB or HNSWVectorDB):

from vectordb import InMemoryExactNNVectorDB, HNSWVectorDB
db = InMemoryExactNNVectorDB[MyTextDoc](workspace='./workspace_path')

db.index(inputs=DocList[MyTextDoc]([MyTextDoc(text=f'index {i}', embedding=np.random.rand(128)) for i in range(1000)]))
results = db.search(inputs=DocList[MyTextDoc]([MyTextDoc(text='query', embedding=np.random.rand(128)]), limit=10)

Each result will contain the matches under the .matches attribute as a DocList[MyTextDoc]

Serve the database as a service with any of these protocols: gRPC, HTTP and Webscoket.

with InMemoryExactNNVectorDB[MyTextDoc].serve(workspace='./hnwslib_path', protocol='grpc', port=12345, replicas=1, shards=1) as service:
   service.index(inputs=DocList[TextDoc]([TextDoc(text=f'index {i}', embedding=np.random.rand(128)) for i in range(1000)]))
   service.block()

Interact with the database through a client in a similar way as previously:

from vectordb import Client

c = Client[MyTextDoc](address='grpc://0.0.0.0:12345')
results = c.search(inputs=DocList[TextDoc]([TextDoc(text='query', embedding=np.random.rand(128)]), limit=10)

CRUD API:

When using vectordb as a library or accesing it from a client to a served instance, the Python objects share the exact same API to provide index, search, update and delete capability:

index: Index gets as input the DocList to index.
search: Search gets as input the DocList of batched queries or a single BaseDoc as single query. It returns a single or multiple results where each query has matches and scores attributes sorted by relevance.
delete: Delete gets as input the DocList of documents to delete from the index. The delete operation will only care for the id attribute, so you need to keep track of the indexed IDs if you want to delete documents.
update: Delete gets as input the DocList of documents to update in the index. The update operation will update the indexed document with the same Index with the attributes and payload from the input documents.

🚀 Serve and scale your own Database, add replication and sharding

Serving:

In order to enable your vectordb served so that it can be accessed from a Client, you can give the following parameters:

protocol: The protocol to be used for serving, it can be gRPC, HTTP, websocket or any combination of them provided as a list. Defaults to gRPC
port: The port where the service will be accessible, it can be a list of one port for each protocol provided. Default to 8081
workspace: The workspace is the path used by the VectorDB to hold and persist required data. Defaults to '.' (current directory)

Scalability

When serving or deploying your Vector Databases you can set 2 scaling parameters and vectordb:

Shards: The number of shards in which the data will be split. This will allow for better latency. vectordb will make sure that Documents are indexed in only one of the shards, while search request will be sent to all the shards and vectordb will make sure to merge the results from all shards.
Replicas: The number of replicas of the same DB that must exist. The given replication factor will be shared by all the shards. vectordb uses RAFT algorithm to ensure that the index is in sync between all the replicas of each shard. With this, vectordb increases the availability of the service and allows for better search throughput as multiple replicas can respond in parallel to more search requests while allowing CRUD operations.

** When deployed to JCloud, the number of replicas will be set to 1. We are working to enable replication in the cloud

💻 `vectordb` CLI

vectordb is a simple CLI that helps you to serve and deploy your vectordb db.

First, you need to embed your database instance or class in a python file.

# example.py
from docarray import DocList, BaseDoc
from docarray.typing import NdArray
from vectordb import InMemoryExactNNVectorDB


class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[128]


db = InMemoryExactNNVectorDB[MyDoc](workspace='./vectordb') # notice how `db` is the instance that we want to serve

if __name__ == '__main__':
    # make sure to protect this part of the code
    with app.serve() as service:
        service.block()

Description	Command
Serve your app locally	`vectordb serve --db example:db`
Deploy your app on JCloud	`vectordb deploy --db example:db`

☁️ Deploy it to the cloud

vectordb allows you to deploy your solution to the cloud easily.

First, you need to get a Jina AI Cloud account
Login to your Jina AI Cloud account using the jc command line:

jc login

Deploy:

vectordb deploy --db example:db

Show command output

╭──────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ App ID       │                                           <id>                                                         │
├──────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Phase        │                                       Serving                                                       │
├──────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Endpoint     │                                 grpc://<id>.wolf.jina.ai                                               │
├──────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ App logs     │                                   dashboards.wolf.jina.ai                                         │
╰──────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Connect from Client

Once deployed, you can use vectordb Client to access the given endpoint.

from vectordb import Client

c = Client(address=' grpc://<id>.wolf.jina.ai')

Manage your deployed instances using jcloud: You can then list and delete your deployed DBs with jc command:

jc list <>

jc delete <>

⚙️ Configure

Here you can find the list of parameters you can use to configure the behavior for each of the VectorDB types.

InMemoryExactNNVectorDB

This database type does an exhaustive search on the embeddings and therefore has a very limited configuration setting:

workspace: The folder where the required data will be persisted.

InMemoryExactNNVectorDB[MyDoc](workspace='./vectordb')
InMemoryExactNNVectorDB[MyDoc].serve(workspace='./vectordb')

HNSWVectorDB

This database implements Approximate Nearest Neighbour based on HNSW algorithm using HNSWLib.

It containes more configuration options:

workspace: The folder where the required data will be persisted.

Then a set of configurations that tweak the performance and accuracy of the NN search algorithm. You can find more details in HNSWLib README

space: name of the space, related to the similarity metric used (can be one of "l2", "ip", or "cosine"), default: "l2"
max_elements: Initial capacity of the index, which is increased dynamically, default: 1024,
ef_construction: parameter that controls speed/accuracy trade-off during the index construction, default: 200,
ef: parameter controlling query time/accuracy trade-off, default: 10,
M: parameter that defines the maximum number of outgoing connections in the graph, default: 16.
allow_replace_deleted: enables replacing of deleted elements with new added ones, default: False
num_threads: default number of threads to use while index and search are used, default: 1

🛣️ Roadmap

We have big plans for the future of Vector Database! Here are some of the features we have in the works:

Further configuration of ANN algorithms.
More ANN search algorithms: We want to support more ANN search algorithms.
Filter capacity: We want to support filtering for our offered ANN Search solutions.
Customizable: We want to make it easy for users to customize the behavior for their specific needs in an easy way for Python developers.
Serverless capacity: We're working on adding serverless capacity to vectordb in the cloud. We currenly allow to scale between 0 and 1 replica, we aim to offer from 0 to N.
More deploying options: We want to enable deploying vectordb on different clouds with more options

If you need any help with vectordb, or you are interested on using it and have some requests to make it fit your own need. don't hesitate to reach out to us. You can join our Slack community and chat with us and other community members.

Contributing

We welcome contributions from the community! If you have an idea for a new feature or improvement, please let us know. We're always looking for ways to make vectordb better for our users.