d30b14e700
Signed-off-by: Joan Fontanals Martinez <joan.martinez@jina.ai> |
||
---|---|---|
.github | ||
Dockerfiles | ||
resources/jcloud_exec_template | ||
scripts | ||
tests | ||
vectordb | ||
.gitignore | ||
LICENSE | ||
README.md | ||
aux.py | ||
requirements.txt | ||
setup.py |
README.md
Vector Database for Python Developers
Vector databases store embeddings for semantic similarity between objects, enabling similarity searches across multimodal data types. They enhance LLM applications by providing context and improving generation results.
Meet vectordb: a user-friendly Python solution for creating vector databases with CRUD support. Unlike complex alternatives, vectordb allows easy local deployment while offering scalability features like sharding and replication. Seamlessly transition from a local library to a served database without unnecessary complexity.
vectordb leverages DocArray retrieval capabilities and Jina scalability, reliability, and serving capabilities. In essence, DocArray powers the Vector Search logic while Jina ensures scalable index serving, creating a powerful and user-friendly vector database experience.
💪 Features
-
User-friendly interface:
vectordb
offers a simple and intuitive interface, catering to users of all levels of expertise. -
Tailored to your needs:
vectordb
provides the necessary features without unnecessary complexity, ensuring a smooth transition from local to server and cloud deployment. -
CRUD support:
vectordb
supports essential CRUD operations, including indexing, searching, updating, and deleting. -
Serve as a service:
vectordb
allows you to serve your databases and perform insertion or searching operations through gRPC, HTTP or Websocket protocols. -
Scalability: Take advantage of
vectordb's
deployment capabilities to benefit from powerful scalability features such as sharding and replication. Sharding improves service latency, while replication enhances availability and throughput. -
DCloud deployment: Easily deploy your service in the cloud using Jina AI Cloud. Stay tuned for upcoming deployment options.
-
Serverless capacity:
vectordb
can be deployed in the cloud in serverless mode, allowing you to save resources and have the data available only when needed. -
Multiple ANN algorithms:
vectordb
contains different implementations of ANN algorithms. These are the ones offered so far, we plan to integrate more:- InMemoryExactNNVectorDB (Exact NN Search): Implements Simple Nearest Neighbour Algorithm.
- HNSWVectorDB (based on HNSW): Based on HNSWLib
🏁 Getting Started
To get started with Vector Database, simply follow these easy steps, in this example we are going to use InMemoryExactNNVectorDB
as example:
- Install
vectordb
:
pip install vectordb
- Define your Index Document schema using DocArray:
from docarray import BaseDoc
from docarray.typing import NdArray
class MyTextDoc(TextDoc):
text: str = ''
embedding: NdArray[768]
Make sure that the schema has a field schema
as a tensor
type with shape annotation as in the example.
- Use any of the pre-built databases with the document schema (InMemoryExactNNVectorDB or HNSWVectorDB):
from vectordb import InMemoryExactNNVectorDB, HNSWVectorDB
db = InMemoryExactNNVectorDB[MyTextDoc](workspace='./workspace_path')
db.index(inputs=DocList[MyTextDoc]([MyTextDoc(text=f'index {i}', embedding=np.random.rand(128)) for i in range(1000)]))
results = db.search(inputs=DocList[MyTextDoc]([MyTextDoc(text='query', embedding=np.random.rand(128)]), limit=10)
Each result will contain the matches under the .matches
attribute as a DocList[MyTextDoc]
- Serve the database as a service with any of these protocols:
gRPC
,HTTP
andWebscoket
.
with InMemoryExactNNVectorDB[MyTextDoc].serve(workspace='./hnwslib_path', protocol='grpc', port=12345, replicas=1, shards=1) as service:
service.index(inputs=DocList[TextDoc]([TextDoc(text=f'index {i}', embedding=np.random.rand(128)) for i in range(1000)]))
service.block()
- Interact with the database through a client in a similar way as previously:
from vectordb import Client
c = Client[MyTextDoc](address='grpc://0.0.0.0:12345')
results = c.search(inputs=DocList[TextDoc]([TextDoc(text='query', embedding=np.random.rand(128)]), limit=10)
CRUD API:
When using vectordb
as a library or accesing it from a client to a served instance, the Python objects share the exact same API
to provide index
, search
, update
and delete
capability:
-
index
: Index gets as input theDocList
to index. -
search
: Search gets as input theDocList
of batched queries or a singleBaseDoc
as single query. It returns a single or multiple results where each query hasmatches
andscores
attributes sorted byrelevance
. -
delete
: Delete gets as input theDocList
of documents to delete from the index. Thedelete
operation will only care for theid
attribute, so you need to keep track of theindexed
IDs
if you want to delete documents. -
update
: Delete gets as input theDocList
of documents to update in the index. Theupdate
operation will update theindexed
document with the same Index with the attributes and payload from the input documents.
🚀 Serve and scale your own Database, add replication and sharding
Serving:
In order to enable your vectordb
served so that it can be accessed from a Client, you can give the following parameters:
-
protocol: The protocol to be used for serving, it can be
gRPC
,HTTP
,websocket
or any combination of them provided as a list. Defaults togRPC
-
port: The port where the service will be accessible, it can be a list of one port for each protocol provided. Default to 8081
-
workspace: The workspace is the path used by the VectorDB to hold and persist required data. Defaults to '.' (current directory)
Scalability
When serving or deploying your Vector Databases you can set 2 scaling parameters and vectordb
:
-
Shards: The number of shards in which the data will be split. This will allow for better latency.
vectordb
will make sure that Documents are indexed in only one of the shards, while search request will be sent to all the shards andvectordb
will make sure to merge the results from all shards. -
Replicas: The number of replicas of the same DB that must exist. The given replication factor will be shared by all the
shards
.vectordb
uses RAFT algorithm to ensure that the index is in sync between all the replicas of each shard. With this,vectordb
increases the availability of the service and allows for better search throughput as multiple replicas can respond in parallel to more search requests while allowing CRUD operations.
** When deployed to JCloud, the number of replicas will be set to 1. We are working to enable replication in the cloud
💻 vectordb
CLI
vectordb
is a simple CLI that helps you to serve and deploy your vectordb
db.
First, you need to embed your database instance or class in a python file.
# example.py
from docarray import DocList, BaseDoc
from docarray.typing import NdArray
from vectordb import InMemoryExactNNVectorDB
class MyDoc(BaseDoc):
text: str
embedding: NdArray[128]
db = InMemoryExactNNVectorDB[MyDoc](workspace='./vectordb') # notice how `db` is the instance that we want to serve
if __name__ == '__main__':
# make sure to protect this part of the code
with app.serve() as service:
service.block()
Description | Command |
---|---|
Serve your app locally | vectordb serve --db example:db |
Deploy your app on JCloud | vectordb deploy --db example:db |
☁️ Deploy it to the cloud
vectordb
allows you to deploy your solution to the cloud easily.
-
First, you need to get a Jina AI Cloud account
-
Login to your Jina AI Cloud account using the
jc
command line:
jc login
- Deploy:
vectordb deploy --db example:db
Show command output
╭──────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ App ID │ <id> │
├──────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Phase │ Serving │
├──────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Endpoint │ grpc://<id>.wolf.jina.ai │
├──────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ App logs │ dashboards.wolf.jina.ai │
╰──────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────╯
- Connect from Client
Once deployed, you can use vectordb
Client to access the given endpoint.
from vectordb import Client
c = Client(address=' grpc://<id>.wolf.jina.ai')
- Manage your deployed instances using jcloud:
You can then list and delete your deployed DBs with
jc
command:
jc list <>
jc delete <>
⚙️ Configure
Here you can find the list of parameters you can use to configure the behavior for each of the VectorDB
types.
InMemoryExactNNVectorDB
This database type does an exhaustive search on the embeddings and therefore has a very limited configuration setting:
- workspace: The folder where the required data will be persisted.
InMemoryExactNNVectorDB[MyDoc](workspace='./vectordb')
InMemoryExactNNVectorDB[MyDoc].serve(workspace='./vectordb')
HNSWVectorDB
This database implements Approximate Nearest Neighbour based on HNSW algorithm using HNSWLib.
It containes more configuration options:
- workspace: The folder where the required data will be persisted.
Then a set of configurations that tweak the performance and accuracy of the NN search algorithm. You can find more details in HNSWLib README
- space: name of the space, related to the similarity metric used (can be one of "l2", "ip", or "cosine"), default: "l2"
- max_elements: Initial capacity of the index, which is increased dynamically, default: 1024,
- ef_construction: parameter that controls speed/accuracy trade-off during the index construction, default: 200,
- ef: parameter controlling query time/accuracy trade-off, default: 10,
- M: parameter that defines the maximum number of outgoing connections in the graph, default: 16.
- allow_replace_deleted: enables replacing of deleted elements with new added ones, default: False
- num_threads: default number of threads to use while
index
andsearch
are used, default: 1
🛣️ Roadmap
We have big plans for the future of Vector Database! Here are some of the features we have in the works:
- More ANN search algorithms: We want to support more ANN search algorithms.
- Filter capacity: We want to support filtering for our offered ANN Search solutions.
- Customizable: We want to make it easy for users to customize the behavior for their specific needs in an easy way for Python developers.
- Serverless capacity: We're working on adding serverless capacity to
vectordb
in the cloud. We currenly allow to scale between 0 and 1 replica, we aim to offer from 0 to N. - More deploying options: We want to enable deploying
vectordb
on different clouds with more options
If you need any help with vectordb
, or you are interested on using it and have some requests to make it fit your own need. don't hesitate to reach out to us. You can join our Slack community and chat with us and other community members.
Contributing
We welcome contributions from the community! If you have an idea for a new feature or improvement, please let us know. We're always looking for ways to make vectordb
better for our users.