docs: update readme (#19)

* docs: update readme

* docs: update readme further

* docs: add CRUD API explanation

Signed-off-by: Joan Fontanals Martinez <joan.martinez@jina.ai>

---------

Signed-off-by: Joan Fontanals Martinez <joan.martinez@jina.ai>
This commit is contained in:
Joan Fontanals 2023-06-08 16:55:28 +02:00 committed by GitHub
parent e08989c654
commit 5c38ae719f
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
6 changed files with 92 additions and 56 deletions

View File

View File

@ -87,13 +87,13 @@ jobs:
run: | run: |
python -m pip install --upgrade pip python -m pip install --upgrade pip
python -m pip install wheel python -m pip install wheel
pip install -r requirements.txt pip install pytest
pip install --no-cache-dir ".[test]" pip install .
sudo apt-get install libsndfile1 pip install -U docarray[hnswlib]>=0.33.0
- name: Test - name: Test
id: test id: test
run: | run: |
pytest -v -s --log-cli-level=DEBUG ${{ matrix.test-path }} pytest -v -s ${{ matrix.test-path }}
timeout-minutes: 30 timeout-minutes: 30
integration-tests: integration-tests:
@ -116,22 +116,14 @@ jobs:
run: | run: |
python -m pip install --upgrade pip python -m pip install --upgrade pip
python -m pip install wheel python -m pip install wheel
pip install -r requirements.txt pip install pytest
pip install --no-cache-dir ".[test]" pip install .
sudo apt-get install libsndfile1 pip install -U docarray[hnswlib]>=0.33.0
- name: Setup monitoring stack
run: |
cd $GITHUB_WORKSPACE
docker-compose -f tests/integration/docker-compose.yml --project-directory . up --build -d --remove-orphans
- name: Test - name: Test
id: test id: test
run: | run: |
pytest -v -s --log-cli-level=DEBUG ${{ matrix.test-path }} pytest -v -s ${{ matrix.test-path }}
timeout-minutes: 30 timeout-minutes: 30
- name: Cleanup monitoring stack
run: |
cd $GITHUB_WORKSPACE
docker-compose -f tests/integration/docker-compose.yml --project-directory . down
# just for blocking the merge until all parallel integration-tests are successful # just for blocking the merge until all parallel integration-tests are successful
success-all-test: success-all-test:

102
README.md
View File

@ -1,9 +1,16 @@
# Vector Database for Python Developers # Vector Database for Python Developers
`vectordb` is a simple, user-friendly solution for Python developers looking to create their own vector database with CRUD support. Vector databases are a key component of the stack needed to use LLMs as they allow them to have access to context and memory. Many of the solutions out there require developers and users to use complex solutions that are often not needed. With `vectordb`, you can easily create your own vector database solution that can work locally and still be easily deployed and served with scalability features such as sharding and replication. `vectordb` is a simple, user-friendly solution for Python developers looking to create their own vector database with CRUD support. Vector databases are a key component of the stack needed to use LLMs as they allow them to have access to context and memory. Many of the solutions out there require developers and users to use complex solutions that are often not needed. With `vectordb`, you can easily create your own vector database solution that can work locally and still be easily deployed and served with scalability features such as sharding and replication.
`vectordb` allows you to start simple and work locally while allowing when needed to deploy and scale in a seamless manner. With the help of [DocArray](https://github.com/docarray/docarray) and [Jina](https://github.com/jina-ai/jina) `vectordb` allows developers to focus on the algorithmic part and tweak the core of the vector search with Python as they want while keeping it easy to scale and deploy the solution. Start with your solution as a local library and seamlessly transition into a served database with all the needed capability. No extra complexity than the needed one.
Stop wondering what exact algorithms do existing solutions apply, how do they apply filtering or how to map your schema to their solutions, with `vectordb` you as a Python developer can easily understand and control what is the vector search algorithm doing, giving you the full control if needed while supporting you for local setting and in more advanced and demanding scenarios in the cloud. `vectordb` is based on the local libraries wrapped inside [DocArray](https://github.com/docarray/docarray) and the scalability, reliability and servinc capabilities of [Jina](https://github.com/jina-ai/jina).
In simple terms, one can think as [DocArray](https://github.com/docarray/docarray) being a the `Lucene` algorithmic logic for Vector Search powering the retrieval capabilities and [Jina](https://github.com/jina-ai/jina), the ElasticSearch making sure that the indexes are served and scaled for the clients, `vectordb` wraps these technologies to give a powerful and easy to use experience to
use and develop vector databases.
<!--(THIS CAN BE SHOWN WHEN CUSTOMIZATION IS ENABLED) `vectordb` allows you to start simple and work locally while allowing when needed to deploy and scale in a seamless manner. With the help of [DocArray](https://github.com/docarray/docarray) and [Jina](https://github.com/jina-ai/jina) `vectordb` allows developers to focus on the algorithmic part and tweak the core of the vector search with Python as they want while keeping it easy to scale and deploy the solution. -->
<!--(THIS CAN BE SHOWN WHEN CUSTOMIZATION IS ENABLED) Stop wondering what exact algorithms do existing solutions apply, how do they apply filtering or how to map your schema to their solutions, with `vectordb` you as a Python developer can easily understand and control what is the vector search algorithm doing, giving you the full control if needed while supporting you for local setting and in more advanced and demanding scenarios in the cloud. -->
## :muscle: Features ## :muscle: Features
@ -25,9 +32,9 @@ Stop wondering what exact algorithms do existing solutions apply, how do they ap
- Exact NN Search: Implements Simple Nearest Neighbour Algorithm. - Exact NN Search: Implements Simple Nearest Neighbour Algorithm.
- HNSWLib: Based on [HNSWLib](https://github.com/nmslib/hnswlib) - HNSWLib: Based on [HNSWLib](https://github.com/nmslib/hnswlib)
- Filter capacity: `vectordb` allows you to have filters on top of the ANN search. <!--(THIS CAN BE SHOWN WHEN FILTER IS ENABLED)- Filter capacity: `vectordb` allows you to have filters on top of the ANN search. -->
- Customizable: `vectordb` can be easily extended to suit your specific needs or schemas, so you can build the database you want and for any input and output schema you want with the help of [DocArray](https://github.com/docarray/docarray). <!--(THIS CAN BE SHOWN WHEN FILTER IS ENABLED)- Customizable: `vectordb` can be easily extended to suit your specific needs or schemas, so you can build the database you want and for any input and output schema you want with the help of [DocArray](https://github.com/docarray/docarray).-->
## 🏁 Getting Started ## 🏁 Getting Started
@ -37,32 +44,36 @@ To get started with Vector Database, simply follow these easy steps, in this exa
```pip install vectordb``` ```pip install vectordb```
2. Define your Index Document schema or use any of the predefined ones using [DocArray](https://docs.docarray.org/user_guide/representing/first_step/): 2. Define your Index Document schema using [DocArray](https://docs.docarray.org/user_guide/representing/first_step/):
```python ```python
from docarray import BaseDoc from docarray import BaseDoc
from docarray.text import TextDoc from docarray.typing import NdArray
class MyTextDoc(TextDoc): class MyTextDoc(TextDoc):
author: str = '' text: str = ''
embedding: NdArray[768]
``` ```
3. Use any of the pre-built databases with the document schema as a Python class: Make sure that the schema has a field `schema` as a `tensor` type with shape annotation as in the example.
3. Use any of the pre-built databases with the document schema (InMemoryExactNNVectorDB or HNSWLibDB):
```python ```python
from vectordb import HNSWLibDB from vectordb import InMemoryExactNNVectorDB, HNSWLibDB
db = HNSWLibDB[MyTextDoc](data_path='./hnwslib_path') db = InMemoryExactNNVectorDB[MyTextDoc](workspace='./workspace_path')
db.index(inputs=DocList[MyTextDoc]([MyTextDoc(text=f'index {i}', embedding=np.random.rand(128)) for i in range(1000)])) db.index(inputs=DocList[MyTextDoc]([MyTextDoc(text=f'index {i}', embedding=np.random.rand(128)) for i in range(1000)]))
results = db.search(inputs=DocList[MyTextDoc]([MyTextDoc(text='query', embedding=np.random.rand(128)]), parameters={'limit': 10}) results = db.search(inputs=DocList[MyTextDoc]([MyTextDoc(text='query', embedding=np.random.rand(128)]), limit=10)
``` ```
Each result will contain the matches under the `.matches` attribute as a `DocList[MyTextDoc]` Each result will contain the matches under the `.matches` attribute as a `DocList[MyTextDoc]`
4. Serve the database as a service 4. Serve the database as a service with any of these protocols: `gRPC`, `HTTP` and `Webscoket`.
```python ```python
with HNSWLibDB[MyTextDoc].serve(config={'data_path'= './hnswlib_path'}, port=12345, replicas=1, shards=1) as service: with InMemoryExactNNVectorDB[MyTextDoc].serve(workspace='./hnwslib_path', protocol='grpc', port=12345, replicas=1, shards=1) as service:
service.index(inputs=DocList[TextDoc]([TextDoc(text=f'index {i}', embedding=np.random.rand(128)) for i in range(1000)]))
service.block() service.block()
``` ```
@ -71,12 +82,47 @@ with HNSWLibDB[MyTextDoc].serve(config={'data_path'= './hnswlib_path'}, port=123
```python ```python
from vectordb import Client from vectordb import Client
c = Client[MyTextDoc](port=12345) c = Client[MyTextDoc](address='grpc://0.0.0.0:12345')
results = c.search(inputs=DocList[TextDoc]([TextDoc(text='query', embedding=np.random.rand(128)]), limit=10)
c.index(inputs=DocList[TextDoc]([TextDoc(text=f'index {i}', embedding=np.random.rand(128)) for i in range(1000)]))
results = c.search(inputs=DocList[TextDoc]([TextDoc(text='query', embedding=np.random.rand(128)]), parameters={'limit': 10})
``` ```
## CRUD API:
When using `vectordb` as a library or accesing it from a client to a served instance, the Python objects share the exact same API
to provide `index`, `search`, `update` and `delete` capability:
- `index`: Index gets as input the `DocList` to index.
- `search`: Search gets as input the `DocList` of batched queries or a single `BaseDoc` as single query. It returns a single or multiple results where each query has `matches` and `scores` attributes sorted by `relevance`.
- `delete`: Delete gets as input the `DocList` of documents to delete from the index. The `delete` operation will only care for the `id` attribute, so you need to keep track of the `indexed` `IDs` if you want to delete documents.
- `update`: Delete gets as input the `DocList` of documents to update in the index. The `update` operation will update the `indexed` document with the same Index with the attributes and payload from the input documents.
## :rocket: Serve and scale your own Database, add replication and sharding
### Serving:
In order to enable your `vectordb` served so that it can be accessed from a Client, you can give the following parameters:
- protocol: The protocol to be used for serving, it can be `gRPC`, `HTTP`, `websocket` or any combination of them provided as a list. Defaults to `gRPC`
- port: The port where the service will be accessible, it can be a list of one port for each protocol provided. Default to 8081
- workspace: The workspace is the path used by the VectorDB to hold and persist required data. Defaults to '.' (current directory)
### Scalability
When serving or deploying your Vector Databases you can set 2 scaling parameters and `vectordb`:
- Shards: The number of shards in which the data will be split. This will allow for better latency. `vectordb` will make sure that Documents are indexed in only one of the shards, while search request will be sent to all the shards and `vectordb` will make sure to merge the results from all shards.
- Replicas: The number of replicas of the same DB that must exist. The given replication factor will be shared by all the `shards`. `vectordb` uses [RAFT](https://raft.github.io/) algorithm to ensure that the index is in sync between all the replicas of each shard. With this, `vectordb` increases the availability of the service and allows for better search throughput as multiple replicas can respond in parallel to more search requests while allowing CRUD operations.
** When deployed to JCloud, the number of replicas will be set to 1. We are working to enable replication in the cloud
## :cloud: Deploy it to the cloud ## :cloud: Deploy it to the cloud
`vectordb` allows you to deploy your solution to the cloud easily. `vectordb` allows you to deploy your solution to the cloud easily.
@ -99,28 +145,16 @@ You can then list and delete your deployed DBs with `jc`:
```jc delete <>``` ```jc delete <>```
## :rocket: Scale your own Database, add replication and sharding
When serving or deploying your Vector Databases you can set 2 scaling parameters and `vectordb`:
- Shards: The number of shards in which the data will be split. This will allow for better latency. `vectordb` will make sure that Documents are indexed in only one of the shards, while search request will be sent to all the shards and `vectordb` will make sure to merge the results from all shards.
- Replicas: The number of replicas of the same DB that must exist. The given replication factor will be shared by all the `shards`. `vectordb` uses RAFT algorithm to ensure that the index is in sync between all the replicas of each shard. With this, `vectordb` increases the availability of the service and allows for better search throughput as multiple replicas can respond in parallel to more search requests while allowing CRUD operations.
** When deployed, the number of replicas will be set to 1. We are working to enable replication in the cloud
## 🛠️ (Optional) Customize your Database
TODO: Explain how to write your own implementation
## 🛣️ Roadmap ## 🛣️ Roadmap
We have big plans for the future of Vector Database! Here are some of the features we have in the works: We have big plans for the future of Vector Database! Here are some of the features we have in the works:
- Serverless capacity: We're working on adding serverless capacity to `vectordb` in the cloud. We currenly allow to scale between 0 and 1 replica, we aim to offer from 0 to N. - Further configuration of ANN algorithms.
- More ANN search algorithms: We want to support more ANN search algorithms - More ANN search algorithms: We want to support more ANN search algorithms.
- Filter capacity: We want to support filtering for our offered ANN Search solutions.
- Customizable: We want to make it easy for users to customize the behavior for their specific needs in an easy way for Python developers.
- Serverless capacity: We're working on adding serverless capacity to `vectordb` in the cloud. We currenly allow to scale between 0 and 1 replica, we aim to offer from 0 to N.
- More deploying options: We want to enable deploying `vectordb` on different clouds with more options - More deploying options: We want to enable deploying `vectordb` on different clouds with more options
If you need any help with `vectordb`, or you are interested on using it and have some requests to make it fit your own need. don't hesitate to reach out to us. You can join our [Slack community](https://jina.ai/slack) and chat with us and other community members. If you need any help with `vectordb`, or you are interested on using it and have some requests to make it fit your own need. don't hesitate to reach out to us. You can join our [Slack community](https://jina.ai/slack) and chat with us and other community members.

View File

@ -1,5 +1,16 @@
from setuptools import setup, find_packages from setuptools import setup, find_packages
from vectordb import __version__ from os import path
try:
pkg_name = 'vectordb'
libinfo_py = path.join(pkg_name, '__init__.py')
libinfo_content = open(libinfo_py, 'r', encoding='utf-8').readlines()
version_line = [l.strip() for l in libinfo_content if l.startswith('__version__')][
0
]
exec(version_line) # gives __version__
except FileNotFoundError:
__version__ = '0.0.0'
# Read the contents of requirements.txt # Read the contents of requirements.txt
with open('requirements.txt', 'r') as f: with open('requirements.txt', 'r') as f:
@ -32,11 +43,10 @@ setup(
'test': [ 'test': [
'pytest', 'pytest',
'pytest-asyncio', 'pytest-asyncio',
'monkeypatch'
], ],
}, },
install_requires=requirements, install_requires=requirements,
) )
import subprocess import subprocess
subprocess.run(['pip', 'install', 'docarray[hnswlib]>=0.32.0']) subprocess.run(['pip', 'install', 'docarray[hnswlib]>=0.33.0'])

View File

@ -87,9 +87,9 @@ class VectorDB(Generic[TSchema]):
for shard in range(shards): for shard in range(shards):
peer_ports[str(shard)] = [] peer_ports[str(shard)] = []
for replica in range(replicas): for replica in range(replicas):
peer_ports[str(shard)].append(8081 + shard * replicas + replica + 1) peer_ports[str(shard)].append(port + shard * replicas + replica + 1)
else: else:
peer_ports['0'] = [8081 + (replica + 1) for replica in range(replicas)] peer_ports['0'] = [port + (replica + 1) for replica in range(replicas)]
if stateful is True: if stateful is True:
if shards is not None: if shards is not None:

View File

@ -5,9 +5,9 @@ RETURN_TYPE = 'return_type'
def sort_matches_by_scores(func): def sort_matches_by_scores(func):
"""Method to ensure that return docs have matches sorted by score""" """Method to ensure that return docs have matches sorted by score"""
from docarray import DocList
def wrapper(*args, **kwargs): def wrapper(*args, **kwargs):
from docarray import DocList
res = func(*args, **kwargs) res = func(*args, **kwargs)
obj = args[0] obj = args[0]
if isinstance(res, DocList): if isinstance(res, DocList):