Skip to content

Commit

Permalink
docs: benchmark fixes (#294)
Browse files Browse the repository at this point in the history
* docs: fix link and clarify

* chore: average benchmark results

* docs: reformat backends

* fix: random seed

* docs: fix table

* docs: add docarray version

* docs: add docarray version
  • Loading branch information
alaeddine-13 authored Apr 22, 2022
1 parent a32ba7b commit 94efe7a
Show file tree
Hide file tree
Showing 2 changed files with 40 additions and 15 deletions.
19 changes: 12 additions & 7 deletions docs/advanced/document-store/benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,15 @@

We create a DocumentArray with one million Documents and benchmark all supported document stores. This includes classic database and vector database, all under the same DocumentArray API:

* {ref}`"None"<documentarray>`: `DocumentArray()`, namely an in-memory "store"
* {ref}`Sqlite<sqlite>`: `DocumentArray(storage='sqlite')`
* {ref}`Weaviate<weaviate>`: `DocumentArray(storage='weaviate')`
* {ref}`Qdrant<qdrant>`: `DocumentArray(storage='qdrant')`
* {ref}`Annlite<annlite>`: `DocumentArray(storage='anlite')`
* {ref}`ElasticSearch<elasticsearch>`: `DocumentArray(storage='elasticsearch')`
| Name | Usage | Version |
|---------------------------------------------------------------------------------------------|------------------------------------------|-------------------|
| [`"None"`](../../../fundamentals/documentarray/#documentarray), namely an in-memory "store" | `DocumentArray()` | DocArray `0.12.8` |
| [`Sqlite`](../sqlite/#sqlite) | `DocumentArray(storage='sqlite')` | `2.6.0` |
| [`Weaviate`](../weaviate/#weaviate) | `DocumentArray(storage='weaviate')` | `1.11.0` |
| [`Qdrant`](../qdrant/#qdrant) | `DocumentArray(storage='qdrant')` | `0.7.0` |
| [`Annlite`](../annlite/#annlite) | `DocumentArray(storage='anlite')` | `0.3.1` |
| [`ElasticSearch`](../elasticsearch/#elasticsearch) | `DocumentArray(storage='elasticsearch')` | `8.1.0` |


We focus on the following tasks:

Expand Down Expand Up @@ -118,13 +121,15 @@ Each Document follows the structure:

We use `Recall@K` value as an indicator of the search quality. The in-memory and SQLite store **do not implement** approximate nearest neighbor search but use exhaustive search instead. Hence, they give the maximum `Recall@K` but are the slowest.

The experiments were conducted on a 4.5 Ghz AMD Ryzen Threadripper 3960X 24-Core Processor with Python 3.8.5.
The experiments were conducted on a 4.5 Ghz AMD Ryzen Threadripper 3960X 24-Core Processor with Python 3.8.5 and DocArray 0.12.8.

Besides, as Weaviate, Qdrant and ElasticSearch follow a client/server pattern, we set up them with their official
docker images in a **single node** configuration, with 40 GB of RAM allocated. That is, only 1 replica and shard are
operated during the benchmarking. We did not opt for a cluster setup because our benchmarks mainly aim to assess the
capabilities of a single instance of the server.

Results might include overhead coming from DocArray side which applies equally for all backends, unless a specific
backend provides a more efficient implementation.

### Settings of the nearest neighbour search

Expand Down
36 changes: 28 additions & 8 deletions scripts/benchmarking.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
TENSOR_SHAPE = (512, 256)
K = 10
n_vector_queries = 1000
np.random.seed(123)

parser = argparse.ArgumentParser()
parser.add_argument(
Expand Down Expand Up @@ -174,17 +175,36 @@ def recall(predicted, relevant, eval_at):
create_time, _ = create(da, docs)

# for n_q in n_query:
console.print(f'reading {n_query} docs ...')
read_time, _ = read(
da,
random.sample([d.id for d in docs], n_query),
console.print(
f'reading {n_query} docs averaged {n_vector_queries} times ...'
)
read_times = []
for _ in range(n_vector_queries):
read_time, _ = read(
da,
random.sample([d.id for d in docs], n_query),
)
read_times.append(read_time)
read_time = sum(read_times) / len(read_times)

console.print(f'updating {n_query} docs ...')
update_time, _ = update(da, docs_to_update)
console.print(
f'updating {n_query} docs averaged {n_vector_queries} times ...'
)
update_times = []
for _ in range(n_vector_queries):
update_time, _ = update(da, docs_to_update)
update_times.append(update_time)
update_time = sum(update_times) / len(update_times)

console.print(f'deleting {n_query} docs ...')
delete_time, _ = delete(da, [d.id for d in docs_to_delete])
console.print(
f'deleting {n_query} docs averaged {n_vector_queries} times ...'
)
delete_times = []
for _ in range(n_vector_queries):
delete_time, _ = delete(da, [d.id for d in docs_to_delete])
delete_times.append(delete_time)
da.extend(docs_to_delete)
delete_time = sum(delete_times) / len(delete_times)

console.print(
f'finding {n_query} docs by vector averaged {n_vector_queries} times ...'
Expand Down

0 comments on commit 94efe7a

Please sign in to comment.