Skip to content

Commit

Permalink
docs(array): add doc for array api
Browse files Browse the repository at this point in the history
  • Loading branch information
hanxiao committed Jan 4, 2022
1 parent d3f2c61 commit 3b2e07c
Show file tree
Hide file tree
Showing 12 changed files with 195 additions and 96 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ DocArray is a library for nested, unstructured data such as text, image, audio,

🧑‍🔬 **Data science powerhouse**: greatly facilitate data scientists work on embedding, matching, visualizing, evaluating via Torch/Tensorflow/ONNX/PaddlePaddle.

🚡 **Portable**: ready to wire at anytime with efficient and compact serialization from/to Protobuf, binary, JSON, CSV, dataframe.
🚡 **Portable**: ready-to-wire at anytime with efficient and compact serialization from/to Protobuf, bytes, JSON, CSV, dataframe.

<!-- end elevator-pitch -->

Expand Down
1 change: 1 addition & 0 deletions docs/fundamentals/document/attribute.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,7 @@ print(d)

You can also check which content field is set by `.content_type`.

(content-uri)=
## Load content from URI

A quite common pattern is loading content from a URI instead of assigning them directly in the code.
Expand Down
1 change: 1 addition & 0 deletions docs/fundamentals/document/construct.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
(construct-doc)=
# Construct

Initializing a Document object is super easy. This chapter introduces the ways of constructing empty Document, filled Document. One can also construct Document from bytes, JSON, Protobuf message as introduced {ref}`in the next chapter<serialize>`.
Expand Down
54 changes: 31 additions & 23 deletions docs/fundamentals/document/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,28 +4,31 @@

A Document object has a predefined data structure as below, each of the attributes can be set/get with the dot expression as you would do with any Python object.

| Attribute | Type | Description |
|-------------|--------------------| ----------- |
| id | string | A hexdigest that represents a unique document ID |
| buffer | bytes | the raw binary content of this document, which often represents the original document when comes into jina |
| blob | `ndarray`-like | the ndarray of the image/audio/video document |
| text | string | a text document |
| granularity | int | the depth of the recursive chunk structure |
| adjacency | int | the width of the recursive match structure |
| parent_id | string | the parent id from the previous granularity |
| weight | float | The weight of this document |
| uri | string | a uri of the document could be: a local file path, a remote url starts with http or https or data URI scheme |
| modality | string | modality, an identifier to the modality this document belongs to. In the scope of multi/cross modal search |
| mime_type | string | mime type of this document, for buffer content, this is required; for other contents, this can be guessed |
| offset | float | the offset of the doc |
| location | float | the position of the doc, could be start and end index of a string; could be x,y (top, left) coordinate of an image crop; could be timestamp of an audio clip |
| chunks | `DocumentArray` | list of the sub-documents of this document (recursive structure) |
| matches | `DocumentArray` | the matched documents on the same level (recursive structure) |
| embedding | `ndarray`-like | the embedding of this document |
| tags | dict | a structured data value, consisting of field which map to dynamically typed values. |
| scores | `NamedScore` | Scores performed on the document, each element corresponds to a metric |
| evaluations | `NamedScore` | Evaluations performed on the document, each element corresponds to a metric |

| Attribute | Type | Description |
|-------------|-----------------| ----------- |
| id | string | A hexdigest that represents a unique document ID |
| buffer | bytes | the raw binary content of this document, which often represents the original document when comes into jina |
| blob | `ndarray`-like | the ndarray of the image/audio/video document |
| text | string | a text document |
| granularity | int | the depth of the recursive chunk structure |
| adjacency | int | the width of the recursive match structure |
| parent_id | string | the parent id from the previous granularity |
| weight | float | The weight of this document |
| uri | string | a uri of the document could be: a local file path, a remote url starts with http or https or data URI scheme |
| modality | string | modality, an identifier to the modality this document belongs to. In the scope of multi/cross modal search |
| mime_type | string | mime type of this document, for buffer content, this is required; for other contents, this can be guessed |
| offset | float | the offset of the doc |
| location | float | the position of the doc, could be start and end index of a string; could be x,y (top, left) coordinate of an image crop; could be timestamp of an audio clip |
| chunks | `DocumentArray` | list of the sub-documents of this document (recursive structure) |
| matches | `DocumentArray` | the matched documents on the same level (recursive structure) |
| embedding | `ndarray`-like | the embedding of this document |
| tags | dict | a structured data value, consisting of field which map to dynamically typed values. |
| scores | `NamedScore` | Scores performed on the document, each element corresponds to a metric |
| evaluations | `NamedScore` | Evaluations performed on the document, each element corresponds to a metric |
```{tip}
An `ndarray`-like object can be a Python (nested) List/Tuple, Numpy ndarray, SciPy sparse matrix (spmatrix), TensorFlow dense and sparse tensor, PyTorch dense and sparse tensor, or PaddlePaddle dense tensor.
```

The data structure of the Document is comprehensive and well-organized. One can categorize those attributes into the following groups:

Expand All @@ -37,12 +40,17 @@ The data structure of the Document is comprehensive and well-organized. One can

This picture depicts how you may want to construct or comprehend a Document object.



```{figure} images/document-attributes.svg
```


Document also provides a set of functions frequently used in data science and machine learning community.


## What's next?

To start, let's first see how to construct a Document object in {ref}`the next chapter<construct-doc>`.


```{toctree}
:hidden:
Expand Down
130 changes: 110 additions & 20 deletions docs/fundamentals/documentarray/construct.md
Original file line number Diff line number Diff line change
@@ -1,46 +1,136 @@
(construct-array)=
# Construct

You can construct a `DocumentArray` in different ways:
## Construct an empty array

````{tab} From empty Documents
```python
from jina import DocumentArray
from docarray import DocumentArray

da = DocumentArray.empty(10)
```
````

```text
<DocumentArray (length=10) at 4456123280>
```

## Construct from list-like objects

You can construct DocumentArray from a `Sequence`, `List`, `Tuple` or `Iterator` that yields `Document` object.

````{tab} From list of Documents
```python
from jina import DocumentArray, Document
from docarray import DocumentArray, Document
da = DocumentArray([Document(...), Document(...)])
da = DocumentArray([Document(text='hello'), Document(text='world')])
```
```text
<DocumentArray (length=2) at 4866772176>
```
````
````{tab} From generator
```python
from jina import DocumentArray, Document
da = DocumentArray((Document(...) for _ in range(10)))
da = DocumentArray((Document() for _ in range(10)))
```
```text
<DocumentArray (length=10) at 4866772176>
```
````
````{tab} From another DocumentArray
```python
from jina import DocumentArray, Document

da = DocumentArray((Document() for _ in range(10)))
As DocumentArray itself is also a "list-like object that yields `Document`", you can also construct DocumentArray from another DocumentArray:

```python
da = DocumentArray(...)
da1 = DocumentArray(da)
```
````

````{tab} From JSON, CSV, ndarray, files, ...
## Construct from a single Document

```python
from docarray import DocumentArray, Document

d1 = Document(text='hello')
da = DocumentArray(d1)
```

```text
<DocumentArray (length=1) at 4452802192>
```

## Deep copy on elements

You can find more details about those APIs in {class}`~jina.types.arrays.mixins.io.from_gen.FromGeneratorMixin`.
Note that, as in Python list, adding Document object into DocumentArray only adds its memory reference. The original Document is *not* copied. If you change the original Document afterwards, then the one inside DocumentArray will also change. Here is an example,

```python
da = DocumentArray.from_ndjson(...)
da = DocumentArray.from_csv(...)
da = DocumentArray.from_files(...)
da = DocumentArray.from_lines(...)
da = DocumentArray.from_ndarray(...)
from docarray import DocumentArray, Document

d1 = Document(text='hello')
da = DocumentArray(d1)

print(da[0].text)
d1.text = 'world'
print(da[0].text)
```
````

```text
hello
world
```

This may surprise some users, but considering the following Python code, you will find this behavior is very natural and authentic.

```python
d = {'hello': None}
a = [d]

print(a[0]['hello'])
d['hello'] = 'world'
print(a[0]['hello'])
```

```text
None
world
```

To make a deep copy, set `DocumentArray(..., copy=True)`. Now all Documents in this DocumentArray are completely new objects with identical contents as the original ones.

```python
from docarray import DocumentArray, Document

d1 = Document(text='hello')
da = DocumentArray(d1, copy=True)

print(da[0].text)
d1.text = 'world'
print(da[0].text)
```

```text
hello
hello
```

## Construct from local files

You may recall the common pattern that {ref}`I mentioned here<content-uri>`. With {meth}`~docarray.document.generators.from_files` One can easily construct a DocumentArray object with all file paths defined by a glob expression.

```python
from docarray import DocumentArray

da_jpg = DocumentArray.from_files('images/*.jpg')
da_png = DocumentArray.from_files('images/*.png')
da_all = DocumentArray.from_files(['images/**/*.png', 'images/**/*.jpg', 'images/**/*.jpeg'])
```

This will scan all filenames that match the expression and construct Documents with filled `.uri` attribute. You can control if to read each as text or binary with `read_mode` argument.



## What's next?

In the next chapter, we will see how to construct DocumentArray from binary bytes, JSON, CSV, dataframe, Protobuf message.
15 changes: 9 additions & 6 deletions docs/fundamentals/documentarray/index.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
(documentarray)=
# DocumentArray

```{toctree}
:hidden:
{class}`~docarray.array.document.DocumentArray` is a list-like container of {class}`~docarray.document.Document` objects. It is **the best way** when working with multiple Documents.

In a nutshell, you can simply consider it as a Python `list`, as it implements **all** list interfaces. That is, if you know how to use Python `list`, you already know how to use DocumentArray.

It is also powerful as Numpy's `ndarray`, where you can access its elements by {ref}`fancy slicing syntax<access-elements>`.

documentarraymemmap-api
```
What makes it more exciting is those advanced features of DocumentArray. These features greatly facilitate data scientists work on accessing nested elements, evaluating, visualizing, parallel computing, serializing, matching etc.

A {class}`~jina.types.arrays.document.DocumentArray` is a list of `Document` objects. You can construct, delete, insert, sort and traverse
a `DocumentArray` like a Python `list`. It implements all Python List interface.
## What's next?

Let's see how to construct a DocumentArray {ref}`in the next section<construct-array>`.

```{toctree}
:hidden:
Expand All @@ -23,5 +25,6 @@ matching
evaluation
parallelization
visualization
sharing
list-like
```
52 changes: 6 additions & 46 deletions docs/fundamentals/documentarray/serialization.md
Original file line number Diff line number Diff line change
@@ -1,53 +1,13 @@
# Serialization

`DocumentArray` provides the following methods for importing from/exporting to different formats.
DocArray is designed to be "ready-to-wire" at anytime. Serialization is important. DocumentArray provides multiple serialization methods that allows one transfer DocumentArray object over network and across different microservices.

| Description | Export Method | Import Method |
|-----------------------------------|---------------------------------------------------------------------|-----------------------------------------------|
| LZ4-compressed binary string/file | `.to_bytes()` (or `bytes(...)` for more Pythonic), `.save_binary()` | `.load_binary()` |
| JSON string/file | `.to_json()`, `.save_json()` | `.load_json()`, `.from_ndjson()` |
| CSV file | `.save_csv()` | `.load_csv()`, `.from_lines()`, `.from_csv()` |
| `pandas.Dataframe` object | `.to_dataframe()` | `.from_dataframe()` |
| Local files | | `.from_files()` |
| `numpy.ndarray` object | | `.from_ndarray()` |
| Jina Cloud Storage (experimental) | `.push()` | `.pull()` |
## From/to JSON

```{seealso}
`.from_*()` functions often utlizes generators. When using independently, can be more memory-efficient. See {mod}`~jina.types.document.generators`.
```
## From/to bytes

### Sharing DocumentArray across machines

```{caution}
This is an experimental feature introduced in Jina `2.5.4`. The behavior of this feature might change in the future.
```

Since Jina `2.5.4` we introduce a new IO feature: {meth}`~jina.types.arrays.mixins.io.pushpull.PushPullMixin.push` and {meth}`~jina.types.arrays.mixins.io.pushpull.PushPullMixin.pull`,
which allows you to share a DocumentArray object across machines.

Considering you are working on a GPU machine via Google Colab/Jupyter. After preprocessing and embedding, you got everything you need in a DocumentArray. You can easily transfer it to the local laptop via:

```python
from jina import DocumentArray

da = DocumentArray(...) # heavylifting, processing, GPU task, ...
da.push(token='myda123')
```

Then on your local laptop, simply

```python
from jina import DocumentArray

da = DocumentArray.pull(token='myda123')
```

Now you can continue the work at local, analyzing `da` or visualizing it. Your friends & colleagues who know the token `myda123` can also pull that DocumentArray. It's useful when you want to quickly share the results with your colleagues & friends.

For more information of this feature, please refer to {class}`~jina.types.arrays.mixins.io.pushpull.PushPullMixin`.

```{danger}
The lifetime of the storage is not promised at the momennt: could be a day, could be a week. Do not use it for persistence in production. Only consider this as temporary transmission or a clipboard.
```
## From/to Protobuf

## From/to list

## From/to dataframe
30 changes: 30 additions & 0 deletions docs/fundamentals/documentarray/sharing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Data Sharing


Since Jina `2.5.4` we introduce a new IO feature: {meth}`~jina.types.arrays.mixins.io.pushpull.PushPullMixin.push` and {meth}`~jina.types.arrays.mixins.io.pushpull.PushPullMixin.pull`,
which allows you to share a DocumentArray object across machines.

Considering you are working on a GPU machine via Google Colab/Jupyter. After preprocessing and embedding, you got everything you need in a DocumentArray. You can easily transfer it to the local laptop via:

```python
from jina import DocumentArray

da = DocumentArray(...) # heavylifting, processing, GPU task, ...
da.push(token='myda123')
```

Then on your local laptop, simply

```python
from jina import DocumentArray

da = DocumentArray.pull(token='myda123')
```

Now you can continue the work at local, analyzing `da` or visualizing it. Your friends & colleagues who know the token `myda123` can also pull that DocumentArray. It's useful when you want to quickly share the results with your colleagues & friends.

For more information of this feature, please refer to {class}`~jina.types.arrays.mixins.io.pushpull.PushPullMixin`.

```{danger}
The lifetime of the storage is not promised at the momennt: could be a day, could be a week. Do not use it for persistence in production. Only consider this as temporary transmission or a clipboard.
```
1 change: 1 addition & 0 deletions docs/get-started/image-match.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Similar Image Search with ResNet50
1 change: 1 addition & 0 deletions docs/get-started/text-match.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# QA Matching with Transformer
2 changes: 2 additions & 0 deletions docs/get-started/what-is.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# What is DocArray

2 changes: 2 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,8 @@ not installing `docarray` correctly. You are probably still using an old `docarr
:end-before: <!-- end support-pitch -->
```



```{toctree}
:caption: User Guides
:hidden:
Expand Down

0 comments on commit 3b2e07c

Please sign in to comment.