docs(array): add doc for array api

docarray · Jan 4, 2022 · 3b2e07c · 3b2e07c
1 parent d3f2c61
commit 3b2e07c
Show file tree

Hide file tree

Showing 12 changed files with 195 additions and 96 deletions.
diff --git a/README.md b/README.md
@@ -17,7 +17,7 @@ DocArray is a library for nested, unstructured data such as text, image, audio,
 
 🧑‍🔬 **Data science powerhouse**: greatly facilitate data scientists work on embedding, matching, visualizing, evaluating via Torch/Tensorflow/ONNX/PaddlePaddle.
 
-🚡 **Portable**: ready to wire at anytime with efficient and compact serialization from/to Protobuf, binary, JSON, CSV, dataframe.
+🚡 **Portable**: ready-to-wire at anytime with efficient and compact serialization from/to Protobuf, bytes, JSON, CSV, dataframe.
 
 <!-- end elevator-pitch -->
 

diff --git a/docs/fundamentals/document/attribute.md b/docs/fundamentals/document/attribute.md
@@ -104,6 +104,7 @@ print(d)
 
 You can also check which content field is set by `.content_type`.
 
+(content-uri)=
 ## Load content from URI
 
 A quite common pattern is loading content from a URI instead of assigning them directly in the code.

diff --git a/docs/fundamentals/document/construct.md b/docs/fundamentals/document/construct.md
@@ -1,3 +1,4 @@
+(construct-doc)=
 # Construct
 
 Initializing a Document object is super easy. This chapter introduces the ways of constructing empty Document, filled Document. One can also construct Document from bytes, JSON, Protobuf message as introduced {ref}`in the next chapter<serialize>`.

diff --git a/docs/fundamentals/document/index.md b/docs/fundamentals/document/index.md
@@ -4,28 +4,31 @@
 
 A Document object has a predefined data structure as below, each of the attributes can be set/get with the dot expression as you would do with any Python object.
 
+| Attribute   | Type               | Description |
+|-------------|--------------------| ----------- |
+| id          | string             | A hexdigest that represents a unique document ID |
+| buffer      | bytes              | the raw binary content of this document, which often represents the original document when comes into jina |
+| blob        | `ndarray`-like | the ndarray of the image/audio/video document |
+| text        | string             | a text document |
+| granularity | int                | the depth of the recursive chunk structure |
+| adjacency   | int                | the width of the recursive match structure |
+| parent_id   | string             | the parent id from the previous granularity |
+| weight      | float              | The weight of this document |
+| uri         | string             | a uri of the document could be: a local file path, a remote url starts with http or https or data URI scheme |
+| modality    | string             | modality, an identifier to the modality this document belongs to. In the scope of multi/cross modal search |
+| mime_type   | string             | mime type of this document, for buffer content, this is required; for other contents, this can be guessed |
+| offset      | float              | the offset of the doc |
+| location    | float              | the position of the doc, could be start and end index of a string; could be x,y (top, left) coordinate of an image crop; could be timestamp of an audio clip |
+| chunks      | `DocumentArray`    | list of the sub-documents of this document (recursive structure) |
+| matches     | `DocumentArray`    | the matched documents on the same level (recursive structure) |
+| embedding   | `ndarray`-like     | the embedding of this document |
+| tags        | dict               | a structured data value, consisting of field which map to dynamically typed values. |
+| scores      | `NamedScore`       | Scores performed on the document, each element corresponds to a metric |
+| evaluations | `NamedScore`       | Evaluations performed on the document, each element corresponds to a metric |
 
-| Attribute   | Type            | Description |
-|-------------|-----------------| ----------- |
-| id          | string          | A hexdigest that represents a unique document ID |
-| buffer      | bytes           | the raw binary content of this document, which often represents the original document when comes into jina |
-| blob        | `ndarray`-like  | the ndarray of the image/audio/video document |
-| text        | string          | a text document |
-| granularity | int             | the depth of the recursive chunk structure |
-| adjacency   | int             | the width of the recursive match structure |
-| parent_id   | string          | the parent id from the previous granularity |
-| weight      | float           | The weight of this document |
-| uri         | string          | a uri of the document could be: a local file path, a remote url starts with http or https or data URI scheme |
-| modality    | string          | modality, an identifier to the modality this document belongs to. In the scope of multi/cross modal search |
-| mime_type   | string          | mime type of this document, for buffer content, this is required; for other contents, this can be guessed |
-| offset      | float           | the offset of the doc |
-| location    | float           | the position of the doc, could be start and end index of a string; could be x,y (top, left) coordinate of an image crop; could be timestamp of an audio clip |
-| chunks      | `DocumentArray` | list of the sub-documents of this document (recursive structure) |
-| matches     | `DocumentArray` | the matched documents on the same level (recursive structure) |
-| embedding   | `ndarray`-like  | the embedding of this document |
-| tags        | dict            | a structured data value, consisting of field which map to dynamically typed values. |
-| scores      | `NamedScore`            | Scores performed on the document, each element corresponds to a metric |
-| evaluations | `NamedScore`            | Evaluations performed on the document, each element corresponds to a metric |
+```{tip}
+An `ndarray`-like object can be a Python (nested) List/Tuple, Numpy ndarray, SciPy sparse matrix (spmatrix), TensorFlow dense and sparse tensor, PyTorch dense and sparse tensor, or PaddlePaddle dense tensor.
+```
 
 The data structure of the Document is comprehensive and well-organized. One can categorize those attributes into the following groups:
 
@@ -37,12 +40,17 @@ The data structure of the Document is comprehensive and well-organized. One can
 
 This picture depicts how you may want to construct or comprehend a Document object.
 
-
-
 ```{figure} images/document-attributes.svg
 ```
 
 
+Document also provides a set of functions frequently used in data science and machine learning community.
+
+
+## What's next?
+
+To start, let's first see how to construct a Document object in {ref}`the next chapter<construct-doc>`.
+
 
 ```{toctree}
 :hidden:

diff --git a/docs/fundamentals/documentarray/construct.md b/docs/fundamentals/documentarray/construct.md
@@ -1,46 +1,136 @@
+(construct-array)=
 # Construct
 
-You can construct a `DocumentArray` in different ways:
+## Construct an empty array
 
-````{tab} From empty Documents
 ```python
-from jina import DocumentArray
+from docarray import DocumentArray
 
 da = DocumentArray.empty(10)
 ```
-````
+
+```text
+<DocumentArray (length=10) at 4456123280>
+```
+
+## Construct from list-like objects
+
+You can construct DocumentArray from a `Sequence`, `List`, `Tuple` or `Iterator` that yields `Document` object.
+
 ````{tab} From list of Documents
 ```python
-from jina import DocumentArray, Document
+from docarray import DocumentArray, Document
 
-da = DocumentArray([Document(...), Document(...)])
+da = DocumentArray([Document(text='hello'), Document(text='world')])
 ```
+
+```text
+<DocumentArray (length=2) at 4866772176>
+```
+
 ````
 ````{tab} From generator
 ```python
 from jina import DocumentArray, Document
 
-da = DocumentArray((Document(...) for _ in range(10)))
+da = DocumentArray((Document() for _ in range(10)))
+```
+
+```text
+<DocumentArray (length=10) at 4866772176>
 ```
 ````
-````{tab} From another DocumentArray
-```python
-from jina import DocumentArray, Document
 
-da = DocumentArray((Document() for _ in range(10)))
+As DocumentArray itself is also a "list-like object that yields `Document`", you can also construct DocumentArray from another DocumentArray:
+
+```python
+da = DocumentArray(...)
 da1 = DocumentArray(da)
 ```
-````
 
-````{tab} From JSON, CSV, ndarray, files, ...
+## Construct from a single Document
+
+```python
+from docarray import DocumentArray, Document
+
+d1 = Document(text='hello')
+da = DocumentArray(d1)
+```
+
+```text
+<DocumentArray (length=1) at 4452802192>
+```
+
+## Deep copy on elements
 
-You can find more details about those APIs in {class}`~jina.types.arrays.mixins.io.from_gen.FromGeneratorMixin`.
+Note that, as in Python list, adding Document object into DocumentArray only adds its memory reference. The original Document is *not* copied. If you change the original Document afterwards, then the one inside DocumentArray will also change. Here is an example,
 
 ```python
-da = DocumentArray.from_ndjson(...)
-da = DocumentArray.from_csv(...)
-da = DocumentArray.from_files(...)
-da = DocumentArray.from_lines(...)
-da = DocumentArray.from_ndarray(...)
+from docarray import DocumentArray, Document
+
+d1 = Document(text='hello')
+da = DocumentArray(d1)
+
+print(da[0].text)
+d1.text = 'world'
+print(da[0].text)
 ```
-````
+
+```text
+hello
+world
+```
+
+This may surprise some users, but considering the following Python code, you will find this behavior is very natural and authentic.
+
+```python
+d = {'hello': None}
+a = [d]
+
+print(a[0]['hello'])
+d['hello'] = 'world'
+print(a[0]['hello'])
+```
+
+```text
+None
+world
+```
+
+To make a deep copy, set `DocumentArray(..., copy=True)`. Now all Documents in this DocumentArray are completely new objects with identical contents as the original ones.
+
+```python
+from docarray import DocumentArray, Document
+
+d1 = Document(text='hello')
+da = DocumentArray(d1, copy=True)
+
+print(da[0].text)
+d1.text = 'world'
+print(da[0].text)
+```
+
+```text
+hello
+hello
+```
+
+## Construct from local files
+
+You may recall the common pattern that {ref}`I mentioned here<content-uri>`. With {meth}`~docarray.document.generators.from_files` One can easily construct a DocumentArray object with all file paths defined by a glob expression. 
+
+```python
+from docarray import DocumentArray
+
+da_jpg = DocumentArray.from_files('images/*.jpg')
+da_png = DocumentArray.from_files('images/*.png')
+da_all = DocumentArray.from_files(['images/**/*.png', 'images/**/*.jpg', 'images/**/*.jpeg'])
+```
+
+This will scan all filenames that match the expression and construct Documents with filled `.uri` attribute. You can control if to read each as text or binary with `read_mode` argument.
+
+
+
+## What's next?
+
+In the next chapter, we will see how to construct DocumentArray from binary bytes, JSON, CSV, dataframe, Protobuf message.
diff --git a/docs/fundamentals/documentarray/index.md b/docs/fundamentals/documentarray/index.md
@@ -1,15 +1,17 @@
 (documentarray)=
 # DocumentArray
 
-```{toctree}
-:hidden:
+{class}`~docarray.array.document.DocumentArray` is a list-like container of {class}`~docarray.document.Document` objects. It is **the best way** when working with multiple Documents.
+
+In a nutshell, you can simply consider it as a Python `list`, as it implements **all** list interfaces. That is, if you know how to use Python `list`, you already know how to use DocumentArray. 
+
+It is also powerful as Numpy's `ndarray`, where you can access its elements by {ref}`fancy slicing syntax<access-elements>`. 
 
-documentarraymemmap-api
-```
+What makes it more exciting is those advanced features of DocumentArray. These features greatly facilitate data scientists work on accessing nested elements, evaluating, visualizing, parallel computing, serializing, matching etc. 
 
-A {class}`~jina.types.arrays.document.DocumentArray` is a list of `Document` objects. You can construct, delete, insert, sort and traverse
-a `DocumentArray` like a Python `list`. It implements all Python List interface. 
+## What's next?
 
+Let's see how to construct a DocumentArray {ref}`in the next section<construct-array>`.
 
 ```{toctree}
 :hidden:
@@ -23,5 +25,6 @@ matching
 evaluation
 parallelization
 visualization
+sharing
 list-like
 ```
diff --git a/docs/fundamentals/documentarray/serialization.md b/docs/fundamentals/documentarray/serialization.md
@@ -1,53 +1,13 @@
 # Serialization
 
-`DocumentArray` provides the following methods for importing from/exporting to different formats.
+DocArray is designed to be "ready-to-wire" at anytime. Serialization is important. DocumentArray provides multiple serialization methods that allows one transfer DocumentArray object over network and across different microservices.
 
-| Description                       | Export Method                                                       | Import Method                                 |
-|-----------------------------------|---------------------------------------------------------------------|-----------------------------------------------|
-| LZ4-compressed binary string/file | `.to_bytes()` (or `bytes(...)` for more Pythonic), `.save_binary()` | `.load_binary()`                              |
-| JSON string/file                  | `.to_json()`, `.save_json()`                                        | `.load_json()`, `.from_ndjson()`              |
-| CSV file                          | `.save_csv()`                                                       | `.load_csv()`, `.from_lines()`, `.from_csv()` |
-| `pandas.Dataframe` object         | `.to_dataframe()`                                                   | `.from_dataframe()`                           |
-| Local files                       |                                                                     | `.from_files()`                               |
-| `numpy.ndarray` object            |                                                                     | `.from_ndarray()`                             |
-| Jina Cloud Storage (experimental) | `.push()`                                                           | `.pull()`                                     |
+## From/to JSON
 
-```{seealso}
-`.from_*()` functions often utlizes generators. When using independently, can be more memory-efficient. See {mod}`~jina.types.document.generators`.   
-```
+## From/to bytes
 
-### Sharing DocumentArray across machines
-
-```{caution}
-This is an experimental feature introduced in Jina `2.5.4`. The behavior of this feature might change in the future. 
-```
-
-Since Jina `2.5.4` we introduce a new IO feature: {meth}`~jina.types.arrays.mixins.io.pushpull.PushPullMixin.push` and {meth}`~jina.types.arrays.mixins.io.pushpull.PushPullMixin.pull`, 
-which allows you to share a DocumentArray object across machines.
-
-Considering you are working on a GPU machine via Google Colab/Jupyter. After preprocessing and embedding, you got everything you need in a DocumentArray. You can easily transfer it to the local laptop via:
-
-```python
-from jina import DocumentArray
-
-da = DocumentArray(...)  # heavylifting, processing, GPU task, ...
-da.push(token='myda123')
-```
-
-Then on your local laptop, simply
-
-```python
-from jina import DocumentArray
-
-da = DocumentArray.pull(token='myda123')
-```
-
-Now you can continue the work at local, analyzing `da` or visualizing it. Your friends & colleagues who know the token `myda123` can also pull that DocumentArray. It's useful when you want to quickly share the results with your colleagues & friends.
-
-For more information of this feature, please refer to {class}`~jina.types.arrays.mixins.io.pushpull.PushPullMixin`.
-
-```{danger}
-The lifetime of the storage is not promised at the momennt: could be a day, could be a week. Do not use it for persistence in production. Only consider this as temporary transmission or a clipboard.
-```
+## From/to Protobuf
 
+## From/to list
 
+## From/to dataframe
diff --git a/docs/fundamentals/documentarray/sharing.md b/docs/fundamentals/documentarray/sharing.md
@@ -0,0 +1,30 @@
+# Data Sharing 
+
+
+Since Jina `2.5.4` we introduce a new IO feature: {meth}`~jina.types.arrays.mixins.io.pushpull.PushPullMixin.push` and {meth}`~jina.types.arrays.mixins.io.pushpull.PushPullMixin.pull`, 
+which allows you to share a DocumentArray object across machines.
+
+Considering you are working on a GPU machine via Google Colab/Jupyter. After preprocessing and embedding, you got everything you need in a DocumentArray. You can easily transfer it to the local laptop via:
+
+```python
+from jina import DocumentArray
+
+da = DocumentArray(...)  # heavylifting, processing, GPU task, ...
+da.push(token='myda123')
+```
+
+Then on your local laptop, simply
+
+```python
+from jina import DocumentArray
+
+da = DocumentArray.pull(token='myda123')
+```
+
+Now you can continue the work at local, analyzing `da` or visualizing it. Your friends & colleagues who know the token `myda123` can also pull that DocumentArray. It's useful when you want to quickly share the results with your colleagues & friends.
+
+For more information of this feature, please refer to {class}`~jina.types.arrays.mixins.io.pushpull.PushPullMixin`.
+
+```{danger}
+The lifetime of the storage is not promised at the momennt: could be a day, could be a week. Do not use it for persistence in production. Only consider this as temporary transmission or a clipboard.
+```
diff --git a/docs/get-started/image-match.md b/docs/get-started/image-match.md
@@ -0,0 +1 @@
+# Similar Image Search with ResNet50
diff --git a/docs/get-started/text-match.md b/docs/get-started/text-match.md
@@ -0,0 +1 @@
+# QA Matching with Transformer
diff --git a/docs/get-started/what-is.md b/docs/get-started/what-is.md
@@ -0,0 +1,2 @@
+# What is DocArray
+
diff --git a/docs/index.md b/docs/index.md
@@ -64,6 +64,8 @@ not installing `docarray` correctly. You are probably still using an old `docarr
 :end-before: <!-- end support-pitch -->
 ```
 
+
+
 ```{toctree}
 :caption: User Guides
 :hidden: