From d55d0d75d612039ddbfd5291f52cfe246ec2ff99 Mon Sep 17 00:00:00 2001
From: Indu <indhubharathi@gmail.com>
Date: Mon, 11 Jun 2018 18:06:15 +0000
Subject: [PATCH 01/12] Add first draft of profiler tutorial

---
 docs/tutorials/python/profiler.md | 185 ++++++++++++++++++++++++++++++
 1 file changed, 185 insertions(+)
 create mode 100644 docs/tutorials/python/profiler.md

diff --git a/docs/tutorials/python/profiler.md b/docs/tutorials/python/profiler.md
new file mode 100644
index 000000000000..9c43413b499d
--- /dev/null
+++ b/docs/tutorials/python/profiler.md
@@ -0,0 +1,185 @@
+# Profiling MXNet Models
+
+It is often helpful to understand what operations take how much time while running a model. This helps optimize the model to run faster. In this tutorial, we will learn how to profile MXNet models to measure their running time and memory consumption using the MXNet profiler.
+
+## The incorrect way to profile
+
+If you have just begun using MXNet, you might be tempted to measure execution time of your model using Python's `time` module like shown below:
+
+```
+from time import time
+from mxnet import autograd, nd
+import mxnet as mx
+
+start = time()
+x = nd.random_uniform(shape=(2000,2000))
+y = nd.dot(x, x)
+print('Time for matrix multiplication: %f sec\n' % (time() - start))
+
+start = time()                                
+print(y.asnumpy())                                
+print('Time for printing the output: %f sec' % (time() - start))
+```
+
+Time for matrix multiplication: 0.005051 sec
+
+[[501.1584  508.29724 495.65237 ... 492.84705 492.69092 490.0481 ]
+ [508.81058 507.1822  495.1743  ... 503.10526 497.29315 493.67917]
+ [489.56598 499.47015 490.17722 ... 490.99945 488.05008 483.28836]
+ ...
+ [484.0019  495.7179  479.92142 ... 493.69952 478.89194 487.2074 ]
+ [499.64932 507.65094 497.5938  ... 493.0474  500.74512 495.82712]
+ [516.0143  519.1715  506.354   ... 510.08878 496.35608 495.42523]]
+Time for printing the output: 0.167693 sec
+
+From the output above, it seems as if printing the output takes lot more time that multiplying two large matrices. That doesn't feel right. 
+
+This is because in MXNet, all operations are executed asynchronously. So, when `nd.dot(x, x)` returns, the matrix multiplication is not complete, it has only been queued for execution. `asnumpy` in `print(y.asnumpy())` however waits for the result to be computed and hence takes longer time.
+
+## The correct way to profile
+
+The correct way to measure running time of MXNet models is to use MXNet profiler. In the rest of this tutorial, we will learn how to use the MXNet profiler to measure the running time and memory consumption of MXNet models.
+
+To use the profiler, you need to build MXNet with `USE_PROFILER` enabled. For example this command will build the CPU version of MXNet on Linux,
+
+```
+make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_PROFILER=1
+```
+
+Check [this](http://mxnet.incubator.apache.org/install/index.html?device=Linux&language=Python&processor=CPU) page for more information on building from source for various environments. 
+
+After building with `USE_PROFILER=True` and installing, you can import the profiler and configure it from Python code.
+
+```
+from mxnet import profiler
+profiler.set_config(profile_all=True, aggregate_stats=True, filename='profile_output.json')
+```
+
+`profile_all` enables all types of profiling. You can also individually enable the following types of profiling:
+
+- `profile_symbolic` (boolean): whether to profile symbolic operators
+- `profile_imperative` (boolean): whether to profile imperative operators
+- `profile_memory` (boolean): whether to profile memory usage
+- `profile_api` (boolean): whether to profile the C API
+
+`aggregate_stats` aggregates statistics in memory which can then be printed to console by calling `profiler.dumps()`.
+
+### Setup: Build a model
+
+Let's build a small convolutional neural network that we can use for profiling.
+
+```
+from mxnet import gluon
+net = gluon.nn.HybridSequential()
+with net.name_scope():
+    net.add(gluon.nn.Conv2D(channels=20, kernel_size=5, activation='relu'))
+    net.add(gluon.nn.MaxPool2D(pool_size=2, strides=2))
+    net.add(gluon.nn.Conv2D(channels=50, kernel_size=5, activation='relu'))
+    net.add(gluon.nn.MaxPool2D(pool_size=2, strides=2))
+    net.add(gluon.nn.Flatten())
+    net.add(gluon.nn.Dense(512, activation="relu"))
+    net.add(gluon.nn.Dense(10))
+```
+
+We need data that we can run through the network for profiling. We'll use the MNIST dataset.
+
+```
+from mxnet.gluon.data.vision import transforms
+train_data = gluon.data.DataLoader(gluon.data.vision.MNIST(train=True).transform_first(transforms.ToTensor()),
+                                   batch_size=64, shuffle=True)
+```
+
+Let's define a method that will run one training iteration given data and label.
+
+```
+# Use GPU is available
+ctx = mx.gpu() if mx.test_utils.list_gpus() else mx.cpu()
+
+# Initialize the parameters with random weights
+net.collect_params().initialize(mx.init.Xavier(), ctx=ctx)
+
+# Use SGD optimizer
+trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': .1})
+
+# Softmax Cross Entropy is a frequently used loss function for multi-classs classification
+softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()
+
+# A helper function to run one training iteration
+def run_training_iteration(data, label):
+    
+    # Load data and label is the right context
+    data = data.as_in_context(ctx)
+    label = label.as_in_context(ctx)
+    
+    # Run the forward pass
+    with autograd.record():
+        output = net(data)
+        loss = softmax_cross_entropy(output, label)
+    
+    # Run the backward pass
+    loss.backward()
+    
+    # Apply changes to parameters
+    trainer.step(data.shape[0])
+```
+
+### Starting and stopping the profiler
+
+When the first forward pass is run on a network, MXNet does a number of housekeeping tasks including infering the shapes of various parameters, allocating memory for intermediate and final outputs, etc. For these reasons, profiling the first iteration doesn't provide accurate results. We will therefore skip the first iteration.
+
+```
+# Run the first iteration without profiling
+itr = iter(train_data)
+run_training_iteration(*next(itr))
+```
+
+We'll run the next iteration with the profiler turned on.
+
+```
+data, label = next(itr)
+
+# Ask the profiler to start recording
+profiler.set_state('run')
+
+run_training_iteration(*next(itr))
+
+# Ask the profiler to stop recording
+profiler.set_state('stop')
+```
+
+Between running and stopping the profiler, you can also pause and resume the profiler using `profiler.pause()` and `profiler.resume()` respectively to profile only parts of code you want to profile.
+
+### Viewing profiler output
+
+There are two ways to view the information collected by the profiler. You can either view it in console or you can view a more graphical version in a browser.
+
+#### 1. View in console
+
+You can use the `profiler.dumps()` method to view the information collected by the profiler in the console. The collected information contains time taken by each operator, time taken by each C API and memory consumed in both CPU and GPU.
+
+```
+print(profiler.dumps())
+```
+
+![Profile Statistics](profile_stats.png)
+
+#### 2. View in browser
+
+You can also dump the information collected by the profiler into a `json` file using the `profiler.dump()` function and view it in a browser.
+
+```
+profiler.dump()
+```
+
+`dump()` creates a `json` file which can be viewed using a trace consumer like `chrome://tracing` in the Chrome browser. Here is a snapshot that shows the output of the profiling we did above.
+
+![Tracing Screenshot](profiler_output_chrome.png)
+
+Let's zoom in to check the time taken by operators
+
+![Operator profiling](profile_operators.png)
+
+The above picture visualizes the sequence in which the operators were executed and the time taken by each operator.
+
+If you would like to learn more about the profiler, there are more examples available [here](https://github.com/apache/incubator-mxnet/tree/master/example/profiler).
+

From 273797d04156b26b6f71f9172054107332ce4980 Mon Sep 17 00:00:00 2001
From: Indu <indhubharathi@gmail.com>
Date: Mon, 11 Jun 2018 18:33:34 +0000
Subject: [PATCH 02/12] Minor changes

- Add images from web-data
- Add <!--notebook-skip-line-->
---
 docs/tutorials/python/profiler.md | 42 +++++++++++++++----------------
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/docs/tutorials/python/profiler.md b/docs/tutorials/python/profiler.md
index 9c43413b499d..39321a1f17f4 100644
--- a/docs/tutorials/python/profiler.md
+++ b/docs/tutorials/python/profiler.md
@@ -6,7 +6,7 @@ It is often helpful to understand what operations take how much time while runni
 
 If you have just begun using MXNet, you might be tempted to measure execution time of your model using Python's `time` module like shown below:
 
-```
+```python
 from time import time
 from mxnet import autograd, nd
 import mxnet as mx
@@ -21,16 +21,16 @@ print(y.asnumpy())
 print('Time for printing the output: %f sec' % (time() - start))
 ```
 
-Time for matrix multiplication: 0.005051 sec
+Time for matrix multiplication: 0.005051 sec<!--notebook-skip-line-->
 
-[[501.1584  508.29724 495.65237 ... 492.84705 492.69092 490.0481 ]
- [508.81058 507.1822  495.1743  ... 503.10526 497.29315 493.67917]
- [489.56598 499.47015 490.17722 ... 490.99945 488.05008 483.28836]
- ...
- [484.0019  495.7179  479.92142 ... 493.69952 478.89194 487.2074 ]
- [499.64932 507.65094 497.5938  ... 493.0474  500.74512 495.82712]
- [516.0143  519.1715  506.354   ... 510.08878 496.35608 495.42523]]
-Time for printing the output: 0.167693 sec
+[[501.1584  508.29724 495.65237 ... 492.84705 492.69092 490.0481 ]<!--notebook-skip-line-->
+ [508.81058 507.1822  495.1743  ... 503.10526 497.29315 493.67917]<!--notebook-skip-line-->
+ [489.56598 499.47015 490.17722 ... 490.99945 488.05008 483.28836]<!--notebook-skip-line-->
+ ...<!--notebook-skip-line-->
+ [484.0019  495.7179  479.92142 ... 493.69952 478.89194 487.2074 ]<!--notebook-skip-line-->
+ [499.64932 507.65094 497.5938  ... 493.0474  500.74512 495.82712]<!--notebook-skip-line-->
+ [516.0143  519.1715  506.354   ... 510.08878 496.35608 495.42523]]<!--notebook-skip-line-->
+Time for printing the output: 0.167693 sec<!--notebook-skip-line-->
 
 From the output above, it seems as if printing the output takes lot more time that multiplying two large matrices. That doesn't feel right. 
 
@@ -50,7 +50,7 @@ Check [this](http://mxnet.incubator.apache.org/install/index.html?device=Linux&l
 
 After building with `USE_PROFILER=True` and installing, you can import the profiler and configure it from Python code.
 
-```
+```python
 from mxnet import profiler
 profiler.set_config(profile_all=True, aggregate_stats=True, filename='profile_output.json')
 ```
@@ -68,7 +68,7 @@ profiler.set_config(profile_all=True, aggregate_stats=True, filename='profile_ou
 
 Let's build a small convolutional neural network that we can use for profiling.
 
-```
+```python
 from mxnet import gluon
 net = gluon.nn.HybridSequential()
 with net.name_scope():
@@ -83,7 +83,7 @@ with net.name_scope():
 
 We need data that we can run through the network for profiling. We'll use the MNIST dataset.
 
-```
+```python
 from mxnet.gluon.data.vision import transforms
 train_data = gluon.data.DataLoader(gluon.data.vision.MNIST(train=True).transform_first(transforms.ToTensor()),
                                    batch_size=64, shuffle=True)
@@ -91,7 +91,7 @@ train_data = gluon.data.DataLoader(gluon.data.vision.MNIST(train=True).transform
 
 Let's define a method that will run one training iteration given data and label.
 
-```
+```python
 # Use GPU is available
 ctx = mx.gpu() if mx.test_utils.list_gpus() else mx.cpu()
 
@@ -127,7 +127,7 @@ def run_training_iteration(data, label):
 
 When the first forward pass is run on a network, MXNet does a number of housekeeping tasks including infering the shapes of various parameters, allocating memory for intermediate and final outputs, etc. For these reasons, profiling the first iteration doesn't provide accurate results. We will therefore skip the first iteration.
 
-```
+```python
 # Run the first iteration without profiling
 itr = iter(train_data)
 run_training_iteration(*next(itr))
@@ -135,7 +135,7 @@ run_training_iteration(*next(itr))
 
 We'll run the next iteration with the profiler turned on.
 
-```
+```python
 data, label = next(itr)
 
 # Ask the profiler to start recording
@@ -157,27 +157,27 @@ There are two ways to view the information collected by the profiler. You can ei
 
 You can use the `profiler.dumps()` method to view the information collected by the profiler in the console. The collected information contains time taken by each operator, time taken by each C API and memory consumed in both CPU and GPU.
 
-```
+```python
 print(profiler.dumps())
 ```
 
-![Profile Statistics](profile_stats.png)
+![Profile Statistics](https://github.com/dmlc/web-data/blob/master/mxnet/tutorials/python/profiler/profile_stats.png)
 
 #### 2. View in browser
 
 You can also dump the information collected by the profiler into a `json` file using the `profiler.dump()` function and view it in a browser.
 
-```
+```python
 profiler.dump()
 ```
 
 `dump()` creates a `json` file which can be viewed using a trace consumer like `chrome://tracing` in the Chrome browser. Here is a snapshot that shows the output of the profiling we did above.
 
-![Tracing Screenshot](profiler_output_chrome.png)
+![Tracing Screenshot](https://github.com/dmlc/web-data/blob/master/mxnet/tutorials/python/profiler/profiler_output_chrome.png)
 
 Let's zoom in to check the time taken by operators
 
-![Operator profiling](profile_operators.png)
+![Operator profiling](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/tutorials/python/profiler/profile_operators.png)
 
 The above picture visualizes the sequence in which the operators were executed and the time taken by each operator.
 

From 3e8a9997b9e6f7ab039e67df1f97500714aa22b2 Mon Sep 17 00:00:00 2001
From: Indu <indhubharathi@gmail.com>
Date: Mon, 11 Jun 2018 18:41:48 +0000
Subject: [PATCH 03/12] Language corrections

---
 docs/tutorials/python/profiler.md | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/docs/tutorials/python/profiler.md b/docs/tutorials/python/profiler.md
index 39321a1f17f4..d2378dcbf57c 100644
--- a/docs/tutorials/python/profiler.md
+++ b/docs/tutorials/python/profiler.md
@@ -4,7 +4,7 @@ It is often helpful to understand what operations take how much time while runni
 
 ## The incorrect way to profile
 
-If you have just begun using MXNet, you might be tempted to measure execution time of your model using Python's `time` module like shown below:
+If you have just begun using MXNet, you might be tempted to measure the execution time of your model using Python's `time` module like shown below:
 
 ```python
 from time import time
@@ -34,13 +34,13 @@ Time for printing the output: 0.167693 sec<!--notebook-skip-line-->
 
 From the output above, it seems as if printing the output takes lot more time that multiplying two large matrices. That doesn't feel right. 
 
-This is because in MXNet, all operations are executed asynchronously. So, when `nd.dot(x, x)` returns, the matrix multiplication is not complete, it has only been queued for execution. `asnumpy` in `print(y.asnumpy())` however waits for the result to be computed and hence takes longer time.
+This is because, in MXNet, all operations are executed asynchronously. So, when `nd.dot(x, x)` returns, the matrix multiplication is not complete, it has only been queued for execution. `asnumpy` in `print(y.asnumpy())` however, waits for the result to be computed and hence takes longer time.
 
 ## The correct way to profile
 
 The correct way to measure running time of MXNet models is to use MXNet profiler. In the rest of this tutorial, we will learn how to use the MXNet profiler to measure the running time and memory consumption of MXNet models.
 
-To use the profiler, you need to build MXNet with `USE_PROFILER` enabled. For example this command will build the CPU version of MXNet on Linux,
+To use the profiler, you need to build MXNet with `USE_PROFILER` enabled. For example, this command will build the CPU version of MXNet on Linux,
 
 ```
 make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_PROFILER=1
@@ -125,7 +125,7 @@ def run_training_iteration(data, label):
 
 ### Starting and stopping the profiler
 
-When the first forward pass is run on a network, MXNet does a number of housekeeping tasks including infering the shapes of various parameters, allocating memory for intermediate and final outputs, etc. For these reasons, profiling the first iteration doesn't provide accurate results. We will therefore skip the first iteration.
+When the first forward pass is run on a network, MXNet does a number of housekeeping tasks including inferring the shapes of various parameters, allocating memory for intermediate and final outputs, etc. For these reasons, profiling the first iteration doesn't provide accurate results. We will, therefore skip the first iteration.
 
 ```python
 # Run the first iteration without profiling
@@ -147,11 +147,11 @@ run_training_iteration(*next(itr))
 profiler.set_state('stop')
 ```
 
-Between running and stopping the profiler, you can also pause and resume the profiler using `profiler.pause()` and `profiler.resume()` respectively to profile only parts of code you want to profile.
+Between running and stopping the profiler, you can also pause and resume the profiler using `profiler.pause()` and `profiler.resume()` respectively to profile only parts of the code you want to profile.
 
 ### Viewing profiler output
 
-There are two ways to view the information collected by the profiler. You can either view it in console or you can view a more graphical version in a browser.
+There are two ways to view the information collected by the profiler. You can either view it in the console or you can view a more graphical version in a browser.
 
 #### 1. View in console
 

From e177c70345018273874aeea54cee80a483c1bdcf Mon Sep 17 00:00:00 2001
From: Indu <indhubharathi@gmail.com>
Date: Wed, 13 Jun 2018 23:15:14 +0000
Subject: [PATCH 04/12] Minor changes

- Fix image URLs
- Fix formatting of output
---
 docs/tutorials/python/profiler.md | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/docs/tutorials/python/profiler.md b/docs/tutorials/python/profiler.md
index d2378dcbf57c..030bec1944d8 100644
--- a/docs/tutorials/python/profiler.md
+++ b/docs/tutorials/python/profiler.md
@@ -21,16 +21,25 @@ print(y.asnumpy())
 print('Time for printing the output: %f sec' % (time() - start))
 ```
 
-Time for matrix multiplication: 0.005051 sec<!--notebook-skip-line-->
+
+**Time for matrix multiplication: 0.005051 sec**<!--notebook-skip-line-->
 
 [[501.1584  508.29724 495.65237 ... 492.84705 492.69092 490.0481 ]<!--notebook-skip-line-->
+
  [508.81058 507.1822  495.1743  ... 503.10526 497.29315 493.67917]<!--notebook-skip-line-->
+
  [489.56598 499.47015 490.17722 ... 490.99945 488.05008 483.28836]<!--notebook-skip-line-->
+
  ...<!--notebook-skip-line-->
+
  [484.0019  495.7179  479.92142 ... 493.69952 478.89194 487.2074 ]<!--notebook-skip-line-->
+
  [499.64932 507.65094 497.5938  ... 493.0474  500.74512 495.82712]<!--notebook-skip-line-->
+
  [516.0143  519.1715  506.354   ... 510.08878 496.35608 495.42523]]<!--notebook-skip-line-->
-Time for printing the output: 0.167693 sec<!--notebook-skip-line-->
+
+**Time for printing the output: 0.167693 sec**<!--notebook-skip-line-->
+
 
 From the output above, it seems as if printing the output takes lot more time that multiplying two large matrices. That doesn't feel right. 
 
@@ -161,7 +170,7 @@ You can use the `profiler.dumps()` method to view the information collected by t
 print(profiler.dumps())
 ```
 
-![Profile Statistics](https://github.com/dmlc/web-data/blob/master/mxnet/tutorials/python/profiler/profile_stats.png)
+![Profile Statistics](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/tutorials/python/profiler/profile_stats.png)
 
 #### 2. View in browser
 
@@ -173,7 +182,7 @@ profiler.dump()
 
 `dump()` creates a `json` file which can be viewed using a trace consumer like `chrome://tracing` in the Chrome browser. Here is a snapshot that shows the output of the profiling we did above.
 
-![Tracing Screenshot](https://github.com/dmlc/web-data/blob/master/mxnet/tutorials/python/profiler/profiler_output_chrome.png)
+![Tracing Screenshot](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/tutorials/python/profiler/profiler_output_chrome.png)
 
 Let's zoom in to check the time taken by operators
 

From 823ea3ec7fe518d8b94cc9ee543d79de8aa59be5 Mon Sep 17 00:00:00 2001
From: Indu <indhubharathi@gmail.com>
Date: Thu, 14 Jun 2018 05:52:23 +0000
Subject: [PATCH 05/12] Minor changes

- Add download button.
- Hide profile_stats.png in notebook.
---
 docs/tutorials/python/profiler.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/tutorials/python/profiler.md b/docs/tutorials/python/profiler.md
index 030bec1944d8..5409b8d4bef8 100644
--- a/docs/tutorials/python/profiler.md
+++ b/docs/tutorials/python/profiler.md
@@ -170,7 +170,7 @@ You can use the `profiler.dumps()` method to view the information collected by t
 print(profiler.dumps())
 ```
 
-![Profile Statistics](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/tutorials/python/profiler/profile_stats.png)
+![Profile Statistics](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/tutorials/python/profiler/profile_stats.png)<!--notebook-skip-line-->
 
 #### 2. View in browser
 
@@ -192,3 +192,5 @@ The above picture visualizes the sequence in which the operators were executed a
 
 If you would like to learn more about the profiler, there are more examples available [here](https://github.com/apache/incubator-mxnet/tree/master/example/profiler).
 
+<!-- INSERT SOURCE DOWNLOAD BUTTONS -->
+

From 504139119cfa444a18ac955e2f5d1bfd6ae40a02 Mon Sep 17 00:00:00 2001
From: Indu <indhubharathi@gmail.com>
Date: Thu, 14 Jun 2018 05:57:56 +0000
Subject: [PATCH 06/12] Add tutorial to index.

---
 docs/tutorials/index.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md
index a970c0a52ef9..0890d6137225 100644
--- a/docs/tutorials/index.md
+++ b/docs/tutorials/index.md
@@ -42,6 +42,7 @@ Select API:&nbsp;
     * [Inference using an ONNX model](/tutorials/onnx/inference_on_onnx_model.html)
     * [Fine-tuning an ONNX model on Gluon](/tutorials/onnx/fine_tuning_gluon.html)
     * [Visualizing Decisions of Convolutional Neural Networks](/tutorials/vision/cnn_visualization.html)
+    * [Profiling MXNet Models](/tutorials/python/profiler.html)
 * API Guides
     * Core APIs
         * NDArray

From d9e6540c7d9c3d9424fa0383de382bcee3cc14b7 Mon Sep 17 00:00:00 2001
From: Indu <indhubharathi@gmail.com>
Date: Thu, 14 Jun 2018 05:59:21 +0000
Subject: [PATCH 07/12] Add tutorial to tests.

---
 tests/tutorials/test_tutorials.py | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tests/tutorials/test_tutorials.py b/tests/tutorials/test_tutorials.py
index 0f9103dcaa9e..4741df59c8db 100644
--- a/tests/tutorials/test_tutorials.py
+++ b/tests/tutorials/test_tutorials.py
@@ -213,3 +213,6 @@ def test_vision_large_scale_classification():
 
 def test_vision_cnn_visualization():
     assert _test_tutorial_nb('vision/cnn_visualization')
+
+def test_python_profiler():
+    assert _test_tutorial_nb('python/profiler')

From 553211f1a67345f6cc59feb04d09254427f451bb Mon Sep 17 00:00:00 2001
From: Indu <indhubharathi@gmail.com>
Date: Thu, 14 Jun 2018 19:10:54 +0000
Subject: [PATCH 08/12] Add a note about nd.waitall()

---
 docs/tutorials/python/profiler.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/tutorials/python/profiler.md b/docs/tutorials/python/profiler.md
index 5409b8d4bef8..a6aaa2561f13 100644
--- a/docs/tutorials/python/profiler.md
+++ b/docs/tutorials/python/profiler.md
@@ -45,6 +45,8 @@ From the output above, it seems as if printing the output takes lot more time th
 
 This is because, in MXNet, all operations are executed asynchronously. So, when `nd.dot(x, x)` returns, the matrix multiplication is not complete, it has only been queued for execution. `asnumpy` in `print(y.asnumpy())` however, waits for the result to be computed and hence takes longer time.
 
+While it is possible to use `NDArray.waitall()` before and after operations to get running time of operations, it is not a scalable method to measure running time of multiple sets of operations, especially in a Sequential or Hybrid network.
+
 ## The correct way to profile
 
 The correct way to measure running time of MXNet models is to use MXNet profiler. In the rest of this tutorial, we will learn how to use the MXNet profiler to measure the running time and memory consumption of MXNet models.

From 3d272a0ff07039b18400fdaee716bd3b9f3ef51e Mon Sep 17 00:00:00 2001
From: Indu <indhubharathi@gmail.com>
Date: Fri, 15 Jun 2018 07:33:21 +0000
Subject: [PATCH 09/12] Remove the example build command.

Link to installation page is sufficient.
---
 docs/tutorials/python/profiler.md | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/docs/tutorials/python/profiler.md b/docs/tutorials/python/profiler.md
index a6aaa2561f13..f7c8fbdd32f1 100644
--- a/docs/tutorials/python/profiler.md
+++ b/docs/tutorials/python/profiler.md
@@ -51,15 +51,7 @@ While it is possible to use `NDArray.waitall()` before and after operations to g
 
 The correct way to measure running time of MXNet models is to use MXNet profiler. In the rest of this tutorial, we will learn how to use the MXNet profiler to measure the running time and memory consumption of MXNet models.
 
-To use the profiler, you need to build MXNet with `USE_PROFILER` enabled. For example, this command will build the CPU version of MXNet on Linux,
-
-```
-make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_PROFILER=1
-```
-
-Check [this](http://mxnet.incubator.apache.org/install/index.html?device=Linux&language=Python&processor=CPU) page for more information on building from source for various environments. 
-
-After building with `USE_PROFILER=True` and installing, you can import the profiler and configure it from Python code.
+To use the profiler, you need to build MXNet with `USE_PROFILER=1`. Check the [installation](http://mxnet.incubator.apache.org/install/index.html) page for more information on how to install MXNet from source. After building with `USE_PROFILER=1` and installing, you can import the profiler and configure it from Python code.
 
 ```python
 from mxnet import profiler

From 8f76e284856b1793cdbe71e3087c9442416a9328 Mon Sep 17 00:00:00 2001
From: Indu <indhubharathi@gmail.com>
Date: Fri, 15 Jun 2018 07:35:48 +0000
Subject: [PATCH 10/12] Fix typo

---
 docs/tutorials/python/profiler.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/tutorials/python/profiler.md b/docs/tutorials/python/profiler.md
index f7c8fbdd32f1..77105033782d 100644
--- a/docs/tutorials/python/profiler.md
+++ b/docs/tutorials/python/profiler.md
@@ -95,7 +95,7 @@ train_data = gluon.data.DataLoader(gluon.data.vision.MNIST(train=True).transform
 Let's define a method that will run one training iteration given data and label.
 
 ```python
-# Use GPU is available
+# Use GPU if available
 ctx = mx.gpu() if mx.test_utils.list_gpus() else mx.cpu()
 
 # Initialize the parameters with random weights

From 54eae819006cc59ceca8cc40628c379a0cdba52c Mon Sep 17 00:00:00 2001
From: Indu <indhubharathi@gmail.com>
Date: Mon, 18 Jun 2018 19:24:42 +0000
Subject: [PATCH 11/12] Include info about env variables related to profiling

---
 docs/tutorials/python/profiler.md | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/docs/tutorials/python/profiler.md b/docs/tutorials/python/profiler.md
index 77105033782d..0c2f8381ce15 100644
--- a/docs/tutorials/python/profiler.md
+++ b/docs/tutorials/python/profiler.md
@@ -126,7 +126,7 @@ def run_training_iteration(data, label):
     trainer.step(data.shape[0])
 ```
 
-### Starting and stopping the profiler
+### Starting and stopping the profiler from Python
 
 When the first forward pass is run on a network, MXNet does a number of housekeeping tasks including inferring the shapes of various parameters, allocating memory for intermediate and final outputs, etc. For these reasons, profiling the first iteration doesn't provide accurate results. We will, therefore skip the first iteration.
 
@@ -152,6 +152,18 @@ profiler.set_state('stop')
 
 Between running and stopping the profiler, you can also pause and resume the profiler using `profiler.pause()` and `profiler.resume()` respectively to profile only parts of the code you want to profile.
 
+### Starting profiler automatically using environment variable
+
+The method described above requires code changes to start and stop the profiler. You can also start the profiler automatically and profile the entire code without any code changes using the `MXNET_PROFILER_AUTOSTART` environment variable.
+
+MXNet will start the profiler automatically if you run your code with the environment variable `MXNET_PROFILER_AUTOSTART` set to `1`. The profiler output is stored into `profile.json` in the current directory.
+
+Note that the profiler output could be large depending on your code. It might be helpful to profile only sections of your code using the `set_state` API described in the previous section.
+
+### Increasing granularity of the profiler output
+
+MXNet executes computation graphs in 'bulk mode' which reduces kernel launch gaps in between symbolic operators for faster execution. This could reduce the granularity of the profiler output. If you need profiling result of every operator, please set the environment variables `MXNET_EXEC_BULK_EXEC_INFERENCE` and `MXNET_EXEC_BULK_EXEC_TRAIN` to `0` to disable the bulk execution mode.
+
 ### Viewing profiler output
 
 There are two ways to view the information collected by the profiler. You can either view it in the console or you can view a more graphical version in a browser.

From 0ae6df13ec9fc0746960b28f2f87d15d3abacfc9 Mon Sep 17 00:00:00 2001
From: Indu <indhubharathi@gmail.com>
Date: Mon, 18 Jun 2018 20:13:52 +0000
Subject: [PATCH 12/12] Add a further reading section

---
 docs/tutorials/python/profiler.md | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/docs/tutorials/python/profiler.md b/docs/tutorials/python/profiler.md
index 0c2f8381ce15..46fe23bbafe0 100644
--- a/docs/tutorials/python/profiler.md
+++ b/docs/tutorials/python/profiler.md
@@ -196,7 +196,10 @@ Let's zoom in to check the time taken by operators
 
 The above picture visualizes the sequence in which the operators were executed and the time taken by each operator.
 
-If you would like to learn more about the profiler, there are more examples available [here](https://github.com/apache/incubator-mxnet/tree/master/example/profiler).
+### Further reading
+
+- [Examples using MXNet profiler.](https://github.com/apache/incubator-mxnet/tree/master/example/profiler)
+- [Some tips for improving MXNet performance.](https://mxnet.incubator.apache.org/faq/perf.html)
 
 <!-- INSERT SOURCE DOWNLOAD BUTTONS -->