[MXNET-323] Improve performance of broadcast ops backward pass #11252

anirudh2290 · 2018-06-13T01:10:31Z

Description

This PR tries to improve the performance of broadcast ops backward pass, by caching the intermediate computations and using LaunchEx. The speedup for both forward and backward pass combined for the broadcast_add is around 1.4X. The experiments have been run on p2.8xlarge.

The below numbers are for broadcasting a tensor of shape (1,) to a tensor of shape destination shape as given below.

Destination tensor shape	Before the Change	After the Change
100, 100	0.017	0.013
250, 250	0.085	0.065
500, 500	0.35	0.25
1000, 1000	1.34	0.98
2000, 2000	5.6	3.9
12000, 12000	188	144
2**17, 10, 10	18	12

import numpy as np
import mxnet as mx
import time

a = mx.sym.var('a')
b = mx.sym.var('b')

a_ = mx.nd.ones((2**17, 10, 10))
b_ = mx.nd.ones((1))

func2 = mx.sym.broadcast_add(a, b).bind(mx.cpu(), args={'a': a_, 'b': b_}, args_grad = {'a': mx.nd.ones((2**17, 10, 10)), 'b': mx.nd.ones((1))})

for _ in range(2):
    # boadcast_add(array, array)
    start = time.time()
    for i in range(100):
        out = func2.forward(is_train=True)[0]
        func2.backward(mx.nd.ones((2**17, 10, 10)))
    mx.nd.waitall()
    print("func2: {}".format(time.time() - start))

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

…oadcast2

anirudh2290 · 2018-06-21T01:05:49Z

@piiswrong can you help take a look ?

anirudh2290 · 2018-06-28T16:55:48Z

@rahul003 @haojin2 @eric-haibin-lin @reminisce

eric-haibin-lin · 2018-07-03T18:35:08Z

src/operator/tensor/broadcast_reduce-inl.cuh

@@ -602,6 +602,11 @@ void Reduce(Stream<gpu> *s, const TBlob& small, const OpReqType req,
  ReduceImpl<Reducer, ndim, DType, OP>(stream, small, req, big, workspace, config);
 }

+template <typename Reducer, int ndim, typename DType, typename OP>
+void ReduceWithExtraMem(Stream<cpu>* s, const TBlob& small, const OpReqType req,
+                        const Tensor<cpu, 1, char>& workspace, const TBlob& big) {};


Empty implementation?

ReduceWithExtraMem is only used for the cpu implementation and prevents the build from failing.

binary_broadcast_op.h already included broadcast_reduce-inl.h. Why is it necessary to add this function here?

broadcast_reduce-inl.h includes the code in broadcast_reduce-inl.cuh or some part of code (including ReduceWithExtraMem) in broadcast_reduce-inl.h based on if CUDACC is defined. https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/broadcast_reduce-inl.h#L171. This causes build to fail when omitting ReduceWithExtraMem in broadcast_reduce-inl.cuh.

eric-haibin-lin · 2018-07-03T18:35:26Z

src/operator/tensor/elemwise_binary_broadcast_op.h

+template<typename xpu, typename LOP, typename ROP>
+inline typename std::enable_if<std::is_same<xpu, gpu>::value, void>::type
+BinaryBroadcastBackwardUseNone(const nnvm::NodeAttrs& attrs,
+                                    const OpContext& ctx,


nit: indentation

haojin2 · 2018-07-03T18:39:16Z

src/operator/tensor/broadcast_reduce-inl.h

            const Tensor<cpu, 1, char>& workspace, const TBlob& big) {
  if (req == kNullOp) return;
  Shape<ndim> rshape, rstride;
  diff(small.shape_.get<ndim>(), big.shape_.get<ndim>(), &rshape, &rstride);
  int N = small.shape_.Size(), M = rshape.Size();
  seq_reduce_compute<Reducer, ndim, DType, OP>(
+      N, M, req == kAddTo, big.dptr<DType>(), small.dptr<DType>(),
+      big.shape_.get<ndim>(), small.shape_.get<ndim>(), rshape, rstride);


nit: indentation should be 2 spaces?

haojin2 · 2018-07-03T18:39:30Z

src/operator/tensor/elemwise_binary_broadcast_op-inl.cuh

+template<typename xpu, typename LOP, typename ROP>
+inline typename std::enable_if<std::is_same<xpu, gpu>::value, void>::type
+BinaryBroadcastBackwardUseNone(const nnvm::NodeAttrs& attrs,
+                                    const OpContext& ctx,


nit: alignment of lines

eric-haibin-lin · 2018-07-04T22:26:19Z

src/operator/tensor/elemwise_binary_broadcast_op.h

@@ -544,20 +545,25 @@ void BinaryBroadcastBackwardUseNone(const nnvm::NodeAttrs& attrs,
      const TBlob out = inputs[0].reshape(new_oshape);


since this implementation is only for cpu, is it better to replace xpu with cpu inside?

eric-haibin-lin · 2018-07-04T22:26:32Z

src/operator/tensor/broadcast_reduce-inl.cuh

@@ -602,6 +602,11 @@ void Reduce(Stream<gpu> *s, const TBlob& small, const OpReqType req,
  ReduceImpl<Reducer, ndim, DType, OP>(stream, small, req, big, workspace, config);
 }

+template <typename Reducer, int ndim, typename DType, typename OP>
+void ReduceWithExtraMem(Stream<cpu>* s, const TBlob& small, const OpReqType req,
+                        const Tensor<cpu, 1, char>& workspace, const TBlob& big) {};


binary_broadcast_op.h already included broadcast_reduce-inl.h. Why is it necessary to add this function here?

anirudh2290 · 2018-07-10T23:38:37Z

@eric-haibin-lin I have addressed your comments.

…e#11252) * Fix cached broadcast * Fix * Use seq_reduce_compute logic for stable sum * Fix lint * Add declarations * Add elemwise binary broadcast op cuh file * Add license for elemwise_binary_broadcast_op-inl.cuh * Fix broadcast * Fix indentation * Use cpu and gpu instead of xpu

anirudh2290 added 8 commits June 12, 2018 23:58

Fix cached broadcast

c86c618

Fix

3e77798

Use seq_reduce_compute logic for stable sum

e09810c

Fix lint

cd29174

Add declarations

4142b92

Merge branch 'master' of https://github.com/dmlc/mxnet into cached_br…

27bb730

…oadcast2

Add elemwise binary broadcast op cuh file

8e2b4a8

Add license for elemwise_binary_broadcast_op-inl.cuh

cf481f6

anirudh2290 changed the title ~~[WIP] [MXNET-323] Improve performance of broadcast ops backward pass~~ [MXNET-323] Improve performance of broadcast ops backward pass Jun 21, 2018

eric-haibin-lin reviewed Jul 3, 2018

View reviewed changes

haojin2 reviewed Jul 3, 2018

View reviewed changes

anirudh2290 added 2 commits July 3, 2018 21:27

Fix broadcast

1cd2470

Fix indentation

c0926f6

eric-haibin-lin reviewed Jul 4, 2018

View reviewed changes

Use cpu and gpu instead of xpu

76850d9

eric-haibin-lin approved these changes Jul 12, 2018

View reviewed changes

piiswrong merged commit 32d298b into apache:master Jul 13, 2018

anirudh2290 mentioned this pull request Jul 24, 2019

Broadcasting ops are slow #8219

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-323] Improve performance of broadcast ops backward pass #11252

[MXNET-323] Improve performance of broadcast ops backward pass #11252

anirudh2290 commented Jun 13, 2018

anirudh2290 commented Jun 21, 2018

anirudh2290 commented Jun 28, 2018

eric-haibin-lin Jul 3, 2018

anirudh2290 Jul 4, 2018

eric-haibin-lin Jul 4, 2018

anirudh2290 Jul 5, 2018

eric-haibin-lin Jul 3, 2018

haojin2 Jul 3, 2018 •

edited

Loading

haojin2 Jul 3, 2018

eric-haibin-lin Jul 4, 2018

eric-haibin-lin Jul 4, 2018

anirudh2290 commented Jul 10, 2018

		@@ -544,20 +545,25 @@ void BinaryBroadcastBackwardUseNone(const nnvm::NodeAttrs& attrs,
		const TBlob out = inputs[0].reshape(new_oshape);

[MXNET-323] Improve performance of broadcast ops backward pass #11252

[MXNET-323] Improve performance of broadcast ops backward pass #11252

Conversation

anirudh2290 commented Jun 13, 2018

Description

Checklist

Essentials

anirudh2290 commented Jun 21, 2018

anirudh2290 commented Jun 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haojin2 Jul 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anirudh2290 commented Jul 10, 2018

haojin2 Jul 3, 2018 •

edited

Loading