Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

speed up of topk-operator #10997

Merged
merged 1 commit into from
May 21, 2018
Merged

speed up of topk-operator #10997

merged 1 commit into from
May 21, 2018

Conversation

asmushetzel
Copy link
Contributor

Description

Speedup of topk-operator. Specifically

  • on CPU, avoid doing 3 full sorts and instead do a single partial_sort resp a single full sort when K is reasonably large.
  • on CPU, parallelize over the batch size
  • on GPU, have a faster methods for very small K (which only works on shared memory). This is an important special case for example in beam search applications.

It is hard to put exact criteria in such code on when to switch between different versions of sort algorithms. The CPU criteria should be reasonable stable, the GPU one as well. Tests on Volta showed that a back2back full sort will start to outperform the special small-K version for some cases only (batch size 1, k > 500,000) and not by much. We can put more logic in there and try to be smarter for such cases, but not sure that it is worth it.

The work here is driven by the fact that the performance in cases with small-K of the existing code is non-acceptable. We have an important application where we resorted actually to numpy to get around the current performance issues.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • [ x] Changes are complete (i.e. I finished coding on this PR)
  • [x ] All changes have test coverage:
  • [x ] Code is well-documented:
  • [ x] To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@asmushetzel asmushetzel force-pushed the topk2 branch 2 times, most recently from 89f5fd9 to 15474fd Compare May 18, 2018 16:59
@marcoabreu
Copy link
Contributor

Hi Asmus, thanks for the great work! Could you please elaborate on the test coverage? I don't think we have these exact test cases covered (the different values of N, K and batch sizes specifically). Would be great if you could add some tests for that

@asmushetzel
Copy link
Contributor Author

test_order() in test_operator.py tests full sort, partial sort, different choice of axis, ascending/descending, different K (including cases where K is greater/bigger than the threshold in GPUs) etc. It is quite comprehensive and IMO sufficient.

@marcoabreu marcoabreu requested review from piiswrong and szha May 21, 2018 12:57
@piiswrong piiswrong merged commit b746632 into apache:master May 21, 2018
@sxjscience
Copy link
Member

Great!

jinhuang415 pushed a commit to jinhuang415/incubator-mxnet that referenced this pull request May 29, 2018
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants