Slightly optimize slice::sort #39538

ghost · 2017-02-04T15:40:29Z

First, get rid of some bound checks.

Second, instead of comparing by ternary compare function, use a binary function testing whether an element is less than some other element. This apparently makes it easier for the compiler to reason about the code. I've noticed the same effect with pdqsort crate.

Benchmark:

name                                        before ns/iter        after ns/iter         diff ns/iter   diff %
slice::bench::sort_large_ascending          8,969 (8919 MB/s)     7,410 (10796 MB/s)          -1,559  -17.38%
slice::bench::sort_large_big_ascending      355,640 (3599 MB/s)   359,137 (3564 MB/s)          3,497    0.98%
slice::bench::sort_large_big_descending     427,112 (2996 MB/s)   424,721 (3013 MB/s)         -2,391   -0.56%
slice::bench::sort_large_big_random         2,207,799 (579 MB/s)  2,138,804 (598 MB/s)       -68,995   -3.13%
slice::bench::sort_large_descending         13,694 (5841 MB/s)    13,514 (5919 MB/s)            -180   -1.31%
slice::bench::sort_large_mostly_ascending   239,697 (333 MB/s)    203,542 (393 MB/s)         -36,155  -15.08%
slice::bench::sort_large_mostly_descending  270,102 (296 MB/s)    234,263 (341 MB/s)         -35,839  -13.27%
slice::bench::sort_large_random             513,406 (155 MB/s)    470,084 (170 MB/s)         -43,322   -8.44%
slice::bench::sort_large_random_expensive   23,650,321 (3 MB/s)   23,675,098 (3 MB/s)         24,777    0.10%
slice::bench::sort_medium_ascending         143 (5594 MB/s)       132 (6060 MB/s)                -11   -7.69%
slice::bench::sort_medium_descending        197 (4060 MB/s)       188 (4255 MB/s)                 -9   -4.57%
slice::bench::sort_medium_random            3,358 (238 MB/s)      3,271 (244 MB/s)               -87   -2.59%
slice::bench::sort_small_ascending          32 (2500 MB/s)        32 (2500 MB/s)                   0    0.00%
slice::bench::sort_small_big_ascending      97 (13195 MB/s)       97 (13195 MB/s)                  0    0.00%
slice::bench::sort_small_big_descending     247 (5182 MB/s)       249 (5140 MB/s)                  2    0.81%
slice::bench::sort_small_big_random         502 (2549 MB/s)       498 (2570 MB/s)                 -4   -0.80%
slice::bench::sort_small_descending         55 (1454 MB/s)        61 (1311 MB/s)                   6   10.91%
slice::bench::sort_small_random             358 (223 MB/s)        356 (224 MB/s)                  -2   -0.56%

rust-highfive · 2017-02-04T15:40:43Z

r? @alexcrichton

(rust_highfive has picked a reviewer for you, use r? to override)

First, get rid of some bound checks. Second, instead of comparing by ternary `compare` function, use a binary function testing whether an element is less than some other element. This apparently makes it easier for the compiler to reason about the code. Benchmark: ``` name before ns/iter after ns/iter diff ns/iter diff % slice::bench::sort_large_ascending 8,969 (8919 MB/s) 7,410 (10796 MB/s) -1,559 -17.38% slice::bench::sort_large_big_ascending 355,640 (3599 MB/s) 359,137 (3564 MB/s) 3,497 0.98% slice::bench::sort_large_big_descending 427,112 (2996 MB/s) 424,721 (3013 MB/s) -2,391 -0.56% slice::bench::sort_large_big_random 2,207,799 (579 MB/s) 2,138,804 (598 MB/s) -68,995 -3.13% slice::bench::sort_large_descending 13,694 (5841 MB/s) 13,514 (5919 MB/s) -180 -1.31% slice::bench::sort_large_mostly_ascending 239,697 (333 MB/s) 203,542 (393 MB/s) -36,155 -15.08% slice::bench::sort_large_mostly_descending 270,102 (296 MB/s) 234,263 (341 MB/s) -35,839 -13.27% slice::bench::sort_large_random 513,406 (155 MB/s) 470,084 (170 MB/s) -43,322 -8.44% slice::bench::sort_large_random_expensive 23,650,321 (3 MB/s) 23,675,098 (3 MB/s) 24,777 0.10% slice::bench::sort_medium_ascending 143 (5594 MB/s) 132 (6060 MB/s) -11 -7.69% slice::bench::sort_medium_descending 197 (4060 MB/s) 188 (4255 MB/s) -9 -4.57% slice::bench::sort_medium_random 3,358 (238 MB/s) 3,271 (244 MB/s) -87 -2.59% slice::bench::sort_small_ascending 32 (2500 MB/s) 32 (2500 MB/s) 0 0.00% slice::bench::sort_small_big_ascending 97 (13195 MB/s) 97 (13195 MB/s) 0 0.00% slice::bench::sort_small_big_descending 247 (5182 MB/s) 249 (5140 MB/s) 2 0.81% slice::bench::sort_small_big_random 502 (2549 MB/s) 498 (2570 MB/s) -4 -0.80% slice::bench::sort_small_descending 55 (1454 MB/s) 61 (1311 MB/s) 6 10.91% slice::bench::sort_small_random 358 (223 MB/s) 356 (224 MB/s) -2 -0.56% ```

Before, the `count` would be copied into the closure and could potentially be optimized way. This change ensures it's borrowed by closure and finally consumed by `test::black_box`.

alexcrichton · 2017-02-04T22:48:45Z

@bors: r+

Thanks!

bors · 2017-02-04T22:48:45Z

📌 Commit fa457bf has been approved by alexcrichton

…alexcrichton Slightly optimize slice::sort First, get rid of some bound checks. Second, instead of comparing by ternary `compare` function, use a binary function testing whether an element is less than some other element. This apparently makes it easier for the compiler to reason about the code. I've noticed the same effect with [pdqsort](https://github.com/stjepang/pdqsort) crate. Benchmark: ``` name before ns/iter after ns/iter diff ns/iter diff % slice::bench::sort_large_ascending 8,969 (8919 MB/s) 7,410 (10796 MB/s) -1,559 -17.38% slice::bench::sort_large_big_ascending 355,640 (3599 MB/s) 359,137 (3564 MB/s) 3,497 0.98% slice::bench::sort_large_big_descending 427,112 (2996 MB/s) 424,721 (3013 MB/s) -2,391 -0.56% slice::bench::sort_large_big_random 2,207,799 (579 MB/s) 2,138,804 (598 MB/s) -68,995 -3.13% slice::bench::sort_large_descending 13,694 (5841 MB/s) 13,514 (5919 MB/s) -180 -1.31% slice::bench::sort_large_mostly_ascending 239,697 (333 MB/s) 203,542 (393 MB/s) -36,155 -15.08% slice::bench::sort_large_mostly_descending 270,102 (296 MB/s) 234,263 (341 MB/s) -35,839 -13.27% slice::bench::sort_large_random 513,406 (155 MB/s) 470,084 (170 MB/s) -43,322 -8.44% slice::bench::sort_large_random_expensive 23,650,321 (3 MB/s) 23,675,098 (3 MB/s) 24,777 0.10% slice::bench::sort_medium_ascending 143 (5594 MB/s) 132 (6060 MB/s) -11 -7.69% slice::bench::sort_medium_descending 197 (4060 MB/s) 188 (4255 MB/s) -9 -4.57% slice::bench::sort_medium_random 3,358 (238 MB/s) 3,271 (244 MB/s) -87 -2.59% slice::bench::sort_small_ascending 32 (2500 MB/s) 32 (2500 MB/s) 0 0.00% slice::bench::sort_small_big_ascending 97 (13195 MB/s) 97 (13195 MB/s) 0 0.00% slice::bench::sort_small_big_descending 247 (5182 MB/s) 249 (5140 MB/s) 2 0.81% slice::bench::sort_small_big_random 502 (2549 MB/s) 498 (2570 MB/s) -4 -0.80% slice::bench::sort_small_descending 55 (1454 MB/s) 61 (1311 MB/s) 6 10.91% slice::bench::sort_small_random 358 (223 MB/s) 356 (224 MB/s) -2 -0.56% ```

Rollup of 12 pull requests - Successful merges: #39439, #39472, #39481, #39491, #39501, #39509, #39514, #39519, #39526, #39528, #39530, #39538 - Failed merges:

@alexcrichton

…ichton Specialize `PartialOrd<A> for [A] where A: Ord` This way we can call `cmp` instead of `partial_cmp` in the loop, removing some burden of optimizing `Option`s away from the compiler. PR #39538 introduced a regression where sorting slices suddenly became slower, since `slice1.lt(slice2)` was much slower than `slice1.cmp(slice2) == Less`. This problem is now fixed. To verify, I benchmarked this simple program: ```rust fn main() { let mut v = (0..2_000_000).map(|x| x * x * x * 18913515181).map(|x| vec![x, x ^ 3137831591]).collect::<Vec<_>>(); v.sort(); } ``` Before this PR, it would take 0.95 sec, and now it takes 0.58 sec. I also tried changing the `is_less` lambda to use `cmp` and `partial_cmp`. Now all three versions (`lt`, `cmp`, `partial_cmp`) are equally performant for sorting slices - all of them take 0.58 sec on the benchmark. Tangentially, as soon as we get `default impl`, it might be a good idea to implement a blanket default impl for `lt`, `gt`, `le`, `ge` in terms of `cmp` whenever possible. Today, those four functions by default are only implemented in terms of `partial_cmp`. r? @alexcrichton

@alexcrichton

…d, r=alexcrichton Specialize `PartialOrd<A> for [A] where A: Ord` This way we can call `cmp` instead of `partial_cmp` in the loop, removing some burden of optimizing `Option`s away from the compiler. PR rust-lang#39538 introduced a regression where sorting slices suddenly became slower, since `slice1.lt(slice2)` was much slower than `slice1.cmp(slice2) == Less`. This problem is now fixed. To verify, I benchmarked this simple program: ```rust fn main() { let mut v = (0..2_000_000).map(|x| x * x * x * 18913515181).map(|x| vec![x, x ^ 3137831591]).collect::<Vec<_>>(); v.sort(); } ``` Before this PR, it would take 0.95 sec, and now it takes 0.58 sec. I also tried changing the `is_less` lambda to use `cmp` and `partial_cmp`. Now all three versions (`lt`, `cmp`, `partial_cmp`) are equally performant for sorting slices - all of them take 0.58 sec on the benchmark. Tangentially, as soon as we get `default impl`, it might be a good idea to implement a blanket default impl for `lt`, `gt`, `le`, `ge` in terms of `cmp` whenever possible. Today, those four functions by default are only implemented in terms of `partial_cmp`. r? @alexcrichton

@alexcrichton

Implement feature sort_unstable Tracking issue for the feature: #40585 This is essentially integration of [pdqsort](https://github.com/stjepang/pdqsort) into libcore. There's plenty of unsafe blocks to review. The heart of pdqsort is `fn partition_in_blocks` and is probably the most challenging function to understand. It requires some patience, but let me know if you find it too difficult - comments could always be improved. #### Changes * Added `sort_unstable` feature. * Tweaked insertion sort constants for stable sort. Sorting integers is now up to 5% slower, but sorting big elements is much faster (in particular, `sort_large_big_random` is 35% faster). The old constants were highly optimized for sorting integers, so overall the configuration is more balanced now. A minor regression in case of integers is forgivable as we recently had performance improvements (#39538) that completely make up for it. * Removed some uninteresting sort benchmarks. * Added a new sort benchmark for string sorting. #### Benchmarks The following table compares stable and unstable sorting: ``` name stable ns/iter unstable ns/iter diff ns/iter diff % slice::sort_large_ascending 7,240 (11049 MB/s) 7,380 (10840 MB/s) 140 1.93% slice::sort_large_big_random 1,454,138 (880 MB/s) 910,269 (1406 MB/s) -543,869 -37.40% slice::sort_large_descending 13,450 (5947 MB/s) 10,895 (7342 MB/s) -2,555 -19.00% slice::sort_large_mostly_ascending 204,041 (392 MB/s) 88,639 (902 MB/s) -115,402 -56.56% slice::sort_large_mostly_descending 217,109 (368 MB/s) 99,009 (808 MB/s) -118,100 -54.40% slice::sort_large_random 477,257 (167 MB/s) 346,028 (231 MB/s) -131,229 -27.50% slice::sort_large_random_expensive 21,670,537 (3 MB/s) 22,710,238 (3 MB/s) 1,039,701 4.80% slice::sort_large_strings 6,284,499 (38 MB/s) 6,410,896 (37 MB/s) 126,397 2.01% slice::sort_medium_random 3,515 (227 MB/s) 3,327 (240 MB/s) -188 -5.35% slice::sort_small_ascending 42 (1904 MB/s) 41 (1951 MB/s) -1 -2.38% slice::sort_small_big_random 503 (2544 MB/s) 514 (2490 MB/s) 11 2.19% slice::sort_small_descending 72 (1111 MB/s) 69 (1159 MB/s) -3 -4.17% slice::sort_small_random 369 (216 MB/s) 367 (217 MB/s) -2 -0.54% ``` Interesting cases: * Expensive comparison function and string sorting - it's a really close race, but timsort performs a slightly smaller number of comparisons. This is a natural difference of bottom-up merging versus top-down partitioning. * `large_descending` - unstable sort is faster, but both sorts should have equivalent performance. Both just check whether the slice is descending and if so, they reverse it. I blame LLVM for the discrepancy. r? @alexcrichton

rust-highfive assigned alexcrichton Feb 4, 2017

Stjepan Glavina added 2 commits February 4, 2017 16:44

Minor fix in the *_expensive benchmark

fa457bf

Before, the `count` would be copied into the closure and could potentially be optimized way. This change ensures it's borrowed by closure and finally consumed by `test::black_box`.

frewsxcv mentioned this pull request Feb 5, 2017

Rollup of 12 pull requests #39565

Closed

This was referenced Feb 5, 2017

Rollup of 12 pull requests #39566

Closed

Rollup of 12 pull requests #39567

Merged

bors added a commit that referenced this pull request Feb 5, 2017

Auto merge of #39567 - frewsxcv:rollup, r=frewsxcv

031c116

Rollup of 12 pull requests - Successful merges: #39439, #39472, #39481, #39491, #39501, #39509, #39514, #39519, #39526, #39528, #39530, #39538 - Failed merges:

bors merged commit fa457bf into rust-lang:master Feb 5, 2017

ghost deleted the slightly-optimize-sort branch February 5, 2017 22:23

cuviper mentioned this pull request Feb 8, 2017

Add clamp function rust-num/num#262

Merged

ghost mentioned this pull request Feb 8, 2017

Specialize PartialOrd<A> for [A] where A: Ord #39642

Merged

alexcrichton mentioned this pull request Mar 1, 2017

Regression in done-0.0.0-reserve on Rust 1.17 #40192

Closed

ghost mentioned this pull request Mar 17, 2017

Implement feature sort_unstable #40601

Merged

TimNN mentioned this pull request Apr 13, 2017

Should PartialOrd and Ord impls be equivalent? #41270

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slightly optimize slice::sort #39538

Slightly optimize slice::sort #39538

ghost commented Feb 4, 2017 •

edited by ghost

Loading

rust-highfive commented Feb 4, 2017

alexcrichton commented Feb 4, 2017

bors commented Feb 4, 2017

Slightly optimize slice::sort #39538

Slightly optimize slice::sort #39538

Conversation

ghost commented Feb 4, 2017 • edited by ghost Loading

rust-highfive commented Feb 4, 2017

alexcrichton commented Feb 4, 2017

bors commented Feb 4, 2017

ghost commented Feb 4, 2017 •

edited by ghost

Loading