Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce clone of Statistics in ListingTable and PartitionedFile #11802

Merged
merged 12 commits into from
Aug 6, 2024

Conversation

Rachelint
Copy link
Contributor

@Rachelint Rachelint commented Aug 4, 2024

Which issue does this PR close?

Part of #11719

Rationale for this change

What changes are included in this PR?

  • Reduce the cost about clone and drop of Statistics using arc
  • Optimize the impl for get_statistics_with_limit(seems may tmp vectors exist, but not sure)

Are these changes tested?

By exist tests.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the core Core DataFusion crate label Aug 4, 2024
@Rachelint
Copy link
Contributor Author

Rachelint commented Aug 4, 2024

According to simple benchmark about q0 of clickbench.

  • The first change make it 1.30+ faster(have been partially eliminated):
┏━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃   main ┃ reduce-clone-of-statistic ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │ 3.65ms │                    2.77ms │ +1.32x faster │
└──────────────┴────────┴───────────────────────────┴───────────────┘
  • After second change is applied, 2.10+ faster:
┏━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃   main ┃ reduce-clone-of-statistic ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │ 3.65ms │                    1.69ms │ +2.15x faster │
└──────────────┴────────┴───────────────────────────┴───────────────┘
  • The complete benchmark for clickbench_partitioned(because eliminate the Arc<Statistic> in PartitionedFile, finally 1.67x faster):
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ reduce-clone-of-statistic ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     3.46ms │                    2.07ms │ +1.67x faster │
│ QQuery 1     │    55.94ms │                   55.10ms │     no change │
│ QQuery 2     │   149.88ms │                  147.99ms │     no change │
│ QQuery 3     │   165.74ms │                  161.47ms │     no change │
│ QQuery 4     │  1593.53ms │                 1582.07ms │     no change │
│ QQuery 5     │  1444.10ms │                 1414.92ms │     no change │
│ QQuery 6     │    46.93ms │                   44.35ms │ +1.06x faster │
│ QQuery 7     │    57.43ms │                   54.95ms │     no change │
│ QQuery 8     │  2268.06ms │                 2258.53ms │     no change │
│ QQuery 9     │  1871.11ms │                 1866.65ms │     no change │
│ QQuery 10    │   508.29ms │                  503.46ms │     no change │
│ QQuery 11    │   567.24ms │                  554.40ms │     no change │
│ QQuery 12    │  1635.85ms │                 1628.98ms │     no change │
│ QQuery 13    │  3178.33ms │                 3199.05ms │     no change │
│ QQuery 14    │  2377.17ms │                 2369.37ms │     no change │
│ QQuery 15    │  1791.89ms │                 1772.30ms │     no change │
│ QQuery 16    │  4722.17ms │                 4731.72ms │     no change │
│ QQuery 17    │  4641.52ms │                 4609.93ms │     no change │
│ QQuery 18    │  9375.13ms │                 9536.14ms │     no change │
│ QQuery 19    │   135.14ms │                  133.52ms │     no change │
│ QQuery 20    │  3490.32ms │                 3499.24ms │     no change │
│ QQuery 21    │  4016.60ms │                 4017.70ms │     no change │
│ QQuery 22    │  8761.18ms │                 8770.62ms │     no change │
│ QQuery 23    │ 21291.41ms │                21205.12ms │     no change │
│ QQuery 24    │  1027.87ms │                 1020.31ms │     no change │
│ QQuery 25    │   834.89ms │                  825.86ms │     no change │
│ QQuery 26    │  1201.90ms │                 1196.39ms │     no change │
│ QQuery 27    │  4920.91ms │                 4911.59ms │     no change │
│ QQuery 28    │ 21419.93ms │                21962.78ms │     no change │
│ QQuery 29    │   833.25ms │                  841.28ms │     no change │
│ QQuery 30    │  1935.39ms │                 1936.87ms │     no change │
│ QQuery 31    │  2782.50ms │                 2784.20ms │     no change │
│ QQuery 32    │ 15001.36ms │                15043.47ms │     no change │
│ QQuery 33    │  9388.50ms │                 9423.83ms │     no change │
│ QQuery 34    │  9219.85ms │                 9246.16ms │     no change │
│ QQuery 35    │  3006.75ms │                 3022.45ms │     no change │
│ QQuery 36    │   243.27ms │                  232.42ms │     no change │
│ QQuery 37    │   107.44ms │                  106.40ms │     no change │
│ QQuery 38    │   135.83ms │                  138.65ms │     no change │
│ QQuery 39    │   798.46ms │                  815.32ms │     no change │
│ QQuery 40    │    54.63ms │                   53.55ms │     no change │
│ QQuery 41    │    49.08ms │                   43.92ms │ +1.12x faster │
│ QQuery 42    │    62.14ms │                   60.26ms │     no change │
└──────────────┴────────────┴───────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                        ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (main)                        │ 147172.40ms │
│ Total Time (reduce-clone-of-statistic)   │ 147785.39ms │
│ Average Time (main)                      │   3422.61ms │
│ Average Time (reduce-clone-of-statistic) │   3436.87ms │
│ Queries Faster                           │           3 │
│ Queries Slower                           │           0 │
│ Queries with No Change                   │          40 │
└──────────────────────────────────────────┴─────────────┘

@alamb
Copy link
Contributor

alamb commented Aug 4, 2024

This is very exciting and a great idea. Thank you @Rachelint

We have seen similar performance challenges cloning Statistics in InfluxDB

@Rachelint
Copy link
Contributor Author

Rachelint commented Aug 4, 2024

This is very exciting and a great idea. Thank you @Rachelint

We have seen similar performance challenges cloning Statistics in InfluxDB

Glad to see that it can help!

In fact, I want to refactor the returned value of statistic function to Arc<Statistic> for reducing more clone of Statistic in further pr.

fn statistics(&self) -> Result<Statistics> {

But I don't know if it is ok to modify such a function in the public trait...

@Rachelint Rachelint changed the title Reduce clone of Statistics by using arc Reduce clone of Statistics in ListingTable Aug 4, 2024
@Rachelint Rachelint marked this pull request as ready for review August 4, 2024 14:24
@Rachelint Rachelint force-pushed the reduce-clone-of-statistic branch from 459bcfc to 28efa57 Compare August 4, 2024 15:11
@Rachelint Rachelint marked this pull request as draft August 4, 2024 16:47
@Rachelint Rachelint marked this pull request as ready for review August 5, 2024 04:36
@Rachelint Rachelint changed the title Reduce clone of Statistics in ListingTable Reduce clone of Statistics in ListingTable -- 2x faster for ClickBench Q0 Aug 5, 2024
@Rachelint Rachelint changed the title Reduce clone of Statistics in ListingTable -- 2x faster for ClickBench Q0 Reduce clone of Statistics in ListingTable Aug 5, 2024
@Rachelint
Copy link
Contributor Author

More benchmarks(no change ones):

--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     main ┃ reduce-clone-of-statistic ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 277.12ms │                  277.33ms │     no change │
│ QQuery 2     │  46.22ms │                   42.92ms │ +1.08x faster │
│ QQuery 3     │ 108.74ms │                  107.83ms │     no change │
│ QQuery 4     │  58.80ms │                   57.93ms │     no change │
│ QQuery 5     │ 188.03ms │                  185.05ms │     no change │
│ QQuery 6     │  50.56ms │                   52.48ms │     no change │
│ QQuery 7     │ 294.15ms │                  290.95ms │     no change │
│ QQuery 8     │ 118.22ms │                  116.69ms │     no change │
│ QQuery 9     │ 218.15ms │                  218.93ms │     no change │
│ QQuery 10    │ 187.02ms │                  188.36ms │     no change │
│ QQuery 11    │  30.54ms │                   30.30ms │     no change │
│ QQuery 12    │  80.52ms │                   80.74ms │     no change │
│ QQuery 13    │ 125.80ms │                  121.68ms │     no change │
│ QQuery 14    │  74.04ms │                   72.71ms │     no change │
│ QQuery 15    │ 102.41ms │                  104.08ms │     no change │
│ QQuery 16    │  42.47ms │                   40.99ms │     no change │
│ QQuery 17    │ 265.32ms │                  264.94ms │     no change │
│ QQuery 18    │ 439.87ms │                  433.37ms │     no change │
│ QQuery 19    │ 133.63ms │                  134.33ms │     no change │
│ QQuery 20    │ 142.03ms │                  135.10ms │     no change │
│ QQuery 21    │ 289.53ms │                  285.10ms │     no change │
│ QQuery 22    │  27.15ms │                   24.94ms │ +1.09x faster │
└──────────────┴──────────┴───────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                        ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)                        │ 3300.31ms │
│ Total Time (reduce-clone-of-statistic)   │ 3266.73ms │
│ Average Time (main)                      │  150.01ms │
│ Average Time (reduce-clone-of-statistic) │  148.49ms │
│ Queries Faster                           │         2 │
│ Queries Slower                           │         0 │
│ Queries with No Change                   │        20 │
└──────────────────────────────────────────┴───────────┘

--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ reduce-clone-of-statistic ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.68ms │                    0.71ms │     no change │
│ QQuery 1     │    65.11ms │                   66.44ms │     no change │
│ QQuery 2     │   160.05ms │                  158.06ms │     no change │
│ QQuery 3     │   182.27ms │                  180.71ms │     no change │
│ QQuery 4     │  1585.06ms │                 1598.65ms │     no change │
│ QQuery 5     │  1561.95ms │                 1508.21ms │     no change │
│ QQuery 6     │    56.30ms │                   56.04ms │     no change │
│ QQuery 7     │    67.08ms │                   67.35ms │     no change │
│ QQuery 8     │  2253.04ms │                 2255.98ms │     no change │
│ QQuery 9     │  1876.22ms │                 1866.22ms │     no change │
│ QQuery 10    │   542.70ms │                  529.90ms │     no change │
│ QQuery 11    │   596.05ms │                  579.83ms │     no change │
│ QQuery 12    │  1726.31ms │                 1692.74ms │     no change │
│ QQuery 13    │  3985.22ms │                 3990.07ms │     no change │
│ QQuery 14    │  2518.86ms │                 2523.15ms │     no change │
│ QQuery 15    │  1779.28ms │                 1768.23ms │     no change │
│ QQuery 16    │  4825.89ms │                 4817.11ms │     no change │
│ QQuery 17    │  4763.05ms │                 4700.93ms │     no change │
│ QQuery 18    │ 10052.79ms │                10142.51ms │     no change │
│ QQuery 19    │   143.22ms │                  144.22ms │     no change │
│ QQuery 20    │  3276.89ms │                 3262.53ms │     no change │
│ QQuery 21    │  3852.14ms │                 3813.10ms │     no change │
│ QQuery 22    │  9274.03ms │                 8843.84ms │     no change │
│ QQuery 23    │ 22849.52ms │                22499.54ms │     no change │
│ QQuery 24    │  1145.09ms │                 1115.46ms │     no change │
│ QQuery 25    │  1033.81ms │                  985.75ms │     no change │
│ QQuery 26    │  1334.39ms │                 1300.65ms │     no change │
│ QQuery 27    │  4684.21ms │                 4654.56ms │     no change │
│ QQuery 28    │ 23122.33ms │                23813.49ms │     no change │
│ QQuery 29    │   891.59ms │                  892.46ms │     no change │
│ QQuery 30    │  2017.38ms │                 2025.79ms │     no change │
│ QQuery 31    │  2832.43ms │                 2891.86ms │     no change │
│ QQuery 32    │ 15113.67ms │                15290.76ms │     no change │
│ QQuery 33    │  9394.78ms │                 9566.36ms │     no change │
│ QQuery 34    │  9355.68ms │                 9431.65ms │     no change │
│ QQuery 35    │  2988.67ms │                 3006.76ms │     no change │
│ QQuery 36    │   256.50ms │                  256.27ms │     no change │
│ QQuery 37    │   169.19ms │                  150.94ms │ +1.12x faster │
│ QQuery 38    │   153.09ms │                  150.05ms │     no change │
│ QQuery 39    │   795.92ms │                  811.69ms │     no change │
│ QQuery 40    │    58.10ms │                   60.42ms │     no change │
│ QQuery 41    │    55.76ms │                   56.48ms │     no change │
│ QQuery 42    │    68.34ms │                   69.07ms │     no change │
└──────────────┴────────────┴───────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                        ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (main)                        │ 153464.65ms │
│ Total Time (reduce-clone-of-statistic)   │ 153596.55ms │
│ Average Time (main)                      │   3568.95ms │
│ Average Time (reduce-clone-of-statistic) │   3572.01ms │
│ Queries Faster                           │           1 │
│ Queries Slower                           │           0 │
│ Queries with No Change                   │          42 │
└──────────────────────────────────────────┴─────────────┘

@Rachelint
Copy link
Contributor Author

Some strange results found during the benchmarks(see #11807).

But when I pulling main and rebasing, the results become different...
I can almost make sure that, it is not really related to codes... So strange...

@Rachelint Rachelint force-pushed the reduce-clone-of-statistic branch from 28efa57 to 5162833 Compare August 5, 2024 06:35
@alamb alamb changed the title Reduce clone of Statistics in ListingTable Reduce clone of Statistics in ListingTable and PartitionedFile Aug 5, 2024
@alamb alamb added the api change Changes the API exposed to users of the crate label Aug 5, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @Rachelint -- I went through this PR carefully. I have some comments that I think could make the code better but I don't think they are necessary to merge this PR

@@ -78,10 +78,11 @@ pub struct PartitionedFile {
///
/// DataFusion relies on these statistics for planning (in particular to sort file groups),
/// so if they are incorrect, incorrect answers may result.
pub statistics: Option<Statistics>,
pub statistics: Option<Arc<Statistics>>,
Copy link
Contributor

@alamb alamb Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 This alone will likely avoid a bunch of copying

It is also an API change, so I marked the PR thusly

@@ -159,6 +160,24 @@ impl From<ObjectMeta> for PartitionedFile {
}
}

impl Default for PartitionedFile {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the same as #[derive(Default)]?


total_byte_size =
add_row_stats(file_stats.total_byte_size, total_byte_size);
add_row_stats(file_stats.total_byte_size.clone(), total_byte_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked that stats here are Precision<usize> (and thus this clone is not a performance problem)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also made a small experiment to see if an alternate formulation where it might be clearer that the copy is not occuring (last commit in #11828)

) -> Precision<ScalarValue> {
match (&min_values, &min_nominee) {
(Precision::Exact(val1), Precision::Exact(val2)) if val1 > val2 => min_nominee,
min_nominee: &Precision<ScalarValue>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 to reduce this copy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 to reduce this copy

I think the alternative may be that we refactor the clone expensive scalars to the clone cheap impl (like String to Arc<str>)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would also be something interesting to pursue 💯

Comment on lines 142 to 158
partitioned_files
.chunks(chunk_size)
.map(|c| c.to_vec())
.chunks_mut(chunk_size)
.map(|c| c.iter_mut().map(mem::take).collect())
.collect()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could also be forumulated with drain() and this avoid the need for Default: https://doc.rust-lang.org/std/vec/struct.Vec.html#method.drain

Here is a POC of it working: #11829

    let mut chunks = Vec::with_capacity(n);

    let mut current_chunk = Vec::with_capacity(chunk_size);
    for file in partitioned_files.drain(..) {
        current_chunk.push(file);
        if current_chunk.len() == chunk_size {
            chunks.push(mem::take(&mut current_chunk));
        }
    }
    if !current_chunk.is_empty() {
        chunks.push(current_chunk)
    }
    chunks

(I don't know if this matters for performance)

Copy link
Contributor Author

@Rachelint Rachelint Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use this to replace the chunk_mut, see no changes in performance, but it is really good to eliminiate the default need of PartitionedFile .

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am running benchmarks on my test machine too but I think this looks great to me

@alamb
Copy link
Contributor

alamb commented Aug 5, 2024

But when I pulling main and rebasing, the results become different...
I can almost make sure that, it is not really related to codes... So strange...

I think there can be some signifiant variation in performance when we are measuring queries that take 100s of ms -- so it may be measurement noise

@alamb
Copy link
Contributor

alamb commented Aug 5, 2024

I couldn't reproduce the performance improvement 🤔

--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ reduce-clone-of-statistic ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.68ms │                    0.65ms │     no change │
│ QQuery 1     │    73.22ms │                   68.60ms │ +1.07x faster │
│ QQuery 2     │   127.00ms │                  121.18ms │     no change │
│ QQuery 3     │   130.57ms │                  131.48ms │     no change │
│ QQuery 4     │  1004.57ms │                  977.24ms │     no change │
│ QQuery 5     │  1048.62ms │                 1100.14ms │     no change │
│ QQuery 6     │    65.11ms │                   65.73ms │     no change │
│ QQuery 7     │    75.67ms │                   72.46ms │     no change │
│ QQuery 8     │  1482.20ms │                 1484.73ms │     no change │
│ QQuery 9     │  1368.35ms │                 1335.16ms │     no change │
│ QQuery 10    │   453.35ms │                  458.35ms │     no change │
│ QQuery 11    │   497.56ms │                  487.70ms │     no change │
│ QQuery 12    │  1157.58ms │                 1196.27ms │     no change │
│ QQuery 13    │  2351.61ms │                 2436.85ms │     no change │
│ QQuery 14    │  1583.30ms │                 1622.80ms │     no change │
│ QQuery 15    │  1108.02ms │                 1109.76ms │     no change │
│ QQuery 16    │  2930.51ms │                 2942.69ms │     no change │
│ QQuery 17    │  2865.63ms │                 2908.80ms │     no change │
│ QQuery 18    │  5855.23ms │                 5819.46ms │     no change │
│ QQuery 19    │   122.07ms │                  122.00ms │     no change │
│ QQuery 20    │  1654.79ms │                 1690.15ms │     no change │
│ QQuery 21    │  1966.00ms │                 2037.89ms │     no change │
│ QQuery 22    │  4425.05ms │                 4841.03ms │  1.09x slower │
│ QQuery 23    │ 10970.91ms │                11441.47ms │     no change │
│ QQuery 24    │   698.12ms │                  754.27ms │  1.08x slower │
│ QQuery 25    │   635.64ms │                  662.99ms │     no change │
│ QQuery 26    │   795.75ms │                  835.62ms │  1.05x slower │
│ QQuery 27    │  2506.06ms │                 2533.12ms │     no change │
│ QQuery 28    │ 15227.11ms │                15365.39ms │     no change │
│ QQuery 29    │   553.26ms │                  569.91ms │     no change │
│ QQuery 30    │  1294.37ms │                 1321.61ms │     no change │
│ QQuery 31    │  1601.24ms │                 1676.62ms │     no change │
│ QQuery 32    │  7582.42ms │                 7757.52ms │     no change │
│ QQuery 33    │  5026.69ms │                 5103.35ms │     no change │
│ QQuery 34    │  5008.33ms │                 5122.12ms │     no change │
│ QQuery 35    │  1889.92ms │                 1902.88ms │     no change │
│ QQuery 36    │   324.32ms │                  311.34ms │     no change │
│ QQuery 37    │   208.77ms │                  216.91ms │     no change │
│ QQuery 38    │   194.58ms │                  192.48ms │     no change │
│ QQuery 39    │  1028.92ms │                  994.52ms │     no change │
│ QQuery 40    │    88.58ms │                   88.15ms │     no change │
│ QQuery 41    │    80.15ms │                   79.89ms │     no change │
│ QQuery 42    │    94.55ms │                   93.66ms │     no change │
└──────────────┴────────────┴───────────────────────────┴───────────────┘

--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ reduce-clone-of-statistic ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  103.86ms │                  109.40ms │  1.05x slower │
│ QQuery 2     │   25.44ms │                   25.20ms │     no change │
│ QQuery 3     │   41.18ms │                   42.15ms │     no change │
│ QQuery 4     │   36.43ms │                   34.44ms │ +1.06x faster │
│ QQuery 5     │   62.59ms │                   62.96ms │     no change │
│ QQuery 6     │    8.65ms │                    8.53ms │     no change │
│ QQuery 7     │  117.76ms │                  121.29ms │     no change │
│ QQuery 8     │   26.31ms │                   26.38ms │     no change │
│ QQuery 9     │   63.16ms │                   63.40ms │     no change │
│ QQuery 10    │   69.42ms │                   71.37ms │     no change │
│ QQuery 11    │   64.83ms │                   65.63ms │     no change │
│ QQuery 12    │   27.29ms │                   27.44ms │     no change │
│ QQuery 13    │   40.85ms │                   41.87ms │     no change │
│ QQuery 14    │   11.45ms │                   11.31ms │     no change │
│ QQuery 15    │   21.04ms │                   21.42ms │     no change │
│ QQuery 16    │   26.00ms │                   26.18ms │     no change │
│ QQuery 17    │  104.54ms │                  102.61ms │     no change │
│ QQuery 18    │  235.03ms │                  228.68ms │     no change │
│ QQuery 19    │   29.05ms │                   27.95ms │     no change │
│ QQuery 20    │   46.78ms │                   45.58ms │     no change │
│ QQuery 21    │  171.09ms │                  173.58ms │     no change │
│ QQuery 22    │   14.21ms │                   14.10ms │     no change │
└──────────────┴───────────┴───────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                        ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main_base)                   │ 1346.95ms │
│ Total Time (reduce-clone-of-statistic)   │ 1351.45ms │
│ Average Time (main_base)                 │   61.23ms │
│ Average Time (reduce-clone-of-statistic) │   61.43ms │
│ Queries Faster                           │         1 │
│ Queries Slower                           │         1 │
│ Queries with No Change                   │        20 │
└──────────────────────────────────────────┴───────────┘

@Rachelint
Copy link
Contributor Author

Rachelint commented Aug 6, 2024

@alamb It mainly improve the short queries in clickbench_partitioned, and In my expectations, other cases should have no changes, maybe can try the clickbench_partitioned case?

🤔 But it is strange that it get slower in other cases (especially the not short queries in clickbench_1), I guess the reasons chould be:

  • Some changes in the latest main but not in this branch make a difference? (assuming the main branch here is the latest), the slowers in clickbench_1 look like this?
  • Maybe measurement noise exists? The slower in tpch_mem_sf1 may be due to this?

I am running clickbench_1 in my local to try reproducing it.

@Rachelint Rachelint force-pushed the reduce-clone-of-statistic branch 2 times, most recently from 9a8674f to ce03376 Compare August 6, 2024 07:57
@Rachelint Rachelint force-pushed the reduce-clone-of-statistic branch from ce03376 to 49ca5cb Compare August 6, 2024 07:58
@alamb
Copy link
Contributor

alamb commented Aug 6, 2024

@alamb It mainly improve the short queries in clickbench_partitioned, and In my expectations, other cases should have no changes, maybe can try the clickbench_partitioned case?

That is my expectation too

🤔 But it is strange that it get slower in other cases (especially the not short queries in clickbench_1), I guess the reasons chould be:

  • Some changes in the latest main but not in this branch make a difference? (assuming the main branch here is the latest), the slowers in clickbench_1 look like this?

My script tries to control for this by comparing against git merge-base -- FWIW the script I am using is here https://github.com/alamb/datafusion-benchmarking/blob/main/compare_branch.sh

  • Maybe measurement noise exists? The slower in tpch_mem_sf1 may be due to this?

Yes, maybe

I am running clickbench_1 in my local to try reproducing it.

Than thank you. I also hope to spend some time shortly looking into this (and it looks like you have done some additonal work too)

@Rachelint
Copy link
Contributor Author

Rachelint commented Aug 6, 2024

@alamb It mainly improve the short queries in clickbench_partitioned, and In my expectations, other cases should have no changes, maybe can try the clickbench_partitioned case?

That is my expectation too

🤔 But it is strange that it get slower in other cases (especially the not short queries in clickbench_1), I guess the reasons chould be:

  • Some changes in the latest main but not in this branch make a difference? (assuming the main branch here is the latest), the slowers in clickbench_1 look like this?

My script tries to control for this by comparing against git merge-base -- FWIW the script I am using is here https://github.com/alamb/datafusion-benchmarking/blob/main/compare_branch.sh

  • Maybe measurement noise exists? The slower in tpch_mem_sf1 may be due to this?

Yes, maybe

I am running clickbench_1 in my local to try reproducing it.

Than thank you. I also hope to spend some time shortly looking into this (and it looks like you have done some additonal work too)

I am still working on finding the reason why the long queries slower(e.g. q22 as mentions about the strange result above), after pulling and rebasing to the latest main, this branch took 9200ms, and main 8800ms now...

The code change here is almost impossible to make such a difference (the planning stages just took less than 5ms).
And the generated plans are same as I see in analyze.

I used perf to collects some cpu metrics, also seems almost same ...

  • latest main
83,416,745,608      dTLB-loads
    22,018,889      dTLB-load-misses          #    0.03% of all dTLB cache hits
<not supported>      dTLB-prefetch-misses
83,416,745,608      L1-dcache-loads
 6,315,625,925      L1-dcache-load-misses     #    7.57% of all L1-dcache hits
29,755,469,086      L1-dcache-stores

  17.828162002 seconds time elapsed

  91.404349000 seconds user
   5.162641000 seconds sys
  • this branch
83,396,296,754      dTLB-loads
    21,873,549      dTLB-load-misses          #    0.03% of all dTLB cache hits
<not supported>      dTLB-prefetch-misses
83,396,296,754      L1-dcache-loads
 6,312,881,833      L1-dcache-load-misses     #    7.57% of all L1-dcache hits
29,739,091,743      L1-dcache-stores

  18.637998137 seconds time elapsed

  94.941212000 seconds user
   5.033127000 seconds sys

@Rachelint
Copy link
Contributor Author

Rachelint commented Aug 6, 2024

It is really Interesting, I profile the two branch with the q22 in clickbench_partitioned case:

sudo perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles ./target/release/dfbench-main-d clickbench  --iterations 2 --path "./benchmarks/data/hits_partitioned/" --queries-path "./benchmarks/queries/clickbench/queries.sql" -o "./result"

Then, I found this branch's bus cycles is higher than main, although its total instructions is fewer as my expectation.
It means that the cpu do more memory accesses in this branch.

Then I only revert the commit b7262c2d56c6254dcb07a227ac89f9181c4cf570 which introducing Arc<Statistic>, but keep other commits in this pr, the q22 get as fast as the main!

Seems the Arc here lead to more memory accesses?

  • latest main
   388,784,924,806      cycles
   518,440,373,959      instructions              #    1.33  insn per cycle
    12,099,462,240      cache-references
     6,367,115,115      cache-misses              #   52.623 % of all cache refs
     2,269,797,500      bus-cycles

      18.112975321 seconds time elapsed

      89.534429000 seconds user
       5.590137000 seconds sys
  • this branch
   407,813,402,518      cycles
   513,514,196,117      instructions              #    1.26  insn per cycle
    12,041,238,783      cache-references
     6,321,844,624      cache-misses              #   52.502 % of all cache refs
     2,378,396,233      bus-cycles

      18.527597954 seconds time elapsed

      95.135616000 seconds user
       4.336611000 seconds sys

@Rachelint Rachelint force-pushed the reduce-clone-of-statistic branch from 6993a3f to 3b393d3 Compare August 6, 2024 15:38
@Rachelint Rachelint force-pushed the reduce-clone-of-statistic branch from 3b393d3 to 56cc8ea Compare August 6, 2024 16:06
@Rachelint
Copy link
Contributor Author

Rachelint commented Aug 6, 2024

@alamb finally I think I got the reason, it seems not the measurement noise for the long queries(such as q22 in clickbench)...

The introduction for the Arc<Statistic> to PartitionedFile maybe actually make the long queries slower. The detail can see above, although the use of Arc can decrease the instructions, but it increase bus-cycles, and finally leads to the higher cycles(slower).

I guess it is related to the atomic in the Arc, and when the amount of PartitionedFile becomes large, the cost of atomic becomes not trivial. But I am not sure, just a guess.

I eliminate the Arc<Statistic> in PartitionedFile now for not hurting the long queries.

The new benchmarks can see following.

@Rachelint
Copy link
Contributor Author

The target case clickbench_partitioned

--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ reduce-clone-of-statistic ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     3.46ms │                    2.07ms │ +1.67x faster │
│ QQuery 1     │    55.94ms │                   55.10ms │     no change │
│ QQuery 2     │   149.88ms │                  147.99ms │     no change │
│ QQuery 3     │   165.74ms │                  161.47ms │     no change │
│ QQuery 4     │  1593.53ms │                 1582.07ms │     no change │
│ QQuery 5     │  1444.10ms │                 1414.92ms │     no change │
│ QQuery 6     │    46.93ms │                   44.35ms │ +1.06x faster │
│ QQuery 7     │    57.43ms │                   54.95ms │     no change │
│ QQuery 8     │  2268.06ms │                 2258.53ms │     no change │
│ QQuery 9     │  1871.11ms │                 1866.65ms │     no change │
│ QQuery 10    │   508.29ms │                  503.46ms │     no change │
│ QQuery 11    │   567.24ms │                  554.40ms │     no change │
│ QQuery 12    │  1635.85ms │                 1628.98ms │     no change │
│ QQuery 13    │  3178.33ms │                 3199.05ms │     no change │
│ QQuery 14    │  2377.17ms │                 2369.37ms │     no change │
│ QQuery 15    │  1791.89ms │                 1772.30ms │     no change │
│ QQuery 16    │  4722.17ms │                 4731.72ms │     no change │
│ QQuery 17    │  4641.52ms │                 4609.93ms │     no change │
│ QQuery 18    │  9375.13ms │                 9536.14ms │     no change │
│ QQuery 19    │   135.14ms │                  133.52ms │     no change │
│ QQuery 20    │  3490.32ms │                 3499.24ms │     no change │
│ QQuery 21    │  4016.60ms │                 4017.70ms │     no change │
│ QQuery 22    │  8761.18ms │                 8770.62ms │     no change │
│ QQuery 23    │ 21291.41ms │                21205.12ms │     no change │
│ QQuery 24    │  1027.87ms │                 1020.31ms │     no change │
│ QQuery 25    │   834.89ms │                  825.86ms │     no change │
│ QQuery 26    │  1201.90ms │                 1196.39ms │     no change │
│ QQuery 27    │  4920.91ms │                 4911.59ms │     no change │
│ QQuery 28    │ 21419.93ms │                21962.78ms │     no change │
│ QQuery 29    │   833.25ms │                  841.28ms │     no change │
│ QQuery 30    │  1935.39ms │                 1936.87ms │     no change │
│ QQuery 31    │  2782.50ms │                 2784.20ms │     no change │
│ QQuery 32    │ 15001.36ms │                15043.47ms │     no change │
│ QQuery 33    │  9388.50ms │                 9423.83ms │     no change │
│ QQuery 34    │  9219.85ms │                 9246.16ms │     no change │
│ QQuery 35    │  3006.75ms │                 3022.45ms │     no change │
│ QQuery 36    │   243.27ms │                  232.42ms │     no change │
│ QQuery 37    │   107.44ms │                  106.40ms │     no change │
│ QQuery 38    │   135.83ms │                  138.65ms │     no change │
│ QQuery 39    │   798.46ms │                  815.32ms │     no change │
│ QQuery 40    │    54.63ms │                   53.55ms │     no change │
│ QQuery 41    │    49.08ms │                   43.92ms │ +1.12x faster │
│ QQuery 42    │    62.14ms │                   60.26ms │     no change │
└──────────────┴────────────┴───────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                        ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (main)                        │ 147172.40ms │
│ Total Time (reduce-clone-of-statistic)   │ 147785.39ms │
│ Average Time (main)                      │   3422.61ms │
│ Average Time (reduce-clone-of-statistic) │   3436.87ms │
│ Queries Faster                           │           3 │
│ Queries Slower                           │           0 │
│ Queries with No Change                   │          40 │
└──────────────────────────────────────────┴─────────────┘

@Rachelint
Copy link
Contributor Author

The other non-target cases

--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃       main ┃ reduce-clone-of-statistic ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │     0.67ms │                    0.65ms │ no change │
│ QQuery 1     │    65.64ms │                   64.44ms │ no change │
│ QQuery 2     │   159.91ms │                  159.89ms │ no change │
│ QQuery 3     │   183.11ms │                  181.20ms │ no change │
│ QQuery 4     │  1603.36ms │                 1613.42ms │ no change │
│ QQuery 5     │  1520.54ms │                 1524.46ms │ no change │
│ QQuery 6     │    56.11ms │                   56.84ms │ no change │
│ QQuery 7     │    67.41ms │                   66.71ms │ no change │
│ QQuery 8     │  2257.21ms │                 2268.91ms │ no change │
│ QQuery 9     │  1880.61ms │                 1881.54ms │ no change │
│ QQuery 10    │   542.22ms │                  531.25ms │ no change │
│ QQuery 11    │   590.60ms │                  585.14ms │ no change │
│ QQuery 12    │  1694.33ms │                 1698.81ms │ no change │
│ QQuery 13    │  3323.80ms │                 3294.59ms │ no change │
│ QQuery 14    │  2525.08ms │                 2490.38ms │ no change │
│ QQuery 15    │  1774.93ms │                 1777.03ms │ no change │
│ QQuery 16    │  4806.90ms │                 4812.63ms │ no change │
│ QQuery 17    │  4742.81ms │                 4665.33ms │ no change │
│ QQuery 18    │ 10054.37ms │                 9883.42ms │ no change │
│ QQuery 19    │   148.36ms │                  145.45ms │ no change │
│ QQuery 20    │  3262.71ms │                 3240.41ms │ no change │
│ QQuery 21    │  3818.90ms │                 3793.73ms │ no change │
│ QQuery 22    │  8847.59ms │                 8840.25ms │ no change │
│ QQuery 23    │ 22621.27ms │                22558.51ms │ no change │
│ QQuery 24    │  1118.08ms │                 1115.80ms │ no change │
│ QQuery 25    │   992.32ms │                  991.33ms │ no change │
│ QQuery 26    │  1307.20ms │                 1300.77ms │ no change │
│ QQuery 27    │  4689.10ms │                 4661.18ms │ no change │
│ QQuery 28    │ 23024.40ms │                23722.41ms │ no change │
│ QQuery 29    │   890.02ms │                  894.09ms │ no change │
│ QQuery 30    │  1994.15ms │                 1998.73ms │ no change │
│ QQuery 31    │  2829.76ms │                 2821.23ms │ no change │
│ QQuery 32    │ 15160.37ms │                15177.93ms │ no change │
│ QQuery 33    │  9355.13ms │                 9327.41ms │ no change │
│ QQuery 34    │  9287.31ms │                 9305.62ms │ no change │
│ QQuery 35    │  3002.29ms │                 3019.29ms │ no change │
│ QQuery 36    │   261.70ms │                  248.65ms │ no change │
│ QQuery 37    │   156.69ms │                  152.93ms │ no change │
│ QQuery 38    │   149.18ms │                  150.53ms │ no change │
│ QQuery 39    │   811.29ms │                  808.08ms │ no change │
│ QQuery 40    │    59.28ms │                   59.07ms │ no change │
│ QQuery 41    │    55.25ms │                   56.12ms │ no change │
│ QQuery 42    │    69.43ms │                   66.93ms │ no change │
└──────────────┴────────────┴───────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                        ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (main)                        │ 151761.37ms │
│ Total Time (reduce-clone-of-statistic)   │ 152013.10ms │
│ Average Time (main)                      │   3529.33ms │
│ Average Time (reduce-clone-of-statistic) │   3535.19ms │
│ Queries Faster                           │           0 │
│ Queries Slower                           │           0 │
│ Queries with No Change                   │          43 │
└──────────────────────────────────────────┴─────────────┘

--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃     main ┃ reduce-clone-of-statistic ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1     │ 204.44ms │                  202.51ms │ no change │
│ QQuery 2     │  29.58ms │                   31.01ms │ no change │
│ QQuery 3     │  81.19ms │                   81.99ms │ no change │
│ QQuery 4     │  57.45ms │                   57.49ms │ no change │
│ QQuery 5     │ 119.86ms │                  118.53ms │ no change │
│ QQuery 6     │  12.56ms │                   12.50ms │ no change │
│ QQuery 7     │ 248.12ms │                  244.20ms │ no change │
│ QQuery 8     │  25.27ms │                   25.43ms │ no change │
│ QQuery 9     │ 115.59ms │                  116.67ms │ no change │
│ QQuery 10    │ 113.39ms │                  113.32ms │ no change │
│ QQuery 11    │  55.93ms │                   55.65ms │ no change │
│ QQuery 12    │  33.71ms │                   34.02ms │ no change │
│ QQuery 13    │  78.77ms │                   76.73ms │ no change │
│ QQuery 14    │  14.64ms │                   14.60ms │ no change │
│ QQuery 15    │  23.23ms │                   23.77ms │ no change │
│ QQuery 16    │  35.59ms │                   35.66ms │ no change │
│ QQuery 17    │ 168.67ms │                  167.96ms │ no change │
│ QQuery 18    │ 475.47ms │                  472.33ms │ no change │
│ QQuery 19    │  35.21ms │                   35.14ms │ no change │
│ QQuery 20    │  77.52ms │                   78.73ms │ no change │
│ QQuery 21    │ 277.91ms │                  274.80ms │ no change │
│ QQuery 22    │  18.82ms │                   19.37ms │ no change │
└──────────────┴──────────┴───────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                        ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)                        │ 2302.93ms │
│ Total Time (reduce-clone-of-statistic)   │ 2292.40ms │
│ Average Time (main)                      │  104.68ms │
│ Average Time (reduce-clone-of-statistic) │  104.20ms │
│ Queries Faster                           │         0 │
│ Queries Slower                           │         0 │
│ Queries with No Change                   │        22 │
└──────────────────────────────────────────┴───────────┘
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃     main ┃ reduce-clone-of-statistic ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1     │ 277.52ms │                  275.11ms │ no change │
│ QQuery 2     │  44.10ms │                   42.95ms │ no change │
│ QQuery 3     │ 108.33ms │                  106.12ms │ no change │
│ QQuery 4     │  59.08ms │                   57.71ms │ no change │
│ QQuery 5     │ 189.75ms │                  190.00ms │ no change │
│ QQuery 6     │  53.32ms │                   52.75ms │ no change │
│ QQuery 7     │ 293.43ms │                  296.44ms │ no change │
│ QQuery 8     │ 122.13ms │                  119.84ms │ no change │
│ QQuery 9     │ 220.95ms │                  221.72ms │ no change │
│ QQuery 10    │ 189.97ms │                  188.24ms │ no change │
│ QQuery 11    │  30.37ms │                   30.10ms │ no change │
│ QQuery 12    │  80.62ms │                   79.39ms │ no change │
│ QQuery 13    │ 123.44ms │                  123.03ms │ no change │
│ QQuery 14    │  74.08ms │                   74.29ms │ no change │
│ QQuery 15    │ 103.81ms │                  103.95ms │ no change │
│ QQuery 16    │  41.45ms │                   40.85ms │ no change │
│ QQuery 17    │ 267.03ms │                  266.79ms │ no change │
│ QQuery 18    │ 435.63ms │                  437.27ms │ no change │
│ QQuery 19    │ 133.49ms │                  134.07ms │ no change │
│ QQuery 20    │ 127.94ms │                  127.25ms │ no change │
│ QQuery 21    │ 285.53ms │                  287.23ms │ no change │
│ QQuery 22    │  24.96ms │                   24.01ms │ no change │
└──────────────┴──────────┴───────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                        ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)                        │ 3286.93ms │
│ Total Time (reduce-clone-of-statistic)   │ 3279.13ms │
│ Average Time (main)                      │  149.41ms │
│ Average Time (reduce-clone-of-statistic) │  149.05ms │
│ Queries Faster                           │         0 │
│ Queries Slower                           │         0 │
│ Queries with No Change                   │        22 │
└──────────────────────────────────────────┴───────────┘

@alamb
Copy link
Contributor

alamb commented Aug 6, 2024

@alamb finally I think I got the reason, it seems not the measurement noise for the long queries(such as q22 in clickbench)...

The introduction for the Arc<Statistic> to PartitionedFile maybe actually make the long queries slower. The detail can see above, although the use of Arc can decrease the instructions, but it increase bus-cycles, and finally leads to the higher cycles(slower).

I guess it is related to the atomic in the Arc, and when the amount of PartitionedFile becomes large, the cost of atomic becomes not trivial. But I am not sure, just a guess.

I eliminate the Arc<Statistic> in PartitionedFile now for not hurting the long queries.

The new benchmarks can see following.

I find it very strange that Arc in statistics should show up at all in the execution times -- I would expect a query that takes seconds to run would not look at the statistics once the query started and I would expect the actual processing time to dominate 🤔

@alamb
Copy link
Contributor

alamb commented Aug 6, 2024

I'll rerun and see what I can see

@Rachelint
Copy link
Contributor Author

@alamb finally I think I got the reason, it seems not the measurement noise for the long queries(such as q22 in clickbench)...
The introduction for the Arc<Statistic> to PartitionedFile maybe actually make the long queries slower. The detail can see above, although the use of Arc can decrease the instructions, but it increase bus-cycles, and finally leads to the higher cycles(slower).
I guess it is related to the atomic in the Arc, and when the amount of PartitionedFile becomes large, the cost of atomic becomes not trivial. But I am not sure, just a guess.
I eliminate the Arc<Statistic> in PartitionedFile now for not hurting the long queries.
The new benchmarks can see following.

I find it very strange that Arc in statistics should show up at all in the execution times -- I would expect a query that takes seconds to run would not look at the statistics once the query started and I would expect the actual processing time to dominate 🤔

The statistic is actually not used when the execution started, I guess it may be due to the drop of PartitionedFile here (PartitionedFile drop -> Arc<Statistic> drop -> atomic sub).

fn start_next_file(&mut self) -> Option<Result<(FileOpenFuture, Vec<ScalarValue>)>> {
let part_file = self.file_iter.pop_front()?;
let file_meta = FileMeta {
object_meta: part_file.object_meta,
range: part_file.range,
extensions: part_file.extensions,
};
Some(
self.file_opener
.open(file_meta)
.map(|future| (future, part_file.partition_values)),
)
}

Maybe a possible alternative worth trying in future: we take and drop the Arc<Statistic> in PartitionedFile before the actual execution?

@alamb alamb removed the api change Changes the API exposed to users of the crate label Aug 6, 2024
@alamb
Copy link
Contributor

alamb commented Aug 6, 2024

Removed the api change label as we have now removed the Arc

I agree the timings look good now.

Would you be willing to create a new PR with just the Arc statistics changes so we can see if we see any differences there?

Thanks again @Rachelint

@alamb alamb merged commit bddb641 into apache:main Aug 6, 2024
24 checks passed
@Rachelint
Copy link
Contributor Author

@alamb I found the reason finally,

Removed the api change label as we have now removed the Arc

I agree the timings look good now.

Would you be willing to create a new PR with just the Arc statistics changes so we can see if we see any differences there?

Thanks again @Rachelint

Yes, planning to, it is really worth puresuing.

@alamb
Copy link
Contributor

alamb commented Aug 8, 2024

@alamb I found the reason finally,

Removed the api change label as we have now removed the Arc
I agree the timings look good now.
Would you be willing to create a new PR with just the Arc statistics changes so we can see if we see any differences there?
Thanks again @Rachelint

Yes, planning to, it is really worth puresuing.

Thanks -- filed #11885

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants