Reported DataFusion performance problem #9148

alamb · 2024-02-07T13:35:24Z

Describe the bug

Reported in discord by @mispp: https://discord.com/channels/885562378132000778/1166447479609376850/1204163621433639003

ok people, a performance question if i may... I pulled a ~400mb parquet file from new york taxi drives- for testing. have a simple aggregation that is supposed to sum up a column called trip_time. no group by column is done and it is all performed via dataframe
this operation lasts for ~2s
is this expected?

i saw a video https://youtu.be/NVKujPxwSBA?t=1589 that showed datafusion processed some gigabytes in less than a second

So basically the task here is to reproduce the reported performance and see if there is anything wrong or that we could improve

To Reproduce

Original report: https://gist.github.com/mispp/229fdad7d70c8ab974a8f72f4bdfc43c

DataSet: https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2023-01.parquet

Cargo.toml

[package]
name = "perf-issue"
version = "0.1.0"
edition = "2021"

[dependencies]
tokio = { version = "1", features = ["full"] }
serde = { version = "1", features = ["derive"] }
serde_json = "1.0"
datafusion = "34"
arrow-schema = "*"

Program:

use std::time::SystemTime;

use datafusion::{
    common::Column,
    execution::{context::SessionContext, options::ParquetReadOptions},
    logical_expr::Expr
};


#[tokio::main]
async fn main() {
    let start = SystemTime::now();

    let _ctx = SessionContext::new();
    let _read_options = ParquetReadOptions {
        file_extension: ".parquet",
        table_partition_cols: vec!(),
        parquet_pruning: Some(true),
        skip_metadata: Some(false),
        schema: None,
        file_sort_order: vec![]
    };



    let analysis_expressions: Vec<Expr> = [ datafusion::logical_expr::expr_fn::sum(Expr::Column(Column::from_name("trip_time"))) ].to_vec();
    let group_expressions: Vec<Expr> = [].to_vec();

    println!("just before df -> {}", start.elapsed().unwrap().as_millis());

    let df = _ctx.read_parquet("./fhvhv_tripdata_2023-01.parquet", _read_options).await.unwrap();
    println!("reading df -> {}", start.elapsed().unwrap().as_millis());

    let df_aggregated = df.aggregate(group_expressions, analysis_expressions).unwrap().collect().await;
    println!("df aggregation -> {}", start.elapsed().unwrap().as_millis());

    println!("results -> {:?}", df_aggregated);

}

Expected behavior

No response

Additional context

No response

alamb · 2024-02-07T15:02:58Z

Ran this on my M3 Mac and it finished in 144ms

andrewlamb@Andrews-MacBook-Pro:~/Downloads$ ./rust_playground
just before df -> 1
reading df -> 6
df aggregation -> 144
results -> Ok([RecordBatch { schema: Schema { fields: [Field { name: "SUM(?table?.trip_time)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [PrimitiveArray<Int64>
[
  20227776240,
]], row_count: 1 }])
andrewlamb@Andrews-MacBook-Pro:~/Downloads$

When I ran the debug build, it took more like 2 seconds:

andrewlamb@Andrews-MacBook-Pro:~/Downloads$ ./rust_playground.debug
just before df -> 6
reading df -> 17
df aggregation -> 1822
results -> Ok([RecordBatch { schema: Schema { fields: [Field { name: "SUM(?table?.trip_time)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [PrimitiveArray<Int64>
[
  20227776240,
]], row_count: 1 }])

alamb · 2024-02-07T15:03:10Z

So I wonder if the reporter simply didn't run with a release build

alamb · 2024-02-07T15:04:41Z

I am going to try this on a linux/less powerful machine

mispp · 2024-02-07T15:07:42Z

So I wonder if the reporter simply didn't run with a release build

No, was a simple 'cargo run' with no parameters given. Ok, so this was the reason.

alamb · 2024-02-07T15:08:45Z

Ah, got it -- I think you need to run cargo run --release to get good performance

Thanks again for the report @mispp

Closing this one down as I think we have found the root cause

alamb · 2024-02-07T15:12:24Z

This is consistent on my more limited linux machine too:

alamb@aal-dev:~/rust_playground$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.10s
     Running `target/debug/rust_playground`
just before df -> 0
reading df -> 2
df aggregation -> 2758
results -> Ok([RecordBatch { schema: Schema { fields: [Field { name: "SUM(?table?.trip_time)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [PrimitiveArray<Int64>
[
  20227776240,
]], row_count: 1 }])
alamb@aal-dev:~/rust_playground$

alamb@aal-dev:~/rust_playground$ cargo build --release
    Finished release [optimized] target(s) in 0.11s
alamb@aal-dev:~/rust_playground$ cargo run --release
    Finished release [optimized] target(s) in 0.10s
     Running `target/release/rust_playground`
just before df -> 0
reading df -> 0
df aggregation -> 185
results -> Ok([RecordBatch { schema: Schema { fields: [Field { name: "SUM(?table?.trip_time)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [PrimitiveArray<Int64>
[
  20227776240,
]], row_count: 1 }])

alamb added the bug Something isn't working label Feb 7, 2024

alamb changed the title ~~DataFusion performance problem (or optimization opportunity?)~~ Reported DataFusion performance problem Feb 7, 2024

alamb mentioned this issue Feb 7, 2024

[EPIC] A list of performance improvement tickets #5546

Open

29 tasks

alamb added the help wanted Extra attention is needed label Feb 7, 2024

alamb closed this as completed Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reported DataFusion performance problem #9148

Reported DataFusion performance problem #9148

alamb commented Feb 7, 2024

alamb commented Feb 7, 2024

alamb commented Feb 7, 2024

alamb commented Feb 7, 2024

mispp commented Feb 7, 2024

alamb commented Feb 7, 2024

alamb commented Feb 7, 2024

Reported DataFusion performance problem #9148

Reported DataFusion performance problem #9148

Comments

alamb commented Feb 7, 2024

Describe the bug

To Reproduce

Expected behavior

Additional context

alamb commented Feb 7, 2024

alamb commented Feb 7, 2024

alamb commented Feb 7, 2024

mispp commented Feb 7, 2024

alamb commented Feb 7, 2024

alamb commented Feb 7, 2024