Regression: Wrong result when there are 2 count(distinct) #9586

NGA-TRAN · 2024-03-12T18:48:01Z

Describe the bug

Wrong result when there are 2 count distinct in the select clause,

To Reproduce

I will share there file with @alamb because I cannot attached .parquet file here

Bug

-- BUG
SELECT  COUNT(DISTINCT host) AS servers_count, count(distinct pool) as pool_count, server_role, os, env, datacenter from '/tmp/file.parquet' WHERE time >= '2024-02-25T00:00:00Z' and time < '2024-02-25T00:00:01Z' and server_role = 'mesg' GROUP BY server_role, os, env, datacenter;
+---------------+------------+-------------+---------+------------+------------+
| servers_count | pool_count | server_role | os      | env        | datacenter |
+---------------+------------+-------------+---------+------------+------------+
| 1             | 1          | mesg        | windows | production | mn         |
| 2             | 2          | mesg        | windows | production | va         |  -- should have 4 servers_count and 3 pool_count
+---------------+------------+-------------+---------+------------+------------+

Here are other queries running in DF CLI that tell me the right results

SELECT DISTINCT host, server_role, os, env, datacenter from '/tmp/file.parquet' WHERE time >= '2024-02-25T00:00:00Z' and time < '2024-02-25T00:00:01Z' and server_role = 'mesg' order by datacenter;
+---------------+-------------+---------+------------+------------+
| host          | server_role | os      | env        | datacenter |
+---------------+-------------+---------+------------+------------+
| mpm-mesg1002c | mesg        | windows | production | mn         |
| vpm-mesg1001c | mesg        | windows | production | va         |
| vpm-mesg1005b | mesg        | windows | production | va         |
| vpm-mesg1008d | mesg        | windows | production | va         |
| vpm-mesg1007b | mesg        | windows | production | va         |
+---------------+-------------+---------+------------+------------+

SELECT DISTINCT pool, server_role, os, env, datacenter from '/tmp/file.parquet' WHERE time >= '2024-02-25T00:00:00Z' and time < '2024-02-25T00:00:01Z' and server_role = 'mesg' order by datacenter;
+------+-------------+---------+------------+------------+
| pool | server_role | os      | env        | datacenter |
+------+-------------+---------+------------+------------+
| c    | mesg        | windows | production | mn         |
| b    | mesg        | windows | production | va         |
| c    | mesg        | windows | production | va         |
| d    | mesg        | windows | production | va         |
+------+-------------+---------+------------+------------+

SELECT  COUNT(DISTINCT host) AS servers_count, "server_role", "os", "env", "datacenter" from '/tmp/file.parquet' WHERE time >= '2024-02-25T00:00:00Z' and time < '2024-02-25T00:00:01Z' and server_role = 'mesg' GROUP BY "server_role", "os", "env", "datacenter";
+---------------+-------------+---------+------------+------------+
| servers_count | server_role | os      | env        | datacenter |
+---------------+-------------+---------+------------+------------+
| 4             | mesg        | windows | production | va         |
| 1             | mesg        | windows | production | mn         |
+---------------+-------------+---------+------------+------------+

SELECT  count(distinct pool) as pool_count, "server_role", "os", "env", "datacenter" from '/tmp/file.parquet' WHERE time >= '2024-02-25T00:00:00Z' and time < '2024-02-25T00:00:01Z' and server_role = 'mesg' GROUP BY "server_role", "os", "env", "datacenter";
+------------+-------------+---------+------------+------------+
| pool_count | server_role | os      | env        | datacenter |
+------------+-------------+---------+------------+------------+
| 1          | mesg        | windows | production | mn         |
| 3          | mesg        | windows | production | va         |
+------------+-------------+---------+------------+------------+

Expected behavior

SELECT  COUNT(DISTINCT host) AS servers_count, count(distinct pool) as pool_count, server_role, os, env, datacenter from '/tmp/file.parquet' WHERE time >= '2024-02-25T00:00:00Z' and time < '2024-02-25T00:00:01Z' and server_role = 'mesg' GROUP BY server_role, os, env, datacenter;
+---------------+------------+-------------+---------+------------+------------+
| servers_count | pool_count | server_role | os      | env        | datacenter |
+---------------+------------+-------------+---------+------------+------------+
| 1             | 1          | mesg        | windows | production | mn         |
| 4             | 3          | mesg        | windows | production | va         |  
+---------------+------------+-------------+---------+------------+------------+


### Additional context

_No response_

The text was updated successfully, but these errors were encountered:

viirya · 2024-03-12T19:48:51Z

cc @huaxingao

alamb · 2024-03-12T20:00:40Z

Here is the schema:

DataFusion CLI v36.0.0
❯ describe 'file.parquet';
+------------------------+-----------------------------+-------------+
| column_name            | data_type                   | is_nullable |
+------------------------+-----------------------------+-------------+
| Percent_Idle_Time      | Float64                     | YES         |
| Percent_Processor_Time | Float64                     | YES         |
| Percent_User_Time      | Float64                     | YES         |
| cpu                    | Dictionary(Int32, Utf8)     | YES         |
| datacenter             | Dictionary(Int32, Utf8)     | YES         |
| env                    | Dictionary(Int32, Utf8)     | YES         |
| host                   | Dictionary(Int32, Utf8)     | YES         |
| instance               | Dictionary(Int32, Utf8)     | YES         |
| objectname             | Dictionary(Int32, Utf8)     | YES         |
| os                     | Dictionary(Int32, Utf8)     | YES         |
| pool                   | Dictionary(Int32, Utf8)     | YES         |
| server_role            | Dictionary(Int32, Utf8)     | YES         |
| time                   | Timestamp(Nanosecond, None) | NO          |
| usage_idle             | Float64                     | YES         |
| usage_iowait           | Float64                     | YES         |
| usage_system           | Float64                     | YES         |
| usage_user             | Float64                     | YES         |
+------------------------+-----------------------------+-------------+

alamb · 2024-03-12T20:02:34Z

An update here is that @erratic-pattern is working on creating a reproducer we can share on this ticket

NGA-TRAN · 2024-03-14T15:22:43Z

Many thanks to @erratic-pattern who has identified the PR that introduce:

Issue is related to this PR for array aggregate order and distinct.

the commit immediately before it works correctly with the above queries.

in arrow-datafusion:

cd datafusion-cli
git checkout 0e728fce0a1a87567979bc74ebb64951b0fd9ac8
cargo build
./target/debug/datafusion-cli -f ../bug.sql
DataFusion CLI v36.0.0
+---------------+------------+------------+
| servers_count | pool_count | datacenter |
+---------------+------------+------------+
| 1             | 1          | mn         |
| 4             | 3          | va         |
+---------------+------------+------------+

if you then try to run the same query after the above PR, you get the incorrect result:

git checkout fc84a639fca7716e529384c0e919fb90b75139da
cargo build
./target/debug/datafusion-cli -f ../bug.sql
DataFusion CLI v36.0.0
+---------------+------------+------------+
| servers_count | pool_count | datacenter |
+---------------+------------+------------+
| 1             | 1          | mn         |
| 3             | 2          | va         |
+---------------+------------+------------+
2 rows in set. Query took 0.534 seconds.

bug.sql:

SELECT  COUNT(DISTINCT host) AS servers_count, count(distinct pool) as pool_count, datacenter from '/tmp/file.parquet' WHERE time >= '2024-02-25T00:00:00Z' and time < '2024-02-25T00:00:01Z' and server_role = 'mesg' GROUP BY datacenter;

We are working to share the file

erratic-pattern · 2024-03-14T17:56:43Z

I am continuing to investigate this to see if I can create a minimal reproduction.

alamb · 2024-03-15T10:22:59Z

I have a theory about what is wrong

I think this change from #9234

https://github.com/apache/arrow-datafusion/pull/9234/files#diff-a292ca8deeaaff7ba19bad4adf609c476ff383db56249f75cb6aeab77e887744R245-R248

effectively skips all but the first intermediate result when combining data together

Here is the code on main, specificaly, that I think should look at all elements in the array:

https://github.com/apache/arrow-datafusion/blob/3c26e597aeacde0a5e6a51f30394d3d31c6acd96/datafusion/physical-expr/src/aggregate/count_distinct/mod.rs#L256

So to find a reproducer it would need:

A GROUP BY
More than 1 target partition (to trigger repartitioned group by)
The input partitions don't have all the distinct values

I think the best idea here is to make a fuzz test that triggers the issue (randomly send inputs). We can potentially follow the model here:

https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/fuzz_cases/aggregate_fuzz.rs

I will try to do so later today if no one beats me to it

alamb · 2024-03-15T20:09:40Z

Here is a self contained reproducer with just datafusion-cli

Step 1: Create data

------
-- Create a table using UNION ALL to get 2 partitions (very important)
------
create table test as
    select * from (values('foo', 'bar', 1))
    UNION ALL
    select * from (values('foo', 'baz', 1));

------
-- Now, create a table with the same data, but column2 has type `Dictionary(Int32)` to trigger the fallback code
-----
create table test_dict as
  select
    column1,
    arrow_cast(column2, 'Dictionary(Int32, Utf8)') as "column2",
    column3
from test;

Now, run queries

-- there are two distinct values of column 2: 'bar' and 'baz'

select * from test_dict;
+---------+---------+---------+
| column1 | column2 | column3 |
+---------+---------+---------+
| foo     | bar     | 1       |
| foo     | baz     | 1       |
+---------+---------+---------+
2 rows in set. Query took 0.000 seconds.


-- but this query says there is only 1 distinct value in column2
select
  count(distinct column1),
  count(distinct column2)
FROM test_dict
GROUP BY column3;

+-----------------------------------+-----------------------------------+
| COUNT(DISTINCT test_dict.column1) | COUNT(DISTINCT test_dict.column2) |
+-----------------------------------+-----------------------------------+
| 1                                 | 1                                 |
+-----------------------------------+-----------------------------------+
1 row in set. Query took 0.002 seconds.

Running the same query against the non dictionary encoded column in test produces the correct result:

select
  count(distinct column1),
  count(distinct column2)
FROM test_dict
GROUP BY column3;

+------------------------------+------------------------------+
| COUNT(DISTINCT test.column1) | COUNT(DISTINCT test.column2) |
+------------------------------+------------------------------+
| 1                            | 2                            |
+------------------------------+------------------------------+
1 row in set. Query took 0.001 seconds.

alamb · 2024-03-18T14:45:44Z

I plan to fix this issue

NGA-TRAN added the bug Something isn't working label Mar 12, 2024

alamb mentioned this issue Mar 18, 2024

DataFusion weekly project plan (Andrew Lamb) - March 18, 2024 #9675

Closed

7 tasks

alamb self-assigned this Mar 18, 2024

alamb changed the title ~~Wrong result when there are 2 count(distinct)~~ Regression: Wrong result when there are 2 count(distinct) Mar 18, 2024

alamb added the regression Something that used to work no longer does label Mar 18, 2024

This was referenced Mar 18, 2024

Fix incorrect results with multiple COUNT(DISTINCT..) aggregates on dictionaries #9679

Merged

Add tests for filtering, grouping, aggregation of ARRAYs #9695

Merged

Release DataFusion 37.0.0 #9682

Closed

alamb closed this as completed in #9679 Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression: Wrong result when there are 2 count(distinct) #9586

Regression: Wrong result when there are 2 count(distinct) #9586

NGA-TRAN commented Mar 12, 2024

viirya commented Mar 12, 2024

alamb commented Mar 12, 2024

alamb commented Mar 12, 2024

NGA-TRAN commented Mar 14, 2024

erratic-pattern commented Mar 14, 2024

alamb commented Mar 15, 2024

alamb commented Mar 15, 2024 •

edited

Loading

alamb commented Mar 18, 2024

Regression: Wrong result when there are 2 count(distinct) #9586

Regression: Wrong result when there are 2 count(distinct) #9586

Comments

NGA-TRAN commented Mar 12, 2024

Describe the bug

To Reproduce

Bug

Here are other queries running in DF CLI that tell me the right results

Expected behavior

viirya commented Mar 12, 2024

alamb commented Mar 12, 2024

alamb commented Mar 12, 2024

NGA-TRAN commented Mar 14, 2024

erratic-pattern commented Mar 14, 2024

alamb commented Mar 15, 2024

alamb commented Mar 15, 2024 • edited Loading

alamb commented Mar 18, 2024

alamb commented Mar 15, 2024 •

edited

Loading