Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: Wrong result when there are 2 count(distinct) #9586

Closed
Tracked by #9682
NGA-TRAN opened this issue Mar 12, 2024 · 8 comments · Fixed by #9679
Closed
Tracked by #9682

Regression: Wrong result when there are 2 count(distinct) #9586

NGA-TRAN opened this issue Mar 12, 2024 · 8 comments · Fixed by #9679
Assignees
Labels
bug Something isn't working regression Something that used to work no longer does

Comments

@NGA-TRAN
Copy link
Contributor

Describe the bug

Wrong result when there are 2 count distinct in the select clause,

To Reproduce

I will share there file with @alamb because I cannot attached .parquet file here

Bug

-- BUG
SELECT  COUNT(DISTINCT host) AS servers_count, count(distinct pool) as pool_count, server_role, os, env, datacenter from '/tmp/file.parquet' WHERE time >= '2024-02-25T00:00:00Z' and time < '2024-02-25T00:00:01Z' and server_role = 'mesg' GROUP BY server_role, os, env, datacenter;
+---------------+------------+-------------+---------+------------+------------+
| servers_count | pool_count | server_role | os      | env        | datacenter |
+---------------+------------+-------------+---------+------------+------------+
| 1             | 1          | mesg        | windows | production | mn         |
| 2             | 2          | mesg        | windows | production | va         |  -- should have 4 servers_count and 3 pool_count
+---------------+------------+-------------+---------+------------+------------+

Here are other queries running in DF CLI that tell me the right results

SELECT DISTINCT host, server_role, os, env, datacenter from '/tmp/file.parquet' WHERE time >= '2024-02-25T00:00:00Z' and time < '2024-02-25T00:00:01Z' and server_role = 'mesg' order by datacenter;
+---------------+-------------+---------+------------+------------+
| host          | server_role | os      | env        | datacenter |
+---------------+-------------+---------+------------+------------+
| mpm-mesg1002c | mesg        | windows | production | mn         |
| vpm-mesg1001c | mesg        | windows | production | va         |
| vpm-mesg1005b | mesg        | windows | production | va         |
| vpm-mesg1008d | mesg        | windows | production | va         |
| vpm-mesg1007b | mesg        | windows | production | va         |
+---------------+-------------+---------+------------+------------+

SELECT DISTINCT pool, server_role, os, env, datacenter from '/tmp/file.parquet' WHERE time >= '2024-02-25T00:00:00Z' and time < '2024-02-25T00:00:01Z' and server_role = 'mesg' order by datacenter;
+------+-------------+---------+------------+------------+
| pool | server_role | os      | env        | datacenter |
+------+-------------+---------+------------+------------+
| c    | mesg        | windows | production | mn         |
| b    | mesg        | windows | production | va         |
| c    | mesg        | windows | production | va         |
| d    | mesg        | windows | production | va         |
+------+-------------+---------+------------+------------+

SELECT  COUNT(DISTINCT host) AS servers_count, "server_role", "os", "env", "datacenter" from '/tmp/file.parquet' WHERE time >= '2024-02-25T00:00:00Z' and time < '2024-02-25T00:00:01Z' and server_role = 'mesg' GROUP BY "server_role", "os", "env", "datacenter";
+---------------+-------------+---------+------------+------------+
| servers_count | server_role | os      | env        | datacenter |
+---------------+-------------+---------+------------+------------+
| 4             | mesg        | windows | production | va         |
| 1             | mesg        | windows | production | mn         |
+---------------+-------------+---------+------------+------------+

SELECT  count(distinct pool) as pool_count, "server_role", "os", "env", "datacenter" from '/tmp/file.parquet' WHERE time >= '2024-02-25T00:00:00Z' and time < '2024-02-25T00:00:01Z' and server_role = 'mesg' GROUP BY "server_role", "os", "env", "datacenter";
+------------+-------------+---------+------------+------------+
| pool_count | server_role | os      | env        | datacenter |
+------------+-------------+---------+------------+------------+
| 1          | mesg        | windows | production | mn         |
| 3          | mesg        | windows | production | va         |
+------------+-------------+---------+------------+------------+

Expected behavior

SELECT  COUNT(DISTINCT host) AS servers_count, count(distinct pool) as pool_count, server_role, os, env, datacenter from '/tmp/file.parquet' WHERE time >= '2024-02-25T00:00:00Z' and time < '2024-02-25T00:00:01Z' and server_role = 'mesg' GROUP BY server_role, os, env, datacenter;
+---------------+------------+-------------+---------+------------+------------+
| servers_count | pool_count | server_role | os      | env        | datacenter |
+---------------+------------+-------------+---------+------------+------------+
| 1             | 1          | mesg        | windows | production | mn         |
| 4             | 3          | mesg        | windows | production | va         |  
+---------------+------------+-------------+---------+------------+------------+

### Additional context

_No response_
@NGA-TRAN NGA-TRAN added the bug Something isn't working label Mar 12, 2024
@viirya
Copy link
Member

viirya commented Mar 12, 2024

cc @huaxingao

@alamb
Copy link
Contributor

alamb commented Mar 12, 2024

Here is the schema:

DataFusion CLI v36.0.0
❯ describe 'file.parquet';
+------------------------+-----------------------------+-------------+
| column_name            | data_type                   | is_nullable |
+------------------------+-----------------------------+-------------+
| Percent_Idle_Time      | Float64                     | YES         |
| Percent_Processor_Time | Float64                     | YES         |
| Percent_User_Time      | Float64                     | YES         |
| cpu                    | Dictionary(Int32, Utf8)     | YES         |
| datacenter             | Dictionary(Int32, Utf8)     | YES         |
| env                    | Dictionary(Int32, Utf8)     | YES         |
| host                   | Dictionary(Int32, Utf8)     | YES         |
| instance               | Dictionary(Int32, Utf8)     | YES         |
| objectname             | Dictionary(Int32, Utf8)     | YES         |
| os                     | Dictionary(Int32, Utf8)     | YES         |
| pool                   | Dictionary(Int32, Utf8)     | YES         |
| server_role            | Dictionary(Int32, Utf8)     | YES         |
| time                   | Timestamp(Nanosecond, None) | NO          |
| usage_idle             | Float64                     | YES         |
| usage_iowait           | Float64                     | YES         |
| usage_system           | Float64                     | YES         |
| usage_user             | Float64                     | YES         |
+------------------------+-----------------------------+-------------+

@alamb
Copy link
Contributor

alamb commented Mar 12, 2024

An update here is that @erratic-pattern is working on creating a reproducer we can share on this ticket

@NGA-TRAN
Copy link
Contributor Author

Many thanks to @erratic-pattern who has identified the PR that introduce:

Issue is related to this PR for array aggregate order and distinct.

the commit immediately before it works correctly with the above queries.

in arrow-datafusion:

cd datafusion-cli
git checkout 0e728fce0a1a87567979bc74ebb64951b0fd9ac8
cargo build
./target/debug/datafusion-cli -f ../bug.sql
DataFusion CLI v36.0.0
+---------------+------------+------------+
| servers_count | pool_count | datacenter |
+---------------+------------+------------+
| 1             | 1          | mn         |
| 4             | 3          | va         |
+---------------+------------+------------+

if you then try to run the same query after the above PR, you get the incorrect result:

git checkout fc84a639fca7716e529384c0e919fb90b75139da
cargo build
./target/debug/datafusion-cli -f ../bug.sql
DataFusion CLI v36.0.0
+---------------+------------+------------+
| servers_count | pool_count | datacenter |
+---------------+------------+------------+
| 1             | 1          | mn         |
| 3             | 2          | va         |
+---------------+------------+------------+
2 rows in set. Query took 0.534 seconds.

bug.sql:

SELECT  COUNT(DISTINCT host) AS servers_count, count(distinct pool) as pool_count, datacenter from '/tmp/file.parquet' WHERE time >= '2024-02-25T00:00:00Z' and time < '2024-02-25T00:00:01Z' and server_role = 'mesg' GROUP BY datacenter;

We are working to share the file

@erratic-pattern
Copy link
Contributor

I am continuing to investigate this to see if I can create a minimal reproduction.

@alamb
Copy link
Contributor

alamb commented Mar 15, 2024

I have a theory about what is wrong

I think this change from #9234

https://github.com/apache/arrow-datafusion/pull/9234/files#diff-a292ca8deeaaff7ba19bad4adf609c476ff383db56249f75cb6aeab77e887744R245-R248

effectively skips all but the first intermediate result when combining data together

Here is the code on main, specificaly, that I think should look at all elements in the array:

https://github.com/apache/arrow-datafusion/blob/3c26e597aeacde0a5e6a51f30394d3d31c6acd96/datafusion/physical-expr/src/aggregate/count_distinct/mod.rs#L256

So to find a reproducer it would need:

  1. A GROUP BY
  2. More than 1 target partition (to trigger repartitioned group by)
  3. The input partitions don't have all the distinct values

I think the best idea here is to make a fuzz test that triggers the issue (randomly send inputs). We can potentially follow the model here:

https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/fuzz_cases/aggregate_fuzz.rs

I will try to do so later today if no one beats me to it

@alamb
Copy link
Contributor

alamb commented Mar 15, 2024

Here is a self contained reproducer with just datafusion-cli

Step 1: Create data

------
-- Create a table using UNION ALL to get 2 partitions (very important)
------
create table test as
    select * from (values('foo', 'bar', 1))
    UNION ALL
    select * from (values('foo', 'baz', 1));

------
-- Now, create a table with the same data, but column2 has type `Dictionary(Int32)` to trigger the fallback code
-----
create table test_dict as
  select
    column1,
    arrow_cast(column2, 'Dictionary(Int32, Utf8)') as "column2",
    column3
from test;

Now, run queries

-- there are two distinct values of column 2: 'bar' and 'baz'

select * from test_dict;
+---------+---------+---------+
| column1 | column2 | column3 |
+---------+---------+---------+
| foo     | bar     | 1       |
| foo     | baz     | 1       |
+---------+---------+---------+
2 rows in set. Query took 0.000 seconds.


-- but this query says there is only 1 distinct value in column2
select
  count(distinct column1),
  count(distinct column2)
FROM test_dict
GROUP BY column3;

+-----------------------------------+-----------------------------------+
| COUNT(DISTINCT test_dict.column1) | COUNT(DISTINCT test_dict.column2) |
+-----------------------------------+-----------------------------------+
| 1                                 | 1                                 |
+-----------------------------------+-----------------------------------+
1 row in set. Query took 0.002 seconds.

Running the same query against the non dictionary encoded column in test produces the correct result:

select
  count(distinct column1),
  count(distinct column2)
FROM test_dict
GROUP BY column3;

+------------------------------+------------------------------+
| COUNT(DISTINCT test.column1) | COUNT(DISTINCT test.column2) |
+------------------------------+------------------------------+
| 1                            | 2                            |
+------------------------------+------------------------------+
1 row in set. Query took 0.001 seconds.

@alamb
Copy link
Contributor

alamb commented Mar 18, 2024

I plan to fix this issue

@alamb alamb changed the title Wrong result when there are 2 count(distinct) Regression: Wrong result when there are 2 count(distinct) Mar 18, 2024
@alamb alamb added the regression Something that used to work no longer does label Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regression Something that used to work no longer does
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants