Preserve constant values across union operations #13805

gokselk · 2024-12-17T06:24:28Z

Which issue does this PR close?

Rationale for this change

Currently, DataFusion doesn't preserve constant values across union operations even when both sides have the same constant value. This change enables better optimization by tracking and preserving constant values when they match.

What changes are included in this PR?

Added value: Option<ScalarValue> field to ConstExpr
Added methods to get/set constant values
Modified union operation logic to preserve matching constant values
Updated equality comparison for ConstExpr
Added tests for constant value preservation in unions

Are these changes tested?

Yes, added new test case test_union_constant_value_preservation that verifies constant value preservation across unions.

Are there any user-facing changes?

No user-facing changes. This is an internal optimization improvement.

gokselk · 2024-12-17T06:25:40Z

cc: @berkaysynnada @ozankabak

berkaysynnada

I have just one suggestion, otherwise LGTM

berkaysynnada · 2024-12-17T06:51:27Z

datafusion/physical-expr/src/equivalence/properties.rs


-    // remove any constants that are shared in both outputs (avoid double counting them)
+    // Remove any constants that are shared in both outputs (avoid double counting them)
    for c in &constants {
        lhs = lhs.remove_constant(c);
        rhs = rhs.remove_constant(c);


When I remove this for loop, the tests don't fail. Can you check if they are really needed? If yes, can we write a test for that scenario also in this PR?

I've tested it, and the constant removal loop appears to be redundant. I removed it in commit 291257f.

Can you understand why they were exist, and did they become redundant with this PR? They could do some work which does not appear at the tests. Maybe you can put some debug_asserts() to ensure we are not double counting (what the comment says)

The constant removal loop was unnecessary even in the original code. The function already prevents double-counting by:

First collecting only the constants that exist in both LHS and RHS into a filtered constants vector

Using only this filtered constants vector to create the final result via with_constants()

While add_satisfied_orderings() uses the original constant sets from LHS and RHS, this is correct because it's only checking if orderings from one side are satisfied in the other side. Having extra constants in the original sides doesn't affect this check

So modifying lhs and rhs by removing constants has no effect on the final result, as these modified properties aren't used in any way that would cause double-counting. The comment about "avoiding double counting" was likely added as a defensive measure.

alamb

Thanks @gokselk and @berkaysynnada

I suggest we try to write an end to end sqllogictest for this query too.

alamb · 2024-12-17T11:32:54Z

datafusion/physical-expr/src/equivalence/properties.rs

+        assert_eq!(const_a.value(), Some(&literal_10));
+
+        Ok(())
+    }
 }


Is there a way to crate an end to end .slt test that shows this behavior?

For example, a EXPLAIN PLAN where a Sort is optimized away after the constant value is propagated through the union?

Good idea! I have one in my mind. Let me add it

Hey @alamb, I tried it but after thinking more, we actually need one more step in planner to experience an end-to-end difference. Now we have the knowledge, but we are not using it. 2 possible optimizations are which come to my mind now:
Let's assume we have:

# Constant value tracking across union query TT explain SELECT * FROM( ( SELECT * FROM aggregate_test_100 WHERE c1='a' ) UNION ALL ( SELECT * FROM aggregate_test_100 WHERE c1='a' )) ORDER BY c1 ---- + physical_plan + 01)SortPreservingMergeExec: [c1@0 ASC NULLS LAST] + 02)--UnionExec + 03)----CoalesceBatchesExec: target_batch_size=2 + 04)------FilterExec: c1@0 = a + 05)--------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 + 06)----------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13], has_header=true + 07)----CoalesceBatchesExec: target_batch_size=2 + 08)------FilterExec: c1@0 = a + 09)--------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 + 10)----------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13], has_header=true

At the top of the plan, we see an SPM. However, it can have a CoalescePartitionsExec instead. That would improve the performance for sure.

For the same query without an order by but with another outer filter, we will see another filter. However, we can actually remove that. This is another optimization, but can be observed pretty rarely rather than 1st one.

2nd one could be not really realistic, but the first one could be implemented without much effort with a few changes in replace_with_order_preserving_variants scope.

Can you take a look at the first check @gokselk? It should take a few line changes in plan_with_order_preserving_variants() function. It should first look the order requirements, and if they are matched, then it would try to convert CoalescePartitionExec to SortPreservingMergeExec. But before that conversion, you can check across_partitions flag of the input constants, and if it is true, you can left the CoalescePartitionsExec as is.

Can you take a look at the first check @gokselk? It should take a few line changes in plan_with_order_preserving_variants() function. It should first look the order requirements, and if they are matched, then it would try to convert CoalescePartitionExec to SortPreservingMergeExec. But before that conversion, you can check across_partitions flag of the input constants, and if it is true, you can left the CoalescePartitionsExec as is.

I've made changes to FilterExec for value extraction and added an initial SLT file. The query now shows CoalescePartitionExec in the output, so I think your suggested changes to plan_with_order_preserving_variants() might not be needed anymore. However, I'd appreciate your review to confirm this.

It appears that I broke some ORDER BY queries in my recent commits. I will investigate this further.

To add more context, some tests are failing non-deterministically, which is why I didn't notice it beforehand.

ozankabak · 2024-12-18T07:48:40Z

I wonder if we should change across_partitions to an enum; i.e.

enum PartitionValues {
    Uniform(Option<ScalarValue>),
    Heterogenous(Option<Vec<ScalarValue>>)
}

with Uniform meaning that all partitions have the same value given in the payload (if known), and Heterogenous meaning partitions can have different constant values (each of which is given in the vector, if known).

gokselk added 6 commits November 19, 2024 09:58

Add value tracking to ConstExpr for improved union optimization

d4e41b1

Update PartialEq impl

fc58594

Minor change

5b3278e

Add docstring for ConstExpr value

5201e3b

Improve constant propagation across union partitions

de8bc13

Add assertion for across_partitions

35bfdc4

github-actions bot added the physical-expr Physical Expressions label Dec 17, 2024

gokselk changed the title ~~Feature/const expr value tracking~~ Preserve constant values across union operations Dec 17, 2024

fix fmt

76f497e

berkaysynnada approved these changes Dec 17, 2024

View reviewed changes

berkaysynnada and others added 3 commits December 17, 2024 09:57

Update properties.rs

2721609

Remove redundant constant removal loop

291257f

Remove unnecessary mut

8bc7fd2

alamb reviewed Dec 17, 2024

View reviewed changes

gokselk added 3 commits December 18, 2024 02:42

Set across_partitions=true when both sides are constant

3051cd4

Extract and use constant values in filter expressions

4c3f0d1

Add initial SLT for constant value tracking across UNION ALL

a23faed

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve constant values across union operations #13805

Preserve constant values across union operations #13805

gokselk commented Dec 17, 2024

gokselk commented Dec 17, 2024

berkaysynnada left a comment

berkaysynnada Dec 17, 2024

gokselk Dec 17, 2024

berkaysynnada Dec 17, 2024

gokselk Dec 17, 2024 •

edited

Loading

alamb left a comment

alamb Dec 17, 2024

berkaysynnada Dec 17, 2024

berkaysynnada Dec 17, 2024 •

edited

Loading

berkaysynnada Dec 17, 2024

gokselk Dec 17, 2024 •

edited

Loading

gokselk Dec 18, 2024

gokselk Dec 18, 2024

ozankabak commented Dec 18, 2024

Preserve constant values across union operations #13805

Are you sure you want to change the base?

Preserve constant values across union operations #13805

Conversation

gokselk commented Dec 17, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

gokselk commented Dec 17, 2024

berkaysynnada left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gokselk Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

berkaysynnada Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gokselk Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ozankabak commented Dec 18, 2024

gokselk Dec 17, 2024 •

edited

Loading

berkaysynnada Dec 17, 2024 •

edited

Loading

gokselk Dec 17, 2024 •

edited

Loading