Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to filter null in array_agg #13742

Open
rluvaton opened this issue Dec 12, 2024 · 5 comments
Open

Allow to filter null in array_agg #13742

rluvaton opened this issue Dec 12, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@rluvaton
Copy link
Contributor

rluvaton commented Dec 12, 2024

Is your feature request related to a problem or challenge?

Yes, I want nulls to be filtered from array_agg when I specify with_ignore_nulls: true in AggregateExprBuilder to have behavior similar to Spark, this should also work when with_distinct is false or true

Describe the solution you'd like

Specifing with_ignore_nulls: true should ignore

Describe alternatives you've considered

  1. Adding a filter for removing nulls to the child before creating the array_agg
  2. Copy the implementation and added in myself as an UDAF

Additional context

This PR might be relevant:

I'm willing to open a pull request

@rluvaton rluvaton added the enhancement New feature or request label Dec 12, 2024
@findepi
Copy link
Member

findepi commented Dec 12, 2024

AFAICT, the with_ignore_nulls: true is currently... ignored?
Given it's part of the "logical plan", it should be obeyed. It's probably part of LP because the LP is half-way between parser's AST and real plan (see also #12604 and cc @alamb).

I'm OK for respecting with_ignore_nulls in the plan, but I would actually consider removing it completely. Simpler things are better. The same functionality is available using AggregateFunction.filter.
What calls for keeping with_ignore_nulls is that aggregate functions -- in general -- skip over null inputs. SQL standard array_agg is one of few exceptions (and some systems like Snowflake and apparently Spark skip nulls for array_agg). Ideally this could be decided by the function itself, so could be a parameter of the ArrayAgg struct.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Dec 13, 2024

I like the idea that we leverage existing filter in AggregateExec for this usecase

/// FILTER (WHERE clause) expression for each aggregate expression
filter_expr: Vec<Option<Arc<dyn PhysicalExpr>>>,

I guess you can create AggregateExec with predicate that only non-nulls goes to array-agg, similar to the following syntax

select array_agg(column1) from table where column1 != NUll;

We could even introduce a general filter parameter to UDAF. Then we have something like array_agg(lit(1)).distinct().filter(col != null).build() 🤔

We have filter already 👍🏻

@alamb
Copy link
Contributor

alamb commented Dec 16, 2024

To be clear, I think the SQL parser supports predicates like ARRAY_AGG(x) IGNORE NULL and we have logical plan support for them.

However, some aggregates (silently) ignore this, see

I think filling out support (or at least making it clear that this syntax is ignored) would be a great idea

@findepi
Copy link
Member

findepi commented Dec 17, 2024

Thank you @alamb. do you have an example of an agg that doesn't ignore them?
I tried find one, but maybe i was searching the wrong way

@alamb
Copy link
Contributor

alamb commented Dec 17, 2024

Thank you @alamb. do you have an example of an agg that doesn't ignore them? I tried find one, but maybe i was searching the wrong way

I think first_value supports nulls / ignore nulls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants