Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: define processing time column #7209

Closed
fuyufjh opened this issue Jan 5, 2023 · 6 comments
Closed

feat: define processing time column #7209

fuyufjh opened this issue Jan 5, 2023 · 6 comments
Assignees
Milestone

Comments

@fuyufjh
Copy link
Member

fuyufjh commented Jan 5, 2023

Is your feature request related to a problem? Please describe.

Provide a column with event ingesting time, in case there is no appropriate time column in user data.

Describe the solution you'd like

Based on #6952, we can introduce proc_time as a column in the source, which is same with Flink: https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/table/concepts/time_attributes/#processing-time

CREATE SOURCE user_actions (
  user_name STRING,
  data STRING,
  user_action_time AS PROCTIME() -- declare an additional field as a processing time attribute
) WITH (
  ...
);


SELECT TUMBLE_START(user_action_time, INTERVAL '10' MINUTE), COUNT(DISTINCT user_name)
FROM user_actions
GROUP BY TUMBLE(user_action_time, INTERVAL '10' MINUTE);

Describe alternatives you've considered

1. Hidden column.

The problem of this approach is that it doesn't explicitly tell the users the column is associated with source/table. For example:

This is clear:

select _proc_time, id, value from t where k = 2;

This is ambiguous:

select _proc_time, id, value from a join b where a.bid = b.id where k = 1
-- ERROR: _proc_time should be either a._proc_time or b._proc_time

2. System function

This is actually a different thing. It is apparent that functions are evaluated when executing it, rather than data injected.

It doesn't matter for this simple query

select _proc_time, id, value from t where k = 2;

But will be a problem for more complex ones:

select proc_time(), id, value from a join b where a.bid = b.id where k = 1
-- When is proc_time() evaluated? Intuitively, it should be after the two events joined.

We had better call it now() or something else to distinguish from our topic here.

Additional context

No response

@st1page
Copy link
Contributor

st1page commented Jan 6, 2023

prefer the Hidden column, it should be a record's properties but not a impure function

@fuyufjh
Copy link
Member Author

fuyufjh commented Jan 6, 2023

prefer the Hidden column, it should be a record's properties but not a impure function

My concern is that the concept "hidden column" is unfamiliar to users, especially Postgres users. For example, he/she may feel confused when writing a.proc_time but there is no proc_time column in describe table or show create table.

it should be a record's properties but not an impure function

But It does look like current_user , an in-pure function, doesn't it? 🤣

@st1page
Copy link
Contributor

st1page commented Jan 8, 2023

Ok, just allowing PROCTIME() when the user defines the Source or Table LGTM.
Another question, should we support more kinds of DDL on the source? e.g.

  1. add the PROCTIME column on the source
  2. even more... change the watermark column of a source?
    c.c. @BugenZhao

@liurenjie1024
Copy link
Contributor

How we handle this in batch query? The time when it's scaned into our system?

@lmatz
Copy link
Contributor

lmatz commented Jan 18, 2023

I am wondering if it also makes sense to add a time column when being materialized instead of being just read from the source? I suppose proc_time() is added to the row when it is just read.

@fuyufjh
Copy link
Member Author

fuyufjh commented Jan 26, 2023

I am wondering if it also makes sense to add a time column when being materialized instead of being just read from the source? I suppose proc_time() is added to the row when it is just read.

For a table (previously materialized source) - Yes, the processing time should be persisted in that table/MV as well, just like any other column.

For a source (i.e. not materialized) - No, I suppose the only thing we can do is using the read time as processing time, as @liurenjie1024 commented above. To avoid that, users should define an event time e.g. Kafka timestamp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants