Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: an append-only materialized view storing continuous timestamps #15830

Closed
st1page opened this issue Mar 21, 2024 · 13 comments · Fixed by #17371
Closed

feat: an append-only materialized view storing continuous timestamps #15830

st1page opened this issue Mar 21, 2024 · 13 comments · Fixed by #17371
Assignees
Labels
help wanted Issues that need help from contributors
Milestone

Comments

@st1page
Copy link
Contributor

st1page commented Mar 21, 2024

background

In some use cases, users want a timed triggering logic and they like to do it like this

CREATE TABLE per_minutes(minute timestamp) APPEND ONLY;

INSERT INTO per_minutes SELECT * FROM generate_series(
    '2020-01-01 00:00:00'::TIMESTAMP,
    '2050-01-01 00:00:00'::TIMESTAMP, 
    interval '1' minute
);

CREATE VIEW timer_per_minutes AS SELECT * FROM per_minutes where minute < now();

The view timer_per_minutes will emit the event per minute periodically.

RW can support this kind of usage with a more friendly syntax with a TVF timer_source(interval: INTERVAL) -> APPEND ONLY STREAM (time: timestamptz);. Also, RW can do more optimization for this case and save the big time list table.

implementation

It is a flag on the NowExecutor. And it should not associate with the barrier interval. Even when barrier interval is 10s and the timer's interval is 1s, it should emit the time for each 1s.

@github-actions github-actions bot added this to the release-1.8 milestone Mar 21, 2024
@fuyufjh
Copy link
Member

fuyufjh commented Mar 21, 2024

Can this just be expressed with

create materialized view xxx as generate_series(
    '2020-01-01 00:00:00'::TIMESTAMP, -- start 
    NOW(), -- end
    interval '1' minute  -- step
);

Internally, NowExecutor will be used

@st1page
Copy link
Contributor Author

st1page commented Mar 21, 2024

Can this just be expressed with

create materialized view xxx as generate_series(
    '2020-01-01 00:00:00'::TIMESTAMP, -- start 
    NOW(), -- end
    interval '1' minute  -- step
);

Internally, NowExecutor will be used

The issue is that it will generate too many data in the table and also it is a big generate_series batch query.

@fuyufjh
Copy link
Member

fuyufjh commented Mar 21, 2024

also it is a big generate_series batch query.

Didn't get this. Can you explain it? I don't see any batch query here.

By the way, regarding the original query in PR description

CREATE VIEW timer_per_minutes AS SELECT * FROM per_minutes where minute > now();

Do you mean where minute < now()?

@st1page
Copy link
Contributor Author

st1page commented Mar 21, 2024

```sql
create materialized view xxx as generate_series(
    '2020-01-01 00:00:00'::TIMESTAMP, -- start 
    NOW(), -- end
    interval '1' minute  -- step
);

Oh, create mv with a generate_series is even worse

create materialized view xxx as generate_series(
    '2020-01-01 00:00:00'::TIMESTAMP, -- start 
    NOW(), -- end
    interval '1' minute  -- step
);

WIth this query, every time the now()'s value changes, RW will retract all the result of the old TVF generate_series and send the new value.

@st1page
Copy link
Contributor Author

st1page commented Mar 21, 2024

CREATE VIEW timer_per_minutes AS SELECT * FROM per_minutes where minute > now();

Do you mean where minute < now()?

Yes, I will change it.

@fuyufjh
Copy link
Member

fuyufjh commented Mar 21, 2024

Oh, create mv with a generate_series is even worse

create materialized view xxx as generate_series(
    '2020-01-01 00:00:00'::TIMESTAMP, -- start 
    NOW(), -- end
    interval '1' minute  -- step
);

WIth this query, every time the now()'s value changes, RW will retract all the result of the old TVF generate_series and send the new value.

Yeah, by the current implementation, it is. I am just proposing to use the syntax. We should make it incremental internally.

@st1page
Copy link
Contributor Author

st1page commented Mar 21, 2024

Ohh so sorry I misunderstand it ... LGTM to use the syntax.

generate_series(
    '2020-01-01 00:00:00'::TIMESTAMP, -- start 
    NOW(), -- end
    interval '1' minute  -- step
);

@fuyufjh
Copy link
Member

fuyufjh commented Mar 21, 2024

By the way, share the time_bucket_gapfill from TimescaleDB.

Compared with Join, it has 2 advantages:

  1. It allows specifying how to interpolate values.
  2. Streaming Join is very expensive

Regarding of the user interface, I slightly prefer this than JOIN, also because it's more straigh-forward

@st1page
Copy link
Contributor Author

st1page commented Mar 21, 2024

Streaming Join is very expensive

So, Even a streaming time_bucket_gapfill is better because it exploits the property that the time columns are basically ordered and that the gaps are not very large? 🤔 It is interesting.

@fuyufjh
Copy link
Member

fuyufjh commented Mar 27, 2024

Streaming Join is very expensive

So, Even a streaming time_bucket_gapfill is better because it exploits the property that the time columns are basically ordered and that the gaps are not very large? 🤔 It is interesting.

Let me explain why a (naive) outer join is not the optimal approach.

For example, let's consider the hardest case first - the only left row in a time window was deleted, in this case, a gap is created, and we need to fill the gap with a row of NULLs.

No matter which approach we are using, to implement this, the operator must memorize how many rows were in this window before the delete. Actually, this is exactly what the degree table of left outer join does.

-- imagine it as 
upstream_data LEFT OUTER JOIN per_minutes_series

However, the rest state tables, such as the materialized rows of both left & right sides, were useless, because

  • The left state table (i.e. the materialized rows from upstream_data) is not necessary because every time an upstream_data row comes, it can be emitted immediately and doesn't need to wait for per_minutes_series
  • The right statble table (i.e. the materialized rows from per_minutes_series) is wasteful because it's very simple. Just keeping the start_time and end_time would be enough.

I am thinking that we might need to introduce a special operator, saying, like a much simplified outer join operator with a degree table. Of course, this can be a second step and naive join can be used before it's completed, but before this we need to carefully consider the syntax: join, aggregate function, or TVF.

@st1page
Copy link
Contributor Author

st1page commented Mar 28, 2024

dolphinDB allow a interval syntax in the group by clause, which supports the "fill" argument

fill indicates how to fill the missing values of the result. It can take the value of “prev”, “post”, “linear”, “null”, a specific numeric value and “none”.
“prev”: the previous value
“post”: the next value
“linear”: linear interpolation. For non-numerical data, linear interpolation cannot be used and the “prev” method will be used instead.
“null”: null value
a specific numeric value.
“none”: do not interpolate

https://docs.dolphindb.cn/en/help/SQLStatements/interval.html

@fuyufjh
Copy link
Member

fuyufjh commented Mar 28, 2024

https://docs.dolphindb.cn/en/help/SQLStatements/interval.html

Interesting. While for RisingWave, the closest syntax is actually the time window function i.e. TUMBLE and HOP. But regarding of implementation, it seems that reusing the state in Agg is the most efficient way.

@fuyufjh
Copy link
Member

fuyufjh commented Mar 28, 2024

Oh, I just realized the discussion around gap filling and interpolation has been a little bit off-topic. Let's focus on the original requirement for this issue. That is, supporting

create materialized view xxx as generate_series(
    '2020-01-01 00:00:00'::TIMESTAMP, -- start 
    NOW(), -- end
    interval '1' minute  -- step
);

with some efficient incremental way.

@fuyufjh fuyufjh changed the title feat: implement timer_source(interval: INTERVAL) -> APPEND ONLY STREAM (time: timestamptz); feat: an append-only materialized view storing continuous timestamps Mar 28, 2024
@fuyufjh fuyufjh added the help wanted Issues that need help from contributors label Mar 28, 2024
@st1page st1page self-assigned this Apr 8, 2024
@st1page st1page modified the milestones: release-1.8, release-1.9 Apr 8, 2024
@st1page st1page modified the milestones: release-1.9, release-1.10 May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Issues that need help from contributors
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants