Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate from Postgres to ClickHouse #29

Open
bitjson opened this issue Nov 20, 2021 · 2 comments
Open

Migrate from Postgres to ClickHouse #29

bitjson opened this issue Nov 20, 2021 · 2 comments
Labels
enhancement New feature or request

Comments

@bitjson
Copy link
Member

bitjson commented Nov 20, 2021

In early iterations (on Postgres 12), I experimented with expression indexes for fee_satoshis and other computed fields. I found that expression indexes didn't improve performance as much as I'd hoped, and they are surprisingly expensive to build and maintain (both in terms of DB latency and storage space).

It would be valuable to try again using Postgres 14 and the latest version of Hasura – ideally, it would be possible to simply create expression indexes for any computed fields which will be commonly needed by a Chaingraph instance.

However, I think expression indexes are likely to be insufficient for use cases like block explorers or blockchain analysis systems, where we want most of these computed fields to be accessible in a single table row. (If a query has to hit 8 different expression indexes to return all the computed fields for a particular transaction or block, it's still going to be expensive.)

For these use cases, it would be valuable to somehow support a "storage-heavy" mode, where Chaingraph alters tables to include new columns for any computed values needed by the application. (It would be great if the particular columns were configurable, but if it makes maintenance significantly easier, we could also just have a single storage-heavy mode which would enable everything. Wallet backends might use the default install, while block explorers would choose the heavy option.)

It would be great if chaingraph could move back and forth between these modes – if the agent detects "heavy" mode has been enabled, it would alter all the required tables, then start intelligently filling columns (building up results by e.g. computing input_value_satoshis and output_value_satoshis before fee_satoshis). Once enabled, the additional values could be maintained by either/both Postgres triggers and/or computations by the agent.

This mode would also have to configure Hasura to read from the added table columns rather than use the computed field functions. (That might require us to ship two versions of Chaingraph's Hasura metadata.)

@bitjson bitjson added the enhancement New feature or request label Nov 20, 2021
@bitjson
Copy link
Member Author

bitjson commented Nov 23, 2021

I probably won’t have time to work on this for a while, but just wanted to document what I’m currently thinking is the solution here: manually-refreshed materialized views.

I think we’ll need to migrate each table to a view which either directly references a column in an underlying table or falls back to the function. On startup, the agent would detect any configuration changes in “query acceleration columns” and add the column(s), changing the respective view to reference the new columns after they have been added. (Maybe a separate CHAINGRAPH_ALLOW_COLUMN_DELETION environment variable would allow previously enabled columns to be cleaned up; should be off by default to ensure accidental configuration changes don’t wipe expensive-to-compute data.)

So for example, transaction would become a view which references an underlying transaction_data that might also have columns for fee_satoshis and other otherwise-computed values. Any view columns which are not enabled in the underlying *_data table would be configured to compute the value at query time. For use cases which rarely or never require these values, it’s still possible to request them (and results simply take longer to return); for use cases like block explorers which tend to request all available data, the view basically passes through all columns directly from a single table, making lookups as fast as possible (and faster than a similar query which hits multiple expression indexes).

Each enabled column should also add triggers as necessary to any tables which affect the column. (For best performance, should triggers be combined so that each table only has a single column computation trigger to avoid multiple lookups?) Finally, some columns can be computed by the agent to save DB work, e.g. input_count, output_count, input_value_satoshis, output_value_satoshis, and fee_satoshis can all be computed before the transaction_data row is saved to the database. No triggers should be necessary for these columns – the agent can immediately start sending the computed values for new transactions while the database computes the correct values for transactions which are already saved.

Management of this strategy will probably require that we move away from having Hasura manage the DB schema and instead have the agent actively manage it (such that the agent can detect configuration options and make changes where needed to the tables/views/triggers at startup). Most existing Hasura computed fields would also become normal fields since the function would be called by the view (Added benefit: this would clean up the unnecessary computed field parts of schema docs which don't need to be exposed to end users, e.g. A computed field, executes function "transaction_fee_satoshis").

The primary benefit of this strategy is that it offers the best possible performance, doesn’t waste computation/storage on columns which a particular use case doesn’t require, and it offers a consistent interface regardless of which “accelerated columns” are enabled – we won’t have to maintain or generate multiple Hasura metadata configurations, and SQL queries written for Chaingraph will always work regardless of configuration (albeit slower for queries requiring lots of non-unaccelerated, expensive computations).

@bitjson bitjson changed the title Mode to improve performance of fee_satoshis and other computed fields Accelerated columns: improve performance of computed fields Nov 23, 2021
@bitjson
Copy link
Member Author

bitjson commented Nov 8, 2023

Some important movement on this issue in the past few months: Hasura now supports connecting to a ClickHouse database as of v2.33.0.

In general, ClickHouse is an excellent choice for our workload (schema requires no mutations, even during re-org events). With support now in Hasura, we could get far better performance out of a ClickHouse-backed Chaingraph than from the heavily-optimized Postgres-based solution described above.

A migration would not be trivial, but I think it could be made mostly-transparent to end users. Query performance should be incomparably better, especially for block explorer/analytics-type queries which aren't accelerated by the indexes Chaingraph currently creates. I also expect the total database size to be significantly reduced, as ClickHouse should be able to heavily compress even small values (and ClickHouse is also much more conservative with row/column overhead).

@bitjson bitjson changed the title Accelerated columns: improve performance of computed fields Migrate from Postgres to ClickHouse Nov 8, 2023
@bitjson bitjson pinned this issue Dec 8, 2023
@bitjson bitjson mentioned this issue Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant