Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat : data bucket for persisting data to gcs #6170

Merged
merged 18 commits into from
Dec 4, 2024
Merged

Conversation

k-anshul
Copy link
Member

@k-anshul k-anshul commented Nov 26, 2024

PrefixedBucket closes underlying bucket so no instance will be able to read other instance's data.

@k-anshul k-anshul self-assigned this Nov 26, 2024
Copy link
Contributor

@begelundmuller begelundmuller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we could configure the data directory and data bucket together (instead of passing one as config vars and another as a constructor function). What do you think about grouping storage/persistence related functionality in a single place and dependency-injecting it similar to the logger and activity client?

For example, we could have a runtime/storage package:

// runtime/storage/storage.go
type Client struct {...}

func New(dataDir, tmpDir string, bucketCfg map[string]any) *Client {...}

func NewTmp() *Client {...}

func (c *Client) WithPrefix(instanceID string) *Client {...}

func (c *Client) DataDir(elem ..string) string {...}

func (c *Client) TempDir(elem ..string) string {...}

func (c *Client) OpenBucket() (*blob.Bucket, bool, error) {...}

Which we create in start.go and pass to the runtime, like runtime.New(ctx, opts, logger, ..., storage). In tests, we can have storage.NewTmp() similar to zap.NewNop().

Then when opening a driver, it can pass r.storage.WithPrefix(instanceID) to it.

The advantages I see to this are:

  1. All disk persistence is managed in one place, making it simpler to understand and easier to track storage for billing and periodic cleanup
  2. In the future, enables exposing buckets in other ways than *blob.Bucket. This might be useful for e.g. caching exports where writing directly from the OLAP might be faster.

runtime/drivers/drivers.go Outdated Show resolved Hide resolved
@k-anshul
Copy link
Member Author

k-anshul commented Dec 2, 2024

It would be great if we could configure the data directory and data bucket together (instead of passing one as config vars and another as a constructor function). What do you think about grouping storage/persistence related functionality in a single place and dependency-injecting it similar to the logger and activity client?

Yeah sounds good.
I did the changes as per suggested design. Two changes/considerations though:

  1. Even for tests we will need data_dir so better to let callers use same constructor and pass a temporary directory which can later be cleaned.
  2. As of now it is not possible to set connector name so all connectors of same driver will use same name but this can be easily added by making some changes in AcquireConnection. I haven't done it yet, will make the changes as part of a different PR. As of now there is no impact because of this.

runtime/connection_cache.go Outdated Show resolved Hide resolved
runtime/drivers/duckdb/duckdb.go Outdated Show resolved Hide resolved
runtime/storage/storage.go Show resolved Hide resolved
runtime/storage/storage.go Outdated Show resolved Hide resolved
runtime/drivers/snowflake/sql_store.go Show resolved Hide resolved
runtime/storage/storage.go Outdated Show resolved Hide resolved
Comment on lines 55 to 58
newClient := &Client{
dataDirPath: c.dataDirPath,
bucketConfig: c.bucketConfig,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to handle tempDirPath here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOPS. Thanks :)

@begelundmuller begelundmuller merged commit 0154424 into main Dec 4, 2024
7 checks passed
@begelundmuller begelundmuller deleted the data_bucket branch December 4, 2024 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants