Skip to content

Commit

Permalink
standardize formatting, remove outdated info (#1820)
Browse files Browse the repository at this point in the history
  • Loading branch information
aeluce authored Dec 12, 2024
1 parent eb0047a commit 24a66c1
Show file tree
Hide file tree
Showing 29 changed files with 95 additions and 111 deletions.
5 changes: 2 additions & 3 deletions site/docs/getting-started/quickstart/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,8 +85,7 @@ table.

![Link Capture](https://storage.googleapis.com/estuary-marketing-strapi-uploads/uploads//link_source_to_capture_b0d37a738f/link_source_to_capture_b0d37a738f.png)

After pressing continue, you are met with a few configuration options, but for now, feel free to press **Next,** then *
*Save and Publish** in the top right corner, the defaults will work perfectly fine for this tutorial.
After pressing continue, you are met with a few configuration options, but for now, feel free to press **Next,** then **Save and Publish** in the top right corner, the defaults will work perfectly fine for this tutorial.

A successful deployment will look something like this:

Expand All @@ -100,7 +99,7 @@ the data looks.
Looks like the data is arriving as expected, and the schema of the table is properly configured by the connector based
on the types of the original table in Postgres.

To get a feel for how the data flow works; head over to the collection details page on the Flow web UI to see your
To get a feel for how the data flow works, head over to the collection details page on the Flow web UI to see your
changes immediately. On the Snowflake end, they will be materialized after the next update.

## Next Steps<a id="next-steps"></a>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Amazon RDS, Amazon Aurora, Google Cloud SQL, Azure Database for PostgreSQL, and

## Introduction

Materialized views in Postgres give you a powerful way narrow down a huge dataset into a compact one that you can easily monitor.
Materialized views in Postgres give you a powerful way to narrow down a huge dataset into a compact one that you can easily monitor.
But if your data is updating in real-time, traditional materialized views introduce latency. They're batch workflows — the query is run at a set interval.

To get around this, you'll need to perform a real-time transformation elsewhere.
Expand Down
10 changes: 5 additions & 5 deletions site/docs/getting-started/tutorials/dataflow-s3-snowflake.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,10 +73,10 @@ You'll start by creating your capture.

4. Click inside the **Name** box.

Names of entities in Flow must be unique. They're organized by prefixes, similar to paths in a file system.
Names of entities in Flow must be unique. They're organized by prefixes, similar to paths in a file system.

You'll see one or more prefixes pertaining to your organization.
These prefixes represent the **namespaces** of Flow to which you have access.
You'll see one or more prefixes pertaining to your organization.
These prefixes represent the **namespaces** of Flow to which you have access.

5. Click your prefix from the dropdown and append a unique name after it. For example, `myOrg/yourname/citibiketutorial`.

Expand Down Expand Up @@ -115,13 +115,13 @@ Before you can materialize from Flow to Snowflake, you need to complete some set

1. Leave the Flow web app open. In a new window or tab, go to your Snowflake console.

If you're a new trial user, you should have received instructions by email. For additional help in this section, see the [Snowflake documentation](https://docs.snowflake.com/en/user-guide-getting-started.html).
If you're a new trial user, you should have received instructions by email. For additional help in this section, see the [Snowflake documentation](https://docs.snowflake.com/en/user-guide-getting-started.html).

2. Create a new SQL worksheet if you don't have one open.

This provides an interface where you can run queries.

3. Paste the follow script into the console, changing the value for `estuary_password` from `secret` to a strong password):
3. Paste the following script into the console, changing the value for `estuary_password` from `secret` to a strong password:

```sql
set database_name = 'ESTUARY_DB';
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -308,4 +308,4 @@ Reduction annotations also have some benefits over task state (like SQLite table
* Certain aggregations, such as recursive merging of tree-like structures,
are much simpler to express through reduction annotations vs implementing yourself.

[See "Where to Accumulate?" for more discussion]../../concepts/derivations.md(#where-to-accumulate).
[See "Where to Accumulate?" for more discussion](../../concepts/derivations.md#where-to-accumulate).
20 changes: 13 additions & 7 deletions site/docs/getting-started/tutorials/postgresql_cdc_to_snowflake.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,8 @@ As this tutorial is focused on CDC replication from PostgreSQL, we’ll need a d

Save the below `yaml` snippet as a file called `docker-compose.yml`. This `docker-compose.yml` file contains the service definitions for the PostgreSQL database and the mock data generator service.

:::tip Since V2, compose is integrated into your base Docker package, there’s no need to download any separate tooling!
:::tip
Since V2, compose is integrated into your base Docker package, there’s no need to download any separate tooling!
:::

```yaml title="docker-compose.yml"
Expand Down Expand Up @@ -112,7 +113,8 @@ Don’t be alarmed by all these Docker configurations, they are made to be repro
Next up, create a folder called `schemas` and paste the below SQL DDL into a file called `products.sql`. This file contains the schema of the demo data.

:::note This file defines the schema via a create table statement, but the actual table creation happens in the `init.sql` file, this is just a quirk of the [Datagen](https://github.com/MaterializeInc/datagen) data generator tool.
:::note
This file defines the schema via a create table statement, but the actual table creation happens in the `init.sql` file, this is just a quirk of the [Datagen](https://github.com/MaterializeInc/datagen) data generator tool.
:::

```sql title="products.sql"
Expand Down Expand Up @@ -170,7 +172,8 @@ GRANT pg_read_all_data TO flow_capture;

Granting the `pg_read_all_data` privilege to the `flow_capture` user ensures that it can access and read data from all tables in the database, essential for capturing changes.

:::note `pg_read_all_data` is used for convenience, but is not a hard requirement, since it is possible to grant a more granular set of permissions. For more details check out the [connector docs](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/#self-hosted-postgresql).
:::note
`pg_read_all_data` is used for convenience, but is not a hard requirement, since it is possible to grant a more granular set of permissions. For more details check out the [connector docs](https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/#self-hosted-postgresql).
:::

```sql
Expand Down Expand Up @@ -201,7 +204,8 @@ A publication defines a set of tables whose changes will be replicated. In this

These commands configure the `flow_publication` publication to publish changes via partition root and add the specified tables to the publication. By setting `publish_via_partition_root` to true, the publication ensures that updates to partitioned tables are correctly captured and replicated.

:::note The table in this tutorial is not partitioned, but we recommend always setting `publish_via_partition_root` when creating a publication.
:::note
The table in this tutorial is not partitioned, but we recommend always setting `publish_via_partition_root` when creating a publication.
:::

These objects form the backbone of a robust CDC replication setup, ensuring data consistency and integrity across systems. After the initial setup, you will not have to touch these objects in the future, unless you wish to start ingesting change events from a new table.
Expand Down Expand Up @@ -233,7 +237,8 @@ ngrok tcp 5432

You should immediately be greeted with a screen that contains the public URL for the tunnel we just started! In the example above, the public URL `5.tcp.eu.ngrok.io:14407` is mapped to `localhost:5432`, which is the address of the Postgres database.

:::note Don’t close this window while working on the tutorial as this is required to keep the connections between Flow and the database alive.
:::note
Don’t close this window while working on the tutorial as this is required to keep the connections between Flow and the database alive.
:::

Before we jump into setting up the replication, you can quickly verify the data being properly generated by connecting to the database and peeking into the products table, as shown below:
Expand Down Expand Up @@ -412,9 +417,10 @@ And that’s pretty much it, you’ve successfully published a real-time CDC pip

Looks like the data is arriving as expected, and the schema of the table is properly configured by the connector based on the types of the original table in Postgres.

To get a feel for how the data flow works; head over to the collection details page on the Flow web UI to see your changes immediately. On the Snowflake end, they will be materialized after the next update.
To get a feel for how the data flow works, head over to the collection details page on the Flow web UI to see your changes immediately. On the Snowflake end, they will be materialized after the next update.

:::note Based on your configuration of the "Update Delay" parameter when setting up the Snowflake Materialization, you might have to wait until the configured amount of time passes for your changes to make it to the destination.
:::note
Based on your configuration of the "Update Delay" parameter when setting up the Snowflake Materialization, you might have to wait until the configured amount of time passes for your changes to make it to the destination.
:::


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,8 @@ MongoDB supports various types of change events, each catering to different aspe

- Delete Events: Signaled when documents are removed from a collection.

:::note In MongoDB, if you delete a key from a document, the corresponding change event that gets fired is an "update" event. This may seem counterintuitive at first, but in MongoDB, updates are atomic operations that can modify specific fields within a document, including removing keys. So, when a key is deleted from a document, MongoDB interprets it as an update operation where the specific field (i.e., the key) is being removed, resulting in an "update" event being generated in the oplog.
:::note
In MongoDB, if you delete a key from a document, the corresponding change event that gets fired is an "update" event. This may seem counterintuitive at first, but in MongoDB, updates are atomic operations that can modify specific fields within a document, including removing keys. So, when a key is deleted from a document, MongoDB interprets it as an update operation where the specific field (i.e., the key) is being removed, resulting in an "update" event being generated in the oplog.
:::

![Delete event](https://storage.googleapis.com/estuary-marketing-strapi-uploads/uploads//image3_5dc8c9ea52/image3_5dc8c9ea52.png)
Expand Down Expand Up @@ -121,7 +122,8 @@ Navigate to the “Network Access” page using the left hand sidebar, and using

Next, find your connection string by navigating to the `mongosh` setup page by clicking the “Connect” button on the database overview section, then choosing the “Shell” option.

:::note You’re not going to set up `mongosh` for this tutorial, but this is the easiest way to get ahold of the connection string we’ll be using.
:::note
You’re not going to set up `mongosh` for this tutorial, but this is the easiest way to get ahold of the connection string we’ll be using.
:::

![Grab your MongoDB connection string](https://storage.googleapis.com/estuary-marketing-strapi-uploads/uploads//image9_81fdbf1a20/image9_81fdbf1a20.png)
Expand Down Expand Up @@ -160,7 +162,8 @@ Before we initialize the connector, let’s talk a little bit about how incoming

The **documents** of your flows are stored in **collections**: real-time data lakes of JSON documents in cloud storage.

:::note Keep in mind, these are not the same documents and collections as the ones in MongoDB, only the names are similar, but we are talking about separate systems.
:::note
Keep in mind, these are not the same documents and collections as the ones in MongoDB, only the names are similar, but we are talking about separate systems.
:::

Collections being stored in an object storage mean that once you start capturing data, you won’t have to worry about it not being available to replay – object stores such as S3 can be configured to cheaply store data forever. See [docs page](https://docs.estuary.dev/concepts/collections/#documents) for more information about documents.
Expand Down Expand Up @@ -216,7 +219,8 @@ Incremental backfills in the MongoDB connector follow a straightforward approach

In the event of a pause in the connector's process, it resumes capturing change events from the point of interruption. However, the connector's ability to accomplish this depends on the size of the replica set oplog. In certain scenarios where the pause duration is significant enough for the oplog to purge old change events, the connector may necessitate redoing the backfill to maintain data consistency.

:::tip To ensure reliable data capture, it is recommended to [adjust the oplog size](https://www.mongodb.com/docs/manual/tutorial/change-oplog-size/#c.-change-the-oplog-size-of-the-replica-set-member) or set a [minimum retention period](https://www.mongodb.com/docs/manual/reference/command/replSetResizeOplog/#minimum-oplog-retention-period). A recommended minimum retention period of at least 24 hours is sufficient for most cases.
:::tip
To ensure reliable data capture, it is recommended to [adjust the oplog size](https://www.mongodb.com/docs/manual/tutorial/change-oplog-size/#c.-change-the-oplog-size-of-the-replica-set-member) or set a [minimum retention period](https://www.mongodb.com/docs/manual/reference/command/replSetResizeOplog/#minimum-oplog-retention-period). A recommended minimum retention period of at least 24 hours is sufficient for most cases.
:::

## Real-time CDC<a id="real-time-cdc"></a>
Expand Down
11 changes: 6 additions & 5 deletions site/docs/guides/connect-network.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ You configure this in the `networkTunnel` section of applicable capture or mater
before you can do so, you need a properly configured SSH server on your internal network or cloud hosting platform.

:::tip
If permitted by your organization, a quicker way to connect to a secure database is to [allowlist the Estuary IP addresses](/reference/allow-ip-addresses)
If permitted by your organization, a quicker way to connect to a secure database is to [allowlist the Estuary IP addresses](/reference/allow-ip-addresses).

For help completing this task on different cloud hosting platforms,
see the documentation for the [connector](../reference/Connectors/README.md) you're using.
Expand All @@ -37,6 +37,7 @@ to add your SSH server to your capture or materialization definition.
- `ssh://[email protected]`
- `ssh://[email protected]`
- `ssh://[email protected]:22`

:::info Hint
The [SSH default port is 22](https://www.ssh.com/academy/ssh/port).
Depending on where your server is hosted, you may not be required to specify a port,
Expand All @@ -55,12 +56,12 @@ to add your SSH server to your capture or materialization definition.
ssh-keygen -p -N "" -m pem -f /path/to/key
```

Taken together, these configuration details would allow you to log into the SSH server from your local machine.
They'll allow the connector to do the same.
Taken together, these configuration details would allow you to log into the SSH server from your local machine.
They'll allow the connector to do the same.

5. Configure your internal network to allow the SSH server to access your capture or materialization endpoint.
4. Configure your internal network to allow the SSH server to access your capture or materialization endpoint.

6. To grant external access to the SSH server, it's essential to configure your network settings accordingly. The approach you take will be dictated by your organization's IT policies. One recommended step is to [allowlist the Estuary IP addresses](/reference/allow-ip-addresses). This ensures that connections from this specific IP are permitted through your network's firewall or security measures.
5. To grant external access to the SSH server, it's essential to configure your network settings accordingly. The approach you take will be dictated by your organization's IT policies. One recommended step is to [allowlist the Estuary IP addresses](/reference/allow-ip-addresses). This ensures that connections from this specific IP are permitted through your network's firewall or security measures.

## Setup for AWS

Expand Down
Loading

0 comments on commit 24a66c1

Please sign in to comment.