-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DPE-2897] Cross-region async replication #368
Conversation
Deploy two models, each with 1x postgresql Then, configure async replication as follows: $ juju switch psql-1 $ juju offer postgresql-k8s:async-primary async-primary # async-primary is the relation provided by the leader $ juju switch psql-2 $ juju consume admin/psql-1.async-primary # consume the primary relation $ juju relate postgresql-k8s:async-replica async-primary # Both units are now related, where postgresql-k8s in model psql-2 is the standby-leader Now, run the action: $ juju run -m psql-1 postgresql-k8s/0 promote-standby-cluster # move postgresql-k8s in model psql-1 to be the leader cluster Run the following command to check status: $ PATRONI_KUBERNETES_LABELS='{application: patroni, cluster-name: patroni-postgresql-k8s}' \ PATRONI_KUBERNETES_NAMESPACE=psql-2 \ # update to model number PATRONI_KUBERNETES_USE_ENDPOINTS=true \ PATRONI_NAME=postgresql-k8s-0 \ PATRONI_REPLICATION_USERNAME=replication \ PATRONI_SCOPE=patroni-postgresql-k8s \ PATRONI_SUPERUSER_USERNAME=operator \ patronictl -c /var/lib/postgresql/data/patroni.yml list Role should be "Standby leader" and State should be "Running".
… will stop their services before moving on and reconfiguring
…-async-replication Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
…-async-replication Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
@@ -116,6 +124,10 @@ postgresql: | |||
- {{ 'hostssl' if enable_tls else 'host' }} all all 0.0.0.0/0 md5 | |||
{%- endif %} | |||
- {{ 'hostssl' if enable_tls else 'host' }} replication replication 127.0.0.1/32 md5 | |||
- {{ 'hostssl' if enable_tls else 'host' }} replication replication 127.0.0.6/32 md5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just Q: why do we need to add 127.0.0.6
IP?
src/relations/async_replication.py
Outdated
# This unit is the leader, generate a new configuration and leave. | ||
# There is nothing to do for the leader. | ||
for attempt in Retrying(stop=stop_after_attempt(5), wait=wait_fixed(3)): | ||
with attempt: | ||
self.container.stop(self.charm._postgresql_service) | ||
self.charm.update_config() | ||
self.container.start(self.charm._postgresql_service) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you a liitle bit improve comment, it is not clear why we need to restart primary
# Store current data in a ZIP file, clean folder and generate configuration. | ||
logger.info("Creating backup of pgdata folder") | ||
self.container.exec( | ||
f"tar -zcf /var/lib/postgresql/data/pgdata-{str(datetime.now()).replace(' ', '-').replace(':', '-')}.zip /var/lib/postgresql/data/pgdata".split() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a little bit more faster
f"tar -zcf /var/lib/postgresql/data/pgdata-{str(datetime.now()).replace(' ', '-').replace(':', '-')}.zip /var/lib/postgresql/data/pgdata".split() | |
f"tar -JcpSf /var/lib/postgresql/data/pgdata-{str(datetime.now()).replace(' ', '-').replace(':', '-')}.tar.xz /var/lib/postgresql/data/pgdata".split() |
src/relations/async_replication.py
Outdated
primary_relation = self.model.get_relation(ASYNC_PRIMARY_RELATION) | ||
if not primary_relation: | ||
event.fail("No primary relation") | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need some explanation why relation needed during promotion. maybe call will work better
@@ -1688,7 +1688,6 @@ files = [ | |||
{file = "PyYAML-6.0.1-cp311-cp311-win_amd64.whl", hash = "sha256:bf07ee2fef7014951eeb99f56f39c9bb4af143d8aa3c21b1677805985307da34"}, | |||
{file = "PyYAML-6.0.1-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:855fb52b0dc35af121542a76b9a84f8d1cd886ea97c84703eaa6d88e37a2ad28"}, | |||
{file = "PyYAML-6.0.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:40df9b996c2b73138957fe23a16a4f0ba614f4c0efce1e9406a184b6d07fa3a9"}, | |||
{file = "PyYAML-6.0.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a08c6f0fe150303c1c6b71ebcd7213c2858041a7e01975da3a99aed1e7a378ef"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is caused by python-poetry/poetry#6513
…-async-replication Signed-off-by: Marcelo Henrique Neppel <[email protected]>
src/coordinator_ops.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make sense to make this into a charm lib further down the line.
Code: LGTM. We can merge this to continue from here. IMHO, we should revise and improve UI/UX.
Current UX:
Significant UX issues above:
Propose UI/UX (let's forget about test app and pgbouncer here for simplicity):
Please check this and consider it as a future UX improvement => branch a ticket. Tnx! |
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
@marceloneppel I have re-tested the last commit (on Juju 3.4.0) using UX I posted above, the first switchover psql-1 => psql-2 stuck on psql-1 side in It can be Juju 3.4 issue, but no Juju issues noticed there. BTW Juju 3.4 has cross-model secrets fixed => reason to test there (test app can be in 3rd model |
…-async-replication Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
…er gets stuck Signed-off-by: Marcelo Henrique Neppel <[email protected]>
…-async-replication Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
…-async-replication Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Hi @taurus-forever! I fixed the issue that caused a unit to become stuck in the |
It was superseded by #447. |
Issue
Connecting two PostgreSQL clusters through async replication in a cross-region setup is currently impossible.
Solution
Two endpoints were created to relate the Juju applications of the two PostgreSQL clusters (
async-primary
andasync-replica
). The relation exchanges information about the topology of the clusters (IP addresses and the endpoint of one of thesync_standby
members of the main cluster - the cluster that will replicate its data to the other cluster, the secondary cluster) and also about the secrets from the main cluster.To promote one cluster to be the main cluster, it's needed to call the
promote-standby-cluster
, which enables the replication between the clusters.Also, if anything changes in the main cluster topology (like the
sync_standby
crashing), the endpoint of the main cluster is updated in the secondary cluster to keep replication working. If something happens with thestandby_leader
in the secondary cluster, another member takes that role, and the replication continues to work.src/relations/async_replication.py
contains all that logic, including the sharing of the IP addresses of the secondary cluster, which are needed to enable the replication connection from that cluster units to the main cluster through thepg_hba
rules.Stop all units to delete the cluster info from the Patroni K8S endpoint to start a new cluster replicating from the main cluster on standby. For that, it's necessary to coordinate the cluster to start the new cluster only after all units have been stopped and the cluster information is empty in the K8S endpoint. This is done in
src/coordinator_ops.py
.Integration tests are implemented at #369.
Unit tests should be implemented in a separate PR too.