-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[POC] Add standby/active multi-cluster support #301
Conversation
Deploy two models, each with 1x postgresql Then, configure async replication as follows: $ juju switch psql-1 $ juju offer postgresql-k8s:async-primary async-primary # async-primary is the relation provided by the leader $ juju switch psql-2 $ juju consume admin/psql-1.async-primary # consume the primary relation $ juju relate postgresql-k8s:async-replica async-primary # Both units are now related, where postgresql-k8s in model psql-2 is the standby-leader Now, run the action: $ juju run -m psql-1 postgresql-k8s/0 promote-standby-cluster # move postgresql-k8s in model psql-1 to be the leader cluster Run the following command to check status: $ PATRONI_KUBERNETES_LABELS='{application: patroni, cluster-name: patroni-postgresql-k8s}' \ PATRONI_KUBERNETES_NAMESPACE=psql-2 \ # update to model number PATRONI_KUBERNETES_USE_ENDPOINTS=true \ PATRONI_NAME=postgresql-k8s-0 \ PATRONI_REPLICATION_USERNAME=replication \ PATRONI_SCOPE=patroni-postgresql-k8s \ PATRONI_SUPERUSER_USERNAME=operator \ patronictl -c /var/lib/postgresql/data/patroni.yml list Role should be "Standby leader" and State should be "Running".
The async-replica/-primary relation break is similar in concept to primary/secondary relation endpoints in: https://ubuntu.com/ceph/docs/setting-up-multi-site |
src/relations/async_replication.py
Outdated
@@ -0,0 +1,291 @@ | |||
# Copyright 2022 Canonical Ltd. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Copyright 2022 Canonical Ltd. | |
# Copyright 2023 Canonical Ltd. |
src/relations/async_replication.py
Outdated
# If this is a standby-leader, then execute switchover logic | ||
# TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at Patroni docs:
Automatic promotion is not possible, because DC2 will never able to figure out the state of DC1.
You should not use pg_ctl promote in this scenario, you need “manually promote” the healthy cluster by removing standby_cluster section from the [dynamic configuration](https://patroni.readthedocs.io/en/latest/dynamic_configuration.html#dynamic-configuration).
It sounds to me that we need to remove the new relation (with --force
?) to promote the replica cluster and we can't really promote and demote with just actions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dragomirp, the way I see the fail/switchover happening is the following:
# First, offer async-primary from both models:
juju switch psql-1
juju offer postgresql-k8s:async-primary async-primary
juju switch psql-2
juju offer postgresql-k8s:async-primary async-primary
Then, we consume in each model, the async-primary relation:
juju consume admin/psql-2.async-primary
juju relate -m psql-1 postgresql-k8s:async-replica async-primary
juju consume admin/psql-1.async-primary
juju relate -m psql-2 postgresql-k8s:async-replica async-primary
Once that setup is done, the postgresql apps know there is an async replication available, but will not implement the actual configuration. That will happen once we run the promote-standby-cluster
action on one of the models.
At that moment, then the model where the action ran should take over and become the primary. The remaining will continue as replicas.
At switchover, the user must initiate the process. That should be:
juju run -m <model-with-old-primary> postgresql-k8s/leader demote-primary-cluster
juju run -m <mode-with-new-primary> postgresql-k8s/leader promote-standby-cluster
The demote should not be successful if the target unit sees a cluster as "primary" still connected in its async-replica relation.
In case of failover, I think you are correct, depending on the state of the primary cluster, we may end up with Juju knowing about the primary cluster and having its databag (with the primary key set); but the cluster is gone. Indeed, we would need here to pull the relation out first, possibly with --force
, then promote one of the replica clusters as leader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in any case, I don't think we should do automatic failover between clusters, as this is async replication. It should be a conscious decision from the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Thanks @phvalguima!
On
postgresql-k8s-operator/src/charm.py
Line 486 in c4c0adb
if not self.is_primary and ( |
standby_leader
(like we do for primary) to avoid a wrong status message (after scaling up the standby cluster).
Based in first feedback in our discussions:
I just finished the testing with clusters composed of at least 2 units. I noticed two issues: (1) #306 - I was wrongly using Regarding @dragomirp question: I can run a switchover between clusters with some manual steps. First, I deploy the environment with the following steps:
At the end of the process above, cluster psql-2 will be the
Then, the switchover can be done by removing the following entries from the Standby Leader:
The standby cluster will be promoted to leader:
This is just very early results. We need to test that type of switchover on clusters under stress. |
… will stop their services before moving on and reconfiguring
Superseded by #368. |
This PR adds support for active/standby multi-clustering in Postgresql. It uses Patroni's
standby_cluster
option to bootstrap one of the clusters as a follower of the primary unit.It creates 2x new relations and 2x new actions:
The UX is described as follows:
Then, configure async replication as follows:
Finally, set the relation and run the promotion action:
Once the models settle, it is possible to check the status within one of the postgresql units.
For example, the following status can be seen in standby's patroni:
Role should be "Standby leader" and State should be "Running".