Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DFBUGS-1026: Provision to disable DRPC reconsile temporarily - WA of CG #1720

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

GowthamShanmugam
Copy link

@GowthamShanmugam GowthamShanmugam commented Dec 11, 2024

Issue: https://issues.redhat.com/browse/DFBUGS-1026

Description
The issue is the user has to choose a multivolume consistency group way of replication via DRPC(DRPlacementContorl CR) by adding an annotation. The limitation is, that the user has to add this annotation while creating DRPC. They can't do it after creating DRPC. But the problem is UI. Users can't test this techpreview feature via UI because UI is creating a DRPC for them. The user can't add annotation after UI creates it. KCS: https://access.redhat.com/articles/7091592

The proposed solution is:
To temporarily halt the DRPC reconcile, a new ramen configuration specification is being introduced. Thus, it won't be reconciled when a DRPC is created through the user interface. The user can add annotations drplacementcontrol.ramendr.openshift.io/is-cg-enabled, drplacementcontrol.ramendr.openshift.io/reconciliationUnpaused to resume the reconcile and enable consistency group for a specific DRPC. This is a short-term fix for the devpreview function.

@nirs
Copy link
Member

nirs commented Dec 11, 2024

Issue: https://issues.redhat.com/browse/DFBUGS-1026

The issue explains the problem, but what is the proposed solution? What is the user flow with this fix? Which flows should be affected?

Copy link
Member

@nirs nirs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution is not clear to me, added some comments about the code style.

internal/controller/drplacementcontrol_controller.go Outdated Show resolved Hide resolved
internal/controller/drplacementcontrol_controller.go Outdated Show resolved Hide resolved
@GowthamShanmugam GowthamShanmugam force-pushed the DFBUGS-1026 branch 4 times, most recently from ba359d9 to 25a0902 Compare December 11, 2024 15:51
@GowthamShanmugam
Copy link
Author

/hold not yet tested on the cluster

@BenamarMk
Copy link
Member

The solution is not clear to me, added some comments about the code style.

@nirs the description outlines both the problem and the proposed solution. As Gowtham mentioned, this is intended as a temporary fix and is expected to be removed in future versions. That said, I like the concept of pausing a drpc reconciliation, as it could offer valuable control in certain scenarios.

@nirs
Copy link
Member

nirs commented Dec 11, 2024

The solution is not clear to me, added some comments about the code style.

@nirs the description outlines both the problem and the proposed solution. As Gowtham mentioned, this is intended as a temporary fix and is expected to be removed in future versions. That said, I like the concept of pausing a drpc reconciliation, as it could offer valuable control in certain scenarios.

Yes, I also think this is a useful feature - being able to disable reconcile for a while sounds like nice feature, even if this does not stop replication on the managed cluster. So we better add these options in a generic and keep them.

Copy link
Member

@nirs nirs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to think which operations are affected by disabling reconcile. The current change will not work for deleting the drpc.

Should we disable reconcile only for live drpc, and reconcile normally if the drpc is deleted?


// StartDRPCReconciliationPaused is a flag to pause all the DRPC reconcile temporarily.
// This flag can be overridden by the annotation drplacementcontrol.ramendr.openshift.io/reconciliationUnpaused
StartDRPCReconciliationPaused bool `json:"startDRPCReconciliationPaused,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the Start from the name, this flag disable reconcile even if was already started. It means that drpc is paused by default and requires unpuasing to start reconciling.

I'm not sure about paused and unpause, maybe we can use more common terms like disabled and enabled?

The option can be:

drpcReconciliationDisabled

and the annotation:

reconciliation-enabled

The logic seems clear, if the option is set, all drpcs are disabled, except the these with the explicit "enabled" annotation. It is common that local flag overrides global flag.

@BenamarMk what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Start... here, is simply an indication that disabling is intended for new DRPCs only. But, we can drop it. I'll let Gowtham decides. Also, I like the word Pause instead of disable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not really certain we need this, to stop reconciliation just scale down the deployment to 0 replicas. That would scale down all controllers (i.e policy/cluster/etc.) but this seems excessive at present.

Once the annotation is removed for CG enabling, if a future use-case arises where a single controller from the set needs to be disabled, this maybe considered.

@@ -2467,3 +2480,8 @@ func (r *DRPlacementControlReconciler) drpcHaveCommonClusters(ctx context.Contex

return drpolicyClusters.Intersection(otherDrpolicyClusters).Len() > 0, nil
}

func schedulingEnabled(drpc *rmn.DRPlacementControl, ramenConfig *rmn.RamenConfig) bool {
force := drpc.GetAnnotations()[ReconciliationUnpaused]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

force is does not match the new name, it was good for the old name.

@@ -152,6 +155,16 @@ func (r *DRPlacementControlReconciler) Reconcile(ctx context.Context, req ctrl.R
return ctrl.Result{}, err
}

// Workaround: to add drplacementcontrol.ramendr.openshift.io/is-cg-enabled annotation
// for the application workload that is DR protected from the UI.
if !schedulingEnabled(drpc, ramenConfig) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the flags are about reconciliation, this helper should be reconciliationEnabled()

if !schedulingEnabled(drpc, ramenConfig) {
logger.Info("DRPC is reconcile is paused, To continue the reconsile, " +
"either add drplacementcontrol.ramendr.openshift.io/reconciliationUnpaused: 'true' " +
"annotation or update StartDRPCReconciliationPaused on the ramen configMap.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log will appear many times - each time a watched resource changes, and it is a very verbose log.

It would be better to set the status to report that the drpc is disabled. In this case this will happen once since the next reconcile the status will not change.

})

When("Application deployed for the first time", func() {
It("Should not have any drpc status", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be better to have status reporting that the drpc is disabled.

populateDRClusters()
})

When("Application deployed for the first time", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRPC created in the first time?

Application is deployment may is not related to the drpc, it can be deployed before the drpc was created.

})

When("Update ramen config to override DRPC reconcile", func() {
It("Should have some drpc status after reconcile", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that disabling reconciliation will keep the status before the reconciliation was disabled?

It would be more useful if we had disable status in the drpc status, and it would reflect the current status of the drpc. If we disabled the drpc after it was reconciled, the status will be the same as the last status, with Status.disabled: "true".

When("Deleting DRPolicy with DRPC references", func() {
It("Should retain the deleted DRPolicy in the API server", func() {
By("\n\n*** DELETE drpolicy ***\n\n")
deleteDRPolicyAsync()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests that drpolicy is not deleted?


When("Deleting DRPolicy with DRPC references", func() {
It("Should retain the deleted DRPolicy in the API server", func() {
By("\n\n*** DELETE drpolicy ***\n\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was copied from another test? not sure why we need all these newlines and ***

When("Deleting user PlacementRule", func() {
It("Should cleanup DRPC", func() {
By("\n\n*** DELETE User PlacementRule ***\n\n")
deleteUserPlacementRule(UserPlacementRuleName, DefaultDRPCNamespace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "cleanup DRPC" means? if this delete the drpc, the next test is incorrect since the drpc was already deleted.

But I don't think that deleting drpc will work with the current code, since return from the reconcile very early before we reach the check for deleted drpc.

@ShyamsundarR
Copy link
Member

Issue: https://issues.redhat.com/browse/DFBUGS-1026

Description The issue is the user has to choose a multivolume consistency group way of replication via DRPC(DRPlacementContorl CR) by adding an annotation. The limitation is, that the user has to add this annotation while creating DRPC. They can't do it after creating DRPC. But the problem is UI. Users can't test this techpreview feature via UI because UI is creating a DRPC for them. The user can't add annotation after UI creates it. KCS: https://access.redhat.com/articles/7091592

The proposed solution is: To temporarily halt the DRPC reconcile, a new ramen configuration specification is being introduced. Thus, it won't be reconciled when a DRPC is created through the user interface. The user can add annotations drplacementcontrol.ramendr.openshift.io/is-cg-enabled, drplacementcontrol.ramendr.openshift.io/reconciliationUnpaused to resume the reconcile and enable consistency group for a specific DRPC. This is a short-term fix for the devpreview function.

Can we just scale the reconciler deployment to 0 and then create the DRPC via the console, edit the DRPC with the required annotation and then scale back the deployment?

@BenamarMk
Copy link
Member

BenamarMk commented Dec 11, 2024

Can we just scale the reconciler deployment to 0 and then create the DRPC via the console, edit the DRPC with the required annotation and then scale back the deployment?

@ShyamsundarR initially, that was the approach, and that's how we test it now. However, during the testing process, we consistently found ourselves cleaning up after the mess due to forgetting to scale down the deployment. If we think that's acceptable though, then we can close the PR. The idea was to set the new global config once and forget about it.

@nirs
Copy link
Member

nirs commented Dec 11, 2024

Is it possible to add the missing annotation in the UI instead of adding this new "paused" state to ramen?

@ShyamsundarR
Copy link
Member

Can we just scale the reconciler deployment to 0 and then create the DRPC via the console, edit the DRPC with the required annotation and then scale back the deployment?

@ShyamsundarR initially, that was the approach, and that's how we test it now. However, during the testing process, we consistently found ourselves cleaning up after the mess due to forgetting to scale down the deployment. If we think that's acceptable though, then we can close the PR. The idea was to set the new global config once and forget about it.

I understand someone forgetting it and so making it easier, but as the feature is still under a dev preview state, I would prefer we not add config changes to help the qualification efforts. IOW, we can close this and (soon) shift to testing with the default methods in the future.

@GowthamShanmugam
Copy link
Author

I understand someone forgetting it and so making it easier, but as the feature is still under a dev preview state, I would prefer we not add config changes to help the qualification efforts. IOW, we can close this and (soon) shift to testing with the default methods in the future.

cc: @BenamarMk for the confirmation to close this PR

@BenamarMk
Copy link
Member

I understand someone forgetting it and so making it easier, but as the feature is still under a dev preview state, I would prefer we not add config changes to help the qualification efforts. IOW, we can close this and (soon) shift to testing with the default methods in the future.

cc: @BenamarMk for the confirmation to close this PR

It looks like, the general agreement is that for a dev preview release, asking users to manually scale down Ramen at the hub seems acceptable. This change was proposed to address a common issue where forgetting to scale down Ramen before deploying a new workload led to additional cleanup efforts after the fact. The goal was to introduce a "set it once and forget it" option in the Ramen config. This way, no actions would be taken on new workloads until users explicitly unpause it, serving as a built-in reminder.

That said, I am aligned with the decision to close the PR.

@GowthamShanmugam
Copy link
Author

GowthamShanmugam commented Dec 12, 2024

I understand someone forgetting it and so making it easier, but as the feature is still under a dev preview state, I would prefer we not add config changes to help the qualification efforts. IOW, we can close this and (soon) shift to testing with the default methods in the future.

cc: @BenamarMk for the confirmation to close this PR

It looks like, the general agreement is that for a dev preview release, asking users to manually scale down Ramen at the hub seems acceptable. This change was proposed to address a common issue where forgetting to scale down Ramen before deploying a new workload led to additional cleanup efforts after the fact. The goal was to introduce a "set it once and forget it" option in the Ramen config. This way, no actions would be taken on new workloads until users explicitly unpause it, serving as a built-in reminder.

That said, I am aligned with the decision to close the PR.

@ShyamsundarR UI epic targetted for ODF 4.19, Current KCS has some gap for 4.17 and 4.18: https://access.redhat.com/articles/7091592. I am ok to close this PR but we need to correct this KCS. Shall ask the documentation team to add the steps to scale down the ramen hub pod deployment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants