Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement freeze logic #180

Merged
merged 5 commits into from
Oct 31, 2018

Conversation

prashanth26
Copy link
Contributor

@prashanth26 prashanth26 commented Oct 22, 2018

What this PR does / why we need it:
We need to enhance the freeze logic of safety-controller when APIServers are not reachable to freeze the machine controller from doing any wrong/unwanted reconcilations

Which issue(s) this PR fixes:
Fixes #155

Special notes for your reviewer:
The commits have been split into the following

  1. (Un)freezing of machine controller based on APIServer availability - Actual logic for freezing
  2. Controller test suite now supports control and target clients, Testing Framework now supports faking of object tracker returns/delays - Enhancements to testing framework
  3. Added test cases for machine controller freezing - Adding unit test cases to test changes.

Release note:

(Un)Freezing of machine controller based on APIServer availability 

- Machine Controller is now frozen when either the control (or) target APIServer is unreachable/down
- It is unfrozen when they are reachable again
- When unfreeze happens, the machine objects are re-queued and any machine object's who health check was being processed is re-initialized
@prashanth26 prashanth26 added kind/enhancement Enhancement, improvement, extension priority/critical Needs to be resolved soon, because it impacts users negatively needs/review Needs review status/new Issue is new and unprocessed component/machine-controller-manager size/s Size of pull request is small (see gardener-robot robot/bots/size.py) area/dev-productivity Developer productivity related (how to improve development) topology/seed Affects Seed clusters labels Oct 22, 2018
@prashanth26 prashanth26 requested a review from a team as a code owner October 22, 2018 14:22
- Supports faking of object tracker returns/delays
- Controller test suite now supports control and target clients
@prashanth26 prashanth26 force-pushed the enhancement/enhance-freeze branch from 653e786 to adef455 Compare October 22, 2018 14:29
Copy link

@amshuman-kr amshuman-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Prashanth! Overall, this looks very good!

I have added some comments and requested some changes.

Makefile Show resolved Hide resolved
cmd/machine-controller-manager/app/options/options.go Outdated Show resolved Hide resolved
glog.V(3).Infof("reconcileClusterMachineSafetyAPIServer: Start")
defer glog.V(3).Infof("reconcileClusterMachineSafetyAPIServer: Stop")

if c.safetyOptions.MachineControllerFrozen {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the logic inside the if-else block run regardless of whether machine controller is frozen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If condition (When controller is frozen) contains changes that updates machine status when from moving from health-check resetting (Unknown-->Running). Hence it makes sense to run this only when we unfreezing the controller and hence this check.

For the else logic (When controller is not frozen) sets the timer to check APIServer inactivity, and this checks will not be required if the controller is already frozen hence have added this check here.

Keeping the above two reasons in mind, I hope that the conditions make sense.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Understood.

But I thought inverting the nesting of checks for frozen and isServerUp might lead to leaner generated code. Please ignore. Not important enough ;-)

c.safetyOptions.APIserverInactiveStartTime = time.Time{}
glog.V(2).Infof("reconcileClusterMachineSafetyAPIServer: UnFreezing Machine Controller")
}
} else {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above ;-)

pkg/controller/machine_safety_test.go Outdated Show resolved Hide resolved
pkg/controller/machine_test.go Outdated Show resolved Hide resolved
pkg/controller/machine_test.go Outdated Show resolved Hide resolved
pkg/controller/machine_test.go Outdated Show resolved Hide resolved
pkg/fakeclient/client.go Show resolved Hide resolved
pkg/fakeclient/client.go Outdated Show resolved Hide resolved
@prashanth26
Copy link
Contributor Author

prashanth26 commented Oct 23, 2018

Thanks for your review @amshuman-kr . I shall make the necessary changes.

@prashanth26 prashanth26 changed the title Enhancement/enhance freeze Enhancement freeze logic Oct 23, 2018
Copy link
Contributor

@ggaurav10 ggaurav10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this scale? I mean - for large clusters of sizes 100-200 machines:

  1. How much time does isAPIServerUp() take in listing all nodes and all machines
  2. How much time does it take to set/reset the status field of all machines in both cases of freeze and unfreeze

Can we do a sample run, and get these numbers. That would give some confidence into the design.
Also how much do we want it to scale. 1000 machines?
In case, we find that this doesn't scale, then does below method provide some improvement:

  1. isAPIServerUp() lists all machinesets or machinedeployments
  2. Instead of setting/resetting status of all machines, does it make more sense to set status on machinedpeloyment and stop machine deployment controller from reconciling status from machineset

@amshuman-kr
Copy link

amshuman-kr commented Oct 23, 2018

Perhaps we can use a get call to a known non-existing resource where we expect a standard error response?

Or simply use the REST client instead?

@prashanth26
Copy link
Contributor Author

Or maybe I could run a simple GET("dummy_name") call on a dummy machine/node object. Thereby only trying to fetch a single object at most.

@prashanth26
Copy link
Contributor Author

I have made changes to adapt to most of the changes suggested by @amshuman-kr .

With respect to the valid concern raised by @ggaurav10 regarding LIST() of objects, i have changed the checks to GET() instead of LIST(). Refer - 53ad7ff#diff-65daa4cabb1be8c69955ab239204630bR154

pkg/controller/machine_safety.go Show resolved Hide resolved
pkg/controller/machine_safety.go Show resolved Hide resolved
pkg/controller/machine_safety.go Show resolved Hide resolved
@prashanth26
Copy link
Contributor Author

prashanth26 commented Oct 26, 2018

@ggaurav10 Thanking you 👍 @amshuman-kr I am waiting to hear back from you on this PR.

Copy link

@amshuman-kr amshuman-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Only some questions.

glog.V(3).Infof("reconcileClusterMachineSafetyAPIServer: Start")
defer glog.V(3).Infof("reconcileClusterMachineSafetyAPIServer: Stop")

if c.safetyOptions.MachineControllerFrozen {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Understood.

But I thought inverting the nesting of checks for frozen and isServerUp might lead to leaner generated code. Please ignore. Not important enough ;-)

c.safetyOptions.APIserverInactiveStartTime = time.Time{}
glog.V(2).Infof("reconcileClusterMachineSafetyAPIServer: UnFreezing Machine Controller")
}
} else {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above ;-)

c, w := createController(stop, namespace, objects, nil)
defer w.Stop()
c, trackers := createController(stop, namespace, controlMachineObjects, nil, nil)
defer trackers.Stop()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


// isAPIServerUp returns true if APIServers are up
// Both control and target APIServers
func (c *controller) isAPIServerUp() bool {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need some timeouts or retries here before fail and freeze?

Copy link
Contributor Author

@prashanth26 prashanth26 Oct 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using timeouts before freezing, I assume this is what you mean by the above statement - https://github.com/gardener/machine-controller-manager/pull/180/files/53ad7ff54c570cab93137d15d8d038ce9036b0bd?utf8=%E2%9C%93&diff=unified#diff-65daa4cabb1be8c69955ab239204630bR135

So what we have that we have MachineSafetyAPIServerStatusCheckPeriod (default: 1min) and MachineSafetyAPIServerStatusCheckTimeout (default: 30sec). So the MCM polls every 1min and checks if APIServer is down, and if it is then it waits for 30secs during which it polls every 30/5 = 6seconds before it freezes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant timeouts for the Get calls. I guess we are relying on the defaults there from client-co, right? But I guess, that might not matter. Please ignore.

Copy link

@amshuman-kr amshuman-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@prashanth26 prashanth26 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 31, 2018
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 31, 2018
@gardener-robot-ci-1 gardener-robot-ci-1 added the needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 31, 2018
@prashanth26 prashanth26 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 31, 2018
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 31, 2018
1. Blank object fetch is causing issue in newer versions of kubernetes
2. Added comments in CI tests
@prashanth26 prashanth26 force-pushed the enhancement/enhance-freeze branch from 9d1dd3c to bdf03bf Compare October 31, 2018 12:16
Copy link
Member

@hardikdr hardikdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@hardikdr hardikdr merged commit 35fd0d5 into gardener:master Oct 31, 2018
@prashanth26 prashanth26 deleted the enhancement/enhance-freeze branch November 2, 2018 05:05
@ghost ghost added the component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) label Mar 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dev-productivity Developer productivity related (how to improve development) component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) kind/enhancement Enhancement, improvement, extension needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/review Needs review priority/critical Needs to be resolved soon, because it impacts users negatively size/s Size of pull request is small (see gardener-robot robot/bots/size.py) status/new Issue is new and unprocessed topology/seed Affects Seed clusters
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhance freeze logic
5 participants