-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement freeze logic #180
Enhancement freeze logic #180
Conversation
- Machine Controller is now frozen when either the control (or) target APIServer is unreachable/down - It is unfrozen when they are reachable again - When unfreeze happens, the machine objects are re-queued and any machine object's who health check was being processed is re-initialized
- Supports faking of object tracker returns/delays - Controller test suite now supports control and target clients
653e786
to
adef455
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Prashanth! Overall, this looks very good!
I have added some comments and requested some changes.
glog.V(3).Infof("reconcileClusterMachineSafetyAPIServer: Start") | ||
defer glog.V(3).Infof("reconcileClusterMachineSafetyAPIServer: Stop") | ||
|
||
if c.safetyOptions.MachineControllerFrozen { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the logic inside the if-else
block run regardless of whether machine controller is frozen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If condition (When controller is frozen) contains changes that updates machine status when from moving from health-check resetting (Unknown-->Running). Hence it makes sense to run this only when we unfreezing the controller and hence this check.
For the else logic (When controller is not frozen) sets the timer to check APIServer inactivity, and this checks will not be required if the controller is already frozen hence have added this check here.
Keeping the above two reasons in mind, I hope that the conditions make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Understood.
But I thought inverting the nesting of checks for frozen and isServerUp might lead to leaner generated code. Please ignore. Not important enough ;-)
c.safetyOptions.APIserverInactiveStartTime = time.Time{} | ||
glog.V(2).Infof("reconcileClusterMachineSafetyAPIServer: UnFreezing Machine Controller") | ||
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above ;-)
Thanks for your review @amshuman-kr . I shall make the necessary changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this scale? I mean - for large clusters of sizes 100-200 machines:
- How much time does
isAPIServerUp()
take in listing all nodes and all machines - How much time does it take to set/reset the status field of all machines in both cases of freeze and unfreeze
Can we do a sample run, and get these numbers. That would give some confidence into the design.
Also how much do we want it to scale. 1000 machines?
In case, we find that this doesn't scale, then does below method provide some improvement:
isAPIServerUp()
lists all machinesets or machinedeployments- Instead of setting/resetting status of all machines, does it make more sense to set status on machinedpeloyment and stop machine deployment controller from reconciling status from machineset
Perhaps we can use a get call to a known non-existing resource where we expect a standard error response? Or simply use the REST client instead? |
Or maybe I could run a simple GET("dummy_name") call on a dummy machine/node object. Thereby only trying to fetch a single object at most. |
I have made changes to adapt to most of the changes suggested by @amshuman-kr . With respect to the valid concern raised by @ggaurav10 regarding LIST() of objects, i have changed the checks to GET() instead of LIST(). Refer - 53ad7ff#diff-65daa4cabb1be8c69955ab239204630bR154 |
@ggaurav10 Thanking you 👍 @amshuman-kr I am waiting to hear back from you on this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Only some questions.
glog.V(3).Infof("reconcileClusterMachineSafetyAPIServer: Start") | ||
defer glog.V(3).Infof("reconcileClusterMachineSafetyAPIServer: Stop") | ||
|
||
if c.safetyOptions.MachineControllerFrozen { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Understood.
But I thought inverting the nesting of checks for frozen and isServerUp might lead to leaner generated code. Please ignore. Not important enough ;-)
c.safetyOptions.APIserverInactiveStartTime = time.Time{} | ||
glog.V(2).Infof("reconcileClusterMachineSafetyAPIServer: UnFreezing Machine Controller") | ||
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above ;-)
c, w := createController(stop, namespace, objects, nil) | ||
defer w.Stop() | ||
c, trackers := createController(stop, namespace, controlMachineObjects, nil, nil) | ||
defer trackers.Stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
// isAPIServerUp returns true if APIServers are up | ||
// Both control and target APIServers | ||
func (c *controller) isAPIServerUp() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need some timeouts or retries here before fail and freeze?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using timeouts before freezing, I assume this is what you mean by the above statement - https://github.com/gardener/machine-controller-manager/pull/180/files/53ad7ff54c570cab93137d15d8d038ce9036b0bd?utf8=%E2%9C%93&diff=unified#diff-65daa4cabb1be8c69955ab239204630bR135
So what we have that we have MachineSafetyAPIServerStatusCheckPeriod (default: 1min) and MachineSafetyAPIServerStatusCheckTimeout (default: 30sec). So the MCM polls every 1min and checks if APIServer is down, and if it is then it waits for 30secs during which it polls every 30/5 = 6seconds before it freezes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant timeouts for the Get calls. I guess we are relying on the defaults there from client-co, right? But I guess, that might not matter. Please ignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
1. Blank object fetch is causing issue in newer versions of kubernetes 2. Added comments in CI tests
9d1dd3c
to
bdf03bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
What this PR does / why we need it:
We need to enhance the freeze logic of safety-controller when APIServers are not reachable to freeze the machine controller from doing any wrong/unwanted reconcilations
Which issue(s) this PR fixes:
Fixes #155
Special notes for your reviewer:
The commits have been split into the following
Release note: