-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ext cpu throttling scenarios #2581
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #2581 +/- ##
===========================================
- Coverage 71.96% 71.95% -0.01%
===========================================
Files 102 102
Lines 15397 15460 +63
Branches 2446 2459 +13
===========================================
+ Hits 11080 11125 +45
- Misses 3824 3836 +12
- Partials 493 499 +6
Continue to review full report at Codecov.
|
@@ -873,6 +913,31 @@ def start_tracking_extension_services_cgroups(self, services_list): | |||
if service_name is not None: | |||
self.start_tracking_unit_cgroups(service_name) | |||
|
|||
@staticmethod | |||
def get_extension_services_list(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retrieve the manifest files of installed extensions to reset quotas when we disable cgroups for all.
process_output = read_output(stdout, stderr) | ||
|
||
if timed_out: | ||
if cpu_cgroup is not None:# Report CPUThrottledTime when timeout happens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Report the Throttled time in the exception msg when timeout case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of passing the cpu_cgroup as parameter down the call stack, can we handle the exception in the cgroup configurator and report the throttle time there? (you can create a new exception ExtensionTimeout derived from ExtensionError)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I would like to do that way but when timeout happens, we kill the process before returns ExtensionTimeout exception. That would remove the cgroups so I had to pass cpu_cgroup down the lane to compute before kill.
elif disable_cgroups == DisableCgroups.AGENT: # disable agent | ||
self._agent_cgroups_enabled = False | ||
self.__reset_agent_cpu_quota() | ||
CGroupsTelemetry.stop_tracking(CpuCgroup(AGENT_NAME_TELEMETRY, self._agent_cpu_cgroup_path)) | ||
elif disable_cgroups == DisableCgroups.EXTENSIONS: # disable extensions | ||
self._extensions_cgroups_enabled = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed as we don't disable particular extension
@@ -519,7 +529,6 @@ def __reset_agent_cpu_quota(): | |||
""" | |||
logger.info("Resetting agent's CPUQuota") | |||
if CGroupConfigurator._Impl.__try_set_cpu_quota(''): # setting an empty value resets to the default (infinity) | |||
CGroupsTelemetry.set_track_throttled_time(False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed as we are already stop tracking the cgroup and also this would set to false for extensions too which we don't want to do it on agent throttling case
azurelinuxagent/common/cgroup.py
Outdated
@@ -138,6 +139,7 @@ def __init__(self, name, cgroup_path): | |||
self._current_system_cpu = None | |||
self._previous_throttled_time = None | |||
self._current_throttled_time = None | |||
self._throttled_time_lock = threading.RLock() # Protect the get_throttled_time which is called from Monitor thread and cgroupconfigurator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't seem like we need to protect concurrent access to this getter, do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need in the case of long running extensions which are running more than 5mins or so. In those cases, monitor thread and update handler will have race condition.
process_output = read_output(stdout, stderr) | ||
|
||
if timed_out: | ||
if cpu_cgroup is not None:# Report CPUThrottledTime when timeout happens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of passing the cpu_cgroup as parameter down the call stack, can we handle the exception in the cgroup configurator and report the throttle time there? (you can create a new exception ExtensionTimeout derived from ExtensionError)
throttled_time = 0 | ||
if cpu_cgroup is not None: | ||
try: | ||
throttled_time = float(cpu_cgroup.get_throttled_time() / 1E9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do this conversion in a couple of places; we should abstract it in the CpuCgroup class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
files_to_create.append((drop_in_file_cpu_quota, cpu_quota_contents)) | ||
self.__create_all_files(files_to_create) | ||
except Exception as exception: | ||
_log_cgroup_warning('Failed to set CPUQuota: {0}', ustr(exception)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add the extension/service name to the error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
read the extension services if ResourceLimits are present. | ||
""" | ||
extensions_services = {} | ||
for manifest_path in glob.iglob(os.path.join(conf.get_lib_dir(), "*/HandlerManifest.json")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we abstract the extension traversal in ExtensionsHandlerHandler? that class already has some code to do this kind of stuff, and takes into account, for example, extensions that failed to install (you probably want to skip those)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The challenge was getting around cyclic dependency if I have to move this to extensionhandler
Description
Cgroup disable on following cases:-
Agent throttled - reset agent cpu quota.
Extra process in Agent cgroup - Disable All (includes extension and its services). reset quotas and reset monitors.
Systemd-run failure: disable all, reset quotas and reset monitors.
Also have changes for logging throttled time on extension timeout case.
PR information
Quality of Code and Contribution Guidelines