Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ext cpu throttling scenarios #2581

Merged
merged 7 commits into from
May 17, 2022
Merged

Conversation

nagworld9
Copy link
Contributor

Description

Cgroup disable on following cases:-

Agent throttled - reset agent cpu quota.

Extra process in Agent cgroup - Disable All (includes extension and its services). reset quotas and reset monitors.
 
Systemd-run failure: disable all, reset quotas and reset monitors.

Also have changes for logging throttled time on extension timeout case.


PR information

  • The title of the PR is clear and informative.
  • There are a small number of commits, each of which has an informative message. This means that previously merged commits do not appear in the history of the PR. For information on cleaning up the commits in your pull request, see this page.
  • If applicable, the PR references the bug/issue that it fixes in the description.
  • New Unit tests were added for the changes made

Quality of Code and Contribution Guidelines

@codecov
Copy link

codecov bot commented May 4, 2022

Codecov Report

Merging #2581 (e174e27) into develop (e4ec4c9) will decrease coverage by 0.00%.
The diff coverage is 72.72%.

@@             Coverage Diff             @@
##           develop    #2581      +/-   ##
===========================================
- Coverage    71.96%   71.95%   -0.01%     
===========================================
  Files          102      102              
  Lines        15397    15460      +63     
  Branches      2446     2459      +13     
===========================================
+ Hits         11080    11125      +45     
- Misses        3824     3836      +12     
- Partials       493      499       +6     
Impacted Files Coverage Δ
azurelinuxagent/common/cgroupconfigurator.py 73.52% <63.93%> (-0.72%) ⬇️
...urelinuxagent/common/utils/extensionprocessutil.py 96.96% <88.23%> (-3.04%) ⬇️
azurelinuxagent/common/cgroup.py 91.08% <100.00%> (+0.11%) ⬆️
azurelinuxagent/common/cgroupapi.py 81.60% <100.00%> (+0.21%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e4ec4c9...e174e27. Read the comment docs.

@@ -873,6 +913,31 @@ def start_tracking_extension_services_cgroups(self, services_list):
if service_name is not None:
self.start_tracking_unit_cgroups(service_name)

@staticmethod
def get_extension_services_list():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retrieve the manifest files of installed extensions to reset quotas when we disable cgroups for all.

process_output = read_output(stdout, stderr)

if timed_out:
if cpu_cgroup is not None:# Report CPUThrottledTime when timeout happens
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Report the Throttled time in the exception msg when timeout case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of passing the cpu_cgroup as parameter down the call stack, can we handle the exception in the cgroup configurator and report the throttle time there? (you can create a new exception ExtensionTimeout derived from ExtensionError)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I would like to do that way but when timeout happens, we kill the process before returns ExtensionTimeout exception. That would remove the cgroups so I had to pass cpu_cgroup down the lane to compute before kill.

elif disable_cgroups == DisableCgroups.AGENT: # disable agent
self._agent_cgroups_enabled = False
self.__reset_agent_cpu_quota()
CGroupsTelemetry.stop_tracking(CpuCgroup(AGENT_NAME_TELEMETRY, self._agent_cpu_cgroup_path))
elif disable_cgroups == DisableCgroups.EXTENSIONS: # disable extensions
self._extensions_cgroups_enabled = False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed as we don't disable particular extension

@@ -519,7 +529,6 @@ def __reset_agent_cpu_quota():
"""
logger.info("Resetting agent's CPUQuota")
if CGroupConfigurator._Impl.__try_set_cpu_quota(''): # setting an empty value resets to the default (infinity)
CGroupsTelemetry.set_track_throttled_time(False)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed as we are already stop tracking the cgroup and also this would set to false for extensions too which we don't want to do it on agent throttling case

@@ -138,6 +139,7 @@ def __init__(self, name, cgroup_path):
self._current_system_cpu = None
self._previous_throttled_time = None
self._current_throttled_time = None
self._throttled_time_lock = threading.RLock() # Protect the get_throttled_time which is called from Monitor thread and cgroupconfigurator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't seem like we need to protect concurrent access to this getter, do we?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need in the case of long running extensions which are running more than 5mins or so. In those cases, monitor thread and update handler will have race condition.

process_output = read_output(stdout, stderr)

if timed_out:
if cpu_cgroup is not None:# Report CPUThrottledTime when timeout happens
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of passing the cpu_cgroup as parameter down the call stack, can we handle the exception in the cgroup configurator and report the throttle time there? (you can create a new exception ExtensionTimeout derived from ExtensionError)

throttled_time = 0
if cpu_cgroup is not None:
try:
throttled_time = float(cpu_cgroup.get_throttled_time() / 1E9)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do this conversion in a couple of places; we should abstract it in the CpuCgroup class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

azurelinuxagent/common/cgroupconfigurator.py Show resolved Hide resolved
azurelinuxagent/common/cgroupconfigurator.py Show resolved Hide resolved
files_to_create.append((drop_in_file_cpu_quota, cpu_quota_contents))
self.__create_all_files(files_to_create)
except Exception as exception:
_log_cgroup_warning('Failed to set CPUQuota: {0}', ustr(exception))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add the extension/service name to the error message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

read the extension services if ResourceLimits are present.
"""
extensions_services = {}
for manifest_path in glob.iglob(os.path.join(conf.get_lib_dir(), "*/HandlerManifest.json")):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we abstract the extension traversal in ExtensionsHandlerHandler? that class already has some code to do this kind of stuff, and takes into account, for example, extensions that failed to install (you probably want to skip those)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The challenge was getting around cyclic dependency if I have to move this to extensionhandler

tests/common/test_cgroupconfigurator.py Show resolved Hide resolved
@nagworld9 nagworld9 merged commit aa6d82c into Azure:develop May 17, 2022
@nagworld9 nagworld9 deleted the TASK-14179340 branch May 17, 2022 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants