-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[doc] Monitoring and Auto-mitigating the unhealthy of docker containers in SONiC #564
base: master
Are you sure you want to change the base?
Conversation
the running status of critical process and resource usage. Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
feature. Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One new review comment added and one old comment is still unaddressed.
Resource Usage. Signed-off-by: Yong Zhao <[email protected]>
auto-restart and warm re-boot. Add a paragraph to introduce how can we use Monit to monitor multiple processes with the same command. Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
}, | ||
"lldp": { | ||
"auto_restart": "disabled", | ||
"high_mem_alert": "104857600", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can thresholds should be human readable? Can it be possible to calculate threshold in % values ?
}, | ||
"snmp": { | ||
"auto_restart": "enabled", | ||
"high_mem_alert": "157286400", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you determine how much thresholds should configure? do you have anay recommendations?
admin@sonic:~$ show container feature autorestart | ||
Container Name Status | ||
-------------------- -------- | ||
database disabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can database container is consistent with data after auto restart ?
container stops, the systemd service which manages the container will also stop, but it is | ||
configured to automatically restart the service, thus it will restart the container. | ||
|
||
We also introduced a configuration option which can enable or disable this auto-restart feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does auto-restart works with dockers loaded dynamically?
Similar to monitoring the critical processes, we can employ Monit to monitor the resource usage | ||
such as CPU, memory and disk for each process. Unfortunately Monit is unable to do the resource monitoring | ||
in the container level. Thus we propose a new design to achieve such monitoring based on Monit. | ||
Specifically Monit will monitor a script and check its exit status. This script |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this script be able to detect any hang/loop or deadlock situation for the processes or
threads inside the container?
1. Monit must provide the ability to generate an alert when a critical process has not | ||
been alive for 5 minutes. | ||
2. Monit must provide the ability to generate an alert when the resource usage of | ||
a docker container is larger than the pre-defined threshold. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be saved in the DB for doing trend analysis of containers on the resource utilization?
What's the plan for this feature? Is it proceeding? |
@ben-gale: Yes, it is proceeding. Most of the infrastructure is already in place in the master and 201911 branches. |
Thanks Joe - timeline for the code PRs to master? |
@yozhao101: Can you please add a comment here with links to all the related PRs in sonic-buildimage and sonic-utilities thus far? |
Yes, I will update with link of PRs. |
This document introduced three features which we plan to deploy into SONiC: 1.We proposed to employ Monit to monitor the running status of critical processes in docker containers. The PRs of this proposal in the public SONiC repo are as following: sonic-net/sonic-buildimage#3940 2.We proposed to employ process monitoring/notification framework of supervisord to implement the auto-restart feature of docker containers. The PRs of this proposal in the public SONiC repo are as following: [process monitoring/notification framework] https://github.com/Azure/sonic-buildimage/pull/2852/files [Syncd] https://github.com/Azure/sonic-buildimage/pull/3534/files [CLI to check the state of autorestart feature of each container] |
by Supervisord and high memory restart. Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
8498931
to
8837dc2
Compare
This document will introduce the motivation and design for monitoring, auto-mitigating the unhealthy of docker containers in SONiC.