-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Services] Restart DHCP-Relay service upon unexpected critical process exit. #3667
Conversation
…relay service, this file contains a single groupname: isc-dhcp-relay. Signed-off-by: Yong Zhao <[email protected]>
…ical processes file into dockfile.j2. Signed-off-by: Yong Zhao <[email protected]>
… supervisord conf file. Signed-off-by: Yong Zhao <[email protected]>
…f it attempts to restart this container more than 3 times in 20 minutes. Signed-off-by: Yong Zhao <[email protected]>
… to shared Makefile docker-dhcp-relay.mk. Signed-off-by: Yong Zhao <[email protected]>
Looks good. However, I feel that it's not clear that one can now add group names to the "critical_processes" file (because the file name doesn't mention groups). I don't want to rename the file to something long, like "critical_processes_and_groups", though. Any suggestions? |
I though this issue for a while. In order to keep consistency with other containers, we can put |
I don’t think we should take the templated approach, as using the group name is now shown to work, is much simpler and will require far less maintenance in the future. I think we can keep this as-is for now, but I would like to distinguish between processes and groups in the future. Maybe once all of the containers are managed properly, we can update the critical_processes syntax to match the supervisor.conf syntax. E.g.,
Then we can update the event listener's parsing logic. This separates the individual processes from the groups and also makes it clear to the user. |
Retest this please |
… which monitors a bunch of processes. Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Retest vs please |
…s exit. (sonic-net#3667) Signed-off-by: Yong Zhao <[email protected]>
…s exit. (sonic-net#3667) Signed-off-by: Yong Zhao <[email protected]> [Services] Restart Platform-monitor service upon unexpected critical process exit. (sonic-net#3689) Signed-off-by: Yong Zhao <[email protected]> Signed-off-by: Sangita Maity <[email protected]> RB=2126600 G=lnos-reviewers R=pchaudha,pmao,vapatil,zxu A=zxu
What I did
Restart DHCP-Relay service if one of critical processes running in DHCP-Relay container exited or crashed abnormally.
How I did it
Generally I follow the framework created by Joe to implement this feature in DHCP-Relay container.
First, add supervisor-proc-exit-listener event listener option in Supervisord configuration file in DHCP-Relay docker container. Supervisord will read a list of critical processes for which to monitor the unexpected crashed and exited.
For DHCP-Relay container, since a bunch of critical processes will be monitored by a group, we only need put the groupname in the file critical_processes. At the same time, we also add source code in supervisor-proc-exit-listener script to retrieve the groupname and then decide whether it appears in critical_processes.
Second, configure dhcp-relay.service to always auto-restart the service if it stops, with a delay of 30 seconds. Also set a rate limit of 3 restarts within 20 minutes (1200 seconds).
How to verify it
On your switch device, please use docker ps command to list all running docker containers.
Then use docker exec -it container_id /bin/bash to login target container. Typing top command
on the shell will display all the processes dynamically and you will spot the process id of one
of the critical processes. Finally type the command kill -9 process_id to terminate one process.
After exiting the container, you can use watch -n 1 docker ps to dynamically see the restart
of DHCP-Relay container.