[Services] Restart DHCP-Relay service upon unexpected critical process exit. #3667

yozhao101 · 2019-10-25T19:02:58Z

What I did
Restart DHCP-Relay service if one of critical processes running in DHCP-Relay container exited or crashed abnormally.
How I did it
Generally I follow the framework created by Joe to implement this feature in DHCP-Relay container.
First, add supervisor-proc-exit-listener event listener option in Supervisord configuration file in DHCP-Relay docker container. Supervisord will read a list of critical processes for which to monitor the unexpected crashed and exited.
For DHCP-Relay container, since a bunch of critical processes will be monitored by a group, we only need put the groupname in the file critical_processes. At the same time, we also add source code in supervisor-proc-exit-listener script to retrieve the groupname and then decide whether it appears in critical_processes.
Second, configure dhcp-relay.service to always auto-restart the service if it stops, with a delay of 30 seconds. Also set a rate limit of 3 restarts within 20 minutes (1200 seconds).
How to verify it
On your switch device, please use docker ps command to list all running docker containers.
Then use docker exec -it container_id /bin/bash to login target container. Typing top command
on the shell will display all the processes dynamically and you will spot the process id of one
of the critical processes. Finally type the command kill -9 process_id to terminate one process.
After exiting the container, you can use watch -n 1 docker ps to dynamically see the restart
of DHCP-Relay container.

…relay service, this file contains a single groupname: isc-dhcp-relay. Signed-off-by: Yong Zhao <[email protected]>

…ical processes file into dockfile.j2. Signed-off-by: Yong Zhao <[email protected]>

… supervisord conf file. Signed-off-by: Yong Zhao <[email protected]>

…f it attempts to restart this container more than 3 times in 20 minutes. Signed-off-by: Yong Zhao <[email protected]>

… to shared Makefile docker-dhcp-relay.mk. Signed-off-by: Yong Zhao <[email protected]>

jleveque · 2019-10-25T21:02:27Z

Looks good. However, I feel that it's not clear that one can now add group names to the "critical_processes" file (because the file name doesn't mention groups). I don't want to rename the file to something long, like "critical_processes_and_groups", though. Any suggestions?

yozhao101 · 2019-10-29T17:19:56Z

I though this issue for a while. In order to keep consistency with other containers, we can put
actual process names in this file not the group name. Or we can divide the critical_processes
into two sections: critical process section and group section. For dhcp-relay, we can leave the critical
process section empty and just put a single group name in group section. For now, can we create a critical_processes.j2 file to handle this issue?

jleveque · 2019-10-29T18:52:28Z

I don’t think we should take the templated approach, as using the group name is now shown to work, is much simpler and will require far less maintenance in the future. I think we can keep this as-is for now, but I would like to distinguish between processes and groups in the future. Maybe once all of the containers are managed properly, we can update the critical_processes syntax to match the supervisor.conf syntax. E.g.,

program:x
program:y
group:z

Then we can update the event listener's parsing logic. This separates the individual processes from the groups and also makes it clear to the user.

jleveque · 2019-11-01T23:23:46Z

Retest this please

… which monitors a bunch of processes. Signed-off-by: Yong Zhao <[email protected]>

Signed-off-by: Yong Zhao <[email protected]>

jleveque · 2019-11-06T01:12:54Z

Retest vs please

…s exit. (sonic-net#3667) Signed-off-by: Yong Zhao <[email protected]>

…s exit. (sonic-net#3667) Signed-off-by: Yong Zhao <[email protected]> [Services] Restart Platform-monitor service upon unexpected critical process exit. (sonic-net#3689) Signed-off-by: Yong Zhao <[email protected]> Signed-off-by: Sangita Maity <[email protected]> RB=2126600 G=lnos-reviewers R=pchaudha,pmao,vapatil,zxu A=zxu

yozhao101 added 5 commits October 25, 2019 11:21

[docker-dhcp-relay] Create a file named critical_processes. For dhcp-…

2474fab

…relay service, this file contains a single groupname: isc-dhcp-relay. Signed-off-by: Yong Zhao <[email protected]>

[docker-dhcp-relay] Add paths of supervisord listener script and crit…

1cfa8e6

…ical processes file into dockfile.j2. Signed-off-by: Yong Zhao <[email protected]>

[docker-dhcp-relay] Make event listener autostart by adding option in…

847aed0

… supervisord conf file. Signed-off-by: Yong Zhao <[email protected]>

[docker-dhcp-relay] Configure systemd to stop restarting dhcp-relay i…

2ed2adb

…f it attempts to restart this container more than 3 times in 20 minutes. Signed-off-by: Yong Zhao <[email protected]>

[docker-dhcp-relay] Add macro $(SUPERVISOR_PROC_EXIT_LISTENER_SCRIPT)…

e2ae9d1

… to shared Makefile docker-dhcp-relay.mk. Signed-off-by: Yong Zhao <[email protected]>

yozhao101 requested a review from jleveque October 25, 2019 19:02

jleveque added the Enhancement ➕ label Oct 25, 2019

yozhao101 added 2 commits November 4, 2019 16:32

[docker-dhcp-relay] Event listener will also be guided by a groupname…

b705b35

… which monitors a bunch of processes. Signed-off-by: Yong Zhao <[email protected]>

[docker-dhcp-relay] Add event listener option in test conf file.

7ca5fc8

Signed-off-by: Yong Zhao <[email protected]>

jleveque approved these changes Nov 6, 2019

View reviewed changes

jleveque merged commit ed79f54 into sonic-net:master Nov 6, 2019

zhenggen-xu pushed a commit to zhenggen-xu/sonic-buildimage that referenced this pull request Jan 10, 2020

[Services] Restart DHCP-Relay service upon unexpected critical proces…

47989f6

…s exit. (sonic-net#3667) Signed-off-by: Yong Zhao <[email protected]>

yozhao101 mentioned this pull request Jun 25, 2020

[doc] Monitoring and Auto-mitigating the unhealthy of docker containers in SONiC sonic-net/SONiC#564

Open

jleveque added the DHCP Relay label Jul 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Services] Restart DHCP-Relay service upon unexpected critical process exit. #3667

[Services] Restart DHCP-Relay service upon unexpected critical process exit. #3667

yozhao101 commented Oct 25, 2019

jleveque commented Oct 25, 2019

yozhao101 commented Oct 29, 2019

jleveque commented Oct 29, 2019

jleveque commented Nov 1, 2019

jleveque commented Nov 6, 2019

[Services] Restart DHCP-Relay service upon unexpected critical process exit. #3667

[Services] Restart DHCP-Relay service upon unexpected critical process exit. #3667

Conversation

yozhao101 commented Oct 25, 2019

jleveque commented Oct 25, 2019

yozhao101 commented Oct 29, 2019

jleveque commented Oct 29, 2019

jleveque commented Nov 1, 2019

jleveque commented Nov 6, 2019