unifi-os doesn't restart #29

klemens-st · 2021-03-08T20:34:12Z

Expected
With the on-boot script, udm-le waits 5 minutes then installs certs and restarts unifi-os

Actual
The unifi-os restart command that the script runs doesn't work. Unifi-os is stopped but doesn't restart, everything is back to normal only after manually executing unifi-os restart.

UDM-PRO firmware 1.8.6
on-boot-script 1.0.4
udm-le 1.0.7

The text was updated successfully, but these errors were encountered:

kchristensen · 2021-03-08T20:52:50Z

I've seen this once or twice myself but never captured what is causing it to fail. I don't suppose you have any logs or anything that we might be able to use to figure out why it didn't restart?

Without anything to go on the only thing I can think of would be some kind of Until loop (provided this shell supports it) that does a restart until `unifi-os | grep 'is already running'

klemens-st · 2021-03-08T21:03:57Z

In my case unfortnately it happens every time udm-pro reboots.

I have ssh access to udm-pro, I didn't paste any logs in my original message because I'm not very familiar yet with that system and don't know which log files are relevant to this problem. Is it just /mnt/data/log/messages or some other files too? In /mnt/data/log/messages there's this at the exact moment udm-le script runs:
Mar 8 14:50:43 bryza user.notice unifi-os: Stopping unifi-os
but then unifi-os never starts.

kchristensen · 2021-03-08T21:12:47Z

This might do the trick, I'll see about working into the master branch:

# cat foo
until [ "$(podman inspect -f '{{.State.Running}}' "unifi-os" 2>&1)" = "true" ];
do
	unifi-os restart
	sleep 5
done


# unifi-os stop; sh -x foo
unifi-os: Stopping unifi-os
unifi-os: 200dace39f31e83cfbb1ee947d6016c76c588d2864c652f69769f261db4f777e
unifi-os: Stopping unifi-os SSH daemon... OK
+ podman inspect -f '{{.State.Running}}' unifi-os
+ '[' false '=' true ]
+ unifi-os restart
unifi-os: unifi-os is not running
unifi-os: Stopping unifi-os SSH daemon... OK
unifi-os: Starting unifi-os
unifi-os: Stopping unifi-os SSH daemon... OK
unifi-os: Starting unifi-os SSH daemon... OK
unifi-os: unifi-os
+ sleep 5
+ podman inspect -f '{{.State.Running}}' unifi-os
+ '[' true '=' true ]
#

klemens-st · 2021-03-08T21:39:42Z

Here's /mnt/data/log/messages from the moment udm-pro rebooted to the moment I ran unifi-os restart
messages1.log

kchristensen · 2021-03-08T21:40:26Z

Ok, I pushed up a possible fix. It'll try to restart the container, and if that fails retry 5 times before it bails out. This might catch whatever edge case we're seeing.

83d7be1

Try that on for size and if it looks good I'll cut a new release at some point soon.

klemens-st · 2021-03-08T22:27:41Z

I tried it, updated the file and rebooted udm-pro, no luck. It's exactly the same as before. unifi-os: Stopping unifi-os 5 minutes after reboot and it never starts back again.

klemens-st · 2021-03-08T22:29:57Z

Maybe I will just add another script to on-boot.d that restarts the container 6 or 7 minutes after reboot. Will give it a try tomorrow.

kchristensen · 2021-03-08T23:51:09Z

That's really odd, you can try increasing the sleep in restart_unifi_os to be longer than 5 seconds and it'll do what you're after.

kchristensen · 2021-03-08T23:53:05Z

Another thing to try, edit this line: https://github.com/kchristensen/udm-le/blob/master/on_boot.d/99-udm-le.sh#L10

Change it to something like: sh -x ${UDM_LE_PATH}/udm-le.sh bootrenew > /var/log/udm-le.log. It should dump a verbose run of what it is trying to do to /var/log/udm-le.log which might shed some light on what is going on.

kchristensen · 2021-03-09T00:10:05Z

Another thing that stands out to me: On reboot, if you don't need to reissue a certificate because it is > 60 days old, udm-le doesn't try to restart the unifi-os container because there's no reason to (unless you're using a captive portal certificate). If you have a valid certificate, and you reboot it effectively does nothing but exit.

So, if you know you have a valid certificate and on a reboot unifi-os isn't starting, I feel like there's something else going on we don't know about.

klemens-st · 2021-03-09T19:01:54Z

The output of sh -x ${UDM_LE_PATH}/udm-le.sh bootrenew > /var/log/udm-le.log was:

The certificate expires in 89 days, the number of days defined to perform the renewal is 60: no renewal. No new certificate was found, exiting without restart.

unless you're using a captive portal certificate

Yes, I was trying to set that up, will try without the captive portal and see if it works.

kchristensen · 2021-03-09T19:19:29Z

Ok that's a clue, see if it works w/o the captive portal and we can go from there. I don't personally use it, so there might be some bug there I haven't run into.

klemens-st · 2021-03-09T22:49:06Z

I tried without the captive portal. Still same problem.

So, if you know you have a valid certificate and on a reboot unifi-os isn't starting, I feel like there's something else going on we don't know about.

It's not that it isn's starting at all, when UDM-PRO reboots unifi-os starts properly but then stops after 5 minutes when udm-le.sh bootrenew is run

On reboot, if you don't need to reissue a certificate because it is > 60 days old, udm-le doesn't try to restart the unifi-os container because there's no reason to (unless you're using a captive portal certificate).

Not really. udm-le.sh renew includes this check:

udm-le/udm-le.sh

Lines 104 to 106 in a3f2b92

    
           if [ "${NEW_CERT}" = "yes" ]; then 
        
           	add_captive && restart_unifi_os 
        
           fi

but bootrenew doesn't, it runs add_captive and restart_unifi_os regardless:

udm-le/udm-le.sh

Line 110 in a3f2b92

    
           ${PODMAN_CMD} ${LEGO_ARGS} renew --days 60 && deploy_cert && add_captive && restart_unifi_os

I looked it up and it seems it was discussed and agreed on in #8. So it seems it fails on the unifi-os restart bit.

kchristensen · 2021-03-15T13:38:05Z

Hm, I guess the route to take to get to the bottom of it is to add some more logging to restart_unifi_os to see if we can't capture more of the output or something. It's weird that you only see the Stopping unifi-os bit and nothing further as that's part of the output of unifi-os restart indicating that it did execute it, but then is failing somewhere in the unifi-os code.

Speedliner · 2021-03-15T15:52:56Z

Actual
The unifi-os restart command that the script runs doesn't work. Unifi-os is stopped but doesn't restart, everything is back to normal only after manually executing unifi-os restart.

I' facing this problem as well. My UDMP also refuses to work after five minutes and has to be revived with a "unifi-os restart".

UDM-PRO firmware 1.9.1
on-boot-script 1.0.4
udm-le 1.0.7

Is there anything i can provide beyond that to help solve this?

klemens-st · 2021-03-15T15:55:19Z

Hm, I guess the route to take to get to the bottom of it is to add some more logging to restart_unifi_os

Yes, I was going to try but haven't found the time recently. Will report here once I have tested more.

klemens-st · 2021-03-15T15:57:08Z

I am also going to test a simple on boot script containing unifi-os restart only because my suspicion is this has something to do with how on boot scripts are executed on my udm-pro

kchristensen · 2021-03-15T15:57:21Z

I'm on 1.9.1 / on-boot-script 1.0.4 and oddly I haven't run into this lately so figuring out how to get more verbose logs here is going to be key. unifi-os is just a shell script, so if you modify the restart_unifi_os command to call sh -x /usr/sbin/unifi-os restart instead of just unifi-os restart and pipe that output somewhere as we discussed in this thread previously we might get something of use out of there.

Speedliner · 2021-03-15T16:40:32Z

I changed the udm-le.sh script as following:

After rebooting the UDMP and waiting for 5 minutes, it creates the udm-le.log file under /var/log/ - but it stays empty:

What did I wrong?

kchristensen · 2021-03-15T16:45:12Z

Oh, I think unifi-os does something wonky with stdout, try sh -x /usr/sbin/unifi-os restart 2>&1 > /var/log/udm-le.log maybe?

Speedliner · 2021-03-15T16:58:28Z

Sadly, that didn't help...

klemens-st · 2021-03-15T17:03:12Z

Yeah, that was never going to work.

From /usr/sbin/unifi-os

case "$cmd" in
	stop)
		stop 2>&1 | logger -st unifi-os
		;;
	start)
		start 2>&1 | logger -st unifi-os
		;;
	reset)
		reset 2>&1 | logger -st unifi-os
		;;
	restart)
		restart 2>&1 | logger -st unifi-os
		;;
	running)
		container_is_running "${CONTAINER_NAME}"
		;;
	shell)
		podman exec -ti "${CONTAINER_NAME}" bash
		;;
	update)
		update "${2}" 2>&1 | logger -st unifi-os
		;;
	*)
		echo "Usage: $0 [stop start restart shell 'update url']"
		;;
esac

klemens-st · 2021-03-15T17:04:01Z

Go ahead and modify /usr/sbin/unifi-os directly, just make a solid backup first.

Or you can just have a look at /var/log/messages which is where logger is outputting stuff

Speedliner · 2021-03-15T17:45:33Z

Okay, after extensive testing, I came to the following conclustion:

Editing unifi-os directly works - as long as you do not reboot. After rebooting, those changes made are reverted and nothing gets logged anymore.

klemens-st · 2021-03-15T22:15:22Z

Yeah, that was never going to work.

Not really, I was wrong there. The -x switch does work but the output is written to syslog at /var/log/messages and not to the file we specified, the file remains empty.

I did some further testing. I created an on boot script that contained only

wait 600
unifi-os restart

It failed the same way udm-le fails. So it's clearly something with on boot scripts execution on my udm-pro.

I narrowed down the problem further by modifying my script to sh -x /usr/sbin/unifi-os restart and inspecting /var/log/messages.

unifi-os restart calls the stop function first which is as follows:

stop() {
	if ! container_is_running "${CONTAINER_NAME}"; then
		echo "${CONTAINER_NAME} is not running"
	else
		echo "Stopping ${CONTAINER_NAME}"
		podman stop "${CONTAINER_NAME}"
	fi
	stop_dropbear_daemon
}

According to the logs I saw, podman stop is the last command that executes, stop_dropbear_deamon or the restart function that should be called later are never executed. This is because the script receives a SIGTERM.

Mar 15 21:52:09 unifi user.notice unifi-os: + stop
Mar 15 21:52:09 unifi user.notice unifi-os: + container_is_running unifi-os
Mar 15 21:52:09 unifi user.notice unifi-os: + podman inspect -f '{{.State.Running}}' unifi-os
Mar 15 21:52:09 unifi user.notice unifi-os: + '[' true '=' true ]
Mar 15 21:52:09 unifi user.notice unifi-os: + echo 'Stopping unifi-os'
Mar 15 21:52:09 unifi user.notice unifi-os: Stopping unifi-os
Mar 15 21:52:09 unifi user.notice unifi-os: + podman stop unifi-os
Mar 15 21:52:15 unifi syslog.info kernel: [  333.980562] systemd-journald[26]: Received SIGTERM from PID 1 (system
d-shutdow).
Mar 15 21:52:15 unifi user.emerg kernel: [  333.982681] reboot: cmd=cdef0123
Mar 15 21:52:15 unifi user.emerg kernel: [  333.986192] reboot:   called by: [ 2122]/lib/systemd/systemd-shutdown
halt --timeout 90000000us --log-level 6 --log-target console --log-color
Mar 15 21:52:15 unifi user.emerg kernel: [  333.999053] reboot:   called by: [ 2108]/usr/libexec/podman/conmon --a
pi-version 1 -c 4c2e68e3a60876d796608aa35e9c4e9e2ccf9417127022a8c4ca0b96e5512619 -u 4c2e68e3a6087
Mar 15 21:52:15 unifi user.emerg kernel: [  334.014347] reboot:   called by: [    1]init
Mar 15 21:52:15 unifi user.emerg kernel: [  334.018642] reboot:   called by: [    0]

You can see that just after podman stop unifi-os the script receives a SIGTERM, it then finishes shutting down unifi-os and exits. This never happens when unifi-os restart is run from command line. I have absolutely no idea where this is comming from, I think I'm going to ask on udm-utilities issues.

emlynmac · 2021-03-19T17:58:13Z

I have been experiencing unifi-os dying since I installed and thought it was something flaky with the utils / OS. Now, I understand that it was this little restart fail causing the issue.

My understanding is that the udm-le script will be called from the context of the container, so when you call restart from inside, then it shuts down and gets killed and doesn't restart, as shown in the logs above.

I think there are some ways to solve this:

Somehow figure out how to spawn off the unifi-os start command into something that out-lives the container and doesn't get killed
Find an alternative way to get the OS to pick up the new certificates. Is there not a way that we can get the web services to restart without pulling down the whole container?
Schedule a restart using the unifi-os to happen at least every 70 days, which should then pick up the new certs.

For now, I have disabled the restart in my scripts and will restart it when I need to manually - I'd rather have the certs fail than the whole unifi-os not working after a power outage.

kchristensen · 2021-03-19T18:02:17Z

The on-boot-script actually executes the init script on boot on the host, not in the container. The unifi-os script is basically a wrapper around podman commands that starts up the container that runs the controller, etc.

The issue is that when whatever causes the restart on boot to fail, the controller and captive portal and what have you won't be running. The actual routing stuff happens on the host and should be unaffected by the failure to restart.

As for #3, that happens already, provided udm-le.sh gets run every boot it drops a file in the cron.d directory that will handle the nightly check and subsequent re-issue/restart if necessary.

klemens-st · 2021-03-19T18:02:30Z

My understanding is that the udm-le script will be called from the context of the container, so when you call restart from inside, then it shuts down and gets killed and doesn't restart, as shown in the logs above.

I thought the same but then realised not everyone experiences this issue, which is strange.

kchristensen · 2021-03-19T18:03:45Z

My understanding is that the udm-le script will be called from the context of the container, so when you call restart from inside, then it shuts down and gets killed and doesn't restart, as shown in the logs above.

I thought the same but then realised not everyone experiences this issue, which is strange.

Indeed. I have no issues with the way this is setup and I've had this running on my UDMP since more or less the Early Access days of the udmp. That said I only use Cloudflare for DNS and none of the captive portal things that people have added over time, so it is hard to say if that is possibly causing issues.

emlynmac · 2021-03-19T18:06:10Z

The issue is that when whatever causes the restart on boot to fail, the controller and captive portal and what have you won't be running. The actual routing stuff happens on the host and should be unaffected by the failure to restart.

Routing won't be affected, but protect will not be up - which is a problem.

My setup has udm-le, cloudflare-ddns, pihole, cloudflared and homebridge containers. I don't have captive portal set up.
Running unifi-os restart from the SSH console works fine.

klemens-st · 2021-03-19T18:07:41Z

I have disabled the captive portal and have nothing else installed besides on-boot and udm-le. And as I have recently tested it's just the restart command that fails, it has nothing to do with udm-le itself.

emlynmac · 2021-03-19T18:08:22Z

Does restart work from the shell?

klemens-st · 2021-03-19T18:08:37Z

Yes, always.

klemens-st · 2021-03-19T18:11:57Z

But I think you are on the right lines that when on-boot script runs unifi-os restart its executed in a different context. The kernel part of the logs when it's run from shell is different.

emlynmac · 2021-03-23T02:44:40Z

So peacey over at the udm-utilities has a fix for this issue.

I took his second suggestion and replaced unifi-os restart with pod man restart unifi-os &

The last case is now:

 echo 'Attempting certificate renewal'                                                                
        ${PODMAN_CMD} ${LEGO_ARGS} renew --days 60 && deploy_cert && add_captive && podman restart unifi-os &
        ;;

Seems to have resolved the issue

klemens-st · 2021-03-30T06:24:32Z

Finally found the time to test 1.0.9. It fixed the issue, thanks!

klemens-st mentioned this issue Mar 19, 2021

unifi-os restart doesn't work from an on-boot script unifi-utilities/unifios-utilities#148

Closed

LaurenceJJones mentioned this issue Mar 23, 2021

Potential fix for restart script #32

Merged

klemens-st closed this as completed Mar 30, 2021

unifi-os doesn't restart #29

unifi-os doesn't restart #29

Comments

klemens-st commented Mar 8, 2021

kchristensen commented Mar 8, 2021

klemens-st commented Mar 8, 2021

kchristensen commented Mar 8, 2021

klemens-st commented Mar 8, 2021

kchristensen commented Mar 8, 2021

klemens-st commented Mar 8, 2021

klemens-st commented Mar 8, 2021 • edited Loading

kchristensen commented Mar 8, 2021 • edited Loading

kchristensen commented Mar 8, 2021

kchristensen commented Mar 9, 2021

klemens-st commented Mar 9, 2021

kchristensen commented Mar 9, 2021

klemens-st commented Mar 9, 2021 • edited Loading

kchristensen commented Mar 15, 2021

Speedliner commented Mar 15, 2021

klemens-st commented Mar 15, 2021 • edited Loading

klemens-st commented Mar 15, 2021 • edited Loading

kchristensen commented Mar 15, 2021

Speedliner commented Mar 15, 2021

kchristensen commented Mar 15, 2021

Speedliner commented Mar 15, 2021

klemens-st commented Mar 15, 2021

klemens-st commented Mar 15, 2021 • edited Loading

Speedliner commented Mar 15, 2021

klemens-st commented Mar 15, 2021

emlynmac commented Mar 19, 2021

kchristensen commented Mar 19, 2021 • edited Loading

klemens-st commented Mar 19, 2021

kchristensen commented Mar 19, 2021

emlynmac commented Mar 19, 2021 • edited Loading

klemens-st commented Mar 19, 2021

emlynmac commented Mar 19, 2021

klemens-st commented Mar 19, 2021

klemens-st commented Mar 19, 2021

emlynmac commented Mar 23, 2021

klemens-st commented Mar 30, 2021

klemens-st commented Mar 8, 2021 •

edited

Loading

kchristensen commented Mar 8, 2021 •

edited

Loading

klemens-st commented Mar 9, 2021 •

edited

Loading

klemens-st commented Mar 15, 2021 •

edited

Loading

klemens-st commented Mar 15, 2021 •

edited

Loading

klemens-st commented Mar 15, 2021 •

edited

Loading

kchristensen commented Mar 19, 2021 •

edited

Loading

emlynmac commented Mar 19, 2021 •

edited

Loading