Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unifi-os doesn't restart #29

Closed
klemens-st opened this issue Mar 8, 2021 · 36 comments
Closed

unifi-os doesn't restart #29

klemens-st opened this issue Mar 8, 2021 · 36 comments

Comments

@klemens-st
Copy link

Expected
With the on-boot script, udm-le waits 5 minutes then installs certs and restarts unifi-os

Actual
The unifi-os restart command that the script runs doesn't work. Unifi-os is stopped but doesn't restart, everything is back to normal only after manually executing unifi-os restart.

UDM-PRO firmware 1.8.6
on-boot-script 1.0.4
udm-le 1.0.7

@kchristensen
Copy link
Owner

I've seen this once or twice myself but never captured what is causing it to fail. I don't suppose you have any logs or anything that we might be able to use to figure out why it didn't restart?

Without anything to go on the only thing I can think of would be some kind of Until loop (provided this shell supports it) that does a restart until `unifi-os | grep 'is already running'

@klemens-st
Copy link
Author

In my case unfortnately it happens every time udm-pro reboots.

I have ssh access to udm-pro, I didn't paste any logs in my original message because I'm not very familiar yet with that system and don't know which log files are relevant to this problem. Is it just /mnt/data/log/messages or some other files too? In /mnt/data/log/messages there's this at the exact moment udm-le script runs:
Mar 8 14:50:43 bryza user.notice unifi-os: Stopping unifi-os
but then unifi-os never starts.

@kchristensen
Copy link
Owner

This might do the trick, I'll see about working into the master branch:

# cat foo
until [ "$(podman inspect -f '{{.State.Running}}' "unifi-os" 2>&1)" = "true" ];
do
	unifi-os restart
	sleep 5
done


# unifi-os stop; sh -x foo
unifi-os: Stopping unifi-os
unifi-os: 200dace39f31e83cfbb1ee947d6016c76c588d2864c652f69769f261db4f777e
unifi-os: Stopping unifi-os SSH daemon... OK
+ podman inspect -f '{{.State.Running}}' unifi-os
+ '[' false '=' true ]
+ unifi-os restart
unifi-os: unifi-os is not running
unifi-os: Stopping unifi-os SSH daemon... OK
unifi-os: Starting unifi-os
unifi-os: Stopping unifi-os SSH daemon... OK
unifi-os: Starting unifi-os SSH daemon... OK
unifi-os: unifi-os
+ sleep 5
+ podman inspect -f '{{.State.Running}}' unifi-os
+ '[' true '=' true ]
#

@klemens-st
Copy link
Author

Here's /mnt/data/log/messages from the moment udm-pro rebooted to the moment I ran unifi-os restart
messages1.log

@kchristensen
Copy link
Owner

Ok, I pushed up a possible fix. It'll try to restart the container, and if that fails retry 5 times before it bails out. This might catch whatever edge case we're seeing.

83d7be1

Try that on for size and if it looks good I'll cut a new release at some point soon.

@klemens-st
Copy link
Author

I tried it, updated the file and rebooted udm-pro, no luck. It's exactly the same as before. unifi-os: Stopping unifi-os 5 minutes after reboot and it never starts back again.

@klemens-st
Copy link
Author

klemens-st commented Mar 8, 2021

Maybe I will just add another script to on-boot.d that restarts the container 6 or 7 minutes after reboot. Will give it a try tomorrow.

@kchristensen
Copy link
Owner

kchristensen commented Mar 8, 2021

That's really odd, you can try increasing the sleep in restart_unifi_os to be longer than 5 seconds and it'll do what you're after.

@kchristensen
Copy link
Owner

Another thing to try, edit this line: https://github.com/kchristensen/udm-le/blob/master/on_boot.d/99-udm-le.sh#L10

Change it to something like: sh -x ${UDM_LE_PATH}/udm-le.sh bootrenew > /var/log/udm-le.log. It should dump a verbose run of what it is trying to do to /var/log/udm-le.log which might shed some light on what is going on.

@kchristensen
Copy link
Owner

Another thing that stands out to me: On reboot, if you don't need to reissue a certificate because it is > 60 days old, udm-le doesn't try to restart the unifi-os container because there's no reason to (unless you're using a captive portal certificate). If you have a valid certificate, and you reboot it effectively does nothing but exit.

So, if you know you have a valid certificate and on a reboot unifi-os isn't starting, I feel like there's something else going on we don't know about.

@klemens-st
Copy link
Author

The output of sh -x ${UDM_LE_PATH}/udm-le.sh bootrenew > /var/log/udm-le.log was:

The certificate expires in 89 days, the number of days defined to perform the renewal is 60: no renewal. No new certificate was found, exiting without restart.

unless you're using a captive portal certificate

Yes, I was trying to set that up, will try without the captive portal and see if it works.

@kchristensen
Copy link
Owner

Ok that's a clue, see if it works w/o the captive portal and we can go from there. I don't personally use it, so there might be some bug there I haven't run into.

@klemens-st
Copy link
Author

klemens-st commented Mar 9, 2021

I tried without the captive portal. Still same problem.

So, if you know you have a valid certificate and on a reboot unifi-os isn't starting, I feel like there's something else going on we don't know about.

It's not that it isn's starting at all, when UDM-PRO reboots unifi-os starts properly but then stops after 5 minutes when udm-le.sh bootrenew is run

On reboot, if you don't need to reissue a certificate because it is > 60 days old, udm-le doesn't try to restart the unifi-os container because there's no reason to (unless you're using a captive portal certificate).

Not really. udm-le.sh renew includes this check:

udm-le/udm-le.sh

Lines 104 to 106 in a3f2b92

if [ "${NEW_CERT}" = "yes" ]; then
add_captive && restart_unifi_os
fi

but bootrenew doesn't, it runs add_captive and restart_unifi_os regardless:
${PODMAN_CMD} ${LEGO_ARGS} renew --days 60 && deploy_cert && add_captive && restart_unifi_os

I looked it up and it seems it was discussed and agreed on in #8. So it seems it fails on the unifi-os restart bit.

@kchristensen
Copy link
Owner

Hm, I guess the route to take to get to the bottom of it is to add some more logging to restart_unifi_os to see if we can't capture more of the output or something. It's weird that you only see the Stopping unifi-os bit and nothing further as that's part of the output of unifi-os restart indicating that it did execute it, but then is failing somewhere in the unifi-os code.

@Speedliner
Copy link

Actual
The unifi-os restart command that the script runs doesn't work. Unifi-os is stopped but doesn't restart, everything is back to normal only after manually executing unifi-os restart.

I' facing this problem as well. My UDMP also refuses to work after five minutes and has to be revived with a "unifi-os restart".

UDM-PRO firmware 1.9.1
on-boot-script 1.0.4
udm-le 1.0.7

Is there anything i can provide beyond that to help solve this?

@klemens-st
Copy link
Author

klemens-st commented Mar 15, 2021

Hm, I guess the route to take to get to the bottom of it is to add some more logging to restart_unifi_os

Yes, I was going to try but haven't found the time recently. Will report here once I have tested more.

@klemens-st
Copy link
Author

klemens-st commented Mar 15, 2021

I am also going to test a simple on boot script containing unifi-os restart only because my suspicion is this has something to do with how on boot scripts are executed on my udm-pro

@kchristensen
Copy link
Owner

I'm on 1.9.1 / on-boot-script 1.0.4 and oddly I haven't run into this lately so figuring out how to get more verbose logs here is going to be key. unifi-os is just a shell script, so if you modify the restart_unifi_os command to call sh -x /usr/sbin/unifi-os restart instead of just unifi-os restart and pipe that output somewhere as we discussed in this thread previously we might get something of use out of there.

@Speedliner
Copy link

I changed the udm-le.sh script as following:

grafik

After rebooting the UDMP and waiting for 5 minutes, it creates the udm-le.log file under /var/log/ - but it stays empty:

grafik

What did I wrong?

@kchristensen
Copy link
Owner

Oh, I think unifi-os does something wonky with stdout, try sh -x /usr/sbin/unifi-os restart 2>&1 > /var/log/udm-le.log maybe?

@Speedliner
Copy link

Sadly, that didn't help...

grafik

grafik

@klemens-st
Copy link
Author

Yeah, that was never going to work.

From /usr/sbin/unifi-os

case "$cmd" in
	stop)
		stop 2>&1 | logger -st unifi-os
		;;
	start)
		start 2>&1 | logger -st unifi-os
		;;
	reset)
		reset 2>&1 | logger -st unifi-os
		;;
	restart)
		restart 2>&1 | logger -st unifi-os
		;;
	running)
		container_is_running "${CONTAINER_NAME}"
		;;
	shell)
		podman exec -ti "${CONTAINER_NAME}" bash
		;;
	update)
		update "${2}" 2>&1 | logger -st unifi-os
		;;
	*)
		echo "Usage: $0 [stop start restart shell 'update url']"
		;;
esac

@klemens-st
Copy link
Author

klemens-st commented Mar 15, 2021

Go ahead and modify /usr/sbin/unifi-os directly, just make a solid backup first.

Or you can just have a look at /var/log/messages which is where logger is outputting stuff

@Speedliner
Copy link

Okay, after extensive testing, I came to the following conclustion:

Editing unifi-os directly works - as long as you do not reboot. After rebooting, those changes made are reverted and nothing gets logged anymore.

@klemens-st
Copy link
Author

Yeah, that was never going to work.

Not really, I was wrong there. The -x switch does work but the output is written to syslog at /var/log/messages and not to the file we specified, the file remains empty.

I did some further testing. I created an on boot script that contained only

wait 600
unifi-os restart

It failed the same way udm-le fails. So it's clearly something with on boot scripts execution on my udm-pro.

I narrowed down the problem further by modifying my script to sh -x /usr/sbin/unifi-os restart and inspecting /var/log/messages.

unifi-os restart calls the stop function first which is as follows:

stop() {
	if ! container_is_running "${CONTAINER_NAME}"; then
		echo "${CONTAINER_NAME} is not running"
	else
		echo "Stopping ${CONTAINER_NAME}"
		podman stop "${CONTAINER_NAME}"
	fi
	stop_dropbear_daemon
}

According to the logs I saw, podman stop is the last command that executes, stop_dropbear_deamon or the restart function that should be called later are never executed. This is because the script receives a SIGTERM.

Mar 15 21:52:09 unifi user.notice unifi-os: + stop
Mar 15 21:52:09 unifi user.notice unifi-os: + container_is_running unifi-os
Mar 15 21:52:09 unifi user.notice unifi-os: + podman inspect -f '{{.State.Running}}' unifi-os
Mar 15 21:52:09 unifi user.notice unifi-os: + '[' true '=' true ]
Mar 15 21:52:09 unifi user.notice unifi-os: + echo 'Stopping unifi-os'
Mar 15 21:52:09 unifi user.notice unifi-os: Stopping unifi-os
Mar 15 21:52:09 unifi user.notice unifi-os: + podman stop unifi-os
Mar 15 21:52:15 unifi syslog.info kernel: [  333.980562] systemd-journald[26]: Received SIGTERM from PID 1 (system
d-shutdow).
Mar 15 21:52:15 unifi user.emerg kernel: [  333.982681] reboot: cmd=cdef0123
Mar 15 21:52:15 unifi user.emerg kernel: [  333.986192] reboot:   called by: [ 2122]/lib/systemd/systemd-shutdown
halt --timeout 90000000us --log-level 6 --log-target console --log-color
Mar 15 21:52:15 unifi user.emerg kernel: [  333.999053] reboot:   called by: [ 2108]/usr/libexec/podman/conmon --a
pi-version 1 -c 4c2e68e3a60876d796608aa35e9c4e9e2ccf9417127022a8c4ca0b96e5512619 -u 4c2e68e3a6087
Mar 15 21:52:15 unifi user.emerg kernel: [  334.014347] reboot:   called by: [    1]init
Mar 15 21:52:15 unifi user.emerg kernel: [  334.018642] reboot:   called by: [    0]

You can see that just after podman stop unifi-os the script receives a SIGTERM, it then finishes shutting down unifi-os and exits. This never happens when unifi-os restart is run from command line. I have absolutely no idea where this is comming from, I think I'm going to ask on udm-utilities issues.

@emlynmac
Copy link
Contributor

I have been experiencing unifi-os dying since I installed and thought it was something flaky with the utils / OS. Now, I understand that it was this little restart fail causing the issue.

My understanding is that the udm-le script will be called from the context of the container, so when you call restart from inside, then it shuts down and gets killed and doesn't restart, as shown in the logs above.

I think there are some ways to solve this:

  1. Somehow figure out how to spawn off the unifi-os start command into something that out-lives the container and doesn't get killed
  2. Find an alternative way to get the OS to pick up the new certificates. Is there not a way that we can get the web services to restart without pulling down the whole container?
  3. Schedule a restart using the unifi-os to happen at least every 70 days, which should then pick up the new certs.

For now, I have disabled the restart in my scripts and will restart it when I need to manually - I'd rather have the certs fail than the whole unifi-os not working after a power outage.

@kchristensen
Copy link
Owner

kchristensen commented Mar 19, 2021

The on-boot-script actually executes the init script on boot on the host, not in the container. The unifi-os script is basically a wrapper around podman commands that starts up the container that runs the controller, etc.

The issue is that when whatever causes the restart on boot to fail, the controller and captive portal and what have you won't be running. The actual routing stuff happens on the host and should be unaffected by the failure to restart.

As for #3, that happens already, provided udm-le.sh gets run every boot it drops a file in the cron.d directory that will handle the nightly check and subsequent re-issue/restart if necessary.

@klemens-st
Copy link
Author

My understanding is that the udm-le script will be called from the context of the container, so when you call restart from inside, then it shuts down and gets killed and doesn't restart, as shown in the logs above.

I thought the same but then realised not everyone experiences this issue, which is strange.

@kchristensen
Copy link
Owner

My understanding is that the udm-le script will be called from the context of the container, so when you call restart from inside, then it shuts down and gets killed and doesn't restart, as shown in the logs above.

I thought the same but then realised not everyone experiences this issue, which is strange.

Indeed. I have no issues with the way this is setup and I've had this running on my UDMP since more or less the Early Access days of the udmp. That said I only use Cloudflare for DNS and none of the captive portal things that people have added over time, so it is hard to say if that is possibly causing issues.

@emlynmac
Copy link
Contributor

emlynmac commented Mar 19, 2021

The issue is that when whatever causes the restart on boot to fail, the controller and captive portal and what have you won't be running. The actual routing stuff happens on the host and should be unaffected by the failure to restart.

Routing won't be affected, but protect will not be up - which is a problem.

My setup has udm-le, cloudflare-ddns, pihole, cloudflared and homebridge containers. I don't have captive portal set up.
Running unifi-os restart from the SSH console works fine.

@klemens-st
Copy link
Author

I have disabled the captive portal and have nothing else installed besides on-boot and udm-le. And as I have recently tested it's just the restart command that fails, it has nothing to do with udm-le itself.

@emlynmac
Copy link
Contributor

Does restart work from the shell?

@klemens-st
Copy link
Author

Yes, always.

@klemens-st
Copy link
Author

But I think you are on the right lines that when on-boot script runs unifi-os restart its executed in a different context. The kernel part of the logs when it's run from shell is different.

@emlynmac
Copy link
Contributor

So peacey over at the udm-utilities has a fix for this issue.

I took his second suggestion and replaced unifi-os restart with pod man restart unifi-os &

The last case is now:

 echo 'Attempting certificate renewal'                                                                
        ${PODMAN_CMD} ${LEGO_ARGS} renew --days 60 && deploy_cert && add_captive && podman restart unifi-os &
        ;;

Seems to have resolved the issue

@klemens-st
Copy link
Author

Finally found the time to test 1.0.9. It fixed the issue, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants