-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
emergency mode with too many hard disks / slow SAS controller #1517
Comments
Can you share full
|
thats hard, i suppose. Node is a Dell R730xd with all 24 disks in the front filled. Flatcar resides on a 25th disk in the back. All are SSDs of consumer grade though, except for the Flatcar one which is enterprise grade (= full SAS interface, I believe). Inside is a Then is some Looking at the zpool disks, they have 2 partitions each, like so (thats from another node):
which means I have 2 * 24 + 9 = 57 partitions in the system.
observing the console I see that nothing with zfs/zpool gets to happen yet. Two more unit Jobs stick out however: Before emergency shell kicks in, I spot some other In the emergency shell looking at EDIT: its not quite as bad: the other node with 20 disks equipped has 138 links in |
Can you try and see if one of these two kernel command line arguments helps mitigate this:
or
if they do, you can persist them in
|
That's possible, I want to try to reproduce. Our USR partition is discovered by uuid, and we wait for udev to settle, so it is possible that scanning all devices/partitions is taking longer than systemd waits by default. |
all right, these timeouts make it work! Thats a quickfix at least. Many thanks! It made 133 links. The jobs popping out on the console are thereafter zpool imports fine |
Description
I have zpools spanning 24 disks in that node. As soon as I connect these disks, boot always ends up in emergency mode because some device mapper discovery (?) takes a long time. I then see
sysroot.mount: Mounting timeout. Terminating.
and from there all things stop into Emergency Mode.Before that I see many lines like this (presumably one per disk or so):
systemd-udevd[556]: 0:0:22:0: Worker [779] processing SEQNUM=3014 is taking a long time
I have another node with half that many disks, which also spends quite some time looking at all the disks but gets done before a
sysroot.mount
timeout happens, then imports its zpools and all gets fine at the end.All I need is a longer timeout, I guess. Presumably on the order of 5min, before it considers emergency mode.
Impact
Can not get that node online with its data on the zpools.
Reproduction
installed 3975.2.0 successfully from USB stick using
flatcar-install -d /dev/sda -i ignition.json
methodreboot and ignition completes successfully if
/dev/sda
is the only disk connected, node comes up, joins, all fine ...( a few disks, like 5 or 10 is probably fine too? - can not try as it would degrade the zpools)
Additional information
I am migrating the cluster from CentOS, which appears to have no problem with the slowness of the many disk devices (can't tell about earlier Flatcar versions).
Flatcar release booted without disks disconnected:
The text was updated successfully, but these errors were encountered: