Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

emergency mode with too many hard disks / slow SAS controller #1517

Open
x3nb63 opened this issue Aug 20, 2024 · 5 comments
Open

emergency mode with too many hard disks / slow SAS controller #1517

x3nb63 opened this issue Aug 20, 2024 · 5 comments
Labels
kind/bug Something isn't working

Comments

@x3nb63
Copy link

x3nb63 commented Aug 20, 2024

Description

I have zpools spanning 24 disks in that node. As soon as I connect these disks, boot always ends up in emergency mode because some device mapper discovery (?) takes a long time. I then see sysroot.mount: Mounting timeout. Terminating. and from there all things stop into Emergency Mode.

Before that I see many lines like this (presumably one per disk or so):

systemd-udevd[556]: 0:0:22:0: Worker [779] processing SEQNUM=3014 is taking a long time

I have another node with half that many disks, which also spends quite some time looking at all the disks but gets done before a sysroot.mount timeout happens, then imports its zpools and all gets fine at the end.

All I need is a longer timeout, I guess. Presumably on the order of 5min, before it considers emergency mode.

Impact

Can not get that node online with its data on the zpools.

Reproduction

installed 3975.2.0 successfully from USB stick using flatcar-install -d /dev/sda -i ignition.json method

reboot and ignition completes successfully if /dev/sda is the only disk connected, node comes up, joins, all fine ...

( a few disks, like 5 or 10 is probably fine too? - can not try as it would degrade the zpools)

Additional information

I am migrating the cluster from CentOS, which appears to have no problem with the slowness of the many disk devices (can't tell about earlier Flatcar versions).

Flatcar release booted without disks disconnected:

n04 ~ # cat /etc/os-release
NAME="Flatcar Container Linux by Kinvolk"
ID=flatcar
ID_LIKE=coreos
VERSION=3975.2.0
VERSION_ID=3975.2.0
BUILD_ID=2024-08-05-2103
SYSEXT_LEVEL=1.0
PRETTY_NAME="Flatcar Container Linux by Kinvolk 3975.2.0 (Oklo)"
ANSI_COLOR="38;5;75"
HOME_URL="https://flatcar.org/"
BUG_REPORT_URL="https://issues.flatcar.org"
FLATCAR_BOARD="amd64-usr"
CPE_NAME="cpe:2.3:o:flatcar-linux:flatcar_linux:3975.2.0:*:*:*:*:*:*:*"
@jepio
Copy link
Member

jepio commented Aug 20, 2024

Can you share full journalctl -b0 and dmesg output from the emergency mode? Do you have any idea how this could be reproduced in a synthetic environment? What is the topology in terms of disks and controllers?

sysroot.mount runs from the initrd, and zfs/zpool probing happens after switch-root. We might be probing more than CentOS from our initrd, hard to tell right now.

@x3nb63
Copy link
Author

x3nb63 commented Aug 21, 2024

Can you share full journalctl -b0 and dmesg output from the emergency mode?

dmesg.txt
journalctl-b0.txt

Do you have any idea how this could be reproduced in a synthetic environment? What is the topology in terms of disks and controllers?

thats hard, i suppose.

Node is a Dell R730xd with all 24 disks in the front filled. Flatcar resides on a 25th disk in the back. All are SSDs of consumer grade though, except for the Flatcar one which is enterprise grade (= full SAS interface, I believe).

Inside is a Broadcom / Avago / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02) controller run by the mpt3sas kernel module, connecting all the front disks. Neither BIOS nor iDrac show errors from that controller, its just slow ... BIOS phase takes 5..10min with all disks.

Then is some Intel Corporation C610/X99 series chipset 6-Port SATA Controller operated by sd_mod, connecting the flatcar disk in the back.

Looking at the zpool disks, they have 2 partitions each, like so (thats from another node):

# fdisk -l /dev/sdu
Disk /dev/sdu: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: WDC  WDS200T2B0A
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: AE8B2BAE-3A6B-5A4F-9CA1-FA4B164FFDB9

Device          Start        End    Sectors  Size Type
/dev/sdu1        2048 3907012607 3907010560  1.8T Solaris /usr & Apple ZFS
/dev/sdu9  3907012608 3907028991      16384    8M Solaris reserved 1

which means I have 2 * 24 + 9 = 57 partitions in the system.

sysroot.mount runs from the initrd, and zfs/zpool probing happens after switch-root. We might be probing more than CentOS from our initrd, hard to tell right now.

observing the console I see that nothing with zfs/zpool gets to happen yet. Two more unit Jobs stick out however: dev-mapper-user.device/start and verity-setup.service/start. Both run without timelimit and dont finish as far as I waited.

Before emergency shell kicks in, I spot some other device-mapper thing doing a lot work.

In the emergency shell looking at /dev/disk/by-id/ shows multiple (!) links to each of these disks and partitions ... ata-..., dm-name-..., dm-uuid-..., wwn-... ... but it does not look complete. Need to count to be sure. Suspicion is that device mapper does not finish creating all of these. Its >250 links all together, at least.

EDIT: its not quite as bad: the other node with 20 disks equipped has 138 links in /dev/disk/by-id/ and another one (which is still on CentOS) has 14 disks and 162 links. ... still a lot of looped over ...

@jepio
Copy link
Member

jepio commented Aug 21, 2024

Can you try and see if one of these two kernel command line arguments helps mitigate this:

systemd.default_timeout_start_sec=300

or

systemd.default_device_timeout_sec=300

if they do, you can persist them in /oem/grub.cfg (1):

set linux_append="systemd.(...)"

@jepio
Copy link
Member

jepio commented Aug 21, 2024

In the emergency shell looking at /dev/disk/by-id/ shows multiple (!) links to each of these disks and partitions ... ata-..., dm-name-..., dm-uuid-..., wwn-... ... but it does not look complete. Need to count to be sure. Suspicion is that device mapper does not finish creating all of these. Its >250 links all together, at least.

EDIT: its not quite as bad: the other node with 20 disks equipped has 138 links in /dev/disk/by-id/ and another one (which is still on CentOS) has 14 disks and 162 links. ... still a lot of looped over ...

That's possible, I want to try to reproduce. Our USR partition is discovered by uuid, and we wait for udev to settle, so it is possible that scanning all devices/partitions is taking longer than systemd waits by default.

@x3nb63
Copy link
Author

x3nb63 commented Aug 21, 2024

all right, these timeouts make it work! Thats a quickfix at least. Many thanks!

It made 133 links. The jobs popping out on the console are dev-disk-by\x2dpartuuid..., dev-disk-by\x2dpartlab..., dev-disk-by\x2dlabel... and dev-mapper-usr.device

thereafter zpool imports fine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
Status: 📝 Needs Triage
Development

No branches or pull requests

2 participants