-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not able to mount with -o degraded when a disk is missing after hardware failure #703
Comments
Alright, so I decided to try downgrading bcachefs-tools to 1.7.0 and giving it another try and lo and behold, it worked! So this seems like it might just be a tools bug in 1.9.x. The command I ended up running:
It took several hours, and spat out a lot of
warnings, but otherwise seems to have had no further trouble mounting the 7 remaining disks. I'm backing up the important stuff before I try any more things, and obviously right now there is a lot of this in dmesg (definitely not unexpected at this point):
|
Having the same issue. Trying the downgrade solution now. This will be the second issue I've run into as a result of a newer bcachefs-tools package (disk version 1.9 this time) operating on arrays with the released kernel version (disk version 1.7 this time), so I think there are some non-trivial edge cases that are getting missed due to version shear. Perhaps the kernel-version matched bcachefs tools should be made the primary package with the latest tools version as a secondary or dev package. Note that I'm also on Arch, like the reporter. |
I wasn't able to resolve my issue by downgrading bcachefs-tools (it hung during the backpointer checks on tools 1.7), but I was able to recover the array by upgrading the kernel to the current linux-next-git rather than the regular linux package (6.10.something). So definitely a version shearing issue or bug in kernel 6.10, but one that seems to have been resolved at HEAD. The only bad news is that if people lose an array right now, they might be hosed until the next kernel release (unless they follow similar steps). |
Clarification: installing linux-next-git and adding a bunch of swap space allowed me to complete a clean fsck run, not mount the filesystem. Getting the array mounted once clean did require downgrading to bcachefs-tools version 1.7.0. Now that I've gotten everything fully back up, it seems I hit two issues: a) Trying to mount with tools version 1.9.0 and fewer than the ideal set of disks failed with "invalid argument" even with -o degraded (insufficient disks to start, per kernel logs). My intuition is that this is related to the new code that scans for bcachefs superblocks being active even when an explicit colon-separated device list is given on the command line. Downgrading to tools 1.7.0 allowed me to mount the filesystem with b) Confounding factor: filesystem was not clean due to OOM during the previous device remove operation. Running fsck on a 12TB filesystem with a missing disk needed about 58GB of virtual memory at peak, which was the root cause of the failed fsck runs. Because fsck runs automatically on mount if unclean (true for all attempts in my case), it made mounting with tools 1.7.0 look like a failure because the kernel ran out of memory and crashed or locked up the host, prompting reboots. Getting a clean fsck first by adding more swap space allowed me to mount via tools 1.7.0 like the original reporter. It's unclear if upgrading to linux-next-git was necessary. I think adding the additional swap space so fsck could complete was the primary fix. However, the error messages from both the kernel and bcachefs-tools code were better, which helped. So thanks for improving those :) That said, I looked at the changelog, and there were several malloc and deadlock related changes between mainline and next that might have unblocked the fsck runs. Like I said: unclear. Regardless, all is well again with my filesystems, so this should be my last update unless there is anything I can provide to help debug or confirm a fix. |
1.9.1, and possibly 1.9.0 as well had a bug in the mount helper that resulted in mount options not getting passed through. Can you check that? Either build a newer version of -tools, or mount without the helper (-i). |
I'm not able to do testing in the next day or two because I'll be out of cell service in the woods, but will try to confirm that when I'm back. Could you add a mention of the -i flag on bcachefs mount --help though? It sounds like it can be relevant in stressful situations, and I was not even aware that flag existed, despite looking for things exactly like that to try when I was initially debugging the issue. Just for the sake of appreciation: this has been my only negative experience with bcachefs other than some deadlocks in the first kernel version. Your work on this is much appreciated and I'm proud to have been a supporter for several years. |
I got a very similar issue, got an array of 4 hdds + 1 ssd of durability 0, encrypted. Workaround by bypassing the mount helper: dmesg log:
A subsequent mount with PS @DaemonF: the
|
I have an 8-disk array and after one of my disks died suddenly I'm no longer able to mount it since
/dev/sdh
no longer exists:❯ sudo bcachefs mount -v -o degraded,errors=remount-ro /dev/sda:/dev/sdb:/dev/sdc:/dev/sdd:/dev/sde:/dev/sdf:/dev/sdg /mnt/storage DEBUG - bcachefs::commands::mount: Walking udev db! INFO - bcachefs::commands::mount: mounting with params: device: /dev/sda:/dev/sdb:/dev/sdc:/dev/sdd:/dev/sde:/dev/sdf:/dev/sdg, target: /mnt/storage, options: degraded,errors=remount-ro DEBUG - bcachefs::commands::mount: parsing mount options: degraded,errors=remount-ro INFO - bcachefs::commands::mount: mounting filesystem ERROR - bcachefs::commands::mount: Fatal error: Invalid argument
And in dmesg:
If I try and mount it with
-o very_degraded
it gives the same output. Usingmount.bcachefs
andmount -t bcachefs
give the same output, as does usingUUID=55cfeccc-d8b2-4813-b1a4-9ff9212962e7
.I saw that you can remove a disk by ID so I also tried:
So it seems that would only work if I could mount the array first, which is exactly the problem.
Some extra info:
The text was updated successfully, but these errors were encountered: