Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instance: Fix deadlock during failed snapshot creation #13821

Merged
merged 3 commits into from
Aug 2, 2024

Conversation

MggMuggins
Copy link
Contributor

@MggMuggins MggMuggins commented Jul 25, 2024

Fixes #13466

We need a non-locking delete on an Instance; otherwise a failure while mounting the instance/updating the backup file will deadlock.

I don't love the type assertion here. However, since CreateInternal returns Instance, we don't have a lot of options beyond enriching that interface or providing an additional one.

In order to avoid using type assertions we'd need to define an internal interface instance for delete and possibly a few other functions in the drivers/ package and implement it for lxc/qemu. Alternatively, moving the Interface interface into the same package as the drivers would allow us to define some private methods on Interface instead.

(Updating the backup file on the full volume was very likely the failure in #13466; this is after applying the fix):

$ lxc snapshot lucky-bulldog
Error: Failed to create file "/var/lib/lxd/containers/microcloud-test_lucky-bulldog/backup.yaml": open /var/lib/lxd/containers/microcloud-test_lucky-bulldog/backup.yaml: disk quota exceeded

I will throw a snapshot failure test on here this afternoon.

LXD-1117

@MggMuggins MggMuggins force-pushed the fix-snapshot-deadlock branch 3 times, most recently from 8527ebd to 4618f4d Compare July 25, 2024 22:04
test/suites/snapshots.sh Outdated Show resolved Hide resolved
test/suites/snapshots.sh Outdated Show resolved Hide resolved
test/suites/snapshots.sh Outdated Show resolved Hide resolved
test/suites/snapshots.sh Outdated Show resolved Hide resolved
We need a non-locking delete on an `Instance`.

I don't love the type assertion here. However, since CreateInternal
returns Instance, we don't have a lot of options beyond enriching that
interface or providing an additional one.

A solution that doesn't use type assertions would be to define an
internal interface for `delete` and possibly a few other functions in
the drivers/ package and implement it for common/lxc/qemu. Alternatively,
moving the `Interface` interface into the same package as the drivers
would allow us to define some private methods on `Interface` instead.

Signed-off-by: Wesley Hershberger <[email protected]>
@MggMuggins MggMuggins force-pushed the fix-snapshot-deadlock branch from 4618f4d to c4df03e Compare July 31, 2024 20:42
@MggMuggins
Copy link
Contributor Author

MggMuggins commented Jul 31, 2024

lxc stop also fails on zfs when the disk is full:
Error: Stopping the instance failed: Failed clearing ownership: chown /home/wesley/Workspace/lxd/test/tmp.KFM/iaA/containers/c1: disk quota exceeded

The container does become STOPPED, but this prevents lxc delete -f c1 from removing the container, hence the separate call to lxc stop. I'm happy to open another bug for this if it's undesired; it's inconvenient but easy to work around.

I tried rm /root/big.bin before deleting the instance and while I was able to remove the file, the dataset USED/AVAIL percentages didn't change. I doubt that there was a process still holding the file open (no lsof in testimage but only two processes, init and ps from lxc exec c1 -- ps) so I'm sure this is some zfs feature/behavior I'm not considering/don't know about.

Other backends:

Backend Result
dir ❌ :lxc snapshot c1 succeeds
btrfs lxc snapshot c1 succeeds; 3 MiB root disk required
lvm lxc snapshot c1 succeeds; 9 MiB root disk required

Didn't try Ceph as I don't have a cluster handy.

Let me know if you'd like other changes; thanks!

@MggMuggins MggMuggins requested a review from tomponline July 31, 2024 21:32
@MggMuggins MggMuggins marked this pull request as ready for review July 31, 2024 21:34
test/suites/snapshots.sh Outdated Show resolved Hide resolved
@tomponline
Copy link
Member

lxc stop also fails on zfs when the disk is full:
Error: Stopping the instance failed: Failed clearing ownership: chown /home/wesley/Workspace/lxd/test/tmp.KFM/iaA/containers/c1: disk quota exceeded

I think we should log a warning but continue the onstop hook function if the error contains "disk quota exceeded" (or if there is a better way to detect that error).

This at least runs through the reverter that gets built in case snapshot
creation fails.

Signed-off-by: Wesley Hershberger <[email protected]>
@MggMuggins MggMuggins force-pushed the fix-snapshot-deadlock branch 2 times, most recently from e6a7c95 to 4c38c09 Compare August 1, 2024 22:26
... rootfs to stop.

Signed-off-by: Wesley Hershberger <[email protected]>
@MggMuggins MggMuggins force-pushed the fix-snapshot-deadlock branch from 4c38c09 to 8fbb18e Compare August 1, 2024 22:32
@MggMuggins
Copy link
Contributor Author

Correct me if I'm wrong, but the failure of those chown/chmod would only leave the mount point owned by a uid that could potentially be reused for another container. All that does is allow the mount point to be traversed by a process running under that uid that has access to the parent dir. Agreed, not a concern. Thanks for your patience.

@MggMuggins MggMuggins requested a review from tomponline August 1, 2024 22:51
@tomponline
Copy link
Member

Correct me if I'm wrong, but the failure of those chown/chmod would only leave the mount point owned by a uid that could potentially be reused for another container. All that does is allow the mount point to be traversed by a process running under that uid that has access to the parent dir. Agreed, not a concern. Thanks for your patience.

Yes thats my understanding, and either way that doesn't succeeded whether we return an error or not, its just that by returning an error further bad things happen because the instance's devices are not cleaned up on the host side.

@tomponline tomponline changed the title lxd/instance/drivers: Fix deadlock during failed snapshot creation Instance: Fix deadlock during failed snapshot creation Aug 2, 2024
@tomponline tomponline merged commit 59d9592 into canonical:main Aug 2, 2024
29 checks passed
@MggMuggins MggMuggins deleted the fix-snapshot-deadlock branch August 2, 2024 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants