Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[6.11,6.12] Constant I/O (rebalance) when foreground 2x nvme + background 2x HDD when nvme size >> HDD size #799

Open
elmystico opened this issue Dec 11, 2024 · 7 comments

Comments

@elmystico
Copy link

elmystico commented Dec 11, 2024

I/O ATE MY FLASH after two weeks or smth fortunatelly those were not expensive at all somehow old pieces

Having two 256 GiB partitions nvme and two 34 GiB HDD together four partitions
(I'm not using this configuration anymore but I've tried it few times from scratch and this was reproducible each time)
kernel v 6.11 (debian testing)

bcachefs format --fs_label=data --replicas=2 --block_size=4k --background_compression=lz4:1 \ --label=dhdd.tosh4310 /dev/sda3 --label=dhdd.tosh21F0 /dev/sdb3 \ --discard \ --label=dnvme.970evo /dev/nvme0n1p4 \ --label=dnvme.960evo /dev/nvme1n1p4 \ --foreground_target=dnvme --background_target=dhdd

filing some data and put live processes and then

`Size: 534 GiB
Used: 124 GiB
Online reserved: 1.96 MiB

Data type Required/total Durability Devices
reserved: 1/2 [] 151 MiB
btree: 1/2 2 [nvme0n1p4 nvme1n1p4] 4.51 GiB
user: 1/2 2 [sda3 sdb3] 63.5 GiB
user: 1/2 2 [sda3 nvme0n1p4] 977 MiB
user: 1/2 2 [sda3 nvme1n1p4] 961 MiB
user: 1/2 2 [sdb3 nvme0n1p4] 968 MiB
user: 1/2 2 [sdb3 nvme1n1p4] 985 MiB
user: 1/2 2 [nvme0n1p4 nvme1n1p4] 52.3 GiB
cached: 1/1 1 [sda3] 440 KiB
cached: 1/1 1 [sdb3] 384 KiB
cached: 1/1 1 [nvme0n1p4] 14.1 GiB
cached: 1/1 1 [nvme1n1p4] 14.1 GiB

Compression:
type compressed uncompressed average extent size
lz4 51.8 GiB 197 GiB 70.5 KiB
incompressible 147 GiB 147 GiB 70.2 KiB

Btree usage:
extents: 1.19 GiB
inodes: 305 MiB
dirents: 107 MiB
xattrs: 389 MiB
alloc: 677 MiB
reflink: 137 MiB
subvolumes: 512 KiB
snapshots: 512 KiB
lru: 22.5 MiB
freespace: 5.00 MiB
need_discard: 1.00 MiB
backpointers: 1.52 GiB
bucket_gens: 11.0 MiB
snapshot_trees: 512 KiB
deleted_inodes: 512 KiB
logged_ops: 1.00 MiB
rebalance_work: 117 MiB
subvolume_children: 512 KiB
accounting: 69.5 MiB

Pending rebalance work:
54.3 GiB

dhdd.tosh21F0 (device 1): sdb3 rw
data buckets fragmented
free: 1.06 GiB 4339
sb: 3.00 MiB 13 252 KiB
journal: 272 MiB 1088
btree: 0 B 0
user: 32.7 GiB 133824 100 KiB
cached: 0 B 0
parity: 0 B 0
stripe: 0 B 0
need_gc_gens: 0 B 0
need_discard: 0 B 0
unstriped: 0 B 0
capacity: 34.0 GiB 139264

dhdd.tosh4310 (device 0): sda3 rw
data buckets fragmented
free: 1.07 GiB 4381
sb: 3.00 MiB 13 252 KiB
journal: 272 MiB 1088
btree: 0 B 0
user: 32.7 GiB 133782 12.0 KiB
cached: 0 B 0
parity: 0 B 0
stripe: 0 B 0
need_gc_gens: 0 B 0
need_discard: 0 B 0
unstriped: 0 B 0
capacity: 34.0 GiB 139264

dnvme.960evo (device 3): nvme1n1p4 rw
data buckets fragmented
free: 196 GiB 802284
sb: 3.00 MiB 13 252 KiB
journal: 2.00 GiB 8192
btree: 2.25 GiB 9237
user: 27.1 GiB 111187 360 KiB
cached: 14.1 GiB 117422 14.5 GiB
parity: 0 B 0
stripe: 0 B 0
need_gc_gens: 0 B 0
need_discard: 60.3 MiB 241
unstriped: 0 B 0
capacity: 256 GiB 1048576

dnvme.970evo (device 2): nvme0n1p4 rw
data buckets fragmented
free: 197 GiB 808055
sb: 3.00 MiB 13 252 KiB
journal: 2.00 GiB 8192
btree: 2.25 GiB 9237
user: 27.1 GiB 111186 92.0 KiB
cached: 14.1 GiB 110583 12.9 GiB
parity: 0 B 0
stripe: 0 B 0
need_gc_gens: 0 B 0
need_discard: 328 MiB 1310
unstriped: 0 B 0
capacity: 256 GiB 1048576`

look at pending rebalance amount

`Device: (unknown device)
External UUID: e9807c87-b09b-4cde-8065-4a475de5e2cb
Internal UUID: 16fc0099-7df6-4ea3-9f4e-49cfc10034c9
Magic number: c68573f6-66ce-90a9-d96a-60cf803df7ef
Device index: 1
Label: data
Version: 1.12: rebalance_work_acct_fix
Version upgrade complete: 1.12: rebalance_work_acct_fix
Oldest version on disk: 1.12: rebalance_work_acct_fix
Created: Fri Nov 15 17:10:58 2024
Sequence number: 75
Time of last write: Sun Dec 1 00:31:50 2024
Superblock size: 5.38 KiB/1.00 MiB
Clean: 0
Devices: 4
Sections: members_v1,replicas_v0,disk_groups,clean,journal_seq_blacklist,journal_v2,counters,members_v2,errors,ext,downgrade
Features: lz4,journal_seq_blacklist_v3,reflink,new_siphash,inline_data,new_extent_overwrite,btree_ptr_v2,extents_above_btree_updates,btree_updates_journalled,new_varint,journal_no_flush,alloc_v2,extents_across_btree_nodes
Compat features: alloc_info,alloc_metadata,extents_above_btree_updates_done,bformat_overflow_done

Options:
block_size: 4.00 KiB
btree_node_size: 256 KiB
errors: continue [fix_safe] panic ro
metadata_replicas: 2
data_replicas: 2
metadata_replicas_required: 1
data_replicas_required: 1
encoded_extent_max: 64.0 KiB
metadata_checksum: none [crc32c] crc64 xxhash
data_checksum: none [crc32c] crc64 xxhash
compression: none
background_compression: lz4:1
str_hash: crc32c crc64 [siphash]
metadata_target: none
foreground_target: dnvme
background_target: dhdd
promote_target: none
erasure_code: 0
inodes_32bit: 1
shard_inode_numbers: 1
inodes_use_key_cache: 1
gc_reserve_percent: 8
gc_reserve_bytes: 0 B
root_reserve_percent: 0
wide_macs: 0
promote_whole_extents: 1
acl: 1
usrquota: 0
grpquota: 0
prjquota: 0
journal_flush_delay: 1000
journal_flush_disabled: 0
journal_reclaim_delay: 100
journal_transaction_names: 1
allocator_stuck_timeout: 30
version_upgrade: [compatible] incompatible none
nocow: 0

members_v2 (size 592):
Device: 0
Label: tosh4310 (1)
UUID: a04ae694-690c-49fa-999d-c35db9e55b9f
Size: 34.0 GiB
read errors: 0
write errors: 0
checksum errors: 0
seqread iops: 0
seqwrite iops: 0
randread iops: 0
randwrite iops: 0
Bucket size: 256 KiB
First bucket: 0
Buckets: 139264
Last mount: Sun Dec 1 00:30:28 2024
Last superblock write: 75
State: rw
Data allowed: journal,btree,user
Has data: journal,user,cached
Btree allocated bitmap blocksize: 1.00 B
Btree allocated bitmap: 0000000000000000000000000000000000000000000000000000000000000000
Durability: 1
Discard: 0
Freespace initialized: 1
Device: 1
Label: tosh21F0 (2)
UUID: 08632210-3ddf-4290-971d-17bb26f979e4
Size: 34.0 GiB
read errors: 0
write errors: 0
checksum errors: 0
seqread iops: 0
seqwrite iops: 0
randread iops: 0
randwrite iops: 0
Bucket size: 256 KiB
First bucket: 0
Buckets: 139264
Last mount: Sun Dec 1 00:30:28 2024
Last superblock write: 75
State: rw
Data allowed: journal,btree,user
Has data: journal,user,cached
Btree allocated bitmap blocksize: 1.00 B
Btree allocated bitmap: 0000000000000000000000000000000000000000000000000000000000000000
Durability: 1
Discard: 0
Freespace initialized: 1
Device: 2
Label: 970evo (4)
UUID: b94e6dd2-e553-4e03-b6d6-7e39c799267b
Size: 256 GiB
read errors: 0
write errors: 0
checksum errors: 0
seqread iops: 0
seqwrite iops: 0
randread iops: 0
randwrite iops: 0
Bucket size: 256 KiB
First bucket: 0
Buckets: 1048576
Last mount: Sun Dec 1 00:30:28 2024
Last superblock write: 75
State: rw
Data allowed: journal,btree,user
Has data: journal,btree,user,cached
Btree allocated bitmap blocksize: 8.00 MiB
Btree allocated bitmap: 0000000010000001100000000000000000000000000000001110010100000101
Durability: 1
Discard: 1
Freespace initialized: 1
Device: 3
Label: 960evo (5)
UUID: b03e2746-b6f5-4474-b692-f5fb70ac0662
Size: 256 GiB
read errors: 0
write errors: 0
checksum errors: 0
seqread iops: 0
seqwrite iops: 0
randread iops: 0
randwrite iops: 0
Bucket size: 256 KiB
First bucket: 0
Buckets: 1048576
Last mount: Sun Dec 1 00:30:28 2024
Last superblock write: 75
State: rw
Data allowed: journal,btree,user
Has data: journal,btree,user,cached
Btree allocated bitmap blocksize: 8.00 MiB
Btree allocated bitmap: 0000000000000000100000000000000000000000000000000110010100000101
Durability: 1
Discard: 1
Freespace initialized: 1

errors (size 24):
accounting_mismatch 20 Sun Dec 1 00:30:49 2024`

@nitinkmr333
Copy link

Duplicate of #795

@nitinkmr333
Copy link

@elmystico It shouldn't eat TBW rating of your SSD since only reads are affected (at least in my testing).

@elmystico
Copy link
Author

elmystico commented Dec 16, 2024

Fair enough @nitinkmr333 - I've made e VM jsut for this, 2x32GiB plus 2x16GiB background on smaller pair and I see constant I/O with writes with no reason and no "pending rebalance amount" is changing whatsoever.
Please make test with similar parameters as I'm having perhaps. Please fill filesystem so there will be too much data to fit into background. 2 copies. When rebalance fill background disks it doesn't stop somehow and there's constant r/w IO
Having reporting this it seems not a duplicate of #795 ! Or if it is- #795 would have r/w IO also. This can be unnoticed because bcachefs kernel thread doent show writes but when you look for overall system writes it sure does.
Also have a look for some io measurements inside and outside VM (high r/w!). As soon as I umount fs IO stops to zero.
Zrzut ekranu (182)

Screenshot_disker_2024-12-16_133618

@elmystico
Copy link
Author

Hm after upgrading kernel v6.11 -> v6.12 no more I/O with writes but still full IO saturation perhaps duplicate of #795 (as ypu mentioned @nitinkmr333

Zrzut ekranu (219)

@elmystico
Copy link
Author

Hm I can see that you've been using v6.11 as well @nitinkmr333 perhaps just reboot put writes to stop? Anyway until you reboot it perhaps stucks with r/w IO not read only

@elmystico elmystico changed the title Constant I/O (rebalance) when foreground 2x nvme + background 2x HDD when nvme size >> HDD size [6.11,6.12] Constant I/O (rebalance) when foreground 2x nvme + background 2x HDD when nvme size >> HDD size Dec 16, 2024
@nitinkmr333
Copy link

@elmystico I tested it by creating loopback devices.

On kernel 6.11, I noticed that bcachefs was doing heavy reads but not writes. However, my underlying filesystem btrfs (on which I created loopback devices) was doing same amount of writes (perhaps btrfs is rewriting some data because of loopback devices?).
image
I believe it is the same case as yours. In the image you shared (using iotop), I can see bcachefs is doing reads but writes are probably done by your underlying filesystem (where qcow2 images are created).

After upgrading to kernel 6.12.2, I noticed that underlying filesystem btrfs is no longer doing those writes on the same setup. There are only reads now (by bcachefs)-
Screenshot_20241217_095246_crop

I also checked real hardware (sd card and hard drive) by creating 2 partitions - foreground and background target on each on them. There were reads but no writes on the filesystem (even on kernel 6.11) after filling background target partitions.

Rebooting or remounting these bcachefs drives does not make any difference in my case.

I will try your VM setup.

@elmystico
Copy link
Author

elmystico commented Dec 17, 2024

I believe it is the same case as yours. In the image you shared (using iotop), I can see bcachefs is doing reads but writes are probably done by your underlying filesystem (where qcow2 images are created).

Yeah I've seen that with bare metal bcachefs as well. It looks like bch-rebalance kthread hides its write io inside different thread or something like that because it could not be seen directly but you can see the write IO done on the drive level.

Anyway I think it's ok now to wait for Kent's or other dev reaction we don't know if and what more info is needed to fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants