Skip to content

Conversation

@lisongqian
Copy link
Contributor

@lisongqian lisongqian commented Aug 23, 2025

block: discard and write_zeroes support for raw sync and raw async

The host doesn't reclaim the disk space when a disk space is released in guest.
This patch supports discard and write_zeroes for raw sync and raw async(io_uring), allowing execution of trim in guest to notify the disk backend to release the disk space.

Signed-off-by: Songqian Li sionli@tencent.com

@lisongqian lisongqian requested a review from a team as a code owner August 23, 2025 06:39
@lisongqian lisongqian force-pushed the discard branch 6 times, most recently from e2406be to 5ffbf9e Compare August 23, 2025 07:30
Copy link
Member

@rbradford rbradford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the guest OS take advantage of this? What's the plan for implementing on something better than the raw sync backend - as I doubt that get's used very much compared to say io_uring.

id=<device_id>,pci_segment=<segment_id>,rate_limit_group=<group_id>,\
queue_affinity=<list_of_queue_indices_with_their_associated_cpuset>,\
serial=<serial_number>";
serial=<serial_number>,discard=on|off,detect-zeroes=on|off|unmap";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should always be on if the backing file implementation supports it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance of deleting multiple files will deteriorate on ext4 after executing discard.
Do we need more discussions to confirm whether to set discard=on as the default config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed discard=on to the default config, users can disable it if they don't want guest to execute discard or write_zeroes on this block.

fn check_request(features: u64, request_type: RequestType) -> result::Result<(), ExecuteError> {
let has_feature = |feature_pos: u64| -> bool { (features & (1u64 << feature_pos)) != 0 };

if has_feature(VIRTIO_BLK_F_RO.into()) && request_type != RequestType::In {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include this refactoring and the R/O support in a separate commit (and even separate PR) as I think this would be a good thing to land.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to new PR #7294.

@lisongqian lisongqian force-pushed the discard branch 2 times, most recently from 2c545e5 to f847dfb Compare August 25, 2025 07:08
@lisongqian
Copy link
Contributor Author

lisongqian commented Aug 25, 2025

Does the guest OS take advantage of this? What's the plan for implementing on something better than the raw sync backend - as I doubt that get's used very much compared to say io_uring.

io_uring is supported.

@lisongqian lisongqian force-pushed the discard branch 2 times, most recently from d4d4b26 to a241f7a Compare August 27, 2025 02:38
@lisongqian lisongqian changed the title block: discard and write_zeroes support for raw sync block: discard and write_zeroes support Aug 28, 2025
@lisongqian lisongqian changed the title block: discard and write_zeroes support block: discard and write_zeroes support for raw sync and raw async Aug 28, 2025
@lisongqian lisongqian force-pushed the discard branch 2 times, most recently from 0479c2f to df98391 Compare August 28, 2025 14:50
@arachsys
Copy link
Contributor

arachsys commented Sep 2, 2025

This is great! The lack of discard support has been one of the major drawbacks of both firecracker and cloud-hypervisor compared to qemu. I guess a future enhancement might be calling ioctl(BLKDISCARD) instead of using FALLOC_FL_PUNCH_HOLE when the virtual disk is backed by a block device instead of a plain file.

@rbradford
Copy link
Member

@lisongqian Please can you rebase - we can probably look at moving ahead with this PR.

@phip1611 phip1611 mentioned this pull request Sep 25, 2025
@lisongqian lisongqian force-pushed the discard branch 5 times, most recently from c600fcf to 4bef3fb Compare September 25, 2025 13:59
arachsys added a commit to arachsys/packages that referenced this pull request Oct 20, 2025
Backport experimental support for virtio-blk discard from

  cloud-hypervisor/cloud-hypervisor#7292

for testing and benchmarking.
@arachsys
Copy link
Contributor

arachsys commented Oct 21, 2025

I decided to do a little casual testing with this PR, on my linux-musl x86-64 dev host. I created a small, fully allocated disk image on the host's tmpfs /tmp with

# fallocate -l 8G /tmp/big

and then started cloud hypervisor with a 6.16.x kernel and my 'rescue initramfs' to play with this disk. My first move was to try making an ext4 filesystem, which involves writing zeros and discarding unused blocks.

I got an immediate warning about failed requests because 'Submission queue is full':

# hyper --cpus boot=1 --memory size=1G --kernel=boot/linux \
    --initramfs=boot/repair.zst --disk=path=/tmp/big --cmdline quiet
Please reboot or poweroff when maintenance is complete
# mkfs.ext4 /dev/vda
mke2fs 1.47.3 (8-Jul-2025)
Discarding device blocks: done
Creating filesystem with 2097152 4k blocks and 524288 inodes
Filesystem UUID: 30d894fc-df6b-4b21-a1dd-7daeab73aefc
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632

Allocating group tables: done
Writing inode tables: done
Creating journal (16384 blocks): cloud-hypervisor: 31.212182s: <_disk0_q0> WARN:virtio-devices/src/block.rs:277 -- Request failed: Request { request_type: WriteZeroes, sector: 0, data_descriptors: [(GuestAddress(233e5f0), 10)], status_addr: GuestAddress(39b5b10), writeback: true, aligned_operations: [AlignedOperation { origin_ptr: 7f2d9333e5f0, aligned_ptr: 7f2dd10cc400, size: 10, layout: Layout { size: 10, align: 512 (1 << 9) } }], start: Instant { tv_sec: abf31, tv_nsec: 26fead07 } } DiscardWriteZeroes(WriteZeroes(WriteVectored(Custom { kind: Other, error: "Submission queue is full" })))
cloud-hypervisor: 31.212733s: <_disk0_q0> WARN:virtio-devices/src/block.rs:277 -- Request failed: Request { request_type: WriteZeroes, sector: 0, data_descriptors: [(GuestAddress(233e5f0), 10)], status_addr: GuestAddress(39b5c90), writeback: true, aligned_operations: [AlignedOperation { origin_ptr: 7f2d9333e5f0, aligned_ptr: 7f2dd10cc600, size: 10, layout: Layout { size: 10, align: 512 (1 << 9) } }], start: Instant { tv_sec: abf31, tv_nsec: 2707276d } } DiscardWriteZeroes(WriteZeroes(WriteVectored(Custom { kind: Other, error: "Submission queue is full" })))
done
Writing superblocks and filesystem accounting information: cloud-hypervisor: 31.241350s: <_disk0_q0> WARN:virtio-devices/src/block.rs:305 -- Request failed with batch submission: Request { request_type: In, sector: 40000, data_descriptors: [(GuestAddress(2818000), 1000)], status_addr: GuestAddress(39b5e10), writeback: true, aligned_operations: [], start: Instant { tv_sec: abf31, tv_nsec: 28bbcde1 } } ReadVectored(Custom { kind: Other, error: "Submission queue is full" })
cloud-hypervisor: 31.241740s: <_disk0_q0> WARN:virtio-devices/src/block.rs:305 -- Request failed with batch submission: Request { request_type: In, sector: 40000, data_descriptors: [(GuestAddress(2818000), 1000)], status_addr: GuestAddress(39b5f90), writeback: true, aligned_operations: [], start: Instant { tv_sec: abf31, tv_nsec: 28c1c883 } } ReadVectored(Custom { kind: Other, error: "Submission queue is full" })
cloud-hypervisor: 31.242277s: <_disk0_q0> WARN:virtio-devices/src/block.rs:305 -- Request failed with batch submission: Request { request_type: Out, sector: 0, data_descriptors: [(GuestAddress(2225000), 1000), (GuestAddress(2810000), 1000), (GuestAddress(2205000), 1000), (GuestAddress(2155000), 2000), (GuestAddress(214e000), 1000), (GuestAddress(2150000), 2000), (GuestAddress(214d000), 1000), (GuestAddress(214a000), 1000), (GuestAddress(2147000), 1000), (GuestAddress(214b000), 2000), (GuestAddress(2148000), 2000), (GuestAddress(214f000), 1000), (GuestAddress(2252000), 1000), (GuestAddress(2152000), 1000), (GuestAddress(2145000), 1000), (GuestAddress(21fd000), 1000), (GuestAddress(2143000), 2000), (GuestAddress(2141000), 2000), (GuestAddress(2146000), 1000), (GuestAddress(213e000), 1000), (GuestAddress(213d000), 1000), (GuestAddress(213f000), 2000), (GuestAddress(213a000), 1000), (GuestAddress(2202000), 1000), (GuestAddress(2201000), 1000), (GuestAddress(21fe000), 1000), (GuestAddress(213b000), 2000), (GuestAddress(2134000), 1000), (GuestAddress(2136000), 2000), (GuestAddress(2133000), 1000), (GuestAddress(2130000), 2000), (GuestAddress(212a000), 2000), (GuestAddress(2132000), 1000), (GuestAddress(212c000), 4000), (GuestAddress(2135000), 1000), (GuestAddress(2153000), 1000), (GuestAddress(2206000), 1000), (GuestAddress(2138000), 1000), (GuestAddress(2128000), 1000), (GuestAddress(2126000), 2000), (GuestAddress(21ff000), 1000), (GuestAddress(2124000), 2000), (GuestAddress(21fb000), 1000), (GuestAddress(2129000), 1000), (GuestAddress(21fa000), 1000), (GuestAddress(21e5000), 1000), (GuestAddress(2200000), 1000), (GuestAddress(224e000), 1000), (GuestAddress(2203000), 1000), (GuestAddress(224f000), 1000), (GuestAddress(221b000), 1000), (GuestAddress(2121000), 1000), (GuestAddress(2120000), 1000), (GuestAddress(2122000), 2000), (GuestAddress(211d000), 3000), (GuestAddress(2117000), 1000), (GuestAddress(216c000), 1000), (GuestAddress(2119000), 2000), (GuestAddress(2116000), 1000), (GuestAddress(2251000), 1000), (GuestAddress(2112000), 2000), (GuestAddress(210b000), 2000), (GuestAddress(2114000), 1000), (GuestAddress(210e000), 4000), (GuestAddress(2118000), 1000), (GuestAddress(2139000), 1000), (GuestAddress(211b000), 1000), (GuestAddress(2109000), 1000), (GuestAddress(2107000), 2000), (GuestAddress(224d000), 1000), (GuestAddress(2105000), 2000), (GuestAddress(210a000), 1000), (GuestAddress(2102000), 1000), (GuestAddress(2101000), 1000), (GuestAddress(2103000), 2000), (GuestAddress(20fe000), 3000), (GuestAddress(20f8000), 1000), (GuestAddress(20fa000), 1000), (GuestAddress(20f7000), 1000), (GuestAddress(2173000), 1000), (GuestAddress(20f4000), 1000), (GuestAddress(20f3000), 1000), (GuestAddress(20f5000), 1000), (GuestAddress(20ed000), 2000), (GuestAddress(20f6000), 1000), (GuestAddress(20f0000), 3000), (GuestAddress(20f9000), 1000), (GuestAddress(211c000), 1000), (GuestAddress(20fc000), 1000), (GuestAddress(21d7000), 1000), (GuestAddress(21db000), 2000), (GuestAddress(21e0000), 1000)], status_addr: GuestAddress(39b5f90), writeback: true, aligned_operations: [], start: Instant { tv_sec: abf31, tv_nsec: 28c9bcc0 } } WriteVectored(Custom { kind: Other, error: "Submission queue is full" })

The guest appears to be hung at this point.

However, if I disable detect-zeroes, things seem fine:

# hyper --cpus boot=1 --memory size=1G --kernel=boot/linux \
  --initramfs=boot/repair.zst --disk=path=/tmp/big,detect-zeroes=off --cmdline quiet
Please reboot or poweroff when maintenance is complete
# mkfs.ext4 /dev/vda
mke2fs 1.47.3 (8-Jul-2025)
Discarding device blocks: done
Creating filesystem with 2097152 4k blocks and 524288 inodes
Filesystem UUID: 3859c2aa-97be-4a58-a688-e75402dd70a5
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632

Allocating group tables: done
Writing inode tables: done
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done
#

I can confirm that the discards have happened correctly too:

# du /tmp/big # before
8388608 /tmp/big
# du /tmp/big # after
71900   /tmp/big
#

I don't think the specifics of the kernel or initramfs are very exciting, although they're available at

respectively, with the defconfig-abbreviated kernel config (suitable for make olddefconfig) at

(I should also confirm: for the purposes of this test, I've built an exact clone of this PR's branch without any additional patches, using the current release version of rust. I've also rebased the changes onto the current release of CH, and get the same behaviour there.)

I'll continue trying to find problems with the discard=on,detect-zeroes=off configuration with more testing, but so far everything is working perfectly. I've tested with both block-device and file-backed vda, with a filesystem mounted -o discard, and see used space smoothly increasing and decreasing as the filesystem is filled and emptied. I've also successfully tested guest-side fstrim, which does the right thing.

I've now repeated these tests with discard=on,detect-zeros=unmap, which also appears to work fine. The initial mkfs.ext4 doesn't complain about an overflowing submission queue, and the allocated space on the backing file smoothly increases and decreases as the filesystem is filled and emptied, exactly as when detect-zeros is turned off.

So I think it's just detect-zeros=on that has some small problem here, @lisongqian?

It's really great seeing CH having proper discard/trim support at last. I was delighted when I saw the allocated space in the disk image being released for the first time. Thanks! :)

Copy link
Member

@likebreath likebreath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arachsys Thanks for sharing your experiments with detailed write-up. Can you please work with @lisongqian to get the issues (when enabling detect-zeroes) fixed and add integration tests too?

Please also summarize the motivation (e.g. why) for adding these support, e.g. to explain why it helps with space reclamation and performance, etc. /cc @lisongqian

@arachsys
Copy link
Contributor

arachsys commented Oct 30, 2025

Until @lisongqian is about, maybe I can help out with some notes on the importance of discard support, as you asked for @likebreath?

With guest drives provisioned directly onto SSDs (or contiguous volumes/partitions of SSDs), discard support is essentially pass-through of TRIM, and is useful for exactly the same reasons that TRIM helps on the host: passing information about unused blocks down to the flash management firmware to help with wear-levelling and performance.

It's much better to erase the released blocks during "idle" time ahead of wanting to reuse them, vs doing a full read-erase-modify-write at write time. And allowing the firmware to understand what blocks are really free and can be reused can make the wear-levelling work across a larger range of free blocks, prolonging SSD life.

With thin-provisioned guests, either with sparse image files in filesystems or something like lvm2 sparse volumes, discard plays an even more important role: allowing unused space when files are deleted in guest filesystems (mounted with -o discard) to be released back to the thin provisioning pool for use by other guests. This turns out to work really well in practice: if you fill a large filesystem in the guest then delete everything again, almost all of the space is released straight back again.

There are lots of thin-provisioned or SSD-backed situations where the only reason I use qemu instead of cloud-hypervisor or firecracker is because qemu supports discard pass-through, and that's required to make the thin-provisioning work or maximise SSD lifespan.

@lisongqian lisongqian force-pushed the discard branch 4 times, most recently from 84cdf88 to 3c9ace3 Compare November 23, 2025 16:13
@rbradford
Copy link
Member

As I said I prefer the way crosvm did it - #7292 (comment)
i.e. you don't need separate write-zeroes option - we can just do that unconditionally with no negative performance impact.

As I understand it, crosvm simplifies the options to sparse=off which disables discard and makes write-zeros attempt a fallocate ZERO_RANGE, and sparse=on which enables both discard and write-zeroes as unmaps, i.e. either fallocate PUNCH_HOLE or block device discard. That certainly works for me; does it cover your use cases too @lisongqian?

Edit: ...though not all block devices guarantee that zeroes will be read back from discarded blocks, so mapping write-zeros to block device discard isn't correct. But it's probably fine in the sparse=on case to unconditionally map virtio discard to fallocate PUNCH_HOLE or ioctl BLKDISCARD, and map virtio write-zeroes to fallocate PUNCH_HOLE or ioctl BLKZEROOUT.

For my purposes, I also wouldn't care if write-zeroes was unconditionally mapped to fallocate ZERO_RANGE / ioctl BLKZEROOUT in both sparse=off and sparse=on, with the sparse setting just changing discard behaviour. That's probably the simplest option of all.

Right, crosvm does ZERO_RANGE by default for virtio WRITE_ZEROES. That's why i'm saying we don't need to have a special option for that.

@arachsys
Copy link
Contributor

Right, crosvm does ZERO_RANGE by default for virtio WRITE_ZEROES. That's why i'm saying we don't need to have a special option for that.

I agree, at least for my use of CH. The only thing I wanted to double check is that both sync and async do the right thing with block devices in the sparse case with write-zeros mapping to ioctl BLKZEROOUT and discard mapping to ioctl BLKDISCARD.

With files instead block devices, write-zeroes could correctly map to either fallocate PUNCH_HOLE or fallocate WRITE_ZEROES, and I don't mind. Probably the latter unconditionally is easier as you say.

@lisongqian lisongqian force-pushed the discard branch 2 times, most recently from 86104f5 to 72d52b6 Compare November 24, 2025 12:25
@lisongqian
Copy link
Contributor Author

The latest push has changed the configuration to sparse.

The only thing I wanted to double check is that both sync and async do the right thing with block devices in the sparse case with write-zeros mapping to ioctl BLKZEROOUT and discard mapping to ioctl BLKDISCARD.

For raw async with io_uring, The IORING_OP_IOCTL patch hasn't landed: https://lore.kernel.org/lkml/3757f227-6ed9-b140-e367-69387f966874@gmail.com/t/. We need to use sync ioctl operations here? @arachsys @rbradford

@arachsys
Copy link
Contributor

For raw async with io_uring, The IORING_OP_IOCTL patch hasn't landed: https://lore.kernel.org/lkml/3757f227-6ed9-b140-e367-69387f966874@gmail.com/t/. We need to use sync ioctl operations here? @arachsys @rbradford

Ah, I hadn't spotted that wasn't available! When sparse is enabled on a block device, are all IO operations except discard async, with discard issued synchronously, or does enabling sparse force all IO operations include normal reads and writes to be sync?

@arachsys
Copy link
Contributor

Potentially torvalds/linux@50c5225 which went into 6.11 might be the specific async block discard, although it's not the whole of ioctl()?

@lisongqian
Copy link
Contributor Author

Potentially torvalds/linux@50c5225 which went into 6.11 might be the specific async block discard, although it's not the whole of ioctl()?

I will give it a try and add some integration test cases.

@lisongqian lisongqian changed the title block: discard and write_zeroes support for raw sync and raw async [WIP] block: discard and write_zeroes support for raw sync and raw async Nov 25, 2025
@lisongqian lisongqian force-pushed the discard branch 3 times, most recently from 67ddc0b to 245eaca Compare December 3, 2025 14:50
@arachsys
Copy link
Contributor

arachsys commented Dec 6, 2025

Would you like me to do some live testing on this yet, @lisongqian, or is it still in development? Very happy to test on some real VMs if we're at a stage where that's helpful.

@lisongqian
Copy link
Contributor Author

Would you like me to do some live testing on this yet, @lisongqian, or is it still in development? Very happy to test on some real VMs if we're at a stage where that's helpful.

I'm not sure it's ready. The unit test has failed but I don't know whether the kernel of the host is bigger than 6.12.
I'm busy this month. I'm happy if you could help me test it ❤️

The host doesn't reclaim the disk space when a disk space is released in
guest. This patch supports discard with FALLOC_FL_PUNCH_HOLE and
write_zeroes with FALLOC_FL_ZERO_RANGE for raw sync, allowing execution
of trim in guest to notify the disk backend to release the disk space.

Signed-off-by: Songqian Li <sionli@tencent.com>
This patch supports discard with ioctl(BLKDISCARD) and write_zeroes
with ioctl(BLKZEROOUT) when the backend of blk is a block device.

Signed-off-by: Songqian Li <sionli@tencent.com>
This patch supports discard and write_zeroes for raw io_uring, allowing
execution of trim in guest to notify the disk backend to release the
disk space.

Signed-off-by: Songqian Li <sionli@tencent.com>
Signed-off-by: Songqian Li <sionli@tencent.com>
Signed-off-by: Songqian Li <sionli@tencent.com>
@snue
Copy link
Contributor

snue commented Dec 16, 2025

I'm not sure it's ready. The unit test has failed but I don't know whether the kernel of the host is bigger than 6.12.
I'm busy this month. I'm happy if you could help me test it ❤️

I can reproduce these failures on a 6.12 kernel. It is flaky, with different assertions in the new test failing at different times. I also observed a partly cleared buffer at some point. My guess is there is at least a race condition between the blk_dev discard tests on the use of /tmp/test_discard_dev.raw, but my first attempt at avoiding that did not fix the test yet.

Copy link
Contributor

@snue snue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work 🎉
I am still trying to figure out why the tests fail. Apart from the unexpected buffer content assertions, I also see segfaults with the new tests occasionally.

Otherwise, just some small suggestions so far.

Comment on lines +85 to +90
/// Failed to punch hole file.
#[error("Failed to punch hole file: {0}")]
PunchHole(#[source] std::io::Error),
/// Failed to write zeroes to file.
#[error("Failed to write zeroes to file: {0}")]
WriteZeroes(#[source] std::io::Error),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nested error printing was improved a while ago and the {0} are not needed anymore with errors that provide a #[source].
Also, a small suggestion:

Suggested change
/// Failed to punch hole file.
#[error("Failed to punch hole file: {0}")]
PunchHole(#[source] std::io::Error),
/// Failed to write zeroes to file.
#[error("Failed to write zeroes to file: {0}")]
WriteZeroes(#[source] std::io::Error),
/// Failed to punch hole in file.
#[error("Failed to punch hole in file")]
PunchHole(#[source] std::io::Error),
/// Failed to write zeroes to file.
#[error("Failed to write zeroes to file")]
WriteZeroes(#[source] std::io::Error),

Comment on lines +98 to +101
#[error("Failed to punch hole: {0}")]
PunchHole(AsyncIoError),
#[error("Failed to write zeroes: {0}")]
WriteZeroes(AsyncIoError),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, {0} better replaced with #[source].

Suggested change
#[error("Failed to punch hole: {0}")]
PunchHole(AsyncIoError),
#[error("Failed to write zeroes: {0}")]
WriteZeroes(AsyncIoError),
#[error("Failed to punch hole")]
PunchHole(#[source] AsyncIoError),
#[error("Failed to write zeroes")]
WriteZeroes(#[source] AsyncIoError),

Comment on lines +168 to +169
#[error("Failed to handle discard or write zeroes: {0}")]
DiscardWriteZeroes(Error),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here...

Suggested change
#[error("Failed to handle discard or write zeroes: {0}")]
DiscardWriteZeroes(Error),
#[error("Failed to handle discard or write zeroes")]
DiscardWriteZeroes(#[source] Error),

while let Some(mut desc_chain) = queue.pop_descriptor_chain(self.mem.memory()) {
let mut request = Request::parse(&mut desc_chain, self.access_platform.as_deref())
.map_err(Error::RequestParsing)?;
debug!("virtio-blk request type: {:?}", request.request_type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is surely helpful for development, but should be removed from the loop before merge.

Comment on lines +415 to +416
--privileged \
--security-opt seccomp=unconfined \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfortunate for the unit tests, but probably required for the loop device creation? Should this rather be an integration test? (but I also like that it is a quick unit test currently)

Comment on lines +254 to +267
let mut cmd = [0u8; 16];
cmd[0..8].copy_from_slice(&offset.to_ne_bytes());
cmd[8..16].copy_from_slice(&length.to_ne_bytes());

// SAFETY: we know the file descriptor is valid and points to a block device.
unsafe {
sq.push(
&opcode::UringCmd16::new(types::Fd(self.fd), BLOCK_URING_CMD_DISCARD() as _)
.cmd(cmd)
.build()
.user_data(user_data),
)
.map_err(|_| AsyncIoError::PunchHole(Error::other("Submission queue is full")))?;
};
Copy link
Contributor

@snue snue Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguments to this operation are wrong. The offset needs to go to sqe->addr and length is at sqe->addr3 aka. sqe->cmd. Looks like this isn't exposed through the opcode. I hacked it together like this:

Suggested change
let mut cmd = [0u8; 16];
cmd[0..8].copy_from_slice(&offset.to_ne_bytes());
cmd[8..16].copy_from_slice(&length.to_ne_bytes());
// SAFETY: we know the file descriptor is valid and points to a block device.
unsafe {
sq.push(
&opcode::UringCmd16::new(types::Fd(self.fd), BLOCK_URING_CMD_DISCARD() as _)
.cmd(cmd)
.build()
.user_data(user_data),
)
.map_err(|_| AsyncIoError::PunchHole(Error::other("Submission queue is full")))?;
};
let mut cmd = [0u8; 16];
// length goes to `cmd` aka `sqe.addr3`, but offset/start goes to `sqe.addr`
cmd[0..8].copy_from_slice(&length.to_ne_bytes());
// SAFETY: we know the file descriptor is valid and points to a block device.
unsafe {
let mut sqe =
opcode::UringCmd16::new(types::Fd(self.fd), BLOCK_URING_CMD_DISCARD() as _)
.cmd(cmd)
.build()
.user_data(user_data);
let sqe_ptr = &raw mut sqe as *mut [u8; 128];
(&mut (*sqe_ptr))[16..24].copy_from_slice(&offset.to_ne_bytes());
sq.push(&sqe).map_err(|_| {
AsyncIoError::PunchHole(Error::other("Submission queue is full"))
})?;
};

This makes the test_blk_dev_raw_io_uring_discard_wr_zeroes_request succeed for me (and the other discard tests as well).

I still see a little flakiness with both test_raw_io_uring_discard_wr_zeroes_request and the async blk_dev variant, that I have not tracked down yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sqe->addr is exposed in the io_uring crate since tokio-rs/io-uring#375 (not in a crate release yet)

Comment on lines +1164 to +1174
let evt_num = match epoll::wait(epoll_file.as_raw_fd(), timeout, &mut events[..]) {
Ok(n) => n,
Err(e) if e.kind() == std::io::ErrorKind::Interrupted => continue,
Err(e) => panic!("epoll_wait failed: {e}"),
};

for event in events.iter().take(evt_num) {
assert_eq!(event.data as u16, COMPLETION_EVENT);
disk_image_async.notifier().read().unwrap();
}
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I observed rare cases where multiple events are queued, but the next read is issued before they all completed. This causes the unit test to fail because buffer contents don't match. The read may have been reordered with a still queued write or not happened at all when the comparison is done.

Suggested change
let evt_num = match epoll::wait(epoll_file.as_raw_fd(), timeout, &mut events[..]) {
Ok(n) => n,
Err(e) if e.kind() == std::io::ErrorKind::Interrupted => continue,
Err(e) => panic!("epoll_wait failed: {e}"),
};
for event in events.iter().take(evt_num) {
assert_eq!(event.data as u16, COMPLETION_EVENT);
disk_image_async.notifier().read().unwrap();
}
break;
let evt_num = match epoll::wait(epoll_file.as_raw_fd(), timeout, &mut events[..]) {
// Always loop until no more events are detected.
Ok(0) => break,
Ok(n) => n,
Err(e) if e.kind() == std::io::ErrorKind::Interrupted => continue,
Err(e) => panic!("epoll_wait failed: {e}"),
};
for event in events.iter().take(evt_num) {
assert_eq!(event.data as u16, COMPLETION_EVENT);
disk_image_async.notifier().read().unwrap();
}

The relevant strace part of a failing execution:

[pid 2040901] io_uring_enter(4, 1, 0, 0, NULL, 128) = 1
[pid 2040901] epoll_wait(6, [{events=EPOLLIN, data=0x11}], 1, 5) = 1
[pid 2040901] read(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 2040901] io_uring_enter(4, 1, 0, 0, NULL, 128) = 1
[pid 2040901] io_uring_enter(4, 1, 0, 0, NULL, 128) = 1
[pid 2040901] io_uring_enter(4, 1, 0, 0, NULL, 128) = 1
[pid 2040901] epoll_wait(6, [{events=EPOLLIN, data=0x11}], 1, 5) = 1
[pid 2040901] read(5, "\1\0\0\0\0\0\0\0", 8) = 8  <-- only one of 3 queued write events completed
[pid 2040901] io_uring_enter(4, 1, 0, 0, NULL, 128) = 1  <-- queued the next read
[pid 2040901] epoll_wait(6, [{events=EPOLLIN, data=0x11}], 1, 5) = 1
[pid 2040901] read(5, "\1\0\0\0\0\0\0\0", 8) = 8  <-- one more event completed (2 remaining?)
[pid 2040901] gettid()                  = 2040901  <-- starting failed assert handler

With this final change (and those from previous comments), the unit tests now look solid to me! 🎉

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, some more stress-testing found another very rare case of a not fully cleared buffer:

thread 'tests::test_blk_dev_raw_io_uring_discard_wr_zeroes_request' (2357082) panicked at block/src/lib.rs:1584:13:
assertion `left == right` failed
  left: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86]
 right: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

I don't see how that happened, but it is more likely another test design deficit than implementation bug. I suspect no completion event was queued yet in that case. Simply increasing the timeout has not resurfaced the bug yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making sure at least one event was processed seems to work as well, even with a smaller timeout to provoke the situation. That's probably the way to go here.

Suggested change
let evt_num = match epoll::wait(epoll_file.as_raw_fd(), timeout, &mut events[..]) {
Ok(n) => n,
Err(e) if e.kind() == std::io::ErrorKind::Interrupted => continue,
Err(e) => panic!("epoll_wait failed: {e}"),
};
for event in events.iter().take(evt_num) {
assert_eq!(event.data as u16, COMPLETION_EVENT);
disk_image_async.notifier().read().unwrap();
}
break;
let mut total = 0;
loop {
let evt_num = match epoll::wait(epoll_file.as_raw_fd(), timeout, &mut events[..]) {
// Always loop until no more events are detected.
Ok(0) => if total > 0 { break } else { continue },
Ok(n) => n,
Err(e) if e.kind() == std::io::ErrorKind::Interrupted => continue,
Err(e) => panic!("epoll_wait failed: {e}"),
};
for event in events.iter().take(evt_num) {
assert_eq!(event.data as u16, COMPLETION_EVENT);
disk_image_async.notifier().read().unwrap();
}
total += evt_num;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants