[WIP] vmm: allow configuring memory zones with memory FDs #7377

posk-io · 2025-09-24T20:37:25Z

Live Update Orchestrator (LU) is an on-going Linux Kernel
effort to minimize guest VM disruptions during host kernel
kexec (~reboot):

https://lore.kernel.org/all/20250807014442.3829950-1-pasha.tatashin@soleen.com/

One of components of LUO is the ability to keep VM guest
memory "untouched" during host kernel reboot:

https://lore.kernel.org/all/20250807014442.3829950-30-pasha.tatashin@soleen.com/

Keeping guest memory "live" across host kernel reboots
minimizes the time it takes to snapshot/restore a VM,
especially with large-memory guests (~TB of RAM).

This PR builds on the existing Cloud Hypervisor ability
to save/restore guest memory backed by PMEM (files); it is mostly
a config/API change.

Signed-off-by: Peter Oskolkov posk@google.com

phip1611

I like the direction this is going to, but I see several things we need to discuss and/or address first. Major concern: #7377 (comment)

phip1611 · 2025-09-25T06:17:20Z

vmm/src/api/http/http_endpoint.rs

            let mut restore_cfg: RestoreConfig = serde_json::from_slice(body.raw())?;

+            // First, consume mem FDs, if any.
+            let expected_mem_fds = restore_cfg.mem_fds.as_ref().map_or(0, |fds| fds.len());


We need to be very carefully here and I'd like to discuss this first! We can for sure structure the API in a way that we promise:

Externally FDs must be provided in the following order: mem_fds, net_fds[, blk_fds]`.

But once we do this and management software such as libvirt (which we at Cyberus are working on) relies on that, we never can go back.

Does anyone has ideas how we can make this decision and also the enforcement more clear? It must also be documented somewhere!

Maybe we should add extra state to the restore command that provides an ordered set of (type i.e. what the FDs are for, number of fds) and then we can do extra validation and checks - allowing us to be flexible in the ordering if necessary.

Whatever we will end up with, it has to work at least in the following cases:

CreateVm

RestoreVm

ReceiveLiveMigration

All of these are suitable to get FDs for various components at once, unlike API calls such as AddNet

We can have something like

struct FdConfig { mem_fds: usize, net_fds: usize, }

and add fd_config: Option<FdCOnfig> as a top-level field to VmConfig and others. WDYT?

Alternatively, configs could contain FD indices, e.g. MemoryZone { size: 515M, fd_idx: 3 }, which would indicate that the config will be accompanied by FDs, and fd_vec[3] will contain the FD for this MemoryZone. At the moment configs are supposed to have placeholder "-1" values when coming from the user; if they were to contain FD indices, and the FD vector attached at the API receiving end, this will disambiguate things, right?

and add fd_config: Option<FdConfig> as a top-level field to VmConfig and others. WDYT?

This would work for creating a VM. But what about state save/resume (save/restore) and receive-migration? I'd love if we find a solution that handles this complexity in all cases across all config objects that may now or in future take external FDs at some point from an API call. Ideally not with a dozen of Option<SomeFdConfigType>. But right now I also can't think of an easy solution without thinking about this longer.

phip1611 · 2025-09-25T06:18:30Z

vmm/src/api/http/http_endpoint.rs

+            // First, consume mem FDs, if any.
+            let expected_mem_fds = restore_cfg.mem_fds.as_ref().map_or(0, |fds| fds.len());
+
+            if files.len() < expected_mem_fds {


it looks to me that you replicate logic here that is supposed to be abstracted behind attach_fds_to_cfgs. The new idea of that abstraction was exactly to not have these kind of statements in the handler

phip1611 · 2025-09-25T06:21:24Z

vmm/src/vm_config.rs

+    #[serde(
+        default,
+        serialize_with = "serialize_memory_zone_fd",
+        deserialize_with = "deserialize_memory_zone_fd"


I think in #7281 we decided against a special serialize path. See 666444d.

I'd love to have it consistent here as well.

I think in #7281 we decided against a special serialize path. See 666444d.

I'd love to have it consistent here as well.

Again, I'm not sure I understand your concern: there are (de)serialize_netconfig_fds; I'm adding (de)serialize_memory_zone_fd in this patch/pr, specifically to be consistent with what the currently present code is doing. What is not consistent? What am I missing?

If I'm not mistaken and do not miss something: My intuition here is: If you'd drop serialize_with = "serialize_memory_zone_fd",, the VmInfo API call would display the actually used FD and not a "-1" value, similar as we handle FDs for virtio-net devices.

phip1611 · 2025-09-25T06:22:32Z

vmm/src/api/http/http_endpoint.rs

            }
        }
+
+        impl ConfigWithFDs for VmConfig {


This wonders me a bit. This was supposed to be used for MemoryZoneConfig or so, i.e., the "leaf config objects", not the main object.

In case the original design proposed in #7281 was not mature enough, we should change it again? I think this PR adds inconsistencies in its current state, about that I'm concerned.

This was supposed to be used for MemoryZoneConfig or so, i.e., the "leaf config objects", not the main object.

I don't fully understand your concern here. Are you suggesting that MemoryZoneConfig should implement the trait, not the top-level config?

If so, I don't understand the need for this trait at all, as the trait is defined in this module (http_endpoint), is implemented in this module, and is used in this module. Why is it needed at all? To me, traits are useful as abstractions to facilitate interop across modules/crates, but here it seems to be not doing anything?

If it were applicable to any cfg, including top level, it could be argued to be useful as it would hide implementation details of "consuming" FDs; but if it is to only be applied to "leaf config objects", what's the purpose of the trait?

I understand your confusion and I want to try to clarify things. I'm not saying the solution we introduced in #7281 is the best design one can have for this, but it was a pragmatic solution to simplify the handling of config objects that want external FDs in http_endpoints.rs and it enables additions such as #7379 which is pretty cool.

The idea is that every handler in http_endpoint.rs who receives external FDs uses attach_fds_to_cfg(). And to be able to use that, you have to implement ConfigWithFDs for every config struct you are dealing with. This is at least what worked well for us when it comes to state save/resume, hot attach, and live migration of/with virtio-net devices with external FDs.

phip1611 · 2025-09-25T20:04:47Z

I don't want to slow down things here and I do my best to proactively work towards you, but..the more I think about it: I am genuinely concerned that we need major design thinking first about all current and future users/config objects that want external FDs and find a solution that fulfills all properties:

API events:
- VM creation
- device adding/removing (hotplug)
- live-migration
- save/restore
Code quality:
- do not have many corner cases and dozens of runtime if else checks or checks on Options
- prevent many too many config objects / code duplication
API Stability
- For API calls that take multiple FDs for different entities (i.e., net and mem) we need to aggre on an order that is either configurable or we must ensure that we never break it in future

I think what we've did in #7281 is partly a work towards there but we might also need to refactor and adapt it. We could discuss things personally in a upcoming cloud hypervisor community call.

posk-io · 2025-09-25T20:24:23Z

I don't want to slow down things here and I do my best to proactively work towards you, but..the more I think about it: I am genuinely concerned that we need major design thinking first about all current and future users/config objects that want external FDs and find a solution that fulfills all properties:

I agree that we need to do it properly; I'm not that worried, though, as it looks like it is more of a configuration/API issue, not a core runtime behavior. Even if we get it wrong at first, API deprecation/versioning is quite normal these days.

Another idea: it seems that all, or most, things that may have an FD attached to them have string tags/names/IDs attached to them as well ("tap0", "mem0", "disk0", etc.). So we may postulate that any entity that may get an external FD attached to it must have an ID, and that any API call that sends an array/vector of FDs must also contain an array/vector of strings that uniquely assign them to devices (e.g. ["tap0", "tap1", "mem0", "mem1"]. This array of strings/IDs much have the same length as the array/vector of incoming FDs. This will remove any ambiguity re: which FD belongs to which "device".

posk-io · 2025-09-30T17:10:25Z

@rbradford Rob, how would you like to proceed here? What do you think of this approach:

It seems that all, or most, things that may have an FD attached to them have string tags/names/IDs attached to them as well ("tap0", "mem0", "disk0", etc.). So we may postulate that any entity that may get an external FD attached to it must have an ID, and that any API call that sends an array/vector of FDs must also contain an array/vector of strings that uniquely assign them to devices (e.g. ["tap0", "tap1", "mem0", "mem1"]. This array of strings/IDs must have the same length as the array/vector of incoming FDs. This will remove any ambiguity re: which FD belongs to which "device".

phip1611 · 2025-10-01T04:58:56Z

Do we want to discuss this on Tomorrow's community call? I plan to attend.

My current rough idea is to:

as @rbradford proposed, have a generic struct carrying all the FD informations
we therefore drop the FD from the restored net config struct and similar places; also do not allow this in future work
the VmConfig will get a function to apply the FDs onto itself in the right order

Not sure how compatible this is with the current code. This would somehow make #7281 obsolete which is fine, as it would be a better design.

posk-io · 2025-10-01T16:31:23Z

I'm OK with any design that is generic enough to allow adding more "FD-enabled" configuration parameters in the future.

rbradford · 2025-10-02T15:20:52Z

@rbradford Rob, how would you like to proceed here? What do you think of this approach:

It seems that all, or most, things that may have an FD attached to them have string tags/names/IDs attached to them as well ("tap0", "mem0", "disk0", etc.). So we may postulate that any entity that may get an external FD attached to it must have an ID, and that any API call that sends an array/vector of FDs must also contain an array/vector of strings that uniquely assign them to devices (e.g. ["tap0", "tap1", "mem0", "mem1"]. This array of strings/IDs must have the same length as the array/vector of incoming FDs. This will remove any ambiguity re: which FD belongs to which "device".

Yes, that might be a great solution - provide a list of IDs and then provide a list of FDS. Check the length and then match the up. And we do pick predictable names if the user does not provide them.

I've asked @phip1611 to work with you to come to a scalable solution that hopefully handles the existing use cases and, hopefully many future ones.

posk-io · 2025-10-02T15:55:30Z

[...]
Yes, that might be a great solution - provide a list of IDs and then provide a list of FDS. Check the length and then match the up. And we do pick predictable names if the user does not provide them.

I've asked @phip1611 to work with you to come to a scalable solution that hopefully handles the existing use cases and, hopefully many future ones.

Thanks, Rob! @phip1611 Philipp: would you like to take a stab at this, or do you prefer me doing it? I can probably have something ready for review by the end of the next week, but unlikely much earlier.

phip1611 · 2025-10-06T07:57:29Z

Hey, sorry for the late reply!

I've asked @phip1611 to work with you to come to a scalable solution that hopefully handles the existing use cases and, hopefully many future ones. - @rbradford

I'm very happy to assist and support here!

Thanks, Rob! @phip1611 Philipp: would you like to take a stab at this, or do you prefer me doing it? I can probably have something ready for review by the end of the next week, but unlikely much earlier. - @posk-io

I’ll be away for the next two weeks due to the EuroRust conference in Paris and a short vacation afterwards, but I’ll still have time for code reviews. I'll return to work in week 43.

would you like to take a stab at this

I'd highly appreciate if you can invest time here and take the lead, at least for the next two weeks!

posk-io · 2025-10-10T17:07:49Z

Update: I believe I have everything working, just need to add/run more tests and clean up the code a bit. Will probably update the PR early/mid next week.

posk-io · 2025-10-14T17:00:26Z

@rbradford @phip1611 Please have a look and let me know what you think. The PR probably needs more tests, which I will happily add once the overall approach is agreed upon.

One concern I have is that some old configs won't work now. Is this an issue? If so, how should it be addressed?

Live Update Orchestrator (LU) is an on-going Linux Kernel effort to minimize guest VM disruptions during host kernel kexec (~reboot): https://lore.kernel.org/all/20250807014442.3829950-1-pasha.tatashin@soleen.com/ One of components of LUO is the ability to keep VM guest memory "untouched" during host kernel reboot: https://lore.kernel.org/all/20250807014442.3829950-30-pasha.tatashin@soleen.com/ Keeping guest memory "live" across host kernel reboots minimizes the time it takes to snapshot/restore a VM, especially with large-memory guests (~TB of RAM). This commit builds on the existing Cloud Hypervisor ability to save/restore guest memory backed by PMEM (files); it is mostly a config/API change. Signed-off-by: Peter Oskolkov <posk@google.com>

Signed-off-by: Peter Oskolkov <posk@google.com>

phip1611 · 2025-10-15T13:11:54Z

@posk-io Sorry, my time for reviewing code over the past few days was more limited than expected. I can do a meaningful review starting from Monday but I'll take a quick look just now as well.

One concern I have is that some old configs won't work now. Is this an issue? If so, how should it be addressed?

My first intuition is that this is okay. Apart from ch-remote, I'm not aware of any management software out there in use (except for our libvirt fork where we upstream patches one by one).

phip1611

Could you please move the functionality of pub struct ExternalFdsConfig et al. at least to a dedicated commit?

phip1611 · 2025-10-15T13:16:32Z

vmm/src/vm_config.rs

        }
    }
+
+    pub fn consume_fds(&mut self, fds: Vec<i32>) -> Result<(), PayloadConfigError> {


Vec<RawFd>

Configs store FDs as i32, so I'm not sure why RawFds should be passed here...

rbradford

I think this first commit should be split so that the external_fds handling is added first and then memory backed by FDs is added as a later commit.

posk-io · 2025-10-22T19:21:22Z

I think this first commit should be split so that the external_fds handling is added first and then memory backed by FDs is added as a later commit.

Will do, thanks! I'm glad that's the only feedback ;)

P.S. I assume you mean splitting the first commit not just into two commits in the same PR (this one), but split it into two separate PRs; so I'll mark this PR as WIP/Draft and keep it around for memory zone FDs, and prepare a separate config refactoring PR.

As discussed in [PR cloud-hypervisor#7377](cloud-hypervisor#7377), Cloud Hypervisor may potentially accept different types of FDs on VM creation and/or resuming: - FDs for net devices (currently supported); - FDs for pmem devices for memory zones (WIP); - FDs for storage devices, etc. As FDs are passed as an array of anonymous numbers over UDS sockets, there should be a way to assign them to the devices they belong to. This PR introduces ExternalFdsConfig in vm_config.rs, which matches, positionally, the list of device IDs/names to the list of FDs. Signed-off-by: Peter Oskolkov posk@google.com

posk-io · 2025-10-23T19:24:55Z

[...] prepare a separate config refactoring PR.

Here it is: #7431

As discussed in [PR cloud-hypervisor#7377](cloud-hypervisor#7377), Cloud Hypervisor may potentially accept different types of FDs on VM creation and/or resuming: - FDs for net devices (currently supported); - FDs for pmem devices for memory zones (WIP); - FDs for storage devices, etc. As FDs are passed as an array of anonymous numbers over UDS sockets, there should be a way to assign them to the devices they belong to. This PR introduces ExternalFdsConfig in vm_config.rs, which matches, positionally, the list of device IDs/names to the list of FDs. Signed-off-by: Peter Oskolkov posk@google.com

As discussed in [PR cloud-hypervisor#7377](cloud-hypervisor#7377), Cloud Hypervisor may potentially accept different types of FDs on VM creation and/or resuming: - FDs for net devices (currently supported); - FDs for pmem devices for memory zones (WIP); - FDs for storage devices, etc. As FDs are passed as an array of anonymous numbers over UDS sockets, there should be a way to assign them to the devices they belong to. This PR introduces ExternalFdsConfig in vm_config.rs, which matches, positionally, the list of device IDs/names to the list of FDs. Signed-off-by: Peter Oskolkov <posk@google.com>

As discussed in PR cloud-hypervisor#7377, Cloud Hypervisor may potentially accept different types of FDs on VM creation and/or resuming: - FDs for net devices (currently supported); - FDs for pmem devices for memory zones (WIP); - FDs for storage devices, etc. As FDs are passed as an array of anonymous numbers over UDS sockets, there should be a way to assign them to the devices they belong to. This PR introduces ExternalFdsConfig in vm_config.rs, which matches, positionally, the list of device IDs/names to the list of FDs. Signed-off-by: Peter Oskolkov <posk@google.com>

posk-io requested a review from a team as a code owner September 24, 2025 20:37

posk-io mentioned this pull request Sep 24, 2025

DRAFT: Allow memory zones to use externally-provided mem FDs #7369

Closed

posk-io force-pushed the memfd-02 branch from 7064fce to bb742b2 Compare September 24, 2025 21:24

phip1611 requested changes Sep 25, 2025

View reviewed changes

posk-io force-pushed the memfd-02 branch 2 times, most recently from 210bea9 to 5e636e4 Compare October 14, 2025 16:57

posk-io requested a review from rbradford October 14, 2025 17:00

posk-io force-pushed the memfd-02 branch from 5e636e4 to 4bcb708 Compare October 14, 2025 17:08

posk-io added 2 commits October 14, 2025 17:14

tests: add test_memfds integration test

18126f0

Signed-off-by: Peter Oskolkov <posk@google.com>

posk-io force-pushed the memfd-02 branch from 4bcb708 to 18126f0 Compare October 14, 2025 17:14

phip1611 requested changes Oct 15, 2025

View reviewed changes

rbradford reviewed Oct 22, 2025

View reviewed changes

posk-io marked this pull request as draft October 22, 2025 19:21

posk-io changed the title ~~vmm: allow configuring memory zones with memory FDs~~ [WIP] vmm: allow configuring memory zones with memory FDs Oct 22, 2025

posk-io mentioned this pull request Oct 23, 2025

WIP: vmm: refactor FD passing #7431

Draft

posk-io mentioned this pull request Jan 6, 2026

VFIO support for migration #2251

Open

[WIP] vmm: allow configuring memory zones with memory FDs #7377

Are you sure you want to change the base?

[WIP] vmm: allow configuring memory zones with memory FDs #7377

Uh oh!

Conversation

posk-io commented Sep 24, 2025

Uh oh!

phip1611 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phip1611 Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rbradford Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phip1611 Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phip1611 Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phip1611 commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

posk-io commented Sep 25, 2025

Uh oh!

posk-io commented Sep 30, 2025

Uh oh!

phip1611 commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

posk-io commented Oct 1, 2025

Uh oh!

rbradford commented Oct 2, 2025

Uh oh!

posk-io commented Oct 2, 2025

Uh oh!

phip1611 commented Oct 6, 2025

Uh oh!

posk-io commented Oct 10, 2025

Uh oh!

posk-io commented Oct 14, 2025

Uh oh!

phip1611 commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phip1611 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rbradford left a comment

Choose a reason for hiding this comment

phip1611 left a comment •

edited

Loading

phip1611 Sep 25, 2025 •

edited

Loading

rbradford Sep 25, 2025 •

edited

Loading

phip1611 Sep 25, 2025 •

edited

Loading

phip1611 Sep 25, 2025 •

edited

Loading

phip1611 commented Sep 25, 2025 •

edited

Loading

phip1611 commented Oct 1, 2025 •

edited

Loading

phip1611 commented Oct 15, 2025 •

edited

Loading

phip1611 left a comment •

edited

Loading