Skip to content

Conversation

@posk-io
Copy link
Contributor

@posk-io posk-io commented Sep 24, 2025

Live Update Orchestrator (LU) is an on-going Linux Kernel
effort to minimize guest VM disruptions during host kernel
kexec (~reboot):

https://lore.kernel.org/all/20250807014442.3829950-1-pasha.tatashin@soleen.com/

One of components of LUO is the ability to keep VM guest
memory "untouched" during host kernel reboot:

https://lore.kernel.org/all/20250807014442.3829950-30-pasha.tatashin@soleen.com/

Keeping guest memory "live" across host kernel reboots
minimizes the time it takes to snapshot/restore a VM,
especially with large-memory guests (~TB of RAM).

This PR builds on the existing Cloud Hypervisor ability
to save/restore guest memory backed by PMEM (files); it is mostly
a config/API change.

Signed-off-by: Peter Oskolkov posk@google.com

Copy link
Member

@phip1611 phip1611 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the direction this is going to, but I see several things we need to discuss and/or address first. Major concern: #7377 (comment)

let mut restore_cfg: RestoreConfig = serde_json::from_slice(body.raw())?;

// First, consume mem FDs, if any.
let expected_mem_fds = restore_cfg.mem_fds.as_ref().map_or(0, |fds| fds.len());
Copy link
Member

@phip1611 phip1611 Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be very carefully here and I'd like to discuss this first! We can for sure structure the API in a way that we promise:

Externally FDs must be provided in the following order: mem_fds, net_fds[, blk_fds]`.

But once we do this and management software such as libvirt (which we at Cyberus are working on) relies on that, we never can go back.

Does anyone has ideas how we can make this decision and also the enforcement more clear? It must also be documented somewhere!

Copy link
Member

@rbradford rbradford Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add extra state to the restore command that provides an ordered set of (type i.e. what the FDs are for, number of fds) and then we can do extra validation and checks - allowing us to be flexible in the ordering if necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever we will end up with, it has to work at least in the following cases:

  • CreateVm
  • RestoreVm
  • ReceiveLiveMigration

All of these are suitable to get FDs for various components at once, unlike API calls such as AddNet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have something like

struct FdConfig {
    mem_fds: usize,
    net_fds: usize,
}

and add fd_config: Option<FdCOnfig> as a top-level field to VmConfig and others. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, configs could contain FD indices, e.g. MemoryZone { size: 515M, fd_idx: 3 }, which would indicate that the config will be accompanied by FDs, and fd_vec[3] will contain the FD for this MemoryZone. At the moment configs are supposed to have placeholder "-1" values when coming from the user; if they were to contain FD indices, and the FD vector attached at the API receiving end, this will disambiguate things, right?

Copy link
Member

@phip1611 phip1611 Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and add fd_config: Option<FdConfig> as a top-level field to VmConfig and others. WDYT?

This would work for creating a VM. But what about state save/resume (save/restore) and receive-migration? I'd love if we find a solution that handles this complexity in all cases across all config objects that may now or in future take external FDs at some point from an API call. Ideally not with a dozen of Option<SomeFdConfigType>. But right now I also can't think of an easy solution without thinking about this longer.

// First, consume mem FDs, if any.
let expected_mem_fds = restore_cfg.mem_fds.as_ref().map_or(0, |fds| fds.len());

if files.len() < expected_mem_fds {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks to me that you replicate logic here that is supposed to be abstracted behind attach_fds_to_cfgs. The new idea of that abstraction was exactly to not have these kind of statements in the handler

#[serde(
default,
serialize_with = "serialize_memory_zone_fd",
deserialize_with = "deserialize_memory_zone_fd"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in #7281 we decided against a special serialize path. See 666444d.

I'd love to have it consistent here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in #7281 we decided against a special serialize path. See 666444d.

I'd love to have it consistent here as well.

Again, I'm not sure I understand your concern: there are (de)serialize_netconfig_fds; I'm adding (de)serialize_memory_zone_fd in this patch/pr, specifically to be consistent with what the currently present code is doing. What is not consistent? What am I missing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken and do not miss something: My intuition here is: If you'd drop serialize_with = "serialize_memory_zone_fd",, the VmInfo API call would display the actually used FD and not a "-1" value, similar as we handle FDs for virtio-net devices.

}
}

impl ConfigWithFDs for VmConfig {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wonders me a bit. This was supposed to be used for MemoryZoneConfig or so, i.e., the "leaf config objects", not the main object.

In case the original design proposed in #7281 was not mature enough, we should change it again? I think this PR adds inconsistencies in its current state, about that I'm concerned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was supposed to be used for MemoryZoneConfig or so, i.e., the "leaf config objects", not the main object.

I don't fully understand your concern here. Are you suggesting that MemoryZoneConfig should implement the trait, not the top-level config?

If so, I don't understand the need for this trait at all, as the trait is defined in this module (http_endpoint), is implemented in this module, and is used in this module. Why is it needed at all? To me, traits are useful as abstractions to facilitate interop across modules/crates, but here it seems to be not doing anything?

If it were applicable to any cfg, including top level, it could be argued to be useful as it would hide implementation details of "consuming" FDs; but if it is to only be applied to "leaf config objects", what's the purpose of the trait?

Copy link
Member

@phip1611 phip1611 Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your confusion and I want to try to clarify things. I'm not saying the solution we introduced in #7281 is the best design one can have for this, but it was a pragmatic solution to simplify the handling of config objects that want external FDs in http_endpoints.rs and it enables additions such as #7379 which is pretty cool.

The idea is that every handler in http_endpoint.rs who receives external FDs uses attach_fds_to_cfg(). And to be able to use that, you have to implement ConfigWithFDs for every config struct you are dealing with. This is at least what worked well for us when it comes to state save/resume, hot attach, and live migration of/with virtio-net devices with external FDs.

@phip1611
Copy link
Member

phip1611 commented Sep 25, 2025

I don't want to slow down things here and I do my best to proactively work towards you, but..the more I think about it: I am genuinely concerned that we need major design thinking first about all current and future users/config objects that want external FDs and find a solution that fulfills all properties:

  • API events:
    • VM creation
    • device adding/removing (hotplug)
    • live-migration
    • save/restore
  • Code quality:
    • do not have many corner cases and dozens of runtime if else checks or checks on Options
    • prevent many too many config objects / code duplication
  • API Stability
    • For API calls that take multiple FDs for different entities (i.e., net and mem) we need to aggre on an order that is either configurable or we must ensure that we never break it in future

I think what we've did in #7281 is partly a work towards there but we might also need to refactor and adapt it. We could discuss things personally in a upcoming cloud hypervisor community call.

@posk-io
Copy link
Contributor Author

posk-io commented Sep 25, 2025

I don't want to slow down things here and I do my best to proactively work towards you, but..the more I think about it: I am genuinely concerned that we need major design thinking first about all current and future users/config objects that want external FDs and find a solution that fulfills all properties:

I agree that we need to do it properly; I'm not that worried, though, as it looks like it is more of a configuration/API issue, not a core runtime behavior. Even if we get it wrong at first, API deprecation/versioning is quite normal these days.

Another idea: it seems that all, or most, things that may have an FD attached to them have string tags/names/IDs attached to them as well ("tap0", "mem0", "disk0", etc.). So we may postulate that any entity that may get an external FD attached to it must have an ID, and that any API call that sends an array/vector of FDs must also contain an array/vector of strings that uniquely assign them to devices (e.g. ["tap0", "tap1", "mem0", "mem1"]. This array of strings/IDs much have the same length as the array/vector of incoming FDs. This will remove any ambiguity re: which FD belongs to which "device".

@posk-io
Copy link
Contributor Author

posk-io commented Sep 30, 2025

@rbradford Rob, how would you like to proceed here? What do you think of this approach:

It seems that all, or most, things that may have an FD attached to them have string tags/names/IDs attached to them as well ("tap0", "mem0", "disk0", etc.). So we may postulate that any entity that may get an external FD attached to it must have an ID, and that any API call that sends an array/vector of FDs must also contain an array/vector of strings that uniquely assign them to devices (e.g. ["tap0", "tap1", "mem0", "mem1"]. This array of strings/IDs must have the same length as the array/vector of incoming FDs. This will remove any ambiguity re: which FD belongs to which "device".

@phip1611
Copy link
Member

phip1611 commented Oct 1, 2025

Do we want to discuss this on Tomorrow's community call? I plan to attend.

My current rough idea is to:

  • as @rbradford proposed, have a generic struct carrying all the FD informations
  • we therefore drop the FD from the restored net config struct and similar places; also do not allow this in future work
  • the VmConfig will get a function to apply the FDs onto itself in the right order

Not sure how compatible this is with the current code. This would somehow make #7281 obsolete which is fine, as it would be a better design.

@posk-io
Copy link
Contributor Author

posk-io commented Oct 1, 2025

I'm OK with any design that is generic enough to allow adding more "FD-enabled" configuration parameters in the future.

@rbradford
Copy link
Member

@rbradford Rob, how would you like to proceed here? What do you think of this approach:

It seems that all, or most, things that may have an FD attached to them have string tags/names/IDs attached to them as well ("tap0", "mem0", "disk0", etc.). So we may postulate that any entity that may get an external FD attached to it must have an ID, and that any API call that sends an array/vector of FDs must also contain an array/vector of strings that uniquely assign them to devices (e.g. ["tap0", "tap1", "mem0", "mem1"]. This array of strings/IDs must have the same length as the array/vector of incoming FDs. This will remove any ambiguity re: which FD belongs to which "device".

Yes, that might be a great solution - provide a list of IDs and then provide a list of FDS. Check the length and then match the up. And we do pick predictable names if the user does not provide them.

I've asked @phip1611 to work with you to come to a scalable solution that hopefully handles the existing use cases and, hopefully many future ones.

@posk-io
Copy link
Contributor Author

posk-io commented Oct 2, 2025

[...]
Yes, that might be a great solution - provide a list of IDs and then provide a list of FDS. Check the length and then match the up. And we do pick predictable names if the user does not provide them.

I've asked @phip1611 to work with you to come to a scalable solution that hopefully handles the existing use cases and, hopefully many future ones.

Thanks, Rob! @phip1611 Philipp: would you like to take a stab at this, or do you prefer me doing it? I can probably have something ready for review by the end of the next week, but unlikely much earlier.

@phip1611
Copy link
Member

phip1611 commented Oct 6, 2025

Hey, sorry for the late reply!

I've asked @phip1611 to work with you to come to a scalable solution that hopefully handles the existing use cases and, hopefully many future ones. - @rbradford

I'm very happy to assist and support here!

Thanks, Rob! @phip1611 Philipp: would you like to take a stab at this, or do you prefer me doing it? I can probably have something ready for review by the end of the next week, but unlikely much earlier. - @posk-io

I’ll be away for the next two weeks due to the EuroRust conference in Paris and a short vacation afterwards, but I’ll still have time for code reviews. I'll return to work in week 43.

would you like to take a stab at this

I'd highly appreciate if you can invest time here and take the lead, at least for the next two weeks!

@posk-io
Copy link
Contributor Author

posk-io commented Oct 10, 2025

Update: I believe I have everything working, just need to add/run more tests and clean up the code a bit. Will probably update the PR early/mid next week.

@posk-io posk-io force-pushed the memfd-02 branch 2 times, most recently from 210bea9 to 5e636e4 Compare October 14, 2025 16:57
@posk-io
Copy link
Contributor Author

posk-io commented Oct 14, 2025

@rbradford @phip1611 Please have a look and let me know what you think. The PR probably needs more tests, which I will happily add once the overall approach is agreed upon.

One concern I have is that some old configs won't work now. Is this an issue? If so, how should it be addressed?

Live Update Orchestrator (LU) is an on-going Linux Kernel
effort to minimize guest VM disruptions during host kernel
kexec (~reboot):

https://lore.kernel.org/all/20250807014442.3829950-1-pasha.tatashin@soleen.com/

One of components of LUO is the ability to keep VM guest
memory "untouched" during host kernel reboot:

https://lore.kernel.org/all/20250807014442.3829950-30-pasha.tatashin@soleen.com/

Keeping guest memory "live" across host kernel reboots
minimizes the time it takes to snapshot/restore a VM,
especially with large-memory guests (~TB of RAM).

This commit builds on the existing Cloud Hypervisor ability
to save/restore guest memory backed by PMEM (files); it is mostly
a config/API change.

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
@phip1611
Copy link
Member

phip1611 commented Oct 15, 2025

@posk-io Sorry, my time for reviewing code over the past few days was more limited than expected. I can do a meaningful review starting from Monday but I'll take a quick look just now as well.

One concern I have is that some old configs won't work now. Is this an issue? If so, how should it be addressed?

My first intuition is that this is okay. Apart from ch-remote, I'm not aware of any management software out there in use (except for our libvirt fork where we upstream patches one by one).

Copy link
Member

@phip1611 phip1611 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please move the functionality of pub struct ExternalFdsConfig et al. at least to a dedicated commit?

}
}

pub fn consume_fds(&mut self, fds: Vec<i32>) -> Result<(), PayloadConfigError> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vec<RawFd>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configs store FDs as i32, so I'm not sure why RawFds should be passed here...

Copy link
Member

@rbradford rbradford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this first commit should be split so that the external_fds handling is added first and then memory backed by FDs is added as a later commit.

@posk-io
Copy link
Contributor Author

posk-io commented Oct 22, 2025

I think this first commit should be split so that the external_fds handling is added first and then memory backed by FDs is added as a later commit.

Will do, thanks! I'm glad that's the only feedback ;)

P.S. I assume you mean splitting the first commit not just into two commits in the same PR (this one), but split it into two separate PRs; so I'll mark this PR as WIP/Draft and keep it around for memory zone FDs, and prepare a separate config refactoring PR.

@posk-io posk-io marked this pull request as draft October 22, 2025 19:21
@posk-io posk-io changed the title vmm: allow configuring memory zones with memory FDs [WIP] vmm: allow configuring memory zones with memory FDs Oct 22, 2025
posk-io added a commit to posk-io/cloud-hypervisor that referenced this pull request Oct 23, 2025
As discussed in
[PR cloud-hypervisor#7377](cloud-hypervisor#7377),
Cloud Hypervisor may potentially accept different types of FDs on
VM creation and/or resuming:
- FDs for net devices (currently supported);
- FDs for pmem devices for memory zones (WIP);
- FDs for storage devices, etc.

As FDs are passed as an array of anonymous numbers over UDS sockets,
there should be a way to assign them to the devices they belong to.

This PR introduces ExternalFdsConfig in vm_config.rs, which
matches, positionally, the list of device IDs/names to the list
of FDs.

Signed-off-by: Peter Oskolkov posk@google.com
@posk-io
Copy link
Contributor Author

posk-io commented Oct 23, 2025

[...] prepare a separate config refactoring PR.

Here it is: #7431

posk-io added a commit to posk-io/cloud-hypervisor that referenced this pull request Oct 23, 2025
As discussed in
[PR cloud-hypervisor#7377](cloud-hypervisor#7377),
Cloud Hypervisor may potentially accept different types of FDs on
VM creation and/or resuming:
- FDs for net devices (currently supported);
- FDs for pmem devices for memory zones (WIP);
- FDs for storage devices, etc.

As FDs are passed as an array of anonymous numbers over UDS sockets,
there should be a way to assign them to the devices they belong to.

This PR introduces ExternalFdsConfig in vm_config.rs, which
matches, positionally, the list of device IDs/names to the list
of FDs.

Signed-off-by: Peter Oskolkov posk@google.com
posk-io added a commit to posk-io/cloud-hypervisor that referenced this pull request Oct 23, 2025
As discussed in
[PR cloud-hypervisor#7377](cloud-hypervisor#7377),
Cloud Hypervisor may potentially accept different types of FDs on
VM creation and/or resuming:
- FDs for net devices (currently supported);
- FDs for pmem devices for memory zones (WIP);
- FDs for storage devices, etc.

As FDs are passed as an array of anonymous numbers over UDS sockets,
there should be a way to assign them to the devices they belong to.

This PR introduces ExternalFdsConfig in vm_config.rs, which
matches, positionally, the list of device IDs/names to the list
of FDs.

Signed-off-by: Peter Oskolkov <posk@google.com>
posk-io added a commit to posk-io/cloud-hypervisor that referenced this pull request Oct 23, 2025
As discussed in PR cloud-hypervisor#7377, Cloud Hypervisor may potentially accept
different types of FDs on VM creation and/or resuming:
- FDs for net devices (currently supported);
- FDs for pmem devices for memory zones (WIP);
- FDs for storage devices, etc.

As FDs are passed as an array of anonymous numbers over UDS sockets,
there should be a way to assign them to the devices they belong to.

This PR introduces ExternalFdsConfig in vm_config.rs, which
matches, positionally, the list of device IDs/names to the list
of FDs.

Signed-off-by: Peter Oskolkov <posk@google.com>
posk-io added a commit to posk-io/cloud-hypervisor that referenced this pull request Oct 23, 2025
As discussed in PR cloud-hypervisor#7377, Cloud Hypervisor may potentially accept
different types of FDs on VM creation and/or resuming:
- FDs for net devices (currently supported);
- FDs for pmem devices for memory zones (WIP);
- FDs for storage devices, etc.

As FDs are passed as an array of anonymous numbers over UDS sockets,
there should be a way to assign them to the devices they belong to.

This PR introduces ExternalFdsConfig in vm_config.rs, which
matches, positionally, the list of device IDs/names to the list
of FDs.

Signed-off-by: Peter Oskolkov <posk@google.com>
posk-io added a commit to posk-io/cloud-hypervisor that referenced this pull request Oct 23, 2025
As discussed in PR cloud-hypervisor#7377, Cloud Hypervisor may potentially accept
different types of FDs on VM creation and/or resuming:
- FDs for net devices (currently supported);
- FDs for pmem devices for memory zones (WIP);
- FDs for storage devices, etc.

As FDs are passed as an array of anonymous numbers over UDS sockets,
there should be a way to assign them to the devices they belong to.

This PR introduces ExternalFdsConfig in vm_config.rs, which
matches, positionally, the list of device IDs/names to the list
of FDs.

Signed-off-by: Peter Oskolkov <posk@google.com>
posk-io added a commit to posk-io/cloud-hypervisor that referenced this pull request Oct 23, 2025
As discussed in PR cloud-hypervisor#7377, Cloud Hypervisor may potentially accept
different types of FDs on VM creation and/or resuming:
- FDs for net devices (currently supported);
- FDs for pmem devices for memory zones (WIP);
- FDs for storage devices, etc.

As FDs are passed as an array of anonymous numbers over UDS sockets,
there should be a way to assign them to the devices they belong to.

This PR introduces ExternalFdsConfig in vm_config.rs, which
matches, positionally, the list of device IDs/names to the list
of FDs.

Signed-off-by: Peter Oskolkov <posk@google.com>
posk-io added a commit to posk-io/cloud-hypervisor that referenced this pull request Oct 23, 2025
As discussed in PR cloud-hypervisor#7377, Cloud Hypervisor may potentially accept
different types of FDs on VM creation and/or resuming:
- FDs for net devices (currently supported);
- FDs for pmem devices for memory zones (WIP);
- FDs for storage devices, etc.

As FDs are passed as an array of anonymous numbers over UDS sockets,
there should be a way to assign them to the devices they belong to.

This PR introduces ExternalFdsConfig in vm_config.rs, which
matches, positionally, the list of device IDs/names to the list
of FDs.

Signed-off-by: Peter Oskolkov <posk@google.com>
posk-io added a commit to posk-io/cloud-hypervisor that referenced this pull request Dec 8, 2025
As discussed in PR cloud-hypervisor#7377, Cloud Hypervisor may potentially accept
different types of FDs on VM creation and/or resuming:
- FDs for net devices (currently supported);
- FDs for pmem devices for memory zones (WIP);
- FDs for storage devices, etc.

As FDs are passed as an array of anonymous numbers over UDS sockets,
there should be a way to assign them to the devices they belong to.

This PR introduces ExternalFdsConfig in vm_config.rs, which
matches, positionally, the list of device IDs/names to the list
of FDs.

Signed-off-by: Peter Oskolkov <posk@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants