-
Notifications
You must be signed in to change notification settings - Fork 565
[WIP] vmm: allow configuring memory zones with memory FDs #7377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the direction this is going to, but I see several things we need to discuss and/or address first. Major concern: #7377 (comment)
vmm/src/api/http/http_endpoint.rs
Outdated
| let mut restore_cfg: RestoreConfig = serde_json::from_slice(body.raw())?; | ||
|
|
||
| // First, consume mem FDs, if any. | ||
| let expected_mem_fds = restore_cfg.mem_fds.as_ref().map_or(0, |fds| fds.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be very carefully here and I'd like to discuss this first! We can for sure structure the API in a way that we promise:
Externally FDs must be provided in the following order: mem_fds, net_fds[, blk_fds]`.
But once we do this and management software such as libvirt (which we at Cyberus are working on) relies on that, we never can go back.
Does anyone has ideas how we can make this decision and also the enforcement more clear? It must also be documented somewhere!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should add extra state to the restore command that provides an ordered set of (type i.e. what the FDs are for, number of fds) and then we can do extra validation and checks - allowing us to be flexible in the ordering if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever we will end up with, it has to work at least in the following cases:
- CreateVm
- RestoreVm
- ReceiveLiveMigration
All of these are suitable to get FDs for various components at once, unlike API calls such as AddNet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have something like
struct FdConfig {
mem_fds: usize,
net_fds: usize,
}
and add fd_config: Option<FdCOnfig> as a top-level field to VmConfig and others. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, configs could contain FD indices, e.g. MemoryZone { size: 515M, fd_idx: 3 }, which would indicate that the config will be accompanied by FDs, and fd_vec[3] will contain the FD for this MemoryZone. At the moment configs are supposed to have placeholder "-1" values when coming from the user; if they were to contain FD indices, and the FD vector attached at the API receiving end, this will disambiguate things, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and add
fd_config: Option<FdConfig>as a top-level field to VmConfig and others. WDYT?
This would work for creating a VM. But what about state save/resume (save/restore) and receive-migration? I'd love if we find a solution that handles this complexity in all cases across all config objects that may now or in future take external FDs at some point from an API call. Ideally not with a dozen of Option<SomeFdConfigType>. But right now I also can't think of an easy solution without thinking about this longer.
vmm/src/api/http/http_endpoint.rs
Outdated
| // First, consume mem FDs, if any. | ||
| let expected_mem_fds = restore_cfg.mem_fds.as_ref().map_or(0, |fds| fds.len()); | ||
|
|
||
| if files.len() < expected_mem_fds { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks to me that you replicate logic here that is supposed to be abstracted behind attach_fds_to_cfgs. The new idea of that abstraction was exactly to not have these kind of statements in the handler
vmm/src/vm_config.rs
Outdated
| #[serde( | ||
| default, | ||
| serialize_with = "serialize_memory_zone_fd", | ||
| deserialize_with = "deserialize_memory_zone_fd" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in #7281 we decided against a special serialize path. See 666444d.
I'd love to have it consistent here as well.
Again, I'm not sure I understand your concern: there are (de)serialize_netconfig_fds; I'm adding (de)serialize_memory_zone_fd in this patch/pr, specifically to be consistent with what the currently present code is doing. What is not consistent? What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not mistaken and do not miss something: My intuition here is: If you'd drop serialize_with = "serialize_memory_zone_fd",, the VmInfo API call would display the actually used FD and not a "-1" value, similar as we handle FDs for virtio-net devices.
vmm/src/api/http/http_endpoint.rs
Outdated
| } | ||
| } | ||
|
|
||
| impl ConfigWithFDs for VmConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wonders me a bit. This was supposed to be used for MemoryZoneConfig or so, i.e., the "leaf config objects", not the main object.
In case the original design proposed in #7281 was not mature enough, we should change it again? I think this PR adds inconsistencies in its current state, about that I'm concerned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was supposed to be used for MemoryZoneConfig or so, i.e., the "leaf config objects", not the main object.
I don't fully understand your concern here. Are you suggesting that MemoryZoneConfig should implement the trait, not the top-level config?
If so, I don't understand the need for this trait at all, as the trait is defined in this module (http_endpoint), is implemented in this module, and is used in this module. Why is it needed at all? To me, traits are useful as abstractions to facilitate interop across modules/crates, but here it seems to be not doing anything?
If it were applicable to any cfg, including top level, it could be argued to be useful as it would hide implementation details of "consuming" FDs; but if it is to only be applied to "leaf config objects", what's the purpose of the trait?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your confusion and I want to try to clarify things. I'm not saying the solution we introduced in #7281 is the best design one can have for this, but it was a pragmatic solution to simplify the handling of config objects that want external FDs in http_endpoints.rs and it enables additions such as #7379 which is pretty cool.
The idea is that every handler in http_endpoint.rs who receives external FDs uses attach_fds_to_cfg(). And to be able to use that, you have to implement ConfigWithFDs for every config struct you are dealing with. This is at least what worked well for us when it comes to state save/resume, hot attach, and live migration of/with virtio-net devices with external FDs.
|
I don't want to slow down things here and I do my best to proactively work towards you, but..the more I think about it: I am genuinely concerned that we need major design thinking first about all current and future users/config objects that want external FDs and find a solution that fulfills all properties:
I think what we've did in #7281 is partly a work towards there but we might also need to refactor and adapt it. We could discuss things personally in a upcoming cloud hypervisor community call. |
I agree that we need to do it properly; I'm not that worried, though, as it looks like it is more of a configuration/API issue, not a core runtime behavior. Even if we get it wrong at first, API deprecation/versioning is quite normal these days. Another idea: it seems that all, or most, things that may have an FD attached to them have string tags/names/IDs attached to them as well ("tap0", "mem0", "disk0", etc.). So we may postulate that any entity that may get an external FD attached to it must have an ID, and that any API call that sends an array/vector of FDs must also contain an array/vector of strings that uniquely assign them to devices (e.g. ["tap0", "tap1", "mem0", "mem1"]. This array of strings/IDs much have the same length as the array/vector of incoming FDs. This will remove any ambiguity re: which FD belongs to which "device". |
|
@rbradford Rob, how would you like to proceed here? What do you think of this approach:
|
|
Do we want to discuss this on Tomorrow's community call? I plan to attend. My current rough idea is to:
Not sure how compatible this is with the current code. This would somehow make #7281 obsolete which is fine, as it would be a better design. |
|
I'm OK with any design that is generic enough to allow adding more "FD-enabled" configuration parameters in the future. |
Yes, that might be a great solution - provide a list of IDs and then provide a list of FDS. Check the length and then match the up. And we do pick predictable names if the user does not provide them. I've asked @phip1611 to work with you to come to a scalable solution that hopefully handles the existing use cases and, hopefully many future ones. |
Thanks, Rob! @phip1611 Philipp: would you like to take a stab at this, or do you prefer me doing it? I can probably have something ready for review by the end of the next week, but unlikely much earlier. |
|
Hey, sorry for the late reply!
I'm very happy to assist and support here!
I’ll be away for the next two weeks due to the EuroRust conference in Paris and a short vacation afterwards, but I’ll still have time for code reviews. I'll return to work in week 43.
I'd highly appreciate if you can invest time here and take the lead, at least for the next two weeks! |
|
Update: I believe I have everything working, just need to add/run more tests and clean up the code a bit. Will probably update the PR early/mid next week. |
210bea9 to
5e636e4
Compare
|
@rbradford @phip1611 Please have a look and let me know what you think. The PR probably needs more tests, which I will happily add once the overall approach is agreed upon. One concern I have is that some old configs won't work now. Is this an issue? If so, how should it be addressed? |
Live Update Orchestrator (LU) is an on-going Linux Kernel effort to minimize guest VM disruptions during host kernel kexec (~reboot): https://lore.kernel.org/all/20250807014442.3829950-1-pasha.tatashin@soleen.com/ One of components of LUO is the ability to keep VM guest memory "untouched" during host kernel reboot: https://lore.kernel.org/all/20250807014442.3829950-30-pasha.tatashin@soleen.com/ Keeping guest memory "live" across host kernel reboots minimizes the time it takes to snapshot/restore a VM, especially with large-memory guests (~TB of RAM). This commit builds on the existing Cloud Hypervisor ability to save/restore guest memory backed by PMEM (files); it is mostly a config/API change. Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
|
@posk-io Sorry, my time for reviewing code over the past few days was more limited than expected. I can do a meaningful review starting from Monday but I'll take a quick look just now as well.
My first intuition is that this is okay. Apart from ch-remote, I'm not aware of any management software out there in use (except for our libvirt fork where we upstream patches one by one). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please move the functionality of pub struct ExternalFdsConfig et al. at least to a dedicated commit?
| } | ||
| } | ||
|
|
||
| pub fn consume_fds(&mut self, fds: Vec<i32>) -> Result<(), PayloadConfigError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vec<RawFd>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Configs store FDs as i32, so I'm not sure why RawFds should be passed here...
rbradford
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this first commit should be split so that the external_fds handling is added first and then memory backed by FDs is added as a later commit.
Will do, thanks! I'm glad that's the only feedback ;) P.S. I assume you mean splitting the first commit not just into two commits in the same PR (this one), but split it into two separate PRs; so I'll mark this PR as WIP/Draft and keep it around for memory zone FDs, and prepare a separate config refactoring PR. |
As discussed in [PR cloud-hypervisor#7377](cloud-hypervisor#7377), Cloud Hypervisor may potentially accept different types of FDs on VM creation and/or resuming: - FDs for net devices (currently supported); - FDs for pmem devices for memory zones (WIP); - FDs for storage devices, etc. As FDs are passed as an array of anonymous numbers over UDS sockets, there should be a way to assign them to the devices they belong to. This PR introduces ExternalFdsConfig in vm_config.rs, which matches, positionally, the list of device IDs/names to the list of FDs. Signed-off-by: Peter Oskolkov posk@google.com
Here it is: #7431 |
As discussed in [PR cloud-hypervisor#7377](cloud-hypervisor#7377), Cloud Hypervisor may potentially accept different types of FDs on VM creation and/or resuming: - FDs for net devices (currently supported); - FDs for pmem devices for memory zones (WIP); - FDs for storage devices, etc. As FDs are passed as an array of anonymous numbers over UDS sockets, there should be a way to assign them to the devices they belong to. This PR introduces ExternalFdsConfig in vm_config.rs, which matches, positionally, the list of device IDs/names to the list of FDs. Signed-off-by: Peter Oskolkov posk@google.com
As discussed in [PR cloud-hypervisor#7377](cloud-hypervisor#7377), Cloud Hypervisor may potentially accept different types of FDs on VM creation and/or resuming: - FDs for net devices (currently supported); - FDs for pmem devices for memory zones (WIP); - FDs for storage devices, etc. As FDs are passed as an array of anonymous numbers over UDS sockets, there should be a way to assign them to the devices they belong to. This PR introduces ExternalFdsConfig in vm_config.rs, which matches, positionally, the list of device IDs/names to the list of FDs. Signed-off-by: Peter Oskolkov <posk@google.com>
As discussed in PR cloud-hypervisor#7377, Cloud Hypervisor may potentially accept different types of FDs on VM creation and/or resuming: - FDs for net devices (currently supported); - FDs for pmem devices for memory zones (WIP); - FDs for storage devices, etc. As FDs are passed as an array of anonymous numbers over UDS sockets, there should be a way to assign them to the devices they belong to. This PR introduces ExternalFdsConfig in vm_config.rs, which matches, positionally, the list of device IDs/names to the list of FDs. Signed-off-by: Peter Oskolkov <posk@google.com>
As discussed in PR cloud-hypervisor#7377, Cloud Hypervisor may potentially accept different types of FDs on VM creation and/or resuming: - FDs for net devices (currently supported); - FDs for pmem devices for memory zones (WIP); - FDs for storage devices, etc. As FDs are passed as an array of anonymous numbers over UDS sockets, there should be a way to assign them to the devices they belong to. This PR introduces ExternalFdsConfig in vm_config.rs, which matches, positionally, the list of device IDs/names to the list of FDs. Signed-off-by: Peter Oskolkov <posk@google.com>
As discussed in PR cloud-hypervisor#7377, Cloud Hypervisor may potentially accept different types of FDs on VM creation and/or resuming: - FDs for net devices (currently supported); - FDs for pmem devices for memory zones (WIP); - FDs for storage devices, etc. As FDs are passed as an array of anonymous numbers over UDS sockets, there should be a way to assign them to the devices they belong to. This PR introduces ExternalFdsConfig in vm_config.rs, which matches, positionally, the list of device IDs/names to the list of FDs. Signed-off-by: Peter Oskolkov <posk@google.com>
As discussed in PR cloud-hypervisor#7377, Cloud Hypervisor may potentially accept different types of FDs on VM creation and/or resuming: - FDs for net devices (currently supported); - FDs for pmem devices for memory zones (WIP); - FDs for storage devices, etc. As FDs are passed as an array of anonymous numbers over UDS sockets, there should be a way to assign them to the devices they belong to. This PR introduces ExternalFdsConfig in vm_config.rs, which matches, positionally, the list of device IDs/names to the list of FDs. Signed-off-by: Peter Oskolkov <posk@google.com>
As discussed in PR cloud-hypervisor#7377, Cloud Hypervisor may potentially accept different types of FDs on VM creation and/or resuming: - FDs for net devices (currently supported); - FDs for pmem devices for memory zones (WIP); - FDs for storage devices, etc. As FDs are passed as an array of anonymous numbers over UDS sockets, there should be a way to assign them to the devices they belong to. This PR introduces ExternalFdsConfig in vm_config.rs, which matches, positionally, the list of device IDs/names to the list of FDs. Signed-off-by: Peter Oskolkov <posk@google.com>
As discussed in PR cloud-hypervisor#7377, Cloud Hypervisor may potentially accept different types of FDs on VM creation and/or resuming: - FDs for net devices (currently supported); - FDs for pmem devices for memory zones (WIP); - FDs for storage devices, etc. As FDs are passed as an array of anonymous numbers over UDS sockets, there should be a way to assign them to the devices they belong to. This PR introduces ExternalFdsConfig in vm_config.rs, which matches, positionally, the list of device IDs/names to the list of FDs. Signed-off-by: Peter Oskolkov <posk@google.com>
Live Update Orchestrator (LU) is an on-going Linux Kernel
effort to minimize guest VM disruptions during host kernel
kexec (~reboot):
https://lore.kernel.org/all/20250807014442.3829950-1-pasha.tatashin@soleen.com/
One of components of LUO is the ability to keep VM guest
memory "untouched" during host kernel reboot:
https://lore.kernel.org/all/20250807014442.3829950-30-pasha.tatashin@soleen.com/
Keeping guest memory "live" across host kernel reboots
minimizes the time it takes to snapshot/restore a VM,
especially with large-memory guests (~TB of RAM).
This PR builds on the existing Cloud Hypervisor ability
to save/restore guest memory backed by PMEM (files); it is mostly
a config/API change.
Signed-off-by: Peter Oskolkov posk@google.com