Linux x86 Xen EFI boot entry evaluation
- Linux x86 Xen EFI boot entry evaluation
- Issues with boot x86 boot entries
- Xen evolution and roadmap
- Why use EFI for HVMlite
- EFI calling conventions are standardized
- EFI entry generalizes what new HVMLite entry proposes
- Further semantics may be needed
- Match Xen ARM's clean solution
- You don't need full EFI emulation
- Minimal EFI stubs for guests
- EFI stubs which may be needed for guests
- EFI stubs not needed for guests
- dom0 EFI
- domU EFI emulation possibilities
- kexec needs a boot path as well
- Points against using EFI
- Remaining questions
This page documents the proposal to use the x86 EFI boot entry for the newly proposed Xen guest type, HVMLite / PVHv2. This discussion is being discussed with the x86, lkml and xen-devel mailing lists copied:
Issues with boot x86 boot entries
Bypassing native startup_32() / startup_64()
One type of custom x86 boot entry are boot entries that bypass the usual native path, on x86 this is startup_32() and startup_64(). The worst entry type known that exemplifies this best is the first Xen entry added to Linux, for Xen PV guests; this is known as the Xen PV path. Details of issues with that approach are elaborated in a post that goes into overlooked issues with with pv_ops that enabled multiple entry points, further details are provided in another post that explains how lguest sets the zero page and alludes that the discrepancies introduced by the Xen PV entry point can be addressed with a series of proper semantics and technical frameworks. Some of ongoing work in that direction is documented on the kernel-sandboxing wiki.
Small x86 zero page stubs
Another type of custoim x86 boot entry possible are small boot entries that all they do is take a custom data structure, and interpret it to set the x86 zero page page, and later hand off to the proper native startup_32() / startup_64() entry point. Although this approach is much better than boot entries that avoid the native entry points, it also has its own drawbacks. It makes it harder for boot loader authors to pick the correct entry point. It may also mean old boot loaders cannot work with new kernels that add new entry points.
Xen evolution and roadmap
x86 Xen PV design is old and while Xen PVHVM is available now, the Xen PVH design was the last proposed architecture which promised to take advantage of the best of hardware virtualization extensions and a paravirtualization guest type. Mukesh Rathor at Oracle seems to have architected the first notion of Xen PVH since 2012.
One of the pitfalls of PV on Xen on x86_64 was that AMD removed segmentation limits, which Xen originally used for protection between user mode, guest mode and hypervisor. Without segmentation limits it has meant "both guest kernel and userspace run in ring 3, each with their own address space. Every time a guest process needs to make a system call, it has to bounce up into Xen, which will context-switch to the guest kernel. This not only takes more time for each system call, but requires flushing one of the key CPU caches, called a TLB. Frequent flushing of the TLB causes all execution to run more slowly for some time afterwards, as the TLB is filled up again".
PVH is supposed to give us the best of both worlds. "It’s a fully PV kernel mode, running with paravirtualized disk and network, paravirtualized interrupts and timers, no emulated devices of any kind (and thus no qemu), no BIOS or legacy boot — but instead of requiring PV MMU, it uses the HVM hardware extensions to virtualize the pagetables, as well as system calls and other privileged operations."
Despite all this, Xen PVH design is incomplete, considered experimental, and simply not fully functional and will likely be removed in the future, once a proper design is merged. HVMLite is the latest proper design choice for Xen x86 guests.
Despite ongoing new architectural solutions proposed which should help PV design and even further unify boot entries on x86 further, some folks have been discussing an alternative design for PVH which could further help address Linux community concerns on differences between bare metal Linux and Linux as Xen guest (including dom0). Paravirtualization isn't the best liked feature of the kernel in the community and HVMLite is designed to help address both of these concerns.
In order to be able to run as dom0 it is required to be able to run without the need of qemu. In order to have as little impact as possible to the Linux kernel it is required to make use of as many hardware virtualization features as possible.
This should make it clear why HVMlite is preferred over the PVH disign: instead of originating from a paravirtualized guest HVMlite is a modified HVM guest. This ensures minimizing the impact on the kernel.
PVH came from Mukesh, HVMLite however was first proposed by Roger Pau Monné at Citrix, and later patch proposal on its implementation first proposed by Boris Ostrovsky.
Xen ARM solution
Stefano architected a virtualization on ARM for Xen which matches the same solution as Mukesh's PVH. Xen ARM dom0 and domU solutions use standardized boot entries. Refer to Linux's Documentation/efi-stub.txt for details. This entry is not specific to Xen in any way. There aren't any Xen specific entry points in Linux for Xen on ARM.
Why use EFI for HVMlite
EFI calling conventions are standardized
EFI calling conventions defined in a standard document (the UEFI spec) and therefore work for multiple architectures and Operating Systems. There are concerns over adherence to a spec without a firmware to call into, however it would be fairly straight forward to provide stubs in the Linux kernel for all services that are usually by firmware.
EFI entry generalizes what new HVMLite entry proposes
All that the new HVMLite entry this does is it sets identity page tables and then calls xen_prepare_hvmlite() which crafts boot_params based on memory pointed to by %ebx (stashed in hvmlite_start_info in the very beginning).
The Linux x86 EFI application / driver entry point allows for two parameters to be passed, an Image Handle pointer and a pointer to an EFI System Table. Hanging off of the EFI System Table (efi_system_table_t) is a list of EFI congiguration tables (->tables). Xen can use a custom Xen table with whatever format we wish, from the firmware / boot loader to the kernel. Using this mechanism would enable avoiding adding arbitrary custom early boot paths on Linux. Refer to efi_config_parse_tables().
We don't currently parse these config tables in the EFI boot stub on x86, but we could if necessary. For example, if that info is required super early in boot. We could also "tag" the Image Handle with some Xen protocol to detect Xen-ness.
The efi_mem_reserve() call helps with a secondary issue, which is that if you kexec to a new kernel, how do you mark EFI regions as reserved that would otherwise be freed. This is more complicated than it sounds because the kernel may have already memblock_reserve()'d them and so we need to preserve that, while also informing the EFI subsystem.
Further semantics may be needed
If we need further early boot semantics, instead of extending the x86 boot protocol we could just use EFI configuration tables. We're still evaluating if we need to extend the semantics further by trying to address first all current virtualization hacks.
Match Xen ARM's clean solution
You don't need full EFI emulation
Minimal EFI stubs for guests
EFI stubs which may be needed for guests
Variable operation functions
Variable operation functions may be needed if you want to use standard distribution installers. This raise an interesting point regarding running domUs from physical disks. EFI variables are backed by NVRAM and when using OVMF and its -pflash switch, its not clear if you can point to a raw partition for NVRAM space or whether it only takes a file. It would seem this is just a matter of using the right QEMU argument settings, data could even be stored in a ROM image. Its known that OVMF stores a copy of its config in ESP.
EFI stubs not needed for guests
Not necessary, these are completely unused on native x86 EFI today. Though it's worth noting they are used on arm64. This implementation is considered absolute crap on x86, if we really wanted to use this, one possibility is to write a proper reference implementation.
SetVirtualAddressMap() et consortes is potentially not needed, it isn't actually necessary, especially if you're running something like OVMF.
ResetSystem() is not needed unless you also want to support EFI capsules (unlikely).
domU EFI emulation possibilities
domU would use EFI directly, without intermediate layers like hypercalls. This would mean domUs need an EFI emulation to be provided. If you don't emulate EFI from domU the implementation required would be minimal. We'd need a way to distinguish bare metal from HVMLite by using the EFI protocol -- other virtualization platforms can also do the same. Using the EFI GUID would seem to be the logical way to go to address these needed semantics.
Xen implements its own EFI environment for guests
Xen uses Tianocore / OVMF
kexec needs a boot path as well
kexec needs a booth path as well, ironing out an EFI boot path for Xen HVMLite also means to help address kexec. The kexec and direct EFI Boot paths are different, for kexec refer to efi_enter_virtual_mode() -- kexec cannot call SetVirtualAddressMap() because the first kernel already invoked it. The semantics to distinguish between if a boot came from an EFI booth path or kexec could be improved -- currently the kernel looks for setup_data object in boot_params of type SETUP_EFI.
Points against using EFI
Legacy PV guests need to be supported
Nulling the claimed boot loader effect
startup_32 / startup_64 flexibility
If the concerns are that Xen is just adding yet another entry, perhaps the native paths should be made more flexible to enable further fine tuning and customizations. This may mean having to add more semantics to let a Xen HVMLite boot stub do its work, and that may however mean having to propose extending the x86 boot protocol further. This needs to be evaluated over using EFI and the EFI tables for a customized Xen protocol within EFI.