Linux x86 Xen EFI boot entry evaluation

Contents

Linux x86 Xen EFI boot entry evaluation
Issues with boot x86 boot entries
1. Bypassing native startup_32() / startup_64()
2. Small x86 zero page stubs
Xen evolution and roadmap
Why use EFI for HVMlite
Points against using EFI
Remaining questions

This page documents the proposal to use the x86 EFI boot entry for the newly proposed Xen guest type, HVMLite / PVHv2. This discussion is being discussed with the x86, lkml and xen-devel mailing lists copied:

http://lkml.kernel.org/r/20160406024027.GX1990@wotan.suse.de

Issues with boot x86 boot entries

Adding more than one boot entries into Linux for a single architecture has proven to have a series of drawbacks. There are really two types of boot entry types:

Bypassing native startup_32() / startup_64()
Small x86 zero page stubs, handing off to native startup_32()/startup_64() entry

Bypassing native startup_32() / startup_64()

One type of custom x86 boot entry are boot entries that bypass the usual native path, on x86 this is startup_32() and startup_64(). The worst entry type known that exemplifies this best is the first Xen entry added to Linux, for Xen PV guests; this is known as the Xen PV path. Details of issues with that approach are elaborated in a post that goes into overlooked issues with with pv_ops that enabled multiple entry points, further details are provided in another post that explains how lguest sets the zero page and alludes that the discrepancies introduced by the Xen PV entry point can be addressed with a series of proper semantics and technical frameworks. Some of ongoing work in that direction is documented on the kernel-sandboxing wiki.

Small x86 zero page stubs

Another type of custoim x86 boot entry possible are small boot entries that all they do is take a custom data structure, and interpret it to set the x86 zero page page, and later hand off to the proper native startup_32() / startup_64() entry point. Although this approach is much better than boot entries that avoid the native entry points, it also has its own drawbacks. It makes it harder for boot loader authors to pick the correct entry point. It may also mean old boot loaders cannot work with new kernels that add new entry points.

Reducing arbitrary entry points in Linux in general is a goal.

Xen evolution and roadmap

Below is review of the evolution of the ideal Xen guest type, we review what PVH is/was, what HVMLite is, and how Xen ARM guests were addressed.

About PVH

x86 Xen PV design is old and while Xen PVHVM is available now, the Xen PVH design was the last proposed architecture which promised to take advantage of the best of hardware virtualization extensions and a paravirtualization guest type. Mukesh Rathor at Oracle seems to have architected the first notion of Xen PVH since 2012.

One of the pitfalls of PV on Xen on x86_64 was that AMD removed segmentation limits, which Xen originally used for protection between user mode, guest mode and hypervisor. Without segmentation limits it has meant "both guest kernel and userspace run in ring 3, each with their own address space. Every time a guest process needs to make a system call, it has to bounce up into Xen, which will context-switch to the guest kernel. This not only takes more time for each system call, but requires flushing one of the key CPU caches, called a TLB. Frequent flushing of the TLB causes all execution to run more slowly for some time afterwards, as the TLB is filled up again".

PVH is supposed to give us the best of both worlds. "It’s a fully PV kernel mode, running with paravirtualized disk and network, paravirtualized interrupts and timers, no emulated devices of any kind (and thus no qemu), no BIOS or legacy boot — but instead of requiring PV MMU, it uses the HVM hardware extensions to virtualize the pagetables, as well as system calls and other privileged operations."

Despite all this, Xen PVH design is incomplete, considered experimental, and simply not fully functional and will likely be removed in the future, once a proper design is merged. HVMLite is the latest proper design choice for Xen x86 guests.

About HVMLite

Despite ongoing new architectural solutions proposed which should help PV design and even further unify boot entries on x86 further, some folks have been discussing an alternative design for PVH which could further help address Linux community concerns on differences between bare metal Linux and Linux as Xen guest (including dom0). Paravirtualization isn't the best liked feature of the kernel in the community and HVMLite is designed to help address both of these concerns.

In order to be able to run as dom0 it is required to be able to run without the need of qemu. In order to have as little impact as possible to the Linux kernel it is required to make use of as many hardware virtualization features as possible.

This should make it clear why HVMlite is preferred over the PVH disign: instead of originating from a paravirtualized guest HVMlite is a modified HVM guest. This ensures minimizing the impact on the kernel.

Currently only support for domU is being implemented first, but there are plans to support dom0 as well. It is much easier to test and debug a domU than dom0.

PVH came from Mukesh, HVMLite however was first proposed by Roger Pau Monné at Citrix, and later patch proposal on its implementation first proposed by Boris Ostrovsky.

The basic HVMLite specific performance knobs:

usage of EPT: as avoiding pv pagetables is the main goal there is nothing we want to change here
I/O: PV drivers are to be preferred over emulated legacy devices. The specific pv driver implementation won't be unique to HVMlite, any change can be applied to PV guests and HVM guests too.
timers, interrupts: the HVMlite design allows APIC based timers and interrupts as well as pv based ones.

Removing qemu as a requirement should be a huge win in and of itself.

Xen ARM solution

Stefano architected a virtualization on ARM for Xen which matches the same solution as Mukesh's PVH. Xen ARM dom0 and domU solutions use standardized boot entries. Refer to Linux's Documentation/efi-stub.txt for details. This entry is not specific to Xen in any way. There aren't any Xen specific entry points in Linux for Xen on ARM.

Why use EFI for HVMlite

Below are list of gains to consider using EFI boot entry for x86 HVMlite. We go into details for each further below.

EFI calling conventions are defined in a standards document (UEFI spec) and therefore works for multiple architectures and multiple Operating Systems.
The Linux x86 EFI entry mimics what the currently proposed HVMLite entry does, but generalizes it
May need further early boot hypervisor semantics either we extend x86 boot protocol or we use EFI configuration tables
Match Xen ARM's clean solution
You don't need full EFI emulation
kexec needs a boot path as well

EFI calling conventions are standardized

EFI calling conventions defined in a standard document (the UEFI spec) and therefore work for multiple architectures and Operating Systems. There are concerns over adherence to a spec without a firmware to call into, however it would be fairly straight forward to provide stubs in the Linux kernel for all services that are usually by firmware.

Its useful to list involvement in UEFI for other architectures. Here's a small list:

x86_64 - fully interested
ARM - fully interested
PPC - some interest in the past, uses Open Firmware (Power Firmware), requiring a completely specific boot chain.
s390x - this would be rather complex given that even grub uses its own kernel before loading other kernels. Not aware of anyone working on this, and there is no trace of history of interest.
Any other architectures ?

EFI entry generalizes what new HVMLite entry proposes

The Linux x86 EFI entry mimics what the proposed HVMLite entry does, but generalizes it.

All that the new HVMLite entry this does is it sets identity page tables and then calls xen_prepare_hvmlite() which crafts boot_params based on memory pointed to by %ebx (stashed in hvmlite_start_info in the very beginning).

The Linux x86 EFI application / driver entry point allows for two parameters to be passed, an Image Handle pointer and a pointer to an EFI System Table. Hanging off of the EFI System Table (efi_system_table_t) is a list of EFI congiguration tables (->tables). Xen can use a custom Xen table with whatever format we wish, from the firmware / boot loader to the kernel. Using this mechanism would enable avoiding adding arbitrary custom early boot paths on Linux. Refer to efi_config_parse_tables().

For an example of how this scheme was used on ARM to pass a screen_info object from the EFI boot stub (which doesn't have access to the kernel proper on ARM) to the kernel refer to:

https://lkml.kernel.org/r/1459526735-24936-7-git-send-email-ard.biesheuvel@linaro.org

We don't currently parse these config tables in the EFI boot stub on x86, but we could if necessary. For example, if that info is required super early in boot. We could also "tag" the Image Handle with some Xen protocol to detect Xen-ness.

The efi_mem_reserve() call helps with a secondary issue, which is that if you kexec to a new kernel, how do you mark EFI regions as reserved that would otherwise be freed. This is more complicated than it sounds because the kernel may have already memblock_reserve()'d them and so we need to preserve that, while also informing the EFI subsystem.

Further semantics may be needed

If we need further early boot semantics, instead of extending the x86 boot protocol we could just use EFI configuration tables. We're still evaluating if we need to extend the semantics further by trying to address first all current virtualization hacks.

Match Xen ARM's clean solution

ARM already uses the EFI boot entry, and is an example proof of concept that no custom boot entries would be needed. Matching Xen ARM's solution should help with expectations and setup.

You don't need full EFI emulation

You don't really need a full fledged EFI emuluation, can opt-in for a lot of EFI mechanisms, as such you only really need to implement a subset of EFI stubs.

Minimal EFI stubs for guests

These are identified as minimal requirements for Xen guests

GetMemoryMap()

This has been identified are required.

ExitBootServices()

This has been identified are required.

EFI stubs which may be needed for guests

These are stubs which may be needed.

Exit()

Exit() is probably needed for EFI drivers/tools if you want to run OVMF.

Variable operation functions

Variable operation functions may be needed if you want to use standard distribution installers. This raise an interesting point regarding running domUs from physical disks. EFI variables are backed by NVRAM and when using OVMF and its -pflash switch, its not clear if you can point to a raw partition for NVRAM space or whether it only takes a file. It would seem this is just a matter of using the right QEMU argument settings, data could even be stored in a ROM image. Its known that OVMF stores a copy of its config in ESP.

EFI stubs not needed for guests

These stubs are not needed.

GetTime()/SetTime()

Not necessary, these are completely unused on native x86 EFI today. Though it's worth noting they are used on arm64. This implementation is considered absolute crap on x86, if we really wanted to use this, one possibility is to write a proper reference implementation.

SetVirtualAddressMap()

SetVirtualAddressMap() et consortes is potentially not needed, it isn't actually necessary, especially if you're running something like OVMF.

ResetSystem()

ResetSystem() is not needed unless you also want to support EFI capsules (unlikely).

dom0 EFI

dom0 would use hypercalls to talk to EFI.

domU EFI emulation possibilities

If we wanted to emulate EFI from domU, there are a few possibilities to consider.

domU would use EFI directly, without intermediate layers like hypercalls. This would mean domUs need an EFI emulation to be provided. If you don't emulate EFI from domU the implementation required would be minimal. We'd need a way to distinguish bare metal from HVMLite by using the EFI protocol -- other virtualization platforms can also do the same. Using the EFI GUID would seem to be the logical way to go to address these needed semantics.

Xen implements its own EFI environment for guests

Xen could implements and maintains its own minimal EFI environment for guests (based on the above).

Xen uses Tianocore / OVMF

Tianocore / OVMF could also be used, this is what Xen ARM used. It would be useful to identify why Xen ARM went with Tianocore / OVMF and not a minimal EFI environment.

kexec needs a boot path as well

kexec needs a booth path as well, ironing out an EFI boot path for Xen HVMLite also means to help address kexec. The kexec and direct EFI Boot paths are different, for kexec refer to efi_enter_virtual_mode() -- kexec cannot call SetVirtualAddressMap() because the first kernel already invoked it. The semantics to distinguish between if a boot came from an EFI booth path or kexec could be improved -- currently the kernel looks for setup_data object in boot_params of type SETUP_EFI.

Points against using EFI

Legacy PV guests need to be supported

If we want to later deprecate old Xen PV guest types we will need to ensure legacy guests that do not have EFI can boot.

Minimal EFI emulation would be needed in Xen, the minimal work above is what has been determined to be needed. It is expected this would not only be useful for domU but also for kexec.

Nulling the claimed boot loader effect

The proposed small HVMLite zero page boot stub is very Xen specific and there are only Xen-aware tools that will use it, as such there is no expected huge impact on boot loaders

startup_32 / startup_64 flexibility

If the concerns are that Xen is just adding yet another entry, perhaps the native paths should be made more flexible to enable further fine tuning and customizations. This may mean having to add more semantics to let a Xen HVMLite boot stub do its work, and that may however mean having to propose extending the x86 boot protocol further. This needs to be evaluated over using EFI and the EFI tables for a customized Xen protocol within EFI.

Remaining questions

Support for boot loaders
Support for multiple OSes
Support for PE binaries (e.g. EFI shell)
If the future EFI roadmap for Xen is to support a full blown EFI implementation, should OVMF be considered from the start ?
Why did Xen ARM use OVMF instead of using a smaller subset implementation ?