= Linux x86 Xen EFI boot entry evaluation =

<<TableOfContents(4)>>

This page documents the proposal to use the x86 EFI boot entry for the newly proposed Xen guest type, HVMLite / PVHv2. This discussion is being discussed with the x86, lkml and xen-devel mailing lists copied:

http://lkml.kernel.org/r/20160406024027.GX1990@wotan.suse.de

= Issues with boot x86 boot entries =

Adding more than one boot entries into Linux for a single architecture has proven
to have a series of drawbacks. There are really two types of boot entry types:

  * Bypassing native startup_32() / startup_64()
  * Small x86 zero page stubs, handing off to native startup_32()/startup_64() entry

== Bypassing native startup_32() / startup_64() ==

One type of custom x86 boot entry are boot entries that bypass the usual native path, on x86 this is startup_32()
and startup_64(). The worst entry type known that exemplifies this best is the first Xen entry added
to Linux, for Xen PV guests; this is known as the Xen PV path. Details of issues with that approach are elaborated in a post that
goes into overlooked [[http://www.do-not-panic.com/2015/12/avoiding-dead-code-pvops-not-silver-bullet.html|issues with with pv_ops that enabled multiple entry points]], further details are provided in another post that explains
how [[http://www.do-not-panic.com/2015/12/xen-and-x86-linux-zero-page.html|lguest sets the zero page and alludes that the discrepancies introduced by the Xen PV entry point can be addressed]] with a series of proper semantics
and technical frameworks. Some of ongoing work in that direction is documented
on the [[http://kernelnewbies.org/KernelProjects/kernel-sandboxing|kernel-sandboxing wiki]].

== Small x86 zero page stubs ==

Another type of custoim x86 boot entry possible are small boot entries that all
they do is take a custom data structure, and interpret it to set the x86 zero page page,
and later hand off to the proper native startup_32() / startup_64() entry point.
Although this approach is much better than boot entries that avoid the native entry
points, it also has its own drawbacks. It makes it harder for boot loader authors
to pick the correct entry point. It may also mean old boot loaders cannot work with
new kernels that add new entry points.

Reducing arbitrary entry points in Linux in general is a goal.

= Xen evolution and roadmap =

Below is review of the evolution of the ideal Xen guest type, we review what PVH is/was, what HVMLite is, and how Xen ARM guests were addressed.

== About PVH ==

x86 Xen PV design is old and while Xen PVHVM is available now, the Xen PVH
design was the last proposed architecture which promised to take advantage of the
best of hardware virtualization extensions and a paravirtualization guest type.
[[https://blog.xenproject.org/2012/09/21/xensummit-sessions-new-pvh-virtualisation-mode-for-arm-cortex-a15arm-servers-and-x86/|Mukesh Rathor at Oracle seems to have architected the first notion of Xen PVH since 2012]].

One of the pitfalls of PV on Xen on x86_64 was that AMD removed segmentation                                                                                  
limits, which Xen originally used for protection between user mode, guest mode                                                                                
and hypervisor. [[http://wiki.xen.org/wiki/Virtualization_Spectrum#Problems_with_paravirtualization:_AMD_and_x86-64|Without segmentation limits]] it has meant "both guest kernel and                                                                               
userspace run in ring 3, each with their own address space. Every time a guest                                                                               
process needs to make a system call, it has to bounce up into Xen, which will                                                                                 
context-switch to the guest kernel. This not only takes more time for each                                                                                    
system call, but requires flushing one of the key CPU caches, called a TLB.                                                                                   
Frequent flushing of the TLB causes all execution to run more slowly for some                                                                                 
time afterwards, as the TLB is filled up again".

PVH is supposed to [[http://wiki.xen.org/wiki/Virtualization_Spectrum#Almost_fully_PV:_PVH_mode|give us the best of both worlds]]. "It’s a fully PV                                                                                     
kernel mode, running with paravirtualized disk and network, paravirtualized                                                                                   
interrupts and timers, no emulated devices of any kind (and thus no qemu), no                                                                                 
BIOS or legacy boot — but instead of requiring PV MMU, it uses the HVM hardware                                                                               
extensions to virtualize the pagetables, as well as system calls and other                                                                                    
privileged operations." 

Despite all this, Xen PVH design is incomplete, considered experimental, and simply not fully
functional and will likely be removed in the future, once a proper design is merged.
HVMLite is the latest proper design choice for Xen x86 guests.

== About HVMLite ==

Despite ongoing [[http://kernelnewbies.org/KernelProjects/kernel-sandboxing|new architectural solutions]] proposed
which should help PV design and even further unify boot entries on x86 further, some folks have
been discussing an alternative design for PVH which could further help address Linux community
concerns on differences between bare metal Linux and Linux as Xen guest (including dom0).
Paravirtualization isn't the best liked feature of the kernel in the community and HVMLite
is designed to help address both of these concerns.

In order to be able to run as dom0 it is required to be able to run without the need of qemu.
In order to have as little impact as possible to the Linux kernel it is required to make use
of as many hardware virtualization features as possible.

This should make it clear why HVMlite is preferred over the PVH disign: instead of
originating from a paravirtualized guest HVMlite is a modified HVM guest. This
ensures minimizing the impact on the kernel.

Currently only support for domU is being implemented first, but there are plans
to support dom0 as well. It is much easier to test and debug a domU than dom0.

PVH came from Mukesh, [[http://lists.xen.org/archives/html/xen-devel/2016-02/msg01609.html|HVMLite however was first proposed by Roger Pau Monné at Citrix]], and later
patch proposal on [[http://lkml.kernel.org/r/1454341137-14110-3-git-send-email-boris.ostrovsky@oracle.com|its implementation first proposed by Boris Ostrovsky]].

The basic HVMLite specific performance knobs:

  * usage of EPT: as avoiding pv pagetables is the main goal there is nothing we want to change here
  * I/O: PV drivers are to be preferred over emulated legacy devices. The specific pv driver implementation won't be unique to HVMlite, any change can be applied to PV guests and HVM guests too.
  * timers, interrupts: the HVMlite design allows APIC based timers and interrupts as well as pv based ones.

Removing qemu as a requirement should be a huge win in and of itself.

== Xen ARM solution ==

Stefano architected a virtualization on ARM for Xen which matches the same solution
as Mukesh's PVH. Xen ARM dom0 and domU solutions use standardized boot entries. Refer
to Linux's Documentation/efi-stub.txt for details. This entry is not specific to Xen
in any way. There aren't any Xen specific entry points in Linux for Xen on ARM.

= Why use EFI for HVMlite =

Below are list of gains to consider using EFI boot entry for x86 HVMlite. We go into details for each further below.

  * EFI calling conventions are defined in a standards document (UEFI spec) and therefore works for multiple architectures and multiple Operating Systems.
  * The Linux x86 EFI entry mimics what the currently proposed HVMLite entry does, but generalizes it
  * May need further early boot hypervisor semantics either we extend x86 boot protocol or we use EFI configuration tables
  * Match Xen ARM's clean solution
  * You don't need full EFI emulation
  * kexec needs a boot path as well

== EFI calling conventions are standardized ==

EFI calling conventions defined in a standard document (the UEFI spec) and therefore work
for multiple architectures and Operating Systems. There are concerns over adherence to
a spec without a firmware to call into, however it would be fairly straight forward
to provide stubs in the Linux kernel for all services that are usually by firmware.

Its useful to list involvement in UEFI for other architectures. Here's a small list:

  * x86_64 - fully interested
  * ARM - fully interested
  * PPC - some interest in the past, uses Open Firmware (Power Firmware), requiring a completely specific boot chain.
  * s390x - this would be rather complex given that even grub uses its own kernel before loading other kernels. Not aware of anyone working on this, and there is no trace of history of interest.
  * Any other architectures ?

== EFI entry generalizes what new HVMLite entry proposes ==

The Linux x86 EFI entry mimics what the proposed HVMLite entry does, but generalizes
it.

All that the new HVMLite entry this does is it sets identity page tables and then calls                                                                                                
xen_prepare_hvmlite() which crafts boot_params based on memory pointed to by
%ebx (stashed in hvmlite_start_info in the very beginning).

The Linux x86 EFI application / driver entry point allows for two parameters to be
passed, an Image Handle pointer and a pointer to an EFI System Table. Hanging off
of the EFI System Table (efi_system_table_t) is a list of EFI congiguration tables
(->tables). Xen can use a custom Xen table with whatever format we wish, from the
firmware  / boot loader to the kernel. Using this mechanism would enable avoiding
adding arbitrary custom early boot paths on Linux. Refer to efi_config_parse_tables().

For an example of how this scheme was used on ARM to pass a screen_info object from
the EFI boot stub (which doesn't have access to the kernel proper on ARM) to the
kernel refer to:

https://lkml.kernel.org/r/1459526735-24936-7-git-send-email-ard.biesheuvel@linaro.org

We don't currently parse these config tables in the EFI boot stub on x86, but we could
if necessary. For example, if that info is required super early in boot. We could also
"tag" the Image Handle with some Xen protocol to detect Xen-ness.

The efi_mem_reserve() call helps with a secondary issue, which is that if you kexec to
a new kernel, how do you mark EFI regions as reserved that would otherwise be freed.
This is more complicated than it sounds because the kernel may have already
memblock_reserve()'d them and so we need to preserve that, while also informing the
EFI subsystem.

== Further semantics may be needed ==

If we need further early boot semantics, instead of extending the x86 boot protocol we could just use EFI configuration tables. We're still evaluating if we need to extend the semantics further by trying to address first all current virtualization hacks.

== Match Xen ARM's clean solution ==

ARM already uses the EFI boot entry, and is an example proof of concept that no custom
boot entries would be needed. Matching Xen ARM's solution should help with expectations
and setup.

== You don't need full EFI emulation ==

You don't really need a full fledged EFI emuluation, can opt-in for a lot of EFI mechanisms,
as such you only really need to implement a subset of EFI stubs.

=== Minimal EFI stubs for guests ===

These are identified as minimal requirements for Xen guests

==== GetMemoryMap() ====

This has been identified are required.

==== ExitBootServices() ====

This has been identified are required.

=== EFI stubs which may be needed for guests ===

These are stubs which may be needed.

==== Exit() ====

Exit() is probably needed for EFI drivers/tools if you want to run OVMF.

==== Variable operation functions ====

Variable operation functions may be needed if you want to use standard distribution installers. This
raise an interesting point regarding running domUs from physical disks. EFI variables are
backed by NVRAM and when using OVMF and its -pflash switch, its not clear if you can
point to a raw partition for NVRAM space or whether it only takes a file.
It would seem this is just a matter of using the right QEMU argument settings, data
could even be stored in a ROM image. Its known that OVMF stores a copy of its config in ESP.

=== EFI stubs not needed for guests ===

These stubs are not needed.

==== GetTime()/SetTime() ====

Not necessary, these are completely unused on native x86 EFI today.
Though it's worth noting they are used on arm64. This implementation is considered
absolute crap on x86, if we really wanted to use this, one possibility is to write a
proper reference implementation.

==== SetVirtualAddressMap() ====

SetVirtualAddressMap() et consortes is potentially not needed, it isn't actually necessary,
especially if you're running something like OVMF.

==== ResetSystem() ====

ResetSystem() is not needed unless you also want to support EFI capsules (unlikely).

=== dom0 EFI ===

dom0 would use hypercalls to talk to EFI.

=== domU EFI emulation possibilities ===

If we wanted to emulate EFI from domU, there are a few possibilities to consider.

domU would use EFI directly, without intermediate layers like hypercalls. This would mean
domUs need an EFI emulation to be provided. If you don't emulate EFI from domU the
implementation required would be minimal. We'd need a way to distinguish bare metal
from HVMLite by using the EFI protocol -- other virtualization platforms can also do
the same. Using the EFI GUID would seem to be the logical way to go to address these
needed semantics.

==== Xen implements its own EFI environment for guests ====

Xen could implements and maintains its own minimal EFI environment for guests (based on the above).

==== Xen uses Tianocore / OVMF ====

Tianocore / OVMF could also be used, this is what Xen ARM used. It would be useful to identify
why Xen ARM went with Tianocore / OVMF and not a minimal EFI environment.

== kexec needs a boot path as well ==

kexec needs a booth path as well, ironing out an EFI boot path for Xen HVMLite also means
to help address kexec. The kexec and direct EFI Boot paths are different, for kexec
refer to efi_enter_virtual_mode() -- kexec cannot call SetVirtualAddressMap() because the
first kernel already invoked it. The semantics to distinguish between if a boot came from
an EFI booth path or kexec could be improved -- currently the kernel looks for setup_data
object in boot_params of type SETUP_EFI.

= Points against using EFI =

== Legacy PV guests need to be supported ==

If we want to later deprecate old Xen PV guest types we will need to ensure legacy guests that do not have EFI can boot.

Minimal EFI emulation would be needed in Xen, the minimal work above is what has been determined to be needed. It is expected this would not only be useful for domU but also for kexec.

== Nulling the claimed boot loader effect ==

The proposed small HVMLite zero page boot stub is very Xen specific and there are only Xen-aware tools that will use it, as such there is no expected huge impact on boot loaders

== startup_32 / startup_64 flexibility ==

If the concerns are that Xen is just adding yet another entry, perhaps the native paths should be made more flexible to enable further fine tuning and customizations. This may mean having to add more semantics to let a Xen HVMLite boot stub do its work, and that may however mean having to propose extending the x86 boot protocol further. This needs to be evaluated over using EFI and the EFI tables for a customized Xen protocol within EFI.

= Remaining questions =

  * Support for boot loaders
  * Support for multiple OSes
  * Support for PE binaries (e.g. EFI shell)
  * If the future EFI roadmap for Xen is to support a full blown EFI implementation, should OVMF be considered from the start ?
  * Why did Xen ARM use OVMF instead of using a smaller subset implementation ?