Differences between revisions 2 and 5 (spanning 3 versions)

Linux x86 Xen EFI boot entry evaluation

This page documents the proposal to use the x86 EFI boot entry for the newly proposed Xen guest type, HVMLite / PVHv2.

Issues with boot x86 boot entries

Adding more than one boot entries into Linux for a single architecture has proven to have a series of drawbacks. There are really two types of boot entry types:

Bypassing native startup_32() / startup_64()
Small x86 zero page stubs, handing off to native startup_32()/startup_64() entry

Bypassing native startup_32() / startup_64()

One type of custom x86 boot entry are boot entries that bypass the usual native path, on x86 this is startup_32() and startup_64(). The worst entry type known that exemplifies this best is the first Xen entry added to Linux, for Xen PV guests; this is known as the Xen PV path. Details of issues with that approach are elaborated in a post that goes into overlooked [http://www.do-not-panic.com/2015/12/avoiding-dead-code-pvops-not-silver-bullet.html issues with with pv_ops that enabled multiple entry points], further details are provided in another post that explains how [http://www.do-not-panic.com/2015/12/xen-and-x86-linux-zero-page.html lguest sets the zero page and alludes that the discrepancies introduced by the Xen PV entry point can be addressed] with a series of proper semantics and technical frameworks. Some of ongoing work in that direction is documented on the [http://kernelnewbies.org/KernelProjects/kernel-sandboxing kernel-sandboxing wiki].

Small x86 zero page stubs

Another type of custoim x86 boot entry possible are small boot entries that all they do is take a custom data structure, and interpret it to set the x86 zero page page, and later hand off to the proper native startup_32() / startup_64() entry point. Although this approach is much better than boot entries that avoid the native entry points, it also has its own drawbacks. It makes it harder for boot loader authors to pick the correct entry point. It may also mean old boot loaders cannot work with new kernels that add new entry points.

Reducing arbitrary entry points in Linux in general is a goal.

Xen evolution and roadmap

Below is review of the evolution of the ideal Xen guest type, we review what PVH is/was, what HVMLite is, and how Xen ARM guests were addressed.

About PVH

x86 Xen PV design is old and while Xen PVHVM is available now, the Xen PVH design was the last proposed architecture which promised to take advantage of the best of hardware virtualization extensions and a paravirtualization guest type. [https://blog.xenproject.org/2012/09/21/xensummit-sessions-new-pvh-virtualisation-mode-for-arm-cortex-a15arm-servers-and-x86/ Mukesh Rathor at Oracle seems to have architected the first notion of Xen PVH since 2012].

One of the pitfalls of PV on Xen on x86_64 was that AMD removed segmentation limits, which Xen originally used for protection between user mode, guest mode and hypervisor. [http://wiki.xen.org/wiki/Virtualization_Spectrum#Problems_with_paravirtualization:_AMD_and_x86-64 Without segmentation limits] it has meant "both guest kernel and userspace run in ring 3, each with their own address space. Every time a guest process needs to make a system call, it has to bounce up into Xen, which will context-switch to the guest kernel. This not only takes more time for each system call, but requires flushing one of the key CPU caches, called a TLB. Frequent flushing of the TLB causes all execution to run more slowly for some time afterwards, as the TLB is filled up again".

PVH is supposed to [http://wiki.xen.org/wiki/Virtualization_Spectrum#Almost_fully_PV:_PVH_mode give us the best of both worlds]. "It’s a fully PV kernel mode, running with paravirtualized disk and network, paravirtualized interrupts and timers, no emulated devices of any kind (and thus no qemu), no BIOS or legacy boot — but instead of requiring PV MMU, it uses the HVM hardware extensions to virtualize the pagetables, as well as system calls and other privileged operations."

Despite all this, Xen PVH design is incomplete, considered experimental, and simply not fully functional and will likely be removed in the future, once a proper design is merged. HVMLite is the latest proper design choice for Xen x86 guests.

About HVMLite

Despite ongoing [http://kernelnewbies.org/KernelProjects/kernel-sandboxing new architectural solutions] proposed which should help PV design and even further unify boot entries on x86 further, some folks have been discussing an alternative design for PVH which could further help address Linux community concerns on differences between bare metal Linux and Linux as Xen guest (including dom0). Paravirtualization isn't the best liked feature of the kernel in the community and HVMLite is designed to help address both of these concerns.

In order to be able to run as dom0 it is required to be able to run without the need of qemu. In order to have as little impact as possible to the Linux kernel it is required to make use of as many hardware virtualization features as possible.

This should make it clear why HVMlite is preferred over the PVH disign: instead of originating from a paravirtualized guest HVMlite is a modified HVM guest. This ensures minimizing the impact on the kernel.

Currently only support for domU is being implemented first, but there are plans to support dom0 as well. It is much easier to test and debug a domU than dom0.

PVH came from Mukesh, [http://lists.xen.org/archives/html/xen-devel/2016-02/msg01609.html HVMLite however was first proposed by Roger Pau Monné at Citrix], and later patch proposal on [http://lkml.kernel.org/r/1454341137-14110-3-git-send-email-boris.ostrovsky@oracle.com its implementation first proposed by Boris Ostrovsky].

The basic HVMLite specific performance knobs:

usage of EPT: as avoiding pv pagetables is the main goal there is nothing we want to change here
I/O: PV drivers are to be preferred over emulated legacy devices. The specific pv driver implementation won't be unique to HVMlite, any change can be applied to PV guests and HVM guests too.
timers, interrupts: the HVMlite design allows APIC based timers and interrupts as well as pv based ones.

Removing qemu as a requirement should be a huge win in and of itself.

Xen ARM solution

Stefano architected a virtualization on ARM for Xen which matches the same solution as Mukesh's PVH. Xen ARM dom0 and domU solutions use standardized boot entries. Refer to Linux's Documentation/efi-stub.txt for details. This entry is not specific to Xen in any way. There aren't any Xen specific entry points in Linux for Xen on ARM.

Why use EFI for HVMlite

Below are list of gains to consider using EFI boot entry for x86 HVMlite. We go into details for each further below.

EFI calling conventions are defined in a standards document (UEFI spec) and therefore works for multiple architectures and multiple Operating Systems.
The Linux x86 EFI entry mimics what the currently proposed HVMLite entry does, but generalizes it
May need further early boot hypervisor semantics either we extend x86 boot protocol or we use EFI configuration tables
Match Xen ARM's clean solution
You don't need full EFI emulation
kexec needs a boot path as well

EFI calling conventions are standardized

EFI calling conventions defined in a standard document (the UEFI spec) and therefore work for multiple architectures and Operating Systems. There are concerns over adherence to a spec without a firmware to call into, however it would be fairly straight forward to provide stubs in the Linux kernel for all services that are usually by firmware.

Its useful to list involvement in UEFI for other architectures. Here's a small list:

x86_64 - fully interested
ARM - fully interested
PPC - some interest in the past, uses Open Firmware (Power Firmware), requiring a completely specific boot chain.
s390x - this would be rather complex given that even grub uses its own kernel before loading other kernels. Not aware of anyone working on this, and there is no trace of history of interest.
Any other architectures ?

EFI entry generalizes what new HVMLite entry proposes

The Linux x86 EFI entry mimics what the proposed HVMLite entry does, but generalizes it.

All that the new HVMLite entry this does is it sets identity page tables and then calls xen_prepare_hvmlite() which crafts boot_params based on memory pointed to by %ebx (stashed in hvmlite_start_info in the very beginning).

The Linux x86 EFI application / driver entry point allows for two parameters to be passed, an Image Handle pointer and a pointer to an EFI System Table. Hanging off of the EFI System Table (efi_system_table_t) is a list of EFI congiguration tables (->tables). Xen can use a custom Xen table with whatever format we wish, from the firmware / boot loader to the kernel. Using this mechanism would enable avoiding adding arbitrary custom early boot paths on Linux. Refer to efi_config_parse_tables().

For an example of how this scheme was used on ARM to pass a screen_info object from the EFI boot stub (which doesn't have access to the kernel proper on ARM) to the kernel refer to:

https://lkml.kernel.org/r/1459526735-24936-7-git-send-email-ard.biesheuvel@linaro.org

We don't currently parse these config tables in the EFI boot stub on x86, but we could if necessary. For example, if that info is required super early in boot. We could also "tag" the Image Handle with some Xen protocol to detect Xen-ness.

The efi_mem_reserve() call helps with a secondary issue, which is that if you kexec to a new kernel, how do you mark EFI regions as reserved that would otherwise be freed. This is more complicated than it sounds because the kernel may have already memblock_reserve()'d them and so we need to preserve that, while also informing the EFI subsystem.

Further semantics may be needed

If we need further early boot semantics, instead of extending the x86 boot protocol we could just use EFI configuration tables. We're still evaluating if we need to extend the semantics further by trying to address first all current virtualization hacks.

Match Xen ARM's clean solution

ARM already uses the EFI boot entry, and is an example proof of concept that no custom boot entries would be needed. Matching Xen ARM's solution should help with expectations and setup.

You don't need full EFI emulation

You don't really need a full fledged EFI emuluation, can opt-in for a lot of EFI mechanisms, as such you only really need to implement a subset of EFI stubs.

Minimal EFI stubs for guests

These are identified as minimal requirements for Xen guests

GetMemoryMap()

This has been identified are required.

ExitBootServices()

This has been identified are required.

EFI stubs which may be needed for guests

These are stubs which may be needed.

Exit()

Exit() is probably needed for EFI drivers/tools if you want to run OVMF.

Variable operation functions

Variable operation functions may be needed if you want to use standard distribution installers. This raise an interesting point regarding running domUs from physical disks. EFI variables are backed by NVRAM and when using OVMF and its -pflash switch, its not clear if you can point to a raw partition for NVRAM space or whether it only takes a file. It would seem this is just a matter of using the right QEMU argument settings, data could even be stored in a ROM image. Its known that OVMF stores a copy of its config in ESP.

EFI stubs not needed for guests

These stubs are not needed.

GetTime()/SetTime()

Not necessary, these are completely unused on native x86 EFI today. Though it's worth noting they are used on arm64. This implementation is considered absolute crap on x86, if we really wanted to use this, one possibility is to write a proper reference implementation.

SetVirtualAddressMap()

SetVirtualAddressMap() et consortes is potentially not needed, it isn't actually necessary, especially if you're running something like OVMF.

ResetSystem()

ResetSystem() is not needed unless you also want to support EFI capsules (unlikely).

domU EFI emulation possibilities

If we wanted to emulate EFI from domU, there are a few possibilities to consider.

Xen implements its own EFI environment for guests

Xen could implements and maintains its own minimal EFI environment for guests (based on the above).

Xen uses Tianocore / OVMF

Tianocore / OVMF could also be used, this is what Xen ARM used. It would be useful to identify why Xen ARM went with Tianocore / OVMF and not a minimal EFI environment.

kexec needs a boot path as well

kexec needs a booth path as well, ironing out an EFI boot path for Xen HVMLite also means to help address kexec. The kexec and direct EFI Boot paths are different, for kexec refer to efi_enter_virtual_mode() -- kexec cannot call SetVirtualAddressMap() because the first kernel already invoked it. The semantics to distinguish between if a boot came from an EFI booth path or kexec could be improved -- currently the kernel looks for setup_data object in boot_params of type SETUP_EFI.

Concerns with using EFI

Legacy PV guests need to be supported
dom0: would hypercalls to talk to EFI
domU: would use EFI directly, without intermediate layers like hypercalls. This would mean domUs need an EFI emulation to be provided. If you don't emulate EFI from domU the implementation required would be minimal. We'd need a way to distinguish bare metal from HVMLite by using the EFI protocol -- other virtualization platforms can also do the same. Using the EFI GUID would seem to be the logical way to go to address these needed semantics.

Remaining questions

Support for boot loaders
Support for multiple OSes
Support for PE binaries (e.g. EFI shell)
If the future EFI roadmap for Xen is to support a full blown EFI implementation, should OVMF be considered from the start ?
Why did Xen ARM use OVMF instead of using a smaller subset implementation ?

-  ⇤ ← Revision 2 as of 2016-04-05 23:48:39 → 
  Size: 14840
  Editor: mcgrof
  Comment:
+   ← Revision 5 as of 2016-04-06 01:25:12 → ⇥
  Size: 15673
  Editor: mcgrof
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 17:
-Boot entries that bypass the usual native path, on x86 this is startup_32()
and startup_64(). The worst entry type known is the first Xen entry added
to Linux. Details of issues with that approach are elaborated in a post that
+One type of custom x86 boot entry are boot entries that bypass the usual native path, on x86 this is startup_32()
and startup_64(). The worst entry type known that exemplifies this best is the first Xen entry added
to Linux, for Xen PV guests; this is known as the Xen PV path. Details of issues with that approach are elaborated in a post that
 Line 27:
-Small boot entries that all they do is take a custom data structure, and interpret
it to set the x86 zero page page, and later hand off to the proper native
startup_32() / startup_64() entry point. Although this approach is much better
than boot entries that avoid the native entry ppoints, it also has its own
drawbacks. It makes it harder for boot loader authors to pick the correct
entry point. It may also mean old boot loaders cannot work with new kernels that
add new entry points. Reducing arbitrary entry points in Linux in general is a goal.
+Another type of custoim x86 boot entry possible are small boot entries that all
they do is take a custom data structure, and interpret it to set the x86 zero page page,
and later hand off to the proper native startup_32() / startup_64() entry point.
Although this approach is much better than boot entries that avoid the native entry
points, it also has its own drawbacks. It makes it harder for boot loader authors
to pick the correct entry point. It may also mean old boot loaders cannot work with
new kernels that add new entry points.

Reducing arbitrary entry points in Linux in general is a goal.
-Line 41:
+Line 43:
-x86 Xen PV design is old and while there is Xen PVHVM available now, the Xen PVH
+x86 Xen PV design is old and while Xen PVHVM is available now, the Xen PVH
-Line 63:
+Line 65:
-Despite all this, Xen PVH design is incomplete, considered experimental, and simply not fully functional and will likely be removed in the future, once a proper design is merged. HVMLite is that design choice it seems.
+Despite all this, Xen PVH design is incomplete, considered experimental, and simply not fully
functional and will likely be removed in the future, once a proper design is merged.
HVMLite is the latest proper design choice for Xen x86 guests.
-Line 70:
+Line 74:
-concerns on differences between bare metal Linux and Linux as Xen guest (including dom0). Paravirtualization isn't
the best liked feature of the kernel in the community and HVMLite is designed to help address both of these concerns.
+concerns on differences between bare metal Linux and Linux as Xen guest (including dom0).
Paravirtualization isn't the best liked feature of the kernel in the community and HVMLite
is designed to help address both of these concerns.
-Line 77:
+Line 82:
-This makes clear why HVMlite was preferred over PVH: instead of originating from
a paravirtualized guest HVMlite is a modified HVM guest. This ensures minimizing
the impact on the kernel.

Currently only domU is being designed first, but there are plans to support
dom0 as well. It is much easier to test and debug a domU than dom0.
+This should make it clear why HVMlite is preferred over the PVH disign: instead of
originating from a paravirtualized guest HVMlite is a modified HVM guest. This
ensures minimizing the impact on the kernel.

Currently only support for domU is being implemented first, but there are plans
to support dom0 as well. It is much easier to test and debug a domU than dom0.
-Line 89:
+Line 94:
-- usage of EPT: as avoiding pv pagetables is the main goal there is nothing we want to change here.                                                                                                                             
- I/O: PV drivers are to be preferred over emulated legacy devices. The specific pv driver implementation won't be unique                                                                                          to HVMlite, any change can be applied to PV guests and HVM guests too.
- timers, interrupts: the HVMlite design allows APIC based timers and interrupts as well as pv based ones.
+  * usage of EPT: as avoiding pv pagetables is the main goal there is nothing we want to change here
  * I/O: PV drivers are to be preferred over emulated legacy devices. The specific pv driver implementation won't be unique to HVMlite, any change can be applied to PV guests and HVM guests too.
  * timers, interrupts: the HVMlite design allows APIC based timers and interrupts as well as pv based ones.
-Line 100:
+Line 104:
-to Linux's Documentation/efi-stub.txt for details/. This is not specific to Xen.
There aren't any Xen specific entry points in Linux for Xen on ARM.
+to Linux's Documentation/efi-stub.txt for details. This entry is not specific to Xen
in any way. There aren't any Xen specific entry points in Linux for Xen on ARM.
-Line 110:
+Line 114:
+  * Match Xen ARM's clean solution
-Line 120:
+Line 125:
-ARM already uses the EFI boot entry, and is an example proof of concept that no custom
boot entries would be needed.
-Line 127:
+Line 129:
-Line 129:
+Line 130:
-  * s390x - this would be rather complex given that even grub uses its own kernel
    before loading other kernels. Not aware of anyone working on this, and there is
    no trace of history of interest.
+  * s390x - this would be rather complex given that even grub uses its own kernel before loading other kernels. Not aware of anyone working on this, and there is no trace of history of interest.
-Line 172:
+Line 169:
+== Match Xen ARM's clean solution ==

ARM already uses the EFI boot entry, and is an example proof of concept that no custom
boot entries would be needed. Matching Xen ARM's solution should help with expectations
and setup.
-Line 174:
+Line 177:
-You don't reallly need a full fledged EFI emuluation, can opt-in for a lot of EFI mechanisms,
+You don't really need a full fledged EFI emuluation, can opt-in for a lot of EFI mechanisms,
-Line 177:
+Line 180:
-Minimal EFI environemnt for guests:

  * GetMemoryMap()
  * ExitBootServices()

May be needed:

  * Exit() - Probably for EFI drivers/tools if you want to run OVMF
  * variable operation functions - Needed if you want to use standard distribution installers.
    This raise an interesting point regarding running domUs from physical disks. EFI variables
    are backed by NVRAM and when using OVMF and its -pflash switch, its not clear if you can
    point to a raw partition for NVRAM space or whether it only takes a file.
    It would seem this is just a matter of using the right QEMU argument settings, data could
    even be stored in a ROM image. Its known that OVMF stores a copy of its config in ESP.

Not needed:

  * GetTime()/SetTime() - Not necessary, these are completely unused on native x86 EFI today.
    Though it's worth noting they are used on arm64. This implementation is considered
    absolute crap on x86, if we really wanted to use this, one possibility is to write a
    proper reference implementation.
  * SetVirtualAddressMap() et consortes - Potentially not, it isn't actually necessary,
    especially if you're running something like OVMF.
  * ResetSystem() - Not unless you also want to support EFI capsules (unlikely).
            If we wanted to emulate EFI from domU, there are a few possibilities:
  * Xen implements its own EFI environment for guests
  * Tianocore / OVMF is used - this is what Xen ARM used
+=== Minimal EFI stubs for guests ===

These are identified as minimal requirements for Xen guests

==== GetMemoryMap() ====

This has been identified are required.

==== ExitBootServices() ====

This has been identified are required.

=== EFI stubs which may be needed for guests ===

These are stubs which may be needed.

==== Exit() ====

Exit() is probably needed for EFI drivers/tools if you want to run OVMF.

==== Variable operation functions ====

Variable operation functions may be needed if you want to use standard distribution installers. This
raise an interesting point regarding running domUs from physical disks. EFI variables are
backed by NVRAM and when using OVMF and its -pflash switch, its not clear if you can
point to a raw partition for NVRAM space or whether it only takes a file.
It would seem this is just a matter of using the right QEMU argument settings, data
could even be stored in a ROM image. Its known that OVMF stores a copy of its config in ESP.

=== EFI stubs not needed for guests ===

These stubs are not needed.


==== GetTime()/SetTime() ====

Not necessary, these are completely unused on native x86 EFI today.
Though it's worth noting they are used on arm64. This implementation is considered
absolute crap on x86, if we really wanted to use this, one possibility is to write a
proper reference implementation.

==== SetVirtualAddressMap() ====

SetVirtualAddressMap() et consortes is potentially not needed, it isn't actually necessary,
especially if you're running something like OVMF.

==== ResetSystem() ====

ResetSystem() is not needed unless you also want to support EFI capsules (unlikely).

=== domU EFI emulation possibilities ===

If we wanted to emulate EFI from domU, there are a few possibilities to consider.

==== Xen implements its own EFI environment for guests ====

Xen could implements and maintains its own minimal EFI environment for guests (based on the above).

==== Xen uses Tianocore / OVMF ====

Tianocore / OVMF could also be used, this is what Xen ARM used. It would be useful to identify
why Xen ARM went with Tianocore / OVMF and not a minimal EFI environment.
-Line 219:
+Line 256:
-  * domU: would use EFI directly, without intermediate layers like hypercalls
 This would mean domUs need an EFI emulation to be provided.
        If you don't emulate EFI from domU the implementation required would
        be minimal. We'd need a way to distinguish bare metal from HVMLite
        by using the EFI protocol -- other virtualization platforms can also
        do the same. Using the EFI GUID would seem to be the logical way to go
        to address these needed semantics.
+  * domU: would use EFI directly, without intermediate layers like hypercalls. This would mean domUs need an EFI emulation to be provided. If you don't emulate EFI from domU the implementation required would be minimal. We'd need a way to distinguish bare metal from HVMLite by using the EFI protocol -- other virtualization platforms can also do the same. Using the EFI GUID would seem to be the logical way to go to address these needed semantics.
-Line 232:
+Line 263:
-  * If the future EFI roadmap for Xen is to support a full blown EFI implementation, should
    OVMF be considered from the start ?
+  * If the future EFI roadmap for Xen is to support a full blown EFI implementation, should OVMF be considered from the start ?