Kernel sandboxing
This page documents the proposal for a series of new kernel strategies to help sandbox the kernel. This is very different than sandboxing userspace from the kernel, this is about sandboxing the kernel for itself.
Contents
Major Linux distributions tend to prefer to ship and support only one binary kernel on a slew of different environments. This often means a lot of code ends up getting enabled as modular even if unused, other times some code ends up getting enabled as built-in, as part of vmlinux, even if its never used. This presents a few issues:
- Code size concerns - not such a critical concern unless for tiny Linux systems
- Dead code concerns - we sometimes cannot be sure if certain code that should not run never runs
- Security concerns - certain unused code may be left enabled which increases attack vectors in the kernel
Kernel sandboxing strategies tries to address these problems through a series of techniques and toolboxes.
Code concerns
First we must understand each of the above code concern considerations.
Code size concerns
The Linux kernel tinification effort ('make tinyconfig') helps produce a minimal kernel. This effort is limited to what can be disabled only through Kconfig. There is potential to help reduce memory footprint by reducing both code and data when we determine we do not need certain code at run time.
Dead code concerns
In theory, code that should not run should not run, but it would be much better if we can say code that should not run cannot run. One of the best example of code that should not run is code built-in to the kernel to enable different types of virtualization environments when you know you are running bare metal. Likewise, if you are running on a virtualized environment there is certain code designed only for bare metal that we know we must not run. This is best explained on the post "Avoiding dead code, pvops is not the silver bullet".
Security code concerns
Leaving code enabled which we know cannot be useful after a specific point in run time only increases attack vectors possible on a system. We should be able to figure out when we don't need certain code, and for some mechanisms in the kernel there should already be heuristics available to determine this, at that point the kernel should be able to completely disarm such code.
Current available solutions
This list provides a set of current mechanisms used on the kernel to help sandbox a kernel. The limitations of these mechanisms should be considered when considering solutions to help address some shortcomings at run time.
Kconfig
Kconfig can be used to disable compiling and linking into the kernel certain code functionality you know you do not need. Kconfig is limited in that you need to know what features you do not want enabled. Disabling a lot of Kconfig options also means reducing the flexibility of certain distributions' final kernel. For this reason a lot of Linux distributions rely on module support, enabling loading of required code functionality at run time, only when needed. Often times certain code cannot be made modular though, and as such Linux distributions wishing to enable certain functionality has no other option but to enable a series of functionality only available as built-in.
Kernel parameters
Kernel parameters enable dynamically tuning of code functionality both for built-in code and modules (as module parameters). Kernel parameters are used at run time, it is up to the discretion of the user to enable or disable certain parameters to customize run time functionality. Even though kernel and module parameters can be used to disable certain functionality at run time, the code being disabled is typically still technically available at run time. It is up to the implementation to ensure that disabled code will never run by analyzing code flow.
Binary patching
Certain features are critical to performance and relying on branches and variables at run time to determine what path to take in code degrades performance. The Linux kernel supports the ability to modify critical code at run time to avoid unnecessary branching. Binary patching is supported on different architectures on Linux. For details on the x86 implementation refer to the arch/x86/kernel/alternative.c file. Binary patching takes place towards the end of init/main.c start_kernel(), right before the first init userspace process is called, check_bugs(). This will call your architecture specific check_bugs(). For instance on x86_64 this is implemented on arch/x86/kernel/cpu/bugs_64.c, which eventually calls alternative_instructions().
Alternatives are implemented by using custom ELF sections, x86 currently has 2 dedicated alternative sections. Each ELF sections stuff struct alt_inst data. The custom linker script (arch/x86/kernel/vmlinux.lds.S) on Linux has these defined as:
/* * start address and size of operations which during runtime * can be patched with virtualization friendly instructions or * baremetal native ones. Think page table operations. * Details in paravirt_types.h */ . = ALIGN(8); .parainstructions : AT(ADDR(.parainstructions) - LOAD_OFFSET) { __parainstructions = .; *(.parainstructions) __parainstructions_end = .; } /* * struct alt_inst entries. From the header (alternative.h): * "Alternative instructions for different CPU types or capabilities" * Think locking instructions on spinlocks. */ . = ALIGN(8); .altinstructions : AT(ADDR(.altinstructions) - LOAD_OFFSET) { __alt_instructions = .; *(.altinstructions) __alt_instructions_end = .; } /* * And here are the replacement instructions. The linker sticks * them as binary blobs. The .altinstructions has enough data to * get the address and the length of them to patch the kernel safely. */ .altinstr_replacement : AT(ADDR(.altinstr_replacement) - LOAD_OFFSET) { *(.altinstr_replacement) }
The alternative_instructions() then has:
void __init alternative_instructions(void) { ... apply_alternatives(__alt_instructions, __alt_instructions_end); ... apply_paravirt(__parainstructions, __parainstructions_end); }
The alternatives implementation enables to no-op functionality which is unused, but this is only for the remaining set of code for which our replacement instruction doesn't suffice to replace, it is also only used for very critical sections of the kernel.
Candidate code review
Below is a list of candidate code we've either identified as possible candidate for use of some new possible prospective solutions we're working on to either reduce kernel size, address dead code or address security concerns, at run time.
pvops
Linux distributions using pvops enables a lot of code which always remains on even after we've discovered what virtualization environment we're on. We don't need code for other virtualization environments once we know what environment we're on. For some code it may be easy to figure out what code is built-in and Xen specfic or bare-metal-only, but this is not always true. For instance although we know some bare-metal-only code should not run when on Xen, we have no guarantees that this is not running when on Xen, we need a mechanism in place to enable a staging / debugging phase to not fault a kernel but simply annotate in dmesg a warning when code not intended to run is found to be run. After a phase of certainty and code fixing, such code can be discarded if such mechanisms exist. Likewise when we are on bare metal we know we do not need all of the Xen built-in code. This applies to other virtualization environments as well supported through pvops.
ACPI legacy
The ACPI 5.2.9.3 IA-PC Boot Architecture Flags documents a series of flags for legacy x86 mechanisms (such things also exist for ARM, so review those too). If ACPI annotates certain functionality as not needed, even if we built it into the kernel once these flags are cleared we know we don't such legacy functionality. A legacy free system is one that lacks any legacy requirements.
Linux actually only currently makes use of a few of these. We should take advantage and not only parse the ACPI table to vet for more, but also consider addressing code we don't need.
/* Masks for FADT IA-PC Boot Architecture Flags (boot_flags) [Vx]=Introduced in this FADT revision */ #define ACPI_FADT_LEGACY_DEVICES (1) /* 00: [V2] System has LPC or ISA bus devices */ #define ACPI_FADT_8042 (1<<1) /* 01: [V3] System has an 8042 controller on port 60/64 */ #define ACPI_FADT_NO_VGA (1<<2) /* 02: [V4] It is not safe to probe for VGA hardware */ #define ACPI_FADT_NO_MSI (1<<3) /* 03: [V4] Message Signaled Interrupts (MSI) must not be enabled */ #define ACPI_FADT_NO_ASPM (1<<4) /* 04: [V4] PCIe ASPM control must not be enabled */ #define ACPI_FADT_NO_CMOS_RTC (1<<5) /* 05: [V5] No CMOS real-time clock present */
Jump labels
After we dynamically branch into code we know runs, we know we do not need the code we did not run in favor over.
Prospective solutions
To help address the gaps of the concerns of the existing solutions this is a few solutions currently being worked on or already submitted for review for inclusion upstream.
Linker tables and lightweight feature graph
The Linux kernel has series of custom ELF sections spread over the kernel. Each time a custom ELF section is added to the kernel the custom linker script (for x86 this is arch/x86/kernel/vmlinux.lds.S) must be extended to include the start, end address of the section. Code then is annotated respectively. A generic solutions is being proposed upstream, called linker tables. The effort is originally inspired by both iPXE's own Linker table solution, and Linux's IOMMU initialization code. The Linux linker table solution has its own lightweight optimized run time sort functionality originally based on the existing Linux kernel IOMMU initialization functionality with stronger semantics. The Linux linker table work enables:
- A generic simple linker code sort, based on priority number
- Ability to sort code further at run time later in boot
- Further run time semantics fine tuning on dependency / relationships between code
- Adds a new force-obj-y and force-lib-y Makefile entries, as an alternative to obj-y, to enable code to be always compiled but only linked in when mandated by kconfig, this enables avoiding code rotting
The latest draft code is available at:
So far review of the linker table effort and proof of concept use with existing custom sections revealed that there are really two types of custom sections used on Linux:
- custom sections used to stuff code in
- custom sections used as tables, which code later iterates over
The first use case only requires support for compartmentalizing code / data into sections. The other use cases are more in line with the original intentions of linker tables. For this reason a generic section solution is being devised first. The run time sort functionality is simply not needed so it will be yanked out. In the future it may be lifted / used for other subsystems, once a basic core generic custom section / linker table solution is merged.
Compiler multiverse support
gcc could be extended to generalize the binary patching technique used in the Linux kernel for use for any application. What this provides is variable optimized run time multiverse support. This effort is now has its own home page:
https://github.com/luhsra/multiverse/
Full blown feature graph
At the 2016 Linux Plumbers at Santa Fe, the topic of complex dependencies was addressed, part of the goal was to address seeing if any light weight feature graph solution already present on Linux could be made generic for use on building relationships. It would seem that the answer was: no.
The notes from these sessions:
https://lwn.net/Articles/705852/
SAT solvers
As code and run time complexity grows one may end up needing a SAT solver to address more complex run time code dependency relationships. Work on this front is starting with kconfig-sat, refer to linux-sat for other potential use cases.
Freeing and marking code as not present
The Linux kernel already has a solution in place to free code it does not need after boot through free_init_pages(). This mechanism actually also has a debug feature to mark code as not present to trigger a page fault when debugging. We should be able to repurpose this and generalize this to free other code or mark other code as not present at other specific points in time after init. Due to the linker table work, which generalizes custom linker script sections, and since free_init_pages() relies on section beging/end addresses, provided you compartmentalized code properly you should be able to free more code or mark more code as not present after another specific run time moment other than init. The current code assumes you do not have to deal with concurrency, this will need to be addressed if this is going to be re-purposed.
For instance, the existing code that free init code uses:
void free_initmem(void) { free_init_pages("unused kernel", (unsigned long)(&__init_begin), (unsigned long)(&__init_end)); }
The custom Linux section here is delineated here by init_begin and init_end.
The hunt for dead code
Sometimes finding code that should not run for a run time environment may involve rather easy code inspection. Other times this may be a bit more complex. We need toolboxes to help enable developers identify dead code easily. This section is dedicated to ideas to help in this direction.
Using kprobes
A simple solution one can devise is to attach a kprobe on all exported symbols on the Linux kernel, then run a series of battery of run time tests and workloads that will mimic all expected uses, and add to a linked list once an exported symbol if its used. At the end of the test, one can deduce the exported symbols that are unused. This should give a basic idea of code that should be inspected.
Using eBPF
This is so far just a list of ideas.
eBPF programs can attach to tracepoints using debugfs on the tracepoint "filter" file - write bpf_ID to the file. These tracepoints could replicate "trigger" moments which are indicative of key run time turning points which can be used to annotate a very specific run time condition has been reached. For instance, if we know we do not need a legacy piece code, can we use an eBPF program to trigger a call into the kernel to then free such legacy code?
Since we can use eBPF to "profile" random code pieces of code, we should be able to build eBPF maps with counters on possible sequences we know should trigger at least once otherwise certain code might not be useful for the current run time.