= Kernel sandboxing = This page documents the proposal for a series of new kernel strategies to help sandbox the kernel. This is very different than sandboxing userspace from the kernel, this is about sandboxing the kernel for itself. [[TableOfContents(4)]] Major Linux distributions tend to prefer to ship and support only one binary kernel on a slew of different environments. This often means a lot of code ends up getting enabled as modular even if unused, other times some code ends up getting enabled as built-in, as part of vmlinux, even if its never used. This presents a few issues: * Code size concerns - not such a critical concern unless for tiny Linux systems * Dead code concerns - we sometimes cannot be sure if certain code that should not run never runs * Security concerns - certain unused code may be left enabled which increases attack vectors in the kernel Kernel sandboxing strategies tries to address these problems through a series of techniques and toolboxes. = Code concerns = First we must understand each of the above code concern considerations. == Code size concerns == The [http://tiny.wiki.kernel.org Linux kernel tinification effort] ('make tinyconfig') helps produce a minimal kernel. This effort is limited to what can be disabled only through Kconfig. There is potential to help reduce memory footprint by reducing both code and data when we determine we do not need certain code at run time. == Dead code concerns == In theory, code that should not run should not run, but it would be much better if we can say code that should not run cannot run. One of the best example of code that should not run is code built-in to the kernel to enable different types of virtualization environments when you know you are running bare metal. Likewise, if you are running on a virtualized environment there is certain code designed only for bare metal that we know we must not run. This is best explained on the post [http://www.do-not-panic.com/2015/12/avoiding-dead-code-pvops-not-silver-bullet.html "Avoiding dead code, pvops is not the silver bullet"]. == Security code concerns == Leaving code enabled which we know cannot be useful after a specific point in run time only increases attack vectors possible on a system. We should be able to figure out when we don't need certain code, and for some mechanisms in the kernel there should already be heuristics available to determine this, at that point the kernel should be able to completely disarm such code. = Current available solutions = This list provides a set of current mechanisms used on the kernel to help sandbox a kernel. The limitations of these mechanisms should be considered when considering solutions to help address some shortcomings at run time. == Kconfig == Kconfig can be used to disable compiling and linking into the kernel certain code functionality you know you do not need. Kconfig is limited in that you need to know what features you do not want enabled. Disabling a lot of Kconfig options also means reducing the flexibility of certain distributions' final kernel. For this reason a lot of Linux distributions rely on module support, enabling loading of required code functionality at run time, only when needed. Often times certain code cannot be made modular though, and as such Linux distributions wishing to enable certain functionality has no other option but to enable a series of functionality only available as built-in. == Kernel parameters == Kernel parameters enable dynamically tuning of code functionality both for built-in code and modules (as module parameters). Kernel parameters are used at run time, it is up to the discretion of the user to enable or disable certain parameters to customize run time functionality. Even though kernel and module parameters can be used to disable certain functionality at run time, the code being disabled is typically still technically available at run time. It is up to the implementation to ensure that disabled code will never run by analyzing code flow. == Binary patching == Certain features are critical to performance and relying on branches and variables at run time to determine what path to take in code degrades performance. The Linux kernel supports the ability to modify critical code at run time to avoid unnecessary branching. Binary patching is supported on different architectures on Linux. For details on the x86 implementation refer to the [https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/alternative.c arch/x86/kernel/alternative.c] file. Binary patching takes place towards the end of [https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/init/main.c init/main.c] start_kernel(), right before the first init userspace process is called, check_bugs(). This will call your architecture specific check_bugs(). For instance on x86_64 this is implemented on [https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/cpu/bugs_64.c arch/x86/kernel/cpu/bugs_64.c], which eventually calls alternative_instructions(). Alternatives are implemented by using custom ELF sections, x86 currently has 2 dedicated alternative sections. Each ELF sections stuff struct alt_inst data. The custom linker script ([https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/vmlinux.lds.S arch/x86/kernel/vmlinux.lds.S]) on Linux has these defined as: {{{ /* * start address and size of operations which during runtime * can be patched with virtualization friendly instructions or * baremetal native ones. Think page table operations. * Details in paravirt_types.h */ . = ALIGN(8); .parainstructions : AT(ADDR(.parainstructions) - LOAD_OFFSET) { __parainstructions = .; *(.parainstructions) __parainstructions_end = .; } /* * struct alt_inst entries. From the header (alternative.h): * "Alternative instructions for different CPU types or capabilities" * Think locking instructions on spinlocks. */ . = ALIGN(8); .altinstructions : AT(ADDR(.altinstructions) - LOAD_OFFSET) { __alt_instructions = .; *(.altinstructions) __alt_instructions_end = .; } /* * And here are the replacement instructions. The linker sticks * them as binary blobs. The .altinstructions has enough data to * get the address and the length of them to patch the kernel safely. */ .altinstr_replacement : AT(ADDR(.altinstr_replacement) - LOAD_OFFSET) { *(.altinstr_replacement) } }}} The alternative_instructions() then has: {{{ void __init alternative_instructions(void) { ... apply_alternatives(__alt_instructions, __alt_instructions_end); ... apply_paravirt(__parainstructions, __parainstructions_end); } }}} The alternatives implementation enables to no-op functionality which is unused, but this is only for the remaining set of code for which our replacement instruction doesn't suffice to replace, it is also only used for very critical sections of the kernel. = Candidate code review = Below is a list of candidate code we've either identified as possible candidate for use of some new possible prospective solutions we're working on to either reduce kernel size, address dead code or address security concerns, at run time. == pvops == Linux distributions using pvops enables a lot of code which always remains on even after we've discovered what virtualization environment we're on. We don't need code for other virtualization environments once we know what environment we're on. For some code it may be easy to figure out what code is built-in and Xen specfic or bare-metal-only, but this is not always true. For instance although we know some bare-metal-only code should not run when on Xen, we have no guarantees that this is not running when on Xen, we need a mechanism in place to enable a staging / debugging phase to not fault a kernel but simply annotate in dmesg a warning when code not intended to run is found to be run. After a phase of certainty and code fixing, such code can be discarded if such mechanisms exist. Likewise when we are on bare metal we know we do not need all of the Xen built-in code. This applies to other virtualization environments as well supported through pvops. == ACPI legacy == The [http://www.acpi.info/DOWNLOADS/ACPIspec50.pdf ACPI 5.2.9.3 IA-PC Boot Architecture Flags] documents a series of flags for legacy x86 mechanisms (such things also exist for ARM, so review those too). If ACPI annotates certain functionality as not needed, even if we built it into the kernel once these flags are cleared we know we don't such legacy functionality. A legacy free system is one that lacks any legacy requirements. Linux actually only currently makes use of a few of these. We should take advantage and not only parse the ACPI table to vet for more, but also consider addressing code we don't need. {{{ /* Masks for FADT IA-PC Boot Architecture Flags (boot_flags) [Vx]=Introduced in this FADT revision */ #define ACPI_FADT_LEGACY_DEVICES (1) /* 00: [V2] System has LPC or ISA bus devices */ #define ACPI_FADT_8042 (1<<1) /* 01: [V3] System has an 8042 controller on port 60/64 */ #define ACPI_FADT_NO_VGA (1<<2) /* 02: [V4] It is not safe to probe for VGA hardware */ #define ACPI_FADT_NO_MSI (1<<3) /* 03: [V4] Message Signaled Interrupts (MSI) must not be enabled */ #define ACPI_FADT_NO_ASPM (1<<4) /* 04: [V4] PCIe ASPM control must not be enabled */ #define ACPI_FADT_NO_CMOS_RTC (1<<5) /* 05: [V5] No CMOS real-time clock present */ }}} == Jump labels == After we dynamically branch into code we know runs, we know we do not need the code we did not run in favor over. = Prospective solutions = To help address the gaps of the concerns of the existing solutions this is a few solutions currently being worked on or already submitted for review for inclusion upstream. == Linker tables and lightweight feature graph == The Linux kernel has series of custom ELF sections spread over the kernel. Each time a custom ELF section is added to the kernel the custom linker script (for x86 this is [https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/vmlinux.lds.S arch/x86/kernel/vmlinux.lds.S]) must be extended to include the start, end address of the section. Code then is annotated respectively. A generic solutions is being proposed upstream, called linker tables. The effort is originally based on the iPXE's own Linker table solution only now heavily modified to suit Linux. The work has gone through two version series now. The first incarnation included an implementation following iPXE's original own linker table solution, extended to fit to Linux and had its own lightweight optimized run time sort functionality originally based on the existing Linux kernel IOMMU initialization functionality with stronger semantics. By leveraging both efforts and with its own extensions it enables: * A generic simple linker code sort, based on priority number * Ability to sort code further at run time later in boot * Further run time semantics fine tuning on dependency / relationships between code * Adds a new table-y Makefile entry, as an alternative to obj-y, to enable code to be always compiled but only linked in when mandated by kconfig, this enables avoiding code rotting Upon discussions from the first series patches from the first series were split out into 3 new categories for the second version. A third version of patches is being worked on right now with more functionality and expressed changes further split into other more categories. Below is the list of the original series, and follow up work. === Userspace linker table solution === The Linux linker table solution was first developed in userspace to help with ease of testing and evolution. The goal is to keep this tree up to date to both enable further extensions, mockups, and easier testing in userspace. For those interested it can also obviously be forked for integration in other projects. * [https://git.kernel.org/cgit/linux/kernel/git/mcgrof/linker-tables.git/ userspace linker table solution] The history contains details over the evolution of exactly why certain things were changed both to deviate away from the iPXE solution but also when changes were made to deviate the code away from what the semantics in the original IOMMU initialization code on Linux. === First iteration: linker table === The first series had a lot of functionality meshed in together: * Linker tables for Linux - this first iteration required a custom change to each architecture custom linker script for each new linker table you added * Lightweight feature graph based on the existing IOMMU initialization code with enhanced semantics * Rebranding of paravirt_enabled() - this was as per Konrad's original recommendation * Use of linker tables within x86's init entry points * First actual use of the x86 protocol hardware subarch for Xen Below is the list of patches for the first iteration of work: * [http://lkml.kernel.org/r/1450217797-19295-1-git-send-email-mcgrof@do-not-panic.com RFC v1 0/8] x86/init: Linux linker tables] * [http://lkml.kernel.org/r/1450217797-19295-2-git-send-email-mcgrof@do-not-panic.com RFC v1 1/8] paravirt: rename paravirt_enabled to paravirt_legacy] * [http://lkml.kernel.org/r/1450217797-19295-3-git-send-email-mcgrof@do-not-panic.com RFC v1 2/8] tables.h: add linker table support] * [http://lkml.kernel.org/r/1450217797-19295-4-git-send-email-mcgrof@do-not-panic.com RFC v1 3/8] x86/boot: add BIT() to boot/bitops.h] * [http://lkml.kernel.org/r/1450217797-19295-5-git-send-email-mcgrof@do-not-panic.com RFC v1 4/8] x86/init: add linker table support] * [http://lkml.kernel.org/r/1450217797-19295-6-git-send-email-mcgrof@do-not-panic.com RFC v1 5/8] x86/init: move ebda reservations into linker table] * [http://lkml.kernel.org/r/1450217797-19295-7-git-send-email-mcgrof@do-not-panic.com RFC v1 6/8] x86/init: use linker table for i386 early setup] * [http://lkml.kernel.org/r/1450217797-19295-8-git-send-email-mcgrof@do-not-panic.com RFC v1 7/8] x86/init: user linker table for ce4100 early setup] * [http://lkml.kernel.org/r/1450217797-19295-9-git-send-email-mcgrof@do-not-panic.com RFC v1 8/8] x86/init: use linker table for mid early setup] === Second iteration: linker table === This was split up into a few patch sets. Replace paravirt_enabled() and paravirt RTC check: it was determined that instead of rebranding paravirt_enabled() we should simply remove it, details are documented here, [http://kernelnewbies.org/KernelProjects/remove-paravirt-enabled remove-paravirt-enabled]. * [http://lkml.kernel.org/r/1456212255-23959-1-git-send-email-mcgrof@kernel.org PATCH v3 00/11] x86/init: replace paravirt_enabled() were possible] * [http://lkml.kernel.org/r/1456212255-23959-2-git-send-email-mcgrof@kernel.org PATCH v3 01/11] x86/boot: enumerate documentation for the x86 hardware_subarch] * [http://lkml.kernel.org/r/1456212255-23959-3-git-send-email-mcgrof@kernel.org PATCH v3 02/11] tools/lguest: make lguest launcher use X86_SUBARCH_LGUEST explicitly] * [http://lkml.kernel.org/r/1456212255-23959-4-git-send-email-mcgrof@kernel.org PATCH v3 03/11] x86/xen: use X86_SUBARCH_XEN for PV guest boots] * [http://lkml.kernel.org/r/1456212255-23959-5-git-send-email-mcgrof@kernel.org PATCH v3 04/11] x86/init: make ebda depend on PC subarch] * [http://lkml.kernel.org/r/1456212255-23959-6-git-send-email-mcgrof@kernel.org PATCH v3 05/11] tools/lguest: force disable tboot and apm] * [http://lkml.kernel.org/r/1456212255-23959-7-git-send-email-mcgrof@kernel.org PATCH v3 06/11] apm32: remove paravirt_enabled() use] * [http://lkml.kernel.org/r/1456212255-23959-8-git-send-email-mcgrof@kernel.org PATCH v3 07/11] x86/tboot: remove paravirt_enabled()] * [http://lkml.kernel.org/r/1456212255-23959-9-git-send-email-mcgrof@kernel.org PATCH v3 08/11] x86/cpu/intel: replace paravirt_enabled() for f00f work around] * [http://lkml.kernel.org/r/1456212255-23959-10-git-send-email-mcgrof@kernel.org PATCH v3 09/11] x86/boot: add BIT() to boot/bitops.h] * [http://lkml.kernel.org/r/1456212255-23959-11-git-send-email-mcgrof@kernel.org PATCH v3 10/11] x86/rtc: replace paravirt rtc check with x86 specific solution] * [http://lkml.kernel.org/r/1456212255-23959-12-git-send-email-mcgrof@kernel.org PATCH v3 11/11] pnpbios: replace paravirt_enabled() check with subarch checks] Linker table discussion from the first patch created a discussion over generalizing the linker table solution even further: instead of requiring a data structure, enable use of the generalization for any custom section on Linux thereby also replacing the custom changes to each linker table script for each architecture. The ability to not have to modify the custom linker script is brought to us by relying on existing standard ELF sections, and just adding an entry dedicated for tables which is sorted at link time. Changes with proof of concept uses, replacing existing kernel custom hacks with the generic solution: * [http://lkml.kernel.org/r/1455889559-9428-1-git-send-email-mcgrof@kernel.org RFC v2 0/7] linux: add linker tables] * [http://lkml.kernel.org/r/1455889559-9428-2-git-send-email-mcgrof@kernel.org RFC v2 1/7] sections.h: add sections header to collect all section info] * [http://lkml.kernel.org/r/1455889559-9428-3-git-send-email-mcgrof@kernel.org RFC v2 2/7] tables.h: add linker table support] * [http://lkml.kernel.org/r/1455889559-9428-4-git-send-email-mcgrof@kernel.org RFC v2 3/7] firmware: port built-in section to linker table] * [http://lkml.kernel.org/r/1455889559-9428-5-git-send-email-mcgrof@kernel.org RFC v2 4/7] asm/sections: add a generic push_section_tbl()] * [http://lkml.kernel.org/r/1455889559-9428-6-git-send-email-mcgrof@kernel.org RFC v2 5/7] jump_label: port __jump_table to linker tables] * [http://lkml.kernel.org/r/1455889559-9428-7-git-send-email-mcgrof@kernel.org RFC v2 6/7] dynamic_debug: port to use linker tables] * [http://lkml.kernel.org/r/1455889559-9428-8-git-send-email-mcgrof@kernel.org RFC v2 7/7] kprobes: port to linker table] x86 use of linker table: * [http://lkml.kernel.org/r/1455891343-10016-1-git-send-email-mcgrof@kernel.org RFC v2 0/6] x86/init: use linker table] * [http://lkml.kernel.org/r/1455891343-10016-2-git-send-email-mcgrof@kernel.org RFC v2 1/6] x86/boot: add BIT() to boot/bitops.h] * [http://lkml.kernel.org/r/1455891343-10016-3-git-send-email-mcgrof@kernel.org RFC v2 2/6] x86/init: use linker tables to simplify x86 init and annotate dependencies] * [http://lkml.kernel.org/r/1455891343-10016-4-git-send-email-mcgrof@kernel.org RFC v2 3/6] x86/init: move ebda reservations into linker table * [http://lkml.kernel.org/r/1455891343-10016-5-git-send-email-mcgrof@kernel.org RFC v2 4/6] x86/init: use linker table for i386 early setup] * [http://lkml.kernel.org/r/1455891343-10016-6-git-send-email-mcgrof@kernel.org RFC v2 5/6] x86/init: user linker table for ce4100 early setup] * [http://lkml.kernel.org/r/1455891343-10016-7-git-send-email-mcgrof@kernel.org RFC v2 6/6] x86/init: use linker table for mid early setup] === Third iteration: linker tables === A third iteration is being worked on based on review. Here's a bit of status updates on the work though: ==== Linker tables ==== The review of the linker table effort and proof of concept use with existing custom sections revealed that there are really two types of custom sections used on Linux: * custom sections used to stuff code in * custom sections used as tables, which code later iterates over The first use case only requires support for compartmentalizing code / data into sections. The other use cases are more in line with the original intentions of linker tables. For this reason a generic section solution is being devised first. The run time sort functionality is simply not needed so it will be yanked out. In the future it may be lifted / used for other subsystems, once a basic core generic custom section / linker table solution is merged. ==== Remove paravirt_enabled() ==== The discussion from the last series lead to the conclusion we do not want to use the x86 protocol subarch to replace paravirt_enabled(), instead other alternatives by Ingo were recommended, in particular devising a x86 platform quirk and legacy fields and having each different platform annotate its requirements / quirks. Refer to the page [http://kernelnewbies.org/KernelProjects/remove-paravirt-enabled remove paravirt_enabled] for more elaborate details there. Andy has posted this patch set to replace paravirt_enabled on the x86 ESPFIX, this has already been merged on Ingo's tree, futher work to replace the other uses is underway: * [http://lkml.kernel.org/r/cover.1456789731.git.luto@kernel.org PATCH v2 0/2] x86/entry/32: Get rid of paravirt_enabled in ESPFIX] * [http://lkml.kernel.org/r/5cf8d92df1ad2965a2d8cdbb466af04da8dbbbc1.1456789731.git.luto@kernel.org PATCH v2 1/2] x86/entry/32: Introduce and use X86_BUG_ESPFIX instead of paravirt_enabled] * [http://lkml.kernel.org/r/b8adc42d21ea64d84589f8ee7540f8299df21577.1456789731.git.luto@kernel.org PATCH v2 2/2] x86/asm-offsets: Remove PARAVIRT_enabled] ==== Use of linker tables ==== Here's a summary of discussion follow up over the basic proof of concept use of the linker tables: * firmware: it was asked we just remove this code * jump label and dynamic debug: these seem like ideal candidates to port to the generic solution * kprobes: further testing was asked and completed successfully without issues. Since certain uses __kprobes is going to be modified so that blacklisting is done as designed, it was asked that the conversion be yielded until that work is complete == Compiler multiverse support == gcc could be extended to generalize the binary patching technique used in the Linux kernel for use for any application. What this provides is variable optimized run time multiverse support. This feature alternative is being documented separately given the prospect use is way beyond the Linux kernel, for details refer to: http://kernelnewbies.org/KernelProjects/compiler-multiverse == Full blown feature graph == Mauro seems to have a feature graph fully implemented for media drivers. This should be looked into for code with more complex relationship semantics than the simple feature graph originally proposed on the first iteration of the linker table for Linux (and/or used already upstream on the Linux IOMMU init solution). == SAT solvers == As code and run time complexity grows one may end up needing a SAT solver to address more complex run time code dependency relationships. Work on this front is starting with [http://kernelnewbies.org/KernelProjects/kconfig-sat kconfig-sat], refer to [http://kernelnewbies.org/KernelProjects/linux-sat linux-sat] for other potential use cases. = Freeing and marking code as not present = The Linux kernel already has a solution in place to free code it does not need after boot through free_init_pages(). This mechanism actually also has a debug feature to mark code as not present to trigger a page fault when debugging. We should be able to repurpose this and generalize this to free other code or mark other code as not present at other specific points in time after init. Due to the linker table work, which generalizes custom linker script sections, and since free_init_pages() relies on section beging/end addresses, provided you compartmentalized code properly you should be able to free more code or mark more code as not present after another specific run time moment other than init. The current code assumes you do not have to deal with concurrency, this will need to be addressed if this is going to be re-purposed. For instance, the existing code that free init code uses: [https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/x86/mm/init.c arch/x86/mm/init.c] {{{ void free_initmem(void) { free_init_pages("unused kernel", (unsigned long)(&__init_begin), (unsigned long)(&__init_end)); } }}} The custom Linux section here is delineated here by __init_begin and __init_end. = The hunt for dead code = Sometimes finding code that should not run for a run time environment may involve rather easy code inspection. Other times this may be a bit more complex. We need toolboxes to help enable developers identify dead code easily. This section is dedicated to ideas to help in this direction. == Using kprobes == A simple solution one can devise is to attach a kprobe on all exported symbols on the Linux kernel, then run a series of battery of run time tests and workloads that will mimic all expected uses, and add to a linked list once an exported symbol if its used. At the end of the test, one can deduce the exported symbols that are unused. This should give a basic idea of code that should be inspected. == Using eBPF == This is so far just a list of ideas. eBPF programs can attach to tracepoints using debugfs on the tracepoint "filter" file - write bpf_ID to the file. These tracepoints could replicate "trigger" moments which are indicative of key run time turning points which can be used to annotate a very specific run time condition has been reached. For instance, if we know we do not need a legacy piece code, can we use an eBPF program to trigger a call into the kernel to then free such legacy code? Since we can use eBPF to "profile" random code pieces of code, we should be able to build eBPF maps with counters on possible sequences we know should trigger at least once otherwise certain code might not be useful for the current run time.