Kernel sandboxing

This page documents the proposal for a series of new kernel strategies to help sandbox the kernel. This is very different than sandboxing userspace from the kernel, this is about sandboxing the kernel for itself.

TableOfContents(4)

Major Linux distributions tend to prefer to ship and support only one binary kernel on a slew of different environments. This often means a lot of code ends up getting enabled as modular even if unused, other times some code ends up even getting enabled as built-in and never used. This presents a few issues:

Code size concerns - not such a critical concern unless for tiny Linux systems
Dead code concerns - we sometimes cannot be sure if certain code that should not run never runs
Security concerns - certain unused code may be left enabled which increases attack vectors in the kernel

Kernel sandboxing strategies tries to address these problems through a series of techniques and toolboxes.

Code concerns

First we must understand each of the above code concern considerations.

Code size concerns

The [http://tiny.wiki.kernel.org Linux kernel tinification effort] ('make tinyconfig') helps produce a minimal kernel. This effort is limited to what can be disabled only through Kconfig. There is potential to help reduce memory footprint by reducing both code and data when we determine we do not need certain code.

Dead concerns

In theory, code that should not run should not run, but it would be much better if we can say code that should not run cannot run. One of the best example of code that should not run is code built-in to the kernel to enable different types of virtualization environments when you know you are running bare metal. Likewise, if you are running on a virtualized environment there is certain code designed only for bare metal that we know we must not run. This is best explained on the post [http://www.do-not-panic.com/2015/12/avoiding-dead-code-pvops-not-silver-bullet.html "Avoiding dead code, pvops is not the silver bullet"].

Security concerns

Leaving code enabled which we know cannot be useful after a specific point in run time only increases attack vectors possible on a system. We should be able to figure out when we don't need certain code, and for some mechanisms in the kernel there should already be heuristics available to determine this, at that point the kernel should be able to completely disarm such code.

Current available solutions

This list provides a set of current mechanisms used on the kernel to help sandbox a kernel. The limitations of these mechanisms should be considered when considering solutions to help address some shortcomings at run time.

Kconfig

Kconfig can be used to disable compiling and linking into the kernel certain code functionality you know you do not need. Kconfig is limited in that you need to know what features you do not want enabled. Disabling a lot of Kconfig options also means reducing the flexibility of certain distributions' final kernel. For this reason a lot of Linux distributions rely on module support, enabling loading of required code functionality at run time, only when needed. Often times certain code cannot be made modular though, and as such Linux distributions wishing to enable certain functionality has no other option but to enable a series of functionality only available as built-in.

Kernel parameters

Kernel parameters enable dynamically tuning of code functionality both for built-in code and modules (as module parameters). Kernel parameters are used at run time, it is up to the discretion of the user to enable or disable certain parameters to customize run time functionality. Even though kernel and module parameters can be used to disable certain functionality at run time, the code being disabled is typically still technically available at run time. It is up to the implementation to ensure that disabled code will never run by analyzing code flow.

Binary patching

Certain features are critical to performance and relying on branches and variables at run time to determine what path to take in code degrades performance. The Linux kernel supports the ability to modify critical code at run time to avoid unnecessary branching. Binary patching is supported on different architectures on Linux. For details on the x86 implementation refer to the [https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/alternative.c arch/x86/kernel/alternative.c] file. Binary patching takes place towards the end of [https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/init/main.c init/main.c] start_kernel(), right before the first init userspace process is called, check_bugs(). This will call your architecture specific check_bugs(). For instance on x86_64 this is implemented on [https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/cpu/bugs_64.c arch/x86/kernel/cpu/bugs_64.c], which eventually calls alternative_instructions().

Alternatives are implemented by using custom ELF sections, x86 currently has 2 dedicated alternative sections. Each ELF sections stuff struct alt_inst data. The custom linker script ([https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/vmlinux.lds.S arch/x86/kernel/vmlinux.lds.S]) on Linux has these defined as:

        /*                                                                      
         * start address and size of operations which during runtime            
         * can be patched with virtualization friendly instructions or          
         * baremetal native ones. Think page table operations.                  
         * Details in paravirt_types.h                                          
         */                                                                     
        . = ALIGN(8);                                                           
        .parainstructions : AT(ADDR(.parainstructions) - LOAD_OFFSET) {         
                __parainstructions = .;                                         
                *(.parainstructions)                                            
                __parainstructions_end = .;                                     
        }                                                                       
                                                                                
        /*                                                                      
         * struct alt_inst entries. From the header (alternative.h):            
         * "Alternative instructions for different CPU types or capabilities"   
         * Think locking instructions on spinlocks.                             
         */                                                                     
        . = ALIGN(8);                                                           
        .altinstructions : AT(ADDR(.altinstructions) - LOAD_OFFSET) {           
                __alt_instructions = .;                                         
                *(.altinstructions)                                             
                __alt_instructions_end = .;                                     
        }                                                                       
                                                                                
        /*                                                                      
         * And here are the replacement instructions. The linker sticks         
         * them as binary blobs. The .altinstructions has enough data to        
         * get the address and the length of them to patch the kernel safely.   
         */                                                                     
        .altinstr_replacement : AT(ADDR(.altinstr_replacement) - LOAD_OFFSET) { 
                *(.altinstr_replacement)                                        
        }

The alternative_instructions() then has:

void __init alternative_instructions(void)                                      
{
   ...
   apply_alternatives(__alt_instructions, __alt_instructions_end);
   ...
   apply_paravirt(__parainstructions, __parainstructions_end);
}

The alternatives implementation enables to no-op functionality which is unused, but this is only for the remaining set of code for which our replacement instruction doesn't suffice to replace, it is also only used for very critical sections of the kernel.

Prospective solutions

XXX