KernelMemoryAllocation - Linux Kernel Newbies

by Arnout Vandecappelle, Mind

In the kernel, malloc() is not available. Instead, the kernel has to define its own memory allocation functions. However, many different allocation mechanisms exist. This article gives an overview of them.

References

The memory manager is discussed as part of an introductory course.

http://linux-mm.org/LinuxMMInternals is a wiki about the kernel memory manager.

http://www.win.tue.nl/~aeb/linux/lk/lk-9.html and http://www.linuxjournal.com/article/6930 give an overview of the three main kernel memory allocation mechanisms.

http://www.informit.com/content/images/0131453483/downloads/gorman_book.pdf is a complete book on the linux kernel memory managers. It's a bit too detailed, though.

Summary

All allocations take place from one out of three zones: ZONE_DMA (which is accessible by ISA DMA), ZONE_NORMAL, and ZONE_HIGHMEM (which is not directly accessible by the kernel but requires virtual-to-physical address translation through the MMU; it is required for large memory on 32-bit machines).

alloc_bootmem_...(): allocator used at boot time. This code is deleted after initialisation!
get_free_pages(): get a power-of-two multiple of PAGE_SIZE contiguous physical pages. Use get_order() to determine number of pages from a linear size. Sizes up to about 8MiB are OK.
kmalloc(): get any size (but actually a power-of-two is allocated from the default slab). Maximum size is usually 128KiB.
kmem_cache_alloc(): get predefined size from a kmem_cache (allocates extra slabs as needed).
vmalloc(): allocate contiguous virtual memory, which corresponds to non-contiguous physical memory. Use instead of kmalloc for large chunks of data (getting many contiguous physical pages leads to external fragmentation).
various parts of the kernel have their own allocators, often using kmem_cache slabs. Specific allocation functions are provided then, which often do some other management activity (e.g. updating some list).
request_mem_region(): reserves specific physical addresses for device I/O. Need to use ioremap() to map this into a virtual memory address. Use ioread8(), iowrite8(), memset_io(), memset_toio() and memset_fromio() to access it.
remap_pfn_range(): reserves a virtual address range and maps it to a given range of physical pages. Pages must have been allocated already. Typically used for implementing user-space mmap() for a device. Allows direct access to memory-mapped I/O from user space.

HIGHMEM

See http://linux-mm.org/HighMemory.

The Linux kernel normally uses a very simple way to map virtual to physical addresses: subtract PAGE_OFFSET (0xC000000 on x86). However, that leaves only 1GiB of addressable space for the kernel. Therefore, the kernel defines high memory. When high memory is allocated, it is not directly addressable. To address it, first the kmap() function has to be called to enter the memory page into the kernel page table. Then the address is valid, until kunmap() is called. The kmap() - kunmap() sequence has to be entered around every access to this page.

The HIGHMEM is mostly relevant for I/O buffers to mass storage devices: they require a lot of kernel space and may eat up the 1GiB address space. The kernel provides an additional feature, bounce buffers, (cfr. bounce_buffer_create) to manage this type of buffer on large memory systems.

DMA

https://elixir.bootlin.com/linux/latest/source/Documentation/core-api/dma-api.rst and https://elixir.bootlin.com/linux/latest/source/Documentation/core-api/dma-api-howto.rst in the kernel source tree document how to do DMA. There is a large overlap in the content of the two documents. dma-api.rst is a bit more high-level. However, dma-api-howto.rst contains some good skeleton code you can start from when writing a driver.

DMA requires some memory space that can be accessed by the hardware (which often requires it to be in the ZONE_DMA memory region), which is not cached, and which is physically contiguous. Therefore, drivers of DMA hardware use dma_alloc_coherent() to allocate DMA-able space. If it's DMA over the PCI bus, pci_alloc_consistent() is used instead. For USB, it's usb_buffer_alloc(). Note that you still need to use memory barriers to make sure the accesses are not reordered by the processor. Basically, the only thing guaranteed here is that the DMA region is uncacheable.

How this coherency/consistency is guaranteed is processor-dependent, therefore these functions are implemented in the architecture-specific directories.

Since dma_alloc_coherent() allocates at least a full page, use dma_pool_create() to allocate space for smaller transfers. Then, take some space from the pool with dma_pool_alloc().

Since the cache-coherent mapping may be expensive, also a streaming allocation exists. This is a buffer for one-way communication, which means coherency is limited to flushing the data from the cache after a write finishes. The buffer has to be pre-allocated (e.g. using kmalloc()). DMA for it is set up with dma_map_single(). When the DMA is finished (e.g. when the device has sent an interrupt signaling end of DMA), call dma_unmap_single(). Between map and unmap, the device is in control of the buffer: if you write to the device, do it before dma_map_single(), if you read from it, do it after dma_unmap_single().

The streaming DMA may use bounce buffers if necessary (i.e. if the physical address is not accessible by the device DMA, as specified by the DMA mask set for the device by dma_set_mask()). Bounce buffers require extra memory-to-memory copies. This is an issue on large-memory systems for 32 (or less)-bit devices. Note that the implementation of dma_unmap_single() is architecture-specific and may not include bounce buffers (e.g. on x86 it doesn't and there's no check).

If the buffer is not physically contiguous, it must be passed through a scatter/gather list. Use dma_map_sg() instead of dma_map_single().

If you're doing a lot of DMA, you would normally have a sequence of map-unmap-map-unmap requests. Rather than unmapping, you can keep the address mapped and just synchronise with dma_sync_single_for_cpu() or dma_sync_single_for_device(), as appropriate.

CategoryKernelHacking