KernelNewbies:

The ultimate goal of the folios project is to turn struct page into:

struct page {
        u64 memdesc;
};

Bits 0-3 of memdesc are a type field that describes what the remaining bits are used for.

type

Meaning

Remaining bits

0

Misc

See below

1

Buddy

Embedded struct buddy

2

File

Pointer to struct folio

3

Anon

Pointer to struct anon_folio

4

KSM

Pointer to struct ksm (TBD)

5

Slab

Pointer to struct slab

6

Bump

Pointer to struct bump (TBD)

7

Movable

Pointer to struct movable (TBD)

8

PageTable

Pointer to struct ptdesc

9

HWPoison

Pointer to struct hwpoison (TBD)

10

PerCPU

Pointer to struct pcpudesc (TBD)

11

AcctMem

Pointer to struct acctmem

12

ZPDesc

Pointer to struct zpdesc

13

Managed

Pointer to struct mgdesc

14-15

not yet assigned

Misc memory

Type 0 is used for memory which needs no further data associated with it. Bits 4-10 are used as a subtype to determine what the memory is used for:

Subtype

Meaning

0

PageReserved

1

A Zero Page (maybe PTE or PMD)

2

Unknown (probably device driver)

3

Vmalloc

4

Guard

5

Offline

6

kmalloc_large (unless acctmem)

7

Exact

8

brd

9-127

not yet assigned

Bit 11 is set if the page may be mapped to userspace.

Bits 12-17 are used to store the order of the page. The high bits are used to store section/node/zone information, as is done today with the page flags.

XXX: How to indicate that a page was allocated from reserves like pfmemalloc today?

NOTE! There is no refcount for this kind of memory! Nor mapcount! get_page() / put_page() will throw an error for them. You can free the pages, and they will go straight back to the page allocator. If you need a refcount, allocate a folio instead. You can still map them to userspace, but they will be treated like a PFNMAP.

struct buddy

Type 1 is used for pages which are in the MatthewWilcox/BuddyAllocator. This is either one or two words of data which is used to manage the pages (see the link for more detail).

Memdesc pointers

All structs pointed to from a memdesc must be allocated from a slab which has its alignment set to 16 bytes (in order to allow the bottom 4 bits to be used for the type). That implies that they are a multiple of 16 bytes in size. The slab must also have the TYPESAFE_BY_RCU flag set as some page walkers will attempt to look up the memdesc from the page while holding only the RCU read lock.

File and Anon

File and anon memory will both use struct folio (for now?)

struct folio {
    unsigned long flags;
    struct list_head lru;
    struct address_space *mapping;
    pgoff_t index;
    void *private;
    atomic_t _refcount;
    atomic_t _mapcount;
    atomic_t pincount;
    unsigned char order;
    unsigned long pfn;
    unsigned long memcg_data;
};

This looks very similar to today. The only addition is the pfn, which we can use for getting the struct page if needed, the pincount and the order. This is 80 bytes, so we get 51 per 4KiB page.

KSM

TBD

Slab

There's a minor recursion problem for the slab memdesc. This can be avoided by special-casing the struct slab allocation; any time we need to allocate a new slab for the slab memdesc cache, we _do not_ allocate a struct slab for it; we use the first object in the allocated memory for its own struct slab.

Bump

This was known as page_pool or netpool. Patches for this are forthcoming.

AcctMem

Folios and slabs are also accounted to a memcg, but if your code calls alloc_pages(GFP_KERNEL_ACCOUNT), we will allocate a struct acctmem and return a pointer to the first struct page, just as we do today. Only core code refers to the memcg_data today, so no device driver changes will be needed.

struct acctmem {
    unsigned long flags;
    unsigned long memcg_data;
};

One of the flags will denote kmalloc_large. Other uses of the flags field will include section/node/zone.

Managed memory

This is miscellaneous memory which needs a small amount of extra information, eg a list_head. We don't want/need to allocate a separate memdesc type for each of these; it's like a type 0, but with additional metadata.

struct mgdesc {
    unsigned long flags;
    unsigned long data[3];
};

The flags will contain an 8-bit field to allow the user to check that this really is their memory. The flags also contain section/node/zone, so users cannot use arbitrary bits in this field (a more full description of how many bits are available in this field will exist at some point). The 3 data fields may be used by the owner without restriction.

Allocating memory

Device drivers that do not touch the contents of struct page can continue calling alloc_pages() as they do today. They will get back a struct page pointer which will have subtype "Unknown" but they won't care.

We'll add a new memdesc_alloc_pages() family which allocate the memory and set each page->memdesc to the passed-in memdesc. So each memdesc allocator will first use slab to allocate a memdesc, then allocate the pages that point to that memdesc.

alloc_pages_exact

We'll follow the same approach as today; allocate a buddy of sufficient size, but we'll expose a new primitive buddy_free_tail() which will release the tail pages back to the buddy allocator.

Freeing memory

Folios (file/anon/ksm) have a refcount. These should be freed with folio_put(). Other memdescs may not have a refcount (eg slab).

Misc allocations (type 0) are simply passed back to the page allocator which will turn them into Buddy pages.

Splitting folios

It is going to be stupidly expensive to split an order-9 allocation into 512 order-0 allocations. We'll have to allocate 512 folios. We may want to optimise split_page() to only allocate folios for pages we're not going to immediately free.

Mapping memory into userspace

File, anon memory and KSM memory is rmappable. The rmap does not apply to other kinds of memory (networking, device driver, vmalloc, etc). These kinds of memory should be added to VM_MIXEDMAP or VM_PFNMAP mappings only.

Things to remember

ext4 attaches a buffer_head to memory allocated from slab. virt_to_folio() should probably return NULL in this case?

KernelNewbies: MatthewWilcox/Memdescs (last edited 2024-11-20 05:30:27 by MatthewWilcox)