Differences between revisions 4 and 7 (spanning 3 versions)

The ultimate goal of the folios project is to turn struct page into:

struct page {
        unsigned long memdesc;
};

Bits 0-3 of memdesc are a type field that describes what the remaining bits are used for.

type	Meaning	Remaining bits
0	Irregular memory	See below
1	Buddy	Pointer to struct buddy
2	File	Pointer to struct folio
3	Anon	Pointer to struct anon_folio
4	KSM	Pointer to struct ksm (TBD)
5	Slab	Pointer to struct slab
6	Movable	Pointer to struct movable (TBD)
7	PageTable	Pointer to struct ptdesc
8	NetPool	Pointer to struct netpool (TBD)
9	HWPoison	Pointer to struct hwpoison (TBD)
10-15	not yet assigned

Irregular memory

Type 0 is used for pages which will never be freed. Bits 4-7 distinguish why this page is irregular:

subtype	Meaning
0	The Zero Page
1	PageReserved
2-15	not yet assigned

Bits 8-13 are also used to store the order of the page. The high bits are used to store section/node/zone information, as is done today with the page flags.

Memdesc pointers

All structs pointed to from a memdesc must be allocated from a slab which has its alignment set to 16 bytes (in order to allow the bottom 4 bits to be used for the type). That implies that they are a multiple of 16 bytes in size. The slab must also have the TYPESAFE_BY_RCU flag set as some page walkers will attempt to look up the memdesc from the page while holding only the RCU read lock.

Other than struct buddy, all memdescs must have a pointer to struct buddy as the first word. See below.

struct buddy

struct buddy {
        unsigned long flags;
        struct list_head buddy_list;
        struct page *first;
};

This is used by the buddy allocator to track free pages. It is also used when the user does not need to store significant auxiliary information. The upper bits of the flags word contains section/node/zone information as page->flags does today. However the lower bits are quite different:

Bits	Meaning
0-3	Subtype
4-9	Order
10-13	Migratetype
14	pfmemalloc
15	Userspace mappable

Subtype	Meaning
0	Free
1	Device driver allocation
2	Vmalloc
3	Guard
4	Offline
5	kmalloc_large
6	PMD-sized Zero Page
7	Exact
8-15	not yet assigned

Allocating memory

Device drivers that do not touch the contents of struct page can continue calling alloc_pages() as they do today. They will get back a struct page pointer which will have a struct buddy memdesc, but they won't care.

We'll add a new buddy_alloc() which will return the buddy memdesc pointer instead. Each allocator can change the subtype from 'Free' to whatever subtype it has.

We'll add a new memdesc_alloc_pages() family which allocate the memory and set each page->memdesc to the passed-in memdesc. So each memdesc allocator will first use slab to allocate a memdesc, then allocate the pages that point to that memdesc. As an optimisation, if sizeof(struct my_memdesc) is <= sizeof(struct buddy), we can avoid allocating a new memdesc and cast the pointer returned from buddy_alloc().

alloc_pages_exact

We'll follow the same approach as today; allocate a buddy of sufficient size, but we'll expose a new primitive buddy_free_tail() which will release the tail pages back to the buddy allocator.

Slab memdesc

There's a minor recursion problem for the slab memdesc. This can be avoided by special-casing the struct slab allocation; any time we need to allocate a new slab for the slab memdesc cache, we _do not_ allocate a struct slab for it; we use the first object in the allocated memory for its own struct slab.

Buddy memdesc

When the page allocator wants to split an order-N page into two lower-order pages, it has to allocate a new buddy. Which comes from the slab allocator. So when allocating a folio, we might see:

Allocate folio memdesc
Folio slab cache has no free objects, allocates a new slab (order 0)
The smallest order that the page allocator has on hand is order-3
Page allocator asks slab for 3 new buddy memdescs (one for each level)
Buddy slab cache has no free objects, allocates a new slab (order 0)
... oh dear ...

Here are a few options for solving it.

When allocating buddy slabs, slab could pass order -1 to the page allocator, which would mean to hand back _any_ order page. Slab would then use the first N objects in the first page to be the buddy memdescs for (in this example) page 1, 2 and 4. Then it would give the extra pages back to the page allocator.
The page allocator could avoid using the slab allocator to allocate buddies. They're 16 byte objects, and it's relatively easy to split a 4kB page into 256 equal pieces. But then the page allocator is duplicating something that slab is good at.
Pass -1 as the order when allocating buddy slabs. Treat it as an order-0 slab and use buddy_free_pages() to free all but the first page. The page allocator will allocate fresh buddies from the slab allocator which will be able to satisfy them.

So the new worst-case scenario looks like:

Allocate folio memdesc
Folio slab cache has no free objects, allocates a new slab (order 0)
The smallest order that the page allocator has on hand is order-3
Page allocator asks slab for 3 new buddy memdescs (one for each level)
Buddy slab cache has no free objects, allocates a new slab (order -1)
Page allocator returns the order-3 buddy it was going to use for slab
Slab returns 7 pages to the page allocator
Page allocator allocates 3 buddies
Page allocator returns the order-0 page
Slab allocator gives itself the struct slab it needed

This can be made more complex by, e.g., an interrupt coming in and stealing all the buddy objects, but since we get 256 per 4kB page this is unlikely, and the consequence is simply going around the loop again.

Freeing memory

Folios (file/anon/ksm) have a refcount. These should be freed with folio_put(). Other memdescs may not have a refcount,

Allocations which use a buddy memdesc are simply passed back to the page allocator. Other memdescs will copy the buddy pointer from the memdesc before RCU freeing it, and update each struct page with the buddy memdesc.

Memory control group

Splitting folios

It is going to be stupidly expensive to split an order-9 allocation into 512 order-0 allocations. We'll have to allocate 512 buddys and 512 folios. We may want to optimise split_page() to only allocate folios for pages we're not going to immediately free, and to allocate higher order buddies for pages we are going to free.

Mapping memory into userspace

File, anon memory and KSM memory is rmappable. The rmap does not apply to other kinds of memory (networking, device driver, vmalloc, etc). These kinds of memory should be added to VM_MIXEDMAP or VM_PFNMAP mappings only.

Things to remember

ext4 attaches a buffer_head to memory allocated from slab. virt_to_folio() should probably return NULL in this case?

-  ⇤ ← Revision 4 as of 2024-01-06 04:00:20 → 
  Size: 4930
  Editor: MatthewWilcox
  Comment: pfmemalloc & kmalloc_large
+   ← Revision 7 as of 2024-01-12 20:46:18 → ⇥
  Size: 8015
  Editor: MatthewWilcox
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 9:
-The memdesc contains a 4 bit ''type'' field that describes what the remaining 60/28 bits are used for.
+Bits 0-3 of memdesc are a ''type'' field that describes what the remaining bits are used for.
 Line 12:
-||     0 || No Pointer         || See below                         ||
+||     0 || Irregular memory   || See below                         ||
 Line 25:
-=== Type 0 ===
+=== Irregular memory ===
 Line 27:
-Type 0 is used for allocations which will never be freed.  The next four bits distinguish what kind of allocation this is:
+Type 0 is used for pages which will never be freed.  Bits 4-7 distinguish why this page is irregular:
 Line 30:
+||       0 || The Zero Page                       ||
-Line 33:
+Line 34:
-The high bits are used to store zone/node/... information, as is done today with the page flags.  Bits 8-13 are also used to store the order of the allocation.
+Bits 8-13 are also used to store the order of the page.
The high bits are used to store section/node/zone information, as is done today with the page flags.
-Line 41:
+Line 43:
-Line 52:
+Line 55:
-This is used by the buddy allocator to track free pages.  It is also used when the user does not need to store significant auxiliary information.  The flags word contains node/zone/section/etc information as page->flags does today.  However the lower bits are quite different:
+This is used by the buddy allocator to track free pages.  It is also used when the user does not need to store significant auxiliary information.  The upper bits of the flags word contains section/node/zone information as page->flags does today.  However the lower bits are quite different:
-Line 68:
+Line 71:
-||    6-15 || not yet assigned         ||
+||       6 || PMD-sized Zero Page      ||
||       7 || Exact           ||
||    8-15 || not yet assigned         ||
-Line 78:
+Line 83:
+=== alloc_pages_exact ===

We'll follow the same approach as today; allocate a buddy of sufficient size, but we'll expose a new primitive buddy_free_tail() which will release the tail pages back to the buddy allocator.

=== Slab memdesc ===
-Line 80:
+Line 91:
-I don't know how we'll handle alloc_pages_exact().
+=== Buddy memdesc ===

When the page allocator wants to split an order-N page into two lower-order pages, it has to allocate a new buddy.  Which comes from the slab allocator.  So when allocating a folio, we might see:

 * Allocate folio memdesc
 * Folio slab cache has no free objects, allocates a new slab (order 0)
 * The smallest order that the page allocator has on hand is order-3
 * Page allocator asks slab for 3 new buddy memdescs (one for each level)
 * Buddy slab cache has no free objects, allocates a new slab (order 0)
 * ... oh dear ...

Here are a few options for solving it.

 1. When allocating buddy slabs, slab could pass order -1 to the page allocator, which would mean to hand back _any_ order page.  Slab would then use the first N objects in the first page to be the buddy memdescs for (in this example) page 1, 2 and 4.  Then it would give the extra pages back to the page allocator.
 2. The page allocator could avoid using the slab allocator to allocate buddies.  They're 16 byte objects, and it's relatively easy to split a 4kB page into 256 equal pieces.  But then the page allocator is duplicating something that slab is good at.
 3. Pass -1 as the order when allocating buddy slabs. Treat it as an order-0 slab and use buddy_free_pages() to free all but the first page. The page allocator will allocate fresh buddies from the slab allocator which will be able to satisfy them.

So the new worst-case scenario looks like:

 * Allocate folio memdesc
 * Folio slab cache has no free objects, allocates a new slab (order 0)
 * The smallest order that the page allocator has on hand is order-3
 * Page allocator asks slab for 3 new buddy memdescs (one for each level)
 * Buddy slab cache has no free objects, allocates a new slab (order -1)
 * Page allocator returns the order-3 buddy it was going to use for slab
 * Slab returns 7 pages to the page allocator
 * Page allocator allocates 3 buddies
 * Page allocator returns the order-0 page
 * Slab allocator gives itself the struct slab it needed

This can be made more complex by, e.g., an interrupt coming in and stealing all the buddy objects, but since we get 256 per 4kB page this is unlikely, and the consequence is simply going around the loop again.
-Line 86:
+Line 127:
-Allocations which use a buddy memdesc are simply passed back to the buddy allocator.  Other memdescs will copy the buddy pointer from the memdesc before RCU freeing it, and update each struct page with the buddy memdesc.
+Allocations which use a buddy memdesc are simply passed back to the page allocator.  Other memdescs will copy the buddy pointer from the memdesc before RCU freeing it, and update each struct page with the buddy memdesc.
-Line 90:
+Line 131:
+== Splitting folios ==

It is going to be stupidly expensive to split an order-9 allocation into 512 order-0 allocations. We'll have to allocate 512 buddys and 512 folios. We may want to optimise split_page() to only allocate folios for pages we're not going to immediately free, and to allocate higher order buddies for pages we are going to free.
-Line 94:
+Line 138:
+= Things to remember =

ext4 attaches a buffer_head to memory allocated from slab.  virt_to_folio() should probably return NULL in this case?