4930
Comment: pfmemalloc & kmalloc_large
|
5641
|
Deletions are marked like this. | Additions are marked like this. |
Line 5: | Line 5: |
unsigned long memdesc; | u64 memdesc; |
Line 9: | Line 9: |
The memdesc contains a 4 bit ''type'' field that describes what the remaining 60/28 bits are used for. | Bits 0-3 of memdesc are a ''type'' field that describes what the remaining bits are used for. |
Line 12: | Line 12: |
|| 0 || No Pointer || See below || || 1 || Buddy || Pointer to struct buddy || |
|| 0 || Misc || See below || || 1 || Buddy || Embedded struct buddy || |
Line 22: | Line 22: |
|| 10-15 || not yet assigned || || | || 10 || PerCPU || Pointer to struct pcpudesc (TBD) || || 11-15 || not yet assigned || || |
Line 25: | Line 26: |
=== Type 0 === | === Misc memory === |
Line 27: | Line 28: |
Type 0 is used for allocations which will never be freed. The next four bits distinguish what kind of allocation this is: | Type 0 is used for memory which needs no further data associated with it. Bits 4-10 are used as a subtype to determine what the memory is used for: |
Line 29: | Line 30: |
|| subtype || Meaning || || 1 || PageReserved || || 2-15 || not yet assigned || |
|| Subtype || Meaning || || 0 || PageReserved || || 1 || A Zero Page (maybe PTE or PMD) || || 2 || Unknown (probably device driver) || || 3 || Vmalloc || || 4 || Guard || || 5 || Offline || || 6 || kmalloc_large || || 7 || Exact || || 8 || brd || || 9-127 || not yet assigned || |
Line 33: | Line 42: |
The high bits are used to store zone/node/... information, as is done today with the page flags. Bits 8-13 are also used to store the order of the allocation. | Bit 11 is set if the page may be mapped to userspace. |
Line 35: | Line 44: |
Bits 12-17 are used to store the order of the page. The high bits are used to store section/node/zone information, as is done today with the page flags. XXX: How to indicate that a page was allocated from reserves like pfmemalloc today? NOTE! There is no refcount for this kind of memory! Nor mapcount! get_page() / put_page() will throw an error for them. You can free the pages, and they will go straight back to the page allocator. If you need a refcount, allocate a folio instead. You can still map them to userspace, but they will be treated like a PFNMAP. === struct buddy === Type 1 is used for pages which are in the MatthewWilcox/BuddyAllocator. This is either one or two words of data which is used to manage the pages (see the link for more detail). |
|
Line 40: | Line 59: |
Other than struct buddy, all memdescs must have a pointer to struct buddy as the first word. See below. === struct buddy === |
=== struct folio === |
Line 45: | Line 62: |
struct buddy { unsigned long flags; struct list_head buddy_list; struct page *first; |
struct folio { unsigned long flags; struct list_head lru; struct address_space *mapping; pgoff_t index; void *private; atomic_t _refcount; atomic_t _mapcount; unsigned long pfn; unsigned long memcg_data; |
Line 52: | Line 75: |
This is used by the buddy allocator to track free pages. It is also used when the user does not need to store significant auxiliary information. The flags word contains node/zone/section/etc information as page->flags does today. However the lower bits are quite different: | This looks very similar to today. The only addition is the pfn, which we can use for getting the struct page if needed. |
Line 54: | Line 77: |
|| Bits || Meaning || || 0-3 || Subtype || || 4-9 || Order || || 10-13 || Migratetype || || 14 || pfmemalloc || || 15 || Userspace mappable || |
=== Other structs === |
Line 61: | Line 79: |
|| Subtype || Meaning || || 0 || Free || || 1 || Device driver allocation || || 2 || Vmalloc || || 3 || Guard || || 4 || Offline || || 5 || kmalloc_large || || 6-15 || not yet assigned || |
TBD |
Line 72: | Line 83: |
Device drivers that do not touch the contents of struct page can continue calling alloc_pages() as they do today. They will get back a struct page pointer which will have a struct buddy memdesc, but they won't care. | Device drivers that do not touch the contents of struct page can continue calling alloc_pages() as they do today. They will get back a struct page pointer which will have subtype "Unknown" but they won't care. |
Line 74: | Line 85: |
We'll add a new buddy_alloc() which will return the buddy memdesc pointer instead. Each allocator can change the subtype from 'Free' to whatever subtype it has. | We'll add a new memdesc_alloc_pages() family which allocate the memory and set each page->memdesc to the passed-in memdesc. So each memdesc allocator will first use slab to allocate a memdesc, then allocate the pages that point to that memdesc. |
Line 76: | Line 87: |
We'll add a new memdesc_alloc_pages() family which allocate the memory and set each page->memdesc to the passed-in memdesc. So each memdesc allocator will first use slab to allocate a memdesc, then allocate the pages that point to that memdesc. As an optimisation, if sizeof(struct my_memdesc) is <= sizeof(struct buddy), we can avoid allocating a new memdesc and cast the pointer returned from buddy_alloc(). | === alloc_pages_exact === We'll follow the same approach as today; allocate a buddy of sufficient size, but we'll expose a new primitive buddy_free_tail() which will release the tail pages back to the buddy allocator. === Slab memdesc === |
Line 80: | Line 95: |
I don't know how we'll handle alloc_pages_exact(). |
|
Line 84: | Line 97: |
Folios (file/anon/ksm) have a refcount. These should be freed with folio_put(). Other memdescs may not have a refcount, | Folios (file/anon/ksm) have a refcount. These should be freed with folio_put(). Other memdescs may not have a refcount (eg slab). |
Line 86: | Line 99: |
Allocations which use a buddy memdesc are simply passed back to the buddy allocator. Other memdescs will copy the buddy pointer from the memdesc before RCU freeing it, and update each struct page with the buddy memdesc. | Misc allocations (type 0) are simply passed back to the page allocator which will turn them into Buddy pages. |
Line 90: | Line 103: |
== Splitting folios == It is going to be stupidly expensive to split an order-9 allocation into 512 order-0 allocations. We'll have to allocate 512 folios. We may want to optimise split_page() to only allocate folios for pages we're not going to immediately free. |
|
Line 94: | Line 110: |
= Things to remember = ext4 attaches a buffer_head to memory allocated from slab. virt_to_folio() should probably return NULL in this case? |
The ultimate goal of the folios project is to turn struct page into:
struct page { u64 memdesc; };
Bits 0-3 of memdesc are a type field that describes what the remaining bits are used for.
type |
Meaning |
Remaining bits |
0 |
Misc |
See below |
1 |
Buddy |
Embedded struct buddy |
2 |
File |
Pointer to struct folio |
3 |
Anon |
Pointer to struct anon_folio |
4 |
KSM |
Pointer to struct ksm (TBD) |
5 |
Slab |
Pointer to struct slab |
6 |
Movable |
Pointer to struct movable (TBD) |
7 |
Pointer to struct ptdesc |
|
8 |
Pointer to struct netpool (TBD) |
|
9 |
HWPoison |
Pointer to struct hwpoison (TBD) |
10 |
PerCPU |
Pointer to struct pcpudesc (TBD) |
11-15 |
not yet assigned |
|
Misc memory
Type 0 is used for memory which needs no further data associated with it. Bits 4-10 are used as a subtype to determine what the memory is used for:
Subtype |
Meaning |
0 |
|
1 |
A Zero Page (maybe PTE or PMD) |
2 |
Unknown (probably device driver) |
3 |
Vmalloc |
4 |
Guard |
5 |
Offline |
6 |
kmalloc_large |
7 |
Exact |
8 |
brd |
9-127 |
not yet assigned |
Bit 11 is set if the page may be mapped to userspace.
Bits 12-17 are used to store the order of the page. The high bits are used to store section/node/zone information, as is done today with the page flags.
XXX: How to indicate that a page was allocated from reserves like pfmemalloc today?
NOTE! There is no refcount for this kind of memory! Nor mapcount! get_page() / put_page() will throw an error for them. You can free the pages, and they will go straight back to the page allocator. If you need a refcount, allocate a folio instead. You can still map them to userspace, but they will be treated like a PFNMAP.
struct buddy
Type 1 is used for pages which are in the MatthewWilcox/BuddyAllocator. This is either one or two words of data which is used to manage the pages (see the link for more detail).
Memdesc pointers
All structs pointed to from a memdesc must be allocated from a slab which has its alignment set to 16 bytes (in order to allow the bottom 4 bits to be used for the type). That implies that they are a multiple of 16 bytes in size. The slab must also have the TYPESAFE_BY_RCU flag set as some page walkers will attempt to look up the memdesc from the page while holding only the RCU read lock.
struct folio
struct folio { unsigned long flags; struct list_head lru; struct address_space *mapping; pgoff_t index; void *private; atomic_t _refcount; atomic_t _mapcount; unsigned long pfn; unsigned long memcg_data; };
This looks very similar to today. The only addition is the pfn, which we can use for getting the struct page if needed.
Other structs
TBD
Allocating memory
Device drivers that do not touch the contents of struct page can continue calling alloc_pages() as they do today. They will get back a struct page pointer which will have subtype "Unknown" but they won't care.
We'll add a new memdesc_alloc_pages() family which allocate the memory and set each page->memdesc to the passed-in memdesc. So each memdesc allocator will first use slab to allocate a memdesc, then allocate the pages that point to that memdesc.
alloc_pages_exact
We'll follow the same approach as today; allocate a buddy of sufficient size, but we'll expose a new primitive buddy_free_tail() which will release the tail pages back to the buddy allocator.
Slab memdesc
There's a minor recursion problem for the slab memdesc. This can be avoided by special-casing the struct slab allocation; any time we need to allocate a new slab for the slab memdesc cache, we _do not_ allocate a struct slab for it; we use the first object in the allocated memory for its own struct slab.
Freeing memory
Folios (file/anon/ksm) have a refcount. These should be freed with folio_put(). Other memdescs may not have a refcount (eg slab).
Misc allocations (type 0) are simply passed back to the page allocator which will turn them into Buddy pages.
Memory control group
Splitting folios
It is going to be stupidly expensive to split an order-9 allocation into 512 order-0 allocations. We'll have to allocate 512 folios. We may want to optimise split_page() to only allocate folios for pages we're not going to immediately free.
Mapping memory into userspace
File, anon memory and KSM memory is rmappable. The rmap does not apply to other kinds of memory (networking, device driver, vmalloc, etc). These kinds of memory should be added to VM_MIXEDMAP or VM_PFNMAP mappings only.
Things to remember
ext4 attaches a buffer_head to memory allocated from slab. virt_to_folio() should probably return NULL in this case?