3912
Comment:
|
← Revision 6 as of 2024-03-04 15:05:32 ⇥
4527
Add ZONE_NOSPLIT option
|
Deletions are marked like this. | Additions are marked like this. |
Line 11: | Line 11: |
While Node/Section/Zone is initialised in page->flags, zone & node information is already part of the zone, so we can discard it on free and reinitialise it on allocation. Section can be inferred from the PFN. We do need to find the previous and next pages in the list; the operations we need are to remove the first page on the list (allocation), add a page to the list (freeing) and remove a page from anywhere on the list (a free operation has found its buddy; we need to remove the buddy from the list it's currently on and add it to a different list). | While Node/Section/Zone is initialised in page->flags, zone & node information is already part of the zone, so we can discard it on free and reinitialise it on allocation. Section can be derived from the PFN at allocation time. We do need to find the previous and next pages in the list; the operations we need are to remove the first page on the list (allocation), add a page to the list (freeing) and remove a page from anywhere on the list (a free operation has found its buddy; we need to remove the buddy from the list it's currently on and add it to a different list). |
Line 15: | Line 15: |
First option is to do this: | === Option 1 === |
Line 33: | Line 33: |
The second option is the same data structure, but store the PFN of the next/prev pages instead of the pointer to the struct page. That gives us a lot more bits to play with! On 32-bit, we can use 28 bits to support up to 1TB of memory (theoretically possible with ARM LPAE). But we no longer have a limit of order-15 pages as we know that PFNs are naturally aligned, and so we can use [[MatthewWilcox/NaturallyAlignedOrder|a single bit]] to record what order the page is. And we have three bits left over! On 64-bit, we have space for 60 bits of PFN which works out to 4096 exabytes of memory (most 64-bit architectures can support PFNs up to about 51 bits). | === Option 2 === |
Line 35: | Line 35: |
The third option is an optional compression of option 2. Many 64-bit systems, like my laptop and my phone have less than 1TB of memory. So instead of using a pair of unsigned long, we can encode all of this into a single 8-byte integer: | The same data structure, but store the PFN of the next/prev pages instead of the pointer to the struct page. That gives us a lot more bits to play with! On 32-bit, we can use 28 bits to support up to 1TB of memory (theoretically possible with ARM LPAE). But we no longer have a limit of order-15 pages as we know that PFNs are naturally aligned, and so we can use [[MatthewWilcox/NaturallyAlignedOrder|a single bit]] to record what order the page is. And we have three bits left over! On 64-bit, we have space for 60 bits of PFN which works out to 4096 exabytes of memory (most 64-bit architectures can support PFNs up to about 51 bits). === Option 3 === We can compress option 2 on many 64-bit systems. For example, my laptop and my phone have less than 2TB of memory. Instead of using a pair of unsigned long, we can encode next/prev/order into a single 8-byte integer: |
Line 44: | Line 48: |
That is actually 29 bits, letting us support up to 2TB systems with an 8 byte memdesc. Assuming there's a decent Kconfig option to determine whether it's OK to decline to support the extra memory. | That is 29 bits, letting us support up to 2TB systems with an 8 byte memdesc. Assuming there's a decent Kconfig option to determine whether it's OK to decline to support memory above 2TB ... |
Line 46: | Line 50: |
It is tempting to see if we can shrink memdesc to 4 bytes on 32-bit systems, but we only get 13 bits for prev/next, limiting us to a 32MB system. If it's worth the Kconfig option, it'll only be a few extra lines of code to support it. | It is tempting to see if we can shrink memdesc to 4 bytes on 32-bit systems, but we only get 13 bits for prev & next, limiting us to a 32MB system. If it's worth the Kconfig option, it'll only be a few extra lines of code to support it. === Option 4 === Instead of using an absolute PFN, use a PFN relative to the base of the zone. That would mean we need one zone per 2TB of memory, which would expand the number of zones a little. But we could keep memdesc at 8 bytes, even on the largest machines, which would save us a Kconfig option for gargantuan machines. === Option 5 === If we accept [[https://lore.kernel.org/linux-mm/20240229183436.4110845-2-yuzhao@google.com/|ZONE_NOSPLIT]] we can scale the next/prev by an extra 9 bits each in that zone, giving us support for up to 1PB of physical addresses (assuming 2MB minimum order). |
This is part of the MatthewWilcox/Memdescs project.
Today, pages in the buddy allocator store the following information:
Node/Section/Zone (in page->flags)
Prev + Next page of this order (in page->buddy_list, aliasing page->lru)
Page order (in page->private)
migratetype (stored in page->index) is only used by the PCP frontend to the buddy allocator and does not need to be retained by pages in the buddy allocator.
While Node/Section/Zone is initialised in page->flags, zone & node information is already part of the zone, so we can discard it on free and reinitialise it on allocation. Section can be derived from the PFN at allocation time. We do need to find the previous and next pages in the list; the operations we need are to remove the first page on the list (allocation), add a page to the list (freeing) and remove a page from anywhere on the list (a free operation has found its buddy; we need to remove the buddy from the list it's currently on and add it to a different list).
Once we move to memdescs, we have a choice between storing all the information we need in struct page and allocating a 'struct buddy' for each page being managed by the buddy allocator. Allocating a struct buddy is feasible but adds a lot of complex interactions between the buddy allocator and the slab allocator to ensure that we can always allocate struct buddy in order to either allocate or free memory. I definitely prefer storing all necessary information directly in struct page.
Option 1
struct buddy { unsigned long next; unsigned long prev; }; struct page { union { unsigned long memdesc; struct buddy buddy; }; };
This gives us a 16 byte struct page. The 'next' field has its bottom four bits used for the memdesc type for buddy pages. The bottom four bits of 'prev' can be used to store the order (assuming we don't need to go past order 15 pages). The remaining bits of 'next' and 'prev' can be used to store pointers to struct page, since struct page is aligned to 16 bits. But this doesn't work on 32-bit because struct page would only be eight bytes. We could make it work by allocating two memdesc types to the buddy allocator (eg types 1 and 9) so that it works for merely 8 byte alignment. But I think we have better options ...
Option 2
The same data structure, but store the PFN of the next/prev pages instead of the pointer to the struct page. That gives us a lot more bits to play with! On 32-bit, we can use 28 bits to support up to 1TB of memory (theoretically possible with ARM LPAE). But we no longer have a limit of order-15 pages as we know that PFNs are naturally aligned, and so we can use a single bit to record what order the page is. And we have three bits left over! On 64-bit, we have space for 60 bits of PFN which works out to 4096 exabytes of memory (most 64-bit architectures can support PFNs up to about 51 bits).
Option 3
We can compress option 2 on many 64-bit systems. For example, my laptop and my phone have less than 2TB of memory. Instead of using a pair of unsigned long, we can encode next/prev/order into a single 8-byte integer:
bits |
meaning |
0-3 |
memdesc type buddy |
4 |
order encoding |
5-33 |
next |
34-62 |
prev |
63 |
unused |
That is 29 bits, letting us support up to 2TB systems with an 8 byte memdesc. Assuming there's a decent Kconfig option to determine whether it's OK to decline to support memory above 2TB ...
It is tempting to see if we can shrink memdesc to 4 bytes on 32-bit systems, but we only get 13 bits for prev & next, limiting us to a 32MB system. If it's worth the Kconfig option, it'll only be a few extra lines of code to support it.
Option 4
Instead of using an absolute PFN, use a PFN relative to the base of the zone. That would mean we need one zone per 2TB of memory, which would expand the number of zones a little. But we could keep memdesc at 8 bytes, even on the largest machines, which would save us a Kconfig option for gargantuan machines.
Option 5
If we accept ZONE_NOSPLIT we can scale the next/prev by an extra 9 bits each in that zone, giving us support for up to 1PB of physical addresses (assuming 2MB minimum order).