KernelNewbies
  • Comments
  • Immutable Page
  • Menu
    • Navigation
    • RecentChanges
    • FindPage
    • Local Site Map
    • Help
    • HelpContents
    • HelpOnMoinWikiSyntax
    • Display
    • Attachments
    • Info
    • Raw Text
    • Print View
    • Edit
    • Load
    • Save
  • Login

Kernel Hacking

  • Frontpage

  • Kernel Hacking

  • Kernel Documentation

  • Kernel Glossary

  • FAQ

  • Found a bug?

  • Kernel Changelog

  • Upstream Merge Guide

Projects

  • KernelJanitors

  • KernelMentors

  • KernelProjects

Community

  • Why a community?

  • Regional Kernelnewbies

  • Personal Pages

  • Upcoming Events

References

  • Mailing Lists

  • Related Sites

  • Programming Links

Wiki

  • Recent Changes

  • Site Editors

  • Side Bar

  • Tips for Editors

  • Hosted by WikiWall

Navigation

  • RecentChanges
  • FindPage
  • HelpContents

Upload page content

You can upload content for the page named below. If you change the page name, you can also upload content for another page. If the page name is empty, we derive the page name from the file name.

File to load page content from
Page name
Comment

KernelNewbies:
  • MatthewWilcox
  • Memdescs
  • Path

How do we get to memdescs in a series of bisectable, small and reviewable steps?

Immediate projects

There are no dependencies between these projects. They can all be tackled by different people. If you want to work on one of them, please ask Matthew for an invite to the THP Cabal meeting for coordination purposes.

  • Remove page->lru uses. This is really a set of many small projects. Some tactics can be shared between different projects, but this really requires looking into each usage and figuring out how to replace it.

  • Remove page->index uses.

  • Remove page->mapping uses.

    • AMDGPU probably needs a better solution than currently proposed.
  • Remove use of &folio->page. Sometimes this is using folio_page(folio, 0); sometimes this is pushing folios into the called function (eg block_commit_write()).

  • Remove bh->b_page as it's also effectively a cast between page & folio.

  • Remove folio from release_pages_arg.

  • Split the pagepool bump allocator out of struct page, as has been done for, eg, slab and ptdesc.
  • Fix memcg_data so that slabs, folios and plain pages are each accounted appropriately.

We may wish to do a "developer preview" where we just disable some modules without finishing the conversion so that people can evaluate the performance.

Slab

Before we can split page and folio, we need to split slab from both page and folio as we currently cast between page, slab and folio.

To minimise the changes to code outside of slab, all pages in a slab will set PageTail and have compound_head point to the separately allocated struct slab. struct slab will gain an 'address' member that points to the first byte in the page allocation, which allows us to implement slab_address() and slab_to_page().

Large Kmalloc allocations will continue to be compound allocations for now. They will store their memcg_data in page->memcg_data and their type in page->page_type. When we get to the shrinking below, their memdesc will be Misc with a subtype of Large Kmalloc. If accounted, their memcg_data will be stored in page->private (for now).

After those projects are complete

Then we can shrink struct page to 32 bytes:

struct folio {
    unsigned long flags;
    struct list_head lru;
    struct address_space *mapping;
    pgoff_t index;
    void *private;
    unsigned int _refcount;
    unsigned int _mapcount;
    unsigned int pincount;
    unsigned char order;
    /* 3 bytes available here */
    unsigned long pfn;
    struct mem_cgroup *memcg;
    /* Large folios will store more information here */
};

struct page {
    unsigned long flags;
    union {
        struct list_head buddy_list;
        struct list_head pcp_list;
        struct {
            unsigned long memdesc;
            union {
                unsigned long private;
                atomic_t _mapcount; // only used for folios?
            };
        };
    };
    int _refcount; // 0 for folios
};

For each memdesc (slab, folio, zpdesc, ptdesc, bump) in turn, we create a slab cache for it. Then we make page->compound_head point to the dynamically allocated memdesc rather than the first page. Then we can transition to the above layout.

Memdesc types

As in the fully shrunk struct page, bits 0-3 of memdesc are a type field that describes what the remaining bits are used for. However, types 0, 4, 8 and 12 all alias as "buddy" due to the storage of the buddy_list overlapping the memdesc field.

type

Meaning

Remaining bits in memdesc field

0

Buddy

buddy_list

1

Misc

See below

2

File

Pointer to struct folio

3

Anon

Pointer to struct folio

4

Buddy (alias)

5

Slab

Pointer to struct slab

6

Bump

Pointer to struct bump (TBD)

7

Movable

Pointer to struct movable (TBD)

8

Buddy (alias)

9

HWPoison

Pointer to struct hwpoison (TBD)

10

Accounted

Pointer to struct obj_cgroup

11

ZPDesc

Pointer to struct zpdesc

12

Buddy (alias)

13

KSM

Pointer to struct ksm (TBD)

14

PageTable

Pointer to struct ptdesc

15

unused

In this 2025 world, we copy page->flags from the first page to the folio, but leave it intact to keep code calling page_zone() working the way it does today. The union with the lru/buddy pointers means that memdesc types 0, 4, 8 and 12 are all used by free pages.

For a page in a folio, the usage is:

  • flags (mostly zone/node/section; PageAnonExclusive, PageHwPoison, maybe a couple of others?)

  • memdesc (points to the folio)
  • mapcount (per-page mapcount)

For accounted memory, there are several possibilities:

  • If a folio, we use the memcg in the folio
  • If slab, we use the objcg in struct slab
  • If a plain page GFP_ACCOUNT, use the memdesc to point to the objcg

Notes:

  • For buddy pages, we use five bits of page->flags to store buddy_order.

  • PageTail() moves out of line. If folio, compare page_pfn() with folio->pfn. If slab, compare page_address() with slab->addr. Otherwise, see if memdesc is tail.

  • Similarly get_page(), lock_page(), put_page(), unlock_page() go out of line and operate differently, depending on the memdesc type.
  • movable_ops users need to use memdesc
  • Can only support 13 memdescs as list_head.prev is only 4-byte aligned (so we can use 1-3,5-7,9-11,13-15 but not 0,4,8,12)
  • Calling compound_head() on a page that belongs to a folio or slab is a BUG()
  • Calling page_folio() on a page which does not belong to a folio returns NULL

Page2026

The next step is to shrink struct page to 16 bytes,

struct page {
    union {
        struct list_head buddy_list;
        struct {
            unsigned long memdesc;
            unsigned long private;
        };
    };
};

This will involve changes to page_zone(), page_to_nid() and so on.

After this, we can start working on removing accesses to page->private from device drivers. When that is finished, we can explore the various options presented in MatthewWilcox/BuddyAllocator

  • MoinMoin Powered
  • Python Powered
  • GPL licensed
  • Valid HTML 4.01