KernelNewbies:

For every physical address in Linux, there is a struct page. Struct page is a rather weak data type; it's very easy to look at (eg) page->mapping when the page is actually a tail page, and so does not have a mapping. Folios are the beginning of separating out some of the roles of struct page. Conceptually, folios take the contents of struct page (except the tail page parts) and move them into struct folio. That isn't what the patchset actually does, because I'm not enough of a masochist to make all those changes in one go.

We can (and should) go further. We need to understand how memory is (currently) allocated and used. Any memory mapped to userspace needs to have dirty & locked bits, have a refcount and a mapcount.

Purpose

Notes

Free

Belongs to the buddy allocator. Not mappable to userspace

Slab

Not mappable to userspace

Page table

Not mappable to userspace

vmalloc

May be mapped

kernel stack

Currently allocated by vmalloc, but should never be mapped

File cache

May be mapped

Anon

May be mapped

net pool

May be mapped

zsmalloc

Not mappable to userspace

kernel text

Can we map this through /dev/kmem or something?

kernel data

Mapping this is a security hole?

ZONE_DEVICE

This is a disaster area

offline

Logically offline, eg balloon

guard

see debug_pagealloc

arbitrary

Many device drivers just allocate pages and map them to userspace

arbitrary

Many more device drivers allocate pages from the buddy allocator and don't map them to userspace. These should probably be converted to use the slab allocator, but this is a job for someone who understands each driver well.

Memory allocated to slab is low-hanging fruit. Preliminary patch available here: https://lore.kernel.org/lkml/YUpaTBJ%2FJhz15S6a@casper.infradead.org/

Page tables are probably the next obvious thing to split out. At that point, though, I lack vision for splitting out any of the other things. So we end up with:

struct page {
    unsigned long flags;
    unsigned long compound_head;
    union {
        struct { /* First tail page only */
            unsigned char compound_dtor;
            unsigned char compound_order;
            atomic_t compound_mapcount;
            unsigned int compound_nr;
        };
        struct { /* Second tail page only */
            atomic_t hpage_pinned_refcount;
            struct list_head deferred_list;
        };
        unsigned long padding1[4];
    };
    unsigned int padding2[2];
#ifdef CONFIG_MEMCG
    unsigned long padding3;
#endif
#ifdef WANT_PAGE_VIRTUAL
    void *virtual;
#endif
#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
    int _last_cpupid;
#endif
};

struct slab {
... slab specific stuff here ...
};

struct page_table {
... pgtable stuff here ...
};

struct folio {
    unsigned long flags;
    union {
        struct {
            struct list_head lru;
            struct address_space *mapping;
            pgoff_t index;
            void *private;
        };
        struct {
... net pool here ...
        };
        struct {
... zone device here ...
        };
    };
    atomic_t _mapcount;
    atomic_t _refcount;
#ifdef CONFIG_MEMCG
    unsigned long memcg_data;
#endif
};

(nb: I haven't added the rcu_head anywhere; not sure what needs it right now. maybe just slab?)

Eventually, I hope to get to the point where we dynamically allocate the struct folio, struct slab and other memory descriptors [1]. struct page then contains only a single unsigned long, pointing to the memory descriptor that page belongs to. This can be typed (by the usual bottom few bits). It shrinks the overhead of memmap from 1.6% of memory to 0.2% of memory. There's a bootstrapping problem here (to allocate the first struct slab, eg), but I'm sure it's not insurmountable. I'm also not sure what the buddy allocator looks like at this point, if it can't use struct page to store its metadata (or maybe it can still use this newly shrunken struct page to store enough metadata?)

[1] Every memory descriptor struct (even those that aren't mappable to userspace) must be TYPESAFE_BY_RCU as there are lockless page table walkers that might see a stale reference to a memory descriptor.

KernelNewbies: MemoryTypes (last edited 2021-10-12 14:25:49 by MatthewWilcox)