• Immutable Page
  • Info
  • Attachments


Fernand0 we are very pleased to present you today Rik van Riel.
Fernand0 He is a kernel hacker working on memory management.
Fernand0 Currently he is working at conectiva S.A. in Brazil. As all of you know,
Fernand0 it is a big Linux company from South America.
Fernand0 Appart from kernel hacking, he also runs the Linux-MM website and the
Fernand0 #kernelnewbies IRC channel on openprojects.net
Fernand0 You can find more about him at: www.surriel.com (there you can find, among
Fernand0 other things the slides of this talk at:
Fernand0 (http://www.surriel.com/lectures/mmtour.html)
Fernand0 He will talk here about memory management but other interests of him
Fernand0 are: High availability, filesystems and various other things ...
Fernand0 The talk will be here, in #linux channel, Mr Riel suggested us to make
Fernand0 another channel (#qc -> questions channel) to write questions during the talk.
Fernand0 Should you have any questions, comments, etc, just write them in #qc
Fernand0 and Mr. Riel will reply.
Fernand0 Thank you to Mr. Riel for comming here and also to all of you
Fernand0 The title of his talk is:
Fernand0 Too little, too slow; memory management
Fernand0 Mr. Riel ...
riel I guess it's time to begin .............
riel ok, welcome everybody
riel today I will be giving a talk about Linux memory management
riel the slides are at http://www.surriel.com/lectures/mmtour.html
riel we will begin with some of the slides introducing memory management and explaining why we need memory management
riel if you have any questions about my talk, you can ask them in #qc
riel in #qc, you can also discuss with each other the things I talk about
riel this channel (#linux) is meant to be completely silent ... except for me of course ;)
riel ...... (page 1) .....
riel let me begin by telling a little bit about what I am doing at the moment
riel Conectiva is paying me to work on improving the Linux kernel full-time
riel this means that I am working for Linux Torvalds and Alan Cox, but Conectiva is paying me ;) [thanks Conectiva :)]
riel now I'll move on to the real talk ... (page 2)
riel [for the new people ... http://www.surriel.com/lectures/mmtour.html for the slides]
riel ok, I will begin by explaining about memory management
riel most of the introduction I will skip
riel but I will tell a few things about the memory hierarchy and about page faults and page replacement
riel lets start with the picture on (page 3)
riel this picture represents the "memory hierarchy"
riel every computer has more kinds of memory
riel very fast and very small memory
riel and very big but very slow memory
riel fast memory can be the registers or L1 cpu cache
riel slow memory can be L2 cache or RAM
riel and then you have hard disk, which is REALLY REALLY extremely slow ;)
riel when you see this picture, some people will ask themselves the question "but why doesn't my machine have only fast memory?"
riel or "why don't we just run everything from fast memory?"
riel the reason for this is that it is impossible to make very big fast memory
riel and even if it was possible, it would simply be too expensive
riel and you cannot run everything from fast memory because programs are simply too big
riel now we've talked about "fast" and "slow" memory ... (page 4) tells us about different kinds of speeds
riel you have "latency" and "throughput"
riel ok, good to have everybody back
riel if you look at (page 6) you can see how rediculously slow some memory things are
riel ok, lets go to (page 7)
riel "latency" == "if I ask for something, how long do I have to wait until I get the answer"
riel "throughput" == "how much data can I get per minute"
riel <riel> I think we do not have time to look at the L1 and L2 cache things
riel <riel> so lets move on to the RAM management
riel <riel> on (page 14)
riel <riel> RAM is the slowest electronic memory in a computer
riel <riel> it is often 100 times slower than the CPU core (in latency)
riel <riel> this is very very slow
riel <riel> but when you see that the disk is 100000 times slower than RAM (in latency), suddenly memory looks fast again ... ;)
riel <riel> this enormous difference in speed makesit very important that you have the data in memory that you need
riel -
riel 6<riel> if you do not have the data you need in RAM, you need to wait VERY LONG (often more than 5 million CPU cycles) before your data is there and your program can continue
riel <riel> on the other hand, everybody knows that you NEVER have enough memory ;)
riel <riel> so the system has to chose which pages to keep in memory (or which pages to read from disk) and which pages to throw away (swap out)
riel <
riel ok, lets try this again ;)
riel the ping timeout probably lost my last 3 minutes of the talk
riel 6<riel> so lets move on to the RAM management
riel <riel> on (page 14)
riel <riel> RAM is the slowest electronic memory in a computer
riel <riel> it is often 100 times slower than the CPU core (in latency)
riel <riel> this is very very slow
riel <riel> but when you see that the disk is 100000 times slower than RAM (in latency), suddenly memory looks fast again ... ;)
riel <-- Sadie has quit (Ping timeout for Sadie[orka.go2.pl])
riel <riel> this enormous difference in speed makesit very important that you have the data in memory that you need
riel -
riel but as we all know, no computer ever has enough memory ... ;)
riel and the speed difference is REALLY big ... this means that the system has to choose very carefully what data it keeps in RAM and what data it throws away (swaps out)
riel lets move on to page 18
riel ok, if a page of a process is NOT in memory (but the process wants it) then the CPU will give an error and abort the program
riel then the Operating System (OS) gets the job of fixing this error and letting the program continue
riel this trap is called a "PAGE FAULT"
riel the OS fixes the job by getting a free page, putting the right data in that page and giving the page to that program
riel after that the process continues just like nothing happened
riel the ONLY big problem is that such a page fault easily takes 5 _million_ CPU cycles
riel so you want to make sure you have as little page faults as possible
riel the other problem is that you only have a "little bit" of memory in your machine
riel and you run out of free memory very fast
riel at that point, the OS needs to choose which data it keeps in memory and which data it swaps out
riel ..... lets move to (page 19) of http://www.surriel.com/lectures/mmtour.html ....
riel the "perfect" thing to do is to throw away (swap out) that data which will not be needed again for the longest time
riel that way you have the longest time between page faults and the minimum number of page faults per minute ... so the best system performance
riel the only problem with this method is that you need to look into the future to do this
riel and that isn't really possible ... ;)))
riel so we have to come up with other ideas that approximate this idea
riel ......
riel one idea is LRU ... we swap out the page which has not been used for the longest time
riel the idea is: "if a page has not been used for 30 minutes, I can be pretty sure I will not use it again in the next 5 seconds"
riel which really makes a lot of sense in most situation
riel unfortunately, there are a few (very common) cases where LRU does the exact wrong thing
riel take for example a system where somebody is burning a CD
riel to burn a CD at 8-speed, you will be "streaming" your data at 1.2MB per second
riel at that speed, it will take just 30 seconds on your 64MB workstation before your mail reader is "older" than the old data from the CD write program
riel and your system will swap out the mail reader
riel which is the exact WRONG thing to do
riel because most likely you will use your mail reader before you will burn the same CD image again
riel LFU would avoid this situation
riel LFU swaps out the page which has been used least often
riel so it would see that the mail reader is being used all the time (pages used 400 times in the last 2 minutes) while the CD image has only been used one time (read from disk, burn to CD and forget about it)
riel and LFU would nicely throw out the CD image data that has been used
riel and keep the mail reader in memory
riel in this situation LFU is almost perfect
riel ... now we take another example ;)
riel if we look at GCC, you will see that it consists of 3 parts
riel a preprocessor (cpp), a compiler (cc1) and an assembler (as)
riel suppose you only have memory for one of these at a time
riel cpp was running just fine and used its memory 400 times in the last minute
riel now it is cc1's turn to do work
riel but cc1 does not fit in memory at the same time as cpp
riel and LFU will swap out parts of cc1 because cpp used its memory a lot ...
riel and does the exact wrong thing ... cpp _stopped doing work_ a second ago
riel and cc1 is now the important process to keep in memory
riel ... this means that both LRU and LFU are good for some situations, but really bad for other situations
riel I got a question if LRU or LFU is better ... the answer is none of them ;)
riel what we really want is something that has the good parts of both LRU and LFU but not the bad parts
riel luckily we have such a solution ... page aging (on page 20)
riel page aging is really simple
riel the system scans over all of memory, and each page has an "age" (just points)
riel if the page has been used since we scanned the page last, we increase the page age
riel <Rob> riel: How do you measure when a page has last been "used"?
riel ... umm yes ... I almost forgot about that part ;))
riel --- when a page is being used, the CPU sets a special bit, the "accessed bit" on the page (or the page table)
riel --- and we only have to look at this bit to see if the page was used
riel --- and after we look at the bit, we set it to 0 so it will change if is being used again after we scan
riel so back to page aging now
riel if the page was used since we last scan it, we make the page age bigger
riel if the page was not used, we make the page age smaller
riel and when the page age reaches 0, the page is a candidate for swapout ... we remove the data and use the memory for something else
riel now there are different ways of making the page age bigger and smaller
riel for making it bigger, we just add a magic number to the page age ... page->age += 3
riel for making it smaller, we can do multiple things
riel if we substract a magic number (page->age -= 1), we will be close to LFU
riel if we divide the page age by 2 (page->age /= 2), we will be close to LRU
riel to be honest, I have absolutely no idea which of the two would work best
riel or if we want system administrators to select this themselves, depending on what the system is doing
riel page aging is used by Linux 2.0, FreeBSD and Linux 2.4
riel somebody thought it would be a good idea to remove page aging in Linux 2.2, but it turned out not to work very well ... ;)
riel ... so we put it back for Linux 2.4 ;))
riel and another question: <HoraPe> riel: what is linux using?
riel HoraPe: Linux 2.0 uses the "page->age -= 1" strategy
riel HoraPe: and Linux 2.4 uses the "page->age /= 2" strategy
riel maybe the first strategy is better, maybe the second strategy is better
riel if we have any volunteers who want to test this, talk to me after the lecture ;))
riel I will now go on to (page 21) and talk about drop-behind
riel most of you have probably heard about read-ahead
riel where the system tries to read in data from a file *before* the program which uses the file needs it
riel this sounds difficult, but if the program is just reading the file from beginning to end it is quite simple ...
riel one problem is that this linear read will quickly fill up all memory if it is very fast
riel and you do not want that, because you also have other things you want to do with your memory
riel the solution is to put all the pages _behind_ where the program has been reading on the list of pages we will swap out next (the inactive list)
riel so in front of where the program is now, you read in all the data the program needs next (very friendly for the program)
riel and in exchange for that, you remove the data the program will probably not need any more
riel of course you can make mistakes here, but if you get it right 90% of the time it is still good for performance ... you do not need to be perfect
riel ... from the part about hard disks, I will skip almost everything
riel ... only (page 23) I will discuss today
riel as you probably know, hard disks are really strange devices
riel they consist of a bunch of metal (or glass) plates with a magnetic coating on them which spin around at rediculously high speeds
riel and there is a read-write arm which can seek across the disk at very low speeds
riel the consequences of this design are that hard disks have a high throughput ... 20 MB/second is quite normal today
riel this is fast enough to keep a modern CPU busy
riel on the other hand, if you need some piece of data, your CPU will have to wait for 5 _million_ CPU cycles
riel so hard disks are MUCH too slow if you're not reading the disk from beginning to end
riel this means that so called "linear reads" are very fast
riel while "random access" is extremely slow
riel you should not be surprised if 90% of the data is in linear reads, but hard disks spend 95% of their time doing random disk IO
riel because the linear IO is so fast the disk can do it in almost no time ;)
riel the normal optimisation for this is "IO clustering", where the OS reads (or writes) as much data in one place of the disk as possible
riel the "as possible" can not be too large, however ...
riel if you have "only" 64 MB RAM in your machine, you probably do not want to do readahead in 2MB pieces
riel because that way you will throw useful data out of memory, which you will need to read in again later, etc...
riel so it is good to read in a small part of data, but it is also good to read in very big parts of data ... and the OS will have to decide on some good value all by itself
riel Linux has some auto-tuning readahead code for this situation (in mm/filemap.c::generic_file_readahead(), for the interested) but that code still needs some work to make it better
riel and of course, another way to make "disk accesses" fast is to make sure you do not access the disk
riel you can do this if the data you need is already (or still) in memory
riel Linux uses all "extra" memory as a disk cache in the hope that it can avoid disk reads
riel and most other good operating systems do the same (FreeBSD for example)
riel ... now I will go on with Linux memory management
riel ... on (page 28) and furter
riel I will explain how memory management in Linux 2.2 chooses which pages to swap out, what is wrong with that and how we fix the situation in Linux 2.4
riel and also the things that are still wrong in Linux 2.4 and need to be fixed later ;)
riel ok, another question: <rcastro> riel: to keep data in memory, have you ever thought about compressed data in memory?
riel --- this is a good idea in some circumstances
riel --- research has shown that compressed cache means that some systems can do with less disk IO and are faster
riel --- on the other hand, for some other systems it makes the system slower because of the overhead of compression
riel --- it really depends on what you do with your system if the "compressed cache" trick is worth it or not
riel --- and it would be interesting to see as an option on Linux since it is really useful for some special systems
riel --- for example, systems which do not have swap
riel ... ok, lets move on to (page 31)
riel Linux 2.2 swapout code is really simple
riel (at least, that's the idea)
riel the main function is do_try_to_free_pages()
riel this function calls shrink_mmap(), swap_out() and a few other - less important - functions
riel shrink_mmap() simply scans all of memory and will throw away (swap out) all cache pages which were not used since the last time we scanned
riel and swap_out() scans the memory of all programs and swaps out every program page which was not used since the last time we scanned it
riel ... (page 32)
riel this is a really simple system which works well if the system load is not too high
riel but as soon as the load gets higher, it can completely break down for some reasons
riel if, for example, the load on the system is very variable, we get problems
riel if you have enough memory for 30 minutes and all memory has been used in those 30 minutes, then after 30 minutes _every_ page has been used since the last time we scanned (30 minutes ago)
riel and then something happens in the system (Netscape gets started)
riel but the OS has no idea which page to swap out, since all pages were used in the last 30 minutes, when we scanned last
riel in that situation, the OS usually swaps out the 'wrong' pages
riel and those wrong pages are needed again 5 milliseconds later
riel which makes the OS swap out *other* wrong pages again, until everything settles down
riel so every time the system load increases, you have a period where the system is really slow and has to adjust to the load ...
riel another problem is that (in shrink_mmap) we scan and swap out pages from the same function
riel this breaks down when we have a very high load on the system and a lot of the pages we want to swap out need to be written to disk first
riel shrink_mmap() will scan every page in memory and start disk IO for the pages that need to be written to disk
riel after that it will start scanning at the beginning again
riel and no page it sees has been used since the last time we scanned it, since kswapd was the only thing running
riel at that point the system -again- starts swapping out the wrong pages
riel a question: <movement> is this the do_try_to_free_pages() printk we hear so much about on lkml ?
riel --- this printk is called when do_try_to_free_pages() cannot find pages to swap out
riel --- not when do_try_to_free_pages() swaps the wrong pages by accident
riel --- so these things are not the same
riel ... lets move on to (page 33) and see how we fix these problems in Linux 2.4
riel the two big changes for Linux 2.4 are page aging and the separation of page aging and page writeback to disk
riel page aging means we are more precise in chosing which page we swap out, so we will have a better chance of having the pages we need in memory
riel and the system will perform better when memory is getting full
riel the separation of page aging and page flushing means that we will not swap out the wrong page just because the right page still needs to be written to disk and we cannot use it for something else yet
riel ... on (page 35) I will explain about the memory queues we have in Linux 2.4
riel we have 3 "types" of pages in Linux 2.4
riel active pages, inactive_dirty pages and inactive_clean pages
riel we do page aging on the active pages
riel and the inactive_dirty and inactive_clean pages are simply sitting there waiting to be used for something else
riel ... now we go back to (page 34) <sorry>
riel having MORE inactive pages means that the system is better able to deal with big allocations and spikes in system load
riel however, moving pages from the active to the inactive list and back is a lot of overhead
riel so having LESS inactive pages is also good ...
riel the solution Linux 2.4 takes is to see how much memory is being reclaimed for other purposes each second (averaged over a minute) and simply keep 1 second of allocations in the inactive queues
riel I am pretty certain we can do this better for Linux 2.5, but nobody has had time yet to research this ...
riel what Linux 2.4 also does is some light background scanning
riel every minute or so all the cache memory is scanned
riel and when the page->age of pages in the cache reaches 0, it will be moved to the inactive list
riel so when the system gets a burst of activity again after 30 minutes, the system knows exactly which pages to swapout and which pages to keep in memory
riel this fixes the biggest problems we have with Linux 2.2 VM
riel ... because we have little time left, I will now go to the Out Of Memory (OOM) killer on (page 43)
riel which will be the last part of the lecture, after this you can ask questions ;)
riel ok, the OOM killer
riel when memory *and* swap are full, there is not much you can do
riel in fact, you can either sit there and wait until a program goes away, or you can kill a program and hope the system goes on running
riel in Linux, the system always kills a process
riel Linux 2.2 kills the process which is currently doing an allocation, which is very bad if it happens to be syslog or init
riel Linux 2.4 tries to be smart and select a "good" process to kill
riel for this, it looks at the size of the process (so killing 1 process gets us all the memory back we need)
riel but also at if it is a root process or if the process has direct hardware access (it is very bad to kill these programs)
riel and at the amount of time the process has been running and the CPU time it has used
riel because it is better to kill a 5-second old Netscape than to kill your mathematical calculation which has been running for 3 weeks
riel even if the Netscape is smaller ...
riel killing the big calculation will mean the computer loses a lot of work, which is bad
riel for Linux 2.5, I guess some people will also want to have the OOM killer look at which _user_ is doing bad things to the system
riel but that are things to do in the future
riel ... on (page 44) you can find some URLs with interesting information
riel ... thank you for your time, if you have any questions, feel free to join the discussion on #qc
riel ... this is the end of my talk, but I will be in #qc for a bit more time
Fernand0 clap clap clap clap clap clap clpa clpar clap
Fernand0 clap clap clap clap clap clap clpa clpar clap
Fernand0 clap clap clap clap clap clap clpa clpar clap
riel btw, for people interested in Linux kernel hacking we have a special IRC channel
riel on irc.openprojects.net #kernelnewbies
riel see http://kernelnewbies.org/ for the #kernelnewbies website
riel most of the time you can find me (and other kernel hackers) on that channel
riel if you have some in-depth questions or find something interesting when reading my slides, you can always go there
Fernand0 well, my friends
Fernand0 feel free to continue discussing at #qc
Fernand0 many thanks to Rik van riel and to all of you for comming here
Tell others about this page:

last edited 2006-08-15 20:16:57 by njit-border-ce01