Email: ...

Before I start off, I would like to confess that I am a complete Kernel_Newbie who has been hacking 2.4 Kernel Code for only 1 month, and hence some of the information might be inconsistent owing to the inconsistent knowledge of the author. It is highly recommended to combine this Document with Tigrans 2.4 Kernel Internals (, as the left_overs can be traced to his document.

Moreover there are first_hand references in his document in linuxdoc about Process Management, IPC, page_cache, which doesn't call for a repeat invocation, as far as the data structures are concerned. Nevertheless, the best source is the Kernel-Source itself, and the information contained in this document are only for Linux 2.4 on INTEL. There have been rewrites of some sections (the scheduler being one) in 2.4.

By the way, if anyone who is reading this is having ideas of analyzing kernel_code,then he should start with IPC (linux/ipc/{sem.c,shm.c,msg.c,utilc.util.h}), as the sections there are pretty local to their origin, except for a few sched.c references.


One of the methods of learning to Program is by looking at others Code, and none better than witnessing Experts in Action. The Linux Kernel Source is a plethora of knowledge embedded in millions of lines of Code, where performance and wizardry rules. Taking a decision to look at Kernel Code is a step towards getting yourself drowned in a fathomless ocean. Infact for people reluctant or scared to look at Kernel_Code, the best method to start, is by reading more about Linux,and getting oneself theoretically acquainted with its modus_operandi.

It is advisable to have a look at some books on Kernel Internals,especially Linux Device Drivers by A.Rubini. The practical_acquaintance will creep_in with time, and with perseverance, and is a never_ending process, where one starts fighting with the Code, to get under its skin. The code is split up into several_rooms and one can choose his room for a period of time before moving onto another, where the complexities can vary to a large extent. So lets start off by picking a room which is located at 30,000 feet from the earth.

A View from 30,000 feet of the Linux Kernel dancing around in Memory :

A boot through a floppy : (arch/i386/boot/bootsect.S)


The above was a view from 30,000 feet of the Linux Kernel on its WAY_TO_WORK.Let us come down a bit,and start taking a macro view on the things explained above,before moving onto other things.

Booting Specifics :

The BIOS after doing the POST (Power on Self Test), checks for the booting device. If its a boot through a floppy, then arch/i386/boot/bootsect.S gets the cake, but for boots from hard_disks, its handled a bit differently. For hard_disks the MBR (Master Boot Record contains the partition information and also a small program embedded to load the first sector of the active partition into memory.) The boot loader LILO for Linux copies itself in place of the MBR program. This program reads information from a config. file, moves the boot_sector out of the way to 0x9000 and the second stage loader loads setup.S at 0x90200 and the system at 0x10000. The same thing happens during a boot from a floppy, except that the loader does the thing bootsect.S does before jumping to setup at 0x90200. setup.S in arch/i386/boot/ uses BIOS calls to get device information (like memory,video,hd0,hd1,APM-BIOS,mouse) into specific points in the bootsegment ( thats 0x90000 aka INITSEG in setup.S). It uses the bootsegment as a free space to store information found using BIOS calls.

The information include: (They may not be in the order in which they are scanned for and some of them might|should have been skipped. Meanwhile it holds good only zImages (small),whereas bzImages are handled differently as they end up in high_mem.(Check out boot/compressed/misc.c and setup.S for complete info.):

The system is copied from 0x10000 (64K) to 0x1000 (1st page.) Address line A20 (20 bit addressing mode->DOS) is enabled, interrupts are masked off, and the CPU switches to PROTECTED Mode. A 32 bit pointer to the Boot Segment (aka INITSIG) is stored in esi. A jump to 32 bit startup code starts off at head.S (arch/i386/boot/compressed/head.S). head.S starts off by setting up the segment selectors, and the registers to KERNEL_DS (include/asm-i386/segment.h). It then sets up a temporary stack to stack_start (misc.c), and then uses this to call decompress_kernel (misc.c) which decompresses the kernel to 0x100000 (1 MB).(check inflate.c for the inflation performed.). A jump to 0x100000 then ends up to a call to arch/i386/kernel/head.S (note the difference in locations of 2 head.S). This is the 32 bit entry startup code, which enables paging.

(movl $swapper_pg_dir , %eax; movl %eax, %cr3;

It should be noted that the Kernel is at offset PAGE_OFFSET (asm-i386/page.h) which starts at 0xC0000000 or 3 GB virtual memory. This is evident from the script written by Martin Mares,(arch/i386/ which describes the composition of the vmlinux object. {Please read linker script tutorials from GNU, as that would enable you to follow, and also some variables which are directly linked in the script, like start_ksymtab, stop_ksymtab ,etc.etc, which are defined in file. If you dont get to that,and try ruminating over the extern declarations of start_ksymtab and stop_ksymtab in kernel/module.c, then you would never get out of your search_module.) The linker script defines the sections which make the vmlinux object, and the sections like .text.init, .data.init,.initcall.init,.exitcall.exit, are freed up later,to reclaim the memory used by those sections. (check out include/linux/init.h for the macro definitions. They use a special GCC magic to define the sections, like in #define initdata attribute((section(".data.init"))) , and use it like: static char init[] initdata = "remove_windows"; Check out the tutorials or the script for details regarding its usage. This is a performance_hack in the Linux Kernel which I presume is not there in most of the Kernels.).

The first 8 MB page mappings are initialised in pg0 and pg1. The empty_zero_page which is a globally shared page follows up after pg1. The swapper_pg_dir which is the page_directory for init, is initialised for the first 0-8M identity mappings with the Kernel Offset PAGE_OFFSET. When you read the table,please bear in mind that swapper_pg_dir is located at 0x1000 with reference to the kernel loaded at 1MB. Hence the swapper_pg_dir starts off with a mapping of 0x102007 instead of 0x101007 (I might be wrong,but I have a hunch,that I have got it right.) The page directory is split up into 2 parts: Total = 1024 entries: 3 GIG FOR USER, AND 1 GIG FOR KERNEL which has a direct mapping with PAGE_OFFSET -> a.k.a pa(kaddr) == phys_addr: and va(phys_addr) = kern_addr:

These will be discussed when Page tables will be taken up. The cpu information found is loaded up into boot_cpu_data. setup_idt initialises the IDT with 256 entries pointing to no_int,which is the bad interrupt handler defined in head.S. The ldt and the gdt entries are loaded, (There are 8192 LDT entries which modify_ldt manual says that they are used by the processor to keep track of the memory management tables per process. I am not sure about the usage stats of LDT. GDT entries are loaded with 140 quadwords,depended on NR_CPUS, in head.S. They denote the data_segment switch and the Code segment switch(0x10->Kernel Code,0x18 for Kernel Data,0x23 for User_Code and 0x2b for User_Data->segment.h says that), and specify the protection on the segments for text and data. (Am I Right ???) Jump to start_kernel takes place only for BP(Boot Processor),and all other APs (Application Processor) (if in SMP) call initialize_secondary.start_kernel initialises a whole lot of things,which you can checkout in init/main.c before creating the init kernel_thread,and starts of the init as a Kernel Thread.

Kernel Thread:

A Kernel_thread is a cloned thread,which is defined in arch/i386/kernel/process.c. A good example of a Kernel Thread is "keventd" which is started as a kernel_thread using start_context_thread(context_thread,NULL,CLONE_FILES | CLONE_FS) in kernel/context.c which has the operations for the kernel thread keventd. A kernel_thread usually runs in the background, releasing its memory map,and the files struct and becoming one with "init" by calling daemonize() defined kernel/sched.c. keventd gives up its resources,becomes one with init, and goes into background, trying to wait for someone to call schedule_task. It blocks till someone calls schedule_task by putting itself into its wait_queue, calling schedule,thereby waiting for someone to wake it up through a call to schedule_task.

schedule_task queues up the tq_struct passed as an argument in the task_queue tq_context (DECLARE_TASK_QUEUE(tq_context)). It then wakes up the keventd process, blocking in its wait_queue, for someone to wake it up, in order to consume the tq_context queue. After consuming the tq_context queue, it then wakes up the process blocking in flush_scheduled_tasks, context_task_done wait_queue, trying to flush_out queued tasks, before blocking again by calling schedule and waiting for the schedule_task routine to wake it up. Task Queues and Bottom Halves would be taken up after System Calls.

System Call Handling :

arch/i386/kernel/entry.S: system_call is the entry point of System Calls.: The stack state at the time of syscall is defined by struct pt_regs in asm-i386/ptrace.h. There are 3 trap gates to syscalls. lcall7 and lcall27 handlers where the stack state has to be realigned,and the function corresponding to the struct exec_domain registered is called. But ret. path is the same as in int 0x80 which is the normal method of a syscall,and jump to kernel mode. system_call is the entry point to system_calls defined entry.S. This does a SAVE_ALL by saving all the registers in the stack (as defined by the structure "struct pt_regs. (starts from low to high.So it should be ok.)". After storing,it gets the number of the system_call from %eax. Note: (hacknayed note) In normal syscall_handling, the registers used are "eax,ebx,ecx,edx,esi,edi,and ebp(for six arguments which is the max limit.)". "eax" contains the number of the system_call.The rest contain the arguments passed with a max limit of 6. (check out /usr/src/linux/include/asm-i386/unistd.h for the numbers of sys_calls supported.(currently 221 syscalls, and also check out the syscall macro wrappers, which use inlined assembly. (checkout inlined assembly tutorials from its quite necessary to hack kernel code.) Using the number of the system_call,it indexes into a pointer to a function table (sys_call_table->entry.S) to call the function. Note that all of the system call functions use "asmlinkage" which is a gcc macro (defined in /usr/src/linux/include/linux/linkage.h),which tells gcc,to look for the arguments of the system call, in the stack and not in registers,where they were passed originally.And for ptrace,the route is quite different in the sense,that if the task is ptraced (PT_TRACESYS),it jumps to tracesys ,which calls syscall_trace,before indexing into the sys_call_table for calling the corresponding routine and falling back to tracesys_exit to return to ret_from_sys_call. When the system_call handler returns,the return value is moved back to the saved "eax",and the call then goes through ret_from_sys_call. But how does a Child Thread created through a fork or a clone gets handled ? Which route does it take ? fork,and clones take a slightly different route in the sense that they dont first reach ret_from_sys_call. Thats again a fantastic trick which is performed in copy_thread call to fork (check out kernel/fork.c). The copy_thread routine defined in (arch/i386/kernel/process.c) also justifies the glaring fact,that would appear in most of the hackers guide.

The stack pointer is aligned on a 2 page boundary (page = 1 << 12) with the "current" process pointer. Meaning- The Kernel Stack associated with each process is aligned with the Process Pointer on a THREAD_SIZE ( 1 << 13 ) boundary.

The significance of this statement lies in the extraction of the "current" pointer justified in arch/i386/entry.S (check out lcall7,lcall27 handlers for usage and GET_CURRENT for definition.) The definition is pretty simple if one knows to align something on a specific boundary.("AND" with the complement of a X-1, if you want to align on the nearest X boundary.) Also note that the current pointer is kept in "bx".If you want to find "current" from the stack pointer,just "AND." it with the complement of 8191 or -8192 or THREAD_SIZE - 1.(A good explanation of this is given in get_current,although GET_CURRENT in entry.S should say it all as there is no usage of gcc inline asm.("movl %esp, %bx; andl $-8192, %bx;" ) The copy_thread routine sets up a struct pt_regs pointer to POINT to: (struct pt_regs *)(THREAD_SIZE + (unsigned long)task_pointer) - 1 and does a struct_copy of the registers from the parent_thread. (The thread that invoked fork.) This way it ensures that the state of the stack for a ret_from_sys_call is just about right for an "iret" to user_space. It confirms the return state of the child thread by setting up the task->thread.eip to ret_from_fork. (check out entry.S again.) The thread_struct structure associated with a task,keeps status information about the thread,like the threads esp (stack pointer),esp0 (upper limit,-Squeeze me-. but p->thread.esp0 is set to THREAD_SIZE + (unsigned long)task_pointer), fs and gs register values,along with debug register values (8 debug registers). Merely setting up the task->thread.eip (instruction pointer) will not put it in the ret_from_fork path. The trick lies in switch_to (inlined_asm defined in /usr/src/linux/include/asm-i386/system.h),which happens in the scheduler (kernel/sched.c) which switches from the previous task to the next task. The switch_to routine saves di,si and bp into the stack and then saves the stack pointer,and the instruction pointer of the switched task set to the value just after the (jmp switch_to) instruction (a C routine defined in asm-i386/process.c), in its thread_struct, and then resets the stack pointer to the new task,and pushes the return address defined in task->thread.eip (ret_from_fork) in the stack,before jumping to switch_to routine. The switch_to saves the tss values for esp0 of the new switched task,also saves the ioperm bitmaps,saves the fs,and gs in the prev_thread,and loads the fs,and gs in the new thread,loads the debug registers from the new thread,and then bids GOODBYE before entering ret_from_fork. This way,2 threads will go through different return paths in a fork call.(which creates an additional process table entry.) The caller of the fork routine goes through ret_from_sys_call,which is the main destination of most of the routines,and returns with the pid of child process. The child process instead gets back with a return of 0, which is setup in copy_thread by making childregs->eax = 0. ret_from_fork also jumps to ret_from_sys_call but not before calling schedule_tail (defined in kernel/sched.c and will be explained in the scheduler section.) ret_from_sys_call does the following: It first checks the softirq_active bitmask for any active softirqs (tasklets or bottom halves to be run.) If its active, then it calls handle_softirq which jumps to do_softirq (kernel/softirq.c). This returns back to ret_from_intr which checks for its mode when it was called. If in kernel_mode it jumps to restore_all straight, but if in User_Mode,it makes a jump to ret_with_reschedule,which checks if task->need_resched is set ,before jumping to reschedule which calls schedule,and gets a return path to ret_from_sys_call. ret_from_sys_call also checks for signal pending,and if there are signals pending,it jumps to signal_return which calls do_signal(arch/i386/kernel.c) before jumping to restore_all,and back to the USER_SPACE from HELL.

Schematic Representation from ret_from_sys_call: (for return from System_calls)

ret_from_sys_call:Mode-Kernel_Mode | User_Mode

Task Queues and Bottom Halves :

Task_Queues are extended bottom halves which define functions to be either called at specific points by using pre-defined task_queues (tq_immediate,tq_timer,tq_disk.I dont see tq_scheduler in my 2.4s include/tqueue.h file, where as its there in 2.2.x. and sched.c in my 2.4 source doesnt call run_task_queue(&tq_scheduler). To understand task_queues you should first understand bottom_halves or tasklets,and to understand them you should have a look at linux/softirq.c. Its judicial and advisable to split the job of the interrupt handler into 2 halves. While the interrupt handler runs by disabling the interrupt occured,it should be fast in processing the interrupt. Any processing of data or other stuffs can be taken up by a bottom half. For example an interrupt handler can fetch the data,and mark up a task to be executed at known points. What are these known points and when are they processed ? Bottom halves are consumed in 2 places. In the ret_from_sys_call or ret_from_exception or in schedule(), by calling do_softirq (defined linux/kernel/softirq.c). The do_softirq goes through the list of active_softirq (which is a bitmask of softirq_active(cpu).(check include/linux/interrupt.h,and follow up the headers in that file.) It goes through each softirq_action structure and calls the function registered. The tasklets TASKLET_SOFTIRQ,HI_SOFTIRQ ,each having an instance of tasklet_struct,queue themselves by calling either tasklet_schedule or tasklet_hi_schedule. The latter being used by mark_bh routine which is retained for backwards compatibility. The softirqs are initialised at system startup by calling softirq_init,which opens up the softirqs TASKLET_SOFTIRQ and HI_SOFTIRQ. The older versions of bottom_half,are also initialised. If one queues up tasklets by calling tasklet_schedule, they will be consumed by calling tasklet_action defined in softirq.c. The older ones using mark_bh are consumed by calling tasklet_hi_action. But because the old ones are limited in number (32),execute only one function,and run with spinlock held (they cannot block),task_queues were introduced which are extended bottom halves. To understand and to use task_queues,you should look at tqueue.h for its usage. You can do for example: DECLARE_TASK_QUEUE(run_queue); static struct tq_struct tq; tq->routine = your_routine; tq->data = &your_data; And then do: queue_task(&tq,&run_queue); run_task_queue(&run_queue) for it to be consumed. If you use pre-defined task_queues like ,tq_immediate,tq_timer, then the task_queue would be consumed at definite points. tq_immediate queue is a bottom half,and be can run immediately on ret_from_exception/intr by queue_task(&tq,&tq_immediate); mark_bh(IMMEDIATE_BH). The tq_timer would be consumed on a timer interrupt (and also on a console_close),So you do for example: queue_task(&tq,&tq_timer); mark_bh(TIMER_BH). For the rest,softirq.c should deliver the coup-de-grace on this Section.

The Scheduler :

This section again is covered up beautifully in Linux 2.4 Internals,and Lazy(const char *lazy,...) can look at Tigrans Materials, or the mor e active ones, can take up sched.c. The code is actually not that complex, in UNP mode,but gets solid in SMP mode. The gotos rule the roost in schedule,and one will start dancing while analysing the code,because its stuffed with "goto XX, and goto XX_back." They are with a purpose and defy the normal 'C' guides which say, "GOTOs are obsolete and should not be used." gotos result in better assembly code,and the scheduler is one such optimised example. These are the actions performed by the scheduler in a nutshell:

switch_to returns when the scheduled_out task starts running,which then has to go through the schedule_tail routine. The schedule_tail routine (also called from ret_from_fork path,as mentioned before) zeroes out the task->has_cpu,and calls reschedule_idle only if its not an idle_task and the scheduler policy for the task is not SCHED_YIELD,else the SCHED_YIELD is masked out. reschedule_idle is an important routine in SMP zone,whereas in a UNP context,it just amounts to a positive ( > 1) pre-emption_goodness.(The chance of pre-empting a running task) In SMP context,reschedule_idle tries to find an idle_cpu for the previous task that is now scheduled in. It first checks whether the process can_be scheduled in this_cpu (task->processor). In order to get the concept of can_schedule,check out the macro can_schedule,which checks that the Process is not running in another CPU,and has not masked off task->processor from the allowed_cpus. If an idle task is running in task->processor (best_cpu) and if need_resched is not set to -1,and if this_cpu (current->processor) is not equal to best_cpu,it sends a reschedule event to that cpu.(smp_send_reschedule). If an idle_task is not running,then it loops through the list of processors,and tries to find an idle_task which has not run for the maximum time,or tries to locate a task with a positive maximum pre-emption goodness if it fails in getting an idle task on all cpus,before sending a reschedule event on that cpu. The goodness routine which tries to find the goodness value is pretty simple.It masks out SCHED_YIELD with a negative return,real_time tasks with SCHED_FIFO and SCHED_RR policies are given the maximum increment of +1000 on a task->rt_priority.The SCHED_OTHER guys,get their time_slices added with their NICE value. (-20 highest to +19 lowest.) There are also many helper functions defined in sched.c which might be of interest. wake_ups aside,schedule_timeout is one such function which uses Kernel Timers (check it out for timer_list usage., to schedule out a task,and bring it back into action. (see include/linux/timer.h.) Kernel Timers can be used when one wants to run a tasklet at a defined time in future. You can set the expiry interval of the timer in jiffies (number of clock_ticks since System Boot, and is incremented HZ (include/linux/param.h) times/SEC.


struct timer_list your_timer;


your_timer.expires = jiffies + HZ * 5;

your_timer.function = your_function; = (unsigned long)current;

add_timer(&your_timer); /*It gets called after 5 secs,which is set as the expiry of the timer. */


Signal Handling :

In order to get inside signal_handling,one should focus on kernel/signal.c,and arch/i386/kernel/signal.c. Also look up the headers in include/{asm,linux}/{sigcontext.h,siginfo.h,signal.h} for the data_structures. Each signal has a stack frame. Thats setup in set_sigcontext (arch/i386/kernel/signal.c), by storing the stack of the user at the time of ret_from_sys_call(a.k.a struct pt_regs,include/asm/ptrace.h), in the sigcontext structure. This way,it performs a magic to switch to signal handler,by setting up the stack context to the sigframe. Try analysing the sigframe structure which contains the first 2 fields as { char *pretcode; int sig; }. Assuming that most of you might be knowing the function_call semantics of the GCC compiler,you should be able to realise the importance of the first 2 fields. The pretcode, is the return address which will get popped off the stack on return from the sighandler,and the argument to the signal handler gets the signal number,as its just above the return address in the stack layout during a call to the signal handler.The pretcode (which is a pointer) contains the address of retcode (another 8 byte field in the sigcontext structure) which contains the code to jump to Kernel Space by making a "sigreturn" system call. (Check out the hex code of retcode.) The signal number of sigreturn is pushed into eax,and then the return to kernel space is setup by a int 0x80 (interrupt)instruction. The stack layout at the time of sys_call is saved in the sigcontext structure for it to get restored on return to Kernel Space after handling the signal.The eip (Instruction Pointer) is changed to which is the handler of the signal. The segment is changed to USER_DS which is the segment selector switch and control switches to do_signal,which returns to the signal_return path. The signal_return defined in (arch/i386/kernel/entry.S) does a RESTORE_ALL to get back to user space by popping up the registers from the stack. It should be noted that the esp (stack pointer) is set to the sigframe which was set_up as explained above.So the stack frame when the signal handling is done looks like this :

As the eip points to, the execution starts from the signal handler as if a function call has taken place. The return from the signal handler comes back to sigreturn as a result of a system_call because of the retcode thats inserted as a return address in the stack frame. sigreturn,a system_call defined arch/i386/kernel/signal.c restores the signal context by restoring the values stored in sigcontext back to the registers,restores up the old mask and returns.This way the eip is set back to old stuff,that happened when the process got a signal. ret_from_sys_call takes care of all the handle_softirq stuff,reschedule stuff,before doing a RESTORE_ALL,as explained before. The do_signal routine works by dequeuing the signals,one after the another, to send the first possible signal.Dequeueing of signals is logical only if some fragment has queued the signal.This signal queue,and siginfo management is done in linux/kernel/signal.c.When the process receives a signal either through a notify_parent or a kill,the signal is queued up using send_siginfo defined in kernel/signal.c. This routine collects the signal,and queues up the signal in the sigpending structure of the process. The validity of the signal is also checked by bad_signal routine,before queueing up the signal. Some conclusions are arrived at,by the signal_type routine if the handlers fall in SIG_DFL, or SIG_IGN category. There is a special treatment for SIGCHLD, for a SIG_IGN, which never gets ignored. This according to me,is double standards,because the caller gets away if he has a SIG_IGN handler for SIGCHLD,but gets penalised (Zombies galore) if he forgets to install the handler. (a ret.value of 1 instead of 0 in the signal_type routine for {SIGCHLD,SIG_DFL } case should eradicate Zombies.) In fact,the do_signal routine has a special check for {SIGCHLD,SIG_IGN},and does the job of reaping the child by calling sys_wait4 and thus prevents useless Zombies from piling up,with task->state = TASK_ZOMBIE. Please note that Zombies dont clog resources, as when a process exits, exit_notify (linux/kernel/exit.c), performs all cleanup with respect to a tasks resources,before setting the task state to TASK_ZOMBIE. The notify_parent (linux/kernel/signal.c) call in exit_notify,queues up the tasks exit_signal(task->exit_signal) which has a default value of SIGCHLD on a "fork." This function fills up the siginfo structure for the signal to be queued,queues up the signal,calling send_siginfo,and wakes up the parent (and all the others in its thread_group (struct list_head thread_group) ) blocking on its wait_chldexit wait_queue.

Explanation of sys_wait4 is done only to confirm that Zombies do not claim resources, as most of them wrongly believe:

The sys_wait4 syscall,which is responsible for reaping the child, processes its child list, by looping through the task->p_osptr (older sibling pointer), starting from task->p_cptr (youngest child), based on a set of flags. If it finds any tasks with TASK_STOPPED,(include PT_TRACED), and TASK_ZOMBIE, state it takes the corresponding actions. In case of TASK_STOPPED,it sets 0x7f in the first 7 bits,which indicate a TASK_STOPPED state. In case of TASK_ZOMBIE,it adjusts the task->p_opptr,original parent pointer if the task->p_pptr is not in sync with it, readjusts the links in the Process (REMOVE_LINKS,SET_LINKS ->include/linux/sched.h), notifies the parent with a SIGCHLD before returning,else it frees up the task_struct (hence removing the process table entry of the Zombie,which wasnt hogging memory.) and returns with the pid of the child process freed. In case it doesnt find any child exiting and the flags dont contain WNOHANG,it puts itself into the wait_chldexit wait_queue,and blocks (calls schedule() ) until the child wakes it up through a notify_parent call,and continues again repeating the same process described above.

Memory Management related Stuff:

Physical abstraction of memory is represented by the "struct page.". Virtual abstraction of the memory is evident in the vm_area_struct per process. Memory is addressed in Pages, which are 4K on a 32 bit machine. Linux supports 3 level page tables, even though architectures like INTEL, require only 2 levels of Paging. As most of you might be knowing that in Protected Mode coding, the Processors need a mechanism of translating the 32 Bit Virtual Addresses into Physical Addresses,which are aligned on a page boundary. As mentioned before, Linux offers 3 levels which define the course of translation of a virtual address into a physical address. The three levels are: Page Directory a.k.a pgd_t Page Mid Level Directory a.k.a pmd_t Page Table Entry a.k.a pte_t In a 32 bit Virtual Address Space, there has to be space for 4 GB ( 1 << 32 ) entries. This is accomplished by the 3 levels mentioned above. Every process has a Page Directory which has PTRS_PER_PGD number of Entries, a.k.a 1024 Entries, for 1 page of space. Each PGD entry in turn contains a pointer to a PMD entry,which in turn contains a pointer to a PTE entry,which contains the Page Table Entry. (or the Physical address of the Page, a.k.a pa(addr) ) But in 2 level architectures as in INTEL, its suffice to fold the PMD entry into a PGD entry,thereby optimising out one redundant level. Considering the fact that one has to accomodate 4 GB of entries, Each PGD entry,gives room to hold 4 MB of Memory, by allocating a PTE pointer (the PMD is folded into a PGD in 32 bit.)which can in turn hold 1024 Pages,as it has space to accomodate 1024 entries.(Each entry,containing a page.) Hence with each PGD entry holding 4 MB ( 1 << 22 ) of address space,(a.k.a PGDIR_SIZE), you touch the upper limit of 4 GB with 1024 PGD entries. Now let us look at how a Processor actually converts a 32 Bit Virtual Address into a Physical Address.As I have said before,the Processor has the PGD pointer changed to the corresponding tasks PGD pointer during a switch_mm call in schedule. Now it remains to be seen,as to how the processor resolves the 32 bit address into a physical one. This will be illustrated on viewing the definitions of the pgd_offset, pmd_offset, and pte_offset macros, in include/asm/pgtable.Also try looking at the various Page Table headers,like page.h,pgtable-2level.h for INTEL. The processor takes the last 10 bits (obvious as you can have 1024 or 1 << 10 pgdir entries) of a 32 bit Virt. address,to index into the page directory entry. It then dereferences the value at that entry, (which is a PTE pointer,as the PMD is folded for 32 Bit.) and uses that as a pointer to a page table entry to index into the Page table entry which contains the physical page address.It takes 10 bits from Bit #: 12 through Bit #: 22 ,leaving room for a 1 page offset.This way it obtains the physical page address. Example: Bit equivalent:

PGD 10 bits PTE 10 BITS (PMD folded in 32 bit)

0000 0000 00 00 1000 0000 1100 0000 1111 -> (0x00080c0f)

pgd index(0) pte_index(1 << 7 , pmd is folded to pgd)

I hope the translation is clear now. Before we go on to copy_on_write pages (COW pages), page_faults, etc,let us have a look at the virtual abstraction of the physical space. Each process has a Virtual Memory area,which is represented by the vm_area_struct structure.A process can have many such virtual memory areas,sorted out by address. For example, a text segment of a process would be constituting a virtual memory area, a data segment will constitute another,and mapped segments would constitute another,and so on. The Virtual Address bounds of a Process cannot exceed 3 GB (TASK_SIZE),as from then on,its into Kernel Space, which is at offset PAGE_OFFSET ( 3 GB in virtual memory, and has an identity mapping with the physical memory, as represented by the swapper_pg_dir which is the page_directory for init.) After 3 GB, a Process has the Kernel Page Tables entries.Infact it has the same entries as that of the init process, after 3 GB. The KERNEL_PGD_PTRS entries are copied when a Process is created, from swapper_pg_dir,by doing a memcpy(task->mm->pgd + USER_PGD_PTRS, swapper_pg_dir + USER_PGD_PTRS , sizeof(pgd_t) * KERNEL_PGD_PTRS); Hence the last 256 entries (The mapping for 3 GB and up) is copied directly to the process page tables. The links to the next vm_area_struct are reflected by vm_next entry, and the links to the inode address_spaces shared and private mapping, are reflected by vm_next_share and vm_pprev_share entries in the vm_area_struct.The root of the AVL tree,is represented by mmap_avl, in the mm_struct of a process, and the head of the vm_area_structs is the mmap field in the mm_struct,and there is also a mmap_cache for the last vma struct found. AVL trees are used to speed up vm_area_struct searches by having the vm_area_structs represented by a Tree,which starts builing when the mmap count reaches AVL_MIN_MAP_COUNT.You can check out the avl_tree code in mm/mmap_avl.c, which gets included in mm/mmap.c. Each vm area also has pointers to functions that represent the page_faults,no_page handlers,etc. for that area, in vm_operations_struct. If the process has a file mapped to that area,the file can be traced by vm_file field. Across a fork,not only does the vm_area gets copied (check fork.c ),but any inode shares,that the parent has mapped, gets linked to the child thread,and the area is opened.(vm_ops->open ) In order to get a first hand glimpse of vm_area management,you should have a look at mmap.c, which has important definitions for do_mmap_pgoff, do_unmap,do_brk,insert_vm_struct, which inserts a vm struct, sorted by address,and also builds the mmap_avl list,on reaching AVL_MIN_MAP_COUNT,by calling build_mmap_avl.There are also other important routines which you need to take up.(like find_vma,find_vma_prev,vmtruncate_list in memory.c,etc.etc). The do_mmap_pgoff gets called on a mmap syscall. A mmap system call is used to map Physical pages to the tasks virtual address space.Supposing that a task wants to mmap LEN bytes from offset OFFSET (Page aligned), from a File to its virtual address space,it uses a mmap system call,to map them to its address space. The task of do_mmap in a nutshell,apart from verification of the arguments passed from user_space, is to ensure the creation of a virtual memory area,before calling the Files mmap method. (f_op->mmap) This method takes charge of actually mapping the pages to the tasks virtual address space,in order for things to go around smoothly, on access of the virtual address. Most of the calls to mmap,ask the kernel to resolve a virtual address by giving a ZERO argument (even shmat for shared memory calls do that) for the virtual address to be mapped. If the map is not MAP_FIXED,and the argument is a Zero,do_mmap_pgoff gets a virtual address by calling get_unmapped_addr,which starts checking from TASK_UNMAPPED_BASE which starts at TASK_SIZE/3 (1 GIG), to hunt for a virtual address,keeping the upper limit as TASK_SIZE.If the address is found,a vm_area_struct is created,and gets linked with the tasks vma before the address is returned,with a call to insert_vm_struct which also links the vma to the inode address spaces i_mmap or i_mmap_shared lists,depending on whether VM_SHARED flag is set or not. Try taking a look at mmap.c,which is quite a useful file, as far as vm_area management is concerned. I hope by now,you would have visualised the mechanism of do_munmap,which unmaps the virtual address from the processes virtual address mapping. do_munmap,is a bit tricky in the sense that there are 4 cases to be governed,which are elucidated in unmap_fixup.

The first case, doesnt have to do much other than calling vm_ops->close,and cleaning up the slab cache entry for the vma.The second case has to only readjust the vm_end,and the third readjust the vm_start and vm_pgoff (page offset) fields. But the fourth case,has to bring up a new VMA based on the ends being freed.So it results in an extra allocation of VMA. Check out unmap_fixup which is well documented at the beginning,and is pretty simple to follow. But bear in mind,that the call to unmap_fixup has been preceeded by a remove_shared_vm_struct and {zap,flush}page_range calls.Hence the unmap_fixup has to just set up a new VMA and insert the vm struct,if one of the ends get changed, which happens in all the cases except the first. do_munmap,builds up a free_vma list,before doing the updation of te vmas. The do_brk call which sets up the brk segment of the task,(end of the data segment,which is what malloc does,)is a do_mmap in disguise. Examination of error_possibilities aside,it unmaps the last segment, before setting up the new vma. It uses a make_pages_present call,to force the pages to tasks memory,iFF the vm is locked->VM_LOCKED. In other cases,the pages are faulted in by do_no_page which faults in an anonymous page(do_anonymous_page), since the vma doesnt have any vm_operations_struct defined for page_fault handlers. Now its time to switch context to page_fault_handlers which indicates an indirect reference to mm/memory.c As I said,do_mmap calls the file specific mmap_method,which sets up the environment for faulting in the page, for the virtual address being mapped.The mmap method can do 2 things.It can either fault in the page,and map the virtual address by calling remap_page_range,or can initialise the environment,and postpone the page fault operations to do_no_page handler if it defines one,or can instead decide to fall back on the default do_no_page handler,which tries to fault in an anonymous zeroed page,depending on whether the write_access is set. Imagining that it falls back on the default no_page handler do_no_page,a Process either gets a write_protected globally shared anonymous page (ZERO_PAGE(addr)->empty_zero_page), or an anonymous zeroed new page from do_anonymous_page.The pte entry (page table entry) is set by calling set_pte(ptep,pte_entry), and thus the stage is set for a write_protected_fault do_wp_page iFF the globally shared zero page was mapped to the process address space.Considering that it was mapped,the process goes through do_wp_page,which gets the cake for write_protected page faults.The stage is set for a COW mapping which in other words is a COPY_ON_WRITE mapping-Meaning COPY ME ON WRITE ACCESS.(break_cow).A new page is faulted in,and the contents of the pte_wrprotect page are copied to the new page,and the page table entry is set (set_pte)for the faulting address to the new one,whereas the shared_ness of the old page gets decreased. If the page happens to be in the Swap Cache,then the page is marked dirty before returning. The routine do_swap_page is called for swapping in a page,if the page is not present.handle_mm_fault does that,if page_present is not set.(pte_present) do_swap_page looks up a swap_cache and gets in a page.It removes the page from the swap cache (swap_free),and does a set_pte. A page contains some hints regarding its location in the swap. A swap cache entry contains 6 bits to indicate Swap Type, (64 swap devices supported),and 24 bits are retained for SWAP_OFFSET within the swap device. Given a swap entry,you can lookup a swap cache by doing lookup_swap_cache, and get the page.If its not present, there is a swapin_readahead which tries to readahead aligned 1 << clusters,(I didnt get that.) before doing a read_swapcache,to get a page. Some of the routines call handle_mm_fault to fault in the page to tasks virtual address space. One such routine is map_user_kiobuf which uses a kiobuf to force in the pages to user space. Try reading it,because sometimes you might have to use kiobuf,for mapping vmalloced pages to user_space,supposing you define your own page_fault_handler,do_no_page. If you want to define your own page_fault_handler,for mapping kernel_memory allocated using kmalloc or vmalloc to user_space,then you should be prepared to look up the page tables.For kmalloc cases you can use remap_page_range,but be aware of the fact that remap_page_range only works with page_reserved cases. It doesn work if the page is not reserved.(check out remap_pte_range in memory.c). So you are better off,faulting in 1 page at a time,in case of a page_fault exception. I have some sample codes,but they are there in, mapping a device section. vmalloc case is not covered over there. For vmalloc cases,you have to look up kernel Page tables for tracking the page. The reason being-Vmalloc virtual_addresses are contiguous in space,but they are not physically contiguous.vmalloc addresses lie after the end of Physical Memory.(8 MEM HOLE before vmalloc starts). Hence you should go about tracing the pte_entry for those virtual_addresses in the Kernels Page tables. You can do for example : get_page : {

pgd_t *pgd; pmd_t *pmd; pte_t *pte;

struct page *page = NOPAGE_SIGBUS;

pgd = pgd_offset_k(address) ; //pgd_offset_k takes care of passing init_mm.

if(! pgd_none(*pgd) ) {

pmd = pmd_offset(pgd,address);

if(! pmd_none(*pmd) ) {

pte = pte_offset(pmd,address);

if(pte_present(*pte) ) {

page = pte_page(*pte); //get the page




return page;


You should also take care of the vm_operations_struct if you are defining your own page_fault handlers.

struct vm_operations_struct your_ops = {

open: open,

close: close,



Open and close are called on map and unmap time,or on a fork,when a child tries to link itself against the inode address spaces mapping.


A very important and an interesting function which you might want to look at is copy_page_range,(memory.c) called from fork, which copies the page tables(pte_entries) from the parent to the child across a fork. The function also takes care of COW pages,if it manages to detect the COW pages. (vm->vm_flags & (VM_SHARED | VM_MAYWRITE) ) == VM_MAYWRITE). If its a COW page,it write protects the page both in the parent and the child (pte_wrprotect),to ensure that a dirty,writable page would be faulted in when either of them tries to do a write on the page,as was explained before. Try looking at it,as that would help you to have control over page tables,if you want to tweak the page tables. zap_page_range frees the pages mapped to the pte_entries and zeroes out the pte_entries,in the page_tables, given a range of addresses.do_munmap calls it,to free up the pages allocated to the tasks page_tables for the area unmapped. clear_page_tables frees the page table pointers. There are many other utility functions,which might be of help. You can then take up the mprotect ,mlock codes,if you are through with the memory.c and mmap.c stuff. A very important file that implements the slab_allocator for Linux is slab.c.Its an implementation of a slab cache Let us discuss its actions in a nutshell. A cache object is made up of several slabs,and each slab can contain many instances of objects. The slab cache depends upon struct page, to extract the slab and caches respectively from an object pointer. This way when an object gets freed,the slab to which it belongs to is located,or the cache to which it belongs to in case of kfree is located by either calling GET_PAGE_SLAB OR GET_PAGE_CACHE. On system startup,the slab cache gets initialised by a call to kmem_cache_init,kmem_cache_sizes_init. kmem_cache_init initialises the main cache,cache_cache to which all the caches created are linked.kmem_cache_sizes_init creates several caches(normal and dma) of various sizes of powers of 2, starting from 32 through order 5, (PAGE_SIZE << 5) which are used by kmalloc to allocate memory. The maximum order supported is order 5,and hence the object size is limited to order 5.A cache is created by calling kmem_cache_create("your_cache",your_cache_size,your_cache_offset,your_cache_flags,your_constructor,your_destructor). You should have a look at kmem_cache_t structure to get a glimpse of the slab cache.When a cache is created by calling, kmem_cache_create, kmem_cache_estimate is called to compute the number of objects that can fit of the size requested, starting from order 0.There are several constraints that are followed while creating a cache.The order is restricted to max_gfporder which is 5,and cannot exceed this limit. An OFF_SLAB limit needed for getting the bufctls of a slab,is alternatively computed,which is half of the maximum cache_size.A slab which contains many objects is represented by a slab_t structure.Each slab contains a bufctl array,which keeps track of the free object slots per slab.On cache creation,the cache gets linked in the cache_cache.In order to know the flags that one can use,take a look at slab.c.One flag thats widely used while creating a cache,is SLAB_HWCACHE_ALIGN,which aligns the objects on an L1_CACHE_BYTES boundary.There are other options which can trap NULL references to the cache,mainly SLAB_POISON,which fills up the POISON_BYTE, and POISON_END,by leaving a 1 word gap before and after the end of the object.The object size,and the number of objects estimated by kmem_cache_estimate,are stored in the cache structure to be used by the slab,when it gets allocated. Per cache,there is a list of partially free,fully free,and full slabs.The slab that gets selected during allocation is the firstnotfull slab,which selects the slab to allocate an object. When an object is allocated from the cache,using kmem_cache_alloc(your_cache_p,your_flags), kmem_cache_alloc gets called which uses kmem_cache_alloc_one to allocate the object.kmem_cache_alloc_one is a macro,which looks up the firstnotfull slab pointer to see,if there are any available slabs.If there are available slabs,it calls kmem_cache_alloc_one_tail, to get the object. If there are no slabs, then a call to kmem_cache_grow is made,to fetch a single slab to accomodate an object. kmem_cache_grow, tries to get memory by calling kmem_getpages, which calls get_free_pages(flags,cachep->gfp_order), to allocate memory. After the memory is allocated,kmem_cache_grow gets the slab management object by calling kmem_slab_mgmt,and then initialises the objects by calling kmem_cache_init_objs.Then the pages allocated are linked up in the struct page.(SET_PAGE_SLAB,SET_PAGE_CACHE variants.) The slab gets added in the cache list of slabs,and kmem_cache_grow returns.kmem_cache_alloc_one_tail allocates the object,updates the slabs free object number slot,and adjusts the slabs bufctl limit. If the slabs bufctl limit hits a peak (BUFCTL_END),then the firstnotfull pointer is adjusted to point to the next slab, before returning a pointer to the object. An object is freed from the slab_cache by calling kmem_cache_free. kmem_cache_free calls kmem_cache_free_one which makes the objects slot available for reuse,by modifying the slabs free slot. (slab->free) It is to be noted that kfree and kmalloc call kmem_cache_alloc and kmem_cache_free respectively to free up the object.In case of kmalloc,a search is made for the cache pointer corresponding to the size requested, by looking up the cache_sizes structure. kfree locates the cache pointer from the Object pointer by looking up the struct page.(GET_PAGE_CACHE),before calling kmem_cache_free to free up the object.kmem_cache_free apart from freeing up the object,does some readjustments to the firstnotfull pointer based on whether the slab was full before the object was freed, or the slab is free after the object was freed. (Lookout for goto moveslab_partial,moveslab_free.) There is also a reaper function for the slab cache called when to free up memory,when the memory gets overburdened. The reaper starts at clock_searchp cache pointer and tries to find a cache,with the maximum number of free_pages, with an upper limit of REAP_PERFECT. It examines the slabs of each of the caches, looping till slab in use (slab->inuse) is found.It optimises out certain conditions based on the caches with gfp_order > 0,and with a cache_contructor to about 80 percent of the actual free count.If a cache is found that can be reaped,kmem_slab_destroy is called on the slabs after reducing the reaping limit to 80 percent of the slabs, and removing the best_len number of slabs from the caches list of slabs. kmem_slab_destroy frees up the pages by calling kmem_freepages,thus freeing up the memory.Have a look at slab.c, as there are many other factors,like a per_cpu cache for objects is kept in cpucache_t,which gets initialised on a enable_all_cpucaches call.For SMP,the cpucache_t caches the list of objects depending upon the batchcount and the object limit per cpu.The kmem_cache_t contains a per CPU pointer to cpucache_t,where the objects are cached. The updation of the kmem_cache_ts cpucache_t pointer is done by having a ccupdate_t structure,which updates the cpucache_t and frees up any old cpucache_t entries.Try looking at slab.c to see their implementation.


To wrap it up,its evident that most of the issues like spinlocks,semaphores,read/write,IPC werent taken up. I wanted to take them up,but they were very well discussed on Linux 2.4 Internals Document in You should also have a look at include/linux/list.h for the list_head double_linked_list interface as they are heavily used in the Kernel.From what I have analysed till now,a reader can get first hand looks at implementation of all kinds of data structures like linked lists,queues,trees(as in AVL),hashing,etc. Moreover,IPC should be the first point to start hacking the kernel.While going through the shm.c (shared memory), try going through the shmem.c code in linux/mm which is the implementation of the shmem file system used by shared memory,while mapping the files to user_space.Focus on pipelined_send in message queues,and the sem_queue and the sem_undo structure management in sem.c.If you want to check out the reboot magic,have a look at machine_restart in arch/i386/process.c, which switches to real mode before making the jump to BIOS.(0xffff0000). Before I stop,its evident that this document is totally incomplete and will/should have future revisions. Thanks for a patient reading.I hope it was useful.

- A.R.Karthick

KernelNewbies: Karthick (last edited 2017-12-30 01:30:07 by localhost)