Thursday, December 13, 2007

Memory Management in Linux

Linux is a Unix-like computer operating system. Linux is one of the most prominent examples of free software and open source development; typically all underlying source code can be freely modified, used, and redistributed by anyone.

The Linux kernel was first released to the public on 17 September 1991, for the Intel x86 PC architecture. The kernel was augmented with system utilities and libraries from the GNU project to create a usable operating system, which led to an alternative term, GNU/Linux. Linux is packaged for different uses in Linux distributions, which contain the sometimes modified kernel along with a variety of other software packages tailored to different requirements.

Predominantly known for its use in servers, Linux is supported by corporations such as Dell, Hewlett-Packard, IBM, Novell, Oracle Corporation, Red Hat, and Sun Microsystems. It is used as an operating system for a wide variety of computer hardware, including desktop computers, supercomputers, game systems, such as PlayStation 2, 3, several arcade games, and embedded devices, such as mobile phones and routers.

Linux Memory Management

The Linux memory manager implements demand paging with a copy-on-write strategy relying on the 386's paging support. A process acquires its page tables from its parent (during a fork()) with the entries marked as read-only or swapped. Then, if the process tries to write to that memory space, and the page is a copy-on-write page, it is copied, and the page is marked read-write. An exec() results in the reading in of a page or so from the executable. The process then faults in any other pages it needs.

Each process has a page directory which means it can access 1 KB of page tables pointing to 1 MB of 4 KB pages which is 4 GB of memory. A process' page directory is initialized during a fork by copy_page_tables(). The idle process has its page directory initialized during the initialization sequence.

Each user process has a local descriptor table that contains a code segment and data-stack segment. These user segments extend from 0 to 3 GB (0xc0000000). In user space, linear addresses and logical addresses are identical.

On the 80386, linear address run from 0GB to 4GB. A linear address points to a particular memory location within this space. A linear address is not a physical address--it is a virtual address. A logical address consists of a selector and an offset. The selector points to a segment and the offset tells how far into that segment the address is located)

The kernel code and data segments are privileged segments defined in the global descriptor table and extend from 3 GB to 4 GB. The swapper page directory (swapper_page_dir is set up so that logical addresses and physical addresses are identical in kernel space.

The space above 3 GB appears in a process' page directory as pointers to kernel page tables. This space is invisible to the process in user mode but the mapping becomes relevant when privileged mode is entered, for example, to handle a system call. Supervisor mode is entered within the context of the current process so address translation occurs with respect to the process' page directory but using kernel segments. This is identically the mapping produced by using the swapper_pg_dir and kernel segments as both page directories use the same page tables in this space. Only task[0] (the idle task, sometimes called the swapper task for historical reasons, even though it has nothing to do with swapping in the Linux implementation) uses the swapper_pg_dir directly.

  • The user process' segment_base = 0x00, page_dir private to the process.
  • user process makes a system call: segment_base=0xc0000000 page_dir = same user page_dir.
  • swapper_pg_dir contains a mapping for all physical pages from 0xc0000000 to 0xc0000000 + end_mem, so the first 768 entries in swapper_pg_dir are 0's, and then there are 4 or more that point to kernel page tables.
  • The user page directories have the same entries as swapper_pg_dir above 768. The first 768 entries map the user space.
The upshot is that whenever the linear address is above 0xc0000000 everything uses the same kernel page tables.

The user stack sits at the top of the user data segment and grows down. The kernel stack is not a pretty data structure or segment that I can point to with a ``yon lies the kernel stack.'' A kernel_stack_frame (a page) is associated with each newly created process and is used whenever the kernel operates within the context of that process. Bad things would happen if the kernel stack were to grow below its current stack frame.

User pages can be stolen or swapped. A user page is one that is mapped below 3 GB in a user page table. This region does not contain page directories or page tables. Only dirty pages are swapped.

Minor alterations are needed in some places (tests for process memory limits comes to mind) to provide support for programmer defined segments.

Linux Virtual memory

Introduction

The Linux philosophy regarding memory usage is that “unused memory is wasted memory”. So what does that mean when you look at the free list when using the top utility or vmstat? It means that top is showing you wasted memory, or rather, memory that is not currently needed. Looking at the free column alone to determine current memory usage is misleading, in that it gives you an incomplete view of the whole memory picture.

In this paper I will try to provide a more complete view of Linux memory management and highlight a few tools that will help reach this goal. I will also outline a method for quickly determining a Linux system's memory use, which should prove handy when you need to eliminate possible contributors to bad performance or chase down memory related errors. The tools and methods I will use in the examples have been chosen due to their availability and ease of use, and can be applied to any Linux system if you want to re-create the scenarios on your own equipment.

First off, we should get some terms defined that appear throughout this document. Particular focus will be paid to the Page Cache throughout this paper. The reason for this, is that the page cache is where it seems all our memory winds up.

Definitions

Page - a discrete unit of memory that is manipulated by the Linux kernel. In systems that utilize Intel 32 bit processors, a page is 4096 bytes, or 4 kilobytes (KB).

Anonymous memory - when a process requests memory from the kernel via the malloc() system call, the process is assigned memory that has no file backing on disk. This is why it is called "anonymous". When this memory is allocated, a reservation is taken against physical swap space on disk. This way, when the kernel needs to free up memory (due to pressure from processes that need more memory or when new processes start), this area will be used to write out the changed pages. The kernel will then add these reclaimed pages to the free list. When a process tries to access pages that have since been paged to the swap area, those pages need to be read back from disk and written into memory.

Buffer cache - The buffer cache is the area of memory set aside to buffer blocks read from or written to disk. This disk activity is known as disk I/O, or disk input and output. Buffer cache also contains filesystem metadata, such as directory structure data and filesystem journaling information.

Page cache - The area of memory set aside for filesystem and process pages that have been read in from disk, or pages that have no file backing. If the kernel needs to allocate memory to a process, and it finds the pages here, there will be no disk I/O operation. The page cache contains anonymous memory pages, processes' executable pages and pages of regular files open for reading and writing. The Linux kernel tries to keep this as large as possible to maintain fast file operations.

Paging - The act of moving pages of memory in to and out from disk. Paging in refers to loading a process's executable image and associated data into memory at startup. It also refers to loading pages into memory that were previously written to swap. Paging out occurs when pages are written to disk in order to free memory. This paging out can be either to its file backing on the filesystem or to disk based swap.

Free list - The pool from which memory allocations are satisfied. The Linux kernel tries to keep the free list at a certain size so that allocations need not always be satisfied from cache. The kernel uses a method of aging where the least recently used pages eventually filter down through different states and are then candidates for being placed here.

Cache hit rate - the rate at which a system can find a page in cache. A miss indicates a read from disk for the requested page, which is slow and needs to be avoided.