Virtual Memory (2024)

CSCI.4210 Operating Systems
Virtual Memory

The memory management strategies for multiprogrammingsystems described in the previous lesson were based on two assumptions:

A process needed to be loaded into memory in its entirety beforeit could run.

A process needed to be loaded into contiguous memory.

These seemed likeobvious assumptions when multiprogramming was first introduced. However,as programs became larger and as systems started to load largernumbers of processes at the same time, these primitive memory managementsystems began to break down.

In the 1970s, most computer systems began to switch to a memory managementsystem called virtual memory in which both of these assumptionswere violated. Virtual memory is based on the observation that at any giventime, a process is only using a small fraction of its code and data.A process which may consist of tens of thousands of lines of code mightonly be running in a few functions. In fact, a substantial fraction of mostreal world code consists of error handling routines which are seldom used.

Recall the principles of spatial and temporal locality discussed last week.These two principles also apply to virtual memory.

The idea behind virtual memory is that physical memory is divided into fixed size pages.Pages are typically 512 to 8192 bytes, with 4096 being a typical value. Page size is virtuallyalways a power of two, for reasons to be explained below. Loadable modules are also divided intoa number of page frames. Page frames are always the same size as the pages in memory.

Pages frames are loaded into memory only when they are needed. Adjacent page frames are notnecessarily loaded into adjacent pages in memory. At a point in time during the executionof a program, only a small fraction of the total pages of a process may be loaded.

Note that there are two kinds of addresses (or two address spaces), virtual addresses and real addresses. Virtual addresses, also called logical addresses,are program generated. In a system without virtual memory, a physical address anda logical address are the same; the virtualaddress is placed on the address bus directly because it refers to a physical addressin main memory. On systems with virtual memory, there is a mapping between virtualaddresses and real addresses. In the above figure, supposed that code for the nextinstruction to be executed is in page frame 3. The running program only knows thevirtual address. Rather than putting the virtual address directly on the addressbus, the system sends this address to the MemoryManagement Unit or MMU. The MMU provides this mapping betweenvirtual addresses and physical addresses. In theexample above, this mapping would convert a virtual address in page frame 3 toa real address in page 10. The data structure which contains this mapping iscalled a page table.

It is possible that a process might access a page which is not in memory. Forexample, suppose that the process tries to access data in page frame 8. The MMUwill determine that this page is not loaded, and this will generate a page fault.A page fault is an interrupt. The handler for a page fault locates the needed page frame onthe disk, copies it to a page in memory, and updates the page table, telling it where the pageframe is loaded. Control then returns to the process.

When a program first starts running, it will generate a number of page faults. Supposethat the first executable instruction is
MOV A704,AX; copy value at address hex A704 to register AX
This one instruction will generate two page faults. The instruction itselfwill be at an address in the code segment and so the page frame which has thisinstruction will have to be loaded into memory. Then the page in thedata segment or stack which contains virtual address A704 will have to beloaded into another frame. However, if the program conforms to the principlesof temporal and spatial locality, subsequent instructions are likely tobe in the same page frame as this instruction and subsequent accesses to data arelikely to be in the same page frame of data.

Structure of the page table

A row of the page table typically has the following fields.

Page frame number (virtual addresses)
Page number (physical or real address)
Present/absent bit (is this page frame currently loaded?)
Protection bit (read/write vs. read only)
Dirty bit (has the data been modified?)
Referenced bit (has this page been recently referenced?)
Caching disabled bit (Can this data be cached? Computers which supportmemory mapped devices should not cache certain data).

The dirty bit is important. If at least one value in the page frame hasbeen modified since the frame was loaded from disk into memory, there isa discrepancy between the page frame in memory and the page frame on thedisk. In this case the dirty bit is on, otherwise it is off. This isimportant, because there will come a time when this page will be replacedby another page frame, and when this occurs, if the dirty bit is on,the contents of the page frame will have to be copied back to disk bythe page fault handler. If the dirty bit is off, the contents of thepage frame on the disk are identical to the contents of the page framein memory, and so the page can be overwritten without losing any data.

The MMU

When virtual memory was first proposed, many people thought that it wouldnever work because it would be too difficult to implement paging rapidlyenough. The page table provides a mapping between the logical (virtual)address and the real address. This mapping has to be very fast, andthe page tables can be very large.

Obviously this mapping has to be implemented in hardware. Thesection of the CPU that provides this service is called the MemoryManagement Unit (MMU). Note that every memory reference has togo through the MMU.

Let's start with a toy implementation to demonstratesome of the issues that the MMU has to deal with. Suppose a systemhas a 16 bit virtual address space (the program addresses rangefrom 0 .. 64K), but it only has 32K of memory, divided into eightpages of size 4K. A 16 bit virtual address can be divided intoa 4-bit page reference and a 12-bit offset. Thus, each page andpage frame is 4,096 bytes (2¹²The page table consistsof 16 entries, one for each possible value of the 4-bit page field. The offset is simply the address within that page.For example, the virtual address 0X5068
(0101 0000 0110 0100 in binary)
is divided into two parts. The first four bits refer to thepage (0x5), and the last 12 bits refer to the offset withinthat page (0x068) which is 100 in decimal.

The page table has an entry for each logical page, 16 inall, and the four bit page field is an index into the page table.

If the entirepage table is loaded into registers, this is trivial to implementin hardware, and performance would bevery fast, but it has some serious problems. The obvious oneis that most modern systems allow processes to have very largeaddress spaces. An address space of 32 bits per process is typical now.If a page size is 12 bits (4096 bytes, also a typical figure),then the page field has to be 20 bits, so the page table hasto have 2²⁰ (1,048,576₁₀) entries. This is clearly far too large to load into registers.

Also, keep in mind that each process would have its own page table,and so every time a context switch occurred, the page table forthe new process would have to be loaded into the registers, and thiswould make context switching very slow, far too slow for a typicalsystem which may make a hundred or more context switches per second.

The alternative is to keep the page table in main memory, but thisis inefficient because each memory reference now involves two memoryreferences, one to get the appropriate page table entry and oneto fetch the actual data. That is not a viable solution.

One solution is to have a multi-level page table. In the above example, a logical address was divided into two fields, apage field and an offset field. With a multi-level page tablethe logical address is divided into three fields, one for theprimary page table, one for the secondary page table, and onefor the offset of the address within the page.

Here is an example. A system with a 32 bit logical address fielddivides such an address into a ten bit field for the first pagetable, a ten bit field for the second level page table and a12 bit field for the offset, creating a page size of 2¹² or 4,096 bytes.

There are potentially 2²⁰entries in the page table, far too many to keep in memory at onetime. However, keep in mind that although each process hasa potential address space of 2³², most processes only use a tiny fraction of this. In particular, whena process is loaded, there is a large chunk of unused memory in themiddle of the address space for the stack and the heap to growinto.

Recall that in the page table design described above, eachentry in the page table contained a pointer to real memory.In a two level page table, each entry in the first levelpage table contains a pointer to a second level page table, and the second level page table provides theaddress of the page in real memory. There are 2¹⁰ or 1,024entries in the first level page table, each pointingto a second level page table. The first level page tableis always in memory, but only a handful of second level pagetables are usually needed.

If the 32 bit virtual address is
00000101 11100110 11100100 01100100 or 0x05EAE864
to determine the physical address, this logical address is sent to the MMU,which extracts the first ten bits 0000010111 to useas an index to the first level page table. The first levelpage table provides a pointer to one of 1,024 second levelpage tables (in fact most of these will not exist).This table is hopefully in memory. The MMU willextract the second ten bits 1001101110 and usethis as an index to the second level page table, which willprovide a pointer to the actual page in real memory. The last12 bits 010001100100 will provide the offset in thepage for the actual byte which is requested.

This model is still too slow to be effective. Even if theboth of the required page tables are in memory, this nowmeans that one logical memory access would require three memoryaccesses because the MMU would have to read the first pagetable and the second page table to be able to generate thephysical address.

The solution is Translation Lookaside Buffers orTLBs. These are based on the principles of temporal andspatial locality which most programs conform to. Mostprocesses, no matter how large they are, are usually onlyexecuting steps in a small part of the text segmentand accessing only a small fraction of the data during anygiven time short time period. This means that they areaccessing only a small number of pages. The idea behind aTLB is that these pages are cached.

The TLB is inside the MMU, and contains pointers to a smallnumber of frequently used pages. Whenever the running programneeds to access memory, it first checks to see if thepointer to the page containing that address is stored inthe TLB. Most of the time it is, and so the MMU can immediatelyput the real address on the address bus without having toread any other memory.

Suppose the TLB has slots for 32 pages. Even with this smallnumber, it cannot do a linear search to determine if the pageis in the TLB. It needs to search in parallel, and so thishas to be implemented in hardware.

Entries in the TLB would look very similar to entries in theactual page table; each record would contain the virtual pagenumber, the physical page number, a valid flag, a dirty bit,and protection information.

The relationship between the TLB, the MMU, and the cache isfairly complicated. Here is a typical arrangement.

The CPU generates a virtual address. First, the TLB issearched to see if the address of the physical address iscached. If there is a hit, we know the physical address.The next step is to search the cache to see if the value isin the cache. If there is a hit, this value is returned tothe CPU. In most cases, both of these queries are hitsand so the value can be returned quickly without accessingmain memory.

If the TLB does not have a pointer to the page that containsthe virtual address, the next step is to search the actual pagetable. If there is a hit, then we can generate the physical address and so we search the cache to determine if thisvalue is in the cache. We also will add this page to theTLB.

If the page is not in the page table, then a page fault isgenerated. In this case, the page will be read in fromthe disk, the page table will be updated, and the TLB willbe updated.

Inverted Page Tables

Most page tables are relatively sparse; they contain space forperhaps a million pages, but only a tiny fraction of these pagesare actually used by the typical process. Thus, a traditional page table is somewhat wasteful. Also, as 64 bit computers become more common, page tables will be inconceivablyhuge. The solution is inverted page tables

In a regular page table, there is an entry for each virtual page,in an inverted page table there is an entry for each physical page,so the page table is much smaller. This would notseem to make much sense, because to access a particularvirtual address, it would be necessary to search each page inthe inverted page table to see if it is there. Of coursethe system first checks the TLB, and if there is a hit, thereis no need to access the page table at all.

An inverted page table uses associative or content addressablememory. Each cell in an associative memory contains a key fieldand a data field.To search an inverted page table, the system uses a hash function,in which the physical address is passed through a function togenerate an index to the appropriate row in the inverted page table.You all remember from your data structures course that a hash functionallows very rapid search of large data sets.

Page Replacement Algorithms

When a page fault occurs, the system has to copy a new page framefrom disk into a page of memory. This may involve replacing apage which is already loaded. How does the system decide whichpage to replace? Note that if the page which is being replaced hasbeen modified (the dirty bit is on), it has to be copied to disk.It is important to have an efficient page replacement algorithm,because if the page which is replaced is accessed again soon,the result will be another page fault.

Page faults are time consuming. If the system is generating toomany page faults, it is doing little real work. This is a situationcalled thrashing.

Let's assume that we have perfectknowledge of the page reference stream into the future. The forwarddistance of a particular page at a particular time is the next placein the stream where that page is referenced again. The forward distance wouldbe 1 if the very next reference was to that page, and infinity if that pagewill never be referenced again. The optimal page replacement algorithm would be to replace that page which has the highest forward distance, in otherwords, replace that page which will not be used for the longest time.

Unfortunately, thesystem has no way of knowing this. However, it is possible to makea reasonable guess about which pages may not be referenced again soon.

One simple and obvious strategy is the not-recently-used algorithm (NRU).This algorithm says to replace a page in memory which has notbeen accessed recently. Because of the principle of temporallocality, we can assume that a page which has not been referencedrecently is unlikely to be referenced soon.

The ideal implementation of the NRU algorithm would be to replace thatpage which has been least recently used (LRU); i.e. has not been accessed forthe longest time. However, it is difficult to implement a pure LRU algorithmbecause this would requirekeeping track of how long ago each page was referenced, and the overheadassociated with this is far too high to be practical. An approximation ofLRU will be discussed below.

It is simple to implement a Not Recently Used algorithm.A row in a page table usually has a flag called the referenced bit.Most systems have a clock which periodically generates an interrupt about onceevery 20 msec. On each clock interrupt (or similar schedule),the referenced bits of all of the pages in the page table are set to zero.As each page is referenced, its referenced bit is turned on. When a page faultoccurs, if an active page needs to be replaced, the system looks for a page where thereferenced bit is off, and replaces that page.

This is far from perfect. If a page fault occurs shortly after a clock interrupt, essentially all of the pages will be marked unreferenced, andso the system will have little information about which page to overwrite.

This system is further complicated by the fact that if the dirty bit is on,the page has to be copied to disk. In this case, a page fault resultsin both a read-from-disk operation and a write-to-disk operation. Thus,a variant of this NRU strategy is to look first for a page where boththe referenced bit and the dirty bit are off, and it only replaces apage with the dirty bit on if there are no pages with both bits off.Of course this requires additional overhead because the entire page tablehas to be searched.

Another low overhead strategy is the First In First Out (FIFO) page replacementalgorithm. In this case, the system keeps track of the load time for each page,and it replaces that page with the longest time in memory. This can beeasily implemented by putting each new page at the tail of a list in memory,and removing pages from the head.

The FIFO algorithm should be included in this list for completeness, but itis not used in practice. There is no particular reason to assume that thepage which was loaded longest ago is less likely to be accessed soon than apage which was just loaded.

We can improve on the NRU algorithm. The text describes a second chancealgorithm and a clock algorithm, but the latter is just an implementation ofthe former. The clock algorithm keeps all of the current pages in a circular list.There is a pointer which moves around this list like the hand of a clock. When a page fault occurs, the system examines the page that the hand is currently pointingto. If the referenced bit is off, this page is replaced in memory (copying thecontents back to disk if the dirty bit is on). If the referenced bit is on forthat page, it is set to off and the hand continues around the list. This continues until it finds a page where the referenced bit is off. The algorithm is guaranteedto find such a page, because the list is circular, and the hand will eventuallyget back to the page where it started, and this page will have the referenced bitoff.

It goes without saying that the referenced bit is turned on whenever a page isreferenced.

This algorithm has an advantage over the NRU algorithm, because when it findsa page frame with the referenced bit off, it can be certain that this pagehas not been referenced in one full sweep across all of the pages, whereas withthe NRU algorithm, if the system finds a page with the reference bit off, thereis no guarantee regarding how long ago it was referenced, only that it has notbeen referenced since the last clock interrupt.

The clock algorithm or variants of it are used in many real operating systems.

It was stated above that the Least Recently Used (LRU) algorithm wouldbe very efficient, but that it would be difficult to implement in practice.The text talks about two ways to implement LRU in hardware.

One method is for the system to maintain a counter which is incremented eachtime any instruction is accessed. The row of data in the page frame needsto maintain room for this counter. Whenever a particular page is referenced,the counter is incremented and the value is copied to the page frame.

When it is time to replace a page in memory, the system checks this valuein each page and replaces that page where the value is the lowest.

Here is an example. Suppose that the value of the counter at the start ofthe scenario is 1000 (in practice, the counter should be very large, the textsuggests 64 bits). There are eight pages in memory, numbered one to eight.Here is a string of page references.

6, 7, 3, 8, 8, 1, 4, 2, 5, 6, 1

After executing this string, here is the value of the counter for each of the eightpage frames.

If a page fault occurred at this point in time, and one of these eightpages had to be replaced, the system would look for the lowest value ofthe counter, see that it was page 7, and it would replace that page.

This algorithm can only be implemented if there is hardware support forit in the MMU, and few if any hardware developers have implemented suchan algorithm.

A close approximation of the LRU algorithm can be implemented entirely in software. The text refers to this as Not FrequentlyUsed (NFU) with aging. It works like this.Each page has a counter associated with it. Recall that operatingsystems have a clock interrupt which occurs roughly every 20 msec.At each clock interrupt,each counter is right shifted by one bit (this is equivalent todividing by 2. The least significant bit is lost).If the page has been referenced since the last interrupt, the mostsignificant bit of the counter is set to one, otherwise it is set tozero. Then all of the referenced flags are set to zero.

Here is an example. There are six pages in memory, and the counter iseight bits. (Eight bits is usually enough). During each timeslice (the period of time between clock interrupts), the followingpages are referenced. For example, during the first time slice, pages2 and 4 were referenced, the others were loaded but not referenced.

1	2,4
2	1,2,3,4
3	2,3,4,5,6
4	1,2
5	1,2,5,6
6	2
7	1,2,6
8	1,2,5

After this set of eight time slices, the counters look likethis

1	11011010 (218)
2	11111111 (255)
3	00000110 (6)
4	00000111 (7)
5	10010100 (148)
6	01010100 (84)

When a page fault occurs, the page with the lowest value of the counter isreplaced. If a page fault occurred after the above string, it would replace page 3, because itscounter had the lowest value.

Note that any page whichwas referenced in the last clock cycle will have a higher value thana page not so referenced, because its most significant bit will be one,and the most significant bit of any page not referenced in the last cyclewill be zero. Likewise, any page referenced in the past two cycles willhave a higher value than a page which has not been referenced in the pasttwo cycles, and so on.

Working SetsAll of the above models of paging use demand paging. In demandpaging, a page is only loaded when it is needed, and a page fault alwaysloads exactly one page. When a process is first created, none of itspages are in memory, and so there will always be a number of page faultswhen a process starts. After a while, enough pages will be loaded intomemory so that the process runs for long periods with few page faults.The set of pages which need to be in memory in order for the program torun for an extended period with few page faults is called the working set.

Operating systems which are running at or near capacity will swap outprocesses, that is, remove all of their pages from memory andremove the job from the running queue. At some point a swapped outjob will be swapped back in, that is, returned to the running queue.If the system is using demand paging, such a job will again generatea number of page faults before having enough pages in memory torun. Thus, it would make sense to keep track of the pages of a processwhich were in memory when the job was swapped out, and when the jobwas swapped back in, restore all of these pages to memory before startingto run the job. Loading pages before they generate a page fault iscalled prepaging, and this model is called the workingset model because it restores the entire working set of a processbefore running it.

This is more complicated than it seems. The operating system needs tokeep track of which pages have been recently accessed. It is possible,indeed likely, that a process may have pages loaded which it has notaccessed for a while, but which have not been replaced. These pagesshould not be considered to be a part of the working set.

We can define the working set w(k,t) at timet as theset of pages that have been referenced by a particular process within thepast k memory references. Up to a point,as k increases, the size of the working set increases, but this functionlevels off quickly.

In order to implement this, the system would need to update a register ateach memory reference, and this is far too costly to be done even in hardware.However, it is possible to have a reasonable simulation. A field fortime of last use is associated with each page (although this cannot be keptentirely current). As with other page replacement algorithms, each page also hasa referenced bit. At each clock tick (about once every 20 msec), the systemclears this bit (sets it to zero), and each page reference sets it to one.

All operating systems keep track of the total running time of a process. Whena page fault occurs, the system checks all of the pages associated with thecurrent process. If the referenced bit is on, meaning that the page hasbeen referenced since the last clock tick, the time of last use field isupdated to the current value of the total run time of the process. If thereferenced bit is zero, the value in time of last use field is compared tothe current total time value, and if the difference is greater than some value,tau, this page is removed.

Here is an example. At an instant in time, the currently running process hassix pages in memory.The total running time of the process is now 1000 msec (this value isalways an approximation, since it is only updated at clock interrupts).The value of tau, set by the system, is 50 msec, meaning that pages which havenot been referenced in the past 50 msec of run time for that process arecandidates to be replaced because they are no longer in the working set.Here are the values of the six pages when a page fault occurs. Clearly thepage table would have other data as well, but I only show the data relevantto this algorithm.

Page number	Time of last use	Referenced flag
1	980	1
2	980	0
3	960	0
4	980	1
5	930	1
6	940	0

The system needs to find a page to replace. The first page has been referenced sincethe last clock interrupt, so it is clearly not a candidate for replacement. ItsTime of last use field is updated to 1000 (the current running time value). The secondpage has not been referenced in this clock cycle, but the curren time (1000) minusits time of last use (980) is less than tau (20 is less than 50), so it is nota candidate for removal. The same is true of the third page. Pages 4 and 5 havebeen recently referenced, so their time of last use field is updated to 1000.Page six has not been referenced since the last clock tick (its referenced bit iszero) and the current time minus its time of last use is 60 msec, which isgreater than tau, so it is no longer in the working set and it is a candidatefor replacement.

This algorithm works well, but there is a lot of overhead associated withit because at each page fault, every page associated with that process hasto be examined and possibly updated. This algorithm can be combined with theclock algorithm to get an efficient page replacement strategy which is widelyused in practice.

In this algorithm, the pages associated with a process are kept in a circularlist. Each page has a referenced bit, a modified bit (dirty bit), and a timeof last use field as well as a pointer to the physical page address. At each page fault, the list is searched to find a suitable candidate forreplacement. Searching starts where it ended at the last page fault (Picture the hand of a clock). The algorithm differs from the abovein that not all pages are searched, searching stops when a suitable pagefor replacement is found. Also, if the page is not in the working setbut the modified bit is on, meaning that the page cannot be replaced untilits new values are written to disk, rather than performing the writeimmediately, the write to disk is scheduled and the hand moves on, looking for a better candidate page.