A Decentralized Virtual Memory Scheme ...

1 downloads 0 Views 182KB Size Report
tual page directory (VPD) is located. All virtual pages addressable in the system have an entry in the VPD with information about disk location, type (code, data).
A Decentralized Virtual Memory Scheme Implemented on an Emulated Multiprocessor Mats Brorsson

Department of Computer Engineering, Lund University P.O. Box 118, S-221 00 Lund, Sweden

Abstract

By dividing the memory space into equally sized

pages (or segments with varying size) only those pages

A decentralized scheme for virtual memory management on MIMD multiprocessors with shared memory has been developed. Control and data structures are kept local to the processing elements (PE) which reduces the global trac and makes a high degree of parallelism possible. Each of the PEs in the target architecture consists of processor and part of the shared memory, and is connected to the others by a common bus. The traditional approach based on replication or sharing of data structures is not suitable in this case when the number of PEs is in the magnitude of 100. This is due to the excessive global trac caused by consistency or mutual exclusion protocols. A variant of the Dennings Working Set page replacement algorithm is used, in which each process own a page list. Shared pages are not present in more than one list and it is shown that this will not increase the page fault rate in most cases.

belonging to the locality set of a program need to be resident in the main memory. The set of pages currently present in the main memory is referred to as the resident set. The mechanism for selecting pages to be removed is called the removal policy or replacement algorithm. One of the most well-known removal policies is the Denning Working Set policy, (DWS) [3]. It belongs to the class of variable space policies as it tries to adjust the memory allocation to the current needs of the program. In order to achieve this, the DWS estimates the locality set of a program by its working set (WS). All the pages that have been referenced during the last T memory references belong to the working set. The parameter T is called the window size and is usually kept constant during the execution and the same for all programs in the system. Pages that do not belong to any working set are candidates for being replaced. Although conceptually simple, the working set is hard to identify exactly during the execution of a program. It would require a mechanism that updated a list of all the pages belonging to the working set, for every memory reference. The standard implementation of the DWS policy is therefore based on an approximation of the working set according to the following. For every page a reference bit is set each time any location within the page is referenced. The reference bits can be read and reset by software. The operating system maintains a list of all the pages belonging to each working set and a similar list of all the pages that do not belong to any working set, the free page list. As both the working sets and the free page list refer to pages already allocated to page frames in the physical memory, the normal implementation is based on a page frame table containing status information on all the page frames. The working sets and the free page list are then maintained simply as doubly linked lists

1 Background

The idea for a virtual memory rst arose in the 1950's [10]. Early computer users had to handle their own memory management by dividing their programs into modules and explicitly specifying which module would replace which when they ran out of memory space. This technique is known as overlaying. The problem became even worse when the concepts of multiprogramming, timesharing and virtual machines were to be considered [2]. One way to characterize the behaviour of a program is by its reference string, in which we can observe that for long periods of time only a relatively well de ned subset of the program's address space is referenced. This subset is referred to as the locality set of a program. The changes of the locality set are often abrupt and are called phase transitions [5]. 286

2 Multiprocessors with shared memory.

of records in the page frame table. The approximation of the working set is computed in the following way. Every time a page fault occurs, the page that is brought into memory will be entered into the corresponding working set. The algorithm for moving pages from the working set to the free list is based on a mechanism called aging of the pages in the working set. Using real time clock interrupts the operating system keeps track of the number of executed instructions or the CPU time for every program or process in the system. At speci c intervals in the virtual time scale of each process the operating system reads the reference bits of all pages in the working set belonging to that speci c process, updates the working set and the free page list according to an aging algorithm, and then resets all or some of the reference bits. The scheme for resetting the reference bits is also part of the aging algorithm. In the simplest aging algorithm the virtual time interval for updating is equal to the window size, T. The pages for which the reference bit is not set will then be moved from the working set list to the free list. This algorithm results in a very crude approximation of the working set. Other algorithms, based on an updating interval smaller than T, exist that give better approximations [4]. The DWS policy is a local policy in the sense that the replacement decision depends only on the behaviour of the current program or process. Because of this, it works equally well for multiprogramming or multi-user systems. In the case where two or more processes have shared pages the situation will, however, be substantially different. The fact that a page is shared by two or more proceses means that the processes a ect each others paging dynamics. A shared page should be resident as long as any process references it actively. The ideal solution would be to let the shared pages belong to the working sets of all the corresponding processes. When a shared page is removed from one of the working sets a check must then be made to see if it still belongs to any other working set in order to decide if it should be moved to the free list or not. The bookkeeping to make this possible would require a dynamic page frame table, which is expensive to implement, especially if the reference bits are to be updated by hardware. Another solution would be to let the page belong only to the working set of one of the processes sharing the page. If the activity of this process is much less than any of the others, the aging will not be handled correctly.

We will be concerned with a class of computers that is becoming increasingly important, in which a number of processors share the same address space, shared memory multiprocessors. One of the most common architectures of this kind is based on a memory structure where the entire shared memory is global, i. e., equally accessible from all the processors. By having one copy of the operating system and all data structures such as page tables, page frame tables and free page lists in global memory, the uniprocessor implementation of paging can be applied directly just by ensuring that all updates of the data structures are atomic. Most commercial multiprocessors use this approach but they are limited to a small number of processing elements, typically 20{30 [8]. We will now consider the case where the shared memory is physically distributed over a number of processing elements interconnected by a communication system. In particular we will consider a multiprocessor with the following characteristics [6, 9].  The system consists of a number (magnitude of 100) of identical processing elements (PEs), containing a processor and part of the shared memory.  The processing elements are connected by a shared common bus, which means that we must minimize the amount of trac in order to avoid contention.  The address space is common to all the processing elements, but the shared memory is physically distributed over the PEs and the allocation of physical pages can be decided by the paging system in order to reduce the communication costs. In this type of multiprocessor it is favourable to use a decentralized operating system in which many of the decisions are made locally for each PE. Also the corresponding data structures ought to be kept local as much as possible. In the case of page replacement algorithms, the uniprocessor approach described earlier, based on global data structures and global decisions, is not appropriate for this number of processing elements, due to the excessive trac on the common bus caused by the access and updating of global data structures. Even if the global data is cached there will be a lot of trac in order to keep the shared data consistent. This traf c will easily lead to contention on the bus when we increase the number of processing elements. In the following we will present an extension of the DWS policy suitable for multiprocessors with dis287

tributed, shared memory. Compared to the uniprocessor case, the following new conditions are the most important to take into consideration.  The page frame information is stored locally for each PE.  As many of the decisions about page allocation as possible should be made locally and based only on local data structures.  The allocation of pages shared between processes executing in di erent PEs should take into account the di erent amount of trac produced on the communication system by di erent allocation strategies. As a consequence of this, the de nition of working set as an approximation of the locality set of a process has to be revised.

share memory. The memory is divided into equally sized pages. The physical memory is distributed over the PEs and is divided into page frames with the same size as the virtual pages. All PEs are equal except that one of them, the pageserver, is connected to the disk and thus serves all the others with disk transactions. There are two kinds of requests that a PE can send to the page-server, ReadVirtualPage and WriteVirtualPage. They correspond to a transfer of a page from the disk to the PE respectively to the disk from the PE. The virtual page number is transferred to the page-server as a parameter. At the page-server the virtual page directory (VPD) is located. All virtual pages addressable in the system have an entry in the VPD with information about disk location, type (code, data) and whether or not the page contains valid information. When the page-server receives a request for a read of a virtual page it rst checks if the page is valid and then sends it to the PE which requested it. Similarly if the request is a write, the page-server transfers the page from the PEs memory and then writes it onto the disk. In each PE there are two sets of pages, common to all processes within that PE, the PE-global allocated list and the free list. The PE-global allocated list consists of those pages which are referenced externally, i. e., through the common bus. Such references are called external memory references in contrast to local memory references. In the free list are those page frames which are free to be allocated to new virtual pages. To each process is associated a set of pages, a working set, organized as an ordered list of pages. New pages are added to the working set in the event of a page fault. Pages are removed from the working set if they have not been referenced locally for some time or if they become shared with a process on another PE. We can now de ne the working set of the current process to contain the pages referenced for the rst time by this process and which have only been referenced locally during the last time interval. This means that a page shared by processes on the same PE is in one of the working sets, namely the one for the process which rst referenced the page since start or since the page was released. On each PE there is an inverted page table with an entry for each page frame on the PE. The page table is used in the address translation mechanisms and the index gives the page frame number from which the physical address can be derived. Some of the elds in an entry are: ExternalRef: Tells if the page has been referenced through the common bus.

3 A proposed virtual memory scheme.

We suggest a decentralized memory management with a local, variable space page replacement policy, in which pages shared between processes executing on different PEs form one set of their own, and are aged and released based on real time. Pages shared between processes executing on the same PE are incorporated into one of the process' working sets and are aged with them. This proposal emphasizes the decentralization of the data structures and control, so instead of su ering from the bottleneck of a central operating system and shared data structures, the scheme exhibits high parallelism. A given PE rst performs an address translation for its local memory. If it fails to nd the requested memory location, then all the other PEs perform an address translation for their memories in order to see if the page is present anywhere in the systems main memory. Data references are carried out over the shared bus and code pages are copied to the requesting PE. From time to time, the pages in the working sets and in the PE-global allocated list are scanned to see if they have been referenced recently. If not, they are released and inserted in the free list or written back to the disk in case they have been modi ed.

3.1 Memory model.

The programming model of the system consists of a number of processing-elements (PEs) connected together by a common bus. On each PE one or more processes are executing, all of which see the same virtual address space, which makes it possible for processes to 288

LocalRef:

Tells if the page has been referenced locally.

loop GET_ADDR(VAddr, Op) if not LOC_MEM_REF(VAddr, Op, Data) then if not EXT_MEM_REF(VAddr, Op, Data) then PAGEFAULT_HANDLING(VAddr); ABORT_PROCESSOR end if end if end loop

This eld tells to which list the page presently belongs.

List:

VPN:

The virtual page number.

A page is known to be resident in main memory if and only if its virtual page number can be found in the VPN eld for some page frame in the page table. Sections 3.2 and 3.3 contains a detailed description of the actions made by the memory management system.

Figure 1: Memory addressing loop.

3.3 Replacement algorithm.

3.2 Memory addressing.

Figure 2 illustrates the movement of pages and page frames between lists as a consequence of various events. A free page frame is extracted from the free list and allocated to the new page. If the free list length by this has decreased beneath a certain threshold level, the page aging routine is run in order to remove pages not recently used by the faulted process. If even this fails to get free page frames then a page frame is taken in fo order from the working set of the current process, to be replaced by the new page. In order to always have free page frames in the free list under normal load there are two scan routines which run at times as indicated below. The rst routine scans the pages in the current process' working set when the process has accumulated a certain amount of CPU time and takes action depending on the status of the Localref bit. LocalRef=1: The page has recently been used. The page remains in the working set and the LocalRef bit is reset. LocalRef=0: The page has not been used since the last invocation of the scan routine, and is thus considered not to be in the process' working set and is released. In either case, if the ExternalRef bit is set then the page has been referenced from the communication system and it is moved from the working set to the PE-global allocated list. The second routine scans the pages in the PE-global allocated list and it is invoked at regular real time intervals. This scanner acts depending on the ExternalRef bit. ExternalRef=1: The page has been recently used. The ExternalRef bit and the LocalRef bits are reset.

Firstly the page frame table is searched in order to determine if the page is resident in the local main memory. If so then the List eld is checked to see in which of the lists it is present. If the page is either in a working set or in the PE-global allocated list then the LocalRef bit is set. The position in the page frame table points to the physical page and the reference is carried out. If the page was not found locally, then a request to do the memory reference is sent to the other PEs. This is accomplished by broadcasting the virtual address over the bus to all other PEs. Each of the PEs examines its inverted page table to see if the page is resident in its physical memory. If resident, then the corresponding list can be found in the page table eld list. If the referenced page is in a working set or in the PE-global allocated list, and the reference is not an instruction fetch, then the page is marked with the ExternalRef bit and the reference is carried out. If the reference is an instruction fetch and the page is resident then the page is copied to the requesting PE and the reference is done locally, i. e., code pages are never shared between PEs. Now, if we still have not been able to do the reference, the page has to be requested from the disk. The current process is suspended and the CPU is scheduled to another process. The replacement algorithm is run to get a free page frame which is allocated to the faulting page and a ReadVirtualPage request is sent to the page-server. When the page eventually is written to the allocated page frame then the page is added to the faulted process' working set and the CPU is interrupted so that the process can resume its execution. By the page fault the process' working set has grown by one page. The memory addressing scheme can be expressed with the pseudo code in gure 1. 289

Working sets

'$ &%

 Page fault >   Aging = 

Disk

@@ I

Ext mem refs

Aging@@

}ZZPage frame allocation Z @@ frame ? ,, Page release Free list

@ PE-global Allocated list

Figure 2: Page movements between lists. The page has not been referenced from the common bus since the last invocation of the scan routine, but if the LocalRef bit is set then it still has been referenced locally and no other action than resetting of the LocalRef bit is taken. If LocalRef=0 then the page has not been referenced at all for the last time quantum and it is released.

If a page released in one of the scan routines has been modi ed since it was last read from the disk then it is sent to the page server and written to disk. Otherwise there exists a valid copy on disk. The page frame is inserted into the free list.

about which pages should be removed from a working set can be approximated by the use of reference bits which are scanned from time to time. This approximation of the working set concept is used, for instance, by Sperry Univac [4]. The next restriction is to let the pages belong to only one working set, disregarding whether or not the page is shared by several processes. This seems at rst glance to be a serious violation of the working set concept, but the result, if we look only at the aspect of residency of the pages, is very much the same. The PE-global allocated list is instrumental in making decisions about the pages that are shared by processes which are executing on di erent processing elements.

4 Dynamics for shared pages.

4.1 Comparison with an ideal solution.

ExternalRef=0:

The ideal way to administrate the shared pages would be to let a page be incorporated into the working sets of every process which references the page, independently of the physical location of the page and on which PEs the processes are executing. In this case the working set is de ned to be the pages that have been referenced at least once during the last T memory references. There are two major restrictions to this solution which led to the proposed approximation described above. Firstly, it is expensive to maintain a true working set according to the de nition, so the decision

We compare the proposed solution with an ideal in two scenarios. In the following discussion, we do not consider the di erent cost of accessing the local memory or the memory of another processing element. The solutions are compared in the sense of how the pages are organized in the working sets, and not how well the working sets in the proposed solution are approximating true working sets. Let a page be shared by two processes, A and B. Each process A and B can be in one of four possible states regarding its activity in accessing the shared

290

page: 1. The process is active and making references to the page. 2. The process is active and does not reference the page. 3. The process is inactive but would have referenced the page if it were active. 4. The process is neither active, nor would it have accessed the page if it were active. The processes execute in the rst scenario on the same PE. When the scenario begins, the page is in the working set of A, and it is referenced by both A and B. A 1 2 3 4 1 2 3 4

B 1 1 1 1 3 3 3 3

ideal in both WS in B's WS in both WS in B's WS in both WS in B's WS in both WS in B's WS

when process B has need of the page but is blocked for some reason and process A has ceased to access the page. The statement above is true if the processes are actively accessing the page or if they are making no references at all. The e ect of the deviation can be reduced by the use of page caches (free and modi ed page lists). For example, if a page actively shared by process A and process B becomes accessed only by B, and after some time B goes to sleep, then the page will be released and moved to the free or to the modi ed page list. If B wakes up soon enough then B will have an opportunity to recover the page from one of the lists. The proposed extension to a working set based replacement algorithm is a violation of the DWS algorithm, and the used approximation of true working sets is fairly coarse. The important thing however is that the pages are in main memory when they are needed, and not in which working set. In order to estimate a process' memory demand, then information is lost with the proposal described above. The size of a process' working set has lost its correlation to the memory demand and the working set can no longer be used, without extensions to that described above, as an aid in making decisions about swapping of processes.

proposed in A's WS in A's WS in A's WS in A's WS in A's WS aged and released in A's WS aged and released

Table 1: Possession of a shared page; A and B on same PE.

5 Experiments. 5.1 Experimental environment.

Table 1 shows in which working set, if any, the shared page is in when the referencing processes are in di erent states. The situation where the processes execute on di erent PEs is illustrated in table 2. A 1 2 3 4 1 2 3 4

B 1 1 1 1 3 3 3 3

ideal in both WS in B's WS in both WS in B's WS in both WS in B's WS in both WS in B's WS

To evaluate the proposed organization, an implementation has been done on the MUMS multiprocessor. MUMS is an experimental multiprocessor developed at the department of Computer Engineering, University of Lund [9]. One of the goals of the MUMS project is to evaluate VLSI based design principles. Several mechanisms in our proposal for memory management can well be implemented in hardware. In order to be able to investigate which mechanism is suitable to implement in hardware, and to experiment with di erent hardware structures, we have chosen to emulate parts of the hardware by software. This provides the possibility to carry out experiments with real-life workloads without the severe performance degradation we would have if we simulated the whole system [7]. The processor on each PE is divided into two CPU's, one of which is called the execution CPU, which executes application programs and some of the operating system functions. The other one, the communication CPU, executes the emulation software which is written in assembly and

proposed in PE-global WS in PE-global WS in PE-global WS in PE-global WS in PE-global WS aged and released in PE-global WS aged and released

Table 2: Possession of a shared page; A and B on different PEs. From the tables we can see that the only situation where the result di ers signi cantly from the ideal is 291

extended concurrent Pascal. Every memory reference issued by the execution CPU is handled by the communication CPU and the action taken is completely dependent on the emulation software executing on the communication CPU. The system structure is directed towards architectures with the characteristics described in section 2. The current version of MUMS consists of 38 PE-boards interconnected by a common bus on which all communication between the processors takes place. A PEboard is divided into a processor part and a memory part which contains 512 kB of the physical memory. One of the PEs (the page-server), is dedicated to serve the others with disk accesses. The page server is also connected to a MicroVAX II host system on which all software development is done. A concurrent extension to Pascal has been developed at our department [11], which provides the facility to write programs with concurrency notion, dynamic creation and deletion of processes and the concept of semaphores for synchronization. The extended Pascal system of the communication processor is also used to serve as an application programming language in which processes can be spread on the PEs and execute concurrently. So far, the process allocation to processors is static, i. e., no process migration is possible during the execution. Instead, their place of execution is determined at compile-time. The interface between the execution processor and the emulation software is similar to the one used between microprocessor and coprocessor in modern microprocessor based computer systems. The proposed memory management system has been implemented using functions provided by the basic emulation system [11]. These functions include address translation mechanisms, communication and memory access primitives. Measurements are obtained by the insertion of counters for various events concerning the emulated functions, in the emulation software code. Each time an event occurs, the counter is incremented. Once in a while the counters can be logged to a le gathering the statistics. These constructs are called software probes, and they are extensively used to monitor the behaviour of the emulated structure. For the purpose of evaluation of our implementation of distributed virtual memory management, we have used software probes to record statistics about the following events:  Page faults.  Process virtual time (measured in number of memory references).

Working set changes. The parameters which we have changed from one experiment to another were:  Window size (T ) in the page replacement algorithm.  Size of available main memory. A special kind of software probe is the trace function. It allows us to time stamp events. We have used the trace function in our experiments to trace the movement of selected virtual pages from disk to di erent lists, back to disk and so forth. All transfers of the chosen pages are time stamped and recorded. After the completion of a set of experiments, a considerable amount of statistics has been gathered on different les on the host system. It is not practical for humans to study all of these statistics and try to draw any conclusions from them, so we have constructed a statistics post processor which extracts chosen data from a set of les with common parameter settings. The output from the program can either be in table or in diagram form. Emulation of hardware in software and the use of software probes forms a very useful tool in the investigation of di erent architectural details. The architecture can be changed and measurements can be done with software probes in a way that should be impossible or infeasible to accomplish if real hardware were to be used. There is also a possibility to experiment with di erent mechanisms in hardware which normally are made in the operating system. 

5.2 Experimental setup.

The two workloads described here which form the base of our experiments are constructed to illustrate the situation in the scenarios mentioned in section 4.1. The rst consists of a number of identical processes evenly distributed on three processing elements. These processes do not share any data but they use a lot of private data and thus put pressure on the memory system in a way that may a ect other processes on the same PE. Additionally there are three processes, one on each PE, which share one page. These processes are called A, B and C. Processes A and B behave as described in the two scenarios described in section 4.1 but the shared page is from the start of the scenario in the working set of process C. Processes A, B and C execute on PEs 1, 2 and 3 respectively. Figures 3 and 4 show for scenario number one and two respectively, the expected movement of the page shared speci cally by processes A and B. References 292

qq qq

Page # 81 PE # 3 Process Process time 5 34404 0 4294 0 5366

Real time 2480841 2511440 2862057

Type Transfer Datapage toWorkingSet Datapage toPEglobalAlloca Datapage toFreelist

Page # 81 PE # 1 Process Process time 5 739704 0 4022

Real time 3365689 3388980

Type Transfer Datapage toWorkingSet Datapage toPEglobalAlloca

Figure 5: Trace of page number 81 in the experiment of scenario number 2. Page in working set. in PE-global allocated list. Process actively referencing. Process blocked.

number 2, is suspended and process A no longer makes references to the page then the page is aged in PE number 3 and written to the disk. This di ers from the ideal organization where the page would still be in the WS of process B. When A starts to reference the page and B resumes its execution then the page will be read to the PE where it was rst referenced, in this case PE number 1. Since it is almost simultanously referenced by both A and B it will be in the PE-global allocated list. A second workload has been run in order to exemplify what kind of data is possible to produce. The program solves an equation system with Cramers' method in parallel. Two processes executes on each PE. What we expect to get is a exible and easy to understand means of viewing how the behaviour alters with changes in the parameters. We have run the workloads 6 times each with 20 kB, 25 kB and 50 kB of free memory and window sizes ranging from 6000 to 40000 memory references.

 Page

PE 1 2 3

1

2

1

1 

-

time

Figure 3: Movement of a shared page in scenario number 1. PE 1 2 1 1 1 3 1 2 3

qq



-

time

5.3 Results.

Figure 4: Movement of a shared page in scenario number 2.

To verify the expected behaviour in section 5.2, the movement of page number 81 which is the shared page referred to in the both scenarios have been traced. Figure 5 show a sample of the trace which has been derived from the run of a program for scenario number two. Processes are on each PE numbered from 1 and upwards. External memory references are internally treated in the emulation software as if they came from process number 0. Process time for process 0 is consequently the number of external memory references. Firstly the page is referenced by process 5 on PE number 3, which is process C in the discussion above. Then after about 30000 memory references, the page becomes shared and this is seen in the gure as a transfer to the PE-global allocated list on PE number 3.

from the processes to the page are shown as straight or dashed lines in the diagram and the numbers show the transition from one state of reference intensity to another. These numbers are the same as mentioned in section 4.1. Dots and wave-lines denote the residency of the shared page. The dots means that the shared page is in a working set of a process on the corresponding PE, and the transition from dots to a wave-line means that the page has moved from a local working set to the PE-global list. In gure 4, which refers to the second scenario, is shown that when process B, which is executing on PE 293

Average WS size

WS size

14 15 12

10

10

8 5 PE 3, Pr 3 PE 2, Pr 3 PE 1, Pr 3

6

5e3

1e4

1.5e4

2e4

2.5e4

3e4

3.5e4

4e4

Window size 4.5e4

5e5

Figure 6: Average WS size as a function of the window size for three processes, one from each PE.

1e6

Time

1.5e6

Figure 8: Working set changes with time. Window size equals 6000 memory references.

Average Ws size

Page fault rate

14

140

13 120 12 100 11 80 10 60 9 40 8 PE 3, Pr 3 PE 2, Pr 3 PE 1, Pr 3

20

5e3

1e4

1.5e4

2e4

2.5e4

3e4

3.5e4

4e4

20 kB 25 kB 50 kB

7

Window size 4.5e4

5e3

1e4

1.5e4

2e4

2.5e4

3e4

3.5e4

4e4

Window size 4.5e4

Figure 7: Page fault rate as a function of the window size for three processes, one from each PE.

Figure 9: Average working set size versus window size with di erent amount of memory available.

After about another 350000 memory references, the page is no longer a part of the PE-global allocated list since it has not been referenced during the time interval between two scans of the reference bits and the page frame is released and inserted into the free list. Later on, the page is accessed by process number 5 on PE 1, this is process A. When process B , which is executing on PE number 2, starts to reference the page, then page number 81 is moved to the PE-global allocated list in the same way as was done earlier on PE number 3. Figures 6, 7 and 8 shows some curves which have been measured from a set of experiements running the equation solving program. The curves are obtained from three processes which executes in di erent processing elements. From each run with certain

parameter settings, one pair of coordinates was obtained. Pairs of coordinates are connected together with straight lines. In gure 9 is plotted for one process how the average working set size alters when the amount of free memory is increased from 20 kB and 25 Kb to 50 kB.

5.4 Conclusions.

The goal has been to propose a reasonably simple scheme for managing shared pages with a completely decentralized algorithm, in a multiprocessor with shared memory. The proposed scheme does not use any global data structures and all decisions can be made locally. This leads to a high degree of parallelism and a smaller amount of trac on the communication 294

References

system. Algorithms for allocation of shared pages in a way that will further minimize the global trac can easily be inserted and will be an object for future research. The discussion in section 4 indicated that in most cases, a shared page will be resident when it is needed even if it is included in only one of the corresponding processes working sets. The discussion also points out where the most severe di erences will occur and suggests that a way to minimize its impact on execution time would be to utilize page caches to delay the release of a page in order to give processes a possibility to reclaim a page even though it has left the working set. Experiments in which a parallel program has been run on the MUMS-implementation of the proposed scheme was described in section 5.2. The results are accordant with the expected behaviour as seen in gures 3 and 4. Other experiments with the same set of parameters used in scenario number two, except that page caches were used, has shown that the shared page is also kept resident for the period when it was paged out before. This shows that the use of page caches is a possible way to increase performance. The MUMS experimental multiprocessor has shown itself to be a good tool to investigate the behaviour of our implementation of a distributed virtual memory system. Di erent memory organizations and system architectures are easily investigated thanks to the software emulation of hardware mechanisms and monitoring with software probes. Figures 6 to 9 which have all been generated from the same set of experiments shows some of the exibility of MUMS as a tool for evaluation of di erent design principles. We now have a rm basis for a thorough evaluation of the proposed decentralized scheme for memory management on MIMD multiprocessors with shared memory. In [1] is discussed suitable performance indices and evaluation methods.

[1] M. Brorsson. Evaluating Multiprocessor Virtual Memory Performance using Emulation. Technical report, Department of Computer Engineering, Lund University, Sweden, September 1988. [2] R. P. Case and A. Padegs. Architecture of the IBM System/370. Communications of the ACM, 21(1):73{96, January 1978. [3] P. J. Denning. The working set model for program behavior. Communications of the ACM, 11:323{333, May 1968. [4] M. Fogel. The VMOS Paging Algorithm: A practical Implementation of the working set model. ACM-Operating Systems Review, 8(1):8{17, January 1974. [5] I. J. Haikala. Program Behaviour in Memory Hierarchies. PhD thesis, University of Helsinki, Finland, January 1986. Department of Computer science. [6] L. Philipson. Implementation of a Pascal Based Parallel Language for a Multiprocessor Computer. SoftwarePractice and Experience, 14(7):643{657, July 1984. [7] B. Stavenow and L. Philipson. Performance Simulation of a Multiprocessor Computer with Virtual Memory. Technical report, Departement of Telecommunication Systems and Departement of Computer Engineering, Lund University, Sweden, 1984. [8] P. Stenstrom. Reducing Contention in Shared-Memory Multiprocessors. IEEE Computer, 21(11), Nov 1988. [9] P. Stenstrom and L. Philipson. A layered emulator for evaluation of design principles for MIMD multiprocessors with shared memory. In Proceedings of Parallel Architectures and Languages, Europe, pages 329{344. Springer Verlag, June 1987. [10] F. H. Sumner, G. Haley, and E. C. Y. Chen. The Central Control Unit of the 'Atlas' Computer. Proceedings IFIP Cong., pages 657{662, 1962. [11] A. Svensson. A Basic Emulator Environment for a MIMD Multiprocessor with Shared Memory. Technical report, Department of Computer Engineering, Lund University, Sweden, April 1988.

Acknowledgements. The author would like to thank Professor Lars Philipson who has given his support and contributed with invaluable comments and ideas. The following persons are also ackowledged for their work in the project: Per Stenstrom, Anders Svensson, Anders Ardo and Lars Lundberg. Many thanks to Glenn Jennings for his contribution to the presentation of this paper. This research was sponsored by the Swedish National Board for Technical Development (STU) under contract numbers 80-3962, 83-3647 and 85-3899. 295