COMP9242 Week 5 - CiteSeerX

Microkernels In a Bit More Depth Gernot Heiser COMP9242 2005/S2 Week 5

cse/UNSW/NICTA

M OTIVATION • Early operating systems had very little structure. • A strictly layered approach was promoted by [Dijkstra 1968]. • Later OS (more or less) followed that approach (e.g., Unix). • Such systems are known as monolithic kernels

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P1

A DVANTAGES OF M ONOLITHIC K ERNELS

• Kernel has access to everything: ➜ all optimisations possible ➜ all techniques/mechanisms/concepts implementable

• Kernel can be extended by adding more code, e.g. for: ➜ new services ➜ support for new hardare

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P2

P ROBLEMS WITH LAYERED APPROACH

• Widening range of services and applications ⇒ OS bigger, more complex, slower, more error prone. • Need to support same OS on different hardware. • Like to support various OS environments. • Distribution ⇒ impossible to provide all services from same (local) kernel.

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P3

E VOLUTION OF THE L INUX K ERNEL Linux Kernel Size (.tar.gz) 35 30 Size (Mb)

25 20 15 10 5 0 31-Jan-93

28-Oct-95

24-Jul-98

19-Apr-01

14-Jan-04

Date

Linux 2.4.18: 2.7M LoC cse/UNSW/NICTA

COMP9242 2005/S2 W5 P4

A PPROACHES TO TACKLING C OMPLEXITY • Classical software-engineering approach: modularity ➜ ➜ ➜ ➜

(relatively) small, mostly self-contained components well-defined interfaces between them enforcement of interfaces containment of faults to few modules

A PPROACHES TO TACKLING C OMPLEXITY • Classical software-engineering approach: modularity ➜ ➜ ➜ ➜

(relatively) small, mostly self-contained components well-defined interfaces between them enforcement of interfaces containment of faults to few modules

• Doesn’t work with monolithic kernels: ➜ all kernel code executes in privileged mode ➜ faults aren’t contained ➜ interfaces cannot be enforced ➜ performance takes priority over structure

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P5

E VOLUTION OF THE L INUX K ERNEL — PART 2

Dependencies between kernel modules by global variables (“common coupling”) as a function of version number [SJW+02]: Exponential growth! cse/UNSW/NICTA

COMP9242 2005/S2 W5 P6

OS

User-level servers

Microkernel

Hardware

privileged

Applications

non-privileged

M ICROKERNEL I DEA : B REAK U P THE OS

Based on the ideas of Brinch Hansen’s “Nucleus” [BH70]

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P7

M ONOLITHIC VS . M ICROKERNEL OS S TRUCTURE

Application

syscall user mode

VFS IPC, file system Scheduler, virtual memory Device drivers, Dispatcher, ... Hardware

cse/UNSW/NICTA

Application kernel mode

IPC

UNIX Server

Device Driver

File Server

IPC, virtual memory Hardware

COMP9242 2005/S2 W5 P8

M ICROKERNEL OS • Kernel: ? ? ? ⇒

contains code which must run in supervisor mode; isolates hardware dependence from higher levels; is small and fast extensible system;

M ICROKERNEL OS • Kernel: ? ? ? ⇒ ?

contains code which must run in supervisor mode; isolates hardware dependence from higher levels; is small and fast extensible system; provides mechanisms.



• User-level servers: ? are hardware independent/portable, ? provide “OS environment”/”OS personality” (maybe several), ? may be invoked: ➜ from application (via message-passing IPC) ➜ from kernel (upcalls);



• User-level servers: ? are hardware independent/portable, ? provide “OS environment”/”OS personality” (maybe several), ? may be invoked: ➜ from application (via message-passing IPC) ➜ from kernel (upcalls); ? implement policies [BH70].

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P9

D OWNCALL VS . UPCALL downcall

upcall

(syscall) user code kernel unprivileged code enters kernel mode implemented via trap

cse/UNSW/NICTA

un−privileged privileged privileged code enters user mode implemented via IPC

COMP9242 2005/S2 W5 P10

M ICROKERNEL - BASED S YSTEMS

classic + Security

classic OS

RT MM

specialized

thin em-

native Java

bedded

app

highly-specialized component

L4

L4

L4

HW

HW

HW

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P11

E ARLY EXAMPLE : H YDRA • Separation of mechanism from policy ➜ e.g. protection vs. security • No hierarchical layering of kernel. • Protection, even within OS. ➜ Uses (segregated) capabilities. • Objects, encapsuation, units of protection. • Unique object name, no ownership. • Object persistence based on reference counting [WCC+74].

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P12

H YDRA ... • Can be considered the first object-oriented OS; • Has been called the first microkernel OS (by people who ignored Brinch Hansen) • Has had enormous influence on later operating systems research; • Was never widely used even at CMU because of ➜ poor performance, ➜ lack of a complete environment.

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P13

P OPULAR E XAMPLE : M ACH • Developed at CMU by Rashid and others [RTY+88] from 1984 • successor of Accent [FR86] and RIG [Ras88].

P OPULAR E XAMPLE : M ACH • Developed at CMU by Rashid and others [RTY+88] from 1984 • successor of Accent [FR86] and RIG [Ras88]. G OALS : • Tailorability: support different OS interfaces • Portability: almost all code H/W independent • Real-time capability • Multiprocessor and distribution support • Security

P OPULAR E XAMPLE : M ACH • Developed at CMU by Rashid and others [RTY+88] from 1984 • successor of Accent [FR86] and RIG [Ras88]. G OALS : • Tailorability: support different OS interfaces • Portability: almost all code H/W independent • Real-time capability • Multiprocessor and distribution support • Security Coined term microkernel. cse/UNSW/NICTA

COMP9242 2005/S2 W5 P14

BASIC FEATURES OF M ACH KERNEL • Task and thread management • Interprocess communication (asynchronous message-passing) • Memory object management • System call redirection • Device support • Multicomputer support

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P15

M ACH TASKS AND THREADS • Task consists of one or more threads • Task provides address space and other environment • Thread is active entity (basic unit of CPU utilisation) • Threads have own stacks, are kernel scheduled • Threads may run in parallel on multiprocessor • “Privileged user-state program” may be used to control scheduling • Task created from “blueprint” with empty or inherited address space • Activated by creating a thread in it

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P16

M ACH IPC: P ORTS • Addressing based on ports: ➜ ➜ ➜ ➜ ➜

port is a mailbox, allocated/destroyed via a system call has a fixed-size message queue associated with it is protected by (segregated) capabilities has exactly one receiver, but possibly many senders can have “send-once” capability to a port



• Can pass the receive capability for a port to another process ➜ give up read access to the port



• Can pass the receive capability for a port to another process ➜ give up read access to the port • Kernel detects ports without senders or receiver • Processes may have many ports (UNIX server has 2000!) • Ports can be grouped into port sets ➜ Allows listening to many ports (like select())



• Can pass the receive capability for a port to another process ➜ give up read access to the port • Kernel detects ports without senders or receiver • Processes may have many ports (UNIX server has 2000!) • Ports can be grouped into port sets ➜ Allows listening to many ports (like select()) • Send blocks if queue is full ➜ except with send-once cap (used for server replies) cse/UNSW/NICTA

COMP9242 2005/S2 W5 P17

M ACH IPC: M ESSAGES • Segregated capabilities: ➜ threads refer to them via local indices ➜ kernel marshalls capabilities in messages ➜ message format must identify caps

M ACH IPC: M ESSAGES • Segregated capabilities: ➜ threads refer to them via local indices ➜ kernel marshalls capabilities in messages ➜ message format must identify caps • Message contents: ? Send capability to destination port (mandatory) ➜ used by kernel to validate operation ? optional send capability to reply port ➜ for use by receiver to send reply ? possibly other capabilities

M ACH IPC: M ESSAGES • Segregated capabilities: ➜ threads refer to them via local indices ➜ kernel marshalls capabilities in messages ➜ message format must identify caps • Message contents: ? Send capability to destination port (mandatory) ➜ used by kernel to validate operation ? optional send capability to reply port ➜ for use by receiver to send reply ? possibly other capabilities ? “in-line” (by-value) data ? “out-of-line” (by reference) data, using copy-on-write, ➜ may contain whole address spaces cse/UNSW/NICTA

COMP9242 2005/S2 W5 P18

M ACH IPC Message

header

port rights out−of− in−line (capabilities) line data data

virtual address space task 1

virtual address space task 2

IPC

mapping before IPC

mapping after IPC

physical memory

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P19

M ACH VIRTUAL MEMORY MANAGEMENT • Address space constructed from memory regions ? initially empty

M ACH VIRTUAL MEMORY MANAGEMENT • Address space constructed from memory regions ? initially empty ? populated by: ➜ explicit allocation ➜ explicitly mapping a memory object ➜ inheriting from “blueprint” (as in Linux clone()), ➜ inheritance: not, shared or copied ➜ allocated automatically by kernel during IPC ➜ when passing by-reference parameters ⇒ sparse virtual memory use (unlike UNIX)

M ACH VIRTUAL MEMORY MANAGEMENT • Address space constructed from memory regions ? initially empty ? populated by: ➜ explicit allocation ➜ explicitly mapping a memory object ➜ inheriting from “blueprint” (as in Linux clone()), ➜ inheritance: not, shared or copied ➜ allocated automatically by kernel during IPC ➜ when passing by-reference parameters ⇒ sparse virtual memory use (unlike UNIX) ? 3 page states: ➜ unallocated, ➜ allocated & unreferenced, ➜ allocated & initialised

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P20

C OPY - ON - WRITE IN M ACH • When data is copied (“blueprint” or passed by-reference): ➜ source and destination share single copy, ➜ both virtual pages are mapped to the same frame • Marked as read-only • When one copy is modified, a fault occurs • Handling by kernel involves making a physical copy is made, ➜ VM mapping is changed to refer to the new copy

C OPY - ON - WRITE IN M ACH • When data is copied (“blueprint” or passed by-reference): ➜ source and destination share single copy, ➜ both virtual pages are mapped to the same frame • Marked as read-only • When one copy is modified, a fault occurs • Handling by kernel involves making a physical copy is made, ➜ VM mapping is changed to refer to the new copy • Advantage: ➜ efficient way of sharing/passing large amounts of data

C OPY - ON - WRITE IN M ACH • When data is copied (“blueprint” or passed by-reference): ➜ source and destination share single copy, ➜ both virtual pages are mapped to the same frame • Marked as read-only • When one copy is modified, a fault occurs • Handling by kernel involves making a physical copy is made, ➜ VM mapping is changed to refer to the new copy • Advantage: ➜ efficient way of sharing/passing large amounts of data • Drawbacks: ➜ expensive for small amounts of data (page table manipulations) ➜ data must be properly aligned cse/UNSW/NICTA

COMP9242 2005/S2 W5 P21

M ACH ADDRESS MAPS • Address spaces represented as address maps: head

tail

entry

memory object

M ACH ADDRESS MAPS • Address spaces represented as address maps: head

tail

entry

memory object

• Any part of AS can be mapped to (part of) a memory object • Compact representation of sparse address spaces ➜ Compare to multi-level page tables? cse/UNSW/NICTA

COMP9242 2005/S2 W5 P22

M EMORY OBJECTS • Kernel doesn’t support file system • Memory objects are an abstraction of secondary storage: ? can be mapped into virtual memory ? are cached by the kernel in physical memory ? pager invoked if uncached page is touched ➜ used by file system server to provide data • Support data sharing ➜ by mapping objects into several address spaces • Memory is only cache for memory objects

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P23

U SER - LEVEL PAGE FAULT HANDLERS • All actual I/O performed by pager; can be ➜ default pager (provided by kernel), or ➜ external pager, running at user-level. Task

External Pager

map Kernel

IPC memory cache object

U SER - LEVEL PAGE FAULT HANDLERS • All actual I/O performed by pager; can be ➜ default pager (provided by kernel), or ➜ external pager, running at user-level. Task

External Pager

map Kernel

IPC memory cache object

• Intrinsic page fault cost: 2 IPCs cse/UNSW/NICTA

COMP9242 2005/S2 W5 P24

H ANDLING PAGE FAULTS ➀ Check protection & locate memory object ➜ uses address map ➁ Check cache, invoke pager if cache miss ➜ uses a hashed page table ➂ Check copy-on-write ➜ perform physical copy if write fault ➃ Enter new mapping into H/W page tables

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P25

R EMOTE COMMUNICATION • Client A sends message to server B on remote node ➀ A sends message to local proxy port for B’s receive port ➁ User-level network message server receives from proxy port ➂ NMS converts proxy port into (global) network port. ➃ NMS sends message to NMS on B’s node ➜ may need conversion (byte order...) ➄ Remote NMS converts network port into local port (B’s). ➅ Remote NMS sends message to that port

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P26

M ACH U NIX EMULATION 3 Application

1

Emulation 5 Library

syscall redirect

Unix Server 4

2

Mach

• Emulation library in user address space handles IPC • Invoked by system call redirection (trampoline mechanism) ➜ supports binary compatibility

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P27

M ACH = M ICROKERNEL ? • Most OS services implemented at user level ➜ using memory objects and external pagers ➜ Provides mechanisms, not policies • Mostly hardware independent

M ACH = M ICROKERNEL ? • Most OS services implemented at user level ➜ using memory objects and external pagers ➜ Provides mechanisms, not policies • Mostly hardware independent • Big! ➜ 140 system calls ➜ Size: 200k instructions • Performance poor ➜ tendency to move features into kernel ➜ Darwin (base if MacOS X) has complete BSD kernel inside Mach

M ACH = M ICROKERNEL ? • Most OS services implemented at user level ➜ using memory objects and external pagers ➜ Provides mechanisms, not policies • Mostly hardware independent • Big! ➜ 140 system calls ➜ Size: 200k instructions • Performance poor ➜ tendency to move features into kernel ➜ Darwin (base if MacOS X) has complete BSD kernel inside Mach Further information on Mach: [YTR+87, CDK94, Sin97]

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P28

OTHER CLIENT- SERVER SYSTEMS • Lots!

OTHER CLIENT- SERVER SYSTEMS • Lots! • Most notable systems: Amoeba: FU Amsterdam, early 1980’s [TM81, TM84, MT86] Chorus: INRIA (France), from early 1980’s [DA92, RAA+90, RAA+92] ` ➜ Commercialised by Chorus Systemes in 1988 ➜ Recently bought by Sun Windows NT: Microsoft (early 1990’s) [Cus93] ➜ Main servers run in kernel mode

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P29

C RITIQUE OF M ICROKERNEL A RCHITECTURES


I’m not interested in making devices look like user-level. They aren’t, they shouldn’t, and microkernels are just stupid. Linus Torvalds


I’m not interested in making devices look like user-level. They aren’t, they shouldn’t, and microkernels are just stupid. Linus Torvalds Is Linus right?

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P30

M ICROKERNEL P ERFORMANCE • First generation µ-kernel systems exhibited poor performance when compared to monolithic UNIX implementations ➜ particularly Mach, the best-known example

M ICROKERNEL P ERFORMANCE • First generation µ-kernel systems exhibited poor performance when compared to monolithic UNIX implementations ➜ particularly Mach, the best-known example • Reasons are investigated by [Chen & Bershad 93]: ➜ ➜ ➜ ➜ ➜

instrumented user and system code to collect execution traces run on DECstation 5000/200 (25MHz R3000) run under Ultrix and Mach with Unix server traces fed to memory system simulator analyse MCPI (memory cycles per instruction) ➜ baseline MCPI (i.e. excluding idle loops)

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P31

U LTRIX VS . M ACH MCPI

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P32

I NTERPRETATION Observations: • Mach memory penalty (i.e. cache missess or write stalls) higher • Mach VM system executes more instructions than Ultrix (but has more functionality)

I NTERPRETATION Observations: • Mach memory penalty (i.e. cache missess or write stalls) higher • Mach VM system executes more instructions than Ultrix (but has more functionality) Claim: • Degraded performance is (intrinsic?) result of OS structure

I NTERPRETATION Observations: • Mach memory penalty (i.e. cache missess or write stalls) higher • Mach VM system executes more instructions than Ultrix (but has more functionality) Claim: • Degraded performance is (intrinsic?) result of OS structure • IPC cost (known to be high in Mach) is not a major factor [Ber92]

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P33

A SSERTIONS 1 OS has less instruction and data locality than user code ➜ System code has higher cache and TLB miss rates ➜ Particularly bad for instructions

A SSERTIONS 1 OS has less instruction and data locality than user code ➜ System code has higher cache and TLB miss rates ➜ Particularly bad for instructions 2 System execution is more dependent on instruction cache behaviour than is user execution ➜ MCPIs dominated by system i-cache misses Note: most benchmarks were small, i.e. user code fits in cache

A SSERTIONS 1 OS has less instruction and data locality than user code ➜ System code has higher cache and TLB miss rates ➜ Particularly bad for instructions 2 System execution is more dependent on instruction cache behaviour than is user execution ➜ MCPIs dominated by system i-cache misses Note: most benchmarks were small, i.e. user code fits in cache 3 Competition between user and system code is not a problem ➜ Few conflicts between user and system caching ➜ TLB misses are not a relevant factor Note: the hardware used has direct-mapped physical caches. ⇒ Split system/user caches wouldn’t help

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P34

S ELF

INTERFERENCE

➜ Only examine system cache misses ➜ Shaded: System cache misses removed by associativity ➜ MCPI for system-only, using R3000 direct-mapped cache ➜ Reductions due to associativity were obtained by running system on a simulator and using a two-way associative cache of the same size

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P35

A SSERTIONS ... 4 Self-interference is a problem in system instruction reference streams. ➜ High internal conflicts in system code ➜ System would benefit from higher cache associativity

A SSERTIONS ... 4 Self-interference is a problem in system instruction reference streams. ➜ High internal conflicts in system code ➜ System would benefit from higher cache associativity 5 System block memory operations are responsible for a large percentage of memory system reference costs. ➜ Particularly true for I/O system calls

A SSERTIONS ... 4 Self-interference is a problem in system instruction reference streams. ➜ High internal conflicts in system code ➜ System would benefit from higher cache associativity 5 System block memory operations are responsible for a large percentage of memory system reference costs. ➜ Particularly true for I/O system calls 6 Write buffers are less effective for system references. ➜ write buffer allows limited asynchronous writes on cache misses

A SSERTIONS ... 4 Self-interference is a problem in system instruction reference streams. ➜ High internal conflicts in system code ➜ System would benefit from higher cache associativity 5 System block memory operations are responsible for a large percentage of memory system reference costs. ➜ Particularly true for I/O system calls 6 Write buffers are less effective for system references. ➜ write buffer allows limited asynchronous writes on cache misses 7 Virtual-to-physical mapping strategy can have significant impact on cache performance ➜ Unfortunate mapping may increase conflict misses ➜ “Random” mappings (Mach) are to be avoided

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P36

OTHER EXPERIENCE WITH µ- KERNEL PERFORMANCE • System call costs are (inherently?) high ➜ Typically hundreds of cycles, 900 for Mach/i486 • Context (address-space) switching costs are (inherently?) high ➜ Getting worse (in terms of cycles) with increasing CPU/memory speed ratios [Ous90] ➜ IPC (involving system calls and context switches) is inherently expensive

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P37

S O, WHAT ’ S WRONG ?

S O, WHAT ’ S WRONG ? • µ-kernels heavily depend on IPC • IPC is expensive

S O, WHAT ’ S WRONG ? • µ-kernels heavily depend on IPC • IPC is expensive ? Is the µ-kernel idea flawed? ? Should some code never leave the kernel? ? Do we have to buy flexibility with performance?

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P38

A C RITIQUE OF THE C RITIQUE Data presented earlier: ➜ are specific to one (or a few) system, ➜ results cannot be generalised without thorough analysis, ➜ no such analysis has been done ⇒ Cannot trust the conclusions [Lie95]

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P39

R E - ANALYSIS OF C HEN & B ERSHAD ’ S DATA .................................................................................................................................................................... .. ... ....................................................................................................... .. ... ... .

sed U M egrep U M yacc U M gcc U M compress U M ab U M espresso U M

other MCPI system cache miss MCPI

0.227 0.495 0.035 0.081 0.067 0.129 0.434 0.690 0.250 0.418 0.427 0.534 0.041 0.068

for Ultrix andd-cache Mach misses in system code MCPI for Ultrix and Mach MCPI right end is i- and Overhead is mostly in the system

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P40

R E - ANALYSIS OF C HEN & B ERSHAD ’ S DATA ... ..................................................................................................................... .. .. ............................................................................ .. .. .. ..

sed U M egrep U M yacc U M gcc U M compress U M ab U M espresso U M

conflict misses capacity misses

0.170 0.415 0.024 0.069 0.039 0.098 0.130 0.388 0.102 0.258 0.230 0.382 0.012 0.037

MCPI caused by cache misses: conflict (black) vs capacity (white)

MCPI caused by cache misses: conflict (black) vs capacity (white)

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P41

C ONCLUSION • Mach system (kernel + UNIX server + emulation library) is too big!

C ONCLUSION • Mach system (kernel + UNIX server + emulation library) is too big! • UNIX server is essentially same • Emulation library is irrelevant (according to Chan & Bershad)

C ONCLUSION • Mach system (kernel + UNIX server + emulation library) is too big! • UNIX server is essentially same • Emulation library is irrelevant (according to Chan & Bershad) ⇒ Mach µ-kernel working set is too big

C ONCLUSION • Mach system (kernel + UNIX server + emulation library) is too big! • UNIX server is essentially same • Emulation library is irrelevant (according to Chan & Bershad) ⇒ Mach µ-kernel working set is too big

Can we build µ-kernels which avoid these problems?

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P42

R EQUIREMENTS FOR µ- KERNELS : • Fast (system call costs, IPC costs) • Small (big ⇒ slow) ⇒ Must be well designed, providing a minimal set of operations.

Can this be done?

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P43

A RE HIGH SYSTEM COSTS ESSENTIAL ? • Example: kernel call cost on i486 ? Mach kernel call: 900 cycles

A RE HIGH SYSTEM COSTS ESSENTIAL ? • Example: kernel call cost on i486 ? Mach kernel call: 900 cycles ? Inherent (hardware-dictated cost): 107 cycles. ⇒ 800 cycles kernel overhead

A RE HIGH SYSTEM COSTS ESSENTIAL ? • Example: kernel call cost on i486 ? Mach kernel call: 900 cycles ? Inherent (hardware-dictated cost): 107 cycles. ⇒ 800 cycles kernel overhead ? L4 kernel call: 123–180 cycles (15–73 cycles overhead)

A RE HIGH SYSTEM COSTS ESSENTIAL ? • Example: kernel call cost on i486 ? Mach kernel call: 900 cycles ? Inherent (hardware-dictated cost): 107 cycles. ⇒ 800 cycles kernel overhead ? L4 kernel call: 123–180 cycles (15–73 cycles overhead) ⇒ Mach’s performance is a result of design and implementation not the µ-kernel concept!

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P44

µ- KERNEL D ESIGN P RINCIPLES [L IE 96] Minimality: If it doesn’t have to be in the kernel, it shouldn’t be in the kernel Appropriate abstractions which can be made fast and allow efficient implementation of services Well written: It pays to shave a few cycles off TLB refill handler or the IPC path Unportable: must be targeted to specific hardware ➜ no problem if it’s small, and higher layers are portable ➜ Example: Liedtke reports significant rewrite of memory management when porting from 486 to Pentium ⇒ “abstract hardware layer” is too costly

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P45

N ON - PORTABILITY EXAMPLE : I 486 VS P ENTIUM : • Size and associativity of TLB • Size and organisation of cache (larger line size - restructured IPC) • Segment regs in Pentium used to simulate tagged TLB ⇒ different trade-offs

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P46

W HAT Must A µ- KERNEL P ROVIDE ? • Virtual memory/address spaces • Threads • fast IPC • Unique identifiers (for IPC addressing)

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P47

µ- KERNEL DOES Not HAVE TO P ROVIDE : • File system ➜ use user-level server (as in Mach) • Device drivers ➜ user-level driver invoked via interrupt (= IPC) • Page-fault handler ➜ use user-level pager

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P48

L4 IMPLEMENTATION TECHNIQUES (L IEDTKE ) • Appropriate system calls to reduce number of kernel invocations ➜ e.g., reply & receive next ➜ as many syscall args as possible in registers

L4 IMPLEMENTATION TECHNIQUES (L IEDTKE ) • Appropriate system calls to reduce number of kernel invocations ➜ e.g., reply & receive next ➜ as many syscall args as possible in registers • Efficient IPC ➜ rich message structure ➜ value and reference parameters in message ➜ copy message only once (i.e. not user→kernel→user)

L4 IMPLEMENTATION TECHNIQUES (L IEDTKE ) • Appropriate system calls to reduce number of kernel invocations ➜ e.g., reply & receive next ➜ as many syscall args as possible in registers • Efficient IPC ➜ rich message structure ➜ value and reference parameters in message ➜ copy message only once (i.e. not user→kernel→user) • Fast thread access ➜ Thread UIDs (containing thread ID) ➜ TCBs in (mapped) VM, cache-friendly layout ➜ Separate kernel stack for each thread (for fast interrupt handling)

L4 IMPLEMENTATION TECHNIQUES (L IEDTKE ) • Appropriate system calls to reduce number of kernel invocations ➜ e.g., reply & receive next ➜ as many syscall args as possible in registers • Efficient IPC ➜ rich message structure ➜ value and reference parameters in message ➜ copy message only once (i.e. not user→kernel→user) • Fast thread access ➜ Thread UIDs (containing thread ID) ➜ TCBs in (mapped) VM, cache-friendly layout ➜ Separate kernel stack for each thread (for fast interrupt handling) • General optimisations ➜ “Hottest” kernel code is shortest ➜ Kernel IPC code on single page, critical data on single page ➜ Many H/W specific optimisations cse/UNSW/NICTA

COMP9242 2005/S2 W5 P49

M ICROKERNEL P ERFORMANCE System L4 L4 L4 L4 QNX Mach SCR RPC Mach Amoeba Spin Mach Exo-tlrpc Spring DP-Mach LRPC

cse/UNSW/NICTA

CPU R4600 Alpha Pentium 486 486 R2000 CVAX 486 68020 Alpha 21064 Alpha 21064 R2000 SparcV8 486 CVAX

MHz 100 433 166 50 33 16.7 12.5 50 15 133 133 116.7 40 66 12.5

RPC µs 1.7 µs 0.2 µs 1.5 µs 10 µs 76 µs 190 µs 464 µs 230 µs 800 µs 102 µs 104 µs 6 µs 11 µs 16 µs 157 µs

cyc/IPC 100 45 121 250 1254 1584 2900 5750 6000 6783 6916 53 220 528 981

semantics full full full full full full full full full full full restricted restricted restricted restricted

COMP9242 2005/S2 W5 P50

L4K A ::P ISTACHIO IPC P ERFORMANCE

Architecture Pentium-3 Pentium-4 Itanium 2 cross CPU MIPS64 cross CPU PowerPC-64 Alpha 21264 ARM/XScale UltraSPARC † ‡

port/ optimisation UKa UKa UKa/NICTA UKa NICTA/UNSW NICTA/UNSW NICTA/UNSW NICTA/UNSW NICTA/UNSW NICTA/UNSW

C++ intra AS inter AS 180 367 385 983 508 508 7419 7410 276 276 3238 3238 330 518 440 642 340 340

optimised intra AS inter AS 113 305 196 416 36 36 N/A N/A 109 109 690 690 200‡ 200‡ ≈70† ≈70† 120–140‡ 120–140‡ 100‡ 100‡

“Version 2” assembler kernel Guestimate!

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P51

¨ C ASE IN P OINT: L4 L INUX [H ARTIG et al. 97] • Port of Linux kernel to L4 (like Mach Unix server) ➜ ➜ ➜ ➜ ➜

single-threaded (for simplicity, not performance) is pager of all Linux user processes maps emulation library and signal-handling code into AS server AS maps physical memory (& Linux runs within) copying between user and server done on physical memory ➜ use software lookup of page tables for address translation

¨ C ASE IN P OINT: L4 L INUX [H ARTIG et al. 97] • Port of Linux kernel to L4 (like Mach Unix server) ➜ ➜ ➜ ➜ ➜

single-threaded (for simplicity, not performance) is pager of all Linux user processes maps emulation library and signal-handling code into AS server AS maps physical memory (& Linux runs within) copying between user and server done on physical memory ➜ use software lookup of page tables for address translation

• Changes to Linux restricted to architecture-dependent part • Duplication of page tables (L4 and Linux server) • Binary compatible to native Linux via trampoline mechanism ➜ but also modified libc with RPC stubs

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P52

y and the aphs) into ss space. ges minin with the n its own. ued to the

nipulating the main thread’s stack pointer and instruction pointer.

S IGNAL DELIVERY IN L4 L INUX

Linux User Process user thread Enter Linux (3)

resume (4)

manipulate thread (2)

signal thread

forward signal (1)

Linux Server

main thread

re implethe user tly usable

• Separate signal-handler thread inineach user Arrows process Figure 3: Signal delivery L4 Linux. denote IPC. Numbers in

➜ ➜ er; ➜ ➜ brary; ➜

emulates rrespond-

one estab-

parentheses indicate the sequence of actions.

server IPCs signal-handler thread handler thread ex regs main user thread to save state The IPCs signalLinux thread and the emulation library are not protected user thread server the main thread. However, the user process can only damserveragainst does signal processing itself bythread modifying them. Global effects of signaling, e. g. serverage IPCs user to resume

killing a process, are implemented by the Linux server. The signal thread only notifies the user process.

Scheduling. All threads mentioned above are scheduled by the L4 µ-kernel’s internal scheduler. This leaves the traditional Linux

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P53

L4 L INUX P ERFORMANCE M ICROBENCHMARKS :

G E T P I D ()

System Linux

Time [µs] 1.68

Cycles 223


G E T P I D ()

System Linux L4Linux L4Linux (trampoline)

Time [µs] 1.68 3.95 5.66

Cycles 223 526 753


G E T P I D ()

System Linux L4Linux L4Linux (trampoline) MkLinux in-kernel MkLinux server

Time [µs] 1.68 3.95 5.66 15.66 110.60

Cycles 223 526 753 2050 14710


G E T P I D ()

System Linux L4Linux L4Linux (trampoline) MkLinux in-kernel MkLinux server C YCLE

Time [µs] 1.68 3.95 5.66 15.66 110.60

Cycles 223 526 753 2050 14710

Cycles 20 168 131 188 19

Server

BREAKDOWN :

Client enter emulation lib send syscall message receive reply leave emulation lib

wait for msg Linux kernel send reply

Hardware cost: 82 cycles (133MHz Pentium)

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P54

L4 L INUX P ERFORMANCE M ICROBENCHMARKS : write /dev/null lat null process lat simple process lat /bin/sh process lat mmap lat 2-proc context switch lat 8-proc context switch lat pipe lat UDP lat RPC/UDP lat TCP lat RPC/TCP lat pipe bw 1 TCP bw 1 file reread bw 1 mmap reread bw 1

LMBENCH

Linux 4 L Linux MkLinux (in-kernel) MkLinux (user)

64 5

25 8

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Figure 6: lmbench results, normalized to native Linux. These are presented as slowdowns: a shorter bar is a better result. lat is a latency measurement, 1

the inverse of a bandwidth one. Hardware is a 133 MHz Pentium.

total system-call costs, 141 cycles remain for native Linux, 170 cycles for L4 Linux. The small difference of both values indicates that indeed IPC is the major cause for additional costs in L4 Linux. When removing the part called LINUX in Figure 5, the L4 Linux overhead code remains. It uses 45 cache lines, 9% of the first-level cache, including the cache L4 needs for IPC. The lmbench [29] microbenchmark suite measures basic operations like system calls, context switches, memory accesses, pipe operations, network operations, etc. by repeating the respective operation a large number of times. lmbench’s measurement methods have recently been criticized by Brown and Seltzer [6]. Their im-

AIM Suite-7 Benchmark - Real Time 7000

6000

5000

Real time

bw

4000

RPC/UDP lat TCP lat RPC/TCP lat pipe bw 1 TCP bw 1 file reread bw 1 mmap reread bw 1

Figu

mark

64 5

25 8

Linux 4 L Linux MkLinux (in-kernel) MkLinux (user)

5.3 Macrobenchmarks

In the 1first 2 macrobenchmark experiment, we measured the 14time 3 4 5 6 7 8 9 10 11 12 13 needed to recompile the Linux server (Figure 7). L4 Linux was 6– 7%normalized slower tothan Linux butas10–20% faster MkLinux Figure 6: lmbench results, nativenative Linux. These are presented slowdowns: a shorter bar isthan a better both result. lat is a latency measurement, bw 1 the inverse of a bandwidth one. Hardware is a 133 MHz Pentium. versions.

M ACROBENCHMARKS : K ERNEL

COMPILE

total system-call costs, 141 cycles remain for native Linux, 170 cycles for L4 Linux. TheLinux small difference of both values indicates that indeed IPC is the major cause for additional costs in L4 Linux. L4 Linux When removing the4part called LINUX in Figure 5, the L4 Linux L Linux (trampo) overhead code remains. It uses 45 cache lines, 9% of the first-level cache, including the cache L4 needs for IPC. MkLinux (kernel) The lmbench [29]MkLinux microbenchmark (user)suite measures basic operations like system calls, context switches, memory accesses, pipe operations, network operations, etc. by repeating the respective operation a large number of times. lmbench’s measurement methods have recently been criticized by Brown and Seltzer [6]. Their im-

7000

6000

476 s 506 s (+6.3%) AIM Suite-7 Benchmark - Real Time 509 s (+6.9%) 555 s (+16.6%) 605 s (+27.1%)

5000

Real time

Figure 7: Real time for compiling the Linux Server. (133 MHz Pen-

cse/UNSW/NICTA

Jobs per Minute

the same results 4 on all systems. L L INUX Presults ERFORMANCE Table 3 shows selected of lmbench and hbench. It compares native Linux, L4 Linux with and without trampoline, and both 4 versions :ofLMkLinux. M ICROBENCHMARKS M B E N C H Figure 6 plots the slowdown of L Linux, co user-mode MkLinux, normalized Linux. located and to native Both write /dev/null lat null process lat 4 Linux. MkLinux versions of have a much higher penalty than L simple process lat /bin/sh process lat of co-location effect is small rather Surprisingly, compared to the mmap lat 4 2-proc context switch lat even the L Linux penalties are not of using L4. However, 8-proc context switch lat the effect pipe lat as low as we hoped. UDP lat

4000

COMP9242 2005/S2 W5 P55

Figu

C ONCLUSIONS • Mach sux 6⇒ microkernels suck

C ONCLUSIONS • Mach sux 6⇒ microkernels suck • L4 shows that performance might be deliverable ➜ L4Linux gets close to monolithic kernel performance ➜ need real multi-server system to evaluate µ-kernel potential

C ONCLUSIONS • Mach sux 6⇒ microkernels suck • L4 shows that performance might be deliverable ➜ L4Linux gets close to monolithic kernel performance ➜ need real multi-server system to evaluate µ-kernel potential • Jury is still out! • Mach has prejudiced community (see Linus...) ➜ It’s still an uphill battle!

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P56

R EFERENCES [Ber92]

Brian N. Bershad. The increasing irrelevance of IPC performance for microkernel-based operating systems. In USENIX WS Microkernels & other Kernel Arch., pages 205–211, Seattle, WA, USA, Apr 1992.

[BH70]

Per Brinch Hansen. The nucleus of a multiprogramming operating system. CACM, 13:238–250, 1970.

[CB93]

J. Bradley Chen and Brian N. Bershad. The impact of operating system structure on memory system performance. In 14th SOSP, pages 120–133, Asheville, NC, USA, Dec 1993.

[CDK94]

George F. Coulouris, Jean Dollimore, and Tim Kindberg. Distributed Systems: Concepts and Design. Addison-Wesley, 2nd edition, 1994.

[Cus93]

Helen Custer. Inside Windows NT. Microsoft Press, 1993.

[DA92]

Randall W. Dean and Francois Armand. Data movement in kernelized systems. In USENIX WS Microkernels & other Kernel Arch., pages 243–261, Seattle, WA, USA, Apr 1992.

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P57

[Dij68]

Edsger W. Dijkstra. The structure of the “THE” multiprogramming system. CACM, 11:341–346, 1968.

[FR86]

Robert Fitzgerald and Richard Rashid. The integration of virtual memory management and interprocess communication in Accent. Trans. Comp. Syst., 4:147–177, 1986.

¨ [HHL+97] Hermann Hartig, Michael Hohmuth, Jochen Liedtke, Sebastian ¨ Schonberg, and Jean Wolter. The performance of µ-kernel-based systems. In 16th SOSP, pages 66–77, St. Malo, France, Oct 1997. [Lie95]

Jochen Liedtke. On µ-kernel construction. In 15th SOSP, pages 237–250, Copper Mountain, CO, USA, Dec 1995.

[Lie96]

Jochen Liedtke. Towards real microkernels. CACM, 39(9):70–77, Sep 1996.

[MT86]

Sape J. Mullender and Andrew S. Tanenbaum. The design of a capability-based distributed operating system. The Comp. J., 29:289–299, 1986.

[Ous90]

J.K. Ousterhout. Why aren’t operating systems getting faster as fast as

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P58

hardware? In 1990 Summer USENIX Techn. Conf., pages 247–56, Jun 1990. [RAA+90] M. Rozier, V. Abrossimov, F. Armand, I. Boule, M. Gien, M. Guillemont, ´ F. Herrmann, C. Kaiser, S. Langlois, P. Leonard, and W. Neuhauser. Overview of the CHORUS distributed operating system. Technical report ` CS/TR-90-25, Chorus systeemes, Montigny-le-Bretonneux (France), Apr 1990. [RAA+92] M. Rozier, V. Abrossimov, F. Armand, L. Boule, M. Gien, M. Guillemont, ´ F. Herrman, C. Kaiser, S. Langlois, P. Leonard, and W. Neuhauser. Overview of the Chorus distributed operating system. In USENIX WS Microkernels & other Kernel Arch., pages 39–69, Seattle, WA, USA, Apr 1992. [Ras88]

Richard F. Rashid. From RIG to Accent to Mach: The evolution of a network operating system. In Bernardo A. Huberman, editor, The Ecology of Computation, Studies in Computer Science and Artifical Intelligence, pages 207–230. North-Holland, Amsterdam, 1988.

[RTY+88] Richard Rashid, Avadis Tevanian, Jr., Michael Young, David Golub, Robert Baron, David Black, William J. Bolosky, and Jonathan Chew. Machine-independent virtual memory management for paged cse/UNSW/NICTA

COMP9242 2005/S2 W5 P59

uniprocessor and multiprocessor architectures. Trans. Computers, C-37:896–908, 1988. [Sin97]

Pradeep K. Sinha. Distributed Operating Systems: Concepts and Design. Comp. Soc. Press, 1997.

[SJW+02] Stephen R. Schach, Bo Jin, David R. Wright, Gillian Z. Heller, and A. Jefferson Offutt. Maintainability of the Linux kernel. IEE Proc.: Softw., 149:18–23, 2002. [TM81]

Andrew S. Tanenbaum and Sape J. Mullender. An overview of the Amoeba distributed operating system. Operat. Syst. Rev., 15(3):51–64, 1981.

[TM84]

Andrew S. Tanenbaum and Sape Mullender. The design of a capability-based distributed operating system. Technical Report IR-88, Vrije Universiteit, Nov 1984.

[WCC+74] W. Wulf, E. Cohen, W. Corwin, A. Jones, R. Levin, C. Pierson, and F. Pollack. HYDRA: The kernel of a multiprocessor operating system. CACM, 17:337–345, 1974. cse/UNSW/NICTA

COMP9242 2005/S2 W5 P60

[YTR+87] Michael Young, Avadis Tevanian, Richard Rashid, David Golub, Jeffrey Eppinger, Jonathan Chew, William Bolosky, David Black, and Robert Baron. The duality of memory and communication in the implementation of a multiprocessor operating system. In 11th SOSP, pages 63–76, 1987.

cse/UNSW/NICTA

COMP9242 2005/S2 W5 P61