MCPI for Ultrix and Mach right end is i- and d-cache misses in system code. MCPI for Ultrix and Mach cse/UNSW/NICTA. COMP9242 2005/S2 W5 P40 ...
Microkernels In a Bit More Depth Gernot Heiser COMP9242 2005/S2 Week 5
cse/UNSW/NICTA
M OTIVATION • Early operating systems had very little structure. • A strictly layered approach was promoted by [Dijkstra 1968]. • Later OS (more or less) followed that approach (e.g., Unix). • Such systems are known as monolithic kernels
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P1
A DVANTAGES OF M ONOLITHIC K ERNELS
• Kernel has access to everything: ➜ all optimisations possible ➜ all techniques/mechanisms/concepts implementable
• Kernel can be extended by adding more code, e.g. for: ➜ new services ➜ support for new hardare
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P2
P ROBLEMS WITH LAYERED APPROACH
• Widening range of services and applications ⇒ OS bigger, more complex, slower, more error prone. • Need to support same OS on different hardware. • Like to support various OS environments. • Distribution ⇒ impossible to provide all services from same (local) kernel.
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P3
E VOLUTION OF THE L INUX K ERNEL Linux Kernel Size (.tar.gz) 35 30 Size (Mb)
25 20 15 10 5 0 31-Jan-93
28-Oct-95
24-Jul-98
19-Apr-01
14-Jan-04
Date
Linux 2.4.18: 2.7M LoC cse/UNSW/NICTA
COMP9242 2005/S2 W5 P4
A PPROACHES TO TACKLING C OMPLEXITY • Classical software-engineering approach: modularity ➜ ➜ ➜ ➜
(relatively) small, mostly self-contained components well-defined interfaces between them enforcement of interfaces containment of faults to few modules
A PPROACHES TO TACKLING C OMPLEXITY • Classical software-engineering approach: modularity ➜ ➜ ➜ ➜
(relatively) small, mostly self-contained components well-defined interfaces between them enforcement of interfaces containment of faults to few modules
• Doesn’t work with monolithic kernels: ➜ all kernel code executes in privileged mode ➜ faults aren’t contained ➜ interfaces cannot be enforced ➜ performance takes priority over structure
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P5
E VOLUTION OF THE L INUX K ERNEL — PART 2
Dependencies between kernel modules by global variables (“common coupling”) as a function of version number [SJW+02]: Exponential growth! cse/UNSW/NICTA
COMP9242 2005/S2 W5 P6
OS
User-level servers
Microkernel
Hardware
privileged
Applications
non-privileged
M ICROKERNEL I DEA : B REAK U P THE OS
Based on the ideas of Brinch Hansen’s “Nucleus” [BH70]
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P7
M ONOLITHIC VS . M ICROKERNEL OS S TRUCTURE
Application
syscall user mode
VFS IPC, file system Scheduler, virtual memory Device drivers, Dispatcher, ... Hardware
cse/UNSW/NICTA
Application kernel mode
IPC
UNIX Server
Device Driver
File Server
IPC, virtual memory Hardware
COMP9242 2005/S2 W5 P8
M ICROKERNEL OS • Kernel: ? ? ? ⇒
contains code which must run in supervisor mode; isolates hardware dependence from higher levels; is small and fast extensible system;
M ICROKERNEL OS • Kernel: ? ? ? ⇒ ?
contains code which must run in supervisor mode; isolates hardware dependence from higher levels; is small and fast extensible system; provides mechanisms.
M ICROKERNEL OS • Kernel: ? ? ? ⇒ ?
contains code which must run in supervisor mode; isolates hardware dependence from higher levels; is small and fast extensible system; provides mechanisms.
• User-level servers: ? are hardware independent/portable, ? provide “OS environment”/”OS personality” (maybe several), ? may be invoked: ➜ from application (via message-passing IPC) ➜ from kernel (upcalls);
M ICROKERNEL OS • Kernel: ? ? ? ⇒ ?
contains code which must run in supervisor mode; isolates hardware dependence from higher levels; is small and fast extensible system; provides mechanisms.
• User-level servers: ? are hardware independent/portable, ? provide “OS environment”/”OS personality” (maybe several), ? may be invoked: ➜ from application (via message-passing IPC) ➜ from kernel (upcalls); ? implement policies [BH70].
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P9
D OWNCALL VS . UPCALL downcall
upcall
(syscall) user code kernel unprivileged code enters kernel mode implemented via trap
cse/UNSW/NICTA
un−privileged privileged privileged code enters user mode implemented via IPC
COMP9242 2005/S2 W5 P10
M ICROKERNEL - BASED S YSTEMS
classic + Security
classic OS
RT MM
specialized
thin em-
native Java
bedded
app
highly-specialized component
L4
L4
L4
HW
HW
HW
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P11
E ARLY EXAMPLE : H YDRA • Separation of mechanism from policy ➜ e.g. protection vs. security • No hierarchical layering of kernel. • Protection, even within OS. ➜ Uses (segregated) capabilities. • Objects, encapsuation, units of protection. • Unique object name, no ownership. • Object persistence based on reference counting [WCC+74].
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P12
H YDRA ... • Can be considered the first object-oriented OS; • Has been called the first microkernel OS (by people who ignored Brinch Hansen) • Has had enormous influence on later operating systems research; • Was never widely used even at CMU because of ➜ poor performance, ➜ lack of a complete environment.
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P13
P OPULAR E XAMPLE : M ACH • Developed at CMU by Rashid and others [RTY+88] from 1984 • successor of Accent [FR86] and RIG [Ras88].
P OPULAR E XAMPLE : M ACH • Developed at CMU by Rashid and others [RTY+88] from 1984 • successor of Accent [FR86] and RIG [Ras88]. G OALS : • Tailorability: support different OS interfaces • Portability: almost all code H/W independent • Real-time capability • Multiprocessor and distribution support • Security
P OPULAR E XAMPLE : M ACH • Developed at CMU by Rashid and others [RTY+88] from 1984 • successor of Accent [FR86] and RIG [Ras88]. G OALS : • Tailorability: support different OS interfaces • Portability: almost all code H/W independent • Real-time capability • Multiprocessor and distribution support • Security Coined term microkernel. cse/UNSW/NICTA
COMP9242 2005/S2 W5 P14
BASIC FEATURES OF M ACH KERNEL • Task and thread management • Interprocess communication (asynchronous message-passing) • Memory object management • System call redirection • Device support • Multicomputer support
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P15
M ACH TASKS AND THREADS • Task consists of one or more threads • Task provides address space and other environment • Thread is active entity (basic unit of CPU utilisation) • Threads have own stacks, are kernel scheduled • Threads may run in parallel on multiprocessor • “Privileged user-state program” may be used to control scheduling • Task created from “blueprint” with empty or inherited address space • Activated by creating a thread in it
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P16
M ACH IPC: P ORTS • Addressing based on ports: ➜ ➜ ➜ ➜ ➜
port is a mailbox, allocated/destroyed via a system call has a fixed-size message queue associated with it is protected by (segregated) capabilities has exactly one receiver, but possibly many senders can have “send-once” capability to a port
M ACH IPC: P ORTS • Addressing based on ports: ➜ ➜ ➜ ➜ ➜
port is a mailbox, allocated/destroyed via a system call has a fixed-size message queue associated with it is protected by (segregated) capabilities has exactly one receiver, but possibly many senders can have “send-once” capability to a port
• Can pass the receive capability for a port to another process ➜ give up read access to the port
M ACH IPC: P ORTS • Addressing based on ports: ➜ ➜ ➜ ➜ ➜
port is a mailbox, allocated/destroyed via a system call has a fixed-size message queue associated with it is protected by (segregated) capabilities has exactly one receiver, but possibly many senders can have “send-once” capability to a port
• Can pass the receive capability for a port to another process ➜ give up read access to the port • Kernel detects ports without senders or receiver • Processes may have many ports (UNIX server has 2000!) • Ports can be grouped into port sets ➜ Allows listening to many ports (like select())
M ACH IPC: P ORTS • Addressing based on ports: ➜ ➜ ➜ ➜ ➜
port is a mailbox, allocated/destroyed via a system call has a fixed-size message queue associated with it is protected by (segregated) capabilities has exactly one receiver, but possibly many senders can have “send-once” capability to a port
• Can pass the receive capability for a port to another process ➜ give up read access to the port • Kernel detects ports without senders or receiver • Processes may have many ports (UNIX server has 2000!) • Ports can be grouped into port sets ➜ Allows listening to many ports (like select()) • Send blocks if queue is full ➜ except with send-once cap (used for server replies) cse/UNSW/NICTA
COMP9242 2005/S2 W5 P17
M ACH IPC: M ESSAGES • Segregated capabilities: ➜ threads refer to them via local indices ➜ kernel marshalls capabilities in messages ➜ message format must identify caps
M ACH IPC: M ESSAGES • Segregated capabilities: ➜ threads refer to them via local indices ➜ kernel marshalls capabilities in messages ➜ message format must identify caps • Message contents: ? Send capability to destination port (mandatory) ➜ used by kernel to validate operation ? optional send capability to reply port ➜ for use by receiver to send reply ? possibly other capabilities
M ACH IPC: M ESSAGES • Segregated capabilities: ➜ threads refer to them via local indices ➜ kernel marshalls capabilities in messages ➜ message format must identify caps • Message contents: ? Send capability to destination port (mandatory) ➜ used by kernel to validate operation ? optional send capability to reply port ➜ for use by receiver to send reply ? possibly other capabilities ? “in-line” (by-value) data ? “out-of-line” (by reference) data, using copy-on-write, ➜ may contain whole address spaces cse/UNSW/NICTA
COMP9242 2005/S2 W5 P18
M ACH IPC Message
header
port rights out−of− in−line (capabilities) line data data
virtual address space task 1
virtual address space task 2
IPC
mapping before IPC
mapping after IPC
physical memory
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P19
M ACH VIRTUAL MEMORY MANAGEMENT • Address space constructed from memory regions ? initially empty
M ACH VIRTUAL MEMORY MANAGEMENT • Address space constructed from memory regions ? initially empty ? populated by: ➜ explicit allocation ➜ explicitly mapping a memory object ➜ inheriting from “blueprint” (as in Linux clone()), ➜ inheritance: not, shared or copied ➜ allocated automatically by kernel during IPC ➜ when passing by-reference parameters ⇒ sparse virtual memory use (unlike UNIX)
M ACH VIRTUAL MEMORY MANAGEMENT • Address space constructed from memory regions ? initially empty ? populated by: ➜ explicit allocation ➜ explicitly mapping a memory object ➜ inheriting from “blueprint” (as in Linux clone()), ➜ inheritance: not, shared or copied ➜ allocated automatically by kernel during IPC ➜ when passing by-reference parameters ⇒ sparse virtual memory use (unlike UNIX) ? 3 page states: ➜ unallocated, ➜ allocated & unreferenced, ➜ allocated & initialised
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P20
C OPY - ON - WRITE IN M ACH • When data is copied (“blueprint” or passed by-reference): ➜ source and destination share single copy, ➜ both virtual pages are mapped to the same frame • Marked as read-only • When one copy is modified, a fault occurs • Handling by kernel involves making a physical copy is made, ➜ VM mapping is changed to refer to the new copy
C OPY - ON - WRITE IN M ACH • When data is copied (“blueprint” or passed by-reference): ➜ source and destination share single copy, ➜ both virtual pages are mapped to the same frame • Marked as read-only • When one copy is modified, a fault occurs • Handling by kernel involves making a physical copy is made, ➜ VM mapping is changed to refer to the new copy • Advantage: ➜ efficient way of sharing/passing large amounts of data
C OPY - ON - WRITE IN M ACH • When data is copied (“blueprint” or passed by-reference): ➜ source and destination share single copy, ➜ both virtual pages are mapped to the same frame • Marked as read-only • When one copy is modified, a fault occurs • Handling by kernel involves making a physical copy is made, ➜ VM mapping is changed to refer to the new copy • Advantage: ➜ efficient way of sharing/passing large amounts of data • Drawbacks: ➜ expensive for small amounts of data (page table manipulations) ➜ data must be properly aligned cse/UNSW/NICTA
COMP9242 2005/S2 W5 P21
M ACH ADDRESS MAPS • Address spaces represented as address maps: head
tail
entry
memory object
M ACH ADDRESS MAPS • Address spaces represented as address maps: head
tail
entry
memory object
• Any part of AS can be mapped to (part of) a memory object • Compact representation of sparse address spaces ➜ Compare to multi-level page tables? cse/UNSW/NICTA
COMP9242 2005/S2 W5 P22
M EMORY OBJECTS • Kernel doesn’t support file system • Memory objects are an abstraction of secondary storage: ? can be mapped into virtual memory ? are cached by the kernel in physical memory ? pager invoked if uncached page is touched ➜ used by file system server to provide data • Support data sharing ➜ by mapping objects into several address spaces • Memory is only cache for memory objects
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P23
U SER - LEVEL PAGE FAULT HANDLERS • All actual I/O performed by pager; can be ➜ default pager (provided by kernel), or ➜ external pager, running at user-level. Task
External Pager
map Kernel
IPC memory cache object
U SER - LEVEL PAGE FAULT HANDLERS • All actual I/O performed by pager; can be ➜ default pager (provided by kernel), or ➜ external pager, running at user-level. Task
External Pager
map Kernel
IPC memory cache object
• Intrinsic page fault cost: 2 IPCs cse/UNSW/NICTA
COMP9242 2005/S2 W5 P24
H ANDLING PAGE FAULTS ➀ Check protection & locate memory object ➜ uses address map ➁ Check cache, invoke pager if cache miss ➜ uses a hashed page table ➂ Check copy-on-write ➜ perform physical copy if write fault ➃ Enter new mapping into H/W page tables
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P25
R EMOTE COMMUNICATION • Client A sends message to server B on remote node ➀ A sends message to local proxy port for B’s receive port ➁ User-level network message server receives from proxy port ➂ NMS converts proxy port into (global) network port. ➃ NMS sends message to NMS on B’s node ➜ may need conversion (byte order...) ➄ Remote NMS converts network port into local port (B’s). ➅ Remote NMS sends message to that port
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P26
M ACH U NIX EMULATION 3 Application
1
Emulation 5 Library
syscall redirect
Unix Server 4
2
Mach
• Emulation library in user address space handles IPC • Invoked by system call redirection (trampoline mechanism) ➜ supports binary compatibility
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P27
M ACH = M ICROKERNEL ? • Most OS services implemented at user level ➜ using memory objects and external pagers ➜ Provides mechanisms, not policies • Mostly hardware independent
M ACH = M ICROKERNEL ? • Most OS services implemented at user level ➜ using memory objects and external pagers ➜ Provides mechanisms, not policies • Mostly hardware independent • Big! ➜ 140 system calls ➜ Size: 200k instructions • Performance poor ➜ tendency to move features into kernel ➜ Darwin (base if MacOS X) has complete BSD kernel inside Mach
M ACH = M ICROKERNEL ? • Most OS services implemented at user level ➜ using memory objects and external pagers ➜ Provides mechanisms, not policies • Mostly hardware independent • Big! ➜ 140 system calls ➜ Size: 200k instructions • Performance poor ➜ tendency to move features into kernel ➜ Darwin (base if MacOS X) has complete BSD kernel inside Mach Further information on Mach: [YTR+87, CDK94, Sin97]
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P28
OTHER CLIENT- SERVER SYSTEMS • Lots!
OTHER CLIENT- SERVER SYSTEMS • Lots! • Most notable systems: Amoeba: FU Amsterdam, early 1980’s [TM81, TM84, MT86] Chorus: INRIA (France), from early 1980’s [DA92, RAA+90, RAA+92] ` ➜ Commercialised by Chorus Systemes in 1988 ➜ Recently bought by Sun Windows NT: Microsoft (early 1990’s) [Cus93] ➜ Main servers run in kernel mode
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P29
C RITIQUE OF M ICROKERNEL A RCHITECTURES
C RITIQUE OF M ICROKERNEL A RCHITECTURES
I’m not interested in making devices look like user-level. They aren’t, they shouldn’t, and microkernels are just stupid. Linus Torvalds
C RITIQUE OF M ICROKERNEL A RCHITECTURES
I’m not interested in making devices look like user-level. They aren’t, they shouldn’t, and microkernels are just stupid. Linus Torvalds Is Linus right?
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P30
M ICROKERNEL P ERFORMANCE • First generation µ-kernel systems exhibited poor performance when compared to monolithic UNIX implementations ➜ particularly Mach, the best-known example
M ICROKERNEL P ERFORMANCE • First generation µ-kernel systems exhibited poor performance when compared to monolithic UNIX implementations ➜ particularly Mach, the best-known example • Reasons are investigated by [Chen & Bershad 93]: ➜ ➜ ➜ ➜ ➜
instrumented user and system code to collect execution traces run on DECstation 5000/200 (25MHz R3000) run under Ultrix and Mach with Unix server traces fed to memory system simulator analyse MCPI (memory cycles per instruction) ➜ baseline MCPI (i.e. excluding idle loops)
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P31
U LTRIX VS . M ACH MCPI
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P32
I NTERPRETATION Observations: • Mach memory penalty (i.e. cache missess or write stalls) higher • Mach VM system executes more instructions than Ultrix (but has more functionality)
I NTERPRETATION Observations: • Mach memory penalty (i.e. cache missess or write stalls) higher • Mach VM system executes more instructions than Ultrix (but has more functionality) Claim: • Degraded performance is (intrinsic?) result of OS structure
I NTERPRETATION Observations: • Mach memory penalty (i.e. cache missess or write stalls) higher • Mach VM system executes more instructions than Ultrix (but has more functionality) Claim: • Degraded performance is (intrinsic?) result of OS structure • IPC cost (known to be high in Mach) is not a major factor [Ber92]
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P33
A SSERTIONS 1 OS has less instruction and data locality than user code ➜ System code has higher cache and TLB miss rates ➜ Particularly bad for instructions
A SSERTIONS 1 OS has less instruction and data locality than user code ➜ System code has higher cache and TLB miss rates ➜ Particularly bad for instructions 2 System execution is more dependent on instruction cache behaviour than is user execution ➜ MCPIs dominated by system i-cache misses Note: most benchmarks were small, i.e. user code fits in cache
A SSERTIONS 1 OS has less instruction and data locality than user code ➜ System code has higher cache and TLB miss rates ➜ Particularly bad for instructions 2 System execution is more dependent on instruction cache behaviour than is user execution ➜ MCPIs dominated by system i-cache misses Note: most benchmarks were small, i.e. user code fits in cache 3 Competition between user and system code is not a problem ➜ Few conflicts between user and system caching ➜ TLB misses are not a relevant factor Note: the hardware used has direct-mapped physical caches. ⇒ Split system/user caches wouldn’t help
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P34
S ELF
INTERFERENCE
➜ Only examine system cache misses ➜ Shaded: System cache misses removed by associativity ➜ MCPI for system-only, using R3000 direct-mapped cache ➜ Reductions due to associativity were obtained by running system on a simulator and using a two-way associative cache of the same size
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P35
A SSERTIONS ... 4 Self-interference is a problem in system instruction reference streams. ➜ High internal conflicts in system code ➜ System would benefit from higher cache associativity
A SSERTIONS ... 4 Self-interference is a problem in system instruction reference streams. ➜ High internal conflicts in system code ➜ System would benefit from higher cache associativity 5 System block memory operations are responsible for a large percentage of memory system reference costs. ➜ Particularly true for I/O system calls
A SSERTIONS ... 4 Self-interference is a problem in system instruction reference streams. ➜ High internal conflicts in system code ➜ System would benefit from higher cache associativity 5 System block memory operations are responsible for a large percentage of memory system reference costs. ➜ Particularly true for I/O system calls 6 Write buffers are less effective for system references. ➜ write buffer allows limited asynchronous writes on cache misses
A SSERTIONS ... 4 Self-interference is a problem in system instruction reference streams. ➜ High internal conflicts in system code ➜ System would benefit from higher cache associativity 5 System block memory operations are responsible for a large percentage of memory system reference costs. ➜ Particularly true for I/O system calls 6 Write buffers are less effective for system references. ➜ write buffer allows limited asynchronous writes on cache misses 7 Virtual-to-physical mapping strategy can have significant impact on cache performance ➜ Unfortunate mapping may increase conflict misses ➜ “Random” mappings (Mach) are to be avoided
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P36
OTHER EXPERIENCE WITH µ- KERNEL PERFORMANCE • System call costs are (inherently?) high ➜ Typically hundreds of cycles, 900 for Mach/i486 • Context (address-space) switching costs are (inherently?) high ➜ Getting worse (in terms of cycles) with increasing CPU/memory speed ratios [Ous90] ➜ IPC (involving system calls and context switches) is inherently expensive
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P37
S O, WHAT ’ S WRONG ?
S O, WHAT ’ S WRONG ? • µ-kernels heavily depend on IPC • IPC is expensive
S O, WHAT ’ S WRONG ? • µ-kernels heavily depend on IPC • IPC is expensive ? Is the µ-kernel idea flawed? ? Should some code never leave the kernel? ? Do we have to buy flexibility with performance?
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P38
A C RITIQUE OF THE C RITIQUE Data presented earlier: ➜ are specific to one (or a few) system, ➜ results cannot be generalised without thorough analysis, ➜ no such analysis has been done ⇒ Cannot trust the conclusions [Lie95]
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P39
R E - ANALYSIS OF C HEN & B ERSHAD ’ S DATA .................................................................................................................................................................... .. ... ....................................................................................................... .. ... ... .
sed U M egrep U M yacc U M gcc U M compress U M ab U M espresso U M
other MCPI system cache miss MCPI
0.227 0.495 0.035 0.081 0.067 0.129 0.434 0.690 0.250 0.418 0.427 0.534 0.041 0.068
for Ultrix andd-cache Mach misses in system code MCPI for Ultrix and Mach MCPI right end is i- and Overhead is mostly in the system
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P40
R E - ANALYSIS OF C HEN & B ERSHAD ’ S DATA ... ..................................................................................................................... .. .. ............................................................................ .. .. .. ..
sed U M egrep U M yacc U M gcc U M compress U M ab U M espresso U M
conflict misses capacity misses
0.170 0.415 0.024 0.069 0.039 0.098 0.130 0.388 0.102 0.258 0.230 0.382 0.012 0.037
MCPI caused by cache misses: conflict (black) vs capacity (white)
MCPI caused by cache misses: conflict (black) vs capacity (white)
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P41
C ONCLUSION • Mach system (kernel + UNIX server + emulation library) is too big!
C ONCLUSION • Mach system (kernel + UNIX server + emulation library) is too big! • UNIX server is essentially same • Emulation library is irrelevant (according to Chan & Bershad)
C ONCLUSION • Mach system (kernel + UNIX server + emulation library) is too big! • UNIX server is essentially same • Emulation library is irrelevant (according to Chan & Bershad) ⇒ Mach µ-kernel working set is too big
C ONCLUSION • Mach system (kernel + UNIX server + emulation library) is too big! • UNIX server is essentially same • Emulation library is irrelevant (according to Chan & Bershad) ⇒ Mach µ-kernel working set is too big
Can we build µ-kernels which avoid these problems?
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P42
R EQUIREMENTS FOR µ- KERNELS : • Fast (system call costs, IPC costs) • Small (big ⇒ slow) ⇒ Must be well designed, providing a minimal set of operations.
Can this be done?
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P43
A RE HIGH SYSTEM COSTS ESSENTIAL ? • Example: kernel call cost on i486 ? Mach kernel call: 900 cycles
A RE HIGH SYSTEM COSTS ESSENTIAL ? • Example: kernel call cost on i486 ? Mach kernel call: 900 cycles ? Inherent (hardware-dictated cost): 107 cycles. ⇒ 800 cycles kernel overhead
A RE HIGH SYSTEM COSTS ESSENTIAL ? • Example: kernel call cost on i486 ? Mach kernel call: 900 cycles ? Inherent (hardware-dictated cost): 107 cycles. ⇒ 800 cycles kernel overhead ? L4 kernel call: 123–180 cycles (15–73 cycles overhead)
A RE HIGH SYSTEM COSTS ESSENTIAL ? • Example: kernel call cost on i486 ? Mach kernel call: 900 cycles ? Inherent (hardware-dictated cost): 107 cycles. ⇒ 800 cycles kernel overhead ? L4 kernel call: 123–180 cycles (15–73 cycles overhead) ⇒ Mach’s performance is a result of design and implementation not the µ-kernel concept!
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P44
µ- KERNEL D ESIGN P RINCIPLES [L IE 96] Minimality: If it doesn’t have to be in the kernel, it shouldn’t be in the kernel Appropriate abstractions which can be made fast and allow efficient implementation of services Well written: It pays to shave a few cycles off TLB refill handler or the IPC path Unportable: must be targeted to specific hardware ➜ no problem if it’s small, and higher layers are portable ➜ Example: Liedtke reports significant rewrite of memory management when porting from 486 to Pentium ⇒ “abstract hardware layer” is too costly
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P45
N ON - PORTABILITY EXAMPLE : I 486 VS P ENTIUM : • Size and associativity of TLB • Size and organisation of cache (larger line size - restructured IPC) • Segment regs in Pentium used to simulate tagged TLB ⇒ different trade-offs
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P46
W HAT Must A µ- KERNEL P ROVIDE ? • Virtual memory/address spaces • Threads • fast IPC • Unique identifiers (for IPC addressing)
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P47
µ- KERNEL DOES Not HAVE TO P ROVIDE : • File system ➜ use user-level server (as in Mach) • Device drivers ➜ user-level driver invoked via interrupt (= IPC) • Page-fault handler ➜ use user-level pager
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P48
L4 IMPLEMENTATION TECHNIQUES (L IEDTKE ) • Appropriate system calls to reduce number of kernel invocations ➜ e.g., reply & receive next ➜ as many syscall args as possible in registers
L4 IMPLEMENTATION TECHNIQUES (L IEDTKE ) • Appropriate system calls to reduce number of kernel invocations ➜ e.g., reply & receive next ➜ as many syscall args as possible in registers • Efficient IPC ➜ rich message structure ➜ value and reference parameters in message ➜ copy message only once (i.e. not user→kernel→user)
L4 IMPLEMENTATION TECHNIQUES (L IEDTKE ) • Appropriate system calls to reduce number of kernel invocations ➜ e.g., reply & receive next ➜ as many syscall args as possible in registers • Efficient IPC ➜ rich message structure ➜ value and reference parameters in message ➜ copy message only once (i.e. not user→kernel→user) • Fast thread access ➜ Thread UIDs (containing thread ID) ➜ TCBs in (mapped) VM, cache-friendly layout ➜ Separate kernel stack for each thread (for fast interrupt handling)
L4 IMPLEMENTATION TECHNIQUES (L IEDTKE ) • Appropriate system calls to reduce number of kernel invocations ➜ e.g., reply & receive next ➜ as many syscall args as possible in registers • Efficient IPC ➜ rich message structure ➜ value and reference parameters in message ➜ copy message only once (i.e. not user→kernel→user) • Fast thread access ➜ Thread UIDs (containing thread ID) ➜ TCBs in (mapped) VM, cache-friendly layout ➜ Separate kernel stack for each thread (for fast interrupt handling) • General optimisations ➜ “Hottest” kernel code is shortest ➜ Kernel IPC code on single page, critical data on single page ➜ Many H/W specific optimisations cse/UNSW/NICTA
COMP9242 2005/S2 W5 P49
M ICROKERNEL P ERFORMANCE System L4 L4 L4 L4 QNX Mach SCR RPC Mach Amoeba Spin Mach Exo-tlrpc Spring DP-Mach LRPC
cse/UNSW/NICTA
CPU R4600 Alpha Pentium 486 486 R2000 CVAX 486 68020 Alpha 21064 Alpha 21064 R2000 SparcV8 486 CVAX
MHz 100 433 166 50 33 16.7 12.5 50 15 133 133 116.7 40 66 12.5
RPC µs 1.7 µs 0.2 µs 1.5 µs 10 µs 76 µs 190 µs 464 µs 230 µs 800 µs 102 µs 104 µs 6 µs 11 µs 16 µs 157 µs
cyc/IPC 100 45 121 250 1254 1584 2900 5750 6000 6783 6916 53 220 528 981
semantics full full full full full full full full full full full restricted restricted restricted restricted
COMP9242 2005/S2 W5 P50
L4K A ::P ISTACHIO IPC P ERFORMANCE
Architecture Pentium-3 Pentium-4 Itanium 2 cross CPU MIPS64 cross CPU PowerPC-64 Alpha 21264 ARM/XScale UltraSPARC † ‡
port/ optimisation UKa UKa UKa/NICTA UKa NICTA/UNSW NICTA/UNSW NICTA/UNSW NICTA/UNSW NICTA/UNSW NICTA/UNSW
C++ intra AS inter AS 180 367 385 983 508 508 7419 7410 276 276 3238 3238 330 518 440 642 340 340
optimised intra AS inter AS 113 305 196 416 36 36 N/A N/A 109 109 690 690 200‡ 200‡ ≈70† ≈70† 120–140‡ 120–140‡ 100‡ 100‡
“Version 2” assembler kernel Guestimate!
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P51
¨ C ASE IN P OINT: L4 L INUX [H ARTIG et al. 97] • Port of Linux kernel to L4 (like Mach Unix server) ➜ ➜ ➜ ➜ ➜
single-threaded (for simplicity, not performance) is pager of all Linux user processes maps emulation library and signal-handling code into AS server AS maps physical memory (& Linux runs within) copying between user and server done on physical memory ➜ use software lookup of page tables for address translation
¨ C ASE IN P OINT: L4 L INUX [H ARTIG et al. 97] • Port of Linux kernel to L4 (like Mach Unix server) ➜ ➜ ➜ ➜ ➜
single-threaded (for simplicity, not performance) is pager of all Linux user processes maps emulation library and signal-handling code into AS server AS maps physical memory (& Linux runs within) copying between user and server done on physical memory ➜ use software lookup of page tables for address translation
• Changes to Linux restricted to architecture-dependent part • Duplication of page tables (L4 and Linux server) • Binary compatible to native Linux via trampoline mechanism ➜ but also modified libc with RPC stubs
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P52
y and the aphs) into ss space. ges minin with the n its own. ued to the
nipulating the main thread’s stack pointer and instruction pointer.
S IGNAL DELIVERY IN L4 L INUX
Linux User Process user thread Enter Linux (3)
resume (4)
manipulate thread (2)
signal thread
forward signal (1)
Linux Server
main thread
re implethe user tly usable
• Separate signal-handler thread inineach user Arrows process Figure 3: Signal delivery L4 Linux. denote IPC. Numbers in
➜ ➜ er; ➜ ➜ brary; ➜
emulates rrespond-
one estab-
parentheses indicate the sequence of actions.
server IPCs signal-handler thread handler thread ex regs main user thread to save state The IPCs signalLinux thread and the emulation library are not protected user thread server the main thread. However, the user process can only damserveragainst does signal processing itself bythread modifying them. Global effects of signaling, e. g. serverage IPCs user to resume
killing a process, are implemented by the Linux server. The signal thread only notifies the user process.
Scheduling. All threads mentioned above are scheduled by the L4 µ-kernel’s internal scheduler. This leaves the traditional Linux
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P53
L4 L INUX P ERFORMANCE M ICROBENCHMARKS :
G E T P I D ()
System Linux
Time [µs] 1.68
Cycles 223
L4 L INUX P ERFORMANCE M ICROBENCHMARKS :
G E T P I D ()
System Linux L4Linux L4Linux (trampoline)
Time [µs] 1.68 3.95 5.66
Cycles 223 526 753
L4 L INUX P ERFORMANCE M ICROBENCHMARKS :
G E T P I D ()
System Linux L4Linux L4Linux (trampoline) MkLinux in-kernel MkLinux server
Time [µs] 1.68 3.95 5.66 15.66 110.60
Cycles 223 526 753 2050 14710
L4 L INUX P ERFORMANCE M ICROBENCHMARKS :
G E T P I D ()
System Linux L4Linux L4Linux (trampoline) MkLinux in-kernel MkLinux server C YCLE
Time [µs] 1.68 3.95 5.66 15.66 110.60
Cycles 223 526 753 2050 14710
Cycles 20 168 131 188 19
Server
BREAKDOWN :
Client enter emulation lib send syscall message receive reply leave emulation lib
wait for msg Linux kernel send reply
Hardware cost: 82 cycles (133MHz Pentium)
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P54
L4 L INUX P ERFORMANCE M ICROBENCHMARKS : write /dev/null lat null process lat simple process lat /bin/sh process lat mmap lat 2-proc context switch lat 8-proc context switch lat pipe lat UDP lat RPC/UDP lat TCP lat RPC/TCP lat pipe bw 1 TCP bw 1 file reread bw 1 mmap reread bw 1
LMBENCH
Linux 4 L Linux MkLinux (in-kernel) MkLinux (user)
64 5
25 8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Figure 6: lmbench results, normalized to native Linux. These are presented as slowdowns: a shorter bar is a better result. lat is a latency measurement, 1
the inverse of a bandwidth one. Hardware is a 133 MHz Pentium.
total system-call costs, 141 cycles remain for native Linux, 170 cycles for L4 Linux. The small difference of both values indicates that indeed IPC is the major cause for additional costs in L4 Linux. When removing the part called LINUX in Figure 5, the L4 Linux overhead code remains. It uses 45 cache lines, 9% of the first-level cache, including the cache L4 needs for IPC. The lmbench [29] microbenchmark suite measures basic operations like system calls, context switches, memory accesses, pipe operations, network operations, etc. by repeating the respective operation a large number of times. lmbench’s measurement methods have recently been criticized by Brown and Seltzer [6]. Their im-
AIM Suite-7 Benchmark - Real Time 7000
6000
5000
Real time
bw
4000
RPC/UDP lat TCP lat RPC/TCP lat pipe bw 1 TCP bw 1 file reread bw 1 mmap reread bw 1
Figu
mark
64 5
25 8
Linux 4 L Linux MkLinux (in-kernel) MkLinux (user)
5.3 Macrobenchmarks
In the 1first 2 macrobenchmark experiment, we measured the 14time 3 4 5 6 7 8 9 10 11 12 13 needed to recompile the Linux server (Figure 7). L4 Linux was 6– 7%normalized slower tothan Linux butas10–20% faster MkLinux Figure 6: lmbench results, nativenative Linux. These are presented slowdowns: a shorter bar isthan a better both result. lat is a latency measurement, bw 1 the inverse of a bandwidth one. Hardware is a 133 MHz Pentium. versions.
M ACROBENCHMARKS : K ERNEL
COMPILE
total system-call costs, 141 cycles remain for native Linux, 170 cycles for L4 Linux. TheLinux small difference of both values indicates that indeed IPC is the major cause for additional costs in L4 Linux. L4 Linux When removing the4part called LINUX in Figure 5, the L4 Linux L Linux (trampo) overhead code remains. It uses 45 cache lines, 9% of the first-level cache, including the cache L4 needs for IPC. MkLinux (kernel) The lmbench [29]MkLinux microbenchmark (user)suite measures basic operations like system calls, context switches, memory accesses, pipe operations, network operations, etc. by repeating the respective operation a large number of times. lmbench’s measurement methods have recently been criticized by Brown and Seltzer [6]. Their im-
7000
6000
476 s 506 s (+6.3%) AIM Suite-7 Benchmark - Real Time 509 s (+6.9%) 555 s (+16.6%) 605 s (+27.1%)
5000
Real time
Figure 7: Real time for compiling the Linux Server. (133 MHz Pen-
cse/UNSW/NICTA
Jobs per Minute
the same results 4 on all systems. L L INUX Presults ERFORMANCE Table 3 shows selected of lmbench and hbench. It compares native Linux, L4 Linux with and without trampoline, and both 4 versions :ofLMkLinux. M ICROBENCHMARKS M B E N C H Figure 6 plots the slowdown of L Linux, co user-mode MkLinux, normalized Linux. located and to native Both write /dev/null lat null process lat 4 Linux. MkLinux versions of have a much higher penalty than L simple process lat /bin/sh process lat of co-location effect is small rather Surprisingly, compared to the mmap lat 4 2-proc context switch lat even the L Linux penalties are not of using L4. However, 8-proc context switch lat the effect pipe lat as low as we hoped. UDP lat
4000
COMP9242 2005/S2 W5 P55
Figu
C ONCLUSIONS • Mach sux 6⇒ microkernels suck
C ONCLUSIONS • Mach sux 6⇒ microkernels suck • L4 shows that performance might be deliverable ➜ L4Linux gets close to monolithic kernel performance ➜ need real multi-server system to evaluate µ-kernel potential
C ONCLUSIONS • Mach sux 6⇒ microkernels suck • L4 shows that performance might be deliverable ➜ L4Linux gets close to monolithic kernel performance ➜ need real multi-server system to evaluate µ-kernel potential • Jury is still out! • Mach has prejudiced community (see Linus...) ➜ It’s still an uphill battle!
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P56
R EFERENCES [Ber92]
Brian N. Bershad. The increasing irrelevance of IPC performance for microkernel-based operating systems. In USENIX WS Microkernels & other Kernel Arch., pages 205–211, Seattle, WA, USA, Apr 1992.
[BH70]
Per Brinch Hansen. The nucleus of a multiprogramming operating system. CACM, 13:238–250, 1970.
[CB93]
J. Bradley Chen and Brian N. Bershad. The impact of operating system structure on memory system performance. In 14th SOSP, pages 120–133, Asheville, NC, USA, Dec 1993.
[CDK94]
George F. Coulouris, Jean Dollimore, and Tim Kindberg. Distributed Systems: Concepts and Design. Addison-Wesley, 2nd edition, 1994.
[Cus93]
Helen Custer. Inside Windows NT. Microsoft Press, 1993.
[DA92]
Randall W. Dean and Francois Armand. Data movement in kernelized systems. In USENIX WS Microkernels & other Kernel Arch., pages 243–261, Seattle, WA, USA, Apr 1992.
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P57
[Dij68]
Edsger W. Dijkstra. The structure of the “THE” multiprogramming system. CACM, 11:341–346, 1968.
[FR86]
Robert Fitzgerald and Richard Rashid. The integration of virtual memory management and interprocess communication in Accent. Trans. Comp. Syst., 4:147–177, 1986.
¨ [HHL+97] Hermann Hartig, Michael Hohmuth, Jochen Liedtke, Sebastian ¨ Schonberg, and Jean Wolter. The performance of µ-kernel-based systems. In 16th SOSP, pages 66–77, St. Malo, France, Oct 1997. [Lie95]
Jochen Liedtke. On µ-kernel construction. In 15th SOSP, pages 237–250, Copper Mountain, CO, USA, Dec 1995.
[Lie96]
Jochen Liedtke. Towards real microkernels. CACM, 39(9):70–77, Sep 1996.
[MT86]
Sape J. Mullender and Andrew S. Tanenbaum. The design of a capability-based distributed operating system. The Comp. J., 29:289–299, 1986.
[Ous90]
J.K. Ousterhout. Why aren’t operating systems getting faster as fast as
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P58
hardware? In 1990 Summer USENIX Techn. Conf., pages 247–56, Jun 1990. [RAA+90] M. Rozier, V. Abrossimov, F. Armand, I. Boule, M. Gien, M. Guillemont, ´ F. Herrmann, C. Kaiser, S. Langlois, P. Leonard, and W. Neuhauser. Overview of the CHORUS distributed operating system. Technical report ` CS/TR-90-25, Chorus systeemes, Montigny-le-Bretonneux (France), Apr 1990. [RAA+92] M. Rozier, V. Abrossimov, F. Armand, L. Boule, M. Gien, M. Guillemont, ´ F. Herrman, C. Kaiser, S. Langlois, P. Leonard, and W. Neuhauser. Overview of the Chorus distributed operating system. In USENIX WS Microkernels & other Kernel Arch., pages 39–69, Seattle, WA, USA, Apr 1992. [Ras88]
Richard F. Rashid. From RIG to Accent to Mach: The evolution of a network operating system. In Bernardo A. Huberman, editor, The Ecology of Computation, Studies in Computer Science and Artifical Intelligence, pages 207–230. North-Holland, Amsterdam, 1988.
[RTY+88] Richard Rashid, Avadis Tevanian, Jr., Michael Young, David Golub, Robert Baron, David Black, William J. Bolosky, and Jonathan Chew. Machine-independent virtual memory management for paged cse/UNSW/NICTA
COMP9242 2005/S2 W5 P59
uniprocessor and multiprocessor architectures. Trans. Computers, C-37:896–908, 1988. [Sin97]
Pradeep K. Sinha. Distributed Operating Systems: Concepts and Design. Comp. Soc. Press, 1997.
[SJW+02] Stephen R. Schach, Bo Jin, David R. Wright, Gillian Z. Heller, and A. Jefferson Offutt. Maintainability of the Linux kernel. IEE Proc.: Softw., 149:18–23, 2002. [TM81]
Andrew S. Tanenbaum and Sape J. Mullender. An overview of the Amoeba distributed operating system. Operat. Syst. Rev., 15(3):51–64, 1981.
[TM84]
Andrew S. Tanenbaum and Sape Mullender. The design of a capability-based distributed operating system. Technical Report IR-88, Vrije Universiteit, Nov 1984.
[WCC+74] W. Wulf, E. Cohen, W. Corwin, A. Jones, R. Levin, C. Pierson, and F. Pollack. HYDRA: The kernel of a multiprocessor operating system. CACM, 17:337–345, 1974. cse/UNSW/NICTA
COMP9242 2005/S2 W5 P60
[YTR+87] Michael Young, Avadis Tevanian, Richard Rashid, David Golub, Jeffrey Eppinger, Jonathan Chew, William Bolosky, David Black, and Robert Baron. The duality of memory and communication in the implementation of a multiprocessor operating system. In 11th SOSP, pages 63–76, 1987.
cse/UNSW/NICTA
COMP9242 2005/S2 W5 P61