performance of low-cost ultrasparc multiprocessors

PERFORMANCE OF LOW-COST ULTRASPARC MULTIPROCESSORS CONNECTED BY SCI Knut Omang Dep. of Informatics, University of Oslo, Box 1080 Blindern, N-0316 OSLO Norway [email protected]

ABSTRACT

Bodo Parady Sun Microsystems Inc. MPK12-204 2550 Garcia Avenue Mountain View, California [email protected]

back to other interface technologies.

In bringing high end performance to low cost workstation hardware, the level of efficiency in the interconnection of many individual nodes will be a key factor for the overall system performance. In this paper basic performance characteristics as throughput and latency is measured on state-of-the-art workstations connected by SCI (Scalable Coherent Interface) based I/O adapters. The flexibility of the SCI interconnect give options for customizable internal bandwidths and systems with up to 64K nodes. Thus point-to-point throughput is limited by different parts of the interface towards each node, and not by the interconnect itself. Point-to-point performance is investigated and compared to similar measurements done for other interconnects. Even though options for improvement are pointed out, results demonstrate clearly that although this interconnect has latency characteristics inferior to common SMPs, this solution is currently providing one of the best I/O adapter based solutions and should definitely be taken into consideration as a basis for clustering.

2 THE CLUSTER HARDWARE

Keywords: Computer systems, Distributed processors, multiprocessors, performance analysis, operating systems.

1 INTRODUCTION As inexpensive desktop computers have had an incredible performance increase the latest years, and significant progress steps have been made in the area of CPU design, the local area network interfaces of these computers still use technology already mature in the mid eighties. 10 Mbits/s Ethernet is still the most common, although some vendors are starting to equip their systems with improved 100 Mbits/s interfaces using improved versions of essentially the same base technology. ATM (Asynchronous Transfer Mode) is another standard that is starting to get into use. While ATM offers important enhanced functionality as QoS (quality of service) options as well as higher unidirectional bandwidths, latency does not seem to improve, making message passing equally slow as (or sometimes even worse than) Ethernet. And ATM software overhead is still high. Since ATM is an interconnect which may lose packets, additional software layers must take care of error handling and retransmission. In this paper a network interconnect is presented where communication can be done almost solely in hardware. This is possible because the hardware takes care of error checking and retransmission during normal operation, making little or no additional checking necessary. The SCI interconnect to be presented has very low zero byte latency to be an I/O adapter and peak system to network bandwidth which does not stand

Sbus−SCI

Ultra-2 workstations are used as cluster nodes. Ultra-2 is a new generation of UltraSparc based workstations from Sun Microsystems. Each Ultra-2 is an SMP with two 200MHz CPUs. The Solaris operating system supports multithreading (also in the kernel) with increased opportunities for parallelism also at driver level. The term low-cost is used comparing this hardware to SMPs with many processors, in which scalability issues has a relatively high initial cost, and where the size of the final system must be decided upon (and paid for) initially, even if the initial system only consists of a few processors. SCI (Scalable Coherent Interface)(SCI 1993) is a standard for high speed interconnects designed for scalability and performance of closely coupled systems in mind. Dolphin Interconnect Solution’s SCI/Sbus-2 is the second generation of SCI interface cards for the I/O bus of the SparcStations. This new generation of SCI I/O adaptor cards uses the new, faster LinkController as its SCI interface. In addition the Sbus interface and driver software is improved to give considerably better throughput characteristics than its predecessor. The simplest back-to-back connected cluster configuration is pictured in figure 1. Connected to the I/O bus with its limited ca-

Host Sbus

IOMMU

Host CPU/memory system

Figure 1: Simple back-to-back connection of two nodes using Sbus-SCI adapters. pabilities* the SCI interface used in this paper only provides support for a limited part of the full SCI specification. Operations by the interface are: • Hardware support for single 1,2,4 and 8 byte remote load and store and 64 byte store operations. * such as no direct access to memory, and no way of intercepting operations from the CPU to the memory

109

• A DMA engine capable of doing larger transfers without using CPU cycles.

EDU

EDU

SCI link out

to Sbus card

EDU EDU

EDU

Host

Host

Figure 3: A 4-node building block. interconnect can be duplicated using multiple Sbus cards in each node. Some of the results presented later are gathered using two adapters in each node of a two-node configuration.

4 SYSTEM SOFTWARE LAYERS The SCI interconnect is currently accessible from user space through two different interfaces: 1. Through the standard TCP/IP protocol stack and its standard UNIX socket abstraction. 2. Through a special “raw” device interface that provides more direct access to the underlying hardware. “Raw” access is aquired through opening specific UNIX device files and using • the read/write system calls for operating system supported message passing or

3 CLUSTER BUILDING BLOCKS

• programmed I/O (PIO) through setting up user level accessible shared memory through the UNIX mmap and ioctl system calls (memory mapped I/O)

The Dolphin Sbus/SCI adapter card itself comes with a single connector. This connector has both input and output pins, thus a single station cable can be used to connect two workstations “back-to-back” as in figure 1.

The read/write system call interface uses DMA for aligned message transfers, but currently does not use an effective strategy in cases of misaligned and odd length messages. The aligned transfers using this interface are denoted “Raw SCI” in the performance figures. Once a shared memory mapping is set up, PIO operations are performed completely by hardware. To use PIO effectively for message passing, a protocol for queueing and waiting is needed. A future optimized message passing protocol for SCI should probably use a combination of polling user level messages and operating system handled DMA requests. In this paper latency and throughput performance of the two methods are investigated to show what to expect from such protocols. TCP/IP over SCI is made available through a separate driver module that uses the exported interface of the low level SCI driver to implement the needed higher level functionality of the Solaris Streams system. The Streams system itself is then used to implement the socket user interface as well as other higher level interfaces.

Host

SW

Host SCI link in

Load/store operations to remote location is triggered by ordinary CPU load/store instructions to special regions of virtual memory mapped to I/O space. We denote such operations programmed I/O (PIO) operations. The UltraSparc architecture has special support for block store operations, with block sizes up to 64 byte. The 64 byte store support uses these special instructions. Since PIO operations are done completely in hardware, they have very low latency. Remote store operations are asynchronous, may get pipelined within the interconnect and also arrive out of order, and the CPU is only stalled until the store operation is completed on the local Sbus. On the other hand remote load operations are synchronous, ie. the CPU stalls until the operation and all previous store operations to the same memory has completed. The DMA engine has relatively high setup cost compared to PIO. In addition DMA data must start on a 64 byte alignment border and must be in continuous sections of a multiple of 64 byte. The DMA engine provide high bandwidths and relieves the CPU of work once running. Consequently, the best way of implementing message passing across this interconnect is to use PIO for small transfers and a combination of PIO (for alignment) and DMA for larger transfers. This interconnect hardware offer limited abilities compared to what is possible for devices with direct access to the system bus, but yet powerful functionality compared to traditional network interfaces, and suggest a mental model of the interconnect as something in between a local area network (LAN) and a traditional backplane bus.

Host

Host

Host

Host

Figure 2: A switched 4-node configuration. Such a setup creates a two node SCI ring. 4 nodes can for instance be connected using Dolphin’s 4-way switch and 4 station cables, thus giving 4 two node rings with the switch being one of the nodes on each of the rings as in figure 2. Another possible 4 node configuration is to make a single ring using small Dolphin provided connector boxes (EDU’s), and connect these boxes using the smaller ring cables. Using these basic building blocks, larger configurations can easily be created, for example a typical 16 node configuration could use a four way switch and 5-way SCI rings where the switch is the 5’th connection to a four node ring (figure 3). If more bandwidth or a higher degree of availability is needed, the SCI

5 INTERCONNECT LATENCY Latency of short messages is important for applications that might need to exchange relatively small amounts of data, but where the delay of such exchange operations are crucial to performance. The SCI/SBus-2 hardware is characterized by very low latency. A single short (1,2,4 or 8 byte) remote load/store operation over SCI between two Ultra-2’s takes less than 4 µs. 110

This is slower than a similar access to local memory (0.1-1 µs), but faster than most network adaptor technologies. All these short stores takes approximately the same time to traverse the interconnect, since they are accomplished using the smallest available SCI transaction (16 bytes). A 64 byte remote store takes 7 microseconds, but consecutive stores may be pipelined to achieve better performance. PIO latency for SCI in figure 4 is measured using a small microbenchmark that transfers the different number of bytes using as big remote stores as possible. When a remote write is done, control is returned to the CPU once the operation is completed on the local Sbus. Another write may be posted immediately even though the effect of the write is not visible to the other processors yet. This second write might bypass the previous one. To support deterministic operation for messages longer than 64 byte, a dummy load instruction to the same memory acts as a store barrier by flushing all pending write operations, so that when the load returns, all previous writes have been completed at the remote end before updating the value polled by the receiver. Message passing latency is measured by the time needed to send a message using a specific interconnect/protocol and receive an equally long response, divided by 2.

Throughput is characterized by running a single microbenchmark with different values for several parameters, • Single threaded one-way throughput performance • Multithreaded performance (ie. multiplexing several logical connections from separate threads over the same physical interface to hide latency) • Two-way performance (ie. thread(s) at the same time)

• Using more than one physical SCI interconnect at the same time (multiple adapters) • Using different protocols (TCP/IP, lower level raw system interface or user level shared memory) • Using different interconnects (SCI, Fast Ethernet and ATM) This microbenchmark starts a configurable number of independent reader and writer threads (figure 5). Single threaded unidirectional performance shows how the software and hardware interface of a single logical connection performs. Multithreaded unidirectional performance shows the limits of the SCI DMA queuing and the DMA engine itself (assuming the CPU(s) have enough cycles not to let thread context switching have notable impact on performance) Two-way performance

700 SCI 4,8,64 byte PIO TCP/IP over SCI TCP/IP over 100baseT Ethernet TCP/IP over Sun ATM TCP/IP over Fore ATM

600

Latency(µs)

500

both reader and writer

to reader thread on node (i+1) mod N SCI virtual circuit process on node i

400

W

300

R

Writer thread

Reader thread

200 from writer thread on node (i − 1) mod N

100 Figure 5: A basic component of the N-way throughput tester.

0 0

256

512

768

shows how the SCI/Sbus interface behaves when data is coming in both directions at the same time (which certainly will be the case for most applications). Performance measurements using multiple SCI/Sbus cards reveals whether the bottleneck is the SCI interface, the I/O-bus and card interoperability, or if the Sbus itself is saturated, by looking at how well performance scales with the number of boards used. Using the same test program to run on all the different interconnects reduces the number of possible error sources of the comparison.

1024

User buffer size (bytes) Figure 4: Latencies of different interconnects/protocols.

6 END SYSTEM THROUGHPUT

6.1 One-way Point-to-point Performance

The motivation for throughput measurements are twofold:

Maximal sustainable point-to-point throughput (regardless of message sizes) is important to applications that need to transfer large amounts of data, while the point-to-point throughput at different message sizes is important to applications that have a specific pattern of transfers. Unidirectional point-topoint throughput for different message sizes over raw SCI and

1. To estimate expected application performance and identify system limits as will be seen by user applications. 2. To identify the bottlenecks of the system to possibly provide feedback for improvement. 111

50

40

Raw SCI, DMA (3 threads) TCP/IP over SCI TCP/IP over 100baseT TCP/IP over Sun ATM SCI user level 64 bytes PIO

35

Raw SCI(2 threads) TCP/IP over SCI TCP/IP over 100baseT Ethernet TCP/IP over Sun ATM 40 SCI user level 64 bytes PIO Throughput (Mbyte/s)

Throughput (Mbyte/s)

30 25 20 15

30

20

10 10 5 0 64

256

1K

4K

0

16K 48K 128K

64

User buffer size (bytes)

256

1K

4K

16K 48K 128K

User buffer size (bytes)

Figure 6: One-way point-to-point throughput of different protocol/interconnect pairs.

Figure 7: Two-way point-to-point throughput of different protocol/interconnect pairs.

TCP/IP over SCI can be inspected in figure 6. For reference we have also run the same microbenchmark over the Ultra2’s builtin 100Mbits/s fast Ethernet port. The Ethernet interfaces are connected directly to an Ethernet switch. In addition we have included some measurements obtained between two back to back connected 155 Mbits/s Sun ATM boards, as well as running over a pair of Fore Systems ATM 155 Mbits/s boards.** The characteristics of TCP/IP over the Fore boards are also described more closely in (Bryhni and Omang 1996). Raw SCI data was collected using DMA only. User level message passing with PIO is accomplished using a special lightweight protocol still under development. This protocol uses the special UltraSparc instruction set extensions to optimize message pipelining and requires nontrivial UltraSparc assembly programming. Currently only the 3 smallest special cases are implemented. Optimized implementations of the larger message cases are expected to have slightly better performance due to the possibility of increased pipelining of multiple block load/stores in the processor. This is the topic of another article. The disadvantage of using PIO for larger transfers is that most of the involved CPU’s cycles will be spent doing communication, while DMA transfers can go on in parallel with other processing. For medium length messages, TCP/IP over SCI performs better than DMA based raw SCI. This is probably due to assembling of messages in the different layers of the TCP

protocol.

6.2 Two-way Throughput Performance It is not obvious where the bottlenecks are in a complicated system like this. In our case, one might initially believe that the one-way throughput test reveals the “pipe width” of the complete Sbus/SCI interface. Then interleaving incoming and outgoing SCI traffic should not make any difference from having one-way traffic only. However, this turns out not to be the case. As is clear from figure 7, throughput when doing read and write at the same time in both ends gives more than 18Mbytes/s in each direction, a total of 36Mbytes/s, which is an increase of 42% compared to the one-directional case. Clearly, some limit must exist in the DMA system or buffering of SCI transactions related to traffic in one direction. That is, the overall capacity of the “pipe” is not the current bottleneck. Another conclusion to draw from figure 7 is that both SCI (and also the Sun ATM interconnect) will get better performance for applications with more or less symmetric communication patterns than one could expect from the one way sustained throughput numbers in figure 6. Consequently, providing only the 1-way numbers does not tell the whole story about an interface. This is also illustrated by the fact that the Fore ATM boards does not show any performance increase for two-way throughput (numbers not included in graphs for readability reasons). For standard half duplex Ethernet, 2-way performance is obviously not better than 1-way, but modern Ethernet interfaces is starting to support full duplex operation.

** These measurements are obtained using two SparcStation 20’s, but both platforms are fast enough that we believe this difference has little impact on throughput performance.

112

Preliminary tests with the fast Ethernet interface on the Ultra2’s running 2-way in full duplex mode through a switch shows 60-70% increase in throughput compared to the 1-way case. Thus two-way throughput is definitely a parameter that should be measured.

35

Raw SCI, no multiplexing Raw SCI, 2 threads Raw SCI, 3 threads Raw SCI, 4 threads

30

6.3 Hiding DMA setup Latency Throughput (Mbyte/s)

Implementing message passing with maximal efficiency requires reducing the times data has to be copied to a minimum. The DMA engine of the SCI hardware requires that DMA data is a multiple of 64 bytes and that both sender data and receive buffer start address is aligned to a 64 byte boundary. To avoid costly copying operations, when data is aligned, the read/write interface of the raw SCI device currently uses DMA directly from the user buffer on the writer side. On the reader end data is DMA’ed into a kernel buffer and the driver uses an extra level of copying to get data into the (possibly differently aligned) user buffer. Note that the sender will have to wait for completion of the DMA operation before his buffer can be reused. If this process is the only one using the interconnect, it will not be able to get the full bandwidth because the DMA engine is idle while the CPU is working and opposite (see figure 8). This lack of overlapping between communication and computation clearly hurts performance. Another interface to raw driver operation could allow asynchronous operation, ie. having a post-send call and a complete-send call, and require that the user program calls the complete-send call before attempting to reuse the buffer.

25

20

15

10

5

0 64

256

1K

4K

16K 48K 128K

User buffer size (bytes) Other processing DMA setup DMA transfer Sleeping/idle

Figure 9: Point-to-point throughput: Effect of multiplexing read/writes over multiple virtual channels.

A single threaded system (multiple CPUs)

7 PERFORMANCE USING MORE THAN ONE SCI BOARD IN EACH NODE

Thread: DMA engine:

The Ultra-2 workstations are equipped with 4 Sbus slots of which the Sbus-2 card occupies one slot. These slots all belong to the same bus, so overall performance will at least be limited by the bus bandwidth. The SCI driver supports multiple boards per node. We wanted to see if we were able to saturate the Sbus using more than one interface, and also to see what maximal flow we were able to get using these cards. We put two boards in each of the two nodes, both pair of boards connected backto-back (separate SCI rings). Throughput results of one-way and two-way operation over both interfaces simultaneously are presented in figure 10. The results show that the peak total throughput of the system in one direction doubles to more than 51 Mbytes/s, while throughput running in both directions increases by 57% from 36Mbytes/s to more than 57 Mbytes/s. These results indicate that the bandwidth of the Sbus is no problem with one Sbus-2 card, while we are starting to approach limits at two cards when running in both directions, and suggests that the observed 57 Mbytes/s is close to the hardware limit of the Sbus in 32 bit mode. The theoretical limit of Sbus performance (in 32bit mode at 25 MHz) is 100 Mbytes/s, but that number does not include overhead of arbitration and other protocol overhead.

A two−thread system (multiple CPU’s) Thread 1 Thread 2 DMA engine:

Figure 8: Interleaving computation and communications using threads. Work is being done to provide a pseudo driver environment for this and other performance experiments that requires kernel level modifications. Because of the lack of current ansynchronous interfaces, raw SCI throughput results in this paper are obtained by multiplexing data over more than one virtual channel using lightweight threads . The effect this has on performance can be seen in figure 9. For small messages little is gained, setup of DMA is relatively costly compared to the DMA itself, and setup of several buffers can currently not be done in parallel due to sequencialization in the current implementation of the driver. As messages get larger, the DMA transfer time is long enough to get some setup done in another thread, and the performance gain is significant. As message sizes becomes even larger, this gain is reduced because the overall time used for setup is diminishing (relatively fewer setups needed). Anyway, we see a noticeable (8-10%) improvement even for optimal message sizes, clearly indicating the need for a better implementation.

7.1 TCP/IP over SCI The TCP/IP protocol over SCI is implemented by providing a software layer called the DLPI (data link provider interface). 113

DMA operations and the hardware did not have support for 8 or 64 byte stores. The Sbus-2 adapter has thus more than twice the throughput and much lower latency on pipelined writes. The low level performance tests in this paper are collected using improved and extended versions of the microbenchmarks described in (Omang 1995). Parts of that work together with an improved comparison on ATM is available in (Bryhni and Omang 1996). An interesting analysis of performance of a higher level application over Sbus-1 can be found in (Bugge and Husøy 1995). Other work on the SCI/Sbus-1 adaptor cards include (Klovning and Bryhni 1994; George et al. 1995; George et al. 1996).

Raw SCI, 1−way (2 threads) Raw SCI, 2−way (2 threads) TCP/IP over SCI, 1−way TCP/IP over SCI, 2−way

70

Throughput (Mbyte/s)

60 50 40

9 OTHER RELATED WORK DEC’s Memory Channel (MC)(Gillett 1996; Gillett and Kaufmann 1996) is similar to SCI in implementing distributed shared memory using reflective memory. In this technology a store to a shared segment on one node is spread to local copies on participating nodes. This is different from Sbus/SCI where only one copy of the data is kept. MC is limited to 8 nodes, and limited by a total network address space of 512 Mbyte, and the interconnect technology is not standard based as SCI. The latest version of this technology gives a minimal (one-way) latency of 2.9µs for a 32 byte message, and a raw bandwidth of 59Mbytes/s on the faster PCI bus. Sbus/SCI on the other

30 20 10 0 64

256

1K

4K

16K 48K 128K

Interconnect

Latency (µs) SHRIMP 4.8 SCI(raw) 4.0 SCI+TCP 199.0 ATM+TCP 220.9 DEC MC/PCI 2.9 2 adapters (same Sbus) SCI(raw) 4.0 SCI+TCP 199.0

User buffer size (bytes) Figure 10: One-way/two-way throughput performance of SCI interconnect using two SCI/Sbus adapters connected by two separate SCI rings. This is a separate pseudo device driver module that links the lower level parts of the Dolphin device driver with higher level protocols in the UNIX system V Streams interface. Since SCI provides a transmission error free service, some of the TCP protocol features are not really needed and presents useless overhead for many applications. The software overhead becomes particularly clear when looking at the increase in latency from the native SCI case. The SCI read/write interface has a minimal latency of 98µs (not present in figure 4) while the minimal latency for TCP/IP over SCI is 200µs. Even though the low level latency of the SCI interconnect should be superior to the latency of ATM interconnection, the observed latency of the two TCP/IP implementations are currently in the same range, and both worse than the 100 Mbits/s Ethernet numbers. The TCP/IP implementation currently does not make use of 64 byte programmed I/O. It is also not optimized in any way as opposed to the more mature implementations. The introduction of 64 byte PIO in the TCP/IP driver as well as tuning and optimalization should give significant speedup, however probably still with significant protocol overhead compared to the bare hardware performance.

Throughput (MB/s) 23.0 36.6 29.3 30.0 59.0 57.8 41.9

Table 1: Some numbers for different hardware supported DSM solutions hand allows each node to import 256 Mbyte of address space physically allocated on other nodes, that is the total amount of shared memory scales with the number of nodes, but each individual node can at one time only map a total 256 Mbyte of memory from other nodes for programmed I/O. DMA can be used independently of this mapping limit. SCI also have support for up to 64,000 nodes, although currently only small systems† have actually been configured. Another related project is the Princeton SHRIMP project(Iftode et al. 1996; Dubnicki et al. 1996). SHRIMP uses an Intel Paragon backplane as the physical layer, and Pentium PC’s with the slower EISA bus as individual nodes. The SHRIMP addition is a low latency message passing protocol that through snooping on the memory bus of each node is able to selectively fetch memory operations to local memory from the memory bus. The access to the memory bus is readonly, so all writes will have to take place through the I/O bus. Sbus/SCI and MC being pure I/O adapter cards do not have

8 RELATED SCI WORK There are other works that characterizes the older Sbus-1 adapter. The Sbus-1 adapter has a similar interface to the user, but hardware is more limited. The Sbus-1 boards had only buffers for one outstanding SCI transaction on the SCI side. In addition, the driver at that point did not support queuing of

†

114

8 nodes with Sbus-2 by Sept-96

We foresee interesting possibilites with this interconnect also in the area of “dedicated clustering”. SCI does not have the same limits on the number of nodes as SMPs and has a scalable interconnect bandwidth. Thus cluster of single CPU’s, or clusters of smaller SMP can easily be put together incrementally using this technology and possibly replace or supplement large SMPs for a number of applications.

access to the memory bus of the local node, so all interconnect operations have to go across the I/O interface. The positive side of the SHRIMP approach is the symmetry of the shared memory and the possibility of maintaining coherency through the snooping capability, which makes SHRIMP shared memory usable as traditional SMP shared memory in larger quantities. The drawback is that SHRIMP will need an additional set of hardware for each system bus to support (and system bus interfaces are in general less available than I/O interfaces) while Sbus/SCI and MC only depend on the I/O interface. With respect to performance, the Sbus/SCI interface shows slightly better latency and somewhat better bandwidth. Table 1 contains a short summary‡ The MC and SHRIMP results are from (Gillett and Kaufmann 1996) and (Dubnicki et al. 1996) respectively.

11 ACKNOWLEDGMENTS We would like to thank Mark Hill and Øystein Gran Larsen for valuable comments on this paper, and Sun Microsystems for providing the equipment and working environment. Thanks also to the rest of the HPC team at Sun, and Bjørn Dag Johnsen at Dolphin for stimulating discussions and valuable input during this work. Knut Omang is supported by the University of Oslo and Dolphin Interconnect Solutions.

10 CONCLUSION

References

A detailed low level performance analysis of point-to-point connection of workstations using SCI has been conducted. Comparing these results with results obtained on the same platform, using exactly the same test program, identical versions of operating system and hardware for ATM and Fast Ethernet confirms that the Sbus adapter based interconnect provides latency margins and bandwidth currently not documented for any other standard I/O adapter based interconnect. The results shows that even though hardware has very low latency, the property is lost through operating system overhead and protocol processing in the standard IP case. There is a definite need for user level low level protocols that take into consideration the cost of operating system calls and interrupt processing. This work shows some of the potential for such protocols in the SCI case. Traditionally, throughput is measured as one directional throughput from one node to another. Modern hardware have different bi-directional bandwidth characteristics that suggests that two-way throughput numbers should be part of network adapter performance measurements. The maximal scalability of this technology is yet to be demonstrated, but this low level analysis shows some of the potential. Good latency and end system throughput numbers are important factors in making an interconnect scalable. The SCI interconnect itself has performance that allows several nodes on a ring and switched configurations allow for further scaling. The combination allows for different topology depending on performance needs. Work to look at such larger configuration is in progress. However, the two node case itself is quite interesting, since it provides an upgrade path for SMPs. Currently, vendors offer SMPs with up to a few tens of processors. Scaling from there currently requires moving to special power computers at a completely different price level. Using multiple SCI cards (Suns new Enterprise systems comes with multiple I/O buses, each of which can be equipped with 2 SCI boards for fault tolerance or speed) a powerful dual-node multiprocessor can be constructed. This is also justified by the fact that by the time this paper is presented, Sun offers such SMP cluster products based on this technology. Researchers have for many years dreamt about the “network multicomputer” where standard workstations in a LAN can be put at work solving a single large problem even with considerable exchange of data. This is what (Bugge and Husøy 1995) call “accidential clustering”. This paper shows that this goal is about to be achieved. ‡

Bryhni, H. and K. Omang (1996, July). A Comparison of Network Adapter based Technologies for Workstation Clustering. In Proceedings of 11th International Conference on Systems Engineering . Bugge, H. and P. O. Husøy (1995, October). Dedicated Clustering: A Case Study. In Proceedings of Fourth International Workshop on SCI-based High-Performance LowCost Computing. Dubnicki, C., L. Iftode, E. E. Felten, and K. Li (1996, April). Software Support for Virtual Memory-Mapped Communication. In Proceedings of 10th International Parallel Processing Symposium. George, A. D., R. Todd, W. Phipps, M. Miars, and W. Rosen (1996, March). Parallel Processing Experiments on an SCI-based Workstation Cluster. In Proceedings of Fifth International Workshop on SCI-based High-Performance Low-Cost Computing. George, A. D., R. W. Todd, and W. Rosen (1995, October). A Cluster Testbed for SCI-based Parallel Processing. In Proceedings of Fourth International Workshop on SCI-based High-Performance Low-Cost Computing. Gillett, R. B. (1996, February). Memory Channel for PCI. IEEE Micro, 12–18. Gillett, R. B. and R. Kaufmann (1996, August). Experience Using the First-Generation Memory Channel for PCI Network. In Proceedings of Hot Interconnects IV, pp. 205– 214. Iftode, L., C. Dubnicki, E. W. Felten, and K. Li (1996, February). Improving Release-Consistent Shared Virtual Memory Using Automatic Update. In Proceedings of 2nd International Symposium on High-Performance Computer Architecture. Klovning, E. and H. Bryhni (1994). Design and performance evaluation of a multiprotocol SCI LAN. Technical Report TF R 45/94, Telenor Research, Kjeller. Omang, K. (1995, November). Preliminary Performance results from SALMON, a Multiprocessing Environment based on Workstations Connected by SCI. Research Report 208, Department of Informatics, University of Oslo, Norway. Available at http://www.ifi.uio.no/~sci/ papers.html. SCI (1993, August). IEEE Standard for Scalable Coherent Interface (SCI).

SCI and ATM bandwidths are the two-way throughput measurements

115