Ninja Migration: An Interconnect-Transparent Migration for ...

2 downloads 0 Views 1MB Size Report
Ninja Migration: An Interconnect-transparent Migration for Heterogeneous Data Centers. Ryousei Takano, Hidemoto Nakada, Takahiro Hirofuchi, Yoshio Tanaka ...
2013 IEEE 27th International Symposium on Parallel & Distributed Processing Workshops and PhD Forum

Ninja Migration: An Interconnect-transparent Migration for Heterogeneous Data Centers Ryousei Takano, Hidemoto Nakada, Takahiro Hirofuchi, Yoshio Tanaka, and Tomohiro Kudoh Information Technology Research Institute National Institute of Advanced Industrial Science and Technology (AIST) 1-1-1 Umezono, Tsukuba, Ibaraki 305-8568 Japan Email: {takano-ryousei, hide-nakada, t.hirofuchi, yoshio.tanaka, t.kudoh}@aist.go.jp

bypass I/O technologies, including PCI passthrough and SRIOV, can significantly reduce the overhead [4], they make it impossible to migrate virtual machines. The heterogeneity of the underlying software and hardware, including VMM, CPU and interconnect architectures, also makes it hard to migrate a VM. For example, a VM ran on an Infiniband cluster in whcih the VM is assigned to an Infiniband device cannot migrate to an Ethernet cluster, and vice versa. Even if the VM can migrate, an application using the device cannot continue to execute without relaunching. In this paper, we address these issues, and propose an interconnect-transparent migration, which enable us to simultaneously migrate multiple co-located VMs between data centers equipped with different interconnect devices. To realize the proposed mechanism, we adopt a graybox approach [5], [6], which is a cross-layer technique to improve the performance and the functionality on a virtualized environment by leveraging the knowledge of a guest operating system (OS). Our previous work, Symbiotic Virtualization (SymVirt), addresses the former issue, and we have presented a proof of concept implementation [7]. SymVirt enables us to migrate a VM with a VMM-bypass I/O device such as an Infiniband host channel adapter (HCA). We implemented an interconnect-transparent migration based on a SymVirt mechanism, called Ninja migration, on top of both the QEMU/KVM and the Open MPI system. To demonstrate the feasibility, we conducted experiments where VMs running MPI benchmark programs migrate between an Infiniband cluster and an Ethernet cluster. The results of our experiments indicate that 1) the proposed mechanism has no performance overhead during normal operations, and 2) MPI processes running on distributed VMs can migrate between an Infiniband cluster and an Ethernet cluster without restarting the processes. The rest of the paper is organized as follows. Section II describes the use cases and requirements of an interconnecttransparent migration. The design and implementation of the Ninja migration mechanism is presented in Section III. Section IV shows the results of our experiments, and we discuss further optimization techniques in Section V. In Section VI, we briefly discuss related work. Finally, Section VII

Abstract—A virtual machine (VM) migration is useful for improving flexibility and maintainability in cloud computing environments. However, the heterogeneity of the underlying software and hardware, including CPU and interconnect architectures, makes it hard to migrate a VM. In addition, VM monitor (VMM)-bypass I/O technologies, which significantly reduce the overhead of I/O virtualization, also make VM migration impossible. Therefore, a VM assigned to an Infiniband device cannot migrate to an Ethernet machine, and vice versa. If we overcome the above barriers, we can increase the potential and possibilities of VM migration. In this paper, we propose an interconnect-transparent migration mechanism to simultaneously migrate multiple co-located VMs between data centers equipped with different interconnect devices. Our implementation of the proposed mechanism, called Ninja migration, is achieved by cooperation between a VMM and an MPI runtime system on the guest OS. We demonstrate fallback and recovery operations on a high performance computing workload using the proposed mechanism. We have confirmed that 1) the proposed mechanism has no performance overhead during normal operations, and 2) MPI processes running on distributed VMs can migrate between an Infiniband cluster and an Ethernet cluster without restarting the processes. Keywords-Virtualization, Interconnect-transparent migration, VMM-bypass I/O, MPI, Cloud computing

I. I NTRODUCTION A high performance computing (HPC) cloud is a promising HPC platform. Recently, cloud computing has been getting increased attention from the HPC community. To meet the demand, several systems, e.g., the Amazon EC2 Cluster Compute Instances [1], Google Compute Engine [2], and CycleCloud [3] have been proposed. By introducing cloud computing to HPC, all the benefits of cloud computing, such as reduced ownership cost, higher flexibility, and higher availability can be enjoyed by users. Virtualization is a key technology in cloud computing, and it is widely used for flexibility and security. It makes migration of computing elements easy, and such ease-ofuse is useful for achieving server consolidation and fault tolerance. However, current VM migration technologies lack the support of both VMM-bypass I/O devices and the heterogeneity of interconnect devices. Virtualization introduces a large overhead, spoiling I/O performance. Although VMM978-0-7695-4979-8/13 $26.00 © 2013 IEEE DOI 10.1109/IPDPSW.2013.114

992

Normal operation

VM1 VM2 VM3

Infiniband

Recovery migration

VM1

VM2

VM3

Each VM communicates over Infiniband on an Infiniband cluster.

Fallback migration VM1 VM2 VM3

Fallback operation Ethernet

VM1

VM2

VM3

Each VM communicates over TCP/ IP on an Ethernet cluster.

Figure 1. A use case of an interconnect-transparent migration. The left figure shows the state of a normal operation on an Infiniband cluster; the right figure shows the state of a fallback operation on an Ethernet cluster. An interconnect-transparent migration enables us to transparently transit the two states without restarting a parallel application that runs on the VMs.

filling [9] and load balancing techniques using VM technologies. For the purpose of implementing these techniques, interconnection-transparent migration is useful. VM consolidation enables us to reduce the total cost of ownership, especially for web and enterprise workloads. Some researchers have focused on application/workloadaware VM consolidation techniques. L. Cherkasova, et al., have reported significant under-utilization of computing resources in an LHC computing Grid data center: 50 % of the jobs use less than 2 % of the CPU-time and 70 % use less than 14 % of the CPU-time during their lifetime [10]. This implies that VM consolidation can also be effective for HPC workloads. Sustainable grid supercomputing: The number of processors used changes dynamically on demand, and resources are allocated and migrated on a Grid of geographically distributed supercomputers, dynamically, according to both reservations and unexpected faults [11]. The PRAGMA resources working group [12] demonstrates single VM image runs on multiple heterogeneous and geographically distributed sites. Interconnect-transparent migration can help to improve the flexibility of resource sharing. To support for executing a parallel application on such a geographically distributed environment, we have proposed a Grid middleware suite called GridARS [13], [14]. Disaster recovery: VMs are evacuated from a disasteraffected data center to a safe data center before those VMs crash [15]. In order to ensure the continuity of their services, interconnect-transparent migration, which can expand the range of choices for data centers that can accept VM migration, is useful.

summarizes the paper. II. I NTERCONNECT- TRANSPARENT M IGRATION A. Use cases of an interconnect-transparent migration A VM migration is useful for improving flexibility and maintainability in cloud computing environments. However, the heterogeneity of the underlying software and hardware makes it impossible to migrate a VM. Sometimes we cannot assume that the destination node of a migration has the same equipment as the source node. Some researchers address a heterogeneous VM migration technique [8], which focuses on the heterogeneity among VMMs. In this paper, we focus on the heterogeneity of interconnect architectures. We call such a technique an interconnect-transparent migration. Figure 1 shows an interconnect-transparent migration, which allows the migration of VMs between data centers, in which compute nodes are connected with different interconnect networks. Here we assume the following scenario. The data center on the left is normally the one used. When we cannot continue to use this data center due to hardware failures, scheduled maintenance, and so on, our application migrates to the data center on the right. At both a fallback migration and a recovery migration, the application can continue to run without relaunching. There are many use cases of such a migration technique in a widespread area, as follows. Non-stop maintenance: During hardware or software maintenance in a machine, interconnect-transparent migration allows a VM to transparently fail-over to another machine without stopping the service. We cannot ensure high-spec machines equipped with a high speed interconnect device are always available. In addition, using proactive and reactive fault tolerant systems, as shown in [7], we can restart VMs on an Ethernet cluster from checkpointed VM images on an Infiniband cluster. High resource utilization: The optimal placement of VMs is scheduled, by leveraging the ability to suspend, migrate, and resume computations. There exist back-

B. Requirements First of all, we assume our target application is an Message Passing Interface (MPI) program. The following describes requirements that must be met to achieve an interconnect-transparent migration. First, a VMM can migrate a VM with VMM-bypass I/O devices. Second, a VMM

993

can migrate a VM even if the source and destination nodes have different interconnect devices. To meet the above requirements, cooperation between a VMM and the guest OS is required as follows. By detaching all VMM-bypass I/O devices currently attached, it is possible to migrate a VM. To perform such a migration without losing currently-being transferred data, packet transmission from/to the VM should be stopped prior to detaching the devices involved. VMM-bypass I/O devices can be detached and re-attached using PCI hotplugging. However, with a VMM, it is hard to know the communication status of an application inside a guest OS and the proper timing of detaching and re-attaching the devices, especially if VMMbypass devices are used. Moreover, in conjunction with a migration, an MPI runtime system changes the transport protocol according to available interconnect devices.

VM

Infiniband node

VM

VM

MPI app. M

MPI app.

MPI lib. M

MPI lib.

IB IB

TCP ...

IB

MPI app. MPI lib.

... TCP ...

IB IB

  VMM

TCP ...

 

VMM

VMM

VMM

Fallback b k back ration migration

Recov Recovery R migra g migration VM V

Ethernet node

MPI app app. MPI lib. lib. IB IB

TCP TCP

.. .

VMM

VMM

VMM

VMM

1)

2)

3)

4)

Time

Figure 2. An overview of Ninja migration. Ninja migration works with cooperation between distributed VMMs and the guest OSes. This figure shows a fallback and recovery behavior consists of four phases: 1) normal operation, 2) fallback migration, 3) fallback operation, and 4) recovery migration. After phase 4, the status returns back to that of phase 1.

III. N INJA M IGRATION A. Design We present our implementation of an interconnect transparent migration mechanism, called Ninja migration. To realize cooperation between a VMM and the guest OS, we adopt a gray-box approach [5], [6], which consists of a cross-layer technique to improve the performance and the functionality in a virtualized environment by leveraging the knowledge of a guest OS. We designed the Ninja migration mechanism based on the Symbiotic Virtualization (SymVirt) mechanism [7], that enables distributed VMMs to cooperate with a message passing layer on the guest OSes. Figure 2 shows an overview of the Ninja migration mechanism. The upper figures show the Infiniband node; the bottom figures show the Ethernet node. According to a fallback-and-recovery scenario shown in Figure 1, it works as follows: 1) normal operation, 2) fallback migration, 3) fallback operation, and 4) recovery migration. After phase 4, the status returns back to that of phase 1. During normal operation, each VM communicates over an Infiniband network via a VMM-bypass Infiniband HCA. In fallback migration, the Infiniband HCA is detached, and a VM migrates to another node on an Ethernet cluster. At the same time, the MPI runtime system switches the transport protocol from Infiniband to TCP/IP. During fallback operation, each VM communicates over TCP/IP via an Ethernet NIC. In recovery migration, the VM migrates back to the original node, and the Infiniband HCA is re-attached. At the same time, the MPI runtime system switches the transport protocol back from Infiniband to TCP/IP. The following introduces an overview of SymVirt, and then presents the implementation of the Ninja migration mechanism in some detail.

or data center. The key to implementing migration and checkpoint/restart for VMs with VMM-bypass I/O devices is to unplug the devices only when such VM-level functions are required. To unplug devices safely while a parallel application is running on distributed VMMs, we must guarantee the ability to create a globally consistent snapshot of the entire virtualized cluster. Therefore, we employ a technique that involves combining a PCI hotplug and the coordination of a parallel application. The former enables us to add and remove devices while the OS is running. The latter is required to preserve the VM execution and communication states when the snapshots are restored in the future. Moreover, this approach enables us to avoid virtualization overhead during normal operations. SymVirt provides a simple and intra-node communication mechanism between a VMM and the guest OS, that is, a pair of a VMM from/to the guest OS mode switch calls, SymVirt wait and SymVirt signal. From the view point of a guest OS, a SymVirt wait call is considered as a synchronous call. The execution of the VM is blocked until a SymVirt signal call is issued on the VMM. During the time between SymVirt wait and signal calls, VMM monitor commands, e.g., attach and detach a device, and migration commands, can be issued. Figure 3 shows an overview of the proposed mechanism, which consists of SymVirt coordinator, SymVirt controller, and SymVirt agent. SymVirt coordinator runs inside an application process. SymVirt controller is a master program on the VMM side. SymVirt controller and SymVirt agents work together to control distributed VMMs. This mechanism works in cooperation with a cloud scheduler. The workflow of SymVirt is summarized as follows. 1) A cloud scheduler delivers a trigger event, e.g., a migration

B. SymVirt: Symbiotic Virtualization The aim of SymVirt is to simultaneously migrate and checkpoint/restart multiple co-located VMs in a cluster

994

node #0



Guest OS

Cloud scheduler

Guest OS

1) trigger events

C. Implementation

node #n

MPI app.

MPI app.

MPI lib.

MPI lib.

SymVirt coordinator 3) SymVirt wait

SymVirt coordinator

We implemented Ninja migration on top of the QEMU/KVM [16] and Open MPI [17], and we confirmed it works on heterogeneous virtualized clusters. The details of the implementation are described below. For the purpose of dynamically switching transport protocols, we extended a SymVirt coordinator. A SymVirt coordinator exploits the modular checkpoint/restart framework [18] of Open MPI, for VM-level checkpoint/restart and migration instead of process-level checkpoint/restart. This framework consists of a checkpoint/restart coordination protocol framework called the OMPI CRCP (Checkpoint/Restart Coordination Protocol), and a single process checkpoint/restart service framework called the OPAL CRS (Checkpoint/Restart Service). The OPAL CRS supports the user-level checkpoint feature (SELF) and the process-level checkpoint/restart system BLCR [19]. In order to ensure easy deployment, a SymVirt coordinator is required to work without modification of either an MPI library or applications. OMPI CRCP can be used without modification. Instead of implementing a new OPAL CRS component for SymVirt, we used a SELF component, which supports application level checkpointing by providing the application callbacks upon checkpoint, restart and continue operations. A SymVirt coordinator uses checkpoint and continue callbacks to issue SymVirt wait calls. On the other hand, SymVirt does not use a restart callback. SELF handler routines for SymVirt are implemented as a shared library, libsymvirt.so. Using the LD_PRELOAD environment variable, the library is loaded into an MPI process at runtime. The Ninja migration transparently switchs transport protocols before and after a migration as follows. Open MPI CRS releases all resources allocated on Infiniband devices in the pre-checkpoint phase. OMPI Byte Transfer Layer (BTL) provides an interconnect agnostic abstraction, used for MPI point-to-point messages on several types of networks. BTL modules are reconstructed and connections are re-established in the continue and restart phases. Therefore, there are no problems even if Local IDs (port addresses) or Queue Pair Numbers are changed after a migration. This design can be considered because BLCR does not support mechanisms to save and restore network connections, i.e., sockets. Each BTL has an exclusivity parameter to select a primary communication device. The higher the value the higher the priority. For example, that of TCP is 100; that of Infiniband is 1024. If an Infiniband device is available after a migration, the Infiniband device is used according to the exclusivity parameters. Otherwise, fallback to Ethernet occurrs. Note that if the TCP BTL module is only available for inter-node communication, BTL reconstruction is not executed. This situation can be considered as recovery migration. Therefore, we set the variable ompi_cr_continue_like_restart to forcibly reconstruct BTL modules when a recovery mi-

MPI app.

MPI lib.

2) coordination

SymVirt coordinator

5) SymVirt signal

VMM SymVirt controller

SymVirt agent

Figure 3.

VMM

4) do something, e.g., migration/checkpointing

SymVirt agent

An overview of the SymVirt mechanism.

Application confirm SymVirt wait

SymVirt coordinator

confirm linkup

SymVirt signal

Guest OS mode VMM mode

detach

migration

re-attach

SymVirt controller/agent

Figure 4.

The control flow of a VM migration using SymVirt.

or checkpoint/restart request, to both an MPI runtime system and the SymVirt controller. The MPI runtime invokes SymVirt coordinators at each MPI process. 2) SymVirt coordinators synchronize all processes and create a consistent state for the entire application by using a coordination protocol. 3) Each SymVirt coordinator issues a SymVirt wait call. The VM is paused until a SymVirt signal call is received. 4) The SymVirt controller spawns SymVirt agent threads. Each agent connects with the VMM monitor interface, and executes a procedure corresponding to the event. 5) SymVirt agents issue a SymVirt signal call, and the VMs are resumed. Figure 4 shows the control flow of a VM migration using SymVirt. It consists of the following three phases: 1) detach: a SymVirt agent removes a VMM-bypass I/O device from the VM, 2) migration, and 3) re-attach: a SymVirt agent reattaches a VMM-bypass I/O device to the VM. Each phase involves a VMM from/to the guest OS mode transition. A SymVirt wait call is issued by a SymVirt coordinator. SymVirt agents do something to control a VM, followed by a SymVirt signal to wake up the VM. During phases 1) and 3), the guest OS needs to be able to recognize the addition and removal of a device to migrate a VM safely. Therefore, a period of time so that a PCI hotplug mechanism, i.e., the ACPI hotplug PCI controller driver acpiphp, can work on a guest OS is required.

995

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

Table I AGC CLUSTER SPECIFICATIONS .

import symvirt from symvirt import config ### 1. fallback migration ctl = symvirt.Controller(config.eth_hostlist)

Node PC CPU Chipset Memory Infiniband 10 GbE Disk

# 1a. device detach ctl.wait_all() kwargs = {’tag’ : ’vf0’} ctl.device_detach(**kwargs) ctl.signal() # 1b. migration ctl.wait_all() ctl.migration(config.ib_hostlist, config.eth_hostlist) ctl.quit()

Switch Infiniband 10 GbE

### 2. recovery migration ctl = symvirt.Controller(config.eth_hostlist) # 2a. migration ctl.wait_all() ctl.migration(config.eth_hostlist, config.ib_hostlist) ctl.quit()

Mellanox M3601Q Dell M8024

and blade switches. Hyper Threading was disabled. Table I summarizes the specifications of the node PC and the switch. We set up two virtualized clusters top on the physical cluster. A single VM, which had 8 CPU cores and 20 GB of memory, was run on a physical machine. The host OS and the guest OS were Debian GNU/Linux 7.0 (testing) and Scientific Linux 6.2, respectively. The VM image was created using the qcow2 format which enabled us to make snapshots internally. Live migration was required for the shared storage among the source and destination nodes. In this experiment, we used NFS version 3. The proposed mechanism was implemented based on the Linux kernel version 3.2.18 and QEMU/KVM version 1.1rc3. We used the default precopy live migration of QEMU/KVM. The virtual CPU model was set to “host” to allow the guest OS to use all available host processor features. To configure NUMA affinity, “-smp” and “-numa” options were also set at boot time. On the VM environment, the OpenFabrics Enterprise Distribution (OFED) version 1.5.4.1 was used. The benchmark applications were compiled with gcc/gfortran version 4.4.6, and the optimization option was set to “-O2.” We used Open MPI version 1.6.0 [17], and the option was set to “–mca mpi leave pinned 0 -am ft-enablecr.”

# 2b. device attach ctl = symvirt.Controller(config.ib_hostlist) ctl.wait_all() kwargs = {’host’ : ’04:00.0’, ’tag’ : ’vf0’} ctl.device_attach(**kwargs) ctl.signal() ctl.close()

Figure 5.

Quad-core Intel Xeon E5540/2.53GHz x2 Intel 5520 48 GB DDR3-1066 Mellanox ConnectX (MT26428) Broadcom NetXtreme II (BMC57711) SAS 300 GB hardware RAID-1 array

A simplified version of the Ninja migration script.

gration executes. The SymVirt controller and the SymVirt agent are implemented in Python. The SymVirt controller invokes SymVirt agent threads for each QEMU. A SymVirt agent controls virtual machines by using QEMU monitor commands, including migrate, device_add, and device_del. Each agent communicates with a QEMU process via the QEMU Monitor Protocol (QMP) or a telnet connection. Figure 5 shows a simplified version of the Ninja migration script. The wait_all waits until all given VMs issue the SymVirt wait call. The signal resumes all VMs. The other methods, including device_detach, migrate, device_attach, correspond to QEMU monitor commands. We assume that the cloud scheduler provides information, including the source and destination nodes of migration, and the PCI ID of a VMM-bypass I/O device. This is a reasonable assumption on cloud environments.

B. The overhead of SymVirt The overhead of SymVirt is divided into three parts: hotplug, link-up, and migration, as shown in Figure 4. The hotplug time is the sum of the detach, re-attach, and confirm times. The link-up time is a wait time until a link is active on a guest OS. Theoretically, both the hotplug and linkup times are constant; migration time is dependent on the memory footprint. We used two 8 node-clusters. One was connected with Infiniband and the other was connected with Ethernet. VMs on the former communicate with each other via VMM-bypass Infiniband devices; VMs on the latter communicate with each other via para-virtualized Ethernet virtio_net devices. First, we measured the elapsed time of hotplug and link-up by using a simple memory intensive micro benchmark, called memtest. Second, we measured the

IV. E XPERIMENT A. Experimental setting We used a 16 node-cluster, which is a part of the AIST Green Cloud (AGC) cluster. The cluster consists of Dell PowerEdge M610 blade servers, and is comprised of 8 CPU cores, 48 GB of memory, a 300 GB SAS disk array, a QDR Infiniband HCA, and a 10 Gigabit Ethernet (GbE) NIC. The Dell M1000e blade enclosure holds 16 blade servers

996

100

Table II E LAPSED TIME OF HOTPLUG AND LINK - UP [ SECONDS ].

→ → → →

Infiniband Ethernet Infiniband Ethernet

Execution Time [Seconds]

Infiniband Infiniband Ethernet Ethernet

hotplug 3.88 2.80 1.15 0.13

migration

link-up 29.91 0.00 29.79 0.00

hotplug

linkup

80 60 40 20

53.7

35.9

38.7

44.2

14.6

13.5

12.5

11.3

28.5

28.5

28.5

28.6

2GB

4GB

8GB

16GB

0

total overhead included in migration time by using both the memtest and the NAS Parallel Benchmarks (NPB). In this case, both the source and the destination clusters use Infiniband only. 1) Hotplug and Link-up Time: A memtest benchmark sequentially writes data to a 2 GB memory array. We used 8 VMs, and an MPI process ran on each VM. In this experiment, we did self-migration, where a VM migrates to the same physical node, with four combinations of interconnect settings: Infiniband → Infiniband, Infiniband → Ethernet, Ethernet → Infiniband, and Ethernet → Ethernet. Table II shows the elapsed time of hotplug and link-up. Each value is measured three times and the best is taken. The hotplug time of Infiniband is longer than that of Ethernet. When the destination node has an Infiniband device, the linkup time takes about 30 seconds. This is not a negligible overhead. In contrast, when the destination node has an Ethernet device, it is negligible. This issue will be discussed in Section V. 2) Memtest Micro Benchmark: The memtest benchmark sequentially wrote data to a memory array that ranged from 2 GB to 16 GB. We used 8 VMs, and an MPI process ran on each VM. Figure 6 shows the total execution overhead of the Ninja migration. Each value is measured three times and the best is taken. The variation of the overhead is within 2 seconds. Analyzing the breakdown of the overhead, the migration time is dependent on the memory footprint; both hotplug and link-up times are approximately constant. The migration time is not exactly proportional to the memory footprint. This is because a VMM traverses the whole of the guest OS’s memory during a migration. The QEMU/KVM migration mechanism compresses pages that contain uniform data, e.g., “zero pages,” to reduce the amount of transferred memory pages. The hotplug and linkup time is three times longer than that of self-migration shown in Table II. It can be considered that migration noise interferes with the execution of hotplug. 3) NAS Parallel Benchmarks: We also evaluated the proposed mechanism with more a practical application benchmark, the NPB version 3.3.1. The problem size is class D. The total number of processes is 64. We used the following four benchmark programs from NPB: BT (Block Tridiagonal), CG (Conjugate Gradient), FT (Fast Fourier Transform), and LU (LU Simulated CFD Application). The Ninja migration mechanism is issued once at three minutes

Figure 6. [seconds].

The overhead of Ninja migration on a memtest benchmark

1200 Execution time [seconds]

migration

hotplug

linkup

application

1000 800 600 400 200 0 baseline proposed baseline proposed baseline proposed baseline proposed BT

CG

FT

LU

Figure 7. The overhead of Ninja migration on the NPB 3.3 (64 processes, class D) [seconds].

after each benchmark start time. Figure 7 breaks down the overhead caused by the Ninja migration. Each value is measured three times and the best is taken. The “baseline” indicates the execution without the Ninja migration; the “proposed” indicates the execution with the Ninja migration once. First, the Ninja migration has no performance overhead during normal operations. Second, the migration time is basically proportional to the memory footprint, where the memory footprints ranged from 2.3 GB to 16 GB; both hotplug and link-up times are constant. C. Fallback and recovery migration To demonstrate fallback and recovery migration using the Ninja migration mechanism, 4 VMs migrate and migrate back according to the following scenario: 4 hosts (Infiniband) → 2 hosts (TCP) → 4 hosts (Infiniband) → 4 hosts (TCP). “4 hosts” denotes that one VM runs on one host; “2 hosts” denotes two VMs run on one host, and shows an example of a server consolidation. “TCP” denotes MPI processes communicate with TCP/IP via a virtio net NIC. The difference between migration to Infiniband cluster and that to Ethernet cluster is only transport protocol switching. After a migration, the Open MPI runtime initializes BTL modules, detects both Infiniband and Ethernet devices, and

997

During Ninja migration, an application is completely frozen. Although the impact depends on applications and the frequency of migrations, reducing overhead costs is another important open issue.

re-establishes Infiniband connections among MPI processes, as shown in Section III-C. We conducted the above experiment with two settings: 1 process per VM (total 4 processes) and 8 processes per VM (total 32 processes). The benchmark program used was a simple MPI program that repeatedly broadcasts and reduces 8 GB data per a node. Ninja migration was launched every 10 iteration steps. The elapsed time of each iteration should decrease, as the performance of interconnection increases. This is because MPI_Bcast and MPI_Reduce are dominant in the execution time. Figure 8 shows the results. The X-axis shows the iteration steps; the Y-axis shows the elapsed time of each iteration in seconds. The elapsed time of iteration steps 11, 21, and 31 include the migration time. The top part of the bar shows the overhead accociated with Ninja migration. With these results, we have confirmed that MPI processes communicating with each other through Infiniband can migrate to Ethernet machines without restarting processes. The total overhead is identical as the number of process per VM increases from 1 to 8. This is impossible in existing virtualized environments. In addition, the execution times of 8 processes per VM are faster than those of 1 process per VM, except for “2 hosts (TCP).” In the case of “2 hosts (TCP),” such low performance is caused by a lot of CPU contention under the CPU over-commit setting.

VI. R ELATED W ORK VM migration is widely used for cloud and enterprise environments. However, to the best of our knowledge, there are only a few studies that address how to break the heterogeneity barriers, including those of CPU, I/O devices, and VMM architectures. Vagrant [8] supports a live migration across heterogeneous VMMs, e.g., Xen and KVM. In terms of I/O devices, Vagrant focuses on an emulation-based device model, and it does not consider the heterogeneity of I/O devices and the support of VMM-bypass I/O devices. Combining both their approach and our proposed mechanism can improve the flexibility of virtualized data centers. A. Kadav and M. Swift have proposed another lightweight software mechanism for migrating VMs with VMM-bypass I/O devices [22]. On the source node, the shadow driver continuously monitors the state of the driver and device. After migration, the shadow driver uses this information to configure a driver for the corresponding device on the destination. This technique can be applied to any class of devices. However, a device driver-level implementation cannot support migration between an Infiniband node and an Ethernet node. Another drawback is the overhead of logging. To improve the efficiency and the performance by leveraging the knowledge of a guest OS, several gray-box techniques have been proposed [5], [6]. These techniques are required for a communication mechanism between a VMM and the guest OS. SymCall [6] provides an upcall mechanism from a VMM to a guest OS, using a nested VM Exit call. In contrast, Socket outsourcing [23] and SymVirt provide a simple hypercall mechanism from a guest OS to the VMM. SymVirt does not require a complicated upcall mechanism, assuming it works in cooperation with a cloud controller. Socket outsourcing offloads a guest OS’s functionality, like TCP/IP communication, to the VMM. Some para-virtualized Infiniband drivers for Xen and VMWare ESXi have been proposed [24], [25], [26]. In contrast to these studies, the proposed mechanism relies on VMM-bypass I/O technologies and hotplugging mechanisms instead of implementing a para-virtualized driver for a specific VMM. Therefore, there is no performance overhead and no limitation in supported devices, e.g., Myrinet and other devices. Nomad, in particular, supports migration of virtual machines with an Infiniband device [25]. Nomad virtualizes location dependent resources, including Local IDs (port addresses), Queue Pair Numbers, and memory keys for RDMA operations. The proposed system does not need such virtualization because it relies on Open MPI’s checkpoint/restart framework to re-establish all connections after

V. D ISCUSSION This section discusses the results of our experiments and remaining open issues. We also present some ideas for optimization to improve the efficiency. Our evaluation lacks scalability tests, but the proposed mechanism is essentially scalable. The overhead consists of four parts: coordination, migration, hotplug, and link-up. The coordination has a negligible impact to the total overhead. The migration time may significantly increase as the number of hosts increases due to network congestion and so on. This is independent from the proposed mechanism and it is another open issue. The other two are done in constant time. The link-up time of Infiniband devices costs about 30 seconds. This is not a negligible overhead. During that time, the hardware state keeps “polling,” which indicates the port is not physically connected. We need to investigate what is happening. In contrast, Ethernet, both with a real NIC and a virio net NIC, does not have the same problem. In our experiment, the network throughput of migration is less than 1.3 Gbps, and thus cannot fully utilize a 10 GbE network. This is because of CPU bottlenecks at the source node. The current QEMU migration implementation is based on TCP/IP and utilizes a single thread. During the migration, the utilization of one CPU core is saturated at 100 %. RDMA-based migration [20], [21] can reduce CPU utilization and improve the throughput, compared with TCP/IP-based migration.

998

4 hosts (IB)

2 hosts (TCP)

4 hosts (TCP)

4 hosts (IB)

4 hosts (IB)

2 hosts (TCP)

4 hosts (IB)

4 hosts (TCP)

200 Overhead

Execution time [seconds]

Execution time [seconds]

100 Application

80 60 40 20

Overhead Application

160 120 80 40 0

0 1

11

21 Steps 

1

31

a) 1 process / VM (total 4 processes)

11

21 Steps

31

b) 8 processes / VM (total 32 processes)

Figure 8. The fallback migration and recovery migration. The X-axis shows the iteration steps; the Y-axis shows the elapsed time of each iteration in seconds. The elapsed time of iteration steps 11, 21, and 31 include the migration time.

In terms of scalability, the results lead to positive conclusion that a migration of VMs across two racks is feasible. We plan to demonstrate Ninja migration on large scale clusters according to more realistic scenarios, including wide area migration of VMs for disaster recovery, and an intelligent VM placement in a data center consists of heterogeneous racks for power saving. We will also investigate the overhead associated with Ninja migration, especially the very long link-up time of Infiniband, in more detail. Based on the lessons learned from this study, we will design and implement a generic communication layer to support a guest OS cooperative migration based on a SymVirt mechanism, which is independent on an MPI runtime system. This will bring the benefit of an interconnect-transparent migration to wide-ranging applications.

a migration. This contributes the simplicity and robustness of our implementation. Although VMM-bypass I/O technologies are effective in improving the I/O performance of a guest OS, they are still unable to achieve the levels of bare metal due to the overhead of VM Exits, which increases the communication latency. This is because a guest OS cannot selectively intercept physical interrupts. Exit-less interrupt (ELI) [27] addresses this issue. It is a software-only approach for handling interrupts within guest VMs directly and securely. We expect that next-generation hardware virtualization will support a selective interrupt for a VM Exits feature. As another approach to achieve the desired combination of performance and dependability, H. Chen, et al., have proposed a selfvirtualization technique [28], which provides an OS with the capability to turn virtualization on and off on demand. It enables migration and checkpoint/restart to avoid virtualization overhead during normal operations. However, it lacks a coordination mechanism among distributed VMMs.

ACKNOWLEDGMENT This work was partly supported by JSPS KAKENHI Grant Number 24700040.

VII. C ONCLUSION We have proposed an interconnect-transparent migration mechanism for heterogeneous data centers. We also presented our implementation of the proposed mechanism called Ninja migration, which simultaneously migrates multiple co-located VMs in cooperation with a VMM and an MPI runtime system on the guest OS using a SymVirt mechanism. The use of both SymVirt and Open MPI CRS makes our implementation simple and robust. To demonstrate the feasibility, we conducted experiments where VMs running MPI benchmark programs migrate between an Infiniband cluster and an Ethernet cluster. As a result, we have confirmed that 1) the proposed mechanism has no performance overhead during normal operations, and 2) MPI processes running on distributed VMs can migrate between an Infiniband cluster and an Ethernet cluster without restarting the application processes.

R EFERENCES [1] Amazon EC2. http://aws.amazon.com/ec2/. [2] Google Compute Engine. compute/.

https://developers.google.com/

[3] CycleCloud. http://cyclecomputing.com/cyclecloud/overview. [4] Ryousei Takano, Tsutomu Ikegami, Takahiro Hirofuchi, and Yoshio Tanaka. Toward a practical “HPC Cloud”: Performance tuning of a virtualized InfiniBand cluster. In Proc. of the 6th International Conference on Ubiquitous Information Technologies and Applications (CUTE), December 2011. [5] Timothy Wood, Prashant Shenoy, Arun Venkataramani, and Mazin Yousif. Black-box and Gray-box Strategies for Virtual Machine Migration. In Proc. of the 4th USENIX Symposium on Networked Systems Design and Implementation, 2007.

999

[6] Jack Lange and Peter Dinda. SymCall: Symbiotic Virtualization Through VMM-to-Guest Upcalls. In Proc. of the 2011 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE), pages 193–204, March 2011.

[16] Avi Kivity. KVM: the Linux Virtual Machine Monitor. In Proc. of the Ottawa Linux Symposium, pages 225–230, July 2007. [17] Open MPI. http://www.open-mpi.org/.

[7] Ryousei Takano, Hidemoto Nakada, Takahiro Hirofuchi, Yoshio Tanaka, and Tomohiro Kudoh. Cooperative VM Migration for a Virtualized HPC Cluster with VMM-Bypass I/O devices. In Proc. of the IEEE 8th International Conference on eScience, October 2012.

[18] Joshua Hursey, Jeffrey M. Squyres, Timothy I. Mattox, and Andrew Lumsdaine. The design and implementation of checkpoint/restart process fault tolerance for Open MPI. In Proc. of the 12th IEEE Workshop on Dependable Parallel, Distributed and Network-Centric Systems (DPDNS), March 2007.

[8] Pengcheng Liu, Ziye Yang, Xiang Song, Yixun Zhou, Haibo Chen, and Binyu Zang. Heterogeneous Live Migration of Virtual Machines. In Proc. of the International Workshop on Virtualization Technology (IWVT), 2008.

[19] Paul H. Hargrove and Jason C. Duell. Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters. In Proc. of SciDAC 2006, 2006. [20] W. Huang, Q. Gao, J. Liu, and D. K. Panda. High Performance Virtual Machine Migration with RDMA over Modern Interconnects. In Proc. of the IEEE International Conference on Cluster Computing, 2007.

[9] Borja Sotomayor, Kate Keahey, and Ian Foster. Combining Batch Execution and Leasing Using Virtual Machines. In Proc. of the 17th International Symposium on High Performance Distributed Computing (HPDC), pages 87–96, June 2008.

[21] C. Isci, J. Liu, B. Abali, J. O. Kephart, and J. Kouloheris. Improving server utilization using fast virtual machine migration. IBM Journal of Research and Development, 55(6):4:1– 4:12, 2011.

[10] Ludmila Cherkasova, Diwaker Gupta, Eygene Ryabinkin, Roman Kurakin, Vladimir Dobretsov, and Amin Vahdat. Optimizing Grid Site Manager Performance with Virtual Machines. In Proc. of the 3rd USENIX Workshop on Real, Large Distributed Systems (WORLDS), 2006.

[22] Asim Kadav and Michael M. Swift. Live Migration of DirectAccess Devices. SIGOPS Operating Systems Review, pages 95–104, 2009.

[11] Hiroshi Takemiya, Yoshio Tanaka, Hidemoto Nakada, Satoshi Sekiguchi, Shuji Ogata, Rajiv K. Kalia, Aiichiro Nakano, and Priya Vashishta. Sustainable adaptive grid supercomputing: Multiscale simulation of semiconductor processing across the Pacific. In Proc of the ACM/IEEE Conference on Supercomputing (SC), 2006.

[23] Hideki Eiraku, Yasushi Shinjo, Calton Pu, Younggyun Koh, and Kazuhiko Kato. Fast Networking with SocketOutsourcing in Hosted Virtual Machine Environments. In Proc. of the 24th ACM Symposium on Applied Computing, pages 310–317, 2009.

[12] PRAGMA: Pacific Rim Applications and Grid Middleware Assembly. http://www.pragma-grid.net.

[24] Jiuxing Liu, Wei Huang, Bulent Abali, and Dhabaleswar K. Panda. High Performance VMM-Bypass I/O in Virtual Machines. In Proc. of the USENIX Annual Technical Conference, pages 29–42, June 2006.

[13] Atsuko Takefusa, Hidemoto Nakada, Ryousei Takano, Tomohiro Kudoh, and Yoshio Tanaka. GridARS: A Grid Advanced Resource Management System Framework for Intercloud. In Proc. of the 1st International Workshop on Network Infrastructure Services as part of Cloud Computing (NetCloud), December 2011.

[25] W. Huang, J. Liu, M. Koop, B. Abali, and D. K. Panda. Nomad: Migrating OS-bypass Networks in Virtual Machines. In Proc. of the 2007 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE), 2007. [26] Bhavesh Davda and Josh Simons. RDMA on vSphere: Update and Future Directions. Open Fabrics Workshop, March 2012.

[14] Ryousei Takano, Hidemoto Nakada, Atsuko Takefusa, and Tomohiro Kudoh. A Distributed Application Execution System for an Infrastructure with Dynamically Configured Networks. In Proc. of the 2nd International Workshop on Network Infrastructure Services as part of Cloud Computing (NetCloud), December 2012.

[27] Abel Gordon, Nadav Amit, Nadav Har’El, Muli Ben-Yehuda, Alex Landau, Assaf Schuster, and Dan Tsafrir. ELI: Baremetal Performance for I/O Virtualization. In Proc. of the 17th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2012.

[15] Mauricio Tsugawa, Renato Figueiredo, Jose Fortes, Takahiro Hirofuchi, Hidemoto Nakada, and Ryousei Takano. On the Use of Virtualization Technologies to Support Uninterrupted IT Services - A Case Study with Lessons Learned from The Great East Japan Earthquake. In Proc of the International Workshop on Re-think ICT infrastructure designs and operations (RIDO), June 2012.

[28] Haibo Chen, Rong Chen, Fengzhe Zhang, Binyu Zang, and Pen-Chung Yew. Mercury: Combining Performance with Dependability Using Self-Virtualization. Journal of Computer Science and Technology, 27(1):92–104, 2012.

1000