Vshadow: Promoting Physical Servers into Virtualization ... - CiteSeerX

3 downloads 0 Views 2MB Size Report
ical machine's disk to a VM disk image, which is called off-line P2V conversion. [1]. .... physical machine with the help of a portable disk which contains virtual-.
Noname manuscript No. (will be inserted by the editor)

Vshadow: Promoting Physical Servers into Virtualization World Song Wu · Yongchang Li · Xinhou Wang · Hai Jin · Hanhua Chen

Received: date / Accepted: date

Abstract Though virtualization technology has been widely adopted these years due to its advantages on improving server utilization, reducing management costs and energy consumption, there are still lots of legacy applications deployed in traditional physical machines. How to efficiently promote these physical servers into virtual machines (VMs) if necessary has become an interesting and challenging problem. Existing Physical-to-Virtual (P2V) conversion methods suffer from long server downtime during the converting process, which makes them impractical and inefficient in real world. In this paper, we analyze the reason why the previous approaches result in intolerable server downtime, and propose a new P2V conversion system called Vshadow. It enables a native P2V conversion which can quickly switch physical machine to a local VM, and combines implicit disk replication method and live migration process to promote physical machine into remote virtualization platform. Our experimental results show that Vshadow can reduce the server downtime by more than 90% compared with existing P2V conversion methods. Besides proving the effectiveness and efficiency of Vshadow, we also illustrate the possible ways to apply Vshadow in server consolidation of cloud data centers and to use Vshadow to build a cost-effective failover system. Keywords P2V, Synchronization, Live Migration, Consolidation, Failover

S. Wu(4) · Yongchang Li · Xinhou Wang · Hai Jin · Hanhua Chen Services Computing Technology and System Lab & Cluster and Grid Computing Lab School of Computer Science and Technology Huazhong University of Science and Technology Wuhan, 430074, China E-mail: [email protected]

2

Song Wu et al.

1 Introduction Virtualization technology has been widely adopted these years due to its advantages on improving server utilization, reducing management costs and energy consumption. With the help of virtualization and cloud computing, large numbers of new applications are running on virtual machines (VMs). However, there are still lots of legacy applications deployed in traditional physical machines. How to efficiently promote these physical servers into VMs if necessary has become an interesting and challenging problem. The conversion process from Physical-to-Virtual (P2V) machine addresses this problem by consolidating servers into virtualization platform. Nowadays, there exist different methods to achieve P2V conversion. A straightforward conversion method is to create a new VM, install operating system and copy application data from physical machine. But this method is not practical because it is application dependent and needs lots of manual operations. Another widely used method is to shutdown the physical machine and convert the physical machine’s disk to a VM disk image, which is called off-line P2V conversion [1]. Off-line conversion suffers from long service downtime because the physical machine is shutdown during the whole conversion process. As a contrast, on-line P2V conversion [2] method converts physical machine’s disk without shutting it down. Since the physical machine is running and generating new data, on-line P2V conversion needs a synchronization process to generate a consistent disk image, which also brings long service downtime. In summary, existing P2V conversion methods suffer from long server downtime during the converting process, which makes them impractical and inefficient in real world. Service downtime in on-line and off-line P2V conversion methods is mainly determined by the processes of disk replication and synchronization. Since off-line method shuts down the source physical machine and disk data keeps unchanged, there is no synchronization process in off-line method. But service downtime of off-line method exists during the whole disk replication process. For on-line method, the disk replication process starts when the source physical machine is powered on and service downtime mainly exists in the explicit disk synchronization process. During the P2V conversion, if we overlap the synchronization process with the replication process and remove the unnecessary downtime, the total service downtime can be reduced significantly. To minimize service downtime of the P2V conversion, we propose an implicit disk synchronization method which avoids service downtime from disk synchronization by overlapping disk synchronization with replication. In order to achieve implicit disk synchronization, we also design a native P2V method which converts the source physical machine to a local VM without time-consuming disk replication. In this paper, we propose a conversion system named Vshadow, which combines a native P2V method, implicit disk synchronization method and live migration to minimize the service downtime during the conversion. Vshadow converts the physical machine to a local VM that shares the original underlying hardware with the assist of a portable disk. After the synchronization process,

Vshadow: Promoting Physical Servers into Virtualization World

3

the VM can be lively migrated to virtualization platform. We also integrate Vshadow into highly available methods in virtualization environment such as Remus [3] to provide service failover system for physical machine. We highlight our contributions of this paper as follows: – We summarize and analyze the reason why current P2V conversion approaches result in intolerable server downtime, and propose a new P2V conversion system called Vshadow. – Vshadow provides a general P2V conversion method with minimal service downtime. Experiments show that Vshadow reduces service downtime by 91.2% and 97.2% compared with traditional on-line and off-line conversion methods, respectively. – We apply Vshadow into several application scenarios, such as consolidating servers in data centers and creating a cost-effective failover system for physical machines, to demonstrate that our approach can be widely applied in real world. The rest of paper is organized as follows. Section 2 presents the background of our system. We introduce the design and implementation of Vshadow in Sections 3 and 4. Section 5 describes our evaluation workloads, experimental setup and experimental results. Related work are summarized in Section 7 and we conclude our paper in Section 8. 2 Background In this section, we give a brief review of current P2V methods. These methods can be divided into two categories: on-line and off-line. The most significantly difference between them is that off-line conversion happens when the physical machine is shutdown while on-line conversion keeps the physical machine running. Start the VM Boot into PXE Converting and configure VM Restart service

Transfer disk content 0

2

T1: 34min T2: 34min

24

30 32 34

Time/min

Fig. 1 An off-line P2V case using Red Hat Physical-to-Virtual solution

Off-line P2V needs a preboot execute environment (PXE) media. The conversion performs the following steps: 1) shuts down the physical machine, 2) boots the physical machine into a PXE (restart the system and boot into P2V environment using DVD), 3) transfers physical machine disk data (transfer disk content), 4) converts physical machine disk data to a VM image disk (convert and configure VM), 5) starts a VM in virtualization platform (start

4

Song Wu et al.

the VM). A case of off-line P2V method [1], the Red Hat Physical-to-Virtual solution (VirtP2V), is shown in Fig. 1. In this case, we convert a Linux physical machine of 15GB disk size under 100Mbps network bandwidth. In the figure, T1 is the upgrading time of converting existing physical server to virtualization platform and T2 is the service downtime during the converting process. As shown in the figure, off-line P2V method suffers from service downtime as long as the whole conversion process. Install agent and configure P2V process

Configuring VM driver and start VM

Transfer and convert the disk

0 2

Synchr onizati on

64

T1: 78min

Restart service

74 7678 Time/min

T2: 14min

Fig. 2 An on-line P2V case using the VMware vCenter Converter Standalone

On-line P2V usually takes snapshots of the physical machine disk and transfers the snapshots to remote virtualization platform. After generating the initial VM disk image, on-line P2V conversion begins the explicit synchronization process between physical machine disk and VM disk image. Services must be stopped before the synchronization process to guarantee disk consistency. This method brings less service downtime compared with off-line P2V conversion but needs special supports from operating system and file system. Fig. 2 shows a case of on-line P2V method, the VMware vCenter Converter Standalone [2]. In this case, we convert a Windows physical machine of 75GB disk size under 100Mbps network bandwidth. We copy a 4.2GB file to the disk during conversion in order to test the synchronizing process. In the figure, service downtime is small compared to the total conversion process. But this method of P2V suffers from many disadvantages as follows. First, the Volume Shadow copy Service (VSS) is necessary for disk copying and converting. Second, the synchronization process can only be used in Windows operating system which supports the VSS. Third, data consistency has not been guaranteed directly. Users have to stop the services to make sure they do not suffer from data loss. Current P2V methods avoid complex manual operations but suffer from long service downtime as well. As shown in Fig. 1 and 2, service downtime of on-line conversion is less than that of off-line because downtime does not exist in the disk transfer and convert stages of on-line P2V conversion. We can find that service downtime caused by environment switching (application control is transferred from physical to virtual machine) is unavoidable in both on-line and off-line methods due to the inherent features of virtualization. However, the total service downtime can significantly be reduced if we remove the unnecessary downtime in other stages. Our approach avoids unnecessary service downtime and completes the P2V conversion with minimal service downtime.

Vshadow: Promoting Physical Servers into Virtualization World Remote Virtualization Platform

Physical Machine Original Physical Machine

CPU MEM

NIC

Portal Disk

DISK

Migration Sender

Applications

2

Applications

Shadow VM

Converted VM

Virtual Drivers

DISK

3

Replication Sender

VMM

1 CPU

MEM

NIC

6

Migration Receiver

Other Shadow VMs

Applications

Operation System

Operation System

Operation System

5

4

Replication Receiver

Virtual Disk Virtual Drivers

5

Other Shadow VMs

VMM Portal Disk

Harewares

Fig. 3 Vshadow Architecture

3 System Design This section summarizes the system design and analysis of Vshadow. We design an optimized P2V switching system which avoids the disadvantages of traditional P2V approaches. By means of converting physical machine to a local VM, implicit disk image synchronization and live migration, we reduce the service downtime of conversion process. Our approach minimizes the impact of services running during the P2V conversion.

3.1 Design Overview Our goal is to achieve P2V conversion with minimal service downtime. Fig. 3 shows the architecture of our system. We begin with a traditional physical machine with hardware components and convert the physical machine into a virtualization platform. We use a portable disk which contains virtualization environment to switch between physical and virtual machines. The portable disk also contains replication and migration components. To describe our system clearly, we give the following definitions of roles in different machines. 1) Original physical machine (OPM): Physical machine to be converted, usually with services running. 2) Converted virtual machine (CVM): Virtual machine generated by our native P2V switch method. Services running in OPM will switch to work in the CVM. CVM utilities the underlying hardwares of OPM exclusively. 3) Shadow virtual machine (SVM): Virtual machine generated by the migration process. When SVM starts in a virtualization platform, we complete the whole P2V process. We summarize our P2V system into three different stages as follows. 1: Native P2V Switch: The process of converting OPM to CVM on the same physical machine with the help of a portable disk which contains virtualization environment (see Step 1 & 2 in Fig. 3). The environment switching happens in this stage and occurs service downtime. 2: VM Image Replication: The process of transferring the disk image of CVM (Step 3), implicitly synchronizing it to a remote virtualization platform

6

Song Wu et al.

(Step 4) and generating a consistent disk image in a remote virtualization platform (Step 5). The disk image encapsulates operating system, applications and data of OPM. We start SVM based on it. This stage causes no service downtime in our approach. 3: VM Migration: The process of migration from CVM to SVM. We could simply stop CVM and start SVM by the disk image mentioned above, but this will add unnecessary service downtime. In our approach, after the disk image is synchronized, we leverage the live migration process (Step 6) to generate SVM and complete the whole P2V process.

3.2 System Components So far, we have discussed the whole design overview of our approach. In this section, we will introduce the details of our system components. 3.2.1 Native P2V Switch Traditional P2V methods convert physical machine directly to a virtual machine in remote virtualization platform. This leads to the time consuming process of states conversion and transmission before generating VM, especially for the disk state. The physical machine to be converted and the VM generated after conversion are on different hosts. The most significant feature of our native P2V switch method is that we achieve P2V conversion in the same physical machine without time consuming disk transferring. Our native P2V switch utilizes the underlying physical hardware of OPM. We notice that disk storage is persistent while the other states are volatile such as memory, network, CPU. Once getting the entire disk states of one physical machine, we can start the computer system with the help of compatible hardwares and device drivers. As shown in Fig. 3, we begin our native P2V switch by providing a portable disk, which contains a virtualization hypervisor as a bootable device. This device contains the necessary drivers to support original physical machine’s underlying hardware, together with the replication and migration engine. With the help of original physical machine’s hardware (except the disk) and the portable disk, we reboot the OPM into a virtualization environment. This environment contains all the virtual hardware drivers for guest VMs and provides disk and network access for other guest VMs. We start our CVM based on disk of OPM and virtual device drivers provided by the virtualization environment. Since the CVM contains the operating system and all applications that are the same as original physical machine, services start to work in the CVM as well. Now the native P2V switch process completes with no disk conversion and transmission. There are many advantages in our native P2V switch. First, our method is general to support different types of operating systems. Second, service downtime only exists in the environment switching process compared to tradition

Vshadow: Promoting Physical Servers into Virtualization World

7

methods. Third, there is no resource competition between different guest VMs. Because the virtualization manager consumes little resources of the OPM and all the other resources can be assigned to CVM.

3.2.2 VM Image Replication After native P2V switch process, services are running in the CVM. Compared with traditional P2V methods, we get a local VM without shared storage. Our next step is to generate an encapsulated image file in a remote virtualization platform. Compared with using the whole disk as a VM image, encapsulated image file is easy to transfer and can be applied in different virtualization environment such Xen [4], KVM [5] and VMware [6]. If the remote virtualization platform is based on shared storage, the SVM can be migrated between different hosts, which can improve the whole system utilization. During the VM image replication stage, we generate an empty encapsulated image file at remote virtualization platform (e.g., Xen). Then we replicate and synchronize disk image from CVM to the encapsulated image file. During the disk replication and synchronizing process, the write operations of CVM’s disk will be sent and synchronized to remote encapsulated image file. Whole disk Replication Loop device

Disk partitions

Loop device

Encapsulated

file

Fig. 4 Three types of virtual machine image, replicating and synchronizing by using loop devices

It is worth noting that the image of CVM is based on physical disk, which can not be directly transfered to an encapsulated image file. Hence, we create loop devices [7] in Vshadow to achieve replication between physical disk and encapsulated file image. As shown in Fig. 4, loop devices manage three types of underlying storage including whole disk, disk partitions and encapsulated image file. Snapshots based synchronization methods [2] guarantee disk consistency at special points and services must be stopped to synchronize disk contents changed since the nearest snapshot checkpoint. In our approach, the encapsulated disk image keeps synchronization with CVM’s image and there is no service downtime in the VM image replication stage. During the disk replication and synchronization stage, services need not be stopped while disk images are consistent. Although the process of disk replication and synchronization is time consuming, which depends on image size and network bandwidth, but no service downtime exists. After the native P2V switch and the disk replication stages, CVM can be migrated to the remote virtualization platform.

8

Song Wu et al.

3.2.3 VM Migration There are many different VM migration methods (e.g., cold migration [8], live migration [9]) in the literature of virtualization. The most direct way is a cold method to stop CVM and then start SVM in the remote virtualization platform, which introduces unnecessary service downtime. Another way is to suspend CVM and save the memory state to a checkpoint file [10]. This file will be transfered to remote and used to restore SVM. This method also causes service downtime when transferring memory checkpoint and restoring VM. Besides of these cold migration methods, we use live migration to migrate from CVM to SVM with negligible service downtime (as low as 60 ms described in [9]). The time of migration depends on the total memory size, memory dirty rate and network bandwidth available. But the service downtime is negligible compared with downtime caused by environment switching in native P2V process. In summary, we have designed a P2V conversion method in three stages: a native P2V switch, VM image replication and VM migration. Our method results in minimal service downtime and guarantees disk consistency at the same time. 3.3 System Analysis We quantify the service downtime of different conversion methods in order to explain why our system outperforms traditional conversion methods clearly. We focus on the time period from the point of preparing conversion to the point when conversion is completed, denoted as TU pgrade . We denote the service downtime as TDown . For the convenience of our analysis, we first define some variables as follows, which are illustrated in Fig. 5 that shows the time sequences of different conversion methods. – TP : preparing time, denoting the time period before starting the conversion process. This stage contains operations like software installation, workloads analysis and portable disk preparation. – TCS : context switching time, denoting the period switching from physical machine to conversion environment. – TI : initializing time, denoting the time period to generate VM disk image. This is the most time consuming part of the conversion. – TS : synchronizing time, denoting the period to generate a consistent encapsulated disk image in on-line P2V conversion. – TC : configuration time, the period which prepares VM configuration file and virtual drivers to start the VM. This process happens in the late stage during on-line and off-line conversion. But for Vshadow, it happens in the environment switch stage to start CVM. – TV S : VM and service start time, denoting the period of VM booting and service restarting.

Vshadow: Promoting Physical Servers into Virtualization World

9

– ρ: the ratio of TDown to TU pgrade . Although TDown is the most important indicator for majority of services, ρ shows the percentage of downtime during the whole conversion. The preparing time, TP , differs in different conversion methods and usually brings no service downtime. TV S is necessary to start the VM and restart service. The configuration time, TC , can be regarded as a parallel operation during TI .

TP

TI

TDown

TS TUpgrade

TC

TVS

TP

(a) On-line conversion time sequence TP

TCS

TCS

TI TUpgrade

TDown

TC

TVS

(b) Off-line conversion time sequence TDown

TC

TVS

TI

(c) Vshadow conversion time sequence Fig. 5 Time sequences of three different conversion process, no to scale

Fig. 5(a) shows the time sequence of on-line P2V conversion, it is clear that:   TU pgrade = TP + TI + TS + TC + TV S TDown = TS + TC + TV S (1)  ρ = TDown = TS +TC +TV S TU pgrade TP +TI +TS +TC +TV S From Equation 1, we can see that in the on-line P2V conversion, service downtime mainly depends on TS , for the reason that TC and TV S can be regarded as constants. TI and TS are the main parts of whole upgrading. Fig. 5(b) shows the time sequence of off-line P2V conversion, and we have:   TU pgrade = TP + TCS + TI + TC + TV S TDown = TCS + TI + TC + TV S (2)  ρ = TDown = TCS +TI +TC +TV S ≈ 1 TU pgrade TP +TCS +TI +TC +TV S As shown in Equation 2, service downtime in off-line P2V conversion is nearly the same as total upgrading time when ignoring the preparation stage. There is no need to synchronize converted VM disk image because physical machine has been shut down and thus there is no TS period. Time sequence of Vshadow is shown in Fig. 5(c), we get the following equations:   TU pgrade = TP + TCS + TC + TV S + TI TDown = TCS + TC + TV S (3)  ρ = TDown = TCS +TC +TV S TU pgrade TP +TCS +TC +TV S +TI For Vshadow in Equation 3, service downtime depends on content switching period, CVM configuration and service restarting operations. All these three

10

Song Wu et al.

steps can be completed in a relatively fixed time, thus TDown can be regarded as a constant. Total upgrading time mainly depends on the disk replication process which introduces no service downtime. In summary, Vshadow avoids unnecessary service downtime in disk replication and synchronization processes and completes P2V conversion with minimal service downtime.

4 Implementation In this section, We discuss the implementation issues of our system components including native P2V switch, VM image replication and VM migration.

4.1 Native P2V Switch Our implementation is based on Xen virtualization hypervisor. The portable disk in native P2V switch contains Xen hypervisor and replication engine. We take advantage of Xen’s abstraction of virtual block devices to switch the OPM to a CVM which shares the underlying hardware resources exclusively. We reboot the physical machine into Xen virtualization environment with the help of the portable disk. A VM (Domain 0) starts and contains necessary drivers to convert other VMs (guest VMs). Then we starts the CVM with OPM’s physical disk and virtual drivers. To reduce environment switching time, we also create the CVM automatically after Xen hypervisor starts. CVM can be built on both the whole physical disk and disk partitions which contains root file system. In the former way, the CVM can support both Linux and Windows systems in full virtualization mode. While in the latter one, the CVM can only support Linux system in para-virtualization mode.

4.2 VM Image Replication Vshadow utilizes the DRBD [11] disk replication system to achieve synchronization between block devices. We make some modifications and configurations of DRBD in Vshadow. First, in order to start VM based on the replicated image, Vshadow uses DRBD’s dual primary modes during this stage. Second, Vshadow utilizes both the asynchronous and synchronous replication modes to reduce the impact on performance. Third, we use external storage to save DRBD meta-data so that no modifications of original physical machine disk are needed. When the remote image is synchronized, any disk update of local image is propagated to the remote. We create the loop devices and combine with DRBD to provide replication and synchronization between disk images of CVM and SVM. The synchronizing flow can be conducted to a secondary network which will reduce the impact of service running in CVM.

Vshadow: Promoting Physical Servers into Virtualization World

11

4.3 VM Migration We configure Xen hypervisor to enable live migration between different hosts. We choose xm as Xen tools to perform the live migration process. During the migration, encapsulated disk image at remote virtualization platform will change to primary role in DRBD. After several iteratively pre-copy stages of memory, CVM is suspended and live migration process synchronizes remaining VM states to SVM. After the live migration, SVM starts in the virtualization platform and services switch to run in SVM.

5 Evaluation In this section we present an evaluation of our implementation with a number of workloads. Our results demonstrate that Vshadow does not impact services significantly when switching physical machine to virtualization environment. Evaluations show that our approach reduces service downtime by 91.2% and 97.2% compared with traditional on-line and off-line conversion methods, respectively. Furthermore, the file system is guaranteed consistency during and after the conversion. Among the production conversion methods, we choose VMware vCenter Converter Standalone in Windows system as a representation for on-line conversion, and VirtP2V in Linux system for off-line conversion. We use metrics TDown (service downtime) and TU pgrade (total conversion time) to show the impact of services. As discussed in Section 4, TDown is the most important metric, because it represents the time when services are unavailable. We begin by describing our experimental setup, and then explore system conversions with different workloads in detail. Finally, we test the overhead of our approach on I/O performance. To accurately determine the impact on I/O performance, we need long-running services. We use static and dynamic web applications as our benchmarks to represent long-running services. In the following experiments, we first introduce the bare system conversion without services to show the baseline of Vshadow and comparison systems. Then we conduct experiments on realistic workloads on both static and dynamic web servers.

5.1 Experimental Setup Unless otherwise stated, our evaluations are conducted on three hosts connected by 100Mbps switched Ethernet network. The first host (called Host A) is the OPM which supports service running. The second host (called Host B) is the virtualization platform to support the SVM. The third host (called Host C) acts as workload generator of services. Both Host A and B are comprised of a dual-core 2.1GHz Intel CPU and 4GB memory. Service downtime is the most important indicator for majority of services and TDown in Vshadow is

12

Song Wu et al.

not affected by the time of VM image replication. In our experiments, we use a 40GB SATA disk with Windows 7 and a 40GB portal disk with Linux on Host A to support services1 . A 16GB USB disk with Xen virtualization and Linux system installed is used in the native P2V conversion. Xen 4.1.2 release and Fedora 18 with kernel version 3.4.94 are installed on the USB disk and Host B. Apache and MySQL are installed on Host A to support static and dynamic web services. Table 1 I/O performance of different disks, tested by Iometer with 16KB transfer request size Different disk types

Read Throughput (MBps)

Write Throughput (MBps)

40GB Windows SATA

46.66

46.54

40GB Linux portal disk

37.02

36.12

16GB USB disk

36.52

7.10

External disk

27.69

25.33

Since we use different types of disk components, we test the disk I/O performance by Iometer before we begin all the experiments. The different types of disks include SATA disk with Windows installed on Host A to test Windows system conversion, portable disk with Linux installed on Host A to test Linux system conversion, external disk to store the generated shadow VM image and portable USB disk for native P2V conversion. As shown in Table 1, most of the throughputs are larger than the maximal network bandwidth of 100Mbps (12.5 MBps). Although the write performance of the 16GB USB is slower than the network, it has little or no impact on performance, because most of the writes occur on the external disk on Host B. Therefore, we conclude that disk performance is not the bottleneck in the following experiments. 5.2 Bare System Conversion We begin our evaluation by testing different conversion methods on bare physical machine without a service running. Table 2 shows the summary of results for different conversion methods without network rate limit, which means that the conversion process can use all the available network bandwidth. The preparation stage, TP , is eliminated in the table because TP has no impact on service downtime, so there is no need to measure it. TCS , TC and TV S are measured several times and we choose the average value as the typical value. Other values in the table are measured in different conversion methods with the same experimental setup. Since no services are running in this situation, we do not calculate the live migration process in the table. As discussed in Section 3, TCS for VMware on-line method and TS for off-line and Vshadow do not exist, so these values are not shown in Table 2. 1

Larger disk size only increases total upgrading time.

Vshadow: Promoting Physical Servers into Virtualization World

13

Table 2 Summary of results for different conversion methods on Windows and Linux systems

Windows

Linux

VMware service downtime VMware upgrading time Vshadow service downtime Vshadow upgrading time

TI (s)

TS (s)

TC (s)

TV S (s)

TDown (s)

TU pgrade (s)

ρ (%)

-

4579

802

111

27

940

5519

17.0

41

3852

-

0

42

83

3935

2.1

179

4723

-

110

31

5043

5043

100

113

3618

-

0

30

143

3761

3.8

VMware service downtime VMware upgrading time Vshadow service downtime Vshadow upgrading time

65536

Time (s)

Time (s)

65536

TCS (s)

4096

4096

256

256 65

70

75

80

85

90

95

Disk space utilization (%)

Fig. 6 VMware Windows conversion of different disk space usage without network rate limit

VMware service downtime VMware upgrading time Vshadow service downtime Vshadow upgrading time

65536

Time (s)

Conversion methods VMware On-line Vshadow VirtP2V Off-line Vshadow

OS

4096

256 50

60

70

80

90

100

Network bandwidth limit (Mbps)

Fig. 7 Conversions of different network rate limit, Windows system with 85% disk space usage

0

1

2

3

4

Disk dirty size (GB)

Fig. 8 Conversions of different disk changing size, Windows system with no network rate limit

Also, the VM configuration process can be proceeded in parallel with the disk initialization process, so the value of TC in Vshadow is regarded as 0. There are some interesting findings in the table. First, TCS lasts about 3 minutes in off-line method (Linux VirtP2V) due to manual configuration of network and virtual machine parameters. Second, TCS of the Vshadow method in Linux is 1 minute larger than that of Windows. The reason is that we use portable disk for Linux while SATA for Windows, and the difference of disk performance causes this problem. At last, we are surprised that VMware takes more than 800 seconds to synchronize the disk image, even when Host A does not provide any service to clients. We infer from VMware log files that temp files generated by VMware and VSS (Volume Shadow copy Service) service of Windows may be the reasons. Off-line P2V conversion shutdowns the physical machine during the whole conversion process and service downtime of off-line P2V conversion is the same with total upgrading time. As a result, service downtime of off-line P2V conversion mainly depends on the network and disk bandwidths. Values in Table 2 are chosen as the typical values for off-line P2V conversion and we mainly focus on on-line P2V conversion and Vshadow in the following experiments. During the P2V conversion, different disk space utilization, network bandwidth and disk dirty size will lead to different results. So we conduct three groups of experiments and plot the results in Fig. 6 to 8. As shown in Fig. 6, since Vshadow replicates and synchronizes the whole disk of physical machine, TDown and TU pgrade in Vshadow are not affected by disk space utilization. TDown and TU pgrade in VMware conversion trend to increase with the increment of disk space usage. It is worth noting that the disk space usage will not influence TDown and TU pgrade in Vshadow because we transfer the whole disk

14

Song Wu et al.

T h ro u g h tp u t (M b p s )

HTTPpeaks

H DT iT sP kR Te rq au ne ss ft er

1 0 0 8 0 6 0

TDown Servicerestars CreatesnSaypnschortosnizatiSoynncbhergoinniszationendTsS

4 0 2 0 0 0

5 0

: 3 3 .2 m in u t e s

1 0 0

1 5 0

2 0 0

T im e (m in u te s )

Fig. 9 Throughputs of VMware Windows system conversation on-line P2V under static web workload T h ro u g h tp u t (M b p s )

1 0 0 8 0

HDTiTsPkRTerqaunessfter TDowXn:enMigration HTSTyPncfharlolnsizationbegins Servicerestarts 4 .1 m in u t e s

6 0 4 0 2 0 0 0

2 0

4 0

6 0

8 0

Migrationends Synchronizationends Migrationbegins

1 0 0

1 2 0

1 4 0

T im e (m in u te s )

Fig. 10 Throughputs of Vshadow Windows system conversation under static web workload, including VM live migration

size during the conversion. In Fig. 7, low network bandwidth leads to longer upgrading time in VMware, as we expected. But in Vshadow, the upgrading time presents the similar trend with the results in VMware, while the service downtime behaves as a relatively fixed value under different network bandwidths, which also verifies the correctness of our analysis in Section 3. During the conversion of bare system, we dirty the original disk content in different sizes and plot the results in Fig. 8. Total upgrading time increases with the increment of disk dirty size. An anomaly result happens in Fig. 8 that the service downtime increases sharply when disk dirty size increases from 3GB to 4GB. The reason is that VMware creates volume snapshots at latter disk initialization stage and the 4GB disk content dirty happens after the points of creating snapshots while the other disk content changes happen before the point. So far, we have compared the performance of our proposed approach, Vshadow, on a bare system with the performance of traditional on-line and off-line methods. Specially, the results presented in Table 2 show that our approach reduces service downtime by 91.2% and 97.2% compared with traditional on-line and off-line conversion methods, respectively. In the following sections, we evaluate the impacts on the services running in the physical machine during the P2V conversion. As discussed above, subsequent analysis only focuses on Vshadow and on-line methods.

5.3 Static Workload Results In this section, we test the P2V conversion by running a static web server on the physical machine for both traditional on-line P2V and Vshadow methods for Windows system. We record the service downtime and HTTP throughput

Vshadow: Promoting Physical Servers into Virtualization World

15

to see the impacts on service during the conversion. The services running in Host A is migrated to Host B after the conversion. Client requests are generated on Host C using SIEGE 2 , and 25 clients are simulated to retrieve a 512KB file from the web server repetitively, which consumes all the 100Mbps network bandwidth. VMs generated by VMware and Vshadow have a 2GB memory. Since HTTP requests and disk transfer share the 100Mbps network, bandwidth used by the disk transfer process is limited during the two experiments. We record the throughputs of HTTP requests and disk transfer for the two methods, shown in Fig. 9 and 10. We observe from the figures that service downtime (TDown ) of VMware and Vshadow conversions are 33.2 and 4.1 minutes respectively. For Vshadow, downtime is a little larger than the value in Table 2. Because environment switching stage takes longer to complete when the server is running web service. Furthermore, services downtime caused by live migration is nearly negligible compared with the service downtime [9]. During the VMware conversion, VMware configures the disk image and installs necessary components after the synchronization stage, which makes service downtime longer. Some interesting phenomena happen in the figures. As shown in Fig. 9, VMware takes snapshots at latter disk transferring stage and disk transfer throughput drops sharply, which causes a HTTP throughput peak. Live migration process in Fig. 10 lasts about 4 minutes and the web service recovers to best performance after that. The bandwidth limitation mechanisms are different in the two conversion methods, which leads to higher transferring speed and less TU pgrade . This experiment demonstrates that Vshadow is able to convert a common running server to virtualization environment. As the analyses in Section 3 show, service downtime of Vshadow mainly depends on the environment switching stage and can be regarded as a constant. Vshadow removes the unnecessary downtime and outperforms VMware nearly one order of magnitude on service downtime. 5.4 Dynamic Workload Results In this test, Host A is installed with phpBB 3 (a free and famous flat-forum bulletin board software solution) and acts as a dynamic web server. We create a script on Host C to simulate 100 users with read and post operations. Each user reads all the topics of the bulletin board and has a probability of 1% to reply to every topic. The reply makes a post which will be inserted into MySQL database on Host A. VMs generated by VMware and Vshadow have 2GB memory. We record the throughputs of HTTP requests and disk transfer for each method and plot the results in Fig. 11 and 12. As shown in both figures, the HTTP request throughput is not significantly affected during the conversion because the dynamic web service workload is not 2 3

http://www.joedog.org/siege-home/ https://www.phpbb.com/

16

Song Wu et al.

HDTiTsPkRTerqaunessfter

T h ro u g h tp u t (M b p s )

1 0 0 8 0 6 0 4 0 2 0 0 0

HTTP 2 0

Createsnapshots

4 0

SyncShyrnocnhirzoantiizoantiboengiennsds TS TDown: Servicerestarts 3 1 .1 3 m in u e s

6 0

8 0

1 0 0

1 2 0

1 4 0

T im e (m in u te s )

Fig. 11 Throughputs of VMware Windows system conversation on-line P2V under dynamic web workload

T h ro u g h tp u t (M b p s )

1 0 0 8 0 6 0 4 0

H TT PRe qu es tr D Xi es nkMT ir ga rn as tf ie on SynchronizatMiigornateinodnsbegins TDoSwny:nchronizationbegins Migrationends Servicerestarts HTTP 2 .0 3 m in u t e s

2 0 0 0

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0

T im e (m in u te s )

Fig. 12 Throughputs of Vshadow Window system conversation under dynamic web workload, including VM live migration

bandwidth limited. For Vshadow conversion in Fig. 12, disk synchronization throughput varies due to the dynamic workload. Service downtime (TDown ) of VMware and Vshadow conversions are 31.13 and 2.03 minutes respectively. We observe forced chkdsk process when starting virtual machine during VMware conversion, which makes the service downtime longer in Fig. 11. VMware does not guarantee the disk consistency directly and let users choose which service to stop themselves. During the VMware conversion, we choose to automatically stop the httpd and mysqld service before the disk synchronization process. This may be the reason of the inconsistency during VMware conversion. As a contrast, our system keeps the consistency of the disk with minimal service downtime.

5.5 I/O Performance Overhead Results In this section, we test the overhead of Vshadow during the conversion. We first conduct an experiment to see the disk I/O overhead caused by virtualization environment and then measure the required bandwidth to synchronize disk image between CVM and SVM. We compare the disk I/O performance of OPM and CVM by Iometer to see the disk performance overhead caused by virtualization environment. The I/O request size is 4KB and the experiment runs under different random access proportions. Disk I/O throughput of CVM are normalized to throughput of OPM, which is shown in Fig. 13. The I/O performance overhead brought by Xen virtualization is less than 3%, which is negligible to the whole system. Since all the experiments above are under one switched network, we can not directly get the network bandwidth overhead caused by the synchronization

110 Read Write

105 100 95 90 85 0

10

20

30

40

50

60

70

80

90

Random Access Ratio (%)

Fig. 13 Vshadow VM disk I/O performance compared with the original physical machine

20

Throughtput (Kbps)

Required bandwidth (Mbps)

Normalized Throughput(%)

Vshadow: Promoting Physical Servers into Virtualization World

Throughput 15 10 5 0 16

32

64

128

256

512

1024

2048

Disk dirty rate (KB/s)

Fig. 14 Network bandwidth used for disk synchronization with different disk dirty rates

17 900 800 700 600 500 400 300 200 100 0

25ms remus HTTP 100ms remus HTTP Remus starts Physical machine crashes

0

50

100

150

200

250

300

350

Time (seconds)

Fig. 15 Integrated with Remus for service failover under different checkpoint intervals

process. To measure this, we test the required synchronization bandwidth under different disk dirty rates. Fig. 14 shows that network bandwidth needed to synchronize at 2MBps disk dirty rate is about 2.03MBps. The bandwidth will not be a problem to the running service if we choose to synchronize the disk with a secondary network. Otherwise we can enhance the network bandwidth limitation of disk synchronization to reduce the impacts on running services. This method will make the total upgrading time longer but service downtime stays the same.

6 Application Scenarios In this section, we give two possible scenarios of Vshadow in real world. We describe the background, implementation and experiment of each application scenario in detail.

6.1 Server Consolidation in Datacenter With the help of virtualization technology, servers running in physical machines can be consolidated into virtualization platform. Server consolidation reduces the maintenance cost of physical machine, improves resource utilization and reduces energy consumption, especially in data centers. However, due to the fact that there are still many servers running in physical machines, transferring these servers to virtualization platform probably brings intolerable service downtime [1]. Vshadow enables server consolidation with minimal service downtime in data centers. As shown in Fig. 16, instead of adding a portable disk to physical machine in Vshadow, we can also prepare a conversion server which contains the same components as the portable disk, and take advantage of network boot protocol to boot physical machine into a virtualization platform. After the environment switching process, we continue the VM image replication and live migration stages in Vshadow. Large number of physical machines can be converted to virtualized platform concurrently in this way, which will reduce the total upgrading time to convert the whole datacenter.

18

Song Wu et al. Consolidation Server Management PXE boot Vshadow Conversion

Physical Server

Virtualization Platform

Physical Server

Fig. 16 Consolidating physical servers in data center using Vshadow Physical Machine Original Physical Machine

Converted VM

Virtualization Platform Replication

Replication

Remus

Remus

Shadow VM

Other Shadow VM

Service Failover

Fig. 17 Integrating Vshadow with Remus to build a cost-effective failover system

Since TDown can be regarded as a constant, TU pgrade mainly depends on the time period of disk replication. To evaluate the upgrading time from physical machine to virtualization environment in datacenter with Vshadow, we conduct an experiment with a 400GB disk and 1Gbps network. Network bandwidth measured by iperf is 949 Mbps. Reading and writing throughput are about 126.03MBps and 118MBps. We run the disk replication process several times and the average result is 113 minutes, the average synchronization rate is 60.1 MBps. The result means that we can convert a physical machine together with 400GB disk within 2 hours, which brings less that 2 minutes service downtime. Vshadow can virtualize the whole datacenter flexibly with rational arrangement and minimize impacts on service users.

6.2 Creating a Cost-effective Failover System Failover systems enable services to switch to a backup computer system upon the failure of the previously active one, which is important to support uninterrupted IT services. However, current commercial failover systems (e.g [12]) suffer from the following disadvantages. First, besides of the customized software, duplication for every hardware is needed, which brings additional costs. Second, these failover systems require compatible or even the same hardwares and architectures, which brings inflexibility to build the failover system. Moreover, the backup host is specialized to current host and can not provide failover service to other hosts. To solve these problems, we create a cost-effective failover system for physical machines by integrating with Remus which is a popular failover software designed for virtualized environment [3]. As shown in Fig. 17, instead of lively migrating CVM to SVM, we start Remus program after the VM image replication stage. Then, system states are copied and transfered to SVM periodically and external states such as

Vshadow: Promoting Physical Servers into Virtualization World

19

network are hold until acknowledgment of SVM. SVM acts as a backup VM and is ready to take over CVM. We experiment Vshadow together with Remus by the experimental setup in Section 5. A static web service is installed and SIEGE is used to generate workloads of 600 Kbps. We test Remus at 25ms and 100ms checkpoint intervals (i.e. 40 and 10 checkpoints in Remus) and the network throughputs are shown in Fig. 15. When Remus is running, we give a forced shutdown to the physical machine and the service continues to run in the backup virtual machine (SVM). The experiment demonstrates that we can combine Vshadow and Remus to build a cost-effective failover system for physical machine.

7 Related Work P2V conversion process deals with problems like system replication, data protection and virtualization. In this section, we introduce several related systems and research, which are classified into five categories as follows: freeze and copy, operating system replication, continuous data protection, virtual machine migration and container based virtualization. Freeze and copy Freezing the current running states and resuming at destination is the simplest way to transfer changing systems, which keeps the consistent states as well. Cold migration of Xen uses this method with the following steps: suspending the virtual machine, creating memory snapshots, copying memory snapshots to destination, and resuming virtual machine states. The Internet Suspend/Resume system [13] combines VM and distributed file systems to suspend local computer, transfer persistent states and resume to another computer. Distributed file systems is needed in the approach, which is different from our method. Operating system replication System replication at process level is a hot topic in 1980s such as [14] and [15]. As stated in the survey in [16], this approach is impractical for real-world applications because it suffers from residual dependencies from different levels, such as file descriptors and shared memory. Furthermore, system replication at process level brings significant impacts on the replicated processes. Zap [17] is an example of system replication at operating system level. It attempts to add a virtualization layer together with Linux kernel. But this approach can not be applied to other systems because it has to rebuild the kernel for every operating system. Continuous data protection Continuous data protection (CDP) methods backup storage data periodically and recover data when crashing. Work in [18] explores a new cloud platform architecture called Data Protection as a Service, which dramatically reduces the per-application development effort required to offer data protection. Sweeper [19] provides an efficient mechanism to identify an application consistent recovery point. The work in [20] introduces a block-level CDP architecture to satisfy low bandwidth and high latency in the cloud environment. These CDP methods are application depen-

20

Song Wu et al.

dent that mainly aim to recover storage data while our system is application independent. Virtual machine migration The excellent work of [9] completes live migration of virtual machine with negligible service downtime. MECOM [21] joints memory compression with live migration to provide fast and stable VM migration. Zephyr [22] proposes a technique to efficiently migrate a live database in a shared nothing databases for elastic cloud platforms. SecondCite [23] expands Remus to wide area network (WAN) and solves problems of low bandwidth, failure detection and network configuration in WAN environment. These outstanding works deal with VMs in the virtualization environment but they do not consider the process of converting physical machines. Container based virtualization The work in [24] presents a system-level virtualization method that provides both isolation and efficiency, but differs from hypervisor-based virtualization methods (e.g. Xen). In high performance computing environment, the performance evaluation in [25] determines that system level virtualization introduces less performance overhead than hypervisor based virtualization. However, all the VMs share the same kernel with virtualization host, which is the main restriction for this type of virtualization. BIRDS [26] completes bare system recovery by means of a container based virtualization P2V method and parallel recovery mechanism. Our conversion system is different from BIRD on not only the virtualization method but also design and implementation.

8 Conclusion We present the design, implementation and evaluation of a system named Vshadow, which converts a running service from a physical to a virtual machine. Vshadow can reduce service downtime during the conversion process by 91.2% and 97.2% compared with traditional on-line and off-line conversion methods, respectively. We achieve the reduced down time with a unique combination of a quick native P2V method, disk replication process and live VM migration. What’s more, Vshadow guarantees consistency of original physical machine’s persistent disk states while introducing negligible performance overhead of the I/O system. We also give two application scenarios and evaluations demonstrate that Vshadow is able to be applied in real world.

Acknowledgments The research is supported by National Science Foundation of China under grant No.61232008 and No.61472151, National 863 Hi-Tech Research and Development Program under grant No.2015AA011402 and No.2014AA01A302.

Vshadow: Promoting Physical Servers into Virtualization World

21

References 1. https://access.redhat.com/site/documentation/en-US/Red Hat Enterprise Linux/ 6/html/V2V Guide/P2V Migration Moving workloads from Physical to Virtual Machines-Converting Physical Machines to Virtual Machines.html. 2. VMware vCenter Converter Standalone User’s Guide, vCenter Converter Standalone 5.1. http://www.vmware.com/pdf/convsa 51 guide.pdf. 3. B. Cully, G. Lefebvre, D. Meyer, M. Feeley, N. Hutchinson, and A. Warfield, “Remus: High availability via asynchronous virtual machine replication,” in Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation (NSDI’08), 2008, pp. 161–174. 4. P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, “Xen and the art of virtualization,” ACM SIGOPS Operating Systems Review, vol. 37, no. 5, pp. 164–177, 2003. 5. A. Kivity, Y. Kamay, D. Laor, U. Lublin, and A. Liguori, “kvm: the linux virtual machine monitor,” in Proceedings of the Linux Symposium, vol. 1, 2007, pp. 225–230. 6. J. Sugerman, G. Venkitachalam, and B.-H. Lim, “Virtualizing i/o devices on vmware workstation’s hosted virtual machine monitor.” in Proceedings of USENIX Annual Technical Conference, 2001, pp. 1–14. 7. http://www.cl.cam.ac.uk/cgi-bin/manpage?8+losetup. 8. https://pubs.vmware.com/vsphere-50/index.jsp?topic=%2Fcom.vmware.vsphere. vcenterhost.doc 50%2FGUID-326DEC3C-3EFC-4DA0-B1E9-0B2D4698CBCC.html. 9. C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield, “Live migration of virtual machines,” in Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation (NSDI’05), 2005, pp. 273–286. 10. H. Liu, H. Jin, X. Liao, B. Ma, and C. Xu, “Vmckpt: lightweight and live virtual machine checkpointing,” Science China Information Sciences, vol. 55, no. 12, pp. 2865–2880, 2012. 11. P. Reisner and L. Ellenberg, “Replicated storage with shared disk semantics,” in Proceedings of the 12th International Linux System Technology Conference (LinuxKongress’08), 2005, pp. 111–119. 12. HP. NonStop Computing . http://h17007.www1.hp.com/us/en/enterprise/servers/ integrity/nonstop.aspx?. 13. M. Kozuch and M. Satyanarayanan, “Internet suspend/resume,” in Proceedings of the 4th IEEE Workshop on Mobile Computing Systems and Applications (WMCSA’02), 2002, pp. 40–46. 14. A. Barak, S. Guday, and R. G. Wheeler, The MOSIX distributed operating system: load balancing for UNIX, 1993, vol. 13. 15. F. Douglis and J. Ousterhout, “Transparent process migration: Design alternatives and the sprite implementation,” Software: Practice and Experience, vol. 21, no. 8, pp. 757– 785, 1991. 16. D. S. Milojiˇ ci´ c, F. Douglis, Y. Paindaveine, R. Wheeler, and S. Zhou, “Process migration,” ACM Computing Surveys (CSUR), vol. 32, no. 3, pp. 241–299, 2000. 17. S. Osman, D. Subhraveti, G. Su, and J. Nieh, “The design and implementation of zap: A system for migrating computing environments,” ACM SIGOPS Operating Systems Review, vol. 36, no. SI, pp. 361–376, 2002. 18. D. Song, E. Shi, I. Fischer, and U. Shankar, “Cloud data protection for the masses,” Computer, vol. 45, no. 1, pp. 39–45, 2012. 19. A. Verma, K. Voruganti, R. Routray, and R. Jain, “Sweeper: an efficient disaster recovery point identification mechanism,” in Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08), 2008, p. 20. 20. G. Yu, L. Chuanyi, and W. Dongsheng, “Fast recovery and low cost coexist: When continuous data protection meets the cloud,” IEICE TRANSACTIONS on Information and Systems, vol. 97, no. 7, pp. 1700–1708, 2014. 21. H. Jin, L. Deng, S. Wu, X. Shi, and X. Pan, “Live virtual machine migration with adaptive, memory compression,” in Proceedings of IEEE International Conference on Cluster Computing and Workshops (CLUSTER’09)., 2009, pp. 1–10.

22

Song Wu et al.

22. A. J. Elmore, S. Das, D. Agrawal, and A. El Abbadi, “Zephyr: live migration in shared nothing databases for elastic cloud platforms,” in Proceedings of the 2011 ACM SIGMOD International Conference on Management of data (SIGMOD ’11), 2011, pp. 301–312. 23. S. Rajagopalan, B. Cully, R. O’Connor, and A. Warfield, “Secondsite: disaster tolerance as a service,” in ACM SIGPLAN Notices, vol. 47, no. 7, 2012, pp. 97–108. 24. S. Soltesz, H. P¨ otzl, M. E. Fiuczynski, A. Bavier, and L. Peterson, “Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors,” in ACM SIGOPS Operating Systems Review, vol. 41, no. 3, 2007, pp. 275–287. 25. M. G. Xavier, M. V. Neves, F. D. Rossi, T. C. Ferreto, T. Lange, and C. A. De Rose, “Performance evaluation of container-based virtualization for high performance computing environments,” in Proceedings of the 21st Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP’13), 2013, pp. 233–240. 26. H. Yu, X. Xiang, Y. Zhao, and W. Zheng, “Birds: A bare-metal recovery systemfor instant restoration of data services,” IEEE Transactions on Computers, vol. 63, no. 6, pp. 1392–1407, 2014.