Dynamic Energy Allocation for Coordinated Optimization in Enterprise ...

3 downloads 1262 Views 293KB Size Report
in an enterprise server, one needs to observe temporal behavior of workloads ... egy can thus be formulated to understand, monitor, and drive the expenditure of ...
Dynamic Energy Allocation for Coordinated Optimization in Enterprise Workloads Rahul Khanna∗ , Kshitij A. Doshi† , Christian Le‡ , John Ping§ , Martin P Dimitrov¶ Intel Corporation, USA

{ ∗ rahul.khanna, † kshitij.a.doshi, ‡ christian.le, § john.ping, ¶ martin.p.dimitrov }@intel.com

Mariette Awad , Melissa Stockman∗∗

American University of Beirut (AUB), Lebanon {  ma162, ∗∗ mas113 }@aub.edu.lb Abstract—Power optimization and power control are challenging issues for server computer systems. To obtain power optimization in an enterprise server, one needs to observe temporal behavior of workloads, and how they contribute to relative variations in power drawn by different server components. This depth of analysis helps to validate and quantify various energy/performance trends important for power modeling. In this paper we discuss an adaptive infrastructure to synthesize models that dynamically estimate the throughput and latency characteristics based on component level power distribution in a server. In this infrastructure, we capture telemetry data from a distributed set of physical and logical sensors in the system and use it to train models for various phases of the workload. Once trained, system power, throughput and latency models participate in an optimization heuristics that re-distribute the power to maximize the overall performance/watt of an enterprise server. We demonstrate modeling accuracy and improvement in energy efficiency due to coordinated power allocation among server components.

I. I NTRODUCTION Growing volumes of information that must be processed and served with high speed ever increasing densities have created an upsurge of power consumption in data-centers [1], [2]. This growth has put considerable pressure on cooling and power delivery capacities and thereby driven up fixed infrastructure costs and operational expenses. As a result of these trends, server and data-center energy efficiency has been pushed to the forefront of systems research and product design and architecture [3]–[5]. Typically, a system can be represented as a set of components whose cooperative interaction produces useful work. These components may be heterogeneous in nature and may vary in their power consumption and power control mechanisms. Conventional workload scheduling and dynamic management techniques may not always rebalance energy and temperature distribution at lowest cost. Although energy efficiency challenges that deal with static tradeoffs that are made at hardware and software design and initialization, dynamic tradeoffs are made in light of runtime information related to energy distribution and performance trends. Although important, little work has been done in quantitative understanding of the energy trends using simple yet accurate models that can characterize the full system power. To achieve a better understanding of the energy and performance trends, we developed an instrumentation harness that captures telemetry data from a distributed set of physical and logical sensors in the system. Physical sensors include distributed voltages and currents; and logical sensors include transaction and latency QoS metrics. Unified telemetry stream and built-in analytical capability within our harness provides us with opportunities to build and employ application models as below: • Managed – Scripting and expert tools automate data sensing, execution, and reporting operations. Collected data

is analyzed by individual experts to formulate plans and decisions. • Predictive - Mechanisms are devised for recognition of a workload patterns and early detection of phases for characterization. The knowledge base (Models) recommends appropriate actions. • Adaptive – Building on the predictive capabilities, the adaptive system takes action itself based on the situation. • Autonomic – Policy drives system activities such as allocation of resources within a prioritization framework Telemetry harness, which incorporates data from physical sources as well as software, provides the comprehensive information stream necessary for driving run-time decisions on energy. The design anticipates that software visible performance and quality-of-service feedback will need to be merged into its telemetry stream. A hardware-software cooperative strategy can thus be formulated to understand, monitor, and drive the expenditure of energy towards goals that are the primary concern of software. In particular, this methodology enables collaborative framework in which energy-performance-latency tradeoff is properly abstracted and made visible to software, which in turn can generate the performance feedback for action by HW and the OS layers. This paper incorporates a machine learning methodology to utilize the telemetry harness towards an intelligent provisioning and thus achieve energy/performance optimal solutions. Intelligent provisioning leads to efficient computations, efficient cooling, reduced guardbands and reduced thermal gradients for a given set of QoS constraints. In section III we discuss adaptive infrastructure that synthesizes models to dynamically estimate the throughput and latency characteristics using component level power distribution in a server. In our analysis (Section IV) we have been able to predict the system energy with an accuracy of 99.3% which allows us to select accurate models for throughput and response time evaluation. II. R ELATED W ORK Data correlation among processor performance, power and thermal events has been studied extensively. Bellosa et. al [6], demonstrated strong correlations among performance events (instructions/cycle, memory references, cache references) and power consumption for Pentium II. Bircher et. al. [7] demonstrated that microprocessor performance events can accurately estimate total system power and identified performance events for modeling the entire system power. David et. al. [19] utilized activity counters to predict DIMM power and used the prediction for controlling the DIMM power budget with a Runtime Average Power Limiting (RAPL) approach. Economou et. al [8] described a Mantis framework that generated a model by correlating AC power measurements with user-level system utilization metrics on the basis of a one-time calibration phase. Kang et. al. [9] showed the use of optimized search algorithms and machine learning techniques in a processor design exploration problem to

reduce the time needed for determining the best configuration. Our work differs from the traditional approach in this aspect: we propose to use component energy metrics to show strong correlation between them and the parameters such as Througput and Response time (QoS parameters) in an enterprise workload using a machine learning approach. Traditional workload analysis methodologies consist of building upon simulation results obtained from isolated components, involve manual alignment of telemetry data, and include off-line post-processing. These often result in long analysis times, over-corrections, suboptimal tuning and larger guard-bands. In our analysis we systematically address the issues related to the dynamic collection, processing, and analysis of time-series telemetry data obtained in timealigned fashion from a variety of physical and logical sensors in the system. Furthermore, we propose to exploit that correlation to re-distribute energy amongst competing components using an Adaptive Weighted Genetic Algorithm [10] fitness function. III. S YSTEM A RCHITECTURE FOR E NERGY AWARE C OMPUTING In a given system environment, there are cooperating elements that play an optimization game among themselves in order to optimize the system’s resource usage while abiding by prevailing global constraints. To detect drifts from compliant activities, a monitoring mechanism that operates in real-time and is not dependent on the behaviors being monitored must be devised. Such a mechanism captures and analyzes trends in a system’s behavior (e.g., performance data, component-wise power, etc.) in response to usage changes and identifies drifts by comparing the behavioral data against signatures of normal activity; i.e., excursion from an acceptable profile. A profile characterizes the behavior of a given subject with respect to a given object, thereby serving as a signature or description of normal activity for its respective subject(s) and object(s) [11]. A random variable x, observed or derived, monitored over a period of time represents observed behavior; and observations can be used together with statistical models to determine conformity. Two of the profiles that aid in quality-of-service (QoS) analysis are the following: • System Usage Profile summarizes the system usage trends related to users’ best-known practices (or trends). It may be expressed, for example, as average system activity or ideal distribution of usage, etc. • System Activity Profile summarizes the system resource usage trends over a period of time, and performance feedback in relation to the usage trends. A. Multi Objective Optimization (MOP) MOP consists of n variables, q constraints and m objectives, where any or all of the objective functions may be linear or non-linear. Energy and performance models have a number of degrees of freedom and conflicting objectives. This makes it difficult to optimize the objectives collectively. We allow invoking trade-offs between opposing objectives with MOP and thus suppress the number of uncertain variables that are involved in the optimization. Our objectives are- (a) to improve overall performance/watt, (b) operate within a power limit, (c) reduce performance degradation or violation of a service level agreement (SLA). We use a data analysis engine to correlate performance/latency variations to energy distribution amongst cooperating components and thereby identify optimal corrections towards compliant operation (SLA) dynamically. A necessary step towards achieving such models is to extract a cost function that relates energy to performance and to the SLA (which we achieve by integrating energy and performance matrices related to workloads.) The energy and performance data are obtained using an instrumentation harness (described further in this section) as a telemetry stream; we fuse the

disparate telemetry items related to power and performance into a unified temporally aligned telemetry stream to prevent instabilities and errors from attempting to correlate events using out-of-phase observations. Our work enables the system to construct global optimization scheme that works with multiple objectives resulting in efficient use of discrete energy levels for given SLA constraints. These objectives can be summarized as follows: • How can we maintain total system power budget to sustain given throughput. • How can we maintain response time to sustain given throughput. An energy allocation strategy tends to be optimal if it assigns power states to corresponding devices such that the total power consumption of the system is minimal while the SLA is still met. The SLA is measured by meeting the latency requirements (Equation 3) for given throughput. In order to accomplish this we use two-tier approach. First part uses support vector machine (SVM) to construct coefficients of linear estimation model for energy, throughput and response time (Section IV). Once the estimation model is trained, in the second part we use adaptive weighted genetic algorithm to extract Pareto optimal, allowing computation of the entire Pareto front for proactive energy distribution in a single algorithm run. Success can be measured if we can maintain viable limits of energy, throughput and latency SLA. We utilizes useful information from the existing SVM solutions to read energy distribution to obtain a search pressure toward a positive ideal point. In general, adaptive weighted GA with resulting population P can be summarized as: Fnmax = max(fn (x)|x ∈ P ), n = 1, 2, 3 · ·, N

(1a)

Fnmin

(1b)

F =

= min(fn (x)|x ∈ P ), n = 1, 2, 3 · ·, N

N  n=1

N

fn (x) − Fnmin · (Fnmax − Fnmin )

(1c)

where, F represents overall fitness function of n objectives and trained through x solutions (of GA fitness). B. Platform Telemetry Abstraction Model In our SLA driven system, multi-objective optimization function assists in fair distribution of power amongst competing elements. These functions are constructed with the following agents: • Telemetry bus to retrieve raw power/performance sensors and send control messages to devices using an efficient interconnect. • Monitor Agent to organize the raw sensor data, synthesize its statistical characteristics and distribute them internally or externally. • Analysis Agent to make predictive energy-distribution decisions. It executes the multi-objective GA model by proactively changing power allocation to a random device incrementally and arriving at the resulting performance impact. When performance impact is negative, it avoids similar decisions in future; and when it is positive, it favors repeating or amplifying them. • Control Agent to propagate power control message to the controlled device in a timely manner. A configuration function identifies the dynamic range and granularity of power control. Platform power, performance and thermal sensors are spread out throughout the system and accessible through variety of interconnects and communication methods. From application standpoint, there is a significant gap between what is required for optimal functionality and the actual low-level data that the

Fig. 1. Sensor Network Model: Sensor Network Layered Architecture. S1, S2 ·· Sn Represent Platform Sensors (Temperature, Power, and so on).

sensors produce. An efficient analytical layer bridges the gap between the application level abstraction of power/performance and the view produced from sensor level data, by refining the synthesis/analysis/eventing/presentation model successively through multiple layers of abstraction as illustrated in Figure 1. Each layer incorporates a set of similar functions and hides lower-level details from the layer above it, thus achieving simplicity, abstraction, and ease of implementation. 1) Sensor Hardware Abstraction (SHA) Layer: This layer interacts with a proprietary hardware (platform sensor infrastructure) and communication channels to acquire sensor data in a timely manner. SHA adopts the adaptive sensor sampling policy that adapts to the environmental dynamics and makes the measurements only when they are needed most and eliminates sampling redundancies. 2) Platform Sensor Analyzer: This layer processes the data to remove noise that needs to be smoothed according to a statistical mechanism appropriate for a sensor. Additionally, trends are analyzed to dynamically train the model equations through understanding of the current and historic predicates. Multiple activity trends can be modeled into a single equation to estimate resource or performance characteristics. For example, in Equation 2, x, y, and z are the activity counters in smoothed form, A, a, B, b, C, and c are the coefficients of the sensor equation that can be tuned by the application for a specific use. S = Axa + By b + Cz c

(2)

3) Platform Sensor Event Generation: An event is an asynchronous signal from hardware (or abstracted hardware) that indicates the need for attention and executes a registered interrupt handler. Using event mechanisms, we can reduce the amount of data flow over the communicating channels and configure actionable thresholds (sensor averages and so on) with dynamic properties. These signals, when triggered, cause software execution of signal handlers, leading to actuator action and optionally to signal property changes for a subsequent trigger. IV. M EASURING S YSTEM C HARACTERISTICS In this section we describe the measurement infrastructure used in the experiment. An energy measurement harness connects to the main voltage regulator (VR) outputs to sense the power in different parts of the machine as summarized in table I. The power sensor measures the output voltage/inductor current and reports the average power to an FPGA core. The output current is measured across the inductor’s intrinsic direct current resistance (DCR). The current measurement mechanism can be applied to single and multiphase voltage regulators. Currently power sensing connection architecture requires a RC

Fig. 2.

Power Sensing RC Network for non-coupled inductor

network across the inductor to measure the voltage and current concurrently (Fig 2). The workload used in our study is the query-only portion of the Transaction Processing over XML (TPoX) benchmark [12] version 2.0, with the Express-C edition of IBM DB2 database management software [13] and the Red Hat Enterprise Linux version 6 operating system [14]. As the workload driver for TPoX draws very modest computational effort from the machine when using the query-only workload, for configuration simplicity the workload driver is co-hosted together with the database management software on the system under test. The TPoX scale factor employed is 100GB, and a think time ranging from 60 ms to 1ms is used to produce variable computational demand. The choice of TPoX benchmark as the workload for this study is motivated by its ability to impose the kind of broad front stress on the computing system that is representative of a modern enterprise – with its large numbers of threads, complex concurrency interactions, and appreciable memory footprint. IBM DB2’s self tuning memory manager eliminated the need to perform any fine-tuning in our setup as we varied the imposed workload. The system under test employs two CMP processors, from R the Intel Xeon E5-4600 series. The machine is furnished with 32GB DDR3 DRAM. A single Intel SATA solid-state disk drive with a capacity of 160 GB provides the mass storage for database tables and log files, with sufficient random I/O throughput to eliminate disk wait times during workload execution. The machine is equipped with 2 network interfaces of 1Gb each; however as the workload driver is hosted locally during the test, the network traffic during the execution of the workload is negligible. During the execution of the workload the CPU utilization approaches 100%, with about 8% of the time in system mode. The query-only portion of the TPoX benchmark presents the system with a mix of seven queries. Each second, the workload driver reports the transactional throughput for each of the queries, and the average, minimum, maximum, and 95th percentile response times. From these statistics, we extract the average throughput (across all queries) and the geometric-mean of their response times. The throughput statistics used in this paper are normalized against a common reference for ease of analysis and to prevent unintended product comparisons. Monitor Agent comprises of a micro-controller that collects the power and performance data through system VR, O.S performance counters and workload QoS calculators. Monitor agent buffers the temporally aligned sensor data into an internal circular buffer. The data is then extracted from this buffer and processed for rudimentary statistical functions (e.g. running averages, mean, standard deviation etc.). RAW data along with statistical data is made available to the analysis engine for model calculations and training. Server hardware also comprises of power management fea-

Signal VCCP VSA VT T VDDQ

Description For each multi-core processor socket, the sum of the power drawn into that processor’s cores For each multi-core processor socket the power drawn by a System Agent, an entity responsible for power distribution and control to the rest of the socket. For each multi-core processor socket, the power drawn for socket level caching and data movements which includes power taken up in I/O and shared L3 cache Each multi-core processor socket has several DRAM interfaces. VDDQ measures the power drawn for memory attached to these interfaces.

TABLE I P OWER S ENSING C APABILITIES

Fig. 3. Component Energy Rebalancing infrastructure that uses support vector machine (SVM) for model training and adaptive weighted genetic algorithm for optimal distribution.

tures that can be controlled by frequency/voltage scaling and proactive tuning methodologies [15]–[18]. Memory subsystem uses Running Average Power Limiting (RAPL) methodology [19] that can simultaneously enforce multiple memory power limits. These functions in conjunction with QoS guarantees (Throughput and Response Time) limit the CPU/DIMM energy for a given interval. Feedback Loop analyzes the penalty function and rebalances the energy distribution to reduce responsetime extremes while maximizing the performance/watt metric. Response time penalty function can be summarized by the following equation: max(0, (R − Rt )) · 100 (3a) Rt 1 (3b) P =1− χ 2 where, R is the measured response time, Rt is the target response time and P is the penalty function. Throughput and Response Time are trained as a function of energy distribution according to workload requirements. Dynamic training methodology results in building discrete models for different phases of the workload. In our experiments these models are characterized by total system power that is calculated by summing up the component energy. When response time penalty exceeds a given threshold, energy distribution is re-evaluated according to equation 4: ⎛ ⎞ ` N N    ij ij ij ⎝ ⎠ Ej (t) = αk Vk (t) + β ij Vddq (t) + K (4a) χ=

cpu=i k

ch=i

j ∈ (TM , RM , PM ) ; k ∈ (ccp, tt, sa) For a given workload phase M, energy distribution function for throughput (TM ), response time (RM ) and system power is given by equation 4. α and β coefficients for CPU and DIMM channels are calculated by regression methodology that includes training the system for different phases of the workload. When the penalty function is triggered, system is re-optimized to reduce the transaction response time and/or increase the throughput while enforcing the system energy budget. Multi

Fig. 4. Predicted Throughput (vs. Measured Throughput) as a function of component energy distribution.

Objective Genetic Algorithm (GA) function analysis is done to redistribute the energy between competing components to maximize the overall fitness that includes maintaining optimal throughput (TM ), response time (RM ) and system energy budget (PM ). The methodology for energy redistribution is represented by equation 1c where fn can be represented by equation 5: min Ej (t)−Ej

fj (t) = 2 fj (t) = 2

E min j

max −E (t) Ej j E max j

; j ∈ TM

(5a)

; j ∈ R M , PM

(5b)

where Ejmin sets the minimum limit of the constraint and sets the maximum limit to the constraint. These constraints are represented by system power, throughput and response time targets respectively. GA algorithm shuffles the component energy budgets such that representative equations 4 results in maximizing fitness function given by equation 55b. Effectiveness of the solution depends upon the accuracy of system energy, performance and throughput models and corresponding coefficients. Analysis functions comprises of machine learning heuristics. In an effort to constrain the workload energy, throughput and response-time, we use Support Vector Machine (SVM) [20] technique to construct linear model that relates component energy to individual objectives. Since invented by Boser, Guyon and Vapnik, SVM has established itself as one of the leading approaches in pattern recognition because of its strong foundation in statistical theory. By finding a computationally efficient way of learning the high-dimensional feature space using kernel functions that have to be positive definite, SVM for regression depends on three parameters that need to be specified and tuned for optimal performance. Ejmax

V. E VALUATION Using the experimental setup as described in section IV, we collected the coefficients (Table II) of equations 4 by training the models for different phases of the TPoX workload. The measurement infrastructure of section IV collects ten component energy variables which, combined with performance, SLA, and penalty function for violating SLA form the model training set.

Feature Power Model Response Time Model Throughput Model

1 0.003228 0.016902 0.016089

2 0.000999 0.003602 1.083152

3 0.000408 0.059599 3.379966

4 0.001224 0.040463 1.855955

5 0.001009 0.002725 0.913453

6 0.016208 0.203723 2.894045

7 0.000637 0.002176 1.291055

8 0.003731 0.559411 38.806805

9 0.002100 0.025721 0.976363

10 0.001841 0.000301 0.21032

Constant 2.583555 16.752212 14367.69107

TABLE II C OEFFICIENTS FOR S YSTEM P OWER & L ATENCY M ODELS

Fig. 5. Predicted System Power (vs. Measured System Power) as a function of component energy distribution.

Fig. 6. Predicted Response Time (vs. Measured Response Time) as a function of component energy distribution.

Figure 4 illustrates the throughput control function given by equation 4. Coefficients of throughput equation are calculated using SVM regression function. Similarly, figure 5 illustrates system power control function represented by PM and figure 6 illustrates response time function represented by RM (equation 4). On an average our machine learning regression function supported by SVM machines delivers accuracy between 97%98.5%. For simplicity, workload phases are identified by system power profile mapped at 0-20, 21-40, 41-60, 61-80, 81-100 percentiles. Although each phase is trained for its own performance and latency model coefficients, system power maintains a single estimation profile. Figure 7 illustrates penalty function as defined in equation 5. Management software analyzes the penalty function and triggers a re-configuration event to compensate response time target (90 Percentile) (Figure 3). A genetic algorithm (GA) based regression identifies the energy targets for CPUs and memory per fitness function defined in equation 3. Conversely, the fitness equation models maximization of performance/watt subject to energy and latency constraints. Note that in order to meet SLA targets driven by latency constraints, it is necessary to hold back from over allocating power to one component so that another

Fig. 7. Penalty function snapshot (sampling interval = 597-776) calculated at 90 percentile.

Fig. 8. Total Server (Node) power as a function of 90 percentile latency achieved due to optimal distribution of CPU and DIMM power for various Think Time (TT) levels.

component gets the power it needs to keep from being the cause of an SLA violation. Based on the output of adaptive weighted GA regression, control functions (including DRAM RAPL [19] and CPU P states control) are employed to set the power budget of each component for the next evaluation interval. Real-time penalty data and throughput data is fed back into the training cycle to harden the representative models. In order to check for any correlation or dependency in the 10 features (component energy variables) collected, we investigated feature subsets that best predict the throughput by sequentially selecting some features until there is no improvement in prediction can happen. Starting from an empty feature set, candidate feature subsets were created by sequentially adding each of the features not yet selected. For each candidate feature subset a 10fold cross-validation was performed. The GA regression model accurately predicts and averts response-time violations in 98% of samples. It is also able to predict and regulate the power caps at the component level to maintain response times, to an average fitness of 97.5%. Figure 8 illustrates total server power as a function of 90 percentile latency for each Think-Time (TT) level. In this scenario, we consider optimal mix of CPU and Memory power scaling and illustrate only pareto-optimal results (best power savings

for a given latency). Reducing memory power impacts overall performance. Although at higher performance we can achieve marginal memory power savings, at lower performance levels memory scaling performs well. Our results demonstrates that coordinated memory and CPU power scaling yields significant power reduction at all performance levels relative to isolated tuning. In summary, components should scale together using multi-objective function to maintain system balance for given throughput and latency targets. Figure 9 illustrates average power allocation savings of 13% with a standard deviation of 6.9 and peak savings of 23%. This savings are achieved by coordinated budgeting of component power using adaptive weighted genetic algorithm for target throughput. Figure 8 illustrates total server power as a function of 90 percentile latency for each Think-Time (TT) level. In this scenario, we consider optimal mix of CPU and Memory power scaling and illustrate only pareto-optimal results (best power savings for a given latency). Reducing memory power impacts overall performance. Although at higher levels of throughput (i.e., at high arrival and service rates) we can achieve marginal memory power savings, at lower rates of service (together with lower arrival rates) memory frequency reductions save power without hurting the response time . Our results demonstrates that coordinated memory and CPU power scaling yields significant power reduction at all performance levels relative to isolated tuning. In summary, components should be power-scaled through coordinated budgeting using weighted multi-objective optimization to maintain system balance for given throughput and latency targets. Figure 9 illustrates average power allocation savings of 13% with a standard deviation of 6.9 and a peak of 23%. power reduction. Experimental setup allows continuous monitoring of workload and planning energy allocation by predicting the effects on performance. GA based energy allocation algorithm estimates DIMM and CPU power control values to balance power supply and performance requirements driven by latency targets at given throughput. A reconfigurable power allocation infrastructure directs power-control requests to each component. VI. C ONCLUSIONS In this paper we discussed an adaptive infrastructure that synthesizes models to dynamically control the throughput and latency characteristics through component level power distribution in a server. We capture the distributed power rails of system components along with throughput and response time of an enterprise workload (TPoX). Workload modulations result in throughput and response time variations that need to be controlled in order to fulfill a response-time SLA requirement. Models that estimate system power, throughput, and response time are trained on the basis of component level power data. Trained models participate in a multi-objective optimization game that redistributes available power to simultaneously achieve two goals: maximize the overall performance/watt of an enterprise server, and, minimize penalty from violating workload quality of service expectations. We used a mix of support vector machines (SVM) and adaptive weighted genetic algorithms (AWGA) for model training and dynamic energy distribution among competing components respectively. Model-estimated system power assists in classification of the workload phases. Throughput and latency models assist in re-allocating the power according to a performance/watt maximization function. In our analysis we have been able to predict the system energy with an accuracy of 99.3%. and throughput and response time metrics with an accuracy between 95% and 98%. Along with adapting to meet the throughput and latency requirements, the multi-objective optimization approach reduces server energy usage by 13-23% and predicts future energy distribution (that is required for the SLA mandated latency targets) to an average fitness of

Fig. 9. Total improvement in power allocation for TT=30 at 90 percentile latency targets.

97.5%. Future work includes incorporating more workloads and incorporating a penalty function based on throughput as well as response-time. R EFERENCES [1] U.S. EPA, “Report to congress on server and data center energy efficiency,”Tech. Rep., Aug. 2007 [2] J.G. Koomey, “Estimating Total Power Consumption by Servers in the U.S. and the World” [3] P. Bohrer, E. Elnozahy, T. Keller, M. Kistler, C. Lefurgy, and R. Rajamony, “The case for power management in web servers,”Power Aware Computing, Jan 2002. [4] Intel, “First the Tick, Now the Tock:Intel Microarchitecture (Nehalem),”2009. [5] L. Barroso and U. Holzle, “The case for energy-proportional computing,”IEEE Computer, Jan 2007. [6] Frank Bellosa, “The Benefits of Event-Driven Energy Accounting in Power-Sensitive Systems,”ACM SIGOPS European Workshop, September 2000. [7] W. Bircher and L. John, “Complete System Power Estimation: A TrickleDown Approach Based on Performance Events,”in Proc. of 2007 IEEE Int’l Symp. on Perf. Analysis of Systems and Software, April 2007, pp. 158-168. [8] D. Economou, S. Rivoire, C. Kozyrakis, and P. Ranganathan, “Full-system power analysis and modeling for server environments,” In Workshop on Modeling Benchmarking and Simulation (MOBS), June 2006 [9] S. Kang and R. Kumar, “Magellan: A Framework for Fast Multi-core Design Space Exploration and Optimization Using Search and Machine Learning’,”UIUC CRHC Technical Report 2008, pp 1432-1437. [10] D. Goldberg, Genetic Algorithm in Search, Optimization and machine learning, Addison-Wesley Publishing Company, Inc., 1989. [11] Dorothy E. Denning, “An Intrusion-Detection Model,” IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-13, NO. 2, FEBRUARY 1987, 222-232. [12] TPoX, http://tpox.sourceforge.net/ [13] DB2 Express-C http://www-01.ibm.com/software/data/db2/linux-unixwindows/edition-express-c.html [14] Red Hat Linux, http://www.redhat.com/rhel/server/details/ [15] V.Kontorinis, A.Shayan, R. Kumar and D.Tullsen, “Reducing Peak Power with a Table-Driven Adaptive Processor Core,”In Proc. of the International Symposium on Microarchitecture, 2009 [16] K. Meng, R. Joseph, R. P. Dick, and L. Shang, “Multi-optimization power management for chip multiprocessors,”In Proc. of PACT, 2008. [17] C. Isci, A. Buyuktosunoglu, C.-Y. Cher, P. Bose, and M. Martonosi, “An analysis of efficient multi-core global power management policies: Maximizing performance for a given power budget,”In Proc. of the International Symposium on Microarchitecture, 2006. [18] W. Felter, K. Rajamani, T. Keller, and C. Rusu, “A PerformanceConserving Approach for Reducing Peak Power Consumption in Server Systems,”In Proceedings of ICS, June 2005. [19] H.David, E.Gorbatov, U.Hannebute, R.Khanna, C.Le, ”RAPL: Memory Power Estimation and Capping,” ISLPD, 2010. [20] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik, “A training algorithm for optimal margin classifiers,”In Proceedings of the fifth annual workshop on Computational learning theory (COLT ’92), ACM, New York, NY, USA, 144-152.