DEPARTAMENTO DE ARQUITECTURA Y ... - Archivo Digital UPM

DEPARTAMENTO DE ARQUITECTURA Y ´ TECNOLOGÍA DE SISTEMAS INFORMATICOS

Facultad de Informática Universidad Politécnica de Madrid

Ph.D. THESIS

AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION

Author Alberto S´ anchez Campos MS Computer Science MS Marketing

Ph.D. supervisors Mar´ıa de los Santos P´ erez Hern´ andez Ph.D. Computer Science Antonio Cortes Rossell´ o Ph.D. Computer Science

2008

THESIS COMMITTEE:

CHAIRMAN: Pedro de Miguel Anasagasti

MEMBER: Jes´ us Carretero Pérez

MEMBER: José Luis Bosque Orero

EXTERNAL MEMBER: Heinz Stockinger

SECRETARY: José Mar´ıa Pe˜ na S´ anchez

To my grandmother, Marina Benito. I’m proud to be her grandson.

“Any sufficiently advanced technology is indistinguishable from Magic” Arthur C. Clarke

Acknowledgements This work has been carried out at the Departamento de Arquitectura y Sistemas Informáticos of the Facultad de Inform´ atica of the Universidad Politécnica de Madrid in collaboration with the Barcelona Supercomputing Center. I would like to thank all the people who are in this department because of their inestimable help, particularly those whose conversations, company and work have contributed in some way to this thesis. Chuso, Pablo, Chema, Santi, Germán, Edu, Javier, Juanma, etc, your work and knowledge are considered as an invaluable contribution. Furthermore, I thank all the researching centers participating in the grid testbed required for a complete evaluation of this thesis: the Instituto de Investigaci´ on en Informática of the Universidad de Castilla La Mancha, the Escuela Superior de Ciencias Experimentales of the Universidad Rey Juan Carlos and the Departamento de Ingenier´ıa y Tecnolog´ıa de computadores of the Universidad de Murcia.

Especially, I want to recognize the prolific not only teaching and researching but also humane work that Mar´ıa and Toni have carried out. I hardly can imagine having better supervisors for my Ph.D. They have guided me in good times and in bad in the best possible way. They have always been there when I have needed them contributing to this thesis with their common sense and deep knowledge. This work would not have been possible without them.

Finally, I would like to acknowledge not only my family because of their care and understanding but also all those people who have influenced the way I am. From my old friends with who I learnt to live to new ones with who I have learnt to enjoy each day.

This goes for all of you!

Alberto Sánchez Campos 2008

Abstract The existing difference between computing and I/O times has originated the so-called “I/O crisis”. As long as this difference is not decreased, the I/O system will constitute the bottleneck of most computing systems.

Parallel file systems constitute a solution to this problem and they are often used in different kinds of environments. In spite of the proliferation of cluster environments and the improvements of parallel cluster file systems, a large number of data analysis problems cannot be tackled. In these scenarios, the concept of parallelism should be adapted to new technologies. Grid technology enables resource sharing across wide area networks, increasing the computational power and the storage capacity, which allows the scientific community to solve formerly unachievable problems. However, grid is mainly focused on increasing the availability of resources instead of improving the performance of applications. Parallelism could optimize data access in this kind of environments and, therefore, enhance applications performance. It would be advisable to extend different solutions used in clusters of workstations to a grid infrastructure with the aim of overcoming I/O problems in grid applications.

Nevertheless, the complexity and the dynamism of grid environments make the management and administration of such data access systems difficult. With the aim of managing this high complexity, the system must be able to self-manage, focusing on improving the performance of I/O operations. In an I/O system, it is necessary to take into account that data is not usually required at the same time when it is produced. In this sense, performance improvements are related to both current and future states of grid elements since the actual access to data will be made later on. Prediction methods can be used to know the future behavior of the system and making decisions aimed at enhancing the performance of current and later I/O operations.

As a summary, this work proposes the analysis, design and implementation of an architecture that solves the I/O problem in an efficient way managing the high complexity of a heterogeneous environment.

Keywords: Parallel I/0; Grid computing; Data grid; Autonomic computing; Autonomic storage; Long term prediction; Heterogeneous environment.

Resumen El desequilibrio existente entre el tiempo de cómputo y el tiempo de E/S origina lo que se ha dado a conocer como crisis de la E/S. Mientras este salto no se acorte, la E/S continuará siendo uno de los principales cuellos de botella de las aplicaciones.

Los sistemas de ficheros paralelos constituyen una solución a este problema, pudiéndose utilizar una gran variedad de este tipo de sistemas en diferentes entornos. A pesar de la proliferación de entornos de tipo cluster, existen nuevos y n´ umerosos problemas a los que está haciendo frente la comunidad cient´ıfica que no pueden ser abordados, debido a la gran necesidad de computación y el enorme conjunto de datos que utilizan. En este sentido es interesante extrapolar el concepto de paralelismo a nuevas tecnolog´ıas capaces de abordar los problemas indicados. La tecnolog´ıa grid, mediante la utilizaci´ on de un gran n´ umero de elementos heterogéneos distribuidos geográficamente, posibilita una gran capacidad de c´ omputo y de almacenamiento que permite solventar problemas que anteriormente eran inabordables. Pero su enfoque est´ a principalmente dirigido a aumentar la disponibilidad en vez de incrementar el rendimiento. La adopción del paralelismo a este nivel podr´ıa optimizar el acceso a datos en este tipo de entornos, y en consecuencia, mejorar el rendimiento de las aplicaciones que ejecutan sobre esta plataforma.

La problem´ atica que se desprende de esta propuesta es que la complejidad y el dinamismo del entorno dificulta enormemente su gesti´ on. Como solución se plantea que sea el sistema el que analice su propio comportamiento y act´ ue en consonancia buscando en todo momento la mejora del rendimiento de las operaciones de E/S. Pero frente a las necesidades de cómputo que se plantean en un momento dado, los datos no se consumen en el mismo momento en que son producidos. Por ello, no sólo se deben mejorar las operaciones de E/S actuales sino que se debe tratar de optimizar los accesos posteriores (tanto a corto como a largo plazo) sobre los datos almacenados en un momento determinado.

En conclusi´ on, este trabajo propone el análisis, dise˜ no y validación de una arquitectura que permita resolver el problema de la E/S de forma eficiente en entornos heterogéneos, tratando la problemática relativa a la gesti´ on de la complejidad de dicho entorno.

Palabras clave: E/S Paralela; Grid computing; Data grid; Autonomic computing; Autonomic storage; Predicci´ on a largo plazo; Entorno heterogeneo.

Contents

List of Figures

viii

List of Tables

xi

Chapter 1 INTRODUCTION

I

1

1.1

Motivations and scope of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Antecedents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.3

Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.4

Document organization

7

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

STATE-OF-THE-ART

9

Chapter 2 PARALLEL I/O

11

2.1

Secondary storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.2

Device-level parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.2.1

RAID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.2.2

Heterogeneous disk arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.2.2.1

V:Drive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.2.2.2

HERA

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

2.2.2.3

AdaptRaid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

File system-level parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

2.3.1

PVFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

2.3.2

GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

2.3.3

Lustre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

2.3.4

MAPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

Parallel I/O hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

2.3

2.4

Chapter 3 DISTRIBUTED COMPUTING

25

3.1

Cluster computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

3.2

Grid computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

iii

3.2.1

3.2.2

3.2.3

Grid middleware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

3.2.1.1

Grid fabric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

3.2.1.2

Core grid middleware . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

3.2.1.3

User level grid middleware . . . . . . . . . . . . . . . . . . . . . . . .

31

3.2.1.4

Grid Portals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

Data grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

3.2.2.1

Data transport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

3.2.2.2

Data replication and storage management . . . . . . . . . . . . . . . .

39

3.2.2.3

Replication strategies . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

Grid performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

3.2.3.1

Grid benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

3.2.3.2

Grid monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

Chapter 4 AUTONOMIC COMPUTING

49

4.1

Autonomic elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

4.2

Autonomic levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

4.3

Autonomic projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

4.3.1

Self-* storage system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

4.3.2

VIOLIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

4.3.3

Autonomic virtualized environments . . . . . . . . . . . . . . . . . . . . . . . .

56

4.3.4

Kendra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

4.3.5

Application performance prediction and autonomic computing . . . . . . . . .

57

Chapter 5 STATISTICAL METHODS TO MODEL AUTONOMIC SYSTEMS 5.1

5.2

Statistical methods for clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

5.1.1

. . . . . . . . . . . . . . . . . . . . . . . .

61

Statistical methods for prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

5.2.1

On-line models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

5.2.1.1

Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

5.2.1.2

Exponential, polynomial and logarithmic adjustment . . . . . . . . .

64

5.2.1.3

Exponential smoothing . . . . . . . . . . . . . . . . . . . . . . . . . .

64

A priori models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

5.2.2.1

Petri Nets

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

5.2.2.2

Finite state machines . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

5.2.2.3

Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

Statistical methods for decision making . . . . . . . . . . . . . . . . . . . . . . . . . .

68

5.2.2

5.3

59

Determining the number of clusters

II

PROBLEM STATEMENT AND PROPOSAL

Chapter 6 PROBLEM STATEMENT

69 71

6.1

Grand challenge problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

6.2

I/O and software complexity crisis: Two problems not solved yet . . . . . . . . . . . .

72

6.3

Solutions to the I/O crisis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

6.3.1

I/O solutions: Approaches based on clusters . . . . . . . . . . . . . . . . . . . .

73

6.3.2

I/O solutions: Approaches based on grid . . . . . . . . . . . . . . . . . . . . . .

74

6.3.2.1

Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

Solutions to the software complexity crisis . . . . . . . . . . . . . . . . . . . . . . . . .

75

6.4.1

76

6.4

Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 7 A HIGH PERFORMANCE GRID STORAGE ARCHITECTURE

77

7.1

A toolkit for accessing large volumes of data . . . . . . . . . . . . . . . . . . . . . . . .

78

7.2

Providing an efficient service-based access to data resources . . . . . . . . . . . . . . .

80

7.2.1

MAPFS-Grid Parallel Data Access Service . . . . . . . . . . . . . . . . . . . . .

81

Providing a uniform service-based access to data resources . . . . . . . . . . . . . . . .

83

7.3.1

MAPFS-DAI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

Providing a high-performance access to data resources . . . . . . . . . . . . . . . . . .

87

7.4.1

88

7.3

7.4

MAPFS-DSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 8 AN AUTONOMIC FRAMEWORK FOR GRID STORAGE 8.1

8.2

93

Providing Autonomic Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

94

8.1.1

A system-broker-based solution . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

8.1.2

File data discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

8.1.3

Fault-tolerance and file data replication . . . . . . . . . . . . . . . . . . . . . . 100

8.1.4

Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

GAS structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 8.2.1

Coexistence of GAS and I/O systems . . . . . . . . . . . . . . . . . . . . . . . 107 8.2.1.1

Coexistence with MAPFS-Grid . . . . . . . . . . . . . . . . . . . . . . 107

8.2.1.2

Coexistence with SRB . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

8.2.1.3

Coexistence with CASTOR . . . . . . . . . . . . . . . . . . . . . . . . 110

Chapter 9 OBTAINING PERFORMANCE MEASUREMENTS

113

9.1

MonALISA-based monitoring system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

9.2

GMonE: A Grid Monitoring Environment . . . . . . . . . . . . . . . . . . . . . . . . . 117 9.2.1

Obtaining data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 9.2.1.1

Monitored data aggregation . . . . . . . . . . . . . . . . . . . . . . . . 120

9.2.2

Gathering and managing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

9.2.3

Providing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

9.2.4

A simple grid monitoring visualization tool . . . . . . . . . . . . . . . . . . . . 123

Chapter 10 PREDICTION AND DECISION MAKING

125

10.1 Definition of states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 10.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 10.2.1 Prediction model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 10.2.1.1 Usual behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 10.2.1.2 Adaptation to new data . . . . . . . . . . . . . . . . . . . . . . . . . . 136 10.2.1.3 Results interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 10.2.1.4 Influence of the monitoring data period . . . . . . . . . . . . . . . . . 140 10.3 Decision making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 10.3.1 Decision making for read requests . . . . . . . . . . . . . . . . . . . . . . . . . 142 10.3.2 Decision making for file creating requests . . . . . . . . . . . . . . . . . . . . . 144 10.3.2.1 Goodness calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 10.3.2.2 Goodness-based decision making stage . . . . . . . . . . . . . . . . . . 152 10.3.2.3 Influence of data distribution and block size . . . . . . . . . . . . . . 153 10.3.2.4 Performance improvement achieved by parallelism . . . . . . . . . . . 158 10.3.3 Decision making for write requests on an existing file . . . . . . . . . . . . . . . 159 Chapter 11 EVALUATION

161

11.1 Evaluation of the proposed grid storage architecture . . . . . . . . . . . . . . . . . . . 161 11.1.1 Performance of MAPFS-Grid PDAS . . . . . . . . . . . . . . . . . . . . . . . . 164 11.1.2 Performance of MAPFS-DAI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 11.1.3 Performance of MAPFS-DSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 11.2 Evaluation of the proposed grid monitoring environment . . . . . . . . . . . . . . . . . 178 11.2.1 Data monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 11.2.2 Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 11.2.3 Data recovering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 11.3 Evaluation of the autonomic framework for grid storage . . . . . . . . . . . . . . . . . 181 11.3.1 Performance of the prediction model . . . . . . . . . . . . . . . . . . . . . . . . 183 11.3.2 Evaluation of the made predictions . . . . . . . . . . . . . . . . . . . . . . . . . 185 11.3.3 Performance of the decision making . . . . . . . . . . . . . . . . . . . . . . . . 186 11.3.4 Transfer mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 11.3.5 Autonomic capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 11.3.6 System performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

III

CONCLUSIONS AND FUTURE WORK

Chapter 12 CONCLUSIONS AND FUTURE RESEARCH LINES

197 199

12.1 Contributions in data grid field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 12.2 Contributions in grid management field . . . . . . . . . . . . . . . . . . . . . . . . . . 201 12.3 Benefits derived from the development of this Ph.D. thesis . . . . . . . . . . . . . . . . 203 12.4 Future research lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 12.4.1 Time dependent predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 12.4.2 Peer-to-peer-based brokering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 12.4.3 Self-healing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Appendix A MARKOV CHAIN RESOLUTION

207

Bibliography

213

List of Figures

2.1

Evolution of the storage capacity of hard disks [Yel06] . . . . . . . . . . . . . . . . . .

12

2.2

Evolution of the hard disk performance [Yel06] . . . . . . . . . . . . . . . . . . . . . .

13

2.3

RAID 1 [RAI]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.4

RAID 5 [RAI]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.5

MAPFS three-tier architecture [Pér03] . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

2.6

Parallel I/O hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

2.7

Increase expectations of the network bandwidth and storage capacity . . . . . . . . . .

23

3.1

Brokering in a grid system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

3.2

Grid Monitoring Architecture (GMA) [GMA] . . . . . . . . . . . . . . . . . . . . . . .

45

4.1

Autonomic computing principles [RAC] . . . . . . . . . . . . . . . . . . . . . . . . . .

53

4.2

Maturity levels of autonomic computing [RAC] . . . . . . . . . . . . . . . . . . . . . .

54

5.1

Dendogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

5.2

Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

5.3

Exponential smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

5.4

Markov chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

7.1

MAPFS-Grid Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

7.2

Parallel Data Access Service (PDAS) . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

7.3

MAPFS-Grid PDAS providing two levels of parallelism . . . . . . . . . . . . . . . . . .

82

7.4

MAPFS-DAI within the OGSA-DAI Architecture . . . . . . . . . . . . . . . . . . . . .

85

7.5

Two-level parallelism by using MAPFS-DAI . . . . . . . . . . . . . . . . . . . . . . . .

86

7.6

Two-level parallelism by using MAPFS-DSI . . . . . . . . . . . . . . . . . . . . . . . .

89

7.7

MAPFS-DSI within a data transference scenario . . . . . . . . . . . . . . . . . . . . .

90

8.1

I/O operation sequence in MAPFS-Grid by using the autonomic broker . . . . . . . .

97

8.2

File data discovery based on EPR

99

8.3

GAS architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

8.4

Autonomic computing reference architecture [IBM04] . . . . . . . . . . . . . . . . . . . 105

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

8.5

GAS model related with the autonomic computing reference model . . . . . . . . . . . 106

8.6

Coexistence of GAS and MAPFS-Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

8.7

SRB architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

8.8

Coexistence of GAS and SRB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

8.9

CASTOR architecture [CAS] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

8.10 Coexistence of GAS and CASTOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 9.1

MonALISA-based monitoring system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

9.2

GMonE architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

9.3

GsVTool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

10.1 Prediction and decision making modules . . . . . . . . . . . . . . . . . . . . . . . . . . 127 10.2 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 10.3 Markov chain state diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 10.4 Results visualization to apply the centroid method . . . . . . . . . . . . . . . . . . . . 140 10.5 Achieved total transfer rate depending on time . . . . . . . . . . . . . . . . . . . . . . 155 10.6 Assigning blocks of changeable size carrying out a joint parallelism . . . . . . . . . . . 156 10.7 Asynchronous sending of the optimum block size to each resource . . . . . . . . . . . . 157 11.1 Network topology of the test environment used to evaluate MAPFS-Grid

. . . . . . . 162

11.2 Evaluation of the network bandwidth in the test environment . . . . . . . . . . . . . . 163 11.3 Performance obtained by MAPFS-Grid PDAS to read a file on UPM 1 . . . . . . . . . 164 11.4 Performance obtained by MAPFS-Grid PDAS to write a file . . . . . . . . . . . . . . . 165 11.5 Performance obtained by PDAS to access a 12 GB file size . . . . . . . . . . . . . . . . 166 11.6 Performance obtained by parallel MAPFS-Grid PDAS to read a file . . . . . . . . . . 167 11.7 Performance obtained by parallel MAPFS-Grid PDAS to write a file . . . . . . . . . . 168 11.8 Comparison between fileAccess (OGSA-DAI) and MAPFS-DAI to read a file . . . . . 169 11.9 Comparison between fileAccess (OGSA-DAI) and MAPFS-DAI to write a file . . . . . 170 11.10Performance obtained by parallel MAPFS-DAI to read a file . . . . . . . . . . . . . . . 171 11.11Performance obtained by parallel MAPFS-DAI to write a file . . . . . . . . . . . . . . 172 11.12Performance obtained by parallel MAPFS-DAI to access a 12 GB file size . . . . . . . 173 11.13Comparison between MAPFS-DSI and GridFTP file DSI to read a file . . . . . . . . . 174 11.14Comparison between MAPFS-DSI and GridFTP file DSI to write a file . . . . . . . . . 175 11.15Performance obtained by parallel MAPFS-DSI to read a file . . . . . . . . . . . . . . . 176 11.16Performance obtained by parallel MAPFS-DSI to write a file . . . . . . . . . . . . . . 177 11.17Average time of a monitoring query . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 11.18Monitored data access time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

11.19Geographical location of the storage elements in the grid testbed . . . . . . . . . . . . 182 11.20Network topology of the grid testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 11.21Average time to predict a parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 11.22Workload of each decision making stage for file create operations . . . . . . . . . . . . 187 11.23Comparison between round robin and asynchronous transfer modes . . . . . . . . . . . 188 11.24Average time to make predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 11.25Average time for reading 10 MB file size . . . . . . . . . . . . . . . . . . . . . . . . . . 191 11.26Average time for creating and writing 10 MB file size . . . . . . . . . . . . . . . . . . . 192 11.27Average time for reading 100 MB file size . . . . . . . . . . . . . . . . . . . . . . . . . 193 11.28Average time for creating and writing 100 MB file size . . . . . . . . . . . . . . . . . . 193 11.29Average time for reading 1 GB file size . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 11.30Average time for creating and writing 1 GB file size . . . . . . . . . . . . . . . . . . . 195 11.31Bandwidth improvement according to file size . . . . . . . . . . . . . . . . . . . . . . . 195

List of Tables

11.1 Percentage of processing workload of the prediction . . . . . . . . . . . . . . . . . . . . 185 11.2 Expected and obtained frequencies for the prediction of IU P M 3 . . . . . . . . . . . . . 186 11.3 Processing workload of decision making for read/write operations . . . . . . . . . . . . 186 11.4 Processing workload of decision making for file create operations . . . . . . . . . . . . 188 A.1 Successive OverRelaxation algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

xi

Chapter 1

INTRODUCTION

1.1

Motivations and scope of the work

From the birth of computers, the Input/Output (I/O) stage has meant the bottleneck of most of the computing systems, because it involves an important curb to response times of applications. This is due to the great difference among memory access, processor execution and I/O devices access times. In spite of the incessant evolution of computers and several hardware components, the difference of times between an I/O and CPU operation has not diminished. Improvements in disk access times have not been proportional to the increase of processors performance, which have been enhanced more than 50% per year.

Furthermore, the huge advances in computer science have caused that the computing power vertiginously grows. This has involved more ambitious problems can be raised and solved in several fields, such as Engineering, Biology and Physics. Most of these problems mean a grand challenge for the scientists and the only way to tackle them is analyzing more and more increasing amounts of data (the volume of this analyzed data can be measured in terabytes and petabytes). This great storage need has caused that the storage capacity has drastically increased [Hug02] faster than the bandwidth and the latency of the evolving physical devices where data is accessed, because of its dependence on mechanical components. This difference means that the access to every day larger volume of information cannot be made up for by means of the increase of device performance. Thus, the imbalance between I/O system and computing system is aggravated even more.

Alberto S´ anchez Campos


2

CHAPTER 1. INTRODUCTION

In this way, the I/O system is the slowest component of computers involving that its limitations avoid data-intensive applications to achieve a good performance. This problem is traditionally known as I/O crisis [PGK88].

In this sense, data accesses have turned into a key point of the performance of systems and applications that run on them. Any improvements directed to reduce the data access time will have a notable influence in applications. With the aim of improving the I/O system to increase the whole system performance, several different alternatives have arisen. Thanks to the reduction of device prices, several technologies based on the concept of parallelism have appeared since the end of the eighties and mainly during the nineties. Parallelism can be defined as the cooperation of several hardware and/or software components to solve a fixed problem, in this case, data accesses.

The use of parallelism has involved a widely used approach that can help to mitigate, in some sense, the I/O crisis. Nevertheless, parallelism implies much more complex and sophisticated systems although different solutions have been suggested at hardware level. These solutions based on hardware components are very varied, emphasizing Redundant Array of Independent Disks (RAID) [PCGK89]. However, the technology is changing as time goes by.

Nowadays, it is common to use clusters of workstations and distributed environments in a parallel way instead of traditional mainframes or supercomputers (clusters predominated on top500 list of world’s fastest supercomputers [TSS] in 2007). The use of clusters proposes to increase the performance based on the combination of several computers, which are seen as a single entity, creating a single system image. Among their advantages it is worth emphasizing the use of several I/O devices in a parallel way, the distribution of the workload among all the nodes and their low cost compared to supercomputers.

Nevertheless, some grand challenge applications, like earth observation or high energy physics, demand such high computational and storage capacities that neither current systems nor the combinations of the previous technologies, like clusters, are sufficient to solve them. A paradigmatic example is the imminent switch on of the new CERN’s particle accelerator LHC (Large Hardon Collider), which is able to generate several petabytes of data per second.

In order to solve and enhance the performance of these grand data challenges, it should be advisable to extrapolate the concept of parallelism to a higher level, that is, to the technologies that can solve these hard problems. Since the network bandwidth is increasing more than the storage capacity [Vil01], it is expected that it can be taken advantage of wide area networks (WAN) as infrastructure AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


1.1. MOTIVATIONS AND SCOPE OF THE WORK

3

to provide this parallelism. Thus, a new technology based on sharing resources along WAN, named grid computing, can be used in a parallel way in order to deal with these problems.

Grid technology [FK04] provides access to a pool of geographically distributed computing and data resources. The enormous number of possible data resources enables creating a great storage space, which makes data-intensive problems possible to solve. These resources can be heterogeneous, as clusters of workstations, mainframes or servers. Due to the fact that any resources can be used in any geographical locations, it is possible to increase the computing power and storage capacity, solving problems previously unapproachable.

However, the creation of this parallel grid infrastructure does not only offer advantages. On the one hand, it is necessary to take into account the increase of the complexity because of the use of parallelism on this kind of environments. On the other hand, the heterogeneity of grid elements and especially its high number of resources cause that the environment is managed with difficulty. This complexity significantly affects the performance of the system and makes its management difficult.

These shown complexity problems are merged with the new crisis detected in current software systems. The high increase rate of the system complexity has involved that the system management is a very hard task, avoiding the achievement of the maximum performance. This problem is named complexity software crisis [Hor01]. This crisis has not been solved yet, although there is a known approach to tackle it. This proposal is known as autonomic computing [KC03] and it tries to achieve that the system is in charge of managing itself. This idea refers to systems that can adapt themselves to the changes occurred in a dynamic environment, providing learning skill features.

Thus, applying autonomic techniques in grid environments means a great advantage because of the high complexity of this type of environments. Specifically, grids where the aim is the access to distributed data resources and their management, called data grid [CFK+ 00], do not usually offer autonomic capabilities in spite of its complexity, due to the huge number of heterogeneous storage resources and data sources. In this way, it is possible not to require human participation in the system management, making its efficient working easy and increasing the performance of the I/O system.

As a summary, current challenge applications require efficient I/O infrastructures that provide huge amount of computing power and storage capabilities and simplify their deployment and management. These two aspects are both sides of the same coin, which involves the two fields shown above, grid and autonomic computing. In this way, this work has a multidisciplinary vision. It tries to solve I/O performance problems in new generation distributed systems, like a grid, applying optimization Alberto S´ anchez Campos


4


methods that are often used in other fields, such as parallelism and autonomic computing techniques. This approach aims to optimize the I/O phase, which is critical in data grids, involving positive and remarkable influence in the resolution of data-intensive challenge problems.

1.2

Antecedents

The present Ph.D. thesis comes up from the researching initiated by the Operating Systems Group of the Departamento de Arquitectura y Sistemas Informáticos of the Facultad de Informática of the Universidad Politécnica de Madrid. In parallel I/O systems, the researching line lays in the design of an I/O multiagent architecture adapted to clusters environments. The idea was based on the use of agents to provide high performance to I/O operations. This work concluded with the achievement of the MAPFS file system [Pér03].

Due to the fact that current challenge problems involve analyzing more and more increasing amounts of data to be solved, the experience of the above mentioned researching group in I/O systems is an advisable starting point to find solutions to optimize the I/O phase when this kind of grand data-intensive problems are covered. Since these problems require huge volumes of data and this data can be stored in a distributed way along the world, it is necessary to provide performance improvements of I/O accesses over the environments where grand challenge problems can be solved, like a grid. This infrastructure provides the computing and storage capacity required to deal with these problems.

The great complexity in the management of these environments requires a complementary experience to the Operating System Group. This involves the collaboration with other researching groups that work in autonomic systems. Thus, this latter researching line is being developed as collaboration between the researching groups belonging to both Universidad Politécnica de Madrid and Barcelona Supercomputing Center.

In this sense, this Ph.D. thesis combines both two researching lines to solve challenging dataintensive problems sharing the basic I/O architecture and aiming the optimization and the simplification of I/O accesses. As a summary, the actual motivation of this Ph.D. thesis is twofold:

1. Development of a high performance storage system for grid environments.

2. Design of an autonomic storage architecture to self-manage the performance of the underlying I/O system.



1.3. OBJECTIVES

1.3

5

Objectives

The I/O crisis and its effect on data-intensive applications aimed to solve the new scientific challenges cause this Ph.D. thesis. Most of the new challenges only can be solved by means of infrastructures providing a great computing power and storage capacity. Nevertheless, the heterogeneity and distributed features of kind of environments aggravate the I/O phase performance and involve a great complexity, which makes its improvement difficult. Thus, the central aim of this work is to demonstrate the feasibility of applying both autonomic computing and I/O phase improvement techniques to highly distributed environments, like a grid, in order to enhance the manageability and performance of its I/O system. In order to demonstrate this initial hypothesis, the design and implementation of an I/O architecture for grid environments must be performed with the aim to provide a high performance and self-managing solution.

There are different approaches to tackle the low performance of the I/O system. In this work, the use of parallelism has been selected, mainly due to the following reasons: 1. It is one of the most used approaches to address the I/O crisis. 2. It reduces the bottleneck that constitutes the access to a single storage element. 3. It improves the use of the resources that make up the environment. Data distribution can be shared among several resources instead of overloading a single element. Thus, the proposed architecture will be based on the parallel access to heterogeneous storage resources. The use of heterogeneous elements also enables the availability of a huge volume of resources to be accessed in a parallel way. However, this heterogeneity causes that the optimization of the I/O system inside each storage element cannot be made in the same way. Therefore, this works tries to cover not only the parallelism among the heterogeneous grid elements but also the optimization of the I/O inside of grid resources (being clusters one of the most used elements in these environments).

On the other hand, the great heterogeneity and amplitude of grid infrastructures aggravate the software complexity crisis avoiding an optimum operation. Thus, the reduction of the complexity is a key point in order to optimize these environments and therefore it turns into a cornerstone of this Ph.D. thesis. In this sense, this work is focused on applying the autonomic computing paradigm to data grid environments in order to provide optimization automatically. Although the complexity decrease is much related with the I/O field, the proposal must separate autonomic and I/O concepts. In this way, the created autonomic system will be able to provide autonomic skills to not only the proposed parallel I/O architecture but also any data grid systems.



6


Since a grid can be composed of hundreds of thousands of resources, the autonomic management of the I/O system is aimed to select those elements that are the most suitable to make specific I/O operations. All the proposals in the state-of-the-art to deal with data storage in grid are aimed to this approach, but as far as is known, there is no proposal that takes into account that data is not often required at the same time when it is produced and therefore it only optimizes current operations. Nevertheless, read operations require a better efficiency that write ones because data is usually read more times than it is written [RTC05] and applications immediately demand data (that is, reading) to be able to analyze it. This causes that not only the performance of current operations but also future read operations must be improved since the actual access to data will be made later on. Thus, prediction methods must be automatically used to know how the system works at long term and make decisions to enhance the performance of current and later I/O operations.

Like a first prototype of the autonomic system, the I/O system itself created to demonstrate the I/O optimization by means of a parallel approach, will be used to show the improvements obtained by the autonomic management. Nevertheless, the autonomic system must be flexible enough to be used with other grid I/O systems if its interface is correctly implemented. The joint of these two aspects provides autonomic high performance storage for grid environments based on long term prediction. Therefore, the proposed approach has the following objectives: • Feasibility analysis of the use of the parallelism in data grid environments to increase the I/O performance. • Investigation of data grid access methods to optimize I/O techniques often used in these infrastructures. • Improvement and adjustment of the monitoring techniques for distributed and complex environments to provide the knowledge about the past and current system behavior. • Formal study and definition of autonomic capabilities to reduce the complexity and optimize the current and future I/O operations in complex environments, like a grid. To fulfill the previous objectives, the following milestones are proposed: • The design and implementation of a parallel I/O architecture for grid infrastructures based on the use of heterogeneous storage elements, such as PC or clusters. This proposal must suggest not only an optimum single solution, but also different ways of optimizing the I/O in these environments depending on the technique used to access data. • The definition of an autonomic system that manages itself the environment complexity. • The definition of a monitoring system adapted to very complex and distributed environments. AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


1.4. DOCUMENT ORGANIZATION

7

• The definition of a formalism that allows the system to predict and make decisions to optimize I/O accesses. This must be provided in a transparent way to the client and automatically to the system. Additionally, the system must enable the inclusion of other functionalities, such as fault-tolerance, file data discovery and security without modifying grid resources.

As a summary, this work tries to cover two fundamental problems that have influence in the solution of data-intensive applications in complex environments, both I/O and software complexity crisis. The proposal must offer an integral solution that allows challenging data-intensive applications to increase its performance.

1.4

Document organization

The rest of this document is divided into three thematic blocks. Each one is divided in chapters to facilitate its understanding: • State-of-the-art: the first part gathers the current state of the researching areas related to the current work. Then, it is possible to state the problem from needs assessment in the research literature. – Chapter 2 analyzes the parallel I/O approach. – Chapter 3 describes several distributed computing technologies designed to solve challenge problems that demand great computing and storage capacities. – Chapter 4 indicates the advances to solve the growing complexity of the computer systems, which causes the software complexity crisis. The aim is to search automatisms to obtain the maximum performance of applications without having expert knowledge or complex control systems. – Chapter 5 studies several statistical methods that can help to model autonomic systems. Whereas clustering is useful to group and simplify information, prediction and decision making techniques can be used to know the future system behavior and select the best resources where data must be stored in order to optimize further I/O requests. • Problem statement and proposal: the second part constitutes the core of this work. It states the problem to be solved and the proposed approach. The proposal addresses the combination of the researching lines of data grid and autonomic computing in order to provide a grid-based environment whose main aim is to increase the performance of the I/O phase for data-intensive Alberto S´ anchez Campos


8


applications. Furthermore, the high complexity of the system is decreased by means of the use of autonomic techniques. Finally, it shows the evaluation of the approach. – Chapter 6 shows the problem statement based on minimizing the I/O and software complexity crisis in grid environments. – Chapter 7 defines a grid-based architecture to enhance the performance of data grid applications. The aim is to provide high performance grid storage. – Chapter 8 describes an autonomic storage-based framework to make automatically decisions in order to improve I/O requests. The proposed autonomic solution is independent of the grid I/O system used. – Chapter 9 proposes the way of properly monitoring different measurements about the operation of a grid infrastructure. This makes it possible to know the behavior of this kind of complex and heterogeneous environments. – Chapter 10 states the mathematical formalism required to act automatically on the system. This formalism includes both a prediction and decision making phase. – Chapter 11 evaluates the performance increase achieved with the proposed approach. The evaluation is seen from two points of view. Firstly, the performance enhancement caused by the proposed high performance data grid storage and then, the improvements obtained by means of the autonomic system. • Conclusions and future work: the last part shows the conclusions of this doctoral thesis and the open lines of this proposal. – Chapter 12 extracts the most important conclusions obtained from the achievements of this work and describes the possible researching lines that arise from its development.



Part I

STATE-OF-THE-ART



Chapter 2

PARALLEL I/O

Most of the computer science applications need information additional to data resident in memory in order to be run. Therefore, the interaction with devices that are in charge of providing the required information is necessary. The I/O system is responsible for managing these devices.

In the last years, the I/O system has been discriminated against the technological advances. Whereas the processors run operations faster than 3 GHz and the accesses to the Random Access Memory (RAM) take nanoseconds, I/O devices take a long time (milliseconds in the best cases) to be accessed. Due to its slowness and the growing use of interactive applications, the I/O system has turned into the system bottleneck. Therefore, any improvements of the underlaying I/O system could cause an optimization of the whole system.

As it can be seen, computers could lose part of its usefulness if applications do not interact with I/O devices to obtain the information required for its execution. Thus, I/O devices have a great importance. There are different types of I/O devices: • Peripheral: They provide a user-interface, e.g., the keyboard, the mouse or the printer. • Secondary storage or storage devices: They provide a non-volatile data storage, e. g., hard disks and tapes. • Communication devices: They make the connection with other machines possible. The modem and the network card constitute clear examples of this type. Alberto S´ anchez Campos


12

CHAPTER 2. PARALLEL I/O

This work is focused on the storage devices giving support to file systems.

2.1

Secondary storage

Secondary storage devices offer the following features: • Non-volatile data storage. • High-speed access. They are under the RAM layer in the memory hierarchy. • Ability to perform read and write operations. • File system support.

Figure 2.1: Evolution of the storage capacity of hard disks [Yel06]

The principal parameters that indicate the performance of the secondary storage devices, specifically hard disks, are the capacity and the bandwidth. According to Moore [Moo65], disk storage capacity is doubled every 12 to 18 months as long as density is increased. This growth is shown in Figure 2.1. Just like the capacity, the disk performance is growing too. However, this increase is slower than the capacity growth. Figure 2.2 shows the disk access times have been halved (from 13.25 ms to 6.5 ms) in the same period an approximately 36-fold capacity increase has taken place (from 2 GB to 73 GB). This means a problem, that is, disks can contain much more data but the disk bandwidth does not grow proportionally to the increase of the capacity.

Furthermore, the emergence of new applications requiring more amounts of data and faster access does not fit with traditional computing. AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


2.1. SECONDARY STORAGE

13

Figure 2.2: Evolution of the hard disk performance [Yel06]

Moreover, this problem is even harder due to the appearance of Grand Challenge Applications (GCAs). GCAs [GCA] represent computation problems that cannot be solved in a reasonable time with current technology. Applications that model big DNA structures or forecast climate change are some examples of this kind of applications. The main problem is the great amount of data that such applications have to analyze.

Therefore, in case an application uses the I/O system several times, the problem turns into critical mainly due to two factors: • The use of a large volume of information. I/O devices take a long time to access this amount of data. • The I/O crisis [PGK88, KGP89, Pat94]. As it have been seen, I/O systems have traditionally been the bottleneck of computing systems because of the difference between I/O devices access and computing time, avoiding data-intensive applications to achieve a good performance. This problem is traditionally known as I/O crisis. This crisis has not been solved in general-purpose distributed systems. Thus, the problem is clear; a higher capacity is required without making worse the performance because of the bandwidth1 . Different solutions have been proposed to solve the I/O crisis. Two of 1 If a higher capacity is obtained, in order to maintain constant the access time to the hard disk, the bandwidth must be increased.



14


the most important proposals are: • Use of high performance storage systems and/or Storage Area Networks (SAN). SAN [GNA+ 97] connect storage devices (for instance, disks and tapes) to servers creating a storage space that can be managed together. • Use of parallelism distributing the file data among several devices and/or servers. Parallelism is the most common choice [HO01]. This work will concentrate its efforts on the use of several disks or servers accessing in a parallel way to increase the I/O bandwidth.

2.2

Device-level parallelism

The origin of the parallelism is linked to the emergence of the Redundant Array of Inexpensive Disks (RAID) system.

2.2.1

RAID

RAID systems were described at the beginning of the eighties [Law81, PB86], although they were not known until the work of a researching group of UC Berkeley [PGK88, KGP89, PCGK89] at the end of the eighties.

RAID is based on the access to multiple disks as a single device. Depending on the RAID level, different improvements, both fault-tolerance and bandwidth increase, can be obtained since its access to the all the disks in a parallel way. The problem of RAID systems is indeed its lack of reliability, i.e., the probability of failure of a disk is higher when the number of disks increases. Thus, it is important to provide fault-tolerance.

There are several RAID levels, namely: • RAID 0: It is a non-redundant disk array and, therefore, it does not provide fault-tolerance. The striping2 is performed by means of a round-robin data distribution among all the disks, obtaining high bandwidth and large capacity. • RAID 1: It is an array of mirrored disks. Data are duplicated to increase its availability, but the parallelism and especially the capacity are decreased. As Figure 2.3 shows, the system is fault-tolerant since in case of some disk has a failure, its mirrored disk can be accessed in order to obtain the required data. 2 Action

to distribute data among different devices in an iterative well-balanced way.



2.2. DEVICE-LEVEL PARALLELISM

15

Figure 2.3: RAID 1 [RAI].

• RAID 5: It is similar to RAID 0, but it uses parity to provide fault-tolerance. The parity is distributed among all the disks and it is obtained applying the XOR function to all the blocks corresponding to the rest of disks. It tolerates the failure of a single disk in exchange for writing parity in the corresponding disk. Therefore, it obtains good performance, better than RAID 1, being tolerant to the failure of a single disk. Figure 2.4 shows this structure. Read operations are efficient due to the possibility of reading in a parallel way, but write operations are more costly because of the parity calculation. The utilization of the storage capacity for useful data is high and it is increased when the number of disks is higher, although the failure recovery is costly.

Figure 2.4: RAID 5 [RAI]. .



16


One of the most important problems of RAID systems happens when small writes take place. This situation is known as small-write problem [Gib91]. A write can be so small that it only affects some disks of the whole RAID. However, when a parity-based RAID system is used, it is necessary to read the rest of disks to obtain the parity and write the data and the calculated parity. This means a high overhead with regard to the time of the made write causing the system performance worse. There are several solutions, but just one based on the RAID system, named AutoRAID, will be analyzed.

AutoRAID [WGSS96] is an evolution of disk cache. It is based on the joint utilization of several RAID levels in order to improve the performance. AutoRAID architecture is composed of two levels of hierarchy. For instance, RAID 0 can be chosen in the highest level and RAID 5 in the lowest, both on the same array. Write operations are made by means of the RAID 0, but if the information is essential, it is written on RAID 5. Read operations can be just performed at any level of the hierarchy. When the array is not busy, the information is moved from RAID 0 to RAID 5. Thus, the performance of write operations is increased solving small write problem because the operation is performed on a non-parity-based RAID system.

2.2.2

Heterogeneous disk arrays

Although RAID involved a great advance in parallelism, some problems are not solved. In RAID, all disks are considered homogeneous. This means that all disks are treated in the same way in the case that heterogeneous disks are used. In this way, all disks are seen as if they had the capacity and performance of the worst of them.

Nowadays the use of heterogeneous disks is being increased because fast technology advances involve that when a disk is replaced or added in a homogeneous RAID, the new one offers better features than the old ones. Heterogeneous disk arrays are the solution to this problem to use all the capabilities of any heterogeneous disks in a parallel way. Some instances of heterogeneous disks arrays are V:Drive [BHMadH+ 04], HERA [ZG00] and AdaptRaid [CL03].

2.2.2.1 V:Drive V:Drive [BHMadH+ 04] is an out-of-band storage virtualization system to abstract from the physical storage system in a storage area network (SAN). Its distribution scheme, named Share-strategy [BSS02], allows the system to use heterogeneous storage components. For doing it, each disk is partitioned into minimum sized units of contiguous data blocks. To handle the heterogeneous disk features, the length of the unit for each disk is selected to reduce the heterogeneous placement problem to a AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


2.2. DEVICE-LEVEL PARALLELISM

17

homogeneous one. Then, a distribution strategy for uniform capacities [KLL+ 97] is used to know where the access must be made.

2.2.2.2 HERA

Heterogeneous Extension of RAID (HERA) [ZG00] is a framework that extends RAID allowing different disks to be used. The contribution of HERA is mainly the application of fault tolerance techniques for heterogeneous storage and the use of disk merging for managing heterogeneous disk arrays. Disk merging [ZG97] builds a collection of logical disk drives from an array of heterogeneous physical disks. The logical disk drives are seen as homogeneous enabling applying RAID over them. The logical disk drive can be made using a fraction of the bandwidth and capacity provided by several physical disks.

2.2.2.3 AdaptRaid

AdaptRaid [CL03] is a distribution scheme that extends RAID to heterogeneous disk arrays, focusing in two cases, one without data replication (AdaptRaid0) and other with parity (AdaptRaid5). AdaptRaid tries to make the most of disks with better characteristics of both capacity and bandwidth, storing in them more data blocks. Thus, the placement of whole stripes is kept as long as possible to take advantage of a higher level of parallelism.

AdaptRaid0 proposes the concept of a pattern of stripes. The idea is to see disks smaller than they actually are following the same proportions in size among all the disks. Then, blocks are distributed in this small array using as many stripes as blocks can fit in the smallest disk to take advantage of all disks in a parallel way. The same process is repeated with the rest of the disks until all disks in the smaller array are full. Then, a new step is calculated over the following small array where the number of stripes and blocks in each disk in each distribution of the algorithm are the same.

AdaptRaid5 follows the same approach than AdaptRaid0 keeping the parity distributed among all the disks. The stripe must be at least two blocks to store the parity block. In order to avoid small write problems the stripe length is restricted to a divisor of the largest stripe, what involves to loss capacity. This is solved starting each stripe in a different disk and using a Tetris-like algorithm to fill free spaces with subsequent units from further stripes. Alberto S´ anchez Campos


18


2.3

File system-level parallelism

The access to secondary storage devices needs an abstraction layer, which hides the physical details of these devices. Furthermore, the access is not secure if the user accesses directly to the physical level where there are not any restrictions. This abstraction layer is named file system.

File systems are the middleware between the devices and users or user applications that: • Provide a logical vision of the devices. • Offer access in a independent way of the physical details of the devices. • Provide protection mechanisms. Parallel file systems can be built following the shown idea of parallelism. According to [Sto98], a parallel file system is a file system that avoids the I/O bottleneck adding in a logical way independent storage devices and I/O nodes by means of a high performance storage system. The bandwidth is increased because of: • The independent addressing of the disk: Access to data of different files in a concurrent way. • Data declustering: Access to data of the same file in parallel. Definitively, the principal advantages that are obtained by means of the parallel storage are the following: • Higher capacity: The capacity of different devices can be added because the storage space is distributed • Performance increase: When the access to n devices is performed in a parallel way, the performance might be multiplied by n (the maximum increase that is possible to obtain depends on several factors). • Fault-tolerance: Redundant information can be used for information recovery in case of a failure. Among all the parallel file systems, PVFS, GPFS, Lustre and MAPFS stand out for our purpose.

2.3.1

PVFS

PVFS (Parallel Virtual File System) [CLIRT00, CIRT00, CCL+ 02] is a parallel file system, currently targeted at clusters of workstations or Beowulfs [Beo]. PVFS goals are twofold: to be used as platform for parallel I/O research, and to serve as production file system for the cluster community. This parallel file system stripes file data across multiple disks in different nodes in a cluster. In this AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


2.3. FILE SYSTEM-LEVEL PARALLELISM

19

way, potential bandwidth is increased, and network bottlenecks are minimized. PVFS offers multiple user interfaces, which includes MPI-IO, traditional Linux file system interface and a native PFVS library interface.

PVFS has two releases: PVFS1 and PVFS2. The PVFS1 has 4 major components: 1. Metadata Server: it manages all file metadata for PVFS1 files. 2. I/O Server: it handles storing and retrieving of file data stored on local disks connected to the node. 3. PVFS1 Native API: it provides user-space access to the PVFS1 servers. 4. PVFS1 Linux Kernel Support: it provides the functionality necessary to mount PVFS1 file systems on Linux nodes. There are some problems in PVFS1. Some of the most important ones are the following: (i) it is socket centric, (ii) it is single-threaded, (iii) it does not support heterogeneous systems with different endians and (iv) it is too dependent on the OS buffering and file system characteristics. Due to these problems, PVFS2 [MLRC04] was implemented.

A PVFS2 file system consists of the following pieces: 1. The pvfs2-Server: the only server process and has two possible roles, I/O servers, which store the actual data associated with each file and Metadata Servers, which store metainformation about files. 2. The System Interface: the lowest level API that provides access to the PVFS2 file system. 3. The Management Interface: a supplemental API that adds functionality usually not exposed to any file system users. 4. The Linux Kernel Driver: a module that can be loaded into an unmodified Linux kernel in order to provide Virtual File System (VFS) support for PVFS2. 5. The ROMIO PVFS2 Device: it provides MPI-IO support for PVFS2.

2.3.2

GPFS

The Global Parallel File System (GPFS) [SH02] was originally developed by IBM on AIX for SP systems. Recently, IBM ported GPFS to Linux.



20


GPFS stripes all of its data across all of the storage targets attached to its servers. On Linux, GPFS operates in two modes: Network Shared Disk (NSD) and SAN direct-attached disk. Furthermore, GPFS relies on Reliable Scalable Cluster Technology (RSCT). RSCT [RSC] is a subsystem that monitors the heartbeats of GPFS member nodes. The heartbeat monitors the uptime, network and other attributes of the nodes through the RSCT daemons running on each of the nodes.

2.3.3

Lustre

Lustre is an open source parallel file system for Linux developed by Cluster File Systems (CFS) [Bra02]. The name Lustre comes from combining the words “Linux” and “cluster”. It integrates into an existing clustering environment by utilizing existing network, storage and computing resources. In addition, it uses many industry standard open source components, such as an LDAP server, XML formatted configuration files and the Portals networking protocol.

The file system architecture consists of: • Meta Data Servers (MDS). The MDS keeps track of information about the files stored in Lustre. • Object Storage Target (OST). The OST are daemons that run on a node that is directly connected to the storage targets called Object Based Disks (OBD).The OBDs can be any kind of disks including SCSI, IDE, SAN or RAID storage array disks. • Clients. Clients send a request to the MDS to find out where the data resides. After the MDS responds, the clients send the request directly to the OSTs holding the data. Lustre writes on supported block devices including SAN storage targets. However, since the OSTs and clients cannot exist on the same node, clients cannot perform direct I/O to the OSTs and OBDs. Hence, all data served from the OSTs travel over the Portals supported devices.

2.3.4

MAPFS

MAPFS (MultiAgent Parallel File System) [Pér03, PCFGR06] has been developed in the Universidad Politécnica de Madrid (UPM) since 2003. It arises from an innovative idea in the field of file systems: using the paradigm of agents, tackling the problems from a different perspective, from Distributed Artificial Intelligence (DAI). The advantage obtained by means of the use of the technology of agents is its conceptual contribution, offering to the applications a set of properties that make easy its adjustment to complex and dynamical environments.

With the goal of performing its tasks, MAPFS is composed of two subsystems with different responsibilities: AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


2.3. FILE SYSTEM-LEVEL PARALLELISM

21

1. MAPFS FS, which implements the parallel file system functionality. 2. MAPFS MAS, responsible for the information retrieval and other additional tasks. MAPFS MAS is an independent subsystem, which provides support to the major subsystem (MAPFS FS) in three different areas: • Information retrieval: The main task of MAPFS MAS is to facilitate the location of the information to MAPFS FS. Data is stored in I/O nodes, that is, data servers. • Caching and prefetching services: MAPFS takes advantage of the temporal and spatial locality of data stored in servers. Cache agents in MAPFS MAS manage this feature. • Use of hints: The use of hints related to different aspects of data distribution and access patterns allows MAPFS to increase the performance of these operations.

MAPFS_FS INTERFACE

MAPFS CLIENTS

CLIENT LAYER

MAPFS_MAS INTERFACE

Multiagent System

Groups server

Storage Group (SG) 1

SG 2

SG 3

Groups BD

MIDDLE LAYER

MPI INTERFACE

Traditional and heterogeneous servers

SERVER LAYER

Figure 2.5: MAPFS three-tier architecture [Pér03]



22


MAPFS is a three-tier architecture, whose layers are: • MAPFS clients: They implement the MAPFS FS functionality, providing parallelism, the file system interface and the interface to the server-side for accessing data. • Storage servers: These servers only store data and metadata. Nevertheless, a formalism called storage groups [PSPR05] is defined for providing dynamic capacities to storage servers, without modifying them. • Multiagent subsystem: MAPFS MAS is composed of agents responsible for performing additional tasks, which can be executed on different nodes, including data servers. The tasks of a multiagent system are mainly: (i) give support to the file system for data retrieving; (ii) cache management; (iii) storage group management; and (iv) hints creation and management. Figure 2.5 represents the three-tier architecture, showing the three layers and the relation among them. At the top of the hierarchy, the file system interface is shown. MAPFS clients are connected to other two modules: (i) the multiagent subsystem, through MAPFS MAS interface and (ii) data servers, through the corresponding access interface.

For implementing the multiagent subsystem, MPI (Message Passing Interface) [GLS94, MPIa] is used. This technology provides the following features: 1. MPI is a standard message-passing interface, which allows agents to communicate among them by means of messages. 2. The message-passing paradigm is useful for synchronizing processes. 3. MPI is broadly used in clusters of workstations. 4. It provides both a suitable framework for parallel applications and dynamic management of processes. 5. Finally, MPI provides operations for modifying the communication topologies. In short, MAPFS can be defined as a multiagent and parallel file system, orientated firstly to its use in clusters and high performance systems. The aim is to extend it as file system by using remote and heterogeneous machines connected by means of the use of the grid technology [PCG+ 03].

2.4

Parallel I/O hierarchy

Parallelism can also be applied to all the levels of the hierarchy shown in Figure 2.6. A good parallel system should be coordinated among all the levels in order to obtain a better performance.



2.4. PARALLEL I/O HIERARCHY

23

Figure 2.6: Parallel I/O hierarchy

Besides the parallelism among disks made by RAID, the parallelism can be achieved by using several machines in a distributed system. This can increase the I/O bandwidth [dRBC93] if data accesses are performed in a parallel way. In this case, the parallelism is obtained distributing the information among several servers and disks.

Figure 2.7: Moore’s law comparison between the increase expectations of the network bandwidth and storage capacity [Vil01]

Moreover, as it has been previously explained, the capacity of storage devices is doubled every 12 months, but the network bandwidth is doubled every 9 months. This difference in the growth rate, Alberto S´ anchez Campos


24


observed in Figure 2.7, causes that most of the network limitations to act as an useful element in order to make a parallel access on different servers are eliminated. In this sense, the parallelism concept can be extrapolated towards a level on top of the usual technologies, for instance, by using Wide Area Networks (WAN).



Chapter 3

DISTRIBUTED COMPUTING

In 1965, Gordon Moore, trying to predict the technological advances, stated that “the number of transistors per chip that yields the minimum cost per transistor has increased at a rate of roughly a factor of two per year” [Moo65]. Some years later, when the technology evolution was observed the period was stated in 18 months.

40 years after Moore’s law was stated, its conclusions continue being fulfilled and nowadays the computing power is unimaginable some time ago. As the available resources are increased, the desire of knowledge is increasing and the faced problems are more complex.

Current society tries to obtain the maximum benefit/cost ratio. Computer Science follows the same principles. In the last years, this idea has caused a change of perspective to deal with the computing challenges. Whereas a few years ago they only could be faced by means of supercomputers, which were excessively expensive and hardly scalable, nowadays the cheapest available resources are tried to use. This favors the scalability from sets of computers connected by means of a network, taking advantage of all the resources.

An instance of this evolution is the CERN’s (Conseil Européen pour la Recherche Nucléaire) (English: European Organization for Nuclear Research) particle accelerator Large Electron-Positron (LEP). At the beginning of its building, supercomputers Cray were used in order to analyze its data. Then, clusters were used in the nineties and finally nowadays the grid technology is being used in the last years. Alberto S´ anchez Campos


26

CHAPTER 3. DISTRIBUTED COMPUTING

The problem raised in [Mar02] is carrying out distributed computing among several machines has less cost than in a single machine, but it is not easy to obtain. High Throughput Computing (HTC) applications are easily adaptable because they can be split in parts, which can be assigned to every machine. On the other hand in a High Performance Computing (HPC) environment, an immediate response is required to the application can continue and thus the distribution of the calculation in an independent way is not possible.

3.1

Cluster computing

Cluster computing was the first alternative to multiprocessors, trying to obtain a better relation between performance and cost. A cluster can be defined as a set of dedicated and independent machines, connected by means of an internal network, and managed by a system that takes advantage of the existence of several machines. A cluster is expected to provide high performance, high availability, load balancing and scalability.

The machines, named nodes, can have the same hardware and operating system configuration (homogeneous cluster), different performance but similar architectures and operating systems or different hardware and operating system (heterogeneous cluster).

In order that a cluster works, it is not enough to connect only the nodes, but it is necessary to provide a cluster management system. The cluster management system interacts with the user and controls the processes that are being run to optimize the performance. It receives the jobs and redistributes them so that the processes are run faster avoiding the bottlenecks. This is usually made by means of a monitoring system and specific policies that indicate where and how the jobs have to be distributed. In short, the cluster management system is a middleware between the user and the nodes that provides: • A single access interface named single system image. This generates to the user the feeling of using a single computer. • Tools for the optimization and maintenance of the system: checkpointing, load balancing, faulttolerance, scheduling, etc. • Scalability: it must detect automatically new nodes connected to cluster. There are several cluster management systems, like MOSIX, OpenMosix, LinuxSSI and Beowulf. MOSIX [MOS] and its extended open source release OpenMosix [OMo] provide load balancing in a AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


3.2. GRID COMPUTING

27

transparent way, making possible that a cluster behaves as a single machine. LinuxSSI [LSS] is an operating system for clusters based on the existing Linux-based Kerrighed technology [Ker], which provides single system image. Beowulf [Beo] is a cluster technology based on Linux computers to form a parallel virtualsupercomputer. Besides there are queue managers which send jobs to each node depending of its workload, like OpenPBS [PBS], SGE [SGE] and Condor [Con].

Although cluster computing solves problems in a cheap way, there is a common misconception that any software will run faster on a cluster than a single machine. However, to achieve this, the software has to be re-written in order to take advantage of the cluster, and specifically must have non-dependent parallel operations.

Finally, cluster computing makes possible the use of internal resources, but it does not make possible the joint of different administration domains supporting the security policies of each domain. Furthermore, they are not based on open standards, which slow down the growth and the communication among applications.

3.2

Grid computing

In 1995, during the congress SuperComputing’95, it was demonstrated that it is possible to run distributed applications in 17 different researching centers connected by a high-speed network. This proof of concept meant the starting point of several projects with a common denominator: the computing resources sharing in a distributed way.

The aim of taking advantage of different available resources physically separated and connected by means of Internet generates the idea of grid computing. The term grid appears in the first place in [KF98] with a similar idea to the Electrical Power Grid. This offers a uniform and cheap access point to the electricity in any place. In the case of grid computing, the access point is referred to computing and storage resources.

Therefore, a grid is understood as an enormous set of machines connected by any network, such as Internet, LAN or WAN, linked with the aim to make costly computing tasks. In short, a grid is a pool of geographically dispersed computing resources.

Unlike the conventional networks, focused on the communication among devices, grid computing can take advantage of low workload periods of all connected resources, making possible the resolution of very complex computing problems, which cannot be solved by a single machine. Grid can profit Alberto S´ anchez Campos


28


from the computational power of the non-used computers, being accessed like a virtual supercomputer. In this sense, the grid technology makes possible the sharing and aggregation of geographically distributed resources, which can be both supercomputers and storage systems.

For user-level, grid computing automatically gets the achievement of enormous computing requests and data management. For security reasons, authentication by means of the use of certificates is required to access grid resources.

It is necessary to take into account the grid technology does not expect to replace previous technologies but it is applied in another area. Its aim is joining resources of different authorities or virtual organizations [Fos01], respecting security policies and management tools of each organization. Some rules have to be fulfilled: • It must not interfere with the site or organization management. • It must not put the users and organizations security at risk. • It must not be necessary to change the installed operating systems, networks protocols or existing services. • Each organization, site or resource can join and leave the grid at any moment. • It must provide a reliable infrastructure. • It must provide support for heterogeneous components. • It must use existing standards and technologies. Foster defines the characteristics that any grid has to meet [Fos02]: • “Coordinates resources that are not subject to centralized control”. In this sense, a different security policy for each site at a given time is allowed unlike local management system. • “Using standard, open, general-purpose protocols and interfaces”. This makes possible the communication among all the resources, which can be heterogeneous. • “To deliver nontrivial qualities of service”, related to response time, throughput, availability and security, for instance. In short, among all the benefits of grid computing, the following ones can be stood out: • Unlimited computational power offered by multitude of computers connected by a network. • Elimination of bottlenecks of some computing processes. AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


3.2. GRID COMPUTING

29

• Possibility of managing multiple domains corresponding to different virtual organizations. • Integration of systems and heterogeneous devices. A grid can be built from heterogeneous resources and it includes a big range of technologies. • Scalability. It can grow of being integrated by few resources to millions of them. This implies a degradation problem of the operation as the grid size is increased. Therefore, applications that need many geographically dispersed resources must be designed to be bandwidth-tolerant. It will never be obsolete due to its scalability. • Adaptability. The lack of resources is usual in grid technology. In fact, the probability of a failure takes place is high because there are many resources. The applications must adapt its behavior dynamically and use the available resources in an efficient way.

3.2.1

Grid middleware

Grid middleware is composed of several service layers: 1. Grid fabric. It represents all the distributed resources that are accessible from any location of Internet. These resources can be any kind of machines (both PC and multiprocessors) that work on different operating systems (such as UNIX, Linux or Solaris), storage devices, databases, and scientific instruments like a telescope or a heat sensor. 2. Core grid middleware. It represents basic services such as remote process management, resource assignment, storage access, information discovery, security, and different aspects of the quality of the service (QoS) like reservation and resource brokering. 3. User level grid middleware. Applications and tools for the development of environments and resource brokers to manage global resources. 4. Grid portals. It represents a web infrastructure where final users can run applications and gather the results in a transparent way.

3.2.1.1 Grid fabric In this layer, several services make the efficient use of local resources possible. Among all of them, some have to be stood out: • Queue managers. They are in charge of managing and distributing the jobs that are executed in each local resource. Sun Grid Engine [SGE] and Condor [Con] stand out due to their wide use. • Parallel computing libraries, like MPI [MPIb] and PVM [PVM]. They are useful to take advantage of compound local resources, such as clusters or supercomputers. Alberto S´ anchez Campos


30


• Resource monitoring, such as Hawkeye on Condor [Haw] and Ganglia [Gan]. They are used to know the performance of each single local resource at a given moment.

3.2.1.2 Core grid middleware There are several grid technologies. The most known technologies are Legion, Unicore and Globus. Legion [GWF+ 94] is an object-oriented system developed by the University of Virginia. It provides the necessary infrastructure to give support to applications on an heterogeneous and geographically distributed environment composed of high performance machines. It offers a virtual environment in a similar way that a single machine works.

Due to be based on objects, it has the following features: • Objects represent both hardware and software components. Each object is an active process that answers to the invocations of its methods by the objects that interact with it. • As any object-oriented system, users can re-define the functionality of the classes. This feature makes possible an adjustment to the user needs. • Objects are independent and can communicate between them. The set of methods of an object is described in its interface, which must be known in order to use the object. The Interface Definition Language (IDL) is used with the aim of making standard the interaction between different objects. UNICORE (Uniform Interface to COmputer REsources) [ES01] is a project founded by the Ministry of Education and Research of Germany. It provides an interface to prepare and send jobs in a secure way to resources. The applications distributed by means of UNICORE are divided in parts, which run in different computers in an asynchronous or sequentially synchronized way. An UNICORE job contains a multipart application with the requirements of the resources that it needs, and the dependences among the different parts of the application.

The aims of the UNICORE’s design are based on: • Having a simple graphical user interface. • Architecture based on the concept of abstract jobs including security. • Minimal interference with the local procedures. • Use of existing technologies taking into account the new growing technologies. AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


3.2. GRID COMPUTING

31

UNICORE is designed to support batch jobs using a distributed system in which the different parts of the application can be run. In order to standardize the accesses to the computers, the user id is unique.

Globus is the “de facto” standard of grid computing. It is the middleware used by some leading computing companies, such as Compaq, Cray, Sun Microsystems, IBM, Microsoft, Veridian, Fujitsu, Hitachi and NEC.

The Globus Alliance [GlA] was born from the union among the Argonne National Laboratory, the Information Sciences Institute of the University of Southern California, the University of Chicago, the University of Edinburgh and the Swedish Royal Institute of Technology. It is managed by Ian Foster and Carl Kesselman, co-authors of [KF98]. Globus is an open-source and open-architecture project that incorporates protocols and basic services for constructing grids. Furthermore, it provides support to run applications taking advantage of the offered computational power.

Nevertheless, operating systems underneath have no idea about what grid middleware is running on top. The integration between grid middleware and operating systems could address different improvements, such as proper resource management and accurate job monitoring.

Thus, XtreemOS is taking a first step trying to introduce some grid concepts into Linux [Mor07]. The XtreemOS project tries to determine which new services should be added to current operating systems to build grid infrastructures in a simple way. In essence, the idea behind XtreemOS is to integrate the grid middleware in the operating system.

From a different point of view, a complete grid operating system could be built to improve the deployment of grid middleware. In this sense, GridOS [PW03] is an operating system that makes a usual computer more suitable for grid environments.

3.2.1.3 User level grid middleware In 2002, the grid community changed its orientation to a services model [FKNT02]. This new architecture, named Open Grid Services Architecture (OGSA) was defined by the Global Grid Forum (GGF) and it provides both the creation and the maintenance of the different services offered by Virtual Organizations (VO).

The Global Grid Forum was the organization that is in charge of defining the services and protocols standards of the grid technology. It was created in Amsterdam in 2001 with the aim of being Alberto S´ anchez Campos


32


the international forum about grid technology. Some important companies, like IBM, Microsoft, Sun and HP, participate in it. Nowadays, the GGF and the Enterprise Grid Alliance (EGA) have joined creating the Open Grid Forum [OGF]. The EGA was formed in 2004 to focus on the grid adoption in enterprise data centers.

In February 2003, the Global Grid Forum, the Globus Alliance and IBM proposed together to converge the grid computing technology towards Web services.

Thus, the GGF defined OGSA

[FKNT02, FSB+ 06, OGS], the necessary architecture to provide grid services. OGSA defines the services that must be offered in grid computing and specifically defines the following fields: • Information services. Clients need mechanisms to discover available services by means of the specification of their features. • Services can be dynamically created. • Fault-tolerance mechanisms. In case of a service failure, there must have mechanisms that provide its reboot. • Notification. The changes occurred in a service can be notified to clients or other services. • The work environment must take advantage of available resources by means of services. Specifically, grid services are a considerable extension of Web services. Web services are not valid to give support to grid applications because of the following reasons: • They are stateless, that is, they are not persistent services. • They do not give support to services, like notifications and life cycle management. On the other hand, grid services have the next features: • They are stateful. The state of the grid service is kept during invocations. • They can be persistent. Some instances of the same services can be created at a given moment destroying them when they are not necessary. • They have support services. Thanks to the relation between the GGF and Consortium for World Wide Web (W3C)1 , grid services have been fused with Web services (WS) in an single research line named WS-Resource Framework (WSRF) [GKM+ 06]. WSRF provides access to stateful Web services. 1 W3C was founded in October 1994 to promote the World Wide Web, developing standard protocols to ensure its interoperability.



3.2. GRID COMPUTING

33

Although there are some known grid projects related to computational power, like LHC Computing Grid (LCG) and Enabling Grids for E-SciencE (EGEE), computing is not the only important aspect of a grid. For instance, the data access is for many applications more important than the access to computing resources.

In short, grids can provide several types of services [BBL02]: • Computing services. They provide tools to run complex jobs on distributed computing resources. Some examples of computational grids are NASA Information Power Grid (IPG) [CFF+ 01], LCG [LCG] and NSF TeraGrid [TeG]. • Data services. They provide secure data access and data management. Data can be catalogued and replicated in order to manage it. A data grid (see Section 3.2.2) offers an environment in which is possible to manage huge data volumes in a distributed way. The projects European Data Grid (EDG) [CFF+ 01] and CrossGrid [CrG] can be stood out among all the projects belonging to this type. • Application services. They provide application management and a transparent access to remote software and libraries. They are built to use computing and data services provided by the grid. A system used to develop such services is NetSolve [SYAD05]. • Information services. They obtain and show the information collected about the computing, data and application services. • Knowledge services. They are referring to the way the system knowledge is acquired, used, recovered and published to help to the users to fulfill its aims. The knowledge is understood as information applied to reach an aim, solve a problem or make a decision. In order to search, discover and select the most suitable services or resources an element that has information about the numerous resources is required. This element is called broker [SFT00, Tor04]. The system capacities depend on the way the broker select the resources. This is the reason why brokering deserves a more detailed study.

The broker is vital for any grid infrastructure because its operation and performance determine the required user experience and the use of the environment. Although these components have to identify, characterize, evaluate, select and reserve the most suitable resources or services for a certain job, brokers can be only job allocators too. In this sense, many of the queue managers used in distributed systems can be applied in grid environments. Some examples are PBS and Condor, with its grid release, Condor-G.



34


Mainframe discovery events Client

Broker Personal Computer

Cluster

Figure 3.1: Brokering in a grid system Nowadays there are two tendencies in relation to the brokering phase: • Client-broker: This kind of broker is designed to comply with client needs selecting the services according to the client policies without having a global vision of the requests of other clients. An example of a client-broker is GridWay [GWM] that can carry out the most usual responsibilities required by a broker, such as resource assignment, task migration and fault-tolerance. For GridWay the integration client-broker means an advantage because it is more scalable and dynamic changes-resistant since it is only worried about the needs of a single client. • System-broker: In this case, the broker has a complete and global vision of the resources and clients of the grid. Thus, it can select the most suitable resources for the client jobs with the aim of improving the performance of the whole grid. Nevertheless, it concentrates the decision load in a single point. However, it is possible to have a broker hierarchy to avoid this. Nowadays the scientific community is working trying to improve different aspects of the services. Especially service scalability, introduction of autonomic computing techniques, improvement of data accesses and fault tolerance are areas that are being researched. This Ph.D. thesis is focused on the second and third researching lines. Although several improvements can be performed, there are a large number of working projects, such as LCG [LCG], EGEE [EGE], TeraGrid [TeG] and CrossGrid [CrG], that solve compute and data-intensive problems. In addition, there are many finished projects that demonstrate that the grid technology is possible, like GridSim [BM02], IPG [JGN99], EDG [CFF+ 01], Gridbus (GRID computing and BUSinees) [Gbu]. Among them EGEE and LCG can be emphasized.

The Large Hadron Collider (LHC) is the CERN’s particle accelerator. The discovery of new fundaAUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


3.2. GRID COMPUTING

35

mental particles, like specifically the Higgs boson or God particle2 , is one of the most important aims of the LHC. It is possible to know the existence of this kind of particles analyzing the properties by means of a statistical analysis of massive amounts of data collected by the LHC’s detectors ATLAS, CMS, ALICE and LHCb. 15 petabytes per year is the data size expected to collect by the LHC from 2007. Then, a comparison by means of compute-intensive theoretical simulations is required.

The analysis of this great amount of data means one of the most important scientific challenges in the world. The aim of the LHC Computing Project (LCG) [LCG] is to build an analysis infrastructure in order to understand this data. Furthermore, data has to be stored in a distributed way since its size is too big to be stored in a single storage system. The LCG project interoperate with other known grid projects, such as EGEE, Grid3 [GR3] and The Globus Alliance.

The Enabling Grid for E-sciencE (EGEE) [EGE] project aims to develop a grid infrastructure that is available to the geographically scattered scientist community to solve different kinds of complex problems in different domains. EGEE tries to cover a wide-range of both scientific and industrial applications, including Earth Sciences, High Energy Physics, Bioinformatics, Astrophysics and so on.

Two pilot applications were selected to test the performance and functionality of the EGEE’s infrastructure. Both the LHC Computing Grid and Biomedical Grids, where scientists facing with bioinformatics and healthcare problems, certify its use.

The problem of this infrastructure is constituted by the fact that not all the grid requirements stated by Foster [Fos02] are fulfilled. EGEE has stated several requirements that a researching center must fulfill to join it to the initiative. The aim was to create quickly a simple grid infrastructure to solve problems that do not have solution with other technologies. Thus, the most of computers used in EGEE have the same architecture and operating system. Besides each VO or researching center has lost some of control on its own resources, since there is a central organization that manages the EGEE’s resources. This means EGEE breaks with grid requirements.

3.2.1.4 Grid Portals To make easy the interaction between users and the grid, it would be advisable to create web portals. In this sense, it is possible to access the grid from any location, running applications by means of a graphical and simple interface. Furthermore, the installation cost in every client machine is eliminated.

There are some web portals that make different operations possible to carry out via web, such 2 The

Higgs boson is the key particle to understand why matter has mass.



36


as requests to the information services, file transference and job management. However, they are dependent on the project in which they make up. In addition, there are some projects to make easy the development of grid portals, like GridPort and GridSphere.

Grid Portal Toolkit (GridPort) [GPo] is a set of technologies designed to contribute in the development of scientific portals for computational grids. By means of GridPort, developers can extend the functionality of a web portal to run complex applications in a grid. GridPort has been used to develop scientific portals, for instance to model molecules.

GridSphere [GSp] is a project to make easy the creation of Grid Portals based on portles. A portle is a component that can be integrated in a portal web and works as a web application managed by a central manager.

3.2.2

Data grid

One of the major goals of grid computing is to provide efficient access to data, since data-intensive grid applications are first-citizen of complex problems resolution architectures [Fos98]. Grid systems in charge of tackling and managing large amounts of data in geographically distributed environments are usually named data grids.

Data grids [CFK+ 00, Ave02, HJMS+ 00] are grids where the access to distributed data resources and their management are treated as major entities along with processing operations. Thus, data grids primarily deal with providing services and infrastructures for distributed data-intensive applications. The fundamental feature of data grids is the provision of a secure, high-performance transfer protocol for transferring large datasets and a scalable replication mechanism for ensuring data distribution on-demand.

Nowadays, there is a huge number of applications creating and operating on large amounts of data, e.g. data mining systems extracting knowledge from large volumes of data. Existing data-intensive applications have been used in several domains, such as Physics [Hol00], climate modeling [LEG+ 97], Biology [YFH+ 96] or visualization [FL99]. Data-intensive grid applications try to tackle problems originated by the needs of a performance-full I/O system in a grid infrastructure. In these architectures, data sources are distributed among different and heterogeneous nodes. In addition, a typical data grid requires access to terabyte or higher sizes datasets. For example, high-energy physics may generate terabytes of data in a single experiment. Accesses to data repositories must be made in an efficient way, in order to increase the performance of the applications used in the grid.



3.2. GRID COMPUTING

37

In order to achieve the maximum benefits of a data grid, some requirements are needed: • Ability to search through numerous available datasets for those that are required. • Ability to select suitable computational resources to perform data analysis. • Ability to manage access permissions. • Intelligent resource allocation and scheduling. Data grids require the use of data sources, which are facilities that may accept or provide data [ACK+ 03]. These sources are composed of four components: 1. Storage systems, which include file systems, caches, databases and directory services, 2. Data types, which include files, relational databases, XML databases and others, 3. Data models, that is, different databases schemas, and 4. Access mechanisms, which include file-system operations, SQL, XQuery, or XPath. Data grids must manage these entire components in a dynamic fashion. Accessing heterogeneous resources with interfaces and different functionalities is solved, in the majority of the cases, by means of new services that offer a uniform access to different types of systems. Furthermore, generic data management systems involve more challenges. The most important ones are two: 1. All the layers of the grid infrastructure are characterized by their heterogeneity. This includes different storage systems, data access mechanisms and policies. Heterogeneity does not only affect to the infrastructure, but also to the data itself. Different kind of data formats and data from heterogeneous sources contribute to make more difficult an efficient data management. 2. This infrastructure must be able to tackle huge volumes of data, from terabytes to petabytes. Additionally, many applications require the discovery of both data and other types of resources. A replica management system [VTF01] can keep track of the data and their locations. In addition, replication can improve the I/O performance and reliability.

There are different several data-intensive grids and I/O systems proposed to provide a suitable solution to data intensive applications. A taxonomy of data grids is proposed in [VBR06]. The elements of this taxonomy related to store data are i) data transport and ii) data replication and storage. Alberto S´ anchez Campos


38


3.2.2.1 Data transport Data transport refers to the data transfer that offers security and management of the sending. Among data transport mechanisms, SOAP [SOA07], GridFTP [ABB+ 02a], IBP [PBE+ 99], Kangaroo [TBSL01] and SRB I/O [BMRW98] stand out. They are described below.

Simple Object Access Protocol (SOAP) [SOA07] is a versatile protocol for exchanging messages over Internet. SOAP provides a basic and standard messaging framework especially appropriate for Web services. Since the grid community has change its orientation towards a service model [FKNT02] following Web services approach, SOAP is considered as the default data transport in this model. However, SOAP is considerably slow and does not provide security, what must be solved by the middleware, for instance, by means of Grid Security Infrastructure (GSI) in Globus. GridFTP [ABB+ 02a] is an extension of the standard FTP protocol, providing secure and efficient data management in grid architectures. In order to exhibit this behavior, GridFTP includes among others: • Grid Security Infrastructure (GSI), • Parallel and striped data transfer, • Support for reliable and restartable data transfer, • Automatic negotiation of TCP buffer and window sizes, • Instrumentation and monitoring tools. The use of GridFTP parallel threads increases the effective network bandwidth improving the link utilization. Internet Backplane Protocol (IBP) [PBE+ 99, BBM+ 03] is an end-to-end data movement mechanisms, which tries to optimizes the data movement from a node to another one by storing the data at intermediate locations. An IBP node has a temporary buffer where data is stored during a specific period. Therefore, configuring these buffers it is possible to move data close to where it is required. Furthermore, applications can transfer data without worrying about the storage management on the individual nodes. Once the data is in the local node, applications read or write the new copy.

IBP also provides a semantic similar to UNIX system. In addition, it uses a concept named exNodes that is similar to UNIX inodes. Nevertheless, although exNodes store metadata, IBP does not provide a discovery service, being a low-level storage solution. Its security is based on capabilities for AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


3.2. GRID COMPUTING

39

reading, writing and managing given by the server to a client.

Kangaroo [TBSL01] belongs to the Condor grid project and it is a reliable data movement system. Kangaroo service continues performing I/O operations even if the process that initiated these requests fails. The reliability is achieved running the transfer process in background and transparently managing faults. Furthermore, an intermediate storage element is placed between the client and the remote server to enhance its performance. Then, a background process named mover, place data in its proper location. Kangaroo provides an interface to make get, put, commit and push operations. Whereas get and put are non-block operations, commit and push are blocked until writes have been respectively delivered to the next intermediate element or the destination.

The Storage Resource Broker (SRB) I/O [BMRW98] provides a UNIX-style file I/O interface for accessing heterogeneous data distributed over wide area nodes. SRB I/O provides streaming data transfer making multiple files possible to be sent using multiple streams to a storage resource. In addition, data can be moved from slower storage, such as tapes, to faster one to improve the performance and remote procedures, including SQL queries and data filtering, are supported. The fine-grained data access in SRB is provided through credentials, which are stored in the metadata catalog (MCAT). The authorization is provided by means of the use of tickets.

3.2.2.2 Data replication and storage management Replication and storage management is a key factor in order to achieve an efficient and reliable data access. Among systems that provide either data replication or storage management, Giggle [CDF+ 02], GDMP [SSMD02], Storage Resource Broker (SRB) [BMRW98], Remote I/O [FKKM97], Storage Resource Manager (SRM) [PPC+ 07] and OGSA-DAI [Mag03, MEM03] can be emphasized. They are described below.

GIGa-scale Global Location Engine (Giggle) [CDF+ 02] is a Replica Location Service (RLS) framework. The aim of RLS is to replicate data that is written once and read many. This service maintains information whereas are physically located file replicas. A RLS is composed of a Replica Catalog and a Replica Location Index (RLI). Whereas, the Replica Catalog is in charge of knowing the logical representation of the physical locations, the RLI indexes the catalog itself.

In fact, data is represented by a logical file name (LFN). The LRC maps the physical locations of the file, expressed by means of a unique physical file name (PFN), and its corresponding LFN. The RLI is an index of replica catalogs composed of LFN and to Replica Catalogs, allowing the system to be configured in different ways, such as hierarchical or central. Alberto S´ anchez Campos


40


Grid Data Mirroring Package (GDMP) [SSMD02] is a secure and high-speed replication manager of data files and object databases by using Replica Catalogs and GridFTP. GDMP is based on the publisher-subscriber model. Every server publishes the set of files added to the Replica Catalog enabling clients request replicas of them. In order to provide security to the client connection GSI us used. The replica consistency is not an important issue in GDMP, since it was conceived for High Energy Physics experiments in which data is no updated.

SRB [BMRW98] is one of the most widely used data grid architectures since it provides a uniform interface to heterogeneous storage systems, unifying the view of the files stored in different locations. It uses replication to improve data availability and performance. The replication consistency in SRB is provided by means of synchronization and lock mechanisms propagating changes to other replicas.

File systems and databases are managed as physical storage resources (PSR) which are mapped with logical storage resources (LSR). Data is organized in SRB within a hierarchy of collections analogous to the UNIX file system hierarchy. Collections use LSR to find data by means of its corresponding PSR. Besides LSR and PSR, some files attributes are stored in the metadata catalog MCAT of SRB that allows clients to search data according to its attributes. Hierarchical data grids [CFK+ 00] expand this idea into trees of servers that replicate data from a production site.

Grid Datafarm (Gfarm) [TMM+ 02] is a high-speed I/O system designed for large-scale architectures. Although it can manage several petabytes, it requires high-speed network connections and large disk space in each storage element, which does not properly fit with the grid philosophy. In this way, it can be seen as adapted to cluster computing idea.

Gfarm consists of a parallel file system that stores data splitting it in parts on multiple disks. Each file part stored in each element has an arbitrary length. Replicas are created when the system tries to access a file part in loaded elements. However, the whole file is considered as write-once, involving creating a new file when a file is modified. The scheduling is integrated with data distribution, although no complex policies are used.

The Remote I/O (RIO) library [FKKM97] is a tool for accessing files located on remote file systems based on the first versions of Globus. RIO follows the MPI-IO interface and allows any application that uses MPI-IO to operate unchanged in a wide area environment. It provides portability, integration with distributed computing systems. Furthermore, it obtains good performance by using traditional I/O optimizations such as asynchronous operations and message forwarding. AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


3.2. GRID COMPUTING

41

The Storage Resource Managers (SRM) [PPC+ 07] interface provides a standard uniform management interface to these heterogeneous storage systems, providing a common interface to data grids, abstracting the peculiarities of each particular Mass Storage System. SRM are middleware components whose function is to provide dynamic space allocation and file management on shared storage components on the grid. SRM interface could be used to access the different storage system such as CASTOR (CERN Advanced STORage manager) [CAS], a scalable and distributed hierarchical storage management system developed at CERN in 1999. Other mass storage systems which provide a SRM interface are HPSS [WC95, HPS], Enstore [EMS], JASMine [HHSK05] and dCache [EFG+ 01, DCa].

OGSA-DAI [Mag03, MEM03] is a middleware to integrate data from different sources in grid environments. Besides files, other data sources, like relational databases, would become more prominent in data grids. OGSA-DAI exposes information about data resources in order clients can access data by using OGSA-DAI Web services. The aim is to provide any faster methods to access data, for instance, by means of URL or GridFTP. The challenge is to extend the existing grid mechanisms such as replication, data transfer and scheduling to work with these new data sources.

3.2.2.3 Replication strategies In [GB06] three replication policies for data grids are explained. First, replication based on the location is shown. This idea was conceived from the replica selection according to clients’ preferences. Different approaches use this kind of replica selection. In order to define the clients’ requirements and the resource features, the Condor’s ClassAds technology is often used [RLS98, VTF01]. The broker knows both information. Thus, by means of pairs of values it is possible to know if a resource meets the requirements of a specific client. The location can be a determined attribute.

Following a model based on location, different replica creation techniques can be applied. In [RF01] several strategies are discussed to deal with the problem: • Best client: Each storage element records the number of accesses made by each client for each replica. When the access frequency is higher than a threshold, then a replica is created in the client that have accessed more times to this replica (best client). • Cascading replication: The analogy shown in the article for this strategy is a tiered fountain. In a similar way, data is originated at the top. When it fills the top ledge, it overflows to the next level and so on. That is, when the threshold for any replica is exceeded, data is replicated in the next level on the path to the best client. Although this technique distributes the storage space well, it is only advisable to be applied in a hierarchical architecture. Alberto S´ anchez Campos


42


• Plain caching: The client that requests data stores a local copy of this data. The problem is the required storage space. • Caching plus cascading replication: This strategy combines the plain caching and the cascading replication. The client stores data locally, but the system propagates the most popular replicas down the hierarchy. Although this solution tries to solve the problems of the two other strategies, it only minimizes these problems. • Fast spread: This strategy replicates data in each element along the path of the best client, using a P2P-based replication. However, this technique wastes much storage space. Although there are methods to delete non-used replicas that could solve the problem, the technique should be applied in hierarchical topologies. Other replication policy is based on Economy. The idea is to know if it is advisable to create a replica from an economical or social point of view. An instance of these economy-based strategies is [CZSS02]. The simplicity of these models, based on sale and opportunity costs, and above all, that the current grid infrastructures have a scientific approach which is not based on pay-per-use, cause that this strategy is not suitable to solve the proposed problem.

The last one replication policy is based on cost estimation [LSS02]. This strategy is very similar to the economic model, but the cost estimation is more elaborated. The cost is calculated according to different monitoring parameters, such as the network latency and bandwidth, the replica size and the number of read and write operations. Nevertheless, in a similar way that cascading replication and fats spread work, the strategy is only advisable to be applied in hierarchical and flat topologies.

3.2.3

Grid performance

In distributed environments, some traditional parameters have been defined as performance and operation metrics of the system, such as the throughput, the network bandwidth and the response time. A grid is the maximum expression of a distributed environment and therefore the usual metrics can still be used, but it also will be necessary to take into account other parameters like: • Quality of service of each resource. • Service availability. • Suitableness of the process assignment to resources. In short, in a grid environment it is very important to take into account the adaptation of applications to the infrastructure knowing the principal features of a grid: AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


3.2. GRID COMPUTING

43

1. Diversity of resources. 2. Dynamism of resources. Therefore, it is necessary to analyze the way the used resources have been selected, above all its suitableness and efficiency. [GWB+ 04] shows the required conditions that an analysis performance tool for grids must fulfill. Grid performance analysis has special requirements, emphasizing: 1. Data acquisition should include both application and infrastructure monitoring since grid performance is determined by the configuration of the used resource and the running applications. 2. Monitors should have an intelligent part in order to filter and pre-process monitored data because monitoring a grid can involve a great amount of data. 3. Monitoring techniques should be able to detect performance problems automatically from their symptoms. 4. Monitoring tools should be able to be customized since a grid is heterogeneous and dynamic. Grid benchmarking and grid monitoring are usual methods to know the grid behavior.

3.2.3.1 Grid benchmarking The use of benchmarks to analyze the grid performance can be considered to define the expected quality of service and to compare scheduling algorithms, policies of resources assignment, implementations and complete systems. The effort of building grid benchmarks was mainly directed by the Grid Benchmarking Research Group at the extinct Global Grid Forum.

Most of the grid benchmarks are computation intensive and they are based on NAS Parallel Benchmark3 . Among them, NAS Grid Benchmark [FdW02] and GridBench [TD03] can be emphasized. On the other hand, the Arithmetic Data Cube (ADC) [FS03] can be indicated as the single known possibility of data intensive grid benchmarks.

NAS Grid Benchmark (NGB) [FdW02] is an intensive computing benchmark. The NGB tasks are defined in terms of flow graphs, where each node represents the computing elements and the computing times within them, and each arch represents the communication between the computing elements. The grid performance is calculated from the log, which includes the running times in each node and the transmission times in each arch. This benchmark is very useful to analyze the critical path for a 3 NAS

Parallel Benchmark is composed of a set of 8 modules that evaluate the performance of parallel supercomputers.



44


certain application.

GridBench is a benchmark definition tool for a specific grid, which is necessary to define. For doing it, it has a Grid Benchmark Description Language (GBDL) [TD05] that can be transformed to job description languages for grid environments like JDL (Condor) or RSL (Globus).

In order to be user-friendly it has a graphical interface to define, run benchmarks and analyze the obtained results. The obtained results are based on the architecture R-GMA defined for the project European Data Grid (see section 3.2.3.2).

The results obtained by the benchmark defined by means of GridBench are stored in a XML database together with its definition GBDL of the benchmark. Besides, monitored data during the running of the benchmark can be included in order to improve the analysis. The results of a certain computing element can be published in a monitoring and discovery service like MDS (see section 3.2.3.2), making its access easy to users and schedulers.

Arithmetic Data Cube [FS03] is an application of Data Cube Operator in an arithmetic data set. Data Cube Operator [GCB+ 97] is a tool of On-Line Analytical Processing4 that processes views of a data set. ADC can be used as grid benchmark since it manages large distributed data sets. Furthermore, it is possible to control the intensity of the benchmark controlling the size of the views of the arithmetic data set.

The problem of the grid benchmarks is that they run using resources of a grid but they do not have to use all the resources that make up it, since a grid can be composed of numerous resources. Resources are selected according to an assignment policy. Thus, the results of the benchmarking are only representative for the use on the selected resources but they are not representative for the whole grid. The resources selected do not have to be the same in posterior executions and although the same resources were selected the characteristics of the same ones might have changed, which makes difficult to predict the performance.

3.2.3.2 Grid monitoring Another approach to know the performance of a grid system at a given moment is grid monitoring. This monitoring task does not only have to involve the whole grid, but it is possible to do it in different levels, such as: 4 On-Line Analytical Processing (OLAP) provides quick answers to analytical queries in a database. It allows users to discover patterns in a data set.



3.2. GRID COMPUTING

45

• Specific of the application. • Node or server level. • Cluster or site level. • Grid level. However, grid monitoring does not study logs or historical data about what has happened in the system and does not analyze details about an only user.

Several working groups of the old grid forum, like the GGF Grid Monitoring Architecture Working Group [GPW], worked to build an architecture that monitors the specific components that characterize the grid platforms. The architecture is named Grid Monitoring Architecture (GMA), and it can be seen in Figure 3.2. It has a producer and a consumer. The producer registers its monitoring skill about a part of the grid in the directory service. The consumer searches those producers that can provide the information that it needs in the directory service. The producer is selected according to brokering policies. Finally, the consumer communicates with the selected producer to obtain the necessary information.

Figure 3.2: Grid Monitoring Architecture (GMA) [GMA]

Different architectures, like R-GMA [BCC+ 02], MDS [MDS], Hawkeye [Haw] and NWS [NWS], based on the generic architecture given by the Global Grid Forum, have been implemented.

Relational Grid Monitoring Architecture (R-GMA) [BCC+ 02, GMA] is the monitoring architecture used in the European Data Grid (EDG) project. R-GMA implements the GMA architecture model defined by the Global Grid Forum by means of Java servlets and relational databases. This makes transparent the details of the producer/consumer model to the user. It is used for searching Alberto S´ anchez Campos


46


grid services and applications monitoring.

Monitoring and Discovery System (MDS) [MDS] is the Globus Toolkit’s information services component. It uses an extensible framework with a structure arranged in a hierarchical order to manage static and dynamical information. It provides an information services architecture offering mechanisms of resource discovery and monitoring.

MDS provides information about the available resources in the grid and its state. It is often used in order to publish some results obtained by means of benchmarks in a certain grid. In spite of being a monitoring service, it is more used as discovery service, because it does not provide an historical archive how system has worked. Furthermore, it can be combined with other grid protocols in order to build high-level services, like brokering and fault tolerance.

Hawkeye [Haw] is the monitor used by the scheduler Condor-G. It combines the Globus’s resource management with the methods of Condor’s local management. It has mechanisms for job monitoring, notification, etc. The advantage of Hawkeye is the possibility of detecting automatically problems. It is based on ClassAd technology, which identifies the resources that are in a set of machines. Couples indicate immediately if a resource fulfill the requirements of a certain application.

Network Weather Service (NWS) [NWS] was created by the University of California. It is a distributed system that monitors periodically the system and forecasts the performance of the computing resources and network load at a given time.

MonALISA (Monitoring Agents in To Large Integrated Services Architecture) [NLB01, NLG+ 03, MMG] provides a distributed service to monitor, control and optimize complex systems. The combination of the architecture based on services and the utilization of mobile agents facilitate the creation of a services hierarchy that can be easily expanded to manage very complex systems, like grid. The scalability of the system is achieved thanks to an engine that runs several dynamic services. These services can be discovered and used by other users or services that need the information they provide. Thus, there are two kind of services or agents: data collector agents (monitor service) and decision making agents (aggregator service).

The monitoring system supervises in real time the grid elements, the network, and the processes that are running. The collected information is essential when high-level services are developed and especially when it is needed to make decisions in order to optimize the grid performance. Different monitoring tools can be used as monitoring modules to collect information about grid resources. AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


3.2. GRID COMPUTING

47

A monitoring module is a module that collects values about the system and can be dynamically loaded. The monitoring modules can be periodically run sending the results to the monitor service. As the modules can be dynamically loaded it is possible to define and introduce new monitoring parameters. Thus, MonALISA makes easy to monitor heterogeneous nodes with different response times.



48




Chapter 4

AUTONOMIC COMPUTING

Computing systems have made considerable advances in the last years. Furthermore, every day that passes the sophistication of the software is increased. This causes that the complexity of the information technology is reaching a limit that is difficult to sustain.

In this sense, the system management by the administrators is more difficult due to the incessant increment of the complexity of the existing technology, the introduction of different system standards and the existence of complex and heterogeneous infrastructures. This means losing performance because of the ignorance of all the features of the infrastructure.

In addition, it is necessary to take into account the growing increase of the infrastructures. For instance, the number of users that accesses to a server can change in a unpredictable way. The human participation in managing the possible changes is very costly and it is not possible or, at least, very complicated to monitor the whole system in order to adapt it to its variable characteristics.

At the beginning of the year 2002, with the aim of managing the specified complexity, the autonomic computing idea was conceived to help to manage the growing needs of the information technology. The concept of autonomic computing was created by IBM as an idea that enables infrastructures to adapt theirselves automatically to the demands of the applications that are running on them, increasing the system performance. There are other terms that are losing importance as symbols of the shown idea like self-healing technology, dynamic systems initiative, adaptative architecture, introspective computing and holistic computing. Alberto S´ anchez Campos


50

CHAPTER 4. AUTONOMIC COMPUTING

The idea of self-healing technology [HL03] was created in 2001 by Intel, Shindy and F5 to designate the intelligent software responsible of warning in case of applications running near to the maximum capacity. There is a similar software offered by Concord Communications that automatically fits the network connections according to the current characteristics of the network.

Dynamic Systems Initiative (DSI) [DSI] is the Microsoft’s strategy to enter on the autonomic system field. Its aim is building servers that can be managed according to the real-time computing requirements and managing the system behavior depending on the application needs. For instance, more servers can be activated if a web site is accessed by an unusual number of users at a given moment. Although this initiative has developed some tools that facilitate autonomic computing tasks, like server virtualization and system monitoring, they need too much human intervention.

Adaptive infrastructure [AdI, Hoo05] is the HP’s contribution to the construction of self-managing infrastructures and autonomic systems. This approach has been implemented in the HP Utility Data Center managing the whole pool of resources of a data center in a dynamic way.

The idea of introspective computing [Isa02] was conceived as a dynamical adjustment of algorithms that run in a system from a previous analysis of it. Different enhancements can help to this adaptation, such as power saving or fault tolerance.

Holistic computing was the IBM’s first idea, before autonomic computing. The aim of this sort of computation was to check the databases trying to get a better efficiency.

The idea of autonomic computing collects all the previous concepts in only one: the system adaptation to the changing and heterogeneous environments, with learning skill features. Autonomic computing is based on biological systems, specifically on the central nervous system. The central nervous system carries out multiple tasks, but human beings are not conscious of all of them and, we do not take part in its decisions. For instance, among all these tasks the blood pressure reviews, the control of the cardiac frequency, the adjustment of the body’s temperature, the management of the food digestion and so on can be stand out. All these tasks are made in an unconscious fashion. In fact, it is able to attend to these tasks and other more important jobs in a concurrency way without any failure. Moreover, it also adapts to the changing corporal needs. Therefore, the aim of autonomic computing is the construction of a computing system that works in a similar way to the central nervous system. This is the key feature that autonomic computing aims at achieving: the self-configuration is made without any conscious recognition by the user or developer. AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


4.1. AUTONOMIC ELEMENTS

51

Autonomic computing [RAC] refers to systems that are self-configuring and self-managing according to changes occurred in the environment, and they can heal themselves in case of failure. This kind of computation requires less human intervention for systems management and administration. Nowadays, the biggest problem of most of the systems is to define the parameters that configure the system. System administrators and/or application programmers are usually in charge of deciding the value of the mentioned parameters. In many cases, these values are not very suitable since they depend on the knowledge that the administrator has about how the system works. Besides, in order to obtain the maximum performance, the same system on different infrastructures needs different configuration values. Even more, the system characteristics can be dynamic, causing that the configuration should be modified depending on the conditions of the environment. For these reasons, it is not advisable that the whole weight of the system configuration falls on the administrator, which in the most cases is not an expert in all these areas. For instance, the system administrator could not be an expert in the underlying I/O system.

In short, the aim is obtaining, in an almost magical way, the system management adapting it to the changing conditions of the environment, optimizing the configuration parameters and being repaired in case of failure. This is made by means of the use of the autonomic skill and without human interaction.

The objective is the design of systems that are able to adjust themselves according to the changing conditions of the environment and manage the resources in an efficient way facing different workloads. The autonomic systems must allow the users to customize in which fields they want to have a bigger control fitting the configuration parameters.

4.1

Autonomic elements

In [Hor01] it is emphasized the urgency of “. . . design and build computing systems capable of running themselves, adjusting to varying circumstances, and preparing their resources to handle most efficiently the workloads we put upon them. These autonomic system must anticipate needs and allow users to concentrate on what they want to accomplish rather than figuring how to rig the computing systems to get them there . . . ”

In order to focus on this paradigm, it is important to understand the nature of autonomic computing. In [Hor01], IBM, one of the most active supporters of autonomic computing, defines the following eight key elements of this discipline: Alberto S´ anchez Campos


52


1. “To be autonomic, a computing system needs to ‘know itself’ - and comprise components that also possess a system identity”. An autonomic computing system is aware of all its components and their status and it must able to act on them. 2. “An autonomic computing system must configure and reconfigure itself under varying and unpredictable conditions”. The environment in which an autonomic computing system works is dynamic, and according to these dynamic conditions, the autonomic computing system must be able to reconfigure itself. Although the conditions are unpredictable, it is possible and desirable to use a system that can predict, in some sense, the future behavior. In this way, the configuration makes feasible the performance enhancement. Besides the system adjustment must be carried out in a constant way. Like this the system must adapt oneself to the environment in the best possible way at any time. 3. “An autonomic computing system never settles for the status quo - it always looks for ways to optimize its workings”. An autonomic computing system monitors the overall status of the system and decides, according to an optimization plan, the parameters to be changed. 4. “An autonomic computing system must perform something skin to healing - it must be able to recover from routine and extraordinary events that might cause some of its parts to malfunction”. For doing it, it can use the available resources with the aim of avoiding the interruption of the activities running on it. In case that the failures cannot be solved by the system, like a hardware failure, it must warn to the administrator. In short, an important feature of an autonomic computing is its ability for self-healing: a system must be able to identify the problems causes and solve them. 5. “A virtual world is no less dangerous that the physical one, so an autonomic computing system must be an expert in self-protection” An autonomic computing system must prevent itself from attacks, detecting them and alerting system administrator in case of danger. In the same way that immunological system actuates, an autonomic system must be able to identify suspicious code, analyze it and distribute a cure for the whole system. 6. “An autonomic computing system knows its environment and the context surroundings its activity, and acts accordingly”. An autonomic computing system must be able to discover resources and obtain information about them. Furthermore, according to the information of its neighbors, the system makes decisions. 7. “An autonomic computing system cannot exist in a hermetic environment”. An autonomic computing system interacts in an open and heterogeneous environment with other elements by means of open standards. This feature is especially compatible with the grid philosophy (see Section 3.2). AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


4.2. AUTONOMIC LEVELS

53

8. “Perhaps most critical for the user, an autonomic computing system will anticipate the optimized resources needed while keeping its complexity hidden”. An autonomic computing system must be able to act in advance and in a optimized fashion in order to increase the system performance. This ability must be performed in a transparent way. The final aim is clear: users have to achieve their aims without worrying about the system operation and its implementation.

Figure 4.1: Autonomic computing principles [RAC] In order to achieve all these features, four generic principles (see Figure 4.1) are embedded into autonomic computing strategy, namely [BAC03, KC03]: 1. Self-configuration, that is, the ability for configuring itself according to high-level policies. The components are adapted dynamically to the system changes. The aim is the flexible adjustment to the environment, being able to face diverse workload. 2. Self-optimization, that is, the capacity of seeking ways of enhancing the performance. It monitors the resources and it makes decisions that optimize the system operation from the obtained information. 3. Self-healing, that is, the feature that allows the system to detect, diagnose and repair hardware and software problems. This skill means discovering, diagnosing and reacting faced with system failures according to the policies indicated by the administrator. 4. Self-protection, that is, the ability for preventing the system against possible attacks. In fact, it must detect them and act according to its needs.

4.2

Autonomic levels

To measure the level of an autonomic computing system, a scale from manual to autonomic is defined in Figure 4.2. Each level replaces a certain area of the human intervention and decision. It is Alberto S´ anchez Campos


54


important to emphasize that the complexity occurs in all system levels, like hardware, software and management [GC03].

Figure 4.2: Maturity levels of autonomic computing [RAC]

1. The most basic level needs the administrator installs, supervises and keeps the system. 2. The second level or managed level uses personal tools to analyze system components, using the results for making decisions. 3. The third level or predictive level adds new structures to the supervision tools. Now, the system can suggest recommendations being the administrator responsible of approving and making the needed actions. 4. In the fourth level or adaptative level, the system is taking more responsibilities about decision making. The administrator is in charge of setting the policies that the system must obey. 5. The autonomic level is the fifth level. The system must take into account the high-level policies adapting oneself in the specified way to the environment. Nowadays some software systems of this type have been created. These systems are the first attempts to test some of the characteristics of autonomic computing. Due to the dimension of the task and the different fields of the involved Computer Science, such as cybernetics and fuzzy logic, in order to advance in the deployment of the autonomic computing is necessary that strong companies are joined to this initiative. In this sense, in next years the computing science community might involve completely with this technology and new projects might be accomplished. Today, IBM [IBM03], HP and Microsoft have strongly bet for the development of this technology, creating new products for all the levels of the system. AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


4.3. AUTONOMIC PROJECTS

4.3

55

Autonomic projects

In April 2003, IBM, considered as the father of autonomic computing, announced the first project to manage complex computing environments and new development tools in order to design autonomic systems.

With the aim of propagating the autonomic computing idea to the scientific community, it is indispensable that standards and tools will be developed. With this project, IBM suggests a set of technical guides that should be designed to make the joint work possible with efficiency. The only way of reaching this idea is using open specifications. This project developed a common terminology for autonomic system. Furthermore, it tried to use common standards in other fields, like Open Grid Services Architecture (OGSA) in grid computing (see Section 3.2), in order to define the autonomic management of complex systems.

Besides this project, IBM has developed some tools in order to help in the creation of autonomic systems, namely: • Log & Trace Tool is designed to facilitate the understanding of failure reasons in the system. It changes data from different system components to a common format that is accessible in the monitoring phase. The tool helps to the administrator to identify the reason of a problem. In addition, it can make easy the development of self-healing skills. • Agent Building and Learning Environment (ABLE) is an autonomic technology based on Java for the creation of intelligent agents. Due to the adaptation of agents to heterogeneous environments, this is very useful to develop autonomic system features. • Autonomic Monitoring Engine is designed to reduce the complexity of heterogeneous environments. It can detect failures and potential problems before they affect the system behavior. It has self-protecting skills making possible that the systems solve automatically critical situations. • Business Workload Management is a tool that helps to avoid bottlenecks measuring different parameters like reaction times. Later, it adjust the resources to achieve its aims. Nevertheless, although there are tools to develop autonomic systems, the techniques applied to provide autonomic capabilities are not standard and they depend on the domain and the tackled problem.

Nowadays, there are several initiatives and projects with the aim of providing autonomic skills and testing autonomic concepts. IBM and Sun are two of the most active companies that support the autonomic development with some of their projects, such as z/Os, WebSphere, DB2, Tivoli and N1. Alberto S´ anchez Campos


56


Meanwhile, different universities and researching centers are using the autonomic concepts to built its own infrastructures, like Self-* Storage system [GSK03], VIOLIN [JX04, RMX05], Autonomic Virtualized Environments [MB06], Kendra [McC03] and Application Performance Prediction and Autonomic Computing [MK04]. More autonomic projects are exposed in [ACo].

4.3.1

Self-* storage system

Self-* storage system [GSK03] is a project of the Carnegie Mellon University.

It designs a

cluster-oriented storage file system that provides self-organizing, self-managing, self-healing and selfconfiguring capabilities. It is based on both artificial intelligence and system control. The low cost of cluster components and resources quality causes that the system offers reliability and availability.

Self-* storage system is in charge of storing files in the corresponding cluster node. Files are split in chunks storing it in an adapted way to the operation of every “storage brick”. This adaptation is mainly directed to obtain fault-tolerance, although it is possible to define other goals.

4.3.2

VIOLIN

Since current network infrastructures are slowly adapted to changes, virtual networks can be designed as service-oriented added value networks. These networks provide efficiency and flexibility, although they have a hard use because the accumulation of both network and service functions. [JX04, RMX05] is a project of the Purdue University published in 2004 that proposes a Virtual Internetworking on OverLay INfrastructure (VIOLIN) to solve the overload of the application level and its adaptation to dynamic changes.

VIOLIN networks are virtual that are working on overlay infrastructures, for instance PlanetLab [PlL]. They are composed of software-based virtual routers, switches and end-hosts. Its main characteristics are: 1. Each VIOLIN network is a virtual world. Thus, its communications are limited to this network. 2. All network elements can be created and deleted on demand, being automatically adapted to the corresponding circumstance.

4.3.3

Autonomic virtualized environments

According to the autonomic computing group of the George Mason University, autonomic techniques can be applied to decide which computing resources must be allocated to any virtual machines as the system workload changes [MB06]. AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


4.3. AUTONOMIC PROJECTS

57

The aim of this project is to find an optimum usefulness function that helps to make these assignments in virtual environments, being CPU resources considered as the main value to share. They propose two ways of modeling the problem. First, all tasks belonging to the same virtual machine has the same priority. Thus, the scheduler only takes into account the different priorities among virtual machines. On the other hand, the system can be modeled by means of virtual CPU, allocating a virtual to each virtual machine. The assignment is not carried out by means of priorities but fitting the time that every virtual CPU can run in the real CPU. Both models are valid, though the second one obtains more accuracy taking advantage of the available CPU time in a better way.

4.3.4

Kendra

Kendra [McC03] arises from the idea that metadata allows Internet searches to be more efficient and provides useful information for data distribution and decision systems.

Kendra’s decision making module supervises the Internet operation and makes decisions trying to optimize the information delivery according to the available resources at each moment. In order to facilitate its adaptability to environments, a set of adaptable components can be run on demand. These components enables optimistic and pessimist configurations. This project was tested with audio servers creating the Kendra initiative to promote a distribution market of open digital contents.

4.3.5

Application performance prediction and autonomic computing

This project [MK04] from Clemson University refers the use of prediction as an integral part of an autonomic system that monitors, analyzes and control its behavior. A self-managed system can use predictions to develop an analysis of the system to determine the ideal resource configuration dinamically.

In order to prove this approach, they predicted a web application taking into account an end-to-end model for each client-server including the network influence. The prediction is based on an stochastic model about the operation of the TCP protocol. The inputs are the sending time and the lost time by the TCP protocol and the output is the performance obtained by means of the throughput achieved by a TCP flow in steady state. Finally, they build a model that indicates the application behavior according to the predicted workload of the network and the server.



58




Chapter 5

STATISTICAL METHODS TO MODEL AUTONOMIC SYSTEMS

The system understanding is the key point to provide autonomic capabilities to a system. Some known statistical methods can be used in order to help to understand a system, since there can be a regularity in its behavior that can be mathematically explained. The analysis of different decision making techniques helps to provide appropriate self-managing because it is mainly based on automatically selecting the most suitable decision at each moment. Furthermore, with the aim of providing selfoptimizing, knowing not only the current system behavior but also future one by means of prediction methods, allows the system to improve its performance. Finally, due to the fact of the great amount of data of a grid system, the first step is trying to get knowledge about the nature of data to simplify it. In order to understand data, it is possible to group it by means of clustering methods.

Therefore, next sections show those mathematical methods that can help to model autonomic systems.

5.1

Statistical methods for clustering

The aim of the cluster analysis [Try39] is grouping objects of similar kind depending on similarities among them. The degree of association between two objects is high if they belong to the same group and low otherwise. Clustering can be found in many usual situations, like the association of related products in a department store. Nevertheless, there are scientific researching areas where clustering Alberto S´ anchez Campos


60

CHAPTER 5. STATISTICAL METHODS TO MODEL AUTONOMIC SYSTEMS

can also be used to organize a huge number of elements, like stars, animal species or consumer of patient typologies.

Clustering techniques can be classified as divisive or agglomerative methods. In a divisive method, at first all the elements belong to the same cluster. Then this cluster is gradually divided in smaller clusters. In agglomerative techniques, at first each single member implies a cluster. Then, the rest of the elements are gradually merged until an only large cluster is formed. Finally, the suitable final number of cluster is selected using expert knowledge. There are some cluster analysis methods. The most known methods are joining and k-means.

Joining is a hierarchical and agglomerative clustering technique. The algorithm starts joining the closest objects depending on some measure of distance and repeating the operation successively obtaining larger clusters. Furthermore, all objects are joined together in the last step. The result of this type of clustering is a hierarchical tree called dendogram (see Figure 5.1). The distance measure is the key factor in this method. The distance between two elements indicates the similarity between both and thus it is the factor for grouping elements. There are several kinds of distance, like the Euclidean distance and the Ward’s method [War63]. The Euclidean distance is the geometric distance between two points in any space, whereas the Ward’s method uses a variance analysis, minimizing the sum of squares between two clusters at each step. In order to select the final number of clusters is required to inspect the dendogram. This must made by an expert in the analyzed field who interprets the created groups. The reliance on this method depends on the expert and it must be cautiously used [AB84].

GROUP :

GROUP :

GROUP :

GROUP :

GROUP :

Distance

Figure 5.1: Dendogram

On the other way, k-means [Mac67] is a non-hierarchical clustering technique. This method differs from hierarchical clustering. In this case, data is partitioned, there is not a dendogram and the most important aspect, the number of clusters which data are grouped must be supplied. Using the supplied number of clusters, data is assigned randomly to each cluster. Then the elements are moved AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


5.1. STATISTICAL METHODS FOR CLUSTERING

61

among clusters with the goal of minimizing variability or variance within clusters, that is, reducing the dissimilarities and maximizing variability among clusters. Cluster variability is measured with respect to the means of the classifying variables.

Other more sophisticated methods that can be used for data partition although they require more computing power are Self-Organizing Map (SOM) [Koh90], Expectation-Maximization (EM) algorithm [DLR77] and Quality Threshold (QT) [HKY99].

5.1.1

Determining the number of clusters

There are some techniques to determine the number of clusters. Some techniques use the agglomeration coefficient to solve this problem. The agglomeration coefficient represents the squared Euclidean distance between two clusters that are combined in a particular stage. One rule of thumb, named elbow criterion, draws the number of clusters and the agglomeration coefficient in the same graph. The suitable numbers of clusters is selected where the elbow is located. Nevertheless, the interpretation is not easy, since it is possible there is not a pronounced elbow [HS83] or there are more than one [AB84]. In this sense, as in hierarchical methods, the number of cluster depends on the expert opinion. Another technique deals with the incremental changes in this coefficient. A high growth of the coefficient when the number of cluster is increased implies that the new number of clusters is most suitable than the previous one. However, just like in the elbow graph method, it is possible to find some cases, where the number of clusters depends on the expert opinion. The agglomeration coefficient represents the squared Euclidean distance between two clusters being combined in a particular stage. In short, although many researchers use the agglomeration coefficient to try to select an appropriate number of clusters, many of them has to apply their knowledge and the state-of-the-art of the field in which they are working to solve this problem.

There are other kinds of methods, which are not based on the agglomeration coefficient. The Cubic Clustering Criterion (CCC) [Sar83] is a measure of within-cluster homogeneity relative to betweencluster heterogeneity. The number of clusters is selected in the peak of the CCC. Nevertheless, this method tends to propose a large number of clusters [MC85].

Another method is known as Hartigan’s rule [Har75], which uses the within-cluster sum of squares to estimate the number of clusters. Its goal is minimizing within-cluster sum of squares. This rule uses the following test:

F =


(W GSS(k) − W GSS(k + 1)) W GSS(k+1) n−k−1


62


being n the total number of instances to be clustered, k the number of clusters and W GSS(k) (Within Group Sum of Squares) the total sum of squared distances of cluster members from their cluster centroid in all clusters when k clusters are created. Hartigan suggests that values exceeding 10 justify increasing the number of clusters from k to k + 1 [Har75].

5.2

Statistical methods for prediction

Predicting a system behavior is one of the most challenging areas. There are several algorithms and techniques developed from different points of view.

First of all, there are models based on just general input-output descriptions, called model-free approaches [Zad97]. This kind of techniques includes neural networks, fuzzy systems, evolutionary algorithms and hybrid techniques. These techniques can be applied to time series analysis in order to predict the system behavior [KE02, TSL01]. Furthermore, time series analysis can identify the nature of the phenomenon represented and understand the variables that have an influence on it. Time series are temporary sequences of measurements where successive values in the data represent consecutive measurements taken at equally spaced time intervals.

There is a powerful approach [Val02] to determine the parameters to predict a system based on time series analysis and specially designed to deal with great amounts of heterogeneous data. It also enables the identification of global changes of state, that is, those drastic changes in the system behavior.

Another point of view is concerning to the system structure and composition. With this idea, different prediction methods can be analyzed, selecting a combination of on-line and a priori models.

5.2.1

On-line models

A on-line model, as its name indicates, models systems from current data obtained over time. Some data stream mining [GZK05] methods can be used to perform these models. Nevertheless, rough techniques, like neuron networks, linear regression or exponential smoothing, are usually applied to model them. This is due to the fact that the system to predict is completely unknown because it can be influenced by different factors, like features that cannot be observed or high complexity. This causes that the system cannot be successfully modeled by means of another way.

These methods are based on the idea that the current results depend on the previous ones and there is not any random factor or it has a little influence. Therefore, the first results will be exAUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


5.2. STATISTICAL METHODS FOR PREDICTION

63

pected few trustworthy since the system behavior is not known. If the obtained values are cyclical or they follow a clear trend, these methods will obtain accurate enough values. In general, its strength is based on being able to obtain the different components from the observed values avoiding the noise existing in the measures. Thus, a greater amount of data will obtain better suppositions and predictions.

However, in case of a drastic change of tendency the predictions start to get wrong although they progressively improve. If the suppositions get wrong, the past knowledge of the system can be discarded and a new prediction can be built from current data. In this way, these methods can model new tendencies obtaining reasonable results. In fact, in dynamic systems where there are non-controlled inputs and different rules through the time, the on-line models offer much confidence. In short, although on-line models take advantage of the historical system behavior, they give more importance to the current state, facilitating to face the system changes at short term.

Its great advantage is they do not require too much initial effort and they adjust well when the values have a clear tendency. On the other hand, they can provide wrong results in complex systems. Finally, the result depends on the selected algorithm, since each method is specialized in different data series.

5.2.1.1 Linear regression The aim of a regression analysis [Lin87, SR94] is to know the statistical relation existing between a dependent variable (Y ) and one or more independent variables ((X1 , X2 , X3 , . . . , Xn )). In this sense, a functional relation between the variables must be postulated. For the sake of simplicity, the most used relation is the linear relation. When there is only an independent variable, it is reduced to straight line:

Yb = b0 + b1 X In a n dimension space:

Yb = b0 + b1 X1 + b2 X2 + b3 X3 + . . . + bn Xn As it can be seen in Figure 5.2 the algorithm will try to look for the straight line that comes near most of the points. Atypical data can cause problems because straight line is diverted towards these points. Thus, atypical values should be removed before. The algorithm uses any kind of space distance to minimize the distance between all the points and the straight line. The most traditionally used distance is vertical distance to the straight line. When more data is incorporated to the system, the adjustment is expected to be better improving the predictions.



64


Figure 5.2: Linear regression As conclusion, in a system model, it is tried that the system outputs fits with the current inputs. A dependent time function can also be used to provide the outputs.

5.2.1.2 Exponential, polynomial and logarithmic adjustment These kinds of adjustment try to study in depth regression including the approach through polynomials of different degrees. The oscillating and tendency changes representation make the error possible to reduce. Although the adjustment is much better than linear regression, the resolution complexity is also increased.

The following adjustment functions are used:

Yb = b0 + b1 log(X) Yb = b0 + eb1 X Yb = b0 + b1 X + b2 X 2 + b3 X 3 + ... + bn X n The method involves studying superficially the system and analyzing the function that better adjusts to the inputs. Then, the adapted parameters to the system behavior are obtained following the normal procedure of linear regression.

5.2.1.3 Exponential smoothing This technique is often used in time series forecasting with a great variability [GJD80]. These series have numerous peaks that appear in a graphical representation. The global behavior of the series can be understood better when these peaks are smoothed, reducing local fluctuations. This process is called exponential smoothing. The following formula is used: (

c1 = Y0 Y c Y[ n+1 = αYn + (1 − α)Yn




65

Y0 is the first value of the series. As it can be seen, recent data is more influential to forecast than the old ones. Thus, although the first prediction is not good enough, new predictions will be better. The factor α is a value of adjustment between 0 and 1 that have to be chosen a priori and cause the results can vary:

• If α is near to 1, then a greater importance is provided by the current observations.

• If α is near to 0, the importance of each observation is more equitably distributed.

In the graph of Figure 5.3, α is 0.23. As it can be seen, the peaks are smoothed, maintaining clearly the data tendency.

110

100

90

80

70 90

92

94

96

98

AT_SSM_IND_EXP

00

02

04

IND_EXP

Figure 5.3: Exponential smoothing

5.2.2

A priori models

A priori models work with a previous learning phase where the system is analyzed. Unlike on-line models, where the predictions change when new data is monitored, the a priori prediction model does not change through time. In short, these methods try to infer the model that manages the system behavior and the approximated values of the control variables.

There are some a priori models, like ad-hoc models aimed at a concrete system or models based on Petri Nets, Markov chains and finite state machines. The analytic generation of the most suitable model according to the aim of this work will be explained later. Alberto S´ anchez Campos


66


5.2.2.1 Petri Nets Petri Nets (PN) were created by Karl Adam Petri in 1962 [Pet62]. PN are considered a tool for the system study and they can model the system behavior and structure. Furthermore, it is possible to direct the model to extreme situations that are difficult to obtain in a real system. PN provide mathematical representations of systems so that they can be studied like automatons.

In PN, there are some fundamental concepts: places, transitions, and arcs. Arcs connect places with transitions. Input places are the places where an arc connects it to a transition. The places where arcs are connected from a transition are called output places. Places can contain tokens and the number of tokens in each place is called the marking. A transition can be fired if there are the required number of tokens and all the preconditions (represented by means of predicates) of the event are fulfilled. When a transition fires, it consumes the tokens from its input places and performs the task. Then, it places atomically a specified number of tokens into each of its output places. Since firing is nondeterministic, parallelism and concurrence concepts can be applied, modeling this kind of systems.

Stochastic Petri Nets (SPN) are a formalism based on PN developed in the field of computer science for modeling system performance. Transitions are fired determined by a stochastic mechanism. Specifically, a clock is associated with each transition. The clock indicates the remaining time until the transition is fired. Clocks run down at marking dependent speed. When any clock reaches 0 a marking change occurs.

5.2.2.2 Finite state machines Finite state machines (FSM) [Gil62] are mathematical models that were designed for strings recognition belonging to a language.

It is possible to model systems thanks to the definition of transitions among states and actions. In this case, the state stores information about the past and the transition means a state change when the indicated conditions have been fulfilled and the action is a description of a task performed at a certain moment. In order to model systems, each symbol of the chain will be one of the possible inputs that the system receives. Therefore, the system state will only depend on the different inputs that have been receiving during its past behavior, being deterministic.

Nevertheless, it is possible to define nondeterministic finite automaton that makes complex systems possible to model. In this case, the transition between states does not only depend on the input symbol. Each input symbol for a certain state can have got several output transitions with an associate AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION



67

probability. Thus, FSM can be represented using a state transition diagram, what can represent the system evolution.

5.2.2.3 Markov chains Markov chains [Mar06, Mar71] represent systems that change their state through time, being each change a system transition. These changes are not predetermined, although the probability of the next state is determined based on the previous state. This probability is constant through time and thus, it is possible to know the stationary future state.

The states are a characterization of the situation in which the system is at a certain time. This characterization can be quantitative or qualitative. The system state at a certain moment is a variable whose values can only belong to the set of system states. Thus, the system model created by means of a Markov chain is a variable that changes of value through time. This change is named transition.

It is possible to represent a Markov chain (see Figure 5.4) as a directed graph where the states are the nodes and the transitions are the edges. Each one of these transitions has a probability of occurrence.

Figure 5.4: Markov chain

It is possible to evolve the system because it is modeled like a dynamic variable. After a high number of transitions, it is possible that the system reaches a stationary state, that is, the same value is obtained to the previous one in each evolution step. This limit value can be used as a long term prediction. Alberto S´ anchez Campos


68


5.3

Statistical methods for decision making

Decision making from the analysis of system data is one of the problems that is needed to face in order to get the best strategy that maximizes the expected system performance.

Decision making is used in the field of the system analysis and evaluation, but it also can be found in other areas. For instance, in the Health Science field, the study of different strategies is fundamental to help to make decisions about a disease from the symptoms observed in a patient. Sciences like Physics and Chemistry are characterized by deterministic or natural laws, so that if an experiment in the same conditions is repeated, the same results are always obtained because they are carried out by natural laws. Nevertheless, when a system behavior is predicted, the situations that take place do not obey a natural law in most cases. Therefore, a probabilistic data analysis is required considering the principle of statistical regularity. The principle of statistical regularity is the property found in a random experiment that indicates if the repetition number is very high, the relative frequency of an event tends to become stabilized in the value that represents the probability of this event.

The calculation of probabilities is defined as the mathematical model of regularities that can be observed in the frequency series corresponding to a random phenomenon [Rio67].

Researchers need to make decisions about the truthfulness or falseness of a hypothesis contrasting it with the results obtained from samples.

Keeney and Raiffa define the decision analysis as a procedure that tries to help people to face difficult situations [KR93]. This analysis makes possible better decisions but this does not necessarily lead to get better results because it only calculates the occurrence probability of an event instead of the accuracy of its appearance.

There are several methods to select the best possible decision, which show different perspectives to tackle problems. Among others, decision trees [RS61], Bayesian approach [Bay63], Fisher’s significance tests [Fis22, Fis25], hypothesis testing [NP28, NP33], Markovian approach [Mar06, Mar71] and linear programming [Dan63] can be highlighted. Some of these methods can be used to deploy autonomic and decision making skills depending on the domain and addressed problem.



Part II

PROBLEM STATEMENT AND PROPOSAL



Chapter 6

PROBLEM STATEMENT

The complexity and size of current problems involve a challenge for researchers, which need to find novel solutions and architectures to their domain-based problems. These problems usually require large-scale computing and data analysis, which imply that cannot be solved in a reasonable amount of time with current computers and former technologies.

6.1

Grand challenge problems

A grand challenge problem [WHJ+ 93] is a scientific or engineering problem that usually analyzes an enormous volume of data requiring high performance computing resources during many time to obtain results. Examples of these grand challenges problems can be found in several areas, such as molecular biology, molecular design, process optimization, weather forecast, climate change prediction, astronomy or fluid dynamics.

Nowadays, a wide range of these problems are standing out and arising in the computing scene. A noteworthy example can be found in high energy physics field. The CERN’s particle accelerator LHC (Large Hadron Collider) is able to generate petabytes of data per second, which is analyzed in four experiments ALICE, ATLAS, CMS and LHCb.

From this field and another different scientific field, life sciences, a new project was originated: Alberto S´ anchez Campos


72

CHAPTER 6. PROBLEM STATEMENT

EGEE (Enabling Grids for E-sciencE) [EGE]. EGEE provides a general framework for applications related to many other areas, such as geology, earth sciences, astrophysics or computational chemistry. EGEE is intended to provide computing and storage resources: over 20,000 CPU available and about 5 petabytes of storage.

In life sciences, and more specifically, in bioinformatics, the analysis and integration of multiple databases have taken advantage of the use of the grid research. For instance, the caBIG initiative (cancer Biomedical Informatics Grid) [CaB] is a grid composed by individuals and institutions for sharing data and tools related to the prevention and treatment of cancer. In the same way the myGrid project [MyG] has developed high-level services to support data intensive in silico experiments in biology.

In a completely different field, the astronomy, the AstroGrid project [AGr] provides a uniform interface and remote access to astronomy databases.

In spite of this diversity, all these projects are oriented to specific fields or applications. Nevertheless, all of them are large-scale data-intensive applications requiring high performance in huge data stores access and implying a great complexity because of the enormous number of managed resources. Due to the fact that the I/O and software complexity crisis are not completely solved, some advances are required in the I/O access and the management in order to solve these grand challenge problems.

6.2

I/O and software complexity crisis: Two problems not solved yet

Most of the progresses made in Computer Science have arisen as result of the existence of some specific crises, which implies a revolution in the corresponding research area. For instance, the known software crisis [Dij72], whose notion emerged at the end of the 1960s, and which is characterized by an inability of software developers to deliver good quality software products according to the scheduled time and budget, caused the beginning of the software engineering [Was76].

In the same way, the I/O crisis and, more recently, the software complexity crisis, have been used for naming situations in which the current technology did not solve the problems originated by such crises.

The I/O crisis is given by the performance difference between computing and I/O phases, which leads to become the I/O system in a “bottleneck” in the systems [PGK88]. On the other hand, a new AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


6.3. SOLUTIONS TO THE I/O CRISIS

73

software complexity crisis has been detected in current software systems [Hor01]. Current systems are so complex than their administration is becoming unmanageable. Both problems or crises have not been properly solved, although there are some initiatives for their resolution. In the first area, many different proposals have been provided. Parallel I/O system is one of the most active fields in this sense. In the latter one, the growing proliferation of the autonomic computing can allow software developers and administrators to make easier the management and administration of complex software systems.

6.3

Solutions to the I/O crisis

The huge advances performed in the computational field have implied the resolution of new problems and the possibility of facing new challenging applications. Most of these applications require the management and analysis of large volumes of data. However, as it has been previously mentioned, the I/O phase usually constitutes the bottleneck of these applications. Different solutions can be applied depending on the point of view and the used technology.

6.3.1

I/O solutions: Approaches based on clusters

By deploying groups of servers and other resources together in a cluster, it is possible to increase applications performance, and obtain balanced traffic and high availability. The key to a successful implementation is that clusters appear as a single highly available system to end users. Besides the cluster system should be as easy to manage by administrators as a single-server system. This view of the entire cluster as one large system is called single system image [BCJ01].

Cluster file systems are integrated solutions for clustered and SAN (Storage Area Network) computing environments that offer transparent data sharing. A SAN is a high-speed subnetwork of shared storage devices [GNA+ 97]. Cluster file systems are available for several operating systems.

Great efforts have been made for solving the I/O crisis in this kind of environments. Among others, parallelism enables minimizing the I/O bottleneck adding in a logical way independent storage devices and I/O nodes accessing to all of them at the same time. Therefore, parallel file systems based on clusters have constituted one of the most successful solutions of the last decade. In Section 2.3, a study of several different parallel file systems was shown.

Although parallel file systems for clusters make lighter the I/O crisis, the more and more huge computing and storage needs of the grand challenge problems exceed the usual cluster capacity. Thus, Alberto S´ anchez Campos


74


this kind of problems cannot be approached by means of this infrastructure, and new alternatives must be analyzed. How it can be seen in Section 3.2, grid computing is an emergent infrastructure that is able to provide the necessary computing and storage capacities to solve large-scale costly projects. In fact, it is the most used technology where grand challenge problems are tried to solve.

However, it is possible to extract knowledge from the lessons learnt from cluster computing and analyze how they can be applied to this innovative technology. Specifically, cluster file systems can be analyzed in order to select those that can be more useful to solve the challenging I/O problems in grid. Then, the best alternative can be the use of cluster solutions as a building block of one integral proposal to solve the I/O crisis in grids.

Among several different alternatives of parallel file systems on clusters, MAPFS (see Section 2.3.4) is a very suitable option to use in a grid since its good adjustment to complex and dynamic environments. This combination is due to the fact that MAPFS uses agents for tackling the I/O problems, which conceptually contribute to the adaptation to any kind of resources. Since a grid environment is composed of different and heterogeneous resources, the adjustment of agents to these resources is a key factor that states the selection of MAPFS. Furthermore, as clusters are one of the most used resources in a grid because of its good relation power vs. cost, it is possible to improve the grid data operations through parallel accesses into the clusters resources. Therefore, the parallel access of MAPFS can be used inside a cluster to optimize the whole I/O performance.

6.3.2

I/O solutions: Approaches based on grid

Grid technology is an important part of several researching areas because it provides computational and storage support to applications that need a great computational capacity and analyze a great amount of data. Thus, grid computing has become a key piece for the building of a framework in which grand challenge problems can be tackled and solved.

In terms of storage, current data-intensive applications require access to huge amount of data, distributed in different locations linked by wide-area networks. The high latency among these locations makes the access and the management of these data sources inefficient.

Data grid has as one of its main goal to develop suitable solutions to data-intensive applications by means of grid-based tools. In Section 3.2.2 a deep study of data grid systems was shown.



6.4. SOLUTIONS TO THE SOFTWARE COMPLEXITY CRISIS

75

6.3.2.1 Contribution In spite of the fact that different schemes have been proposed as alternatives and the grid technology allows heterogeneous resources and specifically data resources to be shared, only few research works in the field of data grid are oriented to increase the performance. The orchestration and coordination of grid services and resources makes performance improvements difficult, although it is a desirable aspect in this kind of environments.

Moreover, as far as is known, there are not any alternatives focused on optimizing the I/O in all the possible levels. That is, none of them tries to maximize the performance at cluster-level, which can help to optimize the I/O in data grid.

This work tries to find different solutions to provide an I/O efficient access, at both resourcelevel and grid-level. Some of these ideas can be extracted from cluster computing, adapting cluster file systems, like MAPFS, to grids. Any improvements in both areas reduce the I/O crisis in grid environments. Thus, a new framework should be proposed to overcome the I/O problems in grid applications. Chapter 7 is focused on defining new methods to provide high performance access to huge volumes of data in a grid environment. Since a grid is a large and complex environment, the optimization should cover several different fields, both the usual large-scale data transfers and the uniform access to data following the current grid orientation.

6.4

Solutions to the software complexity crisis

Current grand challenges problems require new technologies to be solved. Although grid computing enables creating infrastructures adapted to these challenges, its great complexity causes an increase of the difficulty of them. These problems aggravate the software complexity crisis and they make difficult the solution of current problems. Autonomic computing (see Chapter 4) is the most known method proposed in the research literature to tackle this crisis.

Autonomic computing [RAC, BAC03] is used to describe the set of technologies that enable applications to become more self-managing. Self-management involves self-configuring, self-healing, selfoptimizing, and self-protecting capabilities. The word autonomic has been borrowed from physiology; as a human body knows when it needs to breathe, software is being developed to enable a computer system to know when it needs to configure itself.

Autonomic computing deployment is one of the most promising areas in computer science. If the environment in which this discipline is used is a grid, the advantages would be even clearer, due to Alberto S´ anchez Campos


76


the complexity of these environments.

6.4.1

Contribution

Part of the complexity of challenge problems is due to the huge amount of data involved in such processes. In these scenarios, the I/O access stage limits the overall performance of the system. Thus, it should be intended to combine solutions from parallel I/O systems and autonomic computing in order to optimize the performance of the I/O phase (which is critical) in data grids.

The idea is providing an efficient access to large volumes of data. Therefore, Chapter 8 defines an autonomic framework for grid storage, which makes automatically decisions in order to improve the I/O system performance. Autonomic decisions are based on a deep knowledge about the system behavior. Thus, Chapter 9 proposes a solution for properly monitoring a grid infrastructure in spite of its complexity and heterogeneity. As a grid is composed of several different and heterogeneous resources, the decision making means to select the best suitable resources according to their adjustment and efficiency. Specifically, in a data grid, these resources are storage resources and it is necessary to take into account files can be accessed later on. Therefore, a mathematical formalism to select the best resources that takes into account the future access is defined in Chapter 10 to improve the later operations about them. These autonomic decisions must supply transparency to provide both fault tolerance and performance enhancement, which enables reducing the complexity of the whole system, minimizing the effect of the software complexity crisis.



Chapter 7

DEFINITION OF A HIGH PERFORMANCE GRID STORAGE ARCHITECTURE

Clusters of workstations have meant a revolution in high-performance computing, since they constitute a cheap solution to supercomputing. Within the cluster philosophy, the concept of single-system image can provide a unified access to system resources, increasing the availability and uniformity of the whole system.

Nevertheless, this feature is far from the original idea of the grid problem as a “flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resources” [Fos01]. Grid computing enables the virtualization of distributed computing and data resources to create a single system image. This virtualization provides flexibility and, unlike clusters and distributed computing, allows resources to be heterogeneous and geographically distributed.

Furthermore, grid computing does not fit properly with the cluster technology and consequently, some efforts have been made in this sense. A significant example is constituted by the initiatives for porting MPI, a message-passing paradigm traditionally bound to clusters, to a grid environment [MHG03]. Some of these approaches are MPICH-G2 [G2M], LAM-MPI 7.X [LAM], PACX-MPI [GRBK98], MagPIe [KHB+ 99] and Stampi [ITKT00].



78

CHAPTER 7. A HIGH PERFORMANCE GRID STORAGE ARCHITECTURE

Therefore, it would be also desirable to adapt I/O solutions provided by the research made on clusters to grid environments. The main reasons are: • Make the most of all the work made in clusters to develop grid solutions. • Parallel I/O systems for clusters provide high performance. • A great number of the resources of a grid may be clusters. Thus, the integration between clusters and grid is required. Thus, it would be advisable to adapt cluster file systems to grid environments in order to provide a suitable solution to the more and more important grid I/O problems. These problems happen due to the great amount of analyzed data in order to solve grand scientific challenges.

7.1

A toolkit for accessing large volumes of data

Since the huge requirements of efficient data access in grid environments by data-intensive applications, different needs arise in the data management and access in grids. Three important aspects are noticed related to these needs: 1. An efficient data service that manages file resources following the OGSA architecture is required. The Web services technology is suitable for managing services and resources through WSRF [GKM+ 06]. As far as is known, there are no WSRF-based data services designed to improve the performance of I/O operations by means of a parallel file system. 2. Every grid project usually provides “ad hoc” solutions to the data management. A data service often offers a native interface, which does not provide interoperability with other I/O systems. OGSA-DAI [AAB+ 05] has emerged to provide a uniform access to data sources in a grid environment. However, OGSA-DAI is not focused on the performance of the I/O operations. Therefore, providing a bridge between the interoperability and the performance optimization is an important need of current data grid projects. 3. The most important drawback of the previous scenario is the poor performance exhibited by Web services. In fact, the use of XML and SOAP as transference protocol is not appropriate for performance-critical applications [CGB02]. Although there are different proposals for dealing with this decrease of performance [W3C05], [SM02], [LS00], none of them are suitable for scenarios demanding high throughput. In this context, GridFTP has emerged as an optimized protocol for large files transference, since it is not based on SOAP transfers1 . Additionally, GridFTP extends the basic protocol for supporting data transfers among multiple servers (striping). Furthermore, GridFTP allows us to use multiple TCP streams in parallel from a source 1 GridFTP

is not service-oriented, breaking the current main point of the grid philosophy.



7.1. A TOOLKIT FOR ACCESSING LARGE VOLUMES OF DATA

79

to a sink (parallelism) [ABKL05]. In spite of providing good I/O performance, it is possible to improve it even more with the application of a parallel file system. All these problems have to be solved in order to enhance the grid I/O performance. Our approach involves the development of a generic framework for providing high performance data grid storage by means of the definition of different services, suitable for the three identified scenarios. Next sections describe the three approaches, called MAPFS-Grid Parallel Data Access Service, MAPFS-DSI and MAPFS-DAI, designed and developed to face each problem. In general, in order to take advantage of ideas from cluster computing applied to grid environments, MAPFS-Grid provides a grid-like interface to a parallel file system based on clusters. Any cluster file systems could be used to access data into the cluster following MAPFS-Grid architecture. In order to test in a certain environment, MAPFS [PCFGR06] has been selected, due to not only the previous experience with this file system but also its better adjustment to complex and dynamical environments. The feasibility of this combination is given by the fact that grid environments are composed of different and heterogeneous resources, being clusters one of the most used because of its good relation power vs. cost. Thus, it is possible to improve grid data operations through parallel accesses into clusters resources.

MAPFS is used in clusters of workstations. This file system distributes data stripes over all the nodes of the cluster. On the other hand, MAPFS-Grid allows heterogeneous servers connected by means of a wide-area network to be used as data repositories, by storing data in a parallel way through all the clusters and individual resources that compose the grid.

Therefore, the proposal MAPFS-Grid, as a complete suite of services, allows applications to benefit from two levels of parallelism [PSHR06]: 1. Level 1: The higher level provides parallelism among the grid storage elements, that is, interstorage element parallelism. 2. Level 2: The lower level provides parallelism among the set of nodes of each cluster, that is, intra-cluster parallelism. At this level, MAPFS is applied. These two levels are depicted in Figure 7.1. Both levels are integrated and they cooperate with the aim of providing an enhanced I/O bandwidth.

The high level involves a better coordination between all the resources of the grid. In fact, different monitoring tools allow a broker to have a global knowledge about the grid composition and the status of all the belonging clusters. In this sense, it is possible to make decisions about the most valuable storage resources for the I/O operations. Because the performance of grid applications depends on the selected resources, the broker will be analyzed in depth in Chapter 8. The client can use this Alberto S´ anchez Campos


80


Mainframe Servers Personal Computer Broker Client

Cluster

Mainframes

Cluster

INTER-STORAGE ELEMENT PARALLELISM. (LEVEL 1) INTRA-CLUSTER PARALLELISM (LEVEL 2)

Figure 7.1: MAPFS-Grid Overview

information to write a file in a parallel way, increasing the performance, because different clusters can be used for the same operation.

In order to understand the way in which both levels work in MAPFS-Grid, an analogy can be described. The development of both levels in MAPFS-Grid can be considered analogue to the behavior of the following hypothetical infrastructure, which may be used for maximizing the parallelism in a grid, whose computing resources are different clusters: • MPI [MPI97] can be used in every cluster of workstations, as usual. • Different initiatives for using MPI in grids, like MPICH-G2, can be used for running MPI applications in the grid resources, that is, the heterogeneous clusters, at high level. Unlike the hypothetical infrastructure, MAPFS-Grid is intended to improve the I/O phase, instead of enhancing the computing phase.

7.2

Providing an efficient service-based access to data resources

Nowadays, the grid community has oriented its activity towards a service model [FKNT02]. The architecture Open Grid Services Architecture (OGSA) defines an abstract view of this trend in grid AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


7.2. PROVIDING AN EFFICIENT SERVICE-BASED ACCESS TO DATA RESOURCES 81

environments. This architecture has represented a 180-degree turn in the conception of current grids. From the moment of its definition, there has been a strong convergence between grid and Web services. Indeed, the grid field has been improved by Web services concepts and technologies. However, this vision does not deal with aspects of performance involved in the use of grid services.

Moreover, grid technology is focused on high throughput computing because of its intrinsic characteristics. The great complexity due to the use of heterogeneous resources and diverse kinds of networks makes difficult the grid management. The lack of an efficient management involves an important restriction in order to obtain a good performance of the applications, which limits the use of a grid in order to solve grand challenges. Although grid technology is expected to provide high performance computing, most of grid projects are not focused on it since nowadays there are not many performance service-based grid techniques because of their complexity. Therefore, it would advisable its building since any improvement of the I/O implies an improvement of the application performance.

7.2.1

MAPFS-Grid Parallel Data Access Service

According to the service-based orientation of the grid philosophy, indicated by OGSA, the first proposal is to provide a WSRF-compliant service, named Parallel Data Access Service (PDAS), that allows parallel I/O operations to be made in a cluster environment. The conception of this service comes from Data Access and Integration Service (DAIS) [DAI]. PDAS is an adaptation of this concept from the performance and parallelism viewpoints.

Figure 7.2: Parallel Data Access Service (PDAS)

The main and basic operations provided by this service are read and write routines. The overview of this service is shown in Figure 7.2. In this case, an I/O operation is shown. The client needs to Alberto S´ anchez Campos


82


get a reference of the PDAS. Previously, the client must have contacted to a broker service, which decides to assign the most suitable service to the client. Different criteria can be used at this stage. As the brokering phase has a main role in the performance of grid applications since it depends on the selected resources, this problem requires a depth study. Therefore, it will be analyzed in depth from Chapter 8 on. Once the client obtains the reference, the client invokes the write operation in the PDAS, transferring data to be written. The PDAS writes client data in a parallel fashion in the cluster associated.

Original File

MAPFS-Grid Client

Block fragmentation B0

B1

B2

B3

PARALLELISM. LEVEL 1 Element 0 B0

PDAS

Element 2

B2

B1

PARALLELISM. LEVEL 2

PDAS


Block 0

Cluster

B3

Block 1

Cluster

Block 2

Block 3

Figure 7.3: MAPFS-Grid PDAS providing two levels of parallelism

This is the first building block of the general architecture, that is, MAPFS-Grid. The two levels of AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


7.3. PROVIDING A UNIFORM SERVICE-BASED ACCESS TO DATA RESOURCES

83

parallelism provided by PDAS are shown in Figure 7.3. The level 1 parallelism is provided by several PDAS (in every storage element), which give support to a distributed data repository. The level 2 parallelism is offered directly by MAPFS, in those storage elements that are clusters.

The main advantage of PDAS is that constitutes a WSRF-compliant grid service, which provides reasonably good performance and it is easy to deploy in a grid scenario where the main components are clusters of workstations.

7.3

Providing a uniform service-based access to data resources

As it has previously been mentioned, a large number of applications must be able to “consume” huge volumes of data in order to improve their behavior. Therefore, data management and access are taking a relevant role in current applications and systems. Although the efficiency of accessing data is a key factor, the interoperability between different data sources and the need of “standard” interfaces to access data are causing more and more interest on current solutions. Particularly, in grid environments this demand has originated the proliferation of specifications such as Web Services Data Access and Integration (WS-DAI), whose main goal is to define a “. . . specification for a collection of generic data interfaces that can be extended to support specific kinds of data resources . . . ” [AAL+ 05]. OGSA Data Access Interface (OGSA-DAI) [KAA+ 04] is intended to provide a reference implementation of WS-DAI. OGSA-DAI project has as main goal to provide a uniform access to data resources, being compliant with OGSA [FSB+ 06]. OGSA-DAI has been released on GT3, Axis, the OMII 1 infrastructure and on GT4, based on different specifications, such as OGSI, WS-I, WS-I+ [ARD+ 05] and WSRF respectively.

In spite of all the benefits given by OGSA-DAI, its performance can be enhanced. In fact, it is stated that OGSA-DAI developers “. . . expect to invest significant effort in engineering good performance. . . ” [AKA+ 05].

In short, our proposal is an extension of OGSA-DAI, named MAPFS-DAI, in order to offer two features: 1. Facilitating uniform access to data resources by means of a service-oriented architecture. With this aim, OGSA-DAI [KAA+ 04] was chosen, since it provides a reference implementation of WS-DAI. Alberto S´ anchez Campos


84


2. Improving the performance provided by OGSA-DAI, through the use of a parallel I/O system suitable for grid environments Both features are combined in order to define MAPFS-DAI [SPK+ 06]. The feasibility of this combination is guaranteed because of the following common characteristics between OGSA-DAI and MAPFS-Grid: 1. They both are OGSA compliant [FSB+ 06]. OGSA-DAI supports WS-I, WS-I+ and WSRF. MAPFS-Grid supports WSRF. For this reason, WSRF will be used as the common grid specification for MAPFS-DAI. 2. OGSA-DAI can be extended, adding new functionalities and activities. Moreover, OGSA-DAI provides a flexible way to add new data resources. Although OGSA-DAI is mainly intended to relational or XML databases, users can provide additional functionalities. MAPFS-DAI gives support to the access to flat files in an efficient fashion. OGSA-DAI allows files to be accessed on grids, but so far, they are focused on file formats such as OMIM, SWISSPROT and EMBL, and not on performance. 3. Both OGSA-DAI and MAPFS-Grid rely on the factory pattern for the creation of grid services associated to data.

7.3.1

MAPFS-DAI

Since the extensive number of data sources and storage systems in different grid projects, the interoperability among them is taking a relevant role in grid research. OGSA-DAI [KAA+ 04] project intends to provide a uniform access to data resources, being compliant with OGSA [FSB+ 06]. However, the performance of OGSA-DAI is quite poor.

Some of the approaches mentioned in Section 3.2.2 address several aspects related to data grids. Nevertheless, as far as is known, there is not any works which deals with the combination of access uniformity and high performance. The integration between the OGSA-DAI philosophy for providing uniform access and the idea of using two parallelism levels can deal with this combination. MAPFSDAI constitutes an extension of the OGSA-DAI architecture, whose aim is to increase this performance.

The problem of PDAS is that offers a native interface, which does not provide interoperability with other I/O systems. This feature is against the principles of a growing grid environment. For this reason, MAPFS-DAI is proposed as a bridge between the interoperability and the performance optimization.



7.3. PROVIDING A UNIFORM SERVICE-BASED ACCESS TO DATA RESOURCES

85

Client Application

Client Layer

Client Toolkit

WS-RF Client Stubs

Other client stubs

Service schema (.xsd)

Data Services Presentation Layer WS-RF Data Service

Other data services

OGSA-DAI Core File Data Service Resource

Bussiness Layer

MAPFS-DAI Accessor

Other Data Service Resource

Other accessors

Other accessors

Data Layer Flat files

Relational databases

XML databases

Formatted files

Figure 7.4: MAPFS-DAI within the OGSA-DAI Architecture

According to our goals, the MAPFS-DAI architecture must be an extension of the OGSA-DAI architecture [AAB+ 05]. Thus, as it can be seen in Figure 7.4, the MAPFS-DAI architecture is divided into four layers: 1. Data Layer, composed of data resources. Data resources exposed by MAPFS-DAI are flat and unformatted files. On the other hand, OGSA-DAI gives support to other kind of data resources, such as relational and XML databases. 2. Business Logic Layer, which is composed of: • A suitable Data Service Resource, whose name is File Data Service Resource, associated to flat and unformatted files. Alberto S´ anchez Campos


86


• A MAPFS-DAI accessor, whose main goal is to control access to the underlying data resource, that is, files. The MAPFS-DAI accessor allows activities to access data resources. Activities are the operations performed by data service resources. Currently in MAPFSDAI, two activities have been fully developed, one for reading (FileAccessActivity) and another one for writing (FileWritingActivity), which are compliant with the File Activities defined by OGSA-DAI. 3. Presentation Layer, which provides Web service interfaces to data services. MAPFS-DAI uses WSRF. 4. Client Layer, with two components: client application and client toolkit. The client toolkit makes easy the development of client applications by providing useful and simple tools to create the perform and response documents exchanged between the client and server. Both documents must fulfill the requirements specified by the service schema. In this way, the storage system performance can be enhanced, without changing the client application.

Original File

MAPFS-Grid Client


B1

B2

B3

B4

B5


Element 0

Element 1

Element 2

MAPFS-DAI B3 B0

OGSA-DAI compliant system

MAPFS-DAI B2 B5

B4

B1



Block 0

Block 2

Block 1

Block 4

Cluster

Cluster

Server

Block 3

Block 5

Figure 7.5: Two-level parallelism by using MAPFS-DAI and OGSA-DAI compliant storage elements



7.4. PROVIDING A HIGH-PERFORMANCE ACCESS TO DATA RESOURCES

87

The main disadvantage of providing a uniform access is that the performance is drastically reduced. MAPFS-DAI tries to relieve this decrease thanks to the two levels of parallelism of the MAPFS-Grid philosophy: 1. The higher level provides parallelism among the set of storage elements. 2. The lower level provides parallelism among the set of nodes of each cluster. These two levels are depicted in Figure 7.5. The interoperability between MAPFS-DAI and other OGSA-DAI compliant systems is also shown. This is an important feature and advantage, since every storage element that exhibits the OGSA-DAI interface can be used together with MAPFS-DAI elements. This is the case of the element 1 in Figure 7.5. New OGSA-DAI compliant elements could be created from Mass Storage Systems (MSS), such as CASTOR, and incorporated to this scenario. Due to the interoperability and the parallelism provided by OGSA-DAI and MAPFS-DAI respectively, different storage systems could be accessed in parallel.

7.4

Providing a high-performance access to data resources

In grid projects, there is usually a need of transferring large files among different virtual organizations. This is especially significant in data-intensive applications, where accessing and dealing with data is the most critical process.

In this scenario, although the service-based orientation of the grid philosophy indicated by OGSA has brought a larger number of advantages, since the synergies of grid and Web services, some limitations have also arisen in this context. Maybe the most important drawback of this combination is the poor performance exhibited by Web services. In fact, the use of XML and SOAP as transference protocol is not appropriate for performance-critical applications [CGB02] since SOAP introduces a noticeable amount of overhead. As previously mentioned, this is one of the limitations of grid services and particularly MAPFS-Grid PDAS and MAPFS-DAI.

There are different proposals based on the compression of the sent data for dealing with this decrease of performance (see [W3C05, SM02, LS00]). Nevertheless, none of them is suitable in scenarios demanding high throughput. Moreover, the improvement does not imply high-performance because the transference is again performed by means of SOAP, and this is considerably slower then other alternatives, such as FTP. In this context, GridFTP [ABB+ 02b] has emerged as an optimized protocol for large files transference. In order to provide high-performance, unlike other grid services, GridFTP service is not based Alberto S´ anchez Campos


88


on SOAP transferences. In addition, GridFTP provides two important characteristics [ABKL05]: 1. Striping: GridFTP extends the basic protocol for supporting data transfers among multiple servers. 2. Parallelism: GridFTP offers the possibility of using multiple TCP streams in parallel from a source to a sink. Both features can be used combined and they offer good performance.

Although there are different approaches for increasing the performance of the transference between client and servers, (e.g., parallelism and striping), the access to an only server constitutes a bottleneck in the whole system, since the I/O bandwidth could be considerably lower than the network bandwidth. Nevertheless, the advantage of GridFTP is the possibility of modifying its Data Storage Interface (DSI) in order to transform the data retrieval process. The responsibility of the DSI is to read and write to the local storage system. The DSI is composed of several function signatures, which must be filled with suitable semantics to provide a specific functionality. An important characteristic is the fact that the DSI can be loaded at runtime. Approaches from the parallel I/O field can be successfully applied to this scenario.

Different techniques have been proposed as alternatives. However, as far as is known, none of them is focused on applying parallel I/O techniques to enhance the performance of the GridFTP server. This is the main motivation of our work: a new DSI, named MAPFS-DSI, have been proposed in order to improve the performance of GridFTP largely.

7.4.1

MAPFS-DSI

For building MAPFS-DSI [SPG+ 06], it is required to modify the GridFTP server. The GridFTP server is composed of three modules: 1. The GridFTP Protocol module, which is responsible for reading and writing to the network, implementing the GridFTP protocol. In order to be interoperable, this module should not be modified. 2. The Data Transform module 3. The Data Storage Interface. The last two modules functionalities are merged in the Globus Toolkit 4 [GlA]. DSI is used to name both modules. The GridFTP server requests to the DSI module whatever operation it needs from the underlying storage system. Once the DSI performs this operation, it notifies the GridFTP AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION



89

server about the completion of such one.

The flexibility of the GridFTP server has been used for transforming I/O operations. MAPFS I/O routines are used instead for enhancing the server. The result is MAPFS-DSI. MAPFS-DSI enables GridFTP clients to read and write data to a storage system based on MAPFS. As the architecture for MAPFS is a cluster of workstations, the GridFTP server should be the master node from a cluster of workstation, where MAPFS is installed. The combination of MAPFS and GridFTP, shown in Figure 7.6, aims to alleviate the bottleneck that constitutes the GridFTP server. The main idea is that the I/O system attached to the GridFTP server can be optimized by means of parallel I/O techniques.

Original File

MAPFS-Grid Client


B1

B2

B3

B4

B5


GridFTP Control Server DB

RFT

Element 0

Element 1

Element 2

DSI MAPFS B3 B0

Other DSI

DSI MAPFS B2 B5

B4

B1



Block 0

Block 2

Block 1

Block 4

Cluster

Cluster

Server

Block 3

Block 5

Figure 7.6: Two-level parallelism by using MAPFS-DSI and GridFTP compliant storage elements



90


The interface of any DSI is composed of the following functions to be implemented: • send (get). • receive (put). • simple commands such as mkdir. The previous operations have been implemented invoking MAPFS functions. In order to provide access to MAPFS, the GridFTP server must contain two separated modules, which are designed and implemented to work together: • The DSI itself. It manages the GridFTP data, acting as an interface between the server and the I/O system. The aim of this module is to provide a modular abstraction layer to the storage system. This module can be loaded and switched at runtime. When the server requires action from the MAPFS system it passes a request to this module. Nevertheless, the I/O operation is not directly performed by the DSI. For this purpose, a second module has been implemented. • The MAPFS driver, the second module previously cited, carries the responsibility of performing the I/O operations. It receives the requests from the DSI and makes a parallel access to the cluster nodes. This modular structure provides system flexibility, allowing it to switch between different interfaces and storage systems without stopping the GridFTP server.

GridFTP server

MAPFS parallelism

MAPFS-DSI

Cluster 1

GridFTP server GridFTP client

MAPFS parallelism

MAPFS-DSI

Cluster 2

Inter-storage Element Parallelism GridFTP parallelism Intra-storage Element Parallelism

GridFTP server

MAPFS parallelism

MAPFS-DSI

Cluster 3

Figure 7.7: MAPFS-DSI within a data transference scenario MAPFS-DSI is embedded within the general scenario in which GridFTP is used. As it can be seen in the Figure 7.7, two independent parts can improve the performance of a data transference AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION



91

operation. Firstly, the specific features of GridFTP (parallelism and striping), which can be used in any GridFTP server. Secondly, the parallelism provided by MAPFS. This implies that the use of MAPFS within the GridFTP server offers two levels of parallelism and striping, avoiding the server storage system becomes a bottleneck in the whole data transference process.



92




Chapter 8

DEFINITION OF AN AUTONOMIC FRAMEWORK FOR GRID STORAGE

In Section 7.1 a grid-based framework named MAPFS-Grid was shown, whose main goal is to enhance the performance of data grid applications. Although MAPFS-Grid is focused on high performance storage, the great complexity of grid environments causes several difficulties. In fact, since every resource of the grid can be composed of several components (e.g., clusters of workstations), it is necessary to optimize the I/O performance of every resource before tackling the global I/O optimization. Additionally, current high-performance I/O approaches in different systems are characterized by their complexity. These systems are based on specific features that require specialized administrator skills. Furthermore, the increasing needs of applications for larger volumes of data have transformed the data infrastructures into grid-based frameworks, which are more suitable to meet these demands. Gridbased data infrastructures have been used to answer the increasing needs of application in terms of data volume and throughput. Data grids have been defined as a set of storage resources and data retrieval elements, which allow applications to access data through specific software mechanisms [CFK+ 00]. Nevertheless, the use of a grid environment makes even more complex the management and administration of such data access systems.

In the emerging knowledge area, applications have to tackle an increasing volume of data. Even today, a solution to the problem of accessing in an efficient and easy way data sources has not been achieved. In [RTC05], several challenges for I/O systems have been identified. Among them, it is important to emphasize the need of increasing the manageability of I/O systems, due to their high Alberto S´ anchez Campos


94

CHAPTER 8. AN AUTONOMIC FRAMEWORK FOR GRID STORAGE

complexity. However, this management cannot be performed by administrators since they do not have to know the underlying I/O system and there are numerous parameters and resources to control. Self-configuring and self-optimizing capabilities can help to manage complex environments.

A large number of advantages can be achieved if this discipline is applied to heterogeneous and distributed environments, where grids take a special relevance. Autonomic computing helps to reduce the complexity and drawbacks of grid environments by managing their heterogeneity, the use of wide-area networks, the existence of different virtual organizations and so on. In short, it is advisable to provide an autonomic storage system, which is in charge of providing autonomic capabilities to the system. The aim is to provide self-management and self-optimization characteristics to a grid I/O system, alleviating the tasks of administrators. In this sense, autonomic computing can offer self-management features to I/O systems, which can constitute a huge step forward in the I/O field. This approach is known as autonomic storage.

Autonomic storage is based on the idea of providing an efficient access to large volumes of data, such as those stored in data grids. In most systems, files are written locally and once the write is finished files are either sent to the server that will finally store them or replicated to the right places (according to replica management policies) [BMRW98, VTF01, KLSS05]. Thus, the main difference between this work and previous works done on grid I/O is the way writes are performed. Data is not usually required in an I/O system at the same time in which data is produced, involving that to enhance the I/O performance, when a write process is made, not only the current status of the system should be analyzed but also its future status, since the actual access to data will be made later on. Therefore, the main contribution of this proposal will be to enhance the access to remote servers, writing data in that server that will offer the best performance when the file is going to be read. In short, the system will automatically decide the best data placement according to the future I/O performance improvement depending on the future behavior predictions.

8.1

Providing Autonomic Storage

Data grids have emerged as appropriate environments for dealing with large volumes of data. Typical data grids follow a similar approach to the publish-subscribe model. In this kind of environments, clients can find by means of catalogues suitable replica for read operations or storage locations for write operations. The research performed within this discipline has not been traditionally focused on increasing the performance of the I/O stage. The proposal of this work aims to enhance this performance in a user-transparent way. This goal can be achieved by means of the use of autonomic computing, which provides self-management capabilities to complex systems. Following this idea, the AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


8.1. PROVIDING AUTONOMIC STORAGE

95

Grid Autonomic Storage (GAS) [SPM+ 08], a self-optimizing storage-based broker that helps to make suitable decisions about the target storage resources in order to improve future I/O requests, has been designed in this work. Next subsections describe the general architecture of GAS.

8.1.1

A system-broker-based solution

In a generic data grid scenario each file server is independent, allowing clients to use one or thousands of servers in the same way. Therefore, in this kind of environments grid application performance depends on which nodes are chosen. The first solution to access these servers could be to use selfcontained clients. Each client maintains its own list of servers. When a client needs to read or write a file, it queries all the servers of its own lists and finally makes a decision about the most suitable server. The decision is the combination of servers that best fits job’s requirements. On the one hand, this solution allows us to set up very accurate algorithms to make the decision, but on the other hand maintaining an updated server list implies expensive resource discovery. This idea would be related with the client-broker vision. Thus, the selection will be designed to comply with client needs selecting the services according to the client policies without having a global vision of the grid.

Other possible solution introduces a broker that centralizes server’s information. Instead of having every client maintaining its own list, this list is maintained by the broker being transparent to clients. Thus, the load related to resource discovery is released from clients. Moreover, the broker has a complete and global vision of the grid elements and it can select the most suitable resources for the client jobs in order to improve the performance of the whole grid, that is, the broker handles decision making. Clients submit resource requests using a resource description language, then the broker queries known available servers, and it tries to match resource requirements with available resources. When the decision is made, the broker answers the client giving the best resource of a list of the matchmaking servers. Using a resource description language allows the broker to operate in an heterogeneous environment, being the broker able to update dynamically the server list. The combination of both dynamic updating and capabilities to operate in an heterogeneous environment provides scalability.

Although the inherent characteristics of the grid environments, like heterogeneity, distributed resources along the world, scalability, and the involvement of multiple administrative domains, can be managed by means of such broker, they cause a huge complexity in data grids. In short, the difficulty of data grids management does not make easy their performance improvement. In this sense, capabilities such as self-configuring and self-optimizing can help to manage these systems, adapting them to changing environment conditions and alleviating their complexity.



96


This kind of brokering provides simplicity to the client. Most of the operations, like resource discovery, autonomic features, future behavior prediction, are performed by the broker instead of the client. Moreover, focusing on the problem this work wants to solve, this broker can keep the file system logic relieving the client of this job. Thus, storage elements can be used as simple data repositories and clients as simple data requesters. Therefore, different clients connected to the same broker have the same global vision of the file system. In this sense, they can act together and share files enabling researchers of different places to work together, fulfilling one of the aims of the grid technology.

Due to the fact that the file system logic is in the broker, this will be responsible of knowing where files are stored. Since there can be some replicas of the files in the system the file names represent a logic view of the file. The broker must know the relation between the logical view and the physical storage, following a similar idea of the Replica Catalog of the Replica Location Service (RLS). This implies to know the storage elements where the file is stored and the way to access it since files are stored in parallel.

For a certain client read operation, the broker will be responsible to decide which replica provides the most efficient access to the requested file. Therefore, the replication phase has an importance in the performance and it will be explained below in Section 8.1.3. In case of a write operation, since data is not usually required at the same time in which data is produced, the broker has to take into account the future system behavior. The aim is to improve the performance of the next I/O operations made later on the same file. However, the prediction of the future behavior has to be made on long term since the time in which the next operations will be made is unknown. The broker informs the client about which storage elements has to write, and then the own system informs the broker where data has been stored. This is carried out by means of grid technology techniques (see Section 8.1.2).

Therefore, the broker can act as a provider of autonomic features since it is a key element to manage the complex characteristics of data grids. In this sense, GAS could be a brokering-based solution. Its internal structure is described in Section 8.2.

8.1.2

File data discovery

In Section 7.1 the two parallelism levels of the system from a client point of view was explained. As an autonomic broker is part of the system, the operation sequence takes into account this element. In order to provide file data discovery, the broker takes on a higher relevance. Thus, the operation sequence is the following and it is shown in Figure 8.1. 1. Communication between the broker and grid elements. The broker has a deep knowledge of the AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION



97

storage elements and especially of the file data they store, by means of resource monitoring. Thanks to this, it is able to make decisions about the best storage elements for a certain client I/O request. 2. I/O request. The final user invokes an I/O operation in the client machine. In this invocation, the user is unaware of the resource management and the system operation. The I/O request operation is so simple as opening, reading, writing, closing, transferring a file, etc. The used syntax is based on the POSIX’s file management syntax. The POSIX API[IEE94] is well known by system programmers and it is easy-to-use. Nevertheless, since POSIX syntax is not wellsuited for I/O performance requirements of applications dealing with large amounts of data, the used syntax incorporates methods based on high performance data access interfaces, like the FTP interface. 3. Broker call. The client contacts with the broker, which selects the best grid elements where the client can run. 4. Obtaining the needed storage elements. The client obtains a reference for each grid element where the I/O request should be run. 5. Task accomplishment. The client contacts each storage element in order to carry out the task entrusted by the user. 6. Task delivery. The results of the I/O requests are delivered to the final user.

Broker 1 2 6

3 User Interface

4

Client

5

Figure 8.1: I/O operation sequence in MAPFS-Grid by using the autonomic broker



98


The broker makes transparent the interaction between clients and grid elements. As this work is focused on a file system, the way the broker knows where file data is stored takes a high relevance. This can be solved taking advantage of the new vision of grid technology. After failing OGSI, due to its aim of altering Web services (WS) to keep the state of grid services during invocations, WSRF has been conceived to provide stateful WS. The idea is to divide a stateful WS in two concepts, the stateless service and a new concept named resource that is responsible for providing statefulness. The resource state is defined by different parameters named resource properties. Then, these resource properties, and therefore the resource, can be published by using MDS for each grid element. Other grid elements that have rights to access its MDS published information, like a broker, can know and discover these resources and access them using its EndPoint Reference (EPR) [Rec06]. The EPR uniquely identifies a particular resource, including the location of the storage element that contains this resource.

Since it is possible to discover resources by means of MDS, this method enables the discovery of the storage locations of file data. Thus, an intrinsic association between file data and resource can be established. Resources have properties that can be used to store meta-information about the resource1 . As indicated, the “file resource” can contain the required meta-information to discover it and make possible its parallel use with other resources. Therefore, its resource properties are: • File name. • User identification. The user public-key certificate can be used to identify the user that has created the file. • Access rights. They define not only the user rights, but also the corresponding group (using CAS) and system user rights. • Number of parts in which the file is distributed in a parallel way. • Order of the part in the parallel distribution. The parallel distribution is made in round robin according to MAPFS2 . Due to this, it is required to provide the order of the element to reconstruct the information. This method of data distribution in a parallel way could be changed to adapt it to each problem. In this case, it would be necessary to analyze the meaning of the meta-information to ensure the right file reconstruction. • Block size used to access the file data. Several storage elements with different performance measurements are accessed in parallel. In order to enhance the operation, the size accessed in each of them must be different and appropriate to spend the same time in performing the operation, taking advantage of the parallelism. 1 These

data is named WS-Resource Properties in WSRF’s terminology. distribution policies could be used. Round robin is the default policy in MAPFS because, in principle, it behaves properly. 2 Different




99

• Size of the file data stored. This information will be useful to know the file size and space booking. When a file is written into the system, it is divided among n grid elements, since the I/O request is made in parallel. Therefore, in later requests, there will be n EPR to access the file, one for each of its parts stored in each grid element. Figure 8.2 shows the use of EPR to discover data: 1. The meta-information of each resource, which represents file data, is published by the storage element. 2. Resources are automatically discovered by the broker thanks to MDS. In this way, the broker can know all the EPR associated with the file data at a given moment. 3. The client obtains from the broker all the EPR corresponding to each part of the file. Each EPR refers to a MAPFS file in a grid element. For instance, in Figure 8.2 EPR 1.1 correspond to the first piece of the file 1 and it is stored in the storage element 0. In the same way, EPR 2.1 and EPR 3.1 corresponds to the second and the third piece respectively to the file 1. The association of the EPR of the file 2 is done in a analogous way. Using this EPR the client can access the file data.

User’ files File 1 File 2 MAPFS-Grid Client

EPR 1.1

EPR 2.1

EPR 1.2

EPR 2.2

3

EPR 1.3

Broker EPR list 2 MDS

EPR 1.1

EPR 2.1

EPR 1.2

1

EPR 2.2

EPR 1.3

1

Grid Storage Element 0


MAPFS Fileblock 1 Fileblock 2

MAPFS Fileblock 1 Fileblock 2

1

Grid Storage Element 2 MAPFS Fileblock 1

Figure 8.2: File data discovery based on EPR



100


Therefore, all the data resources that compose of a file can be discovered using EPR.

In addition, several clients can carry out I/O request on the same file at a given moment as in a conventional file system accessing concurrently the same resource. Each client obtains a different file descriptor, and the MAPFS file system is in charge of managing the operations and maintaining the file coherence.

Because the file data must persist until the client want to delete the file, the resources persistence is a key factor. Thus, the data resources have to be permanent, being stored in disk. This makes fault tolerance possible to provide in case of a failure in the container. If a storage element has a failure and the grid element is rebooted, their data resources are reconstructed from resources properties stored in disk.

8.1.3

Fault-tolerance and file data replication

A storage system aims to give support to the data requests of applications. The number of storage elements is high, and therefore the probability of an element has a failure is increased, being able to give up providing data. Since it is vital that data is always available, offering fault tolerance is essential. In fact, as the capacity is not a problem because there are multiple storage elements that offer a high capacity, it is possible to use part of the storage space to supply fault tolerance skills by means of replication.

Since a grid is a very complex environment and there are many possible organizations, a hierarchical topology of the grid cannot be ensured. Therefore, any replication strategy (see Section 3.2.2.3) that assumes a hierarchical organization can be harmful for the system, if it is not kept. Thus, the best client strategy is the most accurate option in order to create replicas in our system.

However, there are some differences. First, the replica must be made in the nearest storage element to the best client instead of the client, since clients do not store data. Moreover, the MAPFS-Grid system itself will be used in order to make the replicas. Finally, the whole file is not replicated, but only the most used pieces of a file are replicated. Thus, when the broker wants to select the most suitable storage elements to a certain I/O client request that involves a file that is in the system, it has to select the best replica of each file piece. Thanks to this replica strategy, it is possible to provide fault-tolerance at file-level.

Although replication provides fault-tolerance, it causes a problem of consistency. If a replica is modified, then the rest of replicas have to be updated with the same changes. Since in a grid environAUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION



101

ment there are numerous resources, and many of them cannot be active in every moment, finding and updating replicas is a very hard task. Moreover, this problem is not solved at all in the state-of-the art. One of the most interesting alternatives is based on a Consistency Service [DHJM+ 01, DDP+ 04] that propagates an update to the other sites. This service uses a catalogue named Replica Consistency Catalogue (RCS) in order to store the required metadata to the update process. Though the update of a replica is directly indicated by an application or a client to the Consistency Service instead of be discovered by the system, it means a valid attempt to face the problem.

On the other hand, if a system-broker is used, the decision load is concentrated in a single point. This means that if the broker has a failure the system could fail. However, there are some techniques to avoid this. Firstly, it is possible to achieve reliability by means of broker redundancy. The replication of the broker in other machines provides fault-tolerance, following the idea of a reliable peer-to-peer (P2P) topology [YGM03]. Nevertheless, this redundancy involves a cost both bandwidth because of the number of messages transmitted to inform brokers is increased as well as computing processing made in each grid resource to send these messages. Finally, other approach to avoid a single point of failure is to create a broker hierarchy. The aim of this hierarchical structure is to increase the system scalability by reducing the number of brokers. This idea is gathered from a P2P point of view in [MRLB06]. Thus, the problem of a single point of failure can be solved. However, this work is out of the scope of the Ph.D. thesis.

8.1.4

Security

The use of grid computing has implied a change in the way that the computation and the I/O are carried out. If some years ago, the use of clusters meant an evolution with regard to supercomputers because of low cost, nowadays, grid also should involve a change of mentality, above all, of project coordinators due to its decoupling. Thus, with the emergence of grid computing it is possible to subcontract the needed resources when there are peak workloads instead of increasing the computing capacity of the own resources, which is going to be underutilized the most part of the time.

Nevertheless, since Foster defines a grid like “coordinate resources that aren’t subject to centralized control” [KF98], there are aspects that must be taken into account to provide the needed confidence. If there is not a centralized control and resources are distributed in a wide area network crossing organizational boundaries, resources could be accessed by many different organizations and it will be necessary to control the accesses to such resources. This is the reason why security is one of the most important aspects in the grid technology.

Although security is a key factor in grid technology, nowadays one of the most important grid Alberto S´ anchez Campos


102


problems is that the resource sharing is only viable by means of a scenario based on reciprocal trust. Moreover, in most cases the confidence scenario occurs in academic and scientific environments. Companies do not take part in this scenario due to the high competitiveness. Thus, it is currently unthinkable that a company trusts third parties in order to store its critical documents or processes, though it could obtain a cost reduction.

In order to solve this problem, in [Fos01] the concept of Virtual Organization (VO) is defined. This article also shows the boundaries between VOs. It must be made sure that only certain organizations or users can access certain resources, and especially that they are really who they claim that they are.

Grid Security Infrastructure (GSI) [FKTT98] was conceived to provide the required security in grid environments. GSI interacts with common remote access tools, such as remote login, remote file access, and programming libraries. It is based on public-key cryptography, and therefore can be configured to guarantee privacy, integrity, and authentication. In addition, GSI supports authorization both the server-side and the client-side. By means of a list, called gridmap, stored in the host, of authorized users who can access the grid elements it is possible restrict users the accesses to the resources. If the user is in the list of authorized users, he is enabled to use the grid resource with a local user account and obtaining local user’s rights.

Nevertheless, there is an important problem; this option cannot specify fine-grained access control policies. Community Authorization Service (CAS) [PWF+ 02] solves this problem delegating finegrained access control policy management to the user community itself. Resource providers maintain authority over their resources and grant the rights to the whole community. The user community is responsible for restricting to the user the rights in a fine-grained way.

Although GSI and CAS provide security, it is required to develop policies that assure an environment of trust. In this work, this means the application of a set of policies that makes critical files possible to store in a set of storage elements in which the client trusts. In this way, the client can indicate a list of grid elements as requirement to the broker that should choose among them.

However, the security also has to be provided in other sense. Resource providers can limit the access to their grid elements, because of different reasons, such as resource overload or clients in which they do not trust. Thus, it is required to define polices to indicate the way clients can use the resources, for instance, limiting the percentage of disk space or establishing the maximum number of clients that can access at a given time.



8.2. GAS STRUCTURE

103

In order to allow the resource provider to control the client accesses, it is necessary to create a list for each file or directory in which the access rights of every client can be defined. The same public-key certificates used in GSI can be used for the purposes of identification. The access right list will be exported to the broker. When the broker receives a request of a client who does not have rights to access a resource in a storage element, this storage element will be directly deleted of the candidates list to carry out the operation.

Logically these two protection skills can work simultaneously. In the first step, storage elements would be restricted according to client requirements. Later on, the user rights to carry out the operation in each selected storage element will be checked. Then, the usual decision process continues only with the storage elements that fulfill both restrictions.

8.2

GAS structure

The difficulty and size of current problems involve a challenge for researchers, which need to use complex solutions and architectures to their domain-based problems. Grid computing has become a key piece for the development and deployment of these infrastructures.

Often, the higher complexity is due to the huge amount of data involved in such processes. In these scenarios, the I/O access stage limits the overall performance of the system. Furthermore, the optimum configuration of these environments is not usually straightforward.

The optimization must depend on the kind of I/O operation. In the case of write operations, it is important to take into account that data is not usually required in an I/O system at the same time in which data is produced. Therefore, a prediction phase based on logs and historic data, together with a decision policy to select the most suitable storage element is advisable. This can enhance the next operations about the same data.

On the other hand, in the case of read operations, the optimization must improve the data access, selecting the best replica depending on the current performance of all the locations where data is stored.

As it was motivated in the Section 8.1.1, GAS is a broker-based architecture, which can be used together with any I/O systems. An example of this coexistence is MAPFS-Grid, although other I/O systems can be used, like Castor or SRB. The coexistence between GAS and grid I/O systems is shown in Section 8.2.1. In order to provide an efficient broker for I/O operations, the following functionalities Alberto S´ anchez Campos


104


have to be covered: 1. Monitoring parameters of the grid. This phase includes both obtaining the monitoring data of the storage elements, their aggregation and the whole management with the aim of understanding the grid behavior. 2. Predicting the long term state of the grid according to the monitored data. The system aims to improve both current and later I/O operations on the same data. Thus, since the performance depends on the future state of the grid storage elements a prediction phase, leant on a mathematical formalism, is required. 3. Making decisions about both the target of the I/O operations and the internal parameters of GAS.

GAS

retrieve/store

GMonE: Grid Monitoring Environment

System Prediction & Decision Making

request Client

Logs & Monitored data

query access

Monitoring module

…. Storage Element 1

Cluster 1 Data Storage

Storage Element 2


Storage Element n

Server n Data Storage

Figure 8.3: GAS architecture

To achieve these goals, the architecture shown in Figure 8.3 is proposed. The grid storage elements or resources can be clusters of workstations or standalone servers with attached disks. The architecture is divided into two main components: the first one, regarding to the monitoring phase and a second one, regarding to the prediction and decision making phases. These last two functionalities are closely related. Next chapters widely describe both components.



8.2. GAS STRUCTURE

105

This architecture, based on monitoring and decision making, is much related with the way that autonomic computing works. The autonomic computing idea lies in monitoring the managed resources and analyzing the retrieved data. Then, it plans and executes the best actions according to analyzed data and defined policies with the aim to achieve its goals. Therefore, autonomic features can be enabled using a MAPE (Monitor-Analyze-Plan-Execute) loop processing (see Figure 8.4). This entire loop revolves around the knowledge about how the managed system works. This knowledge is extracted from observations and past behavior. However, not all knowledge comes from monitoring phase, but several rules can be obtained from experts.

Figure 8.4: Autonomic computing reference architecture [IBM04]

Besides these four phases, some more elements take a significant role, since it enables to interact with the managed system. Whereas sensors monitor and recover information about a resource, effectors are the mean to act on a resource. The monitor and analyze phases process information from sensors to understand the managed system. The plan and execute phases decide the actions that should be carried out through the effectors in order to improve the system and achieve its goals.

It is possible to define the GAS architecture in the same way following the principles of the autonomic computing. Thus, the GAS relation with the autonomic computing reference architecture can be exploited just as it is shown in Figure 8.5. The GAS architecture also consists of four phases, whose main responsibility is providing autonomic self-management features to I/O systems and they use sensors and effectors to know and act on the system. The whole process obtains a valuable deep knowledge of the system that enables to improve the later decisions. Alberto S´ anchez Campos


106


Grid Autonomic System

Analyze

Plan

System Analysis & Prediction

Decision Making

Monitor

Execute

System Monitoring

1 I/O Operation Execution Knowledge

I/O Service Access Service A

Effectors Service B

Service Z

Monitor Access

Sensors



Storage Element 2

Server 2 Data Storage

Storage Element n

Cluster n Data Storage

Figure 8.5: GAS model related with the autonomic computing reference model

These four phases of GAS, following the MAPE loop, are: 1. System Monitoring (Monitor): It is responsible for collecting data from the managed resources. Performance parameters are monitored with the aim of improving the decisions made by the autonomic system. Furthermore, it discovers where file data is. 2. System Analysis and Prediction (Analyze): In order to calculate the optimum storage elements, monitored data has to be analyzed. Then a prediction model based on past behavior is used to know how every storage element works later on (see Section 10.2.1). 3. Making Decision (Plan): First, it analyzes the kind of operation. In read routines, this stage selects the best storage elements in which data is stored. In create and write operations, according to the predicted system state of every resource, it plans and decides the target storage elements and the action to be done (see Section 10.3). AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


8.2. GAS STRUCTURE

107

4. I/O Request Execution (Execute): It interprets the decision making and accesses with the client the suitable elements. In short, it processes the actions and provides file data replication.

8.2.1

Coexistence of GAS and I/O systems

GAS provides an autonomic management in data grid environments. Since GAS is a systembroker, it is independent of the I/O system used to access data. Therefore, it can be used together with any I/O systems, incorporating the required changes to recover information about and act on the applicable I/O system. These changes only affect to sensors and effectors because they are the “eyes” and the “hands” respectively to use the I/O system. Nevertheless, there are sensors that provide performance information about the managed resources and not the I/O system itself, which do not required to be modified. In this way, the bulk part of the whole system is the same although different and several I/O systems can be used. These elements represent the logic of the system and they are the following: • Client: The client requests to GAS for the suitable location where making an I/O operation. Then, it carries out the operation using the corresponding access to the used grid storage system. • GAS: This is the autonomic part. Mainly, this component is in charge of making decisions about the target storage elements for an I/O operation, and aspects related to the data layout and management. • Performance Monitoring: The autonomic system invokes it in order to obtain both the performance parameters and measurements related to the storage element. Although many I/O systems have his own monitoring subsystems, this can be made by using GMonE (see Chapter 9) too. Therefore, in order to use different data grid systems with GAS, it is required to adapt the effectors and the sensors to interact with the I/O system. Depending of the storage system the adaptation will be different. Thus, it is necessary to analyze in depth each system. To explain the coexistence with GAS, different storage systems are analyzed below. This can be extrapolated to other grid I/O systems.

8.2.1.1 Coexistence with MAPFS-Grid MAPFS-Grid is a high performance data grid storage. In relation to other grid storage systems, like CASTOR, SRB, GridFTP, OGSA-DAI, etc., it provides a two level parallelism that allows the system to obtain a better performance. It is important to emphasize some aspects of such framework, which takes on a high relevance: Alberto S´ anchez Campos


108


• MAPFS-Grid resources are clusters of workstations/servers or individual nodes. In general, any computing elements with disk capacity can be considered a resource in this environment. This feature makes flexible the definition of resources in MAPFS-Grid. • MAPFS-Grid is based on a parallel file system for clusters, named MAPFS. In MAPFS, data is striped between the nodes of the clusters, in order to take advantage of the inherent parallelism of such layout.

User Client

request

MAPFS-Grid Interface


MAPFS-Grid Access Monitor Access (GMonE)

Effectors MAPFS-Grid Service 1

MAPFS-Grid Service 2

Sensors

MAPFS-Grid Service n

MDS



Storage Element 2

Server 2 Data Storage

Storage Element n

Cluster n Data Storage

Figure 8.6: Coexistence of GAS and MAPFS-Grid

In other words, since every resource of the grid can be composed of several components (e.g., clusters of workstations), it is necessary to optimize the I/O performance of every resource before tackling the global I/O optimization. Therefore, if the storage element is complex and the monitoring phase provides information about parameters that characterize this complexity, GAS can use them to enhance the performance of this kind of resources. Then, since the access is performed in a parallel way among several storage elements, the most suitable resources to make the operation3 must be selected.

3 That

is, these are the best target storage elements to access in parallel according to the used heuristics.



8.2. GAS STRUCTURE

109

MDS is used to provide the needed knowledge about MAPFS-Grid. File data stored in each grid storage element are published by using MDS. Then, any elements can know and discover these file data subscribing to the MDS of each resource. This provides the file data discovery. Therefore, GAS can process information from MDS as sensors to understand the system. Later, MAPFS-Grid service is used as effectors for performing the file systems operations. It could use any of the different approaches shown in Chapter 7, such as MAPFS-Grid PDAS, MAPFS-DAI and MAPFS-DSI. Figure 8.6 shows the general interaction between GAS and MAPFS-Grid.

8.2.1.2 Coexistence with SRB The Storage Resource Broker (SRB) [BMRW98] can be understood as a grid file system that provides a uniform interface to distributed and heterogeneous storage resources, such as file systems, like UNIX, UniTree and HPSS, and databases, like DB2 and Oracle.

Application (SRB Client)

MCAT

SRB Server

DB2

Ilustra

Oracle

HPSS

UniTree

UNIX

Figure 8.7: SRB architecture

As it can be seen in Figure 8.7, the SRB reference architecture consists of three elements: • Metadata catalog (MCAT). The MCAT stores information concerning managed data sets. Therefore, GAS can request information to MCAT as sensor to know where data is. • SRB servers. The SRB server performs the I/O operations. Although it usually interacts directly with the MCAT to know where making the request, if GAS is requested, decisions can be improved. This is because GAS could analyze previously both MCAT and monitored performance information to improve the decisions. Since the SRB is a uniform interface, the SRB server has to interact with several types of storage systems (both databases and file systems). The SRB offers a kind of common driver or API that can be utilized as effector to make I/O requests. Figure 8.8 shows the coexistence between GAS and SRB. Alberto S´ anchez Campos


110


• SRB clients. The client uses the uniform API to access every kind of storage systems.

User

Client SRB Client

Grid Autonomic System request

schedule

SRB Driver Monitor Access (GMonE and SRB Monitor)

Effectors Sensors

Storage Element 1

Database-based Data Storage

SRB Server

Meta data catalog (MCAT)

Storage Element n

Filesytem-based Data Storage

Figure 8.8: Coexistence of GAS and SRB

8.2.1.3 Coexistence with CASTOR CASTOR (CERN Advanced STORage manager) [CAS] is a hierarchical storage management (HSM) system designed to store a great volume of data. In 2007 there are roughly 60 million files and 7 petabytes of data in CASTOR. Most of the data is generated from CERN’s physics experiments. Due to this necessity of space, CASTOR manages disk caches, hard disks and tapes.

In CASTOR, files can be accessed using different data transfer protocols like RFIO (Remote File IO), ROOT and GridFTP. Furthermore, since CASTOR provides a Storage Resource Manager (SRM)4 interface, it can be accessed from SRM clients.

The internal operation of CASTOR is shown in Figure 8.9. The Request Handler stores the request into the database, and then the stager polls it to make the request. The stager is the disk pool 4 The SRM is a middleware component for managing storage resources on a grid. CASTOR is a Mass Storage System seen as a resource in SRM.



8.2. GAS STRUCTURE

111

CENTRAL SERVICES

Figure 8.9: CASTOR architecture [CAS]

manager in charge of making I/O requests and maintaining a catalogue of all the files together the Central Services. Central Services is the database for the CASTOR file system, and it could be used as sensor by GAS. The decisions about the storage procedures are made by the stager. Nevertheless, an external scheduler, GAS in this case, can be used to make decisions about other aspects, such as where and when files must be stored. In order to make the decisions CASTOR monitors by means of an own monitoring subsystem named LEMON. Then, the scheduler sends a stager job to the selected disk front-end, which acts as effector in GAS, to carry out the I/O operation on disk. Migration to tape can be performed to tape backend.

As a summary, the Central Services and LEMON and GMonE are the sensors that allow GAS to communicate with CASTOR. Each server (both tape and disk) acts as effector. Therefore, the coexistence between GAS and CASTOR is as it is shown in Figure 8.10.



112


User

Client RFIO

GridFTP

ROOT


schedule

CASTOR (Stager job) Monitor Access (GMonE and LEMON)

Tape Server

Effectors Sensors

Disk Server

store request

Stager job

Database

Cental Services


Tape Robotics

Storage Element n

Disk Array

Figure 8.10: Coexistence of GAS and CASTOR

.



Chapter 9

OBTAINING PERFORMANCE MEASUREMENTS

With the aim of knowing the behavior of any systems, it is advisable to take different measurements about its operation. Then, this knowledge can be used to improve its performance. Several methods can be applied to obtain these measurements. Among all, monitoring stands out because it provides measurements of the system operation at a given time. Grid monitoring is more complex than the monitoring that takes place in other kind of environments, since the huge number of resources and their heterogeneity. Especially, the great amount of resources and the high number of parameters that have to address makes difficult the phase of results understanding. This is the reason why it is required to provide a global or a high-level vision of the system that reduces the number of measurements and makes the data understanding easier. This has to be made without losing the meaning of monitored data.

Final users, application developers and resource providers (administrators) can take advantage of the monitored information about the grid behavior. Furthermore, it is possible to define schedulers based on the performance of different grid resources assigning jobs to non-busy elements. Therefore, the knowledge associated to this information is essential in order to build and autonomic system.

There are several methods to provide grid monitoring (see Section 3.2.3.2). All the alternatives had been analyzed in order to select the most suitable method according to the proposed problem. MonALISA [NLG+ 03] constitutes a suitable grid monitoring tool for this work, due to its good adapAlberto S´ anchez Campos


114

CHAPTER 9. OBTAINING PERFORMANCE MEASUREMENTS

tation to heterogeneous infrastructures. That is, it can monitor both compound grid elements (e.g. cluster) and single elements (e.g. workstation).

9.1

MonALISA-based monitoring system

Two of the main elements that have to be considered within grid environments are heterogeneity and flexibility. Each one of the grid nodes can be a completely different system (architecture, operating system, software applications, resources...) and in constant change. To deal with this, the monitoring system must be based on a tool that could be easily adapted to different kinds of environments. Its modular structure is the key in MonALISA to provide this flexibility. The set of monitored parameters is not previously established, and new modules can be developed in order to meet specific requirements.

MonALISA is based on a scalable dynamic services architecture, which is designed to fulfill the needs of physics collaborations for monitoring global grid systems, and it is implemented using JINI/JAVA and WSDL/SOAP technologies. MonALISA enables the system to: 1. Monitor different measures of each grid element. This makes grid behavior easy to understand. 2. Communicate with the monitoring service by means a Web service. Using this WS, external grid elements, like a broker or a client, can request and access monitored data directly to a service. 3. Store the historical monitored values. As it can be seen in Section 3.2.3.2, the different monitoring architectures, like GMA, do not store historical logs of the system operation. MonALISA provides a WS at server-side that stores historical information about the system behavior. The broker can request the information monitored during a certain period. The problem is that after a certain time, MonALISA groups monitored values in order not to overload its database. 4. Calculate arithmetical averages of the monitored data during a certain period. This value tries to define in general the behavior of a parameter. 5. Include new monitoring modules in order to obtain new metrics that can be adapted for evaluating the system in a better way. 6. Monitor by means of monitoring tools widely used in distributed environments, like Ganglia [SKMC03]. In this sense, it will be necessary that each grid element has MonALISA running on it. MonALISA is responsible for monitoring the whole resource, as if this is a compound element, like a cluster, as if it is a single element. In order to monitor clusters, Ganglia can be used. To monitor complex and AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


9.1. MONALISA-BASED MONITORING SYSTEM

115

special parameters, new modules can be developed.

MonALISA is very versatile from a general point of view since it provides a standard request mechanism. Nevertheless, although MonALISA was developed for grid environments, it has some inconvenient which must be taken into account: • The communication model used by MonALISA based on WS, and therefore on SOAP, can take a long time and thus overload the network if a great volume of information is requested. Since MonALISA stores historical values the monitored measurements requested can correspond to periods of several months or moreover years, which implies a great amount of data. Historical values should be stored in the broker in order to minimize the data sending, in a similar way that the idea that the Globus’ Archive Service represents. The Archive Service is intended to store and enable time-based queries on MDS. • MonALISA groups historical monitored data from a certain quantity. It is impossible to obtain accurate data if data corresponding to a long period is requested. This causes that the broker has to store all the historical value. Because of this, it is necessary to define a more complete grid monitoring tool based on a broker. In case of using MonALISA, it must solve its inconvenient to provide grid state information without affecting the whole system performance. The first approach is to create a MonALISA-based grid monitoring system. This monitoring system provides a new software layer integrated in the broker that uses the MonALISA WS of each grid element. Therefore, this monitoring system does not perform the monitoring operation in each grid element. Instead of this, it uses MonALISA to obtain and manage data. MonALISA can also use Ganglia if the grid element is compound. The MonALISA system provides a distributed monitoring service in each one of the systems or resources that the grid is composed of. Then, the system requests to the MonALISA WS for data and it stores these values in its own database. This means that the broker stores all the historical values.

This monitoring system will be responsible for providing the necessary information to the decision module. The integration of MonALISA with this monitoring system is depicted in Figure 9.1.

This approach presents some problems when it comes down to the nitty-gritty. Due to the proliferation of clusters because of its low cost, this kind of resource is nowadays one of the most used elements. This causes a growing computational power, but the monitoring information also grows due to the existence of several nodes per each cluster. All this information provided by a cluster monitoring tool like Ganglia is stored in MonALISA. Thus, since MonALISA only works with the whole set of data monitored in each elements, the monitoring system stores a great amount of data. Alberto S´ anchez Campos


116


Monitored data aggregation

System Monitoring

Monitoring database

MonALISA

Ganglia

Figure 9.1: MonALISA-based monitoring system

This high volume of monitored data makes the system behavior difficult to understand. MonALISA can calculate averages of the monitored data for a certain period. Although this can reduce data to analyze, this kind of aggregation based on long periods loses significance.

Thus, grid monitoring might try to reduce the amount of information without losing too much accuracy. One way to solve it is providing aggregate information of each parameter obtained within a storage or computing element. For instance, if a cluster is considered, grouping data for each parameter obtained in all its nodes could be enough. Different formulas can be applied according to the monitored parameter in order to collect the desired data. Although it is possible to post-process data to group them, as it is shown in Figure 9.1, since MonALISA do not group the set of values obtained for the different nodes of each cluster, the amount of stored data is too high. Therefore, it is required to make this aggregation at grid-element level. This makes the monitoring information possible to be AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


9.2. GMONE: A GRID MONITORING ENVIRONMENT

117

reduced and therefore it makes easy the data analysis. Thus, a grid monitoring system based on an element vision will be advisable. The aim is providing a single system image of each grid element to make easy the whole grid system understanding.

However, this vision does not fit with the most known grid monitoring tools, like MonALISA or MDS. Therefore, it is necessary to build a new complete non-MonALISA-based grid monitoring systems based on this idea that make easy the grid understanding. This new vision is carried out by means of the creation of GMonE.

9.2

GMonE: A Grid Monitoring Environment

Grid Monitoring Environment (GMonE) is a set of services designed in this Ph.D. thesis to work together and intended to provide a whole grid monitoring infrastructure. Its main characteristics are: • It provides an efficient access to information, considering response times and data size. • It coordinates monitored information from the whole grid, providing a unified interface for management and queries. • It relies on the standard Globus monitoring procedures (by using MDS) for data transfer mechanisms, which makes it more adaptable and scalable to heterogeneous environments. GMonE works as a standalone tool that can run and work on any grid (or similar environment) in a completely independent way. It includes many important advantages with regard to the MonALISAbased approach. The most remarkable ones, and basically, those that motivated the creation of GMonE are the following: • Independence from MonALISA: In GMonE no third-party specific monitoring tool is required. • Plug-ins modularity: In the MonALISA-based approach the modularity relied on the MonALISA modules. After removing the dependence with it, a new plug-in system has been included, making easier the monitoring of its own monitoring metrics. • Configurable aggregation: MonALISA is only able to perform limited data statistical process and aggregation. GMonE includes a mechanism to provide an aggregation equation to pre-process and group data according to this equation. This equation can be configured and different for each parameter. • Use of MDS: GMonE relies on the standard de facto grid monitoring tool MDS for the data transfer, instead of a specific WS as in MonALISA. MDS standardization makes it easy to develop third-party applications that take advantage of the GMonE features. There are faster Alberto S´ anchez Campos


118


methods to provide a fast way to propagate data, like Java Remote Method Invocation (RMI) but in most real grid environments the use of non-grid protocols and interfaces may represent a disadvantage (e.g. routers may filter the traffic, making the system unable to function, etc.). The use of MDS guarantees the proper behavior of this propagation in grid environments. Because GMonE is designed to provide information from the whole grid, its services should be spread throughout the entire infrastructure. They have to be designed to cooperate in order to obtain and manage the monitored data. From this point of view, the tasks performed by GMonE are: 1. Obtaining the raw monitored data. 2. Gathering and managing the monitored data. 3. Providing the post-processed monitored data. Figure 9.2 illustrates the whole GMonE architecture. Each of the modules and services shown here are described next.

9.2.1

Obtaining data

The monitoring system must adapt itself to different kinds of resources. Thus, GMonE performs the monitoring of each grid element by means of a flexible service, named MonitorAccess.

MonitorAccess service monitors the required parameters and, if it is required, it processes data to aggregate values according to the system needs. Then it communicates automatically with the broker to send it the monitored data by using MDS.

In summary, the main features of MonitorAccess are: • It provides an abstraction layer that enables the system to access all the monitored information from the same interface. • It provides a scalable and customizable structure, based on monitoring plug-ins. • It performs user configurable statistical processing of the monitored data. This allows the system to obtain much significance without sending many data through the network. This reduces the impact of monitoring in the system performance. • It forwards monitored data to the broker by means of MDS. Since MonitorAccess has a modular structure, the set of monitored parameters is not statically defined, and new plug-ins can be developed in order to fit specific requirements. Its modular structure AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION



119

GMonEAccess

GMonEDB

Monitoring database

MonitorAccess

Plug-ins

Figure 9.2: GMonE architecture

is the key to provide the flexibility and adaptation required to complex environments.

To satisfy this modularity, MonitorAccess provides a simple interface that can be implemented by grid developers in order to monitor specific parameters. Several plug-ins can be simultaneously used, allowing the system to offer a fully customizable set of monitored parameters. By default, MonitorAccess includes a basic Ganglia plug-in that can provide standard cluster information, like CPU usage in each node.

Because GAS aims to manage an I/O system, the monitored information must be related to the main parameters of an I/O system. Thus, some of the most important parameters to monitor are: • Storage capacity (C) and total capacity (T C): the total amount of space available and total for storing data. Alberto S´ anchez Campos


120


• CPU load (L) and the processing speed (S): the average percentage of CPU used in the system and its corresponding speed. Although CPU is not a limiting factor on an I/O system, it must be taken into account to avoid assigning request to overloaded resources, due to use of agents and MPI1 . Anyway, it can be used to provide deep knowledge of the grid. • Internal I/O bandwidth (I). The effective I/O bandwidth obtained while performing both read (Ir) and write (Iw) operations by means of a file s system inside the storage resource. Each storage resource is seen as a single entity, that is, if it is composed of several nodes and the access is performed in parallel, then I will be the result of accessing all the nodes working together. • Number of I/O requests (R) running on each storage element at a given moment. It provides information about the use of the storage resource. • External I/O bandwidth (E) and latency (Lat): the effective I/O bandwidth and its corresponding latency obtained while performing operations from an external client, outside the system (e.g., through the Internet or WAN). These measures do not just depend on the storage resource but also on the client. Some parameters, like C and L, can be obtained directly from Ganglia. To provide the rest of parameters, a set of modules has been developed and incorporated in MonitorAccess.

9.2.1.1 Monitored data aggregation Grid environments are composed of different and heterogeneous resources, as compound (like clusters) as single elements. Since clusters are formed by several nodes, data management monitoring of such grid element is a hard task. In this situation, most of the monitoring tools provide a set of values as information related to a specific monitored parameter, each of these values corresponding to a different node in the grid element. This information makes difficult to understand the behavior of the system. A possible solution is to consider a cluster as a simple machine regardless of its composition. In order to do that, a virtual node model from monitored data must be defined, that is, providing a single system image for each cluster, aggregating all the information of its internal nodes. These values can be aggregated in different ways, depending on the parameter meaning, measurements units, etc.

Let R1 ...Rr be a set of compound grid resources, like clusters, where ∀i ∈ (1, r), Ri ≡ {Si1 , Si2 , ..., Sis } being Sij the node j of the resource Ri .

1 These

technologies are used by MAPFS file system.




121

It is possible to monitor different parameters Pij of each node Sij by using MonitorAccess. For instance, one parameter would be the capacity Cij of the node Sij . Each parameter can be grouped to obtain a representative value of the whole cluster. Therefore, Ci is the aggregated capacity offered by the cluster Ri .

Different methods can be used to calculate the aggregated value: • Target of all nodes. Pi =

Pi1 + Pi2 + Pi3 + . . . + Pis s

• Pessimistic case: taking the worst measurement of the nodes of the cluster. Pi = max ∨ min(Pi1 , Pi2 , . . . , Pis ) • Optimistic case: taking the best measurement of the nodes of the cluster. Pi = min ∨ max(Pi1 , Pi2 , . . . , Pis ) • Individual evaluation function. Pi = max ∨ min(F (Pi1 ), F (Pi2 ), . . . , F (Pis )) • Evaluation function including all the parameters (Pij , Qij , . . .). The node with the highest function value is selected as representative of the whole cluster. F (Pij , Qij , . . .) = max {F (Pi1 , Qi1 , . . .), ..., F (Pis , Qis , . . .)} ⇔ (Pi , Qi , . . .) = (Pij , Qij , . . .) The most suitable aggregation method often depends on the parameter meaning. Thus, the aggregation method must be easily configurable. MonitorAccess provides a mechanism that allows the system to specify the way this aggregation is performed, depending on the parameter characteristics. An aggregation function can be specified for each monitored parameter, which can be changed and reset during execution time. The syntax of the grouping function is the following:

• Basic arithmetic operators (+, −, ∗ and /). • Real numerical constants (1, 0, 26, −367.2, etc). • Special operators: they represent mathematical functions to apply to a set of values, like: – S: sum of all the values. Alberto S´ anchez Campos


122


– P : multiplication of all the values. – M : maximum value. – m: minimum value. – n: number of values. – F : aggregation is not required. Simple statistic parameters can be easily obtained by using these operators (e.g. the arithmetical mean:

S n ).

In our case, different methods have been chosen, considering the characteristics of each parameter. For example: • Capacity (C): Individual evaluation function. Ci = min((Ci1 × s), (Ci2 × s), ..., (Cis × s)) = m × n • CPU load (L): Pessimistic case. Li = max(Li1 , Li2 , ..., Lis ) = M • Internal I/O bandwidth (I): I=F Instead of measuring individual values of nodes and aggregating them, this parameter is directly obtained from the performance monitoring of the underlying file system over the storage resource. If this element is a cluster, a parallel file system could be an appropriate choice. By means of this aggregation, all the grid elements can be compared in the same way being transparent if an element is a cluster or not.

9.2.2

Gathering and managing data

Once the monitored information has been delivered by the MonitorAccess service of each storage element, it is necessary to gather and manage it, in order to provide a single system image of the whole grid. The GMonEDB service collects the monitored information from each storage element in the grid and stores it in its own database.

The main features of the GMonEDB service are: • It provides monitored information from the whole grid. As it was mentioned, this service gathers information from the storage resources. AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION



123

• It stores data from the whole grid in its own database. The amount of data that GMonEDB has to store in its database is severely reduced compared with MonALISA since data is aggregated at down-level. It provides fast access to monitored information from the whole system, regardless of the way in which each resource is attached to the grid. It also provides data about the grid past behavior. This makes possible to observe the grid evolution, in order to achieve a deep knowledge of the whole system. A similar functionality should be provided by the Archive Service of MDS [SDM+ 05], but this is still under development. • It manages relevant data. Many of the distributed systems monitoring tools provide information about a large number of parameters that may not be of interest at grid level. GMonEDB can be configured to obtain, store and manage only the necessary information in each particular case. If in a point of time the kind of data should change, the service can be easily reconfigured on-line to start obtaining the new required information. It is important to indicate that GMonEDB is not required to be a single application in the whole grid. Several instances of this service can be used, for example, to manage different kinds of data or provide data replication and fault tolerance.

9.2.3

Providing data

The last part of the GMonE system is used to obtain monitored information of the whole grid from the GMonEDB database. This final module is called GmonEAccess and it works as a programming library with the following features: • It provides a common interface to access the GMonE system. • It can be used to obtain monitored data from the whole grid and to configure and administer on-line the GMonE environment. • It can be used to run and configure the GMonEDB service.

9.2.4

A simple grid monitoring visualization tool

In order to provide a visual interface to the GMonE system, GMonE simple Visualization Tool (GsVTool) has been developed. GsVTool uses GMonEAccess to obtain the monitored values. The main features of this application are: • It provides a visual representation of the system structure, including all the grid elements and the monitoring parameters associated to each one of them. • It includes a simple mechanism to query monitored data. It shows the monitored information as graphs that represents the evolution of the desired parameter within a given time interval. Alberto S´ anchez Campos


124


Figure 9.3: GsVTool

A typical screenshot of this application is shown in Figure 9.3.



Chapter 10

PREDICTION AND DECISION MAKING

In an I/O system, it is necessary to take into account that data is not often required at the same time when it is produced. Furthermore, data usually is subsequently read more times than it is written because this is the way to watch the results [RTC05]. Since reads requires a better efficiency that writes because applications at once demand accessible data to be able to analyze it, the future read operations take a high relevance. In this sense, performance improvements are related to both current and future states of grid elements since the actual access to data will be made later on. Thus, the use of prediction methods could allow us to know the future behavior of the system and make decisions aimed at enhancing the performance of current and later I/O operations.

Our autonomic approach, named GAS (see Chapter 8), tries to manage the high complexity of grid environments. Specifically, GAS contributes in two manners. First, it allows the system to face environment changes and adapt its behavior with the aim of improving its performance. Secondly, it provides self-configuring and self-optimizing features, adjusting internal parameters according to the obtained results. The system analysis and planning to achieve it, including the prediction and decision making phases, are shown in this Chapter.

In order to enhance the performance of current and later I/O operations, after the monitoring stage, GAS must run the following steps: 1. Definition of the states. First, it is required to know and understand the obtained data. One way to do this is organize and group them. Alberto S´ anchez Campos


126

CHAPTER 10. PREDICTION AND DECISION MAKING

2. Apply a prediction model. 3. Make a decision about the target storage element. Figure 10.1 depicts the necessary modules for solving these problems: • Request translation module: it translates an I/O operation in a request to be solved. This module analyzes the kind of operation. If it is a read operation, GAS only checks where the file involved is stored. If there are more than one replica, then GAS, according to the current information decides which replica has higher performance. This is decided according to the current information provided by GMonE. If the client requests a write operation, GAS must activate the prediction stage with the aim of choosing the best storage elements to currently write and later read the file in a parallel way. • Prediction module: it gets knowledge about how each grid element has behaved in the past (definition of the states) and then it predicts its behavior. This allows the system to make decisions taking into account not only the past or the present but the future behavior. • Decision making module: it selects the best storage elements to carry out the I/O operation requested by the client.

10.1

Definition of states

When a data problem is faced, the first step is trying to get knowledge about the nature of data. In order to understand data, it is advisable to study such data and the relations existing among all the data items. A challenge of researchers in many areas is how to organize observed data to analyze them easily. In order to do this, when the number of data items is high, it should be advisable grouping them in different categories or states characterized by some parameters1 .

Although there are different methods to face this problem, most of them try to solve it by means of static states. Thus, they try to define certain states that group data and expect that new data is in these groups. This makes easy the comparison among the results obtained for every grid element. Nevertheless, this approach has a problem. Most of the systems, and specially grid systems, change as time goes by. This could be due to the changes in the dependency patterns among the variables or data. In this sense, it is required that the defined states can dynamically change and these states can be produced based on historical data. 1 The analysis and data grouping developed in this work is made for each parameter. This involves that the problem can be simplified and referred to sampling instead of cluster analysis. Even then, clustering methods are used in order to generalize the results.



10.1. DEFINITION OF STATES

127

Request Translation Module

Decision Making Module

System Prediction Module

GMonE

Figure 10.1: Prediction and decision making modules

With these objectives, the k-means algorithm (see Section 5.1) has been selected due to its simplicity and fast execution2 . K-means provides a way to group the monitored data for each parameter and each element3 . However, this method requires knowing the number of states before running the algorithm. To select the number of states, different approaches can be used.

Therefore, an important question that needs to be answered before applying the k-means clustering algorithms is how many clusters there are in the data. This is not known a priori and, in fact, there might be no definite or unique answer as to what value should take.

Each method shown in Section 5.1.1 to determine the number of clusters has its own limitations, that is, there are not full satisfactory methods for determining the number of clusters [Eve80, MC85]. 2 An important design restriction of the system is it must provide a fast response. Thus, the whole approach must be based on methods that can be solved quickly. 3 It should be advisable to standardize data before grouping it. In this specific case, this is not required because there is only one dimension taken into account when data is clustered.



128


Although there are approaches [Eve80, KS96] that advocate the use of multiple techniques and the comparison among them, this cannot be done automatically since an expert is required to select the appropriate value observing the results of the different methods.

Since the number of states must be automatically determined according to monitored values and there are not known previous studies in computer systems monitoring that help to select this number, the Hartigan’s rule [Har75] is selected due to its simplicity and good performance.

In addition, atypical data can cause the creation of states to represent them and therefore some information about the rest of data is lost since less number of states is used to represent the nonatypical data. Therefore, atypical data should be addressed: • In order to do this, first, it is needed to check if data fulfill a Normal distribution X ∼ N (µ, σ 2 ) by using the Kolmogorov-Smirnov test4 . In this case, it is possible to remove all the data items outside of the following expression since this data is considered as atypical.

(µ − 2 × σ, µ + 2 × σ) As it can be seen in Figure 10.2 roughly 96% of the values of a Normal distribution are in this rank. Therefore, it is advisable to consider all data outside of this rank is atypical, because the study wishes to collect information about the usual system behavior and without concentrating it in a single value.

Figure 10.2: Normal distribution

• In the case that data does not fit with a Normal distribution, if there is a state composed of a single element, this state is removed and the data is grouped in the state closer to it. 4 It is also possible to extract a characteristic function X ∼ N (µ, σ 2 ) from a initial data distribution, e.g. by means of the sum of distances from every point to the n nearest neighbors.



10.2. PREDICTION

10.2

129

Prediction

Knowing the grid behavior is a very complex problem. Moreover predicting the grid behavior is a much harder complex task. Furthermore, due to the complexity and heterogeneity of a whole grid, it is very difficult, with current algorithms, knowing, understanding and predicting the performance of the whole grid. In this sense, it is needed to know and predict the behavior of each grid element to extrapolate the results to the whole grid. This prediction must be based on historical monitored data about its performance.

10.2.1

Prediction model

As it was seen in Section 5.2, there are different points of view to predict a system. Different alternatives must be analyzed to gather the contributions of each one that help to solve the problem.

Time series analysis [Ham94] offers a powerful tool to make predictions. Nevertheless, it demands too much computation. Although the grid infrastructure itself could absorb the computing needs, too much time would be taken in order to make the prediction. Thus, it does not fulfill the design restriction of our system. Moreover, this method provides a prediction function that depends on the time. However, in an I/O system, it is unknown when the future operations will be made, and therefore it is not useful to know how the system will be in a certain future moment. It is more useful to make long term predictions. Thus, other kind of methods, especially less resource intensive techniques, must be analyzed. Nevertheless, time series analysis could be used to lean several aspects of the selected methods, like state changes or determining the most valuable parameters that define the system behavior.

Both on-line and a priori models provide a different vision to address the prediction task. A priori models analyzes in depth the system past behavior in order to forecast it. However, it is expected that the system does not change its usual behavior. Long term predictions can be made based on this statement. Unlike a priori models, on-line models make the system changes possible to face at short term. That is, they can adapt themselves to drastic changes in the system behavior due to the fact they give more importance to current data.

Since GAS aims to provide an autonomic system that can adapt itself to changes into the system, it is required to tackle the problem by means of a vision similar to on-line models. However, the problem is focused on an efficient I/O infrastructure, which enhances the I/O performance. In an I/O system, the time point in which the future I/O operation will be performed is unknown. Thus, in order to improve not just current I/O operations but also future ones, long term prediction would be Alberto S´ anchez Campos


130


useful. This kind of prediction could be made by means of a priori models.

The Markov chains approach is a very good option to know how grid elements usually behave in the future and, therefore, they can be chosen as key model in GAS. Nevertheless, this approach does not meet the needs of dealing with short-term changes. In order to provide a trade-off solution, an “enhanced” Markov chains approach is used, which includes on-line model skills. To do this, GAS makes a new prediction when new data is acquired (creating a new Markov chain from both new and former data). This provides dynamism to our prediction model finding out drastic changes into the system and changing the predictions based on new data.

Due to the high complexity of a grid environment, tackling the system as a single element can simplify the solution of the problem, but many aspects will not be taken into account, like heterogeneity and the different number of active elements at a certain moment. Therefore, GAS deals with the problem by means of a grid element vision, that is, any grid element will be analyzed to make a decision. Thus, each grid element has to be predicted.

In order to model each grid element by using a Markov chain, a deep knowledge about its past behavior is required. GMonE is used to obtain all the required monitored data. GMonE returns “processed data” in the sense that the monitored data is not raw data, but prepared for the prediction stage. Then, the system tries to obtain a basic knowledge about the monitored data by means of the data aggregation phase. This stage allows the system to know the relations among data. Therefore, it defines states where the grid element behavior is expected to be in a dynamic way. The definition of states in a dynamic way makes possible to adapt oneself to the problem. Although, the use of pre-defined states can be easier since they can be defined before obtaining the data, this reduces the knowledge about this grid element and therefore the system autonomy. The knowledge reduction is because it is possible to find empty states losing the information provided by this state.

In this way the “reachable states” and the probability to reach them can be known. For instance, to study the internal I/O bandwidth evolution the following si states could be defined from historical data by using the indicated method in Section 10.1. s1 =⇒ 0MB/s ≤ Ii < 14.3MB/s s2 =⇒ 14.3MB/s ≤ Ii < 35.2MB/s s3 =⇒ 35.2MB/s ≤ Ii < 63.7MB/s s4 =⇒ 63.7MB/s ≤ Ii < 87.5MB/s s5 =⇒ 87.5MB/s ≤ Ii < 100MB/s AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


10.2. PREDICTION

131

From these states, it is possible to define a Markov chain, just as it is shown in Figure 10.3.

Figure 10.3: Markov chain state diagram

Modeling the system using Markov chains enables making long term predictions due to the stationary probability properties. These stationary probabilities can be used for the decision making phase.

10.2.1.1 Usual behavior The normal behavior of the GAS system can be modeled by means of a regular Markov chain. This is because the Markov chain is adapted to the system itself since the states are calculated according to the monitored data. Nevertheless, there are a few cases where it is possible to find semi-regular chains when the system is modeled. This case will be analyzed in Section 10.2.1.2.

The matrix obtained from the regular chain does not present any format that facilitates the system resolution. In this sense, it is turned into a system of equations, which it can be easily solved (see Appendix A). Among resolutions methods, direct methods can be ruled out due to response-time restrictions. Approximate methods are suitable for our purposes since it does not matter if the result is not precise. Specifically, the selected algorithm has been the Successive OverRelaxation (SOR) [Fra50, You50], which contains performance improvements versus the traditional algorithm GaussSeidel [Kah58].

For instance, suppose the following initial matrix of probabilities between internal I/O bandwidth states shown below:    I=  


0.504 0.496 0.0 0.0 0.0 0.504 0.481 0.015 0.167 0.0 0.333 0.0 0.333 0.333 0.0 0.0 0.167 0.0 0.5 0.333 0.0 0.0 0.0 0.25 0.75

     


132


In order to obtain a system of equations, first it is required to include the equation A.6. Then, the equations A.7 and A.8 are applied. Thus, the system can be solved by SOR.      

−0.496 0.504 0.496 −0.519 0.0 0.015 0.0 0.0 1.0 1.0

0.333 0.0 −0.667 0.333 1.0

0.0 0.0 0.5 −0.5 1.0

0.0 0.0 0.0 0.25 0.0

     

Finally, the previous matrix and the vector of independent components [0, 0, 0, 0, 0] are introduced in the SOR algorithm. Taking as initial hypothesis a vector with the same probabilities [0.2, 0.2, 0.2, 0.2, 0.2], the system evolves in the following way: 0.2 0.241189 0.266923 0.285005 0.299034 0.310689 0.320781 0.329739 0.337819 0.345195 0.351994 0.364234 0.369808 0.375082 .. .

0.2 0.228501 0.255676 0.278947 0.298019 0.313435 0.325944 0.336243 0.344901 0.352350 0.358902 0.370133 0.375073 0.379671 .. .

0.2 0.141546 0.100812 0.072456 0.052736 0.039036 0.029531 0.022947 0.018396 0.015262 0.013112 0.010657 0.009998 0.009568 .. .

0.2 0.198309 0.188549 0.175636 0.162363 0.150124 0.139459 0.130429 0.122862 0.116494 0.106312 0.102073 0.098198 0.094586 .. .

0.2 0.197135 0.194406 0.192470 0.191083 0.189772 0.188125 0.185880 0.182921 0.179254 0.170154 0.164978 0.159561 0.154020 .. .

0.475345 0.475630 0.475716 0.475797 0.475875

0.461793 0.462024 0.462093 0.462159 0.462221

0.010411 0.010417 0.010419 0.010420 0.010422

0.022376 0.022167 0.022104 0.022044 0.021988

0.030380 0.030023 0.029916 0.029814 0.029718

96 iterations have been performed to solve the problem by means of SOR with an accuracy of 10

−6

. Although this number of iterations is not too high and does not mean too much time in order

to run the algorithm, it can be improved if the initial hypothesis is more accurate. In this sense, the initial hypothesis must be the probability to be in a certain state found from the historical data. This is calculated at the same time the states are defined. For instance, for the same matrix shown above, the following probability of being in a state in the past was calculated: ps1 ps2 ps3 ps4 ps5

= 0.475694 = 0.465277 = 0.010416 = 0.020833 = 0.027777

Then, the system evolves in the following way: AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


10.2. PREDICTION

133

0.475694 0.476751 0.477270 0.477524 0.477646 0.477703

0.465277 0.464539 0.464172 0.463986 0.463889 0.463835

0.010416 0.010435 0.010445 0.010451 0.010455 0.010457

0.020833 0.020837 0.020826 0.020802 0.020772 0.020741

0.027777 0.027675 0.027558 0.027461 0.027393 0.027354

In any case, the same solution is obtained. This demonstrates that the use of both initial hypotheses is valid.

π1 = 0.4777031665213671 π2 = 0.4638350339769136 π3 = 0.010457438846941251 π4 = 0.020741003038957668 π5 = 0.027354745198539962 According to the πi values (probability of being in each state at long term), of each storage elements it is possible to select the best resource that should improve the performance of the future I/O requests.

As it can be seen, a value of the initial hypothesis based on the past behavior is a good starting point to do the prediction. The main reason is that the past behavior has a great importance in the future operation. However, the past behavior does not cause that the behavior at long term must be the same. Although in the shown example, the future behavior is very similar to the past, there are several cases where this does not occur. For instance, the following transition matrix represents 5 days:     I=   

0.555 0.333 0.111 0.0 0.154 0.385 0.230 0.0 0.0104 0.021 0.938 0.026 0.0 0.25 0.0 0.25 0.0 0.0 0.333 0.333 0.0 0.0 0.043 0.0

0.0 0.0 0.076 0.154 0.005 0.0 0.0 0.25 0.75 0.0 0.021 0.937

       

The probability of being in a certain state in the past was calculated from the obtained data, being:

ps1 = 0.03125 ps2 = 0.0451388 ps3 = 0.6701388 ps4 = 0.0694444 ps5 = 0.0138888 ps6 = 0.1701388 With this initial hypothesis, the system evolves to obtain the stationary probabilities: Alberto S´ anchez Campos


134


0.03125 0.03125 0.031137 0.030963 0.030761 0.030556 0.030361 0.030186 0.030033 0.029904 0.029796 0.029710 0.029641 0.029587 0.029545 0.029514 0.029491 0.029474 0.029462 0.029454

0.045138 0.045138 0.044689 0.044689 0.044396 0.044113 0.043859 0.043640 0.043459 0.043314 0.043199 0.043111 0.043045 0.042996 0.042961 0.042936 0.042919 0.042908 0.042901 0.042897

0.670138 0.654083 0.643074 0.635691 0.630864 0.627807 0.625954 0.624907 0.624384 0.624193 0.624203 0.624327 0.624507 0.624706 0.624903 0.625085 0.625246 0.625384 0.625498 0.625590

0.069444 0.069028 0.068449 0.067851 0.067311 0.066862 0.066512 0.066255 0.066076 0.065961 0.065895 0.065865 0.065859 0.065869 0.065889 0.065913 0.065940 0.065965 0.065989 0.066010

0.013888 0.013879 0.013873 0.013886 0.013920 0.013969 0.014027 0.014088 0.014149 0.014205 0.014255 0.014299 0.014335 0.014365 0.014389 0.014408 0.014423 0.014434 0.014442 0.014448

0.170138 0.175083 0.182110 0.189552 0.196509 0.202564 0.207580 0.211582 0.214676 0.217000 0.218694 0.219892 0.220707 0.221237 0.221559 0.221733 0.221806 0.221814 0.221782 0.221727

19 iterations have been performed to obtain the stationary probabilities. As it can be seen, these probabilities differ from the historical probabilities although the last ones have an influence in the first ones. π1 = 0.02945417474448796 π2 = 0.04289739217251692 π3 = 0.6255909098585902 π4 = 0.06601008044083241 π5 = 0.01444850165313966 π6 = 0.22172709219323386 It can be seen that all the states have their own relevance in the system. If pre-defined and static states were used, the ranges of the states could be too wide for the dataset, losing precision in the result and therefore in the obtained knowledge. Furthermore, it rises the possibility of appearance of a state without transitions what can cause that the Markov chain does not fulfill the regularity.

Nevertheless, a problem has not been solved yet: how the transition probabilities are obtained. At the time of extracting the transition probabilities between states, in the case of systems highly defined, extracting the rules of the system definition is enough. However, in our case, the system is regulated by external factors. In this sense, the transition probabilities will be extracted studying the system behavior known thanks to GMonE. To do this, GAS asks to GMonE for a parameter monitored each window time. This request obtains a time ordered data series. Defining the value limits of each state, the data series is scanned increasing in one the transition aij of the matrix if the current value is in the state ai and the following one is in aj . As a summary, the transition matrix can be directly obtained from the data returned by GMonE, with a simple process that counts the number of transitions. For AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


10.2. PREDICTION

135

example, if the data series is {12,87,10,23,24,54,10,76,98,31,60,37,9,88,53,65,61,43,47,80...} and the states are defined in the following ranks [0,25)[25,50)[50,75)[75,100], the series would be transformed into the state series {1,4,1,1,1,3,1,4,4,2,3,2,1,4,3,3,3,2,2,4...}, being the state transition matrix: 

2  1   1 1

0 1 2 0

1 1 2 1

 3 1   0  1

Another possibility is to consider the recent information has a higher weight than the old one. Thus, at the time of transition counting, an importance factor between 0 and 1 could be applied reducing the influence factor of each transition from the first one through time. This can be done exponentially or arithmetically. Arithmetically, the influence weight series would be {..., 1 − 3α, 1 − 2α, 1 − α, 1} and exponentially, the series would be {..., α3, α2, α, 1}.

In any case, when the transition-counting phase is finished, it is necessary to normalize the matrix. As each matrix element represents a probability, its value must be in the rank [0,1] and the sum of the row values must be equal to one. The normalization can be made by means of the Laplace’s probability formula: Favorable cases Total cases Nevertheless, in order to obtain a confidence probability value, avoiding null values, the Laplace’s correction formula can be used:

pi =

Favorable cases + λi Pn Total cases + i=1 λi

(10.1)

Where n is the number of classes and λi is the priority for class i. In this case, it is set to 1. The formula 10.1 could be applied to each matrix element where the number of total cases is the sum of all values of the row and favorable cases is the element value. mij + 1 pij = Pn j=1 mij + n

(10.2)

Applying the formula 10.2, the normalized matrix would be: 

 0.3 0.1 0.2 0.4  0.25 0.25 0.25 0.25   P =  0.222 0.333 0.333 0.111  0.286 0.143 0.286 0.286 In this way, it is possible to obtain the transition matrix of each grid element. Furthermore, the calculation of this matrix does not involve too much time, as it can be seen in Section 11.3.1. However, if it were only calculated once, there would be a single prediction about the further behavior of each Alberto S´ anchez Campos


136


grid element. Nevertheless, as it can be seen above, new monitored data can vary the expected future state. That is, it is required to reconsider the predictions when new data is acquired.

10.2.1.2 Adaptation to new data Following on-line models, it is important to take into account that new data can vary the expected future system behavior. Thus, it is necessary to analyze the past system operation every time new data is acquired. There are some ways to adapt the prediction phase to new known behavior.

First, GAS could perform a new prediction for each client write request. This would create several Markov chains through time, one for each request. Nevertheless, this alternative has some problems. Firstly, it could cause an overload of the autonomic system and therefore, an increase of the write response time. In order to reduce this, predictions would be only made when the internal system parameters have been modified by a previous decision making or a time threshold (a percentage of the monitored data time (M )) has gone by since the last prediction. Second, the monitored data time parameter to take into account in the prediction achieves a great significance, since it represents the total time of the analyzed past behavior. This means the rest of the time is not considered in the study. Therefore, if M is not big enough, it could be possible that important data occurred a long time ago is not considered in the study. On the other hand, if M is too big then current data, which causes drastic changes in the system behavior, cannot be recognized until long time lasts. These reasons make M a special parameter that has to be autonomically configured. Although it is possible to compare the results obtained with different values of M , it is not easy to decide the policy to apply in each case because it depends completely on the data instead of the differences observed among the results. Thus, it should advisable to analyze other alternatives where the system gathers data since the beginning of the system running and the M parameter lacks importance.

The solution is based on the idea of gathering all the data until a drastic change occurs into the system. However, this implies that the idea of performing a prediction for each client request must be rejected. In short, the predictions will be made at certain time and different Markov chains will be needed: • A Markov chain each certain time PTi that represents only a time period. Each period T a Markov chain is calculated with all monitoring values corresponding to this period. The performed predictions are valid during the period T when they have been calculated. • A historical Markov chain P h . This historical chain aggregates all the historical data. It is known that a Markov chain evolves by using the new Markov chain corresponding to the last AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


10.2. PREDICTION

137

period, following the typical evolution: P2 = P · P P3 = P2 · P P k = P (k−1) · P In this sense P h is the historical Markov Chain and PTn is the Markov chain in each step of the evolution. P h = PT1 P h+1 = P h · PT2 P h+1 = P h · PT3 .. . P h+1 = P h · PTn Then, it is possible to evolve the historical chain to obtain the stationary probability for staying in a certain state at long term, as it is shown in Section 10.2.1.1. (P h )∗ · P h = P h · (P h )∗ = (P h )∗ It is required that the period T is the same and the states do not change in all PTn in order to allow the historical Markov chain to evolve in a proper way. Finally, the stationary Markov chain can be calculated from the historical chain P h taking into account that its states are defined in a dynamic way once, when this matrix is created from PT1 . Thus, this makes all the monitored data from the beginning of the system running possible to take into account.

The evolution of the historical chain in this way allows GAS to adapt itself to gradual changes in the system behavior. In this sense, the predictions are slightly modified as new data is collected enabling to gradually change its meaning in several steps. Nevertheless, this does not solve the recognition of the drastic changes into the system. When a drastic change occurs into the system, the values of the monitored data change. Thus, it is possible to recognize these changes when the collected data during a certain time together with the historical Markov chain cause that one or more of the defined states lose their relevance. This can be discovered when the P h turns from a regular chain into a semi-regular chain. The reason is that, during a long time, the new acquired data does not fit with the states defined in the historical matrix because of a drastic change in the system. This causes that there are empty states or states without transitions what implies that the matrix does not fulfill the regularity feature. Therefore, the stationary probabilities obtained according to the historical matrix cannot be obtained in the same way.



138


The problem of knowing if the matrix is regular or semi-regular can be simplified to know if all the states can be reached since any state. This problem can be addressed basing on the graph theory. From this point of view the problem can be reduced to know if the graph created according to the relations between the states is strongly connected by means of Kosaraju’s algorithm5 . If the graph is strongly connected, then the matrix represented by this graph is regular. Otherwise, the matrix is not regular. If the matrix changes from regular into non-regular or semi-regular, it means that the new data collected after the state definition, does not fit with these states. This is caused by a drastic change in the system.

As the behavior of the system has changed, the known system knowledge loses importance. Thus, when this change is observed, new states have to be defined in a dynamic way according to last monitored values and a new historical chain is calculated allowing drastic changes to be detected and facing them throwing out past information. That is, it is required to change the states when a radical system change is perceived. As a conclusion, this method maintains the whole information analyzed until a change in the system behavior is detected. When a change is detected it is tried to go ahead to improve the system.

Nevertheless, there is a problem in this approach. The first time that the historical matrix is built from PT1 , it is possible that PT1 was not regular. Many times is because the last calculated state does not pass to a previous state during the period T . Although there are mathematical methods to solve semi-regular chains, these methods are complex and the obtained result tends to be in the last state at long term because there is no way to go to another state. Therefore, T must be considered as a key internal parameter and its adjustment should be autonomically achieved. To solve this, an empirical approach is used based on obtaining more amount of information. Thus, more monitored data is requested to GMonE, enlarging the period T from a minimum to a maximum value. These values must be defined by system administrators as a high policy according to the specific environment or problem to solve. Section 11.3.1 indicates the trade-off values selected to the grid testbed used in the evaluation of this work. The size to increase T is a proportion (1/75, because 75 steps are considered enough) between the maximum and minimal period. Maintaining the same calculated states, the aim is checking if during the any advance in the period T , there is a transition between the problematic state and the previous state. If there a transition and the chain is regular, then this is the proper value of T and the process continues normally. The calculated value of T means how often predictions will made until a drastic change is detected. If there is not any value of T lower than its maximum value in which the chain is regular, then the prediction is only being in this state at long term, since most of the analyzed time was in this state and it is drain. In any case, new data can modify this 5 The Kosaraju’s algorithm calculates the strongly connected components of a graph using two Depth-First Searches [CLRS01]. If there is only a single strongly connected component, the graph is strongly connected.



10.2. PREDICTION

139

prediction.

10.2.1.3 Results interpretation Once the historical chain P h of each parameter in each storage element is evolved, the stationary probability for staying at long term in each one of the defined states by means of the clustering method is obtained. The following result is obtained for every grid element and parameter:  π1 ,      π2 , π3 , p(x) =      πn ,

if x ∈ s1 if x ∈ s2 if x ∈ s3 ··· if x ∈ sn

(10.3)

Nevertheless, since each grid element has a different representative pattern, different states can have been created for each one of them. This involves that the meaning of each state is different for each storage resource although the same parameters are compared. Therefore, the states are not comparable and this makes difficult the decision making. Thus, in order to interpret the results it should be advisable to obtain a single value for each parameter and grid resource that represents an aggregation of the probabilities obtained in each state for the parameter in question, that is, a numerical value that represents the average expected at long term for each parameter.

The problem of calculating this aggregated value can be found in other fields, like fuzzy logic when a specific output is selected from diffuse premises. In fact, the problem can be approached as a function defined by parts that means the probability that the value of the studied parameter is x. This function fits with the results obtained in the prediction phase shown in Equation 10.3.        

y1 , if x ∈ R1 y2 , if x ∈ R2 y3 , if x ∈ R3 f (x) = ···     y , if x ∈ Rn  n   0 , if x ∈ / R1 ∪ R2 ∪ R3 ∪ . . . ∪ Rn The centroid or center of gravity from geometry is the most common method to obtain a single value that represents the whole function. The centroid calculates an average of the probability function, concentrating it in a single point. The projection in the X axis of the gravity center can be used as representative value of the whole function. The value of the centroid g is obtained by means of the following formula: Pn xi f (xi ) g = Pi=1 n i=1 f (xi )

(10.4)

Where xi are equidistant points which cover the whole range of values where the function is not 0. The separation between these points can be adjusted according to the needed precision in the result. Alberto S´ anchez Campos


140


For instance, the following probability values were obtained in the prediction of the I/O internal bandwidth parameter for a certain storage element.  0.0569061, if      0.3587905, if 0.1315845, if p(I) =   0.1831028, if    0.2694380, if

I I I I I

∈ [16, 25.52) ∈ [25.52, 35.1) ∈ [35.1, 50.3) ∈ [50.3, 60.7) ∈ [60.7, 64]

These results can be graphically represented. The probability of each state represents the level of the function during the part defined by the size of the state. The resultant area can be observed in Figure 10.4, where 20 equidistant points has been selected to cover the whole range of values. This choice of 20 points is because 4 points for each state is considered enough for representing all the states. In order to obtain the centroid the probability function is integrated by parts.

Figure 10.4: Results visualization to apply the centroid method

Using the Formula 10.4, the centroid can be mathematically obtained. P20

j=1 Ij p(Ii )

g = P20

j=1

p(Ij )

=

155.18886 = 41.6773 3.72358

This single numerical value represents the observed parameter I at long term. This result is consistent with the expected probabilities of staying in each state and it is possible to make decisions from the comparison with the values obtained in the same parameter in every grid element.

10.2.1.4 Influence of the monitoring data period The monitoring data period Dps refers to the window time used by GAS to ask GMonE for a parameter p monitored in a specific resource s. This obtains a time ordered data series each period T used to make the prediction. A step in the time series involves passing a Dps period. Therefore, the AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


10.2. PREDICTION

141

Dps determines the data set with which is worked in the prediction phase. A good adjustment of this parameter allows the system to adapt itself to system changes in a better way avoiding overloads in the prediction algorithm (that is, the lowest amount of data should be used).

This adjustment should be autonomically achieved based on the obtained data. Therefore if Dps is too high, the system changes happened during this period are not visible, losing accuracy. On the other hand, a reduction of the monitoring data period enables knowing in depth the system changes but it requests a higher amount of monitoring data to GMonE, causing a performance decrease when a prediction is performed. Thus, it is necessary to establish a trade-off between the size of Dps and the caused overload.

In order to select a suitable value automatically for Dps , it is possible to compare the results obtained in the prediction phase among different Dps values. That is, a comparison between the prediction results obtained with the current value Dps and other values should be made. As the initial aim is the reduction of Dps to optimize the performance, a prediction is carried out with a smaller value6 than the current Dps . Then, its result is compared with the results previously obtained for the same parameter and grid resource. If the results differ7 , then this means that system changes are not properly recognized and therefore it is necessary to reduce the monitoring data period, although this involves an overload. The initial monitoring data period is 5 minutes because it is considered relevant enough, although it is adjusted depending on data.

On the other hand, if both predictions are similar, then the reduction of the number of observations does not obtain a higher accuracy. In this case, a new prediction is carried out with a larger8 Dps to know if it is possible to reduce the amount of data without losing accuracy. If results are similar to the current case, then both predictions are similar and it is possible to increase the size of the monitoring data period in order to work with fewer observations.

The size to reduce or increase the monitoring data period should be a proportion (1/25, because 25 steps are considered enough) between the maximum and minimal period. In this sense, since this proportion is not multiple of the current monitoring data period, periodic system changes can be discovered.

6 The

minimal size is the monitoring period of GMonE. similarity degree indicates the adjustment between Dps and the caused overload. A similarity of 5% is considered enough in order not to lose accuracy. 8 The maximum size should be a proportion of the whole monitoring period T to make the prediction in order to take a minimal data set. 7 The



142

10.3


Decision making

When all the stationary matrices from every storage element and every analyzed parameter are obtained, the decision making phase takes place. This means to select the best suitable storage elements in order to tend to client requests. Whereas the future behavior prediction is made every certain time, when a new Markov chain is calculated each T period, the decision making is carried out when a client requests an I/O operation to the system. Different methods have to be applied depending on the kind of I/O operation.

According to the kind of used file system, such as MAPFS-Grid, SRM, SRB and so on, the degree of the development of each method can be different. This is caused by the different perspectives of each different I/O system to access data. For instance, whereas MAPFS-Grid uses two levels of parallelism involving more difficult decisions because a higher number of elements must be selected, SRB accesses the whole file in the same resource. Although the decision making method for write operations essentially is the same for both, the decision made for SRB does not involve an analysis of how divide the file to take advantage of the parallelism among different elements. This only causes that some steps of the decision making are not processed if a single resource has to be selected. In order to explain the decision making in a general way, the most difficult process is shown according to the use of two levels of parallelism. Then, it is easy to simplify the problem to other data grid systems.

10.3.1

Decision making for read requests

This kind of requests requires reading a file. As it was explained in Section 8.1.3, file data can be replicated into the system to provide fault tolerance skills. These replicas must be generated when the system is idle. Nevertheless, whole files are not replicated, but only the most used parts of each file. The replication involves that the broker has to select the replica of each file part that currently offers the best performance.

The discovery of file data locations is carried out by means of the monitor phase of the autonomic architecture (see Section 8.2). Each I/O system provides its own method to discover where file data is. For instance, MAPFS-Grid uses MDS and resource properties (see Section 8.1.2) to report its file data. Resource properties are the meta-information about each part of the file, which can be later replicated.

By means of the file data discovery phase shown in Section 8.1.2, the autonomic system GAS knows which storage elements stores each replica of the file data. Then some important information can be extracted from the metadata or resource properties, which allows GAS to analyze the completeness AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


10.3. DECISION MAKING

143

of all the data that makes up the whole file. In this case, the required meta-information is both the number of parts among the file was distributed when it was written in a parallel way and the order of the part in the parallel distribution9 . Finally, GAS checks if there has not been loss of information ensuring that there is at least a replica of every portion10 .

If the file is complete, then GAS decides which replicas provide the most efficient access to the requested file. In order to make this decision, the transfer time T t of each replica a is calculated. The time T t takes into account both the time delay between the transfer is started from the storage element s, the moment that the client c receives the first communication, named latency Latsc , and the data transfer rate T Rsc of sending the whole size of the file data. This data size DSa corresponds to the size of the replica and it is the same in each storage element that contains a replica of this information. Furthermore, this information is included in the resource properties since it is useful to know the file size to book space.

The data transfer rate is the average number of bits per unit time between the resource and the client. Since distributed storage is addressed in this work, both the I/O and network bandwidths are limiting factor. Thus, the data transfer rate is obtained as the minimum between the internal I/O bandwidth of reading Irs of the storage element and the network bandwidth Esc between the client and the storage resource.

T Rsc = min{Irs , Esc } Both parameters Irs and Esc are obtained by means of GMonE. In the case that GMonE does not provide a value Esc to this client and a certain storage element, this means that the client has never connected with this resource. Since there is not knowledge about this connection, ∞ is taken as value, which involves that Esc is not a limiting factor. If most of the parameters Irs are higher than its corresponding Esc by every client that is connected, then it causes that the selected replicas are those in which the client has never been connected. As this access implies a connection, therefore, in later accesses, other storage resource where the client has never been connected will be selected. This could not improve the performance of operations at short-term, however the knowledge about the connections between the client and a higher number of elements is achieved. This knowledge in a later stage could optimize the performance.

Using the mentioned parameters Latsc , T Rsc and DSa , the usual transfer time T tasc of every 9 Round robin is the default policy to make a parallel access used in MAPFS-Grid. The order of the element is required to reconstruct the information properly. 10 Depending on the policy used to make the parallel distribution the method to ensure the right file reconstruction can be different.



144


replica a from the storage element s to the client c could be calculated using the following formula: T tasc = Latsc +

DSa T Rsc

(10.5)

Latsc is provided by GMonE. If there is not value about Latsc , the client has never connected with this element. In this case, 0 is taken as value, which involves increasing the probability to select the non-connected storage resources from the client. This means to achieve a higher knowledge about the system, which allows GAS to enhance the system operation subsequently.

Nevertheless, the transfer time shown in Formula 10.5 does not take into account the number of messages sent between the storage element and the client to transfer the size of the replica. Not all the data blocks are sent in the same message, if the transference is carried out by means of slices. The block size of the slice is the key factor that indicates the number of messages. Just like the parameter DSa , the block size BSas used to write the replica is stored in the resource properties. In addition, as it is explained in Section 10.3.2.3 it is calculated with the aim to optimize the I/O accesses when the file is created. To calculate the sent messages N masc the data size of the replica DSa and the size of the slice of the same one BSas must be considered in the following way:

N masc = b

DSa c+1 BSas

Finally, the number of messages only affects to the latency, since it is the delay to establish the communication. Therefore, the transfer time must be expressed as: T tasc = (Latsc × N masc ) +

DSa T Rsc

(10.6)

Once all the transfer times of each replica are calculated, then GAS chooses the replica with less time T tasc to read this part of the file. Then, it provides to the client a complete list of resources with its optimum block size BSas where the selected replicas are. In this way, the client can find and efficiently access all the parts of the file in a parallel way using as block size BSas .

10.3.2

Decision making for file creating requests

These requests require creating a new file of a certain size. GAS must find out not only the set of storage elements that currently provide the best performance to create and make the write of the indicated size, but also it must estimate the best grid resources in order to make future I/O operations. This estimation is made by means of the prediction phase shown in Section 10.2.1. Then, GAS selects the most suitable resources that improve both current and later operations. In the selected resources, the required size is booked. In order to know if there is the required space in each storage element, GAS can request information to the monitoring system. In this way, the autonomic system asks to AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION



145

GMonE for the free storage space (SS) in each resource, and it checks if there is room for the file.

The decision making to obtain a suitable trade-off solution between the improvement for current and later operations is a very complex process, and it can be studied from different points of view.

First, in the case that the same states were used in all the storage elements, that is, the states used to predict the behavior of the element are predefined instead of being calculated by means of clustering methods, there is not a single optimum decision. Since the states of different storage elements have the same meaning, it is possible to compare them, applying different kind of policies: • Optimistic or aggressive policies: These policies take into account the most favorable states of all the stationary matrices. Thus, the decision is made according to the comparison between these states. For instance, one possible optimistic policy would choose as targeted storage elements those with higher probability for the most favorable state.

• Pessimistic or defensive policies: On the other hand, these policies take into account the least favorable states of all the stationary matrices. For instance, one possible pessimistic policy would choose as targeted storage elements those with lower probability for the least favorable state.

• Combination of previous policies or hybrid policies: These policies consider information about all the states of the stationary matrices, or a subset of states, but mixing both favorable and non-favorable states. For instance, one possible hybrid policy could consider the intermediate states of the stationary matrices. Therefore, in this case, the decision depends on the selected policy.

Nevertheless, when a huge amount of elements is working together, each element can have a representative different pattern. Thus, the definition of states adapted to the operation of each storage elements is suitable, which it is approached in a second point of view. In Section 10.2.1.3, how the results obtained from the prediction phase can be interpreted and concentrated in a single value was explained. This value represents an observed parameter at long term in a storage element.

Since several parameters could be taken into account, it is necessary to figure out a goodness value for every storage resource that concentrates all the parameters in an aggregated value representing its current and later behavior in order to be able to select the best ones. Alberto S´ anchez Campos


146


10.3.2.1 Goodness calculation In order to model this problem, the goodness should be calculated basing on the different behavior of each storage element. The importance of every parameter in the goodness can be established by means of an importance weight. Since these weights constitute a key factor, the system itself should automatically decide these values following high-level polices ruled by administrators to optimize the goodness.

The optimization refers to find the way to obtain the best result in a problem. That is, the aim is to minimize or maximize a function that model the problem by choosing the values of decision variables from a possible set. This can be carried out in three steps. The first stage consists of identifying the variables that affect to the problem. The suitable values of these variables that optimize the objective are searched. The second phase is to determine admissible decisions, that is, a set of constraints that defines the problem. In the last stage, the benefit associated with every solution is calculated from the definition of the objective function11 . All these elements determine an optimization problem.

Therefore, an optimization model consists of an objective function and a set of constraints that restrict the values of decision variables. Optimization models are used in multiple decision making problems, such as product placement, buying stocks, and so on. In the case of the system management field, the problem is usually to improve its performance. Specifically, in our case, the problem can be simplified to optimize both storage resources and I/O operations. Since a model tries to represent some aspects of the reality about the system involved in the problem, a mathematical representation of the problem has to be built.

In order to model the system, it is necessary to specify the aims to define the decision variables subsequently, the objective function and the constraints that make up the model in the proper way. The aim is to achieve the most suitable relation among the parameters that affect to the system so that the goodness of the storage elements is maximum to make the decision making easy. The relations among the parameters are expressed by its importance weights (that is, they are the decision variables) and they must fulfill high-level policies that are going to be defined aimed at improving the performance in data grids.

Before defining the constraints and the objective function, since GMonE allows GAS to know a huge set of monitoring parameters, it is necessary to define those that are going to be considered to make decisions. These parameters must be related with the monitoring of the I/O system phase 11 An objective function is defined as a function associated with an optimization problem, which determines how good a solution is [AF98].




147

in complex and distributed environments, like a grid. In Section 9.2.1, some parameters based on expert knowledge in the I/O field were selected12 . The transfer rate for read T Rrsc and write T Rwsc operations and the latency Latsc stand out as parameters defined by both the client c and the storage element s. As parameters only related with each storage element s behavior, the percentage of the available storage capacity CPs , the workload W Ls and the number of simultaneous I/O requests (Rs ) can be emphasized. These parameters are either simple monitoring parameters offered by GMonE or they can be easily calculated from them.

T Rrsc = min{Irs , Esc } T Rwsc = min{Iws , Esc } Cs CPs = T Cs W Ls = Ls × Ss The goodness of every grid resource Xs can be modeled taking into account all these parameters instead of the heterogeneity of them. Furthermore, since performance improvements should not only be related to current but also later I/O operations, the prediction of parameters related with future and current accesses must be taken into account together in the calculation of the goodness. The parameters that have influence in later accesses are those related with read accesses and the behavior cs , W ds 13 , and therefore they have to be predicted following [ of the resource, such as T\ Rrsc ,R Ls and CP the method explained in Section 10.2.1. On the other hand, since a write access is achieved just after creating the file, the prediction of parameters related with this write, like T Rwsc and Lats , are not considered. Finally, the importance of every monitoring parameter in the goodness calculation can be established by means of a certain weight W , which allows GAS to decide what parameters have more influence in order to make decisions.

ds + WW L × W cs + WLat × Latsc (10.7) [ Xs = WT Rr × T\ Rrsc + WT Rw × T Rwsc + WCP × CP Ls + WR ∗ R Formula 10.7 shows the goodness of each grid element. Nevertheless, since each parameter has its own units, it is not possible to add them if values are not previously processed. In this way, the value obtained by each parameter is not the monitoring value but a proportion of this value with regard to values obtained by the rest of storage elements in this parameter. That is, for each parameter P , all monitoring values obtained for all grid resources are compared. 1 is assigned to the maximum value Pmax whereas 0 is for the minimum one Pmin . The rest of values are proportionally obtained. Ps0 = 12 Note

Ps − Pmin Pmax − Pmin

that any other different parameters could be easily included in the model. Pb means the prediction of the parameter P .

13 Symbol



148


In this sense, the ranges are the same for all the parameters (a proportion between 0 and 1) and the parameter value can be added. Even so, Formula 10.7 has an interpretation problem. The meaning of parameters can be different, since a high value can mean a benefit or a inconvenient depending on the parameter. As parameters are expressed in proportions, it is only required to turn over the proportion to change the parameters that have the opposite meaning in the following way: Ps00 = 1 − Ps0 if the parameter has the opposite meaning That is, where before the high value represented 1 now it indicates 0. For instance, it is required to change Lats , W Ls and Rs , in which a high number involves a worse performance, to adapt to the rest of parameters, like T Rrsc and T Rwsc , where a high value indicates better operation.

WP means the importance weight assigned to each parameter P . Therefore, these weights decide what parameters are more taken into account to make decisions. As the aim is to obtain the proper values of WP that involve the best decision making process, the decision will be easier if the goodness of every storage element is the highest. Thus, the objective function must represent the calculation of the weights WP1 , WP2 , . . . , WPn that obtain a goodness Xs for each element s as long as there are not other weights WP0 1 , . . . , WP0 n whose goodness assigned to a resource Xs0 is higher than the calculated with the previous weights Xs .

However, the solution of this problem is NP-complete and it is possible that there is not solution since it requires maximizing together the goodness of each single storage element knowing that weights must affect the same in each resource. To avoid this problem, the joint system goodness can be optimized instead of the goodness of each element. In this sense the objective function is: max f = X1 + X2 + . . . + Xn

(10.8)

The relations among these weights must be defined by means of high-level policies adapted to this kind of environments. That is, policies indicate what parameters are more important and can be expressed by means of constraints about the relation among weights. Since policies are intended to improve the I/O phase with the aim of enhancing data-intensive applications, the following restrictions have been defined from expert knowledge in the I/O field: 1. The transfer rate of read has more priority than the write operations because the improvement of later read accesses is intended: WT Rw ≤ WT Rr 2. Due to the great volume of data to store, the transfer rate for write operation is more important than the workload in the storage element: WW L ≤ WT Rw AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION



149

3. Since the problem is focused on a storage environment, the latency is a factor that has influence in the performance of the operations. Therefore, its weight has to have more priority than the processing capacity: WW L ≤ WLat 4. All the parameters must be related among them. Thus, there must be any proportion among them reflecting the importance of each one: WT Rr + WT Rw + WCP + WW L + WR + WLat = 1 5. It is important not to saturate the storage capacity of some few grid elements, which have the best characteristics, but data must be shared among the most number of elements. This involves that the free capacity percentage is more important than the number of expected simultaneous connections and the current transfer rate for write operations. WT Rw ≤ WCP WR ≤ WCP 6. Finally, it is necessary to avoid that the importance of the most important parameters implies that the others parameters are not considered. Thus, it is required to limit its growth: WT Rr ≤ 0.6 WCP ≤ 0.6 WLat ≤ 0.6 As summary, the constraints that define the high-level policies are: WT Rw ≤ WT Rr WW L ≤ WT Rw WW L ≤ WLat WT Rr + WT Rw + WCP + WW L + WR + WLat = 1 WT Rw ≤ WCP WR ≤ WCP WT Rr ≤ 0.6 WCP ≤ 0.6 WLat ≤ 0.6 Alberto S´ anchez Campos


150


Nevertheless, these policies are adapted to great amounts of data. If data size is small14 , some constraints should be changed. The four first ones will be the same because they define the main aim in this kind of distributed storage environments. Regarding the rest of policies, since the current request is small, it is more interesting that the operation finishes before than having a number of expected simultaneous accesses because these accesses usually imply less time. WR ≤ WT Rw It is considered more important to control the number of I/O processes than the proportion of storage capacity, because the involved data is not too much. WCP ≤ WR Nevertheless, the storage capacity must have more priority than the workload since data-oriented environments are addressed. WW L ≤ WCP In this case, there are two parameters whose growth is not limited by other parameters. Thus, they have to be limited with a maximum value: WBR ≤ 0.6 WLat ≤ 0.6 In short, the constraints for small sizes are: WT Rw ≤ WT Rr WW L ≤ WT Rw WW L ≤ WLat WT Rr + WT Rw + WCP + WW L + WR + WLat = 1 WR ≤ WT Rw WCP ≤ WR WW L ≤ WCP WBR ≤ 0.6 WLat ≤ 0.6 In order to solve the proposed optimization model, since all the shown constraints and the objective function are linear, linear programming is the most suitable option because it allows us to address 14 Data sizes lower than 100 MB are considered small since the performance improvement for accessing 100 MB file size shown in Section 11.3.6 is not meaningful enough due to its low size.




151

the problem in a simple way. Linear programming is a method to optimize the assignment of different kinds of resources, such as money, time, and so on. It assumes that both the objective function and constraints are linear, although this is not a restriction for solving most of the optimization problems, like transport systems, strategy games, etc.

Therefore, this problem can be solved by means of linear programming, and more specifically the simplex method [Dan63]. In this way, the objective function shown in Formula 10.8 can be expressed in its standard form:

min f = −X1 − X2 − . . . − XN Where N is the number of grid storage elements. Replacing Xs by its corresponding values shown in Formula 10.7, then:

min f = −WT Rr ×

N X

! T Rrsc

− ... − WLat ×

s=1

N X

! Latsc

(10.9)

s=1

In the same way, the constraints can be expressed as: WT Rw − WT Rr + SV10 = 0 WW L − WT Rw + SV20 = 0 WW L − WLat + SV30 = 0 WT Rr + WT Rw + WCP + WW L + WR + WLat = 1 WT Rw − WCP + SV50 = 0 WR − WCP + SV60 = 0 WT Rr + SV70 = 0.6 WCP + SV80 = 0.6 WLat + SV90 = 0.6 Where SCr0 are slack variables in order to turn the inequalities into equalities. At the time of dynamically obtaining the suitable weights, the two-phase simplex algorithm can be applied with the objective function in its standard form shown in Formula 10.9 and its corresponding set of constraints. As these 9 constraints are only composed of 8 slack variables then an artificial variable AV4 is required creating the necessary identity matrix. However, if the constrains are inconsistent then the problem is unfeasible. The solution to this scenario is based on successively Alberto S´ anchez Campos


152


eliminating constrains according to the first-fail principle15 until the problem is feasible. Constraint ordering methods, like the Constraint Ordering Heuristic [SB05], can be used to arrange from the most restrictive constraint to the least following this principle.

Since the decision making is carried out when a client requests an I/O operation to the system, weights are dynamically calculated depending on the monitoring data obtained for the specific client. The dynamic calculation of weights allows GAS to know the goodness of each element in order to automatically make a decision about the most suitable elements where creating a new file which will improve current and later operations.

10.3.2.2 Goodness-based decision making stage In order to make the best decision, firstly the average of the number of processes R that are going to simultaneously request I/O operations in the grid is estimated (from the prediction of the parameter R) for each element. If a great number of processes is expected, the aim is to minimize the impact on the system. In this sense, the single storage element that can store all the data size and provides the highest goodness is only selected16 . This is due to the fact that the number of processes should not be increased more than necessary.

On the other hand, if there is not a single element that can store all data size or the expected number of simultaneous accesses is not too high, then several elements to be accessed them in a parallel way are selected17 . Sufficient number of elements will be selected as needed for completing the available network bandwidth BWc of the client provided to GAS. This involves analyzing the transfer rate T Rsc that each element provides to the client following the goodness order and checking if its addition with transfer rates of the previous selected elements is less than BWc , because the client might not be saturated. If this is the case, then this element is selected to create the file.

Finally, since the block size determines the number of messages among clients and storage elements, it has influence in the performance of the operations. Therefore, it is required to decide the suitable block size BS that improves the current and later I/O operations on the file to indicate it to the client. Then, the client uses this information to access the file. Its calculation is analyzed in depth in the next section.

15 “To

succeed, try first where you are most likely to fail” [HE80]. that this is also used for grid storage systems that are not accessed in a parallel way, like SRB or SRM. 17 Due to the fact that several elements are selected, the global number of I/O processes in the whole grid is increased. 16 Note




153

10.3.2.3 Influence of data distribution and block size The way to access storage elements has a clear influence in the performance of I/O operations. Just like every file system, data is accessed by means of slides. Thus, the size of the stripe sent to each grid resource affects the obtained performance. Since the importance of this parameter, GAS must calculate an optimum value.

Chen and Paterson in [CP90] indicate the block size that maximizes the performance in a striped disk array. Disk arrays can be seen as a previous step to the proposed storage elements arrays. They propose to balance the benefit and the cost of the operation. The benefit is the transfer time of a single request, and the cost is the time spent until data actually starts being accessed. For disk arrays, the benefit can be translated to the slide size divided by the transfer rate of the disk, and the cost is its positioning time. Therefore, the following formula is obtained:

Striping unit = Z × average positioning time × data transfer rate

(10.10)

Where Z is the zero-knowledge coefficient, that is, if information about the workload is not known. They demonstrate that Z is roughly 32 . In a grid environment, where the disks are replaced by storage resources distributed over Internet, the optimum block size can follow the same idea. In this case, the benefit and cost are represented by different parameters. These parameters must take into account the operation of the network instead of the disk, since the network establishes the communication between the client and data. Following the idea expressed by Chen and Patterson, the benefit would be the block size divided by the transfer rate of the connection and the cost would be its associated latency. In this sense, a suitable block size to optimize the access between a storage element s and a client c can be expressed as:

BSsc = Z × Latsc × T Rsc

(10.11)

The heterogeneity of the resources and the network between the client and the storage elements cause a problem for using the Formula 10.11 in grid environments, because it is based on the Formula 10.10 developed for homogeneous disk arrays. That is, it models an optimum block size for each storage element that it is the same for all of them if they have even characteristics, like the same Lat and T R. Nevertheless, if their properties are different, it does not take into account these differences in order to provide an efficient parallelism among them. Therefore, this slide size can be used when a single storage element is accessed to achieve data, e.g., in non-parallel grid systems, like SRM and SRB, and if only a resource is selected in MAPFS-Grid.



154


In the case that parallelism among heterogeneous networks and elements are used to improve the I/O performance, it is necessary to analyze some alternatives to achieve an efficient access. Two possible solutions are proposed to tackle this problem: 1. Assigning blocks of changeable size to each storage element to optimize the joint parallelism. The parallelism is carried out by means of the block sending in a cyclical way among the selected elements18 . In a heterogeneous environment, if data blocks of the same size are transferred to the resources, optimal overall performance may not be obtained, since the slowest resource is usually the limiting factor.

In this case, the block size assignment for each resource must be calculated taking into account the rest of elements. That is, the maximum size that each resource can receive in the same time must be calculated. In that way, no storage elements have to wait for the rest of elements providing an efficient access. In order to obtain the efficient block size, the latency of the lowest element M L is a key factor, since the rest have been able to send data before this delay.

M L = max{Lat1c , Lat2c , . . . Latnc , } Data sent before M L for each element is its transfer rate T Rsc multiplied for the time spent from it receives the first communication from the client (that is, its latency Latsc ) to this moment M L. Then, for each element, it is only necessary to take into account the T Rsc existing with the client and the transfer time T t from M L to obtain a block size that do not involves wait in any elements. This can be expressed as:

BSsc = (M L − Latsc ) × T Rsc + T Rsc × T t

(10.12)

The problem is how large T t must be selected to obtain a positive benefit/cost ratio. When data is sent, the delay of the latency causes that the maximum transfer rate cannot have completely reached during the whole period. A larger block size improves I/O performance but reduces the usefulness of the system. Therefore, it is necessary to achieve a trade-off between the block size and the transmitting time. For doing it, it is necessary to analyze and understand the whole BSc since several storage elements are involved in the transfer.

BSc =

n X s=1

BSsc =

n X

((M L − Latsc ) × T Rsc ) +

s=1

n X

T Rsc × T t

(10.13)

s=1

From Formula 10.13, it is possible to calculate the total transfer rate T Rtotal achieved by the client in the studied period t = M L + T t. 18 Round

robin is the default policy.




155

DSlat =

n X

((M L − Latsc ) × T Rsc )

s=1

T Rc =

n X

T Rsc

s=1

T Rtotal =

BSc DSlat + T Rc × T t = ML + Tt ML + Tt

(10.14)

T Rtotal can be explained by means of the meaning of its two terms. The first one DSlat is the data size that is transmitted in the access between the client and the set of storage elements before reaching M L. The second one is the data size jointly sent at the transfer rate T Rc in the rest of time T t of the studied period. T Rsc is the sum of every transfer rate between the client and each storage element because all elements are transferring data at T Rsc from M L. This transfer rate T Rc means an asymptote for the performance due to the delay produced by the latency M L, as it can be seen in Figure 10.5 obtained from Formula 10.14.

Figure 10.5: Achieved total transfer rate depending on time The total transfer rate is increased approaching to the asymptote that limits its growth. This happens because the latency has less influence as time goes by. Therefore, it is possible to analyze T t studying the growth of the transfer rate by means of its derivative.

0 T Rtotal

DSlat + T Rc × T t = ML + Tt

0 =

T Rc × M L − DSlat (T t + M L)2

(10.15)

0 T Rtotal × (T t + M L)2 = T Rc × M L − DSlat



156


0 0 0 T Rtotal × T t2 + T Rtotal × T t × 2M L + T Rtotal × M L2 = T Rc × M L − DSlat 0 0 0 T Rtotal × T t2 + (T Rtotal × 2M L) × T t + (T Rtotal × M L2 + DSlat − T Rc × M L) = 0

Thus, the value of T t can be obtained solving this second order polynomial. p T t = −M L +

0 T Rtotal × (T Rc × M L − DSlat ) 0 T Rtotal

(10.16)

As the aim is obtaining the maximum transfer rate with the least sent time T t, it is considered that the cost is larger than the benefit when the tangent to the T Rtotal curve has less than 45o slope. In this way, the parallelism among storage elements assigning blocks of changeable size 0 during M L + T t is efficient. The limit of 45o slope is reached when the derivative T Rtotal takes

as value 1, which involves that the proper value for T t is:

T t = −M L +

p

(T Rc × M L − DSlat )

Once the value of T t is calculated, the different block sizes for each storage element BSsc can be obtained applying Formula 10.12.

GAS broker Resource list

...

Resource 2 Properties

Resource 3 Properties

Resource 1 Properties File name: file1 Order:1 Number of parts: 2 Block size: 10 MB Size: 938 MB

Resource 1 Properties File name: file1 Order:2 Number of parts: 2 Block size: 14.5 MB Size: 1.361 GB



File data 1 File data 2

...

File data 1 File data 3

Resource N Properties File name: fileN Order:5 Number of parts: 3 Block size: 8.3 MB Size: 1.743 GB

Grid Storage Element M File data N

Figure 10.6: Assigning blocks of changeable size carrying out a joint parallelism

This alternative implies a specific understanding of the resource properties. In Figure 10.6, the parameters that are required in the resource properties and its operation are shown. The block size used in each storage element to access each file must be stored in its resource properties. Then GAS inform to the client about the proper BSsc required to efficiently access each storage AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION



157

element. This block size is used by the client to recover the file requesting the right size to each selected storage element. Therefore, the client must know the parallelism logic to access data19 . 2. Asynchronous sending of the optimum block size to each resource. The block size of each element is independently obtained by means of Formula 10.11. In this sense, the faster storage elements receive more blocks according to their transfer rate. This makes difficult the reconstructing and recovering of the file. In fact, it is required that any elements publish and inform of the file parts that it contains to GAS20 . Therefore, each resource refers to a file block instead of a set of blocks, as it can be seen in Figure 11.23. This involves a larger number of resources because a resource is necessary for every block in which the file is divided to send data. The size of a file block in each storage element s is equal to the optimum calculated slide BSsc .

GAS broker Resource list

...

...

... ...

... Resource 2_35 Properties

...

Resource 1_12 Properties File name: file1 Order:12 Number of parts: 230 Size: 10 MB

Resource 3_1 Properties

...

Resource 1_23 Properties File name: file1 Order:23 Number of parts: 230 Size: 14.5 MB

Grid Storage Element 0 File data 1 File data 2

Grid Storage Element 1 File data 1 File data 3

... Resource N_54 Properties File name: fileN Order:54 Number of parts: 527 Size: 8.3 MB

Grid Storage Element M File data N

Figure 10.7: Asynchronous sending of the optimum block size to each resource

This proposal provides a non-gap approach, since the parallelism logic does not require waiting for the lowest element. Furthermore, the client toolkit does not need to know how the storage elements have to be accessed but GAS returns a complete list indicating in which element is each part. Then, the client accesses in a parallel way requesting sequentially each file part to reconstruct the file. Both approaches propose adapt the slide size to different characteristics of each resource. Since most of known heterogeneous disk arrays (see Section 2.2.2) proposes to uses the same slide size 19 In MAPFS-Grid, there is a client toolkit that is in charge of making parallel operations. Thus, this is transparent for the final client. 20 GAS can check if the whole file is in the system obtaining the total number of parts in which the file was distributed and checking if all the parts are stored by any elements.



158


storing more slides in the best disks, this means to make a qualitative step forward caused by the grid complexity.

10.3.2.4 Performance improvement achieved by two levels of parallelism The calculation of the block size shown in Section 10.3.2.3 only takes into account the parallelism among the elements selected to make the I/O operation. Nevertheless, some resources can be compound, e.g., clusters. This enables taking advantage of two levels of parallelism using several nodes that internally compose each storage element in the way that it was observed in Chapter 7. In this case, it is advisable to adapt the calculated block size that improves the performance of the high level of parallelism to optimize both inter-storage element and intra-cluster parallelism.

The optimization in a cluster level depends on the parallel file system used inside of each cluster to access data. For a specific file system, it would be necessary to analyze and calculate the block size that enhances its performance. Our proposal selected MAPFS file system because of its good adjustment to complex and dynamic environments, like a grid, although other parallel file systems could be used, such as PVFS or Lustre. In order to explain the performance improvements achieved by means of the two levels of parallelism MAPFS is selected.

MAPFS is a cluster file system designed for clusters that accesses the whole set of nodes in a parallel way by means of slides. The size of the slide sent to each node is a configuration parameter defined by the cluster administrator according to its expert knowledge about the system. The parallelism is optimized when no nodes have to wait for the rest. In a homogeneous cluster, this is achieved when the size of the request is the same for all the nodes. Thus, the optimum block size of MAPFS is the size of the configured slide sent to a node multiplied by the number of nodes.

BSintra-cluster = BSnode ∗ Number of nodes To optimize the two levels of parallelism, both the block size BSinter-storage element of a certain storage element that optimizes the parallelism together with the rest of grid elements and its BSintra-cluster that enhances the parallel access inside it must be taken into account. In fact, if the block size BStwo levels parallelism finally used to access grid resources is multiple of BSintra-cluster , it can be made the most of both levels parallelism in an ideal way. Therefore, the BSinter-storage element is refined from BSintra-cluster to turn into an optimum value BStwo levels parallelism following the next formula: BSinter-storage element c × BSintra−cluster BSintra-cluster is less than BSintra-cluster , it is not possible to refine it and

BStwo levels parallelism = b In the case that BSinter-storage element

BStwo levels parallelism will be equal to BSinter-storage element . This does not provide an optimum two AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION



159

level infrastructure. The increase of a block size to optimize the intra-cluster parallelism of a specific resource would involve that the rest of grid resources have to wait for this element. However, at least the improvement provided by the first parallelism level is achieved.

10.3.3

Decision making for write requests on an existing file

This kind of requests requires writing a file that is in the file system. Since file data can be replicated into the system, a write involves selecting the replica of each file part that enhances the performance of not only current but also later I/O operations. However, since replication and file creation aim to optimize the later access, decision making for write operations only gets worried about the current access. Therefore, the decision can be made just like read operations (see Section 10.3.1) taking into account the parameters that have influence in this kind of operation.

In order to select the most efficient replicas, firstly, it is necessary to verify if the whole set of sites where old data is stored is able to store the new size of the file. This is verified during the whole process of selecting a replica of each file part. If there is not available space to store the file in the previously determined data distribution, it is required to make decisions about other suitable locations for the new file size in the way explained in Section 10.3.2. Then, the file must be reconstructed following the new selected distribution. This task can be made through different alternatives.

The first alternative is to reconstruct the information by using brute force, which consists of three stages: 1. Read the whole file storing it in an intermediate storage. 2. Write the file data blocks taking into account the new topology. 3. Delete the old file. The main advantage of this alternative is its simplicity. Nevertheless, this solution is very inefficient.

The second alternative is based on the usage of selective operations. These operations read data slides of the most effective replicas that belong to the first topology writing directly them in the new distribution of storage elements. [PSPR05] shows the file reconstruction stage by using selective operations in cluster environments. These operations can be analogously applied in a grid, being this alternative more efficient than brute force. Even so, data redistribution is an expensive task, which affect to the system performance. Since this operation is necessary, the system must provide service during the execution of such task.



160


The original file must not be eliminated until the whole operation is completed in order to keep the service during the redistribution. Therefore, the system has two copies or views of such file, which GAS must manage. During the redistribution task, a file mapping must be made, in such way that if other client accesses slides that have been written by the redistributor process, the system accesses the new view. If such process accesses slides that have not been written yet, the system accesses the old view.

If there is enough space, then GAS decides which replicas provide the most efficient write access to the requested file. Just like decision making for read operations, the transfer time is the key factor to select the best replicas (see Formula 10.6). Nevertheless, the data transfer rate T Rsc for sending from the client c to each storage element s is different, since parameters that affect to write operations must be considered. In this case, the data transfer rate is obtained as the minimum between the internal I/O bandwidth of writing Iws in each site and the network bandwidth Esc between the client and the storage resource.

T Rsc = min{Iws , Esc } After transfer times of each replica are calculated, then GAS chooses the replicas that involve less transfer time to be written. Then, it provides to the client the list of elements where selected replicas are stored. The client writes in these sites in a parallel way. The write access in each storage element is carried out using BSas obtained in the resources properties of each selected replica, because it provides the parallelism logic.

When a write is achieved on the replica that provides the most efficient write access, the nonselected replicas must be marked as obsolete because their data is not updated. The data updates to ensure the consistency are explained in Section 8.1.3.



Chapter 11

EVALUATION

Systems evaluations are often made in unusual conditions, mainly due to two different reasons: (i) systems are evaluated through simulations; (ii) the test environment is different from the deployment environment. The uncontrollable grid aspects make the proper representation of a grid through simulations difficult. Therefore, in order to evaluate the proposal properly, a real environment supporting usual workload seems a better option.

This chapter analyzes in depth the performance and different benefits of the proposed approach. The three different parts of the whole proposal, i) the high performance grid storage architecture, ii) the grid monitoring environment and iii) the autonomic framework for grid storage, are separately evaluated in next sections to simplify the process.

11.1

Evaluation of the high performance grid storage architecture MAPFS-Grid

This analysis aims at demonstrating the efficiency of the proposed two levels of parallelism (intracluster and inter-storage elements) vs. the performance obtained by traditional grid data transfer methods.

In this case, the work environment is designed to test the benefits of the proposed grid I/O architecture obtaining the theoretical and practical limit of the infrastructure. Thus, it is only constituted Alberto S´ anchez Campos


162

CHAPTER 11. EVALUATION

by two compound storage elements. UPM 1 consists of 8 Intel Xeon 2.40GHz nodes with 1 GB of RAM memory, connected by a Gigabit network. The hard disk in each node provides 30 MB/s approximately. UPM 2 consists of 4 Intel Xeon 3.0GHz nodes with 2 GB of RAM memory, connected by a Gigabit network. The hard disk in each node provides 50 MB/s approximately.

UPM 1

Client

UPM 2

Figure 11.1: Network topology of the test environment used to evaluate MAPFS-Grid To prevent the client from being the throughput bottleneck of this system, a high-performance client is required. An Intel Xeon 3GHz system featuring two hard disks set in a RAID 0 array is selected as client. This client offers a disk read/write bandwidth of around 130 MB/s. Client and storage elements are connected by a Gigabit network. Figure 11.1 shows the topology of the network that interconnects the different elements.

For evaluating the network, the tool NetPIPE Network Protocol Independent Evaluator Performance [TC02] has been used, which includes the best features of the evaluation applications ttcp and netperf. This tool represents the network performance in several conditions. In addition, it performs ping-pong tests, by sending messages of incremental and variable size between two processes through the network or within a shared-memory architecture. Figure 11.2 shows the results of evaluation of the network of the work environment. The maximum achieved bandwidth is approximately 112 MB/s.

Experiments have been conducted varying parameters that have a clear influence in the performance of I/O operations on compound storage elements: 1. Number of nodes that are part of the storage element. Tests are designed to measure the performance improvements obtained by using the whole potential of a cluster or a set of nonindividuals nodes working together facing to traditional single storage elements. All tests analyze the influence of this parameter in the performance. Thus, most of tests are made over UPM 1 AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


11.1. EVALUATION OF THE PROPOSED GRID STORAGE ARCHITECTURE

163

Network bandwidth

120

Bandwidth (MB/s)

100 80 60 40 20 0

20

25

210 215 220 Access size (bytes)

225

230

Figure 11.2: Evaluation of the network bandwidth in the test environment

because it has a higher number of nodes. 2. Block size BScs sent between client c and storage element s. Since the system is designed to manage massive amount of data, the block size takes high relevance. As it was demonstrated in [Pér03], proper MAPFS’s block sizes depends on both the number of nodes and slide size sent by MAPFS file system to every node in a parallel way. After analyzing the results obtained in [Pér03] and checking its right behavior, 64 KB was selected as the most suitable slide size sent by MAPFS parallel to every node. This size corresponds with the maximum data size that DMA drivers can manage. In order to maintain the same conditions among tests, the same block sizes between client and storage element have to be used. Besides, block size should take advantage of the parallel access to the whole set of nodes. Therefore, it must be multiple both of the MAPFS slide size (64 KB) and the number of nodes (1,2,4,8).

BScs = lcm(1, 2, 3, 4) × 64KB = 512KB Thus, the system is tested considering the default 512 KB block size, then 1 MB, 2 MB and 4MB block sizes. 3. File size. Memory hierarchy is designed to take advantage of faster accesses and lower latencies of lower levels, such as cache and memory, vs. the higher levels, like disk storage, to enhance the I/O phase. Memory accesses can be also faster than data sending through network, just as this occurs in the test environment. Therefore, memory hierarchy can hide the benefit obtained by means of the use of parallelism among nodes to access small file size because most of data is accessed in lower levels of memory. Nevertheless, file sizes higher than memory sizes cause Alberto S´ anchez Campos


164


accessing higher levels of memory hierarchy to make the I/O operation, where parallel accesses are very beneficial. Different data access systems have been compared to the proposed components in next sections to evaluate their performance. In all cases, read and write operations are made to evaluate their performance.

11.1.1

Performance of the service-based high-performance access MAPFSGrid PDAS

In the proposal explained in Section 7.2.1, named MAPFS-Grid PDAS, the I/O phase of writing/reading files is improved by means of the parallel access among the nodes that are part of a storage element. This corresponds to the intra-cluster parallelism. Some tests have been run on UPM 1 to show the benefits of the performance of this parallelism level.

1,06

1,70

1,05

1,65 1,60 Bandwidth (MB/s)

Bandwidth (MB/s)

1,04 1,03 1,02 1,01 1,00 0,99

1,55 1,50 1,45 1,40 1,35

0,98 0,97

1,30

100 MB

1 GB

6 GB

12 GB

100 MB

Filesize PDAS 1 node PDAS 2 nodes

2,30

1,60

2,35

2,15 2,10 2,05 100 MB

Bandwidth (MB/s)

2,40

Bandwidth (MB/s)

Bandwidth (MB/s)

1,65

2,20

1,55 1,50 1,45 1,40 1,35 1 GB

12 GB

PDAS 1 node

PDAS 2 nodes

PDAS 4 nodes

PDAS 8 nodes

(b) Block size = 1 MB

PDAS 8 nodes

2,35

2,25

6 GB

Filesize

(a) Block size = 512 KB PDAS 4 nodes 1,70

1 GB

2,30 2,25 2,20 2,15

2,10 2,05

6 GB

12 GB

100 MB

Filesize 1,30

1 GB

6 GB

12 GB

Filesize

PDAS 1 node

PDAS 2 nodes

PDAS 4 nodes

PDAS 8 nodes

(c) Block size = 2100 MBMB

1 GB

6 GB

PDAS 1 node

PDAS 2 nodes

PDAS 4 nodes

PDAS 8 nodes

(d) Block 12 GBsize = 4 MB

Filesize PDAS 1 node

PDAS 2 nodes

PDAS 4 nodes

PDAS 8 nodes

Figure 11.3: Performance obtained by MAPFS-Grid PDAS to read a file on UPM 1




165

Figures 11.3 and 11.4 shows the comparison between the bandwidths achieved in read and write operations, respectively, on UPM 1 by using MAPFS-Grid PDAS according to the parameters below.

1,035

2,05

1,030 2,00 Bandwidth (MB/s)

Bandwidth (MB/s)

1,025 1,020 1,015 1,010 1,005 1,000

1,95 1,90 1,85

0,995 0,990

1,80

100 MB

1 GB

6 GB

12 GB

100 MB

(a) Block size = 512 KB

1,65

3,90

1,60

3,80 3,70 3,60 3,50 3,40 3,30 100 MB

PDAS 1 node

PDAS 2 nodes

PDAS 4 nodes

PDAS 8 nodes

4,80 4,70

1,50 1,45 1,40 1 GB

12 GB

4,90

1,55

1,35

6 GB


PDAS 8 nodes

Bandwidth (MB/s)

4,00

Bandwidth (MB/s)

Bandwidth (MB/s)

PDAS 4 nodes 1,70

1 GB Filesize

Filesize PDAS 1 node PDAS 2 nodes

4,60 4,50

4,40 4,30 4,20

4,10 4,00 3,90

6 GB

100 MB

12 GB

Filesize 1,30

1 GB

6 GB

12 GB

Filesize

PDAS 1 node

PDAS 2 nodes

PDAS 4 nodes

PDAS 8 nodes

(c) Block size = 2100 MBMB

1 GB

6 GB

PDAS 1 node

PDAS 2 nodes

PDAS 4 nodes

PDAS 8 nodes

(d) Block 12 GBsize = 4 MB


PDAS 2 nodes

PDAS 4 nodes

PDAS 8 nodes

Figure 11.4: Performance obtained by MAPFS-Grid PDAS to write a file

Figures 11.3 and 11.4 show that when the file size is increased, an increment of the performance is obtained, being higher in 1 GB. The reason behind this behavior is the influence of the memory hierarchy. Since UPM 1 has 1 GB of RAM memory, most of data from 1 GB file size is accessed from memory instead of disk. On the other hand, file sizes higher than 1 GB require the access to most data from disk decreasing the performance. File sizes lower than 1 GB take advantage of the whole access to memory although the cost of the initial connection is high enough to hide the mentioned benefit. In this case, the low time to access the file size cannot make up for the high cost of the initial connection.

Furthermore, the results shown in Figures 11.3 and 11.4 indicate a similar improvement trend Alberto S´ anchez Campos


166


according to the use of a higher number of nodes in both cases. Increasing the number of nodes means a performance improvement. Nevertheless, the increase is not too meaningful since most of the time spent in MAPFS-Grid PDAS-based accesses is taken by the network transfer instead of the I/O access to data in disk. This is due to the overhead in the transfer protocol in services, SOAP.

1,70 2,3

Bandwidth (MB/s)

Bandwidth (MB/s)

2,1 1,9 1,7

1,5 1,3 1,1 0,9

0,7 0,5

1,65

6

1,60

5 Bandwidth (MB/s)

2,5

1,55 1,50 1,45

3 2 1

1,40

0

1,35

512 Kb

4

1 Mb

2 Mb

512 Kb

4 Mb

Blocksize 1,30 PDAS 1 node

PDAS 2 nodes

PDAS 4 nodes

PDAS 8 nodes

(a) Read operation 100 MB

1 Mb

2 Mb

4 Mb

Blocksize

1 GB

6 GB

PDAS 1 node

PDAS 2 nodes

PDAS 4 nodes

PDAS 8 nodes

(b) Write 12 GB operation


PDAS 2 nodes

PDAS 4 nodes

PDAS 8 nodes

Figure 11.5: Performance obtained by PDAS to access a 12 GB file size

On the other hand, the block size is a key factor in the performance of MAPFS-Grid PDAS. Figure 11.5 shows that the block size has a clear influence in the MAPFS-Grid PDAS performance while the improvement achieved increasing the number of nodes is hidden due to the high influence of the transfer method. Besides, write operations are much more effective and obtain better performance than read operations. The reason is that sending SOAP data from a client to a server (corresponding to write operations) behaves much better than SOAP download transfers (read operations). Since both client and server have similar characteristics and workload, this behavior is due to the SOAP protocol operation.

As a result, to improve the performance of MAPFS-Grid PDAS, it is necessary to reduce the transfer time, since the data access time inside the storage element has not a high influence in the results. Parallel techniques applied to the inter-storage element level can be used to decrease the transfer time.

Both elements UPM 1 and UPM 2 can be used in a parallel fashion by taking advantage of interstorage element parallelism level, obtaining a performance improvement. Figure 11.6 and 11.7 shows the comparison between the bandwidths achieved in read and write operations, respectively, on UPM 1 and UPM 2 in a parallel way by using MAPFS-Grid PDAS. In all the figures, these proposals are AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION



167

called Parallel PDAS, indicating the number of nodes used by each storage element.

1,030

1,64

1,62 Bandwidth (MB/s)

Bandwidth (MB/s)

1,025 1,020 1,015 1,010

1,60 1,58 1,56 1,54

1,52

1,005

1,50

100 MB

1 GB

6 GB

12 GB

100 MB

Parallel PDAS 4 nodes


1,62

3,70

2,05 2,00

3,60

1,60


2,10

Bandwidth (MB/s)

Bandwidth (MB/s)

2,15

12 GB

Parallel PDAS 4 nodes 2,30

2,20

6 GB

Filesize Parallel PDAS 1 node Parallel PDAS 2 nodes


2,25

1 GB

Filesize Parallel PDAS 11,64 node Parallel PDAS 2 nodes

1,58 1,56 1,54

100 MB

3,30 3,20 3,10

3,00 2,90

1,52

1,95

3,40

1 GB

2,80 6 GB

100 MB

12 GB

1,50Filesize Parallel PDAS 1 node


MB (c) Block size = 100 2 MB


1 GB

1 GB

6 GB

12 GB


6 GB

12 size GB = 4 MB (d) Block


Filesize Parallel PDAS 1 node Parallel PDAS 2 nodes Parallel PDAS 4 nodes

Figure 11.6: Performance obtained by parallel MAPFS-Grid PDAS to read a file

Like the non-parallel version of MAPFS-Grid PDAS, parallel MAPFS-Grid PDAS obtains an improvement as the file size is increased taking into account the memory hierarchy. Nevertheless, the memory size is higher in this context because it groups the sum of memory sizes of each storage element. In this case, the memory used to store data is near to 3 GB (1 GB from UPM 1 and 2 GB from UPM 2 ) causing that 6 GB file sizes obtain better performance. Nevertheless, 12 GB file sizes require accessing most of data in disk decreasing the performance. 100 MB file size obtains lower performance due to the high cost of the initial connection.

Regarding the number of nodes that compose each storage element, there are not meaningful differences although Figures 11.6 and 11.7 show a slight improvement according to the increase of the number of nodes. On the other hand, whereas the increase trend of the block size is kept similar Alberto S´ anchez Campos



7

2,90

6

2,85

5

2,80

Bandwidth (MB/s)

Bandwidth (MB/s)

168

4

3 2

1

2,75

2,70 2,65

2,60

0

2,55 100 MB

1 GB

6 GB

12 GB

100 MB

1 GB

6 GB

12 GB

Filesize Parallel PDAS 11,64 node Parallel PDAS 2 nodes






1,62 4,6

4,3 4,2 4,1

4,0 3,9 3,8 3,7

1,60

6,4 6,2

1,58

Bandwidth (MB/s)

Bandwidth (MB/s)

4,4

6,6

Bandwidth (MB/s)

4,5

1,56 1,54

100 MB

5,8 5,6

5,4 5,2

1,52

3,6

6,0

1 GB

5,0

6 GB

12 GB

100 MB

1,50Filesize Parallel PDAS 1 node




1 GB

1 GB

6 GB

12 GB


6 GB

12 size GB = 4 MB (d) Block


Filesize Parallel PDAS 1 node Parallel PDAS 2 nodes Parallel PDAS 4 nodes

Figure 11.7: Performance obtained by parallel MAPFS-Grid PDAS to write a file

to the non-parallel MAPFS-Grid PDAS version, there is a noteworthy improvement in the achieved bandwidth due to the use of the inter-storage element level of parallelism. The average improvement when using this level of parallelism is 69.28 %, obtaining the higher improvements for 512 KB blocks. For instance, increases of 55.3 % and 35.7 % are obtained comparing Figures 11.3d and 11.4d, and Figures 11.6d and 11.7d for 12 GB file size read and write operations, respectively. In this case, the performance of write operations achieves the maximum performance of the SOAP protocol used in data transfers of this kind of services. Nowadays, the maximum performance that can be achieved by SOAP is around 6-7 MB/s [SOA07] due to different causes shown in [GMM06]. The SOAP maximum performance limits the improvement of the inter-cluster parallelism but the obtaining of the maximum states the optimization of the use of the insfrastructure thanks to MAPFS-Grid PDAS. AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION



11.1.2

169

Performance of the uniform high-performance access MAPFSDAI

MAPFS-DAI is an extension of OGSA-DAI architecture that integrates the MAPFS file system as an accessor inside it (see Section 7.3.1). As a result, it is possible to access the data stored in

0,544

0,4260 0,4255 0,4250 0,4245 0,4240 0,4235 0,4230 0,4225 0,4220 0,4215 0,4210 0,4205


Bandwidth (MB/s)

compound storage elements using MAPFS through the common interface provided by OGSA-DAI.

0,540 0,538 0,536

0,534 0,532

0,530 100 MB

1 GB

6 GB

100 MB

12 GB

OGSA_DAI

MAPFS-DAI 1 node

OGSA_DAI

0,600 0,598 0,596 0,594 0,592 0,590 0,588 0,586 0,584 0,582 0,580

MAPFS-DAI 8 nodes

Bandwidth (MB/s)

0,710 0,705 0,700

MAPFS-DAI 1 node

0,715 0,710

0,705 0,700 0,695

6 GB

12 GB

100 MB

Filesize

MAPFS-DAI 1 node

MAPFS-DAI 2 nodes

(c) Block size = 2 MB100 MB

MAPFS-DAI 4 nodes

1 GB

6 GB

12 GB

Filesize

0,695

OGSA_DAI

MAPFS-DAI 2 nodes

MAPFS-DAI 8 nodes

0,720

0,715

1 GB

12 GB

0,725

0,720

100 MB

6 GB


MAPFS-DAI 4 nodes

0,725

Bandwidth (MB/s)

Bandwidth (MB/s)

MAPFS-DAI 2 nodes


MAPFS-DAI 4 nodes

1 GB Filesize

Filesize

1 GB

MAPFS-DAI 8 nodes

OGSA_DAI 6 GB

MAPFS-DAI GB (d) 12 Block size 1=node 4 MB MAPFS-DAI 2 nodes

MAPFS-DAI 4 nodes

Bandwidth (MB/s) OGSA_DAI

MAPFS-DAI 1 node

MAPFS-DAI 4 nodes

MAPFS-DAI 8 nodes

MAPFS-DAI 8 nodes

MAPFS-DAI 2 nodes

Figure 11.8: Comparison between fileAccess (OGSA-DAI) and MAPFS-DAI to read a file

Some experiments have been run on UPM 1 to evaluate the performance of file accesses by means of intra-cluster parallelism level provided by MAPFS-DAI comparing it to OGSA-DAI architecture. Figures 11.8 and 11.9 show the comparison between the performance of read and write operations, respectively, on UPM 1 by using MAPFS-DAI.

Like MAPFS-Grid PDAS, MAPFS-DAI has a slight improvement according to the parallel use of a higher number of nodes in both read/write operations. The improvement is slight because of the high influence data transfer through the network has vs. data access to disk. Most time is consumed by Alberto S´ anchez Campos



0,3715 0,3710 0,3705 0,3700 0,3695 0,3690 0,3685 0,3680 0,3675 0,3670 0,3665

0,474 0,473 0,472 Bandwidth (MB/s)

Bandwidth (MB/s)

170

0,471

0,470 0,469 0,468 0,467 0,466

100 MB

1 GB

6 GB

12 GB

100 MB

1 GB

Filesize OGSA_DAI

MAPFS-DAI 1 node

OGSA_DAI

0,725

0,524 0,523 0,522 0,521 0,520 0,519 0,518 0,517 0,516 0,515 0,514 0,513

12 GB

MAPFS-DAI 1 node

MAPFS-DAI 2 nodes


MAPFS-DAI 4 nodes

MAPFS-DAI 8 nodes

MAPFS-DAI 8 nodes

0,632 0,630

0,720

100 MB

Bandwidth (MB/s)

0,628

0,715

Bandwidth (MB/s)

Bandwidth (MB/s)

MAPFS-DAI 2 nodes


MAPFS-DAI 4 nodes

6 GB

Filesize

0,710 0,705

1 GB

0,700

0,626 0,624 0,622 0,620 0,618

0,616 0,614

6 GB

100 MB

12 GB

1 GB

OGSA_DAI

MAPFS-DAI 1 node MAPFS-DAI MB 2 nodes 1 GB (c) Block size = 2 MB100

MAPFS-DAI 4 nodes

MAPFS-DAI 8 nodes

6 GB

12 GB

Filesize

Filesize 0,695 OGSA_DAI

6 GB

MAPFS-DAI 1 node

GB size = 4 MB (d) 12 Block

MAPFS-DAI 4 nodes

MAPFS-DAI 2 nodes

MAPFS-DAI 8 nodes

Bandwidth (MB/s)

OGSA_DAI

MAPFS-DAI 1 node

MAPFS-DAI 4 nodes

MAPFS-DAI 8 nodes

MAPFS-DAI 2 nodes

Figure 11.9: Comparison between fileAccess (OGSA-DAI) and MAPFS-DAI to write a file

performing the file transfer instead of accessing data because MAPFS-DAI, like MAPFS-Grid PDAS, is based on SOAP. Furthermore, there is a meaningful overhead produced by sending messages and running activities required by the fulfillment of the OGSA-DAI common interface.

The comparison of Figures 11.3, 11.4 and 11.8 and 11.9, respectively, shows that the MAPFS-Grid PDAS access over a single storage element obtains better performance than MAPFS-DAI, and therefore better than OGSA-DAI complaint systems. OGSA-DAI infrastructure, and therefore MAPFSDAI, has lower performance due to the overhead introduced by the interoperability layer. Both, MAPFS-Grid PDAS and solutions based on OGSA-DAI are WSRF-based data services, and their performance is limited by the use of the SOAP protocol. Nevertheless, the MAPFS-DAI performance is much lower than MAPFS-Grid PDAS because it is not only a protocol problem, but also a messagepassing problem due to the use of the OGSA-DAI’s common messages.

As a result, 98.7 % on average of the operation time is spent by the data transfer. However, as it is shown in Figures 11.8 and 11.9, MAPFS-DAI slightly enhances the performance of the basic impleAUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION



171

mentation of OGSA-DAI, called fileAccess. The improvement is due to the fact that MAPFS-DAI file access is made in a parallel fashion using some nodes of a cluster, whereas OGSA-DAI access a single server (the master node of the cluster). This states the performance improvements of the intra-cluster

0,545 0,545 0,544 0,544 0,543 0,543 0,542 0,542 0,541 0,541 0,540

Bandwidth (MB/s)

Bandwidth (MB/s)

parallelism level, although the requirements of the common interface limit the enhancement.

100 MB

1 GB

6 GB

0,604 0,603 0,602 0,601 0,600 0,599 0,598 0,597 0,596 0,595 0,594 0,593

12 GB

100 MB

1 GB

Filesize

Parallel MAPFS-DAI 1 node

MAPFS-DAI 2 nodes (a) Block size = Parallel 512 KB

Parallel MAPFS-DAI 4 nodes

0,682

Bandwidth (MB/s)

Bandwidth (MB/s)

0,684

0,604 0,603 0,602 0,601 0,600 0,599 0,598 0,597 0,596 0,595 1 GB 6 GB 0,594 Filesize 0,593

Bandwidth (MB/s)

0,690

0,686

0,680 0,678 0,676

0,674 0,672 100 MB

12 GB

0,740 0,738 0,736 0,734 0,732 0,730 0,728 0,726 0,724 0,722 0,720 0,718

100 MB

1 GB

6 GB

12 GB

Filesize


MAPFS-DAI 100 MB 2 nodes 1 GB (c) Block size =Parallel 2 MB


12 GB

Parallel MAPFS-DAI 1 nodesize = Parallel MAPFS-DAI 2 nodes (b) Block 1 MB


0,688

6 GB

Filesize

Filesize Parallel MAPFS-DAI 1 node

6 GB GB Parallel MAPFS-DAI 1 nodesize = Parallel MAPFS-DAI 2 nodes (d)12Block 4 MB Parallel MAPFS-DAI 4 nodes



Figure 11.10: Performance obtained by parallel MAPFS-DAI to read a file In order to improve the low performance, a second level of parallelism has been applied using both storage elements UPM 1 and UPM 2. In all figures, these approaches are named Parallel MAPFSDAI, indicating the number of nodes used by each storage element. Figures 11.10 and 11.11 show bandwidths achieved in read and write operations, respectively, on UPM 1 and UPM 2 in a parallel way by using MAPFS-DAI.

Like non-parallel version of MAPFS-DAI, the intra-cluster parallelism only obtains slight differences, as it can be seen in Figures 11.10 and 11.11. Nevertheless, there is a meaningful improvement due to the use of the inter-storage element parallelism. The average improvement achieved by the Alberto S´ anchez Campos


172


0,4960

0,4955

Bandwidth (MB/s)

Bandwidth (MB/s)

0,4950 0,4945 0,4940 0,4935

0,4930 0,4925 0,4920

100 MB

1 GB

6 GB

0,550 0,549 0,549 0,548 0,548 0,547 0,547 0,546 0,546 0,545 0,545 0,544 100 MB

12 GB

1 GB

Parallel MAPFS-DAI 1 node (a) Block size

12 GB


MAPFS-DAI 2 nodes (b) Block size =Parallel 1 MB

MAPFS-DAI 2 nodes = Parallel 512 KB



0,676

0,630

0,674 0,672

Bandwidth (MB/s)

0,626

0,604 0,603 0,602 0,601 0,600 0,599 0,598 0,597 0,596 0,595 1 GB 6 GB 0,594 Filesize 0,593

Bandwidth (MB/s)


6 GB

Filesize

Filesize

0,624 0,622 0,620 0,618 100 MB


0,670 0,668 0,666

0,664 0,662 0,660 0,658

12 GB

100 MB

6 GB

12 GB

Filesize

Parallel MAPFS-DAI 2 nodes 1 GB

(c) Block size = 2 MB100 MB


1 GB

Filesize Parallel MAPFS-DAI 1 node

MAPFS-DAI 1 node MAPFS-DAI 2 nodes 6 Parallel GB GB (d)12Block size =Parallel 4 MB Parallel MAPFS-DAI 4 nodes



Figure 11.11: Performance obtained by parallel MAPFS-DAI to write a file

inter-storage element level of parallelism compared to a single storage element is 16.81 %, obtaining the highest improvements for 512 KB block size.

As it can be seen in Figure 11.12, and unlike MAPFS-Grid PDAS, as the block size is increased the improvement is reduced. As the performance is meaningfully improved regarding the block size, 4 MB block size obtains a good performance with inter-cluster parallelism, reaching a value near to the maximum possible performance. This maximum possible performance of OGSA-DAI message passing protocol around 750 KB/s limits the improvement achieved by inter-storage element parallelism. In both cases, the results show the performance improvement of the use of the OGSADAI-based infrastructure thanks to MAPFS-DAI. Furthermore, it is advisable to take into account that different kinds of systems (such as MAPFS-DAI, fileAccess based on OGSA-DAI, CASTOR, and so on) could be used together by means of OGSA-DAI since they provide the same interface. This feature could increase the fault tolerance if data replication is implemented within the system. AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION



173

-DAI 4 nodes 0,75

0,75

0,70

0,70

0,65

0,65

0,55

0,70

0,50

0,65

0,45 0,40 0,35 0,30 512 Kb

1 Mb

Bandwidth (MB/s)

0,75

0,60

Bandwidth (MB/s)

Bandwidth (MB/s)

-DAI 4 nodes

0,60 0,55 0,50 2 Mb 0,45

0,60 0,55 0,50 0,45 0,40 0,35 0,30

4 Mb

512 Kb

1 Mb

2 Mb

4 Mb

Blocksize Blocksize 0,40 Parallel MAPFS-DAI 1 node Parallel MAPFS-DAI 2 nodes Parallel MAPFS-DAI 1 node Parallel MAPFS-DAI 2 nodes (a) Read operation (b) Write operation 512 Kb 1 Mb 2 Mb 4 Mb Parallel MAPFS-DAI 4 nodes OGSA DAI Parallel MAPFS-DAI 4 nodes OGSA DAI Blocksize




OGSA DAI

Figure 11.12: Performance obtained by parallel MAPFS-DAI to access a 12 GB file size

11.1.3

Performance of the non-service-based high-performance access MAPFS-DSI

In Section 7.4.1, a Data Storage Interface (DSI), named MAPFS-DSI, is proposed to improve the performance of GridFTP largely. The usual Data Storage Interface provided by GridFTP, called file DSI, which only accesses the master cluster node, can be compared to the proposed MAPFS-DSI, which can access a whole cluster in a parallel way.

Some tests shown in Figures 11.13 and 11.14 have been made on UPM 1 to know the performance of file read and write operation by means of intra-cluster parallelism level achieved by MAPFS-DSI compared to GridFTP file DSI.

It is important to note that the time of accessing a file in a parallel way among the nodes of the cluster is superimposed by the transference time. In this sense, as it can be seen in Figures 11.13 and 11.14, when the number of nodes increases, MAPFS-DSI provides an improvement vs. GridFTP file DSI, enhancing the use of GridFTP. Only if MAPFS-DSI is used without parallelism (1 node), sometimes its performance is lower than GridFTP file DSI because of the overhead observed in MAPFS when accessing a single node. The improvement achieved by the use of several nodes in a parallel way is different according to the kind of I/O operation.

For read operations, the results shown in 11.13 indicate that the improvement is limited by the behavior of the GridFTP protocol itself, used by both MAPFS-DSI and GridFTP file DSI. In this case, GridFTP download transfers (corresponding to a read operation) only obtain bandwidths around 30-35 MB/s. As the hard disk of a single node of UPM 1 provides 30 MB/s approximately, changing Alberto S´ anchez Campos



35

35

30

30

25

25

Bandwidth (MB/s)

Bandwidth (MB/s)

174

20 15 10 5

20 15

10 5

0

0

100 MB

1 GB

6 GB

12 GB

100 MB

1 GB

Filesize GridFTP File DSI

MAPFS-DSI 1 node

MAPFS-DSI 4 nodes

MAPFS-DSI 8 nodes

MAPFS-DSI 2 nodes

35

30

30

30

15 10 5

25 20 15

1 GB

GridFTP File DSI

MAPFS-DSI 1 node

MAPFS-DSI 4 nodes

MAPFS-DSI 8 nodes

25 20 15

10

0

5

6 GB

100 MB

12 GB

0Filesize

6 GB

12 GB

DSI 12 GB MAPFS-DSI 1 node 6 GBGridFTP File (d) Block size = 4 MB

MAPFS-DSI 1 node MB MAPFS-DSI 2 nodes 1 GB (c) Block size = 2100 MB MAPFS-DSI 8 nodes

1 GB Filesize

GridFTP File DSI

MAPFS-DSI 4 nodes

MAPFS-DSI 2 nodes

5

10

0 100 MB

Bandwidth (MB/s)

35

20

12 GB


35

Bandwidth (MB/s)

Bandwidth (MB/s)


25

6 GB Filesize

MAPFS-DSI 4 nodes

MAPFS-DSI 2 nodes

MAPFS-DSI 8 nodes

Filesize

GridFTP File DSI

MAPFS-DSI 1 node

MAPFS-DSI 4 nodes

MAPFS-DSI 8 nodes

MAPFS-DSI 2 nodes

Figure 11.13: Comparison between MAPFS-DSI and GridFTP file DSI to read a file

block sizes and the use of a higher number of nodes only provides a slight improvement due to this restriction. Furthermore, at a certain number of nodes, there is no improvement but a decrease in performance, due to the higher existing coordination among nodes. When the number of nodes is increased, the extra coordination overhead cannot be compensated by a performance improvement because the bandwidth is limited. Since MAPFS-DSI parallelism when using 4 nodes already reaches the bandwidth limited by the GridFTP protocol for download transfers, using 8 nodes obtain worse performance because of the cost of its higher process synchronization. Regarding file size, the high cost of the initial connection loses significance as file size is increased, thus the achieved bandwidth is improved.

On the other hand, using higher number of nodes in a parallel way means a meaningful improvement for write operations. Since there is not limitation in upload transfers, intra-cluster parallel access improves the performance as the number of nodes is increased as it can be seen in Figure 11.14. The differences are more noticeable for high file sizes. Memory hierarchy affects write accesses AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION



175

of files smaller than memory size (1 GB), hindering, in part, parallelism benefits. When the file size is increased, more data has to be written in disk and therefore the parallel access has a higher influence.

60 60

55

55

50 Bandwidth (MB/s)

Bandwidth (MB/s)

50 45 40 35 30 25

45

40 35 30 25

20

20

100 MB

1 GB

6 GB

12 GB

100 MB

1 GB

Filesize GridFTP File(a) DSI

12 GB

GridFTP File DSI

MAPFS-DSI 1 node (b) Block size = 1 MBMAPFS-DSI 2 nodes

node KBMAPFS-DSI 2 nodes BlockMAPFS-DSI size =1 512

MAPFS-DSI 4 nodes

6 GB Filesize

MAPFS-DSI 4 nodes

MAPFS-DSI 8 nodes

MAPFS-DSI 8 nodes

35 60

60

30

55

40 35 30

25

20 15

10

20 100 MB

50 Bandwidth (MB/s)

45

55

25

Bandwidth (MB/s)

Bandwidth (MB/s)

50

1 GB

45

40 35 30 25

20

5

6 GB

12 GB

100 MB

0Filesize GridFTP File DSI

MAPFS-DSI 1 node

MAPFS-DSI 4 nodes

MAPFS-DSI 8 nodes

MB (c) Block size = 2100 MB

1 GB

6 GB

12 GB

Filesize MAPFS-DSI 2 nodes

1 node 6 GBGridFTP File DSI GB MAPFS-DSI (d)12Block size = 4 MB

1 GB

MAPFS-DSI 4 nodes

MAPFS-DSI 2 nodes

MAPFS-DSI 8 nodes

Filesize

GridFTP File DSI

MAPFS-DSI 1 node

MAPFS-DSI 4 nodes

MAPFS-DSI 8 nodes

MAPFS-DSI 2 nodes

Figure 11.14: Comparison between MAPFS-DSI and GridFTP file DSI to write a file

However, parallelism among 8 nodes does not improve the performance for 1 MB, 2 MB and 4 MB block sizes. MAPFS has to wait for all the nodes to finish their operation before sending a new slide of information. In the case of 8 nodes parallelism, the MAPFS slide is 64KB × 8 = 512KB. Therefore, when the block size is higher than 512 KB, it is necessary to wait for the lowest node before sending the next 512 KB. As the number of nodes is high, the wait can be higher decreasing the final performance. Furthermore, the block size has a meaningful influence in the performance of GridFTP and therefore MAPFS-DSI. Figure 11.14 shows as block size is increased the obtained bandwidth decreases.

Moreover, both inter-storage element and intra-cluster parallelism can be considered. A parallel GridFTP file DSI version has been built to run tests, although it can only take advantage of interstorage element parallelism. On the other hand, MAPFS-DSI can obtain benefits from both of them. Alberto S´ anchez Campos


176


Figures 11.15 and 11.16 show the I/O bandwidth obtained by the application of both levels of parallelism on UPM 1 and UPM 2 storage elements by using a version parallel of GridFTP file DSI

60

60

50

50 Bandwidth (MB/s)

Bandwidth (MB/s)

and MAPFS-DSI.

40 30 20

10

40

30 20

10

0

0

100 MB

1 GB

6 GB DSI 1 node

12 GB

100 MB

1 GB

Filesize

12 GB

Parallel GridFTP File DSI

Parallel MAPFS-DSI 1 node



Parallel MAPFS-DSI 2 nodes






60

60

50

50

50

40 30 20 10 0

Bandwidth (MB/s)

60

Bandwidth (MB/s)

Bandwidth (MB/s)

6 GB Filesize

40

30 20

30 20 10 0

10 100 MB

40

1 GB

6 GB

100 MB

12 GB

0Filesize

1 GB

6 GB

12 GB

Filesize






File 6 GBParallel GridFTP 12Block GBDSI size =Parallel (d) 4 MBMAPFS-DSI 1 node

1 GB Filesize







Figure 11.15: Performance obtained by parallel MAPFS-DSI to read a file

The use of both levels of parallelism obtains noteworthy improvements, as it can be seen in Figures 11.15 and 11.16. Increasing the number of storage elements and nodes means an improvement in the performance. The inter-storage parallelism level also causes a great improvement in file DSI performance (parallel GridFTP file-DSI), but it is still worse than MAPFS-DSI.

Like non-parallel version, the bandwidth achieved according to the kind of I/O operation is different because of the behavior of the GridFTP protocol. Uploads are faster than downloads transfers causing better performance for write operations, although when block size is increased in uploads, the AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION



110

100

100

90

90

Bandwidth (MB/s)

120

110 Bandwidth (MB/s)

120

80 70

60 50 40

80 70 60 50 40

30

30

20

20 100 MB

1 GB

6DSI GB 1 node

100 MB

12 GB

1 GB

6 GB

12 GB

Filesize

Filesize Parallel GridFTP File DSI










60

120

120 110

110

50

100

100

80 70 60 50 40 30 20

Bandwidth (MB/s)

90

Bandwidth (MB/s)

Bandwidth (MB/s)

177

40

30 20

80 70 60 50 40 30 20

10 100 MB

90

1 GB

6 GB

100 MB

12 GB

0Filesize

1 GB

6 GB

12 GB

Filesize






File 6 GBParallel GridFTP 12Block GBDSI size =Parallel (d) 4 MBMAPFS-DSI 1 node

1 GB Filesize







Figure 11.16: Performance obtained by parallel MAPFS-DSI to write a file

obtained bandwidth is decreased. Furthermore, whereas the average improvement achieved by the inter-storage element parallelism compared to a single storage element for read operations is 73.45 %, the improvement is 88.16 % for write operations.

Finally, Figure 11.16a show as MAPFS-DSI achieves the maximum possible network bandwidth by using the two levels of parallelism. Considering that the work environment is a real environment and the network is not dedicated, such network is acting as a bottleneck of the whole system. Nevertheless, the performance obtained by file DSI of GridFTP is limited by the I/O bandwidth of the hard disks of the cluster master node. As the predicted advances in Computer Science state the network performance will increase higher than the I/O system [Vil01], this is a much harder limitation. Alberto S´ anchez Campos


178

11.2


Evaluation of the proposed grid monitoring environment GMonE

This section aims at evaluating the proposed grid monitoring GMonE. GMonE (see Section 9.2) is a key part of the autonomic system since the system behavior is perceived through it.

GMonE is designed to provide information from the whole grid by means of an information movement similar to a client-server architecture. Clients are represented by monitored storage elements and the server is the broker where monitored data is finally stored. This client-server design implies to study three clearly different items of the system that represent client, environment and server to evaluate GMonE: • Data monitoring in storage elements (clients). • Communication between storage elements and broker (environment) • Data recovering in the broker (server).

11.2.1

Hoja3

Data monitoring

parameter L 70 64 73 64 64 An important aspect is the64time of requesting a value of each64monitored S of a monitoring system 64 64 64 E 140 111 135 110 135 parameter. This timeLatrefers to the response the MonitorAccess service 9.2.1), 70 time of 64 73 64 (see Section 75 C 56 56 which is running in every storage element. 57 Figure 11.1755 shows the 56 different monitoring times in UPM TC 57 55 56 56 56 R 57 56 55 55 57 1 of the parameters required by the autonomic system defined in Section 9.2.1. Ir 303 34 304 56 280 Iw 35 327 33 304 34

180 160 140 Time (ms)

120 100 80 60 40 20 0 L

S

E

Lat

C

TC

R

Ir

Iw

Parameters

Figure 11.17: Average time of a monitoring query

The monitoring time of each parameter depends on several factors, most of them related to the way AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


11.2. EVALUATION OF THE PROPOSED GRID MONITORING ENVIRONMENT

179

parameters are monitored. Each parameter is monitored by a specific plug-in, causing the differences shown in Figure 11.17. The total monitoring time is 848.2 ms, meaning the workload of MonitorAccess each GMonE’s monitoring period. For instance, if GMonE monitors data each 2 minutes the overhead or workload of the monitoring system in the monitored system is 0.71 %.

Furthermore, monitored data by GMonE is more useful than data provided by other systems, since data is aggregated (see Section 9.2.1.1) considering a compound storage element as a simple machine regardless of its composition. Parameter aggregation by means of the defined functions not only makes easy the understanding of the system behavior but also reduces the quantity of information that is transferred through network, taking less than 1 ms on average if 1000 values are aggregated.

11.2.2

Communication

It is important to analyze the network overhead produced by both monitoring services and the used communication method in a geographically scattered environment. Therefore, the communication between MonitorAccess service in every storage element and the central monitoring service GMoneDB (see Section 9.2.2) is a key factor. The communication method used in GMonE for data transfer mechanisms is the standard grid monitoring information service MDS.

The MDS’s capabilities of adaptability and scalability to heterogeneous and geographically distributed environments over WAN networks have been enunciated in [CFFK01]. Some developed tests about performance, throughput and query response times shown in [ZFS03] for MDS-2 and in [Sch05] for MDS-4 confirm the mentioned capabilities and the low network overhead generated by the communication method.

Since GMonE uses MDS to monitoring data transfer mechanism, the theory led us to infer scalability and low traffic network features.

11.2.3

Data recovering

A monitoring system aims at providing monitored data to be processed by high-level applications. Therefore, the access time to monitored data is a key factor of a monitoring system.

GMonEAccess (see Section 9.2.3) is the part of the proposed system that is in charge of accessing the monitoring database and providing data. Some different volumes of monitored data have been requested to GMonEAccess in order to evaluate the access time. Figure 11.18 show the average access time regarding different volumes of required information indicated by the request period in hours (the Alberto S´ anchez Campos


10 17 19 22 27 35 46 44

Time (ms)

180

10 13 14,16666667 20,66666667 24,16666667 30,83333333 44 45

197 201 200 206 210 217 227 225

197 197 199 199 200 200 209 206 209 210 217 217 227 228 CHAPTER 225 225 11.

197 198 201 210 209 217 228 EVALUATION 225

50 45 40 35 30 25 20 15 10 5 0 0

5

10

15

20

25

30

35

40

Request period (h)

Figure 11.18: Monitored data access time

volume of information is higher as the request period Páginais2 increased). The database file size is around 300 MB.

Results show very low times when accessing the database whatever the volume of required data. The use of a proper database index improved meaningfully the access speed because of efficient ordering of access to records by means of the date of the monitored data. The average improvement achieved by the database index was 88.42 %. Furthermore, the access time increase is not linear regarding the increase of the request period or data volume. As a result, the use of GMonE is suitable for requesting high volumes of monitored information. Finally, the information provided by GMonE is aggregated, reducing the later processing of high-level applications.

Although the GMONE’s database accesses are high performance, it is necessary to take into account that the growth of the database size can cause a decrease in the performance. The growth rhythm of database is a key factor in the performance of database accesses.

GMonE has been specifically designed to obtain information about both several parameters and grid elements. Therefore, the growth trend of GMoNE’s database is defined by: • Size of each database record (SR ). Each GMonE’s database record is composed of 5 fields: 1. Parameter. Name in ASCII of the monitored parameter. It takes 8 bytes on average. 2. Grid element. Name in ASCII of the grid element where MonitorAccess monitors. It takes AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


11.3. EVALUATION OF THE AUTONOMIC FRAMEWORK FOR GRID STORAGE

181

8 bytes on average. 3. Value. Long value that indicates the monitored data of the specified parameter and grid element. It takes 4 bytes. 4. Date. Long value that indicates when the monitored data was obtained. It takes 4 bytes. 5. Units. Name in ASCII of the units in which the monitoring is made. It takes 4 bytes on average. Then, each record takes 30 bytes on average, SR = 30. • Number of analyzed parameters (NP ). In Section 9.2.1 the following parameters were defined to be monitored: C, T C, L, S, Ir, Iw, R, E and Lat. Then NP = 9. • Number of monitored elements (NS ). • Monitoring period (DG ). Period in minutes that monitored data is stored in database. For instance, if GMonE monitors data each 2 minutes, then, DG = 2. The growth trend (G) of GMoNE’s database in bytes per day (1440 minutes) can be expressed in the following way:

G = SR × NP × NS ×

1440 DG

Since the number of monitored elements is unknown, it is possible to calculate growth trend (GS ) of GMoNE’s database in bytes per day and per grid element S:

GS = SR × NP ×

1440 1440 w 190KB = 30 × 9 × DG 2

As a result, approximately 190 KB per day and grid element are stored in GMonE’s database according to the specified values for SR , NP and DG . Although the growth is not quite large to have a meaningful influence in the performance because GMonE’s database is properly indexed and accessed, if NS is high enough, the access time could be meaningfully increased. In this case, it would be advisable either increasing DG or using a hierarchy of independent GMonEDB services, each one with their own database, monitoring a subset of elements and sharing the information following the idea of broker hierarchy indicated in Section 8.1.3.

11.3

Evaluation of the autonomic framework for grid storage GAS

In order to evaluate GAS, seven non-idle storage elements distributed on the Internet have been used building a simple but complete grid testbed. Each storage element has installed a MAPFS-DSI Alberto S´ anchez Campos


182


server to store files, since this is the highest performance proposed approach (see Section 11.1.3). Their different geographical location over Spain is shown in Figure 11.19. Both, storage elements and client, are connected through Spanish scientific wide area network, named RedIRIS.

BSC

UCLM

UPM

URJC

UM

Figure 11.19: Geographical location of the storage elements in the grid testbed Each grid element has a different configuration: • UPM 1 : it is a cluster composed of 8 Intel Xeon 2.40GHz nodes with 1 GB of RAM memory, connected by a Gigabit network. The hard disk of each node provides 30 MB/s approximately and its maximum network bandwidth is 15 MB/s. It is located at the Facultad de Informática of the Universidad Politécnica de Madrid (Madrid, Spain). • UPM 2 : it is a cluster composed of 4 Intel Xeon 3.0GHz nodes with 2 GB of RAM memory, connected by a Gigabit network. The hard disk of each node provides 50 MB/s approximately and its maximum network bandwidth is 20 MB/s. It is located at the Facultad de Informática of the Universidad Politécnica de Madrid (Madrid, Spain). • UPM 3 : it is a cluster composed of 2 Intel Pentium IV 3.20GHz nodes with 512 MB of RAM memory, connected by a Gigabit network. The hard disk of each node provides 30 MB/s apAUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION



183

proximately and its maximum network bandwidth is 5 MB/s. It is located at the Facultad de Inform´ atica of the Universidad Politécnica de Madrid (Madrid, Spain). • URJC : it is an Intel Pentium IV 2.80GHz node with 1 GB of RAM memory. Its maximum network bandwidth is 10 MB/s. It is located at the Escuela Superior de Ciencias Experimentales of the Universidad Rey Juan Carlos (Mostoles, Spain). • BSC : it is an Intel Pentium IV 3.0GHz node with 1 GB of RAM memory. Its maximum network bandwidth is 10 MB/s. It is located at the Barcelona Supercomputing Center (Barcelona, Spain). • UCLM : it is a cluster composed of 8 Intel Pentium IV 3.0GHz nodes with 2 GB of RAM memory. Its maximum network bandwidth is 1MB/s. It is located at the Instituto de Investigación en Inform´ atica of the Universidad de Castilla La Mancha (Albacete, Spain). • UM : it is an Intel Pentium III 1.0GHz node with 384 MB of RAM memory. Its maximum network bandwidth is 1 MB/s. It is located at the Departamento de Ingenier´ıa y Tecnolog´ıa de computadores of the Universidad de Murcia (Murcia, Spain). • Broker : the GAS broker developed in the present Ph.D. thesis is running at a Intel Xeon to 3.00GHz node, with 2 GB of memory RAM. It is located at the Facultad de Informática of the Universidad Politécnica de Madrid (Madrid, Spain). • Client: the client developed in this work is running in an Intel Core 2 2.13GHz node with 1 GB of memory RAM and it has limited the network bandwidth to 40MB/s. It is located at the Operating Systems Group’s laboratory of the Facultad de Informática of the Universidad Politécnica de Madrid (Madrid, Spain). Figure 11.20 shows the topology of the network that interconnects the different grid elements. The shown network bandwidths refer to the maximum values of their network interfaces, being their actual value lower because of the third-party network traffic and the ignorance about the RedIRIS’ internal topology. Furthermore, it is important to emphasize to understand the results that the whole set of grid elements used in the testbed are shared by other local users.

11.3.1

Performance of the prediction model

This section analyzes the prediction model shown in Section 10.2.1, in order to know the time that is required to make the predictions and the weight of selected algorithms in the produced workload.

First, several predictions have been made by the broker to evaluate the complete prediction time. Each T period, a new prediction is made for each storage element and studied parameter required in the decision making. Since the system is designed to work at long term and the aim is to not overload Alberto S´ anchez Campos


184


UPM 1

UPM 2

UPM 3 Client

URJC

Broker BSC

UCLM

UM

Figure 11.20: Network topology of the grid testbed the system, 1 day is considered as a trade-off minimum value of T . Then, the average prediction time for each grid element and parameter is analyzed.

Day 1

Day 2

Day 3

Day 4 0

5

10

15

20

25

Prediction time (ms)

Figure 11.21: Average time to predict a behavior parameter corresponding to a grid element Figure 11.21 shows the average prediction time during consecutive days. As a summary, 20 ms are approximately taken to predict every parameter of each resource. As three parameters are predicted by resource, this means a time lower than 60 ms for every storage element. This low time stands out the high capacity offered by the proposed system in this work because a usual machine can predict 16 and half resources per second, 59661 per hour. Furthermore, the possibility of increasing the T period used to make the predictions states the scalability of the system. AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION



185

With regard to the workload generated by the prediction algorithms, it is important to note that most of prediction time is taken in obtaining data, specifically in database accesses.

Table 11.1: Percentage of processing workload of the prediction Data obtaining 73.68 %

Processing 26.32 %

As it can be observed in Table 11.1, around three quarters of prediction time is consumed by the database accesses to gather the information required to make the prediction. This indicates the importance of having a fast database access. On the other hand, only around a quarter is consumed to process data, define the states and make the prediction. Moreover, the model designed in this work to make predictions (shown in Section 10.2.1) takes less than 2 ms for each parameter and storage element. This stage is clearly optimized thanks to the right selection not only of algorithms but also of the proper initial hypothesis.

11.3.2

Evaluation of the made predictions

With the aim at understanding how the predicted model is working, it should be checked if a set of observations collected after the prediction fits with the expected distribution. Goodness of fit and more specifically chi-square (χ2 ) test can be used to check this. The χ2 statistic calculates a discrepancy measure among observed values Obs and the values Exp of the expected distribution in the following way:

χ2 =

i=k 2 X (Obsi − Expi ) i=0

Expi

(11.1)

In this test, both states i and expected frequencies Expi of every parameter p of each storage element s are calculated by means of the definition of states and prediction phases shown in this work (see Sections 10.1 and 10.2). The predictions were made depending on 10 days historical data. These predictions are compared with the data set Obsi obtained for each parameter and grid element during the following 20 days (measured every 4 hours) classifying each value according to the previous calculated states. Expi and Obsi enables calculating the χ2 statistic. Once the χ2 statistic, shown in Formula 11.1, is calculated, it can be compared to the χ2 distribution to determine its goodness of fit and therefore its confidence level. The confidence level indicates Alberto S´ anchez Campos


186


the probability of the result being correct, that is, prediction is fulfilled.

Table 11.2: Expected and obtained frequencies for the prediction of IU P M 3

[28.05 [38.07 [46.01 [55.38 [64.83 [73.93 [84.96

States MB/s − 38.07 MB/s − 46.01 MB/s − 55.38 MB/s − 64.83 MB/s − 73.97 MB/s − 84.96 MB/s − 93.09

MB/s) MB/s) MB/s) MB/s) MB/s) MB/s) MB/s)

Expected frequency 49 % 7% 11 % 7% 10 % 7% 9%

Observed frequency 49 % 5% 12 % 5% 11 % 8% 10 %

In the tests, 57.14 % of the made predictions achieve a statistically meaningful confidence level (higher than 90%). For instance, Table 11.2 shows both expected and observed probability distributions for the different states of the internal I/O bandwidth parameter of reading Ir of the resource UPM 3. The resulting value of the chi-square statistic is χ2 = 1.59 stating the confidence level is statistically meaningful.

An analysis of the predictions that reach a statistically meaningful confidence level obtains more conclusions. The parameter internal I/O bandwidth of reading (Ir) obtains the highest probability to have a statistically meaningful confidence level achieving 71.48 %. Regarding to the confidence level of the predictions of the different storage elements, no statistically meaningful differences among them are found.

11.3.3

Performance of the decision making

In order to evaluate the performance of the decision making, it is interesting to understand how much time is consumed for not only read/write1 but also file create operations, since file create operations consume much more processing time because they have to calculate the most efficient location to enhance the performance of current and later I/O operations.

Table 11.3: Processing workload of decision making for read/write operations Data obtaining 86.6 % 1 Write

Processing 13.4 %

operations refer to write a file that is previously in the file system.




187

The performance of the decision making of read/write operations depends on not only the number of parts in which the file was divided when was created but also the number of replicas of each part. For each replica, the broker has to obtain the corresponding current data by means of GMonE to make the decision. The average time to request this data is 30 ms per replica. Nevertheless, as it can be observed in Table 11.3, the processing time to select the replicas that enhance the performance only means a very low percentage of this time because of the cost of database accesses.

The performance of file create operations requires a deeper analysis due to its higher complexity. Thus, the weights of the different stages that this operation needs to make are analyzed. These stages or steps are the following: 1. Analysis of the active grid resources obtaining their predictions and the historical information of each one of the different analyzed parameters. 2. Obtaining the objective function and sorting the storage resources from higher to lower expected performance of current and later I/O operations according to the predictions and the defined high-level policies. 3. Selection of the required set of resources to store the file taking advantage of most of the client network bandwidth. Furthermore, it calculates the suitable block size to access each resource.

37,80%

44,50%

17,70% Step 1: Obtaining current and prediction data of active resources Step 2: Optimization of the objective function Step 3: Decision making

Figure 11.22: Workload percentage in each decision making stage or file create operations Figure 11.22 shows the weight of each stage for making a decision about a file create operation. The processing time is mostly spent in the first step because this phase requests a higher number of data to the monitoring system. As it can be seen in Table 11.4, most of the time is consumed for accessing the database whereas only a slight percentage is used for processing data. Then, the next high workload stage is the resource selection and block size calculation step since it has to make the Alberto S´ anchez Campos


188


decision. Finally, the objective function calculation is the step that requires the least workload what states the suitable selection of the algorithms used to obtain the goodness of each resource (see Section 10.3.2.1). Table 11.4: Processing workload of decision making for file create operations Data obtaining 72.25 %

11.3.4

Processing 27.75 %

Transfer mode

The transfer mode about how data blocks or slides are sent to storage elements can cause differences in parallelism efficiency and therefore in the final performance of I/O operations. Section 10.3.2.3 discusses further two alternatives or proposals to achieve an efficient parallelism among heterogeneous elements. 35 The first alternative assigns different block sizes to each storage element in such way that the 30

Bandwidth (Mb/s)

25 transfer time of each block size to each storage element is expected to be the same. This proposal 20

15 reducing the wait for the slowest element. The transfer operation is made by means of sendaims at 10

ing blocks in a parallel way following a round robin policy. On the other hand, the second proposal 5 0

suggests an asynchronous sending of the optimum block size to each grid element. In this case, faster 10 MB 100 MB 1 GB Filesize

storage elements receive more blocks because they do not have to wait for slower elements. Round robin based access Asynchronous-based access

35 30

25

25

20 15 10 5 0 10 MB

35 30 Bandwidth (Mb/s)

30

Bandwidth (Mb/s)

Bandwidth (Mb/s)

35

20 15 10

25 20 15 10 5 0

5 100 MB

10 MB

1 GB

100 MB

0 Filesize

1 GB

Filesize

10 MB

(a) Read operation Round robin based access

Asynchronous-based access

100 MB

1 GB

(b)Round Write robin operation based access

Filesize

Asynchronous-based access

Round robin based access Asynchronous-based access 35 30 Bandwidth (Mb/s)

25 Figure 11.23: Comparison between round robin and asynchronous transfer modes

35

Bandwidth (Mb/s)

Some parallel tests

30 have

20 15 10

been made to compare the 5performance of both alternatives by using the

25

0

same storage elements, UPM 1, UPM 2 and URJC. Figure 11.23 bandwidths1 GBachieved for 10 MB shows the 100 MB 20 Filesize

each proposal to make read/write operations improving the performance as the file size is increased 15 Round robin based access Asynchronous-based access

10 5


0 10 MB

100 MB Filesize

Round robin based access Asynchronous-based access

1 GB



189

because of the cost of the connection with the broker. In both operations, asynchronous-based transfer mode provides better performance, achieving improvements of 26.15 % on average regarding round robin-based access. Round robin-based access waits that all elements have finished its operation before sending more data. Although in round robin-based alternative, the block size is calculated to take the same time, the different current resource workloads, which dynamically change, can cause that sometimes faster storage elements have to wait for slower elements to receive a new request. On the other hand, any element has to wait for another in the asynchronous-based access. Thus, as the file size is increased, waits in round robin-based access can happen more times making the improvement of using asynchronous access more noteworthy.

11.3.5

Autonomic capabilities

GAS aims at making system management easy due to its high complexity. Self-configuring and selfoptimizing features can help to manage complex environments. Thus, GAS provides self-configuring and self-optimizing capabilities. These features have to be evaluated to ensure their proper application.

Self-configuring can be defined as the ability for configuring itself according to high-level policies (see Section 4.1). Section 10.3.2.1 analyzes in depth how GAS configures the importance (WP ) of every parameter (P ) in order to make decisions following high-level polices defined by system administrators. The importance is dynamically configured to improve the use and the performance of the system according to the predicted and current system behavior. Then, GAS makes decisions based on the calculated importance. Thus, self-configuring capability is intrinsic to GAS operation.

Self-optimizing aims at enhancing the performance of not only operations made over the system but also the system itself. As it can be observed in next section, decisions made by GAS are directed to improve the use and the performance of the grid I/O infrastructure, in this case MAPFS-Grid. However, there is another clear optimization aimed at enhancing the performance of the system itself. Section 10.2.1.4 discusses further how this optimization is made examining the behavior of the resources and modifying monitoring data period (Dps ) according to their stability. Changing this internal system parameter either improves the performance of the system itself or adapts the system to changes in a better way.

Figure 11.24 show the average time taken by the system to make predictions according to different values of Dps . If GAS detects a stable behavior for a specific resource s and parameter p, it increases Dps achieving faster accesses to the database and avoiding overloads in the prediction algorithm. On the other hand, if the behavior is too variable, GAS reduces Dps to obtain more information about the resource and the parameter adapting itself better to changes. In this case, prediction results (see Alberto S´ anchez Campos


H F-XC A N GE

H F-XC A N GE

N O bu y to k lic

190


30 25 20 15 10 5 0 60

93,6 127,2 160,8 194,4 228 261,6 295,2 328,8 362,4 396 429,6 463,2 496,8 530,4 564 597,6 631,2 664,8 698,4 732 765,6 799,2 832,8 866,4 900

Monitoring data period (s) Database access time

Prediction time

Figure 11.24: Average time to make predictions Section 10.2.1.3) can change around 16 % approaching better to resource behavior as Dps is decreased.

11.3.6

System performance

In order to know the GAS influence in the whole system performance, a MAPFS-DSI client application has been used to send data to the selected resources in a parallel way. The transfer mode is asynchronous because it is the highest performance proposed approach as it is shown in Section 11.3.4. To evaluate the global behavior offered by the whole system and to perform a deep analysis, four working modes have been implemented: 1. Random mode: it selects how many and what storage resources are used for making parallel write operations in a random way. It obtains the information about where making the read operations storing where writes were made. Since write selection is random, the client does not have to request information to the broker. This implies that the client has a list of the set of resources from which it does not receive monitoring information. Thus, the client also has to decide the block sizes to access each storage resource. In order not to randomly decide these block sizes, a default size is previously configured. In this case, 2 MB is considered a reasonable block size, since the results obtained in previous analyses show that this size behaves properly in all tests. 2. Best resource mode: the a priori best resource is selected by the broker to make the I/O operations. The resource selection is made based on static characteristics of grid elements instead of dynamic workload. In this way, UPM 2 is considered as the proper selection in the work environment since it expects to provide the highest bandwidths (both network and disk I/O) and AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


.d o

m o

.c

Time (ms)

c u-tr a c k

C

m

w

o

.d o

w

w

w

w

w

C

lic

k

to

bu y

N O

W !

PD

W !

PD

c u-tr a c k

.c


191

the processor features of their nodes are good enough. The appropriate block size for effectively accessing the selected resource is calculated by the broker according to Section 10.3.2.3. 3. Decision mode: it selects storage resources according to the decision making phase explained in Section 10.3. Therefore, it has to request this information to the broker. Nevertheless, the prediction phase showed in this Ph.D. thesis is not carried out, working only with the current state of each one of the resources. Furthermore, in order to optimize the client network bandwidth, the client obtains the optimum block sizes to access every resource. 4. Prediction & decision mode: the client requests to the GAS broker that is in charge of calculating the predictions and decision making depending on not only these predictions but also current monitoring data according to Chapter 10. Just like decision mode, the client obtains the optimum block sizes to the broker. Several tests have been made to analyze the performance of I/O operations according to the client access mode and the file sizes. Due to capacity restrictions in some storage resources, these tests consist in creating and writing 10 MB, 100 MB and 1GB file size for later making consecutive read operations over the same ones each 6 hours. For read and write operations, an average value has been selected as representative time of each operation.

Random mode

Best resource mode

Decision mode

Prediction & Decision mode

0

1

2

3

4

5

6

Read time (s) Broker request

Client read access

Figure 11.25: Average time for reading 10 MB file size Figure 11.25 shows the different average times of each client access mode for reading 10 MB file size. The random access obtains better results since the file size is very small. In best resource, decision and prediction & decision modes, the client requests to the broker the information about where data is whereas in the random mode the client itself has the needed location information. The request time is the sum of the time of the connection with the broker and the decision making to select the replicas that provide a current optimum access. The request time means the 20% of the 10 Alberto S´ anchez Campos


192


MB read operation time approximately in broker-based modes and therefore it is meaningful in their performance. The high influence of the request time due to the low file size causes that the best option to read low-size files is the location information is in the client itself, in this case, the random mode.

Prediction & decision mode Decision mode Best resource mode Random mode

0

1

2

3 Write time (s)

4

5

Figure 11.26: Average time for creating and writing 10 MB file size

Figure 11.26 analyzes the 10 MB create and write operation performance. The random access continues being the average best option, since the client has not to request a decision to the broker. Nevertheless, some random tests in which the client selected the least bandwidth resources, its time was higher because of the increase of the data transfer time, which could be higher than 20 %. Concerning the broker-based modes, although a suitable resource is accessed in best resource mode, it cannot benefit from parallelism involving worse results than the rest of modes. The prediction & decision mode makes a decision faster than the decision mode because the number of requests of monitoring data is lower because it has the information aggregated in the prediction phase whereas in the decision mode all the current data is requested to the database. Furthermore, data from the prediction phase is normalized causing the time of optimizing the high-level policies is also lower because there is not atypical data.

As it can be seen in Figure 11.27, the 100 MB read operation performance with a random mode client has a high average delay in the transfer time compared to Figure 11.25. This is due to the fact that this file size is considerable and the selection of non suitable resources can get meaningfully worse the performance. This states the selection of broker-based alternatives in this work, although best resource mode obtains worse performance. Whereas I/O operations with a random mode benefits from the parallel access if several resources are randomly selected, only a resource is used in best resource mode, limiting its performance. Regarding proposed decision-based modes, the read operations times of decision and prediction & decision modes are very close because there are several ideal optimum AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION



193

Random mode

Best resource mode

Decision mode


0

2

4

6

8

10

12

14


Client read access

Figure 11.27: Average time for reading 100 MB file size

combinations for this file size in which the bandwidths are not saturated during a long time.

Concerning the 100 MB create and write operations, Figure 11.28 shows that the sending time of this file size is different according to the selected resources. The random mode obtains worse results than the proposed decision-based modes, although part of the time lost by its low-performance transfer is compensated by the fact that the broker is not requested. On the other hand, the selection of the most suitable resource in best resource mode does not compensate its non-parallel access.


0

2

4

6 8 Write time (s)

10

12

14

Figure 11.28: Average time for creating and writing 100 MB file size

As it can be observed in Figure 11.29, the 1 GB read operation time in each mode is meaningfully different. A set of the grid testbed resources have considerably higher bandwidth causing that their intervention to be determinant. The random mode client is clearly penalized because it can work with both the best and the worst storage resources. On the other hand, the proposed decision-based mode Alberto S´ anchez Campos


194


clients have guaranteed the use of ideal resources because the decisions are based on obtaining highperformance, implying a clear improvement. Thus, although random client mode does not contact the broker, eliminating its connection, it takes around 38 % more time than decision mode client and around 46 % time more than prediction & decision mode client.

Random mode

Best resource mode

Decision mode


0

20

40

60

80

100

120


Client read access

Figure 11.29: Average time for reading 1 GB file size

The difference between the 1 GB read operations performance of both decision-based modes is due to the prediction enables accessing the data in a more constant way at long term, if the predictions made are fulfilled. The decision mode client suffers the grid changes because write operations did not take into account the prediction of the grid behavior. It is observed in Figure 11.29 that prediction & decision mode client obtains an improvement of around 10 % with regard to the decision mode client.

On the other hand, like 10 MB and 100 MB file size accesses, the best resource mode obtains the worst performance since client access an only grid element. The average improvement of using the parallelism provided by the decision mode client vs. the non-parallel access of the best resource mode to make 1 GB read operations is around 143 % whereas the use of prediction & decision mode client involves an enhancement of around 175 %.

Figure 11.30 shows the results obtained to create and write a 1 GB file size, which are very similar to the obtained to read this file size. In this case, the overload introduced to request the broker is not meaningful due to the high time necessary for the file transfers and the use of parallelism, which involves a clear enhancement. Therefore, a very important improvement is obtained by decision-based modes for high file sizes. In write operations, the prediction & decision mode client is faster than the decision mode client is, due to the fact that the first one works with elaborated information, the predictions, which allow the broker to make a faster decision. On the other hand, the decision mode AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION



195


0

20

40

60 Write time (s)

80

100

120

Figure 11.30: Average time for creating and writing 1 GB file size

client works with the most current information which can be atypical, increasing the solving time of the decision making phase.

30

30

25 Bandwidth (MB/s)

30

20

25 15

Bandwidth (MB/s)

Bandwidth (MB/s)

25

10 5 0 10 MB

20 15

15

10 5

10

0

100 MB 5 Filesize

20

10 MB

1 GB

100 MB

1 GB

Filesize

0

Random mode

Best resource mode

Decision mode


(a) Read operation

10 MB

100 MB

Random mode 1 GB (b) Decision mode

Best resource mode

Write operation

Prediction & decision mode

Filesize Random mode

Best resource mode

Decision mode

Prediction & decision mode

Figure 11.31: Bandwidth improvement according to file size

Figure 11.31 shows the trend of bandwidth improvement according to file size and client working mode for both read and write operations. When the file size is increased, an increment of the performance is obtained because of the high cost of the initial connection and the request time for broker-based modes. The performance is meaningfully improved by using the decision-based modes due to the decision phase proposed in this work. Specifically the decision & prediction mode, corresponding to the whole proposal of this Ph.D. thesis, obtains the highest bandwidths for both read and write operations. The transfer rate of the read and write operations exceed 25 MB/s and 20 MB/s, Alberto S´ anchez Campos


196


respectively, which are higher than the theoretical network transfer rate of every storage element of the grid testbed. This and the high improvement of parallel accesses with regard to the non-parallel best resource mode state the benefits of the approach of this Ph.D. thesis. As a result, the application of parallelism and autonomic capabilities to data grid is a key factor for enhancing I/O access performance in these environments.



Part III

CONCLUSIONS AND FUTURE WORK



Chapter 12

CONCLUSIONS AND FUTURE RESEARCH LINES

The current Ph.D. thesis has properly fulfilled all the primary objectives indicated in Section 1.3. This work has mainly contributed to both the improvement of the I/O phase of dataintensive applications in grid environments and the manageability of their complexity. This contribution has been materialized in the design and complete implementation of both the high performance data grid architecture MAPFS-Grid and the autonomic storage system GAS based on long term prediction.

Next sections summarize the contributions, emphasizing the achievements of the established hypotheses.

12.1

Contributions in data grid field

Since its birth, Computer Science has been creating tools to deal with the challenges of modern society. Physical and theoretical resources are being developed, providing insight on scientific and everyday problems. As the computer science tools improve, the needs of those who use them also increase. A set of issues has been defined as grand challenge applications (GCAs), which cannot be solved using current computing techniques. Innovative approaches must be applied to address GCAs.

Grid computing has emerged as a novel environment where GCAs can be faced. Grid adaptability Alberto S´ anchez Campos


200

CHAPTER 12. CONCLUSIONS AND FUTURE RESEARCH LINES

makes it an ideal context to deploy these applications. Nevertheless, the performance of data-intensive GCAs is limited by the I/O system since I/O phase constitutes the bottleneck of most of computing systems. Parallel file systems have arisen as an alternative to face this problem, known as I/O crisis, in traditional infrastructures, like cluster computing.

The construction of the high-performance storage system MAPFS-Grid in this work makes it possible to face I/O crisis in data grid environments. To achieve this purpose, two areas, parallel I/O and grid technology, which have been independently exploited, are fused in MAPFS-Grid to reinforce their advantages. MAPFS-Grid takes advantage of the two levels of parallelism existing in grid environments: 1. The first level provides parallelism among all the storage resources of the grid. 2. The second level improves I/O operations in those compound storage resources corresponding to clusters by means of parallelism among their internal nodes. MAPFS-Grid provides a suite of three components MAPFS-Grid PDAS, MAPFS-DAI and MAPFSDSI for accessing efficiently to large amounts of data in grid environments. A deep research of usual storage methods in data grid has been made to optimize the I/O performance depending on the technique used to access data. Every component is focused on a different aspect related to access data in grids to face different needs arise in data management. All of them take advantage of two levels of parallelism. These approaches have been evaluated in Section 11.1, demonstrating that it is feasible their efficient use in data grids due to the improvement of the I/O throughput by means of the use of two levels of parallelism.

Each MAPFS-Grid component fits better depending on access way or data scenario. MAPFSDSI is useful for performance-critical applications since the obtained performance is optimal and it is only limited by the network bandwidth. MAPFS-Grid PDAS and MAPFS-DAI are appropriate to service-oriented applications, which usually read and write parts of the remote files directly. Whereas MAPFS-Grid PDAS offers better performance, MAPFS-DAI provides interoperability. Therefore, in order to obtain a high number of accessible resources (e.g. to achieve fault tolerance or data replication) MAPFS-DAI should be used. On the other hand, MAPFS-Grid PDAS is suitable for efficient service-oriented applications.

As a summary, this Ph.D. thesis lays the foundations of a high performance parallel grid storage system whose performance is only limited by the features of transfer methods, such as the transfer protocol or network bandwidth. These technological limitations restrict the benefits achieved by this work. Nevertheless, close advances in network technologies are going to involve a performance AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


12.2. CONTRIBUTIONS IN GRID MANAGEMENT FIELD

201

enhancement still high by this proposal. Some advantages of the use of the proposed high performance infrastructure are: • Availability of huge shared storage capacity provided by several heterogeneous and scattered distributed resources. Besides, parallelism improves the use of the elements that make up the grid environment. • Improvements of the I/O phase performance tackling the I/O crisis in data grid environments. As a result, MAPFS-Grid is especially beneficial for being used by HPC grid data-intensive applications. • Use flexibility. The different functionalities of each MAPFS-Grid component make the adaptation to any kind of data access scenario possible.

12.2

Contributions in grid management field

Although grid computing allows creating infrastructures to solve GCAs, its own characteristics imply a highly complex distributed environment. Dealing with this complexity becomes a top priority, in order to manage the environment.

Several reasons are responsible for the high complexity of grids. As a grid is a set of geographically distributed and heterogeneous resources that belong to different virtual organizations, the coordinated use of these non-dedicated resources (which are not subject to a centralized control) can cause not only management but also security and reliability issues. Therefore, the cause for grid complexity is twofold: 1. Architectural environment characteristics, such as the use of different hardware elements, including interconnection networks, computing and storage resources. 2. Non-architectural aspects, such as fault-tolerance, security, asynchronism, scalability and decentralization. Obtaining proper information about grid behavior is the first step for its management. There are several solutions designed for distributed system monitoring. Nevertheless, most of them provide detailed information about the whole set of elements that compose of the environment. The high information size obtained in very complex environments makes system understanding difficult. Since grid environments are focused on resources and services, monitoring system should provide relevant information taking into account this point of view instead of non-relevant special features about the internal composition of every grid element.



202


This work contributes in grid monitoring field adjusting efficiently distributed monitoring techniques to highly complex environments, like a grid. The key feature is data aggregation to monitor each storage resource as a single entity, although it is composed of several nodes. As it is stated in Section 11.2, the benefits of data aggregation are mainly twofold: • Simplification of the understanding of the system behavior. • Reduction of the quantity of transferred information. The definition and implementation of the monitoring system GMonE specifically adapted to grid provide relevant past and current data about system operation. Monitored data can be analyzed to manage the whole infrastructure.

In grid management, selecting the appropriate elements that provide high-performance data access and reducing system complexity is a challenging area. The construction of the autonomic storage framework GAS means a innovative advance in the state-of-the-art of grid management field to solve this problem. Data-intensive applications performance is more related with read operation behavior since data is usually read more times than it is written [RTC05]. Therefore, as data is not usually required at the same time in which data is produced, the enhancement of data-intensive applications performance involves to improve the later read accesses instead of current write accesses. Therefore, storage elements should be chosen depending on not only current values of parameters but also considering predictions about them. Predictions must be made at long term because when the future I/O operations will be made is unknown.

Following this idea, GAS definition contributes a formal study of all the required phases to provide autonomic capabilities to reduce the complexity and manage the system aimed at improving data accesses: self-managing capabilities, long term prediction and decision making. As a result, the proposed autonomic high performance grid storage system provides self-configuring and self-optimizing capabilities. On the one hand, GAS allows the system to adapt its behavior facing environment changes, making suitable decisions according to expected behavior of resources. The made decisions are directed to improve both current and later I/O operations. On the other hand, it adjusts its own internal parameters to improve its performance and later decision making processes.

As it is demonstrated in Section 11.3, the use of the proposed autonomic storage framework provides several different benefits, such as: • Coordination and I/O operation management of all involved processes. • Enhancement of read operation performance. AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


12.3. BENEFITS DERIVED FROM THE DEVELOPMENT OF THIS PH.D. THESIS

203

• Improvement of write operation performance although it is not optimized since write operations aim at enhancing read operations. • Proper workload distribution among the whole set of grid elements seeking a maximum availability of all of them. • Autonomic improvement of the performance of GAS itself. An exhaustive and incremental evaluation has been made in Section 11.3.6 concluding that using MAPFS-Grid and GAS together improves largely the performance of I/O operations.

As a summary, the progresses extracted of this work in grid management field contribute to manage more easily grid technology. This facilitates the acceptance of grid computing by final users and developers what can help to turn grid into reality.

12.3

Benefits derived from the development of this Ph.D. thesis

Grid technology can work out applications that cannot be solved in a single computer or cluster due to their high requirements of computing power or/and data management. Several application fields, such as life sciences, Physics, visualization or weather modeling, use this new paradigm to tackle their problems.

Most of these applications use systems that are in charge of information exploitation by using data mining techniques or other information processing algorithms. All these systems require efficient information retrieval modules with high storage capacity. The high-performance grid storage system proposed in this work can be adapted to this type of infrastructures in a suitable way. For instance, Data Mining Grid Architecture (DMGA) [SPP+ 04, PSR+ 07] is a flexible architecture based on the proposed parallelism ideas in this work for providing data mining services in grid environments. An implementation of the DMGA architecture, named WekaG [PSH+ 05], has also been developed. This implementation is based on Weka [WF02], a well-known tool for developing machine-learning algorithms.

DMGA presents two different composition models: (i) horizontal composition, which offers workflow capabilities and (ii) vertical composition, for increasing the performance of inherently parallel data mining services. The vertical composition refers to the combined and parallel use of several services that provide the same functionality. This scheme is especially significant to those services Alberto S´ anchez Campos


204


accessing a large volume of data, which can be distributed through diverse locations.

By using vertical composition, it is possible to improve the performance of a data mining process in a similar way than the first level of MAPFS-Grid’s parallelism. In this case, each service that forms a vertical composition accesses a different data portion of the same data set in a parallel way. Then, results from every service are combined to compute the answer, which is then transferred to the client. Therefore, to use MAPFS-Grid as information retrieval system in DMGA, it is only required to modify MAPFS-Grid to divide data according to suitable data division to each specific problem.

Furthermore, it should be advisable to have a global knowledge of the whole grid behavior. A modification of the proposed broker GAS can be used to provide a complete and global view of the services and resources. In this case, GAS might separate data into subsets and distributing depending on data proximity, available bandwidth or CPU availability criteria.

In general, not only data mining but also any data-intensive grid applications can benefit from the development of this Ph.D. thesis, since this work show mechanisms to improve the I/O phase meaningfully in data grid. For instance, inside of the Departamento de Arquitectura y Sistemas Informáticos of the Universidad Politécnica de Madrid there is an established group in the data mining field. This researching group could directly use the I/O platform obtained by this work as storage infrastructure to its applications, increasing both the data retrieval phase and the information availability. In the Barcelona Supercomputing Center, it is being planned the utilization of the results of this work to improve the use of its storage resources. Besides, results can be extrapolated to the future Spanish Supercomputing Network to use storage resources of all the network nodes in a global and transparent way. Additionally, some data analysis centers, like the National Research Council in Canada, could integrate the proposal inside the data analysis phase. Due to the great volume of information that they have, the use of the new infrastructure means a very advisable improvement.

As a result, the offer can be extended to the whole data mining and data processing community allowing it to use the I/O platform to increase the performance of its data-intensive applications.

12.4

Future research lines

As it has been analyzed in this chapter, the objectives stated at the beginning of this Ph.D. thesis have been satisfactorily reached. Furthermore, this work opens new research lines shown in the following sections in order to extend the successful aims. The future lines are orientated both to AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


12.4. FUTURE RESEARCH LINES

205

increase the functionality of the proposed solution and to develop new aspects related to such solution.

12.4.1

Time dependent predictions

The specific moment when each one of the several and heterogeneous grid users accesses data is a priori unknown. Therefore, the decision making phase of this work is based on long term prediction. The aim is to improve I/O accesses in the future as a whole instead of concentrating on at a given moment corresponding to the expected time of the I/O request. Nevertheless, if user access pattern is previously specified, the decision making could be based on time dependent predictions aimed at enhancing data accesses at the specific point of time when a request is expected. Although user access patterns are very hard to achieve because of its own conception1 , it would be interesting to analyze if this knowledge involves more accurate decisions2 .

[Pér03] defines I/O profiles as a set of structures aimed at specifying I/O demands and access pattern of a certain application. These control structures are translated to hints, in order to increase the I/O operation performance. The same idea could be incorporated into the proposed autonomic framework to indicate when future I/O operations will be requested.

This information enables the usefulness of time dependent predictions to make decisions. As it can be stated in Section 5.2, time series can be predicted being analyzed by means of different kinds of techniques, such as neural networks, fuzzy systems and evolutionary algorithms. The aim is to discover a model that represents the time series. [Val02] presents a strategy based on similarity, neuro-fuzzy networks and evolutionary algorithms for modeling time series. In this case, a model family is explored by means of evolutionary algorithms. The model quality that represents the functional relationship and the prediction error is obtained by means a similarity-based neuro-fuzzy network. It also enables the identification of drastic changes in system behavior [VBC06], modifying the prediction model.

Then, time dependent prediction-based decisions could be analyzed, including its accuracy and I/O performance improvement.

12.4.2

Peer-to-peer-based brokering

A single system-broker, which is in charge of collecting monitored information and making suitable decisions, was selected in this work for managing grid environments. However, a single broker could be a system bottleneck and a single point of failure due to the large number of grid resources that may be involved. 1 Users

do not usually know when they are going to access their own data. and batch queues could provide some information about the jobs that are going to run into the system.

2 Schedulers



206


A peer-to-peer protocol [MKL+ 02, FI03] might be used to address this problem. Instead of having clients requesting a single broker and the broker monitoring grid resources, each resource could play the role of client (consumer), server (producer) and broker. In this way, each resource share information about the files that it stores instead of sharing files by means of peer-to-peer software, like Freenet [CSWH01]. This approach would help to solve the bottleneck problem and it makes the system more robust. Nevertheless, it introduces problems related to routing information.

12.4.3

Self-healing

Self-healing is a feature of autonomic computing systems aims at detecting and solving systems problems or failures. In a data grid, this capability has special relevance because every time data access has to be ensured. The incorporation of this capability to the proposed autonomic framework could increase the guarantee of data grid operation involving that stored information is available or it is easily reparable in case of failure.

As it can be seen in Section 8.1.3, data replication is a key factor to ensure the availability of data access services in data grids because of the heterogeneity and decentralization characteristics of these environments. The main difficulty of data replication in grid environments is to choose the set of replication policies that manages the system. The inherent variability of the environment makes choosing a guideline that optimizes the process difficult. Moreover, environment conditions can change suddenly worsening the performance. Therefore, it should be advisable to use a self-adaptive approach that allows the system to overcome grid changes readjusting their replication policies.

Thus, it would be interesting that the autonomic system has knowledge of the different data replication policies in a grid environment and under what circumstances results are optimal. This knowledge firstly requires a study in depth of data replication techniques and how they can be properly applied to data grid since concepts like failure has to be redefined or, at least, completed.

Self-healing system should choose and activate what replication policy is more beneficial analyzing monitored data by GMonE (new required monitoring parameters for data replication can be easily included in GMonE thanks to its high flexibility). Furthermore, dynamic features of the environment can cause some predictions do not fit with the future behavior. As future work, it is planned that the self-healing system introduces data replication to better locations at specific time points if the expected results does not fit with the real system behavior.



Appendix A

MARKOV CHAIN RESOLUTION

Markov chains [Mar06, Mar71] represent a system that changes its state through time. A Markov chain can be represented as a transition matrix. The elements of the matrix represent the state change probability. This matrix is square and it has as many rows and columns as states in the system.

      

p11 p21 p31 .. .

p12 p22 p32 .. .

p13 p23 p33 .. .

... ... ... .. .

p1n p2n p3n .. .

pn1

pn2

pn3

...

pnn

      

As the system evolves to some possible states, the transition probabilities will fulfill the following property:

n X

pij = 1 being pij ≥ 0

j=1

Modeling a system using Markov chains allows us to make long term predictions due to the stationary matrix properties. The stationary probability is defined as the probability the system stays in a certain state after a high number of transitions. After a large number of steps, the transition probabilities between states take a limit value denominated stationary probability. This is possible since the transition probabilities between states are stable in a further future and therefore it is possible to study the transition probabilities in k steps: Alberto S´ anchez Campos


208

APPENDIX A. MARKOV CHAIN RESOLUTION

(k)

(A.1)

(k−m)

(A.2)

pie · pej

(k−1)

(A.3)

(k−1)

∗ pej

(A.4)

p {Et+k = j | Et = i} = p {Ek = j | E0 = i} = pij (k)

pij =

n X

(m)

pie · pej

e=1 (k)

pij = (k)

pij =

n X

e=1 n X

pie

e=1

The last two equations A.3 and A.4 are called Chapman - Kolmogorov’s equations. They provide the expressions of the transition probabilities in the state k from the transition probabilities in the state (k − 1). The equations state the P (k) transition matrices can be obtained from the power of the P matrix. P (2) = P · P = P 2 P (3) = P (2) · P = P · P 2 = P 2 · P = P 3 P (k) = P (k−1) · P = P · P k−1 = P k−1 · P = P k The successive powers of P indicate the transition probabilities in as many steps as the index of the power. P 1 represents the probability of an only transition and P 0 = I is the probability in zero transitions (if no transition is made, the state is the same one and therefore this matrix is the identity matrix). Since the powers of P define the probabilities in whatever number of transitions, it is possible to study the stationary probabilities analyzing what happens if high powers of the P matrix are calculated.

The stationary probability is quite complex to be calculated depending on the chain since there are different kinds or Markov chains. These ones can be classified depending on two characteristics: regularity and ergodicity. According to its regularity, a chain can be regular or semi-regular. A Markov chain is regular if all the stationary probabilities do not depend on the initial state. That is, all the probabilities tend to a limit value when the number of transitions is increased and there is only a final set of states. As opposed to regular chains, the stationary probabilities depend on the initial state in the case of semi-regular chains.

The ergodicity concept has a relation with the long term system behavior. According to its ergodicity, a Markov chain can be ergodic, semi-ergodic or non-ergodic. A Markov chain is ergodic if all the states can occur at long term. Thus, its behavior, that is, the transitions between states, does not vary through time. In a semi-ergodic chain, there are transitory states that can disappear at long AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


209

term. Therefore, the long term behavior differs from the short-term behavior. In a non-ergodic chain, the final behavior depends on the initial situation. That is the stationary probability of each state depends on this initial situation.

Although ergodicity shows some interesting features about the Markov chains, regularity acquires a higher relevance because it defines how the system evolves. Thus, there is a different mathematical method to solve each kind of Markov chain, regular or semi-regular. As well, there are different resolution methods to face the different kind of ergodicity in Markov chains, but these methods follow the same philosophy. That is, they solve the Markov chain in a similar way distinguishing if the chain is regular or not.

If the system is modeled as a regular chain, after a high number of transitions, the initial state information will be lost. This is the reason why the probability of the system stays in a specific state does not depend on initial state. It is noted the system behavior tends to become stable. All the rows of P ∗ stationary matrix in a regular chain will be equal, adopting the form:    P∗ =  

π1 π1 .. .

π2 π2 .. .

π3 π3 .. .

π1

π2

π3

. . . πn . . . πn .. .. . . . . . πn

    

π1 , π2 , π3 ..., πn represent the P ∗ stationary matrix and indicate the probability of being in each state in a further future. The P ∗ matrix fulfills the following feature:

P∗ · P = P · P∗ = P∗ Thus, the following set of equations is obtained:  p11 · π1 + p21 · π2 + . . . + pn1 · πn = π1     p12 · π1 + p22 · π2 + . . . + pn2 · πn = π2 ..  .    p1n · π1 + p2n · π2 + . . . + pnn · πn = πn

(A.5)

Unfortunately, although the equation system has n unknowns and equations, it is undetermined. In order to overcome this, it is required to add the stochastic matrix equation: n X

πi = 1

(A.6)

i=1

Including the equation A.6 and deleting one equation of the system A.5, a determined equation system is obtained (P T · X = X) that can be solved by means of a normal system resolution method. Before solving the system, it must be transformed to the usual format (A · X = B): Alberto S´ anchez Campos


210


P T · X = X ⇐⇒ A · X = B P T · X = X =⇒ P T · X − X = 0 =⇒ (P T − I) · X = 0 Then, the following statements are obtained:

A = PT − I

(A.7)

B=0

(A.8)

There are two resolution algorithm families to solve the system of equations: • Direct methods obtain the exact result of the system, but they are computationally expensive. Gauss-Jordan elimination [Jor73, Gil03] is the classical direct method used in the state-of-art, although other methods, like matrix decompositions, can be classified as direct as well. • Approximate methods work from an initial vector reducing the error in each iteration. As drawback, they usually do not offer the exact result. There are different methods, such as Jacobi [You71], Gauss-Seidel [Kah58] and the Successive OverRelaxation (SOR) [Fra50, You50]. Whereas Gauss-Seidel is analogous to Jacobi method, SOR refines both.

Table A.1: Successive OverRelaxation algorithm

k = 1; do { for(i=1; i≤ n; i++) { σ = 0; for(j=1; j < i; j++) { σ = σ + a[i][j]*x[k][j]; } for(j=i+1; j ≤ n; j++) { σ = σ + a[i][j]*x[k-1][j]; } σ = (b[i] - σ)/a[i][i]; x[k][i] = x[k-1][i] + ω*(σ - x[k-1][i]); } }while (k < maxIter || error(x) > minError);



211

The SOR algorithm pseudocode is shown in Table A.1. a[i][j] is the element Aij of the matrix A seen in A.7. x[k][j] is the element j of the solution vector in iteration k. As it can be seen in A.8, the value of matrix B is 0 and therefore b[i] is 0. Finally, ω is an algorithm adjustment factor that varies the convergence speed. In principle its value should be between 0,5 and 1.



212




Bibliography

[AAB+ 05]

M. Antonioletti, M. Atkinson, R. Baxter, A. Borley, N. Chue Hong, P. Dantressangle, A. Hume, M. Jackson, A. Krause, S. Laws, P. Parson, N. Paton, J. Schopf, T. Sugden, P. Watson, and D. Vyvyan. OGSA-DAI status and benchmarks. In Proceeedings of UK e-Science All Hands Meeting, 2005.

[AAL+ 05]

M. Antonioletti, M. Atkinson, S. Laws, S. Malaika, N. Paton, D. Pearson, and G. Riccardi. Web Services Data Access and Integration (WS-DAI). 13th Global Grid Forum, 2005.

[AB84]

M. Aldenferder and R. Blashfield. Cluster Analysis. Sage Publications, 1984.

[ABB+ 02a]

B. Allcock, J. Bester, J. Bresnahan, A. L. Chervenak, I. Foster, C. Kesselman, S. Meder, V. Nefedova, D. Quesnal, and S. Tuecke. Data management and transfer in high performance computational grid environments. Parallel Computing Journal, 28(5):749–771, May 2002.

[ABB+ 02b]

W. Allcock, J. Bester, J. Bresnahan, A. Chervenak, L. Liming, S. Meder, and S. Tuecke. GridFTP Protocol Specification, September 2002.

[ABKL05]

W. Allcock, J. Bresnahan, R. Kettimuthu, and M. Link. The Globus Striped GridFTP Framework and Server. In Proceedings of the 2005 ACM/IEEE conference on Supercomputing (SC ’05), page 54, Washington, DC, USA, 2005. IEEE Computer Society.

[ACK+ 03]

M. Atkinson, A. L. Chervenak, P. Kunszt, I. Narang, N. W. Paton, D. Pearson, A. Shashoni, and P. Watson. Data Access, Integration and Management, chapter 22, pages 391–429. In Foster and Kesselman [FK04], December 2003.

[ACo]

Welcome to Autonomic computing.org, accessed Jun 2007 [Online]. Available: http://www.autonomiccomputing.org.

[AdI]

Adaptive

Infrastructure,

accessed

Apr

2007

[Online].

Available:

http://h71028.www7.hp.com/enterprise/cache/342611-0-0-0-121.html. [AF98]

M. J. Atallah and S. Fox, editors. Algorithms and Theory of Computation Handbook. CRC Press Inc., Boca Raton, FL, USA, 1998.



214

BIBLIOGRAPHY

[AGr]

AstroGrid, accessed Jun 2006 [Online]. Available: http://www.astrogrid.org.

[AKA+ 05]

M. Atkinson, K. Karasavvas, M. Antonioletti, R. Baxter, A. Borley, N. Chue Hong, A. Hume, M. Jackson, A. Krause, S. Laws, N. Paton, J. Schopf, T. Sugden, K. Tourlas, and P. Watson. A new architecture for OGSA-DAI. Proceeedings of UK e-Science All Hands Meeting, 2005.

[ARD+ 05]

M. Atkinson, D. De Roure, A. N. Dunlop, G. Fox, P. Henderson, A. J. Hey, N. W. Paton, S. Newhouse, S. Parastatidis, A. E. Trefethen, P. Watson, and J. Webber. Web service grids: an evolutionary approach. Concurrency - Practice and Experience, 17(2-4):377–389, 2005.

[Ave02]

P. Avery. Data grids: a new computational infrastructure for data-intensive science. Philosophical Transactions of the Royal Society of London Series a-Mathematical Physical and Engineering Sciences, 360(1795):1191–1209, 2002.

[BAC03]

An architectural blueprint for autonomic computing. IBM and autonomic computing, April 2003.

[Bay63]

T. Bayes. An essay towards solving a problem in the doctrine of chances. Philosphical Transactions of the Royal Society, 53:370–418, 1763.

[BBL02]

M. Baker, R. Buyya, and D. Laforenza. Grids and grid technologies for wide-area distributed computing. Software-Practice and Experience, 32(15):1437–1466, 2002.

[BBM+ 03]

A. Bassi, M. Beck, T. Moore, J. S. Plank, M. Swany, R. Wolski, and G. Fagg. The Internet Backplane Protocol: a study in resource sharing. Future Generation Computer Systems, 19(4):551–562, 2003.

[BCC+ 02]

R. Byrom, B. Coghlan, A. Cooke, R.Cordenonsi, L. Cornwall, A. Datta, A. Djaoui, L. Field, S. Fisher, S. Hicks, S. Kenny, J. Magowan, W. Nutt, D. O Callaghan, M. Oevers, N. Podhorszki, J. Ryan, M. Soni, P. Taylor, A. Wilson, and X Zhu. R-GMA: A Relational Grid Information and Monitoring System. In 2nd Cracow Grid Workshop, Cracow, Poland, 2002.

[BCJ01]

R. Buyya, T. Cortes, and H. Jin. Single System Image. International Journal of High Performance Computing Applications, 15(2):124–135, 2001.

[Beo]

Beowulf.org:

The Beowulf Cluster Site, accessed Jan 2007 [Online]. Available:

http://www.beowulf.org. [BHMadH+ 04] A. Brinkmann, M. Heidebuer, F. Meyer auf der Heide, U. R¨ uckert, K. Salzwedel, and M. Vodisek. V:Drive – Costs and Benefits of an Out-of-Band Storage Virtualization AUTONOMIC HIGH PERFORMANCE STORAGE FOR GRID ENVIRONMENTS BASED ON LONG TERM PREDICTION


BIBLIOGRAPHY

215

System. In Proceedings of the 12th NASA Goddard, 21st IEEE Conference on Mass Storage Systems and Technologies (MSST), pages 153–157, College Park, Maryland, USA, April 2004. [BM02]

R. Buyya and M. Murshed. GridSim: a toolkit for the modeling and simulation of distributed resource management and scheduling for Grid computing. Journal of Concurrency and Computation: Practice and Experience (CCPE), 14(13–15):1175– 1220, May 2002.

[BMRW98]

C. K. Baru, R. W. Moore, A. Rajasekar, and M. Wan. The SDSC storage resource broker. In S. A. MacKay and J. H. Johnson, editors, Proceedings of the 1998 conference of the Centre for Advanced Studies on Collaborative Research (CASCON), page 5, Toronto, Ontario, Canada, December 1998. IBM Press.

[Bra02]

Peter J. Braam. The Lustre storage architecture. Cluster File Systems Inc. Architecture, design, and manual for Lustre, November 2002.

[BSS02]

A. Brinkmann, K. Salzwedel, and C. Scheideler. Compact, adaptive placement schemes for non-uniform requirements. In Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures (SPAA ’02), pages 53–62, New York, NY, USA, 2002. ACM Press.

[CaB]

caBIG Web Site, accessed Jun 2006 [Online]. Available: https://cabig.nci.nih.gov/.

[CAS]

CASTOR - CERN Advanced STORage manager, accessed Apr 2007 [Online]. Available: http://castor.web.cern.ch/castor.

[CCL+ 02]

A. Ching, A. Choudhary, W. Liao, R. Ross, and W. Gropp. Noncontiguous I/O through PVFS. In Proceedings of 2002 IEEE International Conference on Cluster Computing (CLUSTER ’02), pages 405–414, Washington, DC, USA, September 2002. IEEE Computer Society.

[CDF+ 02]

A. Chervenak, E. Deelman, I. Foster, L. Guy, W. Hoschek, A. Iamnitchi, C. Kesselman, P. Kunszt, M. Ripeanu, B. Schwartzkopf, H. Stockinger, K. Stockinger, and B. Tierney. Giggle: a framework for constructing scalable replica location services. In Proceedings of the 2002 ACM/IEEE conference on Supercomputing (Supercomputing ’02), pages 1–17, Los Alamitos, CA, USA, 2002. IEEE Computer Society Press.

[CFF+ 01]

G. Cancio, S. M. Fisher, T. Folkes, F. Giacomini, W. Hoschek, D. Kelsey, and B. L. Tierney. The DataGrid Architecture. Technical Report DataGrid-12-D12.4-333671-30, EU DataGrid Project, 2001.



216

[CFFK01]

BIBLIOGRAPHY

K. Czajkowski, S. Fitzgerald, I. Foster, and C. Kesselman. Grid information services for distributed resource sharing. In Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing (HPDC ’01), pages 181–194, Washington, DC, USA, 2001. IEEE Computer Society.

[CFK+ 00]

A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, and S. Tuecke. The Data Grid: Towards an architecture for the distributed management and analysis of large scientific datasets. Journal of Network and Computer Applications, 23(3):187–200, July 2000.

[CGB02]

K. Chiu, M. Govindaraju, and R. Bramley. Investigating the Limits of SOAP Performance for Scientific Computing. In Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing (HPDC ’02), page 246, Washington, DC, USA, 2002. IEEE Computer Society.

[CIRT00]

P. H. Carns, W. B. Ligon III, R. B. Ross, and R. Thakur. PVFS: A parallel file system for linux clusters. In Proceedings of the 4th Annual Linux Showcase and Conference, pages 317–327, October 2000.

[CL03]

T. Cortes and J. Labarta. Taking advantage of heterogeneity in disk arrays. Journal of Parallel and Distributed Computing, 63(4):448–464, 2003.

[CLIRT00]

P. H. Carns, W. B. Ligon III, R. B. Ross, and R Thakur. PVFS: A parallel file system for Linux clusters. In Proceedings of the 4th Annual Linux Showcase and Conference, pages 317–327, Oct 2000.

[CLRS01]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, editors. Introduction to Algorithms, chapter 22.3: Depth-first search, pages 540–549. MIT Press, McGraw-Hill, second edition edition, 2001.

[Con]

The

R Condor

Project,

accessed

Dec

2006

[Online].

Available:

http://www.cs.wisc.edu/condor. [CP90]

P. M. Chen and D. A. Patterson. Maximizing performance in a striped disk array. In Proceedings of the 17th annual international symposium on Computer Architecture (ISCA ’90), pages 322–331, New York, NY, USA, 1990. ACM Press.

[CrG]

CrossGrid

Exploitation

Website,

accessed

Jan

2007

[Online].

Available:

http://www.crossgrid.org/main.html. [CSWH01]

I. Clarke, O. Sandberg, B. Wiley, and T. W. Hong. Freenet: a distributed anonymous information storage and retrieval system. In Proceedings of the International workshop on Designing privacy enhancing technologies, pages 46–66, New York, NY, USA, 2001. Springer-Verlag New York, Inc.



BIBLIOGRAPHY

[CZSS02]

217

M. Carman, F. Zini, L. Serafini, and K. Stockinger. Towards an Economy-Based Optimisation of File Access and Replication on a Data Grid. In Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID ’02), pages 340–345, Washington, DC, USA, 2002. IEEE Computer Society.

[DAI]

Data Access and Integration Services working group, accessed May 2007 [Online]. Available: https://forge.gridforum.org/projects/dais-wg.

[Dan63]

G. B. Dantzig. Linear Programming and Extensions. Princeton University Press, 1963.

[DCa]

dCache.ORG, accessed Apr 2007 [Online]. Available: http://www.dcache.org/.

[DDH+ 06]

B. Di Martino, J. Dongarra, A. Hoisie, L. T. Yang, and H. Zima, editors. Engineering the Grid: Status and Perspective. American Scientifc Publisher, January 2006.

[DDP+ 04]

A. Domenici, F. Donno, G. Pucciani, H. Stockinger, and K. Stockinger. Replica consistency in a Data Grid. Nuclear Instruments and Methods in Physics Research A, 534:24–28, November 2004.

[DHJM+ 01]

D. D¨ ullman, W. Hoschek, J. Jaen-Martinez, B. Segal, A. Samar, H. Stockinger, and K. Stockinger. Models for replica synchronisation and consistency in a data grid. In Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing (HPDC ’01), pages 65–67, Washington, DC, USA, 2001. IEEE Computer Society.

[Dij72]

E. W. Dijkstra. The humble programmer. Communications of the ACM, 15(10):859– 866, 1972.

[DLR77]

A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society, Series B, 39(1):1–38, 1977.

[dRBC93]

J. del Rosario, R. Bordawekar, and A. Choundary. Improved parallel I/O via a twophase run-time access strategy. ACM Computer Architecture News, 21(5):31–38, December 1993.

[DSI]

Dynamic

Systems

Initiative,

accessed

Apr

2007

[Online].

Available:

http://www.microsoft.com/dsi. [EFG+ 01]

M. Ernst, P. Fuhrmann, M. Gasthuber, T. Mkrtchyan, and C. Waldman. dCache, a Distributed Storage Data Caching System. In Proceedings of the International Conference on Computing on High-Energy and Nuclear Physics (CHEP’01), page 244, Beijing, China, September 2001.



218

BIBLIOGRAPHY

[EGE]

EGEE Homepage, accessed Jun 2006 [Online]. Available: http://public.eu-egee.org.

[EMS]

Fermilab MASS STORAGE SYSTEM - Enstore, accessed Apr 2007 [Online]. Available: http://www-ccf.fnal.gov/enstore/.

[ES01]

D. W. Erwin and D. F. Snelling. UNICORE: A Grid computing environment. Lecture Notes in Computer Science, 2150, 2001.

[Eve80]

B. S. Everitt. Cluster Analysis. Halsted, 1980.

[FdW02]

M. Frumkin and R. F. Van der Wijngaart. NAS Grid Benchmarks: A Tool for Grid Space Exploration. Cluster Computing, 5(3):247–255, 2002.

[FI03]

I. Foster and A. Iamnitchi. On Death, Taxes, and the Convergence of Peer-to-Peer and Grid Computing. In M. F. Kaashoek and I. Stoica, editors, Proceedings of the 2nd International Workshop on Peer-to-Peer Systems (IPTPS’03), volume 2735 of Lecture Notes in Computer Science, pages 118–128. Springer, 2003.

[Fis22]

R. A. Fisher. On the mathematical foundations of theorical statistics. Philosophical Transactions of the Royal Society of London, 222A:309 – 368, 1922.

[Fis25]

R. A. Fisher. Statistical methods for research workers. Oliver and Boyd, 1925.

[FK04]

I. Foster and C. Kesselman, editors. The Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers, 2004.

[FKKM97]

I. Foster, D. Kohr, R. Krishnaiyer, and J. Mogill. Remote I/O fast access to distant storage. In Proceedings of the 5th Workshop on I/O in Parallel and Distributed Systems (IOPADS ’97), pages 14–25, 1997.

[FKNT02]

I. Foster, C. Kesselman, J. M. Nick, and S. Tuecke. The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration. Published online at http://www.globus.org/research/papers/ogsa.pdf, January 2002.

[FKTT98]

I. Foster, C. Kesselman, G. Tsudik, and S. Tuecke. A security architecture for computational grids. In ACM Conference on Computer and Communications Security, pages 83–92, 1998.

[FL99]

L. A. Freitag and R. M. Loy. Adaptive, multiresolution visualization of large data sets using a distributed memory octree. In Proceedings of the 1999 ACM/IEEE conference on Supercomputing (Supercomputing ’99), page 60, Portland, OR, USA, November 1999. ACM Press and IEEE Computer Society Press.



BIBLIOGRAPHY

[Fos98]

219

I. Foster. Computational Grids, chapter 2. In Kesselman and Foster [KF98], November 1998.

[Fos01]

I. Foster. The Anatomy of the Grid: Enabling Scalable Virtual Organizations. In Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing (Euro-Par ’01), volume 2150 of Lecture Notes In Computer Science, pages 1–4, London, UK, 2001. Springer-Verlag.

[Fos02]

I. Foster. What is the Grid? A Three Point Checklist. Grid Today, 1(6), Jul 2002.

[Fra50]

S. P. Frankel. Convergence rates of iterative treatments of partial differential equations. Mathematical Tables and Aids to Computation, 4:65–75, 1950.

[FS03]

M. A. Frumkin and L. Shabanov. Arithmetic Data Cube as a Data Intensive Benchmark. Technical Report NAS-03-005, NAS, 2003.

[FSB+ 06]

I. Foster, A. Savva, D. Berry, A. Djaoui, A. Grimshaw, B. Horn, F. Maciel, F. Siebenlist, R. Subramaniam, J. Treadwell, and J. Von Reich. The Open Grid Services Architecture. Technical Report Version 1.5, Global Grid Forum, June 2006.

[G2M]

MPICH-G2, accessed Jan 2007 [Online]. Available: http://www3.niu.edu/mpi.

[Gan]

Ganglia distributed monitoring and execution system, accessed feb 2007 [Online]. Available: http://ganglia.sourceforge.net/.

[GB06]

S. Goel and R. Buyya. Enterprise Service Computing: From Concept to Deployment, chapter Data Replication Strategies in Wide Area Distributed Systems, pages 211–241. Idea Group Inc., 2006.

[Gbu]

The Gridbus Project, accessed Nov 2006 [Online]. Available: http://www.gridbus.org.

[GC03]

A. G. Ganek and T. A. Corbi. The dawning of tyhe autonomic computing era. IBM Systems Journal, 42(1):5–18, 2003.

[GCA]

Grand Challenge Applications, accessed Jun 2006 [Online]. Available: http://wwwfp.mcs.anl.gov/grand-challenges/.

[GCB+ 97]

J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data Cube: A Relational Aggregation Operator Generalizing GroupBy, Cross-Tab, and Sub-Totals. Data Mining and Knowledge Discovery, 1(1):29–53, 1997.

[Gib91]

G. A. Gibson. Redundant Disk Arrays: Reliable, Parallel Secondary Storage. PhD thesis, EECS Department, University of California, Berkeley, 1991.



220

BIBLIOGRAPHY

[Gil62]

A. Gill. Introduction to The Theory of Finite State Machines. McGraw-Hill, 1962.

[Gil03]

S. Gilbert. Introduction to Linear Algebra. Wellesley-Cambridge Press, 3rd edition edition, 2003.

[GJD80]

E. S. Gardner Jr. and D. G. Dannenbring. Forecasting with exponential smoothing: Some guidelines for model selection. Decision Sciences, 11:370 – 383, 1980.

[GKM+ 06]

S. Graham, A. Karmarkar, J. Mischkinsky, I. Robinson, and I. Sedukhin. Web Service Resource Framework. Technical Report Version 1.2, OASIS, 2006.

[GlA]

The Globus Alliance, accessed Feb 2007 [Online]. Available: http://www.globus.org.

[GLS94]

W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT Press, 1994.

[GMA]

R-GMA: Relational Grid Monitoring Architecture, accessed Jan 2007 [Online]. Available: http://www.r-gma.org/.

[GMM06]

E. G´ omez-Mart´ınez and J. Merseguer. Impact of SOAP Implementations in the Performance of a Web Service-Based Application. In Geyong Min, Beniamino Di Martino, Laurence Tianruo Yang, Minyi Guo, and Gudula R¨ unger, editors, Frontiers of High Performance Computing and Networking, ISPA 2006 International Workshops, volume 4331 of Lecture Notes in Computer Science, pages 884–896. Springer, 2006.

[GNA+ 97]

G. A. Gibson, D. Nagle, K. Amiri, F. W. Chang, E. M. Feinberg, H. Gobioff, C. Lee, B. Ozceri, E. Riedel, D. Rochberg, and J. Zelenka. File server scaling with networkattached secure disks. In Proceedings of the 1997 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS ’97), pages 272–284, New York, NY, USA, June 1997. ACM Press.

[GPo]

About the GridPort Toolkit - GridPort Toolkit, accessed Jan 2007 [Online]. Available: http://gridport.net/main/.

[GPW]

Grid Forum Grid Performance Working Group Home Page, accessed jun 2006 [Online]. Available: http://www-didc.lbl.gov/GGF-PERF/GMA-WG.

[GR3]

GRID3, accessed Jan 2007 [Online]. Available: http://www.ivdgl.org/grid2003/.

[GRBK98]

E. Gabriel, M. Resch, T. Beisel, and R. Keller. Distributed computing in a heterogeneous computing environment. In Proceedings of the 5th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 180–187. Springer-Verlag, 1998.



BIBLIOGRAPHY

[GSK03]

221

G. R. Ganger, J. D. Strunk, and A. J. Klosterman. Self-* storage: brickbased storage with automated administration. Technical Report CMU-CS03 -178, Carnegie Mellon University, August 2003.

[GSp]

GridSphere,

accessed

Jan

2007

[Online].

Available:

http://www.gridsphere.org/gridsphere/gridsphere. [GWB+ 04]

M. Gerndt, R. Wism¨ uller, Z. Balaton, G. Gombás, P. Kacsuk, Z. Nemeth, N. Podhorszki, H. Truong, T. Fahringer, M. Bubak, E. Laure, and T. Margalef. Performance tools for the grid: State of the art and future. APART White Paper, 2004.

[GWF+ 94]

A. S. Grimshaw, W. A. Wulf, J. C. French, A. C. Weaver, and P. F. Reynolds Jr. Legion: The next logical step toward a nationwide virtual computer. Technical Report CS-94-21, Department of Computer Science, University of Virginia, August 1994.

[GWM]

GridWay Metascheduler: Metascheduling Technologies for the Grid, accessed Dec 2006 [Online]. Available: http://www.gridway.org/.

[GZK05]

M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy. Mining data streams: a review. SIGMOD Rec., 34(2):18–26, 2005.

[Ham94]

J. D. Hamilton. Time series analysis. Princeton University Press, 1994.

[Har75]

J. Hartigan. Clustering Algorithms. Wiley, 1975.

[Haw]

Hawkeye,

accessed

Feb

2007

[Online].

Available:

http://www.cs.wisc.edu/condor/hawkeye/. [HE80]

Robert M. Haralick and Gordon L. Elliott. Increasing tree search efficiency for constraint satisfaction problems. Artificial Intelligence, 14(3):263–313, 1980.

[HHSK05]

B. K. Hess, M. Haddox-Schatz, and M. A. Kowalski. The Design and Evolution of Jefferson Lab’s Jasmine Mass Storage System. In Proceedings of the 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST’05), pages 94–105. IEEE Computer Society, 2005.

[HJMS+ 00]

W. Hoschek, F. J. Jaen-Martinez, A. Samar, H. Stockinger, and K. Stockinger. Data management in an international data grid project. In Proceedings of the First IEEE/ACM International Workshop on Grid Computing(GRID 2000), December 2000.

[HKY99]

L. J. Heyer, S. Kruglyak, and S Yooseph. Exploring expression data: Identification and analysis of coexpressed genes. Genome Research, 9:1106–1115, 1999.



222

[HL03]

BIBLIOGRAPHY

P. Hunt and D. Larson. Addressing IT Challenges with Self-Healing Technology. Technology@Intel Magazine, pages 1–6, October 2003.

[HO01]

J. H. Hartman and J. K. Ousterhout. The Zebra striped network file system. In H. Jin, T. Cortes, and R. Buyya, editors, High Performance Mass Storage and Parallel I/O: Technologies and Applications, pages 309–329. IEEE Computer Society Press and Wiley, New York, NY, USA, 2001.

[Hol00]

K. Holtman. Object Level Physics Data Replication in the Grid. In Proceedings of the VII International Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT 2000), pages 244–246, October 2000.

[Hoo05]

G. Hoolahan.

Applying Adaptive Enterprise principles to collaborative business

infrastructure-based solution designs. HP White Paper, June 2005. [Hor01]

P. Horn. IBM’s Perspective on the state of information technology, accessed May 2005 [online]. available: http://www.research.ibm.com/autonomic/overview/, 2001.

[HPS]

HPSS - High Performance Storage System, accessed Apr 2007 [Online]. Available: http://www.hpss-collaboration.org/hpss/.

[HS83]

D. C. Hambrick and S. M. Schecter. Turnaround strategies for mature industrialproduct business units. Academy of Management Journal, 26:231–248, 1983.

[Hug02]

G. F. Hugues. Wise drives. IEEE Spectrum, August 2002.

[IBM03]

IBM - International Business Machines Corporation. Delivering the vision: autonomic computing. Technical Report The Mainstream, Issue 3, The IBM eServer zSeries and S/390 software, Aug 2003.

[IBM04]

IBM - International Business Machines Corporation. Autonomic Computing Toolkit. Developer’s Guide, 2 edition, August 2004.

[IEE94]

IEEE Computer Society. Portable Applications Standards Committee. IEEE Standard for Information Technology - Portable Operating System Interface (POSIX): System Application Program Interface (API), Amendment 1: Realtime Extension (C Language), IEEE Std 1003.1b-1993. IEEE Standards Office, New York, NY, USA, 1994.

[Isa02]

N. Isailovic. An Introspective Approach to Speculative Execution. Technical Report UCB/CSD-02-1219, U.C. Berkeley, December 2002.

[ITKT00]

T. Imamura, Y. Tsujita, H. Koide, and H. Takemiya. An Architecture of Stampi: MPI Library on a Cluster of Parallel Computers. In Proceedings of the 7th European



BIBLIOGRAPHY

223

PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 200–207. Springer-Verlag, 2000. [JGN99]

W. E. Johnston, D. Gannon, and B. Nitzberg. Grids as production computing environments: The engineering aspects of NASA’s Information Power Grid. In Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing (HPDC ’99), page 34. IEEE Computer Society, 1999.

[Jor73]

W. Jordan. Handbuch der Vermessungskunde. Gysbers & Van Loon, 1873.

[JX04]

X. Jiang and D. Xu. VIOLIN: Virtual Internetworking on OverLay INfrastructure. Lecture Notes in Computer Science, 3358:937–946, 2004.

[KAA+ 04]

K. Karasavvas, M. Antonioletti, M. Atkinson, N. P. Chue Hong, T. Sugden, A. C. Hume, M. Jackson, A. Krause, and C. Palansuriya. Introduction to OGSA-DAI Services. In P. Herrero, M. S. Pérez, and V. Robles, editors, Proceedings of the First International Workshop on Scientific Applications of Grid Computing (SAG 2004), volume 3458 of Lecture Notes in Computer Science, pages 1–12. Springer, 2004.

[Kah58]

W. Kahan. Gauss-Seidel Methods of Solving Large Systems of Linear Equations. PhD thesis, University of Toronto, Toronto, Canada, 1958.

[KC03]

J. O. Kephart and D. M. Chess. The vision of autonomic computing. Computer, 36(1):41–50, 2003.

[KE02]

G. J. Klir and D. Elias. Architecture of Systems Problem Solving. Da Capo Press Incorporated, 2002.

[Ker]

Kerrighed, accessed Dec 2007 [Online]. Available: http://www.kerrighed.org.

[KF98]

C. Kesselman and I. Foster, editors. The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers, November 1998.

[KGP89]

R. H. Katz, G. A. Gibson, and D. A. Patterson. Disk system architectures for high performance computing. Technical Report UCB/CSD-89-497, EECS Department, University of California, Berkeley, 1989.

[KHB+ 99]

T. Kielmann, R. F. Hofman, H. E. Bal, A. Plaat, and R. A. Bhoedjang. MagPIe: MPI’s Collective Communication Operations for Clustered Wide Area Systems. In Seventh ACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP’99), pages 131–140, Atlanta, GA, May 1999.



224

[KLL+ 97]

BIBLIOGRAPHY

D. Karger, E. Lehman, T. Leighton, R. Panigrahy, M. Levine, and D. Lewin. Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the world wide web. In Proceedings of the twenty-ninth annual ACM symposium on Theory of computing (STOC ’97), pages 654–663, New York, NY, USA, 1997. ACM Press.

[KLSS05]

P. Z. Kunszt, E. Laure, H. Stockinger, and K. Stockinger. File-based replica management. Future Generation Computer Systems, 21(1):115–123, 2005.

[Koh90]

T. Kohonen. The self-organizing map. In Proceedings of the IEEE, volume 78(9), pages 1464–1480, 1990.

[KR93]

R. L. Keeney and H. Raiffa. Decisions with Multiple Objectives: Preferences and Value Tradeoffs. Cambridge University Press, 1993.

[KS96]

D. J. Ketchen and C. L. Shook. The application of cluster analysis in strategic management research: An analysis and critique. Strategic Management Journal, 17:441–458, 1996.

[LAM]

LAM / MPI parallel computing,

accessed May 2007 [Online]. Available:

http://www.lam-mpi.org. [Law81]

F. D. Lawlor. Efficient mass storage parity recovery mechanism. IBM Technical Disclosure Bulletin, 24(2):986–987, July 1981.

[LCG]

LCG

-

LHC

Computing

Grid,

accessed

Jan

2007

[Online].

Available:

http://lcg.web.cern.ch/LCG. [LEG+ 97]

P. M. Lyster, K. Ekers, J. Guo, M. Harber, D. Lamich, J. W. Larson, R. Lucchesi, R. Rood, S. Schubert, W. Sawyer, M. Sienkiewicz, A. da Silva, J. Stobie, L. L. Takacs, R. Todling, and J. Zero. Parallel computing at the NASA data assimilation office (DAO). In Proceedings of the 1997 ACM/IEEE conference on Supercomputing (Supercomputing ’97), pages 1–18, San Jose, CA, USA, November 1997. IEEE Computer Society Press.

[Lin87]

D. V. Lindley. Regression and correlation analysis. New Palgrave: A Dictionary of Economics, 4:120 – 123, 1987.

[LS00]

H. Liefke and D. Suciu. XMILL: An Efficient Compressor for XML Data. In W. Chen, J. F. Naughton, and P. A. Bernstein, editors, SIGMOD Conference, pages 153–164. ACM, 2000.



BIBLIOGRAPHY

[LSS]

225

LinuxSSI - XtreemOS : A Linux-based Operating System to support Virtual Organizations for next generation Grids,

accessed Dec 2007 [Online].

Available: http://www.xtreemos.eu/science-and-research/plonearticlemultipage.200705-03.8942730332/linuxssi. [LSS02]

H. Lamehamedi, B. Szymanski, and Z. Shentu. Data replication strategies in grid environments. In Proceedings of the Fifth International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP ’02), pages 378–383, Washington, DC, USA, 2002. IEEE Computer Society.

[Mac67]

J. B. MacQueen. Some methods of classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathemtical Statistics and Probability, pages 281–297, 1967.

[Mag03]

J. Magowan. A view on relational data on the grid. In Proceedings of the 17th International Symposium on Parallel and Distributed Processing (IPDPS ’03), page 90, Washington, DC, USA, 2003. IEEE Computer Society.

[Mar06]

A. A. Markov. Rasprostranenie zakona bol’shih chisel na velichiny, zavisyaschie drug ot druga. Izvestiya Fiziko-matematicheskogo obschestva pri Kazanskom universitete. 2-ya seriya, 15:135–156, 1906.

[Mar71]

A. A. Markov. Extension of the limit theorems of probability theory to a sum of variables connected in a chain. reprinted in Appendix B of: R. Howard. Dynamic Probabilistic Systems. John Wiley and Sons, volume 1: Markov Chains:552–577, 1971.

[Mar02]

J. Marco. Grids and e-Science. RedIris Bulletin, 61, September 2002.

[MB06]

D. A. Menasce and M. N. Bennani. Autonomic virtualized environments. In Proceedings of the IEEE International Conference on Autonomic and Autonomous Systems (ICAS ’06), page 28, Silicon Valley, USA, July 2006. IEEE Computer Society.

[MC85]

G. W. Milligan and M. C. Cooper. An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50:159–179, 1985.

[McC03]

J. A. McCann. Adaptivity for Improving Web Streaming Application Performance, chapter 8, pages 172–191. IGI Publishing, Hershey, PA, USA, 2003.

[MDS]

Globus: Monitoring and Discovery System, accessed Feb 2007 [Online]. Available: http://www.globus.org/mds/.

[MEM03]

S. Malaika, A. Eisenberg, and J. Melton. Standards for databases on the grid. SIGMOD Record, 32(3):92–100, 2003.



226

[MHG03]

BIBLIOGRAPHY

M. S. M¨ uller, M. Hess, and E. Gabriel. Grid enabled MPI solutions for Clusters. In Proceedings of the 3rd IEEE International Symposium on Cluster Computing and the Grid (CCGRID ’03), pages 18–25. IEEE Computer Society, 2003.

[MK04]

J. Martin and H. Karlapudi. Web application performance prediction. In Proceedings of the IASTED International Conference on Communication and Computer Networks, pages 281–286, Boston, MA, USA, November 2004.

[MKL+ 02]

S. Milojicic, V. Kalogeraki, R. Lukose, K. Nagaraja, J. Pruyne, B. Richard, S. Rollins, and Z. Xu. Peer-to-peer computing. Technical Report HPL-2002-57, HP Lab, 2002.

[MLRC04]

N. Miller, R. Latham, R. Ross, and P. Carns. High performance I/O: PVFS2 for clusters. ClusterWorld Magazine, April 2004.

[MMG]

MonALISA - Monitoring the Grid since 2001, accessed Feb 2007 [Online]. Available: http://monalisa.cacr.caltech.edu/.

[Moo65]

G. E. Moore. Cramming more components onto integrated circuits. Electronics, 38(8):114–117, April 1965.

[Mor07]

Christine Morin. XtreemOS: A Grid Operating System Making your Computer Ready for Participating in Virtual Organizations. In Proceedings of the 10th IEEE International Symposium on Object and Component-Oriented Real-Time Distributed Computing (ISORC’07), pages 393–402, Los Alamitos, CA, USA, 2007. IEEE Computer Society.

[MOS]

MOSIX, accessed Dec 2006 [Online]. Available: http://www.mosix.org.

[MPIa]

Message Passing Interface (MPI) Forum, accessed Jul 2007 [Online]. Available: http://www.mpi-forum.org/.

[MPIb]

MPI - the Message Passing Interface standard, accessed Jan 2007 [Online]. Available: http://www-unix.mcs.anl.gov/mpi/.

[MPI97]

Message Passing Interface Forum. MPI-2: Extensions to the Message-Passing Interface, July 1997.

[MRLB06]

P. McBrien, N. Rizopoulos, C. Lazanitis, and Z. Bellahsène. iXPeer: Implementing layers of abstraction in P2P Schema Mapping using AutoMed. In 2nd Workshop on Innovations in Web Infrastructure, Edinburgh, UK, May 2006.

[MyG]

myGrid, accessed Jun 2006 [Online]. Available: http://www.mygrid.org.uk.



BIBLIOGRAPHY

[NLB01]

227

H. B. Newman, I. C. Legrand, and J. J. Bunn. A Distributed Agent-based Architecture for Dynamic Services. In Proceedings of the International Conference on Computing on High-Energy and Nuclear Physics (CHEP’01), Beijing, China, September 2001.

[NLG+ 03]

H. B. Newman, I. C. Legrand, P. Galvez, R. Voicu, and C. Cirstoiu. MonALISA : A Distributed Monitoring Service Architecture. In Proceedings of the 2003 Conference on Computing on High-Energy and Nuclear Physics (CHEP03), La Jolla, California, USA, March 2003.

[NP28]

J. Neyman and E. S. Pearson. On the use and interpretation of certain test criteria for purposes of statical inference. Biometrika, 20(A):175 – 240 and 263 – 294, 1928.

[NP33]

J. Neyman and E. S. Pearson. On the problem of the most efficient test of statistical hypotheses. Philosophical Transactions of the Royal Society, Series A:231, 298 – 337, 1933.

[NWS]

Network Weather Service:

Introduction, accessed Dec 2006 [Online]. Available:

http://nws.cs.ucsb.edu/ewiki/. [OGF]

Open Grid Forum, accessed Feb 2007 [Online]. Available: http://www.ogf.org/.

[OGS]

Globus: The Open Grid Service Architecture, accessed Feb 2007 [Online]. Available: http://www.globus.org/ogsa/.

[OMo]

openMosix, an Open Source Linux Cluster Project, accessed Apr 2007 [Online]. Available: http://openmosix.sourceforge.net.

[Pat94]

Y. N. Patt. The I/O subsystem: A candidate for improvement. IEEE Computer, 27(3):15–16, March 1994.

[PB86]

A. Park and K. Balasubramanian. Providing fault tolerance in parallel secondary storage systems. Technical Report CS-TR-057-86, Department of Computer Science, Princeton University, November 1986.

[PBE+ 99]

J. S. Plank, M. Beck, W. Elwasif, T. Moore, M. Swany, and R. Wolski. The Internet Backplane Protocol: Storage in the network. In Proceedings of the Network Storage Symposium (NetStore ’99), October 1999.

[PBS]

OpenPBS, accessed Jan 2007 [Online]. Available: www.openpbs.org.

[PCFGR06]

M. S. Pérez, J. Carretero, J. M. Pe˜ na F. Garc´ıa, and V. Robles. MAPFS: A flexible multiagent parallel file system for clusters. Future Generation Computer Systems, 22(5):620–632, 2006.



228

[PCG+ 03]

BIBLIOGRAPHY

M. S. Pérez, J. Carretero, F. Garc´ıa, J. M. Pe˜ na, and V. Robles. MAPFS-Grid: A flexible architecture for data-intensive grid applications. In F. Fernández Rivera, M. Bubak, A. G´ omez Tato, and R. Doallo, editors, European Across Grids Conference, volume 2970 of Lecture Notes in Computer Science, pages 111–118. Springer, 2003.

[PCGK89]

D. A. Patterson, P. M. Chen, G. A. Gibson, and R. H. Katz. Introduction to Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of Thirty-Fourth IEEE Computer Society International Conference: Intellectual Leverage, Digest of Papers (COMPCON ’89), pages 112–117. IEEE Computer Society, 1989.

[Pet62]

C. A. Petri. Fundamentals of a theory of asynchronous information flow. In Proceedings of IFIP Congress 62, pages 386–390, 1962.

[PGK88]

D. A. Patterson, G. Gibson, and R. H. Katz. A case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of the 1988 ACM SIGMOD international conference on Management of Data, pages 109–116, Chicago, Illinois, USA, June 1988. ACM Press.

[PlL]

PlanetLab, An open platform for developing, deploying, and accesing planetary-scale services, accessed Jun 2007 [Online]. Available: https://www.planet-lab.org.

[PPC+ 07]

T. Perelmutov, D. Petravick, E. Corso, L. Magnoni, O. Barring, J. Baud, F. Donno, M. Litmaath, S. De Witt, J. Jensen, M. Haddox-Schatz, B. Hess, A. Kowalski, and C. Watson. The storage resource manager interface specification, version 2.2. Lawrence Berkeley National Laboratory, January 2007.

[PSH+ 05]

M. S. Pérez, A. S´ anchez, P. Herrero, V. Robles, and J. M. Pe˜ na. Adapting the Weka Data Mining Toolkit to a Grid Based Environment. In P. S. Szczepaniak, J. Kacprzyk, and A. Niewiadomski, editors, Proceedings of the 3rd International. Atlantic Web Intelligence Conference (AWIC 2005), volume 3528 of Lecture Notes in Computer Science, pages 492–497. Springer, 2005.

[PSHR06]

M. S. Pérez, A. S´ anchez, P. Herrero, and V. Robles. A New Approach for overcoming the I/O crisis in grid environments, chapter 19, pages 311–321. In Di Martino et al. [DDH+ 06], January 2006.

[PSPR05]

M. S. Pérez, A. S´ anchez, J. M. Pe˜ na, and V. Robles. A new formalism for dynamic reconfiguration of data servers in a cluster. Journal of Parallel and Distributed Computing, 65(10):1134–1145, 2005.



BIBLIOGRAPHY

[PSR+ 07]

229

M. S. Pérez, A. S´ anchez, V. Robles, P. Herrero, and J. M. Pe˜ na. Design and implementation of a data mining grid-aware architecture. Future Generation Computer Systems, 23(1):42–47, 2007.

[PVM]

PVM:

Parallel

Virtual

Machine,

accessed

Jan

2007

[Online].

Available:

http://www.csm.ornl.gov/pvm/pvm home.html. [PW03]

P. Padala and J. N. Wilson. GridOS: Operating System Services for Grid Architectures. In T. M. Pinkston and V. K. Prasanna, editors, Proceedings of the 10th International Conference on High Performance Computing (HiPC 2003), volume 2913 of Lecture Notes in Computer Science, pages 353–362. Springer, 2003.

[PWF+ 02]

L. Pearlman, V. Welch, I. Foster, C. Kesselman, and S. Tuecke. A Community Authorization Service for Group Collaboration. In Proceedings of the 3rd International Workshop on Policies for Distributed Systems and Networks (POLICY’02), pages 50– 59, Washington, DC, USA, 2002. IEEE Computer Society.

[Pér03]

M. S. Pérez. Arquitectura Multiagente para E/S de Alto Rendimiento en Clusters. PhD thesis, Universidad Politécnica de Madrid, 2003.

[RAC]

IBM Research Autonomic Computing, accessed Feb 2007 [online]. available: http://www.research.ibm.com/autonomic/.

[RAI]

RAID

—

Wikipedia,

accessed

Mar

2007

[Online].

available:

http://es.wikipedia.org/wiki/RAID. [Rec06]

W3C Recommendation. Web Services Addressing 1.0 - Core, accessed Aug 2007 [Online]. Available: http://www.w3.org/TR/ws-addr-core/, May 2006.

[RF01]

K. Ranganathan and I. Foster. Identifying dynamic replication strategies for a highperformance data grid. In Proceedings of the Second International Workshop on Grid Computing (GRID ’01), pages 75–86, London, UK, 2001. Springer-Verlag.

[Rio67]

S. Rios. Métodos estad´ısticos. Ediciones del Castillo, 1967.

[RLS98]

R. Raman, M. Livny, and M. Solomon. Matchmaking: Distributed resource management for high throughput computing. In Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing (HPDC ’98), page 140, Washington, DC, USA, 1998. IEEE Computer Society.

[RMX05]

P. Ruth, P. McGachey, and D. Xu. Viocluster: Virtualization for dynamic computational domains. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER ’05), Boston, September 2005.



230

[RS61]

BIBLIOGRAPHY

H. Raiffa and R. Schlaifer. Applied Statistical Decision Theory. Harvard University Press, 1961.

[RSC]

IBM RedBooks — Reliable Scalable Cluster Technology (RSCT), accessed Mar 2007 [Online]. available: http://publib-b.boulder.ibm.com/abstracts/tips0090.html.

[RTC05]

R. Ross, R. Thakur, and A. Choudhary. Achievements and challenges for I/O in computational science. Journal of Physics: Conference Series, 16:501–509, 2005.

[Sar83]

W. S. Sarle. Cubic Clustering Criterion. Technical Report A-108, SAS Institute Inc., 1983.

[SB05]

M. A. Salido and F. Barber. A Non-Binary Constraint Ordering Approach to Scheduling Problems. In Proceedings of the 24th SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence (AI-2004), 2005.

[Sch05]

J. M. Schopf. Distributed monitoring and information services for the grid. Bath, UK, September 2005.

[SDM+ 05]

J. M. Schopf, M. D’Arcy, N. Miller, L. Pearlman, I. Foster, and C. Kesselman. Monitoring and Discovery in a Web Services Framework: Functionality and Performance of the Globus Toolkit’s MDS4. Technical Report ANL/MCS-P1248-0405, Argonne National Laboratory, April 2005.

[SFT00]

W. Smith, I. Foster, and V. Taylor. Scheduling with advanced reservations. In Proceedings of the 14th International Symposium on Parallel and Distributed Processing (IPDPS ’00), pages 127–132, Washington, DC, USA, 2000. IEEE Computer Society.

[SGE]

gridengine:Home,

accessed

Nov

2006

[Online].

Available:

http://gridengine.sunsource.net. [SH02]

F. Schmuck and R. Haskin. GPFS: A Shared-Disk File System for Large Computing Clusters. In Proceedings of the 1st USENIX Conference on File and Storage Technologies (FAST ’02), pages 231–244, Berkeley, CA, USA, 2002. USENIX Association.

[SKMC03]

F. D. Sacerdoti, M. J. Katz, M. L. Massie, and D. E. Culler. Wide Area Cluster Monitoring with Ganglia. In Proceedings of the 2003 IEEE International Conference on Cluster Computing (CLUSTER ’03), pages 289–298, 2003.

[SM02]

N. Sundaresan and R. Moussa. Algorithms and programming models for efficient representation of XML for Internet applications. Computer Networks, 39(5):681–697, 2002.



BIBLIOGRAPHY

[SOA07]

231

SOAP Version 1.2: Messaging Framework, accessed Jun 2007 [Online]. Available: http://www.w3.org/TR/soap12-part1/, April 2007.

[SPG+ 06]

A. S´ anchez., M. S. Pérez, P. Gueant, J. Montes, and P. Herrero. A Parallel Data Storage Interface to GridFTP. Lecture Notes in Computer Science, 4276:1203–1212, 2006.

[SPK+ 06]

A. S´ anchez, M. S. Pérez, K Karasavvas, P. Herrero, and A. Pérez. MAPFS-DAI, an extension of OGSA-DAI based on a parallel file system. Future Generation Computer Systems, 23:138–145, 2006.

[SPM+ 08]

A. S´ anchez, M. S. Pérez, J. Montes, P. G´ ueant, and T. Cortes. A Prediction-based Autonomic Storage Architecture for Grids. Special Issue of Journal of Autonomic and Trusted Computing, (accepted, to appear), 2008.

[SPP+ 04]

A. S´ anchez, J. M. Pe˜ na, M. S. Pérez, V. Robles, and P. Herrero. Improving distributed data mining techniques by means of a grid infrastructure. In R. Meersman, Z. Tari, and A. Corsaro, editors, Proceedings of the On the Move to Meaningful Internet Systems 2004 Workshops (OTM 2004), volume 3292 of Lecture Notes in Computer Science, pages 111–122. Springer, 2004.

[SR94]

R. R. Sokal and J. F. Rohlf. Biometry. W. H. Freeman, September 1994.

[SSMD02]

H. Stockinger, A. Samar, S. Muzaffar, and F. Donno. Grid Data Mirroring Package (GDMP). Scientific Programming, 10(2):121–133, 2002.

[Sto98]

H. Stockinger. Dictionary on parallel Input/Output. Master’s Thesis, February 1998.

[SYAD05]

K. Seymour, A. YarKhan, S. Agrawal, and J. Dongarra. Netsolve: Grid enabling scientific computing environments. In Lucio Grandinetti, editor, Grid Computing: The New Frontier of High Performance Computing, volume 14 of Advances in Parallel Computing. Elsevier, 2005.

[TBSL01]

D. Thain, J. Basney, S. Son, and M. Livny. The Kangaroo Approach to Data Movement on the Grid. In Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing (HPDC ’01), pages 325–333, Los Alamitos, CA, USA, August 2001. IEEE Computer Society.

[TC02]

D. Turner and X. Chen. Protocol-dependent message-passing performance on linux clusters. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER ’02), page 187, Washington, DC, USA, 2002. IEEE Computer Society.



232

[TD03]

BIBLIOGRAPHY

G. Tsouloupas and M. Dikaiakos. GridBench: A Tool for Benchmarking Grids. In Proceedings of the 4th International Workshop on Grid Computing (GRID 2003), page 60, Phoenix, Arizona, USA, November 2003.

[TD05]

G. Tsouloupas and M. Dikaiakos. Design and Implementation of GridBench. In Advances in Grid Computing - EGC 2005: European Grid Conference, volume 3470 of Lecture Notes in Computer Science, pages 211–225. Springer, 2005.

[TeG]

TeraGrid

[About],

accessed

Jan

2007

[Online].

Available:

http://www.teragrid.org/about. [TMM+ 02]

O. Tatebe, Y. Morita, S. Matsuoka, N. Soda, and S. Sekiguchi. Grid datafarm architecture for petascale data intensive computing. In Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID ’02), page 102, Washington, DC, USA, 2002. IEEE Computer Society.

[Tor04]

J. Tordsson. Resource brokering for grid environments. Master’s thesis, Umea University, June 2004.

[Try39]

R. C. Tryon. Cluster analysis. Edwards Brothers, Inc., 1939.

[TSL01]

A. Tucker, S. Swift, and X. Liu. Variable grouping in multivariate time series via correlation. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 31(2):235– 245, 2001.

[TSS]

TOP500

Supercomputing

Sites,

accessed

Jan

2008

[Online].

Available:

http://www.top500.org/. [Val02]

J. J. Valdés. Time Series Models Discovery with Similarity-Based Neuro-Fuzzy Networks and Evolutionary Algorithms. In Proceedings of the 2002 IEEE World Congress on Computational Intelligence (WCCI’02), pages 2345–2350, 2002.

[VBC06]

J. J. Valdés and G. Bonham-Carter. Time dependent neural network models for detecting changes of state in complex processes: applications in earth sciences and astronomy. Neural Network, 19(2):196–207, 2006.

[VBR06]

S. Venugopal, R. Buyya, and K. Ramamohanarao. A taxonomy of Data Grids for distributed data sharing, management, and processing. ACM Computing Surveys, 38(1):3, 2006.

[Vil01]

C. Vilett. Moore’s law vs. storage improvements vs. optical improvements. Scientific American, January 2001.



BIBLIOGRAPHY

[VTF01]

233

S. Vazhkudai, S. Tuecke, and I. Foster. Replica Selection in the Globus Data Grid. In Proceedings of the 1st IEEE International Symposium on Cluster Computing and the Grid (CCGRID ’01), pages 106–113, Washington, DC, USA, May 15-18 2001. IEEE Computer Society.

[W3C05]

W3C. SOAP Message Transmission Optimization Mechanism, accessed Jun 2007 [Online]. Available: http://www.w3.org/TR/soap12-mtom, January 2005.

[War63]

J. Ward. Hierarchical grouping to optimize an objective function. Journal American Statistical Association, 58:236–244, 1963.

[Was76]

A. I. Wasserman. A top-down view of software engineering. ACM SIGSOFT Software Engineering Notes, 1(1):8–14, 1976.

[WC95]

R. W. Watson and R. A. Cyne. The Parallel I/O Architecture of the High-Performance Storage System (HPSS). In Proceedings of the IEEE Symposium on Mass Storage Systems, pages 27–44, 1995.

[WF02]

I. H. Witten and E. Frank. Data mining: Practical machine learning tools and techniques with Java implementations. SIGMOD Record, 31(1):76–77, 2002.

[WGSS96]

J. Wilkes, R. Golding, C. Staelin, and T. Sullivan. The HP AutoRAID hierarchical storage system. ACM Transactions on Computer Systems (TOCS), 14(1):108–136, 1996.

[WHJ+ 93]

B. W. Wah, T. S. Huang, A. K. Joshi, D. Moldovan, J. Aloimonos, R. K. Bajcsy, D. Ballard, D. DeGroot, K. DeJong, C. R. Dyer, S. E. Fahlman, R. Grishman, L. Hirschman, R. E. Korf, S. E. Levinson, D. P. Miranker, N. H. Morgan, S. Nirenburg, T. Poggio, E. M. Riseman, C. Stanfill, S. J. Stolfo, S. L. Tanimoto, and C. Weems. Report on workshop on high performance computing and communications for grand challenge applications: Computer vision, speech and natural language processing, and artificial intelligence. IEEE Transactions on Knowledge and Data Engineering, 5(1):138–154, 1993.

[Yel06]

R. Yellin. The Data Storage Evolution. Has disk capacity outgrown its usefulness? Teradata Magazine. NCR Corporation, 2006.

[YFH+ 96]

S. J. Young, G. Y. Fan, D. Hessler, S. Lamont, T. T. Elvins, M. Hadida, G. Hanyzewski, J. W. Durkin, P. Hubbard, G. Kindlmann, E. Wong, D. Greenberg, S. Karin, and M. H. Ellisman. Implementing a collaboratory for microscopic digital anatomy.

Supercomputer Applications and High Performance Computing,

10(2/3):170–181, 1996. Alberto S´ anchez Campos


234

[YGM03]

APPENDIX A. BIBLIOGRAPHY

B. Yang and H. Garcia-Molina. Designing a super-peer network. In Nineteenth International Conference on Data Engineering (ICDE’03), pages 49–62, 2003.

[You50]

D. Young. Iterative methods for solving partial difference equations of elliptic type. PhD thesis, Harvard University, Cambridge, 1950.

[You71]

D. Young. Iterative Solutions of Large Linear Systems. Academic Press, 1971.

[Zad97]

L. A. Zadeh. The roles of fuzzy logic and soft computing in the conception, design and deployment of intelligent systems. In H. S. Nwana and N. Azarmi, editors, Software Agents and Soft Computing, volume 1198 of Lecture Notes in Computer Science, pages 183–190. Springer, 1997.

[ZFS03]

X. Zhang, J. L. Freschl, and J. M. Schopf. A performance study of monitoring and information services for distributed systems. In Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing (HPDC ’03), pages 270–282, Washington, DC, USA, 2003. IEEE Computer Society.

[ZG97]

R. Zimmermann and S. Ghandeharizadeh. Continuous display using heterogeneous disk subsystems. In Proceedings of the Fifth ACM Multimedia Conference, pages 227– 236, Seattle, Washington, November 1997.

[ZG00]

R. Zimmermann and S. Ghandeharizadeh. HERA: Heterogeneous Extension of RAID. In Hamid R. Arabnia, editor, Proceedings of the 2000 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’2000). CSREA Press, 2000.