Visualizing Static and Dynamic Relations in Information Hierarchies

Visualizing Static and Dynamic Relations in Information Hierarchies

Dissertation zur Erlangung des akademischen Grades Doktor der Naturwissenschaften (Dr. rer. nat.) am Fachbereich IV der Universität Trier

vorgelegt von Dipl.-Inform. Michael Burch

im November 2009

Berichterstatter:

Prof. Dr. Stephan Diehl, Universität Trier Prof. Dr. Alexandru C. Telea, University of Groningen

2

Eidesstattliche Erkl¨ arung Hiermit erkläre ich an Eides Statt, dass die von mir vorgelegte Dissertation bisher nicht im In- oder Ausland in dieser oder ähnlicher Form in einem anderen Promotionsverfahren vorgelegt wurde.

Trier, den

Name, Unterschrift

3

4

Zusammenfassung

Im Forschungsgebiet der Informationsvisualisierung beschäftigt man sich unter anderem mit der effizienten und u ¨bersichtlichen Darstellung von Relationen zwischen Objekten. Zahlreiche Visualisierungswerkzeuge mit verschiedenen visuellen Metaphern haben ihre Daseinsberechtigung nicht zuletzt wegen einer Vielzahl von Beispielszenarien aus der realen Welt: In der Bioinformatik untersucht man Zusammenhänge bei der Interaktion von Proteinen. Im Internet sind Webseiten u ¨ber sogenannte Hyperlinks miteinander verbunden. Funktionsaufrufe in Softwaresystemen dr¨ ucken aus, welche Funktionen voneinander abhängen. Co-Autornetzwerke zeigen auf, welche Forscher häufig miteinander publizieren. Diese vier Beispiele stehen nur stellvertretend f¨ ur eine Vielzahl anderer. Die wohl am häufigsten verwendete visuelle Metapher, um diese Art von Daten graphisch darzustellen, ist ohne Zweifel der Ansatz basierend auf Knoten und Kanten. Typischerweise leidet diese Darstellungsform unter einem Phänomen, das in der Informationsvisualisierung mit Visual Clutter - einem visuellen Wirrwarr - bezeichnet wird. Der Grund hierf¨ ur sind die zahlreich auftretenden Kantenkreuzungen. Um das Problem so gut wie möglich zu entschärfen, wurden in der Vergangenheit bereits einige ausgekl¨ ugelte Algorithmen entwickelt, die auch eine ganze Reihe weiterer a¨sthetischer Kriterien in Betracht ziehen, um das Layout des sogenannten Graphen visuell ansprechend und lesbar zu gestalten. Relationen zwischen Objekten verändern sich in vielen Fällen im Laufe der Zeit. Die Visualisierung solcher dynamischer Zusammenhänge stellt eine zusätzliche Herausforderung f¨ ur Forscher aus dem Bereich der Graphvisualisierung dar. Ein naiver Ansatz könnte die gleichen Layout-Algorithmen wie f¨ ur statische Graphen auf jeden einzelnen Zwischengraph der Graphsequenz anwenden und diese Sequenz dann in einer Animation dem Betrachter zeigen. Dies w¨ urde jeden einzelnen Graphen mit Sicherheit a¨sthetisch im Sinne der Graphvisualisierung erscheinen lassen, die Darstellung der Graphsequenz als Ganzes w¨ urde allerdings unter dieser Visual-

i

isierungsstrategie leiden. Das Hauptproblem dieser naiven Vorgehensweise ist der hohe kognitive Aufwand, den ein Betrachter der animierten Graphsequenz f¨ ur die ¨ Verfolgung der Anderungen auf sich nehmen muss. Ein Betrachter ist in der Lage, eine sogenannte Mental Map aufzubauen. Er prägt sich, meist unbewusst, die Position von Knoten und Kanten in k¨ urzester Zeit ein ¨ und kann minimale Anderungen an dieser Mental Map ohne große Anstrengungen in Kauf nehmen. Deshalb sollten ausgereifte Layout-Algorithmen f¨ ur animierte Darstellungen von Graphsequenzen die Positionen der Knoten und Kanten so berechnen, dass sich die visuelle Darstellung aufeinanderfolgender Graphen stets nur minimal voneinander unterscheidet. Der Hauptbeitrag dieser Arbeit besteht nicht darin, verbesserte Layout-Verfahren f¨ ur Graphanimation zu entwickeln, sondern sowohl statische als auch dynamische Graphen als statische Visualisierung, abgesehen von interaktiven Funktionen, darzustellen. Genauer gesagt werden in dieser Dissertation Graphvisualisierungen vorgestellt, die gerichtete und gewichtete Multi-Compound-Graphen - Graphen mit Multikanten, deren Knoten hierarchisch organisiert sind - in einem statischen Bild repräsentieren. Hauptziel ist es hierbei, Visual Clutter zu reduzieren und dem Betrachter der Visualisierung die Möglichkeit zu geben, mit möglichst geringer Anstrengung seine Mental Map aufrechtzuerhalten. Um dieses Ziel zu erreichen, benutzen wir unter anderem kartesische und radiale, platz-f¨ ullende Darstellungen. Als Seiteneffekt erhalten wir a¨sthetisch ansprechende Visualisierungen. Zuerst beschreiben wir in dieser Arbeit jedoch statische Graphvisualisierungen f¨ ur Regelmengen, die wir aus Softwarearchiven extrahiert haben. In einer anderen Arbeit verwenden wir auch animierte Knoten-Kanten-Diagramme, um sich verändernde Beziehungen zwischen Quellcode und Entwickler im Laufe des SoftwareEntwicklungsprozesses aufzuzeigen. Erst später gehen wir auf den oben beschriebenen neuartigen Ansatz ein. Ein oftmals unterschätztes Visualisierungsparadigma zur Darstellung von Daten ist die Visualisierung dieser Daten in einer radialen, kreisförmigen Art. Obwohl diese Art der Darstellung eine jahrhundertalte Geschichte hat, wurden die Vorteile davon bisher nur in geringem Maße evaluiert. In einer Studie haben wir versucht, die Vor- und Nachteile von kartesischer und radialer Darstellung unseres Visualisierungsansatzes herauszufinden. Sowohl eine Eyetracking- als auch eine OnlineStudie sollen dar¨ uber Aufschluß geben. Einige interessante Phänomene konnten herausgefunden werden, neben der Tatsache, dass auch Menschen ohne Kenntnisse im Bereich der Graphtheorie und Graphvisualisierung den neuartigen Ansatz innerhalb k¨ urzester Zeit verstehen und auf komplexe Datensätze anwenden können. Die Arbeit wird abgerundet durch eine Diskussion u ¨ber ästhetische Kriterien, nach denen sich Graphvisualisierungen f¨ ur sowohl statische als auch dynamische Graphen richten sollen, wenn sie f¨ ur einen Betrachter brauchbare Einsichten liefern sollen.

ii

Abstract

The visualization of relational data is at the heart of information visualization. The prevalence of visual representations for this kind of data is based on many real world examples spread over many application domains: protein-protein interaction networks in the field of bioinformatics, hyperlinked documents in the World Wide Web, call graphs in software systems, or co-author networks are just four instances of a rich source of relational datasets. The most common visual metaphor for this kind of data is definitely the node-link approach, which typically suffers from visual clutter caused by many edge crossings. Many sophisticated algorithms have been developed to layout a graph efficiently and with respect to a list of aesthetic graph drawing criteria. Relations between objects normally change over time. Visualizing the dynamics means an additional challenge for graph visualization researchers. Applying the same layout algorithms for static graphs to intermediate states of dynamic graphs may also be a strategy to compute layouts for an animated graph sequence that shows the dynamics. The major drawback of this approach is the high cognitive effort for a viewer of the animation to preserve his mental map. To tackle this problem, a sophisticated layout algorithm has to inspect the whole graph sequence and compute a layout with as little changes as possible between subsequent graphs. The main contribution and ultimate goal of this thesis is the visualization of dynamic compound weighted multi directed graphs as a static image that targets at visual clutter reduction and at mental map preservation. To achieve this goal, we use a radial space-filling visual metaphor to represent the dynamics in relational data. As a side effect the obtained pictures are very aesthetically appealing. In this thesis we firstly describe static graph visualizations for rule sets obtained by extracting knowledge from software archives under version control. In a different

iii

work we apply animated node-link diagrams to code-developer relationships to show the dynamics in software systems. An underestimated visualization paradigm is the radial representation of data. Though this kind of data has a long history back to centuries-old statistical graphics, only little efforts have been done to fully explore the benefits of this paradigm. We evaluated a Cartesian and a radial counterpart of a visualization technique for visually encoding transaction sequences and dynamic compound digraphs with both an eyetracking and an online study. We found some interesting phenomena apart from the fact that also laymen in graph theory can understand the novel approach in a short time and apply it to datasets. The thesis is concluded by an aesthetic dimensions framework for dynamic graph drawing, future work, and currently open issues.

iv

Acknowledgments

“Vergiss diejenigen nicht, die deine Lebensleiter festgehalten haben, während du von einer Stufe zur anderen hochgestiegen bist.” — Gerlinde Nyncke

The pronoun “we” is used throughout the thesis even if sometimes the pronoun “I” would be more appropriate. The aim is to emphasize that the presented work is based on the thoughts and efforts of several people apart from myself. I could not have finished this thesis without their helpful support, and hence I am very thankful to get to know those people. First of all, I would like to thank Professor Dr. Stephan Diehl, my supervisor, who supported me during the last five years. He had always an open ear for me and motivated me to keep at doing research in the very large field of information visualization. His constructive critics and his helpful ideas gave me the opportunity to be productive in this research area and to finish this thesis. Special thanks go to Professor Dr. Alexandru C. Telea, my second reviewer, for having the time to read this thesis. The main reason to continue research in the field of information visualization was definitely the final year of my studies in Saarbr¨ ucken. I am very grateful to Andreas Zeller, Stephan Diehl, Tom Zimmermann, Carsten Görg, and my roommate Peter Weißgerber. These people made an important contribution to my work about software visualization. I will never forget the following two years in Eichstätt, Bavaria, where I had a lot of time to work out, implement, and write down my novel ideas in the field of information visualization. There I got to know many new colleagues, among them

v

Robin Bergenthum, Sebastian Mauser, Gabor Juhasz, Robert Lorenz, Christian Neumair, Jörg Desel, Dorothea Iglezakis, and Vesna Milijic. Among them was also Leo von Klenze who implemented an Eclipse PlugIn for the pixelmap technique. I also gratefully thank my new colleagues in Trier whom I have been working with in the past three years, among them Alexander Weber, Patrick Reuther, Gennadi Umanski, Oliver Zlotowski, Martin Taphorn, Daniel Schmitt, Daniel Raible, Stefan Gulan, Florian Reitz, Guido Schmitz, Peter Birke, and Jennifer Driesch. They helped me with revising my documents and supported me with helpful comments. The collaboration with Mathias Pohl and Peter Weißgerber, who implemented animated node-link diagrams and some matrix-based visualizations respectively, gave me the opportunity to compare different visualization metaphors and also provided new insights. I would also like to thank Fabian Beck who implemented a visualization tool in an excellent work based on my ideas, that we published at the conference on advanced visual interfaces (AVI 2008) in Naples, Italy. This work was some kind of starting point for further collaborated work and further publications at renowned conferences. Among those was the participation at the EuroVis conference 2009 in Berlin, which was only possible with the implementation work of Martin Greilich. I am also very grateful to Fabian for proofreading this thesis. The evaluation of our Cartesian and radial techniques was conducted in cooperation with Felix Bott. He managed an eyetracking study and had to solve lots of problems such as inviting participants, writing documents, organizing the eyetracking device, understanding the functionality of the device, and presenting some of the results. Rainer Lutz helped him with the data acquisition. This work led to another publication and gave us the insight that even laymen can understand the novel visualization technique. Again, I am very thankful to Stephan Diehl to give me the opportunity to take part at many conferences throughout the world. I am deeply grateful to the many employees of the university departments who did their administrative jobs. These people were a great help to me and they were the reason why I could concentrate on my work. My ultimate thank goes to my mother, my father, and my brothers, who supported me all my life long. They encouraged me to carry on my work and never give up though I had sometimes very hard times. I appreciate their love and support. Without them I would be lost. The most important person in my life is my girlfriend Tina Gaiser who always understood my problems and had an open ear for me. Her love gave me a solid basis for my ideas.

Michael Burch October 2009

vi

CONTENTS

List of Figures

xi

List of Tables

xv

1 Motivation 1.1 Information Visualization . . . . . . . . . . 1.1.1 The Visualization Pipeline . . . . . . 1.1.2 Related Fields . . . . . . . . . . . . . 1.2 Software Visualization . . . . . . . . . . . . 1.2.1 Visualizing the Evolution of Software 1.3 Graph Drawing and Graph Visualization . . 1.4 Contribution . . . . . . . . . . . . . . . . . . 1.5 Outline . . . . . . . . . . . . . . . . . . . . . 2 Related Work 2.1 Visualization Tools for Software Evolution 2.1.1 Code-Centric Approaches . . . . . 2.1.2 Author-Centric Approaches . . . . 2.1.3 Three-Dimensional Approaches . . 2.2 Graph Visualization . . . . . . . . . . . . 2.2.1 Node-Link-Based Representations . 2.2.2 Matrix-Based Representations . . . 2.2.3 List-Based Representations . . . . . 2.3 Tree Visualization . . . . . . . . . . . . . . 2.3.1 Node-Link Representations . . . . . 2.3.2 Containment Representations . . . 2.3.3 Layered Icicle Representations . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . .

. . . . . . . .

1 2 3 4 4 5 6 7 9

. . . . . . . . . . . .

11 12 12 15 16 17 18 20 22 23 23 25 27

vii

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

28 28 29 32 33 34 35

3 Visualizing Rules from Software Archives 3.1 Data Mining in Version Archives . . . . . . . . . . . . 3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Data Extraction . . . . . . . . . . . . . . . . . . 3.2.2 Reconstruction of the Transactions . . . . . . . 3.2.3 Finding out the Changed Artifacts . . . . . . . 3.2.4 Data Cleaning . . . . . . . . . . . . . . . . . . . 3.3 Mining Association and Sequence Rules . . . . . . . . . 3.3.1 Association Rules . . . . . . . . . . . . . . . . . 3.3.1.1 Generating Binary Association Rules . 3.3.1.2 Visualizing Binary Association Rules . 3.3.1.3 Generating n-ary Association Rules . . 3.3.1.4 Visualizing n-ary Association Rules . . 3.3.2 Sequence Rules . . . . . . . . . . . . . . . . . . 3.3.2.1 Generating Sequence Rules . . . . . . 3.3.2.2 Visualizing as Parallel Coordinates . . 3.3.2.3 Visualizing as Decision Prefix Trees . . 3.3.2.4 Visualizing as Trees in a Treemap . . . 3.4 The EPOSee Tool . . . . . . . . . . . . . . . . . . . . . 3.4.1 How to work with EPOSee . . . . . . . . . . . . 3.4.2 Case Study: MOZILLA . . . . . . . . . . . . . . 3.4.2.1 Insights from Binary Association Rules 3.4.2.2 Insights from n-ary Association Rules 3.4.2.3 Insights from Sequence Rules . . . . . 3.4.3 Sequence Rules as Trees in a Treemap . . . . . 3.4.3.1 Case Study: SWT . . . . . . . . . . . . 3.4.3.2 Case Study: Trace Routes . . . . . . . 3.5 Development Phases in Software Projects . . . . . . . . 3.5.1 Transaction Overview . . . . . . . . . . . . . . 3.5.2 File-author matrix . . . . . . . . . . . . . . . . 3.5.3 Dynamic Author-File Graph . . . . . . . . . . . 3.5.4 Case Studies: JUNIT and TOMCAT3 . . . . . . 3.5.4.1 JUNIT . . . . . . . . . . . . . . . . . . 3.5.4.2 TOMCAT3 . . . . . . . . . . . . . . . 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37 38 40 40 41 42 42 43 43 43 45 48 49 50 51 51 52 53 67 67 71 72 74 75 77 77 80 83 84 85 87 88 88 91 93

2.4 2.5 2.6

viii

2.3.4 Indentation . . . . . . . . . . . . 2.3.5 Approaches Using 3D . . . . . . . Compound Graph Visualization . . . . . Dynamic Graph Visualization . . . . . . Visualization of Time-Based Data . . . . 2.6.1 Overview-Based Representations 2.6.2 Animation-Based Representations

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

4 Modeling Transaction Sequences and Dynamic Compound Digraphs 4.1 A Transaction Measure . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 A Formal Definition for Compound Graphs . . . . . . . . . . . . . . 4.3 A Graph Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Transforming Transactions into Directed Graphs . . . . . . . . . . . 4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

95 97 99 101 101 103

5 Visualizing Transaction Sequences and Dynamic Compound Digraphs 105 5.1 TimeArcTrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.1.1 TimeArcTrees—Step by Step . . . . . . . . . . . . . . . . . . 108 5.1.1.1 A Single Graph . . . . . . . . . . . . . . . . . . . . . 109 5.1.1.2 Hierarchy Levels . . . . . . . . . . . . . . . . . . . . 113 5.1.1.3 Aggregation of Edges . . . . . . . . . . . . . . . . . . 114 5.1.1.4 Graph Sequence . . . . . . . . . . . . . . . . . . . . 115 5.1.1.5 Aggregation of Graphs over Time . . . . . . . . . . . 116 5.1.2 Interactive Features in TimeArcTrees . . . . . . . . . . . . . . 117 5.1.3 An Application—Shortest Paths . . . . . . . . . . . . . . . . . 118 5.2 Timeline Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.2.1 Visualizing the Information Hierarchy . . . . . . . . . . . . . . 125 5.2.2 Visual Encoding of the Transaction Sequence . . . . . . . . . 127 5.2.3 Thumbnails as Miniature Representations . . . . . . . . . . . 129 5.2.4 Alternative Representation: Time Bars . . . . . . . . . . . . . 130 5.2.5 Interactive Features of Timeline Trees . . . . . . . . . . . . . . 131 5.2.6 Application Domains . . . . . . . . . . . . . . . . . . . . . . . 132 5.2.6.1 Team Play in a Soccer Match . . . . . . . . . . . . . 132 5.2.6.2 Evolution of Transactions in Software Systems . . . . 134 5.2.6.3 World’s Export in a Time Bars Representation . . . 136 5.3 TimeRadarTrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.3.1 Visualization of a Single Digraph . . . . . . . . . . . . . . . . 139 5.3.2 Visualization of a Digraph Sequence . . . . . . . . . . . . . . . 140 5.3.3 Visualization of the Hierarchy . . . . . . . . . . . . . . . . . . 141 5.3.4 Visualization of Dynamic Compound Digraphs . . . . . . . . . 142 5.3.5 Visualization of the Graph Measure . . . . . . . . . . . . . . . 143 5.3.6 Interactive Features . . . . . . . . . . . . . . . . . . . . . . . . 144 5.3.7 Application Domains . . . . . . . . . . . . . . . . . . . . . . . 145 5.3.7.1 Soccer Match Results . . . . . . . . . . . . . . . . . 145 5.3.7.2 Software Evolution . . . . . . . . . . . . . . . . . . . 146 5.3.7.3 Co-author Graphs . . . . . . . . . . . . . . . . . . . 148 5.4 A Comparison of the Techniques . . . . . . . . . . . . . . . . . . . . . 149 5.4.1 Scalability in TimeRadarTrees . . . . . . . . . . . . . . . . . . 152 5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6 A Comparative Evaluation of TLT and TRT

155

ix

6.1 6.2

6.3

6.4

Cartesian vs. Radial . . . An Eyetracking Study . . 6.2.1 The Participants . 6.2.2 Experiment Setup . 6.2.3 Results . . . . . . . 6.2.4 Threats to Validity An Online Study . . . . . 6.3.1 The Participants . 6.3.2 Experiment Setup . 6.3.3 Results . . . . . . . Conclusions . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

7 The Aesthetics of Dynamic Graph Visualization 7.1 Aesthetics for Node-Link Metaphors . . . . . 7.1.1 Static Graphs . . . . . . . . . . . . . . 7.1.2 Dynamic Graphs . . . . . . . . . . . . 7.2 Aesthetics for Space-Filling Metaphors . . . . 7.2.1 Static Graphs . . . . . . . . . . . . . . 7.2.2 Dynamic Graphs . . . . . . . . . . . . 7.3 Conclusions . . . . . . . . . . . . . . . . . . . 8 Conclusion and Future Work 8.1 Data Acquisition . . . . . . . . . 8.2 Trade-Offs in Layout Algorithms 8.3 Evaluation . . . . . . . . . . . . . 8.4 The Tools on the Web . . . . . . Bibliography

x

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . . . . . . .

. . . . . . .

. . . .

. . . . . . . . . . .

. . . . . . .

. . . .

. . . . . . . . . . .

. . . . . . .

. . . .

. . . . . . . . . . .

. . . . . . .

. . . .

. . . . . . . . . . .

. . . . . . .

. . . .

. . . . . . . . . . .

. . . . . . .

. . . .

. . . . . . . . . . .

. . . . . . .

. . . .

. . . . . . . . . . .

. . . . . . .

. . . .

. . . . . . . . . . .

. . . . . . .

. . . .

. . . . . . . . . . .

. . . . . . .

. . . .

. . . . . . . . . . .

. . . . . . .

. . . .

. . . . . . . . . . .

. . . . . . . . . . .

156 158 159 160 161 163 163 164 165 166 167

. . . . . . .

169 . 170 . 170 . 172 . 173 . 173 . 175 . 176

. . . .

179 . 180 . 180 . 181 . 182 183

LIST OF FIGURES

1.1

The visualization pipeline . . . . . . . . . . . . . . . . . . . . . . . .

2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9

The GEVOL system . . . . . . . . . . . . . . . . . Stargate . . . . . . . . . . . . . . . . . . . . . . . . A ‘software city’ in CodeCity . . . . . . . . . . . . Node-link, matrix, and list representation . . . . . . Force-directed edge bundling . . . . . . . . . . . . . The Zoomable Adjacency Matrix Explorer (ZAME) A hierarchy in a treemap representation . . . . . . The InterRing tool . . . . . . . . . . . . . . . . . . Hierarchical edge bundling . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

14 15 17 18 20 21 25 28 31

3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15

Pixelmap with sorted items . . . . . . . . . . . Zoom function in the pixelmap . . . . . . . . . Node-link diagram of the support matrix . . . . Association rule matrix . . . . . . . . . . . . . . Parallel coordinates technique . . . . . . . . . . Decision prefix tree . . . . . . . . . . . . . . . . Prefix tree with nodes linked to a taxonomy . . Straight vs. orthogonal links . . . . . . . . . . . Unsorted and sorted adjacency matrices . . . . Prefix tree in a parallel coordinate plot . . . . . Comparison of related techniques . . . . . . . . Orthogonal paths in a two-dimensional grid . . EPOSee tool in binary association rule mode . . EPOSee tool in the n-ary association rule mode Enlarged histograms for two metrics . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

46 47 48 50 52 53 54 55 56 57 58 64 68 69 70

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

3

xi

xii

3.16 3.17 3.18 3.19 3.20 3.21 3.22 3.23 3.24 3.25 3.26 3.27 3.28 3.29 3.30 3.31 3.32 3.33

EPOSee tool in the sequence rule mode . . . . . . . . . . Pixelmap of MOZILLA’s browser directory . . . . . . . . Sequence rules of MOZILLA in a parallel coordinate plot Parallel coordinate plot with focus on three artifacts . . . SWT prefix trees in root directory . . . . . . . . . . . . . . Intermediate expansion step for SWT treemap . . . . . . . Further expansion of gtk subdirectory . . . . . . . . . . . Orthogonal layout of the object trees . . . . . . . . . . . Internet routes as Trees in a Treemap . . . . . . . . . . . The transaction overview . . . . . . . . . . . . . . . . . . The file-author matrix . . . . . . . . . . . . . . . . . . . The author-file graph (AFG) . . . . . . . . . . . . . . . . Transaction overview for JUNIT . . . . . . . . . . . . . . File-author matrix for JUNIT . . . . . . . . . . . . . . . Author-file graph for JUNIT . . . . . . . . . . . . . . . . Transaction overview for TOMCAT3 . . . . . . . . . . . . File-author matrix for TOMCAT3 . . . . . . . . . . . . . Author-file graph for TOMCAT3 . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

4.1 4.2

A tree showing inclusion edges and leaf nodes . . . . . . . . . . . . . 96 A compound graph with edge weights . . . . . . . . . . . . . . . . . . 99

5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 5.17 5.18 5.19 5.20 5.21

Node-link diagram of a weighted digraph . . . . . . . . . . . . . . . TimeArcTrees for one weighted digraph . . . . . . . . . . . . . . . . Avoiding vertical edge overlap . . . . . . . . . . . . . . . . . . . . . Incoming and outgoing edge ports . . . . . . . . . . . . . . . . . . . Edge crossing minimization in a graph sequence . . . . . . . . . . . Edge length minimization in a graph sequence . . . . . . . . . . . . Orthogonal layout of edges . . . . . . . . . . . . . . . . . . . . . . . The hierarchy in TimeArcTrees . . . . . . . . . . . . . . . . . . . . Smooth animation in TimeArcTrees . . . . . . . . . . . . . . . . . . Edge aggregation in TimeArcTrees . . . . . . . . . . . . . . . . . . Graph sequence as node-link diagrams . . . . . . . . . . . . . . . . Graph sequence in TimeArcTrees . . . . . . . . . . . . . . . . . . . Aggregated graphs in TimeArcTrees . . . . . . . . . . . . . . . . . . Selecting start and target nodes . . . . . . . . . . . . . . . . . . . . Circular bar for accumulated costs . . . . . . . . . . . . . . . . . . Excerpt of the German Autobahn map . . . . . . . . . . . . . . . . A sequence of geographic maps . . . . . . . . . . . . . . . . . . . . Shortest paths between “Kreuz Meerbusch” and “Kamener Kreuz” Timeline Trees for market basket data . . . . . . . . . . . . . . . . Timeline Trees in different modes . . . . . . . . . . . . . . . . . . . Tooltip in Timeline Trees . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

71 72 76 77 78 79 79 80 81 84 86 88 89 90 91 91 92 93

109 110 111 111 112 113 114 115 115 116 117 118 119 119 120 120 122 123 126 128 129

5.22 5.23 5.24 5.25 5.26 5.27 5.28 5.29 5.30 5.31 5.32 5.33 5.34 5.35 5.36 5.37 5.38

Lens function in Timeline Trees . . . . . . . . . . . . . . . . . . Time Bars view . . . . . . . . . . . . . . . . . . . . . . . . . . . Timeline Trees for a soccer match . . . . . . . . . . . . . . . . . Timeline Trees on player level . . . . . . . . . . . . . . . . . . . Timeline Trees of the JEDIT project . . . . . . . . . . . . . . . . Timeline Trees of JEDIT with expanded doc subdirectory . . . . Time Bars for the world’s export . . . . . . . . . . . . . . . . . Node-link and TimeRadarTrees of a single digraph . . . . . . . Node-link and TimeRadarTrees of a sequence of digraphs . . . . Node-link and TimeRadarTrees of a compound digraph . . . . . Node-link and TimeRadarTrees of a dynamic compound digraph Node-link and TimeRadarTrees of a weighted digraph . . . . . . Soccer match results in TimeRadarTrees . . . . . . . . . . . . . Evolutionary couplings in JEDIT . . . . . . . . . . . . . . . . . . Changes in a co-author graph . . . . . . . . . . . . . . . . . . . Comparison of TAT, TLT, and TRT . . . . . . . . . . . . . . . Space for incoming edges depending on several parameters . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

130 131 133 135 136 137 137 139 141 142 142 143 147 148 150 151 152

6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11

Cartesian visualization techniques . . . . . . Radial visualization techniques . . . . . . . Cartesian visualization for trees . . . . . . . Radial visualization for trees . . . . . . . . . Containment hierarchy in a soccer match . . Correctness of answers for both groups. . . . Heatmap for TRT (correlation question) . . Heatmap fot TLT (correlation question) . . Heatmap for TRT and TLT (open question) Results for online experiment . . . . . . . . Response times for online experiment . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

156 156 157 157 159 161 163 164 164 166 166

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

xiii

xiv

LIST OF TABLES

3.1 3.2

Strongly related file sets . . . . . . . . . . . . . . . . . . . . . . . . . 73 Outliers in the support graph . . . . . . . . . . . . . . . . . . . . . . 74

4.1 4.2

Measure values for an example transaction . . . . . . . . . . . . . . . 102 Measure values for an example digraph . . . . . . . . . . . . . . . . . 103

5.1 5.2

Short notation for IP-addresses . . . . . . . . . . . . . . . . . . . . . 109 Market baskets for several days . . . . . . . . . . . . . . . . . . . . . 125

6.1 6.2 6.3 6.4

Participants that performed the eyetracking study. . T-test analysis. . . . . . . . . . . . . . . . . . . . . Online experiment: Participants . . . . . . . . . . . Online experiment: T-test analysis. . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

160 161 165 167

xv

xvi

“The purpose of information visualization is to amplify cognitive performance, not just to create interesting pictures. Information visualizations should do for the mind what automobiles do for the feet.” — Stuart K. Card (2008)

CHAPTER 1 Motivation

isualization techniques will become more and more important to many people from very different fields of activity. The main reason is the steadily growing quantity of generated data as well as the diversity of types of it. Visualization tools can help to transform such data in a usable form. Nowadays we have to understand complex relations in data from very different application domains. New disciplines such as bioinformatics produce immense datasets, for example, microarray data, DNA strings, or protein-protein interaction data. Network traffic data grows with the ever bigger and faster becoming Internet. Moreover, the variety of services increases with a steady pace. This leads to a big challenge for the network management and security communities.

V

Some researchers tried to estimate the produced data. In December 2008, the Internet traffic all over the world was estimated to be about 5 to 8 exabytes (1 exabyte = 1018 bytes) per month. Not only the traffic but the size of the Internet itself has grown to approximately 500 exabytes [119]. A couple of years ago, estimations about generated data concluded that this data can be compared to 3 million times of the amount of information that are inside all books ever written [71]. The trend of data explosion is also noticeable in the branch of software engineering. The development and maintenance of software systems have become a very complex

1

2

1 Motivation

task due to the complexity of the software itself. Even simple programs can contain many thousand lines of code. These are implemented by many programmers throughout the world over several years of development time and are hierarchically organized in methods, functions, classes, files, directories, and packages. These socalled software artifacts are related to each other to some extent. The situation gets even worse because these relations are mainly not static but change over time. The main contribution of this work is part of two different areas of research that will be shortly introduced in Sections 1.1- 1.3. The visualization of evolutionary relations among software artifacts belongs to the branch of software evolution visualization, the visualization of dynamic compound digraphs is part of graph visualization, both of them are subdisciplines of a field called information visualization. In the part about software evolution visualization we are dealing with two types of mining rules. For this reason, we use a static matrix-based graph visualization which we call pixelmap, on the one hand, and, on the other hand, parallel coordinates to display hypervariate data. Time-aggregated relational data can give a good overview about related artifacts, but a drawback is that the evolution of the relations in a specific time interval cannot be examined. To tackle this problem, we developed visualization techniques for displaying dynamic relational data that have the common goal to preserve one’s mental map by showing the temporal data in a single view. Radial representations gain more and more popularity and are also used in this thesis as a special visualization paradigm. This paradigm is rooted in centuriesold statistical graphics and has many benefits due to its aesthetic appeal and its compactness. We evaluated the usefulness of a Cartesian visualization tool and its radial counterpart by both an eyetracking study as well as an online study.

1.1 Information Visualization Information visualization, or InfoVis for short, can be seen as a subfield of computer graphics that is again a subfield of computer science. A definiton is given by Card et al. [27]. “Visualization is the use of interactive visual representations of data to amplify cognition”. This citation expresses that people can use visualization to explore, understand, and draw conclusions from data. It even goes a bit further. Visualizations should be interactive, in other words, the user should be able to manipulate the pictures that are displayed on screen. The role of interaction in InfoVis was also emphasized by Spence:

1.1 Information Visualization

3

“Interaction between human and computer is at the heart of modern information visualization and for a single overriding reason: the enormous benefit that can accrue from being able to change one’s view of a corpus of data. Usually that corpus is so large that no single all-inclusive view is likely to lead to insight. Those who wish to acquire insight must explore, interactively, subsets of that corpus to find their way towards the view that triggers an ’a ha!’ experience” [147].

1.1.1 The Visualization Pipeline Figure 1.1 shows a visualization of the steps that are needed to convert raw data into a usable form for the end user. This process is denoted by the term visualization pipeline.

: Figure 1.1: The visualization pipeline [27] shows the steps that are needed to transform source data into an interactive view for the end user.

Before working with the data, it has to be prepared for a prescribed data format before the mapping stage. This is the first step of the visualization pipeline. Doubtlessly, the heart of the visualization process is the visual mapping of the data into a visual form. The developer of a visualization tool has to decide which visual glyphs are mapped to the data entities. The function of the third step is to transform the visual form into several views on screen. The final views are then perceived by the user. But he is not supposed to just watch the presented views, he can more or less interact with the visualization in any of the steps that are indicated in the visualization pipeline. This frees the user from static pictures and gives him the opportunity to generate his own pictures. All these steps help a user of a visualization tool to gain insights into an abstract dataset and to draw further conclusions. There is a newer quote about information visualization by Card [26]. He states that: “The purpose of information visualization is to amplify cognitive performance, not just to create interesting pictures. Information visualizations should do for the mind what automobiles do for the feet”.

4

1 Motivation

This statement expresses that the user of a modern visualization tool can explore the dataset on his own and he is given the freedom to manipulate the provided views. Furthermore, it states that a visualization tool may be used to facilitate and to accelerate the exploration process, but on the other hand, it expresses that an intelligent user is also needed to perform the given tasks. Highly interactive features were not imaginable before the invention of graphical user interfaces and they profit from the advances in related technology.

1.1.2 Related Fields Scientific visualization, or SciVis for short, and geovisualization are disciplines that have a common goal with information visualization. Also these support a user to understand and gain insights from data. In scientific visualization the data is related to a physical model. It is primarily concerned with the representation of three-dimensional phenomena. Architectural, meteorological, medical, and biological systems typically produce spatial data—data with an inherent three-dimensional nature. The main goal is the realistic rendering of volumes, surfaces, and illumination sources for example. These are mostly linked to a dynamic component [66]. Researchers in SciVis use computer graphics for the computation of visual images. These can help to understand and analyze complex, often massive numerical representations of scientific concepts or results [117]. Geovisualization has many similarities with SciVis but is a slightly different discipline. The basic concept in this field is some kind of map that is used to display the data in. The book by MacEachren [114] is a good information source for the field of geovisualization. Information visualization, in contrast, has to deal with abstract and non-spatial kinds of data. These are, for example, data from engineering or software development, which is explored by the subdiscipline of InfoVis called software visualization. Over the last years, many attempts have been tried to make the distinction of information visualization, scientific visualization, and geovisualization much clearer, but there always remains some kind of overlap between these disciplines. Information visualization is not unscientific and scientific visualization is not uninformative. Readers who are more interested in additional literature should have a look at the work of Ed H. Chi [30].

1.2 Software Visualization Software projects under version control have become a rich data source to be analyzed, explored, and of course visualized. Developers of large software systems are

1.2 Software Visualization

5

not only interested in the static structure of their system. Since configuration management systems exist, project managers want to gain insights from the evolutionary couplings during the lifetime of the system. They could even learn from history to draw conclusions for the future. With this information they could support software developers to implement better source code [195]. But also the structure of the software project could be visualized to get an overview of the system. Many dependencies between software artifacts are the reason that make such systems that complex and very hard to maintain. Visualizations of the call graph, inheritance tree, abstract syntax tree, and other forms of such structural data can be used to derive interesting insights, which could not have been found by just browsing through the million lines of source code. All these tasks which are absolutely necessary for software developers to produce a well-engineered system, are the goal of a branch of information visualization that is denoted by software visualization. John Stasko gave a definition of software visualization and expressed it as “the use of the crafts of typographic, graphic design, animation, and cinematography with modern human-computer interaction and computer graphics technology to facilitate both the human understanding and effective use of computer software” [148]. Stephan Diehl subclassifies the field of software visualization into three main subfields: the visualization of the program structure, the visualization of the program behavior, and the visualization of program evolution [43]. In this thesis we are focusing on the visualization of evolutionary couplings between software artifacts on different levels of granularity. For this reason we will present some tasks that have to be solved by techniques and tools from the domain of software evolution visualization.

1.2.1 Visualizing the Evolution of Software Software systems evolve from the very first beginning of implementing source code. New functionality is added and hierarchically structured in programming language specific containers such as packages, classes, methods, or functions. Functionality could become deprecated and has to be replaced by new one. Bugs may occur during the implementation phase of a software project and have to be fixed as soon as possible. Parts of the code have to be restructured and copied to other locations in the hierarchically organized system. All these tasks are very hard to manage if there is just the current and up-todate state of the software system available. This is the point where configuration management systems come into play. These systems are able to store change data

6

1 Motivation

of text documents. The Source Code Control System (SCCS) [140] was the first source code revision control system that targets this problem. Later Walter F. Tichy developed the Revision Control System (RCS) in the beginning of the eighties [158, 160]. Version control is able to keep even large software systems well organized and maintainable, which nowadays typically contain millions of lines of source code that are spread over very many files organized in different directories and subdirectories. These are implemented by many developers all around the globe, who produce many versions of the system until it reaches its final state. Some years later, in 1989, Dick Grune developed the Concurrent Versions System (CVS), which is able to handle sets of files in one single repository, which was the bottleneck in RCS, that could manage just one file at a time. Even CVS has some restrictions when manipulating binary data or whole directories. An enhancement of this system was the SUBVERSION management system [151], which has some improvements such as a more sophisticated handling of binary files. Many other configuration management systems have been developed, which mainly base on the principles used in RCS and CVS. Visualizing the evolution of software systems has become very important for software engineers and even more for the management team of software systems. The reasons for this are quite obvious. Understanding the changes of the overall structure of the project, which parts of the system change very frequently, and which developers did the changes can help to save money, which is the most important reason for visually exploring the evolution of software systems for the software industry. But even more complicated questions that target at analyzing dependencies in the code evolution arise. Trends, patterns, and anomalies during the evolution can be explored to provide suggestions for the developers and to improve their programming efficiency. On the one hand, the focus lies on the evolution of the whole system, but on the other hand, there is a need to explore the evolution of a particular software artifact or the dependencies among a restricted set of artifacts. Software evolution visualization is a young field that has its genesis in the development of configuration management systems. Soon after their first appearance, researchers tried to analyze this new kind of data. One of the very first published tools in that domain were the SeeSoft tool [53] and the Graphical Analyzer for Software Evolution (GASE) [91]. Soon many attempts have been made to enhance visualization tools and develop novel techniques for this new data source. In Section 2.1 we give a short list of the most popular visualization tools in that field.

1.3 Graph Drawing and Graph Visualization Graph Drawing (GD) is a branch of graph theory and is motivated by applications such as VLSI design or cartography. The aim of the field is the automatic layout and drawing of graphs. Like software visualization, it is also a subdiscipline of InfoVis.

1.4 Contribution

7

Drawing a graph means to visually encode relations between a set of objects. The most frequently used graph drawing paradigm is the node-link approach, which can be explained very easily: Place each object at a distinct location on screen and draw either a straight or curved line between related objects. Though this approach seems very plausible, it may also suffer from visual clutter, which is caused by many edge crossings. Visual clutter is the result of a poor graph layout. Many edge crossings could be responsible for a confused viewer, but also several other problems could lead to misinterpretations of the graph data. For this reason much research has been done on sophisticated algorithms with the goal to be very time-efficient and to apply several aesthetic criteria. The need of GD is based on many real world examples of relational data. Graphs or networks are part of our everyday life. Social networks consist of a set of persons where each person is linked to a subset of friends—persons that are related to him/her. Human beings are related to other human beings. The family tree represents relationships among people that may stem from marriage or birth. Computers are connected and enable such a mighty invention as it is the Internet. There are much more complex examples where relations occur. Protein-protein interaction is being analyzed by researchers of bioinformatics to understand the complex processes that happen in an organism. All these and many more application domains make graph drawing an absolutely necessary field of research. The first book that was solely dedicated to graph drawing was written by di Battista et al. [42]. Apart from representing graphs as node-link diagrams, we could also use a more compact visualization such as the adjacency matrix, which is very effective for dense graphs. Visual clutter is reduced to a minimum, but one drawback of this technique is the weakness for solving path-related tasks. The reason for this difficulty can be explained by the representative elements for each node—one in a row and one in a column. So far we talked about static graphs. These graphs do not change over time. But in many situations we are confronted with dynamic graphs, which change over time. This time-based relational data is an additional challenge for both the developer and the viewer of the visualization tool. Animation plays an important role for representing dynamic graphs but may also suffer from high cognitive efforts for a viewer.

1.4 Contribution The main contribution of this thesis addresses software evolution visualization as well as dynamic graph visualization.

8

1 Motivation

Nowadays, software systems are developed by several hundred programmers over several places on earth in different time zones. The distributed work has to be coordinated to prevent errors whose debugging process can be very time consuming and hence very expensive. Typically the data produced by a software developer is stored in a software archive or repository, like presented in Section 1.2.1. Configuration management systems do not only store the latest version of the system but all versions at all checkin times to reproduce an earlier state of development. Analyzing the flood of data by hand would be a daunting task. The field of visual data mining provides a solution for this problem. In this thesis we introduce several techniques for visualizing rules extracted from software archives. Case studies for several open source software projects under version control show the usefulness of our visualization tool called EPOSee. These code-centric approaches are very useful to find interesting patterns and anomalies in the development of open source software projects. This offers a possibility to examine the relations between software artifacts on different levels of granularity. In a different developer-centric approach we tried to find out interesting insights about the behavior of several developers in the development process of the software system. The major problem of our evolutionary visualization approach was the aggregation of data over time. Though the resulting visualization gave us insights in the complex dependencies of software artifacts such as files or methods, it is still nonsatisfying. In the scope of this thesis we are considering evolutionary couplings of software systems, but the single phases during evolution cannot be explored when aggregating over all transactions. For this reason we developed further visualization techniques for detecting development trends, counter-trends, phases of stagnation, and very frequently committed artifacts to just mention four findings. We introduce three different visualization approaches for the representation of sequences of transactions in information hierarchies and dynamic compound digraphs. The first one, called TimeArcTrees, uses traditional node-link diagrams to show the graph sequence and the hierarchical ordering of the graph nodes. To avoid visual clutter that is caused by lots of edge crossings, we developed the Timeline Trees visualization technique, which uses rectangular boxes for the graph sequence instead of colored arcs. The third idea, called TimeRadarTrees, was developed under the assumption that a user might better understand radial representations of this type of data. We evaluated the Cartesian and the radial counterpart of the tools in both an eyetracking and an online study. The visualization of dynamic compound digraphs in a single view has two major benefits over traditional dynamic graph drawings that use smooth animation: The first aspect is time complexity. Animation-based dynamic graph drawing requires sophisticated algorithms because of time complexity challenges, be it with online or offline approaches. The reason for the time consumption is the computation of graph

1.5 Outline

9

layouts that minimize layout changes in subsequent graphs in the sequence. In the novel visualization, the graph nodes are laid out with respect to their hierarchical ordering or on user’s demand. Then the visual representations of the edges of a graph sequence are drawn in their corresponding positions. The strength of the novel technique is that layout changes are not needed. Typically these would imply a significant effort for a viewer when preserving his mental map. The second aspect is the good overview for time-based data, which is difficult to get with animated visualizatons because the cognitive effort for a viewer grows immensely. Some of the visualization work presented herein is a collaboration with several other researchers. The EPOSee tool was implemented by myself in my diploma thesis and enhanced in the first few months of my PhD years. My idea for graphically displaying sequence rules where the single items are attached to elements of an information hierarchy or taxonomy was called Trees in a Treemap. Peter Weißgerber and Mathias Pohl developed the file-author matrix and the dynamic author-file graph respectively in a joint work. The TimeArcTrees, Timeline Trees, and TimeRadarTrees visualization tools have been developed in an excellent collaboration with two students of computer science. Martin Greilich implemented the TimeArcTrees approach and Fabian Beck the space-filling and Cartesian Timeline Trees visualization. The radial counterpart called TimeRadarTrees was developed by myself and the evaluation with an eyetracking study was managed by Felix Bott and Rainer Lutz.

1.5 Outline In this thesis we propose several visualization approaches to explore evolutionary couplings between software artifacts on different levels of granularity. Static, overview-based visualizations of dynamic relational data support a user to better understand phases of evolution and to detect trends or counter-trends. Furthermore we illustrate the usefulness of our visualization techniques with applications from very different application domains. Both an eyetracking as well as an online experiment give us insights in how users interpret Cartesian and radial visualizations for sequences of transactions in information hierarchies. These studies show us possible difficulties in understanding the totally different representations. The remainder of this thesis is organized as follows: In Chapter 2 we will give an overview of software evolution visualization tools and also some related work on graph and tree visualization approaches, as well as on compound graph representations. Existing approaches for the visualization of timebased data and dynamic graphs are also presented in this chapter. The evaluation of visualization tools has become very important, but only a few researchers illustrate

10

1 Motivation

the usefulness of their techniques by a sophisticated study. Thus, related work on comparative studies is also given in this chapter. Chapter 3 explains how evolutionary coupling data can be mined for association and sequence rules. The second part of this chapter shows how this data can be visualized. Interesting patterns and anomalies, that are detected by using the visualization techniques, are presented and further discussed. These code-centric visualizations focus on the evolutionary couplings of source code. In a different work we tried to explore how developers are behaving during the development of the software system, which can be regarded as some kind of author-centric visualization. In Chapter 4 we give a mathematical model for measuring strengths of transactions and edge weights. This is necessary because expanding and collapsing subhierarchies leads to aggregated values and aggregating graphs in the sequence means adding up measures. The chapter is concluded by showing how a transaction can be transformed into a directed graph. Also an undirected graph can be transformed into a directed one. The opposite direction—directed to undirected—would lead to an information loss in general. Chapter 5 shows three different approaches to visualize sequences of transactions in information hierarchies and dynamic compound digraphs. TimeArcTrees uses a node-link diagram to show the information hierarchy and the graph sequence, Timeline Trees uses a conventional node-link diagram for the hierarchy and a spacefilling representation for the graph edges, and finally, TimeRadarTrees is the radial counterpart of the Cartesian Timeline Trees approach. It makes use of a radial node-link tree and circle sectors instead of rectangular boxes. A comparative study of the Cartesian and the radial variant of our visualization tools is discussed in Chapter 6. An eyetracking study helps to understand the behavior of voluntary participants when solving certain tasks. Also an online study by means of a JAVA applet gave us further interesting insights. Chapter 7 surveys different static and dynamic graph visualization paradigms. Furthermore, we discuss some aesthetic criteria when drawing static and dynamic graphs and explain the pros and cons of the techniques with respect to either static or dynamic visualizations. Chapter 8 shows which problems occured when first acquiring data, second visualizing the abstract data, and third evaluating the developed visualization technique. We also focus on future work in this chapter.

“The secret to creativity is knowing how to hide your sources.” — Albert Einstein (1879-1955)

CHAPTER 2 Related Work

he problem of visualizing directed or undirected relations among a set of objects has become one of the main challenges in information visualization and more specially in the domain of graph drawing over the last years. Moreover, the ever growing datasets make higher and higher requirements to visualization tools with respect to a readable representation of the data.

T

To display the data efficiently, many sophisticated layout algorithms have been developed to minimize visual clutter, to improve the graphs with respect to aesthetic criteria, and hence to improve the understandability of graphs in a node-link visual metaphor. Mainly these algorithms only deal with static graphs, those that do not change their structure over time. An even larger challenge is the task to generate a nearly optimal layout of a sequence of graphs by means of animated node-link diagrams. The situation even worsens when allowing a hierarchical ordering on the graph nodes that restricts the layout of the nodes considerably and leads to additional requirements of the layout algorithm. The aforementioned problem occurs in many different application domains. One part of this thesis covers some approaches from the field of software evolution visualization. Software projects are typically organized in a hierarchical structure that is made up of software artifacts, for example, directories, files, or methods. These artifacts themselves could show some relations among each other. The type of relation depends on the type of metric that has to be examined. A call graph or an

11

12

2 Related Work

inheritance tree could be the result of such an examination. The main problem arises when we have a look at the evolution of the whole project. Not only the structure of the hierarchy can change but even the relations between the software entities. If there is a weight function on the graph edges, it might be possible that relations become weaker or stronger from time to time and exhibit trends or counter-trends. In this chapter we will discuss some existing graph visualization methods and tools that have their main focus on software data and software evolution data. A special kind of graphs, namely trees, have also been visually represented by means of visualization techniques. These include classical node-link diagrams, nested sets, layered icicle diagrams, indentation, and nested parentheses. Time-based data requires a time line that declares points in time to distinguish between older and newer data. Static and dynamic visualizations are two very different types of representation for time-based data. The strength of relations can vary immensely or just imperceptible throughout an evolutionary process. We will present some existing research that tries to tackle this problem by using smooth animations. Only a handful of researchers so far have been developed techniques to visually explore compound graphs and, to the best of our knowledge, no research has been published that presents an idea for visualizing dynamic compound digraphs in a single view without the use of an animated graph sequence.

2.1 Visualization Tools for Software Evolution The visualization of evolutionary data from software archives has been addressed by many researchers in the last years. The ever growing amount of data that is stored in a software repository makes it very important to develop tools that can support software developers to detect patterns or anomalies in the flood of data. We will start the section on related work with some existing visualization tools that point in this direction. We try to make a difference between tools that target at visualizing the evolution of the dependencies of software artifacts focusing on both source code and the developers of the code. Some work is very hard to categorize in this way because both approaches are strongly intertwined with eachother.

2.1.1 Code-Centric Approaches The SeeSoft tool [53] is definitely one from the very first beginning of the branch of software evolution visualization. SeeSoft allows to explore 50.000 lines of code simultaneously. To achieve this goal each line of source code is mapped to a thin pixel line. The color coding depends on the statistic of interest. The most recently

2.1 Visualization Tools for Software Evolution

13

changed code lines could be represented in a red color whereas the least recently changed ones are indicated by a blue color, which makes it very intuitive to separate new and old implemented code fragments. SeeSoft is also able to visualize the structure of a system by applying a treemap visualization, which is also possible in the Xia tool [184]. The main difference is that, with Xia, the number of changes or the last committer’s name as well as the timestamp can be visually encoded and are put in the context of the structure of the whole system. Some other tools have been implemented, which are inspired by the ideas in SeeSoft. The CVS Activity Viewer [67] and also the final tool Augur [68] use similar visualization techniques that map code lines to pixel lines. It provides a visual combination of software entities and the activities of developers. The Evolution Matrix [109] can be used to detect different phases during software evolution such as growth, stagnation, or shrink phases. Two-dimensional boxes are arranged in a matrix view to show single files on horizontal axes version by version. The width and the height of the rectangular boxes indicate metric measurements of the classes, for example, number of methods, number of instance variables or the like. The Evolution Spectograph [183] uses a quite similar approach. Files are arranged on the vertical axis and time on the horizontal one. A new version of a checked-in file is encoded by a green colored glyph and is placed in the line, which indicates the representative element. Its horizontal position depends on the specific point in time that is mapped to the horizontal axis. When the time passes by, the green color fades to white until the next commit including this file occurs. To understand the differences between two subsequent versions, one can use the Hipikat tool [38]. Bug reports, changes, emails, or documentation are software artifacts that can be visualized by the tool and are used to show relationships between versions. The RelVis tool [130] uses Kiviat diagrams to visually encode a number of software metrics and the relationships between software entities. A Kiviat diagram, also known as star plot or spider web, is shown for each entity and the related entities are linked by filled rectangles drawn in the background. The Kiviat polygons are differently color coded to also show the changes between several releases. There are also some tools that visualize dependencies of software artifacts. GEVOL [34] uses an animated graph drawing technique to represent large graphs that have a temporal component. These graphs can give insights into the evolutionary process of a software system on a more structure-based level. The developers of the GEVOL technique used a bytecode analyzer to generate the graph sequence. Figure 2.1 shows a call graph of a software project that evolves over time. CodeConnections is a tool that uses the yFiles [187] development kit to visualize data from software archives as some kind of node-link diagrams. The view consists of two panes, that show the structure of the project at one point in time and additionally color coded programmer names in the top pane. The navigational control

14

2 Related Work

Figure 2.1: GEVOL [34] uses a graph drawing technique to represent large graphs that have a temporal component. (Courtesy of Stephen Kobourov)

is represented in the bottom pane. The whole representation serves as overview and detail. Though SHriMP [150] cannot display evolution dynamics, it can be used to explore software hierarchies and several relationships between software entities. Nested graphs show the hierarchical order and colored arcs represent different relationships that link to the hierarchical elements. The Evolution Radar [40] is an interactive and radial visualization technique that uses file-level and module-level information about logical couplings. The module in focus is placed in the circle center and the rest of the project is represented as circle sectors with their size depending on the size of the corresponding module. The coupling information of the focused module with the others can be gained from the distance of colored circles from the circle center. The Gantt chart paradigm [70] is used in Historian [90] to show a file-level visualization of the evolutionary process of a software system. Unfortunately, the visualization tool does not scale for large projects. CVSscan [172] uses a line-oriented representation for code changes. The horizontal axis encodes time whereas the columns on the vertical axis indicate the version of a file. A number of metrics as well as the source code can be shown in separate linked displays. CVSscan is similar to HistoryFlow [170], which also shows code evolution on a horizontal timeline that is separated into versions. Very simple chart visualizations are provided by softChange [75, 76]. The researchers plot software metrics such as lines of code or number of files vs. time.

2.1 Visualization Tools for Software Evolution

15

The softChange tool targets at supporting a user to answer typical questions that occur during open source software development.

2.1.2 Author-Centric Approaches CodeSaw [79] is a social visualization tool that helps to explore distributed software development. Two independent perspectives are visualized that are code repositories, on the one hand, and project communication, on the other. CodeSaw shows peaks and valleys of productivity in a programmer’s life and hence belongs to the author-centric visualization approaches. Stargate [125] also focuses on developers and hence puts them into the center of the visualization. The structure of the system is represented by a radial layered icicle. Developers are grouped into clusters and placed inside the circular area close to the file locations they work at most. Developers are linked when they communicate via email. Trends in the developer’s behavior can be uncovered very easily, as shown in Figure 2.2.

Figure 2.2: Ogawa and Ma [125] developed the Stargate tool for visualizing developer activity and the system structure simultaneously. (Courtesy of Kwan-Liu Ma)

16

2 Related Work

The Growing Bloom [105] visualization uses concentric pie charts to represent the evolution of source code of individual implementers and their comment contributions to open source software projects. In Chapter 3 we introduce the EPOSee tool [23, 24] that is a code-centric visualization technique to represent relations between software entities. We also developed author-centric visualizations such as the transaction overview, the file-author matrix, or the dynamic author-file graph [132, 179] in different works.

2.1.3 Three-Dimensional Approaches Other tools try to avoid the scalability problems by adding a third dimension. VRCS [108] and 3DSoftVis [137] show an additional time axis. In VRCS the version information about a history file managed by RCS is shown as a two-dimensional tree that typically changes over time. 3DSoftVis offers three kinds of visualizations for analyzing the evolution of software systems: the system structure as threedimensional trees in a balloon layout, the three-dimensional history view of one subsystem as a sequence of two-dimensional trees, and the two-dimensional view for module evolution in a certain subsystem. Three-dimensional visualizations of software data are difficult to create because software data has no inherent spatial structure. Hence, the visual mapping of the data to three dimensions has to be handled with care. When not encoded the right way, it may be a daunting task to read and explore the visualization in full detail and to regain the encoded data. Occlusion problems and a wrong mapping of data points in the three-dimensional space may lead to misinterpretations. Animation could mitigate this situation a little bit, but it typically leads to high cognitive efforts for a viewer. The source viewer 3D, or sv3D [115, 116] for short, is based on the ideas used in the two-dimensional SeeSoft tool and supports the visualization of large-scale software. The sv3D framework renders data from software systems as poly cylinders that show software artifacts and containers indicating aggregations of these artifacts. Software metrics are encoded in the height and color of the cylinders. Wettel and Lanza [181] implemented a visualization tool called CodeCity that is based on a city metaphor. Classes are depicted as buildings and packages as districts. A user of their technique can easily explore even large software systems by taking a walk through this software city. The visualization benefits from the fact that a user gets more and more familiar with the city environment by inspecting the buildings, and hence, he gets familiar with the software system as well, see Figure 2.3. The EPOSee tool [23, 24] integrates several three-dimensional bar chart visualizations to encode two kinds of metrics simultaneously as color and height of the bars in one view. Both metrics—typically support and confidence of mining rules from software archives—can interactively be replaced one by the other.

2.2 Graph Visualization

17

Figure 2.3: Wettel and Lanza [181] developed the CodeCity visualization tool that is based on a city metaphor. This figure shows the class-level disharmonies of the ARGOUML system. (Courtesy of Richard Wettel)

2.2 Graph Visualization The field of graph visualization focuses on the efficient representation of relational data. The prevalent visual metaphor in that field is definitely the representation of graphs in a node-link style. A good overview of graph visualization tools and techniques can be found on visualcomplexity’s webpage [171]. Actually, there are as many real world applications for static graphs as for those graphs that change over time. In the context of this thesis, we denote this type of graph by dynamic. The visualization of this kind of data gets even more problematic when graphs are very dense and a representation as a node-link diagram makes it impossible to understand and explore the relational structure given in the graph sequence. Even after applying elaborate layout algorithms, we can seldomly get rid of the phenomenon that visualization researchers refer to as visual clutter. A very common, space-filling visualization technique for graph data is the so-called matrix representation, which has some advantages with respect to visual clutter, but mainly suffers from other drawbacks. Also a matrix-based representation is problematic when visualizing sequences of graphs in a single view. Good research on the comparison of the readability of graphs using node-link and matrix representations can be found in the work of Ghoniem et al. [77, 78]. Keller et al. [104] point out that matrices have many benefits over traditional node-link diagrams when visualizing very dense graphs and that visual clutter, which is typically caused by lots of edge crossings in the node-link representation, is reduced to a minimum.

18

2 Related Work

A third approach to visually encode a graph is given by a so-called list representation, which shows all adjacent elements as a list related to the corresponding element. This approach is also of limited use. A more detailed exploration of the usefulness and the aesthetic criteria on different graph drawing approaches can be found in Chapter 7.

(a)

(b)

(c)

Figure 2.4: Three different representations for the same graph: (a) node-link diagram; (b) matrix representation; (c) list representation.

Figure 2.4 shows three different representations for a small weighted digraph example: node-link, matrix, and list representations. In the following subsections we will discuss some related work that use either one of them or are a mixture of two of them.

2.2.1 Node-Link-Based Representations Node-link diagrams are widely-used for graph representations. The strength of those visualizations is the intuitiveness when interpreting a set of elements, which are connected by either straight or curved lines that indicate some kind of relationship. One early example of this method is the famous historical problem of the ’Seven Bridges of Königsberg’. In 1735 Leonard Euler, a pioneer in the domain of graph theory, gave a solution to this mathematical problem [57]. The term ’graph’ was introduced much later in graphical notations of chemical structures in 1878 by the mathematician James Joseph Sylvester [153]. In graph theory, a graph is modeled by a set of elements, which are also called vertices, on the one hand, and a set of edges that express relations among those vertices, on the other hand. These edges can be visually encoded by straight lines or arcs in case the two corresponding elements are related. The shape of these visual elements may also matter. Layout algorithms have to be aware of the visual encoding of the graph’s elements to avoid occlusion problems in the final layout. Nowadays, we are talking about node-link diagrams when we are referring to this visual representation of a graph. Node-link is just one visual metaphor for the


19

mathematical graph structure among others. Thus, the question arises, which is the most efficient visual representation for this kind of data structure. The major drawback of node-link diagrams is obvious. The more nodes and the more links are displayed, the more likely are link crossings, and finally, the more opaque a diagram will become. This problem is well-known and is called visual clutter, a degradation of the visual perception. The measurement and reduction of visual clutter has been a very important task ever since for node-link representations. Clutter can be a very confusing phenomenon in the domain of information visualization. Mostly it is caused by too many objects on a too small display. The American Heritage College Dictionary [156] is defining clutter as “a confused or disordered state, caused by filling or covering with objects”. This is a very interesting statement because it expresses that clutter is not only caused by having too many objects on the screen. It may also be a phenomenon of a small number of objects that are disordered and can result in a state of confusion for a user of a visualization tool. Rosenholtz et al. [141] give another definition for visual clutter: “Clutter is the state in which an excess of items, or their representation or organization, lead to a degradation of performance at some task”. The work of Edward Tufte [162, 163, 164] provides another idea of what visual clutter can also be referenced to. The appearance of clutter does not solely depend on the density of visual elements. Clutter can also be seen as anything that causes confusion to a user of a visualization. Tufte further points out that large datasets are not the only cause for visual clutter. A wrong design in the visual mapping can also lead to a cluttered display and hence to a confused user. In the following sections we will give some techniques that have the common goal of minimizing visual clutter in node-link representations. This is also one of the goals in our TimeRadarTrees visualization technique that will be discussed in Chapter 5.3, but it uses a different visual metaphor for dynamic graph data. An absolutely novel concept to address this problem was introduced by Holten and van Wijk [94]. The researchers developed a force-directed edge bundling approach for graphs that is a self-organizing approach. For the bundling of edges, they do not need a hierarchy or a control mesh anymore to arrange the bundling as it was necessary in their previous idea, which addresses hierarchical edge bundling, see Figure 2.5. The bundling approach was already used by Phan et al. [128] in the flow map layout. A hierarchical clustering on a node set, positions, and flow data was generated and

20

2 Related Work

Figure 2.5: Holten and van Wijk [94] developed the force-directed edge bundling for graphs that is a self-organizing approach. (Courtesy of Danny Holten)

supported the developers at routing the graph edges. The major problem with this technique was the limitation to a binary clustering, which means all edges can only split into exactly two new edges. Some approaches try to group node sets before drawing edges between individual nodes. Visual clutter is normally reduced in these so-called clustered graphs. Eades et al. [49, 62] and Kaufmann and Wagner [102] present some techniques to draw clustered graphs. With the goal of visualizing network data as node-link diagrams with minimized visual clutter, Becker et al. [8] only draw half of an edge. The idea is based on Gestalt visual principles [107] continuity as well as closure, which state that visual elements can be perceived as a whole even if just parts of those are visible. Without an evaluation of their idea, it may be hardly believed that this approach is really useful, but it cannot be dismissed that it in fact reduces visual clutter.

2.2.2 Matrix-Based Representations Classical node-link diagrams can be easily transformed into a matrix representation, which is also called an adjacency matrix. Nodes are represented twice, horizontally as well as vertically, which is in fact the most important drawback of the very compact technique for dense graphs. The existence of an edge can for example be expressed by an integer matrix entry. The value of the entry informs about the weight of the relation between the corresponding elements at the row and the column of this special matrix cell. The connection of two nodes can be grasped from the fact that this cell is the intersection point of the corresponding row and column. For reasons of scalability the matrix cells are scaled down to pixel-sized entries that are color coded according to an edge weighting function. On a 1024 by 768 screen


21

resolution up to 786,432 relations can be represented in a single view and hence provide a great uncluttered overview of a weighted directed graph. The visualization of a directed graph with edge weights as a color coded adjacency matrix makes the graph structure very clear with respect to visual clutter. But the presence of two representative elements for the same node makes it very difficult to track a path in the graph. The eye has to jump to and fro between two successive nodes of a path in this visual encoding. Without a sophisticated interaction mechanism it is also very difficult to find an answer to the question which nodes are linked and what is the weight of a relation in such a pixel-based representation. Some techniques have already been implemented that try to tackle this problem. The Zoomable Adjacency Matrix Explorer (ZAME) [54, 59] is a visualization tool that is based on a matrix representation. ZAME is able to handle millions of nodes and edges. Such ranges are typical for datasets generated from protein-protein interaction networks or from the links in the World Wide Web. The tool provides zooming and panning techniques, which can be used to interactively change from overview to detail, see Figure 2.6.

Figure 2.6: The Zoomable Adjacency Matrix Explorer (ZAME) [54, 59] is a visualization tool that is based on a matrix representation. (Courtesy of Niclas Elmqvist)

The underlying structure of a graph or a network, which is represented as a pixelbased matrix, can be better explored by applying sophisticated algorithms that permutate rows and columns of the matrix. Since the number of all possible permutations grows exponentially in the number of nodes, this is a daunting task. Finding

22

2 Related Work

an optimal layout of the graph nodes in the matrix is strongly related to the socalled Optimal Linear Arrangement problem (OLA) that is described, for example, in [73]. Becker et al. [8] use such a permutation approach to reveal the underlying structure of a network. Also in the ZAME tool [54, 59] the researchers have developed a fast and automatic reordering mechanism, which can find a good layout of the represented graph. A reordering of a matrix can help to detect subsets of nodes that are all related to each other. This is typically denoted by the term clique in graph theory. Moreover, one could find nodes that only have outgoing edges, which are denoted by sources in graph theory. However, path-related tasks are the weakness of matrix-based representations. The MatLink visualization tool [87] offers interactive features to support a user with path-related tasks in matrix visualizations. A linear node-link diagram on top of the matrix shows shortest paths between selected nodes. Another hybrid representation is given by NodeTrix [88]. The authors distinguish between global sparsity and local density in social networks. Their approach targets at visualizing dense subnetworks with a matrix-based representation. Sparse subnetworks are still readable with a node-link diagram. The MatrixExplorer [86] also combines matrices and node-link diagrams. The tool provides two combined views where a user can apply queries in one of the views. The other view is updated in a way that it displays the same information but in a different visual encoding—a principle that visualization researchers refer to as linking and brushing. The authors conjectured that matrices are generally used to permutate, filter, and cluster nodes or node groups in the network and the results are then explored in the corresponding node-link diagram, which provides a smaller and sparser graph. Inspired by Metro maps, Shen and Ma [143] integrated path visualizations into a matrix-based visualization. They argued that dense graphs can be better analyzed by matrices whereas path-related tasks are difficult to solve. In their approach they try to minimize path crossings and path overlapping. Furthermore, paths can be filtered out. In this thesis we will also present a matrix-based and interactive visualization technique in our EPOSee tool [23, 24] in Section 3.4. Furthermore, we integrated our pixelmap visualization technique into an Eclipse PlugIn [180], which also has many interactive features to explore evolutionary couplings between software artifacts.

2.2.3 List-Based Representations A graph visualization in a space-efficient way can be obtained by an adjacency list representation. For a vertex X of an undirected graph we store all vertices in a list that are linked by an edge to this vertex X. For the case of a directed graph we

2.3 Tree Visualization

23

only store those vertices in a list that have an edge starting at this vertex X and ending at one of those vertices in that list. Figure 2.4(c) shows a small example for such a list representation. Adjacency lists do not need to have a special ordering of the vertices, which is the main reason that makes such a visualization very difficult to read. A list representation also has some benefits. Vertices that have many outgoing edges are very easy to detect, but the opposite direction is much harder to explore. Trying to find the vertex with a maximum of incoming edges is impossible in larger graphs. A list-based graph visualization tool can only be useful when it supports the user with interactive features. Maybe the rarity of these kinds of visualization tools can be explained by the weak intuitive representation of a graph. Anyhow, in the work of van Rossum [167], the author of the popular Python language, a hash table is used to store all adjacent vertices in an array. Cormen et al. [37] also use an array that is filled with indices. These point to single linked-lists that indicate the neighbors of each vertex.

2.3 Tree Visualization As trees are a subset of graphs, they can also be visualized by existing graph visualization tools, but they have one big advantage to general graphs: Trees belong to the class of planar graphs, graphs that can be represented in a node-link diagram in the two-dimensional plane without any edge crossings. This characteristic feature relaxes the situation for a visualizer in a particular way. Moreover, a researcher can use space-filling nested techniques as it is used in the treemap visualization [99] or its several modifications [17, 145, 168]. In this thesis we have also used some tree visualizations to show an underlying hierarchical structure if one exists in the dataset. In the next sections we will introduce some existing techniques for tree or hierarchy visualization.

2.3.1 Node-Link Representations First of all, trees can be represented by a conventional node-link diagram. Doing so, we soon reach limitations due to scalability problems. Typically, the number of nodes in a tree level grows exponentially with the depth of the tree. This means, if we are applying a top-down approach where the root node is positioned near the upper border of the view and the leaf nodes are positioned at the lower border, nodes appear more and more crowded following the vertical axis downwards. In the upper area of the display we waste much space that we would need for the leaf nodes. Many approaches have been developed to illustrate tree structures with node-link diagrams. The flood of existing techniques can be explained by this extremely

24

2 Related Work

common data structure for which we can list very many examples from very many different fields of research: File systems in computer science, evolutionary trees in biology, and product hierarchies in department stores are just three instances. Besides the top-down or left-right layouts, some other more space-efficient techniques have been developed. Balloon or bubble layouts [14, 82] try to overcome the space limitations by placing the root node in the center of a circle and its direct descendants or children to the circle circumference. The same layout algorithm is then recursively applied to the children until the border of the tree is reached. The drawback of this technique is the small space for large subhierarchies in a deeper level of the tree and the different edge lengths in the bubbles due to the different radii of the circles. As a benefit we highlight that the hierarchical structure can be uncovered much better than in other node-link approaches. The Ringed Interactive Navigation Graph System (RINGS) [155] makes use of such a circular Balloon layout. In this tool different properties can be visualized by mapping a color coding to the representation. The focus can be set to a different child that will then move to the circle center and change the tree representation. Smooth animation is used to preserve a user’s mental map when applying this kind of operation. Also Lin and Yen [111] analyzed Balloon drawings of rooted trees. They give some algorithmic solutions with respect to different layout models and optimization criteria. Radial layout approaches are somehow related to Balloon layouts, but here the leaves of a tree are mapped to just one circle. The root node is again placed in the circle center and the children are successively drawn on concentric circles according to their depths in the tree. One can get a good overview in the work of Battista et al. [42], Eades [48], and Herman et al. [89]. A radial layout is used in our technique called TimeRadarTrees in Section 5.3 to illustrate the hierarchical structure of the nodes in the graph sequence. A space-efficient method for drawing binary trees is the H-tree technique that is commonly used in VLSI design [16]. Horizontal and vertical orthogonal splits are responsible for the tree substructures that look very similar to the letter ’H’. The major problem when visualizing large tree structures is scalability. In the very young field of bioinformatics, genomic analyzes often generate hierarchical structures of genes or proteins in the range of tens of thousands of nodes and edges. The TreeJuxtaposer by Munzner et al. [122] addresses the problem of comparing trees with more than a hundred thousand nodes, a range that is typical for phylogenetic trees, which are explored in the field of biology or bioinformatics. The researchers introduce the concept of guaranteed visibility—highlighted elements have to be trackable all the time during the exploration process. Holten and van Wijk [93] visualize relations between matching subhierarchies using their hierarchical edge bundling technique, which additionally reduces visual clut-


25

ter, see Figure 2.9 for a compound graph visualization represented in this visual metaphor.

2.3.2 Containment Representations A very compact and space-efficient representation of a hierarchical structure is the so-called treemap approach, which was firstly introduced by Shneiderman and Johnson in 1990 [99, 144]. The starting point of the idea was the general question about the used storage space on a hard disk. The treemap technique is very effective when visualizing the size of hierarchy nodes, and it scales up to thousands of those nodes. But on the other hand it is nearly imapplicable to understand the hierarchical structure. Node-link diagrams perform much better for solving this task. Figure 2.7 shows a treemap that was created by a treemap visualization tool developed by myself in the JAVA programming language. The tool can switch between two metrics and apply different color codings. Treemap borders can be interactively displayed to a user defined hierarchy depth.

Figure 2.7: A treemap can be used to represent hierarchical information in a space-efficient way. The treemap in this figure encodes two metrics, one in the area of each box and the other by a color coding, which is blue to red in this figure.

Anyway, the treemap approach had many imitators over the years and many researchers tried to enhance the original idea, which was a slice-and-dice layout. This

26

2 Related Work

means the subdivision into smaller rectangles alternates between horizontal and vertical cuts. Also Shneiderman et al. improved their initial layout with a so-called Nested Treemap to overcome the problem of tree structure detection. Some years later, in the Cushion Treemap approach [168], the developers used shading to emphasize the readability of the tree structure. During the recursive process they add ridges to each of the rectangles to emphasize the hierarchical structure. Squarified Treemaps [17] are used to avoid thin, elongated rectangles that are normally generated by the original slice-and-dice treemap algorithm. Following this approach, the aspect ratio becomes much more balanced, which has the consequence that the boxes become nearly quadratic and can better be recognized by the human eye. Balzer and Deussen [4] made the rectangular shape of original treemap layouts responsible for the limitations in exploring the hierarchical structure. With their Voronoi Treemap technique, the developers are able to subdivide arbitrary shapes such as triangles and circles into Voronoi polygons. There are a number of other treemap imitations—too many to discuss them all in detail—such as Bubblemaps and Quantum- [9], Ordered- [145], Clustered- [175], Cascaded- [112], Modifiable- [169], or Circular Treemaps [173] that all have some benefits and some drawbacks with respect to the original approach. A history of the treemap idea can be found at Ben Shneiderman’s treemap homepage [11]. StepTree [13] extends the treemap idea to three dimensions. The authors saw similarities in the laid out boxes on a warehouse floor. Stacking subdirectories on top of their parent directories should have the positive effect that one can better detect the hierarchical structure. The approach looks like a two-dimensional layered icicle (Section 2.3.3) but naturally suffers from occlusion problems. There are also some approaches that try to overcome the poorly visualized structural information of treemaps with additional node-link diagrams. The most popular is the Elastic Hierarchies [190] visualization, which is a hybrid representation for both node-link diagrams, which have the benefit of structural clarity but the drawback of poor scalability, and treemaps, which are very space-efficient but the hierarchical structure will be hardly perceived. In EncCon [124] the developers use the treemap information to generate node-link diagrams but they do not show the treemap boxes in the final layout that looks very space-efficient but suffers from the edge crossing problem for very large hierarchies. To show the usefulness of our visualization techniques, we also apply them to data from the field of sports. Team sports such as soccer or tennis is well-known and visualized data from this domain can also be understood by novices in information visualization. Jin and Banks [98] developed the TennisViewer for displaying data from tennis that is also based on a treemap visualization which is derived from competition trees. The dynamics of a tennis match is mapped to two directions.


27

Each set is shown as horizontal columns whereas each game in a set is mapped vertically in the box that represents this set. Color coding is used to show the performance of each player separately in each game, set, and match. A similar approach is used in [165] to visually represent basketball data with a hierarchical component that subdivides the NBA into four divisions, then slicing them into teams and then again into player levels. Color coding is used to indicate the points per season attribute.

2.3.3 Layered Icicle Representations Some evaluations [6, 149] showed that a radial space-filling hierarchy visualization technique outperforms other space-filling hierarchy representations such as the treemap approach with respect to structure-based tasks. One existing work that makes use of radial space-filling techniques is the very popular Sunburst approach [149]. It is also called a radial layered icicle tree, in contrast to a Cartesian layered icicle tree where the different layers are laid out horizontally or vertically. Sunburst is equipped with interactive distortion features that they call angular detail, detail outside, and detail inside methods. Overview and context is preserved by smaller images of the original representation and a user can focus on special hierarchical elements whose transition states are supported by smooth animations. The work of Andrews and Heidegger [3] could be seen as the precursor of Sunburst. It consists of two semi-circular discs that represent a file/directory tree and is called Information Slices. The overview of the hierarchy is shown in the left view and the detailed subhierarchy in the current focus is represented in the view on the right hand side. Some improvements of the Sunburst technique are imaginable, which could, for example, address distortion techniques. The developers of InterRing [185] claim that existing tools do not support interactive features such as selection and navigation techniques. They implemented a sophisticated multi-focus and context distortion approach and support a user with a lot of other navigation techniques such as zooming/panning, rotation, and drilling-down/rolling-up to mention the major ones, see Figure 2.8. In DocuBurst [35] the radial space-filling technique is applied to text document content. The WordNet [61] hyponomy relationship is used to generate a hierarchical structure. A word of interest is located in the circle center and serves as the root of the tree. The tool provides many interaction techniques, and color coding is used to highlight related words to the one in focus. The InfoVis toolkit [58] provides many different types of visualization techniques with a Cartesian layered icicle tree among them.

28

2 Related Work

Figure 2.8: InterRing [185] uses radial layered icicles and provides many interactive features to manipulate hierarchical data (Courtesy of Matthew Ward).

2.3.4 Indentation File browsers are also tools that enable to explore information hierarchies. A file or directory structure can be examined very fast by expanding or collapsing subdirectories. The hierarchy elements are normally listed vertically whereas the most important feature of this representation is indentation. The deeper an element is located in a hierarchy, the larger gets the horizonal indentation for this element. Typically, the problem of this approach is scalability. Only a small fragment of a very large directory tree can be displayed at one time. Expanding subhierarchies even worsens this situation of space limitations. A scrolling function could help a user with searching what he is looking for. But the expansion and collapse function when exploring the tree make it nearly impossible to form a mental map of the hierarchical data—a negative effect that is referred to as the focus+context problem. With respect to this problem Chimera and Shneiderman [31] evaluated three hierarchy browsing interfaces. In their exploratory evaluation, a fully expanded stable interface, an expand/contract interface, and a multi-pane interface are evaluated by 41 novice participants. The users that explored the indented hierarchies by means of the stable interface took much longer than those using the other interfaces.

2.3.5 Approaches Using 3D Using a third dimension for the visualization of hierarchies can sometimes be very useful but the major drawback of this approach are occlusions. To overcome this problem, such tools have to be equipped with interaction, navigation, and animation features because otherwise one cannot depict the hierarchical structures and substructures. One early approach is the Cone Trees visualization [28, 139], which switches to 3D to maximize an effective use of display space. The root of each subtree is located at the apex of each cone. The corresponding children of this parent node are placed around the circular base of this three-dimensional cone. The tree is laid out in a top-down

2.4 Compound Graph Visualization

29

layout and each node can be rotated to the front. A two-dimensional projection on the ground emphasizes the structure of the tree. The horizontally oriented version of Cone Trees is called Cam Trees. Also here the developers use a projection to give the user the opportunity to better perceive the hierarchical structure. Rotation is also the key concept in Collapsible Cylindrical Trees (CCT) [39] where child nodes are mapped to rotating cylinders. Detail and context is preserved by a dynamic representation of these child nodes. The drawback of the technique is that it does not scale up to very many nodes, but the developers do not focus on visualizing large trees but more on the first two levels of a tree. Additionally they show a path in the hierarchy that is chosen by a user. Stacked circular beams are the basic idea in the Beamtrees visualization technique by van Ham and van Wijk [166]. Both the size of each node and the tree structure can be discerned from this representation. The researchers found out that their approach is significantly more effective than treemaps and cushion treemaps with respect to tree structure-based tasks. Botanical trees by Kleiberg et al. [106] offer many similarities to trees in nature. The approach is based on the strand model of Holton [95] to generate a geometric model from an abstract tree. Leaf nodes are shown as fruits and intermediate nodes are mapped to branches in the visualization. Pyramid-like structures are used in Information Pyramids [2], a visualization technique for representing large hierarchies. The pyramids grow in height the deeper an information hierarchy gets.

2.4 Compound Graph Visualization Compound graphs are a very frequently occuring data structure. In many application domains, related objects are typically hierarchically organized into categories and subcategories: Methods or functions in software systems are related by calls that can be modeled by a call graph. Members of a communication network are related if they communicate with each other. These are just two common examples out of many where we are exactly in the situation of having to deal with compound graphs. In the scope of this thesis we are distinguishing two types of edges when we are talking about compound graphs. First of all, there are inclusion edges that express the hierarchical ordering of the elements, and second, there are adjacency edges that tell us about relations between leaf elements of the tree. Georg Sander from the University of Saarbr¨ ucken introduced a method for the layout of compound digraphs [142] that is based on the hierarchical layer layout method. He reports that the approach has similarities with the one presented by Sugiyama and Misue [152]. The main difference lies in the results of the final layout. With his approach, one is able to draw rectangular borders around the nested subgraphs.

30

2 Related Work

In the field of bioinformatics, lots of datasets can be modeled by a graph. Moreover, the elements can be hierarchically organized. Biochemical networks for example can be modeled as compound graphs and also visualized. Zimanyi and Skhiri [192] present the constraint simple compound graph layout (CSCGL) algorithm in the Visual BioMaze framework [191] to automatically construct compound graph layouts. Fekete et al. [60] overlays a treemap with curved links that are implemented by using quadrilateral Bezier curves. These show additional relations between the hierarchical elements and profit from the fact that they can better be followed by the eye than straight lines that often take parallel routes from their start node to their target node. The direction of an edge can be obtained by looking at the offset of curvature. In the Arctrees [123] approach the developers are representing hierarchical and nonhierarchical relations in one view but in their work they do not denote their data structure as a compound graph. Their approach is based on the traditional spacefilling treemap idea to show the information hierarchy. The additional relations are drawn as colored arcs on top of this representation. The hierarchical edge bundling approach by Holten [92] was originally applied to compound graphs. It reduces visual clutter by bundling the adjacency edges that point between hierarchical elements. They are modeling each adjacency edge as a B-spline curve towards the polyline along the path via the inclusion edges from a start node to a target node, see Figure 2.9. The approach of Danny Holten addresses the problem of reducing visual clutter by bundling edges that point to nearly the same direction and split the bundles into smaller ones when these bundles point to locally different directions. Though the idea is very intuitive and generates very aesthetically pleasing pictures it also has some drawbacks. Even after applying an edge bundling algorithm not all edge crossings can be avoided which should be the major goal when laying out graphs as node-link diagrams. Although edge crossings are not avoided, their approach resolves the situation a lot. There is a large class of graphs that are not static but change their structure over time. Visualizing this data type as traditional node-link diagrams leads to a dramatical worsening of the situation. The edge bundling approach might be extended by smooth animation with the goal to show dynamic graphs. The major problem in this idea may occur when very frequently changing graphs have to be displayed, which could lead to a total loss of the mental map and a confused user. Our Trees in a Treemap visualization [20] is similar to the idea of Fekete’s curved links but we use an orthogonal layout instead of curved links. Edge crossings are reduced by a layout optimization algorithm. The technique is explained in more detail in Section 3.3.2.4. In Section 5.3 of this thesis we want to tackle this problem by introducing a novel visualization technique that can represent dynamic compound weighted multi digraphs in a single view. This idea benefits from the fact that a user’s mental map is

2.4 Compound Graph Visualization

31

preserved. The idea in this novel technique is to use a different visual encoding for the representation of adjacency edges that are responsible for visual clutter caused by lots of edge crossings. Node-link diagrams for graphs can only be represented in the two-dimensional plane without edge crossings when they belong to the class of planar graphs where the set of trees is a subset of. The visualization approaches that we discussed above, all have one commonality. They use space-filling visualizations for the inclusion edges and node-link diagrams for the adjacency edges. Our novel idea benefits from using space-filling visualizations for the adjacency edges and

Figure 2.9: The Hierarchical Edge Bundling technique [92] reduces visual clutter by bundling the adjacency edges that point between hierarchical elements. The hierarchy is represented here by a radial layered icicle. (Courtesy of Danny Holten)

32

2 Related Work

node-link diagrams for the inclusion edges, which, as we mentioned before, form a special graph called tree and belong to the class of planar graphs. This means that they can be represented as node-link diagrams without any edge crossings anyway.

2.5 Dynamic Graph Visualization Though some graph visualization techniques are very useful even for extremely large and dense graphs, it is very difficult for them to concurrently handle dynamic graphs—graphs that change their structure over time. In this section we will discuss some related work that addresses this problem by animation techniques. In this context it is very important to discuss the term ’mental map’, which is also called ’dynamic stability’. Trivially the problem of visualizing a changing graph can be solved by the following naive approach. Every graph in the sequence could be laid out separately by a static graph drawing algorithm. If we are following this idea, we are confronted with the next problem—time inefficiency. Parts of the graph that remain unchanged, could be disregarded by the algorithm to save computation time. Actually, this becomes a real problem for large and dense graphs and very long graph sequences. If we have to deal with small and sparse graphs, the approach does not show any trouble. But what we have to be aware of when tackling the problem of visualizing a changing graph structure is the preservation of the mental map [50, 120]. When a picture is displayed to a human, he has an impression of that picture in his mind after a fraction of a millisecond. The viewer formed a so-called mental map of the image. Showing the dynamics of a graph by animation means showing the changes in the graph in one view by subsequent pictures. As far as possible, each picture in this sequence should not change too much with respect to the one presented before because it would mean a significant effort for a viewer to form a new mental map each time. Changes in this mental map should be minimal to not confuse a user when exploring a sequence of graphs that is presented with animation. Brandes and Wagner [15] and Friedrich and Eades [65] address this problem in their research. Diehl and Görg [44] introduce a generic algorithm for drawing such graph sequences. Their algorithm can be applied to different static layout strategies of the graphs. The main difference to previous work in this domain is that their algorithm firstly considers all graphs in the sequence and then generates an adequate layout. This is what they call offline graph drawing. Their approach is definitely the very first published work on offline dynamic graph drawing. Many visualization tools have been developed to visualize dynamic networks. The DGD tool by Pohl and Birke [133] additionally provides instruments for statistical dynamic network analyses and it combines several views into one by linking and brushing techniques.

2.6 Visualization of Time-Based Data

33

The GEVOL system [34], which is shown in Figure 2.1, is an example visualization tool, which represents dynamic graphs in a node-link metaphor by means of animation. Software evolution is shown as animated graph-based visualizations to explore changes in software dependencies such as call graphs, inheritance graphs, or control flow graphs. In TimeRadarTrees presented in Section 5.3 we show how to visualize a sequence of directed graphs in a single view. This novel idea has two major benefits. Computation time for a new layout of each graph in the sequence is reduced to a minimum because nodes remain on a fixed position, which also gives a solution to the problem of preserving the mental map.

2.6 Visualization of Time-Based Data Many datasets have an additional chronological component. Tufte [162] analyzed about 4000 publications between 1974 and 1980 with a graphic representation in it and came to the conclusion that more than 75 percent of the visualized data has a time-based nature, which is also denoted as time-series data. For example, the meterological service stores temperatures, barometric pressure, wind direction, wind speed, or the amount of precipitation at many points of time during a day, a month, a year, or even decades. Statistical evaluation of this time-based data can give many significant insights to make better predictions for the weather forecast. Many visualization tools have been developed to explore this time-based data more efficiently. Interaction is an indispensible technique to support a user to better navigate in the large datasets. Distortion of the time axis is an absolutely necessary feature for such visualizations to make regions of interest clearer. The standard approach to visualize time-based data is by a two-dimensional line graph. The time is mapped to the x-axis and the variable of interest to the y-axis. MacEachren [114] introduces a questionnaire that presents a list of questions that could all be answered by means of a time-based visualization. For time-based data the time axis always stands in the center of the visualization. Frank [64] gives a taxonomy for different time-axis representations. In his research he distinguishes • cyclic vs. linear time • discrete time points vs. interval • ordinal vs. continuous time • ordered vs. branching time vs. time with multiple perspectives A good overview of visualization techniques for time-based data is given in M¨ uller et al. [121]. In their work they classify the visual representation of time-based data

34

2 Related Work

into static, dynamic, and event-based representations. Static visualizations of timebased data have many benefits over dynamic or animated representations of timebased data. First of all, these approaches give a better overview of the time-series dataset. Apart from scalability restrictions, a viewer can get excellent insights in the time-dependent datasets and can easily detect trends or patterns at first sight, even without interaction techniques. This is very difficult in animated representations due to the preservation of the mental map at each transition. Animations have the significant restriction that the human eye has to concentrate on small parts of the display when firstly confronted with the dataset. To get better insights, we may have run the animation several times until we get an ’a ha’ experience. This is not necessary within static or overview-based representations of time-based data. The viewer is responsible by himself where he looks at in the view. Especially for trend visualization, Robertson et al. [138] evaluated the effectiveness of trend animation for multi-dimensional data. As a result, the researchers found out that animation leads to many interpretation errors for the participants. Two alternative static trend visualizations—simultaneously overlaid trends in one single display and a small multiples display that shows trends side-by-side—are significantly faster and using the small multiples display is more accurate. In the following we will discuss some existing tools that are either static visualizations of time-based data, which means that the only change in the visualization stems from user interaction, or the visual representation is dynamic and can be seen as a kind of function over time, an approach typical for animated visualizations. In the context of this thesis we also denote static representations with the term overviewbased representations and likewise dynamic representations with animation-based representations.

2.6.1 Overview-Based Representations The visualization of time-based data, and visualization at all, has been existant long before computers were invented. This means, no animation of the visualization was possible in ancient times. This is the reason why static or overview-based representations of time-dependent data appeared long before the animation-based approaches. Though, these early visualizations are very basic because they can just handle the time-varying data of one single variable, they have many benefits. Trends and counter-trends, quantitative statements, or patterns and anomalies can be detected in a very short time because the visualizations provide a good overview of the dataset. A very popular example of time-based data visualization, which is nowadays seen as a classic, is the map drawn by the mapmaker of Napoleon, Charles Joseph Minard, in 1869 [55, 162]. The chart shows the diminishing size of the army heading to Moscow in 1812 and suffering from the icy temperatures during retreat. In this

2.6 Visualization of Time-Based Data

35

diagram, many time-dependent variables are involved in a single view, which are the decreasing number of soldiers, the direction of walk, the current temperature, or the geographic location, albeit very vague. The parallel coordinates view [97] is normally used to represent hypervariate or multivariate data where one object is associated with more than three attributes. To also visualize time-based data, each vertical axis of the parallel coordinates plot is mapped to a point in time, which preserves chronological order. The drawback of this approach emerges when many objects are drawn in the same diagram. Many intersecting lines, which indicate the strength of the object’s time-varying metric at each axis, cause visual clutter. Though multivariate data can be visualized by parallel coordinates that map the single variables of each object to the according axes, it still is a challenge to also visualize the changes of all variables of one single object as a function over time. The ThemeRiver technique [85] addresses this problem by drawing stacked bar charts on a continuous linear time line. Each single bar in the stacked representation indicates the value of the corresponding variable. To preserve the mental map, the stacking of the single bars have the same order at each point of time. The developers use interpolation techniques to give the viewer the impression of a flowing river. The problem of this idea lies in the distorted representation of the single variables.

2.6.2 Animation-Based Representations The code swarm visualization [33] generates very aesthetic animations of the evolution of commits in software projects. It is a mixture of author-centric and codecentric software visualization approaches that shows the software developers and the changed files as animated colored objects. The color coding of the visual encoded files depends on the file type—be it a source code or image file. Though it is very fascinating to watch the animations, it is also a very time-consuming task to filter out some interesting insights such as trends or patterns. The question arises if a static visualization of the same kind of data could be much more efficient for data exploration. The GapMinder tool [72] also uses animation to show trends and counter-trends in multi-dimensional data. Bubble sizes and positions on x- and y-axis of a coordinate system are animated over time to show the evolution of trivariate data. The technique was evaluated by Robertson et al. [138] and compared to a static trace and a static small multiples visualization. The researchers report that the participants found the animated visualization very enjoyable, but it leads to many interpretation errors and is less time-efficient compared to the static counterparts. In Chapter 5 we introduce a novel radial approach called TimeRadarTrees. With this technique we can visualize dynamic weighted compound digraphs as well as sequences of transactions in information hierarchies in a single view. The new idea

36

2 Related Work

can therefore be regarded as a static or overview-based visualization technique of dynamic relational data. The main contribution of the novel technique is that visual clutter is reduced by drawing colored circle sectors instead of node-link diagrams for the adjacency edges. The Timeline Trees visualization technique is the Cartesian counterpart of the radial TimeRadarTrees, which uses rectangular boxes instead of circle sectors. The TimeArcTrees is the third visualization approach that can also represent this kind of data. It is introduced in the beginning of the chapter because it makes use of conventional node-link diagrams for the adjacency edges as well as for the inclusion egdes. In the next chapter we focus on the extraction and visualization of mining rules from software archives under version control. In the first part we have to deal with static time-aggregated relational data in information hierarchies. The second part of the chapter introduces two visualization techniques that represent time-series data from software archives by both a static overview-based and a dynamic animation-based approach.

“Successful software changes over time.” — David L. Parnas (1994)

CHAPTER 3 Visualizing Rules from Software Archives

oftware systems in these days are very complex and contain millions of lines of source code, which are implemented by very many developers throughout the world in many different timezones. Managing and maintaining these distributed systems is an enormously exhausting task. Software configuration management tools cannot solve emerging problems, but they can support software developers exceedingly.

S

Typically, in these software archives or repositories, data is stored during the whole evolutionary process and tagged with additional temporal information. Hence, software projects under version control provide a rich source of time-based data sources for both relational and non-relational as well as source code-centric and authorcentric visualization approaches. Data mining techniques are mighty tools to extract rules for software artifacts that hold at least for a certain time period of the development process of a software system. In this chapter we will discuss which types of rules can be gained from software archives and how these rules can be visualized to interactively explore them. Our focus in this chapter lies on different visualization techniques for aggregated evolutionary data. The techniques presented herein stand in contrast to the techniques in Chapter 5 where we introduce static visualization techniques for non-aggregated time-based relational data. We illustrate the usefulness of the techniques by interesting results of some case studies.

37

38

3 Visualizing Rules from Software Archives

A rule, in this context, is something such as ’under some circumstances A do B’ and holds for a certain probability that we further denote as strength of a rule. A single rule consists of underlying software artifacts that are typically organized in an information hierarchy. Later this hierarchical information can be used to better navigate in the visualization tool, to filter and aggregate rule sets, and to explore the data on different levels of granularity. We target at finding interesting patterns and anomalies, which we also call outliers. Exemplary, we apply some simple visualization techniques that are integrated in our tool called EPOSee to a mined rule set of the MOZILLA open source software project. We explain how these results could be used to support software engineers and what important insights they could give in a complex system.

3.1 Data Mining in Version Archives The amount of stored data is doubling every three years [113], and it becomes increasingly difficult to extract patterns from datasets manually, what means, without technical support. This is exactly the point where data mining comes into play. It can be seen as the process of extracting patterns from datasets that are hidden in the flood of data and would otherwise never be retrieved [101]. Data mining is now used in many disciplines, and it is a very useful tool, with which one can barely do without. Large datasets are for example census data, transaction data of a department store, or data recorded in the software development process. Each software system has its own specific evolution like a genetic finger print. Internet browsers, computer games, management systems, and operating systems have all been developed in an evolutionary process over a certain time period. The single implementation phases can be subclassified into software design, code development, bug fixing, and many more subprocesses. “Successful software changes over time” [126]. All these changes and all versions of very large software systems are stored in socalled software archives using configuration management systems [36]. The most popular of those are CVS, the Concurrent Versions System [5] and SUBVERSION (SVN) [129, 151]. A typical repository does not contain the runnable software. It is merely a collection of textual data that can further be splitted into source code and documentation files. Also binary files or images and movies are checked in and managed in a software archive. All these artifacts are systematically divided into a directory structure and attached to a timestamp of the exact time of commit. The most frequently changed files normally are those that contain the source code and the documentation of the software project.

3.1 Data Mining in Version Archives

39

The source code files have a certain structure that depends on the programming language and its program constructs. The ideal case would be that a developer changes a set of files including the documentation files that he needs for finishing one subtask. The process of checking in a number of involved files at a specific point in time is called a transaction. A change of a textual document means adding, replacing, or deleting at least one character in the file. CVS is not able to detect how many characters a person changed because it uses a line-based diff algorithm. We can introduce a measure function that counts the number of changed lines in each file in each transaction for example. Formally, the i-th transaction could here be defined by ti := {f ile ∈ F | µtransi ({f ile}) > 0} where F is the set that contains all files in the software repository that have been present during the whole evolutionary process, and µtransi ({f ile}) stands for the number of lines that have been changed in f ile in the i-th sliding time window. CVS avails itself of a change detection algorithm that can detect the difference between the current file and the already checked in one and is therefore called a diff algorithm. Many of these algorithms exist, each with a different characteristics. The main idea is to turn a text t1 into a text t2 and to use as less modifications as possible to do this operation [118, 159]. A slightly different application where transaction data is explored is given by online shops. They store some information about the shopping behavior of their clients, or what they looked at until they order some goods. These information can be analyzed and used to make suggestions or advertise products to the customers when they visit the online shop for the next time. A typical suggestion could look like this: ’Other clients who ordered the album Human Touch of Bruce Springsteen also ordered Both Sides of Phil Collins and Land of Confusion of Genesis’. To obtain more precise suggestions, it is very important to have a large data source. In the example above, this precision plays a crucial role in the decision whether the advertised products will be bought or not. Zimmermann et al. [195] were inspired by the idea described above and target at the examination of software archives. They identified files, program constructs, or lines of code that have been changed together very frequently. After the analyzing process they can give suggestions to a programmer which changes of the source code or documentation he should be aware of. Also in the context of software evolution we refer by the term data mining to the process of extracting hidden patterns from large amounts of data that sleeps unused in version archives and could be transformed to valuable information.

40


In the following sections we give a more detailed description of the phases to discover knowledge in software archives. The final and the most important step in the context of this thesis is the representation of the obtained information.

3.2 Preprocessing Version management systems like CVS or SVN are very space-efficient. These tools do not store whole files of all ever existant versions. This would lead to a horrible waste of storage space. Configuration management should also be time-efficient. For this reason, only the latest version of each file exists on the whole, and only the differences between two successive versions are stored. To access an older version, all differences are applied iteratively, and the requested version of a file is reconstructed. The reason for storing the latest version and not the first one is that developers mainly access newer versions. Following this idea the access time will be decreased exceedingly. Not all data needed for an analysis is stored in such a repository, but fortunately, a great piece of this information can be reconstructed. CVS does not store information about the togetherness of all files in one single transaction set, for example. This information has to be reconstructed by inspecting the timestamp, the committer, and the log messages of all files that come into consideration [178]. To make more efficient analyzes of the data, it is necessary to preprocess the data and attach it with additional tags. For this reason, the data contained in the repository is mirrored in a data base. In this early stage we perform data enhancements for a better • reconstruction of the single transactions, • assignment of the changes to the software artifacts, and • data filtering

3.2.1 Data Extraction Draheim et al. [46], on the one hand, and Zimmermann et al. [194], on the other hand, suggest similar processes to extract versions from CVS archives and to store the data in a relational data base such as PostgreSQL in a way that the data can easily be analysed for simultaneously changed artifacts. The ’CVS log’ command outputs textual information about files, directories, revisions, symbolic names (tags), and branching points. A sophisticated parsing program extracts this data from the text file and stores it in specific data tables. Draheim et al. implemented this principle in their tool called Bloof [45, 127]. It can be used for explorative analysis of software project data and allows user defined

3.2 Preprocessing

41

queries. Daniel German developed the softChange tool [75]. He is more interested in discovering ’software trails’—the term denotes specific hints about the developer’s behavior that he left during the evolution of the software. To tackle this problem with his tool, he has to extract additional information and examines three different sources to meet the requirements. • ChangeLog files: Some projects make use of ChangeLog files, which have the purpose to collect short comments about a developer’s reason for the change. These changes can be extracted and compared to the CVS log information. • BUGZILLA bug database: Another idea is to store the commit that caused a bug in a so-called bug database. The log messages of CVS can systematically be examined for bug indications. • Mailing lists: In many cases there is a strong dependency between mails and commits to the repository. In softChange the developers try to unmask these relations. Fischer et al. [63] also used the data of the BUGZILLA database, which they firstly extract and then store both, the CVS data and the BUGZILLA data, in a complex structured database. Different aspects of software evolution are extracted by Hassan et al. [84]. Their extractors can be applied to both time-based and event-based snapshots.

3.2.2 Reconstruction of the Transactions One drawback of CVS is that it divides each checkin into many single checkin commands. The files that belong to the same transaction are not stored as a whole but each file is treated separately. As a consequence, each checked in file is attached to a different timestamp. This makes the process of reconstructing all transactions a little bit more complicated. A heuristical approach is used to regain the transaction information. To overcome this problem, we use the concept of logical time equality that is defined by sliding time windows. All files that are checked in within a point in time that belongs to a sliding time window with the same author and the same log message belong to the same transaction. The role of the sliding time windows can better be expressed in a more formal way: Let the sliding time window be set to ∆t seconds. Then the following holds for all checkins δ1 , . . . , δk (sorted by | time(δi ) |), that belong to one transaction t: ∀ δi ∈ t : author(δi ) = author(δ1 ) ∀ δi ∈ t : log message(δi ) = log message(δ1 )

42


∀ i ∈ {2, . . . , k} : | time(δi ) − time(δi−1 ) |≤ ∆t In CVS it is strictly forbidden to have two revisions of the same file that were checked in at exactly the same point in time. This phenomenon cannot be avoided when using the heuristic from above. An upper bound for the length of one transaction can help to tackle this problem, which is exactly the idea that Daniel German introduces. An alternative solution would be to reconstruct the set of transactions by using commit mails. This kind of information is sent to the mailing list each time a developer makes a checkin into the software archive.

3.2.3 Finding out the Changed Artifacts So far we can only find out which single files belong to which transaction. Our analysis should be more exactly in a way that we can analyze concrete artifacts such as classes, methods, or even arrays, which are on a finer level of granularity. To extract this information, we compare the previous and the current version of all files separately by a textual diff-based analysis on all existing artifacts. An examination of the complete syntax trees would be a more exact but less efficient analysis approach. Doing so, we find out the added and deleted artifacts in each file of a transaction. The drawback of this idea is the dependency on the performance of the used diff algorithm.

3.2.4 Data Cleaning The steps described above do not lead to the final result. Before we reach this final goal, we need a data cleaning process. Two different types of checkin information are the reason for the last step that could lead to erreneous conclusions if not treated. • Large transactions Global changes are the main reason for large transactions. A migration of all or at least a large number of source code files is to blame for this phenomenon. Sometimes software developers do not check in their changes after each completed programming task. They collect the changes until the end of their working day and then make a final commit. Very many artifacts are involved in such a transaction though they are not logically coupled. • Merges Two parallel lines of development can be merged to one single line again. All changes are combined in one transaction. The files belonging to this transaction do not have to be logically coupled. The naive approach to get rid of these large transactions is to just ignore all transactions that contain more than a certain fixed number of artifacts. This fixed number has to be chosen for each software project separately.

3.3 Mining Association and Sequence Rules

43

We need again a heuristical approach to look for merges because CVS does not store any information about merges. Fischer et al. [63] introduce a solution to find revisions of a file that are caused by a merge.

3.3 Mining Association and Sequence Rules Extracting data from software repositories was subject of the previous sections. In the following sections we will describe how this data can be used to generate more specific rules by means of data mining techniques. Visual representations, though very simple, can help to get insights in the large rule sets. By the term ’rule mining’ we refer to data mining techniques with the goal to detect rules between software artifacts. With these rules one can make statements with a certain probability which artifacts have been changed together. Using the rules this way, it is possible • to understand which artifacts have been coupled to what extent in the software development history and • to forecast the future software development to some extent and to support developers. To generate such rules, it is necessary to know the changes of software artifacts within the same transaction, which means within the same checkin to the software archive. In Section 3.2.2 we explained how to reconstruct such transactions on file level. However, it is an easy task to also allow other levels because the file revision is known to where the software artifacts belong. We could, for example, work on method level and inspect all transactions that involve methods that are changed together by the same checkin.

3.3.1 Association Rules In this thesis we are working with two different types of rules. This section describes how association rules can be generated, and after that, how they can be visualized and analyzed. 3.3.1.1 Generating Binary Association Rules Binary association rules describe how often and with what probability exactly two artifacts were changed at the same time. We use the notation a⇒b

44


to denote that artifact a and artifact b have been changed simultaneously. We call a the antecedent and b the consequent of the rule. The rule has to be read in the following way: ’When artifact a has been changed, artifact b has to be changed at the same time, too.’ Each mining rule is attached to two different metric values, namely support and confidence. These are defined in the following: • As support of a rule a ⇒ b we understand how often the artifacts a and b occur in the same transaction. Thus, the support measures on how many transactions a rule is based. • As confidence of a rule we understand the probability that the rule is correct based on the set of transactions. It is defined as the number of the common changes of a and b divided by the number of all changes of a over all transactions. The algorithm that computes binary association rules is rather simple and can be easily illustrated by the following two steps. Let k be the total number of examined transactions and n be the cardinality of the artifact set that is obtained by extracting all artifacts from the transaction sequence. 1. Create a n × n matrix that has natural numbered entries and initialize every position of the matrix with the starting value 0. The annotations on both axes of the matrix belong to the software artifacts to be examined and are sorted in the same way, both vertically as well as horizontally. 2. Examine all transactions t0 , . . . , tk−1 step by step and for every pair (a, b) with a, b ∈ tj increment the value at the corresponding matrix position for a and b by one. The algorithm in Listing 3.1 may express this operation more precisely. 1 2 3 4 5 6 7 8 9 10 11 12 13

i n t [ ] [ ] t = . . . // t [ j ] c o n t a i n s i n d i c e s // o f a r t i f a c t s i n t r a n s a c t i o n j i n t [ ] [ ] mat = new i n t [ n ] [ n ] ; // matrix with 0 e n t r i e s

f o r ( i n t j = 0 ; j < k ; j ++) { // t r a n s a c t i o n s f o r ( i n t l = 0 ; l < t [ j ] . l e n g t h ; l ++) { // a r t i f a c t f o r ( i n t m = l ; m < t [ j ] . l e n g t h ; m++) { // p a i r s mat [ t [ j ] [ l ] ] [ t [ j ] [m]]++; } } }


45

Listing 3.1: Generating binary association rules

The support of the rule a ⇒ b with respect to the set of transactions T can be extracted from the matrix position for the artifact pair (a, b). To compute the confidence we have to simply divide this value by the value at the position corresponding to (a, a), which is stored at the diagonal line of the matrix. 3.3.1.2 Visualizing Binary Association Rules Visualizations can help to find correlations in large datasets [103]. To visualize binary association rules, we prefer a representation as a pixelmap, which is the smallest possible representation of a matrix. The visualization encodes each binary rule in just one pixel. Pixelmap Figure 3.1 shows an example of an enlarged pixelmap. Both axes are annotated with hierarchically sorted artifacts. The matrix cells are labelled with natural numbers that express the support values of each rule. In this example we use a red to blue color scale that maps higher values to a red color and lower values to a blue one. By using this special color scale, we can easily detect the strength (support or confidence) of the coupling of two software artifacts. Furthermore, it is easily discernable from this space-filling quadratic scheme which are the • maximum and minimum values, • outliers or anomalies, or • patterns in the binary association rule set. As we said before, we obtain a more space-filling representation and are able to visualize a maximum of binary association rules on the screen when we encode each matrix entry in just one colored pixel. In some applications it is very useful to encode both metrics, support as well as confidence, in a single view. For this reason we show three-dimensional bar charts as a kind of zoom function on demand. By default, the height of each bar chart encodes the support value of each coupling, the color encodes the confidence value. Switching both metrics is also possible in the visualization tool. Figure 3.2 shows the idea of encoding both metrics with enlarged three-dimensional bar charts. In (a) the height indicates the confidence value and the color coding (red to blue) represents the support value, whereas (b) encodes the same view in a heated object color coding. The main drawbacks with three-dimensional visualizations are occlusion problems. We overcome this problem by a color coded crosshair function that highlights the rule

46


Figure 3.1: The artifacts in this enlarged pixelmap are sorted hierarchically at both axes horizontally and vertically in the same way. Color coding is used to visualize the confidence metric.

in focus in both the pixelmap as well as in the three-dimensional bar charts. Rotating the three-dimensional view would be a better solution to tackle this problem, but this belongs to future work. In Section 3.4 we introduce our visualization tool called EPOSee that is an interactive tool to support the exploration process of large rule sets. EPOSee integrates several visualization techniques as well as linking and brushing functions, which can be very helpful to detect patterns, outliers, and anomalies. It can be seen as a top-down visualization approach in a way that a user firstly gets an overview of the possibly unknown rule set and then interactively explores the dataset on more and more detailed levels. This visual exploration process refers to the so-called visualization mantra of Ben Shneiderman [10, 27], which typically consists of the following three stages: • Overview first • Zoom and filter • Details on demand The visualization mantra is widely used in interactive visualization tools targeting at visual exploration problems from very different application domains. Support Graph The pixelmap technique is an excellent approach when we have to deal with very dense matrices. Typically, these appear in bad structured software


(a)

47

(b)

Figure 3.2: A zoom function shows three-dimensional bar charts that encode both metrics, support as well as confidence, in a single view: (a) height represents confidence, color coding visually encodes support; (b) a different color coding (heated object color scale) can give different insights in the rule set. The crosshair function is also illustrated.

systems where many artifacts are checked in at the same time by the same developer and hence are detected as logically coupled, even if they are not. The matrix metaphor can also be transformed into a conventional node-link metaphor where the artifacts are encoded by nodes and the matrix entries by the links between the nodes. The hierarchical ordering of the nodes can be visualized by the color of the nodes. Visualizing the hierarchical ordering by special node positions on the screen would also be a good design, but would restrict the layout algorithm due to position determinations enormously. With the goal to detect outliers and anomalies we also applied a node-link visualization based on the same rule set as the matrix representation for both the support and the confidence metric. The technique is denoted by Support Graph visualization because we use the support values to layout the single graph nodes. We do not encode the support values in the single links, by a color coding for example, but we actually use them in the following manner: Nodes that represent strongly coupled artifacts are located very close to each other, whereas nodes that encode weakly coupled artifacts are positioned not that close. To achieve the final layout, the algorithm works as a force-directed approach and normally generates graph layouts in an aesthetically pleasing way. The set of edges and vertices are attached to forces in a straightforward way: Edges are interpreted as springs, which refers to the law of Hooke, and nodes are regarded as electrically charged particles, which refers to the law of Coulomb. The whole graph is treated as a physical system that tries—when released—to reach a state of equilibrium—a balance between all forces—by iteratively repetitions of force adjustments between nodes and edges.

48


The hierarchical ordering of the artifacts is responsible for the color coding of the single nodes in the following way. Nodes that are representing items that are hierarchically close to each other will have a similar color coding.

Figure 3.3: The support graph visualization is a node-link approach and shows outliers by node distance and color coding. Here the browser subdirectory of the open source software project MOZILLA is represented.

Figure 3.3 represents the couplings of the browser directory of the MOZILLA open source project as a support graph. A linear optimal color scale is applied to better depict the hierarchical ordering. The layout technique generates a clustering of the nodes, if one exists, and in this example it is very easy to see that the layout algorithm actually generates a clustering from the underlying support matrix. Furthermore, we can detect an outlier in the red to brown colored cluster that belongs to the base directory. The four light blue colored nodes indicate a coupling to a different hierarchy level. To fully explore the outlier, we have to request for details. 3.3.1.3 Generating n-ary Association Rules So far we examined rules with correlations among exactly two software artifacts that we called binary association rules. Similar kinds of correlations can also be


49

detected for rules that have more than two artifacts within their antecedent or their consequent. Such rules express for example: ’If a set of artifacts A has been changed, another set of artifacts B has to be changed at the same time with a certain probability, too’. n-ary association rules are always of the form A ⇒ B where A and B are disjoint sets, also denoted as A ∩ B = ∅. In an analogous way to binary rules we can now define the support and the confidence metric for n-ary association rules with respect to a set of transactions T : • Support: supp(A ⇒ B) = f req(A ∪ B) • Confidence: conf (A ⇒ B) =

f req(A∪B) f req(A)

In these formulas the frequency of a set M , denoted as f req(M ), is defined as f req(M ) :=| {t ∈ T : M ⊆ t} | The Apriori algorithm by Agrawal [1] generates this kind of association rules. It needs the set of transactions and a lower bound min as input variables. The final output of the algorithm is the set of all association rules whose support values are equal to or higher than min. 3.3.1.4 Visualizing n-ary Association Rules Binary association rules can be visualized with very basic visualization techniques. In contrast, the more general type of n-ary association rules cannot be represented that way. The reason for the more complex visualization is obvious and lies in the more complicated antecedent and consequent of a rule. These consist now of sets of artifacts instead of single artifacts. The trouble is that sets cannot be sorted hierarchically because these do not have an inherent total order. For example, it remains unclear if the set {r/xy, s/xy, t/xy} has to be positioned next to r/x, s/x, or t/x in the final layout. To still gain interesting insights in the dataset, we use a slightly different visualization technique. The association rule matrix [182] provides a two-dimensional view in which the single software artifacts are arranged vertically in their hierarchical order on the y-axis. The rules are arbitrarily mapped to the x-axis, whereas the single artifacts in each rule are arranged in their vertical position. An example of this kind of visual encoding is shown in Figure 3.4. In this visualization antecedent and consequent are visually encoded as colored pixels. Support and confidence are visualized as colored bars. • Red colored pixels If a software artifact is contained in the antecedent of a rule, a red colored pixel is shown at the associated position in the association

50


Figure 3.4: An association rule matrix can represent n-ary association rules with colored pixels indicating antecedent and consequent, and small bar charts representing support and confidence.

rule matrix. Multiple adjacent red colored pixels in the same matrix column are caused by multiple artifacts in an antecedent that belong to the same hierarchical level, e.g., the same subdirectory. • Blue colored pixels Analogously, blue pixels indicate artifacts that are contained in the consequent of an association rule. Hence, adjacent blue pixels in a matrix column show that the rule suggests to change multiple artifacts of the same hierarchy level. • Gray colored pixels Artifacts that are not affected by any of the rules are grayed out. Below each column of the association matrix view one can find a color coded bar that indicates the support value of the rule by its length and the confidence value by its color. A detail on demand feature can be used to show a textual information about the currently selected rule.

3.3.2 Sequence Rules Association rules express which artifacts have been changed simultaneously. A second rule type, which we denote by sequence rule, has an additional temporal component, which provides an information about the order of change of the involved artifacts.


51

3.3.2.1 Generating Sequence Rules A sequence rule consists of an ordered list of elements, which is a list of software artifacts in the scope of this thesis. Let S = s1 → . . . → sn be an arbitrary sequence consisting of n artifacts. The order of the artifacts in the rule can be depicted directly. We define for every 1 ≤ i ≤ j ≤ n that artifact si was changed before or at the same time as artifact sj , but not after it. Though sequence rules have many commonalities to association rules, they have also some differentiating factors. Both rule types consist of two parts, antecedent and consequent. For the association rules these parts are modeled by sets, whereas the underlying data structures for sequence rules are ordered lists. Sequence rules are preferable if we focus on the visual exploration of software archives. We overcome the problem of chronologically ordering the artifacts by aggregating several transactions in a meaningful way, which means that we recombine parts that belong together. This information could be acquired by collecting the set of transactions that are performed during a whole day. This type of transactions is denoted as sequence transactions, and the chronological order of the single artifacts within such a sequence transaction can be extracted from the timestamp information of the real transactions. Analogous to association rules, we can even define the metrics support and confidence for this special rule type apart from the fact that the occuring sets have to be replaced by ordered lists or sequences. 3.3.2.2 Visualizing as Parallel Coordinates Sequence rules have an inherent multivariate or hypervariate data structure, which means that all rules involve a certain, not necessarily the same, number of different attributes, namely the software artifact at the specific position in each rule. Software artifacts represent nominal data that could easily be ordered due to their lexicographical order and due to their hierarchical belonging. The most popular and very basic visualization technique for representing hypervariate data is the so-called parallel coordinates technique [96, 97, 177]. To visually encode sequence rules as a parallel coordinates plot, we first arrange the software artifacts vertically in a hierarchical and lexicographical ordering. The horizontal axis is equally divided into as many parallel vertical axes as artifacts exist in the longest antecedent of the visualized rule set plus the number of artifacts of the longest consequent of the rule set. On each vertical line all artifacts are represented in the same order. To complete the visualization, we draw a polyline that connects the points

52


Figure 3.5: An example of a parallel coordinates view [96, 97, 177] that uses color coding to differentiate between the hierarchy levels. Antecedent and consequent separate the view into two parallel coordinate blocks.

on each vertical axis corresponding to the single artifacts from this position in the rule. A parallel coordinates plot has many drawbacks. The most important is that several polyline segments can belong to many different sequence rules. Polylines are more or less a node-link based representation for hypervariate data and consequently suffer from visual clutter caused by many edge crossings and parallel lines. A visualization tool that provides a parallel coordinates view should definitely offer interactive features. A user should be able to interactively filter several attributes or constrain those into specific domains. This would serve as some kind of filtering function, and consequently, all corresponding rules should be highlighted, which in turn would reduce visual clutter. The ordering of the vertical axes plays an important role in this technique. A permutation of those could lead to a different visual appearance, but thankfully, the order of the artifacts in sequence rules is fixed. Also a color coding could help to find out how many rules belong to one special edge and to what extent. In the example of the parallel coordinates view in Figure 3.5, we use an additional hierarchical sorting of the artifacts, which is illustrated by color coding of the different hierarchy levels. This separation allows to find outliers and clusters more efficiently. In Section 3.4 we apply the parallel coordinate technique to a sequence rule set mined from the MOZILLA open source software project and show which insights can be gained from sequence rules by applying this technique. 3.3.2.3 Visualizing as Decision Prefix Trees In the decision prefix tree view in Figure 3.6(a) each sequence rule is shown as a single tree branch. The common prefixes are represented only once in the visualization. We can easily depict large trees or subtrees, and hence, frequently occuring artifact sequences. The branching nodes and the ramification have a special color coding.


53

The confidence metric is indicated by the color of each single branch, and the support metric by the color of a leaf node. Apart from having a good overview of the sequence rule set, it also serves as a navigation tool. The rule in focus is highlighted, and details on demand are given in the center view, which shows the artifacts involved as well as confidence and support metric values, the rule number, and the prefix tree number. In Figure 3.6(b), one rule is focused, and all rules in the neighborhood are zoomed in. The corresponding nodes are represented as three-dimensional bars.

(a)

(b)

Figure 3.6: The decision prefix tree view can be used to detect frequently occuring common prefixes in the sequence rule set with color coded branches and leaf nodes: (a) an overview of the sequence rule set; (b) an enlarged and three-dimensional view with the green colored rule in focus.

The drawback of this idea is that the hierarchical ordering of the artifacts is generally lost in this visualization. The reason for this is that rules with the same but permuted artifacts eventually belong to different subtrees. Hence, a clustering cannot be detected by using this representation. To also inspect the hierarchical belonging of the artifacts, we developed the undermentioned approach in Section 3.3.2.4 denoted by ’Trees in a Treemap’.

3.3.2.4 Visualizing as Trees in a Treemap The Trees in a Treemap visualization is based on the same idea as in the decision prefix trees in Section 3.3.2.3. First of all, we generate the prefix tree where each node is associated with an object of a taxonomy. Taxonomies are a common and powerful tool to structure information. They are special trees where each leaf node

54


represents some object and all intermediate nodes represent classifications of objects represented by its child nodes. More precisely, each node of the prefix tree corresponds to a leaf node of such a taxonomy. The data model is very similar to compound digraphs apart from the fact that an artifact can appear several times in a sequence rule, and hence, more than one element is required that represent the nodes of the prefix tree. The presence of all these representatives is very important for this technique, because otherwise, we would lose the information of following a path at intersection points, which would be the case in conventional node-link representations. There we typically have just one unique representative and the several paths are intersecting at a high degree. The same problem occured in the parallel coordinates visualization where several polyline parts are drawn on top of each other. Figure 3.7(a) shows a small example of a prefix or object tree and a taxonomy in two separate diagrams, whereas in Figure 3.7(b) the objects of the object tree are linked by straight lines with the corresponding leaves of this taxonomy.

(a)

(b)

Figure 3.7: The nodes of a decision prefix tree can be leaf nodes of a taxonomy: (a) the prefix tree and the taxonomy in two separate views; (b) the leaf nodes of the taxonomy linked to the nodes of the prefix tree by straight lines.

In our approach we visually encode the taxonomy as a treemap, hence the name for the technique—Trees in a Treemap. The prefix trees are represented by node-link diagrams, and all representative elements are placed in their corresponding treemap boxes. The exact position in a box is computed by a special layout algorithm, which reduces visual clutter. We developed two different approaches for the representation of the links: • Straight links: This kind of representation can lead to very small angles at edge crossing points. Consequently, the viewer could be irritated and makes wrong interpretations, see Figure 3.8(a).


55

• Orthogonal links: Angles with 90 degrees at the intersection points make an exploration easier with respect to path-related tasks, though the links become much longer and may have several bends, see Figure 3.8(b).

(a)

(b)

Figure 3.8: A small example for the Trees in a Treemap technique shows both types of links for the same object tree as in Figure 3.7: (a) with straight links; (b) with orthogonal links.

The challenge is to draw object trees such that the position of each object in the taxonomy is easily visible. The simplest way to show an object tree and the underlying taxonomy is by drawing the tree diagrams of both of them next to each other as it is shown in Figure 3.7(a). In this case it is not immediately obvious how often and where an object of the taxonomy occurs in the object tree. To make the associations explicit, we can draw links between each object of the taxonomy to all associated nodes in the object tree, as shown in Figure 3.7(b). Also adjacency matrices can be used as a space-filling visualization of graphs by representing every entry of the matrix as a single pixel. Admittedly, for trees the space efficiency is not as good as for general graphs because trees do not belong to the category of dense graphs. Figure 3.9 shows the adjacency matrix of an object tree both with unsorted and with grouped objects. Actually, the objects in Figure 3.9(a) are sorted by a breadth-first traversal. As a result, rows and columns of objects that are close together in the taxonomy are also close together in the matrix. Object trees could also be represented by a parallel coordinates plot [97]. The depth of the object in the tree can be used to draw polylines, whereas the objects are vertically sorted according to the hierarchical ordering of the leaf nodes in the taxonomy. Figure 3.10 shows this idea by a small example. Our novel approach integrates both the taxonomy and the object tree in one view, hence, Trees in a Treemap. The treemap shows the taxonomy, and the nodes of the object tree are drawn in the boxes representing the objects associated with each node. Thus, for every node of the object tree its position in the taxonomy is easily visible. In addition, the treemap can also be used to hide details of the object tree by collapsing subtrees of the taxonomy. For example, in Figure 3.8 we can have a look at a fully expanded treemap.

56


(a)

(b)

Figure 3.9: Adjacency matrices can be used to represent graphs. The ordering of the representative elements at both axes is important to gain insights: (a) breadth first traversal order; (b) lexicographic order.

Each node of the object tree is represented by a small circle placed in the box of the treemap that represents the object associated with that node. Edges are either drawn as straight lines connecting these circles or as orthogonal lines, i.e., horizontal and vertical lines with bends of 90 degrees as aforementioned. The root of the tree is indicated by a bigger green circle. It is also possible to draw several object trees in the same treemap. Comparison to Other Techniques To compare the different aforementioned visualization techniques, we use the following criteria: • Single representation: Is each object of an object tree represented in the visualization? If the taxonomy is shown, is this single representation also used by the taxonomy? • Crossings: Visual clutter is caused by many edge crossings or overlaid objects. Is visual clutter reduced? What is the number of edge crossings in the object tree? • Continuity: Is it easy for the human eye to follow paths or to visually solve path-related tasks? For orthogonal layout we tend to have longer edges, but less crossings, and these crossings have a right angle. For straight edges the angles of the crossings may become very small, and it may become very difficult to follow these edges. • Clusters/Outliers: Does the visualization allow to detect clusters and outliers?


57

Figure 3.10: Object trees can be drawn as a parallel coordinates plot with the objects sorted at the vertical axis according to their leaf word ordering in the taxonomy.

• Compactness: How much information is encoded in a certain part of the screen? Is the visualization compact or space-filling? • Taxonomy: Is the taxonomy shown? How is the taxonomy represented, or to what extent? The results of our comparison are summarized in Table 3.11. It is noteworthy that only the Trees in a Treemap visualization provides a single representation of each object while at the same time showing the full taxonomy. In addition, the orthogonal layout only leads to 90 degree angled edge crossings and thus improves the continuity of the edges. A color coding of the edges is an additional feature that supports a viewer to solve path-related tasks even at points where many edges intersect. Layout of Trees in a Treemap The tree representing the taxonomy can be visualized as a space-filling treemap. This kind of visualization meets our demands because each node of the taxonomy tree is encoded as a treemap box whose borders are the bounds for placing the representative tree elements. The root node of the taxonomy is represented by the whole rectangular area, whereas child nodes are splitted with respect to the metric that is responsible for the box space assignment. This metric depends on the dataset we are trying to analyze. In the discipline of software visualization or visual data mining in software archives, we are working with software artifacts such as files and methods. If each box represents a file of a software system, we could, for example, inscribe quantitative data such as the size of each file, the number of code lines in a file, or the number of changes as a box size metric. A second metric could be represented in the color of each box, e.g., the age of a file as ordinal data type or the person who did the last commit as a kind of nominal

58


data. The number of nominal data values is limited due to the number of colors that a viewer can differentiate hence, the number of represented developers is limited. Another drawback does not stem from the data type in use, but more from the intertwined visual encoding of overlapping information. If the boxes are already color coded, how can we represent the metrics that are used in the object trees on top of the treemap boxes? If these color codings are not well choosen, we obtain a visualization that can confuse a viewer. For this reason the user can interactively select a color coding for both the treemap boxes representing the taxonomy and the node-link diagrams representing the object trees independently. The default color coding uses black colored treemap boxes. The fact that the elements of an object tree are only mapped to leaf nodes of the taxonomy and never to inner nodes means that the treemap visualization does not have to show intermediate levels, which would need additional display space.

Visualization technique Separate trees Linked trees Colored trees Matrix BFT order Matrix leaf word order Sorted par. coord. TiaTM (straight) TiaTM (orthogonal)

Single representation

Crossings

Continuity

Clusters Outliers

Compactness

Taxonomy tree diagram tree diagram

no

no

straight

no

no

no

yes

straight

difficult

no

no

no

yes

medium

yes

no

no

high

color not visible

yes

no

straight not visible not visible

yes

high

order

yes

yes

straight

yes

medium

order

yes

yes

yes

high

treemap

yes

reduced

straight orthogonal

yes

high

treemap

Figure 3.11: A comparison of related visualization techniques with respect to some visualization criteria can show the benefits and drawbacks of the novel technique.


59

We developed three different layout techniques starting with a very naive one that places all nodes randomly and then draws straight links between these nodes. The freedom for placing the nodes is limited by the borders of the bounding boxes. Following this approach, we obtain a very cluttered display, but the color coding helps to detect clusters and patterns. The second approach does not make use of the whole box for a node placement, but always draws a node in the center of its corresponding treemap box. This leads to a decrease of visual clutter, but in this approach many edges are represented on top of each other, and hence, we hide some information about several edges. But also this idea has some benefits. We can easily explore which leaf nodes of the taxonomy are related to each other by a subpart of a sequence rule. The most sophisticated algorithm that we developed makes use of orthogonal edges. Path-related tasks can be solved much easier, but the drawback of this idea is the inefficiency of the algorithm with respect to its running time, which will be discussed in the following. For all three approaches, we assume that the treemap is already generated as a starting point where we have to place the nodes and links of the object trees. Suppose that all k object trees that have to be laid out consist of n nodes, and hence, of n − k edges in total. • All nodes randomly: Placing all nodes randomly in their corresponding treemap boxes and drawing the directed links between related nodes is a very naive approach. Though this idea generates very chaotic layouts, it also has some benefits. It goes without saying that the running time of the algorithm can be asymptotically approximated by O(n). In most cases the links do not overlap, but actually the angles between the links become very small, which is the major drawback of the layout algorithm. But the generated layout meets our demands because we are able to detect clusters or patterns by inspecting those hierarchy levels where most rules bypass. • All nodes centered: To reduce visual clutter we always draw the nodes of the object trees in the center of their representative treemap boxes. The root nodes of the object trees are treated as special objects and are placed randomly in their representing boxes. The center point of each box is used for non-root nodes with the goal to distinguish those nodes from root nodes. The running time of this algorithm can also be asymptotically approximated by O(n). Though the layout can be computed very fast, it has one important drawback: Multiple edges between the same two nodes can overlap, and hence, hide some information, which was already the case for the parallel coordinates plot. But the layout has its strengths if someone is interested in detecting whether there exist relations at all between different levels of the hierarchy. • Orthogonal edge layout: The third approach has the goal to reduce visual clutter that is caused by lots of edge crossings and to avoid small angles at intersecting links that may lead to irritations for a viewer. For this reason we

60


implemented an orthogonal variant of the layout algorithm, which can handle many different parameters such as the number of edge direction changes (or bends for short), the minimal edge length, or the maximal edge length. Length refers to the number of cells that we have to cross on the way from a starting point to a destination point because the algorithm uses a rectangular grid to compute the sampling points. The higher the number of edge bends or the longer an edge the more timeconsuming will be the computation of the layout. But a reduction of edge crossings and edges that are as short as possible are the main aesthetic criteria in the graph visualization community, and hence, the running time of the algorithm can be kept low. We will discuss the time efficiency of the algorithm later on when showing the pseudo code for this algorithm. The benefit of this approach is that all nodes and all edges are present, do not overlap, and their number of edge crossings is reduced. Consequently, this design is well suited for path-related tasks. Computing a layout with respect to a minimum of edge crossings is an algorithmic problem that is known to belong to the class of N P -complete problems [73]. To obtain a faster solution, we use a heuristical approach that restricts the number of bends or edge direction changes, which again restricts the search space immensely. Due to the orthogonal layout, edge direction changes are limited to four possible directions, namely left, right, up, and down or west, east, north, and south when using geographic orientation. A change of direction means that an edge takes a different direction from the one it currently has. Additionally, to avoid cycles, we omit the contrary direction. Consider an edge that has direction west as an example then this edge can only point to direction north and south after the change. The direction east will be forbidden, and west would not change the current direction at all. The number of direction changes is the main reason for the search space explosion. The number of different edges of at least length lmin and at most length lmax with at most cmax direction changes can also be asymptotically estimated, but deducing this formula is a little bit more complicated. For the layout of k object trees consisting of n nodes in total, we have to compute the tracks of n − k edges. The additional length restriction l for an edge and the maximal number of edge bends cmax leads to a combinatorial problem of putting l indistinguishable balls into cmax + 1 distinguishable containers with at least one ball in each container. Additionally, the number of balls in each container can be divided into exactly two classes. The reason for the extra distinction comes from the fact that exactly two different edge direction changes are possible each time an edge bends.


61

Putting m indistinguishable balls into k distinguishable containers with at least one ball in each container results in m−1 (m − 1)! = k−1 (k − 1)!(m − k)! different possibilities for ball arrangements or with the aforementioned parameters l−1 (l − 1)! = cmax cmax !(l − 1 − cmax )! Now we can proceed to deduce the desired formula for all possible edges (n, lmin , lmax , cmax ) and disregard the number k that is comparable small to the number of all nodes. Formally, we can express (n, lmin , lmax , cmax ) as

n lX max X

j−1 2·2 (n, lmin , lmax , cmax ) := cmax i=1 j=lmin n lX max X j−1 cmax +1 =2 cmax i=1 j=lmin "l # lmin n max X X X−1 j − 1 j − 1 = 2cmax +1 − cmax cmax j=0 i=1 j=0 n X lmax + 1 lmin cmax +1 =2 − cmax + 1 cmax + 1 i=1 lmax + 1 lmin cmax +1 = n2 − cmax + 1 cmax + 1 cmax +1 (lmax + 1)! lmin ! n2 − = (cmax + 1)! (lmax − cmax )! (lmin − cmax − 1)! cmax

(3.1)

(3.2)

(3.3) (3.4) (3.5) (3.6)

The formula lets us presume that the worst case running time for computing an orthogonal edge layout may grow exponentionally in the number of edge bends cmax . The factorial of cmax in the denominator of the first fraction can also be estimated by Stirling’s formula c +1 p cmax + 1 max (cmax + 1)! ≈ 2π(cmax + 1) · e

62


By replacing the denominator of the first fraction by this estimation we obtain the result that an increase of cmax also leads to an exponential increase of the number of all possible edges. Taking the aesthetic criteria into account that the number of edge bends should be as small as possible and the length of edges as short as possible, the running time is not as bad as is expressed by the formula. Moreover, the search space that consists of all possible orthogonal edges shrinks immensely once a nearly optimal edge is found by the algorithm. Even an optimal position of the treemap boxes in each hierarchy level could be computed to minimize the edge crossing number and edge lengths. But this is also a very difficult task. This problem is known as the Optimal Linear Arrangement(OLA) Problem [73] and also proven to belong to the class of N P -complete problems. As long as we are dealing with small treemaps with only a few items in each hierarchy level, this is not a problem at all. But as the problem instance of this optimization problem gets bigger, we have to use a heuristical approach and compute a nearly optimal solution. Let V be the set of nodes of the taxonomy and coupling : V × V −→ N be a function, that maps each pair of nodes of the taxonomy to the number of couplings that these two nodes have in the given dataset. Keep in mind that each leaf node corresponds to a particular artifact in the rule set and intermediate nodes serve as containers for these nodes. Treemap boxes are the visual encodings for the taxonomy nodes v1 , . . . , vn ∈ V in a hierarchy level. We say that boxes b1 , . . . , bn in a hierarchy level are optimal linear arranged with permutation π, if and only if the permutation π : {1, . . . , n} −→ {1, . . . , n} of the boxes b1 , . . . , bn minimizes n−1 X

coupling(vi , vi+1 ) · | π(vi ) − π(vi+1 ) |

i=1

We apply the optimal linear arrangement approach recursively in all hierarchy levels starting with the root, which is trivially optimal arranged. This algorithm presents a top-down sorting approach. After computing an optimal, or in the case of a large number of artifacts, a nearly optimal solution to the OLA problem, we start to layout the trees in the treemap. The steps that are needed to layout one single orthogonal edge in a treemap are explained in more detail in the following.


63

First, we need a good position for the start and end nodes of the edge to be computed in the corresponding treemap boxes, if these positions are not computed by an already laid out edge before. As a heuristic we place the nodes at a free position near the bounds of their representative boxes to keep the orthogonal path that visually represents an edge as short as possible. It may be noteworthy that we refer by the term edge to a mathematical modeled edge in a graph whereas a path denotes here the actually taken route of an edge in the treemap. The term edge is detached from its visual placement, whereas the term path is not. The algorithm iteratively checks all paths contained in the search space. The starting path is always precomputed by the shortest orthogonal line from the starting node to the destination node, no matter how many path crossings it has with already existing paths or nodes. This starting path is taken for the currently optimal path. The algorithm compares this one with the next path from the search space that has not been examined yet and leads from the starting point to the destination point. If there is a better one that meets the former conditions, it replaces the starting path and now is the currently best one. The algorithm terminates when all possible candidates have been compared to the currently optimal path. The algorithm is given in Listing 3.2. 1 2 3 4 5 6 7 8 9 10 11 12 13 14

v o i d update tm ( Point s t a r t , Point end ) { SearchSpace sp = new SearchSpace ( s t a r t , end ) ; Path c u r o p t = c o m p f i r s t P a t h ( s t a r t , end ) ; Path nextP ; w h i l e ( sp . hasNext ( ) ) { nextP = sp . g e t n e x t P a t h ( ) ; i f ( nextP . g e t S c o r e ( ) < c u r o p t . g e t S c o r e ( ) ) { c u r o p t = nextP ; } } storePath ( cur opt ) ; } Listing 3.2: Computation of an orthogonal edge

The treemap is internally represented as a 2-dimensional matrix in which each matrix entry has a special property. The method storePath(Path p) stores the entry properties for the computed path. The encoded cell properties can be used to decide whether a pair of treemap coordinates was already used by a visual element. The algorithm needs this information to decide whether there can still be a better path with less crossings, or if the currently computed path cannot be improved. The function getScore() is used to compute whether there is a path enhancement and for doing this, it takes all previously generated paths into account.

64


If the best possible path has been found for an edge, the treemap matrix will be updated and the algorithm will try to find a layout for the next edge. The algorithm iterates over the whole edge list, which is ordered with respect to the branch lengths. In this context, length means the minimal distance from the ultimate box bound points, which are the representative visual elements of these object tree nodes. The algorithm follows a depth first search strategy. This keeps the memory requirement very small and even works faster than a breadth first search strategy. We also tried a mixture of the former two strategies but there was no improvement. To draw a path between two nodes, the corresponding boxes have to be computed first. The boxes restrict the coordinates where the nodes can be placed. Next, we compute one path, which has at most one direction change and which leads from the starting point to the destination box. In most cases this path is not the optimal one. So the algorithm has to look for a better one, which means one with less crossings with already laid out paths. Note that there is no direct relation between the length of a path and the number of crossings. In many cases we found that this relation was even reciprocally proportional: The longer the path, the less crossings it has in general. This phenomenon makes the solution to this problem extremely difficult and requires some kind of trade-off. This is the reason why we use depth first search in our algorithm. We search for longer paths to reduce the number of crossings. If there exist two paths with the same number of crossings, the shorter will be chosen.

Figure 3.12: The algorithm uses a two-dimensional grid to explore the search space for an optimal orthogonal edge. Both example paths from the search space differ in length but have the same number of two direction changes.


65

Since the treemap is divided into a two-dimensional grid (see Figure 3.12) that consists of finitely many points, the space for drawing paths without overlapping each other will decrease very rapidly. The algorithm avoids regions where many paths are located because crossings are more likely within these regions. To tackle this problem, the algorithm selects longer paths with a small number of bends. The algorithm can be described in terms of parameters and the problem to be solved: • Array tr for the object trees Ti = (Vi , Ei ) • For each v ∈ Ti the box bounds in the OLA Treemap • The maximal number of bends cmax • The minimal edge length lmin • The maximal edge length lmax The goal of the algorithm is a nearly optimal orthogonal layout of the trees Ti in the optimal linear arranged treemap. Listing 3.3 shows the code for this algorithm. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

ObjectTree [ ] t r ; Point [ ] [ ] s t a r t , end ; i n t [ ] [ ] tm = i n i t i a l i z e t m ( ) ; f o r ( i n t i = 0 ; i < t r . l e n g t h ; i ++) { Edge [ ] eds = t r [ i ] . s o r t e d g e s ( ) ; s t a r t = new Point [ t r . l e n g t h ] [ eds . l e n g t h ] ; end = new Point [ t r . l e n g t h ] [ eds . l e n g t h ] ; f o r ( i n t j = 0 ; j < eds . l e n g t h ; j ++) { s t a r t [ i ] [ j ] = c o m p s t a r t P o i n t ( eds [ j ] ) ; end [ i ] [ j ] = comp endPoint ( eds [ j ] ) ; update tm ( s t a r t [ i ] [ j ] , end [ i ] [ j ] ) ; } } Listing 3.3: Orthogonal layout of the object trees

The two-dimensional arrays start and end contain the x- and y-positions of the laid out nodes of the object trees. These are computed by the methods comp startPoint(Edge e) and comp endPoint(Edge e) respectively, which compute the points with a minimal distance. An orthogonal path between these two points is computed by the method update tm(Point start, Point end), and the treemap array tm is updated as soon as such a path is found. This tm array is filled with integer values that will be mapped to visual elements in the visual mapping step later on.

66


The update process is accomplished iteratively, which means edge by edge. Parallel paths are a bit confusing and should keep some distance in the visualization. For this reason the algortihm uses a weighting function that maps points in the near environment of an already existing path to higher values than those that are more far away. A newly computed path can only take routes that are below a given weight bound, and consequently, parallel paths can be forced to keep a minimum distance. A similar weighting function is used for the following confusing phenomena: • Paths over a box bound: Box bounds are important to distinguish different boxes from each other. A path could hide a box bound so that the user makes wrong interpretations of the treemap. • Paths over a node: Nodes are absolutely necessary to see where a path starts and where it ends. So these are not allowed to be painted over by other paths. • Paths over paths: Apart from crossing a path we want to avoid two paths laying on top of each other. The best situation is achieved when these have no point in common at all. • Self crossing: This is also a problem because a self crossing path could be perceived as a cycle. So we want to avoid cycles whenever possible. In many applications we find trees of objects that are also classified in some taxonomy. In Section 3.3.2.4 we looked at several known and some novel techniques to draw this kind of trees such that the position of each object in the taxonomy is clearly visible. In particular, orthogonal layout in Trees in a Treemap turned out to offer many advantages over the other approaches: There is a single representation for each object, the visualization is compact, the number of path crossings is reduced, the orthogonal paths are suited for path-related tasks, clusters and outliers can be detected and above all, the taxonomy is visible in form of the treemap. In addition, parts of the treemap, and as a consequence, parts of the object tree can be collapsed and expanded again. In Section 3.4.3 we will show the usability of our novel visualization technique. We looked at two different datasets: Sequence rules from software archives and network routing data. In the former case, we found, in particular, cross cutting relations, i.e., dependencies between directories in different parts of the directory tree. In the latter case, we immediately spotted an anomaly—packets being repeatedly sent back and forth.

3.4 The EPOSee Tool

67

3.4 The EPOSee Tool The EPOSee tool [22, 24] was implemented in the JAVA programming language and gives a user the opportunity to interactively explore the rules retrieved by data mining. EPOSee can represent both types of rules, namely association rules and sequence rules. The following sections explain the tool in more detail and how patterns and anomalies can be detected from a large rule set. EPOSee integrates the aforementioned visualization techniques into one tool to interactively explore mining rules extracted from software archives. We will list the techniques that are supported by EPOSee in the following and show the usefulness of the techniques by means of several case studies. • Visualization of Binary Association Rules – – – –

Pixelmap (overview, context) Support Graph (force-directed, polar layout) 3D Bar Chart (of selected association rules, focus) Rule Detail Window (of selected rule, focus)

• Visualization of n-ary Association Rules – – – –

Association Rule Matrix (overview, context) Bar Charts (for support and confidence simultaneously) Rule Detail Window (of selected rule, focus) Item Legend Window

• Visualization of Sequence Rules – – – – –

Parallel Coordinates View (overview, context) Decision Prefix Tree (overview, context) 3D Branch View (of selected sequence rules, focus) Trees in a Treemap (overview, context, taxonomy) Rule Detail Window (of selected rule, focus)

• Histogram (distribution of confidence and support) All these visualization techniques can interactively be changed by a user to gain some interesting insights into the very large textual rule data. How this visual exploration process works in detail will be shown in the next section.

3.4.1 How to work with EPOSee The EPOSee tool can switch between different modes that depend on the type of rule set that is currently analyzed.

68


Typically the visual data mining process works as follows: First, an overview of the rule set is provided. The user is able to detect interesting visual patterns like clusters in this early step. Next, he can inspect the rules of a visual pattern by selecting the rules involved and by viewing them in a zoomed display, which is possibly a 3D view. Additionally, filter functions allow the user to constrain the current rule set. A detail on demand function can be used in a last step to select single rules. All these steps conform to Ben Shneiderman’s visualization mantra [10, 27] which states: Overview first, zoom and filter, then details on demand. In the following sections we will give a detailed description of how EPOSee integrates the different views for binary association rules, n-ary association rules, and sequence rules. Furthermore, we will see how these visualizations interact with each other. The open source software project MOZILLA serves as an example project for this visual analysis process. Figure 3.13 shows the tool in its binary association rule mode. We differ between binary and n-ary association rules, which cannot be represented by a traditional matrix visualization such as the pixelmap. For n-ary association rules we use the association rule matrix technique of Wong et al. [182].

Figure 3.13: The EPOSee tool in the binary association rule mode can represent different views integrated into one. On the left hand side we see the pixelmap view, in the middle the support graph view, and on the right hand side the three-dimensional bar chart view that serves as a zooming function.

In Figure 3.13 we see three views, which can be resized and moved on user’s demand. These visualizations show the pixelmap, the support graph, and the 3D bar chart view for a set of binary association rules. At the bottom of the window, the selected

3.4 The EPOSee Tool

69

rule is represented in textual detail. When a user selects an interesting part of the pixelmap, this part is zoomed in and visualized as a 3D bar chart view. Next, the user can select a single rule from the bar chart view, which will be shown in detail. An advantage of EPOSee over single applications for each view is that the different views are synchronized. This corresponds to the concept of linking and brushing. If a rule is selected in the 3D bar chart view, for example, the corresponding artifacts are also emphasized in the support graph as well as in the pixelmap representation. When displaying n-ary association rules, EPOSee opens a window containing the rule matrix and a second window that shows the list of artifacts involved. The user can select a single rule in the matrix, which will be shown in the detail view at the bottom of the application window. The n-ary association rule mode of our application is shown in Figure 3.14. The histogram in the bottom right corner shows the distribution of support or confidence respectively over the set of rules. These metric values can be represented as a histogram for all kinds of visualized rules.

Figure 3.14: The EPOSee tool in the n-ary association rule mode shows an enlarged association rule matrix, a list of involved artifacts, and a histogram in the lower right part of the view.

A histogram is a graphical representation of tabulated frequencies. These are typically shown as bar charts, which visually encode how many elements fall into particular categories that are representing disjoint intervals of a domain. Categories must be visualized adjacently to make better observations about the distribution of the values of a quantitative dataset. Enlarged example histograms are represented in Figure 3.15 for both support and confidence.

70


Figure 3.15: Histograms can show the distribution of metric values of several rules over an interval: The upper part shows the support metric, the lower part the confidence metric.

If the user opens a file containing sequence rules, the EPOSee tool enters the sequence rule mode, which is illustrated in Figure 3.16. In this mode EPOSee shows the parallel coordinates view, the decision prefix tree, the 3D branch view, and the Trees in a Treemap visualization. The parallel coordinates view, the decision tree view, and the Trees in a Treemap give the spectator of the visualizations an overview of the rule set. Additionally, the selected rule is zoomed in the 3D branch view and is shown in detail in an additional view at the bottom of the application. Independendly of the rule type, the EPOSee tool allows to filter the set of rules according to their support and confidence metrics. Furthermore, it is provided with a textual keyword search. Moreover, various color codings can be applied to each of the views separately. The items in the rules, extracted from software archives, are software artifacts like files, classes, methods, or functions. Instead of starting from a random order when visualizing these items in a rule, we will use a total order derived from the application domain’s hierarchy, e.g., methods are contained in classes, classes are contained in files, files are contained in directories, and directories are contained in other directories. In the following sections we will illustrate the techniques by means of several case studies and explain which insights can be gained from the datasets.

3.4 The EPOSee Tool

71

Figure 3.16: The EPOSee tool in the sequence rule mode shows the parallel coordinates plot on the left hand side. The view on the right hand side is splitted into the decision prefix trees view in the upper part and the 3D branch view in the lower part. A histogram indicates again the distribution of the values over the rule set. Also a detail on demand view is given on the left hand side of the histogram.

3.4.2 Case Study: MOZILLA The Netscape Communications Corporation originally developed an internet client which was called MOZILLA. Several years later it became an open source software project [157]. The number of files in this project is roughly 77, 000, which would result in a very large pixelmap of size 77, 000 × 77, 000, which are 5, 929, 000, 000 entries in the matrix. Counting functions rather than files would result in even larger pixelmaps. An overview of the whole dataset is not possible, even in this very compact visualization technique. In the following we want to restrict our analysis to the browser subdirectory which still contains 556 different files and is further subdivided into subdirectories app, base, components, and resources. For this case study we used data mining techniques to extract rules from the MOZILLACVS archive. We applied all the visualization techniques described above to the obtained rule sets. The goal of our analysis was to find some interesting patterns and anomalies, which we call outliers. Actually, we are more interested in the outliers. As we do not know the internals of the MOZILLA project, we have to look into

72


the files involved as well as the MOZILLA documentation to be able to explain these outliers.

3.4.2.1 Insights from Binary Association Rules

Figure 3.17: The pixelmap shows the support values of the browser subdirectory of the MOZILLA open source project. Small quadratic boxes along the diagonal are a sign of a well structured project.

Figure 3.17 shows evolutionary couplings as support values in a pixelmap. All files of the browser subdirectory of the CVS software archive of the MOZILLA project are shown in one view. The source code files of the web browser FIREFOX are contained in this directory. As aforementioned, we order the files hierarchically at both axes in the same manner. Inspecting the pixelmap, one can see that a file is more likely to be related to its neighboring files than to other files. The reason for this is quite simple: All files that are located in a square along the diagonal typically belong to the same subdirectory. Consequently, these clusters are a good sign for the hierarchical structure of the software system. If the pixelmap would not show these quadratic areas along the diagonal, it might be a sign for a poorly structured system. Maybe a restructuring of the system should be the consequence.

3.4 The EPOSee Tool

bookmarks subdirectory resources/locale/pref-bookmarks.dtd skin/Bookmarks-toolbar.png skin/bookmarksManager.css

73

prefwindow subdirectory content/pref-popups.xul content/pref-privacy.js content/pref-privacy.xul content/pref-proxies.js content/pref-proxies.xul content/pref-proxy-manual.xul content/pref-scripts.js

Table 3.1: The file set in the left column is strongly related to the files in the set of the right column.

Pixels representing evolutionary couplings between files of different directories are mostly located far away from the diagonal of the pixelmap and are called outliers. These indicate a sign of aspects orthogonal to the system hierarchy, and furthermore, these can be a sign for a poor system architecture. This is the reason why software developers are more interested in outliers than in patterns. Figure 3.17 is annotated with an outlier, but apart from this one, there are many more, which we did not highlight in the figure. Inspecting the figure more precisely, we can detect strong relations between two sets of files in the components subdirectory. The files of the bookmarks subdirectory in the left column of Table 3.1 are strongly related to the files of the prefwindow subdirectory, which are given in the right column of the same table. The dependencies between the files shown in Table 3.1 can be explained as follows: XUL files are the basis for the graphical user interface of the browser’s implementation. These files are so-called XML files that describe the elements of the graphical user interface. As an additional feature, these link the elements to JavaScript functions that express the actions of the GUI. These functions are able to access and call the whole system functionality, for example, COM objects or C++ code. Cascading style sheets (CSS) files can be used to customize the outward appearance of the graphical elements of the GUI, which can change their display images, for example. The aforementioned detection of an outlier can be explained by a first hypothesis: All XUL files in prefwindow/content may directly or indirectly reference the DTD in bookmarks/resources/locale/pref-bookmarks.dtd. This relation would be easy to find by a textual analysis of all files. But a closer inspection revealed that the XUL files do not refer to the DTD, but rather to the local prefwindow/locale/pref-privacy.dtd DTD, for example. These are related to the aforementioned DTD, due to the same naming conventions, that are, name and attributes. Moreover, these define the same tags to some extent. Consequently, if the naming conventions change their behavior, all DTD files have to be adapted to these new conventions. Uncovering these types

74


components/prefwindow/locale pref-advanced.dtd pref-appearance.dtd pref-applications-edit.dtd pref-applications.dtd Table 3.2: These four files cause the outliers in the support graph view.

of dependencies would be a daunting task by means of classical text or program analyses. Though the correlation between these two groups of files was detected by visualization techniques, we also had to explore manually what was the reason for the files to be related by inspecting the corresponding source code. Some kind of detail on demand feature in the visualization would be very important that combines the focused software artifacts with their corresponding source code snippets to gain better and faster insights into the complex data. Though this idea looks very intuitive, it is not as straightforward as it seems. The reason for this are the different versions of the software artifacts, which make a choice of the correct artifact revision very difficult. Another technique to explore binary association rules is given by the so-called support graph, which can also be used to detect clusters or outliers in the dataset. Figure 3.3 shows the support graph of the browser directory of the MOZILLA project. The graph nodes are color coded with respect to the hierarchical ordering and represent several files. A viewer can see the large cluster in the middle of the figure. The red colored part of it refers to the base subdirectory. A few light blue nodes are also drawn in this cluster, which are a sign for outliers. Table 3.2 lists the four corresponding files of the components/prefwindow/locale subdirectory that cause these outliers. By closer inspection we can find out that only the first file (pref-advanced.dtd) in the list in Table 3.2 is related to files in the base directory. The rest of the list is only related to this first one, but not to files in the base subdirectory.

3.4.2.2 Insights from n-ary Association Rules So far we explored relations between exactly two files and called these dependencies binary association rules. In this section we will have a look at simultaneous changes of arbitrary sets of files that we call n-ary association rules. Figure 3.4 shows the association rule matrix of the browser directory of MOZILLA. It is easily discernable that the biggest part of these n-ary association rules consists of artifacts that belong to the same subdirectory, which we call a pattern.

3.4 The EPOSee Tool

75

This pattern also involves a small number of exceptions. Several rules contain artifacts that belong to different subhierarchies and are hence denoted as outliers. One can see this kind of outliers by inspecting the subdirectories browser/base and browser/components as highlighted in the figure. If there is a change in base/content/browser.js, then components/prefwindow/locale/pref-tabs.dtd has been changed too. Ordering the rules horizontally with respect to their lexicographic order when treating antecedent and consequent as plain text, we can get even more insights in frequently occuring rule prefixes, as can be seen in Figure 3.4.

3.4.2.3 Insights from Sequence Rules

In the previous sections we looked at files that were frequently changed simultaneously. Now we look at the temporal order of the changes. Figure 3.18 shows a parallel coordinates view of the browser directory. The color of the nodes indicates the weighted sum of the support values of all rule prefixes that share this node. The color of the edges indicates the weighted sum of the confidence values. As the nodes are ordered with respect to the file hierarchy of the software system, we see multiple clusters consisting of many edges, which are only related to items in the same subdirectory. We also see that the files base/content/browser.js and base/content/browser.xul are related in a very interesting way to almost all JavaScript respectively XUL files. They are typically changed after one of these files has been changed. Figure 3.19 shows a part of the parallel coordinates view in a larger scale. There are two green edges leading to the light blue node representing the file browser.xul in the consequent column. The green edges indicate high confidence. But if we look at both items where these edges start, we see that there are no edges with high confidence pointing to these nodes. To fully understand the meaning of this phenomenon, we have to inspect the rules in the decision tree view of EPOSee. As a consequence, we found that the rule browser.js ⇒ browser.xul has a confidence value of 30 percent. The rule browser.js → browser.dtd ⇒ browser.xul has a confidence of 61 percent. In other words, the confidence increases considerably if the second file is also changed.

76


Figure 3.18: The parallel coordinates plot of the MOZILLA open source project shows strongly and weakly related software artifacts. To reduce visual clutter only rules with a minimum support of 11 are shown.

3.4 The EPOSee Tool

77

Figure 3.19: An enlarged area of the parallel coordinates plot of the MOZILLA open source software project with three files in focus shows interesting dependencies.

3.4.3 Sequence Rules as Trees in a Treemap In this section we look at two different application domains for Trees in a Treemap. Clusters and outliers can be detected with the novel visualization approach in both software archives under version control and in traceroute data. 3.4.3.1 Case Study: SWT To show the strengths of our visualization technique, we explored parts of the open source software project ECLIPSESWT—the Standard Widget Toolkit. The generated sequence rule set is transformed to prefix trees or object trees by sharing common prefixes. The objects of the tree can be linked to the leaves of a taxonomy that is given by the directory structure of the software project. Figure 3.20 shows the root level of the Trees in a Treemap visualization of the decision trees we obtained for SWT on file level. As we can see, there exist several green colored nodes—each representing the root of a decision tree. These roots are placed randomly within the box, while all other nodes of the trees are located in the center of the box. There is only one red colored node in the center of the view. In Trees in a Treemap we use a different color coding for the nodes depending on the type of node: • Green colored nodes always represent root nodes. • Blue colored nodes represent nodes in the antecedent of a sequence rule. • Red colored nodes represent nodes in the consequent of a sequence rule. This additional encoding can only be used in the orthogonal layout approach and makes it easier to distinguish the type of node, whether it is a root node, whether it is located in the antecedent of a rule, or whether it belongs to the consequent. Inspecting rules at root level with this technique does not convey much information about the rule structure itself, but we can see that there exist lots of trees, which we can examine in more detail by expanding the root box of the treemap.

78


Figure 3.20: The Trees in a Treemap approach can be used on different hierarchy levels. But the SWT prefix trees in root directory do not convey much information.

In Figure 3.21 the treemap is expanded to an intermediate level, but not all subhierarchies are fully expanded down to the leaf level, which represents files. The paths of several subdirectories can be detected by reading the labeled treemap boxes. The set of these subdirectories contains SWT Custom Widgets, win32, common, gtk, carbon, and SWT PI, and each of them can be further expanded. In this layout the nodes are randomly placed within their corresponding treemap boxes, and the object trees can be distinguished by different color codings. Even with this very naive approach, one can detect some clusters and outliers in the rule set. We can easily detect that the SWT Custom Widgets subdirectory is not related to any other subdirectory. The same is true for the carbon and the gtk subdirectories. Figure 3.22 shows the relations in the gtk subdirectory and also makes the dependencies of the other subdirectories clearer. As aforementioned, we can easily discern that the gtk subdirectory is not related to any element of any other subdirectory. Additionally, all relations between single files of the gtk subdirectory are visible. Also some interhierarchical relations—relations between different hierarchy levels—become clearer. The common subdirectory is very closely related to files in the SWT PI/common j2se/org/swt/internal subdirectory, which is located at the upper right corner of Figure 3.22. Figure 3.23 shows the same rule set as Figure 3.22 but with orthogonally laid out object trees. Representing the rule set this way additionally avoids visual clutter, which is typically caused by edge crossings. The algorithm generates many parallel crossing free lines. The orthogonal technique has its benefits when solving pathrelated tasks. In contrast to the layout in Figure 3.22, in the orthogonal layout

3.4 The EPOSee Tool

Figure 3.21: The treemap is expanded to an intermediate level of the hierarchy. Even with this naive approach one can detect patterns and outliers.

Figure 3.22: A further expansion of the gtk subdirectory makes the clustering in the SWT sequence rule set clearer.

79

80


we can easily see that the subdirectories common j2me and win32 are not related. Getting this insight was very difficult in Figure 3.22.

Figure 3.23: An orthogonal layout of the object trees makes the visualization of the SWT object trees still much clearer by reducing visual clutter.

This section illustrated how sequence rules from software archives could be represented as prefix trees that are laid out on top of a treemap that indicates the taxonomy. The visualization enables a viewer to inspect the additional hierarchical information about the software artifacts contained in the sequence rules. In the next section we focus on a very different application domain—trace routes between hierarchical organized IP addresses.

3.4.3.2 Case Study: Trace Routes Figure 3.24 shows a visualization of internet routing data which have been collected from October 2001 to July 2002. This dataset [25] was generated with the program traceroute and run from a host within the LAN of the Physics Department of the University of Rome ’La Sapienza’. The IP address of that host was 141.108.2.4. First, we generate one prefix tree from the underlying dataset. All routes have the common root node 141.108.2.4, a fact that results in just one single large prefix tree. The hierarchical order of the IP addresses is used to derive a taxonomy and to map those addresses to treemap boxes. In the following we will show how the Trees in a Treemap visualization can help to also understand data from this application domain, to detect patterns, and to uncover outliers.

Figure 3.24: Several internet routes from the same host can be drawn in a fully expanded hierarchy of IP addresses. Each route is colored differently to distinguish between the single paths that were taken in the hierarchy.

3.4 The EPOSee Tool

81

82


Inspecting Figure 3.24 we can easily see that the root node (green color) is positioned in the leftmost box, which represents IP address 141.108.2.4. This root node is the starting point for the path colored in blue/green that ends (red color) at IP address 192.9.104.17. But the path splits at IP address 212.1.200.25, which is indicated by the blue path starting from there and ending at IP address 4.1.122.230. So each time the routing path splits, an additional path in a different color is shown. Color coding is here used to distinguish between different paths. The end of a path is always represented as a red colored node. The above coloring scheme of different paths helps to distinguish the routes of the packets sent. An outlier catches our eyes when we have a closer look at the boxes representing IP addresses 192.9.104.15 and 192.9.104.17. The packets are sent forth and back between these two addresses and finally end up in one of these. Also in this application domain, we profit from the fact that the hierarchy of the object tree elements is integrated in the same view as the object tree itself. In the underlying dataset, each address consists of four numbers in the range of 0-255. Each of these numbers can be seen as a level of a taxonomy. The numbers from left to right indicate the levels of the network, i.e., from a net to its subnets. This additional information gives us an overview whether the route of a packet changes between different local networks or not. The blue colored path at the upper part of Figure 3.24 consists of very long parallel lines, which alternate between IP address 157.130.60.249 and IP addresses 192.9.24.17 and 4.1.122.230 respectively. Examining the addresses in more detail, we detect that the packets were sent across more than one hierarchy level. On the other side we see the light blue colored path right below the dark blue colored path. This one represents the route of a packet, that was mainly sent between IP addresses starting with number 152.*.*.*. In this example, there are little relations between different levels of the hierarchy. This means the packets did not leave this network very often. Also a comparison of two paths is possible with our visualization technique. If we look at the dark blue colored and the light blue colored path again, we find that these two paths have a common subsequence of IP addresses that are not prefixes or postfixes of the whole sequence. They share the same consecutive IP addresses starting from 152.63.19.29 to 152.63.55.82. This pattern can be directly detected because there is a sequence of neighboring blue colored node pairs. If more than one red colored node is placed within the same box, more than one path exists, that finally ends at the same IP address. This is, for example, the case for the green colored and the light blue colored paths, which both end at IP address 192.9.24.17. So far, we examined relations between source code artifacts in software projects. The second part of this chapter introduces visualization techniques to explore code-

3.5 Development Phases in Software Projects

83

developer dependencies, which we call author-centric approaches, in contrast to the code-centric approaches, which do not focus on the person behind the source code.

3.5 Development Phases in Software Projects The visualization techniques described before focus on software artifacts such as directories, files, or methods and hence can be classified as code-centric approaches. In this section we point our attention on author-centric visualization—techniques where the developer stands in the center of evolution. This section focuses on three visualization techniques that can show how developers work together. We try to analyze if they work as a team or if they develop their specific parts of the software independently. Another very important difference to the aforementioned approaches is the type of data that is represented. In this part we explore time-series data instead of data generated by data mining techniques that was aggregated over the whole evolution of a software project. Examining this time-based data with both static and dynamic (animated) visualizations can give very interesting insights into evolution phases, such as phases of stagnation, phases of active development, or phases where only certain developers are active. The usefulness of the techniques is again demonstrated by case studies of two open source software projects, namely JUNIT and TOMCAT3. We observed the development behavior of several programmers in terms of the modules they are working on and the time intervals they are active. Furthermore, we can detect which modules were changed by many different developers and which ones are owned by just a single author. We can ask several questions which we try to answer by means of our visualization techniques under the viewpoint that the developer stands in the center of evolution: • Number of developers: Is the implementation work equally distributed among all developers, or is there one major actor or a small group of major developers? If there is one major developer, is his role occupied by several developers during the project’s lifetime? • File ownership: Is each developer working for his files independently from the others, or are there files that are shared by many developers at the same time? • Development phases: Is the software project evolving with a constant speed, or are there phases during the evolution where the development is inactive? Are there alternating development phases? Are there trends or counter-trends during the evolution?

84


To answer these questions we developed three visualization techniques in our group that target on visualizing time-series developer-centered transaction data from software archives.

3.5.1 Transaction Overview Typically, a software developer changes a set of files and later on commits them into the software repository to make his changes accessible to all other developers of the system. Such a commit is called transaction. It consists of a set of artifacts that belong together, have a similar timestamp and the same log message. Each artifact in such a transaction is changed by some not necessarily different extent. Analyzing the data flood produced by today’s typically very large and complex software systems manually would be a time consuming task. To accelerate the exploration process, we developed the transaction overview, which provides an overview— as the name says—of the whole transaction set. In this view, as little as possible information should be hidden, which may be a difficult problem. Figure 3.25 shows a small example for a transaction overview plot. It consists of a Cartesian coordinate system with a horizontal (x-axis) and a vertical axis (y-axis). The x-axis is used as a timeline, whereas the y-axis shows both the number of changed files as well as all files in a hierarchically ordered arrangement. The author itself is visually encoded as a color coded bullet at the intersection point of the transaction’s horizontal position and the vertical position of the number of changed files.

Figure 3.25: A small example shows the transaction overview visualization technique. The x-axis represents the timeline, the y-axis the number of changed files, and color coded bullets indicate the authors.

The leftmost position represents the starting point of evolution, whereas the rightmost point shows the currently last transaction of the evolutionary process. The


85

approach uses pixel-based vertical lines to show the point in time of each single transaction. If there are more transactions than pixels on the horizontal axis of the display, we compress those lines that overlap, but the bullets representing the authors are always shown. The transaction overview can be used to analyze several different aspects about an evolving software system, among them the following: • Number and frequency of the transactions: The evolution of software systems can contain different development phases. In some phases many transactions in short time intervals occur with a high frequency whereas during some other phases the development process seems to stagnate. The frequency of the transactions can be analyzed by inspecting the density of the vertical lines. • Number of developers: The number of different developers in several time intervals can be examined by the color coded bullets. • Number of changed files: A color coded bullet is located at the y-axis position representing the number of changed files per transaction. The higher this value, the more files have been changed by this developer in this transaction. It is also very easy to compare developers in terms of the sizes of their transaction sets. • Hierarchy level of the changed file: The vertical lines representing the transactions are additionally annotated with small yellow colored dots which can give insights about which files belong to this specific transaction. It may be noted that the files are hierarchically sorted at the vertical axis. This means that files that are located in the same directory or subdirectory are typically laid out close to each other at the vertical axis. Files in the same subdirectory are sorted lexicographically. • Developer sequence: Inspecting the color coded bullets from left to right can help to understand which developers are working in certain time intervals. Trends and counter-trends or an alternating behavior of two or several developers can be detected very easily.

3.5.2 File-author matrix The file-author matrix visualization does not display time-series data but aggregates this data over time. It tries to indicate the dependencies between developers and files in a rectangular matrix. Figure 3.26 shows an example of the file-author matrix, which is a space-filling two-dimensional visualization. Files are hierarchically ordered at the x-axis, and developers are located at the y-axis. The visualization is a pixel-based approach and thus can show many dependencies at the same time. The color of a pixel indicates how often a file was changed by a developer relative to the total number of changes

86


Figure 3.26: The file-author matrix visualization also uses a Cartesian coordinate system to represent files of a software project at the x-axis and several developers at the y-axis.

to that file. In Figure 3.26 we use blue colored pixels to denote that a file underwent relatively little changes by a developer, whereas a red colored pixel indicates that a file was changed very frequently. Background color is used to express that the corresponding file has not been changed by this developer. It goes without saying that the order of both the files and the developers at the axes plays a significant role for pattern and outlier detection. We suggest that the best order to start with is a descending order by the number of checkins for the developers and a hierarchical order for the files. Following this approach, one can easily detect files, which have been changed very frequently by the main developer and those that are not. Additionally, the hierarchical belonging of the files remains visible. Developers that work on particular modules can also be found out quickly. The background color is also used to encode information. To distinguish different hierarchy levels, we set the background color of file-representing columns according to a linear gray color scale. The single color values depend on the depth of the file in the hierarchy: The deeper it is located in the hierarchy, the darker is its color. This approach has two major drawbacks. Files that are located in different subhierarchies but have the same depth are typically drawn in the same color. This makes the observation of the hierarchy structure very difficult. Another problem occurs when all developers change one specific file, which could be a todo file, for example. Then the background color is not observable anymore, and thus, a linking of that file to a depth of the hierarchy becomes impossible. To tackle these problems, we use detail on demand features that show additional textual information about the file in focus. The number of files in software projects is typically much larger than the total number of developers. For example, the TOMCAT3 project consists of 2297 files and was implemented by just 40 programmers. This is a ratio of rougly 58 files per developer. Consequently, we obtain rectangular elongated matrices with far more columns than rows. To use the display space more efficiently, we cut the matrix horizontally and continue at the leftmost position, right under the matrix part drawn before. Following this approach, we can even visualize larger systems such as the TOMCAT3 project.


87

3.5.3 Dynamic Author-File Graph The Dynamic Author-File Graph (AFG) visualization uses animation to represent time-series data. The main focus of this technique is to show the relations between the developers and the files throughout the evolutionary process of the software system by animated sequences of node-link graphs. An AFG consists of an ordered sequence of bipartite graphs that carry information about which developer changed which file during a certain time period. The technique is very similar to the one presented in the animated code swarm tool [33]. An AFG can be formally expressed as an n-tuple:

AFG := (Gi )1≤i≤n where ∀ 1 ≤ i ≤ n : Gi := (Di ∪ Fi , Ei ) and Ei ⊆ Di × Fi The set of all developers that make commits to the repository during time period i is denoted by Di . The set of all files that were changed during this time period is expressed as Fi . The file-developer relationships are expressed by Ei . Each graph Gi can be defined in the following way: Ei := {(d, f ) ∈ Di × Fi | Developer d changed file f during time period i} The time intervals are flexible and have to be adjusted for each software project. The selection highly depends on the activity of the developers. Sometimes we would prefer shorter time periods that may express a weekly change, whereas for other projects longer periods are more suited, for example, monthly changes. In the examples of the case studies we analyzed monthly changes. In an AFG, developers are visualized as large circles and files as smaller circles. This refers to the same difficulty as in the file-author matrix where we found out that there are by far more files in a software project than developers. Figure 3.27 shows one graph of a dynamic AFG. The whole graph sequence is displayed by means of smooth animation. The layout of the graphs is computed by an extended foresighted layout that was published by Diehl et al. [44, 80]. This approach benefits from the fact that a viewer can easily preserve the mental map. We just show small excerpts of the AFGs because the sequences are very long, which means they contain many graphs. The labelling problem is solved by describing just the most important nodes in the visualization, which again reduces visual clutter and makes the figures more readable.

88


Figure 3.27: An author-file graph of an animated sequence of graphs. The two large circles show the developers and the smaller bullets represent files that are changed by these developers.

3.5.4 Case Studies: JUNIT and TOMCAT3 The testing framework JUNIT and the application server TOMCAT3 are the focus of our illustrative case studies. JUNIT is a comparably small project in contrast to TOMCAT3. Just 7 developers worked on 629 files and made 522 transactions during the evolution of the system. In the TOMCAT3 project 40 developers took part and worked on 2297 files over 3917 transactions. 3.5.4.1 JUNIT Figure 3.28 shows a transaction overview of the JUNIT project in a time interval starting at December 3, 2000 and ending at October 18, 2006. The names of the developers are given in textual form by their initials. One can easily grasp from the figure that the author EG (represented with magenta colored bullets) was working for this project as the very first developer. Nearly one year later also KB (green) and CS (lightblue) started participating in the project, or we would better say, started to make commits to the repository, hence we do not know if they really started right at their first checkin or maybe much earlier. In the middle part of the transaction overview, the developers seem to be inactive because of the missing transactions. The only active developer in this time period was CS who made very small sized transactions. The largest transaction did DS (orange) in the last part of the evolutionary process. DS can also be uncovered as the main developer in this final development stage, apart from KB, who sometimes also makes some commits. Files that are contained in a transaction are additionally marked by colored pixelsized (black in Figure 3.28) points, which indicate the part of the system that was changed by a developer. By inspecting those pixels, we can find out that in the


89

last third many files were changed that had never been changed in the evolution of this system before. Maybe a new component was added, which provides some more features in the system.

Figure 3.28: One can easily get an overview of the transactions for the JUNIT project in the time interval starting at December 3, 2000 and ending at October 18, 2006.

Figure 3.29 shows the same dataset (JUNIT and the same time period) as Figure 3.28 in a file-author matrix representation. The different kind of visual mapping represents time aggregated transaction data and represents developers instead of time at one of the axes. The matrix is splitted into two parts due to the many files. The upper part of the matrix shows the root directory, the doc subdirectory, and also the source code files of a deprecated JUNIT version. The novel source code files and the tries subdirectory are visualized in the matrix part below. A more specific exploration of this subdirectory reveals that it is used by the developers to test new approaches. The special sorting of the developers shows that KB changes the biggest part of the files and that DS and EG are also very active authors. The remaining four developers did not take part as much as the others in the development process, if we have a closer look at the changed files by these developers. It seems as if they are inactive in the development process as indicated by the small number of changed files. Inspecting developer EG shows an interesting phenomenon: EG mostly changed files that contain the old source code. He only changed a handful of files belonging to the new source code directory as well as some tests and files in the samples directory. EM, CW, and VB did not make changes to any of the files of the new source code, but CS got more involved in the development process of the new code. A closer inspection of the source code files can also give some more very interesting insights. We try to answer the question whether several developers work as a team or independently from all others. In both the old source code and the new one, some subdirectories exist that were changed by one up to three single developers. In the

90


old source code nearly the whole source code was altered by EG, whereas the tests subdirectory was also changed together with KB, but to a different extent. The files belonging to the new source code directory were mostly altered by KB and DS. Only a small number of those files were changed by both developers to the same extent. The tries directory indicates an outlier because the only changes to that were done by KB. The internal directory contains the request subdirectory which was only edited by DS. One very interesting subdirectory in software archives is always the one that contains the documentation part. Here we can find out that the very active developers KB and DS were merely working on the source code, but only a little on the documentation, which was the job of EG. The FAQ file was frequently altered by CW, which is the only file that CW changed during the evolutionary process.

Figure 3.29: The file-author matrix of JUNIT additionally shows which files have been changed very frequently by a certain developer.

In Figure 3.30 we see three snapshots of a dynamic author-file graph generated for the JUNIT project. Each graph covers one month of development time, in this figure, March, April, and June of the year 2002. We can see that KB started as a developer of many files in March, but reduced this number significantly in the subsequent months. In June he just works on files that EG also worked on. It may be noteworthy that the number of files under development decreases during the evolution of the system. Obviously, EG and KB are the main developers in this project because they worked on the largest number of files and even shared some files.


(a)

(b)

91

(c)

Figure 3.30: The author-file graph representation for the JUNIT project is shown for three different points in time, namely: (a) March 2002; (b) April 2002; (c) June 2002.

3.5.4.2 TOMCAT3 Figure 3.31 shows the transaction overview for TOMCAT3 in the time period starting at October 19, 1999 and ending at November 21, 2004. In the first half of the analyzed time period, the transaction frequency is much higher than in the second half, and many developers took part. A contrary situation can be observed in the second half where the number of participating developers decreased very rapidly and just the developer indicated by the yellow bullet was very active. HG and BB were the only developers at the very end of this project.

Figure 3.31: An overview of the transactions of TOMCAT3 shows the checkin behavior of the developers in the time period starting at October 19, 1999 and ending at November 21, 2004.

The complete file-developer dependencies of TOMCAT3 are shown in the splitted matrix in Figure 3.32. The figure is manually annotated by the directory names. The files of the subdirectories j2ee and build were altered by very few developers, whereas some others were changed by more than 10 developers. The directory org.apache.jasper was altered by 22 developers, but it also contains subdirectories that were changed by only few developers, for example, j2ee.

92


The package org.apache.tomcat and several subpackages are definitely the core of the TOMCAT3 system. Inspecting the matrix, one can see that CM is the main developer due to the very large number of changed files. The src/native subdirectory is not the focus of CM, the major developers there are BB, RS, GS, and AL. The reason for this is quite obvious but can only be conjectured: All files in this subdirectory are implemented in the C programming language. Also CM did no changes in the org.apache.jasper directory, which is also not part of the aforementioned system core. We can further conjecture that the main developer CM worked at the core of the system but left the programming work of project parts that do not belong to the core to other developers. Some developers did not work on the source code of a project but mainly care for the documentation or the web pages. For instance, LI made very little changes at the source code, but in contrast, very many changes in the documentation files in proposals.

Figure 3.32: The file-author matrix of TOMCAT3 shows the time aggregated transaction data as file-developer relations.

The animated sequence of author-file graphs for TOMCAT3 in the time period of September 2003 until February 2004 allows some very interesting conclusions. Figure 3.33 shows four snapshots of the animation as a static sequence of graphs. HG and BB exchanged their roles as the main developer. HG made many commits in September and October, and he made no commits in January and February at all. In contrast to that, BB starts in September with committing a few changes and seems to become the main developer until the end of February. Both developers are hardly related with respect to a large number of checked in files. Consequently, one could say that TOMCAT3 was developed in separate subprocesses by developers that work independently from each other.

3.6 Conclusions

(a)

(b)

(c)

93

(d)

Figure 3.33: The author-file graph representation for the TOMCAT3 project is represented for four different points in time, namely: (a) September 2003; (b) October 2003; (c) January 2004; (d) February 2004.

3.6 Conclusions In the first part of this chapter we introduced several visualization techniques to gain interesting insights from large rule sets generated by data mining techniques. Before going into visual details, we showed how such datasets could be extracted from software archives and preprocessed to obtain an adequate dataset. To compute rules from software archives, we typically aggregate the dependencies over time, which are caused by transactions, and get two types of software rules— association rules between sets of software artifacts and sequence rules that are additionally attached by a temporal component. To visually explore the datasets, we developed the EPOSee visualization tool, which integrates the aforementioned simple techniques into one view and combines them by linking and brushing features. The tool can represent both association rules by pixelmap and support graph techniques as well as sequence rules by parallel coordinates plots, prefix trees, and an approach that is called Trees in a Treemap. For the latter idea we also explored traceroute data to show the usefulness of the visualization with data from a very different application domain. We showed the benefits of the techniques by several case studies with data from open source software systems such as MOZILLA. The visualization techniques used in the EPOSee tool can be classified as code-centric approaches, because we do not get insights from the developers’ behavior but just from the dependencies of source code artifacts such as directories, files, classes, or methods. The second part of the chapter deals with author-centric or developer-centric visualization approaches, and also with time-series or time-based data where we get insights about the evolutionary processes and the role of several developers in the project. For this reason we developed three visualization techniques—the transaction overview, the file-author matrix, and the dynamic author-file graph. The tech-

94


niques differ in several aspects. The transaction overview is a static representation of time-series data that shows the number of transactions in one single static image in a chronological order with absolute time as well as the participating developers and the files they changed. The dynamic author-file graph uses animation to represent the same kind of data as a sequence of node-link diagrams. The graphs of the sequence are laid out in a way that a minimum of changes occurs between two subsequent graphs. The preservation of the user’s mental map stands in the focus of this visualization. The file-author matrix is again a author-centric and time-aggregated representation of transaction data that shows the percental number of changes of one file by a certain developer. We illustrated the roles of developers by case studies on two software projects, namely, JUNIT and TOMCAT3. Some interesting patterns and anomalies have been found out that could never be uncovered by means of manual searches. Though the author-centric visualizations supported us in analyzing the behavior of developers in software archives, it could still be enhanced by additional data sources. Bug databases are a very rich source of documented errors that occur during the development phases and are typically much better commented than log messages of transaction data. Furthermore, the relation between several developers can be better understood by using bug databases, which is also true for email archives. Exploiting the data stored about developer conversations can also help to identify groups of people that are working as a team or single persons that are implementing independently.

“A mathematical model is a representation of the essential aspects of an existing system (or a system to be constructed) which presents knowledge of that system in usable form” — Eykhoff (1974)

CHAPTER 4 Modeling Transaction Sequences and Dynamic Compound Digraphs

s we have to deal with two types of data, namely, sequences of weighted transactions, and the more general sequences of directed and weighted compound multigraphs, we will first give an introduction to these two different datasets. Second, we will try to keep both separate, and third, we will explain how the transaction dataset can be transformed into a more general, directed graph dataset. Also undirected graphs can be modeled as directed graphs by replacing each undirected edge by two directed edges that point in both directions and each have the same weight as the undirected edge.

A

For the purposes of this thesis we model an information hierarchy as a tree where the leaf nodes represent some pieces of information. We call this information the items in the rest of this thesis. As an example we may consider a directory tree—one possible information hierarchy from software systems. The elements in the rules extracted from software archives are software artifacts such as directories/subdirectories, packages, files, classes, methods, or functions. These artifacts are hierarchically ordered because directories may contain subdirectories, in which several files are located in. Files consist of classes, where each class can be further subdivided into code blocks, which

95

96

4 Modeling Transaction Sequences and Dynamic Compound Digraphs

are methods or functions, for example. Other code blocks are also imaginable. The level of granularity defines, which artifacts we call items, the leaf nodes of this information hierarchy. If we are working on file level, for example, these items are given by the set of all files. We can express the information hierarchy more formally. Let T = (VT , EI ) be the tree that represents this hierarchy where VT = {v1 , . . . , vn } is the set of vertices of the tree induced by the information hierarchy with cardinality | VT | ≥ 2. The set EI = {e1 , . . . , ek } ( VT × VT contains the set of directed edges of this tree, which we will also call inclusion edges in the context of this thesis, and L ( VT is the set of leaves that is a proper or strict subset of the set of all vertices VT of T . If the context is clear we omit the subscript T . The tree is always rooted. This means that one vertex is always designated the root, which can never be a leaf in our data model, and therefore, the restriction | VT | ≥ 2. T is always a directed graph where the root node has only outgoing edges and the leaf nodes have only incoming edges.

Figure 4.1: A tree T with node set VT = {v1 , . . . , v9 } and inclusion edge set EI = {(v1 , v2 ), (v1 , v3 ), (v1 , v4 ), (v3 , v5 ), (v3 , v6 ), (v4 , v7 ), (v4 , v8 ), (v4 , v9 )}. The set of leaves or items is given by L = {v2 , v5 , v6 , v7 , v8 , v9 }. The root vertex is labeled by v1 .

Figure 4.1 shows a small example of a tree. The node set is given by VT = {v1 , . . . , v9 } The set of inclusion edges is indicated by EI = {(v1 , v2 ), (v1 , v3 ), (v1 , v4 ), (v3 , v5 ), (v3 , v6 ), (v4 , v7 ), (v4 , v8 ), (v4 , v9 )} The set of leaf nodes or items is represented by L = {v2 , v5 , v6 , v7 , v8 , v9 } For reasons of clarity we omit the arrows in the diagram, but we keep in mind that a tree is always a directed graph.

97

4.1 A Transaction Measure

4.1 A Transaction Measure In the following we describe how transactions in information hierarchies can be measured. This measure function is very important because it expresses the measure of a transaction for each state of the currently visible information hierarchy. Arbitrary graphs are also attached to a measure function that can be used to model aggregated weights in all possible expansion levels of the information hierarchy. This fact will be described in Section 4.3 later on. Let T = (V, E) be a tree with leaf node set L ( V . First of all we define a function rln : 2V −→ 2L , U 7→ {v ∈ L | ∃ u →∗ v, u ∈ U } T

that maps a node set U ∈ 2V to the set of its reachable leaf nodes L0 ⊆ L. v →∗ v 0 T

denotes a possibly empty sequence of edges and v, v 0 ∈ T , in other words, a path starting at v and ending at v 0 . It may be noted that a leaf node is reachable by itself by means of an empty sequence of edges. Trivially, the function rln has the properties rln({v}) = L if v is the root node and rln(∅) = ∅. Furthermore, it has the following general properties: For  0  if U = U 0  rln(U ) = rln(U ), if U ⊆ U 0 U, U 0 ∈ 2V ⇒ rln(U ) ⊆ rln(U 0 ),   rln(U ) ∩ rln(U 0 ) = ∅, if ∀ v ∈ U ∀ v 0 ∈ U 0 @ v →∗ v 0 ∧ @ v 0 →∗ v T

T

Given an itemset L that contains the leaf nodes of a hierarchy, then the function rln declares a σ-algebra Σ over L, that has the following properties: 1. Σ contains the whole set L as an element. L∈Σ 2. If the set L0 is contained in Σ then so is the complementary set L \ L0 ∈ Σ. L0 ∈ Σ ⇒ L \ L0 ∈ Σ

98


3. The union of countably many sets is also contained in Σ. [ ∀ 1 ≤ i ≤ n : Li ∈ Σ ⇒ Li ∈ Σ 1≤i≤n

These properties can be proved directly: 1. If we apply the function rln to the set that consists of the root node of the hierarchy as the only node, then all leaf nodes are reachable leaf nodes from this node set. The whole set L is contained in Σ. 2. If the set L0 is the set of reachable leaf nodes of a node set U that is contained in Σ, then the complementary set of leaf nodes L \ L0 is reachable by the set of nodes that contains all parent nodes of this complementary set, that are not parents of at least one of the elements of L0 . 3. Let L1 , . . . , Ln be a set of reachable leafSnodes from the set of node sets U1 , . . . , Un . The union of this set L := 1≤i≤n Li is trivially reachable by the same set. Maybe the reachable leaf nodes of a subset of this set are also reachable by an even smaller subset that may contain itself parent nodes of these nodes. Now we can define the measure µtrans as a function with domain Σ by µtrans : Σ −→ R+ 0. A measure in the context of this thesis is always defined on a set of leaf nodes. As a shorter notation, we also use intermediate nodes as parameters of the measure function µ, which are then representative elements for the reachable leaf nodes from these intermediate nodes. The mathematical properties of a measure are: • Non-negativity, i.e., ∀ L0 ∈ Σ : µtrans (L0 ) ≥ 0 • µtrans (∅) = 0 • Countable additivity or σ-additivity: If Li is a countable set of pairwise disjoint sets in Σ with an index set I, then the measure of the union of all Li is equal to the sum of the measures of each Li : ! [ X µtrans Li = µtrans (Li ) i∈I

i∈I

The couple (L, Σ) is a measurable space where the elements of Σ are measurable sets, and the triple (L, Σ, µtrans ) is a measure space.

4.2 A Formal Definition for Compound Graphs

99

Furthermore, let (µtrans n ) be a sequence of measures for n ∈ N where µtrans i : Σ −→ R+ 0, 1 ≤ i ≤ n is an arbitrary measure defined on the set of items L ( V . We can model each transaction ti , 1 ≤ i ≤ n as a set of items: ti := {v ∈ L | µtrans i ({v}) > 0} A sequence of transactions will be expressed by (ti )1≤i≤n . The whole sequence of transactions is denoted by tn for short. For more on measure theory, we refer, for example, to the book of Bauer [83].

4.2 A Formal Definition for Compound Graphs Compound graphs are very frequently occuring data structures. Several objects, which we will call items in the rest of this thesis, may be related to some extent. Moreover, these elements are leaf nodes of a tree that represents an information hierarchy, see Figure 4.2. The adjacency edges in the compound graph in this figure are undirected relations.

Figure 4.2: Several objects are related to some extent and are additionally hierarchically organized. This data structure is denoted as a compound graph.

Many real world examples exist that contain both a hierarchical ordering among a set of objects and non-hierarchical relations between the objects. Various examples from very different application domains can be modeled as a compound graph:

100


• A software system is hierarchically organized. Directories/subdirectories, packages, files, classes, and methods build up a hierarchy where the leaf nodes can carry some information. These leaf nodes are more or less related to each other, depending on the examined property of the software system. Some properties could be, for example, the call graph of methods or the inheritance hierarchy. • The world’s countries can be subdivided into continents, subcontinents, regions, etc. What we get is a hierarchical ordering of the world’s countries depending on their location on the globe. For instance, a relation between two countries can be obtained by inspecting their export/import behavior of goods such as fuel. • Social networks are consisting of many individuals, which could be hierarchically organized by using their geographic location. Relations could, for example, be detected by analyzing the persons’ communication behavior—who sends emails to whom and how many. The last example shows that relations can also be weighted or measured. A detailed description about this kind of measure is given in Section 4.3. In the following, we will refer to the relations stemming from the hierarchy as inclusion edges, whereas the relations among the leaf nodes will be called adjacency edges. The compound graph model can deal with both types of edges at once. Let G = (VG , EA ) be the adjacency graph where VG are the vertices of G and EA ⊆ VG × VG is the set of adjacency edges. For reasons of simplicity we do not allow multi edges in this model, but handling this kind of edges works straightforward. Moreover, T = (VT , EI ) represents the tree that shows the hierarchical organization of VG . The vertices are expressed by the set VT and the inclusion edges by EI ( VT × VT . It goes without saying that VT \VG contains the inner nodes of the hierarchy including the root node. Furthermore, the edge set EA is completely different from the edge set EI , mathematically expressed as EI ∩ EA = ∅ The whole compound digraph can be modeled by C = (G, T ) A finite sequence of compound digraphs can be modeled by < C1 , . . . , Cn > where Ci = (Gi , T ) expresses the sequence of adjacency edges in a constant hierarchy.

4.3 A Graph Measure

101

4.3 A Graph Measure In Section 4.1 we explained how to handle measures on sequences of transactions. The problem was to derive a suitable σ-algebra from the set of items. For graphs, or more precisely, for digraphs, we have to work with weighted adjacency edges instead of weighted nodes in transaction data. Our graph model has to be extended by a weight function that has to be defined for each state of the corresponding tree T . If a part of the tree is collapsed, edge weights have to be aggregated. The best way to express this is by means of a measure as we did for the transaction data as well. To do this, we will firstly introduce a σ-algebra on the set of adjacency edges EA . In this case the σ-algebra consists of sets of set pairs instead of sets of sets as in the section about transactions. The properties of a σ-algebra are also satisfied and proving them is straightforward. The properties of the measure µgraph are more interesting to discuss. Let C = ((VG , EA ), (VT , EI )) be a compound digraph with adjacency edges EA ⊆ L × L and Σ as defined in Section 4.1 on the set of leaf nodes L = VG . For U, U 0 ∈ Σ the graph measure µgraph : Σ × Σ −→ R+ 0 has the following properties: • Non-negativity, i.e., ∀ L0 , L00 ∈ Σ : µgraph (L0 , L00 ) ≥ 0 • ∀ L00 ∈ Σ : µgraph (∅, L00 ) = 0 and simultaneously ∀ L0 ∈ Σ : µgraph (L0 , ∅) = 0 • Countable additivity or σ-additivity: If Li is a countable set of pairwise disjoint sets in Σ with an index set I and L0 , L00 ∈ Σ, then the measure of the union of all Li is equal to the sum of the measures of each Li : ! X [ µgraph ( Li , L00 ) = µgraph (Li , L00 ) i∈I

i∈I

and simultaneously ! µgraph (L0 ,

[ i∈I

Li ) =

X

µgraph (L0 , Li )

i∈I

4.4 Transforming Transactions into Directed Graphs A transaction can always be transformed into a directed graph. The opposite direction is only possible if the graph represents a complete clique and each node has a

102


Measure values µtrans ({v4 }) = 3 µtrans ({v5 }) = 8 µtrans ({v6 }) = 2 µtrans ({v7 }) = 6 Table 4.1: Four elements belong to one transaction, each of those elements with a different extent.

self-edge. Even in this situation the edge weights have to satisfy a special form to transform the graph back to a transaction. The transformation can be motivated by data filtering functions of the visualization techniques TimeArcTrees, Timeline Trees, and TimeRadarTrees. We need this transformation because the visualization approaches for dynamic compound digraphs and sequences of transactions can handle both data formats. These are able to read the transaction data format in a first step, but a user of the visualization tool could interactively analyze the transaction data and manipulate the dataset as well. He can, for example, eliminate some edges in the visualization. With this interactive feature, the user changes the transaction format into a special digraph format, which he may store in a last step to work with it later again. The data formats have to be interchangeable to some extent. Let T := (VT , EI ) be a tree with VT := {v1 , v2 , v3 , v4 , v5 , v6 , v7 } where VT ) L := {v4 , v5 , v6 , v7 } is the set of items and EI := {(v1 , v2 ), (v1 , v3 ), (v2 , v4 ), (v2 , v5 ), (v3 , v6 ), (v3 , v7 )} are the inclusion edges. An example transaction may be defined on the itemset given in Table 4.1. The transaction in Table 4.1 can easily be transformed into a corresponding digraph G = (L, EA ). The set of adjacency edges is simply EA := L × L, and the measure of each adjacency edge is given in Table 4.2. In the general case we have a tree with T = (VT , EI ) where VT = {v1 , . . . , vn } and {e1 , . . . , em } =: EI ( VT × VT is the set of inclusion edges. The directed graph can

4.5 Conclusions

µgraph ({v4 }, {v4 }) = 3 µgraph ({v5 }, {v4 }) = 3 µgraph ({v6 }, {v4 }) = 3 µgraph ({v7 }, {v4 }) = 3 µgraph ({v4 }, {v5 }) = 8 µgraph ({v5 }, {v5 }) = 8 µgraph ({v6 }, {v5 }) = 8 µgraph ({v7 }, {v5 }) = 8

103

µgraph ({v4 }, {v6 }) = 2 µgraph ({v5 }, {v6 }) = 2 µgraph ({v6 }, {v6 }) = 2 µgraph ({v7 }, {v6 }) = 2 µgraph ({v4 }, {v7 }) = 6 µgraph ({v5 }, {v7 }) = 6 µgraph ({v6 }, {v7 }) = 6 µgraph ({v7 }, {v7 }) = 6

Table 4.2: The transaction from Table 4.1 is transformed into a digraph.

be constructed following the scheme from above. If L ( VT is the set of leaf nodes then EA := L × L is the set of adjacency edges. µgraph ({vi }, {vj }) = µtrans ({vj }) for vi , vj ∈ L. The transformation of an undirected graph into a directed graph works straightforward by doubling undirected edges into two directed edges pointing in both directions with the same measure µgraph ({vi }, {vj }) = µgraph ({vj }, {vi }).

4.5 Conclusions In this chapter we introduced the term measure for both transactions and directed graphs. The measure function guarantees that the weight of relations is defined for each visual state of a compound graph. Collapsing and expanding of subhierarchies leads to an aggregation of adjacency relations that belong to this hierarchy level. Also an aggregation of several graphs or transactions is possible and leads to a summing up of the respective measures in the sequence. In the next chapter we introduce three novel visualization approaches that are able to handle data from the aforementioned data model, namely, sequences of weighted transactions in information hierarchies and dynamic weighted compound digraphs. The information hierarchy is used to navigate in the dataset and to filter for subhierarchies. Applying one of these functions to the visualization means displaying different values of the measure function to the visualization. A user-defined color coding visually encodes the strength of the measure values for the current state of the information hierarchy.

104


“I’ve probably reviewed over 60 papers in my career, and this paper probably presents the weirdest idea I will probably ever review.” — Unknown Reviewer of the EuroVis Conference (2008)

CHAPTER 5 Visualizing Transaction Sequences and Dynamic Compound Digraphs

raphs are a mathematical method to express relations or dependencies between a set of objects. Compound graphs additionally encapsulate a hierarchical order on the graph nodes and are a widely used model in computer science.

G

Information hierarchies occur in many application domains, such as the hierarchical organization of companies, news topics and subtopics, file/directory systems, products and product groups of a department store, or phylogenetic trees in biology. The evolution of dependencies in such information hierarchies can be modeled by sequences of compound digraphs with edge weights. Many layout algorithms have been developed by graph drawing researchers to better understand interesting structures or substructures of a given compound graph. The most frequently applied visualization metaphor in the graph drawing community is definitely the conventional node-link diagram. Today, the primary challenge of graph drawing is not the invention of new visual metaphors for relational data, but more the enhancement of the layout algorithms based on node-link diagrams with respect to their running time. Moreover, these final layout results should fulfill several aesthetic criteria, which are explained in Chapter 7.

105

106

5 Visualizing Transaction Sequences and Dynamic Compound Digraphs

In many cases these algorithms can only deal with static graphs that do not change their structure over time. Only few algorithms can deal with changing relations, for example, the approach by Pohl and Birke [131] that uses animated sequences of node-link diagrams. Throughout the years, there has been a lot of research on effectively visualizing information hierarchies [3, 99, 136, 149, 186]. Recently, some researchers developed methods to tackle the problem of visualizing dependencies between elements in the hierarchy too [60, 92, 123, 134, 190, 192]. Our novel visualization techniques are able to visualize dynamic compound digraphs as well as sequences of transactions. The main difference to existing work lies in the visual encodings for the information entities and the relations. We try to solve the problem of visual clutter by applying space-filling visual encodings to the relational data. Moreover, we try to present the data in a single view and not by animated graph sequences. In this chapter we will present the three approaches called TimeArcTrees [81], Timeline Trees [18], and TimeRadarTrees [21]. We will work out some commonalities as well as some differences of the techniques. The TimeArcTrees technique uses the conventional node-link approach for both inclusion and adjacency edges. The Timeline Trees tries to solve the problem of edge crossings by a space-filling visualization of the adjacency edges. The information about connectedness between two nodes can be extracted from the elementary perceptual tasks given by position in non-aligned scales, shape, area, color, and context of the corresponding rectangular boxes [32]. Shape, area, and color are processed pre-attentively when only one encoding at a time occurs. Conjunction encodings are in general not processed pre-attentively. “An understanding of what is processed pre-attentively is probably the most important contribution what visual science can make to data visualization” [174]. Julesz [100] published a theory about pre-attentive vision—the instantaneous and effortless part of visual perception that the brain performs without focusing attention on local detail. The TimeRadarTrees technique is the radial counterpart of the Timeline Trees approach, which uses a radial node-link tree and/or a radial layered icicle diagram for the inclusion edges and radial space-filling circle sectors for the sequence of adjacency edges. Connectedness between two nodes is expressed by position in non-aligned scales, direction, angle, area, curvature, shape, color, and context, and hence, has more distinguishing features for the exploration of relational data than the Cartesian Timeline Trees visualization.

5.1 TimeArcTrees

107

All three visualization approaches are implemented in the JAVA programming language. These are interactive and use smooth animations to keep the user’s mental map when transforming the diagram from one state to another. In this chapter we will illuminate the details of the visualization techniques. The interactive features of the tools and further real world datasets are also exemplified in this chapter, which is organized as follows: In Section 5.1 we explain the details of the TimeArcTrees visualization technique and how it can be used to visually solve the shortest path problem. In Section 5.2 we introduce the space-filling Timeline Trees technique, and Section 5.3 gives the technical details about the radial counterpart of it—the TimeRadarTrees. We will finalize this chapter with Section 5.4, which shows a comparison of these three visualization approaches with respect to scalability and typical graph visualization challenges.

5.1 TimeArcTrees The TimeArcTrees approach shows a sequence of compound digraphs in a traditional node-link representation. Based on the order of the nodes in the hierarchy, the nodes of the graph are placed on a vertical axis from top to bottom for each graph of the sequence. The hierarchy is represented as a node-link diagram on the left hand side of the whole view and can be expanded or collapsed to an interactively selectable level. The visualization of the graph sequence automatically adapts to the current expansion level of the hierarchy. The frontier of the currently expanded hierarchy tree forms the nodes of the graph sequence. The mental map is preserved by smooth animation making the transitions between different views easier to understand. The graphs of the sequence are drawn from left to right as separate node-link diagrams. In these diagrams we use the following conventions. First, all nodes of a graph are aligned vertically on top of each other. Edges are not drawn as straight lines with bends, but as arcs to make it easier for the human eye to follow a path. Furthermore, edges may fall into one of three different categories. • Upward edges: Start nodes are located below the target nodes on the vertical axis, and upward edges are placed on the left hand side of the axis. • Self-edges: Start and end nodes refer to the same node, and self-edges are also placed on the left hand side of the vertical axis. • Downward edges: Start nodes are located over the target nodes, and the downward edges are placed on the right hand side of the axis. Moreover, edges can be weighted. These weights are indicated by color coding using one of many color scales provided by the tool. We developed several interaction techniques to explore the data more efficiently. On top of each graph, one can find a time slider that may be used to aggregate a graph

108


subsequence to one single graph. These and other features will be explained in more detail in Sections 5.1.1 and 5.1.2.

5.1.1 TimeArcTrees—Step by Step We illustrate our visualization technique by means of a small example. Assume the compound graph sequence < C1 , . . . , Cn > that consists of n pairs Ci = (Gi , T ) where Gi = (V, Ei ) , 1 ≤ i ≤ n. The set V contains the leaf nodes and Ei ⊆ V × V × R+ 0 denotes the set of directed, weighted edges between pairs of leaf nodes. In the context of this thesis we express an edge (u, v, µgraph (u, v)) ∈ Ei of graph Gi by the symbol u

µgraph (u,v)

→ Gi

v

We call µgraph (u, v) the edge weight. If the context is clear, we use the shorter notation µ for the graph measure µgraph . Instead of using the term measure, we sometimes use the term edge weight as a synonym. It may also be noteworthy that we apply the measure function here on a pair of single nodes instead of node set pairs consisting of leaf nodes of a hierarchy. If single nodes are used in the measure context, the nodes have to be replaced by the set of their reachable leaf nodes as it was described in Section 4.1 by the function rln. The same holds for node sets that do not already consist of leaf nodes exclusively. A path

c

u →∗ v Gi

is a possibly empty sequence of edges such that u := v0

µ(v0 ,v1 )

→ Gi

v1

µ(v1 ,v2 )

→ Gi

...

and c=

µ(vn−2 ,vn−1 )

n X

→ Gi

vn−1

µ(vn−1 ,vn )

→ Gi

vn =: v

µ(vj−1 , vj )

j=1

are the accumulated costs along the path. A cycle or loop is a nonempty path c

u →∗ u Gi

In the following example we assume that the nodes of the graph sequence correspond to IP-addresses, and the edge weights represent network delays. The graph can be obtained by merging the information of multiple invocations of a traceroute

5.1 TimeArcTrees

109

program. Table 5.1 shows the nodes of the graph sequence with a short labeling, namely, A, . . . , F , the corresponding IP-addresses, and the WWW domains. Table 5.1: Short notation for IP-addresses and their WWW domains. The addresses are abbreviated by A, B, C, D, E, and F .

Node A B C D E F

IP-address 136.199.55.209 136.199.199.105 136.199.55.175 134.96.7.179 134.93.178.2 131.246.120.51

Domain www.st.uni-trier.de www.uni-trier.de dbis.uni-trier.de www.uni-sb.de www.uni-mainz.de www.uni-kaiserslautern.de

5.1.1.1 A Single Graph To explain the TimeArcTress technique, we start with this small graph example in Figure 5.1. The graph consists of 6 nodes and 10 weighted directed edges, which are annotated with the corresponding integer valued edge weights. Edges are visualized as straight lines pointing from the start node to the destination node.

Figure 5.1: Node-link diagram of a directed graph with weighted edges. Each edge is annotated with its weight.

Figure 5.2 depicts the same graph using the TimeArcTrees visualization approach. The graph nodes are vertically aligned and edges are visualized as colored arcs. The more reddish the color of the edge, the higher is its weight. In contrast, green indicates low edge weights.

110


Figure 5.2: Node-link diagram of a directed weighted graph in a TimeArcTrees representation. The weighted edges are represented by colored arcs.

Layout of the edges First of all, upward edges and self-edges are always drawn on the left hand side of the vertical axis of the graph, downward edges bypass the right hand side of this vertical axis. Furthermore, the horizontal distance of an edge from the vertical axis increases with the number of nodes that are located between the start and the target node of the edge. With this edge placement, we intend to reduce edge crossings and visual clutter. Following this approach results in many overlapping edges as can be seen in Figure 5.3. Equal distances between start and destination nodes are responsible for the edge overlaps in this approach. To tackle this problem, we weaken the condition of mapping node distances to horizontal edge distances directly. Crossing edges that show equal distances between start and end node have to be assigned different arc lengths. But the condition that edges with larger node distances are also mapped to larger distances from the vertical axis has to hold for this approach, too. We start with edges between neighbored nodes, which have a node distance of value one. Because those edges are never crossing each other, we do not need different arc lengths. Edges between non-neighbored nodes where start and destination nodes have a distance larger than one are more problematic. For the maximal distance, we also have one possible edge, which results in one possible arc length. The function for computing the number of arc lengths aln(z, l) for the number of nodes z and the edge distance l between start and target node can be computed by ( aln : N × N 7→ N, (z, l) −→

a(z,l) , da(z,l)/le

if a(z, l) < l

l,

otherwise

a(z, l) denotes the number of edges of distance l with a total number of z nodes. The final step is to map edges with the same distance to different arc lengths. This

5.1 TimeArcTrees

(a)

111

(b)

Figure 5.3: Vertical edge overlap can confuse a viewer and has to be avoided whenever possible: (a) diagram with vertical edge overlap; (b) diagram without vertical edge overlap.

can be done by applying the modulo function to the smaller index s of either start or target node of an edge. This results in the formula for the vertical distance of an arc avd : N × N × N 7→ N, (s, z, l) −→ s mod aln(z, l) Following this approach results in an overlap-free edge representation as it is given in Figure 5.3(b). To further reduce visual clutter that is caused by many incident edges at the same node, all upward edges that are ending in the same node share a single arrow head, as shown in Figure 5.4. The same holds for all downward edges and self-edges.

Figure 5.4: Incoming and outgoing edge ports: Upward edges and self-edges are always drawn on the left hand side of the vertical graph axis, and downward edges are drawn on the right hand side of this axis.

112


Since one of the most important aesthetics criterion for graphs in a node-link style is the minimization of edge crossings, we also apply an algorithm that targets on reordering the graph nodes on the vertical axis. Finding the right node order can be reduced to the N P -complete problem of Optimal Linear Arrangement (OLA) [73]. The order of the nodes depends on the hierarchical structure, which is not allowed to be restructured, but we may rotate the hierarchy levels recursively until we reach an enhanced node arrangement. In few words, a minimization of edge crossings can be obtained by reordering the leaf nodes of the hierarchy. This new ordering has an impact on all graphs in the sequence, but also all graphs of the sequence have to be examined in the layout algorithm. The total number of permutations of a subtree that contains n child nodes is given by n!, which are quite a lot different node arrangements. As said before, a time-efficient algorithm for computing an optimal layout is not realistic, but we are comfortable with a fast heuristic algorithm that computes a quasi-optimal solution for this problem. The same algorithm was already used in Section 3.3.2.4 for Trees in a Treemap where we also target at improving the hierarchical ordering with the goal to reduce edge lengths.

Figure 5.5: The edge crossing minimization algorithm reduces the number of crossings immensely. In this figure, a percentage of 65 is achieved—from the originally 136 existant crossings remain just 48.

Figure 5.5 shows a graph sequence after applying the node reordering algorithm. The number of edge crossings has reduced significantly by 65 percent to a number of just 48, in contrast to the 136 crossings in the original layout. A second aesthetics criterion is the reduction of edge lengths, a problem that is solved by the same reordering algorithm but instead of minimizing the sum of overlaps we minimize the sum of edge lengths in this instance of the algorithm. It goes without saying that there may be a trade-off between both optimization problems. A reduction of edge crossings can lead to an increase of edge lengths and vice versa.

5.1 TimeArcTrees

113

Figure 5.6: The edge length minimization algorithm reduces the sum of edge lengths. In this figure, a reduction of 26 percent is achieved—of the originally 2734 units of length just 2012 remain.

Figure 5.6 shows a graph sequence after a second application of the node reordering algorithm. Here the sum of edge lengths is taken as a reordering criterion. The units of length significantly decrease by 26 percent from 2734 to 2012. A unit of length is given by the distance of two neighbored nodes on the vertical line. Additionally, the tool provides orthogonal edge layout that is very helpful for pathrelated tasks. Due to the straight lines and 90 degree angles at intersection points, it is easier to follow edges that are frequently crossing many other edges. A drawback of the approach is caused by many edges incident to the same node. This phenomenon may confuse a viewer, who could also perceive crossing edges as edges that start at this special node. This problem is mitigated a little bit by classifying edges in upward- and downward edges as illustrated before. Figure 5.7 shows an orthogonal layout of a graph sequence. 5.1.1.2 Hierarchy Levels As we deal with compound digraphs, we have to visualize the hierarchical ordering of the graph nodes as well. Figure 5.8 shows how the information hierarchy is represented. We use a traditional node-link diagram that is located on the left hand side of the graph sequence view to give an overview of the hierarchical information of the compound digraph sequence. Subtrees can be collapsed into a single node and expanded again. The size of a collapsed subtree is indicated by the color of its root node. Large subtrees are represented by white hierarchy nodes, whereas small subtrees are visualized as dark blue colored nodes. The hierarchy can be transformed by clicking on a node. If the subtree of the node is expanded, it will be collapsed when clicking the root node of the subtree. The same

114


Figure 5.7: Orthogonal layout can support a viewer when solving path-related tasks.

effect also works in the opposite direction. The graph is automatically updated to show only nodes of the current frontier of the hierarchy. Smooth animations from the old view to the new one help the user to preserve his mental map. Figure 5.9 shows how the view is transformed by smooth animation.

5.1.1.3 Aggregation of Edges If the hierarchy is transformed by clicking on an intermediate node, the graph is adjusted as shown in Figure 5.10. The corresponding nodes of a collapsed subtree are replaced by a single node, and all edges to and from nodes within the subtree are shown as edges to and from the new node. If there are several edges to or from the same node, they are only represented by a single edge. The weight of this aggregated edge is either computed as the sum or the average of the individual edge weights, depending on the user’s preferences. For example, the edge D

µ(D,136)

→ G

136

in Figure 5.10 aggregates the edge from µ(D,B)

D → B G

5.1 TimeArcTrees

115

Figure 5.8: A graph in a TimeArcTrees representation: The hierarchy (from dataset shown in Table 5.1) is shown on the left hand side. In this example, the hierarchy is completely expanded.

Figure 5.9: Smooth animation is used when collapsing parts of a hierarchy.

and the edge from µ(D,A)

D → A G

in Figure 5.8, and its color indicates the sum of the individual edge weights expressed by µ(D, 136). 5.1.1.4 Graph Sequence The TimeArcTrees tool is at its best if we have to deal with sequences of graphs. Dependencies and relations among several objects are normally changing over time. This leads to either totally different graph structures or to only locally different substructures. The node-link diagrams in Figure 5.11 show a sequence of small graphs. For example, the edge D −→ B in the first graph has vanished in the second graph and reappears in the third one, but with a smaller weight than before. The node E in the last graph has neither incoming nor outgoing edges and thus is not shown at all.

116


Figure 5.10: Parts of the hierarchy can be expanded or collapsed to an interactively selectable level. In this example, the complete subtree with root 136 is collapsed, and all incoming, outgoing, and inner edges of this subtree are aggregated.

The same scenario as in Figure 5.11 is represented in Figure 5.12, but here we apply our novel TimeArcTrees approach. Edges are colored and the hierarchy is added on the left side of the graph sequence view.

5.1.1.5 Aggregation of Graphs over Time Subsequent graphs can be aggregated into one single graph. In the example shown in Figure 5.13, the second and the third graph of Figure 5.12 are aggregated into a new graph G2 . To indicate this, there is no graph separator between the two bullets representing the two original graphs on top of each graph. In the aggregated graph, the edge E

µ(E,D)

→ D G2

stems from the second graph and the edge µ(D,B)

D → B G2

from the third graph. The color of the edge F

µ(F,C)

→ C G2

5.1 TimeArcTrees

117

Figure 5.11: A sequence of graphs shown as traditional node-link diagrams.

represents the average of the weights of the same edge in both original graphs G2 and G3 that is µ2 (F, C) + µ3 (F, C) 2 Alternatively, the TimeArcTrees tool can use the sum of the weights for aggregated edges, in this example µ2 (F, C) + µ3 (F, C) It goes without saying that TimeArcTrees also allows to expand aggregated graphs by reinserting graph separators.

5.1.2 Interactive Features in TimeArcTrees The TimeArcTrees tool has many additional interactive features that help the user to explore the given graphs. In this section we will explain some of them in more detail. • Expanding/Collapsing of the hierarchy: The information hierarchy can be expanded or collapsed by clicking on the corresponding node. This has the positive effect that subgraphs can be combined to one single node and the graphs become much smaller. The transition is supported by smooth animation to preserve the mental map. • Graph aggregation: Subsequent graphs can be aggregated by dissolving the graph separator between two or more subsequent graphs. The edge weights of the aggregated edges are either computed as the sum or the average of the weights of the individual edges.

118


Figure 5.12: The TimeArcTrees visualization for a sequence of three graphs with the information hierarchy. Graph separators are interactive widgets that can be used to change the horizontal space for each graph, as well as to aggregate subsequent graphs.

• Node activation and deactivation: Nodes in the hierarchy can be activated and deactivated. If a node is deactivated, only those edges are visible that are not incoming or outgoing edges of this deactivated node. • Filtering function: In order to reduce visual clutter, edges can be filtered by applying a threshold to their edge weights. In other words, if the weight of an edge is below a given threshold, it is not drawn on screen. • Detail on demand: Detailed information can be obtained in form of a tooltip when moving the mouse cursor over a node or edge position. • Color coding: The tool provides a set of predefined color scales, from which the user can select an adequate one.

5.1.3 An Application—Shortest Paths So far, we have shown how TimeArcTrees can be used to explore a sequence of graphs. In addition, the tool is able to compute shortest paths using the algorithm of Richard Bellman and Lester Ford [37], or Bellmann-Ford algorithm for short. TimeArcTrees can also visualize some other graph properties like maximum flows, which we do not further discuss in this thesis. To this end, the user has to select start and target nodes as shown in Figure 5.14.

5.1 TimeArcTrees

119

Figure 5.13: The second and third graph are aggregated to one single graph. The graph separator between the two original graphs was removed.

Figure 5.14: The start node for shortest path visualization is selected by clicking at the green circle sector of a node. The target node can be selected with the red circle sector.

By clicking at the green circle sector (see Figure 5.14), this node is marked as the start node, whereas by clicking at the red circle sector, it is marked as the target node. A shortest path in a graph G from node u to node v is the path c

u →∗ v G

with the minimal costs c. To visualize a shortest path, only the edges along the path are drawn in color, while the remaining edges are drawn in light grey. At each node along a path, the accumulated costs relative to the costs of the longest shortest path are indicated by a circular bar, as shown in Figure 5.15. For each graph of the sequence, we may have a different shortest path. Of these shortest paths, the longest shortest path is the one with the highest cost. To illustrate our approach, we look at the question of how the fastest way to go home on the German Autobahn changes during the day. Or more precisely, how to get from the junction “Kreuz Meerbusch” to the junction “Kamener Kreuz”.

120


Figure 5.15: The accumulated costs along the shortest path from the start node to this node are represented by a circular bar.

Figure 5.16 shows an excerpt of a German Autobahn map that we will use in the following to geographically interpret the findings of our visual analysis. We divided the map into four regions: South, North, East, and West. From this map we derived an Autobahn graph consisting of 27 nodes and 78 edges, in which the nodes represent the junctions, and the edge weights represent the time to get by car from one junction to another. We use the following time function t to generate edge weights for the different situations: a free motorway, a slow movement on the motorway, and a traffic jam. The function maps a number of kilometers d to minutes t(d) taken for that distance.

Figure 5.16: Excerpt of the German Autobahn map: The map is divided into four regions. The start node for our shortest path example is indicated by a green flag. The target node by a red one.

5.1 TimeArcTrees

121

  0.5d, if the motorway is jam free + + t : R0 7→ R0 , d −→ 3d, if only slow movement is possible   6d, if the motorway is jammed This means, for example, that it takes a car driver 5 minutes for 10 kilometers on a jam free motorway, half an hour when only a slow movement is possible, and one hour when the motorway is jammed. We do not claim that the given times are correct for each traffic situation, but we think that those times are suited for our visualization purposes. The data acquisition was accomplished manually by storing the traffic jam situation for several points of time during a whole day, which can be extracted from traffic service websites such as [176]. Typically, the time of travel changes during the day due to traffic jams. These are typically caused by rush hours, car breakdowns, or beginning and end of holidays. In Figure 5.17 the shortest paths are drawn into the map to preserve geographic information. For each of seven different times of day a separate map is shown. Finally, the map at the lower right shows all shortest paths in one single map. Visualizing the paths using separate maps requires a lot of screen space, showing the maps one after another using animation leads to high cognitive load, and integrating the paths into one single map produces a lot of visual clutter and makes it hard to show additional information. By visualizing the Autobahn graph instead of the map, we lose geographic information, but it allows us to apply the TimeArcTrees approach to explore the data in a space-efficient and highly interactive way. The geographical location is not what matters when trying to travel from one city to another as fast as possible. This refers to a statement of Harry Beck, the famous London underground map designer [74]. Beck states that “the most important thing is, knowing how to travel to your destination and at which points you have to change one line to another”. In the example in Figure 5.18 the shortest paths are shown for 8 different points of time of the same day. Actually, the graphs for 12 am and 2 pm are aggregated in this example to save screen space, and because there was no difference between the shortest paths in both graphs. As can be seen by the closed circular bar around the target node in the graph for 4 pm, this graph contains the longest shortest path. When moving the cursor onto the circular bar, the tooltip shows that this path takes 102 minutes. In contrast, the shortest shortest path only takes 62 minutes. Surprisingly, the longest shortest path and the shortest shortest path are along the same nodes. Whereas, most of the other shortest paths have different intermediate nodes. Thus, in this case we found out that it does not pay off to use alternative routes during the rush hour because these are also jammed.

122


Figure 5.17: A sequence of geographic maps shows the shortest paths at different points of time. In the lower right map all paths are shown at once.

Figure 5.18: Screenshot of the TimeArcTrees tool showing the shortest paths between the nodes “Kreuz Meerbusch” and “Kamener Kreuz” for different points of time. The graph in the middle is actually an aggregation of two graphs. The edges are colored with a blue to red color scale.

5.1 TimeArcTrees

123

124


While we found the Autobahn example very intuitive, by removing the spatial information and aligning all junctions vertically, the user might misinterpret the length and direction of edges as the geographical distance and orientation.

5.2 Timeline Trees The Timeline Trees visualization technique can be used to represent sequences of transactions in information hierarchies. It could easily be extended to a visualization approach that can also deal with weighted dynamic compound digraphs. In many applications, transactions between the elements of an information hierarchy occur over time. For example, the procuct offers of a department store can be organized into product groups and subgroups to form an information hierarchy. A market basket consisting of the products bought by a customer forms a transaction. Market baskets of one or more customers can be ordered by time into a sequence of transactions. Each item in a transaction is associated with a measure, for example, the amount paid for a product or the number of products of the same kind. The Timeline Trees visualization technique is a novel method for visualizing sequences of these kinds of transactions in information hierarchies. The visualization tool provides several interaction techniques that are explained in more detail in Section 5.2.5. Also smooth animations help a user to track the transitions between views. We illustrate the usefulness of the approach by examples from several very different application domains. The Timeline Trees approach presented in this thesis integrates three views into one: • Information hierarchy view: It shows the whole hierarchy to an interactively selectable level. By clicking at a node that is currently displayed as a leaf, the subtree with this node as parent node is expanded. If we click at an intermediate node, the subtree starting at that node is collapsed. Expanding or collapsing subtrees of the hierarchy can help to detect relations at different levels of abstraction. • Timeline view: The sequence of transactions is visualized on a timeline drawn as an extension to the interactive tree. The elements of a transaction are represented by rectangular boxes, that are colored and sized according to the defined measure. Together with some alternative views and further features, which are introduced further below, the timeline visualization provides an extensive tool to explore and analyze the transaction sequence. • Thumbnail view: Miniature representations of the timeline view at each leaf node or at each collapsed node of the hierarchy enable the user to detect dependencies, in which the element(s) represented by that node are involved. A dependency between two items can be detected by one or more elementary

5.2 Timeline Trees

125

perceptual tasks that are encoded in the position in non-aligned scales, area, shape, color, and context. To the best of our knowledge, Timeline Trees is one of the first approaches that allows users to visually explore sequences of transactions in information hierarchies in a single view by static pictures that additionally allows to interactively manipulate the information hierarchy. The user can analyze the evolution of transactions, the roles of their member elements, and detect when and how strong elements of the hierarchy are related.

5.2.1 Visualizing the Information Hierarchy The visualization of hierarchical data is at the heart of information visualization. Information hierarchies can be obtained by both containment relations where the parent nodes are containers for their child nodes as well as subordination relations where parent nodes are controllers of their child nodes. Examples of information hierachies are given by the following list: • Evolutionary or phylogenetic trees in biology (subordination relations) • Hierarchical organization of companies (subordination relations) • News topics and subtopics (containment relations) • File/directory systems (containment relations) • Products and product groups of a department store (containment relations) We will start our explanation of Timeline Trees by a small dataset given in Table 5.2. Table 5.2: The example shows market baskets at several days of a week.

Day Market basket and money spent Monday: milk $1, bananas $3 Tuesday: cheese $1, apples $3 Wednesday: milk $1, bananas $1, grapes $2 Thursday: milk $1 Friday: milk $1, cheese $3 It shows the market baskets of five subsequent days, i.e., the products and prices of each product that a person bought. In our example we use the price as a measure function. For example, the third transaction corresponding to Wednesday is t3 = {milk, bananas, grapes}

126


where the value of the measure for grapes is µ3 ({grapes}) = 2 Figure 5.19 shows the market basket example in the Timeline Trees visualization. The representation is composed of three views. We should draw our attention now to the information hierarchy view.

Figure 5.19: Visualization of the market basket example(see Table 5.2) as a Timeline Trees representation with color coded measure values and additional explanations.

The hierarchy view makes use of a conventional node-link representation where the size and color of a node visually encode the number of items that are descendants of the node: One can identify nodes that are roots of large subtrees, even if they are collapsed. Similar to the TimeArcTrees approach, the most important interaction functions of the tree diagram are collapsing and expanding of nodes with smooth transitions. This enables the user to explore larger information hierarchies without losing focus and to compare data on different levels of abstraction. The goal of the tree layout is to efficiently display the tree with labeled nodes and to emphasize the tree structure. The former goal is realized by using more horizontal space with increasing depth of the nodes. An additional feature is the space-dependent orientation of the labels, which generates a more space-filling tree

5.2 Timeline Trees

127

layout. The nearly orthogonal layout and smaller vertical distances between siblings help to reach the latter goal. Furthermore, tooltips provide detailed information about the nodes.

5.2.2 Visual Encoding of the Transaction Sequence In the application domains mentioned above, there are also relations between elements in the hierarchy at a given point in time or in a specific time interval. For example, employees are related if they communicate with each other, topics are related if they are covered in the daily newscast, files are related if they are changed simultaneously by the same person, or products are related if they are bought by the same customer at the same time. Related elements at a given point in time or time interval form a transaction. If we inspect a longer time period, we typically obtain a certain number of transactions, which can be chronologically ordered into a sequence of transactions. Though some researchers have developed methods to visualize relations between elements of a hierarchy [20, 60, 123, 190], only little research has been done to visualize sequences of transactions in hierarchical data. The visualization of transaction sequences in information hierarchies is motivated by lots of examples from the real world. Furthermore, the elements involved in a transaction can exhibit different extents. To model this, we associate a measure with each transaction by mapping each element of the transaction to a positive real number. Thus, in the application domains mentioned above, the conversational partners, the selection of topics, the files, and the products bought are the elements of transactions, while the duration of the communication, the extent of the coverage of each topic, the size of the modification of each file, and the amount paid for a product are the associated time-varying measures. The Timeline Trees approach visualizes a sequence of transactions in a rectangular box style. The temporal component is encoded in a timeline that goes from left to right in the diagram, hence the name ’Timeline’ Trees. In many applications time provides a natural order on the transactions. Each box represents one member element of a transaction and is positioned in the same column as the other members of this transaction and in the row of the according item. The measure µi ({v}) is redundantly encoded by color and height of the box, whereas its width is fixed. This means that the size of a box increases linearly with the measure. But the user can also switch to fixed heights, because in some applications the importance of an element does not correlate with the measure. Also in Timeline Trees we included numerous of predefined color scales for color coding such that the user can select a suitable color scale for the task at hand. Discerning two adjacent, similarly colored boxes might be difficult, so we use a brightness gradient as a kind of cushion effect [168].

128


So far we discussed how to draw boxes for leaf nodes of a hierarchy. Next we look at how to tackle the problem of collapsing subtrees to also use the hierarchical information about the elements involved in the transaction sequence. In Timeline Trees, collapsed nodes can either be encoded as one single box with aggregated measure values or as several vertically stacked boxes, each representing the non-zero color coded value for one single element in a transaction. Both modes can be useful for different applications and can be applied on a user’s demand.

(a)

(b)

Figure 5.20: Market basket with collapsed nodes ‘dairy’ and ‘fruit’ in different modes: (a) height represents the measure, collapsed items are stacked; (b) unified heights, summed measure values for collapsed items.

During the interactive exploration process of a dataset, a good orientation and an easy access to additional information is very important. Our visualization supports these aspects by highlighting the row and column marked by the current position of the mouse cursor and by detailed tooltip texts as shown in Figure 5.21. Another very useful feature is the masking of elements. To this end, the user can select some items or collapsed nodes to form a mask set M . Only those transactions TM = {ti | ti ∩ M = M } that match the mask set will be shown. All transactions that do not contain all nodes of the mask set are faded out. As a result, the user can focus on the relations between the nodes in the mask set.

5.2 Timeline Trees

129

Figure 5.21: Tooltip for the collapsed ‘fruit’ node in the Wednesday transaction of the market basket example (see also Figure 5.20).

5.2.3 Thumbnails as Miniature Representations

The idea of masking transactions is extended by the thumbnail views of the timeline diagram. These thumbnails are displayed for every item or collapsed node at the right side of the tree diagram. They show the transactions from the perspective of the according node as if this node would be the only element of the mask set. In other words, only those transactions the node is member of are represented in the thumbnail using the selected color code, the remaining transactions are only drawn as gray boxes. As for the general mask set, the thumbnails are a good tool for identifying correlations between nodes, but in contrast to the mask set, the thumbnails are simultaneously shown for each item or collapsed node. To assist orientation in the thumbnails, within a thumbnail the row of the node related to the thumbnail is highlighted as a slightly colored line. Furthermore, to countervail the disadvantage of the relatively small size of the thumbnails, we implemented a magnification lens functionality that enlarges parts of the thumbnails when the mouse cursor moves over them. The technique used for this lens function is called X- and Y-distortion since it distorts the view in both X and Y directions. The technique offers one solution to the focus+context problem. Leung et al. [110] applied the technique to the visualization of the London Underground map. Figure 5.22 shows an example of the lens function that distorts the selected and focused region in both the X and Y direction to additionally preserve the context.

130


Figure 5.22: Thumbnail example with lens function whereas the mouse cursor is over the ‘Defense’ thumbnail (detailed view of the soccer match visualization presented in Section 5.2.6.1).

5.2.4 Alternative Representation: Time Bars In addition to the visualization discussed above, Timeline Trees include an alternative representation of the transactions, which is shown in Figure 5.23 for the market basket example. Here, the time or order of transactions is encoded using color coding, and the measure is represented by the width of the boxes instead of their height. The boxes are drawn from left to right attached to each other, instead of positioning them in separate columns as in the Timeline Trees representation. Thus, boxes related to the same transaction are no longer in the same column, but they have the same color. As the resulting representation is very similar to a bar chart, we call it Time Bars. The visual encoding of stacked bars is denoted by divided bar graphs in the literature. We use Time Bars only in addition to the default visualization because the color coding of time is not that intuitive, and discerning transactions in time is not accurate enough. But for many analyses, it provides the following advantages: • The Time Bars form some kind of bar diagram that represents the aggregated measures of the items and currently collapsed nodes. So for example, one can easily detect which node is the most active one. • The shape of the diagram is much more memorable, and one establishes a sort of mental map while exploring the data. Thus, the orientation in the diagram and especially in the thumbnails is significantly better. The reason for this are

5.2 Timeline Trees

131

Figure 5.23: Visualization of the market basket example in a Time Bars view.

the additional elementary perceptual tasks given by the different shapes and areas. • The distribution of colors gives a more holistic overview of the temporal progress of the transactions: One can detect differences at first sight. But as a drawback, we see the limited ability to distinguish different colors.

5.2.5 Interactive Features of Timeline Trees Also the Timeline Trees representation offers the opportunity to interactively manipulate the provided views. We already mentioned some interactive features in the sections before and will give a more extended list of those features in the following: • Expanding and collapsing of subhierarchies: Clicking on a node that represents a collapsed subhierarchy leads to an expansion of this hierarchy again. The same is true for expanded subhierarchies. Clicking on a parent node leads to a smoothly animated collapsing of this subtree. • Selecting a mask set: Several nodes can be selected and put to a mask set. This set is then used to filter the transaction sequence for those transactions that contain at least all elements in the mask set. • Lens function: To get an enlarged display of a portion of visualized data, we use some kind of lens function. X and Y distortion is used in the lens function to tackle the focus+context problem.

132


• Filtering measure values: The user can select minimum and maximum threshold values, which has the effect that only those elements with measure values in this value interval are shown. All others are grayed out to also encode the elements that are filtered out. • Box highlighting: If the mouse is located over a rectangular box, this box is highlighted, and also the corresponding box in either the thumbnail view or the timeline view is highlighted, too. This linking function supports a user at understanding the relations much better. • Selecting a color scale: Color coding is used to better extract the quantitative information encoded in both the timeline view and in the thumbnail view. Predefined color scales can be applied to a transaction sequence dataset to better support a user at exploring the data. Some color scales may not be suited for a given dataset, and hence, the user has to select the most appropriate one. • Details on demand: Following Ben Shneiderman’s visualization mantra, we also allow to ask for details when analyzing a dataset. The detailed information is displayed by a tooltip text. Interactive features are very important in modern visualization tools to support a user at exploring a dataset. This frees him from just watching static pictures where interesting phenomena about a dataset could be hidden. By means of interaction, a user is directly involved in the visualization process and can make decisions about the state of the provided views by himself.

5.2.6 Application Domains To illustrate the features and the usefulness of the Timeline Trees visualization tool, we apply it to datasets from very different application domains. In the following sections we describe explicitly which interesting insights we gained. 5.2.6.1 Team Play in a Soccer Match Soccer teams are hierarchically organized. Eleven players belong to each team and are subdivided into different team parts: the goalkeeper, the defense, the midfield, and the offense. Additionally, players have their specific location or area on the soccer ground where they act. The number of contacts with the ball and the different players belonging to a move of the match can be seen as a transaction where each element has a measure, namely, the number of contacts. Figure 5.24 shows the moves of the first half of a soccer match in the World Cup Championships 1990 in Italy. The dataset was recorded by hand from the match between Germany and the Netherlands, which Germany won 2 to 1.

Figure 5.24: Timeline Trees for the soccer match between Germany and the Netherlands in the World Cup Championships 1990 in Italy on team part level.

5.2 Timeline Trees

133

134


In this visualization the organizational structure of a soccer team in terms of offense, midfield, defense, and the goalkeeper and the individual players forms the hierarchy and is represented as a node-link diagram at the left hand side. Players are related to each other, if they take part in the same move, which can be observed by the thumbnail view in each of the small boxes. Here we define move as the time period during which one team has the exclusive ball possession. As the measure of a move, we use the number of ball contacts of one member of the team. Many ball contacts are indicated by higher bars and a red color, whereas a green color stands for little contacts in a move, yellow is a value in between. In Figure 5.24 we can also make very interesting observations about the first half of the match. The hierarchy is expanded to the level of team parts. Both defenses are the parts with the most ball contacts. The goalkeepers have only very little contacts, which is an absolutely normal phenomenon. The German offense acts not as much as their counterpart from the Netherlands. But the German midfield takes this part and therefore has much more ball contacts than the one of the Netherlands. A closer look at the lowest timeline in this figure reveals that the offense of the Netherlands increases their number of ball contacts towards the end of this first half. In Figure 5.25, the German midfield and offense as well as the defense of the Netherlands is expanded. The thumbnail view can give us the information that there is one transaction, in which one player of each team is involved. Frank Rijkaard and Rudi Völler both received the red card and are ejected from the match. This detail on demand information can be requested by a tooltip when moving to the position of one of the corresponding bars. After this 21st minute of the first half, the following observation can be made. Frank Rijkaard was a defending player, and it can be expected that the other players belonging to the defense have to do the work of the missing player. And in fact this is true. The players Adrie van Tiggelen and Ronald Koeman have much more ball contacts than before this 21st minute. Another observation is that the ball contacts of the whole offense part of the Netherlands increase right after this 21st minute, and naturally, the defense of Germany in the same way. 5.2.6.2 Evolution of Transactions in Software Systems Open source software systems under version control can be used to gain interesting insights of the development process of the software system. One important observation can be which files have been changed together to what extent. Furthermore, it can be referred which files have been developed in which period of time. These facts can be very helpful to support software developers during the evolutionary process of their current project. Figure 5.26 shows the Timeline Trees visualization for a time period of the development of the JEDIT [146] open source software project. In this figure, the two overall blue colored lines indicate that two software artifacts are in the center of the

Figure 5.25: Timeline Trees for the soccer match with expanded team part subhierarchies.

5.2 Timeline Trees

135

136


Figure 5.26: Transactions of a part of the JEDIT open source software project. The doc and org subdirectories have been changed most frequently.

evolutionary process. The upper one corresponds to the doc subdirectory and the one in the lower part represents the whole source code subdirectory of the project. Most of the transactions contain at least one file of the source code subdirectory. Documentation and source code are changed together very frequently. This can be a hint that developers almost always document their changes immediately. A closer look at the selection of transactions by the mask set in Figure 5.27, which contains both documentation files TODO.txt and CHANGES.txt, reveals that in nearly each case when a developer changes the file CHANGES.txt he also changes the file TODO.txt. The inverse only holds in roughly 50 percent of the transactions. Our hypothesis is that, if someone makes a change to the CHANGES.txt file, he always has to adjust the TODO.txt file because the change solves a problem or implements a feature contained in the to-do list. 5.2.6.3 World’s Export in a Time Bars Representation Using time bars instead of timelines, our visualization can be used as an augmented bar chart diagram. The bars are generated by stacking the boxes of each time interval. Additionally to the conventional approach, the single bars are colored with respect to their corresponding time interval. This approach can help to observe in which time interval a bar grows more rapidly than others. Figure 5.28 shows the yearly export data in terms of dollars for the whole world from 1948 to 2005. The year of a transaction is indicated by color, where blue

5.2 Timeline Trees

137

Figure 5.27: Timeline Trees with two files in the mask set, common transactions of the masked files are highlighted.

Figure 5.28: Export data (in Dollar) of the world’s regions in a Time Bars view from 1948-2005.

138


indicates older transactions, and red indicates more recent ones. Green, yellow, and orange colors are used for values in between. We can immediately see that Western Europe has the biggest export value for this time interval, followed by East Asia, North America, and Central Europe. The hierarchy can be expanded to the country level to gain insights about the export data of each country of the world. Another interesting observation is that the whole continent of Africa exports less than Southern Europe, for example.

5.3 TimeRadarTrees The TimeRadarTrees visualization technique can be seen as the radial counterpart of the Timeline Trees approach described in Section 5.2. It is a novel approach that can visually encode sequences of weighted compound digraphs and sequences of transactions in information hierarchies. It uses a radial node-link layout to draw the information hierarchy and circle sectors to represent the temporal changes of edges in the digraph sequence. The hierarchy can also be represented as a layered icicle plot in the outside of the circle to avoid occlusion problems that may be caused by the radial node-link diagram. Also in the TimeRadarTrees visualization tool, we developed several interaction techniques to explore the time-based relational data. Smooth animations help a user to track the transitions between views. Also for this technique, we will illustrate the usefulness of the approach by examples from very different application domains. Radial representations have become a popular metaphor in the field of information visualization. The main reason for their prevalence may be their compact layout and their aesthetic appeal to a viewer. Though many researchers make use of radial representations, only little efforts have been done to evaluate the usefulness of radial techniques and their predominance over Cartesian counterparts. In a recent work, Draper et al. [47] give a survey about existant radial visualizations and present a taxonomy in form of seven design patterns, namely, the tree and the star pattern for polar plots, concentric, spiral, and Euler patterns in space filling visualizations, and connected and disconnected patterns in ring-based representations. The authors claim that the TimeRadarTrees technique combines elements from both the concentric radial space filling pattern and the connected ring pattern. The advantage of the TimeRadarTrees approach presented in this thesis is definitely the fact that the evolution of weighted dependencies in information hierarchies can be visualized in a single diagram and hence is a static representation of dynamic graph data. The TimeRadarTrees technique integrates three views into one to represent timeseries relational data this way:

5.3 TimeRadarTrees

139

• Interactive radial tree: It shows the whole hierarchy to an interactively selectable level. By clicking at a node on the circumference it is expanded; by clicking at an intermediate node, the subtree starting at that node is collapsed, and the node is put on the circumference. Expanding or collapsing subtrees of the hierarchy can help to detect relations at different levels of abstraction. • Inner circle (Time Radar): Incoming edges of leaf nodes or collapsed subtrees are shown as colored parts of a circle sector related to that node or subtree. The color of each part encodes the weight of the edge, i.e., the strength of the dependency. • Outer circles (Thumbnails): The smaller outer circles, which are located close to each hierarchy leaf node, show the outgoing edges of the related node. The target node of each edge can be inferred from a sector position in nonaligned scales, slope, area, shape, curvature, color, and/or context. In the following sections we will explain how to transform conventional node-link diagrams into a TimeRadarTrees representation.

5.3.1 Visualization of a Single Digraph We illustrate our visualization technique by starting with the representation of a single graph and then adding features step by step. As a first example, consider the node-link diagram of a single digraph, that is shown in Figure 5.29(a). Nodes are represented by circles, directed edges by arrows pointing from one node to another.

(a)

(b)

Figure 5.29: A digraph in two different visual metaphors: (a) node-link diagram; (b) TimeRadarTrees.

In TimeRadarTrees there are two visual representatives for an edge, see Figure 5.29(b). For each node its incoming edges are represented by sectors of the

140


large circle in the middle, while the smaller circles on the circumference of the inner circle show the outgoing edges. This stands in contrast to matrix-based graph visualizations where nodes are represented twice, vertically and horizontally. All circles are subdivided into sectors as follows. First, if a graph consists of n nodes, the circle is divided into n equally sized sectors. Each of these sectors is associated with a certain node. In the example in Figure 5.29(b), the lower left sector of all circles is associated with node D. Next, each of the sectors is subdivided into a number of smaller sectors depending on the number of incoming edges of the associated node. In the example, the three colored sectors related to node D in the inner circle represent the three incoming edges of node D, while the one big colored sector related with node A indicates that node A has only one single incoming edge, and finally, the white sectors related to E and C show that these nodes have no incoming edges at all. Note that, by looking only at the inner circle, we cannot identify the nodes where the incoming edges start from. This information can be grasped by looking at the outer circles. For example, by inspecting the outer circle related to node B, we see that there is only one outgoing edge, and this outgoing edge is drawn in the part of the circle associated with the node D. In comparison with node-link diagrams, an important advantage of the TimeRadarTrees visualization is that there are no edge crossings leading to visual clutter. As a drawback we see the weakness of solving path-related tasks without using interactive features. Another problem of the technique is scalability in several dimensions. These include the number of nodes, the number of edges, and the number of graphs. Many visualization techniques suffer from scalability problems because of a limited display space. Even if we had an infinitely large screen to display information, we could not tap the full potential because of a limited field of view of the human visual system. While this sector-based visual metaphor for graph data might seem awkward at first sight, and it needs some training to read this representation, it will turn out useful once we add more features. The eyetracking study described in Chapter 6 showed that the visualization could be understood even by laymen—people that are not very familiar with the concept of graphs—after reading a ten minutes tutorial text.

5.3.2 Visualization of a Digraph Sequence In a next step, we try to visualize a sequence of graphs instead of a single graph. As an example, we use the sequence of graphs shown in Figure 5.30(a). Each graph of the sequence is shown by a separate node-link diagram. Figure 5.30(b) shows a TimeRadarTrees representation of the same sequence of graphs. Here the edges of one graph are represented by sectors of the same ring.

5.3 TimeRadarTrees

(a)

141

(b)

Figure 5.30: A sequence of digraphs in two different visual metaphors: (a) node-link diagrams; (b) TimeRadarTrees.

Actually, in this example, the innermost circle corresponds to the first graph, which we have seen before; the next graph of the sequence is represented by the inner ring; and the third graph by the outer ring. The time axis starts in the circle center. Looking at the lower left sector of the first graph, we see for example, that there are three incoming edges for node D in the first graph, one in the second, and none in the third. By looking at the lower left sectors of the small outer circles, we detect that two of the incoming edges of D start from node A, and its third incoming edge starts from node B. Furthermore, we see that the single incoming edge in the second graph starts also from A. In comparison with animated node-link diagrams, the integration of all graphs in the sequence into a single static diagram helps the user to preserve the mental map.

5.3.3 Visualization of the Hierarchy In a compound digraph, like the one shown in Figure 5.31(a), the nodes of the graph are additionally related to leaves of a hierarchy. In the example, the graph and the hierarchy are both shown as node-link diagrams with additional edges connecting the nodes of the graph with the leaf nodes of the hierarchy. In Figure 5.31(b), the same compound digraph is shown as a TimeRadarTrees using a radial layout to embed the node-link diagram of the hierarchy in the sector-based visualization of the graph.

142


(a)

(b)

Figure 5.31: A compound digraph in two different visual metaphors: (a) node-link diagram; (b) TimeRadarTrees.

5.3.4 Visualization of Dynamic Compound Digraphs We can now combine the features from Sections 5.3.2 and 5.3.3 for visualizing sequences of compound digraphs in a static picture.

(a)

(b)

Figure 5.32: A dynamic compound digraph in two different visual metaphors: (a) aligned node-link diagrams; (b) TimeRadarTrees.

Figure 5.32(a) shows them as a sequence of node-link diagrams, while Figure 5.32(b) combines the two approaches discussed above to integrate all information into a single TimeRadarTrees diagram. One can easily detect that the edge from node C to node B is present in all three digraphs in the sequence. Nodes D and E have no outgoing edges, a fact that can be observed by inspecting the thumbnails

5.3 TimeRadarTrees

143

that correspond to nodes D and E. Also the incoming edges can be explored more precisely with the large circle. Here one can see that node E has no incoming edges in the first graph, and node D has no incoming edges in the third graph. This means that those nodes are disconnected to the graphs—node E is disconnected to the first graph and node D to the third graph. Node D is the only node that has more than one incoming edge in one graph of the sequence. It can easily be detected that node D has three incoming edges in the first graph by inspecting the further subdivision of the corresponding circle sector. The source nodes can be found out by looking at the thumbnail representation of node D again. Two edges are pointing from node A to node D, and one edge starts at node B.

5.3.5 Visualization of the Graph Measure Another way to extend the visualizaton is to consider directed graphs with edge weights as the one shown in the node-link diagram in Figure 5.33(a). Instead of using numbers, we can encode weights by colors, both in the node-link diagram and in the TimeRadarTrees, see Figure 5.33(b).

(a)

(b)

Figure 5.33: A weighted digraph in two different visual metaphors: (a) node-link diagram; (b) TimeRadarTrees.

We are using the term weight synonymously with the term measure, but measure is mathematically more correct in the context of this approach. By using the data model in Section 4, the measure of the graph edges is automatically given for each state of the information hierarchy. Finally, by combining all the features that we discussed above, we are able to visualize sequences of compound digraphs with edge weights in a single static TimeRadarTrees diagram.

144


5.3.6 Interactive Features To fully exploit the technique, we provide several interactive features to manipulate the visualized data: • Expanding and collapsing of subhierarchies: Clicking on a node that represents a collapsed subhierarchy leads to an expansion of this hierarchy again. The same is true for expanded subhierarchies. Clicking on a parent node leads to a smoothly animated collapsing of this subtree. The involved edges are aggregated, and the color coding of the aggregated edge is adjusted. • Time warp: Graphs that are laid out close to the circle center are displayed within a smaller area. This fact is expressed in the formula in Section 5.4.1. The currently visible graph sequence can be rotated so that graphs from the inner area can also be pulled near the circumference of the circle, and hence, can be displayed within a larger area. We call this feature the time warp function. It may be noteworthy that the order of the graphs in the sequence stays the same even after rotating the graph sequence. • Filtering measure values: The user can select minimum and maximum threshold values, which has the effect that only those edges with measure values within this interval are shown. All others are grayed out. • Filtering nodes and graphs: Due to scalability reasons, a user can interactively filter out one or more nodes in the graph sequence. This leaves more display space for the remaining nodes. Also the graph sequence can be too long to be visualized on screen. The tool provides the opportunity to select only a subset of graphs from the sequence. The order of the represented graphs stays the same. • Sector highlighting: If the mouse is located over a circle sector, this sector is highlighted, and also the corresponding sector in either the thumbnail view or the inner circle view is highlighted, too. This linking supports a user at understanding the relations much better. • Selecting a color scale: Color coding is used to better extract the quantitative information encoded in both the inner circle view and in the thumbnail view. Predefined color scales can be applied to a digraph sequence dataset to better support a user at exploring the graph data. Some color scales may not be suited for a given dataset, and hence, the user has to select the most appropriate one. • Details on demand: Also for TimeRadarTrees, we use the visualization mantra and hence provide a details on demand function. The detailed information is displayed by a tooltip text and in a separate panel. To this end, we developed some interactive features but many more are planned to follow to better support a viewer when exploring a dataset. The drawback of this

5.3 TimeRadarTrees

145

visualization technique is definitely the problem of solving path-related tasks. An additional feature could compute all paths or the shortest one—the one with the lowest costs—between two selected nodes, if one exists. These paths could then be represented in a node-link metaphor on top of the TimeRadarTrees to accelerate the exploration process when a viewer is interested in path detection. But many application domains provide datasets with related artifacts where just the relations play an important role but not paths with an underlying sequence of those artifacts.

5.3.7 Application Domains To illustrate the usefulness of the TimeRadarTrees technique, we apply it to datasets from very different application domains. In the following sections we work out the findings that we uncovered by using this novel visualization approach. 5.3.7.1 Soccer Match Results The world’s national soccer teams can be hierarchically organized by first dividing them into continents and then further subdividing them by regions of these continents, for example, North, South, East, and West. The results of soccer matches that took place within a given time interval can be used to generate a graph in the following way: The number of goals µ({A}, {B}) of team A against team B is represented by a directed edge between A and B with the weight µ({A}, {B}) and an edge leading from B to A with the weight µ({B}, {A}), i.e. the number of goals that team B has scored against A in this match. It is important to differ between edges with weight 0 and non-existing edges. Looking at many of these subsequent graphs in a single view can provide important insights of the soccer playing quality of national teams over a longer time period. Figure 5.34 shows a sequence of 14 compound digraphs generated for soccer matches between national teams of Central Europe and South America from 1992 to 2005. The data acquisition was performed by a program in the JAVA programming language that extracted the soccer results from a livescores, statistics, and soccer results webpage [69]. The first observation is that there only were a few matches between teams of Central Europe and South America. This fact can be found out by having a closer look on the thumbnail circles. The thumbnail circles in the lower part only have a few colored edges in their upper part, and the thumbnail circles in the upper part have only a few colored edges in their lower part. Only the teams of Germany and Brazil have played against each other a bit more frequently. This phenomenon can be explained by the fact that both teams participated in the World Cup Championships very often and additionally reached the semi finals.

146


A second observation is that the teams of South America played more matches against each other than the teams of Central Europe in the same time period, which can be uncovered by the denser colored circle sectors in the thumbnails. Color coding can be used to find out the teams which scored many goals against other teams. Here we use the following color coding scheme: • dark blue indicates 0 goals • light blue indicates 1 − 2 goals • green indicates 3 − 4 goals • yellow indicates 5 − 8 goals and • red indicates more than 8 goals. Looking at Figure 5.34 again, one can easily find out that Germany has a red outgoing edge to the national team of Liechtenstein. A detail on demand request for that edge provides the information, that Germany won the match 9:1 on June 4th 1996. Moreover, it can be seen that the team of Brazil scored very frequently, and not surprisingly, the team of Liechtenstein scored very infrequently. TimeRadarTrees can also be used to just get an overview of the evolution of incoming edges. Especially for this dataset, this means which teams have a good and which have a bad defense. In Figure 5.34, it can be seen that the team of Liechtenstein has many against goals. For the team of Austria, we see that the situation of against goals worsens after the year 2002. 5.3.7.2 Software Evolution Next, we look at the evolutionary coupling of software artifacts like modules, files, classes, and methods. The strength of the evolutionary coupling of two artifacts is the number of times they have been changed together. With TimeRadarTrees, we can show in which time intervals two software components have been changed together very frequently and to what extent. The number of changed lines can be color coded and shows which software artifacts have been changed together very frequently and how many lines have been involved in this change. Such a coupling between many hierarchy levels can be a hint for a bad software system design [193]. To completely understand these phenomena, we have to inspect the source code of the corresponding software artifacts. The TimeRadarTrees visualization in Figure 5.35 represents the co-change of files in the org and the doc subdirectories of the JEDIT software system (jedit.org). The color of an edge indicates the size of the change in terms of the number of changed lines of code. Small changes are indicated by blue, bigger ones by green, and finally, red indicates a very large number of changed code lines. In the example at hand, we can see that the files TODO.txt and CHANGES.txt were changed together very

5.3 TimeRadarTrees

147

Figure 5.34: A TimeRadarTrees representation of the soccer match results of national teams of South America and Central Europe for the years 1992 until 2005.

148


frequently, but only small changes occured. Many of the files in the org subdirectory were changed together, and the color indicates that there are involved several lines of code. The thumbnail can be used to find out that these files were mainly changed in the same transactions as those of the TODO.txt and CHANGES.txt files of the doc subdirectory.

Figure 5.35: Commonly checked in files of a part of the JEDIT open source project.

5.3.7.3 Co-author Graphs The TimeRadarTrees technique enables to explore a changing co-author list of a certain researcher in focus. Moreover, we can understand how the co-author relations between all researchers of that list evolve over a particular time period. Researchers that publish very frequently and, furthermore, researchers that have a large coauthor list with the researcher in focus in common can be uncovered very fast. The

5.4 A Comparison of the Techniques

149

color coding can be used to examine the number of publications of collaborating authors and to examine this number with respect to all others. Figure 5.36 shows the co-authors of PW in the years 2004 until 2008 in a TimeRadarTrees visualization that displays one circle slice per year. A red color indicates a high number of publications per year (6-10), a green color not as many per year (3-5), and a blue color very little per year (1-2). An additional feature in this representation is the average measure, which is indicated by the colored thick annulus between the inner big circle and the thumbnails. It gives an impression about the number of publications with respect to the number of publications of all other researchers in the displayed co-author list. A closer inspection of the additional information shows that TZ and AZ are the busiest publishers in the presented list of researchers aggregated over the shown time period. The same feature is given for each thumbnail. It visually encodes the relative number of joint publications of the two corresponding researchers. We can find out very fast that TZ and AZ published very often in a collaborated work and the same holds for the authors SD and PW, too. One can easily see that SD is the author that collaborated most with PW apart from the year 2007. In this year SD did not work with researchers from PW’s co-author list, but there is one single work where he is the only author. This fact can be uncovered by inspecting the large circle sector at SD’s thumbnail. The author with the most publications in this co-author list is definitely TZ, who mostly published with AZ as a co-author. We could speculate that TZ and AZ are professors or heads of a research group. LK and DN only collaborated with PW in one single paper in the years 2005 and 2008, and did not publish other works. From this observation, we could claim that LK and DN are students.

5.4 A Comparison of the Techniques A comparison of the three approaches described in Section 5.1, 5.2, and 5.3 for visualizing sequences of transactions in information hierarchies and weighted dynamic compound digraphs is a hard task though they are all based on the same underlying data model. Table 5.37 shows some differences and commonalities of the techniques with respect to several visualization criteria. The main difference lies in the visual encodings of the data. TimeArcTrees uses a node-link metaphor to show the information hierarchy as well as to represent each graph of the sequence. Timeline Trees is a space-filling Cartesian variant that avoids the conventional node-link representation to reduce visual

150


Figure 5.36: The co-authors of PW in the years 2004 until 2008 as a TimeRadarTrees representation.

clutter caused by lots of edge crossings. TimeRadarTrees is the radial counterpart of the Timeline Trees, and hence, the relational data can be extracted by additional elementary perceptual tasks: curvature, slope, and angle of the corresponding visual graph edge encodings as circle sectors. Node-link diagrams, as in TimeArcTrees, are very easy to explore with respect to path-related tasks. The reason for this is that the direct connectedness of objects is achieved by linking those objects by a curved line. In Timeline Trees and TimeRadarTrees we have a different kind of connectedness. Related objects can be uncovered by solving several special elementary perceptual tasks. This is definitely a weaker design for the expression of connectedness, but it surely reduces visual clutter, which is the major drawback of node-link diagrams. To examine connected objects, the Timeline Trees and TimeRadarTrees approaches use two visual representatives for each edge—one in the timeline view and one in the thumbnails view,

5.4 A Comparison of the Techniques

Visualization Inclusion edge metaphor Adjacency edge metaphor Representative graph elements Visual clutter reduction Path-related tasks Extraction of relational data by

151

TAT

TLT

TRT

node-link

node-link

node-link Nodes: once Edges: once medium

rectangular boxes Nodes: once Edges: twice high

radial node-link radial layered icicle circle sectors Nodes: once Edges: twice high

yes

difficult

difficult

direct connectedness

position in non-, aligned scales, shape, area, color, context

position in nonaligned scales, shape, area, curvature, slope, angle, color, context

Figure 5.37: A comparison of TimeArcTrees, Timeline Trees, and TimeRadarTrees with respect to their visual encodings and their benefits and drawbacks at solving several graph visualization challenges.

which is to blame for the more difficult solution of path-related tasks. This is actually the difference to matrix-based representations where the nodes are represented twice—one in a row and one in a column. Another benefit in the space-filling techniques is the fact that edges are aligned on top of each other, and hence, an exploration of trends and counter-trends for dynamic relational data is much easier. The novel visualization techniques are at their best when displaying dense dynamic graphs that would cause a lot of problems when using either aligned or animated node-link diagrams. Animation in node-link diagrams leads to sophisticated algorithms that have two goals: a nice layout of the graphs with respect to aesthetic criteria and the preservation of the mental map. Time-efficient algorithms are needed to tackle these problems. In Timeline Trees and TimeRadarTrees we are freed from mental map preservations and hence we do not need sophisticated layout algorithms. But here the question arises whether a chaotic layout of the graph nodes would be more appropriate, because we need some interesting points in the novel visualizations to preserve the context which may help to explore complex relations much faster. In the following section we will derive a formula to understand another drawback of TimeRadarTrees—scalability.

152


5.4.1 Scalability in TimeRadarTrees The visual scalability of an information visualization tool is defined as the ability to display large amounts of data in an effective way. Eick [52] presents several factors that could affect visual scalability. The human perception, the visual metaphor, the display where the data is represented, algorithms, and computation are among these factors. Some research has also been done that focuses on the perceptual scalability of visualization [188]. In this section we want to analyse how much data can be visualized with the radial variant of our visualization tools—TimeRadarTrees.

Figure 5.38: Space for incoming edges depending on several parameters

Figure 5.38 shows the inner big circle of the TimeRadarTrees representation. In this circular area, we visualize the incoming edges of the graph sequence. We want to give an estimation about the space that remains for a single incoming edge depending on the following parameters: • the length of the segment a of the minimal square that contains the whole circle • the number of leaf nodes n that are currently represented on the circle circumference • the number of graphs k in the actually visualized graph sequence

5.5 Conclusions

153

• the number of incoming edges li,j that are adjacent to the node j in graph i If the segment of the underlying square has length a, then the radius of the circle a will be exactly a2 . The thickness of one annulus is then exactly 2k because each annulus has the same thickness. Now we want to compute the area of one annulus, whereas the innermost is numbered with 1 and the outermost with k. Additionally, the area is depending on the position of the graph in the sequence because inner graphs have less space than outer ones. We denote the area of the annulus representing graph i by Aa,i . The area for all incoming edges of node j in graph i and underlying quadratic area with segment length a is denoted by Aa,i,j . Then the annulus representing graph i has the following square measure ia 2 (i − 1)a 2 ) − π( ) by circular area 2k 2k ia a ia = π[( )2 − ( − )2 ] 2k 2k 2k 2 ia a2 = π(2 · 2 − 2 ) 4k 4k πa2 = 2 (2i − 1) 4k

Aa,i = π(

(5.1) (5.2) (5.3) (5.4)

If we want to compute the area of one annulus sector, we have to divide the above formula by the number of nodes n and we obtain Aa,i,j =

πa2 (2i − 1) 4k 2 n

The area of one single incoming edge can further be computed by dividing Aa,i,j by the number of all incoming edges of node j in graph i namely li,j . Finally, we achieve this formula A0a,i,j =

πa2 (2i − 1) 4k 2 n li,j

The formula expresses that an increase of the number of graphs, the number of nodes, or the number of edges leads to a decrease of display space for a single edge.

5.5 Conclusions It may be noteworthy to say that these three novel representations focus on visualizing static hierarchies; nevertheless, minor changes in the hierarchy can be simply

154


transformed to fit our data model. For instance, if an item moves in the hierarchy, it will be displayed in both positions and handled like two different items. Solving this problem belongs to future work. Drawing dynamic graphs is at the heart of information visualization. A wide spread visual metaphor for displaying this kind of data is the node-link diagram that has as many benefits as drawbacks. In this section we focused on a novel visual metaphor for drawing dynamic graphs that does not make use of node-link diagrams to express relations between a number of objects. The novel idea benefits from the fact that it reduces visual clutter by encoding relations with several elementary perceptual tasks and not by direct connectedness as it is suggested by Gestalt theorists. To fully exploit the list of possible perceptual tasks, we use a radial representation of the data. Mappings of graph nodes to a circle circumference have one big advantage over Cartesian mappings. Ambiguous edges cannot occur within the radial TimeRadarTrees technique, never mind which node order is currently applied. In the following section we show the usefulness of the novel technique by a comparative eyetracking and an online study.

“True genius resides in the capacity for evaluation of uncertain, hazardous, and conflicting information” — Winston Churchill

CHAPTER 6 A Comparative Evaluation of TLT and TRT

any recently developed information visualization techniques are radial variants of originally Cartesian visualizations. Actually, hierarchies are either represented in the traditional node-link tree layout which is either displayed in a top-down fashion, even sometimes from left to right, or also in a radial style. This holds for other approaches of tree visualizations such as a layered icicle plot or a treemap, too.

M

Radial variants are said to be less space-efficient but a hypothesis is that the structure of hierarchical data becomes clearer when a tree is mapped to a circular shape. Almost none of these radial variants have been evaluated with respect to their benefits or drawbacks over their original Cartesian visualizations. In this chapter, we compare the radial and the Cartesian variant of our visualization tools called Timeline Trees and TimeRadarTrees. Both approaches are based on the same kind of dataset and hence, are able to represent sequences of transactions in information hierarchies. To this end, we use both quantitative as well as qualitative evaluation methods including eyetracking [19].

155

156

6 A Comparative Evaluation of TLT and TRT

6.1 Cartesian vs. Radial Many visualization techniques have been developed and many more will follow in the near future. With that the number of radial representations grows nearly the same way. A radial visualization is a transformation of a visualization in a Cartesian coordinate system into one in a radial coordinate system. In the Cartesian coordinate system, the values on the horizontal axis (x-axis) are normally mapped to values on the vertical axis (y-axis). Cartesian coordinate systems describe vectors in terms of distance along each of the axes of the space. This is not true for a radial coordinate system where the x-axis starts in the circle center and the y-axis is represented in a circular way around the circle or vice versa. Some basic visualization approaches have already been transformed into radial counterparts. A rose diagram is a radial variant of a bar chart. Also a pie chart is a radial variant of a bar chart, but in this case the value is represented by the width instead of by the height of the bar. A star plot [29] is a radial variant of a parallel coordinates plot [97], see Figures 6.1 and 6.2.

(a)

(b)

(c)

Figure 6.1: Examples for Cartesian diagrams: (a) Bar charts with equal sized widths; (b) Bar charts with equal sized heights; (c) Parallel coordinates plot.

(a)

(b)

(c)

Figure 6.2: Examples for radial diagrams that show the same data as the Cartesian counterparts in Figure 6.1: (a) Rose diagram; (b) Pie chart; (c) Star plot.

6.1 Cartesian vs. Radial

157

Hierarchical data is also represented in several different ways. Node-link visualizations traditionally place the root of a tree at the top in contrast to a tree in nature. Some applications even visualize trees where the root node is positioned on the left or on the right hand side. A file browser mainly uses orthogonal edges and indentation to show the directory structure. Not suprisingly, radial node-link visualizations have been developed [51] and the layered icicle technique has also been ’radialized’, for example in Information Slices [3] or InterRing [185].

(a)

(b)

Figure 6.3: Examples for Cartesian diagrams for trees: (a) Node-link diagram of a tree; (b) Layered icicle.

(a)

(b)

Figure 6.4: Examples for radial diagrams that show the same data as the Cartesian counterparts in Figure 6.3: (a) Radial tree; (b) Radial layered icicle.

Figures 6.3 and 6.4 illustrate the transformations of hierarchical information visualization techniques from a Cartesian approach into a radial one. Trees are mainly represented in a radial style to better express the structure of the hierarchical data but not for reasons of space-efficiency. Furthermore, several recently developed visualization techniques combine radial visualizations, e.g. hierarchical edge bundles [92] combine radial icicles with bundled node-link representations that show related hierarchical items. Stargate [125] is an interactive visualization that uses radial icicles and parallel coordinates. Looking at all these examples, the question arises what is the effect of the radial transformation on the readability and usability of the visualization. Radial visualizations are more difficult to implement but in many cases, these look more aesthet-

158


ically appealing than their Cartesian counterparts. It still remains an open question whether they better support users to comprehend data and extract knowledge. In the next sections, we present two empirical studies with the goal to compare the Cartesian Timeline Trees and the radial TimeRadarTrees variant. A more sophisticated study would be needed to compare more basic techniques as those described in Figures 6.1 and 6.2 for quantitative and hypervariate data visualization or those in Figure 6.3 and Figure 6.4 for relational or hierarchical data visualization. For example, several years ago, Cleveland and McGill [32] conducted a position-angle experiment, where they compared the extraction of quantitative data from both, pie charts as well as bar charts. Their results showed that Cartesian representations for quantitative data are mostly less error-prone than radial counterparts when analyzing quantitative data.

6.2 An Eyetracking Study To attract participants to our study, we decided to use a dataset related to soccer. The reason for choosing this kind of data is that soccer is well-known and easy to explain even to persons who dislike soccer. Furthermore it is a real and an adequately representative dataset. A dataset that contains the number of ball contacts of players in a sequence of moves has all required features that we need for an evaluation of the visualization tools. First of all, each soccer match can be seen as hierarchically organized in the following manner: Two teams are participating in a match which stand on the first level of the hierarchy. A team may further be subdivided into team parts which are normally the goalkeeper, the defense, the midfield, and the offense. All these team parts integrate individual players, see Figure 6.5. Each move in the match puts the participating players into a set—or a transaction in the context of this thesis. A participation in a move means that a player has one or more ball contacts until the opposite team wins the ball possession. The number of ball contacts of each player in each move is recorded and is an indication for the measure value of each participating player in this transaction. As a special feature, two or more players of both teams can also be involved in the same move by a special kind of event. This holds if, for example, two players are ejected from the match simultaneously, because of a red card or a substitution within the same minute. In this way, the whole match can be modeled as a sequence of transactions with measure values. We base our experiment on a real dataset, which was manually recorded from a soccer match between the national teams of Germany and the Netherlands in the World Cup Championships in 1990 played in Italy. It was the round of the last sixteen teams that Germany won 2 to 1.

6.2 An Eyetracking Study

159

Figure 6.5: The participating players in a soccer match can be hierarchically organized into teams and further into team parts.

6.2.1 The Participants The experiment was performed by 35 subjects, that are students from our university. The group of participants can further be subdivided into 18 males and 17 females and were randomly split into a group TLT with 17 subjects, among them 9 males and 8 females. A second group, TRT, consisted of 18 participants, among them 9 males and 9 females. The subjects participated voluntarily in the evaluation experiment. As a precondition for participation, the students had to fill in a short questionnaire before the start of the experiment. We asked them about their mathematical background, their video gaming skills, and their soccer interests. As we can see in Table 6.1, groups TLT and TRT were relatively balanced with respect to their soccer interests, but group TRT had slightly better mathematical skills and group TLT slightly more experience in video gaming. It may be noteworthy that we do not rank these criteria very high because the numbers only depend on each subject’s individual opinion.

160


Table 6.1: Participants that performed the eyetracking study.

TLT

TRT

Participants - total 17 18 - male 9 9 - female 8 9 Mathematical skills (1 very good, 6 very bad) - in school (average) 2,76 2,47 - estimated current skills (average) 3,35 3,11

Soccer interests - not at all - some - very interested - plays soccer 3D-game playing (hours/week)

TLT

TRT

4 8 4 2 0,82

5 10 3 2 0,67

6.2.2 Experiment Setup The actual experiment started when the participants finished to fill in the initial questionnaire. In a second step, they were asked to read a printed tutorial text with a detailed description of one of the visualization tools (either TRT or TLT). To check whether the subjects understood the visualization techniques, we asked them some initial questions after reading the tutorial text. We gave 10 minutes of time for reading the tutorial text which is quite a very short period for understanding a novel visualization technique. The actual experiment took 15 minutes and was performed with an eyetracking system (Tobii x50) that uses corneal reflection of infra-red light to locate the position and movement of the eye. The tracking characteristics of the Tobii x50 are very similar to those of the Tobii 1750. However, unlike the Tobii 1750, the Tobii x50 is not integrated into a monitor, which makes it more conspicious, and it requires a hardware calibration each time it is moved. We presented the questions and visualizations on a computer screen and two cameras mounted on the screen recorded the eye movements at a frequency of 50 Hz, i.e., an image is taken every 20 ms. The visual representations were single screenshots of the radial TRT and the Cartesian TLT tool, which contained between 13 to 157

6.2 An Eyetracking Study

161

Table 6.2: T-test analysis.

- all 16 questions - counting questions - correlation questions

Number of correct answers Mean Mean P-Value (TLT) (TRT) T-Value (2-sided) 11.83 11.06 -1.089 0.284 4.35 3.17 -3.436 0.002 3.65 4.17 1.037 0.520

Response time for correct answers Mean Mean P-Value (TLT) (TRT) T-Value (2-sided) 16.69 21.55 3.060 0.004 17.40 23.07 2.511 0.017 21.74 23.23 0.840 0.407

transactions and between 8 to 22 leaf nodes. Interactive features were not available during the experiment. For the analysis of the recorded eyetracking data we used heatmap visualizations. To produce the heatmaps, points of fixations of several test persons have been combined. A fixation was registered by the system when a test person gazed at an area of 30 pixels radius at least for 100 ms.

6.2.3 Results In this experiment the participants had to answer 18 questions. The last two of these were open questions, while the first 16 questions had clearly determined correct answers. These 16 questions and the overall results are shown in Figure 6.6. They can be grouped into three categories: warm-up questions, counting questions, and correlation questions.

Figure 6.6: Correctness of answers for both groups.

162


• Warm-up questions: For the warm-up questions we see, that these questions have been answered correctly by more than 90 percent, sometimes even 100 percent of the participants. • Counting questions: This type of question focuses on counting and summing items in different scenarios. As shown in Table 6.2, TLT outperformed TRT with respect to correctness of answers as well as with respect to response time for correct answers. Moreover, these two results are statistically significant1 . By examining the heatmaps we found that the participants did not use the thumbnails when answering these questions. This was expected because the main purpose of the thumbnails is the detection of relationships. • Correlation questions: For correlation questions, that ask about relations between items, the participants could answer more questions correctly when using TRT, as shown in Table 6.2. Unfortunately, this result is not statistically significant. After examining the heatmaps of the correlation questions we found that the participants using TRT looked at the thumbnails more intensively than those using TLT. For example, Figure 6.7 and Figure 6.8 show the heatmaps for “Which player played most often with Marco van Basten?” In the TLT heatmap, one can easily see that there was almost no fixation on the thumbnails, whereas in the TRT heatmap there was a strong fixation at the thumbnail that represents the player Marco van Basten and another one at the thumbnail that represents the player Ronald Koeman, the correct answer for this question. When looking at the heatmaps of those participants using TLT who answered correlation questions incorrectly, we often found that they did not make much use of the thumbnails. • Open questions: All of the previous questions could be answered automatically with relatively simple database queries and no visualization at all. We think that the most important contribution of visualization tools is for exploring large datasets, where we do not know what to look for in advance. For this, we also showed the participants the visualizations and asked the very general question “Can you detect any trends or anomalies?” In both groups a test person mentioned on average about 4.3 observations. But the observations varied between the two groups. For example, 14 participants using TRT found that two players (Rudi Völler and Frank Rijkaard) only took part in moves at the beginning of the visualized time period, but only 6 participants using TLT detected this anomaly. Both players have actually been ejected from the match at the same time. Looking at the heatmaps shown in Figure 6.9 we realized that the participants using TLT did not inspect the periphery of the visualization, i.e. they did not fixate any of the four corners of the com1

In the tables we have set the error probability of all statistically significant results (p < 0.05 with the Bonferroni-Holm correction) in bold face.

6.3 An Online Study

163

Figure 6.7: Heatmap for TRT (correlation question): Which player played most often with Marco van Basten?

puter screen. Figure 6.9 also shows that for TRT due to its radial layout this “blinders effect” did not occur.

6.2.4 Threats to Validity There are various factors that limit the validity of the results of these kinds of studies. These include for example the choice of the dataset, the choice of the questions, and the size of the dataset for each question. Furthermore, while the eyetracker used is not very distracting (compared for example to a head-mounted one), it still restricts the user not to move his or her head. Finally, TRT does not exploit one of the alleged advantages of radial displays, the possibility to put detailed information in the center and context information in the periphery. Thus, we could not evaluate this feature.

6.3 An Online Study While 35 participants are a relatively high number for an eyetracking study, it is too small to achieve statistically significant results for smaller effects. As a consequence, we designed an additional web-based experiment, and invited people by email.

164


Figure 6.8: Heatmap for TLT (correlation question): Which player played most often with Marco van Basten?

6.3.1 The Participants

The population that performed the evaluation consisted of 215 web users. As before, the subjects were split into Group A (128 subjects) and Group B (87 subjects). Group A had to answer questions related to TLT visualizations, while Group B had to answer questions related to TRT visualizations. All subjects participated voluntarily in the evaluation.

(a)

(b)

Figure 6.9: The heatmaps for the open question show some differences: Can you detect any trends or anomalies? (a) Heatmap for TRT; (b) Heatmap for TLT.

6.3 An Online Study

165

6.3.2 Experiment Setup

Before the actual experiment the participants were asked to read a PDF-version of the tutorial text of one of the visualization techniques (either TRT or TLT). At the end of the tutorial text there were some initial questions for them to check whether they understood how to read the visualizations. Next they were asked to fill in a questionnaire about their mathematical background, their soccer interests, and their 3D video gaming skills. The experiment was divided into two parts. In the first part, we asked five questions related to images of the visualizations. The second part consisted of five questions related to interactive features of the visualization. Interaction was provided by means of a Java applet. The applet was instrumented to record most interactions and also the time between starting and stopping the applet, i.e., the time it took the user to answer the question.

Table 6.3: Online experiment: Participants

Participants - total - male - female - unknown Soccer interests - not at all - some - very interested - plays soccer - unknown Mathematical skills (1 good, 6 very bad) - in school (average) - estimated current skills (average) 3D-Game Playing - none - one hour/week - more than one hour/week - unknown

TLT

TRT

128 100 23 5

87 74 10 3

4 19 41 59 5

2 15 32 35 3

2,10 2,43

2,13 2,39

65 20 38 5

37 13 28 9

166


Figure 6.10: Aligned bar charts show the results of the comparative online experiment.

6.3.3 Results The overall results are shown in Figure 6.10. Both, for the questions related to the images as well as for those related to the interactive visualization, participants using TLT could answer the questions much better. Actually, the average number of correct answers per person was 6,59 for TLT, and only 4,53 for TRT. This result is highly significant. So, the question arises whether TRT had any advantage over TLT in this experiment. As can be seen in Figure 6.11 the average response times for both tools are very similar, except for one question: “Which player participated in the move with the most ball contacts?” For this question the test persons using TLT took nearly 50 percent more time than those using TRT—another statistically significant result, as can be seen in Table 6.4. To answer this question, the participants had not only to count how many boxes or sectors where colored blue. They had to take into account the saturation of the color coded boxes or circle sectors, too.

Figure 6.11: Online experiment: Average response time for correct answers.

As before, some of the questions could have been categorized as counting, and others as correlation questions. In the previous study, we had some evidence that TRT would be better for correlation questions, and that the thumbnails were very

6.4 Conclusions

167

Table 6.4: Online experiment: T-test analysis.

Mean (TLT) Number of correct answers 6,59 Response time - all answers 79 - all correct answers 57 - question 9 154

Mean (TRT)

T-Value

P-Value (2-sided)

4,53

4,81

0,000001

63 54 92

1,46 0,22 2,21

0,15 0,82 0,03

useful in this case. In the online study, we couldn’t confirm this hypothesis. We think, that one reason for this is that we used a very low resolution. Actually, the visualization covered only 750 by 500 pixels of the evaluation web page, the rest was used for the question and possible answers. In contrast, in the eyetracking study the visualizations used the full screen (1280 by 1024). As a result, in the thumbnails of the online experiment most sectors or boxes were represented by single pixels, whereas in the thumbnails of the eyetracking experiment they were represented by more pixels.

6.4 Conclusions While the overall performance of the participants using TLT was better than the performance of those using TRT, the interpretation and thus the effective use of the thumbnails worked better in TRT. One reason for this might be that it is easier to distinguish and remember locations in the radial layout. Another reason may be the longer list of elementary perceptual tasks that can be used to uncover connectedness of several objects. In the Cartesian TLT approach those were position in non-aligned scales, color, and/or context. In the radial part a viewer can additionally perceive this connectedness by slope, angles, shape, area, and curvature. Radial visualizations are fancy, and for some tasks they may even be superior to their Cartesian counterparts. At least, in our empirical study the radial visualization could not keep up with the Cartesian one. Although TLT outperformed TRT overall, there is still some hope: The eyetracking experiment showed that the radial visualization did not lead to the ’blinders effect’, and that the radial thumbnails were more useful than the Cartesian ones. Furthermore, we have been confirmed that the TimeRadarTrees can be understood in a few minutes even by laymen—people without experience in graph visualization.

168


The study presented in this chapter should only be considered a first step towards answering our initial question of whether radial visualizations better support users to comprehend data and extract knowledge. Before we can understand complex visual representations of dynamic relational data in either Cartesian or radial visual metaphors we have to understand the basic visual elements used in these complex visualization approaches first.

“There is no particular evidence that any of the lower mammals or any of the other animals have any interest in aesthetics at all. But Homo sapiens does, always has, and always will.” — Jock Sturges

CHAPTER 7 The Aesthetics of Dynamic Graph Visualization

hen we talk about aesthetics we always refer to the term beauty. But what is the definition for beauty? It cannot be expressed in terms of a single real numbered value. Aesthetics depends on the human’s senses and is known as the study of sensory or sensori-emotional values, also called judgments of sentiment and taste [189]. Researchers in that field define aesthetics as

W

“critical reflection on art, culture, and nature”. When we talk about beauty we also have to talk about the opposite term—ugliness. To measure beauty we may also measure ugliness. In some situations it may be a good strategy but it will not lead to a solution to the given problem, that is not only a phenomenon of modern times. Even in ancient times philosophers have been studied our judgments about beauty and ugliness. In the scope of this thesis we apply the term aesthetics to visual representations of graphs which is a much simpler task than measuring the aesthetic appealing of arbitrary images. Much research has been done on studying static graphs in a

169

170

7 The Aesthetics of Dynamic Graph Visualization

node-link metaphor [12]. But only little efforts have been investigated to examine alternative visual metaphors apart from node-link diagrams with respect to aesthetic criteria. Furthermore little progress is recognizable in the field of dynamic graph visualization. In the field of graph visualization we denote a representation of a graph as aesthetically appealing if it satisfies a list of aesthetic criteria with respect to readability and understanding. Each single criterion can be examined if it applies to a special visual encoding metaphor and a layout of a graph. We could iterate over the list of criteria like a check list and could come to the conclusion if a graph fulfills the most criteria given in that list or not. After this check, the design of the graph visualization layout algorithm could be enhanced by following the design principles given by an aesthetic dimensions framework for graph visualization [7]. In this thesis we present a different space-filling and radial visual metaphor for dynamic graphs in a single view that has a slightly different aesthetic criteria list than the one for static node-link diagrams. For this reason the list of aesthetic criteria for a node-link metaphor has to be enhanced with respect to different graph visualization paradigms. In this chapter we try to discuss aspects that may be the reason why a special visual encoding can be considered more aesthetic than a different one. In the following we will examine node-link graph diagrams as well as more space-filling approaches for both static and dynamic graph data.

7.1 Aesthetics for Node-Link Metaphors Node-link diagrams are the most common technique in graph drawing and graph visualization. Their prevalence over other graph visualization metaphors can be explained by the fact that it is very easy to understand related objects by direct connectedness with either straight or curved lines. Gestalt theorists evaluated other perception principles such as proximity or similarity and came to the conclusion that direct connectedness is perceived best.

7.1.1 Static Graphs Static graphs are at its best when these are used to display static graph data. But even a static graph structure that will not change over time can lead to very complicated and time-consuming algorithms when researchers look for sophisticated algorithms with respect to several aesthetic criteria. For non-planar graphs we have to accept at least one edge crossing in the final layout and in more general a flood of edge crossings cannot be avoided even in an optimal laid out node-link diagram. Edge crossings are the worst drawback of node-link diagrams and should be

7.1 Aesthetics for Node-Link Metaphors

171

avoided whenever possible because a certain number of crossings can lead to visual clutter and a confused user when interpreting a graph. To tackle this problem many researchers from the Graph Drawing field are developing sophisticated algorithms that fulfill a list of aesthetic criteria. Making a static graph aesthetically appealing has to be understood as making a graph readable which means communicate its underlying meaning and structure to a viewer. Aesthetics in graph visualization is less beauty but more readability which implies understanding. Visualizing a graph in an aesthetic appealing manner is a major problem for dense graphs, where trees do not belong to. Trees are a special graph type that belong to the class of planar graphs and are said to be sparse. Also tree layouts are computed with respect to a list of aesthetic criteria that target at representing the hierarchical structure more clearly. Section 2.3.1 gives a good overview of tree diagrams in a node-link style. For arbitrary graphs it means a big challenge to present it in a readable layout. A graph visualization may carry several information among them the • visual encoding of nodes(shape) • placement of nodes • visual encoding of edges(shape) • visual encoding of weights for both nodes as well as edges(color) • node labels Nodes should be evenly distributed over the display area that results in a more ordered appearance and hence in a higher aesthetic appealing [41, 42, 154]. Visual graph elements should not overlap and nodes should have a certain distance from eachother and even from drawn edges. This avoids spatial aliases, which is a very important design strategy for graph clustering. Trade-offs are a typical phenomenon when representing graphs nicely. For example, clustering nodes stands in contrast to an even node distribution. Also edges should be represented with respect to several aesthetic criteria. The most important edge placement heuristic for graph visualization is definitely the minimization of edge crossings [135]. Crossings can lead to a cluttered display and hence can confuse a user enormously when exploring graph data. The minimization of edge bends is also a very interesting heuristic. We followed these heuristics in our approach about Trees in a Treemap in Section 3.3.2.4. There we found out a trade-off between both. Reducing the number of edge crossings leads to longer edges and a higher number of edge bends in general. The display area for a graph should be as small as possible, hence a layout algorithm should produce short edges with uniform edge lengths that lead to a more regular graph representation. A major problem appears when edges are connected to their corresponding nodes. Many adjacent edges at a node should be placed equally angled around this node

172


and the inclosing edge angles should be maximized. Moreover, edges should be laid out orthogonally what means that they intersect at 90 degree angles exclusively. Another very important aspect when producing aesthetically appealing graphs is symmetry and aspect ratio of the edges. These heuristics are very common for tree layout, where substructures could be uncovered by inspecting the symmetries in the tree representation. The aforementioned aesthetic heuristics apply to static graphs in a node-link style and also to some extent to dynamic graphs. In the next section we describe an existing visualization paradigm for the representation of dynamic graphs—animation.

7.1.2 Dynamic Graphs Relational data that has a temporal component can be modeled by dynamic graphs, which can again be modeled by sequences of static graphs. Visualizing dynamic graphs in a node-link visualization metaphor as static images with a maximum of visible information leads to a horrible waste of display space. Drawing the edges of all graphs in the sequence as an aligned node-link diagram leads to scalability problems and additional visual clutter. To keep it short, an appropriate visualization technique for dynamic graphs in a node-link metaphor may be an animated sequence of the graphs. Several researchers have tried to develop algorithms for drawing sequences of graphs. The naive approach would be to apply algorithms for static graph layout to each of the graphs in the sequence and only observe the previous graphs in the sequence. This idea is denoted as the online approach, that suffers from a deciding drawback. If there are several small changes in the graph sequence, those could accumulate to a large change that causes the online drawing algorithm to compute an absolutely new layout of the graph with the goal to reduce visual clutter. This step is exactly the point where an additional aesthetic criterion comes into play that appears whenever animation is used. The criterion has been coined by Misue et al. as ’preserving the mental map’ [120] or better within the context of graph drawing as ’dynamic stability’. Changes between subsequent graphs in the sequence should be as small as possible to minimize the cognitive efforts for a viewer when comparing graphs. By the term ’mental map’ we refer to the fact that a viewer of a graph impresses the abstract structural information unconsciously in his mind within a split second. Three models for preserving the mental map in an animated graph are given in the work of Misue et al. [120] when subparts of a graph are changed within two subsequent graphs: • The order of all nodes in the graph should be kept up for both in the horizontal direction as well as in the vertical direction.

7.2 Aesthetics for Space-Filling Metaphors

173

• The distance between the nodes should be preserved. • The dual graph should stay the same. Diehl and Görg [44] implemented an offline graph drawing algorithm for dynamic graphs that uses smooth transitions between subsequent graphs to support the viewer again at preserving his mental map. They refer by the term ’offline’ to as the fact that all graphs in the sequence are considered first by the algorithm which stands in contrast to ’online’ approaches where only the graphs before the currently shown are taken into account. Apart from preserving one’s mental map, animated graph visualizations have two major drawbacks. The first is definitely the sometimes very long running time of the layout algorithms, be it for offline or online algorithms. The second and more interesting problem in the scope of this thesis is the fact that a graph analyst does not know where to look at first in an animated graph sequence. He has to let run the animation several times until he gets familiar with the dataset and can gain insights from the graph data. Static representations of dynamic graphs do not have to tackle these problems, because the viewer gets an overview of the graph sequence and can decide by himself where to look at. In the next section we will observe graph visualization from a different perspective, not by means of node-link diagrams, but more by means of space-filling techniques that do not use a direct connectedness in form of straight or curved lines between related objects.

7.2 Aesthetics for Space-Filling Metaphors The benefit for drawing graphs in a space-filling style is an uncluttered display for dense graphs. Node-link diagrams are more preferable for sparse graphs such as trees. Graph sequences for both sparse as well as dense graphs are more readable when using space-filling graph visualizations.

7.2.1 Static Graphs Much research has been done on visualizing graphs as matrix-based representations [54, 59, 86, 87, 88, 143]. Many researchers point out that matrices are difficult to interpret when solving path-related tasks. The difficulty can be explained by the existence of two representative elements for each node, vertically as well as horizontally. The viewer has to jump to and fro between rows and columns to follow a path with the eye, a task that can only be solved by high cognitive efforts. Furthermore because of the pixel-based representation it is very easy to make a mistake in node identification at a row or a column and hence come to a wrong conclusion.

174


Several matrix-based visualization tools try to tackle this problem by providing an additional node-link representation on top of the matrix visualization [87]. In a different work the researchers extend their work on matrix visualizations by differentiating between dense subparts that they display as a matrix and sparse subparts that are shown by node-link diagrams [88]. A slightly different idea to display graph data is by a so called list representation. A list of related vertices is shown for each vertex and may even be color coded. Each node is encoded by two representative elements, one as a list member and the other a member of that list, which shows the adjacent nodes of this corresponding node. It goes without saying that list representations suffer from the same drawback as matrix-based graph visualizations when solving path-related tasks. List-based representations are even worse. When displaying them in a pixel-based visualization, the corresponding nodes cannot be uncovered unless we use color coding for the node labels which is a daunting task. Those representations have just one benefit. Nodes with very many incident edges can be uncovered very soon, because extracting quantitative data from positions in aligned scales is proved to be very effective [32]. Aesthetic criteria for static node-link based graph visualizations can also apply to matrix or list representations. A matrix representation makes use of color coded pixels to encode relationships between several entities. This approach can be classified as a space-optimal visualization solution to dense weighted graphs. Some criteria that hold for node-link representations do not apply to a matrix approach directly. Edge crossings cannot be present in matrix metaphors, because the encoding of edges has no overlap. In the following list we describe four general aesthetic criteria that apply to static graph visualizations. • Reduce visual clutter: Visual clutter is the state in which excess visual elements or their disorganization lead to a degredation of performance at some task [141]. For dense graphs visual clutter increases in node-link diagrams because of many edge crossings. Matrix-based representations are suited for dense graphs but the major drawback is the problem when solving path-related tasks. • Reduce spatial aliases: Several elements might be placed in a way that those are mistaken one for the other. If visual elements are put too close to eachother as it typically is the case for visualizations in a matrix metaphor, one might have problems at mapping visual elements to the correct relations and hence may make wrong interpretations of the graph. Spatial aliases in node-link diagrams occur when edges intersect at small angles. • Spatial matching of multiple representatives: One object may be represented more than once in the visual encoding. The representative elements have to be matched again when reading the visualization and could lead to misinterpretations. Matrix representations encode the graph nodes twice, one in a row and one in a column. The TimeRadarTrees representation encodes

7.2 Aesthetics for Space-Filling Metaphors

175

edges twice, as circle sectors in a big context circle and in the smaller circles that are also called thumbnails. The drawback of multiple representatives is the difficulty when following paths. • Maximize compactness: A compact graph representation uses space (and time) efficiently for encoding the graph data. The pixel-based matrix representation can be classified as a compact visualization for dense graphs. Node-link diagrams typically encode edges with more than one pixel and hence should not be classified as a compact visualization of graph data. Displaying dynamic data in a static picture can help to better detect trends or counter-trends in the time-varying data though it mainly suffers from scalability problems. In the next section we try to examine scalability for dynamic graphs as well as several aesthetic criteria for dynamic graph visualizations in a static image.

7.2.2 Dynamic Graphs Scalability targets at the question whether a visualization is able to handle a growing dataset with respect to keep the larger dataset still readable. Dynamic graphs can grow in three dimensions, that are the number of nodes, the number of edges, and the number of graphs. Hence their aesthetic scalability has to be discussed for each of those dimensions separately. • Scalability in number of vertices: The readability of the visualization should be preserved for a higher number of vertices. Increasing the number of vertices could lead to an increase of graph edges, too. To discuss the scalability in the number of vertices we should assume that the graph density stays at a constant level. • Scalability in number of edges: The readability of the visualization should be preserved for a higher number of edges. An increasing of the number of edges leads to a higher density of the graphs. This is in favour of matrix-based representations where the space stays the same, but node-link diagrams are not suited for dense graphs. • Scalability in number of graphs: The readability of the visualization should be preserved for a higher number of graphs. Only animations can keep up with an increasing number of graphs to some extent. For a long graph sequence the animation takes more time and the user may be supposed to watch the animated graphs several times until he finds interesting insights. The visualization of dynamic graphs brings several additional aesthetic criteria into play. Trends and counter-trends are typical phenomena that should be explored very easily by means of time-based visualizations. Specially for graph data a user should be able to detect the evolution of edge weights, vanishing or re-appearing

176


edges, or temporal patterns. We present three dynamic aesthetic criteria for graph visualization. • Preserving the mental map: The term mental map refers to the abstract structural information a user forms by looking at the layout of a graph [120]. The mental map facilitates navigation in the graph or the comparison to other graphs. Changes to this map should be as small as possible in dynamic graph drawing. • Reducing the cognitive load: The cognitive load refers to the amount of information the user has to keep in his working memory to read the visualization. To understand what information is presented by an animation, by a sequence or aligned visualization of a graph sequence, the viewer has to keep a part of the presented information in his working memory. A visualization is useless if the way, information is presented, exceeds the capacity of the working memory. It may cause a high mass of attention such that the working memory cannot be refreshed for new tasks. In animations we have to remember what happened before and hence we have to rely on the working memory, hence the cognitive efforts can be very high. • Minimizing temporal aliases: The placement of several visual elements on a time axis might be mistaken one for the other. These phenomena are called temporal aliases. Change detection in animated graphs is a difficult task when there is a missing correspondence between several visual elements in subsequent graphs. The illusion of backward-spinning wagon wheels in Western movies illustrates that the mind may match the wrong visual elements. These confusing phenomena might also occur in static visualizations of dynamic graphs not only in animations. In this section we described aesthetic criteria for general dynamic graph drawing and explained the differences between animated and static space-filling representations with respect to their benefits and drawbacks.

7.3 Conclusions Relations between a set of objects can be modeled by directed graphs. The visualization of those graphs can be guided by several visualization metaphors. Node-link diagrams are the most common technique to present a graph. Sophisticated algorithms have been developed with respect to their running time and their graph layouts. The visual appearance of a graph should be uncluttered and follow a list of aesthetic criteria to support a graph explorer at understanding the relational data. Though the algorithms may work well for small and sparse graphs, those may have limitations when the graphs become denser and more edges appear. Generally,

7.3 Conclusions

177

an increase of the number of edges means an increase of edge crossings and a more cluttered display. Hence, the visualization of graphs with respect to aesthetic criteria is a big challenge for graph drawing researchers and sometimes there is a trade-off between several optimization criteria. Graphs can grow in three dimensions. An increase of the number of nodes generally means an increase in the number of edges. A third dimension occurs when we look at dynamic graphs, those that are not static but change over time. Typically this kind of time-series relational data is displayed by animated sequences of graphs that may lead to high cognitive efforts for a viewer when not represented the right way. The term that comes into play is the ’mental map’, that a user forms when looking at a visualization. The changes of subsequent graphs should be minimal to support the user at preserving his mental map. Static representations of dynamic data have not to be aware of a mental map preservation in the same way as it is the case in animated representations because the data is displayed in one view and the user can decide by himself where to look at. Visualizing dynamic data in a static image may not require the working memory as much as it would be the case for animated sequences.

178


“My interest is in the future because I am going to spend the rest of my life there.” — Charles F. Kettering (1876-1958)

CHAPTER 8 Conclusion and Future Work

his work is a first try to develop a novel visual metaphor for dynamic graph visualization and hence, presents an antagonist to traditional animated nodelink diagrams. The focus of this work is to show the benefits and the drawbacks of different visualization techniques for relational data either for static visualizations of static relational data, dynamic visualizations of dynamic relational data, and finally static visualizations of dynamic relational data. Visualizing static relational data in a dynamic style by means of animation is not in the scope of this thesis. By the term animation we do not express the fact that a user may manipulate a visualized dataset by means of interactive features but more the fact that a view changes its visual state by a predefined process.

T

The growing number of graph visualization tools can be explained by many real world examples from very different application domains that all have an underlying relational data structure. Changing file-developer relationships or evolutionary couplings from software systems under version control, network routing data, complex dependencies in sports, co-author networks, or traffic data are just a few examples out of a rich source. To examine the usefulness of our visualization approaches we mainly focused on datasets from software systems under version control. Data from other application domains may show that a visualization technique is not only domain-specific but can also be applied to datasets from very different fields of research.

179

180

8 Conclusion and Future Work

A frequently occuring difficulty when testing the usefulness of a novel visualization approach over the borders of the typical application domain of the technique, is always to find a suited dataset which can be accessed and parsed into a specific data format very easily. Data protection and nonelectronic data demonstrate major problems to tap a data source and hence, to support the improvement of visualization tools.

8.1 Data Acquisition Data sleeps unused in software archives under version control. We developed software tools to extract data about the evolutionary couplings of software artifacts and the behavior of several developers. Before applying the visualization tools to such a dataset, it has to be preprocessed and brought to the specified format required by a visualization tool. The advantage of configuration management systems is the fact that data is stored on a server and hence, has already an electronic form. We do not have to care about how to acquire the data, but more about in which format the data has to be transformed. Acquiring data that is not under version control and even not in electronical form was a time-consuming task. For example, the data produced by a soccer match had to be recorded manually and very small time periods of that match had to be watched several times. The major problem when acquiring data manually is the fact that we cannot guarantee that the dataset is free from defects. For example, traffic data has been extracted manually from a webpage over a period of one day and we have to rely on the traffic data on the web. If we want to explore datasets acquired over longer time periods, an extraction by hand is definitely a very time-consuming and error-prone solution. In the future we should implement a software tool that may be able to connect to a webpage and store datasets in a predefined data format.

8.2 Trade-Offs in Layout Algorithms Visualizing an abstract dataset may sometimes be very difficult. Finding the right visual mapping for such a dataset may take either some time to search a suited approach in the literature or to invent a novel technique which is suited for visual exploration of that kind of data. In some cases it may be a hard task to represent the data in the right way. For graph visualization, we learned that there may be trade-offs between several aesthetic criteria. Should the graph visualization reduce edge lengths or minimize edge

8.3 Evaluation

181

crossings if both optimization criteria at a time are not possible? In case of this kind of problems, the visualization tool may be supported by interactive features which enable a user to switch between several criteria on demand. Specially for the nodelink metaphor, the change of several parameters may force the layout algorithm to compute a new layout of the graph and redraw it. But in some cases, even the visual metaphor may be not the right one. Using node-link diagrams for dense graphs can be the wrong visual metaphor. Maybe a matrix-based representation is suited much better. But what if the viewer wants to explore paths in the graph dataset? A hybrid visualization technique could present a solution that displays dense graphs as a matrix and sparse graphs in a node-link style. This last example shows that even some visual metaphors show trade-offs with respect to some visualization goals. A reduction of visual clutter leads to a matrixbased metaphor that results in problems at solving path-related tasks but scales up to thousands of nodes and edges. A good exploration of paths is possible by node-link diagrams but those produce visual clutter and do not scale up to as many nodes as matrix representations. The same drawback holds for dynamic graphs. A minimization of visual clutter, a very good exploration of paths, and a high scalability are visualization goals that cannot be achieved all at a time in a static picture.

8.3 Evaluation The evaluation of visualization techniques is very important to understand if the technique is really useful and superior or inferior to a technique that represents the same kind of data. Another very interesting aspect is the population that participated in the evaluation. Examining the experience of the subjects can help to understand if the visualization technique is really understandable by layman or very experienced users in a specific field of research. The visualization could then be adjusted to some extent to the demands of the viewers and hence it can be enhanced in the right direction. The comparative evaluation of our visualization tool in Chapter 6 showed us some insights that could never be uncovered without the eyetracking study. Though the radial TimeRadarTrees technique cannot be said to be superior over the Cartesian Timeline Trees, we can definitely claim that the novel technique was understandable by people who have no experiences in the field of graph drawing or graph visualization. The visualization tools used very complex visual encodings for representing dynamic relational data. Currently, we are planning an online study to evaluate more basic visual encodings for the same kind of data.

182

8 Conclusion and Future Work

In our evaluation we tried to find out if radial or Cartesian visualizations are superior. So far we did not evaluated if the static representations for dynamic compound digraphs are superior over animated node-link diagrams. As expected, the novel visualization technique has to do poorly for path-related tasks but to do well for trend and counter-trend detection. The correctness of that claim has to be proven by a sophisticated eyetracking or online study and belongs to future work as well.

8.4 The Tools on the Web The visualization ideas of our EPOSee tool and the TimeArcTrees, TimelineTrees, and TimeRadarTrees tools are accessible from our webpages [56, 161] and the software can be downloaded to make one’s own screenshots of a dataset. Additionally, we uploaded our work on TimeRadarTrees to the very popular visualcomplexity’s visualization webpage [171] to make it accessible and we hope to get some feedback. In the end, I would like to thank again the many people who contributed in discussions, implementations, organizations, and proofreadings that finally made this thesis possible. I will never forget you.

BIBLIOGRAPHY

[1]

Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules in Large Databases. In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors, VLDB’94, Proceedings of the 20th International Conference on Very Large Data Bases, pages 487–499, Santiago de Chile, Chile, September 12-15, 1994. Morgan Kaufmann.

[2]

Keith Andrews. Visual Exploration of Large Hierarchies with Information Pyramids. International Conference on Information Visualization, page 793, 2002.

[3]

Keith Andrews and Helmut Heidegger. Information Slices: Visualizing and Exploring Large Hierarchies using Cascading, Semi-Circular Discs; Late Breaking Hot Topic Paper. In Proceedings of the IEEE Symposium on Information Visualization (INFOVIS’98), pages 9–12, Research Triangle Park, NC, USA, October 1998.

[4]

Michael Balzer and Oliver Deussen. Voronoi Treemaps. In IEEE Symposium on Information Visualization InfoVis 2005, page 7, Minneapolis, MN, USA, October 2005.

[5]

Moshe Bar and Karl Fogel. Open Source Development with CVS. Paraglyph Press, Inc., Scottsdale, AZ, 3rd edition, 2003.

[6]

S. Todd Barlow and Padraic Neville. A Comparison of 2-D Visualizations of Hierarchies. In Proceedings of Information Visualization 2001, pages 131–138, 2001.

[7]

Fabian Beck, Michael Burch, and Stephan Diehl. Towards an Aesthetic Dimensions Framework for Dynamic Graph Visualizations. In Proceedings of 13th International Conference on Information Visualisation (IV 09), July 2009.

183

184

Bibliography

[8]

Richard Becker, Stephen G. Eick, and Allan R. Wilks. Visualizing Network Data. IEEE Transactions on Visualization and Computer Graphics, 1:16–28, 1995.

[9]

Benjamin B. Bederson. Quantum Treemaps and Bubblemaps for a Zoomable Image Browser. In Proceedings of User Interface Systems and Technology, pages 71–80. Press, 2001.

[10] Benjamin B. Bederson and Ben Shneiderman. The Craft of Information Visualization: Readings and Reflections. Morgan Kaufmann, San Francisco, CA, USA, 2003. [11] Ben Shneiderman: Treemaps for Space-Constrained Visualization of Hierarchies. http://www.cs.umd.edu/hcil/treemap-history/. [12] Chris Bennett, Jody Ryall, Leo Spalteholz, and Amy Gooch. The Aesthetics of Graph Visualization. In Proceedings of Computational Aesthetics in Graphics, Visualization, and Imaging, 2007. [13] Thomas Bladh, David A. Carr, and Jeremiah Scholl. Extending Tree-Maps to Three Dimensions: A Comparative Study. In Proceedings of 6th Asia-Pacific Conference on Computer-Human Interaction APCHI, pages 50–59, 2004. [14] Richard Boardman. Bubble Trees: The Visualization of Hierarchical Information Structures. In CHI ’00: Extended Abstracts on Human Factors in Computing Systems, pages 315–316, New York, NY, USA, 2000. ACM. [15] Ulrik Brandes and Dorothea Wagner. A Bayesian Paradigm for Dynamic Graph Layout. In Proceedings of the 5th Symposium on Graph Drawing (GD’ 97), volume 1353 of Lecture Notes in Computer Science, pages 236–247. Springer-Verlag, 1997. [16] Sally Anne Browning. The Tree Machine: A Highly Concurrent Computing Environment. PhD thesis, California Institute of Technology, Pasadena, CA, USA, 1980. [17] Mark Bruls, Kees Huizing, and Jarke J. van Wijk. Squarified Treemaps. In Proceedings of the Joint Eurographics and IEEE TCVG Symposium on Visualization, pages 33–42. Press, 2000. [18] Michael Burch, Fabian Beck, and Stephan Diehl. Timeline Trees: Visualizing Sequences of Transactions in Information Hierarchies. In Proceedings of 9th International Working Conference on Advanced Visual Interfaces (AVI 2008), May 2008.

Bibliography

185

[19] Michael Burch, Felix Bott, Fabian Beck, and Stephan Diehl. Cartesian vs. Radial—A Comparative Evaluation of Two Visualization Tools. In Proceedings of 4th International Symposium on Visual Computing (ISVC’08), December 2008. [20] Michael Burch and Stephan Diehl. Trees in a Treemap: Visualizing Multiple Hierarchies. In Proceedings of 13th Conference on Visualization and Data Analysis (VDA 2006), San Jose, California, US, 2006. [21] Michael Burch and Stephan Diehl. TimeRadarTrees: Visualizing Dynamic Compound Digraphs. In Proceedings of Tenth Joint Eurographics/IEEEVGTC Symposium on Visualization (EuroVis’2008), May 2008. [22] Michael Burch, Stephan Diehl, and Peter Weißgerber. Visual Data Mining in Software Archives. In Proceedings of ACM Symposium on Software Visualization SOFTVIS’05, pages 37–46, St. Louis, USA, 2005. ACM. [23] Michael Burch, Stephan Diehl, and Peter Weißgerber. EPOSee: A Tool for Visualizing Software Evolution Patterns. In Sixth Workshop on SoftwareReengineering, appeared in Softwaretechnik-Trends, Gesellschaft f¨ ur Informatik, Bd. 24, Issue 2, Bad Honnef, Germany, May 2004. [24] Michael Burch, Stephan Diehl, and Peter Weißgerber. EPOSee — A Tool for Visualizing Software Evolution. In VISSOFT ’05: Proceedings of the 3rd IEEE International Workshop on Visualizing Software for Understanding and Analysis, page 35, Budapest, Hungary, September 2005. IEEE Computer Society. [25] C. Project – Network Data Sets – Traceroutes Data Sets, online. http://www.cosin.org/extra/data/traceroute/imdb.html, 2005. [26] Stuart Card. Information Visualization. Lawrence Erlbaum Associates Inc., Hillsdale, NJ, USA, 2008. [27] Stuart Card, Jock Mackinlay, and Ben Shneiderman. Readings in Information Visualization: Using Vision to Think. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1999. [28] Jeromy S. Carriere and Rick Kazman. Research Report: Interacting with Huge Hierarchies: Beyond Cone Trees. IEEE Symposium on Information Visualization, pages 74–81, October 30-31, 1995. [29] John M. Chambers, William S. Cleveland, B. Kleiner, and Paul A. Tukey. Graphical Methods for Data Analysis. Chapman and Hall, New York, 1983.

186

Bibliography

[30] Ed H. Chi. A Taxonomy of Visualization Techniques Using the Data State Reference Model. IEEE Symposium on Information Visualization, page 69, 2000. [31] Richard Chimera and Ben Shneiderman. An Exploratory Evaluation of Three Interfaces for Browsing Large Hierarchical Tables of Contents. ACM Transactions on Information Systems, 12:383–406, 1994. [32] William S. Cleveland and Robert McGill. Graphical Perception: Theory, Experimentation, and Application of the Development of Graphical Methods. Journal of the American Statistical Association, 79:531–554, 1984. [33] Code Swarm: An Experiment in Organic Software Visualization. http://vis.cs.ucdavis.edu/ ogawa/codeswarm/. [34] Christian Collberg, Stephen Kobourov, Jasvir Nagra, Jacob Pitts, and Kevin Wampler. A System for Graph-Based Visualization of the Evolution of Software. In Proceedings of the 2003 ACM Symposium on Software Visualization, pages 77–86, New York NY, 2003. ACM Press. [35] Christopher Collins. DocuBurst: Radial Space-Filling Visualization of Document Content. Technical Report KMDI-TR-2007-1, Knowledge Media Design Institute, University of Toronto, 2007. [36] Reidar Conradi and Bernhard Westfechtel. Version Models for Software Configuration Management. ACM Computer Surveys, 30(2):232–282, 1998. [37] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Second Edition, 2001. [38] Davor Cubranic, Gail C. Murphy, Janice Singer, and Kellogg S. Booth. Hipikat: A Project Memory for Software Development. IEEE Transactions on Software Engineering, 31(6):446–465, 2005. [39] Raimund Dachselt and J¨ urgen Ebert. Collapsible Cylindrical Trees: A Fast Hierarchical Navigation Technique. In INFOVIS ’01: Proceedings of the IEEE Symposium on Information Visualization 2001 (INFOVIS’01), pages 79–86, Washington, DC, USA, 2001. IEEE Computer Society. [40] Marco D’Ambros, Michele Lanza, and Mircea Lungu. The Evolution Radar: Integrating Fine-Grained and Coarse-Grained Logical Coupling Information. In Proceedings of MSR 2006 (3rd International Workshop on Mining Software Repositories), pages 26–32, May 2006. [41] Ron Davidson and David Harel. Drawing Graphs Nicely Using Simulated Annealing. ACM Transactions on Graphics, 15(4):301–331, 1996.

Bibliography

187

[42] Giuseppe di Battista, Peter Eades, Roberto Tamassia, and Ioannis G. Tollis. Graph Drawing: Algorithms for the Visualization of Graphs. Prentice Hall, July 1998. [43] Stephan Diehl. Software Visualization: Visualizing the Structure, Behaviour, and Evolution of Software. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2007. [44] Stephan Diehl and Carsten Görg. Graphs, They Are Changing. In GD ’02: Revised Papers from the 10th International Symposium on Graph Drawing, pages 23–30, London, UK, 2002. Springer-Verlag. [45] Dirk Draheim and Lukasz Pekacki. Analytical Processing of Version Control Data: Towards a Process-Centric Viewpoint, 2003. [46] Dirk Draheim and Lukasz Pekacki. Process-Centric Analytical Processing of Version Control Data. International Workshop on Principles of Software Evolution, page 131, 2003. [47] Geoffrey M. Draper, Yarden Livnat, and Richard F. Riesenfeld. A Survey of Radial Methods for Information Visualization. IEEE Transactions on Visualization and Computer Graphics, 15(5):759–776, 2009. [48] Peter Eades. Drawing Free Trees. In Bulletin of the Institute for Combinatorics and its Applications, 5, 2, pages 10–36, 1992. [49] Peter Eades, Qing-Wen Feng, Xuemin Lin, and Hiroshi Nagamochi. StraightLine Drawing Algorithms for Hierarchical Graphs and Clustered Graphs. Algorithmica, 44(1):1–32, 2006. [50] Peter Eades, Wei Lai, Kazuo Misue, and Kozo Sugiyama. Preserving the Mental Map of a Diagram. In Proceedings of Computer Graphics, pages 24– 33, 1991. [51] Peter Eades and Sue Whitesides. Drawing Graphs in Two Layers. Theoretical Computer Science, 131(2):361–374, 1994. [52] Stephen G. Eick. Visual Scalability. Journal of Computational and Graphical Statistics, 11:22–43, March 2002. [53] Stephen G. Eick, Joseph L. Steffen, and Jr. Eric E. Sumner. Seesoft — A Tool for Visualizing Line Oriented Software Statistics. IEEE Transactions on Software Engineering, 18(11):957–968, 1992. [54] Niklas Elmqvist, Thanh-Nghi Do, Howard Goodell, Nathalie Henry, and JeanDaniel Fekete. ZAME: Interactive Large-Scale Graph Visualization. In Proceedings of the IEEE Pacific Visualization Symposium, pages 215–222. IEEE, 2008.

188

Bibliography

[55] T. Todd Elvins. VisFiles: Presentation Techniques for Time-Series Data. SIGGRAPH Computer Graphics, 31(2):14–16, 1997. [56] EPOSee - Visualization of Evolution Patterns of Software, http://www.eposoft.org.

online.

[57] Leonard Euler. Königsberg Bridge Problem. In Commentarii Academiae Scientiarum Petropolitanae 8, 1741 (originally published), pages 128–140, 1741. [58] Jean-Daniel Fekete. The InfoVis Toolkit. In INFOVIS ’04: Proceedings of the IEEE Symposium on Information Visualization, pages 167–174, Washington, DC, USA, 2004. IEEE Computer Society. [59] Jean-Daniel Fekete, Niklas Elmqvist, Thanh-Nghi Do, Howard Goodell, and Nathalie Henry. Navigating Wikipedia with the Zoomable Adjacency Matrix Explorer. Technical Report RR:00141168, INRIA Research Report (Paris), 2007. [60] Jean-Daniel Fekete, David Wang, Niem Dang, Aleks Aris, and Catherine Plaisant. Overlaying Graph Links on Treemaps, 2003. [61] Christiane Fellbaum, editor. WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press, May 1998. [62] Qing-Wen Feng. Algorithms for Drawing Clustered Graphs. PhD thesis, University of Newcastle, 1997. [63] Michael Fischer, Martin Pinzger, and Harald Gall. Populating a Release History Database from Version Control and Bug Tracking Systems. In ICSM ’03: Proceedings of the International Conference on Software Maintenance, page 23, Washington, DC, USA, September 2003. IEEE Computer Society. [64] Andrew U. Frank. Different Types of ’Times’ in GIS. In Spatial and Temporal Reasoning in Geographic Information Systems, pages 40–62. Oxford University Press, 1998. [65] Carsten Friedrich and Peter Eades. Graph Drawing in Motion. Journal of Graph Algorithms and Applications, 6(3):353–370, 2002. [66] Michael Friendly. Milestones in the History of Thematic Cartography, Statistical Graphics, and Data Visualization. In Seeing Science: Today. American Association for the Advancement of Science, February 2008. [67] Jon Froehlich and Paul Dourish. www.ics.uci.edu/ jpd/research/seesoft.html, 2003.

Bibliography

189

[68] Jon Froehlich and Paul Dourish. Unifying Artifacts and Activities in a Visual Tool for Distributed Software Development Teams. International Conference on Software Engineering, pages 387–396, 2004. [69] Futbol24: Livescore — Soccer Results — Fixtures — Statistics, online. http://www.futbol24.com. [70] Henry Laurence Gantt. Work, Wages, and Profits. Engineering Magazine Co., New York, 1910. [71] John Gantz. An Updated Forecast of Worldwide Information Growth Through 2011, March 2008. [72] Gapminder, online. http://www.gapminder.org. [73] Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness (Series of Books in the Mathematical Sciences). W. H. Freeman & Co Ltd, January 1979. [74] Ken Garland. Mr Beck’s Underground Map. Capital Transport, 1994. [75] Daniel M. Germán. Mining CVS Repositories, the Softchange Experience. In Proceedings of the First International Workshop on Mining Software Repositories MSR, pages 17–21, Edinburgh, Scotland, UK, 2004. [76] Daniel M. Germán, Abram Hindle, and Norman Jordan. Visualizing the Evolution of Software Using softChange. In SEKE’04: Proceedings of the 16th International Conference on Software Engineering and Knowledge Engineering, pages 336–341, 2004. [77] Mohammad Ghoniem, Jean-Daniel Fekete, and Philippe Castagliola. A Comparison of the Readability of Graphs Using Node-Link and Matrix-Based Representations. IEEE Symposium on Information Visualization, pages 17–24, 2004. [78] Mohammad Ghoniem, Jean-Daniel Fekete, and Philippe Castagliola. On the Readability of Graphs Using Node-Link and Matrix-Based Representations: a Controlled Experiment and Statistical Analysis. Information Visualization, 4(2):114–135, 2005. [79] Eric Gilbert and Karrie Karahalios. CodeSaw: A Social Visualization of Distributed Software Development. In Proceedings of Interact, pages 303–316, 2007. [80] Carsten Görg, Peter Birke, Mathias Pohl, and Stephan Diehl. Dynamic Graph Drawing of Sequences of Orthogonal and Hierarchical Graphs. In Proceedings of International Symposium on Graph Drawing GD, September 2004.

190

Bibliography

[81] Martin Greilich, Michael Burch, and Stephan Diehl. Visualizing the Evolution of Compound Digraphs with TimeArcTrees. In Proceedings of 11th Joint Eurographics/IEEE Symposium on Visualization (EuroVis’2009), June 2009. [82] Sebastien Grivet, David Auber, Jean-Philippe Domenger, and Guy Melancon. Bubble Tree Drawing Algorithm. In Springer Verlag, editor, International Conference on Computer Vision and Graphics, pages 633–641, 2004. [83] Paul R. Halmos. Measure Theory, volume 18 of Graduate Texts in Mathematics. Springer, NY, 1974. [84] Ahmed E. Hassan and Richard C. Holt. Studying The Evolution of Software Systems Using Evolutionary Code Extractors. International Workshop on Principles of Software Evolution, pages 76–81, 2004. [85] Susan Havre, Elizabeth Hetzler, Paul Whitney, and Lucy Nowell. ThemeRiver: Visualizing Thematic Changes in Large Document Collections. IEEE Transactions on Visualization and Computer Graphics, 8(1):9–20, 2002. [86] Nathalie Henry and Jean-Daniel Fekete. MatrixExplorer: a DualRepresentation System to Explore Social Networks. IEEE Transactions on Visualization and Computer Graphics, 12(5):677–684, 2006. [87] Nathalie Henry and Jean-Daniel Fekete. Matlink: Enhanced Matrix Visualization for Analyzing Social Networks. In Proceedings of the International Conference Interact, 2007. [88] Nathalie Henry, Jean-Daniel Fekete, and Michael J. McGuffin. Nodetrix: Hybrid Representation for Analyzing Social Networks. In IEEE TVGC (InfoVis’07 Proceedings), 2007. [89] Ivan Herman, Guy Melancon, and M. Scott Marshall. Graph Visualization and Navigation in Information Visualization: a Survey. IEEE Transactions on Visualization and Computer Graphics, 6:24–43, 2000. [90] Historian, online. www.historian.tigris.org. [91] Ric Holt and Jason Y. Pak. GASE: Visualizing Software Evolution-in-theLarge. In Proceedings of the Working Conference on Reverse Engineering, pages 163–167, 1996. [92] Danny Holten. Hierarchical Edge Bundles: Visualization of Adjacency Relations in Hierarchical Data. IEEE Transactions on Visualization and Computer Graphics, 12(5):741–748, September 2006. [93] Danny Holten and Jarke J. van Wijk. Visual Comparison of Hierarchically Organized Data. Computer Graphics Forum, 27(3):759–766, 2008.

Bibliography

191

[94] Danny Holten and Jarke J. van Wijk. Force-Directed Edge Bundling for Graph Visualization. In 11th Eurographics/IEEE-VGTC Symposium on Visualization (Computer Graphics Forum; Proceedings of EuroVis 2009), pages 983–990, Berlin, Germany, 2009. [95] Matthew Holton. Strands, Gravity and Botanical Tree Imagery. In Computer Graphics Forum, 13(1), pages 57–67, March 1994. [96] Alfred Inselberg. The Plane with Parallel Coordinates. The Visual Computer, V1(4):69–91, December 1985. [97] Alfred Inselberg and Bernd Dimsdale. Parallel Coordinates: A Tool for Visualizing Multi-Dimensional Geometry. In Proceedings of IEEE Visualization, pages 361–378, Los Alamitos, CA, USA, 1990. IEEE Computer Society Press. [98] Liqun Jin and David C. Banks. TennisViewer: A Browser for Competition Trees. IEEE Computer Graphics and Applications, 17(4):63–65, 1997. [99] Brian Johnson and Ben Shneiderman. Tree-Maps: A Space-Filling Approach to the Visualization of Hierarchical Information Structures. In Proceedings of IEEE Visualization Conference, pages 284–291, 1991. [100] Bela Julesz. Textons, the Elements of Texture Perception, and their Interactions. Nature, 290(5802):91–97, March 1981. [101] Mehmed Kantardzic. Data Mining: Concepts, Models, Methods and Algorithms. John Wiley & Sons, Inc., New York, NY, USA, 2002. [102] Michael Kaufmann and Dorothea Wagner, editors. Drawing Graphs: Methods and Models (Lecture Notes in Computer Science). Springer, Berlin, 1st edition, 2001. [103] Daniel A. Keim. Information Visualization and Visual Data Mining. IEEE Transactions on Visualization and Computer Graphics, 8(1):1–8, 2002. [104] René Keller, Claudia M. Eckert, and P. John Clarkson. Matrices or Node-Link Diagrams: Which Visual Representation is Better for Visualising Connectivity Models? Proceedings of Information Visualization, 5(1):62–76, 2006. [105] Bernard Kerr, Li-Te Cheng, and Timothy Sweeney. Growing Bloom: Design of a Visualization of Project Evolution. In CHI ’06: Extended Abstracts on Human Factors in Computing Systems, pages 93–98, New York, NY, USA, 2006. ACM. [106] Ernst Kleiberg, Huub van de Wetering, and Jarke J. Van Wijk. Botanical Visualization of Huge Hierarchies. IEEE Symposium on Information Visualization, page 87, 2001.

192

Bibliography

[107] Kurt Koffka. Principles of Gestalt Psychology. Routledge and Kegan Paul, New York, 1935. [108] Hideki Koike and Hui-Chu Chu. VRCS: Integrating Version Control and Module Management Using Interactive 3D Graphics. IEEE Symposium on Visual Languages, page 168, 1997. [109] Michele Lanza. The Evolution Matrix: Recovering Software Evolution Using Software Visualization Techniques. In IWPSE ’01: Proceedings of the 4th International Workshop on Principles of Software Evolution, pages 37–42, New York, NY, USA, 2001. ACM. [110] Ying K. Leung, Robert Spence, and Mark D. Apperley. Applying Bifocal Displays to Topological Maps. International Journal of Human Computer Interaction, 7(1):79–98, 1995. [111] Chun-Cheng Lin and Hsu-Chun Yen. On Balloon Drawings of Rooted Trees. Journal of Graph Algorithms and Applications, 11(2):431–452, 2007. [112] Hao Lu and James Fogarty. Cascaded Treemaps: Examining the Visibility and Stability of Structure in Treemaps. In GI ’08: Proceedings of Graphics Interface 2008, pages 259–266, Toronto, Ontario, Canada, 2008. Canadian Information Processing Society. [113] Peter Lyman and Hal R. Varian. How Much Information? JEP the Journal of Electronic Publishing, 2003. [114] Alan M. MacEachren. How Maps Work: Representation, Visualization, and Design. The Guilford Press, June 2004. [115] Jonathan I. Maletic, Andrian Marcus, and Louis Feng. Source Viewer 3D (sv3D): A Framework for Software Visualization. In ICSE ’03: Proceedings of the 25th International Conference on Software Engineering, pages 812–813, Washington, DC, USA, 2003. IEEE Computer Society. [116] Andrian Marcus, Louis Feng, and Jonathan I. Maletic. 3D Representations for Software Visualization. In SoftVis ’03: Proceedings of the 2003 ACM Symposium on Software Visualization, pages 27–ff, New York, NY, USA, 2003. ACM. [117] Bruce H. McCormick. Visualization in Scientific Computing. SIGBIO Newsletter, 10(1):15–21, 1988. [118] Webb Miller and Eugene W. Myers. A File Comparison Program. Software Practice and Experience, 15(11):1025–1040, 1985.

Bibliography

193

[119] Minnesota Internet Traffic Studies (MINTS), online, http://www.dtc.umn.edu /mints/home.php. [120] Kazuo Misue, Peter Eades, Wei Lai, and Kozo Sugiyama. Layout Adjustment and the Mental Map. Journal of Visual Languages & Computing, pages 183– 210, June 1995. [121] Wolfgang M¨ uller and Heidrun Schumann. Visualization for Modeling and Simulation: Visualization Methods for Time-Dependent Data — An Overview. In Winter Simulation Conference, pages 737–745, 2003. [122] Tamara Munzner, Francois Guimbretiere, Serdar Tasiran, Yunhong Zhou, and Li Zhang. TreeJuxtaposer: Scalable Tree Comparison Using Focus+Context with Guaranteed Visibility. ACM Transactions on Graphics, 22:453–462, 2003. [123] Petra Neumann, Stefan Schlechtweg, and Sheelagh Carpendale. ArcTrees: Visualizing Relations in Hierarchical Data. In EuroVis’05 - Eurographics / IEEE VGTC Symposium on Visualization, pages 53–60, June 2005. [124] Quang Vinh Nguyen and Mao Lin Huang. EncCon: An Approach to Constructing Interactive Visualization of Large Hierarchical Data. Information Visualization, 4(1):1–21, 2005. [125] Michael Ogawa and Kwan-Liu Ma. Stargate: A Unified, Interactive Visualization of Software Projects. In Proceedings of IEEE VGTC Pacific Visualization Symposium (PacificVis), pages 191–198, 2008. [126] David Lorge Parnas. Software Aging. In ICSE ’94: Proceedings of the 16th International Conference on Software Engineering, pages 279–287, Los Alamitos, CA, USA, 1994. IEEE Computer Society Press. [127] Lukasz Pekacki. http://bloof.sourceforge.net/, 2003. [128] Doantam Phan, Ling Xiao, Ron Yeh, Pat Hanrahan, and Terry Winograd. Flow Map Layout. In INFOVIS ’05: Proceedings of the 2005 IEEE Symposium on Information Visualization, page 29, Washington, DC, USA, 2005. IEEE Computer Society. [129] Michael C. Pilato, Ben Collins-Sussman, and Brian W. Fitzpatrick. Version Control with Subversion. O’Reilly Media, Inc, June 2004. [130] Martin Pinzger, Harald Gall, Michael Fischer, and Michele Lanza. Visualizing Multiple Evolution Metrics. In SoftVis ’05: Proceedings of the 2005 ACM Symposium on Software Visualization, pages 67–75, New York, NY, USA, 2005. ACM Press.

194

Bibliography

[131] Mathias Pohl and Peter Birke. Interactive Exploration of Large Dynamic Networks. In VISUAL ’08: Proceedings of the 10th International Conference on Visual Information Systems, pages 56–67, Berlin, Heidelberg, 2008. SpringerVerlag. [132] Mathias Pohl, Michael Burch, and Peter Weißgerber. Ist Programmieren ein Mannschaftssport? In Wolf-Gideon Bleek, Jörg Raasch, and Heinz Z¨ ullighoven, editors, Software Engineering, volume 105 of LNI, pages 181– 192. GI, March 2007. [133] Mathias Pohl, Florian Reitz, and Peter Birke. As Time Goes By: Integrated Visualization and Analysis of Dynamic Networks. In AVI ’08: Proceedings of the Working Conference on Advanced Visual Interfaces, pages 372–375, New York, NY, USA, 2008. ACM. [134] Johannes A. Pretorius and Jarke J. van Wijk. Visual Analysis of Multivariate State Transition Graphs. IEEE Transactions on Visualization and Computer Graphics, 12(5):685–692, 2006. [135] Helen C. Purchase. Metrics for Graph Drawing Aesthetics. Journal of Visual Languages and Computing, 13(5):501–516, 2002. [136] Edward M. Reingold and John S. Tilford. Tidier Drawings of Trees. IEEE Transactions on Software Engineering, 7(2):223–228, 1981. [137] Claudio Riva. Visualizing Software Release Histories With 3DSoftVis. In ICSE ’00: Proceedings of the 22nd International Conference on Software Engineering, page 789, New York, NY, USA, 2000. ACM Press. [138] George Robertson, Roland Fernandez, Danyel Fisher, Bongshin Lee, and John Stasko. Effectiveness of Animation in Trend Visualization. IEEE Transactions on Visualization and Computer Graphics, 14(6):1325–1332, 2008. [139] George G. Robertson, Jock D. Mackinlay, and Stuart K. Card. Cone Trees: Animated 3D Visualizations of Hierarchical Information. In CHI ’91: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 189–194, New York, NY, USA, 1991. ACM Press. [140] Marc J. Rochkind. The Source Code Control System. IEEE Transactions on Software Engineering SE, 1(4):364–370, December 1975. [141] Ruth Rosenholtz, Yuanzhen Li, Jonathan Mansfield, and Zhenlan Jin. Feature Congestion: A Measure of Display Clutter. In Proceedings of SIGCHI Conference on Human Factors in Computing Systems, pages 761–770. ACM Press, 2005.

Bibliography

195

[142] Georg Sander. Layout of Compound Directed Graphs. Technical report, Universität des Saarlandes, FB 14 Informatik, 1996. [143] Zeqian Shen and Kwan-Liu Ma. Path Visualization for Adjacency Matrices. In Proceedings of EuroVis ’07, pages 83–90, 2007. [144] Ben Shneiderman. Tree Visualization with Tree-Maps: 2-D Space-Filling Approach. ACM Transactions on Graphics, 11(1):92–99, 1992. [145] Ben Shneiderman and Martin Wattenberg. Ordered Treemap Layouts. In INFOVIS ’01: Proceedings of the IEEE Symposium on Information Visualization 2001 (INFOVIS’01), pages 73–78, Washington, DC, USA, 2001. IEEE Computer Society. [146] Slava Petkov and Contributors http://www.jedit.org.

JEDIT

Project

Homepage,

online.

[147] Robert Spence. Information Visualization: Design for Interaction (2nd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2007. [148] John Stasko, John B. Domingue, Marc H. Brown, and Blaine A. Price. Software Visualization: Programming as a Multimedia Experience. The MIT Press, 1998. [149] John Stasko and Eugene Zhang. Focus+Context Display and Navigation Techniques for Enhancing Radial, Space-Filling Hierarchy Visualizations. In INFOVIS ’00: Proceedings of the IEEE Symposium on Information Vizualization 2000, page 57, Washington, DC, USA, 2000. IEEE Computer Society. [150] Margaret-Anne Storey and Hausi A. M¨ uller. Manipulating and Documenting Software Structures Using Shrimp Views. In Proceedings of the 1995 International Conference on Software Maintainance, pages 275–284, Opio, France, 2005. IEEE Computer Society Press. [151] SUBVERSION, online. http://subversion.tigris.org/. [152] Kozo Sugiyama and Kazuo Misue. Visualization of Structural Information: Automatic Drawing of Compound Digraphs. In IEEE Transactions on Systems, Man and Cybernetics, 21(4), pages 876–892, 1991. [153] James Joseph Sylvester. Chemistry and Algebra. In Nature, 17, page 284, 1878. [154] Martyn Taylor and Peter Rodgers. Applying Graphical Design Techniques to Graph Visualization. Information Visualisation, International Conference on Information Visualization, pages 651–656, 2005.

196

Bibliography

[155] Soon Tee Teoh and Kwan-Liu Ma. RINGS: A Technique for Visualizing Large Hierarchies. In Proceedings of 10th International Symposium on Graph Drawing, pages 268–275, April 2002. [156] The American Heritage Dictionaries, editor. The American Heritage College Dictionary. Houghton Mifflin Harcourt, 4th edition, June 2004. [157] The Mozilla Organization. http://www.mozilla.org. [158] Walter F. Tichy. Design, Implementation, and Evaluation of a Revision Control System. In ICSE ’82: Proceedings of the 6th International Conference on Software Engineering, pages 58–67, Los Alamitos, CA, USA, 1982. IEEE Computer Society Press. [159] Walter F. Tichy. The String-to-String Correction Problem with Block Moves. ACM Transactions on Computer Systems, 2(4):309–321, 1984. [160] Walter F. Tichy. RCS — A System for Version Control. Software — Practice and Experience, 15(7):637–654, 1985. [161] TimeArcTress - Timeline Trees - TimeRadarTrees, online. http://www.st.unitrier.de/ burch/trt/trt.html. [162] Edward R. Tufte. The Visual Display of Quantitative Information. Graphics Press, 1983. [163] Edward R. Tufte. Envisioning Information. Graphics Press, 1990. [164] Edward R. Tufte. Visual Explanations. Graphics Press, 1997. [165] David Turo. Hierarchical Visualization with Treemaps: Making Sense of Pro Basketball Data. In ACM CHI ’94 Conference Companion, pages 441–442. ACM, 1994. [166] Frank van Ham and Jarke J. van Wijk. Beamtrees: Compact Visualization of Large Hierarchies. In INFOVIS ’02: Proceedings of the IEEE Symposium on Information Visualization (InfoVis’02), page 93, Washington, DC, USA, 2002. IEEE Computer Society. [167] Guido van Rossum. http://www.python.org/doc/essays/graphs/, 1998. [168] Jarke J. van Wijk and Huub van de Wetering. Cushion Treemaps: Visualization of Hierarchical Information. In INFOVIS ’99: Proceedings of the 1999 IEEE Symposium on Information Visualization, page 73, Washington, DC, USA, 1999. IEEE Computer Society Press. [169] Frederic Vernier and Laurence Nigay. Variable-Shaped Units, 2000.

Modifiable Treemaps Containing

Bibliography

197

[170] Fernanda B. Viegas, Martin Wattenberg, and Kushal Dave. Studying Cooperation and Conflict between Authors with History Flow Visualizations. In CHI’04: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 575–582. ACM Press, 2004. [171] Visualcomplexity, online. www.visualcomplexity.com. [172] Lucian Voinea, Alex Telea, and Jarke J. van Wijk. CVSscan: Visualization of Code Evolution. In SoftVis ’05: Proceedings of the 2005 ACM Symposium on Software Visualization, pages 47–56, New York, NY, USA, 2005. ACM Press. [173] Weixin Wang, Hui Wang, Guozhong Dai, and Hongan Wang. Visualization of Large Hierarchical Data by Circle Packing. In CHI ’06: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 517–520, New York, NY, USA, 2006. ACM Press. [174] Colin Ware. Information Visualization: Perception for Design. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2004. [175] Martin Wattenberg. Visualizing the Stock Market. In CHI ’99: Extended Abstracts on Human Factors in Computing Systems, pages 188–189, New York, NY, USA, 1999. ACM. [176] WDR Verkehrslage, http://www.wdr.de/themen/verkehr/verkehrslage/. [177] Edward J. Wegman. Hyperdimensional Data Analysis Using Parallel Coordinates. Journal of the American Statistical Association, 85(411):664–675, 1990. [178] Peter Weißgerber, Michael Burch, and Stephan Diehl. Knowledge Discovery in Versions Archives. In LWA, pages 92–99, 2004. [179] Peter Weißgerber, Mathias Pohl, and Michael Burch. Visual Data Mining in Software Archives to Detect how Developers Work Together. In MSR ’07: Mining Software Repositories (ICSE Workshop), Minneapolis, USA, 2007. [180] Peter Weißgerber, Leo von Klenze, Michael Burch, and Stephan Diehl. Exploring Evolutionary Coupling in Eclipse. In Proceedings of International Eclipse Technology Exchange (ETX05), San Diego, California, USA, October 2005. [181] Richard Wettel and Michele Lanza. CodeCity: 3D Visualization of Large-Scale Software. In ICSE Companion, pages 921–922, 2008. [182] Pak Chung Wong, Paul Whitney, and Jim Thomas. Visualizing Association Rules for Text Mining. In INFOVIS ’99: Proceedings of the 1999 IEEE Symposium on Information Visualization, page 120, Washington, DC, USA, 1999. IEEE Computer Society Press.

198

Bibliography

[183] Jingwei Wu, Claus W. Spitzer, Ahmed E. Hassan, and Richard C. Holt. Evolution Spectrographs: Visualizing Punctuated Change in Software Evolution. In IWPSE ’04: Proceedings of the Principles of Software Evolution, 7th International Workshop, pages 57–66, Washington, DC, USA, 2004. IEEE Computer Society Press. [184] Xiaomin Wu. Visualization of Version Control Information, 2003. [185] Jing Yang, Matthew O. Ward, Elke A. Rundensteiner, and Anilkumar Patro. InterRing: A Visual Interface for Navigating and Manipulating Hierarchies. Information Visualization, 2(1):16–30, 2003. [186] Ka-Ping Yee, Danyel Fisher, Rachna Dhamija, and Marti Hearst. Animated Exploration of Dynamic Graphs with Radial Layout. In INFOVIS ’01: Proceedings of the IEEE Symposium on Information Visualization 2001 (INFOVIS’01), page 43, Washington, DC, USA, 2001. IEEE Computer Society Press. [187] yFiles, yWorks. http://www.yworks.com/en/products yfiles about.htm. [188] Beth Yost and Chris North. The Perceptual Scalability of Visualization. IEEE Transactions on Visualization and Computer Graphics, 12(5):837–844, 2006. [189] Nick Zangwill. Aesthetic Judgment, 2008. [190] Shengdong Zhao, Michael J. McGuffin, and Mark H. Chignell. Elastic Hierarchies: Combining Treemaps and Node-Link Diagrams. In INFOVIS ’05: Proceedings of the 2005 IEEE Symposium on Information Visualization, page 8, Washington, DC, USA, 2005. IEEE Computer Society Press. [191] Esteban Zimanyi and Sabri Skhiri dit Gabouje. Semantic Visualization of Biochemical Databases. In Semantics for GRID Databases: Proceedings of the International Conference on Semantics for a Networked World, pages 199–214. LNCS 3226, Springer, 2004. [192] Esteban Zimanyi and Sabri Skhiri dit Gabouje. A New Constraint-Based Compound Graph Layout Algorithm for Drawing Biochemical Networks. In Fernando Ferri, editor, Visual Languages for Interactive Computing: Definitions and Formalizations. Idea Group Inc., 2006. [193] Thomas Zimmermann, Stephan Diehl, and Andreas Zeller. How History Justifies System Architecture (or not). In Proceedings of International Workshop on Principles of Software Evolution IWPSE, page 73, September 2003. [194] Thomas Zimmermann and Peter Weißgerber. Preprocessing CVS Data for Fine-Grained Analysis. In Proceedings of International Workshop on Mining Software Repositories MSR, pages 2–6, May 2004.

Bibliography

199

[195] Thomas Zimmermann, Peter Weißgerber, Stephan Diehl, and Andreas Zeller. Mining Version Histories to Guide Software Changes. In Proceedings of the 26th International Conference on Software Engineering ICSE, pages 563–572, Washington, DC, USA, May 2004. IEEE Computer Society Press.

200

Bibliography

Bibliography

201

Publications Related to this Work • Michael Burch, Stephan Diehl, and Peter Weißgerber. EPOSee: A Tool for Visualizing Software Evolution Patterns. In Proceedings of Workshop on Software-Reengineering (WSR), Bad Honnef, Germany. May 2004. • Michael Burch. Interactive Visualization of Large Rule Sets for the Detection of Patterns and Anomalies. Diploma Thesis, Saarland University Saarbr¨ ucken, June 2004. • Peter Weißgerber, Michael Burch, and Stephan Diehl. Knowledge Discovery in Version Archives. In Proceedings of Workshop on Knowledge Discovery AKKD (Arbeitskreis Knowledge-Discovery), Berlin, Germany, October 2004. • Michael Burch, Stephan Diehl, and Peter Weißgerber. Visual Data Mining in Software Archives. In Proceedings of ACM Symposium on Software Visualization SOFTVIS’05, St. Louis, May 2005. • Michael Burch, Stephan Diehl, and Peter Weißgerber. EPOSee - A Tool for Visualizing Software Evolution. In Proceedings of the 3rd IEEE International Workshop on Visualizing Software for Program Understanding and Analysis, VisSoft’05, Budapest, Hungary, September 2005 • Peter Weißgerber, Leo von Klenze, Michael Burch, and Stephan Diehl. Exploring Evolutionary Coupling in Eclipse. eclipse Technology eXchange (eTX) Workshop at OOPSLA 2005. • Michael Burch and Stephan Diehl. Trees in a Treemap: Visualizing Multiple Hierarchies. In Proceedings of 13th Conference on Visualization and Data Analysis (VDA 2006), San Jose, California, January 2006. • Mathias Pohl, Michael Burch, and Peter Weißgerber. Ist Programmieren ein Mannschaftssport? SE 2007 - Conference on Software Engineering, Hamburg, Deutschland, März 2007. Erschienen in Lecture Notes of Informatics (LNI) 105, Gesellschaft f¨ ur Informatik. • Peter Weißgerber, Mathias Pohl, and Michael Burch. Visual Data Mining in Software Archives to Detect how Developers Work Together. MSR 07 - Mining Software Repositories (ICSE Workshop), Minneapolis, USA, 2007. • Michael Burch, Fabian Beck, and Stephan Diehl. Timeline Trees: Visualizing Sequences of Transactions in Information Hierarchies. In Proceedings of 9th International Working Conference on Advanced Visual Interfaces (AVI 2008), Naples, Italy, May, 2008. • Michael Burch and Stephan Diehl. TimeRadarTrees: Visualizing Dynamic Compound Digraphs. In Proceedings of Tenth Joint Eurographics/IEEEVGTC Symposium on Visualization (EuroVis 2008), Eindhoven, The Netherlands, May 2008.

202

Bibliography

• Michael Burch, Felix Bott, Fabian Beck, and Stephan Diehl. Cartesian vs. Radial—A Comparative Evaluation of Two Visualization Tools. In Proceedings of 4th International Symposium on Visual Computing (ISVC 08), Las Vegas, Nevada, December 2008. • Martin Greilich, Michael Burch, and Stephan Diehl. Visualizing the Evolution of Compound Digraphs with TimeArcTrees. In Proceedings of 11th Joint Eurographics/IEEE Symposium on Visualization (EuroVis 2009), Berlin, Germany, June 2009. • Fabian Beck, Michael Burch, and Stephan Diehl. Towards an Aesthetic Dimensions Framework for Dynamic Graph Visualizations. In Proceedings of 13th International Conference on Information Visualisation (IV 09), Barcelona, Spain, July 2009.

Curriculum Vitae

Michael Burch was born on April 5th, 1976 in Merzig, Germany. In 1982, he visited the primary school in his home village Bachem, Germany. From 1986 until 1995, he went to secondary school in Merzig. In 1995, he got the certificate to having passed the Abitur. In the following 13 months, he worked in an ambulance station to complete the civilian service time. In October 1996, he studied physics at the University of Saarbr¨ ucken, Germany and switched one year later to computer science with mathematics as a minor field of study. In 2004, he was awarded a diploma in computer science. From this moment, he started his PhD years at the University of Eichstätt-Ingolstadt at the chair of software engineering of Prof. Dr. Stephan Diehl. In 2006, he continued the PhD years at the University of Trier, Germany.

203