INTEGRATING NATURAL LANGUAGE AND PROGRAM ...

INTEGRATING NATURAL LANGUAGE AND PROGRAM STRUCTURE INFORMATION TO IMPROVE SOFTWARE SEARCH AND EXPLORATION

by Emily Hill

A dissertation submitted to the Faculty of the University of Delaware in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science

Summer 2010

c 2010 Emily Hill

All Rights Reserved

UMI Number: 3423409

All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion.

UMI 3423409 Copyright 2010 by ProQuest LLC. All rights reserved. This edition of the work is protected against unauthorized copying under Title 17, United States Code.

ProQuest LLC 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, MI 48106-1346

INTEGRATING NATURAL LANGUAGE AND PROGRAM STRUCTURE INFORMATION TO IMPROVE SOFTWARE SEARCH AND EXPLORATION

by Emily Hill

Approved: B. David Saunders, Ph.D. Chair of the Department of Computer and Information Sciences

Approved: Michael J. Chajes, Ph.D. Dean of the College of Engineering

Approved: Debra Hess Norris, M.S. Vice Provost for Graduate and Professional Education

I certify that I have read this dissertation and that in my opinion it meets the academic and professional standard required by the University as a dissertation for the degree of Doctor of Philosophy.

Signed: Lori L. Pollock, Ph.D. Professor in charge of dissertation


Signed: Vijay K. Shanker, Ph.D. Professor in charge of dissertation


Signed: Keith S. Decker, Ph.D. Member of dissertation committee


Signed: Brian P. Robinson, Ph.D. Member of dissertation committee

ACKNOWLEDGMENTS

My dissertation did not come forth fully-formed, but is the result of continual support and encouragement by those in my personal and professional life. First, I would like to thank my husband for his constant patience, encouragement, positive attitude, and extra chore duties. In addition, the support of family and friends has been invaluable to my continual progress, and I am grateful for their flexibility in accommodating my work commitments. My colleagues and lab mates have not only been pivotal in helping to develop, evaluate, and publish my work, but also soothed and distracted my mind when I needed it most. I would like to thank my advisors, Dr. Lori Pollock and Dr. Vijay Shanker, for their sustained guidance and support throughout the research, evaluation, and writing process. This endeavor, and my growth as a researcher, would have been impossible without their unceasing patience and effort. Finally, I would like to thank my committee members for their commitment to see me through the doctoral process, as well as for their excellent feedback and guidance.

iv

TABLE OF CONTENTS

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Chapter 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 1.2 1.3

1

Locating Relevant Code . . . . . . . . . . . . . . . . . . . . . . . . . Improving Software Maintenance Tools . . . . . . . . . . . . . . . . . Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 3 4

2 DEFINING THE PROBLEM: CONCERN LOCATION . . . . . .

8

2.1

Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 2.1.2

2.2

Searching for the Concern . . . . . . . . . . . . . . . . . . . . 11 Exploring the Concern . . . . . . . . . . . . . . . . . . . . . . 12

State of the Art in Concern Location . . . . . . . . . . . . . . . . . . 14 2.2.1

Static Concern Location . . . . . . . . . . . . . . . . . . . . . 15 2.2.1.1 2.2.1.2

2.2.2 2.2.3 2.3

9

Navigation and Exploration Tools . . . . . . . . . . . 15 Search Tools . . . . . . . . . . . . . . . . . . . . . . 19

Dynamic and Hybrid Concern Location Techniques . . . . . . 24 Summary of Limitations . . . . . . . . . . . . . . . . . . . . . 25

Focus of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . 25 v

3 PREPARING THE SOURCE CODE FOR SEARCH . . . . . . . . 27 3.1

Identifier Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.1 3.1.2 3.1.3

The Identifier (Token) Splitting Problem . . . . . . . . . . . . 28 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Automatic Token Splitting with Samurai . . . . . . . . . . . . 32 3.1.3.1 3.1.3.2 3.1.3.3 3.1.3.4

3.1.4 3.1.5 3.1.6 3.2

Mixed-case Token Splitting Same-case Token Splitting . String Scoring Function . . Analysis of Time and Space

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

32 36 37 38

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . 41 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Abbreviation Expansion . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2.1

Types of non-dictionary words . . . . . . . . . . . . . . . . . . 46 3.2.1.1 3.2.1.2 3.2.1.3

3.2.2

Automatic abbreviation expansion 3.2.2.1 3.2.2.2

3.2.3

Single-Word Abbreviations . . . . . . . . . . . . . . 46 Multi-Word Abbreviations . . . . . . . . . . . . . . . 47 Other Types of Short Forms . . . . . . . . . . . . . . 47 . . . . . . . . . . . . . . . 48

Observations and Challenges . . . . . . . . . . . . . 49 State of the Art . . . . . . . . . . . . . . . . . . . . . 51

The Scoped Approach . . . . . . . . . . . . . . . . . . . . . . 52 3.2.3.1 3.2.3.2 3.2.3.3

Method-level Matching . . . . . . . . . . . . . . . . . 53 Most Frequent Expansion (MFE) . . . . . . . . . . . 60 Implementation . . . . . . . . . . . . . . . . . . . . . 63

vi

3.2.3.4 3.2.4

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.2.4.1 3.2.4.2 3.2.4.3

3.2.5 3.3

Analysis of Time and Space . . . . . . . . . . . . . . 63

Experiment Design . . . . . . . . . . . . . . . . . . . 64 Threats to Validity . . . . . . . . . . . . . . . . . . . 68 Results and Analysis . . . . . . . . . . . . . . . . . . 69

Discussion and Future Work . . . . . . . . . . . . . . . . . . . 70

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4 SELECTING GOOD QUERY WORDS . . . . . . . . . . . . . . . . 74 4.1 4.2

Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Approach: Capturing Context with Phrases . . . . . . . . . . . . . . 79 4.2.1 4.2.2 4.2.3

4.3

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.3.1 4.3.2 4.3.3 4.3.4 4.3.5

4.4 4.5 4.6

Information Extraction . . . . . . . . . . . . . . . . . . . . . . 79 Search and Hierarchical Categorization . . . . . . . . . . . . . 85 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 86

Independent Variable . . Dependent Variables and Subjects . . . . . . . . . Design and Methodology Threats to Validity . . .

. . . . . . Measures . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

87 89 89 91 92

Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5 DEFINING A SOFTWARE WORD USAGE MODEL (SWUM) . 99 5.1 5.2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Representing Phrasal Concepts . . . . . . . . . . . . . . . . . . . . . 101 5.2.1

Phrase Structure . . . . . . . . . . . . . . . . . . . . . . . . . 101

vii

5.2.2 5.3 5.4 5.5

Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 A Software Word Usage Model . . . . . . . . . . . . . . . . . . . . . . 107 5.5.1 5.5.2 5.5.3 5.5.4 5.5.5

5.6

Parts of Speech . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Formal Definition . . . . . . . . . . . . . . . . . . . . . . Modeling Program Structure in SWUM (SWUMprogram ) Modeling Program Words in SWUM (SWUMwords ) . . . Modeling Phrase Structure in SWUM (SWUMcore ) . . . Analysis of Space . . . . . . . . . . . . . . . . . . . . . .

. . . . .

107 112 113 114 116

Developing SWUMcore Construction Rules . . . . . . . . . . . 117 Constructing SWUMcore . . . . . . . . . . . . . . . . . . . . . 118 5.6.2.1 5.6.2.2 5.6.2.3 5.6.2.4 5.6.2.5 5.6.2.6

Tagging Method Names and Identifying the Action Identifying VP Theme and Arguments . . . . . . . Beyond Method Declarations . . . . . . . . . . . . Current Implementation . . . . . . . . . . . . . . . Analysis of Time and Space . . . . . . . . . . . . . Developing SWUM Construction Rules for other Languages . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

120 125 126 128 130

. 131

Comparing SWUM with other NLP-based Identifier Analyses . . . . . 132 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.8.1 5.8.2 5.8.3 5.8.4

5.9

. . . . .

Automatically Constructing SWUMcore . . . . . . . . . . . . . . . . . 117 5.6.1 5.6.2

5.7 5.8

. . . . .

Applying SWUM to other Software Engineering Tools . . . . . Analysis of Accuracy: Utilizing SWUM for Phrase Generation Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . . Results and Discussion . . . . . . . . . . . . . . . . . . . . . .

133 135 136 137

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

6 USING SWUM TO SEARCH SOURCE CODE . . . . . . . . . . . 139 6.1

A SWUM Score for Search . . . . . . . . . . . . . . . . . . . . . . . . 139 6.1.1 6.1.2

Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Semantic Role . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 viii

6.1.3 6.1.4 6.1.5 6.2

Head Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Putting it all Together . . . . . . . . . . . . . . . . . . . . . . 149

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 6.2.1

Impact of SWUM on Software Search . . . . . . . . . . . . . . 152 6.2.1.1 6.2.1.2

6.2.2

Comparison With State of the Art . . . . . . . . . . . . . . . 161 6.2.2.1 6.2.2.2 6.2.2.3

6.3 6.4

Design . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Results and Discussion . . . . . . . . . . . . . . . . . 155

Design . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Results . . . . . . . . . . . . . . . . . . . . . . . . . 163 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 169

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

7 SEARCHING AND EXPLORING THE SOURCE CODE . . . . . 174 7.1 7.2

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 Dora the Program Explorer . . . . . . . . . . . . . . . . . . . . . . . 179 7.2.1

Automatically identifying the relevant neighborhood 7.2.1.1 7.2.1.2

7.2.2

Components of Dora’s Method Relevance Score . . . 180 Calculating Dora’s Method Relevance Score . . . . . 182

Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . 183 7.2.2.1 7.2.2.2 7.2.2.3 7.2.2.4

7.2.3

. . . . . 179

Experiment Design Results . . . . . . Threats to Validity Discussion . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

183 188 190 192

Related Work in Program Exploration . . . . . . . . . . . . . 192 7.2.3.1 7.2.3.2

Navigation-based Exploration Approaches . . . . . . 192 Software Architecture Recovery . . . . . . . . . . . . 193 ix

7.2.3.3 7.2.3.4 7.2.4 7.3 7.4

Program Structure Visualization . . . . . . . . . . . 194 Program Structure Representations . . . . . . . . . . 194

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

Investigative Study: Integrating Search and Exploration . . . . . . . 195 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

8 CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . 203 8.1

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 8.1.1 8.1.2

Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 New Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 205

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

x

LIST OF FIGURES

1.1

Developer locating relevant code with SWUM-based software engineering tool support. . . . . . . . . . . . . . . . . . . . . . . . .

6

2.1

Program elements and structural dependences for the “add auction” concern. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2

Methods related to triggering an ‘add auction’ event with the surrounding call graph. Of the 40 methods that DoAction calls, only 2 are relevant to adding an auction. . . . . . . . . . . . . . . . . . . 13

2.3

Methods related to handling a user-triggered ‘add auction’ event, with the surrounding call graph (7 in-edges were not shown to preserve readability). Exploring 57 possible elements to determine 12 relevant ones is no trivial task. . . . . . . . . . . . . . . . . . . . 14

3.1

Accuracy of token splitting approaches per category: dictionary word requiring no split (DW), nondictionary word without split (NDW), underscores and digits (US), camel case (CC), alternating case (AC), same-case (SC). . . . . . . . . . . . . . . . . . . . . . . 42

3.2

Percent of incorrectly split same-case tokens by category: nondictionary word (NDW), underscore/digit separated words (US), camel case separated words (CC). . . . . . . . . . . . . . . . . . . . 43

3.3

Number of correct expansions for each technique (a) by type and (b) by length in letters. To fairly compare all the techniques to LFB, do not include the top most bar for combination word (CW). . . . . . 70

4.1

Iterative Query Refinement and Search Process . . . . . . . . . . . 75

4.2

Example results for “convert” query. . . . . . . . . . . . . . . . . . 77

4.3

The Contextual Search Process . . . . . . . . . . . . . . . . . . . . 81 xi

4.4

Effort and Effectiveness Results for contextH and V -DO. Effort is measured in terms of the number of queries entered, shown on the left. Effectiveness is measured in terms of the F Measure, shown on the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.1

Example English parse trees and phrase structure for 3 methods. . 101

5.2

Search results and phrasal concepts for “add item” query. All the methods returned by the search contain the words “add” and “item”, but “add item” is not necessarily the main action taken by each method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.3

SWUM captures phrasal concepts and links them with program structure and additional word information. . . . . . . . . . . . . . . 108

5.4

SWUM for a source code sample to be used as a running example throughout its definition. . . . . . . . . . . . . . . . . . . . . . . . . 109

5.5

Grammar for SWUM’s phrase structure. . . . . . . . . . . . . . . . 114

6.1

AUC Results for 6 SWUM and BOW variants. . . . . . . . . . . . . 156

6.2

Comparing signature-based SWUM search variants by query. . . . . 157

6.3

Comparing information location of SWUM and BOW approaches by query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

6.4

F Measure results for state of the art search techniques. . . . . . . . 164

6.5

Precision Results for state of the art search techniques. . . . . . . . 165

6.6

Recall Results for state of the art search techniques. . . . . . . . . . 166

6.7

F Measure results by Concern. . . . . . . . . . . . . . . . . . . . . . 167

6.8

Precision results by Concern. . . . . . . . . . . . . . . . . . . . . . 168

6.9

Recall results by Concern. . . . . . . . . . . . . . . . . . . . . . . . 169

7.1

Program elements and structural dependences for the “add auction” concern from Chapter 2. . . . . . . . . . . . . . . . . . . . . . . . . 175 xii

7.2

Precision-Recall Graph for Exploration Techniques. . . . . . . . . . 186

7.3

F Measure across exploration techniques. . . . . . . . . . . . . . . . 187

7.4

F Measure for each concern by exploration technique. . . . . . . . . 189

7.5

F Measure results for state of the art search techniques and exploring with Dora (S+SWUM and SH+SWUM). . . . . . . . . . 196

7.6

Recall results for state of the art search techniques and exploring with Dora (S+SWUM and SH+SWUM). . . . . . . . . . . . . . . . 197

7.7

Precision results for state of the art search techniques and exploring with Dora (S+SWUM and SH+SWUM). . . . . . . . . . . . . . . . 198

xiii

LIST OF TABLES

3.1

Top 20 entries in the most frequent expansion (MFE) list for Java 5. 61

3.2

Programs used in the evaluation. . . . . . . . . . . . . . . . . . . . 64

3.3

Distribution of short forms in abbreviation types for the gold set. The 5 abbreviation types are acronym (AC), prefix (PR), single letter prefix (SL), dropped letter (DL), combination word (CW), and other (OO). The column ‘NCW’ totals all abbreviation types but CW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.4

Percent correct expansions for each technique and abbreviation type. To fairly compare our Scope technique to LFB, use the NCW total accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.1

Subject Developer Characteristics . . . . . . . . . . . . . . . . . . . 87

4.2

Experimental units for comparing contextH with V -DO

6.1

Concerns used in Evaluation. . . . . . . . . . . . . . . . . . . . . . 150

7.1

Concerns and queries used in evaluation, in terms of number of methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

7.2

Program Characteristics for concerns used in the evaluation. . . . . 185

xiv

. . . . . . 92

ABSTRACT

Today’s software is large and complex, with systems consisting of millions of lines of code. New developers to a software project face significant challenges in locating code related to their maintenance tasks of fixing bugs or adding new features. Developers can simply be assigned a bug and told to fix it—even when they have no idea where to begin. In fact, research has shown that a developer typically spends more time locating and understanding code during maintenance than modifying it. We can significantly reduce the cost of software maintenance by reducing the time and effort to find and understand the code relevant to a software maintenance task. In this dissertation, we demonstrate how textual and structural information in source code can be used to improve software search and exploration tools. To facilitate integration of this information into additional software tools, we present a novel model of word usage in software. This model provides software engineering tool designers access to both structural and linguistic information about the source code, where previously only structural information was available. We utilize textual and structural information to improve software search and program exploration tools, and evaluate against competing state of the art approaches. Our evaluations show that combining textual and structural information can outperform competing state of the art techniques. Finally, we outline uses of the model to improve software engineering tools beyond program search and exploration.

xv

Chapter 1 INTRODUCTION Once a software project has been initially developed and deployed, the software’s maintenance cycle begins [GJM02]. Software maintenance is the process of evolving a software system to fix bugs, add new features, or improve performance [LST78]. Despite decades of knowledge that software engineering techniques can reduce software maintenance costs [Boe76, FPB75], focusing on fast initial product releases and leveraging existing legacy systems means that as much as 90% of software life cycle resources are spent on maintenance [Erl00]. When performing software maintenance tasks, developers must first identify the relevant code fragments to be modified, potentially within millions of lines of code. Identifying these code fragments is especially difficult for a newcomer, who is a developer faced with an unfamiliar system [SH98]. In fact, research has shown that newcomers spend more time finding and understanding code than they do actually making modifications [KMC06]. Because today’s software is so large and complex, even developers familiar with one aspect of a system may feel like newcomers when solving maintenance tasks in other components of the same software system. Thus, we can reduce maintenance costs by helping developers to more effectively find and understand code relevant to their maintenance tasks. 1.1

Locating Relevant Code There are a number of options for finding and understanding code for software

maintenance: ask an expert, read documentation, execute test cases, or investigate 1

the actual program code. However, only the code is consistently available for all software projects (even if it cannot be executed). Program code can either be at the human-readable source code level or at the machine-readable binary code level. We focus on source code because our goal is to help human developers understand code for maintenance, which is performed at the source level. In general, approaches to find and understand source code fall into three main categories: browsing, exploring, and searching. Browsing. A naive way to find code relevant to a maintenance task is to browse the source code by skimming the file structure and scrolling through any files that look potentially relevant. This method is ad-hoc at best, and unlikely to locate relevant code. Studies have shown that successful maintainers are more likely to search and explore the source code, rather than browse it [RC04]. In fact, a recent study showed that newcomers begin maintenance tasks by textually searching the source code 83% of the time [KMC06], browsing just 17% of the time [KMC06]. Exploring and Navigating. A more reliable way to locate code relevant to a software maintenance task is to explore the source code by navigating its dependences. Similar to hyperlink and domain structures on the web, segments of source code in a software system are connected via dependences. For example, a procedure to multiply two numbers may call a procedure to add two numbers, creating a dependence link between the two procedures. Unlike the web, where these links can be automatically followed, the dependences in software cannot always be precisely resolved. Navigation and exploration tools [CR00, Rob05, SFDB07, SKM06, Tip95] use different approaches to resolve the dependences. Following the “links” created by these dependences, navigation tools save the developer time by jumping directly from one relevant segment of code to another segment related by dependences. However, a single segment of code may be connected to tens or hundreds of others, when only a handful are relevant to the maintenance task. Although manually navigating 2

these links one at a time can save the developer time, recursively exploring them automatically can be overwhelming. In addition, for newcomers to a software system, the downside of navigation and exploration tools is that the developer needs to have a place to start exploring within the system, potentially within millions of lines of code. Since newcomers are unfamiliar with the system, this can be a non-trivial task. One way newcomers can locate starting points is to search the software. Searching. Similar to how we use Google to search the web, software search tools use natural language queries and perform string matching with the comments and identifiers in the source code [MSRM04, PMR+ 06]. Search tools can quickly find relevant code scattered across a large code base, but can also return irrelevant results because the semantics of the words in the query and the code are not taken into account. For example, a query containing the word “add” can return results for appending items to a list as well as performing a mathematical sum. For most code searches, only one definition of add is desired. 1.2

Improving Software Maintenance Tools Tools for both exploring and searching source code have benefits for newcom-

ers, but are hampered by limitations. Navigation tools are relatively precise, but require a starting point in the system. Search tools require only a natural language query, but can return many irrelevant results. We propose improving software maintenance tools by leveraging the advantages of search and exploration tools, while limiting their drawbacks. Specifically, our key insight is that both the program dependences used by exploration tools and the natural language information used by search tools is crucial to improving software maintenance tools. When initially developing software, developers express in the code the higherlevel concepts that they are trying to implement. Specifically, the algorithm is

3

expressed through the programming language syntax and semantics, such as dependences, which we call program structure. In addition, the domain concepts are expressed through the natural language information used in meaningful identifier names and comments. Thus, as developers continually read and modify the code throughout development and maintenance, they create code artifacts that are readable as well as runnable [LBS06]. Since existing software maintenance tools typically use either program structure or natural language information, they are missing significant information that a developer uses when reading and modifying code. 1.3

Thesis Overview In this dissertation, we address the problem of concern location for a new-

comer. By leveraging the natural language information, programming language structure and semantics in the source code, we can create program search and exploration techniques that help reduce the high cost of software maintenance. To facilitate access to this information, we propose a novel model of software word usage which leverages natural language and structure information to improve software maintenance tools, and thus reduce maintenance costs. Specifically, this dissertation makes the following contributions in helping developers locate and understand code relevant to a maintenance task: • Source code preparation techniques which increase search accuracy by improving query word to code matching [EHPVS09, HFB+ 08]. Specifically, identifier splitting [EHPVS09] and abbreviation expansion [HFB+ 08] techniques that are more accurate than the existing state of the art. Both techniques help improve search accuracy by ensuring the natural language query words match the words extracted from the source code. • A novel query reformulation technique and tool that enables the developer to (1) quickly discriminate between relevant and irrelevant results, and (2) spark 4

further query refinements [HPVS09]. Our approach automatically captures the context of query words in source code by extracting and generating natural language phrases from method and field signatures. An empirical evaluation of 22 developers demonstrates that our technique significantly outperforms the competing state of the art in terms of effort and effectiveness. • Design and implementation of a novel Software Word Usage Model (SWUM) that captures the conceptual knowledge of the programmer as expressed in both natural language information and programming language structure and semantics. SWUM is designed to be an interface between software engineering tool designers and researchers working on improved linguistic analyses for software. We present an algorithm to automatically build SWUM for Java methods, types, and variables. • A SWUM-based scoring technique and tool for searching software that is easier to use and more consistently effective than existing approaches. In our evaluation, we analyze the effects of SWUM’s various components on software search as well as compare SWUM-based search with competing state of the art approaches. • A novel program exploration technique that leverages textual and program structure information [HPVS07]. Our experimental results demonstrate that an integrated textual- and structural-based approach is significantly more effective than a state-of-the-art structural program exploration technique. In addition, we investigate what impact exploration has on improving search results for concern location. Figure 1.1 provides an overview of how the above contributions relate to the overall concern location process. The developer begins with an information need relating to his software maintenance task, called a concern. The developer 5

Do these results satisfy my information need?

Information Need (Concern)

Query

Developer

Results

SE Applications •Query Reformulation •Software Search •Program Exploration

Source Code Prep

Software Word Usage Model (SWUM)

•Identifier Splitting •Abbreviation Expansion

Source Code

Figure 1.1: Developer locating relevant code with SWUM-based software engineering tool support. 6

translates this information need into a query describing the concern. Rather than executing this query on the source code directly, we propose first preprocessing the source code using source code preparation techniques such as identifier splitting and abbreviation expansion. The prepared source code is then used to construct a Software Word Usage Model (SWUM) of the source code. The prepared source code and the SWUM are used by software engineering applications such as query reformulation, software search, and program exploration to find segments of code related to the developer’s information need. The developer then views these search results to determine whether his information need is satisfied, repeating the process as necessary. The remainder of this dissertation is organized as follows. Chapter 2 defines the problem of concern location and provides an overview of its challenges and existing solutions. Chapter 3 specifies how we prepare the source code for search by applying identifier splitting and abbreviation expansion techniques. Chapter 4 presents our solution to helping the developer select query words and determine result relevance. Chapters 5–6 define our software word usage model and how we leverage that model for concern location in search. In Chapter 7, we present our novel exploration technique that combines both textual and structural information, and investigate the impact of exploration in combination with search for concern location.

7

Chapter 2 DEFINING THE PROBLEM: CONCERN LOCATION

The code relevant to a maintenance task typically involves one or more concerns of the software system. A concern is anything stakeholders of the software consider to be a conceptual unit, such as features, requirements, design idioms, or implementation mechanisms [RM07]. Before making any changes to the code when solving a maintenance task, it is important for the developer to locate and understand all the concern’s elements, or risk making incorrect modifications [LS86, RC04]. In addition, poor understanding of the concern could lead to changes that do not fit the software’s overall design and introduce future bugs [Par94]. Thus, concern location is an important step in a successful software maintenance process. Locating a concern in source code can be especially difficult if the relevant code is scattered throughout the system, rather than located in a single file [LS86]. For example, in an object-oriented auction-sniping program, all of the code relating to an auction entry is stored in an AuctionEntry class. Code for adding, removing, or updating an auction entry is easier for a software maintainer to find and modify because it is all located in the AuctionEntry class file. In contrast, the logging functionality in the Tomcat web server cuts across 47 of the 148 source files [RM02]. Developers need techniques for locating concerns in source code to help reduce maintenance costs. Although many different approaches to concern location exist, we believe software search followed by exploration to be the most flexible and suitable technique for newcomers. 8

In this chapter, we present an overview of the concern location problem and existing solutions. We begin with a motivating example demonstrating how search and exploration can be used in concern location, as well as the challenges. Next we present the state of the art in concern location, followed by a summary of the limitations of existing techniques and an overview of the focus of this dissertation. 2.1

Motivating Example As an example, consider the jBidWatcher application, an auction bidding,

sniping, and tracking tool for online auction sites such as eBay or Yahoo. jBidWatcher includes a unique and powerful sniping feature. A snipe is a bid placed in the closing seconds of the auction used in an attempt to snatch final victory away from the previous bidder. Before a user can bid on an auction, she must add the auction to the user view and data structures of the system. Consider the following maintenance scenario. A developer is notified of a bug in the software: users have been unable to add auctions to the main program window. Specifically, users select the ‘add auction’ button but there is no response from the system. The code necessary to understand this bug would include the methods and fields that handle the user triggered event, as well as those that add the auction to the data structures of the system. Figure 2.1 shows the code related to the add auction concern. Oval nodes represent methods while boxes represent fields. It should be noted that some aspects of a concern may be subjective, since developers may require different amounts of information depending on their programming experience, expertise in the language, and knowledge of the underlying system. The nodes with solid lines represent the core concern nodes which are likely to be useful to anyone understanding the concern. Dotted nodes represent selected calling context information, while dashed nodes represent lower-level implementation details. For example, the two egg-shaped methods, Map.put (X) and Set.add (Y), represent Java library methods. 9

JBidMouse.DoAction (Object, String, Object)

A JBidMouse.DoAction (Object, String, AuctionEntry)

C JBidMouse. DoAdd

JBidMouse. DoPasteFromClipboard()

JBidMouse. messageAction

JBidMouse. ADD_AUCTION

JBidMouse. cmdAddAuction

D

B

F

E

JBidMouse. addAuction

JBWDropHandler. messageAction

AuctionsManager. newAuctionEntry

JBidProxy. addAuction

AuctionsManager. addEntry

G

AuctionEntry. AuctionEntry

FilterManager. addAuction

EntryManager. addEntry

X AuctionEntry. _justAdded

AuctionEntry. prepareAuctionEntry

AuctionServer. addAuction(String)

java.util.Map. put(K, V)

AuctionEntry. _auction

Auctions. _tSort

Auctions. addEntry

FilterManager. _allOrderedAuctionEntries

TableSorter. insert(Object)

AuctionServerManager. add_entry

AuctionServer. addAuction(URL, String)

AuctionServer. registerAuction

Y

AuctionServer. loadAuction

java.util.Set. add(E)

AuctionServer. _aucList

Figure 2.1: Program elements and structural dependences for the “add auction” concern. 10

In the top left of Figure 2.1 are the methods that handle the add auction event triggered by user input: DoAction (A), DoAdd (B), and DoPasteFromClipboard (C). DoAdd prompts the user to enter an auction number and then stores this information in a system-wide queue. DoPasteFromClipboard adds an auction via the system’s copy and paste functionality and also adds the “add auction” event to the queue. The field ADD AUCTION is a constant used to represent the “add auction” action within the queue. After the “add auction” event is added to the event queue, another group of methods is responsible for actually adding the auction to the user view and internal data structures. JBidMouse.messageAction (D) takes action events from the queue and directs them to the appropriate handler methods, in this case, cmdAddAuction (E). This method is the start of a chain of method calls, culminating in the auction being added to the system. Its immediate caller, JBidMouse.addAuction (F), consists of two steps: creating the new auction entry and loading its pertinent information from the internet, and then adding the newly created entry to the system by calling AuctionManager.addEntry (G). 2.1.1

Searching for the Concern The first step in concern location is identifying a starting point in the source

code that appears relevant to the maintenance task. This starting element can originate from the developer’s prior knowledge, expert recommendations from others, or by using search tools. If the developer is familiar with the system, they would quickly realize that there is a bug somewhere in the code that handles user-triggered add auction events. They could immediately begin exploring the source code looking for potential causes of the bug. However, if the developer is a newcomer new to the system and no expert is available, they will likely need to begin by searching the source code.

11

The first challenge in searching source code is formulating the query. Considering this problem from the perspective of a newcomer, the newcomer might not immediately formulate the query “add auction”. Other plausible queries include “enter auction”, “load auction”, or “create new auction”. However, given that the user interface menu uses the phrase “add auction”, it is likely that even a newcomer will be able to construct a good query for this particular example. Not all concerns are so easy to construct queries for. Consider a report management system that allows users to add text fields to a report. In the user interface, this feature is called “add textfield”. However, the code implements this feature as dropNewTextField using a TextReportElement constructor. Searching for “add textfield” returns 67 statements across 25 methods, none of which are relevant. Once the developer has constructed a query and executed this query using a search technique, they have to search the results for segments of source code relevant to the concern. Poor queries can return few relevant results, or an overwhelming number of irrelevant results. For example, consider searching using a built-in search feature in a common development environment. Searching for the query “add auction” yields 50 methods, less than 15 of which are actually part of the concern. Thus, searching for starting program elements in concern location is a powerful tool, but can be hampered by few relevant results or too many irrelevant results. 2.1.2

Exploring the Concern Once the developer has identified one or more starting elements, they need

to somehow locate the remaining code that is relevant to the maintenance task. Most developers will investigate code by following structural links leading from the starting element, such as links in the the call graph or data dependence graph. The process of locating these remaining relevant program elements is called program exploration.

12

JBidMouse. DoAction (Object, String, AuctionEntry)

JBidMouse. DoAdd (Component)


Figure 2.2: Methods related to triggering an ‘add auction’ event with the surrounding call graph. Of the 40 methods that DoAction calls, only 2 are relevant to adding an auction. To better illustrate the concepts behind program exploration, consider the previous “add auction” example. Based on an expert’s recommendation, the developer knows that the DoAction method handles all user-triggered events. The developer plans to use the call graph to explore the program. Figure 2.2 shows the methods related to triggering an “add auction” event, with the surrounding call graph. Of the 40 methods that DoAction calls, only 2 are relevant to the maintenance task. Figure 2.3 shows methods related to handling an “add auction” event, with the surrounding call graph. To locate these 12 relevant methods starting from addAuction, as many as 57 call edges and associated methods must be explored. In addition to the 55 outgoing call edges from the DoAction trigger in Figure 2.2, a developer could explore more than 100 methods in trying to identify just 15 relevant methods. Note that after looking at more than 100 methods, the developer has only identified 12 that are relevant to the search task; they have not spent any time looking for the bug’s cause or how to implement a fix.

13

JBidMouse. cmdAddAuction (String)

JBidMouse. addAuction (String)

AuctionsManager. newAuctionEntry (String)

AuctionsManager. addEntry (AuctionEntry)

AuctionEntry. AuctionEntry (String)

AuctionEntry. prepareAuctionEntry (String)

FilterManager. addAuction (AuctionEntry)

Auctions. addEntry (AuctionEntry)

AuctionServer. addAuction (String)

AuctionServer. addAuction (URL, String)

AuctionServerManager. add_entry (AuctionEntry)

AuctionServer. registerAuction (AuctionEntry)

Figure 2.3: Methods related to handling a user-triggered ‘add auction’ event, with the surrounding call graph (7 in-edges were not shown to preserve readability). Exploring 57 possible elements to determine 12 relevant ones is no trivial task. 2.2

State of the Art in Concern Location When a newcomer is faced with locating a concern for a maintenance task,

she has a number of automated and non-automated options. First, she can look for an expert for the system and ask them. Unfortunately, experts for the system can be unavailable [KDV07], nonexistent for legacy systems, or the developer may feel uncomfortable approaching the expert with her question [YYN07]. Alternatively, she could consult an expert-by-proxy by looking at data collected during program investigation activities of other developers [RM03], but this information must be recorded ahead of time and will not always exist. Another option is to investigate

14

the documentation associated with the system. Although the documentation may give her clues to understanding the system, she must still locate the relevant code that might be related to the documentation. This is a non-trivial task [ACC+ 02, MM03, LFOT07] for which automated tools have been developed. Her final nonautomated option is to haphazardly browse the source code looking for relevant code, but this is not the most successful or efficient concern location technique [RC04]. Since we are focused on reducing maintenance costs, automated and semiatuomated tools are the best option for concern location. These tools can be classified by the type of source code information they use: static, dynamic, or hybrids of both static and dynamic information. Although this dissertation focuses on static approaches, we include a brief overview of the dynamic and hybrid approaches. 2.2.1

Static Concern Location Static approaches locate the program elements of a concern using only in-

formation available from the source code directly, without executing the software. There are two main categories of static approaches: navigation tools and search tools. 2.2.1.1

Navigation and Exploration Tools

Navigation and exploration tools help developers explore the program structure from a starting point in the code. In general, these fall into two main categories: semi-automated approaches, which provide automatically gathered information to the user but require the developer to initiate every navigation step (stepwise), and approaches that automatically traverse the program structure and return many related elements without user intervention (recursive). Before discussing these different types of navigation tools, we first describe some of the most popular ways to represent program structure.

15

Representing Program Structure The basic representation a program is an abstract syntax tree (AST), which is a hierarchical representation of the syntactic structure of the program. AST nodes are either leaf nodes or intermediate nodes. Leaf nodes are typically identifiers or constants (e.g., the number 7 or the string “price”), while intermediate nodes represent grammatical phrases in the language (e.g., a statement node that encapsulates the expression y = m ∗ x + b). A single function or method can be represented by a control flow graph or data dependence graph [Muc97]. A control flow graph (CFG) captures the control relationships in a method. A CFG contains a node for each contiguous block of code in the method, and contains an edge for every way that control can flow between the blocks. Data dependence graphs complement control flow graphs by capturing data relationships in a program [Muc97]. A data dependence graph typically contains a node for each operation or statement in a method, labeled with variables defined by the programmer, and edges for how the data values are combined to form new values. Any statement that assigns a value to a variable x is said to define x at that program point, and any statement that otherwise reads x is said to use x. A use of variable y is said to be data dependent on the definition of a variable x if x is somehow used to generate the value of y at that use of y. Data flow analysis can be performed to identify definition-use (def-use) chains, which link the definition of a variable with its uses. A def-use chain contains the definition of a variable and all uses it may flow to [Muc97]. Control and data flow information can be used to form a program dependence graph (PDG) [FOW87, HR92], often used in exploration tools [SFB07, Tip95, XQZ+ 05]. By default, control flow, data flow, and program dependence graphs are defined within a method (intraprocedurally), but can be extended for the entire program (interprocedurally).

16

At a higher level of abstraction, there are call graphs and type hierarchy graphs. A call graph captures the calling relationships in a program without accounting for calling order or the number of calls [Muc97]. There exists a call graph node for every method in the program, and edges represent the calling relationships between methods. For example, if method a may call method b directly, then a is a caller of b, b is a callee of a, and there exists an edge from a to b. The number of outgoing call edges from a call graph node is called fan-out, and the number of in edges is fan-in [Som10]. A type hierarchy graph captures the inheritance relationships in a program [Muc97]. A type hierarchy contains a node for every class in the program. If an edge exists from type A to type B, then B inherits functionality and/or information from A. Stepwise Navigation Tools Stepwise navigation tools begin from a relevant starting element and allow developers to explore structurally related program elements such as methods, fields, or classes. Some navigation tools allow developers to query structurally connected components one edge away [CR00, RM02, SKM06] or recommend structurally related elements 1-2 edges away [Rob05, SFDB07]. To access connected components, these tools use interprocedural information such as the call graph, type hierarchy, or field def-use information. Stepwise navigation tools suggest manageable numbers of elements to be investigated, but provide limited contextual information since the developer is only presented a small neighborhood of program elements at each step. Each successive structural element to be explored must be manually selected. If a developer were to use a stepwise navigation tool for the “add auction” concern, the developer would potentially have to initiate as many as 15 exploration steps. Chapter 7 provides more specific details about stepwise navigation tools.

17

Recursive Exploration Tools In contrast, recursive exploration tools provide more structural context by automatically exploring many structural edges away from the starting element [SFB07, Tip95, XQZ+ 05, YT07] (e.g., by including callers 5 edges up the call chain). For instance, program slicing identifies which elements of a program may affect the data values computed at some point of interest, usually by following edges in a program dependence graph [Tip95]. Unfortunately, program slices tend to be overwhelmingly large for human viewing and can be expensive to calculate on large code bases [SFB07]. As demonstrated by traditional slicing, the number of structurally connected components can grow very quickly as new program elements are added to the result set. To address this problem, some recursive navigation tools employ filtering techniques to eliminate unnecessary results. For example, Thin Slicing has been proposed as a way to reduce traditional program slices [SFB07]. For a statement s, thin slicing only includes statements that are part of the chain of assignments that computes and copies values in s. The remaining control flow statements and statements affecting pointers and heap accesses are filtered away. In addition, a textual similarity metric has been used as a stopping criteria in slicing [INMI07], which is another way of filtering. Rather than explore data dependences, some recursive navigation techniques reduce expense by exploring the call graph [HPVS07, YT07]. One approach is to filter based purely on call graph information such as the number of edges away from the starting element or the number of callees [YT07]. These filters can be further refined by using textual information [HPVS07]. Recursive exploration tools have an advantage over stepwise navigation tools in that they require less effort on behalf of the user to deliver a contextualized, local view of the relevant program elements. However, the challenge becomes defining

18

how to filter out irrelevant results. With recursive exploration, just one irrelevant result can quickly lead to many irrelevant results. Other techniques help the developer manage context as they explore by keeping track of the history of previously explored elements [JV03, KM05], by allowing developers to record pertinent artifacts [RM02], or by visualizing the exploration process [Bal04, BD06, DL05, JH06, SKM06, SEHM06, SM95]. Although such context management techniques automatically help the developer keep track of what has already been explored, these approaches do not automatically suggest elements of potential interest that the developer has not explored yet. 2.2.1.2

Search Tools

In contrast to navigation tools, which require a seed starting point in the source code, search tools allow the developer to identify relevant program elements scattered throughout the code from just a natural language query [MSRM04, PMD06, PPM+ 06, SFH+ 07]. Given a natural language query, search tools output a set of program elements relevant to the query. Search tools calculate relevance by scoring each program element based on the query and the comments and identifiers in the code. Although search tools tend to be easier for newcomers to use, they come with their own set of challenges. Search tools essentially treat concern location as an information retrieval problem [MRS08], which leads to three decisions in designing a search-based solution: information need, query mechanism, and document granularity. Information Need The information need for concern location differs from the more familiar information need of searching the web. When we search the internet, we are typically looking for results to answer a question, such as, “What is a rutabaga?”, “When 19

was Dairy Queen founded?”, or “What is the cheapest price for a MacBook?”. In answering these questions, just one relevant document will satisfy the information need. For example, to answer the information need, “What is a rutabaga?”, the search engine need only return one highly ranked document that defines a rutabaga, rather than every document which discusses rutabagas. In contrast, when a developer is looking for a concern, his information need will only be satisfied by returning every relevant document that is part of the concern. This makes the search task more difficult, since it is unlikely that all elements of the concern will be highly ranked by a single query or search mechanism. Query Mechanism There are a number of different types of queries that a developer can use when searching source code. In conjunction with identifier names or other textual ˘ information, structural queries [AJL+ 09, CM03, dAM08, HVdM06, FM10, HM05, JV03, KBT07, PP96, ZW95] use structural information such as call and use relationships [HM05, JV03] or type information [JV03, ZW95] to search software. Some structural approaches search across multiple sources of information [AJL+ 09, dAM08, FM10, KBT07], such as documentation or e-mails, in addition to the source code itself. Like navigation tools, structural queries require either a program element starting point or a known identifier name. Because newcomers by definition have little experience with the software system, a newcomer is unlikely to have a program element starting point or a known identifier name unless they have expert advice. An alternative to structural queries are text-based queries, such as regular expressions or natural language queries. Regular expression (RE) queries consist of a sequence of characters with optional wild cards or character classes, similar to the queries used to search with grep. To be effective, RE queries require the developer to have detailed knowledge of identifiers and naming conventions used in the system. 20

Because RE queries match based on sequences of characters, the developer must know exactly what words were used to describe the concept in the code. Even if the developer guesses the correct words, the developer must be aware of the naming conventions used in the code base to construct an effective query. If a developer searches for the query “remove*auction” but the code base implements the concept using “deleteAuction”, the regular expression will fail to return any relevant results. Since newcomers are unlikely to have such detailed knowledge of the word usage and naming conventions in a software system, regular expression queries can be difficult to use. In contrast, natural language (NL) queries [MSRM04, Mic02, LBN+ 09], or keyword-style queries [FHR91, PKD09], require no such prior knowledge of the code base. Because NL queries consist of a sequence of words, search tools can take advantage of the semantics of the word. For example, NL queries can be expanded with synonyms, or can match based on the root form of a query word. In addition, NL queries are considered to be more intuitive than structural queries. In fact, W¨ ursch, et al. recently developed a natural language query mechanism to wrap around structurally-oriented developer queries [WGRG10] because NL queries are so much more intuitive. Document Granularity Another decision in searching software is defining the concept of a document. In natural language texts, a document can be a web page, an article, or even an entire book. Similarly for searching software, we must first define a document. For object-oriented source code, we have the following document granularities, from coarse to fine granularity: whole system, package, file, class, program element (i.e., methods and fields), block, statement. Which level of granularity makes the most sense when searching software for maintenance?

21

When searching during software maintenance, a newcomer is searching within one specific project, so the whole system level granularity is of little use. Further, search results at the package level can still return hundreds to millions of lines of code, which will do little to reduce maintenance time. Likewise, files and classes can contain hundreds of lines of code. Recall that a concern can be scattered in files across a system [LS86]. The fine granularity of methods and fields can locate scattered concerns, without pulling in unnecessary information from the rest of the class or file, which may be unrelated. Statements are also at a finer level of granularity and can locate scattered concerns, and are often returned with regular expression queries. The drawback of statement-level results is that the results are presented at the most detailed level of source code, devoid of contextual information. To understand the document or determine its relevance, the developer has to look at the statement in the context of the method it occurs in. In terms of understandability, we believe program elements within a class are at a manageable level of granularity for search results and at an appropriate level of abstraction for comprehension. Although program element granularity can lead to better usability and effectiveness than other document granularities, it also makes the search problem more difficult since program elements can contain varying number of words as compared to NL documents. For example, in a seminal study of information retrieval methods [SB88], Salton used 6 document collections containing between 1,000-12,700 documents. The average number of words in these documents ranged from 20 to 53, with standard deviations of 10–23. In contrast, a sample of 8 programs containing between 19KLOC and 1.5 million LOC consists of 3,000–20,000 program element “documents”. The average number of words in program elements ranged from 14-30 with standard deviations of 15–120. These differences illustrate how “documents” in software can significantly differ from their NL counterparts, upon which most information retrieval methods were developed.

22

Putting Existing Search Tools into Context Some approaches directly apply traditional information retrieval techniques to the problem of concern location [GCS08, LKE08, MSRM04, PMD06, PPM+ 06]. These techniques use natural language queries and can be used at different levels of granularity. Chapter 6 presents specific details of these approaches. In contrast, Sourcerer [LBN+ 09] uses keyword queries in addition to program structure to find relevant examples from a large repository of open source projects to find reusable code rather than to locate concerns. The information need for reuse is closer to the more traditional needs of web-based search, rather than concern location. Developers can also use architecture recovery techniques as a search technique for concern location [Big89, MNS01, PFGJ02]. During architecture recovery, developers use RE queries in conjunction with source structure models to locate high-level concepts in code. The developer is responsible for articulating the key concepts of the maintenance task in a regular expression query that may also require specifications as to which code structures should be searched [MNS01, PFGJ02]. Some search-based concern location approaches automatically link documentation to source code by using textual information [ACC+ 02, MM03] or a combination of textual and structural information [ZZL+ 06]. Although these approaches are fully automatic, they require accurate documentation in addition to meaningful identifier names. However, incomplete, nonexistent, or inaccurate documentation could hinder the searching effectiveness of these approaches. In addition, the documentation may be written at a coarser granularity than the maintenance task being completed. With no mechanism to retrieve finer granularity matches, the developer is forced to search for only those features that are at the documentation granularity. Combining Natural Language and Static Program Structure Information Combining natural language and structural information has been proposed as an effective way to identify concerns [Big89], and some search techniques have 23

incorporated this information to improve accuracy [SFH+ 07]. The FindConcept technique [SFH+ 07] builds a natural language representation, called the ActionOriented Identifier Graph (AOIG), of the verb and direct object pairs for methods in source code. The key idea behind FindConcept is that methods form the core elements of concerns, and that the verb and direct object of a method’s signature and comments capture the method’s intent. The AOIG takes advantage of the fact that the same verb may act on many objects in a system, that a single object may be acted on by many verbs, and that the relationship between any one verb and any one object differentiates them from the rest. Program structure information, such as a method’s defining class, is used to accurately extract these verb and direct object pairs and link them with the underlying source code. The AOIG representation of verb and direct object pairs is used to help the user interactively construct queries by suggesting additional verbs and objects that co-occur with the initial query words in the source code. Then, the query is used to search the AOIG, and return a set of program elements linked from the AOIG’s verb-object pairs. Although FindConcept was shown to have higher precision and recall overall than two existing approaches which only used NL information [SFH+ 07], there were some concern location tasks for which the existing techniques worked as well or better. Thus, there is still room for improvement in combining natural language and program structure information to create a more effective concern location tool. 2.2.2

Dynamic and Hybrid Concern Location Techniques Dynamic concern location approaches use information from executing the

source code in addition to the source code itself [EAAG08, EKS03, EV05, KQ05, LMPR07, PGM+ 07, RDP10, WS95, WHGT99]. The intuition is that methods and statements executed by a test suite are likely to be relevant to the concern. However, dynamic approaches miss relevant results that are not executed by any 24

test cases, and to be effective, either depend on a test suite or the developer to execute the areas of interest in the code. To combat the drawbacks of using dynamic information, hybrid approaches that use dynamic information in conjunction with static program structure and/or textual information have been proposed [EAAG08, EKS03, LMPR07, PGM+ 07, RDP10, ZZL+ 06]. These hybrid approaches outperform existing dynamic approaches [RDP10], but still suffer from the same limitation of using dynamic information: the possibility that relevant, unexecuted program elements will be missed. 2.2.3

Summary of Limitations Unlike dynamic approaches, static approaches to concern location do not

require that the developer be able to execute the concern in the code, making them more suitable for newcomers. Recursive exploration tools provide the developer with local structural context, but require a starting element and carefully defined filters to ensure the results do not overwhelm the developer. Search tools only require a natural language query, rather than a starting program element, but their accuracy largely depends on how well the query discriminates between relevant and irrelevant results in addition to how the search mechanism matches program elements to the query. 2.3

Focus of the Dissertation To reduce maintenance costs and to be widely applicable, we believe a con-

cern location approach should be static, take a natural language query as input, and provide support for recursive exploration to put the search results into context of the surrounding program elements. Our key insight is that (1) the higher level conceptual knowledge in code can be captured by analyzing the semantics of the program structure in conjunction with the natural language semantics of the software’s identifiers; and (2) this higher level semantic information can be leveraged 25

to create software engineering tools that are more effective, and potentially easier to use, than existing techniques. In this dissertation, we present a novel model of software word usage capable of capturing this conceptual knowledge, and leverage this semantic information to create improved search and exploration techniques for concern location.

26

Chapter 3 PREPARING THE SOURCE CODE FOR SEARCH

The advantage of using a natural language query to search source code is in its intuitive entry mechanism. However, there is a trade-off between intuitive NL queries and their ability to match relevant source code. Unlike traditional natural language text, identifiers in source code are not space-delimited (e.g., strlength). This requires a preprocessing step to prepare source code for search. Further, developers often make use of abbreviations to reduce typing effort (e.g., ASTVisitor). In fact, some abbreviations are more prevalent than their expanded counterparts. To combat these challenges, we have developed two approaches to automatically prepare the source code for search. In collaboration with Eric Enslen, et al. [EHPVS09], we have developed improved automatic identifier splitting techniques. In addition, we have developed an automatic abbreviation expansion technique for source code. Both of these techniques help improve search accuracy by ensuring the NL query words match the words extracted from the source code (e.g., ensuring the query word “string” will match the abbreviation str in the identifier strlength). This chapter describes our automatic identifier splitting and abbreviation expansion techniques, as well as an evaluation comparing these techniques with competing state of the art approaches.

27

3.1

Identifier Splitting When taking advantage of the natural language information in source code,

the first step is to split the identifiers into their constituent words. Programmers frequently create identifiers with multiple words, called multi-word identifiers [EHPVS09], to name the entity they want to represent (e.g., output file, mp3 list). Words within these identifiers can be delimited by non-alphabetic characters (e.g., ‘ ’ and numbers) or capital letters (e.g., toString, ASTVisitorTree, newValidatingXMLInputStream, jLabel6, buildXMLforComposite). Although such simple coding conventions as these are relatively straightforward to automatically split, the challenge is splitting multi-word identifiers where existing naming conventions break down or do not exist (e.g., DAYSforMONTH, GPSstate, SIMPLETYPENAME). Why are these identifiers so important to be correctly split? When using NL queries to search software, poorly split identifiers can reduce accuracy. For example, consider searching for a feature that adds a text field to a report in a GUI-based report processing system. In the implementation, the developers are inconsistent with how the concept of a “text field” is textually represented. The method responsible for adding the text field to the system uses capital letters to indicate splits, addTextField. However, the GUI method responsible for initiating the action does not use any splits, referring to the concept as textfield. Thus, without correct identifier splitting, search techniques will never return all relevant pieces of code for a single query (either “text field” or “textfield”). 3.1.1

The Identifier (Token) Splitting Problem Although the motivation for splitting arises from multi-word identifiers, split-

ting can also be applied to string literals and comments. In fact, identifiers frequently appear in Java doc comments as well as in sections of code that have been commented. Thus, we focus on splitting tokens, which may be program identifiers or space-delimited strings appearing in code comments or string literals. 28

Token splitting is the problem of partitioning an arbitrary token into its constituent concept words, which are typically dictionary words and abbreviations. The general form of a token is a sequence of letters, digits, and special characters. In addition to using digits and special characters, another common convention for indicating word splits is camel casing [CT99, DP06, LFB07b, LBS06]. When using camel case, the first letter of every word in an identifier is capitalized (thus giving the identifier the look of a humped camel). Using capital letters to delimit words requires less typing than using special characters, while preserving readability. For example, parseTable is easier to read than parsetable. Although initially used to improve readability, camel casing can also help to split tokens for static analysis tools that use lexical information. However, camel casing is not well-defined in certain situations, and may be modified to improve readability. Specifically, no convention exists for including acronyms within camel case tokens. For example, the whole abbreviation may be capitalized, as in ConvertASCIItoUTF, or just the first letter, as in SqlList. The decision depends on the readability of the token. In particular, SqlList is arguably more readable than SQLList, and more closely follows camel case guidelines than SQLlist. Strict camel casing may be sacrificed for readability, especially for prepositions and conjunctions, as in DAYSforMONTH, convertCEtoString, or PrintPRandOSError. In some instances, no delimiters are used for very common multi-word concepts, such as sizeof or hostname. Thus, although camel case conventions exist, different decisions are made in the interest of readability and faster typing. Formally, we define a token t = (s0 , s1 , s3 , ...sn ), where si is a letter, digit, or special character. The trivial first step in token splitting is to separate the token before and after each sequence of special characters and digits. Each substring is then considered as a candidate token to be further split. Any substrings left after the first trivial splits, we refer to as alphabetic tokens. An alphabetic token is a

29

sequence of alternating upper and lower case letters. For example, eof, Database, startCDATA, ConvertASCIItoUTF, and buildXMLforComposite are all alphabetic tokens with varying numbers of alternation and sizes of substrings in a same-case sequence. For alphabetic tokens, there are four possible cases to consider in deciding whether to split at a given point between si and sj : 1. si is lower case and sj is upper case (e.g., getString, setPoint) 2. si is upper case and sj is lower case (e.g., getMAXstring, GPSstate, ASTVisitor) 3. both si and sj are lower case (e.g., notype, databasefield, actionparameters) 4. both si and sj are upper case (e.g., USERLIB, NONNEGATIVEDECIMALTYPE, COUNTRYCODE) Case (1) is the natural place to split for straightforward camel case without abbreviations, (e.g., isCellEditable, getDescription). However, the examples for case (2) demonstrate how following strict camel casing can provide incorrect splitting (e.g., get MA Xstring, GP Sstate). We call this problem of deciding where to split between alternating lower and upper case letters the mixed-case token splitting problem. The mixed-case token splitting problem is particularly complicated by the use of acronyms. We refer to cases (3) and (4) as the same-case token splitting problem. The programmer has not used camel case within this token, and thus has provided no clues as to whether any individual words, or concepts, should be extracted. A fully automatic program token splitting algorithm should automatically solve both the mixed-case and the same-case token splitting subproblems effectively. The algorithm should be capable of splitting a token into an arbitrary number of substrings that represent different concepts. The client software analysis tool can always merge together words that were split, but could be considered together as a single concept. For example, we observed that more experienced Java programmers 30

would consider javax to be a single concept, the Javax API, while a novice would consider ‘java’ and ‘x’ to be separate words. 3.1.2

State of the Art To our knowledge, Feild, Binkley, and Lawrie [FBL06, LFB07a, LFB07c]

are the only other researchers to develop and evaluate techniques that address the problem of automatically partitioning multi-word program identifiers. They define a string of characters between division markers (e.g., underscores and camel case) and the endpoints of a token to be a hard word. For example, the identifier hashtable entry contains two hard words: hashtable and entry. When a hard word consists of multiple parts, the parts are called soft words. The hard word hashtable contains two soft words: hash and table. Thus, a hard word can consist of multiple soft words. A hard word containing more than one soft word is a same-case token. Feild, et al. present two approaches to same-case token splitting—a greedy approach and an approach based on neural networks. The greedy approach uses a dictionary word list, a list of known abbbreviations, and a stop list of keywords which includes predefined identifiers, common library functions and variable names, and single letters. After returning each hard word found in one of the three word lists as a single soft word, the remaining hard words are considered for splitting. The algorithm recursively looks for the longest prefix and suffix that appear in one of the three lists. Whenever a substring is found in the lists, a division marker is placed at that position to signify a split and the algorithm continues. Thus, the greedy approach is based on a predefined dictionary of words and abbreviations, and splits are determined based on whether the word is found in the dictionary, with longer words preferred. In contrast, the neural network approach passes each hard word through a neural network to determine splits, with each network specialized to a given hard word length. Feild, et al. do not discuss the mixed-case token splitting problem, beyond stating that division markers are used to derive hard words. 31

3.1.3

Automatic Token Splitting with Samurai Our automatic token splitting technique, called Samurai, is based on the

hypothesis that strings in multi-word tokens are likely used elsewhere within the same program or other programs, and that the most likely split coincides with the most frequently occurring strings within easily split tokens. These tokens are initially split using a conservative token splitting algorithm based on camel case and non-alphabetic characters. The strings output by the conservative splitting algorithm are then used to calculate two frequency tables: a program-specific frequency table for the current program, and a global frequency table constructed from a large set of open source Java programs. For each token being analyzed, Samurai starts by executing the mixedCaseSplit algorithm. Shown in Algorithm 1, mixedCaseSplit outputs a space-delimited token where each space-delimited string of letters takes the form: (a) all lower case, (b) all upper case, or (c) a single upper case followed by all lower case letters. The output token is then processed by the sameCaseSplit algorithm, shown in Algorithm 2, which outputs a space-delimited token in which some of its substrings have been further split and delimited by spaces. Each split is denoted by an inserted blank character, and the final split token will be a sequence of substrings of the original token with an inserted blank character at each split. The following subsections describe each of the algorithms in detail as well as the scoring function. 3.1.3.1

Mixed-case Token Splitting

The mixedCaseSplit algorithm begins by replacing special characters with blank characters and inserting a blank character before and after each sequence of digits. The splitOnLowercaseToUppercase function adds a blank character between every two-character sequence of a lower case letter followed by an upper case letter. At this point, each alphabetic substring is of the form zero or more upper case letters

32

Algorithm 1 mixedCaseSplit(token) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41:

Input: token to be split, token Output: space-delimited split token, sT oken token = splitOnSpecialCharsAndDigits(token) token = splitOnLowercaseToUppercase(token) sT oken ← “” for all space-delimited substrings s in token do if ∃ { i| isUpper(s[i]) ∧ isLower(s[i + 1])} then n ← length(s) −1 // compute score for camelcase split if i > 0 then camelScore ← score(s[i, n]) else camelScore ← score(s[0, n]) end if // compute score for alternate split altScore ← score(s[i + 1, n]) // select split based √ on score if camlScore > altScore then if i > 0 then s ← s[0, i − 1]+ “ ” +s[i, n] end if else s ← s[0, i]+ “ ” +s[i + 1, n] end if end if sT oken ← sT oken+ “ ” +s end for token ← sT oken sT oken ← “” for all space-delimited substrings s in token do sT oken ← sT oken+ “ ” + sameCaseSplit(s, score(s)) end for return sT oken

33

followed by zero or more lowercase characters (e.g., List, ASTVisitor, GPSstate, state, finalstate, NAMESPACE, MAX). Each mixed-case alphabetic substring is then examined to decide between the straightforward camel case splitting before the last upper case letter (e.g., “AST Visitor”, “GP Sstate”) or the alternate split between the last upper case letter and the first lower case letter (e.g., “ASTV isitor”, “GPS state”). The original alphabetic substring is then replaced in the token by the split substring. The split selection is determined by comparing the score of the string to the right of the split point, dampening the alternate split score to favor the camel case split unless there is overwhelming evidence for the alternate split. We dampen the split score using square root to approximate an order of magnitude difference between the camel case and alternate splits, based on our training examples. For example, in the identifier GPSstate, the word Sstate has a frequency of 286 while the word state has a frequency of 1, 072, 952. Thus, we have overwhelming evidence for √ the alternative split between GPS and state, since 1, 072, 952 > 286. In contrast, we do not have overwhelming evidence for the alternative split in the identifier rowScan. In rowScan, the word Scan has a frequency of 75, 871, while the word can √ has a frequency of 939, 820. Although 939, 820 > 75, 871, 939, 820 < 75, 871— leading our algorithm to correctly leave Scan as a single word. One challenge we faced in developing a frequency-based token splitting technique is that short words occur so frequently in the source code that they tend to have much higher scores than longer words. If the algorithm does not include some way to dampen the scores of short words, tokens will be split incorrectly by splitting words that should not be split. This led us to perform the square root of the substring scores before comparing them to the scores of the current and original same-case strings being analyzed. Otherwise, splits would be improperly inserted with short words, when a better split existed (e.g., performed split as “per formed”).

34

Algorithm 2 sameCaseSplit(s, scorens ) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27:

Input: same-case string, s Input: no split score, scorens Output: final space-delimited split token, splitS splitS ← s, n ← length(s)−1 i ← 0, maxScore ← −1 while i < n do scorel ← score(s[0, i]) scorer ← score(s[i + 1, n]) pref ix ← isPrefix(s[0, i]) ∨ isSuffix(s[i + 1, n]) √ toSplitl ← scorel > max(score(s), scorens ) √ toSplitr ← scorer > max(score(s), scorens ) if !pref ix ∧ toSplitl ∧ toSplitr then if (scorel + scorer ) > maxScore then maxScore ← scorel + scorer splitS ← s[0, i]+ “ ” +s[i + 1, n] end if else if !pref ix ∧ toSplitl then temp ← sameCaseSplit(s[i + 1, n], scorens ) if temp was further split then splitS ← s[0, i]+ “ ” +temp end if end if i←i+1 end while return splitS

After all mixed-case splitting is completed, the mixedCaseSplit algorithm calls the sameCaseSplit algorithm on each space-delimited substring of the current (possibly already split) token. The string score of the space-delimited substring is input to the sameCaseSplit to be used in making split decisions, particularly decisions in recursive calls to sameCaseSplit. The substrings returned from sameCaseSplit are concatenated with space delimiters to construct the final split token.

35

3.1.3.2

Same-case Token Splitting

Along with the score of the original same-case token to be split, the sameCaseSplit algorithm takes as input the substring under analysis, s, which is already (1) all lower case, (2) all upper case, or (3) a single upper case letter followed by all lower case letters. Starting with the first position in s, the algorithm examines each possible split point in s, where s is split into two substrings, called lef t and right, with score(lef t) and score(right), respectively. The split decision is based on several conditions. Intuitively, we are looking for a split with the largest possible score such that (a) the substrings are not common prefixes or suffixes and (b) there is overwhelming evidence to support the split decision. Based on exploratory data analysis, we determined “overwhelming evidence in favor of a split” to be when a dampened score for each of the potential substrings is larger than both the score(s) and the score of the original same-case string (before any recursive sameCaseSplit calls). If these conditions hold, s will be split and no further splitting of s will be attempted. If condition (a) holds but only the substring to the left of the current potential split point (lef t) provides overwhelming evidence of being a word, then sameCaseSplit is called recursively to determine whether right should be further split. If right results in being split, then we split between lef t and right also, otherwise we do not split here because splitting at the current point based solely on the evidence of lef t tends to result in improper splits (e.g., “string ified”). This is due to the fact that shorter words more frequently occur than longer words, so if lef t and right combine to form a word with a suffix, the word without the suffix (lef t) is likely to occur more frequently than lef t+right. Following this recursive algorithm, Samurai correctly splits nonnegativedecimaltype as “nonnegative decimal type”.

36

3.1.3.3

String Scoring Function

A key component of our token splitting technique is our string scoring function, score(s), that returns a score for the string s based on how frequently s appears in the program under analysis and in a more global scope of a large set of programs. The score function is called numerous times particularly to make decisions at two key steps in our automatic token splitting algorithms: • to score and compare the straightforward camel case split results to the alternate split results during mixed-case token splitting • to score different substring partitions of same-case tokens to identify the best split We compute the string score for a given string s in a program p by: score(s, p) = F req(s, p) + (globalF req(s)/ log10 (AllStrsF req(p)) where F req(s, p) is the frequency of s in p, globalF req(s) is the frequency of s in a large set of Java programs, and AllStrsF req(p) is the total frequency of all strings in the program p. As mentioned earlier, the frequency information is mined from the source code. Over 9,000 Java SourceForge projects were mined for the occurrences of strings to create the global frequency table used in globalF req(s). Approximately 630,000 unique conservatively split words were identified with a total number of occurrences of 938 million. The number of occurrences of strings in a given program varies with the program size. Larger programs have a higher frequency of string occurrences. The frequency analysis runs in linear time with respect to the number of identifiers in the program. The formula for the scoring function was developed through exploratory data analysis. Sometimes the mined program-specific frequency table has high frequency 37

for a string that does not appear with high frequency in the larger program corpus. However, for small programs, the frequency of a string in the program itself may be too low to provide good data for splitting, motivating the inclusion of the mined global frequency table information. Thus, the string scoring function is comprised of both the frequency of the string in the program under analysis as well as the global frequency of strings. However, the frequency of a string within a program can be overly dominated by the global frequency of that string due to the much larger mined data set from 9,000 programs, when compared to the relatively small amount of data mined from a single program. Thus, we dampen the effect of the global frequency in the score by dividing by the log of the frequency of all strings in the program under analysis. This takes into consideration the fact that we have more mined frequency data from larger programs than smaller programs. 3.1.3.4

Analysis of Time and Space

The identifier splitting technique takes as input two frequency tables: a global frequency list and a program-specific word list. The global frequency list is precomputed, and the current version is approximately 12 MB. The program-specific word lists can be constructed in linear time with respect to the total number of identifiers occurring in the program. The algorithm also uses a stored list of prefixes and suffixes, which are approximately 4 KB each. For every identifier split, the algorithm analyzes every character, and is therefore on the order of the number of characters in the program. This time can be reduced by storing common same-case splits. 3.1.4

Evaluation We evaluated our automatic token splitting technique with two research ques-

tions in mind:

38

1. How does our technique for mixed-case token splitting compare with the only other automatic technique, conservative division marker (e.g., camel case and special character) splitting? 2. How does our technique for same-case token splitting compare with the state of the art Feild, Binkley, and Lawrie [FBL06] greedy approach? Variables and Measures The independent variable is the token splitting algorithm, which we evaluate by measuring the accuracy of each technique in identifying the correct partitioning of a gold set of tokens from Java programs. To evaluate how effectively our mixed-case token splitting performs, we implemented the straightforward splitting based on division markers and camel casing used by most researchers in software analysis who require token splitting, which we call conserv. Our implementation scans a token and splits between any alternating case from lower case to upper case and before the last upper case in a sequence of multiple consecutive upper case followed by lower case, between letters and digits, and treating special characters as delimiters. For simple camel case tokens, such as getString or setIdentifierType, the tokens would be correctly split as “get String” and “set Identifier Type”. For strings such as DAYSforMONTH, this conservative camel case approach will incorrectly split as “DAY Sfor MONTH”. However, the conservative camel case approach does not require any knowledge of abbreviations or common situations that do not follow this simple rule, thus making it very efficient and easy to implement. To evaluate the effectiveness of our same-case token splitting, we implemented the greedy algorithm by Feild, Binkley, and Lawrie [FBL06]. For the three predefined word lists, we used the same ispell Version 3.1.20, a stop list of Java keywords, and an abbreviation list that we had created as part of our work on abbreviation

39

expansion [HFB+ 08]. Because we are evaluating with Java programs, we did not include library functions and predefined identifiers in the stop list. We did not compare with the neural network approach because it was only shown to perform well given specialized data, while the greedy approach was more consistent across data sets. Since they did not have an automatic front-end for splitting mixed-case, we ran conservative camel case for the mixed-case splitting. The dependent variable in our study is the effectiveness of each technique, measured in terms of accuracy. Accuracy is determined by counting the number of tokens that are correctly split, where correctly split is defined to be completely matching splits with the human annotators who produced the gold set. Tokens with multiple splits, but only some of them correctly split by an automatic technique, were considered to be incorrectly split. Because not all trends in the data are visible in terms of accuracy, we also measured correctness in terms of the percent of incorrectly split same-case tokens, or oversplitting. Subjects We randomly selected tokens from 9,000 open source Java programs in SourceForge. Two human annotators, who had no knowledge of our token splitting technique, manually inspected each token in their respective sets to identify the most appropriate token splits. To construct the gold set, we continued to add tokens to each human subject’s set until we reached 1,500 nondictionary words in their set. The total number of tokens in the gold set is 8,466. Methodology We mined each of the 9,000 programs to produce the program-specific and global frequency tables. We ran four techniques on the entire set of tokens in the gold set: conservative camel case, Samurai, greedy, and mixedCaseSplit without the call to sameCaseSplit. We compared the output of each tool with the gold set. If 40

the space-delimited token generated by the automated technique was identical to the human-split token, then the automatic token split is considered to be correct. In addition to computing the overall accuracy for each tool, we computed accuracy for different token categories based on their characteristics to analyze differences in effectiveness. We also computed the amounts of oversplitting performed by Samurai and greedy. Threats to Validity Because our technique is developed on Java programs, the results of the study may not generalize to all programming languages; however, the gold set does include tokens in natural languages other than English. As with any subjective task, it is possible that the human annotators did not identify the correct split for a given token. In some instances, the annotators kept proper names together even when they were camel cased. There were a number of same-case tokens that were ambiguous, and up to personal preference. We noticed that subjects more familiar with Java programming would split differently from a novice programmer. For instance, the splitting of Javadoc, sourceforge, gmail, gcal are subjective. 3.1.5

Results and Analysis We present the accuracy results for our experiment in Figure 3.1. Although

Samurai misses some same-case splitting, Samurai is more accurate than the greedy algorithm overall. Mixed-case Samurai performs very similar to conserv in mixed-case splitting. There are 1,632 instances where the split is camel case. Samurai correctly chooses the camel case split in 1,630 cases, and incorrectly chooses the alternate split in just two cases.

41

8466 8193

8194

8213

8067

Number of Correctly Split Tokens

8000

7500

7000

6500

6385

6000

5500

5000 NoSplit Greedy Mixed ConservSamurai Ideal DW NDW

US CC

AC SC

Figure 3.1: Accuracy of token splitting approaches per category: dictionary word requiring no split (DW), nondictionary word without split (NDW), underscores and digits (US), camel case (CC), alternating case (AC), same-case (SC).

42

Percent of Incorrectly Split Same-case Tokens

12

NDW US CC

10

8

6

4

2

0 Greedy Samurai

Figure 3.2: Percent of incorrectly split same-case tokens by category: nondictionary word (NDW), underscore/digit separated words (US), camel case separated words (CC). There are only four instances of alternate split, one of which Samurai correctly selects the alternate split. conserv does not get any of these correct because it never considers alternating splits. Thus, for mixed-case splitting, even this large data set is not a large enough sample to answer this question. Same-case Note that in Figure 3.1, except for greedy, the techniques correctly do not split dictionary and nondictionary tokens. Also, note that the greedy algorithm outperforms Samurai in terms of same-case splitting. There are 249 tokens that contain at least one same-case split. Greedy correctly splits 125 of these tokens, while Samurai correctly splits just 29. 43

Although Samurai splits fewer of the same-case tokens that should be split, it makes fewer mistakes than greedy in splitting strings that should not be split. There are 6,391 no-split tokens, which require no split. 5,582 of these no-split tokens are dictionary words, and 809 are nondictionary tokens. Because greedy uses the same dictionary, it is no surprise that greedy does not split the dictionary words. It should be noted that even though Samurai does not use a dictionary, it also does not split any of the dictionary words. The results for no-split nondictionary tokens are not as favorable for greedy. Figure 3.2 presents the results on the percent of oversplitting by greedy and Samurai. The incorrectly split same-case words are full nondictionary tokens, words that fall between underscores/digits, or camel case separated words. As expected from prior evaluation [FBL06], greedy suffers from a significant amount of oversplitting. In contrast to greedy’s 10%, Samurai’s frequency-based approach oversplits in just 1% of cases. Although Samurai splits fewer same-case tokens than greedy, it is more accurate overall by oversplitting significantly less. The data reveal that our scoring function may be overly conservative. In the future, we plan to investigate a scoring function that more accurately balances splitting same-case tokens while preserving no-split tokens. 3.1.6

Conclusions In this section, we presented and evaluated a technique, Samurai, to auto-

matically split tokens into sequences of words by mining the frequency of potential substrings from the source code. We evaluated Samurai against the state of the art on over 8,000 tokens. Our results show that frequency-based token splitting misses same-case splits identified by the greedy algorithm, but outperforms greedy overall by making significantly fewer oversplits. Samurai also identifies slightly more correct

44

splits than conservative division marker splitting, without incorrectly splitting any dictionary words. 3.2

Abbreviation Expansion When writing software, developers often use abbreviations in identifier names,

especially for identifiers that must be typed often and for domain-specific words used in comments. In some cases, the abbreviated form of a word is so prevalent that it occurs more often than the expanded form. For example, the word ‘number’ occurs only 4, 314 times in the Java 2 Platform, while its abbreviation ‘num’ occurs 5, 226 times. If not taken into account, abbreviations can cause search tools to miss relevant code. Consider a developer searching for string handling code, who might enter the query ‘string’. If the abbreviation ‘str’ is used in the code instead of ‘string’, the search tool will miss relevant code. For the purposes of this work, we define a token to be a sequence of alphabetic characters delimited by any non-alphabetic token such as spaces or underscores. We refer to any token that is not found in an English dictionary as a non-dictionary word. We use the term short form to refer to an abbreviation, and long form for its corresponding full word expansion. One simple way to expand short forms in code is to manually create a dictionary of common short forms [RAN07]. Although most developers understand that ‘str’ is a short form for ‘string’, not all abbreviations are as easy to resolve. Consider the abbreviation ‘comp’. Depending on the context in which the word appears, ‘comp’ could mean either ‘compare’ or ‘component’. Thus, a simple dictionary of common short forms will not suffice. In addition, manually created dictionaries are limited to abbreviations known to the dictionary builders. The hypothesis driving this work is that automatically mining short forms from the program itself can identify the most appropriate expansions of short forms

45

within the context of the individual occurrences. Thus, we have developed an automatic technique to mine short and long forms from a large set of programs, such that abbreviations in a program can be automatically expanded to the most appropriate long form in the context of their occurrences. 3.2.1

Types of non-dictionary words There are many types of non-dictionary words used in program identifiers,

and these non-dictionary words are not limited to abbreviations. We have found abbreviations generally fall into two categories: single-word and multi-word. 3.2.1.1

Single-Word Abbreviations

Single-word abbreviations are short forms whose long form consists of a single word. This is in contrast to non-dictionary words such as acronyms, whose long forms expand into multiple words. We have identified two major types of single word abbreviations in programs: prefixes and dropped letters. Prefix short forms are formed by dropping the latter part of a long form, retaining only the few beginning letters. Examples of prefixes include ‘attr’ (attribute), ‘obj’ (object), and ‘param’ (parameter). A subset of prefix short forms are single letter prefixes. Single letter prefixes are predominantly used for local variables with very little scope outside a class or method [LBS06]. Examples include ‘i’ (integer) and ‘e’ (exception). The second type of single-word abbreviation is dropped letter. Dropped letter short forms consist of the first letter of the long form followed by an assortment of other letters from the long form. Examples include ‘evt’ (event), ‘msg’ (message), and ‘src’ (source). Dropped letter short forms are actually a super set of prefix short forms, although they can easily expand to a much larger set of long forms. For example, the abbreviation ‘org’ can expand to be a prefix of ‘organization’ or be the less probable dropped letter ‘original’. Automatic abbreviation expansion

46

techniques must therefore be selective in expanding dropped letter abbreviations to potential long forms. 3.2.1.2

Multi-Word Abbreviations

Multi-word abbreviations are short forms that when expanded into long form consist of more than one word. The most common are acronyms, which arguably belong in a class of short forms separate from abbreviations. For the purposes of this work, we consider acronyms to be a special type of multi-word abbreviation. Acronyms consist of the first letters of the words in the long form. Acronyms can be so widely used that the long form is rarely seen, such as ‘ftp’, ‘xml’, or ‘gif’. Some uses of acronyms are very localized, such as what we call type acronyms. When creating local variables or naming method parameters, a common naming scheme is to use the type’s abbreviation. For example, a variable of the type ArrayIndexOutOfBoundsException may be abbreviated ‘aiobe’, or StringBuffer as ‘sb’. The second type of multi-word abbreviation includes more than just the first letters of the long form. A combination multi-word may combine single-word abbreviations, acronyms, or dictionary words. Examples include ‘oid’ (object identifier), ‘println’ (print line), and ‘doctype’ (document type). By definition, combination multi-words must contain more than two letters, otherwise the short form would be an acronym. As with the relationship between prefixes and dropped letters, acronyms are a subset of combination multi-words. 3.2.1.3

Other Types of Short Forms

Aside from abbreviated words, one of the most common forms of non-dictionary words in code are multiple words with no clearly delineated word boundaries. Most identifiers that consist of multiple words contain word boundaries by varying upper and lower case letters (i.e., camel casing) or by using non-alphabetic characters as

47

in the examples ASTVisitor, stringBuffer, and TARGET WINDOW. However, a programmer may not delineate word boundaries because: (1) the boundaries are trivial for a human to recognize, such as in ‘keystore’ or ‘threadgroup’; (2) the programmer favored typing fewer letters over general readability; or (3) the words appear so often together that the programmer may not realize the compound word does not exist in English. In the last case, the joined words may be considered one word to most programmers, but English dictionaries currently include only the separate parts. We have found that lack of word boundaries is especially common for collocations—words that often occur adjacent to one another and represent a conventional way of saying things [MS99]. Examples include ‘filesize’, ‘saveas’, and ‘dataset’. Components of multi-words are called soft words by Lawrie, et al. [LFB07b]. Misspellings are also present in identifiers, although less so than for comments. Examples include ‘instanciation’ (instantiation) and strike ‘trought’ (through) font format. Other types of non-dictionary words include mathematical notation, such as for vector indices or notation specific to scientific equations, and Hungarian notation [Sim99]. More common for C-based languages than Java, Hungarian notation suggests appending the first letter of the data type to every variable name. Finally, some identifiers are just improbably named: ‘zzzcatzzzdogzzz’. When variable names are selected with little relevance to the underlying code, deriving meaning can be impossible both for humans and automated mining techniques. 3.2.2

Automatic abbreviation expansion Automatically expanding abbreviations requires the following steps: (1) iden-

tifying whether a token is a non-dictionary word, and therefore a short form candidate; (2) searching for potential long forms for the given short form; and (3) selecting the most appropriate long form from among the set of mined potential long form candidates.

48

For some applications, a completely automated approach may be unnecessary and the final step of selecting the most appropriate long form can be left to the human user. Examples include a program comprehension tool that automatically presents the developer with potential long forms when given an unfamiliar abbreviation as input; or a query expansion mechanism that uses human feedback to determine appropriate query expansions, which may include short forms for a long form given in the query. However, for general search or automated maintenance tools, a fully automatic approach may be more appropriate. For the remainder of this chapter we focus on completely automatic abbreviation expansion. 3.2.2.1

Observations and Challenges

To develop our automatic abbreviation expansion technique, we analyzed short forms and their corresponding long forms in 15 open source Java programs. Based on our manual inspection, we made the following observations that must be taken into account when automatically mining abbreviation expansions: Good dictionaries are hard to find. The most prevalent available English dictionaries are used for spell checking, and may include proper nouns, common abbreviations, and contractions, all of which may occur in software. However, some dictionary words are less likely to occur in code. For example, observing the token ‘io’ in software is much more likely to stand for ‘input output’ than the proper noun ‘Io’. In addition, legitimate English words may be used for abbreviations in code, such as ‘char’, ‘tab’, or ‘id’. Thus, there is a trade off in including too many or too few words in a dictionary. Too few words causes the automatic expansion to attempt to find long forms for legitimate words; too many words causes even legitimate short forms to be classified as dictionary words. Short form type is impossible to determine a priori. If it were possible to automatically identify the short form type, it would be easier to narrow down the list

49

of potential long forms, and therefore more accurately select the appropriate long form. Unfortunately, abbreviations are short forms for longer words, and are by their very nature less unique than the long forms themselves. Thus, the same sequence of three characters may represent different long forms depending on the context. For example, ‘def’ can refer to ‘definition’, ‘default’, or even ‘defect’. In one instance, we even observed that the acronym ‘dc’ was used to represent both ‘dynamic color’ and ‘duration color’ in different branches of the same method. Thus, we cannot rely solely on the abbreviation type to eliminate unrealistic long form candidates. The shorter the short form, the more potential long form candidates. This observation presents one of the more frustrating aspects of the automatic abbreviation expansion problem. By definition, the shorter the short form, the more potential long forms it could match. For example, a single letter abbreviation ‘i’ could conceivably match any dictionary word beginning with the letter ‘i’; whereas ‘int’ is likely to match ‘integer’, ‘interface’, or ‘interrupt’; and ‘interf’ will match ‘interface’. However, most abbreviations are short, consisting of just 1–3 letters. Thus, the majority of short forms represent the most difficult instances of the automatic abbreviation expansion problem. Some abbreviation types have more long form candidates. Specifically, acronyms and prefixes have fewer long form candidates than dropped letters and combination multi-words. For example, the prefix ‘str’ is likely to be ‘string’ or ‘stream’, whereas the dropped letter ‘str’ could match long forms ‘substring’, ‘store’, ‘september’, or ‘saturn’. Thus, automatic expansion techniques should take long form accuracy of abbreviation type into account when choosing between potential long forms.

50

3.2.2.2

State of the Art

To our knowledge, Lawrie, Feild, and Binkley [LFB07b] are the only other researchers to present and evaluate techniques to address the problem of automatically expanding abbreviations that occur in program identifiers. In their earlier paper [FBL06], Feild, Lawrie and Binkley evaluated three automated techniques for splitting identifiers that are not easily split by camel-casing or underscore clues left by the programmer. By first splitting the identifiers into their constituent “words”, their abbreviation analysis can focus on the individual “words” comprising each identifier. More recently, Lawrie, Feild, and Binkley (LFB) [LFB07b] presented a strategy for automatically expanding abbreviations used in identifiers by first extracting lists of potential expansions as words and phrases, and then performing a two-stage expansion for each abbreviation occurrence in the code. They create several different lists to be used during expansion of an identifier occurrence. For each function f in the program, they create a list of words contained in the comments before or within the function f or in identifiers with word boundaries (e.g., camel casing) occurring in f, and a phrase dictionary created by running the comments and multi-wordidentifiers through a phrase finder [FC01]. In addition to the lists for each function, they create a list of programming language specific words as a stop word list. Suffix-stripping (i.e., stemming) and the stop word list are used during extraction to improve accuracy. The first letter of each phrase is used to build acronyms. Expansion of a given non-dictionary word occurrence in a function f involves first looking in f’s word list and phrase dictionary, and then in a natural language dictionary. A word is said to be a potential expansion of an abbreviation when the abbreviation starts with the same letter and every letter of the abbreviation occurs in the word in order. The LFB [LFB07b] technique returns a potential expansion only if there is

51

a single possible expansion. They leave the problem of choosing among multiple potential expansions found at either stage as future work. When they manually checked a random sample of 64 identifiers requiring expansion (from a set of C, C++, and Java code), one third were correctly split and expanded. Of the identifiers correctly split, 58% of the one–two letter forms were expanded correctly and 64% of the over-two letter forms. Thus, only approximately 20% (60% of 33%) of the identifiers were expanded correctly. In their other quantitative study of all identifiers in their 158-program suite of over 8 million unique terms, only 7% of the total number of identifier terms were expanded by their technique; these expansions were not checked for correctness. These low precision results motivate a closer look at alternative strategies for expansion. In addition, sets of potential expansions for a given occurrence in their study ranged from 1 to 6,735, demonstrating the need for a heuristic for choosing the most appropriate expansion for a given occurrence. In addition, related work includes the work on restructuring program identifier names to conform to a standard in both the lexicon of the composed terms and the syntactic form of the overall identifier composition of terms [CT00]. Identifiers are split, and then a match between a standard dictionary and synonym dictionary and the identifier components is attempted. When no match is found, the user is prompted for help. No automatic abbreviation expansion is attempted. There exist acronym expansion techniques created for use in written English text [LOPT00, Pak01]; however, their premise does not hold for software due to their reliance on textual patterns that do not occur in code and do not apply in the context of the syntactic structure of a program. 3.2.3

The Scoped Approach As discussed in the beginning of Section 3.2.2, the LFB approach only ad-

dresses the first two steps of identifying non-dictionary words and potential expansions [LFB07b]. When faced with the final step of choosing between equally likely 52

long forms, their current approach returns nothing. With our scoped approach, we attempt to effectively solve all three steps of the automatic abbreviation expansion problem. Also in contrast to LFB, we never attempt to match short forms to an English dictionary of words, only those dictionary words appearing within the scope of the software. This is a direct consequence of our observation that word lists for computer science are nonexistent, and many English dictionaries include too many words. Also, the only hand-tuned word lists used in our approach are a stop word list and list of common contractions. Our list of common abbreviations is automatically derived from software, but could be improved with a hand-tuned common abbreviation list, especially for production systems. Our automatic long form mining technique is inspired by the static scoping of variables represented by a compiler’s symbol table. When looking for potential long forms, we start at the closest scope to the short form, such as type names and statements, and gradually broaden our scope to include the method, its comments, and the class comments. If our technique is still unsuccessful in finding a long form, we attempt to find the most likely long form found within the program and in Java SE 1.5. With each successive scope we include more general, i.e., less domain specific, information in our long form search. In this work, we assume a short form has the same long form for an entire method. Although infrequent, it is possible for a short form to have multiple long forms within a method. To handle such cases, our approach could be extended to assume a short form has the same long form for only block or statement level scope. 3.2.3.1

Method-level Matching

The core of our approach is our long form search technique within a method. In this section, we first describe how we search for each type of long form within a

53

Algorithm 3 Searching for single-word long forms. Quotes are used to indicate regular expressions. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20:

Input: potential short form, sf Input: regular expression to match long form, pattern Input: method body text, method comments Input: class comments (Prefix only) Output: long form candidates, or null if none if ((prefix pattern) or (sf matches “[a-z][âeiou]+”) or (length(sf ) > 3)) and (sf does not match “[a–z][aeiou][aeiou]+”) then In the following, when a unique long form is found, return. Search JavaDoc comments for “@param sf pattern” Search TypeNames and corresponding declared variable names for “pattern sf ” Search MethodName for “pattern” Search Statements for “pattern sf ” and “sf pattern” if length(sf ) 6= 2 then Search method words for “pattern” Search method comment words for “pattern” end if if (length(sf ) > 1) and (prefix pattern) then Search class comment words for “pattern” end if end if

method, define how we select the long form from many potential long forms, and how we attempt to expand short forms initially missed at the method scope. Single-Words The first step in searching for long forms is to construct a regular expression pattern from the short form, and then use the pattern to search for long form candidates over different parts of the method body text. Our single-word search approach is presented in Algorithm 3, and the patterns used for each type of short form are described below. Stepping through Algorithm 3, line 6 prevents (1) searching for unlikely dropped letter long forms and (2) expanding short forms with many consecutive vowels as a single-word. The first three predicates in line 6 restrict the search for

54

dropped letter long forms to only those short forms that are longer than 3 letters or composed of all consonant letters with an optional leading vowel. We restrict the dropped letter pattern search because the pattern can greedily match many incorrect expansions. For example, if left unchecked, dropped letter may incorrectly expand ‘lang’ to ‘loading’, ‘br’ to ‘bar’, or ‘mtc’ to ‘matching’. The last predicate of line 6 ensures that we do not try to expand short forms with many consecutive vowels as a single-word. Most short forms consisting of consecutive vowels expand into multi-word long forms; consider ‘gui’ (graphical user interface), ‘ioe’ (invalid object exception), or ‘poa’ (portable object adaptor). Lines 7–19 of the algorithm describe the search process. If at any line a unique long form is found, the algorithm immediately returns. In line 8, we first search for the short form and the pattern in the method’s Java Doc comment. If unsuccessful, in line 9 we look for the short form and the pattern appearing together in a variable declaration and its type. Next we search the method name for the pattern in line 10. In line 11 we continue searching for the pattern and the short form appearing within the same statement. In line 12 we restrict our search of the general method text and comments to short forms of 3 letters long or more because short forms that are two letters long (1) are most likely to be multi-words and (2) are capable of matching many different words. Since we do not search beyond method scope for single letter prefixes, we also search the method text and comments for single letter prefixes. Thus, in lines 13–14 we search for the pattern in the method words and method comment words if the short form is not of length two. Lastly, if the pattern is a prefix and the short form is longer than a single letter, we search the class comments for the pattern in line 17. Since single letter prefix short forms are unlikely to have scope beyond a method, and since the single letter prefix pattern may match so many long forms, we do not attempt to match

55

single letter prefix patterns to the class comments. Likewise, since the dropped letter pattern is so greedy, we do not search for dropped letter long forms in the class comments. Prefix Pattern The first step in searching for prefix long forms is to construct a regular expression from the short form. The prefix pattern is thus the short form followed by the regular expression “[a–z]+”: “sf [a–z]+”. The letter ‘x’ is a special case: if a short form begins with ‘x’, the expression “e?x” is added to the beginning of the pattern. The pattern is then input into Algorithm 3 to search for long forms. Dropped Letter Pattern The regular expression pattern for dropped letter is much less conservative than the pattern used for prefixes. The dropped letter pattern is constructed by inserting the expression “[a–z]*” after every letter in the short form. Let sf = c0 , c1 , ..., cn , where n is the length of the short form. Then the dropped letter pattern is c0 [a–z]*c1 [a–z]*...[a–z]*cn . Multi-Words As with single-words, our approach for finding multi-word long forms searches increasingly broader scopes until we find a long form candidate that matches the pattern. However, because multi-word patterns must search over spaces, it is important to limit how far the pattern should extend. For example, with a naive pattern the short form ‘il’ could match the phrase “it is important to limit” in the previous sentence. Thus, we preprocess the method body text and comments so that we do not search for long forms beyond variable declarations and method identifier

56

Algorithm 4 Searching for multi-word long forms. Quotes are used to indicate regular expressions. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18:

Input: potential short form, sf Input: regular expression to match long form, pattern Input: method body text, method comments Input: class comments (Acronym only) Output: long form candidates, or null if none if (acronym pattern) or (length(sf ) > 3) then In the following, when a unique long form is found, return. Search JavaDoc comments for “@param sf pattern” Search TypeNames and corresponding declared variable names for “pattern sf ” Search MethodName for “pattern” Search all identifiers in the method for “pattern” (including type names) Search string literals for “pattern” {At this point we have searched all the possible phrases in the method body} Search method comment words for “pattern” if acronym pattern then Search class comment words for “pattern” end if end if

boundaries. We also split comments and string literals into phrases using punctuation ([?!,;]). So that abbreviations like ‘val’ are not expanded to ‘verify and load’, we remove common stop words from the method body text and comments. Our multi-word search approach is presented in Algorithm 4. Line 6 ensures that we do not search for many incorrect combination word long forms. Combination word patterns are much less conservative than acronyms, and can frequently match incorrect expansions. Thus, we restrict our search to short forms of length 4 letters or more. This threshold will cause our technique to miss some legitimate 3-letter combination word expansions, such as ‘oid’ (object identifier), but we feel it is necessary to restrict our search to find only the most likely long forms. It should be noted that 3-letter combination word abbreviations are not very common in practice. In the random sample of 250 non-dictionary words used in our evaluation,

57

only 1 short form fell into this category. Based on this sample, we expect 3-letter combination word abbreviations to account for only 4% of all combination words and just 0.4% of all non-dictionary words. As with single-words, our technique searches for multi-word long form candidates first in Java Doc, type names, and the method name in lines 8–10. We were unable to search for multi-words in statements due to run time complexities of the regular expression. Next in lines 11–12 we search the method identifiers and string literals for the pattern, followed by method comments in line 13. Because expansions for well understood short forms in the context of the class may not occur within the method text and comments, we also search for acronym long forms in the class comments in line 15. As with dropped letter, we do not search for combination word patterns in the class comments because the pattern can match many incorrect long forms. Acronym Pattern The regular expression pattern used to search for acronym long forms is simply constructed by inserting the expression “[a–z]+[ ]+” after every letter in the short form. Let sf = c0 , c1 , ..., cn , where n is the length of the short form. Then the acronym pattern is c0 [a–z]+[ ]+c1 [a–z]+[ ]+...[a–z]+[ ]+cn . As with prefixes, the letter x is a special case. When forming the acronym pattern, any occurrence of ‘x’ in the short form is replaced with the expression “e?x.” This enables our technique to find long forms for acronyms such as ‘xml’ (extensible markup language). Combination Word Pattern The pattern to search for combination word long forms is constructed by appending the expression “[a–z]*?[ ]*?” to every letter of the short form. Let sf = c0 , c1 , ..., cn , where n is the length of the short form. Then the dropped letter pattern is c0 [a–z]*?[ ]*?c1 [a–z]*?[ ]*?...[a–z]*?[ ]*?cn . The pattern is constructed such that 58

only letters occurring in the short form can begin a word. This keeps the pattern from expanding short forms such as ‘ada’ to ‘adding machine’. We use a less greedy wild card to favor shorter long forms with fewer spaces, such as ‘period defined’ for ‘pdef’, rather than ‘period defined first’. Putting it all together With a slightly different technique to search for long forms of each abbreviation type, we now have to combine them together to output a single long form. The first step is identifying the order to apply the expansion techniques. Within the single- and multi-word types, acronyms should be matched before combination words and prefixes before dropped letter, since the greedier patterns will match all the long forms that the more conservative patterns match. However, we were not immediately sure in what order to search for acronym and prefix or dropped letter and combination word. After manually inspecting hundreds of example long forms for 15 open source Java programs, we concluded that the best order to apply the long form search techniques is: acronym, prefix, dropped letter, and combination word. If none of the abbreviation type expansion techniques match locally within the method, we attempt to match the short form to common contractions, followed by our most frequent expansion (MFE) technique. Handling multiple matches Before presenting our MFE technique, we must address how to handle short forms whose pattern matches multiple long form candidates within the same method. Within broader scopes such as method or comments, it is possible for a single abbreviation type pattern to match many potential long forms. For example, the prefix pattern for ‘val’ may match ‘value’ as well as ‘valid’ in a method comment. Our technique for selecting between multiple long forms is as follows:

59

Step 1. Use the long form that most frequently matches the short form’s pattern in this scope. For example, if ‘value’ matched the prefix pattern for ‘val’ three times and ‘valid’ only once, return ‘value’. Step 2. Group words with the same stem [Por80] and update the frequencies accordingly. For example, if the words ‘default’ (2 matches), ‘defaults’ (2 matches), and ‘define’ (2 matches) all match the prefix pattern for ‘def’, group ‘default’ and ‘defaults’ to be the shortest long form, ‘default’ (4 matches), and return the long form with the highest frequency. Step 3. If there is still no clear winner, continue searching for the pattern at broader scope levels. For example, if both ‘string buffer’ and ‘sound byte’ match the acronym pattern for ‘sb’ at the method identifier level, continue to search for the acronym pattern in string literals and comments. We store the frequencies of the tied long forms so that the most frequently occurring long form candidates are favored when searching the broader scope. Step 4. If all else fails, abandon the search and let MFE select the long form. At this point we stop searching for long form candidates of different abbreviation types. For example, if a prefix pattern has already found long form candidates, we avoid finding dropped letter long form candidates by halting the search for a given short form within a method. 3.2.3.2

Most Frequent Expansion (MFE)

Our most frequent expansion (MFE) technique leverages successful local expansions to help derive long forms for short forms that would otherwise be missed. Although not all short form expansions are correct, the assumption is that taken over the entire program, the most frequently occurring long form will be the correct one.

60

Short Form int impl obj pos init len attr num env val str buf ctx msg cs var elem param decl arg

Long Form integer implement object position initial length attribute number environment value string buffer context message copyright sun variable element parameter declare argument

Relative Frequency 0.821 0.840 1.000 0.828 0.955 0.990 1.000 0.985 0.972 0.894 0.881 0.992 0.962 0.977 0.665 0.974 1.000 0.992 0.920 0.964

Table 3.1: Top 20 entries in the most frequent expansion (MFE) list for Java 5. We calculate MFE by running our local abbreviation expansion approach over the entire program. Then, for each short form, we count how many times the short form was matched to a given long form. We calculate the relative frequency that a short form was expanded to each long form. The long form with the highest relative frequency is considered to be the most frequent expansion. As with the final step in selecting between potential long forms, we also group long forms with the same stem when creating the MFE list. However, occasionally an incorrect long form may be considered the most likely expansion. To avoid this, we only consider long forms that were matched for more than half (0.5) of the short form matches, and for short forms that were matched at least 3 times in the entire program. We based these thresholds on empirical observation. 61

We apply our MFE technique at two levels: the program level and the more general Java level. The program level ideally helps expand domain matches. For example, an open source implementation of Guitar Pro has frequent occurrences of the short form ‘gp’. Although ‘gp’ was incorrectly matched to ‘graphics’ 8 times, our technique correctly expanded ‘gp’ to ‘guitar pro’ with a relative frequency of 0.68. In addition to program MFE information, we also use more general programming knowledge from the Java API implementation. The top 20 entries of our MFE list for Java 5 are presented in Figure 3.1. If an unexpanded short form is not present in the program MFE we look for it at the more general Java level. The Java MFE list can be calculated ahead of time, or even run over a larger set of Java programs, rather than just the Java API implementation. If our scoped approach were to be applied in practice, this is the stage where a hand-tuned MFE list could be used to improve accuracy. It is possible that some frequently occurring short forms may never occur with the correct long form, or that the short form is so prevalent in the domain that the long form does not appear anywhere at all. Examples include common acronyms such as ‘xml’ or domain-specific terminology such as ‘ast’ in compilers or ‘rsa’ in encryption. One solution to this problem is to hand-tune the MFE list for the most frequently occurring short forms, or train the JavaMFE approach on a larger set of Java programs as mentioned above. Another solution would be to mine potential long forms beyond the scope of Java programs by utilizing online documents related to Java or computer science in general. For example, abbreviation expansion techniques created for English [LOPT00, Pak01] could be used to mine potential long forms from online textbooks in computer science.

62

3.2.3.3

Implementation

Our technique is fully automatic and is implemented as a Java Eclipse plugin with command line scripts for the MFE calculations, which could easily be added to the plugin in the future. The current implementation is designed for batch processing, but could be incrementally updated or run in the background to support software maintenance tools. Due to computational issues involved in Java regular expressions, we limited our non-dictionary words to length 10 or less. This rules out some non boundary words that would otherwise be expanded to combination word long forms, such as ‘numericfield’, but relieves our implementation from attempting to find long forms for non-dictionary words like ‘pppppppppppq’ (which we actually came across in an open source project). However, we do not feel limiting the short form length to 10 impairs our technique, since most abbreviations are considerably less than this limit. Our approach uses a number of word lists and dictionaries, some of which have been hand-tuned for software. For example, we have removed any words from our stop list that could be content words in software, such as ‘face’, ‘case’, and ‘turn’. The word lists used in our implementation as well as descriptions of how they were derived are available online.1 3.2.3.4


The algorithm takes as input a number of word lists. The dictionary is the largest, containing 124,000 words, and taking up 1.2 MB of space. The supporting word lists for contractions, proper nouns, and stop words take less than 100K of space. The Java MFE list contains less than 1200 expansions and takes up just 22K of space. In terms of time, the algorithm creates 4 regular expression patterns 1

http://www.cis.udel.edu/∼hill/amap

63

Program Liferay Portal OpenOffice.org Portable iText.NET Tiger Envelopes Azureus

Version 4.3.2 2.2.1 1.4-1 0.8.9 3.0.3.0

No. Devel. 94 8 2 1 7

NCLOC 393,802 372,807 361,403 350,046 335,515

Types 4,050 4,213 4,465 3,005 5,335

Methods 39,747 20,374 34,141 19,706 28,255

No. Non-Dictionary 188,955 274,969 217,965 191,787 193,757

Table 3.2: Programs used in the evaluation. for every short form. Given a short form of length n and a class file containing m characters, the algorithm takes n ∗ m time in the worst case. To analyze a whole program, the worst case time is multiplied by the number of short forms in the program. 3.2.4

Evaluation We evaluated our automatic abbreviation expansion technique with two re-

search questions in mind: 1. How does our technique compare to the program and Java MFE approaches? 2. How does our technique compare to the state of the art LFB [LFB07b] approach? 3.2.4.1

Experiment Design

Variables and Measures The independent variable is the abbreviation expansion technique, which we evaluated by measuring the accuracy of each technique in finding the correct long forms for a human-annotated gold set of non-dictionary words. To evaluate how important local scope and domain information is to our expansion technique, we compared our approach to storing and using just the program (ProgMFE) or Java MFE (JavaMFE) information. To expand a short form using an MFE technique, we run our local expansion algorithm on either the entire program or Java once, and calculate the MFE list. The list is used to expand every short

64

form. Thus, the ProgMFE approach expands every short form for a given program to the same long form, and the JavaMFE approach expands every short form to the same long form independent of program. We also compared our expansion technique with existing work by implementing the Lawrie, Feild, and Binkley (LFB) technique based on their description [LFB07b]. According to their paper, we implemented LFB to search for dropped letter (including prefix) and acronym expansions in dictionary words appearing in the method or comment where the short form occurs. Then, if the short form is not a Java reserved word, a dictionary of words is searched for expansions (ispell Version 3.1.20). Although the paper mentions using maximum likelihood estimation (MLE)2 to select between multiple long form candidates in the future, the existing approach has no mechanism to select between multiple long form candidates. Therefore, if there is more than one long form found in the method and comment, or in the dictionary, no long form is returned. There was one aspect of the LFB approach that we were unable to implement at this time. In contrast to our approach, which searches for no boundary short forms based on our combination word pattern, LFB handles combination long forms by recursively searching for possible places to split the short form. For example, the identifier splitting approach would split the non-dictionary word zeroindegree into zero-in-degree. To split these non-dictionary words, LFB searches for successively shorter prefixes and suffixes of dictionary words and a list of common abbreviations. We did not have access to this list of common abbreviations, and felt it might be unfair to evaluate the effectiveness of their approach with a substituted list. The only short form types this affects are combination word (CW), thus we only compare 2

Our notion of most frequent expansion is closely related to maximum likelihood estimation [MS99]. However, because we filter on the number of matches and require the long forms in our MFE list to be matched in the majority of cases, MFE is not identical to MLE.

65

Count Percent

AC 49 19.6%

Abbreviation Type PR SL DL CW 59 64 9 23 23.6% 25.6% 3.6% 9.2%

OO 46 18.4%

Total NCW Total 227 250 90.8% 100%

Table 3.3: Distribution of short forms in abbreviation types for the gold set. The 5 abbreviation types are acronym (AC), prefix (PR), single letter prefix (SL), dropped letter (DL), combination word (CW), and other (OO). The column ‘NCW’ totals all abbreviation types but CW. our technique to LFB for prefix (PR), dropped letter (DL), and acronym (AC) short form types. The dependent variable in our study is the effectiveness of each technique, measured in terms of accuracy. Accuracy is determined by counting the number of short forms that are correctly expanded from a gold set of non-dictionary words. If the non-dictionary word is a short form, the technique should output the corresponding long form, otherwise the technique should output nothing (no long form). To calculate accuracy, we divide the total number of correctly expanded non-dictionary words by the total number of non-dictionary words in the gold set. Subjects The subjects in our study are short forms originating from Java programs. We selected 5 open source Java programs from different domains and with different numbers of developers. We chose our programs with approximately equal lines of code (between 300-400K) to avoid bias during our random selection of non-dictionary words for the gold set. Table 7.2 shows characteristics of the subject programs. Liferay Portal. Liferay Portal is an open source portal framework for integrated Web publishing and content management with an enterprise service bus and service-oriented architecture. Because Liferay Portal is a secure portal platform, the program text contains terms from security in addition to web publishing and content management. 66

OpenOffice.org Portable. The goal of the PortableApps.com project is to make applications portable by taking existing applications and packaging them to run from a portable device (e.g., USB flash drive). The program text contains terms relating to document editing, run time GUI management, and mathematical calculations. iText.NET. iText is an open source library for creating and manipulating PDF, RTF, and HTML files in Java. For example, iText allows developers to extend the capabilities of their web server applications in order to generate a PDF document. The program text contains terms related to reading and converting PDF files. Tiger Envelopes. Tiger Envelopes is an open source personal mail proxy that automatically encrypts and decrypts mail. The program text contains terms related to encryption and mail clients. Azureus. Azureus is a Java-based client for sharing files using the BitTorrent file-sharing protocol. The program text contains terms related to file management, runtime GUI management, and networking. We randomly selected 250 non-dictionary words from the 5 subject Java programs. Two human annotators who had no knowledge of our mining technique manually inspected each short form candidate to identify the abbreviation type and the most appropriate long form for the given context. This served as our gold set. Some non-dictionary words were not abbreviations at all, such as mathematical variables or the program name, and were marked as abbreviation type ‘other’ (OO). The distribution of short forms across abbreviation types is listed in Table 3.3. We consider an occurrence of a non-dictionary word to be unique per method. Thus, for any given method, we assume that all instances of a short form have the same long form.

67

Methodology We ran each of our implemented tools including our Scope technique, LFB, JavaMFE, and ProgMFE, on the entire set of 250 non-dictionary words. We compared the output of each tool with the gold set. If the long form in the gold set and the technique’s automatically determined long form have the same stem according to Porter’s stemmer [Por80], then the expansion is considered to be correct. If the non-dictionary word was not an abbreviation and the technique output no long form, then the expansion is also considered to be correct. We computed the accuracy for each type of abbreviation for each tool. We then computed the accuracy of each tool for short forms by length from one character to 10 characters long, aggregated over all types of abbreviations. 3.2.4.2

Threats to Validity

We attempted to gather a domain-independent gold set of short forms by selecting large programs with different functionality. However, many of our subject programs involve a security component, and our gold set includes a number of short forms related to network security and encryption. Therefore the results of the study may not generalize to all program domains. In addition, because our technique is developed on Java programs predominantly written in English, the results of the study may not generalize to all programming language and natural language combinations. As with any subjective task, it is possible that the human annotators did not identify the correct long form for a given short form. In some instances, a single short form may be interpreted as different long forms by different developers. To limit this threat the gold set short forms were mapped by two independent developers who were unfamiliar with any of the techniques used in this study. When the appropriate long form was unclear, the non-dictionary word was classified as type ‘other’ (OO).

68

Type CW DL OO AC PR SL NCW Total

LFB 0.000 0.111 0.826 0.285 0.322 0.297 0.401 0.364

JavaMFE 0.304 0.778 0.652 0.122 0.728 0.313 0.467 0.452

ProgMFE 0.000 0.667 0.609 0.408 0.746 0.594 0.599 0.544

Scope 0.174 0.778 0.478 0.469 0.797 0.688 0.630 0.588

Table 3.4: Percent correct expansions for each technique and abbreviation type. To fairly compare our Scope technique to LFB, use the NCW total accuracy. 3.2.4.3

Results and Analysis

We present the accuracy results for our experiment in Table 3.4. Overall, our approach provides a 57% improvement in accuracy over the current state of the art, LFB, when non combination word (NCW) short forms are considered. In addition, both the JavaMFE and ProgMFE had higher accuracy overall than LFB. Because LFB outputs no long form rather than choose between two potential long forms, a significant portion of the technique’s correct results are due to correctly not identifying long forms. This is evidenced by LFB’s high accuracy, over 80%, for the other (OO) category. The accuracy results in Table 3.4 also demonstrate the effectiveness of using scope in correctly identifying long forms. As illustrated in Figure 3.3(a), there is a steady increase in accuracy as more local context information is used, from JavaMFE to ProgMFE to our fully contextualized Scope approach. Figure 3.3(b) shows the number of correct matches broken down by short form length rather than type. Interestingly, LFB performed best on short forms of length 2 and 1 (with accuracy of 45% and 65%, respectively), worst on short forms of length 4 (only identifying 12% correctly), and average for longer expansions. As expected, the Scope approach had the highest accuracy for short forms of length 3 (64% accuracy), as well as longer short forms of length 6 and 7 (78% and 60%

69

160 160

120

140

Number of Correct Expansions

Number of Correct Expansions

140

SL PR AC OO DL CW

100 80 60 40

120

1 2 3 4 5 6 7 9 10

100 80 60 40 20

20 0 LFB

JavaMFE ProgMFE

0

Scope

LFB

(a) Number correct by abbreviation type.

JavaMFE ProgMFE

Scope

(b) Number correct by abbreviation length.

Figure 3.3: Number of correct expansions for each technique (a) by type and (b) by length in letters. To fairly compare all the techniques to LFB, do not include the top most bar for combination word (CW). accuracy, respectively). The MFE approaches perform similarly, although JavaMFE outperforms ProgMFE for longer length short forms and JavaMFE underperforms all techniques for length 2. 3.2.5

Discussion and Future Work The results of our experiment demonstrate that our scoped approach is a

significant improvement over the state of the art LFB. For the 227 NCW nondictionary words, our technique had an accuracy of 63.0%, whereas LFB correctly identified just 91 long forms for an accuracy of 40.1%. Out of the 159 incorrect expansions for LFB, 123 short forms had more than one long form candidate and were therefore missed. Thus, the majority of long forms missed by LFB were due to not choosing between multiple possible expansions. Despite our success over LFB, there is still room for improvement. We manually investigated the short form expansions missed by the Scope approach and 70

identified some patterns in the set of missed and incorrect expansions which point to avenues for further research. Humans unable to identify the long form. For some incorrect results, a human was unable to identify the long form. Out of 103 total incorrect expansions for our Scope approach, 36 were single letter abbreviations. Over half of the incorrect single letter expansions lacked a clear expansion given the context according to the human annotators. Some single letter abbreviations, such as ‘i’ or ‘x’, are commonly used out of convenience and add no semantic value to the code. In future, we plan to automatically identify when an abbreviation has no intended meaning rather than attempting to assign meaning where none is intended. Incorrectly choosing between multiple candidates. A second class of missed short form expansions was due to selecting the incorrect candidate from multiple choices. For instance, the short form ‘loc’ was expanded to ‘locate’ instead of ‘local’. The incorrect long form was found at the program level using the program MFE list because no expansion of ‘loc’ was found in the method. Both of these long forms are common within the program and occur in the same proximity. By refining our long form selection algorithm, especially at class and program levels, we hope to obtain more appropriate long forms when choosing between two seemingly acceptable long forms. Long form based on domain knowledge absent in code. Missing domain knowledge also presents a problem to expanding abbreviations. For example, the short form ‘lsup’ occurred in the math-based typesetting class StarMathConverter, and no long form candidate was found in the entire program. The human annotator investigated beyond the program into domain knowledge and found that ‘lsup’ is a Tex command that stands for ‘left superscript’. This particular short form was part of a set of abbreviations that are so common in math-based typesetting code that they are generally understood by the developers and thus long forms are not present 71

anywhere in the code. Many domains have similar sets of generally understood abbreviations that make finding expansions exceedingly difficult. To remedy this problem, we plan to improve our Java MFE list by mining over more programs and hand-tuning the long forms for the most frequently occurring short forms. Further improvements. The Scope approach could be further improved by using a specialized edit distance to eliminate unlikely dropped letter and combination word expansions. An edit distance [JM00] assigns a similarity score between two strings based on the number of edits, in terms of additions and deletions, required to convert one string, such as a short form, into the other, such as a long form candidate. For example, vowels are more likely to be dropped than consonants, so a specialized edit distance would penalize consonant additions more than vowel additions. Another avenue of future research is in using a long form candidate’s part of speech to eliminate unlikely candidates. For example, many abbreviation expansions in our gold set are nouns, such as ‘integer’ or ‘string’, or noun phrases, such as ‘extensible markup language’ or ‘pseudo random number generator’. However, there are two issues that must be overcome with this approach. First, part of speech is more difficult to determine for software words than for English text. This is because many of the sentences and phrases in software are in the imperative form, which existing part of speech taggers are not trained on. Second, the part of speech of a short form, and thus its long form candidate, may vary depending on the location. Consider the short form ‘def’ in the identifiers defFont and fontDef. In defFont the appropriate long form is the verb ‘define’, whereas in fontDef the appropriate long form is the noun ‘definition’. Thus, the location of a short form within an identifier must be taken into account with the parts of speech of the surrounding words. Finally, in this work we have only begun to expand abbreviations for Java

72

programs predominantly written in English. Although the technique can theoretically be applied to any natural language, further evaluation and development are necessary to maximize the performance of our automatic abbreviation expansion technique for other languages. 3.3

Conclusion Even though both our source code preparation techniques outperformed the

competing state of the art, the techniques require further work. For identifier splitting, we plan to continue investigating the accuracy of same-case splitting and further evaluate the mixed-case split on a broader data set in the future. One important lesson from abbreviation expansion is the different types of abbreviations and the regular expressions used to search for them. We use these regular expressions to search for abbreviations in their most specific and accurate location: between a variable name and its type. We can then remove these abbreviations from names to eliminate redundancy when combining variable names and types into a single phrase. These phrases are then used for query reformulation, described in the following chapter.

73

Chapter 4 SELECTING GOOD QUERY WORDS

The first step in producing accurate search results is to determine good query words. Good query words not only describe the concern, or information need, for which the developer is searching, but also match the words used in the software’s implementation. To this end, we developed a novel query reformulation tool that enables the developer to (1) quickly discriminate between relevant and irrelevant search results, and (2) provide valuable feedback for further query refinements [HPVS09]. 4.1

Problem To identify code relevant to a concern, developers typically use an iterative

refinement process [FNL89, Hen94] to formulate a query for search, as shown in Figure 4.1. In this process, the developer enters a query into a source code search tool. Depending on the relevance of the results, the user will reformulate the query and search again. This process continues until the user is satisfied with the results (or gives up). In this process, the user has two important tasks: (1) query (re)formulation and (2) determining whether the search results are relevant. Challenges Studies show that formulating effective natural language queries can be as important as the search algorithm itself [Hen94]. During query formulation, the developer must guess what words were used by the original developer to implement the targeted feature. Unfortunately, the likelihood of two people choosing the same 74

(Re)formulate Query

Determine Relevance of Results

User Query

Search Results Search Method

Source Code

Figure 4.1: Iterative Query Refinement and Search Process keyword for a familiar concept is only between 10-15% [FLGD87]. Specifically, query formulation is complicated by the vocabulary mismatch problem [Hen94] (multiple words for the same topic), polysemy (one word with multiple meanings), and the fact that queries with words that frequently occur in the software system will return many irrelevant results [MRS08]. It is very difficult to overcome these challenges by automatically expanding a query on the user’s behalf. For polysemy and word frequency, the user needs to add additional query words about the feature to restrict the search results. Such detailed knowledge about the feature exists only in the developer’s mind. Further, automatically expanding a query with inappropriate synonyms can return worse results than using no expansion [SHPVS08]. Thus, we believe the role of automation is not to automatically expand the query, but to provide information about the underlying word usage in the code that will enable the human user to quickly formulate an effective query. Currently, few systems recommend alternative words to help developers reformulate poor queries [PMD06, SFH+ 07]. 75

Another challenge in the iterative refinement process is discriminating between relevant and irrelevant search results. Presentation of the search results is not always adequate to determine relevance, forcing the user to further examine the code. If users cannot quickly determine that results are irrelevant, they could waste significant time investigating irrelevant code. Existing static code search techniques using natural language queries [MSRM04, PPM+ 06] provide little support to help developers determine whether search results are relevant beyond ranking the results [PM07, SFH+ 07]. Providing Automated Support In this chapter, we present a novel approach to providing automated support to the developer both in formulating queries and discriminating between relevant and irrelevant search results. Our key insight is that the context of words surrounding the query terms in the code is important to quickly determine result relevance and reformulate queries. For example, online search engines such as Google display the context of words when searching natural language text. We automatically capture the context of the query words by extracting and generating natural language phrases from the underlying source code, where a phrase is a sequence of words (see Section 4.2.1 for a more complete definition). By associating and displaying these phrases with the program elements they describe, the user can see the context of the matches to the query words, and determine the relevance of each program element to the search. Because we provide word context for the occurrences of query words in the source code during the iterative refinement process, we call our approach contextual search. For example, consider the search results for the query “convert” in Figure 4.2. Extracted phrases are to the left, followed by ‘::’ and the associated method and field signatures. Phrases associated with multiple signatures are followed by the number of matching signatures and ‘>’ to indicate the expanded hierarchy of subphrases. 76

Figure 4.2: Example results for “convert” query. A subphrase is a partial match between two longer phrases that is used to group related signatures together. For example, “call convert result” is a subphrase of “generate call convert result,” and the query “convert” is a subphrase of “convert result”, “convert arg”, “can convert”, etc. By skimming the list of words occurring with “convert”, we notice that convert can behave as a verb which acts on objects such as “result,”, “arg”, or “parameter”; or convert can itself be acted upon or modified by words such as “can” and “get args to.” If the user were searching for code related to “converting arguments”, they could quickly scan the list of phrases and identify “convert arg” as relevant. Thus, understanding this context allows the user to quickly discard irrelevant results without having to investigate the code, and focus on groups of related signatures that are more likely to be relevant. The phrases, which are extracted from source code, naturally form a hierarchy of related phrases. At the top of the hierarchy are more general phrases, and at the bottom are the most specific phrases, which contain the most words. Continuing with the “convert” example, the most general phrase is the query, “convert”, and more specific phrases include “convert result”, “can convert”, and “get args to convert”. Further, the phrase “convert result” is more general than “generate convert 77

result”, which occurs below it in the hierarchy. The leaf nodes of the hierarchy are the specific program elements that match the phrases. This phrase hierarchy allows the developer to quickly identify relevant program elements by reducing the number of relevance judgments, while the natural language phrases help the developer to formulate effective queries. Phrases for Search and Query Reformulation Our contextual approach to search and query reformulation is motivated by insights gained from Shepherd et al.’s approach to query expansion and code search [SFH+ 07]. The approach by Shepherd et al. associates (V-DO) pairs with method signatures and comments, and uses these pairs to locate action-oriented concerns that cross-cut object-oriented systems. Their previous experimental study showed that by capturing specific word relationships in identifiers, such as V-DO pairs, a query recommendation tool enabled users to produce more effective queries more consistently than with two competing search tools. However, strict V-DO queries cannot be used to search for every concern. For example, the V-DO approach cannot search for features expressed as noun phrases that have no a verb, such as “reserved keyword” or “mp3 player”. One potential approach to go beyond V-DO pairs is to capture all word relationships in software by using co-occurrences [MS99]. Although co-occurrences can find meaningful word relationships, in our investigation, we found that cooccurrences do not consistently help to find relationships that would aid in query reformulation. We tried using information about co-occurring terms within identifiers, within methods, in comment-to-method mappings, from the method name to methods called within the method’s implementation, etc. The key problem that we observed with using co-occurring word pairs for query reformulation is that cooccurrences lose information because they are not ordered. For example, the fact

78

that “item” and “add” co-occurred more often than due to chance was less useful than simply knowing that the phrase “add item” frequently occurred. Thus, our key observation in query reformulation is that word order matters. The ordering of words gives contextual clues about the meaning of the co-occurring words. For example, “map object” conveys a directive to map an object, whereas “object map” likely describes a hash table of objects. We take this idea further by capturing context not just with co-occuring words, but with whole phrases. Like V-DO, we capture phrases with verbs and direct objects, but we go beyond V-DO in capturing phrases that do not contain any verbs and can contain prepositions. Thus, we can more accurately represent all the phrase clues programmers have left in the identifiers in source code. 4.2

Approach: Capturing Context with Phrases Figure 4.3 illustrates our contextual search process. There are two main com-

ponents to our approach: (1) extracting and generating natural language phrases from source code, followed by (2) search and hierarchical categorization of the phrases. Each subsection describes the challenges to be addressed, followed by our approach and detailed algorithm. 4.2.1

Information Extraction

Challenges The first challenge we faced in developing a phrase extraction process was deciding what phrases should be extracted that best capture word context. We focused on extracting phrases from method signatures (including declaring class name, method name, type, and formal parameters) and field signatures (including declaring class name, type, and field name). We developed rules to extract verb, noun, and prepositional phrases from method and field signatures. A noun phrase (NP) is a sequence of noun modifiers, such as nouns and adjectives, followed by a 79

noun, and optionally followed by other modifiers or prepositional phrases [MS99]. A verb phrase (VP) is a verb followed by an NP, and does not usually include the subject of the verb. A prepositional phrase (PP) is a preposition plus an NP, and can be part of a VP or NP. For the purposes of this work, we denote an NP containing a PP as an NPP, and a VP containing a PP as a VPP. We used exploratory data analysis techniques [MS99] to cluster groups of related method and field signatures and developed phrase extraction rules for each subgroup. For example, we developed extraction rules for static methods and fields, methods with and without parameters, and methods and fields with verbs and objects in the name. We analyzed methods with different return types [LBS06], as well as prepositions in the beginning, end and middle of names. When possible, we generalized common rules. For instance, the same rule to extract direct objects from formal parameters was developed independently for static and non-static methods. We also dropped any rule that extracted meaningful phrases for only some signatures, if we could not determine under what conditions the rule should be applied. The quality of our extraction rules depends on the variety of signatures under study and their naming conventions. We endeavored to study a diverse set of Java signatures and naming conventions by analyzing the most frequently occurring identifiers in a set of 9,000 open source Java programs downloaded from sourceforge.net. This set of programs contains over 18 million signatures, with 3.5 million unique names consisting of over 200 thousand unique words. This analysis technique could be replicated to generate phrases for languages other than Java. The amount of time required depends on the desired level of accuracy. Algorithm 5 is the culmination of our detailed analysis of thousands of unique method and field signatures.

80

Examples of NL Phrase Mapping add item :: Books boolean addItem(BookItem biNew) add book item :: Books boolean addItem(BookItem biNew) bi new :: Books boolean addItem(BookItem biNew) update event :: PlaylistFileManager void update(Event event) compare playlist file to object :: PlaylistFile int compareTo(Object o) load history :: History static void load()

(Re)formulate Query

Query

Determine Relevance of Results

User

Search Method

Preprocessing NL Phrase Mapping Source Code

Information Extraction Process

phrase 1 :: signature A phrase 2 :: signature A phrase 3 :: signature B phrase 1 :: signature C

Search & Hierarchical Categorization

Search Results

Figure 4.3: The Contextual Search Process Automatic Phrase Extraction Our guiding principle to phrase extraction is that providing the user with incorrect or misleading information is worse than providing no information at all. Therefore, we strived to achieve balance between conservatively extracting information in which we have a high confidence of accurately portraying word context, and yet still providing enough information that the system is useful. Extracting Phrases from Signatures Our extraction process has four major steps: (1) splitting identifiers into space-delimited phrases; (2) determining if the (method or field) name should be treated as an NP or VP, and whether it contains a PP; (3) identifying the verb, direct object, preposition, and indirect object of the VP, and (4) inferring arguments for VPs to generate additional phrases. Our phrase extraction technique is presented in Algorithm 5, with example extracted phrases shown in Figure 4.3. During NP and VP extraction, we use formal parameter names and types as objects. In line 6 of Algorithm 5, all formal parameter names and non-primitive formal types are added to args. Primitive types include the basic types int, void, boolean, etc. as well as String since it is a very common type. We add non-primitive

81

Algorithm 5 extractP hrasesF romSignature(sig) 1: Input: field or method signature, sig 2: Output: set of phrases for the signature, pset 3: 4: 5: 6: 7:

name ← split(name(sig)) type ← split(typeOrReturnT ype(sig)) class ← split(declaringClass(sig)) args ← set of split(f ormals(sig)) pset ← ∅

8: 9: 10: 11: 12: 13: 14: 15:

if isConstructor(sig) then {// NP} pset ← {name} ∪ args else if hasT railingP astP articiple(name) then {// NP} pset ← {name} else if hasLeadingP reposition(name) then {// NP} name ← {class + name} else if hasLeadingV erb(name) then {// Construct VP} v ← getV erb(name)

16: 17: 18: 19: 20:

// Find direct object if hasObjectInN ame(name) then DO ← getObject(name) else if hasP arameters(sig) then DO ← {getF irstF ormalN ame(sig) ∪ getF irstF ormalT ype(sig)} else DO ← class end if if containsP reposition(DO) then {// Generate VPs} for all prepositions p ∈ DO do DOi ← getW ordsBef oreP rep(DO, p) IO ← getW ordsAf terP rep(DO, p) pset ← pset ∪ inf erArguments(v, DOi , p, IO, args) end for else pset ← inf erArguments(v, DO, ∅, ∅, args) end if else {// NP} pset ← pset ∪ name if isF ield(sig) then pset ← pset ∪ type end if end if return pset

21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39:

82

parameter types as well as parameter names to account for instances where a parameter name is an abbreviation or conveys little meaning. Identifying NPs and NPPs To determine if a method name should be treated as an NP, we first identify whether the signature is a constructor in line 8. If so, we consider the name and each formal argument in args as individual NPs that each map to the signature. Next, in line 10, we check if the signature contains a trailing past participle, such as in “action performed” or “key pressed.” If so, we consider the name to be an NP and add it to pset, the set of phrases for the signature. These cases comprise the NPs for which we have high confidence of correctness. We next determine if the name contains a PP. In line 12, we check if the first word is a preposition, and if so, we concatenate the class name before the method name, and add the constructed phrase to pset. For example, toByteArray in class FileWriter would become “file writer to byte array.” In contrast to previous work [SFH+ 07], we do not automatically infer a verb such as “convert” in these situations. Following our conservative policy, we treat such cases as NPPs rather than potentially erroneous VPs. Next, we attempt to identify the name as a VP. If we do not successfully identify the name as VP, we treat the name as an NP in line 34. Identifying and Constructing VPs and VPPs In line 14, we consider the name to be a VP if the first word is a verb. Note that the behavior of hasLeadingV erb depends on whether a method or field signature is being analyzed. For fields, the name must begin with a verb and consist of more than one word; for methods, the name need only start with a verb. Since fields are less likely to begin with verbs and have no parameters, we only process a field name as a VP if there is also an object in the name, e.g., printWhenExpression. 83

After extracting the verb, we locate the verb’s arguments, starting in line 17. First, we determine if there is an object in the name following the verb, as in getConnectionType. If not, we use the first formal parameter name as the object as well as the first parameter’s type if it is not primitive, as in line 20. Otherwise, we use the class name as the object. Next, in line 24, we look for any prepositions in the verb’s object. For every preposition we find, we gather information about direct and indirect objects and call inferArguments to generate additional phrases. We identify the direct object from the words before the preposition, and the indirect object from the words after the preposition. We continue examining all prepositions in the verb’s object because not all words that can be prepositions act as such in identifiers. For example, in “show about dialogue,” “about” is actually acting as an adjective that modifies “dialogue.” Generating Additional Phrases Based on the formal parameter names and types, we construct additional VPPs that represent the signature without repeating the name. For example, a method may have a general name, such as addItem. However, a parameter (name or type) may indicate that only a specific type of item is being added, e.g., a BookItem. By inferring the phrase “add book item” in addition to “add item,” this signature will also be returned for queries such as “add book.” To generate new phrases, we look for partial matches to the direct and indirect objects in the argument list of method signatures, args. Recall that all formal parameter names and non-primitive formal types were added to args earlier. For every phrase in args that overlaps one word in the direct or indirect object of the VPP, we emit all possible combinations of original and inferred direct and indirect objects as phrases. For example, for the signature Base64 static Object decodeToObject(String sourceObject), we would output the phrase “decode base 64 to source object.” If the name ends in a preposition, and thus contains no indirect object, we 84

treat every argument as an indirect object. In general, no more than 1-2 phrases are added for signatures with parameters, but we have seen as many as 10 for complicated method names that contain prepositions. 4.2.2

Search and Hierarchical Categorization Once natural language phrases have been extracted from the source code,

the second component searches the phrases associated with each program element and groups related signatures into a hierarchy based on partial phrase matching. As illustrated in Figure 4.2, phrases at the top of the hierarchy are more general and contain fewer words, whereas phrases more deeply nested in the hierarchy are more specific and contain more words. In general, it is computationally infeasible to enumerate all possible subphrases, which is an instance of the power set problem. Thus, our technique is based on the top-down approach used in Max-Miner [RJB98] to efficiently mine long data patterns from databases. In contrast to Max-Miner, our implementation is recursive, and we use regular expressions to approximate the set operations as well as to extend the algorithm to handle sequences of words instead of sets. In displaying search results, it is important to give an example of what is retrieved so the user can decide whether the query needs to be reformulated and in what way [FNL89]. For this reason, in addition to phrases, we also display signatures that match each phrase in the source code. Our existing implementation only places a signature in the hierarchy once, in the topmost category possible. We achieve this by sorting the candidate phrases by the number of phrase matches. We also use the total number of matches of a phrase to sort the branches when displaying the hierarchy to the user. During the contextual search process and when building the hierarchy, query words do not have to match a phrase exactly but only preserve word order. For example, if the query is “text field,” and the phrase “text field xml file” maps to a 85

signature, both “text field file” and “text field xml” will be considered as candidates for sub-phrase matching, even if neither phrase exactly describes a signature. In addition, our approach uses the longest possible phrase that describes a subset of signatures in the hierarchy. For example, if all the signatures matching “text field” also match “j text field”, our approach will skip the shorter phrase “text field” and only add “j text field” to the hierarchy. 4.2.3

Implementation Our current implementation extracts and generates phrases from Java code as

an Eclipse plug-in. We use a morphological parser, PC-Kimmo [Ant90], to determine the possible parts of speech for individual words. The hierarchical categorization of phrases is implemented in php (http://www.cis.udel.edu/∼hill/context). Ideally, this technique should be integrated into a development environment. For this to be feasible, it must be possible to incrementally update the phrase representation as the code evolves. Because phrases are stored per method or per field, a new set of phrases can be extracted for modified program elements in the background or before the next search is executed. The initial extraction time from source code is very reasonable: 2-5 seconds for 20 KLOC, 8 seconds for 75 KLOC, and under 3 minutes for 1.5 million LOC. The extraction time depends on the length of signature names as well as the number of parameters overlapping NPs in method names. The hierarchy of phrases is constructed online based on the query with imperceptible delay for small (less than a thousand) result sets. For larger result sets, delay has been minor (less than ten seconds), but noticeable. For more substantial code sizes, a map/reduce architecture could be used to reduce overall search costs. In this work, we have chosen not to use stemming, since removing a stem can affect the part of speech of a word. The effect of changing the part of speech of a word on the readability of these phrases has not yet been studied. Also, other 86

No. Years 10+ years 5-9 years 1-4 years < 1 year

Programming Experience 13 4 5 –

Number of Software Developers in Study Industry Perform Perform Maintenance Experience Maintenance on Code Not Authored – 5 2 6 7 3 11 7 7 5 3 10

Frequency Daily Weekly Monthly Yearly

Table 4.1: Subject Developer Characteristics approaches have used synonyms [SFH+ 07]. Rather than use potentially inaccurate, domain-independent synonyms [SHPVS08], we have chosen to leave exploration of synonyms for future work. 4.3

Evaluation In this evaluation, we investigated the following research question: What effect does a contextual search of natural language phrases have on the effort and effectiveness of developers searching source code? To evaluate this research question, we compared our contextual search with

an approximation of the verb-DO search method [SFH+ 07] to explore whether natural language phrases beyond verb-DO improve searching capabilities, without studying effects caused by synonym recommendations or other minor algorithmic differences. In addition, we compare a version of our tool that does not display the phrases or hierarchy, only the matching signatures. In this study, we compare search results of 22 developers performing 28 concern location tasks. 4.3.1

Independent Variable The independent variable in our study is the search technique: contextH , V -

DO, and contextL . The contextH search technique is the contextual search approach described in Section 4.2. We compare contextH to a V -DO approach [SFH+ 07]

87

which we implemented. For the purposes of evaluation, we removed as much variability as possible between V -DO and contextH by implementing V -DO within the contextH framework, while still maintaining the essence of the approach. The V -DO approach requires the user to enter verb and direct object queries as input and outputs a list of signatures. The query consists of a single verb, followed by a direct object which may be multiple words. Examples include “remove item” and “lookup performance event.” The approach will always treat the first word of the query as a verb, and matches exact V-DO phrases only [SPVS06]. For example, the query “lookup performance” will only match “lookup performance” and not “lookup performance event.” To address the vocabulary mismatch problem, V -DO displays a set of verb and object recommendations for query reformulation below the list of exact matches. One column shows a list of the verbs that co-occur in the code with the direct object in the query, sorted by frequency. Similarly, another column displays all the direct objects that co-occur in the code with the query’s verb. The extraction process for V -DO differs slightly from contextH . First, in the case of constructors, the verb “constructs” is added in front of the class name. Second, non-void methods beginning with the prepositions “from,” “to,” and “as,” are replaced with the verb “convert.” Method names beginning with “is” are replaced with “check.” Finally, no prepositional phrases are explored or new phrases generated as with contextH ; once the direct object is identified the phrase is complete. We also compare contextH to a baseline version, contextL . The contextL technique uses the same query and search technique as contextH to identify matches, but skips the hierarchical categorization step and simply displays the results in a list. We use the contextL technique to explore whether the phrase matching or the hierarchical categorization has more of an effect on the search.

88

4.3.2

Dependent Variables and Measures The dependent variables in the study are user effort and search effective-

ness. We measure effort in terms of the number of queries submitted, ignoring any identical consecutive queries. We measure effectiveness by calculating the common measures of precision and recall on each search result set [MRS08]. Precision is the percent of search results that are relevant, and captures how many irrelevant results were present with the relevant results. Recall is the percent of all relevant results that were correctly returned as search results, and captures how many of the actually relevant results were predicted as relevant. We combine precision and recall using the F measure, which is high only when both precision and recall are similarly high. Although results for effective queries will ideally have both high precision as well as high recall, and thus a high F measure, it is unlikely that a single query will be capable of capturing both high precision and high recall. For the search techniques in this experiment, individual queries will typically be able to capture either high recall (by returning many results) or high precision (by returning few, but very relevant results). 4.3.3

Subjects The subjects of our study are the human developers and concerns. The

concerns formed the search tasks for which subject developers were required to construct queries. The concerns were also used as a gold standard, i.e., a set of relevant program elements, to evaluate effectiveness. Developers We obtained results from 22 volunteer software developers with varying levels of programming and industry experience. Table 4.1 shows characteristics of our subject population. The distribution of years of programming and industry experience 89

for each subject is displayed on the left of the table, and the frequency that they perform maintenance tasks is on the right. Although we confirmed that 27 subjects would participate, 22 completed the study. Concerns The description and contents of concerns add significant variability to the study. To control for this variability as much as possible, we used concerns from two completely different sets, which have completely different types of concern descriptions, and different methodologies for deriving the gold sets. The first set of 19 concerns is from the 45 KLOC JavaScript/ECMAScript interpreter and compiler, Rhino. The gold sets of the concerns were derived by the removal dependency rule [EZS+ 08]: under this rule, a method or field was only considered to be associated with a concern if it should be removed or modified when the concern is removed from the program. Two human analysts used this rule to determine 415 concerns for the Rhino program [EZS+ 08]. Each concern maps to a subsection of the documentation, which is used as the concern description. We selected a random subset of these 415 with some restrictions. First, the concerns varied in size from over 300 program elements to just a single element. We selected the most common sizes by restricting our sample to the middle 50% of the concerns, with sizes ranging from 4 to 25. Second, since each concern mapped to a specific section or subsection of the documentation, we wanted to have a representative sample to control for whether different sections tend to have certain types of concerns. Thus, we selected a random sample of 19 concerns such that every major section of the documentation was represented before repeats were selected. We selected 19 concerns so that the total number could be divided evenly into four groups for the design. The second set of concerns consists of 9 user-observable, action-oriented concerns from 4 programs ranging in size from 23 to 75 KLOC [SFH+ 07]. The four 90

programs are: iReport, a visual report builder and designer; jBidWatcher, an auction bidding, sniping, and tracking tool for online auction sites such as eBay or Yahoo; javaHMO, a media server for the Home Media Option from TiVo; and Jajuk, a music organizer for large music collections. The concern descriptions consist of screen shots of each concern being executed. The concern implementations were derived by a set of two human analysts who agreed on the concern implementations after executing and exploring the concerns [SFH+ 07]. Both sets of concerns were derived by two groups of independent researchers, and have been used as subjects in previous evaluations [EZS+ 08, SFH+ 07]. It should be noted that as a compiler, Rhino is out of most of our developer’s familiar domain. In addition, it is known from previous experience that the concerns from javaHMO and Jajuk are implemented using very different words than appear in the user interface, which is used for the concern description. 4.3.4

Design and Methodology We designed the experiment to compare contextH with V -DO and contextL .

In the design, there were two blocking factors: the order that the search techniques, or treatments, were applied, and the concerns. The order the treatments were applied is important to control for learning effects. To create concern blocks, the concerns were randomly assigned to 4 groups of 7 concerns (A − D) such that each group contained 4-5 Rhino concerns and 2-3 concerns from 3 other programs. This ensured that each treatment was applied to a consistent variety of concerns. We used a randomized crossed block design [DV99] to create 8 experimental units, 4 for each comparison (contextH with V -DO and contextH with contextL ). The four experimental units used to compare contextH with V -DO are presented in Table 4.2. Similar units were used for comparing contextH with contextL . Every treatment was applied to every concern and every order, although not every order was applied to every possible combination of concern blocks. 91

Unit 1 2 3 4

Order of contextH 1 1 2 2

Concern group for contextH A C B D

Concern group for V -DO D B A C

Table 4.2: Experimental units for comparing contextH with V -DO Subjects were initially randomly assigned to experimental units. Because not all subjects completed the study, we do not have an equal number of replications for each experimental unit. Units with less replications were assigned to new subjects as they volunteered; thus, every unit was completed by at least 2 subjects, but no more than 4. The subjects were asked to fill out an exit survey after completing the experiment. All the experimental materials, including the instructions for each experimental unit, are available online: http://www.cis.udel.edu/∼hill/context. 4.3.5

Threats to Validity Studying the effects of human subjects on such an open-ended task as concern

location poses many challenges. Although we endeavored to control for variability as much as possible, there are still threats to the validity of the results. The subject concerns are an unavoidable threat. To minimize the effect that some concerns are more difficult to locate and formulate queries for, we used concerns from 5 different programs with two different types of descriptions. In addition, each participant applied each treatment to 7 concerns to avoid any one concern dominating the results. However, it is possible that the concern groups that we randomly selected were not of equivalent difficulty. We avoided this as much as possible by ensuring that each group contained concerns from at least 3 different programs, under the assumption that concerns from the same program will be of approximately the same difficulty.

92

80

30

70 60

25

50

20

40 15

30

10

20

5

10

0

F Measure (Effectiveness)

Number of Queries (Effort)

35

0 Context

V-DO

Context

V-DO

Figure 4.4: Effort and Effectiveness Results for contextH and V -DO. Effort is measured in terms of the number of queries entered, shown on the left. Effectiveness is measured in terms of the F Measure, shown on the right. The experiment was administered as a volunteer online survey to gain access to as many developers with industry experience as possible. However, this meant that subjects were not in a controlled environment, and other distractions may have influenced the attention that subjects devoted to the experiment. For example, one subject was eating during the first part of the experiment, but not the second part. Another subject took an hour break in the middle of reformulating a query for one concern. For this reason, we cannot analyze effort in terms of time, only in terms of the number of submitted queries. Again, we attempted to minimize this threat as much as possible by observing the subjects over 7 concerns per treatment. 4.4

Results and Analysis

Contextual Phrases versus V-DO Pairs We found that contextH significantly outperforms V -DO in terms of effort and effectiveness. Figure 4.4 presents the results of our comparison in a box and

93

whisker plot. The shaded box represents the inner 50% of the data, the middle line represents the median, the plus represents the mean, and outliers are represented by an ‘×’. In terms of effort, shown on the left, developers entered 5 more queries on average for V -DO than for contextH . In most cases, this was due to the fact that users found it difficult to formulate strict verb-direct object queries for all the concerns. One subject said,“I really liked the verb-direct object search addon, but had trouble formulating some of the mandatory verbs, for example with the sqrt2 query.” In situations where V -DO could not extract a verb, users had trouble formulating successful queries and therefore expended more effort than with contextH . V -DO’s inability to extract verbs in all situations also led to poor effectiveness, shown on the right in Figure 4.4. Although the developers found V -DO’s query recommendations to be helpful, the recommendations did not provide significantly improved results. For example, another subject said, “In the V-DO part especially, it was difficult to find an accurate list [of signatures] for each concern by specifying complete V-DO combinations.” Thus, the more flexible phrase extraction process of contextH allowed for higher F measure values. To verify our observation that contextH outperforms V -DO, we performed a two-sample t-test [DV99]. Our dependent variables, number of queries (nq) and F measure (f ), had unequal variances, leading us to use the Satterthwaite approximation. We found that contextH outperforms V -DO with statistical significance at the α = 0.05 level (nq: p = 0.0004, f : p = 0.0021). Because our experiment includes repeated measures of the same subjects, the assumption that the two samples are independent does not hold. In such situations it is more appropriate to use a mixed model. When we analyzed the data as a mixed model, we achieved the same level of significance, with similar p values.

94

At first inspection, the F measure values appear to be low. Although we used concerns as benchmarks, the goal of the study was not to locate the entire concern, but to locate seed starting points to begin concern location. We do not expect any single query to be capable of locating all relevant items in a concern. However, an automatic program exploration technique [HPVS07, SFB07] can explore structural edges to locate program elements not returned by the natural language search. Because most concerns in this study contain just one or two structurally connected components, non-zero F measure values translate into fairly decent concern coverage. Contextual Hierarchy We also compared our contextual search technique both with and without the hierarchical topic display, contextL . In contrast to V -DO, we did not see a significant difference between contextH and contextL . In fact, the distributions of the nq and f variables for both techniques are quite similar. We found that the results in this part of the experiment are complicated by user interface issues. First, when faced with a long list of signatures from contextL , many subjects trusted the results and simply moved on. Such behavior led to very few queries entered, and very high recall from the huge size of the result set. Second, a number of developers were frustrated by the interactive nature of contextH ’s hierarchy. They liked the hierarchy, but disliked having to click to expand every branch. Currently, it is impossible to determine how much of an effect the phrase hierarchy and topic display has on the user without further investigation of the user interface. Qualitative Results At the end of the experiment, subjects were asked to comment on the techniques they used. Of the 17 that responded, 5 commented on which tool they preferred. Four of the subjects preferred using the contextH technique:

95

The nice thing about the reformulation technique [contextH ] is that most of the time you put in a 1-word query and find the concern(s) quickly, because they are nicely organized in groups. This way, you definitely see other functions you might miss if your queries were longer than one word. This observation typifies newcomer behavior when formulating queries [SLS09]. However, one subject disagreed: I felt like Part II [contextH ] was too exact and that I should remove words from my query–to only one word, which for some reason wasn’t intuitive. I’ve been trained to at least use two words (unless they’re, like, proper names). The subjects also suggested improvements, such as acceptable instances for stemming. Two of the subjects would have liked at least trailing ‘s’ characters stemmed from plural nouns and third person singular verbs. One subject suggested the use of synonyms, which we plan to add in the future. Some subjects also felt that the word order restriction on the query made searching difficult; in future, the word order could be used for creating the phrase hierarchy, and not for the search mechanism. 4.5

Related Work The most closely related work is Shepherd et al.’s approach to automati-

cally extracting V-DO pairs from source code comments and identifiers for search and query recommendations [SFH+ 07, SPVS06]. Our technique generalizes this approach by extracting NPs, VPs, NPPs and VPPs from signatures. Another approach to query recommendation automatically suggests close matches for misspelled query terms [PMD06]. Other static search techniques supporting natural language queries do not provide query recommendations [MSRM04, PPM+ 06]. There is also related work on automatically extracting topic words and phrases from source code [BLLB08, MSH08, OG05], displaying search results in a concept 96

lattice of keywords [PM07], and clustering program elements that share similar phrases [KDG07]. Although useful for exploring the overall word usage of an unfamiliar software system, these techniques are not sufficient for exploring all usage. In contrast to our approach, these approaches either filter the topics based on perceived importance to the system [BLLB08, KDG07, OG05, PM07], or do not produce human understandable topic labels [MSH08]. Since it is impossible to predict a priori what will be of interest to the developer, we let the developer filter the results with a natural language query, and have endeavored to keep our extracted phrases as human readable as possible. Existing research into design recovery and reuse has also used information from identifiers [BMW93, DBSB90, Hen94, MMM03, PDF87]. However, all of these approaches require an expert-defined domain model or knowledge base, which is not available for all software systems or domains. One approach for automatic generation of domain representations has been suggested for software artifacts, but has not yet been evaluated on source code [LVdA+ 04]. Another approach automatically constructs and categorizes reuse libraries based on comments and documentation, but does not process identifiers [MBK91]. Michail et al. use similarity of identifier names to compare and contrast software libraries for reuse [MN99]. An alternative to query reformulation is using relevance feedback [GHMM09]. Rather than modifying the initial query, a relevance feedback system requires the user to determine the relevance of retrieved documents, and uses this relevance feedback to further refine the search results. Although relevance feedback helps solve the problem of query reformulation by automatically changing the query based on user feedback, it does not provide any support to determining result relevance. In contrast, our contextual approach to query reformulation helps the user to both refine a query and determine result relevance.

97

4.6

Conclusion In this chapter, we present a novel approach that provides automated support

to the developer both in formulating queries and discriminating between relevant and irrelevant search results. Our contextual search approach automatically captures the context of query words in source code by extracting and generating natural language phrases from method and field signatures. These phrases naturally form a hierarchy that allows the developer to quickly identify relevant program elements by reducing the number of relevance judgments, while the phrases help the developer to formulate effective queries. We conducted an empirical evaluation of 22 developers comparing our contextual search approach to verb-direct object, the most closely related search technique. Our results show that contextual search significantly outperforms verb-direct object in terms of effort and effectiveness. Feedback from the subject developers indicate further areas of research.

98

Chapter 5 DEFINING A SOFTWARE WORD USAGE MODEL (SWUM) In the previous chapter, we observed that the context in which a word appears helps a developer to determine whether search results are relevant to his information need. In this chapter, we generalize this notion of word context in software and present a systematic way to access this information applicable not only for query reformulation and software search, but for a wide variety of software engineering tools. 5.1

Introduction During software development and maintenance, human programmers read

and modify the code that they and others produce, creating code artifacts that are readable as well as runnable [LBS06]. While the programming language syntax and semantics convey the algorithm to be executed, the identifier names and comments express the higher-level conceptual algorithmic steps and domain concepts behind the implementation. For example, from the method name buildQueryForTrace, we can infer that the method’s implementation will construct (i.e., build) a query for a trace. The comment provides further elucidation: “Build the sql query string for tracing.” Thus, concepts, or ideas, behind the implementation are expressed through words found in comments and identifiers. We use the term lexical concept to describe a concept invoked by a single word, and phrasal concept to describe a concept expressed as a sequence of words [Jac90]. 99

Textual information from lexical concepts has been successfully used to build a variety of maintenance tools. As described in Chapter 2, examples include tools that support program navigation and search [HPVS07, PM07], automatic recovery of traceability links between software artifacts [ANS08, ACC+ 02], and code reuse [MBK91, MN99]. However, most existing lexical approaches treat a program as a “bag of words” [MRS08], i.e., words are viewed as independent occurrences with no relationships. Approaches that utilize lexical concepts alone miss important linguistic relationships between how words appear together in code and comments. For example, consider searching for the query “add item” in a shopping cart application. The occurrence of “add” and “item” in two separate statements of the same method does not necessarily indicate that the method is performing an “add item” action— the method may be adding an action to the system’s queue and then getting the item field of another object in the system. We argue that going beyond the lexical concepts of words to phrasal concepts can yield further improvements in software engineering tools. The contextual query reformulation technique of the previous chapter further supports this idea. Approaches that use lexical concepts not only miss important linguistic relationships, they also lose the structural context in which the words appear. In this chapter, we introduce a novel Software Word Usage Model (SWUM) that not only captures the occurrences of words in code, but also their linguistic and structural relationships. SWUM captures the conceptual knowledge of a programmer expressed through both natural language information and programming language structure and semantics, in the form of phrasal concepts. In contrast to existing lexical approaches, we take a transformative step in automatic analysis for software engineering tools by accounting for how words occur together in code, rather than just counting their frequencies.

100

VP VP VP

NP

NP NM

PP NP

NP N

next area nextArea()

V

NM

N

V

draw content border

N

P

NM

N

perform tests from zip file

drawContentBorder()

performTestsFromZipFile(File zfile)

Figure 5.1: Example English parse trees and phrase structure for 3 methods. In the previous chapter, we represented phrasal concepts for verb phrases and noun phrases using phrases, which are a textual representation of phrasal concepts. In this chapter, we generalize beyond a textual representation of phrasal concepts to a general model of phrase structure, with applications beyond query reformulation. 5.2

Representing Phrasal Concepts A phrasal concept is analogous to a parse tree of an English phrase. In this

section, we present an overview of phrase structure and part of speech information commonly used to analyze English natural language sentences. We build on this information to model word usage in source code. 5.2.1

Phrase Structure English phrases are modeled by parse trees consisting of phrase structure

nodes, such as verb phrases, noun phrases, or prepositional phrases. A noun phrase (NP) is a sequence of noun modifiers, such as nouns and adjectives, followed by a noun, and optionally followed by other modifiers or prepositional phrases [MS99].1 1

In English, NPs are defined to begin with a determiner such as “a” or “the”. Since determiners are rarely used in identifiers in software, we relax this requirement. 101

A verb phrase (VP) is a verb followed by an NP, and does not usually include the subject of the verb. A prepositional phrase (PP) is a preposition plus an NP, and can be part of a VP or NP (i.e., a VPP or NPP as defined in Chapter 4). Figure 5.1 presents an example NP, VP, and VP with PP for three method name identifiers. The phrase structure nodes are NP, VP, and PP, while the other nodes are part of speech. The leaf nodes are the individual words split from within the identifier. Each word in the identifier is assigned a part of speech, which can then be used to derive the identifier’s phrase structure. 5.2.2

Parts of Speech The building blocks of phrase structure is part of speech (POS) information.

Any word in a phrase can be assigned a part of speech, called a POS tag. Although a wide variety of possible POS tags exist [MS99], we only need a handful to represent phrases in software: Noun (N) Nouns occur in the head position of a noun phrase (NP), where the head is the final word in the phrase. Head words are important because they typically represent the central theme of the NP [JM00]. For example, consider the identifier DataSourceChangeListener. Although every word in the identifier is a noun, the central concept of this NP is captured by the head word, “listener”. Noun Modifier (NM) A noun modifier is a noun occurring to the left of the head position in an NP, or an adjective. For example, in DefaultEventIndex, both the adjective “default” and the noun “event” are noun modifiers. Verb (V) The verb label is used to annotate all verbs, regardless of form (e.g., root, third person, past participle, etc.) Verb Modifier (VM) Verb modifiers are adverbs that describe the nature of the action taking place, like “quickly”, “completely”, or “simply”. Verb Particle (VPR) Verb particles are prepositions found in phrasal verbs like “set up”, “look for”, and “write down”. Preposition (P) Prepositions such as “to”, “in”, “from”, and “for” are labelled as P unless they appear as verb particles. 102

Conjunction (CJ) Conjunctions connect multiple phrases together. Examples include “and”, “or”, “but”, “if”, and “unless”. Determiner (DT) Determiners typically describe nouns, but provide little content information. Common determiners include “the”, “a”, “any”, and “all”. Pronoun (PR) Pronouns like “my”, “it”, or “this” can occur in identifiers to further describe nouns. Digit (D) Sequences of digits can also be treated as words, and labelled accordingly. Figure 5.1 shows example POS tags above each identifier word. 5.3

Motivating Examples

Using phrasal concepts to improve software engineering tools Consider a software maintenance tool that allows a developer to textually search for source code related to a maintenance task. Most existing search techniques use lexical concepts, that is, they treat a program as a “bag of words”. In situations where bag of words would return many results, SWUM’s phrasal concepts can differentiate between the relevant and irrelevant results. For example, consider searching for code related to the concept of adding items to a cart in a shopping system. Figure 5.2 shows three methods returned by a bag of words technique when searching for the “add item” query. The addEntry method, m1 , is highly relevant to the target concept of adding an item, containing occurrences of the “add item” phrasal concept in both the method name and body statements. In contrast, sum and loadAllItemsFromURLString, m2 and m3 respectively, are not relevant to the target concept, despite occurrences of both the words “add” and “item” in the source code. Specifically, sum uses a different sense of the word “add”. Rather than augmenting the shopping cart with new items, sum arithmetically adds the prices of the shopping cart’s contents. Since sum contains the phrasal concept “add price” and 103

Phrasal Concepts

/** @brief Add a new item entry to the set. * @ * @param @para ie - The item entry to add. */ m1 public void ie) { d addEntry(ItemEntry En (ItemE _itemCount++; _itemC ++ FilterManager.getInstance().addItem(ie); . dI It tem( }

add item

load items

add price

sum

/** @brief Load a URL U in, find d all hrefs on the page that point t to t an * item, and load them into the program. th t */ private void loadAllItemsFromURLString(Searcher searcher, String label) lItemsF om ) { URLPager pager = new URLPager(searcher.getSearch()); int results = 0; a for(JHTML htmlDocument : p pager) if(htmlDocument ent != null) ll l) results += addA addAllItemsOnPage(htmlDocument, label, searcher); lItemsO }

m3

protected String ing sum(String field) { Currency accum = null; for(ItemInfo it : _items) _ accum.add(it.getCurPrice()); accum = acc ccum cum.add(i d( (it ( if(accum == null) return "$x.xx"; return accum.toString(); }

m2

Figure 5.2: Search results and phrasal concepts for “add item” query. All the methods returned by the search contain the words “add” and “item”, but “add item” is not necessarily the main action taken by each method. not “add item”, SWUM can be used to distinguish between the relevant addEntry and the irrelevant sum. In contrast to sum, loadAllItemsFromURLString contains the “add item” phrasal concept. Based on m3 ’s method name, we can conclude that the main intent of this method is to load items, during which “adding items” is just one substep. By taking into account where phrasal concepts occur, SWUM can be used to differentiate between the highly relevant addEntry method and the less relevant loadAllItemsFromURLString. In summary, this example demonstrates that knowing how words occur together in phrasal concepts and where they occur can help distinguish between relevant and irrelevant search results, which can lead to improved software engineering tools. Automatically leveraging contextual information The structural and natural language information from program statements can be a rich source of conceptual program knowledge. We have designed SWUM to enable software engineering tools to automatically use this information.

104

For example, consider the statement: long justAdded = System.currentTimeMillis() + 5 * Constants.ONE MINUTE; In this statement, the variable name justAdded forms an incomplete phrasal concept because it does not communicate what was just added. Even though the variable justAdded is of type long—an uninformative, primitive numeric type—we can infer from the surrounding expression that justAdded holds a value related to time. This information can be used in automated software engineering tools, for instance, if the developer wanted a list of all variables and methods that handle time in the system. Knowledge of the conceptual type of justAdded enables us to make further inferences about variables used with justAdded throughout the program. For example, from the expression cur > justAdded we can infer that the variable cur is also related to time, most likely capturing the ‘current time’. In this way, every segment of code can provide knowledge about the program that can be leveraged in automated software engineering tools. 5.4

State of the Art In this section, we briefly summarize the state of the art in capturing word re-

lationships in source code, and demonstrate why SWUM is key to extracting phrasal concepts from source code. Although there has been some work toward representing word relationships in code, to our knowledge, no existing technique automatically captures the lexical concept of a word in conjunction with the context of its surrounding phrase. For example, automatic techniques have been developed to capture co-occurring word pairs [MBK91, MS99]. Going back to our “add item to cart” example, a co-occurrence approach would view the relationships between “add” and “item”, “add” and “cart”, and “item” and “cart” as the same, even though “item” and “cart” are direct and indirect objects of the verb “add”, respectively. Thus,

105

co-occurrences do not capture any information about the nature of the relationship between words beyond that the words occur together in the same context. In contrast, the V-DO approach described in Chapters 2 and 4 uses pairs from method signatures and comments to find actions that cross-cut object-oriented systems [SFH+ 07]. Although V-DO can model “add item”, it cannot model the entire phrasal concept, “add item to cart”. Inspired by the success of V-DO [SFH+ 07], SWUM generalizes the concept behind V-DO to capture a wider variety of phrasal concepts in source code. Another potential approach is to capture phrasal concepts in source code with latent semantic analysis (LSA) [LMDK07]. LSA identifies groups of semantically related words based on how words co-occur in a document collection. However, the semantic concepts found by LSA do not define phrasal concepts found in text—the concepts are instead represented as a mathematical set that may not correspond to anything expressible in language, other than the words in the set. An approach to automatic generation of domain representations has been suggested for software artifacts such as documentation [LVdA+ 04], but has not been applied to source code. An alternative would be to automatically identify topics in source code [KDG07], and parse the topics to derive word relationships. The basic premise of this approach filters the topics based on perceived importance to the system, and therefore will miss some infrequently occurring phrasal concepts. Alternatively, reflexion models allow developers to map structural mental models of software artifacts to the source code [MNS01], but the focus is primarily on program structure, rather than natural language information. In summary, our work to extract phrasal concepts and word relationships is fundamentally different from prior work. No existing technique is capable of capturing both the textual phrases and the nature of word relationships required to model a variety of phrasal concepts for a given segment of code. With SWUM, we

106

are attempting to capture the full context of a word both within its phrasal concept as well as within the structure of the program. 5.5

A Software Word Usage Model Intuitively, SWUM models phrasal concepts that capture both the program-

ming language (PL) and the natural language (NL) syntax and semantics found in source code. SWUM captures the PL and NL semantics with phrase structures that are linked to the underlying program structure and context-independent word information. Different phrasal concepts can be derived from these phrase structures, depending on the target software engineering tool using the information. In this section, we present our model of software word usage (SWUM), and use the concrete example in Figure 5.4 to illustrate its components. Figure 5.3 gives a high-level view of SWUM and its layers. 5.5.1

Formal Definition Formally, a software word usage model (SWUM) for a program P is a tuple

consisting of the following components: 1. a set N of nodes, which is the disjoint union of • a set Np of program element nodes, where a program element node is a syntactic program element from the AST (e.g., a method declaration, a variable use, formal parameter), and • a set Nw of word nodes, where a word node represents a single word used in comments and/or identifiers of the program element nodes (e.g., “print” or “error”), independent of the context in which they appear, and • a set Ns of phrase structure nodes, which are used to capture the natural language (NL) parse structure and semantic roles of words used in specific program element nodes (e.g., VP, NP, PP) 107

invokes handleFatalError

SWUMprogram

doPrint actual

formal use

error

String error

N ,A ,E ,L p p p p name

type

VP

N ,A ,E ,L s s s s

VP

NP theme

SWUMcore

theme

action NP

VG

ignorable verb

NP

NP

head, noun

head, noun

head, noun

action

VG

ignorable verb

print

error

string

verb

verb SWUMwords do

handle synonym verb

N ,A ,E ,L w w w w

verb

Legend

Nodes

Edges

N

program element node ! N

N

word node ! N

N

phrase structure node ! N

p

w

structural edge ! E p word edge ! E

w s bridge edge ! E b parse edge ! E

s

Figure 5.3: SWUM captures phrasal concepts and links them with program structure and additional word information.

108

public void handleFatalError(String error) { SysLogger.doPrint(error); }

Model of Program Structure (SWUMprogram) (Selected nodes from AST, PDG, Call Graph, etc.) MainObject defining class

invoked on expression

handleFatalError

p1

actual

error

p2

p3 formal

doPrint

String error

p4 use

b4 name

b1

Phrase Structure Model (SWUMcore) (parse trees, semantic roles, etc.) auxiliary s1 head-1, noun argument VP NP action

theme

s2 VG

EQ

head, ignorable noun

b2

NP

N

word node ! N

N

phrase structure node ! N

VP head, noun

main object

head-1, noun modifier head, noun

w s bridge edge ! E b

w

parse edge ! E

s

Model of Words Used in Program (SWUMwords), independent of context Nw, Aw, Ew, Lw (possible parts of speech, word relations, stems, IDF, etc.) verb verb noun noun to object string print logger do sys noun, preposition noun verb abbreviation w stem w 2 synonym w1 3 main expansion error fatal noun, log verb handle noun system noun adjective adjective

NP

NP

structural edge ! E p word edge ! E

p

b3

type

ignorable verb handle

program element node ! N

* To preserve readability, some bridge edges were omitted. Word nodes duplicated in the SWUM layer.

callee invokes

Edges

N

Np, Ap, Ep, Lp

SysLogger

caller

Nodes

Legend

Source Code: MainObject.java

action

secondary argument

N , As, Es, Ls

s head-1, noun modifier

object PP

string head, noun

VG error ignorable verb

fatal

do

NP

sys

NP preposition

theme

head, noun

logger

to

verb head, noun print

error

error

Figure 5.4: SWUM for a source code sample to be used as a running example throughout its definition.

109

2. a set A of node annotations, which is the disjoint union of • a set Ap of program element annotations, where a program element annotation is a structure-based role such as caller, callee, or overridden labeling a program element node, • a set Aw of word node annotations, where a word node annotation is a context-independent lexical feature such as part of speech (verb, noun, preposition), possible abbreviation expansion, or frequency labeling a word node with IR measures such as inverse document frequency (IDF), and • a set As of phrase node annotations, where a phrase node annotation represents the type of NL parse tree node, such as verb phrase, noun phrase, or prepositional phrase 3. for each n ∈ N , an annotation (n, x) ∈ A such that • if n ∈ Np and (n, x) ∈ A then x ∈ Ap • if n ∈ Nw and (n, x) ∈ A then x ∈ Aw • if n ∈ Ns and (n, x) ∈ A then x ∈ As 4. a set E of edges, which is the disjoint union of • the set Ep of structural edges, which represent program structure relationships such as control dependence, data dependence, or other information from the AST, PDG, or Call Graph • the set Ew of word edges, which represent context-independent relations between words, such as the semantic relations between words (e.g., synonyms and antonyms) or word stems • the set Es of NL parse edges, which represent semantic roles between phrase structure nodes (e.g., action and argument), and 110

• the set Eb of bridge edges between program element nodes, phrase structure nodes, and word nodes 5. a set L of edge labels, which is the disjoint union of • the set Lp of structural edge labels, • the set Lw of word edge labels, • the set Ls of parse edge labels, and • the set Lb of bridge labels for bridge edges connecting from phrase structure nodes to word nodes, which represent the actual part of speech and other role information 6. for each e ∈ E, nodes src(e), dest(e) ∈ N , and label (e, x) ∈ L such that • if e ∈ Ep then src(e) ∈ Np , dest(e) ∈ Np , and if (e, x) ∈ E, then x ∈ Lp • if e ∈ Ew then src(e) ∈ Nw , dest(e) ∈ Nw , and if (e, x) ∈ E, then x ∈ Lw • if e ∈ Es then src(e) ∈ Ns , dest(e) ∈ Ns , and if (e, x) ∈ E, then x ∈ Ls • if e ∈ Eb and src(e) ∈ Np , then dest(e) ∈ Ns , and if (e, x) ∈ E, then x∈∅ • if e ∈ Eb and src(e) ∈ Ns , then dest(e) ∈ Nw , and (e, x) ∈ E, then x ∈ Lb Figure 5.3 shows a high level overview of SWUM and its different components. Figure 5.4 shows an example SWUM constructed from the source code for a method named handleFatalError. The example demonstrates how SWUM links with the program structure and word information and is used as a running example throughout the next subsections.

111

5.5.2

Modeling Program Structure in SWUM (SWUMprogram ) Figure 5.4 shows sample program structure information from the AST, the

call graph, and formal declaration and use information. Each identifier in the example becomes a program element node ∈ Np , and each structural relationship becomes an edge ∈ Ep . For example, there is a method invocation edge from program element node handleFatalError (p1 ) to doPrint (p2 ), a formal declaration edge from handleFatalError (p1 ) to String error (p3 ), and a formal use edge from String error (p3 ) to error (p4 ). The edges are then labelled by the type of structural relationship. In general, a label ∈ Ls represents a structural relationship, which can be a control or data dependence, call edge, inheritance edge, AST edge, or a lightweight structural model relationship [MN96]. Although the example in Figure 5.4 includes nodes at the method declaration level, program element nodes can be created for an identifier at any level. Nodes can be created for classes, source files, or even packages, depending on the needs of the target software engineering tool utilizing SWUM. Nodes are annotated with structural role information, such as the “caller” and “callee” annotations on handleFatalError (p1 ) and doPrint (p2 ), respectively. In this example, both handleFatalError and doPrint are program element nodes ∈ Np , the structural edges between them are ∈ Ep , the structural role annotations are ∈ Ap , and the “call” edge label is ∈ Lp . In addition, program element nodes may be annotated with higher-level semantic roles. For example, variable declarations may be annotated by usage pattern roles such as loop iterators or temporaries [BGS05]. Alternately, field, method, and type declarations can be annotated by the role they play in the larger software system. For example, method declarations can be annotated by whether they take action that changes program state or return useful data, or perform computations that do not change the program state [LBS06]. Similarly, fields can be annotated by whether they are used as a constant or global

112

variable (e.g., COLOR CYAN), contain a property of a class (e.g., ReportElement.transparent), or contain information that is part-of a class (e.g., HTMLTable.title). Type declarations might be annotated by the role they play in the overall system— whether they encapsulate a data object (e.g., TPSReport), encapsulate an action (e.g., ReportCompiler), play a role in an implemented design pattern (e.g., DocumentBuilderFactory), or other abstract class role [GM05, SK08, MMvD07]. 5.5.3

Modeling Program Words in SWUM (SWUMwords ) There exists a word node ∈ Nw for every word token used in the comments

and identifiers of the program. The word nodes are derived by splitting the identifiers into their constituent word tokens, usually by camel casing [FBL06, LBS06] or by the techniques described in Chapter 3. Word nodes can be annotated with lexical features that are independent of specific word occurrences, including possible parts of speech, word meanings, relative frequency, or other lexical semantics [MS99]. Context-independent word relationships are used to create edges between word nodes ∈ Ew . For example, in Figure 5.4, the words “do” and “handle” are frequently used as synonyms in source code, represented by the synonym edge w1 . Additional word nodes not in the code may be added to create some of these relationships. For instance, the stem, or root form, of “logger” is not present in our code example. The word node “log” is added to Nw to create edge w2 . Word nodes are not labelled with lexical features that require contextual information, such as actual part of speech. Such occurrence-specific information is used to annotate bridge edges from phrase structure nodes to word nodes (Ns to Nw ). The actual parts of speech used to label these bridge edges include the parts of speech presented in Section 5.2 as well as two we derived especially for software: Ignorable Noun (NI) In certain situations, a head noun may not carry meaningful content information. For example, consider the two noun phrases “action entry” and “action server”. In “action entry”, the head word “entry” conveys little information. In fact, “action” could be considered to be the central 113

VP VG NP PP EQ EQN P EQV G

→ → → → → → →

(VG | EQV G ) (NP | EQN P ) PP? (VI|VM)* (V|VI) VPR? NM* N NI? P NP EQN P | EQV G NP+ VG+

Figure 5.5: Grammar for SWUM’s phrase structure. theme of this NP. In contrast, the central theme of “action server” is the head word, “server”. Because “entry” conveys almost no semantic information, we label it an ignorable noun. Ignorable Verb (VI) Similar to ignorable nouns, some verbs carry more information than others. In contrast to ignorable nouns, ignorable verbs typically precede the more important verb. For example, in handleGetUser, “get” is the true action taken by this method, not “handle”. Thus, “handle” is a VI.

5.5.4

Modeling Phrase Structure in SWUM (SWUMcore ) SWUMcore contains the key components of SWUM used to represent phrasal

concepts and capture the semantic knowledge of the programmer in terms of program structure and natural language syntax and semantics. To capture NL syntax and semantics in SWUMcore , we use phrase structure information inspired by natural language parsing techniques [MS99]. We have created additional phrase structure nodes to account for semantic information unique to source code. The phrase structure information allows us to capture semantic role relationships like action, theme, and argument [MS99]. These semantic roles capture the higher level concepts of the code, and can be used to construct phrasal concepts. SWUM contains 6 phrase structure node types: Noun Phrase (NP) Noun phrases consist of a sequence of noun modifiers followed by a noun and an optional ignorable noun (NP → NM* N NI?). A labelled bridge edge exists from the NP to every word node in its phrase. For example, in Figure 5.4, the type of variable String error links to the NP, “string”, labelled s2 . 114

Prepositional Phrase (PP) A prepositional phrase consists of a preposition followed by an NP (PP → P NP). A PP has a bridge edge to its preposition as well as a parse edge to its NP object. In Figure 5.4, the VP for doPrint, labelled s3 , connects to the PP “to sys logger”. Verb Group (VG) Verb groups are used to group action words together as one conceptual phrase unit. Verb groups may consist of a sequence of verbs, verb modifiers, and an optional verb particle (VG → (VI|VM)* (V|VI) VPR?). A labelled bridge edge maps the VG node to its constituent words. For example, in Figure 5.4, the actions “handle” and “do” are part of a VG node. We use verb groups to simplify the recursive VP parse structures common in more traditional NL parsing [MS99], which recursively define a VP → V (VP | NP). Verb groups provide a phrase structure parallel to NPs, which enables both VGs and NPs to be similarly processed by software engineering tools like search. Verb Phrase (VP) Verb phrases consist of a verb group and its arguments such as the theme (i.e., direct object), secondary arguments (i.e., indirect objects), the subject, or other auxiliary arguments (VP → VG (NP | EQ) PP?). A parse edge ∈ Es exists from the VP to each of these semantic roles, if present. In Figure 5.4, both method names map to VPs (s1 and s3 ). Because method names frequently encapsulate actions, methods often map to VPs in SWUM. In Figure 5.4, the VP for handleFatalError (s1 ) maps to a verb group action, its theme, and the class as an auxiliary argument. Equivalence (EQ) Equivalence nodes capture relationships between noun phrases or verb groups that can be used interchangeably (EQ → NP+ | VG+). An equivalence node will either connect to all NPs or all VGs. In Figure 5.4, the theme in handleFatalError further describes the method’s formal parameter, String error. The equivalence node s2 is used to semantically join the two noun phrases of these program elements together. For example, in Figure 5.4, the theme for handleFatalError contains an EQ between “fatal error” and “error”. Conjunctive Phrase (CP) Conjunctive phrases unite two other phrase nodes via a conjunction, such as in lockAndInitHandle or rotateLeftOrRight. Solutions exist in the NLP community for handling many conjunctive phrase structures [KK08, Kos86, OM94], and we leave handling conjunctive phrases for future work. Figure 5.5 shows the grammar between phrase structure nodes used in SWUMcore . Bridge edges from program elements to SWUMcore phrase structure nodes can only 115

have NP or VP nodes as their destination, since these are the topmost level phrase structure nodes in the parse trees we use. This is illustrated in Figure 5.4 by bridge edges b1 , b2 , b3 , and b4 . Program elements with name and type information (i.e., formals, local variables, and fields), contain two bridge edges, one for the name and one for the type. Actual parameters and variable uses can follow structural and bridge edges to navigate to the phrase structures of their declared and actual types. Formally, the edges connecting to nodes in the SWUMcore layer conform to the following properties: • if e ∈ Eb and src(e) ∈ Np , then dest(e) ∈ {N P, V P } • if e ∈ Eb and src(e) ∈ {V G, N P }, then dest(e) ∈ Nw • if e ∈ Es and src(e) ∈ {V P }, then dest(e) ∈ Ns − {V P } • if src(e) ∈ {P P, CJ}, then dest(e) ∈ Nw ∪ {N P } • if src(e) ∈ {EQ}, then dest(e) ∈ {N P, V G} • if n ∈ {EQ} and src(ei ) = n and src(ej ) = n and dest(ei ) = X, then dest(ej ) = X Within Ns , only VG, NP, CJ, and PP nodes contain bridge edges to word nodes. The remaining edges from CJ and PP nodes connect to NPs, and EQ nodes connect to either NPs or VGs. 5.5.5

Analysis of Space As formally defined, SWUM contains a node for every identifier and word

node used in the source code and comments. The number of phrase structure nodes depends on the number of identifiers. Bridge edges exist for every identifier and structural relationship being modeled, every word relationship modeled, and from every identifier to the phrase structure and constituent words. Although the number 116

of edges in SWUM will be constant with respect to the number of identifiers and words, it is a very large constant. Because a full SWUM can be prohibitively large, we recommend tailoring SWUM construction for the target application or employing SWUM for a variety of applications to help amortize the cost of construction. 5.6

Automatically Constructing SWUMcore In this section, we describe our algorithm to automatically build SWUMcore

by extracting the necessary information from the source code. Because the exact type and nature of program structure and word usage information required depends on the target software engineering application, we leave discussion of automatically constructing these layers to the implementation section (Section 5.6.2.4). We construct SWUMcore using rules derived from naming conventions used across thousands of Java programs. We first present our methodology for rule development, followed by our phrase structure construction algorithm, which includes our approach to part of speech tagging. Rather than use the small example in Figure 5.4, we provide intuition behind each construction rule using a variety of examples from actual open source Java code. Aside from method declarations, all identifiers are treated as NPs and therefore have bridge edges from program elements in Np to NP phrase structure nodes in Ns . We have focused our efforts on constructing SWUM for methods because we believe VPs from methods are the basic semantic units of code, supported by the NPs of variables and types. 5.6.1

Developing SWUMcore Construction Rules The rules for constructing SWUM were developed by analyzing method and

field naming conventions from over 9,000 open source Java programs. Our rule development approach was inspired by exploratory data analysis techniques [MS99]. Starting from a random sample of 10,000 method signatures, we would partition 117

the data set and attempt to build a rule for each partition. Partitions were refined based on whether a single rule would apply to the entire partition. For example, we could start with the assumption that all method names are in the same partition and that every method name begins with a verb. Upon applying this rule to our sample, we would observe mistakes when the name started with a noun phrase or preposition. This would give rise to two partitions in addition to starting with a verb: starting with a preposition and starting with an NP. Every new partition presented two challenges: how to automatically identify that the method name was part of that partition, and then how to construct the SWUM for the partition. Using the partitions, rather than a pre-annotated data set, allowed us to quickly view a much larger sample of the data set, by only analyzing differences between partition rule iterations. We continued refining our construction rules until we were satisfied with the level of accuracy for our target applications of search and exploration. Quantifying each rule’s accuracy and integrating it into SWUM is the subject of future work. 5.6.2

Constructing SWUMcore SWUMcore (Ns , As , Es , Ls ) consists of phrase structure nodes (e.g., NP and

VP) and parse edges representing semantic roles of VPs such as the action and theme. The challenge in constructing SWUMcore is in (1) accurately identifying the parts of speech of words in identifiers, and (2) identifying the action, theme, and secondary arguments for VPs. To accurately identify the parts of speech of words and determine if a name should be classified as a VP or NP, we use two sources of information: the lexical concept of the word itself, and the position of the word in the PL syntax of the program as well as within the NL syntax of the identifier. For example, if the method name begins with a noun, determiner, or adjective (e.g, newStruct() or all interfaces()) we assume the name is an NP. Similarly, if the name ends with a 118

past tense verb (e.g., actionPerformed()), we also assume the name is an NP. In most other cases we assume the method name is a VP. The next step is to identify the action, theme, and secondary arguments for VPs. We assume all methods are VPs, regardless of whether the name is an NP or VP. However, determining the action depends on whether the name is classified as an NP or VP; if the name is an NP, a default action is inferred, otherwise the action is the VP’s verb. The theme and other arguments are inferred by looking in the name, parameters, declaring class, and return type of the method signature. Our approach to constructing SWUMcore for a method declaration is presented in Algorithm 7, and the following sections describe the details of our approach. Algorithm 7 focuses on identifying parts of speech and the semantic roles of VPs, which can be mapped back to phrase structure nodes and bridge edges. Specifically, a SWUMcore contains the following components: • A VG action • An NP or EQ theme • A set of PP secondary arguments • A set of NP auxiliary arguments not used in the theme or secondary arguments • For boolean checkers, an NP/EQ subject and a VP condition • Two flag attributes representing if the method is a constructor (isCtor) or if the name is general (isGeneral) • A bridge edge from the method declaration (m ∈ Np ) to its VP ∈ Ns • For each actual POS tag, there exists a bridge edge from the phrase structure node ∈ Ns to its word node ∈ Nw , annotated with its actual POS tag

119

Algorithm 6 parseP reamble(name) 1: Input: split method name, name = w0 , w1 , ..., wn , dictionary of two-letter words (dictionary), and two frequency lists from a corpus of method names, onlyF req and f irstF req 2: Output: Parsed preamble and name = (preamble, name) 3: 4: preamble = “” 5: i = 0, ilast = −1 6: 7: while name 6= ∅ ∧ ilast 6= i do 8: ilast = i 9: while tag(wi ) = DIGIT do 10: i++ 11: end while 12: if length(wi ) = 1 ∨ (length(wi ) = 2 ∧ wi ∈ / dictionary) ∨ (length(wi ) < 5 ∧ !contains(wi , get) ∧ !contains(wi , set) ∧ !isP otentialV erb(wi ) ∧ onlyF req(wi ) = 0 ∧ f irstF req > 10) then 13: preamble = preamble + dequeue(name) 14: i++ 15: end if 16: end while 17: return (preamble, name)

The phrase structure nodes VG, NP, PP, and EQ are ∈ Ns . The action, theme, secondary arguments, and auxiliary arguments are parse edge labels ∈ Ls , which label edges between the VP node and the remaining phrase structure nodes. 5.6.2.1

Tagging Method Names and Identifying the Action

The first step is to split the identifiers and create sequences of word nodes ∈ Nw for each identifier in the method signature. If the method is a constructor, the name is tagged as an NP, annotated as the theme, and we infer the action EQ VG {“create”, “construct”} in line 7. For the remaining method declarations, we tag any digits in the name (line 12) and tag prepositions (line 14). By tag, we mean create a bridge edge from the program element node pen ∈ Np to the phrase structure node psn ∈ Ns , and label bridge edges from psn to all the words contained in pen labelled with actual part of speech information.

120

Algorithm 7 constructP hraseStructureF orM ethodDeclaration(m) 1: Input: method declaration of split identifiers, m = (name, class, return type, f ormals) 2: Output: implicit phrase structure model for m = (action, theme, subject, condition, secondaryArgs, auxArgs, isGeneral, isCtor) 3: isGeneral = false, isCtor = false 4: {action, theme, subject, condition, secondaryArgs, auxArgs} = ∅ 5: 6: if isConstructor(m) then 7: action ={“create”, “construct”} 8: theme = tagN P (name) 9: auxArgs = f ormals, isCtor = true 10: return (action, theme, secondaryArgs, auxArgs, isGeneral, isCtor) 11: end if 12: name = tagDigits(name) 13: {preamble, name} = parseP reamble(name) 14: name = tagP repositions(name) 15: 16: if name = ∅ then 17: theme = preamble, isGeneral = true 18: auxArgs = f ormals ∪ return type ∪ class 19: else if isChecker(name) then 20: (action, theme, secondaryArgs, auxArgs) = parseBaseV erb(name, class, return type, f ormals) 21: if theme 6= class then 22: subject = class 23: end if 24: condition = action + theme 25: else if hasGeneralN ame(name) ∨ isEventHandler(f ormals) ∨ name0 ∈ {“on”, “before”, “after”} ∨ (isN P (name) ∧ return type = “void”) then 26: isGeneral = true 27: action = “handle” 28: theme = name 29: auxArgs = f ormals ∪ return type ∪ class 30: else if isN P (name) ∧ return type 6= “void” then 31: action = “get” 32: theme = name 33: auxArgs = f ormals ∪ return type ∪ class 34: else if startsW ithP reposition(name) then 35: if name0 ∈ {“to”, “from”} then 36: action = “convert” 37: end if 38: secondaryArgs = {name} 39: auxArgs = f ormals ∪ return type ∪ class 40: else 41: (action, theme, secondaryArgs, auxArgs) = parseBaseV erb(name, class, return type, f ormals) 42: end if 43: if action = “get”, remove return type from auxArgs and create EQ with theme 44: return (action, theme, subject, condition, secondaryArgs, auxArgs, isGeneral, isCtor)

121

Preamble In addition, we tag any leading sequences of letters that appear to carry little content, called the preamble. A developer may use a preamble to differentiate different types of methods which take similar actions, or just prepend information describing the method’s action.

For example, consider the methods

BrowserLauncher.ICStart() and Permute.gsl permute ushort(), with preambles “IC” and “GSL”, respectively. Because our verb tagger assumes method names begin with a verb, tagging preamble allows our verb tagger to correctly identify the verbs in the method names (“start” and “permute”). We do not discard the preamble, but assign it a special POS tag (P RE) and associate it with the action’s VG to be conservative with respect to errors. Our approach to identifying the preamble is described in Algorithm 6. It takes as input the method name and a list of two-letter dictionary words likely to occur in software (dictionary). These words include “am”, “my”, and “on”, but do not include words from other domains like “io”, from greek mythology and astronomy, which is more commonly used as an abbreviation in code to refer to “input and output”. We also include a list of positional word frequencies. In our training sample of over 9,000 open source Java programs, we counted the number of times a word occurred in the beginning, middle, or end of a method name, as well as the number of times the word occurred as the only word in the name. Our preamble function takes as input two of these frequency lists: onlyF req and f irstF req. Identifying the Type of Method Name In some cases, an entire method name may be labelled as preamble. For example, consider the automatically generated method Parser.jj 3R 188(). In such cases, we assume the preamble is the theme, and mark the method as having a general name (line 17). The isGeneral flag indicates that the method name probably

122

has little to do with the method’s actions, and enables SE tools using SWUM information to handle these methods differently. Next, we detect whether the method is a boolean checker like isVisible or containsKey. If the method name begins with a verb in third person singular form, or is a modal verb like “can”, “must”, or “should”, we assume the method is a boolean checker. Checkers are a special class of method names that do not conform to the action-theme VP model. Checkers need a subject that indicates what is to be checked (i.e., what is visible, what contains a key). Thus, checkers contain a subject, usually the class, and the method name becomes a VP condition mapping to the action and theme (line 24). In line 25, we try to identify the remaining general names. General names can include event-driven methods like actionPerformed(), keyPressed(), or Thread.run(). We detect these methods by looking for names like main or run, and by looking for past tense (end with -ed) or present participle (end with -ing) verbs at the end of the method name. In addition, methods with formal parameters of a type ending in “event” are frequently event handlers, as are method names beginning with the prepositions “on”, “before”, or “after”. Lastly, NP method names that are void are assumed to be general (e.g., void method Handler.characters(String characters)). We assume a method name is an NP if it begins with a noun, determiner, pronoun, or adjective, and cannot be a verb. If the method name is an NP and has non-void return type, we assume the method is a getter (line 30). If the method name begins with a preposition, the action and theme are unknown and we treat the name as a secondary argument. If the preposition follows a known naming convention, like “from” or “to”, we can infer an action. For example, if the method starts with a common preposition like “to”, as in toString(), we infer the verb group node and action edge “convert”. In contrast, for less commonly used prepositions, like inSectorNorthWest(Vector), we leave the VG empty.

123

Algorithm 8 parseBaseV erb(name, class, return type, f ormals) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41: 42: 43: 44: 45: 46: 47: 48:

Input: method declaration information, (name, class, return type, f ormals) Output: (action, theme, secondaryArgs, auxArgs) i = 0, (action, theme, secondaryArgs, auxArgs) = ∅ // Parse verb group if isIgnorableV erb(name0 ) ∧ moreLikelyV erbT hanN oun(name1 ) then tag(name0 ) = V I, i + + end if if isLyAdverb(namei ) ∧ isP otentialV erb(namei+1 ) ∧ tag(namei ) ∈ / {DT, P R, P } then tag(namei ) = V M , i + + end if if isIgnorableV erb(namei ) then tag(namei ) = V I, i + + else tag(namei ) = V , i + + end if if isV erbP article(namei−1 , namei ) then tag(namei ) = V P R, i + + end if while tag(name0 ) ∈ {V I, V, V M, V P R} do action = action + dequeue(namei ) end while[// Next Parse Arguments] p = index of preposition in name, −1 otherwise auxArgs = populateArgsF romF ormals(f ormals) if p = −1 then if name 6= ∅ then theme = tagN P (name) checkHeadOverlap(theme, auxArgs) else if auxArgs 6= ∅ then theme = dequeue(auxArgs) else theme = class end if else Let X = namei,p−1 and Y = namep+1,n if X = ∅ then X = class else checkHeadOverlap(X, auxArgs) end if if Y = ∅ then Y = dequeue(auxArgs) else checkHeadOverlap(Y, auxArgs) end if theme = X, secondaryArgs = Y end if auxArgs = auxArgs ∪ return type ∪ class (unless class is theme or secondary argument) return (action, theme, secondaryArgs, auxArgs)

124

5.6.2.2

Identifying VP Theme and Arguments

If we have not identified the method name as any other type, we assume the name starts with a verb in base form. In line 41 of Algorithm 7, we forward the remaining parsing and argument identification on to the parseBaseV erb function presented in Algorithm 8. The first step in parsing a base verb name is to identify the verb group. In addition to the verb, the name may contain verb modifiers, verb particles, and additional verbs in the form of ignorable verbs (line 5). For example, the method name testMakeProductPrice() contains two verbs in its VG: “test” (VI) and “make” (V). Depending on the SE tool, “test” may or may not be “ignorable”, but tagging “test” as VI enables us to identify another verb, “make”. If we were not able to identify ignorable verbs, we would only tag “test” as the verb, and “make” would be improperly tagged as a noun modifier (NM) in the theme. To determine if the first word in the method name is an ignorable verb (VI) followed by an actual verb (V), we utilize the positional frequency lists we first used to tag preamble. If the second word in the name, name1 , is more likely to occur as the first or only word in a name rather than occur in the middle or end of a name, we assume the first word in the method name, name0 , is VI. Once we have identified the verb group, we identify the theme, secondary arguments, and auxiliary arguments (line 22). The first step is to check if the name contains a preposition and populate the auxiliary arguments from the formals. For most names we add any formals of non-boolean type to auxArgs. Boolean-typed variables are typically used as flags controlling the flow of execution, and are unlikely to be the theme or secondary argument. The exceptions are names beginning with verbs that can take boolean arguments, like “check”, “assert”, “contains”, “add”, “print”, “append”, “push”, or “set”. For these boolean-argument verbs, all formal parameters are added to auxArgs.

125

If the method name does not contain a preposition, we look for the theme in the rest of the name, the auxArgs, or the class. If the theme is in the name, we also look for equivalent formal parameters in auxArgs, presented in Algorithm 9. We consider an NP ∈ auxArgs to be equivalent to the NP theme if their head parameters are the same, unless the auxiliary argument NP ends with an ignorable noun, in which case we check the penultimate word in the NP for head overlap. If the method name contains a preposition, we look for a secondary argument in addition to the theme. Similar to our phrase generation approach in Chapter 4, we look for the theme first in the name, and then in the class. We look for the secondary argument in the name, and then in the auxArgs. If the theme or secondary argument are in the name, we check these arguments for overlap in auxArgs. Algorithm 9 checkHeadOverlap(theme, auxArgs) 1: Input: set of phrase structure nodes (theme, auxArgs) 2: Output: possibly modified phrase structure nodes (theme, auxArgs) 3: 4: for all arg ∈ auxArgs do 5: if head(theme) = head(arg) ∨ (tag(head(arg)) = N I ∧ head(theme) = head−1 (arg)) then 6: theme = EQ(theme, arg) 7: remove(arg, auxArgs) 8: end if 9: end for 10: return (theme, auxArgs)

5.6.2.3

Beyond Method Declarations

To build a SWUM for a method invocation, we start with the SWUM from its declaration. Based on the argument position of the formal parameters, the actual parameters and invoked-on-expression are added to the model as EQ nodes with the formal parameters and declaring class, respectively. In some cases, multiple method invocations may be chained together as composed and nested method invocations. We consider a composed method invocation to be a sequence of method calls where each method call is invoked on the previous 126

method invocation expression. For example, from the composed invocation pServer.getTagName.equals(“server”), we can infer that “equals” compares a server with a tag name. We consider a nested method invocation to be a sequence of method calls where a method call uses one or more method invocation expressions as actual parameters. For example, from the nested invocation server.addItem(server.getURLFromString(id)), we can infer that this particular “add item” invocation takes a “url” as an argument. In addition to being linked in the program structure, composed and nested method invocations are linked together in SWUM via their phrase structure nodes, since the arguments of one method become the arguments of another. In the example above, the theme for getURLFromString(id) would become one of the auxiliary arguments to the VP for addItem. Our construction algorithm for SWUM currently supports variables, types, method and field signatures, and method invocations. We support logical and arithmetic expressions by creating an EQ node linking to the NPs for every variable used in the expression. Assignment statements are modeled with a bridge edge to the SWUM of the left hand side, and a separate bridge edge to the right hand side. Support for other language constructs, such as conditional statements and loops, is the subject of continuing research [SHM+ 10]. Additional nodes and edges may be added to capture return information. For example, if the method is a getter, its return type is considered equivalent to its theme. Within a method declaration, SWUMs for any return statements are linked with an EQ node to the return type. This return information can provide valuable insights into the actual types of variables and how general methods are used in the source code. For example, consider the nested invocation: JConfig.getCanonicalFile(JConfig.queryConfiguration(“savefile”)) Based on its declaration, we know that queryConfiguration has a string return type and returns a variable retVal (not shown). We also know from the declaration of 127

getCanonicalFile that the name of its formal parameter is “fname”. Thus, since the invocation of queryConfiguration is nested inside the call to getCanonicalFile, we can also infer that this particular invocation of queryConfiguration returns the concept of an “fname”. For this example, queryConfiguration would contain three equivalent arguments for the return type in auxArgs: the declared return type, “string”; the returned expression, “ret val”; and the inferred return, “fname”. 5.6.2.4

Current Implementation

SWUM is currently implemented as an Eclipse plug-in, taking advantage of Eclipse’s program structure information and AST for Java source code. Implementation details of statement-level SWUM analysis within method bodies are left to the target software engineering applications, such as search in Chapter 6 or comment generation [SHM+ 10]. Because statement-level information is more expensive to calculate, depending on the required level of precision, we leave optimizing access to that information up to the target applications. Program Structure Algorithms for constructing program structure models (Np , Ap , Ep and Lp ) are well documented [ALSU06, FOW87]. The current version of SWUM utilizes Eclipse’s call hierarchy for calling relationships, and relies on Eclipse’s program element reference and declaration search mechanisms. Additional structural information, such as whether a method writes or reads a field, are detected with light-weight lexical approximations using regular expressions or by traversing the AST. SWUM currently utilizes calling relationships, field def-use information, and declaration-use edges for both variables (local, formal, field) and methods. The current implementation of SWUM does not annotate structural nodes with semantic role information, since this information is not required for the target

128

applications of concern location. This information is more important for extracting UML diagrams or comment generation [SHM+ 10]. Program Words The overall accuracy of SWUM depends on being able to accurately extract word nodes from identifiers and comments. In most cases, simply splitting on camel case words (e.g., ASTVisitor, getInfoByName) or non-alphabetic characters (e.g., model type, object2byte) will suffice. This straight-forward splitting has been successfully employed by tools which utilize lexical concepts [ACC+ 02, HPVS07, HPVS09, SFH+ 07]. However, sometimes there is no discernible delineation between words, especially in cases where two words together indicate a particular concept (e.g., notype, USERLIB). In addition, word node extraction is complicated by the use of abbreviations, which can be expanded as necessary to properly extract lexical features [HFB+ 08]. The techniques presented in Chapter 3 can be be used to improve word splitting and abbreviation expansion. Synonym support is not currently implemented, since accurate synonym information for the domain of software does not yet exist [SHPVS08]. This is an area for future work. Part of speech (POS) information for individual words is crucial to extracting accurate phrasal concepts. Although POS taggers for written English texts boast accuracy of 95-97% [MS99], the problem of identifying the correct POS for a given word in software is different from English. In software, sentences are often written in imperative form and nouns are frequently used as verbs. Consider the common English noun “fire”, which is frequently used as a verb in software (e.g., “fire action event”), but not always. The word “fire” may be a noun (e.g., “fire started”) or even an adjective in gaming software (e.g., “car is in a fire state”). Further, existing taggers perform poorly on unknown vocabulary [MS99]. This is a particular problem for software, where terminology frequently evolves as new technologies are developed and new domains encountered. 129

To address this problem, we use an alternative to traditional English POS taggers: the morphological parser PC-KIMMO [Ant90], which uses the structure of the word itself to suggest potential POS tags. For example, words ending in “-ed” are often verb past participles. Thus, PC-KIMMO can make informed POS guesses even for new vocabulary. The drawback is that PC-KIMMO outputs every possible POS—the final decision is left to the user. We use standard naming conventions [GJSB05, LBS06] (e.g., the fact that method names frequently begin with verbs, and variables are NPs) to disambiguate the POS for a given word node. We also used PC-Kimmo to create a word stemmer specific to the domain of software. We select between possible parses by using word frequency information from a corpus of over 9,000 open source Java programs. Using this technique, we identified stems for 21,329 words found in the Java corpus. In the current implementation of SWUM, we have chosen to be more conservative than our prior query reformulation approach in Chapter 4, and leave inferring unknown action and theme of PP names for future work. 5.6.2.5


SWUM takes as input POS, stem, and positional frequency information, which we have pre-computed over a large set of words in software. For our current implementation, these files require just 1.6 MB of space. Alternately, an improved POS tagger and stemmer could reduce the amount of space these lists require by parsing identifiers and stemming on demand, albeit with a potential tradeoff in time. If IDF information is used, as with applications like search, document frequency information must be computed by performing a pass over the program before using SWUM. This information can be incrementally updated as the program changes, by reanalyzing the portions of code that have been changed. Changes to signatures require reanalyzing all uses, whereas changing a body statement will only require reanalyzing that line. 130

We assume SWUM takes as input a program structure representation and the above mentioned word usage information. Given these inputs, SWUM construction is linear with respect to the number of unique identifiers and constant values used in the code being analyzed. Information constructed from declarations can be stored for when uses are analyzed. As with IDF, this information can be incrementally updated as the program changes. Our unoptimized research prototype takes just 11.3 seconds to analyze signatures from a 74K LOC program (containing 3,462 fields and 4,712 methods) and analyzed 1.5 million LOC in just under 11 min (64,694 fields and 130,363 methods). Analysis of body statements as well as signatures will require additional time, and will be heavily dependent on the speed of the structural analysis. Although we have presented times for analyzing an entire program, SWUM analysis can be run on-demand. Demand-driven construction is especially useful when more expensive analyses are required, as for comment generation [SHM+ 10]. 5.6.2.6

Developing SWUM Construction Rules for other Languages

Although the rules have been developed specifically for Java, they can largely be applied to other object-oriented languages, with small tweaks for differences in language constructs and semantics. We were able to construct SWUM for sample C++ classes with little difficulty. For other types of languages, the naming convention rules may be quite different. The overall theory of SWUM and what nodes need to be constructed to capture phrasal concepts remains the same, but the construction rules may require significant adjustments. New rules can be constructed based on the rule development methodology outlined above. Depending on the level of accuracy required by the target software engineering tool, this can be a relatively straightforward process for someone proficient and experienced in the language. Language proficiency can be simulated by studying many examples from other programmers. 131

5.7

Comparing SWUM with other NLP-based Identifier Analyses Our Sofware Word Usage Model (SWUM) is a generalization of the Action-

Oriented Identifier Graph (AOIG) [SPVS06]. Like the AOIG, SWUM places emphasis on accurately modeling the actions in code by modeling method declarations. However, the AOIG only models verbs and their themes, or direct objects, and does not support the full range of phrase structures found in SWUM. Because SWUM can represent the full range of phrase structures, it can be used in SE tools that are not action-oriented, and provides greater accuracy in verb arguments for SE tools that are action-oriented. Beyond AOIG, other researchers have investigated naming conventions and the semantics of method signatures. Liblit, et al. identified common morphological patterns and naming conventions [LBS06], which became the starting point for our rule development. Caprile and Tonella developed a language of function identifiers that shares similarities with our phrase structure node rules [CT99], and applied it to an identifier restructuring tool [CT00]. The main differences between Caprile and Tonella’s grammar and SWUM’s grammar are (1) our phrase structure nodes and grammar closely mimic well-known parsing structures from NLP, and can thus readily take advantage of parsing solutions found in that field; (2) we have developed an algorithm to automatically parse identifiers according to our grammar; and (3) our rules cover a broader set of identifiers and were developed using a much larger code base (18 million LOC in Java versus 230 KLOC in C). Because Caprile and Tonella’s grammar was developed exclusively on C code, the similarities between their grammar and SWUM’s provides further evidence that the construction rules built for Java can translate to other languages. Method names have also been analyzed in terms of their most common meanings [DP06, HØ09b]. Høst and Østvold created a dictionary of commonly occurring method name patterns along with some of their most common attributes in terms

132

of implementation semantics [HØ09b]. Although the primary aim of this work is to help developers choose better names, the parsing rules and implementation attributes are similar in spirit to SWUM’s grammar and role information for program elements. In fact, SWUM already takes advantage of some of their structural attributes, such as reading or writing fields. Patterns observed in the dictionary could be integrated into SWUM’s naming convention rules in the future. 5.8

Evaluation We evaluate SWUM in multiple dimensions: generality, accuracy, and effec-

tiveness as a basis for software engineering tools. One way to evaluate a model is to judge its applicability: the more general the model, the more widely applicable and useful it is. In this section, we summarize some of the diverse ways SWUM can be used to improve software engineering tools as well as investigate the accuracy of SWUM’s construction rules as applied to phrase generation. In the following chapter, we evaluate SWUM’s utility in a software search tool. 5.8.1

Applying SWUM to other Software Engineering Tools We believe SWUM can be used to create improved human-oriented software

models and tools for (1) concern location, (2) program comprehension, (3) debugging, and (4) text mining. Concern Location and Software Search In Section 5.3, we motivated the need for word relationships through program search. This idea has been empirically investigated [HPVS09, SFH+ 07] and demonstrated to be an effective technique for query formulation and indexing. We expect the phrasal concepts modeled by SWUM to lead to even more effective query formulation tools, and further improve the contextual query reformulation technique described in Chapter 4.

133

Program Comprehension Because of its emphasis on representing both the programming language and natural language semantics found in software, SWUM is a natural fit for program comprehension tools. SWUM can be used to automatically generate single-sentence comments for method declarations and invocations, which can be used to generate a simple Javadoc comment for a method declaration, or to generate a comment for a complicated method invocation statement. We envision a personal programming tutor that can automatically generate natural language comments for novice programmers working with more expertly-written code. In fact, Sridhara, et al. used SWUM to automatically generate summary comments for methods [SHM+ 10]. SWUM can also be used to improve keyword programming [LM09], detect poorly named methods [HØ09a, LFB07a], refactor identifiers [CT00], or to generate patch documentation [BW08, BW10]. Finally, the statement-level linguistic relationships that can be inferred using SWUM can be considered similar to abstract types, which have been used for both program comprehension and test generation [GPME06, PLEB07]. Debugging Ko and Myers developed a debugging system based on question and answering [KM08]. They developed an interface that tracks output-affecting state variables through a trace, and enables the developer to select from a set of NL questions relevant to the output. SWUM could take this idea further by generating more informative questions which are specific to the underlying code, rather than relying on the current strictly predefined questions. In addition, SWUM could enable development of a true question and answering debugging system that allows the developer to enter an arbitrary question expressed in natural language.

134

Text Mining of Source Code SWUM can be used to develop customized NL analyses for software through text mining of source code. The node annotations and linguistic relationships in SWUM provide a natural framework for mining finer-grained information beyond simple co-occurrences. SWUM can be used to help mine verb and phrase meanings [HØ07, HØ09b], populate a software ontology [WLZR08], generate word relation databases for software [SHPVS08], or even mine commonly occurring verbpreposition relationships, which can lead to improved construction algorithms for SWUM in the future. 5.8.2

Analysis of Accuracy: Utilizing SWUM for Phrase Generation One way to evaluate SWUM is to investigate its effectiveness for phrase gen-

eration, which can be used in query reformulation or comment generation. If the phrases automatically generated from our model are believed by humans to accurately reflect the intent of the code, then we believe that SWUM has the potential to significantly impact the way software engineering tools are built to help software developers and maintainers. The following chapter evaluates SWUM’s utility in constructing software search tools for concern location. Although SWUM models method semantics in terms of action, theme, and secondary objects, for the purpose of phrase generation we refer to these items by their syntactic analogues of verb, direct object, and indirect object. For phrase generation, inference of an indirect object becomes important for creating a descriptive phrase. If the theme is found in the method name, our phrase generation algorithm gets an indirect object from the class and the first auxiliary argument. If the theme is a formal parameter, the indirect object is assumed to be the class. We infer a generic preposition based on the verb (e.g., “from” for “get”, “to” for “set”, “of” by default, etc.). Checker methods are a special case; phrases are generated using the template, “check if” +subject + condition. 135

We evaluated our phrase generation algorithm by asking 4 human annotators to rate the correctness of 20 randomly selected phrases in terms of accuracy and completeness. 5.8.3

Study Design The 20 sample method declarations were randomly selected from 4 open

source Java programs: the Java 1.5 API implementation; Vuze, a bit torrent client; Freemind, a mind map application; and Gantt, an open source interface for using Gantt Charts. Five method declarations (signatures and implementing source code) were randomly selected from each program, and a single phrase was generated for each declaration. This set of 20 examples was given to our 4 subjects, all of whom have familiarity developing programs in Java. The subjects were asked to rate each phrase in terms of accuracy and completeness by using one of 4 labels: AC if the phrase accurately and completely captures the method’s intent; C if the phrase is complete, but not accurate (e.g., the phrase includes the right ideas, but combines them in such a way that the phrase is no longer accurate); A if the phrase is accurate, but incomplete (e.g., the phrase correctly conveys part of the method’s intent but not the whole intent); or W if the phrase is inaccurate and incomplete (e.g., the phrase does not describe the method’s intent in any way). If the subjects chose any label other than AC, they were asked to enter a more appropriate phrase, beginning with a verb, that was more accurate or complete to them. Since the subjects were given the method declaration and its implementation, it should be noted that the subjects had access to more information than our automated SWUM approach. When a phrase is annotated as A, C, or even W, there may still be correctly generated verb, direct object (DO), or indirect object (IO) components of the phrase, or subphrases. Thus, we measure accuracy in terms of the number of correctly generated verbs, DOs, and IOs. To measure this information, we manually compared 136

the phrase generated by our technique with the phrase constructed by the subjects, looking for similarities. When the generated subphrase (i.e., verb, DO, or IO) exactly matched the subphrase constructed by the human subject, the generated subphrase was marked as correct. Given 20 example declarations, 4 annotators, and 3 subphrases, we had a total of 240 observations in the study. 5.8.4

Results and Discussion Overall, our automatic phrase generation technique based on SWUM pro-

duced accurate subphrases for 86% of the 240 example verbs, DOs, and IOs compared in the study. Our approach generated completely accurate phrases for 9 of the 20 phrases (45%), and the generated phrases contained at least 2 out of 3 correct subphrases for 17 out of 20 (85%). Further, we modeled and generated verb subphrases with an accuracy of 84%, DOs with an accuracy of 89%, and IOs with an accuracy of 85%. Let us consider 3 of the 8 correctly generated phrases. Since methods with single verb names are the most straightforward to model, it is not surprising that our approach correctly generated the phrase “init controller of mind map mode” for the init(Controller c) declaration in class MindMapMode. In such cases, the DO is always the formal and the IO the class. A more complicated declaration is equals(GlyphVector rhs) in class StandardGlyphVector, for which we generated the phrase, “check if standard glyph vector equals rhs glyph vector.” This method contains a special case verb, equals, which takes two DO arguments that we extract from the class and the first formal. Finally, let us consider the declaration of setResizable(boolean b) in class JInternalFrame. Setters are also special case verbs which have specific semantics. The DO is almost always being set to the formal, and frequently the DO is a field in the declaring class. Thus, for setResizable we correctly generate the phrase, “set resizable of j internal frame to boolean”.

137

Although our approach accurately generated subphrases for most verbs, DOs, and IOs, there were some methods where analyzing only signature information hurt accuracy. For example, for the method usage(), we simply generated the phrase “handle usage”. As our subjects observed, the method is actually printing usage to stdout. Analyzing the method invocations within the method body could overcome this problem. There was one very poorly named method where an inaccurate verb was used in the method name. Although the method name was isIconVisible, the method actually called setIconVisible and had a void return type (contrary to the intuition that a name starting with “is” would return a boolean). This example illustrates how SWUM could be used to automatically detect poorly named methods that can hinder program comprehension. 5.9

Conclusion In this chapter, we introduced a novel Software Word Usage Model (SWUM)

that captures the conceptual knowledge of the programmer as expressed in both natural language information and programming language structure and semantics. We have designed SWUM to be an interface between software engineering tool designers and researchers working on improved linguistic analyses for software. We presented an algorithm to automatically construct SWUM for Java methods, types, and variables. We evaluated SWUM’s accuracy in an automatic phrase generation technique that produced accurate subphrases for 86% of verbs, DOs, and IOs. In the next chapter, we evaluate SWUM’s utility in a software search tool.

138

Chapter 6 USING SWUM TO SEARCH SOURCE CODE

Most existing information retrieval (IR) techniques treat search documents as a bag of words, that is, the words are simply viewed as a collection with no context or relationships associated between them. In the beginning of Chapter 5, we demonstrated how bag of words approaches for software search only differentiate between relevant and irrelevant results based on word occurrences. In contrast, phrasal concepts enable us to further differentiate relevant and irrelevant results based on how the words appear together in code. Figure 5.2 illustrates this observation by contrasting the occurrence of the query terms “add” and “item” in the irrelevant method sum(String field) with the occurrences in relevant addEntry(ItemEntry ie). In fact, researchers have demonstrated that POS information can improve retrieval performance in software [CDLO+ 09]. In this chapter, we present a scoring technique, swum, for searching software using SWUM’s phrasal concepts. We analyze the effect SWUM has on concern location and compare our technique against state of the art search tools. 6.1

A SWUM Score for Search Our goal in developing the swum scoring function is to score every program

element based on its relevance to the query words. We view swum as a way to gather evidence in support of an element’s relevance, based on the following sources of information:

139

• The location of the query word occurrence in the code. For methods, the query word could occur in the signature or body of a method. For fields, the query word could also occur in the signature or “body”, i.e., the initializer or comments. • The semantic role of the query word occurrence in the phrase structure. For methods, the query word could occur within the action, theme, secondary arguments, or auxiliary arguments. Currently, fields do not contain semantic roles beyond name, type, and declaring class. • The head distance of the query word occurrence within the phrase structure. For example, the query word could be the last word of a theme’s noun phrase (i.e., in the head position), or the first. The closer a query word occurs to the head position of a phrase, the more strongly the phrase relates to the query word. • The usage of the word in the rest of the program. If a query word frequently occurs in program elements throughout the rest of the program, it is not as good at discriminating between relevant and irrelevant results. This idea is commonly used in information retrieval techniques [MRS08]. Based on previous work in exploration [HPVS07], we know that the relationship of a method within the structural context of the program can provide further evidence of an element’s relevance. However, this information is more useful in exploring, rather than searching, the source code, and we leave investigation of structural context to Chapter 7. We developed our swum scoring function using a training data set of 5 manually-mapped concerns. The remaining subsections describe how we used that

140

training data to integrate location, semantic role, head distance, and usage information into a single swum scoring function. We present the scoring function top-down, with the highest-level equations presented first. 6.1.1

Location For the purposes of search, we only differentiate between signature and body

locations, where the body includes all words from comments, identifiers, and literals in the program element. Given a program element x, we define our scoring function, swum(x), using scores for both the signature and the body:

swum(x) = signature(x) + body(x)

(6.1)

where

signature(x) = max(βsig−swum ∗ sig swum(x), βsig−lex ∗ sig lex(x))

(6.2)

and

body(x) = βbody ∗

body lex(x) nq

(6.3)

Where nq is the number of query words. Based on our training sample, we define weights for the score coefficients as βsig−swum = 1, βsig−lex = 0.05, and βbody = 0.1. In the remainder of this subsection, we describe sig lex and body lex; the sig swum function is described when we discuss semantic roles in Section 6.1.2. In Equation 6.2, we take the maximum of our SWUM-based score of the signature (sig swum) and a best-effort lexical score (sig lex). For sig lex, we first split and stem all the identifiers in the signature. Then, we lexically search for the query term or the stemmed query term. Using a lexical search allows us to give non-zero scores to method signatures that we cannot parse due to same-case 141

splitting mistakes. This is a best-effort lexical search because we still split and stem the signature first to take advantage of clearly available clues. Thus, this lexical search is a cross between bag of words and grep. Program elements only matched by sig lex and not sig swum are more likely to be irrelevant. For example, using sig lex to search for the word “adds”, with stem “add”, would return irrelevant words like “padding” in addition to relevant signatures like the unsplit “additem” or splittable “addsItem”. Thus, we multiply sig lex by a very small coefficient and take the maximum of sig lex and sig swum to ensure signatures parsed by SWUM are always ranked more highly than signatures only matched with sig lex. As shown in Equation 6.3, we analyze body information (body lex) with the same boolean lexical score as sig lex. We use a small coefficient (βbody = 0.1) to ensure body information is only used to break ties between similarly ranked methods. To keep the contribution of body lex bounded to a maximum of βbody in Equation 6.3, we normalize body lex by the number of query words. We also investigated more sophisticated ways of using body information, such as calculating a SWUM-based body score, but results did not improve on our training data. Most traditional information retrieval techniques take into account the frequency of words in a document when determining relevance. Frequency-based scores are typically normalized by the length of the document, since longer documents are more likely to contain more occurrences of the query words. However, we use a boolean scoring function with body lex, rather than a frequency-based score, because we want our scoring function to be independent of a programmer’s method decomposition habits. If a programmer decomposes methods into small segments, the document lengths will be small. Conversely, if a programmer does not decompose methods into substeps, the documents will be very long.

142

For example, consider a concern related to sorting. The relevant method populateTreeByStyle is 1500 lines long, and contains just 1 relevant statement which calls the sort function. Later, the programmer decides to refactor the long populateTreeByStyle method into a number of smaller methods. The call to the sort function is now located within a short 5-line method, called getResults. We believe that both the original 1500 line method and the refactored 5 line method are equally relevant for containing a call to the sort function. If we were to use frequency information in our scoring function, the longer populateTreeByStyle method would be penalized for the method decomposition habits of the programmer. In general, using frequency information will bias the search in favor of either short or long methods. Our goal is to bias the search to methods containing the query terms, regardless of length. Investigations with our training data support that normalizing by method length, either in terms of lines or the number of words, does not uniformly improve our search results. We also observed this phenomenon when creating a method scoring function for exploration [HPVS07]. We hypothesize that this may be due to the fact that very short methods, such as getters and setters, tend not to be as important for comprehension purposes as longer methods. Normalizing by length would favor these shorter get and set methods over methods likely to implement important functionality for concern location and comprehension. The remaining subsections describe additional details for how we calculate sig swum, sig lex, and body lex. 6.1.2

Semantic Role Semantic role information plays a key part in our swum scoring function. In

fact, the main purpose of using SWUM in a scoring function is the ability to rank occurrences of query words based on their semantic roles.

143

Scoring Methods We take semantic role information into account in sig swum by weighting occurrences of query terms based on where they appear in their semantic role in the phrase structure of a method’s signature:

sig swum(x) =

X

idf (q) ∗ max(βi ∗ head(q, i)) i

q∈query

(6.4)

where

i ∈ {action(x), theme(x), secondaryArgs(x), auxArgs(x)} In addition to semantic role information, Equation 6.4 brings together usage information (idf ), which approximates how well the query word discriminates between relevant and irrelevant methods; and head distance (head), which approximates how strongly a phrase relates to the query word. We define these functions in Subsections 6.1.3 and 6.1.4. We ensure a single query word can only contribute to the sig swum score once by taking the maximum contribution for any semantic role’s head distance score and weight. We define a weight, βi , for each semantic role (i.e., action, theme secondary arguments, and auxiliary arguments). If instead of taking a maximum we summed the contributions from all the roles, our search results would rank signatures with multiple occurrences of one query word higher than signatures containing one occurrence of multiple query words. For example, consider searching for the “add item” concern from Chapter 5 with methods Main.add(Item) and ItemManager.parseItems(ItemSet), where add contains 1 occurrence of 2 query words and parseItems contains 3 occurrences of 1 query word, “item”. If we summed the contributions of a query word, the less relevant parseItems method, with a score of 3, would be more highly ranked than the more relevant add (score 2), even though parseItems only contains 1 query term. In contrast, by only taking the maximum of 144

each query word’s contribution, the add(Item) method will be more highly ranked for containing multiple query words, since parseItems would only have a score of 1, and add 2. We based the weights for the semantic role information on our intuition and results from our training data set. As observed in our prior work with verb-direct object searching [SFH+ 07], occurrences of query words in a method’s action or theme are an important indicator of relevance. Thus, we give the action and theme the highest weights of βaction = 1 and βtheme = 1. By making the action and theme coefficients equivalent, we are more robust with respect to SWUM parse errors in identifying the action and theme in a method name. In addition, equivalent weights give a query term equal weight whether it appears as an action or a theme. This is particularly useful for query words that are used interchangeably in the action and theme role. For example, when searching for “compare track”, we found relevant results with “compare” as an action (TrackComparator.compare) and as a noun modifier (Track.getCompareString). By weighing the action and theme equally, both occurrences of “compare” are scored highly. This weighting scheme, in conjunction with free-form keyword queries, gives swum an advantage over our prior verb-direct object search technique [SFH+ 07]. Although secondary and auxiliary arguments are less likely to capture the main intent of a method, they can help differentiate between relevant and irrelevant results. For example, consider searching for “sort style” with methods getStyle and sortXMLByStyle. Based only on action and theme, getStyle would be as highly ranked as sortXMLByStyle. By taking secondary arguments into account, sortXMLByStyle would get the higher score. Thus, we define βsecondaryArgs = 0.5 for secondary arguments found in the name, and βauxArgs = 0.25 for any remaining auxiliary arguments in the signature.

145

Scoring Fields The approach for calculating sig swum for fields is similar to that of methods. We currently do not have semantic roles defined for fields outside of their name, type, and declaring class.

sig swum(x) =

X q∈query

idf (q) ∗ max(βi ∗ avg(head(q, i), pnc(q, i)) i

(6.5)

where

i ∈ {name(x), type(x), declaring class(x)} For methods, the head distance (head) is enough to capture how strongly an NP is related to a query word. Because field names are not parsed, and simply treated as NPs, head distance can be misleading for longer field names. For example, when searching for “compare style”, the irrelevant field startupStyle is scored the same as the relevant comparator. Our solution is to take the percent of noncontent and non-query words into account, defined by the function pnc. Thus, we counterbalance head information by averaging it with pnc. We formally define the idf , head, and pnc functions in the following subsections. 6.1.3

Head Distance Head distance approximates how strongly a noun phrase or verb group relates

to a query word. For example, consider the concept of “adding an auction” in an auction sniping program that allows users to place proxy bids on online auction sites such as eBay, as previously described in Chapter 2. The query to search for this concern would be “add auction”. If the head distance of the query word is not taken into account, the following 4 methods are all considered equally relevant:

146

AuctionServerManager.addAuctionServerMenus() HTMLDump.addAuctionLink() JBidMouse.addAuction(String auctionSource) JBidProxy.addAuction(String auctionId) Clearly, the last two methods, JBidMouse.addAuction and JBidProxy.addAuction, are more relevant than the first two methods. Specifically, addAuctionServerMenus() is not adding an auction but a server menu, and addAuctionLink is adding a link. In both cases, the query word is being used to modify the theme’s head. Head distance allows us to differentiate between strong occurrences of query words in the head position, and less relevant occurrences to the left of the head. Thus, head distance (head) approximates how strongly a phrase p is related to a query word q, and is calculated the same for method and field signatures:    0,

head(q, p) =  

if q ∈ /p

1 , 1+min distance(q,p)

(6.6)

if q ∈ p

Where min distance is the minimum distance of the query word from the head position of the phrase. If the query word does not appear in the phrase, head = 0. Because SWUM supports equivalence nodes, we need to use the minimum distance of the query word from the head position, rather than just the distance. For example, if the two noun phrases “auction” and “auction item” are equivalent, the distance will be the minimum head distance, 1. Our approach to head distance also takes into account ignorable nouns in the head position by discounting the minimum head distance.1 If the right most word of a noun phrase (NP) is tagged as an ignorable noun (NI), then min distance = min distance − 1. In summary, for methods, the head distance (head) is enough to capture how strongly an NP is related to a query word. Because field names are not parsed, 1

There is no need to account for ignorable verbs since those precede the verb in a VG.

147

and simply treated as NPs, head distance can be misleading for longer field names. For example, when searching for “compare style”, the irrelevant field startupStyle is scored the same as the relevant comparator. Our solution is to take the percent of non-content and non-query words of the field name into account. A non-content word is a determiner, pronoun, or abbreviation. We assume any single letter is an abbreviation, as well as any type abbreviations. Thus, we counterbalance head information by averaging it with the percent of non-content words (pnc). Given a query word q, a set of query words qs, and a field name n:

pnc(q, qs, n) =

tf (q, n) ncnq(qs, n)

(6.7)

Where tf is the number of times the query word appears in the name, and ncnq is the number of non-query and non-content words in the name. To calculate pnc, we divide the number of query words in the name (tf ) by the number of noncontent and non-query words in the name (ncnq). For example, given the query “sort”, the field JLabel jlSort would have ncnq = 1 and pnc = 1, since jl is an abbreviation for the field’s type name, JLabel. For methods, the head distance captures how strongly a noun phrase or verb group relates to the query word, based on the query word’s distance from the head position. To more accurately capture how well a field name relates to a query word, we average head distance with the percent of non-content and non-query words in the name. 6.1.4

Usage Usage information captures how frequently a query word appears in the rest

of the source code, rather than just in the current method or field. If a query word frequently occurs throughout a program, it can be too common to be a good discriminator between relevant and irrelevant results. We capture usage information

148

for a word w using the common information retrieval measure of inverse document frequency (idf ) [MRS08]:

idf (w) = 1 −

df (w) N

(6.8)

Where df is the total number of program elements, or “documents”, in which a query word appears, and N is the total number of program elements in the system. Traditionally, IDF is calculated by taking the log of N/df (w), with values ranging from [0, log(N )]. We slightly modified the traditional IDF scoring function by making it a linear function from [0, 1) to be able to integrate it into the three location scores: sig swum, sig lex, and body lex. IDF ensures that if all the results contain a single query word in the head position, the most highly ranked will contain the least frequent query word. This especially helps in situations where the query words are very general, and thus poor discriminators. For example, consider searching for the query “compare style”, where “style” is much more prevalent in the program than “compare”. IDF ensures that the relevant method TrackComparator.compare is more highly ranked than irrelevant Util.formatStyle(String style). If IDF were not used, the prevalence of “style” would lead to more than 30 irrelevant methods with the same score as compare. The lexical scoring functions sig lex and body lex also take advantage of IDF. Rather than simply adding +1 for the occurrence of each query word, we add the idf of the query word. 6.1.5

Putting it all Together In summary, we score individual query words based on their usage pattern

in the rest of the code as well as their head distance within a phrase in sig swum. This score for a phrase is then scaled based on its semantic role, where actions and themes get the highest coefficient multiplier. If a method is difficult for SWUM to

149

Task Name

Application

Description

Add Textfield Compile Report

iReport iReport

Add Auction

jBidWatcher

Set Snipe

jBidWatcher

Save Auctions

jBidWatcher

Gather Music Files

javaHMO

Load Movie Listings

javaHMO

Search for Songs

Jajuk

Play Track

Jajuk

Insert a textfield into a report Compile the source of a report into the final report Add an auction to the local list of auctions to monitor Set the price for the program to automatically bid on the user’s behalf at a specified time Save the list of auctions that the user is currently monitoring Inspect the local hard-drive for music files to automatically add to the music library Download movie listings from a remote movie listings service Search the user’s music library for a track using regular expressions Play an audio file

Gold Set Size 5 8

Textual Clues Part Full

10

Full

12

Part

9

Full

4

None

5

none

5

none

12

Part

Table 6.1: Concerns used in Evaluation. parse, we fall back on sig lex. The score from the signature is brought together with body information (body lex) in the final swum score. 6.2

Evaluation To evaluate our swum scoring function, we investigated two main research

questions: 1. What impact does SWUM have on the effectiveness of software search? 2. How does swum compare to existing state of the art search tools? To evaluate these research questions, we compared the performance of swum, swum variants, and state of the art search tools on 8 concern location tasks. First, we present the subject concerns and dependent variables, which are the same for both studies. Then, we present study-specific details for answering each research question, followed by results and discussion. 150

Subject Concerns The set of concerns consists of 9 user-observable, action-oriented concerns from 4 programs ranging in size from 23 to 75 KLOC [SFH+ 07]. The four programs are: iReport, a visual report builder and designer; jBidWatcher, an auction bidding, sniping, and tracking tool for online auction sites such as eBay or Yahoo; javaHMO, a media server for the Home Media Option from TiVo; and Jajuk, a music organizer for large music collections. The concern descriptions consist of screen shots of each concern being executed from the user interface. The concern implementations were derived by two human analysts who agreed on the relevant methods after executing and manually exploring the concerns [SFH+ 07]. Information about the concerns is presented in Table 7.1, including the concern size and whether there were textual clues in the screen shot describing the concern. One of the concerns, “add auction”, had to be discarded from the study because it was used in swum’s training set, leaving us with 8 concerns. Evaluation Measures We measure effectiveness by calculating the common measures of precision and recall on each search result set [MRS08]. Precision (P) is the percent of search results that are relevant, and captures how many irrelevant results were present with the relevant results. Recall (R) is the percent of all relevant results that were correctly returned as search results, and captures how many of the actually relevant results were predicted as relevant. We combine precision and recall using the commonly used F measure (F) [MRS08], which is high only when both precision and recall are similarly high:

F =

2∗P ∗R P +R

151

Although results for effective queries will ideally have both high precision as well as high recall, and thus a high F measure, it is unlikely that a single query will be capable of capturing both high precision and high recall. Individual queries will typically be able to capture either high recall (by returning many results) or high precision (by returning few, but very relevant results), because the measures of precision and recall are inversely related [MRS08]. 6.2.1

Impact of SWUM on Software Search In this study, we investigate research questions to evaluate SWUM’s impact

on software search effectiveness based on semantic roles, head distance, and location. Specifically, in terms of search effectiveness: RQ 1 How does SWUM compare to bag of words (BOW) information retrieval techniques? RQ 2 What effect does location have? RQ 3 What effect do semantic roles have? Because head distance is tightly coupled with SWUM information, we did not evaluate its contribution outside of SWUM’s effect. To answer these research questions, we compare swum with a number of variants, including BOW-based scoring functions. 6.2.1.1

Design

Independent Variable The independent variable in this study is the search technique. We compare 6 different search techniques: • swum: the SWUM-based scoring function presented in Section 6.1.1, which uses both signature and body locations, all semantic role information, and head distance.

152

• sig swum: the SWUM-based scoring function presented in Section 6.1.2 that only uses information from the signature location, but includes all semantic role information. • sig at: a refinement of sig swum that only uses the action and theme semantic roles. • body swum: a SWUM-based scoring function applied to all the identifiers in a method’s body. For each query word, body swum finds the identifier with the maximum sig swum score. The contributions of each query word are summed and normalized by the total number of query words. • sig bow: a BOW-based scoring function applied only to the signature location. To better differentiate between search results, we count the number of times each query word appears in the signature, called term frequency (tf ). We then take the natural log of the term frequency multiplied by its IDF, summed over all the query words. • body bow: the same BOW-based scoring function as sig bow, but with information from the entire body, including the comments and signature. Each scoring function can return a different number of results. To keep the search results comparable, we select the top 10 ranked methods as search results for each technique, similar to previous software search studies [SFH+ 07]. The results for each technique are sorted by score, with tying scores sorted alphabetically by default. Usage information, as captured by IDF, is a constant across all search techniques. Dependent Variable Because we are only using the top 10 results for each search technique, two techniques which return the same number of relevant results will have exactly the same precision and recall. However, one technique might rank the results in positions 153

1, 2, and 3, while the other technique might have the results ranked as 6, 7, and 8. To differentiate between these two result sets we introduce an additional measure: area under the ROC curve (AUC) [MRS08]. If the relevant results are plotted on a graph with the number of results on the x-axis and the number of relevant results on the y-axis, the AUC is the area under the curve of this graph. To keep AUC comparable, we normalize it by the maximum area under the curve. An AUC value of 1 indicates every result is relevant, whereas an AUC value of 0 indicates no relevant results were returned. Subject Queries One of the most influential factors in search effectiveness is the query. For our study of search techniques to be realistic, we need queries generated by actual newcomers that can be held constant across all search techniques. Recall that in Chapter 4, we collected queries generated by human subjects using our query reformulation tool. In this study, we use those queries to search for concerns. To be as fair as possible, we selected the top 3 performing queries for each concern in terms of precision, recall, and F Measure in that study, based on words appearing in method signatures. If a query contained only one word, it was discarded. Because swum is almost identical to sig bow for single word queries, single word queries would not further our goal of investigating SWUM’s effectiveness. Only 2 of the 8 concerns had more than one query after discarding single word queries, leaving us with a total of 10 queries spread over 8 concerns in the study. Threats to Validity The concerns and queries used in the study are a potential threat to validity. We have tried to minimize this threat by including concerns that were mapped by independent researchers and used in a previous study [SFH+ 07]. Further, we have tried to make the study as realistic as possible in using queries generated by 154

newcomers [HPVS09]. Because search effectiveness is so dependent on the query and how well that query discriminates between results for a particular concern, it is possible these results will not generalize to all other concern-query combinations. Since the study is based on Java, the results from this study may not generalize to other programming languages. However, we expect these results to be relatively consistent for other object-oriented languages. In this study, we have only observed one BOW approach, which is one of the most common BOW-based information retrieval measures. However, more sophisticated BOW approaches exist that can lead to improved results for traditional retrieval of natural language documents. We have observed that for specific queryconcern combinations, the more advanced techniques can produce worse results than the common techniques. Based on our investigations into these more advanced techniques, we have observed that the limiting factor seems to be using BOW information. These results are further supported by researchers who have observed that the information retrieval method does not significantly influence the retrieval accuracy in software [CDLO+ 09]. However, it is possible that the results observed for our BOW technique do not generalize to all BOW techniques. 6.2.1.2

Results and Discussion

Figure 6.1 shows box and whisker plots capturing the overall AUC results for the 6 search techniques across all 10 queries. The shaded box represents the inner 50% of the data, the middle line represents the median, the plus represents the mean, and outliers are represented by an ‘×’. Although there are no statistically significant differences in the data set, the data show a number of interesting trends. In terms of location, for both SWUM and BOW it appears that using signature information is more effective than using information from the body. This observation is especially clear for the BOW scores. Comparing across locations, it appears that sig swum and sig bow produce similar 155

1

AUC (Effectiveness)

0.8

0.6

0.4

0.2

0 swum

sig_swum

sig_at

body_swum

sig_bow

body_bow

Search Technique

Figure 6.1: AUC Results for 6 SWUM and BOW variants. results for this data set. Although body swum and body bow have very similar results overall, body swum is more consistent. In terms of semantic roles, it appears that using information beyond action and theme yields more consistently competitive results. Judiciously combining location information in swum appears to produce the best results. To further investigate these observations, we plotted the AUC values for each query, shown in Figures 6.2 and 6.3. The results are sorted by swum’s AUC, while keeping different queries for the same concern adjacent. Semantic Roles Figure 6.2 plots the AUC values by query for swum, sig swum, and sig at, sorted by swum’s AUC. As expected from techniques which are only subtly different, the results for all three techniques are very similar for the first five queries. The two queries that show the most differences are for the jBidWatcher application: “save auctions” and “prepare snipe”. The “save auctions” query demonstrates differences between all three techniques. Because sig at only looks for query words in the action and theme, many 156

AUC (Effectiveness)

1 0.8 0.6 0.4 0.2 0

rep fin ga tex tex sa pre se se pla ds ve arc arc the t re t fi yf ort pa h e a ile h h r r po co es ow ld uc dir sc res rt e mp tio s ec n r u i i p n t l tor ile lem eri t e s r y a en t

Query swum

sig_swum

sig_at

Figure 6.2: Comparing signature-based SWUM search variants by query. of the results containing one query word, “save”, are ranked the same. This ends up lowering the rank of results that contain the second query word, “auction”, in other semantic roles. So while the action and theme are important for determining result relevance, utilizing the full information found in the signature helps further differentiate between relevant and irrelevant results. Another way to differentiate between relevant and irrelevant results is to use some information from the body, as demonstrated by both jBidWatcher queries. For save auctions, query words in the body broke ties between relevant and irrelevant methods with the same signature information. For example, relevant method DoSave contains only one query word in the signature, and was ranked equally with 9 other irrelevant method signatures like SearchManager.saveSearches. Since DoSave contained an identifier, string literal, and a comment containing both query terms,

157

AUC (Effectiveness)

1 0.8 0.6 0.4 0.2 0

rep fin ga tex tex sa pre se se pla ds ve arc arc the t re t fi yf ort pa h e a ile h h r r po co es ow ld uc dir sc res rt e mp tio s ec n r u i i p n t l tor ile lem eri t e s r y a en t

Query swum

sig_bow

body_swum

body_bow

Figure 6.3: Comparing information location of SWUM and BOW approaches by query. it was more highly ranked by swum than the other irrelevant methods, whose bodies only contained the query word “save”. Similarly, the body information in the “prepare snipe” example helped differentiate between single query word occurrences in the signature that were not supported by words in the body. There is one concern for which body information appears to have a slight detrimental effect for swum. “Play file” is the most effective query for all three techniques. For sig swum, at rank 8 there are a sequence of 8 results that are all tied. In sig swum, this tie is broken by an alphabetic sort which happens to rank the only relevant result in the group, Type.getPlayerImpl(), the highest. In contrast, swum uses body information and mistakenly ranks 5 irrelevant results before the relevant one. If another ordering were used to break ties between sig swum results, both techniques would be equivalent. However, since the top 7 results are all relevant for the three techniques, this is still a very positive example overall. 158

Location Figure 6.3 plots the AUC values for swum, sig bow, body swum, and body bow. Results for sig swum are similar to those of swum, and are not shown to preserve readability. In general, the signature location illustrated by swum and sig bow is better than the body location. Signature outperforms body for 5 of the queries for 3 concerns and is approximately equivalent for 3 other concerns. However, there are two concerns for which body information is much more effective than signature information: “report compiler” and “find shows”. For “find shows”, the results for body bow and body swum contain just one more relevant result than swum. The relevant result found by both locations is the method findShows. By analyzing body information, body bow and body swum find one more relevant result: findShows’s caller reload. Since reload is relevant for calling findShows, signature-based search information paired with an exploration or navigation technique would perform equivalently. The “report compiler” concern is similar in that additional relevant callers are found using body information. However, there is another reason why swum performs poorly on this concern. When SWUM identifies a method theme, the least accurate rule is when the theme is the class. In our training sample, we overcame this source of inaccuracy by reducing βtheme = βauxArgs unless the name (i.e., the action) contains a query word. Unfortunately, this rule did not generalize to our evaluation set. Discarding this rule leads swum to have an AUC of 0.52 for “report compiler”, with the top 3 results being relevant. However, discarding this rule also has the side effect of slightly reducing swum’s effectiveness on “prepare snipe” and “search result”. More work is necessary to examine these tradeoffs. One way to solve this problem would be for SWUM to include rule accuracy information, and to study the effects of this rule on a larger data set of concerns.

159

Although body information can be more accurate for some concerns and queries, in general, body information is very inconsistent and performs more poorly than signature information. Perhaps further investigation of additional concerns would yield insights as to when body information is important to improve search effectiveness. The other option is to leverage the precision of signature information and improve its recall with an exploration or navigation technique. SWUM vs BOW As can be seen in Figure 6.3, swum information tends to outperform bow information regardless of location. For instance, swum outperforms sig bow for 6 of the queries on 5 of the concerns and is equivalent for 3 other concerns. The only query where sig bow outperforms swum is “text field”. Although both techniques found the same two relevant results, sig bow correctly identified an additional relevant element: the constructor TextFieldReportElement. Since the query words “text” and “field” were not in the head position, TextFieldReportElement was ranked lower than irrelevant results which contained “text” or “field” in the head position. Perhaps further refinements to the head distance would improve results and enable the head score to potentially ignore multiple head words. Summary In summary, the phrasal concepts used in swum outperform BOW for most concerns in this study. It appears that location information can be critical for search effectiveness, but whether that location is the signature or the body is concern dependent. However, body information is less consistently helpful than signature information. Finally, additional semantic role information improves search effectiveness for some concerns.

160

6.2.2

Comparison With State of the Art In this study, we compare swum with existing state of the art search tools.

These tools may differ from swum in the types of information, relevance threshold, and query mechanism they use. These variables led us to the following specific research questions: RQ 1 How does swum compare to existing state of the art search tools? RQ 2 What effect does the query mechanism have on search effectiveness? RQ 3 What effect does threshold have on search effectiveness? 6.2.2.1

Design

Independent Variable The independent variable in this study is the search technique. We compare 5 search techniques: ELex, GES, FindConcept, and two variants of swum: ELex ELex is the regular expression, or lexical, searching functionality provided by the Eclipse IDE [IBM10]. Similar to UNIX grep, ELex allows users to search using a regular expression query over the source code files of a given project, returning an unranked list of files which contain statements matching the query. Because ELex uses a boolean relevance function (0 for irrelevant, 1 for relevant), we do not apply a threshold to ELex’s results. GES GES, or Google Eclipse Search, integrates Google Desktop Search into the Eclipse workbench. GES allows users to search Java files with natural language queries and returns a set of files ranked by relevance to the query. We altered GES slightly to return individual methods instead of entire files, to keep the results comparable. Although Google’s exact search algorithm is proprietary, the authors of GES claim that Google Desktop provides an accurate search and does not suffer from inefficient queries or inefficient re-indexing of files during evolution; an advantage over their previous prototypes [PPM+ 06]. To compare GES with the other techniques, we select the top 10 results, ranked by the number of query words they contain. FindConcept Similar to swum, FindConcept searches software using both natural language and program structure information [SFH+ 07]. FindConcept searches in method signatures and comments for verb and direct object pairs that match the query. FindConcept uses a unique query mechanism that requires the user 161

to enter a verb and a direct object. FindConcept encourages the user to further refine this initial query by selecting additional synonyms or co-occurring words. Results are ordered by the amount of structural connectedness to other results in the set, and the top 10 ranked results are selected. SWUM10 SWUM10 is the same swum scoring function evaluated in Section 6.2.1, using the top 10 ranked results. Recall that swum is the SWUM-based scoring function presented in Section 6.1.1, which uses both signature and body locations and all semantic role information. Like GES, SWUM uses natural language queries. SWUMT SWUMT uses the same swum scoring function as SWUM10, but with a more sophisticated threshold. For many concerns, the swum scores for the top few relevant results are significantly greater than the remaining irrelevant results in the top 10. In addition, some concerns contain 10 or more methods that are relevant and highly scored. To account for these factors, we use the average of the top 20 results as the threshold. If a method’s swum score is greater than the average of the top 20 scores, the method is included. If the top few results are much more relevant than the rest, the threshold will be less than 10. If the scores are evenly distributed, the threshold will be approximately 10. If the results are scored high beyond 10 and then suddenly drop, the threshold will be greater than 10. Subject Queries The results for ELex, GES, and FindConcept come from a previous concern location study [SFH+ 07]. In this study, 18 human subjects were asked to locate 9 concerns. Each participant used each tool to search for 3 of the concerns, resulting in 6 replications of each concern-tool combination. One of the concerns, “add auction”, had to be discarded from the study because it was used in swum’s training set, leaving us with 8 concerns. Because we are interested in investigating the differences between the techniques, and not how well the human subjects could use the tools, we selected a subset of the queries to be used in this evaluation. Based on the results, we selected at most three queries for each technique, per concern: the query that had the top precision, the top recall, and the top F measure. We followed a similar technique

162

for SWUM10 and SWUMT, using the results of the previous study in Section 6.2.1. Thus, the results presented in this study represent each search technique at its best, given a newcomer user. Threats to Validity As in Section 6.2.1, the concerns and queries are a potential threat to validity. We tried to ensure a fair comparison across search techniques by only using the queries for each concern that showed the search technique at its best. This helped control for unavoidable variables between the subjects using the techniques, since the queries for GES, ELex, and FindConcept were generated by a different set of subjects than for SWUM10 and SWUMT. However, this has the drawback of potentially overinflating the results for a technique. In some cases, the inflated results may be due to an exceptional subject, rather than any feature of the technique. We have tried to clearly note these potential situations in the results. As in Section 6.2.1, the results may not generalize beyond Java, although we expect the results to be consistent across other object-oriented languages. 6.2.2.2

Results

Figures 6.4, 6.5, and 6.6 shows box and whisker plots capturing the overall F Measure (F), Precision (P), and Recall (R) results for the 5 search techniques. The shaded box represents the inner 50% of the data, the middle line represents the median, the plus represents the mean, and outliers are represented by an ‘×’. Although there are no statistically significant differences in the data set, the plots show a number of interesting trends. Based on the F Measure, ELex is inferior to the other search techniques. The SWUM-based techniques, SWUM10 and SWUMT, appear to be more consistently effective than FindConcept or GES. These results are confirmed by the precision and recall results in Figures 6.5 and 6.6. In

163

100

F Measure (%)

80

60

40

20

0 ELex

GES

FindConcept SWUM10

SWUMT

Search Technique

Figure 6.4: F Measure results for state of the art search techniques. terms of precision, SWUMT is a clear front-runner closely followed by FindConcept. For recall, SWUM10, SWUMT, and GES appear to have similar results. It is not surprising that ELex, the technique with the worst precision, also has the best recall. For most queries in this study, ELex typically returns too many results. This ensures ELex finds many relevant results, but too many irrelevant ones. Our observations independently confirm earlier results that used tools similar to ELex for feature location [AG06]. In Figure 6.6, SWUM10 and SWUMT begin to approach ELex’s high recall, without sacrificing precision. Because ELex’s poor performance is largely due to its large result sets, we focus on the more competitive search techniques in the remaining results and discussion.

164

100

Precision (%)

80

60

40

20

0 ELex

GES

FindConcept SWUM10

SWUMT

Search Technique

Figure 6.5: Precision Results for state of the art search techniques. Figures 6.7, 6.8, and 6.9 plot the F, P, and R results by concern. As can be seen in Figure 6.7, SWUM outperforms the other search techniques for 3 concerns, GES outperforms the others for 3 concerns, FindConcept is the best for 1 concern, and they all tie for 1. Here, we analyze the cases when different techniques outperformed the others. SWUM > FindConcept, GES In terms of the F Measure results in Figure 6.7, SWUMT and SWUM10 outperform the other state of the art approaches for 3 of the 8 concerns. For “search for songs”, SWUMT and SWUM10 have higher recall than FindConcept and GES. GES performed poorly because only 1 GES user was able to identify a query with any relevant results, and that query only returned 1 of the 5 relevant results. FindConcept

165

100

Recall (%)

80

60

40

20

0 ELex

GES

FindConcept SWUM10

SWUMT

Search Technique

Figure 6.6: Recall Results for state of the art search techniques. had difficulty identifying relevant results because the main query word, “search”, did not occur in the verb or direct object position. In this example, SWUM’s semantic roles increased effectiveness. SWUMT outperformed SWUM10 because SWUMT had higher precision. For “play track”, SWUMT outperformed SWUM10 and FindConcept, and all three phrase-based searches outperformed GES. This is a concern where phrasebased search excels. The key query word, “play”, is extremely common and occurs in over 500 methods. In contrast, FindConcept’s restricted search mechanism located just 34 methods using play as a verb. Although all phrase-based techniques had the same recall, SWUMT won out overall by having higher precision. Finally, the phrase-based query reformulation technique employed by SWUMT and SWUM10 was the only technique used to find a working query for “gather music 166

100

ELex GES FindConcept SWUM10 SWUMT

F Measure (%)

80

60

40

20

0

ad d

tex

co mp

t fi

eld

se ile

ts

sa

nip

rep ort

e

ve

au

ga cti

the

on

s

loa

rm

se

us

dm

ic

arc

ov

file

ie

s

list

pla

hf

ing

or

s

yt

so n

rac

k

gs

Concern

Figure 6.7: F Measure results by Concern. files”. Because only one subject out of 27 was able to construct a successful query, it is unclear whether this result is due to the effectiveness of the query reformulation technique or the subject’s inventiveness in formulating queries. The subject who came up with the query, “gather directory”, was the only subject that thought of “directory” as a relevant related word to “file” when browsing music options. GES > SWUM, FindConcept Because GES takes advantage of more body information, it outperformed the phrase-based techniques in 3 of the concerns. Similar to Section 6.2.1, the phrase-based techniques performed poorly on “compile report”, and GES was able to leverage body information into victory. For “load movie listings”, GES was able to take advantage of occurrences of query words in string literals that FindConcept missed and SWUM did not highly rank. This was also true of “save auctions”, where GES was able to find one more relevant result than the other techniques. However, 167

100


Precision (%)

80

60

40

20

0

ad d

tex

co mp

t fi

eld

se ile

ts

sa

nip

rep ort

e

ve

au

ga cti

the

on

s

loa

rm

se

us

dm

ic

arc

ov

file

ie

s

list

pla

hf

ing

or

s

yt

so n

rac

k

gs

Concern

Figure 6.8: Precision results by Concern. this last result may be due to an exceptional subject. In prior analysis of these results using all the queries, rather than just the most successful ones, FindConcept was actually found to outperform GES [SFH+ 07]. FindConcept > SWUM, GES There was one concern where FindConcept outperformed the other search techniques. For “add text field”, FindConcept’s restrictive rule matching led to increased precision. Although “text” and “field” occur frequently in the program, the words rarely appear together in the object position. This led to improved precision for FindConcept. Despite SWUMT’s and SWUM10’s equivalent recall, FindConcept outperformed all the other techniques due to improved precision. However, it should be noted that for the “search for songs” and “load movie listings” concerns, FindConcept’s restrictive rules hurt effectiveness.

168

100


Recall (%)

80

60

40

20

0

ad d

tex

co mp

se ile

t fi

eld

ts

sa

nip

rep ort

e

ve

au

ga cti

the

on

s

loa

rm

se

us

dm

ic

arc

ov

file

ie

s

list

pla

hf

ing

or

s

yt

so n

rac

k

gs

Concern

Figure 6.9: Recall results by Concern. All Equivalent For “set snipe”, no one search technique outperformed the rest. This is a concern where the subjects clearly struggled to formulate a discriminating query. The most common query, “snipe”, returns too many results. The most successful queries, “prepare snipe” and “set snipe,” had good precision for the phrase-based approaches, but poor recall. This is a concern where the main query word “snipe” is extremely prevalent in the system, and many semantically related words are used in the relevant results. 6.2.2.3

Discussion

Overall, SWUMT is a very competitive search technique when the query words match relevant signatures. However, when body information is important to locating the concern, GES is the best state of the art technique in this study. Although GES outperformed SWUMT, SWUM10, and FindConcept for some of the 169

concerns, its performance in general seems to be unpredictable. When GES did not have the best performance, it tended to be little better, and sometimes even worse, than ELex. In contrast, even though SWUMT did not always have the best results, it was usually competitive. To investigate this observation, we ranked the approaches from 1–5 based on their maximum F Measure score for each concern, giving ties the same rank. Using this measure, SWUMT is the most highly ranked technique with an average rank of 2.38 and a standard deviation (std) of 1.18. GES has an average of 2.75 (std 1.19), SWUM10 an average of 2.88 (std 1.64), FindConcept an average of 3.00 (std 0.93), and ELex and average of 3.50 (std 1.41). From these results we can see that SWUMT and GES are the best overall techniques in this study, but that SWUMT is consistently ranked more highly overall. Search Mechanism The most competitive search techniques were SWUMT, GES, and FindConcept. In general, FindConcept’s limited rules improved precision, but hurt recall. GES had mixed results. When a user was able to construct a successful query, the body information GES used led to high recall, but at other times, reduced precision. SWUMT provided a signature-based counterpoint to GES’s body information, alternating between high precision and high recall as well. SWUMT has an advantage over FindConcept in that it searches with semantic roles beyond verbs and direct objects. Query mechanism Both GES and SWUMT use natural language queries, whereas FindConcept and ELex have stricter query mechanisms. ELex’s regular expression queries were difficult for newcomers to construct, often returning too many or too few results [SFH+ 07]. FindConcept’s query recommendation system helped newcomers 170

find more accurate queries, although FindConcept’s strict verb-direct object matching requirement sometimes hurt recall. SWUMT’s phrase-based query refinement also helped newcomers formulate queries for challenging concerns. Users tended to have mixed success when constructing queries with GES, although this result is largely dependent on GES’s search mechanism. Threshold In terms of threshold, SWUMT outperformed SWUM10 in 6 out of the 8 concerns. For “prepare snipe”, SWUMT cut off the results too high, whereas for “gather directory” the threshold was too low. Both result sets had large differences in score magnitude within the top 20 results. For SWUMT and FindConcept, both techniques had equivalent recall for two concerns where precision made the difference. FindConcept had higher precision for “add text field”, whereas SWUMT had higher precision for “play track”. Perhaps with improved head distance handling, SWUMT could approach FindConcept in terms of precision for concerns like “add text field”. 6.3

Related Work Existing approaches to textually searching software typically use regular ex-

pressions, bag of words information, or phrase information. Regular expression tools, like UNIX grep, are often built-in to a developer’s development environment. Regular expression queries can be effective if the developer has prior knowledge of the system’s structure and naming conventions, but can be difficult for newcomers to construct. In contrast, bag of word (BOW) information allows newcomers to search using more intuitive natural language queries. Existing BOW approaches employ a variety of information retrieval mechanisms, including Latent Semantic Indexing (LSI) [MSRM04], Latent Dirichlet Allocation [LKE08], Independent Component 171

Analysis (ICA) [GCS08], and Google Desktop Search [PPM+ 06]. BOW techniques can use all the information found in source code, but can also suffer from poor precision as a result. Finally, FindConcept [SFH+ 07] and SWUM use phrase-based information. Phrase-based techniques utilize the semantics of words within phrase structures. FindConcept uses a strict query mechanism, whereas SWUM uses a more intuitive natural language query interface. Where FindConcept restricts matches to just verbs and their objects, SWUM utilizes all the information found in a signature to help differentiate between relevant and irrelevant results. Both techniques rely heavily on signature information. How to integrate body information with phrase-based signature information remains an open problem. Chapter 2 presents alternative approaches to concern location outside of search. Other techniques have taken advantage of semantic role, location, and head distance information. As previously mentioned, the FindConcept search technique extracts semantic role information for verbs and direct objects (similar to actions and themes) and indexes them for search. We previously developed a program exploration technique that more highly weighted signature information than information located in the body [HPVS07]. In addition, head information has been used to indicate similarity between identifiers when identifying refactoring candidates [BDL09]. 6.4

Conclusion In this chapter, we presented a novel scoring function for software based

on SWUM’s phrasal concepts. Our scoring function, swum, weighted query words based on their location, semantic role, head distance, and usage information. We evaluated swum by comparing it with a number of SWUM and BOW variants, as well as comparing it with competing state of the art techniques that employ different query and search mechanisms. Our results show that (1) swum’s expanded semantic role information yields improvements over FindConcept and only using action and 172

theme roles, (2) body information is crucial for some concerns, but detrimental for others, (3) overall, swum is the most consistent search technique. In the future, we plan to refine swum’s use of body information and endeavor to recognize when information from the body location is critical to search effectiveness.

173

Chapter 7 SEARCHING AND EXPLORING THE SOURCE CODE

In the previous chapter, we presented a SWUM-based scoring function to search software and demonstrated its usefulness in locating some of the program elements related to a concern. The next step is to locate and understand the remaining concern elements. We believe recursive program exploration approaches, in conjunction with a search technique, provide a powerful concern location combination. When searching the entire software, the goal is high precision with reasonable recall. That is, a good search tool does not return too many irrelevant results, but does return some of high relevance. To improve precision, we employed a SWUM-based scoring function. In contrast, because an exploration tool is searching locally in the program structure, poor precision is less of a problem and we can focus on achieving high recall (with reasonable precision). That is, a good exploration technique finds the remaining relevant results, without returning too many irrelevant results. In this chapter, we present our approach to program exploration which uses a greedy bag of words scoring function to achieve high recall with reasonable precision. We evaluate our approach by comparing it with competing state of the art techniques. In addition, we present a case study evaluating the utility of combining search and exploration for concern location.

174

JBidMouse.DoAction (Object, String, Object)

A JBidMouse.DoAction (Object, String, AuctionEntry)

C JBidMouse. DoAdd


JBidMouse. messageAction

JBidMouse. ADD_AUCTION

JBidMouse. cmdAddAuction

D

B

F

E

JBidMouse. addAuction

JBWDropHandler. messageAction

AuctionsManager. newAuctionEntry

JBidProxy. addAuction

AuctionsManager. addEntry

G

AuctionEntry. AuctionEntry

FilterManager. addAuction

EntryManager. addEntry

X AuctionEntry. _justAdded

AuctionEntry. prepareAuctionEntry

AuctionServer. addAuction(String)

java.util.Map. put(K, V)

AuctionEntry. _auction

Auctions. _tSort

Auctions. addEntry

FilterManager. _allOrderedAuctionEntries

TableSorter. insert(Object)

AuctionServerManager. add_entry

AuctionServer. addAuction(URL, String)

AuctionServer. registerAuction

Y

AuctionServer. loadAuction

java.util.Set. add(E)

AuctionServer. _aucList

Figure 7.1: Program elements and structural dependences for the “add auction” concern from Chapter 2. 175

7.1

Motivation Despite evidence that successful programmers use program structure as well

as identifier names to explore software [RC04], most existing program exploration techniques use either structural or textual information. By using only one type of information, automated tools ignore valuable clues about a developer’s intentions [Big89]—clues critical to the human program comprehension process. By utilizing textual as well as structural program information, we can create automatic program exploration tools that mirror how humans attempt to understand code. In Chapter 2, we presented an example concern, “add auction”, to illustrate the challenges in searching and exploring source code. We now employ that same example to highlight why combining textual and structural information is so important for recursive exploration tools. Recall that jBidWatcher is an auction bidding, sniping, and tracking tool for online auction sites such as eBay or Yahoo. jBidWatcher includes a unique and powerful sniping feature which allows the user to place a bid in the closing seconds of the auction. Before a user can bid on an auction, they must add the auction to the user view and data structures. A bug occurs when a user selects the “add auction” button and there is no response from the application. Figure 7.1 shows the code related to the add auction concern. Oval nodes represent methods while boxes represent fields. The nodes with solid lines represent the core aspects of the concern, likely to be useful to anyone understanding the concern. Dotted nodes represent selected calling context information, while dashed nodes represent lowerlevel implementation details. For example, the two egg-shaped methods, Map.put and Set.add, represent Java library methods. In the top left of Figure 7.1 are the methods that handle the add auction event triggered by user input: DoAction, DoAdd, and DoPasteFromClipboard. DoAdd prompts the user to enter an auction number, and then stores this information in

176

a system-wide queue. DoPasteFromClipboard adds an auction via the system’s copy and paste functionality, and also adds the “add auction” event to the queue. The field ADD AUCTION is a constant used to represent the “add auction” action within the queue. After the “add auction” event is added to the event queue, another group of methods is responsible for actually adding the auction to the user view and data structures. JBidMouse.messageAction takes action events from the queue and directs them to the appropriate handler methods, in this case, cmdAddAuction. This method is the start of a chain of method calls, culminating in the auction being added to the internal data structures. Its immediate caller, JBidMouse.addAuction, consists of two steps: creating the new auction entry and loading its pertinent information from the internet, and then adding the newly created entry to the system’s data structures by calling AuctionManager.addEntry. Through this example, we show how recursive exploration tools can benefit from using textual information in addition to structure: Automatically eliminate breadth choices Finding the methods that handle the “add auction” event by looking at the callees of DoAction is no trivial task. DoAction calls 38 methods within the same class that handle various user actions such as bidding, searching, or adding an auction. For the “add auction” event, only two of the 38 methods are relevant: DoAdd and DoPasteFromClipboard. Although method signatures are usually a good indication of relevance, in this example, a developer skimming the callees might miss the relevant method DoPasteFromClipboard. Thus, it is important for techniques to analyze the signature as well as the body source code of each method to include non-obvious relevant methods while still eliminating irrelevant call edges.

177

Automatically expand deep call chains Looking at the call chain beginning with cmdAddAuction, there are 17 relevant methods and two call chains that are 8 methods deep. The few unrelated callers into the call chain are easy to eliminate using textual information from the names. Find relevant elements missed by regular expression searches The most effective simple search for this concern is the regular expression ‘add.*auction’. Using Eclipse’s [IBM10] simple textual search on methods rather than files, the query matches 50 methods, 11 of which are relevant. Within the top ten results, ranked by number of matches, just 3 are relevant: cmdAddAuction, AuctionsManager.addEntry, and AuctionServer.addAuction. Plus, 9 relevant methods are completely missed by the regular expression query. Recursive exploration Lastly, the jBidWatcher example in Figure 7.1 illustrates how using textual information to prune irrelevant call edges enables recursive exploration. Most existing techniques for call graph navigation either display the entire call graph [BD06, DL05] or explore a single edge at a time [SKM06]. Another navigation technique is Eclipse’s frequently-used call hierarchy feature [IBM10], which allows developers to recursively view either all descendants or all ancestors of a method. These approaches lack the ability to recursively recommend relevant callers and callees. This capability would allow developers to view sibling calling relationships in addition to ancestor and descendant relationships. For example, starting from the Auctions.addEntry method, it is impossible to realize that Auctions.addEntry and AuctionEntry.AuctionEntry share the caller JBidMouse.addAuction in Eclipse’s call hierarchy without changing to a different

178

starting point. However, by using textual information to prune irrelevant edges, exploration tools can automatically discover relationships such as shared callers with no additional user intervention. In summary, combining both textual and structural information enables exploration tools to automatically prune irrelevant structural edges. By eliminating irrelevant edges, exploration tools can recursively search a structural program representation to provide the maintainer with a broad, high level view of the code relevant to a maintenance task—without including the entire program. 7.2

Dora the Program Explorer Our automated exploration technique, Dora the Program Explorer,1

takes as input a natural language query related to the maintenance task and a program structure representation to be explored. Dora then outputs a subset of the program structure relevant to the query, called a relevant neighborhood. Dora currently uses the call graph for program structure, and takes a seed method as a starting point. By recursively traversing call edges, Dora identifies the relevant neighborhood for this seed. 7.2.1

Automatically identifying the relevant neighborhood Dora’s main contribution is its ability to automatically identify the relevant

neighborhood by combining textual and structural information. Starting from a seed method m, Dora uses structural information by traversing structural call edges to find the set of callers and callees for m. The set of callers and callees become candidates for the relevant neighborhood. Next, Dora uses textual information by scoring each candidates’ relevance to the query, which we call the method relevance score. Candidates scored higher than a given threshold, t1 , are added to the relevant neighborhood. Candidates scored less than t1 but more than a threshold 1

Dora comes from exploradora, the Spanish word for a female explorer. 179

t2 are further explored to ensure they are not connected to more relevant methods. This use of two thresholds guards against missing very relevant methods that are connected by a borderline relevant method. Both thresholds are given a default value of t1 = 0.5 and t2 = 0.3, but are user specifiable. Finally, this exploration process is recursively repeated for each method added to the relevant neighborhood. In the remainder of this section, we describe in detail how Dora uses textual information to calculate the method relevance score. 7.2.1.1

Components of Dora’s Method Relevance Score

Term Frequency The principal component of our method relevance score is how frequently query terms appear in a method, also known as term frequency. Term frequency (tf ) is often used to determine document relevance in information retrieval (IR) [vR79]. The intuition is that the more frequently a word occurs, the more relevant the document, or method, is to the query. For example, in an addAuction method, the word ‘auction’ appears 25 times. In contrast, the word ‘sort’, a term irrelevant to adding an auction, appears only once. The drawback of term frequency is that uninformative terms appearing throughout the program can distract from less frequent, but relevant, terms. Intuitively, the more methods that include a term, the less a term discriminates between methods. To address this issue, the IR community commonly multiplies a term frequency by its inverse document frequency (idf ), called a tf-idf score [vR79]. The idf for a term t is calculated by dividing the total number of methods in a program by the number of methods that contain t, and taking the resulting number’s natural log. For example, consider the query ‘add auction’ from the auction sniping program jBidWatcher. Because the domain of jBidWatcher involves online auctions, the term ‘auction’ appears in 470 of the 1,812 methods in the program. In contrast, the word ‘add’ appears in only 261 methods. Therefore, occurrences of the 180

term ‘add’ are given a higher tf-idf score than occurrences of ‘auction’. Thus, more occurrences of the word ‘auction’ are required to get a tf-idf score as high as the less-used term ‘add’. Before counting the frequency of terms, we apply a simple preprocessing step to the query and methods. First, all the identifiers are split into terms based on nonalphabetic characters and camel case. For example, addAuction and add auction both become the terms ‘add’ and ‘auction’. Next, the terms are converted into lower case and stemmed. Stemming ensures that similar terms like ‘auction’ and ‘auctioned’ map to the same conceptual term of ‘auction.’ To count how frequently query terms appear in a method, we use a sum of the tf-idf scores for each query term appearing in the method. Method Features In addition to how frequently a term occurs, our score takes into account where the query terms appear in the method. We consider the method name to be the most important indicator of relevance. Because method names have higher visibility in a program than, say, local variables, programmers typically select very descriptive method names [LBS06]. We chose not to include other method signature information, such as the declaring class or package name, for the same reason that we use idf : classes and package names are shared by many system components and therefore are less able to differentiate between program elements. As demonstrated by our “add auction” example in Section 7.1, occasionally a method name alone does not indicate relevance to a maintenance task. Thus, we count the number of method statements containing a query term, multiplied by the term’s idf . These tf-idf scores are summed and then normalized by the method length. Because longer methods are more likely to contain more query term occurrences, dividing by the method length ensures our scores are not biased to longer methods. 181

In addition to where terms are located, we also take into consideration whether a method is binary, i.e., whether it is a library method with no source code present. Although library methods are rarely explored by developers, highly relevant library method calls can provide additional information to relevant nonlibrary methods. Therefore, we include highly relevant calls to library methods in the relevant neighborhood for context purposes. Thus, the method features that we consider for our scoring technique are name, statement, and binary. 7.2.1.2

Calculating Dora’s Method Relevance Score

Although we had some intuition into the features of a method that would be useful in determining relevance to the query, we were unsure how to weight them for the relevance score. To determine our weights, we applied logistic regression on a training set of methods. Logistic regression is a statistical technique to find the best fitting model for a binary dependent variable [All99]. Given a set of features, or independent variables, logistic regression outputs a set of feature weights β and an intercept value α that best predict the training data. Unlike linear regression methods, which output a linear model, logistic regression outputs an exponential model. Therefore, given a feature vector x of size k, weights β, and intercept α, we apply the following equation to calculate the probability p that a method is relevant to the query: p=

eα+β1 x1 +...+βk xk 1 + eα+β1 x1 +...+βk xk

This equation will always give a value between 0 and 1, making it ideal for calculating probabilities. For our training set, we used methods from nine concerns used in a previous concern location tool evaluation [SFH+ 07]. We included the methods in the concerns plus all methods one call edge away from any method in a concern. We

182

manually inspected each method and annotated them as either relevant or irrelevant. Although the manual annotations are necessarily subjective, we have tried to limit bias by combining the input of three Java programmers. After training the model, we define our method score (p): p=

e−0.5+−2.5∗bin+name+0.5∗statement 1 + e−0.5+−2.5∗bin+name+0.5∗statement

In training our model, we considered other possible method features such as comments. However, none of the other features we tried with the model added to its predictive power. We would have liked to consider other types of methods in addition to binary, such as public, private, abstract, etc., but had insufficient training data for these categories. Therefore, we focused the model on the simplest variables that best predicted relevance: binary, name, and statement. 7.2.2

Experimental Evaluation The purpose of our evaluation is two-fold:

1. To compare our integrated textual- and structural-based approach against a state-of-the-art structural-based approach 2. To demonstrate that our sophisticated textual scoring technique is an improvement over naive textual scoring techniques in identifying the relevant neighborhood 7.2.2.1

Experiment Design

Variables and Measures The independent variable in our study is the method scoring technique. The structural-based approach we compare against is Robillard’s structural topology approach, Suade [Rob05, WR07]. Suade uses the specificity and reinforcement that a

183

given method has upon another to recommend structurally relevant methods. Because we are evaluating the technique on singleton method sets, only the specificity of a program element is taken into account. We include two additional textual- and structural-based techniques in our study: boolean-AND (AN D) and boolean-OR (OR). These techniques are baselines used to evaluate Dora’s more sophisticated relevance score. These techniques output either 0 or 1: AN D outputs 1 if all query terms appear in the method; OR outputs 1 if any query term appears in the method. The dependent variable in our study is the effectiveness of each technique, measured in terms of the commonly used IR measures, precision and recall. Precision is the fraction of methods reported by the technique that are relevant, calculated by dividing the number of relevant methods reported by the total number of methods reported. Recall is the fraction of relevant methods reported, calculated by dividing the number of relevant methods reported by the total number of actual relevant methods. High precision implies a scoring technique returns few irrelevant methods, whereas high recall implies the scoring technique misses few relevant methods. Since ideal techniques have both high recall and high precision, the F measure is commonly used to combine both precision and recall into a single measure. The F measure is defined as the harmonic mean of precision and recall, and is high only when both precision and recall are high. Thus, a high F value can be interpreted as the best possible combination of precision and recall. We use the F measure to evaluate the performance of each scoring technique. Subjects To evaluate each technique, we use sets of methods from concerns as seeds. Thus, each subject in our study is a pair. To avoid investigator bias in determining our own concerns for evaluation, we selected 8 concerns from a recent study of the concept assignment problem on 4 open source Java 184

Crn. Query C3 Task progress complete C9 Update auction C10 Download thumbnail image C11 Execute auction bid C12 Delete auction C13 Toggle fold node C14 Zoom in out C16 Auto save file

Prog. Gantt JBid. JBid. JBid. JBid. Free. Free. Free.

∪ ∩3 32 2 22 3 19 6 13 2 24 6 21 6 25 2 6 2

∩2 12 11 11 7 11 14 14 5

∩2 ∪

38% 50% 58% 54% 46% 67% 56% 83%

Table 7.1: Concerns and queries used in evaluation, in terms of number of methods. Program Version Gantt 2.0.2 JBidWatcher 1.0pre6 Freemind 0.8.0

NCLOC 43,246 22,997 70,341

No. Classes 555 183 617

No. Methods 3,991 1,812 5,388

Table 7.2: Program Characteristics for concerns used in the evaluation. programs [RSH+ 07]. Methods were selected for each concern by 3 independent developers, with varying levels of agreement (overlap in methods selected). Details of the concerns are shown in Table 7.1. The column labeled ∪ is the union of methods selected by all three developers in the study, ∩3 is the number agreed upon by all three developers, ∩2 is the number at least two agreed upon, and the last column shows the percent developer agreement. Because conflicts of agreement could indicate poor quality method sets for a concern, we used only those concerns where at least two developers agreed on 35% or more of the relevant methods ( ∩∪2 in Table 7.1). We considered any method selected by at least two developers to be relevant to a concern. To put the concerns into context, the program characteristics for the concerns are presented in Table 7.2. The one component of our subject input missing from these concerns is the query. To avoid investigator bias, the queries were chosen by an independent researcher involved in the concept assignment study who had no

185

70

Dora Suade OR AND

60

Precision

50 40 30 20 10 0 0

10

20

30 40 Recall

50

60

70

Figure 7.2: Precision-Recall Graph for Exploration Techniques. knowledge of our scoring technique. The queries were selected by looking at the concern descriptions, searching the code, and a query expansion mechanism [SFH+ 07]. Thus, we evaluated each scoring technique on 8 pairs, containing a total of 160 seed methods. The median number of caller and callee edges per seed is 7, with a minimum of one edge and a maximum of 62. A total of 1885 call edges were evaluated by each technique, with overlap. Methodology Although the ideal evaluation would recursively explore from each seed method, incorrect identifications would propagate and unfairly skew the results. Therefore, our evaluation compares the effectiveness of each technique applied one edge away from a single seed method. For each method m in the set of evaluation concerns, we applied each scoring technique to all the callers and callees of m, and calculated

186

100

F Measure

80 60 40 20 0 AND Suade

OR

Dora

Figure 7.3: F Measure across exploration techniques. the precision and recall for m. We used Eclipse [IBM10] to generate the structural information used by the techniques. Both Suade and Dora output relevance scores ranging from 0 to 1. However, evaluating the techniques in terms of precision and recall requires a threshold to map these scores to 0 or 1. Because this threshold selection is a potential threat to validity, we evaluated the threshold performance on the training data set used for Dora. We evaluated the precision and recall at threshold levels varying from 0 to 1 at 0.005 intervals, and selected the threshold for each technique that maximized the mean F measure. We found that Suade performed best at a 0.3 threshold, and Dora at a threshold of t1 = 0.5.2 Based on these thresholds, the results for Dora and Suade were partitioned into relevant (1) and irrelevant (0) scores before calculating precision and recall.

187

7.2.2.2

Results

Overall, we found Dora to be the most successful technique, and structuralbased Suade to be competitive with the naive textual- and structural-based OR. Of all the techniques, naive AN D had the worst performance. Precision, Recall, and Threshold Variation Figure 7.2 shows a Precision-Recall graph for all four techniques. Suade and Dora were evaluated at thresholds ranging from 0 to 1 at 0.005 intervals (AN D and OR require no threshold). Each point represents precision and recall averaged over a given threshold, with decreasing threshold values from left to right. (A high threshold means fewer methods are scored as relevant, and therefore recall is low.) For both Dora and Suade, the recall progressively increases as the threshold decreases, whereas the precision increases with the threshold and drops off again when the threshold gets too high. The peak in Dora’s F measure performance at (R = 59, P = 43) occurs at the 0.5 threshold. The peak in Suade’s performance at (48, 33) occurs at the 0.41 threshold. At (23, 23), AN D performs similarly to Dora and Suade at higher than optimal thresholds, 0.99 and 0.64, respectively. At (60, 32), OR performs similarly to Dora and Suade, with less than peak precision for Dora, and slightly better than peak recall for Suade. Figure 7.2 demonstrates how Dora outperforms Suade in terms of precision over many thresholds, and performs equivalently in terms of recall. Specifically, Dora outperforms Suade in precision for thresholds 0.05–0.98 (Dora) and 0.0–0.59 (Suade). Suade and Dora perform equivalently on the training set for all other thresholds.

188

100 F Measure

80 60 40 20 0 9

10

11

12

13

14

16

3

Concern AND

Suade

OR

Dora

Figure 7.4: F Measure for each concern by exploration technique. Overall Performance Figure 7.3 shows the overall performance of each technique. Each bar shows the distribution of F measures calculated for each seed method across all the concerns. The shaded box represents 50% of the data, from the 25th to 75th percentiles. The horizontal bar represents the median, and the plus represents the mean. Since each shaded box extends from 0, at least 25% of the 160 methods considered by each technique have 0% recall and precision. However, Dora achieves 100% precision and recall for 25% of the data—more than any other technique. Suade and OR appear to perform similar to one another, although OR has a slightly higher mean F measure. To confirm these observations, we performed a Bonferroni mean separation test3 at α = 0.05. Dora performs significantly better than structural-based Suade, 2

Because we are only scoring methods one edge from a seed, Dora’s threshold t2 is unnecessary.

3

Before performing the contrasts, we applied the ANOVA F-Test to test that the means were significantly different (p < 0.0001). Next, we performed the Bonferroni mean separation test, which evaluates multiple mean comparisons and controls the experimentwise error rate. The error rate is controlled by adjusting the α for each pairwise comparison such that the experimentwise error rate of α = 0.05 is preserved. 189

with a mean difference of 12.9, although neither Dora nor Suade are significantly different from OR (with mean differences of 9.6 and 3.3, respectively). All the approaches outperform AN D with statistical significance. Performance by Concern Figure 7.4 shows the performance for each technique, separately for each concern. Although each concern does not contain sufficient data to judge significant differences, the trends in Figure 7.4 follow the same pattern as the overall performance results in Figure 7.3, with AN D < Suade ≤ OR < Dora. Concerns 9 and 12 are an exception to the general trend, with AN D outperforming most techniques. For both of these concerns, AN D had much higher precision than OR and Suade. This was due to the fact that most of the relevant methods contained both query terms. Because AN D performed so poorly for the remainder of the concerns, AN D is clearly very sensitive to the query. 7.2.2.3

Threats to Validity

Because the selection of methods relevant to a concern is a necessarily subjective task, the correctness of the concerns used in the evaluation could be a threat to validity. The developers were unfamiliar with the programs and tasks assigned to them, and were advised to spend less than an hour selecting program elements relevant to the concerns. For this reason it is possible that the developers missed relevant methods or even included irrelevant ones, which could impact the precision and recall of the techniques evaluated in the experiment. However, we do not believe that the developers consistently selected methods incorrectly, and we tried to alleviate this threat by only using concerns with at least 35% agreement. Since all the techniques are subject to the same vulnerability, we do not feel this is a serious threat to the validity of the study.

190

Four of the evaluation concerns were from a program that was used during training, although with three different concerns. This could affect our results by potentially giving Dora an unfair advantage. However, a T-Test revealed that for each technique, there was no significant difference4 in mean F measures between concerns from completely new programs, and those concerns of a program used by three out of nine training concerns. We minimized threats to conclusion validity by carefully selecting a threshold for Dora and Suade. We evaluated precision and recall at various threshold levels on the same training set, and selected the optimal threshold for each technique to maximize the mean F measure. In addition, we have shown a precision-recall graph (Figure 7.2) that illustrates our conclusions independent of threshold selection. A second threat to conclusion validity is the application of Suade to singleton method sets, rather than partial concerns with multiple methods. Restricting the input to singleton method sets forces Suade to rely solely on its specificity scoring component, and perhaps to perform suboptimally. Because previous Suade evaluation [Rob05] used such singleton method sets, we do not feel this is a serious threat to the conclusions of our study. We minimized internal threats to validity by having each scoring technique operate on the same structure data. We minimized the potential for investigator bias by using concerns and queries selected by software engineers outside the investigators. Because we evaluated the techniques on 8 concerns from 3 open source Java programs, the results of this study may not generalize to all programs and combinations. In addition, any developer naming conventions could have affected the textual-based approaches. 4

At the 5% significance level: AN D p = 0.14, OR p = 0.53, Suade p = 0.77, Dora p = 0.39.

191

7.2.2.4

Discussion

In this study, we found that the integrated Dora approach outperformed the purely structural Suade approach, motivating further development of integrated structural and textual techniques. In addition, we found that not all integrated approaches outperformed a purely structural approach. For example, the naive textual scoring mechanism (OR) performed equivalently to structural-based Suade, and AN D performed worse. Because AN D requires all query terms to be present, it is very sensitive to the selected query terms. Thus, the success of a textual- and structural-based technique is highly dependent on the performance of the textual scoring technique. Of the three textual scoring techniques, Dora outperformed OR and AN D. In Figure 7.3, Dora clearly outperforms OR and AN D, although not with statistical significance over OR. However, Figure 7.4 shows that Dora performs as good as or better than OR in every concern. 7.2.3

Related Work in Program Exploration

7.2.3.1

Navigation-based Exploration Approaches

Navigation-based program exploration techniques help developers navigate structural dependencies. Unfortunately, most existing tools require the developer to initiate every exploration step and manually select every structure edge to be expanded [CR00, RM02, SKM06]. In contrast, Dora automatically explores highly relevant structure edges to save the developer time and effort by producing a relevant neighborhood. The most closely related exploration technique is Robillard’s structural topology approach, Suade [Rob05, WR07], which automatically generates suggestions for program investigation based on a seed method set, or concern. The Suade approach uses structural relations between program elements (calling a method, being called by a method, accessing a field, and being accessed by a method) to evaluate the 192

specificity and reinforcement that a given element has upon another. Elements that have fewer structural relationships are considered more specific, and are therefore given a higher relevance score. Reinforcement increases the score of elements that have more structural connections to elements already in the concern. Our approach differs from Suade in that we use textual and structural information to explore a program, rather than just structure. Because both techniques use different information to score relevant program elements, it may be possible to create a hybrid approach that combines both techniques. Another technique that automatically includes relevant structural edges is program slicing. Slicing techniques use program dependence relationships such as control and data dependence to extract the parts of a program that may affect a point of interest [SFB07, Tip95, XQZ+ 05]. The slices of the program can be used to aid program comprehension. Unfortunately, slices tend to be large and can be expensive to calculate. Reducing expense has the trade-off of making slices more conservative and therefore containing even more irrelevant information. To potentially reduce cost and improve relevance, our textual scoring mechanism could be applied to slicing techniques as an additional stopping criteria. 7.2.3.2

Software Architecture Recovery

Developers also can use architecture recovery techniques to understand a system [Big89, MNS01, PFGJ02]. During architecture recovery, developers use textual patterns in conjunction with source structure models to locate high-level concepts in code. The developer is responsible for articulating the key concepts of the maintenance task in a regular expression query that may also require specifications as to which code structures should be searched [MNS01, PFGJ02]. In contrast, Dora takes as input a simple natural-language-based query. By using natural language query terms rather than regular expressions, Dora can utilize more advanced information retrieval techniques such as stemming [Por80] and tf-idf [vR79]. 193

7.2.3.3

Program Structure Visualization

A number of approaches have been suggested for whole-program visualization and navigation [Bal04, BD06, DL05, JH06, SM95]. Some approaches have made efforts to restrict the information presented so as not to overwhelm the developer by adding zooming or fisheye viewing capabilities [BD06, JH06, SM95] or aggregating dependencies to a higher level [Bal04]. By using Dora, we believe that these tools can be further improved by focusing the developer’s attention on information that is likely to be relevant to a maintenance task. 7.2.3.4

Program Structure Representations

A variety of structural program models have been proposed to enhance program comprehension and facilitate maintenance tasks, such as system dependence graphs [HRB90], program slices [Tip95, XQZ+ 05], type hierarchies [DGC95] and call graphs [GDDC97]. Although our general strategy can be applied to any type of structural program model, we focus this paper on call graphs as an intuitive model that allows developers to quickly comprehend interactions between large sections of code. Call graphs are relatively inexpensive to calculate, and have the advantage of representing even scattered code well because call graphs are indifferent to class decompositions. 7.2.4

Summary In this section, we presented a program exploration approach, Dora, that uses

textual information and word location information to calculate a method relevance score. Starting from a seed method starting point, Dora recursively traverses the call graph in search of methods relevant to the query. The experimental results demonstrate that an integrated textual- and structural-based approach is significantly more effective than a state-of-the-art structural program exploration technique, motivating further development of integrated structural and textual exploration techniques. 194

7.3

Investigative Study: Integrating Search and Exploration After evaluating search and exploration separately, we investigated our in-

tuition that a precise search tool, like SWUMT from Chapter 6, and a greedy exploration tool, like Dora, would yield the most effective concern location tool. We investigate this claim by comparing the state of the art concern location techniques presented in Chapter 6 with techniques that have a second exploration step after search. When combining search and exploration, we expect to improve recall with a decrease in precision. The goal is to maximize the gain in recall, while minimizing the reduction in precision. To capture the overall trade offs between these inversely related measures, we measure the overall success of the techniques in terms of the F Measure, which considers both precision and recall to be equally important. The seed search results represent an upper bound on precision, i.e., search and exploration will never have better precision than search alone. Similarly, the seed search results represent a lower bound on recall, meaning that search and exploration will find at least as many relevant results as search alone. In this study, we investigate whether the increase in recall is worth the decrease in precision when combining search and exploration. Design In this study, we compare two search and exploration variants with the 5 techniques compared in Chapter 6: ELex, a simple regular expression search; GES, a version of Google Desktop Search applied to source code; FindConcept (FC), which searches based on pairs; SWUM10, which uses a SWUM-based search score, swum, and returns the top 10 ranked search results as relevant; and SWUMT, which uses the same score as SWUM10 but only returns a search result as relevant if its score is greater than the average of the top 20 ranked results. 195

100

F Measure (%)

80

60

40

20

0 ELex

GES

FC

SWUM10 SWUMT

S+Dora SH+Dora

Search Technique

Figure 7.5: F Measure results for state of the art search techniques and exploring with Dora (S+SWUM and SH+SWUM). The two search and exploration variants are: SWUMT+Dora (S+Dora) and SWUMT+Human+Dora (SH+Dora). S+Dora starts from SWUMT’s search results and uses Dora to explore one edge away. With SH+Dora we assume a human views the search results and includes only relevant results as seeds for Dora’s exploration process. Thus, SH+Dora starts from SWUMT’s relevant search results and explores one edge away. For this study, we used the same concerns and queries as in the search evaluation in Chapter 6. The set of concerns consists of 8 user-observable, action-oriented concerns from 4 programs ranging in size from 23 to 75 KLOC [SFH+ 07]. The concern implementations were derived by two human analysts who agreed on the relevant methods after executing and manually exploring the concerns [SFH+ 07]. To compare the techniques at their best, we selected at most 3 queries for each technique: a query with the best precision, the best recall and the best F Measure. The 196

100

Recall (%)

80

60

40

20

0 ELex

GES

FC

SWUM10 SWUMT

S+Dora SH+Dora

Search Technique

Figure 7.6: Recall results for state of the art search techniques and exploring with Dora (S+SWUM and SH+SWUM). same queries are used for search and exploration. Additional study design details are presented in Chapter 6. Overall Results Figures 7.5– 7.7 shows box and whisker plots capturing the overall F Measure (F), Precision (P), and Recall (R) results for the 5 search techniques and the two search and exploration techniques, S+Dora and SH+Dora. The shaded box represents the inner 50% of the data, the middle line represents the median, the plus represents the mean, and outliers are represented by an ‘×’. The F Measure results in Figure 7.5 show that the search and exploration techniques are more consistently effective than the state of the art search techniques alone. The median of S+Dora is almost as high, and SH+Dora is as high, as the top quartile of SWUMT’s results. This implies that S+Dora and SH+Dora have 197

100

Precision (%)

80

60

40

20

0 ELex

GES

FC

SWUM10 SWUMT

S+Dora SH+Dora

Search Technique

Figure 7.7: Precision results for state of the art search techniques and exploring with Dora (S+SWUM and SH+SWUM). as good results as the top 25% of SWUMT’s results, but for 50% of the concerns rather than just 25%. In Figure 7.6 we see that using search and exploration improves recall to be competitive with overly greedy ELex. In Figure 7.7, we see that combining search and exploration is competitive in terms of precision with the other state of the art search techniques. Since we are exploring from SWUMT, which has the highest overall precision, we do not expect search plus exploration to outperform SWUMT in terms of precision. Observations We performed a detailed analysis of Dora’s exploration performance when combined with search. We found that S+Dora and SH+Dora perform well on 3 of the concerns, need additional query words for 2 of the concerns, add little beyond 198

strong search results for 2 of the concerns, and need human intervention or better threshold selection to perform well on just 1 concern. There were 3 concerns for which search and exploration yielded better results than search alone: “prepare snipe”, “save auctions”, and “text field”. For “prepare snipe”, exploration added 2 true positives with only 1 false positive, giving an overall precision value of 80% and recall of 33%. In “prepare snipe”, the ideal exploration query is simply “snipe”, but this single word query is too greedy for search. By using the query “prepare snipe”, the query is a good discriminator for seed search and greedy enough for exploration. To further improve recall, additional related query terms such as “bid” and “amount” are necessary. In “save auctions”, exploration added 3 true positives, and 5 false positives for S+Dora and just 2 false positives for SH+Dora. Upon further investigation, these two false positives could be considered as relevant for comprehension. One of the concern elements, saveAuctions, is called by three methods: DoSave, internal shutdown, and checkSnapshot. One of these callers, DoSave, is relevant for the concern’s description, which shows a user saving the auctions through the user interface. However, given the query “save auctions”, the remaining two callers could provide relevant contextual information to aid comprehension of the saveAuctions method. Exploring the “text field” concern also improved effectiveness beyond search. For “text field”, exploring added 2 additional true positives, with just 2 false positives for S+Dora and only one false positive for SH+Dora. One false positive called 3 irrelevant seeds, while the other, setText, was a callee of a seed. There were two concerns, “gather directory” and “find shows”, for which the user-formulated queries were good discriminators for search, but poor queries for exploration. For the “gather directory” concern, additional synonyms and domainrelated words are necessary to locate the remaining concern elements. An ideal

199

query for this concern would be “gather find MP3 files”. Similarly, the remaining elements of “find shows” require the synonym “movie”. An ideal query would be “find load movie show”. Since search and exploration have competing goals, it is possible that search and exploration require different query expansion mechanisms. Another reason for this observation is that the queries used in this study were formulated by users in the query reformulation experiment in Chapter 4, and the users were restricted to single phrase queries derived from words in program identifiers. For both of these concerns, the best queries for search and exploration required words from multiple identifiers. If the users were instead permitted to enter their own ideal query after exploring the word usage, perhaps they would have come up with more effective queries for exploration, rather than just seed search. For two of the concerns, “play file” and “search result”, the SWUMT search performed so well that exploration was unnecessary and only decreased precision. In the “play file” concern, SWUMT had 100% precision. Dora found 1 true positive, but added 6 false positives. Since the remaining relevant elements require additional query words, applying exploration reduced precision, thereby decreasing performance overall. Similarly, the search results for the “search result” concern returned all but one of the relevant elements in the concern. This particular element (getResu()) contained an abbreviation for a query term, and could be found by exploration if abbreviations are taken into account. Whether or not abbreviations are used, Dora added 3 false positives beyond SWUMT’s initial results. Lastly, there was one concern where additional human intervention or a more accurate search threshold was necessary to select seeds for exploration. For the “report compiler” concern, the query words were good search discriminators and good for exploration, but too greedy when applied to irrelevant seeds. SWUMT returns 6 seeds for this concern, only one of which is relevant. When exploring from all 6 seeds, Dora finds 3 additional true positives and 18 false positives. If only

200

the one relevant result is explored, Dora finds the 3 additional true positives, but with no additional false positives outside of the search results. This situation could also be mitigated with improved use of body information in search. Recall that the SWUM-based searches perform worst on “report compiler” because they do not take body information into account. If the seed search results or threshold selection for “report compiler” were improved, human intervention might not be required. Discussion Overall, search and exploration are more consistently effective in locating concerns than just using search alone. As expected, combining exploration with search improved recall with modest decreases in precision for most of the concerns. When the developer is able to hand pick the search result seeds to explore from, the results are even more effective. As with search, we observed that query selection plays an important role in the effectiveness of exploration. However, the query needs are different between search and exploration. In search, queries need to be good discriminators and have high precision. In contrast, queries for exploration need to be greedier, typically need additional semantically related words such as synonyms, and should have high recall. It is possible for a single query to meet both these needs, as with “prepare snipe”, and feasible that the ideal queries for exploration, such as “gather find MP3 files”, might also be good discriminators for a search technique like SWUMT. More investigation is needed to determine the effect of additional query words on search and exploration. In addition, more investigation is need to determine how to create greedier queries with more relevant words. One approach would be for the developer to use the contextual query reformulation technique presented in Chapter 4 to explore the word usage in the source code, but not restrict queries to be only phrases or

201

subphrases found in the source code. An alternative would be to leverage the synonym and co-occurring word suggestions of FindConcept’s query reformulation step, while relaxing the strict requirement that the query must match verb-direct objects in source code and comments. In this study, we explored with queries developed using a query reformulation technique for search, which may have reduced the results for exploration. Allowing a user to enter in an arbitrary query after exploring the word usage in source code could yield more realistic, and hence generalizable, results. 7.4

Conclusion In this chapter, we presented a novel program exploration technique that

uses textual information and word location information to recursively locate relevant, structurally connected program elements. In comparing our approach to the competing state of the art, we found that an integrated textual- and structural-based approach is significantly more effective than a technique using structural information alone. In addition, we investigated what impact exploration has on improving search results for concern location. Our study demonstrates that search and exploration can produce better results for concern location than search alone, but require further investigation into the query needs of search versus exploration.

202

Chapter 8 CONCLUSIONS AND FUTURE WORK Today’s software is large and complex, with systems consisting of millions of lines of code. New developers to a software project face significant challenges in locating code related to their maintenance tasks of fixing bugs or adding new features. In fact, research has shown that developers typically spend more time locating and understanding code during maintenance than modifying it [KMC06]. Thus, we can significantly reduce the cost of software maintenance by reducing the time and effort to find and understand the code relevant to a software maintenance task. In this dissertation, we demonstrate how textual and structural information in source code can be used to improve software search and exploration tools. Specifically, this dissertation’s main contributions are: • Motivation for the use of textual and structural information to improve software maintenance tools and reduce the costs of concern location [HPVS07]. • Source code analysis techniques that provide more accurate textual information for software maintenance tools utilizing textual information [EHPVS09, HPVS09, HFB+ 08]. • A novel representation of source code that facilitates access to textual and structural information in source code and provides an interface between software engineering tool designers and researchers working on improved linguistic analyses for software. 203

• Theory and algorithms for capturing semantic concepts expressed in both natural language information and programming language structure and semantics. • Application of natural language and structural information to improve the effectiveness of two software tools for concern location: search and exploration [HPVS07]. First, we illustrated the challenges that developers face when searching and exploring unfamiliar systems. We demonstrated how textual and structural information could be used to improve software maintenance tools. We then showed how textual information from source code could be processed for use in software maintenance tools. To facilitate integration of textual and structural information into additional software tools, we designed and implemented a novel model of word usage in software. This model provides software engineering tool designers access to both structural and linguistic information about the source code, where previously only structural information was available. Finally, we utilize textual and structural information to improve software search and program exploration tools, and evaluate against competing state of the art approaches. Our evaluations show that combining textual and structural information can outperform competing state of the art techniques. 8.1 8.1.1

Future Work Extensions To make natural language information more widely applicable to software

engineering tools, we plan to continue refining the extraction algorithm for automatically constructing SWUM for Java, and extend it to other programming languages such as C and C++. We plan to evaluate the success of using SWUM for other search-based tasks, such as documentation to source code traceability [CDLO+ 09].

204

In addition, we will explore using SWUM for novice program comprehension. It is possible to use SWUM to automatically generate an English phrase for an arbitrary program statement [SHM+ 10]. We plan to investigate whether automatically generated phrases for complicated Java statements improve program comprehension time in novices. 8.1.2

New Directions In the long term, we plan to use insights from developing SWUM and combin-

ing textual and structural information to improve software engineering tools beyond program search and exploration. UML class diagrams are used to comprehend legacy systems as well as to design and document newly created ones. Software engineering tools like IBM’s Rational Software Architect can automatically extract UML class diagrams by extracting all class entities, without differentiating between methods and fields implementing high-level concepts versus low-level implementation details which may distract from comprehension. By developing advanced linguistic and structural analyses such as those used in SWUM, we can create automated techniques capable of extracting UML diagrams at different levels of granularity. Focusing the developer’s attention either on high level system-wide concepts or on low-level implementation details can thus enhance program comprehension. When planning a software release and working to meet deadlines, managers and developers may need to prioritize which new features, optimizations, or bug fix changes to implement first. Software change impact analysis, or simply impact analysis, can be used to estimate how much code will be affected by a change. Most existing approaches to impact analysis focus on precision, and can take as long as a day to analyze 50 million lines of code [Moo02]. However, an impact analysis technique that is capable of making decisions on-demand and in real time during meetings could help guide decision making. We believe we can leverage the 205

structural and textual information in source code to quickly approximate impact sets for decision making.

206

BIBLIOGRAPHY [ACC+ 02]

Giuliano Antoniol, Gerardo Canfora, Gerardo Casazza, Andrea De Lucia, and Ettore Merlo. Recovering traceability links between code and documentation. IEEE Transactions on Software Engineering, 28(10):970–983, 2002.

[AG06]

Giuliano Antoniol and Yann-Gael Gueheneuc. Feature identification: An epidemiological metaphor. IEEE Transactions on Software Engineering, 32(9):627–641, 2006.

[AJL+ 09]

B. Ashok, Joseph Joy, Hongkang Liang, Sriram K. Rajamani, Gopal Srinivasa, and Vipindeep Vangala. DebugAdvisor: A recommender system for debugging. In ESEC/FSE ’09: Proceedings of the the 7th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, pages 373–382, New York, NY, USA, 2009. ACM.

[All99]

Paul D. Allison. Logistic Regression Using SAS: Theory and Application. SAS Institute, Inc., Cary, NC, USA, 1999.

[ALSU06]

Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools (2nd Edition). AddisonWesley Longman Publishing Co., Inc., Boston, MA, USA, 2006.

[ANS08]

A. Abadi, M. Nisenson, and Y. Simionovici. A traceability technique for specifications. ICPC ’08: Proceedings of the 16th IEEE International Conference on Program Comprehension, pages 103–112, June 2008.

[Ant90]

Evan L. Antworth. PC-KIMMO: a two-level processor for morphological analysis. Occasional Publications in Academic Computing No. 16., Dallas, TX: Summer Institute of Linguistics, 1990. http://www.sil. org/pckimmo/.

[Bal04]

F. Balmas. Displaying dependence graphs: A hierarchical approach. Journal of Software Maintenance and Evolution, 16(3):151–185, 2004.

207

[BD06]

Johannes Bohnet and J¨ urgen Döllner. Visual exploration of function call graphs for feature location in complex software systems. In SoftVis ’06: Proceedings of the 2006 ACM Symposium on Software Visualization, pages 95–104, New York, NY, USA, 2006. ACM.

[BDL09]

Mario Luca Bernardi and Giuseppe Antonio Di Lucca. A role-based crosscutting concerns mining approach to evolve java systems towards aop. In IWPSE-Evol ’09: Proceedings of the joint international and annual ERCIM workshops on Principles of Software Evolution (IWPSE) and Software Evolution (Evol) workshops, pages 63–72, New York, NY, USA, 2009. ACM.

[BGS05]

Pauli Byckling, Petri Gerdt, and Jorma Sajaniemi. Roles of variables in object-oriented programming. In OOPSLA ’05: Companion to the 20th annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, pages 350–355, New York, NY, USA, 2005. ACM Press.

[Big89]

Ted J. Biggerstaff. Design recovery for maintenance and reuse. Computer, 22(7):36–49, 1989.

[BLLB08]

Pierre F. Baldi, Cristina V. Lopes, Erik J. Linstead, and Sushil K. Bajracharya. A theory of aspects as latent topics. In OOPSLA ’08: Proceedings of the 23rd ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, pages 543–562, New York, NY, USA, 2008. ACM.

[BMW93]

Ted J. Biggerstaff, Bharat G. Mitbander, and Dallas Webster. The concept assignment problem in program understanding. In ICSE ’93: Proceedings of the 15th International Conference on Software Engineering, pages 482–498, 1993.

[Boe76]

B.W. Boehm. Software engineering. IEEE Transactions on Computers, C-25(12):1226–1241, December 1976.

[BW08]

Raymond P.L. Buse and Westley R. Weimer. Automatic documentation inference for exceptions. In ISSTA ’08: Proceedings of the 2008 International Symposium on Software Testing and Analysis, pages 273–282, New York, NY, USA, 2008. ACM.

[BW10]

Raymond P.L. Buse and Westley Weimer. Automatically documenting program changes. In ASE ’10: Proceedings of the 25th IEEE International Conference on Automated Software Engineering (ASE’10), 2010. To Appear. 208

[CDLO+ 09] G. Capobianco, A. De Lucia, R. Oliveto, A. Panichella, and S. Panichella. On the role of the nouns in ir-based traceability recovery. In IEEE 17th International Conference on Program Comprehension, 2009 (ICPC ’09), pages 148 –157, may 2009. ˘ [CM03]

˘ Davor Cubrani´ c and Gail C. Murphy. Hipikat: recommending pertinent software development artifacts. In ICSE ’03: Proceedings of the 25th International Conference on Software Engineering, pages 408–418, 2003.

[CR00]

Kunrong Chen and Václav Rajlich. Case study of feature location using dependence graph. In IWPC ’00: Proceedings of the 8th International Workshop on Program Comprehension, pages 241–249, 2000.

[CT99]

Bruno Caprile and Paolo Tonella. Nomen est omen: Analyzing the language of function identifiers. In WCRE ’99: Proceedings of the 6th Working Conference on Reverse Engineering, pages 112–122, 1999.

[CT00]

Bruno Caprile and Paolo Tonella. Restructuring program identifier names. In ICSM ’00: Proceedings of the International Conference on Software Maintenance (ICSM’00), page 97, Washington, DC, USA, 2000. IEEE Computer Society.

[dAM08]

Brian de Alwis and Gail C. Murphy. Answering conceptual queries with ferret. In ICSE ’08: Proceedings of the 30th International Conference on Software Engineering, pages 21–30, New York, NY, USA, 2008. ACM.

[DBSB90]

P. T. Devanbu, R. J. Brachman, P. G. Selfridge, and B. W. Ballard. Lassie—a knowledge-based software information system. In ICSE ’90: Proceedings of the 12th International Conference on Software Engineering, pages 249–261, 1990.

[DGC95]

Jeffrey Dean, David Grove, and Craig Chambers. Optimization of object-oriented programs using static class hierarchy analysis. In ECOOP ’95: Proceedings of the 9th European Conference on ObjectOriented Programming, pages 77–101, 1995.

[DL05]

Stephane Ducasse and Michele Lanza. The class blueprint: Visually supporting the understanding of classes. IEEE Transactions on Software Engineering, 31(1):75–90, 2005.

[DP06]

Florian Deissenboeck and Markus Pizka. Concise and consistent naming. Software Quality Control, 14(3):261–282, 2006. 209

[DV99]

Angela Dean and Daniel Voss. Design and Analysis of Experiments. Springer, New York, NY, USA, 1999.

[EAAG08]

Marc Eaddy, ALfred V. Aho, Giuliano Antoniol, and Yann-Gael Gueheneuc. Cerberus: Tracing requirements to source code using information retrieval, dynamic analysis, and program analysis. In ICPC ’08: Proceedings of the 16th IEEE International Conference on Program Comprehension, Washington, DC, USA, 2008. IEEE Computer Society.

[EHPVS09] Eric Enslen, Emily Hill, Lori Pollock, and K. Vijay-Shanker. Mining source code to automatically split identifiers for software analysis. Proceedings of the 6th International Working Conference on Mining Software Repositories, MSR 2009, 0:71–80, 2009. [EKS03]

Thomas Eisenbarth, Rainer Koschke, and Daniel Simon. Locating features in source code. IEEE Transactions on Software Engineering, 29(3):210–224, 2003.

[Erl00]

Len Erlikh. Leveraging legacy system dollars for e-business. IT Professional, 2(3):17–23, 2000.

[EV05]

Andrew David Eisenberg and Kris De Volder. Dynamic feature traces: Finding features in unfamiliar code. In ICSM ’05: Proceedings of the 21st IEEE International Conference on Software Maintenance (ICSM’05), pages 337–346, Washington, DC, USA, 2005. IEEE Computer Society.

[EZS+ 08]

Marc Eaddy, Thomas Zimmermann, Kaitlin D. Sherwood, Vibhav Garg, Gail C. Murphy, Nachiappan Nagappan, and Alfred V. Aho. Do crosscutting concerns cause defects? IEEE Transactions on Software Engineering, 34(4):497–515, 2008.

[FBL06]

Henry Feild, David Binkley, and Dawn Lawrie. An empirical comparison of techniques for extracting concept abbreviations from identifiers. In Proceedings of IASTED International Conference on Software Engineering and Applications (SEA’06), November 2006.

[FC01]

Fangfang Feng and W. Bruce Croft. Probabilistic techniques for phrase extraction. Information Processing and Management, 37(2):199–220, 2001.

[FHR91]

Gerhard Fischer, Scott Henninger, and David Redmiles. Cognitive tools for locating and comprehending software objects for reuse. In ICSE ’91: 210

Proceedings of the 13th International Conference on Software Engineering, pages 318–328, Los Alamitos, CA, USA, 1991. IEEE Computer Society Press. [FLGD87]

G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T. Dumais. The vocabulary problem in human-system communication. Communications of the ACM, 30(11):964–971, 1987.

[FM10]

Thomas Fritz and Gail C. Murphy. Using information fragments to answer the questions developers ask. In Proceedings of the 32nd International Conference on Software Engineering. IEEE Computer Society, May 2010.

[FNL89]

G. Fischer and H. Nieper-Lemke. Helgon: extending the retrieval by reformulation paradigm. In CHI ’89: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 357–362, 1989.

[FOW87]

Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems, 9(3):319–349, 1987.

[FPB75]

Jr. Frederick P. Brooks. The Mythical Man-Month: Essays on Software Engineering. Addison-Wesley Longman Publishing Co., Inc., Reading, MA, USA, 1975.

[GCS08]

Scott Grant, James R. Cordy, and David Skillicorn. Automated concept location using independent component analysis. In WCRE ’08: Proceedings of the 2008 15th Working Conference on Reverse Engineering, pages 138–142, Washington, DC, USA, 2008. IEEE Computer Society.

[GDDC97] David Grove, Greg DeFouw, Jeffrey Dean, and Craig Chambers. Call graph construction in object-oriented languages. In OOPSLA ’97: Proceedings of the 12th ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, pages 108–124, 1997. [GHMM09] G. Gay, S. Haiduc, A. Marcus, and T. Menzies. On the use of relevance feedback in ir-based concept location. In IEEE International Conference on Software Maintenance (ICSM 2009), pages 351 –360, sept. 2009. [GJM02]

Carlo Ghezzi, Mehdi Jazayeri, and Dino Mandrioli. Fundamentals of Software Engineering. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2nd edition, 2002. 211

[GJSB05]

James Gosling, Bill Joy, Guy Steele, and Gilad Bracha. Java Language Specification. Prentice Hall, 3rd edition, June 2005.

[GM05]

Joseph (Yossi) Gil and Itay Maman. Micro patterns in java code. In OOPSLA ’05: Proceedings of the 20th annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, pages 97–116, New York, NY, USA, 2005. ACM.

[GPME06] Philip J. Guo, Jeff H. Perkins, Stephen McCamant, and Michael D. Ernst. Dynamic inference of abstract types. In ISSTA ’06: Proceedings of the 2006 International Symposium on Software Testing and Analysis, pages 255–265, New York, NY, USA, 2006. ACM Press. [Hen94]

Scott Henninger. Using iterative refinement to find reusable software. IEEE Software, 11(5):48–59, 1994.

[HFB+ 08]

Emily Hill, Zachary P. Fry, Haley Boyd, Giriprasad Sridhara, Yana Novikova, Lori Pollock, and K. Vijay-Shanker. AMAP: Automatically mining abbreviation expansions in programs to enhance software maintenance tools. In MSR ’08: Proceedings of the 5th International Working Conference on Mining Software Repositories, Washington, DC, USA, 2008. IEEE Computer Society.

[HM05]

Reid Holmes and Gail C. Murphy. Using structural context to recommend source code examples. In ICSE ’05: Proceedings of the 27th International Conference on Software Engineering, pages 117–125, 2005.

[HØ07]

Einar W. Høst and Bjarte M. Østvold. The programmer’s lexicon, volume I: The verbs. In SCAM ’07: Proceedings of the 7th IEEE International Working Conference on Source Code Analysis and Manipulation, pages 193–202, Washington, DC, USA, 2007. IEEE Computer Society.

[HØ09a]

Einar W. Høst and Bjarte M. Østvold. Debugging method names. In ECOOP ’09: Proceedings of the 23rd European Conference on ObjectOriented Programming, 2009.

[HØ09b]

Einar W. Høst and Bjarte M. Østvold. The java programmer’s phrase book. In Proceedings of the 1st International Conference on Software Language Engineering, pages 322–341, Berlin, Heidelberg, 2009. Springer-Verlag.

[HPVS07]

Emily Hill, Lori Pollock, and K. Vijay-Shanker. Exploring the neighborhood with Dora to expedite software maintenance. In ASE ’07: Proceedings of the 22nd IEEE International Conference on Automated 212

Software Engineering (ASE’07), pages 14–23, Washington, DC, USA, November 2007. IEEE Computer Society. [HPVS09]

Emily Hill, Lori Pollock, and K. Vijay-Shanker. Automatically capturing source code context of nl-queries for software maintenance and reuse. In ICSE ’09: Proceedings of the 31st International Conference on Software Engineering, 2009.

[HR92]

Susan Horwitz and Thomas Reps. The use of program dependence graphs in software engineering. In ICSE ’92: Proceedings of the 14th International Conference on Software Engineering, pages 392–411, New York, NY, USA, 1992. ACM Press.

[HRB90]

Susan Horwitz, Thomas Reps, and David Binkley. Interprocedural slicing using dependence graphs. ACM Transactions on Programming Languages and Systems, 12(1):26–60, 1990.

[HVdM06]

Elnar Hajiyev, Mathieu Verbaere, and Oege de Moor. Codequest: Scalable source code queries with datalog. In Dave Thomas, editor, ECOOP’06: Proceedings of the 20th European Conference on ObjectOriented Programming, volume 4067 of Lecture Notes in Computer Science, pages 2–27, Berlin, Germany, 2006. Springer.

[IBM10]

IBM. Eclipse IDE. Online, 2010. http://www.eclipse.org.

[INMI07]

Takashi Ishio, Ryusuke Niitani, Gail C. Murphy, and Katsuro Inoue. A program slicing approach for locating functional concerns. Technical report, Graduate School of Information Science and Technology, Osaka University, March 2007. http://sel.ist.osaka-u.ac.jp/~ishio/ TR-slicing2007.pdf.

[Jac90]

Ray Jackendoff. Semantic Structures. MIT Press, Cambridge, MA, 1990.

[JH06]

Mikkel R. Jakobsen and Kasper Hornbæk. Evaluating a fisheye view of source code. In CHI ’06: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 377–386, 2006.

[JM00]

Daniel Jurafsky and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall PTR, Upper Saddle River, NJ, USA, 2000.

213

[JV03]

Doug Janzen and Kris De Volder. Navigating and querying code without getting lost. In AOSD ’03: Proceedings of the 2nd International Conference on Aspect-Oriented Software Development, pages 178–187, 2003.

[KBT07]

Christoph Kiefer, Abraham Bernstein, and Jonas Tappolet. Mining software repositories with iSPARQL and a software evolution ontology. In ICSEW ’07: Proceedings of the 29th International Conference on Software Engineering Workshops, page 10, Washington, DC, USA, 2007. IEEE Computer Society.

[KDG07]

Adrian Kuhn, Stéphane Ducasse, and Tudor G´ırba. Semantic clustering: Identifying topics in source code. Information Systems and Technologies, 49(3):230–243, 2007.

[KDV07]

Andrew J. Ko, Robert DeLine, and Gina Venolia. Information needs in collocated software development teams. In ICSE ’07: Proceedings of the 29th International Conference on Software Engineering, pages 344–353, Washington, DC, USA, 2007. IEEE Computer Society.

[KK08]

Daisuke Kawahara and Sadao Kurohashi. Coordination disambiguation without any similarities. In COLING ’08: Proceedings of the 22nd International Conference on Computational Linguistics, pages 425–432, Morristown, NJ, USA, 2008. Association for Computational Linguistics.

[KM05]

Mik Kersten and Gail C. Murphy. Mylar: A degree-of-interest model for IDEs. In AOSD ’05: Proceedings of the 4th International Conference on Aspect-Oriented Software Development, pages 159–168, 2005.

[KM08]

Andrew J. Ko and Brad A. Myers. Debugging reinvented: asking and answering why and why not questions about program behavior. In ICSE ’08: Proceedings of the 30th International Conference on Software Engineering, pages 301–310, New York, NY, USA, 2008. ACM.

[KMC06]

Andrew J. Ko, Brad A. Myers, and Duen Horng Chau. A linguistic analysis of how people describe software problems. In VLHCC ’06: Proceedings of the Visual Languages and Human-Centric Computing, pages 127–134, Washington, DC, USA, 2006. IEEE Computer Society.

[Kos86]

Donald W. Kosy. Parsing conjunctions deterministically. In Proceedings of the 24th annual meeting on Association for Computational Linguistics, pages 78–84, Morristown, NJ, USA, 1986. Association for Computational Linguistics. 214

[KQ05]

Rainer Koschke and Jochen Quante. On dynamic feature location. In ASE ’05: Proceedings of the 20th IEEE/ACM International Conference on Automated Software Engineering, pages 86–95, New York, NY, USA, 2005. ACM.

[LBN+ 09]

Erik Linstead, Sushil Bajracharya, Trung Ngo, Paul Rigor, Cristina Lopes, and Pierre Baldi. Sourcerer: Mining and searching internetscale software repositories. Data Mining and Knowledge Discovery, 18(2):300–336, 2009.

[LBS06]

Ben Liblit, Andrew Begel, and Eve Sweetser. Cognitive perspectives on the role of naming in computer programs. In Proceedings of the 18th Annual Psychology of Programming Workshop, 2006.

[LFB07a]

Dawn Lawrie, Henry Feild, and David Binkley. An empirical study of rules for well-formed identifiers. Journal of Software Maintenance and Evolution, 19(4):205–229, 2007.

[LFB07b]

Dawn Lawrie, Henry Feild, and David Binkley. Extracting meaning from abbreviated identifiers. In SCAM ’07: Proceedings of the 7th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM 2007), pages 213–222, 2007.

[LFB07c]

Dawn Lawrie, Henry Feild, and David Binkley. Quantifying identifier quality: An analysis of trends. Empirical Software Engineering, 12(4):359–388, 2007.

[LFOT07]

Andrea De Lucia, Fausto Fasano, Rocco Oliveto, and Genoveffa Tortora. Recovering traceability links in software artifact management systems using information retrieval methods. ACM Transactions on Software Engineering and Methodology, 16(4):13, 2007.

[LKE08]

S.K. Lukins, N.A. Kraft, and L.H. Etzkorn. Source code retrieval for bug localization using latent dirichlet allocation. In WCRE ’08: Proceedings of the 15th Working Conference on Reverse Engineering, pages 155 –164, 15-18 2008.

[LM09]

Greg Little and Robert C. Miller. Keyword programming in java. Automated Software Engineering, 16(1):145–192, 2009.

[LMDK07] Thomas K. Landauer, Danielle S. McNamara, Simon Dennis, and Walter Kintsch, editors. Handbook of Latent Semantic Analysis. Erlbaum, Mahwah, NJ, USA, 2007.

215

[LMPR07]

Dapeng Liu, Andrian Marcus, Denys Poshyvanyk, and Vaclav Rajlich. Feature location via information retrieval based filtering of a single scenario execution trace. In ASE ’07: Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering, pages 234–243, New York, NY, USA, 2007. ACM.

[LOPT00]

Leah S. Larkey, Paul Ogilvie, M. Andrew Price, and Brenden Tamilio. Acrophile: An automated acronym extractor and server. In DL ’00: Proceedings of the 5th ACM Conference on Digital Libraries, pages 205–214, New York, NY, USA, 2000. ACM.

[LS86]

S. Letovsky and E. Soloway. Delocalized plans and program comprehension. IEEE Software, 3(3):41–49, May 1986.

[LST78]

B. P. Lientz, E. B. Swanson, and G. E. Tompkins. Characteristics of application software maintenance. Communications of the ACM, 21(6):466–471, 1978.

[LVdA+ 04] Juan Lloréns, Manuel Velasco, Antonio de Amescua, José A. Moreiro, and Vicente Mart´ınez. Automatic generation of domain representations using thesaurus structures. Journal of the American Society for Information Science and Technology, 55(10):846–858, 2004. [MBK91]

Yoëlle S. Maarek, Daniel M. Berry, and Gail E. Kaiser. An information retrieval approach for automatically constructing software libraries. IEEE Transactions on Software Engineering, 17(8):800–813, 1991.

[Mic02]

Amir Michail. Browsing and searching source code of applications written using a GUI framework. In ICSE ’02: Proceedings of the 24th International Conference on Software Engineering, pages 327–337, 2002.

[MM03]

Andrian Marcus and Jonathan I. Maletic. Recovering documentationto-source-code traceability links using latent semantic indexing. In ICSE ’03: Proceedings of the 25th International Conference on Software Engineering, pages 125–135, 2003.

[MMM03]

Ettore Merlo, Ian McAdam, and Renato De Mori. Feed-forward and recurrent neural networks for source code informal information analysis. Journal of Software Maintenance, 15(4):205–244, 2003.

[MMvD07] M. Marin, L. Moonen, and A. van Deursen. Documenting typical crosscutting concerns. In WCRE ’07: Proceedings of the 14th Working Conference on Reverse Engineering, pages 31 –40, 28-31 2007. 216

[MN96]

Gail C. Murphy and David Notkin. Lightweight lexical source model extraction. ACM Transactions on Software Engineering and Methodology, 5(3):262–292, 1996.

[MN99]

Amir Michail and David Notkin. Assessing software libraries by browsing similar classes, functions and relationships. In ICSE ’99: Proceedings of the 21st International Conference on Software Engineering, pages 463–472, Los Alamitos, CA, USA, 1999. IEEE Computer Society Press.

[MNS01]

Gail C. Murphy, David Notkin, and Kevin J. Sullivan. Software reflexion models: Bridging the gap between design and implementation. IEEE Transactions on Software Engineering, 27(4):364–380, 2001.

[Moo02]

Leon Moonen. Lightweight impact analysis using island grammars. In IWPC ’02: Proceedings of the 10th International Workshop on Program Comprehension, page 219, Washington, DC, USA, 2002. IEEE Computer Society.

[MRS08]

Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch¨ utze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.

[MS99]

Chris Manning and Hinrich Sch¨ utze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, USA, May 1999.

[MSH08]

Girish Maskeri, Santonu Sarkar, and Kenneth Heafield. Mining business topics in source code using latent dirichlet allocation. In ISEC ’08: Proceedings of the 1st India Software Engineering Conference, pages 113–120, 2008.

[MSRM04] Andrian Marcus, Andrey Sergeyev, Václav Rajlich, and Jonathan I. Maletic. An information retrieval approach to concept location in source code. In WCRE ’04: Proceedings of the 11th Working Conference on Reverse Engineering (WCRE’04), pages 214–223, 2004. [Muc97]

Steven S. Muchnick. Advanced compiler design and implementation. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997.

[OG05]

Masaru Ohba and Katsuhiko Gondow. Toward mining “concept keywords” from identifiers in large software projects. In MSR ’05: Proceedings of the 2005 International Workshop on Mining Software Repositories, pages 1–5, 2005.

217

[OM94]

Akitoshi Okumura and Kazunori Muraki. Symmetric pattern matching analysis for english coordinate structures. In Proceedings of the 4th Conference on Applied Natural Language Processing, pages 41–46, Morristown, NJ, USA, 1994. Association for Computational Linguistics.

[Pak01]

Serguei Pakhomov. Semi-supervised maximum entropy based approach to acronym and abbreviation normalization in medical texts. In ACL ’02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 160–167, Morristown, NJ, USA, 2001. Association for Computational Linguistics.

[Par94]

David Lorge Parnas. Software aging. In ICSE ’94: Proceedings of the 16th International Conference on Software Engineering, pages 279–287, Los Alamitos, CA, USA, 1994. IEEE Computer Society Press.

[PDF87]

R. Prieto-Diaz and P. Freeman. Classifying software for reusability. IEEE Software, 4(1):6–16, 1987.

[PFGJ02]

M. Pinzger, M. Fischer, H. Gall, and M. Jazayeri. Revealer: A lexical pattern matcher for architecture recovery. In WCRE ’02: Proceedings of the 9th Working Conference on Reverse Engineering (WCRE’02), page 170, 2002.

[PGM+ 07] Denys Poshyvanyk, Yann-Gael Gueheneuc, Andrian Marcus, Giuliano Antoniol, and Vaclav Rajlich. Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. IEEE Transactions on Software Engineering, 33(6):420–432, 2007. [PKD09]

George Pallis, Asterios Katsifodimos, and Marios D. Dikaiakos. Effective keyword search for software resources installed in large-scale grid infrastructures. In WI-IAT ’09: Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, pages 482–489, Washington, DC, USA, 2009. IEEE Computer Society.

[PLEB07]

Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, and Thomas Ball. Feedback-directed random test generation. In ICSE’07, Proceedings of the 29th International Conference on Software Engineering, pages 75–84, Minneapolis, MN, USA, May 23–25, 2007.

[PM07]

Denys Poshyvanyk and Andrian Marcus. Combining formal concept analysis with information retrieval for concept location in source code. In ICPC ’07: Proceedings of the 15th IEEE International Conference

218

on Program Comprehension, pages 37–48, Washington, DC, USA, 2007. IEEE Computer Society. [PMD06]

Denys Poshyvanyk, Andrian Marcus, and Yubo Dong. JIRiSS – an Eclipse plug-in for source code exploration. In Proceedings of the 14th International Conference on Program Comprehension (ICPC ’06), pages 252–255, 2006.

[PMR+ 06]

Denys Poshyvanyk, Andrian Marcus, Vaclav Rajlich, Yann-Gael Gueheneuc, and Giuliano Antoniol. Combining probabilistic ranking and latent semantic indexing for feature identification. In ICPC ’06: Proceedings of the 14th IEEE International Conference on Program Comprehension (ICPC’06), pages 137–148, Washington, DC, USA, 2006. IEEE Computer Society.

[Por80]

M.F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.

[PP96]

Santanu Paul and Ataul Prakash. A query algebra for program databases. IEEE Transactions on Software Engineering, 22(3):202–217, 1996.

[PPM+ 06]

Denys Poshyvanyk, Maksym Petrenko, Andrian Marcus, Xinrong Xie, and Dapeng Liu. Source code exploration with Google. In ICSM ’06: Proceedings of the 22nd IEEE International Conference on Software Maintenance (ICSM’06), pages 334–338, 2006.

[RAN07]

Per Runeson, Magnus Alexandersson, and Oskar Nyholm. Detection of duplicate defect reports using natural language processing. In ICSE ’07: Proceedings of the 29th International Conference on Software Engineering, pages 499–510, Washington, DC, USA, 2007. IEEE Computer Society.

[RC04]

Martin P. Robillard and Wesley Coelho. How effective developers investigate source code: An exploratory study. IEEE Transactions on Software Engineering, 30(12):889–903, 2004.

[RDP10]

Meghan Revelle, Bogdan Dit, and Denys Poshyvanyk. Using data fusion and web mining to support feature location in software. In IEEE 18th International Conference on Program Comprehension (ICPC ’10), June 2010.

[RJB98]

Jr. Roberto J. Bayardo. Efficiently mining long patterns from databases. In SIGMOD ’98: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pages 85–93, 1998. 219

[RM02]

Martin P. Robillard and Gail C. Murphy. Concern graphs: Finding and describing concerns using structural program dependencies. In ICSE ’02: Proceedings of the 24th International Conference on Software Engineering, pages 406–416, 2002.

[RM03]

Martin P. Robillard and Gail C. Murphy. Automatically inferring concern code from program investigation activities. In In Proceedings of the 18th International Conference on Automated Software Engineering, pages 225–234, 2003.

[RM07]

Martin P. Robillard and Gail C. Murphy. Representing concerns in source code. ACM Transactions on Software Engineering and Methodology, 16(1):3, 2007.

[Rob05]

Martin P. Robillard. Automatic generation of suggestions for program investigation. In ESEC/FSE-13: Proceedings of the 10th European Software Engineering Conference held jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 11–20, 2005.

[RSH+ 07]

Martin P. Robillard, David Shepherd, Emily Hill, K. Vijay-Shanker, and Lori Pollock. An empirical study of the concept assignment problem. Technical Report SOCS-TR-2007.3, School of Computer Science, McGill University, June 2007. http://www.cs.mcgill.ca/~martin/ concerns/.

[SB88]

Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513–523, 1988.

[SEHM06]

Thorsten Schafer, Michael Eichberg, Michael Haupt, and Mira Mezini. The SEXTANT software exploration tool. IEEE Transactions on Software Engineering, 32(9):753–768, 2006.

[SFB07]

Manu Sridharan, Stephen Fink, and Rastislav Bodik. Thin slicing. In PLDI ’07: Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, 2007.

[SFDB07]

Zachary M. Saul, Vladimir Filkov, Premkumar Devanbu, and Christian Bird. Recommending random walks. In ESEC-FSE ’07: Proceedings of the the 6th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, pages 15–24, New York, NY, USA, 2007. ACM Press. 220

[SFH+ 07]

David Shepherd, Zachary P. Fry, Emily Hill, Lori Pollock, and K. VijayShanker. Using natural language program analysis to locate and understand action-oriented concerns. In AOSD ’07: Proceedings of the 6th International Conference on Aspect-Oriented Software Development, 2007.

[SH98]

Susan Elliott Sim and Richard C. Holt. The ramp-up problem in software projects: A case study of how software immigrants naturalize. In ICSE ’98: Proceedings of the 20th International Conference on Software Engineering, pages 361–370, Washington, DC, USA, 1998. IEEE Computer Society.

[SHM+ 10]

Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K. Vijay-Shanker. Towards automatically generating summary comments for java methods. In ASE ’10: Proceedings of the 25th IEEE International Conference on Automated Software Engineering (ASE’10), 2010. To Appear.

[SHPVS08] Giriprasad Sridhara, Emily Hill, Lori Pollock, and K. Vijay-Shanker. Identifying word relations in software: A comparative study of semantic similarity tools. In Proceedings of the 16th IEEE International Conference on Program Comprehension. IEEE, June 2008. [Sim99]

Charles Simonyi. Hungarian notation. In Visual Studio 6.0 Technical Articles. Microsoft Corporation, 1999. Reprinted November 1999.

[SK08]

J. Singer and C. Kirkham. Exploiting the correspondence between micro patterns and class names. In SCAM ’08: Proceedings of the 8th IEEE International Working Conference on Source Code Analysis and Manipulation, pages 67 –76, 2008.

[SKM06]

Vineet Sinha, David Karger, and Rob Miller. Relo: Helping users manage context during interactive exploratory visualization of large codebases. In Visual Languages and Human-Centric Computing (VL/HCC 2006), 2006.

[SLS09]

Jamie Starke, Chris Luce, and Jonathan Sillito. Searching and skimming: An exploratory study. In IEEE International Conference on Software Maintenance, pages 157–166. IEEE Computer Society, 2009.

[SM95]

M.-A. D. Storey and H. A. Muller. Manipulating and documenting software structures using shrimp views. In ICSM ’95: Proceedings of the International Conference on Software Maintenance, page 275, 1995.

221

[Som10]

Ian Sommerville. Software Engineering (9th Edition). Pearson Addison Wesley, 2010.

[SPVS06]

David Shepherd, Lori Pollock, and K. Vijay-Shanker. Towards supporting on-demand virtual remodularization using program graphs. In AOSD ’06: Proceedings of the 5th International Conference on AspectOriented Software Development, pages 3–14, 2006.

[Tip95]

Frank Tip. A survey of program slicing techniques. Journal of Programming Languages, 3(3):121–189, 1995.

[vR79]

C.J. van Rijsbergen. Information Retrieval. Butterworths, 2nd edition, 1979.

[WGRG10] Michael W¨ ursch, Giacomo Ghezzi, Gerald Reif, and Harald C. Gall. Supporting developers with natural language queries. In Proceedings of the 32nd International Conference on Software Engineering. IEEE Computer Society, May 2010. [WHGT99] W. Eric Wong, Joseph R. Horgan, Swapna S. Gokhale, and Kishor S. Trivedi. Locating program features using execution slices. In ASSET ’99: Proceedings of the 1999 IEEE Symposium on Application-Specific Systems and Software Engineering and Technology, page 194, Washington, DC, USA, 1999. IEEE Computer Society. [WLZR08] R. Witte, Q. Li, Y. Zhang, and J. Rilling. Text mining and software engineering: An integrated source code and document analysis approach. IET Software, 2(1):3–16, February 2008. [WR07]

Frédéric Weigand Warr and Martin P. Robillard. Suade: Topologybased searches for software investigation. In ICSE ’07: Proceedings of the 29th International Conference on Software Engineering, pages 780–783, 2007.

[WS95]

Norman Wilde and Michael C. Scully. Software reconnaissance: Mapping program features to code. Journal of Software Maintenance, 7(1):49–62, 1995.

[XQZ+ 05]

Baowen Xu, Ju Qian, Xiaofang Zhang, Zhongqiang Wu, and Lin Chen. A brief survey of program slicing. SIGSOFT Software Engineering Notes, 30(2):1–36, 2005.

[YT07]

Annie T. T. Ying and Peri L. Tarr. Filtering out methods you wish you hadn’t navigated. In eclipse ’07: Proceedings of the 2007 OOPSLA 222

workshop on eclipse technology eXchange, pages 11–15, New York, NY, USA, 2007. ACM. [YYN07]

Yunwen Ye, Yasuhiro Yamamoto, and Kumiyo Nakakoji. A sociotechnical framework for supporting programmers. In ESEC-FSE ’07: Proceedings of the the 6th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, pages 351–360, New York, NY, USA, 2007. ACM.

[ZW95]

Amy Moormann Zaremski and Jeannette M. Wing. Signature matching: A tool for using software libraries. ACM Transactions on Software Engineering and Methodology, 4(2):146–170, 1995.

[ZZL+ 06]

Wei Zhao, Lu Zhang, Yin Liu, Jiasu Sun, and Fuqing Yang. SNIAFL: Towards a static non-interactive approach to feature location. ACM Transactions on Software Engineering and Methodology, 15(2):195–226, 2006.

223