Discovering Culturomic Trends in Large-Scale ...

8 downloads 34210 Views 6MB Size Report
Jan 9, 2010 - Careful thought needs to be taken in determining the best way ...... the same domain, e.g. http://example.wordpress.com, and only dier ...... Yougov/ the sun survey results. http://cdn.yougov.com/today_uk_import/yg-archives-.
Discovering Culturomic Trends in Large-Scale Textual Corpora

Thomas Lansdall-Welfare

A dissertation submitted to the University of Bristol in accordance with the requirements of the degree of Doctor of Philosophy in the Department of Computer Science, Faculty of Engineering.

June 2015

∼ 45,000 words

"Have courage for the great sorrows of life and patience for the small ones; and when you have laboriously accomplished your daily task, go to sleep in peace." - Victor Hugo

Abstract The abundance of data and the ability to process it at a massive scale has transformed many areas of research in the natural sciences. These data-driven methods have recently begun to be adopted in other fields of research which traditionally have not relied on computational approaches, such as the social sciences and humanities. As we continue forward, we will likely see an increase in the spread of data-driven approaches in these fields as more and more data is “born digital”, coupled with mass digitalisation projects that aim to digitise the mountains of paper archives that still exist. In this thesis, we look at extracting, analysing and delving into data from massive textual corpora, concentrating on macroscopic trends and characteristics that can only be found when transitioning from traditional social science methods involving manual inspection known as ‘coding’ to scalable, data-driven computational methods. A distributed architecture for large-scale text analysis was collaboratively developed during the project, serving as the infrastructure for collecting, storing and analysing data. Using this infrastructure, this thesis not only explores methods for extracting information in a scalable way but also demonstrates the types of studies that can be achieved by adopting data-driven approaches. These studies and their findings include differences in writing style across topics and news outlets; longitudinal and diurnal patterns of mood change in population-scale samples of UK social media users; and general tools and methods that can be used to interrogate and explore massive textual corpora in an interactive way. We conclude that data-driven methods for the analysis of large-scale textual corpora have now reached a point where the extraction of macroscopic trends and patterns can enable meaningful information about the real-world to be discovered.

Acknowledgements This project would not have been possible without the inspiration, guidance and help of many generous people during its completion. As such, I would like to express my appreciation for their contributions. Firstly, I would like to recognise the inspiring and enlightening role that my supervisor, Prof. Nello Cristianini, has played in my life during this time. Nello has not only provided invaluable motivation, support and guidance in academic affairs, but has been a constant source of wisdom and advice in all matters. I have truly enjoyed spending time talking and working with him over the years. Next, I would like to thank the other members of the research group who have not only provided a stimulating and positive atmosphere to work in, but have made meaningful contributions and given indispensable advice to me over the course of this work. I would like to specifically thank Dr. Ilias Flaounas for his endless advice and cooperation on numerous occasions; Dr. Saatviga Sudhahar for her helpful collaboration and hard work; Dr. Vasileios Lampos for his honesty and contribution to our joint studies; and Dr. Panagiota Antonakaki for her encouragement, support, and advice on all manner of subjects. Additionally, I would like to thank Omar, Fabon, Jason, Ricardo, Nick, Elena, and Tristan for making my time in the group fun and interesting. I am grateful to Syed Rahman for his hard work and sincere effort in keeping all the machines and hardware ticking over, as well as being patient and understanding at times as required. I would also like to express my thanks and recognise the contribution to my work from Dr. Tijl De Bie, Dr. Giuseppe A. Veltri, Prof. Justin Lewis and Prof. Ricardo Araya, all of whom provided invaluable insights and perspectives into our joint collaborations. No acknowledgements would be complete without also mentioning those friends and family who provided relief and distraction from the project when necessary on occasion. So, Paul, Nathan, Rob, Rob, Terry, Roz and Louise, thank you for keeping me sane and making me laugh. Above all, I am eternally grateful to my parents, Richard and Margaret, for their love and encouragement and for financially supporting me throughout this project. Without them, literally none of this would have been possible.

iii

AUTHOR’S DECLARATION I declare that the work in this dissertation was carried out in accordance with the Regulations of the University of Bristol. The work is original except where indicated by special reference in the text and no part of the dissertation has been submitted for any other degree. Any views expressed in the dissertation are those of the author and in no way represent those of the University of Bristol. The dissertation has not been presented to any other University for examination either in the United Kingdom or overseas.

SIGNED: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

DATE: . . . . . . . . . . . . . . . . . . .

Contents

1

Introduction

21

1.1

Thesis aims and research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.2

Peer-reviewed publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3

Impact and Outreach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.4

Key Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.5

Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2 Background 2.1

2.2

2.3

27

29

Information Retrieval Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.1.1

Text Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.1.2

Dependency Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

2.1.3

Vector Space Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

2.2.1

Topic Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

2.2.2

Association Rule Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

2.2.3

Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 Meet the Data 3.1

24

42 43

Online News Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

3.1.1

44

Feeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Contents

3.2

3.3

3.4

3.1.2

Outlets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

3.1.3

Articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

3.2.1

Twitter Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

3.2.2

Collecting Tweets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

Modular Architecture for Data Curation . . . . . . . . . . . . . . . . . . . . . . . . . .

53

3.3.1

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

3.3.2

System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

3.3.3

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.3.4

Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4 Extracting Textual Characteristics at Scale 4.1

4.2

4.3

67

71

Annotation for Computational Social Science . . . . . . . . . . . . . . . . . . . . . . .

72

4.1.1

Measuring Writing Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

4.1.2

Writing Style in Topics and Outlets . . . . . . . . . . . . . . . . . . . . . . . . .

74

4.1.3

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

4.1.4

Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

Label Propagation for Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . .

81

4.2.1

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.2.2

Graph Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.2.3

Label Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.2.4

Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.2.5

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5 Mood Trends in Social Media

97

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.2

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.3

Computing a Mood Trend from Social Media . . . . . . . . . . . . . . . . . . . . . . . 100

5.4

Longitudinal Mood Trends in Twitter Content . . . . . . . . . . . . . . . . . . . . . . . 101 5.4.1

5.5

97

Longitudinal mood trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Circadian Mood Trends in Twitter Content . . . . . . . . . . . . . . . . . . . . . . . . . 109 viii

Contents

5.5.1

Circadian mood trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.6

Visualising Mood Changes in Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.7

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

5.8

5.7.1

Measuring mood signifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.7.2

Twitter demographics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.7.3

Attributing cause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6 Large-Scale Corpus Analysis 6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

6.2

Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

6.3

6.4

7

123

6.2.1

Extracting References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6.2.2

Generating Attention and Sentiment Trends . . . . . . . . . . . . . . . . . . . . 125

6.2.3

Mining Associations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.2.4

Extracting Triplets and Action Clouds . . . . . . . . . . . . . . . . . . . . . . . 128

Nuclear Power Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.3.1

Evolution of Attention and Sentiment . . . . . . . . . . . . . . . . . . . . . . . . 130

6.3.2

Associations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.3.3

Actions and Actors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.3.4

Case Study Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

Data Playground: an Interactive Tool for Corpora Analysis . . . . . . . . . . . . . . . . 139 6.4.1

Available Tools within the Data Playground . . . . . . . . . . . . . . . . . . . . 140

6.4.2

Notes on Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

6.5

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

6.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

Conclusions

151

7.1

Summary of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

7.2

Future Directions for Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

A Supplementary Examples A.1

157

Example studies using Data Playground . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

B Supplementary Tables

163

ix

List of Figures

2.1

Illustrative example of a string being tokenised. . . . . . . . . . . . . . . . . . . . . . . 30

2.2

Two (linearly separable) classes being separated by a maximum margin hyperplane, showing the support vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.3

Illustration of the two database scans required by the FP-Growth algorithm to build an FP-Tree for mining frequent patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.1

Sample of a news feed taken from the BBC News website. . . . . . . . . . . . . . . . . .

45

3.2

Sample of the XML from a news feed taken from the BBC News website. . . . . . . . .

45

3.3

Annotated news article found on the BBC News website. . . . . . . . . . . . . . . . . . 48

3.4

Average volume of news articles collected by day of the week. . . . . . . . . . . . . . . 50

3.5

Cumulative volume of news articles collected. . . . . . . . . . . . . . . . . . . . . . . . 50

3.6

Cumulative volume of tweets collected. Annotated change points correspond to A) Change of database system; B) Twitter API version change; and C) On-going issue with Geographical Search API. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.7

53

Simplified overview of the modular architecture for data curation, demonstrating the interaction of modules (circles) to annotate data via central blackboards. . . . . . . . . 56

3.8

Communication of modules in the system via control labels written to the blackboards. 59

3.9

Screenshot of the Clickable News App. . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.10 Screenshot of the Klaxon server alert system. . . . . . . . . . . . . . . . . . . . . . . . . 63 xi

List of Figures

4.1

Scatter plot of topics by their mean readability score against their mean linguistic subjectivity score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.2

Scatter plot of US and UK news outlets by their mean readability score against their mean linguistic subjectivity score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.3

77

79

Tuning the parameters k and  for the two graph-based methods. F1 performance for the 50 most common topics evaluated on the training set using LOO-CV for (a)  = {0, 0.01, . . . , 1.00} and (b) k = {1, 3, . . . , 99}. It can be seen that there is a peak at  = 0.4 and k = 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.4

87

Comparison of the mean F1 Score, averaged over all test set weeks, for the graph-based methods on the 50 most common topics. . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.5

Comparison of the mean F1 Score, averaged over all test set weeks, for the graphbased methods with a single best parameter against a multi-parameter approach on the 50 most common topics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.6

Comparison of the mean F1 Score, averaged over all test set weeks, for (a) -Neighbourhood and (b) k-NN against SVMs on the 50 most common topics. . . . . . . . . . . . . . . . 90

4.7

Comparison of the mean F1 Score, averaged over all test set weeks, for the combined method at different µ values on the 50 most common topics. It can be seen that the combined method offers an improvement over the kNN approach (µ = 0) and the SVM approach (µ = 6). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.8

Comparison of the mean F1 Score, averaged over all test set weeks, for the combined method using µ = 4 against (a) SVMs and (b) kNN on the 50 most common topics. . . 93

4.9

Summary of the mean F1 Score, averaged over all test set weeks, for the graph-based methods and SVMs along with the best combined method (µ = 4) on the 50 most common topics. It can be seen that the graph-based methods are comparable with SVMs, with the combined method showing a further improvement. It should be noted that the performance of the combined method is slightly biased due to selecting for the best µ. -Neighbourhood has been abbreviated to N. . . . . . . . . . . . . . . . . . 94

5.1

Anger trend extracted from UK Twitter content between 1st July 2009 and 19th January 2012. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.2

Fear trend extracted from UK Twitter content between 1st July 2009 and 19th January 2012. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 xii

List of Figures

5.3

Joy trend extracted from UK Twitter content between 1st July 2009 and 19th January 2012. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.4

Sadness trend extracted from UK Twitter content between 1st July 2009 and 19th January 2012. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.5

Rate of mood change using the difference in mean for the 50 days before and after an event. Significant days (p < 0.005 for t = 10, 000) are represented by red asterisks. . . 108

5.6

Cumulative sum of the mean trend points for each mood.

. . . . . . . . . . . . . . . . 109

5.7

Daily mood trends for each of the four moods computed for summer (green), winter (red) and aggregated (blue) data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.8

Autocorrelation of the four mood trends for the aggregated data. . . . . . . . . . . . . 114

5.9

Visualisation of Mood in Twitter using the Grimace Project [207]. . . . . . . . . . . . . 116

5.10 Facial visualisations of the collective mood in UK Twitter on (a) Christmas Day 2009, (b) 29th April 2011 (Royal Wedding), (c) 27th October 2010 (Week after cuts announced), (d) 27th October 2011 (One year before cuts), (e) 29th July 2011 (Week before riots), (f) 29th July 2010 (One year before riots). . . . . . . . . . . . . . . . . . . . . . . 117 6.1

Relative frequency of the number of science articles mentioning ‘Nuclear Power’ between 1st May 2008 and 31st December 2013. . . . . . . . . . . . . . . . . . . . . . . . 130

6.2

Normalised difference in the number of positive to negative sentences mentioning ‘Nuclear Power’ between 1st May 2008 and 31st December 2013. . . . . . . . . . . . . 130

6.3

Cumulative sum of the mean for the sentiment score of ‘Nuclear Power’ between 1st May 2008 and 31st December 2013. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.4

Histogram of the 10,000 greatest values found in randomised trials of the cumulative sum of the mean for the sentiment score of ‘Nuclear Power’ between 1st May 2008 and 31st December 2013. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.5

Associated universities (green), topics (blue) and diseases (red) found through association rule mining for ‘Nuclear Power’ before the Fukushima disaster. Edge weight denotes the confidence in the association. . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.6

Associated universities (green), topics (blue) and diseases (red) found through association rule mining for ‘Nuclear Power’ after the Fukushima disaster. Edge weight denotes the confidence in the association. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 xiii

List of Figures

6.7

SVO triplet network showing the actors and actions affecting ‘Nuclear Power’ before the Fukushima disaster. Nodes represent subjects and objects in the SVO triplets, while edges show the verb relation between the subject and object of the triplet. . . . . . . . 136

6.8 SVO triplet network showing the actors and actions affecting ‘Nuclear Power’ after the Fukushima disaster. Nodes represent subjects and objects in the SVO triplets, while edges show the verb relation between the subject and object of the triplet . . . . . . . . 137 6.9

Verbs in the SVO triplets where ‘Nuclear Power‘ is the object before the incident, showing the actions happening to ‘Nuclear Power’. . . . . . . . . . . . . . . . . . . . . . 138

6.10 Verbs in the SVO triplets where ‘Nuclear Power‘ is the object after the incident, showing the actions happening to ‘Nuclear Power’. . . . . . . . . . . . . . . . . . . . . . . . 138 6.11 Timelines of three European leaders in modern English news media using Data Playground. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.12 Word cloud generated for the largest peak for “Francois Hollande” as seen in Figure 6.11. 141 6.13 Set comparison of three major cities in modern English news media using Data Playground. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.14 Example of the advantages of the expressive query language available when timelines are computed on the fly, instead of being precomputed. . . . . . . . . . . . . . . . . . . 145 A.1

Data Playground timeline for Elon Musk in the Science news corpus described in Section 6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

A.2 Data Playground associations for Elon Musk in the Science news corpus described in Section 6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 A.3 Data Playground triplets for Elon Musk in the Science news corpus described in Section 6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 A.4 Data Playground timeline for Artificial Intelligence in the Science news corpus described in Section 6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 A.5 Data Playground triplets for Artificial Intelligence in the Science news corpus described in Section 6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

xiv

List of Tables

3.1

Mandatory fields for feeds in the system. . . . . . . . . . . . . . . . . . . . . . . . . . .

46

3.2

Common optional fields for feeds in the system. . . . . . . . . . . . . . . . . . . . . . .

46

3.3

Common fields for outlets in the system. . . . . . . . . . . . . . . . . . . . . . . . . . .

47

3.4

Common fields for articles in the system. . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.5

Blackboards in the system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.6

Parameter settings for the modules in the system. . . . . . . . . . . . . . . . . . . . . . 60

4.1

Articles labelled with each of the 15 topic categories for writing style comparison. . . .

4.2

Articles collected from the main feed of the 16 news outlets chosen for writing style comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75 78

5.1

Correlation between mood trends for the same public holidays in different years . . . . 106

B.1

Locations by Region for Twitter collection within the United Kingdom. . . . . . . . . . 163

B.2 Number of articles labelled with the 50 most frequently occurring topics in the Reuters RCV1 corpus, along with the optimal hyperparameters  and k found for label propagation (see Section 4.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 B.3

Anger terms present in the WordNet-Affect lexicon. . . . . . . . . . . . . . . . . . . . . 165

B.4

Fear terms present in the WordNet-Affect lexicon. . . . . . . . . . . . . . . . . . . . . . 165

B.5

Joy terms present in the WordNet-Affect lexicon. . . . . . . . . . . . . . . . . . . . . . . 166

B.6 Sadness terms present in the WordNet-Affect lexicon. . . . . . . . . . . . . . . . . . . . 166

xv

CHAPTER 1

Introduction

“Data Science”, first coined in the 1960s by Peter Naur as an alternate name for the field of Computer Science [220]1 , has become an increasingly popular discipline, to the extent that in 2012 the Harvard Business Review went so far as to call being a “Data Scientist” (a practitioner of Data Science) the sexiest job of the 21st century2 . Nowadays, the term has evolved and taken on a more specific meaning relating to the study of the generalisable extraction of knowledge from data. The popularity of Data Science has been driven by the ubiquity of data and the need to understand and extract knowledge from it, with the total archived data capacity globally increasing 60% between 2013 and 2014 to an estimated 197 exabytes split among databases, email and unstructured data [61]. Organisations that utilise data at all levels of their decision making, such as Google and Facebook, are seen as modern success stories and are held up as exemplars of “The Age of Big Data”3 . Indeed, the entire Big Data movement, closely related to Data Science, is often framed as a paradigm shift in the Kuhnian sense [49], where “"one conceptual world view is replaced by another” [14] and again borrowing the nomenclature of Kuhn is referred to as “The Big Data Revolution” [152] or heralded as “The End of Theory”4 . In this context, particular attention is often directed towards the “data” aspect of Data Science, with discussions focussing on certain quantitative characteristics of the data (known as the five V’s of Big Data: Volume, Velocity, Variety, Veracity, Value). While these factors are very important in 1

“Data Science” was originally used interchangably with “Datalogy”. http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ 3 http://www.nytimes.com/2012/02/12/sunday-review/big-datas-impact-in-the-world.html 4 http://archive.wired.com/science/discoveries/magazine/16-07/pb_theory

2

21

Chapter 1. Introduction

the collection, storage and management of the data, they often do not address the more fundamental questions of what is the purpose of the data? What knowledge can be gained by analysing it? These kind of question point towards the “science” aspect of Data Science being the key contribution. John Tukey succinctly summed this up when saying “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data” [230]. Essentially, there needs to be a focus on how we can use the abundant data available to us to carry out systematic studies of the structure and behaviour of the world, where the large quantity of data available exposes new facets of knowledge that can therefore be extracted. All of this is not to say that Big Data and Data Science are not useful and meaningful ways to conduct science, but that care should be taken to focus on the knowledge extracted from the data, rather than the data itself. Data Science itself has many proven applications across a range of fields from the Large Hadron Collider at CERN discovering the Higgs boson [2] to the 100,000 Genomes Project aiming to match together genomic and clinical data in order to develop personalised therapies for a whole host of diseases [74]. The research project detailed in this thesis focuses on the application of Data Science to the Social Sciences, sometimes referred to as Computational Social Science (CSS) [128]. An exemplar in this field is the “Culturomics” study first presented in 2011 by Jean-Baptiste Michel et al. [161], which analysed five million digitised books from the past and showed cultural changes spanning from how quickly verbs are regularised, the length of time it takes inventions to reach their highest frequency of mentions, and the effects of censorship on the mention of entities, along with introducing the idea that similar approaches provide quantitative evidence for scholars in many fields. Following this approach, this research project looks at how automated methods can enable the analysis of modern day cultural changes using online news and social media by taking a data-driven approach. The automation of such studies opens up the possibility for analysing entire sections of society at a large scale, using huge samples that traditional techniques from the social sciences simply could not permit.

1.1

Thesis aims and research questions

This thesis aims to take a data-driven approach to investigate important questions in areas outside of Computer Science that have yet to fully adopt computational and automated methods in order to discover large-scale macroscopic patterns in the every day text data that are normally manually explored. In this direction, we present several studies using state-of-the-art tools and techniques from the field of Data Science and its associated fields of Data Mining, Machine Learning, Information 22

1.1. Thesis aims and research questions

Retrieval and Natural Language Processing in order to solve research challenges in a diverse range of fields from the Social Sciences, Journalism, Health and more. Where necessary, new methods for analysing the data are introduced in order to answer the research questions posed, along with a focus on how to apply established techniques on a larger scale, allowing macroscopic trends in the data to emerge. We summarise of list of the research questions posed and answered in the course of following the aims of this thesis, leading us to be able to conduct studies in traditionally non-computational fields.

How do we collect and store large amounts of text data? A starting point for any data-driven approach will always be the data. Careful thought needs to be taken in determining the best way to firstly collect the data, ensuring that it allows for the desired research questions to be properly addressed, and secondly storing the data in a way that allows for easy processing and analysis later on.

Can we extract meaningful information at scale from text data? Once the data has been collected and stored, we need to be able to extract information from it which can be used to address the questions we wish to tackle. With textual data, it is important to determine if the information extracted from it is meaningful, which heavily relies on what research question you are trying to answer. A common methodological issue is validating that the extracted information is a good representation of the real-world, especially in the case where a ground truth is absent.

What can the information tell us about the real-world? Once we establish that we can extract meaningful information from text data, we want to use the information to answer questions about the real-world using the macroscopic patterns we find in the data. This not only allows us to demonstrate the feasibility of performing such tasks, but it also enables scientific findings that were previously impossible from a variety of traditional disciplines to be discovered and presented using data-driven approaches.

Can we formulate our studies into generic methodologies? After solving a particular problem in one field, we want to be able to generalise our approach to be able to solve similar but unrelated problems in entirely new fields of study. To do so, we must strive for general methodologies which can be applied in a wide range of scenarios, rather than developed for specific cases. 23

Chapter 1. Introduction

1.2

Peer-reviewed publications

In following the aims and research questions of the thesis, a number of peer-reviewed publications were generated. Here we list all relevant peer-reviewed publications co-authored by Thomas LansdallWelfare during the course of the Ph.D. project. Each publication is listed in chronological order, and is annotated with the varying levels of contribution to them, defined as follows: • Involved in writing the paper (W) • Contributed to the ideas, methods, algorithms or design (C) • Provided data or implementation (D)

Scalable Corpus Annotation by Graph Construction and Label Propagation.

Thomas Lansdall-

Welfare, Ilias Flaounas, Nello Cristianini. Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods. Faro, Portugal. 2012. [122] – W, C, D. In this paper, we present a scalable methodology for constructing graphs of text documents based upon their content similarity which can be used for classification using an efficient label propagation algorithm that gives comparable performance to the state-of-the-art classifiers.

Effects of the Recession on Public Mood in the UK.

Thomas Lansdall-Welfare, Vasileios Lam-

pos, Nello Cristianini. Proceedings of the 21st World Wide Web Conference Companion 2012, Lyon, France. 2012. [123] – W, C, D. In this paper, we analyse mood scores derived from Twitter content and relate them with real-life events in the UK. Interesting findings include evidence of increased levels of negative mood associated with the recession affecting the UK during the period covered by the study.

ElectionWatch: detecting patterns in news coverage of US elections. Saatviga Sudhahar, Thomas Lansdall-Welfare, Ilias Flaounas, Nello Cristianini. Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics. 2012. [219] – C, D. In this paper, we present a web tool that allows users to explore news stories concerning the 2012 US Presidential Elections via an interactive interface based upon the concept of narrative analysis. 24

1.2. Peer-reviewed publications

Automatic Annotation of a Dynamic Corpus by Label Propagation.

Thomas Lansdall-Welfare,

Ilias Flaounas, Nello Cristianini. Mathematical Methodologies in Pattern Recognition and Machine Learning, p.19–32. Springer. 2013. [121] – W, C, D. In this paper, we are interested in the problem of automatically annotating a large, constantly expanding corpus, in the case where potentially neither the dataset nor the class of possible labels that can be used are static, and the annotation of the data needs to be efficient. This paper presents experimental results about both the efficient creation of the graph and the propagation of the labels, and is an extension of [122]. Research Methods in the Age of Digital Journalism: Massive-scale automated analysis of news-content - topics, style and gender. Ilias Flaounas, Omar Ali, Thomas Lansdall-Welfare, Tijl De Bie, Nick Mosdell, Justin Lewis, Nello Cristianini. Digital Journalism, Vol. 1(1), p.102–116. 2013. [79] – C, D. In this paper, we present an approach for the automation of several common, labour-intensive, manual tasks involved in news content analysis, and demonstrate the approach on a massive data set many orders of magnitude larger than those normally used. On the Coverage of Science in the Media: A Big Data Study on the Impact of the Fukushima Disaster

Thomas Lansdall-Welfare, Saatviga Sudhahar, Giuseppe A. Veltri, Nello Cristianini. Pro-

ceedings of the 2014 IEEE International Conference on Big Data. Washington DC, USA. 2014. [126] – W, C, D. In this paper, we present a scalable methodology for several aspects of corpus analysis, demonstrating the approach with a case study using millions of news articles, focusing on the coverage of science in the media. We show how real-world events can alter the way that science is reported and offer methods for detecting these changes. Measuring Gender Bias in News Images. Sen Jia, Thomas Lansdall-Welfare, Nello Cristianini. NewsWWW 2015 – 2nd workshop on Web and Data Science for News Publishing, Florence, Italy. 2015. [108] – W, C, D. In this paper, we adopt a novel approach to gender bias in the media, studying the gender of faces in images collected from news articles on a large scale. Our findings show that gender bias differs by topic, with Fashion and the Arts showing the least bias. Comparisons of gender bias by outlet suggest that tabloid-style news outlets may be less gender-biased than broadsheet-style ones, supporting 25

Chapter 1. Introduction

previous results from textual content analysis of news articles. Scalable Preference Learning from Data Streams.

Fabon Dzogang, Thomas Lansdall-Welfare,

Saatviga Sudhahar, Nello Cristianini. NewsWWW 2015 – 2nd workshop on Web and Data Science for News Publishing, Florence, Italy. 2015. [70] – W5 , C, D. In this paper, we show how learning the preferences of users can benefit from an efficient algorithm, based on hashing representations, which enables it to be deployed on high intensity data streams. We demonstrate the effectiveness of this approach on four real-world news streams, compare it with standard approaches, and describe a new online demonstration based on the technology.

1.3

Impact and Outreach

The work in this thesis has been generally well received by the academic community, with 84 citations to the articles published as part of this thesis at the time of writing6 . Additionally, several contributions were invited based upon work in this thesis, including a talk at the Big Data Mining event organised by the Royal Statistics Society7 , an article in Significance [124] and a book chapter in The Routledge Companion to Digital Journalism Studies [125]. Aspects of the work completed in this thesis also attracted the attention of technology and mainstream media. Some of the significant news articles or other media appearances that disseminate the work to the general public are listed in this Section, providing an indication of public interest in the research output of the thesis, while also offering a more accessible introduction. • Television interview on BBC World News about “Effects of the Recession on Public Mood in the UK” [123]. Available at https://www.youtube.com/watch?v=EfUfyF83EAg. • Television segment “Dans les labos d’ici et d’ailleurs: Twitter à l’étude !” on Le Code Chastenay, Episode 108. Available at http://lecodechastenay.telequebec.tv/emission.aspx?id=114. • “Can Twitter Tell Us What We Are Feeling? [VIDEO]”, in Mashable by Sonia Paul. Available at http://mashable.com/2012/04/17/twitter-feeling/. • “Twitter reveals effects of the recession on public mood”, in New Scientist Technology Blog by Jacob Aron. Available at http://www.newscientist.com/blogs/onepercent/2012/04/twitter-reveals5

This paper was co-written with Fabon Dzogang. 1st August 2015, number of citations as determined by Google Scholar: https://scholar.google.co.uk/citations?user= ToF6ePQAAAAJ&hl=en 7 http://www.rss.org.uk/Images/PDF/about/rss-annual-report-2013-with-accounts.pdf 6

26

1.4. Key Findings

effects-of-the.html. • “Can social media detect the changes in public mood?”, in Science Daily. Available at http: //www.sciencedaily.com/releases/2012/04/120417080557.htm. • “Twitter Study Shows an Increase in Negative Mood Leading Up to Last Year’s London Riots”, in The Socionomist by Chuck Thomson. Available at http://www.socionomics.net/pdf/1209SOC. pdf. • “Algorithm deems The Sun easiest tabloid to read, comparable to CBBC” in Wired by David Cornish. Available at http://www.wired.co.uk/news/archive/2012-11/26/analysis-of-news-articles. • “Detecting patterns in news coverage of US elections”, in Significance by Abdel Khairoun. Available at http://www.significancemagazine.org/details/webexclusive/3003851/Detecting-patternsin-news-coverage-of-US-elections.html [Web archive].

1.4

Key Findings

In this section we briefly summarise the main findings of this thesis, many of which rely on large-scale automated approaches that would not be within the scope of traditional methods in their respective scholarly fields. • Propagation of topic labels in graphs of documents can achieve state-of-the-art performance while offering a scalable and efficient implementation, paving the way for topic classification on large-scale corpora. • Large-scale extraction of textual characteristics, such as readability and linguistic subjectivity, allows for patterns in writing style to be revealed, enabling different outlets and topics to be ranked and compared to each other. • Longitudinal changes in mood can be detected and analysed from hundreds of millions of tweets, suggesting the recession played a key role in the collective public mood. • Mood levels extracted from Twitter reveal patterns of diurnal variation in collective mood for the UK Twitter using population. • Using large-scale corpus analysis, we can observe shifts in the framing of scientific topics within the mass media, including changes in the related sentiment. 27

Chapter 1. Introduction

1.5

Outline of the Thesis

This Chapter provides the motivation and aims for the thesis and presents the research questions we wish to address, while also presenting the output of the thesis in terms of relevant peer-reviewed publications and dissemination in the media. Chapter 2 covers fundamentals in Text Mining for methods used later on in the thesis and provides some background into the notation and general tasks which are addressed further on. Chapter 3 begins by detailing the types of data collected and used during the thesis and the infrastructure required to perform such tasks. Chapter 4 builds on this and presents two studies in extracting information from text; the first analysing how writing style in news media content differs between topics and also for different news sources; while the second presents methodologies for performing efficient document classification at a large scale. Chapter 5 continues with a look at detecting mood levels of the public through the analysis of social media data, and how this relates to real-world events happening at the time. Chapter 6 will present the final study in the thesis, showing how a large-scale corpus analysis approach to the reporting of science can reveal shifts in news coverage, highlighting in particular the change in how nuclear power is framed before and after a large disaster. Finally, Chapter 7 concludes the thesis by revisiting the research questions posed in Section 1.1 in light of the studies presented and details directions the work could be taken in the future.

28

CHAPTER 2

Background

In this Chapter, we cover some of the fundamental ideas and methods in text mining and information retrieval, detailing the processing steps that commonly take place when working with unstructured text in order to convert the text into a usable format for the machine learning and data mining methods that we wish to use. We further provide detailed descriptions of the algorithms used for extracting patterns and knowledge from data in later chapters of this thesis.

2.1

Information Retrieval Fundamentals

2.1.1

Text Preprocessing

In order to work with unstructured text data, we must often perform some operations to clean the raw text data into a format which is easier to work with and remove some of the irrelevant information that is often found in unstructured text. These steps are commonly referred to as preprocessing steps, as they happen before we process the text into a new machine readable representation. 2.1.1.1

Tokenisation

Raw text data is stored as a string of characters, with no explicit word information. The task of tokenisation is to split the string of characters up into a string of words, also referred to as tokens or terms, which can then be processed. In practice, tokenisation is performed by assuming that contiguous sequences of alphabetic characters form a single token in the vocabulary, which are separated 29

Chapter 2. Background

by whitespace characters. Numeric characters that form contiguous sequences are also considered a single token, while special characters such as punctuation are treated in different ways depending on the specific character. For example, we would wish to allow an apostrophe (’) to form part of a token, such as “won’t”, but we would wish to treat a hyphen (-) as its own token, such as in a date range like “2013−2014”. For our purposes, the alphabet used is the UTF-8 character set, which is the most dominant character set used on the web [238]. An illustration of a string being tokenised can be seen in Figure 2.1. String:

The cat sat near the fire because he wasn’t warm enough. ↓ tokenisation ↓

Tokens:

The

cat

sat

near

the

fire

because

he

wasn’t

warm

enough

.

Figure 2.1: Illustrative example of a string being tokenised. Tokenisation is obviously dependent on the language of the text being tokenised, with the given example applying to English language text. For non-Latin alphabet languages however, tokenisation can be a non-trivial task, especially when there are no whitespace characters between words such as in Chinese [106]. Even within the group of languages that use the same alphabet, the rules governing how punctuation characters should be treated will vary based upon the language, along with the vocabulary. 2.1.1.2

Sentence Boundary Disambiguation

Sentence Boundary Disambiguation (SBD) is the process by which a text can be split into its constituent sentences. In practice, this process often follows the tokenisation of a text, and using the punctuation tokens determines initial candidate positions for where each sentence ends. If the token preceding the punctuation mark is not a known abbreviation, and the following token begins with a capitalised letter, then the punctuation mark is identified as a sentence boundary. This approach is implemented in the popular ANNIE plugin for GATE [52], and is used when SBD was required in this project. Other methods for SBD exist, including models trained using a maximum entropy approach [190], neural networks [178] and transformational-based learning [209]. 2.1.1.3

Part of Speech Tagging

Part of Speech (POS) tagging is the process of identifying the role each word plays in a text, whether its a noun, verb, adjective, etc. This is useful as it allows us to differentiate between the specific usage 30

2.1. Information Retrieval Fundamentals

of particular words, such as the word “bear” as a noun (referring to the mammal, e.g. “polar bear”) versus “bear” the verb (meaning to endure or tolerate, e.g. “bear with me”), as well as identifying the important words within a document, for example detecting proper nouns [163]. There have been many proposed algorithms for performing the POS tagging [208, 147, 205, 229], many of which achieve comparable accuracy. One particular method [225], implemented in the popular Stanford parser, works by building a cyclic dependency network over the words, whereby the part of speech tag for a given word is influenced by the words and their POS tags both preceding and succeeding it. Maximum likelihood estimation methods are then be used to infer the POS tag relating to each word by finding the most likely sequence of tags for the words, conditional on the words in the dependency network. Again, as with tokenisation and many other text preprocessing steps, these methods are language dependent, with a different language model required per language. For English, the set of possible POS tags that a word could receive are encoded in the Penn Treebank POS tagset [149].

2.1.1.4

Stemming and Stop Word Removal

Depending on their specific grammatical usage, semantically similar words may take different forms. In order to normalise words into their root form, a process known as stemming is performed. There are many different methods for performing stemming [84], with the most popular being the Porter Stemming algorithm [185]. The Porter stemming algorithm takes a suffix stripping approach, whereby the ends of words are removed following a set of specific language rules. The stemming of a word is entirely independent on the surrounding words, with the stemming algorithms typically operating on a single word at a time. This is useful in text mining tasks as a way to reduce the dimensional size of the data, collapsing many words down into the same root form. Not only does this keep the size of the vocabulary under control, but it allows for similar words, such as “laziness” and “lazy” to be stemmed to the same root form “lazi”, thereby enabling texts to be found containing either original word. While stemming does have advantages, there are also some drawbacks and a balance needs to be found. Aggressive stemming leads to more words being reduced to the same root forms, but yields a higher rate of two semantically different words being reduced to the same root form. Lighter stemming has the inverse problem, in that two words with the same semantic meaning may not be stemmed to the same form. An additional processing step known as stop word removal often accompanies the stemming 31

Chapter 2. Background

process. Stop words are very highly frequent words which do not carry any semantic information, and are likely to be present in every single text. By removing them from the text, we can focus on those words which are informative and discriminate one text from another [192, 234]. Examples of stop words typically include articles (a, an, the, etc.), prepositions (in, of, on, to, etc.), and conjunctions (and, because, etc.). The stop words we wish to remove are usually compiled into a list, known as a stop word list, with the most popular for English text being created by Fox [83] from the Brown corpus [85]. Removal of the stop words is then performed by using a deterministic finite state automata [47] to match words in the text against the list. Sometimes it is also beneficial to remove words that occur very infrequently in a corpus. This approach can drastically reduce the size of the vocabulary due to Zipf’s law, which states that “the frequency of a word in a corpus is inversely proportional to its frequency position within the corpus” [251]. This means that there are only a few very frequent words, and very many infrequent words. The most infrequent words can be removed as they often do not occur frequently enough to be discriminative.

2.1.2

Dependency Parsing

Dependency parsing of a text tells us the relationship between each of the words in a sentence, often using the verb in the sentence as the structural centre. The task of a dependency parser is to determine the dependencies between each token in a sentence, and map each of these dependencies to a single relation from a set of approximately 50 possible relations [58]. This is performed by building a graph over the tokens in the sentence, where the nodes in the graph are tokens and the edges encode the relations between the connected nodes. Two mainstream approaches for building the dependency graph (or tree) exist; using generative probabilistic models to rank the best candidate analyses derived from a dynamic programming approach [37, 43]; and using deterministic greedy parsing algorithms which utilise a trained classifier [172, 191]. To help improve the accuracy of the parsing in relation to named entities specifically, we first perform coreference and anaphora resolution on the text. This involves finding each occurrence of a pronoun, and replacing it with the entity which is being referred, along with replacing all references to the same entity using slightly different names with the same identifier. When required, we used the MaltParser [172] implementation of a classifier-based dependency parser for English text in this project, trained1 on dependency trees extracted from Wall Street Journal sections of the Penn Treebank 1

See http://www.maltparser.org/mco/english_parser/engmalt.html for more details.

32

2.1. Information Retrieval Fundamentals

[149] using the Stanford parser [57], with the coreference and anaphora resolution performed using GATE [52].

2.1.3

Vector Space Representation

Once a text has been preprocessed, we often need to convert the representation from a sequence of words, to a vector of numbers. This allows the sequences to be cast into the vector space needed in many machine learning methods, as well as allowing us to assign weights to different words based upon their importance. When converting from the sequence of words to a vector of numbers representation, we lose the ordering of the words, with only the presence of a word in the sequence being of importance. Therefore, the vector space representation is often referred to as a “bag-of-words”. From a sequence of tokens obtained by preprocessing an unstructured text, we obtain a set of vocabulary words SV ⊆ V . We form a vector d = {w1 , w2 , . . . , wn } where each weight wi ∈ d denotes the weight given to the vocabulary word vi ∈ V in the vocabulary V . There are many weighting schemes that can be used in order to determine the weight that should be assigned to each vocabulary word. 2.1.3.1

Boolean Vectors

The simplest weighting scheme is simply to assign each weight wi ∈ d a boolean 0 or 1 value depending on whether the sequence SV contains the vocabulary word vi ∈ V , that is:

wi =

  1,

if vi ∈ SV ,

 0,

otherwise.

(2.1)

Boolean vectors do not carry any information about how often a particular word occurs within a text, or give any special importance to any word over another. They are most commonly used to determine the set of documents that contain a word from an entire corpus, which can be computed efficiently by performing bitwise operations between boolean vectors. 2.1.3.2

Term Frequency Vectors

Another weighting scheme for documents, known as the term frequency (TF), is to assign a weight wi ∈ d to each vocabulary word vi ∈ V based upon how often the word appears in the set SV

wi =

|{vi } ∪ SV | . |SV | 33

(2.2)

Chapter 2. Background

Each weight is equal to the number of times it appears in the original word sequence, normalised by the total length of the sequence. This weighting schemes gives each word a relative importance based upon how often it occurs within the text. For example, given the quote “One morning I shot an elephant in my pajamas. How the elephant got into my pajamas I’ll never know.” (Groucho Marx in Animal Crackers, 1930), we can see that the words “elephant” and “pajamas” would be assigned the greatest weight in the term frequency vector2 since they occur twice as often as the other words. One drawback of this scheme is that it does not take into account how discriminative a particular word is when comparing this text to others within the same corpus. 2.1.3.3

Term Frequency-Inverse Document Frequency Vectors

The most common weighting scheme for text is known as the Term Frequency-Inverse Document Frequency (TF-IDF). This scheme assigns a weight wi ∈ d to each vocabulary word vi ∈ V based upon how often the word appears in the sequence SV , normalised by the logarithm of how often it occurs within the entire corpus C the text SV belongs to, that is:

wi =

|{vi } ∪ SV | |C| log( |{SV ∈C:v ) i ∈SV }|

.

(2.3)

Normalising the term frequency by the number of texts containing it within the whole corpus allows us to assign a greater weight to those words which occur rarely in the rest of the texts, therefore making it a more discriminative word for this text. For example, the word “vector” would likely be assigned a high TF-IDF score in a corpus of Entertainment texts due to its relatively infrequent usage, but would be assigned a much lower score within the domain of Computer Science, where it is a more common term. 2.1.3.4

Cosine Similarity

Texts that have been transformed into a vector space representation can easily be compared to one another, a common task in information retrieval, where one wishes to assess the similarity of two texts. In order to do this, we must choose a distance metric by which to measure the similarity of the vectors in question. For text, the most common distance metric is the cosine similarity, which assigns a score between 0 (no similarity) and 1 (complete similarity) to the similarity of two vectors based upon the angle between them in vector space. 2

The words “in” and “my” would also receive the same weight, but are likely to be removed as stop words.

34

2.2. Machine Learning

The cosine similarity cos(θ) between two vectors d1 and d2 is derived from the Euclidean dot product formula and is computed as cos(θ) =

d1 , d2 , kd1 kkd2 k

(2.4)

where · , · is the inner product between the vectors and k· k denotes the Euclidean norm of a vector.

2.2

Machine Learning

Machine Learning is a subfield of Artificial Intelligence (AI) and covers the construction and study of algorithms that learn from data. As such, Machine Learning methods are often key to unlocking the potential of large scale corpora and discovering the hidden patterns contained within. During this project, we compare our own approaches with other machine learning methods as appropriate, along with making use of Machine Learning approaches, as detailed in this Section.

2.2.1

Topic Classification

Topic classification is a vital task in many situations where one has a very large number of text documents that need to be categorised. In this context, the aim of topic classification is to categorise a text document into one of any number of predefined topic categories automatically, without the need for each text document to be manually assigned to a topic. For example, a text document that mentions “David Cameron”, “Houses of Parliment” and “Westminster” might be categorised as Politics, while one mentioning “nuclear power”, “Mars rover” and “Curiosity” might be categorised as Science or Technology. Since topic classification, and more generally text categorisation, is such an important task in many domains, there is a large body of literature on the subject, surveyed in [4, 216, 199]. Many of the methods and techniques for topic classification are now considered mature, and are being applied in a wide range of new applications to support scholarly fields of inquiry [65, 111, 148]. In this work, we compare with and use statistical classifiers for topic classification which rely on data in order to learn how to classify text. We focus on supervised classification algorithms, a subset of learning algorithms which require training data in order to learn. In this setting, a linear classifier takes training data containing n data samples Zn = {(x1 , y1 ), . . . , (xn , yn )} where xi is a data sample and yi ∈ {−1, 1} is the label for the assigned class, and uses it to learn the model weight vector w and a bias b which can predict the label y on unseen data (known as test data). Here, we 35

Chapter 2. Background

examine binary linear classifiers, where data samples are either assigned to a class or not, with a different binary classifier used to predict each class (in this case a topic). When dealing with text, due to the very large dimension of the data (size of the vocabulary), good classifier performance can often be achieved using simple linear classifiers [4, 197, 241],

f (x) = w, x − b      1 if f (x) ≥ 0 yˆ = sign(f (x)) =     −1 otherwise

(2.5)

(2.6)

where yˆ is the predicted label for data sample x using model w. Intuitively, a binary classifier is attempting to find the hyperplane which best separates the data based upon its labels. The learnt weight vector w is then the vector normal to the separating hyperplane. 2.2.1.1

Support Vector Machines

One algorithm for learning the weight vector w is the Support Vector Machine (SVM) [46]. SVMs attempt to learn the optimal separating hyperplane by maximising the margin to the closest samples (to the hyperplane) of each class, where the data samples lying on the boundary are known as support vectors. Figure 2.2 illustrates two classes being separated by a maximum margin hyperplane.

Figure 2.2: Two (linearly separable) classes being separated by a maximum margin hyperplane, showing the support vectors. w is learnt by solving an optimisation problem using standard quadratic programming techniques such as sequential minimisation optimisation [184] or stochastic gradient descent [26]. It can be 36

2.2. Machine Learning

written in primal form using Lagrange multipliers α as follows: ( arg min max w,b α≥0

) n X 1 kwk2 − αi [yi ( w, xi − b) − 1] . 2

(2.7)

i=1

In this solution, we do not allow for the case where the data is not entirely linearly separable. To account for this, Soft Margin SVMs were presented in [46], which allow for data to be on the wrong side of the boundaries by introducing non-negative slack variables ξi that measure how far on the wrong side of the boundary a misclassified data sample is. Accounting for the soft margins, the slack variables ξi are introduced to Equation 2.7 to give: ( arg min max

w,ξ,b α,β≥0

n

n

n

i=1

i=1

i=1

X X X 1 kwk2 + C β i ξi ξi − αi [yi ( w, xi − b) − 1 + ξi ] − 2

) (2.8)

.

The dual form of this optimisation problem [203] can then be expressed as:

arg max α≥0

 n X 

i=1

αi −

n n 1 XX

2

αi αj yi yj xi , xj

i=1 j=1

 

(2.9)

,



with the model w calculated as

w=

n X

αi yi xi

(2.10)

i=0

The dual form given in Equation 2.9 allows for the kernel trick to be used in cases where the data is not linearly separable, replacing the inner product between xi and xj with a kernel operation, mapping the two data points into a feature space which belong to the set of Hilbert spaces [203] before calculating their similarity in the new space. Many different kernels exists, and are chosen depending on the type of data one is working with. In this project, where we are primarily working on text, we do not use non-linear kernels due to the very high dimensionality of the text feature space enabling a simple Cosine kernel, a normalised version of the linear kernel, to perform well [221]. In this work, we use the LibSVM [35] implementation of SVMs for topic classification trained on the New York Times [193] and Reuters [135] corpora, with the classifiers tuned for high precision (see Section 2.2.3) and deployed by Ilias Flaounas [78].

2.2.2

Association Rule Mining

Association rule mining, also referred to as frequent item set or pattern mining, is an approach for discovering interesting associations between different items within a database, the so called association 37

Chapter 2. Background

rules and frequent item sets. The most common example to demonstrate the approach is that of finding sets of items that are bought together in the same transactions at a supermarket, e.g. when “minced beef” and “tomato ketchup” are bought in the same transaction, we often find that “bread rolls” are also purchased. The aim of association rule mining is to discover these interesting rules from large databases of transactions. Formally, let I = {i1 , i2 , . . . , in } be a set of n items and D = {t1 , t2 , . . . , tm } be a set of m transactions called a database, where t = {t ⊆ I | ∀t ∈ D}, that is each transaction in the database is a subset of all possible items. An association rule is then defined as an implication between item sets, such that the rule A ⇒ B denotes A implies B, given that A, B ⊆ I and A ∩ B 6= ∅. The measure of “interestingness” in this context is often the support of an item set or the confidence of a discovered association rule, although other measures exist. The support of an item set supp(A) is defined as the proportion of the total transactions which contain the item set, while the confidence of an association rule conf (A ⇒ B) is defined as the support of the union of the itemsets supp(A ∪ B) over the support of the first item set supp(A). A frequent item set, also known as a frequent pattern, is therefore any item set which has a support greater than a chosen minimum threshold. While many algorithms have been proposed to mine association rules [3, 5, 97, 182, 245], they generally fall into two main categories: candidate generation algorithms, of which the Apriori algorithm [5] is the most well known, and pattern growth algorithms, with the FP-Growth algorithm [97] being the most popular. Briefly, candidate generation algorithms work by iteratively generating the list of candidate item sets of length (k + 1) from the list of frequent item sets of length k and checking if the candidates of length (k + 1) are also frequent. These methods rely on the Apriori property, that states that “if an item set of length k is not frequent, any of its supersets with length (k + 1) or greater can never be frequent” [5]. However, when very large numbers of candidate item sets exist within a database, the candidate generation algorithms require a large number of scans of the database in order to check the frequency of the candidate item sets. In order to address this issue, pattern growth algorithms focus on partitioning the database according to the frequent patterns, requiring few scans of the database.

2.2.2.1

FP-Growth

In this work, we make use of the FP-Growth algorithm [97], which we outline here. The FP-Growth algorithm is composed of two steps: first an FP-Tree is constructed from the database using two scans 38

2.2. Machine Learning

Figure 2.3: Illustration of the two database scans required by the FP-Growth algorithm to build an FP-Tree for mining frequent patterns. 39

Chapter 2. Background

of the database, then in the second step frequent patterns are directly mined from the FP-Tree. We begin the construction of the FP-Tree by doing a scan of the database and finding the support of each individual item, removing those that are below our minimum support threshold. We use this information to sort the items by their support in descending order in each transaction. Next, we perform a second scan of the database mapping each transaction to a path in the tree where each item is a node in the tree, with more frequent items in a path being the parents of the less frequent ones. While each transaction is mapped to a path in the tree, we keep a count of how often each node in the path has been seen in that path. Pointers are maintained using a linked list between nodes that represent the same items, so that all nodes within the tree representing the same item can quickly be found. An example of the construction of an FP-Tree can be seen in Figure 2.3. Once the FP-Tree has been constructed, we can mine it for frequent item sets, and their association rules following a bottom-up approach, starting at the leaves of the tree and working towards the root. In this divide and conquer approach, we start from the least frequent item il and using the linked list we find each node representing this item and sum the counts of how often it appears, obtaining the support for the item. If the item set has a support greater than the minimum support threshold we have set, we extract it as a frequent item set. This process is repeated for all item sets with il as a suffix, moving towards the root of the tree, following the same procedure until all frequent patterns that have item il as their suffix have been extracted. Next, we move onto the next least frequent item as the suffix, iterating the procedure until all frequent patterns have been extracted from the FP-Tree. For this project, we use a parallel implementation of FP-Growth in Mahout3 based upon [140]. This allows us to scale the algorithm (and data) out in a distributed fashion to numerous machines using the Hadoop framework4 , enabling us to extract frequent item sets from large databases.

2.2.3

Performance Measures

When working with data, we often wish to know how well our algorithms are performing. This will give us an idea about how well our algorithm is doing, or thinking about it in another way, how difficult the data is to predict. In order to quantify this notation of how well we can predict based upon the data, or to gain an understanding of how well a particular algorithm is performing, there are a common set of performance measures that are used in the field of information retrieval and pattern recognition for assessing the performance of binary classification. This is the case where data can be assigned to either a positive or negative set, which is then assessed against the true (ground truth) 3 4

http://mahout.apache.org/users/misc/parallel-frequent-pattern-mining.html http://hadoop.apache.org

40

2.2. Machine Learning

assignment of the data. 2.2.3.1

Precision

Precision is a measure of what percentage of the results that are returned by an algorithm are relevant, and is alternatively known as the positive predictive value. A high precision means that the vast majority of the results match the query or pattern we are looking for with the algorithm, and therefore very good prediction can be made about the data. Alternatively, it means that an algorithm is very good at separating the data into those which are relevant and those which are not. Precision is calculated as

Pr =

tp , tp + f p

(2.11)

where tp is the number of true positives (data points labelled as positive with a positive ground truth) and f p is the number of false positives (data points labelled as positive with a negative ground truth). While a high precision is desirable, it does not give any information about how many of the data points with a positive ground truth have been labelled as relevant, simply that those which have are correct. 2.2.3.2

Recall

Recall is a measure of what percentage of the total relevant results in the data are returned by the algorithm as relevant. This gives us an idea about the probability that any given relevant result in the data is returned by the algorithm. A high recall means that nearly all of the relevant results are returned by the algorithm, whether or not the algorithm can correctly decide if they are relevant. As such, recall alone is not enough to quantify if an algorithm is performing well, since one can trivially achieve perfect recall by returning all results. Recall is calculated as

Rec =

2.2.3.3

tp . tp + f n

(2.12)

F1 Score

In many classification tasks, it is not enough to have a high precision or recall, we wish to balance a trade off between the two to achieve good performance overall. To do this, the F1 Score is often used as the harmonic mean between the precision and recall to obtain an average classification performance. 41

Chapter 2. Background

The F1 Score is calculated as F1 =

2.3

2 · P r · Rec . P r + Rec

(2.13)

Summary

Nearly all text, whether its found on the web, in print, from private communications or elsewhere comes in an unstructured form. In this Chapter, we introduced the most common methods for processing text from this unstructured form into word tokens and sentences along with adding information about the role each word token plays within the sentence or the dependencies between them. We covered how to convert these extracted tokens into a machine readable format using a vector space representation suitable for machine learning methods. Finally, we detailed the machine learning methods used later in this project to classify documents into topic categories and discover associations between items in a database of transactions. In the next Chapter, we will cover the types of data we collect and use in the project, along with the distributed, modular infrastructure we developed in order to manage this huge collection of data.

42

CHAPTER 3

Meet the Data

Vast amounts of readily available digital data coupled with powerful new computational analyses are driving a data-driven approach to science automation in many disparate branches of science, stretching from biology, physics and chemistry to the humanities and social sciences [128, 240]. In the social sciences, the data-driven approach is opening possibilities for large-scale quantitative studies of human behaviour [161], language evolution [141] and more from analysing millions of texts. Huge benefit can be obtained by transitioning some types of study from manual, human-labour intensive methods to large-scale computational automation. One might even argue that with the combination of shorter messages and the sheer quantity of social media data created every second, applying traditional manual methods are somewhat redundant without the ability to investigate a meaningful sample of the data generated. In this Chapter, we describe the two different types of text data that we used for the various studies presented in this project; online news media and social media (specifically tweets), before detailing a modular architecture for collecting, processing, storing, analysing and visualising the data in a scalable, distributed manner building on top of modern open source software tools.

3.1

Online News Media

Traditional research on news media relies on a labour intensive step of human coding, whereby human experts read and annotate news articles by hand in an effort to determine if the articles they are reading display a particular property of interest. Due to the manual process involved, the number 43

Chapter 3. Meet the Data

of news articles which can be examined is limited. However, many of these steps are now able to be replicated faithfully by machine using algorithms from the general field of AI. This automation of science paves the way for inherently different studies to be imagined, allowing otherwise infeasible studies focusing on long term trends across millions of articles to be investigated. Before any of these large-scale automated studies can take place, we must first collect and store the data we wish to analyse, which can be a non-trivial task in itself. Here we specify what we mean by a news article, what properties an article has and how we collect them, along with the accompanying auxiliary information. In the mediasphere, we consider the main unit of analysis to be the news article. For each news article, there are two related objects of interest, namely the feed and news outlet that generated the news article. Each of these forms a layer in a hierarchical view of the media system we wish to study and provides additional information to the other layers through the concept of inheritance. By this, we mean that properties of a news outlet, which may have several feeds (feeds are the children of outlets) may be inherited by the feed. Additionally, these feeds may also pass on information to all articles that the feed produces. We start by covering what we mean by a news feed as this is the entry point to how we collect the data, followed by describing what a news outlet is, and finally, detailing news articles themselves. At each stage, we give examples of what the raw data looks like, along with outlining some statistics on the volume of data we have available for studies.

3.1.1

Feeds

Feeds are a common component of the world wide web, allowing users to subscribe to websites and retrieve updates from the websites when new documents are available. Nearly all media outlets that have an online presence offer feeds that they keep updated with their latest content. Indeed, a recent survey1 by Ofcom, the regulator for UK communications industries, found that a larger percentage of people now read the news online rather than in printed newspapers [174]. Feeds come in several formats, with the most commonly used being Really Simple Syndication (RSS) [28]. All formats are an XML-based technology, offering structured information that can be read by machine, containing a list of pointers to the full web address of some newly published content on the owner’s website, along with a small amount of meta information, such as a title, and short description. An example of a feed as displayed in a web browser is shown in Figure 3.1, along with a 1

Recent as of June 2014

44

3.1. Online News Media

Figure 3.1: Sample of a news feed taken from the BBC News website.

Figure 3.2: Sample of the XML from a news feed taken from the BBC News website. sample of the raw XML data that we use in order to create a feed object in our system in Figure 3.2. We use a web crawler developed within our research group named “Feed Finder”2 [78] to automatically discover new feeds. Briefly, Feed Finder discovers feeds by crawling the web, starting from a given web address, such as the homepage of a newspaper’s website or a webpage containing links to many news sources, and using a heuristic approach finds the most promising links to follow in a depth-first search looking for URLs which point to feeds. Once a URL for a feed has been found, a manual curation step takes place whereby interesting feeds are automatically given an initial annotation with the mandatory fields shared by all feeds, which are then verified by the user, along with optionally adding additional common fields, before being labelled as ready to be crawled and used for retrieving news articles. Table 3.1 shows the common fields shared by all feeds, along with typical optional fields which might be manually added when the feed is added to the system. Further discussion on crawling of feeds can be found in [143]. In total, there are 4,480 news feeds which are monitored in the system, with a much larger list of 2

Feed Finder was originally developed and maintained by Ilias Flaounas.

45

Chapter 3. Meet the Data Table 3.1: Mandatory fields for feeds in the system. Mandatory Fields

Description

ID

A unique identifier for the feed.

Title

The title of the feed, as specified by the host website.

Description

A short description of what the feed contains, specified by the host website.

URL

The URL to where the feed is located on the web. Table 3.2: Common optional fields for feeds in the system.

Common Optional Fields

Description

Language

A shortcode denoting the language of the content the feed produces.

Outlet

An identifier for an outlet in the system which owns the feed.

Country

The country from which the produced content of the feed comes from.

Topic

Feeds which produce content specific to a particular topic are annotated with that topic.

6,374,763 potential URLs that have been identified by Feed Finder as being feeds, which have not yet gone through the manual step of being annotated and added to the system.

3.1.2

Outlets

We consider any organisation or media company that produces news in an online format to be a news outlet in our system. This broad definition allows for the inclusion of online newspapers, newswires and magazines along with blogs and other websites that periodically update their websites and push the new content to news feeds. While this does limit us to only collecting data from news outlets which have an online presence, this is not an unreasonable assumption given the majority of global and national news outlets that do so already. Each news outlet is annotated with a number of different fields as shown in Table 3.3. The fields are manually annotated when an outlet is added to the system, usually at the same time as the first feed for that outlet is also added. 46

3.1. Online News Media Table 3.3: Common fields for outlets in the system. Fields

Description

ID

A unique identifier for the news outlet.

Name

The name of the news outlet.

Domain

The domain of the news outlet’s home page.

Location

The location in the world where the outlet is based.

Language

The primary language of news produced by the outlet.

Type

The type of news outlet, e.g. newspaper, blog, newswire, etc.

The notion of a news outlet is important for us when we wish to conduct studies on the mediasphere as a whole, since it allows us to group together different feeds that belong to the same organisation, and analyse all of their content in a way that is meaningful to human readers. This is also beneficial for studies that wish to model what users reading the news on the web prefer to click on [103]. In order to do this, each outlet is mapped to a unique domain name, which allows us to treat all content coming from the same domain as belonging to the same outlet. An example of this would be “The Guardian” (http://www.theguardian.com) and “ The Observer” (http://observer.theguardian.com) which could be considered two separate newspapers, but are produced by the same company and reside on the same domain. Blogs are the one exception to this rule, since many blogs are hosted on the same domain, e.g. http://example.wordpress.com, and only differ by their subdomain. The list of 4,480 news feeds that have been discovered and added to our system correspond to 1,986 news outlets, covering 136 different countries across the globe.

3.1.3

Articles

News articles are our main unit of analysis for online news media, and can be thought of as a news report generated by some news outlet. While we typically think of a news article as a single written work on a particular topic or event in a newspaper or online, our definition allows for a slightly more generic representation, including transcripts of television news programs, written posts from a select number of blogs, etc. An annotated example of a news article as seen on the web is shown in Figure 3.3. We collect the news articles by first reading each of the news feeds, and checking to see if they 47

Chapter 3. Meet the Data

Figure 3.3: Annotated news article found on the BBC News website. have any new content we have not yet seen. This process is known as crawling, as we are essentially crawling a subset of the web (our list of feeds) and discovering new web pages (news articles). Once a web page corresponding to a new news article has been found by the crawler, we enter this into our system as a news article, albeit with only limited information about it, such as its URL, the title of the article, and the short description provided about it in the news feed, along with the timestamp for when the article was found. Periodically, these articles are then submitted to a web scraper which attempts to find the full text of the news article along with an associated image, if present, from the web page. To do this, we must first represent the web page in a tree structure, as defined by the HTML (HyperText Markup 48

3.1. Online News Media Table 3.4: Common fields for articles in the system. Fields

Description

ID

A unique identifier for the news article, generated as a hash of other common fields when the article is inserted into the database.

Title

The title of the news article, as specified by the feed that produced the news article.

Description

The short description of the news article found in the feed that produced the news article.

Content

The full body of text for the article, found by applying a HTML scraper to the URL specified in the feed.

Date

The date the article was first seen in any of the feeds in the system.

URL

The URL to the webpage that contains the full body of the article, as specified by the feed.

Outlet

The identifier(s) for the parent outlet of the producing feed.

Feed

A list of feed identifiers for all feeds that produced the article.

Image URL

The URL of the image associated with this news article.

Language) content of the page. The full text of the news article is extracted from the web page by looking for the parent node in the tree whose children collectively have the largest number of textual characters. This process enables us to remove irrelevant information from the web page which is present, but that is not directly related to the news article, such as the navigation of the web page, or JavaScript content present on the page. Similarly, we find the image associated with a news article by extracting the meta information from the HTML head of the webpage, saving the image URL assigned to the news article by the publisher. The result is a structured object in our system that has a number of fields that we store, and are common across all articles. The list of common fields and a summary of the contents of each field can be seen in Table 3.4. Further to the common fields, articles can also be annotated with any number of other fields as required. These typically contain information derived from the common fields, such as a topic classification applied by a classification algorithm, and will differ from article to article, as well as 49

Chapter 3. Meet the Data

5

×10 4

4.5

Average Daily Article Volume

4 3.5 3 2.5 2 1.5 1 0.5 0 Sunday

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Day of the Week

Figure 3.4: Average volume of news articles collected by day of the week.

12

×10 7

Cumulative Article Volume

10

8

6

4

2

0 Jul 08

Jan 09

Jul 09

Jan 10

Jul 10

Jan 11

Jul 11

Jan 12

Jul 12

Jan 13

Jul 13

Jan 14

Date

Figure 3.5: Cumulative volume of news articles collected.

50

Jul 14

Jan 15

3.2. Social Media

depending on at what stage of its life-cycle the article is in. Generally, articles do not ever lose fields, and will accrue more information over time as different algorithms are applied to them. The average volume of articles collected each day follows a weekly periodic cycle, peaking midweek with the lowest point on Sunday, as illustrated in Figure 3.4. In total, we have collected over 111,400,000 articles between 1st May 2008 and 1st March 2015 from the 4,480 news feeds. Figure 3.5 shows the cumulative volume of articles collected over the entire time period.

3.2

Social Media

The emergence of social networking websites has been the biggest phenomenon to appear on the Internet in the last few years, allowing anyone with an Internet connection to publish text, images and videos about anything they wish for a potentially global audience. The reach of social media is still growing, with over 70% of adults who use the Internet now using one or more social networking sites [67]. In the UK, an estimated 32.1 million Internet users visited a social networking website at least once per month in 2013, predicted to rise to 36.7 million by 2017 [73].

3.2.1

Twitter Data

Twitter is one such social networking site, allowing anyone with an account to publicly publish a message (referred to as a “tweet”) that is an expression of a moment or idea3 . Originally, a tweet was limited to only include a string of up to 140 characters (as to allow users to post tweets via SMS4 ), but has since been further developed to now allow images and videos to be attached to tweets, although ththe 140 character limit for text remains in place. We focus on data collected from Twitter for our social media analysis for several reasons, from both a technical standpoint, as well as for the value of the scientific analysis that it offers. Firstly, the format of tweets is particularly of interest, due to their brief and timely quality. This encourages users to post tweets which express their thoughts and opinions on their situation or experiences in real-time [248], making Twitter data truly valuable source of information. Secondly, Twitter users can decide to append location information to each tweet, offering the ability to study a population of users from a single geographical region (such as within the United Kingdom) or the differences between geographical areas (different cities). From a technical standpoint, Twitter data is readily accessible and easy to collect, with tweets and their associated meta data being publicly available by default. In 3 4

What is a Tweet?, https://about.twitter.com/what-is-twitter/story-of-a-tweet How Twitter was born, http://www.140characters.com/2009/01/30/how-twitter-was-born/

51

Chapter 3. Meet the Data

the next section, we detail how we collect tweets, the meta data provided by Twitter for each tweet, and some statistics on how many tweets we have collected thus far.

3.2.2

Collecting Tweets

Twitter offers a Search Application Programming Interface (API)5 that allows for any tweets matching a set of search criteria to be collected via a web feed in a manner similar to news articles (see Section 3.1.3). The API provides the tweets from the feed in JavaScript Object Notation ( JSON) format ready to be stored in the database without the need to use a scraper to retrieve any further information. We specify for each query of the Search API that we wish to collect the 100 most recent tweets from one of 54 specific geolocations. All queries are executed using a basic web crawler6 every ten minutes. We focus our collection on tweets where Twitter knows the users’ location, enabling us to ensure that any analysis is restricted to within the United Kingdom. We do this by specifying a bounding boxes of radius 10km around each of the 54 most populated urban centres of the UK, collecting all results returned to us by Twitter. Table B.1 shows the list of cities from which we gather tweets. Collection of the tweets was originally implemented by Vasilieos Lampos, and maintained by Ilias Flaounas before the system became maintained by myself. Using the Twitter API, we have collected over 955,804,440 tweets between 1st July 2009 and 1st March 2015 corresponding to one of the 54 locations that we gather Twitter data from. Figure 3.6 shows the cumulative volume of tweets collected over the entire time period. Annotated changes in Figure 3.6 correspond to: A) Change of our underlying database system to NoSQL distributed architecture; B) Twitter API Version 1.1 released, requiring authentication 7 ; and C) On-going issue with Geographical Search API 8 . In this thesis, all studies presenting results on Twitter data (see Chapter 5) use data from before change point A and so are unaffected by these changes in data collection.

5

Twitter Search API, https://dev.twitter.com/docs/api/1.1/get/search/tweets Twitter4J Java Library, http://twitter4j.org/en/index.html 7 https://blog.twitter.com/2012/changes-coming-to-twitter-api 8 https://twittercommunity.com/t/search-api-returning-very-sparse-geocode-results/27998 6

52

3.3. Modular Architecture for Data Curation

10

×10 8

9 8

Cumulative Tweet Volume

C 7 A

6 5

B

4 3 2 1 0 Jul 09

Jan 10

Jul 10

Jan 11

Jul 11

Jan 12

Jul 12

Jan 13

Jul 13

Jan 14

Jul 14

Jan 15

Date

Figure 3.6: Cumulative volume of tweets collected. Annotated change points correspond to A) Change of database system; B) Twitter API version change; and C) On-going issue with Geographical Search API.

3.3

Modular Architecture for Data Curation

The transformation that data-driven approaches are having in a wide range of scientific disciplines is made possible by the convergence of two trends in technology: the availability of data and the technology to manage it; and the emergence of a new generation of powerful data-driven algorithms from the fields of machine learning and data mining. In order to realise the benefits of these approaches we must first consider how to appropriately manage the data and automate tasks as required. Significant progress has been made in optimising the performance of single-task algorithms, such as topic classification or named entity recognition, but basic tasks such as these which are solvable by classical AI algorithms do not directly coincide with the tasks which are of value to scientists. For example, the measurement of gender bias in a corpus, a typical task in the social sciences, is not the output of a single algorithm, but the combination of many elementary AI algorithms, such as the detection of named entities, co-reference resolution, topic classification and so forth. An important part of modern AI is understanding how these more complex tasks can be decomposed into modular tasks which can be performed independently, along with the inverse problem of composing many basic modules together in a way that produces complex behaviour without errors being propagated through the system. 53

Chapter 3. Meet the Data

In this Section, we describe our modular architecture for online data curation, our solution for the design of a large-scale intelligent system that combines multiple AI modules, allowing the automatic annotation of large amounts of data for analysing the content of the global media system. These modules are combined in a coherent and scalable framework where all communication between modules is achieved through reading and writing on a shared blackboard, with decisions made by the system being executed without any centralised control. The performance of each learning module in the system is recorded each time it runs in order to be able to quickly identify where errors might arise. On top of this, processing platforms offer the ability to perform large-scale analyses of the data annotated by the independent modules for scientific studies. This architecture was used for all studies in this thesis and has been used in numerous past studies within the research group.

3.3.1

Related Work

Designing, developing and integrating a range of different modular components in a distributed way is a challenging task, and one that is often necessary in order to carry out large-scale analysis of data using intelligent systems. Our system is designed specifically for the continuous monitoring of the global media system, falling into the intersection of media studies, a subfield of social sciences, and the general field of AI and its subfields of machine learning, natural language processing and data mining. The few such systems that exist for continuous media monitoring are listed here. TextMap Access [15] is a news analysis system that allows students and scholars to access large quantities of textual corpora ranging from U.S. and International English-language newspapers, patents, Pubmed and LiveJournal social media along with many more sources for use in answering questions in the data-driven social science domain. The system implements a processing flow where items in the system are collected and follow a processing pipeline through the system, adding natural language processing (NLP) mark-up, sentiment and entity analysis before making the data available for visualisation through a web-based interface. It has been used in several studies including the correlation between the stock trading levels of companies and their coverage in news reports [247]; detecting the spatial distribution of named entities in the U.S. [159]; and the sentiment associated with different entities in the news [92]. The Europe Media Monitor (EMM) is another system for news analysis, supported by the European Commission [210], which focuses on the coverage of news in multiple languages across Europe, looking to capture the complimentary information that may be present in news articles from different countries. It has been used for reconstructing the social networks of people mentioned in the news 54

3.3. Modular Architecture for Data Curation

across language [187]; tracking quotations in multilingual news [186]; along with event extraction [222]. Other systems for event extraction from the news are Event Registry [129], built on top of the IJS NewsFeed [227] and the GDELT project [131]. Event Registry is a system for detecting events in large news corpora collected from the web, identifying related events in the real-world across multiple languages. The GDELT project aims to map out global events by location and tone from a number of international news sources and make the daily updates available for further analysis. The closest of these systems to ours is the TextMap Access system which also focuses specifically on the mediasphere as a global system that requires scientific study into its structure and the patterns that arise from it on a macroscopic scale. Our system is primarily focussed on analysing the media system itself, rather than extracting particular knowledge, such as events, to be presented to the public. Our system differs from all of these systems in that it does not follow a fixed processing pipeline for each news article, and is the composition of many different learning systems which cooperate towards annotating the data and extracting knowledge for scientific study.

3.3.2

System Overview

Large-scale computational analysis systems often found in AI literature [15, 129, 131] are designed with a single purpose in mind and follow a simple pipeline-based architecture whereby data flows through the system in a linear way, with each analysis component following the next in a series of steps until a specific output is achieved at the final step. Instead, our system is designed to be modular and decentralised, decoupling the dependencies between analysis components and allowing each item in the system to be treated with different behaviour as necessary. An additional motivation for this design is that in any real-world setting, long running intelligent systems must be robust to changes, both in the software and hardware of the system. This is achieved through the use of a blackboard architecture, comprised of modules and blackboards. In this system, modules represent workers that carry out a single, specific task on items from the blackboards. Blackboards are common spaces for the modules to read and write data to and from, and provide the only means of communication between modules. Each blackboard contains a collection of similar items, for instance news articles or tweets, which are available for the modules to annotate. This architecture allows us to implement a simple solution to the challenging task of decomposing complex behaviour into a number of simple tasks. The idea of using a common space 55

Chapter 3. Meet the Data

Figure 3.7: Simplified overview of the modular architecture for data curation, demonstrating the interaction of modules (circles) to annotate data via central blackboards. for the coordination of multiple agents via reading and writing data to a blackboard is indeed a very old idea in AI, dating back to at least the influential Pandemonium system in 1958 for coordinating the action of several daemons [200]. This type of framework is designed to confer many key benefits that are increasingly important in modern intelligent systems. We require that the system is flexible, robust, and does not incorporate any centralised decision process. Flexibility for us means that the processing steps that any given item entering the system passes through are not hardwired but can easily change, meaning that modules can be added or removed without any interruption to the rest of the system or needing changes to modules which already exist. This is achieved by enforcing that no modules can communicate with each other, apart from indirectly through the shared blackboards, and providing a common API by which all the modules must read and write to the blackboards. The framework is inherently distributed with modules being independent of one another, and so can easily be placed on separate machines as a way to build redundancy into the modules through replication, while the central blackboards exist as distributed data stores which are highly available, 56

3.3. Modular Architecture for Data Curation

fault-tolerant and replicated. By having our system distributed across many machines we fulfill an essential condition in ensuring the system is robust to both software and hardware issues that arise over time in any realistic scenario. Decentralised decision processes are achieved in the system through the information left indirectly by modules on the blackboards. Each module performs an action on an item in a blackboard, and then writes back onto the same item. The information left by the module can then be read by another module and affect its behaviour. This process of indirect coordination by leaving traces in the environment for other agents and thereby affecting their behaviour is known as stigmergic communication and is used in nature to generate coherent collective behaviour, such as in ants [66]. It is one of the ways in which complex systems self-organise without the need for central planning or control, or even direct communication between individual agents. We further augment the blackboard architecture with processing platforms, such as Hadoop and ElasticSearch, that sit on top of the architecture and offer the ability to perform ad-hoc batch queries (as often required by specific studies) on the already annotated data from the blackboards on a massive scale within the same framework. The entire system is also monitored through a web front-end named SystemWatch using our internal API which gathers details each time a module runs, such as its performance, along with statistics about each of the blackboards. An overview of the system is illustrated in Figure 3.7, showing the interaction between modules via a blackboard and the processing platforms for batch processing for an example task of performing sentiment analysis on news articles.

3.3.2.1

Blackboards

In our system, a blackboard is a shared, distributed data repository that is available for all modules to read from and write data to. The system is composed of many blackboards, where each blackboard is dedicated towards the storage and annotation of a single type of object, e.g. news articles are stored in one blackboard, while tweets are stored in another. A full list of the blackboards used in the system, along with a brief description of the type of objects they hold can be seen in Table 3.5. Within a blackboard, a set of labels are also stored which are used to annotate the objects with additional meta information that can be used in one of two distinct ways. Content labels are used to add meta information to the objects that is some property of the object, such as the result of a topic classification or that the object belongs to a particular subset of the data. 57

Chapter 3. Meet the Data Table 3.5: Blackboards in the system. Blackboard

Description

Articles

News articles from the web

Feeds

News feeds from the outlets

Outlets

News publishers with an online presence

URLs

Potential news feeds discovered by Feed Finder

Tweets

Tweets from the UK

Locations

Locations to gather tweets from

Historical

Newspaper articles from the British Library 19th Century Newspaper collection

Geonames

Collection of locations and their coordinates

NYTimes

New York Times corpus [193]

Reuters

Reuters news corpus [135]

These labels form part of the annotation of the data, and are used to separate the data into subsets depending on the criteria required when conducting studies using the data. Control labels are used as part of the system structure as a way for the modules to communicate with each other, and request that an object is processed by another module. The control labels are agnostic to the content of the objects, and are only used as part of the stigmergic communication between the modules. The blackboards are implemented as a set of collections in a MongoDB9 database, with all queries for data by modules coming through the architecture API. This ensures that for the queries provided by the API we have indexes in place to facilitate quick execution. The API also offers a layer of abstraction, hiding the database layer from the functional core of each module. This allows us to easily change the underlying database system in the future if required without needing to reprogram each of the modules for the new system. 3.3.2.2

Modules

Modules are independent programs that run as part of the system, and perform some action on the objects of a blackboard. Each module is implemented as an executable containing the core functionality of the module, and a settings file which contains the parameters that should be used 9

http://www.mongodb.org/

58

3.3. Modular Architecture for Data Curation

Figure 3.8: Communication of modules in the system via control labels written to the blackboards. by the module each time it runs. This allows the settings to be easily changed without the need to change the core code of the module. The settings file will contain many parameters as detailed in Table 3.6. An additional advantage of this design is that since many tasks we are interested in involve the same core module, we can generate many such modules quickly and efficiently by simply changing 59

Chapter 3. Meet the Data Table 3.6: Parameter settings for the modules in the system. Parameter

Description

Name

Module name

Description

Short description of the module’s function

Database Settings

Path to the database connection file

Process Limit

Maximum number of objects to process per run

Input Blackboards

Blackboard(s) to read from

Output Blackboards

Blackboard(s) to write to

Input Labels

Label(s) to be used to select the input data

Output Labels

Label(s) to be annotated on the output data

Input Fields

Field(s) of the objects to be read from

Output Fields

Field(s) of the object to be written to

the settings. An example of this would be topic classification. The core module executable simply needs to implement a machine learning algorithm that is suitable for topic classification which can then be reused many times with different input data to learn different models for each topic. The modules themselves can be categorised into one of three different categories: input modules, analysis modules and output modules. Input Modules

Modules that use some external source, such as the web, to either create new

objects in a blackboard, or to add new content to them are called input modules. All objects in the blackboards are generated via one of these input modules. Examples of input modules include the Feed Crawler, the Article Scraper, and the Twitter Crawler. The Feed Crawler10 is a module which given a set of labelled feed objects in the Feed blackboard, will visit each of their URLs on the web and retrieve an XML document in the RSS format as previously shown in Figure 3.2. The module then parses this XML document and extracts each of the news articles, inserting them into the blackboard specified in the modules settings. The Article Scraper is a module that takes all the news articles which have a control label indicating that they require scraping, and visits the URL of the news article in order to retrieve the full text of the news article from the web page. This is performed by downloading the full HTML document and 10

Implemented by Ilias Flaounas

60

3.3. Modular Architecture for Data Curation building a parse tree of the document using the Java library jsoup11 . A calculation is then performed over the parse tree where each text node in the tree is assigned an integer value based upon the length of text contained by all of its child text nodes. The parent node with the largest value is then taken to be the container for the news article text, and this text is extracted and annotated back onto the object in the blackboard. The Twitter Crawler12 is a module which given a set of location objects in a Location blackboard, takes all those locations with a control label indicating that they require crawling, and queries the Twitter Search API for new tweets which have been published in that location, and saves the resulting XML documents as new objects in the Tweet blackboard. Analysis Modules Modules in the system that use the attributes of an object from a blackboard to perform some type of analysis are referred to as Analysis Modules. These modules make up the bulk of the modules in the system and are used to not only annotate the different blackboard objects but also control the behaviour of other modules in the system via the control labels. These modules often store their own private data, such as a model that blackboard objects are compared to or scored against and which is not available to any other module, enforcing that all information shared between modules takes place via the blackboards. The simplest Analysis Modules are those that perform some sort of conditional replacement of labels on a blackboard’s objects. For example, the Post Article Scraper module takes the news article objects which have the control label indicating that they have been processed by the Article Scraper and depending on the language of the news article, as indicated by its content labels, will do one of two things. If the article is written in English, it will apply a control label to pre-process the object’s textual fields ready for the host of topic classification, sentiment analysis, and appeal ranking modules that exist in the system, as well as applying the control labels for modules which do not require the pre-processing step to first take place, such as the geolocation module, the readability module and the linguistic subjectivity module. If the article is written in one of the European languages for which the system has a statistical machine translation module for, it will apply the control labels indicating that the object should be processed by the statistical machine translation module in order to generate a “Machine English” article and add it to the Article blackboard. Other examples of Analysis Modules in the system include topic classifiers trained using Support Vector Machines [51] on the Reuters [135] and New York Times [193] corpora, online topic classifiers 11 12

jsoup: http://jsoup.org/ Original implementation by Vasilieos Lampos

61

Chapter 3. Meet the Data

Figure 3.9: Screenshot of the Clickable News App. trained from the blackboards using Perceptron or other online linear algorithms, online appeal learners which use the difference between a pair of objects to work out the appeal of an object [70], a geolocation module for finding and disambiguating mentions of locations within the text, modules that perform sentiment and affect analysis by taking the inner product between a given sentiment lexicon and an object’s text, along with many others. Output Modules Modules that output some information to be used for human consumption or visualisation purposes and that do not perform any analysis or annotation of the data are referred to as Output Modules. These modules are often used to provide the information displayed in web interfaces, such as the Clickable News Application13 or Klaxon server alert system 14 . The Clickable News App is a web application that recommends news articles to users based upon their location and the preferences of other user’s from nearby locations. This work builds on top of 13 14

http://clickable.enm.bris.ac.uk https://systemwatch.enm.bris.ac.uk/SW2/klaxon/

62

3.3. Modular Architecture for Data Curation

Figure 3.10: Screenshot of the Klaxon server alert system. the work by Elena Hensinger [103] for predicting how appealing news articles are based upon their textual content and is published in [70]. Originally, the appeal of news articles was learnt by using signals from well known news publishers, such as the BBC, for whom we are able to collect information about which news articles became “popular” on their front pages. The issue with this implementation however was that we do not have any ground truth or click information for which articles were truly preferred by the readers. To remedy this, we wished to collect our own click information, in order to learn the model for what is appealing for our own readers. As part of this application, there is an Output Module that given a list of annotated news articles from the past day, will compute a list of the 12 most appealing news articles based upon the learnt preference model and other auxiliary information, such as if the news article has been annotated with an image that we can present to the end user. A screenshot of the Clickable News App can be seen in Figure 3.9. Monitoring the modules Combining many different modules into a single intelligent system without central control is a challenging task. In order to monitor and ensure the proper functioning of the system, some monitoring of the modules is required, from basic tasks such as checking they are running when they should, to more complex operations such as monitoring if the performance of the internal model which is being learnt is good enough to not propagate errors throughout the system. 63

Chapter 3. Meet the Data We do this by first requiring that each module extends a Base Module implementation15 that records details about the modules each time they run, along with their settings, allowing the results of any module to later be reproduced. We additionally allow modules to report their own performance each time they run, leaving a trace of how well each learning module believes it is performing. This plays an important role in a system which contains many modules that encapsulate machine learning methods that are actively learning in an online fashion from the data. If a module is under-performing, outputting annotations with a high level of error, when later modules ingest the error-prone annotations, the errors are amplified, causing further errors to be propagated. For this reason, monitoring of the system can be crucial in catching problems early. The monitoring of the system is performed in SystemWatch, a web interface which offers an overview of the system and its modules in real time, showing users basic statistics of the blackboards such as the number of objects in each blackboard, how often each module runs and its performance (if appropriate) for that run, the default settings for each module, along with the Klaxon server alert system11 for monitoring various aspects of the physical servers on top of which the entire system runs. A screenshot of Klaxon can be seen in Figure 3.10. 3.3.2.3

Platforms

In normal operation, modules typically process data that has recently been added to the system by one of the input modules. Once every relevant module has finished with an object it will often not received any further annotation for a long time until a new module is added to the system. At this point it often does not make sense nor is it feasible to run the new module over every single object in a blackboard sequentially as would be typical in the daily running of the module. The two options when a new module are added is that only objects added to the system from that point in time are annotated by it, or a large batch process needs to be run in order to annotate the entire past history of data with the new module. For studies based upon the data in the system, it is also often necessary to perform analyses over the entire contents of a blackboard at once. In order to meet these demands of requiring that entire blackboards may need processing in large batch jobs, we deployed batch processing platforms that sit on top of the blackboard and module architecture and give the ability to process all of the data as a single job. Currently, the two processing platforms we have deployed are Apache Hadoop16 and ElasticSearch17 , each of which has different 15

Originally implemented by Ilias Flaounas before being extended. http://hadoop.apache.org 17 http://www.elasticsearch.org/

16

64

3.3. Modular Architecture for Data Curation

use-cases as detailed. Apache Hadoop

Hadoop is based upon the Map Reduce framework first developed at Google,

which was subsequently published [59] before being implemented by Yahoo! and released as an open source project hosted by the Apache Software Foundation. The basic premise of Hadoop, which uses a Map Reduce framework, is that processing tasks can be easily parallelised over a cluster of machines by splitting the program task up into a series of phases. The first phase involves generating key-value pairs from the data that relate to an attribute of the data which we wish to extract. This is known as the Mapping phase, as each data point is mapped to one or more key-value pairs which are then output to the next phase. In the second phase, the key-value pairs are grouped by key and processed together, allowing for all pairs that have the same key to be aggregated or combined together as required, in the so called Reduce phase. When the data points do not rely on each other and are independent, as is often the case for news articles or tweets, this allows for the processing to be easily scaled out to many machines, with the management of the job almost entirely managed by the Hadoop framework once the programs for the Map and Reduce phases have been specified. An example use-case of Apache Hadoop in our system is when we wish to introduce a new module into the system, and require that the entire past history of data is also annotated by the module. For example, say we introduce several new module that perform topic classification on text. Instead of needing to do a full scan of the data in a blackboard, which would take a considerable amount of time, we can implement a Map-Reduce solution and process all the data in a batch job. First, a map task would be constructed that compares each of the topic classification models in turn to the relevant text of a blackboard object, be it a news article or a tweet, and for each topic classification model emits a key-value pair where the key is an identifier for the object, and the value is itself an object which contains the name of the classification topic, along with the score assigned to the text by the topic classifier. The reduce task would subsequently need to group together all the key-value pairs which have been emitted from the Map phase, and recombine them in order to produce a set of topic annotations for each object by its key. ElasticSearch

ElasticSearch is an open source implementation of a distributed inverted index, built

on top of Apache Lucene18 . ElasticSearch maintains an inverted index representation of a corpus or set of corpora, allowing for very fast retrieval of objects that contain field attributes that match those in the query. Further to this, ElasticSearch scales out the index so that it resides across multiple 18

http://lucene.apache.org

65

Chapter 3. Meet the Data machines in a distributed manner, offering a RESTful19 API that allows the data to be queried in a variety of programming languages, either as part of a module or to support web interfaces. An example use-case of ElasticSearch in our system would be when we wish to select a subset of our data based upon whether it contains particular key words or phrases, rather than just selecting a subset of data based upon the intersection of the content labels. For example, if we wish to conduct a study about how a particular event is covered in news articles, we may want to focus specifically on those news articles which mention the given event. While there is not currently a content label for each and every event that may appear in the corpus, we can still retrieve all news articles which contain a given phrase or set of phrases related to the event we are interested in using the inverted index structure.

3.3.3

Discussion

The motivation guiding the development of the blackboard system is that we want to be able to collect, store and analyse large quantities of data in a scalable way, using many different machine learning and natural language processing techniques, without creating unnecessary dependencies between them. It is also important that the system is robust to changes in the software and hardware, allowing it to continue functioning properly over long periods of time with minimal intervention, so that data collected can be used for longitudinal studies of the traditional news and social media. The modular architecture of the system facilitates these features by encapsulating the programming logic in discrete modules that run independently from each other, without the need to communicate, or even know which other modules exist within the system. When modules need to be updated, either due to changes in our requirements or because of external changes, such as the API of an external service is updated, it is a simple matter of redeploying a single module. Another important feature for the design of the system is that documents within the blackboards are represented in a flexible, extensible way, allowing modules to operate on any document within the blackboards, whether it is a long-form news article, a short message from social media, or any other type of document. By doing so, we reduce the need to create specialised modules or algorithms for each type of document, allowing the same algorithms to be deployed for many different use cases while keeping the number of modules under control. A drawback of the current system as implemented is that should we wish to massively scale up the data collection, we would need to replicate modules, running instances on many different machines, 19

http://en.wikipedia.org/wiki/Representational_state_transfer

66

3.3. Modular Architecture for Data Curation

in order to continue processing the data as it arrives. While this does not pose a problem per se, the management of an escalating number of modules may become difficult. To address this challenge, the adoption of a service-oriented architecture for the modules would be appropriate. This could be implemented by removing the algorithmic logic of the modules, and offloading it to a set of distributed services that the modules would call to process the data for them, leaving the modules to simply decide which data was for processing, and writing the results back to the blackboard as they do currently. The service-oriented architecture would therefore expose a set number of services which the modules can send data to for processing, which could be scaled out across as many machines as needed to keep up with demand.

3.3.4

Contribution

The system has been developed over a number of years, with many different people in the research group contributing different parts. As such, no one person is solely responsible for the entire system. Here I will attempt to assign credit where it is due for different parts of the system.

System design and API The initial design of the blackboard system was covered in the NOAM system [80] running on top of a MySQL database, which was the previous incarnation of the news media monitoring system within the group. The system was then redesigned by Ilias Flaounas with input from myself, with an aim to be further decentralised and distributed, requiring the change of the underlying database to MongoDB. With this, Ilias implemented the architecture API, in conjunction with myself and Panagiota Antonakaki, which allows each of the modules to communicate with the blackboards, along with storing basic information about the modules. I have since extended the architecture API and currently maintain it, adding additional functionality as needed, such as automatically storing the performance of modules as they perform each run.

Blackboards

Data has been collected over a long period of time, starting around 2008. All of

the data from the previous incarnation of the system was imported into the current system, with the modules for data collection (Input Modules) updated accordingly. Ilias Flaounas performed this data migration into the current system, which includes all the blackboards apart from the Historical blackboard, which was populated by myself. The ongoing collection of data and maintenance of all blackboards and machines they run on is my responsibility. 67

Chapter 3. Meet the Data

Modules

Within the system there are a number of modules that collect, annotate and visualise

the data. The original base module implementation was created by Ilias Flaounas, and has since been extended by myself to store additional information about each module when it runs. The Feed Crawler and Article Scraper modules were implemented by Ilias Flaounas based upon code by Omar Ali, and have since been updated and heavily modified by myself. The Twitter Crawler was originally implemented by Vasilieos Lampos, and was further modified and maintained by Ilias Flaounas, Fabon Dzogang and myself. Working with Ilias, Panagiota Antonakaki implemented the modules which replace labels based upon certain conditions (Post Feed Crawler, Post Article Scraper, Post Article Indexer), as well as the Geolocator (based upon work by myself and Aleksandr Tkatsenko) and modules that take the inner product between a text and a module’s internal model or lexicon (Sentiment Analysis modules). The TF-IDF feature extractor module, SVM topic classifiers, readability and linguistic subjectivity modules were created by Ilias Flaounas. The online learning modules were implemented by myself and trained by Saatviga Sudhahar, based upon code by Panagiota Antonakaki. The scheduling and daily running of the modules is my responsibility, along with maintaining, fixing and debugging modules within the system. Platforms Processing platforms are used within the system when large, ad-hoc, batch jobs need to be performed on the data. Both the Apache Hadoop and ElasticSearch platforms were installed, configured and deployed by myself, along with the ongoing maintenance and development of solutions which run on them. Interfaces The system has several interfaces which are used to either gather information about the system, or output content for users to consume. SystemWatch was originally made by Ilias Flaounas, but has been maintained by myself to incorporate more information from the modules. The Klaxon server alert system for monitoring the health of the blackboards, modules and the underlying physical servers (disks, server load, backups etc.) was created by myself. The Clickable News application was created by myself in collaboration with Saatviga Sudhahar (providing the news to recommend) and Fabon Dzogang (training the underlying models).

3.4

Summary

In this Chapter we have introduced the notations of feeds, outlets and articles within the mediasphere along with messages from the social network Twitter known as tweets. We have described how we 68

3.4. Summary

collect these from the web, and given details on the news and social media corpora that will be used as part of this project. We went on to present the modular architecture that we have developed in collaboration with others within the research group as well as presenting a few interfaces that are part of the system. The next Chapter focuses on extracting textual characteristics from large scale corpora and using two studies, one on writing style in news media, and the other on an efficient method for large scale topic classification, shows how automated methods can help enrich a massive text corpus by adding useful annotation and reveal macroscopic patterns in the data.

69

CHAPTER 4

Extracting Textual Characteristics at Scale

When transitioning from analysing a few hundred documents, as often performed in some disciplines [90, 133], to millions of documents, there are many considerations that need to be taken into account, some of which have been demonstrated in the last chapter. Perhaps most importantly, tackling these large-scale challenges requires scalable methodologies, that more often than not may differ from those that perform best for smaller corpora [95]. In this Chapter, we explore scalable approaches to annotate documents with a number of concepts extracted from the text, with experimental results demonstrating the approach in each study. We report on two studies, the first a large-scale analysis of similarities in writing style among several news media outlets and between topics, with findings supporting the thesis that the application of large-scale automated methods can detect macroscopic patterns in the contents of large amounts of data, first published in [79]. The second study, first published in [122] and extended in [121], explores a relational approach to classification using a graphical (network) representation of textual document corpora based upon document similarity. The study includes consideration of the initial graph construction, a non-trivial challenge to make scalable, and the following label propagation on the graph which our results show leads to state-of-the-art performance in topic classification. 71

Chapter 4. Extracting Textual Characteristics at Scale

4.1

Annotation for Computational Social Science

In the social sciences, when one wishes to perform a quantitative study, the largest limiting factor to the scope of the study is the number of assistants available to annotate the data by hand in a process known as coding. For example, in [90] two assistants were required to code 113 news articles, while in [133] three assistants were required to code 42 issues from two newspapers. When larger scale studies need to be accomplished, such as sample sizes of a few thousand articles, it typically requires a coding team working over a period of several months to produce the data [136, 137, 138]. Coding is a slow process that involves manually analysing the raw data, often found as text, in order to be able to transform it into a format that can be processed and analysed with statistical software later on. This limitation on the amount of human attention that can be devoted to the task of coding has given rise to strong interest in recent years in computational methods that can bypass this coding step and process the raw data without the need for human coding. In this section we present a large-scale study using annotation of a textual corpus to investigate the content of 2,490,429 articles from online news outlets. Since this is a new approach, we wished to validate the outcome by choosing two characteristics of text, namely readability and linguistic subjectivity, together referred to as the writing style, and attempting to replicate earlier findings in the social sciences but on a much larger scale. This study highlights how automated approaches can perform content analysis at a scale which was previously impossible, with fairly predictable outcomes that validate our approach.

4.1.1

Measuring Writing Style

In this study, we wished to automate manual processes that analyse different aspects of a text for writing style. The two characteristics of text we chose to investigate, readability and linguistic subjectivity, have their roots in traditional journalism and serve as useful properties to validate that computational methods can indeed aid the social sciences in automating much of the current labour required to perform quantitative studies. 4.1.1.1

Readability

Readability is a measure long used by scholars to determine the ease with which a given text can be comprehended [54, 109, 211]. In the domain of journalism, it has previously been found that a high readability score increases the satisfaction of news readers, and therefore is considered a desirable 72

4.1. Annotation for Computational Social Science

property of a news article for an editor [29, 86]. It has more recently started being applied to information intended for widespread consumption, such as in public health, government or financial documentation [18, 132, 42], with the aim of ensuring that the information is written in a simple style that is accessible to the general population. We used the popular Flesch Reading Ease Test (FRET) [81] for our readability measurements in this study. FRET generates a score that typically ranges from 0 to 100 for English language texts, where a higher score denotes a text that is easier to comprehend. Helpfully the scores can be thought of as the level of education necessary for the reader to properly understand the text; scores from 0 − 30 require a university education, while scores of 50 − 60 are appropriate for readers who have finished secondary school. Moving further up the scores, scores of 90 − 100 are easily understandable by the average primary school leaver. The FRET score F (xi ) of an English language text xi is calculated from linguistic properties of the text, specifically the average number of words per sentence (W P S) and the average number of syllables per word (SP W ), in the following way: F (xi ) = 206.835 − 1.015W P S(xi ) − 84.6SP W (xi ),

(4.1)

where the constants are language dependent on xi and were chosen empirically to give the aforementioned range between 0 − 1001 [81]. 4.1.1.2

Linguistic Subjectivity

Ideally, the news we read should portray only the facts about a particular topic or event. In reality, news is rarely free from bias and attempts to influence a reader’s opinion to align with that of the news outlet. Linguistic Subjectivity is a measure used to determine how polarising the sentiment in a text is, and whether this sentiment is positive, negative or neutral. It represents a form of bias, related to the choice of words used to convey information. Previous studies [99, 100, 232] have shown that sentiment bias can be associated with the choice of adjectives used, with the common vocabularies of several languages already having been numerically annotated with the sentiment imparted by each word [99]. We measure the linguistic subjectivity of news articles in this study by calculating the percentage of adjectives which are subjective in relation to the total number of adjectives found in a news article. Adjectives are first identified by parsing the text using the Stanford Log-linear Part-Of-Speech Tagger 1

It should be noted that due to the way the score was originally constructed, it is not guaranteed to fall within this range.

73

Chapter 4. Extracting Textual Characteristics at Scale

[225, 226]. Each adjective is then analysed for its sentimental content and determined to be either positive, negative or neutral. This is performed by retrieving the two weights, positive and negative, for sentimental orientation found in the SentiWordNet version 3.0 database [11, 75]. Each of the weights can take a value between 0 and 1, where 0 represents no sentimental bias and 1 represents very strong sentimental bias in that direction. Additionally, a word can be annotated with many different weights depending on the context in which it is used. For such cases, we took the average score from all contexts, grouped by positive or negative weight. We considered an adjective to be a subjective adjective if it received a weight greater than 0.25 in either direction. Typical scores for linguistic subjectivity range from 15% to 30% of adjectives present having deemed to be sentimental. 4.1.1.3

Validating Writing Style Measures

Proper validation of methods is usually accomplished by measuring results against a commonly agreed gold standard. Unfortunately, there is no agreed gold standard for sentiment analysis [179] which could be used to validate the linguistic subjectivity score, and sadly the same is also true for the readability score. However, there are some sanity checks that can be performed as a way of validation in the absence of a ground truth or gold standard. For readability, this was accomplished by looking at news that is specifically written for children, and therefore should be more readable than news targeted at adults. For this purpose, 10,000 news articles collected from the BBC childrens current affairs programme CBBC-Newsround was scored by readability and compared to other topics. As expected, CBBC-Newsround articles were significantly more readable than any other topic, with a mean readability score of 62.50 and standard error of the mean (SEM) with 99% confidence of 0.27. As for readability, linguistic subjectivity validation was performed by finding a set of articles which one would reasonably expect to be more subjective and contain more bias than other news articles. This was done by collecting Opinion and Editorial (OP/ED) news articles. In total, we collected 5,766 news articles from a variety of news outlets in the category OP/ED during the period of the study. The mean linguistic subjectivity of OP/ED articles was 31.20% with an SEM of 0.20%, compared to a mean linguistic subjectivity of other topics to be 22.41% with an SEM of 0.02%.

4.1.2

Writing Style in Topics and Outlets

In this study, we focus on two elements for analysis, news outlets and topics, comparing their differences in writing style. Using the architecture described in Section 3.3, we collected 2,490,429 news 74

4.1. Annotation for Computational Social Science Table 4.1: Articles labelled with each of the 15 topic categories for writing style comparison. Topic Art

Articles 42,896

Topic

Articles

Markets

24,319 21,236

Business

126,494

Petroleum

Crime

277,626

Politics

201,776

Disasters

83,828

Religion

34,441

Elections

28,656

Science

10,076

Environment

16,103

Sports

141,665

Fashion

1,284

Weather

Inflation Prices

2,331

Total

8,505 1,037,359

articles from the Main feed of 498 different English-language news outlets spanning a continuous ten month period from 1st January 2010 to 30th October 2010. News articles were first preprocessed as described in Section 2.1.1 before automatic annotation into 15 non-exclusive topic areas (an article could belong to multiple topics) using the well established approach of Support Vector Machines (SVMs) [51] trained on two well known corpora, the Reuters corpus [135] and the New York Times corpus [193] (one SVM for each topic). The SVMs were implemented, as described in Section 2.2.1.1, using the LibSVM library [35] using the cosine similarity as a measure of similarity between news articles and empirically tuning the C parameter of each SVM. The total number of news articles falling into each topic category can be seen in Table 4.1. We were left with 1,564,018 news articles which did not fall into any of our 15 topic categories due to two reasons. Firstly, the news article may belong to a different topic which is not covered by the 15 topics that we have trained SVMs to detect and so fall into the unlabelled category. For example, we do not have a classifier trained for Entertainment articles here, so articles about the Entertainment topic are likely to be unlabelled. Secondly, as a design choice we have a preference for high precision topic classification at the expense of recall. This is justified by the motivation to have very clean data (where we are confident that the labels are correct) while accepting the trade off that we generate less labelled data as a consequence of this. Since we started with a reasonably large number of articles, a lower recall is affordable since the resulting topic labelled dataset still contains 926,411 news articles with an average 1.12 topic labels per article. Following topic classification, each article was annotated with its readability and linguistic subjectivity as described in Section 4.1.1. Separating the articles into groups based upon their topic and 75

Chapter 4. Extracting Textual Characteristics at Scale

later their outlet allowed us to then analyse the writing style characteristics as a feature of the two elements, topic and outlet.

4.1.2.1

Comparison by Topic

We are interested in how the writing style differs among news content covering different topics. For each of the 15 chosen topics, 10,000 randomly selected articles of that topic were chosen to represent the topic for both the readability measure and the linguistic subjectivity score. This random selection acts as a normalisation step for the comparison between topics, which as shown in Table 4.1 can have a large variation in the number of articles. Any topic category which did not contain 10,000 articles had every available article selected. Two additional “topics” were also included in our analysis. We created a Most Popular topic, using 24,409 articles collected from the Most Popular feeds from 16 of the 498 outlets under consideration (the ones for which we had a Most Popular feed) which also appeared in the Main feed of these outlets, using the methodology in [103]. These articles represent the news articles which appeared on the “front page” of a news outlet, and subsequently became popular. This allows us to analyse the writing style of the most popular articles that are published, and see how they compare with each of the topics. Additionally, a random topic category was created by selecting 10,000 articles randomly from all topic categories. The random selected topic category is used as an average score for both the readability and linguistic subjectivity scores. The mean readability score for each topic can be seen in Figure 4.1, with error bars showing the SEM. Sports, Art and Fashion were the topics found to be most easily read, while Science, Environment and Politics were the least readable of the topics. The results are not particularly surprising, with more complicated, far-reaching topics tending to have lower readability score than topics which cover leisure or entertainment-type news. The linguistic subjectivity score for each topic can also be seen in Figure 4.1, with error bars again showing the SEM. The ranking of topics is again somewhat predictable, with notoriously subjective topics Fashion and Art being the two most linguistically subjective, while the somewhat drier topics of Politics, Elections and Business filling in the bottom positions. The relatively high ranking of Petroleum is perhaps not entirely expected, being the third most subjective topic examined. A significant correlation of 73.49% was found between the two studied elements of writing style (Spearman correlation, p=0.0018). This can be interpreted as the more readable a topic is, the more linguistically subjective it tends to be. It’s possible that this reveals an association between the two 76

4.1. Annotation for Computational Social Science

54

Sports

52

50 Art

Readability

Fashion

Weather

48

Disasters

46 Markets 44 Crime

Petroleum

42

40

Most Popular

Religion Elections Business

Average Prices Science

Environment

38 Politics 36 18

20

22

24 26 Linguistic Subjectivity

28

30

32

Figure 4.1: Scatter plot of topics by their mean readability score against their mean linguistic subjectivity score. elements of style that has emerged over time in journalistic writing. Sports coverage appears to be the topic that most shrugs off this convention, enjoying a high readability score, while lying near the centre of the topic ranking for subjectivity. This may be in part due to reporting of game statistics, a source of objective information broken down into short snippets for quick reading consumption.

4.1.2.2

Comparison by Outlet

We were also interested in comparing the writing style of news outlets. Sixteen major news outlets that publish a widely distributed newspaper were selected for analysis from the 498 news outlets used to look at topic, eight from the UK and eight from the US. Of the eight UK news outlets chosen, four publish a broadsheet newspaper, while the remaining four are tabloids. This selection allows for a comparison between the differences in UK and US outlets, along with a comparison of UK broadsheets to UK tabloids. Table 4.2 shows the number of articles collected from the sixteen news 77

Chapter 4. Extracting Textual Characteristics at Scale Table 4.2: Articles collected from the main feed of the 16 news outlets chosen for writing style comparison. US News Outlets

Articles

UK News Outlets

Articles

Chicago Tribune

5,477

Daily Mail

Daily News

2,212

Daily Mirror

7,731

Los Angeles Times

6,696

Daily Star

8,946

24,326

New York Post

32,033

Daily Telegraph

22,682

New York Times

11,508

Independent

43,557

The Wall Street Journal

12,300

The Guardian

15,393

The Washington Post

7,228

The Sun

9,048

USA Today

6,208

The Times

2,957

outlets Main feeds during the period of this study. The readability score for each news outlet can be seen in Figure 4.2, with error bars showing the SEM. The Sun and the Daily Mirror were the news outlets with the highest readability score, with scores not entirely dissimilar to CBBC-Newsround, used earlier as a sanity check for validation. The news outlets with the lowest readability scores were The Guardian and USA Today. The low readability score of The Guardian is significantly lower than the other UK broadsheets and may be reflective of the highly educated demographic of readers the news outlet caters for. Somewhat surprisingly, both the Daily Mail and the Daily Star have a lower readability score than one would expect for a tabloid newspaper. The linguistic subjectivity score for each news outlet can also be seen in Figure 4.2, with error bars again showing the SEM. All four of the UK tabloid newspapers top the ranking for linguistic subjectivity, with The Wall Street Journal and The Washington Post using the least subjective language. Interestingly, the New York Times, Los Angeles Times and The Times news outlets all appear to have almost identical use of subjective adjectives in their content, suggesting they are targetted at similar reader demographics. News outlets that are close together in Figure 4.2 have similar writing styles; their readability and linguistic subjectivity are comparable, suggesting they are aimed towards the same readers. The Sun and Daily Mirror occupy their own region of the scatter plot being the only two high readability, high linguistic subjectivity news outlets. It is also noticeable that there is a lack of high readability, low linguistic subjectivity content coming from the major news outlets. It is however difficult to tell 78

4.1. Annotation for Computational Social Science

60

The Sun Daily Mirror

55

50

Chicago Tribune

Readability

The Times

Independent

New York Post

New York Times

Daily News

Daily Telegraph

45

The Wall Street Journal The Washington Post

Los Angeles Times Daily Star

40 Daily Mail USA Today 35 The Guardian 30 14

16

18

20

22 24 Linguistic Subjectivity

26

28

30

32

Figure 4.2: Scatter plot of US and UK news outlets by their mean readability score against their mean linguistic subjectivity score. whether this can be seen as an opportunity for an outlet to position itself in this space, or if there is a lack of readers looking for this style of content. Overall, it appears most news outlets are aimed at a reading level equivalent to somewhere between finishing secondary school and university, with a varying amount of subjectivity depending on the readers taste.

4.1.3

Discussion

This study has both illustrated the need for computational methods to automate quantitative studies in the social sciences and the validity of such approaches when applied on a large scale. The findings are validated with sanity checks and support existing studies in the social sciences. Such approaches remove a lot of the manual labour involved in conducting content analysis studies while allowing macroscopic patterns to be detected from the content. While these studies broaden the scope of quantitative studies, it should be noted that automated methods such as these are still in their infancy 79

Chapter 4. Extracting Textual Characteristics at Scale

and lack the depth that human coding can provide. Additionally, many details of the content are lost in this process, resulting in shallow, but broad findings. For example, we can easily find that the New York Times has a similar level of subjective bias to The Times, but without human coding, it is more difficult to determine if they are treating individual entities in the same way, or have similar views on a multitude of events and topics. These methods also rely on algorithms and approximations that may have some inaccuracies. The FRET score, for example, has been approximated and empirically tuned by Rudolph Flesch in order to fall within the desired range of 0 to 100 [81]. While it is not guaranteed to strictly fall within this range, it is only in rare cases of extremely complex, or extremely simplistic natural language where this is the case, with the score correlating 70% with the McCall-Crabbs reading tests [153]. The FRET score also relies on the text being tokenised and split into sentences which may introduce some error into the score, such as if a token is incorrectly split into multiple tokens or a sentence boundary is not correctly identified. The linguistic subjectivity score also makes some approximations in its calculation which should be kept in mind. When resolving the sentimental polarity of an adjective, we take the average sentiment score for the adjective from SentiWordNet [11, 75], which may correspond to the adjective being used in other contexts besides the one in question. Another potential source of error is in the detection of adjectives by the POS tagger since it is possible that not every adjective is detected or that terms which are not adjectives are labelled as such. Given the high performance of the Stanford parser, achieving 97.24% accuracy in correctly assigning POS tags [225], this is not believed to be an issue, but should nonetheless be kept in consideration when interpreting the results. Both of these issues could be addressed by developing a more sophisticated approach to linguistic subjectivity using supervised learning to learn a regression model that predicts the subjectivity of a given text, although it would also rely on its own approximations and training on an appropriate subjectivity training set. Each of the writing style methods in this study scales linearly in the number of documents scored, since each method relies only on local properties of the text, such as the number of words in a sentence, or the syllables in each word. The most computationally expensive step in this study, assuming one has already trained the SVMs for topic classification, is the reliance on SentiWordNet as a global resource to retrieve the sentiment scores of each adjective. However, in practice the size of the SentiWordNet database is small enough that it can be shared via the distributed cache in a Hadoop framework, or simply copied to each machine in a high performance computing (HPC) cluster as part of the job submission script. 80

4.2. Label Propagation for Text Classification

More advanced studies in this area should focus on separating out which events and entities each news outlet prefers to cover in each topic, and how they treat each of these subjects. For example, do all the outlets cover the same types of crimes? How positively or negatively are different entities treated in each case? As always, one must strike a balance between using more sophisticated methods that work well on a small scale (often due to computational complexity) with large-scale studies that allow macroscopic patterns to be observed.

4.1.4

Contribution

The work in this study on writing style in outlets and topics was jointly performed with Ilias Flaounas. As such, I will outline the contribution of each author to various aspects of the study. The methodological design of the study was performed by Ilias Flaounas. The analysis presented here, along with the implementation and calculation of the results for readability and linguistic subjectivity was completed by myself.

4.2

Label Propagation for Text Classification

Text classification is a common task in machine learning that has received much attention over the years. This was often tackled by looking for exact solutions to difficult optimisation problems, as is the case, for example, with SVMs [51] (see Section 2.2.1.1). However, with the fast pace with which data is generated and consumed, along with the numerous range of topics that documents often cover, a different way of handling this task is often required that can address the limitations of more traditional approaches. More explicitly, we wished to investigate relational approaches to classification that incorporated the additional properties of being online, multiclass and multi-label. As previously outlined in Section 2.2.1, supervised learning algorithms for binary classification take a set of training data containing n data samples Zn = {(x1 , y1 ), . . . , (xn , yn )}, where xi is a data sample and yi ∈ {−1, 1} is the label for the assigned class, and use it to learn a model which can predict the label y on unseen data. In this setting, traditional offline or batch algorithms will operate on the entire training set Zn to learn the model. In contrast to this, online or incremental methods learn by operating on a single data sample at a time, incrementally adjusting the model on the fly. These methods are particularly well suited to data streams where new labelled data is always being received and can be learnt from, and can be effectively limitless in the number of data samples. Additionally, we are interested in the scenario where the number of potential classes can grow online as new labels appear within the incoming labelled data set (known as multiclass classification). 81

Chapter 4. Extracting Textual Characteristics at Scale

Due to the conceptual ambiguity often associated with topic categories in text documents, we further wanted to allow each document to be assigned multiple labels (multi-label classification). In this setting, each labelled data sample xi has a label set yi ∈ 2p , where p is the (unbounded) number of classes a document can be assigned to. In this study, first published in [122] and later extended in [121], we focus on label propagation within graphs, summarising previously used approaches in the domain of label propagation using graphs for text classification, before detailing the two main processes our proposed method can be broken down into, namely efficiently constructing the graph from the corpus of documents, followed by the propagation of labels on the graph for the purpose of classifying the documents. The methodology is empirically tested on a corpus of 800,000 documents and compared with state-of-the-art performance from SVMs. The findings show that our proposed method outperforms SVMs at this task, while also providing the additional advantages of being online, multiclass and multi-label.

4.2.1

Related Work

There is a growing interest in the problem of propagating labels in graph structures. Previous work by Angelova and Weikum [8] extensively studied the propagation of labels in web graphs including a metric distance between labels, and assigning weights to web links based upon content similarity in the webpage documents. Many alternative label propagation algorithms have also been proposed over the years, with a survey [249] giving an overview of several different approaches cast in a regularisation framework. A common drawback of these approaches is the prohibitively high cost associated with label propagation. A number of works on label propagation [33, 34, 105] concentrate on extracting a tree from the graph, using a very small number of the neighbours for each node. While many graph-based methods do not address the problem of the initial graph construction, assuming a fully connected graph is given, or simply choosing to work on data that inherently has a graph structure, there is a large number of papers dedicated to calculating the nearest neighbours of a data point. One such approximate method, NN-Descent [64], shows promising results in terms of accuracy and speed for constructing k-Nearest Neighbour graphs, based upon the principle that “a neighbour of a neighbour is also likely to be a neighbour”. The All-Pairs algorithm [16] tackles the problem of computing the pairwise similarity matrix often used as the input graph structure in an efficient and exact manner, showing speed improvements over another inverted-list based approach, ProbeOpt-sort [195] and well-known signature based methods such as Locality Sensitive Hashing (LSH) [91]. 82

4.2. Label Propagation for Text Classification

4.2.2

Graph Construction

In our approach, the first step is to turn the data stream or corpus of documents that we wish to classify into a graph of documents. The graph construction X → G deals with taking a data stream or corpus X = {x1 , . . . , xn }, and creating a graph G = (R, E, W ), where R is the set of vertices with the bag of words representation of document xi being represented by the vertex ri , E is the set of edges, and W is the edge weight matrix. During the process of creation there are several design choices that must be made, each of which should be tailored to the type of data one wishes to classify. Particular care must be taken when selecting the distance metric used to assign edge weights to the graph, along with the choice of method used to trim the graph and keep it sparse for efficient computation. 4.2.2.1

Distance metric

The distance metric is used to determine the weight of the edges between any two vertices, and make up the edge weight matrix W . The weight of an edge wij is used as a notion of the similarity between two vertices ri and rj . A high edge weight wij means that vertex ri and rj are similar in some sense, while a low edge weight would mean they are dissimilar. The sense in which the two vertices are considered similar is where the choice of distance metric comes in. Since the proposed approach works with vector representations it is neutral to the type of data used, and so the distance metric chosen should be appropriate for the type of data one is working with. For example, when working with histogram representations of images (typically done for texture and object recognition) an appropriate distance metric might be χ2 or Earth Mover’s Distance [246]. In our study, we are working with bag of words representations of text documents. The most common measure for comparing the similarity of two bags of words is the cosine similarity [146] as detailed in Section 2.1.3.4 and is the distance metric used here. 4.2.2.2

Keeping the graph sparse

The first step of creating the graph would typically involve calculating the pairwise similarity score between all pairs of vertices in the graph using the appropriately chosen distance metric in order to obtain the edge weight matrix W . Many studies assume that it is feasible to create a full N × N distance matrix [107] or that a graph is already given [34, 105]. This assumption can severely limit the size of data that is manageable, limited by the naive O(N 2 ) time complexity needed for pairwise calculation. Additionally, construction of a full graph Laplacian kernel, as required by some standard 83

Chapter 4. Extracting Textual Characteristics at Scale

graph labelling methods [17, 104, 250] is computationally challenging even for graphs with 10,000 vertices [105]. Another consideration is that for scenarios where the data is arriving as part of a stream, the computational time to add each point grows as a function of the number of data points, quickly leading to an intractable situation. Maintaining a sparse graph is therefore desirable for reasons of efficiency and reducing susceptibility to noise in the data [107]. Two popular methods for achieving sparsity include k-nearest neighbour (kNN) and -neighbourhood, both of which utilise the local neighbourhood properties of the vertices in the graph [32, 107, 145]. Local neighbourhood methods are important for efficiency since a data point only relies on information about other points close by, with respect to the distance metric, to determine the neighbours of a vertex. This means that no global properties of the graph need to be calculated over the entire graph each time a new vertex is added, a consideration that has implications both for the scalability and, more generally, for parallelisation. We investigate both these popular local neighbourhood methods, k-nearest neighbour (kNN) and -neighbourhood, for keeping the graph sparse during the initial construction phase and also when new vertices are added to the graph. The first method, kNN, connects each vertex to the k most similar vertices in R, excluding itself. That is, for two vertices ri and rj , an edge is added if and only if the similarity between ri and rj is within the largest k results for vertex ri . The second method we investigate, -neighbourhood, connects all vertices within a distance  of each other, a similar approach to classical Parzen windows in machine learning [180]. This places a lower bound on the similarity between any two neighbouring vertices, i.e. only edges with a weight above the threshold  are added to the graph. A simple way of visualising this is by drawing a sphere around each vertex with radius , where any vertex falling within the sphere is a neighbour of the vertex. While the first method fixes the degree distribution of the network, the second does not, resulting in fundamentally different topologies. We will investigate the effect of these different topologies from the two neighbourhood methods on labelling accuracy in Section 4.2.4. Alternative methods, such as β-matching proposed by Jebara et al. [107] show an improvement in performance over k-nearest neighbour, but at a cost of an increased complexity of the solution and the assumption that a fully connected graph is given. Since we are interested in applying these methods to data streams, with large numbers of incoming documents, the assumption that we can create a fully connected graph is a serious drawback that makes pursuing this method infeasible. 84

4.2. Label Propagation for Text Classification

4.2.2.3

Creating the graph

We can overcome the issue of O(N 2 ) time complexity for computing the edge weight matrix by using an alternative method, converting the corpus into an inverted index where each term has a pointer to the documents the term appears within. The advantage of this approach is that the corpus is mapped into a space based upon the number of terms, rather than number of documents. This assumption relies on the size of the vocabulary |V | being much smaller than the size of the corpus |N |. According to Heaps’ Law, the number of terms |V | appearing in a corpus grows as O(N β ), where β is a constant between 0 and 1 dependent on the text [101]. Some experiments on English text have shown that in practice β is between 0.4 and 0.6 [9, 12]. The inverted index can be built in O(N Ld ) time where Ld is the average number of terms in a document, with a space complexity of O(N Lv ) where Lv is the average number of unique terms per document [243]. Finding the neighbours of a document is then trivial because of the inverted index structure. A classical approach is to use the Term Frequency-Inverse Document Frequency (TF-IDF) weighting [146]. This can be performed in O(Ld log |V |) time for each document by performing Ld binary searches over the inverted index. Assuming β from Heaps’ Law is the average value of 0.5, the time complexity for finding the neighbours of a document can be rewritten as O( L2d log N ). Therefore, there is a total time complexity O(N +

N Ld 2

log N ) for building the index and finding the neighbours of all vertices

in the graph. This is equivalent to O(N log N ) under the assumption that the average document length Ld is constant. Since we wish to keep the graph sparse, even during creation, rather than building a fully connected graph and then pruning edges, the chosen neighbourhood method is used during the creation step to determine how many of the neighbours (essentially how far down the list of similar vertices) are included in the edge weight matrix. From the edge weight matrix, we can obtain a sparse edge list representation of the (unweighted) graph. This is done by listing for each vertex ri , every other vertex rj which has a non-zero entry in the edge weight wi,j . The sparse edge list representation of the graph will be important for the label propagation step.

4.2.3

Label Propagation

Once we have the graph constructed from the corpus of documents, we wish to propagate labels along the edges of the graph for the task of classification. Label propagation aims to use a graph G = (R, E, W ) to propagate labels from labelled vertices to unlabelled vertices. Each vertex ri can 85

Chapter 4. Extracting Textual Characteristics at Scale

have multiple labels, i.e. a document can have multiple annotations, and each label is considered independently of the other labels assigned to a vertex. The labels assigned to the set of labelled vertices Yl = {y1 , . . . , yl } are used to estimate the labels Yu = {yl+1 , . . . , yl+u } on the unlabelled set. Carreira-Perpinan et al. [32] suggest constructing graphs from ensembles of minimum spanning trees (MST) as part of their label propagation algorithm, with their two methods Perturbed MSTs (PMSTs) and Disjoint MSTs (DMSTs), having a complexity of approximately O(T N 2 log N ) and O(N 2 (log N + t)) respectively, where N is the number of vertices, T is the number of MSTs ensembled in PMSTs, and t is the number of MSTs used in DMSTs, typically t 1,     0,

(6.3)

(6.4)

m=1

where um is the cumulative sum of the mean at time interval m, tm is the sentiment score at time interval m and µ is the mean of the series T . The cumulative sum of the mean for the Nuclear Power sentiment series can be seen in Figure 6.3. Change points within the series T can then be found by taking the maximum of cusum(T ), revealing that a change point is detected in March 2011, corresponding with the visible change point that is seen. The detected change point was further tested for significance by assessing if the magnitude of the maximal cumulative sum of the mean value is larger than would be expected by random chance, i.e. should we expect to see such a large shift in mean in a random series? We used random permutation testing in order to assess if the change point is significant under the hypothesis that the cumulative sum of the mean each time interval um is drawn from a distribution with a stationary mean. We found that the detected change point in March 2011 is indeed significant with p < 0.001 after 10,000 trials. The histogram of maximal values found from the randomised trials can be seen in Figure 6.4. 131

Chapter 6. Large-Scale Corpus Analysis

Cumulative sum of mean

80 60 40 20 0 Jan 09

Jan 10

Jan 11 Date

Jan 12

Jan 13

Figure 6.3: Cumulative sum of the mean for the sentiment score of ‘Nuclear Power’ between 1st May 2008 and 31st December 2013.

6.3.2

Associations

Performing association rule mining on our items reveals information about how often items co-occur with one another during the media discourse, and in this case allows us to gain an idea of the most associated topics, universities and diseases to a particular topic. We generated association networks showing the co-occurrences in the science articles for nuclear power before and after the incident, exposing the change in topics mentioned in relation to nuclear power. The edges of the networks express high co-occurrence between items in the same science articles and are weighted by the confidence of the association as described in Section 6.2.3. Before the Fukushima disaster, nuclear power was associated with other scientific topics and institutions and only marginally related to health risks such as cancer, plague, asthma, etc. as seen in Figure 6.5. After the Fukushima incident, there is a substantial increase in associating nuclear power with cancer and in particular with thyroid cancer (a common radiation-induced form of cancer) as shown in Figure 6.6. This is due to the aftermath of the incident in Japan and a more frequent media discourse on the health hazards of nuclear power in which cancer is predominant. These findings are in line with what we described both in terms of the evolution of articles’ sentiment and the following networks of actors and actions. Another change that can be observed after the disaster is the appearance of a large number of scientific research institutes associated with nuclear power. This could suggest that it is felt that further research is needed into nuclear power, given that such incidents are still happening, or simply 132

6.3. Nuclear Power Case Study

Frequency

1500

1000

500

0 5

10

15

20 25 30 Randomised maxmimum value

35

40

45

Figure 6.4: Histogram of the 10,000 greatest values found in randomised trials of the cumulative sum of the mean for the sentiment score of ‘Nuclear Power’ between 1st May 2008 and 31st December 2013. that media outlets asked scholars from these institutions to comment on the disaster for their expert opinions on the matter. Further analysis and the involvement of social scientists would be required in order to make any concrete claims and to tease out an explanation for the observed increase in associated research institutes from the many possible speculations.

6.3.3

Actions and Actors

Extracting SVO triplets and generating a triplet network for individual items allows us to map the range of issues related to each item, clearly displaying the key actors and entities affecting and being affected by the items in turn. By increasing the level of granularity and looking directly at the actions relating to nuclear power before and after the incident we can analyse how the media frames the topic in a different light. Figure 6.7 shows the network of actors and actions linked to nuclear power before the incident and contains a number of policy actors and countries revealing the debate about nuclear power as a viable alternative to fossil based energy sources. Common frequent actors are countries or political figures because most articles reflect the debate taking place within countries about their energy supply needs, with the actions linking them also reflecting such a kind of discussion. This can be further seen in Figure 6.9, where the collective actions being applied to nuclear power are mostly centred around ‘support’, ‘embrace’ and ‘need’. However, after the Fukushima disaster the network of actors and actions changed (Figure 6.8). 133

Chapter 6. Large-Scale Corpus Analysis

Mechanics

Strategy

Cancer Mit

Control Systems

Tsinghua University Human Resources

Networks

Scheduling

0.003

0.047 0.015

0.013

0.048

Logistics

Leadership

0.021

0.008

0.011

0.006

Depression 0.027

0.002

Corrections

0.004 0.003

Nuclear Power

Hiv

0.002

0.008

0.001

Public Health 0.392

0.062

0.004

Nuclear Engineering

Plague

Asthma

0.014

0.006

0.011

0.004 0.023

0.021

0.002

Oregon State University

0.008

Nuclear Physics

Purchasing

Campaigning

University Of Virginia

Sustainable Development Surgery Civil Engineering

Statistics

Figure 6.5: Associated universities (green), topics (blue) and diseases (red) found through association rule mining for ‘Nuclear Power’ before the Fukushima disaster. Edge weight denotes the confidence in the association.

134

6.3. Nuclear Power Case Study

Georgetown University Networks

Public Health University Of Michigan

Purchasing Simulation University Of Manchester California Institute Of Technology Mit

Oregon State University International Trade

Depression Astrobiology

0.024

0.009

0.013

0.004

0.005 0.010

0.027

0.011

Imperial College London University Of Chicago

Oncology

0.064 0.025 Harvard University 0.021

0.004

Nuclear Engineering

0.109 0.003

Princeton University 0.009

0.005

0.005

0.411

University Of California Berkeley

0.009

Stanford University

0.007

0.016

Geochemistry

0.050

Nuclear Power 0.002

Corrections

0.004

0.031

0.015

0.004

0.026

Leadership

0.045 0.013

0.002

0.013

0.054

Mental Health

Cancer 0.091

Statistics

Cornell University 0.005

0.213

Control Systems 0.001

Surgery

0.010

0.032

0.038 0.030

Carnegie Mellon University

0.015

University Of New Mexico

Applied Physics

University Of Tokyo

Logistics Astrophysics

Rensselaer Polytechnic Institute Thyroid Cancer

Strategy

Environmental Science Obesity

Campaigning

Figure 6.6: Associated universities (green), topics (blue) and diseases (red) found through association rule mining for ‘Nuclear Power’ after the Fukushima disaster. Edge weight denotes the confidence in the association.

135

Chapter 6. Large-Scale Corpus Analysis

Israel Part

Italy

Support

Iran

John Mccain

Legislation

Countries

Uk

Something

Key Role Alberta

acknowledge

Majority

embrace

Regains

need

Scotland

form

expand supports

President Barack Obama

needs

Role

consider

undergo

play

Sources

give

need

Thomas Pigford

oppose

Obama Administration Governments

play

pushes

embrace

provide include

represents

Nuclear Power congratulate replace

offers

use involve

want

Labor

Companies cut

Fuels

utilize

Possibility

Safe

champion

promote

presents

promotes

Emissions

provides adopt

Nigeria

makes require provide

need

Problems

poses need

mention

China

realize

Australia

NOT

Sense

use become

favour reauthorize

Amounts Energy

Venezuela Risks Britain

Bill

People Answer Germany Expensive Way

Government

States

Figure 6.7: SVO triplet network showing the actors and actions affecting ‘Nuclear Power’ before the Fukushima disaster. Nodes represent subjects and objects in the SVO triplets, while edges show the verb relation between the subject and object of the triplet.

136

6.3. Nuclear Power Case Study

Japan

Plants Labour

Government Germany

Group

Itself

abandons supports World abandon

close scrap

Show backs

discredit

creates need

Tokyo

pretend steal

President Barack Obama Public

abandon

embrace

reject serve

Nuclear Power

saves

Lives

gets

Lot

question kills reconsider

expect play

hide

People play

abandon

Countries

Role abandon causes wants

Costs

replace

Vital Role Country

Italy

Solar Power Deaths

China

Figure 6.8: SVO triplet network showing the actors and actions affecting ‘Nuclear Power’ after the Fukushima disaster. Nodes represent subjects and objects in the SVO triplets, while edges show the verb relation between the subject and object of the triplet

137

Chapter 6. Large-Scale Corpus Analysis

Figure 6.9: Verbs in the SVO triplets where ‘Nuclear Power‘ is the object before the incident, showing the actions happening to ‘Nuclear Power’.

Figure 6.10: Verbs in the SVO triplets where ‘Nuclear Power‘ is the object after the incident, showing the actions happening to ‘Nuclear Power’.

138

6.4. Data Playground: an Interactive Tool for Corpora Analysis

The biggest change is the introduction of the public as a very important actor and their views and feelings about nuclear power. The role and risks associated with nuclear power re-emerged as an element of the debate. Actions such as ‘replace’, ‘reject’ and ‘abandon’ become more prominent, as seen in Figure 6.10. Additionally, in contrast to pre-incident, actors’ public reaction and acceptance of nuclear power gained a prominent role in its online news representation and we can guess that this is the reason of negative long-standing shift in the articles sentiments after the incident.

6.3.4

Case Study Discussion

Traumatic and dramatic events can have a profound impact in the way media represents an issue and in an age of global media, such effects can potentially have a wide reach. Traditionally, it has been very difficult to ascertain which events had an impact or not in the public sphere, and discerning if the impact is a long-lasting one. The level of attention, sentiment and framing (both in terms of SVO triplets and associated topics) in the media coverage of a technology can greatly affect its trajectory. The most common recent example is the case of biotechnology in Europe and its public opinion backlash that generated an embargo for GMO crops [87]. Actors and other issues associated to nuclear power have experienced a dramatic change from a largely positive one to a clearly negative climate. Such dynamics remind of the case of the Three Mile Island accident in 1979 that had considerable effect on the development of the nuclear industry1 . Further analysis of important events and issues in the public sphere could also be augmented by the inclusion of sentiment analysis of data from social media sources, allowing for a more direct method of gauging the public reaction to particular issues. Similarly, associations and the key actors and actions could also be extracted from social media data for a fuller picture of how events and issues unfold in the public eye.

6.4

Data Playground: an Interactive Tool for Corpora Analysis

So far in this Chapter we have demonstrated with a case study that large-scale analysis of textual corpora is not only possible, but can reveal patterns from millions of news articles about real-world events. However, large textual corpora such as the online news or social media data that we collect cover many real-world events, document shifts in popular opinion and contain hidden patterns about a wide range of concepts. While automated means can reveal these hidden treasures in the data, it is often difficult for these automated means to determine which are meaningful or indeed interesting to 1

http://en.wikipedia.org/wiki/Three_Mile_Island_accident#Effect_on_nuclear_power_industry

139

Chapter 6. Large-Scale Corpus Analysis

a particular scholar or field of study, and which are simply unlikely (and therefore generally interesting to an algorithm) but essentially meaningless changes in the statistics of the data. In this vein, we present an interactive tool named “Data Playground”, providing a way for users to explore, form and test hypotheses and generally infer patterns and trends within massive textual corpora, leveraging the aid of automated data-driven methods to explore and interrogate the data. The tool was developed in an effort to bring the worlds of quantitative analysis and qualitative analysis closer together using the methods described in Section 6.2, building on the strengths of each to combat the weakness in each approach independently. Namely, the main critique of quantitative approaches touted by data scientists is that only a shallow understanding of the data is accessible, with a depth of meaning missing from the automated methods. Similarly, qualitative methods are attacked for their comparatively tiny sample sizes, which can easily miss data relevant to the study at hand, or that are too closely focused on a very particular aspect.

6.4.1

Available Tools within the Data Playground

The main interface of the Data Playground allows for a user to search a corpus by keyword or phrase, choosing a particular suite of tools to be run depending on the view of the resulting data to be shown. Currently, there are four different sets of methods available to the user, each of which starts with a high-level display of the entire corpus, and allows for more detail to be extracted from the underlying text. In this way, users can start from a high level quantitative view of the corpus as a whole, finding macroscopic patterns, which can then be interrogated for further analysis, until the user gets right down into each individual text for manual inspection if required. This combination of approaches allows the real strengths of quantitative and qualitative methods to be combined, ensuring that all relevant data is covered and can be manually inspected, while also benefiting from the vast scale only available to automated means and their ability at uncovering macroscopic trends in the data. 6.4.1.1

Timelines

The first set of tools displays timelines of salience for the keywords or phrases searched for by the user. This includes both single words such as “Bristol” and longer phrases such as “University of Bristol”, collectively referred to as an n-gram, with a specific n-gram that has been searched for referred to as the target n-gram. As detailed in Section 6.2.2, these timelines for salience are calculated as the number of articles that contain the n-gram over the total number of articles that fall within the bucket for that time resolution. Each n-gram is displayed as a new series within the same graph, showing 140

6.4. Data Playground: an Interactive Tool for Corpora Analysis

0.012

0.01

Values

0.008

0.006

0.004

"Angela Merkel"

20150102

20140718

20141010

20140131

20140425

20131108

20130816

20130524

20121207

"Francois Hollande"

20130301

20120622

20120914

20120330

20120106

20110722

20111014

20110204

"David Cameron"

20110429

20101112

20100820

20100305

20100528

20091211

20090626

20090918

20090109

20090403

20081017

20080725

0

20080502

0.002

Data Playgrou n d v0.2.2

Figure 6.11: Timelines of three European leaders in modern English news media using Data Playground.

Figure 6.12: Word cloud generated for the largest peak for “Francois Hollande” as seen in Figure 6.11.

a comparison of different concepts at the same time. An example of this can be seen in Figure 6.11, where the n-grams “David Cameron”, “Angela Merkel” and “Francois Hollande” are compared in the modern English news corpus collected using our modular architecture for data curation detailed in Section 3.3. From here, users can extract more information from the corpus by clicking on any of the time points in the timeline to compute the associated word cloud for that particular set of articles. The word cloud offers a summary of the associated articles and quickly allows the user to determine the context in which the n-gram is being used. An example of this is shown in Figure 6.12, where after clicking on the largest peak of attention for “Francois Hollande”, the word cloud displays more information revolving around the attack on the Paris offices of the magazine Charlie Hebdo in January 141

Chapter 6. Large-Scale Corpus Analysis 20152 . The word cloud here is computed by displaying the most frequently occurring words (after stop word removal) within the top 1000 articles containing the target n-gram for the time interval covered by the time point clicked. Furthermore, the user can select particular words that appear in the word cloud to query for more detailed results, displaying the titles of the articles which contain the target n-gram for the timeline, and the new word selected from the word cloud. This allows users to confirm that the articles are reporting about the n-gram as they expect, but also to inquire into words which are unexpected in the word cloud, and interrogate the data for more details. Selecting an individual article title will then display the full textual content of the article to the user, enabling a close reading of the text and manual inspection (and verification) of the automated methods to take place. This ability to query the data and obtain more fine-grained information about the salience of an n-gram at any particular time in a very targeted way enables users to view the data in a new way and draw conclusions from the data using both automated and manual approaches in a succinct way. While this approach covers a specific concept very well, it does not offer a lot of information about related concepts, simply enabling them to be discovered from the text for further analysis using the other methods available. 6.4.1.2

Set Comparisons

An alternative to the timelines is to compare n-grams as a set, showing the relative changes in the corpus for each of the n-grams in a related set. The set comparison tools perform this by restricting the set of articles to only those which contain at least one of the n-grams from the set. The set comparison timelines are then computed as the number of articles containing one of the target ngrams in a set, divided by the total number of articles that match any of the target n-grams in the set, calculated for each individual time interval in the corpus. In this way, n-grams can be clearly and directly compared to each other, highlighting cross over points, when one n-gram becomes more prominent than another in the underlying text corpora. Here again, like with the timelines, we can also click on particular time points to compute the word cloud for the target n-gram and extract more information about any points of interest in the series, further drilling down to the titles and full content of the articles composing the world cloud. For example, in Figure 6.13 we can see a comparison of London, New York and Los Angeles using the set comparison tools reveals that New York is covered most in the set for our English language 2

http://en.wikipedia.org/wiki/Charlie_Hebdo_shooting

142

6.4. Data Playground: an Interactive Tool for Corpora Analysis

0.8

0.7

0.6

Values

0.5

0.4

0.3

0.2

0.1

London

"Los Angeles"

20150102

20141010

20140718

20140425

20140131

20131108

20130816

20130524

20130301

20121207

20120914

20120330

20120622

20111014

"New York"

20120106

20110429

20110722

20110204

20101112

20100820

20100305

20100528

20091211

20090918

20090626

20090109

20090403

20081017

20080725

20080502

0

Data Playgrou n d v0.2.2

Figure 6.13: Set comparison of three major cities in modern English news media using Data Playground.

news corpus, apart from four cross over points with London. These crossover points occur during the “2009 G20 London Summit”3 , in early July 2009 when a house fire spread through a block of flats killing 6 people4 combined with heavy coverage of the “2009 flu pandemic in the United Kingdom”5 , during the “2011 London summer riots”6 and during the “2012 Summer Olympics”, held in London7 . Los Angeles, the least mentioned of the three cities gains its momentary, relative fame over London on three occasions, after the death of Whitney Houston8 , during the “Christopher Dorner shootings and manhunt”9 and in May 2014 when “[Donald] Sterling’s wife, Shelly, reached an agreement to sell the [Los Angeles] Clippers for $2 billion to Steve Ballmer”10 . As we have seen, comparing similar concepts as a relative quantity, in relation to each other, quickly allows identification of when certain n-grams become more prominent and cause crossover points to occur, along with allowing quick analysis of the reasons for the increase in attention for a given n-gram via the lower level data representations.

3

http://en.wikipedia.org/wiki/2009_G-20_London_summit http://en.wikipedia.org/wiki/2009_Lakanal_House_tower_block_fire 5 http://en.wikipedia.org/wiki/2009_flu_pandemic_in_the_United_Kingdom 6 http://en.wikipedia.org/wiki/2011_England_riots 7 http://en.wikipedia.org/wiki/2012_Summer_Olympics 8 http://en.wikipedia.org/wiki/Whitney_Houston#Death 9 http://en.wikipedia.org/wiki/Christopher_Dorner_shootings_and_manhunt 10 http://en.wikipedia.org/wiki/Donald_Sterling 4

143

Chapter 6. Large-Scale Corpus Analysis

6.4.1.3

Associations

As previously mentioned, using the timeline tools does not always provide much information about how an n-gram relates to other concepts, without performing a large amount of manual inspection and close reading of the underlying texts. To aid the user in the pursuit of relational information about different concepts, we provide tools for detecting and exploring associations of entities and concepts within the data. Using the association tools, target n-grams are displayed as part of a network, as seen in Figures 6.5 & 6.6 from the Nuclear Power case study. The target n-gram forms the central node, with edges, denoting an association determined by association rule mining (see Section 6.2.3), connecting the node to the associated concepts that co-occurred within the same articles. Interrogation of the network by the user provides two options, dependent on if a node or an edge in the network are selected. When an edge in the network is selected, the word cloud containing the most frequent words after stop word removal is displayed for the set of articles which are common to both the nodes connected by the edge. This enables more information to be found about how these two concepts are connected with each other. Similarly to the timelines and set comparison tools, selecting any words within a word cloud returns the set of articles that match that word and the ngram of interest, allowing for manual inspection of the text to take place. When a node in the network is selected, the display of the association network changes, and refocuses with the selected node as the new central node. In this way, a user can browse from concept to concept through the network via co-occurring concepts and fully explore a particular aspect of the data they may be interested in. 6.4.1.4

Triplets Networks

The last set of tools currently available relate to the extraction and formation of triplet networks from the data. Here, networks of actors and their respective actions are presented to the user as an alternative corpus browsing approach to the association networks shown with the associations tools. The networks generated here again place the target n-gram as the central node of a network, with other nodes in the connected network being other actors which are either performing an action on the target n-gram, or are having an action performed on them by the target n-gram. The edges in the network encode the action which is being performed, and are labelled as such, similarly to the triplet networks in Figures 6.7 & 6.8. Again, similarly to when exploring associations, two options are available by selecting either a node or an edge in the network for more details on the data. Selecting an edge once again displays a word cloud, this time computed from the articles in which the action is being performed on or 144

6.4. Data Playground: an Interactive Tool for Corpora Analysis

0.06

0.05

Values

0.04

0.03

0.02

"nuclear power"

"nuclear power" AND "curiosit y"

"nuclear power" AND -"curiosit y"

20131108

20130816

20130524

20130301

20121207

20120914

20120622

20120330

20120106

20111014

20110722

20110429

20110204

20101112

20100820

20100528

20100305

20091211

20090918

20090626

20090403

20090109

20081017

20080725

0

20080502

0.01

Data Playgrou n d v0.2.2

Figure 6.14: Example of the advantages of the expressive query language available when timelines are computed on the fly, instead of being precomputed.

by the two connected nodes. The word cloud here quickly gives some context to the action that is happening, with selected words within the word cloud further displaying the articles in which they appear, as seen in the other tools. Selecting a node in the triplet networks will refocus the network to place the selected node at the centre of the network, and find the new actions and actors connected to the new central node. A user can therefore browse a massive textual corpus by following the actions that occurred between different actors, similar to a previously implemented interface provided in [219] for US election news articles, but in a much larger, and more general way11 .

6.4.2

Notes on Implementation

Due to the real time, interactive nature of the Data Playground, some methods are prohibitively expensive to compute in real time and so are difficult to include for the user’s benefit. Some of the different trade-offs between the amount of information extracted or displayed to the user and the need to keep everything fast to compute and display so as not to affect the user’s ability to quickly interrogate the data are detailed here. We also cover an additional set of tools that are considered for inclusion in the tool, and the associated challenges. 145

Chapter 6. Large-Scale Corpus Analysis

6.4.2.1

Document Frequency vs Term Frequency

Firstly, perhaps the biggest step taken to ensure a smooth and (mostly) wait free experience is to compute article counts rather than relative term frequency counts when exploring timelines. This efficiency step means that instead of reporting the total number of times that an n-gram appears divided by the total number of terms in a time interval, we report the number of articles that contain an n-gram, divided by the total number of articles in a time interval. The time saving in the computation is made since we only need to take a count of the matching articles, instead of retrieving all matching articles, and then checking each for the number of times the n-gram occurs. Steps have been taken to investigate precomputing the timelines for n-grams (up to 3-grams in this case) for various corpora using the Hadoop platform within our modular architecture (see Section 3.3.2.3), but this comes with its own advantages and disadvantages. For example, it would allow timelines across different corpora to be more easily compared when the underlying text structure is significantly different, such as when comparing relatively brief messages from social media with longer form news articles, since the relative term frequency counts completely ignore the abstracted “article” structure, and simply treat the corpus as an unstructured collection of words (albeit timestamped ones). However, we would lose much of the empowering expressivity of the query language that can be accessed when the timelines are computed on the fly. An example of this, shown in Figure 6.14, would be that with precomputed n-gram timelines, we would not be able to display the intersection of n-grams (black series, i.e. “nuclear power” and “curiosity” in the same article) or use negation operators (green series, i.e. “nuclear power” and not “curiosity”), which are a very useful way to verify how much of the salience of the timeline is related to a particular concept, showing that the later two peaks are almost entirely attributable to the Mars rover Curiosity being launched and landing on Mars12 , while the first peak is an increase in the salience of nuclear power due to the Fukushima disaster covered in the previous case study. 6.4.2.2

Word Clouds and their Alternatives

Word clouds are used extensively in the Data Playground as a rudimentary method of “summarising” a set of relevant articles. While more sophisticated and involved methods of article summarisation [144, 170, 176] would certainly be preferable to the word cloud approach taken in this work, this process would no doubt involve several steps which must be computed on the fly each time a user selects an 11 12

Additional usage examples of Data Playground are available in Appendix A.1. http://en.wikipedia.org/wiki/Curiosity_%28rover%29

146

6.4. Data Playground: an Interactive Tool for Corpora Analysis

object for further analysis. Typically, multi-document summarisation techniques are used to produce short summaries of tens of articles at a time. Since we are working in the space of corpus analysis, the most common case will be to potentially summarise tens of thousands of articles at once. Indeed, since it is likely that not every single article containing a target n-gram is focussed on that particular topic, the articles would require clustering as a preprocessing step before summaries can be generated for each cluster individually. As one can see, trying to generate fast summaries for huge numbers of articles quickly leads down a path that would require its own field of research. Word clouds do fit our criteria of being able to summarise large article sets, but come with their own caveats and limitations. The main limitation for word clouds in the tool is the user’s browser. Not only is screen real estate limited, but trying to cram too much into one display has a negative effect on a user’s ability to process the information; white space is important [36, 142]. For these reasons, we specify that the word clouds should not render more than the 250 most frequent words at any one time for a set of articles (although it is often very much less than this due to the size available on the user’s screen). The set of articles chosen to represent the relevant set of articles for a target n-gram, whether it is from the timeline, set comparison or association tools, or after selecting an action in the triplet networks, is restricted to an upper limit of the 1, 000 most relevant articles. While it is expected that this upper limit should not generally be reached in the current corpora of modern English news, or its subset of Science news, one can easily imagine a case where a larger corpus is used in the same tool that does come up against the limit (social media is one example). Under these circumstances, a user needs to be aware of such restrictions within the tool, in order to be able to take them into account. Sampling methods, designed to ensure that the articles included in the word cloud are representative of the full relevant article set could be deployed but would likely run up against similar issues as those encountered by the multi-document summarisation techniques; issues of scalability and computational efficiency that allow for near real time generation. For all their faults and criticisms [98], in the deployed usage within Data Playground, word clouds offer a sensible balance between the high-level views of the data (timelines, associations etc.) and the low-level of the raw textual content of the articles.

6.4.2.3

N-grams and References

In Section 6.2.1, we detail how references to items are extracted from the text, starting from a predefined list of items of interest. However, in Data Playground, users can search for any target n-gram 147

Chapter 6. Large-Scale Corpus Analysis

they wish, not necessarily from the list of predefined items. As such, we allow for new concepts and ideas to be investigated as they occur to the user, rather than having to specify a priori the types of investigation they wish to conduct. Indeed, as a tool, we wish for it to be generic, in that it has the ability to allow users to perform studies into unforeseen subjects, given a specific corpus. However, this comes with its own trade-offs between what can be computed on the fly when a user makes a query, and what needs to be precomputed. Tolerable waiting times for a web user have been found to be around two seconds, with traditional human factor guidelines suggesting that a maximum wait of 10 seconds is the limit [169]. This marks the difference between the types of tool presented in Data Playground, and what types of searches can be made in each of them. The timelines and set comparison tools allow for any n-gram to be searched for, and can easily be computed on the fly, with results typically returned in under a second per series. However, extracting references and computing their associations and parsing the text to produce triplet networks cannot be computed on the fly (due to computation speed) and so must be precomputed. In the case of the triplet networks, this is not a problem when one wishes to use n-grams, since if the target n-gram appears in any of the triplets, we already have the result for it. For the associations, we cannot foresee which n-grams will be searched for, or indeed know what type of thing the n-gram may belong to. For example, we predefined a list of diseases, scientific topics and science institutions in the Nuclear Power case study since we were interested in investigating their associations with each other. In this trade-off, we felt it is more informative to know the category of a predefined list of items which may be of interest, than to have an exponentially larger space that would be difficult to search through using the tool. Once an interesting pattern has been detected and interrogated from the tool, a user can then define the lists which would be interesting for that particular pattern, and compute the associations as needed to further their study.

6.4.2.4

Sentiment timelines

Data Playground in its current state does not include the ability to display sentiment timelines for target n-grams. While they can be computed for a target n-gram, as seen in Figure 6.2 of the Nuclear Power case study, it typically takes a little over a minute with our current hardware and infrastructure, depending on the target n-gram chosen. Keeping in mind our target of less than two seconds, various methods for speeding up the computation of sentiment timelines have been considered. The main crux of the problem lies in the differences between textual corpora of different types. For example, we have seen mood computed for social media (see Chapter 5) and sentiment in traditional 148

6.5. Discussion

news media in this project (see Section 6.2.2.2), although a slightly different method was applied in each case. This generally comes down to the length of the text under consideration, since longer bodies of text are more likely to contain multiple points of view and topics, with brief messages containing a single opinion or topic of discussion. In short messages, such as in social media, we assume that only a single topic or opinion is being expressed due to the limitations on the character count enforcing brevity. Therefore, we can compute the sentiment for an entire message and apply it to any concept of interest within the message. For longer messages, such as news articles, a range of opinions and topics of discussion can be addressed within the same article, making it much more difficult to attribute article-level sentiment to any particular concept within the text. In these circumstances, we resort to taking a sentence-level sentiment, assuming that sentences are analogous to social media messages in that they contain a single topic of discussion or point of view, as described in Section 6.2.2.2.

6.5

Discussion

The benefits of analysing massive collections of text documents as a whole are becoming more apparent, with high profile studies [130, 141, 161] and media coverage bringing interest from a range of scholarly domains that work with text. While many approaches make use of named entities (including ours) and detecting specific events within the corpora [129, 131, 239], borrowing from a rich set of techniques previously developed for information retrieval purposes, the result is still often lamented as being shallow, requiring additional linking with semantic knowledge bases for real knowledge to be extracted from the data [25, 217]. However, these studies are certainly not without their merit, and offer one very detailed perspective that can be obtained from large-scale corpora, especially when combined with innovative visualisation techniques [201, 202]. In our case study on nuclear power, we demonstrated a number of views on the data, offering a number of perspectives that can be used to inspect the data and build hypotheses. We take inspiration from [161], where it was shown just how powerful timelines of words and phrases over time can really be, and that although they are reasonably shallow, they provide a good entry point to investigating a hypothesis about information in the data. We also offer narrative analysis by scaling up the work presented in [218], while allowing users to interact with the data and perform manual exploration, aided by the automated methods that are becoming more and more necessary to study a particular idea. Indeed, [38] raises the question of what tools exist and who will create new ones for researchers who increasingly need tools that allow them access to large-scale corpora and the ability to visualise 149

Chapter 6. Large-Scale Corpus Analysis

them, while in [171] it is noted that “the restrictive interfaces of online archives currently make culturomics difficult” before giving an example of how to manually query the 19th Century British Library Newspapers archive13 . Of course, presenting such approaches and tools as simply visualisations of a corpus is misleading, and misrepresents the amount of data-processing, annotation, analysis and statistical testing that are involved in creating such interactive interfaces or conducting studies in this area. As such, tools like Data Playground for exploring massive textual corpora not only epitomise the aims of this project, but require every single one of our stated research questions to first be addressed, involving the scalable collection, storage and annotation of large amounts of text data, the extraction of meaningful information from the data via a number of generic methods, and allowing users to investigate macroscopic patterns and trends within the data to form and test hypotheses about the real-world. In general, Culturomic approaches require an interdisciplinary methodology and will continue to do so for the time being. As Culturomics gains a critical mass, its reach will extend far into the digital humanities, political sciences and sociology amongst others, shaping the types of studies that are possible, and to quote [161] “will furnish a great cache of bones from which to reconstruct the skeleton of a new science.”

6.6

Summary

In this Chapter, we have introduced a suite of tools and methods for analysing large-scale corpora and demonstrated their ability to explore and interrogate massive collections of articles, finding interesting changes in the way nuclear power is covered by the mass media over the past few years. We proposed a new tool, Data Playground, to help social scientists and anyone else interested in playing with massive datasets to investigate macroscopic patterns and trends within the data and uncover their own patterns, thereby enabling a combination of computational methods and traditional methods to be leveraged in a way that enriches either approach on its own. Finally, in the last Chapter, we conclude the project and sum up what we have learnt during the course of the project, along with outlining a few directions for future work to further our aims of using automated data-driven methods to unlock the potential of massive scale textual corpora in traditionally non-computational fields.

13

http://www.bl.uk/reshelp/findhelprestype/news/newspdigproj/database/

150

CHAPTER 7

Conclusions

In this thesis we have taken a data-driven approach to investigating important questions in disciplines that have yet to fully adopt computational and automated methods, discovering macroscopic patterns that can only be revealed through analysis of massive corpora in large-scale studies. We presented several studies using state-of-the-art tools and techniques from the field of Data Science and its associated fields of Data Mining, Machine Learning, Information Retrieval and Natural Language Processing to solve research challenges in a number of traditionally non-computational fields including Journalism, Health, Social Sciences and more. In Section 1.1 we set out a list of four general research questions that we wanted to address during the thesis. We now revisit those questions and take another look at them in light of the findings in this thesis, addressing each of them in turn. How do we collect and store large amounts of text data? A starting point for any data-driven approach will always be the data. In Chapter 3 we presented a distributed architecture for collecting, storing, and analysing text data collected from the web. The distributed architecture is modular and scalable in design and allows for a generic representation of textual documents, which enables the same modules to be applied to many different types of document. We also detailed the accompanying metadata that we collect alongside each document, which allows us to partition the data into sets based upon our needs and the questions we wish to study. We concentrated on online news and social media data, each of which has value in a number of different fields due to its coverage of current events and public opinion, while also being abundantly available in vast quantities. The data 151

Chapter 7. Conclusions

collected and annotated by the presented architecture was then used in the later chapters of this thesis for each of the data-driven studies that we covered. Can we extract meaningful information at scale from text data?

Once data was collected and

stored, we needed to extract information from it which could be used to address the questions we wished to tackle. We demonstrated that we can extract two elements of writing style, readability and linguistic subjectivity, by annotation of large-scale corpora and that these elements are not only correlated with each other, but that they reveal patterns about which news outlets target similar readerships. We proposed new methods for topic classification using a similarity graph approach that achieves state-of-the-art performance comparable to SVMs, and showed how a combination of the graph-based approach with SVMs could further improve performance. Additionally, we presented studies on extracting mood from social media, investigating both longitudinal and circadian patterns over time, along with ways to disseminate the results to the public. Finally, we analysed the way science is framed in news media coverage, focussing on changes in the coverage of nuclear power using a host of methods as detailed in Chapter 6. In each of these studies we performed validation steps in order to ensure that the information we were extracting was meaningful, comparing against validation data sets or real-world events as appropriate when a ground truth was absent. What can the information tell us about the real-world? Once we established that we can extract meaningful information from text data, we wanted to use the information to answer questions about the real-world using the macroscopic patterns we found in the data. We demonstrated that using the information we extracted we could discover: how editors at different news outlets favour news articles written in a particular style and that there is a noticeable lack of outlets using a high readability, low subjectivity writing style, indicating a possible gap in the market; that public mood in the UK was affected by the announcements of budget cuts in the United Kingdom, and that the UK riots in August 2011 almost a year later were also detected as a change point, with significant increases in the mood change for anger and fear preceeding the riots; that mood on Twitter varies throughout the day exhibiting a strong diurnal profile, along with high levels of correlation between the summer and winter periods studied; and that the coverage of nuclear power in the media has fundamentally changed since the Fukushima Daiichi disaster in March 2011, shifting from reflecting the on-going debates within many countries about how nuclear power can help address their energy supply needs, to a media representation with a strong association with thyroid cancer, and involving the public’s view of nuclear power as a prominent actor. We also give a small glimpse at the endless possible 152

7.1. Summary of the Thesis

studies that could be conducted using scalable methods for textual analysis when combined with appropriate tools to allow independent researchers to interrogate the data themselves. Can we formulate our studies into generic methodologies and tools?

After solving a particular

problem in one field, we wanted to be able to generalise our approach to solve similar but unrelated problems in entirely new fields of study. In this direction, we investigated a variety of scalable approaches for analysing large-scale corpora, including analysis of writing style, scalable annotation of topic categories, extraction of mood and n-gram trends, along with finding the associations and actions being performed by key entities within a text. Furthermore, we developed an interactive tool that can be used by scholars without a background in large-scale text analysis that allows them to explore and delve into textual corpora using a number of different tools, and thereby navigate through the data and discover patterns using timelines, associations and narrative information in an automated way.

7.1

Summary of the Thesis

1 - Introduction

In this Chapter, we provided the motivation and aims of the project and presented

the research questions we wished to address, along with detailing the output of the project in terms of relevant peer-reviewed publications, impact and dissemination in the media. 2 - Background

In this Chapter, we introduced the most common methods for processing text from

its unstructured form into word tokens and sentences, along with adding information about the role each word token plays within the sentence or the dependencies between them. We covered how to convert these extracted tokens into a machine readable format using a vector space representation suitable for machine learning methods. We also detailed the machine learning methods used in later chapters to classify documents into topic categories and discover associations between items in a database of transactions. 3 - Meet the Data

In this Chapter, we introduced the notations of feeds, outlets and articles within

the mediasphere along with messages from the social network Twitter known as tweets. We described how we collect these types of data from the web, and gave details on the news and social media corpora that were used as part of this project. We went on to present the modular architecture that we developed in collaboration with others within the research group and presented a few interfaces that are part of the system. 153

Chapter 7. Conclusions

4 - Extracting Textual Characteristics at Scale

In this Chapter, we explored scalable approaches

to annotate documents with a number of characteristics extracted from text, with experimental results demonstrating the approach in each study. We reported on two studies, the first a large-scale analysis of similarities in writing style among several news media outlets and between topics, with findings supporting the thesis that the application of large-scale automated methods can detect macroscopic patterns in the contents of large amounts of data. The second study explored a relational approach to classification using a graphical representation of textual corpora based upon document similarity. The study took into consideration the initial graph construction, a non-trivial challenge to make scalable, and the following label propagation on the graph, with results that showed the approach can achieve state-of-the-art performance in topic classification. 5 - Mood Trends in Social Media In this Chapter, we explored how the rise of social media and discovery of trends in its data has potential implications in a wide range of disciplines from economics and politics to sociology and health. We detailed how we computed usage trends of mood related words and studied these trends over a 31 month period, detecting the significant change points that occurred, along with another study on how patterns of circadian and seasonal changes of mood can be estimated from social media. We also looked at possible ways to validate the results obtained from social media in the absence of a ground truth, as is often the case in such circumstances. Furthermore, we spent some time on visualising the mood trends that emerged from the data, enabling a wider audience to gain an insight into the meaning of the findings. 6 - Large-Scale Corpus Analysis

In this Chapter, we explored several scalable methods to analyse

concepts within large-scale news corpora, including evolution of attention and sentiment to concepts over time, associations between different concepts in the narrative and the actions being performed on or by different actors within the text relating to those concepts. We presented a case study on the portrayal of ’Nuclear Power’ in modern science news, uncovering how the framing of the topic has changed since the Fukushima Disaster, while demonstrating both the usefulness and feasibility of such methods for large-scale textual analysis. Additionally, we presented an interactive tool that allows users, such as historians or social scientists, to interrogate large-scale corpora using keywords, phrases and concepts in an interactive and informative way, allowing for macroscopic patterns to be unveiled. This type of tool offers a way to close the gap between fully automated but shallower computational approaches and manual inspection of the data, overcoming the significant drawbacks which can be associated with each. 154

7.2. Future Directions for Research

7.2

Future Directions for Research

We are really only at the beginning of this line of research into Data Science and Big Data, with many fields just beginning to take their first steps into data-driven approaches. As such, there remains many open questions to address and potential directions for future research to take. To conclude the project, we will take some time to outline a number of these future directions and open questions. As with many quantitative methods, one direction for future research is always to improve the performance of the current methods, in an effort to match and possibly exceed human ability at the given task. In this project, this could apply to improving the performance of the topic classification task, creating and benchmarking better lexicons for mood signifiers or further development of the extraction of triplets via dependency parsing, amongst others. While all of these are possible avenues, they are already mature research topics in their own right with only incremental progress currently being made in each. All of the work presented so far has focussed primarily on textual corpora and the knowledge which can be extracted from it. This is missing a large part of the traditional study of society and culture, which would include other mediums of communication including images, video and audio. A first example in this direction can be seen in the analysis of gender bias within news images [108]. Bringing these other forms of recorded communication into the culturomic analysis would give a fuller picture of the trends and behaviours we are aiming to understand. Another avenue of considerable interest is combining the previously established Culturomic approaches [130, 161] with semantic information from knowledge bases, as outlined in the recent vision paper by Suchanek and Preda [217]. While the work in this project extends some way towards this, including associations between concepts (entities) and narrative analysis using SVO triplets, along with the more commonly studied lexical trends found in large-scale corpora, the wikification1 of text will add another layer of context to the expanding field of Culturomics and allow machine-extracted knowledge, found in the form of knowledge bases [10, 76, 31], to be used as additional annotation to the methods proposed in this project, enabling a deeper understanding of the cultural and societal trends and changes represented within the corpora. Further to this, broadening the coverage of such approaches would have significant benefits, empowering larger sections of society and culture to be studied and understood. Increasing the coverage can be performed in several ways, most notably by expanding the number of languages that 1 Defined as “The automated process of adding links to Wikipedia to specific words and phrases in an arbitrary text (e.g. a news article).” - http://en.wiktionary.org/wiki/wikification

155

Chapter 7. Conclusions

can be automatically processed and studied, but also turning to historical sources, increasing our coverage of society and culture further back into the past. Each of these directions faces challenges, ranging from developing tools which can handle text from many languages to the corruption of the text due to age and optical character recognition (OCR) for historical sources to name a couple, but these are not challenges without solutions. Active research is taking place in both machine translation [20] and processing historical texts into a usable format [183], with a number of commonly used natural language processing tools already managing a multitude of languages. The challenge of combining these no doubt disparate textual corpora to unlock their potential will require work on finding ways to analyse texts of varying quality, length, validity, style and more in a comparable way, allowing discovered patterns, trends and knowledge to be drawn from across many different languages, cultures and eras. Finally, following in the footsteps of Adam the robot scientist [110], further automation of the scientific process involved in Culturomic studies would allow vast numbers of hypotheses to be tested and verified, sifting interesting trends and patterns from the data before needing the final involvement of a human to interpret meaning and place the findings within their larger historical and social context found outside of the particular set of data being investigated. A starting point for this automation may lie in borrowing methods from the high-throughput world of genomics [50, 71]. Indeed, this automation of science has long been a goal in many scientific fields, with Nature noting as early as 2006 that “scientists will increasingly have to rely on automation to extract useful knowledge from . . . vast data resources” [1].

156

APPENDIX A

Supplementary Examples

A.1

Example studies using Data Playground

Here, we include a small sample of the possible studies that one could make, demonstrating the tool on the science news corpus used for the nuclear power case study in Section 6.3. Elon Musk

Elon Musk is a prominent figure within the technology community, and is described by

Wikipedia as “a Canadian-American entrepreneur, engineer, inventor and investor. He is the CEO and CTO of SpaceX, CEO and product architect of Tesla Motors, and chairman of SolarCity. He is the founder of SpaceX and a cofounder of PayPal, Inc., Tesla Motors, and Zip2. He has also envisioned a conceptual high-speed transportation system known as the Hyperloop” [242]. Using the Data Playground, we can see in Figure A.1 that there are six main peaks of attention relating to him in the Science news corpus. Investigating the largest peak on 25th May 2012, we can see from the word cloud in Figure A.1 that this relates to the SpaceX Dragon spaceship successfully attaching to the International Space Station, making it the first commercial vehicle in history to do so1 . Looking at the associations for Elon Musk, we can see in Figure A.2 that he is unsurprisingly closely associated with his own companies, the space station and other similar entities such as Cape Canaveral (where many space launches take place), and the Air Force (which runs Cape Canaveral). We can additionally see that within the media he is associated with other prominent technology figures such as Richard Branson and Jeff Bezos who have expressed an interest in commercial space flight. 1

Press Release from SpaceX: http://www.spacex.com/press/2012/12/19/spacex-makes-history

157

Appendix A. Supplementary Examples

Figure A.1: Data Playground timeline for Elon Musk in the Science news corpus described in Section 6.3 Triplets in the Data Playground show in Figure A.3 that Elon Musk revealed plans for his Hyperloop Alpha design on 12th August 2012, a conceptual transport system for high speed travel using capsules that ride on a cushion of air within lower pressure tubes2 . In this mini study on a single individual, we can see that Data Playground can be used to easily capture and explore all occurrences of a particular 2

http://www.spacex.com/sites/spacex/files/hyperloop_alpha-20130812.pdf

158

A.1. Example studies using Data Playground

Figure A.2: Data Playground associations for Elon Musk in the Science news corpus described in Section 6.3 individual within a corpus, allowing one to infer their relations to other people, organisations and their activities as shown in the media. Artificial Intelligence

The field of Artificial Intelligence (AI) is becoming more popular and cov-

ered in the mainstream media, with particular events being heavily covered in the news. Again, using 159

Appendix A. Supplementary Examples

Figure A.3: Data Playground triplets for Elon Musk in the Science news corpus described in Section 6.3

the Data Playground to chart the occurrences of Artificial Intelligence in the science news corpus described in Section 6.3, we can see in Figure A.4 that the main peak of attention for Artificial Intelligence was in February 2011 when an artificial agent named Watson competed on the popular television quiz show Jeopardy against two previous human champions, resulting in Watson beating 160

A.1. Example studies using Data Playground

Figure A.4: Data Playground timeline for Artificial Intelligence in the Science news corpus described in Section 6.3 the humans at the game. This event also relates to the conversation taking place in the media about AI one day surpassing that of human intelligence, as seen in the triplets involving artificial intelligence, shown in Figure A.5, where we can see the triplets “Artificial Intelligence matches Human Intelligence”, “Artificial Intelligence surpass Human Intelligence” and “Artificial Intelligence achieve Human Intelligence”. The triplets also reveal the connections between AI and robotics, as well as that 161

Appendix A. Supplementary Examples

Figure A.5: Data Playground triplets for Artificial Intelligence in the Science news corpus described in Section 6.3 Facebook are exploring Artificial Intelligence, with a deeper look into the data explaining that this relates to Facebook’s hiring of Yann LeCun, the director of the NYU Center for Data Science.

162

APPENDIX B

Supplementary Tables

Table B.1: Locations by Region for Twitter collection within the United Kingdom. Region

Locations

North England

Blackburn, Blackpool, Bradford, Bolton, Huddersfield, Hull, Leeds, Liverpool, Manchester, Middlesbrough, Newcastle, Oldham, Preston, Rotherham, Sheffield, Stockport, Sunderland, York

Central England

Basildon, Birmingham, Coventry, Derby, Ipswich, Leicester, Luton, Northampton, Norwich, Nottingham, Peterborough, Southend, Stoke, Watford, Wolverhampton

South England

Brighton, Bristol, Bournemouth, Gloucester, London, Oxford, Plymouth, Poole, Portsmouth, Reading, Slough, Southampton, Swindon

Scotland

Aberdeen, Dundee, Edinburgh, Glasgow

Wales

Cardiff, Newport, Swansea

Northern Ireland

Belfast

163

Appendix B. Supplementary Tables

Table B.2: Number of articles labelled with the 50 most frequently occurring topics in the Reuters RCV1 corpus, along with the optimal hyperparameters  and k found for label propagation (see Section 4.2). Topic

Articles



k

Topic

Articles



k

Accounts Earnings

81, 875

0.5

11

Government Social

234, 873

0.26

5

Annual Results

23, 212

0.67

9

Interbank Markets

28, 185

0.56

6

Bond Markets

26, 036

0.41

7

International Relations

37, 739

0.42

5

Bonds Debt Issues

11, 487

0.44

5

Labour

11, 878

0.48

5

Capacity Facilities

32, 153

0.46

3

Labour Issues

17, 241

0.48

3

Comment Forecasts

73, 092

0.41

11

Legal Judicial

11, 944

0.58

1

Commodity Markets

85, 100

0.43

7

Management

11, 354

0.09

1

Contracts Orders

15, 331

0.52

1

Management Moves

10, 272

0.09

1

374, 316

0.33

7

Markets

200, 190

0.4

9

32, 219

0.46

1

Markets Marketing

40, 506

0.52

3

Defence

8, 842

0.46

3

Merchandise Trade

12, 634

0.49

5

Disasters and Accidents

8, 657

0.39

3

Mergers Aquisitions

43, 374

0.54

1

56, 878

0.39

7

Metals Trading

12, 136

0.48

7

8, 568

0.47

3

Monetary Economics

27, 078

0.44

5

117, 539

0.41

5

Money Markets

52, 972

0.5

10

Elections

11, 532

0.47

7

Ownership Changes

51, 480

0.52

1

Employment Labour

16, 900

0.43

5

Performance

150, 164

0.16

21

Energy Markets

21, 957

0.4

8

Production Services

25, 403

0.55

3

Equity Markets

48, 700

0.4

3

Regulation Policy

37, 410

0.45

3

European Community

19, 152

0.38

5

Share Capital

18, 313

0.4

5

Expenditure Revenue

15, 768

0.45

3

Soft Commodities

47, 708

0.44

5

Forex Markets

26, 752

0.5

13

Sports

35, 317

0.3

7

Funding Capital

41, 829

0.4

5

Strategy Plans

24, 325

0.57

1

Government Borrowing

27, 405

0.46

5

Trade Reserves

20, 722

0.49

3

Government Finance

43, 128

0.41

3

War, Civil War

32, 615

0.4

5

Corporate Industrial Crime Law Enforcement

Domestic Politics Economic Performance Economics

164

Table B.3: Anger terms present in the WordNet-Affect lexicon. abhor

abhorr

abomin

aggrav

aggress

amok

amuck

anger

angri

angrili

animos

animu

annoi

annoy

antagon

avarici

baffl

balk

bedevil

begrudg

belliger

bitter

bother

bothersom

brood

chafe

choler

covet

crucifi

dander

despis

despit

detest

devil

discourag

disdain

displeas

displeasur

dudgeon

dun

enmiti

enrag

envi

enviabl

enviou

envious

evil

exacerb

exasper

execr

fit

frustrat

furi

furiou

furious

gall

grabbi

grasp

gravel

greedi

grievanc

grizzl

grudg

hackl

harass

harri

hate

hatr

heartburn

hostil

huffi

huffili

incens

indign

indignantli

infuri

ir

irasc

irrit

jealou

jealous

jealousi

livid

loath

mad

madden

malefic

malevol

malic

malici

malign

misanthrop

misanthropi

misogyn

misogyni

mison

murder

nettl

nettlesom

odium

offend

offens

oppress

outrag

peev

persecut

peski

pester

pestifer

piqu

pout

prehensil

provok

rag

rage

rancor

resent

resentfulli

rile

roil

scene

score

scorn

sore

spite

spleen

stew

stung

sulk

sulki

tantal

tantrum

teas

temper

torment

umbrag

veng

vengefulli

venom

vex

vexat

vexati

vindict

warpath

wrath

wroth

Table B.4: Fear terms present in the WordNet-Affect lexicon. affright

afraid

alarm

alert

anxiou

anxious

appal

apprehens

atroci

aw

awfulli

bashfulli

bode

browbeaten

bulli

chill

constern

cow

cower

crawl

creep

cring

cruel

cruelli

cruelti

dash

daunt

diffid

dire

dismai

dread

dreadfulli

fawn

fear

fearfulli

fearsom

forebod

fright

frighten

frighteningli

grovel

hangdog

hardheart

heartless

heartlessli

hesit

hesitantli

hesitatingli

hideou

hideous intimid

horrend

horribl

horrid

horrif

horrifi

horrifyingli

horror

hyster

hysteria

merciless

monstrous

outrag

pall

panic

panick

panicki

pitiless

premonit

presag

presenti

ruthless

scare

scari

scarili

shadow

shiveri

shyli

shyness

suspens

terribl

terrifi

terror

timid

timidli

timor

trepid

ugli

unassert

uneasili

unkind

unsur

165

Appendix B. Supplementary Tables

Table B.5: Joy terms present in the WordNet-Affect lexicon. admir

ador

affect

affection

amatori

amic

amor

anticip

appreci

approv

ardor

attach

avid

bang

banter

barrack

beam

beguil

approb belong

benef

benefic

benefici

benevol

bewitch

blith

bonheur

brotherhood

brotherli

buoyanc

captiv

captur

care

carefre

catch

chaff

charg

charit

charm

cheer

cheerfulli

cheeri

close

comfort

commend

compat

complac

concern

congratul

consol

content

crush

delight

devot

eager

eagerli

ebulli

elan

elat

embolden

emot

empath

empathet

empathi

enamor

enamour

enchant

endearingli

enjoi

enthral

enthusiasm

enthusiast

entranc

esteem

euphor

euphoria

exalt

excit

exhilar

exhort

expans

expect

exuber

exuberantli

exult

exultantli

fanci

fascin

favor

favour

festal

festiv

flush

fond

fondli

fratern

friendli

fulfil

gai

gaieti gratifi

gala

glad

gladden

glee

gleeful

gleefulli

gloat

good

goodwil

gratifyingli

great

gusto

happi

happili

heart

hearten

hilar

hilari

identif

identifi

impress

infatu

insouci

inspir

intimaci

intox

jocular

jocund

joi

jolli

jolliti

jovial

joy

joyfulli

joyou

joyous

jubil

jubilantli

keen

kick

kid

kindheart

kindli

laudabl

lighthearted

lightsom

likabl

like

likeabl

love

loving

lovingli

loyalti

merri

merrili

merriment

mirth

mirthfulli

near

nigh

occupi

partial

penchant

perki

predilect

preen

prefer

pride

protect

proudli

rapport

recreat

regard

rejoic

relish

respect

revel

riotous

romant

rush

satisfact

satisfactori

satisfactorili

satisfi

satisfyingli

schadenfreud

scream

smug

softheart

solac

sooth

stimul

strike

sunni

suspens

sympathet

sympathi

tast

teas

tender

thrill

tickl

titil

tranc

triumph

triumphal

triumphant

triumphantli

unworri

uplift

uproari

urg

wallow

warm

warmheart

warmhearted

warmth

weak

worri

worship

zeal

zealou

zest

Table B.6: Sadness terms present in the WordNet-Affect lexicon. aggriev

attrit

bad

bereav

bereft

blue

bore

cheerless

compunct

contrit

dark

deject

demor

deplor

depress

desol

despairingli

despond

dingi

disconsol

discourag

dishearten

dismai

dismal

dispirit

distress

dole

dolefulli

dolor

downcast

downheart

downtrodden

drab

drear

dreari

dysphor

dysphoria

execr

forlorn

forlornli

gloom

gloomi

gloomili

glum

godforsaken

grief

griev

grievou

grievous

grim

guilt

guilti

hangdog

hapless

harass

heartach

heartbreak

heartrend

heartsick

helpless

joyless

joylessli

lachrymos

laden

lament

loneli

lorn

low

melanchol

melancholi

miser

miseri

misfortun

mourn

mournfulli

oppress

pathet

penanc

penit

penitenti

persecut

piteou

piti

pitiabl

plaintiv

poor

regret

remors

remorsefulli

repent

rue

rueful

ruefulli

ruth

sad

sadden

sadli

shame

shamefac

somber

sorri

sorrow

sorrowfulli

suffer

tear

tyrann

unhappi

weep

weepi

weight

woe

woebegon

woeful

woefulli

wretch

166

Bibliography

[1] Steering the future of computing. Nature, 440(7083):383–383, Mar 2006. [2] G. Aad and A. C. et al. Observation of a new particle in the search for the standard model higgs boson with the {ATLAS} detector at the {LHC}. Physics Letters B, 716(1):1 – 29, 2012. [3] R. C. Agarwal, C. C. Aggarwal, and V. Prasad. A tree projection algorithm for generation of frequent item sets. Journal of parallel and Distributed Computing, 61(3):350–371, 2001. [4] C. C. Aggarwal and C. Zhai. A survey of text classification algorithms. In Mining Text Data, pages 163–222. Springer Science + Business Media, 2012. [5] R. Agrawal, R. Srikant, et al. Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB, volume 1215, pages 487–499, 1994. [6] O. Ali, G. Zappella, T. De Bie, and N. Cristianini. An empirical comparison of label prediction algorithms on automatically inferred networks. In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods, pages 259 – 268, 2012. [7] L. Anderson. Demystifying the arab spring: parsing the differences between tunisia, egypt, and libya. Foreign Affairs, pages 2–7, 2011. [8] R. Angelova and G. Weikum. Graph-based text classification: learn from your neighbors. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 485–492. ACM, 2006. [9] M. Araujo, G. Navarro, and N. Ziviani. Large text searching allowing errors. 1997. 167

Bibliography

[10] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. DBpedia: A Nucleus for a Web of Open Data. Springer Science + Business Media, 2007. [11] S. Baccianella, A. Esuli, and F. Sebastiani. Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In LREC, volume 10, pages 2200–2204, 2010. [12] R. Baeza-Yates and G. Navarro. Block addressing indices for approximate text retrieval. Journal of the American Society for Information Science, 51(1):69–82, 2000. [13] K. Balog and M. de Rijke. Decomposing bloggers’ moods. In WWW-2006 Workshop on the Weblogging Ecosystem. Citeseer, 2006. [14] B. Barber and T. S. Kuhn. The Structure of Scientific Revolutions., volume 28. SAGE Publications, 1963. [15] M. Bautin, C. B. Ward, A. Patil, and S. S. Skiena. Access: news and blog analysis for the social sciences. In Proceedings of the 19th international conference on World wide web - WWW '10. Association for Computing Machinery (ACM), 2010. [16] R. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proceedings of the 16th international conference on World Wide Web, pages 131–140. ACM, 2007. [17] M. Belkin, I. Matveeva, and P. Niyogi. Regularization and semi-supervised learning on large graphs. pages 624–638, 2004. [18] G. K. Berland, M. N. Elliott, L. S. Morales, J. I. Algazy, R. L. Kravitz, M. S. Broder, D. E. Kanouse, J. A. Muñoz, J.-A. Puyol, M. Lara, K. E. Watkins, H. Yang, and E. A. McGlynn. Health information on the internet. JAMA, 285(20):2612, 2001. [19] A. Bermingham and A. F. Smeaton. On using twitter to monitor political sentiment and predict election results. 2011. [20] O. Bojar, C. Buck, C. Federmann, B. Haddow, P. Koehn, J. Leveling, C. Monz, P. Pecina, M. Post, H. Saint-Amand, R. Soricut, L. Specia, and A. Tamchyna. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation. Association for Computational Linguistics (ACL), 2014. [21] N. Bolger, A. Davis, and E. Rafaeli. Diary methods: Capturing life as it is lived. Annual Review of Psychology, 54(1):579–616, 2003. 168

Bibliography

[22] J. Bollen, B. Gonçalves, G. Ruan, and H. Mao. Happiness is assortative in online social networks. Artificial Life, 17(3):237–251, 2011. [23] J. Bollen, H. Mao, and X. Zeng. Twitter mood predicts the stock market. Journal of Computational Science, 2011. [24] J. Bollen, A. Pepe, and H. Mao. Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena. CoRR, abs/0911.1583, pages 1–10, 2009. [25] L. Borin, D. Dubhashi, M. Forsberg, R. Johansson, D. Kokkinakis, and P. Nugues. Mining semantics for culturomics. In Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing - UnstructureNLP '13. Association for Computing Machinery (ACM), 2013. [26] L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010. [27] M. Buhrmester, T. Kwang, and S. D. Gosling. Amazon's mechanical turk: A new source of inexpensive, yet high-quality, data? Perspectives on Psychological Science, 6(1):3–5, 2011. [28] BuiltWith.com. Feed usage statistics. http://trends.builtwith.com/feeds. Accessed: 2014-06-26. [29] J. K. Burgoon, M. Burgoon, and M. Wilkinson. Writing style as predictor of newspaper readership, satisfaction and image. Journalism & Mass Communication Quarterly, 58(2):225–231, 1981. [30] R. A. Calvo and S. D’Mello. Affect detection: An interdisciplinary review of models, methods, and their applications. Affective Computing, IEEE Transactions on, 1(1):18–37, 2010. [31] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr, and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, volume 5, page 3, 2010. [32] M. Carreira-Perpinan and R. Zemel. Proximity graphs for clustering and manifold learning. In Advances in Neural Information Processing Systems 17, NIPS-17, 2004. [33] N. Cesa-Bianchi, C. Gentile, F. Vitale, and G. Zappella. Active learning on graphs via spanning trees. 2010. [34] N. Cesa-Bianchi, C. Gentile, F. Vitale, and G. Zappella. Random spanning trees and the prediction of weighted graphs. In Proc. of ICML, pages 175–182. Citeseer, 2010. 169

Bibliography

[35] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu. edu.tw/~cjlin/libsvm. [36] B. S. Chaparro, A. D. Shaikh, and J. R. Baker. Reading online text with a poor layout: Is performance worse? Usability News, 7(1):1–4, 2005. [37] E. Charniak. A maximum-entropy-inspired parser. In Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference, pages 132–139. Association for Computational Linguistics, 2000. [38] D. Cheney. Text mining newspapers and news content: new trends and research methodologies. 2013. [39] I. C. Choi and G. W. Comstock. Interviewer effect on responses to a questionnaire relating to mood. American journal of epidemiology, 101(1):84–92, 1975. [40] J. E. Chung and E. Mustafaraj. Can collective sentiment expressed on twitter predict political elections? In AAAI, 2011. [41] L. A. Clark, D. Watson, and J. Leeka. Diurnal variation in the positive affects. Motivation and Emotion, 13(3):205–234, 1989. [42] Z. R. Cochrane, P. Gregory, and A. Wilson. Readability of consumer health information on the internet: A comparison of u.s. government–funded and commercially funded websites. Journal of Health Communication, 17(9):1003–1010, 2012. [43] M. Collins. Three generative, lexicalised models for statistical parsing. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, pages 16–23. Association for Computational Linguistics, 1997. [44] H. Cooper. Obama announces killing of osama bin laden, 2011. [Online; accessed 28-November2014]. [45] T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms, 1990. [46] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995. 170

Bibliography

[47] M. A. Covington, E. Roche, and Y. Schabes. Finite-State Language Processing, volume 75. JSTOR, 1999. [48] C. Crain. Surveillance society. The New Yorker, September 2006. [49] N. Cristianini. On the current paradigm in artificial intelligence. AI Commun., 27(1):37–43, 2014. [50] N. Cristianini and M. W. Hahn. Introduction to computational genomics: a case studies approach. Cambridge University Press, 2006. [51] N. Cristianini and J. Shawe-Taylor. An introduction to Support Vector Machines and other kernelbased learning methods. Cambridge University Press, 2000. [52] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02), 2002. [53] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan, N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts, D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion, J. Petrak, Y. Li, and W. Peters. Text Processing with GATE (Version 6). 2011. [54] W. A. Danielson and S. D. Bryan. Readability of wire stories in eight news categories. Journalism & Mass Communication Quarterly, 41(1):105–106, 1964. [55] D. Davidov, O. Tsur, and A. Rappoport. Enhanced sentiment learning using twitter hashtags and smileys. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pages 241–249. Association for Computational Linguistics, 2010. [56] D. Davidov, O. Tsur, and A. Rappoport. Semi-supervised recognition of sarcastic sentences in twitter and amazon. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning, pages 107–116. Association for Computational Linguistics, 2010. [57] M.-C. De Marneffe, B. MacCartney, C. D. Manning, et al. Generating typed dependency parses from phrase structure parses. In Proceedings of LREC, volume 6, pages 449–454, 2006. [58] M.-C. De Marneffe and C. D. Manning. Stanford typed dependencies manual. URL http://nlp. stanford. edu/software/dependencies manual. pdf, 2008. [59] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008. 171

Bibliography

[60] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. JAsIs, 41(6):391–407, 1990. [61] V. Dhar. Data science and prediction. Commun. ACM, 56(12):64–73, 2013. [62] T. Dietterich. Ensemble methods in machine learning. Multiple classifier systems, pages 1–15, 2000. [63] P. S. Dodds, K. D. Harris, I. M. Kloumann, C. A. Bliss, and C. M. Danforth. Temporal patterns of happiness and information in a global social network: Hedonometrics and twitter. PLoS ONE, 6(12):e26752, 12 2011. [64] W. Dong, C. Moses, and K. Li. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web, pages 577–586. ACM, 2011. [65] V. D'Orazio, S. T. Landis, G. Palmer, and P. Schrodt. Separating the wheat from the chaff: Applications of automated document classification using support vector machines. Political Analysis, 22(2):224–242, 2014. [66] M. Dorigo, M. Birattari, and T. Stutzle. Ant colony optimization. Computational Intelligence Magazine, IEEE, 1(4):28–39, 2006. [67] M. Duggan and P. R. C. Aaron Smith. Social media update 2013. http://pewinternet.org/Reports/ 2013/Social-Media-Update.aspx, January 2014. [68] M. Duggan and J. Brenner. The demographics of social media users, 2012, volume 14. Pew Research Center’s Internet & American Life Project Washington, DC, 2013. [69] V. D’Urso and R. Trentin. Introduzione alla psicologia delle emozioni. Laterza, 2007. [70] F. Dzogang, T. Lansdall-Welfare, S. Sudhahar, and N. Cristianini. Scalable preference learning from data streams. In Proceedings of the 24th International Conference on World Wide Web Companion, pages 885–890. International World Wide Web Conferences Steering Committee, 2015. [71] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, 95(25):14863– 14868, 1998. 172

Bibliography

[72] P. Ekman. An argument for basic emotions. Cognition & Emotion, 6(3-4):169–200, 1992. [73] eMarketer. Social networking reaches nearly one in four around the world. http://www.emarketer. com/Article/Social-Networking-Reaches-Nearly-One-Four-Around-World/1009976. Publication Date: 18th June 2013. [74] G. England. Genomics england and the 100,000 genomes project. http://www.genomicsengland. co.uk/wp-content/uploads/2014/07/Narrative-Genomics-England-the-100000-GenomesProject-FINAL-28-7-14.pdf, July 2014. [75] A. Esuli and F. Sebastiani. Sentiwordnet: A publicly available lexical resource for opinion mining. In Proceedings of LREC, volume 6, pages 417–422. Citeseer, 2006. [76] O. Etzioni, M. Banko, S. Soderland, and D. S. Weld. Open information extraction from the web. volume 51, page 68. Association for Computing Machinery (ACM), 2008. [77] S. S. Fahmy, W. Wanta, T. J. Johnson, and J. Zhang. The path to war: Exploring a secondlevel agenda-building analysis examining the relationship among the media, the public and the president. International Communication Gazette, 73(4):322–342, 2011. [78] I. Flaounas. Pattern analysis of news media content. PhD thesis, University of Bristol, 2011. [79] I. Flaounas, O. Ali, T. Lansdall-Welfare, T. De Bie, N. Mosdell, J. Lewis, and N. Cristianini. Research methods in the age of digital journalism: Massive-scale automated analysis of newscontent—topics, style and gender. Digital Journalism, 1(1):102–116, 2013. [80] I. Flaounas, O. Ali, M. Turchi, T. Snowsill, F. Nicart, T. De Bie, and N. Cristianini. NOAM: News Outlets Analysis and Monitoring System. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pages 1275–1278. Association for Computing Machinery (ACM), 2011. [81] R. Flesch. A new readability yardstick. Journal of Applied Psychology, 32(3):221–233, 1948. [82] J. H. Fowler, N. A. Christakis, et al. Dynamic spread of happiness in a large social network: longitudinal analysis over 20 years in the framingham heart study. BMJ, 337:a2338, 2008. [83] C. Fox. A stop list for general text. SIGIR Forum, 24(1-2):19–21, 1989. [84] W. B. Frakes. Information retrieval. chapter Stemming Algorithms, pages 131–160. Prentice-Hall, Inc., 1992. 173

Bibliography

[85] W. N. Francis and H. Kucera. Brown corpus manual. Brown University, 1979. [86] J. A. Fusaro and W. M. Conover. Readability of two tabloid and two nontabloid papers. Journalism & Mass Communication Quarterly, 60(1):142–144, 1983. [87] G. Gaskell and M. W. Bauer. Biotechnology 1996-2000: the years of controversy. Science Museum London, 2001. [88] D. Gayo-Avello. A meta-analysis of state-of-the-art electoral prediction from twitter data. Social Science Computer Review, page 0894439313493979, 2013. [89] G. Gerbner and L. Gross. Living with television: The violence profile. Journal of communication, 26(2):172–194, 1976. [90] M. Gilens and C. Hertzman. Corporate ownership and news bias: Newspaper coverage of the 1996 telecommunications act. The Journal of Politics, 62(02):369–386, 2000. [91] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases, pages 518–529. Morgan Kaufmann Publishers Inc., 1999. [92] N. Godbole, M. Srinivasaiah, and S. Skiena. Large-scale sentiment analysis for news and blogs. ICWSM, 7, 2007. [93] S. Golder and M. Macy. Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures. Science, 333(6051):1878–1881, 2011. [94] S. Grimaldi, T. Partonen, J. Haukka, A. Aromaa, and J. Lönnqvist. Seasonal vegetative and affective symptoms in the finnish general population: Testing the dual vulnerability and latitude effect hypotheses. Nord J Psychiatry, 63(5):397–404, 2009. [95] A. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. IEEE Intell. Syst., 24(2):8–12, 2009. [96] J. D. Hamilton. Time series analysis, volume 2. Princeton university press Princeton, 1994. [97] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In ACM SIGMOD Record, volume 29, pages 1–12. ACM, 2000. 174

Bibliography

[98] J. Harris. Word clouds considered harmful. http://www.niemanlab.org/2011/10/word-cloudsconsidered-harmful/, October 2011. [99] V. Hatzivassiloglou and K. R. McKeown. Predicting the semantic orientation of adjectives. In Proceedings of the Eighth Conference on European Chapter of the Association for Computational Linguistics, EACL ’97, pages 174–181, Stroudsburg, PA, USA, 1997. Association for Computational Linguistics. [100] V. Hatzivassiloglou and J. M. Wiebe. Effects of adjective orientation and gradability on sentence subjectivity. In Proceedings of the 18th conference on Computational linguistics -. Association for Computational Linguistics (ACL), 2000. [101] H. Heaps. Information retrieval: Computational and theoretical aspects. Academic Press, Inc. Orlando, FL, USA, 1978. [102] P. Hedstrom. SOCIOLOGY: Experimental macro sociology: Predicting the next best seller. Science, 311(5762):786–787, 2006. [103] E. Hensinger, I. Flaounas, and N. Cristianini. Learning the preferences of news readers with SVM and lasso ranking. In IFIP Advances in Information and Communication Technology, pages 179–186. Springer Science + Business Media, 2010. [104] M. Herbster and M. Pontil. Prediction on a graph with a perceptron. Advances in Neural Information Orocessing Systems, 19:577, 2007. [105] M. Herbster, M. Pontil, and S. Rojas-Galeano. Fast prediction on a tree. Advances in Neural Information Processing Systems, 21:657–664, 2009. [106] C.-N. Huang and H. Zhao. Chinese word segmentation: A decade review. Journal of Chinese Information Processing, 21(3):8–20, 2007. [107] T. Jebara, J. Wang, and S.-F. Chang. Graph construction and b-matching for semi-supervised learning. In Proceedings of the 26th Annual International Conference on Machine Learning - ICML '09, pages 441–448. Association for Computing Machinery (ACM), 2009. [108] S. Jia, T. Lansdall-Welfare, and N. Cristianini. Measuring gender bias in news images. In Proceedings of the 24th International Conference on World Wide Web Companion, pages 893–898. International World Wide Web Conferences Steering Committee, 2015. 175

Bibliography

[109] J. L. Johns and T. E. Wheat. Newspaper readability. Reading World, 18(2):141–147, 1978. [110] R. D. King, J. Rowland, S. G. Oliver, M. Young, W. Aubrey, E. Byrne, M. Liakata, M. Markham, P. Pir, L. N. Soldatova, et al. The automation of science. Science, 324(5923):85–89, 2009. [111] M. Krallinger, F. Leitner, and A. Valencia. Retrieval and discovery of cell cycle literature and proteins by means of machine learning, text mining and network analysis. In Advances in Intelligent Systems and Computing, pages 285–292. Springer Science + Business Media, 2014. [112] A. D. Kramer, J. E. Guillory, and J. T. Hancock. Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences, 111(24):8788–8790, 2014. [113] N. Kronfeld-Schor and H. Einat. Circadian rhythms and depression: Human psychopathology and animal models. Neuropharmacology, 62(1):101–114, 2012. [114] H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In Proceedings of the 19th international conference on World wide web, pages 591–600. ACM, 2010. [115] V. Lampos. Detecting events and patterns in large-scale user generated textual streams with statistical learning methods. PhD thesis, 2012. [116] V. Lampos and N. Cristianini. Nowcasting events from the social web with statistical learning. 2011. [117] V. Lampos, T. De Bie, and N. Cristianini. Flu detector-tracking epidemics on twitter. Machine Learning and Knowledge Discovery in Databases, pages 599–602, 2010. [118] V. Lampos, T. Lansdall-Welfare, R. Araya, and N. Cristianini. Analysing mood patterns in the united kingdom through twitter content. CoRR, abs/1304.5507, 2013. [119] V. Lampos, D. Preotiuc-Pietro, and T. Cohn. A user-centric model of voting intention from social media. In ACL (1), pages 993–1003, 2013. [120] G. E. Lang and K. Lang. The battle for public opinion: The president, the press, and the polls during Watergate. Columbia University Press New York, 1983. [121] T. Lansdall-Welfare, I. Flaounas, and N. Cristianini. Automatic annotation of a dynamic corpus by label propagation. In Mathematical Methodologies in Pattern Recognition and Machine Learning, pages 19–32. Springer Science + Business Media, 2012. 176

Bibliography

[122] T. Lansdall-Welfare, I. Flaounas, and N. Cristianini. Scalable corpus annotation by graph construction and label propagation. In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods, pages 25–34, 2012. [123] T. Lansdall-Welfare, V. Lampos, and N. Cristianini. Effects of the recession on public mood in the uk. In Proceedings of the 21st International Conference Companion on World Wide Web, WWW ’12 Companion, pages 1221–1226, New York, NY, USA, 2012. ACM. [124] T. Lansdall-Welfare, V. Lampos, and N. Cristianini. Nowcasting the mood of the nation. Significance, 9(4):26–28, 2012. [125] T. Lansdall-Welfare, J. Lewis, and N. Cristianini. Automating massive-scale analysis of news content. In B. Franklin and S. Eldridge II, editors, The Routledge Companion to Digital Journalism Studies. Routledge, 2016. In press. [126] T. Lansdall-Welfare, S. Sudhahar, G. A. Veltri, and N. Cristianini. On the coverage of science in the media: A big data study on the impact of the fukushima disaster. In Big Data (Big Data), 2014 IEEE International Conference on, pages 60–66. IEEE, 2014. [127] J. R. Lax and J. H. Phillips. How should we estimate public opinion in the states? American Journal of Political Science, 53(1):107–121, 2009. [128] D. Lazer, A. S. Pentland, L. Adamic, S. Aral, A. L. Barabasi, D. Brewer, N. Christakis, N. Contractor, J. Fowler, M. Gutmann, et al. Life in the network: the coming age of computational social science. Science (New York, NY), 323(5915):721, 2009. [129] G. Leban, B. Fortuna, J. Brank, and M. Grobelnik. Event registry: learning about world events from news. In Proceedings of the 23rd international conference on World Wide Web companion - WWW '14, pages 107–110. International World Wide Web Conferences Steering Committee, 2014. [130] K. Leetaru. Culturomics 2.0: Forecasting large-scale human behavior using global news media tone in time and space. First Monday, 16(9), 2011. [131] K. Leetaru and P. A. Schrodt. Gdelt: Global data on events, location and tone, 1979-2012. International Studies Association Annual Conference, April 2013. [132] R. Lehavy, F. Li, and K. Merkley. The effect of annual report readability on analyst following and the properties of their earnings forecasts. The Accounting Review, 86(3):1087–1115, 2011. 177

Bibliography

[133] M. E. Len-Rios. Representation of women in news and photos: Comparing content to perceptions. Journal of Communication, 55(1):152–168, 2005. [134] D. Lester. Temporal variation in suicide and homicide. American Journal of Epidemiology, 109(5):517–520, 1979. [135] D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397, 2004. [136] J. Lewis and S. Cushion. The thirst to be first: An analysis of breaking news stories and their impact on the quality of 24-hour news coverage in the uk. Journalism Practice, 3(3):304–318. [137] J. Lewis, S. Cushion, and J. Thomas. Immediacy, convenience or engagement? an analysis of 24-hour news channels in the uk. Journalism Studies, 6(4):461–477, 2005. [138] J. Lewis, S. Inthorn, and K. Wahl-Jorgensen. Citizens or consumers: The media and the decline of political participation, 2005. [139] A. Lewy, R. Sack, L. Miller, and T. Hoban. Antidepressant and circadian phase-shifting effects of light. Science, 235(4786):352–354, 1987. [140] H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang. Pfp: parallel fp-growth for query recommendation. In Proceedings of the 2008 ACM conference on Recommender systems, pages 107– 114. ACM, 2008. [141] E. Lieberman, J.-B. Michel, J. Jackson, T. Tang, and M. A. Nowak. Quantifying the evolutionary dynamics of language. Nature, 449(7163):713–716, 2007. [142] D.-Y. M. Lin. Evaluating older adults’ retention in hypertext perusal: Impacts of presentation media as a function of text topology. Computers in Human Behavior, 20(4):491–503, 2004. [143] B. Liu. Web Data Mining: Exploring Hyperlinks, Contents and Usage Data. Springer Science + Business Media, 2011. [144] Y. Ma and J. Wu. Combining n-gram and dependency word pair for multi-document summarization. In Computational Science and Engineering (CSE), 2014 IEEE 17th International Conference on, pages 27–31. IEEE, 2014. [145] M. Maier, U. Von Luxburg, and M. Hein. Influence of graph construction on graph-based clustering measures. Advances in Neural Information Processing Systems, 22:1025–1032, 2009. 178

Bibliography

[146] C. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. [147] C. D. Manning. Part-of-speech tagging from 97% to 100%: Is it time for some linguistics? In Computational Linguistics and Intelligent Text Processing, pages 171–189. Springer Science + Business Media, 2011. [148] B. J. Marafino, J. M. Davies, N. S. Bardach, M. L. Dean, R. A. Dudley, and J. Boscardin. N-gram support vector machines for scalable procedure and diagnosis classification, with applications to clinical free text data from the intensive care unit. Journal of the American Medical Informatics Association, 21(5):871–875, 2014. [149] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993. [150] E. Martínez-Cámara, M. T. Martín-Valdivia, L. A. Urena-López, and A. R. Montejo-Ráez. Sentiment analysis in twitter. Natural Language Engineering, 20(01):1–28, 2012. [151] H. Matters.

One day in history.

https://web.archive.org/web/20061010023450/http://www.

historymatters.org.uk/output/Page95.asp, 2006.

[Online; accessed via Internet Archive 17-

February-2015]. [152] V. Mayer-Schönberger and K. Cukier. Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt, 2013. [153] W. A. McCall and L. M. Crabbs. Standard Test Lessons in Reading. Number 5. Teachers College, Columbia University, Bureau of Publications, 1926. [154] M. McCombs. The agenda-setting role of the mass media in the shaping of public opinion. In Mass Media Economics 2002 Conference, London School of Economics: http://sticerd. lse. ac. uk/dps/extra/McCombs. pdf, 2002. [155] M. McCombs and S. I. Ghanem. The convergence of agenda setting and framing. Framing public life: Perspectives on media and our understanding of the social world, pages 67–81, 2001. [156] M. E. McCombs and D. L. Shaw. The agenda-setting function of mass media. Public opinion quarterly, 36(2):176–187, 1972. 179

Bibliography

[157] M. E. McCombs and D. L. Shaw. The evolution of agenda-setting research: twenty-five years in the marketplace of ideas. Journal of communication, 43(2):58–67, 1993. [158] P. McKellar. The emotion of anger in the expression of human aggressiveness. British Journal of Psychology. General Section, 39(3):148–155, 1949. [159] A. Mehler, Y. Bao, X. Li, Y. Wang, and S. Skiena. Spatial analysis of news sources. IEEE Trans. Visual. Comput. Graphics, 12(5):765–772, 2006. [160] P. T. Metaxas, E. Mustafaraj, and D. Gayo-Avello. How (not) to predict elections. In Privacy, Security, Risk and Trust (PASSAT) and 2011 IEEE Third Inernational Conference on Social Computing (SocialCom), 2011 IEEE Third International Conference on, pages 165–171. IEEE, 2011. [161] J.-B. Michel, Y. K. Shen, A. P. Aiden, A. Veres, M. K. Gray, T. G. B. Team, J. P. Pickett, D. Hoiberg, D. Clancy, P. Norvig, J. Orwant, S. Pinker, M. A. Nowak, and E. L. Aiden. Quantitative analysis of culture using millions of digitized books. Science, 331(6014):176–182, 2011. [162] R. Mihalcea and H. Liu. A corpus-based approach to finding happiness. In AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, pages 139–144, 2006. [163] R. Mihalcea and P. Tarau. Textrank: Bringing order into texts. In D. Lin and D. Wu, editors, Proceedings of EMNLP 2004, pages 404–411. Association for Computational Linguistics, 2004. [164] G. Mishne and M. De Rijke. Capturing global mood levels using blog posts. In AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, pages 145–152, 2006. [165] A. Mislove, S. Lehmann, Y.-Y. Ahn, J.-P. Onnela, and J. N. Rosenquist. Pulse of the nation: visualizing the mood of twitter. http://www.ccs.neu.edu/home/amislove/twittermood/, 2010. [166] A. Mislove, S. Lehmann, Y.-Y. Ahn, J.-P. Onnela, and J. N. Rosenquist. Understanding the demographics of twitter users. ICWSM, 11:5th, 2011. [167] F. Morstatter, J. Pfeffer, H. Liu, and K. M. Carley. Is the sample good enough? comparing data from twitter’s streaming api with twitter’s firehose. In Seventh International AAAI Conference on Weblogs and Social Media, 2013. [168] S. Moscovici and G. Duveen. Social representations: Explorations in social psychology, volume 41. Polity Press Cambridge, 2000. 180

Bibliography

[169] F. F.-H. Nah. A study on tolerable waiting time: how long are web users willing to wait? Behaviour & Information Technology, 23(3):153–163, 2004. [170] A. Nenkova and K. McKeown. A survey of text summarization techniques. In Mining Text Data, pages 43–76. Springer, 2012. [171] B. Nicholson. Counting culture; or, how to read victorian newspapers from a distance. Journal of Victorian Culture, 17(2):238–246, 2012. [172] J. Nivre, J. Hall, J. Nilsson, A. Chanev, G. Eryigit, S. Kübler, S. Marinov, and E. Marsi. Maltparser: A language-independent system for data-driven dependency parsing. Natural Language Engineering, 13(02):95–135, 2007. [173] J. C. Norcross and D. J. Vangarelli. The resolution solution: Longitudinal examination of new year’s change attempts. Journal of Substance Abuse, 1(2):127–134, 1989. [174] Ofcom. News consumption in the uk: 2014 report. http://stakeholders.ofcom.org.uk/binaries/ research/tv-research/news/2014/News_Report_2014.pdf. Publication Date: June 2014. [175] S. Ogden. The language of agricultural biotechnology terminate or be terminated. Organization & environment, 14(3):336–340, 2001. [176] Y. Ouyang, W. Li, S. Li, and Q. Lu. Applying regression models to query-focused multidocument summarization. Information Processing & Management, 47(2):227–237, 2011. [177] E. S. Page. Continuous inspection schemes. Biometrika, 41(1/2):pp. 100–115, 1954. [178] D. D. Palmer. Satz – an adaptive sentence segmentation system. Technical report, 1994. [179] B. Pang and L. Lee. Opinion mining and sentiment analysis. Foundations and trends in information retrieval, 2(1-2):1–135, 2008. [180] E. Parzen. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3):1065–1076, 1962. [181] S. L. B. Payne. The Art of Asking Questions: Studies in Public Opinion, 3, volume 3. Princeton University Press, 2014. [182] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang. H-mine: Hyper-structure mining of frequent patterns in large databases. In Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on, pages 441–448. IEEE, 2001. 181

Bibliography

[183] M. Piotrowski. Natural language processing for historical texts. Synthesis Lectures on Human Language Technologies, 5(2):1–157, 2012. [184] J. Platt et al. Sequential minimal optimization: A fast algorithm for training support vector machines. 1998. [185] M. Porter. An algorithm for suffix stripping. Program: electronic library and information systems, 14(3):130–137, 1980. [186] B. Pouliquen, R. Steinberger, and C. Best. Automatic detection of quotations in multilingual news. In Proceedings of Recent Advances in Natural Language Processing, pages 487–492, 2007. [187] B. Pouliquen, H. Tanev, and M. Atkinson. Extracting and learning social networks out of multilingual news. In Proceedings of the Social Networks and Application tools workshop (Skalica, Slovakia, Septembe. Citeseer, 2008. [188] R. Prabowo and M. Thelwall. A comparison of feature selection methods for an evolving RSS feed corpus. Information Processing & Management, 42(6):1491–1512, 2006. [189] A. Reyes, P. Rosso, and D. Buscaldi. From humor recognition to irony detection: The figurative language of social media. Data & Knowledge Engineering, 74:1–12, 2012. [190] J. C. Reynar and A. Ratnaparkhi. A maximum entropy approach to identifying sentence boundaries. In Proceedings of the Fifth Conference on Applied Natural Language Processing, ANLC ’97, pages 16–19. Association for Computational Linguistics, 1997. [191] K. Sagae and A. Lavie. A classifier-based parser with linear run-time complexity. In Proceedings of the Ninth International Workshop on Parsing Technology, pages 125–132. Association for Computational Linguistics, 2005. [192] G. Salton and M. J. McGill. Introduction to modern information retrieval. 1983. [193] E. Sandhaus. The new york times annotated corpus. Linguistic Data Consortium, Philadelphia, 6(12):e26752, 2008. [194] J. Santino. Halloween and other festivals of death and life. Univ. of Tennessee Press, 1994. [195] S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 743–754. ACM, 2004. 182

Bibliography

[196] K. E. Schroder, M. P. Carey, and P. A. Vanable. Methodological challenges in research on sexual risk behavior: Ii. accuracy of self-reports. Annals of behavioral medicine, 26(2):104–123, 2003. [197] H. Schütze, D. A. Hull, and J. O. Pedersen. A comparison of classifiers and document representations for the routing problem. In Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '95. Association for Computing Machinery (ACM), 1995. [198] N. Schwarz and G. L. Clore. Mood, misattribution, and judgments of well-being: Informative and directive functions of affective states. Journal of personality and social psychology, 45(3):513, 1983. [199] F. Sebastiani. Machine learning in automated text categorization. CSUR, 34(1):1–47, 2002. [200] O. G. Selfridge. Pandemonium: a paradigm for learning in mechanisation of thought processes. 1958. [201] D. Shahaf, C. Guestrin, and E. Horvitz. Trains of thought. In Proceedings of the 21st international conference on World Wide Web - WWW '12. Association for Computing Machinery (ACM), 2012. [202] D. Shahaf, J. Yang, C. Suen, J. Jacobs, H. Wang, and J. Leskovec. Information cartography. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '13. Association for Computing Machinery (ACM), 2013. [203] J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge University Press, 2004. [204] A. Signorini, A. M. Segre, and P. M. Polgreen. The use of twitter to track levels of disease activity and public concern in the us during the influenza a h1n1 pandemic. PloS one, 6(5):e19467, 2011. [205] A. Søgaard. Semisupervised condensed nearest neighbor for part-of-speech tagging. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, pages 48–52. Association for Computational Linguistics, 2011. [206] C. D. Spielberger. Profile of Mood States., volume 3. American Psychological Association (APA), 1972. [207] O. Spindler and T. Fadrus. Grimace project documentation. Vienna University of Technology, 2009. 183

Bibliography

[208] D. J. Spoustová, J. Hajiˇc, J. Raab, and M. Spousta. Semi-supervised training for the averaged perceptron POS tagger. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics on - EACL '09. Association for Computational Linguistics (ACL), 2009. [209] E. Stamatatos, N. Fakotakis, and G. Kokkinakis. Automatic extraction of rules for sentence boundary disambiguation. In Proceedings of the Workshop on Machine Learning in Human Language Technology, pages 88–92, 1999. [210] R. Steinberger, B. Pouliquen, and E. Van Der Goot. An introduction to the europe media monitor family of applications. arXiv preprint arXiv:1309.5290, 2013. [211] G. H. Stempel III. Readability of six kinds of content in newspapers. Newspaper Research Journal, 3(1):32–37, 1981. [212] S. M. Stigler. Francis galton’s account of the invention of correlation. Statistical Science, 4(2):73– 79, 05 1989. [213] A. A. Stone, J. E. Schwartz, D. Schkade, N. Schwarz, A. Krueger, and D. Kahneman. A population approach to the study of emotion: Diurnal rhythms of a working day examined with the day reconstruction method. Emotion, 6(1):139–149, 2006. [214] C. Strapparava and R. Mihalcea. Learning to identify emotions in text. In Proceedings of the 2008 ACM symposium on Applied computing, pages 1556–1560. ACM, 2008. [215] C. Strapparava and A. Valitutti. Wordnet-affect: an affective extension of wordnet. Proceedings of the 4th International Conference on Language Resources and Evaluation, pages 1413–1418, 2004. [216] J.-S. SU. Advances in machine learning based text categorization. Journal of Software, 17(9):1848, 2006. [217] F. M. Suchanek and N. Preda. Semantic culturomics. Proc. VLDB Endow., 7(12):1215–1218, 2014. [218] S. Sudhahar. Automated Analysis of Narrative Text using Network Analysis in Large Corpora. PhD thesis, University of Bristol, 2015. [219] S. Sudhahar, T. Lansdall-Welfare, I. Flaounas, and N. Cristianini. Electionwatch: detecting patterns in news coverage of us elections. In Proceedings of EACL, pages 82–86. Association for Computational Linguistics, 2012. 184

Bibliography

[220] E. Sveinsdottir and E. Frøkjær. Datalogy - the Copenhagen tradition of computer science. BIT, 28(3):450–472, 1988. ´ [221] J. Szymanski and W. Duch. Representation of hypertext documents based on terms, links and text compressibility. In Neural Information Processing. Theory and Algorithms, pages 282–289. Springer Science + Business Media, 2010. [222] H. Tanev, V. Zavarella, J. Linge, M. Kabadjov, J. Piskorski, M. Atkinson, and R. Steinberger. Exploiting machine learning techniques to build an event extraction system for portuguese and spanish. Linguamática, 1(2):55–66, 2009. [223] W. A. Taylor. Change-point analysis: a powerful new tool for detecting changes. preprint, available as http://www. variation. com/cpa/tech/changepoint. html, 2000. [224] M. Thelwall, K. Buckley, and G. Paltoglou. Sentiment in twitter events. Journal of the American Society for Information Science and Technology, 62(2):406–418, 2011. [225] K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - NAACL '03. Association for Computational Linguistics (ACL), 2003. [226] K. Toutanova and C. D. Manning. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (ACL), 2000. [227] M. Trampuš and B. Novak. Internals of an aggregated web news feed. Proceedings of the Fifteenth International Information Science Conference IS SiKDD 2012, pages 431–434, 2012. [228] A. Tsakalidis, S. Papadopoulos, A. Cristea, and Y. Kompatsiaris. Predicting elections for multiple countries using twitter and polls. [229] Y. Tsuruoka, Y. Miyao, and J. Kazama. Learning with lookahead: can history-based models rival globally optimized models?

In Proceedings of the Fifteenth Conference on Computational

Natural Language Learning, pages 238–246. Association for Computational Linguistics, 2011. 185

Bibliography

[230] J. W. Tukey. Sunset salvo. The American Statistician, 40(1):72, 1986. [231] A. Tumasjan, T. O. Sprenger, P. G. Sandner, and I. M. Welpe. Predicting elections with twitter: What 140 characters reveal about political sentiment. ICWSM, 10:178–185, 2010. [232] P. D. Turney. Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 417–424. Association for Computational Linguistics (ACL), 2002. [233] Twitter. About twitter. https://about.twitter.com/, 2015. [Online; accessed 16-February-2015]. [234] C. van Rijsbergen. Information Retrieval. Butterworth, 1979. [235] G. A. Veltri. Microblogging and nanotweets: Nanotechnology on twitter. Public Understanding of Science, 22(7):832–849, 2013. [236] G. A. Veltri. Viva la nano-revolución! a semantic analysis of the spanish national press. Science Communication, 35(2):143–167, 2013. [237] J. R. Vittengl and C. S. Holt. A time-series diary study of mood and social interaction. Motivation and Emotion, 22(3):255–275, 1998. [238] W3Techs. Usage statistics of character encodings for websites, november 2014. http://w3techs. com/technologies/overview/character_encoding/all. Accessed on: 19 November 2014. [239] C. B. Ward, M. Bautin, and S. Skiena. Identifying differences in news coverage between cultural/ethnic groups. In Web Intelligence and Intelligent Agent Technologies, 2009. WI-IAT’09. IEEE/WIC/ACM International Joint Conferences on, volume 3, pages 511–514. IET, 2009. [240] D. J. Watts. A twenty-first century science. Nature, (7127):489–489, 2007. [241] E. Wiener, J. O. Pedersen, A. S. Weigend, et al. A neural network approach to topic spotting. In Proceedings of SDAIR-95, 4th annual symposium on document analysis and information retrieval, pages 317–332. Citeseer, 1995. [242] Wikipedia. Elon musk — wikipedia, the free encyclopedia, 2015. [Online; accessed 1-May-2015]. [243] Y. Yang, J. Zhang, and B. Kisiel. A scalability analysis of classifiers in text categorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 96–103. ACM, 2003. 186

Bibliography

[244] YouGov. Yougov/ the sun survey results. http://cdn.yougov.com/today_uk_import/yg-archivespol-sun-riots-100811.pdf, August 2011. [245] M. J. Zaki, S. Parthasarathy, M. Ogihara, W. Li, et al. New algorithms for fast discovery of association rules. In KDD, volume 97, pages 283–286, 1997. [246] J. Zhang, M. Marszałek, S. Lazebnik, and C. Schmid. Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study. International Journal of Computer Vision, 73(2):213–238, 2007. [247] W. Zhang and S. Skiena. Trading strategies to exploit blog and news sentiment. In ICWSM, 2010. [248] D. Zhao and M. B. Rosson. How and why people twitter: the role that micro-blogging plays in informal communication at work. In Proceedings of the ACM 2009 international conference on Supporting group work - GROUP '09. Association for Computing Machinery (ACM), 2009. [249] X. Zhu. Semi-supervised learning literature survey. Computer Science, University of WisconsinMadison, 2007. [250] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In International Conference of Machine Learning, volume 20, page 912, 2003. [251] G. Zipf. Selective Studies and the Principle of Relative Frequency in Language, volume 23, 1932.

187