Extended Abstract - Semantic Scholar

4 downloads 0 Views 130KB Size Report
ager (man), in-house lawyer (lawyer) and not available (na). One interesting observation made, is the significant number of cross-heirarchical relations such as ...
Automatic Extraction of Concealed Relations from Email Logs (Extended Abstract) ∗ Nishith Pathak

Jaideep Srivastava

Department of Computer Science University of Minnesota, Twin Cities Minneapolis, MN, USA

Department of Computer Science University of Minnesota, Twin Cities Minneapolis, MN, USA

[email protected]

[email protected]

ABSTRACT People interact with each other for various reasons. Based on the purpose of the relationship, these interactions exhibit certain characteristics. One such important characteristic is that of concealment. Concealed relations can often be a source of interest especially in the domain of counterterrorism where relations fostering malicious activities tend to be secretive or concealed from the general public. In this paper we propose a technique for extracting concealed relations from social network data. The technique analyzes actors’ perceptions regarding other actors’ social interactions and requires that they can be constructed from the social network data. One popular communication medium for which this can be done efficiently is electronic mail. The proposed technique uses the popular and robust tf-idf measure from the information retrieval literature to quantify the concept of concealment. We present experimental results from the Enron email corpus.

Categories and Subject Descriptors H.4.3 [Information Systems]: Communications Applications—email ; H.3.3 [Information Systems]: Information Search and Retrieval—tf-idf

General Terms Algorithms, Measures

Keywords Social Netowrk Analysis, email, concealed relations, tf-idf

1.

INTRODUCTION

Intuitively, a concealed relation can be defined as a relation which is strong but known to only a very small subset of actors. Instances of interaction between two actors can be anything quantifiable such as the number of times they have conversed or the number of emails exchanged between them. By a strong relation we mean a pairwise interaction that is relatively much more frequent than the average frequency of a pairwise interaction in the social network. When we talk ∗(Produces the NetSci2006-specific release, location and copyright information). For use with netsci2006submission.cls V1.4. Supported by ACM. Copyright is held by the author/owner(s). NetSci2006, May 22–25, 2006, Bloomington, IN, USA. .

about a relationship being perceived by a third actor, we mean that this actor has observed some threshold number of instances of interaction between two actors involved in the relation, thus allowing him/her to have sufficient belief in the existence of their social relationship. There are various reasons which drive people towards keeping their social activities secret or concealed from the rest of the social network. This problem is of interest in the counter-terrorism domain, where individuals involved in malicious activities tend to entertain secret interactions. One of the primary problems in counter-terrorism is that of email surveillance. Some recent efforts in the computer science community have been directed towards this problem [2, 3]. In the social network domain Baker and Faulkner[1] have talked about how actors involved in illegal activities tend to focus more on concealment, above other factors. It would also be interesting to study the role of concealed relations in the informal network of an organization. In this paper we propose an approach for automatically extracting concealed relations from email header analysis. Wellman[7] has identified how electronic communication is gaining importance as a medium for social networking. In case of email communication, an actor observes only those emails which are addressed to him/her (i.e. the actor is on the To, Cc or Bcc fields) For example, consider an e-mail sent by actor A to B, with Cc to C and Bcc to D. The analysis of the header reveals the following: B and C know that A and B communicated, and that all three of them know about this communication. However, neither B nor C knows that D was also sent this e-mail. A and D know everything, and both of them also know that B and C do not know of D’s getting the e-mail. This analysis illustrates that a single e-mail can create different beliefs among different people, depending on whether and how they are included in it. Moreover, it also provides information about who perceives which interactions. In [5] the authors provide an approach for construction and analysis of actors’ perceptions from email logs. The proposed approach is based on the popular tf-idf measure [6] from information retrieval. With the appropriate semantic associations, the tf-idf measure can be transformed into an efficient and robust technique for scoring and ranking relations based on their “level of concealment.” In section 2 we introduce the proposed approach, followed by experimental results in section 3 and conclusions in section 4.

2.

PROPOSED APPROACH

Consider a social network consisting of N actors. If the set of actors is denoted by A, then for every actor ai ∈ A

we construct an N × N matrix Mi , where entry mikl corresponds to the number of instances of interactions between ak and al observed by ai . Such matrices can be efficiently constructed from e-mail logs by analyzing their header information. Although illustrated for email, the proposed approach is applicable to any social network for which it is feasible to construct the set of matrices Mi for all actors ai ∈ A. We refer to matrix Mi as the perceptual matrix for actor ai . The tf-idf measure is used to extract those words from a document in a corpus, which can characterize that particular document within the corpus. tf (term frequency) denotes the frequency of the term in the document and idf (inverse document frequency) is a measure of the “uniqueness” of the term within the corpus. The idf of a term is the log of the ratio of the total number of documents in the corpus to the number of documents containing that term. The tf-idf score is taken to be the product of tf and idf. Those terms which are unique to a small set of documents and are highly frequent, serve to set apart these few documents from the rest. The tf-idf score for such terms is higher, and thus these terms are identified. When a query is presented, the documents corresponding to the query terms are returned as relevent results. Consider the set of all perceptual matrices to be analogous to a corpus and an actor ai ’s perceptual matrix Mi to be analogous to a document. Each relation between pairs of actors in a perceptual matrix is considered analogous to the terms in a document and the number of instances of each relation observed by an actor becomes analogous to the term frequencies in the document. We can now use the idf part to score those relations that are uniquely perceived by only a small subset of actors. From this set of uniquely perceived relations, we filter out the strong ones by multiplying the idf score by the relative frequency of occurrence of that relation. This relative frequency can be thought of as the tf score for the relation. For a relation rkl between two actors ak and al , we define its tf-idf score as – ` N nkl (1) log skl = PN j ) navg 1 + j=1 δkl j where, δkl =



1 0

if mjkl ≥ t otherwsise

Here, N is the total number of actors, nkl is the frequency of the interaction rkl (note that nkl = mkkl = mlkl ) , navg is the average frequency of an interaction and t is the threshold number of instances of relation rkl , that an actor must observe in order for him/her to “perceive” that relation. The log term can be seen as the idf part and the relative frequency as the tf part. Relative frequency is used in order to ensure that the strength of a relation is not biased by the general communication activity in the social network. We add 1 in the denominator of the log term to account for j the case when all δkl are 0. The score skl is indicative of the level of concealment of the relation between actors ak and al . The stronger and the less perceived a relation is, the greater score it will have. For a given actor ai , if we replace nkl by mikl in the expression for skl , then we have the tf-idf score for relation rkl relative to actor ai , denoted by sikl . Note that the higher sikl is, the more privy is actor ai to the relationship between actors ak and al . The relative tf-idf score sikl achieves the maximum value of skl for i = k and i = l. This agrees with the intuition that the participating actors

themselves are most privy to their relationship. Thus, the proposed tf-idf score skl , for social networks, can be used to rank social relations based on their level of concealment and the actor relative tf-idf score sikl can be used to identify the actors who are privy to these concealed relations.

3.

EXPERIMENTAL RESULTS

The Enron email corpus is a corpus of emails exchanged by 151 Enron employees spanning the period from mid-1999 to mid-2002, inclusive of the Enron crisis occurring in October 2001. For a detailed description of the Enron email dataset the reader is referred to [4]. For our experiments we considered undirected relations (i.e. emails from ak to al and al to ak were aggregated under a single relation ak ↔ al ) and used threshold of perceiving a relation t = 1. Concealed relations we extracted only for communication occuring in October 2000 and October 2001. This allows us to compare a normal month in the life of the organization (October 2000) with the crisis period (October 2001) Tables 1 and 2 show the top 10 concealed relations along with their scores for the months of October 2000 and 2001 respectively1 . Each actor’s position is abbreviated with his/her name – employee (e), president (p), vice-president (vp), director (dir), manager (man), in-house lawyer (lawyer) and not available (na). One interesting observation made, is the significant number of cross-heirarchical relations such as Richard Shapiro (vp) ↔ Jeff Dasovich (e), Mark Grigsby (man) ↔ Barry Tycholiz (vp) and Mark Grigsby (man) ↔ Jay Reitmeyer (e). From the relatively low scores for October 2001 one can also say that during the crisis period actors’ relations became less concealed in general. Tables 3 and 4 show the top 5 actors who are privy to each of the 3 topmost concealed relations for October 2000 and 2001 respectively. For the month of October 2000, it is observed that the actor relative scores for the top two actors, i.e. the actors involved in the relation, are much higher than those for other actors. This leads us to conclude that the knowledge of the relation is confined to just the two participating actors. An interesting pattern can be observed for October 2001. All the top 3 relations are among the three actors D. Steffes, Richard Shapiro and Jeff Dasovich. Notice that D. Steffes is privy to the relationship between Richard Shapiro and Jeff Dasovich. This can be inferred from the relatively close scores for all the three actors for the relation Richard Shapiro ↔ Jeff Dasovich (Table 4). However, neither is Richard Shapiro as privy to the relation D. Steffes ↔ Jeff Dasovich, nor is Jeff Dasovich as privy to the relationship D. Steffes ↔ Richard Shapiro. This is inferred from the large difference in the scores between the third and the top two actors, in the first and third relations (Table 4). One can also find other groups of actors exhibiting pairwise concealed relations such as Tana Jones , Sara Shackleton and Stephanie Panus, and Richard Shapiro, Jeff Dasovich and Mary Hain, both groups in October 2000. People tend to have concealed relations with their “confidential sources” across departments and organizations. In some cases these relations might not be desirable (eg. espionage rings) and in some cases they might actually be beneficial. The role of concealed relations in the informal networks of organizations, is a subject of great interest. However, one first needs to identify these concealed relations. This issue of identifying concealed relations is addressed in our work. 1

In all tables, x.yz . . . E7 denotes x.yz . . . × 107

Table 1: Top 10 Concealed Relations (October 2000) Relation Score Tana Jones (e) ↔ Sara Shackleton (e) 1.7760794E7 Richard shapiro (vp) ↔ Jeff Dasovich (e) 1.3316896E7 Marie Heard (na) ↔ Tana Jones (e) 1.2031506E7 Jeff Dasovich (e) ↔ Mary Hain (lawyer) 1.0895643E7 Stephanie Panus (e) ↔ Sara Shackleton (e) 1.0026255E7 Stacy Dickson (e) ↔ Tana Jones (e) 9685016.0 Matthew Lenhart (e) ↔ Eric Bass (trader) 8021003.5 Mark Whitt (na) ↔ Gerald Nemec (na) 7739389.0 Richard Shapiro (vp) ↔ Mary Main (lawyer) 5182706.0 Stephanie Panus (e) ↔ Tana Jones (e) 4637158.0

Table 2: Top 10 Concealed Relations (October 2001) Relation Score D. Steffes (vp) ↔ Jeff Dasovich (vp) 1.0007493E7 Richard Shapiro (vp) ↔ Jeff Dasovich (e) 5063396.0 D. Steffes (vp) ↔ Richard Shapiro (vp) 4718486.5 Marie Heard (na) ↔ Sara Shackleton (e) 3927464.5 Kimberly watson (e) ↔ Mark Mcconnell (na) 3759267.0 Kimberly watson (e) ↔ Michelle Lokay (e) 3408572.5 Mike Grigsby (man) ↔ Barry Tycholiz (vp) 3079402.2 Mike Grigsby (man) ↔ Matt Smith (na) 2905096.5 Mike Grigsby (man) ↔ Jason Wolfe (na) 2902135.2 Mike Grigsby (man) ↔ Jay Reitmeyer (e) 2852143.8

Table 3: Top 5 actors for the top 3 Concealed Relations (October 2000) Tana Jones (e) ↔ Sara Shackleton (e) Actor Actor Relative Score Tana Jones (e) 1.7760794E7 Sara Shackleton (e) 1.7760794E7 Susan Bailey (na) 5729288.5 Stephanie Panus (e) 5442824.0 Carol Clair (lawyer) 4583431.0 Richard Shapiro (vp) ↔ Jeff Dasovich (e) Actor Actor Relative Score Richard Shapiro (vp) 1.3316896E7 Jeff Dasovich (e) 1.3316896E7 Mary Hain (lawyer) 2723910.8 Robert Badeer (dir) 302656.75 B. Sanders (vp) 0.0 Marie Heard (lawyer) ↔ Tana Jones (e) Actor Actor Relative Score Marie Heard (lawyer) 1.2031506E7 Tana Jones (e) 1.2031506E7 Stacy Dickson (e) 7448075.5 Stephanie Panus (e) 1432322.1 Susan Bailey (na) 1432322.1

4.

CONCLUSIONS

In this paper we have presented an approach for extracting concealed relations from actors’ perceptions data. The proposed approach is particularly applicable to email net-

Table 4: Top 5 actors for the top 3 Concealed Relations (October 2001) D. Steffes (vp) ↔ Jeff Dasovich (e) Actor Actor Relative Score D. Steffes (vp) 1.0007493E7 Jeff Dasovich 1.0007493E7 Richard Shapiro (vp) 5138983.0 J. Kean (vp) 1700114.6 B. Sanders (vp) 1661475.6 Richard Shapiro (vp) ↔ Jeff Dasovich (e) Actor Actor Relative Score Richard Shapiro (vp) 5063396.0 Jeff Dasovich (e) 5063396.0 D. Steffes (vp) 4324214.0 J. Kean (vp) 1921873.0 B. Sanders (vp) 702222.8 D. Steffes (vp) ↔ Richard Shapiro (vp) Actor Actor Relative Score D. Steffes (vp) 4718486.5 Richard Shapiro (vp) 4718486.5 Jeff Dasovich (e) 780501.5 B. Sanders (vp) 496682.78 Louise Kitchen (p) 319296.06

works and has useful applications especially in the domain of counter-terrorism. Results on the Enron email corpus are promising. Our ongoing work is directed towards developing techniques for efficiently mining interesting patterns from concealed relations, which would help in understanding their role in the informal network of an organization.

5.

REFERENCES

[1] W. E. Baker and R. R. Faulkner. The social organization of conspiracy: Illegal networks in the heavy electrical equipment industry. American Sociological Review, 58(6):837–860, 1993. [2] M. W. Berry and M. Browne. Email surveillance using non-negative matrix factorization. Comput. Math. Organ. Theory, 11(3):249–264, 2005. [3] A. Chapanond, M. S. Krishnamoorthy, and B. Yener. Graph theoretic and spectral analysis of enron email data. Comput. Math. Organ. Theory, 11(3):265–281, 2005. [4] J.Shetty and J.Adibi. The enron email dataset database schema and brief statistical report. Technical report, Information Sciences Institue, 2004. [5] N.Pathak, S.Mane, and J.Srivastava. Who thinks who knows who? socio-cognitive analysis of an email network. Technical report, AHPCRC, 2006. [6] G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA, 1986. [7] B. Wellman, J. Salaff, D. Dimitrova, L. Garton, M. Gulia, and C. Haythornthwaite. Computer networks as a social networks: collaborative work, telework, and virtual community. Annual Reviews Sociology, 22:213–238, 1996.