Digital games - an interactive medium

Mixing research methods to unveil work practices of dispersed Open Source communities: lessons learned from the PyPy study Anders Sigfridsson, Anne Sheehan, Gabriela Avram Interaction Design Centre, University of Limerick, Ireland {anders.sigfridsson; anne.sheehan; gabriela.avram} @ ul.ie

Abstract Open Source software development projects are based on collaboration within dispersed, multifaceted, and volunteer-based communities. Studying the work practices of such a community requires adherence to a plethora of methodological approaches and employment of several different research methods. In this paper we present a study of an Open Source community called PyPy. The focus of the paper is on the evolution of the study itself and how we have utilized several different research methods – including participant observation, virtual ethnography, and electronic questionnaires – to unveil the work practices of this community. We will conclude by discussing the issues we have experienced whilst doing this and some relevant lessons learned regarding studying Open Source communities.

1. Introduction Open Source Software development is a form of distributed collaboration. The success of many Open Source projects implies that Open Source communities successfully alleviate many of the challenges of geographical dispersion that occur in proprietary software organizations [1, 2]. Open Source communities successfully utilize the competencies of widely dispersed, heterogeneous groups of developers [3]. Therefore the study of Open Source communities has become the focus of significant academic research, in particular among researchers who are interested in distributed software development in general. However, studying Open Source communities also raises a number of unique challenges – both practical and methodological – due to the very nature of the phenomenon.

Since the participants are primarily a group of volunteers who are geographically dispersed and seldom meet face-to-face, organizing on-site observations of work practices or formal interviews is usually difficult from a practical point of view. Since the practice of “sprints” is becoming more common in Open Source projects [4], there are cases (such as the one presented in this paper) where opportunities for direct interaction with collaborating developers are more likely, but this is still an exception. The collaboration is mainly mediated through simple textbased communication [5] and the core principle of openness that permeates the Open Source world means that the publicly available archives of email lists, chat transcripts, and project documentation usually constitutes the researchers’ main source of information. Whilst being a rich source of statistical data [3], it can be a cumbersome task to perform qualitative analysis of this, often colossal, body of material. There is also the potential for misinterpretation due to the lack of context. Open Source communities are also a complex tapestry of heterogeneous actors and components that require a combination of methodological lenses to understand and explain [6]. Ethnographic methods are suitable to help gain an understanding of an Open Source community’s culture and practices, but since collocated events are rare in most projects this may be difficult to achieve. Virtual ethnography has been outlined as a method for ethnographies of Internet communities [7] and it has been used for looking at the Open Source phenomenon [8]. There is also a body of writings in the tradition of computer-mediated communication (CMC) research that addresses the problems of performing studies in the context of the Internet [9]. In this paper, we suggest that a combination of methodological approaches and research methods is the best way to make an in-depth, interpretive case study of

an Open Source community. It is also very important to realize that the selection of research methods shapes the evolution of the research questions as well as the results of the study. We briefly present our own experience from studying an Open Source community called PyPy, where we used a combination of participant observation [10], virtual ethnography [7], and electronic questionnaires [11] to unveil their work practices for analysis. The focus is in particular on how this study evolved through exploration and how various research methods were incorporated during this process rather than as a prerequisite for it. We will conclude by discussing the issues we encountered and relevant lessons learned regarding studying Open Source communities.

2. The PyPy project In many ways PyPy is similar to many other Open Source projects. The community maintains a webpage with extensive project documentation, mailing lists, chat rooms, and an online code repository. Anyone is free to begin participating online and to download source code for using, studying, and editing. In principle, anyone is also permitted to contribute code to the repository, with both community review and the software framework setup ensuring that no one will break the code (at least, to any great extent). But in some respects PyPy stands out from the crowd. Between January 2004 and May 2007, they applied for and received funding from the EU in order to accelerate the development of their product. Following the efforts of the community core, the structured and rather rigid EU framework was shaped to support the work of the ad-hoc, flexible Open Source community. One of the key objectives of this setup was to support and demonstrate the applicability of the sprint-driven development methodology in an Open Source context. Building on principles and practices from agile development methodologies, a “sprint” is an event where people in this otherwise distributed collaborative project come together for approximately seven days and work on specific development issues. It is a time for rather intense collaborative coding and testing, but also for socializing with people whom one would otherwise only interact with online. The EU funding has allowed PyPy to sprint systematically, every 6 weeks or so, always in different locations to accommodate as wide a group of participants as possible. The collocated events have effectively been interwoven with the dispersed collaborative practices of PyPy.

3. Methodological pillars: the foundation Our initial contact with the PyPy community was made in August 2006 when a sprint was organized at the University of Limerick (UL), Ireland. In our research project at UL we are studying work practices and collaboration in globally distributed software development projects. In particular, we are interested in social, organizational and cultural aspects and in studying global software development as a human activity rather than a formal process or method. Our scope includes all kinds of distributed software development work, from formal projects in multinational corporations to ad-hoc Open Source projects. The fact that PyPy was organizing a sprint in UL and that they were open to letting researchers look at how they work, presented us with a very convenient opportunity for investigating the Open Source way of working in general and more specifically PyPy’s sprintdriven development as a way of performing collaborative distributed software development. Thus, this study began as an opportunistic exploratory study with a general focus on collaborative work practices in the PyPy Open Source community. Our research approach builds mainly on two strands of methodological thought: qualitative, ethnographically-informed empirical studies of software engineering [12, 13] and the tradition of work practice studies mainly as represented in the field of Computer-Supported Cooperative Work (CSCW) [14, 15], but it is also inspired by general studies of organizations [16, 17]. This means that we are interested in revealing the actual work practices – as compared to canonical practices and formal processes – of people in distributed software development projects and in studying this socio-technical phenomenon in its ‘natural’ state. Applying a naturalistic research approach traditionally means the researcher must not rely solely on “artificial” settings, like experiments or formal interviews, instead drawing mainly on direct and participant observation and in situ conversations as the primary source of information [10]. Another important pillar in our methodology is the concept of “context”. The naturalistic research tradition emphasizes that human actions are based on social meaning – intentions, motives, beliefs, rules, values, and so on [18]. In order to be able to account for these aspects, studies of situated everyday practice must not separate persons acting from the social world of activity [19]. Hence, accounting for the context in which the observed activity is situated and for the relations between the persons acting and the social

world is another important endeavor in work practice studies. These two methodological pillars formed the foundation for our design of the PyPy study, but rather than working as a predefined design this helped shape the process through which this study grew. As with all research, social research of this kind begins with a set of issues or research questions. However, it is quite common that these are not fully fleshed out and set at the beginning of the research; in fact, a key characteristic of social research is that it has an open-ended nature and that an important part is exploratory observation and description of some setting or activity, something that iteratively leads to more developed research questions and further investigation of certain aspects [18]. In our case, the initial observations were purely exploratory – focusing on collaboration practices in distributed software development projects in the form of Open Source projects – but they led to a set of conclusions that in turn sparked more specific research questions to evolve. This, in turn, made us incorporate other methods of research in order to investigate the new issues, which will be illustrated in the following section.

4. Practical research methods: evolution This case study has been ongoing for an extended period of time, initiated in August 2006 and to some extent (mainly through monitoring of online activities) still ongoing at the time of writing (June 2008). Over this period of time, the research has gone through two major phases. At first, it was an exploratory study aimed at investigating the collaborative practices of this particular Open Source community. This phase involved direct observation of one sprint (7 days long) and qualitative analysis of online activities, primarily of the main mailing lists used by the community. We were allowed to sit in to observe all the group meetings during this sprint (which were also recorded on video), and also to have in situ conversations with the participants at any time during their work (just to note, none of the researchers was involved as active participant in the actual work of the sprint). Following the interest expressed by two research groups at UL, the project administrator agreed to organize a workshop on the PyPy development methodology, which included an extended discussion session with one of the core members of the community. The study of online activities served mainly as a way of gaining insights into the dynamics of the community and to build up a historical account of the projects development.

This first phase of our research culminated with a paper that was submitted and accepted for the OSS conference in January 2007 [20]. In this paper, it was concluded that a situated learning perspective could help account for the value of having sprints in an otherwise distributed project. Sprints, it was argued, facilitate situated learning in the project, allowing developers to come together and work through handson collaboration using agile practices such as pairprogramming and daily scrum meetings, as well as to socialize and form stronger personal relationships. Furthermore, it was stated that sprint-driven development also facilitates the expansion of the community through enculturation, because it enables new members to both achieve necessary technical knowledge and the community membership needed to contribute to the project. At this point, the study moved into a second phase. While the first phase had been exploratory, this second was more investigative in nature. To get a more indepth understanding of what sprints mean for developers who participate in the project, we adopted an ethnographic approach. We expected this to allow us gain a more first-hand experience of the process and acquire a participant-perspective on what goes on during sprints, something that would help us shed light on the aspects we were investigating. After consultation with and approval from the core members of the community, one of the researchers was allowed to join the community as a regular newcomer. The researcher began familiarizing himself with the source code, studying the online documentation, and participating in the online community. In March 2007, the researcher attended a sprint in Hildesheim, Germany, taking the role of a regular newcomer and working on the project. The duration of this particular sprint was 3 full working days. On the first day there was an introductory tutorial for the two newcomers. After that, the newcomers were encouraged to pair up with more experienced developers and get to work on some development issues that were deemed appropriate. The pairs changed every day, so the newcomers got to work with several other developers. They also had a lot of choice as to what they were interested in working on and were expected to at least attempt to choose and work independently. The main method of record during this event was extensive note-taking – including both ongoing field notes and writing a fieldwork diary at the end of each day – and audio recording of all group meetings (mainly the daily scrum meetings). Apart from this, the researcher had the opportunity to have numerous casual conversations with the other participants about things experienced and was allowed

to document much of the time spent there with photos. Added to the documenting material was also the regular sprint report that was written by two of the core community members after the sprint. Following this participatory experience, we realized that a more in-depth study of the online activities of the community was necessary to investigate fully what we were interested in. A more structured approach was adopted, though still qualitative in nature. It focused mainly on the central mailing list (PyPy-dev) and had two main purposes. Firstly, to investigate how the sprints extend the dispersed collaboration, we focused on finding aspects that are constitutively specific due to the sprints; i.e. that would not have been present if it was not for the sprints. In time, this allowed us to distinguish a number of analytical categories which revealed what the collocated events add to the dispersed activities, something that could then be used to further code the material and reinforce our evidence. Secondly, in order to investigate what sprints mean to newcomers beginning to work on the project, we identified cases of newcomers approaching the community. A number of cases were chosen – including both people who eventually became core members of the community and people who made initial contact and then drifted away – and their evolving interactions with the community over time were studied. This also included accounting for how the sprint events affected the evolving interaction. The analysis also focused on understanding the participants’ own views on the meaning of sprints by identifying both discussions and singular comments where this is expressed implicitly or explicitly. By the spring of 2008, the PyPy study had come a long way since the initial observation of a sprint in 2006. Both the participant observation and the major analysis of the online material were concluded, but as we began working up the case regarding the two aspects we had investigated – the interweaving of collocated events with dispersed collaboration and the enculturation process for newcomers – it became apparent that we needed a way to confirm and strengthen our argument based on the explicit opinions of actual participants in the community. Formal interviews would have been difficult to organize and would have risked disrupting the relaxed atmosphere that characterized our relationship with the community members thus far in the study. Instead, we decided to use the medium which they felt most comfortable with – email – and design simple, informal questionnaires. It aimed at shedding further light on the analytical categories we had already identified, without explicitly detailing them to the respondents, thus allowing for full

freedom of response. We are currently in the process of analyzing these questionnaire responses.

5. Discussion One of the consequences of the evolution that is outlined above is that we gradually achieved an understanding, not just of the particular aspects we were investigating, but of the context we were studying as a whole. Since we were following the PyPy community’s work over time and our focus was exploratory, the particular sets of issues we were interested in were not defined from the start. This, we believe, has meant that our research focus has been tuned to what was important in this particular context, rather than adhering to a prescribed theoretical framework or empirical perspective. While not contributing notably to the production of generalizeable findings (which is a separate issue, but is a general criticism often leveled at in-depth, interpretive case studies), it does contribute significantly to our ability to explain what is going on in this particular case. However, it is also worth noting that we have experienced some problems regarding the focus. Since we have been trying to adapt to what we have found, it has also meant that more and more issues or potential topics of interest, not necessarily related, have emerged from the data. Because of the multifaceted nature of any social phenomenon, there is a multitude of perspectives one may assume and depending on the researchers’ choice different sets of issues will be relevant. Therefore, it is necessary at some point to decide what are the aspects one wants to focus on and from there iteratively steer the study towards investigating those in particular. Following the publication of the first paper, we decided to focus on how the sprints extend the dispersed collaboration and on how they allow for enculturation of newcomers. At this point we felt we knew the context well enough to decide on what was important. We had spent roughly 6 months following the online activities after the initial observation of a sprint, therefore we knew both what goes on at sprints and were aware of the general dynamics of the community. While working up the case study data, we have also discovered how one of the sources of material seems richer than the others. At the writing stage, we find that the excerpts from the mailing lists provide more detailed evidence for our thesis than the observations from sprints and the questionnaire answers. One of the reasons for this might be the nature of the material found in the mailing lists. Since the contributors know

they are public, it may be argued that their posts and responses are rather more detailed and considered in order to avoid ambiguity and act as a resource for the community. In addition, these mailing lists are archived and thus provide an historic record covering the entire duration of the community’s existence. Another reason for the perceived richness of the mailing list data may also result from the very focus of our research, since our analytical categories originated from what we observed in the mailing list interactions and were later extended to the observations and the questionnaires. So comparatively speaking, the data available from the observations of two sprints and the questionnaires is complementary but less extensive. Regardless of the actual reason, this is a sign that combining different research methods is not straightforward and will require the researcher to do some tailoring. In our case, one of the data sources has become the main basis of our analysis and the two others are complementary. But that is in response to our current focus and it is not unreasonable to assume that if we decide in the future to highlight another aspect of the case, this may change.

6. Further work In this paper we have shown how our study of the PyPy community evolved and we have briefly reflected on this in a discussion on the opportunities and challenges we encountered. As noted previously, it is important to realize that the selection of research methods shapes the evolution of the research questions as well as the results of the study. We hope to have provided a naturalistic description of a research study that can be used as a basis for considering and discussing the potential contribution that studies on software development practices can make. As Ian Dey puts it: “we might learn more from reflecting on how researchers actually do research, than from the more general accounts and prescriptions.” [21]. When describing the GSD phenomenon it appears to be a ‘given’ that work is now ‘distributed’ or ‘global’, but what this means in terms of actual locally situated work practice is rarely made explicit. The bulk of GSD research published to date have been quantitative studies, but the nature of what we have identified as important requires qualitative approaches, because these allow the exploration and illumination of the in situ practice of software engineering [12]. The PyPy study is now nearing its conclusion. In terms of substance, we are at the stage of finalizing the in-depth analysis of the second phase of our

investigation. However, we do feel that we have quite a substantial and rich collection of material, so there is certainly room for exploring other aspects of the case based on the same data set. It may be one of the inherent advantages of studying Open Source communities that the mailing list excerpts provide an extensive and publicly available record of the activity of the community. As for reflecting on our methodological approach, this paper should essentially be seen as an initial attempt. There is significant room for considering how an evolving research focus and the tailoring of research methods impact on the outcomes of the research. In our opinion, this is a very important area to consider, both for the case study at hand and for making relevant contributions regarding research methods to the academic community focused on distributed software development.

7. References 1. Fitzgerald, B., The Transformation of Open Source Software. MIS Quarterly, 2006. 30(3): p. 587-598. 2. Mockus, A., R. Fielding, and J.D. Herbsleb, Two Case Studies of Open Source Software Development: Apache and Mozilla. ACM Transactions on Software Engineering and Methodology, 2002. 11(3): p. 309-346. 3. Crowston, K. and B. Scozzi, Open source software projects as virtual organisations: competency rallying for software development. IEEE Proceedings Software, 2002. 149(1). 4. During, B., Sprint Driven Development: Agile Methodologies in a Distributed Open Source Project (PyPy), in XP 2006, 7th International Conference. 2006: Oulu, Finland. 5. Crowston, K., et al., Coordination of Free/Libre Open Source Software Development, in International Conference on Information Systems. 2005: Las Vegas, Nevada, USA. 6. Sack, W., et al., A Methodological Framework for SocioCognitive Analyses of Collaborative Design of Open Source Software. Computer Supported Cooperative Work (CSCW): An International Journal, 2006. 15: p. 229-250. 7. Hine, C., Virtual ethnography. 2000, London: SAGE Publication. 8. Bergquist, M. and J. Ljungberg, The power of gifts: organizing social relationships in open source communities. Information Systems Journal, 2001. 11: p. 305-320. 9. Markham, A.N., The Internet as research context, in Qualitative Research Practice, C. Seale, et al., Editors. 2007, SAGE Publications Ltd: London. p. 328-344.

10. Delamont, S., Ethnography and Participant Observation, in Qualitative Research Practice, C. Seale, et al., Editors. 2007, SAGE Publications Ltd: London. p. 205-217. 11. Oates, B.J., Researching Information Systems and Computing. 2006, London: SAGE Publications Ltd. 12. Dittrich, Y., et al., Editorial: For the Special Issue on Qualitative Software Engineering Research. Information and Software Technology, 2007. 49: p. 531-539. 13. Robinson, H., J. Segal, and H. Sharp, Ethnographicallyinformed empirical studies of software practices. Information and Software Technology, 2007. 49: p. 540551. 14. Schmidt, K., The critical role of workplace studies in CSCW, in Workplace Studies: Recovering Work Practice and Informing Design, C. Heath, J. Hindmarsh, and P. Luff, Editors. 2000, Cambridge University Press: Cambridge. 15. Singer, J., et al., An Examination of Software Engineering Work Practices, in CASCON'97. 1997: Toronto, Ontario, Canada. 16. Orlikowski, W.J., Knowing in Practice: Enacting a Collective Capability in Distributed Organizing. Organization Science, 2002. 13(3): p. 249-273.

17. Orr, J.E., Talking about Machines: An Ethnography of a Modern Job. 1996, Ithaca, New York: IRL Press. 18. Hammersley, M. and P. Atkinson, Ethnography Principles in Practice. 1996, New York & London: Routledge. 19. Chaiklin, S. and J. Lave, Understanding Practice Perspectives on activity and context. Learning in doing: Social, cognitive, and computational perspectives, ed. R. Pea and J.S. Brown. 1993, Cambridge: Cambridge University Press. 20. Sigfridsson, A., et al., Sprint-driven development: working, learning and the process of enculturation in the PyPy community, in International Conference on Open Source Systems. 2007: Limerick, Ireland. 21. Dey, I., Grounded Theory, in Qualitative Research Practice, C. Seale, et al., Editors. 2007, SAGE Publications Ltd: London. p. 80-94.