Empirical evaluation of an educational game on ... - Springer Link

5 downloads 44429 Views 696KB Size Report
Oct 1, 2008 - and Zehle 2003; Hock and Hui 2004), as many computer science ..... are expected to have at least a bachelor degree in computer science.
Empir Software Eng (2009) 14:418–452 DOI 10.1007/s10664-008-9092-6

Empirical evaluation of an educational game on software measurement Christiane Gresse von Wangenheim & Marcello Thiry & Djone Kochanski

Published online: 1 October 2008 # Springer Science + Business Media, LLC 2008 Editor: Forrest Shull

Abstract Software measurement is considered important in improving the software process. However, teaching software measurement remains a challenging issue. Although, games and simulations are regarded powerful tools for learning, their learning effectiveness is not rigorously established. This paper describes the results of an explorative study to investigate the learning effectiveness of a game prototype on software measurement in order to make an initial judgment about its potential as an educational tool as well as to analyze its appropriateness, engagement and strengths & weaknesses as guidance for further evolution. Within the study, a series of experiments was conducted in parallel in three master courses in Brazil. Results of the study reveal that the participants consider the content and structure of the game appropriate, but no indication for a significant difference on learning effectiveness could be shown. Keywords Measurement . Educational game . Project management . Experiment

1 Introduction Although there have been significant advances on the consciousness of the potential benefits of software measurement, software industry still remains slow in establishing C. Gresse von Wangenheim (*) : M. Thiry : D. Kochanski Universidade do Vale do Itajaí (UNIVALI), Master Program on Applied Computer Science, São José, SC, Brazil e-mail: [email protected] M. Thiry e-mail: [email protected] D. Kochanski e-mail: [email protected] C. Gresse von Wangenheim Universidade Federal de Santa Catarina (UFSC) Graduate Program in Computer Science, Florianópolis, SC, Brazil

Empir Software Eng (2009) 14:418–452

419

measurement programs and, often, measurement initiatives continue to fail (Dekkers and McQuaid 2002; Kasunic 2006). One of the reasons assumed is a lack of education (Löper and Zehle 2003; Hock and Hui 2004), as many computer science courses still do not cover software measurement as part of their curriculum. At most, students are taught a minimum of basic knowledge on software measurement as part of an undergraduate or graduate software engineering lecture (Hock and Hui 2004; Ott 2005). In addition, many do not stress the importance of software measurement in practice and, consequently, fail to motivate the students sufficiently, as measurement is often perceived as a complex and difficult subject (Hock and Hui 2004). One of the reasons for this problem is the way in which software measurement is taught. Expository lessons are still the dominant instructional technique in, basically, all sectors of education and training (Percival et al. 1993). While they are adequate to present abstract concepts and factual information, they are not the most suitable for higher-cognitive objectives aiming at the transfer of the knowledge to real-life situations (Choi and Hannafin 1995). Another disadvantage is that such an education is not only cost intensive, but also time-consuming (Bruns and Gajewski 1999). And, as measurement is only one area of software engineering, typically, there is not sufficient time in software engineering lectures to provide students with a solid understanding as well as to teach them the application of measurement in practice. Such practical constraints usually limit the exposure of students to realistic measurement programs. Especially, as practical exercises typically require a close supervision by instructors and a significant amount of time to be done. Therefore, it remains a challenge to teach students in a compact, but interesting way, so that they understand the key concepts and are capable to apply measurement in real-world situations. In this context, educational games have become an alternative providing various advantages (Percival et al. 1993). They can allow the virtual equivalent of a real-world experience. Thus, they can be effective in reinforcing the teaching of basic concepts by demonstrating their application and relevance as well as in developing higher-cognitive competencies by providing illustrative case studies (Percival et al. 1993). Especially, computer-based games can allow “learning by doing” in realistic situations with immediate provision of feedback. This can also leave the student more confident in his/her ability to handle a similar situation in real life. Another advantage is that the learners can work at their own pace, not requiring the presence of an instructor or interaction with others. And, building on the engaging nature of games, they can make learning more fun, if not easier (Kafai 2001). In this context, we are developing a computer-based educational game on software measurement with the objective to reinforce the remembering and understanding of basic concepts and to train measurement application. Yet, although computer games and simulations are regarded powerful tools for learning, due to a lack of well-designed studies about their integration into teaching and learning, their success is questionable or at least not rigorously established (Akili 2007). Thus, our goal is to investigate the effectiveness of the game in order to explore its potential as an educational tool and to guide its evolution. Therefore, the paper describes a series of experiments with a first prototype of the game, which we run as part of a software measurement module in parallel in three graduate software engineering lectures in Brazil. The paper is structured as follows. Section 2 provides an overview on the educational game X-MED. Related work in this area is presented in Section 3. Section 4 describes the experiment. Data collection and preparation are presented in Section 5 (including the complete data set in Appendix). Section 6 presents the data analysis, which is summarized and discussed in Section 7. Conclusions are presented in Section 8.

420

Empir Software Eng (2009) 14:418–452

2 Educational Game X-MED Our research interest is to develop a computer-based educational game on the definition and execution of a software measurement program in a hypothetical real-life scenario. A first step into this direction is X-MED v1.0 (Lino 2007), a computer-based educational game prototype on software measurement. The objective of the game is to exercise the application of software measurement in the context of project management in alignment with maturity level 2 of the CMMI-DEV v1.2 (CMMI 2006) based on GQM-Goal/ Question/Metric (Basili et al. 1994) and including elements from PSM—Practical Software and Systems Measurement (McGarry et al. 2001). The instructional design of the game is being developed in accordance with the education-learning process proposed by (Percival et al. 1993). The learning objective of the game is to reinforce measurement concepts and to teach the competency to apply the acquired knowledge covering the cognitive levels remembering, understanding and applying in accordance to the revised version of Bloom’s taxonomy of educational objectives (Anderson and Krathwohl 2001) (see Fig. 1). By selecting adequate solutions from a set of pre-defined alternatives in order to solve practical problems, students are supposed to learn how to develop or to select adequate measurement goals, GQM plans, data collection plans and how to verify, analyze and interpret data. In accordance with those learning objectives, the game has been designed as a single-player environment, in which the learner takes the role of a measurement analyst and defines and executes step-by-step a measurement program in a realistic scenario. The game consists of three main phases, including the introduction to the game, the main game play and finalization. The main part of the game X-MED v1.0 follows a linear flow of actions in accordance to the GQM measurement process (as shown in Table 1). During the game, the learner executes sequentially each of the steps of a measurement program, for example, step 1—context characterization, step 2—identification of measurement goal, step 3.1—definition of abstraction sheet and so on. Each of the steps follows the same sequence by first presenting the task description, then providing information on the task, and then requesting the player to select the most adequate solution from a set of alternatives. Based on the selected alternative, then a pre-defined score and feedback is presented. In each step, a task is presented to the learner. For example, in step 2, the learner is asked to identify the most appropriate measurement goal for the given situation (Fig. 2). Then, the game presents information and material for the respective task (e.g., product description, project plan excerpt, interview records, etc.). For example, in step 2, this 6. Creating: Putting elements together to form a coherent or functional whole; reorganizing elements into a new pattern or structure through generating, planning, or producing. 5. Evaluating: Making judgments based on criteria and standards through checking and critiquing. 4. Analyzing: Breaking concepts into parts, determining how the parts relate or interrelate to one another or to an overall structure or purpose, including differentiating, organizing, and attributing. 3. Applying: Carrying out or using a procedure through executing, or implementing. 2. Understanding: Constructing meaning from different types of functions like interpreting, exemplifying, classifying, summarizing, inferring, comparing, and explaining. 1. Remembering: Retrieving, recalling, or recognizing knowledge from memory. Remembering is when memory is used to produce definitions, facts, or lists, or recite or retrieve material.

Fig. 1 Revised version of Bloom’s taxonomy of educational objectives (Anderson and Krathwohl 2001)

Empir Software Eng (2009) 14:418–452

421

Table 1 Overview on the flow of actions Phase 1—Game introduction Phase 2—Game execution Step 1—Context characterization Task description Presentation of material Selection of task solution from 6 alternatives Score and feedback Step 2—Identification of measurement goal Task description Presentation of material Selection of task solution from 6 alternatives Score and feedback Step 3—Development of GQM Plan Step 3.1—Definition of Abstraction Sheet Task description Presentation of material Selection of task solution from 6 alternatives Score and feedback Step 3.2—Identification of GQM questions ... Step 3.3—Definition of analysis models Step 3.4—Specification of measures Step 4—Development of data collection plan Step 5—Data verification Step 6—Data analysis Step 7—Data interpretation Phase 3—Game finalization

General explanation on the game’s objective and how it works Includes the following steps: Analysis of the context information on the hypothetical software organization, its product, projects and software process

Identification of the most adequate measurement goal in the given situation

Includes the following steps: Definition of an Abstraction Sheet for one measurement goal

Identification of two GQM questions based on the Abstraction Sheet Definition of analysis models with respect to. two GQM questions Specification of measures with respect to. two analysis models Definition of data collection procedures with respect to two measures Verification of a set of collected data Identification of adequate data analysis results with respect to one GQM question Identification of adequate data interpretation results with respect to one GQM question Presentation of total score and summary of feedback

includes a textual recording of a brainstorming meeting on the identification of measurement goals in the hypothetical software organization (see Fig. 3), besides additional background information on the organization, its product and process. Once the presented material has been analyzed by the learner, the game asks the learner to take a decision with respect to the presented task. In order to control the complexity and variability of such decisions in the measurement domain, in this initial version of X-MED, decisions are made by selecting the most adequate solution from a pre-defined set of six alternatives. For example, in step 2, the task is to select the most adequate measurement goal in the given situation from six potentially relevant measurement goals (see Fig. 4). Once selected an alternative, the game immediately provides a pre-defined feedback and score to the student depending on the selected alternative (Fig. 5). As another example, Fig. 6, shows the game flow with respect to step 7. Data interpretation. Again, first the task is presented. Here the player has to select an adequate interpretation based on the collected data and the recorded feedback session. Six alternative interpretations are presented and once selected; the game presents the respective feedback.

422

Empir Software Eng (2009) 14:418–452

Task description

Game sequence

Context information

Fig. 2 Screenshot illustrating task description for the identification of measurement goal

Independent on which alternative was selected by the learner, the game advances to the next step, using always the pre-defined correct decision of the previous step as a basis. This means, that, given that a learner takes a wrong decision in a step, the game automatically uses the correct answer to continue the game. In the end, a total score is calculated based on the sum of the partial scores and a final report with all partial scores and feedback is generated. The target audience of the game is graduate students in computer science courses or software engineering professionals. The game is designed as a complement to traditional classroom or e-learning courses providing an environment to exercise the presented

Access to context information Record of the meeting for identification of measurement goal

Fig. 3 Screenshot illustrating material for the identification of measurement goal

Empir Software Eng (2009) 14:418–452

423

Asking the learner to select the most adequate measurement goal for the given situation

Measurement goal alternatives

Fig. 4 Screenshot illustrating the selection of the most adequate measurement goal

concepts. It requires a basic understanding of software engineering, software measurement, project management and the CMMI framework. The game is intended to be used individually by a student without the need of interaction with other students or an instructor. The average duration of a game session is about 2 h. We developed the game using evolutionary prototyping based on the ADDIE model— Analysis, Design, Development, Implementation, and Evaluation (Molenda et al. 1996). The material (including alternatives, scores and feedback) has been prepared based on measurement literature, reported experiences and our experiences on applying measurement in practice. So far, X-MED v1.0 consists of only one static scenario focusing on the application of software measurement for project management. In its current version, it is not configurable or customizable to other measurement approaches or scenarios. Fig. 5 Screenshot showing detailed feedback

Detailed feedback

424

Empir Software Eng (2009) 14:418–452

1. Task description

2. Record of feedback session

3. Interpretation alternatives

4. Feedback

Fig. 6 Screenshots demonstrating game flow of step 7. Data interpretation

The prototype has been implemented with JAVA JDK 6 as a desktop system. All three phases and the seven steps as summarized in Table 1 have been implemented. Only textual and graphical elements have been applied, no multimedia elements (sound, video) were integrated. The game has been implemented in Brazilian Portuguese. A demo version of the prototype X-MED v1.0 is available at: http://www.incremental.com.br/xmed. In this respect, X-MED v1.0 represents a simplified and limited version of an educational game. Although, there are many perspectives on games for learning (Michael and Chen 2006; Abt 2002; Prensky 2001), in accordance to common elements, such as, play, rules and competition, X-MED v1.0 can be regarded an educational game, albeit simple, as it provides a contest (to achieve maximum score) in which the player operates under rules in accordance to measurement theory to gain a specified objective (adequately define and execute parts of a hypothetical measurement program) based upon learning. Based on (Ellington et al. 1982), X-MED v1.0 can be regarded a “game used as case study”, as it provides a game in which the learner has to examine a realistic scenario, where s/he takes the role of a measurement analyst in a hypothetical software organization. In this first version of the game, the narrative and flow of events is strictly linear, based on a series of constrained selections with pre-defined alternatives, following the procedural steps of a GQM-based measurement program. Yet, in accordance to (Greitzer et al. 2007), we consider, even, such a relatively simple learning experience, a valuable opportunity for learning through its problem-centered focus on content-related decisions.

Empir Software Eng (2009) 14:418–452

425

3 Related Work The idea of adopting educational games for software engineering education is recent and, so far, there do not exist games for teaching software measurement. However, there are several games available in other software engineering areas, mainly, project management (including, e.g., The Incredible Manager (Dantas et al. 2004), Project-o-poly (Buglione 2007), management training courses (Collofello 2000)) or simulation games for the execution of software processes, such as, SimSE (Oh Navaro and van der Hoek 2007) SESAM (Drappa and Ludewig 2000), SimVBSE (Jain and Boehm 2006), OSS (Sharp and Hall 2000), Problems and Programmers (Baker et al. 2003), among others. Several of those games or teaching methods have also been evaluated with respect to their impact on software engineering education. One of the most comprehensive evaluations in this context has been done on the educational game SimSE (Oh Navaro and van der Hoek 2007) for the creation and simulation of software process models. As part of the research on the game, a multi-angled evaluation, including an initial pilot study, an in-class study, a comparative study (designed as formal experiment) and an observational study have been run to provide a comprehensive picture of SimSE’s overall effectiveness and a general understanding of its strengths and weaknesses. Another distinctive study is the externally replicated controlled experiment on the learning effectiveness of using a process simulation model for educating computer science students in software project management (Pfahl et al. 2003). In the experiments, a pretest– posttest control group design was used, where the experimental group applied a system dynamics simulation model and the control group used the COCOMO model as a predictive tool for project planning. The learning effectiveness has been analyzed based on the scores in the tests and subjective improvement suggestions. SESAM—Software Engineering Simulation by Animated Models (Drappa and Ludewig 2000) has been evaluated through a case study and a controlled experiment in order to investigate, whether the simulation based on the SESAM model helps to improve project management education. Both studies used a pretest–posttest design comparing the performance of the participants in terms of a questionnaire score, the preparation of a project plan and results of the simulation runs. Yet, other evaluations of educational games, generally, take into consideration only the level of reactions of the participants. For example, the usefulness of the project management game The Incredible Manager (Dantas et al. 2004) within a training concept has been analyzed through two experimental studies. Within the studies, only subjective factors, such as, fun and interest as well as the identification of the game’s limitations and drawbacks have been investigated. Another example is a large study, involving more than 1,500 participants, on the investigation of a case study element of a course presented through an innovative interactive multimedia simulation of a software house Open Software Solutions (Sharp and Hall 2000). In this simulation, the student plays the role of an employee and performs various tasks as a member of the company's project teams. The evaluation covers usability aspects, such as, attractiveness, learnability, helpfulness, etc., as well as positive and negative factors of the simulation. Another example is the incorporation of empirical studies in three industry short courses on the effects of test-driven development (TDD) on internal software quality (Janzen et al. 2007). The experiments compared an iterative test-first approach with an iterative test-last approach by analyzing various software metrics on software size, complexity, coupling,

426

Empir Software Eng (2009) 14:418–452

cohesion, and testing. Additional data via pre- and post-experiment surveys was collected on the programmer opinions of TDD. In contrast, experiments regarding software measurement education are extremely rare. One example is a practical experiment in teaching software metrics (Thomas 1996). However, this experiment aimed at introducing productivity measurement as an integral part of student software engineering projects and work assignments rather than evaluating any specific teaching method.

4 Experiment 4.1 Research Objectives Our motivation for the study is that, although, educational games are being recognized as interesting teaching means, it remains open to what degree they indeed contribute to learning (Akili 2007). Thus, as an explorative research, this series of experiments aims at providing a first insight into the overall effectiveness of the prototype of the game X-MED v1.0 as an educational tool. The research goal is to evaluate these aspects from the viewpoint of the researchers in the context of graduate software engineering lectures. Our hypothesis is that the usage of the educational game X-MED has a positive learning effect on the capability of learners to define and execute measurement programs for project management in alignment with maturity level 2 (ML2) of the CMMI-DEV. We expect a positive reinforcement effect regarding the remembering and understanding of measurement concepts and a positive learning effect on the capability to apply the acquired knowledge. A second research objective of this study is also to evaluate the appropriateness of the game, in terms of its content, teaching method and duration as well as its engagement from the viewpoint of the learners. And, we also want to obtain a first feedback on the strengths and weaknesses of the prototype of the game in order to guide its evolution. 4.2 Context We have run a series of experiments within a module on software measurement as part of software engineering lectures in master courses on computer science. Within each experiment, we followed the same syllabus: 4.2.1 Syllabus of the Module “Software Measurement” Context This module is planned to be part of a master course lecture or part of a professional training related to software engineering, software quality or software process improvement. Pre-requisites Students are expected to have at least a bachelor degree in computer science (or a related area) and a basic understanding on software engineering, project management and the CMMI framework. Module Objective The objective of this module is to provide a basic understanding on the definition and execution of software measurement programs for project management in alignment with maturity level 2 of the CMMI-DEV.

Empir Software Eng (2009) 14:418–452

427

Module Description Measurement concepts and terminology, overview on standard ISO/ IEC 15939 (measurement information model and measurement process model), overview on measurement methods (GQM, PSM), measurement in reference models (CMMI-DEV), measurement process step-by-step: context characterization, definition of measurement goals, development of GQM plan, development of data collection plan, data collection, verification and storage, data analysis and interpretation, communication of measurement data and results. Learning Outcomes As a result of this module, the student should have a basic understanding on software measurement and should be capable to define and execute basic measurement programs for project management under supervision. The learning objective of the software measurement module of these university courses is directed to cognitive learning, including declarative and procedural knowledge considering the remembering, understanding and applying level in accordance with the revised version of Bloom’s taxonomy (Anderson and Krathwohl 2001). Upon completion of this module, students will have the ability to: Level Remembering

Learning outcome Can students RECALL information? ▪ Recall measurement concepts and terminology: measurement goal, measure, etc. ▪ Cite relevant standards, models and methods. ▪ Name the steps of the measurement process. Understanding Can students EXPLAIN ideas? ▪ Classify and illustrate measurement elements, such as measures. ▪ Describe the measurement process. Applying Can students USE a procedure? ▪ Construct measurement programs focusing on ML2 of the CMMI ▪ Select adequate measurement elements in the definition and execution of measurement programs focusing on ML2 of the CMMI.

Teaching Methods Different teaching methods are used, including a 4-h expository lecture with in-class exercises and a 2-h application of the educational game X-MED. The main focus of the different teaching methods with respect to the intended knowledge levels is shown below: Teaching method Lecture In-class exercises Game

Level Remembering x

Understanding

Applying

x x

x x

Assessment The achievement of the expected learning outcomes is assessed by written tests with multiple choice and open questions. Tests include questions on all three knowledge levels (remembering, understanding and applying). Regarding application, the tests include questions on the application of measurement with the same focus as in the game (monitoring project schedule, effort and cost) as well as transfer questions regarding the

428

Empir Software Eng (2009) 14:418–452

application of measurement for monitoring other process areas associated to ML2 of the CMMI (software configuration management, software acquisition management, etc). 4.3 Experiment Design We run a series of three experiments in parallel without any modifications in each of the lectures. In each of the experiments, we applied a classic experimental design (randomized pre-test post-test control group design) (Takona 2002; Wohlin et al. 2000). In order to minimize the problem of selection bias, the distribution of participants to the experimental or control group has been randomized in a balanced manner. In each of the experiments, groups were equally trained with respect to basic measurement concepts and the measurement process through an expository lecture and in-class exercises by the same instructor. Then, both groups in each experiment took a pretest prior to the application of the treatment. Afterwards, the treatment—the usage of the game—was only given to the experimental group. No treatment has been applied to the control group. In the end, both groups in each experiment took the post-test. The design of each of the experiments is as follows: Group (A) Experimental (B) Control

Random assignment Random assignment

Training Lecture & exercise Lecture & exercise

Pre-test Test 1 ———>

Treatment Game ———>

Post-test Test 2

Test 1

—————————>

Test 2

Our objective of this study was to analyze, if the game (as a complement to traditional lectures and exercises) has any impact on the learning outcome. Therefore, we applied the game (as treatment) to the experimental group, only. Due to the fact that we did not intend to compare the game to any other teaching means (e.g., to evaluate, if it’s impact is greater than working on case studies), we did not apply any treatment to the control group. 4.4 Hypotheses and Variables Our research questions are: Research Question 1 Is the learning effect on the remembering, understanding and applying level in the experimental group A higher than in control group B? Research Question 2 Is the educational game considered appropriate in terms of content relevancy, correctness, sufficiency and degree of difficulty, sequence, teaching method and duration in the context for which it is intended? Is the game considered engaging? What are its strengths and weaknesses? The objective of this research question is to obtain a subjective evaluation of these aspects from the learners point of view, rather than to formally evaluate, e.g., the correctness or completeness of the game in accordance with measurement theory. We selected these variables in order to help to identify any flaws in the game design, e.g., missing elements aspect, which the learners consider important to be included. Yet, students, who are learning about software measurement, might not be able to provide valuable feedback on those issues and/or be biased. Due to those threats to validity, we do

Empir Software Eng (2009) 14:418–452

429

not analyze those issues solely based on the students’ evaluations. For a comprehensive analysis of the correctness, relevancy and completeness of the game, reviews have been performed by software engineering experts during the development of the game. In order to analyze these research questions, we adopt the Kirkpatrick's four-level model for evaluation (Kirkpatrick and Kirkpatrick 2006), a popular and widely used model for the evaluation of training and learning. This model essentially represents four successive levels of more precise measures on the effectiveness of training programs, as presented in Table 2. In accordance to Kirkpatrick’s four-level model for evaluation, we investigate both research questions on level one: reacting, which focuses on how the participants feel about the learning experience by collecting data via satisfaction questionnaires. On level 1 regarding research question 1, we subjectively evaluate the perceived level of measurement competence of the participants (Variable Y.1 Measurement competency) between pre-and post-test on an ordinal 6-point scale. We investigate research question 1 also on level two: learning, which focuses on the evaluation of the increase in knowledge by administering a pre-and post-test. On level 2, we evaluate the learning effect separately for each of the knowledge levels (Y.2 Measurement knowledge on the remembering level, Y.3 Measurement knowledge on the understanding level, Y.4 Measurement knowledge on the applying level) by comparing the average scores between pre-test and post-test (relative learning effect), and with regard to post-test performance (absolute learning effect). These expectations are formulated as follows (based on (Pfahl et al. 2003)): (a) Relative learning effect: (Y.i;A)diff >(Y.i;B)diff, for i=1,…,4 (b) Absolute learning effect: (Y.i;A)post >(Y.i;B)post, for i=1,…,4 with: (Y.1;X)pre (Y.i;X)pre

Variable Y.1 during pre-test of participants in group X (X=A or B). Average score on questions on variable Y.i (i=2,…,4) during pre-test of participants in group X (X=A or B).

Table 2 Overview on Kirkpatrick's four-level model for evaluation (Kirkpatrick and Kirkpatrick 2006) Level Evaluation Evaluation description and characteristics Examples of evaluation tools and methods level 1

Reaction

2

Learning

3

Behavior

4

Results

Evaluates how the students felt about the training or learning experience Evaluates the increase in knowledge or capability (before and after) Evaluates the extent of applied learning back on the job-implementation

Happy-sheets; feedback forms; verbal reactions; post-training surveys; ... Assessments and tests before and after the training; interviews or observation Observation and interviews over time to assess change, relevance of change and sustainability of change Evaluation of the effect on the Long-term post-training surveys; observation business or environment by the trainee as part of ongoing, sequenced training and coaching over a period of time; measures, such as, re-work, errors, etc. to measure, if participants achieved training objectives; interviews with trainees and their managers, or their customer groups

430

Empir Software Eng (2009) 14:418–452

(Y.1;X)post (Y.i;X)post

Variable Y.1 during post-test of participants in group X (X=A or B). Average score on questions on variable Y.i (i=2; …,4) during post-test of participants in group X (X=A or B). ðY :i; X Þdiff ¼ ðY :i; X Þpost ðY :i; X Þpre :

The related null hypotheses are stated as follows: H01a There is no significant difference in relative learning effectiveness between group A (experimental group) and group B (control group). H01b There is no significant difference in absolute learning effectiveness between group A and group B. On level 1, we also evaluate subjectively the perceived learning effect of the game (Y.5 Subjective learning effect) considering Y.5.1 Learning effect on measurement concepts and process and Y.5.2 Learning effect on measurement application, each on a four-point ordinal scale. Regarding research question 2, we subjectively evaluate the perceived level of appropriateness (Y.6 Appropriateness) of the game by asking the participants to evaluate seven dimensions: Y.6.1 content relevancy, Y.6.2 correctness, Y.6.3 sufficiency, Y.6.4 difficulty, Y.6.5 sequence, Y.6.6 teaching method and Y.6.7 duration, each on a four-point ordinal scale. We regard the game as appropriate from the viewpoint of the learners, if all participants evaluate all dimensions at least as good. Specifically with respect to correctness, content relevancy, and sufficiently, the objective here is to identify any shortcomings from the viewpoint of the learners. We also evaluate on level 1, the engagement of the game (Y.7 Engagement) by subjectively evaluating, if the participants liked the game (Y.7.1 Satisfaction) and (Y.7.2 Fun) each on a four-point ordinal scale. We regard the game as engaging, if all participants at least liked the game and had fun while playing. Table 3 summarizes the research questions, instruments and variables with respect to the evaluation levels. Within the context of the study, we considered the following disturbing factors: DF.1 Personal background in terms of academic formation, training and practical experience. DF.2 Motivation based on the perceived importance of measurement and the participant’s interest in learning more about measurement, each on a four-point ordinal scale. DF.3 Additional study time spent besides the lecture, in-class exercises and game application on a 6-point ordinal scale. Further information on the variables can also be found in Table 6. 4.5 Execution 4.5.1 Participants We run the series of experiments in the context of lectures in computer science master courses in Brazil. In order to obtain a larger number of participants, we performed the same experiment in parallel as part of the following lectures: ▪

“Software Process Improvement” during the 3. Trimester 2007 of the Graduate Program in Computer Science at the Federal University of Santa Catarina/Florianópolis.

Empir Software Eng (2009) 14:418–452

431

Table 3 Evaluation overview Kirkpatrick’s evaluation level 1

2

Instruments Research question 1. Is the learning effect on the remembering, understanding and applying level in the experimental group A higher than in control group B?

Satisfaction questionnaire Y.1 Measurement competency

Research question 2. Is the educational game considered appropriate in terms of content relevancy, correctness, sufficiency and degree of difficulty, sequence, teaching method and duration in the context for which it is intended? Is the game considered engaging? What are its strengths and weaknesses?

Y.6 Appropriateness Y.7 Engagement

Pre-/post-test Y.2 Measurement knowledge on the remembering level Y.3 Measurement knowledge on the understanding level Y.4 Measurement knowledge on the applying level –

▪ ▪

Y.5 Subjective learning effect

“Software Engineering” during the 2. Semester 2007 at the Master Program in Applied Computer Science at the UNIVALI—Universidade do Vale do Itajaí/São José. “Software Quality and Productivity” during the 2. Semester 2007 at the Master Program in Applied Computer Science at the UNIVALI—Universidade do Vale do Itajaí/São José.

In total, 15 students participated in the series of experiments and all participants completed the experiment. As part of their participation, the students earned educational credits. Table 4 summarizes the personal characteristics of each of the student groups. 4.5.2 Procedure and Materials The experiments were conducted following the schedule presented in Table 5, as part of the regular classes in weekly intervals. After a short presentation of the teaching plan, the experiment’s purpose and general organizational issues, the participants were asked to sign a consent form and to answer a questionnaire on their personal background. The questionnaire was composed of questions on their level of formation, professional experience and present position and research focus within the master program. The students also answered questions focusing on knowledge and previous training on software measurement and their motivation to learn about measurement and to participate in the experiment. Then, an expositive lecture with small inclass exercises was realized. Exercises included a crossword puzzle on measurement concepts and a continuous exercise on the identification of measurement objectives, the development of an abstraction sheet and the definition of data collection procedures. The lecture was hold by the same instructor in all three experiments. Afterwards, feedback on the lecture and exercises was collected through a questionnaire. At the second encounter, the pre-test was conducted to establish a baseline for the assessment of the learning effect. The pre-test was composed of 14 questions: four on the remembering level, three on the understanding level, five related to the application level in

432

Empir Software Eng (2009) 14:418–452

Table 4 Overview on personal background of the participants Personal characteristics

SPI/UFSC

Number of participants Average age [years] Academic formation [in number of participants] Bachelor in Computer Science Bachelor in Computer Engineering Bachelor in Information Systems Bachelor in other area Specialization in Computer Science Professional certifications [in number of participants] Implementer of the Brazilian Model for Software Process Improvement MPS.BR Brazilian Certification for Software Testing Professional experience [average in years] as Software analyst or developer Project manager SEPG member Software quality group member Software process consultant Software Engineering professor Current professional position (multiple options possible) [in number of participants] Software analyst Software developer Tester Project manager SEPG Consultant Professor Not working in the software domain Work load [in number of participants] Full-time student Working approx. 20 h/week Working approx. 40 h/week Applied measurement in practice [in number of participants] No Yes Participated already in training courses on (with more than 30 h) [in number of participants] Software measurement Software Quality CMMI MPS.BR Project management

SE/UNIVALI

SQP/UNIVALI

8 27

5 37

2 38

4 2 2

2

1

3

1 2 4

1 2

1

1 1

2.4 0.6 0.3 0.3 0.3 0.1

7 0.6 0.6 0.4 0.4 3.2

4 0 0 0 0 2.5

2 1 0 0 1 1 0 3

0 0 1 1 0 1 1 1

1 0 0 0 0 0 2 0

2 2 4

0 1 4

0 1 1

7 1

4 1

2 0

0 0 0 2 1

0 3 1 1 2

0 1 0 0 1

the same domain as the game and two transfer application questions to other contexts associated to ML2 of the CMMI. In addition, subjective data was collected on potential disturbing factors as well as the subjective evaluation of the measurement competency level. After the pre-test the assignment of participants to either the experimental or control

Empir Software Eng (2009) 14:418–452

433

Table 5 Schedule of the experiments Day

Content

1. Day

Presentation of teaching plan

Forms

Duration

Consent form Background questionnaire Lecture & in-class exercises Post-lecture questionnaire 2. Day

Pre-test

3. Day

Game×(experimental group only)

4. Day

Post-test

Test questionnaire Game log file Post-game questionnaire Test questionnaire

5 min 5 min 10 min 3:30 h 15 min 1h 5 min 2h 30 min 1h 5 min

group was done randomly and communicated to the participants. Results of the pre-tests were published only in the end of the experiment together with the results of the post-test. At the next encounter, only the experimental group received a brief introduction on the game and, then, played one game session, covering all steps of the game execution (step 1– step 7). In the end, the automatically generated and encrypted log file on the performance of each participant was collected. Further subjective data on the game was collected via a questionnaire, including questions on the evaluation of the game, strengths & weaknesses and the subjective evaluation of the learning effect as well as the motivation for playing. The control group underwent no treatment. At the last encounter, both groups completed the post-test. Again, subjective data was collected on potential disturbing factors as well as a subjective evaluation on the competency level and the perceived learning impact of the game. The in-class exercises, the game tasks as well as the questions of the pre-and post-test have been designed similar in style, content and difficulty. An example of such a task is: Imagine a company that develops e-government systems and, which initiated last year a software process improvement initiative in alignment with ML 2 of the CMMI. One of the main characteristics of the company is that it frequently sub-contracts other software companies in order to supply parts of its systems. Among the suppliers are the companies elaw, big-law and best-law. In the past, there have been many problems with the products supplied by e-law, which demonstrated several defects. These defects are mainly detected during integration testing. Therefore, one of the primary focuses is on the process area Supplier Agreement Management. Now, senior management wants to observe trends on the quality, mainly, reliability, of the products supplied for the organizational unit e-blob and, therefore, requests that you establish a measurement program. What would be an adequate measurement goal in this context? Object

Purpose □ characterize □ monitor □ evaluate □ control □ predict □ identify causal relationships

Quality focus

Viewpoint

Context

434

Empir Software Eng (2009) 14:418–452

5 Data Collection and Preparation The raw data for variable Y.1 and DF.3 was collected through questionnaires in parallel to the pre- and post-test. Data for the variables Y.2–Y.4 was collected during the pre- and posttests. Data on Y.5, Y.6 and Y.7 was collected through a post-game questionnaire. Data on DF.1 and DF.2 was collected in the beginning of the experiment through a background questionnaire. The raw data was treated, as described in Table 6, in order to prepare data analysis.

6 Analysis 6.1 Data Analysis Procedure In order to try to obtain a greater accuracy and statistical power by increasing the sample size, we studied the combined results of the series of the experiments. Therefore, we performed a joint analysis, cumulating the data from the individual experiments into one dataset. As the three experiments were run in parallel by the same research group in similar contexts under the same conditions (research hypothesis, experimental design, treatments, material and measurement instruments, etc.), we consider them as one “big” experiment. Such a procedure is applicable in this specific case, as the data of the individual studies is similar and homogeneous enough to be combined. We also regard the personal background of the participants in all three studies comparable (Table 4), with only a slightly higher experience across the SE/UNIVALI group. Anyway, we deal with differences in the personal background of the participants by analyzing the difference between pre- and posttest, instead of considering only absolute post-test results. In this series of experiments, our primary concern for accuracy and to overcome problems with statistical power is due to the very small sample size, even, when combining the data from the 3 studies (n=15). For such small data sets, it is, basically, impossible to tell, if the data comes from a variable that is normally distributed (Levin and Fox 2006), as with small sample sizes (n