Measuring Information and Communication Technology Literacy using ...

16 downloads 19910 Views 1MB Size Report
Apr 16, 2014 - National Educational Technology Standards for Students (ISTE, 2007). These standards are .... This particular certificate has been adopted by ...
Computers & Education 77 (2014) 1–12

Contents lists available at ScienceDirect

Computers & Education journal homepage: www.elsevier.com/locate/compedu

Measuring Information and Communication Technology Literacy using a performance assessment: Validation of the Student Tool for Technology Literacy (ST2L) Anne Corinne Huggins, Albert D. Ritzhaupt*, Kara Dawson University of Florida, USA

a r t i c l e i n f o

a b s t r a c t

Article history: Received 19 November 2013 Accepted 2 April 2014 Available online 16 April 2014

This paper reports the validation scores of the Student Tool for Technology Literacy (ST2L), a performance-based assessment based on the National Educational Technology Standards for Students (NETS*S) used to measure middle grade students Information and Communication Technology (ICT) Literacy. Middle grade students (N ¼ 5884) from school districts across the state of Florida were recruited for this study. This paper first provides an overview of various methods to measure ICT literacy and related constructs, and provides documented evidence of score reliability and validity. Following sound procedures based on prior research, this paper provides validity and reliability evidence for the ST2L scores using both item response theory and testlet response theory. This paper examines both the internal and external validity of the instrument. The ST2L, with minimal revision, was found to be a sound measure of ICT literacy for low-stakes assessment purposes. A discussion of the results is provided with emphasis on the psychometric properties of the tool and some practical insights on with whom the tool should be used in future research and practice. Ó 2014 Elsevier Ltd. All rights reserved.

Keywords: Technology literacy Information and Communication Technology Literacy NETS*S Validation Reliability

1. Introduction A series of recent workshops convened by the National Research Council (NRC) and co-sponsored by the National Science Foundation (NSF) and National Institute for Health highlighted the importance of teaching and assessing 21st century skills in K-12 education (NRC, 2011). Information and Communication Technology (ICT) literacy, or the ability to use technologies to support problem solving, critical thinking, communication, collaboration and decision-making, is a critical 21st century skill (NRC, 2011; P21, 2011). The National Educational Technology Plan (USDOE, 2010) also highlights the importance of ICT literacy for student success across all content areas, for developing skills to support lifelong learning and for providing authentic learning opportunities that prepare students to succeed in a globally competitive workforce. It is clear that students who are ICT literate are at a distinct advantage in terms of learning in increasingly digital classrooms (NSF, 2006; USDOE, 2010), competing in an increasingly digital job market (NRC, 2008) and participating in an increasingly digital democracy (Jenkins, 2006; P21, 2011). Hence, it is critical that educators have access to measures that display evidence of validity and reliability in scores representing this construct in order to use the measures, for example, to guide instruction and address student needs in this area. The International Society for Technology in Education (ISTE) has developed a set of national standards for ICT literacy known as the National Educational Technology Standards for Students (ISTE, 2007). These standards are designed to consider the breadth and depth of ICT literacy and to be flexible enough to adapt as new technologies emerge. The standards were modified based on the 1998 version of the standards. NETS*S strands include knowledge and dispositions related to Creativity and Innovation, Communication and Collaboration, Research and Information Fluency, Critical Thinking, Problem Solving and Decision Making, Digital Citizenship and Technology Operations and Concepts. NETS*S have been widely acclaimed and adopted in the U.S. and many countries around the world and are being used by schools for curriculum development, technology planning and school improvement plans.

* Corresponding author. School of Teaching and Learning, College of Education, University of Florida, 2423 Norman Hall, PO Box 117048, Gainesville, FL 32611, USA. Tel.: þ1 352 273 4180; fax: þ1 352 392 9193. E-mail address: [email protected] (A.D. Ritzhaupt). http://dx.doi.org/10.1016/j.compedu.2014.04.005 0360-1315/Ó 2014 Elsevier Ltd. All rights reserved.

2

A.C. Huggins et al. / Computers & Education 77 (2014) 1–12

Yet, measuring ICT literacy is a major challenge for educators and researchers. This point is reinforced by two chapters of the most recent Handbook of Educational Communications and Technology that highlight research and methods on measuring the phenomena (Christensen & Knezek, 2014; Tristán-López & Ylizaliturri-Salcedo, 2014). Though there is disagreement on the language used to describe the construct (e.g., digital literacy, media literacy, technological literacy, technology readiness, etc.), several agree on the key facets that make up the construct, including facets such as knowledge of computer hardware and peripherals, navigation of operating systems, folders and file management, word processing, spreadsheets, databases, e-mail, web searching and much more (Tristán-López & Ylizaliturri-Salcedo, 2014). Such skills are essential for individuals in K-12, post-secondary and workplace environments. For-profit companies have attempted to measure ICT literacy to meet the No Child Left Behind (USDOE, 2001) mandate of every child being technologically literate by 8th grade. States have employed different methods to address this mandates with many relying on private companies. Many of these tools, such as the TechLiteracy Assessment (Learning, 2012) claim alignment with NETS*S. However, most for-profit companies provide little evidence of a rigorous design, development and validation process. The PISA (Program for International Student Assessment) indirectly measures some ICT-related items such as frequency of use and self-efficacy via self-report, but ICT literacy is not a focus of the assessment. Instead, the PISA measures reading literacy, mathematics literacy, and science literacy of 15 year-old high school students (PISA, 2012). Many states have adopted the TAGLIT (Taking a Good Look at Instructional Technology) (Christensen & Knezek, 2014) to meet the NCLB reporting requirements. This tool includes a suite of online assessments for students, teachers, and administrators and claims to be connected to the NETS*S for the student assessment. The questions of the assessment were originally developed by the University of North Carolina Center for School Leadership Development. This is a traditional online assessment that includes a wide range of questions focusing on the knowledge, skills, and dispositions related to ICT literacy. The utility includes a reporting function for schools to use for reporting and planning purposes. However, very little research has been published on the design, development, and validation of this suite of tools for public inspection. A promising new initiative is the first-ever National Assessment of Education Progress (NAEP) Technology and Engineering Literacy (TEL) assessment, which is currently under development (NAEP, 2014). TEL is designed to complement other NAEP assessments in mathematics and science by focusing specifically on technology and engineering constructs. Unlike the other NAEP instruments, the TEL is completely computer-based and includes interactive scenario-based tasks in simulated software environments. The TEL is scheduled for pilot testing with 8th grade students in the Fall of 2013, and slated for release to the wider public in sometime in 2014. However, if one carefully reads over the framework for this instrument, one will discover the instrument is not designed to purely measure ICT literacy. Rather, the instrument focuses on three interrelated constructs, including Design and Systems, Technology and Society, and Information and Communication Technology (NAEP, 2014). This paper focuses on a performance-based instrument known as the Student Tool for Literacy (ST2L) designed to measure the ICT literacy skills of middle grades students in Florida using the 2007 National Educational Technology Standards for Students (NETS*S). This is the second iteration of the ST2L with the first iteration aligned with the original 1998 NETS*S (Hohlfeld, Ritzhaupt, & Barron, 2010). Specifically, this paper provides validity and reliability evidence for the scores using both item response theory and testlet response theory (Wainer, Bradlow, & Wang, 2007). 2. Measuring ICT literacy The definition, description, and measurement of ICT literacy has been a topic under investigation primarily since the advent of the World Wide Web in the early nineties. Several scholars, practitioners, and reputable organizations have attempted to carefully define ICT literacy with associated frameworks, and have attempted to design, develop, and validate reliable measures of this multidimensional construct. For instance, in Europe, they have created the European Computer Driving License Foundation (ECDLF), which is a framework and comprehensive assessment of ICT literacy skills used to certify professionals working in the information technology industry. This particular certificate has been adopted by 148 countries around the world in 41 different languages (Christensen & Knezek, 2014). We attempt to review some of the published measures of ICT literacy and related constructs in this short literature review. We do not claim to cover all instruments of ICT literacy; rather, we cover instruments that were published and provided evidence of both validity and reliability. Compeau and Higgins (1995) provide one of the earlier and more popular measures of computer-self efficacy and discuss its implications for the acceptance of technology systems in the context of knowledge workers. The measure is intended to be used with knowledge workers. Building on the works of Bandura (1986), computer-self efficacy is defined as “a judgment of one’s capability to use a computer” (Compeau & Higgins, 1995, 192). Their study involved more than 1000 knowledge workers in Canada, and several related measurement systems, including computer affect, anxiety, and use. They designed and tested a complex path model to examine computer-self efficacy and its relationship with the other constructs. Unsurprisingly, computer self-efficacy was significantly and negatively correlated with computer anxiety. Also, computer use has a significant positive correlation with computer self-efficacy. This scale has been widely adopted, and the article has been cited more than 2900 times according to Google Scholar. Parasuraman (2000) provides a comprehensive overview of the Technology Readiness Index (TRI), which is a multi-item scale designed to measure technology readiness, a construct similar to ICT literacy. Parasuraman (2000) defines technology readiness as “people’s propensity to embrace and use new technologies for accomplishing goals in home life and at work” (p. 308). This measure is intended to be used by adults in marketing and business contexts. The development process included dozens of technology-related focus groups to generate the initial item pool followed by an intensive study on the psychometric properties of the scale (including factor analysis and internal consistency reliability). Though the TRI has been mostly used in business and marketing literature, it demonstrates that other disciplines are also struggling with this complex phenomenon. Bunz (2004) validated an instrument to assess people’s fluency with the computer, e-mail, and the Web (CEW fluency). The instrument was developed based on extensive research on information and communication technology literacies. The research was conducted in two phases. First, the instrument was tested on 284 research participants and a principle component factor analysis with varimax rotation resulted in 21 items in four constructs: computer fluency (a ¼ .85), e-mail fluency (a ¼ .89), Web navigation (a ¼ .84), and Web editing (a ¼ .82). The 4-factor solution accounted for more than 67% of the total variance. In the second phase, Bunz’s (2004) 143 participants

A.C. Huggins et al. / Computers & Education 77 (2014) 1–12

3

completed the CEW scale and several other scales to demonstrate convergent validity. The correlations were strong and significant. The measure was used with students in higher education contexts. Overall, preliminary support for the scale’s reliability and validity was found. Katz and Macklin (2007) provided a comprehensive study of the ETS ICT Literacy Assessment (renamed iSkills) with more than 4000 college students from more than 30 college campuses in the U.S. The ETS assessment of ICT literacy focuses on several dimensions of ICT literacy that are measured in a simulated software environment, including defining, accessing, managing, integrating, evaluating, creating and communicating using digital tools, communications tools, and/or networks (Katz & Macklin, 2007). They systematically investigated the relationship among scores on the ETS assessment and self-report measures of ICT literacy, self-sufficiency, and academic performance as measured by the cumulative grade point average. The ETS assessment was found to have small to moderate statistically significant correlations with other measurements of ICT literacy, which provides evidence of convergent validity of the measurement system. ETS continues to administer the iSkills assessment to college students at select universities, and provides comprehensive reporting. Schmidt et al. (2009) developed a measure of Technological Pedagogical Content Knowledge (TPACK) for pre-service teachers based on Mishra and Koehler’s (2006) discussion of the Technological Pedagogical Content Knowledge framework. Though not a pure measure of ICT literacy, the instrument includes several technology-related items that attempt to measure a pre-service teacher’s knowledge, skills, and dispositions towards technology. The development of the instrument was based on an extensive review of literature surrounding teacher use of technology and an expert review panel appraising the items generated by the research team for relevance. The researchers then conducted a principal component analysis and internal consistency reliability analysis of the associated structure of the instrument. The instrument has been widely adopted (e.g., Abitt, 2011; Chai, Ling Koh, Tsai, & Wee Tan, 2011; Koh & Divaharan, 2011). Hohlfeld et al. (2010) reported on the Student Tool for Technology Literacy (ST2L) development and validation process, and provided evidence that the ST2L produces valid and reliable ICT literacy scores for middle grade students in Florida based on the 1998 NETS*S. The ST2L includes more than 100 items, most of which are performance assessment items in which the learner responds to tasks in a software environment (e.g., Spreadsheet) that simulates real world application of ICT literacy skills. The strategy for developing the technology tool was as follows: 1) technology standards were identified; 2) grade-level expectations/benchmarks for these standards were developed; 3) indicators for the benchmarks were outlined, and 4) specific knowledge assessment items were written and specific performance or skill assessment items were designed and programmed. Using a merger of design-based research and classical test theory, Hohlfeld et al. (2010) demonstrate the tool to be a sound assessment tool for the intended purpose of low-stakes assessment of ICT literacy. Worth noting, the ST2L has been used by more than 100,000 middle grade students in the state of Florida since its formal production release (ST2L, 2013). Across these various studies that all address the complex topic of measuring ICT literacy, we can make a few observations. First, there is not have consensus on the language used to describe this construct. Computer-self efficacy, CEW fluency, ICT literacy, technology readiness, or technology proficiency are all terms that can be used to describe a similar phenomenon. Second, each article presented here built on a conceptual framework to explain ICT literacy (e.g., social cognitive theory, NETS*S, TPACK, etc.) and used sound development and validation procedures. We feel this is an important aspect of the work on ICT literacy and that it must be guided by frameworks and theories to inform our research base. Third, the instruments were developed for various populations, including pre-service teachers, middle grade students, knowledge workers, college students, and more. Special attention must be paid to the population the ICT literacy measurement is designed for. Finally, there are several different methods to measure this complex phenomena, ranging from traditional paper/pencil instruments to online assessments to fully computer-based simulated software environments. The authors feel that the future of measuring ICT literacy should embrace an objective performance-based assessment in which the learners are responding to tasks in a simulated software environment, like the ST2L. 3. Purpose Following the recommendations of Hohlfeld et al. (2010), this paper presents a validation of scores on the Student Tool for Technology Literacy (ST2L), a performance assessment originally based on the 1998 NETS*S and recently revised to align with the 2007 NETS*S. This tool was developed through Enhancing Education Through Technology (EETT) funding as an instrument to assess the ICT literacy of middle grade students in Florida (Hohlfeld et al. (2010)). This paper provides the validity and reliability evidence for scores on the modified instrument

Table 1 Demographic statistics. Variable

Groups

Frequency (n)

Percentage (%)

Grade

5 6 7 8 9 10 11 Male Female Asian Black Hispanic White Other Yes No Yes No

5 1234 1598 3035 9 1 2 2934 2950 125 1114 682 3626 337 3569 2315 5508 376

.09 20.97 27.16 51.58 .15 .02 .03 49.86 50.14 2.12 18.93 11.59 61.62 5.73 60.66 39.34 93.61 6.39

Gender Race

Free/Reduced lunch English with family

4

A.C. Huggins et al. / Computers & Education 77 (2014) 1–12

according to these new standards (ISTE, 2007), with methodology operating under both item response theory and testlet response theory (Wainer et al., 2007). Specifically, this paper addresses the following research questions: (a) Do scores on the ST2L display evidence of internal structure validity?, and (b) Do scores on the ST2L display evidence of external structure validity? 4. Method 4.1. Participants Middle school teachers from 13 Florida school districts were recruited from the EETT grant program. Teachers were provided an overview of the ST2L, how to administer the tool, and how to interpret the scores. Teachers then administered the ST2L within their classes during the fall 2010 semester. Table 1 details demographic information for the sample of N ¼ 5884 examinees. The bulk of students (i.e., n ¼ 5867) are in grades 6 through 8, with a wide range of diversity in gender, race and free/reduced lunch status. A small percentage (i.e., 7%) of examinees was from families that did not speak English in the home. 4.2. Measures ST2L: The ST2L is a performance-based assessment designed to measure middle school students’ ICT literacy across relevant domains based on the 2007 NETS-S: Technology Operations and Concepts, Constructing and Demonstrating Knowledge, Communication and Collaboration, Independent Learning, and Digital Citizenship. These standards are designed to consider the breadth and depth of ICT literacy and to be flexible enough to adapt as new technologies emerge. NETS*S have been widely acclaimed and adopted in the U.S. and in many countries around the world. They are being used by schools for curriculum development, technology planning and school improvement plans. The ST2L includes 66 performance-based tasks and 40 selected-response items, for a total of 106 items. The selected-response item types include text-based multiple-choice and true/false items, as well as multiple-choice items with graphics and image map selections (see Fig. 1 for an example). The performance-based items require the examinee to complete multiple tasks nested within simulated software environments, and these sets of performance-based items were treated as testlets (i.e., groups of related items) in the analysis (see Fig. 2 for an example). The testlets allow for ease on the examinee as multiple items are associated with each prompt, and they are also more applicable to technological performance of the examinees outside of the assessment environment. The original version of the ST2L was previously pilot tested on N ¼ 1513 8th grade students (Hohlfeld et al., 2010). The purpose of the pilot test was to provide a preliminary demonstration of the overall assessment quality by considering classical test theory (CTT) item analyses, reliability, and validity. Pilot analysis results indicated that the original version of the ST2L was a sound low-stakes assessment tool. Differences between the piloted tool and the current tool reflect changes in national standards. In the current dataset for this study, Cronbach’s alpha as a measure of internal consistency of the ST2L items was estimated as a ¼ .96. For the ST2L assessment used in this study, there were fourteen sections of items. These fourteen sections are a part of the NETS*S domains as defined and described by ISTE. The first consisted of fifteen selected-response items measuring the construct of technology concepts, which was shortened to techConcepts in the remaining text and tables. The second consisted of four performance-based items that measured the examinee’s ability to manipulate a file, which was shorted to techConceptsFileManip in the remaining text and tables. The third and fourth sections consisted of ten and three performance-based items, respectively, that measured the examinee’s ability to perform research in a word processor, which was shortened to researchWP. The fifth section measured the examinee’s ability to perform research with a flowchart with five performance-based items (i.e., researchFlowchart). The sixth, seventh, and eight sections measured examinee’s creative ability with technology, each with four performance-based items that focused on the use of graphics, presentations, and videos,

Fig. 1. Example multiple-selection item.

A.C. Huggins et al. / Computers & Education 77 (2014) 1–12

5

Fig. 2. Example performance-based task item.

respectively (i.e., creativityGraphics, creativityPresent, creativityVideo). The ninth, tenth, and eleventh sections consisted of eight, six, and four performance-based items, respectively, which measured examinee ability in applying technological communication through browsers (i.e., communicationBrowser) and email (i.e., communicationEmail). The twelfth and thirteenth sections measured critical thinking skills in technology with five and nine performance-based items, respectively (i.e., criticalThink). Finally, the fourteenth section measured digital citizenship of examinees with twenty-five selected-response items, which was shortened to digitalCit. PISA: The PISA questionnaire was included in this study as a criterion measure for assessing external validity of ST2L scores. It has been rigorously analyzed to demonstrate both reliability and validity across diverse international populations (OECD, 2003). Students were asked to provide information related to their Comfort with Technology, Attitudes towards Technology, and Frequency of Use of Technology. The three constructs employed different scales for which internal consistency in this study’s dataset was a ¼ .78, a ¼ .89, and a ¼ .54, respectively. The low internal consistency of the attitudes toward technology is expected due the shortness of the scale (i.e., five items). 4.3. Procedures Data were collected in the fall semester of 2010. Middle school teachers from the 13 Florida school districts were recruited from the EETT grant program. Teachers were provided an overview of the ST2L, how to administer the tool, and how to interpret the scores. Teachers then administered the ST2L within their classes during the fall 2010 semester. Teachers also had the opportunity to report any problems with the administration process. 4.4. Data analysis: internal structure validity The testlet nature of the items corresponds with a multidimensional data structure. Each testlet item is expected to contribute to the ICT literacy dimension as well as to a second dimension representing the effect of the testlet in which the item is nested. Dimensionality assumptions of the testlet response model were assessed via confirmatory factor analysis (CFA). Fit of the model to the item data was assessed with the S–X2 index (Orlando & Thissen, 2000, 2003). Data from the selected response (i.e., multiple-choice/true-false) non-testlet items were then fit to a three-parameter logistic model (3PL; Birnbaum, 1968). The 3PL is defined as

"

# eai ðqs bi Þ Psi ðYi ¼ 1jqs ; ai ; bi ; ci Þ ¼ ½ci *ð1  ci Þ ; 1 þ eai ðqs bi Þ

(1)

where i refers to items, s refers to examinees, Y is an item response, q is ability (i.e., ICT literacy), a is item discrimination, b is item difficulty and c is item lower asymptote. Data from the performance-based (i.e., open-response) testlet items were fit to a two-parameter logistic testlet model (2PL; Bradlow, Wainer, & Wang, 1999). The 2PL testlet model is defined as

Psi



   Yi ¼ 1qs ; ai ; bi ; gsdðiÞ ¼

"

# eai ðqs bi gsdðiÞ Þ ; 1 þ eai ðqs bi gsdðiÞ Þ

(2)

where Ysd(i) represents a testlet (d) effect for each examinee. The testlet component (Ysd(i)) is a random effect, allowing for a variance

6

A.C. Huggins et al. / Computers & Education 77 (2014) 1–12

estimate of Ysd(i) for each testlet. A 2PL testlet model was selected over 3PL because the item formats did not lend to meaningful chances of successful guessing. The item calibration was completed with Bayesian estimation using Markov chain Monte Carlo methods in the SCORIGHT statistical package (Wainer, Bradlow, & Wang, 2010; Wang, Bradlow, & Wainer, 2004). Item fit, item parameter estimates, testlet effect variance components, standard errors of measurement, information, reliability, and differential item functioning (DIF) were examined for internal structure validity evidence. 4.5. Data analysis: external structure validity ICT literacy latent ability estimates were correlated with the three PISA measures (i.e., use of technology, general knowledge of technology, and attitudes toward technology) to assess external structure validity evidence. Positive, small to moderate correlations were expected with all three external criteria, with a literature-based hypothesis that comfort with technology would yield the strongest relative correlation with technology literacy (Hohlfeld et al., 2010; Katz & Macklin, 2007). 5. Results Prior to addressing the research questions on internal and external structure validity evidence, items with constant response vectors and persons with missing data had to be addressed. Two items on the assessment had constant response vectors in this study’s sample (i.e., researchWP21 was answered incorrectly by all examinees and communicationEmail24 was answered correctly by all examinees), and were therefore not included in the analysis. The first stage of analysis was focused on determining the nature of the missing data on the remaining 104 test items used in the analysis. Basing our missing data analysis process on Wainer et al. (2007), we began by coding the missing responses as omitted responses and calibrated the testlet model. We then coded the missing responses as wrong and recalibrated the testlet model. The theta estimates from these two calibrations correlated at r ¼ .615 (p < .001). This indicated that the choice of how to handle our missing data was non-negligible. We then identified a clear group of examinees for whom ability estimates were extremely different between the two coding methods. Specifically, they had enough missing data to result in very low ability estimates when missing responses were coded as wrong and average ability estimates with very high standard errors when missing responses were coded as omitted. We then correlated the a and b item parameter estimates from the calibration with missing responses coded as omitted with the a and b item parameter estimates from the calibration with missing responses coded as wrong, respectively. Discrimination (a) parameters were mostly larger when missing data was treated as omitted, and the correlation indicated that the differences between the calibrations were non-negligible (r ¼ .629, p < .001). Difficulty (b) parameters were more similar across the calibrations with a correlation at r ¼ .931 (p < .001). The lack of overall similarity in results shown by these correlations indicated that coding the missing data as wrong was not a viable solution. In addition, it was clear that some individuals (n ¼ 109) with large standard errors of ability estimates when missing data was coded as omitted had to be removed from the data set. Ultimately, they did not answer enough items to allow for accurate ability estimation, and their inclusion would therefore compromise future analysis, such as the correlations between ability and external criteria. Majority of these 109 individuals answered only one item of the 106 test items. The remaining data set of N ¼ 5884 examinees (i.e., those discussed in the above Participants section) was examined for nature of missingness according to Enders (2010). For each item, we coded missing data as 1 and present data as 0. We treated these groups as independent variables in t-tests in which the dependent variable was either the total score on frequency of technology use items or the total score on attitudes toward technology items. All t-tests for all items were non-significant, indicating that the missingness was not related to frequency of technology uses or attitudes toward technology. We were unable to perform t-tests on the total score of self-efficacy with technology use of items due to severe violations of distributional assumptions (i.e., a large portion of persons scored the maximum score for self-efficacy), and hence a simple mean comparison was utilized. The self-efficacy total scores ranged from 0 to 76, and self-efficacy means of the group of missing data differed from the means of the group with non-missing data by less than three points on all items. We concluded that these differences were small and, therefore, that missingness was not related to this variable. Based on these analyses, we proceeded under the assumption that missing data was ignorable (MAR) for the 3 PL and testlet model analysis. 5.1. Internal structure validity evidence Before fitting the 3PL and testlet models, we checked the assumption of model fit through CFA analysis with a hypothesis that each item would load onto the overall ability factor (theta) as well as a testlet factor associated with the testlet in which the item was nested. Fig. 3 shows an abbreviated diagram of the CFA model fit in Mplus version 7 (Muthén & Muthén, 2012), with weighted least squares estimation with adjusted means and variances. All latent factors and item residuals were forced to an uncorrelated structure. The model fit the data to an acceptable degree, as indicated by the root mean square error of approximation (RMSEA ¼ .068), the comparative fit index (CFI ¼ .941) and the Tucker–Lewis fit index (TLI ¼ .938). While the fit could have been improved slightly, these results were deemed acceptable for meeting the dimensionality assumptions of the item/testlet response models. We then assessed item fit in IRTPro (Cai, Thissen, & du Toit, 2011) to determine if each multiple choice/true-false item fit the 3PL model and if each open-response item fit the 2PL testlet model. This statistical package was used for item fit as the calculation of fit indices is built into the program, however it was not used for final parameter estimation as it lacks the preferred Bayesian estimation approaches used in this study. The S-X2 item fit index (Orlando & Thissen, 2000, 2003) was used to assess fit of the model to the item data, and significance level at a ¼ .001 was used due to sensitivity of the chi-square test to large sample sizes. A total of four (i.e.,