Towards a framework for the validation of early childhood assessment ...

3 downloads 120 Views 442KB Size Report
American early childhood education is in the midst of drastic change. In recent years, states have begun the process of overhauling early childhood education ...
Educ Asse Eval Acc DOI 10.1007/s11092-015-9231-8

Towards a framework for the validation of early childhood assessment systems Jessica Goldstein 1 & Jessica Kay Flake 2

Received: 28 January 2015 / Accepted: 19 October 2015 # Springer Science+Business Media New York 2015

Abstract American early childhood education is in the midst of drastic change. In recent years, states have begun the process of overhauling early childhood education systems in response to federal grant competitions, bringing an increased focus on assessment and accountability for early learning programs. The assessment of young children is fraught with challenges; psychometricians and educational researchers must work together with the early childhood community to develop these instruments. The purpose of this paper is to present a conceptual framework for the validation of such instrumentation and examine its implications for early childhood educators. We formulate a validity argument for early childhood assessments providing a pivotal link between validity theory and early education practice. Recommendations for the assessment field are also considered. Keywords Validity . Early childhood . Educational assessment Early childhood education in the USA is on the brink of great change. As of January 2014, over $1 billion in federal Race to the Top Early Learning Challenge grants were awarded to 20 states for the development and enhancement of comprehensive early childhood assessment systems. The National Research Council (2008) defined a comprehensive early childhood assessment system (CECAS) as a network of developmental screening measures, formative assessments, measures of environmental quality, measures of the quality of adult-child interactions, and a kindergarten entry assessment. In September 2013, the U.S. Department of Education awarded more than $15 million in Enhanced Assessment Grants (EAGs) for the development or enhancement of kindergarten entry

* Jessica Goldstein [email protected] 1

Department of Educational Psychology, University of Connecticut, 249 Glenbrook Road, Storrs, CT 06269, USA

2

Department of Psychology, Quantitative Methods, York University, 101 Behavioural Science Building, 4700 Keele St., Toronto, ON M4s 2E2, Canada

Educ Asse Eval Acc

assessments. These federal investments are the result of a national need for developmentally appropriate, psychometrically sound instruments to monitor young children’s learning and development and evaluate the effectiveness of their early childhood learning programs. This need is not limited to the USA; Australia and Canada are working to develop national early childhood assessments as well (Goldfeld et al. 2009; Guhn et al. 2007). Instrumentation is needed to monitor the growth of young children across the globe; however, the assessment of young children is fraught with challenges. Psychometricians and educational researchers must work together with the early childhood community to develop these instruments. The purpose of this paper is to present work towards a conceptual framework for the validation of such instrumentation and examine its implications for early childhood educators. We focus specifically on kindergarten assessment, as this is the entry point into the public school system.

1 The importance of early childhood assessment research Learning and development in the early years is the foundation for future educational achievement (Alexander and Entwisle 1988, 1996; Bowman et al. 2001; Neuman and Dickinson 2001). Differences in academic performance begin early and persist, and perhaps worsen, over time. One study found that 88 % of students identified as poor readers in first grade were also considered poor readers in fourth grade (Juel 1988). Other studies have found that early reading problems persisted into high school (Francis et al. 1996; Shaywitz et al. 1999). These issues are prevalent in mathematics as well. An achievement gap in mathematics can be seen in children as young as age 3 (Case and Griffin 1990; Jordan et al. 1992), and the effects of this gap linger not only through kindergarten and first grade, but can persist into middle and high school (Berkner and Chavez 1997; Braswell et al. 2001; Denton and West 2002; Entwisle and Alexander 1993; West et al. 2001a, b). Further, racial differences in academic skills are evident even before students enter elementary school (Fryer and Levitt 2006; Haskins and Rouse 2005; Loeb et al. 2007). At the start of kindergarten, Hispanic children have been found to be less ready for school than White or Black students (Duncan and Magnuson 2005; Fryer and Levitt 2004; Reardon 2003; Rumberger and Arellano 2004; Zill et al. 1995). Kindergarten is the point of entry to the American public school system and the earliest point of intervention for public school educators. Kindergarten entry assessments (KEAs) are designed to be administered by classroom teachers as they begin to understand their students. Federal funding competitions were created to ensure a level of uniformity to KEAs across the nation. Specifically, the Race to the Top Early Learning Challenge required that a KEA Bis aligned with the State’s Early Learning and Development Standards and covers all Essential Domains of School Readiness… [and] is valid, reliable, and appropriate for the target population and for the purpose for which it will be used, including for English learners and children with disabilities^ (U.S. Department of Education, p. 68, 2011b). The federal government also requires the KEA to address the essential domains of school readiness, which are defined as language and literacy development, cognition and general knowledge (including early mathematics and early scientific development), approaches towards learning, physical well-being and motor development, and social and emotional development (U.S.

Educ Asse Eval Acc

Department of Education, n.d.). The multidimensional nature of these assessments and the restriction that they must be administered by classroom teachers at the start of school render them unique from existing instrumentation designed to assess learning in the later grades.

2 Large-scale early childhood assessments Measurement challenges result from key differences between early childhood assessments and those designed for the later grades. Early childhood experts encourage assessment of the whole child (National Research Council 2008), including the domains of social/emotional development, physical skills, attitudes towards learning, and early literacy and numeracy skills. Yet, valid and reliable assessment instruments are not as readily available for all of these domains. In addition, young children spend a great deal of time in non-school settings. As a result, families and other non-schoolbased caregivers play a more prominent role in child development. Moreover, there is greater variability in the timing and pace of development at younger ages. This is a challenge for early childhood educators as well as the psychometricians and educational researchers tasked with assessment development. National policies on early childhood assessment in the USA allow individual states to select instrumentation. As of March 2014, 29 states and Washington, D.C., required assessments of kindergarten students (Education Commission of the States 2014). Though most states developed their own assessments or partnered with other states to develop assessments, several states are piloting a commercial product called Teaching Strategies GOLD (U.S. Department of Education 2015). Recent large-scale empirical supports the validity of Teaching Strategies GOLD for use in measuring young children’s progress (Kim et al. 2013; Kim and Smith 2010; Lambert et al. 2015). Other commercial measures widely used in the USA include the Work Sampling System (WSS), the Child Observation Record (COR), and the Ounce Scale. WSS incorporates checklists, portfolios of children’s work, and summary reports to evaluate performance (Meisels 1996; Meisels et al. 1995). The COR is a 30-item observational measure intended to assess the cognitive, social, and motor development of children 2.5 to 6 years of age (High/Scope Educational Research Foundation, 1992). The Ounce Scale is an observation and documentation system designed for use with children from birth to 42 months (Meisels et al. 2010). A 2011 report from the Council of Chief State School Officers (CCSSO 2011) noted that Bfew resources have been invested in developing assessment tools that address the full range of domains of early learning and child development, and the multiple purposes of such assessments^ (p. 7). The full range of domains of early learning and development includes social and emotional development, executive function, physical skills, and early literacy and numeracy skills. Scholars have also noted the need for validation of screening tools used as part of federally funded preschool programs. The Devereux Early Childhood Assessment (DECA) is a standardized rating scale of children’s social-emotional adjustment developed for use by parents and teachers (Bulotsky-Shearer et al. 2013; LeBuffe and Naglieri 1999). Bulotsky-Shearer et al. (2013) note that while the instrument is widely used in federally funded programs, Bto date, few empirical studies have examined the psychometric properties of the measure when used with culturally and linguistically

Educ Asse Eval Acc

diverse, low-income populations^ (p. 795). Practitioners and researchers alike need to build a body of evidence to support the use of these measures for their intended purposes. Similar validity issues exist with the Early Childhood Environment Rating ScaleRevised (ECERS-R; Harms et al. 1998), a widely used measure of program quality. ECERS-R is a 43-item measure designed to assess the quality of group programs for preschool-kindergarten aged children, from 2 through 5 years of age, with a focus on interactions between the interactions between the program staff and children as well as program staff and families. The tool also addresses the interactions children have with the many materials and activities in the environment, as well as those features, such as space, schedule, and materials that support these interactions. This instrument has been used in a number of studies of the effects of early childhood education (De Kruif et al. 2000; Gilliam 2000; Jaeger and Funk 2001) and a preschool evaluation in Bangladesh (Aboud 2006). Myers (2004) noted that the instrument had been used in over 20 different countries. Gordon et al. (2013) note that Bthere is surprisingly little empirical evidence of the validity of the ECERS–R instrument to support its widespread use in research and policy contexts^ (p. 146). These authors analyzed data from a sample of 1350 centers and preschools using the Rasch partial credit model, exploratory factor analysis, and regression. In their analyses, they identified issues with the instrument’s structural validity and response process validity were identified. They concluded that BThe widespread adoption of the ECERS–R for a variety of programmatic, policy, and research purposes necessitates comprehensive validity studies that look at multiple aspects of validity using techniques drawn from both classical test theory and IRT^ (p. 155). Other researchers have questioned the validity of ECERS-R (Hofer 2010), though it is so widely used in practice. There are also examples of this at the international level. The Effective Provision of Pre-school Education (EPPE) study was designed to examine the long-term effects of individual preschool programs on children in the UK (Sylva et al. 2004). EPPE was the first major European longitudinal study of a national sample of young children’s development between the ages of 3 and 7 years. The study employed both multivariate quantitative analyses and in-depth qualitative case studies. Researchers for this project developed The Early Childhood Environment Rating Scale-Extension (ECERS-E) (Sylva et al. 2003) to supplement the ECERS-R with academic content. The ECERSE has four subscales of 18 items that address literacy, mathematics, science and environment, and diversity. It was developed by early childhood authorities and practitioners specifically designed to address quality as it was defined by the academic skills included in the curriculum in England at that time. Sylva et al. (2006) compared the ECERS-E and the ECERS-R to external measures of children’s language, cognition, and social/behavioral skills. They found a strong relationship between ECERS-E and cognitive development and a strong relationship between ECERS-R and children’s social-behavioral development over time. Though both instruments are widely used internationally, comprehensive evidence to support the validity of these measures to predict achievement and support instruction is still needed. A variation of the ECERS was used in China as well. Li et al. (2011) used Messick’s (1989) conceptualization of validity to present evidence for the Chinese Early Childhood Environment Rating Scale (CECERS). The instrument is an adaptation of ECERS-R, designed to measure early childhood program quality in the Chinese socio-

Educ Asse Eval Acc

cultural contexts. The authors presented empirical evidence using data from 1012 children in 178 classrooms and address content validity considerations, factor structure, concurrent validity, and criterion-related validity as well as internal consistency reliability and inter-rater reliability. This study is unique in the literature in that it offers a Bbody of evidence [to support] that CECERS functioned as intended^ (p. 278). The international early childhood community must demand validity evidence for new or edited instrumentation. In England, teachers are required to complete The Early Years Foundation Stage (EYFS) Profile, which describes a child’s skills at age 5 based on teachers’ ongoing observations and assessments (Department for Education 2013). The EYFS Profile was designed to compare each child’s development with national early learning goals in the prime areas of learning (communication and language; physical development; personal, social, and emotional development), the specific areas of learning (literacy; mathematics; understanding the world; and expressive arts and design), and characteristics of effective learning (playing and exploring; active learning; and creating and thinking critically). Data from this instrument are reported to parents and aggregated locally and nationally. In the report for the instrument it is noted that, BThe EYFS Profile has been designed to be valid and reliable for these purposes^ (p. 7), yet references to specific studies are not included. Again, this evidence should be provided to consumers as a matter of course. The appropriateness of the instrument for its intended purpose rests on reports of empirical evidence and qualitative research. The Early Development Instrument (EDI; Janus and Offord 2007) is another teacher observation checklist used internationally. It addresses the following five developmental domains: physical health and well-being, social competence, emotional maturity, language and cognitive development, and communication skills and general knowledge. Though the EDI is completed for individual children in a classroom, the data are interpreted only at the school level with scores aggregated to the school, neighborhood, or regional level. The instrument was designed as a holistic measure of child development, with the intent for use as a population health indicator, and not as an individual diagnostic assessment or screening tool. The psychometric properties of the EDI had been subject of several investigations with data from Canada, Australia, the USA, and Jamaica (Janus et al. 2011; Janus and Offord 2007; Andrich and Styles 2004). Researchers have also shown the instrument to have predictive validity (Guhn et al. 2007; Silburn et al. 2007) and construct validity (Brinkman and Blackmore 2003; Brinkman et al. 2007) with data from British Columbia and Australia. As the number and use of these assessments increase, the need for an overarching validity framework becomes apparent. As can be seen with the assessments described above, the approaches to collecting validity evidence can vary and the results can be mixed, making it difficult to navigate the information and make decisions about what assessments are appropriate for a given purpose. Because of national initiatives to expand access to preschool and introduce large-scale assessments at kindergarten entry, researchers are at a unique point in history in which appropriate validation practice can be integrated into assessment development. Comprehensive validity studies are needed for all assessments, but they are particularly important for those measures that will be used to inform policy and practice. In this article, we present a conceptual framework for the validation of kindergarten entry assessments using an argument-based approach (Kane 2006) and examine its implications for early childhood educators, researchers,

Educ Asse Eval Acc

and policy makers. While argument-based approaches to validation exist for the field of English language acquisition (Chappelle et al. 2010) and assessments of students with severe disabilities (Goldstein and Behuniak 2011; Marion and Pellegrino 2006), there are no examples in the literature of validation procedures for early childhood assessment systems.

3 Conceptual approach Our conceptual framework for the validation of early childhood assessment systems is a validity argument, which builds on the work of both Kane (2006, 2013) and Mislevy et al. (2003). The joint AERA, APA, NCME Standards for Educational and Psychological Testing (2014) are used to evaluate educational assessment programs by the measurement community. In the Standards, validity is defined as the Bdegree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests^ (p. 9). Kane (2006) extended the presentation of validity from the Standards by specifying that Bvalidation requires a clear statement of the proposed interpretations and uses^ (p. 23). In the context of assessment development, this statement of the proposed interpretations and uses refers to the purpose of the assessment. Validation begins with assessment purpose, and validity studies are the theorybased evidence that support the use of the assessment for the specified purpose. Kane’s seminal work (2006) focuses on an argument-based approach to validation, which incorporates an interpretive argument and a validity argument. The interpretive argument Bassumes that the proposed interpretations and uses will be explicitly stated as an argument, or network of inferences and supporting assumptions, leading from observations to the conclusions and decisions^ (p. 17). This network of inferences and supporting assumptions builds from the assessment purpose and is the first step in the validation process. The validity argument is the evaluation of research evidence designed to support or refute the network of inferences and assumptions in the interpretive argument. In a validity argument, a set of research studies is evaluated holistically. An alternative perspective on validation is evidence-centered design (ECD), which builds on a student-centric approach to assessment design purported by Messick (1994). Mislevy et al. (2003) proposed the ECD framework as a tool to ensure the interconnectedness of assessment purpose, assessment elements, the conception of proficiency, and the evidentiary argument. In the ECD framework, assessment data are observations that hold clues to claims to be made about what students know or are able to do. An argument of inferences connects these observations to theory-based inferences about the domain of interest, statistical modeling, and relevant elements of the test construction and context. In the ECD perspective, every assessment design decision influences the chain of inferences that connect examinee behaviors to claims about what they know or can do. Claims about what examinees know or can do are considered throughout the design process. The uniqueness of ECD is that the approach is applied proactively to test design (Mislevy and Haertel 2006). Details of the validity framework were drawn from a comprehensive review of the literature on assessment in early childhood and validity theory, as well as ongoing conversations between these researchers and their local education agency in planning

Educ Asse Eval Acc

an early childhood assessment system. The needs of various stakeholder groups were also considered. While early childhood assessment systems typically address comprehensive services from birth through age 8, the details of this framework is targeted at children’s learning and development in the period around kindergarten entry, from 48 to 72 months (pre-kindergarten through kindergarten) via ongoing formative assessments and a summative kindergarten entry assessment. The stakeholder groups and desired purposes for the instrumentation are presented in Table 1. Table 2 details the assumptions subsumed within each purpose. In the following sections, we relate these assumptions to the research literature on learning and development, teachers and teaching, assessment data and data use, and the young children; the systems are designed to serve.

4 Assumptions about policy and practice Validation is a recursive process of investigating assessment assumptions. The assumptions associated with kindergarten entry assessment run both deep and wide. To accurately reflect learning and development of a group of children at any point in time, children should have had sufficient opportunities to show their skills, teachers should be skilled in recognizing and documenting growth, and the assessment instruments should effectively mirror development. These issues are especially salient in the assessment of young children. Accordingly, Sireci (2009) described validity as an Bunattainable goal^ (p. 28). Using an argument-based framework, we have identified potential purposes of early childhood assessments and the associated assumptions, which are included in Table 2. In the following sections, we examine the critical assumptions that underlie many of the purposes. These assumptions are potential validity studies to be explored by researchers and policy makers as they relate to and learning and development in young children, teachers and their judgments, and assessment and data use. 4.1 Learning and development in young children There are many aspects of young children’s cognitive development that complicate assessment design. Unlike older students, preschool students are not yet able to sit down and take a multiple choice test. Young children are developing their memory and attention, as well as their reading skills. It is an assumption that young children are healthy, well-rested, and physically able to demonstrate their knowledge and skills. Cognitive development is dependent on these physical skills and dispositions. Many cognitive hindrances limit young children in the test taking process, including developing abilities to comprehend assessment stimuli, process complex information, and control behaviors (Meisels 2007). Young children may be able to show learning through behavior, but are unable to write their thoughts or responses for data collection. Additionally, young children do not have the theory of mind to process the goals of assessment, making testing interactions confusing and difficult to execute (Nelson 1998). Young children must be assessed multiple times in a variety of ways because they learn in a manner that is more episodic than older students (Kagan et al. 2003).

● ●



Communicate with families

● ●

● ●



Conduct screenings for EC-SPED services



Improve population level outcomes for children



● ●













State education agency

Evaluate resource allocation ●



EC funding agencies

● ●



Local community EC councils

Maintain longitudinal data system

Measure ROI for school readiness funding

Collect results-based accountability data







Improve teacher quality

Support the teacher evaluation system





Improve program quality ●



Summarize learning & development of children in EC-SPED programs





Local education agencies/ school districts

Summarize population level learning and development

Summarize individual level learning and development



Monitor student learning to inform instruction

Families Early Receiving K childhood schools programs and teachers

Table 1 Stakeholder groups and system goals for a comprehensive early childhood assessment system









State legislators

Educ Asse Eval Acc

Educ Asse Eval Acc Table 2 Purposes and associated assumptions for a comprehensive early childhood assessment system Purpose

Assumption

Summarize learning and development of children, at the individual and population level

• Children have the opportunity to learn or be exposed to experiences in the content standards. • Children are healthy, well-rested, and physically able to demonstrate their knowledge and skills. • Teachers have an accurate, current understanding of early learning and development for their population. • Teachers design instruction or create learning environments to reflect state content standards. • Teachers design instruction or create learning environments to children’s stage of development. • Teachers have basic assessment literacy skills. • Teachers have sufficient resources to conduct assessments/ data collection. • Assessments are developmentally appropriate. • The assessment reflects the content standards. • All identifiable subgroups receive equal treatment in the assessment system. • Scores on observations measures or performance tasks are consistent across raters. • Assessment scores and performance levels accurately reflect student learning and development. • Analyses of the internal structure of the assessment instruments comply with standard validity requirements. • Formative assessment data can be collected in a manner that supports learning.

Monitor student learning to inform instruction

• Teachers have an accurate, current understanding of early learning and development for their population. • Teachers design instruction or create learning environments to reflect state content standards. • Teachers design instruction or create learning environments to children’s stage of development. • Teachers have basic assessment literacy skills. • Teachers have sufficient resources to conduct assessments/ data collection. • Teachers experience professional development opportunities related to the early childhood assessment system. • Teachers can design effective interventions and/or augment instruction for students based on assessment data.

Improve outcomes for children

• Teachers can design effective interventions and/or augment instruction for students based on assessment data. • Statewide, students’ skills at established system benchmarks increase over time. • Use of the CECAS leads to improved performance on school readiness measures. • Unintended negative consequences of the assessment system are minimized.

Communicate with families

• Families are informed of their child’s progress on the assessments. • Families have an accurate, current understanding of early learning and development for young children. • Learning standards are distributed in a manner that is accessible for all families.

Educ Asse Eval Acc Table 2 (continued) Purpose

Assumption • Teachers have an accurate, current understanding of early learning and development for young children. • Teachers can explain the cycle of instruction, assessment, and intervention effectively to families.

Collect results-based accountability data

• Teachers experience professional development opportunities related to the early childhood assessment system. • Teachers implement the assessment instruments with fidelity. • Assessment instruments reflect learning and development directly related to program/classroom activities. • Assessments are developmentally appropriate. • All identifiable subgroups receive equal treatment in the assessment system. • Rubrics are applied consistently across programs and schools. • Analyses of the internal structure of the assessment instruments comply with standard validity requirements.

Evaluate resource allocation and measure return on investments in preschool funding

• Increased funding is associated with improved learning and development.

Improve program quality

• Assessments and data collection instruments appropriately identify high quality programs and programs in need of improvement.

Improve teacher quality

• Assessments and data collection instruments appropriately identify high quality teachers and teachers in need of improvement.

Maintain longitudinal data system

• Assessment and data collection instruments are aligned with assessments and data collection instruments in the later grades. • Students have unique identification numbers from preschool through public school.

Accurate assessment of what young children know and can do takes time. The challenge of time is overcome, to some extent, when a child has a relationship with his or her assessor. Furthering this relationship is the primary reason the assessment is usually administered by the student’s teacher (Meisels). The assessment process, along with the assessment results, helps the teacher learn more about the student. The National Research Council (2008) suggests the importance of contextualization in reporting the results of direct assessments: A child’s score on a vocabulary test reflects not just the child’s capacity to learn words, but also the language environment in which the child has lived since birth, the child’s ease with the testing procedure, and the child’s relationship with the test. The younger the child, the more important are these considerations. (p. 17)

Educ Asse Eval Acc

Assessment results may be a reflection of what a child has learned in their life and their comfort-level at a particular moment, rather than what they may have gained from any particular learning experience. As each young child is the sum of their experiences, the influence of family is also a key consideration in the evaluation of the learning and development of young children. It is a strong assumption that families have an accurate, current understanding of early learning and development milestones for young children and that have the knowledge and ability to expose their children to related learning experiences. The Early Childhood Longitudinal Study, Kindergarten Class of 1998–1999 (ECLS-K), followed a nationally representative sample of 22,000 kindergartners from the fall of 1998 through their fifth-grade year. In the study, kindergartners’ performance on math, reading, and general knowledge items increased with the level of their mothers’ education and was higher for children from two-parent families (Rathburn et al. 2004; Zill and West 2001). Children with more family risk factors made smaller gains in math and reading, contributing to the achievement gap between disadvantaged and more-advantaged children in the early years of school (Rathburn et al. 2004). In a study of parent-school relationships, Powell et al. (2010) found that parental involvement and parental perception of teacher responsiveness were related to children’s academic and social competence. Meisls (2007, p. 36) stated, Bthe range of opportunities to learn in early childhood describes the fundamental differences in society and especially reflects the challenges faced by impoverished and disadvantaged children prior to even arriving at the school door^ (p. 36). Opportunity to learn is an equity issue. One of the key assumptions underlying the development and validation of assessment systems is shared expectations for learning and development. Studies of kindergarten teachers’ views suggest a vision of kindergarten readiness among practitioners that is unbalanced across the domains. One study found that the top three qualities that public school kindergarten teachers consider essential for school readiness are that a child be physically healthy, rested, and well nourished; be able to communicate needs, wants, and thoughts verbally; and be enthusiastic and curious in approaching new activities (Heaviside and Farris 1993). A decade later, further research confirmed that teacher perceptions of kindergarten success rest on the child’s health, social competence, ability to communicate, and ability to follow directions (Lin et al. 2003; Wesley and Buysse 2003). It is an assumption that educators share an accurate, current understanding of learning and development for their students. In its notice to states of the enhanced assessment grant funds to support the development or enhancement of a kindergarten entry assessment (KEA), the U.S. Department of Education (2013) defined the essential domains of school readiness as the domains of language and literacy development, cognition and general knowledge (including early mathematics and early scientific development), approaches towards learning, physical well-being and motor development (including adaptive skills), and social and emotional development. Another interesting consideration is the arc of these skills across the early childhood continuum. The federal government requires the assessment of readiness as a multidimensional construct; it does not provide guidance about the developmental trajectory of the skills outside of literacy and numeracy throughout the primary grades. Further, assessment of these other skills is not required in elementary school outside of kindergarten entry assessments. It is an assumption that

Educ Asse Eval Acc

kindergarten assessment and data collection instruments are aligned with assessments and data collection instruments in the later grades. The discontinuity in expectations for learning and development across schools, teachers, and older students presents a challenge to creating a valid assessment system, as alignment to standards and instruction are assumptions that underlie a focal purpose of KEAs, to summarize learning and development (see Table 2). Given the learning and development challenges in assessing small children, these assessments may serve some purposes better than others. For example, these assessments may not be able to summarize learning and development accurately because the assessment stimuli are not age appropriate. However, with developmentally appropriate stimuli, these assessments may meet that purpose, but not provide good accountability data, as learning is strongly influenced by the home environment as well as the school environment. 4.2 Teachers and teaching A valid assessment assumes teachers can create learning environments to advance children’s development. Research has shown that the early educator workforce suffers from a shortage of formally trained practitioners. In 2004, 30 % of center-based workers held a bachelor’s degree; this percentage is much lower for home-based workers (11 %; Herzenberg et al. 2005). Since 2005, the training landscape of the workforce has not improved. A 2012 follow-up report, investigating workforce data from 2004 to 2010, found no significant changes in workforce education from the early 2000s (Bassok et al. 2012). To date, only 16 states require pre-service qualifications to be a teacher in an early child care center and 39 require annual, ongoing training (U.S. Department of Health and Human Services 2011). With such basic levels of required training for early childhood educators, appropriate pedagogy and assessment practice is not assured in early childhood classrooms. It is an assumption that teachers can design instruction or create learning environments to reflect state content standards that are appropriate for children’s stage of development. Even with formal training, successful implementation of assessment practice is challenging. For example, K-12 teachers with bachelor’s degrees and formal assessment training require continued development to successfully implement and utilize assessments (Crooks 1988; Herman and DorrBremme 1982; Stiggins 1999; Volante and Fazio 2007). In a review of federally funded Institute of Educational Sciences (IES) research on early interventions and early childhood education, Diamond et al. (2013) noted that research is needed to identify valid and reliable ways to measure children’s skills and capture their learning over time that can be easily adopted by practitioners. The validity of kindergarten entry assessments also rests on the assumption that the instruments are implemented with fidelity. Even with proper training variability in the quality of teacher judgments within a population of teachers presents a challenging measurement issue. Children of lower-socioeconomic-status (SES) are more likely to be retained in kindergarten (Burkam et al. 2007) and placed into lower-level ability groups, even after controlling for sociodemographic background and measured academic ability (Tach and Farkas 2006). Beswick et al. (2005) also found that kindergarten teacher ratings of their students’ skills were influenced by gender, maternal education level, and behavior. Mashburn and Henry (2004) compared preschool and

Educ Asse Eval Acc

kindergarten teacher ratings of children’s kindergarten readiness, academic skills, and communication skills with direct assessments of these skills. The authors found that student demographic characteristics were influential of teacher ratings. Specifically, boys and younger children had lower ratings from both preschool and kindergarten teachers and African American children had higher teacher ratings than White children. Family characteristics were also associated with kindergarten teacher ratings: children whose families received welfare had lower ratings than children whose families did not receive welfare. Teachers face yet another challenge in successfully executing these assessments, as the new assessments were introduced with new standards and curricula. In the USA, a number of states recently revised the learning and development content standards. In 2011, 35 states, D.C., and Puerto Rico submitted applications to the Race to the Top Early Learning Challenge, which required early learning standards aligned to the Common Core State Standards (CCSS; U.S. Department of Education 2011a). States revised standards to reflect the CCSS, as well as advances in understanding about growth and development in young children in recent years. While standards have been revised cyclically over time, the standards now rise in importance as they will form the foundation of kindergarten entry assessments. Practitioner use of learning and development standards from which assessments are designed also weigh heavily on validity. The implicit assumption that re-issued standards will change teacher practice underlies numerous purposes in Table 2. This assumption is substantial and warrants investigation. Teachers’ ability to incorporate learning standards into their practice in manner that is developmentally appropriate weighs heavily on children’s opportunity to learn. Because data collection in early childhood relies heavily on teacher report (Alvidrez and Weinstein 1999; Barnett et al. 1992), researchers must develop a body of knowledge around appropriate prompts and responses that are universally accessible for all children as they start kindergarten. The interdependency of teachers’ skills to develop assessment tasks, time to conduct the assessments appropriately, and their overall understanding of the larger assessment system weighs heavily on the validity of the overall system. Thus, the assumptions regarding teachers and their training underlie many of the purposes of early childhood assessments listed in Table 2. Research on the viability of these assumptions needs to be explicitly incorporated into the development and validation of early childhood assessments. 4.3 Assessment and data use To summarize learning and report accountability data, the basic assumptions that pertain to all assessment instruments must be met. These assumptions are similar to those for assessments of older children. In general, it is assumed that component instruments of an early childhood assessment system will exhibit standard psychometric properties and that these instruments will be used consistently across schools and teachers as well as for different students. Details of these requirements are included in the 2014 Standards for Educational and Psychological Testing, a professional guideline document jointly developed by the American Educational Research Association (AERA), the American Psychological Association, and the National Council on Measurement in Education (NCME) (2014) to promote the sound and ethical use of tests and to provide a basis for evaluating the quality of testing practices. The Standards

Educ Asse Eval Acc

include practical examples of each classification of validity evidence: evidence based on test content, evidence based on response processes, evidence based on internal structure, evidence based on relations to other variables, and evidence based on the consequences of testing. Beyond these core assumptions, assessment programs must show evidence that the instruments are valid for special populations including English learners and students with disabilities. Further, evidence is needed to show that the introduction of the assessment in the classroom improves outcomes for children and that unintended negative consequences are minimized. Schafer et al. (2009) reviewed public documentation to identify validity evidence presented in support of state assessment systems, including technical manuals and reviews of state work by the federal government. The authors found mostly evidence based on test content, which included test blueprints and alignment reports. Achievement levels were also used to support validity. Little evidence based on student responses processes was identified in state technical manuals, but a suggestion was made that such evidence may include depth of knowledge ratings from alignment reports. Evidence based on internal structure is also needed and included item–subtest correlations, item–test correlations, and factor analyses to demonstrate dimensionality. Typical examples of evidence based on relations to other variables included correlations to nationally marketed tests, usually based on comparing scores on the same subject (i.e., reading to reading). The authors found very few examples of evidence based on test consequences in state documentation. Evidence based on test consequences is particularly relevant to the early childhood context if the assessments are to serve the purpose of improving outcomes for children. Studies are needed to demonstrate that an assessment program minimizes negative consequences. Lane et al. (1998) suggested that documentation outlining the stated purpose and intended outcomes of a testing program can provide the framework for an investigation of test consequences. Evidence of the match between the assessment purpose and use can come from examination of curricular materials, changes to professional development programs, interviews with curriculum specialists, and stakeholder surveys. Still, there are few research-based examples of evidence based on test consequences across assessment contexts (Cizek et al. 2008; Mehrens 2002; Reckase 1998).

5 Recommendations for policy and practice Burgeoning early childhood assessments affect many systems: governments distribute money, teachers participate in professional development, pre-service training programs are altered, and assessments are developed and studied. These changes are for naught if these systems do not lead to improved outcomes for young children. Our validity argument is a framework for researchers to study assessment structure, data use, and assessment literacy. Early childhood assessment systems must be built on a body of evidence to demonstrate the suitability of the data structure, the appropriateness of data use, and the effectiveness of implementation. The web of questions about educators, children, and use of data within the system abound. What is the impact of learning and development standards on teacher practice? Can these standards help educate parents on

Educ Asse Eval Acc

appropriate expectations for their children? How can the system advance teacher understanding of effective assessment practice? What type of training works best for teachers who do not have advanced educations? Can the assessments identify high-quality programs and/or high-quality teachers? Are children’s lives affected by assessment data? Our work provides a structure for researchers to study these questions. Through articulating the validity argument, we identified key recommendations for moving forward. We recommend that large-scale assessments of young children are viewed as a snapshot of what individual children can do at a specific point in time, and not measure of their abilities. The absence of specific skills at the start of formal schooling may be the result of lack of experience or lack of exposure. It also possible that children who perform well on kindergarten entry assessments were exposed to the skills on the assessments in their preschool programs. Young children learn quickly and skills that are not present at the very start of school may emerge shortly thereafter. Moreover, children’s performance is dependent upon their comfort. As young children become more comfortable with their assessor, they may become more willing to share their thoughts. Moreover, it is important to consider the alignment between each item and the content it was designed to measure. It is challenging to elicit specific behaviors from young children; poor performance may be the consequence of poor item design. We recommend the involvement of classroom educators and active administrators in the assessment development process. Active engagement of these practitioners in the development process will help ensure the authenticity of the assessment tasks and will encourage buy-in from teachers and schools. Outreach can be a continuous process throughout assessment development. It should begin when the development process begins and should continue after data reports are publicized. Practitioners can comment on individual items, suitability of the assessment for special populations, administration guidelines, and reporting structures. This process of continuous stakeholder engagement will help support the validity of the assessment. Assessment training and ongoing professional development are also important. Early childhood education programs should cover the practice of assessment. Assessment literacy should be incorporated into ongoing professional development programs for educators and caregivers as well. The design of authentic experiences for data collection is an art that requires practice. In addition, pre-service educators should study early learning and development standards as well as current assessment systems designed to serve young children. The introduction of a new assessment system requires a significant professional development effort for practicing educators as well. All system stakeholders must understand the importance of the system, including its general purpose and value to participants. Assessors must understand administrative procedures for implementation and data collection. Finally, all of the stakeholders must have an accurate understanding of data use. The introduction of a new assessment system is a large undertaking. We recommend a blended professional development system that includes face-to-face settings, online information delivery, and online professional communities of practice.

Educ Asse Eval Acc

6 Conclusion Though the development and use of assessments for young children is fraught with complexities, we have the opportunity to face these challenges in the assessment planning stages. Here, we have created a validity argument for these assessments, identifying the many purposes and assumptions that underlie them. In doing so, we have identified major areas of exploration for research and practitioners. Each identified purpose has a corresponding set of assumptions. As can be seem in Table 2, many of the fundamental purposes of KEAs, summarizing learning, improving outcomes for children, and supplying accountability data have numerous assumptions that need formal investigation. An assessment designed to summarize learning assumes that young children have had equivalent exposure to learning experiences. An assessment designed to monitor ongoing learning and development is not associated with this assumption, but its validity rests on teachers’ abilities to instruct, nurture, and observe. If assessments are created to improve outcomes for young children, evidence is required to show how data are used to change programs and services. Similarly, assessments designed for program quality require evidence of this use. Any assertion on the part of assessment developers must be proven by the development or program team. Our work provides a structure for this process. The framework delineates possible purposes for early childhood assessments and details associated assumptions. This structure offers an opportunity for validation to begin at the start of assessment development, when there is ample time for the systematic investigation of these assumptions. Assessment developers must consider not just the items on the assessment, but the role the assessment plays in a larger system designed to serve young children. Assessments are not just a collection of test items; they should be considered system drivers with consequences for many different stakeholder groups. Stakeholders must agree to assessment purpose and hold to that purpose throughout the development and implementation of the assessment program. A rigorous program of research will ensure these purposes are met. Budgets for validation activities should be set in the early stages of assessment development. Discrepancies between the intention of the assessment and practitioner use will be addressed by program administrators to ensure validity. Policy makers must also commit to pre-service training for developing educators and professional development for practitioners. Training should focus on developments in the field on learning and development in young children as well as assessment literacy. Investments of federal money in early learning and development standards and corresponding assessments may help improve outcomes for children if the educators in the classroom receive appropriately training and ongoing technical assistance. Teachers must also be well-versed in reporting on development and assessment performance to parents. Validation must be integrated into the foundation of a CECAS and its tenants must be re-examined throughout the life cycle its component assessments. These efforts will ensure we are creating a system that will improve learning and development for children. Acknowledgments ment of Education.

This research was supported in part by a contract from the Connecticut State Depart-

Educ Asse Eval Acc

References Aboud, F. (2006). Evaluation of an early childhood preschool program in rural Bangladesh. Early Childhood Research Quarterly, 21(1), 46–60. Alexander, K. L., & Entwisle, D. R. (1988). Achievement in the first 2 years of school: Patterns and processes. Monographs of the Society for Research in Child Development, 53(2, Serial No. 218). Alexander, K. L., & Entwisle, D. R. (1996). Schools and children at risk. In A. Booth & J. F. Dunn (Eds.), Family school links: How do they affect educational outcomes? (pp. 67–87). Mahwah, NJ: Erlbaum. Alvidrez, J., & Weinstein, R. S. (1999). Early teacher perceptions and later student academic achievement. Journal of Educational Psychology, 91, 731–746. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Andrich, D., & Styles, I. (2004). Final report on the psychometric analysis of the Early Development Instrument (EDI) using the Rasch Model: a technical paper commissioned for the development of the Australian commissioned for the development of the Australian Early Development Index(AEDI). Perth: Murdoch University. Barnett, D. W., Macmann, G. M., & Carey, K. T. (1992). Early intervention and the assessment of developmental skills: challenges and directions. Topics in Early Childhood Special Education, 12, 21–43. Bassok, D., Fitzpatrick, M., Loeb, S., & Paglayan, A. S. (2012). The early childhood care and education workforce in the United States: Understanding changes from 1990 through 2010. Unpublished Manuscript. Retrieved from http://cepa.stanford.edu/sites/default/files/AEFP_ECCE%20Workforce.pdf. Berkner, L. K., & Chavez, L. (1997). Access to postsecondary education for the 1992 high school graduates. Statistical Analysis Report, NCES 98–105. Washington, DC: U.S. Department of Education, Office of Educational Research and Improvement, National Center for Education Statistics. Beswick, J. F., Willms, J. D., & Sloat, E. A. (2005). A comparative study of teacher ratings of emergent literacy skills and student performance on a standardized measure. Education, 126(1), 116. Bowman, B., Donovan, M. S., & Burns, M. S. (Eds.). (2001). Eager to learn: educating our preschoolers. Washington, DC: National Academy Press. Braswell, J. S., Lutkus, A. D., Grigg, W. S., Santapau, S. L., Tay-Lim, B. S.-H., & Johnson, M. S. (2001). The nation’s report card: mathematics 2000 (NCES 2001–517). Washington, DC: U.S. Department of Education, Office of Educational Research and Improvement, National Center for Education Statistics. Brinkman, S., & Blackmore, S. (2003). Pilot study results of the Australian early development instrument: a population based measure for communities and community mobilisation tool. Adelaide: Paper presented at the Beyond the Rhetoric in Early Intervention Conference. Brinkman, S., Silburn, S., Lawrence, D., Goldfeld, S., Sayers, M., & Oberklaid, F. (2007). Investigating the validity of the Australian early development index. Early Education and Development, 18(3), 427–451. Bulotsky-Shearer, R. J., Fernandez, V. A., & Rainelli, S. (2013). The validity of the Devereux early childhood assessment for culturally and linguistically diverse head start children. Early Childhood Research Quarterly, 28(4), 794–807. Burkam, D. T., LoGerfo, L., Ready, D., & Lee, V. E. (2007). The differential effects of repeating kindergarten. Journal of Education for Students Placed at Risk, 12(2), 103–136. Case, R., & Griffin, S. (1990). Child cognitive development: the role of central conceptual structures in the development of scientific and social thought. In C. A. Hauert (Ed.), Developmental psychology: cognitive, perceptuo-motor, and neuropsychological perspectives (pp. 193–230). Amsterdam: Elsevier Science. Chappelle, C. A., Enright, M. K., & Jamieson, J. (2010). Does an argument-based approach to validity make a difference? Educational Measurement: Issues and Practice, 29, 3–13. doi:10.1111/j.1745-3992.2009. 00165.x. Cizek, G. J., Rosenberg, S., & Koons, H. (2008). Sources of validity evidence for educational and psychological tests. Educational and Psychological Measurement, 68, 397–412. Council of Chief State School Officers (CCSSO). (2011). Moving forward with kindergarten readiness assessment efforts: A position paper of the Early Childhood State Collaborative on Assessment and Student Standards. Washington, DC: Council of Chief State School Officers. Crooks, T. J. (1988). The impact of classroom evaluation on students. Review of Educational Research, 58, 438–481.

Educ Asse Eval Acc De Kruif, R. E. L., McWilliam, R. A., Ridley, S. M., & Wakely, M. B. (2000). Classification of teachers’ interaction behaviors in early childhood classrooms. Early Childhood Research Quarterly, 15(2), 247– 268. Denton, K., & West, J. (2002). Children’s reading and mathematics achievement in kindergarten and first grade (NCES 2002–125). Washington, DC: National Center for Education Statistics. Department for Education. (2013). Early Years Foundation Stage Profile Handbook. Available at: https:// www.gov.uk/government/publications/early-years-foundation-stage-profile-handbook. Diamond, K.E., Justice, L.M., Siegler, R.S., & Snyder, P.A. (2013). Synthesis of IES research on early intervention and early childhood education. U.S. Department of Education. NCSER 2013–3001. Duncan, G. J., & Magnuson, K. A. (2005). Can family socioeconomic resources account for racial and ethnic test score gaps? Future of Children, 15(1), 35–54. Education Commission of the States. (2014). 50-state analysis: Kindergarten entrance assessments. Available at: http://ecs.force.com/mbdata/mbquestRT?rep=Kq1407. Entwisle, D. R., & Alexander, K. L. (1993). Entry into schools: the beginning school transition and educational stratification in the United States. Annual Review in Sociology, 19, 401–423. Francis, D. J., Fletcher, J. M., Shaywitz, B. A., Shaywitz, S. E., & Rourke, B. P. (1996). Defining learning and language disabilities: conceptual and psychometric issues with the use of IQ tests. Language, Speech, and Hearing Services in Schools, 27, 132–143. Fryer, R. G., & Levitt, S. D. (2004). Understanding the black-white test score gap in the first two years of school. Review of Economics and Statistics, 86(2), 447–464. Fryer, R. G., & Levitt, S. D. (2006). The black-white test score gap through third grade. American Law and Economics Review, 8(2), 249–281. Gilliam, W. S. (2000). On over-generalizing from overly-simplistic evaluations of complex social programs. Early Childhood Research Quarterly, 15(1), 67–74. Goldfeld, S., Sayers, M., Brinkman, S., Silburn, S., & Oberklaid, F. (2009). The Process and Policy Challenges of Adapting and Implementing the Early Development Instrument in Australia. Early Education & Development, 13, 978–991. Goldstein, J., & Behuniak, P. (2011). Assumptions in alternate assessment: An argument-based approach to validation. Assessment for Effective Intervention, 36, 179–191. Gordon, R. A., Fujimoto, K., Kaestner, R., Korenman, S., & Abner, K. (2013). An assessment of the validity of the ECERS–R with implications for measures of child care quality and relations to child development. Developmental Psychology, 41(1), 146–160. Guhn, M., Gadermann, A., & Zumbo, B. D. (2007). Does the EDI measure school readiness in the same way across different groups of children? Early Education and Development, 18(3), 453–472. Harms, T., Clifford, R. M., & Cryer, D. (1998). Early childhood environment rating scale (Revisedth ed.). New York: Teachers College Press. Haskins, R., & Rouse, C. (2005). Closing achievement gaps. The future of children Spring Policy Brief. Princeton: Princeton University and Brookings Institution. Heaviside, S., & Farris, E. (1993). Public school kindergarten teachers’ views on children’s readiness for school (NCES No. 93–410). Washington, DC: U.S. Department of Educational, Office of Educational Research and Improvement. Herman, J., & Dorr-Bremme, D. (1982). Assessing students: teachers’ routine practices and reasoning. New York: Paper presented at the annual meeting of the American Educational Research Association. Herzenberg, S., Price, M., & Bradley, D. (2005). Losing ground in early childhood education: declining workforce qualifications in an expanding industry. Washington, DC: Economic Policy Institute. High/Scope Educational Research Foundation. (1992). High/scope Child Observation Record (COR) for ages 2 1/2-6. Ypsilanti, MI: High/Scope Press. Hofer, K. G. (2010). How measurement characteristics can affect ECERS-R scores and program funding. Contemporary Issues in Early Childhood, 11(2), 175–191. Jaeger, E. and Funk, S. (2001). The Philadelphia Child Care Quality Study: An examination of quality in selected early education and care settings. Available at: www.sju.edu/int/academics/cas/resources/cdl/ resources/Phila.CC%20Study.pdf. Janus, M., & Offord, D. (2007). Development and psychometric properties of the early development instrument (EDI): a measure of children’s school readiness. Canadian Journal of Behavioural Science, 39, 1–22. Janus, M., Brinkman, S., & Duku, E. (2011). Validity and psychometric properties of the early development instrument in Canada, Australia, United States, and Jamaica. Social Indicators Research, 103(2), 283– 297.

Educ Asse Eval Acc Jordan, N. C., Huttenlocher, J., & Levine, S. C. (1992). Differential calculation abilities in young children from middle- and low-income families. Developmental Psychology, 28, 644–653. Juel, C. (1988). Learning to read and write: a longitudinal study of 54 children from first through fourth grades. Journal of Educational Psychology, 80(4), 437–447. Kagan, S. L., Scott-Little, C., & Clifford, R. M. (2003). Assessing young children: what policymakers need to know and do. In C. Scott-Little, S. L. Kagan, & R. M. Clifford (Eds.), Assessing the state of state assessments: perspectives on assessing young children. Greensboro: North Carolina: University of North Carolina, SERVE. Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64). Washington, DC: The National Council on Measurement in Education & the American Council on Education. Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. Kim, D. H., & Smith, J. D. (2010). Evaluation of two observational assessment systems for children’s development and learning. NHSA Dialog, 13, 253–267. Kim, D. H., Lambert, R. G., & Burts, D. C. (2013). Evidence of the validity of teaching strategies GOLD® assessment tool for english language learners and children with disabilities. Early Education and Development, 24(4), 574–595. Lambert, R. G., Kim, D. H., & Burts, D. C. (2015). The measurement properties of the Teaching Strategies GOLD® assessment system. Early Childhood Research Quarterly. doi:10.1016/j.ecresq.2015.05.004. Lane, S., Parke, C. S., & Stone, C. A. (1998). A framework for evaluating the consequences of assessment programs. Educational Measurement: Issues and Practice, 17(2), 24–28. LeBuffe, P. A., & Naglieri, J. A. (1999). DECA: Devereux early childhood assessment. Lewisville: Kaplan Press. Li, K., Hu, B., Pan, Y., Qin, J., & Fan, X. (2011). Chinese Early Childhood Environment Rating Scale (trial) (CECERS): A validity study. Early Childhood Research Quarterly, 29, 268–282. Lin, H. L., Lawrence, F. R., & Gorrell, J. (2003). Kindergarten teachers’ views of children’s readiness for school. Early Childhood Research Quarterly, 18(2), 225–237. Loeb, S., Bridges, M., Bassok, D., Fuller, B., & Rumberger, R. (2007). How much is too much? The influence of preschool centers on children’s social and cognitive development. Economics of Education Review, 26(1), 52–66. Marion, S., & Pellegrino, J. (2006). A validity framework for evaluating the technical quality of alternate assessments. Educational Measurement: Issues and Practice, 25(4), 47–57. Mashburn, A. J., & Henry, G. T. (2004). Assessing school readiness: Validity and bias in preschool and kindergarten teachers’ ratings. Educational Measurement: Issues and Practice, 23(4), 16–30. Mehrens, W. (2002). Consequences of assessment: what is the evidence? In G. Tindal & T. Haladyna (Eds.), Large-scale assessment programs for all students: validity, technical adequacy, and implementation. Mahwah: Lawrence Earlbaum Associates. Meisels, S. J. (1996). Performance in context: assessing children’s achievement at the outset of school. In A. J. Sameroff & M. M. Haith (Eds.), The five to seven year shift: the age of reason and responsibility (pp. 410–431). Chicago, IL: University of Chicago Press. Meisels, S. J. (2007). Accountability in early childhood: no easy answers. In R. C. Pianta, M. J. Cox, & K. Snow (Eds.), School readiness, early learning, and the transition to kindergarten (pp. 31–48). Baltimore: Paul H. Brookes. Meisels, S. J., Liaw, F., Dorfman, A., & Nelson, R. F. (1995). The work sampling system: reliability and validity of a performance assessment for young children. Early Childhood Research Quarterly, 10, 277– 296. Meisels, S. J., Wen, X., & Beachy-Quick, K. (2010). Authentic assessment for infants and toddlers: exploring the reliability and validity of the ounce scale. Applied Developmental Science, 14, 55–71. Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23(2), 13–23. Mislevy, R. J., & Haertel, G. D. (2006). Implications of evidence‐centered design for educational testing. Educational Measurement: Issues and Practice, 25(4), 6–20. Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). Focus article: on the structure of educational assessments. Measurement: Interdisciplinary research and perspectives, 1(1), 3–62. Myers, R. G. (2004). In search of quality programmes of early child- hood care and education. Background paper for Education for All, Global Monitoring Report 2005. Paris, France: UNESCO. Retrieved from www.unesco.org/education/gmrdownload/references 2005.pdf.

Educ Asse Eval Acc National Research Council. (2008). Early childhood assessment: what, why, and how. Washington, DC: National Academies Press. Nelson, K. (Ed.). (1998). Principles and recommendations for childhood assessments. DIANE Publishing. Neuman, S. B., & Dickinson, D. K. (Eds.). (2001). Handbook of early childhood literacy research. New York: Guilford. Powell, D. R., Son, S., File, N., & San Juan, R. R. (2010). Parent-school relationships and children’s academic and social outcomes in public school pre-kindergarten. Journal of School Psychology, 48(4), 269–292. Rathburn A, West J. From Kindergarten Through Third Grade: Children’s Beginning School Experiences. Washington, DC: National Center for Education Statistics; 2004. Available at: http://nces.ed.gov/ pubsearch/pubsinfo.asp?pubid2004007. Reardon, S. F. (2003). Sources of educational inequality: the growth of racial/ethnic and socioeconomic test score gaps in kindergarten and first grade (Working Paper 03-05R). University Park: The Pennsylvania State University, Population Research Institute. Reckase, M. (1998). Consequential validity from the test developer’s perspective. Educational Measurement: Issues and Practice, 17(2), 13–16. Rumberger, R. W. & Arellano, B. (2004). Understanding and addressing the Latino achievement gap in California. (Working paper 2004–01). Berkeley, CA: UC Latino Policy Institute. Schafer, W. D., Wang, J., & Wang, V. (2009). Validity in action: state assessment validity evidence for compliance with NCLB. In R. Lissitz (Ed.), The concept of validity: revisions, new directions and applications (pp. 173–193). Charlotte: Information Age Publishing Inc. Shaywitz, S. E., Fletcher, J. M., Holahan, J. M., Shneider, A. E., Marchione, K. E., Stuebing, K. K., & Shaywitz, B. A. (1999). Persistence of dyslexia: the Connecticut longitudinal study at adolescence. Pediatrics, 104, 1351–1359. Silburn, S., Brinkman, S., Sayers, M., Goldfeld, S., & Oberklaid, F. (2007). Establishing the construct and predictive validity of the Australian early development index (AEDI). Early Human Development, 83(1), S125. Sireci, S. G. (2009). Packing and unpacking sources of validity evidence. The concept of validity: Revisions, new directions, and applications, 19. Stiggins, R. J. (1999). Evaluating classroom assessment training in teacher education programs. Educational Measurement: Issues and Practice, 18(1), 23–27. Sylva, K., Siraj-Blatchford, I., & Taggart, B. (2003). Assessing quality in the early years: Early Childhood Environment Rating Scale-Extension (ECERS-E): Four curricular subscales. Stoke-on Trent: Trentham Books. Sylva, K., Melhuish, E. C., Sammons, P., Siraj, I., & Taggart, B. (2004). The Effective Provision of Pre-School Education (EPPE) Project Technical Paper 12, The Final Report: Effective Pre-School Education. London: DfES / Institute of Education, University of London. Sylva, K., Siraj-Blatchford, I., Taggart, B., Sammons, P., Melhuish, E., Elliot, K., & Totsika, V. (2006). Capturing quality in early childhood through environmental rating scales. Early Childhood Research Quarterly, 21(1), 76–92. Tach, L. M., & Farkas, G. (2006). Learning-related behaviors, cognitive skills, and ability grouping when schooling begins. Social Science Research, 35(4), 1048–1079. U.S. Department of Education. (2011a, October 20). 35 States, D.C. and Puerto Rico submit applications for the Race to the Top-Early Learning Challenge. Retrieved from https://www.ed.gov/news/press-releases/ 35-states-dc-and-puerto-rico-submit-applications-race-top-early-learning-challenge. U.S. Department of Education. (2011). Race to the Top - Early Learning Challenge application for initial funding: CFDA Number: 84.412. Retrieved from http://www2.ed.gov/programs/racetothetopearlylearningchallenge/2011-412.doc. U.S. Department of Education. (2013, May 23). Applications for new awards: Enhanced assessment instruments Grants Program-Enhanced Assessment Instruments-Kindergarten Entry Assessment Competition. Retrieved from https://www.federalregister.gov/articles/2013/05/23/2013-12212/applications-for-newawards-ehanced-assessment-instruments-grants-program-enhanced-assessment. U.S. Department of Education (2015). Kindergarten Entry Assessments in RTT-ELC Grantee States. Retrieved from: https://elc.grads360.org/services/PDCService.svc/GetPDCDocumentFile?fileId=10126. U.S. Department of Health and Human Services. (2011). Minimum preservice qualifications and annual ongoing training house for center teaching roles in 2011. National Center on Child Care Quality Improvement. Fairfax, VA. Retrieved from https://childcare.gov/sites/default/files/542_1305_ qualstchmst_2011.pdf. Volante, L., & Fazio, X. (2007). Exploring teacher candidates’ assessment literacy: implications for teacher education reform and professional development. Canadian Journal of Education, 30(3), 749–770.

Educ Asse Eval Acc Wesley, P. W., & Buysse, V. (2003). Making meaning of school readiness in schools and communities. Early Childhood Research Quarterly, 18(3), 351–375. West, J., Denton, K., & Germino-Hausken, E. (2001a). America’s kindergartners: findings from the Early Childhood Longitudinal Study, kindergarten class of 1998–99. Washington, DC: National Center for Education Statistics. West, J., Denton, K., & Reaney, L. (2001b). The kindergarten year (NCES 2001–023). Washington, DC: National Center for Education Statistics. Zill, N., & West, J. (2001). Findings from the condition of education 2000: entering kindergarten. Washington, DC: National Center for Education Statistics. Zill, N., Collins, M., West, J., & Hausken, E. G. (1995). Approaching kindergarten: A look at preschoolers in the United States. U.S. Department of Education, Office of Educational Research and Improvement, National Center for Education Statistics.