A Confirmatory Study of Rating Scale Category

0 downloads 0 Views 282KB Size Report
Jan 23, 2013 - Across rating scale category effectiveness guidelines, 25 of. 26 evidences (96%) ..... the rating scale (e.g., CAT-7, CAT-8, and CAT-9), which.
This article was downloaded by: [University of Miami] On: 17 May 2014, At: 12:29 Publisher: Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Research Quarterly for Exercise and Sport Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/urqe20

A Confirmatory Study of Rating Scale Category Effectiveness for the Coaching Efficacy Scale a

b

Nicholas D. Myers , Deborah L. Feltz & Edward W. Wolfe

c

a

Department of Educational and Psychological Studies , University of Miami

b

Department of Kinesiology , Michigan State University

c

Department of Educational Leadership and Policy Studies , Virginia Polytechnic Institute and State University Published online: 23 Jan 2013.

To cite this article: Nicholas D. Myers , Deborah L. Feltz & Edward W. Wolfe (2008) A Confirmatory Study of Rating Scale Category Effectiveness for the Coaching Efficacy Scale, Research Quarterly for Exercise and Sport, 79:3, 300-311 To link to this article: http://dx.doi.org/10.1080/02701367.2008.10599493

PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-andconditions

Measurement andWolfe Evaluation Myers, Feltz, and Research Quarterly for Exercise and Sport ©2008 by the American Alliance for Health, Physical Education, Recreation and Dance Vol. 79, No. 3, pp. 300–311

A Confirmatory Study of Rating Scale Category Effectiveness for the Coaching Efficacy Scale

Downloaded by [University of Miami] at 12:29 17 May 2014

Nicholas D. Myers, Deborah L. Feltz, and Edward W. Wolfe

This study extended validity evidence for measures of coaching efficacy derived from the Coaching Efficacy Scale (CES) by testing the rating scale categorizations suggested in previous research. Previous research provided evidence for the effectiveness of a four-category (4-CAT) structure for high school and collegiate sports coaches; it also suggested that a five-category (5-CAT) structure may be effective for youth sports coaches, because they may be more likely to endorse categories on the lower end of the scale. Coaches of youth sports (N = 492) responded to the CES items with a 5-CAT structure. Across rating scale category effectiveness guidelines, 32 of 34 evidences (94%) provided support for this structure. Data were condensed to a 4-CAT structure by collapsing responses in Category 1 (CAT-1) and Category 2 (CAT-2). Across rating scale category effectiveness guidelines, 25 of 26 evidences (96%) provided support for this structure. Findings provided confirmatory, cross-validation evidence for both the 5-CAT and 4-CAT structures. For empirical, theoretical, and practical reasons, the authors concluded that the 4-CAT structure was preferable to the 5-CAT when CES items are used to measure coaching efficacy. This conclusion is based on the findings of this confirmatory study and the more exploratory findings of Myers, Wolfe, and Feltz (2005).

Key words: coaches of youth sport, Rasch model, validity

V

alidity refers to the degree to which evidence and theory support the interpretations of scores entailed by proposed uses of tests (quote from the American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999, p. 9). Or, as Messick (1989) more fully explained in his treatise on a unified conceptualization of construct validity, “Validity is an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test Submitted: May 5, 2006 Accepted: August 13, 2007 Nicholas D. Myers is with the Department of Educational and Psychological Studies at the University of Miami. Deborah L. Feltz is with the Department of Kinesiology at Michigan State University. Edward W. Wolfe is with the Department of Educational Leadership and Policy Studies at Virginia Polytechnic Institute and State University.

300

Myers.indd 300

scores or other modes of assessment” (p. 13). Within both conceptualizations, validity is a unified, although multifaceted, concept for which evidences need to be collected and evaluated across time and across purposes. An important facet of validity evidence is how effectively an instrument’s rating scale structure represents a construct, what Messick (1995) referred to as a substantive aspect of validity evidence. Respondents do not always use a rating scale in the manner intended by the constructor. Determining the function of a rating scale is important, because an effective structure increases the accuracy and precision of the resulting measures, the likelihood of measure stability, and related inferences for future samples (Linacre, 2002). “Rasch analysis provides an effective framework within which to verify, and perhaps improve, the functioning of rating scale categorization” (Linacre, 2004, p. 259). Data are fitted to a Rasch (1960) model, because select diagnostic statistics have proven useful in both determining an effective post hoc rating structure (Zhu & Kang, 1998; Zhu, Timm, & Ainsworth, 2001; Zhu, Updyke, & Lewandowski, 1997) and in cross-validating it in a confirmatory application (Zhu, 2002). A four-step

RQES: September 2008

8/19/2008 5:17:32 PM

Myers, Feltz, and Wolfe

conceptual summary of this procedure is worthwhile, because application of the methodology in sport and exercise sciences is relatively novel. This summary consolidates the work of Linacre (1999, 2002, 2004) and should not be viewed as new methodological work.

Downloaded by [University of Miami] at 12:29 17 May 2014

A Conceptual Four-Step Summary Step 1. Items should be oriented in a consistent manner with the latent variable of interest. Simply, the rating scale should maintain its meaning across items that are influenced by the latent variable (e.g., 1 = the lowest amount, and 5 = the highest amount of the latent across items). For example, if an item was negatively oriented in which 5 indicated the lowest amount, and all of the other items were positively oriented, responses to this negatively oriented item should be reoriented to reflect the shared positive orientation of the other items (i.e., 5 reoriented to 1, 4 reoriented to 2, etc.). Step 2. The data are fit to a Rasch model from a family of 1-parameter Item Response Theory (IRT) measurement models. IRT is well suited to analyze rating scale data (Wright & Masters, 1982). Typically, the data are fit to the Rasch rating scale model and not a more complex Rasch model (e.g., partial credit model), because, in part, the same rating scale structure is imposed across items. A set of items that do not share a constant rating scale would require a more complex model. Step 3. Eight guidelines (presented in the Method section) are used to judge the empirical effectiveness of the rating scale structure in the sample. It is important to note that not all guidelines will necessarily be relevant for every application. Because of the guidelines’ unique focuses, they may occasionally provide some conflicting information (similar to fit indexes in a Structural Equation Modeling). The guidelines were further classified by Linacre (2004) as either essential or helpful in regard to four areas: measure stability, measure accuracy, description of the sample, and inference for the next sample. Step 4. If the original rating scale structure functions ineffectively, the data should be collapsed based on some general principles (Wright & Linacre, 1992) as well as the rating scale function in relation to the eight guidelines.1 The collapsed structure should: (a) be interpretable, (b) not create additional modal categories, and (c) have face validity (i.e., it should take into account the person-measure distribution and the item difficulty distribution). In summary, the goal(s) of rating scale analyses are to determine (a) whether the original structure functions effectively, and if not, then (b) a post hoc rating structure that is conceptually defensible and produces estimates with improved psychometric characteristics. When previous research has provided evidence for the ineffectiveness of the original rating scale and a more effective post hoc structure, then the post hoc structure identified in

RQES: September 2008

Myers.indd 301



the previous study should be tested to provide confirmatory, cross-validation evidence for this scale. Previous Rating Scale Effectiveness Research in Exercise and Sport Science Zhu and his colleagues published several studies on rating scale effectiveness for responses to either psychomotor self-efficacy items (Zhu & Kang, 1998; Zhu et al., 1997) or exercise perseverance and barriers items (Zhu, 2002; Zhu et al., 2001). In the first set of studies, a post hoc procedure was used to determine an optimum condensed rating scale structure, where the original structure consisted of 5-CAT (i.e., CAT-1, CAT-2, CAT-3, CAT-4, CAT-5).2 In both exploratory studies, the authors examined all possible combinations of collapsing adjacent categories for a 2-CAT, 3-CAT, and 4-CAT structure. This resulted in 14 post hoc categorizations in both studies (see Table 1). The authors described this way of deriving possible post hoc structures as “mechanical” (Zhu & Kang, 1998, p. 230; in this paper, the process is referred to as mechanically thorough). The authors then imposed a Rasch rating scale model and compared model-data fit statistics, category statistics, parameter estimates and separation statistics produced by each structure. In both studies the authors determined that a 3-CAT structure (i.e., CAT-1, CAT-2, CAT-2, CAT-2, CAT-3 or CAT-1, CAT-2, CAT-2, CAT-3, CAT-3) was/were the optimal categorization(s). In the second set of studies, an exploratory post hoc procedure was used in Study 1 to determine an optimum rating scale structure for exercise perseverance and barrier items (Zhu et al., 2001). This structure was imposed in a subsequent study (Study 2) to provide cross-validation evidence (Zhu, 2002). In Study 1, the authors used a mechanically thorough approach to examine all possible combinations of collapsing adjacent categories for a 3-CAT and 4-CAT structure; 2-CAT structures were not created based on previous research (Zhu & Kang, 1998; Zhu et al., 1997). This process resulted in 10 post hoc categorizations (see Table 1). The authors compared model-data fit statistics, category statistics, parameter estimates, and separation statistics produced by each structure. The authors determined that a 3-CAT structure was the optimal categorization. In Study 2, a subsample of respondents from Study 1 responded to the same items with the suggested 3-CAT structure. The author provided evidence that that the 3-CAT rating scale derived from the exploratory study (Study 1) functioned in a similarly effective way in the confirmatory study (Study 2). It is important to note the following about the work by Zhu and colleagues. First, their seminal research provided a foundation to build on in applied measurement. Second, the criteria by which Zhu and his colleagues created post hoc categorizations can be described as

301

8/19/2008 5:17:32 PM

Myers, Feltz, and Wolfe

mechanically thorough. Limitations with this approach include: (a) it does not fully respect the general principles for collapsing categories proposed by Wright and Linacre (1992), and (b) it creates a large set of post hoc categorizations that are then compared by a process that may be unlikely to objectively detect meaningful differences between the functioning of two or more post hoc

Table 1. Previous post hoc categorizations in exercise and sport science

Downloaded by [University of Miami] at 12:29 17 May 2014

Studies

Original categorization

Post hoc categorization(s)

Zhu and Kang (1998) 1,2,3,4,5 (5-CAT) 1,1,1,1,2 (2-CAT) 1,1,1,2,2 (2-CAT) Zhu, Updyke, and 1,2,3,4,5 (5-CAT) 1,1,2,2,2 (2-CAT) Lewandowski (1997) 1,2,2,2,2 (2-CAT) 1,1,1,2,3 (3-CAT) 1,1,2,3,3 (3-CAT) 1,1,2,2,3 (3-CAT) 1,2,2,2,3 (3-CAT)** 1,2,2,3,3 (3-CAT) 1,2,3,3,3 (3-CAT) 1,1,2,3,4 (4-CAT) 1,2,2,3,4 (4-CAT) 1,2,3,3,4 (4-CAT) 1,2,3,4,4 (4-CAT) Zhu, Timm, and 1,2,3,4,5 (5-CAT) 1,1,1,2,3 (3-CAT) Ainsworth (2001) 1,1,2,3,3 (3-CAT) 1,1,2,2,3 (3-CAT)* 1,2,2,2,3 (3-CAT) 1,2,2,3,3 (3-CAT) 1,2,3,3,3 (3-CAT) 1,1,2,3,4 (4-CAT) 1,2,2,3,4 (4-CAT) 1,2,3,3,4 (4-CAT) 1,2,3,4,4 (4-CAT) Zhu (2002) 1,2,3 (3-CAT)* none Myers, Wolfe, and Feltz (2005) 0,1,2,3,4,5,6,7,8,9 4,4,4,4,4,5,6,7,8,9 (10-CAT) (6-CAT) 5,5,5,5,5,5,6,7,8,9 (5-CAT) 6,6,6,6,6,6,6,7,8,9 (4-CAT)* 7,7,7,7,7,7,7,7,8,9 (3-CAT) 8,8,8,8,8,8,8,8,8,9 (2-CAT) Current study 1,2,3,4,5 (5-CAT) 2,2,3,4,5 (4-CAT)* Note. 5-CAT = five-category structure; 2-CAT = two-category structure; 3-CAT = three-category structure; 4-CAT = fourcategory structure; 6-CAT = six-category structure; 4-CAT = four-category structure; ** = selected categorization in both studies; * = selected categorization in relevant study.

302

Myers.indd 302

categorizations. This can lead the researcher to arbitrarily declare one post hoc categorization as “optimal.” Third, the criteria used to determine an optimal rating scale structure was similar to but not entirely the same as guidelines provided by Linacre (1999, 2002, 2004). A primary difference is that in the former approach there is a direct and primary role for person and item separation indexes as tools to select an optimum post hoc categorization (Zhu & Kang, 1998; Zhu et al., 2001; Zhu et al., 1997). When comparing the four final post hoc structures, Zhu et al. (2001) wrote, “Because the 11223 categorization had the highest item and person separations, it was chosen as the optimal categorization.” (p. 110). The relevant item and person separations for the four categories were 9.06 and 1.98 for 11233, 10.57 and 2.62 for 11223, 10.31 and 2.59 for 12223, and 9.08 and 1.99 for 12233. It is likely the differences for the separation statistics between 11223 and 12223 were trivial. All four post hoc categorizations suggested a 3-CAT structure. This is important because, as Zhu et al. (1997) noted, person and item separation indexes are expected to decrease somewhat when categories are collapsed, generally the more categories the better the discriminations. Linacre (2004) provided no explicit guideline that the categorization with the greatest person and item separations should be selected. It is worth commenting on the use of separation indexes in evaluating various rating scale structures, but because the separation indexes are not widely understood, a more general introduction is provided. Traditionally, reliability has been depicted as the ratio of true variance to observed variance. For the Rasch model, these values are designated as variances of the latent and estimated person measures, [V(Latent)/ V(Observed)], referred to as person separation reliability, and indicate the degree to which the person measures are error free. Because the variance of the latent person measures cannot be observed, reliability is estimated as 1 – [V(Error)/V(Observed)], which is statistically equivalent to V(Latent)/V(Observed), where error variance equals the average of the squared errors associated with the estimated person measures. A similar index can be computed to depict the degree the item measures are error free, substituting the average of the squared errors associated with the estimated item calibrations for V(error) and the variance of the estimated item calibrations for V(observed). Person and item separation indexes are closely related to the concepts of person and item separation reliability. Specifically, the separation index equals the ratio of the standard deviations of the latent and error distributions, for example, person separation = SD(latent)/SD(error). Hence, separation reliability can be expressed as a function of the separation statistic [separation reliability = separation2/(1 + separation2)] (Linacre, 2005). Person separation and person separation reliability are important for most measurement decisions because

RQES: September 2008

8/19/2008 5:17:32 PM

Downloaded by [University of Miami] at 12:29 17 May 2014

Myers, Feltz, and Wolfe

most decisions draw inferences about persons. Hence, the person separation index and the person separation reliability index should be important indicators of effective rating scale structures. Considering the impact of rating scale optimization on the item separation and the item separation reliability indexes may be less important for two reasons. First, we seldom seek to draw inferences about the relative difficulties of the items representing the underlying domain. When it is important to statistically distinguish between levels of item difficulty, increasing item separation reliability may be an important indicator of rating scale effectiveness. Second, because of the design process for most attitudinal instruments, item separation reliability statistics tend to approach their upper limits. This is because the process for most applications involves selecting items highly correlated with one another and then administering them to a relatively large number of persons; both practices minimize error variance. In addition, items tend to be selected to represent a wide range of difficulties, a practice that maximizes true variance. Hence, in our view, although person separation reliability is an important characteristic of the person measures and should be included as a criterion for evaluating rating scale effectiveness, it is not clear whether item separation reliability is an equally important criterion for most applications. In closing, because collapsing rating scale categories typically results in a reduction of statistical information (i.e., latent variance), this practice often results in a reduction of both the separation, and reliability of separation, indexes (Zhu et al., 1997). Therefore, when evaluating rating scale category effectiveness for structures with a different number of categories (e.g., one structure has collapsed categories), an analyst should not be guided by comparing the values of separation indexes only, but rather the relative decrease in separation indexes in relation to the change in the other indicators. Previous Data From the Coaching Efficacy Scale The Coaching Efficacy Scale (CES; Feltz, Chase, Mor-itz, & Sullivan, 1999) was developed to measure a coach’s belief in his or her ability to influence athletes’ learning and performance. The original CES scale was a 10-CAT structure with endpoints ranging from 0 (not at all confident) to 9 (extremely confident). Myers, Wolfe, and Feltz (2005) provided evidence, with data from high school and college coaches, for the ineffectiveness of the original rating scale structure and the effectiveness of a 4-CAT structure derived from post hoc analyses. They also noted that (a) the 4-CAT structure should be tested in a new sample to provide cross-validation evidence and (b) coaches of youth sports may benefit from a 5-CAT structure, because they are more likely to endorse categories on the lower end of the scale (Lee, Malete, & Feltz, 2002).

RQES: September 2008

Myers.indd 303



The methodology used by Myers, Wolfe, et al. (2005) to identify a more effective structure for the CES acknowledged the work of Zhu and colleagues (Zhu & Kang, 1998; Zhu et al., 2001; Zhu et al., 1997) but applied a slightly different process in creating post hoc categorizations. They produced post hoc categorizations less mechanically thorough than Zhu and colleagues. Specifically, they noted that within the 10-CAT rating structure most invalidity evidences were from the lower half of the original scale and, therefore, produced only five post hoc structures that collapsed categories from the bottom up (see Table 1). Both practical and theoretical concerns drove this approach. Practically, Myers, Wolfe, et al. had an original rating scale with far too many categories, and evaluating all possible post hoc categorizations was unreasonable within the context of their study. Another practical advantage was that in most post hoc categorizations the original information in the upper part of the rating scale (e.g., CAT-7, CAT-8, and CAT-9), which respondents used frequently and systematically, was unaffected by the post hoc manipulation of the data. Theoretically, the process used by Myers, Wolfe, et al. (2005) to create post hoc categorizations more fully incorporated the general principles for collapsing categories proposed by Wright and Linacre (1992) than the mechanically thorough approach imposed by Zhu and colleagues (Zhu & Kang, 1998; Zhu et al., 2001; Zhu et al., 1997). Specifically, in the Myers, Wolfe, et al. study most collapsed structures were interpretable, they rarely created an additional modal category, and they had a reasonable degree of face validity. The collapsed structures were also consistent with recommendations regarding self-efficacy measurement in sport, which notes that responses to efficacy items typically indicate at least moderate confidence (Feltz & Chase, 1998). Because coaches rarely endorse categories representing less than moderate confidence, there is unlikely a need for many categories representing degrees for less than moderate confidence (e.g., CAT-0 through CAT-3 on the previous 10-CAT structure). Research Questions The purpose of this study was to extend validity evidence for coaching efficacy measures derived from the CES by testing the rating scale categorizations suggested in previous research (Myers, Wolfe, et al., 2005). The research questions for this study were: 1. Do youth sport coaches effectively use a 5-CAT rating scale structure (CAT-1, CAT-2, CAT-3, CAT-4, CAT-5) when responding to the items that define the CES? 2. Do youth sport coaches effectively use a 4-CAT rating scale structure (CAT-1 & 2, CAT-3. CAT-4, CAT-5) when responding to the items that define the CES?

303

8/19/2008 5:17:32 PM

Myers, Feltz, and Wolfe

Method

Downloaded by [University of Miami] at 12:29 17 May 2014

Sample Data for this study were part of a larger project conducted on youth sport coaches (Roman, 2006),3 who were defined as individuals who coach children between the ages of 7 and 12 years. Participants were recruited from clinics designed for beginning and intermediate level coaches. Participation in this study was requested during the welcome statement of the coaching clinics. Participants (N = 492) were assistant coaches (n = 338) and head coaches (n = 150) of youth sports from the midwestern United States.4 Most participants were men (n = 463), Caucasian (n = 459), had earned at least a Bachelor’s degree (n = 251), and unpaid coaches (n = 462) of boys’ sports (n = 437). Ethnicities other than Caucasian included African American (n = 15), Native North American Indian (n = 5), Asian (n = 3), Hispanic (n = 2), and other (n = 6). The reported age ranged from 15 to 65 (M = 37.70 years, SD = 8.33), and the reported years of coaching experience ranged from 0 to 32 (M = 3.51 years, SD = 4.12). The sports they reported coaching included ice hockey (n = 387), basketball (n = 69), soccer (n = 16), volleyball (n = 13), football (n = 5), softball (n = 1), and other (n = 1). Procedure All necessary permissions were obtained prior to data collection, including permission from the relevant institutional review board. A graduate student associated with the project explained the study to each coach. All respondents provided informed consent. Coaches were guaranteed confidentiality for their responses. They completed questionnaires individually and returned them to a graduate student associated with the project. Missing Data Across cases and items, data were observed in 11,777 of 11,808 possible observations. There was no discernable pattern for missing data (e.g., 17 of 24 items were missing at least one observation). Approximately 5% of participants (n = 26) were missing data, mostly for one item only (range = 1–3 missing observations). Thus, incomplete cases provided responses to most (88–96%) of the items. Unobserved data were judged to be randomly missing and handled with joint maximum likelihood estimation in Winsteps (Wright & Linacre, 1998). Effectiveness of the Rating Scale Structures The effectiveness of both the 5-CAT and 4-CAT structures were evaluated based on the work of Linacre

304

Myers.indd 304

(1999, 2002, 2004), Zhu and colleagues (Zhu, 2002; Zhu & Kang, 1998; Zhu et al., 2001; Zhu et al., 1997), and Myers, Wolfe, et al. (2005). Combining these works resulted in a four-step evaluation process. Step 1. All CES items are positively oriented. Pointmeasure correlations (i.e., between raw item responses to a particular item and the ability estimates for those responding to that item) were examined to confirm that responses varied in expected ways. It was necessary to clarify the latent variable of interest, because there is evidence for a multidimensional conceptualization of coaching efficacy, in which motivation, game strategy, technique, and character building define the efficacies of interest (Feltz et al., 1999; Lee et al., 2002; Myers, Wolfe et al., 2005), and a unidimensional conceptualization of total coaching efficacy (TCE; Feltz et al., 1999; Myers, Wolfe, et al., 2005). Because it is typical in rating scale effectiveness research to assume a unidimensional model even when there is evidence for a multidimensional model (Myers, Wolfe, et al., 2005; Zhu, 2002; Zhu & Kang, 1998; Zhu et al., 2001; Zhu et al., 1997), a unidimensional model was used in this study. Imposing a unidimensional model was reasonable, because the subscales have repeatedly been shown to be moderately highly to highly related to one another (Feltz et al., 1999; Myers, Wolfe, et al., 2005), because (a) the number of items within subscales are limited (range = 4–7), (b) TCE scores appear in the literature (Feltz et al., 1999; Malete & Feltz, 2000; Myers, Vargas-Tonsing, & Feltz, 2005), and (c) there are practical advantages to having a single rating scale structure across items. Item 7 (demonstrate the skills of my sport) was dropped, because it exhibited misfit to a unidimensional model (Myers, Wolfe, et al., 2005). Step 2. Data were calibrated to the Rasch Rating Scale Model (RSM; Andrich, 1978) using Winsteps (Wright & Linacre, 1998).5 The RSM described the probability that a specific coach (n) would rate a particular item (i) using a specific rating scale category (k), conditioned on the coach’s efficacy (θn) and the item’s difficulty (δi). The log-odds equation for this probability, ln(Pk/Pk-1) = θn - δi - τk, contained three parameters: θn, δi, and category threshold (τk), the threshold between two adjacent rating scale categories. Parameters of this model were estimated from observed data via joint maximum likelihood, which does not impose distributional assumptions on the parameter estimates (Wright & Masters, 1982). Calibration of the data to the RSM resulted in a θ estimate for each coach, a δ estimate for each item, a set of τ estimates across items, and a standard error (SE) for each estimate. Estimates were on a single linear continuum in logistic ratio units (logits). A logit is the natural logarithm of the odds of an event. Because the data in this study were polytomous, odds were defined by the likelihood of assigning a rating in one category versus the odds of assigning a rating in the next lower category. For

RQES: September 2008

8/19/2008 5:17:32 PM

Downloaded by [University of Miami] at 12:29 17 May 2014

Myers, Feltz, and Wolfe

identification purposes, mean δ was constrained to equal zero. In general terms, negative δ estimates indicated an item was easy to endorse, while positive δ estimates indicated an item was difficult to endorse, and a positive θ estimate indicated a coach was more efficacious than a coach with a negative θ estimate. Item-level fit to the measurement model was explored for both conceptualizations prior to the rating scale analyses. Both the weighted mean-square residual (WMS or Infit) and the unweighted mean-square residual (UWMS or Outfit) statistics were considered each item. The WMS statistic was sensitive to unexpected responses from pairings of coaches and items with similar θ and δ estimates (e.g., a highly efficacious coach selects CAT-1 for an item that is easy to endorse). The UWMS statistic was sensitive to unexpected responses from pairings of coaches and items with dissimilar θ and δ estimates (e.g., an inefficacious coach selects CAT-5 for an item that is difficult to endorse). For both indexes, values range from 0.00 to ∞ with an expected value of 1.00. Rating scale items with fit values from 0.60 to 1.40 illustrated adequate fit to the measurement model (Wright & Linacre, 1994). It is noted that Zhu and colleagues (Zhu, 2002; Zhu & Kang, 1998; Zhu et al., 2001; Zhu et al., 1997) invoked a heuristic range of 0.70–1.30 to indicate adequate fit, and Myers, Wolfe et al. (2005) invoked a heuristic range of 0.60–1.40. Therefore, items with fit statistics of 0.60–0.70 and/or 1.30–1.40, respectively, were described as exhibiting marginal fit. Step 3a. The effectiveness of both rating scale structures was evaluated based on eight guidelines (Linacre, 2004). Summaries are included in this paper as well as the number of possible validity evidences within each guideline for both conceptualizations. The first four guidelines describe this sample’s use of the rating scale. The first (A) is that all categories should have at least 10 observations to increase the likelihood of reasonably precise and stable τ estimates, important in this study because of previous problems in this area (Myers, Wolfe et al., 2005). Within this guideline were five pieces and four pieces of validity evidence in the 5-CAT and 4-CAT structures, respectively. The second (B) is that the frequency distribution of ratings is regular (e.g., unimodal, etc.). Within this guideline there was one piece of validity evidence in the 5-CAT and 4-CAT structures, respectively. The third guideline (C) is that average measures ( θˆ ) advance monotonically with the rating scale categories. Within this guideline were four pieces and three pieces of validity evidence in the 5-CAT and 4-CAT structures, respectively. The fourth guideline (D) is that the UWMS for each category is less than 2.00. Within this guideline there were five pieces and four pieces of validity evidence in the 5-CAT and 4-CAT structures, respectively. The next four guidelines are more concerned with parameter estimates than the previous four. The fifth

RQES: September 2008

Myers.indd 305



guideline (E) is that the τ estimates advance with the categories. In this guideline, there were three pieces and two pieces of validity evidence in the 5-CAT and 4-CAT structures, respectively. The sixth guideline (F) is that the ratings imply the measures (coherence > 39%), and (G) the measures imply the ratings (coherence > 39%). Coherence refers to the degree to which observed ratings match modeled expectations for a particular category and vice versa (for a fuller description see Myers, Wolfe, Maier, Feltz, & Reckase, 2006). It is important to delimit Guideline F, because meeting it is most important in clinical settings where “action is often based on a single observation” (Linacre, p. 271). Because the CES is not intended and has never been used for such purposes (i.e., predicting a coach’s total efficacy based on his/her response to a single item), meeting this guideline was viewed as helpful but unnecessary to achieve an effecting rating scale structure. Meeting Guideline G, on the other hand, was viewed as necessary to achieve an effecting structure, because the prediction is based on the full measure.6 In this guideline (F and G collapsed), there were 10 pieces and eight pieces of validity evidence in the 5-CAT and 4-CAT structures, respectively. The seventh guideline (H) is that the distance between adjacent τ estimates advance by at least 0.98 logits (τ1 to τ2), 0.81 logits (τ2 to τ3) and 0.98 logits (τ3 to τ4), and by 1.10 logits (τ1 to τ2) and 1.10 logits (τ2 to τ3), respectively (Huynh, 1994). Within this guideline were three pieces and two pieces of validity evidence in the 5-CAT and 4-CAT structures, respectively. The eighth guideline (I) is that the distance between adjacent τ estimates advance by less than 5.00 logits. In this guideline were three pieces and two pieces of validity evidence in the 5-CAT and 4-CAT structures, respectively. In sum, there were 34 pieces and 26 pieces of possible validity evidence in the 5-CAT and 4-CAT structures, respectively. Step 3b. Step 3a was repeated on head coaches and assistant head coaches separately to address the implicit assumption in Step 3a that ratings scale categorization structures functioned similarly for the two subgroups. It was noted, however, that the relatively small sample of head coaches, n = 150, combined with the low percentage of responses in the first category across coaches, .002%, could exacerbate problems at the low end of the scale. Step 4. Due to the confirmatory nature of this study, both the 5-CAT and 4-CAT structures were determined a priori, congruent with suggestions by Myers, Wolfe et al. (2005). The 5-CAT structure consisted of 1 = no confidence, 2 = low confidence, 3 = moderate confidence, 4 = high confidence, and 5 = complete confidence. The 4-CAT structure collapsed the data in CAT-1 and CAT-2 while maintaining CAT-3, CAT-4, and CAT-5. The relative effectiveness of these two categorizations was a way to evaluate whether the 5-CAT structure, with CAT-1

305

8/19/2008 5:17:33 PM

Myers, Feltz, and Wolfe

representing no confidence, was necessary with coaches of youth sports or if the 4-CAT structure, without a category representing no confidence only, was effective with coaches of youth sport as had been shown with high school and college coaches.

the observed misfit was likely attributable to the imposed unidimensional measurement model, and there are appropriate opportunities in youth sports for a coach to instruct athletes on appropriate technique. Step 3a

Results

Downloaded by [University of Miami] at 12:29 17 May 2014

Steps 1 and 2 Point measure correlations were positive and ranged from .51 to .77 in both the 5-CAT and 4-CAT structures, providing empirical evidence for the expected covariance between-item responses and person measure and for the positive orientation of items. As depicted in Table 2, in both conceptualizations, 22 of 23 items exhibited acceptable fit to the measurement model. Item 14, coach individual athletes on technique, exhibited only marginal fit to the model in both the 5-CAT (WMS = 1.33, UWMS = 1.31) and 4-CAT (WMS = 1.32, UWMS = 1.30) conceptualizations. This item was retained because it exhibited marginal fit to the model, there is previous evidence of its good fit to this model (Myers, Wolfe et al., 2005), some of

As depicted in Tables 3 and 4, 32 of 34 evidences provided support for the 5-CAT structure, and 25 of 26 provided support for the 4-CAT structure, respectively. The person and the item separation indexes changed little from reducing the 5-CAT structure (person separation = 3.42, item separation = 11.36) to the 4-CAT (person separation = 3.40, item separation = 11.30). Overall, both categorizations functioned effectively. Positive validity evidence included meeting Guidelines A through I in both structures. Guideline A was met in both the 5-CAT (five pieces of positive validity evidence) and 4-CAT (four pieces of evidence) structures, as each category had at least 10 observations (range = 22–5,297, and 383–5,297, respectively). It was noted, however, that CAT-1 received only .002% of the possible observations in the 5-CAT structure. Guideline B was met in both the 5-CAT (one piece of evidence) and 4-CAT (one piece of evidence) structures, as each frequency distribution

Table 2. Item fit and item difficulty Item 5-CAT 4-CAT WMS UWMS d SE WMS UWMS d SE 1. Maintain confidence in your athletes? 0.86 1.01 -0.21 .08 0.85 1.01 -0.21 .08 2. Recognize opposing team’s strengths during competition? 1.20 1.19 0.78 .08 1.20 1.19 0.78 .08 3. Mentally prepare athletes for game/meet strategies? 0.91 0.93 1.25 .08 0.92 0.92 1.25 .08 4. Understand competitive strategies? 1.21 1.21 0.84 .08 1.22 1.21 0.85 .08 5. Instill an attitude of good moral character? 1.05 1.11 -1.43 .08 1.05 1.11 -1.43 .09 6. Build the self-esteem of your athletes? 0.91 0.99 -1.10 .09 0.90 0.98 -1.09 .09 8. Adapt to different game/meet situations? 0.91 0.94 1.16 .08 0.94 0.96 1.17 .08 9. Recognize opposing team’s weakness during competition? 1.07 1.05 0.70 .08 1.07 1.06 0.69 .08 10. Motivate your athletes? 1.08 1.11 -0.20 .08 1.08 1.11 -0.19 .08 11. Make critical decisions during competition? 0.84 0.84 0.97 .08 0.84 0.85 0.96 .08 12. Build team cohesion? 0.94 0.96 0.19 .08 0.94 0.96 0.19 .08 13. Instill an attitude of fair play among your athletes? 1.14 1.28 -1.27 .09 1.14 1.27 -1.27 .09 14. Coach individual athletes on technique? 1.33 1.31 0.39 .08 1.32 1.30 0.38 .08 15. Build the self-confidence of your athletes? 0.86 0.91 -0.46 .08 0.85 0.91 -0.46 .08 16. Develop athletes’ abilities? 0.72 0.71 0.41 .08 0.70 0.70 0.41 .08 17. Maximize your team’s strengths during competition? 0.70 0.70 0.89 .08 0.71 0.71 0.89 .08 18. Recognize talent in athletes? 1.03 1.01 -0.67 .08 1.02 1.01 -0.66 .08 19. Promote good sportsmanship? 1.05 1.16 -2.08 .09 1.04 1.16 -2.07 .09 20. Detect skill errors? 1.03 1.03 0.63 .08 1.04 1.03 0.63 .08 21. Adjust your game/meet strategy to fit your team’s talent? 0.89 0.89 1.04 .08 0.90 0.90 1.04 .08 22. Teach the skills of your sport? 1.22 1.20 0.30 .08 1.21 1.18 0.30 .08 23. Build team confidence? 0.79 0.78 -0.51 .08 0.78 0.77 -0.51 .08 24. Instill an attitude of respect for others? 1.06 1.09 -1.64 .08 1.05 1.09 -1.63 .09 Note. 5-CAT = five-category structure; 4-CAT = four-category structure; WMS = weighted mean square residual; UWMS = unweighted mean-square residual; d = item difficulty; SE = standard error.

306

Myers.indd 306

RQES: September 2008

8/19/2008 5:17:33 PM

Myers, Feltz, and Wolfe

Table 3. Guidelines for increasing rating scale category effectiveness: 5-CAT structure

Downloaded by [University of Miami] at 12:29 17 May 2014

Guideline Total coaching efficacy CAT- 5-CAT Validity evidence A B A. ≥ 10 observations in each CAT 1 22 (.002%) PVE B. Regular observation distribution 2 383 (3.5%) PVE 3 2,912 (26.5%) PVE PVE 4 5,297 (48.2%) PVE 5 2,380 (21.6%) PVE C C. Average measures increase across CATs 1 -2.53 2 -0.08 PVE 3 1.02 PVE 4 2.45 PVE 5 4.34 PVE D D. UWMS < 2.0 for each CAT 1 0.59 PVE 2 1.04 PVE 3 1.07 PVE 4 1.01 PVE 5 0.99 PVE E E. Thresholds advance across CATS 1 — 2 -3.66 (.23) 3 -1.58 (.06) PVE 4 1.10 (.03) PVE 5 4.14 (.03) PVE H H. Distance between adj acent thresholds meets 1 — minimum standard 2 -3.66 (.23) 3 -1.58 (.06) PVE 4 1.10 (.03) PVE 5 4.14 (.03) PVE I I. Distance between adj acent thresholds does 1 — not exceed 5.0 logits 2 -3.66 (.23) 3 -1.58 (.06) PVE 4 1.10 (.03) PVE 5 4.14 (.03) PVE F F. Ratings imply measures 1 18% NVE 2 5% NVE 3 57% PVE 4 77% PVE 5 49% PVE G G. Measures imply ratings 1 100% PVE 2 41% PVE 3 60% PVE 4 63% PVE 5 72% PVE Person reliability 0.92 Person separation 3.42 Item reliability 0.99 Item separation 11.36 Note. 5-CAT = five-category structure; CAT- = a specific category; PVE = positive validity evidence; NVE = negative validity evidence; UWMS = unweighted mean-square residual.

RQES: September 2008

Myers.indd 307



Table 4. Guidelines for increasing rating scale category effectiveness: 4-CAT structure Guideline Total coaching efficacy CAT- 5-CAT Validity evidence A B A. ≥ 10 observations in each CAT 1 & 2 405 (3.7%) PVE B. Regular observation distribution 3 2912 (26.5%) PVE 4 5297 (48.2%) PVE PVE 5 2380 (21.6%) PVE C C. Average measures increase across CATs 1 & 2 -1.42 3 -0.22 PVE 4 1.21 PVE 5 3.11 PVE D D. UWMS < 2.0 for each CAT 1 & 2 1.01 PVE 3 1.06 PVE 4 1.01 PVE 5 0.99 PVE E E. Thresholds advance across CATs 1 & 2 — 3 -2.77 (.06) 4 -0.13 (.03) PVE 5 2.90 (.03) PVE H H. Distance between adj acent thresholds meets 1 & 2 — minimum standard 3 -2.77 (.06) 4 -0.13 (.03) PVE 5 2.90 (.03) PVE I I. Distance between adj acent thresholds does 1 & 2 — not exceed 5.0 logits 3 -2.77 (.06) 4 -0.13 (.03) PVE 5 2.90 (.03) PVE F F. Ratings imply measures 1 & 2 6% NVE 3 58% PVE 4 77% PVE 5 49% PVE G G. Measures imply ratings 1 & 2 80% PVE 3 60% PVE 4 63% PVE 5 72% PVE Person reliability 0.92 Person separation 3.40 Item reliability 0.99 Item separation 11.3 Note. 4-CAT = four-category structure; CAT- = a specific category; PVE = positive validity evidence; NVE = negative validity evidence; UWMS = unweighted mean-square residual.

307

8/19/2008 5:17:33 PM

Downloaded by [University of Miami] at 12:29 17 May 2014

Myers, Feltz, and Wolfe

was unimodal (CAT-4 = 5,297 in both distributions). Guideline C was met in both the 5-CAT (four pieces of evidence) and 4-CAT (three pieces evidence) structures, as the average TCE measure advanced monotonically with the categories (-2.53, -0.08, 1.02, 2.45, 4.34, and 1.42, -0.22, 1.21, 3.11, respectively). Guideline D was met in both the 5-CAT (5 pieces of evidence) and the 4-CAT (four pieces of evidence) structures, as the UWMS values for each category was less than 2.00 (range = 0.59–1.07, and 0.99–1.06, respectively). Guidelines E, H, and I were met in both the 5-CAT (nine pieces of positive validity evidence) and 4-CAT (six pieces evidence) structures, as the τ estimates advanced with the categories, by more than the minimum distances, and by less than 5.00 logits (-3.66, -1.58, 1.10, 4.14, and -2.77, -0.13, 2.90, respectively). It was noted, however, that the SE for τ1 .23, was much larger than the SEs for the other τ estimates across both structures (range = .03–.06). Guideline G was met in both the 5CAT (five pieces of evidence) and 4-CAT (four pieces of evidence) structures, as the coherence percentage for each measurement zone was greater than 39% (range = 41–100%, and 60–80%, respectively). In both structures, positive and negative validity evidence was observed in Guideline F (coherence for ratings implying measurement zones > 39%). In the 5-CAT structure, CAT-1 coherence = 18%, CAT-2 coherence = 5%, CAT-3 coherence = 57%, CAT-4 coherence = 77%, and CAT-5 coherence = 49% (three pieces of positive validity evidence and two pieces of invalidity evidence). In the 4-CAT structure, CAT-1 and CAT-2 coherence = 6%, CAT-3 coherence = 58%, CAT-4 coherence = 77%, and CAT-5 coherence = 49% (three pieces of positive validity evidence and one piece of invalidity evidence). Conceptually, a rating in the lowest category (or lowest two categories) was/were not very predictive of a coach’s overall measure. Step 3b The results for assistant coaches closely paralleled the results reported in Step 3a.7 Specifically, 32 of 34 evidences provided support for the 5-CAT structure, and 25 of 26 evidences provided support for the 4-CAT structure. The area of negative validity evidence was the same as reported in Step 3a. For head coaches, more problems were observed in the 5-CAT structure than in Step 3a. Prior to discussing these problems, it is useful to note that many of these were likely because there were only six responses in CAT1. The percentage of responses observed in CAT-1 for, however, was equal to the percentage of such responses for assistant coaches, .002%. It is reasonable, therefore, to hypothesize that the relatively small sample of head coaches contributed to these additional problems.

308

Myers.indd 308

For head coaches, 28 of the 34 evidences supported the 5-CAT structure. Areas of negative validity evidence included those reported in Step 3a in addition to failing to meet: Guideline A for CAT-1, Guideline H for τ1 to τ2, Guideline G for CAT-1 and CAT-2. The results for the 4-CAT structure, however, closely paralleled the results from Step 3a. Specifically, 25 of 26 evidences provided support for the 4-CAT structure. The area of negative validity evidence was the same as reported in Step 3a. Step 4 From this point on, the term coaches includes assistant and head coaches. Across coaches, both the 5-CAT and 4-CAT structures functioned effectively and provided confirmatory evidence for imposing the structures on items that define the CES. For head coaches only, there was evidence that the 5-CAT structure functioned less effectively than the 4-CAT structure; however, the small sample size for this subgroup could have contributed to this discrepancy. For reasons described in the Discussion section, the authors concluded that the 4-CAT structure was preferable to the 5-CAT structure when the CES items were used to measure the coaching efficacy for coaches of youth, high school, and collegiate sports. Discussion This study extends validity evidence for coaching efficacy measures derived from the CES by testing the rating scale categorizations suggested in previous research (Myers, Wolfe et al., 2005), which provided post hoc evidence for the effectiveness of a 4-CAT structure for high school and college coaches. It also suggested a 5CAT structure may be effective for youth sports coaches, because they may be more likely to endorse categories on the lower end of the scale (Lee et al., 2002). In the present study, youth coaches responded to the CES items with a 5-CAT structure. Across rating scale category effectiveness guidelines, 32 of 34 evidences supported this structure. Data were condensed to a 4-CAT structure by collapsing responses in CAT-1 and CAT-2. Across category effectiveness guidelines, 25 of 26 evidences supported this structure. Findings provided confirmatory, crossvalidation evidence for both the 5-CAT and 4-CAT structures. The authors conclude, however, that the 4-CAT structure is preferable to the 5-CAT structure when the CES items are used to measure the coaching efficacy for coaches of youth, high school, and collegiate sports. There are several reasons for this preference. Empirically, the 4-CAT and 5-CAT structures exhibit a similar degree of effectiveness, but CAT-1 in the 5-CAT structure exhibits a few problems, which appear to be influenced by the infrequent use of this category. In the 5-CAT structure, CAT-1 was selected .002% of the time; although the

RQES: September 2008

8/19/2008 5:17:33 PM

Downloaded by [University of Miami] at 12:29 17 May 2014

Myers, Feltz, and Wolfe

raw number of observations (n = 22) met the minimum heuristic (10), the low frequency with which this category was selected directly contributed to the observed problems. First, the SE (.23) for the threshold between CAT- 1 and CAT-2 (τ1 = -3.66) is approximately four times as large as any of the other SEs for the τs (next largest was .06). In the 4-CAT structure, SEs for the τs ranged from .03 to .06, which suggested improved precision. Second, ratings in CAT-1 within the 5-CAT structure were not very predictive of measurement zones for total coaching efficacy, which suggests the infrequent use of this category may have been at least somewhat nonsystematic. The 4-CAT structure did not improve on the poor coherence for the first category ratings implying the first measurement zone (i.e., Guideline F). This poor coherence may be partially attributable to between-item multidimensionality of the CES. It may be that a low rating for a difficult motivation item is not a good indication of an overall low coaching efficacy estimate. Still, given both the rationale for imposing a unidimensional model and the relative low importance of meeting Guideline F in this study this invalidity evidence is tolerable as long as this limitation is noted in subsequent research. Specifically, a low rating on a single item should not be trusted to indicate a low level of total coaching efficacy. There are theoretical, practical, and broader fieldbased reasons for retaining the 4-CAT structure as opposed to the 5-CAT structure. Theoretically, maintaining the 4-CAT structure parallels previous research (Myers, Wolfe et al., 2005) and is congruent with general guidelines for self-efficacy in sport (Feltz & Chase, 1998), which implies coaches are highly unlikely to select a category indicating “no confidence” (CAT-1 in the 5-CAT structure). Practically, maintaining the 4-CAT structure allows for a uniform rating scale across coaches of youth, high school, and collegiate sports, which should be useful for researchers. Broader field-based reasons for retaining the 4-CAT structure include responding to the call to move away from the poor psychometric practice of providing too many categories for efficacy items (Myers & Feltz, 2007). Most importantly, a more effective rating scale structure (compared to the original 10-CAT structure) should provide more accurate, precise, and stable measures of coaching efficacy (Linacre, 2002). It is interesting to note that although Zhu et al. (2001) and Myers, Wolfe et al. (2005) used slightly different approaches to arrive at more effective post hoc rating scale structures, both structures were confirmed in follow-up studies (Zhu, 2002) and in this study as well. Specifically, Zhu et al. used a mechanically thorough approach to create all possible adjacent categories for 3-CAT and 4-CAT structures from an original 5-CAT structure and identified a preferred 3-CAT structure. Myers, Wolfe et al. used a less mechanically thorough procedure to reduce a 10-CAT structure to five post hoc structures and

RQES: September 2008

Myers.indd 309



identified a preferred 4-CAT structure. We view both approaches as reasonable ways to explore post hoc rating scale structures when the original is ineffective, but we note general weaknesses of both approaches. Because the Zhu et al. approach is mechanically thorough, it does not appear to incorporate general guidelines for collapsing categories (Wright & Linacre, 1992), and it can create multiple similar structures that are difficult to differentiate from a statistical perspective. Because the Myers, Wolfe et al. approach is less mechanically thorough, it does not explore possible rating scale structures. A hybrid of both approaches, which might minimize their weaknesses and maximize their strengths, might be to: (a) construct all possible post hoc rating scale structures, (b) eliminate all structures that violate one of the general guidelines for collapsing categories, and (c) evaluate the relative effectiveness of the remaining post hoc categorizations. We are aware of two primary limitations to this study. First, the ability to generalize from this sample is limited by participants’ relative homogeneity, which was overwhelmingly composed of male Caucasian coaches of boys’ ice hockey. While there is no compelling theoretical reason or previous research to indicate that other groups of coaches (e.g., female non-Caucasian coaches of female athletes, etc.) would require a different rating scale structure, we are unable to provide empirical evidence in this study. The primary reason for not evaluating this assumption was that the identified subgroups were not of a sufficient size to provide a reasonable degree of confidence in the results. Future research that successfully imposes designs targeted at evaluating the effectiveness of various rating scale structures for these subgroups would add to this literature. Last, while the unidimensional model imposed in this study was reasonable, it did not provide the direct evidence that an effective rating structure across subscales is effective within each subscale. Future research that provides evidence for the effectiveness of a uniform rating scale structure imposed within each subscale would increase confidence in this assumption. Comparison of the Two Approaches Now that confirmatory evidence has shown the effectiveness of a rating scale structure identified using the Wright and Linacre (1992) guidelines for combining categories and Linacre’s (1999, 2002, 2004) method of evaluating rating scale effectiveness, a useful future study would be to directly compare the exploratory (Zhu & Kang, 1998; Zhu et al., 2001; Zhu et al., 1997) and confirmatory (Zhu, 2002) approaches of Zhu and colleagues to the exploratory and confirmatory approaches of Myers and colleagues (Myers, Wolfe, et al., 2005) in this study. An investigation of this sort would be well served by a two-study approach, in which Study 1 is the exploratory and Study 2 is the confirmatory study.

309

8/19/2008 5:17:33 PM

Downloaded by [University of Miami] at 12:29 17 May 2014

Myers, Feltz, and Wolfe

There will be some major obstacles in such a study. One is language. In their exploratory approach, Zhu and colleagues reported using a method to identify the optimal categorization. We believe this implies a single rating scale structure is optimal in the population (not a particular sample) and that Zhu and colleagues provided a method to recover this structure from responses to a less than optimal rating scale. While the former may be true, we believe the latter remains unproven. The primary problem is that in all the exploratory studies of Zhu and his colleagues, the true optimal categorization structure was unknown; therefore, it is impossible to prove that a particular methodology recovered it. The methodology of Zhu and colleagues compared the relative functioning of various rating scales with sample data. They then narrowed the set of candidates until selecting a structure they declared to be optimal. As stated in the Introduction and Discussion sections, a problem with this approach is there will likely be competing structures that cannot be meaningfully distinguished based on the sample results. That said, we believe the exploratory methodology put forth by Zhu and colleagues, like the exploratory methodology used by Myers, Wolfe, et al. (2005), is a reasonable way to identify an effective rating scale structure. A second major obstacle would be to merge the confirmatory approaches. Zhu (2002) followed a subset of participants from the relevant exploratory study (i.e., Zhu et al., 2001). In the present study, the sample was not a subset (i.e., Myers, Wolfe, et al., 2005). This difference may be important, because a statistical indicator Zhu used in his confirmatory approach was correlating person measures from the confirmatory study with person measures from the exploratory study. A few of the other indexes used by Zhu overlap to some degree with the approach in this study (e.g., item-fit, average person measure in each category, and threshold order). Other indexes used by Zhu do not overlap with the present study (e.g., correlation among the set of item difficulties between studies). Indexes in the present study that do not explicitly overlap with those used by Zhu include Guidelines A, B, D, H, I, F, and G. A study that overcomes these and other obstacles to compare the two approaches would advance the literature. For now, there is evidence that both approaches can (a) identify an effective categorization structure through exploratory post hoc methods, and (b) confirm the effectiveness of this categorization structure in a subsequent study.

References American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (1999). Standards for educational and psycho-

310

Myers.indd 310

logical testing. Washington, DC: American Educational Research Association. American Psychological Association. (2001). Publication manual of the American Psychological Association (5th ed.). Washington, DC: Author. Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561–573. Bozdogan, H. (1987). Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions. Psychometrika, 52, 345–370. Feltz, D. L., & Chase, M. A. (1998). The measurement of self-efficacy and confidence in sport. In J. L. Duda (Ed.), Advancements in sport and exercise psychology measurement (pp. 65–80). Morgantown, WV: Fitness Information Technology. Feltz, D. L., Chase, M. A., Moritz, S. E., & Sullivan, P. J. (1999). A conceptual model of coaching efficacy: Preliminary investigation and instrument development. Journal of Educational Psychology, 91, 765–776. Feltz, D. L., Hepler, T. J., Roman, N., & Paiement, C. A. (2006). Coaching efficacy of youth sport coaches: Extending validity evidence for the Coaching Efficacy Scale. Unpublished manuscript, Michigan State University, East Lansing. Huynh, H. (1994). On equivalence between a partial credit item and a set of independent Rasch binary items. Psychometrika, 59, 111–119. Lee, K. S., Malete, L., & Feltz, D. L. (2002). The strength of coaching efficacy between certified and non-certified Singapore coaches. International Journal of Applied Sport Sciences, 14, 55–67. Linacre, J. M. (1999). Investigating rating scale category utility. Journal of Outcome Measurement, 3, 103–122. Linacre, J. M. (2002). Optimizing rating scale category effectiveness. Journal of Applied Measurement, 3, 85–106. Linacre, J. M. (2004). Optimizing rating scale category effectiveness. In E. V. Smith and R. M. Smith (Eds.), Introduction to Rasch measurement (pp. 258–278). Maple Grove, MN: JAM Press. Linacre, J. M. (2005). A user’s guide to WINSTEPS/MINISTEP Rasch-model computer programs (Version 3.55). Chicago: MESA Press. Malete, L., & Feltz, D. L., (2000). The effect of a coaching education program on coaching efficacy. The Sport Psychologist, 14, 410–417. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York: Macmillan. Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741–749. Myers, N. D., & Feltz, D. L. (2007). From self-efficacy to collective efficacy in sport: Transitional methodological issues. In G. Tenenbaum & R.C. Eklund (Eds.), The handbook of sport psychology (3rd ed., pp. 799–819). New York: Wiley. Myers, N. D., Vargas-Tonsing, T. M., & Feltz, D. L. (2005). Coaching efficacy in collegiate coaches: Sources, coaching behavior, and team variables. Psychology of Sport & Exercise, 6, 129–143. Myers, N. D., Wolfe, E. W., & Feltz, D. L. (2005). An evaluation of the psychometric properties of the coaching efficacy

RQES: September 2008

8/19/2008 5:17:33 PM

Downloaded by [University of Miami] at 12:29 17 May 2014

Myers, Feltz, and Wolfe

scale for American coaches. Measurement in Physical Education and Exercise Science, 9, 135–160. Myers, N. D., Wolfe, E. W., Maier, K. S., Feltz, D. L., & Reckase, M. D. (2006). Extending validity evidence for multidimensional measures of coaching competency. Research Quarterly for Exercise and Sport, 77, 451–463. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen, Denmark: Danish Institute for Educational Research. Reprinted, 1992, Chicago: MESA Press. Roman, N. (2006). Extending validity evidence for the coaching efficacy scale with volunteer youth sport coaches. Unpublished master’s thesis, Michigan State University, East Lansing. Wright, B. D., & Linacre, J. M. (1992). Combining and splitting categories. Rasch Measurement Transactions, 6, 233–235. Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8, 370. Wright, B. D., & Linacre, J. M. (1998). Winsteps: A Rasch model computer program. Chicago: MESA Press. Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago: MESA Press. Zhu, W. (2002). A confirmatory study of Rasch-based optimal categorization of a rating scale. Journal of Applied Measurement, 3(1), 1–15. Zhu, W., & Kang, S. J. (1998). Cross-cultural stability of the optimal categorization of a self-efficacy scale: A Rasch analysis. Measurement in Physical Education and Exercise Science, 2, 225–241. Zhu, W., Timm, G., & Ainsworth, B. (2001). Rasch calibration and optimal categorization of an instrument measuring women’s exercise perseverance and barriers. Research Quarterly for Exercise and Sport, 72, 104–116. Zhu, W., Updyke, W. F., & Lewandowski, C. (1997). Post-hoc Rasch analysis of optimal categorization of ordered-response scale. Journalof Outcome Measurement, 1, 286–304.

Notes 1. Technically, categories can be condensed or split in post hoc categorizations, where splitting might consist of breaking a single category into two, because the original was determined to attract more than one response type. This paper focuses on condensing categories, because in the limited research tin exercise in sport sciences none of the applications have found reason to split a category (Myers, Wolfe, et al., 2005; Myers et al., 2006; Zhu & Kang, 1998; Zhu et al., 2001; Zhu et al., 1997). 2. The acronym, CAT, is used in two ways in this paper: (a) when referring to a specific category in a specific rating scale structure (e.g., CAT-1 = first category), and (b) when referring to the number of categories in a rating scale structure (e.g., 5-CAT = five categories).

RQES: September 2008

Myers.indd 311



3. The larger project had three purposes: (a) evaluate the rating scale categorization structure suggested by Myers, Wolfe, et al. (2005); (b) investigate the factorial structure of responses to the CES items for youth sport coaches; and (c) investigate sources of coaching efficacy in youth sport coaches. This paper concerns purpose (a). The contributions of this study are unique and are of practical, theoretical, and methodological importance. The unique contribution of this study satisfies concerns regarding duplicate publications (APA, 2001). Purpose (b) and purpose (c) are summarized in Feltz, Hepler, Roman, and Paiement (2006). 4. For any variable where the total number of observations does not equal 492, the discrepancy can be attributed to a respondent failing to respond to a specific demographic item. 5. As a preliminary step, the degree to which fitting the data to a rating scale model, as opposed to a partial credit model, was reasonable was tested by comparing the values of the consistent Akaike information criterion (CAIC; Bozdogan, 1987). In both the 5-CAT structure and the 4-CAT structure, the rating scale model (CAIC = 20380 and CAIC = 20251, respectively) fit the data better than the partial credit model (CAIC = 20559 and CAIC = 20354, respectively). 6. It was noted that Guidelines F and G provide another example of slightly differing methodologies for evaluating the functioning of rating scale structures used by Zhu and his colleagues (Zhu, 2002; Zhu & Kang, 1998; Zhu et al., 2001; Zhu et al., 1997), compared to Myers, Wolfe et al. (2005) and to this study. Zhu and his colleagues did not report coherence values, Myers, Wolfe et al., and this study do report coherence values. 7. T ables summarizing the results from Step 3b are not provided due to spatial limitations. These tables, however, are available on request to the first author.

Authors’ Notes This research was supported in part by a Summer Research Award from the University of Miami. We acknowledge Ahnalee Brincks for editing this manuscript. Please address all correspondence concerning this article to Nicholas D. Myers, School of Education, University of Miami, 311B Merrick Building, P.O. Box 248065, Coral Gables, FL, 33124-2040. E-mail: [email protected]

311

8/19/2008 5:17:33 PM