Meyners Castura Carr - CATA -- PREPRINT

1 downloads 0 Views 244KB Size Report
Check-All-That-Apply (CATA) questionnaires have seen a widespread use recently ... Cochran's Q to test for product differences across all attributes, and a more ...
Existing and new approaches for the analysis of CATA data Michael Meyners1*, John C. Castura2, and B. Thomas Carr3 1

Procter & Gamble Service GmbH, 65824 Schwalbach am Taunus, Germany Compusense Inc., Guelph, Ontario, Canada 3 Carr Consulting, Wilmette, IL 60091, USA 2

*

Email: [email protected]

Abstract Check-All-That-Apply (CATA) questionnaires have seen a widespread use recently. In this paper, we briefly review some of the existing approaches to analyze data obtained from such a study. Proposed extensions to these methods include a generalization of Cochran’s Q to test for product differences across all attributes, and a more informative penalty analysis. Multidimensional alignment (MDA) is suggested as a useful tool to investigate the association between products and the attributes. Comparisons of real products with an ideal are useful in identifying specific improvements for individual products. Penalty and penalty-lift analyses are used to identify (positive and negative) drivers of liking. The methods are illustrated by means of CATA study on whole grain breads.

Keywords: Check-All-That-Apply (CATA), Cochran’s Q, correspondence analysis, multidimensional alignment, penalty analysis, penalty-lift-analysis

Introduction Check-All-That-Apply (CATA) questions have seen an increased usage recently. They are considered to investigate the perceptions of consumers on a variety of attributes. A presentation by Adams, Williams, Lancaster and Foley (2007) on this method sparked considerable interest in using CATA questions to obtain a rapid profile from naïve consumers. In this paper, we briefly review some existing ways to analyze CATA data, both graphically and by means of statistical tests. The main focus is on refining these approaches to provide further insight into the data, and to add a few complementary analyses that have, to the best of our knowledge, not been proposed so far. All methods can similarly be applied to data from applicability testing studies (cf. Ennis & Ennis, 2013).

Notation Let us consider a CATA study in which each assessor j evaluates each product k exactly once per attribute a. Let nJ denote the total number of assessors, nK the total number of products, and nA the total number of attributes. For simplicity, we assume that each assessor evaluates each product exactly once; generalizations of some of the methods below for incomplete or replicated designs or for data with missing values are straightforward but inconvenient with regard to notation. We assume the data typically to be organized in rectangular form where each of the nJ*nK rows contains the data for one observation, i.e. one assessor and one product. The attributes under consideration are displayed in nA columns. For the attributes, let “1” indicate that this attribute has been checked by the respective assessor for this product, and “0” indicate that the attribute has not been chosen. Additional columns indentify the assessor and the product for this observation; further columns might contain data from non-CATA questions like, e.g., ratings on liking or others. As the analyses can be broken down to analyses by attribute, we re-organize the data for each attribute a in a matrix Xa = {xajk} ~ (nJ, nK), with assessors in rows and products in columns. The matrix Xa will contain the binary data, with a 1 indicating that the attribute was selected and a 0 indicating that the attribute was not selected. Henceforth, we omit the index a for simplicity unless required to make the notation clear.

Contingency table Typically, the first summary of CATA data is to determine the column sums of X, i.e. counting by product how many assessors checked the given attribute. Merging the different attributes yields the so-called contingency table. The values might be displayed as absolute counts or percentages, the latter being particularly useful if the number of evaluations differs between products for any reason. Bar charts are frequently used to visualize the contingency table (cf. Castura & Meyners, 2013).

Statistical test strategy for CATA The analysis of CATA is typically considered exploratory and descriptive in nature, rather than inferential. However, statistical testing for product differences and inference to determine which attributes really differentiate between products and which attributes show differences that are potentially due to chance only is useful to avoid overinterpretation of the data. If the data set includes evaluations of an ideal product, depending on the purpose of the investigation this ideal product is treated as all others products in what follows, or it is omitted from the analysis. From now on, for convenience we assume that the study has been designed as a nonreplicated full cross-over (i.e. each panelist evaluates all products once) as the most

common test design for CATA studies. Other designs might be used, but might require some modification of the approaches; some of the parametric tests might not even apply anymore, as their assumptions are violated. If deviations are relatively small, they might still provide relatively accurate results, while with larger deviations, the validity of the parametric tests might be significantly compromised.

Cochran’s Q Cochran (1950) proposes a test statistic Q to investigate differences between treatments for cross-over studies with binary outcomes (like yes/no or checked/unchecked as used for CATA). The test applies to one attribute at a time. Cochran’s Q statistic for a single attribute is, in our notation, given by =

 ( − )  ( − 1) ∑  



   ∑  − ∑ 

with Tk denoting the number of checks for product k with corresponding grand mean , and Rj the number of products for which assessor j checked the attribute under investigation. Under the null hypothesis of no product differences, Q is asymptotically χ2distributed with (nK–1) degrees of freedom. Tate and Brown (1970) suggest that the χ2 approximation is acceptable if the corrected number of assessors times the number of products is at least 24. For the correction in the number of assessors, those with no variability across products are excluded, i.e. those that checked the respective attribute for all or for none of the products are omitted from counting. In this manuscript, we refer to the corrected number of assessors as the effective sample size. Though 24 seems a relatively easy threshold to reach in a CATA study, in small studies with few products or for comparisons of 2 or 3 products on attributes that are most often elicited on all or no products, the threshold might still be limiting. Castura and Meyners (2013) report effective sample sizes for pairwise comparisons for some attributes as low as 15 from a total of 116 assessors. An alternative proposal of Cochran (1950) is to use the ordinary F-test on binary data, thereby treating binary data as if it were continuous. Although the F-test might give very similar results in certain situations compared to Cochran’s Q test, the latter is more appropriate for binary data (Cochran, 1950; Tate & Brown, 1970); therefore we will not consider the F-test any further. If only 2 products are compared, McNemar’s test (McNemar, 1947) and Cochran’s Q test are equivalent. However, we recommend using the well-known sign test (Arbuthnott, 1710), which provides an exact version of these approaches and is simple to conduct. A typical analysis by attribute could hence consist of Cochran’s Q followed by the sign test for each pair of products (similar to an ANOVA F-test with subsequent pairwise comparisons using Fisher’s Least Significant Difference), although this approach does not protect against multiplicity issues.

Overall test Instead of applying Cochran’s Q or related tests by attribute, we might be interested to test for overall differences between products across all attributes. Such an omnibus test will help to protect against inflated experiment-wise error rates. The well-known Pearson’s χ2-test for contingency tables might be considered a reasonable approach, were it not for the assumption of independence of observations, which is clearly violated in typical CATA studies, because assessors often evaluate multiple (if not all) products, and they assess all attributes. As Pearson’s χ2-test relies on the assumption that all observations are independent (i.e. each participant would only assess one attribute for one of the products), it does not provide a valid test for typical CATA data. To the best of our knowledge, no valid global test for CATA data has been proposed. In order to derive an approximate solution, note that Cochran’s Q test is asymptotically χ2distributed with (nK–1) degrees of freedom, and that the sum of two independent χ2distributed random variables with n and m degrees of freedom, respectively, is χ2distributed with n+m degrees of freedom. To derive an asymptotic test, assume for a moment independence in evaluations of the different attributes, i.e. evaluation of any attribute by an assessor is independent of his/her evaluations of all other attributes. If true, the sum of all Q statistics across attributes is asymptotically χ2-distributed with  nA(nK–1) degrees of freedom. The sum of the Q statistics  ∗ = ∑  might then be 2 compared against the respective χ -distribution to determine an approximate p value. This approach naturally generalizes Cochran’s Q test to multiple attributes. However, the assumption of independence rarely holds. In a hypothetical evaluation, a consumer who considers sour to apply might be less likely to use sweet to characterize that same product. If such dependencies exist amongst attributes, then the Q statistics are also not independent. The particular attributes and products in the study will determine how strongly the assumption of independence is violated. We employ the notion of randomization tests to derive a valid overall test which takes the dependency structure into account. The same approach can be used to assess the significance of product differences separately for each attribute. Conceptually, the proposed method is very similar to the one proposed by Meyners and Pineau (2010) for Temporal Dominance of Sensations (TDS) studies, and by Wakeling, Raats and MacFie (1992) to test consensus in Generalized Procrustes Analysis (GPA). The underlying idea is that, under the null hypothesis of no product differences, the recorded data does not depend on the actual product tested, but would have been identical if any of the other products had been evaluated. Therefore, randomly shuffling the allocation of products to evaluations (while obeying the study design) should not systematically change the test statistic used. If it nevertheless does change the test statistic systematically, this provides evidence that the null hypothesis is not true. In practice, usually 1000 or 10,000 randomly reallocations (including the original one) are used. A proof of the validity of the concept is found in the textbook by Edgington and Onghena (2007; see also Meyners & Pineau, 2010 for a more elaborate description of the approach) and will be omitted here. It is worth noting that the concept of randomization tests provides an exact test; but as we usually do perform only a random

subset of all possible randomizations, the test is considered to be quasi-exact. Arbitrary precision could (in theory) be obtained by increasing the number of re-randomizations. As above, the test statistic used is the sum of Cochran’s Q statistics for all attributes. Rather than comparing the test statistic Q* against the χ2-distribution with nA(nK–1) degrees of freedom as discussed above, we determine the so-called null-distribution by means of appropriate re-randomizations of the data, typically using the observed data plus 999 re-randomizations. Other test statistics might be used alternatively without compromising the validity of the test; the choice of the test statistic merely defines against which alternatives the test will be most sensitive. Meyners and Hartwig (2009) used the very same approach employing Pearson’s χ2-statistic. We suggest using the sum of Q statistics here as it is linked nicely with the Cochran’s Q test by attribute. Of course, as for any global test, the hurdle for significance increases with an increasing number of attributes that do not discriminate between products (i.e. attributes just representing noise). Therefore, attributes other than those of main interest could be dropped from the analysis, if not from the ballot prior to running the study. Note that the re-randomizations are conducted such that rows are permuted within consumers. Thereby, any assessor effects or potential dependencies between attributes are maintained in the data, including potential differences in the average number of attributes checked as well as systematic differences in the selection of attributes. Though these effects are not modeled, they are respected for in the evaluation of product differences, yielding valid p values under a model where assessor effects and attribute dependencies are included. A typical next step would be to apply the same test for subsets of products, investigating the nature of any product differences that might exist. Most often, these tests would be applied to pairs of products. Correction for multiplicity may be applied, but as CATA is most often used for exploratory rather than inferential data analysis, it might be reasonable to refrain from any correction; the test proposed here rather serves to avoid serious over-interpretation of data that is possibly very noisy. An alternative could be to conduct univariate analyses (one attribute at a time) across all products and then try to identify the products that are discriminated by these attributes. Either way we are interested in obtaining information at the level of individual attributes and pairs of products. The advantage of testing pairs of products first is that we avoid exhaustive investigation about the attributes on which two products might differ before we know that they differ at all. Note that the same approach might be used to compare products on a single attribute at a time. The aforementioned approximations by the χ2-distribution only hold for sufficiently large effective sample sizes. If effective sample sizes are too small for some attributes, a randomization test can be used for each single attribute. An implicit assumption for the application of randomization tests is that the order of presentation has been randomized for each assessor independently from all the others, i.e. no constraints have been imposed, such as balancing the sample order. However, experience shows that results are robust to order balancing unless there are very strong deviations from the randomization (e.g. the same order for each subject). Analytical approaches (such as Cochran’s Q test) rely on exactly the same assumption, which

would be violated all the same. It is possible to modify the randomization tests to account for design features, and also for other designs like incomplete block designs, all of which will put some constraints on the randomization. However, given the infinite number of constraints that might apply in theory, it is impossible to automate this in any software; manual project-specific modifications are required instead. The R code that is available upon request applies to the most common situation, but can be used as a starting point for generalizations. Lancaster and Foley (2007) suggest a different approach using bootstrapping and the Cochran-Armitage linear trend test (Cochran, 1954, Armitage, 1955; cf. also Agresti, 2002). As we usually do not have any prior information or beliefs about possible direction of trends, this approach only seems applicable to contrast pairs of products, where a significant trend indicates product differences. The authors use the MULTTEST procedure in SAS, which has the benefit of directly providing multiplicity correction for the pairwise comparisons, but at the cost of not performing an adequate overall test for CATA data. Agresti and Liu (1999) as well as Bilder and Loughlin (2002, 2004) propose approaches to model and statistically test in the context of categorical variables with multiple choices like in CATA. However, their approaches refer to a slightly different scenario more typical for surveys in which two (or more) variables are assessed by a single assessor, but not the scenario in which multiple products are evaluated by all assessors. This has implications on the dependency structure in the data, such that these approaches might not generalize easily to the most common CATA situation. Similar to the setup proposed by Meyners and Pineau (2010) for TDS data, the different tests provide an overview of the differences between products that can be assumed to be real and those differences that might be due to chance only. A graphical display as proposed by Meyners and Pineau (2010) might therefore be used to visualize these results. For further (graphical) analyses, it is worth considering omission of those attributes that did not show any discrimination between the products. Some of the multivariate methods are sensitive to whether such attributes are included, so it might be useful to avoid any influence of non-significant attributes. In larger and well-planned studies, it might seem of little use to actually run an overall test. The products are typically chosen such that we already know that they differ, at least to some extent. However, it is a good confirmation that the set of attributes was reasonably chosen, and for smaller studies, an overall test is certainly helpful to avoid interpreting noise in the data, so we recommend an overall test as a reaffirmation that the data is indeed interpretable in some detail, and use the pairwise comparisons in order to confine interpretation on attributes that really discriminate between products.

Graphical analysis A contingency table is often displayed in a bar chart of the percentages or absolute numbers of assessors checking an attribute by product. As CATA studies typically involve a large number of attributes and several products, careful choices are required regarding which attributes to display together, and whether the bars are grouped by products or rather by attribute. Correspondence Analysis (CA) is widely used to visualize a contingency table, and might be considered as a generalization of Principal Component Analysis (PCA) for ordinary data. The method projects the data into orthogonal components such as to maximize the sequential representation of the variation in the data. Typically, only the plot of the first two components is displayed; sometimes, due to too little variation explained, additional dimensions are plotted as well. Details about CA are beyond the scope of the paper, but are available from Greenacre (2007) as well as Abdi and Williams (2010). For classical CA based on χ²-distance, Legendre and Gallagher (2001) describe in a different context that (translated to CATA data) attributes with low incidence rates can have a major and undue impact on the results. To avoid having to omit these attributes from the analysis, they propose to use the so-called Hellinger distance (Hellinger, 1909). Popper, Abdi, Williams and Kroll (2011) made similar observations with regard to rarely selected CATA terms, supporting the idea of using Hellinger distances. The R package ExPosition (Beaton, Chin Fatt, & Abdi, 2012) provides a convenient interface to determine a CA based on either χ² or Hellinger distances. As far as we are concerned, a drawback of the current version of this package (and shared by other packages) is that the aspect ratio of the default plots does not represent the relative variation explained; distances between products in the plot therefore do not necessarily respect the actual distances of the products. An alternative approach to derive a perceptual map is given by Multi Factor Analysis (MFA; Escofier & Pagès, 2008). MFA allows giving the same weight to (groups of) variables, such that the perceptual map is not dominated by only a few attributes. This balanced weighting is useful if, for example, smell, taste and texture attributes should have similar impact on the results. Of course, other methods exist to derive a perceptual map, e.g., Partial Least Squares or covariance-based principal component analysis. These methods formally require scale data, but are easily performed using CATA data with most software packages. Interpretation of results from these analyses is done as if scale data were used, but often resemble results from CA at least qualitatively. Applications using these methods are rarely reported for CATA data and will not be addressed further, given that CA is widely available.

Multidimensional alignment (MDA) Any perceptual map displays only two (rarely three) dimensions, thereby representing the relationship between attributes and products only incompletely. Depending on the

proportion of variance explained in the figure, the true relationships might differ more or less from the visual impression from the plot. Consequently, an attribute might be related substantially less or substantially more with a product than derived from the display. Mathematically, the full information in the data is given in no more than (min(nA, nK) – 1) dimensions. As typically nA > nK, this is in up to as many dimensions as we have products, less one. Occasionally, due to perfect linear dependencies in the data, this number might be even smaller. Therefore, attributes and products in a perceptual map are vectors in a (nK – 1)-dimensional space. In order to reveal this information, Carr, Dzuroska, Taylor, Lanza, and Pansini (2009) propose to determine the cosine between , ! pairs of vectors through cos(∠, ) = ,! , ! , where denotes the vector √

product for vectors x and y. (Note that in a multidimensional perceptual map, both products and attributes are represented mathematically by vectors, even though in many cases only the attributes are depicted by arrows.) The cosine value can fall between -1 and +1. The angle between the vectors (or its cosine) in the full-dimensional space gives the complete information about the relationship between products and attributes. Carr et al. (2009) refer to this approach as Multidimensional Alignment (MDA). They propose to display the cosines of the angles of a product with all attributes in a bar chart. For interpretation, one needs to bear in mind that unlike correlation coefficients, absolute cosines below 0.707 (=cos(45°) = -cos(135°)) indicate hardly any relationship at all. This threshold is much larger than the threshold typically applied for interpreting correlation coefficients, where smaller values are usually considered to indicate some relationship; the threshold is high for the cosine due to its non-linearity. Consequently, a bar chart of the cosines might me slightly misleading if not carefully interpreted. Instead, Castura and Meyners (2013) propose displaying the angles directly on a reversed scale, where 180° (π radians) indicates perfect negative relation, 0° (0 radians) perfect positive relation, and 90° (π/2 radians) no relation at all. Alternatively, we propose here to display the attributes in a semicircle for each product, thereby displaying all angles between the attributes and the respective product in the multidimensional scale in just two dimensions. The semicircle plot may become illegible if too many attributes are displayed simultaneously; in this case a full circle plot provides increased legibility. It should be noted that angles and their displays relate to the products only; we cannot interpret the relationship of two or more angles of attributes with one product in order to compare the attributes with each other. Two attributes might be reasonably well correlated with a product, yet be orthogonal to each other in the multi-dimensional space. However, it would be possible to apply the same approach to study the relationship between one attribute and all others. The output would look the same, but the one attribute under consideration is not correlated with itself and therefore not shown.

φ-coefficient: relationship between attributes An alternative way to study the relationship between attributes is given by the φcoefficient, a measure of correlation between two binary variables defined as $=

 %% − % % &• %• • •%

.

Thereby, n11 (n00) is the frequency that the attributes are both (not) selected, n10 (n01) the frequency that the first (second) attribute is selected but not the other, and n1• (n0•) is the marginal frequency that the first attribute was (not) selected irrespective of the second, and vice versa for n•1 and n•0. Note that φ is related to the Pearson’s χ²-statistic for 2x2 contingency tables through $ =

) . 

The φ-coefficient applied across assessors allows determination of which attributes are typically checked together, and which are used rather independently to characterize the products. Multidimensional scaling (MDS) can be applied to the matrix containing the φcoefficient for all pairs of attributes (or 1-φ in case a distance matrix is required as input to MDS). The results of MDS can then be visualized again in a two-dimensional map. Additional insight might be gained from the same analysis on subsets of the data, e.g. observations above and below the mean liking only, or based on demographic variables. Many coefficients of association could be used in place of the φ-coefficient. One possibility is the Jaccard index (Jaccard, 1901), which might perform particularly well with attributes that are rarely used to endorse the products. Lapointe and Legendre (1994) use this index in a cluster analysis of Scotch whiskies, arguing that two products sharing a characteristic was more relevant for similarity than two products both lacking a characteristic. Investigating emotions might be another application where most attributes are rarely used. In turn, there might be situations where most participants would endorse certain attributes for all products (like, e.g., brown in the example below, which is used in 72% of all observations), and where the absence of a characteristic for two products is more important with regard to their similarity than its presence. Jaccard index might even partially mask the similarity between the attributes in such a case.

CATA with additional variables Penalty-Lift-Analysis If liking of the products under investigation has been rated along with the CATA question(s), a penalty-lift analysis might be performed (Williams, Carr & Popper, 2011). To this end, liking is averaged across all observations (i.e. assessors and products) in which the attribute under consideration was used to characterize the product, and across those observations for which it was not. Determining the difference between these two mean values provides an estimate of the average change in liking due to this attribute applying compared to not applying as indicated in the CATA questions. For offnotes or other negatively associated attributes, liking might decrease due to an attribute applying, resulting in a negative “penalty”. The penalty-lift is typically displayed in a horizontal bar chart. Interpretation of the respective results might be misleading in case the attributes used in CATA are highly correlated in the product space under consideration. In that case, an attribute might be identified as an important driver of liking just due to its correlation with a real driver. Therefore, results of this approach should be carefully interpreted with sufficient understanding of the products evaluated

and the product space they span. This caution applies to other analyses as well, but is of particular relevance here because the analysis does not give a hint on correlation at the same time. In contrast, consider perceptual maps, in which the proximity of attributes indicates high positive correlation (or negative correlation if on opposite sides), which is usually taken into account implicitly in the interpretation. Rather than aggregating across products, the analysis could be performed by product. However, that approach would primarily return the mean liking of those people that checked an attribute compared to that of assessors who did not do so for this particular product. If due primarily to differences between certain consumer groups, for example, it will certainly not provide an adequate assessment about which attributes drive liking or acceptance across the product category. Furthermore, the number of (binary) observations might be too small to obtain robust results if the analysis is restricted to a single product.

Comparison of products with ideal CATA question(s) can also be asked for an imagined ideal product. In this case, Cowden, Moore, and Vanleur (2009) suggest comparing the real products with the ideal based on the proportion of elicitations per product, and using a confidence interval for the number of elicitations of the real product. This approach ignores the uncertainty inherent in the number of elicitations for the ideal product, which is also a random number empirically observed from the data. We propose modifying this approach by determining the difference in the proportion of elicitations between each product and the ideal. Any assessor selecting an attribute for both the real and the ideal product (or for neither real nor ideal product) does not contribute any information on potential differences between the products, whereas any assessor using an attribute for the real but not for the ideal (or vice versa) does contribute. If there was no systematic difference between the product and the ideal in this attribute (as posited under the null hypothesis), the number of elicitations for the real product from those assessors that discriminate between the products would be binomially distributed with chance parameter ½ and a sample size equal to the effective sample size for this comparison. Based on these parameters, a (90% or 95%) confidence interval is determined easily. Effective sample sizes vary between attributes, thus confidence intervals have differing widths. The differences in proportion of elicitations can then be plotted along with a corresponding confidence interval, readily visualizing the differences between any product and the ideal.

Penalty Analysis A CATA study might include both scores on liking and the evaluation of an ideal product. In this case, a penalty analysis can be used (Ares, Dauber, Fernández, Giménez, & Varela, 2014). This analysis differs from the well-known penalty analysis for just-aboutright (JAR) questions in that the analysis is based on the differences between real and ideal products (rather than deviation from the JAR value), and the impact on associated liking scores.

For the proposed approach, we determine whether an attribute is used either only for the ideal or only for the real product, which we define as incongruence. In contrast, if an attribute is used for both or neither of the products, it is defined as congruence. We can then determine the difference in mean liking for consumers with congruent and incongruent elicitations, respectively. This value is a reasonable estimate of the average impact on liking that the attribute has when used in a discriminating manner, compared to its use in a non-discriminating manner. We might apply the same idea across products, thereby allowing an overall evaluation. A difficulty in interpreting the visual display of this analysis lies in the fact that the approach only gives the absolute difference in liking, but does not indicate whether the attribute in question actually increases or decreases liking. Interpretation of a corresponding plot must always relate to the respective contingency table, and might require some interpretation of attributes as potentially being positive or negative drivers of liking. Also, it is rather arbitrary whether the changes are represented with a positive or a negative sign. The difficulty just mentioned arises because both types of incongruence are treated the same way in this approach. But whether an attribute is checked for an ideal but not the real product, or vice versa, often matters. Therefore, we extend the proposal by Ares et al. (2014) to allow respective differentiation. Substantially increased liking for observations with the attribute checked for both products (denoted by (1,1)) over those observations for which it is checked for the ideal but not the real product (denoted by (1,0)) indicates a “must-have” attribute. If the difference in liking is small, the attribute might be less relevant, even if consumers check it frequently for their ideal product. A decreased liking for (1,1) compared to (1,0) should rarely be observed as it would indicate that presence of the attribute has a negative impact on liking, which contradicts the fact that it is has been selected for the ideal product. The other relevant comparison along the same lines is between assessors that have not checked the attribute for either of the products (denoted (0,0)) and those that have not checked it for the ideal but have done so for the real product (0,1). A decrease in liking from (0,0) to (0,1) indicates that the attribute should be avoided for a good product; if approximately equal, we conclude that presence of the attribute does no harm. Liking might also increase substantially from (0,0) to (0,1), which would indicate that the attribute is not necessary (or considered necessary) for the ideal product, it nonetheless has a positive impact on liking. Therefore, determining and visualizing the different averages allows a more in-depth investigation of potential liking drivers or inhibitors, and to differentiate “must-have” characteristics from “nice-to-have” or “to-be-avoided” attributes. As before, the data can be averaged across consumers for a more general interpretation. To visualize the results, we plot the observed differences in liking against the percentage of consumers for which incongruence occurred. The latter percentage indicates how important the attribute under investigation is to discriminate between products across consumers.

Example To illustrate the approaches proposed here, we consider a CATA study on six whole grain breads. 161 consumers participated in this study after pre-selection based on product usage criteria in a screener questionnaire. There were 86 women and 75 men, all above the age of 18. All analyses were performed using R 2.15.1 (R Core team, 2012). Samples were presented following a 6x6 Latin square in which order and first-position carryover effects were balanced. Overall liking of samples was evaluated on a 9-point scale before consumer indicated the CATA terms that applied to the sample. 31 CATA terms were allocated to consumers in an order defined by a Williams design with "Other" in the 32nd position. The same order of attributes was used for all evaluations by the same consumer. After evaluation of the 6 real samples, consumers evaluated liking and the same CATA questionnaire for their hypothesized ideal product. The contingency table including mean values on Overall liking for this data set is shown in Table 1.

(Table 1 about here)

The average liking of the ideal product is clearly below 9, indicating that it doesn’t seem safe to assume the liking for an ideal product to be the highest possible score; this illustrates that this assumption, which is often made for penalty analysis, is not justified. Only 64 of 161 consumers (40%) rated the liking of their ideal product to be 9, while 83 consumers rated the liking of their ideal product to be 8. A few (17) consumers even rated liking of their ideal product to be 7 or less. Before going into further graphical analyses, a statistical test for differences between the real products has been performed overall and by attribute. The ideal has been omitted in this analysis as primary interest was in comparing the real products. However, the same analysis could have been run by treating the ideal as any other product. From the results in Table 2, we see a good agreement between the randomized p values and those using the asymptotic approximation to Cochran’s Q, both if applied across attributes or one attribute at a time. The effective sample size varies from 23 for warm to 156 for seeds. These effective sizes are reasonable; as shown in Table 1, warm has rarely been selected at all and hence most consumers never selected it, while only a single consumer endorsed for seeds for product 2, but most did so for products 5, 6 and 1. The last column of Table 2 includes the respective asymptotic p values if the ideal product is included in the analysis. As most attributes showed statistically significant differences between the products, this will not change a lot if another product (the ideal) is included; however, two changes are noteworthy: if the ideal is included, warm and brown turn out to be highly significant, though they haven’t been before. Referring to Table 1, it becomes obvious that none of the products is regularly endorsed for being warm, while the majority of consumers would like the ideal to be warm. In contrast, all breads are endorsed by about three quarters of the consumers for being brown; for the ideal, it is selected by only about half of the consumers. The breads in this study might hence have been slightly too brown compared to consumers' ideal product.

(Table 2 about here)

A noteworthy number of additional comments were given under “other”; these comments lend themselves to a different analysis, hence they will be omitted from the quantitative analysis that follows. Also, the ideal product is omitted from the graphical analysis, as primary interest is in the comparison between the real products and the product space they span. Correspondence Analysis has been used to visualize the data, using the R packages ExPosition (Beaton, Chin Fatt, & Abdi, 2012) and prettyGraphs (Beaton, 2012) with a few modifications to improve the graphical display. In this case, Hellinger’s distance has been used. The resulting plot is shown in Figure 1. The analysis indicates that the product space is low dimensional (2 or maximum 3-dimensional; the variance explained in components 3 and 4 is about 6 and 2%, respectively). The main discrimination is apparently between products endorsed on seeds, grainy, crunchy (products 1, 3, 5) and the more traditional, springy and soft product 2. The second dimension differentiates mainly on sweetness (including molasses). The size of the circles in Figure 1 indicates how much the respective product or attribute contributes to the total variance in the data.

(Figure 1 about here)

Based on the CA using Hellinger’s distance, MDA was used to determine the association between products and attributes in all dimensions, i.e. including those not displayed in Figure 1. As the number of attributes (31) is quite large to be displayed simultaneously in a semi-circle plot, Figure 2 shows the results in full circles to increase legibility. Attributes displayed nearly vertically indicate high (if above horizontal line) or low (if below horizontal line) correlation with the respective product. Attributes displayed almost horizontally are hardly associated with this particular product. The strength of the association is further emphasized by the darkness of the font color. It is important to note that it does not matter whether an attribute is displayed to the left or to the right of the vertical line; they have been distributed across sides to maximize legibility. Further note that proximity of attributes among each other does not indicate a relationship; such relationships are not displayed in this layout. Figure 2 shows that product 2 is highly associated with, e.g., traditional, soft and springy, which is consistent with the interpretation of Figure 1. Next to a strong negative association with attributes like grainy, seeds and sesame as also indicated from the two-dimensional display of the CA results in Figure 1, we also find that this product is highly negatively associated with nutritious, a relationship which is less apparent from the CA. Products 3 and, to a slightly lesser extent, 1 are not very strongly associated with any of the attributes. An alternative to displaying the attributes in a circle to indicate associations would be to use an

ordinary barplot. An example of such a display for results of an MDA can be found in Castura and Meyners (2013) and is omitted here for brevity.

(Figure 2 about here)

To examine the association between the different attributes, we have determined the φcoefficient between each pair of attributes (not shown). Based on these analyses, classical (metric) Multidimensional Scaling (MDS) based on the analysis of Mardia (1978; as implemented in the R function cmdscale) was used on the distances between attributes determined by 1–φ to create a two-dimensional map of these attributes displaying their associations. It should be noted that the variation explained in the first two dimensions is low (16% and 11%, respectively) such that further components might have to be investigated (a scree plot – not shown – would indicate up to 6 components). Overall, the associations found in Figure 3 showing the first components of this MDS seem very reasonable. In the next step, we looked at the associations based on observations above and below the average mean liking separately. The idea behind this is that some attributes might be co-elicited frequently for well-liked products, but co-elicited infrequently for less-liked products (or vice versa). For example, firm and dense might be liked if they go together, but not if only one of them is present; therefore for well-liked products, typically both or neither are endorsed, but rarely only one of these attributes. For our data, the corresponding MDS gives very similar results, which are omitted for brevity. However, for other data sets, we have observed some differences in the respective MDS, indicating that some attributes are endorsed simultaneously for products with low liking but are less associated on products with high liking, or vice versa.

(Figure 3 about here)

In order to examine the impact of different attributes on liking, a penalty-lift analysis was performed. Figure 4 shows the results, indicating that most attributes rather have a positive connotation in this context. Not surprisingly, tasteful, satisfying, exciting, and tempting are found to be the most important drivers of liking and increase liking by almost up to 2 points on the 9-point scale when present compared to being absent. In turn, sweet is not as strong an acceptance driver as for many other categories. The results also show that “other” is the only real inhibitor of liking by about 1.3 points on average; examining the open comments by means of a word cloud created using R package wordcloud (Fellows, 2012) and displayed in Figure 5 indicates that most of the open comments had rather negative connotations: most frequently used terms include dry, bland, bitter, plain, boring, and tasteless and might be added to subsequent studies, while at least one of, e.g., firm and dense might be omitted as they seem to be

highly associated (Figure 3), have a similar impact (Figure 4) and also do not significantly discriminate between products (firm) or only borderline so (dense, cf. Table 2).

(Figure 4 about here) (Figure 5 about here)

To compare individual attributes with their hypothetical ideals, we determined the difference between the proportion of elicitations for the real and the ideal product. Figure 6 visualizes the results for bread 5. The dashed line indicates the 95% confidence interval for each individual proportion; note that the width of the confidence interval depends on the effective sample size (i.e. the number of consumers using the respective attribute in one product, but not the other). It is apparent that consumers associated the attributes homemade, satisfying, and fresh much more with their ideal product than with the real product. The consumers found the bread less associated with warm than with their ideal bread, which held for all products, as indicated earlier. On the other hand, bread 5 was probably too coarse and perhaps too brown compared to an ideal.

(Figure 6 about here)

Finally, a penalty analysis has been performed. As indicated above, we didn’t assume the liking scores for the ideal product to be equal to 9 as that did not seem to be a viable assumption. Instead, the liking as expressed for a hypothetical ideal product was used. The results of both the approach by Ares et al. (2014) and of the new proposed one are overlaid in Figure 7. As proposed by those authors but in contrast to Castura and Meyners (2013), we have chosen to plot the absolute change with a positive sign here as most attributes are judged to rather have a positive connotation. Thereby, results from the approach of Ares et al. (2014) and ours are more easily compared. An attribute tasteless, if used instead of tasteful, would likely have been found in a similar position as the latter if we assume that a product that is endorsed for tasteless is not so for tasteful and vice versa, illustrating the limitation of the interpretation. In contrast, our approach clearly differentiates positive from negative drivers of liking without further reference to the original contingency table as given in Table 2. Similar to previous analyses, the same drivers of liking are typically identified. However, it is worth noting that the approach of Ares et al. (2014) does not indicate a strong impact of either warm or homemade, for example, while these attributes are apparently important “must-haves” for a good product. Similarly, crunchy (with a negative sign for the analysis by Ares et al., 2014) is apparently a “must-have” for some and a “nice-tohave” property of a good product for some other consumers. Too much brown and dense is to be avoided, as this is on the lower end of the plot. A similar graph can be

derived for each individual product in order to identify specific attributes to be changed for any particular bread; it is omitted here for brevity.

(Figure 7 about here)

Conclusions CATA is a valuable method in the toolbox of sensory scientists, and many tools exist for respective data analyses. These tools have been proven useful on many data sets. We have extended this toolbox by adding some complimentary ways to look at the data, and by refining some of the existing proposals to better match the requirements of the researcher. Furthermore, the approaches described herein can be applied to all consumers in a study, or to individual consumer segments or consumer clusters. These methods can all the same be applied to data from an applicability testing study, as well as to other scenarios in which non-CATA binary variables are collected.

Acknowledgements The authors are grateful to Compusense Sensory & Consumer Services – and in particular Karen Phipps and Sheila Fortune – for running the consumer test on whole grain breads and making available the data set used to illustrate the methods described in this manuscript. Two anonymous reviewers provided helpful comments on an earlier version of the manuscript.

References Abdi, H., & Willams, L. J. (2010). Correspondence Analysis. In: N. J. Salkind, D. M., Dougherty, & B. Frey (eds.): Encyclopedia of Research Design. Thousand Oaks, CA: Sage. pp. 267-278. Adams, J., Williams, A., Lancaster, B., & Foley, M. (2007). Advantages and uses of check-all-that-apply response compared to traditional scaling of attributes. 7th Rose-Marie Pangborn Sensory Science Symposium. Minneapolis, MN, USA. Agresti, A. (2002). Categorical data analysis (2nd ed.). Hoboken, NJ: John Wiley and Sons.

Agresti, A., & Liu, I. (1999). Modeling a categorical variable allowing arbitrarily many category choices. Biometrics, 55, 936–943. Arbuthnott, J. (1710). An Argument for Divine Providence, Taken from the Constant Regularity Observ’d in the Births of Both Sexes. Philosophical Transactions of the Royal Society of London, 27, 186–190. Ares, G., Dauber, C., Fernández, E., Giménez, A., & Varela, P. (2014). Penalty analysis based on CATA questions to identify drivers of liking and directions for product reformulation. Food Quality and Preference 32A, 65-76. Armitage, P. (1955). Tests for linear trends in proportions and frequencies. Biometrics, 11, 375–386. Beaton, D. (2012). prettyGraphs: publication-quality graphics. R package version 1.0. http://CRAN.R-project.org/package=prettyGraphs Beaton, D., Chin Fatt, C. R., & Abdi, H. (2012). ExPosition: Exploratory analysis with the singular value decomposition. R package version 1.1. http://CRAN.Rproject.org/package=ExPosition. Bilder, C. R., & Loughlin, T. M. (2002). Testing for conditional multiple marginal independence. Biometrics, 58, 200–208. Bilder, C. R., & Loughlin, T. M. (2004). Testing for marginal independence between two categorical variables with multiple responses. Biometrics, 60, 241–248. Carr, B. T., Dzuroska, J., Taylor, R. O., Lanza, K., & Pansini, C. (2009). Multidimensional Alignment (MDA): A simple numerical tool for assessing the degree of association between products and attributes on perceptual maps. 8th Rose-Marie Pangborn Sensory Science Symposium. Florence, Italy. Castura. J. C. & Meyners, M. (2013). Check-all-that apply questions. In: P. Varela and G. Ares (eds.): Novel Techniques in Sensory Characterization and Consumer Profiling. Boca Raton, FL: CRC Press. Cochran, W. G. (1950). The comparison of percentages in matched samples. Biometrika, 37, 256–266. Cochran, W. G. (1954). Some methods for strengthening the common χ2 tests. Biometrics, 10, 417–451. Cowden, J., Moore, K., & Vanleur, K. (2009). Application of check-all-that-apply response to identify and optimize attributes important to consumer’s Ideal product. 8th Rose-Marie Pangborn Sensory Science Symposium. Florence, Italy.

Edgington, E., & Onghena, P. (2007). Randomization tests (4th ed.). Boca Raton, FL: Chapman and Hall/CRC. Ennis, D.M., & Ennis, J.M. (2013). Analysis and Thurstonian scaling of applicability scores. Journal of Sensory Studies, in press. DOI: 10.1111/joss.12034 Escofier, B., & Pagès, J. (2008). Analyses factorielles simples et multiples; objectifs, méthodes et interprétation (4th ed.). Paris, France: Dunod. Fellows, I. (2012). wordcloud: Word Clouds. R package version 2.2. http://CRAN.Rproject.org/package=wordcloud. Greenacre, M. (2007). Correspondence Analysis in Practice. (2nd ed.). Boca Raton, FL: Chapman and Hall/CRC. Hellinger, E. (1909). Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen. Journal für die reine und angewandte Mathematik, 136, 210–271. Jaccard, P. (1901). Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin de la Société Vaudoise des Sciences Naturelles, 37, 547–579. Lancaster, B., & Foley, M. (2007). Determining statistical significance for choose-all-thatapply question responses. 7th Rose-Marie Pangborn Sensory Science Symposium. Minneapolis, MN, USA. Lapointe, F.-J., & Legendre, P. (1994). A classification of pure malt Scotch whiskies. Applied Statistics, 43, 237–257. Legendre, P., & Gallagher, E. (2001). Ecologically meaningful transformations for ordination of species data. Oecologia, 129, 271–280. Mardia, K. V. (1978). Some properties of classical multidimensional scaling. Communications in Statistics – Theory and Methods, A7, 1233–1241. McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12, 153–157. Meyners, M., & Hartwig, P. (2009). Consumer associations with a toddlers’ product color evaluated by a choose-all-that-apply questionnaire. 8th Rose-Marie Pangborn Sensory Science Symposium. Florence, Italy. Meyners, M., & Pineau, N. (2010). Statistical inference for temporal dominance of sensations data using randomization tests. Food Quality and Preference, 21(7), 805–814.

Popper, R., Abdi, H., Williams, A., & Kroll, B. J. (2011). Multi-Block Hellinger Analysis for Creating Perceptual Maps from Check-All-That-Apply Questions. 9th Rose-Marie Sensory Science Symposium, Toronto, ON, Canada. R Development Core Team (2012). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3900051-07-0, URL http://www.R-project.org/. Tate, M. W., & Brown, S. M. (1970). Note on the Cochran Q Test. Journal of the American Statistical Association, 65, 155–160. Wakeling, I. N., Raats, M. M., & MacFie, H. J. H. (1992). A new significance test for consensus in generalized procrustes analysis. Journal of Sensory Studies, 7, 91– 96. Williams, A., Carr, B.T., & Popper, R. (2011). Exploring Analysis Options for Check-All-That-Apply (CATA) Questions. 9th Rose-Marie Sensory Science Symposium, Toronto, ON, Canada.

Table 1: Contingency table for the CATA evaluation and average liking scores for 6 breads and a hypothetical ideal whole grain bread. Attribute Fresh Warm Crusty Soft Sweet Chewy Tempting Nutty Moist Grainy Dense Healthy Texture Comfort Wheat Exciting Crunchy Homemade Brown Seeds Fullness Coarse Satisfying Nutritious Molasses Tasteful Rustic Traditional Springy Sesame Firm Other total Mean Liking

Prod 1 89 6 24 76 102 73 48 89 81 111 58 118 87 45 104 15 41 40 122 125 76 34 89 113 62 114 61 40 42 31 37 16 2169 7.3

Prod 2 70 7 4 95 18 71 17 3 63 4 40 53 29 33 91 2 1 25 115 1 41 7 39 42 6 49 2 66 61 0 33 56 1144 5.8

Prod 3 71 4 32 75 30 74 16 50 44 80 40 85 68 18 108 3 32 21 120 82 36 39 48 76 24 54 36 35 48 22 28 40 1539 6.1

Prod 4 71 11 27 72 97 78 21 28 60 35 56 78 46 38 96 8 7 23 112 24 52 26 57 70 40 68 18 50 43 7 33 40 1492 6.2

Prod 5 91 9 32 81 53 78 35 91 84 127 43 122 100 36 111 19 63 37 122 139 54 67 73 109 21 97 71 37 46 39 22 32 2141 6.9

Prod 6 108 8 27 78 81 75 44 92 83 128 51 118 97 46 99 24 75 36 110 133 77 44 93 115 24 108 67 38 47 42 33 23 2224 7.2

ideal 142 90 47 93 85 69 75 88 120 102 41 147 90 86 91 59 36 107 88 109 87 18 126 134 39 144 72 57 48 38 35 16 2579 8.3

total 642 135 193 570 466 518 256 441 535 587 329 721 517 302 700 130 255 289 789 613 423 235 525 659 216 634 327 323 335 179 221 223 6.8

Table 2: Uncorrected p values from statistical testing for overall product differences and effective sample sizes. Significant p values at level 5% are set in bold. p value attribute (randomizations) Fresh 0.001 Warm 0.315 Crusty 0.001 Soft 0.055 Sweet 0.001 Chewy 0.928 Tempting 0.001 Nutty 0.001 Moist 0.001 Grainy 0.001 Dense 0.039 Healthy 0.001 Texture 0.001 Comfort 0.001 Wheat 0.023 Exciting 0.001 Crunchy 0.001 Homemade 0.002 Brown 0.285 Seeds 0.001 Fullness 0.001 Coarse 0.001 Satisfying 0.001 Nutritious 0.001 Molasses 0.001 Tasteful 0.001 Rustic 0.001 Traditional 0.001 Springy 0.099 Sesame 0.001 Firm 0.277 Other 0.001 overall 0.001 1 2

p value (Cochran’s Q) < 0.001 0.304 < 0.001 0.062 < 0.001 0.926 < 0.001 < 0.001 < 0.001 < 0.001 0.037 < 0.001 < 0.001 < 0.001 0.023 < 0.001 < 0.001 0.001 0.242 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 0.094 < 0.001 0.256 < 0.001 < 0.0011

effective sample size 132 23 66 132 134 118 86 128 131 152 118 116 127 86 93 45 110 74 89 156 128 98 137 126 84 146 107 104 105 71 96 91

p value (incl. ideal) (Cochran’s Q) < 0.001 < 0.001 < 0.001 0.018 < 0.001 0.886 < 0.001 < 0.001 < 0.001 < 0.001 0.033 < 0.001 < 0.001 < 0.001 0.009 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 0.154 < 0.001 0.308 < 0.001 < 0.0012

based on the approximate sum of Q statistics of 2863 on 160 degrees of freedom based on the approximate sum of Q statistics of 4431 on 192 degrees of freedom

Crunchy Sesame

Seeds

Component 2 (16.8%)

Grainy

Rustic Nutty

Exciting

5

Wheat Brown

Texture

6

2

Springy Soft

3

Coarse

Fresh Chewy Moist HealthyHomemade Crusty Nutritious Firm Satisfying Tasteful Fullness Comfort DenseWarm Tempting

Traditional

1

4

Sweet Molasses

Component 1 (72.2%)

Figure 1: First two dimensions from the CA based on Hellinger’s distance. A symmetric display is used here; only distances between products and distances between attributes can be interpreted directly, but not those between attributes and products.

lln es s

tis fyi ng Fu

Sa

Tempt ing Sw W eet Satisfying he at M Fu oist lln es s

He alth Te y mp Tas tin tefug l Sweet Fullness h es Fr fort m o C

iou Ns ut ty

Figure 2: Associations between attributes and products based on MDA.

C TEexrcunch xt iti n y ur g R e Seus edti c s

T HSeeesxture altam hy eC Ru oasti rsce

ing

F dit i ir m on S al p C hreingy wy

y

he w

C

cit Ex

e ars Cgo yin tisf Crusty Sa Molasses Mo ist H om ema de W ar m

oft e S ns t rown DWehea B

e R Te ust xtu ic re

e ars Co y t s s ru eat ySeuesdtic CW h ch R gy Cru henwy Sprin Freesh D F n ir m Ho se me maEx de cit in g

product 6

rt fo om C al ft i tion STor ad Firm

y ain G r n ch y u tt yds Cr NSuee

sa m

Den Ch se ew y W ar m

m

Se

xtu re H ea S es a l th e G Num ainyy Bitrro ttN y utr iow Soft uns ses Sw Tastefu TraditionaMl olas rm t Ex eet M Wa ois ci ol as l tin ses M g t e e Sw

Tra

xtu re Coa rs e Grainy S eeds N ut ty

Te

y B ro Sown ft

Ch ew Fir

hy a lt He us io rit t Nu

n ow ess Br ull n Cru s F sty sse t h M ol a f SFor es Satisfying Tasteful C rusty Ex de citi sses ema sh la ng o M om Fre H Co ar arm se W ingy r Sp

Te

e iny saam SeGr tty N uious t tri Nu

l na

ting Temp

t

product 5 hyy unr acin us CrG s itio ed y tr SeN utt Nu

itio ad Tr

W h Mo eat Co i st S m rin for Spw Hom e made egey t t Tasteful

r fo oem C d a m me Ho

g in pt m Te tef ul s ort Ta mf Co

product 4

product 3

hy alt He y nch ame Cruustic Sestritious Nu R

arm

m

t al hea gy on W rin iti Sp adsh TFrr e

W

Fir

Br ow n D en se

g

ade H om em

g W yi ne ar isafm t m s ae l S ut f y ee y alt h t Full se h e nes TSawrunc H s C sty Cru Den Comfort g se ing mptin Satisfy Te Mo i st ty us Cr

Nu tri t

in pt m

GRru as Full ne Textu intiyc ss C oarsre e

S ofty ew e Ch ens sh Fre eat D h W

Te

M ol a ExSe sse cit ed s ings

product 2 Traditio nal MBor ow ist n Spring y Fir m

product 1

Firm Dense

Traditional Brown Springy

Chewy Wheat Molasses Fullness

Crusty

Warm Homemade Comfort Soft

Exciting Coarse

Sweet

Satisfying

Rustic Texture

Fresh Moist Tempting

Sesame

Healthy

Tasteful

Nutritious Crunchy

Nutty

Grainy Seeds

Figure 3: Associations between attributes based on classical MDS on the φ-coefficient. The variation explained is about 16% and 11% for the first and second dimension, respectively.

Tasteful Satisfying Exciting Tempting Comfort Healthy Nutritious Fresh Moist Warm Homemade Fullness Nutty

Firm Dense Brown Other

-1.5

-1.0

-0.5

0.0

Sweet Seeds Crunchy Grainy Rustic Texture Soft Sesame Molasses Traditional Springy Chewy Wheat Coarse Crusty

0.5

1.0

1.5

2.0

Figure 4: Results of the penalty-lift analysis. The values indicate the change in liking of observations for which the respective attribute was checked, compared to observations for which that attribute was not checked.

tasteless delicious

processed

taste slight

thick

boring

dry flavourful

crunchy fresh

oatmeal

plain bittersweet recipe chemical odd bread appealing nutty moist sweetness tastes

light seeds

somewhat

nice

stale looksyummy

heavy

fluffy crust texture salty sour little yeasty tasty airysticky cutfulfilling poppy slightly tasting blahbit strong seed

burnt thin exciting

bland aftertaste

seeded flavour appearance

Figure 5: Word cloud from the open comments. Only words at least used twice across all panelists and products are shown.

Warm Homemade Satisfying Fresh Tempting Sweet Soft Comfort Moist Tasteful Coarse Chewy Seeds Rustic Fullness Crunchy Crusty Brown Sesame Nutritious Nutty Grainy Wheat Exciting Springy Healthy Dense Traditional Texture Firm Molasses

0.3

0.2

0.1

0.0

-0.1

-0.2

-0.3

-0.4

-0.5

Figure 6: Comparison of elicitation rates of bread 5 and the ideal product.

Satisfying

2.0

Tasteful

Exciting

Satisfying Exciting Tempting Tasteful

1.5

Tempting

2

Moist

Warm Fresh

change in liking

1.0

5Nutty Grainy

Traditional

Comfort

Crunchy 1

Coarse

Homemade

Wheat 2

Nutritious

0.5

Healthy

Sesame Texture

Springy 6 Firm Crunchy

0.0

4

Warm Satisfying

3 Grainy Sweet

Crusty Traditional

Wheat

Seeds3 Springy Sweet Grainy

Homemade

Fresh

6

Texture Sesame 1 5

Fresh

Nutritious Moist

Seeds

Nutty Brown

Healthy

Tasteful

Comfort Nutritious Healthy

Sweet

Firm

5

Brown Texture

Comfort 2 Nutty 4 3 Tempting

Seeds Homemade

Moist

Exciting Springy Crusty

1 Coarse

Warm

Sesame

Coarse

Crusty 4

-0.5

Traditional Firm

pooled must: real=0, ideal=1 nice: real=1, ideal=0

0

1 Molasses 2 Fullness 3 Soft

Brown

6

10

Crunchy

Wheat

20

30

40

4 Chewy 5 Rustic 6 Dense

50

percentage of consumers with mismatch

Figure 7: Results of the penalty analysis based on general incongruence (bold), on incongruence in which the attribute is missing in the real but not the ideal product (“must-haves”, dark grey italics) and incongruence in which the attribute is not important for the ideal but endorsed for the real product (“nice-to-haves”, light grey plain font).