Using census data to investigate the causes of the ... - CiteSeerX

Using census data to investigate the causes of the ecological fallacy DRAFT VERSION: NOT FOR DISTRIBUTION M. Tranmer 1 and D.G. Steel 2

Abstract This paper shows how data from the 2% Sample of Anonymised Records (SAR) can be combined with data from the Small Area Statistics (SAS) to investigate the causes of the ecological fallacy in an Enumeration District (ED) level analysis. A range of census variables are examined in three `SAR Districts' (Local Authority Districts with populations of 120,000 or more or combinations of contiguous districts with smaller populations) in England. Results of comparable analyses from the 1986 Australian census are also given. The ecological fallacy arises when results from an analysis based on area level aggregate statistics are incorrectly assumed to apply at the individual level. In general the results are dierent because individuals in the same area tend to have similar characteristics: a phenomenon known as within area homogeneity. A statistical model is presented which allows for within area homogeneity. This model may be used to explain the eects of aggregation on variances, covariances and correlations. A methodology is introduced which allows aggregate level statistics to be adjusted using individual level information on those variables that explain much of the within area homogeneity. This methodology appears to be eective in adjusting census data analyses, and the results suggest that the SAR is a valuable source of adjustment information for aggregate data analyses from census and non-census sources. Keywords: aggregation; Samples of Anonymised Records (SAR); Small Area Statistics (SAS); correlation; within area homogeneity.

1 Introduction The ecological fallacy arises when the results of an analysis based on area level aggregate statistics are incorrectly assumed to apply at the individual level. For Department of Social Statistics, University of Southampton, S017 1BJ, UK 2 Department of Applied Statistics, University of Wollongong, NSW 2522, Australia 1

1

example, it would usually be wrong to assume that correlations calculated from Enumeration District (ED) means provide good estimates of the corresponding individual level correlations in a region of interest, such as a `SAR district' (These are single Local Authority Districts (LADs) whose population exceeds 120,000, or amalgamations of contiguous LADs with smaller populations). Area level and individual level statistics are often quite dierent because the area level statistics are subject to aggregation eects. Dierences in the results of statistical analyses carried out at the individual level and higher geographic levels have been shown to exist in many empirical studies using data from a wide variety of sources (for example, Openshaw & Taylor (1979), Openshaw (1984a), Fotheringham & Wong (1991)). The release of the 2% Sample of Anonymised Records (SAR) has enabled, for the rst time, a detailed investigation of the nature and extent of aggregation eects in UK census data to be carried out by combining data from the SAR with data from the Small Area Statistics (SAS) data base. To understand the causes of the ecological fallacy a statistical model is needed through which the eects of aggregation can be determined. In a region of interest, aggregation eects arise because the individuals who live close to one another tend to have a degree of similarity for a range of characteristics. The small geographic areas within the region will therefore contain people who have similar characteristics, a phenomenon known as `within area homogeneity'. For example, the region of interest may be a SAR district, which contains a number of EDs. Two individuals from the same ED will tend to be a little more alike than two individuals, each from a dierent ED, for a range of socio-economic variables. A statistical model is presented in Section 2 which allows for within area homogeneity and clearly shows the causes of the dierences in aggregate and individual level statistics and the causes of the ecological fallacy. Using this model the within ED homogeneity of census variables can be estimated by combining aggregate (SAS) data with individual (SAR) data. An important feature of this method of 2

estimating within ED homogeneity is that it does not require the individual level data to include ED identi ers. Besides explaining the aggregation eects, these estimates of within ED homogeneity may be of substantive interest, since they re ect relationships over and above those purely at the individual level. The model and the associated methods were used in an empirical investigation into ED level aggregation eects in three dierent SAR districts in England. In addition, a similar investigation was carried out using Australian census data. Here, aggregate and individual data were combined to estimate aggregation eects for the city of Adelaide. Finally, by combining aggregate and individual level census data under the statistical model, an understanding of area structure in socio-economic data is possible. If a set of variables can be identi ed which explain much of the within area homogeneity, limited individual level information on these variables can be used to adjust an area level analysis for aggregation eects. The eectiveness of this approach is assessed by adjusting an ED level analysis from SAS data using information from the SAR. More generally, situations in which the SAR might be used to adjust for aggregation eects in the analysis of aggregate socio-economic data from sources other than the census are discussed.

2 A statistical model which allows for within area homogeneity Steel & Holt (1996a) suggest two ways in which within area homogeneity may arise; 1. Individuals who live in the same area are exposed to common in uences and as a result exhibit similarities. 3

2. Individuals with similar characteristics choose to live in the same area.

A simple way to represent within area homogeneity is through a statistical model which includes variance components to represent area eects. For a single variable of interest, y1, such a model can be speci ed as follows:

Model A:

y1i = 1 + 1g + 1i

i2g

Where,

y1i represents the value of y1 for the ith individual in area g; 1 is the population mean of y1 across the region of interest; 1g is a random variable representing the area eect for the gth area; 1i is a random variable representing the pure individual eect. It is assumed that:

E [1g ] = 0; V ar[1g ] = 12 E [1i ] = 0; V ar[1i ] = 12

It is also that individual and area eects and the individual eects for dierent individuals are uncorrelated, that is:

Cov[1g ; 1i ] = 0; Cov[1i; 1j ] = 0 for i 6= j Under Model A, the following properties of y1i apply:

E [y1i ] = 1 V ar[y1i ] = 12 + 12 = 12 4

8 > < Cov[y1i; y1j ] = > :

12 if i and j are from the same area 0 otherwise

As an example, consider individuals within EDs. In this case, 1g represents the ED eect for the gth ED which is common to all individuals within that ED. The term 1i represents an eect unique to the ith individual and which is over and above the ED eect. The variance of y1i comprises two components: the rst, 12 , due to the ED eect and the second, 12 , due to the individual within ED eect. If individuals i and j are from the same ED, the covariance of y1i and y1j is the variance of the ED level eect, 12 , due to the common ED eect, which produces the within ED homogeneity. A similar variance components model can be speci ed for a second variable of interest, y2, involving area eects 2g and pure individual level eects 2i . Model A may be expanded in an obvious way to incorporate the covariance between the area eects, Cov[1g ; 2g ] = 12 , and the individual eects, Cov[1i ; 2i ] = 12 , for the two variables. Calculation of the correlation coecient for y1 and y2 also involves their covariance, 12:

Cov[y1i ; y2i ] = 12 + 12 = 12 Thus, under Model A, the covariance between the two variables for individuals has a component due to the covariance of the area eects of the two variables and the pure individual level eects. Moreover 8 > < Cov[y1i ; y2j ] = > :

12 if i and j are from the same ED 0 otherwise

So that the values of the two variables for two dierent individuals in the same ED have a covariance due to the area level eects, 1g and 2g .

5

Hence for the region of interest, the (marginal) individual level population correlation coecient of y1 and y2 across all areas is: 12 = 12 1 2 This correlation contains an individual within area component: 12 = 12 1 2 and an area level component, 12 = 12 : 1 2 Steel & Holt (1996b) developed the statistical theory for situations in which areas are randomly formed and there is no within area homogeneity. While this approach is useful in developing a statistical framework for considering aggregation eects, in practice most areas, such as EDs, are not randomly formed and models which allow for within area homogeneity must be considered, such as Model A above.

3 Targets of inference and data availability When a statistical analysis is carried out for a population which is split into geographic areas, such as EDs, it is important to clearly specify the parameters the analysis aims to estimate. That is, the targets of inference. It should be clear whether the targets of inference are at the individual level or the area level. In some cases interest may focus on both levels, as in the case in multilevel modelling (Jones and Duncan, 1996). When the targets of inference are at the individual level it should be considered whether the analysis aims to assess relationships within each of the areas: a conditional approach, or across the areas: a marginal approach. As an example, suppose that a researcher is interested in the individual level correlation between y1 and y2 for a particular SAR district, which has been divided 6

into EDs. The conditional approach could involve estimating a population correlation coecient for y1 and y2 for each ED within the SAR district and conditional on the ED eects. These coecients could be taken as xed parameters for each ED or the variation of these coecients between EDs could be modelled in some way. This approach is typical in multilevel modelling and would normally require individual level data with ED identi ers (although it is possible using only the SAS in those cases where all of the variables of interest are cross tabulated with each other (Duncan Jones & Moon, 1993)). For example in Model A the correlation is assumed to be the same in each ED i.e 12 and the targets of inference in multilevel modelling would be both 12 and 12 . Under a marginal approach the individual level correlation between y1 and y2 across all the EDs in the SAR district, 12, is estimated, although to achieve this some allowance for within ED homogeneity must be made. In the remainder of this paper it is assumed that the population of interest is contained in a SAR district and the aim of the analysis is to estimate the marginal individual level correlations between census variables y1 and y2, which are not necessarily cross-tabulated in the SAS. Prior to the release of the SAR, the lowest level of data to be released within SAR districts were the Small Area Statistics (SAS). These data provide, for each of the M EDs (indicated by g) in the SAR district, the total population Ng , and the totals of the variables of interest y1g , y2g . The total number of individuals in the SAR district is N = PMg=1 Ng and the average number of individuals per ED is N = MN . These data can be used to calculate the ED means y1g and y2g , for y1 and y2, respectively. For example, the ED mean y1g may be the proportion of females in ED g and y2g may be the proportion of people with limiting long-term illness in the same ED. From the ED means, the following statistics may then be calculated:

7

the overall means of y1 and y2, M M X X y1 = N1 Ng y1g and y2 = N1 Ng y2g ; g=1 g=1 the ED level variances of y1 and y2, weighted by population size, M M X X S12 = M 1? 1 Ng (y1g ? y1)2 and S22 = M 1? 1 Ng (y2g ? y2)2 ; g=1 g=1 the ED level covariance between y1 and y2, weighted by population size, M X S12 = M 1? 1 Ng (y1g ? y1)(y2g ? y2) ; g=1 and hence the ED level correlation coecient, r12 = qS122 2 : S1 S2 The ecological fallacy arises when this ED correlation coecient, r12, is not equal to its corresponding individual level value. The 2% SAR provides individual level census data for `SAR districts'. For reasons of con dentiality, the SAR data do not include ED (or ward) identi ers. The following individual level sample statistics may be calculated for a sample of individuals of size n in a SAR district: the means of y1 and y2 from individual level data, n n X X y1 = n1 y1i and y2 = n1 y2i ; i=1 i=1 the sample individual level variances of y1 and y2, n n X X S12 = n ?1 1 (y1i ? y1)2 and S22 = n ?1 1 (y2i ? y2)2 ; i=1

i=1

the individual level covariance of y1 and y2, n X S12 = n ?1 1 (y1i ? y1)(y2i ? y2) ; i=1 8

and hence the individual level correlation from the sample,

r12 = qS122 2 : S1 S2 Notice that the statistics calculated from the person level SAR are based on a 2% sample that includes all of the EDs in the SAR district. The area level statistics calculated from the SAS are based on all people included in the census and cover all EDs in the SAR district. Nearly all variables collected in the census are categorical and so at the individual level the variable y1i takes the value 1 or 0 depending on whether the ith individual is in the category or not. For such variables other measures of association can be used, however the correlation coecient is widely employed and so it is worth considering the eects of aggregation on this measure.

4 Expectations of area level and individual level statistics Steel, Holt and Tranmer (1996) show that under Model A the expectations of the ED and individual level statistics are

E [y1] = E [y1] = 1 and E [y2] = E [y2] = 2 This result implies that ED level or individual level sample means may be used to estimate the population means of y1 and y2. The expectations of the estimates of the population mean are not aected by aggregation, although their variances may dier. The expectations of the ED level variances of y1 and y2 are: E [S12] = 12 + (N ? 1)12 and E [S22] = 22 + (N ? 1)22 9

(4.1)

PM

Ng2 . In practice N is very close to Here N = N + ((NM??N1)0) , where N 0 = g=1 N N . These results imply that S12 and S22 are biased estimates of their corresponding population individual level variances. The bias terms involve the ED level variance components, 12 and 22 and are multiplied by a coecient of order N .

The expectations of the corresponding sample individual level variances of y1 and y2 are: !

!

0 0 E [S ] = ? nn ??11 12 and E [S22] = 22 ? nn ??11 22 (4.2) Where n = n=m, and n 0 is the sample equivalent of N 0 and is usually very close to n . 2 1

2 1

These results show that the sample individual level variances of y1 and y2 are biased estimates of their respective population individual level variances. The bias term also involves the ED level variance component, but this time it is multiplied by a coecient which is approximately equal to the number of areas covered by the sample, which in this case is of order M ?1 . This means that if the number of EDs in the SAR district under study is reasonably large, the bias involved in using the sample individual variance to estimate the corresponding population parameter will be very small. In contrast the bias in using the ED level variance to estimate the corresponding individual population parameter will usually be considerable, even if 12 is small compared with 12, because the bias term has a coecient of order N , which for EDs is usually about 500. If the aim of the analysis is to estimate the marginal individual level correlation coecient between y1 and y2, the covariances must also be considered, since these are also used in the calculation of this coecient. The expectations of the ED level and individual level sample covariances are:

E [S12] = 12 + (N ? 1)12 10

and !

0 E [S12] = 12 ? nn ??11 12

(4.3)

respectively. These results show that the covariance term is biased in a similar way to the variances. If the relative biases on the covariance and each of the variances are the same then the aggregation eects will cancel in the calculation of the ED level correlation. However, in general the relative biases can dier between the variances and covariance, so that the ED level correlation is a biased estimate of its corresponding individual level value. Therefore, if researchers assume that the ED level correlation coecient is a reasonable estimate of the individual level correlation, they risk committing the ecological fallacy. The results (4.2) and (4.3) imply that the SAR data allow individual level correlations of variables of interest to be reasonably estimated. However, by combining the SAR with the SAS it is possible to gain much more insight into the cause of the aggregation eect, that is, the ED level population variances and covariance 12 , 22 and 12 . Moreover, by estimating the variance components for a range of variables, it is possible to identify the variables that are most highly associated with the within area homogeneity. Information on these variables can be used to develop methods that can be used to adjust the results of aggregate data analyses to produce improved estimates of individual level relationships. (see Section 8).

5 Investigating aggregation eects In this section a methodology is described for combining aggregate and individual level data for the same population in order to calculate summary measures of ag11

gregation eects for variances and covariances and assess the eects of aggregation on correlations. Under Model A, the ED level variance is potentially a seriously biased estimate of its corresponding individual level value. It is therefore useful to develop statistics which summarize the extent of the bias for each variable. A simple measure of the aggregation bias is the `Aggregation Eect', which is de ned for y1 as: 12 S Q1 = S 2 1 This measure relies on the fact that S12, calculated from the individual level data, is a reasonable estimate of 12 and the ratio Q 1 will give an indication of how much the ED level variance has been aected by aggregation. If there are no aggregation eects for this variable Q 1 = 1. In the statistical literature, a common measure of within area homogeneity is the intra-ED correlation, which is the correlation between the values of y1 for dierent individuals in the same ED . For Model A this is, 2 2 1 1 11 = 2 + 2 = 2 1 1 1 which is also the ratio of the ED level variance to the total variance. This will take the value 0 if y1 has no within area homogeneity (i.e. each ED comprises random values of the variable y1) and 1 if there is perfect homogeneity (i.e. within each ED all the values of y1 are the same, although they may dier between EDs). Estimation of these intra-ED correlations involves estimating the ED level variance components and the individual level variances. The results (4.1) and (4.2) suggest that the ED level variance components may be estimated for y1 and y2 as: 2 2 2 2 b12 = ((SN1 ??S1)1 ) and b22 = ((SN2 ??S1)2 ) respectively. Hence the intra-ED correlations may be estimated as: b2 b2 (5.1) b11 = S12 and b22 = S22 1 2 12

for y1 and y2 respectively. For the covariance of y1 and y2 the relevant measure of within area homogeneity is the intra-ED (cross) correlations , which is the correlation between the values of y1 and y2 for dierent individuals in the same ED . For Model A this is

12 = q122 2 : 1 2 which can be estimated by b b12 = q122 2 where b12 = (S(N12 ??S1)12) : S1 S2 The eect of aggregation can be directly related to these measures of ED homogeneity, for example: S12 = S12(1 + (N ? 1)b11) (5.2) S12 = S12(1 + (N ? 1)b12r12?1 ) These results show clearly how the aggregation eects on variances and covariances are due to the area level eects, which result in the intra-area correlations as measured by b11, b22 and b12. Because N is often large even small values of these parameter estimates can lead to large aggregation eects. To obtain estimates of the relevant measures of homogeneity under this approach, it is not necessary to have individual level data with ED level identi ers. It is not even necessary for the ED and individual data to be based on the same sample, although they must relate to the same SAR district. Under Model A the only individual level information required is that which permits the calculation of reasonable estimates of the individual level variances and covariances. The basic model used here is an example of a multilevel model (Goldstein, 1995). However, the methods used dier from standard multilevel modelling, where the data used are typically individual level and include area identi ers (in this case 13

ED identi ers). Model A assumes that the within ED homogeneity is the same for each ED. This assumption allows the estimation of 11 from data which do not include ED identi ers. If, in fact, the within ED correlation varies between EDs, 11 can still be interpreted as an average within ED correlation. The aggregation eect for correlations can also be related to b12, b11 and b22 as: b ?1 r 12 (1 + (N ? 1)12r12 ) q r12 = (5.3) (1 + (N ? 1)b11)(1 + (N ? 1)b22) This suggests that the eect of aggregation on correlations will depend on the relative sizes of 11, 22 and 12. The emphasis so far has been on estimating 12, but the methods described above produce estimates of the components of the variance and covariance due to ED and pure individual level eects and hence can produce estimates of 12 , the pure ED level correlation and 12 , the pure individual level correlation. The aggregation eects and measures of homogeneity obtained are speci c to the set of EDs analysed. If dierent ED boundaries had been used then they may result in dierent aggregation eects. This phenomenon is often referred to as the zoning problem and is part of the Modi able Areal Unit Problem (see Openshaw, 1984b). Since our aim is to investigate the causes of the ecological fallacy for the data available this is not of direct concern here. However, the framework developed shows that dierent zonings produce dierent values of statistics because they may produce dierent degrees of within area homogeneity. For example for a particular variable the aggregation eects on its variance is maximized if zones (EDs) are formed to maximize the within area homogeneity for that variable. Openshaw (1996) has proposed that in situations in which the formation of zones is under the researcher's control geographical features of interest can be identi ed by forming the zones to maximize, for example, the zone level correlation coecient for two variables. From (5.3) we can see that maximizing this correlation is not necessarily achieved by maximizing the homogeneity of either or both of the variables, since 14

b12 is also involved. Also since the ecological correlation involves a contribution from the purely individual level correlation which is unaected by dierent zonings we suggest that in this type of analysis attention should focus on nding zones that maximize, b q 12 b11b22 which is an estimate of the pure area level correlation 12.

6 Empirical Investigation 6.1 UK Census Data The aggregation bias associated with using ED level data to estimate individual correlations was investigated for three SAR districts in England: Manchester; Reigate and Banstead with Tandridge (hereafter `Reigate') and the Isle of Wight. The aggregate data used were obtained from the SAS data base and the individual data were obtained from the 2% SAR. A variety of socio-economic variables were considered. Some basic characteristics of the three UK districts are given in the rst three columns of Table 1. Although the 2% SAR and SAS may be used to combine individual and ED level information for the same SAR district, there are some minor inconsistencies between these data sources. Firstly, the SAS data often include imputed household information and the SAR data do not, although this inconsistency is likely to be small relative to the dierences due to aggregation in statistics calculated from the SAS when compared with corresponding statistics calculated from the SAR. Secondly, as a con dentiality protection, the SAS cross-tabulation totals have had a -1,0 or 1 randomly added: a process known as \Barnardisation". However, this slight blurring of the aggregate data is unlikely to have much eect on the results. 15

Thirdly, the SAS and SAR may be slightly inconsistent for persons in communal establishments. Therefore this investigation was restricted to `residents in households'. For some variables, the SAS totals are based on a 10% sample of households rather than on a 100% basis. This applies in particular to characteristics of employment and family structure. The 10% SAS totals were not used because of the added complications of combining 100% and 10% based data, although the methods described here can be modi ed to handle this situation (Holt & Steel, 1994).

6.2 Australian Census Data Aggregation eects in Australian census data were also investigated using the same approach described above for the UK data. The Australian data were obtained from the 1986 census for the City of Adelaide, South Australia, which has a larger population than any of the UK SAR districts (see nal column of Table 1). The small areas used were Collection Districts (CDs) which have slightly higher populations than EDs, as Table 1 shows. The individual data used were obtained from a 1% sample. Unlike the UK data, all Australian CD totals are based on 100% counts, so that a much wider set of variables were available to be examined without the added complications of combining 100% and 10% based totals. In this paper, however, only those variables common to UK and Australia are compared. A more detailed investigation into aggregation eects in Adelaide using the full set of Australian census variables has been carried out elsewhere (Holt, Steel & Tranmer, 1996).

16

Table 1: Characteristics of regions of interest

Isle of Wight Total residents in households1 , N 120,312 Number of EDs (UK)/ CDs (Aus), M 270 Mean ED population size, N 445.6 microdata2 sample size, n 2351

1

Area Reigate Manchester Adelaide 188,700 398,165 916,938 371 897 1584 508.6 443.9 578.9 3693 7613 9199

Source: 1991 Census SAS and SAR data; 1986 Australian Census Data For the UK this is \residents in households"; For Australia this is \all residents". 2 For the UK this is the 2% individual SAR; In Australia this is 1% microdata.

6.3 Results Table 2 gives the estimated intra-ED (or CD in Australia) correlations for those socio-economic variables common to the UK and Australian census data. These estimates have been obtained using (5.1), and enable an assessment to be made of the eects of aggregation on the variances of each of these variables. The results suggest that some variables exhibit more homogeneity than others, but the relativities of the values are similar across the four areas. The values of female have low intra-ED correlations; this concurs with previous results (see, for example, Lynn & Lievesley, 1991) and is to be expected since there is no obvious substantive reason for females to be clustered in EDs. In contrast, the age groups, in particular the oldest category, show a fair degree of homogeneity. The characteristics of housing (Owner Occupier and Housing Type) are relatively homogeneous in all four areas. 17

Table 2: Intra-ED (and CD) correlations for 100% census items Isle of Wight Female -0.0004 Age 20-29 0.0206 Age 30-39 0.0066 Age 40-59 0.0082 Age 60 plus 0.0424 Married 0.0156 Employed 0.0136 Unemployed 0.0061 Students 0.0027 Born UK/Aus1 0.0022 Migrant2 0.0202 Owner Occupier 0.1266 Housing Type 0.2055 1

Area Reigate Manchester Adelaide 0.0002 0.0025 0.0030 0.0201 0.0340 0.0189 0.0090 0.0053 0.0162 0.0155 0.0079 0.0183 0.0323 0.0358 0.0810 0.0103 0.0260 0.0159 0.0149 0.0286 0.0217 0.0025 0.0202 0.0117 0.0062 0.0424 0.0182 0.0068 0.0837 0.0278 0.0158 0.0538 0.0703 0.1770 0.3440 0.2209 0.1754 0.2928 0.2482

Source: 1991 Census SAS and SAR data; 1986 Australian Census Data For the UK this is HoH born UK; For Adelaide this is individual born Australia. 2 For the UK this is 1 year migrants; For Australia this is 5 year migrants.

18

As an example of the eects of aggregation on an ED level variance, consider the variable `Age 60 plus' for the Reigate SAR District, which has an individual level variance of 0.162 and an intra-ED correlation of 0.0323, and N 509. Using these values with the relationship given in (5.2), a value of 2.82 is obtained for the ED level variance, leading to an aggregation eect, Q , of 17.41. This example shows that even an apparently low intra-ED correlation results in a severe aggregation eect when it is multiplied by (N ? 1). The median value of in table 2 is 0.0186 (nearest variable is age 40-59 adelaide, which has a unit level variance of 0.16 and a CD level variance of 1.85. Hence aggregation eect is 11.56. A similar analysis of covariances can be carried out (Holt, Steel, Tranmer & Wrigley, 1996). However, the results for correlations are of more interest. A useful way of examining the aggregation eect for several variables is to plot the ED level correlations, r12 between the pairs of variables y1, y2 obtained from the SAS, against their corresponding individual level correlations, r12 obtained from the SAR. If there are no aggregation eects, this plot will be approximately linear and any departure from the 45 degree line will be due to sampling variation. Figure 1 shows such a plot for a range of 21 socio-economic variables, including the variables listed in Table 2, in Reigate District. If the EDs comprised random groups of individuals then there would be no aggregation eect on the correlations and the variance of the ED level correlation would be the same as a correlation obtained from an individual level sample with sample size equal to the number of EDs (Steel and Holt, 1996b). Using this result it is possible to assess whether the dierence between the ED and individual level correlations are greater than could reasonably be explained by random aggregation, and many are. More importantly the plot shows a considerable departure from the r12 = r12 (i.e. the diagonal) line and the points lie in an approximate `S' shape, indicating that aggregation eects exist in these data. The `S' shape suggests that a modest individual level correlation tends to be magni ed by the eects of aggregation so that its corresponding ED level value is usually stronger but of the same sign, although this is not always 19

1.0

the case. The plots for the other UK and Australian areas exhibit similar \S" Shapes (plots not shown).

•

-0.5

ED Level Correlations 0.0 0.5

• • •• • • • • •• • • • • •• • • • • •• • • • ••••• • ••• • • • • • ••• ••• • • • • • •• • • • •• • • • ••••••• • • ••• •• • • • • • •• •• • • ••• • •• • • •• • • • •• • • ••• • •• •• • •• • • • • •

-1.0

•

•

-1.0

-0.5

0.0 Individual Level Correlations

0.5

1.0

Figure 1: ED level vs. Individual level correlations for Reigate district

20

7 A model which incorporates `grouping variables' The results of the UK and Australian census data analysis clearly indicate that aggregation eects exist in the ED level variances and correlations because of the within ED homogeneity. Under Model A this within area homogeneity is included as unobserved area level eects. Suppose that these apparent area level eects are largely due to a set of variables, z, which are called `grouping variables'. If this is the case, the within area homogeneity of the variables of interest, say y1 and y2, may be due to their association with the grouping variables, z. The grouping variables may be those characteristics that determine where a person lives, or common ED in uences, or a mixture of the two. They may be a subset of the variables of interest, or a set of auxiliary variables that are not directly of interest. Model A may be extended to include the grouping variables as follows:

Model B

y1i = ~1 + 01z zi + ~1g + ~1i

21

i2g

Where,

y1i

represents the value of y1 for the ith individual,

~1

is the population mean of y1 across the population, conditional on the grouping variables, z,

zi

is the vector of values of the grouping variables for the ith individual,

1z is the vector of coecients that relates y1 to the grouping variables, z ,

~1g is a random variable representing the area eect conditional on the grouping variables, z, ~1i

is a random variable representing the pure individual eect conditional on the grouping variables, z.

In Model B it is assumed that

E [~1g j z] = 0; V ar[~1g j z] = 12jz E [~1i j z] = 0; V ar[~1i j z] = 12jz It is also assumed that, conditional on z, the individual and group eects and the individual eects for dierent individuals are uncorrelated, that is:

Cov[~1g ; ~1i j z] = 0; Cov[~1i; ~1j j z] = 0 for i 6= j Under Model B, the following properties of y1i may be obtained (Steel, Holt & Tranmer, 1996).

E [y1i j z] = ~1 22

V ar[y1i j z] = 12jz + 12jz = 12jz

8 > < j z] = >:

12jz if i and j are from the same ED Cov[y1i; y1j 0 otherwise The expectation of the ED level variance of y1, conditional on z , is

E [S12 j z] = 12 + 01z S zz ? zz 1z + (N ? 1)12jz

(7.1)

Here zz is the population individual level variance-covariance matrix of the grouping variables, and S zz is the corresponding ED level variance-covariance matrix of the grouping variables.

There are two bias terms in (7.1). The rst term 01z S zz ? zz 1z is due to the aggregation eects of the grouping variables and depends on the relationship of the variable of interest, y1, with the grouping variables, z. The second bias term, (N ? 1)12jz involves the residual ED variance component, having allowed for the grouping variables. If the grouping variables explain most of the aggregation eects in the variables of interest, this residual term will be small. A similar result may be obtained for the ED level covariance of y1 and y2: E [S12 j z] = 12 + 01z S zz ? zz 2z + (N ? 1)12jz

(7.2)

In (7.2) the term, 01z S zz ? zz 2z is due to the aggregation eects of the grouping variables and depends on the relationship of the variables of interest, y1 and y2 with the grouping variables, z. Steel and Holt (1996a) suggest a method of identifying those variables that may be important grouping variables from a set of variables for which both the ED level and the individual level variance-covariance matrix are available. The method nds those linear combinations which have maximum aggregation eect subject to being independent at the ED and individual level. These linear combinations are called Canonical Grouping Variables (CGVs). 23

Using a CGV analysis which combines data from the SAR and SAS, it is possible to identify the main dimensions of aggregation in the census variables. Results of this analysis suggest that in Reigate District the oldest age groups and housing characteristics could be regarded as grouping variables (Steel, Holt & Tranmer, 1996). The results for the other UK districts and Australia suggested similar grouping variables, with non-white also appearing to be particularly important in Manchester. This consistency of grouping variables is encouraging as it suggests that much of the within area homogeneity in socio-economic data may be explained by a common set of variables.

8 Adjustments In situations where individual level data are not available for the variables of interest, Holt, Steel, Tranmer & Wrigley (1996) show that under Model B an estimate of 12, which allows for the grouping variables, may be obtained using:

b12(z) = S12 ? B 01z (S zz ? S zz )B 1z

(8.3)

For example, suppose that some non-census data such as morbidity rates for EDs or wards in a SAR district, or another region which matches the SAR geography. In this case the morbidity rates could be linked to the ED level census data from the SAS. The estimator (8.3) may be used to adjust the ED level analyses using

24

data from the SAR as follows: S12 is the ED level variance matrix of y1 calculated from the morbidity data; B 1z is a set of regression coecients relating the variable of interest, y1, to the grouping variables, which is calculated from the ED level morbidity rates and the SAS respectively ; Szz is the ED level covariance matrix of the grouping variables obtained from the SAS; S zz is an estimate of the population individual variance covariance matrix, zz . This may be obtained from the SAR. The term B 01z (S zz ? Szz )B 1z removes the aggregation bias due to the grouping variables z. This means that an improved estimate of the individual level variance of y1 should be obtained. Under Model B, an estimate of 12 the covariance of y1 and y2 may also be obtained as:

b12(z) = S12 ? B 01z (S zz ? Szz )B 2z Where S12 is the ED level covariance of y1, y2, calculated from the administrative source. Adjusted correlations may then be obtained, using: b rb12(z) = q 212(z)2 (8.4) b1 (z)b2 (z) In general, the aggregate data used in this adjustment can come from one source and the estimate of zz may come from another source, provided that the two sources relate to the same population, and aggregate and individual level data are available for the grouping variables. This means that SAR is a potentially useful for adjusting aggregation eects in data from sources other than the census whose populations match the SAR geography. 25

In order to evaluate the eectiveness of the adjustment, an empirical analysis was carried out where individual information on the grouping variables from the SAR was used to adjust ED level variances and covariances, and hence correlations, from the SAS. The adjusted ED level correlations could then be compared with individual level correlations calculated from the SAR, which are reasonable estimates of the population individual correlations. The correlations of the 21 census variables previously considered in Section 6 were examined for the population of Reigate District. Seven grouping variables were used, three of which are personal characteristics: age 45-59; age 60 plus; nonwhite. The remaining four may be broadly regarded as housing characteristics: owner occupied tenure; local authority rented tenure; good amenities; housing type. Figure 2 is a plot of the adjusted ED level correlations obtained from (8.4) using the SAS and SAR, rb12(z) (vertical axis) against their corresponding individual level values, obtained from the SAR, r12 (horizontal axis). If a point lies close to the diagonal line, it has been adjusted well by the methodology described above. Comparing this plot with Figure 1 (the unadjusted ED level correlations against their corresponding individual level values), the points on Figure 2 appear much more linear and virtually all of the S shape has been removed. This suggests a considerable improvement can be made in the estimation of individual correlations from ED level data by allowing for grouping variables.

9 Conclusions By combining data from the SAS and SAR under a statistical model it is possible to understand the causes of the ecological fallacy and aggregation eects in an ED level analysis of UK census data. Estimation of the relevant measures of within ED homogeneity would not be possible without individual level data provided by the SAR. A key feature of the methods developed here is that they have enabled 26

1.0 Adjusted ED Level Correlations -0.5 0.0 0.5

• • •

• •

•

•• • • •• • • • • •• ••• • •• • • ••••• • • •••••• • • •••• ••••••••••••••••••• • • • • • • •• • ••••••••••••• • • ••• • • • • • • • •• • • • • • • • •• • • •

•

-1.0

•

-1.0

-0.5

0.0 Individual Level Correlations

0.5

1.0

Figure 2: Adjusted ED level vs. Individual level correlations for Reigate district

27

the SAR to be used even though it does not contain ED identi ers. Information from the SAR and SAS may also be used together to identify those variables that might explain much of the ED eects: the `grouping variables'. A methodology for the adjustment for aggregation eects has been demonstrated using SAS and SAR data, and the results suggest that this methodology is eective in adjusting ED level correlations. As well as being a source of the individual level data for the adjustment of the SAS, the SAR also provides a basis for comparing aggregate, adjusted and individual level statistics. A common set of grouping variables appear to explain aggregation eects in census data for three dierent `SAR districts' and one Australian city. This consistency is encouraging as it suggests that the same set of variables may explain aggregation eects in other socio-economic data. If this is the case, the SAR is a useful source of the individual level information which may be used to adjust for aggregate level analyses of non-census socio-economic data whose populations match the SAR geography, such as ED or ward level morbidity rates for SAR districts. The theoretical and empirical results given here suggest that a researcher wishing to estimate marginal individual level correlations who has only aggregate data for EDs available for the variables of interest. should attempt to obtain ED level information on the age, housing and ethnic group structure of the EDs, possibly from the SAS data. Using the SAR the corresponding individual level variance covariance matrix can also be obtained and the adjusted estimator rb12(z) can be calculated. The results here suggest that rb12(z) should be closer to r12 than r12. However the bias due to 12jz , 22jz and 12jz will remain.

28

Acknowledgements This research was supported by the Economic and Science Research Council (ESRC) (Grant number R 000236135). The authors would like to thank an anonymous referee and several of the participants of the `Research Value of Census Microdata' conference for their helpful comments.

References Duncan C, Jones K, and Moon G, 1993, Modelling ecologies: the multilevel model as a general framework for analysing census data. Paper presented at the 1991 Census Data Conference, University of Newcastle, September 1993 Fotheringham A and Wong D, 1991, \The modi able areal unit problem in multivariate statistical analysis" Environment and Planning (A) 23, 1025-1044 Goldstein, 1995, Multilevel Statistical Models. Edward Arnold: London. Jones K, and Duncan C, 1996, \People and places: the multilevel model as a general framework for the quantitative analysis of geographical data" in Spatial Analysis:Modelling in a GIS Environment ed P. Longley and M. Batty (GeoInformation International, Cambridge) pp 79-84 Holt D and Steel D, 1994, Statistical Analysis of Aggregate Data from Overlapping Samples. University of Wollongong, Department of Applied Statistics, Preprint 2/94. Holt D, Steel D and Tranmer M, 1996, \Area homogeneity and the modi able areal unit problem" Geographical Systems 3, 181-200 Holt D, Steel D, Tranmer, M, and Wrigley N, 1996, \Aggregation and ecological eects in geographically based data" Geographical Analysis 28, 244-261 Lynn P and Lievesley D, 1991, Drawing General Population Samples in Great Britain,

29

SCPR: London. Openshaw S, 1984a, \Ecological fallacies and the analysis of areal census data" Environment and Planning (A) 16, 17-31 Openshaw S, 1984b, Concepts and Techniques in Modern Geography 38. The Modi able Areal Unit Problem (GeoBooks, Norwich) Openshaw S, 1996, \Developing GIS-relevant zone based spatial analysis methods" in Spatial Analysis:Modelling in a GIS Environment Ed P. Longley and M Batty, (GeoInformation International, Cambridge) pp 55-78 Openshaw S and Taylor P, 1979, \A million or so correlation coecients: three experiments on the modi able areal unit problem", in Statistical Applications in the Social Sciences Ed N. Wrigley (Pion, London), pp 127-144 Steel D and Holt D, 1996a, \Analysing and adjusting aggregation eects: the ecological fallacy revisited", International Statistical Review 64, 39-60 Steel D and Holt D, 1996b, \Rules for random aggregation" Environment and Planning (A), 28, 957-978 Steel D, Holt D and Tranmer M, 1996, \Making unit level inferences from aggregated data", Survey Methodology 22, 3-15

30