Sample Size Considerations in Genetic ...

1 downloads 0 Views 217KB Size Report
different numbers (K) of alleles, assuming all al- leles are ..... 2 Deka R, Shriver MD, Yu LM, Ferrell RE, Cha- ... Soodyall H, Jenkins T, Rogers AR: Origins and.
Original Paper Received: December 4, 2000 Revision received: March 23, 2001 Accepted: April 4, 2001

Hum Hered 2001;52:191–200

Sample Size Considerations in Genetic Polymorphism Studies Chandrika B-Rao Functional Genomics Unit, Centre for Biochemical Technology (CSIR), Delhi University Campus, Delhi, India

Abstract Objectives: Molecular studies for genetic polymorphisms are being carried out for a number of different applications, such as genetic disorders in different populations, pharmacogenomics, genetic identification of ethnic groups for forensic and legal applications, genetic identification of breed/stock in animals and plants for commercial applications and conservation of germ plasm. In this paper, for a random sampling scheme, we address two questions: (A) What should be the minimum size of the sample so that, with a prespecified probability, all alleles at a given locus (or haplotypes at a given set of loci) are detected? (B) What should be the sample size so that the allele frequency distribution at a given locus (or haplotype frequency distribution at a given set of loci) is estimated reliably within permissible error limits? Methods: We have used combinatorial probabilistic arguments and Monte Carlo simulations to answer these questions. Results: We found that the minimum sample size required in case A depends mainly on the prespecified probability of detecting all alleles, while in case B, it

ABC

© 2001 S. Karger AG, Basel 0001–5652/01/0524–0191$17.50/0

Fax + 41 61 306 12 34 E-Mail [email protected] www.karger.com

Accessible online at: www.karger.com/journals/hhe

varies greatly depending on the permissible error in estimation (which will vary with the application). We have obtained the minimum sample sizes for different degrees of polymorphism at a locus under high stringency, as well as a relaxed level of permissible error. We present a detailed sampling procedure for estimating allele frequencies at a given locus, which will be of use in practical applications. Conclusion: Since the sample size required for reliable estimation of allele frequency distribution increases with the number of alleles at the locus, there is a strong case for using biallelic markers (like single nucleotide polymorphisms) when the available sample size is about 800 or less. Copyright © 2001 S. Karger AG, Basel

Introduction

Molecular studies for genetic polymorphisms are being carried out for a number of different applications, such as anthropological and evolutionary studies [1–9], genetic disorders in different populations [10–17], pharmacogenomics [18–21], genetic identification of ethnic groups for forensic and legal applications [22–24], genetic identification of breed/stock in animals and plants for commercial applications [25] and conservation of germ plasm and

Chandrika B-Rao Functional Genomics Unit, Centre for Biochemical Technology (CSIR) Delhi University Campus, Mall Road Delhi 110 007 (India) Tel. +91 11 766 6159, Fax +91 11 766 7471, E-Mail [email protected]

Downloaded by: Universiti Sains Malaysia 202.170.60.241 - 9/19/2017 6:01:36 PM

Key Words Sample size W Genetic polymorphisms W Allele frequency distribution estimation W Population studies

192

Hum Hered 2001;52:191–200

In this paper, we have addressed two related questions: (A) Given the degree of polymorphism at a locus, that is, the number of alleles (K) and allele frequencies [f = ( f1, f2, ..., fK)] at that locus [30], what is the minimum sample size (N) required to ensure that all the alleles are detected with a given prespecified probability ()? (B) What is the minimum sample size required so that the sample allele frequencies are good estimates of the true population allele frequencies? We have also outlined a sequential sampling procedure for estimating true population allele frequencies using the answers to these questions. These issues have earlier been treated by Ott [31] in the context of selecting suitable markers for human gene mapping. In this paper, we present a simple thumb-rule for a solution to question A, when the allele frequencies are (nearly) equal. Whereas Ott [31] provides an approximate formula for calculating the probability of finding all alleles at a locus, we have developed a software package to find the exact probability. Furthermore, Ott’s [31] formula for question B provides a lower bound (a necessary but not sufficient value) on the sample size, which is about an order of magnitude smaller than the (sufficient) value found by us. The results of his method are compared with ours. To answer the first question, we used a combinatorial statistical mechanics approach involving multinomial probabilities. The multinomial formula is simplified in two special cases: when we use a univariate simplification to find the minimum sample size required to find the rarest allele (and assume that if the sample size is large enough to find this, it must be large enough to find alleles of higher frequency also), and when all alleles are equally frequent. These values provide lower bounds for the required sample size. For the second question, instead of the analytical approach used by Ott [31], we use a Monte Carlo simulation method. For different population allele frequencies, we simulate M samples each of different sizes (N), and estimate the sample allele frequencies for the given sample size. Whereas Ott [31] uses an average error in allele frequencies (i.e., the actual error may be more), we take the actual permissible errors. We expect the permissible error bounds on allele frequencies to be different in different applications, based on the required biological power of discrimination between allele frequency distributions. Hence, they have to be specified individually for each application. In this paper, we have illustrated examples with different permitted error bounds ranging from 0.005 to 0.05.

B-Rao

Downloaded by: Universiti Sains Malaysia 202.170.60.241 - 9/19/2017 6:01:36 PM

applications to crop improvement [26–29]. In designing these studies, an important issue to be considered is the minimum or optimum size of random sample required so that all the alleles/genotypes at one or more loci are observed with a certain prespecified probability [26–28, 31] and for reliably estimating allele frequency distributions by statistical methods [31]. In most methods of genetic linkage analysis, the inferences can be sensitive to errors in population marker allele frequencies used, often leading to an increase in the false-positive evidence for linkage. An example of how marker allele frequency misspecification can drastically affect the inference of linkage has been provided [31]. While one can avoid estimating marker allele frequencies and attendant problems if one has adequate genotyping information, this is not always possible. In a genome-wide scan using 386 markers to detect regions linked with agerelated maculopathy [32], a sample of 422 chromosomes (equivalently, 211 individuals) was used to estimate marker allele frequencies. Despite the apparently large sample size, some of the alleles at marker loci were not detected by the sample. However, for genetic linkage analysis of quantitative traits using the variance components method, it has been shown that misspecification of marker allele frequency distribution does not significantly affect the conclusion of linkage but the observed variations may be very large [33]. For an isolated population of about 8,000 individuals, it has been reported [34] that, using a resampling method, samples of 20 individuals showed the same allele frequency distribution as the original 55 unrelated individuals at 6 tetranucleotide microsatellite markers; such low sample sizes are unlikely to be adequate for general, outbreeding populations. Another motivation for this study was several published papers in the area of phylogenetic analysis, where the study designs seem to have either too large (e.g. some isozyme studies on fish species) or too small sample sizes. For example, Jorde et al. [3] report a phylogenetic analysis of 13 human populations based on 30 tetranucleotide repeat polymorphisms. The sample sizes from some of these populations are as small as 5 individuals (e.g., Biaka Pygmy, Mbuti), whereas the degree of polymorphism at some of the loci (e.g., D19S400, D5S580) ranges from 12 to 22 alleles. Obviously, with a sample of 5 persons, by default, all the alleles at these loci would never have been found, and ‘allele frequencies’ based on such small sample sizes have no meaning. Although, in a particular case, the final results of phylogenetic analysis may not deviate from the expected pattern, it is not good practice to use a poor study design if better alternatives are available.

Plant geneticists have made several efforts to answer this and variations of this question. They have generally used probabilistic arguments and based the calculations on different genetic models and different types of reproductive behavior [26–28]. To aid selection of markers for human gene mapping, ways of calculating sample sizes which would ensure that all alleles at the locus are detected with a high probability have been proposed [31]. Consider the following scenario: it is desired to find allele frequencies at a certain locus in a given population by any of the available experimental techniques such as DNA profiling and DNA sequencing. It is not known how many different alleles occur at the locus in the population. The size of the random sample we take from the population should be adequate to detect all the alleles at the locus; however, we may not mind missing some rare alleles whose frequency is less than a small value, Â (= 0.01, say). In the absence of any information, one could take a pilot sample of size N0 (whose value may be decided in some way; for example, by using equation 1 below) and find the number of alleles and their frequencies. Assuming these to be the true population allele frequencies, what is the probability that there exists a rare allele which was not detected by the sample? For example, at the alcohol dehydrogenase 2 locus, the intron 3 RsaI polymorphism is an absence (allele A1) or presence (allele A2) polymorphism and hence can have at most 2 allelic forms. In the Ticuna population [35, 36], a sample of 66 individuals did not show the A2 allele [37]. The allele frequency profile for this population is therefore (1.0, 0.0). We ask the question: is it possible that the 2nd allele occurs rarely in the population and was not detected by the sample? To answer this, we assume that the true population allele frequency distribution is (0.99, 0.01). We have,

there is a 27% chance of having missed the 2nd (rare) allele in the Ticuna population, merely because of the small size of the sample. An approximation to equation 1 was proposed [31, equation 6], which gives the probability of finding both the alleles with a sample of 66 individuals to be 0.71 (a relative error of 2.7% in the approximation). The next question is, what should be the size of sample which will detect both alleles with a high probability,  (= 0.99, say)? To solve this, we set equation 1 equal to , and solve for N. As this is an implicit equation in N, we need to use a numerical computational method to solve it. Before doing this, we consider two simplifications that may provide good solutions in special cases. (a) Assuming a situation where the occurrence in the sample of only the rarest allele (of frequency fmin) is in question, the minimum sample size (Nmin) is given by Nmin 6 log (1 –  )/2log (1 – fmin)

(2)

In some situations, as when there is only one rare allele and the other alleles are nearly equally likely, the above equation itself provides the required value of minimum sample size. For example, in the Ticuna population considered earlier, using this equation, the minimum sample size required to have a 99% probability of finding the rare allele is 230 persons as against the 66 persons actually sampled. In other situations, it provides a lower bound to the sample size. (b) Assuming that all alleles are equally likely, the probability of finding all of them in a sample of size 2N is the same as the probability that, in the classical occupancy or pigeonhole problem, all boxes are occupied when 2N indistinguishable balls are placed in them at random [38, equation 2.3, p. 102]. Pr(all K alleles of nearly equal frequency are detected K

in sample of size 2N ) =

™ (–1)vKC (1 – v/K )2N. v

(3)

v=0

With K = 2, N = 66 and f = (0.99, 0.01) in equation 1 the probability of finding both the alleles is 0.73. Thus,

To ensure a probability of  of finding all equally frequent alleles, equation (3) is set equal to  and solved for N. A good approximation to N by this formula for  = 0.99 is given by 4K (fig. 1) for K ^ 35. To solve equation 1 for N, an initial solution (lower bound) was taken to be the maximum of the two values found from equations 2 and 3 above. One can also use equation 6 of Ott [31] to find an initial approximation for the number of alleles required. For different allele frequency distributions, the probability of finding all the alleles is computed for different values of N (fig. 2). It can be seen that, for up to 5 alleles (including the worst case of

Genetic Polymorphism Studies

Hum Hered 2001;52:191–200

Pr(all K alleles with frequency distribution f are detected K

in sample of size 2N ) = (2N )! ™ ¶ ( fi )ni/ni !

(1)

n i=1

where K

n = (n1, n2, ..., nK);

™ ni = 2N; i=1

ni being a positive integer for each i = 1, 2,..., K.

193

Downloaded by: Universiti Sains Malaysia 202.170.60.241 - 9/19/2017 6:01:36 PM

Minimum Sample Size for Detecting All Alleles

Fig. 2. For different degrees of polymorphism, the probability of detecting all alleles is

plotted for different sample sizes. The plots for K = 2 and K = 3 with one rare allele coincide as do the plots for K = 3 and K = 4 with two rare alleles. The horizontal line represents the cut-off of 99% probability.

4 rare alleles of frequency 0.05 out of the 5 alleles), samples of 45 or more individuals will be able to detect all alleles with a probability of at least 95%. However, when the frequency of a single rare allele reduces to 0.01, the minimum sample size required is 230 individuals (from equation 2) for the same probability. In some applications, alleles of frequency less than 0.05 may not be of great biological interest, except when they are responsible for genetic disorders. In the case of repeat length polymorphisms, some alleles of very low frequency (less than 0.01) may also be observed. The major results of our numerical simulations are summarized in a later section. To return to our scenario, a sample of N1 additional individuals (such that N0 + N1 = N* computed by equation 1) can then be drawn to see if any additional alleles exist at the locus. If new alleles are found, the process has to be repeated till no more new alleles are found. At this stage, we can be quite confident that we know how many alleles occur in the population at the locus. However, the allele frequency found with this sample size may not be a good estimate of population allele frequencies.

194

Hum Hered 2001;52:191–200

Sample Size for Reliable Estimation of Allele Frequency Distribution

Estimation of allele frequency distribution is a classical problem in statistical genetics, and several methods including maximum likelihood estimates (MLE) have been proposed [39]. However, the MLE is a good estimate only when the sample size is sufficiently large, when it may be assumed to satisfy the Hardy-Weinberg equilibrium condition as also the statistical condition of unbiasedness [39]. Statisticians typically recommend that reported allele frequencies should be accompanied by their confidence intervals, but this is seldom done in practice. A very wide confidence interval (as found from small samples) reduces the reliability of the estimated value. By rearranging the terms in the equations for confidence interval, one can calculate the sample size necessary for obtaining the desired confidence interval [39, pp. 38–39]. Whereas this approach works well for biallelic loci, it fails when larger numbers of alleles are present, as will be seen below. This question has earlier also been addressed by Ott [31, equations 7 and 8], who has proposed lower bounds

B-Rao

Downloaded by: Universiti Sains Malaysia 202.170.60.241 - 9/19/2017 6:01:36 PM

Fig. 1. Plot of sample size (N) required to have 99% chance of detecting all alleles at a locus, for different numbers (K) of alleles, assuming all alleles are equally frequent.

Table 1. Actual range of values taken by each allele frequency, estimated from 10,000 simulated samples, and the number of such samples

that would be rejected Allele frequency distribution, f

Sample size, N

Ranges of allele frequencies, over 10,000 samples

Samples with A error A 1 0.005

(0.34, 0.33, 0.33) (0,50, 0.49, 0.01) (0.98, 0.01, 0.01) (0.25, 0.25, 0.25, 0.25) (0.33, 0.33, 0.33, 0.01) (0.49, 0.49, 0.01, 0.01) (0.97, 0.01, 0.01, 0.01)

25,000 22,000 1,750 23,250 24,500 22,000 2,500

((0.331, 0.348), (0.322, 0.338), (0.322, 0.339)) ((0.491, 0.508), (0.480, 0.499), (0.008, 0.011)) ((0.969, 0.988), (0.004, 0.018), (0.004, 0.018)) ((0.242, 0.257), (0.241, 0.258), (0.242, 0.257), (0.242, 0.256)) ((0.321, 0.337), (0.322, 0.338), (0.322, 0.338), (0.008, 0.011)) ((0.480, 0.498), (0.479, 0.499), (0.008, 0.011), (0.008, 0.011)) ((0.958, 0.979), (0.005, 0.015), (0.005, 0.016), (0.005, 0.016))

500 509 404 497 499 515 439

For different population allele frequency distributions (f) the sample size (N) gives about 5% of the estimated allele frequency distributions falling outside the hypersphere of radius 0.005.

which provide a necessary size of sample to be able to detect allele frequencies within given average error bounds. By this formula, for given average error value, the required sample size decreases as the square of the number of alleles at the locus. If there are 20 alleles at a locus, allowing average error of 0.05, the necessary sample size to find correct allele frequencies is less than 10, which has a probability of less than 10 –7 of detecting all alleles (using equation 6 of Ott [31]). We have used a Monte Carlo simulation procedure to find the sufficient sample size which reliably estimates the population allele frequency distribution for different degrees of polymorphism at a locus. The definition of ‘reliable’, as measured by closeness to population allele frequency distribution, will depend on the accuracy required by the particular application. Here, our criterion for selecting the best sample size, N*, is that: out of M (a very large integer value) simulated simple random samples of size N, less than 5% of the samples give an allele frequency distribution that is different from the true allele frequency distribution, up to given level of accuracy, for samples of size N*. Since this sample size is much higher than the minimum required for all alleles to be found in the sample, that condition is automatically satisfied. Our simulation procedure enables us to estimate the probability of accepting a wrong allele frequency distribution, so that we can find the required sample size after fixing the permissible level of errors. The specific amount of error tolerable for any given application needs to be estimated by simulation with known data. Whereas an average value of error is used by Ott [31], our method permits flexibility in fixing the actual range of permitted values.

For the Ticuna population considered above, assuming true allele frequencies to be (f1, f2) = (0.99, 0.01), 1,000 simulated samples of size 66 each gave the range of estimated sample frequency of the first allele to be (0.962, 1.000), with only the first allele being found in 266 of the samples (about 27% chance of missing an allele, which matches our theoretical value obtained earlier). For a particular application (e.g., phylogenetic analysis), if a 27% chance of missing an allele and an additional error of between [–0.04, +0.01] in the estimation of the allele frequency would not significantly alter the inference, a sample of size 66 would be appropriate. However, these bounds should ideally be reported in a database of allele frequencies, so that users will have an idea of the amount of error they should be prepared for. For example, wrong allele frequency distribution (error in any allele frequency being more than 0.005) has a chance of occurring in about 2.46% (estimated empirically from 10,000 simulated samples) of samples of size 30,000 if the true population allele frequency distribution were f = (0.34, 0.33, 0.33). If we permit a higher risk (5%) of getting a wrong allele frequency distribution, then a sample of size 25,000 would be adequate. Sample sizes for different degrees of polymorphism with approximately 5% chance of getting a ‘wrong’ distribution (error larger than 0.005 in any allele frequency), along with ranges of the allele frequencies are shown in table 1. The corresponding values for biallelic loci, for different bounds on permitted error, are presented in table 2. Ott’s [31] formula gives a lower bound of 4,444 for the sample size whereas the sample size indicated by the 95% confidence intervals [39] is about 16,988. Our simulations showed that a sample of 20,000 has an empirical · value of about 0.09;

Genetic Polymorphism Studies

Hum Hered 2001;52:191–200

Downloaded by: Universiti Sains Malaysia 202.170.60.241 - 9/19/2017 6:01:36 PM

195

hence, the sample size found using confidence intervals is also an underestimate (table 3). To return to our earlier scenario, with a sample size of N* estimated by simulation, if the observed allele frequencies are significantly different from those estimated earlier, using the new observed frequency distribution as the true population allele frequency distribution, one should repeat the simulation, reestimate the sample size, draw additional samples if required, verify allele frequencies, and repeat the procedure till one is confident of the allele frequencies. The whole procedure for ensuring that

Table 2. The sample sizes (N) which give about 5% of the estimates of allele frequency distributions outside the pre-specified error bounds and the number of samples actually lying outside the error limits, for a biallelic locus (e.g., SNP) with different allele frequency distributions (f)

For 2 alleles, allele frequency distribution, f

Sample size, N

Number of samples to be rejected out of 10,000

(0.50, 0.50) (0.60, 0.40)

1,150 19,000 1,150 175 1,000 800 400 825 800

538 (A error A 1 0.02) 541 (A error A 1 0.005) 523 (A error A 1 0.02) 490 (A error A 1 0.05) 521 (A error A 1 0.02) 496 (A error A 1 0.02) 586 (A error A 1 0.02) 460 (A error A 1 0.005) 440 (A error A 1 0.05)

(0.70, 0.30) (0.80, 0.20) (0.90, 0.10) (0.99, 0.01)

all alleles of a given minimum frequency are sampled, and that the estimated allele frequencies are correct within permissible error, is presented in figure 3. It may be noted that we have good reason for representing allele frequencies up to 2-decimal digit accuracy. Firstly, single-digit accuracy would exclude alleles of frequency less than 0.05 and also be inapplicable for highly polymorphic loci. Secondly, our simulations show that, for 3-decimal digit accuracy, samples of size greater than 5 ! 105 are required, which is impracticable. Hence, 2decimal digit accuracy, which requires sample sizes of the order of 102–104, seems to be a reasonable choice. The results of our computations and simulations for questions A and B are summarized below.

Summary of Results

For question A (using our computer program, POLYSAMP) (1) A sample of size 4 times the number of alleles is adequate to find all equally (or nearly equally) frequent alleles with probability 0.99. This is a good approximation to equation 3 when the number of alleles is less than 35 (fig. 1). For number of alleles from 35 up to 100, a sample size of 5 times the number of alleles provides a good approximation. (2) When there is one rare allele and other alleles have nearly equal frequencies, a sample of size Nmin given by equation 2 is adequate to find all alleles with probability .

Table 3. Comparison of minimum sample sizes calculated for average (Ott’s) or actual per-

missible error of 0.5% by using Ott’s lower bound, the formula for 95% confidence intervals on true allele frequencies and our simulations, for different allele frequency distributions True frequency distribution

Sample size (Ott’s lower bound)

Sample size (95% CI)

Sample size (simulation)

(0.60, 0.40) (0.99, 0.01) (0.34, 0.33, 0.33) (0.50, 0.49, 0.01) (0.98, 0.01, 0.01) (0.25, 0.25, 0.25, 0.25) (0.97, 0.01, 0.01, 0.01)

4,800 198 4,444 3,399 263 3,750 294

18,440 761 17,241 19,208 1,506 14,423 2,236

19,000 825 25,000 22,000 1,750 23,250 2,500

The sample sizes for confidence intervals around each allele frequency in a distribution were computed and the maximum taken. For the 3-allele cases, when we took sample size of 20,000, we got empirical · = 0.09 (for equal allele frequency) and · = 0.06 (for one rare allele) based on 10,000 simulations, indicating that the confidence intervals also underestimate the required sample size.

Hum Hered 2001;52:191–200

B-Rao

Downloaded by: Universiti Sains Malaysia 202.170.60.241 - 9/19/2017 6:01:36 PM

196

Fig. 3. Sequential sampling procedure for reliably determining degree of polymorphism at a locus.

(3) When there are two or more rare alleles, or the frequency distribution is highly skewed, the minimum sample size has to be found by equation 1. We have found that, when the rarest allele has frequency 0.05, for up to K = 5 alleles at a locus with (K –1) = 4 rare alleles, a sample of about 60 persons provides a 99% probability of finding all alleles (fig. 2).

For Question B (using our computer program, ALFREQ) (1) The estimated minimum sample size based on question A gives widely varying sample allele frequency distributions and is therefore inadequate for estimating the allele frequency distribution. For example, the minimum sample size of 5 from the population distribution

Genetic Polymorphism Studies

Hum Hered 2001;52:191–200

Downloaded by: Universiti Sains Malaysia 202.170.60.241 - 9/19/2017 6:01:36 PM

197

Discussion

Allele frequency distributions are required in many modern applications of genetics, and are sufficiently important that a database [37] has been created in the public domain. For such databases to be really useful, it is necessary that the sampling design ensures that the data are reliable for all purposes for which they may be used. Estimates of sample sizes based on the stringent condition that each value of allele frequency should be accurate according to the usual arithmetic rules of rounding off to the nearest 2-decimal digits are presented in this paper. It is desirable that the values reported in dedicated databases are based on adequately large sample sizes. For biallelic loci, confidence intervals [39] for the allele frequencies can be reported, but when the number of alleles is larger, our work would provide researchers with informa-

198

Hum Hered 2001;52:191–200

tion on the risk involved in using the reported allele frequency distributions as if they were the true population distributions. For example, at an SNP locus, we found that, with a sample of 105 individuals, the estimated allele frequency of 0.60 (if assumed to be the true population frequency) could actually range from 0.47 to 0.73, with over 52% of 10,000 samples giving values less than 0.58 or more than 0.62; even if we allowed an error of magnitude up to 0.05, over 13% of the samples were outside these bounds. It would be necessary to take a sample of about 1,100 persons to have the lower and upper bounds on the allele frequency to be 0.56 and 0.64, respectively, and about 5.5% of the values outside error bounds of magnitude 0.02. Whereas we have given numerical examples for 2, 3, 4 or 5 alleles per locus, the methods are applicable for more polymorphic loci also. Typical sample sizes that have so far been used by researchers to find allele frequency distributions have been of the order of 102, and very rarely, 103. Hence our result that these sample sizes are hardly likely to give reliable allele frequency distributions in most situations is significant. For most practical purposes, errors of magnitude larger than 0.005 may be tolerable, and hence, smaller sample sizes could be used; these can be found using our method after fixing the permissible error limits. This can be seen from table 2, where the required sample size for frequency distribution (0.6, 0.4) reduces from 19,000 (AerrorA ^ 0.005) to 1,150 (AerrorA ^ 0.02) and further to 175 (AerrorA ^ 0.05). Permissible error bounds for a given application may be found by simulations using different error bounds. Smaller samples can also be used for loci with low polymorphism (less than 4 alleles) and minimum allele frequency not less than 0.1, by using allele frequency values of single-digit accuracy. Furthermore, our estimates of sample sizes are based on the assumption of random mating. In highly inbred populations, smaller sample sizes may be adequate. Special cases of other mating systems can be incorporated into our simulation program by making suitable modifications. Although we have given examples for single genetic loci in this paper, the same methods are also applicable to haplotypes. If the set of loci being used segregate independently of one another, the haplotype frequencies would simply be the product of individual allele frequencies. Otherwise, they need to be estimated by sampling, and each haplotype can be treated like an allele. Any results based on simulations would be sensitive to the (pseudo)random number generator used. We have implemented our computer programs on a Sun Ultra-

B-Rao

Downloaded by: Universiti Sains Malaysia 202.170.60.241 - 9/19/2017 6:01:36 PM

(0.6, 0.4) estimates the first allele frequency (0.6) to be in the range (0.2, 1.0) over 200 simulations. (2) In order to estimate allele frequencies to 3-decimal digit accuracy, the required sample sizes when there are 3 or 4 alleles and any allele frequency distribution, is of the order of 105. Since this is impracticable, allele frequencies can be reported to at most 2-decimal digit accuracy. (3) When there are only two alleles, e.g., single nucleotide polymorphism (SNP) or insertion/deletion (I/D) polymorphism, samples of about 200 individuals are adequate for single-digit accuracy (permitted error between –0.05 and +0.04) if the alleles are nearly equally prevalent, but a sample size of 19,000 is required for the same frequency distribution if 2-decimal digit accuracy is required (permitted error between –0.005 and +0.004). Samples of about 800 individuals are required when there is one very rare allele (2-decimal digit accuracy) (table 2). (4) Sample sizes required for 2 decimal digit accuracy are of the order of at least 104 when 3–5 alleles are present, and the alleles are nearly equally distributed. This is true even when there are some rare alleles and the others are nearly equally distributed. However, if there is one major allele and all the others are rare, then samples of the order of 103 are adequate (table 1). (5) For single-decimal digit accuracy (e.g., when number of alleles at a locus is small and there are no rare alleles of frequency less than 0.1), sample sizes of the order of 102 are adequate for 2 alleles, but for 5 alleles, a sample of about 17,500 persons is required. (6) In all the above cases, the required sample size increases with the number of alleles at the locus.

Sparc 60 workstation with random number generator, drand48, which is based on a linear congruential algorithm with 48-bit integer arithmetic and has a period of about 247 (about 1014). Our simulation procedure requires 2 random numbers per person, and hence 2N random numbers per sample, and altogether 2N ! 104 random numbers for 10,000 such simulated samples. This number (of order greater than 108) is still smaller than the cycling period of the random number generator used. Hence the results obtained by us are truly representative of a simple random sampling procedure. Our computer programs can be obtained by sending an e-mail to [email protected]. They can be used to (1) find the probability that an allele of a given frequency is missing in a sample of given size, (2) find the probability that all alleles of given frequency distribution will be found in a sample of given size, (3) find minimum sample size that will ensure that all alleles of a given frequency distribution are found in the sample, with a given probability, (4) giv-

en the population allele frequency distribution and a sample size, estimate range of variation in sample allele frequencies and find the frequency with which at least one of these frequencies is outside the permissible error bounds, and (5) estimate the required sample size for getting reliable allele frequency distributions within permitted error limits, with a given margin of getting a wrong distribution (say, 5%).

Acknowledgments This work is part of the Functional Genomics Project, funded by Department of Biotechnology, India, to SKB. I am grateful to Prof. S.K. Brahmachari for suggesting the problem, to Dr. Somdatta Sinha for discussions and critical comments, to Dr. Balaram Ghosh and Dr. S. Ramachandran for their comments on a draft of the manuscript. I thank Ms. Puja Dogra and Ms. Swati Mital for assistance in computer programming.

References

Genetic Polymorphism Studies

9 Roy M, Thakurta PG, Sil SK, Majumder PP, Roy B, Banerjee S, Chakraborty M, Dey B, Mukherjee N: Human-specific insertion/deletion polymorphism in Indian populations and their possible evolutionary implications. Eur J Hum Genet 1999;7:435–446. 10 Thompson EA, Neel JV: Allelic disequilibrium and allele frequency distribution as a function of social and demographic history. Am J Hum Genet 1997;60:197–204. 11 Chandy KG, Fantino E, Wittekindt O, Kalman K, Tong L-L, Ho T-H, Gutman GA, Crocq MA, Ganguli R, Nimgaonkar V, Morris-Rosendhal DJ, Gargus JJ: Isolation of a novel potassium channel gene hsKCa3 containing a polymorphic CAG repeat: A candidate for schizophrenia and bipolar disorder? Mol Psychiatry 1998;3:32–37. 12 Ogilvie AD, Russell MB, Dhall P, Battersby S, Ulrich V, Smith CA, Goodwin GM, Harmar AJ, Olesen J: Altered allelic distributions of the serotonin transporter gene in migraine without aura and migraine with aura. Cephalalgia 1998;18:23–26. 13 Saleem Q, Vijayakumar M, Mutsuddi M, Chowdhary N, Jain S, Brahmachari S: Variation at the MJD locus in the major psychoses. Am J Med Genet (Neuropsychiatr Genet) 1998;81:440–442. 14 Saleem Q, Choudhry S, Mukerji M, Bashyam L, Padma MV, Chakravarthy A, Maheshwari MC, Jain S, Brahmachari SK: Molecular analysis of autosomal dominant hereditary ataxias in the Indian population: High frequency of SCA2 and evidence for common founder mutation. Hum Genet 2000;106:179–187.

15 Saleem Q, Srividya VS, Sudhir J, Savitri JV, Gowda Y, B-Rao C, Benegal V, Majumder PP, Anand A, Brahmachari S, Jain S: Association analysis of CAG repeats at the KCNN3 locus in Indian patients with BPAD and schizophrenia. Am J Med Genet (Neuropsychiat Genet) 2000; 96:744–748. 16 Kelly MA, Alvi NS, Croft NJ, Mijovic CH, Bottazzo GF, Barnett AH: Genetic and immunological characteristics of type I diabetes mellitus in an Indo-Aryan population. Diabetologia 2000;43:450–456. 17 Takamatsu M, Yamauchi M, Maezawa Y, Saito S, Maeyama S, Uchikoshi T: Genetic polymorphism of interleukin-1beta in association with the development of alcoholic liver disease in Japanese population. Am J Gastroenterol 2000;98:1305–1311. 18 Wood TC, Her C, Aksoy I, Otterness DM, Weinshilboum RM: Human dehydroepiandrosterone sulfotransferase pharmacogenetics: Quantitative Western analysis and gene sequence polymorphisms. J Steroid Biochem Mol Biol 1996;59:467–478. 19 Player MA, Barracchini KC, Simonis TB, Rivoltini L, Arienti F, Castelli C, Mazzocchi A, Belli F, Parmiani G, Marincola FM: Differences in frequency distribution of HLA-A2 subtypes between North American and Italian white melanoma patients: Relevance for epitope specific vaccination protocols. J Immunother Emphas Tumor Immunol 1996;19:357– 363. 20 Hayney MS, Poland GA, Jacobson RM, Rabe D, Schaid DJ, Jacobsen SJ, Lipsky JJ: Relationship of HLA-DQA1 alleles and humoral antibody following measles vaccination. Int J Infect Dis 1998;2:143–146.

Hum Hered 2001;52:191–200

199

Downloaded by: Universiti Sains Malaysia 202.170.60.241 - 9/19/2017 6:01:36 PM

1 Majumder PP, Roy J: Distribution of ABO blood groups in Indian subcontinent. Curr Anthropol 1982;23:539–566. 2 Deka R, Shriver MD, Yu LM, Ferrell RE, Chakraborty R: Intra- and inter-population diversity at short tandem repeat loci in diverse populations of the world. Electrophoresis 1995;16: 1659–1664. 3 Jorde LB, Bamshad MJ, Watkins WS, Zenger R, Fraley AE, Krakowiak PA, Carpenter KD, Soodyall H, Jenkins T, Rogers AR: Origins and affinities of modern humans: A comparison of mitochondrial and nuclear genetic data. Am J Hum Genet 1995;57:523–538. 4 Pandian SK, Kumar S, Krishnan M, Dharmalingam K, Damodaran C: Allele frequency distribution for the variable number of tandem repeat locus D10S28 in Tamil Nadu (South India) population. Electrophoresis 1995;16: 1689–1692. 5 Mitchell RJ, Earl L, Fricke B: Y-chromosome specific alleles and haplotypes in European and Asian populations: Linkage disequilibrium and geographic diversity. Am J Phys Anthropol 1997;104:167–176. 6 Mukherjee N, Majumder PP, Roy B, Roy M, Dey B, Chakraborty M, Banerjee S: Variation in 4 short tandem repeat loci in 8 population groups of India. Hum Biol 1999;71:439–446. 7 Wei CC, Chiang FT, Lin KS, Lin LI: The spectrum of microsatellite loci on chromosome 7 and 8 in Taiwan aboriginal populations: A comparative population genetic study. Hum Genet 1999;104:333–340. 8 Mani GS, Cook LM, Marvdashti R: What can be learnt about selection from gene frequency distribution? Genetics 1986;114:971–982.

200

27 Crossa J: Methodologies for estimating the sample size required for genetic conservation of outbreeding crops. Theor Appl Genet 1989; 77:153–161. 28 Lawrence MJ, Marshall DF, Davies P: Genetics of genetic conservation. II. Sample size when collecting seed of cross-pollinating species and the information that can be obtained from the evaluation of material held in gene banks. Euphytica 1995;84:101–107. 29 Sapra RL, Narain P, Chauhan SVS: A general model for sample size determination in collecting germplasm. J Biosci 1998;23:647–652. 30 Ott J: Analysis of Human Genetic Linkage, ed 3. The Johns Hopkins University Press, Baltimore, 1999. 31 Ott J: Strategies for characterizing highly polymorphic markers in human gene mapping. Am J Hum Genet 1992;51:283–290. 32 Weekes DE, Conley YP, Mah TS, Paul TO, Morse L, Ngo-Chang J, Dalley JP, Ferrel RE, Gorin MB: A full genome scan for age-related maculopathy. Hum Mol Genet 2000;9:1329– 1349. 33 Borecki IB, Province MA: The impact of marker allele frequency misspecification in variance components quantitative trait locus analysis using sibship data. Genet Epidemiol 1999; 17(suppl 1):S73–S77.

Hum Hered 2001;52:191–200

34 Siniscalco M, Robledo R, Bender PK, Carcassi C, Contu L, Beck JC: Population genomics in Sardinia: A novel approach to hunt for genomic combinations underlying complex traits and diseases. Cytogenet Cell Genet 1999;86: 148–152. 35 Neel JV, Gershowitz H, Mohrenweiser HW, Amos B, Kostyu DD, Salzano FM, Mestriner MA, Lawrence D, Simoes AL, Smouse PE, Oliver WJ, Spielman RS, Neel JV Jr: Genetic studies on the Ticuna, an enigmatic tribe of Central Amazon. Ann Hum Genet 1980;44(pt 1):37–54. 36 Tishkoff SA, Goldman A, Calafell F, Speed WC, Deinard AS, Bonne-Tamir B, Kidd JR, Pakstis AJ, Jenkins T, Kidd KK: A global haplotype analysis of the myotonic dystrophy locus: Implications for the evolution of modern humans and for the origin of myotonic dystrophy mutation. Am J Hum Genet 1998;62: 1389–1402. 37 Kidd K, et al: ALFRED: ALlele FREquency Database; URL: http://info.med.yale.edu/genetics/kkidd 38 Feller W: An Introduction to Probability Theory and Its Applications. New Delhi, Wiley Eastern, 1983, vol 1. 39 Weir BS: Genetic Data Analysis II. Sunderland, Sinauer, 1996.

B-Rao

Downloaded by: Universiti Sains Malaysia 202.170.60.241 - 9/19/2017 6:01:36 PM

21 Xie HG, Kim RB, Stein CM, Gainer JV, Brown NJ, Wood AJ: Alpha1A-adrenergic receptor polymorphism: Association with ethnicity but not essential hypertension. Pharmacogenetics 1999;9:651–656. 22 Mertens G, Mommers N, Heylen H, Gielis M, Muylle L, Vandenberghe A: Allele frequencies of nine STR systems in the Flemish population and application in parentage testing. Int J Legal Med 1997;110:177–180. 23 Liu C, Harashima N, Katsuyama Y, Ota M, Arakura A, Fukushima H: ACTBP2 gene frequency distribution and sequencing of the allelic ladder and variants in the Japanese and Chinese populations. Int J Legal Med 1997;110: 208–212. 24 Jarreta MB, Roche DP, Abecia E: Genetic variation at six STR loci (HUMTH01, HUMTPOX, HUMCSF1PO, HUMF13A01, HUMFES/FPS, HUMVWFA31) in Aragon (North Spain). Forensic Sci Int 1999;100:87– 92. 25 Buchanan FC, Adams LJ, Littlejohn RP, Maddox JF, Crawford AM: Determination of evolutionary relationships among sheep breeds using microsatellites. Genomics 1994;22:397– 403. 26 Yonezawa K: A definition of the optimal allocation of effort in conservation of plant genetic resources – with application to sample size determination for field collection. Euphytica 1985;34:345–354.