HOW TO START THE PROGRAM

16 downloads 4054 Views 212KB Size Report
webpage for the code: http://home.uchicago.edu/~rhudson1/). It is not necessary to have. MATLAB installed to run SPAms. The user only need to install the ...
SPAms 1.0 (Simulation Program for the Analysis of ms)

Bárbara Parreira, Marie Trussart, Vítor Sousa, Richard Hudson, Lounès Chikhi October, 2008

PRESENTATION SPAms is a graphical interface that allows the user to simulate data under a set of three main models: (i) population size change (either instantaneous or exponential), (ii) admixture model (from 2 up to 5 parental populations), (iii) population structure (either under the n-island or stepping stone model). SPAms uses the ms program (Hudson, 2002) as an engine for the coalescent simulations. It converts the inputted parameters (e.g. effective size, mutation rate, etc.) for the models mentioned above into ms commands and automatically runs the ms program for the user. The ms program produces sequence data under the infinite site model (haplotypes coded with 0's and 1's), but it is possible to convert this output into microsatellite data assuming the stepwise mutation model and using the microsat.exe program also available with the ms package. SPAms can pipe directly ms outputs to microsat.exe to generate microsatellite data. SPAms computes a set of commonly used frequency-based statistics for the microsatellite data not originally computed by the ms program. It also provides the standard statistics for sequence data as implemented in the sample_stats.exe program provided by Richard Hudson. SPAms is provided as stand alone executable which was written and compiled using MATLAB and can be run on any computer with the Windows operating system. The programs to compute the summary statistics are written in C (ms was also written in C, see Richard Hudson’s webpage for the code: http://home.uchicago.edu/~rhudson1/). It is not necessary to have MATLAB installed to run SPAms. The user only need to install the MCRInstaller.exe program

1

(ca. 83.2 MB), which can be found here: http://www.igc.gulbenkian.pt/research/unit/88. The full SPAms package contains the following files/folders: a) SPAms.exe

(the SPAms executable for Windows, 10.3KB)

b) choice.ctf

(required for SPAms to work, 1.4MB)

c) ms.exe

(ms compiled for Windows, 43.3KB)

d) microsat.exe

(microsat compiled for Windows, 19.7KB)

e) sample_stats.exe

(sample_stats compiled for Windows, 20.6KB)

f) cygwin.dll

(required for some computer configurations, 1.8MB)

g) MCRInstaller

(required in a computer without MATLAB, 83.3MB)

h) README.pdf

(this file)

i) R_Scripts

(a folder with examples, 5.3GB)

Note that the files c), d), e) and f) belong to the ms package (Hudson, 2002). They are given together with the SPAms package but can also be downloaded from Richard Hudson’s web page at: http://home.uchicago.edu/~rhudson1/. 1. DEMOGRAPHIC MODELS In the population size change model, the user can choose between an exponential and a sudden size change event. In the admixture model, one admixture event is allowed with up to 5 parental populations. The parental populations are assumed to split from an ancestral population at the same time. In the structured model the user can choose between an island and a stepping-stone model. For the island model, SPAms assumes that all demes (there is no limit to the number of demes, other than computational) have the same population size, and that the sample size is the same for all samples. For the stepping stone model a square (n X n) two-dimensional grid with bouncing edges is assumed. The total number of populations must thus be a square number. If the user specifies a number of demes that is not a square, say 112, the program will automatically round it to the closest square number (here, 121). The maximum number of sampled populations is ten but there is no limit (other than computational) for the total number of populations. Contrary to the island model, in the stepping stone model, the samples do not need to have the same size and the location (row and column)

2

of the sampled populations must be specified by the user. In the structured models (n-island and stepping-stone), the migration rate is assumed to be constant and equal among the pairs of populations exchanging migrants (i.e. all pops in the n-island model, and neighboring populations in the stepping-stone model).

Please note that for all models the population size (past and present or subpopulations) is required, these sizes are the number of diploid individuals, which means that when you type a number it will simulate twice that number in terms of genes. To simulate data from a haploid population of size N you should enter N/2 the population size. For instance, to simulate data from a haploid population of 1000 individuals you should enter 500 in the population size, however to simulate data from a diploid population of 1000 individuals you should enter 1000 in the population size. The mutation parameter is also required in all models and this is the mutation rate per locus, per generation. The migration argument, required in the structured models is the migration rate per population, per generation.

2. HOW TO INSTALL AND USE THE PROGRAM 2.1 Create a folder to store the ms.exe, microsat.exe, sample_stats.exe and cygwin.dll (if required for the latter). We strongly suggest the user to create this folder near the root (for instance C:\ms or C:\data_analysis\ms) as there seems to be problems running SPAms when folder names are very long. One problem comes from the fact that the folders that will be created by SPAms can themselves have very long names depending on the model chosen. These files should all be in the same folder, but it can have any name the user wishes to use. 2.2 If you don’t have MATLAB installed in your computer you must run MCRInstaller to be able to use SPAms. In order to do that, copy the folder named MCRInstaller to your computer (it contains two files MCRInstaller.exe and extractCTF.exe which should be in the same folder). Then double click on the MCRInstaller.exe and follow the instructions. 2.3 Create another folder to store the SPAms.exe and choice.ctf files. The name of this folder

3

has no importance but could for instance be called SPAms; 2.4 Double-click the SPAms.exe file. A DOS window should then appear followed by a window that allows the user to choose the model type, among the three models mentioned above. 2.5 Choose one of the three models; 2.6 A new window will then appear, allowing you to: i. Choose a submodel. For the population size change model you have to choose either an exponential or sudden size change. For the structure model the user can choose between an n-island and a stepping-stone model. For the admixture model the user can choose the number of parental populations. ii. Type all the parameter values in the required spaces. iii. Use the browsing button to tell SPAms where the ms executables have been saved (see step 2.1 above). iv. Choose the type of marker and the statistics you want to compute v. Type the number of simulations wanted. We suggest making a test with a small number of simulations, say ten. vi. Click the RUN button.

3. COMPUTED STATISTICS SPAms computes the standard ms statistics (see table 1) together with a set of allele frequencybased statistics, which depend on both the chosen marker type and model. The standard ms statistics use the sample_stats.exe program and are only computed if you choose haplotypes. All the other statistics are computed if you choose microsatellites. The sample_stats.exe program computes the summary statistics for the entire dataset, but SPAms also allows you to compute the ms statistics for each population in the structure models (n-island and steppingstone). To compute a given statistic, click the corresponding button (in the window that appears in 2.6). SPAms will produce warning messages when you choose a statistic that cannot be computed

4

with the chosen model and/or marker type. Statistics that can be computed for each marker type are shown in table 1. Table 1. Statistics computed by SPAms

*: the ms statistics refer to the statistics computed by the sample_stats program which are the number of pairwise differences, the number of segregating sites, Tajima's D and fay and Wu's H.

4. OUTPUT FILES SPAms will automatically create a folder named 'Simulations' in the folder where the SPAms.exe and choice.ctf files have been copied. Within that folder it will create another folder whose name explicitly uses the model and the parameter values used for the simulation, where all results will be saved as text files. Below are listed all the files that SPAms creates for each marker type: a) Haplotypes •

ms_command_hap.txt



ms_result_hap.txt

parameter values and ms command;

ms output results for haplotype data. This is the standard ms

output as explained in the ms user guide; •

ms_stats.txt

ms statistics calculated using sample_stats.exe for the standard ms

output; •

ms_stats_popX.txt

ms statistics calculated using sample_stats.exe for population X,

5

beeing X=1,... n, where n is the number of populations sampled. b) Microsatellites •

ms_command_msat.txt

parameter values and ms command to produce

microsatellite data; •

ms_result_msat.txt

ms output results for microsatellite data. Each line corresponds to

the data from one simulation. Each value is the relative length of the alleles. This is the standard ms output after using microsat.exe; •

all_frequencies.txt

allelic frequencies computed using ms_result_msat.txt. Each line

corresponds to one simulation. The allele frequencies are ordered population by population; •

ms_sumstats.txt

summary statistics for microsatellite data. The first line identifies

the statistics that were computed and saved in the present file. The computed statistics appear in the same order. When the model assumes more than one population (for instance an island model with four sampled populations), the statistics that are computed for each population will be ordered by population (there will be three He and six FST values corresponding to each of the populations and pairwise comparisons 1-2, 1-3, 1-4, 2-3, 2-4, 3-4). The file has one line per simulation.

5. EXAMPLES The advantage of using a windows-based user-friendly environment relative to typing ms commands can be seen in the next two examples. If we want to simulate an admixture model with three parental populations, of size 2000 each, which diverged 1000 generations before the admixture event from an ancestral population of size 10,000, and if we assume that the admixture event took place 200 generations ago, and that the three parental populations contributed 10, 40 and 50% to the hybrid population of size 1000, respectively, to simulate samples of 100 individuals from each of the parental and hybrid populations, and repeat this 1000 times, under a mutation rate of 10-3 per locus and generation, then the required ms command will be:

6

ms 400 1000 -t 8 -I 4 100 100 100 100 -n 1 1 -n 2 1 -n 3 1 -n 4 0.5 -es 0.025 4 0.1 -ej 0.025 4 1 -es 0.025 5 0.444 -ej 0.025 5 2 -ej 0.025 6 3 -en 0.125 1 5 -ej 0.125 2 1 -ej 0.125 3 1

If we want microsatellite data, it is only necessary to add a piping command : | c:microsat.exe. As can be seen from this command, most of the “natural” parameters are difficult to recognize because ms uses scaled parameters. With SPAms, a few clicks will easily do the job, and the values of the parameters used are saved together with the ms command.

As another example the following command allows the user to simulate a stepping-stone model with 9 populations (i.e. a 3x3 network of populations): ms 300 1000 -t 4 -I 9 0 0 0 100

100 0

0

0

100

-m 1 2 10 -m 2 1 40 -m 1 4

10 -m 4 1 40 -m 4 5 10 -m 5 4 10 -m 2 5 10 -m 5 2 10 -m 7 8 10 -m 8 7 10 -m 3 6 10 -m 6 3 10 -m 2 3 10 -m 3 2 10 -m 4 7 10 -m 7 4 10 -m 5 6 10 -m 6 5 10 -m 5 8 10 -m 8 5 10 -m 8 9 10 -m 9 8 10 -m 6 9 10 -m 9 6 10 > ms_result.txt.

As can be seen here, the command is rather long despite the fact that only nine populations are simulated. The reason for this is that it is necessary to specifically identify all the populations exchanging genes, for instance population 2 with populations 1, 3 and 5, etc. In this case, all populations are of size 1000 and three populations were sampled (4, 5 and 9, with a sample size of 100 each). The migration rate was set to 0.01, and the mutation rate to 0.001. The number of simulations is 1000.

Below we illustrate how data files produced by SPAms can be used to better understand the statistical properties of the genetic data simulated under different demographic scenarios. We performed a number of simulations and computed the distribution of summary statistics for which expected values, either approximate or exact, are known from analytical results. The examples are simulated under different demographic scenarios. All the plots presented below have been generated by the R software using scripts located in the ‘R Scripts’ folder. The corresponding data were produced using SPAms and are also provided.

7



Example 1: Population size change We simulated data sets under an instantaneous population size reduction model (bottleneck), and sampled data hundred generations after the bottleneck. Figure 1 shows the distribution of the number of alleles and of the expected heterozygosity in the present population with bottlenecks of varying intensity. As expected from theoretical results, the number of alleles decreases much quicker than He hence creating a disequilibrium that has been used to detect bottlenecks in genetic data.

Figure 1. The two panels represent the number of alleles (a) and the expected heterozygosity (b) of simulated data sets sampled 100 generations after a sudden bottleneck. The strength of the bottleneck was made to vary as follows: the ancestral size was 105 and the present population varied in size from 99,999 (virtually no change in size*) to 10 ((10000-fold reduction). Each boxplot corresponds to a different postbottleneck population size. The sample size was 100 for all simulations and the mutation rate was 10-4. 1000 simulations where performed for each scenario. *: SPAms does not allow, in the current version, the use of the population size change model to simulate data from a constant population. Beyond the obvious, this simulation would be trivial using ms directly (ms 100 1000 -t 40). Here, we simulated a bottleneck from 100,000 to 99,999, which is virtually equivalent to a constant population. The command is shown in the ms output (see the R Script in the corresponding folder).



Example 2: Differentiation in an n-island model using microsatellites We have simulated data under the n-island model with a constant number of demes and

8

different migration rates. The expected equilibrium FST value in an infinite island model is given as 1/ (1 + 4Nm+ 4N) where N is the subpopulation size, m the migration rate and

 the per locus mutation rate. Figure 2 shows the FST distributions obtained for the

100 0

50

density

150

different sets of simulations.

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Overall FST

Figure 2. The different curves correspond to different values of the migration rate, all other parameter values being equal. They correspond, from left to right to migration rates of 0.5 (black), 0.1 (blue), 0.01 (red) and 0.001 (green). In this case, the simulated data were obtained by choosing microsatellites.



Example 3: Diversity in an n-island model for sequence data It can be shown that, for a symmetric island infinite-site model, the expectation of π (the average pairwise difference between sampled sequences within a deme) is n*θ, where n is the total number of demes and θ is 4Nu. Figure 3 shows the distribution for π obtained for 1000 simulations, together with the expected and average value of the simulated data.

9

0.0025 0.0020 0.0015

density

0.0010 0.0005 0.0000 0

500

1000

1500



Figure 3. The density curve was obtained under a 1000-island model. Data sets were simulated with SPAms using the following parameters: 10 sampled subpopulations each with sample size equal to 100; m = 0.001,  = 0.0001 and deme size = 1000 (i.e.  = 0.4). 1000 independent simulations were performed. The  values for each subpopulation sampled data sets were calculated using the ms sample_stats.exe file. The distribution plotted was obtained using the values computed for all subpopulations (i.e. 10,000 independent within populations  values). The vertical dashed line represents the average computed for the simulated data and the solid line is the expected value (n* 400).

References Hudson RR (2002) Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics, 18, 337-338.

Parreira B, Trussart M, Sousa V, Hudson RR, Chikhi L (2009) SPAms: A user-friendly software to simulate data under complex demographic models. Molecular Ecology Resources, 9, 749-753.

10