Abstract: Chronic bacterial airway infections in people with cystic fibrosis are often caused by .... impedes accurate diagnosis and treatment there is an urgent need to understand the. 91 .... Supplementary tables 2-4 list the totals of the. 145.
Microbial Genomics Refined analyses suggest that recombination is a minor source of genomic diversity in Pseudomonas aeruginosa chronic cystic fibrosis infections --Manuscript Draft-Manuscript Number:
MGEN-D-15-00029R1
Full Title:
Refined analyses suggest that recombination is a minor source of genomic diversity in Pseudomonas aeruginosa chronic cystic fibrosis infections
Article Type:
Short Paper
Section/Category:
Microbial evolution and epidemiology: Population Genomics
Corresponding Author:
Craig Winstanley University of Liverpool Liverpool, UNITED KINGDOM
First Author:
David Williams
Order of Authors:
David Williams Steve Paterson Michael A Brockhurst Craig Winstanley
Abstract:
Chronic bacterial airway infections in people with cystic fibrosis are often caused by Pseudomonas aeruginosa, typically showing high phenotypic diversity among coisolates from the same sputum sample. While adaptive evolution during chronic infections has been reported, the genetic mechanisms underlying the observed rapid within-population diversification are not well understood. Two recent conflicting reports described respectively very high (Darch et al. 2015) and low (Williams at al. 2015) rates of homologous recombination in two closely related P. aeruginosa populations from the lungs of different chronically infected cystic fibrosis patients. To investigate the underlying cause of these contrasting observations, we combined the short read data sets from both studies and applied a new comparative analysis. We inferred low rates of recombination in both populations. The discrepancy in the findings of the two previous studies can be explained by differences in the application of variant calling techniques. Two novel algorithms were developed that filter false positive variants. The first algorithm filters variants on the basis of ambiguity within duplications in the reference genome. The second omits probable false positive variants at regions of non-homology between reference and sample caused by structural rearrangements. Because gains and losses of prophage or genomic islands are frequent causes of chromosomal rearrangements within microbial populations, this filter has broad appeal for mitigating false positive variant calls. Both algorithms are available in a Python package.
Downloaded from www.microbiologyresearch.org by 172.245.148.126 Powered by Editorial Manager® and IP: ProduXion Manager® from Aries Systems Corporation On: Tue, 19 Jan 2016 19:49:58
Manuscript Including References (Word document)
Click here to download Manuscript Including References (Word document) Corrected format_noTT.docx
Research Paper template
1 2 3
Refined analyses suggest that recombination is a minor source of genomic diversity in Pseudomonas aeruginosa chronic cystic fibrosis infections.
4 5
ABSTRACT
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Chronic bacterial airway infections in people with cystic fibrosis are often caused by Pseudomonas aeruginosa, typically showing high phenotypic diversity among co-isolates from the same sputum sample. While adaptive evolution during chronic infections has been reported, the genetic mechanisms underlying the observed rapid within-population diversification are not well understood. Two recent conflicting reports described respectively very high (Darch et al. 2015) and low (Williams at al. 2015) rates of homologous recombination in two closely related P. aeruginosa populations from the lungs of different chronically infected cystic fibrosis patients. To investigate the underlying cause of these contrasting observations, we combined the short read data sets from both studies and applied a new comparative analysis. We inferred low rates of recombination in both populations. The discrepancy in the findings of the two previous studies can be explained by differences in the application of variant calling techniques. Two novel algorithms were developed that filter false positive variants. The first algorithm filters variants on the basis of ambiguity within duplications in the reference genome. The second omits probable false positive variants at regions of non-homology between reference and sample caused by structural rearrangements. Because gains and losses of prophage or genomic islands are frequent causes of chromosomal rearrangements within microbial populations, this filter has broad appeal for mitigating false positive variant calls. Both algorithms are available in a Python package.
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
Research Paper template 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
DATA SUMMARY 1. Short read data for Nottingham P. aeruginosa isolates were obtained from the
European Nucleotide Archive, study: ERP005188 (http://www.ebi.ac.uk/ena/data/view/ERP005188) 2. Short read data for Liverpool P. aeruginosa isolates were obtained from the European Nucleotide Archive, study: ERP006191, sample group ERG001740, reads ERR953477-516 (http://www.ebi.ac.uk/ena/data/view/ERP006191) 3. Complete genome sequence with annotations for P. aeruginosa LESB58 was obtained from NCBI RefSeq: NC_011770.1 (http://www.ncbi.nlm.nih.gov/nuccore/NC_011770.1) 4. Complete genome sequence with annotations for P. aeruginosa LESlike7 was obtained from NCBI RefSeq: NZ_CP006981.1 (http://www.ncbi.nlm.nih.gov/nuccore/NZ_CP006981.1) 5. The Python package “Bacterial and Archaeal Genome Analyser” (BAGA) can be used to the data download the data, reproduce most of the analysis, tables and figures. The most recent version is available from git repository https://github.com/daveuu/baga and release v0.2 at URL & DOI: http://dx.doi.org/10.6084/m9.figshare.2056350 6. A script to reproduce the analysis using BAGA is available via FigShare, URL & DOI: http://dx.doi.org/10.6084/m9.figshare.2056359 7. A script to reproduce the benchmarking of variant calling using BAGA is available via FigShare, URL & DOI: http://dx.doi.org/10.6084/m9.figshare.2056365 8. Variants called against the P. aeruginosa LESB58 and LESlike7 genomes and for benchmarking are available as VCF files via FigShare, URL & DOI: http://dx.doi.org/10.6084/m9.figshare.2056326 9. Variants called against the P. aeruginosa LESB58 and LESlike7 genomes and for benchmarking are available as CSV files via FigShare, URL & DOI: http://dx.doi.org/10.6084/m9.figshare.2056356 10. The multiple sequence alignments from which the phylogeny and recombination was inferred is available via FigShare, URL & DOI: http://dx.doi.org/10.6084/m9.figshare.2056344
57 58 59
I/We confirm all supporting data, code and protocols have been provided within the article or through supplementary data files. ☑
60 61
IMPACT STATEMENT
62
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
Research Paper template 63 64 65 66 67 68 69 70 71 72 73 74 75 76
Rapid pathogen evolution within chronic infections is a major health concern. The resulting high levels of genetic diversity within patients can make infections harder to diagnose and treat. Understanding the genetic mechanisms by which this genetic diversity is generated is therefore vitally important. Two recent studies using genomics to analyse populations of Pseudomonas aeruginosa causing chronic airway infections in cystic fibrosis patients reported conflicting findings. Estimates of the contribution of genetic exchange by homologous recombination, a process that could potentially accelerate pathogen adaptive evolution by generating diversity, differed between the two reports. We applied a new analytical approach to the genome data from these studies that, by inclusion of a stringent data-filtering regime, was designed improve accuracy. In both sets of data, we found similarly low rates of genetic exchange. This suggests that de novo mutation, not genetic exchange, is the primary mechanism driving evolutionary diversification of bacterial populations in these chronic infections.
77 78
INTRODUCTION
79
People with cystic fibrosis (CF) are susceptible to a range of bacterial airway infections, most commonly due to Pseudomonas aeruginosa, which once established are difficult to clear. Damage to lung tissue caused by these chronic infections is a major cause of patient morbidity and mortality. A number of studies have used analyses of sequentially sampled single isolate genome sequences obtained over many years from CF patient sputum to characterise adaptive evolution during chronic infection (Smith et al. 2006; Cramer et al. 2011; Marvig et al. 2013). However, recent studies have demonstrated that there is considerable phenotypic and genomic diversity within single populations of P. aeruginosa in the CF lung (Mowat et al. 2011; Workentine et al. 2013; Williams et al. 2015). Darch et al. (2015) reported large trade-offs in virulence factors, quorum sensing signals and growth among CF lung P. aeruginosa. Notably, they discovered that when multiple isolates were mixed together, resistance to antibiotics significantly increased. Because this diversity impedes accurate diagnosis and treatment there is an urgent need to understand the mechanisms by which these complex population structures have evolved.
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
Although most patients become infected with a P. aeruginosa strain from the environment, transmissible strains can lead to cross infection between CF patients (Winstanley et al. 2009; Fothergill et al. 2012). In the UK, the most abundant single lineage associated with chronic lung infections with CF patients is the Liverpool Epidemic Strain (LES) (Fothergill et al. 2012; Martin et al. 2013). Recent reports by Darch et al. (2015) and Williams et al. (2015) estimated the amount of genetic exchange by homologous recombination in populations of the P. aeruginosa LES from chronic infections of CF airways. Both studies sequenced genomes of multiple contemporary isolates from individual patient sputum samples, but whereas Darch et al. inferred high rates of recombination correlated to phenotypic diversity
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
Research Paper template 102 103 104 105 106 107 108 109 110
Williams et al. reported much lower rates, implying a larger role for spontaneous mutations in generating diversity. In this study, we describe a novel and easily reproducible analysis of whole genome short reads from the Darch et al. (2015) and Williams et al. (2015) papers to estimate recombination rates among P. aeruginosa LES populations during chronic infection of the airways of two CF patients. We conclude that differences in the bioinformatic analyses can explain the contradictory findings between the two studies, and that although it occurs, recombination is not the major driver of the population heterogeneity observed amongst infecting populations of P. aeruginosa in these patients.
111 112
METHODS
113
The whole variant calling bioinformatic analysis pipeline can be conveniently reproduced using the freely available “Bacterial and Archaeal Genome Analyser” (BAGA) command line tool and Python 2.7 package, tested on Linux. See data bibliography for commands to reproduce the analysis and benchmarking. Each set of short reads were aligned to two reference genomes: P. aeruginosa LESB58 (Winstanley et al. 2009) and LESlike7 (Jeukens et al. 2014). Variants were called using the Genome Analysis Toolkit (GATK) Haplotype caller (McKenna et al. 2010; DePristo et al. 2011; Van der Auwera et al. 2013). Two novel variant filters were developed to mitigate false positive variants at regions likely to be affected by rearrangements between a sample and the reference and repeat regions in the reference. Variants were validated by confirming their existence in contigs generated by de novo assembly of the small subset of reads aligning to regions around variants using SPAdes (Bankevich et al. 2012). The accuracy of the pipeline was benchmarked using simulated reads containing known variants using GemSIM (McElroy et al. 2012). See supplementary materials for further details.
114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
RESULTS AND DISCUSSION
129 130
Both populations exhibit similarly low rates of recombination
131
This analysis incorporates genomic short read Illumina data from two previous studies. All short read data from the Darch et al. (2015) report were included representing 22 of the P. aeruginosa isolates obtained from a single sputum sample from a chronically infected CF patient at a Nottingham clinic. These will be referred to as the Nottingham data. A subset of the short read data from the Williams et al. (2015) report, that sequenced from 40 P. aeruginosa isolates obtained from the patient “CF03” sputum sample, were incorporated
132 133 134 135 136
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
Research Paper template 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
and will be referred to as the Liverpool data. Differences in the methods of the two previous papers are summarised in Fig. 1. In the present analysis 270 SNPs and 60 indels were called between the Liverpool sequences and the LESB58 reference sequence compared to 251 SNPs and 49 indels reported previously (Williams et al. 2015) . For the Nottingham sequences, 129 SNPs were called. This is similar in quantity to a set of 121 SNPs reported by Darch et al. (2015) from short read alignments. Of the 129 SNPs called here, 37 were polymorphic among the samples in contrast to 78 in the Darch et al (2015) data. Supplementary tables 2-4 list the totals of the different classes of variants and the effects of variant filters (see data bibliography items 8 and 9 for tables containing each variant). The congruity between SNPs called in this analysis and those reported by Darch et al. (2015) is low (Fig. 2), although 41 within sample SNPs in the latter were called as fixed here. Several variants were called in both studies but subsequently filtered by our novel algorithms. Most were adjacent to prophage in the reference that were missing in the samples, or at repetitive regions associated with paired reads mapping with unexpected separation distances (Fig. 3). Additionally, the previous study reported an absence of insertions or deletions causing frame shifts in protein coding regions either among, or compared to the reference sequence, which is atypical for P. aeruginosa isolated from chronic CF airway infections (Rau et al. 2010; Marvig et al. 2013). In this analysis, eight protein coding regions were polymorphic for frame-shift indels among the Nottingham isolates: PLES_RS00075, PLES_RS001530, PLES_RS05000 (mpl), PLES_RS06070, PLES_RS12750, PLES_RS15875 (pslJ), PLES_RS18160 (mltD), PLES_RS18915 (stk1). A further six were fixed relative to the reference genome (LESB58): PLES_RS02170 (mexB), PLES_RS16380, PLES_RS25200 (ampD), PLES_RS27535, PLES_RS28165, PLES_RS30445 (Supplementary table 1). In total, 37 indels were inferred in the Nottingham isolates relative to LESB58, 22 of which were polymorphic among them.
163 164 165 166 167 168 169 170
The overall higher number of variants, particularly indels, reported in the present analysis might reflect improved sensitivity of the newer GATK HaploType caller module (McKenna et al. 2010) over the GATK Unified Genotype caller used by Williams et al. (2015) and SAMTools (Li et al. 2009) used by Darch et al. (2015). This characteristic has been demonstrated in previous benchmarking reports (Chapman 2014). The joint-genotyping performed in the GATK-based analyses, where information across samples is combined, might also have contributed to the incongruence with the Darch et al. (2015) results.
171 172 173 174 175 176 177
Phylogenetic reconstruction (Fig. 4) recovered the two distinct lineages in Liverpool patient CF3 reported previously (Williams et al. 2015) and a single lineage of lower diversity from the Nottingham data. Despite some differences in bioinformatics (Fig. 1), the topology within each Liverpool lineage was similar to that reported previously (Williams et al. 2015) with an unweighted Robinson-Foulds distance of 44 between 79 bipartition trees (Robinson & Foulds 1981). The topology within the 22 isolate Nottingham lineage was entirely
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
Research Paper template 178 179 180 181
incongruent to that reported by Darch et al. (2015): the 43 and 41 bipartition trees differed by a distance of 42. This metric is the sum of the edges present in one tree but not the other and vice versa. In this case, 42 is the maximum distance given the number of tips and resolution of each topology i.e., the topologies could not be more incongruent.
182 183 184 185 186 187 188 189 190 191 192 193 194 195 196
Recombination was inferred using BRAT NextGen (Marttinen et al. 2012) and, for importations only, ClonalFrameML (Didelot & Wilson 2015). Among the 22 Nottingham genomes analysed together, BRAT NextGen detected a single region from 433,989 to 2,386,565 bp imported by one isolate (SED5). ClonalFrameML did not detect any imported regions. BRAT NextGen analysis of the combined Liverpool and Nottingham data yielded nine distinct origins of foreign genomic segments among the 62 sequences, with a single region of common origin among all 22 Nottingham isolates. ClonalFrameML reported six haplotypes (distinct chromosome regions), totalling 2134 bp with 8 individuals, extant or ancestral, affected. These are indicated in Fig. 4 and have only two regions, close to 0.8 Mb in the larger Liverpool clade, that are in partial agreement with the BRAT NextGen analysis. Thus, the present analysis indicates the Liverpool and Nottingham data sharing similarly low rates of recombination and does not corroborate the high rates of homologous recombination for P. aeruginosa in chronic CF lung infections reported by Darch et al. (2015).
197 198
Potential causes of incongruity between studies
199
Recombination detection sensitivity decreases with diversity (Marttinen et al. 2012), thus the lower recombination rate inferred here can be partly explained by the fewer variants found in this study, than in Darch et al (2015): 37 versus 78 SNPs. However, in addition to read mapping, multiple sequence alignment of short read de novo assemblies of the 22 genome sequences yielded a nearly 40-fold larger set of 1,436 SNPs (Fig S2). This high diversity SNP collection was used for BRAT NextGen analysis indicating many more historical recombination events. With more complexity, errors in each of the assemblies might have compounded errors of multiple alignment i.e., false positive SNPs. Alternatively, the short read mapping approaches might have missed many true positive variants, including those in variably present chromosome regions that cannot be aligned to the LESB58 reference sequence. We quantified missing regions by de novo assembly of reads that did not align, or aligned poorly for each isolate genome: on average, 38 kb among 25 contigs. Thus, the read mapping approach missed 0.57% of each genome, consistent with a pangenome analysis by Darch et al. (2015) that indicated only three genes missing in the reference and present in more than one isolate. Furthermore, their BRAT NextGen analysis indicated recombinant regions along the whole genome length. Thus, it seems most variants are likely to be in regions tested by the read mapping approaches.
200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218
We also performed a series of benchmarking and validation tests on our variant calling approach. In simulated reads from LESB58 genome sequences with known variants added in
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
Research Paper template 219 220 221 222 223
silico, all SNPs were called correctly with no false positives (Table 1). Darch et al (2015) reported no indels, whereas the present analysis found several. Across our ten simulations, 295 out of 300 indels were correct within 2 base pairs. In simulations 5-10, large deletions simulated absence of LESB58 Genomic Island 5 and prophage 5. Polymorphic false positive insertions were called by GATK, but correctly filtered by our novel “rearrangements” filter.
224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248
We validated each SNP by performing de novo assemblies of the reads aligned to the same region and checking the resulting contigs for the variant. Our novel “rearrangements” and “repeats” filters improved variant corroboration from 88-98% to 97-100%, thus they apparently omitted false positive SNPs called in short read alignments but absent in the contigs (supplementary table 5). Among variants removed by the “rearrangements” filter, were 7 SNPs and 3 indels either side of LESB58 prophage 5 – a genomic feature absent in the Nottingham isolates. To test whether the filter was performing correctly in this instance, we repeated the variant calling but used LESlike7 instead: an alternative reference genome without this prophage. No polymorphisms were called around the region in LESlike7 orthologous to the prophage insertion site, thus validating that the filter had removed false positives. The other novel filter was designed to detect repeat regions in the reference genome longer than the sequencing fragment length which is known to cause ambiguity in mapping reads (Olson et al. 2015). Repeats identified included a previously documented repeat within prophage 2 at 0.895-9 Mb and prophage 3 at 1.465-9 Mb (Winstanley et al. 2009). Nineteen SNPs were filtered from this region that also spanned numerous SNPs inferred as adaptive and recombinant by Darch et al. (2015). Notably, the divergence between these repeat units was enough to confuse the read aligner: the alignment contained double the mean read depth over one region but very few reads at the other (Fig 4). Furthermore, the “rearrangements” filter detected the edges of the repeat units, between which read pair members were arbitrarily aligned causing unexpected mapping distances. Both filters are therefore omitting confirmed false positives or variants at repeats indicated as unreliable because of improper paired-read mapping. The filtering of true positives, causing underestimation of diversity and potentially of recombination, remains a possibility. However, the benchmarking and validation suggests this effect is small.
249 250
CONCLUSION
251
Our analysis suggests homologous recombination has made a minor contribution to P. aeruginosa LES diversity in two chronic cystic fibrosis airway infections compared to spontaneous mutations. Within host adaptive evolution might therefore be limited by the mutation rate, rather than the effects of homologous recombination, potentially explaining the prevalence of hypermutator isolates in CF clinical samples (Oliver et al. 2000) including the LES (Kenna et al. 2007). A recent study of P. aeruginosa diversity in different regions of chronically infected cystic fibrosis lungs suggests that there exists strong spatial separation of subpopulations within the lung (Jorth et al. 2015). Thus genetically diverged lineages, even if coexisting within the same lung, may not encounter each other often enough to
252 253 254 255 256 257 258 259
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
Research Paper template 260 261 262 263 264
exchange DNA at detectable levels. The novel variant filters allow data-driven exclusion of probable false positives with minimal intervention from the researcher. The rearrangements filter, designed to mitigate false positives caused by missing chromosome regions, is of particular value given the abundance and rapid dynamics of prophage and genomic islands in microbial communities (Zhou et al. 2011).
265 266
ACKNOWLEDGEMENTS
267 268
This work was supported by The Wellcome Trust (award 093306/Z/10)
269 270
ABBREVIATIONS
271 272
SNP; single nucleotide polymorphism
273
indel; small insertion or deletion
274
CF; cystic fibrosis
275
GATK; Genome Analysis Toolkit
276
BAGA; Bacterial and Archaeal Genome Analyser
277 278
REFERENCES
279 280 281
Bankevich A., Nurk S., Antipov D., Gurevich A.A., Dvorkin M., Kulikov A.S., Lesin V.M., Nikolenko S.I., Pham S., other authors (2012). SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 19 455-477.
282 283
Chapman, B. Validating generalized incremental joint variant calling with GATK HaplotypeCaller, FreeBayes, Platypus and samtools [https://bcbio.wordpress.com/2014/10/07/joint-calling/]
284 285 286
Cramer N., Klockgether J., Wrasman K., Schmidt M., Davenport C.F., Tümmler B. (2011). Microevolution of the major common Pseudomonas aeruginosa clones C and PA14 in cystic fibrosis lungs. Environ Microbiol 13 1690-1704.
287 288 289
Darch S.E., McNally A., Harrison F., Corander J., Barr H.L., Paszkiewicz K., Holden S., Fogarty A., Crusz S.A., Diggle S.P. (2015). Recombination is a key driver of genomic and phenotypic diversity in a Pseudomonas aeruginosa population during cystic fibrosis infection. Sci Rep 5 -.
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
Research Paper template 290 291 292
DePristo M.A., Banks E., Poplin R., Garimella K.V., Maguire J.R., Hartl C., Philippakis A.A., del Angel G., Rivas M.A., other authors (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43 491-498.
293 294
Didelot X., Wilson D.J. (2015). ClonalFrameML: Efficient Inference of Recombination in Whole Bacterial Genomes. PLoS Comput Biol 11 e1004041-.
295 296
Fothergill J.L., Walshaw M.J., Winstanley C. (2012). Transmissible strains of Pseudomonas aeruginosa in cystic fibrosis lung infections. Eur Respir J 40 227-238.
297 298 299 300
Jeukens J., Boyle B., Kukavica-Ibrulj I., Ouellet M.M., Aaron S.D., Charette S.J., Fothergill J.L., Tucker N.P., Winstanley C., Levesque R.C. (2014). Comparative Genomics of Isolates of a Pseudomonas aeruginosa Epidemic Strain Associated with Chronic Lung Infections of Cystic Fibrosis Patients. PLoS ONE 9 e87611.
301 302 303
Jorth P., Staudinger B., Wu X., Hisert K.B., Hayden H., Garudathri J., Harding C., Radey M., Rezayat A., other authors (2015). Regional Isolation Drives Bacterial Diversification within Cystic Fibrosis Lungs. Cell Host & Microbe 18 1-13.
304 305 306
Kenna D.T., Doherty C.J., Foweraker J., Macaskill L., Barcus V.A., Govan J.R.W. (2007). Hypermutability in environmental Pseudomonas aeruginosa and in populations causing pulmonary infection in individuals with cystic fibrosis. Microbiology 153 1852-1859.
307 308
Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25 2078-2079.
309 310 311
Martin K., Baddal B., Mustafa N., Perry C., Underwood A., Constantidou C., Loman N., Kenna D.T., Turton J.F. (2013). Clusters of genetically similar isolates of Pseudomonas aeruginosa from multiple hospitals in the UK. J Med Microbiol 62 988-1000.
312 313 314
Marttinen P., Hanage W., Croucher N., Connor T., Harris S., Bentley S., Corander J. (2012). Detection of recombination events in bacterial genomes from large population samples. Nucleic Acids Res 40 -.
315 316 317
Marvig R.L., Johansen H.K., Molin S., Jelsbak L. (2013). Genome Analysis of a Transmissible Lineage of Pseudomonas aeruginosa Reveals Pathoadaptive Mutations and Distinct Evolutionary Paths of Hypermutators. PLoS Genet 9 e1003741.
318 319
McElroy K.E., Luciani F., Thomas T. (2012). GemSIM: general, error-model based simulator of nextgeneration sequencing data. BMC Genomics 13 74.
320 321 322
McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K., Altshuler D., Gabriel S., other authors (2010). The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20 1297-1303.
323 324 325
Mowat E., Paterson S., Fothergill J.L., Wright E.A., Ledson M.J., Walshaw M.J., Brockhurst M.A., Winstanley C. (2011). Pseudomonas aeruginosa population diversity and turnover in cystic fibrosis chronic infections. Am J Respir Crit Care Med 183 1674-1679.
326 327
Oliver A., Cantón R., Campo P., Baquero F., Blázquez J. (2000). High Frequency of Hypermutable Pseudomonas aeruginosa in Cystic Fibrosis Lung Infection. Science 288 1251-1253.
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
Research Paper template 328 329 330 331
Rau M.H., Hansen S.K., Johansen H.K., Thomsen L.E., Workman C.T., Nielsen K.F., Jelsbak L., Høiby N., Yang L., Molin S. (2010). Early adaptive developments of Pseudomonas aeruginosa after the transition from life in the environment to persistent colonization in the airways of human cystic fibrosis hosts. Environ Microbiol 12 1643-1658.
332 333 334
Smith E.E., Buckley D.G., Wu Z., Saenphimmachak C., Hoffman L.R., D'Argenio D.A., Miller S.I., Ramsey B.W., Speert D.P., other authors (2006). Genetic adaptation by Pseudomonas aeruginosa to the airways of cystic fibrosis patients. Proc Natl Acad Sci U S A 103 8487-8492.
335 336 337
Van der Auwera G.A., Carneiro M.O., Hartl C., Poplin R., del Angel G., Levy-Moonshine A., Jordan T., Shakir K., Roazen D., other authors (2013). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. Curr Protoc Bioinform 43 11.10.1-11.10.33.
338 339 340
Williams D., Evans B., Haldenby S., Walshaw M.J., Brockhurst M.A., Winstanley C., Paterson S. (2015). Divergent, Coexisting Pseudomonas aeruginosa Lineages in Chronic Cystic Fibrosis Lung Infections. Am J Respir Crit Care Med 191 775-785.
341 342 343 344
Winstanley C., Langille M.G., Fothergill J.L., Kukavica-Ibrulj I., Paradis-Bleau C., Sanschagrin F., Thomson N.R., Winsor G.L., Quail M.A., other authors (2009). Newly introduced genomic prophage islands are critical determinants of in vivo competitiveness in the Liverpool Epidemic Strain of Pseudomonas aeruginosa. Genome Res 19 12-23.
345 346 347
Workentine M.L., Sibley C.D., Glezerson B., Purighalla S., Norgaard-Gron J.C., Parkins M.D., Rabin H.R., Surette M.G. (2013). Phenotypic Heterogeneity of Pseudomonas aeruginosa Populations in a Cystic Fibrosis Patient. PLoS ONE 8 e60225.
348 349
Zhou Y., Liang Y., Lynch K.H., Dennis J.J., Wishart D.S. (2011). PHAST: A Fast Phage Search Tool. Nucleic Acids Res 39 W347-W352.
350 351 352
DATA BIBLIOGRAPHY
353
1. Darch, S. E., McNally, A., Harrison, F., Corander, J., Barr, H. L., Paszkiewicz, K., Holden, S.,
354
Fogarty, A., Crusz, S. A. & Diggle, S. P. European Nucleotide Archive, ERP005188 (2014). Williams, D., Evans, B., Haldenby, S., Walshaw, M. J., Brockhurst, M. A., Winstanley, C. & Paterson, S. European Nucleotide Archive, ERR953477-516 (2014). Winstanley, C., Langille, M.G., Fothergill, J.L., Kukavica-Ibrulj, I., Paradis-Bleau, C., Sanschagrin, F., Thomson, N.R., Winsor, G.L., Quail, M.A., Lennard, N., Bignell, A., Clarke, L., Seeger, K., Saunders, D., Harris, D., Parkhill, J., Hancock, R.E., Brinkman, F.S. & Levesque, R.C. RefSeq NC_011770.1 (2008). Jeukens, J., Boyle, B., Kukavica-Ibrulj, I., Ouellet, M.M., Aaron, S.D., Charette, S.J., Fothergill, J.L., Tucker, N.P., Winstanley, C. & Levesque, R.C. RefSeq NZ_CP006981.1 (2013). Williams, D., Paterson, S., Brockhurst, M.A. & Winstanley, C. Figshare 10.6084/m9.figshare.2056350 (2015). Williams, D., Paterson, S., Brockhurst, M.A. & Winstanley, C. Figshare 10.6084/m9.figshare.2056359 (2015).
355
2.
356 357
3.
358 359 360 361
4.
362 363
5.
364 365 366
6.
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
Research Paper template 367
7. Williams, D., Paterson, S., Brockhurst, M.A. & Winstanley, C. FigShare
368
10.6084/m9.figshare.2056365 (2015). 8. Williams, D., Paterson, S., Brockhurst, M.A. & Winstanley, C. FigShare 10.6084/m9.figshare.2056326 (2015). 9. Williams, D., Paterson, S., Brockhurst, M.A. & Winstanley, C. FigShare 10.6084/m9.figshare.2056356 (2015). 10. Williams, D., Paterson, S., Brockhurst, M.A. & Winstanley, C. FigShare 10.6084/m9.figshare.2056344 (2015).
369 370 371 372 373 374 375 376
FIGURES AND TABLES
377 378 379 380
Table 1. Accuracy of the variant calling pipeline for single nucleotide, small insertion and small deletion polymorphisms.
381 True Simulation Total positive no. SNPs* SNPs
382 383
False positive SNPs
False negative SNPs
True Total positive InDels InDels
False positive InDels
False negative InDels
1
51
51
0
0
30
24
6 6‡
2
48
48
0
0
30
20
10 10
3
47
47
0
0
30
24
66
4
52
52
0
0
30
23
77
5
50
50
0
0
30
22
88
6
47
47
0
0
30
21
99
7
52
52
0
0
30
21
99
8
53
53
0
0
30
21
99
9
51
51
0
0
30
20
11 10
10
50
50
0
0
30
22
14 8
*Single nucleotide polymorphisms; †Small insertion or deletion polymorphisms; ‡All but five incorrect InDels are within 2 base pairs of the correct call.
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
Research Paper template 384 385
Figure legends
386 387 388 389 390 391 392
Fig. 1. Comparison of stages of bioinformatic analyses in this and two previous studies. The relevant aim of each analysis is estimation of recombination rates among P. aeruginosa isolates from short read whole genome shotgun sequencing data. GATK: Genome Analysis Tool Kit; BAGA: Bacterial and Archaeal Genome Analyser.
Fig. 2. Congruity between two independent variant calling analyses on the same dataset.
393 394 395 396 397 398 399 400 401 402 403 404 405 406 407
Fig. 3. Sequence repeat units in P. aeruginosa LESB58 that have disrupted paired short read alignments for a closely related isolate. Light grey is depth of aligned reads designated in a “proper pair” by the aligner algorithm; dark grey, stacked above light grey, is depth of reads not in a proper pair. Light green is the ratio of reads not in a proper pair to those in a proper pair; dark green is this ratio “smoothed” by averaging over a 500 bp moving window. Read mapping for variant calling assumes a 1-to-1 orthologous relationship between sample and reference genome. The proportion of proper pair aligned reads decreases where this orthologous relationship breaks down at areas of chromosomal rearrangement, or when the orthology becomes ambiguous such as repeat regions. Variants called within a zone where the smoothed average ratio exceeds a threshold of 0.15, highlighted in red, have a higher chance of being false positives and are omitted. Variants are also omitted from neighbouring zones with discontinuous regions of aligned reads, highlighted in orange. Several variants were filtered at this region. This region was also detected by the “repeats” filter.
408 409 410 411 412 413 414 415
Fig. 4. Maximum likelihood phylogenetic reconstruction for 63 P. aeruginosa LES isolates. All were isolated from CF patient sputa. Forty were isolated from a single sample obtained at a clinic in Liverpool, UK in 2009 (ERR953477-516). Twenty two were isolated from a single sample obtained at a clinic in Nottingham, UK in approximately 2014 (ERR453458-479). The remaining outgroup isolate, LESB58, was isolated in 1988 in Liverpool, UK. The coloured circles indicate branches that imported DNA by homologous recombination as inferred by ClonalFrameML. The positions in the key correspond to the LESB58 genome sequence.
416 417 418 419 420
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
Figure 1
Click here to download Figure Figure 1 methodology comparison.png
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
Figure 2
Click here to download Figure Figure 2 Venn diagrams.png
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
Figure 3
Click here to download Figure Figure 3 repeat disruption.png
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
Figure 4
Click here to download Figure Figure 4 phylogeny with CFML imported regions.png
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
Supplementary Material Files
Click here to download Supplementary Material Files Revised Supplementary.docx
1 2
Supplementary Materials for:
3 4 5 6
Refined analyses suggest recombination is a minor source of genomic diversity in Pseudomonas aeruginosa chronic cystic fibrosis infections.
7 8 9 10 11 12 13 14 15 16
METHODS Short Read Preparation Preprocessing of short read data was implemented in the PrepareReads module of BAGA. Large FASTQ read files were subsampled to provide a 80x read coverage for mapping to a 6.6 Mb genome. Reads were trimmed using position-specific quality scores with Sickle ver. 1.33 (Joshi & Fass 2011) . Adaptor sequences from library preparation were removed using Cutadapt ver. 1.9.dev0 (Martin 2011) .
17 18
Short Read Alignment to reference genome
19
Short read alignment was implemented in the AlignReads module of BAGA. Paired end short reads were aligned to the LESB58 reference genome sequences with the MEM algorithm of BWA (Li & Durbin 2009) (github.com repository revision 9812521) estimating insert size from the data for designation of “proper pair” reads. SAMTools (github.com repository revision b0525f3) (Li et al. 2009) was used for conversion of SAM to BAM files and BAM file sorting. Duplicate reads were removed from alignments using Picard version1.135. The IndelRealigner module of the Genome Analysis Toolkit (GATK) (McKenna et al. 2010; DePristo et al. 2011) was used to simultaneously re-align reads that were likely to be near insertions or deletions.
20 21 22 23 24 25 26 27 28
Reference genome repetitive region variant filter
29
The algorithm for identifying reference genome repeat regions in which variants were deemed unreliable and omitted was implemented in the Repeats module of BAGA. First, each open reading frame nucleotide sequence in the published annotation was aligned back to the full LESB58 chromosome sequence using BWA MEM lowering alignment stringency by setting mismatch penalty to 2 and gap open penalty to 3. These initial alignments using BWA are used to seed an alignment procedure that uses optimal global (Needleman-Wunsch) alignment for better accuracy. The lower alignment stringency produces alignments more divergent than the target 98% nucleotide identity. After omitting alignments between the same sequences, the resulting collection of ORFs within
30 31 32 33 34 35 36
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
44
repeat regions were organised into contiguous blocks of homologous sequence. Each pair of homologous sequences, which may now contain more than one pair of ORFs, were aligned using the Needleman-Wunsch optimal global alignment algorithm implemented in the seq-align package (github.com/noporpoise/seq-align revision 0589383). The percent identity between the two aligned regions is calculated within each 100 bp window at 20 bp increments along a pairwise alignment and regions with more than 98% nucleotide sequence identity were deemed similar enough to yield ambiguous read alignments. Regions greater than the mean insert size are retained for filtering because smaller regions are expected to be resolved by the relative positions of paired-end reads.
45
Genome rearrangements variant filter
46
This algorithm was implemented in the Structure module of BAGA and is applied on a per sample basis. The threshold for the ratio of reads not in a proper pair, to those in a proper pair, which when exceeded defines regions in which variant calls should be omitted, was set to 0.15. At this threshold, false positive variants in the simulations where large deletions were added were successfully filtered out. This ratio, collected at each position in the reference genome, is “smoothed” by averaging over a 500 bp moving window. Proper pair status was determined by the BWA MEM aligner and was acquired from BAM files using PySAM version 0.8.3. Each region for variant omission was extended in each direction if a moving window equal to the mean insert size had at least 15% without any aligned reads. Plotting is implemented in the Structure module.
37 38 39 40 41 42 43
47 48 49 50 51 52 53 54 55 56
Variant calling
57
65
Variant calling was implemented in the CallVariants module of BAGA which follows the “Best Practices workflow” provided by the developers of GATK (Van der Auwera et al. 2013) (https://www.broadinstitute.org/gatk/guide/best-practices accessed 15/04/2015). SNPs and indels were called simultaneously via local re-assembly of haplotypes using the HaplotypeCaller module of the GATK setting –sample_ploidy to 1, –heterozygosity to 0.0001 and –indel_heterozygosity to 0.00001 to approximate the 100-1400 SNPs in the 6.6Mb genomes reported by Williams et al. (2015) and Darch et al. (2015) and the ten-fold few indels reported by Williams et al. (2015). The two datasets were called in separate joint analyses. The standard GATK hard filters for SNPs and indels were applied. Base quality scores in BAM files were recalibrated.
66
Population structure analysis
67
The following procedures were implemented in the ComparativeAnalysis module of the BAGA. A multiple sequence alignment including the reference genome and all isolates was generated from the called SNPs with all filters applied. Invariant positions were retained and columns with missing data were included except if missing in all samples. For example, all 22 Nottingham samples are missing LESB58 Genomic Island 5 which is 50 kb long. A pair of polymorphisms 50 bases either side of the insertion site of Genomic Island 5 should be considered 100 bases apart in the 22 sampled genomes, not 50,100 bases apart, as it is only in the reference genome. Maximum likelihood phylogenetic reconstruction was performed using PhyML (github.com/stephaneguindon/phyml revision e71c553) (Guindon et al. 2010) using the default settings. Recombination was inferred using ClonalFrameML
58 59 60 61 62 63 64
68 69 70 71 72 73 74 75
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
83
(github.com/xavierdidelot/ClonalFrameML revision 2d793a3) (Didelot & Wilson 2015) using the PhyML phylogeny and -ignore_incomplete_sites true. Plotting including mapping the alignment positions reported by ClonalFrameML back to the reference genome is implemented in the ComparativeAnalysis module of BAGA. Phylogeny manipulations were performed using DendroPy 4.0.3 (Sukumaran & Holder 2010). Further detection of recombination events, not implemented in BAGA, was performed using BRAT NextGen version 4/18/2011 (Marttinen et al. 2012) with 100 iterations to achieve negligible changes in model parameter estimates over the last 50 iterations. A threshold of 5% applied to 100 permutations was used for significance testing of each event.
84
In silico read generation
85
Ten sets of paired reads were generated from the LESB58 reference sequence using GemSim version 1.6 (McElroy et al. 2012) and modelling the error profile of the Illumina GA IIx with Illumina Sequencing Kit v4 chemistry. Read length was 100 bp, with insert size 350 bp and standard deviation 20. To provide 60x coverage depth, 1,980,527 read pairs were generated. Each set of reads was generated from a reference sequence on which 50 SNPs (30 shared with ~20 additionally selected at random to approximate a tree-like distribution) and 15 insertions and deletions (five shared, 10 genome specific) had been applied in silico. These simulated reads can be reproduced using the SimulateReads module of BAGA and the scripts provided in the figshare repository (http://dx.doi.org/10.6084/m9.figshare.2056365). The read data was then analysed with the BAGA pipeline described above.
76 77 78 79 80 81 82
86 87 88 89 90 91 92 93 94 95 96
Region-specific de novo assembly of reads for SNP validation
97
For each isolate read set, at each SNP position, reads mapping to 5000 bp each side were extracted from BAM files using pySAM. These were combined with the unmapped and poorly mapped reads for de novo assembly using SPAdes. By reducing the total reads per assembly we aimed to decrease assembly complexity and increase accuracy. A set of contigs were also assembled from just the unmapped and poorly mapped reads and subtracted from the contigs of each SNP-associated assembly to leave those contigs relevant to that region. These remaining contigs were pairwise aligned to the region using the optimal global Needleman-Wunsch algorithm implemented in Seqalign. The pairwise alignment was then check for the presence or absence of the SNP.
98 99 100 101 102 103 104 105 106
RESULTS AND DISCUSSION
107 108 109 110 111 112 113 114
Impact of reference genome repeats on variant calling In the present analysis, two novel filters were applied to mitigate false positive errors. For each filter, variants were omitted if called in regions where assumptions of the method were likely to be violated: either within long repeats in the reference sequence, or near break-points at rearrangements between a sample and the reference. Calling variants by mapping whole-genome shotgun reads to a reference genome assumes each gene sequence in the sampled individual must have a unique Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
115 116 117 118
equivalent in the reference. This 1-to-1 relationship means that each read will align unambiguously to the position in the reference genome that is positionally orthologous to the read origin in the sampled genome. If a read is aligned to the wrong (non-orthologous) part of a reference genome, false positive variants calls are likely.
119 120 121 122 123 124 125 126 127
The first novel variant filter in this analysis omitted variants in long repeats in the reference sequence. Sequence repeats can be caused by duplications (paralogy) or horizontally acquired DNA incorporated into a chromosome whilst also being homologous to another part of the chromosome. The BurrowWheeler Aligner used in this analysis reports an alignment quality of zero where a read maps perfectly to two parts of a genome. In these cases, the risk of false positive variants is easily mitigated by excluding reads with an alignment quality of zero. However, these zero alignment qualities will not be reported for slightly divergent repeats, which would be the case when one or more substitutions have occurred at one repeat unit but not another.
128 129 130 131 132 133 134 135 136 137 138 139
The algorithm we developed to identify repeat regions was implemented in the Repeats module of the Bacterial and Archaeal Genome Analyser (BAGA). Of the 6.6 Mb LESB58 reference genome, 75.1 kb of chromosome in 35 regions were deemed repetitive. These regions required at least 98% aligned nucleotide identity and a length greater than the mean sequencing library insert size because smaller regions are expected to be resolved by the relative positions of paired-end reads. Only contiguous repeat regions, tandem or otherwise, longer than the insert size used in library preparation were included in the filter because shorter regions should be resolvable as unique. Of the 42.7 kb in previously described long duplications in LESB58 (Winstanley et al. 2009), 23.0 kb were included in the high identity regions reported here. Among the Nottingham data, eight SNPs within these regions were omitted (Table S2). In the Liverpool data, nine SNPs were omitted. No indels were found in these regions.
140 141 142
Impact of genome sequence rearrangements on variant calling
143
The second novel variant filter used in this analysis omitted variants in regions around break-points of chromosomal rearrangements. The assumed 1-to-1 orthology required for variant calls from aligned reads can be violated at the regions bordering rearrangements between a sample and the reference genome. These structural changes in bacteria include large deletions, inversions, duplications and integrations, the last typically caused by acquisition of prophage or genomic islands. Some phage are known to carry a fragments of tRNA genes with homology to those in the host organism and integration involves replacement of parts the host version of a tRNA gene with the phage version (Campbell 1992) . Thus, regions flanking the resulting prophage may have sequence divergence caused by a more complex history than the simple orthology assumed by the read mapping method with an increased risk of artefactual, false-positive variants.
144 145 146 147 148 149 150 151 152 153
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
154 155 156 157 158 159 160 161 162 163 164 165 166
The algorithm we developed for omitting putatively unreliable variants near regions of structural rearrangements was implemented in the Structure module of the BAGA software package. The algorithm exploits the fact that sequence reads in this analysis were “paired-end”: from either end of many template DNA fragments of similar lengths. The Burrows-Wheeler Aligner used in this analysis assigns a read “proper pair” status if the aligned distance and orientation to its paired read is close to that expected given the mean template DNA fragment length. Paired-end reads on either side of a rearrangement break-point typically align to the reference at distances far greater than the fragment lengths or in different orientations. For example, in many samples, no reads mapped to the LESB58 chromosome region annotated as “LES prophage 5” indicating absence of the prophage in those samples. Paired reads would align kilobases apart on either side of the prophage if they originated from a DNA fragment spanning the orthologous site of prophage integration, in a sample lacking the prophage. These reads would not be assigned “proper pair” status being so much further apart than the mean fragment size.
167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182
Fig. 3 illustrates how our algorithm determines regions near genome rearrangements in which variant calls may be unreliable. In this example the rearrangement is a duplication, but other common rearrangements include the loss or gain of a prophage or other genomic island. These regions are defined by a deviation in the ratio of proper-pair-assigned reads to non-proper-pair-assigned reads: the more non-proper-pair-assigned reads, the less reliable the region. The initially selected region is extended while regions of the reference have zero aligned read coverage and variants called within the combined region are omitted. In the Nottingham data, 33 SNPs and 10 indels within these regions were omitted (Table S2). In the Liverpool data, 39 SNPs and eight indels were omitted. This analytic approach, by accounting for potential false positive variant calls caused by missing regions of chromosome, is of value because turnover of prophage and genomic islands among microbial genomes is often rapid (Wilmes et al. 2009). In many cases, including the analyses of the Liverpool and Nottingham datasets, guaranteeing a similar genomic composition to permit a contiguous alignment between reference and sampled genomes is difficult. Prophage and genomic islands and are abundant throughout Bacteria and Archaea (Zhou et al. 2011) so few read mapping analyses will be unaffected by these challenges.
183
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
184
TABLES
185 186
Table S1: Small insertions and deletions predicted to cause frame-shift mutations among the Nottingham isolates Chromosome
Position (bp)
Reference
Variant
Gene ID
NC_011770.1
14914
T
TGC
NC_011770.1
335247
C
NC_011770.1
467972
NC_011770.1
Gene name
Frequency
Frequency (filtered)
PLES_RS00075
10
10
CG
PLES_RS01530
2
2
GATCCTCCTCGT
G
PLES_RS02170
mexB
22
22
1037925
A
AC
PLES_RS05000
mpl
3
3
NC_011770.1
1267074
CCCGCTGGAGCT
C
PLES_RS06070
3
3
NC_011770.1
2532081
G
GT
PLES_RS12255
1
0
NC_011770.1
2639073
AG
A
PLES_RS12750
1
1
PLES_RS15460
1
0
NC_011770.1
3280614
A
AAGCGCGCGACC CGGAAGCCGTTG CCCAGCGGGGCG CCCTGT
NC_011770.1
3378367
AG
A
PLES_RS15875
pslJ
21
21
NC_011770.1
3388596
GTACTTGCCGCT GTCCAGGTAGGA CTGGGCGGTGGC CTGGTCGGGTTT CT
G
PLES_RS15915
pslB
1
0
NC_011770.1
3484676
CGCCCGAGCAA
C
PLES_RS16380
22
22
PLES_RS18160
mltD
1
1
PLES_RS18915
stk1
1
1
NC_011770.1
3898444
C
CACGATTCGTTG TCAAAAATAGCC AAGGACCCGGAC ACACGCCTGATG CG
NC_011770.1
4044433
C
CGG
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
NC_011770.1
4931526
G
GGGGATGTCGAC ATGCAA
PLES_RS23105
NC_011770.1
5398560
TCG
T
PLES_RS25200
NC_011770.1
5903286
GCT
G
NC_011770.1
6060546
C
6557920
GGGGCGGCAGCA GTGGCGTTCGGC GCAGT
NC_011770.1
2
0
22
22
PLES_RS27535
22
22
CAT
PLES_RS28165
22
22
G
PLES_RS30445
22
22
ampD
187 188
Table S2: Total polymorphisms called for the Liverpool and Nottingham datasets as cumulative filters are applied, divided by type Liverpool
190
All data
Cumulative filters applied
SNPs*
Indels†
SNPs
Indels
SNPs
Indels
None
318
68
170
47
440
89
318
68
170
47
440
89
BAGA§ reference genome repeats
309
68
162
47
424
89
BAGA genome rearrangements
270
60
129
37
364
78
GATK‡
189
Nottingham
standard 'hard' filter
*Single nucleotide polymorphisms; †Small insertion or deletion polymorphisms; ‡Genome Analysis Toolkit; §Bacterial and Archaeal Genome Analyser
191 192 193
Table S3: Polymorphisms called as fixed in samples and differing with reference for the Liverpool and Nottingham datasets as cumulative filters are applied, divided by type Liverpool
Nottingham
All data
Cumulative filters applied
SNPs*
Indels†
SNPs
Indels
SNPs
Indels
None
43
5
101
17
113
21
GATK‡ standard 'hard' filter
43
5
101
17
113
21
BAGA§ reference genome repeats
43
5
99
17
111
21
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
BAGA genome rearrangements 194 195
43
5
92
15
104
19
*Single nucleotide polymorphisms; †Small insertion or deletion polymorphisms; ‡Genome Analysis Toolkit; §Bacterial and Archaeal Genome Analyser
196 197
Table S4: Polymorphisms called among the Liverpool and Nottingham datasets as cumulative filters are applied, divided by type Liverpool
198 199
Nottingham
All data
Cumulative filters applied
SNPs*
Indels†
SNPs
Indels
SNPs
Indels
None
275
63
69
30
337
71
GATK‡ standard 'hard' filter
275
63
69
30
337
71
BAGA§ reference genome repeats
266
63
63
30
322
71
BAGA genome rearrangements
227
55
37
22
264
60
*Single nucleotide polymorphisms; †Small insertion or deletion polymorphisms; ‡Genome Analysis Toolkit; §Bacterial and Archaeal Genome Analyser
200 201
Table S5: Proportion of single nucleotide polymorphisms called by the GATK Haplotype Caller recovered in de novo assembled contigs.
202 Sample
Proportion all variants corroborated
Proportion filtered variants corroborated*
SED01
96.83%
100.00%
SED02
92.56%
96.00%
SED03
97.54%
100.00%
SED04
95.97%
100.00%
SED05
92.48%
98.04%
SED06
94.92%
100.00%
SED07
91.41%
97.94%
SED08
93.75%
97.98%
SED09
92.48%
99.02%
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
203
SED10
94.26%
97.96%
SED11
89.84%
97.96%
SED12
93.10%
96.94%
SED13
88.10%
100.00%
SED14
94.40%
100.00%
SED15
91.41%
97.96%
SED16
93.75%
99.00%
SED17
94.49%
100.00%
SED18
95.31%
100.00%
SED19
92.91%
97.98%
SED20
92.11%
96.94%
SED21
95.97%
98.98%
SED22
96.09%
100.00%
*Most of the variants removed by the novel filters were not present in the de novo assemblies
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
204
FIGURES
205 206 207 208
Figure S1. Segments of chromosome indicated by coloured bars, among the Liverpool and Nottingham isolates, inferred to have undergone homologous recombination. Regions that are the same colour
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
209 210
share a common origin whereas grey regions were not deemed significant in a permutation test at alpha = 0.05.
211
212 213 214 215 216 217 218 219
Figure S2. Total single nucleotide polymorphisms called by different methods from the same dataset. In the first method from a previously published analysis, short reads from 22 P. aeruginosa isolates were each assembled into contigs de novo, improved using an automated pipeline and finally combined into a multiple sequence alignment within which polymorphismic positions were counted. In the second two methods one from the previous study and the third from this report, variants were called among short reads aligned against a complete reference sequence from a closely related isolate (P. aeruginosa LESB58).
220 221 222
REFERENCES Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
223 224
Sickle: A sliding-window, adaptive, quality-based trimming tool for FastQ files (Version 1.29) [https://github.com/najoshi/sickle]
225 226
Martin M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnetjournal 17 1.
227 228
Li H., Durbin R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25 1754-1760.
229 230
Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25 2078-2079.
231
Picard Sequence Alignment/Map file manipulation library [http://picard.sourceforge.net/]
232 233 234
McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K., Altshuler D., Gabriel S., other authors (2010). The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20 1297-1303.
235 236 237
DePristo M.A., Banks E., Poplin R., Garimella K.V., Maguire J.R., Hartl C., Philippakis A.A., del Angel G., Rivas M.A., other authors (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43 491-498.
238 239 240
Van der Auwera G.A., Carneiro M.O., Hartl C., Poplin R., del Angel G., Levy-Moonshine A., Jordan T., Shakir K., Roazen D., other authors (2013). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. Curr Protoc Bioinform 43 11.10.1-11.10.33.
241 242
Guindon S., Dufayard J., Lefort V., Anisimova M., Hordijk W., O. G. (2010). New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol 59 307-21.
243 244
Didelot X., Wilson D.J. (2015). ClonalFrameML: Efficient Inference of Recombination in Whole Bacterial Genomes. PLoS Comput Biol 11 e1004041-.
245 246
Sukumaran J., Holder M.T. (2010). DendroPy: a Python library for phylogenetic computing. Bioinformatics 26 1569-1571.
247 248
McElroy K.E., Luciani F., Thomas T. (2012). GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics 13 74.
249 250 251 252
Winstanley C., Langille M.G., Fothergill J.L., Kukavica-Ibrulj I., Paradis-Bleau C., Sanschagrin F., Thomson N.R., Winsor G.L., Quail M.A., other authors (2009). Newly introduced genomic prophage islands are critical determinants of in vivo competitiveness in the Liverpool Epidemic Strain of Pseudomonas aeruginosa. Genome Res 19 12-23.
253
Campbell A.M. (1992). Chromosomal insertion sites for phages and plasmids. J Bacteriol 174 7495-7499.
254 255
Wilmes P., Simmons S.L., Denef V.J., Banfield J.F. (2009). The dynamic genetic repertoire of microbial communities. FEMS Microbiol Rev 33 109-132.
256 257
Zhou Y., Liang Y., Lynch K.H., Dennis J.J., Wishart D.S. (2011). PHAST: A Fast Phage Search Tool. Nucleic Acids Res 39 W347-W352.
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58
258
Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58