Microbial Genomics

Microbial Genomics Refined analyses suggest that recombination is a minor source of genomic diversity in Pseudomonas aeruginosa chronic cystic fibrosis infections --Manuscript Draft-Manuscript Number:

MGEN-D-15-00029R1

Full Title:

Refined analyses suggest that recombination is a minor source of genomic diversity in Pseudomonas aeruginosa chronic cystic fibrosis infections

Article Type:

Short Paper

Section/Category:

Microbial evolution and epidemiology: Population Genomics

Corresponding Author:

Craig Winstanley University of Liverpool Liverpool, UNITED KINGDOM

First Author:

David Williams

Order of Authors:

David Williams Steve Paterson Michael A Brockhurst Craig Winstanley

Abstract:

Chronic bacterial airway infections in people with cystic fibrosis are often caused by Pseudomonas aeruginosa, typically showing high phenotypic diversity among coisolates from the same sputum sample. While adaptive evolution during chronic infections has been reported, the genetic mechanisms underlying the observed rapid within-population diversification are not well understood. Two recent conflicting reports described respectively very high (Darch et al. 2015) and low (Williams at al. 2015) rates of homologous recombination in two closely related P. aeruginosa populations from the lungs of different chronically infected cystic fibrosis patients. To investigate the underlying cause of these contrasting observations, we combined the short read data sets from both studies and applied a new comparative analysis. We inferred low rates of recombination in both populations. The discrepancy in the findings of the two previous studies can be explained by differences in the application of variant calling techniques. Two novel algorithms were developed that filter false positive variants. The first algorithm filters variants on the basis of ambiguity within duplications in the reference genome. The second omits probable false positive variants at regions of non-homology between reference and sample caused by structural rearrangements. Because gains and losses of prophage or genomic islands are frequent causes of chromosomal rearrangements within microbial populations, this filter has broad appeal for mitigating false positive variant calls. Both algorithms are available in a Python package.

Downloaded from www.microbiologyresearch.org by 172.245.148.126 Powered by Editorial Manager® and IP: ProduXion Manager® from Aries Systems Corporation On: Tue, 19 Jan 2016 19:49:58

Manuscript Including References (Word document)

Click here to download Manuscript Including References (Word document) Corrected format_noTT.docx

Research Paper template

1 2 3

Refined analyses suggest that recombination is a minor source of genomic diversity in Pseudomonas aeruginosa chronic cystic fibrosis infections.

4 5

ABSTRACT

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Chronic bacterial airway infections in people with cystic fibrosis are often caused by Pseudomonas aeruginosa, typically showing high phenotypic diversity among co-isolates from the same sputum sample. While adaptive evolution during chronic infections has been reported, the genetic mechanisms underlying the observed rapid within-population diversification are not well understood. Two recent conflicting reports described respectively very high (Darch et al. 2015) and low (Williams at al. 2015) rates of homologous recombination in two closely related P. aeruginosa populations from the lungs of different chronically infected cystic fibrosis patients. To investigate the underlying cause of these contrasting observations, we combined the short read data sets from both studies and applied a new comparative analysis. We inferred low rates of recombination in both populations. The discrepancy in the findings of the two previous studies can be explained by differences in the application of variant calling techniques. Two novel algorithms were developed that filter false positive variants. The first algorithm filters variants on the basis of ambiguity within duplications in the reference genome. The second omits probable false positive variants at regions of non-homology between reference and sample caused by structural rearrangements. Because gains and losses of prophage or genomic islands are frequent causes of chromosomal rearrangements within microbial populations, this filter has broad appeal for mitigating false positive variant calls. Both algorithms are available in a Python package.

Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58

Research Paper template 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56

DATA SUMMARY 1. Short read data for Nottingham P. aeruginosa isolates were obtained from the

European Nucleotide Archive, study: ERP005188 (http://www.ebi.ac.uk/ena/data/view/ERP005188) 2. Short read data for Liverpool P. aeruginosa isolates were obtained from the European Nucleotide Archive, study: ERP006191, sample group ERG001740, reads ERR953477-516 (http://www.ebi.ac.uk/ena/data/view/ERP006191) 3. Complete genome sequence with annotations for P. aeruginosa LESB58 was obtained from NCBI RefSeq: NC_011770.1 (http://www.ncbi.nlm.nih.gov/nuccore/NC_011770.1) 4. Complete genome sequence with annotations for P. aeruginosa LESlike7 was obtained from NCBI RefSeq: NZ_CP006981.1 (http://www.ncbi.nlm.nih.gov/nuccore/NZ_CP006981.1) 5. The Python package “Bacterial and Archaeal Genome Analyser” (BAGA) can be used to the data download the data, reproduce most of the analysis, tables and figures. The most recent version is available from git repository https://github.com/daveuu/baga and release v0.2 at URL & DOI: http://dx.doi.org/10.6084/m9.figshare.2056350 6. A script to reproduce the analysis using BAGA is available via FigShare, URL & DOI: http://dx.doi.org/10.6084/m9.figshare.2056359 7. A script to reproduce the benchmarking of variant calling using BAGA is available via FigShare, URL & DOI: http://dx.doi.org/10.6084/m9.figshare.2056365 8. Variants called against the P. aeruginosa LESB58 and LESlike7 genomes and for benchmarking are available as VCF files via FigShare, URL & DOI: http://dx.doi.org/10.6084/m9.figshare.2056326 9. Variants called against the P. aeruginosa LESB58 and LESlike7 genomes and for benchmarking are available as CSV files via FigShare, URL & DOI: http://dx.doi.org/10.6084/m9.figshare.2056356 10. The multiple sequence alignments from which the phylogeny and recombination was inferred is available via FigShare, URL & DOI: http://dx.doi.org/10.6084/m9.figshare.2056344

57 58 59

I/We confirm all supporting data, code and protocols have been provided within the article or through supplementary data files. ☑

60 61

IMPACT STATEMENT

62


Research Paper template 63 64 65 66 67 68 69 70 71 72 73 74 75 76

Rapid pathogen evolution within chronic infections is a major health concern. The resulting high levels of genetic diversity within patients can make infections harder to diagnose and treat. Understanding the genetic mechanisms by which this genetic diversity is generated is therefore vitally important. Two recent studies using genomics to analyse populations of Pseudomonas aeruginosa causing chronic airway infections in cystic fibrosis patients reported conflicting findings. Estimates of the contribution of genetic exchange by homologous recombination, a process that could potentially accelerate pathogen adaptive evolution by generating diversity, differed between the two reports. We applied a new analytical approach to the genome data from these studies that, by inclusion of a stringent data-filtering regime, was designed improve accuracy. In both sets of data, we found similarly low rates of genetic exchange. This suggests that de novo mutation, not genetic exchange, is the primary mechanism driving evolutionary diversification of bacterial populations in these chronic infections.

77 78

INTRODUCTION

79

People with cystic fibrosis (CF) are susceptible to a range of bacterial airway infections, most commonly due to Pseudomonas aeruginosa, which once established are difficult to clear. Damage to lung tissue caused by these chronic infections is a major cause of patient morbidity and mortality. A number of studies have used analyses of sequentially sampled single isolate genome sequences obtained over many years from CF patient sputum to characterise adaptive evolution during chronic infection (Smith et al. 2006; Cramer et al. 2011; Marvig et al. 2013). However, recent studies have demonstrated that there is considerable phenotypic and genomic diversity within single populations of P. aeruginosa in the CF lung (Mowat et al. 2011; Workentine et al. 2013; Williams et al. 2015). Darch et al. (2015) reported large trade-offs in virulence factors, quorum sensing signals and growth among CF lung P. aeruginosa. Notably, they discovered that when multiple isolates were mixed together, resistance to antibiotics significantly increased. Because this diversity impedes accurate diagnosis and treatment there is an urgent need to understand the mechanisms by which these complex population structures have evolved.

80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101

Although most patients become infected with a P. aeruginosa strain from the environment, transmissible strains can lead to cross infection between CF patients (Winstanley et al. 2009; Fothergill et al. 2012). In the UK, the most abundant single lineage associated with chronic lung infections with CF patients is the Liverpool Epidemic Strain (LES) (Fothergill et al. 2012; Martin et al. 2013). Recent reports by Darch et al. (2015) and Williams et al. (2015) estimated the amount of genetic exchange by homologous recombination in populations of the P. aeruginosa LES from chronic infections of CF airways. Both studies sequenced genomes of multiple contemporary isolates from individual patient sputum samples, but whereas Darch et al. inferred high rates of recombination correlated to phenotypic diversity


Research Paper template 102 103 104 105 106 107 108 109 110

Williams et al. reported much lower rates, implying a larger role for spontaneous mutations in generating diversity. In this study, we describe a novel and easily reproducible analysis of whole genome short reads from the Darch et al. (2015) and Williams et al. (2015) papers to estimate recombination rates among P. aeruginosa LES populations during chronic infection of the airways of two CF patients. We conclude that differences in the bioinformatic analyses can explain the contradictory findings between the two studies, and that although it occurs, recombination is not the major driver of the population heterogeneity observed amongst infecting populations of P. aeruginosa in these patients.

111 112

METHODS

113

The whole variant calling bioinformatic analysis pipeline can be conveniently reproduced using the freely available “Bacterial and Archaeal Genome Analyser” (BAGA) command line tool and Python 2.7 package, tested on Linux. See data bibliography for commands to reproduce the analysis and benchmarking. Each set of short reads were aligned to two reference genomes: P. aeruginosa LESB58 (Winstanley et al. 2009) and LESlike7 (Jeukens et al. 2014). Variants were called using the Genome Analysis Toolkit (GATK) Haplotype caller (McKenna et al. 2010; DePristo et al. 2011; Van der Auwera et al. 2013). Two novel variant filters were developed to mitigate false positive variants at regions likely to be affected by rearrangements between a sample and the reference and repeat regions in the reference. Variants were validated by confirming their existence in contigs generated by de novo assembly of the small subset of reads aligning to regions around variants using SPAdes (Bankevich et al. 2012). The accuracy of the pipeline was benchmarked using simulated reads containing known variants using GemSIM (McElroy et al. 2012). See supplementary materials for further details.

114 115 116 117 118 119 120 121 122 123 124 125 126 127 128

RESULTS AND DISCUSSION

129 130

Both populations exhibit similarly low rates of recombination

131

This analysis incorporates genomic short read Illumina data from two previous studies. All short read data from the Darch et al. (2015) report were included representing 22 of the P. aeruginosa isolates obtained from a single sputum sample from a chronically infected CF patient at a Nottingham clinic. These will be referred to as the Nottingham data. A subset of the short read data from the Williams et al. (2015) report, that sequenced from 40 P. aeruginosa isolates obtained from the patient “CF03” sputum sample, were incorporated

132 133 134 135 136


Research Paper template 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162

and will be referred to as the Liverpool data. Differences in the methods of the two previous papers are summarised in Fig. 1. In the present analysis 270 SNPs and 60 indels were called between the Liverpool sequences and the LESB58 reference sequence compared to 251 SNPs and 49 indels reported previously (Williams et al. 2015) . For the Nottingham sequences, 129 SNPs were called. This is similar in quantity to a set of 121 SNPs reported by Darch et al. (2015) from short read alignments. Of the 129 SNPs called here, 37 were polymorphic among the samples in contrast to 78 in the Darch et al (2015) data. Supplementary tables 2-4 list the totals of the different classes of variants and the effects of variant filters (see data bibliography items 8 and 9 for tables containing each variant). The congruity between SNPs called in this analysis and those reported by Darch et al. (2015) is low (Fig. 2), although 41 within sample SNPs in the latter were called as fixed here. Several variants were called in both studies but subsequently filtered by our novel algorithms. Most were adjacent to prophage in the reference that were missing in the samples, or at repetitive regions associated with paired reads mapping with unexpected separation distances (Fig. 3). Additionally, the previous study reported an absence of insertions or deletions causing frame shifts in protein coding regions either among, or compared to the reference sequence, which is atypical for P. aeruginosa isolated from chronic CF airway infections (Rau et al. 2010; Marvig et al. 2013). In this analysis, eight protein coding regions were polymorphic for frame-shift indels among the Nottingham isolates: PLES_RS00075, PLES_RS001530, PLES_RS05000 (mpl), PLES_RS06070, PLES_RS12750, PLES_RS15875 (pslJ), PLES_RS18160 (mltD), PLES_RS18915 (stk1). A further six were fixed relative to the reference genome (LESB58): PLES_RS02170 (mexB), PLES_RS16380, PLES_RS25200 (ampD), PLES_RS27535, PLES_RS28165, PLES_RS30445 (Supplementary table 1). In total, 37 indels were inferred in the Nottingham isolates relative to LESB58, 22 of which were polymorphic among them.

163 164 165 166 167 168 169 170

The overall higher number of variants, particularly indels, reported in the present analysis might reflect improved sensitivity of the newer GATK HaploType caller module (McKenna et al. 2010) over the GATK Unified Genotype caller used by Williams et al. (2015) and SAMTools (Li et al. 2009) used by Darch et al. (2015). This characteristic has been demonstrated in previous benchmarking reports (Chapman 2014). The joint-genotyping performed in the GATK-based analyses, where information across samples is combined, might also have contributed to the incongruence with the Darch et al. (2015) results.

171 172 173 174 175 176 177

Phylogenetic reconstruction (Fig. 4) recovered the two distinct lineages in Liverpool patient CF3 reported previously (Williams et al. 2015) and a single lineage of lower diversity from the Nottingham data. Despite some differences in bioinformatics (Fig. 1), the topology within each Liverpool lineage was similar to that reported previously (Williams et al. 2015) with an unweighted Robinson-Foulds distance of 44 between 79 bipartition trees (Robinson & Foulds 1981). The topology within the 22 isolate Nottingham lineage was entirely


Research Paper template 178 179 180 181

incongruent to that reported by Darch et al. (2015): the 43 and 41 bipartition trees differed by a distance of 42. This metric is the sum of the edges present in one tree but not the other and vice versa. In this case, 42 is the maximum distance given the number of tips and resolution of each topology i.e., the topologies could not be more incongruent.

182 183 184 185 186 187 188 189 190 191 192 193 194 195 196

Recombination was inferred using BRAT NextGen (Marttinen et al. 2012) and, for importations only, ClonalFrameML (Didelot & Wilson 2015). Among the 22 Nottingham genomes analysed together, BRAT NextGen detected a single region from 433,989 to 2,386,565 bp imported by one isolate (SED5). ClonalFrameML did not detect any imported regions. BRAT NextGen analysis of the combined Liverpool and Nottingham data yielded nine distinct origins of foreign genomic segments among the 62 sequences, with a single region of common origin among all 22 Nottingham isolates. ClonalFrameML reported six haplotypes (distinct chromosome regions), totalling 2134 bp with 8 individuals, extant or ancestral, affected. These are indicated in Fig. 4 and have only two regions, close to 0.8 Mb in the larger Liverpool clade, that are in partial agreement with the BRAT NextGen analysis. Thus, the present analysis indicates the Liverpool and Nottingham data sharing similarly low rates of recombination and does not corroborate the high rates of homologous recombination for P. aeruginosa in chronic CF lung infections reported by Darch et al. (2015).

197 198

Potential causes of incongruity between studies

199

Recombination detection sensitivity decreases with diversity (Marttinen et al. 2012), thus the lower recombination rate inferred here can be partly explained by the fewer variants found in this study, than in Darch et al (2015): 37 versus 78 SNPs. However, in addition to read mapping, multiple sequence alignment of short read de novo assemblies of the 22 genome sequences yielded a nearly 40-fold larger set of 1,436 SNPs (Fig S2). This high diversity SNP collection was used for BRAT NextGen analysis indicating many more historical recombination events. With more complexity, errors in each of the assemblies might have compounded errors of multiple alignment i.e., false positive SNPs. Alternatively, the short read mapping approaches might have missed many true positive variants, including those in variably present chromosome regions that cannot be aligned to the LESB58 reference sequence. We quantified missing regions by de novo assembly of reads that did not align, or aligned poorly for each isolate genome: on average, 38 kb among 25 contigs. Thus, the read mapping approach missed 0.57% of each genome, consistent with a pangenome analysis by Darch et al. (2015) that indicated only three genes missing in the reference and present in more than one isolate. Furthermore, their BRAT NextGen analysis indicated recombinant regions along the whole genome length. Thus, it seems most variants are likely to be in regions tested by the read mapping approaches.

200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218

We also performed a series of benchmarking and validation tests on our variant calling approach. In simulated reads from LESB58 genome sequences with known variants added in


Research Paper template 219 220 221 222 223

silico, all SNPs were called correctly with no false positives (Table 1). Darch et al (2015) reported no indels, whereas the present analysis found several. Across our ten simulations, 295 out of 300 indels were correct within 2 base pairs. In simulations 5-10, large deletions simulated absence of LESB58 Genomic Island 5 and prophage 5. Polymorphic false positive insertions were called by GATK, but correctly filtered by our novel “rearrangements” filter.

224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248

We validated each SNP by performing de novo assemblies of the reads aligned to the same region and checking the resulting contigs for the variant. Our novel “rearrangements” and “repeats” filters improved variant corroboration from 88-98% to 97-100%, thus they apparently omitted false positive SNPs called in short read alignments but absent in the contigs (supplementary table 5). Among variants removed by the “rearrangements” filter, were 7 SNPs and 3 indels either side of LESB58 prophage 5 – a genomic feature absent in the Nottingham isolates. To test whether the filter was performing correctly in this instance, we repeated the variant calling but used LESlike7 instead: an alternative reference genome without this prophage. No polymorphisms were called around the region in LESlike7 orthologous to the prophage insertion site, thus validating that the filter had removed false positives. The other novel filter was designed to detect repeat regions in the reference genome longer than the sequencing fragment length which is known to cause ambiguity in mapping reads (Olson et al. 2015). Repeats identified included a previously documented repeat within prophage 2 at 0.895-9 Mb and prophage 3 at 1.465-9 Mb (Winstanley et al. 2009). Nineteen SNPs were filtered from this region that also spanned numerous SNPs inferred as adaptive and recombinant by Darch et al. (2015). Notably, the divergence between these repeat units was enough to confuse the read aligner: the alignment contained double the mean read depth over one region but very few reads at the other (Fig 4). Furthermore, the “rearrangements” filter detected the edges of the repeat units, between which read pair members were arbitrarily aligned causing unexpected mapping distances. Both filters are therefore omitting confirmed false positives or variants at repeats indicated as unreliable because of improper paired-read mapping. The filtering of true positives, causing underestimation of diversity and potentially of recombination, remains a possibility. However, the benchmarking and validation suggests this effect is small.

249 250

CONCLUSION

251

Our analysis suggests homologous recombination has made a minor contribution to P. aeruginosa LES diversity in two chronic cystic fibrosis airway infections compared to spontaneous mutations. Within host adaptive evolution might therefore be limited by the mutation rate, rather than the effects of homologous recombination, potentially explaining the prevalence of hypermutator isolates in CF clinical samples (Oliver et al. 2000) including the LES (Kenna et al. 2007). A recent study of P. aeruginosa diversity in different regions of chronically infected cystic fibrosis lungs suggests that there exists strong spatial separation of subpopulations within the lung (Jorth et al. 2015). Thus genetically diverged lineages, even if coexisting within the same lung, may not encounter each other often enough to

252 253 254 255 256 257 258 259


Research Paper template 260 261 262 263 264

exchange DNA at detectable levels. The novel variant filters allow data-driven exclusion of probable false positives with minimal intervention from the researcher. The rearrangements filter, designed to mitigate false positives caused by missing chromosome regions, is of particular value given the abundance and rapid dynamics of prophage and genomic islands in microbial communities (Zhou et al. 2011).

265 266

ACKNOWLEDGEMENTS

267 268

This work was supported by The Wellcome Trust (award 093306/Z/10)

269 270

ABBREVIATIONS

271 272

SNP; single nucleotide polymorphism

273

indel; small insertion or deletion

274

CF; cystic fibrosis

275

GATK; Genome Analysis Toolkit

276

BAGA; Bacterial and Archaeal Genome Analyser

277 278

REFERENCES

279 280 281

Bankevich A., Nurk S., Antipov D., Gurevich A.A., Dvorkin M., Kulikov A.S., Lesin V.M., Nikolenko S.I., Pham S., other authors (2012). SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 19 455-477.

282 283

Chapman, B. Validating generalized incremental joint variant calling with GATK HaplotypeCaller, FreeBayes, Platypus and samtools [https://bcbio.wordpress.com/2014/10/07/joint-calling/]

284 285 286

Cramer N., Klockgether J., Wrasman K., Schmidt M., Davenport C.F., Tümmler B. (2011). Microevolution of the major common Pseudomonas aeruginosa clones C and PA14 in cystic fibrosis lungs. Environ Microbiol 13 1690-1704.

287 288 289

Darch S.E., McNally A., Harrison F., Corander J., Barr H.L., Paszkiewicz K., Holden S., Fogarty A., Crusz S.A., Diggle S.P. (2015). Recombination is a key driver of genomic and phenotypic diversity in a Pseudomonas aeruginosa population during cystic fibrosis infection. Sci Rep 5 -.


Research Paper template 290 291 292

DePristo M.A., Banks E., Poplin R., Garimella K.V., Maguire J.R., Hartl C., Philippakis A.A., del Angel G., Rivas M.A., other authors (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43 491-498.

293 294

Didelot X., Wilson D.J. (2015). ClonalFrameML: Efficient Inference of Recombination in Whole Bacterial Genomes. PLoS Comput Biol 11 e1004041-.

295 296

Fothergill J.L., Walshaw M.J., Winstanley C. (2012). Transmissible strains of Pseudomonas aeruginosa in cystic fibrosis lung infections. Eur Respir J 40 227-238.

297 298 299 300

Jeukens J., Boyle B., Kukavica-Ibrulj I., Ouellet M.M., Aaron S.D., Charette S.J., Fothergill J.L., Tucker N.P., Winstanley C., Levesque R.C. (2014). Comparative Genomics of Isolates of a Pseudomonas aeruginosa Epidemic Strain Associated with Chronic Lung Infections of Cystic Fibrosis Patients. PLoS ONE 9 e87611.

301 302 303

Jorth P., Staudinger B., Wu X., Hisert K.B., Hayden H., Garudathri J., Harding C., Radey M., Rezayat A., other authors (2015). Regional Isolation Drives Bacterial Diversification within Cystic Fibrosis Lungs. Cell Host & Microbe 18 1-13.

304 305 306

Kenna D.T., Doherty C.J., Foweraker J., Macaskill L., Barcus V.A., Govan J.R.W. (2007). Hypermutability in environmental Pseudomonas aeruginosa and in populations causing pulmonary infection in individuals with cystic fibrosis. Microbiology 153 1852-1859.

307 308

Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25 2078-2079.

309 310 311

Martin K., Baddal B., Mustafa N., Perry C., Underwood A., Constantidou C., Loman N., Kenna D.T., Turton J.F. (2013). Clusters of genetically similar isolates of Pseudomonas aeruginosa from multiple hospitals in the UK. J Med Microbiol 62 988-1000.

312 313 314

Marttinen P., Hanage W., Croucher N., Connor T., Harris S., Bentley S., Corander J. (2012). Detection of recombination events in bacterial genomes from large population samples. Nucleic Acids Res 40 -.

315 316 317

Marvig R.L., Johansen H.K., Molin S., Jelsbak L. (2013). Genome Analysis of a Transmissible Lineage of Pseudomonas aeruginosa Reveals Pathoadaptive Mutations and Distinct Evolutionary Paths of Hypermutators. PLoS Genet 9 e1003741.

318 319

McElroy K.E., Luciani F., Thomas T. (2012). GemSIM: general, error-model based simulator of nextgeneration sequencing data. BMC Genomics 13 74.

320 321 322

McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K., Altshuler D., Gabriel S., other authors (2010). The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20 1297-1303.

323 324 325

Mowat E., Paterson S., Fothergill J.L., Wright E.A., Ledson M.J., Walshaw M.J., Brockhurst M.A., Winstanley C. (2011). Pseudomonas aeruginosa population diversity and turnover in cystic fibrosis chronic infections. Am J Respir Crit Care Med 183 1674-1679.

326 327

Oliver A., Cantón R., Campo P., Baquero F., Blázquez J. (2000). High Frequency of Hypermutable Pseudomonas aeruginosa in Cystic Fibrosis Lung Infection. Science 288 1251-1253.


Research Paper template 328 329 330 331

Rau M.H., Hansen S.K., Johansen H.K., Thomsen L.E., Workman C.T., Nielsen K.F., Jelsbak L., Høiby N., Yang L., Molin S. (2010). Early adaptive developments of Pseudomonas aeruginosa after the transition from life in the environment to persistent colonization in the airways of human cystic fibrosis hosts. Environ Microbiol 12 1643-1658.

332 333 334

Smith E.E., Buckley D.G., Wu Z., Saenphimmachak C., Hoffman L.R., D'Argenio D.A., Miller S.I., Ramsey B.W., Speert D.P., other authors (2006). Genetic adaptation by Pseudomonas aeruginosa to the airways of cystic fibrosis patients. Proc Natl Acad Sci U S A 103 8487-8492.

335 336 337

Van der Auwera G.A., Carneiro M.O., Hartl C., Poplin R., del Angel G., Levy-Moonshine A., Jordan T., Shakir K., Roazen D., other authors (2013). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. Curr Protoc Bioinform 43 11.10.1-11.10.33.

338 339 340

Williams D., Evans B., Haldenby S., Walshaw M.J., Brockhurst M.A., Winstanley C., Paterson S. (2015). Divergent, Coexisting Pseudomonas aeruginosa Lineages in Chronic Cystic Fibrosis Lung Infections. Am J Respir Crit Care Med 191 775-785.

341 342 343 344

Winstanley C., Langille M.G., Fothergill J.L., Kukavica-Ibrulj I., Paradis-Bleau C., Sanschagrin F., Thomson N.R., Winsor G.L., Quail M.A., other authors (2009). Newly introduced genomic prophage islands are critical determinants of in vivo competitiveness in the Liverpool Epidemic Strain of Pseudomonas aeruginosa. Genome Res 19 12-23.

345 346 347

Workentine M.L., Sibley C.D., Glezerson B., Purighalla S., Norgaard-Gron J.C., Parkins M.D., Rabin H.R., Surette M.G. (2013). Phenotypic Heterogeneity of Pseudomonas aeruginosa Populations in a Cystic Fibrosis Patient. PLoS ONE 8 e60225.

348 349

Zhou Y., Liang Y., Lynch K.H., Dennis J.J., Wishart D.S. (2011). PHAST: A Fast Phage Search Tool. Nucleic Acids Res 39 W347-W352.

350 351 352

DATA BIBLIOGRAPHY

353

1. Darch, S. E., McNally, A., Harrison, F., Corander, J., Barr, H. L., Paszkiewicz, K., Holden, S.,

354

Fogarty, A., Crusz, S. A. & Diggle, S. P. European Nucleotide Archive, ERP005188 (2014). Williams, D., Evans, B., Haldenby, S., Walshaw, M. J., Brockhurst, M. A., Winstanley, C. & Paterson, S. European Nucleotide Archive, ERR953477-516 (2014). Winstanley, C., Langille, M.G., Fothergill, J.L., Kukavica-Ibrulj, I., Paradis-Bleau, C., Sanschagrin, F., Thomson, N.R., Winsor, G.L., Quail, M.A., Lennard, N., Bignell, A., Clarke, L., Seeger, K., Saunders, D., Harris, D., Parkhill, J., Hancock, R.E., Brinkman, F.S. & Levesque, R.C. RefSeq NC_011770.1 (2008). Jeukens, J., Boyle, B., Kukavica-Ibrulj, I., Ouellet, M.M., Aaron, S.D., Charette, S.J., Fothergill, J.L., Tucker, N.P., Winstanley, C. & Levesque, R.C. RefSeq NZ_CP006981.1 (2013). Williams, D., Paterson, S., Brockhurst, M.A. & Winstanley, C. Figshare 10.6084/m9.figshare.2056350 (2015). Williams, D., Paterson, S., Brockhurst, M.A. & Winstanley, C. Figshare 10.6084/m9.figshare.2056359 (2015).

355

2.

356 357

3.

358 359 360 361

4.

362 363

5.

364 365 366

6.


Research Paper template 367

7. Williams, D., Paterson, S., Brockhurst, M.A. & Winstanley, C. FigShare

368

10.6084/m9.figshare.2056365 (2015). 8. Williams, D., Paterson, S., Brockhurst, M.A. & Winstanley, C. FigShare 10.6084/m9.figshare.2056326 (2015). 9. Williams, D., Paterson, S., Brockhurst, M.A. & Winstanley, C. FigShare 10.6084/m9.figshare.2056356 (2015). 10. Williams, D., Paterson, S., Brockhurst, M.A. & Winstanley, C. FigShare 10.6084/m9.figshare.2056344 (2015).

369 370 371 372 373 374 375 376

FIGURES AND TABLES

377 378 379 380

Table 1. Accuracy of the variant calling pipeline for single nucleotide, small insertion and small deletion polymorphisms.

381 True Simulation Total positive no. SNPs* SNPs

382 383

False positive SNPs

False negative SNPs

True Total positive InDels InDels

False positive InDels

False negative InDels

1

51

51

0

0

30

24

6 6‡

2

48

48

0

0

30

20

10 10

3

47

47

0

0

30

24

66

4

52

52

0

0

30

23

77

5

50

50

0

0

30

22

88

6

47

47

0

0

30

21

99

7

52

52

0

0

30

21

99

8

53

53

0

0

30

21

99

9

51

51

0

0

30

20

11 10

10

50

50

0

0

30

22

14 8

*Single nucleotide polymorphisms; †Small insertion or deletion polymorphisms; ‡All but five incorrect InDels are within 2 base pairs of the correct call.


Research Paper template 384 385

Figure legends

386 387 388 389 390 391 392

Fig. 1. Comparison of stages of bioinformatic analyses in this and two previous studies. The relevant aim of each analysis is estimation of recombination rates among P. aeruginosa isolates from short read whole genome shotgun sequencing data. GATK: Genome Analysis Tool Kit; BAGA: Bacterial and Archaeal Genome Analyser.

Fig. 2. Congruity between two independent variant calling analyses on the same dataset.

393 394 395 396 397 398 399 400 401 402 403 404 405 406 407

Fig. 3. Sequence repeat units in P. aeruginosa LESB58 that have disrupted paired short read alignments for a closely related isolate. Light grey is depth of aligned reads designated in a “proper pair” by the aligner algorithm; dark grey, stacked above light grey, is depth of reads not in a proper pair. Light green is the ratio of reads not in a proper pair to those in a proper pair; dark green is this ratio “smoothed” by averaging over a 500 bp moving window. Read mapping for variant calling assumes a 1-to-1 orthologous relationship between sample and reference genome. The proportion of proper pair aligned reads decreases where this orthologous relationship breaks down at areas of chromosomal rearrangement, or when the orthology becomes ambiguous such as repeat regions. Variants called within a zone where the smoothed average ratio exceeds a threshold of 0.15, highlighted in red, have a higher chance of being false positives and are omitted. Variants are also omitted from neighbouring zones with discontinuous regions of aligned reads, highlighted in orange. Several variants were filtered at this region. This region was also detected by the “repeats” filter.

408 409 410 411 412 413 414 415

Fig. 4. Maximum likelihood phylogenetic reconstruction for 63 P. aeruginosa LES isolates. All were isolated from CF patient sputa. Forty were isolated from a single sample obtained at a clinic in Liverpool, UK in 2009 (ERR953477-516). Twenty two were isolated from a single sample obtained at a clinic in Nottingham, UK in approximately 2014 (ERR453458-479). The remaining outgroup isolate, LESB58, was isolated in 1988 in Liverpool, UK. The coloured circles indicate branches that imported DNA by homologous recombination as inferred by ClonalFrameML. The positions in the key correspond to the LESB58 genome sequence.

416 417 418 419 420


Figure 1

Click here to download Figure Figure 1 methodology comparison.png


Figure 2

Click here to download Figure Figure 2 Venn diagrams.png


Figure 3

Click here to download Figure Figure 3 repeat disruption.png


Figure 4

Click here to download Figure Figure 4 phylogeny with CFML imported regions.png


Supplementary Material Files

Click here to download Supplementary Material Files Revised Supplementary.docx

1 2

Supplementary Materials for:

3 4 5 6

Refined analyses suggest recombination is a minor source of genomic diversity in Pseudomonas aeruginosa chronic cystic fibrosis infections.

7 8 9 10 11 12 13 14 15 16

METHODS Short Read Preparation Preprocessing of short read data was implemented in the PrepareReads module of BAGA. Large FASTQ read files were subsampled to provide a 80x read coverage for mapping to a 6.6 Mb genome. Reads were trimmed using position-specific quality scores with Sickle ver. 1.33 (Joshi & Fass 2011) . Adaptor sequences from library preparation were removed using Cutadapt ver. 1.9.dev0 (Martin 2011) .

17 18

Short Read Alignment to reference genome

19

Short read alignment was implemented in the AlignReads module of BAGA. Paired end short reads were aligned to the LESB58 reference genome sequences with the MEM algorithm of BWA (Li & Durbin 2009) (github.com repository revision 9812521) estimating insert size from the data for designation of “proper pair” reads. SAMTools (github.com repository revision b0525f3) (Li et al. 2009) was used for conversion of SAM to BAM files and BAM file sorting. Duplicate reads were removed from alignments using Picard version1.135. The IndelRealigner module of the Genome Analysis Toolkit (GATK) (McKenna et al. 2010; DePristo et al. 2011) was used to simultaneously re-align reads that were likely to be near insertions or deletions.

20 21 22 23 24 25 26 27 28

Reference genome repetitive region variant filter

29

The algorithm for identifying reference genome repeat regions in which variants were deemed unreliable and omitted was implemented in the Repeats module of BAGA. First, each open reading frame nucleotide sequence in the published annotation was aligned back to the full LESB58 chromosome sequence using BWA MEM lowering alignment stringency by setting mismatch penalty to 2 and gap open penalty to 3. These initial alignments using BWA are used to seed an alignment procedure that uses optimal global (Needleman-Wunsch) alignment for better accuracy. The lower alignment stringency produces alignments more divergent than the target 98% nucleotide identity. After omitting alignments between the same sequences, the resulting collection of ORFs within

30 31 32 33 34 35 36


44

repeat regions were organised into contiguous blocks of homologous sequence. Each pair of homologous sequences, which may now contain more than one pair of ORFs, were aligned using the Needleman-Wunsch optimal global alignment algorithm implemented in the seq-align package (github.com/noporpoise/seq-align revision 0589383). The percent identity between the two aligned regions is calculated within each 100 bp window at 20 bp increments along a pairwise alignment and regions with more than 98% nucleotide sequence identity were deemed similar enough to yield ambiguous read alignments. Regions greater than the mean insert size are retained for filtering because smaller regions are expected to be resolved by the relative positions of paired-end reads.

45

Genome rearrangements variant filter

46

This algorithm was implemented in the Structure module of BAGA and is applied on a per sample basis. The threshold for the ratio of reads not in a proper pair, to those in a proper pair, which when exceeded defines regions in which variant calls should be omitted, was set to 0.15. At this threshold, false positive variants in the simulations where large deletions were added were successfully filtered out. This ratio, collected at each position in the reference genome, is “smoothed” by averaging over a 500 bp moving window. Proper pair status was determined by the BWA MEM aligner and was acquired from BAM files using PySAM version 0.8.3. Each region for variant omission was extended in each direction if a moving window equal to the mean insert size had at least 15% without any aligned reads. Plotting is implemented in the Structure module.

37 38 39 40 41 42 43

47 48 49 50 51 52 53 54 55 56

Variant calling

57

65

Variant calling was implemented in the CallVariants module of BAGA which follows the “Best Practices workflow” provided by the developers of GATK (Van der Auwera et al. 2013) (https://www.broadinstitute.org/gatk/guide/best-practices accessed 15/04/2015). SNPs and indels were called simultaneously via local re-assembly of haplotypes using the HaplotypeCaller module of the GATK setting –sample_ploidy to 1, –heterozygosity to 0.0001 and –indel_heterozygosity to 0.00001 to approximate the 100-1400 SNPs in the 6.6Mb genomes reported by Williams et al. (2015) and Darch et al. (2015) and the ten-fold few indels reported by Williams et al. (2015). The two datasets were called in separate joint analyses. The standard GATK hard filters for SNPs and indels were applied. Base quality scores in BAM files were recalibrated.

66

Population structure analysis

67

The following procedures were implemented in the ComparativeAnalysis module of the BAGA. A multiple sequence alignment including the reference genome and all isolates was generated from the called SNPs with all filters applied. Invariant positions were retained and columns with missing data were included except if missing in all samples. For example, all 22 Nottingham samples are missing LESB58 Genomic Island 5 which is 50 kb long. A pair of polymorphisms 50 bases either side of the insertion site of Genomic Island 5 should be considered 100 bases apart in the 22 sampled genomes, not 50,100 bases apart, as it is only in the reference genome. Maximum likelihood phylogenetic reconstruction was performed using PhyML (github.com/stephaneguindon/phyml revision e71c553) (Guindon et al. 2010) using the default settings. Recombination was inferred using ClonalFrameML

58 59 60 61 62 63 64

68 69 70 71 72 73 74 75


83

(github.com/xavierdidelot/ClonalFrameML revision 2d793a3) (Didelot & Wilson 2015) using the PhyML phylogeny and -ignore_incomplete_sites true. Plotting including mapping the alignment positions reported by ClonalFrameML back to the reference genome is implemented in the ComparativeAnalysis module of BAGA. Phylogeny manipulations were performed using DendroPy 4.0.3 (Sukumaran & Holder 2010). Further detection of recombination events, not implemented in BAGA, was performed using BRAT NextGen version 4/18/2011 (Marttinen et al. 2012) with 100 iterations to achieve negligible changes in model parameter estimates over the last 50 iterations. A threshold of 5% applied to 100 permutations was used for significance testing of each event.

84

In silico read generation

85

Ten sets of paired reads were generated from the LESB58 reference sequence using GemSim version 1.6 (McElroy et al. 2012) and modelling the error profile of the Illumina GA IIx with Illumina Sequencing Kit v4 chemistry. Read length was 100 bp, with insert size 350 bp and standard deviation 20. To provide 60x coverage depth, 1,980,527 read pairs were generated. Each set of reads was generated from a reference sequence on which 50 SNPs (30 shared with ~20 additionally selected at random to approximate a tree-like distribution) and 15 insertions and deletions (five shared, 10 genome specific) had been applied in silico. These simulated reads can be reproduced using the SimulateReads module of BAGA and the scripts provided in the figshare repository (http://dx.doi.org/10.6084/m9.figshare.2056365). The read data was then analysed with the BAGA pipeline described above.

76 77 78 79 80 81 82

86 87 88 89 90 91 92 93 94 95 96

Region-specific de novo assembly of reads for SNP validation

97

For each isolate read set, at each SNP position, reads mapping to 5000 bp each side were extracted from BAM files using pySAM. These were combined with the unmapped and poorly mapped reads for de novo assembly using SPAdes. By reducing the total reads per assembly we aimed to decrease assembly complexity and increase accuracy. A set of contigs were also assembled from just the unmapped and poorly mapped reads and subtracted from the contigs of each SNP-associated assembly to leave those contigs relevant to that region. These remaining contigs were pairwise aligned to the region using the optimal global Needleman-Wunsch algorithm implemented in Seqalign. The pairwise alignment was then check for the presence or absence of the SNP.

98 99 100 101 102 103 104 105 106

RESULTS AND DISCUSSION

107 108 109 110 111 112 113 114

Impact of reference genome repeats on variant calling In the present analysis, two novel filters were applied to mitigate false positive errors. For each filter, variants were omitted if called in regions where assumptions of the method were likely to be violated: either within long repeats in the reference sequence, or near break-points at rearrangements between a sample and the reference. Calling variants by mapping whole-genome shotgun reads to a reference genome assumes each gene sequence in the sampled individual must have a unique Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58

115 116 117 118

equivalent in the reference. This 1-to-1 relationship means that each read will align unambiguously to the position in the reference genome that is positionally orthologous to the read origin in the sampled genome. If a read is aligned to the wrong (non-orthologous) part of a reference genome, false positive variants calls are likely.

119 120 121 122 123 124 125 126 127

The first novel variant filter in this analysis omitted variants in long repeats in the reference sequence. Sequence repeats can be caused by duplications (paralogy) or horizontally acquired DNA incorporated into a chromosome whilst also being homologous to another part of the chromosome. The BurrowWheeler Aligner used in this analysis reports an alignment quality of zero where a read maps perfectly to two parts of a genome. In these cases, the risk of false positive variants is easily mitigated by excluding reads with an alignment quality of zero. However, these zero alignment qualities will not be reported for slightly divergent repeats, which would be the case when one or more substitutions have occurred at one repeat unit but not another.

128 129 130 131 132 133 134 135 136 137 138 139

The algorithm we developed to identify repeat regions was implemented in the Repeats module of the Bacterial and Archaeal Genome Analyser (BAGA). Of the 6.6 Mb LESB58 reference genome, 75.1 kb of chromosome in 35 regions were deemed repetitive. These regions required at least 98% aligned nucleotide identity and a length greater than the mean sequencing library insert size because smaller regions are expected to be resolved by the relative positions of paired-end reads. Only contiguous repeat regions, tandem or otherwise, longer than the insert size used in library preparation were included in the filter because shorter regions should be resolvable as unique. Of the 42.7 kb in previously described long duplications in LESB58 (Winstanley et al. 2009), 23.0 kb were included in the high identity regions reported here. Among the Nottingham data, eight SNPs within these regions were omitted (Table S2). In the Liverpool data, nine SNPs were omitted. No indels were found in these regions.

140 141 142

Impact of genome sequence rearrangements on variant calling

143

The second novel variant filter used in this analysis omitted variants in regions around break-points of chromosomal rearrangements. The assumed 1-to-1 orthology required for variant calls from aligned reads can be violated at the regions bordering rearrangements between a sample and the reference genome. These structural changes in bacteria include large deletions, inversions, duplications and integrations, the last typically caused by acquisition of prophage or genomic islands. Some phage are known to carry a fragments of tRNA genes with homology to those in the host organism and integration involves replacement of parts the host version of a tRNA gene with the phage version (Campbell 1992) . Thus, regions flanking the resulting prophage may have sequence divergence caused by a more complex history than the simple orthology assumed by the read mapping method with an increased risk of artefactual, false-positive variants.

144 145 146 147 148 149 150 151 152 153


154 155 156 157 158 159 160 161 162 163 164 165 166

The algorithm we developed for omitting putatively unreliable variants near regions of structural rearrangements was implemented in the Structure module of the BAGA software package. The algorithm exploits the fact that sequence reads in this analysis were “paired-end”: from either end of many template DNA fragments of similar lengths. The Burrows-Wheeler Aligner used in this analysis assigns a read “proper pair” status if the aligned distance and orientation to its paired read is close to that expected given the mean template DNA fragment length. Paired-end reads on either side of a rearrangement break-point typically align to the reference at distances far greater than the fragment lengths or in different orientations. For example, in many samples, no reads mapped to the LESB58 chromosome region annotated as “LES prophage 5” indicating absence of the prophage in those samples. Paired reads would align kilobases apart on either side of the prophage if they originated from a DNA fragment spanning the orthologous site of prophage integration, in a sample lacking the prophage. These reads would not be assigned “proper pair” status being so much further apart than the mean fragment size.

167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182

Fig. 3 illustrates how our algorithm determines regions near genome rearrangements in which variant calls may be unreliable. In this example the rearrangement is a duplication, but other common rearrangements include the loss or gain of a prophage or other genomic island. These regions are defined by a deviation in the ratio of proper-pair-assigned reads to non-proper-pair-assigned reads: the more non-proper-pair-assigned reads, the less reliable the region. The initially selected region is extended while regions of the reference have zero aligned read coverage and variants called within the combined region are omitted. In the Nottingham data, 33 SNPs and 10 indels within these regions were omitted (Table S2). In the Liverpool data, 39 SNPs and eight indels were omitted. This analytic approach, by accounting for potential false positive variant calls caused by missing regions of chromosome, is of value because turnover of prophage and genomic islands among microbial genomes is often rapid (Wilmes et al. 2009). In many cases, including the analyses of the Liverpool and Nottingham datasets, guaranteeing a similar genomic composition to permit a contiguous alignment between reference and sampled genomes is difficult. Prophage and genomic islands and are abundant throughout Bacteria and Archaea (Zhou et al. 2011) so few read mapping analyses will be unaffected by these challenges.

183


184

TABLES

185 186

Table S1: Small insertions and deletions predicted to cause frame-shift mutations among the Nottingham isolates Chromosome

Position (bp)

Reference

Variant

Gene ID

NC_011770.1

14914

T

TGC

NC_011770.1

335247

C

NC_011770.1

467972

NC_011770.1

Gene name

Frequency

Frequency (filtered)

PLES_RS00075

10

10

CG

PLES_RS01530

2

2

GATCCTCCTCGT

G

PLES_RS02170

mexB

22

22

1037925

A

AC

PLES_RS05000

mpl

3

3

NC_011770.1

1267074

CCCGCTGGAGCT

C

PLES_RS06070

3

3

NC_011770.1

2532081

G

GT

PLES_RS12255

1

0

NC_011770.1

2639073

AG

A

PLES_RS12750

1

1

PLES_RS15460

1

0

NC_011770.1

3280614

A

AAGCGCGCGACC CGGAAGCCGTTG CCCAGCGGGGCG CCCTGT

NC_011770.1

3378367

AG

A

PLES_RS15875

pslJ

21

21

NC_011770.1

3388596

GTACTTGCCGCT GTCCAGGTAGGA CTGGGCGGTGGC CTGGTCGGGTTT CT

G

PLES_RS15915

pslB

1

0

NC_011770.1

3484676

CGCCCGAGCAA

C

PLES_RS16380

22

22

PLES_RS18160

mltD

1

1

PLES_RS18915

stk1

1

1

NC_011770.1

3898444

C

CACGATTCGTTG TCAAAAATAGCC AAGGACCCGGAC ACACGCCTGATG CG

NC_011770.1

4044433

C

CGG


NC_011770.1

4931526

G

GGGGATGTCGAC ATGCAA

PLES_RS23105

NC_011770.1

5398560

TCG

T

PLES_RS25200

NC_011770.1

5903286

GCT

G

NC_011770.1

6060546

C

6557920

GGGGCGGCAGCA GTGGCGTTCGGC GCAGT

NC_011770.1

2

0

22

22

PLES_RS27535

22

22

CAT

PLES_RS28165

22

22

G

PLES_RS30445

22

22

ampD

187 188

Table S2: Total polymorphisms called for the Liverpool and Nottingham datasets as cumulative filters are applied, divided by type Liverpool

190

All data

Cumulative filters applied

SNPs*

Indels†

SNPs

Indels

SNPs

Indels

None

318

68

170

47

440

89

318

68

170

47

440

89

BAGA§ reference genome repeats

309

68

162

47

424

89

BAGA genome rearrangements

270

60

129

37

364

78

GATK‡

189

Nottingham

standard 'hard' filter

*Single nucleotide polymorphisms; †Small insertion or deletion polymorphisms; ‡Genome Analysis Toolkit; §Bacterial and Archaeal Genome Analyser

191 192 193

Table S3: Polymorphisms called as fixed in samples and differing with reference for the Liverpool and Nottingham datasets as cumulative filters are applied, divided by type Liverpool

Nottingham

All data


SNPs*

Indels†

SNPs

Indels

SNPs

Indels

None

43

5

101

17

113

21

GATK‡ standard 'hard' filter

43

5

101

17

113

21


43

5

99

17

111

21


BAGA genome rearrangements 194 195

43

5

92

15

104

19


196 197

Table S4: Polymorphisms called among the Liverpool and Nottingham datasets as cumulative filters are applied, divided by type Liverpool

198 199

Nottingham

All data


SNPs*

Indels†

SNPs

Indels

SNPs

Indels

None

275

63

69

30

337

71

GATK‡ standard 'hard' filter

275

63

69

30

337

71


266

63

63

30

322

71

BAGA genome rearrangements

227

55

37

22

264

60


200 201

Table S5: Proportion of single nucleotide polymorphisms called by the GATK Haplotype Caller recovered in de novo assembled contigs.

202 Sample

Proportion all variants corroborated

Proportion filtered variants corroborated*

SED01

96.83%

100.00%

SED02

92.56%

96.00%

SED03

97.54%

100.00%

SED04

95.97%

100.00%

SED05

92.48%

98.04%

SED06

94.92%

100.00%

SED07

91.41%

97.94%

SED08

93.75%

97.98%

SED09

92.48%

99.02%


203

SED10

94.26%

97.96%

SED11

89.84%

97.96%

SED12

93.10%

96.94%

SED13

88.10%

100.00%

SED14

94.40%

100.00%

SED15

91.41%

97.96%

SED16

93.75%

99.00%

SED17

94.49%

100.00%

SED18

95.31%

100.00%

SED19

92.91%

97.98%

SED20

92.11%

96.94%

SED21

95.97%

98.98%

SED22

96.09%

100.00%

*Most of the variants removed by the novel filters were not present in the de novo assemblies


204

FIGURES

205 206 207 208

Figure S1. Segments of chromosome indicated by coloured bars, among the Liverpool and Nottingham isolates, inferred to have undergone homologous recombination. Regions that are the same colour


209 210

share a common origin whereas grey regions were not deemed significant in a permutation test at alpha = 0.05.

211

212 213 214 215 216 217 218 219

Figure S2. Total single nucleotide polymorphisms called by different methods from the same dataset. In the first method from a previously published analysis, short reads from 22 P. aeruginosa isolates were each assembled into contigs de novo, improved using an automated pipeline and finally combined into a multiple sequence alignment within which polymorphismic positions were counted. In the second two methods one from the previous study and the third from this report, variants were called among short reads aligned against a complete reference sequence from a closely related isolate (P. aeruginosa LESB58).

220 221 222

REFERENCES Downloaded from www.microbiologyresearch.org by IP: 172.245.148.126 On: Tue, 19 Jan 2016 19:49:58

223 224

Sickle: A sliding-window, adaptive, quality-based trimming tool for FastQ files (Version 1.29) [https://github.com/najoshi/sickle]

225 226

Martin M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnetjournal 17 1.

227 228

Li H., Durbin R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25 1754-1760.

229 230

Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25 2078-2079.

231

Picard Sequence Alignment/Map file manipulation library [http://picard.sourceforge.net/]

232 233 234

McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K., Altshuler D., Gabriel S., other authors (2010). The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20 1297-1303.

235 236 237

DePristo M.A., Banks E., Poplin R., Garimella K.V., Maguire J.R., Hartl C., Philippakis A.A., del Angel G., Rivas M.A., other authors (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43 491-498.

238 239 240

Van der Auwera G.A., Carneiro M.O., Hartl C., Poplin R., del Angel G., Levy-Moonshine A., Jordan T., Shakir K., Roazen D., other authors (2013). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. Curr Protoc Bioinform 43 11.10.1-11.10.33.

241 242

Guindon S., Dufayard J., Lefort V., Anisimova M., Hordijk W., O. G. (2010). New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol 59 307-21.

243 244

Didelot X., Wilson D.J. (2015). ClonalFrameML: Efficient Inference of Recombination in Whole Bacterial Genomes. PLoS Comput Biol 11 e1004041-.

245 246

Sukumaran J., Holder M.T. (2010). DendroPy: a Python library for phylogenetic computing. Bioinformatics 26 1569-1571.

247 248

McElroy K.E., Luciani F., Thomas T. (2012). GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics 13 74.

249 250 251 252

Winstanley C., Langille M.G., Fothergill J.L., Kukavica-Ibrulj I., Paradis-Bleau C., Sanschagrin F., Thomson N.R., Winsor G.L., Quail M.A., other authors (2009). Newly introduced genomic prophage islands are critical determinants of in vivo competitiveness in the Liverpool Epidemic Strain of Pseudomonas aeruginosa. Genome Res 19 12-23.

253

Campbell A.M. (1992). Chromosomal insertion sites for phages and plasmids. J Bacteriol 174 7495-7499.

254 255

Wilmes P., Simmons S.L., Denef V.J., Banfield J.F. (2009). The dynamic genetic repertoire of microbial communities. FEMS Microbiol Rev 33 109-132.

256 257

Zhou Y., Liang Y., Lynch K.H., Dennis J.J., Wishart D.S. (2011). PHAST: A Fast Phage Search Tool. Nucleic Acids Res 39 W347-W352.


258