A priori Considerations When Conducting High-Throughput Amplicon ...

8 downloads 21208 Views 575KB Size Report
researchers with no prior experience may find it useful to be informed of best practices that can be used ... 2016. *Corresponding author (asengupta@email. arizona.edu). .... of samples may be pooled in one run, high diversity sam- ples require a .... technology. http://www.illumina.com/content/dam/illumina-marketing/docu-.
Published March 23, 2016

Agricultural & Environmental Letters Research Letter

A priori Considerations When Conducting High-Throughput Amplicon-Based Sequence Analysis Aditi Sengupta* and Warren A. Dick

Core Ideas • High-throughput sequence analysis of microbes is gaining popularity. • Researchers in agronomy, soil, crop, and environmental sciences are interested in such analyses. • We briefly discuss a priori considerations to guide such researchers. • Developing successful research questions, experiments, and data analysis is the goal.

Abstract: Amplicon-based sequencing strategies that include 16S rRNA and functional

genes, alongside “meta-omics” analyses of communities of microorganisms, have allowed researchers to pose questions and find answers to “who” is present in the environment and “what” they are doing. Next-generation sequencing approaches that aid microbial ecology studies of agricultural systems are fast gaining popularity among agronomy, crop, soil, and environmental science researchers. Given the rapid development of these high-throughput sequencing techniques, researchers with no prior experience will desire information about the best practices that can be used before actually starting high-throughput amplicon-based sequence analyses. We have outlined items that need to be carefully considered in experimental design, sampling, basic bioinformatics, sequencing of mock communities and negative controls, acquisition of metadata, and in standardization of reaction conditions as per experimental requirements. Not all considerations mentioned here may pertain to a particular study. The overall goal is to inform researchers about considerations that must be taken into account when conducting high-throughput microbial DNA sequencing and sequences analysis.

H A. Sengupta, Biosphere 2, Univ. of Arizona, Marshall Building, Room 526B, 845 N. Park Ave., PO Box 210158-B, Tucson, AZ 85721-0158; W.A. Dick, School of Environment and Natural Resources, The Ohio State Univ., 1680 Madison Ave., Wooster, OH 44691-4096.

Copyright © American Society of Agronomy, Crop Science Society of America, and Soil Science Society of America. 5585 Guilford Rd., Madison, WI 53711 USA. This is an open access article distributed under the terms of the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Agric. Environ. Lett. 1:150010 (2016) doi:10.2134/ael2015.11.0010 Received 9 Nov. 2015. Accepted 22 Feb. 2016. *Corresponding author (asengupta@email. arizona.edu).

igh-throughput amplicon-based sequencing of microbial DNA from different environments is proving to be an exceptional tool for understanding microbial ecology of various natural systems (Logares et al., 2012; Zhou et al., 2015). Amplicon-based sequencing strategies of the 16S rRNA and functional genes, alongside metagenomic analyses of microbial communities, have allowed researchers to pose questions and find answers to “who” is present in a specific environment and “what” they are doing. With tools such as Illumina’s MiSeq and the HiSeq platform (Illumina, Inc.) becoming more affordable over time, rapid application of -omics based studies in environmental systems is becoming common. Next-generation approaches to microbial ecology of agricultural systems are also fast gaining popularity among agronomy, crop, soil, and environmental science researchers. Research groups in these disciplines often have as a goal the identification of the diversity and the roles of microbes in these systems. Given the rapid development of high-throughput sequencing techniques, researchers with no prior experience may find it useful to be informed of best practices that can be used before actually starting high-throughput ampliconbased sequences analyses. General procedures of high-throughput sequencing experiments, sequencing concepts, and data analysis are available (Gogol-DÖring and Chen, 2012; Di Bella et al., 2013; Normand and Yanai, 2013). This information can direct a researcher toward mock experimental data analyses, thereby allowing the researcher to work with small experimental datasets. The resources, however, fail to focus on description of the things to be considered that a researcher must pay attention to before starting an experiment that relies heavily on data analyses of high-throughput sequences. Often, by the time researchers accomplish the task of working through a tutorial of mock experimental data, their experiments have already been conducted and data generated. Given that considerable time, money, and effort have already been invested in generating data, it is not always possible to repeat the experiments if a researcher realizes that the experimental design was insufficient. While conducting our own research on diversity of bacteria in soils under different land-use and land-management practices(Sengupta and Dick, 2015), Page 1 of 1

we noticed a glaring paucity of instructions for researchers in the disciplines of agronomy, crops, soil, and environmental science who wish to conduct high-throughput amplicon sequencing work. Stemming largely from our own experiences and discussion with peers, we focus on and present the a priori knowledge that a beginner (and experienced) researcher or laboratory group must develop and/or improve on when attempting to conduct high-throughput amplicon-based sequences analyses.

Recommendations and Discussion Considerations when planning a study involving high-throughput amplicon-based sequences analyses are described below. A graphical interpretation that summarizes these considerations is shown in Fig. 1. First, care is needed in conceptualizing the experiment. It is well worth the time to ask and establish answers to the following questions: (i) what is one trying to study and why, and (ii) how is the experiment going to be conducted? From the standpoint of soil science research, for example, one may wish to isolate microbial genomic DNA from multiple depth profiles for a single treatment. If the hypothesis involves microbial community response under different management practices, sampling may need to be on a wider geographical scale. Alternately, a research question may be aimed at answering microbial community patterns across a landscape gradient as highlighted by Sengupta et al. (2016). In such scenarios, the scale of sampling becomes important, i.e., how far spaced out should sampling locations be to capture microbial heterogeneity. Indeed, the question of scale is increasingly being discussed with respect to microbial ecology studies (Green and Bohannan, 2006; Franklin and Mills, 2007; Fierer and Lennon, 2011).

Fig. 1. Schematic of considerations outlined prior to conducting high-throughput amplicon-based sequence analysis. Page 2 of 4

A related issue arises when deciding to choose the amount of sample to be included in a study. It is suggested that researchers conduct extractions using a range of sample quantity. For example, 0.25 to 10 g of sample may be used to confirm that microbial DNA recovered after a certain amount is constant and/or DNA recovered from a combination of five 2-g samples is similar to that recovered from one 10-g sample. This exercise can then be used to address the issue of DNA extraction biases and to defend the issue of small-scale heterogeneity representative of the community found in the sample. Therefore, unless there are well-defined questions, both sampling and data analyses can easily become cumbersome. While well-defined hypotheses are true to any scientific discipline, the appeal of using automated and fast-sequencing techniques can be tempting. This may obfuscate the additional financial investments that may be needed to arrive at a sound sampling decision. One decision affected by finances is the number of samples and replicates to be sequenced. The researcher must ensure that the experimental design should, from the beginning, focus on the number of replicates per sample and the total number of samples. It is important to pay attention to the variability that is expected between/within samples and/or treatments. Vastly different communities under different treatments may not require many replicates to statistically determine community difference as opposed to highly similar communities that may require many replicates. Replication, or the lack of it, in microbial community data analysis is discussed by Lennon (2011), highlighting the issue and the choices available to researchers. The question that often arises is whether all the replicates need to be sequenced or whether they can be pooled together and treated as one sample. An associated point is the sequencing depth. If low diversity is expected, it may be sufficient to pool more numbers of samples together in one sequencing run. While, theoretically, any number of samples may be pooled in one run, high diversity samples require a greater number of sequences to capture the diversity. This may in turn limit the number of samples that can be pooled to obtain accurate diversity estimates. Other considerations include additional experimental testing that a researcher may do before or in addition to sequencing. For example, researchers may decide to first test physicochemical properties of the soil and/or outline additional experiments that need to be conducted to support sequence-analysis results, including but not restricted to, heterotrophic plate count analysis, and quantitative real-time polymerase chain reactions (qPCR). Second, one must collect information about the samples. Also termed as metadata, this information is crucial and needs to be described and maintained thoroughly. For example, when sampling soil to isolate microbial DNA for high-throughput sequencing work, metadata including GPS coordinates, soil chemical and physical properties, sampling depth, time of year, and land management among others should be recorded. The benefits of this record keeping are twofold: (i) the researcher can use statistical tests to answer hypotheses-driven questions AGRICULTURAL & ENVIRONMENTAL LETTERS

and determine if certain variables in the metadata affect microbial community composition, and (ii) keeping a record of environmental metadata can allow researchers to contribute to global projects like the Earth Microbiome Project (Gilbert et al., 2014) and the Critical Zone Observatory Program (Harpold et al., 2013). Once the experimental design and sampling strategy have been thought out, it is beneficial to practice data analyses on example datasets that are widely available online. Quantitative Insights Into Microbial Ecology (QIIME) tutorials offered by the developers of QIIME (QIIME, 2015), tutorials offered by developer of mothur (Mothur, 2015), and workshop-based data analysis offered by Explorations in Data Analysis for Metagenomic Advances in Microbial Ecology (EDAMAME) provides excellent practice data as a means of familiarizing oneself with high-throughput sequence data analyses (EDAMAME, 2015). Most of these example datasets are small and their analysis can be rapidly completed, allowing the researcher to become familiar with different software available for -omics data analyses. While the practice examples might seem easy because of the way they are structured, the exercise forces researchers to think ahead and consider (i) the robustness of their own experimental design, (ii) the possible computing power that their own dataset may require, and (iii) various ways of representing sequence analysis results. Most practice datasets have accompanying background information, such as sample collection, sample information, replicates, sequencing depth, and normalization of data. This provides researchers an opportunity to review their own experimental design. Another benefit of working with example datasets is that researchers become familiar with computing shell environments. Too often, during -omics data analyses, researchers find themselves immersed in the labyrinths of the command line computer code for the first time. Consequently, it is advisable that researchers familiarize themselves with basic command line operating system usage, including Linux and Unix, in addition to working knowledge of languages like C, Python, and R, to be able to work with sequence data. While the above outlines some recommendations to be considered before actual sequencing, other important aspects need to be kept in mind when preparing samples for sequencing. An important component of highthroughput sequencing studies is the incorporation of mock communities and negative controls. Tutorials often include sequence analysis from a mock sample of known composition (Mothur, 2015; QIIME, 2015). This provides an opportunity to determine quality of the sequencing run (Schloss et al., 2011; Hang et al., 2014). Mock community sequencing also allows the inference of errors based on library preparation methods, run, amount of input DNA, number of polymerase chain reactions (PCR), primer combination, efficacy of clustering methods for taxonomically diverse environmental samples, and sequencing biases against certain taxa (Flynn et al., 2015; Schirmer et al., 2015). Mock communities also help determine the suitability of reference databases. If functional genes of AGRICULTURAL & ENVIRONMENTAL LETTERS

interest are sequenced (apart from 16S rRNA bacterial and archaeal or 18S rRNA and internal transcribed spacer [ITS] fungal genes) inclusive of mock communities, comparison with reference databases may reveal the suitability of the reference database. This consideration may then help researchers in preparing a curated database of reference sequences suitable for their own research with respect to the environment being studied. Negative controls at all steps of sequencing, starting from DNA extraction to loading samples in the sequencer, are important for quality control (Salter et al., 2014). Negative controls should include template-free “blanks” of DNA extractions and PCR amplification blank reactions in the same run as the actual samples. Controls ensure identification of reagent and laboratory contamination. Post-sequencing, results should be inspected for contaminated genera by comparing with published lists of genera found in contaminated blanks (Tanner et al., 1998; Grahn et al., 2003; Barton et al., 2006; Laurence et al., 2014; Salter et al., 2014). Such cross-validations can allow common contaminants, if identified and found to be relevant with respect to the environment being sampled, to be subtracted computationally. There are excellent resources, including Caporaso et al. (2012) and Illumina (2015), that outline standard operating procedures for high-throughput microbial DNA sequence experiments, while others outline limitations of amplicon-based high-throughput sequence analysis (Poretsky et al., 2014; Schirmer et al., 2015). Primer design, library preparation, and PCR conditions are other areas that need to be thoroughly tested and standardized according to individual experiments. Since high-throughput amplicon-sequence analysis includes PCR reactions, it is important to have replicate reactions, restrict the number of amplification cycles to a minimum required, and optimize PCR reactions to reduce biases. Additionally, amplicon-based sequence analysis relies on reference sequence databases; therefore, the analysis is only as good as the reference database. This is especially true for environmental samples with unidentified microorganisms and, therefore, insufficient annotation in reference databases. Moreover, researchers may want to study community diversity based on functional gene sequences, instead of prokaryotic 16S rRNA and eukaryotic 18S rRNA/ITS genes. The lack of easily accessible functional gene databases is a hindrance, although researchers may benefit by accessing curated functional gene databases like those in the Ribosomal Database Project (RDP)(Cole et al., 2014). The vast amount of sequence data generated is only half of the story. We may be able to learn “who is present” in the microbial community using amplicon-based sequence analysis and determine “what are they doing” using “meta-omics” approach. It is, however, critical to link the impact of microbial diversity and consequently their function to ecological processes and ecosystem dynamics. Applying concepts of diversity derived from plant- and animal-based ecological theories to diversity patterns in microbial systems requires theoretical and Page 3 of 4

conceptual understanding of established ecological theories (Prosser et al., 2007). Technical advances in highthroughput amplicon-based sequencing will be beneficial when we are able to understand microbial community structure to their diversity at different scales and, finally, to functionality in our environment.

Summary The easy approachability of sequencing tools is allowing researchers to venture into the world of high-throughput sequencing and tap into the massive potential of this tool to study microbes in varied environmental systems. While a number of resources outline approaches to microbial sequence data analysis, we detail here important considerations that should be taken before any such sequencing is actually conducted. Careful attention to these will be useful to researchers who are enthusiastic about the potential applications of high-throughput microbial DNA sequencing and aim to apply amplicon-based sequencing studies of microbial DNA in their respective fields of agronomy, crops, soils, and the environment.

References Barton, H.A., N.M. Taylor, B.R. Lubbers, and A.C. Pemberton. 2006. DNA extraction from low-biomass carbonate rock: An improved method with reduced contamination and the low-biomass contaminant database. J. Microbiol. Methods 66(1):21–31 doi:10.1016/j. mimet.2005.10.005 Di Bella, J.M., Y. Bao, G.B. Gloor, J.P. Burton, and G. Reid. 2013. High throughput sequencing methods and analysis for microbiome research. J. Microbiol. Methods 95(3): 401–414. doi:10.1016/j. mimet.2013.08.011 Caporaso, J.G., C.L. Lauber. W.A. Walters, D. Berg-Lyons, J. Huntley, N. Fierer, S.M. Owens, J. Betley, L. Fraser, M. Bauer, N. Gormley, J.A. Gilbert, G. Smith, and R. Knight. 2012. Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms. ISME J. 6(8):1621–1624. Cole, J.R., Q. Wang, J.A. Fish, B. Chai, D.M. McGarrell, Y. Sun, C.T. Brown, A. Porras-Alfaro, C.R. Kuske, and J.M. Tiedje. 2014. Ribosomal database project: Data and tools for high throughput rRNA analysis. Nucleic Acids Res. 42(D1):12512–12522. doi:10.1093/nar/gku1013 EDAMAME. 2015. EDAMAME course website. http://edamame-course. github.io/ (accessed 14 Oct. 2015). Fierer, N., and J.T. Lennon. 2011. The generation and maintenance of diversity in microbial communities. Am. J. Bot. 98(3):439–448. doi:10.3732/ ajb.1000498 Flynn, J.M., E.A. Brown, F.J.J. Chain, H.J. MacIsaac, and M.E. Cristescu. 2015. Toward accurate molecular identification of species in complex environmental samples: Testing the performance of sequence filtering and clustering methods. Ecol. Evol. 5(11): 2252–2266. Franklin, R., and A. Mills. 2007. The spatial distribution of microbes in the environment. Springer Science & Business Media, Dordrecht, the Netherlands. Gilbert, J.A., J.K. Jansson, and R. Knight. 2014. The Earth microbiome project: Successes and aspirations. BMC Biol. 12(1):69. doi:10.1186/ s12915-014-0069-1 Gogol-DÖring, A., and W. Chen. 2012. An overview of the analysis of next generation sequencing data. In: J. Wang, A.C. Tan, and T. Tian, editors, Next generation microarray bioinformatics: Methods and protocols. Methods in Molecular Biology. Humana Press, Springer, New York. p. 249–257. Grahn, N., M. Olofsson, K. Ellnebo-Svedlund, H.J. Monstein, and J. Jonasson. 2003. Identification of mixed bacterial DNA contamination in broad-range PCR amplification of 16S rDNA V1 and V3 variable regions by pyrosequencing of cloned amplicons. FEMS Microbiol. Lett. 219(1):87–91. doi:10.1016/S0378-1097(02)01190-4 Green, J., and B.J.M. Bohannan. 2006. Spatial scaling of microbial biodiversity. Trends Ecol. Evol. 21(9):501–507. doi:10.1016/j.tree.2006.06.012 Page 4 of 4

Hang, J., V. Desai, N. Zavaljevski, Y. Yang, X. Lin, R. Satya, L.J. Martinez, J.M. Blaylock, R.G. Jarman, S.J. Thomas, and R.A. Kuschner. 2014. 16S rRNA gene pyrosequencing of reference and clinical samples and investigation of the temperature stability of microbiome profiles. Microbiome 2(1):31. Harpold, A.A., D. Karwan, J. Perdrial, J.A. Marshall, J. Driscoll, A. Neal, and C. Phillips. 2013. Graduate Research Group white paper: Cross-CZO research potential. Critical Zone Observatory. http://criticalzone.org/ catalina-jemez/publications/pub/grg-cross-site-white-paper-graduate-research-group-white-paper-cross-czo-re/. Illumina. 2015. An introduction to next-generation sequencing technology. http://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf (accessed 1 Sept. 2015). Laurence, M., C. Hatzis, and D.E. Brash. 2014. Common contaminants in next-generation sequencing that hinder discovery of low-abundance microbes. PLoS One 9(5): E97876. doi:10.1371/journal.pone.0097876 Lennon, J.T. 2011. Replication, lies and lesser-known truths regarding experimental design in environmental microbiology. Environ. Microbiol. 13(6):1383–1386. doi:10.1111/j.1462-2920.2011.02445.x Logares, R., T.H.A. Haverkamp, S. Kumar, A. Lanzén, A.J. Nederbragt, C. Quince, and H. Kauserud. 2012. Environmental microbiology through the lens of high-throughput DNA sequencing: Synopsis of current platforms and bioinformatics approaches. J. Microbiol. Methods 91(1):106–113. doi:10.1016/j.mimet.2012.07.017 Mothur. 2015. Mothur manual. http://www.mothur.org/wiki/Mothur_manual (accessed 16 Oct. 2015). Normand, R., and I. Yanai. 2013. An introduction to high-throughput sequencing experiments: Design and bioinformatics analysis. In: N. Shomron, editor, Deep sequencing data analysis. Methods in Molecular Biology. Humana Press, Springer, New York. p. 171–179. Poretsky, R., L.M. Rodriguez-R, C. Luo, D. Tsementzi, and K.T. Konstantinidis. 2014. Strengths and limitations of 16S rRNA gene amplicon sequencing in revealing temporal microbial community dynamics. PLoS ONE 9(4):E93827. doi:10.1371/journal.pone.0093827 Prosser, J.I., B.J.M. Bohannan, T.P. Curtis, R.J. Ellis, M.K. Firestone, R.P. Freckleton, J.L. Green, L.E. Green, K. Killham, J.J. Lennon, A.M. Osborn, M. Solan, C.J. van der Gast, and J.P.W. Young. 2007. The role of ecological theory in microbial ecology. Nat. Rev. Microbiol. 5(5): 384–392. doi:10.1038/nrmicro1643 QIIME. 2015. QIIME oveview tutorials. http://qiime.org/tutorials/index. html (accessed 14 Oct. 2015). Salter, S.J., M.J. Cox, E.M. Turek, S.T. Calus, W.O. Cookson, M.F. Moffatt, P. Turner, J. Parkhill, N.J. Loman, and A.W. Walker. 2014. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol. 12(1):87. doi:10.1186/s12915-014-0087-z Schirmer, M., U.Z. Ijaz, R. D’Amore, N. Hall, W.T. Sloan, and C. Quince. 2015. Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform. Nucleic Acids Res. doi:10.1093/ nar/gku1341 Schloss, P.D., D. Gevers, and S.L. Westcott. 2011. Reducing the effects of PCR amplification and sequencing artifacts on 16S rRNA-based studies. PLoS ONE 6(12):E27310. doi:10.1371/journal.pone.0027310 Sengupta, A., and W.A. Dick. 2015. Bacterial community diversity in soil under two tillage practices as determined by pyrosequencing. Microb. Ecol. 70:853–859. doi:10.1007/s00248-015-0609-4 Sengupta, A., L.A. Pangle, T.H.M. Volkmann, K. Dontsova, P.A. Troch, A.A. Meira, J.W. Neilson, E.A. Hunt, J. Chorover, X. Zeng, J. van Haren, G.A. Barron-Gafford, A. Bugaj, N. Abramson, M. Sibayan, and T.E. Huxman. 2016. Advancing understanding of hydrological and biogeochemical interactions in evolving landscapes through controlled experimentation and monitoring at the Landscape Evolution Observatory. In: A. Chabbi and H. Loescher, editors, Terrestrial ecosystem research infrastructures: Challenges and opportunities. CRC Press, Boca Raton, FL (in press). Tanner, M.A., B.M. Goebel, M.A. Dojka, and N.R. Pace. 1998. Specific ribosomal DNA sequences from diverse environmental settings correlate with experimental contaminants. Appl. Environ. Microbiol. 64(8):3110–3113. Zhou, J., Z. He, Y. Yang, Y. Deng, S.G. Tringe, and L. Alvarez-Cohen. 2015. High-throughput metagenomic technologies for complex microbial community analysis: Open and closed formats. MBio 6(1):E02288– E14 doi:10.1128/mBio.02288-14

AGRICULTURAL & ENVIRONMENTAL LETTERS