Jun 29, 2018 - from hardware. Secondary analysis. ⢠QA filtering and variant calling. Tertiary ... Peak-calling analyses for ChIP-Seq (MACS, GeneTrack indexer ...
Next Generation Sequencing Survey of tools for analysis
Commonly used tools, packages, web-serves and databases for next generation sequencing
Aditya Arya, PhD Pathfinder Research and Training Foundation New Delhi
29th June 2018
Levels of NGS data analysis Base calling
Based on detection system and conversion of signals into base information Mostly performed by integrated tools (platform specific)
Quality control
Trimming, removal of tags, barcodes and other non-necessary sequences Also, quality check (can be performed on integrated or independent platforms
Sequence assembly
Varies for different applications, such as amplicon, RNA seq etc. More Complex in case of whole genome assembly (require reference)
Scientific information
Phylogenetic analysis(rarefaction curves), statistical analysis, network Analysis (biological networks), heatmaps, clustering etc. Image Credit: PNAS
2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Dr. Aditya Arya
Levels of NGS data analysis 1
Primary analysis
• Analysis of data generated from hardware
Secondary analysis
2 • Signal analysis
• Base calling
1
• QA filtering and variant calling
4 • Base quality scoring
2 •Alignment and filtering
Tertiary analysis We mostly did this part In this training
3
• Making sense out of information
Already done by sequencing platforms
• Read generation (Fastq)
3 •Quality score recalibration
1
•Filtering false positive
2 •Validation
3 •Interpretation
•Reporting
Integrated tools as well as independent tools for different steps are available Image Credit: PNAS
2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Dr. Aditya Arya
Data types What do we get from sequencing? ATATATGCATGCATGC CATGCATGCGCATGC
Individual sequences/ reads
Contigs
ATATATGCATGCATGCATGCGCATGC
ATGCTATGCATGCATGCATGCGCATGC
ATGCTATGCATGCATGCATGCGCATGCATGCTATGCATGCATGCATGCGCATGC
Scaffolds
Genome assembly
But there are millions of such sequences .. So we Image Credit: PNASneed algorithms and bioinformatics tools 2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Dr. Aditya Arya
File formats Fastq It was originally developed at the Wellcome Trust Sanger Institute to bundle a FASTA formatted sequence and its quality data, but has recently become the de facto standard for storing the output of NGS data.
Sequence Alignment Map (SAM) is a text-based format for storing biological sequences aligned to a reference sequence developed by Heng Li and Bob Handsaker et al.
Binary compressed implementation of SAM is known as BAM format Sometimes, the files can also be compressed as .gz or gzipped files Image Credit: PNAS
2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Dr. Aditya Arya
File formats
Image credit: kscbioinformatics - WordPress.com 2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Dr. Aditya Arya
Types of resources Tools/ Softwares
Stand alone | Online
Suites/ packages
Stand alone | Online
Mothur, Fastqc
More than 700 tools available for NGS data analysis
Silva
Mothur, Fastqc
Choice must be made wisely
Genomequest
Web severs
Allow pipeline development
Databases
Primary
|
NCBI-Genome ENA
Galaxy
Secondary Online Resource for Community Annotation of Eukaryotes
|
Tertiary Human genome Variation db
Image Credit: PNAS
2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Dr. Aditya Arya
Types of resources
Image Credit: PNAS
2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Dr. Aditya Arya
Open Source tools Hundreds of tools available .. Only few are popular and widely used
Microbial ecology
De novo, RNA seq etc. Amplicon
Chimera analysis
16 s and 28 s analysis Image Credit: PNAS
2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Dr. Aditya Arya
Open Source tools (Galaxy) Peak-calling analyses for ChIP-Seq (MACS, GeneTrack indexer, Peak predictor), RNA-Seq (Tophat, Cufflinks) finding small insertions, deletions, and SNPs using SAMtools, GATK
https://usegalaxy.org/ Image Credit: PNAS
2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Dr. Aditya Arya
Open Source tools (Galaxy) Not to be confused with
http://galaxy.seoklab.org/
Protein Structure Prediction GalaxyTBM: Protein structure prediction from sequence by template-based modeling GalaxyLoop: Modeling of loop and/or terminus regions specified by user GalaxyDom: Protein modeling unit detection for protein structure predictions Protein Structure Refinement GalaxyRefine: Refinement of model structure provided by user GalaxyRefineComplex: Refinement of protein-protein complex model structure provided by user Protein Interaction Prediction GalaxySite: Ligand binding site prediction from a given protein structure (experimental or model) GalaxyPepDock: Protein-peptide docking based on interaction similarity GalaxyHomomer: Protein homo-oligomer structure prediction from a monomer sequence or structure GalaxyGemini: Protein homomer structure prediction from a given protein monomer structure based on similarity GalaxyTongDock: Symmetric and asymmetric protein-protein docking GPCR Applications Galaxy7TM: Flexible GPCR-ligand docking by structure refinement with a GPCR and a ligand structure provided by user GalaxyGPCRloop: Structure prediction of the second extracellular loop of GPCR
Image Credit: PNAS
2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Dr. Aditya Arya
Open Source tools (QIIME) Can be used for performing Pronounced as chime
QIIME™ stands for Quantitative Insights Into Microbial Ecology. QIIME is an open-source project, developed primarily in the Knight and Caporaso labs (University of Arizona, USA)
http://qiime.org/
•
Micro biome analysis from raw DNA sequencing data (Illumina or other platforms)
•
De-multiplexing and quality filtering
•
OTU picking, taxonomic assignment, and phylogenetic reconstruction,
•
diversity analyses and visualizations.
•
Producing publication quality graphics and statistics.
Image Credit: PNAS
2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Dr. Aditya Arya
Open Source tools (RDP) Can be used for
A number of addition tools available on portal Such as hierarchy browser, probe match library compare etc.
•
RDP provides quality-controlled, aligned and annotated Bacterial and Archaeal 16S rRNA sequences, and Fungal 28S rRNA sequences.
•
The Ribosomal Database Project (RDP) provides ribosome related data services to the scientific community, including online data analysis, rRNA derived phylogenetic trees, and aligned and annotated rRNA sequences.
https://rdp.cme.msu.edu/index.jsp Image Credit: PNAS
2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Dr. Aditya Arya
Open Source tools (Mothur) Pronounced as mother Mothur project, initiated by Dr. Patrick Schloss and his software development team in the Department of Microbiology & Immunology at The University of Michigan. In February 2009 the first version of mother was released, which had accelerated versions of the popular DOTUR and SONS programs.
Can be used for performing •
Micro biome analysis from raw DNA sequencing data (Illumina or other platforms)
•
Mainly developed to meet the needs of microbial ecologists
•
diversity analyses and visualizations.
•
Producing publication quality graphics and statistics.
https://www.mothur.org/ https://github.com/mothur/mothur/releases/tag/v1.40.5 Image Credit: PNAS
2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Dr. Aditya Arya
Open Source tools (GATK) Pronounced as mother
Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size.
Can be used for performing •
Data pre-processing: Fastq to BAM file, alignment to reference dataset, data cleanup
•
Variant discovery: Identification of variants such as CNVs, output is generally CNVs format
•
Additional filtering using callsets and truthsets.
https://software.broadinstitute.org/gatk/ Image Credit: PNAS
2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Dr. Aditya Arya
Levels of NGS data analysis - tools 1
Secondary analysis
• QA filtering and variant calling
Tertiary analysis
2 • Alignment and filtering
3 • Quality score recalibration
• Filtering false positive
Removal of large indels/CNV – Garbage picker/ Absolute var Identification of SNVs - Absolute var Identification of small indels – GATK…. Many more
• Making sense out of information
1
2 •Validation
3 •Interpretation
•Reporting
Rarefaction curve – Motur, Galaxy, R Statistical analysis – Primer, R, XLSTAT (proprietary) Systems biology – cytoscape, IPA, Metacore….. Many more
Image Credit: PNAS
2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Dr. Aditya Arya
Proprietary tools
Image Credit: PNAS
2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Dr. Aditya Arya
Proprietary tools (CLC workbench)
Image Credit: PNAS
Credit: quiagen.com
2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Dr. Aditya Arya
Proprietary tools (DNAnexus) Credit: dnanexus.com
Dragon Mainly developed as a community platform for NGS assay evaluation and regulatory science exploration. It claims to remove the bottleneck from your NGS data analysis pipeline. Edico Genome's DRAGEN massively accelerates secondary analysis algorithms while simultaneously improving accuracy. Introductory whole genome analysis pricing at $20 per genome
Image Credit: PNAS
2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Dr. Aditya Arya
Proprietary tools (Miniknow) MinKNOW produces FAST5 (HDF5) files, and/or FASTQ files, according to your preference. FAST5 contains raw data and basecalling information.
Credit: oxford nanopore Image Credit: PNAS
2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Dr. Aditya Arya
Proprietary tools (Torrentsuite)
Image Credit: PNAS
2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Credit: Thermo scientific Dr. Aditya Arya
Proprietary tools (Basespace)
Image Credit: PNAS
2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Credit: illumina.com Dr. Aditya Arya
Proprietary tools (Basespace)
Image Credit: PNAS
2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Credit: illumina.com Dr. Aditya Arya
Different pipelines
Application specific tools Image Credit: PNAS
2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Dr. Aditya Arya
De novo assembly/DNA Seq • Obtain sequence read file(s) from sequencing machine(s). • Look at the reads - get an understanding of what you’ve got and what the quality is like. • Raw data cleanup/quality trimming if necessary. • Choose an appropriate assembly parameter set. • Assemble the data into contigs/scaffolds. • Examine the output of the assembly and assess assembly quality.
https://www.melbournebioinformatics.org.au/tutorials/tutorials/assembly/assembly-protocol/ Image Credit: PNAS 2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Dr. Aditya Arya
Tools for DNA Seq SNV detection
Alignment/mapping MAQ BFAST Novoalign
Illumina/ABI ALL Illumina/Roche
BWA SOAP3
Illumina/ABI Illumina/Roche/ABI
De novo assembly VCAKE Newbler Velvet
Illumina/Roche Roche Illumina/Roche/ABI
GATK SAMtools VarScan/VarScan2 SomaticSniper JointSNVMix
Illumina/Roche/ABI Illumina/Roche Illumina/Roche/ABI Illumina Illumina
Structural variation detection BreakDancer VariationHunter SVDetect PEMer
Image Credit: PNAS
2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Illumina/Roche/ABI Illumina Illumina/ABI Illumina/Roche/ABI Credit, Lee et al, 2013 Dr. Aditya Arya
RNA seq
https://sparta-teaching.readthedocs.io/en/latest/rnaseqbackground.html#basic-analysis-procedure Image Credit: PNAS 2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Dr. Aditya Arya
RNA seq De novo transcriptome assembly
Counting reads per transcript
Trinity Trans-AbySS Oases
HTSeq Cufflinks
Illumina/Roche/ABI Illumina/Roche/ABI Illumina/Roche/ABI
Alignment/mapping Bowtie/Bowtie2 TopHat
Illumina/Roche/ABI Illumina/Roche/ABI
Illumina/Roche/ABI Illumina/Roche/ABI
Normalization, bias correction, and testing differential expression DESeq baySeq edgeR Cufflinks
Image Credit: PNAS
2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Illumina/Roche/ABI Illumina/Roche/ABI Illumina/Roche/ABI Illumina/Roche/ABI Credit, Lee et al, 2013 Dr. Aditya Arya
Small RNA seq Adapter trimming cutadapt Flicker FASTX Clipper scythe
Alignment/mapping
Illumina/Roche/ABI Illumina Illumina Illumina
Bowtie/Bowtie2 miRNA prediction
Quality control NGS QC Toolkit FASTQ Quality Filter
Illumina/Roche Illumina
Quality Viewer FastQC qrqc
Illumina/Roche/ABI
DSAP miRanalyzer miRDeep/miRDeep2 MIReNA mirExplorer miRTRAP miRDeep-P
Illumina/Roche/ABI Illumina/Roche/ABI Illumina/Roche/ABI Illumina/Roche/ABI Illumina/Roche/ABI Illumina/Roche/ABI Illumina/Roche/ABI
Illumina/Roche Illumina/Roche/ABI
Image Credit: PNAS
2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Credit, Lee et al, 2013 Dr. Aditya Arya
Amplicon sequencing Base caller PyroBayes Alignment Cross_match ELAND Exonerate Mosaik RMAP SHRiMP SOAP SSAHA2 Alignment SXOligoSearch
Assembly ALLPATHS SHARCGS SHRAP VCAKE Velvet Variant detection PbShort Ssaha SNP
Alignment and variant detection MAQ
Image Credit: PNAS
2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India
Dr. Aditya Arya
Thank you and
happy learning