Next Generation Sequencing

6 downloads 0 Views 4MB Size Report
Jun 29, 2018 - from hardware. Secondary analysis. • QA filtering and variant calling. Tertiary ... Peak-calling analyses for ChIP-Seq (MACS, GeneTrack indexer ...
Next Generation Sequencing Survey of tools for analysis

Commonly used tools, packages, web-serves and databases for next generation sequencing

Aditya Arya, PhD Pathfinder Research and Training Foundation New Delhi

29th June 2018

Levels of NGS data analysis Base calling

Based on detection system and conversion of signals into base information Mostly performed by integrated tools (platform specific)

Quality control

Trimming, removal of tags, barcodes and other non-necessary sequences Also, quality check (can be performed on integrated or independent platforms

Sequence assembly

Varies for different applications, such as amplicon, RNA seq etc. More Complex in case of whole genome assembly (require reference)

Scientific information

Phylogenetic analysis(rarefaction curves), statistical analysis, network Analysis (biological networks), heatmaps, clustering etc. Image Credit: PNAS

2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Dr. Aditya Arya

Levels of NGS data analysis 1

Primary analysis

• Analysis of data generated from hardware

Secondary analysis

2 • Signal analysis

• Base calling

1

• QA filtering and variant calling

4 • Base quality scoring

2 •Alignment and filtering

Tertiary analysis We mostly did this part In this training

3

• Making sense out of information

Already done by sequencing platforms

• Read generation (Fastq)

3 •Quality score recalibration

1

•Filtering false positive

2 •Validation

3 •Interpretation

•Reporting

Integrated tools as well as independent tools for different steps are available Image Credit: PNAS

2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Dr. Aditya Arya

Data types What do we get from sequencing? ATATATGCATGCATGC CATGCATGCGCATGC

Individual sequences/ reads

Contigs

ATATATGCATGCATGCATGCGCATGC

ATGCTATGCATGCATGCATGCGCATGC

ATGCTATGCATGCATGCATGCGCATGCATGCTATGCATGCATGCATGCGCATGC

Scaffolds

Genome assembly

But there are millions of such sequences .. So we Image Credit: PNASneed algorithms and bioinformatics tools 2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Dr. Aditya Arya

File formats Fastq It was originally developed at the Wellcome Trust Sanger Institute to bundle a FASTA formatted sequence and its quality data, but has recently become the de facto standard for storing the output of NGS data.

Sequence Alignment Map (SAM) is a text-based format for storing biological sequences aligned to a reference sequence developed by Heng Li and Bob Handsaker et al.

Binary compressed implementation of SAM is known as BAM format Sometimes, the files can also be compressed as .gz or gzipped files Image Credit: PNAS

2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Dr. Aditya Arya

File formats

Image credit: kscbioinformatics - WordPress.com 2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Dr. Aditya Arya

Types of resources Tools/ Softwares

Stand alone | Online

Suites/ packages

Stand alone | Online

Mothur, Fastqc

More than 700 tools available for NGS data analysis

Silva

Mothur, Fastqc

Choice must be made wisely

Genomequest

Web severs

Allow pipeline development

Databases

Primary

|

NCBI-Genome ENA

Galaxy

Secondary Online Resource for Community Annotation of Eukaryotes

|

Tertiary Human genome Variation db

Image Credit: PNAS

2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Dr. Aditya Arya

Types of resources

Image Credit: PNAS

2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Dr. Aditya Arya

Open Source tools Hundreds of tools available .. Only few are popular and widely used

Microbial ecology

De novo, RNA seq etc. Amplicon

Chimera analysis

16 s and 28 s analysis Image Credit: PNAS

2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Dr. Aditya Arya

Open Source tools (Galaxy) Peak-calling analyses for ChIP-Seq (MACS, GeneTrack indexer, Peak predictor), RNA-Seq (Tophat, Cufflinks) finding small insertions, deletions, and SNPs using SAMtools, GATK

https://usegalaxy.org/ Image Credit: PNAS

2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Dr. Aditya Arya

Open Source tools (Galaxy) Not to be confused with

http://galaxy.seoklab.org/

Protein Structure Prediction GalaxyTBM: Protein structure prediction from sequence by template-based modeling GalaxyLoop: Modeling of loop and/or terminus regions specified by user GalaxyDom: Protein modeling unit detection for protein structure predictions Protein Structure Refinement GalaxyRefine: Refinement of model structure provided by user GalaxyRefineComplex: Refinement of protein-protein complex model structure provided by user Protein Interaction Prediction GalaxySite: Ligand binding site prediction from a given protein structure (experimental or model) GalaxyPepDock: Protein-peptide docking based on interaction similarity GalaxyHomomer: Protein homo-oligomer structure prediction from a monomer sequence or structure GalaxyGemini: Protein homomer structure prediction from a given protein monomer structure based on similarity GalaxyTongDock: Symmetric and asymmetric protein-protein docking GPCR Applications Galaxy7TM: Flexible GPCR-ligand docking by structure refinement with a GPCR and a ligand structure provided by user GalaxyGPCRloop: Structure prediction of the second extracellular loop of GPCR

Image Credit: PNAS

2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Dr. Aditya Arya

Open Source tools (QIIME) Can be used for performing Pronounced as chime

QIIME™ stands for Quantitative Insights Into Microbial Ecology. QIIME is an open-source project, developed primarily in the Knight and Caporaso labs (University of Arizona, USA)

http://qiime.org/



Micro biome analysis from raw DNA sequencing data (Illumina or other platforms)



De-multiplexing and quality filtering



OTU picking, taxonomic assignment, and phylogenetic reconstruction,



diversity analyses and visualizations.



Producing publication quality graphics and statistics.

Image Credit: PNAS

2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Dr. Aditya Arya

Open Source tools (RDP) Can be used for

A number of addition tools available on portal Such as hierarchy browser, probe match library compare etc.



RDP provides quality-controlled, aligned and annotated Bacterial and Archaeal 16S rRNA sequences, and Fungal 28S rRNA sequences.



The Ribosomal Database Project (RDP) provides ribosome related data services to the scientific community, including online data analysis, rRNA derived phylogenetic trees, and aligned and annotated rRNA sequences.

https://rdp.cme.msu.edu/index.jsp Image Credit: PNAS

2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Dr. Aditya Arya

Open Source tools (Mothur) Pronounced as mother Mothur project, initiated by Dr. Patrick Schloss and his software development team in the Department of Microbiology & Immunology at The University of Michigan. In February 2009 the first version of mother was released, which had accelerated versions of the popular DOTUR and SONS programs.

Can be used for performing •

Micro biome analysis from raw DNA sequencing data (Illumina or other platforms)



Mainly developed to meet the needs of microbial ecologists



diversity analyses and visualizations.



Producing publication quality graphics and statistics.

https://www.mothur.org/ https://github.com/mothur/mothur/releases/tag/v1.40.5 Image Credit: PNAS

2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Dr. Aditya Arya

Open Source tools (GATK) Pronounced as mother

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size.

Can be used for performing •

Data pre-processing: Fastq to BAM file, alignment to reference dataset, data cleanup



Variant discovery: Identification of variants such as CNVs, output is generally CNVs format



Additional filtering using callsets and truthsets.

https://software.broadinstitute.org/gatk/ Image Credit: PNAS

2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Dr. Aditya Arya

Levels of NGS data analysis - tools 1

Secondary analysis

• QA filtering and variant calling

Tertiary analysis

2 • Alignment and filtering

3 • Quality score recalibration

• Filtering false positive

Removal of large indels/CNV – Garbage picker/ Absolute var Identification of SNVs - Absolute var Identification of small indels – GATK…. Many more

• Making sense out of information

1

2 •Validation

3 •Interpretation

•Reporting

Rarefaction curve – Motur, Galaxy, R Statistical analysis – Primer, R, XLSTAT (proprietary) Systems biology – cytoscape, IPA, Metacore….. Many more

Image Credit: PNAS

2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Dr. Aditya Arya

Proprietary tools

Image Credit: PNAS

2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Dr. Aditya Arya

Proprietary tools (CLC workbench)

Image Credit: PNAS

Credit: quiagen.com

2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Dr. Aditya Arya

Proprietary tools (DNAnexus) Credit: dnanexus.com

Dragon Mainly developed as a community platform for NGS assay evaluation and regulatory science exploration. It claims to remove the bottleneck from your NGS data analysis pipeline. Edico Genome's DRAGEN massively accelerates secondary analysis algorithms while simultaneously improving accuracy. Introductory whole genome analysis pricing at $20 per genome

Image Credit: PNAS

2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Dr. Aditya Arya

Proprietary tools (Miniknow) MinKNOW produces FAST5 (HDF5) files, and/or FASTQ files, according to your preference. FAST5 contains raw data and basecalling information.

Credit: oxford nanopore Image Credit: PNAS

2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Dr. Aditya Arya

Proprietary tools (Torrentsuite)

Image Credit: PNAS

2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Credit: Thermo scientific Dr. Aditya Arya

Proprietary tools (Basespace)

Image Credit: PNAS

2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Credit: illumina.com Dr. Aditya Arya

Proprietary tools (Basespace)

Image Credit: PNAS

2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Credit: illumina.com Dr. Aditya Arya

Different pipelines

Application specific tools Image Credit: PNAS

2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Dr. Aditya Arya

De novo assembly/DNA Seq • Obtain sequence read file(s) from sequencing machine(s). • Look at the reads - get an understanding of what you’ve got and what the quality is like. • Raw data cleanup/quality trimming if necessary. • Choose an appropriate assembly parameter set. • Assemble the data into contigs/scaffolds. • Examine the output of the assembly and assess assembly quality.

https://www.melbournebioinformatics.org.au/tutorials/tutorials/assembly/assembly-protocol/ Image Credit: PNAS 2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Dr. Aditya Arya

Tools for DNA Seq SNV detection

Alignment/mapping MAQ BFAST Novoalign

Illumina/ABI ALL Illumina/Roche

BWA SOAP3

Illumina/ABI Illumina/Roche/ABI

De novo assembly VCAKE Newbler Velvet

Illumina/Roche Roche Illumina/Roche/ABI

GATK SAMtools VarScan/VarScan2 SomaticSniper JointSNVMix

Illumina/Roche/ABI Illumina/Roche Illumina/Roche/ABI Illumina Illumina

Structural variation detection BreakDancer VariationHunter SVDetect PEMer

Image Credit: PNAS

2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Illumina/Roche/ABI Illumina Illumina/ABI Illumina/Roche/ABI Credit, Lee et al, 2013 Dr. Aditya Arya

RNA seq

https://sparta-teaching.readthedocs.io/en/latest/rnaseqbackground.html#basic-analysis-procedure Image Credit: PNAS 2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Dr. Aditya Arya

RNA seq De novo transcriptome assembly

Counting reads per transcript

Trinity Trans-AbySS Oases

HTSeq Cufflinks

Illumina/Roche/ABI Illumina/Roche/ABI Illumina/Roche/ABI

Alignment/mapping Bowtie/Bowtie2 TopHat

Illumina/Roche/ABI Illumina/Roche/ABI

Illumina/Roche/ABI Illumina/Roche/ABI

Normalization, bias correction, and testing differential expression DESeq baySeq edgeR Cufflinks

Image Credit: PNAS

2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Illumina/Roche/ABI Illumina/Roche/ABI Illumina/Roche/ABI Illumina/Roche/ABI Credit, Lee et al, 2013 Dr. Aditya Arya

Small RNA seq Adapter trimming cutadapt Flicker FASTX Clipper scythe

Alignment/mapping

Illumina/Roche/ABI Illumina Illumina Illumina

Bowtie/Bowtie2 miRNA prediction

Quality control NGS QC Toolkit FASTQ Quality Filter

Illumina/Roche Illumina

Quality Viewer FastQC qrqc

Illumina/Roche/ABI

DSAP miRanalyzer miRDeep/miRDeep2 MIReNA mirExplorer miRTRAP miRDeep-P

Illumina/Roche/ABI Illumina/Roche/ABI Illumina/Roche/ABI Illumina/Roche/ABI Illumina/Roche/ABI Illumina/Roche/ABI Illumina/Roche/ABI

Illumina/Roche Illumina/Roche/ABI

Image Credit: PNAS

2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Credit, Lee et al, 2013 Dr. Aditya Arya

Amplicon sequencing Base caller PyroBayes Alignment Cross_match ELAND Exonerate Mosaik RMAP SHRiMP SOAP SSAHA2 Alignment SXOligoSearch

Assembly ALLPATHS SHARCGS SHRAP VCAKE Velvet Variant detection PbShort Ssaha SNP

Alignment and variant detection MAQ

Image Credit: PNAS

2nd Introductory Course on Next generation Amplicon Sequencing. Chennai, India

Dr. Aditya Arya

Thank you and

happy learning