Data Processing Methods in Metabolomics

5 downloads 98178 Views 836KB Size Report
that must be taken to produce the best results possible from the data acquired ... with metabolomics data sets and some of the software programmes .... MZmine 2 is open source software for the analysis of metabolomic mass spectrometry data.
Data Processing Methods in Metabolomics Gavin Blackburn1, Liang Zheng1, David Watson1 1. Strathclyde Institute of Pharmacy and Biomedical Sciences, John Arbuthnott Building, 27 Taylor street, Glasgow. G4 0NR

Introduction Metabolomics experiments can produce vast amounts of data that can be difficult to handle. There may be several steps that must be taken to produce the best results possible from the data acquired and there are a large array of methods for doing this. This poster will focus on some of the different pre-processing and processing methods available when dealing with metabolomics data sets and some of the software programmes available to carry out this work. Understanding how each method works from the outset of any experiment is key and should be taken into consideration whenever any metabolomics experiments are being planned.

Sieve2 Sieve is a piece of proprietary software developed by Vast Scientific in collaboration with Thermo Scientific. It takes the data produced by the Exactive and XL Orbitrap mass spectrometers at ScotMet, aligns it, extracts the relative ion chromatograms (RICs) for every aligned ion and gives them in a table. It also allows you to look for specific differences between groups for all the extracted RICs. This allows for fast metabolic profiling of sample sets and is an extremely useful tool.

Data Pre-processing There are several methods that can be used to pre-process the data depending on what is required. The four techniques most commonly used are; • • • •

Baseline correction, Normalisation, Scaling and, Data alignment.

Baseline correction simply means setting a particular value in the data to zero and correcting the other values accordingly. For example, it is usual to correct chromatographic data to the lowest point in the data set to account for any baseline shifting between runs. This can also correct slight variations in peak height due to instrument noise. Figure 1 shows a set of chromatograms before and after baseline correction with the most obvious areas of correction highlighted. Figure 3. Screenshot of Sieve V1.2.1 showing and RIC present in one set of samples and not the others.

Scaled chromatograms before baseline correction

Scaled chromatograms after baseline correction

0.8

0.8

0.6

0.6

Scaled TIC

Scaled TIC

1.0

Database searching macros To cope with the large amount of data that Sieve can produce, a macro has been produced by Liang Zheng to compare the extracted masses to a database of metabolites and assign them. This will also split the data into metabolites specific to the groups assigned in Sieve and search for and assign adducts and background noise. This means that the large data sets produced by Sieve can be quickly assessed and used to find new metabolites for the particular sample sets. The database used is the KEGG database.

PyChem 3.0.5f Beta1

0.2

0.2

PyChem is open source software used to perform several different univariate and multivariate analyses, along with some data pre-processing. It allows for PCA , DFA, cluster analysis and PLS-R. It also incorporates some genetic algorithms to examine which parts of the data are being used to discriminate between groups. Some knowledge of metabolomics is required to operate this package. It allows for different class structures and sample names to be entered, which allows for fast meta-data analysis.

0.0

0.0

MZmine 23

0.4

100

200

300 400 Scan number

500

600

700

0.4

100

200

300 400 Scan number

500

600

700

Figure 1. A set of chromatograms before and after baseline correction. The most obvious areas of correction have been highlighted. Produced in PyChem 3.0.5f Beta1.

Normalisation, in the context of metabolomics, is a technique used to transform variables to normal units. In practical terms it can help correct for variations in concentration caused by unknown sample strength as well as help to eliminate instrument variability. While it is an extremely useful technique it must be remembered that it can distort results where concentration is important and, as with other techniques for pre-processing, should be used cautiously. A major factor in normalising data is deciding what data point to normalise to and how to apply this across the data sets. Most considerations involve deciding on whether to normalise all the data sets to a specific reference or to normalise to a specific intra-set point. As a example of the differences that can occur depending on the normalisation technique applied, the data set used previously was processed in two different ways available in PyChem 3.0.5f Beta1, normalising the most intense bin to +1 and normalising the total signal to +1. The results can be seen in figure 2.

Normalise most intense bin to +1

MZmine 2 is open source software for the analysis of metabolomic mass spectrometry data. As a program it has the most functionality of any of the GUI software available, allowing the user to perform a large amount of data pre-processing, peakpicking by RIC extraction and some basic multivariate analysis. It has several different visualisation modes for both the raw data and the processed data. It also has the benefit of being extendable and purposely written to allow users to develop their own modules should they wish. A major feature of MZmine 2 is its alignment feature, which can be used to process data and then export it to allow its analysis in specific statistical analysis software.

Normalise total signal to +1

1.0 0.018 0.016

0.8

0.6

Arbitrary

Arbitrary

0.014

0.4

0.012 0.010 0.008 0.006 0.004

0.2

0.002

0.0

100

200

300 400 Scan number

500

600

700

0.000

100

200

300 400 Scan number

500

600

700

Figure 2. Two methods of normalisation, normalising the most intense bin to +1 and normalising the total signal to +1. Produced in PyChem 3.0.5f Beta1. Scaling is separated from normalisation in this context as it is specific to a particular type of normalisation, that is setting the minimum bin to 0 and the maximum bin to +1. This reduces the scale of the variables and makes them easier to handle, especially if the data has a large dynamic range, for example mass spectrometry data spanning several orders of magnitude. Data alignment it a complex problem that can have a considerable impact on metabolomics data, particularly chromatography-mass spectrometry data, where peak drift can be observed. Many of the statistical processing techniques used in metabolomics, such as principal component analysis, look for variation between data sets and things such as peak drift in chromatography can introduce false variation that can affect the final results. To deal with this the type of drift must be identified. Simple linear drift of every peak in a chromatogram can usually be corrected quite simply, but more complex drifting needs more complex alignment techniques. These are available in several different programmes, both proprietary and open source.

References 1. Jarvis, R.M.; Broadhurst, D.; Johnson, H.E.; O'Boyle, N. & Goodacre, R. (2006) PyChem - a multivariate analysis package for Python. Bioinformatics 22(20):2565-2566. 2. Sieve - http://www.vastscientific.com/sieve/63141_SIEVE_BR_121610.pdf 3. M. Katajamaa, J. Miettinen, and M. Orešič, MZmine: Toolbox for processing and visualization of mass spectrometry based molecular profile data, Bioinformatics 22, 634-636 (2006). 4. Specalign - http://physchem.ox.ac.uk/~jwong/specalign/index.htm 5. C.A. Smith, E.J. Want, G.C. Tong, A. Saghatelian, B.F. Cravatt, R. Abagyan, and G. Siuzdak. Metlin XCMS: Global metabolite profiling incorporating LC/MS filtering, peak detection, and novel non-linear retention time alignment with open-source software. 53rd ASMS Conference on Mass Spectrometry, June 2005, San Antonio Texas.

Figure 4. a screenshot showing several different modules available in MZmine 2. (Reproduced from http://mzmine.sourceforge.net/)

Specalign4 Specalign is a GUI application for aligning and visualising mass spectrometry data. Its main use is for direct analysis data sets although it can be used for any mass spectral and chromatographic data sets. It can align a large number of mass spectra simultaneously, making it a useful tool in metabolomic data pre-processing.

XCMS5 XCMS is a package developed in R, a language for statistical computing and graphics. XCMS has been developed to include many options for visualisation, peak-picking, non-linear retention time alignment and quantitation. It can also take advantage of the many statistical processing methods available in R to offer the user a comprehensive range of data processing techniques within one package. A knowledge of the R language is required to use this package, and unlike the GUI software such as MZmine 2 and PyChem is complicated to gain the most from.

Conclusions There are factors that can influence data analysis that can be present in the raw data. These are summarised here and some of the methods to account for them are described. There is a wide range of software available to researchers working in the metabolomics field. A selection of software packages are summarised here and ongoing work is applying a large data set to these to identify the advantages and disadvantages of each. This will allow development of a rigorous workflow for data analysis.