Application of ensemble deep neural network to

0 downloads 0 Views 2MB Size Report
random forest, and support vector machine algorithms. This study ... quantum coherence; TOCSY, total correlation spectroscopy; STOCSY, statistical. TOCSY ...
Analytica Chimica Acta xxx (2018) 1e7

Contents lists available at ScienceDirect

Analytica Chimica Acta journal homepage: www.elsevier.com/locate/aca

Application of ensemble deep neural network to metabolomics studies Taiga Asakura a, 1, Yasuhiro Date a, b, 1, Jun Kikuchi a, b, c, * a

RIKEN Center for Sustainable Resource Science, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan Graduate School of Medical Life Science, Yokohama City University, 1-7-29 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan c Graduate School of Bioagricultural Sciences, Nagoya University, 1 Furo-cho, Chikusa-ku, Nagoya, Aichi 464-8601, Japan b

g r a p h i c a l a b s t r a c t

a r t i c l e i n f o

a b s t r a c t

Article history: Received 30 September 2017 Received in revised form 5 February 2018 Accepted 10 February 2018 Available online xxx

Deep neural network (DNN) is a useful machine learning approach, although its applicability to metabolomics studies has rarely been explored. Here we describe the development of an ensemble DNN (EDNN) algorithm and its applicability to metabolomics studies. As a model case, the developed EDNN approach was applied to metabolomics data of various fish species collected from Japan coastal and estuarine environments for evaluation of a regression performance compared with conventional DNN, random forest, and support vector machine algorithms. This study also revealed that the metabolic profiles of fish muscles were correlated with fish size (growth) in a species-dependent manner. The performance of EDNN regression for fish size based on metabolic profiles was superior to that of DNN, random forest, and support vector machine algorithms. The EDNN approach, therefore, should be helpful for analyses of regression and concerns pertaining to classification in metabolomics studies. © 2018 Published by Elsevier B.V.

Keywords: Nuclear magnetic resonance Metabolomics Ensemble learning Deep neural network Machine learning

Abbreviations: NMR, nuclear magnetic resonance; DNN, deep neural network; EDNN, ensemble DNN; RF, random forest; PCA, principal component analysis; PLS, partial least square; SVM, support vector machine; HSQC, heteronuclear single quantum coherence; TOCSY, total correlation spectroscopy; STOCSY, statistical TOCSY; SHY, statistical heterospectroscopy; nc, number of classifiers; rrv, ratio of random variables; rwc, ratio of weighted classifiers; RMSE, rootemeanesquare error. * Corresponding author. Center for Sustainable Resource Science, 1-7-22 Suehirocho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan. E-mail addresses: [email protected] (T. Asakura), [email protected] (Y. Date), [email protected] (J. Kikuchi). 1 These authors contributed equally to this work.

1. Introduction Biological and environmental systems have complicated and diverse metabolic reactions, processes, and interactions at the molecular level. For the characterization and evaluation of complicated systems, an important research field urgently in need of development is metabolomics or metabonomics [1]. Nuclear magnetic resonance (NMR)-based metabolomics is a remarkable and versatile approach for the comprehensive understanding of metabolic patterns and variations in systems; in particular, its

https://doi.org/10.1016/j.aca.2018.02.045 0003-2670/© 2018 Published by Elsevier B.V.

Please cite this article in press as: T. Asakura, et al., Application of ensemble deep neural network to metabolomics studies, Analytica Chimica Acta (2018), https://doi.org/10.1016/j.aca.2018.02.045

2

T. Asakura et al. / Analytica Chimica Acta xxx (2018) 1e7

distinctive advantage lies in the comparability of the generated spectroscopic data among laboratories (institutions) globally [2,3]. In addition, NMR-based metabolomics enables the acquisition of position-specific information based on isotopomer analyses even in living cells [4]. Thus, NMR-based metabolomics is widely applied for metabolic research targeting in various biological and environmental samples, such as fish [5e7] and seaweeds [8e10]. Moreover, advances in NMR-based metabolomics have been driven largely by the development of various analytical methods, tools, and databases, such as the Biological Magnetic Resonance Data Bank [11], the Human Metabolome Database [12], Birmingham Metabolite Library [13], SpinAssign [14,15], SpinCouple [16], TOCCATA [17,18], NMRShiftDB [19], MetaboAnalyst [20], BATMAN [21], MVAPACK [22], SENSI [23], KODAMA [24], market basket analysis [25], and the fragment-assembly approach [26]. NMR-based metabolomics is typically performed according to the following procedures: sample collection and preparation, NMR measurements (data acquisition), spectral processing, metabolite annotations, and data mining. In the data mining step of NMRbased metabolomics studies, principal component analysis (PCA) and partial least squares (PLS) are commonly used as unsupervised and supervised approaches, respectively. However, data mining methods alternative to PLS are also important and needed in several cases; thus, machine learning approaches such as support vector machine (SVM) [27] and random forest (RF) [28] have been introduced into a previous metabolomics study [29]. Quite recently, feed-forward networks have been applied to metabolomics data and have demonstrated highly accurate classification of breast cancer tissues [30]. In addition, we have recently developed a deep neural network (DNN)-based analytical approach, which demonstrated superior performance with respect to the concern pertaining to binary classification compared with SVM and RF as well as conventional PLS approaches, although the DNN-based approach had certain limitations, such as sample size and biased sample balance, in its applicability to metabolomics studies [31]. To extend the range of application of this method, we focused on the applicability of the DNN-based approach to a concern pertaining to regression in metabolomics studies and advancement of the DNN method by ensemble learning. Ensemble learning (as represented by bagging [32] and RF) is one of the machine learning methods and is capable of improving classification or regression performance (generalizing capability) by means of an integrated classifier derived from individual learning of multiple classifiers [33]. From this viewpoint, we have considered the applicability of the ensemble learning concept to the DNN-based approach for improvement of classification and regression performance and to overcome its limitations. In this study, we have developed an ensemble DNN (EDNN) algorithm and have evaluated the EDNN algorithm by applying it to a metabolomics study. In the metabolomics data as a model case, we have selected the metabolic profiles of various fish species living in Japan coastal and estuarine environments. The coastal and estuarine environments in Japan provide an important and invaluable habitat as a “cradle” for aquatic organisms, including fish, and nurture a wide variety of fish species and aquatic biodiversity. Fish are one of the principal marine resources and have a deep relationship with their habitat, including organic and inorganic nutrient content of the environment that they inhabit. Thus, it is presumed that fish ecology and physiology, including metabolism, is associated with environmental factors depending on their habitats. From this viewpoint, this study also focused on data mining for feature extraction in terms of the relationship between fish metabolic profiles and their physiology by the EDNN algorithm and other conventional methods (that is, PCA, RF, SVM, and DNN).

2. Materials and methods 2.1. Dataset Metabolomics data derived from fish muscle samples used in this study were measured in a previous study [34] with additionally measured samples derived from coastal and estuarine zones in Japan. A total of 502 fish muscle samples from 24 species found in 20 families (Table S1) were dissected, freeze-dried, and crushed into powders using an Automill machine (Tokken, Inc., Chiba, Japan). The powdered samples were stored in a freezer prior to metabolite extraction for NMR measurements. Metabolic data were used for feature extraction in the relationship between fish metabolic profiles and their physiology, and a part of the dataset was used for performance evaluation of the EDNN algorithm. 2.2. NMR measurements For the additional samples measured in this study, metabolites in fish muscles were extracted by deuterated methanol (99.8%, Cambridge Isotope Laboratories Inc., MA, USA) for NMR measurements according to the procedures reported in a previous study [35]. For annotation of metabolites, the pooled samples were measured by 1He13C heteronuclear single quantum coherence (HSQC) and HSQCetotal correlation spectroscopy (TOCSY) spectra using 256 scans with 128 for F1 (13C) and 1024 for F2 (1H) data points with spectral widths of 150 ppm for F1 and 14 ppm for F2 using an NMR machine (Bruker AVANCE II 700 spectrometer; Bruker BioSpin GmbH, Rheinstetten, Germany). NMR spectra were processed using TopSpin software (Bruker) and annotated using the SpinAssign program [14,15], which is the only database for metabolomics studies using methanol solvent systems, using scientific papers [36,37] and a public database (the Biological Magnetic Resonance Bank) [11] as references. In addition, statistical total correlation spectroscopy (STOCSY) [38] and statistical heterospectroscopy (SHY) [39] were used to support the observed correlation assignment by HSQC and HSQCeTOCSY NMR spectra. STOCSY and SHY were displayed in pseudo-two-dimensional correlation between the individual samples. The same metabolite peaks increased or decreased according to their correlation, thereby validating the NMR assignment. 2.3. Data analysis Each peak detected in 1H NMR spectra was picked using the rNMR software [40] based on the region of interest, which included chemical shift and intensity information in the region. A total of 106 peaks were obtained by peak-picking. The peak-picked data were normalized by a constant sum and scaled with Z-scores. PCA was performed using R software [41]. RF analysis was performed with the “random Forest” [42] and “caret” [43] packages in R. SVM analysis was performed with the “e1071” package in R. The DNN algorithm was performed with the “mxnet” [44] package in R, using an algorithm developed in a previous study [31]. The hyperparameters of DNN and EDNN were set as: number of nodes in the first hidden layer ¼ 512, number of nodes in the second hidden layer ¼ 64, rectified linear unit as the activation function, number of rounds ¼ 100, array batch size ¼ 100, learning rate ¼ 0.0012 and momentum ¼ 0.6. The hyper-parameters of RF were set as ntree ¼ 400 and mtry ¼ 35. The hyper-parameters of SVM were set as g ¼ 0.01 and cost ¼ 10. The EDNN, DNN, RF, and SVM models were evaluated with 10-fold cross-validation. The EDNN protocol (R code and analytical procedure) is available on our website (http:// dmar.riken.jp/Rscripts/).

Please cite this article in press as: T. Asakura, et al., Application of ensemble deep neural network to metabolomics studies, Analytica Chimica Acta (2018), https://doi.org/10.1016/j.aca.2018.02.045

T. Asakura et al. / Analytica Chimica Acta xxx (2018) 1e7

3. Results and discussion 3.1. EDNN algorithm This study focused on the development of an EDNN algorithm for improvement and enhancement of classification and regression performance. To this end, the EDNN algorithm was developed by incorporating the generation of multiple DNN classifiers (number of classifiers, nc) based on bootstrap resampling, random sampling of variables used for the classifier constructions (ratio of random variables, rrv), and the integration of multiple results obtained from each classifier (ratio of weighted classifiers, rwc) into the DNN algorithm (Fig. 1). The bootstrap resampling provides randomly selected data (with replacement) for the construction of a classifier and the rest of the data as test data. The number of classifier generations (that is, the number of bootstrap resampling events) is controlled by the hyper-parameter “nc” in the EDNN algorithm. In the construction of each DNN classifier, the ratio (number) of variables used for the model constructions are given by the hyperparameter “rrv” in the EDNN algorithm. The multiple models are constructed by multiple DNN classifiers calculated based on the mxnet library, followed by performance evaluation of the regression (or classification) models using test data (Fig. 1B). Finally, the obtained results are integrated based on the weighted (ranked) scores of root mean square errors (RMSEs). In this integration, the weighted scores (1 or zero) are determined by ranking the RMSE values in the top X% of all controlled by the hyper-parameter “rwc” in the EDNN algorithm (that is, low performance models are omitted). The definitive values are calculated by averaging the RMSE values in the selected models based on their weighted scores. 3.2. Metabolic characterization and feature extraction For the annotations of the metabolites, the mixtures of all the collected fish muscles were measured with HSQC and HSQCeTOCSY NMR spectra in combination with enhancement of the certainty of the annotated metabolites by STOCSY and SHY

3

(Fig. S1). Metabolic profiles obtained from the 1H NMR measurements were evaluated using an unsupervised approach (that is, PCA) for the extraction of characteristic features in the dataset. From this analysis, we revealed that fish muscle profiles were influenced by fish size (Fig. 2A). This result suggested that an increase in fish size was likely to be accompanied by an increase in histidine and essential fatty acids, both generated from dietary metabolites, and a decrease in some amino acids, such as leucine, glycine, alanine, and phenylalanine (Fig. 2B). Interestingly, these propensities seemed to be superimposed in multiple species, except for cod (Gadus and Theragra), which were characteristically different from other species of the same size. Gadus cod are highly philopatric, tend to eat small organisms (relative to their size), and are known as generalists with a wide feeding niche [45]. The changes in metabolism were apparently caused by food habit. Based on the correlation analysis between the size and metabolites (NMR peaks) of each fish species, we found that histidine was relatively rich in fish found at the surface and upper layers of offshore coastal waters in species such as amberjack (Seriola), jack mackerel (Trachurus), and sardines (Sardinops and Engraulis), compared with that in fish found at the bottom (Fig. 2C). Histidine in migratory fish (Seriola, Trachurus, and Scomber) showed an increased response to size (growth), whereas that in those at the bottom and rockfish showed a decreased response to size (Fig. 2C). We hypothesized that larger fish require rich fatty acids as a sustainable energy source, and histidine has reportedly been associated with red muscle for maintaining prolonged activity [46e48]. 3.3. Evaluation of regression performance We revealed that the metabolic profiles of various fish species were correlated with fish size, and each fish species was correlated with a different metabolic profile. Thus, we selected the relationship between fish metabolic profiles and fish sizes for performance evaluation of the developed EDNN method as a model case. Before the comparisons of the performance evaluation of the EDNN with DNN RF, and SVM, the hyper-parameter settings of the EDNN

Fig. 1. Conceptual diagram of the ensemble deep neural network (EDNN). (A) Overall analytical flow of the EDNN algorithm. The peak-picked and scaled matrix data from raw spectra were divided into a training dataset and an evaluation dataset for cross-validation. Using training data, multiple deep neural networks (DNNs) were generated by bootstrap resampling with random sampling of variables. The regression (or classification) performance of models constructed with multiple DNNs was evaluated against the evaluation data. Models with low performance were excluded and results were then integrated. (B) Detailed algorithm underpinning the DNN is enclosed in a blue box in (A). Modeling data and test data were generated by bootstrap resampling and used for construction of DNNs and calculation of root-mean-square error (RMSE). (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

Please cite this article in press as: T. Asakura, et al., Application of ensemble deep neural network to metabolomics studies, Analytica Chimica Acta (2018), https://doi.org/10.1016/j.aca.2018.02.045

4

T. Asakura et al. / Analytica Chimica Acta xxx (2018) 1e7

Fig. 2. 1H-NMR metabolic profiles of muscles derived from fish species listed in Table S1 (A). Principle component 1 (PC1) score plot versus fish body size (n ¼ 502, k ¼ 106). Symbols indicate fish samples collected from estuaries (circles), coasts (squares), offing (triangles), and the deep sea (diamonds), and surface (light blue), upper (blue), lower (green) and bottom (red) habitat layers. (B) PC1 loading plot; red bars indicate amino acids, pink bars indicate nucleic acids, and green bars indicate lipid peaks. (C) Relationship between LHistidine (7.0 ppm) intensity and body size in six fish species (Tj, Sj, Sq, Pm, R spp., and Sv). (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

algorithm (that is, nc, rrv, and rwc) were performed using three fish species (namely, Pleuronectes yokohama, Trachurus japonicus, and Acanthogobius flavimanus) (Figs. 3 and 4). The regression performance was affected largely by the number of constructed models used for the integration, and lower values of rwc improved RMSE

values. However, variables unused in model construction appeared when the rwc values were below 15%. The number of randomly selected variables used for the model construction hardly affected the regression performance, but the RMSE values of T. japonicus and A. flavimanus were slightly decreased when rrv values were low. In

Fig. 3. Evaluation of hyper-parameter settings based on the ratio of weighted classifiers (rwc). The number of classifiers (nc) was set to 150 (left) and 300 (right). The rwc changed from 100% to 5% and was evaluated with root mean square error (RMSE; closed circles) depending on the number of variables used in the ensemble deep neural networks (EDNNs; open circles). Symbol colors indicate fish species: Tj (blue), Py (green), and Af (red). Each plot displays mean values from performance evaluations (20 times) with standard deviations. The red line indicates the lowest RMSE values when all variables were used in EDNN models. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

Please cite this article in press as: T. Asakura, et al., Application of ensemble deep neural network to metabolomics studies, Analytica Chimica Acta (2018), https://doi.org/10.1016/j.aca.2018.02.045

T. Asakura et al. / Analytica Chimica Acta xxx (2018) 1e7

5

Fig. 4. Evaluation of hyper-parameter settings based on the ratio of random variables (rrv). The number of classifiers (nc) was set to 150 (left) and 300 (right). The rrv changed from 95% to 5% and was evaluated with root mean squared error (RMSE; closed circle) depending on the number of variables used in the ensemble deep neural networks (EDNNs; open circles). Symbol colors indicate fish species: Tj (blue), Py (green), and Af (red). Each plot displays mean values from performance evaluations (20 times) with standard deviations. The red line indicates the lowest RMSE values when all variables were used in EDNN models. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

addition, the number of bootstrap resampling events hardly affected the regression performance and the case when the nc value was 300 was slightly improved. Based on these analyses, we determined that the hyper-parameters nc, rrv, and rwc were 300, 20%, and 20%, respectively. Using these hyper-parameters, we performed the regression analyses by the EDNN algorithm for eight fish species (Ej, Lj, Po, Sq,

Tj, Py, Sj, Af) with comparisons of DNN (EDNN are built from multiple DNN classifiers), RF (one of the ensemble learnings commonly used in metabolomics studies [49,50]), and SVM (one of the machine learning approaches commonly used in metabolomics studies) (Fig. 5, Figs. S2, S3, S4, and Table 1). The regression results of the EDNN algorithm indicated that the regression accuracies based on RMSEs were largely different depending on the fish

Fig. 5. Results of EDNN regression analyses for eight fish species. The fish size was predicted by muscle metabolic profiles of each fish species using the EDNN algorithm. The values next to abbreviation of fish name indicate root mean square errors (RMSEs).

Please cite this article in press as: T. Asakura, et al., Application of ensemble deep neural network to metabolomics studies, Analytica Chimica Acta (2018), https://doi.org/10.1016/j.aca.2018.02.045

6

T. Asakura et al. / Analytica Chimica Acta xxx (2018) 1e7

Table 1 Summary of regression performances of ensemble deep neural network (EDNN), DNN, random forest (RF), support vector machine (SVM) for eight fish species. The values indicate root mean square errors (RMSEs). fish species

EDNN

DNN

RF

SVM

Ej Lj Po Sq Tj Py Sj Af

2.89 5.83 9.03 11.93 3.37 5.17 3.38 3.33

3.6 5.86 10.19 12.24 3.94 5.72 3.7 4.01

0.56 4.67 10.32 9.09 4.18 6.61 3.4 2.24

0.67 5.2 9.29 9.84 5.39 5.81 3.55 2.2

[3]

[4]

[5]

[6]

species (Fig. 5). Similar trends were observed in the regression analyses by DNN, RF, and SVM algorithms (Figs. S2, S3, and S4), indicating that machine learning approaches were capable of accurately predicting the fish size from the metabolic profiles of fish muscles in the case of several fish species, whereas the regression models for the few remaining fish species were construed as having moderate accuracy using machine learning approaches. In the method comparison, regression accuracies (based on RMSE values) of the EDNN approach were superior to those of the DNN approach for all eight fish species, and compared with RF and SVM, the EDNN approach showed good regression performance for more than half of fish species, with the exception of A. flavimanus, E. japonicus, L. japonicus, and S. quinqueradiata (Table 1). In addition, the EDNN approach was capable of identifying important variables that contributed to the constructed EDNN model (Fig. S5). These results indicated that the EDNN approach developed here was a helpful tool for analyzing the regression relationship between fish size and metabolic profiles of various fish muscles. The EDNN approach appears to be a powerful tool to evaluate and characterize metabolomics data derived from biological and environmental systems [51]. 4. Conclusions We have developed an EDNN algorithm which enables the prediction of size in several fish species from metabolic profiles of fish muscle. The EDNN approach was superior to the conventional DNN, RF, and SVM approaches in terms of regression performances based on RMSEs. The EDNN approach is applicable not only to fish metabolomics studies but also to other biological and environmental metabolomics studies, including both regression and classification analyses. Acknowledgments The authors wish to thank Kenji Sakata, Tomoko Shimizu, and Yachiyo Ootaka (RIKEN) for technical assistance. This work was supported in part by J.S.P.S., and also supported by Agriculture, Forestry and Fisheries Council, Japan.

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

Appendix A. Supplementary data Supplementary data related to this article can be found at https://doi.org/10.1016/j.aca.2018.02.045. References [1] J.K. Nicholson, J.C. Lindon, E. Holmes, 'Metabonomics': understanding the metabolic responses of living systems to pathophysiological stimuli via multivariate statistical analysis of biological NMR spectroscopic data, Xenobiotica 29 (1999) 1181e1189. [2] M.R. Viant, D.W. Bearden, J.G. Bundy, I.W. Burton, T.W. Collette, D.R. Ekman,

[22] [23]

[24] [25]

[26]

V. Ezernieks, T.K. Karakach, C.Y. Lin, S. Rochfort, J.S. De Ropp, Q. Teng, R.S. Tieerdema, J.A. Walter, H. Wu, International NMR-based environmental metabolomics intercomparison exercise, Environ. Sci. Technol. 43 (2009) 219e225. J.L. Ward, J.M. Baker, S.J. Miller, C. Deborde, M. Maucourt, B. Biais, D. Rolin, A. Moing, S. Moco, J. Vervoort, A. Lommen, H. Schafer, E. Humpfer, M.H. Beale, An inter-laboratory comparison demonstrates that [H-1]-NMR metabolite fingerprinting is a robust technique for collaborative plant metabolomic data collection, Metabolomics 6 (2010) 263e273. S. Lee, H. Wen, Y.J. An, J.W. Cha, Y.J. Ko, S.G. Hyberts, S. Park, Carbon isotopomer analysis with non-unifom sampling HSQC NMR for cell extract and live cell metabolomics studies, Anal. Chem. 89 (2017) 1078e1085. L.M. Samuelsson, L. Forlin, G. Karlsson, M. Adolfsson-Eric, D.G.J. Larsson, Using NMR metabolomics to identify responses of an environmental estrogen in blood plasma of fish, Aquat. Toxicol. 78 (2006) 341e349. A.D. Dove, J. Leisen, M. Zhou, J.J. Byrne, K. Lim-Hing, H.D. Webb, L. Gelbaum, M.R. Viant, J. Kubanek, F.M. Fernandez, Biomarkers of whale shark health: a metabolomic approach, PLoS One 7 (2012) e49379. M. Mekuchi, K. Sakata, T. Yamaguchi, M. Koiso, J. Kikuchi, Trans-omics approaches used to characterise fish nutritional biorhythms in leopard coral grouper (Plectropomus leopardus), Sci. Rep. 7 (2017) 9372. V. Gupta, R.S. Thakur, C.R.K. Reddy, B. Jha, Central metabolic processes of marine macrophytic algae revealed from NMR based metabolome analysis, Rsc. Adv. 3 (2013) 7037e7047. K. Ito, K. Sakata, Y. Date, J. Kikuchi, Integrated analysis of seaweed components during seasonal fluctuation by data mining across heterogeneous chemical measurements with network visualization, Anal. Chem. 86 (2014) 1098e1105. F. Wei, K. Ito, K. Sakata, Y. Date, J. Kikuchi, Pretreatment and integrated analysis of spectral data reveal seaweed similarities based on chemical diversity, Anal. Chem. 87 (2015) 2819e2826. E.L. Ulrich, H. Akutsu, J.F. Doreleijers, Y. Harano, Y.E. Ioannidis, J. Lin, M. Livny, S. Mading, D. Maziuk, Z. Miller, E. Nakatani, C.F. Schulte, D.E. Tolmie, R. Kent Wenger, H. Yao, J.L. Markley, BioMagResBank, Nucleic Acids Res. 36 (2008) D402eD408. D.S. Wishart, D. Tzur, C. Knox, R. Eisner, A.C. Guo, N. Young, D. Cheng, K. Jewell, D. Arndt, S. Sawhney, C. Fung, L. Nikolai, M. Lewis, M.A. Coutouly, I. Forsythe, P. Tang, S. Shrivastava, K. Jeroncic, P. Stothard, G. Amegbey, D. Block, D.D. Hau, J. Wagner, J. Miniaci, M. Clements, M. Gebremedhin, N. Guo, Y. Zhang, G.E. Duggan, G.D. MacInnis, A.M. Weljie, R. Dowlatabadi, F. Bamforth, D. Clive, R. Greiner, L. Li, T. Marrie, B.D. Sykes, H.J. Vogel, L. Querengesser, HMDB: the human metabolome database, Nucleic Acids Res. 35 (2007) D521eD526. C. Ludwig, J.M. Easton, A. Lodi, S. Tiziani, S.E. Manzoor, A.D. Southam, J.J. Byrne, L.M. Bishop, S. He, T.N. Arvanitis, U.L. Gunther, M.R. Viant, Birmingham Metabolite Library: a publicly accessible database of 1-D H-1 and 2-D H-1 Jresolved NMR spectra of authentic metabolite standards (BML-NMR), Metabolomics 8 (2012) 8e18. E. Chikayama, Y. Sekiyama, M. Okamoto, Y. Nakanishi, Y. Tsuboi, K. Akiyama, K. Saito, K. Shinozaki, J. Kikuchi, Statistical indices for Simultaneous largescale metabolite detections for a single NMR spectrum, Anal. Chem. 82 (2010) 1653e1658. E. Chikayama, M. Suto, T. Nishihara, K. Shinozaki, J. Kikuchi, Systematic NMR analysis of stable isotope labeled metabolite mixtures in plant and animal systems: coarse grained views of metabolic pathways, PLoS One 3 (2008) e3805. J. Kikuchi, Y. Tsuboi, K. Komatsu, M. Gomi, E. Chikayama, Y. Date, Spin couple: development of a web tool for analyzing metabolite mixtures via twodimensional J-resolved NMR database, Anal. Chem. 88 (2016) 659e665. K. Bingol, L. Bruschweiler-Li, D.W. Li, R. Bruschweiler, Customized metabolomics database for the analysis of NMR H-1-H-1 TOCSY and C-13-H-1 HSQC-TOCSY spectra of complex mixtures, Anal. Chem. 86 (2014) 5494e5501. K. Bingol, F.L. Zhang, L. Bruschweiler-Li, R. Bruschweiler, TOCCATA: a customized carbon total correlation spectroscopy NMR metabolomics database, Anal. Chem. 84 (2012) 9395e9401. C. Steinbeck, S. Kuhn, NMRShiftDB - compound identification and structure elucidation support through a free community-built web database, Phytochemistry 65 (2004) 2711e2717. J.G. Xia, N. Psychogios, N. Young, D.S. Wishart, MetaboAnalyst: a web server for metabolomic data analysis and interpretation, Nucleic Acids Res. 37 (2009) W652eW660. J. Hao, W. Astle, M. De Iorio, T.M.D. Ebbels, BATMAN-an R package for the automated quantification of metabolites from nuclear magnetic resonance spectra using a Bayesian model, Bioinformatics 28 (2012) 2088e2090. B. Worley, R. Powers, MVAPACK: a complete data handling package for NMR metabolomics, ACS Chem. Biol. 9 (2014) 1138e1144. T. Misawa, T. Komatsu, Y. Date, J. Kikuchi, SENSI: signal enhancement by spectral integration for the analysis of metabolic mixtures, Chem. Commun. 52 (2016) 2964e2967. S. Cacciatore, C. Luchinat, L. Tenori, Knowledge discovery by accuracy maximization, P Natl. Acad. Sci. USA 111 (2014) 5117e5122. Y. Shiokawa, T. Misawa, Y. Date, J. Kikuchi, Application of market basket analysis for the visualization of transaction data based on human lifestyle and spectroscopic measurements, Anal. Chem. 88 (2016) 2714e2719. K. Ito, Y. Tsutsumi, Y. Date, J. Kikuchi, Fragment assembly approach based on graph/network theory with quantum chemistry verifications for assigning

Please cite this article in press as: T. Asakura, et al., Application of ensemble deep neural network to metabolomics studies, Analytica Chimica Acta (2018), https://doi.org/10.1016/j.aca.2018.02.045

T. Asakura et al. / Analytica Chimica Acta xxx (2018) 1e7

[27] [28] [29]

[30]

[31]

[32] [33] [34]

[35]

[36]

[37]

[38]

[39]

multidimensional NMR signals in metabolite mixtures, ACS Chem. Biol. 11 (2016) 1030e1038. V.N. Vapnik, Statistical Learning Theory, John Wiley & Sons, 1998. L. Breiman, Random forests, Mach. Learn. 45 (2001) 5e32. P.S. Gromski, H. Muhamadali, D.I. Ellis, Y. Xu, E. Correa, M.L. Turner, R. Goodacre, A tutorial review: metabolomics and partial least squaresdiscriminant analysis - a marriage of convenience or a shotgun wedding, Anal. Chim. Acta 879 (2015) 10e23. F.M. Alakwaa, K. Chaudhary, L.X. Garmire, Deep learning accurately predicts estrogen receptor status in breast cancer metabolomics data, J. Proteome Res. 17 (2018) 337e347. Y. Date, J. Kikuchi, Application of a deep neural network to metabolomics studies and its performance in determining important variables, Anal. Chem. 90 (2018) 1805e1810. L. Breiman, Bagging predictors, Mach. Learn. 24 (1996) 123e140. T.G. Dietterich, Ensemble methods in machine learning, Lect. Notes Comput. Sc. 1857 (2000) 1e15. T. Asakura, K. Sakata, Y. Date, J. Kikuchi, Regional feature extraction of various fishes based on chemical and microbial variable selection using machine learning, Anal. Methods (2018) (submitted for publication). T. Asakura, K. Sakata, S. Yoshida, Y. Date, J. Kikuchi, Noninvasive analysis of metabolic changes following nutrient input into diverse fish species, as investigated by metabolic and microbial profiling approaches, Peerj 2 (2014) e550. T. Misawa, F. Wei, J. Kikuchi, Application of two-dimensional nuclear magnetic resonance for signal enhancement by spectral integration using a large data set of metabolic mixtures, Anal. Chem. 88 (2016) 6130e6134. S. Yoshida, Y. Date, M. Akama, J. Kikuchi, Comparative metabolomic and ionomic approach for abundant fishes in estuarine environments of Japan, Sci. Rep. 4 (2014) 7005. O. Cloarec, M.E. Dumas, A. Craig, R.H. Barton, J. Trygg, J. Hudson, C. Blancher, D. Gauguier, J.C. Lindon, E. Holmes, J. Nicholson, Statistical total correlation spectroscopy: an exploratory approach for latent biomarker identification from metabolic H-1 NMR data sets, Anal. Chem. 77 (2005) 1282e1289. D.J. Crockford, E. Holmes, J.C. Lindon, R.S. Plumb, S. Zirah, S.J. Bruce, P. Rainville, C.L. Stumpf, J.K. Nicholson, Statistical heterospectroscopy, an

[40]

[41] [42] [43] [44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

7

approach to the integrated analysis of NMR and UPLC-MS data sets: application in metabonomic toxicology studies, Anal. Chem. 78 (2006) 363e371. I.A. Lewis, S.C. Schommer, J.L. Markley, rNMR: open source software for identifying and quantifying metabolites in NMR spectra, Magn. Reson. Chem. 47 (2009) S123eS126. R. Core Team, R: a language and environment for statistical computing, R Foundation for Statistical Computing, 2015. https://www.R-project.org/. A. Liaw, M. Wiener, Classification and regression by randomforest, R. News 2 (2002) 18e22. M. Kuhn, Building predictive models in R using the caret package, J. Stat. Softw. 28 (2008) 1e26. T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, Z. Zhang, MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems, arXiv preprint (2015) arXiv:1512.01274. K. Fjosne, J. Gjosaeter, Dietary composition and the potential of food competition between 0-group cod (Gadus morhua L) and some other fish species in the littoral zone, Ices. J. Mar. Sci. 53 (1996) 757e770. F.R. Antoine, C.I. Wei, R.C. Littell, M.R. Marshall, HPLC method for analysis of free amino acids in fish using o-phthaldialdehyde precolumn derivatization, J. Agr. Food. Chem. 47 (1999) 5100e5107. H.C. Wu, H.M. Chen, C.Y. Shiau, Free amino acids and peptides as related to antioxidant properties in protein hydrolysates of mackerel (Scomber austriasicus), Food Res. Int. 36 (2003) 949e957. A. Bermejo, M.A. Mondaca, M. Roeckel, M.C. Marti, Bacterial formation of histamine in jack mackerel (Trachurus symmetricus), J. Food Process Pres. 28 (2004) 201e222. H. Shima, S. Masuda, Y. Date, A. Shino, Y. Tsuboi, M. Kajikawa, et al., Exploring the impact of food on the gut ecosystem based on the combination of machine learning and network visualization, Nutrients 9 (2017) 1307. Y. Shiokawa, Y. Date, J. Kikuchi, Application of kernel principal component analysis and computational machine learning to exploration of metabolites strongly associated with diet, Sci. Rep. 8 (2018) 3426. F. Wei, K. Sakata, T. Asakura, J. Kikuchi, Systemic homeostasis in metabolome, ionome and microbiome of wild yellowfin goby in estuarine ecosystem, Sci. Rep. 8 (2018) 3478.

Please cite this article in press as: T. Asakura, et al., Application of ensemble deep neural network to metabolomics studies, Analytica Chimica Acta (2018), https://doi.org/10.1016/j.aca.2018.02.045