Bioinformatics and Biomedical Engineering

LNBI 10813

Ignacio Rojas Francisco Ortuño (Eds.)

Bioinformatics and Biomedical Engineering 6th International Work-Conference, IWBBIO 2018 Granada, Spain, April 25–27, 2018 Proceedings, Part I

123

Lecture Notes in Bioinformatics

10813

Subseries of Lecture Notes in Computer Science

LNBI Series Editors Sorin Istrail Brown University, Providence, RI, USA Pavel Pevzner University of California, San Diego, CA, USA Michael Waterman University of Southern California, Los Angeles, CA, USA

LNBI Editorial Board Søren Brunak Technical University of Denmark, Kongens Lyngby, Denmark Mikhail S. Gelfand IITP, Research and Training Center on Bioinformatics, Moscow, Russia Thomas Lengauer Max Planck Institute for Informatics, Saarbrücken, Germany Satoru Miyano University of Tokyo, Tokyo, Japan Eugene Myers Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany Marie-France Sagot Université Lyon 1, Villeurbanne, France David Sankoff University of Ottawa, Ottawa, Canada Ron Shamir Tel Aviv University, Ramat Aviv, Tel Aviv, Israel Terry Speed Walter and Eliza Hall Institute of Medical Research, Melbourne, VIC, Australia Martin Vingron Max Planck Institute for Molecular Genetics, Berlin, Germany W. Eric Wong University of Texas at Dallas, Richardson, TX, USA

More information about this series at http://www.springer.com/series/5381

Ignacio Rojas Francisco Ortuño (Eds.) •

Bioinformatics and Biomedical Engineering 6th International Work-Conference, IWBBIO 2018 Granada, Spain, April 25–27, 2018 Proceedings, Part I

123

Editors Ignacio Rojas University of Granada Granada Spain

Francisco Ortuño University of Granada Granada Spain

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Bioinformatics ISBN 978-3-319-78722-0 ISBN 978-3-319-78723-7 (eBook) https://doi.org/10.1007/978-3-319-78723-7 Library of Congress Control Number: 2018937390 LNCS Sublibrary: SL8 – Bioinformatics © Springer International Publishing AG, part of Springer Nature 2018 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Printed on acid-free paper This Springer imprint is published by the registered company Springer International Publishing AG part of Springer Nature The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

We are proud to present the set of final accepted full papers for the third edition of the IWBBIO conference “International Work-Conference on Bioinformatics and Biomedical Engineering” held in Granada (Spain) during April 25–27, 2018. The IWBBIO 2018 (International Work-Conference on Bioinformatics and Biomedical Engineering) conference sought to provide a discussion forum for scientists, engineers, educators, and students about the latest ideas and realizations in the foundations, theory, models, and applications for interdisciplinary and multidisciplinary research encompassing disciplines of computer science, mathematics, statistics, biology, bioinformatics, and biomedicine. The aims of IWBBIO 2018 were to create a friendly environment that could lead to the establishment or strengthening of scientific collaborations and exchanges among attendees, and therefore, IWBBIO 2018 solicited high-quality original research papers (including significant work-in-progress) on any aspect of bioinformatics, biomedicine, and biomedical engineering. We especially encouraged contributions dealing with new computational techniques and methods in machine learning; data mining; text analysis; pattern recognition; data integration; genomics and evolution; next-generation sequencing data; protein and RNA structure; protein function and proteomics; medical informatics and translational bioinformatics; computational systems biology; modelling and simulation and their application in life science domain, biomedicine, and biomedical engineering. The list of topics in the successive call for papers also evolved, resulting in the following list for the present edition: 1. Computational proteomics. Analysis of protein–protein interactions. Protein structure modelling. Analysis of protein functionality. Quantitative proteomics and PTMs. Clinical proteomics. Protein annotation. Data mining in proteomics. 2. Next-generation sequencing and sequence analysis. De novo sequencing, re-sequencing and assembly. Expression estimation. Alternative splicing discovery. Pathway Analysis. Chip-seq and RNA-Seq analysis. Metagenomics. SNPs prediction. 3. High performance in bioinformatics. Parallelization for biomedical analysis. Biomedical and biological databases. Data mining and biological text processing. Large-scale biomedical data integration. Biological and medical ontologies. Novel architecture and technologies (GPU, P2P, Grid) for Bioinformatics. 4. Biomedicine. Biomedical computing. Personalized medicine. Nanomedicine. Medical education. Collaborative medicine. Biomedical signal analysis. Biomedicine in industry and society. Electrotherapy and radiotherapy. 5. Biomedical engineering. Computer-assisted surgery. Therapeutic engineering. Interactive 3D modelling. Clinical engineering. Telemedicine. Biosensors and data acquisition. Intelligent instrumentation. Patient Monitoring. Biomedical robotics. Bio-nanotechnology. Genetic engineering.

VI

Preface

6. Computational systems for modelling biological processes. Inference of biological networks. Machine learning in bioinformatics. Classification for biomedical data. Microarray data analysis. Simulation and visualization of biological systems. Molecular evolution and phylogenetic modelling. 7. Health care and diseases. Computational support for clinical decisions. Image visualization and signal analysis. Disease control and diagnosis. Genome-phenome analysis. Biomarker identification. Drug design. Computational immunology. 8. E-health. E-health technology and devices. E-health information processing. Telemedicine/E-health application and services. Medical image processing. Video techniques for medical images. Integration of classical medicine and e-health. After a careful peer review and evaluation process (each submission was reviewed by at least two, and on average 3.1, Program Committee members or additional reviewer), 88 papers were accepted for oral, poster, or virtual presentation, according to the recommendations of the reviewers and the authors’ preferences, and to be included in the LNBI proceedings. During IWBBIO 2018, several Special Sessions were held. Special Sessions are a very useful tool for complementing the regular program with new and emerging topics of particular interest for the participating community. Special Sessions that emphasize multidisciplinary and transversal aspects, as well as cutting-edge topics, were especially encouraged and welcomed, and in this edition of IWBBIO 2018 were the following: – SS1. Generation, Management, and Biological Insights from Big Data. As the sequencing technologies develop, reducing the costs and increasing the accuracy, research in biological sciences is transformed from hypothesis-driven to data-driven approaches. Big data encompasses a generation of data ranging from DNA sequence data for thousands of individuals to single-cell data for thousands of cell types from an individual. This has moved the bottle-neck of the data generation down-stream to use these data to gain new knowledge, finally with an aim to improve the quality of human life. The important down-stream challenges with big data include development of strategies for efficient storage of big data making them findable, accessible, interoperable, and reusable (FAIR), to make them usable for research. The next step is the development of new methods including software and Web tools to make sense of big data. The final important step is to demonstrate that big data can indeed lead to new knowledge. This session will cover the research topics in all three aspects of big data described here. Organizer: Dr. Anagha Joshi, Group leader in the Division of Developmental Biology at the Roslin Institute, University of Edinburgh, UK. Website: https://www.ed.ac.uk/roslin/about/contact-us/staff/anagha-joshi – SS2. Challenges in Smart and Wearable Sensor Design for Mobile Health. The analysis of data streams captured with biomedical sensors can be performed as an embedded procedure within the sensor or sensor network or at a later stage in a

Preface

VII

receiving system. Currently, several systems reduce the number of signals monitored via sensors (e.g., when using wearable devices) in order to save energy. In this case, the pre-processing task is embedded into the sensor or close to it. As a result, fewer data are transferred but pattern matching becomes more complex since cross-reference data are missing and computing power is limited. This session should present new and emerging approaches. Organizers: Prof. Natividad Martínez Madrid, Head of the Internet of Things Laboratory and Director of the AAL-Living Lab at Reutlingen University, Germany. Prof. Juan Antonio Ortega, Director of the Centre of Computer Scientific in Andalusia (Spain) www.cica.es and head of the research group IDINFOR (TIC223), University of Seville, ETS Ingeniería Informática, Spain. Prof. Ralf Seepold, Head of the Ubiquitous Computing Lab at HTWG Konstanz, Department of Computer Science, Germany. Websites: http://iotlab.reutlingen-university.de http://madeirasic.us.es/idinfor/ http://uc-lab.in.htwg-konstanz.de – SS3. Challenges and Advances in Measurement and Self-Parametrization of Complex Biological Systems. Our understanding of biological systems requires progress in the measurement techniques, methods, and principles of acquisition. The development of IT and physical resolution offers novel advanced probes, devices, or interpretation as well as more questions and possibilities. Automation of processing and analysis is increasing thanks to artificial intelligence and machine deep learning. The proper bioinformatic parametrization for the analysis of complex systems continues toward automatic or self-setting of the acquired biophysical attributes. This special section provided a discussion on novel techniques and measurement devices, emerging challenges for complex systems, open solutions, and future visions. The broad examples from self-parametric results supported the discussion with practical applications. Organizer: Dipl-Ing. Jan Urban, PhD, Head of Laboratory of Signal and Image Processing, University of South Bohemia in Ceské Budejovice, Faculty of Fisheries and Protection of Waters, South Bohemian Research Center of Aquaculture and Biodiversity of Hydrocenoses, Institute of Complex Systems, Czech Republic. Website: www.frov.jcu.cz/en/institute-complex-systems/lab-signal-image-processing – SS4. High-Throughput Bioinformatic Tools for Medical Genomics. Genomics is concerned with the sequencing and analysis of an organism’s genome taking advantage of the current, cost-effective, high-throughput sequencing

VIII

Preface

technologies. Their continuous improvement is creating a new demand for enhanced high-throughput bioinformatics tools. In this context, the generation, integration, and interpretation of genetic and genomic data are driving a new era of health care and patient management. Medical genomics (or genomic medicine) is this emerging discipline that involves the use of genomic information about a patient as part of the clinical care with diagnostic or therapeutic purposes to improve the health outcomes. Moreover, it can be considered a subset of precision medicine that is having an impact in the fields of oncology, pharmacology, rare and undiagnosed diseases, and infectious diseases. The aim of this special session is to bring together researchers in medicine, genomics, and bioinformatics to translate medical genomics research into new diagnostic, therapeutic, and preventive medical approaches. Therefore, we invited authors to submit original research, new tools or pipelines, or their update, and review articles on relevant topics, such as (but not limited to): • • • • • • • • • •

Tools for data pre-processing (quality control and filtering) Tools for sequence mapping Tools for the comparison of two read libraries without an external reference Tools for genomic variants (such as variant calling or variant annotation) Tools for functional annotation: identification of domains, orthologues, genetic markers, controlled vocabulary (GO, KEGG, InterPro) Tools for biological enrichment in non-model organisms Tools for gene expression studies Tools for Chip-Seq data Tools for “big data” analyses Tools for integration in workflows

Organizers: Prof. M. Gonzalo Claros, Department of Molecular Biology and Biochemistry, University of Málaga, Spain. Dr. Javier Pérez Florido, Bioinformatics Research Area, Fundación Progreso y Salud, Seville, Spain. – SS5. Drug Delivery System Design Aided by Mathematical Modelling and Experiments. A drug delivery system is designed to releases controlled amount of drugs to a specific target area. To devise optimization strategies for targeted drug delivery the combined action of various processes needs to be well understood. Mathematical modelling offers a valuable tool when evaluating potential drug-carrying materials coupled with rate-controlling coatings. When applied to experimental data, simulations can yield valuable insight and guide further research with the aim of identifying and evaluating key drug release mechanisms. Although diffusion is often a primary drug release process, other effects such as binding and dissolution as well as effects occurring at material interfaces are no less important in describing various rate-controlling release mechanisms.

Preface

IX

Considered systems include: intraocular- and soft contact lenses, orthopedic implants, arterial stents, and transdermal patches. Organizers: PhD candidate Kristinn Gudnason, Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Iceland. Prof. Fjola Jonsdottir, Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Iceland. Prof. Emeritus Sven Sigurdsson, Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Iceland. Prof. Mar Masson, Faculty of Pharmaceutical Science, University of Iceland, Iceland. – SS6. Molecular Studies on Inorganic Nanomaterials for Therapeutical and Diagnosis Applications. Nanostructured material science with natural origins is becoming a hot topic in nanomedicine for addressing toxicity and high cost limitations. The absorption of pharmaceutical drugs in natural inorganic nanostructured solids is very useful for controlled delivery of bioactive compounds. Molecular modelling and analytical spectroscopic techniques are well-established research fields for the characterization of these materials. This approach is becoming of great interest in the studies of these nanocomposites and the interactions of organics on the surfaces of the inorganic solids in health applications. The aim of this session is to gather professionals from a wide scope of scientific disciplines to better understand molecular aspects of nanocomposite components behavior and drug design. This interdisciplinary session included contributions from computational chemistry (empirical potentials, quantum, coarse-grained, etc.), NMR, infrared, and Raman spectroscopies, as well as X-ray-diffraction/neutron/synchrotron techniques. This special session has a multidisciplinary nature and is not easy to be included in one of the congress topics owing to its transversal aim connected with several topics of the congress: computational proteomics (protein structure modelling), biomedicine (biomedical computing, nanomedicine), biomedical engineering (bio-nanotechnology), computational systems for modelling biological processes (simulation and visualization of biological systems), health care and diseases (drug design and computational immunology). The aim of this special session is to show the potential application of computational modelling methods in nanomedicine for experimental researchers and, at the same time, for theoreticians diagnosing possible complementary tools for experiments, generating useful discussions between experimentalists and theoreticians to promote future scientific collaborations. Organizers: Dr. C. Ignacio Sainz-Díaz, Instituto Andaluz de Ciencias de la Tierra, CSIC/UGR, Granada, Spain.

X

Preface

Dr. Carola Aguzzi, Departamento de Tecnología Farmacéutica, Universidad de Granada, Granada, Spain. – SS7. Little-Big Data. Reducing the Complexity and Facing Uncertainty of Highly Underdetermined Phenotype Prediction Problems. Phenotype prediction problems have a very underdetermined character since the number of samples is always much lower in size than the number of genes/genetic probes/SNPs/etc., that are monitored to explain a given phenotype. This generates decision problems that have a huge uncertainty space. This includes a great variety of problems with great impact in translational medicine, such as the analysis of mechanism of action of genes in disease progression, the investigation of new therapeutic targets, the analysis of secondary effects, treatment optimization, as well as analysis of the effect of mutations in the transcriptome and in proteomics, etc. The objective of the session was to present novel computational approaches to reduce the complexity of high-dimensional genetic data while keeping the main information content. Applications in cancer and genomics as well as rare and neurodegenerative diseases were welcome. In particular, the design of new methods to perform the robust analysis of pathways involved in disease development were one of the main topics addressed in this session. Organizer: Prof. Juan Luis Fernández-Martínez, Mathematics Department, Applied Mathematics Section, Director of the Group of Inverse Problems, Optimization and Machine Learning, University of Oviedo, Spain. – SS8. Interpretable Models in Biomedicine and Bioinformatics. In a very short period of time, many areas of science have made a sharp transition toward data-driven methods. This new situation is clear in the life sciences and, as particular cases, in biomedicine, bioinformatics, and health care. You could see this as a perfect scenario for the use of data analytics, from multivariate statistics to machine learning (ML) and computational intelligence (CI), but this scenario also poses some serious challenges. One of them takes the form of (lack of) interpretability/comprehensibility/explainability of the models obtained through data analysis. This could be a bottleneck especially for complex nonlinear models, often affected by what has come to be known as the “black box syndrome.” In some areas such as medicine and health care, not addressing such challenges might seriously limit the chances of adoption, in real practice, of computer-based medical decision support systems (MDSS). Interpretability and explainability have become hot research issues, and there are different reasons for this: One of them is the soaring success of deep learning artificial neural networks in recent years. These models risk not being adopted in areas where human decision is key and that decision must be explained as they are extreme “black box” cases. Another reason is the implementation of the European Union directive for General Data Protection Regulation (GDPR). Enforced in April 2018, it mandates a right to explanation of all decisions made by automated or artificially intelligent algorithmic systems. Needless to say, this directly involves data analytics and it is likely to have an impact on health care, medical

Preface

XI

decision-making, and even in bioinformatics through the use of genomics in personalized medicine. In this session, we called for papers that broach the topics of interpretability/ comprehensibility/explainability of data models (with a non-reductive focus on ML and CI) in biomedicine, bioinformatics, and health care, from different viewpoints, including: • Enhancement of the interpretability of existing data analysis techniques in problems related to biomedicine, bioinformatics, and health care • New methods of model interpretation/explanation in problems related to biomedicine, bioinformatics, and health care • Case studies biomedicine, bioinformatics, and health care in which interpretability/comprehensibility/explainability is a key aspect of the investigation • Methods to enhance interpretability in safety-critical areas (such as, for instance, critical care) • Issues of ethics and social responsibility (including governance, privacy, anonymization) in biomedicine, bioinformatics, and health care Organizers: Prof. Alfredo Vellido, Intelligent Data Science and Artificial Intelligence (IDEAI) Research Center, Universitat Politècnica de Catalunya, Barcelona, Spain. Prof. Sandra Ortega-Martorell, Department of Applied Mathematics, Liverpool John Moores University, Liverpool, UK. Prof. Alessandra Tosi, Mind Foundry Ltd., Oxford, UK. Prof. Iván Olier Caparroso, MMU Machine Learning Research Lab, Manchester Metropolitan University, Manchester, UK. – SS9. Medical Planning: Management System for Liquid Radioactive Waste in Hospital Design. In tertiary hospitals where nuclear medicine services have been introduced, the radioactive materials used in diagnosis and/or treatment need to be handled. The hospital design and medical planning should consider these materials and their policy for treatment. Nuclear waste has been divided into solid and liquid based on the materials used and on their half-life times, which start from a few minutes to years. In our study, the most common radioactive liquid materials (waste) were treated by smart systems that detect the material and based on its HLT (activities) will be distributed in shielded storage tanks to the sewage treatment plant (STP) of the hospital after keeping them for the required times. The location and capacity of these tanks together with their monitoring and control system should be considered in the design stage that determines the treatment processes.

XII

Preface

Motivation and objectives for the session: The nuclear medicine department should be considered in the design stage and its space program. The location and capacity of storage tanks and their drainage lines should be considered in hospital. Organizer: Dr. Khaled El-Sayed, Assistant Professor of Biomedical Engineering, Department of Electrical and Medical Engineering, Benha University, Egypt. – SS10. Bioinformatics Tools to Integrate Omics Dataset and Address Biological Question. Methodological advances in ‘omics’ technologies allow for the high-throughput detection and monitoring of the abundance of several biological molecules. Several ‘omics’ platforms are being used in molecular biology and in clinical practice for the understanding of molecular mechanisms underlying specific disease as well as for identifying trustworthy diagnostic/prognostic markers. Omics strategies include: genomics, which aims to characterize and quantify a set of genes within a single cell of an organism; transcriptomics, which analyzes the levels of mRNA transcripts; proteomics, which includes the identification of proteins and the monitoring of their abundance; metabolomics, which measures the abundance of small cellular metabolites; interactomics, and many others. Although more informative, no single ‘omics’ analysis may fully unveil the complexities of a specific biological question, therefore, to achieve a more comprehensive “picture” of biological processes, experimental data made on different layers have to be integrated and analyzed. This special session aimed to provide a description of bioinformatics strategies aimed at integrating omics datasets to address biological questions. Organizer: Dr. Domenica Scumaci, PhD Laboratory of Proteomics, Department of Experimental and Clinical Medicine, Magna Græcia University of Catanzaro, Italy. – SS11. Understanding the Mechanisms of Variant Effects on Human Disease Phenotype. Modern sequencing technologies have enabled whole-genome sequencing and detailed quantification of germline and somatic variations, many of which are related to human disease. However, these data are necessary but not sufficient for understanding the cause and effect of these variations as well as their mechanisms of involvement in human diseases phenotypes. Disease-related variants can have an impact on DNA, RNA, and protein functions and may lead to impaired replication, transcription, signal transduction, and epigenetic regulation. This session exploited computational approaches related to inferring the effects of human mutations on proteins, biomolecular interactions, and cellular pathways with the goal of elucidating mechanistic aspects of disease causative variants. Organizer: Anna Panchenko, PhD Head, Computational Biophysics Group Computational Biology Branch, National Center for Biotechnology Information, National Institutes of Health, Bethesda, USA.

Preface

XIII

Website: https://www.ncbi.nlm.nih.gov/CBBResearch/Panchenko/ In this edition of IWBBIO, we were honored to have the following invited speakers: 1. Prof. Joaquin Dopazo, Fundacion Progreso y Salud, Clinical Bioinformatics Research Area, Seville, Spain 2. Prof. Luis Rueda, School of Computer Science, Pattern Recognition and Bioinformatics Lab, Windsor Cancer Research Group, University of Windsor 3. Dr. Anagha Joshi, Bioinformatics Group Leader, Developmental Biology Division, The Roslin Institute, University of Edinburgh, UK 4. Prof. FangXiang Wu, SMIEEE Professor, Division of Biomedical Engineering, Department of Mechanical Engineering, College of Engineering, University of Saskatchewan, Canada 5. Prof. Jiayin Wang, Xi’an Jiaotong University, China It is important to note, that for the sake of consistency and readability of the book, the presented papers are classified under 17 chapters. The organization of the papers is in two volumes arranged following the topics list included in the call for papers. The first volume (LNBI 10813), entitled Advances in Computational Intelligence. Part I is divided into 11 main parts and includes the contributions on: 1. Bioinformatics for health care and diseases 2. Bioinformatics tools to integrate omics datasets and address biological questions 3. Challenges and advances in measurement and self-parametrization of complex biological systems 4. Computational genomics 5. Computational proteomics 6. Computational systems for modelling biological processes 7. Drug delivery system design aided by mathematical modelling and experiments 8. Generation, management, and biological insights from big data 9. High-throughput bioinformatic tools for medical genomics 10. Next-generation sequencing and sequence analysis 11. Interpretable models in biomedicine and bioinformatics In the second volume (LNBI 10814), entitled Advances in Computational Intelligence. Part II is divided into six main parts and includes the contributions on: 1. Little-big data. Reducing the complexity and facing uncertainty of highly underdetermined phenotype prediction problems 2. Biomedical engineering 3. Biomedical image analysis 4. Biomedical signal analysis 5. Challenges in smart and wearable sensor design for mobile health 6. Health care and diseases This sixth edition of IWBBIO was organized by the Universidad de Granada together with the Spanish Chapter of the IEEE Computational Intelligence Society. We

XIV

Preface

wish to thank to our main sponsor and the Faculty of Science, Department of Computer Architecture and Computer Technology, and CITIC-UGR from the University of Granada for their support and grants. We wish also to thank to the Editors-in-Chief of different international journals for their interest in editing special issues of the best papers of IWBBIO. We would also like to express our gratitude to the members of the different committees for their support, collaboration, and good work. We especially thank the local Organizing Committee, Program Committee, the reviewers, and special session organizers. We also want to express our gratitude for the EasyChair platform. Finally, we want to thank Springer, and especially Alfred Hoffman and Anna Kramer for their continuous support and cooperation. April 2018

Ignacio Rojas Francisco Ortuño

Organization

Steering Committee Miguel A. Andrade Hesham H. Ali Oresti Baños Alfredo Benso Giorgio Buttazzo Gabriel Caffarena Mario Cannataro Jose María Carazo Jose M. Cecilia M. Gonzalo Claros Joaquin Dopazo Werner Dubitzky Afshin Fassihi Jean-Fred Fontaine Humberto Gonzalez Concettina Guerra Roderic Guigo Andy Jenkinson Craig E. Kapfer Narsis Aftab Kiani Natividad Martinez Marco Masseroli Federico Moran Cristian R. Munteanu Jorge A. Naranjo Michael Ng Jose L. Oliver Juan Antonio Ortega Julio Ortega Alejandro Pazos Javier Perez Florido Violeta I. Pérez Nueno Horacio Pérez-Sánchez

University of Mainz, Germany University of Nebraska, USA University of Twente, The Netherlands Politecnico di Torino, Italy Superior School Sant’Anna, Italy University San Pablo CEU, Spain University Magna Graecia of Catanzaro, Italy Spanish National Center for Biotechnology (CNB), Spain Universidad Católica San Antonio de Murcia (UCAM), Spain University of Malaga, Spain Research Center Principe Felipe (CIPF), Spain University of Ulster, UK Universidad Católica San Antonio de Murcia (UCAM), Spain University of Mainz, Germany University of Basque Country (UPV/EHU), Spain College of Computing, Georgia Tech, USA Center for Genomic Regulation, Pompeu Fabra University, Spain Karolinska Institute, Sweden Reutlingen University, Germany European Bioinformatics Institute (EBI), UK Reutlingen University, Germany Polytechnic University of Milan, Italy Complutense University of Madrid, Spain University of Coruña, Spain New York University (NYU), Abu Dhabi Hong Kong Baptist University, SAR China University of Granada, Spain University of Seville, Spain University of Granada, Spain University of Coruña, Spain Genomics and Bioinformatics Platform of Andalusia, Spain Inria Nancy Grand Est (LORIA), France Universidad Católica San Antonio de Murcia (UCAM), Spain

XVI

Organization

Alberto Policriti Omer F. Rana M. Francesca Romano Yvan Saeys Vicky Schneider Ralf Seepold Mohammad Soruri Yoshiyuki Suzuki Oswaldo Trelles Shusaku Tsumoto Renato Umeton Jan Urban Alfredo Vellido Wolfgang Wurst

Università di Udine, Italy Cardiff University, UK Superior School Sant’Anna, Italy VIB - Ghent University The Genome Analysis Centre (TGAC), UK HTWG Konstanz, Germany University of Birjand, Iran Tokyo Metropolitan Institute of Medical Science, Japan University of Malaga, Spain Shimane University, Japan CytoSolve Inc., USA University of South Bohemia, Czech Republic Polytechnic University of Catalonia, Spain GSF National Research Center of Environment and Health, Germany

Program Committee and Additional Reviewers Jesus S. Aguilar Carlos Alberola Hisham Al-Mubaid Rui Carlos Alves Yuan An Georgios Anagnostopoulos Eduardo Andrés León Antonia Aránega Saúl Ares Masanori Arita Ruben Armañanzas Joel P. Arrais Patrizio Arrigo O. Bamidele Awojoyogbe Hazem Bahig Pedro Ballester Graham Balls Ugo Bastolla Sidahmed Benabderrahmane Steffanny A. Bennett Alfredo Benso Mahua Bhattcharya Concha Bielza Armando Blanco Ignacio Blanquer Olivier Bodenreider Paola Bonizzoni

Christina Boucher Hacene Boukari Daniel Brown Fiona Browne Dongbo Bu Jeremy Buhler Keith C. C. Carlos Cano Angel Cantu Rita Casadio Daniel Castillo Osvaldo Castellanos Ting-Fung Chan Nagasuma Chandra Kun-Mao Chao Bolin Chen Brian Chen Chuming Chen Jie Chen Yuehui Chen Jianlin Cheng Shuai Cheng I-Jen Chiang Jung-Hsien Chiang Young-Rae Cho Justin Choi Petr Cisar

Organization

Darrell Conklin Clare Coveney Aedin Culhane Miguel Damas Bhaskar DasGupta Ricardo De Matos Guillermo de la Calle Javier De Las Rivas Fei Deng Marie-Dominique Devignes Sergio Diaz-Del-Pino Ramón Diaz-Uriarte Julie Dickerson Ye Duan Beatrice Duval Khaled El-Sayed Mamdoh Elsheshengy Christian Exposito Weixing Feng Jose Jesús Fernandez Gionata Fragomeni Xiaoyong Fu Alexandre G. de Brevern Eduardo Gade Gusmao Juan Manuel Galvez Pugalenthi Ganesan Jean Gao Qingsong Gao Rodolfo Garcia Lina Gaudio Mark Gerstein Razvan Ghinea Daniel Gonzalez Peña Dianjing Guo Jun-tao Guo Maozu Guo Christophe Guyeux Michael Hackenberg Michiaki Hamada Xiyi Hang Jin-Kao Hao Nurit Haspel Morihiro Hayashida Jieyue He Luis Javier Herrera Pietro Hiram

Lynette Hirschman Ralf Hofestadt Vasant Honavar Jun Hu Xiaohua Hu Jun Huan Chun-Hsi Huang Heng Huang Jimmy Huang Jingshan Huang Jianzheng Huang Seiya Imoto Jiri Jablonsky Guomin Ji Yanqing Ji Xingpeng Jiang Chandra Kambhamettu Mingon Kang Dong-Chul Kim Dongsup Kim Hyunsoo Kim Sun Kim Kengo Kinoshita Ekaterina Kldiashvili Jun Kong Tomas Koutny Natalio Krasnogor Abhay Krishan Marija Krstic-Demonacos Stephen Kwok-Wing Istvan Ladunga T. W. Lam Jorge Langa Dominique Lavenier Jose Luis Lavin Doheon Lee Xiujuan Lei André Leier Kwong-Sak Leung Chen Li Dingcheng Li Jing Li Jinyan Li Min Li Xiaoli Li Yanpeng Li

XVII

XVIII

Organization

Li Liao Hongfei Lin Hongfang Liu Jinze Liu Xiaowen Liu Xiong Liu Zhenqiu Liu Zhi-Ping Liu Rémi Longuespée Miguel Angel Lopez Gordo Ernesto Lowy Jose Luis Suryani Lukman Feng luo Qin Ma Malika Mahoui Tatiana Marquez-Lago Keith Marsolo Francisco Martinez Alvarez Tatiana Maximova Roderik Melnik Pall Melsted Jordi Mestres Hussain Michelle Marianna Milano Ananda Mondal Antonio Morreale Walter N. Moss Maurice Mulvenna Enrique Muro Radhakrishnan Nagarajan Vijayaraj Nagarajan Kenta Nakai Isabel A. Nepomuceno Mohammad Nezami Anja Nohe Michael Ochs Baldomero Oliva Jose Luis Oliveira Motonori Ota David P. Tun-Wen Pai Paolo Paradisi Hyun-Seok Park Kunsoo Park Taesung Park

David Pelta Alexandre Perera María Del Mar Pérez Gómez Esteban Perez-Wohlfeil Vinhthuy Phan Antonio Pinti Héctor Pomares Mihail Popescu Benjarath Pupacdi Sanguthevar Rajasekaran Shoba Ranganathan Patrick Riley Jairo Rocha Fernando Rojas Jianhua Ruan Gregorio Rubio Antonio Rueda Irena Rusu Renata Rychtarikova Vincent Shin-Mu Tseng Mohammadmehdi Saberioon Kunihiko Sadakane Michael Sadovsky Belen San Roman Maria Jose Saez Hiroto Saigo José Salavert Carla Sancho Mestre Emmanuel Sapin Kengo Sato Jean-Marc Schwartz Russell Schwartz Jose Antonio Seoane Xuequn Shang Piramanayagam Shanmughavel Xiaoman Shawn Xinghua Shi Tetsuo Shibuya Tiratha Raj Singh Dong-Guk Shin Amandeep Sidhu Istvan Simon Richard Sinnott Jiangning Song Zhengchang Su Joakim Sundnes

Organization

Wing-Kin Sung Prashanth Suravajhala Martin Swain Sing-Hoi Sze Mehmet Tan Xing Tan Li Teng Dang Thanh Tianhai Tian Pedro Tomas Carlos Toro Carolina Torres Paolo Trunfio Esko Ukkonen Olga Valenzuela Lucia Vaira Paola Velardi Julio Vera Konstantinos Votis Slobodan Vucetic Ying-Wooi Wan Chong Wang Haiying Wang Jason Wang Jialin Wang Jian Wang Jianxin Wang Jiayin Wang Junbai Wang Junwen Wang Lipo Wang Lusheng Wang Yadong Wang Yong Wang

Ka-Chun Wong Ling-Yun Wu Xintao Wu Zhonghang Xia Fang Xiang Lei Xu Zhong Xue Patrick Xuechun Hui Yang Zhihao Yang Jingkai Yu Hong Yue Erliang Zeng Xue-Qiang Zeng Aidong Zhang Chi Zhang Jiao Zhang Jin Zhang Jingfen Zhang Kaizhong Zhang Le Zhang (Adam) Shao-Wu Zhang Xingan Zhang Zhongming Zhao Huiru Zheng Bin Zhou Shuigeng Zhou Xuezhong Zhou Daming Zhu Dongxiao Zhu Shanfeng Zhu Xiaoqin Zou Xiufen Zou Chiara Zucco

XIX

Contents – Part I

Bioinformatics for Healthcare and Diseases Trends in Online Biomonitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antonín Bárta, Pavel Souček, Vladyslav Bozhynov, Pavla Urbanová, and Dinara Bekkozhayeova SARAEasy: A Mobile App for Cerebellar Syndrome Quantification and Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Haitham Maarouf, Vanessa López, Maria J. Sobrido, Diego Martínez, and Maria Taboada Case-Based Reasoning Systems for Medical Applications with Improved Adaptation and Recovery Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X. Blanco Valencia, D. Bastidas Torres, C. Piñeros Rodriguez, D. H. Peluffo-Ordóñez, M. A. Becerra, and A. E. Castro-Ospina

3

15

26

Bioinformatics Tools to Integrate Omics Dataset and Address Biological Question Constructing a Quantitative Fusion Layer over the Semantic Level for Scalable Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andras Gezsi, Bence Bruncsics, Gabor Guta, and Peter Antal

41

Challenges and Advances in Measurement and Self-Parametrization of Complex Biological Systems Effects of External Voltage in the Dynamics of Pancreatic b-Cells: Implications for the Treatment of Diabetes . . . . . . . . . . . . . . . . . . . . . . . . . Ramón E. R. González, José Radamés Ferreira da Silva, and Romildo Albuquerque Nogueira ISaaC: Identifying Structural Relations in Biological Data with Copula-Based Kernel Dependency Measures . . . . . . . . . . . . . . . . . . . . Hossam Al Meer, Raghvendra Mall, Ehsan Ullah, Nasreddine Megrez, and Halima Bensmail Inspecting the Role of PI3K/AKT Signaling Pathway in Cancer Development Using an In Silico Modeling and Simulation Approach. . . . . . . Pedro Pablo González-Pérez and Maura Cárdenas-García

57

71

83

XXII

Contents – Part I

Cardiac Pulse Modeling Using a Modified van der Pol Oscillator and Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fabián M. Lopez-Chamorro, Andrés F. Arciniegas-Mejia, David Esteban Imbajoa-Ruiz, Paul D. Rosero-Montalvo, Pedro García, Andrés Eduardo Castro-Ospina, Antonio Acosta, and Diego Hernán Peluffo-Ordóñez Visible Aquaphotomics Spectrophotometry for Aquaculture Systems . . . . . . . Vladyslav Bozhynov, Pavel Soucek, Antonin Barta, Pavla Urbanova, and Dinara Bekkozhayeva Resolution, Precision, and Entropy as Binning Problem in Mass Spectrometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jan Urban Discrimination Between Normal Driving and Braking Intention from Driver’s Brain Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Efraín Martínez, Luis Guillermo Hernández, and Javier Mauricio Antelis Unsupervised Parametrization of Nano-Objects in Electron Microscopy . . . . . Pavla Urbanová, Norbert Cyran, Pavel Souček, Antonín Bárta, Vladyslav Bozhynov, Dinara Bekkhozhayeva, Petr Císař, and Miloš Železný

96

107

118

129

139

Computational Genomics Models of Multiple Interactions from Collinear Patterns. . . . . . . . . . . . . . . . Leon Bobrowski and Paweł Zabielski Identification of the Treatment Survivability Gene Biomarkers of Breast Cancer Patients via a Tree-Based Approach . . . . . . . . . . . . . . . . . Ashraf Abou Tabl, Abedalrhman Alkhateeb, Luis Rueda, Waguih ElMaraghy, and Alioune Ngom Workflows and Service Discovery: A Mobile Device Approach . . . . . . . . . . Ricardo Holthausen, Sergio Díaz-Del-Pino, Esteban Pérez-Wohlfeil, Pablo Rodríguez-Brazzarola, and Oswaldo Trelles

153

166

177

Chloroplast Genomes Exhibit Eight-Cluster Structuredness and Mirror Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Sadovsky, Maria Senashova, and Andrew Malyshev

186

Are Radiosensitive and Regular Response Cells Homogeneous in Their Correlations Between Copy Number State and Surviving Fraction After Irradiation? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joanna Tobiasz, Najla Al-Harbi, Sara Bin Judia, Salma Majid, Ghazi Alsbeih, and Joanna Polanska

197

Contents – Part I

XXIII

Computational Proteomics Protein Tertiary Structure Prediction via SVD and PSO Sampling . . . . . . . . . Óscar Álvarez, Juan Luis Fernández-Martínez, Ana Cernea, Zulima Fernández-Muñiz, and Andrzej Kloczkowski Fighting Fire with Fire: Computational Prediction of Microbial Targets for Bacteriocins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Edgar D. Coelho, Joel P. Arrais, and José Luís Oliveira A Graph-Based Approach for Querying Protein-Ligand Structural Patterns . . . Renzo Angles and Mauricio Arenas

211

221 235

Computational Systems for Modelling Biological Processes Predicting Disease Genes from Clinical Single Sample-Based PPI Networks . . . Ping Luo, Li-Ping Tian, Bolin Chen, Qianghua Xiao, and Fang-Xiang Wu

247

Red Blood Cell Model Validation in Dynamic Regime . . . . . . . . . . . . . . . . Kristína Kovalčíková, Alžbeta Bohiniková, Martin Slavík, Isabelle Mazza Guimaraes, and Ivan Cimrák

259

Exploiting Ladder Networks for Gene Expression Classification . . . . . . . . . . Guray Golcuk, Mustafa Anil Tuncel, and Arif Canakoglu

270

Simulation of Blood Flow in Microfluidic Devices for Analysing of Video from Real Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hynek Bachratý, Katarína Bachratá, Michal Chovanec, František Kajánek, Monika Smiešková, and Martin Slavík Alignment-Free Z-Curve Genomic Cepstral Coefficients and Machine Learning for Classification of Viruses . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emmanuel Adetiba, Oludayo O. Olugbara, Tunmike B. Taiwo, Marion O. Adebiyi, Joke A. Badejo, Matthew B. Akanle, and Victor O. Matthews A Combined Approach of Multiscale Texture Analysis and Interest Point/Corner Detectors for Microcalcifications Diagnosis . . . . . . . . . . . . . . . Liliana Losurdo, Annarita Fanizzi, Teresa M. A. Basile, Roberto Bellotti, Ubaldo Bottigli, Rosalba Dentamaro, Vittorio Didonna, Alfonso Fausto, Raffaella Massafra, Alfonso Monaco, Marco Moschetta, Ondina Popescu, Pasquale Tamborra, Sabina Tangaro, and Daniele La Forgia An Empirical Study of Word Sense Disambiguation for Biomedical Information Retrieval System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohammed Rais and Abdelmonaime Lachkar

279

290

302

314

XXIV

Contents – Part I

Drug Delivery System Design Aided by Mathematical Modelling and Experiments Modelling the Release of Moxifloxacin from Plasma Grafted Intraocular Lenses with Rotational Symmetric Numerical Framework . . . . . . . . . . . . . . Kristinn Gudnason, Sven Sigurdsson, Fjola Jonsdottir, A. J. Guiomar, A. P. Vieira, P. Alves, P. Coimbra, and M. H. Gil

329

Generation, Management and Biological Insights from Big Data Predicting Tumor Locations in Prostate Cancer Tissue Using Gene Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Osama Hamzeh, Abedalrhman Alkhateeb, and Luis Rueda

343

Concept of a Module for Physical Security of Material Secured by LIMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pavel Blazek, Kamil Kuca, and Ondrej Krejcar

352

scFeatureFilter: Correlation-Based Feature Filtering for Single-Cell RNAseq. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Angeles Arzalluz-Luque, Guillaume Devailly, and Anagha Joshi

364

High-Throughput Bioinformatic Tools for Medical Genomics NearTrans Can Identify Correlated Expression Changes Between Retrotransposons and Surrounding Genes in Human Cancer . . . . . . . . . . . . . Rafael Larrosa, Macarena Arroyo, Rocío Bautista, Carmen María López-Rodríguez, and M. Gonzalo Claros An Interactive Strategy to Visualize Common Subgraphs in Protein-Ligand Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexandre V. Fassio, Charles A. Santana, Fabio R. Cerqueira, Carlos H. da Silveira, João P. R. Romanelli, Raquel C. de Melo-Minardi, and Sabrina de A. Silveira

373

383

Meta-Alignment: Combining Sequence Aligners for Better Results . . . . . . . . Beat Wolf, Pierre Kuonen, and Thomas Dandekar

395

Exploiting In-memory Systems for Genomic Data Analysis . . . . . . . . . . . . . Zeeshan Ali Shah, Mohamed El-Kalioby, Tariq Faquih, Moustafa Shokrof, Shazia Subhani, Yasser Alnakhli, Hussain Aljafar, Ashiq Anjum, and Mohamed Abouelhoda

405

Improving Metagenomic Assemblies Through Data Partitioning: A GC Content Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fábio Miranda, Cassio Batista, Artur Silva, Jefferson Morais, Nelson Neto, and Rommel Ramos

415

Contents – Part I

XXV

Next Generation Sequencing and Sequence Analysis Quality Assessment of High-Throughput DNA Sequencing Data via Range Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ali Fotouhi, Mina Majidi, and M. Oğuzhan Külekci

429

A BLAS-Based Algorithm for Finding Position Weight Matrix Occurrences in DNA Sequences on CPUs and GPUs. . . . . . . . . . . . . . . . . . . . . . . . . . . Jan Fostier

439

Analyzing the Differences Between Reads and Contigs When Performing a Taxonomic Assignment Comparison in Metagenomics . . . . . . . . . . . . . . . Pablo Rodríguez-Brazzarola, Esteban Pérez-Wohlfeil, Sergio Díaz-del-Pino, Ricardo Holthausen, and Oswaldo Trelles Estimating the Length Distributions of Genomic Micro-satellites from Next Generation Sequencing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . Xuan Feng, Huan Hu, Zhongmeng Zhao, Xuanping Zhang, and Jiayin Wang CIGenotyper: A Machine Learning Approach for Genotyping Complex Indel Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tian Zheng, Yang Li, Yu Geng, Zhongmeng Zhao, Xuanping Zhang, Xiao Xiao, and Jiayin Wang Genomic Solutions to Hospital-Acquired Bacterial Infection Identification . . . Max H. Garzon and Duy T. Pham

450

461

473

486

Interpretable Models in Biomedicine and Bioinformatics Kernel Conditional Embeddings for Associating Omic Data Types . . . . . . . . Ferran Reverter, Esteban Vegas, and Josep M. Oller Metastasis of Cutaneous Melanoma: Risk Factors, Detection and Forecasting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Iker Malaina, Leire Legarreta, Maria Dolores Boyano, Jesus Gardeazabal, Carlos Bringas, Luis Martinez, and Ildefonso Martinez de la Fuente

501

511

Graph Theory Based Classification of Brain Connectivity Network for Autism Spectrum Disorder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ertan Tolan and Zerrin Isik

520

Detect and Predict Melanoma Utilizing TCBR and Classification of Skin Lesions in a Learning Assistant System . . . . . . . . . . . . . . . . . . . . . Sara Nasiri, Matthias Jung, Julien Helsper, and Madjid Fathi

531

XXVI

Contents – Part I

On the Use of Betweenness Centrality for Selection of Plausible Trajectories in Qualitative Biological Regulatory Networks . . . . . . . . . . . . . Muhammad Tariq Saeed, Jamil Ahmad, and Amjad Ali

543

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

553

Contents – Part II

Little-Big Data. Reducing the Complexity and Facing Uncertainty of Highly Underdetermined Phenotype Prediction Problems Know-GRRF: Domain-Knowledge Informed Biomarker Discovery with Random Forests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xin Guan and Li Liu Sampling Defective Pathways in Phenotype Prediction Problems via the Fisher’s Ratio Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ana Cernea, Juan Luis Fernández-Martínez, Enrique J. deAndrés-Galiana, Francisco Javier Fernández-Ovies, Zulima Fernández-Muñiz, Oscar Alvarez-Machancoses, Leorey Saligan, and Stephen T. Sonis Sampling Defective Pathways in Phenotype Prediction Problems via the Holdout Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Juan Luis Fernández-Martínez, Ana Cernea, Enrique J. deAndrés-Galiana, Francisco Javier Fernández-Ovies, Zulima Fernández-Muñiz, Oscar Alvarez-Machancoses, Leorey Saligan, and Stephen T. Sonis Comparison of Different Sampling Algorithms for Phenotype Prediction . . . . Ana Cernea, Juan Luis Fernández-Martínez, Enrique J. deAndrés-Galiana, Francisco Javier Fernández-Ovies, Zulima Fernández-Muñiz, Óscar Alvarez-Machancoses, Leorey Saligan, and Stephen T. Sonis

3

15

24

33

Biomedical Engineering Composite Piezoelectric Material for Biomedical Micro Hydraulic System . . . Arvydas Palevicius, Giedrius Janusas, Elingas Cekas, and YatinkumarRajeshbhai Patel

49

Trabecular Bone Score in Overweight and Normal-Weight Young Women . . . . Abdel-Jalil Berro, Marie-Louise Ayoub, Antonio Pinti, Said Ahmaidi, Georges El Khoury, César El Khoury, Eddy Zakhem, Bernard Cortet, and Rawad El Hage

59

XXVIII


Sarcopenia and Hip Structure Analysis Variables in a Group of Lebanese Postmenopausal Women . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Riad Nasr, Eric Watelain, Antonio Pinti, Hayman Saddik, Ghassan Maalouf, Abdel-Jalil Berro, Abir Alwan, César El Khoury, Ibrahim Fayad, and Rawad El Hage Feet Fidgeting Detection Based on Accelerometers Using Decision Tree Learning and Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Julien Esseiva, Maurizio Caon, Elena Mugellini, Omar Abou Khaled, and Kamiar Aminian Matching Confidence Masks with Experts Annotations for Estimates of Chromosomal Copy Number Alterations . . . . . . . . . . . . . . . . . . . . . . . . Jorge Muñoz-Minjares, Yuriy S. Shmaliy, Tatiana Popova, and R. J. Perez–Chimal Using Orientation Sensors to Control a FES System for Upper-Limb Motor Rehabilitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrés F. Ruíz-Olaya, Alberto López-Delis, and Adson Ferreira da Rocha A Real-Time Research Platform for Intent Pattern Recognition: Implementation, Validation and Application . . . . . . . . . . . . . . . . . . . . . . . . Andres F. Ruiz-Olaya, Gloria M. Díaz, and Alberto López-Delis Augmented Visualization and Touchless Interaction with Virtual Organs . . . . Lucio Tommaso De Paolis Decreased Composite Indices of Femoral Neck Strength in Young Obese Women . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Abdel-Jalil Berro, Said Ahmaidi, Antonio Pinti, Abir Alwan, Hayman Saddik, Joseph Matta, Fabienne Frenn, Maroun Rizkallah, Ghassan Maalouf, and Rawad El Hage On the Use of Decision Trees Based on Diagnosis and Drug Codes for Analyzing Chronic Patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cristina Soguero-Ruiz, Ana Alberca Díaz-Plaza, Pablo de Miguel Bohoyo, Javier Ramos-López, Manuel Rubio-Sánchez, Alberto Sánchez, and Inmaculada Mora-Jiménez

69

75

85

95

106 118

128

135

Biomedical Image Analysis Stochastic Geometry for Automatic Assessment of Ki-67 Index in Breast Cancer Preparations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marek Kowal, Marcin Skobel, Józef Korbicz, and Roman Monczak

151


Detection Methods of Static Microscopic Objects . . . . . . . . . . . . . . . . . . . . Libor Hargaš, Zuzana Loncová, Dušan Koniar, František Jablončík, and Jozef Volák Parkinson’s Disease Database Analysis of Stereotactic Coordinates Related to Clinical Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Francisco Estella, Esther Suarez, Beatriz Lozano, Elena Santamarta, Antonio Saiz, Fernando Rojas, Ignacio Rojas, and Fernando Seijo Quantitative Ultrasound of Tumor Surrounding Tissue for Enhancement of Breast Cancer Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ziemowit Klimonda, Katarzyna Dobruch-Sobczak, Hanna Piotrzkowska-Wróblewska, Piotr Karwat, and Jerzy Litniewski A Texture Analysis Approach for Spine Metastasis Classification in T1 and T2 MRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohamed Amine Larhmam, Saïd Mahmoudi, Stylianos Drisis, and Mohammed Benjelloun Parametric Variations of Anisotropic Diffusion and Gaussian High-Pass Filter for NIR Image Preprocessing in Vein Identification . . . . . . . . . . . . . . Ayca Kirimtat and Ondrej Krejcar

XXIX

163

176

186

198

212

FLIR vs SEEK in Biomedical Applications of Infrared Thermography . . . . . . Ayca Kirimtat and Ondrej Krejcar

221

Advances in Homotopy Applied to Object Deformation . . . . . . . . . . . . . . . . Jose Alejandro Salazar-Castro, Ana Cristina Umaquinga-Criollo, Lilian Dayana Cruz-Cruz, Luis Omar Alpala-Alpala, Catalina González-Castaño, Miguel A. Becerra-Botero, Diego Hernán Peluffo-Ordóñez, and Cesar Germán Castellanos-Domínguez

231

Thermal Imaging for Localization of Anterior Forearm Subcutaneous Veins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Orcan Alpar and Ondrej Krejcar

243

Detection of Irregular Thermoregulation in Hand Thermography by Fuzzy C-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Orcan Alpar and Ondrej Krejcar

255

Medical Image Classification with Hand-Designed or Machine-Designed Texture Descriptors: A Performance Evaluation . . . . . . . . . . . . . . . . . . . . . Joke A. Badejo, Emmanuel Adetiba, Adekunle Akinrinmade, and Matthew B. Akanle

266

XXX


Classification of Breast Cancer Histopathological Images Using KAZE Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Sanchez-Morillo, Jesús González, Marcial García-Rojo, and Julio Ortega

276

Biomedical Signal Analysis Low Data Fusion Framework Oriented to Information Quality for BCI Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Miguel Alberto Becerra, Karla C. Alvarez-Uribe, and Diego Hernán Peluffo-Ordoñez New Parameter Available in Phonocardiogram for Blood Pressure Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Omari Tahar, Ouacif Nadia, Benali Redouane, Dib Nabil, and Bereksi-Reguig Fethi Some False ECG Waves Detections Revised by Fractal Dimensions . . . . . . . Ibticeme Sedjelmaci and Fethi Bereksi Reguig

289

301

311

Challenges in Smart and Wearable Sensor Design for Mobile Health Reconstruction of Equivalent Electrical Sources on Heart Surface . . . . . . . . . Galina V. Zhikhareva, Mikhail N. Kramm, Oleg N. Bodin, Ralf Seepold, Anton I. Chernikov, Yana A. Kupriyanova, and Natalija A. Zhuravleva

325

WearIT - A Rapid Prototyping Platform for Wearables . . . . . . . . . . . . . . . . Isabel Leber and Natividad Martínez Madrid

335

A Review of Health Monitoring Systems Using Sensors on Bed or Cushion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Massimo Conti, Simone Orcioni, Natividad Martínez Madrid, Maksym Gaiduk, and Ralf Seepold Textile Sensor Platform (TSP) - Development of a Textile Real-Time Electrocardiogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas Walzer, Christian Thies, Klaus Meier, and Natividad Martínez Madrid Sensor-Mesh-Based System with Application on Sleep Study . . . . . . . . . . . . Maksym Gaiduk, Bruno Vunderl, Ralf Seepold, Juan Antonio Ortega, and Thomas Penzel Wearable Pneumatic Sensor for Non-invasive Continuous Arterial Blood Pressure Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Viacheslav Antsiperov and Gennady Mansurov

347

359

371

383


XXXI

Healthcare and Diseases Gene-Gene Interaction Analysis: Correlation, Relative Entropy and Rough Set Theory Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sujay Saha, Sukriti Roy, Anupam Ghosh, and Kashi Nath Dey

397

A Transferable Belief Model Decision Support Tool over Complementary Clinical Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Abderraouf Hadj Henni, David Pasquier, and Nacim Betrouni

409

An Online Viewer of FHR Signal for Research, E-Learning and Tele-Medicine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Samuel Boudet, Agathe Houzé de l’Aulnoit, Antonio Pinti, Romain Demailly, Michael Genin, Regis Beuscart, Jessica Schiro, Laurent Peyrodie, and Denis Houzé de l’Aulnoit Modeling Spread of Infectious Diseases at the Arrival Stage of Hajj . . . . . . . Sultanah M. Alshammari and Armin R. Mikler

421

430

Exploring In-Game Reward Mechanisms in Diaquarium – A Serious Game for Children with Type 1 Diabetes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ida Charlotte Rønningen, Eirik Årsand, and Gunnar Hartvigsen

443

An “Awareness” Environment for Clinical Decision Support in e-Health . . . . Obinna Anya and Hissam Tawfik

456

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

469

Bioinformatics for Healthcare and Diseases

Trends in Online Biomonitoring ( ) Antonín Bárta ✉ , Pavel Souček, Vladyslav Bozhynov, Pavla Urbanová, and Dinara Bekkozhayeova

Laboratory of Signal and Image Processing, Faculty of Fisheries and Protection of Waters, South Bohemian Research Center of Aquaculture and Biodiversity of Hydrocenoses, Institute of Complex Systems, University of South Bohemia in České Budějovice, Zámek 136, 37333 Nové Hrady, Czech Republic [email protected]

Abstract. We are living in a digital world and Internet of Things is our last and still ongoing revolution [1]. In the last ten years, the number of devices which are connected to the internet increased ten times. This revolutionary is happening mainly in the industrial area, mainly from efficiency, time and cost reasons. Just step behind this industrial revolution is raising another branch of this new market. This new world is called biomonitoring [2]. Even people who are not involved in research or industry are now facing the changes connected with possi‐ bilities of online bioindicators. More and more are shown devices for monitoring cardiovascular activity during sports, work and sicknesses. These personal devices usually work as a first indicator of a dangerous situation for health. Even personal electro encefalo graphs devices are hit in last years. With this device, the user is able to control another devices or processes only using user’s thoughts [3]. These new methods and products are not usually used in the fishery research area. We are proud to introduce you a novel approach to non-invasive online aquatic organism monitoring systems. This innovative research is a combination of cybernetics, biophysics and zoology. The usage of methods developed primarily for aquatic organisms is widely spread into early warning systems area. From this point of view, we can look at the animals as a living sensors. This article is a review of three non-invasive online biomonitoring methods and one crucial water parameters online monitoring system. Keywords: Online · Monitoring · Aquaponics · Aquaculture · Hydroponics Complex systems · Measurement · Control

1

Introduction

In the Institute of Complex Systems were developed non-invasive devices and methods for online monitoring of fish and crayfish behaviour during pollution and contamination in RAS, fish tank and the aquaponic system [4]. The main reason behind the research is to be able detect changes in behaviour during different water conditions which are not ideal for living organisms and people. Fish and crayfish, the living bioindicators are used as an early warning system during water quality inspection.

© Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortuño (Eds.): IWBBIO 2018, LNBI 10813, pp. 3–14, 2018. https://doi.org/10.1007/978-3-319-78723-7_1

4

A. Bárta et al.

Water bioindicators are deeply connected with online water monitoring. At least crucial water parameters as pH, electric conductivity, temperature and dissolved oxygen have to be monitor in parallel. In the ongoing Internet of Things revolution, online water monitoring is available to common people in the area of ponds, pools and aquariums. All the data and metadata from biomonitoring systems can be misinterpreted without proper software for data and metadata management. For this reason, the last chapter of this article is dedicated to the data and metadata management topic. Aquaponics systems are real living examples of complex systems [5]. Both essential parts, fish and plant, of the system, was independently developed for hundreds of years. Old forgotten knowledge of advantageous fish and plants connection is rediscovered and in last decades redefine way of water usage. Up to 90%, less water usage regarding traditional approaches of growing in soil and plants can grow three times faster. On the other hand, the system must be checked and controlled more than traditional growing systems and electric consumption is also multiplied higher. Aquaponics is an ancient method of growing plants together with fish rising. Fish are used as producers of crucial elements in the form of excrements. Beneficial bacteria of genus Nitrosomonas and Nitrobacter transform the toxic direct product from fish, ammonia, to expedient nitrates, which are one of the fundamental stones in the plant growing mosaic. 1. Nitritation: NH3 + O2 → NO2− + 3H+ + 2e− + 275 kJ energy 2. Nitration: NO2− + H2 O → NO3− + 2H+ + 2e− + 76 kJ energy The main plant core way of growing in aquaponics systems comes from hydroponics growing techniques. Most common are deep water culture, nutrient film technique, media-based (inert medium - hydroton) and aeroponics (a special technique where the roots are sprayed by nozzles). Tomato (Solanum lycopersicum), lettuce (Lactuca sativa)

Fig. 1. GrowTube aeroponic system installed in Institute of Complex System

Trends in Online Biomonitoring

5

and herbs as a mint (Lamium) are most common plant species well grown in aquaponics systems. Rooty plants are not a good choice. Wet condition support mold and the final product is not able to compete with plant cultivated in the soil. Fish part is mostly covered by RAS (recirculating aquaculture systems) which is operated with freshwater species. Nile tilapia (Oreochromis niloticus), carp, (Cyprinus carpio), trout (Salmo trutta) and sheatfish (Silurus glanis) are popular choices for most of the aquaponics growers (Fig. 1).

2

Invasive Methods of Aquatic Organism Biomonitoring

Fish biomonitoring in natural environment is strictly connected with invasive methods. One of these method is done using chips directly put it to the fish body by special syringe with chip. This method is able to track fish movements in its natural environment. Fish population, behavior or species extension and reaction on pollutions or contaminants. On the other hand, it is very complicated and financially demanding to provide these chips to the high numbers of fish. The biggest advantage of this technique is information about each piece of fish, which can be sometimes problem for non-invasive methods based on image processing, because it’s mainly working with static camera position and from that point of view overlapping could be a problem in final fish recognition [6] (Fig. 2).

Fig. 2. Carp (Cyprinus carpio) marking. https://www.mrk.cz/diskuse.php?id=632188

3

Non-invasive Methods of Aquatic Organism Biomonitoring

Non-invasive methods of biomonitoring based on image processing methods were applied in aquaponics systems to provide farmers information about fish and crayfish welfare and water quality. As was mentioned above, living organisms such as fish and crayfish are in a direct connection with possible pollutions and biological processes in the water. In RAS (recirculating aquaculture systems) and aquaponics systems are necessities for water

6

A. Bárta et al.

quality much higher because biochemical processes affect whole organisms in the system. In the Institute of Complex systems were developed systems for non-invasive fish and crayfish monitoring and next pages will be about these techniques [7]. (a) Infrared reflection system for indoor 3D tracking of fish Many fish rearing infrastructures are already equipped with human-operated camera systems for fish behavior monitoring, e.g. for stopping the feeding system when the fish is satiated or for monitoring of fish behavior abnormalities caused by poor water quality or diseases. The novel infrared reflection (IREF) system for indoor 3D tracking of fish demonstrated in the current study allows for automation of fish behavior monitoring, reducing the system running costs by eliminating the need for continuous human monitoring and increasing the behavioral analysis accuracy by excluding the human subjectivity factor. The operating principle of this system is based on the effect of strong absorption of near infrared (NIR) range light by water, thus allowing estimation of fish distance based on the corresponding fish object brightness on the camera image. The use of NIR illuminator as a part of the IREF system allows fish behavior monitoring in the dark so as not to affect fish circadian rhythm. A system evaluation under aqua‐ culture facility conditions with Atlantic salmon (Salmo salar) using flow-through water in tanks, showed the mean depth estimation error was equal to 5.3 ± 4.0 (SD) cm. The physiological variations among conspecific individual fish introduced the mean depth estimation error of 1.6 ± 1.3 (SD) cm. The advantages of the IREF system over well-known stereo vision systems are lower hardware cost and less computationally intensive 3D coordinates estimation algorithm, while the disad‐ vantage is lower accuracy that is nevertheless acceptable for most applications of aquaculture fish monitoring. Monitoring of fish behavior allows answering many research questions in the fields of fish nutrition, welfare, health and pathology, environmental interactions and aquaculture systems design. The IREF system is capable of real-time 3D tracking of fish in water by processing the camera images. The processing includes obtaining the individual fish contours on the image, calculating the fish 3D coordinates based on its brightness and a short-time matching of the same fish on subsequent images. At the same time, the software collects the statistics on short-time parameters of fish trajectories for analysis of fish behavior. Detection of individual fish for a long duration (entire experiment) was classified as unreliable because of the high number of fish occlusions. Therefore the system calculates only the statistical features of the fish shoal behavior from short-time tracks of individual fish. The short-time tracks can be determined from the input data. The information about fish speed, position and orientation is then averaged over individual tracks and over defined time interval to provide information about fish shoal behavior (Fig. 3).


7

Fig. 3. Scheme of the IREF system data processing sequence.

The IREF system is a novel approach for 3D fish behavior monitoring. It allows detection of individual fish even in case of overlap with other fish and performing short-time individual fish tracking at high fish densities. The short-time tracking provides information to the operator for long-term statistics calculation on fish shoal speed, orientation, depth and distance from the tank center (Fig. 4). Optionally, the statistics calculation software module can be extended for other fish trajectory parameters required during ethological experiments or normal daily operation. The advantages of the system in comparison to well-known stereo vision systems are lower hardware cost and less computationally intensive real-time 3D coordi‐ nates estimation algorithm. Current prototype version of the IREF system is capable

Fig. 4. Fish behavior statistics over 10 min obtained using IREF system.

8

A. Bárta et al.

of running in real-time without GPU acceleration, with a rate of 10 frames per second on a PC with Intel’s i5 processor and 4 GB of RAM. The cost of system’s hardware is approximately 900 EUR not including the PC [8]. (b) Noninvasive crayfish cardiac activity monitoring system Crayfish cardiac activity monitoring and analysis are widely used during water pollution and ethological studies. A noninvasive crayfish cardiac activity moni‐ toring (NICCAM) system permits long-term, continuous monitoring of several crayfish simultaneously. The advantages of the system are low price, low number of required components and the possibility of cardiac signal shape monitoring. Calculation and analysis of parameters characterizing the shape of the double peak cardiac activity allows not only reducing the number of incorrect peak detections improving the system accuracy but also can provide additional information on crayfish state. The discussed preliminary experiments on the effect of food odor and chloramine-T on crayfish showed promising potential of signal shape analysis for studying of crayfish cardiac reaction to changes in the aquatic environment. High sensitivity to changes in the aquatic environment, especially to pollution, as well as the ease with which it can be used for experimentation, make crayfish one of the most promising animals for biomonitoring of water quality. Also, relative simplicity of its nervous system as compared with vertebrates make a crayfish model useful for studying the mechanisms underlying behavior. Cardiac activity provides a general indication of crayfish metabolic status. It can signal the inte‐ grated impact of natural and anthropogenic stressors and may reflect the availability of energy required for crayfish normal life, growth, and reproduction. Previous studies have shown the response of the crayfish cardiac system to selected chemical agents in water, e.g., potassium nitrate and ammonium phosphate, sodium chloride, ammonia, chlorides, and nitrite together. Several approaches have been developed for monitoring of crayfish cardiac activity. Invasive methods based on electrocar‐ diogram recording require drilling the crayfish dorsal carapace directly over the heart and implanting two metal electrodes. The drawbacks of this approach include the possibility of injuring the crayfish and altering its behavior by using implanted electrodes, and the complexity of connecting the sensor device to crayfish that require precise surgery. The noninvasive methods are based on measuring of the amount of scattered light from crayfish heart modulated by changes of the heart volume while contracting. The CAPMON system proposed by Depledge and Andersen [18] utilizes a transducer consisting of an infrared (IR) light-emitting diode (LED) and a phototransistor fixed on the animal’s dorsal carapace and connected to the external amplification and filtering circuit via thin flexible elec‐ trical wires. Improved and renovated CAPMON system for intertidal animals was recently developed by Burnett et al. [19]. The NICCAM system is comprised of a set of 16 IR optical sensors for crayfish, a multichannel 14 bit analog-to-digital converter (ADC) with USB interface, and a personal computer (PC) with software for data processing. It allows obtaining raw cardiac activity of up to 16 crayfish simultaneously with a sampling rate of 500 samples/s, processing the data to obtain HR and other 14 cardiac signal inotropic/ chronotropic parameters, and saving the calculated parameters to local hard drive.


9

The software’s graphical user interface is capable of displaying raw cardiac activity signals of all monitored crayfish and their main parameters in real-time. It allows choosing which cardiac activity parameters from the set will be calculated and recorded. Optionally, raw cardiac activity can be recorded for further manual or semiautomatic analysis (Fig. 5).

Fig. 5. NICCAM system. (A) Overview of the system. (B) Circuit diagram of IR optical sensor.

The total price of the components and materials required for manufacturing of one sensor is 1–2 EUR depending on the ordering quantity. The IR optical sensors that can be connected to the crayfish within a few minutes do not affect the crayfish behavior and do not restrict its movements. Up to 16 crayfish can be monitored simultaneously 24 h/day. Also, the system allows recording and storing raw cardiac activity data for further manual or semiautomatic analysis [9]. (c) FISCEAPP: Fish Skin Color Evaluation APPlication Skin colouration in fish is of great physiological, behavioural and ecological impor‐ tance and can be considered as an index of animal welfare in aquaculture as well as an important quality factor in the retail value. Currently, in order to compare colour in animals fed on different diets, biochemical analysis, and colourimetry of fished, mildly anaesthetized or dead body, are very accurate and meaningful meas‐ urements. The noninvasive method using digital images of the fish body was devel‐ oped as a standalone application. This application deals with the computation burden and memory consumption of large input files, optimizing piecewise processing and analysis with the memory/computation time ratio. For the compar‐ ison of colour distributions of various experiments and different colour spaces (RGB, CIE L*a*b*) the comparable semi-equidistant binning of multi-channel representation is introduced. It is derived from the knowledge of quantization levels and Freedman-Diaconis rule. The colour calibrations and camera responsivity function were the necessary part of the measurement process. There are three typical possible methods of colourimetry. The point measurement by colourimeters, spectral measurement and image analysis. The colourimeters

10

A. Bárta et al.

provide single point or very small is value of the colour, usually in CIE L*a*b* colour space. The spectral measurements are able to provide the whole spectrum, which could be used for computations of the colour values in various colour space, however, it is still just a point measurement. The image analysis allows to evaluate many points’ together, estimates the spectra, and use any colour space transforma‐ tion. Under the question was still the colour calibration for the image analysis. Standardized light conditions and white balance correction on the camera are expected. The fish body in each image is expected to be the dominant object. The background has to be semi-uniform. Expertomica Fishgui is a standalone Matlab

Fig. 6. Example of Fishgui application evaluation of Silurus glanis.

Fig. 7. Example of Fish skin color evaluation application evaluation of Amphiprion ocellaris and basic set statistics.


11

application. The application computes the average (avg) and standard deviation (std) values of the pixels of the fish skin. The values are evaluated in the RGB, HSV, and CIE L*a*b colour spaces plus the value of the dominant wavelength (lambda). As the output of the processing, two graphs are plotted; the position of the average pixel in the chromaticity diagram and the non-normalised colours distributions across the images in the RGB colour space [10] (Figs. 6 and 7).

4

Online Water Monitoring

Water quality refers to the chemical, physical, biological, and radiological characteris‐ tics of water. It is a measure of the condition of water relative to the requirements of one or more biotic species and or to any human need or purpose. It is most frequently used by reference to a set of standards against which compliance, generally achieved through treatment of the water, can be assessed. The most common standards used to assess water quality relate to the health of ecosystems, the safety of human contact, and drinking water. Monitoring of crucial parameters in aquaponics systems is a necessity, which could never be forgotten. It does not matter which system, (commercial, hobby or garden) you are using because the “heart” of the systems is with small subtleties always the same. Aquaponics is a connection of aquaculture and hydroponics. Both approaches of culti‐ vation are very sensitive to its own parameters. Every aquaponics farmer must think about balancing the right water condition to offer a friendly and convenient environment for both, fish and plants. There is a couple of fundamental monitoring solutions, which are suitable for the certain sector. Hobby and garden users are mainly using manual measuring device based on chemical reactants. On the other hand, growing aquaponics industry is searching for automatic monitoring solutions, which can transfer online information from measurements of crucial parameters (especially pH, temperature, electric conductivity, dissolved oxygen ammonia, nitrate and iron). From the gardener point of view, first big limitation to set up a commercial aquaponics farm is the lack of the online monitoring system, which is normally very expensive. We are living in the new age of IoT (Internet of things). The possibility of connection physical things to the internet is exponentially growing every day. The main advantage of this solution for hydroponics, aquaponics and fishery industry is the reduction of human work in the common tasks connected with water monitoring quality. From the big industry company perspective where hundreds of tanks full of fish are dependent on optimal water parameters (pH, temperature, dissolved oxygen, ammonia level) is critical to be able to react very quickly on changes and has a general online overview about water quality in fish or hydroponic tanks [11]. The human entity will be always behind the online water monitoring solution. The fundamental idea of the online monitoring system is to save fish and plants welfare and at the same time human work. Nowadays is becoming very popular mini-computers that could change the game of water monitoring industry. Raspberry PI and Arduino are the main players in this game [12]. Arduino is a widely used open-source single-board microcontroller development platform with flexible, easy-to-use hardware and software components. Arduino Uno

12

A. Bárta et al.

R3 is based on Atmel Atmega328 microcontroller and has a clock speed of 16 MHz. It has 6 analogue inputs and 14 digital I/O pins, so it is possible to connect a number of sensors to a single Arduino board. Arduino-compatible custom sensor expansion board, known as a shield, can be developed to directly plug into the standardized pin-headers of the Arduino UNO board. The Raspberry Pi 3 Model measures only about 3.5 by 2.5 in. —small enough to fit in your shirt pocket—and you should be able to reuse any cases or other devices designed for the earlier models [13]. AquaSheriff Online Water monitoring system Institute of Complex Systems developed an automatic online monitoring solution for measuring crucial water quality parameters dedicated to fish and aquaponics farming such as temperature, pH, and electric conductivity and dissolved oxygen. Air condition is also measured at a higher level of product and can online monitor light intensity, air temperature and relative humidity [14]. Fundamental properties of the system: 1. 2. 3. 4. 5. 6. 7.

“Plug and Plant” solution – No programming and putting components together. Quick overview of measured parameters using gauges. Chart visualization for further analysis. Data exportation to Microsoft EXCEL software. ALARM setting for minimal and maximal values for each parameter. The system can run only on battery (power bank) for portable usage. The system is able to operate with various probes and sensors (Fig. 8).

Fig. 8. Gauges and charts from AquaSheriff online water monitoring platform.


5

13

Data and Metadata Management

The amount of data in our world has been exploding, and analyzing large data sets—socalled big data—will become a key basis of competition, underpinning new waves of productivity growth and innovation [15]. We are living in an age of “Big data”, which is changing all areas of human-kind including science. One of the most important issues in experimental research is the reproducibility of experiments. The reproducibility and replicability of biochemistry and biophysics experiments are becoming more and more critical relative to the enormous number of scientific papers published nowadays. The reproducibility is highly connected to the proper description of experimental conditions, which can influence the results of the experiment. The experimental protocol is not only the measurable conditions under which we perform our experiments but it is a complete set of information called experimental metadata. To clarify the concept of experimental metadata, we must start with the general definition of metadata. Source [16] defines metadata as data about data. One of the main challenges in modern science is the amount of data produced by the experimental work; it is difficult to store, organize and share the scientific data and to extract the wealth of knowledge. Experimental method descriptions in scientific publications are often incomplete, which complicates experimental repro‐ ducibility. The bioWES system was created in order to address these issues. It provides a solution for management of the experimental data and metadata to support the repro‐ ducibility [17].

6

Conclusion

We are living in a digital world. We can see the evolution of hundreds of market sectors that was changed during last few years. Almost everything is now controlled and connected to the internet. The fishery industry was a step behind these IoT (Internet of Things) revolution. Nowadays is raising a big commercial potential for new startup companies. The usage of minicomputers, probes and sensors and software with a data‐ base can change the game in fishery online water monitoring and biomonitoring manage‐ ment. These solutions can save time and protect the water organism from fluctuations in the water quality. Sometimes happened, especially in our Czech Republic region, which the fishery companies have more ponds to take care than people. From that point of view is very stressful and time-consuming to have an overview of all the ponds in the company. The catastrophic scenario can happen relatively quickly because there are no automated indicators which are able to contact responsible person. In medieval age was normal to have a responsible person on each pond to control water quality and fish welfare. But today economic world is tending to decrease human work and increase automated methods and solutions. From that point of view, the future of smart online water biomonitoring solutions belongs to connection of minicomputers, sensors and software with the database which are able to alert users before the catastrophic scenario becomes reality. The living biosensors are one of the precise solutions which can predict changes in closed systems as an aquaponics or recirculating aquaculture systems are.

14

A. Bárta et al.

Acknowledgments. This work was supported and co-financed by the South Bohemian Research Center of Aquaculture and Biodiversity of Hydrocenoses (CENAKVA CZ.1.05/2.1.00/01.0024); CENAKVA II (No. LO1205 under the NPU I program); and by the South Bohemia University grant GA JU 017/2016/Z.

References 1. Feki, M.A., et al.: The internet of things: the next technological revolution. Computer 46(2), 24–25 (2013) 2. Angerer, J., Ewers, U., Wilhelm, M.: Human biomonitoring: state of the art. Int. J. Hyg. Environ. Health 210(3), 201–228 (2007) 3. Konstantinidis, E., et al.: Introducing Neuroberry, a platform for pervasive EEG signaling in the IoT domain. In: Proceedings of the 5th EAI International Conference on Wireless Mobile Communication and Healthcare. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering) (2015) 4. http://www.frov.jcu.cz/en/institute-complex-systems 5. Rakocy, J.E., Masser, M.P., Losordo, T.M.: Recirculating aquaculture tank production systems: aquaponics—integrating fish and plant culture. SRAC Publ. 454, 1–16 (2006) 6. Parker, N.C., Giorgi, A.E., Heidinger, R.C., Jester Jr., D.B., Prince, E.D.: Fish-marking techniques (1990) 7. Altimiras, J., Larsen, E.: Non-invasive recording of heart rate and ventilation rate in rainbow trout during rest and swimming. Fish go wireless! J. Fish Biol. 57(1), 197–209 (2000) 8. Pautsina, A., et al.: Infrared reflection system for indoor 3D tracking of fish. Aquacult. Eng. 69, 7–17 (2015) 9. Pautsina, A., et al.: Noninvasive crayfish cardiac activity monitoring system. Limnol. Oceanogr.: Methods 12(10), 670–679 (2014) 10. Urban, J., et al.: FISCEAPP: fish skin color evaluation application. In: 17th International Conference on Digital Image Processing, Dubai, UAE (2015) 11. Gertz, E., Di Justo, P.: Environmental Monitoring with Arduino: Building Simple Devices to Collect Data About the World Around Us. O’Reilly Media Inc., Newton (2012) 12. Ferdoush, S., Li, X.: Wireless sensor network system design using Raspberry Pi and Arduino for environmental monitoring applications. Procedia Comput. Sci. 34, 103–110 (2014) 13. Bárta, A., Souček, P., Bozhynov, V., Urbanová, P.: Automatic multiparameter acquisition in aquaponics systems. In: Rojas, I., Ortuño, F. (eds.) IWBBIO 2017. LNCS, vol. 10209, pp. 712–725. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56154-7_63 14. www.aquasheriff.eu 15. Cartho, W.: Metadata. An Overview (2015). http://www.nla.gov.au/openpublish/index.php/ nlasp/article/view/1019/1289. Accessed 22 June 2015 16. Casadevall, A., Fang, F.C.: Reproducible science. Infect. Immun. 78(12), 4972–4975 (2010). https://doi.org/10.1128/IAI.00908-10 17. Cisar, P., et al.: BioWes-from design of experiment, through protocol to repository, control, standardization and back-tracking. BioMed. Eng. OnLine 15(1), 74 (2016) 18. Depledge, M.H., Andersen, B.B.: A computer-aided physiological monitoring system for continuous, long-term recording of cardiac activity in selected invertebrates. Comp. Biochem. Physiol. A: Physiol. 96(4), 473–477 (1990) 19. Burnett, N.P., et al.: An improved noninvasive method for measuring heartbeat of intertidal animals. Limnol. Oceanogr. Methods 11(2), 91–100 (2013)

SARAEasy: A Mobile App for Cerebellar Syndrome Quantification and Characterization Haitham Maarouf1, Vanessa López1, Maria J. Sobrido2, ( ) Diego Martínez3, and Maria Taboada1 ✉ 1 Department of Electronics and Computer Science, Campus Vida, University of Santiago de Compostela, Santiago de Compostela, Spain [email protected], [email protected], [email protected] 2 Instituto de Investigación Sanitaria (IDIS), Centro de Investigación Biomédica En Red de Enfermedades Raras (CIBERER), Santiago de Compostela, Spain [email protected] 3 Department of Applied Physics, Campus Vida, University of Santiago de Compostela, Santiago de Compostela, Spain [email protected]

Abstract. The assessment of latent variables in neurology is mostly achieved using clinical rating scales. Mobile applications can simplify the use of rating scales, providing a quicker quantitative evaluation of these latent variables. However, most health mobile apps do not provide user input validation, they make mistakes at their recommendations, and they are not sufficiently trans‐ parent in the way they are run. The goal of the paper was to develop a novel mobile app for cerebellar syndrome quantification and clinical phenotype characterization. SARAEasy is based on the Scale for Assessment and Rating of Ataxia (SARA), and it incorporates the clinical knowledge required to interpret the patient status through the identified phenotypic abnormalities. The quality of the clinical interpretation achieved by the app was evaluated using data records from anonymous patients suffering from SCA36, and the functionality and design was assessed through the development of a usability survey. Our study shows that SARAEasy is able to automatically generate high-quality patient reports that summarize the set of phenotypic abnormali‐ ties explaining the achieved cerebellar syndrome quantification. SARAEasy offers low-cost cerebellar syndrome quantification and interpretation for research and clinical purposes, and may help to improve evaluation. Keywords: SARA · Health app · Rating scales · Human phenotype ontology Clinical archetype

1

Introduction

Clinical rating scales play a significant role in collecting standardized data, mainly in neurology. They are also used to measure the so-called latent variables, those that cannot © Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortuño (Eds.): IWBBIO 2018, LNBI 10813, pp. 15–25, 2018. https://doi.org/10.1007/978-3-319-78723-7_2

16

H. Maarouf et al.

be directly observed and must be inferred from other variables. An example of latent variable is ataxia, i.e., the lack of voluntary ability to coordinate muscle movements. The assessment of this latent variable is made indirectly through some questionnaire covering a set of clinical statements or items [1] that can be directly observed, such as abnormal gait or loss of balance. For example, the Scale for Assessment and Rating of Ataxia (SARA) [2] is a survey evaluating motor performance in patients suffering ataxia, by adding scores resulting from the assessment of eight items. Most rating scales used in neurology are ordinal scales, providing facilities to rank patients in degrees of disa‐ bility according to certain external criteria. The strategy of this type of scales is to obtain a single score (total score) that characterizes an individual. The total score of SARA ranges from 0 (no ataxia) to 40 (most severe ataxia). This strategy is very attractive, although it is somewhat ambiguous in that two individuals with different clinical condi‐ tions may have identical scores through different combinations of items. Let’s consider two patients with the same clinical stage, for example, a total SARA score of 15 (moderate cerebellar syndrome). This does not ensure that their functional situation is similar. For example, one of them could barely walk or sit unaided (notably midline ataxia), whereas the other patient could have compromised speech and limb coordina‐ tion. In both cases, this functional differentiation could be inferred from the data collected into the rating scale, thought it is not reflected in the total SARA score. There‐ fore, in cases like this, the numerical result of a scale is not enough to describe the patient’s condition accurately. Therefore, inferring qualitative descriptions for classi‐ fying patients with diagnostic implications from numerical scores collected using rating scales is a challenge. A solution to overcome this drawback is to incorporate the knowl‐ edge required to interpret the numerical assessments. These interpretations could be provided in various formats, depending on their use. If this is focused on the preparation of patient reports, its most appropriate format is a textual summary. SARAEasy provides numerical assessments of cerebellar syndrome following the Scale for Assessment and Rating of Ataxia (SARA), and it incorporates the clinical knowledge required to also provide accurate clinical interpretations of the achieved numerical assessments. The outcome provided by SARAEasy can be directly incorporated in the patient report.

2

Methods

Ensuring safety and quality of medical apps represents a challenge [3]. For example, current insulin dose calculators do not provide user input validation, they make mistakes at the dosage recommendation, and they are not sufficiently transparent in their operation (i.e., in the formulas they are using) [4]. With the aim of guaranteeing patient safety, SARAEasy was based on three basic premises: transparency of the used algorithms and knowledge, data sharing, and data quality. In addition, patient data privacy represented a key principle on which our approach is based. For patient safety purposes, SARAEasy was substantiated by the Scale for Assessment and Rating of Ataxia (SARA), a wellvalidated instrument to assess the presence and severity of cerebellar ataxia [5]. Firstly, we modeled the SARA using openEHR archetypes [6], as they promote computational data standardization, comparison of results across studies [7] and integration of

SARAEasy: A Mobile App for Cerebellar Syndrome Quantification

17

information from different sources. Archetypes support interoperability and can be reused across many types of healthcare applications. Secondly, mapping the archetype data structures to some ontology facilitates the automated clinical interpretation of the patient status and it omits the ambiguity in interpretation, especially when several patients have the same total score. We decided to choose the Human Phenotype Ontology (HPO) [8] to represent the meaning of the SARA clinical items. For patient data privacy purposes, SARAEasy was designed to be patient data access free. It only uses a coded identity to incorporate into the final report. The code is not stored into the app. 2.1 Modeling the SARA by Clinical Archetypes Currently, the main approach to develop a new archetype follows a classical method‐ ology [9]: (1) analysis of the clinical domain and requirements, identifying the archetype content (clinical concepts and their organization) from different sources (literature, record forms, etc.), (2) selection of the archetype type, structuring the content according to the archetype type, and (3) filling of the parts of the archetype with the content. A knowledge engineer analyzed the SARA scale with the help of a neurologist, deter‐ mining all the entities that should be represented, and both of them organized and struc‐ tured the contents. All these entities are shown in the Data part in Fig. 1. After the analysis process, the type of the archetype was chosen. An Observation archetype was selected as it can record directly measurable data. Then, the archetype was filled with metadata, including purpose, keywords, definition and authors, among other information, with the help of the openEHR Archetype Editor [10]. All the archetype entities were modeled using the Element structure. Then, the Elements were defined with proper data types, descriptions, comments, details, occurrences, constraints and possible values. We used two data types: Quantity for the items corresponding to arithmetic averages and total

Fig. 1. The mind map representation of the SARA observation archetype, which is publicly accessible in [12].

18

H. Maarouf et al.

score, and Ordinal for the rest of items. The developed archetype, named SARA, was submitted to the Clinical Knowledge Manager (CKM) [11], a system for collaborative development, management and publishing. After revision and some modifications, the archetype was accepted for publication in the CKM, where it is publicly accessible [12]. Figure 1 displays the mind map representation of the Observation archetype. 2.2 Mapping the Archetype SARA to the HPO Firstly, we used the concept recognition system OBO Annotator [13] to annotate all the relevant information about the SARA survey with the HPO classes. The annotated classes constituted the seed terms required to extracting the HPO subontology that was relevant to the SARA. Next, a neurologist revised the extracted subontology, proposing a minimal extension and reorganization of the subontology classes. Additional classes and subClassOf relationships are summarized in Fig. 2.

Fig. 2. Structure of the extended HPO ontology modules. The blue rectangles represent the original HPO classes, the blue lines represent the original relationships, the red rectangles represent the added classes and the red lines represent the added relationships. (Color figure online)

In addition to the classes that were directly related to the eight SARA items, the neurologist identified three classes especially relevant for clinical interpretation of cere‐ bellar ataxia (Fig. 3): (1) Truncal Ataxia, which subsumes Gait Ataxia, Standing Insta‐ bility and Sitting Imbalance, (2) Appendicular Ataxia, which subsumes Dysdiadocho‐ kinesis, Intention Tremor and Limb Dysmetria, and (3) Dysarthria. In order to determine the levels of severity that a patient has, we reused the HPO class called Severity and the subclasses Borderline, Mild, Moderate, Severe and Profound. New classes were created based on the severity levels of their superclasses. For example, Borderline Sitting


19

Imbalance, Mild Sitting Imbalance, Moderate Sitting Imbalance or Severe Sitting Imbal‐ ance. We translated the HPO subontology to Protégé [14], and we checked the ontology consistency with the Hermit reasoner [15].

Fig. 3. Excerpt from the domain ontology

3

Result

To demonstrate the functionality of our approach, we developed SARAEasy, a mobile app for cerebellar syndrome quantification and clinical phenotype characterization (Fig. 4). A proof of concept test was carried out to show that the SARAEasy app is able to identify and report cerebellar ataxia characteristics such as midline or appendicular ataxia in a straightforward and effortless way. SARAEasy faithfully mirrored the SARA archetype and was developed in Android Studio, the integrated development environ‐ ment for Google’s Android operating system (version 2.2). As with all Android apps, it comprised a set of interconnected activities, where most of them were presented to the user as full-screen windows. In order to protect the confidentiality of patient data, SARAEasy was implemented for not handle patient data and only request a coded iden‐ tity to include into the final report plus an e-mail account for submitting the questionnaire data and the final report (Fig. 5).

20

H. Maarouf et al.

Fig. 4. Main activity

Fig. 5. Entry activity for coded identification and e-mail account

Four types of activities can be distinguished in SARAEasy: item entry, item query, questionnaire modification/deletion, and questionnaire submission via e-mail. A different activity was designed for each item in the rating scale (Fig. 6), equipped with explanations and links to videos (Fig. 7). As for some items, SARA distinguishes between right and left side (mirroring the archetype), hence twelve item entry activities were designed. Additionally, several activities were added to facilitate the navigation through the questionnaire. Finally, the questionnaire submission activity involves all the logic that is associated with the cerebellar syndrome quantification and characterization. It implements all the information-processing units required to automatically interpret the data obtained through the item entry activities. The modeling of these units can be revised in [16]. Both the final report (Fig. 9) and the total score (Fig. 8) can be visualized before submitting by e-mail. An SQLite database was designed for storing the collected and inferred data while the app is active. The database accesses are performed as back‐ ground operations to avoid any slowdown.


Fig. 6. Entry activity for the item sitting

Fig. 8. Total score activity

21

Fig. 7. Access to an explanation

Fig. 9. Patient report automatically generated by SARAEasy

3.1 Dataset and Validation of SARAEasy For validation purpose, two types of assessments were carried out. First, two inde‐ pendent neurologists validated the quality of the achieved results using data records from 28 anonymous individuals suffering from Spinocerebellar Ataxia Type 36 (SCA36).

22

H. Maarouf et al.

Additionally, a usability survey was designed to evaluate the functionality and design of the app. The evaluation of the quality of the achieved results was carried out in three steps: 1. Inference of the scores and reports: We filled out the score data for each patient and extracted the following results: (1) the severity for each item, (2) the severity of cerebellar syndrome, (3) the severity of truncal ataxia, and (4) the severity of appen‐ dicular ataxia on the right and left sides. The results were translated to an excel file. 2. Interpretation by two independent neurologists: The total score data was sent to two neurologists, which used their expertise in ataxia to determine the severity of the cerebellar syndrome, and truncal and appendicular ataxia, if present, from the provided scores. 3. Comparison of results between the system and the human experts: The interpreta‐ tions of the system and of the neurologists were imported into SPSS [17], and Weighted Kappa test [18] was executed 12 times to measure the strength of agree‐ ment between the implemented system and each neurologist, and between the two neurologists themselves. Weighted kappa scores ranged from 0.62 to 0.86. Usability surveys. The assessment of the functionalities and design of the app was carried out through the development of a usability survey (Table 1). The design of its questions was based on software usability questionnaires [19], especially on the System Usability Scale (SUS) questionnaire [20], which was developed by John Brooke in 1986 as part of the introduction of usability engineering to the systems of Digital Equipment Co. Ltd. The survey consisted of 13 items. These items include the assessment of diverse

Table 1. Usability survey Items

Achieved average rating Expert Inexpert Total 1. I think it is an easy application to use 3.5 3.6 3.6 2. Regarding the language, I could understand the application 4.0 3.2 3.6 3. The application is too simple 4.0 3.0 3.5 4. The representation of the icons of the application 3.0 3.0 3.0 concerning their functions 5. The structure and organization of the system 3.5 3.8 3.7 6. I did not find useless buttons or tabs 4.0 3.8 3.9 7. The chosen color range is correct since the texts and the 4.0 3.8 3.9 other elements of the application are clearly visible 8. The presentation of the product is pleasant and not shabby 4.0 3.8 3.9 9. I did not need any help to manage the program 2.5 2.8 2.7 10. I know at what stage I am in the application 3.5 3.6 3.6 11. The application does not make screen leaps pointless 4.0 3.8 3.9 12. The error messages are helpful and not confused 4.0 3.8 3.9 13. The processing speed of the application is fast 3.5 3.0 3.3 Total 47.5 45.0 46.3


23

aspects: language, colors, icons’ images, terminology, speed, error messages and itera‐ tion. Each of these items has four possible answers, which are ranged from 1 to 4, according to the degree of agreement (1 Strongly disagree, 2 Disagree, 3 Agree, 4 Strongly agree). The maximum overall score for the survey is 52. The app was evaluated as a proof-of-concept study with two experts and five inexpert users. All users received detailed explanations about the goal and the functionality of the app. The obtained scores from expert and inexpert users were greater than 75% of the total score, which is therefore considered to have been successfully assessed.

4

Discussion

In this study, a proof of concept test was carried out to ensure the feasibility of SARAEasy to easily register abnormalities of clinical phenotypes associated with the severity of cerebellar syndrome, such as truncal ataxia, appendicular ataxia or dysarthria. Cerebellar syndrome quantification and interpretation through phenotypic abnormalities can help to improve outcome measurements of evaluation of spinocerebellar ataxia. Mobile apps are portable devices facilitating the registration of measurements during the clinical examination with maximum flexibility and high cost-effectiveness. However, currently available apps in different domains face limitations on sharing measurement data and also information on the formulas they are using to calculate scores. SARAEasy can fill this gap by proposing an application providing full access to the item data while at the same time offering reporting capacities based on high-quality patient phenotype interpretation. SARAEasy has been designed with European MEDDEV legislation in mind [21]. SARAEasy manipulates patient data as it calculates partial and total scores, and it generates a clinical report explaining the quantitative scores. Hence, the device should be considered a type of low risk. Even so, patient safety is guaranteed, as SARAEasy was based on a well-validated scale to assess the presence and severity of cerebellar ataxia and it has been modeled using a clinical archetype, which is now freely available in the OpenEHR CKM. Additionally, the clinical knowledge required to generate the final reports has been published and validated using data from SCA36. Thus, all data collected following the archetype can be shared between different healthcare systems. Regarding the clinical interpretations provided by SARAEasy in reporting, we used a combination of GDL (Guideline Description Language) and OWL (Web Ontology Language) to model the information-processing units [16]. In the current version, SARAEasy does not provide data in some format following the clinical archetype. This option is easily extensible and clearly feasible to reach interoperability with the current clinical information systems. With regard to the validation process, SARAEasy will be tested using more patient data that are affected by diverse cerebellar ataxias, and with other neurologists from different hospitals.

24

5

H. Maarouf et al.

Conclusion

Our study shows that SARAEasy is able to automatically generate high-quality patient reports that explain the total score for the cerebellar syndrome quantification. This explanation is determined by phenotypic abnormalities as appendicular or midline ataxia. SARAEasy offers low-cost cerebellar syndrome quantification and interpretation for research and clinical purposes, and may help to improve evaluation. Funding. This work presented in this paper was supported by the National Institute of Health Carlos III [grant no. FIS2012-PI12/00373: OntoNeurophen], FEDER for national and European funding. Acknowledgment. The authors would like to thank Dr. Manuel Arias and Dr. Ángel Sesar for participating in the validation process to test the validity of SARAEasy.

References 1. Martinez-Martin, P.: Composite rating scales. J. Neurol. Sci. 289(1), 7–11 (2010) 2. Schmitz-Hübsch, T., Du Montcel, S.T., Baliko, L., Berciano, J., Boesch, S., Depondt, C., et al.: Scale for the assessment and rating of ataxia development of a new clinical scale. Neurology 66(11), 1717–1720 (2006) 3. Wicks, P., Chiauzzi, E.: ‘Trust but verify’–five approaches to ensure safe medical apps. BMC Med. 13(1), 205 (2015) 4. Huckvale, K., Adomaviciute, S., Prieto, J.T., Leow, M.K.S., Car, J.: Smartphone apps for calculating insulin dose: a systematic assessment. BMC Med. 13(1), 106 (2015) 5. Saute, J.A.M., Donis, K.C., Serrano-Munuera, C., Genis, D., Ramirez, L.T., Mazzetti, P., Pérez, L.V., Latorre, P., Sequeiros, J., Matilla-Dueñas, A., Jardim, L.B.: Ataxia rating scales —psychometric profiles, natural history and their application in clinical trials. Cerebellum 11(2), 488–504 (2012) 6. Beale, T., Heard, S.: openEHR - Release 1.0.2. 2016. http://www.openehr.org/programs/ specification/releases/1.0.2. Accessed 04 Jan 2018 7. Min, H., Ohira, R., Collins, M.A., Bondy, J., Avis, N.E., et al.: Sharing behavioral data through a grid infrastructure using data standards. J. Am. Med. Inf. Assoc. 21(4), 642–649 (2014) 8. Köhler, S., Vasilevsky, N.A., Engelstad, M., Foster, E., McMurry, J., Aymé, S., Baynam, G., Bello, S.M., Boerkoel, C.F., Boycott, K.M.: The human phenotype ontology in 2017. Nucleic Acids Res. 45(D1), D865–D876 (2017) 9. Braun, M., Brandt, A.U., Schulz, S., Boeker, M.: Validating archetypes for the multiple sclerosis functional composite. BMC Med. Inf. Decis. Making 14(1), 64 (2014) 10. openEHR archetype editor. http://www.openehr.org/downloads/archetypeeditor/home. Accessed 04 Dec 2017 11. Clinical knowledge manager. http://openehr.org/ckm/. Accessed 26 Nov 2017 12. SARA observation archetype. http://openehr.org/ckm/#showArchetype_1013.1.2661. Accessed 26 Nov 2017 13. Taboada, M., Rodríguez, H., Martínez, D., Pardo, M., Sobrido, M.J.: Automated semantic annotation of rare disease cases: a case study. Database 2014, bau045 (2014) 14. Protégé. http://protege.stanford.edu/products.php#desktop-protege. Accessed 02 Dec 2017


25

15. HermiT OWL reasoner. http://www.hermit-reasoner.com/. Accessed 02 Jan 2018 16. Maarouf, H., Taboada, M., Rodriguez, H., Arias, M., Sesar, Á., Sobrido, M.J.: An ontologyaware integration of clinical models, terminologies and guidelines: an exploratory study of the scale for the assessment and rating of ataxia (SARA). BMC Med. Inf. Decis. Making 17(1), 159 (2017) 17. IBM SPSS software. https://www.ibm.com/analytics/data-science/predictive-analytics/spssstatistical-software. Accessed 20 Dec 2017 18. Cohen, J.: Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit. Psychol. Bull. 70(4), 213 (1968) 19. Cortes, A.F.: Manual de Técnicas para el Diseño Participativo de Interfaces de Usuario de Sistemas basados en Software y Hardware. http://www.disenomovil.mobi/multimedia_un/ trabajo_final/03_cuestionarios_modelo_usabilidad_web.pdf. Accessed 12 Jan 2018 20. Brooke, J.: SUS-A quick and dirty usability scale. Usability eval. Ind. 189(194), 4–7 (1996) 21. Whitepaper medical apps. https://www.nictiz.nl/publicaties/infographics/infographicmedical-apps-is-certification-required. Accessed 15 Jan 2018

Case-Based Reasoning Systems for Medical Applications with Improved Adaptation and Recovery Stages X. Blanco Valencia1 , D. Bastidas Torres2(B) , C. Pi˜ neros Rodriguez2 , D. H. Peluffo-Ord´ on ˜ez2,3 , M. A. Becerra4 , and A. E. Castro-Ospina4 1

Universidad de Salamanca, Salamanca, Spain Universidad de Nari˜ no, Pasto, Colombia [email protected] 3 Universidad Yachay Tech, Urcuqu´ı, Ecuador Instituto Tecnol´ ogico Metropolitano, Medell´ın, Colombia 2

4

Abstract. Case-Based Reasoning Systems (CBR) are in constant evolution, as a result, this article proposes improving the retrieve and adaption stages through a different approach. A series of experiments were made, divided in three sections: a proper pre-processing technique, a cascade classification, and a probability estimation procedure. Every stage offers an improvement, a better data representation, a more efficient classification, and a more precise probability estimation provided by a Support Vector Machine (SVM) estimator regarding more common approaches. Concluding, more complex techniques for classification and probability estimation are possible, improving CBR systems performance due to lower classification error in general cases. Keywords: Case-based reasoning · Preprocessing Cascade classification · Probability

1

Introduction

Reasoning in humans is based on the process of remembering and applying rules, the product of various experiences that generate knowledge [1]. Case-based reasoning (CBR) is a problem solving approach that uses past experience to tackle current problems. Technically, CBR is a methodology that has proven to be appropriate for applying analogy strategies in unstructured domains and where knowledge acquisition is difficult [2]. Therefore, it is the ideal methodology for the development of support systems for medical diagnosis [3]. Through previous analysis, it can provide results that allow a better understanding of a patient and, therefore, a better diagnosis and a better diagnosis and treatment [4]. The life cycle of a CBR-based system consists of four main phases: to identify the X. Blanco Valencia—This work is supported by Faculty of Engineering from University of Salamanca. c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 26–38, 2018. https://doi.org/10.1007/978-3-319-78723-7_3

Case-Based Reasoning Systems for Medical Applications

27

current problem and find a past case similar to the new case (retrieve), using the case and suggest a solution to the current problem (reuse/adaptation), evaluate the proposed solution (revise), and update the system to learn from experience (retain) [5]. In this paper, we propose an improvement to case-based reasoning systems by developing an estimate of the relation between new cases with existing classes and using multi-class cascade classifiers giving a better diagnostic assistance in medical settings. This paper is organized as follows: Sect. 2 reviews some related works and outlines basics on CBR. Section 3 describes the operation of the proposed classification approach. Section 4 gathers some results and discussion. Finally, conclusions and final remarks are presented in, Sect. 5.

2

Related Works and Background

The origin of the CBR can be traced to Yale University and the Schank and Abelson’s work in 1977 [6]. An early exploration of CBR in the medical field was conducted by Koton [7] and Bareiss [8] in the 1980s. The CBR is inspired by human reason, i.e., to solve a problem by applying previous experiences adapted to the current situation. A case (episodic experience) contains a problem, a solution, and its result. Clinical practice can begin with some initial experiences (cases resolved), then, those experiences are used to solve a new problem where it can been involve some adjustment in the previous solutions to solve the new problem. Therefore, the CBR is a reasoning process which is medically accepted and seems to call attention increasingly [4,9,10]. Anderson has demonstrated how people use past cases as models for learning to solve problems, particularly in early learning. Other results like Kolodner, indicate that experts who know a lot about a particular subject, can remember events in their domain of expertise more easily than non-experts [11]. In the literature, there is a variety of works that apply the CBR methodology focused on the health sector [12–14], and in several of them, its evolution is studied through the last years. For example, Bichindaritz [15], refers to the CBR as an appropriate methodology for the care of the elderly and support for people with disabilities and in [16], it is concluded that automatic adaptation is a weakness especially in systems based on CBR in the medical field. It is [17] suggested working a reduction of dimensions along with CBR, in order to improve the increasingly large, complex and uncertain data systems of clinical environments. Often, with the increase in the number of classes, the complexity and the computational cost increase. In addition, difficulties in the classification may be present only for some classes [18,19]. The present article shows an alternative solution for the problems existing in the automatic adaptation stage.

28

3

X. Blanco Valencia et al.

Proposed Classification Approach

This proposal is aimed at improving the retrieve and adaptation stages of casebased reasoning systems through two processes. The first is focused on an appropriate preprocessing in order to improve the representation of the base of cases and to obtain better results in the classification; and a second process, where the recovery and adaptation stages are combined using cascade pattern recognition algorithms, which improves the result of the classification. Finally, it is proposed to estimate the probability density using support vector machines. This results in a CBR with a greater amount of resources that will give the expert enough support to make the best decision in a medical environment. 3.1

Oversampling and Undersampling Methodology for Class Balancing

Class imbalance problems can be addressed in different ways, being the most common one the oversampling technique, this technique consists in increasing the size of the class with the least quantity of cases, also called minority class, through adding synthetic samples and obtaining a number of records similar to the majority classes. These techniques are employed to avoid the over-training problems caused by the big difference between the sample amount per class. The increase of data in the minority class results in a better classification in exchange for a higher computational cost. The undersampling technique proposes to reduce the majority class to an equal or lower size compared to the minority class, this can be done in several ways, such as eliminating redundant samples, eliminating very close data to random samples by finding the nearest neighbor, deleting random, among others. These methods, as opposed to oversampling, remove unnecessary data, giving a lower computational cost, but this process can also remove relevant data, thus affecting the classification process. In the present work, tests were performed with different balancing methods, such as, under and over sampling or a mixture of both. The classification error and the computational cost were the measurement and comparison patterns, respectively. 3.2

Cascade Classification Methodology

The systems based on CBR contemplate independent stages for the recovery and adaptation phase, in this work these stages were integrated into a single one, resulting in computational savings. The majority of CBR systems are built using the KNN algorithm as part of the retrieve stage, and usually the adaptation stage is avoided due to its complexity. This article proposes a complex technique based on sequential classification with classifiers of different types, entering a wide research field.


3.3

29

SVM Probability Density Methodology

As a complement and improvement of the adaptation stage, the class membership of a new case is predicted. The estimation of probability density by Parzen windows is one of the most studied and well-documented methods at the moment. Support Vector Machine has shown its capacity in application and pattern recognition in general. The method consists in drawing each case from an N dimension, where N is the number of features, then calculating a hyperplane separating each class. This proposal uses SVM as an alternative to other methods such as Parzen and KNN.

4 4.1

Results and Discussion Database

All databases were obtained from the machine learning repository UCI [20]. Two databases with multiple medical diagnoses in the public domain are considered. Hypothyroidism with 5 features distributed around 3 classes and dermatology with 19 features distributed around 6 classes. With the feature selector technique, the hypothyroidism database was reduced from 29 to 5 features, and dermatology database with 33 features was reduced to 19 features making the classification procedure more optimal. 4.2

Methods

The average classification error was used as a measure of comparison between the different experiments. 1. Preprocessing: With the purpose of selecting the most relevant feature, techniques like CFS (Correlation Feature Selection)-Best first and InfoGain AttributeRanker were used, found in the data mining software WEKA. Next undersampling and oversampling procedures for class balance are applied, 6 experiments were made in order to identify this procedures using SMOTE, KNN-Undersampling, ADASYN and a combination of both SMOTE-KNNU algorithms. 2. Cascade classifiers algorithms: Using a different programming software than the data mining software WEKA, 5 classifiers were used (Naive Bayes, Parzen, Random Forest, KNN and SVM). Experiment 1 has 2 and 3 classifiers combination embedded sequentially without repeating the same one using 5 classifiers, where the classifiers are trained with the output of the classifier before. For experiment 2 a class is separated from the original database using a bi-class classifier. As a result a separate sequential classification form is used, being the first part a bi-class classification and the second part an embedded sequential 2 and 3 classifiers combination, with the same combination as experiment 1. This experiments are run 100 times for repeatability and reproducibility purposes, also every experiment were made with 70% from the database for training and 30% for test.

30


3. Probability estimation: Parzen Windows, KNN and SVM were used as probability estimator. 70% of the database was used for training and a 30% to obtain the success rate between the estimator output and the real class given by the database. Also execution time were measured between estimators. 4.3

Performance Measures

The preprocessing and cascade classification were compared using execution time and classification error, for probability estimation, similarity between the class estimation and the original class expressed as an percentage and execution time were used. 4.4

Experiments

1. Pre-processing (a) Feature Selection Methods i. Experiment 1: The execution time is high using multilayer perceptron classifier, being this 24.72 s for dermatology database and 22.68 s for hypothyroidism, every other classifier showed execution times lower than 1 s for every database. The highest classification errors can be observed using Naive Bayes in the dermatology database with 97.81% and random forest on hypothyroidism database with 99.31%. ii. Experiment 2: Feature selection technique cfseval-Best first is implemented, the same classifiers of experiment 1 were used. Classification errors can be observed in the following tables. Table 1 for dermatology and Table 2 for hypothyridism. iii. Experiment 3: The results using InfoGainAttributeval- Ranker can be seen in Tables 1 and 2. On Tables 1 and 2 all classifiers show a good classification process, success rate percentage is above 90% for both databases. For dermatology database (Table 1) with best first selector shows Naive Bayes as the best classifier with 97.81% success rate, as for hypothyroidism database (Table 2) with ranker selector, random forest is the best classifier with Table 1. Dermatology database with best first and ranker as selectors Classifier NaiveBayes

Favorable classification % Poor classification % Time (s) Best first Ranker

Best first Ranker

Best first Ranker

97.81

97.26

2.18

2.73

0

Multilayer perceptron 96.44

95.90

3.55

4.09

9.38

9.66

KNN (1)

96.44

95.35

3.55

4.64

0

0

SVM (Linear Kernel) 97.26

97.26

2.73

2.73

0.06

0.06

Random forest

95.62

3.55

4.37

0.05

0.03

96.44

0


31

Table 2. Hypothyroidism database with best first and ranker as selectors Classifier

Favorable Classification % Poor Classification % Time (s) Best first Ranker

Best first Ranker

Best first Ranker

94.64

94.72

5.35

5.27

0

0

Multilayer perceptron 96.10

96.26

3.89

3.73

2.94

3.02

KNN (1)

NaiveBayes

93.16

94.22

6.83

5.7794

0

0

SVM (Linear Kernel) 93.13

93.50

6.86

6.49

0.2

0.13

Random forest

97.50

4.21

2.49

0.45

0.45

95.78

97.50% success rate. Time wise there is a notorious decrease on the majority, for example Random Forest classifier for dermatology database showed an execution time of 24.60 s without selection method, by contrast, Table 1 execution time with best first selector was 9.38 s and 9.66 s for ranker selector. According to Table 2, Multilayer perceptron classifier execution time is reduced from 22.68 s Table 1 to 2.94 s and 3.02 s for bestfirst and ranker selectors, respectively. These results reaffirm that feature selection technique is a good option for data optimization, eliminating useless feature and longer execution times for different systems. (b) Balancing Methods i. Experiment 1: Using the databases with the most relevant features, the classifier based on KNN was applied, without a balancing method. ii. Experiment 2: Focusing on the minimum amount of cases by class, the number of cases of the majority class is reduced with respect to the minority class, avoiding over classification and verifying if the data of the major class is indeed necessary for a good classification. Dermatology database presents a minimal amount of 20 cases of class 6, and a minimal amount of 50 cases of class 3 for Hypothyroidism Database. iii. Experiment 3: Now applying a general preprocess in the database known as oversampling the most used technique is SMOTE, this algorithm makes new synthetic data between the original ones, the amount of data generated were programmed by 50 intervals to know if really is necessary an increase in the data overall being the amount of the major class the limit. iv. Experiment 4: An undersampling method with KNN technique was applied by deleting unnecessary data without compromising the performance of the classification stage. Hypothyroidism database majority class with 1790 cases is reduced to 64, this value obtained by modifying nearest neighbor amount and distance trying to eliminate unnecessary data. Dermatology database class 1 with 112 cases was reduced to 20 cases. v. Experiment 5: Here, it is used the hybrid balancing method SMOTE-KNNU, which used the same parameters that the

32


experiment 3 and 4. After this process dermatology database ends up with 72 cases for each class and hypothyroidism with 100. vi. Experiment 6: This experiment used as balancing method Adasyn (adaptive synthetic sampling), a extension of SMOTE, not only creates synthetic data on one point but in many around a center. Parameters as percentage of increasing data is managed. Hypothyroidism was the only database in which this algorithm was implemented due to the neighboring of data per each class. Results summary are shown on Table 3, where classification error is analyzed to choose the best option for both databases. Table 3. Performance results in terms of error percentage % of wrong classifications. Database

Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6

Hypothyroidism 5.39 ± 5.39 Dermatology

2.90 ± 8.68

25.26 ± 8.78 9.20 ± 9.36

14.13 ± 5.96

2.37 ± 5.53

4.81 ± 8.82

3.28 ± 9.87

2.69 ± 12.23

4.57 ± 8.77

15.32 ± 9.91

As regards on Table 3, on hypothyroidism database, the minimum error value was in experiment 5, where classification error percentage was 2.37%. Also classification errors are uneven, for example, experiment 2 got 25.26%, and experiment 4 got 14.14%. Experiment 1 shows a lower classification error, possibly result of an overtraining process, given the excessive amount of data for class 1 training the classifier almost exclusively. For dermatology database, there is a similar behavior in all 5 experiments, the difference between the best classification on experiment 2 with 4.81% and experiment 5 with 2.69% is not bigger than 3%. Another important remark resides on experiment 1 with 2.90% as the lowest classification error for all other tests. Also dermatology database does not contain big differences in the amount of data per classes, avoiding overtraining, making it a reasonable database to be used without a pre-processing technique. 2. Cascades: The following classifiers were used in the experiments: Parzen, SVM, Random Forest, Naive bayes and KNN. (a) Experiment 1: Two and three classifiers sequence were used, with 21 and 60 combinations respectively. Hypothyroidism database with a KNNUSMOTE pre-processing and dermatology database without one are used to test every combination on a cascade classifier environment, classification error is shown by boxplots as follows: The first 5 boxes for dermatology database are individual classifiers, Fig. 1(a) shows classifier 5 as the best classifier (Naive Bayes) with the lowest error, an average of 0.04 and variance of 0.12 maximum, 6 to 25 combination displays no alteration regarding classifier 5 results. On triple combinations Fig. 1(b) shows no variation on classification error; the average was 0.04, combination number 3 shows a lower variance, although classifier 5 shows a value of 0.12. Figure 2(a) regarding hypothyroidism,


33

0.8

0.5 0.7

0.4

0.6

0.5

Error

Error

0.3

0.2

0.4

0.3

0.2

0.1 0.1

0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

classifiers

classifiers

(a)

Double Classifiers Combination Error in

(b)

Triple Classifiers Combination Error in Dermatology Database

Dermatology Database

Fig. 1. Dermatology database performance 0.5

0.5

0.45 0.4

0.4

0.35 0.3

Error

Error

0.3 0.25 0.2

0.2

0.15 0.1

0.1

0.05 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

classifiers

classifiers

(a)

Double Classifiers Combination Error in

(b)

Triple Classifiers Combination Error in Dermatology Database

Hypothyroidism Database

Fig. 2. Hypothyroidism database performance

random Forest (classifier 3) proves to be the classifier with the lowest classification error with an average of 0.04 and a variance of 0.12, double combinations 18 to 21 shows low classification error, but there is no variety between combinations. Figure 2(b) presents a similar case, combination 13 to 24 shows an average error of 0.05 and other combinations offers way higher average error and variance. Sequentially implemented classifiers do not affect classification error significantly. So other methods were implemented, trying to reduce classification error the lowest possible. (b) Experiment 2: A class was removed from the original database, so 5 binary classifiers were used for this purpose (SVM, Parzen, Random Forest, KNN and Naive Bayes), naming the desired class for removal ‘class 1’ and all the other ones as ‘class 0’, this way a second classifier, in this case a multiclass one, was trained for the remaining classes making the classification process a little bit easier. Classification error and execution time are shown on Figs. 3 and 4. Figure 3(c), the best binary classifier for dermatology database was classifier 2 (Parzen) and the best class for removal was class 3, although classifier 1 and 2 presents the same classification error, variance on both cases are bigger than classifier 2 like 0.3. Similarly, Fig. 4(c) shows that

X. Blanco Valencia et al. 0.2

0.5

0.18

0.45

0.16

0.4

0.14

0.35

0.2

0.3

Error

0.12

Error

0.3

0.25

0.1

Error

34

0.25

0.1

0.15

0.06

0.1

0.04

0.05

0.05

0.02

0

0 1

2

3

4

5

0 1

2

3

4

5

1

(b)

Class 1 Error per Classifier

(c)


5


0.25

0.3

0.35

0.2

Error

0.25 0.2

Error

0.25

0.3

Error

4

0.3

0.35

0.4

3

Clasifier

0.4

0.5 0.45

0.2 0.15

0.15

0.15

0.1

0.1

0.1

0.05 0.05

0.05

0

0

0 1

2

3

4

5

1

2

Clasifier

(d)

2

Clasifier

Classifier

(a)

0.15

0.2

0.08

3

4

5

1

2

Clasifier

(e)


3

4

5

Clasifier

(f)



Fig. 3. Binary classifier error per class with dermatology database 0.7

0.5

0.6

0.45

0.6

0.4

0.5 0.5

0.35 0.3

0.3 0.2

Error

0.4

Error

Error

0.4

0.3

0.2 0.15

0.2

0.1

0.1

0.1

0

0 1

2

3

4

5

0.05 0 1

2

Classifier

(a)

0.25

3

4

5

1

2

Classifier

(b)


3

4

5

Classifier

(c)



Fig. 4. Binary classifier error per class with hypothyroidism database

class 3 contains the lowest error and a variance of 0 with the third classifier (Random Forest). Results shown proves that a correct data separation might lead to a better classification, thus, a better adaption improvement. (c) Experiment 3: Once a class and a binary classifier were chosen, tests for a multiclass classifiers were made, using the same classifier as experiment 2. Results are shown on Fig. 5: 0.35

0.25

0.3

0.2 0.25

0.15

Error

Error

0.2 0.15 0.1

0.1

0.05

0.05

0 0 1

2

3

4

1

5

(a)

Single Combination Classifiers

Error in Dermatology Database

2

3

4

5

Classifier

Classifier

(b)

Single Combination Classifier

Error in Hypothyroidism Database

Fig. 5. Single combination multiclass classifier performance with one less class


35

There is no significant change with experiment 1, average error and variance are similar. On the contrary hypothyroidism database shows an average error of 0 and a variance a little higher with 0.17 not showing a big improvement. (d) Experiment 4: Now a double combination of classifiers is added for a multiclass classifier, using the same combination as experiment 1 but with one less class to classify. Results are shown on Fig. 6. 0.45 0.4

0.4

0.35

0.35

0.3 0.25

Error

Error

0.3 0.25 0.2

0.2 0.15

0.15

0.1

0.1

0.05

0.05

0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

1

2

3

classifiers

(a)

Error

Database

in

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20

Classifier

Double Combination Clas-

sifiers

4

Dermatology

(b)

Double Combination Clas-

sifiers Error in Hypothyroidism Database

Fig. 6. Double combination multiclass classifier performance with one less class

Hypothyroidism database keeps 0 as an average error on 13, 14, 16 and 17 combination. Dermatology database does not show any improvements regarding experiment 1. (e) Experiment 5: A triple combination of classifiers is analysed, same as experiment 1, 60 combination are implemented trained with one less class database. Results are shown on Fig. 7. 0.35 0.3

Error

0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Classifier

(a)

Triple Combination Classifiers Error in Dermatology Database

0.5 0.45 0.4 0.35

Error

0.3 0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Classifier

(b)

Triple Combination Classifiers Error in Hypothyroidism Database

Fig. 7. Triple combination classifiers error

36

X. Blanco Valencia et al. 90

100 90

80

Success Percentage %

Success Percentage %

80 70 60 50 40 30

70

60

50

40

20 10

30 Parzen

KNN

SVM

Parzen

classifiers

(a)

Probability estimator Success

Rate Dermatology Database

KNN

SVM

classifiers

(b)

Probability estimator Success

Rate Hypothyroidism Database

Fig. 8. Probability density estimation

Most relevant result is seen on dermatology database (Fig. 7a), a lower average error was obtained on combination 13, 14, 15, 20 and 21, with a few high error values but lower variance in general. Any of the 5 combination mentioned before could improve the classifying process. On Hypothyroidism database there is no change on any values compared to any experiment made. 3. Probability (a) Experiment 1: The 3 most representative classifiers were used as density probability estimators: Parzen Windows, KNN and SVM. The results were compared with the original class of each case, counting only the good estimation over the bad ones. For every test made, values like average and minimal classification error and executed time were taken, as their respective graphics, error and time vs times the test was made. Every test was executed a 100 times, analyzing how much the results differ from one cycle to another and the average value. Probability estimation was introduced. As shown in Fig. 8, for analysis success percentage was measured and graphed on boxplots, Parzen windows estimator shows a good efficiency on hypothyroidism database Fig. 8(a) with an average of 75% success rate and low variance, instead dermatology database Fig. 8(b) has success of 8%, something uncommon for such recognized estimator. Later KNN estimator was taken, another recognized probability estimator, lower classification error was given, hypothyroidism database average has a 72% success rate on dermatology and 95% success rate was found but with a higher variance, for example 30% to 50% success rate was recorded. SVM estimator gave the lowest result and variance than others on both databases on average 75% success rate for hypothyroidism and 93% for dermatology.

5

Conclusion

– Using more complex tools in pattern recognition systems like cascade classification and probability density estimation, proves to improve precision and accuracy in general, showing lower classification error and high success rate


37

for probability estimation, this composition in a CBR system, focused on medical environments, provide two independent techniques for a better diagnosis and understanding of different problems in this area. – Cascade classification improves classification error on one classifier. Although a larger amount of classifiers does not mean a better classification process, database nature and behavior are very important in the process of lowering classification error, like in hypothyroidism lowering classification error is a very complex process, although in dermatology there were results as 0 for classification error. This means cascade classification could be more necessary on some system than others making the use of other techniques relevant. – Probability estimation using SVM as estimator proved to get better results for both databases than other estimators, although execution time was increased a more precise an accurate system is preferred over a faster and lighter one, especially for medical environments.

References 1. Leake, D.B.: CBR in context: the present and future. In: Case-Based Reasoning, Experiences, Lessons and Future Directions, pp. 1–30 (1996) 2. Kolodner, J.L.: Maintaining organization in a dynamic long-term memory. Cogn. Sci. 7(4), 243–280 (1983) 3. Abecker, A.: Corporate memories for knowledge management in industrial practice: prospects and challenges. J. Univ. Comput. Sci. 3(8), 929–954 (1997) 4. Aamodt, A., Plaza, E.: Case-based reasoning: foundational issues, methodological variations, and system approaches. AI Commun. 7(1), 39–59 (1994) 5. Pal, S.K., Shiu, S.C.: Foundations of Soft Case-Based Reasoning, vol. 8. Wiley, Hoboken (2004) 6. Schank, R.C., Abelson, R.P.: Scripts, Plans, Goals and Understanding: An Inquiry into Human Knowledge Structures. Oxford, England (1977) 7. Koton, P.: Using experiences for in learning and problem solving. Engineering and Computer Science mMIT/LCS/TR-441 (1989) 8. Bareiss, R.: Exemplar Based Knowledge Acquisition: A Unified Approach to Concept Representati on, Classification, and Learning. Academic Press Professional Inc., San Diego (1989) 9. Anderson, J.R.: The Architecture of Cognition. Harvard University Press, Cambridge (1983) 10. Kolodner, J.L.: Maintaining organization in a dynamic long-term memory*. Cogn. Sci. 7(4), 243–280 (1983) 11. Shiu, S.C., Pal, S.K.: Case-based reasoning: concepts, features and soft computing. Appl. Intell. 21(3), 233–238 (2004) 12. Paz, J.F.D., Bajo, J., Vera, V., Corchado, J.M.: MicroCBR: a case-based reasoning architecture for the classification of microarray data. Appl. Soft Comput. 11(8), 4496–4507 (2011) 13. Paz, J.F.D., Bajo, J., L´ opez, V.F., Corchado, J.M.: Biomedic organizations: an intelligent dynamic architecture for KDD. Inf. Sci. 224, 49–61 (2013) 14. De Paz, J.F., Rodr´ıguez, S., Bajo, J., Corchado, J.M.: Case-based reasoning as a decision support system for cancer diagnosis: a case study. Int. J. Hybrid Intell. Syst. 6(2), 97–110 (2009)

38


15. Bichindaritz, I., Marling, C.: Case-based reasoning in the health sciences: what’s next? Artif. Intell. Med. 36(2), 127–135 (2006) 16. Ju´ arez, J., Campos, M., Gomariz, A., Palma, J., Marin, R.: A reuse-based CBR system evaluation in critical medical scenarios. In: 21st International Conference on Tools with Artificial Intelligence, ICTAI 2009, pp. 261–268, November 2009 17. Montani, S.: How to use contextual knowledge in medical case-based reasoning systems: a survey on very recent trends. Artif. Intell. Med. 51(2), 125–131 (2011) 18. Krawczyk, B., Wo´zniak, M., Herrera, F.: On the usefulness of one-class classifier ensembles for decomposition of multi-class problems. Pattern Recogn. 48(12), 3969–3982 (2015) 19. Kang, S., Cho, S., Kang, P.: Multi-class classification via heterogeneous ensemble of one-class classifiers. Eng. Appl. Artif. Intell. 43, 35–43 (2015) 20. Lichman, M.: UCI Machine Learning Repository (2013)

Bioinformatics Tools to Integrate Omics Dataset and Address Biological Question

Constructing a Quantitative Fusion Layer over the Semantic Level for Scalable Inference Andras Gezsi1,2 , Bence Bruncsics1,2(B) , Gabor Guta2 , and Peter Antal1,2 1

Department of Measurement and Information Systems, Budapest University of Technology and Economics, Budapest, Hungary {gezsi,bruncsics,antal}@mit.bme.hu 2 Abiomics Europe Ltd., Budapest, Hungary [email protected] http://bioinfo.mit.bme.hu/

Abstract. We present a methodology and a corresponding system to bridge the gap between prioritization tools with fixed target and unrestricted semantic queries. We describe the advantages of an intermediate level of networks of similarities and relevances: (1) it is derived from raw, linked data (2) it ensures efficient inference over partial, inconsistent and noisy cross-domain, cross-species linked open data, (3) preserved transparency and decomposability of the inference allows semantic filters and preferences to control and focus of the inference, (4) high-dimensional, weakly significant evidences, such as overall summary statistics could also be used in the inference, (5) quantitative and rank based inference primitives can be defined, and (6) queries are unrestricted, e.g. prioritized variables, and (7) it allows wider access for non-technical experts. We provide a step-by-step guide for the methodology using a macular degeneration model, including drug, target and disease domains. The system and the model presented in the paper are available at bioinformatics.mit.bme.hu/QSF. Keywords: Semantic web · Graph databases · Linked open data Data and knowledge fusion · Recommender systems Explanation generation

1

Introduction

Integration of cross-domain information has been targeted at different levels: at the level of data, such as in the joint statistical analysis of cross-domain omic datasets [1], at the level of knowledge, such as in the pharmaceutical integration approaches using semantic web technologies [2–4], and even at the level of computational services, such as in the scientific workflows [5,6]. However, significant part of scientific knowledge is uncertain, weakly significant, poorly represented and remains inaccessible for cross-domain integration, although the importance c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 41–53, 2018. https://doi.org/10.1007/978-3-319-78723-7_4

42

A. Gezsi et al.

of the analysis and interpretation of such weak signs have already been recognized in many standalone high-dimensional omic domains. This is illustrated by data fusion in molecular similarity [7], kernel-based data and knowledge fusion [8], cross-species gene prioritization [9], Bayesian fusion [10] and network boosted analysis of genome-wide polymorphism data [11]. Semantic technologies, relying heavily on the Resource Description Framework (RDF), provide an unprecedented basis for cross-domain data and knowledge fusion, as demonstrated by the emergence of large-scale, unified knowledge space in life sciences (the Life Sciences Linked Open Data Space, LSLODS, see e.g. BIO2RDF [12], CHEM2BIO2RDF [13], Open PHACTS [3], integrated WikiPathways [14], biochem4j [15], DisGeNET-RDF [16,17]). However, there are serious limitations concerning its computational complexity of inference [18] and practical IT accessibility [19], its inaccessibility for non-technical users [3,20,21]. Furthermore, most importantly, its ability to cope with uncertain facts, evidences, and inference is still an open challenge (for representing uncertain scientific knowledge, see e.g. HELO [22]; for combination of uncertain evidences, see e.g. [10,23–25]). To tackle these challenges, we propose the construction of an intermediate, quantitative knowledge level of structured similarities and we created a corresponding system to demonstrate its advantages, the Quantitative Semantic Fusion (QSF) system (Fig. 1). This approach is related to multiple earlier approaches in fusion, such as (1) Linked Open Data (LOD) cubes to support computationally efficient SPARQL queries [26], (2) knowledge graphs [27], (3) probabilistic logic, Markov logic for semantic web integration inference and approximation of inference in large-scale probabilistic graphical models [28], and (4) relational generalization of kernel-based fusion [8,29]. We demonstrate the properties of this approach and the corresponding QSF system using a specific model for macular degeneration.

2

The Quantitative Semantic Fusion Framework

The Quantitative Semantic Fusion (QSF) System is an extensible framework that incorporates distinct annotated semantic types (also called: entities) and links between them by integrating different data sources from the Linked Open Data world. The QSF System then enables the users to quantitatively prioritize a freely chosen entity based on evidences propagated from any other, possibly multiple entities through the connecting links. Currently, the system contains genes, taxa, diseases, phenotypes, disease categories (UMLS semantic types and MeSH disease classes), pathways, substances, assays, cell lines and the targets of the compounds. Besides, associations between genes and diseases are further described by related single nucleotide polymorphisms and the source of the association information. Links define associations between entities. For example, genes and pathways are connected with a link which represents gene-pathway associations. Certain links have additional annotations which can be used for (1) weighting associations during similarity computations and/or for (2) filtering links based on the annotation values. In order to enable cross-species information fusion, we also added gene ortholog links.

Constructing a Quantitative Fusion Layer over the Semantic Level

43

Fig. 1. Quantitative Semantic Fusion (QSF) System (I) The QSF System incorporates distinct annotated semantic types (i.e. entities) and their quantitative pairwise relations (i.e. links) by integrating different data sources from the Linked Open Data world. Predefined entities and links from DisGeNET [16], Ensembl [30], ChEMBL [31] and WikiPathways [14] are shown in the top. Together entities and links form the structure and parameters of the QSF System. (II) The user can freely construct so-called computation graphs using the available entities and links and can select any entity as the target of the prioritization. An example computation graph is shown in the middle. Then, the user defines the (II.a) inference rules, sets (II.b) evidences of possibly multiple entities and (II.c) optionally sets filters on specific entities and links. The main results of the prioritization are (III.a) the quantitative relevance scores for the target entity and (III.b) the most dominant explanations of the prioritization results.

44

A. Gezsi et al.

Furthermore, to be able to expand the evidences related to gene or substance entities we enriched the system by adding gene-gene similarities based on Gene Ontology semantic similarity using GOssTo [32] and substance-substance similarities based on MACCS fingerprints computed by Tanimoto similarity. The user can freely construct so-called computation graphs from the available entities (used as nodes) and links (as edges). Entities can be reused in the graph, i.e. multiple nodes in the graph can have the same semantic type. The user then arbitrarily selects an entity to prioritize and gives evidences in other entities. For example, given a selection of phenotypes related to a disease, together with relevant drugs and substances in relevant clinical trials, and certain related genes in model organisms, the user may want to prioritize human genes based on all these evidences. The evidences propagate through the edges as similarity calculations between the entity vector of the source node and the row entity vectors of the linker matrix between the source and the target node. Seven different similarity calculation methods were implemented which are Cosine similarity, Dice and Overlap coefficients, Tanimoto similarity, and three kernel based similarities: linear, polynomial and radial basis function kernels. The vectors can be weighted by numeric annotations of the links, and information retrieval based corrections can also be used. Default similarity calculations are suggested for each link type based on internal tests and cross-validation, but the default calculation mode can be overridden by the user. In case of a node that has more than one incoming edges in the computation graph, the calculation of the scores of the node can be given with a mathematical formula over the incoming edges.

3

A Simple Model for Age-Related Macular Degeneration

To illustrate the methodology and the QSF system, a simple model was set up using age-related macular degeneration (AMD) as an example. In this case, the disease database contains 21 AMD subtypes or related diseases, but only one of them (URI: http://linkedlifedata.com/resource/umls/id/C0242383) contains relevant genetic information (including 391 genes). To expand the genetic information, the GWAS catalog [33] with 131 hits for AMD were used for human genetic source, and 72 rat genes from RGD (Rat Genome Database) and 42 mouse genes from MGD (Mouse Genome Database) [34] were used for ortholog genes, representing the two most common animal modes for AMD. For chemical information, a currently used AMD drug and over 30 drug candidates from clinical AMD trials were used, taken from DrugBank [35]. For pathways three complement and angiogenesis-related pathways were identified in the literature as underlying mechanisms.

4

Phases of the Methodology

The main phases of the methodology starting from deriving relations from RDF resources to visualization of the most relevant proofs for an inference are as follows: 1. Resource and model overview: Overview the modeled phenomena and the available relevant resources and their connections.


45

2. Model structure: Design entities and their relations. 3. Model parameters: Derive parameters for the planned relations using SPARQL queries or direct RDF conversion. 4. Inference rules: Specify the inference rules for the propagation and combination of evidences, especially in multiply connected structures (with loops). 5. Evidences: Construct hard (logical) and soft (weighted) evidences. 6. Dynamic knowledge base: Define the active parts of the knowledge base, e.g. by selecting relevant model organisms and resources, and analogously disable certain parts of the knowledge base by semantic filtering. 7. Inference: Perform off-line inference using a computational cluster. 8. Results: Export prioritization and scoring results for targets: e.g. for external enrichment analysis. 9. Explanations: Export the most relevant explanations visualized as graphs. 10. Sensitivity analysis: Check the sensitivity of the results for settings. The graphical user interface (GUI) of the QSF framework can be used for answering a large number of various questions using the predefined computation graphs. Furthermore, the development of the system allows further integration of new databases and the computation graph of the evidence propagation is easily customizable. To support non-technical users, the GUI contains prepared computation graphs, which are capable of handling typical questions and demonstrating functionalities. The presented computation graph is a simple tree-based fusion model over genes, diseases, phenotypes, pathways, targets (proteins) and substances. We use this model to demonstrate and explain the QSF phases (Fig. 2).

Fig. 2. A simple computation graph for macular degeneration in the QSF system. Blue, green and yellow denote the inputs, the filters and the outputs, respectively. (Color figure online)

4.1

Resource and Model Overview

The first step is the overview of the relevant information sources for the modeled phenomena collecting information for the involved phenotypes, genes, drugs and pathways (resources for macular degeneration are presented in Sect. 3).

46

4.2

A. Gezsi et al.

Model Structure, Parameters, and Inference Rules

The second step starts with the construction of the computation graph, which contains the input nodes and the paths with possible filtering nodes. For the AMD model, the inputs and the target determine the computation graph (Fig. 2), but the framework can handle arbitrary graphs for complex models. Remark: The building blocks for the computation graph are the nodes and the edges connecting them (Fig. 1). Example: If a set of genes and substances are the available inputs and the question is about the pathways involved, then by clicking on the Gene and Substance Input Nodes and by selecting the Pathway Node as ’Prioritization target’ (see later in Subsect. 4.5) the relevant parts (i.e. paths connecting the corresponding nodes) of the predefined computation graph will be automatically selected and the Gene and Target Filter nodes will be automatically added to the final computation graph. 4.3

Adding Input Evidences

The framework can incorporate three types of inputs: (1) constraint information or list of entities without any weight, (2) evidence information or a list of entities with corresponding weights or evidences and (3) conditional input or filter parameters on a node choosing all the entities where the condition applies. Remark: The QSF approximates Bayesian information propagation therefore for quantitative results the inputs are required to represent probabilities, although using any other kind of weights are allowed and will result in meaningful prioritization values, but the quantitative interpretation is more problematic. Example: If the inputs are the drugs of running trials for a given disease, then the inputs can be added manually by clicking on ’Add constraint’ or ’Add evidence’, and the IDs will be shown in a list for each node (see Fig. 3B). Converting IDs: The GUI allows to choose entities one by one for any node by name or ID, but for a larger number of values using lists and list of IDs is suggested. For genes Ensemble IDs, for diseases UMLS IDs, for phenotypes HPO IDs, for pathways WikiPathways IDs and for protein targets and substances ChEMBL IDs are used. Using these IDs, a large amount of input can be entered into the model, therefore converting data from diverse origin to the presented IDs is highly recommended, in order to utilize a maximum amount of data. Defining soft evidences: Quantitative evidences are values of weights for each input entity representing relevance. It can be any numeric value, but optimally they are values between 0 and 1 representing the probability of the input.


47

Fig. 3. GUI interface for inputs and filters: (A) Choosing prioritization target (B) Choosing a node and providing manual constraints and evidences (C) Giving constraints and evidences using lists (D) Adding filters and specifying filter conditions

Using lists: For a larger number of inputs, usage of an input list is suggested. For example, if the drug trials for macular degeneration disease are considered as input, then by converting these drug names into ChEMBL, IDs can be added by separating them by commas. Quantitative (soft) evidences can also be used; the format is similar, but for each drug a certainty or relevance weight can be specified by an equality sign, where the number that follows is preferably a probability (Fig. 3C). In this case, the trial phase (0, I–IV) is known for the drugs and the probabilities could be approximated by the acceptance rate of ophthalmology trials, which are 0.17 for phases 0 and I, 0.2 for phase II, 0.45 for phase III [36]. Conditional inputs: Inputs can also be specified by using statements for any parameter of a given input node. Example statements are the following: for a disease node: the title contains the term “macular degeneration”; for a gene node: the chromosome number is 5; for a substance node: the title (or chemical name) contains a name (Fig. 3D) or a specific structure (like “Cyclopropyl-6-fluoro” and “carboxylic acid”). 4.4

Adding Filters

The semantic control over the inference, e.g. filtering out gene-diseases interactions purely based on keywords, is a novel function, which is completely missing from currently prevailing monolithic gene prioritization systems. Further

48

A. Gezsi et al.

improvements could be achieved by filtering out the less reliable links, e.g. the weak substance-target interactions, although the selection of threshold values for filtering is an open issue in our fusion methodology as well. Remark: The filtering method is the same for input and filter nodes (Fig. 3D), except that if there is a filter statement in an input node (without parents), then it includes all the entities matching the statement and in case of any intermediate or filter node, it excludes all the entities from further propagation. Example: In case of macular degeneration wide range of sources contain data about low vision in general, therefore filtering out common factors causing low vision like cataract can improve the quality of the inputs. Additionally, filtering on the Target-Substance edge also allows excluding chemicals with low affinity to the target by sorting out the weak associations where the pChembl (−log(IC50 or Ki)) is below a certain number. 4.5

Determining Outputs and Visualization

The next step is to define a target for the prioritization. It determines the type of the output and the path(s) of the propagation. Remark: The ‘Prioritization target’ determines the path(s) of the information propagation in the graph; therefore it is an interpretation or an aspect of the model. Example: For example, if the question is which diseases are involved in a biological setup, by clicking on the disease, that node is chosen for the prioritization target. It can be changed later by choosing the target from the list of the involved nodes (Figs. 2 and 3A). The GUI supports interpretation using a simple tabular result prioritization and a graphical visualization. Example: The macular degeneration model uses the known macular degeneration-related pathways, human and model animal genes, drugs and their known targets. Choosing a disease node as the prioritization target, the results (Fig. 4) and the contribution of the individual inputs (Fig. 5) are informative for evaluating the model. Prioritization: The results contain entity identifiers, a numeric value representing the relevance of each entity and further descriptive parameters (Fig. 4). Tabular view of prioritization: The matrix view plots parallel results in columns corresponding to all the inputs and for each individual input node. This technique supports the understanding of the contributions of the inputs and their redundancy, complementarity. The color scheme helps the visual tracking of the entities ranked differently by various inputs (Fig. 5).


49

Fig. 4. A result of disease prioritization in the macular degeneration model.

Fig. 5. The result of the disease prioritization using all macular degeneration-related inputs (leftmost column) and the contribution of the individual inputs (other columns).

Explanation visualization: To visualize the most relevant paths (i.e. the explanations) between the input nodes and the target node an explanation graph is exported into Cytoscape. The graph can be processed further using the add-ons and resources developed by the broad community of Cytoscape (Fig. 6). 4.6

Checking Robustness of the Results

Currently, we are implementing methods to support the comparison of results under different settings, e.g. using various inference rules, evidence weighting or semantic filtering. For example, our preliminary evaluation for the disease axis using the AMD model suggests the use of Tanimoto similarity for narrow queries and cosine similarity for broader queries with heterogeneous, soft evidences, e.g. for data analytic evidences.

50

A. Gezsi et al.

Fig. 6. Explanation graphs: (A) The association between a disease and a pathway can be determined by numerous genes (B) Graph representation of the most relevant explanation between pathways (green) and diseases (blue) trough genes (magenta) (Color figure online)

5

Conclusion

The availability of voluminous and heterogeneous semantically linked open data and knowledge provides an unprecedented opportunity for cross-domain fusion. However, uncertainty over the measurements and knowledge fragments, and also over the evidences poses a fundamental challenge for the practical use of these resources in research and development. We proposed an intermediate level of data and knowledge to cope with high-dimensional uncertainty, at which level quantitative relevances can be propagated through similarities and the inference process can also be semantically controlled and focused. Currently, we are evaluating the quantitative performance of the QSF system in prioritization tasks. Acknowledgments. The research has been supported by the European Union, cofinanced by the European Social Fund (EFOP-3.6.2-16-2017-00013) and by OTKA 112915. This work has received funding from the European Union’s Horizon 2020 research and innovation programme under Grant agreement No 633589. This publication reflects only the authors’ views and the Commission is not responsible for any use that may be made of the information it contains.

References 1. Zhu, Z., et al.: Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet. 48(5), 481–487 (2016) 2. Chen, H., Ding, L., Wu, Z., Yu, T., Dhanapalan, L., Chen, J.Y.: Semantic web for integrated network analysis in biomedicine. Briefings Bioinform. 10(2), 177–192 (2009)


51

3. Williams, A.J., Harland, L., Groth, P., Pettifer, S., Chichester, C., Willighagen, E.L., Evelo, C.T., Blomberg, N., Ecker, G., Goble, C., Mons, B.: Open PHACTS: semantic interoperability for drug discovery. Drug Discov. Today 17(21–22), 1188– 1198 (2012) 4. Chen, B., Wang, H., Ding, Y., Wild, D.: Semantic breakthrough in drug discovery. Synth. Lect. Semant. Web 4(2), 1–142 (2014) 5. Stevens, R., Baker, P., Bechhofer, S., Ng, G., Jacoby, A., Paton, N.W., Goble, C.A., Brass, A.: TAMBIS: transparent access to multiple bioinformatics information sources. Bioinformatics 16(2), 184–186 (2000) 6. Karim, M.R., Michel, A., Zappa, A., Baranov, P., Sahay, R., Rebholz-Schuhmann, D.: Improving data workflow systems with cloud services and use of open data for bioinformatics research. Briefings Bioinform. (2017). bbx039 7. Ginn, C.M., Willett, P., Bradshaw, J.: Combination of molecular similarity measures using data fusion. Perspect. Drug Discov. Des. 20, 1–16 (2000). Virtual Screening: An Alternative or Complement to High Throughput Screening? Springer 8. Lanckriet, G.R., De Bie, T., Cristianini, N., Jordan, M.I., Noble, W.S.: A statistical framework for genomic data fusion. Bioinformatics 20(16), 2626–2635 (2004) 9. Tranchevent, L.C., Ardeshirdavani, A., ElShal, S., Alcaide, D., Aerts, J., Auboeuf, D., Moreau, Y.: Candidate gene prioritization with endeavour. Nucleic Acids Res. 44(W1), W117–W121 (2016) 10. Province, M.A., Borecki, I.B.: Gathering the gold dust: methods for assessing the aggregate impact of small effect genes in genomic scans. Pac. Symp. Biocomput. 13, 190–200 (2008) 11. Nakka, P., Raphael, B.J., Ramachandran, S.: Gene and network analysis of common variants reveals novel associations in multiple complex diseases. Genetics 204(2), 783–798 (2016) 12. Callahan, A., Cruz-Toledo, J., Ansell, P., Dumontier, M.: Bio2RDF release 2: improved coverage, interoperability and provenance of life science linked data. In: Cimiano, P., Corcho, O., Presutti, V., Hollink, L., Rudolph, S. (eds.) ESWC 2013. LNCS, vol. 7882, pp. 200–212. Springer, Heidelberg (2013). https://doi.org/10. 1007/978-3-642-38288-8 14 13. Chen, B., Dong, X., Jiao, D., Wang, H., Zhu, Q., Ding, Y., Wild, D.J.: Chem2bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data. BMC Bioinform. 11(1), 255 (2010) 14. Waagmeester, A., Kutmon, M., Riutta, A., Miller, R., Willighagen, E.L., Evelo, C.T., Pico, A.R.: Using the semantic web for rapid integration of wikipathways with other biological online data resources. PLoS Comput. Biol. 12(6), e1004989 (2016) 15. Swainston, N., Batista-Navarro, R., Carbonell, P., Dobson, P.D., Dunstan, M., Jervis, A.J., Vinaixa, M., Williams, A.R., Ananiadou, S., Faulon, J.L., et al.: biochem4j: Integrated and extensible biochemical knowledge through graph databases. PLoS ONE 12(7), e0179130 (2017) ` Sanz, F., Furlong, L.I.: DisGeNET16. Queralt-Rosinach, N., Pi˜ nero, J., Bravo, A., RDF: harnessing the innovative power of the semantic web to explore the genetic basis of diseases. Bioinformatics 32(14), 2236–2238 (2016) ` Queralt-Rosinach, N., Gutiérrez-Sacrist´ 17. Pi˜ nero, J., Bravo, A., an, A., Deu-Pons, J., Centeno, E., Garc´ıa-Garc´ıa, J., Sanz, F., Furlong, L.I.: DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res. 45(D1), D833–D839 (2017)

52

A. Gezsi et al.

18. Gray, A.J., Groth, P., Loizou, A., Askjaer, S., Brenninkmeijer, C., Burger, K., Chichester, C., Evelo, C.T., Goble, C., Harland, L., et al.: Applying linked data approaches to pharmacology: architectural decisions and implementation. Semant. Web 5(2), 101–113 (2014) 19. Beek, W., Rietveld, L., Schlobach, S., van Harmelen, F.: LOD Laundromat: why the semantic web needs centralization (even if we don’t like it). IEEE Internet Comput. 20(2), 78–81 (2016) 20. Dong, X., Ding, Y., Wang, H., Chen, B., Wild, D.: Chem2Bio2RDF dashboard: ranking semantic associations in systems chemical biology space. Future Web Collaboratice Sci. (FWCS) WWW (2010) 21. Kamdar, M.R., Musen, M.A.: PhLeGrA: graph analytics in pharmacology over the web of life sciences linked open data. In: Proceedings of the 26th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, pp. 321–329 (2017) 22. Soldatova, L.N., Rzhetsky, A., De Grave, K., King, R.D.: Representation of probabilistic scientific knowledge. J. Biomed. Semant. 4(Suppl. 1), S7 (2013) 23. Gottlieb, A., Stein, G.Y., Ruppin, E., Sharan, R.: PREDICT: a method for inferring novel drug indications with application to personalized medicine. Mol. Syst. Biol. 7(1), 496 (2011) 24. Callahan, A., Cifuentes, J.J., Dumontier, M.: An evidence-based approach to identify aging-related genes in caenorhabditis elegans. BMC Bioinform. 16(1), 40 (2015) 25. Fu, G., Ding, Y., Seal, A., Chen, B., Sun, Y., Bolton, E.: Predicting drug target interactions using meta-path-based semantic network analysis. BMC Bioinform. 17(1), 160 (2016) 26. Abell´ o, A., et al.: Fusion cubes: towards self-service business intelligence (2013) 27. Paulheim, H.: Knowledge graph refinement: a survey of approaches and evaluation methods. Semantic web 8(3), 489–508 (2017) 28. Domingos, P., Lowd, D., Kok, S., Poon, H., Richardson, M., Singla, P.: Just add weights: Markov logic for the semantic web. In: da Costa, P.C.G., d’Amato, C., Fanizzi, N., Laskey, K.B., Laskey, K.J., Lukasiewicz, T., Nickles, M., Pool, M. (eds.) URSW 2005-2007. LNCS (LNAI), vol. 5327, pp. 1–25. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-89765-1 1 29. De Bie, T., Tranchevent, L.C., Van Oeffelen, L.M., Moreau, Y.: Kernel-based data fusion for gene prioritization. Bioinformatics 23(13), i125–i132 (2007) 30. Yates, A., Akanni, W., Amode, M.R., Barrell, D., Billis, K., Carvalho-Silva, D., Cummins, C., Clapham, P., Fitzgerald, S., Gil, L., et al.: Ensembl 2016. Nucleic Acids Res. 44(D1), D710–D716 (2015) 31. Jupp, S., Malone, J., Bolleman, J., Brandizi, M., Davies, M., Garcia, L., Gaulton, A., Gehant, S., Laibe, C., Redaschi, N., Wimalaratne, S.M., Martin, M., Le Novère, N., Parkinson, H., Birney, E., Jenkinson, A.M.: The EBI RDF platform: linked open data for the life sciences. Bioinformatics 30(9), 1338–1339 (2014) 32. Caniza, H., Romero, A.E., Heron, S., Yang, H., Devoto, A., Frasca, M., Mesiti, M., Valentini, G., Paccanaro, A.: GOssTO: a stand-alone application and a web tool for calculating semantic similarities on the gene ontology. Bioinformatics 30(15), 2235–2236 (2014) 33. MacArthur, J., Bowler, E., Cerezo, M., Gil, L., Hall, P., Hastings, E., Junkins, H., McMahon, A., Milano, A., Morales, J., et al.: The new NHGRI-EBI catalog of published genome-wide association studies (GWAS catalog). Nucleic Acids Res. 45(D1), D896–D901 (2017)


53

34. Twigger, S., Lu, J., Shimoyama, M., Chen, D., Pasko, D., Long, H., Ginster, J., Chen, C.F., Nigam, R., Kwitek, A., et al.: Rat genome database (RGD): mapping disease onto the genome. Nucleic Acids Res. 30(1), 125–128 (2002) 35. Law, V., Knox, C., Djoumbou, Y., Jewison, T., Guo, A.C., Liu, Y., Maciejewski, A., Arndt, D., Wilson, M., Neveu, V., et al.: DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res. 42(D1), D1091–D1097 (2013) 36. Thomas, D.W., Burns, J., Audette, J., Carrol, A., Dow-Hygelund, C., Hay, M.: Clinical Development Success Rates 2006–2015. Biomedtracker/BIO/Amplion, San Diego, Washington, DC, Bend (2016)

Challenges and Advances in Measurement and Self-Parametrization of Complex Biological Systems

Effects of External Voltage in the Dynamics of Pancreatic b-Cells: Implications for the Treatment of Diabetes Ramón E. R. González1 , José Radamés Ferreira da Silva2, and Romildo Albuquerque Nogueira2(&) 1

2

Laboratório de Sistemas Complexos e Universalidades, Departamento de Física, Universidade Federal Rural de Pernambuco, Recife, Pernambuco 52171-900, Brazil Laboratório de Biofísica Teórico-Experimental e Computacional (LABTEC), Departamento de Morfologia e Fisiologia Animal, Universidade Federal Rural de Pernambuco, Recife, Pernambuco 52171-900, Brazil [email protected]

Highlights • Computational simulations were used to study the pattern of the burst in pancreatic beta cells in response to constant voltage pulses. • Stimulation with low amplitude voltage pulses leads to changes in the pattern of the burst in pancreatic beta cells. Abstract. The influence of exposure to electric and magnetic fields in pancreatic islets are still scarce and controversial, and it is difficult to conduct a comparison of existing studies due to the different research methods employed. Here, computational simulations were used to study the burst patterns in pancreatic beta cell exposure to constant voltage pulses. Results show that burst patterns in pancreatic beta cells are dependent on the applied voltage and that some voltages may even inhibit this phenomenon. There are critical voltages, such as 2.16 mV, in which the burst change from a medium oscillation to a slow oscillation phase or 3.5 mV that induces transition in the burst from slow to fast oscillation phase. Voltage pulse higher than 3.5 mV leads to the extinction of bursts and, therefore, inhibits the process of insulin secretion. These results are reforced by phase plane analysis. Keywords: Computational simulations Pancreatic b-Cells Electrical activity Voltage pulses Phase plane

1 Introduction The endocrine pancreas is a key regulator of glucose homeostasis. Secreting insulin through the pancreatic b-cells in response to high levels of blood glucose controls the glucose concentrations of healthy individuals [1–3]. Pancreatic b-cells are insulin-secreting © Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortuño (Eds.): IWBBIO 2018, LNBI 10813, pp. 57–70, 2018. https://doi.org/10.1007/978-3-319-78723-7_5

58

R. E. R. González et al.

cells located in the islets of Langerhans. When insufficient insulin is released in order to preserve normal levels of blood glucose [4], a systemic disease is thus established, called diabetes mellitus (DM). Diabetes manifests itself in two ways, types 1 and 2. Type 1 diabetes (T1D) is a chronic autoimmune disease in which pancreatic beta cells are destroyed, and patients with this form of the disease require exogenous insulin to normalize the levels of blood glucose [5–7]. Type 2 diabetes (T2DM) is characterized by a progressive decline in b-cell function, which occurs because the ATP-dependent potassium channel no longer responds to the presence of this molecule. As a result, the pancreatic b-cells do not depolarize sufficiently to induce insulin secretion [8, 9]. To control the sugar content in blood, characteristic changes take place in the electrical properties of the b-cell membranes induced by glucose, such as depolarization, in which the membrane of these cells presents voltage bursts, indicating a marked change in membrane permeability to various ions [10]. Bursting consists of an active phase of the plasma membrane, with voltages of between −60 mV and −25 mV, followed by silent phases. The bursting process is repeated after each silent period when the b-cell reaches the threshold of excitability [10, 11]. Bursting events are generated by fast and slow gating mechanisms that control the membrane potential in this cell type. Some researchers have described the mechanism of insulin secretion in b-cells with mathematical models, which are important for understanding the complex dynamics of these cells [12, 13]. These models describe the voltage variation in the pancreatic b-cell membrane, using a similar Hodgkin-Huxley model (Hodgkin and Huxley, 1990) to describe the kinetics of the primary ion currents involved in the phenomenon of insulin secretion and its regulation [14, 15]. Bertram et al. [16] using the dynamic clamp technique, developed a model to describe the formation of bursts using variables that have produced bursts of varying duration and amplitudes, such as those observed in pancreatic beta cells. Experimental studies suggest that exposure to low-frequency electrical and magnetic fields affect cellular function through the effects exerted onto intracellular and membrane proteins, including ion channels, membrane receptors and enzymes [17]. One example of this has been the exposure of rats to electromagnetic fields, which in the long term lead to the synthesis and increase in insulin secretion [18]. Exposure of pancreatic islet cells to an extremely low frequency magnetic field induces a reduction in insulin secretion and a change in calcium uptake [19]. It is known that the contents of calcium ions and their flow and insulin secretion during stimulation by the presence of glucose are reduced when isolated islets of rabbits have been exposed to low-frequency electromagnetic fields [20]. However, studies examining the influence of exposure to electric and magnetic fields in pancreatic islets are still scarce and controversial, and it is difficult to conduct a comparison of existing studies due to the different research methods employed. Our research group [15] adapted the mathematical model proposed by Bertram et al. [16] by inserting a sinusoidal voltage of 60 Hz with an amplitude range of between 0.5 and 4 mV into the equations modelling electrical activity in pancreatic b-cells. The computer simulation of the Bertram et al. [16] model with the insertion of the sinusoidal voltage in the current equation [15] can both decrease and increase the duration of the bursts, dependent on the amplitude of the applied external current. The interest of the

Effects of External Voltage in the Dynamics of Pancreatic b-Cells

59

present study is to investigate the sensitivity of the dynamics involved in the bursting process of pancreatic b-cells stimulated by constant voltage pulses with low amplitudes aiming find application of this experimental protocol in the treatment of diabetes.

2 Methodology 2.1

Simulation of the Electrical Activity of Pancreatic Beta-Cells

In order to simulate the effects of constant voltage pulses on the electrical activity of pancreatic b cells we introduced constant voltage pulses into the model proposed by Bertram et al. [16]. These model is able to simulate all three types of bursting encountered in beta cells: fast (60 s). We selected the average burst because this is the most common pattern encountered in these cells. Simulation of the electrical activity of pancreatic b-cells was performed through solutions of the differential equations, described in Bertram et al. [16] and showed below: dV ðICa þ IK þ IS1 þ IS2 þ IL Þ ¼ dt Cm

ð1Þ

Where dV dt is the derivative of the membrane potential in time and Cm is the capacitance of the membrane. The currents are represented by ICa (calcium), IK (potassium), IS1 or IKCa (calcium-activated potassium), IS2 or IKATP (ATP-activated potassium) and IL (leakage). The currents are represented by: ICa ¼ GCa m1 ðV ÞðV VCa þ VE Þ

ð2Þ

IK ¼ GK nðV VK þ VE Þ

ð3Þ

IS1 ¼ GS1 S1 ðV VK þ VE Þ

ð4Þ

IS2 ¼ GS2 S2 ðV VK þ VE Þ

ð5Þ

IL ¼ GL ðV VL þ VE Þ

ð6Þ

The G's are the conductances for the calcium, potassium, calcium-activated potassium and ATP-activated potassium ions, respectively, as well as the conductance for the leakage current. V = V(t) is the voltage through the membrane along the time, VCa, VK and VL are the reversal potential to calcium, potassium and leakage and VE is the external voltage applied. The dynamics of the gating particles are represented by the following differential equations: dn ðn1 ðV Þ nÞ ¼ dt sn

ð7Þ

60


dS1 ðS11 ðV Þ S1 Þ ¼ sS1 dt

ð8Þ

dS2 ðS21 ðV Þ S2 Þ ¼ sS2 dt

ð9Þ

The gating particles m; n; S1 and S2 respectively control the kinetics of the calcium, the voltage-dependent potassium, the calcium-dependent potassium and the ATP-dependent potassium channels. The activation curves at the steady state for the gating particles and the time constant sn are sigmoid functions for the voltage applied to the membrane, as presented in the equations below: m 1 ðV Þ ¼

1 1 þ eðð22VÞ=7:5Þ

ð10Þ

n1 ð V Þ ¼

1 1 þ eðð9VÞ=10Þ

ð11Þ

S11 ðV Þ ¼

1 1 þ eðð40VÞ=0:5Þ

ð12Þ

1

ð13Þ

S21 ðV Þ ¼ sn ð V Þ ¼

1 þ eðð42VÞ=0:4Þ 1 1 þ eððV þ 9Þ=10Þ

ð14Þ

The parameters and values used in the simulation of the electrical activity of pancreatic beta cells are presented in Table 1.

Table 1. Parameter values used in the simulation of the electrical activity of pancreatic beta cells. Values 4524 fF 280 pS 1300 pS 25 pS 7 pS 32 pS 100 mV −80 mV −40 mV 1s 2 min

Parameters Membrane capacitance (Cm) Conductance of the fast Ca2+ channel (GCa) Conductance of the fast K+ channel (GK) Conductance Leakage current (GL) Conductance of the slow K+ channel (GS1) Conductance of the very slow K+ channel (GS2) Reversal Potential of Ca2+ (VCa) Reversal Potential of K+ (VK) Reversal Potential of leakage (VL) Time constant of S1 (sS1) Time constant of S2 (sS2)


61

The values of the initial conditions used were: V ¼ 43 mV, n ¼ 0:03, S1 ¼ 0:1, S2 ¼ 0:434 [21], GS1 ¼ 7pS and the integration time was equal to 600 s. To simulate the applied external voltage pulses in the equations above, the values of the constant pulse voltages were added to the voltage of the membrane. 2.2

Computational Routine and Statistical Analysis

In order to solve the above mentioned system of equations, we used XPPAUT [22]. Initially, XPPAUT loads all the equations with their parameters and initial conditions. After this, the system of equations is solved using the CVODE routine. The CVODE is a package written in C used to numerically solve the system of differential equations with their initial conditions [23]. The program is able to output graphs of different variables in the study, thus enabling the continuous observation of possible changes in these variables over time. We used the ANOVA statistical test and the post-hoc Tukey test whenever necessary, at a significance level of 5%.

3 Results 3.1

Simulations

The simulation of the Bertram et al. [16] model modified by inserting constant low amplitude external voltage pulses in the current equations revealed that the pulses may change the behaviour of both the burst and the silent period. Figures 1A and B demonstrate that the voltage pulses of amplitudes ranging between 0.5 and 2.5 mV increase the duration of the bursts and the silent periods in comparison to the control. Voltage values higher than 2 mV changed profile of the electrical activity of pancreatic beta cells (Fig. 1C) and values of 3.5 mV and over inhibited the electrical activity in pancreatic beta cells (Fig. 1D). The duration of the bursts for voltage pulse with amplitudes equal to 1, 1.5 and 2.0 mV were significantly different from the control, although pulse of 0.5 mV did not significantly alter the duration of the burst (Fig. 2). Figure 2B demonstrates that the durations of the silent periods are significantly different from the control (0 mV) for all values of voltage pulse amplitudes between 0.5 and 2.0 mV. In Fig. 3 it may be observed that the kinetics of gating particles S1 and S2 present time periods similar to those of the bursts and silent periods, under both conditions control (Fig. 3 top) and in the presence of a voltage pulse of 2 mV (Fig. 3 bottom panel). These data demonstrate that voltage pulses may change the kinetics of gating particles S1 and S2, and thus modify the response of the electrical activity in pancreatic beta cells, increasing their burst and silent periods. From external voltage value of 2.16 mV, there is a reduction both in the duration of the silent and the burst periods (Fig. 4, top panel).

62


Fig. 1. Simulation of the electrical activity of pancreatic beta cells. A-control, B-voltage pulse of 2 mV, C-voltage pulse of 2.5 mV and D-voltage pulse of 3.5 mV.

Fig. 2. Duration of burst (A) and silent phase (B) at different values of voltage pulses (0.5– 2 mV). Significant statistical difference in relation to the control (0 mV) (*).

The obtained data also reveals that the application of an external voltage higher than 2.16 mV produces a delay in the time required for triggering the first burst (Fig. 4, bottom panel). This stems from the fact that for voltages higher than 2.16 mV the gating particles that control the burst takes longer to be activated, reflecting directly on the behaviour of the electric activity of the burst. This relationship between the kinetics of ion channels, the phases of the burst and insulin secretion and exocytosis implies changes in the ionic conductance of the membrane of pancreatic beta cells, directly influencing the behaviour of the burst and the silent periods.


63

Fig. 3. Simulation of the electrical activity and the gating particles that regulate the K-Ca (S1) and K-ATP (S2) channels in pancreatic beta cells. Top-control, Bottom-voltage pulse with amplitude of 2 mV.

64


Fig. 4. Top panel: duration of burst (open circles) and silent (closed squares) phases for different values of applied external voltage. Bottom panel: beginning of the burst phase for different values of applied external voltage.


65

Simulations conducted by Bertram et al. [2, 3] and Watts et al. [24] reproduced three types of oscillations slow, medium and fast, which is directly related to the values of the conductance of the membrane to the different species of ions by which it is bathed. The conductance values for the calcium-activated potassium ions, GS1 is approximately 2 pS, 7 pS and 20 pS for each type of oscillation, respectively [24]. The simulations realized in our work were able to reproduce the different types of oscillations found by other authors [2, 24] and can be seen in the figures below showed in our results.

Fig. 5. Medium - Slow burst transition. The voltage dynamics and the gating particles for different values of the applied external voltage. (A): control, (B): 1.0 mV (C): 2.0 mV and (D) 2.16 mV.

It is possible to observe in Fig. 5 that the kinetics of the gating particles accompany the burst phase as well as the silent phase, both in the control situation and in the presence of an external voltage for a medium oscillation. This type of oscillation is controlled by the gating particles S1 and S2. When the external voltage is increased can be noted a transition to slow oscillation phase, controlled basically by gating particles S2. This is equivalent to decreasing the potassium activated by calcium conductance (Gs1). For external voltage higher than 2.16 mV the system transits to fast oscillation phase, that is principally controlled by gating particles S1, while the particle S2 remains practically constant. In this phase, it can be observed a high GS1 values around to 20 pS. Then, the transitions among the different oscillation modus can be controlled by gating particles S1 and S2, sensitives to external voltages. In the phase transition that

66


Fig. 6. Slow - fast burst transition. The voltage dynamics and gating particles for the different values of applied external voltage. (A): 3.5 mV (B): 4.0 mV (C): 5.0 mV and (D): 6.0 mV.

occur to a voltage external value equal to 2.16 mV the S2 gating particle reaches the highest values for both period and amplitude. Another phase transition can be observed in 3.5 mV, a voltage in which the bursts practically vanish and the process is principally controlled by S1 (Fig. 6). 3.2

Phase Space Analysis

We analyze the model in a 2D phase space. The trajectories of the time-dependent variables S1 and V are showed in the Fig. 7. In the case of fast bursting, the slow variable S2 is nearly constant. On the other hand, for medium and slow bursting, S2 can no longer be considered constant but it varies very slowly and can be describe as a quasi-steady state [3]. For this case, we use the mean value of S2 for each external voltage value between 0 and 5.0 mV. With S2 fixed the phase space analysis permits describe the system only through two variables S1 and V setting in the equations of the model, dV/dt = 0, m = m∞(V) and n = n∞(V). Under these conditions Z-curve (graphs of V versus S1(V)) were plotted. The trajectories in the phase plane can be analyzed by the intercepts of the curves S1∞(V) and dV/dt = 0. In the figure, we can see regions separated by the curves in the balance. For external voltage values between 0 and 2.16 mV, the Z - curve shows a shift to the left. This displacement is more pronounced in the upper “knee”, corresponding to the active region. For external voltage values between 2.5 and 5.0 mV the Z - curve


67

then presents a “flattening” characterized by the continuity of the previous displacement presented by the part of the active region and by an inversion of displacement in the part of the lower “knee” corresponding to the silent region. Due to the shape and the singular behavior of the Z - curve, in the active region that of the stable solutions we can see an “approximation” of the Z-curve with the increase of the external voltage, reducing the range of values of possible voltages in the system. At the same time, in the silent region exactly the opposite happens, there is an increase in the range of accessible voltage values. In the second range of external voltages, between 2.5 mV and 5.0 mV, the active region of the Z-curve continues its

Fig. 7. Top panel: Phase plane analysis. The sigmoid black curves represent S1 nullclines. The colored curves (Z-curves) are V nullclines. The arrows indicate the movements of the Z-curves and the stable solutions. Bottom panel: Behavior of the variable S2 as a function of the applied external voltage. The a values are the slopes of the straight lines, representing the rates of decrease of the variable S2 in each regime of the external voltage.

68


displacement without apparent changes until the stable solutions relative to this region disappears completely to an external voltage of 5.0 mV. Solutions corresponding to the silent region also tend to disappear, in this range of external voltage values, with the displacement to the right of the lower “knee” of the Z - curve. This result can be compared with the graphs of Figs. 1, 3, 5 and 6, where we can see a decrease in burst amplitude as the external applied voltage increases. From a simple analysis of the behavior of the variable S2 as a function of the external voltage shown in the Fig. 7, we can see that the disturbance to the system caused by the application of external voltage pulses causes a decrease in the mean value of variable S2 but two regimes appear, one corresponding to voltage values between 0 and 2.16 mV and another, for higher voltage values, up to 5.00 mV. The rate of decrease of the variable is different in the two regimes, and from 2.16 mV of applied external voltage, S2 begins to decrease twice as fast.

4 Conclusions Our first conclusion regarding the results obtained in our studies is that with the application of low amplitude external voltages, the computational simulation shows that there are modifications in the burst pattern. This reflects, physiologically, the release of insulin, being equivalent to physiologically modifying membrane conductance to potassium ions, which may be helpful in the treatment of diabetes. A very important conclusion, from these results, is that, in normal individuals, the application or exposure to external voltages, including low amplitudes, can be a factor causing electric disturbances in the dynamics of pancreatic beta cells, provoking a possible decrease in secretion and hormonal exocytosis. On the other hand, in the case of diabetic patients, controlled applications of external low amplitude voltages may be an alternative for the regulation of blood glucose secretion and control. The widening, for some bands of external voltages, of the burst suggest the possible existence of a greater production of insulin. The application of low-amplitude external voltages induces a transition between the different oscillation regimes leading to bursting. For external voltage values of the order of 5.00 mV, reflecting the inhibition of the insulin secretion process. External voltage values VE = 2.16 mV and VE = 3.5 mV, can be used as reference for therapy, depending on the characteristics of the patient and their pathology. In relation to the value VE = 3.5 mV, we can say that this external voltage value induces a change in the behavior of the dynamics of the insulin secretion process, from the variation in the conductivity of the ions of potassium GK1, which is indirectly modified with the application of applied external voltage pulses. Finally comparing the results from the analysis of the phase plane and the behavior of the variable S2 with the external voltage with the previous ones, we can associate the regimes of decrease of S2 and the change in the direction of the displacement of the Z curves, with possible transitions between the different regimes of oscillation, complementing the results mentioned above.


69

References 1. Pedersen, M.G.: Phantom bursting is highly sensitive to noise and unlikely to account for slow bursting in b-cells: considerations in favor of metabolically driven oscillations. J. Theor. Biol. 248, 391–400 (2007) 2. Bertram, R., Rhoads, J., Cimbora, W.P.: A phantom bursting mechanism for episodic bursting. Bull. Math. Biol. 70, 1979–1993 (2008) 3. Bertram, R., Previte, J., Sherman, A., Kinard, T.A., Satin, L.S.: The phantom burster model for pancreatic beta-cells. Biophys. J. 79, 2880–2892 (2000) 4. D’Aleo, V., Mancarella, R., Del Guerra, S., Boggi, U., Filipponi, F., Marchetti, P., Lupi, R.: Direct effects of rapid-acting insulin analogues on insulin signaling in human pancreatic islets in vitro. Diabetes Metab. 37, 324–329 (2011) 5. Ashcroft, F.M., Rorsman, P.: Diabetes mellitus and the b cell: the last ten years. Cell 148, 1160–1171 (2012) 6. Colli, M.L., Moore, F., Gurzov, E.N., Ortis, F., Eizirik, D.L.: MDA5 and PTPN2, two candidate genes for type 1 diabetes, modify pancreatic beta-cell responses to the viral by-product double-stranded RNA. Hum. Mol. Genet. 19, 135–146 (2010) 7. Rorsman, P., Eliasson, L., Kanno, T., Zhang, Q., Gopel, S.: Electrophysiology of pancreatic b-cells in intact mouse islets of Langerhans. Prog. Biophys. Mol. Biol. 107, 224–235 (2011) 8. Gao, J., Zhong, X., Ding, Y., Bai, T., Wang, H., Wu, H., Liu, Y., Yang, J., Zhang, Y.: Inhibition of voltage-gated potassium channels mediates uncarboxylated osteocalcin-regulated insulin secretion in rat pancreatic b cells. Eur. J. Pharmacol. 777, 41–48 (2016) 9. Zhao, Y., Shi, K., Su, X., Xie, L., Yan, Y.: Microcystin-LR induces dysfunction of insulin secretion in rat insulinoma (INS-1) cells: Implications for diabetes mellitus. J. Hazard. Mater. 314, 11–21 (2016) 10. Fridlyand, L.E., Tamarina, N., Philipson, L.H.: Bursting and calcium oscillations in pancreatic beta-cells: specific pacemakers for specific mechanisms. Am. J. Physiol. Endocrinol. Metab. 299, E517–E532 (2010) 11. Sherman, A.: Lessons from models of pancreatic beta cells for engineering glucose-sensing cells. Math. Biosci. 227, 12–19 (2010) 12. Félix-Martínez, G.J., Godínez-Fernández, J.R.: Modeling Ca2+ currents and buffered diffusion of Ca2+ in human b-cells during voltage clamp experiments. Math. Biosci. 270, 66–80 (2015) 13. Riz, M., Braun, M., Pedersen, M.G.: Mathematical modeling of heterogeneous electrophysiological responses in human b-cells. PLoS Comput. Biol. 10, e1003389 (2014) 14. Benninger, R.K.P., Piston, D.W.: Cellular communication and heterogeneity in pancreatic islet insulin secretion dynamics. Trends Endocrinol. Metab. 25, 399–406 (2014) 15. Neves, G.F., Silva, J.R.F., Moraes, R.B., Fernandes, T.S., Tenorio, B.M., Nogueira, R.A.: 60 Hz electric field changes the membrane potential during burst phase in pancreatic b-cells: in silico analysis. Acta. Biotheor. 62, 133–143 (2014) 16. Bertram, R., Sherman, A., Satin, L.S.: Electrical bursting, calcium oscillations, and synchronization of pancreatic islets. In: Islam, M. (ed.) The Islets of Langerhans, vol. 654, pp. 261–279. Springer, Dordrecht (2010). https://doi.org/10.1007/978-90-481-3271-3_12 17. Grassi, C., D’Ascenzo, M., Torsello, A., Martinotti, G., Wolf, F., Cittadini, A., Azzena, G. B.: Effects of 50 Hz electromagnetic fields on voltage-gated Ca2+ channels and their role in modulation of neuroendocrine cell proliferation and death. Cell Calcium 35, 307–315 (2004)

70


18. Laitl-Kobierska, A., Cieslar, G., Sieron, A., Grzybek, H.: Influence of alternating extremely low frequency ELF magnetic field on structure and function of pancreas in rats. Bioelectromagnetics. 23, 49–58 (2002) 19. Sakurai, T., Satake, A., Sumi, S., Inoue, K., Miyakoshi, J.: An extremely low frequency magnetic field attenuates insulin secretion from the insulinoma cell line. RIN-m. Bioelectromagnetics 25, 160–166 (2004) 20. Jolley, W.B., Hinshaw, D.B., Knierim, K., Hinshaw, D.B.: Magnetic field effects on calcium efflux and insulin secretion in isolated rabbit islets of Langerhans. Bioelectromagnetics 4, 103–106 (1983) 21. Sheik Abdulazeez, S.: Diabetes treatment: a rapid review of the current and future scope of stem cell research. Saudi Pharm. J. 23, 333–340 (2013) 22. Ermentrout, B.: Simulating, Analyzing, and Animating Dynamical Systems. Society for Industrial and Applied Mathematics, Philadelphia (2002) 23. Cohen S.D., Hindmarsh A.C.: CVODE, a stiff/nonstiff ODE solver in C. https://computation. llnl.gov/casc/nsde/pubs/u121014.pdf. Accessed 12 May 2017 24. Watts, M., Tabak, J., Zimliki, C., Sherman, A., Bertram, R.: Slow variable dominance and phase resetting in phantom bursting. J. Theor. Biol. 276, 218–228 (2011)

ISaaC: Identifying Structural Relations in Biological Data with Copula-Based Kernel Dependency Measures Hossam Al Meer1 , Raghvendra Mall1(B) , Ehsan Ullah1 , Nasreddine Megrez2 , and Halima Bensmail1(B) 1

Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar {rmall,eullah,hbensmail}@hbku.edu.qa 2 Department of Mathematics, Al Faisal University, Riyadh, Kingdom of Saudi Arabia [email protected]

Abstract. The goal of this paper is to develop a novel statistical framework for inferring dependence between distributions of variables in omics data. We propose the concept of building a dependence network using a copula-based kernel dependency measures to reconstruct the underlying association network between the distributions. ISaaC is utilized for reverse-engineering gene regulatory networks and is competitive with several state-of-the-art gene regulatory inferrence methods on DREAM3 and DREAM4 Challenge datasets. An open-source implementation of ISaaC is available at https://bitbucket.org/HossamAlmeer/isaac/.

1

Introduction

Changes in environment and external stimuli lead to variations in gene expression which accordingly adapts for proper functioning of living systems. However, abnormalities in this tightly coordinated process are precursor to many pathologies. A vital role is played by the transcription factors (TF), which are proteins that bind to the DNA in the regulatory regions of specific target genes. These TFs can then repress or induce the expression of target genes. Many such transcriptional regulations have been discovered through traditional molecular biology experiments and several of these mechanistic regulatory interactions have been documented into TF-target gene databases [25]. With the availability of high-throughput experimental techniques for efficiently measuring gene expression, such as DNA micro-arrays and RNA-Seq, the aim now is to design computational models for gene regulatory networks (GRN) [34] inference. The accurate reconstruction of GRN from diverse gene expression information sources is one of the most important problems in biomedical research [12]. This is primarily because precisely reverse-engineered GRNs can reveal mechanistic hypotheses about differences between phenotypes and sources of diseases [25], which can ultimately help in the drug discovery and c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 71–82, 2018. https://doi.org/10.1007/978-3-319-78723-7_6

72

H. A. Meer et al.

bio-engineering. The problem of inferring GRN from heterogeneous information sources such as dynamic time-series data, gene knockout, gene knockdown expressions, protein-protein interactions etc. is one of the most actively pursued computational biological problem [17] and has resulted in several DREAM challenges including DREAM3 and DREAM4 challenges. In this paper, we propose a statistical framework to infer the gene regulatory network from expression data using a copula-based kernel dependency measure. Measuring dependence of random variables is one of the main concerns of statistical inference. A typical example is the inference of a graphical model, which expresses relations among variables in terms of independence and conditional independence. Independent component analysis [10] employs a measure of independence as the objective function, and feature selection in supervised learning looks for a set of features on which the response variable depends. The most well-understood dependence measure is the Shannon mutual information [32], which has found various applications [15,16,20]. Despite the fact that this is the most popular dependence measure, it is only one of many other existing ones. Specifically, it is a unique instance of the Rényi-α [28] and there exist others such as Tsallis-α mutual information [38]. Other interesting dependence measures include the maximal correlation coefficient [29], kernel mutual information [9], the generalized variance and kernel canonical correlation analysis [1], the Hilbert-Schmidt independence criterion [8], the Schweizer-Wolff measure [31], and the distance based correlation [37]. On the other hand, kernel methods have been successfully used for capturing (conditional) dependence of variables [1,2,9,36,38]. With the ability to represent high order moments, mapping of variables into reproducing kernel Hilbert spaces (RKHSs) allows us to infer properties of the distributions, such as independence and homogeneity [28]. All the existing dependence estimators have their own shortcomings. For example, the bound on the convergence rate of the Rényi and Tsallis information estimator [22] suffers from the curse of dimensionality. The available reproducing kernel based dependence measures are not invariant to strictly increasing transformation of the marginal random variables. The estimator of Székely [37] is not robust; one single large enough outlier can arbitrarily ruin the estimator. Here we propose a new dependency measure ISaaC based on the Maximum Mean Discrepancy (MMD) [2] that overcomes the above listed limitations. The main idea is to combine empirical copula transformations with reproducing kernel based divergence estimators. The estimator is consistent, robust to outliers, only uses rank statistics therefore is simple to derive and computationally cheap. The empirical copula transformation only slightly affects the convergence rate, but the resulting dependence estimator possesses all the above mentioned properties. Moreover, we propose a novel dependence measure for estimating differences in distribution, based on the Maximum Mean Discrepancy (MMD). We will show that it can take advantage of both the kernel and copula trick. Hence, it is applicable to all data types, from high-dimensional vectors to strings and graphs, arising in bioinformatics. In our experiments, we will apply ISaaC to infer gene

ISaaC: Identifying Structural Relations in Biological Data

73

regulatory networks and show that it is competitive with several state-of-the-art inference methods [11,14,20,23,24,33,40] on DREAM3 and DREAM4 Challenge datasets [18,27].

2

MMD and the Two-Sample Problem

In this section, we review some important properties of the Maximum Mean Discrepancy (MMD), which is a quantity used to measure the distance between distributions [2,6]. An appealing property of this quantity is that it can be efficiently estimated from independent and identically distributed (i.i.d.) samples. In statistics, one of the main question asked is the two-sample or homogeneity problem [13]. The principle underlying the MMD measure is that we want to find a function that assumes different expectations on two different distributions. We then evaluate this function on empirical samples from the distributions s.t. it informs us whether the distributions from where they have been drawn are likely to differ. This leads to the following statistic, which is closely related to the one proposed in [6]. Here and below, X denotes our input domain and is assumed to be a non-empty compact set. Let F be a class of functions, P and Q be probability distributions, and X = (X1 , . . . , Xn ) and Y = (Y1 , . . . , Ym ) be samples composed of i.i.d. observations drawn from P and Q respectively. We define the maximum mean discrepancy (MMD) between P and Q on the function class F and its empirical estimate as: MMD[F, P, Q] = supf ∈F (EP [f (X)] − EQ [f (Y )]) n

MMD[F, X, Y] = supf ∈F (

m

1 1 f (Xi ) − f (Yi )) n i=1 m i=1

Here E[·] represents the expectation function. Intuitively, it is clear that MMD[F, P, Q] will be zero if and only if P = Q. However, F will result in a measure that mostly differs from zero for finite samples X and Y. 2.1

MMD for Kernel Functions

We now introduce a class of functions for which MMD may easily be computed, while retaining the ability to detect all discrepancies between P and Q without making any simplifying assumptions. Let H = {f : X → R} be a RKHS with feature map φ(x) ∈ H(x ∈ X ), and kernel k(x, y) = φ(x), φ(y)H . It is well known [7] that φ(x) = k(·, x), and f (x) = f, φ(x)H , which is called the reproducing property of the RKHS. We also need to define universal kernels. Definition 1. Universal kernel: A kernel k is universal whenever the associated RKHS is dense in the space of bounded continuous functions with respect to the ∞ norm.

74

H. A. Meer et al.

It has been shown in [35] that the Gaussian and Laplace kernels are universal. For general function sets, MMD[F, P, Q] can be difficult to calculate and is not even symmetric in P and Q. Nonetheless, when F is a unit ball of universal RKHS, then for all f ∈ F we also have that −f ∈ F, which implies that MMD[F, P, Q] = MMD[F, Q, P ]. Furthermore, in this case the estimator MMD2 [F, P, Q] has a simple form that makes efficient estimations possible [2]. Lemma 1. When F is a unit ball of RKHS H and EP [k(, x)] < ∞, then MMD2 [F, P, Q] = EP,P [k(X, X )] − 2EP,Q [k(X, Y )] + EQ,Q [k(Y, Y )]

(1)

where X and X have distribution P , Y and Y have distribution Q, and these random variables are all independent from each other. An unbiased estimator for MMD[F, P, Q] (when m = n) was be derived in [26] as: 1 MMD2u [F, X, Y ] = h(Λi , Λj ), (2) m(m − 1) i,j which is a one sample U -statistics which belongs to the class of unbiased statistics with h(Λi , Λj ) = k(Xi , Xj )+k(Yi , Yj )−k(Xi , Yj )−k(Xj , Yi ), where Λi = (Xi , Yj ) and Λ = (Λ1 , . . . , Λn ) are i.i.d. random variables. From the R.H.S. of Lemma 2.2, one can see that: E[h(Λi , Λj )] = MMD2 [F, X, Y ] which proves the unbiasedness of this estimator. In the remainder of the paper, we will always assume that F is a unit ball of RKHS H. A biased estimator (m = n) for MMD[F, P, Q] can be easily given using the law of large numbers: n

MMD2b [F, X, Y ] =

1 k(Xi , Xj ) n(n − 1) i=j

+

n,m m 1 2 k(Y , Y ) − k(Xi , Yj ) i j m2 − m nm i,j=1

(3)

i=j

3

Copula Transformation

In this section, we will review the dependence model and then the usage of copula transformation for the dependence measure. Given random variables Xi with continuous density functions Fi ; i = 1, . . . , n, U1 = F1 (X1 ), U2 = F2 (X2 ), . . . , Un = Fn (Xn ), we have a joint density function F with marginal density functions F1 , F2 , . . . , Fn such that: F (x1 , x2 , . . . , xn ) = C(F1 (x1 ), . . . , Fn (xn ))

(4)


75

The C(·) above is exactly a copula function, a density function on [0, 1]n with standard uniform marginals, C(·) is the density function of the random vector (U1 , . . . , Un ), which means C(u1 , . . . , un ) = P r(U1 ≤ u1 , . . . , Un ≤ un ) and is the cumulative density function (cdf) of the copula-transformed, uniformly distributed, variable U1 = F1 (X1 ), U2 = F2 (X2 ), . . . , Un = Fn (Xn ). Equation 4 and Sklar’s theorem [39] couples the continuous marginal density functions F1 , F2 , . . . , Fn to the joint density function F via the copula C(·), where C(·) is a unique distribution function on the range of the Fi cdf functions. This theorem decomposes the joint distribution function F (x1 , x2 , . . . , xn ) into the marginal distribution and the Copula function as indicated below: F (X1 , . . . , Xn ) = P (X1 ≤ x1 , . . . , Xn ≤ xn ) = C(F1 (X1 ), . . . , Fn (Xn ))

(5)

We illustrate in Fig. 1 the behavior of dependence using a three-dimensional example: the data X, Y and Z are generated from a normal distribution with a correlation coefficient r = (0.4, 0.2, −0.8). Their transformations is uniformly distributed in [0, 1] but captures the dependence structure. A detailed summary of copulas is given in [21]. In our case, we will use a simple and efficient algorithm in which we will estimate the unknown distribution functions F1 , F2 , . . . , Fn efficiently using the n empirical distribution functions Fˆj (x) = n1 i=1 1{Xi ≤x} , where 1{A} is the indicator of event A. For a fixed x, the indicator 1Xi ≤x is a Bernoulli random variable with parameter p = F (x), hence nFˆn (x) is a binomial random variable with mean nF (x) and variance nF (x)(1 − F (x)). This implies that Fˆn (x) is an unbiased estimator for F (x).

(c) 3D projection of the (a) Three correlated variables us(b) Variables U is uniformly dis- three variables. ing the joint distribution based tributed in [0, 1] and can capture on correlation coefficient r = the true dependence structure. (0.4, 0.2, −0.8).

Fig. 1. Copula based dependence on random variables.

We call the maps F and Fˆ , the copula transformation, and the empirical copula transformation respectively. The effect of the empirical copula transfor-

76

H. A. Meer et al.

mation can be studied by a version of the classical Kiefer-Dvoretzky-Wolfowitz theorem [22]. As a simple implication of this theorem, one can show that Fˆ is a consistent estimator of F , and the convergence is uniform. Note that: 1 Fˆj (x) = rank(Xi , X1 , . . . , Xn ) n

(6)

where rank(x, X) is the number of elements of X less than or equal to x. The sample (Zˆ1 , . . . , Zˆn ) = (Fˆ1 , . . . , Fˆn ) is called the empirical copula [4]. Also, observe that the random variables Zˆ1 , . . . , Zˆn are not even independent. Nonetheless, the empirical copula (Zˆ1 , . . . , Zˆn ) is a good approximation of an i.i.d. sample (Z1 , . . . , Zn ) = (F (X1 ), . . . , F (Xn )) from the copula distribution of PX . 3.1

MMD Dependence Measure

We now propose a two-sample measure based on the asymptotic distribution of an unbiased estimate of MMD2 , which applies in the case where F is a unit ball in a RKHS, and m = n. Lemma 2. Let the kernel k be universal on [0, 1] × [0, 1]. Then I(X, Y ) = MMD[F, P, Q] = 0, if and only if X and Y are independent of each other. We will provide a consistent estimator I for I(·). Let k : R × R → R be a kernel function of RKHS H, and let Z = F (X, Y ) be a random variable drawn from the copula. Then according to Lemma 3.1, it is easy to see that: I 2 (X, Y ) = MMD2 (F, PZ , QU ) = EPZ ,PZ [k(Z, Z )] − 2EPZ ,PU [k(Z, U )] + EPU ,PU [k(U, U )]. This expression is the expected value of the kernel k evaluated in random variables drawn from the uniform and the copula distributions. Using Eq. 2 we have: n

MMD2u [F, Z, U ] =

1 [k(Zi , Zj ) n(n − 1) i=j

+ k(Ui , Uj ) − k(Zi , Uj ) − k(Ui , Zj )] and an estimator of I 2 (X, Y ) is given by Iˆ2 (X, Y ), where n(n − 1)Iû2 [X, Y ] = n k(Zî , Zˆj ) + k(Ui , Uj ) − k(Zî , Uj ) − k(Ui , Zˆj )

(7)

i=j

Let 0 ≤ k(x, y) ≤ K be a bounded kernel function. Using [26], we can prove the consistency of the dependence estimators Iˆ2 , and provide upper bounds on the rate of convergence and show that the proposed dependence estimators can be used in high-dimensions as well. Thus, it does not suffer from the curse of dimensionality. Using all of the above, we summarize the MMD dependence measure in Algorithm 1.


77

Algorithm 1. MMD measure for dependence Data: Positive-definite kernel k, distributions A ∈ R and B ∈ R with n samples each and U1:n = {U1 , . . . , Un } are i.i.d samples drawn from U [0, 1]2 . Result: Iˆ2 (A, B) dependence measure. Create M = {A, B}. // Concatenate the two sample distributions for i = 1 : n do î = 1 rank(Mi , {M1 , . . . , Mn }). Z n end t = 0. for i = 1 : n do for j = 1 : n do if j = i then t = t + k(Zi , Zj ) + k(Ui , Uj ) − k(Zi , Uj ) − k(Zj , Ui ). end 1 Iˆ2 (X, Y ) = Iˆ2 (X, Y ) + n(n−1) t end end

3.2

Kernel Choice

So far, we have focused on the case of universal kernels. These kernels have various favorable properties, including that universal kernels are strictly positive definite, making the kernel matrix invertible and avoiding non-uniqueness in the dual solutions. Continuous functions on X can be arbitrarily well approximated (in the ∞ -norm) using an expansion in terms of universal kernels [35]. However, note that for instance in pattern recognition, there might be situations where the best kernel for a given problem is not universal. In fact, the kernel corresponds to the choice of a prior, and thus using a kernel which does not afford approximations of arbitrary continuous functions can be very useful provided that the functions it approximates are known to be solutions of the given problem. One can use a kernel k which is bounded, symmetric and positive and satis fying |x| × k(x) −→ 0 as |x| −→ ∞ and x2 k(x) dx < ∞. Some special kernel functions are given in Table 1. All of the above seems to give similar result, especially well known ParzenRosenblatt kernel density estimator (Gaussian). We choose to use the Gaussian Table 1. An example of different types of kernel. Kernel

k(x)

Uniform

1 1 2 {|x|≤1}

Triangle

(1 − |x|)1{|x|≤1}

Epanechnikov Quartic Triweight Gaussian Cosines

3 (1 − x2 )1{|x|≤1} 4 15 (1 − x2 )2 1{|x|≤1} 16 35 (1 − x2 )3 1{|x|≤1} 36 2 √1 exp − x 2 2π π cos( π4 x)1{|x|≤1} 4

78

H. A. Meer et al.

kernel which has well-known theoretical properties [5]. Our results show that the choice of reasonable k does not seriously affect the quality of the estimator. Additionally, the choice of σ or the window width parameter of the ParzenResenblatt turns to be not so crucial for the accuracy of the estimator. Some indications about this choice are given in [3]. In all our experiments we used the Gaussian kernel.

4

Results

We conducted experiments to infer gene regulatory networks using the proposed copula-based kernel dependence measure (ISaaC) on DREAM Challenge datasets. 4.1

DREAM Challenge Results

For the first set of evaluations, we compared the performance of ISaaC on a set of universally accepted benchmark networks of 100 and more genes from the DREAM3 and DREAM4 challenges [18,19,27] and compared them with several state-of-the-art GRN re-construction methods. For the purpose of comparison, we selected several recent methods including ENNET [33], GENIE3 [11], iRafNet [23], ARACNE [20] and the winner of each DREAM challenge. Among all the DREAM challenge networks, we performed experiments on the in-silico networks of size 100 for DREAM3 and DREAM4 challenge. The DREAM3 challenge consists of several insilico networks whose expression matrix E are simulated using GeneNetWeaver [30] software. In our experiments, we focus on the networks of size 100 which are the largest in the DREAM3 suite. There are several additional sources of information available for these networks including knockout (KO), knockdown (KD) and wildtype (WT) expressions apart from the time-series information. Most of the state-of-the-art techniques do not necessarily utilize all these heterogeneous information sources. However, methods like ENNET and iRafNet incorporate some of this information in a refinement step to adjust the edge-weights of the inferred network. For example, the ENNET technique uses a null-mutant zscore approach [27], exploiting the knockout and knockdown information, to refine the reverse-engineered regulatory networks. In our experiments, we use the time-series to infer an initial network which is further refined using the null-mutant zscore approach proposed in [27] and used effectively in [33]. We compare the results of ISaaC with other GRN inference methods as shown in Table 2. The DREAM4 challenge comprised 5 benchmark networks which were again constructed as sub-networks of transcriptional regulations from model organisms namely E.coli and S.cerevisiae. We focused on the networks of size 100 from the DREAM4 suite. Additional sources of information including knockout, knocdown, wildtype and multifactorial information were also available for DREAM4 challenge. An exhaustive comparison of the performance of diverse GRN inference methods on DREAM4 challenge are depicted in Table 3.


79

Table 2. Comparison of ISaaC method with myriad inference methods on DREAM3 networks of size 100. Here we provide the mean AUpr and AUroc values for 10 random runs of different inference methods. Here KO = Knockout, KD = Knockdown, WT = Wildtype and MTS = Modified smoothed version of the time-series data. The best results are highlighted in bold. ∗ , + and − represent the quality metric values where ISaaC, ENNET and iRafNet techniques respectively outperform the winner of DREAM3 challenge. Methods

Data used

DREAM3 experiments Network 1

Network 2

AUpr AUroc AUpr 0.712 0.962 0.803

Network 3

AUroc AUpr

Network 4

AUroc AUpr

0.512∗ 0.856 0.399 0.786

ISaaC

MTS, KO

ENNET

KO, KD, WT, MTS 0.627 0.901

0.865

0.963+ 0.552+ 0.892

0.522

0.842

0.384 0.765

GENIE3

KO, KD, WT

0.430 0.850

0.782

0.883

0.372

0.729

0.423

0.724

0.314 0.656

iRafNet

KO, KD, WT

0.528 0.878

0.812− 0.901

0.484

0.864

0.482− 0.772

0.364 0.736

ARACNE

KO, KD, WT

0.348 0.781

0.656

0.813

0.285

0.669

0.396

0.662

0.274 0.583

0.694 0.948

0.806

0.960

0.493

0.915 0.469

0.853

0.433 0.783

Winner [40] KO, WT

0.975 0.557 0.888

Network 5

AUroc AUpr AUroc

Table 3. Comparison of ISaaC with myriad inference methods on DREAM4 networks of size 100. Here we provide the mean AUpr and AUroc values for 10 random runs of different inference methods. Here KO = Knockout, KD = Knockdown, WT = Wildtype and MTS = Modified smoothed version of the time-series data. The best results are highlighted in bold. ∗ , + and − represent the quality metric values where ISaaC, ENNET and iRafNet techniques outperform the winner of DREAM4 challenge. Methods

Data used

DREAM4 experiments Network 1 AUpr

Network 2

AUroc AUpr

0.555∗ 0.916 0.505

Network 3

Network 4

Network 5

AUroc AUpr

AUroc AUpr

AUroc AUpr

0.862

0.856∗ 0.514

0.854∗ 0.260∗ 0.790∗

MTS, KO

ENNET

KO, KD, WT 0.604

0.893

0.456+ 0.856+ 0.421+ 0.865

0.506+ 0.878

0.264+ 0.828

GENIE3

KO, WT

0.338

0.864

0.309

0.748

0.277

0.267

0.114

iRafNet

KO, TS

0.552− 0.901

0.337

0.799

0.414− 0.835− 0.421− 0.847− 0.298

0.792−

ARACNE

KO, KD, WT 0.279

0.781

0.256

0.691

0.205

0.669

0.196

0.699

0.074

0.583

0.914

0.377

0.801

0.390

0.833

0.349

0.842

0.213

0.759

Winner [24] KO

0.536

0.503

AUroc

ISaaC

0.782

0.808

0.720

ISaaC is competitive with several state-of-the-art GRN inference methods w.r.t. AUpr and AUroc for both DREAM3 and DREAM4 challenge. ISaaC attains the best AUroc for Networks 1, 2, 4 and 5 and highest AUpr for Networks 1 and 3 in case of DREAM3 challenge datasets. Similarly, ISaaC achieves the best AUroc for Networks 1 and 2 and defeats the winner of DREAM4 challenge w.r.t. AUroc metric for Networks 3, 4 and 5. Moreover, ISaaC performs the best w.r.t. AUpr for Networks 2, 3 and 4 and surpasses the winner of DREAM4 challenge for Networks 1 and 5 w.r.t. the same quality metric. This illustrates the efficiency of the proposed (ISaaC) approach for reverse-engineering gene regulatory networks from heterogeneous information sources. Figure 2 illustrates the difference between the gold standard insilico network and the reverse engineered GRN for Network 1 in case of both DREAM3 and DREAM4 challenges.

80

H. A. Meer et al.

(a) Gold Standard vs reverse-engineered GRN for Network 1 in case of DREAM3 challenge.

(b) Gold Standard vs inferred GRN for Network 1 in case of DREAM4 challenge.

Fig. 2. Comparison of inferred GRNs for Network 1 with gold standard insilico network for DREAM3 and DREAM4 challenge respectively. Here the nodes represents the genes and size of each node is proportional to its out-degree. The “red” coloured edges represent the true positives i.e. edges which are present in both the gold standard network as well as the inferred GRN. The “blue” coloured edges correspond to the false negatives i.e. edges which are present in the gold standard network but are missing in the reverse-engineered GRN. Finally, the “green” coloured edges refer to the false positives i.e. edges which are not present in the gold standard network but are appearing the reconstructed GRN. (Color figure online)

References 1. Bach, F.R., Jordan, M.I.: Kernel independent component analysis. J. Mach. Learn. Res. 3(Jul), 1–48 (2002) 2. Borgwardt, K.M., Gretton, A., Rasch, M.J., Kriegel, H.P., Sch¨ olkopf, B., Smola, A.J.: Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 22(14), e49–e57 (2006) 3. Bosq, D.: Contribution ` a la théorie de l’estimation fonctionnelle. Institut de statistique de l’Université de Paris, Paris (1971)


81

4. Dedecker, J., Doukhan, P., Lang, G., José Rafael, L., Louhichi, S., Prieur, C.: The empirical process. In: Dedecker, J., Doukhan, P., Lang, G., José Rafael, L., Louhichi, S., Prieur, C. (eds.) Weak Dependence: With Examples and Applications, pp. 223–246. Springer, New York (2007). https://doi.org/10.1007/978-0-38769952-3 10 5. Evangelista, P.F., Embrechts, M.J., Szymanski, B.K.: Some properties of the gaussian kernel for one class learning. In: de S´ a, J.M., Alexandre, L.A., Duch, W., Mandic, D. (eds.) ICANN 2007. LNCS, vol. 4668, pp. 269–278. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74690-4 28 6. Fortet, R., Mourier, E.: Convergence de la répartition empirique vers la répartition ´ théorique. Annales scientifiques de l’Ecole Normale Supérieure 70(3), 267–285 (1953) 7. Gretton, A., Borgwardt, K.M., Rasch, M., Sch¨ olkopf, B., Smola, A.J.: A kernel method for the two-sample-problem. In: Advances in Neural Information Processing Systems, pp. 513–520 (2007) 8. Gretton, A., Bousquet, O., Smola, A., Sch¨ olkopf, B.: Measuring statistical dependence with Hilbert-Schmidt norms. In: Jain, S., Simon, H.U., Tomita, E. (eds.) ALT 2005. LNCS (LNAI), vol. 3734, pp. 63–77. Springer, Heidelberg (2005). https:// doi.org/10.1007/11564089 7 9. Gretton, A., Herbrich, R., Smola, A.J.: The kernel mutual information. In: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, Proceedings, ICASSP 2003, vol. 4, pp. IV-880. IEEE (2003) 10. Hyv¨ arinen, A., Karhunen, J., Oja, E.: Independent Component Analysis, vol. 46. Wiley, New Jersy (2004) 11. Irrthum, A., Wehenkel, L., Geurts, P., et al.: Inferring regulatory networks from expression data using tree-based methods. PLoS ONE 5(9), e12776 (2010) 12. Karlebach, G., Shamir, R.: Modelling and analysis of gene regulatory networks. Nat. Rev. Mol. Cell Biol. 9(10), 770–780 (2008) 13. Krus, D.J., Blackman, H.S.: Test reliability and homogeneity from the perspective of the ordinal test theory. Appl. Measur. Educ. 1(1), 79–88 (1988) 14. Mall, R., Cerulo, L., Garofano, L., Frattini, V., Kunji, K., Bensmail, H., Sabedot, T.S., Noushmehr, H., Lasorella, A., Iavarone, A., Ceccarelli, M.: RGBM: regularized gradient boosting machines for identification of the transcriptional regulators of discrete glioma subtypes. Nucleic Acids Res. gky015 (2018). https://doi.org/10. 1093/nar/gky015 15. Mall, R., Jumutc, V., Langone, R., Suykens, J.A.: Representative subsets for big data learning using k-NN graphs. In: 2014 IEEE International Conference on Big Data (Big Data), pp. 37–42. IEEE (2014) 16. Mall, R., Suykens, J.A.: Very sparse LSSVM reductions for large-scale data. IEEE Trans. Neural Netw. Learn. Syst. 26(5), 1086–1097 (2015) 17. Marbach, D., Costello, J.C., K¨ uffner, R., Vega, N.M., Prill, R.J., Camacho, D.M., Allison, K.R., Kellis, M., Collins, J.J., Stolovitzky, G., et al.: Wisdom of crowds for robust gene network inference. Nat. Methods 9(8), 796–804 (2012) 18. Marbach, D., Prill, R.J., Schaffter, T., Mattiussi, C., Floreano, D., Stolovitzky, G.: Revealing strengths and weaknesses of methods for gene network inference. Proc. Nat. Acad. Sci. 107(14), 6286–6291 (2010) 19. Marbach, D., Schaffter, T., Mattiussi, C., Floreano, D.: Generating realistic in silico gene networks for performance assessment of reverse engineering methods. J. Comput. Biol. 16(2), 229–239 (2009)

82

H. A. Meer et al.

20. Margolin, A.A., Nemenman, I., Basso, K., Wiggins, C., Stolovitzky, G., Dalla Favera, R., Califano, A.: Aracne: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinform. 7(1), S7 (2006) 21. Nelsen, R.B.: An Introduction to Copulas. Springer, Heidelberg (2007). https:// doi.org/10.1007/0-387-28678-0 22. P´ al, D., P´ oczos, B., Szepesv´ ari, C.: Estimation of rényi entropy and mutual information based on generalized nearest-neighbor graphs. In: Advances in Neural Information Processing Systems, pp. 1849–1857 (2010) 23. Petralia, F., Wang, P., Yang, J., Tu, Z.: Integrative random forest for gene regulatory network inference. Bioinformatics 31(12), i197–i205 (2015) 24. Pinna, A., Soranzo, N., De La Fuente, A.: From knockouts to networks: establishing direct cause-effect relationships through graph analysis. PLoS ONE 5(10), e12912 (2010) 25. Plaisier, C.L., O’Brien, S., Bernard, B., Reynolds, S., Simon, Z., Toledo, C.M., Ding, Y., Reiss, D.J., Paddison, P.J., Baliga, N.S.: Causal mechanistic regulatory network for glioblastoma deciphered using systems genetics network analysis. Cell Syst. 3(2), 172–186 (2016) 26. P´ oczos, B., Ghahramani, Z., Schneider, J.: Copula-based kernel dependency measures. arXiv preprint arXiv:1206.4682 (2012) 27. Prill, R.J., Marbach, D., Saez-Rodriguez, J., Sorger, P.K., Alexopoulos, L.G., Xue, X., Clarke, N.D., Altan-Bonnet, G., Stolovitzky, G.: Towards a rigorous assessment of systems biology models: the DREAM3 challenges. PLoS ONE 5(2), e9202 (2010) 28. Rényi, A., et al.: On measures of entropy and information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 547–561 (1961) 29. Sarmanov, O.: The maximum correlation coefficient (symmetrical case). Dokl. Akad. Nauk SSSR 120(4), 715–718 (1958) 30. Schaffter, T., Marbach, D., Floreano, D.: Genenetweaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics 27(16), 2263–2270 (2011) 31. Schweizer, B., Wolff, E.F.: On nonparametric measures of dependence for random variables. Ann. Stat. 9(4), 879–885 (1981) 32. Shannon, C.W., Weaver, W.: The Mathematical Theory of Communication. Press UoI, Urbana (1949) 33. Slawek, J., Arod´z, T.: Ennet: inferring large gene regulatory networks from expression data using gradient boosting. BMC Syst. Biol. 7(1), 1 (2013) 34. van Someren, E., Wessels, L., Backer, E., Reinders, M.: Genetic network modeling. Pharmacogenomics 3(4), 507–525 (2002) 35. Steinwart, I.: On the influence of the kernel on the consistency of support vector machines. J. Mach. Learn. Res. 2(Nov), 67–93 (2001) 36. Sun, X., Janzing, D., Sch¨ olkopf, B., Fukumizu, K.: A kernel-based causal learning algorithm. In: Proceedings of the 24th International Conference on Machine Learning, pp. 855–862. ACM (2007) 37. Székely, G.J., Rizzo, M.L., Bakirov, N.K., et al.: Measuring and testing dependence by correlation of distances. Ann. Stat. 35(6), 2769–2794 (2007) 38. Tsallis, C.: Possible generalization of Boltzmann-gibbs statistics. J. Stat. Phys. 52(1), 479–487 (1988) 39. Weisstein, E.: Sklar’s theorem. Retrieved 4, 15 (2011) 40. Yip, K.Y., Alexander, R.P., Yan, K.K., Gerstein, M.: Improved reconstruction of in silico gene regulatory networks by integrating knockout and perturbation data. PLoS ONE 5(1), e8121 (2010)

Inspecting the Role of PI3K/AKT Signaling Pathway in Cancer Development Using an In Silico Modeling and Simulation Approach Pedro Pablo González-Pérez1 ✉ (

1

)

and Maura Cárdenas-García2

Universidad Autónoma Metropolitana, Av. Vasco de Quiroga 4871, 05300 Ciudad de México, Mexico [email protected] 2 Benemérita Universidad Autónoma de Puebla, 13 sur 2702, Col. Volcanes, 72420 Puebla, Mexico [email protected]

Abstract. PI3K/AKT signaling pathway plays a crucial role in the control of functions related to cancer biology, including cellular proliferation, survival, migration, angiogenesis and apoptosis; what makes this signaling pathway one of the main processes involved in cancer development. The analysis and predic‐ tion of the anticancer targets acting over the PI3K/AKT signaling pathway requires of a deep understanding of its signaling elements, the complex interac‐ tions that take place between them, as well as the global behaviors that arise as a result, that is, a systems biology approach. Following this methodology, in this work, we propose an in silico modeling and simulation approach of the PI3K class I and III signaling pathways, for exploring its effect over AKT and SGK proteins, its relationship with the deregulated growth control in cancer, its role in meta‐ stasis, as well as for identifying possible control points. The in silico approach provides symbolic abstractions and accurate algorithms that allow dealing with crucial aspects of the cellular signal transduction such as compartmentalization, topology and timing. Our results show that the activation or inhibition of target signaling elements in the overall signaling pathway can change the outcome of the cell, turning it into apoptosis or proliferation. Keywords: PI3K/AKT signaling pathway · Cancer biology Systems biology approach · In silico modeling and simulation

1

Introduction

The signaling pathways that involve the activation of PI3K have been widely studied, even anti-cancer drugs have been created that inhibit different proteins of this pathway. PI3K signaling pathways play an important role in the progression of cancer, because these are involved in different cellular functions such as: cellular proliferation, survival, migration and angiogenesis [1]. The identification of new therapeutic targets requires exploring and analyzing different alternative pathways that involve different classes of © Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortuño (Eds.): IWBBIO 2018, LNBI 10813, pp. 83–95, 2018. https://doi.org/10.1007/978-3-319-78723-7_7

84

P. P. González-Pérez and M. Cárdenas-García

PI3K protein. For this purpose, an adoption of in silico modeling and simulating approach seems promising. In the last few years, the modeling and simulation of molecular and cellular biology phenomena –specifically, cell signal transduction networks– have found a valuable support in a wide range of computational tools developed [2]. Modeling approaches supporting these computational simulation tools cover a wide spectrum ranging from mathematical models to computational models. Among the mathematical models commonly used are those based on ordinary differential equation systems, statistical methods and numerical methods. On the other hand, computational models have found strong inspiration in techniques such as cellular automata, Boolean networks, Petri nets, neural networks and multi-agent systems. However, most of the earlier computational tools developed provided abstractions to simulate only intracellular signaling systems. That is, they were not suitable for simulating much larger signal transduction systems involving two or more cells. There‐ fore, in recent years, other major requirements in the simulation of cell signal transduc‐ tions systems –such as multi-compartmentalization, location, topology and timing– have emerged, guiding the development of new computational models and tools. Examples of computational tools supporting some of these characteristics are Bio-PEPA [3] MCell [4] COPASI [5] Virtual Cell [6] CompuCell 3D [7] and BTSSOC-Cellulat [8, 9]. In this work, we propose an in silico modeling and simulation approach characterized by multi-compartmentalization, location, topology and timing, and use it for inspecting the behavior of PI3K signaling pathways, its effect over AKT and SGK proteins, its relationship with the deregulated growth control in cancer, its role in metastasis, as well as for identifying possible control points. The multi-compartmental model developed has been integrated in BTSSOC-Cellulat, a computational simulation tool for cell signal transduction systems developed by us a few years ago.

2

Materials and Methods

2.1 The Mathematical Model of Cellular Signal Transduction In this subsection, we present the mathematical model of cellular signal transduction characterized by the following key features: (1) tuples and tuple spaces [8, 10] as logical abstractions for the representation of cellular structures and signaling elements, i.e. cellular compartments, chemical reactions and reactants, (2) production rules as logical abstractions for representation of cellular processes, i.e. cellular events resulting from the cellular signal transduction, e.g. cancer cell proliferation, apoptosis and cell death, (3) an action selection mechanism for the selection and execution of the chemical reac‐ tions, based in Gillespie’s algorithm [11] – a stochastic simulation algorithm typically used to mimic systems of chemical/biochemical reactions in an efficient and accurate way, and (4) an inference machine –independent of the action selection mechanism– for the activation of cellular processes.

Inspecting the Role of PI3K/AKT Signaling Pathway in Cancer Development

85

Representation of cellular structures, cellular processes and signaling interactions. Denote by Ci, 1 ≤ i ≤ m, the i-th cell belonging to the tissue or cellular group G, which is represented by a set of n tuple spaces (TS) such that: } { Ci = TSi1 , TSi2 , … , TSin

(1)

where TSi1 U TSi2 U … U TSin = U Ci. Each tuple space TSij, 1 ≤ j ≤ n, is a set of tuples, where each individual tuple (t) represents a signaling element. Denote by cr a reaction, by cp a cellular process, by r a reactant, and by p a product, therefore we have: ∀t ∈ TSij , 1 ≤ j ≤ l, t = cr, t = cp, t = r, or t = p

(2)

From (1) and (2) we have that any tuple t in any tuple space TSij, 1 ≤ j ≤ n, and therefore in cell Ci, represents either a chemical reaction (cr), a cellular process (cp), a reactant (r) or a product (p). Note that each TSij, 1 ≤ j ≤ n, represents a cell compartment, e.g. nucleus, mitochondria, cytoplasm, cell membrane, or even extracellular space, which guarantees the multicompartmental nature of the cell signaling model. Regarding representation of chemical reactions, expression (3) provides the symbolic abstraction that allows to represent, and therefore manipulate, chemical reac‐ tion schemes commonly required when modelling cellular signaling systems, such as synthesis, decomposition, and standard equation for enzymatic reactions.

cr

) ( )] ( ) ( )) ([( r1 , reqMol1 , r2 , reqMol2 , K, [ p1 , pm1 , p2 , pm2 ]

(3)

where: r1, r2 are reactants and reqMol1, reqMol2 are the number of molecules involved of reactants r1, r2, respectively; K is the reaction rate constant; p1, p2 are products and pm1, pm2 are the number of molecules formed of products p1, p2, respectively. With regard to cellular processes, which are initially written as production rules (IF THEN ), the symbolic abstraction for its representation and manipulation is provided by expression (4). ] [ ]) ([ cp cond1 , … , condp , act1 , … , actq

(4)

Let TSij and TSik be two neighboring tuple spaces, which we represent by the tuple: ) ( neighboring TSij , TSik , 1 ≤ j, k ≤ n, j ≠ k

(5)

Consider also that a tuple space can have at most two neighbors, given the type of biological system that we are modelling. As already established, a tuple space (TSij) models a particular cellular compartment. Thus, an example of tuple space with more than one neighbor is given by the “cellular membrane” tuple space, which has as neigh‐ bors the “extracellular space” and the “cytosol” tuple spaces. The notion of neighboring tuple spaces (expression (5)) plays a key role in our signaling model, since it allows us to establish that the products formed by a reaction cr belonging to a tuple space TSij, are translocated to another tuple space TSik, if and only

86


if TSij and TSik are neighbors. The above is achieved by replacing the tuple (p1, pm1), located in the right part of the expression (3) by the tuple defined by the expression (6). ) ( ( ) translocate p1 , pm1 , TSik

(6)

In this way, the continuity of the signal transduction is guaranteed through all tuple spaces (cell compartments) that make up the cell Ci. Note that a cell compartment is modelled just by a tuple space, while a cell is modelled properly by a set of tuple spaces. With regard to reactants and products involved in reactions, both are represented as tuples in the tuple spaces according the symbolic abstraction provided by expression (7). ( ) r ri , Moli

(7)

The action selection mechanism for carrying out the signal transduction. Once all the chemical reactions and the reactants are modeled, then every chemical reaction is explicitly simulated on the basis of the Gillespie algorithm [11], a stochastic simulation algorithm typically used to mimic systems of chemical/biochemical reactions in an efficient and accurate way. The simulation proceeds choosing the next reaction to occur on the basis of a random number and its propensity function that is calculated based on the reaction rate and on the number of reactants. Expressions (8), (9) and (10) listed below represent the core of this stochastic algorithm and are used for (1) calculation of the rate for each eligible chemical reaction, (2) selection of the next chemical reaction to be executed, and (3) determination of time between the execution of the last and the next chemical reaction, respectively. Rate = K ∗

2 ∏

(

i=1

∑n ψ≤

i=1

Moli reqMoli

)

Ratei

RTot

(8)

(9)

where RTot is the summation of the rates of all eligible reactions.

Stoptime =

− ln (𝜏) RTot

(10)

The action selection mechanism for activating cellular processes. As early mentioned, cellular processes are modeled as production rules, i.e. logical structures of type IF THEN , and represented as tuples formulated by expres‐ sion (4). The execution of these production rules is carried out by an inference engine that applies logical rules to the set of production rules to produce new facts, i.e. new tuples representing cellular events or processes.


87

2.2 The Implementation of the Mathematical Model: The Computational Simulation Tool for Cellular Signaling Transduction The mathematical model of cellular signal transduction previously introduced was implemented and integrated in an already existing computational simulation tool devel‐ oped by us. BTSSOC-Cellulat [8, 9] –as the computational simulation tool was named– is in itself an integrated virtual environment for in silico experimentation on cellular signaling transduction systems, strongly dependent on characteristics such as multicompartmentalization, location, topology and timing. Figure 1 shows the key phases in the creation and execution of a simulation using BTSSOC-Cellulat.

Fig. 1. Workflow of the major activities carried on during the creation and execution of a cell signaling simulation using Cellulat

The computational simulation tool BTSSOC-Cellulat (in its Executable Jar File version) can be either executed or downloaded from the bioinformatics website of our research group at http://bioinformatics.cua.uam.mx/node/10. The instructions required for the download can be consulted on this website. 2.3 The PI3K/AKT Signaling Model Normal cells self-destruct in a regulated way through apoptosis, which includes the decision of beginning self-destruction and adequate execution of the apoptotic program or programmed cell death. Apoptosis requires the coordinated activation and execution of multiple subprograms. In contrast, cancerous cells begin antiapoptotic

88


program, with the purpose to survive. The generation of the tumor and its progress occur when an imbalance among the differentiation caused by mutations, the prolif‐ eration of differentiated cells and apoptosis happens. In tumor cells, the activation of the antiapoptotic signaling pathway associated with PI3K is produced. The uncon‐ trolled activation of the PI3K signaling pathway contributes to the cellular transfor‐ mation and to the tumor progression in various types of tumors, including brain, breast, ovary and renal carcinomas. PI3K controls various key functions related with the biology of cancer, including proliferation, cell survival, migration and angiogenesis [12–14]. The activity of PI3K is regulated by receptors associated to protein kinase, growth factors, cytokines and Ras. PI3K activates two phosphatases: PTEN (a tumor suppressor) and SHIP2 [12, 13, 15]. If PTEN (SHIP1) or SHIP2 inhibits PI3K, the cancer cell stops growing and die. However, if Ras is active, it allows the cell to survive even in presence of regulators of the pathway PI3K/AKT, as described in Fig. 2. As will be seen later, the PI3K/AKT signaling network shown in Fig. 2 will be the starting point for the modeling of reactions, reactants and cellular processes.

Fig. 2. PI3K/AKT signaling pathways. Note the different final destinations that the cell could have, depending on the types of molecules within the cell and their activation states; as well as if it is a cancer cell or a normal cell. Signaling elements are represented as solid blue ellipses and cellular processes as solid green rectangles. Red arrows indicate inhibition relationships and green arrows indicate activation relationships (Color figure online)


89

2.4 The Modeling and Simulation Methodology The methodology followed in this work is characterized by a continuous bidirectional feedback between the in silico modelling and simulation approach and PI3K/AKT theo‐ retical and experimental knowledge. The main phases involved in this methodology are: modeling, simulation and analysis. In modeling phase, reactions, reactants and cellular processes are defined from PI3K/AKT signaling pathways shown in Fig. 2. Note that all these signal elements are written or recorded in the same written language commonly used when describing these elements; it is the simulation tool itself that will translate them into logical abstractions based on tuple spaces, previously introduced in Subsect. 2.1. Simulation phase –using BTSSOC-Cellulat– follows the workflow shown in Fig. 1. Lastly, analysis phase involves the analysis of simulation execution and the feedback between the in silico modelling and simulation approach and PI3K/AKT theo‐ retical and experimental knowledge.

3

Results and Discussions

3.1 Modeling of Reactions, Reactants and Cellular Processes Tables 1 and 2 show some examples of signaling elements –reactions and reactants– and cellular processes, respectively, defined from the PI3K/AKT signaling network shown in Fig. 2, during the modeling phase. Table 1. Examples of signaling elements –reactions and reactants– defined as part of the modeling of PI3K/AKT signaling pathways. Usually, the kinetic parameters are considered from breast cancer cells, as in vitro experiments are carried out in this type of cancer. However, when the kinetic parameters have not been reported in the literature, these are taken from other types of cancer which are commonly product of metastasis from breast cancer. Reaction

Concentration (μM) GF + RTK -> GF (0.0001) RTK* RTK (0.25) SF +RTK -> SF (0.0001) RTK* RTK (0.25) BC + BCR -> BC (2) BCR* BCR (0.30) RTK* + Ras -> RTK (0.25) Ras* Ras (0.7) GPCR* + PI3K -> GPCR (0.25) PI3K* PI3K (0.16)

Km (μM) 34.2

Vmax (μM/mg/ min) 7.6

V0

Ref.

0.00002 [16]

1.61

2.6

0.00016 [17]

0.33

15

7.5

[18]

2.19

27.8

13.9

[19]

1.4

24.6

3.7

[20]

90


Table 2. Examples of cellular processes defined as part of the modeling of PI3K/AKT signaling pathways. Cellular process Cell death (Rule #1)

Apoptosis (Rule #2)

Production rule IF active_conc(“Fas-L*”, AC_Fas_L*) AND AC_Fas_L* > Th_FasL AND active_conc(“p53*”, AC_p53*) AND AC_p53* > Th_p53 AND active_conc(“p21*”, AC_p21*) AND AC_p21* > Th_p21 AND NOT(active_conc(“Bcl-2”, _)) THEN triggered_cellular_event(“CELL DEATH”) IF active_conc(“Bad*”, AC_Bad*) AND AC_Bad* > Th_Bad AND active_conc(“Bax*”, AC_Bax*) AND AC_Bax* > Th_Bax AND active_conc(“p53*”, AC_p53*) AND AC_p53* > Th_p53 AND active_conc(“Bim*”, AC_Bim*) AND AC_Bim* > Th_Bim THEN triggered_cellular_event(“APOPTOSIS”)

3.2 Simulation of PI3K/AKT Signaling Pathway Once the cellular compartments have been created, and reactions with their kinetic parameters, reactants with their initial concentrations and cellular processes have been recorded, the simulation is ready for execution, as shown in Fig. 3.

Fig. 3. PI3K/AKT signaling pathway simulation. Cell compartments, reactions and reactants have been created as the initial components required by the simulation. Reactants are represented by solid red spheres and cellular processes as solid red ellipses. Each signaling element is identified by its name (acronym). Red arrows indicate inhibition relationships and green arrows indicate activation relationships (Color figure online)


91

As a result, the simulated PI3K/AKT signaling network consists of 74 nodes repre‐ senting reactants, 6 nodes representing cell processes, such as proliferation, growth, apoptosis and cell death, and 89 arcs representing reactions between the involved nodes. The overall signaling network extends across 4 cell compartments, i.e. extracellular space, cell membrane, cytosol, and nucleus. 3.3 In Silico Experiments The first in silico experiment carried out consisted of running the simulation considering all recorded reactions and the initial concentration values for all reactants (see some examples in Table 1 and the overall PI3K/AKT signaling network in Fig. 3). According to the results observed in tables and graphs of concentration versus time provided by the simulation tool, all the reactants initiate a mechanism of self-destruction evidenced by the appearance of pro-apoptotic signals. However, as cancerous cells release survival factors, this fact allows the action selection mechanism (based on the Gillespie algo‐ rithm) to find eligible reactions be executed, which depend on the presence of PI3K and its type, as well as, of the presence of phosphatidylinositol and of the activation of AKT. Consequently, the incremental availability of PI3K in the cytosol compartment leads to the activation of PIP3 and therefore, to the amplification of the signal. The activation of AKT allows the sequential activation of survival factors and cell proliferation. The activation of these effector proteins results in the following cellular processes: (1) a decrease in the expression or activity levels of several pro-apoptotic proteins, (2) an increase in anti-apoptotic proteins Bcl-2, Bcl-xL and XIAPs, and (3) a positive feedback on the AKT protein itself. GSK3 is also a target of AKT phosphory‐ lation, which determines its inactivity by blocking its transcriptional activity and the regulation of metabolism. The subsequent in silico experiments consisted of varying the initial concentration of each of the reactants, either increasing to a maximal value or decreasing to zero. In this way we detected two signaling elements, PTEN and SHIP2, that if present, the tumor cells stop growing and lead to death. Simultaneously, it was observed that the activation of PI3K and Ras results in the survival of the cell, even with the presence of PTEN and SHIP2. Therefore, PI3K and Ras must be inhibited simultaneously and the levels of PTEN and SHIP2 must be increased in order for the cell to die. To achieve the above, the concentrations of PI3K and Ras inhibitors, i.e. PTEN, SHIP2 and mTOR, were increased by scaling them by orders of magnitude of 10, 100 or 1000, which led to a gradual inhibition of PI3K and Ras, as can be seen in the Figs. 4 and 5, respectively, where the results shown match to the order of magnitude of 10. Inhibition of PI3K prevents the activation of AKT and other proteins in the signaling cascade. As a conse‐ quence, the inhibition of Fas-L, p21 and p53 does not take place (see the PI3K/AKT signaling network shown in Fig. 2). Finally, as shown in Fig. 6, the simultaneous acti‐ vation of Fas-L, p21 and p53 leads to cell death (as established in Rule #1 of Table 2).

92


Fig. 4. Inhibition of PI3K* (red solid squares). PI3K* virtually disappears by increasing 100 times PTEN (not shown), SHIP2 (green solid triangles) and mTOR (blue solid circles). Axis-x time (milliseconds), and Axis-y concentration of reactants (micromolar). Symbol “*” indicates that the signaling element is active (Color figure online)

Fig. 5. Inhibition of Ras* (red solid squares). Ras* virtually disappears by increasing 100 times PTEN (blue solid circles), SHIP2 (not shown) and mTOR (not shown). Axis-x time (milliseconds), and Axis-y concentration of reactants (micromolar). Symbol “*” indicates that the signaling element is active (Color figure online)


93

Fig. 6. Cell death (yellow solid rhombuses). Thanks to this in silico experiment we determined that by inhibiting PI3K and Ras, and increasing levels of Fas-L (blue solid circles), p21 (red solid squares) and p53 (green solid triangles), cell death increases rapidly. Axis-x time (milliseconds), and Axis-y concentration of reactants (micromolar). Symbol “*” indicates that the signaling element is active (Color figure online)

4

Conclusions

Using the cell signal transduction model proposed here, and integrated in the computa‐ tional simulation tool BTSSOC-Cellulat, we were able to model and simulate the antiapoptotic PI3K signaling pathways, which activates the tumor cells to survive. Through the different in silico experiments developed, we detected proteins, i.e. PTEN and SHIP2, that if increasing their concentration and simultaneously eliminating or inhib‐ iting others, i.e. PI3K and Ras, the tumor cell dies. The next phase is to test them in in vitro experiments using cells from different tumor lines and determine if this occurs. As an initial step in this phase, the genes coding for Fas-L, p21 and p53 proteins were monitored. During this monitoring the lethal dose 50 of aqueous extracts of Whitania somnifera was used, observing that the levels of the genes were increased by two orders of magnitude in breast cancer cell lines MCF 7 (ATCC HTB-22) and MDAMB -231. The next step, is to identify these proteins in these cell lines and in cells from different stages of cancer. Alternatively, the results obtained in in vitro experiments will allow us to improve the kinetic parameters in our simulation tool, and to continue the translational research cycle between biomedical-bioinformatics models and experi‐ mental results.

94


References 1. Lien, E.C., Dibble, C.C., Toker, A.: PI3K signaling in cancer: beyond AKT. Curr. Opin. Cell Biol. 45, 62–71 (2017). https://doi.org/10.1016/j.ceb.2017.02.007 2. Alves, R., Antunes, F., Salvador, A.: Tools for kinetic modeling of biochemical networks. Nat. Biotechnol. 24(6), 667–672 (2006). https://doi.org/10.1038/nbt0606-667 3. Ciocchetta, F., Duguid, A., Guerriero, M.L.: A compartmental model of the cAMP/PKA/ MAPK pathway in bio-PEPA. In: Third Workshop on Membrane Computing and Biologically Inspired Process Calculi (MeCBIC) (2009). http://dx.doi.org/10.4204/EPTCS. 11.5 4. Kerr, R.A., Bartol, T.M., Kaminsky, B., Dittrich, M., Chang, J.C., Baden, S.B., Sejnowski, T.J., Stiles, J.R.: Fast Monte Carlo simulation methods for biological reaction-diffusion systems in solution and on surfaces. SIAM J. Sci. Comput. 30(36), 3126–3149 (2008). https:// doi.org/10.1137/070692017 5. Hoops, S., Sahle, S., Gauges, R., Lee, C., Pahle, J., Simus, N., Singhal, M., Xu, L., Mendes, P., Kummer, U.: COPASI: a complex pathway simulator. Bioinformatics 22(24), 3067–3074 (2006). https://doi.org/10.1093/bioinformatics/btl485 6. Cowan, A.E., Moraru, I.I., Schaff, J.C., Slepchenko, B.M., Loew, L.M.: Spatial modeling of cell signaling networks. Methods Cell Biol. 110, 195–221 (2012). https://doi.org/10.1016/ B978-0-12-388403-9.00008-4 7. Swat, M., Thomas, G.L., Belmonte, J.M., Shirinifard, A., Hmeljak, D., Glazier, J.A.: Multiscale modeling of tissues using CompuCell 3D. Methods Cell Biol. 110, 325–366 (2012). https://doi.org/10.1016/B978-0-12-388403-9.00013-8 8. González-Pérez, P.P., Omicini, A., Sbaraglia, M.: A biochemically inspired coordinationbased model for simulating intracellular signalling pathway. J. Simul. 27(3), 216–226 (2013). https://doi.org/10.1057/jos.2012.28 9. Cárdenas-García, M., González-Pérez, P.P., Montagna, S., Cortés Sánchez, O., Caballero, E.H.: Modeling intercellular communication as a survival strategy of cancer cells: an in silico approach on a flexible bioinformatics framework. Bioinform. Biol. Insights 10, 5–18 (2016). https://doi.org/10.4137/BBI.S38075 10. Gelernter, D.: Generative communication in Linda. ACM Trans. Program. Lang. Syst. 7(1), 80–112 (1985). https://doi.org/10.1145/2363.2433 11. Gillespie, D.T.: Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 81(25), 2340–2361 (1977). https://doi.org/10.1021/j100540a008 12. Downward, J.: Targeting RAS signalling pathways in cancer therapy. Nat. Rev. Cancer 3(1), 11–22 (2013). https://doi.org/10.1038/nrc969 13. Goodsell, D.S.: The molecular perspective: the ras oncogene. Oncologist 4(3), 263–264 (1999) 14. Neves, S.R., Ram, P.T., Iyengar, R.: G protein pathways. Science 296(5573), 1636–1639 (2002). https://doi.org/10.1126/science.1071550 15. González-Pérez, P.P., Cárdenas, M., Camacho, D., Franyuti, A., Rosas, O., Lagúnez-Otero, J.: Cellulat: an agent-based intracellular signalling model. Biosystems 68(2–3), 171–185 (2003). https://doi.org/10.1016/S0303-2647(02)00094-1 16. Reyton-González, M.L., Cornell-Kennon, S., Schaefer, E., Kuzmic, P.: An algebraic model to determine substrate kinetic parameters by global nonlinear fit of progress curves. Anal. Biochem. 1(518), 16–24 (2017). https://doi.org/10.1016/j.ab.2016.11.001 17. Azevedo-Silva, J., Queirós, O., Ribeiro, A., Baltazar, F., Young, K.H., Pedersen, P.L., Preto, A., Casal, M.: The cytotoxicity of 3-bromopyruvate in breast cáncer cells depends on extracelular pH. Biochem. J. 467(2), 247–258 (2015). https://doi.org/10.1042/BJ20140921


95

18. Blokh, D., Stambler, I., Afrimzon, E., Shafran, Y., Korech, E., Sandbank, J., Orda, R., Zurgil, N., Deutsch, M.: The information-theory analysis of Michaelis-menten constants for detection of breast cáncer. Cancer Detec. Prev. 31(6), 489–498 (2007). https://doi.org/10.1016/j.cdp. 2007.10.010 19. Paradiso, A., Cardone, R.A., Bellizzi, A., Bagorda, A., Guerro, L., Tommasino, M., Casavola, V., Reshkin, S.J.: The Na+-H+ exchanger-1 induces cytoskeletal changes involving reciprocal RhoA and Rac1 signaling, resulting in motility and invasión in MDA-MB-435 cells. Breast Cancer Res. 6(6), R616–R628 (2004). https://doi.org/10.1186/bcr922 20. Fritz, J., Dwyer-Nield, L., Malkinson, A.M.: Stimulation of neoplastic mouse lung cell proliferation by alveolar macrophage-derived, insulin-like growth factor-1 can be blocked by inhibiting MEK and PI3K activation. Mol. Cancer 10, 76–96 (2011). https://doi.org/ 10.1186/1476-4598-10-76

Cardiac Pulse Modeling Using a Modified van der Pol Oscillator and Genetic Algorithms Fabi´ an M. Lopez-Chamorro1 , Andrés F. Arciniegas-Mejia1 , David Esteban Imbajoa-Ruiz1 , Paul D. Rosero-Montalvo2,3 , Pedro Garc´ıa2 , Andrés Eduardo Castro-Ospina4(B) , Antonio Acosta5 , and Diego Hern´ an Peluffo-Ord´ on ˜ez5 1

2

4

GIIEE Research Group, Universidad de Nari˜ no, Pasto, Colombia Facultad de Ingenier´ıa en Ciencias Aplicadas, Universidad Técnica del Norte, Ibarra, Ecuador 3 Universidad de Salamanca, Salamanca, Spain Grupo de Investigaci´ on Autom´ atica, Electr´ onica y Ciencias Computacionales, Instituto Tecnol´ ogico Metropolitano, Medell´ın, Colombia [email protected] 5 Yachay Tech, Urcuqu´ı, Ecuador

Abstract. This paper proposes an approach for modeling cardiac pulses from electrocardiographic signals (ECG). A modified van der Pol oscillator model (mvP) is analyzed, which, under a proper configuration, is capable of describing action potentials, and, therefore, it can be adapted for modeling a normal cardiac pulse. Adequate parameters of the mvP system response are estimated using non-linear dynamics methods, like dynamic time warping (DTW). In order to represent an adaptive response for each individual heartbeat, a parameter tuning optimization method is applied which is based on a genetic algorithm that generates responses that morphologically resemble real ECG. This feature is particularly relevant since heartbeats have intrinsically strong variability in terms of both shape and length. Experiments are performed over real ECG from MIT-BIH arrhythmias database. The application of the optimization process shows that the mvP oscillator can be used properly to model the ideal cardiac rate pulse.

1

Introduction

An electrocardiogram is a procedure to record the electrical activity of the heart, being widely used to measure the rate and regularity of heartbeats. A large number of heart diseases can be diagnosed by this method, and it is considerably preferred because of being a non-invasive procedure [1,2]. The resulting record is known as electrocardiogram signal (ECG). Figure 1 depicts a beat drawn from a real ECG signal. c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 96–106, 2018. https://doi.org/10.1007/978-3-319-78723-7_8

Cardiac Pulse Modeling Using a Modified van der Pol Oscillator

97

Currently, computer-based diagnosis is extensively used for both biomedical signal acquisition and processing [3,4]. In the context of ECG signals, the categorization or clustering is the main objective for the diagnostic support system, once the signals are acquired [5–7]. Sequentially, the beats characterization of ECG signals is a key stage of the diagnostic support system, contributing essential data to determine pathologies such as coronary heart diseases and arrhythmias. In order to avoid a manual examination, sensitive to errors and subjectivity, the heartbeat inspection is often developed by dedicated software of analysis. Considering special characteristics of the ECG signals, they should be able to review the complete ECG recording, namely the high length and shape fluctuation, electrode disconnection, signal length (large quantity of beats when using Holter recordings from outpatient electrocardiography), and noise, among others.

Fig. 1. Real normal heartbeat from lead V2. Specifically, the heartbeat is extracted from the record 100 from MIT-BIH database [8].

Features such as clinical, morphological (area under the curve, polarity, heart rate variability) and transformation-based (energy, frequency domain analysis), and their respective usage, are explored in several previous works [6]. Nevertheless, non-linear dynamics concepts are present, taking into account that time series can embedded into a state space –this is, a state space can be reconstructed from a one-dimensional time series–. Actually, ECG signals can be characterized using their corresponding state space instead of signal features directly. An example of this are some works related to the chaos theory and state-space-based features (Hurst exponent, correlation dimension, entropy), that have been widely investigated to represent ECG signals [9–11]. Building a model is another way to seize the capabilities of non-linear dynamics in biomedical science, which might resemble a response to a particular time series following the signal form. The work described in [12], uses a generalized version of the van der Pol oscillator,

98

F. M. Lopez-Chamorro et al.

the Bonhoeffer-van der Pol model (BVP), that represents the action potential behaviour with a proper parameter configuration, as explained in detail in [13]. Accordingly, using the BVP model, some approaches have been proposed to model heartbeats [14,15]. Nonetheless, they do not explore the possibility to characterize normal and pathological heartbeats from such a model. In this document, a technique of ECG signal characterization is developed by using the mvP model and non-linear dynamics techniques. It consists of an optimization by a genetic algorithm aimed at finding parameters that generate the most suitable response that morphologically resembles the signal of interest (every single heartbeat). The dynamic time warping (DTW) was chosen as a dissimilarity measure, since it is a non-linear temporal alignment method. DTW is endorsed as it has the capability to align those signal segments being more convenient to be adjusted only, regarding the shape-pairing perspective [16]. Consequently, it involves morphological characteristics. Additionally, DTW allows for comparing two vectors even having different lengths. This useful feature makes DTW convenient for this study, since the extraction stages and preprocessing generate different lengths sets of data. Since the objective of the model focuses on the cardiac rate pulse, ECG requires a preparation process where a removal of P- and T- waves is necessary, keeping the QRS complex only. This is achieved by using the derivative of the signal. Also, the signal is normalized to avoid amplitude effects and DC offsets. For testing, a record from the MIT-BIH arrhythmia database was used [8]. Obtained results show the ability of mvP to represent or track the heartbeat outline, and the chance to categorize ECG signals using a different non-lineardynamics-based approach. Achieved responses, given their peculiar shapes, can alert about the presence of an acquisition system fail or a specific cardiac pathology. The rest of this paper is organized as follows: Sect. 2 outlines the parametric mvP model and its use to represent action potentials. Section 3 describes the whole methodology to adapt the mvP model to cardiac rate pulse as well as the corresponding parameter tuning. Some experimental results are shown in Sect. 4. Finally, Sect. 5 gathers the final remarks.

2

Modified van der Pol Model

In mathematical modeling of cardiac rhythm, one of the commonly used models has been the van der Pol (vdP) oscillator equation, due to the relation between its dynamic response and the qualitative properties of the heart actuation [17]. Since the vdP equation exhibit many of the features that can occur in natural systems as synchronization, limit cycle and chaos, this equation is appropriated for phenomenological modeling of this type of systems, including the heartbeat. The classical vdP oscillator can be represented by the following expression: dx d2 x + ω 2 x = F (t) + α(x2 − 1) dt2 dt

(1)


99

Although the classical vdP oscillator presents similar qualitative features to those of the heart actuation potential, it does not allow the possibility of change the values of spontaneous depolarization time and refraction time independently. It is important to take into account these two aspects in order to model and simulate significant physiological features of the action potentials [18]. Thus, the modified van der Pol without external driving (mvP) oscillator proposed in [18] is used in this work. The model for the mvP oscillator is described below: d2 x dx x(x + d)(x + e) − = 0, + α(x − v1 )(x − v2 ) 2 dt dt ed

(2)

where α, v1 , v2 , d and e are system parameters. The addition of these parameters into the classical vdP oscillator enables to change the firing frequency of the actuation potential without modifying the refractory period length, as explained in [18]. The state space representation of the mvP oscillator can be written as: x˙1 = x2

(3)

x˙2 = −αx2 (x1 − v1 )(x1 − v2 ) + x1 (x1 + d)(x1 + e)

(4)

Under the constraints v1 v2 < 0,

d, e, α > 0,

the response x1 (t) of the mvP can generates wave forms that resemble the ideal cardiac rate pulse, as shown in Fig. 2.

Fig. 2. mvP system response x1 (t) for α = 0.1, β = 0.5, ε = −1.4 and μ = 3.

100

3 3.1


ECG Signal Model Processing Database

The MIT/BIH arrhythmia dataset, extensively used in ECG research [8], is used to obtain experimental recordings. This data set is composed by 16 types of arrhythmia. For testing purposes, the record 100 was selected. 3.2

Preprocess

With the aim to compare the ECG signal morphology with the model response, P and T waves are removed to keep only the cardiac pulse (QRS complex). This process is performed by deriving the ECG signal y to obtain its derived version yd , as follows: yd [k − 1] = y[k + 1] − y[k − 1],

k =2:L−1

(5)

where L is the length of the signal. The derivative of the original ECG signal can be seen in Fig. 3. It is evident that the derivative signal presents some negative values that are not useful, so these are eliminated.

Fig. 3. Original ECG and derivative signal (normalized)

In order to adjust the maximum absolute value of the signal to one and to suppress the DC level, the model response signal and the derivative signal are normalized applying the expression in Eq. (6). y←

y − μ(y) . max |y|

(6)

Finally, the preprocessed ECG signal to be used for estimating the mvP response is shown in Fig. 4.


101

Fig. 4. Preprocessed ECG signal

3.3

Extraction of Heartbeats

Since a set of parameters must be found to model each heartbeat, the analysis is performed sequentially, that is, before analysing the next heartbeat, a set of parameters are searched through an optimization process for the current heartbeat. 3.4

Parameters Optimization

In this stage, a Genetic Algorithm (GA) is implemented to search for optimal values of the mvP parameters that provide a model response resembling the preprocessed ECG signal. To illustrate the optimization process, a pseudo-code is shown in Algorithm 1. First, a random initial population P0 is created composed by N individuals, each one of those represent a set of parameters (Xn = (an , v1n , v2n , dn , en )) for tuning the mvP. Next, the fitness evaluation of each solution is obtained by applying a dissimilarity measure between the preprocessed ECG signal and the model response. This dissimilarity is obtained throughout the Dynamic Time Warping alignment method (DTW), denoted as dtw(*,*). The DTW algorithm, without global constraints, is detailed in [16]. In this work, the Euclidean distance is used as distance metric to evaluate the dissimilarity between the aligned signals in the DTW. Once the fitness value of solutions in PG is calculated, the offspring population QG is produced by regular genetic operators such as tournament selection, crossover and mutation. When the offspring population is obtained, the objective function evaluation for solutions in QG is accomplished. Then, a new population RG = PG ∪ QG is created to choose the N best solutions by means of an

102


Algorithm 1. Parameter optimization based on GA 0 Initialize population: P0 = (X10 , X20 , . . . , XN ) for k = 1 to K (K: Number of heartbeats) for G = 1 to Gmax Objective function evaluation (PG ): Computation of dissimilarity between the model response for parameters in PG and the preprocessed signal. for j = 1 to N f (XjG ) =dtw(ydk , x1 (XjG )); end Tournament selection: Choose the parents of the offspring population QG Crossover: QG is created based on the chosen parents Mutation: alters randomly solutions of QG Objective function evaluation (QG ): Calculation of dissimilarity between the model response for parameters in QG and the preprocessed ECG signal. Combine populations: RG = PG ∪ QG Elitism operator: sort RG in descending order based on the objective function value and assign the N first solutions to PG+1 end end

elitism operator. The aforementioned process is executed during a preset number of generations Gmax for each heartbeat k. Finally, a feature set is generated by this process which contains the obtained parameters for each heartbeat that best approximate morphologically to the interest signal.

4

Experimental Results

Figure 5 depicts the system response after the optimization process. It can be seen that the system response is similar to the waveform normal heartbeat, represented by the preprocessed ECG signal. Additionally, the aligned signals are presented in Fig. 6, where the model response fits the preprocessed ECG signal with an Euclidean distance value of 21.98. The optimal values of the parameters for each heartbeat are presented in Table 1. With the purpose of observing the effect of changing the parameter values on the model response, the optimal values of v1 , v2 , d and e have been modified to perform two simulation cases as follows: In the first case, the values α, d and e are set at their optimal, but v1 and v2 are the optimal values increased in 0.5 and decreased in 0.5, respectively, for all the heartbeats. The system response of this case (Fig. 7) shows that the variation in parameters v1 and v2 causes a time offset between the model response and the preprocessed ECG signal, which clearly entails that the similitude between those two signals be affected.


103

Fig. 5. mvP model response under optimal parameters.

Fig. 6. DTW alignment for optimized response. Table 1. Optimal mvP parameters for heartbeats samples Heartbeat α

v1

v2

d

e

1

5.5982 2.5151 −2.5151 10.6335

2

7.0488 2.1632 −2.1632 14.1948

7.87016

3

14.6852 2.8377 −2.8377 13.1039

8.27039

4

14.9178 2.8719 −2.8719

10.0548

8.26114 13.6683

In the Second case, the values of d and e are both the optimal values increased in 4 and αn , v1n , v2n take the optimal values (Table 1). Figure 8 shows the system response of second case which exhibit an output that doubling the frequency of the preprocessed signal as an effect of modifying d and e values. Undoubtedly,

104


Fig. 7. mvP model response for case 1.

Fig. 8. mvP model response for case 2.

the similarity of the signals is strongly affected by the change in d and e. As a consequence, the Euclidean distance between the aligned signals is increased to a value of 60.17 as depicted in Fig. 9. The above facts shows the importance of the automatic parameter tuning, performed through the genetic algorithm, to find a set of parameter solutions that best approximate morphologically the response of the dynamical system to the heartbeat signal.


105

Fig. 9. DTW alignment for case 2 response.

5

Conclusions

An optimization parameter process to model the heart rate pulse has been presented as a way of characterizing ECG signals. The application of this process showed that the modified van de Pol oscillator used in this work can resemble the ideal cardiac rate pulse under optimal parameters. This research focused only on the QRS complex with the aim of obtaining a first approximation of a real ECG signal. Thus, as a future work, we are aimed at investigating about nonlinear systems based on coupled van der Pol oscillators that can emulate the P and T waves and applying the optimization process proposed in this work for tuning the parameters to model a complete ECG signal.

References ˙ Arslan, S 1. Bayar, N., C ¸ ay, H.F., Erkal, Z., Sezer, I., ¸ ., C ¸ a˘ gırcı, G., C ¸ ay, S., Y¨ uksel, ˙ O., ¨ K¨ I. okl¨ u, E.: The importance of fragmented QRS in the early detection of cardiac involvement in patients with systemic sclerosis. Anatol. J. Cardiol. 15(3), 209–212 (2015) 2. Dodo-Siddo, M., Sarr, S., Ndiaye, M., Bodian, M., Ndongo, S., et al.: Importance of electrocardiogram for detection of preclinical abnormalities in patients with rheumatoid arthritis without cardiovascular events. J. Arthritis 4(155), 2 (2015) ´ ´ 3. Zuluaga-R´ıos, C.D., Alvarez-L´ opez, M.A., Orozco-Gutiérrez, A.A.: A comparison of robust kalman filtering methods for artifact correction in heart rate variability analysis. Tecno. L´ ogicas 18(34), 25–35 (2015) 4. Gonz´ alez-Barajas, J.E., Velandia-C´ ardenas, C., Nieto-Camacho, J.: Implementation of real-time digital filter for the R wave detection. Tecno. L´ ogicas 18(34), 75–86 (2015)

106


5. Castro-Ospina, A., Castro-Hoyos, C., Peluffo-Ordonez, D., Castellanos-Dominguez, G.: Novel heuristic search for ventricular arrhythmia detection using normalized cut clustering. In: 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 7076–7079. IEEE (2013) 6. Rodr´ıguez-Sotelo, J.L., Peluffo-Ordonez, D., Cuesta-Frau, D., CastellanosDom´ınguez, G.: Unsupervised feature relevance analysis applied to improve ECG heartbeat clustering. Comput. Methods Programs Biomed. 108(1), 250–261 (2012) 7. Abawajy, J.H., Kelarev, A., Chowdhury, M.: Multistage approach for clustering and classification of ECG data. Comput. Methods Programs Biomed. 112(3), 720–730 (2013) 8. Moody, G.B., Mark, R.G.: The MIT-BIH arrhythmia database on CD-ROM and software for use with it. In: 1990 Proceedings of the Computers in Cardiology, pp. 185–188. IEEE (1990) 9. Jovic, A., Bogunovic, N.: Feature extraction for ECG time-series mining based on chaos theory. In: 29th International Conference on Information Technology Interfaces, ITI 2007, pp. 63–68. IEEE (2007) 10. Acharya, R., Faust, O., Kannathal, N., Chua, T., Laxminarayan, S.: Non-linear analysis of EEG signals at various sleep stages. Comput. Methods Programs Biomed. 80(1), 37–45 (2005) 11. Faust, O., Acharya, U.R., Molinari, F., Chattopadhyay, S., Tamura, T.: Linear and non-linear analysis of cardiac health in diabetic subjects. Biomed. Sig. Process. Control 7(3), 295–302 (2012) 12. Peluffo-Ord´ on ˜ez, D., Rodr´ıguez-S´ otelo, J., Revelo-Fuelag´ an, E., Ospina-Aguirre, C., Olivard-Tost, G.: Generalized Bonhoeffer-van der Pol oscillator for modelling cardiac pulse: preliminary results. In: 2015 IEEE 2nd Colombian Conference on Automatic Control (CCAC), pp. 1–6. IEEE (2015) 13. Fitzhugh, R.: Impulses and physiological states in theoretical models nerve membrane. Biophys. J. 1(6), 445–466 (1961) 14. Ferreira, B.B., Savi, M.A., de Paula, A.S.: Chaos control applied to cardiac rhythms represented by ECG signals. Phys. Scripta 89(10), 105–203 (2014) 15. Sato, S., Nomura, T., et al.: Bonhoeffer-van der Pol oscillator model of the sinoatrial node: a possible mechanism of heart rate regulation. Methods Inf. Med. 33(1), 116–119 (1994) 16. Cuesta-Frau, D., Mic´ o-Tormos, P., Aboy, M., Biagetti, M.O., Austin, D., Quinteiro, R.A.: Enhanced modified moving average analysis of T-wave alternans using a curve matching method: a simulation study. Med. Biol. Eng. Comput. 47(3), 323– 331 (2009) 17. van der Pol Jun, B.: LXXXVIII. On relaxation-oscillations. Lond. Edinb. Dublin Philos. Mag. J. Sc. 2(11), 978–992 (1926) ˙ 18. Grudzi´ nski, K., Zebrowski, J.J.: Modeling cardiac pacemakers with relaxation oscillators. Phys. A: Stat. Mech. Appl. 336(1), 153–162 (2004)

Visible Aquaphotomics Spectrophotometry for Aquaculture Systems Vladyslav Bozhynov(B) , Pavel Soucek, Antonin Barta, Pavla Urbanova, and Dinara Bekkozhayeva Laboratory of Signal and Image Processing, Faculty of Fisheries and Protection of Waters, South Bohemia Research Center of Aquaculture and Biodiversity of Hydrocenoses, Institute of Complex Systems, University of South Bohemia in ˇ Cesk´ e Budˇejovice, Z´ amek 136, 373 33 Nové Hrady, Czech Republic [email protected] http://www.frov.jcu.cz/en/institute-complex-systems/ lab-signal-image-processing

Abstract. The water quality is an important question for the environment as well as for the aquaculture and it does not matter which system you are using. It can be water treatment systems, water supply systems, pond treatment system or aquaponics system. For various purposes from simple water monitoring in maintenance, regulation, control and optimization to behavior models in biometrics, biomonitoring, biophysics and bioinformatics, it is necessary to observe wide field of variables. This article discusses and describes a method of biomonitoring, which is called Aquaphotomics. Aquaphotomics is a term introduced to define the application of spectrophotometry in the near infrared region (NIR) in order to understand the influence of water on the structure and function of biological systems. Currently aquaphotomics is focused on the NIR part of light spectrum, while we want to broaden this investigation to also include the visible part. Keywords: Aquaphotomics · Biomonitoring · Spectrophotometry Hue · HSV · RGB · Aquaculture · Measurement · Nutrients

1

Introduction

As aquaponics systems are gaining popularity every day - the issue of water quality control and the concentration of critical parameters is becoming more relevant. This is due to the fact that in such systems it is necessary to maintain optimal conditions for the life of fish and plants simultaneously. This requires constant monitoring of physical and chemical parameters. This article examined the Aquaphotomics method, which provides a framework for understanding changes in a water molecular system, presented as a c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 107–117, 2018. https://doi.org/10.1007/978-3-319-78723-7_9

108

V. Bozhynov et al.

water spectral pattern to mirror the rest of the solution and to give a holistic description related to system functionality. One of its main purposes is to identify water bands as main coordinates of future absorbance patterns to be used as a system biomarker. The Aquaphotomics method illustrates a way to identify specific water bands that bear evidence of temperature changes, concentrations of solutions of different ionic strengths and other disturbances. This method analyzes particle concentration based on the multivariate analysis of water absorbance bands [1–4]. Aquaphotomics (aqua - water; photo - light; omics - all about) is a new ‘-omics’ discipline introduced in 2005 by Prof. Roumiana Tsenkova from the Laboratory of Bio Measurement Technology at Faculty of Agriculture, Kobe University, Japan. It is a new concept known as water as a molecular mirror [5]. This work is based on the knowledge that visible changes occur when altering the spectral pattern of water. Light from the surface of the water has a different spectrum than the incident light. These spectra are recorded by devices based on silicon semiconductors. Aquaphotomics allows the analysis of the concentration of micro particles and heavy metals in water, and detection of microbial organisms (bacteria). An improved prediction of particle concentration is obtained by multivariate analysis based on water absorbance bands rather than univariate method based on absorbance bands at a certain wavelength. Particle concentration measurements in water solution are possible because of the interaction with water “seen” by the NIR light at various water absorbance bands. Multivariate spectral analysis reveals that changes with the matrix under perturbation reflect, like a mirror, the rest of the molecules surrounded by water. As a result, characteristic water absorbance patterns are used for disease diagnosis and the measurement of miniscule concentrations of solutes [6,7]. Aquaphotomics has developed to answer basic questions related to the phenomena and be used in other applications. For example, this method was used in understanding the role of water in biotic and abiotic stress of plants. It was found that fewer hydrogen-bonded water structures decrease in size under the biotic stress [8]. Currently aquaphotomics is focused on the NIR part of light spectrum, while we want to broaden this investigation to also include the visible part. We know that each element or coumpund has a specific ‘fingerprint’ pattern in the absorbtion of electromagnetic radiation. The paterns are present as the spectral series according to the Rydberg formula. Even in the visible spectrum, it is presented a fingerprinting section. Thus, the absorbtion in the visible spectrum is giving unique information about the presented elements or compounds. Based on this knowledge, we are working on developing simple measuring device for aquaculture based on aquaphotomics principles, using spectrophotometers, something like photo diode arrays detectors [9–11].

Visible Aquaphotomics Spectrophotometry for Aquaculture Systems

2

109

Device Selection

First of all, it was necessary to choose the most suitable device for further experiments. The choice was between spectrophotometer ColorMunki and RGB sensor EZO-RGB (Fig. 1). For our purposes, the most important parameters were: accuracy, precision and stability of values. Spectrophotometer ColorMunki from the company X-Rite can measure a spectrum in the range from 360 to 740 nm with step in 10 nm, as well as RGB color values. This device has its own software with the following capabilities: • • • •

create and name unlimited custom color palletes; select colors from built-in libraries; capture any color from any substrate; automatically extract color from any image;

Fig. 1. Spectrophotometer ColorMunki (left side) and EZO-RGB Embedded Color Sensor (right side)

EZO-RGB Embedded Color Sensor from the company Atlas Scientific can measure in true 8-bit RGB format (0–255) and in CIE color space. Also, this device gives the information about light intensity. Such sensor uses a colorimetric measurement principle and consists of a photodiode array, red, green and blue filters and three amplifiers with a current input. RGB filters decompose the incident light into red, green and blue components. The photodiode of the corresponding color channel turns them into a photocurrent. Then three amplifiers with a current input convert the photocurrent into a voltage. Together, the three analog outputs carry information about the color and power of light. For our experiments, the most important advantage of this sensor was water resistance.

110

2.1

V. Bozhynov et al.

Hue

RGB space is most often used in computer graphics, because of the formation of color, these three components are used. The red, green and blue use 8 bits each, which have integer values from 0 to 255. This makes 256 * 256 * 256 = 16 777 216 possible colors. However, RGB is not very effective when it comes to real images. The fact is that to preserve the color of images, it is necessary to know and store all three components of RGB and losing one of them will greatly distort the visual quality of the image. Also, when processing images in RGB space, it is not always convenient to change only the brightness or contrast of an individual pixel, because in this case it will be necessary to read all three values of the RGB components, recalculate them for the desired brightness and write back [12,13]. From our point of view, RGB color space is not the best option, as it does not match the concept of colors as understood by human. As a result, different colors (with different levels of illumination) can have the same RGB values. In this regard, far more appropriate model is HSV, wherein the value of H (Hue) is a parameter corresponding to the color spectrum from the physical point of view, that is the wavelength. The value of S is saturation, chroma, color purity, decrease in white, or ratio of mixing H with gray of the same intensity. V (Value) is a value or intensity, given by the normalized sum of the RGB values (corresponding to the grayscale representation), and it expresses the relative brightness (maximum value) or darkness (minimum value) of the color. This model also most closely matches the human (psychophysical) way of perception of color. Therefore, in the HSV color model, the color is represented by the hue [14,15]. 2.2

Detection of Devices Error

The last step in choosing a device was to compare the results of both devices with known color standards. For this purpose, GoeGuide color plates (Fig. 2),

Fig. 2. Color plates GoeGuide, spectrophotometer ColorMunki and RGB-sensor EZORGB


111

which include 2058 color etalons with know RGB values, were measured by both instruments. These data were processed using MatLab Image Processing Toolbox in three different color spaces (RGB, CIE, HSV) by plotting the graphs and comparing the measured values with the reference values. Most of all, we were interested in the results in HSV color space, the reason is described in the previous section (see Sect. 2.1).

3

Correlation Between Crucial Parameters and Spectral Characteristics

Measurements of the spectra of water samples was carried out using the ColorMunki spectrophotometer. The typical precision (or minimal step) of photometers is between 5–20 nm. The available spectrophotometers (Xrite) have step of 10 nm. The required precision for identification of the compound or element from the white absorption spectrum is not enough - the patterns differ by much smaller step. Therefore, to increase the harvested information, we measured the spectrum of different color etalons. For this purpose, we used ColorChecker plate (Fig. 3), which include 24 different color etalons. For the etalons we were also measure, how the reference spectra should look like through the pure water.

Fig. 3. ColorChecker Classic card (X-Rite)

To find the relationships between the concentration of crucial parameters and the spectral characteristics of water samples the differences of 24 spectra (expected - measured) was compared with known concentrations of nutrients. This gives information in which visible range of spectrum is the chemical absorption pattern (fingerprint, barcode). Depending on the concentration of the chemical the changes in the spectral absorption will be different. The experiments were compared with simultaneously performed measurements by several types of approaches (portable and automatic sensors, photometric).

112

V. Bozhynov et al.

The main purpose of these experiments is to determine which color gives the maximum amount of information, as well as to determine in which visible range of spectrum is the chemical absorption maximum. Thus, 36 spectral differences (36 bars in wavelength) for 24 colors = 864 features for the pattern parametrization and identification (Fig. 4).

Fig. 4. Example of spectra measurements for R = 183, G = 201, B = 181 (Color figure online)

3.1

Experiment Setup

The experiment consisted of two types of measurements, experimental and control - the first with the measurement of different spectra through the sample, the second with the measurement of the concentration of the crucial parameters. In the experiment, the following parameters were measured: dissolved oxygen (DO), temperature (T), electrical conductivity (EC), pH, chlorine (Cl), ammonia (NH3) and ammonium (NH4). Experimental measurements were carried out by measuring 24 different spectra (colors) through the glass square cuvette with water sample in a closed ‘box’ to avoid the influence of external light. As described above (see Sect. 3), changes in 24 spectra were compared for each sample to determine what color represents most changes, and therefore gives the maximum amount of information about a particular parameter. For experiments, several different water samples were prepared, including technical water (from the tap), distilled water, water from the aquarium with fish, from the aquaponics system, as well as various mixtures of these replicates in different proportions. All these samples had different temperatures, electrical conductivity and concentration of chemical elements. This was done specifically


113

to gather the maximum amount of information about the changes in the spectrum with the changes in parameters. In parallel, control measurements were made for each sample, which provided information about the concentration of the parameters. These measurements were carried out using a classical laboratory spectrophotometer (for chemical elements) and various types of sensors (for physical parameters). To achieve maximum stability, water samples were prepared in which the concentration of only one parameter was changed, the rest were stable. This allowed us to determine what color and what part of the spectrum (wavelength) gives the maximum information for each of the parameters. The last step was data processing, which consisted of constructing spectra and analyzing them. To analyze the statistical data and confirm the existence of a relationship between the spectrum and the concentration of the element, the correlation coefficient and confidence (R, p) were calculated. MatLab and Excel were used for these purposes.

4

Results

In the first part of the experiment, we compared two color measuring devices to choose the most suitable for further spectral measurements. Comparisons of devices results with etalon values (which are known from the GoeGuide color plates) were made in two color spaces: the RGB, as the most common, and HSV, which represents the color more accurately. As we can see on the graphs (Fig. 5), ColorMunki displays color much more accurately than EZO-RGB. Most likely, the error is caused by the fact that in RGB color space the same color can have different values of R, G and B, depending on the illumination. For this reason, the RGB color space is not directly acceptable for the color measurements. In this regard, far more appropriate model is HSV. Fortunately, the transfer from RGB to HSV does not present any problems. As can be seen from the graphs (see Fig. 6), in the HSV color space, the EZORGB has a much better results in the color displaying (H-value) than in the RGB color space. However, S indicator is shifted, and V indicator is far from real. As a result, despite such a weighty advantage of the RGB sensor as water resistance, the spectrophotometer ColorMunki was chosen for further experiments. The second part of the experiment was related to the correlation between the spectral characteristics of water specimens and the concentration of key parameters in them. For ease of analysis, the values were normalized from 0 to 1. These graphs (Figs. 7 and 8) show the relationship between the value of parameter and the color (H-value), but it differs (direct, inverse) in different parts of the graph or is not visible at all. However, an analysis of the graphs showed that the dependence exists. For example, the correlation results for Fig. 8 are R = 0.6513 and p = 0.0013. More accurate and understandable are the graphs of the dependence of the spectrum on the concentration of the elements.

114

V. Bozhynov et al.

Fig. 5. Comparison of measured (full line) and expected (dash line) RGB (from top to bottom R, G, B) for spectrophotometer ColorMunki (left side) and RGB-sensor EZORGB (right side) using color etalons with known RGB values from GoeGuide plates (Color figure online)

Fig. 6. Comparison of measured (full line) and expected (dash line) HSV (from top to bottom H, S, V) for spectrophotometer ColorMunki (left side) and RGB-sensor EZORGB (right side) using color etalons with known RGB values from GoeGuide plates (Color figure online)


115

Fig. 7. Correlation graph between the value of electrical conductivity (EC) and the H-value of yellow etalon. The blue line shows the changes in the value of electrical conductivity, which were measured for different water samples. The beige line shows the changes in the H-value of yellow etalon. H-value was measured through the cuvette with the same samples (Color figure online)

Fig. 8. Correlation graph between the amount of chemical compound (NH3) and the H-value of blue etalon. The red line shows the changes in the concentration of ammonia, measured for different water samples. The beige line shows the changes in the H-value of blue etalon. H-value was measured through the cuvette with the same samples (Color figure online)

This graph (Fig. 9) clearly shows how the spectrum changes with the concentration of the NH3. The spectrum shifts by intensity throughout the range (360–740 nm), but more accurate measurements will be made at the sites that have the largest displacement (about 450 nm). As can be seen from the graph (Fig. 10), the dependence of the spectrum on temperature is most visible in the region from 530 to 740 nm.

116

V. Bozhynov et al.

Fig. 9. Graph of changes in spectrum of blue etalon (with values R = 80, G = 91, B=166) according to the changes of concentration of NH3 (low and high indicate a low and high concentration of NH3 in a sample of water) (Color figure online)

Fig. 10. Graph of changes in spectrum of yellow etalon (with values R = 231, G = 199, B = 31) according to the changes of the temperature (low and high indicate a low and high water sample temperature) (Color figure online)

5

Conclusion and Discussion

As a result of these experiments, we found the correlation of the spectrum with the concentration of crucial parameters, with a correlation coefficient R about 0.65, which is more than acceptable. To expand the Visible Aquaphotomics Spectrophotometry for Aquaculture Systems to a full-fledged biomonitoring method, it is planned to increase the number of parameters, as well as add experiments with groups of chemical elements. Statistical data obtained as a result of experiments will be used to create software that can determine the concentration of elements from the spectral characteristics of water samples. Acknowledgments. This work was supported and co-financed by the South Bohemian Research Center of Aquaculture and Biodiversity of Hydrocenoses (CENAKVA CZ.1.05/2.1.00/01.0024); ‘CENAKVA II’ (No. LO1205 under the NPU I program); and by the South Bohemia University grant GA JU 017/2016/Z.


117

References 1. Kovacs, Z., B´ az´ ar, G., Oshima, M., Shigeoka, S., Tanaka, M., Furukawa, A., Nagai, A., Osawa, M., Itakura, Y., Tsenkova, R.: Water spectral pattern as holistic marker for water quality monitoring. Talanta 147, 598–608 (2016) 2. Tsenkova, R.: Introduction: aquaphotomics: dynamic spectroscopy of aqueous and biological systems describes peculiarities of water. J. Near Infrared Spectrosc. 17(6), 303–313 (2010) 3. Tsenkova, R: NIRS for Biomonitoring. Ph.D thesis, Hokkaido university, Japan (2004) 4. Tsenkova, R.: Aquaphotomics tenth anniversary. NIR News 27(1), 45–47 (2016) 5. Tsenkova, R., Kovacs, Z., Kubota, Y.: Aquaphotomics: near infrared spectroscopy and water states in biological systems. In: Disalvo, E.A. (ed.) Membrane Hydration. SB, vol. 71, pp. 189–211. Springer, Cham (2015). https://doi.org/10.1007/978-3319-19060-0 8 6. Jinendra, B., et al.: Near infrared spectroscopy and aquaphotomics: novel approach for rapid in vivo diagnosis of virus infected soybean. Biochem. Biophys. Res. Commun. 397(4), 685–690 (2010) 7. Siesler, H.W., Ozaki, Y., Kawata, S., Heise, H.M. (eds.): Near-Infrared Spectroscopy: Principles, Instruments and Applications. Wiley, Chichester (2002). p. 182 8. Penuelas, J., Filella, I.: Visible and near-infrared reflectance techniques for diagnosing plant physiological status. Trends Plant Sci. 3, 151–156 (1998) 9. Averill, B.A.: Patricia Eldredge: Principles of General Chemistry, pp. 709–732 (2011) 10. Petty, A.: The periodic table of light. Ener. Res. J. (2012) 11. Kramida, A.E.: A critical compilation of experimental data on spectral lines and energy levels of hydrogen, deuterium, and tritium. At. Data Nucl. Data Tables 6, 586–644 (2010) 12. Pascale, D.: A review of RGB color spaces...from xyY to R’G’B’. Babel Color 18, 136–152 (2003) 13. Ibraheem, N.A., et al.: Understanding color models: a review. ARPN J. Sci. Technol. 2(3), 265–275 (2012) 14. Urban, J.: Colormetric experiments on aquatic organisms. In: Rojas, I., Ortu˜ no, F. (eds.) IWBBIO 2017. LNCS, vol. 10208, pp. 96–107. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56148-6 8 15. Article Color models CMYK, RGB, Lab, HSB. Electron. J. ‘CIFRAmagazine’ (2012)

Resolution, Precision, and Entropy as Binning Problem in Mass Spectrometry Jan Urban(B) Laboratory of Signal and Image Processing, Institute of Complex Systems, South Bohemian Research Center of Aquaculture and Biodiversity of Hydrocenoses, ˇ Faculty of Fisheries and Protection of Waters, University of South Bohemia in Cesk´ e Budˇejovice, Z´ amek 136, 373 33 Nové Hrady, Czech Republic [email protected]

Abstract. The analysis mass spectra is dependent on the initial resolution and precision estimation. The novel method of relative entropy, combines the detection of the false precision, statistical binning problem, and the change of information content into one task. The methodological approach as well as relevant objectives are discussed in the first two parts of the work, including mathematical justification. The method of relative entropy has comparable results to the false precision detection, however using different approach. The binning problem solution is estimated via maximization of the relative entropy as a criterion parameter for objective magnitude rounding. The approach is verified on the real high resolution measurements with known presence of false precision. The method could be generalized for wider spectrum of data binnig/precision tasks. Keywords: Resolution · Precision Information entropy · Binning

1

· Mass spectrum

Introduction

Mass spectrometry belongs to the top used instruments in bioinformatics, and biophysics, mainly for metabolomics, proteomics, and lipidomics research. The market offers various types of instruments (TOF, 3QD, Maldi, frequency). There are tens of thousands of mass spec measurements every day across teh laboratories around the world. The further analysis of such huge amount of datasets has to be based on automation of processing steps, and therefore selfparameterization. The crucial parameter at the beginning is used during the transformation of the profile data into centroids, such most of the de-noising, deconvolution, and alignment methods expects such values. Therefore, the error of the deprofiling method could be propagated during the analysis steps both, in qualitative as well as quantitative way. The understanding of the principle of data assignment during the measurement process allows us to describe the process in mathematical way and use c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 118–128, 2018. https://doi.org/10.1007/978-3-319-78723-7_10

Precision and Entropy

119

such description for determination of the most crucial parameter. The question is a not trivial as could be documented on Fig. 1, where the same measurement is stored in two different file formats with different precision in value. The identification of the peak is affected by the over-segmentation, or false precision, of the dataset. The methodological approaches to overcome such issues will be described in the method section of this article. The centroidization of the profile (peaks) data is dependent on three complementary attributes: – Resolution - the ability of the measurement device to resolve two groups of values (peaks given by uncertainty of the exact value), requiring valleys between such groups (peaks). – Accuracy - the ability of the measurement device to give the value as close as possible to the exact value, within given resolution and precision, depending on the device calibration. – Precision - amount of the valid values, after decimal point, with respect to the given resolution.

Fig. 1. The difference between good and bad peak. Example of over-segmented mass peak in cumulative mass spectrum.

The question of resolution determines whether the data values belongs to given peak or not [1]. The proper estimation of the precision allows to segment the peaks with highest possible confidence. The possible occurrence of the over-segmented data complicates the straight forward magnitude rounding. The additional criterion is required to estimate the proper precision level. Since the normalized cumulative (integrated) mass spectrum represents the estimation of

120

J. Urban

Fig. 2. The difference between good and bad peak. The peak with fixed false precision, over-segmented data still visible with lower order.

mass values distribution, the task of the precision is related to the statistical binning problem (Fig. 2). In the case of huge dataset with almost continuous dimension, the data have to be grouped into intervals to create the histogram. For more than hundred years there is an unanswered question, if there exist a way, how to properly group the data into small intervals instead of exact values, called bins. There were made various approaches to determined the correct value of bins, or the length of the bin. The conditions for binning are of follows: – The shape of the distribution has to be approximatively preserved. – The shape has to be smooth as possible. – There should be no artifacts. In economical, financial, and educational field, one of the most used method is given by the Sturges rule: k = 1 + log(2) n,

(1)

where n is amount of data values, and k is the amount of bins [2]. However, the Sturges rule is applicable only for limited range of dataset size. The Sturges rule, is focused on amount of values, while there is no consideration of the values distribution.


121

The different approach based on the statistical parameter of inter quartile range (IQR) was introduced by Freedman and Diaconis: h=

2IQR 1

n3

,

(2)

where h is the width of the bin interval [3]. The latest approach was introduced by Shimazaki and Shinomoto based on minimization of risk function for bin width h: argminh

2(m − v) , h2

(3)

where m is data mean value, and v is mean shifted variance [4]. The method is given smooth results, however the occurrence of artifact is not excluded. Different length of the bin reveals different portion of the dataset information. With the huge amount of bins, the total information will be distributed in so many possibilities, thus that total information is close to zero (the information is dissolved - too many details, but the whole ‘image’ is lost). With low amount of bins, the total information will be higher, since it will be significantly different into which bin the exact data value belongs.

Fig. 3. Microscopic image information entropy dependence on exposition time and shutter speed.

The relation between the binning and resolution arose from the optical bright field and phase contrast microscopy. For the time lapse microphotography experiments, it is necessary to keep the cells in focus. The typical focus based on differences is also estimating the validation of the central limit theorem, which

122

J. Urban

is applicable in the macro-world. However, in the microscopy, the situation with the focus is affected by the point spread function of the observed objects. The evaluation of the entropy of set of images focused on different position allowed to select the position which gave the highest amount of information. It is expectable, that the image with the most information (highest entropy) represents the focused image (with most relevant details), which are in other positions blurred, and therefore informational poor. From the mathematical point of view, the maximal value of the information entropy represents just one of the inflex points in the entropy characteristics of position, exposition, and shutter [5,6] (see Fig. 3). In this paper, the general method of binning via inflex point of relative entropy is presented and discussed for the use of precision estimation in the mass spectrometry.

2

Methods and Analysis

The metabolomic high resolution mass spectrometry datasets were processed in following way. The vector of available m/z values was created, and its amount of individual values evaluated. The mass spec data arranged in matrix (time × m/z ) was used for calculation of cumulative (integrative) mass spectrum [7]. The cumulative mass spectrum serves as an estimation of m/z values histogram [8]. 2.1

False Precision

As a reference tool was used the false precision elimination in Liquid chromatography - mass spectrometry [8], where th total amount of datasets is reduced up to one order, and peak shapes are preserved or even enhanced from the noise and over-segmentation. To obtain only the significant digits after the decimal point, the relation between the value a (in false precision) and the value b (precise with only significant digits) is given by the so-called magnitude coefficient m: b=

round(a ∗ m) m

(4)

where the rounding of a/m is provided to the nearest integer. Therefore, the result of rounding is strictly dependent on the magnitude m [8]. 2.2

Entropy

The Shannon concept of entropy is considered, with knowledge of its limitation. The Shannon entropy do not consider all possible types of distribution. The generalization of Shannon concept is given by Renyii entropy [9,12]. The information must be additive for two independent events a, b: I(ab) = I(a) + I(b).

(5)


123

The information itself should be dependent only on the probability distribution. The previous equation is the well known modified Cauchy’s functional equation with unique solution I(p) = −κ × log2 (p),

(6)

for variable p (probability).

Fig. 4. Total entropy dependence on magnitude of binning.

In statistical thermodynamic theory, the constant κ refers to the Boltzman constant [10]. In the Hartley measure of information, κ equals one [11,12]. Let us focus on Hartley measure. If different amounts of information occur with different probabilities, the total amount of information is the average of the individual information, weighted by the probabilities of their individual occurrences [9,12]. Therefore the total amount of information is: (pd Id ), (7) d

which leads us to the definition of Shannon entropy as a measure of information: S=− pd log2 (pd ). (8) d

Thus, entropy is the sum of the individual informations weighted by the probabilities of their occurrences [13]. In the case of mass spectrum, the m/z occurrences are treated as a data vector for the probability and entropy estimation. In the iterative process of binning data are created different distributions, where the parameters amount of bins k or length of bins h change the total

124

J. Urban

amount of information of the binned distribution (or mass spectrum). The increasing magnitude m (significant digit) represent shorter bins, and therefore more bins, which is also more individual values (bins representations). The lower amount of bins means that more values were integrated together. The amount of information is decreasing with increasing magnitude see Fig. 4. Therefore, with less bins we are getting more information, since there are less possible options. This could lead to the wrong conclusion, that it is better to round data as much as possible. The total entropy estimation itself is not a significant criterion parameter, since it is obvious that the value will change - cause the amount of possibilities in the distribution (= amount of bins) is changed. Such modification has to be integrated into the evaluation via optimal weighting. 2.3

Weighting Factors

The amounts of total entropy are not comparable to each other, since the basic of data (amount of bins) is changed. Therefore, there is a necessity to renormalize entropy values to set them relatively to the amount of bins. This could be done using a weighting factor for the total entropy and obtaining a relative entropy. Th weighting factor could be proportional to – amount of bins k; – length of bin h; – amount of data. The amount of data should be the same in all case, they were just integrated into their bin. The values amount k an h length of bins are inversely proportional to each other. The less bins there are, the bigger is the length of the bin. In this case, the amount of bins k is selected as the weighting factor. With huge amounts of bins, the total entropy is low, cause the information is dissolved. With small amounts of bins, the total information is higher Fig. 4. Anyway, there is an observable jump of entropy information at certain level. In other coordinates, such could be interpreted as an inflex point. As was described above, the total entropy should be weighted by the amount of bins: re = e(d) × kd ,

(9)

where re is relative entropy, e(d) is entropy of dth binning, kd is length of the bin in dth binning, and d is index of binning iteration. The relative entropy defined above is qualitatively equivalent to the total entropy dived by the length of bin h. Therefore, it is relative entropy, relative to the bin length. The arising inflex point of the relative entropy should be located at the proper magnitude of significant digit. The relative entropy is expected to reach the apex at the optimal relation between total information and amount of bins. All other binnigs will be of to much or too few bins (Fig. 5).


125

Fig. 5. Relative entropy dependence on the magnitude of binning.

2.4

Implementation

All dataset were processed in Mathworks Matlab. False precision was evaluated using Expertomica Metabolite Profiling Hi-Res [7]. The average time of computation was 27 s. Typical raw amount of individual m/z values was around 400.000. Two types od datasets were evaluated - high resolution data with possible occurrence of false precision (xml based data), and low resolution data with are already of proper precision (but low).

3

Results

The process of evaluation follows the logic of magnitude rounding. The datasets are not rounded to all possible values, the range is limited from zero digits after the decimal point (unit resolution) to seven digits (current physical limitation). The steps were of one order. The most time consuming part of the evaluation is the accumulation of the data into bins. The information about the occurrence (intensity) is presented, the position is integrated in bins representatives (rounded values). For each magnitude are evaluated the entropy and relative entropy re. The rapid change of informations are located in the inflex points, which could be represented as second derivatives or relation. There is usually more then one inflex point in the distribution. To select the most representative inflex point, the concept of maximization is applied - the selected binnig has to be relatively informationally rich. The maximum is not only simply apex location the proper rounding with huge resolution data, it also preserves the position for the low resolution - the maximum is on the beginning of the interval, where zeros extrapolations is required for the proper positioning of the inflex point.

126

J. Urban

Fig. 6. Raw cumulative mass spectrum with false precision.

In the Fig. 6 is an example of the high resolution cumulative mass spectrum with false precision of 13 digits after decimal point. The peaks are noise, and the borders are not clear. The reference algorithm of false precision based on amount of scans estimated the proper amount of significant digits as 3 after the decimal point and it adds one more to allow existence of the valleys between the peaks. Thus, the used amount is 4 digits after decimal point. The relative entropy approach located the significant binnig with 102 magnitude. Again with adding one more digit, to allow valley for further deconvolution, the result is exactly the same, as with the reference algorithm. Thus, 4 digits after decimal point. Exactly the same results were obtained for randomly chosen single mass spectra (without time axis information available). With the low resolution data, the digits significance is not changed. The equality of relative entropy approach and false precision [8] verify the usability of the method. The observed results offers smoother peaks, clearer borders and valleys Fig. 7, and for very badly over-segmented peaks rise them from the noise level by one order. Therefore, also the signal to noise ratio is significantly increased. The algorithm itself is relatively simple, the main idea is to realize, that the data are changing during the binning, thus the direct comparison of total information entropy is not possible. Instead, it is necessary to accept the concept of relative entropy, which could reach the apex for proper binning. The consideration of the false precision in the mass spectrometry as a special case of binning problem, allowed to develop the statistical concept which improves the preprocessing of the high resolution datasets, decrease the amount of data, as well as decrease the error in centroid assignment to be propagated.


127

Fig. 7. Cumulative mass spectrum, precision corrected via relative entropy binning.

4

Conclusion and Discussion

The question of proper rounding for profile to centroid transformation in liquid chromatography - mass spectrometry is related to the presence of the false precision in the dataset. With the information from time elution, namely amount of scans, the false precision could be detected and corrected. In this work, the extension and modification of this method, based on the binnig problem and relative was introduced. The method is more general, since it allows to evaluate individual spectra like direct injections as well as chromatographic scans. The method of detection instead of noise presence, considers the changes in total information entropy of the spectrum relatively to the amount of bins. The analogy to the binning problem in statistical analysis was described, and used for the development of the relative entropy concept. The search for the information inflex point was solved via maximal value of relative entropy. In such case, the situation, where the data are already properly binned, the results remain correct. The benefit of described methodology is that the false precision in mass spectrometry could be detected as well as it is in tandem with liquid chromatography. Moreover, the general concept of relative entropy could serve in other disciplines, where data binnig is necessary. The concept of entropy itself again proved the general application such parameter and its fundamental meaning across disciplines.

128

J. Urban

Acknowledgement. This work was supported by the Ministry of Education, Youth and Sports of the Czech Republic - projects ‘CENAKVA’ (No. CZ.1.05/2.1.00/01.0024) and ‘CENAKVA II’ (No. LO1205 under the NPU I program).

References 1. Urban, J., Afseth, N.K., Stys, D.: Fundamental definitions and confusions in mass spectrometry about mass assignment, centroiding and resolution. TrAC Trends Anal. Chem. 53, 126–136 (2014) 2. Sturges, H.A.: The choice of a class interval. J. Am. Stat. Assoc. 21(153), 65–66 (1926) 3. Freedman, D., Diaconis, P.: On the histogram as a density estimator: L2 theory. Zeitschrift fur Wahrscheinlichkeitstheorie und verwandte Gebiete 57(4), 453–476 (1981) 4. Shimazaki, H., Shinomoto, S.: A method for selecting the bin size of a time histogram. Neural Comput. 19(6), 1503–1527 (2007) 5. Urban, J., Vanek, J., Stys, D.: Using Information Entropy for Camera Settings. TCP, Prague (2008). ISBN 978-80-7080-692-0 6. Lahoda, D., Urban, J., Vanek, J., Stys, D.: Expertomica Time-Lapse with Entropy, RIV/60076658:12640/09:00010084 (2009) 7. Urban, J., Vanek, J., Soukup, J., Stys, D.: Expertomica metabolite profiling: getting more information from LC-MS using the stochastic systems approach. Bioinformatics 25(20), 2764–2767 (2009) 8. Urban, J.: False precision of mass domain in HPLC–HRMS data representation. J. Chromatogr. B 1023, 72–77 (2016) 9. Shannon C. E.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 and 623–656 (1948) 10. Boublik, T.: Statistical Thermodynamic. Academia, San Francisco (1996) 11. Hatley, J.V.: Bell Syst. Tech. J. 7, 535 (1928) 12. Jizba, P., Arimitsu, T.: The world according to Renyi: thermodynamics of multifractal systems. Ann. Phys. 312, 17–59 (2004) 13. USG Matlab: The mathworks, Inc., Natick, MA (1760, 1992)

Discrimination Between Normal Driving and Braking Intention from Driver’s Brain Signals Efra´ın Mart´ınez , Luis Guillermo Hern´ andez , and Javier Mauricio Antelis(B) Tecnol´ ogico de Monterrey en Guadalajara, Av. Gral Ram´ on Corona 2514, 45201 Zapopan, Jalisco, Mexico [email protected] http://www.itesm.mx Abstract. Advanced driver-assistance systems (ADAS) are in-car technologies that record and process vehicle and road information to take actions to reduce the risk of collision. These technologies however do not use information obtained directly from the driver such as the brain activity. This work proposes the recognition of brake intention using driver’s electroencephalographic (EEG) signals recorded in real driving situations. Five volunteers participated in an experiment that consisted on driving a car and braking in response to a visual stimulus. Driver’s EEG signals were collected and employed to assess two classification scenarios, pre-stimulus vs pos-stimulus and no-braking vs brake-intention. Classification results showed across-all-participants accuracies of 85.2 ± 5.7% and 79 ± 9.1%, respectively, which are above the chance level. Further analysis on the second scenario showed that true positive rate (77.1%) and true negative rate (79.3%) were very similar, which indicates no bias in the classification between no-braking vs brake-intention. These results show that driver’s EEG signals can be used to detect brake intention, which could be useful to take actions to avoid potential collisions. Keywords: Brake intention Classification

1

· Normal driving · Electroencephalogram

Introduction

According to the World Health Organization (WHO), about 1.25 million people die each year as a result of road traffic crashes and between 20 and 50 million more people suffer non-fatal injuries, with many incurring a disability as a result of their injury. For this reason, traffic accidents represent the 9th cause of death in the world and the 1st cause among people between 15–29 years old [1]. To tackle this problem, Advanced Driver-Assistance Systems (ADAS) are acquiring more relevance in smart vehicles manufacturing, specially on safety-critical areas [2,3]. There have been important improvements in the development of sensors, radars and computer vision systems to determine the distance to other cars or c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 129–138, 2018. https://doi.org/10.1007/978-3-319-78723-7_11

130

E. Mart´ınez et al.

objects and enhance the visual field in order to alert drivers from potential risks [4]. However, today the driver is yet the final detonator when a decision is to be made based on the information he/she just received. This is because there is still controversy towards if vehicle control should be taken completely by the machine or the human [5,6]. Whereas this situation is addressed, efforts should also be focused on interpreting the driver’s intention as fast as possible by the car’s computer using biological information such as brain and/or muscular activity (biopotentials) generated prior executing emergency responses. One of the lines of research in this context is the detection of a driver’s intention to perform voluntary movements in critical situations like emergency braking. There are several studies focusing on detecting driver’s intention of braking using brain signals [7–9]. However, these works were carried out on simulated or virtual reality environments under controlled conditions that do not perfectly reflect real driving situations. Only a very few studies have been carried out on real driving situations [10,11]. Therefore, there is still a need to study in more detail if driver’s brain signals can be used to detect the brake intention in realistic driving conditions. The present work proposes the use of supervised learning to discriminate braking intention from normal driving using driver’s electroencephalographic (EEG) signals recorded in a real driving environment. The experiment consisted on driving a car and pressing the brake pedal whenever a red light turned on. This visual stimulus was located in front of the driver and represented unexpected situations that indicated the driver to brake quickly. Five subjects voluntarily participated in this study and the recorded EEG signals were used to evaluate the classification in two scenarios: pre-stimulus versus pos-stimulus and no-braking versus brake-intention. Time-domain features computed from raw, common average reference (CAR) filtered and independent component analysis (ICA) cleaned EEG signals were extracted to train two classification algorithms: Linear Discriminant Analysis (LDA) and Support Vector Machines (SVM). Both classifiers were used in a systematic evaluation to asses classification accuracy. The results showed overall accuracies of 85.2 ± 5.7% and 79 ± 9.1% in the two classification scenarios with features extracted from ICA-cleaned EEG signals and a SVM as classifier. Also, the true positive rate (TPR) and true negative rate (TNR) were balanced in the classification of no-braking vs brake-intention with TPR = 77.1% and TNR = 79.3%. These results show the feasibility of using driver’s brain signals obtained during real driving to recognize the intention to brake which is posterior to the reception of the visual stimulus but precedes the press of the brake pedal. The rest of the paper is organized as follows, Sect. 2 describes the experiments and the data analysis, Sect. 3 presents and discusses the results and Sect. 3 presents the conclusions.

2 2.1

Methods and Materials Participants, Vehicle and Test Track

Five healthy right-handed males between 26–38 years old were enrolled in the study. They had normal or corrected to normal vision, had valid driving license

Discrimination Between Normal Driving and Braking Intention

131

and had no history of psychiatric or neurological diseases that could affect the experimental results. Participation in the experiments was voluntary and all procedures were conducted in accordance with the ethic guidelines of the Declaration of Helsinki [12]. Informed consent was obtained from all participants. The experiments were conducted over a 2.2 km length track on a paved unused road. This test track consisted of two 1 km length straight parallel road sections. A 2011 subcompact sedan with manual transmission was used as test vehicle. This car employs Controller Area Network (CAN) communication protocol ISO 15765-4 with 11 bit ID and 500 kbaud transmission speed [13,14] to provide communication between several devices within the vehicle and the On-Board Diagnostics port (OBD-II). 2.2

EEG Recording System

To get biopotential signals from the drivers, a portable wireless biosignal acquisition system g.Nautilus from g.tech Medical Engineering GmbH was used [15]. This system consists of a cap with 30 electroencephalographic (EEG) active dry electrodes that are uniformly distributed in accordance with the international 10–20 standard (see Fig. 1a) and 2 electrooculographic (EOG) Silver Chloride (AgCl) electrodes that were placed over the right eyebrow and the lower part of the orbicularis oculi muscle to pick up the potentials generated by motion of the eyeballs (also known as the Corneal-Retinal Potential) [16]. The ground and reference electrodes were located at both projections of the mastoid processes. EEG and EOG signals were digitized with 24-bit resolution ADCs, sampled at 250 Hz and band-pass filtered between 0.1 to 100 Hz with a digital 8th order Butterworth-type filter. The g.Nautilus system contains a separate base station that receives the digitalized biosignals through a 2.4 GHz wireless transmission and sends them to a PC through USB port. This receiver has eight digital inputs that are recorded simultaneously with the EEG and EOG signals. 2.3

Vehicle Information Acquisition System

An electronic embedded system was designed to acquire directly from the selected vehicle’s engine computer unit (ECU) the state (i.e., activation/deactivation) of the gas and brake pedals (see Fig. 1b). These two signals were decoded from the vehicle’s On-Board Diagnostics (OBD) port using an OBD-II to Universal Asynchronous Receiver-Transmitter (UART) shield connected to a TM4C123G LaunchPad development board from Texas Instruments. The OBD-II to UART shield contains a Controller Area Network (CAN) transceiver and interpreter based on the micro-controller STN1110 [17] that provides bi-directional half-duplex communication using standard CAN protocols also compatible with industry standard ELM327 AT commands [14]. The Launchpad development board is based on a Cortex-M4 ARM micro-controller unit that was programmed to decode the data present in the CAN bus of the vehicle and to provide 2 digital outputs that indicated the state of the gas and brake pedals (i.e., a low/high level indicates the pedal activation/deactivation).

132


Fig. 1. (a) Illustration of the biopotential recording system. (b) Block diagram of the embedded system designed to acquire vehicle’s brake and gas pedal signals. This electronic system consists of an OBD-II to UART shield implemented to communicate the car’s ECU to a micro-controller in order to decode gas and brake pedals information and convert them to digital outputs. These signals are then connected to the digital inputs in the receiver of the EEG acquisition system.

These digital signals are connected to the digital inputs in the receiver of the g.Nautilus recording system and thus they are recorded simultaneously at the same sampling frequency of the driver’s biopotentials. 2.4

Experimental Paradigm

The experiment consisted on each of the participants driving the car in the test track while their biopotentials (EEG and EOG) and the vehicle information (gas and brake pedal activation) were recorded. Subjects were instructed to drive naturally at a constant speed between 40–60 km/h. During driving, subjects had to brake in response to non anticipated red light activations of a lamp located in front of the driver, over the windshield. Prior to the execution of the experiment, subjects were instructed to interpret this visual stimulus as a situation where they have to perform fast braking. After a brake, the driver had to resume their normal driving until reaching again the target speed. Figure 2a shows a picture taken during the execution of the experiment. The experiment consisted in driving 9 rounds on the test track where the driver had to apply the brake whenever the red light turned on. The red light was activated 10 to 12 times each round at quasi-random intervals between 5–10 seconds by the person conducting the experiment and without previous warning to the driver in order to minimize adaptation effects. This visual stimulus was only presented during the straight parts of the circuit in order to reduce the mechanical artifacts produced by head or body movements of the driver. Figure 2b illustrates the braking situations performed in a round. The experimental session lasted about one hour where a total of approximately 110 brakes were performed by each participant.


133

Fig. 2. (a) Snapshot of the experimental setup with a subject driving the car and wearing the EEG cap. The red light is observed in the car windshield in front of the driver while the host computer and the experimenter are in the back seat of the car. (b) Illustration of how braking was induced at quasi-random intervals only on the straight sections of the test track. A total of 9 rounds were traveled and 10 to 12 times brakes were performed in each round. Therefore, driver’s biopotentials and vehicle information were recorded in about 110 braking situations per participant. (Color figure online)

2.5

Preprocessing and Dataset Preparation

Collected data was preprocessed in Matlab® using the FieldTrip toolbox [18]. EEG signals were preprocessed using common average reference (CAR) and independent component analysis (ICA). ICA decomposition was applied to all EEG signals and components were selected manually in order to remove non-neural activity. Raw, CAR-filtered and ICA-cleaned EEG signals of each participant were used separately to study the recognition of braking intention from normal driving. All data (bio-potentials and vehicle information) was separately divided into trials taking as reference all time instants where visual stimuli were presented ±1.5 s. This resulted in 3s-length stimulus-aligned trials where the reference is the appearance of the red light indicating the driver to press the brake pedal. Therefore, the segment from −1.5 s to 0 s was defined as “pre-stimulus” (normal driving with no braking information) while the segment from 0 s to 1.5 s was defined as “pos-stimulus” (this segment contains braking intention plus braking action). Figure 3a illustrates the time-line of a 3s-length trial. Note that the brake activation is observed within the pos-stimulus segment and thus this segment includes both braking intention and braking. This segmentation yielded to a set of trials that were used to study the classification of “pre-stimulus” versus “pos-stimulus” (classification scenario 1). Another segmentation was done by taking as reference the fastest brake activation period for each participant which was computed using all the trials. This time instant, referred from now on as tBmin , was used to define two new consecutive segments: “brake-intention” from 0 to tBmin and “no-braking” from −tBmin to 0. Figure 3b illustrates these two segments in a 3s-length trial. Note that the length of the segments is less than 1.5 s, the “no-braking” segment

134


Fig. 3. Time-line of a 3 s-length trial where the reference 0 indicates the presentation of the visual stimulus. (a) Illustration of “pre-stimulus” (from −1.5 s to 0) and “posstimulus” (from 0 to 1.5 s) segments. (b) Illustration of “no-braking” (from −tBmin to 0) and “brake-intention” (from 0 to tBmin ) segments. (Color figure online)

contains normal driving with no braking information and the “brake-intention” segment contains information preceding the activation of the brake pedal. This segmentation resulted in a second set of trials that was used to evaluate the recognition of braking intention from normal driving (classification scenario 2). 2.6

Feature Extraction and Classification

Time-domain information was used for feature extraction. For each EEG channel, the signal was divided into 10 intervals of equal length (without overlapping). Average amplitude of the signal was computed over these intervals, i.e., 10 values were computed per channel. The values of all 30 electrodes were concatenated to construct the feature vector x(300×1) . Features were computed exclusively from segments “pre-stimulus” and “pos-stimulus” in the first set of trials (used for classification scenario 1) and from segments “no-braking” and “brake-intention” in the second set of trials (used for classification scenario 2). Two classification methods, Linear Discriminant Analysis (LDA) and Support Vector Machines (SVM), were used to classify “pre-stimulus” versus “pos-stimulus” and “nobraking” versus “brake-intention”. 2.7

Classification Performance Evaluation

Classification performance was assessed independently for each subject through a 10-fold cross-validation procedure. For each fold, classification accuracy was measured as the percentage of correct classifications, or accuracy = (T P R + T N R)/(T P R + T N R + F P R + F N R), where T P R is the true positive rate, F P R is the false positive rate, T N R is the true negative rate and F N R is the false negative rate. Therefore, confusion matrix, average and distributions of classification accuracy were obtained for each subject. This procedure was carried out separately for the two classification scenarios considering all combinations of (i) features extracted from raw, CAR-filtered and ICA-cleaned EEG signals and (ii) LDA and SVM classification algorithms.


3

135

Results

Accepted emergency responses were defined as those in which the braking response was given no earlier than 300 ms and no later than 1200 ms after the presentation of the visual stimulus [7]. All subjects in the experiments had emergency braking responses between 340 and 1100 ms, hence no trial was discarded due too late or too early responses. The average response obtained was 558.2 ± 90.1 ms and the distribution was skewed with percentiles P5 = 420 ms, P25 = 500 ms, P50 = 552 ms (median), P75 = 600 ms and P95 = 720 ms. These response times were consistent with those reported in previous studies [7,8,10]. Table 1 summarizes the classification results across all participants (average ± standard deviation) achieved in the two classification scenarios. The results from the raw EEG data presented higher accuracies in the two classification scenarios for both classification algorithms. This is attributed to the presence of artifact components in some of the trials, mostly because of voluntary and involuntary muscular activity (i.e. eye blinks, head movement, etc.) during driving. Thus, these results might not reflect realistic recognition of braking intention from normal driving. With CAR-filtered and ICA-cleaned EEG signals, accuracy results were slightly reduced, specially with ICA-cleaned data. However, in this condition only neural-activity remained in all EEG channels. Thus, it provided the required brain information to determine if it is possible to recognize braking intention from normal driving. Considering the classification algorithm, is important to also notice that SVM outperforms LDA in the two classification scenarios irrespective of using raw, CAR-filtered and ICA-cleaned EEG signals. Thus, this classification method and features extracted from ICA-cleaned EEG signals were used for further analysis as they represent the best combination to properly study the recognition of braking intention from brain signals. Table 1. Classification accuracy results averaged across all participants for the two classification scenarios with raw, CAR-filtered and ICA-cleaned EEG signals and with LDA and SVM.

136


Table 2. Across all participants confusion matrix for “no-braking” versus “brakeintention” obtained with ICA-cleaned EEG signals and SVM. Results show consistency and balance among both classes. Thus no biasing is detected for one of the classes (T P R ∼ = T N R).

Finally, regarding the two classification scenarios, the first (“pre-stimulus” versus “pos-stimulus”) provided higher accuracies. However, features extracted form the pos-stimulus segment also contain brain signals recorded when drivers are actually pressing the brake pedal. This is not the case in the second classification scenario (“no-braking” versus “brake-intention”) where features computed from the brake-intention segment contain brain signals preceding the activation of the brake pedal. Thus, subsequent analysis are only for classification scenario 2. The across all participants confusion matrix presented in Table 2 shows that classification accuracy results were evenly distributed among both classes (nobrake and brake-intention). This means that total true positive rates (TPR) were almost the same as the true negative rates (TNR), indicating consistency in the results and that they are not biased towards one of the classes. Figure 4 shows the distribution of classification accuracy of each participant obtained with ICA-cleaned EEG signals and SVM. For all participants, the Accuracy Performance (No-brake vs Brake-intention)

100 90

Accuracy (10-fold CV)

80 70 60 50 40 30 20 10 0 1

2

3

4

5

Subject

Fig. 4. Distribution of classification accuracy of each participant in the classification of “no-braking” versus “brake-intention”. The median of the accuracy (red line) is above chance level in all subjects. (Color figure online)


137

median of the accuracy was 80.0%, 78.1%, 86.7%, 75.6% and 72.4% for subject 1, 2, 3, 4 and 5, respectively (observed as the red horizontal line in each distribution), which are highly greater than the theoretical chance level (50%) and are consistent with similar offline classification scenarios as the ones reported by [9] of 80% and 81% using LDA and QDA classifiers. These results shows that all subjects presented significant recognition of the braking intention from normal driving using brain signals.

4

Conclusions

Fast detection of braking intention (i.e., the event preceding the pressing of brake pedal) is essential to perform rapid and controlled car stopping and maneuvering because this could be useful to avoid potential car crashes. In this line, the present work studies the recognition of braking intention from normal driving using electroencephalographic signals from the driver. The idea of using driver’s brain signals is based on the fact that execution of movements (as moving the right leg to press the brake pedal) is preceded by motor-related neural processes that might induce recognizable changes in the ongoing brain signals. This work consisted of a real-driving experiment devised and carried out to obtain EEG signals from several participants that had to drive and to brake in response to a visual stimulus. The state of the car’s brake pedal and of the visual stimulus that indicated to brake were also synchronously recorded with the brain signals. Timedomain features extracted from raw, CAR-filtered and ICA-cleaned EEG signals, along with LDA and SVM classifiers were used to evaluate the classification of “pre-stimulus” versus “pos-stimulus” and “no-braking” versus “brake-intention”. The overall results presented in Table 1 showed that features computed from ICA-cleaned EEG signals and a SVM provided classification accuracies (85.2 ± 5.7% for “pre-stimulus” versus “pos-stimulus” and 79 ± 9.1% for “no-braking” versus “brake-intention”) that are above the theoretical chance level. Even though raw EEG and CAR-filtered signals yielded higher accuracies, they were not considered to analyze the classification between “no-braking” and “brakeintention” as they might contain artifacts that can incorrectly enhance classification. Note that classification results were balanced in discriminating between “no-braking” and “brake-intention” as TPR = 77.1% was very similar to TNR = 79.3% (Table 2). Hence, there was no bias in the detection of one the classes. Finally, the subject-specific results (Fig. 4) showed that classification accuracy was consistent for all participants. On the basis of the results, this work shows that it is possible to discriminate braking intention from non-braking situations using driver’s brain signals. Acknowledgments. This research has been funded by the National Council of Science and Technology of Mexico (CONACyT) through grants 268958 and PN2015-873. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.

138


References 1. WHO: Global status report on road safety 2015. World Health Organization (2015) 2. Louwerse, W.J.R., Hoogendoorn, S.P.: ADAS safety impacts on rural and urban highways, pp. 887–890, June 2004 3. Khan, J.: Using ADAS sensors in implementation of novel automotive features for increased safety and guidance, pp. 753–758, February 2016 4. Okuda, R., Kajiwara, Y., Terashima, K.: A survey of technical trend of ADAS and autonomous driving, pp. 1–4, April 2014 5. Stanton, N.A., Marsden, P.: From fly-by-wire to drive-by-wire: safety implications of automation in vehicles. Saf. Sci. 24, 35–49 (1996) 6. Young, M.S., Stanton, N.A., Harris, D.: Driving automation: learning from aviation about design philosophies. Int. J. Veh. Des. 45, 323–338 (2007) 7. Haufe, S., Treder, M., Gugler, M.F., Sagebaum, M., Curio, G., Blankertz, B.: EEG potentials predict upcoming emergency brakings during simulated driving. J. Neural Eng. 8, 056001 (2011) 8. Kim, I.H., Kim, J.W., Haufe, S., Lee, S.W.: Detection of braking intention in diverse situations during simulated driving based on EEG feature combination. J. Neural Eng. 12, 016001 (2015) 9. Khaliliardali, Z., Chavarriaga, R., Gheorghe, L.A., del Millan, J.R.: Detection of anticipatory brain potentials during car driving. In: Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 3829–3832 (2012) 10. Haufe, S., Kim, J.W., Kim, I.H., Sonnleitner, A., Schrauf, M., Curio, G., Blankert, B.: Electrophysiology-based detection of emergency braking intention in real-world driving. J. Neural Eng. 11, 056011 (2014) 11. Khaliliardali, Z., Chavarriaga Lozano, R., Zhang, H., Gheorghe, L.A., Mill´ an, J.R.: Single trial classification of neural correlates of anticipatory behavior during real car driving (2016) 12. WHO: World medical association declaration of Helsinki. Bull. World Health Organ. 79, 373–374 (2001) 13. ISO 15765-3:2004: Road vehicles - diagnostics on Controller Area Networks (CAN) - Part 4: Requirements for emissions-related systems. Standard, International Organization for Standardization, Geneva, January 2005 14. Elm Electronics Inc.: ELM327 OBD to RS232 Intepreter (2014) 15. g.Tech: g.Nautilus wireless biosignal acquisition. http://www.gtec.at/Products/ Hardware-and-Accessories/g.Nautilus-Specs-Features. Accessed 30 Sept 16 16. Venkataramanan, S., Prabhat, P., Choudhury, S.R., Nemade, H.B., Sahambi, J.S.: Biomedical instrumentation based on electrooculogram (EOG) signal processing and application to a hospital alarm system. In: Proceedings of 2005 International Conference on Intelligent Sensing and Information Processing, pp. 535–540, January 2005 17. OBD Solutions : Multiprotocol OBD to UART Interpreter. STN1110 datasheet, November 2012 18. Oostenveld, R., Fries, P., Maris, E., Schoffelen, J.M.: FieldTrip: open source software for advanced analysis of MEG, EEG, and invasive electrophysiological data. Comput. Intell. Neurosci. 2011, 1:1–1:9 (2011)

Unsupervised Parametrization of Nano-Objects in Electron Microscopy Pavla Urbanov´ a1,2(B) , Norbert Cyran3 , Pavel Souˇcek2 , Anton´ın B´ arta2 , 2 2 2 ˇ y1 Vladyslav Bozhynov , Dinara Bekkhozhayeva , Petr C´ısaˇr , and Miloˇs Zelezn´ 1

Department of Cybernetics, Faculty of Applied Sciences, University of West Bohemia in Pilsen, Univerzitn´ı 8, 306 14 Pilsen, Czech Republic [email protected] 2 Faculty of Fisheries and Protection of Waters, South Bohemian Research Center of Aquaculture and Biodiversity of Hydrocenoses, ˇ Institute of Complex Systems, University of South Bohemia in Cesk´ e Budˇejovice, Z´ amek 136, 373 33 Nové Hrady, Czech Republic 3 Core Facility Cell Imaging and Ultrastructure Research, University of Vienna, Althanstrasse 14, 1090 Vienna, Austria

Abstract. The observation of the nano sized objects in electron microscopy is demanding an automation of evaluation of captured images. The analysis of the digital images should be focused on objects detection, classification, and parametrization. In this work, three different examples of bioinformatical tasks are presented and discussed. The sphericity of such objects is one of the key parameter in nano object detections. The parametrization has to deal with specific properties of electron microscopy images, like high level of noise, low contrast, uneven background, and few pixels per objects. The presented approach combines unsupervised filtration and automatic object detection. The result is the software application with simple graphic user interface. Keywords: Electron microscopy · Nanoparticles Sphericity · Thresholding · Images

1

· Segmentation

Introduction

Image processing and analysis consists of a variety of methods and algorithms for the purpose to obtain from the digitized image such information as its quantified interpretation also applies to the real object captured in the image. Most basic methods were developed long time ago for binary images. In recent years, it has been happening to massively adapt these methods to grayscale (often inaccurately black-and-white) pictures. Generalization of methods for color images is still insufficient. Image processing and analysis provide a set of modular tools that combine them in the right order you can retrieve information from the captured images for which: c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 139–149, 2018. https://doi.org/10.1007/978-3-319-78723-7_12

140

P. Urbanov´ a et al.

– a person is able to personally recognize information value in qualitative terms plane, but the quantitative expression is time consuming and inaccurate, – for a large number of processed data, the effectiveness and credibility of the subjective evaluation by an external observer decreases, on the contrary, the time consuming is increasing and error rate of interpretation, – part of the information, its presence can not be subjectively recognized. The objective of image analysis is the development and creation of such software tools, which simplify and automate human-made routines in image evaluation and more it will extend it to mathematical operations of artificial intelligence. Of course, there is a possibility processing a large number of input data as well as reducing the time-consuming nature of the analysis. Electron microscopy allows us to observe nanostructures of the small objects inside the biological organisms (nuclei, photosystems, membranes). The obtained images are usually intensity (grayscale) levels, since there is no color defined in the electrons beam. The image properties (contrast, bit depth, dynamic range, resolution, etc.) are task-depended, since there is no widely accepted standardization. The background and type or size of other objects vary due to biological reasons. The regions of interest (ROI) are of very small physical size, therefore the discretization effects on the spherical representation are of significant importance for the unsupervised evaluation. Digital images allow us to carry on post-processing and analysis of the objects in the observed scene via plenty of semi-automatic software. The accurate objects recognition and further evaluation depends on many attributes, including discriminability (often incorrectly described as resolution [2]), contrast, compactness of the edges, and so on. The detection of the object shape types are complicated by the discrete property of the object values. Especially, with the very small object, the deformation of the exact shapes is well known as pixelation.

Fig. 1. Example of discrete effects black sphere on gray background.

In the data acquisition, the spatial resolution is related to the ShannonNyquist-Kotelnikov theorem. If the criterion is not fulfilled, the artifacts of aliasing effects are observed. The pixelation could be considered as special type of aliasing function. Especially with the spherical objects, the obtained data are aliased, as they shows squares, crosses, or stairs shapes (see Fig. 1) [13–16].

Nano-Objects

141

On the other hand, the proximity or remoteness to the sphericity are important criteria for the classification of the objects, structure identification, and image registration. Unfortunately, in observation of very small real objects, we are limited by the current technical progress, which is increasing in recent years [11,12]. In the focus of the interest are mainly biochemical and biophysical objects, including cell functional organelle, membrane compartments, and metabolic protein macromolecules. Observation of the intracellular environment helps us to understand properly the its function and behavior. Object resolution and its border identification is complicated by the point spread function (PSF) of the visible light. Therefore, the very small objects are observed using electron microscopy. The PSF of electrons produce less distorted images. citovat neco od nahlika (bude v daliborovo publikacich jako spoluautor). – Algal photosystems in transmission electron microscopy (TEM) of 20–30 nanometers [nm], – Immunolabelling of antibodies gold nanoparticles in TEM, 5–25 [nm], – Restriction enzyme in cryo electron microscopy (Cryo-EM) 18–40 [nm].

2

Setup, Methods and Implementation

Photosystems are investigated for the purpose of identification of their inner structure. The photosystem I of red alga has approximately round shape and it’s diameter vary from 20 to 30 nm. The photosystem II is elliptical and has similar size. The size of the photosystems is too small and on the raw microscope images no internal structure is seen. For the structure purpose thousands of images age registered and accumulated into super-resolved (Figs. 2 and 3). Gold nanoparticles in transmission electron microscopy (TEM) serve for immunolabeling of antibodies in biological experiments [1]. The task is to detect the presence of the nanoparticles in the image area, label their position, and evaluate their amount. The particles are supposed to be spherical, however there are

Fig. 2. Example of input images, from left to right: photosystems, nanoparticles, enzymes.

142


Fig. 3. Example of detected objects: photosystem, nanoparticle, enzyme.

various imperfections during the manufacturing, as well as quantization effects of digital images. Moreover, the size of the particles is around 50 gold atoms. The particles should be always the darkest objects, except the very large areas. The nanoparticles labeling technique is used to localize specific organic molecules by antibody treatment. Specific organic molecules were localized by antibody treatment. The primary antibody is designed to bind to a target protein. A secondary antibody binds to the primary one and is charged with the gold particle [8]. The samples were fixed and embedded in resin (LR White), sectioned ultrathin (50 nm) and mounted on a TEM-grid. The immune-staining is made directly on the section and subsequently imaged on the TEM. The structure of enzymes is also on the apex of cryo electron microscopy research. The enzymes could be observed from different angles and rotation. It is necessary to detect the small objects, and classify them for further recomposition of 3D enzyme structure. 2.1

Image Processing and Analysis

The set of input images was separated into training and testing subsets. On the training images following methods were carried out, compared, and tuned. The successful approach was evaluated using the testing images. To complete the task, theory, examples, and comparison of different approaches was be performed: – – – – – –

Enhancement and equalization; Edge detection; Automatic thresholding and segmentation; Morphological operations; Circle detection; Region properties and parametrization (Figs. 4 and 5).

Most of the methods are based on the intensity histogram. The thresholding approaches separates histogram into two parts. Unfortunately, there are many objects in the electron microscopy images (particles, organelles). Therefore direct segmentation is not efficient.

Nano-Objects

143

Fig. 4. Example of filtration preprocessing. From left to right: input image, background to remove.

Fig. 5. Example of filtration preprocessing. From left to right: image without background, image without noise.

The very first task is to remove noise and background form the scene. Ideal preprocessing tool are the frequency filter, where random noise is typically located in the higher frequencies, and background in the lower. The filtration is done using Hamming window, the size of the window is estimated from the dataset: hw = (g)1/4 ; hs = (g)1/8 ;

(1) (2)

144


Fig. 6. Example of Beucher morphology and Otsu segmentation of borders after Beucher morphology.

where g is amount of image pixels; hw is size of the background window, and hs is size of the noise window. Basic morphological operations [5] were performed using – – – –

edge detectors; Beucher morphology [6]; Otsu between class segmentation [7]; and object parametrization (area, perimeter, circularity, convexity, inertia, center coordinates).

Edge detectors has to deal in this case with not contrasted borders of small size, with discretization effects. For such case, it is optional to use morphological operators such us Beucher morphology, the difference between the dilation and erosion of the image with smallest possible structural element. Such operations highlight the borders to the size of structural element (see Fig. 6). Dilation [9,10] is the Minkowski addition of two sets X and SE and causes objects to grow in size and fill small holes inside: δSE (X) = X ⊕ SE = ∪se∈SE (X, se).

(3)

Erosion [9,10] is a dual transformation to dilation, but not inverse function. It is subtractioncauses objects to shrink in size: SE (X) = X SE = ∩se∈SE (X, se).

(4)

SE is morphological structural element. While the Minkowski operations are standard, Beucher morphology is not too known despite it is very effective. The gray level thresholding [7] is a nonparametric method of automatic threshold selection for picture segmentation

Nano-Objects

145

from intensity histogram H(p). The Histogram function H(p) is an intensity function, shows count of pixel f(i, j) with the intensity equal p independently on the position (i, j): H(p) = h(i, j, p) (5) i,j

h(i, j, p) = 1if f (i, j) = p

(6)

= 0if f (i, j) = p.

(7)

Firstly, the histogram functions are normalized: H(p) , N

op =

(8)

where N is the total number of pixels in image. For separating histogram into two classes, the probabilities of class occurrence and the class mean are computed: k

ω1 =

op ;

(9)

p=1 L

ω2 =

op ;

(10)

p=k+1

k μ1 =

p=1

L

p ∗ op

ω1

,

p ∗ op . ω2 Also there is necessary to evaluate the total mean level of image: μ2 =

μT =

p=k+1

L

p ∗ op ,

(11) (12)

(13)

p=1

and between class variance: 2 σB = ω1 ∗ (μ1 − μt )2 + ω2 ∗ (μ2 − μt )2 .

(14)

2 The optimal threshold k ∗ maximizes σB [7,17]. Final step is to label the objects and evaluate their properties for criterion of their sphericity. The Hough transformation [4] is not recommended for this task, the gold nanoparticles are too small and not perfectly round. The computational burden is the typical disadvantage of the Hough method. Therefore, was adopted a ratio between radii computed from area and perimeter [3] in extension to the eccentricities and elliptical axes, which suffer with error propagation, since it depend on the estimation of object orientation. Thus: Area , (15) ra = π

146


while rp =

perimeter , 2∗π

(16)

ra , rp

(17)

and the ratio is ap =

The ideal sphere in a vacuum will has the ratio equals 1. Therefore the distance from the unity represents also the distance from the sphericity. Due to discrete representation, the observed deviation in sphericity ratio for presumably spherical ROIs was of 0.2 both directions. Therefore, the shape analysis of very small objects is unable to distinguish small real deviations of the sphericity, since the pixelation deformed image representation of objects. The additional parametrization of intensity, convexity, and inertia has to be taken into account. The edge detectors are quite successful due to particles low intensity. Otsu segmentation works perfectly to distinguish two classes of intensities. Object parametrization is the main subtask, while the specification of the criterion function determines the overall results of the detection algorithm. The processing of the Tiff images was performed using Matlab Image processing toolbox and Python with OpenCV. The size of the images was 2044 × 2048 pixels.

3

Results and Conclusion

The preprocessing of filtration enhance the objects for segmentation and thresholding. The affects of the background changes, and noise are smoothed. Supporting noise level detection could be applied. The morphological operations highlights the edges, the between class variance successfully segment the objects in the scene (Fig. 6).

Fig. 7. Example of half of diameter from area and half of diameter from perimeter.

Nano-Objects

147

Fig. 8. Example of ratio of diameters.

The objects that passed the thresholding were labeled and region properties were evaluated: – – – –

perimeter; convex area; centroid coordinates, eccentricity.

According to the ratio ap between perimeter diameter 2rp and area perimeter 2sp, see Fig. 8, the objects could be classified to their sphericity. For the round photosystem the required sphericity is bellow 1.2. For the gold nanoparticles the sphericity ratio is in the interval . The classification of the eliptical objects and enzymes, all the same the similar approach is used in the first step. The additional property of excentricity is necessary to calculate. The classification of the enzymes is complicated, since it is not a priori known how many possible classes are observed. The detection of the objects is straightforward, for more complicated task, the multi level parametrization of other region properties is available. The described methods were implemented into Mathworks Matlab application with graphical user interface 9. The software allow to read data set of electron microscopy images, select the region of interest (whole image, or its part), and run automatic evaluation. The processing and analysis carry on background and noise filtration, morphological operations, thresholding and segmentation, region labeling and properties calculation, diameter ratio classification, and results plot. All the methods parameters are estimated automatically (Fig. 9). Average time of detection was 58 [sec] on Intel Core2 Duo CPU E8400, 3 GHz, 4 GB, using double precision. During the detection some of the nanoparticles escaped from detection in both cases: on the training and testing set.

148


Fig. 9. Example of running application with identified and counted objects.

The increase of multiplicator for standard deviation of circularity parameters suffers in false positive results occurrence. The future work will be focused on the background normalization, contrast enhancement, automatic multiparametric thresholding, and advanced morphological operations. In this work was discussed the sphericity attribute of several objects for classification purposes. Usually, the very small objects are also of small amount of occupied pixels, therefore the discretization of the object borders come into account. The classical Hough transformation is time consuming and valid for well spherical objects. However, in the electron microscopy several types of objects are of the experimentator interest, not only the spherical ones. Therefore, the detection and classification should be based on parametrization and evaluation of the sphericity attribute. The low time burden solution could be done be using elliptical parameters, and area to perimeter derived ratio. Acknowledgement. Author thanks to Jan Vanˇek and Joseph de Joya for relevant discussion. The work has been partially supported by the grant of the University of West Bohemia, project No. SGS-2016-039, by the Ministry of Education, Youth and Sports of the Czech Republic - projects ‘CENAKVA’ (No. CZ.1.05/2.1.00/01.0024) and ‘CENAKVA II’ (No. LO1205 under the NPU I program).

Nano-Objects

149

References 1. Von Byern, J., Dorrer, V., Merritt, D.J., Chandler, P., Stringer, I., MarchettiDeschmann, M., McNaughton, A., Cyran, N., Thiel, K., Noeske, M., Grunwald, I.: Characterization of the fishing lines in titiwai (Arachnocampa luminosa Skuse, 1890) from New Zealand and Australia. PLoS ONE 11(12), e0162687 (2016) 2. Urban, J., Cisar, P., Pautsina, A., Soukup, J., Barta, A.: Discrete representation and morphology. In: Technical Computing Prague, p. 322 (2013) 3. Vanek, J., Urban, J., Gardian, Z.: Automated detection of photosystems II in electron microscope photographs. In: Technical Computing Prague, p. 102 (2006) 4. Duda, R.O., Hart, P.E.: Use of the Hough transformation to detect lines and curves in pictures. Commun. ACM 15(1), 11–15 (1972) 5. Urban, J.: Automatic Image Segmentation of HeLa Cells in Phase Contrast Microphotography. Lap LAMBERT Academic Publishing, Saarbrucken (2012) 6. Beucher, S.: Applications of mathematical morphology in material sciences: a review of recent developments. In: International Metallography Conference, pp. 41–46 (1995) 7. Otsu, N.: A threshold selection method from gray-level histogram. IEEE Trans. Syst. Man Cybern. SMC–9, 62–66 (1979) 8. Slot, J.W., Geuze, H.J.: A new method of preparing gold probes for multiplelabeling cytochemistry. Eur. J. Cell Biol. 38(1), 87–93 (1985) 9. Sonka, M., Hlavac, V., Boyle, R.: Image Processing, Analysis, and Machine Vision. Cole Publishing Company, Brooks (1999) 10. Gonzalez, R.C., Woods, R.E.: Digital Image Processing, vol. 5, pp. 11–15. AddisonWesley, Boston (1992) 11. Dubochet, J., Frank, J., Henderson, R.: Nobel Prize in Chemistry (2017) 12. Betzig, E., Hell, S.W., Moerner, W.E.: The Nobel Prize in Chemistry (2014) 13. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 356–379 (1948) 14. Shannon, C.E.: Communication in the presence of noise. Proc. IRE 37(1), 10–21 (1949) 15. Nyquist, H.: Certain topics in telegraph transmission theory. Trans. AIEE 47, 363–390 (1928) 16. Kotelnikov, V.A.: On the capacity of the ‘ether’ and cables in electrical communication. In: Proceedings of 1st All-Union Conference on Technological Reconstruction of the Communications Sector and Low-Current Engineering (1933) 17. Urban, J.: Colormetric experiments on aquatic organisms. In: Rojas, I., Ortu˜ no, F. (eds.) IWBBIO 2017. LNCS, vol. 10208, pp. 96–107. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56148-6 8

Computational Genomics

Models of Multiple Interactions from Collinear Patterns Leon Bobrowski1,2(&) and Paweł Zabielski1 1

2

Faculty of Computer Science, Bialystok University of Technology, Bialystok, Poland {l.bobrowski,p.zabielski}@pb.edu.pl Institute of Biocybernetics and Biomedical Engineering, PAS, Warsaw, Poland

Abstract. Each collinear pattern should be made up of a large number of feature vectors which are located on a plane in a multidimensional feature space. Data subset located on a plane can represent linear interactions between multiple variables (features, genes). Collinear (flat) patterns can be efficiently extracted from large, multidimensional data sets through minimization of the collinearity criterion function which is convex and piecewise linear (CPL). Flat patterns extracted from representative data sets could provide an opportunity to discover new, important interaction models. As an example, exploration of data sets representing clinical practice and genetic testing could result in multiple interaction models of phenotype and genotype features. Keywords: Data mining Collinear patterns Biclustering Multiple linear interactions

CPL criterion functions

1 Introduction Data mining methods are being developed extensively with the aim to discover knowledge in large, multi-dimensional data sets [1, 2]. Data mining is described as the process of patterns extraction from data sets. New knowledge is discovered on the basis of patterns extracted from large data sets. Extracted pattern is expected to be a subset (cluster) of such feature vectors that are characterized by an interesting or useful type of regularity. The extracted patterns can be represented as special clusters (clouds) of points in a multidimensional feature space. Patterns can be extracted from large data sets by using various computational tools of pattern recognition [3] or machine learning [4]. A variety of clustering techniques have been developed for data exploration and patterns extraction. Biclustering techniques are currently being developed to explore genomic data [5]. Biclustering procedures are aimed to extract subsets of feature vectors from data sets, and simultaneously to extract subsets of features characteristic for a particular pattern. Collinear (flat) patterns can be described as subsets of numerous feature vectors situated on planes of different dimensionality in a feature space [6]. Flat patterns can be linked to degenerated vertices in parameter space. Collinear patterns can be extracted


154

L. Bobrowski and P. Zabielski

from large data sets through minimization of the convex and piecewise linear (CPL) criterion function [7]. The technique of the flat patterns extraction based on minimization of the collinearity CPL criterion function can be compared to the techniques based on the Hough transformation [8, 9]. Possibility of discovering multiple interactions on the basis of extracted collinear patterns are analyzed in the presented paper. An interaction between certain features occurs when a given feature (variable) has a different influence on the studied outcome (effect) depending on the values of other features. The considered models of interaction can have a form of linear relations (equations) between selected features.

2 Dual Hyperlanes and Vertices in Parameter Space Let us consider m feature vectors xj = [xj,1, …, xj,n]T (j = 1, …, m) belonging to the feature space F[n] (xj 2 F[n]). We assume, that components xj,i of each feature vector xj are numerical values (xj,i 2 R or xj,i 2 {0, 1}) of n features xi, where i = 1, …, n. The data (learning) set C is composed of m feature vectors xj: C ¼ fxj : j ¼ 1; . . .; mg

ð1Þ

Components xj,i of the j-th feature vector xj can be treated as numerical results of n standardized examinations of the j-th object Oj. Feature vectors xj from the set C (1) allow to define the following dual hyperplanes hj in the n-dimensional parameter space Rn (w 2 Rn): ð8xj 2 CÞ hj ¼ fw: xTj w ¼ 1g

ð2Þ

where w = [w1, …, wn]T is the parameter (weight) vector (w 2 Rn). Each of n unit vectors ei = [0, …, 1, …, 0]T defines the following hyperplane h0i in the n-dimensional parameter space Rn: ð8i 2 f1; . . .; ngÞ

h0i ¼ w: eTi w ¼ 0 ¼ fw: wi ¼ 0g

ð3Þ

Let us consider the set Sk of rk linearly independent feature vectors xj (j 2 Jk) and n − rk unit vectors ei (i 2 Ik): Sk ¼ fxj : j 2 Jk g [ fei : i 2 Ik g

ð4Þ

The intersection point of the rk hyperplanes hj (2) defined by the feature vectors xj (j 2 Jk) and the n − rk hyperplanes h0i (3) defined by the unit vectors ei (i 2 Ik) from the set Sk (4) is called the k-th vertex wk in the parameter space Rn. The vertex wk can be defined by the below set of linear equations:

Models of Multiple Interactions from Collinear Patterns

ð8j 2 Jk Þ

155

wTk xj ¼ 1

ð5Þ

ð8i 2 Ik Þ wTk ei ¼ 0

ð6Þ

and

The Eqs. (5) and (6) can be represented in the matrix form: Bk wk ¼ 10 ¼ ½1; . . .; 1; 0; . . .; 0T

ð7Þ

where Bk is the square matrix constituting the k-th basis linked to the vertex wk: T Bk ¼ xjð1Þ ; . . .; xjðrkÞ ; eiðrk þ 1Þ ; . . .; eiðnÞ

ð8Þ

0 wk ¼ B1 k 1

ð9Þ

and

Definition 1: The rank rk (1 rk n) of the k-th vertex wk = [wk,1, …, wk,n]T (9) is defined as the number of the non-zero components wk,i (wk,i 6¼ 0). Definition 2: The degree of degeneration dk of the vertex wk (9) of the rank rk is defined as the number dk = mk − rk, where mk is the number of such feature vectors xj from the set C (1), which define the hyperplanes h1j (2) passing through this vertex (wTk xj ¼ 1). The vertex wk (9) is degenerated if the degree of degeneration dk is greater than zero (dk > 0).

3 Vertexical Planes in Feature Space The hyperplane H(w, h) in the feature space F[n] is defined in the below manner [3]: Hðw; hÞ ¼ fx: wT x ¼ hg

ð10Þ

where w is the weight vector (w 2 Rn) and h is the threshold (h 2 R1). Remark 1: If the threshold h is different from zero (h 6¼ 0), then the hyperplane H(w, h) (10) can be represented as the hyperplane H(w′, 1) = {x: (w/h)Tx = 1} with the weight vector w′ = w/h and the threshold h equal to one (h = 1). The (rk − 1) - dimensional vertexical plane Pk(xj(1), …, xj(rk)) based on the supporting vertex wk (9) of the rank rk is defined as the standardized linear combination of the rk (rk > 1) supporting vectors xj(i) (j 2 Jk) (4) belonging to the basis Bk (8) [7]: Pk xjð1Þ ; . . .; xjðrkÞ ¼ fx: x ¼ a1 xjð1Þ þ . . . þ ark xjðrkÞ g

ð11Þ

156


where j(i) 2 Jk (4) and the parameters ai (ai 2 R1) fulfill the below standardization: a1 þ . . . þ ark ¼ 1

ð12Þ

Two linearly independent vectors xj(1) and xj(2) from the set C (1) support the below straight line l(xj(1), xj(2)) in the feature space F[n] (x 2 F[n]): l xjð1Þ ; xjð2Þ ¼ fx: x ¼ xjð1Þ þ aðxjð2Þ xjð1Þ g ¼ fx: x ¼ ð1 aÞxjð1Þ þ axjð2Þ g ð13Þ where a 2 R1. The straight line l(xj(1), xj(2)) (13) can be treated as the vertexical plane Pk(xj(1), xj(2)) (11) spanned by two supporting vectors xj(1) and xj(2) with a1 = 1 − a and a2 = a. In this case, the basis Bk (8) contains only two feature vectors xj(1) and xj(2) (rk = 2) and n − 2 unit vectors ei (i 2 Ik). As a result, the vertex wk = [wk,1, …, wk,n]T (9) contains only two nonzero components wk,i (wk,i 6¼ 0). Lemma 1: The vertexical plane Pk(xj(1), …, xj(rk)) (11) based on the vertex wk (9) with the rank rk greater than 1 (rk > 1) is equal to the hyperplane H(wk, 1) (10) defined by the vertex wk in the n - dimensional feature space F[n]. Theorem 1: The j-th feature vector xj (xj 2 C (1)) is located on the vertexical plane Pk(xj(1), …, xj(rk)) (11) if and only if the j-th dual hyperplane h1j (2) passes through the supporting vertex wk (9) of the rank rk. Proofs of the above Lemma 1 and the Theorem 1 have been given in the paper [10].

4 Convex and Piecewise Linear (CPL) Collinearity Functions Feature vectors xj from the data set C (1) allow to define the below collinearity penalty functions u1j (w) linked to the dual hyperplanes hj (2) [4]: ð8xj 2 CÞ u1j ðwÞ

1 xT w j T ¼ 1 xj w ¼ T xj w 1

if

xTj w 1

if

xTj w [ 1

ð14Þ

Each of n unit vectors ei allows to define the i-th cost function u0i (w) in the n-dimensional parameter space Rn, where w = [w1, …, wn]: ð8i 2 f1; . . .; ngÞ wi u0i ðwÞ ¼ eTi w ¼ jwi j ¼ wi

if

wi 0

if

wi [ 0

ð15Þ


157

The cost functions u0i (w) (15) are linked to the hyperplanes h0i (3). The cost functions u0i (w) (15) like the collinearity penalty functions u1j (w) (14) are convex and piecewise linear (CPL). The k-th collinearity criterion function Uk(w) is defined as the sum of the penalty functions u1j (w) (14) linked the feature vectors xj from the given subset Ck: U k ðw Þ ¼

X

u1j ðwÞ

ð16Þ

j2JK

where Jk = {j: xj 2 Ck} and Ck C (1). It can be proved that the minimal value Uk of the convex and piecewise linear criterion function Uk(w) (16) can be found in one of the vertices wk (9) [11]: ð9wk Þ ð8wÞUk ðwÞ Uk wk ¼ Uk 0

ð17Þ

The basis exchange algorithms which are similar to the linear programming allow to find efficiently the minimal value Uk ðwk Þ (17) of the criterion functions Uk(w) (16) even in case of large, multidimensional data subsets Ck [11]. Theorem 3: The minimal value Uk(wk ) (17) of the collinearity criterion function Uk(w) (16) defined on elements xj of the data subset Ck is equal to zero ðUk ðwk Þ ¼ 0Þ, if and only if all the feature vectors xj from the subset Ck can be located on some hyperplane H(w, h) (10) with h 6¼ 0 [7].

5 Optimal Vertices The optimal vertex wk (9) constitutes the minimal value Uk ðwk Þ (17) of the k-th collinearity criterion function Uk(w) (16). The k-th optimal vertex wk (17) of the rank rk is highly degenerated if the number mk of such dual hyperplanes h1j (2) which pass through this vertex is a large (mk rk). In accordance with Theorem 1 the k-th collinear (flat) pattern Fk is formed by a large number mk of such feature vectors xj which are located on the vertexical plane Pk(xj(1), …, xj(rk)) (11) in the feature space F[n] supported by the optimal vertex wk (17) of the rank rk. Each feature vector xj belonging to the flat pattern Fk defines the dual hyperplane h1j (2) which passes through the optimal vertex wk (17): Fk ¼ fxj : ðwk ÞT xj ¼ 1g

ð18Þ

It has been proved that each feature vector xj from the flat pattern Fk (18) is located on the vertexical plane Pk(xj(1), …, xj(rk)) (11) defined by the optimal vertex wk (17) of the rank rk linked to the basis Bk (8).

158


Remark 5: The minimal value Uk(wk ) (17) of the collinearity criterion function Uk(w) (16) defined on elements xj of the data subset Ck is equal to zero ðUk ðwk Þ ¼ 0Þ, if and only if the selected data subset Ck is contained (Ck Fk) in the flat pattern Fk (18) defined by the optimal vertex wk (17) [6]. The minimal value Uk ðwk Þ (17) of the collinearity criterion function Uk(w) (16) can always be reduced to zero ðUk ðwk Þ ¼ 0Þ, by removing such feature vector xj which does not belong to the flat pattern Fk (18). Each extracted flat patterns Fk should contain sufficiently large numbers mk of feature vectors xj. The procedure Vertex has been proposed for the purpose of the flat patterns Fk (17) extraction from large data set C (1) [10]. A variety of highly degenerated optimal vertices wk (17) can be extracted from a given large data set C (1). Each of these degenerated vertices wk (17) defines the separate flat pattern Fk (18).

6 Collinear Biclusters Vertexical feature subspaces Fk[rk] (Fk[rk] F[n] = {x1, …, xn}) can be defined on the basis of particular vertices wk (9) [7]. The k-th vertexical feature subspace Fk[rk] is obtained from the feature space F[n] by omitting these features xi which are linked to the weights wk,i equal to zero (wk,i = 0), where wk,i is the i-th component of the k-th vertex wk = [wk,1, …, wk,n]T (9) of the rank rk (Definition 1): ð8i 2 f1; . . .; ngÞ

if wk;i ¼ 0; then the i th feature xi is omitted in the feature subspace Fk ½rk

ð19Þ

The k-th vertexical feature subspace Fk[rk] (Fk[yj,1] F[n]) is composed of such rk features xi which are linked to the nonzero weights wk,i (wk,i 6¼ 0) in the vertex wk (9). We can remark that the unit vectors ei in the basis Bk (8) causes the i-th weight wk,i equal to zero (wk,i = 0) and reduction of the i-th feature xi from the feature space F [n] in accordance with the rule (17). The reduced feature vectors yj = [yj,1, …, yj,rk]T (yj 2 Fk[rk]) are obtained from the feature vectors xj = [xj,1, …, xj,n]T (xj 2 C (1)) through reducing (19) such n − rk components xj,i which are linked to the unit vectors ei in the basis Bk (8) and by a new indexing i of the remaining rk components yj,i (i = 1, …, rk). The reduced weight vector vk = [vk,1, …, vk,rk]T is obtained from the k-th vertex wk = [wk,1, …, wk,n]T (9) of the rank rk by neglecting the n − rk components wk,i equal to zero (wk,i = 0). Definition 3: The collinear bicluster Bk(mk, rk) based on the highly degenerated optimal vertex wk (17) of the rank rk is defined as the set of such mk (mk > rk) reduced feature vectors yj (yj 2 Fk[rk]) which fulfill the equation vTk yj ¼ 1: Bk ðmk ; rk Þ ¼

yj : vTk yj ¼ 1

ð20Þ

The k-th collinear (flat) bicluster Bk(mk, rk) (20) based on the degenerated vertex wk (9) has been characterized by the two numbers rk and mk, where rk is the number of features xi in the vertexical subspace Fk[rk] (19), and mk is the number of such feature


159

vectors xj (xj 2 F[n]) which define the dual hyperplanes h1j (2) passing through this vertex ðwTk xj ¼ 1Þ. The collinear biclusters Bk(mk, rk) (20) should be based on highly degenerated vertices wk (9) with a large number mk. Because the reduced vertex vk is obtained from the k-th vertex wk = [wk,1, …, wk,n]T (9) through neglecting (19) such n − rk components wk,i which are equal to zero (wk,i = 0) the below equalities hold: ð8j 2 f1; . . .; mgÞ vTk yj ¼ wTk xj

ð21Þ

Remark 3: The collinear bicluster Bk(mk, rk) (20) is the set of such mk reduced feature vectors yj (yj 2 Fk[nk]) which are located on the hyperplane H(vk, 1) (10) defined in the k-th vertexical feature subspace Fk[rk] by the reduced weight vector vk = [vk,1, …, vk,rk]T with the all rk components vk,i different from zero (vk,i 6¼ 0): H ðvk ; 1Þ ¼ y: vTk y ¼ 1

ð22Þ

8i 2 f1; . . .; rk gvk;i 6¼ 0

ð23Þ

where

We can also remark that mk feature vectors xj = [xj,1, …, xj,n]T linked to the collinear bicluster Bk(mk, rk) (20) through the reduced feature vectors yj are located on the vertexical plane Pk(xj(1), …, xj(rk)) (11) of the rank rk [7]: ð8j 2 f1; . . .; mgÞ if yj 2 Bk ðmk ; rk Þ ; then xj 2 Pk xjð1Þ ; . . .; xjðrkÞ

ð24Þ

7 Multiple Interactions Models Based on Biclusters Let us assume that a large number mk of feature vectors xj is located on the vertexical plane Pk(xj(1), …, xj(rk)) (11) supported by the reduced weight vector vk = [vk,1, …, vk, T rk] of the rank rk (Definition 1). The below equations are fulfilled with the reduced (19) weight vector vk = [vk,1, …, vk,rk]T (23): 8yj 2 Bk ðmk ; rk Þ ð19Þ vTk yj ¼ vk;1 yj;1 þ . . . þ vk;rk yj;rk ¼ 1

ð25Þ

The linear equations (25) are exactly fulfilled by components yj,ik of the reduced vectors yj from the k-th bicluster Bk(mk, rk) (20). If the number rk of elements yj of the bicluster Bk(mk, rk) (20) is large then the below model of linear interaction between selected features xi (xi 2 Fk[rk]) can be justified on the basis of the Eq. (25): a1 xið1Þ þ . . . þ ark xiðrkÞ ¼ 1

ð26Þ

160


where ai is equal to the i-th component vk,i (vk,i 6¼ 0) of the degenerated vertex vk = [vk,1, …, vk,rk]T: ð8i 2 f1; . . .; rk gÞai ¼ vk;i

ð27Þ

and xi(l) is the i(l)-th feature belonging to the vertexical feature subspace Fk[rk] = {xi(1), …, xi(rk)} (19). Tthe k-th vertexical feature subspace Fk[rk] (19) contains such rk features xi(l) which constitute the bicluster Bk(mk, rk) (19). The rk features xi(l) constituting the bicluster Bk(mk, rk) (20) define the k-th vertexical feature subspace Fk[rk] (19) based on the highly degenerated optimal vertex wk (17). Remark 4: The Eq. (25) constitutes the model of linear interactions between rk (2 rk n) features xi selected to the k-th bicluster Bk(mk, rk) (20) through minimization of the collinearity criterion function Uk(w) (16). The model xi0 ¼ a xi þ bða 6¼ 0; b 6¼ 0Þ of linear interactions between two features xi and xi0 can be designed on the basis of the correlation coefficient [2]. The model (26) can be treated as a generalisation of the correlation model of linear interactions to a number rk of variables xi greater than two. Remark 5: The model (26) is precisely accurate for each reduced vector yj (Definition 3) from the bicluster Bk(mk, rk) (20). This means, that components yj,i of each reduced vector yj from the bicluster Bk(mk, rk) (20) are linked by the Eq. (26). The components yj,i of the j-th reduced vectors yj from the bicluster Bk(mk, rk) (20) fulfill the Eq. (26), where all parameters vk,i are different from zero (vk,i 6¼ 0). Consequently, the l-th component yj,l depends on the remaining components yj,i (i 6¼ l):

8yj 2 Bk ðmk ; rk Þ ð8l 2 f1; . . .; rk gÞ

yj;l ¼ b1 yj;1 þ . . . þ brk yj;rk þ b0

ð28Þ

where ð8i 2 f1; . . .; rk gÞ bi = vk;i =vk;l and b0 = 1=vk;l . The deterministic relation (28) between components yj,i can be generalized in order to take into account random noise in modelled interactions. This generalization can be done through introducing layers with the nonnegative margin e (e 0) [10]: ð8l 2f1; . . .; rk gÞ b1 yj;1 þ . . . þ brk yj;rk þ b0 e yj;l b1 yj;1 þ . . . þ brk yj;rk þ b0 þ e

ð29Þ

The collinearity penalty functions uej ðwÞ and the criterion functions Uek ðwÞ with the margin e were proposed in the work [10]. ð8xj 2 CÞ uej ðwÞ ¼

1 e w T xj

if

wT xj \1 e

0

if

1 e w T xj 1 þ e

wT xj 1 þ e

if

wT xj [ 1 þ e

where e is a small, non-negative parameter (e 0).

ð30Þ


161

The criterion functions Uek ðwÞ can be defined in the same way as Uk(w) (16), as the sum of the penalty functions uej ðwÞ (30). The Eqs. (28) or (29) can be used as prognostic (regression) models [2]. Such prognostic models can be extracted from the data set C (1) through the minimization of the convex and piecewise linear (CPL) criterion function Uk(w) (16) both when the margin e (29) is equal to zero (e = 0) and in the case when the margin e is greater than zero (e > 0). The basis exchange algorithm which is similar to the linear programming allows to efficiently find the optimal vertex wk (17) constituting the minimal value Uk ðwk Þ even in case of large, multidimensional data subsets Ck [11]. A variety of highly degenerated vertices wk (17) can be extracted from a given large data set C (1). Each of these degenerated vertices wk allow to define the separate flat pattern Fk (18) and the bicluster Bk(mk, rk) (20). Each of the degenerated vertices wk (26) can also define different prognostic model (28) or (29). The prognostic models obtained through minimization of the criterion functions Uk(w) (16) have a local character. The extracted model (28) is well fitted to mk reduced vectors yj from the collinear bicluster Bk(mk, rk) (20) composed from rk features xi constituting the k-th vertexical feature subspace Fk[rk] (18). The useability of the models (28) or (29) can be extended over the region of the bicluster Bk(mk, rk) (20) through introducing the margin e (29) greater than zero (e > 0). The prognostic models (28) or (29) have an approximate nature in an extended area of the bicluster Bk(mk, rk) (19) [10]. The below layer Lek ½n based on the optimal vertex wk (17) in the feature space F[n] (18) can be introduced for the bicluster Bk(mk, rk) (20) enlarged by the margin e: Lek ½n ¼ fxj 2 C ð1Þ: 1 e ðwk ÞT xj 1 þ eg

ð31Þ

8 Examples of Computational Results The results of experiments with synthetic data sets D1 and D2 were described here to illustrate new concepts of multiple interactions modeling. The set D1 was composed from three lines l1(xj(1), xj(2)), l2(xj(3), xj(4)), and l3(xj(5), xj(6)) (13) in the two-dimensional feature space F[2], where the supporting vectors xj(i) are specified below: xjð1Þ ¼ ½0:98; 1:36T ; xjð2Þ ¼ ½0:22; 2:88T xjð3Þ ¼ ½0:56; 0:43T ; xjð4Þ ¼ ½0:23; 0:76T xjð5Þ ¼ ½0:49; 5:0T ; xjð6Þ ¼ ½0:81; 3:95T

ð32Þ

These three lines lk (k = 1, 2, 3) are shown on the Fig. 1. Each line lk contains mk = 20 points xj on the plane. A noise is represented as mk = 20 uniformly distributed points xj on the plane.

162


Fig. 1. Configuration of feature vectors xj from set D1 on the plane ðxj 2 R2 Þ.

Fig. 2. Representation of the data set D1 in the two-dimensional parameter space R2.

The set D2 was composed from two vertexical planes P1(xj(1), xj(2), xj(3)), and P2(xj(4), xj(5), xj(6)) (11) in the three-dimensional feature space F[3] (Fig. 3). The supporting vectors xj(i) of these planes are specified below (xj 2 R3): xjð1Þ ¼ ½0:70; 0:18; 1:06T ; xjð2Þ ¼ ½0:73; 0:71; 0:85T ; xjð3Þ ¼ ½0:61; 0:14; 1:23T xjð4Þ ¼ ½0:73; 0:86; 0:05T ; xjð5Þ ¼ ½0:31; 0:56; 0:75T ; xjð6Þ ¼ ½0:35; 0:98; 0:24T ð33Þ Each of the planes Pk (k = 1, 2) contains mk = 30 points xj. A noise is represented as mk = 90 uniformly distributed points xj ðxj 2 R3 Þ. An experimental extraction of collinear patterns from the set D1 and the set D2 was done. In this experiment, the collinear patterns in the form of lines and planes were extracted from the sets D1 and D2 with different level of noise. The lines and planes were successfully extracted from the sets Dk (k = 1, 2) even when the level of noise was high.


163

Fig. 3. Configuration of feature vectors xj from set D2 in the three-dimensional feature space F[3] ðxj 2 R3 Þ. The green points represent a noise. (Color figure online)

Each of the extracted lines lk (k = 1, 2, 3) from the set D1 can be represented by the optimal vertex wk ¼ ½wk;1 ; wk;2 T (9). The optimal vertex wk constitutes the minimal value Uk ðwk Þ (17) of the k-th collinearity criterion function Uk(w) (25). The below optimal vertices wk have been extracted from the set D1 (Fig. 2) w1 ¼ ½0:5; 0:15T ; w2 ¼ ½0:6; 0:3T ; w3 ¼ ½1; 1T

ð34Þ

These optimal vertices wk ¼ ½wk;1 ; wk;2 T allow to formulate three models I, II, and III of the linear interactions wk,1 x1 + wk,2 x2 = 1 (26) between two features x1 and x2: I: 0:5x1 0:15x2 ¼ 1 II: 0:6x1 þ 0:3x2 ¼ 1 III: x1 þ x2 ¼ 1

ð35Þ

Similarly, the planes P2 and P3 extracted from the set D2 can be represented by the below optimal vertices w2 and w3 : w1 ¼ ½0:5; 0:7; 0:6T w2 ¼ ½0:8; 0:1; 0:4T

ð36Þ

Three models I′, II′, and III′ of the linear interactions wk,1 x1 + wk,2 x2 + wk,3 x3 = 1 between three features x1, x2, and x3 now have the below form: I0 : 0:5 x1 þ 0:7 x2 þ 0:6 x3 ¼ 1 II0 : 0:8 x1 þ 0:1 x2 þ 0:4 x3 ¼ 1

ð37Þ

164


9 Concluding Remarks Designing models of multiple interactions on the basis of discovering collinear patterns in large, multidimensional data set has been analyzed in the paper. The process of discovering collinear patterns in data set C (1) has been based on multiple minimizations of the CPL criterion functions Uk(w) (24) combined with the data subsets Ck () reduction and feature subspaces Fk[rk] (18) selection. New properties of the collinear biclustering have been examined in the context of the multiple interactions modeling. The collinear (flat) patterns Fk (17) extracted from the set C (1) of feature vectors xj (xj 2 F[n]) allow to define the collinear biclusters Bk(mk, rk) (19). The collinear bicluster Bk(mk, rk) (19) is composed of mk reduced, rk - dimensional vectors yj (yj 2 Fk[rk]). The reduced vector yj is obtained from feature vector xj through omitting such components xj,i which are linked to the weights wi equal to zero (wi = 0). All the vectors yj from the bicluster Bk(mk, rk) (19) are located on the hyperplane H(vk,1) (21) in the k-th vertexical feature subspace Fk[rk]. The deterministic models (28) of linear interactions between selected rk features xi have been formulated on the basis of collinear biclusters Bk(mk, rk) (19). The prognostic (regression) models can be formulated based on this. These regression models have a local character. This means, that the k-th model (28) is valid (exactly fitted) only to mk reduced vectors yj from the k-th collinear bicluster Bk(mk, rk) (19) and to rk features xi from the k-th feature subspace Fk[rk] (18). The deterministic models (28) or (29) of linear interactions between selected features xi can be generalized in order to include random noise. This possibility can be realized through introducing penalty functions u1j (w) (17) with the nonnegative margin e (e 0). This way, the useability of the models (28) or (29) can be extended over the region of single biclusters Bk(mk, rk) (19). Acknowledgments. The present study was supported by a grant S/WI/2/2018 from Bialystok University of Technology and founded from the resources for research by Ministry of Science and Higher Education.

References 1. Hand, D., Smyth, P., Mannila, H.: Principles of Data Mining. MIT Press, Cambridge (2001) 2. Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis. Prentice- Hall Inc., Englewood Cliffs (1991) 3. Duda, O.R., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley, New York (2001) 4. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006) 5. Madeira, S.C., Oliveira, S.L.: Biclustering algorithms for biological data analysis: a survey. IEEE Trans. Comput. Biol. Bioinform. 1(1), 24–45 (2004) 6. Bobrowski, L.: Biclustering based on collinear patterns. In: Rojas, I., Ortuño, F. (eds.) IWBBIO 2017. LNCS, vol. 10208, pp. 134–144. Springer, Cham (2017). https://doi.org/10. 1007/978-3-319-56148-6_11 7. Bobrowski, L.: Discovering main vertexical planes in a multivariate data space by using CPL functions. In: Perner, P. (ed.) ICDM 2014. LNCS (LNAI), vol. 8557, pp. 200–213. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08976-8_15


165

8. Duda, O.R., Hart, P.E.: Use of the hough transformation to detect lines and curves. Pict. Commun. Assoc. Comput. Mach. 15(1), 11–15 (1972) 9. Ballard, D.H.: Generalizing the Hough transform to detect arbitrary shapes. Pattern Recogn. 13(2), 111–122 (1981) 10. Bobrowski, L., Zabielski, P.: Flat patterns extraction with collinearity models. In: 9th EUROSIM Congress on Modelling and Simulation, EUROSIM 2016, Oulu, Finland, 12–16 September 2016. IEEE Conference Publishing Services (CPS) (2016) 11. Bobrowski, L.: Design of piecewise linear classifiers from formal neurons by some basis exchange technique. Pattern Recogn. 24(9), 863–870 (1991)

Identification of the Treatment Survivability Gene Biomarkers of Breast Cancer Patients via a Tree-Based Approach Ashraf Abou Tabl1 ✉ (

1

)

, Abedalrhman Alkhateeb2, Luis Rueda2, Waguih ElMaraghy1, and Alioune Ngom2

Department of Mechanical, Automotive, and Materials Engineering (MAME), University of Windsor, 401 Sunset Ave, Windsor, ON N9B 3P4, Canada {aboutaba,wem}@uwindsor.ca 2 School of Computer Science, University of Windsor, 401 Sunset Ave, Windsor, ON N9B 3P4, Canada {alkhate,lrueda,angom}@uwindsor.ca

Abstract. Studying breast cancer survivability among different patients who received various treatments may help to understand the relationship between the survivability and treatment therapy based on the gene expression. In this work, we built a classifier system that predicts whether a given breast cancer patient who underwent some form of treatment (which is either hormone therapy (H), radiotherapy (R), or surgery (S)) will survive beyond five years after the treatment therapy. Our classifier is a tree-based hierarchical approach which partitions breast cancer patients according to survivability classes; each node in the tree is associated to a treatment therapy and finds a predictive subset of genes that can best predict whether a given patient will survive after that particular treatment. We applied our tree-based method to a gene expression dataset consisting of 347 treated breast cancer patients and identified potential biomarker subsets with accuracies ranging from 80.9% to 100%. We have investigated the roles of many biomarkers through the literature. Keywords: Breast cancer · Survival · Gene biomarkers · Treatment therapy Machine learning · Classification · Feature selection

1

Background

Although with the fast increase in breast cancer rates nowadays, the survival rates are also increased not only due to the improvement in the treatments because of the new technologies [1]. Breast cancer is still one of the leading causes of women death world‐ wide. The survival rates vary with the treatment therapy, which includes Surgery, Chemotherapy, Hormone therapy, Radiotherapy. Patient response to treatment varies from patient to another [2]. Traditional laboratory techniques such as CAT scan and Magnetic Resonance Imaging (MRI) provide useful but very little information, on the other hand with the advances in DNA microarray technology which provide high throughput samples of gene expression. Analyzing gene expression from breast cancer patients who undergo © Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortuño (Eds.): IWBBIO 2018, LNBI 10813, pp. 166–176, 2018. https://doi.org/10.1007/978-3-319-78723-7_14

Treatment Survivability for Breast Cancer

167

different treatments provides a better understanding of the disease progression. A large number of features complicates the computational model, and since it is way higher than the number of samples, it creates a problem known as the curse of dimensionality, where the standard classifiers struggle to handle the volume of features, and the model overfits. Therefore, feature selection techniques are applied to solve the problem and filter out the irrelevant features. Mangasarian et al. utilized a linear support vector machine SVM to extract six features out of 31 clinical features, the data set have samples from 253 breast cancer patients. The model classified the samples into two groups. Node-positive in which the patients have some metastasized lymph nodes, and node-negative for patients with no metastasized lymph nodes. The six features then were used in a Gaussian SVM classifier to classify the patient into three prognostic groups, negative, middle, positive. They found that patients in the negative group had the highest survivability among the other groups with the majority of them had chemotherapy as a treatment [3]. Using samples from patients with high-risk clinical features in early stages of breast cancer, Cardoso et al. proposed a statistical model to decide the necessity of chemo‐ therapy treatment intervene from not based on the expression of the gene [4]. In earlier work, we built a prediction model for survival based on different treatments without defining the period of survivability [5]; that is, given a training set consisting of gene expression data of breast cancer patients who survived or died after receiving a treatment therapy, we built a classifier model which predicted whether a new patient survive or die. In this paper, we propose a classifier model to predict which BC patient will survive beyond five years after undergoing a given treatment therapy. The classifier model is built on top of a feature selection model, which identifies the genes which can best distinguish among the survival classes.

2


Samples from publicly accessible database for 2,433 breast cancer patients and survival is used in this approach [6], After studying the given data, a set of six classes were identified as the base of this work, these classes are the combination of each treatment (Surgery, Hormone therapy, Radiotherapy) with a patient statue (Living or Deceased) The number of samples (patients) for each class are shown in Table 1, which indicates that a total of 347 patients are used in this work. Table 1. Class list with the number of samples in each class. Class Living and Radio (LR) Deceased and Radio (DR) Living and Hormone (LH) Deceased and Hormone (DH) Living and Surgery (LS) Deceased and Surgery (DS) Total

Number of samples 132 19 20 6 130 40 347

168

A. Abou Tabl et al.

Based on the available data, only three treatment therapies are covered which are Surgery, Hormone therapy, Radiotherapy. Our proposed model is a hierarchical classi‐ fier to classify one versus the rest classes. The data set contains unbalanced classes, a problem that is well-known in machine learning. The pipeline starts with feature selec‐ tion methods that include Chi-square [7] and Info-Gain, which are applied for limiting the significant number of features (genes), A wrapper method is also used to get the best subset of genes that represent the model by utilizing mRMR (minimum redundancy maximum relevance) [8] feature selection method. Followed by applying several class balancing techniques such as SMOTE [9], cost-sensitive [10], and resampling [11] to balance the number of classes before applying different types of classifiers like Naive Bayes [12] and decision tree (random forest) [13]. Finally, a small number of biomarker genes are recognized for predicting the proper treatment therapy. To the best of our awareness, this work is the first prediction model which is built on the combination of treatment and survivability of the patient as a class. The patient class distribution is shown in Fig. 1, which indicates the number of samples within each class. It is clear that there are variances between classes, which requires class balancing to have a fair calcification.

Fig. 1. Percentage of patient class distribution

2.1 Class Imbalance This model utilizes one-versus-rest to handle the multiclass problem, which leads to an unbalanced class dataset at each node of the classification model. Therefore, we applied several techniques to handle this issue such as: • Oversampling: Oversampling the minority class by using synthetic data generators. There are several algorithms to achieve this; we used one of the most popular algo‐ rithms, Synthetic Minority Over-Sampling Technique (SMOTE). • Cost-sensitive classifier: Using penalized models that apply additional costs for the minority class to achieve class balancing. This, in turn, bias the model to pay more attention to the minority class. The algorithm used in this work is called Cost-Sensi‐ tive Classifier in Weka using a penalty matrix to overcome the imbalance.


169

• Resampling: Replicating the dataset, which can be done by one of two methods. First, adding more copies of the data instances to the minority class, called oversampling. Second, by deleting some instances of the majority class, called undersampling. We used the oversampling technique. 2.2 Feature Selection The gene expression dataset contains 24,368 genes for each sample of the 347 samples. The curse of dimensionality makes it difficult to classify the dataset in its current form. Hence, feature selection is essential to narrow down the number of genes to few genes at each node. Chi-square and Info-Gain are applied to select the best information gain of the selected genes, then mRMR (minimum redundancy maximum relevance) feature selection method was applied to find the best subset of significant genes. mRMR is an algorithm commonly used in a greedy search to identify characteristics of features and narrow down their relevance correctly. 2.3 Multi-class Classification Model We applied a multiclass approach, the one-versus-rest technique. This approach assumes that one class is classified against the rest of classes, and then remove that class from the dataset. Afterwards, we select another class to classify it against the rest and so on. Using a greedy method to find the starting node, the method classifies all possible combinations such as ‘DH’ against the rest, then DR against the rest, and so on for all the six classes. Afterwards, the best starting node is selected as the root node of the classification tree based on the best performance. Several classifiers were utilized to achieve these results such as random forest and Naive Bayes. The classification model was built with 10-fold cross-validation.

3

Results and Discussion

The developed model of the multi-class is shown in Fig. 2, which shows the final results for each node and the performance measures that were considered such as the accuracy, the sensitivity, the F1-measure, and the specificity. It also shows the number of the correctly and incorrectly classified instances in each node. Moreover, Fig. 2 also shows that the root node is DH against rest gives 100% accu‐ racy. The second node is obtained after removing the ‘DH’ instances from the data set and then classifying each class against the rest. The best outcome was ‘DR’ with an accuracy of 100%. We repeat the same technique for the third node, finishing with ‘LH’ with an accuracy of 100%. Then, ‘DS’ in the fourth node with an accuracy of 97.9%, sensitivity is 96.9%, and specificity is 100% because all the DS samples were correctly classified. In the fifth node which is the last, we have ‘LR’ and ‘LS’; the accuracy drops down to 80.9% because it is difficult to distinguish between the living samples in both of them.

LS

Specificity 76.9% Sensitivity 84.8% F-Measure 81.8% Correctly Classified 100 Incorrectly Classified 30

Accuracy 80.9%

DS

ICOSLG SAR1A PRPS1 FBRSL1 INPP5F SFMBT2

Specificity 100% Sensitivity 96.9% F-Measure 98.5% Correctly Classified 40 Incorrectly Classified 0

LH Accuracy 100 %

DA874553 AKT1S1 CPPED1 BLP ARFGAP2 VAMP4 CT47A1 CLASRP

Specificity 100 % Sensitivity 100% F-Measure 100 % Correctly Classified 20 Incorrectly Classified 0

Yes

Accuracy 97.9%

Yes

Specificity 76.9 % Sensitivity 84.8 % F-Measure 81.8 % Correctly Classified 112 Incorrectly Classified 20

Accuracy 80.9%

LR

Node4

DS VS Rest

Node3

LH VS Rest

Node2

DR Accuracy 100 %

ASXL1 WIPI2 ASAP1 ZNF121 METTL2A FAM170B BG944228 PDCD7 ATL1 TRPC5 FOSB AL71228 BF594823 FBXO41

DH Accuracy 100%

AKIP1 FGF16 AA884297 CDC42BPG UPF3B FAM114A1 OR2G6 ANKLE1 MGA C14orf145

Specificity 100% Sensitivity 100% F-Measure 100% Correctly Classified 6 Incorrectly Classified 0

Yes

Specificity 100 % Sensitivity 100% F-Measure 100 % Correctly Classified 19 Incorrectly Classified 0

Yes

Node1

Fig. 2. Multi-class classification model with performance measures.

C14orf166 OSTC ZFP91 AI376590 BU753119 OR2B3 ARPC3 DSCAM

Node5

LR VS LS

NO

NO

NO

DR VS Rest

NO

DH VS Rest

170 A. Abou Tabl et al.


171

Our method identified 47 gene biomarkers listed in Table 2; a functional validation and biological insights were done for some genes by studying the information provided in the literature and their relation to breast cancer. The genes marked as blue are those that were considered for further biological relevance – see the discussion in the next section. Table 2. Gene biomarkers for each class

4

Biological Insight

FGF16 gene is a member of Fibroblast growth factors (FGFs) family, that are involved in a variety of cellular processes, like stemness, proliferation, antiapoptosis, drug resistance, and angiogenesis [14]. UPF3 is a regulator of nonsense transcripts homolog B (yeast). Kechavarzi and Janga found that UPF3 is one of the actively upregulated RNA-binding proteins identified from nine cancers in humans and their cancer relevant references, breast cancer is one of them [15]. ASAP1 is shown to be a breast cancer biomarker; it is precisely correlated with the invasive phenotypes have not been identified accurately [16]. Sabe et al. reported that ASAP1 is abnormally overexpressed in some breast cancers and used for their invasion and metastasis. FOSB is a member of the AP-1 family of transcription factors. Bamberger et al. concluded that sharp differences in the expression pattern of AP-1 family members are present in breast tumors, and fosB might be involved in the pathogenesis of these tumors [17]. VAMP4 gene is a target for some cellular and circulating miRNAs in neoplastic diseases such as miRNA-31. In any case, it has been confirmed that cellular miRNAs are involved in the development of breast cancer [18]. CT47A1is one of seven Cancer/Testis genes CT class. CT genes are significantly overexpressed in ductal carci‐ noma in situ DCIS [19]. Phosphoribosyl pyrophosphate synthetase 1 (PRPS1) found to be direct targets of miR124 in breast cancer [20]. Nam et al. stated that ICOSLG is a poten‐ tial biomarker of Trastuzumab resistance in breast cancer, which affects the progression of the disease [21]. Dombkowski et al. studied several pathways in breast cancer, in which

172

A. Abou Tabl et al.

they found that ARPC3 reveals extensive combinatorial interactions that have significant implications for its potential role in breast cancer metastasis and therapeutic development [22]. Zinc finger protein 91 homolog ZFP91 in the mouse is a methylated target gene. It was identified by methylated-CpG island recovery assay-assisted microarray analysis [23].

5

Visualization

Figure 3 shows a multi-dimensional representation of the plot matrix for the six biomarker genes found in node 4 for the DS class versus the rest, as an example; the figure also shows the relations among the six genes with each other. It is clear that from the class column the samples are separable. Figure 4 shows that the gene expression of AKIP1 is up-regulated in the DH samples comparing to the rest of the samples. While it shows that the gene expression of ASAP1 is down-regulated in the DR samples comparing to the rest of the samples. As shown in Fig. 5, FOSB has a strong correlation coefficient with AL71228 in the DR samples, while it does not correlate with the rest of the samples as shown in Fig. 6.

Fig. 3. Node number four DS vs. rest with six genes relations matrix.


173

Fig. 4. Boxplot for the AKIP1 and ASAP1 genes in node number one and two show the minimum, first quartile, median, third quartile, and maximum gene expression values for each group of samples (DH vs. Rest) and (DR vs. Rest)

Fig. 5. Circos plot for the biomarker genes in node number two for the “DR” class samples based on the correlation coefficient among genes expressions (p < 0.05).

174

A. Abou Tabl et al.

Fig. 6. Circos plot for the biomarker genes in node number two for the “Rest” class samples based on the correlation coefficient among genes expressions (p < 0.05).

6

Conclusion

Using a machine learning model for identifying gene biomarkers for breast cancer survival is a significant step in predicting the proper treatment for each patient, and may potentially increase the survival rates. This model gives very high accuracy levels by using a hierarchical model as a tree that includes one versus rest classifications. The computational model identifies sets of biomarkers for patients who received different treatments; those biomarkers can distinguish whether the patient has survived or died in a five-years’ time window for a specific treatment therapy. Vali‐ dation from literature verified the relationships among those biomarkers and breast cancer survivability. Future work includes testing these gene biomarkers in biomedical labs. This novel model can be improved to be used to identify the proper biomarker genes (signature) for different cancer type or even in the case of the patient who has more than one therapy. Also considering more patient’s data will allow covering all missing treatments. With this considerable data size, big data tools such as Hadoop and Spark can be utilized to devise and enhanced model. Acknowledgement. This work has been partially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), and the Windsor Essex County Cancer Centre Foundation (WECCCF) Seeds4Hope program.


175

References 1. Siegel, R.L., Miller, K.D., Jemal, A.: Cancer statistics, 2016. CA: Cancer J. Clin. 66(1), 7– 30 (2016) 2. Miller, K.D., Siegel, R.L., Lin, C.C., Mariotto, A.B., Kramer, J.L., Rowland, J.H., Stein, K.D., Alteri, R., Jemal, A.: Cancer treatment and survivorship statistics, 2016. CA: Cancer J. Clin. 66(4), 271–289 (2016) 3. Lee, Y.-J., Mangasarian, O.L., Wolberg, W.: Breast cancer survival and chemotherapy: a support vector machine analysis. In: Discrete Mathematical Problems with Medical Applications, DIMACS Work, 8–10 December 1999, vol. 55, p. 1 (2000) 4. Cardoso, F., van’t Veer, L.J., Bogaerts, J., Slaets, L., Viale, G., Delaloge, S., Pierga, J.-Y., Brain, E., Causeret, S., DeLorenzi, M., et al.: 70-Gene signature as an aid to treatment decisions in early-stage breast cancer. N. Engl. J. Med. 375(8), 717–729 (2016) 5. Abou Tabl, A., Alkhateeb, A., ElMaraghy, W., Ngom, A.: Machine learning model for identifying gene biomarkers for breast cancer treatment survival. In: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, p. 607. ACM (2017) 6. Pereira, B., Chin, S.-F., Rueda, O.M., Vollan, H.-K.M., Provenzano, E., Bardwell, H.A., Pugh, M., Jones, L., Russell, R., Sammut, S.-J., et al.: The somatic mutation profiles of 2,433 breast cancers refines their genomic and transcriptomic landscapes. Nat. Commun. 7, 11479 (2016) 7. Mantel, N.: Chi-square tests with one degree of freedom; extensions of the Mantel-Haenszel procedure. J. Am. Stat. Assoc. 58(303), 690–700 (1963) 8. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of maxdependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005) 9. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002) 10. Núñez, M.: Economic induction: a case study. In: EWSL, vol. 88, pp. 139–145 (1988) 11. Gross, S.: Median estimation in sample surveys. In: Proceedings of the Section on Survey Research Methods, vol. 1814184 (1980) 12. Domingos, P., Pazzani, M.: On the optimality of the simple bayesian classifier under zeroone loss. Mach. Learn. 29(2), 103–130 (1997) 13. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) 14. Katoh, M., Nakagama, H.: FGF receptors: cancer biology and therapeutics. Med. Res. Rev. 34(2), 280–300 (2014) 15. Kechavarzi, B., Janga, S.C.: Dissecting the expression landscape of RNA-binding proteins in human cancers. Genome Biol. 15(1), R14 (2014) 16. Sabe, H., Hashimoto, S., Morishige, M., Ogawa, E., Hashimoto, A., Nam, J.-M., Miura, K., Yano, H., Onodera, Y.: The EGFR-GEP100-ARF6-AMAP1 signaling pathway specific to breast cancer invasion and metastasis. Traffic 10(8), 982–993 (2009) 17. Bamberger, A.-M., Methner, C., Lisboa, B.W., Städtler, C., Schulte, H.M., Löning, T., MildeLangosch, K.: Expression pattern of the AP-1 family in breast cancer: association of fosB expression with a well-differentiated, receptor-positive tumor phenotype. Int. J. Cancer 84(5), 533–538 (1999) 18. Allegra, A., Alonci, A., Campo, S., Penna, G., Petrungaro, A., Gerace, D., Musolino, C.: Circulating microRNAs: new biomarkers in diagnosis, prognosis and treatment of cancer. Int. J. Oncol. 41(6), 1897–1912 (2012)

176

A. Abou Tabl et al.

19. Caballero, O.L., Shousha, S., Zhao, Q., Simpson, A.J., Coombes, R.C., Neville, A.M.: Expression of cancer/testis genes in ductal carcinoma in situ and benign lesions of the breast. Oncoscience 1(1), 14 (2014) 20. Qiu, Z., Guo, W., Wang, Q., Chen, Z., Huang, S., Zhao, F., Yao, M., Zhao, Y., He, X.: MicroRNA-124 reduces the pentose phosphate pathway and proliferation by targeting PRPS1 and RPIA mRNAs in human colorectal cancer cells. Gastroenterology 149(6), 1587–1598 (2015) 21. Nam, S., Chang, H.R., Jung, H.R., Gim, Y., Kim, N.Y., Grailhe, R., Seo, H.R., Park, H.S., Balch, C., Lee, J., et al.: A pathway-based approach for identifying biomarkers of tumor progression to trastuzumab-resistant breast cancer. Cancer Lett. 356(2), 880–890 (2015) 22. Dombkowski, A.A., Sultana, Z., Craig, D.B., Jamil, H.: In silico analysis of combinatorial microRNA activity reveals target genes and pathways associated with breast cancer metastasis. Cancer Inform. 10, 13 (2011) 23. Tommasi, S., Karm, D.L., Wu, X., Yen, Y., Pfeifer, G.P.: Methylation of homeobox genes is a frequent and early epigenetic event in breast cancer. Breast Cancer Res. 11(1), R14 (2009)

Workflows and Service Discovery: A Mobile Device Approach Ricardo Holthausen, Sergio D´ıaz-Del-Pino, Esteban Pérez-Wohlfeil, Pablo Rodr´ıguez-Brazzarola, and Oswaldo Trelles(B) Computer Architecture Department, University of M´ alaga, M´ alaga, Spain {ricardoholthausen,sergiodiazdp,estebanpw,pabrod,ortrelles}@uma.es

Abstract. Bioinformatics has moved from command-line standalone programs to web-service based environments. Such trend has resulted in an enormous amount of online resources which can be hard to find and identify, let alone execute and exploit. Furthermore, these resources are aimed -in general- to solve specific tasks. Usually, this tasks need to be combined in order to achieve the desired results. In this line, finding the appropriate set of tools to build up a workflow to solve a problem with the services available in a repository is itself a complex exercise. Issues such as services discovering, composition and representation appear. On the technological side, mobile devices have experienced an incredible growth in the number of users and technical capabilities. Starting from this reality, in the present paper, we propose a solution for service discovering and workflow generation while distinct approaches of representing workflows in a mobile environment are reviewed and discussed. As a proof of concept, a specific use case has been developed: we have embedded an expanded version of our Magallanes search engine into mORCA, our mobile client for bioinformatics. Such composition delivers a powerful and ubiquitous solution that provides the user with a handy tool for not only generate and represent workflows, but also services, data types, operations and service types discovery. Keywords: mORCA · Magallanes · Search engine Workflow generation · Mobile devices

1

Introduction

In the recent years, a vast amount of Life Science tools have been made accessible through the Internet [1]. Without the need of profound computational knowledge, genome browsers, sequence comparison software, public databases, among others, can now be used by both graphic user interfaces and web-services. Nowadays, a large deal of online resources are available to researchers. Nonetheless, Electronic supplementary material The online version of this chapter (https:// doi.org/10.1007/978-3-319-78723-7 15) contains supplementary material, which is available to authorized users. c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 177–185, 2018. https://doi.org/10.1007/978-3-319-78723-7_15

178

R. Holthausen et al.

they can face the difficulty to find, identify and execute them. Furthermore, and in order to enhance the usability of these tools, the prevailing trend is the combination of them into scientifically repeatable workflows; In this sense, a complete experiment can usually be seen as a work-flow [2] in which different operations are linked together based on their input and output files, providing a procedure that can be reused by other researchers, thus avoiding a tedious and repetitive task, that is also prone to human error. For the past decades, repositories have been a common framework regarding Bioinformatics (as exemplified by the existence and growth of EBI [3], NCBI [4] and INB [5]). This type of service metadata storage has become highly relevant in this field as one of the main actors of Internet resources used to find valuable Bioinformatics software [6]. In this line, MAPI [7] was born as a modular software framework for their standardization and use through different clients by a common access layer. On the other hand, Magallanes [8] was developed as an intelligent search engine built for its use with MAPI, using different repositories in order to discover service types, operations, data types and services, along with automatic workflow composition (given two data types, workflows source and target). Regarding workflow management, several tools have been developed over the years. Galaxy [9], for instance, allows the user to compose workflows by a friendly interface using drag and drop functionality, while Kepler [10] allows to design, execute, reuse, evolve, archive, and share scientific workflows. Moreover, Taverna [11] provides a workflow tool suite in which the user can use both Web Services and local tools. Another distinct project regarding workflows management is myExperiment [12], a sort of social network where users can submit and share their procedures in different formats (Taverna, RapidMiner, Galaxy, KNIME, Kepler, ...). Recently, the community is starting to be aware of the significant evolution in the use of mobile devices, and the possibilities they provide by reason of their ubiquity and usability. This can be seen with the great amount of tools that are being developed in Bioinformatics, in order to take advantage of the capabilities of this relatively new environment. In this line, different apps have been developed. Galaxy Portal App [13] was born as an interface to the Galaxy system [9] for its use through tablets and smartphones, whereas mORCA [14] was developed to ease the integration of Web-Services using metadata available in different catalogues and repositories such as the INB repository [5]. The proposed work stands out from the previously mentioned workflows management systems by proposing a way of automatically generating workflows from a mobile device. Workflow generation is a difficult task that has been addressed through several different approaches but less commonly in mobile devices. Therefore, the main contribution of this work is to join the concepts discussed previously (web services, repositories, workflow management and mobile environment) by extending mORCA, combining it with Magallanes search engine and workflow generation. This provides the latter with a new interface, and the former with two new features. Besides this, those web-services currently available in mORCA are linked with the coincident search results obtained from the Magallanes search.

Workflows and Service Discovery: A Mobile Device Approach

179

Thus, mORCAs functionalities are increased, facilitating finding relevant information, and hence providing biologists an easy-to-use mobile solution for discovering a wide range of resources manually and automatically, taking into account the intrinsic limitations of mobile devices, but empowering its strengths. Finally, as this work brings together two elements such as mobile environment and workflow representation, the way of representing such pieces of information in this new environment is discussed. To do so, several ways of representing workflows have been analyzed in order to determine which one would be more preferable.

2

Methods

In order to provide mORCA with Magallanes functionalities, a Web-Service implementation of the latter has been employed, using its Web Services Description Language (WSDL) interface [15]. Although mORCAs architecture specifications are thoroughly explained in its paper, it is necessary to briefly remark its functioning, in order to gain a better grasp of the current work implementation. mORCA was developed by using a multi-layer architecture, in which components related to the actual tools, services and repositories are provided by MAPI, whereas a second layer works as a middleware for different elements, such as user authentication, repository browsing, service discovery, service parameter composition, service invocation, file management and service execution monitoring. The present work takes advantage of the first mentioned layer by making use of Magallanes web-service version, whose functionality is based on the clientserver model, and that is developed with an established technology such as Java, running in a Java Servlet Containter (Tomcat Server). The Magallanes search engine integration in mORCA has been carried out using the main components of mORCA (developed using web technologies as jQuery and jQuery mobile). mORCA is a modular system, which facilitates the inclusion of new features (repositories, tools, types of visualizations). The main goal was to include Magallanes functionalities while maintaining the original idea of using Ajax requests transparently to the user. More information about this works methods, along with a sequence diagram can be found in the supplementary material that is provided. The automatic workflows generation uses the algorithm described in Magallanes [8]. This algorithm proceeds by finding the shortest sequence of nonredundant services that matches the outputs with the inputs, linking the source and target data types. In summary, it identifies all the services and operations that produce a certain output (target). All those input data types are then used as targets in the next step. Regarding the search engine interface, the Magallanes original interface has been carefully tailored in order to fit mORCA applications environment and to maintain its responsive characteristics, while preserving all the functionalities available in the web version of Magallanes search engine. As can be seen in Fig. 1(A), the user selects the repository in which to search resources. The search

180


terms are entered by the user in a search-box, and different query options can be set in three different sets of checkboxes: – In the first one, a certain policy can be selected regarding the treatment of the terms entered (i.e.: or, and, or the use of regular expressions). – Regarding the second set, the user can be select the type/s of resources he/she wants to discover. – Finally, in the third one, two search options are added: follow links and case sensitive. Results, composed by Datatypes, Services, ServiceTypes and Operations are displayed in a dynamic list that distinguish them by their type. Each result is presented in a box, whose components are: – – – – – –

Result type (Data type, Service, Service type or Operation) Name A short description Links (Optional) Execution link (Optional, in case it is an available Service in mORCA) Set as workflow source/target buttons (Fig. 1(C))

For reasons of simplicity, and in order to avoid an overwhelming amount of outcomes they are displayed using lazy loading methods. Thus, firstly only 10 search results are displayed, and the list can be enlarged by tapping the More results button (Fig. 1(B)).

Fig. 1. Screenshots from Magallanes in mORCA. (A) The Magallanes main screen with the search-box and the different options. (B) An example of the search results, and the More results button for the lazy loading characteristic. (C) Search results of data types containing aminoacid. The first data type shown is set as the workflow source. (D) Workflow generated (represented in one of the formats that will be analyzed in the discussion). The data types are shown in orange, and the select-boxes contain the available operations from the data type above.


181

One characteristic added to this version of Magallanes is the direct linking of those services in the search results that are available in mORCA. Thus, the user can access to the mORCA services by searching them instead of searching for it in the mORCAs service list for the current repository. Another functionality available at Magallanes that has also been included in its integration in mORCA is the did you mean module, which suggests alternative search terms if it detects a likely typo, so the user can click on the suggested search instead of correcting the current one [8]. For the workflow visualization, the source and target data types are displayed. Right below the source data type, a select drop-down menu appears, containing all the operations that receive the source data type as input. When one of them is selected, the corresponding output data type appears below, together with another select drop-down menu with the following available operations, until the target data type is reached. An example of this workflow displaying can be seen in Fig. 1(D).

3

Results

In this section, two use cases of Magallanes in mORCA are demonstrated. Firstly, an example of searching and invoking a service by using the search engine will be described. Secondly, a use case of a workflow generation will be depicted. In order to avoid redundancy, the term clicking will be used in reference to clicking or tapping from this on. 3.1

Searching and Invoking a Service

The first action required from the end-user is to (1) go to Magallanes index page by clicking the button in mORCAs main page. Then, (2) the search options can be set. The user can select a repository to discover resources from, a way of treating the different search terms entered (And/Or/Regular expression), restrict the type of resources to be discovered (service types, operations, services or data types), and set two auxiliary options (case sensitivity and include links in the results). Once the search is done, the user can (3) navigate through the results obtained. Specifically, just ten results are displayed at a first moment, and the user can enlarge this list by clicking in the More results button available at the bottom of the page. Finally, (4) having located the desired service, the user can invoke it right from the results page by clicking the Open resource button, and make use of it in the same way as described in the mORCA documentation [14]. 3.2

Generating a Workflow

The first steps (1, 2 and 3) are similar to the previous use-case. Regarding the search options (step 2), it is advisable to restrict the resources type to just data types, as they are the input and output of the workflow. Once the desired results are found, (4), the source and target data type are selected by clicking in the

182


corresponding button available at the bottom of the result box. When both ends have been selected, a button will appear at the top of the results window. By clicking it (5) the user will have access to the generated workflow, which can be explored by using the select drop-down menus and buttons in the workflow page.

4

Discussion

When it comes to Bioinformatics workflow representation, the main manner consists on different boxes for each operation or data-type, which are interconnected by arrows indicating the order. This is a clear and concise way of depicting such a procedure, but when we switch from the WIMP paradigm to the mobile environment, whose relevance is constantly increasing, other ways have to be explored, as mobile devices have intrinsic limitations and differences with respect to the WIMP environment (e.g.: screen size, tapping instead of clicking, etc.) [16]. Besides the usual way of workflow representation (Fig. 2(B), two other ideas have been analyzed). Firstly, and trying to take advantage of the mobile devices characteristics, we can see a workflow composed by Magallanes in Fig. 1D. The idea is to provide the user with a fully-responsive workflow that can be explored and edited to

Fig. 2. On the left, (1) a workflow for the identification of differential genes generated as an image using Taverna. [17] On the right, (2) Same workflow represented as an interactive graph using web technologies in a mobile device.


183

some extent (i.e.: the workflow input and output do not change, but the different intermediate operations can be selected) and that is adaptable to the screen size changes (i.e.: the workflow will have the same shape in both mobile devices and regular web-browsers). The main drawback inherent to this kind of representation lies in its limitations regarding the workflow complexity. The workflows provided by Magallanes web service are, to some extent, linear, and with just one input and one output. This can be a problem when trying to represent workflows from other sources (v.g.: Taverna, Galaxy). Secondly, an adaptation from the classic way of representing a workflow (i.e.: boxes and arrows) is provided in Fig. 2(A). As can be seen, the boxes have been replaced by smaller bubbles. The idea behind this is to try to increase its adaptability for smaller-sized screens such as mobile devices ones, as well as to add the possibility to zoom in and out, and show/hide the elements names when zooming out (as the workflow complexity would increase). This interactivity allows the user not only to preview the workflow but to modify and adapt it before the execution. This way, and because the dynamically nature of mORCA’s interfaces, customizable workflows for each user needs could be easily reproduced. After considering the two ways mentioned, the first one was selected for the integration of Magallanes in mORCA, due to the linear nature of the workflows generated by Magallanes, as well as its adaptability to mORCAs current interface. Moreover, this is the alternative in which mobile devices characteristics can be better exploited. Profound studies about acceptance are candidates for future works, in order to extend this debate. What cannot be denied is mobile devices have come to stay, and the scientific community has to reap their benefits.

5

Conclusions

Adapting services to mobile devices is a step that should be taken carefully due to the limitations of these tools in terms of diversity of screen sizes (smartphones, phablets, tablets), performance (significantly lower than a PC), and the actual change of interface paradigm (from WIMP - Windows, Icons, Mouse, Pointer to a touch screen). In this sense, and also given the complexity inherent to discovering not only services, but workflows and data types (among other resources) in a repositories system, it has been implemented and adapted an intelligent search engine that includes all the functionalities available in the original version, and also the linking of those services available at mORCA to the search results. Thus, the functionalities of mORCA have been increased, including a way of discovering several types of resources, and also providing an easy-to-use mobile solution for workflows generation, an environment that has not been properly explored yet, despite mobile devices are all pervasive nowadays. Three different methods of mobile representation have been studied to identify its strengths and limitations, in order to provide an optimal solution to the workflow representation problem. Regarding future work, more research has to be carried out, as it has been said previously, in order to obtain information about which is the most-accepted way

184


of representing workflows. Furthermore, as mobile environment is a development area under constant evolution, we must pay attention to technological advancements and improvements, in order to take advantage of them. Besides, due to the modular characteristics of mORCA, more functionalities can be added in the future, expanding its capabilities and providing the community with a richer tool. Acknowledgements. This work has been partially supported by the European project ELIXIR- EXCELERATE (grant no. 676559), the Spanish national projects Plataforma de Recursos Biomoleculares y Bioinform´ aticos (ISCIII-PT13.0001.0012) and RIRAAF (ISCIII-RD12/0013/0006) and the University of M´ alaga.

References 1. Teufel, A., Krupp, M., Weinmann, A., Galle, P.R.: Current bioinformatics tools in genomic biomedical research (review). Int. J. Mol. Med. 17, 967–973 (2006) 2. Castro, A.G., Thoraval, S., Garcia, L.J., Ragan, M.A.: Workflows in bioinformatics: meta-analysis and prototype implementation of a workflow generator. BMC Bioinf. 6, 87 (2005) 3. EBI: The European Bioinformatics Institute. https://www.ebi.ac.uk/. Accessed January 2018 4. NCBI: National Center For Biotechnology Information. https://www.ncbi.nlm.nih. gov/. Accessed January 2018 5. INB: The Spanish Institute For Bioinformatics. http://www.inab.org/. Accessed January 2018 6. Gilbert, D.: Bioinformatics software resources. Brief. Bioinf. 5(3), 300–304 (2004) 7. Karlsson, J., Trelles, O.: MAPI: a software framework for distributed biomedical applications. J. Biomed. Semant. 4, 4 (2013) 8. Ros, J., Karlsson, J., Trelles, O.: Magallanes: a web services discovery and automatic workflow composition tool. BMC Bioinf. 10, 334 (2009) ˇ 9. Afgan, E., Baker, D., van den Beek, M., Blankenberg, D., Bouvier, D., Cech, M., Chilton, J., Clements, D., Coraor, N., Eberhard, C., Grning, B., Guerler, A., Hillman-Jackson, J., Von Kuster, G., Rasche, E., Soranzo, N., Turaga, N., Taylor, J., Nekrutenko, A., Goecks, J.: The galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Research 44(W1), W3–W10 (2016) 10. Barseghian, D., Altintas, I., Jones, M.B., Crawl, D., Potter, N., Gallagher, J., Cornillon, P., Schildhauer, M., Borer, E.T., Seabloom, E.W., Hosseini, P.R.: Workflows and extensions to the kepler scientific workflow system to support environmental sensor data access and analysis. Ecol. Inf. 5, 42–50 (2010) 11. Wolstencroft, K., Haines, R., Fellows, D., Williams, A., Withers, D., Owen, S., Soiland-Reyes, S., Dunlop, I., Nenadic, A., Fisher, P., Bhagat, J., Belhajjame, K., Bacall, F., Hardisty, A., Nieva de la Hidalga, A., Vargas, M.P.B., Sufi, S., Goble, C.: The Taverna workflow suite: designing and executing workflows of web services on the desktop, web or in the cloud. Nucleic Acids Res. 41(W1), W557–W561 (2013) 12. Goble, C.A., Bhagat, J., Aleksejevs, S., Cruickshank, D., Michaelides, D., Newman, D., Borkum, M., Bechhofer, S., Roos, M., Li, P., De Roure, D.: myExperiment: a repository and social network for the sharing of bioinformatics workflows. Nucleic Acids Res. 38, W677–W682 (2010)


185

ˇ 13. Brnich, C., Grytten, I., Hovig, E., Paulsen, J., Cech, M., Kjetil Sandve, G.K.: Galaxy portal: interacting with the galaxy platform through mobile devices. Bioinformatics 32(11), 1743–1745 (2016) 14. D´ıaz-del-Pino, S., Falgueras, J., Pérez-Wohlfeil, E., Trelles, O.: mORCA: sailing bioinformatics world with mobile devices. Bioinformatics 34(5), 869–870 (2018). https://doi.org/10.1093/bioinformatics/btx673 15. Christensen, E., Curbera, F., Meredith, G., Weerawarana, S., et al.: Web services description language (wsdl) 1.1 (2001) 16. Cheung, V., Heydekorn, J., Scott, S., Dachselt, R.: Revisiting hovering: Interaction guides for interactive surfaces. In: Proceedings of the 2012 ACM International Conference on Interactive Tabletops and Surfaces, ITS 2012, pp. 355–358. ACM, New York (2012) 17. Li, P., Castrillo, J.I., Velarde, G., Wassink, I., Soiland-Reyes, S., Owen, S., Withers, D., Oinn, T., Pocock, M.R., Goble, C.A., Oliver, S.G., Kell, D.B.: Performing statistical analyses on quantitative data in Taverna workflows: an example using R and maxdBrowse to identify differentially expressed genes from microarray data. BMC Bioinf. 9(334), 1743–1745 (2008)

Chloroplast Genomes Exhibit Eight-Cluster Structuredness and Mirror Symmetry Michael Sadovsky1,2(B) , Maria Senashova1 , and Andrew Malyshev1 1

2

Institute of Computational Modelling of SB RAS, Akademgorodok, 660036 Krasnoyarsk, Russia {msad,msen,amal}@icm.krasn.ru Institute of Fundamental Biology and Biotechnology, Siberian Federal University, Svobodny Prospect, 79, 660049 Krasnoyarsk, Russia http://icm.krasn.ru

Abstract. Chloroplast genomes have eight-cluster structuredness, in triplet frequency space. Small fragments of a genome converted into a triplet frequency dictionaries are the elements to be clustered. Typical structure consists of eight clusters: six of them correspond to three different positions of a reading frame shifted for 0, 1 and 2 nucleotides (in two opposing strands), the seventh cluster corresponds to a junk regions of a genome, and the eighth cluster is comprised by the fragments with excessive GC-content bearing specific RNA genes. The structure exhibits a specific symmetry.

Keywords: Order K-means

1

· Probability · Triplet · Symmetry · Projection

Introduction

Previously, a seven-cluster pattern claiming to be a universal one in bacterial genomes has been reported [1,2]. This structure was found to be universal, for bacteria; and very elegant theory explaining the observed patterns was proposed. Keeping in mind the most popular theory of chloroplast origin [3–6], we tried to find whether a similar pattern is observed in chloroplast genomes. Surprisingly, eight cluster structure has been found, for chloroplasts, not the seven-cluster one, and the patterns differ rather significantly. Evidently, such studies are of great evolutionary value: comparing various structures found in DNA sequences of various organisms, one expects to retrieve the evolution process details ranging from races and species to global ecological systems. Here one has to study a three-sided entity: structure, function, and phylogeny. Quite often all three issues are so tightly interweaved that one fails to distinguish the effects and contributions of each issue separately. Here we explore the relation between structure and taxonomy of the bearers of chloroplast c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 186–196, 2018. https://doi.org/10.1007/978-3-319-78723-7_16

Eight Clusters in Chloroplasts

187

genomes. A number of papers aims to study evolutionary processes on the basis of genome sequences structures peculiarities retrieval [7,8] or a comparative study of some peculiar fragments of genomes [9–14] of chloroplasts. Let now introduce the strict definitions and exact statements. We shall consider symbol sequences from four-letter alphabet ℵ = {A, C, G, T} of the length M ; the length here is just the total number of symbols (nucleotides) in a sequence. No other symbols or gaps in the sequence take place, by supposition, at least, at the beginning. Any coherent string ω = ν1 ν2 . . . νq of the length q makes a word. A structure to be retrieved from chloroplast genomes is provided by clustering of the fragments of equal length isolated within a genome so that each fragment is converted into a triplet frequency dictionary with non-overlapping triplets with no gaps in frame tiling. Thus, we shall keep the consideration within the study of the triplet ω3 = ν1 ν2 ν3 frequency dictionaries, only. Further, we shall consider the genomes of chloroplasts retrieved from EMBL– bank. In case where extra symbols falling out of the alphabet ℵ take place in a sequence, these former were eliminated; the procedure of such elimination is discussed in Subsect. 2.2.

2

Frequency Dictionary and Genome Fragmentation

Indeed, a triplet frequency dictionary could be defined in various ways. The simplest case is provided by the dictionary W(3,1) , where the first index shows the length of the words counted in a dictionary, and the second one is the step length (i. e., the number of nucleotides located between two sequential positions of a frame reading). And the frequency dictionary itself is the list of the words (these are the triplets, in our case) found in a sequence, so that each entry of the list is provided by the frequency of that latter. The frequency is defined easily: fω =

nω N

(1)

where nω is the number of copies of the specific word ω, and N is the total number of the counted words (with respect to the copies number); N= nω . ω

For W(3,1) N = M , and it is not so in general case. A frequency dictionary Wq of nucleotide sequences is claimed to be an entity bearing a lot of information on that latter [15–20]. A consistent and comprehensive study of frequency dictionaries answers the questions concerning the statistical and information properties of DNA sequences. In general, one might study a frequency dictionary W(n,m) that comprises the words of the length n counted with the step in m nucleotides. For the purposes of our study, we shall consider the frequency dictionaries W(3,3) . Such frequency dictionary is defined ambiguously: there could be three different start positions for triplet counting. Strictly speaking, one should study all three dictionaries

188

M. Sadovsky et al.

of W(3,3) type; moreover, the key issue here is that the three frequency dictionaries W(3,3) differing in the start position exhibit sounding difference in their statistical properties, when determined for coding and non-coding regions of a genome [1,2]. This difference yields the clustering standing behind the structuredness we are speaking about. 2.1

Genome Fragmentation

For the purposes of the study, we shall not consider all three versions of frequency dictionary W(3,3) differed in start position; on the contrary, we shall define so called phase of a fragment. Let now describe the procedure for structuredness retrieval in more detail. Consider a genome sequence that is stipulated to be a symbol sequence from four-letter alphabet ℵ. Let then fix the sliding window length L and the step length R figures. Cover then a genome with a tiling windows of the given length moving upright (for definiteness) alongside the sequence, with the step R; if R < L then two windows overlap, otherwise they do not overlap. This is the preliminary transformation of a genome; convert then each identified fragment (of the length L) into the frequency dictionary W(3,3) so that the start position of the reading frame for triplets to coincide to the first nucleotide in the fragment. Thus, a genome is transformed into an ensemble of W(3,3) frequency dictionaries; here each dictionary is labeled with the number of the fragment, as determined alongside the sequence. Finally, we get an ensemble of the points in 63-dimensional metric space, where each point represents a fragment of the genome. The aim of the work is to reveal the patterns produced by the distribution of those points in 63-dimensional space; formally, the triplet frequencies yield 64-dimensional space, while a triplet must be excluded. Linear constraint TTT

fω = 1

ω=AAA

inflicts rather strong dependence which, in turn, may bring a false signal. Thus, a triplet must be excluded; formally, any triplet may be eliminated. Practically, we excluded the triplet with the lowest standard deviation figure observed over the entire ensemble of frequency dictionaries. 2.2

Fragment Phase Definition

Previously, three versions of W(3,3) frequency dictionary have been mentioned; they differ in the position of reading frame shift. Here we did not derive all three versions of W(3,3) ; on contrary, we defined the so called phase index for each fragment. The phase is defined by the reciprocal position of a fragment against a coding region. Thus, a fragment is labeled as phase 0 the start of a fragment perfectly matches the start of a coding region, or the reminder of the division of the distance from the start position of a coding regions to a fragment by 3 is equal to 0;


189

phase 1 the reminder from the division of the distance from the start position of a coding regions to a fragment by 3 is equal to 1; phase 2 the reminder from the division of the distance from the start position of a coding regions to a fragment by 3 is equal to 2. If a part of a fragment falls out of a coding region, the fragment is labeled by junk phase. Here we did not distinguish exon-intron structure of a gene. Actually, the labeling system includes eight items: the phases F0 , F1 and F2 correspond to the labels mentioned above, as determined for the leading strand; the phases B0 , B1 and B2 correspond to the labels mentions above, as determined for the ladder strand. In this case, the reminder was determined not from the start position of a coding region, but from the end one. Finally, the special phase tail was introduced, to identify the peculiar group of fragments within a genome. Here the problem of extra symbols arises. Indeed, an elimination of some extras (if any) may cause the shift of the number of a nucleotide position determined alongside the sequence. Such shift may affect the reminder calculation, when a phase is determined. To avoid such deterioration of coding and noncoding regions borders, we remain the numbers of the nucleotides; in other words, an elimination affected both extras, and their numbers in the sequence.

3

Results

We examined 185 chloroplast genomes. Each genome has been covered with a tiling set of fragments, then each fragment has been converted into W(3,3) frequency dictionary, and the phase of each fragment was determined; the dictionaries corresponding to the fragments were marked up with the phase index, as well as with the number of the fragment. We used ViDaExpert software [21] to visualize and cluster the data. The greatest majority of genomes exhibits the triplets GCG and CGC having the least standard deviation figures (that were excluded). The excluded triplets form a remarkable couple: they yield the so called complementary palindrome, see Subsect. 3.2 for details. The greatest majority of the genomes exhibit similar pattern of the fragments distribution. Fig. 1 shows a typical distribution pattern of the ensemble of fragments converted into W(3,3) frequency dictionaries. This picture presents the chloroplast genome of cranberry Vaccinium macrocarpon (AC JQ248601) in EMBL–bank; total length of the genome is 176 037 bp. The length L = 603 nucleotides, and R = 11 nucleotides. The motivation to fix such parameters values is following: L here is comparable to a gene length, and R provides sufficiently dense lattice of a sequence. Definitely, one may choose other parameters figures, while a direct check showed that the pattern is insensitive to them, in rather wide range of parameters. The points are projected from 63-dimensional Euclidean space determined by triplet frequencies into the three-dimensional Euclidean space determined by three main principal components [22]. The subfigure (a) shows the distribution

190

M. Sadovsky et al.

(a) profile

(b) full face

Fig. 1. Cranberry Vaccinium macrocarpon chloroplast genome fragments distribution; left is profile view, right is full face view.

in “profile” projection (where the first principle component falls on the plane and is directed from left to right; the subfigure (b) shows the same distribution in “full face” projection, where the first principle component is normal to the figure plane. The phases are colored: phase F0 and B0 are colored in amaranth and cerise, respectively; the phase F1 and B1 are colored in lemon and orange, respectively; finally, the phase F2 and B2 are colored in green and cyan, respectively. The junk phase is colored in maroon. Let now concentrate on the left subfigure of Fig. 1. It looks like a kind of fish with a short tail; and the fragments comprising this part of the distribution are those labeled as tail phase. The occurrence of this phase differs the chloroplast genomes from bacterial ones. The fragments comprising this tail phase are known for its highly increases GC-content value: while the genome-wide figure for that former is 0.38, the specific values for the tail phase fragments tends to exceed 0.5 level. As one can see from Fig. 1, the tail phase consists of both junk and coding fragments. The tail phase fragments present the densely packed cluster of tRNA genes, 16S RNA genes, 23S RNA genes and some other S RNA genes. This cluster has nothing to do with those identified through the mutual distribution of the fragments in the Euclidean space of triplet frequencies. Fig. 2 shows the behaviour of GC-content alongside the genome: junk phase is shown in brown, while the coding regions are shown in blue; two very distinct peaks (shown in ovals) in this figure located in diapasons ∼110 000 ≤ ∼115 000 and ∼165 000 ≤ ∼170 000 comprise the points forming the tail phase.


191

Fig. 2. GC-content of cranberry chloroplast genome determined for each fragment.

3.1

Clustering vs. Visualization

Figure 1 shows the distribution of the ensemble of W(3,3) frequency dictionaries; so the question arises whether these observations towards the preference in phase location in the clusters are really existing? In other words, one must check whether a similar clustering could be derived due to some clustering technique. Otherwise, one has to consider the visualized groups of points to be an artifact. To verify it, we have carried out K-means clustering, with K = 4 clusters. K-means clustering yields very stable dispersion of the fragments into four classes. Figure 3 shows the clustering results. First of all, the clustering is very stable: a hundred of runs of K-means resulted in the same distribution of the points. Next, obviously, K-means for K = 4 is unable to dissociate the points of junk phase from those belonging to coding regions; an exclusion of the junk points from clustering still remains the stable separation into four classes. We did not aim to study clustering of the fragments with an unsupervised cluster technique; on the contrary, the idea was to compare the clusters identified by phases: thus, K = 4 seems to be natural, for such test. The test shows good separation, so that the phase defined clusters are not artifacts. Still, the triplet GCG was excluded, for clustering implementation. So, the stability of clustering is proven, hence the beams identified due to visualization are not a artifact, but correspond to naturally determined structure units.

192

M. Sadovsky et al.

(a) profile

(b) full face

Fig. 3. K-means clustering (K = 4) of cranberry Vaccinium macrocarpon chloroplast genome fragments.

3.2

Symmetry in Genome Clustering

Let now consider Fig. 1 in more detail. Careful examination of the subfigure (b) shows the specific behaviour of the phases: indeed, the phases {F0 , B0 } and {F1 , B1 } occupy two opposing beams of the pattern shown in this figure. The phase {F2 , B2 } occupy the same beam. This behaviour is not occasional: we have examined 185 chloroplast genomes of ground plants, and all of them exhibit the same phase occupancy. This behaviour differs completely from similar one observed in bacterial genomes [1,2]. Two different symmetries stand behind the difference: translational (rotational) symmetry is observed for bacterial genomes, while mirror symmetry is observed for chloroplast ones. The phases patterns (triangles) must be projected one over another, and they are rotated in opposite directions. This symmetry has another manifest in the discrepancy value of the second Chargaff’s parity rule determined for centers of those beams. Let’s discuss it in more detail. Chargaff ’s symmetry of phase clusters. The first Chargaff’s parity rule stipulates a proximity (rather tight) of the fractions of A’s and T’s, as well as the fractions of C’s and G’s counted over a genome. The second Chargaff’s parity rule says that the fraction of the strings comprising the complementary palindrome are also rather close. That former is a couple of words (of the length q) that are read equally in opposite directions, with respect to the complementarity rule (that was originally formulated for a double strand DNA molecule): G ↔ C and A ↔ T. The point is that the fractions (same as frequencies) are counted over a single strand, with neither respect to the second one. Typical example of a couple of triplets making a complementary palindrome are the triplets that were excluded, when clustering was carried out:


193

GCG ⇔ CGC; another example is the couple GCCGTAGT ⇔ ACTACGGC. Two genetic entities could be compared through the discrepancy calculation determined over a frequency dictionary (or two of them): 2 2 fω∗ − fω , (2) μq = q 4 ∗ ω ∈Ω

where ω ∗ and ω are the words comprising a complementary palindrome. So, the symmetry observed in chloroplast genomes could manifest in (2) figures determined for various beams of the pattern shown in Fig. 1. We calculated μ value (2) for all three beams identified in Fig. 1 (see also Fig. 3), with exclusion of the points belonging to tail phase and junk; these values are μ1 = 0.001350, μ2 = 0.001224

and μ3 = 0.000290 .

Obviously, the third beam has an order less discrepancy figure, in comparison to two others. These figures have been obtained for 32 couples of triplets of an arithmetic mean of the frequencies of the points of each beam. Similar pattern is observed for inter-beam discrepancy calculations. Here unlike in (2), one must sum up the squared differences of the frequencies of all 64 couples, since there is no guarantee of the equivalence of two differences (2)

fω(1) − fω

and

(1)

fω − fω(2) ,

where the superscript indicates two compared beams. The observed figures are the following: ρ (beam1 , beam2 ) = 0.011991,

ρ (beam1 , beam3 ) = 0.051165,

ρ (beam3 , beam2 ) = 0.054165. Again, the beam #3 is isolated from two others. The direct comparison of the means and the clusters comprised from various phases unambiguously proves that the beam #3 is the cluster consisting of F2 and B2 phases.

4

Discussion

The labeling system of the formally identified fragments in a sequence seems to be rather strict and to provide a kind of bias in favor of the non-coding regions. A rough estimation of the number of border fragments (i. e., those that fall both in coding and non-coding regions of a genome) in the ensemble is small enough. Suppose, the number of coding regions in a chloroplast genome is 50. Then the approximate number of such border fragments is about L × R−1 × 50 ≈ 3000. This estimation shows rather significant bias to junk labeled fragments resulted from the border fragments, so it may deteriorate the patterns from these border fragments. Yet, further investigation is necessary to answer this question, while the hypothesis is that the impact of those border fragment is not significant, currently.

194

M. Sadovsky et al.

In papers [1,2] an approach to reveal a structuredness in bacterial genomes based on the comparison of frequency dictionaries W(3,3) of the fragments of a genome is presented; our results show that chloroplasts behave in other way always clustering in two coinciding triangles. The vertices of that latter correspond to phases of a reading frame comprising the fragments with identical reading frame shift figure (reminder value). Another important issue is that GCcontent does not determine the positioning of the clusters, unlike for bacterial genomes. A mirror symmetry in frequency dictionaries of W(3,3) type is the most intriguing issue of the work: such symmetry was never observed in bacterial genomes, nor in yeast genomes, nor in the genomes of some other higher organisms. Whether this mirror symmetry is the specific feature of chloroplasts, or it is peculiar for any organelle, is a matter of question. For chloroplasts, the symmetry has been checked for a number of genomes of the plants of various taxa. An idea to reveal some similarities in the patterns described above, in chloroplast genomes, and in some other genetic entities which are claimed to be a kind of relatives of chloroplasts was disproved: we checked several cyanobacteria genomes for the pattern occurrence, and nothing similar was found [23]. Careful examination of the databases implemented for each studied genome (see also Fig. 1) shows a relative maintenance of the fraction of the fragments labeled Fk and Bk ; indeed, the set of genomes could be separated into two subsets: the former with nFk > nBk , and the latter with nFk < nBk . Here nFk (nBk , respectively) is the fraction of the fragments labeled as nFk -phase (nBk -phase, respectively). We hypothesize that the minimum standard deviation triplet (whether it would be GCG or CGC) is determined by the ratio of nFk and nBk figures. Meanwhile, the most exciting observation towards the symmetry in chloroplast genomes consists in the mirror symmetry of the phase-determined clusters comprising the relevant fragments of a genome. Such symmetry manifests also in another type of symmetric-like relation that is expressed in terms of Chargaff’s parity rule: the phases F2 and B2 always cohere into a single cluster, that is also identified as is by K-means. A verification of this clustering pattern over a number of chloroplast genomes allows to say that these are F2 and B2 phases that fall into the same cluster. Moreover, the location of other phases is determined unambiguously, against these two ones. This fact may provide an extremely fast technique for a primary annotation of a de novo assembled chloroplast genome: slicing a sequence into an ensemble of the fragments as it is described above and clustering them takes seconds, and reveals the fragments which are almost for sure ascribed to the phase (if the hypothesis on the interrelation between the least standard deviation figure of a triplet, and the ratio of the phase differing fragments holds true), and, what is more important, the fragments are labeled with the reading frame shift figure. In conclusion, we outline few issues falling beyond the scope of this paper, while expecting an urgent research. The first issue is the study of the chloroplasts from other species that tend to show a deviation from the described


195

pattern (mosses, equisetum, unicellular green algae, etc.). The second issue is more detailed study of the part of genomes that comprise the tail phase. Finally, the third issue is the study of “dark matter” of a genome: the fragments that correspond to non-coding regions. Some preliminary investigations show that these fragments also make various structures, and are sensitive to the taxonomy of the bearers of the genomes. More detailed discussion of these issues falls beyond the scope of this paper. Acknowledgement. This study was supported by a research grant # 14.Y26.31.0004 from the Government of the Russian Federation.

References 1. Gorban, A.N., Zinovyev, A.Y., Popova, T.G.: Seven clusters in genomic triplet distributions. Silico Biol. 3(4), 471–482 (2003) 2. Gorban, A.N., Zinovyev, A.Y., Popova, T.G.: Four basic symmetry types in the universal 7-cluster structure of microbial genomic sequences. Silico Biol. 5(3), 265– 282 (2005) 3. Mereschkowsky, K.S.: Theorie der zwei Plasmaarten als Grundlage der Symbiogenesis, einer neuen Lehre von der Entstehung der Organismen. Biol. Cent. 30, 353–367 (1910) ¨ 4. Mereschkovsky, K.S.: Uber Natur und Ursprung der Chromatophoren im Pflanzenreiche. Biol. Zentr.-Bl. 85(18), 593–604 (1905) 5. Zimorski, V., Ku, C., Martin, W.F., Gould, S.B.: Endosymbiotic theory for organelle origins. Curr. Opin. Microbiol. 22, 38–48 (2014) 6. Raven, J.A., Allen, J.F.: Genomics and chloroplast evolution: what did cyanobacteria do for plants? Genome. Biol. 4(3), 209 (2003) 7. Carbonell-Caballero, J., Alonso, R., Ibanez, V., Terol, J., Talon, M., Dopazo, J.: A phylogenetic analysis of 34 chloroplast genomes elucidates the relationships between wild and domestic species within the genus citrus. Mol. Biol. Evol. 32(8), 2015–2035 (2015) 8. Leliaert, F., Smith, D.R., Moreau, H., Herron, M.D., Verbruggen, H., Delwiche, C.F., De Clerck, O.: Phylogeny and molecular evolution of the green algae. Crit. Rev. Plant Sci. 31, 1–46 (2012) 9. Katayama, H., Ogihara, Y.: Phylogenetic affinities of the grasses to other monocotsas revealed by molecular analysis of chloroplast DNA. Curr. Genet. 29, 572–581 (1996) 10. Milanowski, R., Zakrys, B., Kwiatowski, J.: Phylogenetic analysis of chloroplast small subunit rRNA genes of the genus Euglena Ehrenberg. Int. J. Syst. Evol. Microb. 51, 773–781 (2001) 11. Marazzi, B., Endress, P.K., De Queiroz, L.P., Conti, E.: Phylogenetic relationships within senna (leguminosae, cassiinae) based on three chloroplast DNA regions: patterns in the evolution of floral symmetry and extrafloral nectaries. Am. J. Bot. 93(2), 288–303 (2006) 12. Shaw, J., Lickey, E.B., Beck, J.T., Farmer, S.B., Liu, W., Miller, J., et al.: The tortoise and the hare II: relative utility of 21 noncoding chloroplast DNA sequences for phylogenetic analysis. Am. J. Bot. 92(1), 142–166 (2005)

196

M. Sadovsky et al.

13. Dong, W., Liu, J., Yu, J., Wang, L., Zhou, S.: Highly variable chloroplast markers for evaluating plant phylogeny at low taxonomic levels and for DNA barcoding. PLoS ONE 7(4), 1–9 (2012) 14. Gielly, L., Taberlet, P.: The use of chloroplast DNA to resolve plant phylogenies: noncoding versus rbcL sequences. Mol. Biol. Evol. 11(5), 769–777 (1994) 15. Bugaenko, N.N., Gorban, A.N., Sadovsky, M.G.: Maximum entropy method in analysis of genetic text and measurement of its information content. Open Syst. Inf. Dyn. 5, 265–278 (1998) 16. Gorban, A.N., Popova, T.G., Sadovsky, M.G., W¨ unsch, D.C.: Information content of the frequency dictionaries, reconstruction, transformation and classification of dictionaries and genetic texts. In: Intelligent Engineering Systems Through Artificial Neural Networks – Smart Engineering System Design, vol. 11, pp. 657–663. ASME Press, New York (2001) 17. Gorban, A.N., Popova, T.G., Sadovsky, M.G.: Classification of symbol sequences over thier frequency dictionaries: towards the connection between structure and natural taxonomy. Open Syst. Inf. Dyn. 7, 1–17 (2000) 18. Sadovsky, M.G., Shchepanovsky, A.S., Putintzeva, Y.A.: Genes, information and sense: complexity and knowledge retrieval. Theory Biosci. 127, 69–78 (2008) 19. Sadovsky, M.G.: Comparison of real frequencies of strings vs. the expected ones reveals the information capacity of macromoleculae. J. Biol. Phys. 29, 23–38 (2003) 20. Sadovsky, M.G.: Information capacity of nucleotide sequences and its applications. Bull. Math. Biol. 68, 156–178 (2006) 21. http://bioinfo-out.curie.fr/projects/vidaexpert/ 22. Gorban, A.N., Zinovyev, A.Y.: Principal manifolds and graphs in practice: from molecular biology to dynamical systems. Int. J. Neural Syst. 20(3), 219–232 (2010) 23. Sadovsky, M.G., Senashova, M.Y., Malyshev, A.V.: Eight cluster structuredness of genomes of ground plants. Russ. J. Gen. Biol. 79(2), 124–134 (2018)

Are Radiosensitive and Regular Response Cells Homogeneous in Their Correlations Between Copy Number State and Surviving Fraction After Irradiation? Joanna Tobiasz1 ✉ , Najla Al-Harbi2, Sara Bin Judia2, Salma Majid3, Ghazi Alsbeih2, and Joanna Polanska1 (

)

1

Faculty of Automatic Control, Electronics and Computer Science, Institute of Automatic Control, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland [email protected] 2 Radiation Biology Section, Biomedical Physics Department, King Faisal Specialist Hospital and Research Centre, Riyadh 11211, Kingdom of Saudi Arabia 3 Genetics Department, King Faisal Specialist Hospital and Research Centre, Riyadh 11211, Kingdom of Saudi Arabia

Abstract. Biomarkers of radiosensitivity are currently a widespread research interest due to a demand for a sufficient method of prediction of cell response to ionizing radiation. Copy Number State (CNS) alterations may significantly influ‐ ence individual radiosensitivity. However, their possible impact has not been entirely investigated yet. The purpose of this research was to select markers for which CNS change is significantly associated with the surviving fraction after irradiation with 2 Gy dose (SF2), which is a commonly used measure of cellular radiosensitivity. Moreover, a new strategy of combining qualitative and quantita‐ tive approaches is proposed as the identification of potential biomarkers is based not only on the overall SF2 and CNS correlation, but also on differences of it between radiosensitive and regular response cell strains. Four patterns of association are considered and functional analysis and Gene Ontology enrichment analysis of obtained sets of genomic positions are performed. Proposed strategy provides a comprehensive insight into the strength and direction of association between CNS and cellular radiosensitivity. Obtained results suggest that commonly used approach of group comparison based on testing two samples against each other is not suffi‐ cient in terms of radiosensitivity since this is not a discrete variable and division into sensitive, normal and resistant individuals is always stipulated. Keywords: Copy Number State · Copy Number Variations · Radiosensitivity Genome-Wide Association Study (GWAS)

1

Introduction

Nowadays, identification and validation of biomarkers of radiosensitivity is a scientific issue of a great importance worldwide due to both environmental and medical exposures people receive on a daily basis. Ionizing radiation directly and indirectly causes DNA © Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortuño (Eds.): IWBBIO 2018, LNBI 10813, pp. 197–208, 2018. https://doi.org/10.1007/978-3-319-78723-7_17

198

J. Tobiasz et al.

damages. Having a harmful effect on a particular cell and in some cases also on the entire organism, those may for instance induce cancers, especially after very high-dose expo‐ sure during nuclear disasters. However, the injurious effect of ionizing radiation is also used in radiotherapy, which aim is to cause a tumor cell death due to multiple DNA damages, which are too severe for cell to repair. 1.1 Radiosensitivity Biomarkers in Radiotherapy Planning Radiotherapy should be tailored in a way to exterminate as many tumor cells as needed for the cancer recovery with possibly limited radiation-induced normal tissue injury. Healthy and cancer cells differ in terms of sensitivity to the harmful effect of radiation. Nevertheless, since every tumor cell originates directly or indirectly from a normal cell, it may be assumed that tumors derived from tissues with low tolerance to radiation are more prone to experience severe irreversible DNA damage. On the other hand, surrounding healthy tissue is also more likely to be heavily injured in that case, which may cause a variety of side effects of therapy. In conclusion, radiosensitive individuals are not only more likely to be successfully treated, but also more susceptible to carci‐ nogenic effects of received radiation. Hence, the ability to predict therapeutic response is needed for radiotherapy planning in terms of doses and intervals between exposures in order to reduce the risk of adverse effects and carcinogenesis. Biomarkers identification may be also helpful in the devel‐ opment of the application of systemic agents with presumed radiosensitizing activity. 1.2 Current Radiosensitivity Biomarker Studies Majority of studies concerning biomarkers of response to ionizing radiation focuses on point mutations, amplifications and deletions or translocations [1, 2]. However, the influence of Copy Number State (CNS) on radiosensitivity has not been deeply defined yet [3]. 1.3 Current Copy Number State Analysis Approaches Majority of studies focuses on detecting differences in the copy number (CN) in compar‐ ison to the reference genome and on a functional analysis of region affected by those changes. Various methods of identification of significant alterations are applied, including parametric or nonparametric tests and selection of Copy Number Variations (CNVs) that occur in the certain percentage of all analyzed samples.

2

Aim of Study

The purpose of this study is to identify Copy Number State alterations which may have an impact on the cell’s sensitivity to the ionizing radiation and thus can have a potential to predict patient’s response to radiation prior to the therapy. The aim is to investigate the radiosensitivity on the basis of the combination of two types of variables: CNS and

Are Radiosensitive and Regular Response Cells Homogeneous

199

the results of the clonogenic assay. This approach leads not only to the selection of biomarkers for which CNS and cells’ ability to survive the radiation exposure are significantly correlated, but also to the identification of those markers for which these correlation is distinctly different for cell strains with low and high surviving fraction. Hence, in this study more common qualitative comparison is combined with the quan‐ titative one to provide possibly comprehensive insight into the influence of CNS on the radiosensitivity.

3

Material

Cell cultures established from skin fibroblasts collected from 135 non-irradiated patients served as the material for this study, which is a typical choice of cell type for the analysis of radiosensitivity [4]. Copy Number State (CNS) was measured with Affymetrix Cyto‐ Scan HD microarrays, which contain 6,876,796 probes with an average probe spacing equal to 880 base pairs, providing the results for 1,953,246 non-polymorphic (Copy Number Variations) and 743,304 polymorphic (Single Nucleotide Polymorphisms) biomarkers [5]. Microarray experiment was conducted only for non-irradiated cells. The radiosensitivity of every cell strain was characterized with the use of a clono‐ genic assay with the parameter of surviving fraction at 2 Gy (SF2) serving as the measure of cellular radiosensitivity, which is a gold standard for the analysis of radioresponse. Even though a clonogenic survival as a high-level global assay is not sufficient for therapy tailoring, it may be used as the predictor of various radiotherapy-related late tissue effects, including bone necrosis, skin fibrosis, erythema and telangiectasia [4, 6– 9]. The parameter of surviving fraction at 2 Gy (SF2) was used to divide cell strains into two groups: radiosensitive (RS) and regular response (RR), with the SF2 threshold value defined based on the biological knowledge and equal to 0.325. As a result, 52 radio‐ sensitive and 83 regular response cell strains were obtained and used for the comparison.

4

Methods

4.1 Preprocessing and Data Preparation The preprocessing phase was conducted with the use of Chromosome Analysis Suite 3.1 (ChAS 3.1), which is the software provided by Affymetrix company and dedicated for the analysis of this microarray type results. Provided workflow includes for instance probe set summarization, normalization, scaling and variation removal [5]. Results obtained with Chromosome Analysis Suite 3.1 consist of Fold Change (FC) values (Eq. 1) for each biomarker (m) and each microarray, where every microarray corresponds to one of the 135 examined cell strains. FC = log2 (samplem ) − log2 (referencem ) = log2

samplem referencem

(1)

200

J. Tobiasz et al.

FC is thus equal to 0 for a biomarker, which copy number is the same for the sample cell strain and for the reference genome provided by Affymetrix company. Positive values of FC mean that the gain of DNA occurred at the particular DNA location in the examined sample, while negative FC values indicate the loss of genomic DNA for considered marker. Moreover, quality control of all microarray experiment measurement files was also conducted with the use of Chromosome Analysis Suite 3.1, providing satisfactory results for all 135 cell strains. However, for further steps of analysis measurements for markers located at the allosomes were excluded in order to ensure that any potential differences in the copy number between radiosensitive and regular response groups do not result from the imbalance of patient’s gender between both groups. 4.2 SF2 and Copy Number State Correlation Pearson’s correlation coefficient between SF2 value and FC value was computed for every autosomal marker for all cell strains (Eq. 2) [10].

∑n=135 r= √

i=1

∑n=135 i=1

(SF2i − SF 2 )(FCi − FC)

(SF2i − SF 2 )2

∑n=135 i=1

(FCi − FC)2

(2)

The set of markers for which the correlation between SF2 and FC is significant was selected with the use of Cohen’s q effect size (Eq. 3), which is defined as the difference between two Fisher transformed Pearson’s correlation coefficients [10].

| | q = |tanh−1 r1 − tanh−1 r2 | = ||z1 − z2 || | |

(3)

However, in this study a one-sample case is considered. Hence, there is only one overall Pearson’s correlation coefficient for each marker, which is compared to r equal to 0, corresponding to the situation when no correlation is observed. Thus, sample difference includes only one sampling error variance instead of two, and the correction of Cohen’s q for one-sample case must be made (Eq. 4) [10].

q=

√ | | 2|tanh−1 r| | |

(4)

Threshold values of Pearson’s correlation coefficients (presented in the Table 1) were computed on the basis of the corrected Cohen’s q equation and were used for the selec‐ tion of markers with at least medium and at least large effect size. Table 1. Threshold values for at least small, at least medium and at least large effect size [10] Effect size Small Medium Large

z 0.10 0.30 0.50

q 0.1414 0.4243 0.7071

r 0.1405 0.4005 0.6088


201

Markers for which an absolute value of Pearson’s correlation coefficient is higher than 0.4005 thus have at least medium effect, while for those with an absolute Pearson’s correlation value higher than 0.6088 the effect size is large. Hence, features having at least medium effect were selected as those for which a significant correlation between Copy Number State and SF2 was detected. 4.3 Comparison of Correlations for Radiosensitive and Regular Response Cell Strains Pearson’s correlation coefficient between SF2 value and FC value was computed for every autosomal marker for each of two groups of cell strains separately. Every pair of coefficient values was tested for the homogeneity (Eq. 5) for the significance level equal to 0.05 [10].

z1 − z2 ts = √ (n1 − 3)−1 + (n2 − 3)−1

(5)

The set of markers with heterogeneous Pearson’s correlation coefficients was created. All markers with the opposite signs of coefficients were rejected to avoid the situation when there is a positive correlation for one group and the negative corre‐ lation for the other one, since groups are separated only based on the predefined threshold SF2 value. SF2 is a continuous variable, so existence of minimum or maximum would be biologically unjustifiable. Moreover, markers for which one of the coefficients was not significantly lower or higher than 0 (hypothesis depending on whether the second coefficient is positive or negative) were also removed, based on the test of significance after the Fisher’s z transformation (Eq. 6) [10].

√ √ ts = z n − 3 = tanh−1 r ⋅ n − 3

(6)

Remaining markers were divided into four categories based on the Pearson’s corre‐ lation coefficients in a way presented at the Fig. 1. The categories correspond to the four possible patterns of the correlation between SF2 and FC (change in the copy number), assuming that the relationship between surviving fraction and CNS significantly differs between radiosensitive and regular response cell strains, while the variances are homo‐ geneous in both groups. For the patterns of RS- and RR-dominant increase the smaller the copy number in comparison to the reference genome, the bigger the radiosensitivity. For markers in the first of those categories the increase of the copy number as SF2 grows is more radical for regular response group, contrary to the markers in the second cate‐ gory. For the remaining two patterns the copy number declines as SF2 increases, more sharply for radiosensitive cells in the RS-dominant category and for regular response group in the RR-dominant one.

202

J. Tobiasz et al.

Fig. 1. Scheme of the separation of markers into four groups based on their correlation pattern

4.4 Functional Analysis Obtained sets of selected markers were assigned to the Ensembl Gene IDs with the biomaRt R package [11, 12]. GOstats R package was used for an enrichment analysis of Gene Ontology terms with an algorithm based on the hypergeometric test and the hierarchical structure of the GO database [13–15]. Resulting p–values were corrected for multiple testing with the Benjamini-Hochberg method [16].

5

Results

5.1 Pearson’s Correlation Coefficient and Cohen’s q Pearson’s correlation coefficient between SF2 and FC was computed for all 135 cell strains. Among results for all autosomal markers (2,491,915 features) minimal coeffi‐ cient was equal to −0.4326, while the maximal obtained value was 0.4718. According to the results of test of significance of Pearson’s correlation coefficient followed by the Benjamini-Hochberg procedure for multiple testing correction, correlation was proved for 65 markers for the significance level equal to 0.05. These features were included in the set of 71 markers for which the effect size was at least medium. However, none of markers was proved to have at least large effect based on the Cohen’s q measure.


203

5.2 Qualitative Comparison of Pearson’s Correlation Coefficients According to the test for homogeneity of correlation coefficients, p-values for 103,860 markers (4.17% of all autosomal markers) do not exceed significance level equal to 0.05. Signs of Pearson’s correlation coefficients were opposite for 92,549 features, which means that even though correlation is either positive or negative for both radiosensitive and regular response cells, the coefficients differ distinctly between those two groups for 11,311 markers. This amounts to 0.45% of all autosomal markers and 10.89% of all features with p-values from the test for homogeneity lower than the significance level. Division into categories representing four possible correlation patterns was based not only on signs of two Pearson’s correlation coefficients, but also on a one-sided significance test for this cell group for which the coefficient was closer to 0. Analysis showed that categories of RR-dominant increase and decrease, which assume stronger correlation between SF2 and FC for regular response cell group, remain empty. However, RS-dominant increase and decrease patterns, which represent stronger rela‐ tionship between variables in radiosensitive cell strains consist of 39 and 31 markers, respectively. Scatterplots of SF2 and FC values for exemplary markers included in the RS-dominant increase and decrease categories are presented at the Figs. 2 and 3, with Pearson’s correlation coefficients (overall and for both groups separately), p-value from the test for homogeneity and p-value from the test of significance of a coefficient after Benjamini-Hochberg correction for multiple testing. Moreover, linear regression lines for both groups separately and confidence intervals for prediction are also marked at the Figs. 2 and 3.

Fig. 2. Scatterplot of SF2 and FC values for marker C-3QZLF, which is included in the “RS– dominant increase” category.

204

J. Tobiasz et al.

Fig. 3. Scatterplot of SF2 and FC values for marker S-3CJEA, which is included in the “RS– dominant decrease”.

Table 2 consists of the overall results of the segregation of markers for which corre‐ lation in both groups was either positive or negative, but Pearson’s correlation coeffi‐ cients were heterogeneous. Statement that there is no correlation means that for a partic‐ ular marker p-value from one-sided test of significance after the Fisher’s z transforma‐ tion was not lower than the significance level 0.05. Table 2. Overall summarization of Pearson’s correlation coefficients-based segregation of markers with heterogeneous, both positive or both negative, correlation Number of markers with: No correlaρRS > 0 tion in RS

Number of markers with:

ρRS < 0 ρRR < 0

ρRR > ρRS

ρRR < ρRS

31

0

TOTAL

210

-

241

6,100

10,918

No correlation in RR

4,818

0

ρRR > 0

-

113

TOTAL

4,849

323

ρRR > ρRS

ρRR < ρRS

0

39 6,139

152 11,311

Location of markers included in RS-dominant categories across a genome is presented at the Fig. 4. Distribution of selected markers differs between two presented correlation patterns, especially for chromosome 10. Markers located at this autosome are numerously represented in the “RS-dominant Increase” category and absent in the “RS-dominant Decrease” group.


205

Fig. 4. Genomic locations of markers included in the “RS-dominant Increase” (outer track) and “RS-dominant decrease” categories (inner track)

5.3 Functional Analysis A total number of 11,311 markers with heterogeneous Pearson’s correlation coefficients, for which the association is either positive or negative for both radiosensitive and regular response cells, is overlapped by 2,152 genes. Enrichment analysis based on the Gene Ontology database showed that these genes are associated with various GO terms linked to neurological dysfunctions, including “nervous system development” (GO:0007399), “trans-synaptic signaling” (GO:0099537), “postsynaptic membrane” (GO:0045211), “axon” (GO:0030424), “dendritic spine” (GO:0043197) and “transmitter-gated channel activity” (GO:0022835). Among all features included in the RS-dominant increase or decrease categories there were respectively 5 and 11 markers overlapped by at least one gene. The complete list of genes within which those features are located is presented in the Table 3. More‐ over, in the set of GO enrichment analysis terms connected with the function of synapse are numerously represented results for RS-dominant increase genes.

206

J. Tobiasz et al.

Table 3. List of genes overlapping the genomic positions of markers with at least medium Cohen’s q effect size and markers included in RS-dominant categories Genes

6

At least medium Cohen’s q TTLL5 D2HGDH SHISA6 ARHGAP15 TRIO CCSER1 RASSF3 DCC ASH2L CTC-459F4.3 LLNLF-65H9.1 TNRC6C UTRN FMO10P SNTG1 CSGALNACT1

RS-dominant increase NKAIN3 NLGN1 DNMBP-AS1 ADCY1 CTC-575N7.1

RS-dominant decrease MCCC2 CHST11 NDUFV2 RP11-21J18.1 RABGAP1L SEMA6D NTM EGFR MAP2K5 OPRM1 TYR

Discussion

The Pearson’s correlation coefficients computed for all 135 cell strains indicate that for some markers there is a significant correlation between surviving fraction and FC. These results suggest that the relationship between radiosensitivity and CNS alterations exists even for non-irradiated cell strains and probably it is induced only by the background radiation. Nevertheless, this dependency may not be linear, which can be the reason of small absolute values of overall Pearson’s correlation coefficient, which is a measure of linearity. Performed division into categories based on the Pearson’s correlation coefficients calculated for each cell strain group separately shows that both positive or negative association between SF2 and FC tends to be more radical in radiosensitive cells than in regular response cells for the majority of selected markers. Moreover, a high number of features (10,918 markers) with non-significant correlation in regular response group suggests that the linearity increases when SF2 is small, while for cells with high surviving fraction the strength of linear association is reduced (Table 2). Obtained results of functional analysis may suggest the relationship between indi‐ vidual radiosensitivity and neurological disfunctions. This association is reflected in several studies [17, 18]. Gene EGFR seems to be the most interesting outcome among the list of results of functional analysis of markers included in the RS-dominant cate‐ gories. Inhibition of EGFR is reported to increase radiosensitivity by suppression of radiation-induced DNA double-strand breaks repair [19, 20] and amplification of EGFR was shown to increase radioresistance by Ras activation, which is also associated with MAPK pathway [21]. Moreover, deletions on chromosome 10 have been reported to increase radiosensitivity through PTEN alteration in head and neck cancer. Expression


207

of this gene was shown to be correlated with EGFR activation, while EGFR expression is associated with DNA amplification [22].

7

Conclusions

In the performed study an approach based on the combination of qualitative and quan‐ titative methods was proposed as a strategy to investigate the biomarkers of radiosen‐ sitivity. Obtained results indicate that a correlation-based study allows to achieve a more comprehensive insight into the possible influence of the Copy Number State on processes involved in the response to radiation, even for non-irradiated cell strains. The proposed approach seems to be very promising, regarding the fact the border between sensitive and regular reaction is based on the predefined SF2 threshold value and hence it is theoretical and conventional. Acknowledgement. This work was supported by SUT Grant Number BKM/508/RAU1/2017/25 (JT), SUT Grant Number BK–204/RAU1/2017/9 (JP) and NSTIP-KACST 11-BIO1429-20 (RAC# 2120 003) (NAL, SBJ, SM, GA).

References 1. Ree, A.H., Redalen, K.R.: Personalized radiotherapy: concepts, biomarkers and trial design. Br. J. Radiol. 88(1051), 20150009 (2015). https://doi.org/10.1259/bjr.20150009 2. Alymani, N.A., Smith, M.D., Williams, D.J., Petty, R.D.: Predictive biomarkers for personalised anti-cancer drug use: discovery to clinical implementation. Eur. J. Cancer 46, 869–879 (2010). https://doi.org/10.1016/j.ejca.2010.01.001 3. Yard, B.D., et al.: A genetic basis for the variation in the vulnerability of cancer to DNA damage. Nat. Commun. 7, 11428 (2016). https://doi.org/10.1038/ncomms11428 4. Story, M., Ding, L.H., Brock, W.A., Ang, K.K., Alsbeih, G., Minna, J., Park, S., Das, A.: Defining molecular and cellular responses after low and high linear energy transfer radiations to develop biomarkers of carcinogenic risk or therapeutic outcome. Health Phys. 103(5), 596– 606 (2012). https://doi.org/10.1097/HP.0b013e3182692085 5. Thermo Fisher Scientific Inc.: Chromosome Analysis Suite 3.1 (ChAS 3.1) User Guide. http:// tools.thermofisher.com/content/sfs/manuals/chas3_1_userguide.pdf 6. Tucker, S.L., Turesson, I., Thames, H.D.: Evidence for individual differences in the radiosensitivity of human skin. Eur. J. Cancer 11, 1783–1791 (1992). https://doi.org/ 10.1016/0959-8049(92)90004-L 7. Geara, F.B., Peters, L.J., Ang, K.K., Wike, J.L., Brock, W.A.: Prospective comparison of in vitro normal cell radiosensitivity and normal tissue reactions in radiotherapy patients. Int. J. Radiat. Oncol. Biol. Phys. 27, 1173–1179 (1993). https://doi.org/ 10.1016/0360-3016(93)90540-C 8. Johansen, J., Bentzen, S.M., Overgaard, J., Overgaard, M.: Evidence for a positive correlation between in vitro radiosensitivity of normal human skin fibroblasts and the occurrence of subcutaneous fibrosis after radiotherapy. Int. J. Radiat. Biol. 66, 407–412 (1994). https:// doi.org/10.1080/09553009414551361

208

J. Tobiasz et al.

9. Johansen, J., Bentzen, S.M., Overgaard, J., Overgaard, M.: Relationship between the in vitro radiosensitivity of skin fibroblasts and the expression of subcutaneous fibrosis, telangiectasia, and skin erythema after radiotherapy. Radiother. Oncol. 40, 101–109 (1996). https://doi.org/ 10.1016/0167-8140(96)01777-X 10. Cohen, J.: Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, Hillsdale (2013) 11. Durinck, S., Spellman, P.T., Birney, E., Huber, W.: Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat. Protoc. 4(8), 1184–1191 (2009). https://doi.org/10.1038/nprot.2009.97 12. Durinck, S., Moreau, Y., Kasprzyk, A., Davis, S., De Moor, B., Brazma, A., Huber, W.: BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics 21(16), 3439–3440 (2005). https://doi.org/10.1093/bioinformatics/ bti525 13. Falcon, S., Gentleman, R.: Using GOstats to test gene lists for GO term association. Bioinformatics 23(2), 257–258 (2007). https://doi.org/10.1093/bioinformatics/btl567 14. Ashburner, M., et al.: Gene ontology: tool for the unification of biology. The Gene ontology consortium. Nat. Genet. 25(1), 25–29 (2000). https://doi.org/10.1038/75556 15. The Gene Ontology Consortium: Expansion of the Gene Ontology knowledgebase and resources. Nucleic Acids Res. 45(D1), D331–D338 (2017). https://doi.org/10.1093/nar/ gkw1108 16. Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate - a practical and powerful approach to multiple testing. J. Royal Stat. Soc. Ser. B (Methodol.) 57, 289–300 (1995). https://doi.org/10.2307/2346101 17. Zakhvataev, V.E.: Possible scenarios of the influence of low-dose ionizing radiation on neural functioning. Med. Hypotheses 85(6), 723–735 (2015). https://doi.org/10.1016/j.mehy. 2015.10.020 18. Katsura, M., et al.: Effects of chronic low-dose radiation on human neural progenitor cells. Sci. Rep. 6, 20027 (2016). https://doi.org/10.1038/srep20027 19. Tanaka, T., Munshi, A., Brooks, C., Liu, J., Hobbs, M.L., Meyn, R.E.: Gefitinib radiosensitizes non-small cell lung cancer cells by suppressing cellular DNA repair capacity. Clin. Cancer Res. 14(4), 1266–1273 (2008). https://doi.org/ 10.1158/1078-0432.CCR-07-1606 20. Lee, H.J., et al.: Tyrosine 370 phosphorylation of ATM positively regulates DNA damage response. Cell Res. 25(2), 225–236 (2015). https://doi.org/10.1038/cr.2015.8 21. Cengel, K.A., McKenna, W.G.: Molecular targets for altering radiosensitivity: lessons from Ras as a pre-clinical and clinical model. Crit. Rev. Oncol. Hematol. 55(2), 103–116 (2005). https://doi.org/10.1016/j.critrevonc.2005.02.001 22. Pattje, W.J., Schuuring, E., Mastik, M.F., Slagter-Menkema, L., Schrijvers, M.L., Alessi, S., van der Laan, B.F., Roodenburg, J.L., Langendijk, J.A., van der Wal, J.E.: The phosphatase and tensin homologue deleted on chromosome 10 mediates radiosensitivity in head and neck cancer. Br. J. Cancer 102(12), 1778–1785 (2010). https://doi.org/10.1038/sj.bjc.6605707

Computational Proteomics

Protein Tertiary Structure Prediction via SVD and PSO Sampling ( ) Óscar Álvarez1, Juan Luis Fernández-Martínez1 ✉ , Ana Cernea1, 1 Zulima Fernández-Muñiz , and Andrzej Kloczkowski2

1

Group of Inverse Problems, Optimization and Machine Learning, Department of Mathematics, University of Oviedo, C/ Federico García Lorca, 18, 33007 Oviedo, Spain {UO217123,jlfm,cerneadoina,zulima}@uniovi.es 2 Battelle Center for Mathematical Medicine, Nationwide Children’s Hospital, Department of Pediatrics, The Ohio State University, Columbus, OH, USA [email protected]

Abstract. We discuss the use of the Singular Value Decomposition as a model reduction technique in Protein Tertiary Structure prediction, alongside to the uncertainty analysis associated to the tertiary protein predictions via Particle Swarm Optimization (PSO). The algorithm presented in this paper corresponds to the category of the decoy-based modelling, since it first finds a good protein model located in the low energy region of the protein energy landscape, that is used to establish a three-dimensional space where the free-energy optimization and search is performed via an exploratory version of PSO. The ultimate goal of this algorithm is to get a representative sample of the protein backbone structure and the alternate states in an energy region equivalent or lower than the one corresponding to the protein model that is used to establish the expansion (model reduction), obtaining as result other protein structures that are closer to the native structure and a measure of the uncertainty in the protein tertiary protein recon‐ struction. The strength of this methodology is that it is simple and fast, and serves to alleviate the ill-posed character of the protein structure prediction problem, which is very highly dimensional, improving the results when it is performed in a good protein model of the low energy region. To prove this fact numerically we present the results of the application of the SVD-PSO algorithm to a set of proteins of the CASP competition whose native’s structures are known. Keywords: Particle Swarm Optimization · Protein refinement Singular Value Decomposition · Model reduction Protein tertiary structure prediction

1

Introduction

In computational biology, there is a wide range of problems that can be formulated as a sampling problem over a search space with multiple dimensions. Protein tertiary structure prediction and refinement is solved as the optimization (minimization) of the energy function of the protein. The protein structure prediction problem it is considered as one of the foremost challenges in computational biology [1]. © Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortuño (Eds.): IWBBIO 2018, LNBI 10813, pp. 211–220, 2018. https://doi.org/10.1007/978-3-319-78723-7_18

212

Ó. Álvarez et al.

Proteins are biopolymers that are composed of a set of peptide-bonded amino-acids. The fact that, many spatial conformations of proteins are possible due to the rotation of the chain on each C_α atom, imply that a wide range of structural differences exist. These conformational differences are crucial to fully understand protein interactions, functions and evolution. Large efforts are made in protein structure prediction since the experi‐ mental methods used to study their structure are very costly. The computational predic‐ tion of protein structures implies the understanding of the mechanisms involved in protein structure and folding, in order to construct good physical models of the protein energy function and accurately mimic the reality, and also the development of mathe‐ matical approaches to handle this problem [1, 2]. These methods are based on the opti‐ mization of the protein energy function that depends on the protein atoms’ coordinates. The forward model is crucial, because if the energy function is able of fully describing the energetics of the protein folds, the minimum energy will correspond to the native structure, but other plausible configurations might also coexist. The fact that these algo‐ rithms are not able of sampling the entire protein conformational search space implies that some modelling simplifications are needed. The use of Principal Component Anal‐ ysis performed in a set of templates to reduce the dimension and performing Protein Tertiary Structure Prediction via Particle Swarm Optimization has been presented [3], showing that the accuracy of the structure prediction will depend on how the reduced PCA basis set is constructed. Particularly, the quality of the a priori templates, the number of PCAs terms, and also on the introduction of a high frequency term which is able to span high frequency details of the protein structure, play key roles in the algorithm performance. In this paper, we assume an accurate energy function, focusing on the method devel‐ oped to sample the conformational space via the SVD-PSO algorithm. The model reduction is different from PCA since only one good template is needed to achieve the model reduction. Also, independently of the number of atoms in the protein the sampling is performed in a three-dimensional space. This drastic dimensionality reduction serves to sample other templates closer to the native structure and whose tertiary structure is compatible with the three SVD basis terms.

2

Protein Tertiary Structure Modelling

Our aim is to model protein tertiary structures using SVD as model reduction technique and PSO as global optimizer and sampler. Therefore, the algorithm presented belongs to the category of template-based modelling [4]. Proteins are modelled by their freen energy function, E(𝐦):ℝ ( ) → ℝ, by finding the protein model that achieves the minimum energy value, 𝐦p :E 𝐦p = min E(𝐦). In this case the model parameters (m) are the protein coordinates and its dimension is three times the number of atoms of the protein. Therefore, the prediction of the best protein structure involves the optimization of the energy function in a high dimensional space with an intricate energy function landscape [5]. These two issues have to be carefully considered as they may cause the failure of the optimization problem if the algorithm gets trapped in one flat valley corresponding

Protein Tertiary Structure Prediction via SVD and PSO Sampling

213

to a local minimum which could be located far from the native backbone If ( structure. ) the condition, = 0. we assume that 𝐦p is the global optimum, it satisfies ∇E 𝐦 p { } Consequently, there is a set of models MTOL = 𝐦:E(𝐦) < ETOL , whose energy is lower than a specific cut-off value ETOL. This set in the neighbourhood of 𝐦p :, can be approximated by the linear hyper-quadric [5, 6]: ( ) ) )T ( )( 1( 𝐦 − 𝐦p HE 𝐦p 𝐦 − 𝐦p ≤ ETOL − E 𝐦p 2

(1)

( ) where HE 𝐦p is the Hessian matrix evaluated at 𝐦p. To avoid, the global optimization method to be trapped in flat, curvilinear elongated and intricate valleys, we require high explorative global optimization methods to explore the non-linear equivalence region MTOL. Algorithms such as the binary genetic and Particle Swarm Optimization (PSO) are capable of performing this task [7]. In this paper, we use an explorative member of the PSO family, denoted as RR-PSO to sample the free-energy function in a reduced space. The main difference with respect to others heuristic approaches is that RR-PSO parameters are tuned based in stochastic stability analysis results [15].

3

Protein Tertiary Structure Refinement Algorithm

3.1 The SVD-PSO Algorithm Protein prediction, as other real problems from science, has a large number of parame‐ ters. As pointed, the relatively high number of atoms and its associated coordinates determine the value of the free-energy function. This feature, alongside the accuracy required to make good predictions, make these problems highly undetermined and illposed. Consequently, good “a priori” information is required to make good predictions using global optimization methods. The high numbers of atoms precludes the use of highly explorative optimization algorithms (RR-PSO). In this paper, we show how to construct a reduced search space utilising SVD. Constructing a reduced search space via SVD helps us regularizing the inverse problem and finds the atom coordinates that minimize the protein free-energy function [8]. The utilization of SVD allows the optimization of the free-energy function to be performed in a very low dimensional search space and can be written as follows: finding

( ) ( ) ̂ 𝐤 = E 𝛍 + 𝐕𝐝 𝐚𝐤 ≤ ETOL, 𝐚k ∈ ℝd : E 𝐦

(1)

where 𝛍 is the mean protein (it could be null) and 𝐕𝐝 contains as columns the basis set of vectors provided by the SVD. Focusing on the SVD ( model) reduction, the idea consists in writing the protein in a ̂ 𝐤 ∈ M 3, natoms , storing in each column the [x, y, z] coordinates of each matrix format 𝐦 atom of the protein structure. Then, it can be factorized, as follows via the SVD: ̂ 𝐤 = 𝐔 𝚺 𝐕𝐓 = 𝐦

∑3 k=1

𝛼k 𝐮𝐤 𝐯𝐓𝐤

(2)

214

Ó. Álvarez et al.

where 𝐔, 𝐕 are orthogonal matrices whose column vectors are respectively 𝐮𝐤 , 𝐯𝐓𝐤 , and ( ) ̂ 𝐤, that has 3 non-null singular values α1 , α2 , α3 . The previous expres‐ Σ is the SVD of 𝐦 sion is known as the spectral decomposition of a matrix and, in this case, it implies that the protein tertiary structure prediction problem can be performed over the reduced basis ̂𝐤 𝐮𝐤 𝐯𝐓𝐤 without any loss of energy (information). In this reduced basis set the protein 𝐦 has only these 3 coordinates. Once the reduced base is defined, any other protein decoy will ( be spanned ) as a unique ∑ ̂ 𝐧𝐞𝐰 = 3k=1 βk 𝐮𝐤 𝐯𝐓𝐤, and the coordinates β1 , β2 , β3 are found via linear combination as 𝐦 PSO optimization. The SVD allows a drastic dimensionality reduction { } from 3natoms to 3 dimensions provided by the spectral basis set 𝐮𝟏 𝐯𝐓𝟏 , 𝐮𝟐 𝐯𝐓𝟐 , 𝐮𝟑 𝐯𝐓𝟑 . Then, the PSO sampling (while optimzing) is performed efficiently in a reduced search space as the protein atoms coordinates are not sampled independently. Consequently, the ill-deter‐ mination of the problem is reduced [9]. This procedure works fairly well due to the deterministic nature of the protein energy function landscape, and should be considered as a protein refinement method, with the advantage that the PSO sampling allows to assess the uncertainty of the protein structure reconstruction in the SVD basis set. The aim of this paper is not demonstrating superiority with respect to existing methods, but to provide a new algorithm for tertiary protein prediction refinement. 3.2 Minimization of the Free-Energy Function Most of the advances in reducing computational costs and efficiency are based on aminoacid sequences homology [3, 10–13] However, other algorithms are capable of storing the ongoing protein structure information during the sampling [14]. In this sense, PSO has been confirmed as a major improvement on sampling a specific protein backbone structure and evaluating its alternate states by Fernández-Martínez et al. [3]. Hence, we perform the minimization of the energy function for each through an explorative member of the family of Particle Swarm Optimizers (RR-PSO) [15]. RRPSO is a stochastic and evolutionary optimization algorithm, which was motivated by individual’s (particle) social behaviour [16]. The task ( )consists of sampling an appro‐ ̂ 𝐤 ≤ ETOL,. The sampled model priate protein model that satisfies the condition, E 𝐦 must be reconstructed in the original atom space in order to evaluate the energy, atom coordinates and forces. These forward calculations are performed through the Bioshell package developed by Gont et al. [17–19]. The PSO algorithm starts by defining a prismatic space of admissible protein models: lj ≤ aji ≤ uj , 1 ≤ j ≤ n, 1 ≤ i ≤ nsize

where lj , uj are the lower and upper limits for the j-th coordinate for each model and nsize is the size of the swarm. In this particular case, the sampling is performed in the threedimensional SVD reduced base. In the algorithm, each particle (model) has its own position in the reduced search space. The particle velocity corresponds to the applied atom coordinates perturbations required in order the particle to explore the search space.


4

215

Numerical Results

4.1 2L3F PDB Code Protein We applied the model reduction technique utilizing a SVD to the protein Uracil DNA glycolase from Methanosarcina acetivorans whose native structure is known and reported by the Northeast Structural Genomics Consortium Target [20]. This native structure has been obtained via Nuclear Magnetic Resonance (NMR) which helps obtaining valuable information about the 3D protein structure, dynamics, nucleic acids and its derived complexes. The assessment of the algorithm performance over a reduced search space is carried out by evaluating two different decoys corresponding to the best decoy and the 10th percentile decoy listed in the CASP9 competition. Each decoy comprises 1271 atoms corresponding to 158 residues. When these two decoys are projected over the reduced search space, the energy of each basis term comprises the three decoy eigenvalues; consequently, the protein sampling would be carried out with a lower ill-posed character while maintaining the prediction accuracy. Information about the algorithm perform‐ ance over the reduced Search Space is given in Fig. 1.

Fig. 1. Protein 2L3F. (A) Convergence curve. (B) Median dispersion curve (%).

As observed in Fig. 1A, the algorithm starts with an energy value which is very close to the optimum. Additionally, the protein refinement algorithm is strongly influenced by the “a priori” model utilized, that is, better initial models yield to better refinements. In Fig. 1B, we show the algorithm performance by plotting the median distance for each particle with respect to the centre of gravity normalized with respect to the first iteration (considered to be 100% dispersion). The qualitative assessment of the protein refinement is shown in Fig. 2, where the best configuration found for each case is presented and compared to the best prediction in CASP9 competition. In this sense, good predictions, similar to the native structure were obtained.

216

Ó. Álvarez et al.

Fig. 2. Protein 2L3F backbone structures corresponding to the (A) best model, (B) result of best model refinement, (C) result of 10th percentile decoy refinement.

We quantitatively analyse the refined structures via the Root Mean Squared Distance (RMSD) with respect to the native. Table 1 summarizes the results obtained with different expanded initial models. It can be observed how the algorithm is capable of improving almost the entire decoy set from CASP9 competition. The major drawback is that a good “a priori” model, situated within the valley where the optimum value exists, is required as a starting point as observed by the poor improvement in the energy func‐ tion. However, despite the energy function is seldom improved, the RMSD suffers improvements; due to the fact that, RR-PSO samples within the valley where the energy function does not vary substantially, however, it is capable of finding a new backbone conformation with a lower RMSD. Table 1. Summary of the computational experiments performed in this paper, via Singular Value Decomposition and Particle Swarm Optimization. The table shows the results obtained with different initial models to perform the SVD expansion (initial energy). Protein PDB code

Model

2L3F

Best model 10th percentile Best model 10th percentile Best model 10th percentile Best model 10th percentile Best model 10th percentile Best model 10th percentile Best model 10th percentile Best model 10th percentile Best model 10th percentile

2L06 2KYY 2L02 3NBM 3N1U 2X3O 3NYM 3NZL

Initial energy −342.1 −311.8 −369.9 −322.3 −273.7 −247.4 −448.6 −373.3 −253.6 −233.5 −464.4 −438.1 −369.2 −334.9 −343.6 −299.3 −209.4 −177.1

Best fit energy −341.5 −312.5 −371.4 −323.6 −277.1 −248.3 −450.1 −376.0 −249.6 −233.9 −465.8 −439.9 −369.2 −335.3 −343.0 −301.8 −210.0 −177.8

“Initial” RMSD 1.9424 2.0179 5.9876 6.6480 1.6171 3.6767 7.2553 14.5460 0.9829 1.4245 0.6949 0.8601 8.2840 11.3852 8.9442 10.8898 3.8829 4.1682

Best fit RMSD 1.8884 2.0178 5.9570 4.6003 1.6051 3.6508 7.1511 9.8897 0.9055 1.3309 0.6945 0.8945 3.2162 8.070 6.1692 6.3731 3.8128 4.1648


217

Figure 3 shows the median coordinates of the sampled protein models over the energy region below −200. We show the protein as a matrix with rows containing the coordinates x, y and z and the columns containing the atoms. This representation helps us visualizing better the uncertainty behind the coordinates in the form of coordinate variation and interquartile range.

Fig. 3. Median protein and the protein uncertainty for predictions based on the sampling performed over the (A) best model in the energy region below −320 PCAs; (B) 10th percentile decoy and within the energy region below −300.

This graph is used to quantify different conformations of the protein structure. In this case, the higher variations occur in the border coordinates, those atoms corresponding to the protein ends. Besides, the uncertainty is bigger for case (B) concerning the 10th percentile decoy. Therefore, the protein structure seems to be better constraint when the sampling is performed using the best decoy found. Figure 4 shows the topography of the energy in the first two PCA coordinates; those two reduce search space coordinates that store the majority of the information. As

Fig. 4. Protein 2L3F energy landscape for samplings performed over a search space obtained via SVD performed on: (A) the best model, (B) 10th percentile best decoy. It could be observed that both maps have a similar structure.

218

Ó. Álvarez et al.

observed, the topography is similar, with a central valley of low energies, whose orien‐ tation is North-South. These graphics serve to assess the mathematical complexity of the protein tertiary structure prediction problem, by observing the intricate valleys of the energy function in lower dimensions. In this case we have used PCA as a visuali‐ zation tool to produce this plot, because the projection onto a 2D dimensional space has to be done using different sampled templates with different energy. 4.2 Prediction of Other Proteins via SVD and PSO We present additional information to expand this research benchmark. We have tested additional protein (21–26) utilising SVD model reduction and the RR-PSO algorithm to prove this technique as protein tertiary structure refinement technique. Table 1 summarizes the results obtained, where we show the energy and RMSD of the initial model used to perform the expansion, and the same descriptors obtained after optimi‐ sation in the SVD reduced basis set. The case of 2L06, 2X30 and 3NBM are special, since a drastic improvement with respect each decoy has not been achieved as in the other cases. We propose to use this method as a final refinement step, after a good protein model has been found via other existing methodologies. Besides, the No-free lunch theorem in search and optimization [20] states that no algorithm is superior to the rest when it is used over the whole set of problems. Therefore, research is always needed to provide new mathematical-based, elegant and simple algorithms.

5

Conclusions

In this paper, we describe a model reduction technique applied to a decoy-based model‐ ling algorithm. The application of SVD is capable of successfully establishing a threedimensional Search Space in order to perform the sampling of protein structures via RRPSO. The SVD model reduction technique is able of preserving the complete informa‐ tion of a given protein backbone structure, consequently, it has been proven to further refine and lower the energy when the optimization is carried out. In this sense, it has been shown that a better refinement is achieved compared to other model reduction techniques such as Principal Component Analysis. The main difference with respect to PCA is that the SVD model reduction is performed in one protein template, while PCA needs different templates to diagonalize their experimental covariance matrix and finding the reduced basis set. Besides, independently of the number of atoms, the sampling is always performed in the SVD spectral basis set, which is three dimensional. Additionally, the SVD model reduction combined with PSO, allows us to sample the equivalent nonlinear region, that helps us understand the protein backbone structure and its alternate states. The SVD model reduction serves to alleviate the ill-posed character of this highly-dimensional optimization problem without losing information when the protein that is used to calculate the basis set is expressed in this reduced search space. Therefore, the SVD-PSO methodology should be used as a protein structure refinement method.


219

Acknowledgements. A. K. acknowledges financial support from NSF grant DBI 1661391 and from The Research Institute at Nationwide Children’s Hospital.

References 1. Zhang, Y.: Progress and challenges in protein structure prediction. Curr. Opin. Struct. Biol. 18, 342–348 (2008) 2. Bonneau, R., Strauss, C.E., Rohl, C.A., Chivian, D., Bradley, P., Malmstrom, L., Robertson, T., Baker, D.: De novo prediction of three-dimensional structures for major protein families. J. Mol. Biol. 322, 65–78 (2002) 3. Álvarez-Machancoses, O., Fernández-Martínez, J.L., Fernández-Brillet C., Cernea A., Fernández-Muñiz, Z., Kloczkowski, A.: Principal component analysis in protein tertiary structure prediction, J. Bioinf. Comput. Biol. (2018). Accepted for publication 4. Fiser, A.: Template-based protein structure modeling. Methods Mol. Biol. 673, 73–94 (2010) 5. Fernández-Martínez, J.L., Fernández-Muñiz, M.Z., Tompkins, M.J.: On the topography of the cost functional in linear and nonlinear inverse problems. Geophysics 77, W1–W7 (2012) 6. Fernández-Martínez, J.L.: Model reduction and uncertainty analysis in inverse problems. Lead. Edge 34, 1006–1016 (2015) 7. Fernández-Martínez, J.L., Fernández-Álvarez, J.P., García-Gonzalo, M.E., Ménendez-Pérez, C.O., Kuzma, H.A.: Particle swarm optimization (PSO): a simple and powerful algorithm family for geophysical inversion. In: SEG Technical Program Expanded Abstracts, pp. 3568– 3571 (2008) 8. Fernández-Martínez, J.L., Tompkins, M., Fernández-Muñiz, Z., Mukerji, T.: Inverse problems and model reduction techniques. In: Borgelt, C., et al. (eds.) Combining Soft Computing and Statistical Methods in Data Analysis. Advances in Intelligent and Soft Computing, vol. 77, pp. 255–262. Springer, Heidelberg (2010). https://doi.org/ 10.1007/978-3-642-14746-3_32 9. Fernández-Muñiz, Z., Fernández-Martínez, J.L., Srinivasan, S., Mukerji, T.: Comparative analysis of the solution of linear continuous inverse problems using different basis expansion. J. Appl. Geophys. 113, 95–102 (2015) 10. Quian, B., Ortiz, A., Baker, D.: Improvement of comparative model accuracy by free-energy optimization along principal components of natural structural variation. Proc. Nat. Acad. Sci. 101, 15346–15351 (2004) 11. Leach, A.R.: Molecular Modelling—Principle and Applications. Prentice Hall, Upper Saddle River (1991) 12. Jones, D.T., Thornton, J.M.: Potential energy functions for threading. Curr. Opin. Struct. Biol. 6, 210–216 (1996) 13. Frantz, D.D., Freeman, D.L., Doll, J.D.: Reducing quasi-ergodic behavior in Monte Carlo Simulations by J-walking: applications to atomic clusters. J. Chem. Phys. 93, 2769–2784 (1990) 14. Brunette, T.J., Brock, O.: Improving protein prediction with model-based search. Bioinformatics 21, 66–74 (2005) 15. Fernández-Martínez, J.L., García-Gonzalo, E.: Stochastic stability and numerical analysis of two novel algorithms of the PSO family: PP-PSO and RR-PSO. Int. J. Artif. Intell. Tools 21, 1240011 (2012) 16. Kennedy, J., Eberhart, R.: A new optimizer using particle swarm theory. In: Proceedings of the Sixth International Symposium Micro Machine Human Science (1995)

220

Ó. Álvarez et al.

17. Gont, D., Kolinski, A.: Bioshell - a package of tools for structural biology prediction. Bioinformatics 22, 621–622 (2006) 18. Gont, D., Kolinski, A.: Utility library for structural bioinformatics. Bioinformatics 24, 584– 585 (2008) 19. Gniewek, P., Kolinski, A., Kloczkowski, A., Gront, D.: Bioshell - threading: a versatile Monte Carlo package for protein threading. BMC Bioinf. 22, 22 (2014) 20. Wolper, D.H., Mcready, W.G.: No free lunch theorems for optimization. IEEE Trans. Evol. Comput. 1, 67–82 (1997)

Fighting Fire with Fire: Computational Prediction of Microbial Targets for Bacteriocins Edgar D. Coelho1 ✉ , Joel P. Arrais2, and José Luís Oliveira1 (

)

1

2

Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Telematics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal {eduarte,jlo}@ua.pt Department of Informatics Engineering (DEI), Centre for Informatics and Systems of the University of Coimbra, University of Coimbra, Coimbra, Portugal [email protected]

Abstract. Recently, we have witnessed the emergence of bacterial strains resistant to all known antibacterials. Due to several limitations of existing exper‐ imental methods, these events justify the need of computer-aided methods to systematically and rationally identify new antibacterial agents. Here, we propose a methodology for the systematic prediction of interactions between bacteriocins and bacterial protein targets. The protein-bacteriocin interactions are predicted using a mesh of classifiers previously developed by the authors, allowing the identification of the best bacteriocin candidates for antibiotic use and potential drug targets. Keywords: Protein-protein interactions · Bacteriocins · Interactome Microbiome · Machine learning · Support-vector machine

1

Introduction

Since the introduction of antimicrobial agents in the 1940s, the number of deaths attrib‐ uted to infectious disease was reduced drastically. However, the widespread and some‐ times erroneous use of antimicrobial agents has exerted selective pressure on microbes, forcing their adaptation. Once adapted, the resulting anti-microbial-resistant organisms can pass the resistance conferring genetic data to their offspring after replication (vertical gene transfer), and even share it with surrounding cells (horizontal gene transfer). Several strategies have been progressively implemented globally to tackle antibiotic resistance. These include infection prevention, tracking antibacterial-resistant infec‐ tions, administration of antibacterial agents only when strictly necessary, and the devel‐ opment of new antibacterial drugs. However, the number of new antibacterial agents has steadily declined over the past 30 years, mostly due to the tremendous cost and time necessary to develop and refine a drug that is not certain to reach the market [1]. Given


222

E. D. Coelho et al.

the high costs, labor, and time demands associated with traditional lead screens and leadoptimization processes, computer-aided drug design becomes an essential complemen‐ tary tool to reduce the costs and accelerate these processes. Indeed, these issues propelled the academia to develop computational methodologies to aid the identification of new leads for drug development, as reviewed in several works [4–7]. A strategy that stands out in terms of reducing research and development time and costs is drug repositioning. It consists in finding new targets for already commercialized drugs, rendering additional pharmacokinetics testing of the compound for absorption, distribution, metabolism, excretion and toxicity unnecessary. In a previous work, we developed a methodology for the systematic identification of leads for drug discovery by predicting drug-target interactions between known drugs and microbial proteins [8]. However, since 70% of the currently commercialized anti‐ microbial agents are derived from bacterial or fungal organisms [9], we decided to look further into the proteins, metabolites, and small molecules produced and secreted by microorganisms, especially a subgroup of antimicrobial peptides called bacteriocins. Bacteriocins are bacterial-produced peptides with antimicrobial activity against other bacterial species [15]. Indeed, some of the targeted bacterial species are known clinical targets, including antibiotic-resistant strains. Both broad- and narrow-spectrum bacter‐ iocins have been identified, with the latter being able to neutralize clinical targets without harming commensal species in humans [15–17]. Despite these promising benefits, bacteriocins availability is very limited due to several factors, including the high cost of commercial production, possible toxicity against eukaryotic cells, susceptibility to degradation by human proteolytic enzymes, development of allergies to the bacteriocins, and decreased bioavailability and possible alterations of their physicochemical properties due to interactions with food constituents [18–21]. Still, since bacteriocins are peptides, they can be subjected to gene-based peptide engineering [15]. This strategy coupled with nanotechnology formulations is a promising path to overcome these limitations [22]. Here, we propose a methodology for the systematic prediction of interactions between bacteriocins and bacterial protein targets. The protein-bacteriocin interactions (PBI) are predicted using a mesh of classifiers previously developed by the authors [23], allowing the identification of the best bacteriocin candidates for antibiotic use and potential drug targets.

2

Methods

2.1 Pipeline Overview The proposed approach is schematized in Fig. 1. Since we aimed to predict targets for bacteriocins in the most problematic human pathogens, we searched for reports from the Centers for Disease Control and Prevention. We found that Clostridium difficile, carbapenem-resistant Enterobacteriaceae (CRE), and drug resistant Neisseria gonor‐ rhoeae (DRNG) are the most urgent antibacterial resistance threats.

Fighting Fire with Fire

223

Fig. 1. Pipeline of the proposed methodology.

The primary sequence of each protein (and peptide) in the collected data is then subjected to a substitution table based on the physicochemical similarities of each amino acid. The amino acid substitution table (Table 1) used was first proposed by Shen et al. [24] to reduce the vector space dimensionality, based on the notion that mutations in amino acids closely related physicochemically would most likely be synonymous. Then, we apply the discrete cosine transform (DCT) to the resulting discretized sequence, allowing its representation as features. The DCT is used to describe the protein sequence

Table 1. Amino acid substitution table. Category 1 2 3 4 5 6 7

Amino acids Alanine, Glycine, Valine Isoleucine, Leucine, Phenylalanine, Proline Tyrosine, Methionine, Threonine, Serine Histidine, Asparagine, Glutamine, Tryptophan Arginine, Lysine Aspartic acid, Glutamic acid Cysteine

Amino acid categorization is performed per their specific physicochemical properties.

224

E. D. Coelho et al.

taking the physicochemical properties of each amino acid group into account, while also representing proteins with different lengths with a normalized number of features. Finally, we use a Gene Ontology (GO)-based mesh of classifiers to predict proteinprotein interactions (PPIs). The detailed methodology is thoroughly described in [23], and will be briefly explained in Subsect. 2.4. 2.2 Data Collection and Dataset Construction Protein interaction data for the microorganisms studied in this work were downloaded from the STRING database [1]. To ensure a low percentage of false-positives we filtered the PPIs using a combined score greater than 900. After mapping the STRING identifiers to UniProtKB identifiers, we noticed that both Neisseria gonorrhoeae strains (MS 11 and PID 24) and Yersinia pestis CO 92 had little to no UniProtKB identifiers found (0 out of 2213, 0 out of 2080, and 205 out of 3848, respectively). Thus, these strains were discarded from further experi‐ ments. Furthermore, all the STRING identifiers without mapping to UniProtKB were also discarded. Table 2 summarizes the number of proteins in each organism prior and after mapping to UniProtKB. Amino acid categorization is performed per their physicochemical properties. Table 2. Number of proteins in each organism before and after UniProtKB mapping. Organism Clostridium difficile 630 Klebsiella pneumoniae KCTC 2190 Klebsiella pneumoniae MGH 78578 Salmonella enterica CT 18 Salmonella enterica LT 2 Yersinia pestis KIM 10

# STRING IDs 3,604 4,875 4,746 4,742 4,406 3,943

# UniProtKB IDs 3,400 4,710 4,603 4,591 4,315 3,368

Second column refers to the number of proteins before mapping. The third column is the number of proteins after mapping.

We downloaded all bacteriocin data from BACTIBASE [2], a database designed for the characterization of bacteriocins. From the 229 bacteriocin sequences obtained, six sequences were removed for being incomplete. All the data described were downloaded in October 2017. Since we collected PPI data from six different bacterial species, we created six test datasets. To create these datasets, we paired the collected bacteriocins with the unique proteins obtained from each bacterial species in an all against all fashion. Thus, the resulting C. difficile (strain 630) test dataset comprised 301,032 PBIs, the K. pneumoniae (strain KCTC 2190) comprised 551,448 PBIs, the K. pneumoniae (strain MGH 78578) comprised 1,021,866 PBIs, the S. enterica (strain CT 18) comprised 1,019,202 PBIs, the S. enterica (strain LT 2) comprised 957,930 PBIs and finally, the Y. pestis (strain KIM 10) comprised 760,128 PBIs.


225

2.3 Prediction of Essential Proteins in Bacterial Organisms Several factors are known to influence the activity of bacteriocins on the target bacterial cells, most importantly the structure and quantity of the bacteriocin, the molecular composition of the bacterial cell membrane and the pH of the extracellular environment [3]. While it is recognized that the clear majority of bacteriocins destroy target cells through cytoplasmic membrane permeabilization, not all bacteriocins act directly on the bacterial cell membrane. For instance, they can bind and inhibit DNA, RNA, compro‐ mise protein synthesis, disable intracellular enzymes, or affect membrane and cell wall formation [reviewed in 4]. Taking this into account, we applied a previously used strategy to predict the essential proteins in each of the collected interactomes [5]. This strategy consisted in calculating two network centrality metrics for each protein: the subgraph centrality (SC) and the betweenness centrality (BC). Both network metrics were calculated using the Python software package NetworkX (https://networkx.github.io/), which is used for the creation and study of complex networks. The subgraph centrality (SC) metric can be calculated from the spectra of the adja‐ cency matrix of the network and was found to be better at discriminating the nodes of a network than alternative metric (e.g., degree, closeness). In addition, it was shown that SC is more highly correlated with the lethality of individual proteins removed from the proteome, compared with the number of links per node [6, 7]. For a given node u the SC is given by,

SC(u) =

N ∑ j=1

(vuj )2 e𝜆j

(1)

where vj is an eigenvector of the adjacency matrix A, corresponding to the eigenvalue 𝜆j obtained from the graph. Bottlenecks in protein networks can be predicted by calculating the betweenness centrality, BC, with greater values suggesting a higher “bottleneck-ness”. These are networks nodes that have many shortest paths passing through them, making them key connector proteins. In comparison with degree centrality (i.e., “hub-ness”), bottlenecks are significantly better associated with essentiality [8]. For a node v, the BC is given by,

BC(v) =

∑ 𝜎(s, t|v) 𝜎(s, t) s,t∈V

(2)

where V is the set of nodes, the denominator is the number of shortest paths in the network, and the numerator the number of those that pass through v. Ideally, only reviewed proteins should be used in this study. However, considering that the percentage of reviewed proteins is only 8.7% for Clostridium difficile 630, 0.3% for Klebsiella pneumoniae KCTC 2190, 15.4% for Klebsiella pneumoniae MGH 78578, 28.9% for Salmonella enterica CT 18, 41.2% for Salmonella enterica LT 2 and 27.9% for Yersinia pestis KIM 10, we decided to use unreviewed proteins as well.

226

E. D. Coelho et al.

2.4 Protein-Bacteriocin Interaction (PBI) Classification In a previous work, the authors proposed a methodology to predict protein-protein inter‐ actions (PPIs) [9]. In said approach, four different datasets were created to assess the performance of the method under different circumstances. The amino acid sequences were collected for each protein in these datasets. Each amino acid in these sequences was then converted into a signal using the substitution table (Table 1), followed by reconstruction using the Discrete Cosine Transform (DCT). The DCT was used to tackle the issue of comparing proteins with inconsistent lengths, treating each protein as a signal that modulates the variations of amino acids instead. The optimal number of frequencies to be used as features was then identified after an iterative search, and was internally validated using 5-fold cross validation. The final classifier was implemented using a Support Vector Machine with a radial basis function kernel. For simplification, we will refer to this classifier as “generic classifier”. Since bacteriocins are peptides, the previously developed methodology can be directly applied to the prediction of PBIs. The first step consisted in collecting the FASTA sequences of all proteins and bacteriocins in our data, followed by the steps described to this point. We also collected the Gene Ontology (GO) molecular function annotations for each protein in the datasets. Proteins without GO annotations are subject to a similarity search using UniRef. In the training phase, each protein pair is grouped based on their respective GO molecular function annotations, thus creating highly speci‐ alized classifiers with the same parameters of the generic classifier. The generic classifier is used in the case that no annotation is found for the input protein pair. Based on exhaustive evaluation, the final mesh model attained a consistent average AUC of 0.84.

3


3.1 Evaluation of the Best Predicted Targets The analysis was focused on the proteins identified as being the optimal targets for inhibition, as per their SC and BC scores. Since SC is more highly correlated with the lethality of individual proteins removed from the proteome, we ordered the proteins in each organism by their SC values in descending order. Then, groups of proteins with identical SC values are reordered by their BC values by descending order. Thus, the proteins at the top of this list were considered better targets for inhibition than those at the bottom. DNA polymerase I was identified as the best target for inhibition in C. difficile (strain 630), followed by IMP dehydrogenase and the chaperone protein DnaK. A recent study indicates that bacterial DNA replication proteins, including DNA polymerase I, could be exploited as antibacterial targets [10]. IMP dehydrogenase, an essential enzyme for the de novo synthesis of guanine nucleotides, was also reported as an important anti‐ bacterial target [11–13]. The DnaK protein is the bacterial homolog of the human Hsp70, playing an important role in pathogen survival under stress conditions, including anti‐ biotic therapies. Due to its key role in such events, it has also been suggested as an important antibacterial target [14–17].


227

DNA polymerase I was also the best target in both K. pneumoniae KCTC 2190 and K. pneumoniae MGH 78578. In K. pneumoniae KCTC 2190, the second best target was cystathionine beta-synthase, which is one of the enzymes responsible to produce hydrogen sulfide (H2S) in bacteria. H2S was found to protect bacterial cells against antibiotics and oxidative stress and its inhibition was shown to render bacteria highly sensitive to several antibacterial agents [18]. The aerobic respiration control sensor protein (arcB) was also a high-scoring target shared between K. pneumoniae KCTC 2190 and K. pneumoniae MGH 78578. This protein is part of a two-component system responsible for the anaerobic growth of bacterial species. Since the arcAB system plays a role in the anaerobic growth of bacteria and is essential for antibacterial resistance under aerobic conditions [19], it could potentially be used as a drug target. Following the previously observed pattern, DNA polymerase I was also predicted to be the best target for both S. enterica CT 18 and S. enterica LT 2. In S. enterica CT 18, DNA polymerase I was followed by arcB, in contrast to the chaperone protein DnaK in S. enterica LT 2. The third best predicted target in S. enterica CT 18 was guanosine 5’-monophosphate (GMP) synthase. This enzyme catalyzes the synthesis of GMP in the de novo pathway for purine nucleotide biosynthesis. GMP synthase has been reported to be required for the production of virulence factors and infection in three fungal species, Cryptococcus neoformans, Candida albicans and Aspergillus fumigatus [20, 21]. Although S. enterica CT 18 is not a fungus, some studies have reported the possible uses of inhibitors of GMP synthase (and other enzymes catalyzing reactions in the purine and pyrimidine nucleotide biosynthesis pathways) as antiviral, antifungal, antibacterial and anticancer agents [22, 23]. The third best predicted target in S. enterica LT 2 was the protein RecA, which is responsible for the repair of recombinational DNA and to resume stalled DNA synthesis. This protein has been implicated as a bacterial drug target in several studies, and it has also been reported for taking part in the development of antibacterial resistance [24–26]. Lastly, Y. pestis KIM 10 seemed to be the least studied species of the six, as its top three proteins were only predicted to exist. The first target protein was named putative recombinase (UniProtKB: Q9RID0). According to its UniProtKB entry, this protein is involved in DNA repair, similarly to the protein RecA. In addition, the family and domain annotations in the UniProtKB entry refer that this protein is part of the RecA family. The second was called putative exonuclease (UniProtKB: Q9RIC9). Following the previous line of analysis, we found that this protein has a DNA polymerase I-like domain, thus seeming like it plays a role in bacterial DNA replication. The third protein was the putative phage tail protein (UniProtKB: Q9RIE1). According to InterPro, this protein possesses domains whose functions are not yet clear and thus, we were not able to hypothesize how it could be used as an antibacterial target. 3.2 Analysis of the Best Scoring PBIs To evaluate the predicted PBIs, we performed individual analysis of both the proteins and the bacteriocins which attained the best predictive scores. We then analyzed the

228

E. D. Coelho et al.

results of the classification of our test data. The goal was to study whether the PBIs which attained the highest prediction scores could be of therapeutical use. Table 3 shows the three best scoring PBIs for all the organisms studied in this work. In the 18 PBIs shown only three different bacteriocins are represented: colicin N (UniProtKB: P08083), microcin B17 (UniProtKB: P05834) and enterocin Xbeta (UniProtKB: D7UP04). Table 3. Three best scoring PBIs for each of the studied organisms. Organism C. difficile 630

Bacteriocin UniProtKB P08083 P05834 P05834 K. pneumoniae KCTC 2190 P05834 D7UP04 P05834 K. pneumoniae MGH P05834 78578 D7UP04 D7UP04 S. enterica CT 18 P05834 D7UP04 P05834 S. enterica LT 2 P05834 D7UP04 D7UP04 Y. pestis KIM 10 D7UP04 D7UP04 P05834

Target Protein UniProtKB Score Q18AK1 0.92 Q187R6 0.91 Q189P1 0.91 A0A0H3FL34 0.96 A0A0H3FV80 0.94 A0A0H3FLK2 0.93 A6TCH9 0.96 A6TBR1 0.95 A6TE83 0.94 P0A286 0.96 Q8Z468 0.95 Q8Z0W9 0.94 P0A285 0.96 Q7CQ73 0.93 P0A1D5 0.93 Q7CKB6 0.94 Q7CHC9 0.93 Q8ZD70 0.93

Colicin N is known to primarily bind lipopolysaccharide (LPS) in the outer membrane of Gram-negative bacteria. It is a transmembrane toxin that depolarizes the cytoplasmic membrane, resulting in the dissipation of cellular energy and pore forma‐ tion. It is produced by Escherichia coli and shows activity against itself and closely related bacteria [27]. Although we found no reports of colicin N being able to create pores in Gram-positive bacteria, the closely related colicin M was found to possess this ability [28]. However, since Gram-positive bacteria only contain a cytoplasmic membrane, instead of the outer membrane, the peptidoglycan cell wall and the cytoplasmic inner membrane characteristic of Gram-negative bacteria, it could be argued that Gram-posi‐ tive bacteria are less prone to antibacterial resistance [29, 30]. Colicins have been reported to demonstrate little toxicity in human healthy and tumor cells, although the latter are much more sensitive to its cytotoxic effects [31, 32]. Microcin B17 is another bacteriocin produced by E. coli, known to target DNA gyrase. It targets phylogenetically related species and its effects on sensitive bacteria include growth inhibition, a rapid decline in DNA replication and induction of the SOS


229

response [33, 34]. Although the mechanism of action of microcin B17 is similar to that of fluoroquinolones, it has unfortunately been deemed unsuitable as a therapeutic drug in humans, but still has veterinary uses [35, 36]. For this reason, we will not further analyze the predicted interactions in which microcin B17 participates. Enterocin Xbeta is part of a two-peptide bacteriocin (enterocin X), produced by Enterococcus faecium KU-B5. The activity of enterocin X (i.e., when enterocin Xalfa and enterocin Xbeta are combined) was greater than the activity of each peptide indi‐ vidually [37]. Since E. faecium is a lactic acid bacteria, its enterocin X is considered to be fully safe for humans [36]. C. difficile (strain 630) The first predicted interaction occurs between colicin N and fumarate hydratase class I, subunit B. The fumarate hydratase (also known as fumarase) is an enzyme that catalyzes the reversible conversion of malate to fumarate in the citric acid cycle under anaerobic conditions, being essential in the tricarboxylic acid (TCA) cycle [38, 39]. Since the TCA plays a bridge role between several pathways of cellular metabolism, it is an attractive target for inhibition, especially for species without alternative pathways [40]. The second-best scoring PBI occurs between microcin B17 and the ABC-type trans‐ port system. While it is clear that microcin B17 cannot be used as an antibacterial in humans, it is worth referring that a several studies report the relevance of ABC-type transporters as targets for antibacterial vaccines and therapies [41, 42]. The third best predicted PBI also involved microcin B17, targeting a putative tRNA/ rRNA methyltransferase. Both tRNA and rRNA methyltransferases are considered possible antibacterial targets for different reasons. The primary functions of tRNA methyltransferases are related to the different stages of protein synthesis [43]. Thus, being essential in protein translation renders them prominent targets for antibacterials [44]. On other note, the ribosome is one of the most common targets of antibacterials and that mutations in rRNA genes, rRNA methyltransferases and ribosomal proteins lead to drug resistance. For this reason, a study investigated the possibility of using a rRNA methyltransferase inhibitor as an antimicrobial [45]. K. pneumoniae KCTC 2190 and K. pneumoniae MGH 78578 The best scoring PBI in both K. pneumoniae strains was between microcin B17 and the strain-specific DNA repair protein RecO. Similarly to the protein RecA, the DNA repair protein RecO also plays a role in DNA repair [46], making it an attractive drug target. Unfortunately, the second-best scoring PBI in both K. pneumoniae strains was an uncharacterized protein, part of the uncharacterized protein family YejG. Thus, we were unable to further discuss these two PBIs. However, the third-best scoring PBIs for these strains involved distinct targets. In K. pneumoniae KCTC 2190, this PBI occurred between enterocin Xbeta and the 2-aminoethylphosphonate (2-AEP) transporter repressor. 2-AEP is a phophonate, which is a type of organophosphorous compounds characterized by the presence of one or more carbon-phosphorus bonds. Phosphonates have been shown to support bacterial growth when degraded, due to being a source of carbon, energy and phosphorus [47]. Many of the naturally occurring phosphonates exhibit antibiotic properties, including 2-AEP

230

E. D. Coelho et al.

[48]. Since there are no 2-AEP biosynthetic enzymes in humans and given its importance to various pathogens, these enzymes could be attractive targets for inhibitions [49]. In addition, it could also be argued that inhibiting 2-AEP transport could compromise the ability of the pathogen to degrade it. The third-best scoring PBI in K. pneumoniae MGH 78578 was predicted to occur between enterocin Xbeta and a putative prophage protein. Prophages are considered reservoirs of virulence, resistance, and tolerance genes, that can be triggered by selective pressure and spread the genes via prophage excision and integration [50]. These char‐ acteristics make prophages attractive inhibition targets. S. enterica CT 18 and S. enterica LT 2 The best-scoring PBIs in the herein studied S. enterica strains involved the DNA repair protein RecO. The role of the DNA repair protein RecO was already discussed for the K. pneumoniae strains in the previous subsection. The second-best scoring PBI in S. enterica CT 18 was predicted to occur between enterocin Xbeta and the Cas2 protein. Cas2 is one of four protein families associated with the clustered regularly interspaced short palindromic repeats (CRISPR) DNA family [51]. The CRISPR-Cas system is responsible for defending the prokaryotic cell against foreign genetic elements (e.g., viruses, transposable elements and conjugative plasmids) [52]. The Cas2 protein binds Cas1 creating a complex that is involved in CRISPR adaptation (i.e., the first stage of CRISPR immunity) [53, 54]. Consequently, we believe that inhibition of Cas2 could disrupt complexation and thus, the activity of the CRISPR-Cas system. In S. enterica LT 2, the second-best scoring PBI was between enterocin Xbeta and the putative cytoplasmic protein. Unfortunately, this is another protein with no known function or process annotations, compromising further discussion. The same situation occurs in the third-best scoring interaction in S. enterica CT 18, which occurs between microcin B17 and the putative membrane protein. The third-best scoring PBI in S. enterica LT 2 involved enterocin Xbeta and the 10 kDa chaperonin (cpn10). Cpn10 is the functional partner of Cpn60, forming large multimeric protein complexes [55]. Like other chaperone proteins, they are involved in other functions besides protein-folding, including the possibility of acting as virulence factors and being on the foundation of the onset of infections [56]. However, due to reports of the Entamoeba histolytica cpn10 having sequence similarity with the human cpn10 [55], we suspect this protein will not be an attractive target for inhibition. Y. pestis KIM 10 The two best scoring interactions in Y. pestis KIM 10 were predicted to occur between enterocin Xbeta and two uncharacterized proteins, rendering further discus‐ sion problematic. The third-best scoring PBI involved microcin B17 and the DNA repair protein RecO, suggesting once again its possible use as an antibacterial target.


231

3.3 Future Directions While we strived to be thorough in this work, there are a few optimizations we want to pursue. First, to have a better grasp at the possible side-effects of using bacteriocins in humans, we want to develop an objective function to minimize the effects of bacteriocins in humans and maximize its effects in microbial species. Another possible improvement to this work would be a new centrality measure combining the subgraph centrality and betweenness centrality measures. Finally, docking experiments for the best predicted PBI pairs followed by in vitro testing would provide accurate validation of this work. Acknowledgements. This work has been supported by the NETDIAMOND project under grant number POCI-01-0145-FEDER-016385. EDC is funded by the NETDIAMOND project under grant number BPD/UI63/6085/2017. Conflict of Interest. None declared.

References 1. Szklarczyk, D., Franceschini, A., Wyder, S., Forslund, K., Heller, D., Huerta-Cepas, J., Simonovic, M., Roth, A., Santos, A., Tsafou, K.P., Kuhn, M., Bork, P., Jensen, L.J., von Mering, C.: STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43, D447–D452 (2015) 2. Hammami, R., Zouhir, A., Le Lay, C., Ben Hamida, J., Fliss, I.: BACTIBASE second release: a database and tool platform for bacteriocin characterization. BMC Microbiol. 10, 22 (2010) 3. Motta, A.S., Flores, F.S., Souto, A.A., Brandelli, A.: Antibacterial activity of a bacteriocinlike substance produced by Bacillus sp. P34 that targets the bacterial cell envelope. Antonie Van Leeuwenhoek 93, 275–284 (2008) 4. Scocchi, M., Mardirossian, M., Runti, G., Benincasa, M.: Non-membrane permeabilizing modes of action of antimicrobial peptides on bacteria. Curr. Top. Med. Chem. 16, 76–88 (2016) 5. Coelho, E.D., Arrais, J.P., Oliveira, J.L.: Computational discovery of putative leads for drug repositioning through drug-target interaction prediction. PLoS Comput. Biol. 12, e1005219 (2016) 6. Estrada, E., Rodríguez-Velázquez, J.A.: Subgraph centrality in complex networks. Phys. Rev. E 71, 056103 (2005) 7. Estrada, E., Hatano, N.: Communicability in complex networks. Phys. Rev. E 77, 036111 (2008) 8. Yu, H., Kim, P.M., Sprecher, E., Trifonov, V., Gerstein, M.: The importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics. PLoS Comput. Biol. 3, e59 (2007) 9. Coelho, E.D., Cruz, I.N., Santiago, A., Oliveira, J.L., Dourado, A., Arrais, J.P.: A sequencebased mesh classifier for the prediction of protein-protein interactions. ArXiv e-prints, vol. 1711 (2017) 10. van Eijk, E., Wittekoek, B., Kuijper, E.J., Smits, W.K.: DNA replication proteins as potential targets for antimicrobials in drug-resistant bacterial pathogens. J. Antimicrob. Chemother. 72, 1275–1284 (2017) 11. Shu, Q., Nair, V.: Inosine monophosphate dehydrogenase (IMPDH) as a target in drug discovery. Med. Res. Rev. 28, 219–232 (2008)

232

E. D. Coelho et al.

12. Hedstrom, L., Liechti, G., Goldberg, J.B., Gollapalli, D.R.: The antibiotic potential of prokaryotic IMP dehydrogenase inhibitors. Curr. Med. Chem. 18, 1909–1918 (2011) 13. Shah, C.P., Kharkar, P.S.: Inosine 5’-monophosphate dehydrogenase inhibitors as antimicrobial agents: recent progress and future perspectives. Future Med. Chem. 7, 1415– 1429 (2015) 14. Kragol, G., Lovas, S., Varadi, G., Condie, B.A., Hoffmann, R., Otvos Jr., L.: The antibacterial peptide pyrrhocoricin inhibits the ATPase actions of DnaK and prevents chaperone-assisted protein folding. Biochemistry 40, 3016–3026 (2001) 15. Calloni, G., Chen, T., Schermann, S.M., Chang, H.-C., Genevaux, P., Agostini, F., Tartaglia, G.G., Hayer-Hartl, M., Hartl, F.U.: DnaK functions as a central hub in the E. coli chaperone network. Cell Rep. 1, 251–264 (2012) 16. Chiappori, F., Fumian, M., Milanesi, L., Merelli, I.: DnaK as antibiotic target: hot spot residues analysis for differential inhibition of the bacterial protein in comparison with the human Hsp70. PLoS ONE 10, e0124563 (2015) 17. Knappe, D., Goldbach, T., Hatfield, M.P., Palermo, N.Y., Weinert, S., Strater, N., Hoffmann, R., Lovas, S.: Proline-rich antimicrobial peptides optimized for binding to Escherichia coli chaperone DnaK. Protein Pept. Lett. 23, 1061–1071 (2016) 18. Shatalin, K., Shatalina, E., Mironov, A., Nudler, E.: H2S: a universal defense against antibiotics in bacteria. Science (New York, NY) 334, 986–990 (2011) 19. Loui, C., Chang, A.C., Lu, S.: Role of the ArcAB two-component system in the resistance of Escherichia colito reactive oxygen stress. BMC Microbiol. 9, 183 (2009) 20. Chitty, J.L., Tatzenko, T.L., Williams, S.J., Koh, Y.Q.A.E., Corfield, E.C., Butler, M.S., Robertson, A.A.B., Cooper, M.A., Kappler, U., Kobe, B., Fraser, J.A.: GMP synthase is required for virulence factor production and infection by Cryptococcus neoformans. J. Biol. Chem. 292, 3049–3059 (2017) 21. Rodriguez-Suarez, R., Xu, D., Veillette, K., Davison, J., Sillaots, S., Kauffman, S., Hu, W., Bowman, J., Martel, N., Trosok, S., Wang, H., Zhang, L., Huang, L.-Y., Li, Y., Rahkhoodaee, F., Ransom, T., Gauvin, D., Douglas, C., Youngman, P., Becker, J., Jiang, B., Roemer, T.: Mechanism-of-action determination of GMP synthase inhibitors and target validation in Candida albicans and Aspergillus fumigatus. Chem. Biol. 14, 1163–1175 (2007) 22. Ishikawa, H.: Mizoribine and mycophenolate mofetil. Curr. Med. Chem. 6, 575–597 (1999) 23. Christopherson, R.I., Lyons, S.D., Wilson, P.K.: Inhibitors of de novo nucleotide biosynthesis as drugs. Acc. Chem. Res. 35, 961–971 (2002) 24. Shen, J., Zhang, J., Luo, X., Zhu, W., Yu, K., Chen, K., Li, Y., Jiang, H.: Predicting protein– protein interactions based only on sequences information. Proc. Natl. Acad. Sci. 104 (2007). http://www.pnas.org/content/104/11/4337.short 25. Nautiyal, A., Patil, K.N., Muniyappa, K.: Suramin is a potent and selective inhibitor of Mycobacterium tuberculosis RecA protein and the SOS response: RecA as a potential target for antibacterial drug discovery. J. Antimicrob. Chemother. 69, 1834–1843 (2014) 26. Alam, M.K., Alhhazmi, A., DeCoteau, J.F., Luo, Y., Geyer, C.R.: RecA inhibitors potentiate antibiotic activity and block evolution of antibiotic resistance. Cell Chem. Biol. 23, 381–391 (2016) 27. Lazdunski, C.J., Bouveret, E., Rigal, A., Journet, L., Lloubes, R., Benedetti, H.: Colicin import into Escherichia coli cells. J. Bacteriol. 180, 4993–5002 (1998) 28. Patin, D., Barreteau, H., Auger, G., Magnet, S., Crouvoisier, M., Bouhss, A., Touze, T., Arthur, M., Mengin-Lecreulx, D., Blanot, D.: Colicin M hydrolyses branched lipids II from Gram-positive bacteria. Biochimie 94, 985–990 (2012) 29. Silhavy, T.J., Kahne, D., Walker, S.: The bacterial cell envelope. Cold Spring Harb. Perspect. Biol. 2, a000414 (2010)


233

30. Delcour, A.H.: Outer membrane permeability and antibiotic resistance. Biochim. Biophys. Acta 1794, 808–816 (2009) 31. Murinda, S.E., Rashid, K.A., Roberts, R.F.: In vitro assessment of the cytotoxicity of nisin, pediocin, and selected colicins on simian virus 40–transfected human colon and Vero monkey kidney cells with trypan blue staining viability assays. J. Food Prot. 66, 847–853 (2003) 32. Chumchalova, J., Smarda, J.: Human tumor cells are selectively inhibited by colicins. Folia Microbiol. 48, 111–115 (2003) 33. Heddle, J.G., Blance, S.J., Zamble, D.B., Hollfelder, F., Miller, D.A., Wentzell, L.M., Walsh, C.T., Maxwell, A.: The antibiotic microcin B17 is a DNA gyrase poison: characterisation of the mode of inhibition. J. Mol. Biol. 307, 1223–1234 (2001). Edited by J. Karn 34. Pierrat, O.A., Maxwell, A.: The action of the bacterial toxin microcin B17: insight into the cleavage-religation reaction of DNA gyrase. J. Biol. Chem. 278, 35016–35023 (2003) 35. Collin, F., Thompson, R.E., Jolliffe, K.A., Payne, R.J., Maxwell, A.: Fragments of the bacterial toxin microcin B17 as gyrase poisons. PLoS ONE 8, e61459 (2013) 36. Karpinski, T.M., Szkaradkiewicz, A.K.: Characteristic of bacteriocines and their application. Pol. J. Microbiol. 62, 223–235 (2013) 37. Hu, C.-B., Malaphan, W., Zendo, T., Nakayama, J., Sonomoto, K.: Enterocin X, a novel twopeptide bacteriocin from Enterococcus faecium KU-B5, has an antibacterial spectrum entirely different from those of its component peptides. Appl. Environ. Microbiol. 76, 4542–4545 (2010) 38. Woods, S.A., Guest, J.R.: Differential roles of the Escherichia coli fumarases and fnrdependent expression of fumarase B and aspartase. FEMS Microbiol. Lett. 48, 219–224 (1987) 39. Kasbekar, M., Fischer, G., Mott, B.T., Yasgar, A., Hyvönen, M., Boshoff, H.I.M., Abell, C., Barry, C.E., Thomas, C.J.: Selective small molecule inhibitor of the Mycobacterium tuberculosis fumarate hydratase reveals an allosteric regulatory site. Proc. Nat. Acad. Sci. U.S.A. 113, 7503–7508 (2016) 40. Boshoff, H.I., Barry 3rd, C.E.: Tuberculosis - metabolism and respiration in the absence of growth. Nat. Rev. Microbiol. 3, 70–80 (2005) 41. Garmory, H.S., Titball, R.W.: ATP-binding cassette transporters are targets for the development of antibacterial vaccines and therapies. Infect. Immun. 72, 6757–6763 (2004) 42. Calcagno, A.M., Kim, I.W., Wu, C.P., Shukla, S., Ambudkar, S.V.: ABC drug transporters as molecular targets for the prevention of multidrug resistance and drug-drug interactions. Curr. Drug Deliv. 4, 324–333 (2007) 43. Hori, H.: Methylated nucleosides in tRNA and tRNA methyltransferases. Front. Genet. 5, 144 (2014) 44. Chopra, S., Reader, J.: tRNAs as antibiotic targets. Int. J. Mol. Sci. 16, 321–349 (2015) 45. Rana, A.K., Chandra, S., Siddiqi, M.I., Misra-Bhattacharya, S.: Molecular characterization of an rsmD-Like rRNA methyltransferase from the Wolbachia endosymbiont of Brugia malayi and antifilarial activity of specific inhibitors of the enzyme. Antimicrob. Agents Chemother. 57, 3843–3856 (2013) 46. Makharashvili, N., Koroleva, O., Bera, S., Grandgenett, D.P., Korolev, S.: A novel structure of DNA repair protein RecO from Deinococcus radiodurans. Structure 12, 1881–1889 (2004) 47. Schowanek, D., Verstraete, W.: Phosphonate utilization by bacterial cultures and enrichments from environmental samples. Appl. Environ. Microbiol. 56, 895–903 (1990) 48. McGrath, J.W., Chin, J.P., Quinn, J.P.: Organophosphonates revealed: new insights into the microbial metabolism of ancient molecules. Nat. Rev. Microbiol. 11, 412 (2013) 49. Metcalf, W.W., van der Donk, W.A.: Biosynthesis of phosphonic and phosphinic acid natural products. Annu. Rev. Biochem. 78, 65–94 (2009)

234

E. D. Coelho et al.

50. Wang, X., Wood, T.K.: Cryptic prophages as targets for drug development. Drug Resist. Updates 27, 30–38 (2016) 51. Kunin, V., Sorek, R., Hugenholtz, P.: Evolutionary conservation of sequence and secondary structures in CRISPR repeats. Genome Biol. 8, R61 (2007) 52. Rath, D., Amlinger, L., Rath, A., Lundgren, M.: The CRISPR-Cas immune system: biology, mechanisms and applications. Biochimie 117, 119–128 (2015) 53. Arslan, Z., Hermanns, V., Wurm, R., Wagner, R., Pul, U.: Detection and characterization of spacer integration intermediates in type I-E CRISPR-Cas system. Nucleic Acids Res. 42, 7884–7893 (2014) 54. Nunez, J.K., Kranzusch, P.J., Noeske, J., Wright, A.V., Davies, C.W., Doudna, J.A.: Cas1Cas2 complex formation mediates spacer acquisition during CRISPR-Cas adaptive immunity. Nat. Struct. Mol. Biol. 21, 528–534 (2014) 55. van der Giezen, M., Leon-Avila, G., Tovar, J.: Characterization of chaperonin 10 (Cpn10) from the intestinal human pathogen Entamoeba histolytica. Microbiology (Reading, England) 151, 3107–3115 (2005) 56. Henderson, B., Allan, E., Coates, A.R.M.: Stress wars: the direct role of host and bacterial molecular chaperones in bacterial infection. Infect. Immun. 74, 3693–3706 (2006)

A Graph-Based Approach for Querying Protein-Ligand Structural Patterns Renzo Angles1,3(B) 1

and Mauricio Arenas2

Department of Computer Science, Universidad de Talca, Talca, Chile [email protected] 2 Department of Bioinformatics, Universidad de Talca, Talca, Chile 3 Center for Semantic Web Research, Santiago, Chile

Abstract. In the context of protein engineering and biotechnology, the discovery and characterization of structural patterns is very relevant as it can give fundamental insights about protein structures. In this paper we present GSP4PDB, a bioinformatics web tool that lets the users design, search and analyze protein-ligand structural patterns inside the Protein Data Bank (PDB). The novel feature of GSP4PDB is that a proteinligand structural pattern is graphically designed as a graph such that the nodes represent protein’s components and the edges represent structural relationships. The resulting graph pattern is transformed into a SQL query, and executed in a PostgreSQL database system where the PDB data is stored. The results of the search are presented using a textual representation, and the corresponding binding-sites can be visualized using a JSmol interface.

1

Introduction

In the context of protein engineering and biotechnology, structural patterns are three-dimensional structures that occur in biological molecules, such as proteins or nucleid acid, and are key to understand their functionality. It is known that there are common patterns and preferences in the contacts between amino acid residues, or between residues and other biomolecules, such as DNA. The discovery and characterization of structural patterns is an important research topic as it can give fundamental insight into protein structures and can aid in the prediction of unknown structures [7,8]. We concentrate our interest on structural patterns representing proteinligand interactions [10]. Ligands are small molecules (such as oxygen, solvent and metal) that can interact, bind and control the biological function of proteins. Protein-ligand binding kinetics describes the process underlying the association between the protein and ligand, particularly focusing on the rate at which these two partners bind to each other [4]. The study of the specific interaction of a protein with its ligand is an active research field because of the implications this has in the overall understanding of the structure and function of proteins, and in particular in the fast-growing area of rational drug design [11]. Particularly, c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 235–244, 2018. https://doi.org/10.1007/978-3-319-78723-7_20

236

R. Angles and M. Arenas

structure-based drug design/discovery [12] is one of the computer-aided methods, by which novel drugs are designed or discovered based on the knowledge of 3D structures of the relevant specific targets. To the best of our knowledge, there is no standard way to model and represent protein-ligand structural patterns. Most tools (e.g. ProteinsPlus [5] and AFAL [2]) provide simple user interfaces where the characteristics and restrictions of a structural pattern are filled in a Web form. An alternative is to use a notation (i.e. a textual format) to describe a structural pattern, like the one provided by PROSITE to represent motifs. However, such notations are restricted to express sequence patterns (e.g. a sequence of aminoacid symbols). Hence, researchers are unable to search structural pattern in a simple and natural way. Considering the problem identified above, we propose a graph-based model for representing structural patterns. Specifically, a protein-ligand structural pattern is modeled as a graph whose nodes describe amino acids or ligands, and the edges represent their relationships. Based on this model we have developed GSP4PDB, a web application that lets the users design, search and analyze protein-ligand structural patterns inside the Protein Data Bank (PDB) [13]. The paper is organized as follow. In Sect. 2 we review protein-ligand structural patterns, and define the notion of graph-based structural graph pattern. In Sect. 3 we describe the components of GSP4PDB. Finally, Sect. 4 presents some conclusions and future work.

2 2.1

Graph-Based Representation of Structural Patterns Protein-Ligand Structural Patterns

From a chemical point of view, proteins are by far the most structurally complex and functionally sophisticated molecules known [1]. There are four levels of organization in the structure of a protein. The primary structure refers to the sequence of amino acids, which are linked by peptide bonds to form polypeptide chains. Polypeptide chains can fold into regular structures such as the alpha helix and the beta sheet. These substructures conforms the secondary structure of the protein. Tertiary structure refers to the full three-dimensional organization of a polypeptide chain. Finally, if a particular protein is formed by more than one polypeptide chain, the complete structure is designated as the quaternary structure [3]. The notion of structural pattern is used to describe a three-dimensional “structure” or “shape” that occurs in the secondary structure of a protein. The same structural pattern can occur in a group of proteins with a given frequency and satisfying specific criteria (e.g. atomic distance, composition, connectivity, etc.). There are several types of structural patterns, but we concentrate on those representing protein-ligand interactions [10]. In this context, a “ligand” is any molecule capable of binding to a protein with a high specificity and affinity. We define a protein-ligand structural pattern as the combination of a ligand and a group of amino acids, whose three-dimensional distribution could be determined by three types of relationships: distance between two amino acids,

A Graph-Based Approach for Querying Protein-Ligand Structural Patterns

237

distance between an amino acid and the ligand, and the order of precedence (in the sequence) of an amino acid with respect to other amino acid. For instance, a C2H2-type Zinc Finger [6] can be described by a protein-ligand structural pattern where a zinc atom (the ligand) is surrounded by two cysteine and two histidine residues (the amino acids). Such basic structure could be extended by distance conditions between the amino acids and the ligand. A C2H2-type Zinc Finger can be written, according to the PROSITE notation, as C-x(2, 4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3, 5)-H. However, this textual representation is not intuitive nor simple to search structural patterns. Next we propose the use of graphs as a simple and natural way to represent and visualize structural patterns. 2.2

Graph-Based Structural Patterns

We introduce the notion of graph-based structural pattern as an abstract model for representing protein-ligand structural patterns. In general terms, a graphbased structural pattern is a graph where the nodes represent protein’s components (i.e. amino acids and ligands) and the edges represent structural relationships (distance between two amino acids, distance between a ligand and an amino acid, precedence relationship between two amino acids). Specifically, a graph-based structural pattern is a labeled property graph, i.e. a labeled graph where nodes and edges can contain key-value pairs representing their properties (or attributes). Three types of nodes are allowed, named aminonodes, any-nodes and ligand-nodes. An amino-node represents an specific amino acid, whose name is defined by the property “name”. An any-node represents the occurrence of “any” amino acid (as a wildcard). Each any-node includes the property “class” which allows to define the polarity classification of the node (i.e. non-polar, polar uncharged, positively charged and negatively charged). A ligand-node represent the ligand of the pattern, and includes the property “code” to define the 3 letters code of the ligand. On the other hand, nodes can be connected by two types of edges: distance-edges and next-edges. A distanceedge is an undirected edge which represents the distance relationship between two nodes. Each distance-edge includes the properties “min” and “max”, which allow to define a minimum distance value and a maximum distance value expressed in Angstroms (i.e. a distance range). A next-edge is a directed edge which allows to specify that a node X follows a node Y in the protein chain. Figure 1 shows the graphical representation of the graph-based structural pattern for a C2H2-type Zinc Finger. Nodes are drawn as ellipses whose label determine their type. Distance-edges are represented as lines containing the label “distance”, and a round square with the min and max properties. Next-edges are represented as arrows labelled with next. Properties (for nodes and edges) are described as expressions of the form property = "value". Note that the graph-based representation is a simple and natural way to describe and recognize the two-dimensional structure of a protein-ligand pattern. Moreover, the proposed model could be extended to represent other types of structural patterns.

238


Fig. 1. Graph-based structural pattern for a C2H2-type Zinc Finger.

3

GSP4PDB

Based on the notion of graph-based structural patterns, we have developed GSP4PDB, a bioinformatics tool that lets the users design, search and analyze protein-ligand structural patterns inside the Protein Data Bank. GSP4PDB is available at https://structuralbio.utalca.cl/gsp4pdb/. GSP4PDB is formed by three main elements: gsp4pdb-extractor, a java tool which allows to extract and pre-process data from PDB files; a PostgreSQL database system which is used to store and manage the protein data used by the application; and a web application which provides a graphical interface for designing and querying graph-based structural graph patterns. Next we describe the implementation details of these components. 3.1

Protein Data Extraction and Pre-processing

GSP4PDB was thought to work using data obtained from the Protein Data Bank (PDB) [13]. In this sense, we have developed gsp4pdb-extractor, a command-line java application which allows to process PDB files, and export the protein data to the PostgreSQL database. The single parameter of gsp4pdb-extractor is the directory where the PDB files are stored (we maintain a local copy of the PDB dataset using rsync.). The current version is restricted to process files encoded using the PDB format (*.pdb, *.ent or *.ent.gz). Initially, gsp4pdb-extractor explores the directory and prepares a list of files to be processed. This list is filtered according to the proteins available in the PostgreSQL database. Hence, each time gsp4pdb-extractor is executed, the PostgreSQL database is updated with the latest proteins published in the PDB repository. For each file (or protein) of the filtered list, gsp4pdb-extractor parses the file using biojava1 and create an object model of the protein. The main classes of the 1

http://biojava.org/.


239

model are Protein, SChain, Aminoacid, AminoStandard, AminoStandardList, Hetam (Ligand), AtomAmino and AtomHet. Although a protein can contain many chains, we restrict our analysis to the first one. During the creation of the object model, three distance measures (expressed in Angstroms) are pre-computed: DistanceAminoAmino, which is calculated as the distance between the alpha carbon atoms of two amino acids; and DistanceAminoHet, which corresponds to the distance between the alpha carbon atom of the amino acid and the center of mass of the ligand. Distances greater than 7.0 Amstrongs are not considered. Note that PDB files do not contain information about atomic distances. After some empirical tests, we decide to pre-compute the distances in order to improve the performance of the system, as it implies complex join operations for a relational database system. In addition to the distance relationships, we define the class NextAminoAmino to represent the sort between each pair of amino acids in the chain. After the object model of the protein is constructed, gsp4pdb-extractor loads the data to the PostgreSQL database by using bulks of 1000 SQL instructions. Next we describe the relational model used to store and manage the protein data. 3.2

Protein Data Storage

GSP4PDB uses a PostgreSQL database system (version 9.4) for storing and managing the protein data. The relational schema is given by the tables shown in Table 1. For each table we present its list of attributes and the corresponding number of rows. It is possible to see that a great part of the database corresponds to the information related to distances, in particular, distances between each pair of amino acids. Note that the number of chains is equal to the number of proteins, as we are just processing the first chain structure. Table 1. Relational database schema used by GSP4PDB. Table

Attributes

Protein

id, title, classification, organism, dep date, technique, mod date

Chain

id, protein id, seqres, num het, num amino

Aminoacid

id, chain id, symbol, protein id, next amino

het

id, chain id, symbol, protein id, num atom

distance amino amino amino1 id, amino1 symbol, amino1 class, amino2 id, amino2 symbol, amino2 class, min, max distance het amino

het id, het symbol, amino id, amino symbol, amino class, min, max

next amino amino

amino1 id, amino1 symbol, amino1 class, amino2 id, amino2 symbol, amino2 class

Row count 136,316 136,316 35,592,799 475,353 182,941,759

3,642,090 71,264,160

240


In practice, just the tables distance amino amino, distance het amino and next amino amino are necessary to search graph-based structural patterns. Note that these tables contain data from other tables, i.e. they introduce data redundancy. This unnormalized design is useful to improve query computation. It is important to note that the attributes min and max are used to represent the minimum and maximum distances between two components. Recall that these distances are pre-computed by gsp4pdb-extractor. The rest of the tables have been included to maintain additional information, and for future developments. In order to improve the response time of the database, we have created 12 btree indexes in the database. Specifically, the attributes id and symbol of the three main tables were indexed. This is a preliminar configuration which we expect to improve in the future. 3.3

Web User Interface

GSP4PDB is based on a Web interface divided in two areas: the design area and the output area. The design area (Fig. 2) allows the user to “draw” a graphbased structural pattern. This area provides buttons for the three types of nodes

Fig. 2. Design area of GSP4PDB.


241

and the two types of edges. The graph-based structural pattern is shown in the middle, and auxiliary buttons are shown on the right-hand. The structural pattern shown in Fig. 2 is formed by: an amino-node labeled as “HIS-1”, where “HIS” is the 3-letter code of the amino acid and “1” is its node identifier; an any-amino-node labeled as “ANY-2”, where “2” is its node identifier; a ligand-node labeled as “FE2”; a sequence-edge drawn as a ground arrow labeled with “next”; and two distance-edges, represented as dashed lines, and labeled with the min and max distance values (where [0.5, 7.0] is the default distance range).

Fig. 3. Output area of GSP4PDB.

The output area (Fig. 3) shows the results of searching the graph-based structural pattern in the PostgreSQL database. Each result indicates the PDB ID of

242


the protein where the binding site has been found, and the substructure represented in textual form. The format of the results is given as follows: Amino-Amino distance: (NodeLabel){AminoSymbol #Number}---dist---(NodeLabel){AminoSymbol #Number} Ligand-Amino distance: [NodeLabel]{LigandSymbol #Number}---dist---(NodeLabel){AminoSymbol #Number} Next Amino-Amino relationship: (NodeLabel){AminoSymbol #Number}---next--->(NodeLabel){AminoSymbol #Number}

In the example presented in Fig. 3, the Het-Amino distance edge [FE2]{FE2 #365}------6.3------(ANY.2){GLY #147} represents that there is a distance of 6.3 Angstroms between the ligand FE2 number 365 and the amino acid GLY number 147, the latter corresponding to the ANY node occurring in the structural pattern. The results of the search can be downloaded in different formats (IDs, TXT, JSON). Additionally, each result can be analyzed in a graphical way by clicking the 3D STRUCTURE button, which opens a popup with the JSmol visualization of the binding site. As real use-case, Fig. 4 shows the GSP4PDB representation of the C2H2type Zinc Finger presented in Fig. 1. The results for this structural pattern were computed in 20 s approximately. The usability of GSP4PDB was evaluated by researchers and students of the bioinformatics department at Universidad de Talca (Chile). The evaluation shown that the graph-based representation was very simple and intuitive to understand the protein-ligand interaction.

Fig. 4. A C2H2-type Zinc Finger “drawn” in GSP4PDB.


3.4

243

From Graph Patterns to SQL Queries

In this section we present a brief description of the method to transform a graphbased structural pattern into a SQL query expression. In general terms, the method generates a SQL query expression for each node-edge-node structure in the graph pattern. The final SQL query, expressing the complete graph pattern, is the compositions of all the sub-expressions. The method defines transformations for the following node-edge-node structures: 1. 2. 3. 4. 5. 6. 7. 8. 9.

Hetam - Distance (range) - Amino acid Hetam - Distance (range)- ANY (amino acid) Amino acid - Distance (range) - Amino acid Amino acid - Distance (range) - ANY (amino acid) ANY (amino acid) - Distance (range) - ANY (amino acid) Amino acid - Next - Amino acid Amino acid - Next - ANY (amino acid) ANY (amino acid) - Next - Amino acid ANY (amino acid) - Next - ANY (amino acid)

For instance, the SQL expressions corresponding to the first node-edge-node structure is the following: SELECT het_id, amino_id AS amino[id amino]_id, amino_symbol AS amino[id amino]_symbol, min AS min_het_amino[id amino] FROM distance_het_amino WHERE het_symbol = ?[het_symbol]? AND amino_symbol = ?[amino_symbol]? AND ( (min < [dmin] AND max >= [dmin]) OR (min [dmax]) OR (min >= [dmin] AND max [dmax]) )

The above SQL expression is a template for querying a distance relationship between a ligand and an amino acid. Note that the parameters of the template, represented as square brackets, should be replaced with values from the graph pattern in order to obtain the final SQL expression. For the sake of space, we do not present the rest of transformations. We refer the reader to the complete documentation of GSP4PDB which is available at http://renzoangles.net/gsp4pdb.

4

Conclusions

This paper presents GSP4PDB, a bioinformatics tool which allows to search protein-ligand interactions by using a simple and intuitive graphical representation based on graphs. The main elements of GSP4PDB (the pre-processing tool, the data storage system, and the web interface) were described. Currently, we are working on the optimization of the system. In particular, we are conducting empirical tests to improve the execution time of the PostgreSQL database.

244


As future work we expect to extend the notion of protein-ligand structural patterns to support filters and advanced relationships (e.g. metal interaction geometries). Additionally, we will explore the use of big data technologies for storing and query PDB data. Particularly, we expect to use graph-based technologies like Giraph, a graph processing framework built on top of Apache Hadoop. Acknowledgments. Renzo Angles has funding from Millennium Nucleus Center for Semantic Web Research under Grant NC120004. The first version of GSP4PDB was created by Diego Cisterna, as part of his final engineering project at Universidad de Talca (Chile).

References 1. Alberts, B., Johnson, A., Lewis, J., Raff, M., Roberts, K., Walter, P.: Molecular Biology of the Cell, 4th edn. Garland Science, New York (2002) 2. Arenas-Salinas, M., Ortega-Salazar, S., Gonzales-Nilo, F., Pohl, E., Holmes, D.S., Quatrini, R.: AFAL: a web service for profiling amino acids surrounding ligands in proteins. J. Comput. Aided Mol. Des. 28(11), 1069–1076 (2014) 3. Berg, J.M., Tymoczko, J.L., Stryer, L.: Protein structure and function. In: Biochemistry, 5th edn. W. H. Freeman (2002) 4. Du, X., Li, Y., Xia, Y.L., Ai, S.M., Liang, J., Sang, P., Ji, X.L., Liu, S.Q.: Insights into protein-ligand interactions: mechanisms, models, and methods. Int. J. Mol. Sci. 17(2), 144 (2016) 5. Fahrrolfes, R., Bietz, S., Flachsenberg, F., Meyder, A., Nittinger, E., Otto, T., Volkamer, A., Rarey, M.: ProteinsPlus: a web portal for structure analysis of macromolecules. Nucleic Acids Res. 45(1), 337–343 (2017) 6. Iuchi, S.: Three classes of C2H2 zinc finger proteins. Cell. Mol. Life Sci. 58(4), 625–635 (2001) 7. Lee, D., Redfern, O., Orengo, C.: Predicting protein function from sequence and structure. Nat. Rev. Mol. Cell Biol. 8, 995–1005 (2007) 8. Meysman, P., Zhou, C., Cule, B., Goethals, B., Laukens, K.: Mining the entire Protein DataBank for frequent spatially cohesive amino acid patterns. BioData Min. 8, 4 (2015) 9. Minai, R., Matsuo, Y., Onuki, H., Hirot, H.: Method for comparing the structures of protein ligand-binding sites and application for predicting protein drug interactions. Proteins 72(1), 267–381 (2008) 10. Williams, M.A.: Protein-ligand interactions: fundamentals. In: Williams, M., Daviter, T. (eds.) Protein-Ligand Interactions. Methods in Molecular Biology (Methods and Protocols), vol. 1008, pp. 3–34. Humana Press, Totowa (2013). https://doi.org/10.1007/978-1-62703-398-5 1 11. Mavromoustakos, T., Durdagi, S., Koukoulitsa, C., Simcic, M., Papadopoulos, M.G., Hodoscek, M., Golic Grdadolnik, S.: Strategies in the rational drug design. Curr. Med. Chem. 18(17), 2517–2530 (2011) 12. Wang, T., Wu, M.B., Zhang, R.H., Chen, Z.J., Hua, C., Lin, J.P., Yang, L.R.: Advances in computational structure-based drug design and application in drug discovery. Curr. Top. Med. Chem. 16(9), 901–916 (2016) 13. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.: The protein data bank. Nucleic Acids Res. 28(1), 235–242 (2000)

Computational Systems for Modelling Biological Processes

Predicting Disease Genes from Clinical Single Sample-Based PPI Networks Ping Luo1 , Li-Ping Tian2 , Bolin Chen3 , Qianghua Xiao4 , and Fang-Xiang Wu1,5(B) 1

5

University of Saskatchewan, Sakatoon S7N 5A9, Canada [email protected] 2 Beijing Wuzi University, Beijing 101149, China 3 Northwestern Polytechnical University, Xi’an 710072, China 4 University of South China, HengYang 421001, China School of Mathematical Sciences, Nankai University, Tianjin 300071, China

Abstract. Experimentally identifying disease genes is time-consuming and expensive, and thus it is appealing to develop computational methods for predicting disease genes. Many existing methods predict new disease genes from protein-protein interaction (PPI) networks. However, PPIs are changing during cells’ lifetime and thus only using the static PPI networks may degrade the performance of algorithms. In this study, we propose an algorithm for predicting disease genes based on centrality features extracted from clinical single sample-based PPI networks (dgCSN). Our dgCSN first constructs a single sample-based network from a universal static PPI network and the clinical gene expression of each case sample, and fuses them into a network according to the frequency of each edge appearing in all single sample-based networks. Then, centrality-based features are extracted from the fused network to capture the property of each gene. Finally, regression analysis is performed to predict the probability of each gene being disease-associated. The experiments show that our dgCSN achieves the AUC values of 0.893 and 0.807 on Breast Cancer and Alzheimer’s disease, respectively, which are better than two competing methods. Further analysis on the top 10 prioritized genes also demonstrate that dgCSN is effective for predicting new disease genes. Keywords: Disease gene prediction · Single sample-based network Protein-protein interaction network · Network centrality

1

Introduction

Identifying disease-associated genes help us understand the mechanism of diseases, which has a variety of applications, such as early disease diagnosis, disease treatment and drug development [1]. To determine whether a gene is diseaseassociated or not, scientists usually need to conduct a group of biological experiments, which require an enormous amount of money and time. Thus, choosing c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 247–258, 2018. https://doi.org/10.1007/978-3-319-78723-7_21

248

P. Luo et al.

the ‘right’ genes for experimental validation then becomes significant. Many studies have been conducted to develop computational algorithms for predicting or prioritizing disease genes, so that scientists can optimize the experimental validation according to the results of the algorithms, which maximizes the yield of the experiments. Among the existing methods, some of them use the known disease genes as candidates to search new disease genes from the human genome. For instance, Yang et al. inferred disease-gene associations by combining protein complexes and known phenotype-gene associations [2]. Chen et al. predicted cancerassociated genes by a two-step strategy which chose genes with high probabilities of being cancer-associated in the first step and then further identified genes associated with a specific cancer in the second step [3,4]. Tang et al. prioritized disease genes by incorporating the subcellular localization information into the protein-protein interaction (PPI) networks [5]. Yang et al. trained a bias SVM to classify potential disease genes [6]. Other methods treated all the genes equally and prioritized a set of genes as candidates, such as dmGWAS [7] and Endeavour [8,9]. At the meantime, newly proposed studies tended to employ clinical data in their algorithms to improve the prediction accuracy. For example, Wang et al. searched gene modules from a PPI network weighted by GWAS and gene expression data [10]. These algorithms are successful, and most of them had used PPI networks as one of the data sources. However, PPIs are changing during the life time of cells. Only using the static PPI networks downloaded from online databases may degrade the performance of algorithms. One common solution is to integrate static PPI networks with clinical data, such as GWAS and gene expression data. These clinical data are employed to weight the nodes or edges of the PPI networks, and studies have shown that this strategy is useful for predicting disease-associated genes [10–12]. However, such kinds of integration only strengthen those PPIs that have significant roles in the disease samples, while other PPIs which may never exist in the real tissue are still used in the algorithm. Changing the PPI networks used in the algorithm strongly affects the results of the algorithms. In recent years, many studies on dynamic PPIs have been reported. Studies have shown that utilizing dynamic PPI networks (DPINs) can significantly improve the accuracy of the original algorithms [13–15]. These DPINs are constructed by time series gene expression data, and each network at a time point corresponds to a sample. Although it is hard to obtain time series gene expression data from disease samples, we could use the idea of constructing DPINs to construct single sample-based PPI networks with clinical data. Each clinical sample is used to construct a single-sample based PPI network, which should contain the more specific and thus less noisy PPIs of that sample than a universal static PPI network. It is expected that the results from these single sample-based PPI networks should be more valuable than those from the original universal static PPI network.

Predicting Disease Genes from Clinical Single Sample-Based PPI Networks

249

In this study, we propose an algorithm (called dgCSN) to predict disease genes from the single sample-based PPI networks. A single sample-based network is first constructed for each clinical case sample. These single sample-based networks only contain PPIs that have high chance to exist in the disease tissues. Then, all the single sample-based networks are merged together to form a fused network, which is used to extract network centrality-based features for the prediction. In addition to the approach to construct single sample-based networks, we also extend our previous studies in [12,16–18], and define two more kinds of centrality-based features to capture the properties for discriminating disease genes and non-disease genes. Finally, the probability of each gene being diseaseassociated is calculated by a logistic regression model. The work flow of dgCSN is depicted in Fig. 1.

Fig. 1. The work flow of dgCSN.

2 2.1

Method and Materials Problem Formulation

Disease gene prediction can be formulated as a network labeling problem of a biomolecular network in which disease genes are labeled as 1 while non-disease genes are labeled as 0 [16]. Let g1 , g2 , . . . , gh denote the h genes in the human genome. A set of binary labels x = (x1 , x2 , . . . , xh ) of these h genes is known as a configuration of the biomolecular network, and the set of all possible configurations X is a random field.

250

P. Luo et al.

In this study, we propose a generalized model to predict disease genes. Given a prior configuration x , the posterior probability of gene gi being labeled as 1 is computed as exp(θφi ) (1) P (xi = 1|x[−i] , θ) = 1 + exp(θφi ) where θ is a parameter vector and φi is the feature vector of gi extracted from a biomolecular network. In our previous study [4,12,17], φi was set to (1, Ni0 , Ni1 ), where Ni0 and Ni1 are the numbers of neighbors of node i with label 0 and 1, respectively. The values of Ni0 and Ni1 can be computed as follows (1 − xj ), Ni1 = xj (2) Ni0 = j∈nbri

j∈nbri

where nbri is a set containing all neighbors of node i in the network. Noticing that Ni0 and Ni1 can be regarded as two special types of node degrees: 0-degree and 1-degree. Degree is a typical node centrality index which characterizes the property of nodes in the network. Although our previous results have shown that features extracted according to 0-degree and 1-degree can be used to predict the probability of genes being disease-associated, degree centrality only characterizes the relationships between genes and their neighbors, which cannot reveal the whole topological structure of a network. To further improve the accuracy of the prediction, more valuable features should be incorporated into the general model. Considering that different centrality indices reveal distinct topological properties of nodes in a network, it is reasonable that features extracted via other centrality indices can more precisely capture the properties of nodes (genes) in a biomolecular network than the original 0–1 degree. The strategy of 0–1 degree can also be easily extended to other node centrality indices. According to [19], the global relationships within a biomolecular network are useful for predicting disease genes. In addition, under the structural equivalence hypothesis [20], nodes that have similar structural roles in networks should have similar properties. Thus, features characterize structural roles of genes in a biomolecular network should also be useful for predicting disease genes. Considering the above discussions, in this study, we further define the following two more kinds of 0–1 based centrality indices to characterize both global information and local structure information from biomolecular networks. Closeness Centrality. Closeness centrality measures the degree to which an element is close to all other elements in a network. It captures the global relationships between a node and all other nodes in the network. This specialty makes closeness centrality capable of seizing global information among nodes in a network. The 0-closeness and 1-closeness centrality of node i are defined as 1 1 1 1 (1 − xj ), Ci1 = xj (3) Ci0 = n0 d(i, j) n1 d(i, j) i=j

i=j


251

where j is any node in the network other than i, d(i, j) is the length of the shortest path between nodes i and j, n0 and n1 are the number of nodes labeled as 0 and 1 (except i), respectively. d(i, j) is set to ∞ if there is no path between nodes i and j. Edge Clustering Coefficient. Edge clustering coefficient was first developed to identify communities in complex networks [21]. Then, this notion was applied to infer essential genes from PPI networks [22]. Existing studies have showed that edge clustering coefficient is more efficient than the traditional node-based clustering coefficient in characterizing the local structural roles of nodes and edges in a network. Nodes with similar surrounding edge clustering coefficient may have similar functionalities [22]. In this study, the 0-edge clustering coefficient and 1-edge clustering coefficient of node i are defined as follows: If di = 1 and dj = 1: N Ei0 =

j∈nbri

zi,j zi,j (1 − xj ), N Ei1 = xj min(di − 1, dj − 1) min(di − 1, dj − 1) j∈nbri

(4) Otherwise: N Ei0 = N Ei1 = 0

(5)

where zi,j denotes the number of triangles built on edge (i, j), di and dj are the degrees of node i and j, respectively. The meaning of min(di − 1, dj − 1) is the maximal possible number of triangles on (i, j) that can be built. Finally, for each gene gi , its corresponding feature vector is defined as: φi = (1, Ni0 , Ni1 , Ci0 , Ci1 , N Ei0 , N Ei1 )

(6)

where 1 is a dummy feature and the rest entries are the three types of 0–1 centrality indices. Then, θ in Eq. (1) can be estimated by training a logistic regression model with feature vectors of genes in the benchmark. In this study, we use scikit-learn [23] to train the model. 2.2

Single Sample-Based Networks

The features used to estimate θ are extracted from a fused network, which merges a group of clinical single sample-based networks. These networks are constructed by combining a universal static PPI network with the gene expression data of all clinical samples. A gene gi in the case sample k is considered being activated when its expression value is larger than or equal to λ fold of its mean expression value over all the control samples. In other word, gi is activated if mcase[i, k] >= λ ∗ mean(mcntl[i])

(7)

252

P. Luo et al.

where mcase[i, k] is the expression value of gi in sample k, and mean(mcntl[i]) is the mean expression value of gi over all control samples. For each edge (i, j) in the universal static PPI network, if both gi and gj are activated in sample k, this edge exists in the network corresponds to sample k. Finally, k single sample-based networks are built for the k case samples. Let G1 , G2 , . . . , Gk represent the k single sample-based networks. After obtaining these k networks, we fuse them into one single network. Although many network fusion methods have been proposed, most of them require lots of time to fuse PPI networks with more than 10,000 nodes and 200,000 edges, not to mention the number of the single sample-based networks, which is more than 1000 in our study. Thus, we use the following efficient strategy to fuse all the single sample-based networks into a network Gf . Let fij be the number of times edge (i, j) appears in all k single sample-based networks. When fij < , the edge (i, j) is not in Gf , and when fij ≥ , the edge (i, j) is in Gf . This strategy determines which edges should exist in the fused network because an edge (i, j) in a single sample-based network might appear by chance (its expression value higher than λ ∗ mean(mcntl[i]) is not caused by the disease under consideration) unless edge (i, j) appears in more than single sample-based networks. In other words, an edge (i, j) is considered as significant only when it appears in at least single sample-based networks. To determine λ and , the algorithm is tested with λ = 1, 1.25, 1.5, 1.75, 2 and = 1, 2, 3, 4, 5, 10, respectively. We choose the parameters with highest area under the receiver operating characteristic (ROC) curve (AUC). Meanwhile, fused PPI networks containing less than 90% known disease genes are abandoned. Finally, λ = 1.75, = 3 is chosen for Breast Cancer (BC) and λ = 1, = 1 is chosen for Alzheimer’s disease (AD). The details of the performance of dgCSN with various parameters are showed in Table 1. ‘-’ denotes the corresponding parameter combination cannot generate a network containing more than 90% known disease genes. Table 1. The AUC value of dgCSN with all combinations of parameters BC 1 1

2

3

0.868 0.866 0.865

4

5

10

AD 1

2

3

4 5 10

0.862 0.864 0.866 0.807 0.792 0.792 - - -

1.25 0.873 0.873 0.875

0.877 0.879 0.884 -

-

-

- - -

1.5

0.885 0.883 0.851 -

-

-

- - -

1.75 0.887 0.891 0.893 0.885 0.849 0.796 -

-

-

- - -

2

-

-

- - -

0.868 0.872 0.885 0.881 0.851 0.844

0.822 0.798 -

-

used for AD is smaller than that for BC mainly because the number of AD case samples is much less than that of BC. It is worth noting that although we have 1102 BC case samples, the algorithm still performs best when = 3. This


253

is mainly cased by the following reasons. Though more than 30 genes have been identified as BC associated, the formation of the tumor in a patient is not caused by all the disease genes. Each patient got cancer because of the malfunction of different subgroups of disease genes. The reason that an edge (a, b) only appears in 3 single sample-based networks might be most of the samples we have are not caused by disease genes connected with (a, b). Therefore, if we want to predict all the BC associated genes, these genes should be treated equally, and keep an edge in Gf even if it only exists in 3 single sample-based networks. 2.3

Network Labeling

Before feature extraction, all the genes are labeled based on their associations with the disease under consideration. Disease genes are labeled as 1 while unknown genes are labeled as 0. Let dt represents the disease under consideration. Disease genes (positive instances) are the genes known to be associated with dt , while unknown genes are the genes whose relationships with dt are still unknown. To perform logistic regression, we also need negative instances, which are non-disease genes in our study. Considering no databases contain non-disease genes, we selected them based on disease-gene association data downloaded from OMIM [24] with the strategy used in [25]. Concretely, to select non-disease genes, we build a disease gene network (DGN) based on OMIM data. In this network, each node either represents a disease or a disease-associated gene. A disease node is connected with its associated gene nodes, and two disease nodes are connected if they share at least one associated gene. If two diseases are close to each other in the DGN, they (or their neighbor diseases) may share at least one disease gene. If two diseases have some same disease genes, they may have similar mechanisms. It is reasonable that if the length of the shortest path between two diseases is larger than or equal to a threshold η, they may not have similar mechanisms. Then, the disease genes of one disease can be regarded as non-disease genes of the other disease. Since each disease is connected with its associated disease genes, for dt in DGN, the shortest paths between dt and its non-disease genes should be longer than η. η is empirically set to 5 in this study. If there is no path between a gene and dt in DGN, we also select it as a non-disease gene of dt . All the selected non-disease genes form a set Snon , from which we randomly select a number of non-disease genes for the cross validation. 2.4

Data Sources

The algorithm is evaluated with BC and AD in this study. The BC associated genes are collected from the Cancer Gene Census category (CGC, http://cancer. sanger.ac.uk/census) [26]. 35 BC associated genes are used as the benchmark. The AD associated genes are collected from MalaCards: The human disease database (http://www.malacards.org/), which contains 182 potential AD associated genes. These genes are ranked by their probabilities of being AD associated.

254

P. Luo et al.

We select the top 50 genes, among which 39 exist in the PPI network, and are used as the benchmark. The BC gene expression data are downloaded from the NCI’s Genomic Data Commons (GDC) [27]. GDC measures the data by RNA-Seq technique. We download the raw mapping count files which contain 1102 case samples and 113 control samples. The AD expression data are downloaded from Gene Expression Omnibus (GSE53697) [28], which are also measured by RNA-Seq technique (raw count). The count values are normalized by DESeq2 [29]. We choose this normalization method because DESeq2 has been shown to be the best algorithm for RNA-Seq data normalization for cross-sample comparison [30]. The PPI network is obtained from the InWeb InBioMap database (version 2016 09 12) [31], which consists of more than 600,000 interactions aggregated from eight source databases. We choose this database rather than other PPI databases because it contains the most comprehensive PPIs, which reduce the chance of missing any valuable PPIs during the construction of the single samplebased networks. Meanwhile, our network construction strategy can also filter out the false PPIs in the database. We map the proteins in the network to their corresponding genes, and remove those genes that have no expression data from the network. After applying our proposed method in Sect. 2.2, the fused network contains 15552 genes for BC and 13473 genes for AD.

3 3.1

Experiments and Results Evaluation Criteria

To evaluate our algorithm, ROC curve and AUC value are used to measure the performance of dgCSN based on the leave-one-out cross validation (LOOCV). We choose LOOCV because the number of genes used as benchmark is small, and LOOCV is suitable for small datasets. The ROC curve is the plot of the true positive rate (TPR) against the false positive rate (FPR) at various thresholds. The TPR and FPR are defined as follows: TPR =

FP TP , FPR = TP + FN TN + FP

(8)

where T P, F P, T N , and F N are the numbers of true positives, false positives, true negatives, and false negatives, respectively. In this study, a true positive is a disease gene predicted as a disease gene, a false positive is a non-disease gene predicted as a disease gene, a true negative is a non-disease gene predicted as a non-disease gene, and a false negative is a disease gene predicted as a non-disease gene. The ROC curve features the true positive rate on the Y axis, and the false positive rate on the X axis, which makes the top left corner of the plot an ideal point, with a FPR of 0 and a TPR of 1, and it also means that a method with a larger area under the ROC curve (AUC) is better. To maintain the balance of the features, for each disease, we perform under sampling which randomly selects m genes from the benchmark non-disease genes


255

set Snon as non-disease genes. m is the number of benchmark disease genes for each disease, which is 35 for BC and 39 for AD. During the LOOCV, a disease gene or a non-disease gene is assumed to be unknown in each round and its probability of being disease-associated is computed by the logistic model trained by the features of the other (2m − 1) genes. An AUC value is then calculated based on the prediction results of the 2m disease genes and non-disease genes. This process is performed 1000 times, and the average AUC value is regarded as the AUC value of the results. We also rank the unknown genes in descending order by their probabilities calculated by Eq. (1). The top 10 unknown genes are further analyzed from published literature to reveal the ability of dgCSN in predicting new disease genes. 3.2

Results of the AUC

In terms of AUC value, we compare dgCSN with the Re-balancing algorithm (denoted as ‘Re-balanced’) of Chen et al. [4] and the AIDG algorithm (denoted as ‘AIDG’) of Tang et al. [5]. Re-balanced has been illustrated to be better than many previous methods [3], such as the RWR [19] and the DIR method [32]. AIDG has been illustrated to be better than methods such as DADA [33] and ToppNet [34].

(a) BC

(b) AD

Fig. 2. The ROC curves and AUC values

Figure 2 depict the results of the ROC curves and AUC values for BC and AD. dgCSN reaches 0.893 and 0.807 for BC and AD, respectively. Re-balanced achieves 0.814 and 0.540, and AIDG reaches 0.878 and 0.773 for BC and AD, respectively. The performances of dgCSN are better than Re-balanced and AIDG, especially on the AD set, where the AUC values of Re-balanced and AIDG are all less than 0.8. It is of noting that Re-balanced was developed to predict cancer-related genes, which is the reason why its performance in AD is almost random. Its rationale allows it to predict non-cancer disease genes such as AD.

256

3.3

P. Luo et al.

Top 10 Genes Analysis

To further evaluate our algorithm dgCSN, we rank the unknown genes by their probabilities of being disease-associated in descending order and search the top 10 genes in existing publications. Most of them have been studied as oncogenes in previous studies. For BC, among the top 10 genes, ‘ATR’ and ‘ATM’ were identified as potential therapeutic target [35]. ‘RBBP8’ was identified as potential biomarker [36]. For AD, 2 of the top 10 unknown genes (‘PRNP’, ‘DYRK1A’) exist in the 182 genes collected from MalaCards, and ‘DYRK1A’ is identified as potential therapeutic target in [37]. These analyses reveal that the results of our algorithm are in concert with other existing studies, suggesting that dgCSN is valuable in predicting new disease genes.

4

Conclusion

In this study, we have presented a disease gene prediction algorithm which employs single sample-based PPI networks and centrality-based features. The method first constructs sub networks of a universal static PPI network with gene expression data from clinical samples and fuses them into one network. Logistic regression is then performed on the centrality-based features extracted from the fused network to predict the probability of each gene being disease-associated. Evaluations conducted on BC and AD reveal that our algorithm is more effective than previous methods. Further analyses on the top predicted disease genes also illustrate that dgCSN is powerful for predicting new disease genes. In the future, we could use the number of times an edge exists in all single sample-based networks to generated a weighted fused network. An appropriate weighting strategy should further improve the performance of dgCSN. In the meantime, we could also build different types of single sample-based networks by employing other types of clinical data. Acknowledgments. This work is supported in part by Natural Science and Engineering Research Council of Canada (NSERC), China Scholarship Council (CSC) and by the National Natural Science Foundation of China under Grant No. 61571052 and No. 61602386, and the Natural Science Foundation of Shaanxi Province under Grant No. 2017JQ6008.

References 1. Moody, S.E., Boehm, J.S., Barbie, D.A., Hahn, W.C.: Functional genomics and cancer drug target discovery. Curr. Opin. Mol. Ther. 12(3), 284–293 (2010) 2. Yang, P., Li, X., Wu, M., Kwoh, C.K., Ng, S.K.: Inferring gene-phenotype associations via global protein complex network propagation. PLoS ONE 6(7), e21502 (2011) 3. Chen, B., Shang, X., Li, M., Wang, J., Wu, F.X.: A two-step logistic regression algorithm for identifying individual-cancer-related genes. In: 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 195–200. IEEE (2015)


257

4. Chen, B., Shang, X., Li, M., Wang, J., Wu, F.X.: Identifying individual-cancerrelated genes by rebalancing the training samples. IEEE Trans. Nanobiosci. 15(4), 309–315 (2016) 5. Tang, X., Hu, X., Yang, X., Sun, Y.: A algorithm for identifying disease genes by incorporating the subcellular localization information into the protein-protein interaction networks. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 308–311. IEEE (2016) 6. Yang, P., Li, X.L., Mei, J.P., Kwoh, C.K., Ng, S.K.: Positive-unlabeled learning for disease gene identification. Bioinformatics 28(20), 2640–2647 (2012) 7. Jia, P., Zheng, S., Long, J., Zheng, W., Zhao, Z.: dmGWAS: dense module searching for genome-wide association studies in protein-protein interaction networks. Bioinformatics 27(1), 95–102 (2011) 8. Aerts, S., Lambrechts, D., Maity, S., Van Loo, P., Coessens, B., De Smet, F., Tranchevent, L.C., De Moor, B., Marynen, P., Hassan, B., et al.: Gene prioritization through genomic data fusion. Nat. Biotechnol. 24(5), 537–544 (2006) 9. Tranchevent, L.C., Ardeshirdavani, A., ElShal, S., Alcaide, D., Aerts, J., Auboeuf, D., Moreau, Y.: Candidate gene prioritization with endeavour. Nucleic Acids Res. 44, W117–W121 (2016). https://doi.org/10.1093/nar/gkw365 10. Wang, Q., Yu, H., Zhao, Z., Jia, P.: EW dmGWAS: edge-weighted dense module search for genome-wide association studies and gene expression profiles. Bioinformatics 31, 2591–2594 (2015). https://doi.org/10.1093/bioinformatics/btv150 11. Hou, L., Chen, M., Zhang, C.K., Cho, J., Zhao, H.: Guilt by rewiring: gene prioritization through network rewiring in genome wide association studies. Hum. Mol. Genet. 23(10), 2780–2790 (2014) 12. Luo, P., Tian, L.P., Ruan, J., Wu, F.X.: Identifying disease genes from PPI networks weighted by gene expression under different conditions. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1259–1264. IEEE (2016) 13. Wang, J., Peng, X., Li, M., Pan, Y.: Construction and application of dynamic protein interaction network based on time course gene expression data. Proteomics 13(2), 301–312 (2013) 14. Meng, X., Li, M., Wang, J., Wu, F.X., Pan, Y.: Construction of the spatial and temporal active protein interaction network for identifying protein complexes. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 631–636. IEEE (2016) 15. Chen, B., Fan, W., Liu, J., Wu, F.X.: Identifying protein complexes and functional modules from static PPI networks to dynamic PPI networks. Brief. Bioinform. 15(2), 177–194 (2013) 16. Chen, B., Wang, J., Li, M., Wu, F.X.: Identifying disease genes by integrating multiple data sources. BMC Med. Genomics 7(Suppl. 2), S2 (2014) 17. Chen, B., Li, M., Wang, J., Wu, F.X.: Disease gene identification by using graph kernels and Markov random fields. Sci. China Life Sci. 57(11), 1054–1063 (2014) 18. Chen, B., Li, M., Wang, J., Shang, X., Wu, F.X.: A fast and high performance multiple data integration algorithm for identifying human disease genes. BMC Med. Genomics 8(Suppl. 3), S2 (2015) 19. K¨ ohler, S., Bauer, S., Horn, D., Robinson, P.N.: Walking the interactome for prioritization of candidate disease genes. Am. J. Hum. Genet. 82(4), 949–958 (2008) 20. Hoff, P.D., Raftery, A.E., Handcock, M.S.: Latent space approaches to social network analysis. J. Am. Stat. Assoc. 97(460), 1090–1098 (2002) 21. Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., Parisi, D.: Defining and identifying communities in networks. Proc. Nat. Acad. Sci. U.S.A. 101(9), 2658–2663 (2004)

258

P. Luo et al.

22. Wang, J., Li, M., Wang, H., Pan, Y.: Identification of essential proteins based on edge clustering coefficient. IEEE/ACM Trans. Comput. Biol. Bioinf. 9(4), 1070–1080 (2012) 23. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 24. McKusick, V., et al.: Online mendelian inheritance in man (OMIM). MckusickNathans Institute for Genetic Medicine, Johns Hopkins University. National Center for Biotechnology Information, National Library of Medicine, Bethesda (2004). http://www.ncbi.nlm.nih.gov/omim/ 25. Luo, P., Tian, L.P., Ruan, J., Wu, F.: Disease gene prediction by integrating PPI networks, clinical RNA-Seq data and OMIM data. IEEE/ACM Trans. Comput. Biol. Bioinf. (2017) 26. Forbes, S.A., Beare, D., Boutselakis, H., Bamford, S., Bindal, N., Tate, J., Cole, C.G., Ward, S., Dawson, E., Ponting, L., et al.: COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res. 45, D777–D783 (2016). https://doi.org/10. 1093/nar/gkw1121 27. Grossman, R.L., Heath, A.P., Ferretti, V., Varmus, H.E., Lowy, D.R., Kibbe, W.A., Staudt, L.M.: Toward a shared vision for cancer genomic data. N. Engl. J. Med. 375(12), 1109–1112 (2016) 28. Scheckel, C., Drapeau, E., Frias, M.A., Park, C.Y., Fak, J., Zucker-Scharff, I., Kou, Y., Haroutunian, V., Ma’ayan, A., Buxbaum, J.D., et al.: Regulatory consequences of neuronal ELAV-like protein binding to coding and non-coding RNAs in human brain. Elife 5, e10421 (2016) 29. Love, M.I., Huber, W., Anders, S.: Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15(12), 550 (2014) 30. Dillies, M.A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., Keime, C., Marot, G., Castel, D., Estelle, J., et al.: A comprehensive evaluation of normalization methods for illumina high-throughput RNA sequencing data analysis. Brief. Bioinform. 14(6), 671–683 (2013) 31. Li, T., Wernersson, R., Hansen, R.B., Horn, H., Mercer, J., Slodkowicz, G., Workman, C.T., Rigina, O., Rapacki, K., Stærfeldt, H.H., et al.: A scored human protein-protein interaction network to catalyze genomic interpretation. Nat. Methods 14(1), 61–64 (2016) 32. Chen, Y., Wang, W., Zhou, Y., Shields, R., Chanda, S.K., Elston, R.C., Li, J.: In silico gene prioritization by integrating multiple data sources. PLoS ONE 6(6), e21137 (2011) 33. Erten, S., Bebek, G., Ewing, R.M., Koyut¨ urk, M.: DADA: degree-aware algorithms for network-based disease gene prioritization. BioData Min. 4(1), 19 (2011) 34. Chen, J., Bardes, E.E., Aronow, B.J., Jegga, A.G.: ToppGene suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Res. 37(Suppl. 2), W305–W311 (2009) 35. Weber, A.M., Ryan, A.J.: ATM and ATR as therapeutic targets in cancer. Pharmacol. Ther. 149, 124–138 (2015) 36. Soria-Bretones, I., S´ aez, C., Ru´ız-Borrego, M., Jap´ on, M.A., Huertas, P.: Prognostic value of CtIP/RBBP8 expression in breast cancer. Cancer Med. 2(6), 774–783 (2013) 37. Stotani, S., Giordanetto, F., Medda, F.: DYRK1A inhibition as potential treatment for Alzheimers disease. Future Med. Chem. 8(6), 681–696 (2016)

Red Blood Cell Model Validation in Dynamic Regime Krist´ına Kovalˇc´ıkov´ a1(B) , Alˇzbeta Bohinikov´ a1 , Martin Slav´ık1 , 2 ak1 Isabelle Mazza Guimaraes , and Ivan Cimr´ 1

2

Faculty of Management Science and Informatics, ˇ ˇ Department of Software Technology, University of Zilina, Zilina, Slovakia [email protected] Science and Biotechnology Graduate Program, Fluminense Federal University, Niter´ oi, Brazil http://cell-in-fluid.fri.uniza.sk

Abstract. Our work is set in the area of microfluidics, and deals with behavior of fluid and blood cells in microfluidic devices. The aim of this article is to validate our numerical model of red blood cell. This is done by comparing computer simulation with existing laboratory experiment. The experiment is exploring the velocity and deformation of blood cells in a hyperbolic microchannel. Our research confirms that the deformation of the red blood cell in the simulation is comparable with the results from the experiment, as long as the fluid velocity profile in the simulation fits the fluid velocity profile of the experiment. This validates the elastic parameters of the red blood cell model.

Keywords: Red blood cell Fluid simulation

1

· Microfluidics · Computational model

Introduction

Microfluidics, as one of the emerging fluid technologies, enables investigation of liquid flow in devices with very small dimensions. Many groups are exploring its potential in medically oriented domains, for example in diagnostics, treatment of diseases, drug developing, and other fields. Developing and testing of microchips in laboratory is expensive and timeconsuming. A solution to this problem is computational modeling of microchips and the bloodflow inside microchips. This approach allows us to test multiple topologies of such devices, and to choose the best topology of a microfluidic channel. K. Kovalˇc´ıkov´ a, A. Bohinikov´ a, M. Slav´ık and I. Cimr´ ak—This work was supported by the Ministry of Education, Science, Research and Sport of the Slovak Republic under the contract No. VEGA 1/0643/17 and by the Slovak Research and Development Agency under the contract No. APVV-15-0751. c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 259–269, 2018. https://doi.org/10.1007/978-3-319-78723-7_22

260

K. Kovalˇc´ıkov´ a et al.

There are different ways how to model the blood flow and the choice of a method depends on a scale which is important for a given investigation. The first approach considers the blood as a uniform liquid. This one is to be used in exploration of devices, which are dealing with relatively big amounts of blood, and in which the size of blood channels is in the order of magnitude greater than the size of blood cells. The second approach considers blood as a suspension of cells, where cells are modelled as non-deformable particles. This approach, for example, could be used in investigation of blood sedimentation. The third approach considers each single cell as a deformable object, which has a defined relaxed shape and size. This approach can be used in investigation of microcirculation, where the size of a blood cell is comparable to the size of blood vessels. Another application of this approach is in development of microchips. In our work, we use a model developed in [2] where each cell is considered as a deformable object. The implementation of this model has been added into an open-source software Espresso [3]. In this model, the flow of a liquid (blood plasma, or another laboratory solution) is calculated, so that it is interacting with the objects representing blood cells. This is done using the lattice-Boltzmann method. The deformability of the cell is assured by using several visco-elastic coefficients, which define the behavior of the membrane, for example its stiffness. Development and analyses of this model was done by [7,8] and also [9]. First, in order to obtain a reliable modeling tool, we need to calibrate and then validate these coefficients for a single blood cell. Visco-elastic coefficients were set using the stretching experiment of Dao [4,5]. Calibrated parameters are in Table 2. The details of the calibration process will be published in another paper. The main objective of this paper is to validate calibrated parameters. The validation of the elastic coefficients is performed by comparing the results of numerical models with existing laboratory experiments. The laboratory experiment, which we use for this purpose, was described in [1]. To fulfill our aim and to perform the validation, we are going to focus on the following goals in our article: 1. Velocity validation of a cell entering and exiting the constriction 2. Cell deformation in slow-changing regime 3. Cell deformation in fast-changing regime In the first part of this article, we describe the single cell model, which is to be validated. Next, we point out the model parameters to be validated, such as the elastic coefficients of the red blood cells (RBC), and we explain how their values were previously obtained. After that, the description of the biological experiment from article [1] is provided. Then we explain the computational settings of our numerical simulations. Next, the results of our research are presented, and the conclusions and discussion are provided at the end.

2

Single Cell Model

In this paragraph, we describe our model. It has two principal components the fluid velocity field, and the objects immersed in the flow. We describe the

Red Blood Cell Model Validation in Dynamic Regime

261

model parameters which are necessary to define the behavior of the simulation components. The liquid is modeled by the lattice-Boltzmann method [6]. The physical parameters of the liquid which are important for the numerical simulation are density and viscosity. The cells in our model are considered as deformable vesicles, defined by their membrane. Inside the vesicles, the liquid is present with the same properties as outer liquid. The membrane itself is composed from triangular mesh of points distributed over its surface, and the bonds between those mesh points. The deformability of the cell is defined by five elastic parameters. The first one is stretching coefficient, which defines the rigidity of the bonds between the surface mesh points. Second one is bending coefficient. This one defines the reaction of the membrane to a flection. The third coefficient is the local area coefficient. This one assures that the area of each triangle of the cell tends to be the same as in the relaxed state, and it defines the forces which are acting if a triangle is stretched or compressed, and its area is modified. Fourth one is the global area coefficient. This coefficient is similar to the local area coefficient, but it does not deal with a single triangle area, but with the area of the whole membrane. The fifth one, the volume coefficient, assures that the cell tends to keep the constant volume. There is also coefficient of interaction between the fluid and immersed objects. It defines the strength of force transfer between the fluid and the cells. Once the parameters of the fluid and the cells are known, there are some numerical parameters to be defined. The time step is a parameter that defines the size of a single numerical step. Larger time steps make the simulation run faster, however there is a risk of instability. On the other hand, smaller time steps slow the simulation down, but makes it more stable. Another numerical parameter of the simulation is the size of the lattice-Boltzmann grid. It defines the spatial discretization and sets the distance between grid points in latticeBoltzmann mesh of the fluid.

3

Model Parameters Identification

The determination of model parameters depends on the nature of each parameter. The parameters of the fluid have a direct physical interpretation, so the values of density and viscosity are defined by the nature of the fluid used in the laboratory experiment. As we can see in Fig. 1, the process of cell’s parameters’ determination has several steps. The parameters of cells immersed in the flow do not have a direct

Fig. 1. Process of determining model parameters. In this paper we address step 3.

262


physical interpretation, so their values are to be found by comparison with laboratory experiments. The first estimation of those parameters can be evaluated analytically from corresponding parameters of continuum models, step 1 in Fig. 1. The stretching, bending and local area modulus in our model can be estimated from area expansion modulus (K), shear modulus (μ) and bending modulus (kc ) obtained from continuum models. Another way to specify the parameters is the comparison of the model behavior with analytic solutions. The value of friction coefficient was obtained this way. Finally, the detailed calibration and adjustment of those parameters is done by comparing the model behavior with experiments. The exact values of the five elastic coefficients of the RBC were determined by comparing the results of the numerical simulation with a stretching experiment [4,5], step 2 in Fig. 1. In this experiment, the RBC was stretched by a laser tweezers. Ten different values of the stretching force were used, and the longitudinal and lateral modification of the RBC diameter was recorded for each value of the stretching force. These values created two curves of longitudinal and lateral modification of the cells shape depending on the stretching force. Those two curves were recorded in the numerical simulations as well. The elastic coefficients of the RBC were then adjusted in order to obtain the same deformation graphics as in the laboratory experiment. This calibration process is however not part of this manuscript and will be published elsewhere. The value of friction coefficient used in our article is taken from Martin Busik thesis [8]. After the calibration of the elastic parameters of the cell and of the friction coefficient, we need to test this calibrated model, step 3 in Fig. 1. In this article, we are focusing on this level of validation of the model. To do this, we run simulations with different scenarios, and compare the behavior of the cells in different conditions.

4

Biological Experiment

The experiment used to validate the calibration of our model is described in detail in [1]. This article is about measuring the deformation index (DI) of the RBC and white blood cells in a microfluidic device with hyperbolic microchannel. The laboratory experiment is performed on a microchip with two main parts. The first one serves to separate the cells from blood, and it is directing those single cells to the second part of the microchip. The second part consists of several equal parallel micro-channels. Each of those micro-channels is composed from a sequence of three consecutive hyperbolic chambers, one of them is depicted in Fig. 2(a). The dimensions of the hyperbolic chamber are 55 × 383 µm (height x length), with the width altering from 400 to 20 µm. The fluid used in the experiment was Dextran 40 solution (10%, w/v). The hematocrit of RBCs was about 2%. The density of this solution is 103 kg/m3 . The viscosity of the solution is 4 · 10−6 m2 /s, at the temperature 22 ± 1 ◦ C, at which all experimental assays were performed.


263

Fig. 2. (a) Shape and size of one of the three hyperbolic chambers in the micro-channel. (b) Four zones of the hyperbolic chamber, where the DI of the passing RBC was evaluated. Each zone is 50 µm wide.

The cells were observed during their passage through the narrow part of the channel. The deformation of the cell due to pseudo shear flow was calculated via formula Lmax − Lmin (1) DI = Lmax + Lmin where Lmax and Lmin are major and minor axis lengths of the cell, visible on the camera recording the video. The DI of the passing RBCs was evaluated in four specific zones close to the opening of the narrow part of the channel, as presented in Fig. 2(b). Width of those zones is 50 µm, the two of them are placed at the end of the narrow part of the channel, and the other two are placed in the beginning of the wide part of the channel. The pseudo shear rate of the fluid was evaluated in those four zones. The values of the DI and of the pseudo shear rate, obtained from [1], are presented in the Table 1. To calibrate the flow rate in our simulation, we need to have an information about the velocity of the fluid inside of the channel. This information can be calculated from the measured pseudo shear rate. The relation between the velocity in the channel and the pseudo shear rate is as follows: σ=

v , Dh

(2)

2wh where Dh is the hydraulic diameter, defined as Dh = w+h (where w is the width and h is the height of the channel section), σ is the pseudo shear rate in the corresponding section, and v is the mean velocity in the section. The resulting velocity is presented in the Table 1.

264


Table 1. Values of DI and pseudo shear rate in each of the zones of the hyperbolic chamber. Data are from the laboratory experiment [1]. Average velocity, to which we fit the simulation parameters, is computed using the relation between velocity and pseudo 2wh shear rate v = σ w+h Zone Pseudo shear DI rate (s−1 ) (from [1])

5

1

1400

0.36

2

1761

3

100

4

60

w (µm) h (µm) Dh (µm) Average velocity (m/s) 25

55

34.375

0.0481

0.44

21

55

30.395

0.0535

0.3

320

55

93.867

0.0094

0.24

160

55

81.860

0.0049

Computational Setting

The dimensions of our channel are presented in Fig. 2. The fluid in the simulation is moved by external fluid force, which is a parameter related to the pressure drop in the simulation channel. It is an internal parameter of Espresso. Its value was set to match the maximal velocity 0.0535 ms−1 in the narrow part of the channel (zone 2) calculated from the biological experiment. The orientation of the RBC can be found in one of the figures in the article [1], where one of the cells is captured by camera during the experiment. It has a parachute shape, which is formed if a cell’s orientation is perpendicular to the flow of the liquid. In the article, they consider cells travelling along the centerline. In our work, to study the evolution of the DI of centerline cells in the narrow part of the channel, we have defined five positions of the cell close to the center of the channel, illustrated in Fig. 3. For each of the five positions, we have run

Fig. 3. Starting position of the RBC. On the left, we can see the middle position of the RBC from the top view. On the right, we can see RBC at the same position, only zoomed in and looking along the x axis, the direction of the fluid flow. The black dots indicate 5 different starting positions of the RBC center.


265

Fig. 4. Different starting inclinations of the RBC. Looking along the y axis, from the side of the hyperbolic chamber (the fluid flows from left to right). The first one is perpendicular to the fluid flow. The other four are derived from the initial inclination by rotation by ±π/10 radians according to y and z axes. Table 2. Simulation parameters Timestep

10−7 s

lbgrid

10−6 m

External fluid force

1, 5 · 106 N/m3

Stretching coefficient

8 · 10−6 N/m

Bending coefficient

3 · 10−12 N

Local area conservation coefficient

3 · 10−6 N/m

Global area conservation coefficient 9 · 10−4 N/m Volume conservation coefficient

5 · 102 N/m2

Radius of RBC

3, 91 · 10−6 m

Friction coefficient

3, 39 · 10−9 N · s · m−1

Fluid density

103 kg · m−3

Fluid viscosity

4 · 10−6 m2 /s

a set of simulations, where we slightly modified the initial inclination of the cell (by rotating the cell around y or z axis by ±π/10 radians, Fig. 4). The simulation parameters used in our simulation are in Table 2.

6

Results

The computational study was performed for 25 different starting seedings of the cell (combinations of 5 locations of the center and 5 rotations described in Figs. 3 and 4). Among these some positions are symmetrical and so they provide the same evolution of the velocity and the DI during the simulation. In Fig. 9 we can see the cell’s path in one of the seedings. In the simulation the cell runs through the hyperbolic chamber twice. This was done in order to examine whether the trajectory of the cell is not influenced by insufficiently precise initial conditions. Hence, in Figs. 5, 6, 7 and 8 we have values of x-coordinates from 0 to 800, which corresponds to cell running through the channel twice. Dotted lines indicate the four zones from Fig. 1.

266


Fig. 5. Comparison between values of the cell velocity between the simulation data and data obtained from pseudo shear rate in the laboratory experiment.

Fig. 6. Evolution of the cell velocity during its movement through the channel. The grey lines represent the positions of measurement of the velocity in the channel.

Fig. 7. Comparison between values of DI from the simulation data and from the data obtained in the laboratory experiment.


267

Fig. 8. Evolution of the cell DI during its movement through the channel. The grey lines represent the positions of measurement of the DI in the channel.

Fig. 9. RBC moving through the channel. Captured at each of the channel zones.

Our first aim was to validate velocity of the cell entering and exiting the constriction. In Fig. 5 we present graph of the cell velocity in different parts of the channel. The velocity was calculated as an average of the 25 starting positions of the cell. However, the fluctuations around the average were so small, that the standard deviation is 200–1000 times smaller than the measured velocity, and so it is not visible in the graph. As an example, we present the evolution of x-component of the cell velocity in Fig. 6. We can see that the calculated average course of the cell is not the same as the one observed in laboratory experiment. The velocity profile in the simulation

268


fits the velocity profile from the laboratory experiment only in the slow-changing regime, the narrow part of the channel (zone 1 and zone 2). In order to verify the similarity of the cell deformation, we calculated also the evolution of the DI. This is presented in Fig. 7. We can see that values of DI are comparable with the values obtained from simulations, as long as the velocity profile from the simulation fits the velocity profile from the laboratory experiment. In Fig. 7 we can see the average values of the DI obtained from the simulation of the 25 cells, with standard deviation, and their comparison with data from the laboratory experiment. As an example, we present the evolution of the cell DI in Fig. 8.

7

Discussion and Conclusions

The aim of this article is to make a comparison between a laboratory experiment and the simulation run under Espresso, in order to validate the values of simulation parameters. For this purpose, we used an experiment dealing with deformability of RBCs in the channel with variable width. The velocity of the fluid in our simulation was set in order to fit the data of the pseudo shear rate in the laboratory experiment. The velocity profile in our simulation fits the velocity profile calculated from the pseudo shear rate measured in the laboratory experiment in zone 1 and zone 2. With given parameters of the fluid density and fluid viscosity, we were not able to define the fluid flow in the simulation channel in such a way that the simulation fluid flow would fit the values of the laboratory pseudo shear-rate data in all of the four investigated channel zones. However, our main concern is with the cell deformation and not so much with the fluid behaviour. One of the reasons can be the formula used to calculate the velocity from the pseudo shear rate, where the definition of input variables could be understood in different ways. Our first conclusion was, that we cannot reliably compare DI in the third and fourth zone, because the velocity profiles from simulation and experiment are different. The fluid flow in the fast-changing regime in our simulation is faster than the velocity computed from the pseudo shear rate data from the laboratory experiment. However, when crossing from the third to the fourth zone, velocity profile both in simulation and in the laboratory experiment drop by half. Only the velocity in the simulation is larger by a constant. We can also observe that the change in DI between zones 3 and 4 has similar character in simulation and laboratory experiment, only differing by a constant. The problematic part is the crossing from the second to the third part. Therefore, we turn our focus on the narrow part of the channel, and on the values of DI calculated in this part. As is shown in Fig. 5, the average velocity in the zone 1 is only 4.6% slower than the velocity reported in the laboratory experiment. For the zone 2 the difference is only 1.5%. Looking at Fig. 7 we can see that the values of DI are comparable. The standard deviation from the average DI value in the simulation data is 0.062 and 0.066 (zones 1 and 2). Comparing the average DI from simulation and the value of DI reported in the laboratory experiment we have a 9.7%


269

difference in the first zone of channel and 5.1% difference in the second zone. On the basis of these results we successfully validated the elastic coefficients of the RBC.

References 1. Rodriguez, R.O., Pinho, D., Faustino, V., Lima, R.: A simple microfluidic device for the deformability assessment of blood cells in a continuous flow. Biomed. Microdevices 17, 108 (2015) 2. Cimr´ ak, I., Gusenbauer, M., Schrefl, T.: Modelling and simulation of processes in microfluidic devices for biomedical applications. Comput. Math. Appl. 64(3), 278– 288 (2012) 3. Cimr´ ak, I., Gusenbauer, M., Janˇcigov´ a, I.: An ESPResSo implementation of elastic objects immersed in a fluid. Comput. Phys. Commun. 185(3), 900–907 (2014) 4. Dao, M., Lim, C.T., Suresh, S.: Mechanics of the human red blood cell deformed by optical tweezers. J. Mech. Phys. Solids 51(11), 2259–2280 (2003) 5. Dao, M., Li, J., Suresh, S.: Molecularly based analysis of deformation of spectrin network and human erythrocyte. Mater. Sci. Eng. C 26(8), 1232–1244 (2005) 6. Ahlrichs, P., Dnweg, B.: Lattice-Boltzmann simulation of polymer-solvent systems. Int. J. Mod. Phys. C 9(08), 1429–1438 (2003) 7. Ondruˇsov´ a, M.: Sensitivity of red blood cell dynamics in a shear flow. Cent. Eur. Res. J. 3(1), 28–33 (2017) 8. Buˇs´ık, M.: Development and optimization model for flow cells in the fluid. [Dissertaˇ tion thesis] - University of Zilina. Faculty of Management Science and Informatics. ˇ Department of Software Technology. Supervisor: doc. Mgr. Ivan Cimr´ ak, Dr. Zilina, FRI ZU, 114 p. (2017) 9. Bachrat´ a, K., Bachrat´ y, H.: On modeling blood flow in microfluidic devices. In: 10th International Conference on ELEKTRO 2014, Slovakia, pp. 518–521. IEEE (2014)

Exploiting Ladder Networks for Gene Expression Classification Guray Golcuk , Mustafa Anil Tuncel , and Arif Canakoglu(B) Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, 20133 Milan, Italy {gueray.goelcuek,mustafaanil.tuncel}@mail.polimi.it, [email protected]

Abstract. The application of deep learning to biology is of increasing relevance, but it is difficult; one of the main difficulties is the lack of massive amounts of training data. However, some recent applications of deep learning to the classification of labeled cancer datasets have been successful. Along this direction, in this paper, we apply Ladder networks, a recent and interesting network model, to the binary cancer classification problem; our results improve over the state of the art in deep learning and over the conventional state of the art in machine learning; achieving such results required a careful adaptation of the available datasets and tuning of the network. Keywords: Deep learning · Ladder network RNA-seq expression · Classification

1

· Cancer detection

Introduction

Gene expression measures the transcriptional activity of genes; the analysis of gene expression has a great potential to lead to biological discoveries; in particular, it can be used to explain the role of genes in causing tumors. Different forms of gene expression data (produced by micro-arrays or next generation sequencing through RNA-seq experiments) have been used for classification and clustering studies, using different approaches. In particular, Danaee et al. [1] applied deep learning for analyzing the binary classification problem for breast cancer using TCGA public dataset. Deep learning is a branch of machine learning; it has achieved tremendous performance in several fields such as image classification, semantic segmentation and speech recognition [2–4]. Recently, deep learning methods have also achieved success in computational biology [5]. The problem considered in [1] consists of using classified gene expression vectors representing samples which are taken from normal and tumor cells (hence carrying a label) and then training a classifier to learn the label; this is an interesting preliminary problem for testing the usability of classifiers in medical studies. The problem is difficult in the context of deep learning, due to the high c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 270–278, 2018. https://doi.org/10.1007/978-3-319-78723-7_23

Exploiting Ladder Networks for Gene Expression Classification

271

number of genes and the small number of samples (“small n large p” problem) [6]. In [1], the Stacked Denoising Autoencoder (SDAE) approach was compared to conventional machine learning methodologies. The comparison table of different feature selections and classifications is available in Table 3. Deep learning can be performed in three ways: supervised, unsupervised and semi-supervised learning. Semi-supervised learning [7] uses supervised learning tasks and techniques to make use of unlabeled data for training. This method is recommended when the amount of labeled data is very small, while the unlabeled data is much larger. In this work, we use Ladder network [8] approach, which is a semi-supervised deep learning method, to classify tumorous or healthy samples of the gene expression data for breast cancer and we evaluated the Ladder network against the state-of-the-art machine learning and dimensionality reduction methods; therefore, our work directly compares to [1]. In comparison to the state-of-the-art, the Ladder structure yielded stronger results than both the machine learning algorithms and the SDAE approach of [1], thanks to its improved applicability to datasets with small sample sizes and high dimensions. We considered the datasets extracted from the GMQL [9] project’s public repository. They were originally published by TCGA [10] and enriched by TCGA2BED [11] project. Figure 1 illustrates the number of patients for each cancer type and also shows that there are fewer normal cells compared to the cancerous cells; Breast Invasive Carcinoma (BRCA) has the highest number of cases. We used TCGA RNA-seq V2 Rsem [12] gene normalized BRCA dataset with 1104 tumorous samples and 114 normal samples available.

Fig. 1. The number of patients for each tumor type. Tumor type abbreviations are available at: http://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-studyabbreviations

2

Dimensionality Reduction and Machine Learning Techniques

One of the main characteristics of the gene expression datasets is the highdimensionality. Therefore, a feature selection or a feature extraction step is often required prior to the classification. Feature selection methods attempt to identify

272

G. Golcuk et al.

the most informative subset of features. A common way of performing feature selection is to first compute the chi-squared statistic between each feature and the class labels, then to select the features having the highest chi-squared statistic scores [13]. Feature extraction methods, on the other hand, derive new features by combining the initial features of the dataset. – Principal Component Analysis (PCA): is a well-established method for feature extraction that uses orthogonal transformations to derive uncorrelated features and increase the amount of variance explained [14]. – Kernel Principal Component Analysis (KPCA): is an extension of the PCA that uses kernel methods. With the help of the kernel methods, the principal components can be computed in the high-dimensional spaces [15]. – Non-negative matrix factorization (NMF): is a technique to reduce the dimensions of a non-negative matrix by finding two non-negative matrices, whose multiplication reconstructs an approximation of the initial matrix [16]. Support Vector Machines (SVM) is proposed by Vapnik and Cortes [17] and it has been extensively used on the classification of gene expression datasets [18–21]. Support vector machines can also be adopted to fit non-linear data by using kernel functions. Single layer and multi-layer perceptron architectures have also been widely used in predicting the gene expression profiles of the samples in various works [22–24].

3

Ladder Networks

Ladder networks are deep neural networks using both supervised and unsupervised learning; training of both supervised and unsupervised learning simultaneous, without using layer-wise pre-training (as in the Danaee et al. [1]). We next provide a simplified description of implementation of the ladder network introduced in Rasmus et al. [8]: 1. A Ladder network has a feed-forward model that is used as a supervised learning encoder. The complete system has 2 encoder paths, one is clean the other is corrupted. The difference between them is the gaussian noises which are added to all layers of the corrupted one. 2. A decoder is utilized to acquire the inverse of the output at each layer. This decoder gets the benefit of using denoising function which reconstructs the activation of each layer in corrupted encoder to approximate the activation of the clean encoder. The term denoising cost is defined as the difference between reconstructed and the clean version of that layer. 3. Since it uses both supervised and unsupervised learning, it has corresponding costs for them. Supervised cost is the difference between the output of corrupted encoder and the desired output. Unsupervised cost is the sum of denoising cost of all layers scaled by the significance parameter. The entire cost of training the system is the summation of supervised and unsupervised cost. 4. Fully labeled and semi-supervised structures are trained to minimize the costs by using an optimization technique.


273

Figure 2 illustrates the structure of 2 layered (l = 2) ladder network example in Rasmus et al. [8]. The clean path at the right (x → z (1) → z (2) → y) shares the mapping f (l) with the corrupted path on the left (x → z˜(1) → z˜(2) → y). ˆ) consists of denoising On each layer, the decoder in the middle (˜ z (l) → zˆ(l) → x (l) (l) functions g and cost functions Cd try to minimize the difference between zˆ(l) and z (l) .

Fig. 2. Structure of 2 layered Ladder network. On the right there is clean path, which is work as supervised learning, in the middle and the left one is part of unsupervised learning with encoder (leftmost) and the decoder (middle).

The ability of ladder network reaching high accuracy with very small amount of labeled data on MNIST dataset [25] suggested us that it could be conveniently applied to our problem. To the best of our knowledge, this work is the first to apply the ladder network structure on the gene expression datasets. Before analyzing the gene expression data, we applied preprocessing techniques to fill the missing data and also normalize all the expression data in order to get same expression level for each gene type. For this purpose, min-max normalization was applied on the data. In order to test properly, all samples are divided into three mutually disjoint subset: training, validation and test with 60%, 20% and 20%, respectively. The configured Ladder Network is freely available as a python-based software implementation and source code online via an MIT License: http://github.com/ acanakoglu/genomics-ladder.

274

4

G. Golcuk et al.

Tuning of the Ladder Network

In order to optimize the network configuration, different hyper parameters of the network were analyzed. First of all, the number of layers and structure (number of nodes) of each layer were detected. Then, the batch size for a given network were analyzed, for the purpose of optimizing the execution time and the accuracy of the network. Table 1. Ladder network performance with different number of levels Accuracy Sensitivity Specificity Precision F1 score

Layers 1 hidden layer

a

55.33

57.23

39.13

90.36

0.700

2 hidden layersb

97.38

98.55

86.09

98.55

0.986

3 hidden layersc

96.64

97.28

90.43

98.99

0.981

98.69

98.64

99.13

99.91

0.993

97.30

99.17

81.54

97.83

0.985

5 hidden layers

d

7 hidden layerse f

10 hidden layers 97.56 98.64 87.75 98.64 The number of the nodes: a 1 layer → 2000 b 2 layers → 2000 - 200 c 3 layers → 2000 - 200 - 20 d 5 layers → 2000 - 1000 - 500 - 250 - 10 e 7 layers → 2048 - 1024 - 512 - 256 - 128 - 64 - 32 f 10 layers → 2048 - 1024 - 512 - 256 - 128 - 64 - 32 - 16 - 8 - 4

0.986

We tuned the network by using different parameters, the most relevant ones are the number of layers (single layer or 2, 3, 5, 7 and 10 hidden layers) as shown in Table 1 and the training feed size (10, 20, 30, 40, 60, 80 and 120 labeled data) as shown in Table 2. All of the evaluations were performed by using the 5-fold cross validation technique. In the Table 1, we analyze the effect of the number of hidden ladders. As shown in the table, having 5 hidden layers produces the top performance. Having less than 5 hidden layers result in lower performance, yet, having more causes Table 2. Ladder network performance with different batch sizes Labeled data Accuracy Sensitivity Specificity Precision F1 score 10 label

85.08

85.06

85.22

98.22

0.912

20 label

89.76

98.80

50.22

89.66

0.940

30 label

95.82

98.43

74.24

96.92

0.977

40 label

97.64

98.64

85.87

98.53

0.987

60 label

98.69

98.64

99.13

99.91

0.993

80 label

97.62

98.46

89.09

98.91

0.987

120 label

98.36

98.64

95.65

99.54

0.991


275

overfitting of the data. The structure with 5 hidden layers has 2000, 1000, 500, 250 and 10 nodes for each layer and two output nodes, one for healthy, the other one for cancerous case. Significance number, which is mentioned in step 3 of the method, is selected as [1000, 10, 0.1, 0.1, 0.1, 0.1, 0.1] respectively to indicate the importance of the layer. Figure 2 illustrates the model that is used for classification of TCGA BRCA data. We also investigated the impact of using the supervised learning networks with different batch sizes; Table 2 shows that performance grows while increasing the batch sizes up to 40 samples and it is rather stable with more sample. Since the smaller batch sizes are computationally more efficient, we decided to use a batch size of 40. Terminating condition is satisfied either when the number of epochs reach 100 or when the training accuracy becomes more than 99%. With this size, the ladder network converges in about 4 min of execution time over a dataset of about 1000 gene expression records, with about 20000 genes; execution took place on Nvidia GeForce GTX1060 GPU with 6 GB of RAM with the Tensorflow library [26]. It achieves accuracy of 98.69, sensitivity of 98.64, specificity of 99.13, precision of 99.91, F1 score of 0.993.

5

Evaluation and Conclusions

In the evaluation we used the stratified k-fold cross validation [27] and it is applied on the data with k is equal to 5. In other words, the data were divided into 5 equal subsets such that the folds contains approximately equal proportion of cancerous and healthy samples. In each round, 4 subsets are used for training and validation and 1 subset is used for testing. The procedure is repeated 5 times, by excluding 1 part of the data for testing. This approach was also employed in [1] and for the evaluation of the conventional machine learning algorithms defined in the previous section. The confusion matrix of each step was summed up and then we calculated the accuracy, sensitivity, specificity precision and F1 score, as reported in the last section. We evaluated our ladder network algorithm by comparing its performance metrics against the results from the Danaee et al.’s study [1]. A direct comparison shows that the SDAE network achieves its best result when coupled to SVM for feature selection and in such case, it achieves an accuracy of 98.04, which is slightly inferior to ours. The ladder network could be directly applied without the need for a preliminary feature reduction and it shows that the network learns the important features and it learns the classes. As the performance of a learning algorithm does not only depend on the data, but also on the hyper-parameters. We performed hyper-parameter tuning on the support vector classifier along with three different dimensionality reduction algorithms, in order to observe an optimal performance from the support vector classifier. The GridSearch functionality of the scikit-learn [28] library was utilized for the hyper-parameter tuning. Subsequently, we compared the resulting performance of the support vector classifiers with the ladder network algorithm and reported on the Table 3.

276

G. Golcuk et al. Table 3. Algorithm comparison table Accuracy Sensitivity Specificity Precision F1 score

Features

Model

All

Ladder network 98.69

98.64

99.13

99.91

0.993

NMF†

SVM

98.60

99.45

90.35

99.01

0.992

PCA†

94.91

94.65

97.37

99.71

0.971

CHI2†

98.28

99.45

86.84

98.65

0.990

ANN

96.95

98.73

95.29

95.42

0.970

SVM

98.04

97.21

99.11

99.17

0.981

SVM-RBF

98.26

97.61

99.11

99.17

0.983

ANN

63.04

60.56

70.76

84.58

0.704

SVM

57.83

64.06

46.43

70.42

0.618

SVM-RBF

77.39

86.69

71.29

67.08

0.755

DIFFEXP0.05∗ ANN

59.93

59.93

69.95

84.58

0.701

SVM

68.70

82.73

57.50

65.04

0.637

SVM-RBF

76.96

87.56

70.48

65.42

0.747

ANN

96.52

98.38

95.10

95.00

0.965

SVM

96.30

94.58

98.61

98.75

0.965

SVM-RBF

89.13

83.31

99.47

99.58

0.906

ANN

97.39

96.02

99.10

99.17

0.975

SVM

97.17

96.38

98.20

98.33

0.973

SVM-RBF

97.32

89.92

99.52

99.58

0.943

SDAE∗

DIFFEXP500∗

PCA∗

KPCA∗

† To

further evaluate the performance of our ladder network, the hyperparameters of the support vector classifiers along with three different dimensionality reduction algorithms are tuned by an exhaustive search approach. ∗ The results are taken from Table 1 of Danaee et al. [1].

The table also shows that the ladder network algorithm improves over conventional machine learning algorithms, where the best method is KPCA. We also considered the same machine learning methods and actually found better results than [1], but inferior to the results obtained with the ladder network. In conclusion, we have shown that a ladder network can be applied to binary classification of RNA-seq expression data, and compares well with state-of-theart machine learning and with the previous attempt of solving this problem by using deep learning. Although improvements are small, they demonstrate that this deep learning method can be directly applied to datasets having less than a thousand samples. Our results indicate ladder networks are very promising candidates for solving classification problems over gene expression data. Acknowledgment. This work was supported by the ERC Advanced Grant GeCo (Data-Driven Genomic Computing) (Grant No. 693174) awarded to Prof. Stefano Ceri. We thank Prof. Stefano Ceri who provided insight and expertise that greatly assisted the research and comments that greatly improved the manuscript. We would like to thank also members of the GeCo project for helpful insights.


277

References 1. Danaee, P., Ghaeini, R., Hendrix, D.A.: A deep learning approach for cancer detection and relevant gene identification. In: Pacific Symposium on Biocomputing, pp. 219–229. World Scientific (2017) 2. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012) 3. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015) 4. Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012) 5. Singh, R., Lanchantin, J., Robins, G., Qi, Y.: DeepChrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics 32(17), i639– i648 (2016) 6. Chakraborty, S., Ghosh, M., Mallick, B.K.: Bayesian non-linear regression for large p small n problems. J. Am. Stat. Assoc. (2005) 7. Chapelle, O., Schlkopf, B., Zien, A.: Semi-Supervised Learning, 1st edn. The MIT Press, Cambridge (2010) 8. Rasmus, A., Berglund, M., Honkala, M., Valpola, H., Raiko, T.: Semi-supervised learning with ladder networks. In: Advances in Neural Information Processing Systems, pp. 3546–3554 (2015) 9. Masseroli, M., Pinoli, P., Venco, F., Kaitoua, A., Jalili, V., Palluzzi, F., Muller, H., Ceri, S.: GenoMetric Query Language: a novel approach to large-scale genomic data management. Bioinformatics 31(12), 1881–1888 (2015) 10. Weinstein, J.N., Collisson, E.A., Mills, G.B., Shaw, K.R.M., Ozenberger, B.A., Ellrott, K., Shmulevich, I., Sander, C., Stuart, J.M., Cancer Genome Atlas Research Network, et al.: The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45(10), 1113–1120 (2013) 11. Cumbo, F., Fiscon, G., Ceri, S., Masseroli, M., Weitschek, E.: TCGA2BED: extracting, extending, integrating, and querying the cancer genome atlas. BMC Bioinform. 18(1), 6 (2017) 12. Li, B., Dewey, C.N.: RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinform. 12(1), 323 (2011) 13. Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3(Mar), 1289–1305 (2003) 14. Jolliffe, I.T.: Principal component analysis and factor analysis. In: Principal Component Analysis, pp. 115–128. Springer, New York (1986). https://doi.org/10. 1007/978-1-4757-1904-8 7 15. Sch¨ olkopf, B., Smola, A., M¨ uller, K.-R.: Kernel principal component analysis. In: Gerstner, W., Germond, A., Hasler, M., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, pp. 583–588. Springer, Heidelberg (1997). https://doi.org/10. 1007/BFb0020217 16. Brunet, J.P., Tamayo, P., Golub, T.R., Mesirov, J.P.: Metagenes and molecular pattern discovery using matrix factorization. Proc. Nat. Acad. Sci. 101(12), 4164– 4169 (2004)

278

G. Golcuk et al.

17. Vapnik, V., Cortes, C.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995) 18. Furey, T.S., Cristianini, N., Duffy, N., Bednarski, D.W., Schummer, M., Haussler, D.: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16(10), 906–914 (2000) 19. Tuncel, M.A.: A statistical framework for the analysis of genomic data. Master’s thesis, Politechnico di Milano (2017) 20. Vapnik, V.: The Nature of Statistical Learning Theory. Springer Science & Business Media, New York (2000). https://doi.org/10.1007/978-1-4757-3264-1 21. Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1), 389–422 (2002) 22. Wei, J.S., Greer, B.T., Westermann, F., Steinberg, S.M., Son, C.G., Chen, Q.R., Whiteford, C.C., Bilke, S., Krasnoselsky, A.L., Cenacchi, N., et al.: Prediction of clinical outcome using gene expression profiling and artificial neural networks for patients with neuroblastoma. Cancer Res. 64(19), 6883–6891 (2004) 23. Khan, J., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C.R., Peterson, C., et al.: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med. 7(6), 673–679 (2001) 24. Vohradsky, J.: Neural network model of gene expression. FASEB J. 15(3), 846–854 (2001) 25. Deng, L.: The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process. Mag. 29(6), 141–142 (2012) 26. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). tensorflow.org 27. Refaeilzadeh, P., Tang, L., Liu, H.: Cross-validation. In: Encyclopedia of Database Systems, pp. 532–538. Springer, Boston (2009). https://doi.org/10.1007/978-0387-39940-9 28. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011)

Simulation of Blood Flow in Microfluidic Devices for Analysing of Video from Real Experiments Hynek Bachrat´ y1 , Katar´ına Bachrat´ a1(B) , Michal Chovanec2 , a1 , and Martin Slav´ık1 Frantiˇsek Kajánek1 , Monika Smieˇskov´ 1 Faculty of Management Science and Informatics, ˇ ˇ Department of Software Technology, University of Zilina, Zilina, Slovakia [email protected] 2 Faculty of Management Science and Informatics, ˇ ˇ Department of Technical Cybernetics, University of Zilina, Zilina, Slovakia http://cell-in-fluid.fri.uniza.sk

Abstract. Simulation of microfluidic devices is a great tool for optimizing these devices. For the development of simulation models, it is necessary to ensure a sufficient degree of simulation accuracy. Accuracy is ensured by measuring appropriate values that tell us about the course of the simulation and can also be measured in a real experiment. Measured values will simplify the real situation so that we can develop the model for a specific purpose and measure the values that are relevant to the research. In this article we present the approach in which the data we have gained from simulation are used to improve the quality of data processing from video from a real experiment. Keywords: Microfluidic devices · Simulation model Video processing · Machine learning · Trajectory prediction Kohonen networks

1

Introduction

Modeling blood flow in artificial microfluidic devices is currently of great importance and has a wide application. Fidelity and quality of a simulation is in all cases key for modeling of elasticity, interaction and motion of red blood cells (RBC). This fact stems from the high hematocrit of blood and the dominant 96% representation of RBCs in its solid component. It is also applicable to the main focus of our research [2,4], which is optimisation of diagnostic microfluidic H. Bachrat´ y, K. Bachrat´ a, M. Chovanec, F. Kaj´ anek, M. Smieˇskov´ a and M. Slav´ık— This work was supported by the Ministry of Education, Science, Research and Sport of the Slovak Republic under the contract No. VEGA 1/0643/17 and by the Slovak Research and Development Agency under the contract No. APVV-15-0751. c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 279–289, 2018. https://doi.org/10.1007/978-3-319-78723-7_24

280

H. Bachrat´ y et al.

devices used for intercepting CTC. Their main task is to catch CTC on the surface of a periodical obstacle array using chemical bonds. Since the diagnostically significant ratio of representation of CTC to RBCs is from 1 : 5 · 104 to 1 : 5 · 106 , and even in this case a correct simulation of RBC behaviour is crucial. The main goal of this article is the design of a method for universal prediction of motion of RBCs in microfluidic devices. For this task we use an approach inspired by the learning method of bases of Kohonen networks. Below in the text we introduce the basic description of the method, the results of its first testing on a set of simulation experiments and the first evaluation of its correctness. Next we devise one of its possible uses in the field of computer vision for processing video recordings of real world experiments. The obtainment of these analyses is pivotal for the usage of verification procedures of simulation models and the approaches devised for example in [5,6].

2

Simulation of Blood Flow Experiments

All simulation experiments used in this paper were performed in Open Source software ESPREesSo [1]. We have used LB module, an implementation of a Lattice-Boltzmann method for fluid modelling. The red blood cells were modelled with Object-in-fluid described in [3]. A red blood cell surface mesh was generated in Gmsh software by knowing analytic equation of its surface with sizes 7.82 × 7.82 × 2.56 µm. The triangulated mesh has 141 vertices. Values of elastic coefficients of the red blood cell model were ks = 0.0044, kb = 0.0715, kal = 0.005, kag = 1, kv = 1.25. Mass of the cell was 8.4 pg. Interaction between the cells was modelled by membrane collision parameters mc K = 0.005, mc n = 2.0, mc cut = 0.5. Interaction between the cells and the walls or obstacles was modelled by soft sphere with parameters soft K = 0.00035, soft n = 1.0, soft cut = 0.5. The fluid had spacial step of lattice grid = 1 µm, time step for fluid recalculation was tau = 0.2 µs. 3 A viscosity of the fluid was 1.5 mPa · s and density 1000 kg/m , what cor◦ responds to a plasma with temperature of 20 C. An interaction parameter friction, which secures a velocity transfer between the fluid and immersed object was set to 0.0269. The Fluid was moved by external force with value 0.005. Simulation step for whole simulation was the same as for the fluid 0.2 µs. ESSPResSo does not have any predefined units, all were chosen and recalculated according to our chosen system, which we wanted to model. Simulation channels had cuboid shapes with four walls, they were open in x direction. In this direction a fluid was flowing and periodic conditions were applied. It means, that whatever leaves the channel on the right side, immediately enters the channel on the left side at the same place. Inside the channels were cylindrical obstacles with different arrangements. In this article, we have used two different topologies A and B of channels (Fig. 1.)

Simulation of Blood Flow in Microfluidic Devices

281

Fig. 1. Channels A and B with their sizes. Table 1. Description of the simulations. Simulation ID Channel Seeding Number of cells A20 seed a

A

a

20

A50 seed b

A

b

50

A100 seed c

A

c

100

A50 seed d

A

d

50

A50 seed e

A

e

50

B50 seed f

B

f

50

B50 seed g

B

g

50

In described channels we simulated a flow of red blood cells with various number of cels in A and with 50 cells in B. An initial seeding of cells was random and unique for every simulation. Parameters of all of them is in Table 1. Every simulation ran 380 000 time steps. The fastest cells flew about 800 µm during this period. every 200 time steps the position of the center and actual velocity of every cell in the simulation was measured and saved. Output data were then post-processed by algorithms developed in C# in Visual Studio 2015 into the data suitable for further statistical processing. The main modification consisted in removing the onset part of the simulation and in unification of the length of all data sets. Thus modified data were inputs for machine learning algorithms based on Kohonen neural network written in C++.

3

Determination of Basis of Blood Flow Velocities

Simulation of experiments with microfluidic channels will be a strong tool for finding a suitable geometry that will enable the best capture of CTC cells. In the recent phase of our research, we are analyzing the flow of red blood cells by appropriate instruments. In addition to the statistical tools for comparing the channels [6], we have tried to analyze flows using machine learning methods as well. In this section, we would like to point out the approach of machine learning through the Kohonen networks inspired approach.

282


Entering the learning network is the output of the simulation of two different channels Fig. 1. In Channel A, we performed simulations with different hematocrits, 20, 50 and 100 RBCs were simulated. For the number of 50, each channel had several seedings. As input to the learning algorithm we used the positions of the cells and their velocity vector. Bases as Channel Characteristics Since each run of the experiment (seeding) is different, the paths of the cells are different and direct trajectory or speed comparison can not be used. It is necessary to specify characteristics that can be compared. These characteristics will be bases, which will determine the velocity vector for each channel point. The goal of the experiment is to find the bases for each measurement and verify the similarity for channels A and B. For different seedings within one channel we want to achieve a stronger correlation between bases. In addition, the bases should not be dependent on the seeding but only on the channel geometry. Design of the Learning Network The positions and velocities of the cells from the individual simulations will serve for learning individual basal functions. We will establish their number in advance and we will denote it by N . In the first step, basal functions, Bi for i ∈ {1, 2, . . . K} will be chosen randomly, as 6-segment vectors (potions and velocities). In the first approximation, the model will be a linear combination of bases Bi (r, v), that are randomly assigned. In the following steps, the following rule will be followed: Vi+1 (r, v) =

N

wi (r)Bi (v)

(1)

i=1

where N number of basal functions, Bi are bases and Bi (v) presents the velocity component of the i-th basis, wi (r) is the weight of the i-th basis for the blood cell position r, Vi+1 (r, v) is predicted blood cell velocity Bi+1 in position r, For the basis learning we chosen learning mechanism very similar to the Kohonen network. [9], where neural weights are adjusted in order to cover the entry space in best possible way. Associate scales to them provide the required output value. In literature [10] Kohonen network was used to solve the problem of inverse kinematics of the arm, which is a task similar to our problem experiment, looking for an association between position and speed at a given point. For each experiment, we run learning independently. Initial bases function values were selected from randomly selected positions and velocities of the blood cells. Learning the bases was done by presenting the position of the blood cells C(r) and their velocities C(v).


283

The basis has two components: position Bi (r) and velocity Bi (v). Both are corrected by relationships that correspond to the stochastic gradient descendant method described in [10]. Bi (r) = (1 − ηαi ) Bi (r) + ηαi C(r) Bi (v) = (1 − ηαi ) Bi (v) + ηαi C(v)

(2) (3)

Parameter η ∈ (0, 1) represents speed of basis adaptation, αi is a degree of similarity of positions Bi (r) and C(r). It is calculated using a similarity function according to the following relationships: βi =

k

k + Bi (r) − C(r) βi αi = N j=1 βj

2

(4) (5)

where parameter k > 0 represents the steepness of a similarity function. From the relationship, we get the parameter value αi ∈ (0, 1), for close positions of the blood cells C, close to one, for the distant blood cell the parameter approaches zero. Next equation is normalization for alphai so that their sum is equal to one. In our training, we chose number of basis equal to N = 10000, parameter k = 0.01, and learning rate eta = 0.1. Number of learning iterations were 500,000. Reconstruction of Cell Position and Velocity After bases training, it is possible to test the predictive capabilities of the model. The input is any position of the cell T (r), the goal is to predict the velocity T (v) at this point. First, values αi are calculated using Eqs. 4 and 5, which correspond with weights wi ≡ αi . Using relationship 1 the velocity at given point is computed. By numerical integration it is possible to obtain the trajectory of the test cell Tn+1 (r) = Tn (r) + Tn (v)dt. After that, we can determine the speeds throughout the channel space, also in places where no cell from the training set was present. This allows us to compare experiments with each other. Even when comparing the results, we can not use all the points because of the large amount of data. That’s why we’ve selected 1000 random positions T (r) by the Monte Carlo method. The resulting basis distance is calculated by a normalizes Euclidean metric. First, the basis components of the two trained models X, Y are determined for position T (r). TX (v) = TX (r) and TY (v) = TY (r), where T (r) represents random test position. The difference in positions is calculated as dT = VxT − VyT . 1000 positions selected in this way is averaged. Each model was compared with all the other models, the results of the difference are in the Table 2.

284

H. Bachrat´ y et al. Table 2. Error size between individual models A20 s a

A50 s b

A100 s c A50 s d

A50 s e

B50 s f

B50 s g

A20 seed a

0

0.024045 0.026852 0.029994 0.028294 0.069815 0.065913

A50 seed b

0.024174 0

0.011605 0.025831 0.023463 0.062432 0.064022

A100 seed c 0.026618 0.011235 0

0.027454 0.026459 0.063747 0.061519

A50 seed d

0.028648 0.025418 0.027196 0

A50 seed e

0.026980 0.026299 0.026794 0.022792 0

0.022168 0.064810 0.066823

B50 seed f

0.066236 0.067181 0.063129 0.067300 0.064338 0

B50 seed g

0.065729 0.066606 0.062175 0.063798 0.064817 0.022061 0

0.063020 0.062761 0.020975

Experiments where a smaller error is expected (the same channels) are the same color. Channels A are red and channels B are blue. It is clear from the values that the models differ with a smaller error for the same channel. Measured average error values can be interpreted as a difference of velocity prediction, and move on 10% level of RBC velocities. The linearity and simplicity of the model, however, also causes differences in the same channel. Differences in bases for channels are due to various initial conditions of the experiment but also because of the fact that the model does not consider cell collisions and nonlinearity of relationships. In our next steps, we want to focus on non-linear modeling, for example using deep learning techniques. To compare the results, we also performed a test where we divided the data into two parts, a training and test set, and each experiment was done independently. For testing we chose 25% data. We measured the velocity deviation measured in the simulation from the predicted model velocity. The results are shown in the Table 3. Table 3. Model prediction error Experiment Absolute error Absolute vector size MRE [%] A20 seed a

0.029266

0.253014

11.047184

A50 seed b

0.037301

0.253148

14.865701

A100 seed c 0.021849

0.131263

16.436213

A50 seed d

0.038164

0.259319

14.196479

A50 seed e

0.040760

0.255995

14.843164

B50 seed f

0.034798

0.256949

13.244931

B50 seed g

0.038682

0.246271

15.349910

Prediction of Trajectories of RBC Movement in Channel A It is possible to use machine learning model also for a prediction of experiment’s result. To test this hypothesis we chose simulation experiment A50 with


285

Fig. 2. Trajectories of 50 RBC obtained by simulation and trajectories obtained from the model using the rest of the simulation results as a training set.

seeding e. The model was trained by data from rest of simulation experiments performed in the same channel. Resulting trajectories are shown in Fig. 2. The second test was developed by training only on the data from the simulations in geometry A with the number of 50 cells (Fig. 3). The figures show that this model can predict trajectories. For its simplicity, however, errors occur. For example, the blood cell passes through fixed obstacles, or the predictions can not capture a change in blood flow (highlighted image areas in Figs. 2 and 3).

Fig. 3. Trajectories of 50 simulated RBC and trajectories from model using simulation with geometry A and 50 blood cells as training set

Approximation of Trajectories in Channels with Different Geometry The following figures illustrate the velocities C(v) of cells across the channels A and B. The velocities were obtained from basis learning using data from simulations. This model allows us to get velocity at any point of the channel. We see that for different geometries, the velocities in the individual channel locations are different. In Fig. 4 we have velocities from geometry A with 50

286


Fig. 4. The velocities obtained by simulation with geometry A and 50 cells (seed b) as a training set

Fig. 5. The velocities obtained by simulation with geometry B and 50 cells as a training set

Fig. 6. The velocities obtained by simulation with geometry A and 100 cells as a training set

Fig. 7. The velocities obtained by simulation with geometry A and 50 cells (seed d) as a training set

blood cells. In Fig. 5 we have velocities from geometry B blood cells. Based on the obtained results, we are preparing a statistical evaluation of the trajectory prediction correctness. In the same channel, with different number of blood cells, the differences are smaller. In Fig. 7 we have velocities from geometry A with 50 blood cells. In Fig. 6 we have velocities from geometry A with 100 blood cells.


4

287

Evaluating Simulations Going Forward

Another avenue towards evaluating simulations is visual object tracking. Video data from experiments is available more than ever and as a result fully evaluating video is a significant source of validation data for simulations. Blood flow simulations can produce video data, which can then be used to compare side by side using computer vision. The findings from our method, can further improve the RBC tracking in video data Fig. 8.

Fig. 8. Example of video used for RBC tracking

Tracking of RBCs has to overcome many hurdles, which prevent data gathering. Our RBC tracking currently works in three steps: 1. Subtract background of video 2. Detect RBCs using Hough transform 3. Connect detected RBCs into tracks. Since experiment videos tend to be static we are able to utilize the full background, made from the whole video, instead of just last few frames. The result of this process are frames which contain much less clutter and noise. We are able to take these frame and use Hough transform to detect specific geometric shapes, in our case circles and ellipses, which encompass all the possible rotations of RBCs in blood flow [8]. The output from the detection, which are bounding boxes in which RBCs are located, are then plugged into the tracking algorithm, which in several steps constructs tracks from isolated RBC bounding boxes (Fig. 9). When provided with RBC tracks we are then able to make conclusions that tell us whether our simulations reflect reality [7]. This whole process has to overcome many hurdles. Detection is not perfect, and as a result we are provided with bounding boxes, which do not have their counterpart in each frame. This is mostly due to video quality, which can be fixed with future experiments, and overlap of cells.

288


Fig. 9. Example of detected tracks

Both of these issues are fixable on the level of tracking. The current tracking algorithm has several limitations. It is not able to connect bounding boxes when they are missing between frames. Overlap can be dealt with based on missing frames as well, utilizing outside information, like for example velocity or flow information in the device. Trajectory information of RBCs in blood flow, which was experimented with in this paper can be further used to improve tracking. Tracking a given cell provides us with velocity, which means we can predict where the cell will be in the next frame. This does not account for local changes in flow direction, and as a result knowing that in a particular area, next frame will be offset in a certain direction, can help a great deal with tracking precision. The main idea is to think outside the box and improve existing methods (like hough transform and tracking) with additional data (trajectory and blood flow data), which can reinforce and increase precision of used methods. The final result of all these efforts will be more data for validation of RBC simulations.

5

Conclusions and Discussion

Using the method of learning bases of Kohonen networks in trajectory prediction of RBCs in artificial microfluidic devices is considered innovative and the first acquired results point to a very good perspective in their use. The first notable use can be the analysis of video recordings of real world experiments. Track prediction of specific cells is important for improving tracking algorithms and the subsequent increase in amount of data required for the calibration comparisons between real and simulation experiments. For learning bases its possible to use data from video analysis of unproblematic movement of cells, which are easily identifiable, or data from verified simulation experiments. Motion prediction can in turn improve tracking of RBCs in case of intricate cell overlap, loss of some frames due to speed or enhancement of low quality video recordings.


289

The versatility of the described motion of RBCs in channels also presents other possibilities. The ability to predict motion in parts of a channel where no motion was registered during a real or simulation experiment, can also be decisive. We are then able to predict the influence of interactions between RBCs and a single CTC in any part of a channel. Another possibility is to expand the mentioned method in order to analyze additional input from any real or simulation experiment. In [5] we mentioned the importance and possibilities of describing rotations of a cell, whose data description can be used for same kind of prediction. Further example of formally identical input is the description of bounding cuboid boxes [6], which enables the prediction of typical skew of cells in individual parts of a channel. Last area for additional improvement is the fine-tuning of learning bases for a specific situation of periodical channels, where its possible to specify them based on their topology with greater accuracy.

References 1. Arnold, A., et al.: ESPResSo 3.1: molecular dynamics software for coarse-grained models. In: Griebel, M., Schweitzer, M. (eds.) LNCSE, pp. 1–23. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-32979-1 1 2. Cimr´ ak, I., Gusenbauer, M., Schrefl, T.: Modelling and simulation of processes in microfluidic devices for biomedical applications. Comput. Math. Appl. 64(3), 278–288 (2012) 3. Cimr´ ak, I., Gusenbauer, M., Janˇcigov´ a, I.: An ESPResSo implementation of elastic objects immersed in a fluid. Comput. Phys. Commun. 185(3), 900–907 (2014) 4. Bachrat´ y, H., Bachrat´ a, K.: On modeling blood flow in microfluidic devices. In: ELEK-TRO 19 May 2014, pp. 518-521. IEEE (2014) 5. Bachrat´ y, H., Kovalˇc´ıkov´ a, K., Bachrat´ a, K., Slav´ık, M.: Methods of exploring the red blood cells rotation during the simulations in devices with periodic topology. In: 2017 International Conference on Information and Digital Technologies (IDT), 5 July 2017, pp. 36–46. IEEE (2017) 6. Bachrat´ a, K., Bachrat´ y, H., Slav´ık, M.: Statistics for comparison of simulations and experiments of flow of blood cells. In: EPJ Web of Conferences, vol. 143, pp. 2002–2016. EDP Sciences (2017) 7. Tom´ aˇsikov´ a, J.: Processing and analysis of videosequences from biological experiments using special detection and tracking algorithms. Master thesis - University ˇ of Zilina. Faculty of Management Science and Informatics. Department of Software ˇ Technology, Supervisor: doc. Mgr. Ivan Cimr´ ak, Dr., Zilina, FRI ZU, p. 63 (2017) 8. Muˇcka, F.: Algorithms and their implementation for analysis and image processing ˇ from recordings of biological experiments. Master thesis - University of Zilina. Faculty of Management Science and Informatics. Department of Software Technology, ˇ Supervisor: doc. Mgr. Ivan Cimr´ ak, Dr., Zilina, FRI ZU, p. 61 (2017) 9. Rojas, R.: Neural Networks: A Systematic Introduction. Springer Science & Business Media, Heidelberg (1996). https://doi.org/10.1007/978-3-642-61068-4 10. Ruder, S.: An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 (2016)

Alignment-Free Z-Curve Genomic Cepstral Coefficients and Machine Learning for Classification of Viruses Emmanuel Adetiba1,2(&) , Oludayo O. Olugbara3, Tunmike B. Taiwo3, Marion O. Adebiyi4, Joke A. Badejo1, Matthew B. Akanle1, and Victor O. Matthews1 1

Department of Electrical and Information Engineering, College of Engineering, Covenant University, Ota, Nigeria [email protected] 2 HRA, Institute for Systems Science, Durban University of Technology, P.O. Box 1334, Durban, South Africa 3 ICT and Society Research Group, Durban University of Technology, P.O. Box 1334, Durban 4000, South Africa 4 Department of Computer and Information Science, College of Science and Technology, Covenant University, Ota, Nigeria

Abstract. Accurate detection of pathogenic viruses has become highly imperative. This is because viral diseases constitute a huge threat to human health and wellbeing on a global scale. However, both traditional and recent techniques for viral detection suffer from various setbacks. In codicil, some of the existing alignment-free methods are also limited with respect to viral detection accuracy. In this paper, we present the development of an alignment-free, digital signal processing based method for pathogenic viral detection named Z-Curve Genomic Cesptral Coefficients (ZCGCC). To evaluate the method, ZCGCC were computed from twenty six pathogenic viral strains extracted from the ViPR corpus. Naïve Bayesian classifier, which is a popular machine learning method was experimentally trained and validated using the extracted ZCGCC and other alignment-free methods in the literature. Comparative results show that the proposed ZCGCC gives good accuracy (93.0385%) and improved performance to existing alignment-free methods. Keywords: Alignment-free Virus ViPR ZCGCC

Bayesian Classifier Naïve Pathogenic

1 Introduction Novel and re-emerging viruses continue to surface and unleash havocs on human health worldwide. Some of these viruses spread rapidly across the globe and they culminate in high morbidity and mortality. For example, the Severe Acute Respiratory Syndrome (SARS) coronavirus caused a global pandemic in 2003, which resulted in approximately 916 deaths and affected around 30 countries [1]. The most recent outbreak of Ebola Virus Disease (EVD), which was the largest in the history of the © Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortuño (Eds.): IWBBIO 2018, LNBI 10813, pp. 290–301, 2018. https://doi.org/10.1007/978-3-319-78723-7_25

Alignment-Free Z-Curve Genomic Cepstral Coefficients and Machine Learning

291

disease, started in December 2013 (a decade after the SARS epidemic) and continued until April 2015 in countries like Southern Guinea, Liberia, Nigeria and Sierra Leone. Reports on EVD indicated that there were a total of 15,052 laboratory confirmed cases and 11,169 deaths [2]. Hence, the prompt and unambiguous detection of pathogenic viruses is of critical importance in order to actively control and prevent viral diseases outbreak. Next Generation Sequencing (NGS) technologies provide unprecedented opportunities to researchers with respect to the development of new methodologies for viral detection. This is because a plethora of viral genomic sequences from NGS based studies are available in the public domain for unrestricted access by researchers. However, researchers have opined that given the abundant NGS data, the analysis of such data is the most challenging aspect of genomic based viral detection [3]. Thus, this opens up a remarkable opportunity for researchers in the bioinformatics and Genomic Signal Processing (GSP) [4, 5] fields. Genomic Signal Processing (GSP) is an emerging branch of bioinformatics, which involves the use of Digital Signal Processing (DSP) techniques for genomic data analysis and the use of the resultant biological facts to develop system based applications [5]. The traditional methods that were mostly in use to identify the origin of genome sequences are pairwise and multiple sequence alignment. However, sequence alignment method is fraught with difficulties for genome-wide comparative analysis of viruses. This is because there is a high rate of divergence between different virus sequences due to gene mutation, horizontal gene transfer as well as gene duplication, insertion and deletion [8]. Likewise, there is currently no universal oligonucleotide that is present in all viruses, which can be used for homologous searches against public databases to detect viruses [3]. To address the problems in the alignment methods, several alignment-free methods have been developed for viral detection using genomic sequences. These include k-mers methods such as G-C content, dinucleotide composition profile and frequency chaos game representation [9–12, 26]. Another category of alignment-free methods which was recently developed by researchers is the genome space based methods [13, 14]. The Natural Vector (NV) representation and its different variants are representative examples of genome space alignment-free methods [13, 15, 16]. However, the performance accuracy using some of the k-mers and NV methods still leave room for improvement [15, 16, 26]. In the study at hand, we developed GSP-based features named Z-Curve Genomic Cepstral Coefficients (ZCGCC), as an alignment-free method that could be applied for the classification of pathogenic viruses. To evaluate the developed features, we extracted the genomic sequences of twenty six pathogenic viral strains from the Virus Pathogen Database and Analysis Resource (ViPR) corpus [5, 6]. The twenty six viral strains belong to four pathogenic viral species (namely - Enterovirus, Dengue, HepatitisC and Ebola), which are currently attracting global attentions due to their causation of deadly diseases [5]. Different configurations of the naïve Bayes classifier were trained and validated with the ZCGCC. Naïve Bayes classifier was selected for this study because of its attractive physiognomies, which have been widely explored for accurate classification of genomic sequences [7].

292

E. Adetiba et al.

2 Materials and Methods 2.1

Dataset

Genomic sequences of twenty six viral strains were extracted from the Virus Pathogen Database and Analysis Resource (ViPR) corpus [6] for this study. The extracted strains belong to four pathogenic viral species namely the Ebolavirus, Dengue virus, Hepatitis C and Enterovirus D68, which have been largely responsible for epidemic disease outbreak. The available strains for each of these species are selected for the study at hand to achieve an elaborate and more robust classification than the study in [5]. The distribution of the extracted data presents a challenge known as imbalance dataset, which is addressed with the random oversampling strategy in this study. Furthermore, there are high variations in sequence length even for samples that belong to the same viral strain. For example, the number of sequences for the Ebola Zaire strain varies from 22 to 19,897 while EnterovirusH varies from 20 to 7,374. These huge differences in the length of nucleotides within the same viral strain clearly illustrate the reason why alignment based and some existing alignment free methods cannot offer accurate viral detection [17]. Thus, this provides the rationale for an investigation of a DSP technique in the current study. In total, 1,948 samples of viral strains were extracted. Since each of the viral strains represent a class in the dataset, our experimentation dataset consequently contains twenty six different classes. 2.2

Z-Curve Genomic Cepstral Coefficients

Deoxyribonucleic Acid (DNA) is a biomolecule that stores the digital information that constitute the genetic blueprint of living organisms [9]. Each nucleotide in a DNA is one of Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). DNA sequence analysis using DSP methods requires mapping of nucleotides to appropriate numbers before any other computational operations can be performed. The selection of the representative numbers affects how well the properties of these nucleotides are reflected for the detection of valuable biological characteristics [18]. The Z-Curve genomic mapping method is selected in this study because of its reported strengths over other competing methods [19, 20, 27, 28]. The steps for computing the ZCGCC being proposed are represented in the block diagram shown in Fig. 1 and the computation procedures are presented subsequently. Step 1: The first block in Fig. 1 involves the computation of Z-curve from the input nucleotide sequences. Z-curve is a three-dimensional space curve, which constitute a unique numerical representation of a given DNA sequence [19]. A vital advantage of the Z-curve representation over the other nucleotide numerical representation methods is its reproducibility property. This implies that once the coordinate of Z-curve are well defined, the corresponding nucleotides can be uniquely reconstructed [20]. Given a nucleotide sequences that is read from the 5’ to the 3’ – end with N bases that are inspected from the first base to nth base, the cumulative occurring numbers of each of the bases A, C, G and T are represented by An, Cn, Gn and Tn respectively. For points Qi ; 8 i ¼ 0; 1; 2; . . .; n 1 in a 3-D coordinate system, the line that connects the


293

Fig. 1. Functional block diagram of the Z-Curve Genomic Cepstral Coefficients (ZCGCC).

nodes Q0(x0, y0, z0), Q1(x1, y1, z1), Q2(x2, y2, z2), …, Qn(xn, yn, zn), in a successive manner is the Z-Curve of the nucleotide sequences being examined. These nodes are mathematically represented as [20, 28]: 8 < x½n ¼ 2ðAn þ Gn Þ n y½n ¼ 2ðAn þ Cn Þ n : z½n ¼ 2ðAn þ Tn Þ n

8 n ¼ 0; 1; 2; . . .; N 1

ð1Þ

where A0 ¼ C0 ¼ G0 ¼ T0 ¼ 0 and x0 ¼ y0 ¼ z0 ¼ 0 In order to derive biological meaning from Eq. (1), it is normalized using An þ Cn þ Gn þ Tn ¼ n, to obtain: 8 < x½n ¼ ðAn þ Gn Þ ðCn þ Tn Þ Rn Yn y½n ¼ ðAn þ Cn Þ ðGn þ Tn Þ Mn Kn : z½n ¼ ðAn þ Tn Þ ðCn þ Gn Þ Wn Sn

8 n ¼ 0; 1; 2; . . .; N 1

ð2Þ

where Rn, Yn, Mn, Kn, Wn and Sn are the distributions of the bases of purine, pyrimidine, amino, keto, weak hydrogen bonds and strong hydrogen bonds respectively [21]. The variables x[n], y[n] and z[n] in Eq. 2, which are also illustrated as the outputs of the first block in Fig. 1 are the three independent components of the Z-Curve, with each having distinct biological meaning. Component x[n] represent the distribution of the bases of the purine/pyrimidine (i.e. A or G/C or T) for the first to the nth input nucleotides and it possesses the following attributes: 8 < Positive if Rn [ Yn x½n ¼ Negative if Rn \ Yn ð3Þ : Zero if Rn ¼ Yn

294

E. Adetiba et al.

The second component of Z-Curve, which is yn is the distribution of the bases of the amino/keto group (i.e. A or C/G or T) along the first to nth input nucleotides and it possesses the following attributes: 8 < Positive if Mn [ Kn y½n ¼ Negative if Mn \ Kn ð4Þ : Zero if Mn ¼ Kn The third component of Z-Curve, zn is the distribution of the bases of the weak hydrogen bond/strong hydrogen bond (i.e. A or T/C or G) along the first to the nth input nucleotides with the following characteristics: 8 < Positive if Wn [ Sn z½n ¼ Negative if Wn \ Sn ð5Þ : Zero if Wn ¼ Sn Step 2: The three Z-Curve components computed in the first step, which are streams of digital signals obtained from the input nucleotides are transmitted to the second block in Fig. 1. At this stage, Discrete Fourier Transform (DFT) is applied to the digital signals individually as follows: 8 N1 P 2pkn > > > X½k ¼ x½nej N > > > n¼0 > < N1 P 2pkn y½nej N Y½k ¼ > n¼0 > > > N1 > P 2pkn > > y½nej N : Z½k ¼

8 k ¼ 0; 1; 2; . . .; N 1 ð6Þ

n¼0

where X[k], Y[k] and Z[k] are the spectra of the digital signals. The power spectrum, which is a quadratic combination of these spectra were computed for some selected pathogenic viral sequences in this study and the outputs are presented in Sect. 4. Step 3: Each of the nucleotide spectra computed in the previous step contains peaks which represent the dominant frequency components in the input nucleotide signals. The smooth curve that connects the peaks on a spectrum is referred to as the spectral envelope. The spectral envelope carry the identity of the input nucleotide sequences similar to what obtains in other DSP applications such as speech and mechanical fault diagnosis [22, 23]. The separation of the spectral envelope and spectral details from the spectrum is referred to as cepstral analysis. The required procedure for cepstral analysis are represented with the third, fourth and fifth blocks in Fig. 1 and mathematically depicted as follows:


8 N1 P 2pkn > > > cx ½n ¼ logðX½kÞej N > > > n¼0 > < N1 P 2pkn cy ½n ¼ logðY½kÞej N > n¼0 > > > N1 > P 2pkn > > logðZ½kÞej N : cz ½n ¼

295

ð7Þ

n¼0

Using Euler’s formulae, Eq. (7) becomes: 8 N1 N1 P P > 2pkn > c ½n ¼ logðX½kÞ cosð Þ þ j logðX½kÞ sinð2pkn > x N N Þ > > n¼0 n¼0 > > > > > > > > > > < N1 N1 X X 2pkn 2pkn Þþj Þ cy ½n ¼ logðY½kÞ cosð logðY½kÞ sinð > N N > n¼0 n¼0 > > > > > > > > > > N1 N1 > P P > > : cz ½n ¼ logðZ½kÞ cosð2pkn Þ þ j logðZ½kÞ sinð2pkn N N Þ n¼0

ð8Þ

n¼0

real cepstrum

complex cepstrum

where each of cx ½n, cy ½n and cz ½n represents the complex Z-Curve cepstrum of the x [n], y[n] and z[n] components of the Z-Curve for the input nucleotides respectively. The complex cepstrum is a combination of the real and imaginary cepstrum as shown in Eq. (8). The real cepstrum is the log magnitude spectrum of each of the respective signals while the imaginary cepstrum is the phase components. The spectral envelope and spectral details are captured in the real cepstrum. It should be noted that the word “cepstrum” was coined by reversing the first syllable of “spectrum”. Hence, in the cepstrum domain, quefrency also stands for frequency and lifter is used in place of filter [22]. The spectral envelope is the low quefrency components while the spectral details are the high quefrency components in the cepstrum domain. Authors in other DSP application domains have reported that the first 15 or 20 coefficients of a cepstrum appositely represent the spectral envelope [24]. As depicted with the fifth block of Fig. 1, the first 15 or 20 coefficients (spectral envelope) of the real cepstrum are liftered using the window: w½n ¼

1; 0 n L 0; elsewhere

ð9Þ

where L is the cut off length of the liftering window, which can be either 15 or 20 as earlier stated. The liftering window in Eq. (9) is multiplied with each of the real cepstra sections of Eq. (8) to obtain:

296

E. Adetiba et al.

8 clx ½n ¼ w½n : cx ½n > > > > > < cly ½n ¼ w½n : cy ½n > > > > > : clz ½n ¼ w½n : cz ½n

ð10Þ

where clx ½n, cly ½n and clz ½n are the low quefrency coefficients of cx ½n, cy ½n and cz ½n respectively. Step 4: In the final step depicted with the last block of Fig. 1, the low quefrency cepstral coefficients obtained from Step 3 are concatenated to obtain the Z-Curve Genomic Cepstral Coefficients (ZCGCC) in this study. The ZCGCC is a compact genomic feature vector, which represent the distribution of the dominant components of the bases of purine, pyrimidine, amino, keto, weak and strong hydrogen bonds in the input nucleotide sequences. The ZCGCC feature vector is therefore an alignment-free identity of the input nucleotide sequences and it can either be 45 or 60 elements in length depending on if L in Eq. (9) is 15 or 20 respectively. Naïve Bayesian classifier hereafter in this study to determine the discriminatory potency of ZCGCC when it is applied to extract features from the pathogenic viral dataset. 2.3

Experiments

In this study, three experiments were carried out on a PC with an Intel Core i5 CPU, of 2.50 GHz speed, 6.00 GB RAM, and runs 64-bit Windows 8 operating system. In all the experiments, the forty five and sixty element ZCGCC were utilized and their performances were compared using appropriate metrics. In the first experiment, the naïve Bayes classifier was trained with the ZCGCC extracted from the imbalance dataset. In the second experiments, random oversampling was applied to obtain a balanced dataset. The random oversampling strategy involves the addition of instances to the minority class in a random manner [25]. Since the highest number of instances for any class in the dataset is 100 (Table 1), we increased the number of instances for all the minority classes (instances < 100) in the dataset to 100 to obtain the balanced dataset. The ZCGCC feature vectors extracted from the balanced dataset were further used to train the naïve Bayes classifier. The third experiment involved the comparison of the variant of ZCGCC that gave the best result in the second experiment using the balanced dataset with two other alignment free methods in the literature, namely, Electron Ion Interaction Pseudopotential – Genomic Cepstral Coefficient (EIIP-GCC) [5] and Frequency Chaos Game Representation (FCGR) [26].

3 Results and Discussion 3.1

Power Spectrums of the Z-Curve Encoded Viruses

Figure 2 shows the distinct power spectrums of the different strains of Enterovirus, HepatitisC, Dengue and Ebola viruses. Similar to the illustrations in Fig. 2, previous


297

Fig. 2. Power spectrums of Z-Curve encoded Enterovirus, HepatitisC, Dengue and Ebola viruses.

studies have also utilized power spectral of Z-Curve to graphically illustrate the mitochondria DNA of homo sapiens [27] and lung cancer biomarker genes [28, 29]. 3.2

Classifier Training Results

The results of the first experiment in which the imbalanced dataset was investigated are shown in Table 1. Four different naïve Bayes kernel functions were tested, namely Gaussian, uniform, epanechnikov and triangular [30]. The sixty element ZCGCC gave higher accuracies and low Misclassification Errors (ME) for each of the kernel functions. Meanwhile, the triangular function ranked best (accuracy = 91.2218%, ME = 0.0878) for the sixty element ZCGCC. Two-sample t-test was further utilized to investigate if the difference between the forty five and sixty element ZCGCC is statistically significant. The test statistic indicates that the null hypothesis of no difference between the mean of the two sets of accuracies is rejected, p < 0.05 (p = 0.0278) as well as for the two sets of MEs, p < 0.05 (p = 0.0280). This shows that the performance of the sixty element ZCGCC is significantly better than that of the forty five element ZCGCC for the imbalanced dataset.

298

E. Adetiba et al. Table 1. Experimental results of the imbalanced dataset with ZCGCC. Kernel function ZCGCC (45 elements) Accuracy (%) ME Triangular 89.5277 0.1047 Gaussian 89.0144 0.1099 Epanechnikov 87.7823 0.1222 Uniform 87.1150 0.1289

ZCGCC (60 elements) Accuracy (%) ME 91.2218 0.0878 90.6571 0.0934 90.1437 0.0986 89.3224 0.1068

Table 2 shows the results of the second experiment in which the balanced dataset obtained through random oversampling was used to train the naïve Bayes classifier. The sixty element ZCGCC also gave higher accuracies and lower MEs for all the kernel functions compare to its 45 elements counterpart. Similar to the first experiment, the triangular kernel function gave the best overall performance result for the sixty element ZCGCC (accuracy = 93.0385%, ME = 0.0696). It is also remarkable that the performance results of the ZCGCC for the balanced dataset in the second experiment are better than the corresponding ZCGCC in the first experiment for all the kernel functions. This shows that random oversampling method positively influenced the performance results of the ZCGCC. Since the sixty element ZCGCC gave superior performances in the first and second experiments over the forty element ZCGCC, we further investigated if the improvement of the sixty element ZCGCC for the balanced dataset (second experiment) over the sixty element ZCGCC for the imbalanced dataset (first experiment) is statistically significant. The null hypothesis of no difference between the two sets of accuracies is rejected because p < 0.05 (p = 0.0122) and the null hypothesis of no difference between the mean of the two sets of MEs is also rejected, p < 0.05 (p = 0.0122). Thus, the performance results of the sixty element ZCGCC using the balanced dataset is significantly better than those for the imbalanced dataset. Thus, the sixty element ZCGCC is proposed as an alignment free method for viral pathogen detection in this study based on its overall best performance. Table 2. Experimental results of the balanced dataset with ZCGCC Kernel function ZCGCC (45 elements) Accuracy (%) ME Triangular 91.9615 0.0804 Gaussian 91.6538 0.0835 Uniform 90.6923 0.0937 Epanechnikov 90.6154 0.0938

ZCGCC (60 elements) Accuracy (%) ME 93.0385 0.0696 92.7308 0.0727 91.2308 0.0877 92.3462 0.0765

The third experiment was carried out to compare the proposed alignment free method in this study (i.e. sixty element ZCGCC) with two other alignment free methods in the literature, namely EIIP-GCC [6] and FCGR [26]. Table 3 shows the results of the third experiment for EIIP-GCC and FCGR using the balanced dataset.


299

We deem it adequate to use the balanced dataset for the comparison in this third experiment since it produced the best result for the proposed alignment free method in the second experiment. The performance results of the proposed sixty element ZCGCC in Table 2 for all the kernel functions are better than those of EIIP-GCC in Table 3 for all the corresponding kernel functions. For instance, the triangular kernel function gave the highest accuracy of 93.0385% (ME = 0.0696) for the ZCGCC whereas the accuracy obtained with the triangular kernel function for the EIIP-GCC was 84.5% (ME = 0.1550). Furthermore, the statistical significance of the improvement in the performance of the proposed ZCGCC over EIIP is statistically significant, p < 0.05 (p = 8.82e−06). The performance result of the proposed ZCGCC in Table 2, which was obtained using the triangular kernel function is also slightly better than the highest performance result of the FCGR (accuracy = 92.9231%, ME = 0.0708). Table 3. Experimental results of the balanced dataset with EIIP-GCC and FCGR Kernel function EIIP-GCC Accuracy (%) Epanechnikov 84.6154 Triangular 84.5000 Uniform 83.1154 Gaussian 82.7308

ME 0.1538 0.1550 0.1688 0.1727

FCGR Accuracy (%) ME 92.9231 0.0708 92.6923 0.0731 92.3846 0.0762 91.8846 0.0812

It can be inferred from the results obtained in this study that the first 20 elements of the real cepstrum is more representative of the spectral envelope for the genomic signal. A previous study reported the development of ZCURVE_V, which is a gene finding application for viruses using DNA sequences and the Z-Curve mathematical paradigm. The authors reported that ZCURVE_V can accurately predict genes in viral genomes as short as about 1000 nucleotides [19]. However, the alignment free ZCGCC method proposed in this study detect viral genomes of both long and short lengths with accuracy that compares favorably with existing alignment-free methods in the literature.

4 Conclusion We have successfully reported the development of ZCGCC, which is an alignment-free method for virus detection in this paper. The sixty element ZCGCC gave superior performance to the EIIP-GCC and comparable performance to FCGR. However, ZCGCC provides remarkable advantages such as low dimension, global genome analysis and low computational requirements, which make it a promising method for developing diagnostic tool for detection of pathogenic viral diseases. Future works will include an investigation of the ZCGCC for the detection of other organisms in the prokaryotic and eukaryotic domains of life. We also hope to experiment with other machine learning methods to investigate the possibility of improved performance.

300

E. Adetiba et al.

Acknowledgement. Funding to present this work at IWBBIO 2018 was provided by the Covenant University Centre for Research, Innovation and Development, Canaanland, Ota, Nigeria.

References 1. Xie, G., Yu, J., Duan, Z.: New strategy for virus discovery: viruses identified in human feces in the last decade. Sci. China Life Sci. 56(8), 688–696 (2013) 2. Kaushik, A., Tiwari, S., Jayant, R.D., Marty, A., Nair, M.: Towards detection and diagnosis of Ebola virus disease at point-of-care. Biosens. Bioelectron. 75, 254–272 (2016) 3. Mokili, J.L., Rohwer, F., Dutilh, B.E.: Metagenomics and future perspectives in virus discovery. Curr. Opin. Virol. 2(1), 63–77 (2012) 4. Mabrouk, M.S.: A study of the potential of EIIP mapping method in exon prediction using the frequency domain techniques. Am. J. Biomed. Eng. 2(2), 17–22 (2012) 5. Sathish Kumar, S., Duraipandian, N.: An effective identification of species from DNA sequence: a classification technique by integrating DM and ANN. Int. J. Adv. Comput. Sci. Appl. 3(8), 104–114 (2012) 6. Adetiba, E., Olugbara, O.O., Taiwo, T.B.: Identification of pathogenic viruses using genomic cepstral coefficients with radial basis function neural network. In: Pillay, N., Engelbrecht, A.P., Abraham, A., du Plessis, M.C., Snášel, V., Muda, A.K. (eds.) Advances in Nature and Biologically Inspired Computing. AISC, vol. 419, pp. 281–291. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-27400-3_25 7. Pickett, B.E., Greer, D.S., Zhang, Y.: Virus pathogen database and analysis resource (ViPR): a comprehensive bioinformatics database and analysis resource for the coronavirus research community. Viruses 4, 3209–3226 (2012) 8. Wang, Q., Garrity, G.M., Tiedje, J.M., Cole, J.R.: Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ. Microbiol. 73(16), 5261–5267 (2007) 9. Li, Y., Tian, K., Yin, C., He, R.L., Yau, S.S.T.: Virus classification in 60-dimensional protein space. Mol. Phylogenet. Evol. 99, 53–62 (2016) 10. Vinga, S., Almeida, J.: Alignment-free sequence comparison-a review. Bioinformatics 19, 513–523 (2003). https://doi.org/10.1093/bioinformatics/btg005 11. Kantorovitz, M.R., Robinson, G.E., Sinha, S.: A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics 23(13), i249–i255 (2007) 12. Dai, Q., Yang, Y., Wang, T.: Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison. Bioinformatics 24(20), 2296–2302 (2008) 13. Sims, G.E., Jun, S.R., Wu, G.A., Kim, S.H.: Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc. Natl. Acad. Sci. 106(8), 2677–2682 (2009) 14. Deng, M., Yu, C., Liang, Q., He, R.L., Yau, S.S.T.: A novel method of characterizing genetic sequences: genome space with biological distance and applications. PLoS One 6(3), e17293 (2011) 15. Yu, C., Liang, Q., Yin, C., He, R.L., Yau, S.S.T.: A novel construction of genome space with biological geometry. DNA Res. 17, 155–168 (2010) 16. Yu, C., Hernandez, T., Zheng, H., Yau, S.C., Huang, H.H., He, R.L., Yau, S.S.T.: Real time classification of viruses in 12 dimensions. PLoS One 8(5), e64328 (2013)


301

17. Huang, H.H., Yu, C., Zheng, H., Hernandez, T., Yau, S.C., He, R.L., Yau, S.S.T.: Global comparison of multiple-segmented viruses in 12-dimensional genome space. Mol. Phylogenet. Evol. 81, 29–36 (2014) 18. Anastassiou, D.: DSP in genomics: processing and frequency-domain analysis of character strings. In: Proceedings of 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP 2001), vol. 2, pp. 1053–1056. IEEE (2001) 19. Bai Arniker, S., Kwan, H.K.: Advanced numerical representation of DNA sequences. In: International Conference on Bioscience, Biochemistry and Bioinformatics IPCBEE, vol. 3, p. 1 (2012) 20. Guo, F.B., Lin, Y., Chen, L.L.: Recognition of protein-coding genes based on Z-curve algorithms. Curr. Genomics 15(2), 95–103 (2014) 21. Zhang, R., Zhang, C.T.: A brief review: the z-curve theory and its application in genome analysis. Curr. Genomics 15(2), 78–94 (2014) 22. Cornish-Bowden, A.: Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucleic Acids Res. 13(9), 3021 (1985) 23. Randall, R.B.: A history of cepstrum analysis and its application to mechanical problems. In: International Conference at Institute of Technology of Chartres, France, pp. 11–16 (2013) 24. Thakur, S., Adetiba, E., Olugbara, O.O., Millham, R.: Experimentation using short-term spectral features for secure mobile internet voting authentication. Math. Probl. Eng. (2015) 25. Sakshat Virtual Labs: Cepstral Analysis of Speech (2011). iitg.vlab.co.in/?sub=59&brch= 164&sim=615&cnt=1. Accessed 28 July 2016 26. Adetiba, E., Badejo, J.A., Thakur, S., Matthews, V.O., Adebiyi, M.O., Adebiyi, E.F.: Experimental investigation of frequency chaos game representation for in silico and accurate classification of viral pathogens from genomic sequences. In: Rojas, I., Ortuño, F. (eds.) IWBBIO 2017. LNCS, vol. 10208, pp. 155–164. Springer, Cham (2017). https://doi.org/10. 1007/978-3-319-56148-6_13 27. Vijayan, K., Nair, V.V., Gopinath, D.P.: Classification of organisms using frequency-chaos game representation of genomic sequences and ANN. In: 10th National Conference on Technological Trends (NCTT 2009), pp. 6–7 (2009) 28. Shao, J., Yan, X., Shao, S.: SNR of DNA sequences mapped by general affine transformations of the indicator sequences. J. Math. Biol. 67(2), 433–451 (2013) 29. Adetiba, E., Olugbara, O.O.: Improved classification of lung cancer using radial basis function neural network with affine transforms of Voss representation. PLoS One 10(12), e0143542 (2015) 30. Mathworks, Classification Naive Bayes class. http://www.mathworks.com/help/stats/ classificationnaivebayes-class.html. Accessed 28 July 2016

A Combined Approach of Multiscale Texture Analysis and Interest Point/Corner Detectors for Microcalcifications Diagnosis Liliana Losurdo1(B) , Annarita Fanizzi1 , Teresa M. A. Basile2,3 , Roberto Bellotti2,3 , Ubaldo Bottigli4 , Rosalba Dentamaro1 , Vittorio Didonna1 , Alfonso Fausto5 , Raffaella Massafra1 , Alfonso Monaco3 , Marco Moschetta6 , Ondina Popescu1 , Pasquale Tamborra1 , Sabina Tangaro3 , and Daniele La Forgia1 1

5

I.R.C.C.S. “Giovanni Paolo II” National Cancer Centre, Bari, Italy [email protected] 2 Department of Physics, University of Bari “Aldo Moro”, Bari, Italy 3 Bari Division, INFN National Institute for Nuclear Physics, Bari, Italy 4 Department of Physical Sciences, Earth and Environment, University of Siena, Siena, Italy Department of Diagnostic Imaging, University Hospital of Siena, Siena, Italy 6 Interdisciplinary Department of Medicine, University of Bari “Aldo Moro”, Bari, Italy

Abstract. Screening programs use mammography as primary diagnostic tool for detecting breast cancer at an early stage. The diagnosis of some lesions, such as microcalcifications, is still difficult today for radiologists. In this paper, we proposed an automatic model for characterizing and discriminating tissue in normal/abnormal and benign/malign in digital mammograms, as support tool for the radiologists. We trained a Random Forest classifier on some textural features extracted on a multiscale image decomposition based on the Haar wavelet transform combined with the interest points and corners detected by using Speeded Up Robust Feature (SURF) and Minimum Eigenvalue Algorithm (MinEigenAlg), respectively. We tested the proposed model on 192 ROIs extracted from 176 digital mammograms of a public database. The model proposed was high performing in the prediction of the normal/abnormal and benign/malignant ROIs, with a median AUC value of 98.46% and 94.19%, respectively. The experimental result was comparable with related work performance. Keywords: Computer-aided diagnosis (CADx) · Microcalcifications Digital mammograms · Haar wavelet transform · SURF Minimum Eigenvalue Algorithm · Random Forest

c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 302–313, 2018. https://doi.org/10.1007/978-3-319-78723-7_26

A Combined Approach for Microcalcifications Diagnosis

1

303

Introduction

Breast cancer is the first cause of death among women all over the world. It is difficult to prevent it, but since the first studies [1,2] it has been shown that an early diagnosis of breast lesions increases the chances of survival and reduce the mortality rate. Currently, screening programs use mammography [3–5] as primary diagnostic tool for detecting breast cancer at an early stage. Despite the technological improvement of current mammographic devices, the detection of some lesions is still difficult today for radiologists. In particular, microcalcifications (MCs), tiny spots of calcium deposits localized or broadly diffused alone the breast areas, are often an early sign of breast cancer: the 50% of breast diseases with tumor lesions are accompanied by presence of the microcalcifications, when extremely minute (sometimes they do not exceed 0.1 mm) and grouped in clusters. Analysis of MCs is usually based on the radiologist subjective judgment; this process is sometimes difficult as well as inaccurate, resulting in missing lesion detection during the routine check [6] or in performing many unnecessary breast biopsies on benign calcification clusters. Thus, in screening programs double reading of mammograms (two radiologists read the same mammograms) [7] has been advocated to reduce the rate of missed cancers. But the workload and cost associated with double reading are high. In addition, it is well known that there is a strong positive correlation between breast density and the risk of developing breast cancer [8] since women with dense breast tissue can hide lesions, causing cancer to be detected at later stages [9]. For these reasons, it is evident the need of supporting the radiologists with an automatic tool in identification and characterization of such lesions. Several works have described computerized methods that extract features of clustered microcalcifications to improve radiologists performance in differentiating malignant from benign clustered microcalcifications [10–12]. To improve accuracy of identifying clustered microcalcification patterns through both computer aided feature extraction and classification methods, it is worth developing mathematically a model or method, by which radiologists can evaluate quantitatively the difference between benign clustered microcalcification and its malignant counterpart. Such methods are known as Computer-Aided Detection/Diagnosis (CAD) systems which may offer radiologists a reliable support in the evaluation of mammographic images. They use computer technologies to detect abnormalities in mammograms, playing a key role in the early detection of breast cancer and helping to reduce the death rate among women with breast pathologies in a cost-effective manner [13]. Usually, CAD systems are divided into two levels: the detection responsible for identifying the suspicious regions present on the mammogram (Region Of Interest - ROI) and therefore for the preventive elimination of the areas not at risk, and the classification of ROIs in lesions and healthy tissue. In practice, both levels perform a classification operation and follow a pre-processing phase in which most of the noise is removed from the image and the potentially interesting breast lesion structures are enhanced. The difference lies in the fact that the detection classifies the regions in suspicious

304

L. Losurdo et al.

and unsuspecting, discarding the latter, while the classification analyzes only the regions “survived” at the first level and ranks them in true abnormalities and false alarms. In the last ten years, CAD systems have increasingly been used as an aid by radiologists for the detection and classification of diseases. Many methods have been proposed to achieve a robust mammography-based CAD system. Although there are various types of mammographic abnormality, they can primarily be grouped into either masses or microcalcifications. In particular, several papers regarding MCs have been published: they describes some microcalcification detection methods ranging from image analysis to modern machine learning. The basic idea in image analysis is to identify features, for example morphological and morphometric, which characterize MCs signals, and so to suppress tissue background in a mammogram, by exploiting for example wavelet analysis [14,15]. Different from these approaches, machine learning methods treat the MC detection as a two-class classification problem, wherein a decision function is determined with supervised learning from data examples in mammogram images [16–18]. However, as the proposed work is concerned with microcalcification diagnosis, a brief description on some published methods for CAD of MCs in mammograms follows. In the work described in [19], a dual-tree complex wavelet transform (DT-CWT) is used to facilitate radiologists like double reading. Here, after a feature extraction step based on wavelet transform, 14 descriptors were introduced to define the characteristics of the suspicious MCs. Finally, a Support Vector Machine (SVM) classifier was used to firstly classify the extracted ROIs as normal or abnormal, and successively the ROIs containing microcalcifications as benign or malignant. In [20] there are two stages of classification: first the mammograms were classified into normal and microcalcification tissues using wavelet features; in the second stage, individual microcalcification objects were classified as benign or malignant using a set of features extracted by using the wavelet transform. In such a work, two types of classifiers were used namely Artificial Neural Network (ANN) and SVM, as well as in [21]. An unsupervised technique based on Generalized Gaussian Density (GGD) is proposed in [22]. The authors performed the microcalcification characterization using morphologic features that are used to feed a neuro-fuzzy system to classify the detected breast microcalcifications into benign and malignant classes. In [23] the topology/connectivity of individual MCs is analyzed within a cluster using multiscale morphology. Subsequently, graph theoretical features are extracted, which constitute the topological feature space for modeling and classifying microcalcification clusters by means k nearest neighbors - (kNN) based classifiers. In a previous work [24], we have developed a CAD system on full-field digital mammograms, which consisted of two main steps: a pre-processing step, followed by a microcalcification detection one. First, starting from the mammogram images, a set of image analysis algorithms were exploited: Sobel edge detection algorithm and Gaussian filter were applied in order to improve contrast and reduce noise, and, in this way, to point out regions of the breast


305

that potentially contain findings of interest. Secondly, a two-stage phase interleaved saturation steps on the image, with the aim of removing noise from the images elaborated during the first phase, with structures finding steps. The goal of this second phase was to exactly delineate single microcalcifications contour and it was performed by means of the Hough transform [25,26], a technique commonly used for the detection of curves, such as lines, circles, ellipses, etc. Finally, the microcalcifications was grouped in significant clusters by exploiting a set of codified domain expert rules automatically applied in the final phase of the procedure. In this work, we describe the developments of the previous work done on detection of clustered microcalcifications. It is mainly based on a texture analysis process performed on the identified regions with the aim of characterizing them in normal/abnormal and benign/malign tissue. Such method consists of a double phase approach: feature extraction and ROI classification in order to classify clusters of MCs in full-field digital mammograms. Indeed, for each ROI some textural features, such as classic statistical features on a multiscale decomposition based on the Haar wavelet transform [27,28] are extracted. Moreover, we detect interest points and corners by using Speeded Up Robust Feature (SURF) [29] and Minimum Eigenvalue Algorithm (MinEigenAlg) [30], respectively. Then, a training test by means of a state of art classifier, such as Random Forest [31], is performed in order to recognize clustered microcalcifications. The proposed approach is tested on full field digital mammograms extrapolated from the public database BCDR (Breast Cancer Digital Repository – http://bcdr. inegi.up.pt) [32]. Finally, the model’s performance is tested in cross-validation and evaluated in terms of accuracy, sensitivity and specificity, obtaining results in agreement with the literature.

2 2.1

Materials and Methods Dataset Selection

The image database consisted of a set of digital mammograms selected from the Breast Cancer Digital Repository [32] (BCDR). Currently, the BCDR contains cases of 1734 patients with mammography and ultrasound images, clinical history, lesion segmentation and selected pre-computed image-based descriptors. Patient cases are BIRADS classified and annotated by specialized radiologists covering all the possibilities of diagnosis. The BCDR provides normal and annotated patients cases of breast cancer including mammography lesions outlines, anomalies observed by radiologists, pre-computed image-based descriptors as well as related clinical data. In the database all available medio-lateral oblique (MLO) and cranial caudal (CC) views of the left and right breast are included. The BCDR is subdivided in two different repositories: (1) a Film Mammographybased Repository (BCDR-FM) and (2) a Full Field Digital Mammography-based Repository (BCDR-DM). In particular, BCDR-DM includes 724 patient cases with digital mammograms. The MLO and CC images are gray-level mammograms with a resolution of 3328 (width) by 4084 (height) or 2560 (width) by

306

L. Losurdo et al.

3328 (height) pixels, depending on the compression plate used in the acquisition (according to the breast size of the patient). For this study, 176 digital mammograms in MLO and CC views were casually extracted from BCDR-DM. Since only the main lesions were segmented on BCDR images, each extracted image was evaluated in double blind by two radiologists of our Institute dedicated to senological diagnostics, which have manually identified and classified ROIs containing the microcalcification clusters. So, we have obtained 104 images with clustered MCs, from which 56 benign and 40 malignant ROIs were extracted, and 72 images without any pathology, from which 96 ROIs were casually extracted. 2.2

Textural Features Extraction

In this paper, we propose a fully automated model for the identification and characterization of clustered microcalcifications in digital mammographic images mainly based on a texture analysis approach. Since a fundamental property of the image texture is the scale at which the image is observed and analyzed, in this work a Wavelet transform based multiscale texture analysis approach, and specifically the Haar wavelet transform, was exploited. The Haar wavelet [27,28] is a sequence of rescaled “square-shaped” functions which together form a wavelet family or basis. Wavelet approach is similar to Fourier analysis in that it allows a target function over an interval to be represented in terms of an orthonormal basis. The Haar sequence is recognized as the first known wavelet basis and extensively used as a teaching example. In the 2D Haar wavelet decomposition on the image, the original image is first high-pass filtered, yielding the three detail coefficients subimages (Fig. 1(a), top right: horizontal, bottom left: vertical, and bottom right: diagonal), then low-pass filtered and downscaled, yielding an approximation coefficients subimage (Fig. 1(a), top left). To compute the successive level of decomposition, the process is iterated on the approximation coefficient subimage (Fig. 1(b), top left). In particular, we performed 2D Haar Transform at two levels of decomposition. However, to perform texture analysis, a number of attributes or descriptors that differentiate the textures have to be identified. Of course, such descriptors are assumed to be uniform within the regions with the same texture. Many works in literature report the texture analysis process based on first- or second-order statistics computed on the image histogram. The use of such texture descriptors rely on the assumption that texture can be defined by local statistical properties of pixel gray levels. For this reason, in our study, for each of the eight subimages obtained in the Haar decomposition the following features are computed: mean, variance, skewness, kurtosis, entropy, relative smoothness; thus resulting, for each ROI, a set of 48 Statistical Features (SF set) were extracted. 2.3

Interest Points Detection

As pointed out, the microcalcification are characterized as to be tiny spots of calcium deposits localized or broadly diffused alone the breast areas or in some


307

Fig. 1. Image decomposition. (a) One level. (b) Two levels.

cases extremely minute and grouped in clusters. For this particular characterization of such lesions, in our model we enriched the information coming from the texture analysis with the information about points and corners of interest that can be identified in the ROI. Specifically, they were obtained by using SURF and Minimum Eigenvalue algorithms. SURF method [29] is a scale- and rotation-invariant interest local point detector and descriptor. It relies on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors. For the detection, it uses an integer approximation of the determinant of Hessian blob detection, which can be computed with three integer operations using a precomputed integral image. Its features descriptor is based on the sum of the Haar wavelet response around the point of interest. SURF descriptors can be used to locate and recognize objects, people or faces, to reconstruct 3D scenes, to track objects and to extract points of interest. The MinEigenAlg uses the Shi-Tomasi detector to identify the corners of interest of an image. It is based entirely on the Harris corner detector [30] with a modification in the score calculation. The Harris corner detector has a corner selection criteria: a score is calculated for each pixel, and if the score is above a certain value, the pixel is marked as a corner. The score is calculated using two eigenvalues, given to a function. This function manipulates them, and gave back a score. The Shi-Tomasi corner detector is a complete ripoff of the Harris corner detector, considering the minimum value between two eigenvalues as score. In this preliminary approach, only the number of interest points (IP) and corners (IC) has been taken into account by applying the two algorithms above described. 2.4

Classification Model

The general structure of the classification model proposed is showed in Fig. 2. The method is developed in two phases. First, for each ROIs, a set of features are extracted by using the methods described above. Second, two models are trained to discriminate ROIs into normal and abnormal, or microcalcification clusters into benign and malignant. As first approach to classification, we used a state-of-the-art machine learning method used in classification, regression and

308

L. Losurdo et al.

other tasks, such as a Random Forest [31]. A standard configuration was adopted with 100 trees and 20 features (as described in [31]) randomly selected at each split.

Fig. 2. Flow-chart of the model proposed. In a first phase, a set of features on each ROIs are extracted, then two RF classifiers are trained for the resolution of Normal vs Abnormal and Malignant vs Benign problems.

The performance of the prediction model was evaluated in terms of sensitivity, specificity, accuracy, and Area Under the Curve (AUC) of the ROC curve with 100 ten-fold cross-validation rounds.

3

Results

The proposed method has been trained to discriminate normal and abnormal ROIs, and also benign and malignant microcalcification clusters on four features set, that is: – – – –

SF SF SF SF

set; set and IP, set and IC, set, IP, and IC.

The classification performance was evaluated on 100 rounds of 10-fold cross validation. The results for the discrimination of normal and abnormal ROIs are summarized in Fig. 3. The model trained on SF set showed a median AUC value of 89.76% with an interquartile range (IQR) (89.36%–90.22%), and accuracy was 83.33% with an IQR (82.29%–83.85%) for the discrimination of normal and abnormal ROIs. Instead, the model trained on SF, IP, and IC sets obtained higher performance with a median AUC value of 98.46% with an interquartile range (98.30%–98.57%), and accuracy was 95.83% with an IQR (95.31%– 96.35%). It is worth to note that, by adding IC feature to SF set, the classification performance improved significantly, in particular the accuracy increases of 10% points, whereas by adding IP feature to SF and IC sets improved the forecast performance of abnormal ROIs. The Fig. 4 shows the results for the discrimination of benign and malignant microcalcification clusters. The model trained on SF set showed a median AUC


309

Fig. 3. Performance metrics of the normal/abnormal classification model trained with the four feature sets. The Wilcoxon-Mann-Whitney test detected a significant difference in the classification accuracy of model trained on SF set and the others (**p-value < 0.01).

Fig. 4. Performance metrics of the benign/malignant classification model trained with the four feature sets. The Wilcoxon-Mann-Whitney test didn’t detect a significant difference in the classification accuracy of model trained on SF set and the others.

value of 94.19% with an IQR (93.47%–94.79%), and accuracy was 87.50% with an IQR (87.50%–88.54%) for the discrimination of benign and malignant microcalcification clusters. The performances of the four models did not show any significant difference when comparing benign and malignant microcalcification clusters.

310

4

L. Losurdo et al.

Discussion and Conclusion

This paper is a development of the previous work done on detection of clustered microcalcifications in full-field digital mammograms. We proposed a CAD for characterizing and discriminating ROIs in normal/abnormal and benign/malignant. Firstly, for each ROI, we extracted some textural features on a multiscale decomposition based on the Haar wavelet transform, and detected interest points and corners by using two known algorithms, SURF and MinEigenAlg. Then, we used these features to train two RF classifiers to recognize ROIs. In particular, we evaluated the classification performance of four models obtained by combining the features extracted. The model developed on SF, IP, and IC features set, previously described, is high performing in the prediction of the normal and abnormal ROIs, with a median AUC value of 98.46%, an accuracy of 95.83%, a sensitivity of 96.84%, and a specificity of 95.09%. Moreover, the prediction performance of the model developed on the same features set to discriminate benign and malignant microcalcification clusters was median AUC value of 94.19%, an accuracy of 88.19%, a sensitivity of 91.93%, and a specificity of 85.52%. Experimental results shown by adding IP and IC features to SF sets accuracy classifications have grown by over 12% points. Instead, IP and IC features didn’t seem to have significant information content for the purposes of the classification in question. Tables 1 and 2 show the performance of state-of-the-art models for the classification into normal and abnormal ROI, and into benign and malignant microcalcification, respectively. Note that the various approaches use different images taken from different databases, and therefore, this is a qualitative comparison. However, as shown in Tables 1 and 2, the experimental results obtained by our approach are comparable to the various approaches of the state of the art. It note that in [19,20,22] experimental results were obtained on datasets excessively reduced with respect to the number of features used and an overfitting state could occur. Table 1. Normal VS Abnormal tissue (microcalcification): classification accuracy (Acc). Method

No. ROIs Feature type

No. features Classifier

Acc (%)

Jian et al. [19]

50

Textural features and wavelet coefficients

14

SVM

96

Phadke et al. [20]

52

Wavelet coefficients

147

ANN

99

Boulehmi et al. [22]

25

Varied features

Unknown

Neural network 94

Proposed approach

192

Textural features and #interest points/corners

50

Random Forest 96


311

Table 2. Benign VS Malignant microcalcifications: classification accuracy (Acc). Method

No. ROIs Feature type

No. features Classifier

Acc (%)

Jian et al. [19]

25

Textural features and wavelet coefficients

14

SVM

100

Phadke et al. [20]

26

Wavelet coefficients

147

ANN

96

Khehra et al. [21]

380

Varied features

23

SMO-SVM

88

Boulehmi et al. [22]

25

Varied features

Unknown

ANFIS system

99

Chen et al. 300 [23]

Topological features

Unknown

kNN

85

Proposed approach

Textural features and #interest points/corners

50

Random Forest

88

96

In the next stage of our studies, it will be necessary to combine other feature parameters in addition to the points and corners of interest. The aim of this future work is to improve the performances, specially in the classification of benign and malignant microcalcifications. Acknowledgments. This work was supported by founding from Italian Ministry of Health “Ricerca Corrente 2016”.

References 1. Elter, M., Horsch, A.: CADx of mammographic masses and clustered microcalcifications: a review. Med. Phys. 36(6), 2052–2068 (2009) 2. Howell, A.: The emerging breast cancer epidemic: early diagnosis and treatment. Breast Cancer Res. 12(4), S10 (2010) 3. Breast Cancer Facts. http://www.uthscsa.edu/hscnews/pdf/. Accessed Apr 2010 4. Fletcher, S.W., Elmore, J.G.: Mammographic screening for breast cancer. N. Engl. J. Med. 348(17), 1672–1680 (2003) 5. Elmore, J.G., Armstrong, K., Lehman, C.D., Fletcher, S.W.: Screening for breast cancer. JAMA 293(10), 1245–1256 (2006) 6. Cheng, H.D., Cai, X., Chen, X., Hu, L., Lou, X.: Computer-aided detection and classification of microcalcifications in mammograms: a survey. Pattern Recogn. 36(12), 2967–2991 (2003) 7. Brown, J., Bryan, S., Warren, R.: Mammography screening: an incremental cost effectiveness analysis of double versus single reading of mammograms. BMJ 312(7034), 809–812 (1996) 8. McCormack, V.A., dos Santos Silva, I.: Breast density and parenchymal patterns as markers of breast cancer risk: a meta-analysis. Cancer Epidemiol. Prev. Biomark. 15(6), 1159–1169 (2006)

312

L. Losurdo et al.

9. Wolfe, J.N.: Breast patterns as an index of risk for developing breast cancer. Am. J. Roentgenol. 126(6), 1130–1137 (1976) 10. Jiang, Y., Nishikawa, R.M., Wolverton, D.E., Metz, C.E., Giger, M.L., Schmidt, R.A., Vyborny, C.J., Doi, K.: Malignant and benign clustered microcalcifications: automated feature analysis and classification. Radiology 198(3), 671–678 (1996) 11. Chan, H.P., Sahiner, B., Lam, K.L., Petrick, N., Helvie, M.A., Goodsitt, M.M., Adler, D.D.: Computerized analysis of mammographic microcalcifications in morphological and texture feature spaces. Med. Phys. 25(10), 2007–2019 (1998) 12. Nakayama, R., Uchiyama, Y., Watanabe, R., Katsuragawa, S., Namba, K.: Computer-aided diagnosis scheme for histological classification of clustered microcalcifications on magnification mammograms. Med. Phys. 31(4), 789–799 (2004) 13. Sampat, M.P., Markey, M.K., Bovik, A.C.: Computer-aided detection and diagnosis in mammography. Handb. Image Video Process. 2(1), 1195–1217 (2005) 14. Zhang, X., Homma, N., Goto, S., Kawasumi, Y., Ishibashi, T., Abe, M., Sugita, N., Yoshizawa, M.: A hybrid image filtering method for computer-aided detection of microcalcification clusters in mammograms. J. Med. Eng. 2013, 8 p. (2013). Article no. 615254 15. Vivona, L., Cascio, D., Fauci, F., Raso, G.: Fuzzy technique for microcalcifications clustering in digital mammograms. BMC Med. Imaging 14(1), 23 (2014) 16. Wang, J., Nishikawa, R.M., Yang, Y.: Improving the accuracy in detection of clustered microcalcifications with a context-sensitive classification model. Med. Phys. 43(1), 159–170 (2016) 17. Oliver, A., Torrent, A., Llad´ o, X., Tortajada, M., Tortajada, L., Sent´ıs, M., Freixenet, J., Zwiggelaar, R.: Automatic microcalcification and cluster detection for digital and digitised mammograms. Knowl.-Based Syst. 28, 68–75 (2012) 18. Gallardo-Caballero, R., Garc´ıa-Orellana, C.J., Garc´ıa-Manso, A., Gonz´ alezVelasco, H.M., Mac´ıas-Mac´ıas, M.: Independent component analysis to detect clustered microcalcification breast cancers. Sci. World J. 2012, 6 p. (2012). Article no. 540457 19. Jian, W., Sun, X., Luo, S.: Computer-aided diagnosis of breast microcalcifications based on dual-tree complex wavelet transform. Biomed. Eng. Online 11(1), 96 (2012) 20. Phadke, A.C., Rege, P.P.: Detection and classification of microcalcifications using discrete wavelet transform. Int. J. Emerg. Trends Technol. Comput. Sci. 2(4), 130– 134 (2013) 21. Khehra, B.S., Pharwaha, A.P.S.: Classification of clustered microcalcifications using MLFFBP-ANN and SVM. Egypt. Inform. J. 17(1), 11–20 (2016) 22. Boulehmi, H., Mahersia, H., Hamrouni, K.: A new CAD system for breast microcalcifications diagnosis. Int. J. Adv. Comput. Sci. Appl. 7(4), 133–143 (2016) 23. Chen, Z., Strange, H., Oliver, A., Denton, E.R., Boggis, C., Zwiggelaar, R.: Topological modeling and classification of mammographic microcalcification clusters. IEEE Trans. Biomed. Eng. 62(4), 1203–1214 (2015) 24. Fanizzi, A., Basile, T.M.A., Losurdo, L., Amoroso, N., Bellotti, R., Bottigli, U., Dentamaro, R., Didonna, V., Fausto, A., Massafra, R., Moschetta, M., Tamborra, P., Tangaro, S., La Forgia, D.: Hough transform for clustered microcalcifications detection in full-field digital mammograms. In: Applications of Digital Image Processing XL, vol. 10396, p. 1039616. International Society for Optics and Photonics, San Diego (2017) 25. Sklansky, J.: On the Hough technique for curve detection. IEEE Trans. Comput. C–27(10), 923–926 (1978)


313

26. Pedersen, S.J.K.: Circular hough transform. Aalborg Univ. Vis. Graph. Interact. Syst. 123, 123 (2007) 27. Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 3rd edn. Prentice-Hall, Upper Saddle River (2006) 28. Mallat, S.G.: A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 11(7), 674–693 (1989) 29. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (SURF). Comput. Vis. Image Underst. 110(3), 346–359 (2008) 30. Shi, J., Tomasi, C.: Good features to track. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Proceedings CVPR 1994, pp. 593– 600. IEEE (1994) 31. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) 32. Ramos-Poll´ an, R., Guevara-L´ opez, M.A., Su´ arez-Ortega, C., D´ıaz-Herrero, G., Franco-Valiente, J.M., Rubio-del-Solar, M., Gonz´ alez-de-Posada, N., Pires Vaz, M.A., Loureiro, J., Ramos, I.: Discovering mammography-based machine learning classifiers for breast cancer diagnosis. J. Med. Syst. 36(4), 2259–2269 (2012)

An Empirical Study of Word Sense Disambiguation for Biomedical Information Retrieval System Mohammed Rais ✉ and Abdelmonaime Lachkar (

)

L.I.S.A, Department of Electrical and Computer Engineering, ENSA, USMBA, Fez, Morocco {mohammed.rais,abdelmonaime.lachkar}@usmba.ac.ma

Abstract. Document representation is an important stage to ensure the index‐ ation of biomedical document. The ordinary way to represent a text is a bag of words BoW, This Representation suffers from the lack of sense in resulting representations ignoring all semantics that reside in the original text; instead of, the Conceptualization using background knowledge enriches document representation models. Three strategies can be used in order to realize the conceptualization task: Adding Concept, Partial Conceptualization, and Complete Conceptualization. While searching polysemic term corresponding senses in semantic resources, multiple matches are detected then introduce some ambiguities in the final document representation, three strategies for Disambiguation can be used: First Concept, All Concepts and Context-Based. SenseRelate is a well-known Context-Based algorithm, which uses a fixed window size and taking into consideration the distance weight on how far the terms in the context are from the target word. This may impact negatively on the yielded concepts or senses, we propose a simple modified version of SenseRelate algorithm namely NoDistanceSenseRelate, which simply ignore the distance that is the terms in the context will have the same distance weight. In order to evaluate the effect of the conceptualization strategies and Disambiguation strategies in the indexing process, in this study, several experiments have been conducted using OHSUMED corpus on a biomedical information retrieval system. The obtained results using OHSUMED corpus show that the Context-Based methods (SenseRelate and NoDistanceSenseR‐ elate) outperform the others ones when applying Adding Concept Conceptu‐ alization strategy results using Biomedical Information retrieval system. The obtained results prove the evidence of adding the sense of concepts to the Term Representation in the IR process. Keywords: Natural language processing · Biomedical Information Retrieval Word Sense Disambiguation · Biomedical indexing methods Strategy of disambiguation · Conceptualization · Sense based indexing

1

Introduction

A word having several possible meanings is ambiguous, in the biomedical domain the word “cold” have two meanings: “a respiratory disorder” and “the absence of heat”, it’s © Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortuño (Eds.): IWBBIO 2018, LNBI 10813, pp. 314–326, 2018. https://doi.org/10.1007/978-3-319-78723-7_27

An Empirical Study of WSD for Biomedical IR System

315

the context in which the word is used that determines its correct meaning. Word Sense Disambiguation (WSD) algorithms try to extract the proper sense or concept of an ambiguous term by using its context. WSD play a vital role in many biomedical textmining applications such as Biomedical Information retrieval system. More than 23 million references to journal articles published in Medline biomedical digital library, managing the growing number of this database has become critical, hence, Natural language processing techniques are essential to organize, facilitate indexing, searching and filtering articles in similar databases. In order to represent a document by generating the indexing Model, the indexing methods, generally use a bag of word representation, which is based on syntactical and statistical models. However, this technique often neglects the sense in final document representation due to the non-use of semantic resources like dictionaries or ontologies, instead of, the Conceptualization using background knowledge enriches document representation models, three strategies can be used in order to realize the conceptualization task: Adding Concept, Partial Conceptualization, and complete conceptualization. Nevertheless, while realizing the conceptualization task by mapping term to concept using an ontology, multiple matches can be deducted because of the natural ambiguity of the Unified Medical Language System (UMLS) thesaurus, then obtain an ambiguous document due to this task of mapping, to overcame this problem we distinguish three strategies of disambiguation: All Concepts, First Concept and Based-Context. Over the past studies of biomedical WSD, authors attempt to evaluate WSD algo‐ rithm using several semantic measures [1], Dinh and Tamine [2] propose and evaluate two methods of Word Sense Disambiguation (WSD) for biomedical terms and integrate them to a sense-based document indexing and retrieval framework. The conceptualiza‐ tion using strategies of disambiguation have not been compared and evaluated while using algorithms of WSD in an information retrieval system, thus the objective of our study. In this work, we try to estimate the impact of the Conceptualization strategies while applying the three Disambiguation strategies to Information Retrieval System and eval‐ uated the contribution of Word Sense Disambiguation (WSD) algorithms to Biomedical Information Retrieval System. The remainder of this document is structured as follows: some related works are presented in Sect. 2. The task of Text Conceptualization is presented in Sect. 3. In Sect. 4 we give the architecture used for our evaluation system. The results and discus‐ sion for our evaluation system are in Sect. 4. At the end, a conclusion and future works are in Sect. 5.

2

Related Work

Document representation is typically based on the traditional approach ‘Bag-of-words’ (BoW) [3]. Nevertheless, the Bag-of-Words or simply Term-Based representation suffers from the lack of sense in final document representation due to the absence of all semantics that resides in the original text. However, Concept-Based (Bag of Concepts)

316

M. Rais and A. Lachkar

representation permit semantic integration namely Conceptualization that enriches document representation using background knowledge resources. Many works investi‐ gate in that area, Amine et al. [4] integrate an ontology on the process of document clustering and conclude that representation based on concepts, provides the best results. Litvak et al. [6] introduce the ontology based web content mining application for analyzing and classifying web documents in a given domain; the main contribution of this work is using domain-based Multi-Lingual Ontology in the conceptual representa‐ tion of documents. Guyot et al. [5] evaluated a multilingual ontology-based approach for multilingual information retrieval. Song et al. [7] suggest an automated method for document classification using an ontology, which expresses terminology information and vocabulary contained in Web documents by way of a hierarchical structure. On the other hand, using the Bag of sense representation can imply in certain cases the risk of the loss of information. And then, the choice of the more appropriate sense for a term by applying algorithms of WSD will influence logically on the performances of IR. Sanderson [8] presents a survey of the work on word sense disambiguation and information retrieval and attempt to examine approaches to improve retrieval effective‐ ness that uses knowledge of senses. Some other studies used the selected senses by disambiguating terms in both queries and documents, to perform indexing and retrieval. Stokoe et al. [9] investigate the use of a state of the art automated WSD system within a web IR framework. The focus of their re-search is to perform a large-scale evaluation of both the automated WSD algo‐ rithm and the IR system, their aim is to demonstrate relative performance of an IR system using WSD compared to a baseline retrieval technique such as the vector space model. Kim et al. [10] propose a coarse-grained, consistent, and flexible sense tagging method to improve large-scale text retrieval performance. And so doing, their sense tagger can be built without a sense-tagged corpus and performs consistent disambigu‐ ation by considering only the single most informative neighboring word as evidence of determining the sense of target word. Other researchers propose the usage of knowledge sources from thesauri to studies the expansion effect. The goal of the paper of Fang [11], is to study whether query expansion using only manually created lexical resources could lead to the performance improvement. The main contribution of his work is to show query expansion using only hand crafted lexical resources is effective in the recently proposed axiomatic framework. Agirre et al. [12] propose to use WordNet for document expansion, proposing a new method: given a full document, a random walk algorithm over the WordNet graph ranks concepts closely related to the words in the document. In the biomedical domain, some studies [2, 13] investigate the use of MeSH (Medical Subject Headings) thesaurus building an indexing model to improve result of retrieval process. Dinh and Tamine [2] proposed and evaluated a sense based approach of indexing and retrieving biomedical documents using two WSD methods for identifying ambiguous MeSH concepts: Left-To-Right WSD and Cluster-based WSD and integrate them to a sense-based document indexing and retrieval framework. Majdoubi et al. [13], propose a conceptual indexing of medical articles by using the MeSH (Medical Subject Headings) thesaurus, called BIOINSY (BIOmedical Indexing SYstem), and they use


317

the language model approach to disambiguate the senses of the term and determine its descriptor in the context of the document. In this paper, we use UMLS thesaurus as domain specific resources in order to realize the “Conceptualization” process.

3

Text Conceptualization Task

Knowledge resources such as thesaurus or ontologies can be used to resolve the issue of term based representation by replacing with concept-based one. Figure 1 present the conceptualization and disambiguation process for an extracted biomedical term. This section presents text Conceptualization task, introducing text pre-processing step, then different possible Conceptualization and Disambiguation strategies.

Fig. 1. The process of conceptualization and disambiguation

3.1 Text Preprocessing The first step in text representation is to convert the documents, which its data is strings of characters and words, into a format suitable for the Conceptualization process. Then, extract all the words from the documents and using preprocessing treatments and content extraction to help prepare these words for the Conceptualization process.

318


3.2 Conceptualization Conceptualization is the process of mapping literally occurring terms detected in the text to semantically related concepts, and then the integration of these concepts in text producing the final conceptualized text. In this paper, we use UMLS thesaurus as domain specific resources in order to realize this Conceptualization. Three different strategies can be used for text Conceptualization [3, 14]: Adding Concepts, Partial Conceptualization and Complete Conceptualization (Concept only). In our study, we chose to experiment with two strategies Adding Concept and Complete Conceptualization. 3.3 Disambiguation Strategies To add or replace terms by concepts in an ontology can imply in certain cases the risk of the loss of information, because of the natural ambiguity of the Unified Medical Language System (UMLS) thesaurus, then introduce some ambiguities in the final document representation. Albitar et al. [14] distinguish three Disambiguation strategies: All concepts, First Concept and Context. Despite its complexity, the context based strategy is the most recommended one, this strategy has been introduced in the Word Sense Disambiguation task. Word Sense Disambiguation (WSD) methods automatically assign the proper concept to an ambig‐ uous term based on context, next section, we are introducing the Word Sense Disam‐ biguation task and describe the method used in this study.

4

Methods

4.1 Word Sense Disambiguation A word is ambiguous when it has more than one sense, for example, the word “cold” may refer both to a respiratory disorder and to the absence of heat, it is the context in which the word is used that determines its correct meaning. For Unsupervised Knowledge-Based Word Sense Disambiguation, to determinate the correct meaning (correct concept), generally, we use semantic similarity or related‐ ness measures which attempt to quantify the semantic proximity between two concepts. In our study we use the SenseRelate algorithm [15], NoDsitanceSenseRelate [16, 17]. 4.2 Biomedical Information Retrieval System In order to estimate the impact of the Conceptualization strategies while applying the three Disambiguation strategies described above to Information Retrieval System, we developed a Biomedical Text Indexing system based on the three-disambiguation strat‐ egies, and to realize the second step of information retrieval system we have integrated the index model in Terrier information retrieval system. In this section, we introduce the information retrieval system, and then we provide dataset and flowchart of our eval‐ uation system.


319

Information Retrieval Process. The main objective of an information retrieval system is to obtain the most relevant documents to a query information need from a set of specific documents resources. In so doing, our proposed approach consist of extracting relevant Metathesaurus concepts from both documents and queries using a biomedical semantic thesaurus, then representing the documents and queries by a set of concepts deducted from the Metathesaurus using two strategies of conceptualization producing two vectors descriptors for each document (query). After that, each ambiguous metathesaurus concepts (Polysemic Term) is tagged with the most likely meaning given its context based on the semantic relations between it and its neighboring words in a sentence. Formulating the index model based on the vectors descriptors for documents, the retrieval process computes a numeric score on how well each document matches a given query and rank the documents according to this score. Two steps are essential in our sense-based indexing and retrieval system. Creating document and query Index. Given text of document (query), finding the rele‐ vant Metathesaurus concepts formulate the initial vector descriptor for document Vd (queries Vq) containing annotated biomedical term Tj tagged by their concepts candi‐ dates: } { Vid = T1id , T2id , … Tnid { q q q q } Vi = T1i , T2i , … Tmi

(1)

Where Vid, Viq are respectively the set of Metathesarus concepts, n and m are respec‐ tively the numbers of Term in the document and query, Tjid, Tjiq are respectively the j-th Metathesaurus concept in the document di and query qi. Formally: { } Tj = C1j , C2j , … Cpj

(2)

Where Ckj is the k-th concept candidate for the Term Tj. Hence, the vectors descriptors for documents and queries will be represented by a set of concepts and for the polysemic term (Ti with more than one concept) we use disambiguation strategies. Finally building the index Model using obtained vectors of senses VSid and VSiq. Formally: } { VSdi = C1id , C2id , … Cnid { q } VSqi = C1iq , C2iq , … Cmi

(3)

Where Cjid, Cjid are respectively the j-th deducted concept from Tjid and Tjiq in the docu‐ ment di and query qi. Resuling the most relevant documents. The document is ranked in decreasing order of predicted relevance to the query by calculating the relevance score value of the document to the query using document weighting models.

320


In our case, we chose to experiment with two weighting models: TF-IDF and Okapi BM25 models. TF-IDF. TF-IDF assigns a high weight to a term, if it occurs frequently in the document but rarely in the whole document collection [18]. The TF-IDF weight [19] is composed of two terms: the first computes the normalized Term Frequency (TF), aka: The number of times a word appears in a document N w, divided by the total number of words in that document N wd; w/ TF(t) = N N wd ( d) N IDF(t) = log _e Nt ∑[ ] Score(d, Q) = tf ∗ log 2(IDF)

(4)

The second term is the Inverse Document Frequency (IDF), computed as the loga‐ rithm of the number of the documents in the corpus N d divided by the number of docu‐ ments where the specific term appears. Okapi BM25. Okapi BM25 is one of the strongest “simple” scoring functions and has proven a useful baseline for experiments and feature for ranking. [

∑

( log

i:di =qi =1

) ( )/ ( ) ri + 0.5 R − ri + 0.5 )/ ( ) ( D − R − dfi + ri + 0.5 dfi − ri + 0.5

⎤ tfi,q + k2 ⋅ tfi,q ⎥ ⋅ ⋅ dl tfi,q + k2 ⎥⎥ tfi,d + k1 ((1 − b) + b ⋅ ) ⎦ avg(dl) tfi,d + k1 ⋅ tfi,d

(5)

• The IDF-like ranking score defined in the previous section, • The document term frequency tj,d normalized by the ratio of the document’s length dl to the average length Avg(dl), and • The query term frequency ti,q. DataSet. To evaluate the contribution of Disambiguation strategies to Biomedical Information Retrieval System, we use Ohsumed [20] test collection as it was used for the TREC-9 Filtering Track, which is a set of 348,566 references from MEDLINE, the online medical information database, consisting of titles and/or abstracts of 270 medical journals over a five-year period (1987–1991). The available fields are the title, abstract, MeSH indexing terms, author, source, and publication type. The test collection was built as part of a study assessing the use of MEDLINE by physicians in a clinical setting (Hersh and Hickam, above). Novice physicians using MEDLINE generated 106 queries. Only a subset of these queries was used in the TREC-9 Filtering Track. For the evaluation step we use a subset of 63 of the original query set developed by Hersh et al. for their IR experiments (OHSUMED), each query was replicated by four searchers, two


321

physicians experienced in searching and two medical librarians. The results were assessed for relevance by a different group of physicians, using a three-point scale: definitely, possibly, or not relevant. Flowchart System. Figure 2, presents the flowchart of the evaluation system:

Fig. 2. Flowchart of the evaluation system: biomedical information retrieval system

In this system, we proceed first by indexing document as follows: 1. Executing MetaMap on Ohsumed Trec Corpus, which consist of annotated biomed‐ ical term from the corpus and tagged this term by their concepts candidates using UMLS. 2. The Conceptualization process is carried out with two strategies of conceptualization as mentioned above (adding Concept, Concept only), and for polysemic terms, we use the strategies for disambiguation: First Concept, All Concepts, Context-Based (SenseRelate, NoDistanceSenseRelate). After creating vector descriptor of each document formulating the index Model, the second step consists of computing the document relevant score which is realized by Terrier information retrieval system using TF-IDF and BM25 weighting models.

322

5



5.1 Evaluation Measures Precision. A measure of the ability of a system to present only relevant items. Precision = number of relevant items retrieved/total number of items retrieved. Precision at K. As many queries have thousands of relevant documents, and few users will be interested in reading all of them. Precision at k documents (P@k) is still a useful metric, (e.g., P@10 or “Precision at 10” corresponds to the number of relevant results on the first search results page). Mean average precision. Average Precision is the average of the precision value obtained for the set of top k documents existing after each relevant document is retrieved, and this value is then averaged over information needs. Mean average precision MAP for a set of queries is the mean of the average precision scores for each query. We Consider MAP over the total of 63 testing queries. Reciprocal Rank. The reciprocal rank measure [21] favors scoring functions that rank relevant results highly. The value of the measure is inversely proportional to how far a user has to go down the ranked list of results on average to find the first relevant result. bpref. The bpref measure is designed for situations where relevance judgments are known to be far from complete, bpref computes a preference relation of whether judged relevant documents are retrieved ahead of judged irrelevant documents. Thus, it is based on the relative ranks of judged documents only. 5.2 Experimental Results Tables 1 and 2 present the obtained result of Conceptualization strategies when applying the disambiguation strategies using TF-IDF and Okapi BM25 weighting models respectively.


323

Table 1. Result of conceptualization and disambiguation strategies using TF-IDF model

Model TF--IDF Conceptualization

Disambiguation

Term Based First Concept All Concept Concept only NoDistanceSenseRelate SenseRelate First Concept All Concept Adding Concept NoDistanceSenseRelate SenseRelate

MAP bpref recip_rank P_5

P_10 P_20

11,13% 53,66% 9,63% 43,93%

24,85% 24,90%

13,97% 12,06% 10,56% 13,97% 10,79% 8,89%

-1,50% -9,73%

+0,05%

0,00% -1,27% -1,67%

4,09% 38,85%

12,09%

5,71% 4,60% 3,41%

-7,04% -14,81%

-12,76%

-8,26% -7,46% -7,15%

9,73% 42,68%

24,41%

14,60% 10,32% 8,57%

-1,40% -10,98%

-0,44%

+0,63% -1,74% -1,99%

10,25% 43,64%

23,89%

15,24% 10,95% 9,05%

-0,88% -10,02%

-0,96%

+1,27% -1,11% -1,51%

11,19% 54,34%

24,64%

13,97% 12,70% 11,11%

+0,06% +0,68%

-0,21%

0,00% +0,64% +0,55%

4,91% 43,57%

13,72%

7,62% 6,19% 4,60%

-6,22% -10,09%

-11,13%

-6,35% -5,87% -5,96%

11,57% 54,41%

26,34%

15,56% 13,17% 10,79%

+0,44% +0,75%

+1,49%

+1,59% +1,11% +0,23%

12,33% 55,24%

27,61%

15,87% 13,81% 11,35%

+1,20% +1,58%

+2,76%

+1,90% +1,75% +0,79%

As illustrated in Tables 1 and 2, in all cases, we can observe that Conceptualization using addingConcept when applying SenseRelate and NoDistanceSenseRelate disam‐ biguation strategies improves the outcome. The improving of Mean Average Precision rates obtained as illustrated in Tables 1 and 2 are 1,21% and 1,2% using Okapi BM25 and TF-IDF respectively for adding Concept while using SenseRelate algorithm over Term Based Approach, However, Term Based improve SenseRelate while using concept only strategy in MAP with 0,8 and 0,88 for Okapi BM25 and TF-IDF respectively. In addition, we can observe that MAP of FirstConcept strategy using AddingConcept, outperform Term Based approach with an improvement of 0,19% and 0,06% using Okapi BM25 and TFI-DF respectively, But the MAP of Term Based present an improvement of 1,42 and 1,50 using Okapi BM25 and TF-IDF respectively over First concept while using concept only strategy. This proves the interest of taking into account the term and concept in final representation of the document using Adding Concept strategies. The outcomes of Term Based representation, in general, is very close to the First Concept performance, and we see that SenseRelate, NoDistanceSenseRelate, Term Based and First Concept always give a better precision and MAP than All Concept Strategy while using AddingConcept and Concept Only due to the increase of the repre‐ sentation space.

324

M. Rais and A. Lachkar Table 2. Result of conceptualization and disambiguation strategies using BM25 model

Model BM25 Conceptualization

Disambiguation

Term Based First Concept All Concept Concept only NoDistanceSenseRelate SenseRelate First Concept All Concept Adding Concept NoDistanceSenseRelate SenseRelate

6

MAP bpref recip_rank P_5

P_10 P_20

11,16% 53,92%

24,88%

13,97% 12,22% 10,63%

9,74% 43,93%

25,02%

13,97% 10,79% 8,89%

-1,42% -9,99%

+0,14%

0,00% -1,43% -1,74%

4,20% 39,43%

12,90%

5,71% 4,60% 3,41%

-6,96% -14,49%

-11,98%

-8,26% -7,62% -7,22%

9,86% 43,02%

24,43%

14,60% 10,32% 8,65%

-1,30% -10,90%

-0,45%

+0,63% -1,90% -1,98%

10,36% 43,91%

23,90%

15,24% 11,11% 9,21%

-0,80% -10,01%

-0,98%

+1,27% -1,11% -1,42%

11,35% 54,07%

25,69%

13,97% 12,86% 11,11%

+0,19% +0,15%

+0,81%

0,00% +0,64% +0,48%

6,02% 51,26%

17,86%

8,89% 7,62% 5,40%

-5,14% -2,66%

-7,02%

-5,08% -4,60% -5,23%

11,64% 54,22%

26,59%

15,56% 13,17% 10,79%

+0,48% +0,30%

+1,71%

+1,59% +0,95% +0,16%

12,37% 54,98%

27,80%

15,87% 13,97% 11,43%

+1,21% +1,06%

+2,92%

+1,90% +1,75% +0,80%

Conclusion and Perspectives

Word Sense Disambiguation (WSD) play a vital role in many biomedical text-mining applications such as Information Retrieval. SenseRelate is a well-known WSD ContextBased algorithm, which uses a fixed window size and taking into consideration the distance weight on how far the terms in the context are from the target word. This may impact negatively on the yielded concepts or senses. To overcome this problem, and therefore to enhance the process of Biomedical WSD, we have proposed a simple modified version of SenseRelate algorithm named NoDistanceSenseRelate which simply ignore the distance, that is the terms in the context will have the same distance weight. To illustrate the efficiency of our proposition, both SenseRelate, and NoDistance‐ SenseRelate algorithms have been integrated into a Biomedical Information Retrieval system. Several experiments have been conducted using two weighting models Okapi BM25 and TF-IDF. The obtained results using OHSUMED corpus show that the Context-Based methods (SenseRelate and NoDistanceSenseRelate) outperform the others ones when applying


325

Adding Concept Conceptualization strategy which proves the evidence of adding the sense of concepts to the Term Representation in the IR process. In future work, we propose to approve our evaluation while using other biomedical Information Retrieval Evaluation Dataset.

References 1. Patwardhan, S., Banerjee, S., Pedersen, T.: Using measures of semantic relatedness for word sense disambiguation. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 241–257. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36456-0_24 2. Dinh, D., Tamine, L.: Sense-based biomedical indexing and retrieval. In: Hopfe, C.J., Rezgui, Y., Métais, E., Preece, A., Li, H. (eds.) NLDB 2010. LNCS, vol. 6177, pp. 24–35. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-13881-2_3 3. Elberrichi, Z., Taibi, M., Belaggoun, A.: Multilingual Medical Documents Classification Based on MesH Domain Ontology. CoRR abs/1206.4883 (2012) 4. Amine, A., Elberrichi, Z., Simonet, M.: Evaluation of text clustering methods using WordNet. Int. Arab J. Inf. Technol. 7, 351 (2010) 5. Guyot, J., Radhoum, S., Falquet, G.: Ontology-based multilingual information retrieval. In: CLEF (2005) 6. Litvak, M., Last, M., Kisilevich, S.: Improving classification of multilingual web documents using domain ontologies. In: KDO05, The Second International Workshop on Knowledge Discovery and Ontologies, Porto, Portugal, 7 October 2006 7. Song, M.-H., Lim, S-Yeon, Park, S.-B., Kang, D.-J., Lee, S.-J.: An automatic approach to classify web documents using a domain ontology. In: Pal, S.K., Bandyopadhyay, S., Biswas, S. (eds.) PReMI 2005. LNCS, vol. 3776, pp. 666–671. Springer, Heidelberg (2005). https:// doi.org/10.1007/11590316_107 8. Sanderson, M.: Retrieving with good sense. Inf. Retr. 2(1), 49–69 (2000) 9. Stokoe, C., Oakes, M.P., Tait, J.: Word sense disambiguation in information retrieval revisited. In: Proceedings of the 26th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 159–166 (2003) 10. Kim, S.B., Seo, H.C., Rim, H.C.: Information retrieval using word senses: root sense tagging approach. In: Proceedings of the 27th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 258–265 (2004) 11. Fang, H.: A re-examination of query expansion using lexical resources. In: Proceedings of the 46th Annual Meeting of the Association of Computational Linguistics: Human Language Technologies, pp. 139–147 (2008) 12. Agirre, E., Arregi, X., Otegi, A.: Document expansion based on WordNet for robust IR. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 9–17 (2010) 13. Majdoubi, J., Loukil, H., Tmar, M., Gargouri, F.: An approach based on language modeling for improving biomedical information retrieval. Int. J. Knowl.-based Intell. Eng. Syst. 16(4), 235–246 (2012) 14. Albitar, S., Fournier, S., Espinasse, B.: The impact of conceptualization on text classification. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds.) WISE 2012. LNCS, vol. 7651, pp. 326– 339. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35063-4_24 15. McInnes, B.T., Pedersen, T.: Evaluating measures of semantic similarity and relatedness to disambiguate terms in biomedical text. J. Biomed. Inform. 46(6), 1116–1124 (2013)

326


16. Rais, M., Lachkar, A.: Evaluation of disambiguation strategies on biomedical text categorization. In: Ortuño, F., Rojas, I. (eds.) IWBBIO 2016. LNCS, vol. 9656, pp. 790–801. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-31744-1_68 17. Rais, M., Lachkar, A.: Biomedical word sense disambiguation context-based: improvement of SenseRelate method. In: IEEE Explore - 2016 International Conference on Information Technology for Organizations Development (IT4OD) (2016) 18. Dittenbach, M.: Scoring and Ranking Techniques - TF-IDF Term Weighting and Cosine Similarity (2010). http://www.ir-facility.org/scoring-and-ranking-techniques-tf-idf-termweighting-and-cosine-similarity 19. What does TF-IDF mean? How to Compute. Information Retrieval and Text Mining http:// www.tfidf.com/ 20. Hersh, W., et al.: OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 192–201. New York, Inc., Dublin (1994) 21. Voorhees, E.M., Harman, D.K.: TREC: “Experiment and Evaluation in Information Retrieval”. MIT Press, Cambridge (2005)

Drug Delivery System Design Aided by Mathematical Modelling and Experiments

Modelling the Release of Moxifloxacin from Plasma Grafted Intraocular Lenses with Rotational Symmetric Numerical Framework Kristinn Gudnason1(B) , Sven Sigurdsson1 , Fjola Jonsdottir1(B) , A. J. Guiomar2 , A. P. Vieira3 , P. Alves3 , P. Coimbra3 , and M. H. Gil3 1

Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavik, Iceland {krg13,fj}@hi.is 2 CIEPQPF, Departamento de Ciências da Vida, Universidade de Coimbra, Cal¸cada Martim de Freitas, 3000-456 Coimbra, Portugal 3 CIEPQPF, Departamento de Engenharia Qu´ımica, Universidade de Coimbra, R. S´ılvio Lima, 3030-790 Coimbra, Portugal

Abstract. A rotational symmetric finite element model is constructed to simulate the release of moxifloxacin from different types of plasmagrafted intraocular lenses, utilizing general discontinuous boundary conditions to describe the interface between lens and outside medium. Such boundary conditions allow for the modelling of partitioning and interfacial mass transfer resistance. Due to its rotational symmetry, the shape of the optical part of the intraocular lens is fully taken into account. Two types of polyacrylates were plasma-grafted to the intraocular lens to act as barriers for the release of the loaded drug. Simulations are carried out and compared to release experiments to infer drug-material properties, which is crucial for optimising therapeutic effects. Keywords: Plasma-grafting · Discontinuities · Finite element Mass transfer · Partitioning · Targeted drug delivery Ocular drug delivery

1

Introduction

Endophthalmitis, a purulent inflammation of the aqueous and vitreous humors caused by bacteria or fungi that have entered the eye, may occur after cataract surgery [1], in which the natural eye lens is replaced by an artificial intraocular lens (IOL). As it has an incidence of up to 0.2% [2] and, if left untreated, frequently results in vision loss, prophylaxis by biocide and antibiotic administration is mandatory. In Europe, preoperative antisepsis with topical application of povidone-iodine or chlorhexidine, together with postoperative topical administration of antibiotic eye drops for up to 2 weeks or intracameral cefuroxime c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 329–339, 2018. https://doi.org/10.1007/978-3-319-78723-7_28

330

K. Gudnason et al.

injection at the end of surgery, are the most common approaches; in the USA, administration of antibiotic eyedrops, 1 to 3 days preoperatively and resumed immediately postoperatively for 1 week, is favored [2,3]. Although intracameral administration provides far higher antibiotic concentrations at the target site than topical administration [4], it is an invasive procedure. Topical administration employing eye drops, although a simple, non-invasive ocular drug administration route, suffers from poor patient compliance and low bioavailability due to blinking, lacrimation, flow through the nasolacrimal duct, rapid absorption into the bloodstream and poor corneal penetration. Consequently, less than 5% of the drug administered through eye drops enters the eye [5], and the required therapeutic concentration may not be attained. The conversion of an implanted IOL into a drug release system inside the eye has been proposed in the 90’s as an alternative to both intracameral injection and eye drops. Nishi et al. [6] and Tetz et al. [7] proposed and showed the efficacy of IOLs as drug delivery systems to treat eye inflammation and posterior chamber opacification, employing IOLs loaded with anti-inflammatory and anti-proliferative drugs. Antibiotic-loaded IOLs were pioneered by Shimizu et al. [8] and Kleinmann et al. [9], who have also proposed their use in endophthalmitis prophylaxis. Since then, IOLs have been loaded with a variety of drugs, employing different drug loading strategies [10]. However, in spite of more than 20 years of research, drug-loaded IOLs are not commercially available and currently, there are no clinical trials in the European, American or Japanese clinical trials registers [11–15]. We have also proposed an IOL-based antibiotic-releasing system for use in prophylaxis of postoperative endophthalmitis, releasing moxifloxacin (MFX), a fourth generation fluoroquinolone used in endophthalmitis prophylaxis [16,17]. Surface modification with a thin coating was elected, since thin coatings do not compromise bulk IOL properties and may act as barriers to the release of the drug loaded in the IOL, extending its release duration. Argon plasmaassisted graft copolymerization with acrylate monomers was the surface modification method adopted, with MFX loaded both by entrapment in the grafted polyacrylate coating, and by subsequent soaking in a drug solution. In the current study, the acrylate monomers selected were 2-hydroxyethylmethacrylate (HEMA), an electrically neutral monomer already present in the employed IOLs, and 2-acrylamido-2-methylpropane sulfonic acid (AMPS), a monomer used in superabsorbent hydrogels [18], which is negatively charged at physiological pH. As drug-loaded IOLs must reach the surgeon in a form that is ready for immediate use, they have to be stored in a hydrated, sterilized form, without losing the loaded drug. For this purpose, we sterilized our MFX-loaded IOLs in the MFX solution used in the final drug loading step and stored them afterwards for 4 weeks. As such, drug loss during sterilization and storage, by diffusion to the storage solution, is avoided. The sterilization method selected was sterilization by autoclaving, under conditions which are used industrially for IOLs, since MFX is thermally stable [19,20]. The model developed in this work is based on [21], a rotational symmetric numerical framework with discontinuous interlayer condition, which we use to

Modelling the Release of Moxifloxacin from Plasma Grafted IOLs

331

Fig. 1. Left: intraocular lens (IOL), seen from top, consisting of optical lens surrounded by protruding loops called haptics which give stability once within the eye. Right: finite element triangulation of half of the optical lens cross-section.

simulate the release experiments and evaluate drug-material properties. Included in this framework are the effects of diffusion within the lens, the partitioning between lens and outside medium, and the mass transfer coefficient, which can describe resistance of drug transport across the lens boundary and thus possible surface barrier effects. As the medium surrounding the IOLs is stirred, we assume it to be homogenous and treat the medium concentration as a scalar, which reduces computation. Intraocular lens shown in Fig. 1 consists of an optical lens surrounded by protruding loops called haptics. We focus on simulating the optical lens, which makes out the bulk of the total IOL, using the symmetry to apply the rotational symmetric numerical framework.

2

Methods

IOLs made from a hydrophilic poly[(2-hydroxyethyl methacrylate)-co-(methyl methacrylate)]-based material (EWC: 26%) were provided by PhysIOL S.A. (Liège, Belgium) in their final packaged form. They had a power of 20 diopters, a total lens diameter of 10.75 mm, an optic diameter of 6.15 mm and a center thickness of 1.0 mm. All IOLs were vacuum dried and stored dry before the surface modification. 2-Hydroxyethyl methacrylate (HEMA) and 2-acrylamido-2methylpropane sulfonic acid (AMPS) were supplied by Sigma-Aldrich (St. Louis, USA) and moxifloxacin hydrochloride (MFX), by Carbosynth Ltd. (Compton, UK). A balanced salt solution (BSS) containing 8 g/L of NaCl, 0.4 g/L of KCl, 0.0356 g/L of Na2 HPO4 , 0.06 g/L of KH2 PO4 , 0.144 g/L of CaCl2 , 0.12 g/L of MgSO4 and 0.35 g/L of NaHCO3 in Milli Q water, with pH adjusted to 7.4 was employed in the drug release studies. All chemicals and reagents were of analytical grade and were used as supplied. Saline for injection (NaCl 0.9%) was supplied by Baxter International, Inc., Deerfield, USA. 2.1

IOL Surface Modification and Drug Loading

Surface modification by plasma-assisted grafting with HEMA was conducted in a commercial low-pressure plasma reactor (FEMTO, Diener Electronic GmbH,

332

K. Gudnason et al.

Ebhausen, Germany), composed of a stainless steel chamber of 100 mm diameter and 270 mm length. Vacuum-dried (room temperature) IOLs were placed horizontally in a clean glass plate and exposed to argon plasma generated at a pressure chamber of 0.6 mbar, for 3 min, applying a power of 100 W to the electrodes. After repeating the procedure on the other face, each IOL was removed from the plasma chamber and immediately immersed in a 10% (v/v) solution of HEMA or AMPS in BSS containing MFX at 5 mg/mL. After 1 h (HEMA) or 8 h (AMPS) of grafting at 60 ◦ C, the IOLs were placed in 200 mL of distilled water in a glass container, at 37 ◦ C and under constant shaking at 100 rpm, and were extensively washed until no monomer and MFX could be detected by UV spectroscopy, changing the distilled water thrice a day. Finally, the modified IOLs were individually loaded with MFX by soaking in 1 mL of a 5 mg/mL MFX solution in saline for injection at pH 7 (adjusted with 1 M NaOH) for 15 h, under shaking at 100 rpm and at a temperature of 37 ◦ C. After this drug loading step, the modified IOLs were kept in the loading solution and sterilized in an autoclave at 121 ◦ C for 30 min, in a same batch. All sterilized samples were then stored at room temperature for 4 weeks before the drug release studies. 2.2

Drug Release Studies

BSS was selected as a model of the aqueous humor in the eye in drug release studies in batch mode, under sink conditions. MFX-loaded IOLs were placed individually in a closed vial containing 3 mL of BSS at 37 ◦ C, under constant shaking at 100 rpm. At predetermined time intervals, 0.5 mL of the release medium was sampled and replaced with the same volume of fresh BSS, to imitate the drainage of the aqueous humor and to maintain sink conditions. The MFX concentration present in the collected samples was determined by UV-vis spectrophotometry at 290 nm, employing a Jenway 7315 UV-vis spectrophotometer (Cole-Parmer, Stone, UK). The residual absorbance at 290 nm of blank IOLs (unmodified, unloaded and sterilized IOLs) was subtracted from the absorbance of the sample under study. The difference was used to calculate the drug concentration in the release media, through a calibration curve, and to calculate the cumulative amount of drug released per mass of dry IOL. Each data point was the average of results from three samples. Results were expressed as mean ± standard deviation.

3

Model

The model described below is based on a framework for diffusive transport for a rotational symmetric system with discontinuous interlayer condition [21]. However, since the medium surrounding the IOLs is stirred, we assume it to be homogenous and thus model its concentration as a scalar with the added benefit of reducing computation.


3.1

333

Governing Equations

To model the release experiments described above we express the change of drug concentration within the IOL with the diffusion equation and the concentration in the medium as the integrated flux from the IOL ∂Cα (x, t) = ∇ · (Dα ∇Cα (x, t)) x ∈ Ωα ∂t ∂Cβ (t) =− Dα (∇Cα (x, t) · n) ds in Ωβ Vβ ∂t Γ

(1) (2)

where Ωα and Ωβ refer to the regions of the IOL and the outer medium respectively, Γ refers to the intersection between Ωα and Ωβ and n is the normal on the boundary in the direction from Ωα into Ωβ . Here Cα denotes concentration 3 (mg/cm ) within the lens and Dα is the diffusion coefficient (cm2 /h) within the 3 lens. The concentration in the medium is denoted by Cβ (mg/cm ) and Vβ is the volume of the outer medium. Note that Cα depends on x ∈ Ωα while Cβ is spatially independent, and in (2) we have assumed no-flow condition on the outer boundary of Ωβ . On the interior boundary Γ , between regions Ωα and Ωβ , we employ general boundary conditions allowing for a jump in concentration between subregions −Dα (∇Cα · n) = K(Cα − P Cβ )

(3)

The boundary conditions include two parameters, a partition coefficient, P , and a mass transfer coefficient, K. The important distinction between the two parameters is that P is dimensionless and as such can describe the relative difference between the concentration on each side of the boundary that can be maintained in equilibrium. K, on the other hand, has the dimension (cm/h) and can thus describe the rate at which such an equilibrium can be reached. It can also describe the flux resistance of a thin barrier [22,23]. We express Eq. (1) with cylindrical coordinates, r, ϕ, z, with rotational symmetry so that ∂C/∂ϕ = 0 ∂Cα 1 ∂ ∂Cα ∂Cα ∂ = rDα + Dα (4) ∂t r ∂r ∂r ∂z ∂z 3.2

Discretization

We apply a finite element approach and thus express (4) by the following weak formulation ∂Cα ∂w ∂Cα ∂w ∂Cα w drdz = + r rDα drdz ∂t ∂r ∂r ∂z ∂z Ωα Ω α ∂Cα w ds (5) rDα − ∂n ∂Ωα where w is an arbitrary piecewise differentiable weight function over Ωα , see [21].

334

K. Gudnason et al.

We divide the subregion Ωα into triangular elements with nα nodal points as shown in Fig. 1. A typical element T has vertices vi = (ri , zi ), i = 1, 2, 3 in anticlockwise order and area |T |. Within T we introduce three linear basis functions ηi (r, z), i = 1, 2, 3, that take the value 1 at vi and 0 on the opposite edge. We approximate Cα (r, z, t) with Cˆα (r, z, t) =

3

cˆα,i (t)ηi (r, z)

i=1

where cˆα,i (t) are time dependant coefficients of the concentration that amount to the approximate values for Cα at vi . Applying the weak formulation (5) to T , with the weight functions w = ηi (r, z), i = 1, 2, 3, we get the following local systems of equations ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ cˆ cˆα,1 fα,1 d ⎣ α,1 ⎦ cˆα,2 = −AT ⎣cˆα,2 ⎦ + ⎣fα,2 ⎦ (6) MT dt cˆα,3 cˆα,3 fα,3 Here MT is the local mass matrix and AT is the local stiffness matrix as shown in [21]. ∂ˆ c (r1 η1 + r2 η2 + r3 η3 )Dα ηi ds (7) fα,i = − ∂n ∂T is an outward flux value around the vertex vi . Since flux values across internal element edges will cancel out by the constraint of flux continuity we are only left with flux values along the boundary Γ . We now describe how the general boundary conditions are implemented for this case. To the surrounding medium we assign a single concentration variable cˆβ . Now consider the edge E12 joining vertices v1 and v2 on Γ . The approximate concentration values at v1 and v2 within Ωα are denoted by cˆα,1 and cˆα,2 respectively and are assumed to vary linearly between these values. By the boundary condition, the flux value associated with cˆα,i , i = 1, 2 along E12 will then be ∂ˆ c − (r1 η1 + r2 η2 )Dα ηi ds ∂n E12 (r1 η1 +r2 η2 )K((ˆ cα,1 −P cˆβ )η1 +(ˆ cα,2 −P cˆβ )η2 )ηi ds = E12

along with the same expression for the flux values associated with cˆβ , except the sign is reversed, thus taking care of the right-hand side in (2). This results in the following local flux vector

cˆα,1 |E12 |K F cˆα,2 (8) 12 cˆβ where

(3r1 + r2 ) (r1 + r2 ) −2P (2r1 + r2 ) (r1 + 3r2 ) −2P (r1 + 2r2 ) (r1 + r2 ) F = −2(2r1 + r2 ) −2(r1 + 2r2 ) 6P (r1 + r2 )


3.3

335

Time Integration

By assembling the local systems (6) along with Eq. (2) and the local flux vectors we obtain the global system of differential equations MG

dˆ c = −AG cˆ dt

where cˆ is a nα + 1 vector consisting of the cˆα,i values at all the nα nodes within Ωα and cˆβ as its last value. MG is the (nα + 1) × (nα + 1) global mass matrix including local mass matrix contributions and value of Vβ as the [nα + 1, nα + 1] element. AG is the global stiffness matrix including local stiffness matrix contributions as well as local flux contributions from the matrix in (8). We employ an implicit backward Euler scheme with timestep Δt for the integration of the time term, resulting in the following linear system having to be solved at each new time (MG + ΔtAG ) cˆ (t + Δt) = MG cˆ (t) We only have to carry out a sparse LU-factorization of the matrix (MG + ΔtAG ) at the start and then apply forward elimination and backward substitution at each time step.

4

Simulations

Using the model described above, simulations were carried out and compared with the experimental data, see Fig. 2, to gain insight into drug-material properties. An IOL, as depicted in the left of Fig. 1, consists of an optical biconvex lens and haptics. For simplicity, we will only model the optical biconvex part, the bulk of the total IOL. The geometry of the biconvex optical lens is represented by its cross section, which itself is defined as the intersection of two circles with radii, 2.32 cm and 1.1524 cm, and a thickness at the center of 1 mm. The optical lens itself has a radius of 3.075 mm. Note that due to symmetry, the system being modelled is confined to the positive r half-space with a no-flow boundary condition at r = 0, i.e. the triangulated region in the right of Fig. 1 However, the concentration profiles shown in Fig. 3, results are reflected across the z axis. The value of Vβ is 3 ml. We assume that the plasma grafting process only affected the surface of the IOLs so the D and P values were assumed to be the same for both HEMA, AMPS and the unmodified IOLs. The value for K however, varied. Values for K can be seen in Table 1 whereas D = 8.5 · 10−6 cm2 /h and P = 23. Assuming the loading had been long enough, the initial release concentration distribution within the lens was assumed to be homogeous and set to P ·5mg/ml. The resulting three simulation curves are shown in Fig. 2.

336

K. Gudnason et al.

Fig. 2. Simulations of concentration in the medium surrounding the IOL. Each curve corresponds to a different grafting material. The same diffusion and partition values were used in each simulation but the mass transfer parameter varied; see Table 1. The vertical bars show standard deviation and circles show the average value of experiments.

5

Discussion

Drop in simulated release concentration corresponds to sample extraction i.e. when fresh medium is added, concentration drops by 1/6. Note that before 120 h, the initial burst of the release curve data is higher than the simulated release curves. However, if haptics were added to simulations, the added surface area would increase the initial burst of the release curve as more drug could readily exit the IOL. We assume that the P value used to simulate such a system would be lower due to the added drug molecules within the haptics. Subsequently the initial concentration of the release stage would also be lower. Another point for improvement would be to specially simulate the loading stage of the lens, as the initial concentration of the release stage may be inhomogeneous. Cross-sectional concentration profiles can be seen in Fig. 3, showing the varying concentrations throughout the lens and at different times. Contour lines can be seen around the center of the lens as drug can more easily escape through the sides as might be expected since the lens becomes thinner there. The deduced K parameter value for the AMPS system indicates more resistance at the surface than the unmodified IOL system, as opposed to the HEMA system which has higher K value.


337

Fig. 3. Simulated concentration profiles of AMPS system at different times showing the cross-sectional distribution of moxifloxacin throughout the biconvex optical section of the IOL. Color indicates different concentration, values can be read from colorbar at the top of the figure. (Color figure online)

338

K. Gudnason et al. Table 1. Chosen values for K in simulations of IOL systems Grafting material AMPS K (cm/h)

Unmodified HEMA −4

1.1 · 10

8.5 · 10−5

6.5 · 10−5

Acknowledgments. We thank financial support from the Technical Development Fund, Iceland, (grant no. 13-1309), and Funda¸ca õ para a Ciência e a Tecnologia (FCT, Portugal), QREN, POFC-COMPETE and FEDER programmes (grants MERA.NET/0005/2012 and M-ERA.NET/0006/2012), as part of the jointly funded European M-Era.Net project titled “SurfLenses - Surface modifications to control drug release from therapeutic ophthalmic lenses”. P. Alves and P. Coimbra thank FCT for personal grants SFRH/BPD/69410/2010 and SFRH/BPD/73367/2010, respectively. All authors thank Dr. Helena Filipe (Hospital das For¸cas Armadas, Lisbon, Portugal) for advice, and Dr. Dimitriya Bozukova (PhysIOL, S. A., Liège, Belgium) also for advice and for supplying the IOLs.

References 1. Callegan, M.C., Engelbert, M., Parke, D.W., Jett, B.D., Gilmore, M.S.: Bacterial endophthalmitis: epidemiology, therapeutics, and bacterium-host interactions. Clin. Microbiol. Rev. 15(1), 111–124 (2002) 2. Braga-Mele, R., Chang, D.F., Henderson, B.A., Mamalis, N., Talley-Rostov, A., Vasavada, A., ASCRS Committee Cataract Committee, et al.: Intracameral antibiotics: safety, efficacy, and preparation. J. Cataract Refract. Surg. 40(12), 2134–2142 (2014) 3. Chang, D.F., Braga-Mele, R., Henderson, B.A., Mamalis, N., Vasavada, A., ASCRS Committee Cataract Committee, et al.: Antibiotic prophylaxis of postoperative endophthalmitis after cataract surgery: results of the 2014 ASCRS member survey. J. Cataract Refract. Surg. 41(6), 1300–1305 (2015) 4. Barry, P., Cordovés, L., Gardner, S.: ESCRS guidelines for prevention and treatment of endophthalmitis following cataract surgery: data, dilemmas and conclusions. European Society of Cataract and Refractive Surgeons (2013) 5. Gaudana, R., Jwala, J., Boddu, S.H.S., Mitra, A.K.: Recent perspectives in ocular drug delivery. Pharm. Res. 26(5), 1197 (2009) 6. Nishi, O., Nishi, K., Yamada, Y., Mizumoto, Y.: Effect of indomethacin-coated posterior chamber intraocular lenses on postoperative inflammation and posterior capsule opacification. J. Cataract Refract. Surg. 21(5), 574–578 (1995) 7. Tetz, M.R., Ries, M.W., Lucas, C., Stricker, H., V¨ olcker, H.E.: Inhibition of posterior capsule opacification by an intraocular-lens-bound sustained drug delivery system: an experimental animal study and literature review. J. Cataract Refract. Surg. 22(8), 1070–1078 (1996) 8. Shimizu, K., Kobayakawa, S., Tsuji, A., Tochikubo, T.: Biofilm formation on hydrophilic intraocular lens material. Curr. Eye Res. 31(12), 989–997 (2006) 9. Kleinmann, G., Apple, D.J., Chew, J., Hunter, B., Stevens, S., Larson, S., Mamalis, N., Olson, R.J.: Hydrophilic acrylic intraocular lens as a drug-delivery system for fourth-generation fluoroquinolones. J. Cataract Refract. Surg. 32(10), 1717–1721 (2006)


339

10. Liu, Y.-C., Wong, T.T., Mehta, J.S.: Intraocular lens as a drug delivery reservoir. Curr. Opin. Ophthalmol. 24(1), 53–59 (2013) 11. EU Clinical Trials Register: EMA (European Medicines Agency, London, UK) (2016). Accessed 15 Dec 2017 12. ClinicalTrials.gov: LM (National Library of Medicine, Bethesda, MD, USA) (2016). Accessed 15 Dec 2017 13. UMIN Clinical Trials Registry (UMIN-CTR): UMINC (University Hospital Medical Information Network Center, Tokyo, Japan) (2016). Accessed 03 May 2016 14. JAPIC Clinical Trials Information: JAPIC (Japan Pharmaceutical Information Center, Tokyo, Japan) (2016). Accessed 03 May 2016 15. JMACTR List: JMACCT (Japan Medical Association Center for Clinical Trials, Tokyo, Japan) (2016). Accessed 03 May 2016 16. Vieira, A.P., Pimenta, A.F.R., Silva, D., Gil, M.H., Alves, P., Coimbra, P., Mata, J.L.G.C., Bozukova, D., Correia, T.R., Correia, I.J., et al.: Surface modification of an intraocular lens material by plasma-assisted grafting with 2-hydroxyethyl methacrylate (HEMA), for controlled release of moxifloxacin. Eur. J. Pharm. Biopharm. 120, 52–62 (2017) 17. Pimenta, A.F.R., Vieira, A.P., Cola¸co, R., Saramago, B., Gil, M.H., Coimbra, P., Alves, P., Bozukova, D., Correia, T.R., Correia, I.J., et al.: Controlled release of moxifloxacin from intraocular lenses modified by AR plasma-assisted grafting with AMPS or SBMA: an in vitro study. Colloids Surf. B: Biointerface 156, 95–103 (2017) 18. Okay, O., Sariisik, S.B., Zor, S.D., et al.: Swelling behavior of anionic acrylamidebased hydrogels in aqueous salt solutions: comparison of experiment with theory. J. Appl. Polym. Sci. 70(3), 567–575 (1998) 19. Hubicka, U., Zuromska-Witek, B., Krzek, J., Walczak, M., Zylewski, M.: Kinetic and thermodynamic studies of moxifloxacin hydrolysis in the presence and absence of metal ions in acidic solutions. Acta Pol. Pharm. 69(5), 821–831 (2012) 20. Devi, M.L., Chandrasekhar, K.B.: A validated, specific stability-indicating RP-LC method for moxifloxacin and its related substances. Chromatographia 69(9–10), 993–999 (2009) 21. Gudnason, K., Sigurdsson, S., Jonsdottir, F.: A numerical framework for diffusive transport in rotational symmetric system with discontinuous interlayer condition. In: 9th Vienna International Conference on Mathematical Modelling (2018, to be published) 22. Gudnason, K., Sigurdsson, S., Snorradottir, B.S., Masson, M., Jonsdottir, F.: A numerical framework for drug transport in a multi-layer system with discontinuous interlayer condition. Math. Biosci. 295, 11–23 (2018) 23. Gudnason, K., Solodova, S., Vilardell, A., Masson, M., Sigurdsson, S., Jonsdottir, F.: Numerical simulation of Franz diffusion experiment: application to drug loaded soft contact lenses. J. Drug Deliv. Sci. Technol. 38, 18–27 (2017)

Generation, Management and Biological Insights from Big Data

Predicting Tumor Locations in Prostate Cancer Tissue Using Gene Expression Osama Hamzeh(B) , Abedalrhman Alkhateeb(B) , and Luis Rueda(B) School of Computer Science, University of Windsor, 401 Sunset Ave, Windsor, ON N9B 3P4, Canada {hamzeho,alkhate,lrueda}@uwindsor.ca

Abstract. Prostate cancer can be missed due to the limited number of biopsies or the ineffectiveness of standard screening methods. Finding gene biomarkers for prostate cancer location and analyzing their transcriptomics can help clinically understand the development of the disease and improve treatment efficiency. In this work, a classification model is built based on gene expression measurements of samples from patients who have cancer on the left, right, and both lobes of the prostate as classes. A hybrid feature selection is used to select the best possible set of genes that can differentiate the three classes. Standard machine learning classifiers with the one-versus-all technique are used to select potential biomarkers for each laterality class. RNA-sequencing data from The Cancer Genome Atlas (TCGA) Prostate Adenocarcinoma (PRAD) was used. This dataset consists of 450 samples from different patients with different cancer locations. There are three primary locations within the prostate: left, right and bilateral. Each sample in the dataset contains expression levels for each of the 60,488 genes; the genes are expressed in Transcripts Per Kilobase Million (TPM) values. The results show promising prediction prospect for prostate cancer laterality. With 99% accuracy, a support vector machine (SVM) based on a radial basis function kernel (SVM-RBF) was able to identify each group from the others using the subset of genes. Three groups of genes (RTN1, HLA-DMB, MRI1 and others) were found to be differentially expressed among the three different tumor locations. The findings were validated using multiple findings in the literature, which confirms the relationship between those genes and prostate cancer.

Keywords: Machine learning Prostate cancer laterality

1

· Classification · Biomarkers

Background

Cancer is among the leading causes of death worldwide. In 2013, there were 8.2 million deaths, and 14.9 million incident cancer cases [1]. As with all cancer c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 343–351, 2018. https://doi.org/10.1007/978-3-319-78723-7_29

344

O. Hamzeh et al.

diseases, investigating prostate cancer at the molecular level reveals transcriptional mechanisms of the tumour biology. Traditionally, prostate cancer studies centered primarily on finding biomarkers for differentiation between benign and cancerous tumors. Recently, studies have considered some other aspects of the tumours including progression, metastasis, location, and recurrence, among others. Traditional methods for detecting prostate cancer such as prostate specific antigen (PSA) blood test, transrectal ultrasound image (TRUS) guided biopsy, and digital rectal exam (DRE) do not measure up to the medical standards. PSA blood test statistical results shows a specificity of 61% and a low sensitivity of 34.9%, while TRUS guided biopsy and DRE are invasive [2]. Multiparametric magnetic resonance imaging (MRI) of the prostate is a functional form of imaging used to augment standard T1- and T2-weighted imaging. Multiparametric MRI may miss up to 12% of cancer cases [3]. In addition to the need for reducing the number of biopsies come most of the time with pain, fever, bleeding, infection, transient urinary difficulties, or other complications that require hospitalization [4]. Finding gene biomarkers of prostate cancer location and analyzing their proteomics can help clinically understand the development of the disease and improve treatment efficiency. Machine learning approaches were applied on prostate cancer data to identify gene biomarkers for the disease [5,6]. Using next generation sequencing and the power of machine learning, Singireddy et al. devised a SVM classifier to identify biomarker genes associated with prostate cancer progression. The biomarkers were able to discriminate consecutive prostate cancer stages with high performance [5]. Earlier, we proposed a method for finding groups of transcripts that are differentially expressed among the different Gleason stages [7]. The identified transcripts can be used to predict the actual Gleason score for new samples; these transcripts belong to genes that are well known to play important roles in prostate and other types of cancer. Ping Yu et al. demonstrated that their method is feasible for predicting prostate cancer aggressiveness based on gene expression patterns [8]. Machine learning approaches have been used for cancer localization prediction [9,10]. Artan et al. proposed a prediction model based on cost-sensitive support vector machine (SVM). The model is used to analyze a large dataset of multispectral magnatic resonance imaging (MRI). This method improves the cost-sensitive SVM using a segmentation method by combining conditional random fields (CRF) with a cost-sensitive framework. Incorporating spatial information leads to better localization accuracy [9]. As stated earlier, prediction by imaging needs more improvement. In an attempt to find different gene expression levels between two lists, the first contains the expression levels of colon tumor cells, while the latter for rectal tumor cells, Sanz-Pamplona et al. applied agglomerative hierarchical clustering to display the classification ability between both lists. Both lists have very similar gene expression level except for several HOX genes which are found to be associated with tumor location [10].

Prostate Cancer Laterality

345

In this work, standard classification models are built on gene expression values for three prostate cancer location classes which are the left, the right, and both sides of the prostate; as results of these models, we found biomarker genes that can identify the location of the cancer.

2


RNA-sequencing data from The Cancer Genome Atlas (TCGA) Prostate Adenocarcinoma (PRAD) was used. This dataset consists of 450 samples for different patients with different cancer locations. There are three primary locations that the tumor might be located within the prostate: left, right and bilateral. Figure 1 shows the actual possible locations. Table 1 describes the number of samples in each location.

Fig. 1. Tumors possible locations in prostate cancer.

Table 1. Number of samples in each prostate cancer tumor location. Left Bilateral Right 18

431

38

Gene expression data was downloaded through the cBioPortal for cancer genomics database [11]. Each sample contains expression levels for each of the 60,488 genes; the gene expressions are given in terms of Transcripts Per Kilobase Million (TPM) values. The aim of this study is to identify genes which are associated with specific tumor locations, and hence we need to use the genes as features and the actual locations as classes to build a model to predict locations

346

O. Hamzeh et al.

for future samples. Since most of the samples are bilateral, we deal with a class imbalance problem. We used the resampling method [12] as measure to lower the effect of this imbalance. 2.1

Feature Selection

Since the number of features is quite high, we need to use machine learning techniques to lower the number of features used for classification. We applied information gain feature selection to rank all the genes with a score that relates to the highest information gain against the different classes. We then choose attributes with the highest scores, discarding those with lower scores. In this paper, the information gain (IG) attribute evaluator [13] is used to evaluate each attribute. IG of feature X with respect to class Y is calculated as follows: IG(Y, X) = H(Y ) − H(Y |X) where, H(Y ) = −

p (y) log 2 (p (y)).

(1)

(2)

y∈Y

and H(Y |X) = −

p (x)

x∈X

p(y|x)log 2 (p (y|x)).

(3)

y∈Y

Here, H(Y) is the entropy of class Y and H(Y |X) is the conditional entropy of Y given X. The next step is to choose the best set of attributes (genes) that provide good classification among the different classes. A wrapper that binds feature selection and a classification method is used. That method is the minimum redundancy maximum relevance (mRMR) approach, which takes features that contain minimum redundancy while at the same time have high correlation to the classification variable [14]. The equation for minimizing redundancy (Wi ) and maximizing the relevancy (Vi ) is as follows: Wi = and Vi =

1 |S|

2

I(i, j),

(4)

i,j∈S

1 I(h, i), |S|

(5)

i∈S

Where S is the set of features, I(i,j) is mutual information between features (i,j), h is the class. 2.2

Classification

We deal with a multi-class classification problem which is solved by using the oneversus-all approach. We have three different classes which are the three different


347

locations. To apply the one-versus-all approach, we need to create three separate copies from the actual dataset. For each dataset, we set one of the classes to positive, and the rest of the classes are combined together to form the negative class. We used accuracy, sensitivity and specificity to choose the best classification method. Multiple classification methods were applied against the data to identify which methods separate the locations better. Accordingly, the probabilistic classifier Naive Bayes that applies Bayes’ theorem with the assumption of independence between the features [15] was tested. SVM was also used to build a classification model based on the features selected in the previous step [16]. The other classifier which was tested is the random forest [17] which tries to build multiple decision tree models with different samples and different initial variables. Weka open source libraries were used to run different classification algorithms on the minimized number of features to identify which genes are differentially expressed in the different locations [18].

Fig. 2. Different classifiers accuracy for the different locations.

3


The different classifiers produced varied results as observed in Table 2 and in Fig. 2. The classifiers were picked based on accuracy and precision, as leading high accuracy with low precision is not a good criterion at all. Tables 3, 4, and 5 show the actual accuracy and precision for each classifier. The highest accuracy and precision for the difference classifiers came from the SVM Radial basis

348

O. Hamzeh et al.

function kernel (SVM-RBF) classifier, as it was able to separate the different locations by an accuracy of 99%. The random forest classifier managed to result in high accuracy too, while the naive Bayes classifier results were not good at all. Tables 6, 7 and 8 show the actual genes that were identified by SVM-RBF; these genes can be used to predict the location of the prostate cancer tumor from gene expression data. Throughout our model 10 fold cross-validation was used. Our method identified 12 genes that are differentially expressed among the three different possible locations. Many of the genes identified in this work have been previously characterized and described to play some role in prostate cancer as well as other cancers. SNAI2 is a gene shown [19] to be silenced in prostate cancer and regTable 2. Accuracy for different classifiers. Location

Naive Bayes Random forest SVM-RBF

Left vs all

88

93

99

Bilateral vs all 82

90

99

Right vs all

95

99

80

Table 3. Accuracy and precision for left vs rest classifiers. Classifier

Accuracy Precision

SVM RBF

99

97

Naive Bayes

88

78

Random forest 93

85

Table 4. Accuracy and precision for Bi vs rest classifiers. Classifier

Accuracy Precision

SVM RBF

99

97

Naive Bayes

82

78

Random forest 90

85

Table 5. Accuracy and precision for right vs rest classifiers. Classifier

Accuracy Precision

SVM RBF

99

Naive Bayes

80

78

Random forest 95

85

97


349

Table 6. Genes that can predict tumors in the left part of the prostate tumor. Ensemble

Gene

ENSG00000135108.13 FBXO21 ENSG00000139970.15 RTN1 ENSG00000128609.13 NDUFA5 ENSG00000172336.4

POP7

Table 7. Genes that can predict tumors in the right part of the prostate tumor. Ensemble

Gene

ENSG00000242574.7

HLA-DMB

ENSG00000124193.13 SRSF6 ENSG00000110321.14 EIF4G2 Table 8. Genes that can predict tumors in the bilateral parts of the prostate tumor. Ensemble

Gene

ENSG00000120697.7

ALG5

ENSG00000279453.1

Z99129

ENSG00000019549.7

SNAI2

ENSG00000037757.12 MRI1 ENSG00000178913.7

TAF7

ulates neuroendocrine differentiation, metastasis-suppressor, and pluripotency gene expression. The results shown in [20,21] indicate that increased TAF1/7 expression is associated with progression of human prostate cancers to the lethal castrationresistant state. The results reported in [22] found that tumor cell expression of HLA-DMB is associated with increased numbers of tumor-infiltrating CD8 T lymphocytes and both are associated with improved survival in advanced serous ovarian cancer.

4

Conclusion

Identifying genes that can be used to predict tumor location in the prostate is an essential step in the actual treatment of prostate cancer, as it allows targeted medication towards the exact location of the disease without the need for a biopsy. Using next-generation sequencing and the power of machine learning, this paper proposes a new method for finding groups of genes that are differentially expressed among the different tumor locations in the human prostate. These groups of genes can potentially serve as biomarkers for prostate cancer laterality

350

O. Hamzeh et al.

since the literature shows that they are strongly associated with the disease, and cancer in general. Future work includes utilizing the same methodology for other kinds of cancer, study the regulations, while the pathways of the biomarker genes are involved may reveal more about their functionality. Further analysis based on literature, transcriptomics or interactomics databases, as well as wet-lab experiments, will be able to provide more information about the relevant genes that can be potentially used for diagnosis, treatment, and prognosis of the disease. Acknowledgments. The research on data analysis has been partially funded by the Natural Sciences and Engineering Research Council of Canada, a Seeds4Hope grant from the Windsor-Essex County Cancer Centre Foundation, the University of Windsor, the University of Windsor Undergrad Research Grant.

References 1. Stewart, B.W.K.P., Wild, C.P., et al.: World cancer report 2014. Health (2017) 2. Parpart, S., Rudis, A., Schreck, A., Dewan, N., Warren, P.: Sensitivity and specificity in prostate cancer screening methods and strategies. J. Young Investig. (2007) 3. Stewart, R.W., Lizama, S., Peairs, K., Sateia, H.F., Choi, Y.: Screening for prostate cancer. In: Seminars in Oncology. Elsevier (2017) 4. Rosario, D.J., Lane, J.A., Metcalfe, C., Donovan, J.L., Doble, A., Goodwin, L., Davis, M., Catto, J.W.F., Avery, K., Neal, D.E., et al.: Short term outcomes of prostate biopsy in men tested for cancer by prostate specific antigen: prospective evaluation within protect study. BMJ 344, d7894 (2012) 5. Singireddy, S., Alkhateeb, A., Rezaeian, I., Rueda, L., Cavallo-Medved, D., Porter, L.: Identifying differentially expressed transcripts associated with prostate cancer progression using RNA-Seq and machine learning techniques. In: 2015 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), pp. 1–5. IEEE (2015) 6. Alkhateeb, A., Rezaeian, I., Singireddy, S., Rueda, L.: Obtaining biomarkers in cancer progression from outliers of time-series clusters. In: 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 889–896. IEEE (2015) 7. Hamzeh, O., Alkhateeb, A., Rezaeian, I., Karkar, A., Rueda, L.: Finding transcripts associated with prostate cancer gleason stages using next generation sequencing and machine learning techniques. In: Rojas, I., Ortu˜ no, F. (eds.) IWBBIO 2017. LNCS, vol. 10209, pp. 337–348. Springer, Cham (2017). https://doi.org/10.1007/ 978-3-319-56154-7 31 8. Yu, Y.P., et al.: Gene expression alterations in prostate cancer predicting tumor aggression and preceding development of malignancy. J. Clin. Oncol. 22(14), 2790– 2799 (2004) 9. Artan, Y., Haider, M.A., Langer, D.L., van der Kwast, T.H., Evans, A.J., Yang, Y., Wernick, M.N., Trachtenberg, J., Yetik, I.S.: Prostate cancer localization with multispectral MRI using cost-sensitive support vector machines and conditional random fields. IEEE Trans. Image Process. 19(9), 2444–2455 (2010) 10. Sanz-Pamplona, R., Cordero, D., Berenguer, A., Lejbkowicz, F., Rennert, H., Salazar, R., Biondo, S., Sanjuan, X., Pujana, M.A., Rozek, L., et al.: Gene expression differences between colon and rectum tumors. Clin. Cancer Res. (2011)


351

11. GDC: Portal.gdc.cancer.gov (2017). https://portal.gdc.cancer.gov/. Accessed 15 Aug 2017 12. Estabrooks, A., Jo, T., Japkowicz, N.: A multiple resampling method for learning from imbalanced data sets. Comput. Intell. 20(1), 18–36 (2004) 13. Novakovic, J.: Using information gain attribute evaluation to classify sonar targets. In: 17th Telecommunications Forum TELFOR, pp. 24–26 (2009) 14. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005) 15. Domingos, P., Pazzani, M.: On the optimality of the simple Bayesian classifier under zero-one loss. Mach. Learn. 29(2–3), 103–130 (1997) 16. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995) 17. Rodriguez-Galiano, F., Ghimire, B., Rogan, J., Chica-Olmo, M., Rigol-Sanchez, P.: An assessment of the effectiveness of a random forest classifier for land-cover classification. ISPRS J. Photogramm. Remote Sens. 67, 93–104 (2012) 18. Frank, E., Hall, M., Witten, I.: The WEKA Workbench. Online Appendix for Data Mining: Practical Machine Learning Tools and Techniques, 4th edn. Morgan Kaufmann (2016) 19. Esposito, S., Russo, V., Airoldi, I., Tupone, G., Sorrentino, C., Barbarito, G., Di Carlo, E.: SNAI2/Slug gene is silenced in prostate cancer and regulates neuroendocrine differentiation, metastasis-suppressor and pluripotency gene expression. Oncotarget 6(19), 17121–17134 (2015) 20. Tavassoli, P., Wafa, L.A., Cheng, H., Zoubeidi, A., Fazli, L., Gleave, M., Snoek, R., Rennie, P.S.: TAF1 differentially enhances androgen receptor transcriptional activity via its N-terminal kinase and ubiquitin-activating and -conjugating domains. Mole. Endocrinol. 24(4), 696–708 (2010). https://doi.org/10.1210/me.2009-0229 21. Bhattacharya, S., Lou, X., Hwang, P., Rajashankar, K.R., Wang, X., Gustafsson, J., Fletterick, R.J., Jacobson, R.H., Webb, P.: Structural and functional insight into TAF1-TAF7, a subcomplex of transcription factor II D. PNAS 2014 111(25), 9103–9108 (2014). https://doi.org/10.1073/pnas.1408293111 22. Callahan, M.J., Nagymanyoki, Z., Bonome, T., et al.: Increased HLA-DMB Expression in the tumor epithelium is associated with increased Cytotoxic T Lymphocyte infiltration and improved prognosis in advanced serous ovarian cancer. Clin. Cancer Res. 14(23), 7667–7673 (2008). https://doi.org/10.1158/1078-0432.CCR-08-0479

Concept of a Module for Physical Security of Material Secured by LIMS Pavel Blazek1,2 ✉ (

)

, Kamil Kuca1 , and Ondrej Krejcar1

1

2

Faculty of Informatics and Management, Center for Basic and Applied Research, University of Hradec Kralove, Hradec Kralove, Czech Republic [email protected], {kamil.kuca,ondrej.krejcar}@uhk.cz Faculty of Military Health Sciences, University of Defence Hradec Kralove, Hradec Kralove, Czech Republic

Abstract. Automation and miniaturization are the current trends. The eHealth program of digitization of all parts of the healthcare system is concerned as well. The main purpose is to improve patient’s care, information gathering and its provision in the frame of given structure. The data protection constitutes an essential and integral part of the whole system at all its levels, including labs. The following text is devoted to the protection of storage and manipulation of biomed‐ ical samples. The HW module connected to the lab IS used for this purpose. The module is able not only to regulate access to the followed material but also to send logs to the central database. The proposal is based on the requirement for minimal financial investment, ergonomic provision and the range of provided functions. From the essence of the proposal, the Module can be inserted into the grid sensors of the ambient lab system. Its interfaces allow to process information from the connected sensors, to evaluate it and to create commands for the subse‐ quent elements. The solution is demonstrated on a fridge since, in general, it is one of the most critical places in the lab. Keywords: eHealth · Security · Ergonomics · Touch control · Laboratory

1

Introduction

Microelectronics is one of the most dynamically developing fields in the world today. Electronic chips miniaturization, integration of various functions into one case and a low price are factors that has enabled the phenomenon IoT (Internet of Things) to expand into a wide spectrum of services and industry sectors. Therefore, it is obvious, that they are going to find their application even in laboratories of various specialization. They are equipped not only with laboratory devices and computers for processing and results evaluation of the research activities, but also with other systems, without which their operation would be unreal. Many of these autonomous systems secure the laboratories processes, collect and evaluate online data from primitive interfaces and in some cases they even save the data in a defined time. Their physical separation from each other and heterogeneous data don’t enable any simple and fast information evaluation to be displayed in a proper context. Therefore, it is desirable to interconnect the systems. © Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortuño (Eds.): IWBBIO 2018, LNBI 10813, pp. 352–363, 2018. https://doi.org/10.1007/978-3-319-78723-7_30

Concept of a Module for Physical Security of Material Secured by LIMS

353

Miniaturization, automation, modularization and a wide spectrum of IoT devices’ func‐ tions call for replacing the outdated equipment. An economic condition for such an act is a provable increase in the assets’ fair value and simplification of the routine activities. These impacts depend on an intuitiveness of operating the sophisticated system, so its interfaces can be simply and quickly handled by an operator.

2

Problem Definition

The European countries continuously fulfill the vision of the eHealth unified envi‐ ronment. At the level of national systems, stable environments are created which integrate data of the medical care information systems. As declared at the confer‐ ence “Possibilities and benefits of IT trends in the healthcare system – Health 4.0” [1], which was held in Prague on May 19, 2017, via connection of the border areas, it is possible to share data which are continuously spread into inland areas of coop‐ erating countries. Patients’ data exchange is possible only by keeping security at all levels. Lab tests result inherently refer to this data area. In addition to data protec‐ tion, other potential security risks are concerned here. In biomedical laboratories workers commonly handle materials and chemicals that are classified as dangerous. Closets and cool boxes are being used for their storage which are ensured, either on the level of their physical access, or are provided with a mechanical lock. Further‐ more, not every workplace registers manipulation of such material on an adequate level, so that in the case of a loss or an excessive consumption it could be unequiv‐ ocally determined which worker manipulated the material. Within preparation of an experiment which contains a chemical synthesis there are defined needed raw materials and utilities. In case there are realized more experiments at one time, it is obvious that consumption of some raw materials in one experiment can cause failure in another experiment. It is necessary that the information about actual quantity and consumption are being registered by a responsible worker, who has the competence to supplement the needed material on time. Traditionally, in operation spreadsheets are often being used either in spreadsheet programs or in applications based on local databases from whose output it can be calcu‐ lated the stored quantity of the monitored material. Laboratories with a better organized operation use Laboratory information systems (LIS) or Laboratory information manage‐ ment systems (LIMS) for their operational support, whose integral part are modules to keep registration of the monitored material. Mostly they are fully dependent on infor‐ mation received from the operation by the laboratory technicians, who work with the material and who can distort the information accidentally or intentionally. Direct moni‐ toring of manipulation of the material leads to its deliberate utilization and minimizes the possibility of its misuse. Modern LIS/LIMS systems should be able to automate monitoring and ordering of the material. The technologies mentioned in the introduction should establish that status. Comfortable automatic or semiautomatic monitoring of persons and material move‐ ment should minimally burden personnel. Data collection from sensors do not require any manual data input. It is necessary to provide it with a reliable connectivity with a

354

P. Blazek et al.

database, where the data is being gathered. On the other hand, monitoring which specific person manipulated the material can already require an interaction. Interfaces for communication with the system can be provided by various types of keyboards or displays. Many older applications use a set of multifunctional buttons and a line display, which shows only limited information. The user’s activity requires a specific time frame to carry out the needed sequence of actions that can be evaluated as inadequately burdening. Contrary to that, touchscreens, even though they are dimensionally larger, can provide more space for both displaying information and controls. For operating industry applications devices there are used Panel PCs equipped with either capacitive or resistive touchscreens. Each of the presented technologies has its benefits and its limitations. Not each of them is suitable for all conditions. On mobile devices for common use the technology of capacitive touchscreens dominates thanks to its dura‐ bility, whereas on industrial devices, where it is necessary to count on working in protective equipment including gloves, the resistive variant of displays dominate. Further text deals with a Module for interactive control, which is designed to be integrated with the LIMS and its environment. The Module is described in the first chapter. The second chapter clarifies the security aspects of the implementation of the Module, for which it is determined. In the third chapter there is a short description of the HW part, which contributes to better understanding of connecting the Module to the LIMS, which is treated in the fourth chapter. Finally, the fifth chapter is devoted to the formation of a graphical interface.

3

Related Works

Laboratory information systems in general are primarily dedicated to support the organ‐ ization of routine activities in laboratories. There exist commercial or Open Source products that are modular and out of them it is possible to put together a solution suitable for conditions of that given laboratory. For specific research areas it’s also possible to obtain custom-made systems. An access to saved data is frequently based on just a few predefined levels of authorization, or some decentrally operated environments aren’t concerned with it at all. Systems are based on the principle of trust [2], that is on sharing information about research activities in progress with no limitations. This model is supported by a fact that in the laboratories which had implemented such systems, there work entrusted workers and teams which aspire to achieve a joint success. Unfortunately, this consideration relies on a character estimation and it’s never possible to say how the specific worker will behave in the case of pressure on him or in the situation of forced dismissal. Development of a global society, industry spying and connected security threats are further elements that complete the change of perspective on the trust gener‐ ally, not excluding the laboratories. In the respect of possible impacts, it’s more desirable to prevent the threats than to deal with the consequences. Besides the fundamental principles for construction and operation of a chemical or biological laboratory [3], the security elements and their development are emphasized more and more. An article [4] concerned with LIMS and limited access to data saved in a central database, describes a possible solution on an application layer. Its general job


355

is to ensure the intellectual property of an organization from its divulgement and eventual misuse. Definitely, it’s not a sufficient solution to avoid a physical manipulation of the material. In manuals of standards in the category of physical security there can be found procedures, which solve basic monitoring of persons’ movements and permission of their entrance into defined areas. The systems for support and security of laboratories’ operations are provided with sensors, whose outputs can be used only in a specific system. This decentralization limits the possibility of information evaluation in the sense of data connection and it is inap‐ plicable for an instant decision. With IoT there comes a possibility to replace the current systems with intelligent module solutions, which offer a wider utilization and online evaluation.

4

Module for Secured LIMS

As said in the introduction, the developed environment of secured LIMS [3] targets to achieve a higher security level by modern technological elements. Their objective is to create an interface for monitoring of the movement of persons, physical material and even of saved and processed data. In a commercial environment the laboratories can purchase closets with mechanic or electronic locks that can be unlocked by a key or by entering a code. None of these variants is ready for connection with the security elements of the information system, whereas their purchase price is significantly lower than this designed solution. The further described model is illustratively simplified and it is intended to use Radio Frequency Identification (RFID) technology based device e.g. wristband for clarifying a person’s identification. The laboratory space is then provided with sensors intercon‐ nected with hardware elements limiting and monitoring the movement of material and persons. The authorizations, regulating workers’ accesses to the information systems into the space of the laboratories and into the closets with stored chemicals, are saved in a central database of persons. In practice in the system it’s necessary to use a multilevel protection. 4.1 Hardware Description The device consists of a unit frame, which will be attached to the closet side together with an electronic lock. The upper part consists of 7″ capacitive touchscreen, in whose plane there is situated an RFID scanner in the common frame on the right. In the front edge under the display there is situated an integrated barcode scanner, which could be interchanged with another one that manages to read for example quick response (QR) codes. The lower part of the module consists of a slant board with an integrated labo‐ ratory scale that can be equipped with another system to measure the quantity of stored materials. Because the device can be located in a space where workers have to use gloves because of the manipulation with dangerous materials, on the side there is fixed an easily removable stylus to control the touchscreen (Fig. 1). A detailed description of the hard‐ ware is included also in a published article [5], which is devoted to the hardware solution.

356

P. Blazek et al.

Fig. 1. Concept of the module in the position with slanted laboratory scale

4.2 LIMS Connectivity As it is displayed in Fig. 2, the program equipment of the Module can be integrated into the Laboratory information system. There it is connected to the users’ database, where their authorizations are defined and this controls access into the monitored closet. A second connection is with an audit database, which registers information about successful and unsuccessful workers’ attempts to authorize and check out. A primary connection leads to a database of materials stored in the workplace, where it is defined in which place a specific chemical is situated, including the place in a given closet. Through questions defined by a user’s interaction the Module gains information which it displays on the touchscreen, or it realizes a predefined action. Access to the database’s data is active, therefore it saves back actualized information about supplementation or consumption of material. The database’s content in the sense of entering materials and their position in the monitored closets is being edited from a logistic workstation.

Fig. 2. Interconnection of the module with LIMS


357

4.3 Graphical User Interface Development The designed graphical user interface (GUI) of the Module has to be maximally simple and intuitive. It mustn’t in any way limit a user or require overabundant actions. The layout of elements on the single screens comes out of a block diagram (Fig. 3), which describes the basic control functions of the panel, which means users sign in to and sign off of the Module, detection of the stored material status and location, and the possibility of its withdrawal and storing back to the right position.

Fig. 3. The block diagram of the basic module functions

358

P. Blazek et al.

The module operation has to be fast and intuitive for the user. For this reason it is necessary to eliminate maximum controls (buttons) and peripheries (mouse). Therefore, the touchscreen is a clear choice. For effective user work, one has to be able to select his aim precisely and avoid unwanted selection of the contiguous aims. The application design has to take into account both technical options of the device and ergonomics [6], human physiology [7] and knowledge about cognitive psychology [8]. 7″ display size comes out of the knowledge about the vision angle of the accommo‐ dated view, the considered distance of the display from the head of an operator, the quantity of information shown on the display screen, fine motor skills and also it’s given by the space needed for the installation of integrated scanners. Besides, a larger unit would already take up disproportional space during installation. The vision angle of the accommodated view of a text is in the range of 5°–10°. Within the considered distance of 1 m from the user’s eyes it results with a circle area having a minimum diameter of 19 cm and a maximum diameter of 38 cm. The touchscreen fulfils the function of both the displaying unit and the control panel. The distance of the Module from the user’s head is determined in a way so that the displayed information is legible and at the same time the operation is comfortable for the average adults’ height. The Module is placed on the closet 120–180 cm high with its top edge situated 100 cm from the floor, whereas the display angle is fixed by the Module construction to 40° from a horizontal plane. The amount of the displayed graphic and text information, especially the font size and graphical details, towards the little details and smaller used font size, leads to the reduction of the field of vision to the mentioned minimum area. In Fig. 4 the difference between the maximal and minimal area is obvious.

19 cm 1 meter

38 cm

Fig. 4. Image of changing the areas’ proportions for a diverse detail level

Also fine motor skills play a significant role. Concentration on operating a small touchscreen and a lower lucidity given by the font size and the amount of displayed information leads to the reduction of work practicability. An accuracy of selecting the target by a user’s finger or stylus depends on its location on the screen. The highest accuracy is declared in its middle, a little lower on its left or right edge. A lot lower on its top edge and the lowest on the lower edge. While using R95 [8] we can say that the achieved accuracy [9] in the centre is 7 mm, whereas in the lower corners it is 12 mm [6, 10–13]. Users are aware of this subconsciously, which implies slightly slower


359

selection of smaller targets in the corners or on the edges of the screen. In contrast to that, if a finger is already in contact with the screen, the movement accuracy is 0, 1 mm [14, 15]. According to a laboratory study by MIT [16] the average finger width of an adult is 16–20 mm, which is more than the presented recommendation for mobile applications development. A summarization can be found on the Ubuntu official site [17], where it’s presented a summarization of the knowledge and recommendations for development of GUI for touch screens. Touch areas of the fingertip are 8 – 10 mm, whereas within the finger pads the touch areas are 10–14 mm on average. In contrast to Apple and Microsoft recommendations [18–20], who present that the touch area size of a control shouldn’t be lower than 9 mm with reciprocal spacing 2 mm, it presents minimum size 10 mm. The presented size relates to elements – often used – placed close to the screen edge – used in sequence (dialing telephone numbers) For less frequented elements suffices a square of 7 mm side with a reciprocal spacing 1 mm from consecutive elements. Furthermore, an easy operation is influenced by these factors: – young users have got smaller fingers – older and more corpulent people have got strong fingers – contrary to a mouse cursor, part of the display gets hidden while operating with a finger. According to the standards [6], the presented optimum areas for touch-operation are of bigger sizes. Whereas the aforementioned recommendations have to be necessarily implemented to displays with a diagonal of 5, 5″ and smaller, for larger screens they don’t apply so strictly. When following the aforementioned knowledge, the size of the basic set of controls designed to move in GUI was set to 20 × 20 mm with a reciprocal spacing of 1 mm in the screen edges. According to a study [21], buttons displayed larger than 22 × 22 mm wouldn’t bring a higher effect in the hit accuracy, whereas by reducing the size to 13 × 13 mm the speed of the panel users would decrease by more than 18%. This cannot be prevented in the case of entering text on a displayed keyboard while searching items. However, this function is minor for the users. Design, layout and size of the controls implies partly from the aforementioned knowledge, partly from the requirement on the minimization of the user’s interaction time consumption with the module and partly from the size of the 7″ display. The size of its real displaying area is 86 × 154 mm at a resolution of 800 × 480. The given parameters seem to be sufficient for the satisfactory display of the necessary text and graphic information and controls. The size of the central grid cells that represent the coordinates of the stored material in shelves can be selected up to a minimum size of 7 × 7 mm. For the seamless visual‐ ization of the shelf space it shouldn’t include more than 10 lines and 10 columns, which is technologically acceptable and compatible with the designed interface.

360

P. Blazek et al.

The control’s area size is sufficient to avoid mistakes when touching it with regard to its location on the edges of the screen. The number of simultaneously displayed elements never requires minimizing their size under the limit, which would cause an increase of the error rate of the target hit by a finger or stylus. Because of the supposed implementation of the system in a contaminated environment, where users have to wear protective equipment, which means also gloves, the stylus is considered as a primary pointer. Its touch area diameter corresponds with the finger pad. A graphic design of sample windows is presented in the following pictures. In Fig. 5 there’s visualized a general screen, which the user can see after his authentication. It contains a picture of the operated object, the name of the authenticated person, the place, the date and the time and a three active buttons to enter the desired part of the system.

Fig. 5. Controls layout following the user’s authentication into the system

As it is obvious from Fig. 6, after reading a barcode from a case surface while storing it into the closet there appears a highlighted shelf description and a specific space for its storing.

Fig. 6. Elements layout while inserting material into the closet

While searching the material in a closet, which is available by using the “SELECT” button, there appears a screen with a keyboard for entering characters, a window to display text and an application control button, as you can see in Fig. 7.


361

Fig. 7. Screen for searching items of stored material

The touchscreen size corresponds with ergonomics knowledge [22–24]. It implies from the assumed distance between user’s head and display area, angle of accommo‐ dated view on the displayed information, size of displayed information and manipulation height of a hand on the display and laboratory scale.

5

Conclusion

The described solution is universal and does not have to be implemented in laboratories only. It can be used either autonomously or integrated into the IS. The secure environ‐ mental LIMS supported with hardware elements adequate to the described Module is beneficial for biomedical data protection, for scientific work results and for physical security of sensitive material. Fulfilment of this statement depends on the condition that its operation does not become a burden for users. The concept of both the simple and functional GUI design is an essential step as well as the concept of the database structure, the secure data storage and the secure interconnection of network elements. The impos‐ sibility to avoid the control mechanisms given by the system fulfills the security require‐ ment. Legislation puts increasing demands on information security in healthcare, espe‐ cially on protection of personal data. However, other sensitive data and physical material cannot be ignored. Costs incurred for security do not just affect clinical workplaces, but also labs in a given hierarchy. These are often expensive solutions which are accompa‐ nied with reconstructions of premises and operational restrictions. For investment protection, it is necessary to search for optimal solutions, which would comply with legal standards, and improve working conditions. Our suggested and continuously developed system, which comprises the module, offers such a solution. Acknowledgement. This work and the contribution were also supported by project “Smart Solutions for Ubiquitous Computing Environments” FIM, University of Hradec Kralove, Czech Republic (under ID: UHK-FIM-SP-2018)

362

P. Blazek et al.

References 1. Possibilities and Benefits of IT Trends in Health Care - Health 4.0 – Prague Conference. http://www.cssi.cz/cssi/ehealth-40. Accessed 15 Dec 2017 2. Morisawa, H., Hirota, M., Toda, T.: Development of an open source laboratory information management system for 2D gel electrophoresis-based proteomics workflow. BMC Bioinf. (2006). https://doi.org/10.1186/1471-2105-7-430 3. ISO/IEC 17025 - General requirements for the competence of testing and calibration laboratories. https://www.iso.org/standard/39883.html. Accessed 12 Mar 2017 4. Blazek, P., Kuca, K., Jun, D., Krejcar, O.: Development of information and management system for laboratory based on open source licensed software. In: Núñez, M., Nguyen, N.T., Camacho, D., Trawiński, B. (eds.) ICCCI 2015. LNCS (LNAI), vol. 9330, pp. 377–387. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24306-1_37 5. Blazek, P., Krejcar, O., Jun, D., Kuca, K.: Device security implementation model based on internet of things for a laboratory environment. IFAC-PapersOnLine 49(25), 419–424 (2016). https://doi.org/10.1016/j.ifacol.2016.12.086. ISSN 24058963 6. ISO 9241-400:2007- Ergonomics of human-system interaction, Part 9: Requirements for Non-keyboard Input Devices. https://www.iso.org/standard/30030.html. Accessed 11 Dec 2017 7. Human Factors and Ergonomics Society. https://www.hfes.org/. Accessed 11 July 2017 8. Touchscreen. https://en.wikipedia.org/wiki/Touchscreen#Ergonomics_and_usage. Accessed 11 Dec 2017 9. Circular error probable. https://en.wikipedia.org/wiki/Circular_error_probable. Accessed 11 Dec 2017 10. Hessey, S., Chen, S.H., White, C.: Beyond fingers and thumbs – a graceful touch UI. In: Marcus, A. (ed.) DUXU 2014. LNCS, vol. 8518, pp. 562–573. Springer, Cham (2014). https:// doi.org/10.1007/978-3-319-07626-3_53 11. Cano, M.B., Perry, P., Ashman, R., Waite, K.: The influence of image interactivity upon user engagement when using mobile touch screens. Comput. Hum. Behav. 77, 406–412 (2017). https://doi.org/10.1016/j.chb.2017.03.042 12. Henze, N., Rukzio, E., Boll, S.: 100,000,000 taps: analysis and improvement of touch performance in the large. In: Proceedings of the 13th International Conference on Human Computer Interaction with Mobile Devices and Services, New York (2011). https://doi.org/ 10.1145/2037373.2037395 13. Parhi, P., Karlson, A.K., Bederson, B.B.: Target size study for one-handed thumb use on small touchscreen devices. In: Proceedings of the 8th Conference on Human-Computer Interaction with mobile devices and services, pp. 203–210. ACM (2006). https://doi.org/ 10.1145/1152215.1152260 14. Harrison, C., Schwarz, J., Hudson, S.: TapSense: enhancing finger interaction on touch surfaces. In: Proceedings of UIST 2011, pp. 627–636, New York (2011). https://doi.org/ 10.1145/2047196.2047279 15. Zaiţi, I.-A., Vatavu, R.-D., Pentiuc, Ş.-G.: Exploring hand posture for smart mobile devices. In: Holzinger, A., Ziefle, M., Hitz, M., Debevc, M. (eds.) SouthCHI 2013. LNCS, vol. 7946, pp. 721–731. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39062-3_52 16. Dandekar, K., Raju, B.I., Srinivasan, M.A.: 3-D finite-element models of human and monkey fingertips to investigate the mechanics of tactile sense. J. Biomech. Eng. 125(5), 682–691 (2003) 17. UMEGuide/DesigningForFingerUIs. https://help.ubuntu.com/community/UMEGuide/ DesigningForFingerUIs. Accessed 12 Dec 2017


363

18. iOS Human Interface Guidelines – Apple, 24 August 2014. https://developer.apple.com/ios/ human-interface-guidelines/overview/design-principles. Accessed 12 Dec 2017 19. Metrics and Grids, Google. https://material.io/guidelines/layout/metrics-keylines.html. Accessed 12 Dec 2017 20. Design and UI for UWP apps. https://developer.microsoft.com/en-us/windows/apps/design. Accessed 12 Dec 2017 21. Sesto, M.E., Irwin, C.B., Chen, K.B., Chourasia, A.O., Wiegmann, D.A.: Effect of touch screen button size and spacing on touch characteristics of users with and without disabilities. Hum. Factors: J. Hum. Factors Ergonomics Soc. 54(3), 425–436 (2012). https://doi.org/ 10.1177/0018720811433831 22. Conradi, J., Busch, O., Alexander, T.: Optimal touch button size for the use of mobile devices while walking. Proc. Manuf. 3, 387–394 (2017). Advances in Ergonomic Design of Systems, Products and Processes: Proceedings of the Annual Meeting of GfA 2016. Springer (2016). https://doi.org/10.1016/j.promfg.2015.07.182 23. Taoa, D., Yuana, J., Liub, S., Qua, X.: Effects of button design characteristics on performance and perceptions of touchscreen use. Int. J. Ind. Ergon. 64, 59–68 (2018). https://doi.org/ 10.1016/j.ergon.2017.12.001 24. Parhi, P., Karlson, A.K., Bederson, B.B.: Target size study for one-handed thumb use on small touchscreen devices. In Proceedings of the 8th Conference on Human-Computer Interaction with Mobile Devices and Services, pp. 203–210. ACM (2006). https://doi.org/ 10.1145/1152215.1152260

scFeatureFilter: Correlation-Based Feature Filtering for Single-Cell RNAseq Angeles Arzalluz-Luque2 , Guillaume Devailly1 , and Anagha Joshi1(B) 1

Division of Developmental Biology, The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush Campus, Midlothian EH25 9RG, UK [email protected] 2 Genomics of Gene Expression Laboratory, Centro de Investigaci´ on Principe Felipe (CIPF), Carrer de Eduardo Primo Yufera 3, 46012 Valencia, Spain

Abstract. Single cell RNA sequencing is becoming increasingly popular due to rapidly evolving technology, decreasing costs and its wide applicability. However, the technology suffers from high drop-out rate and high technical noise, mainly due to the low starting material. This hinders the extraction of biological variability, or signal, from the data. One of the first steps in the single cell analysis pipelines is, therefore, to filter the data to keep the most informative features only. This filtering step is often done by arbitrarily selecting a threshold. In order to establish a data-driven approach for the feature filtering step, we developed an R package, scFeatureFilter, which uses the lack of correlation between features as a proxy for the presence of high technical variability. As a result, the tool filters the input data, selecting for the features where the biological variability is higher than technical noise. Keywords: Single cell

1

· RNA sequencing · Feature selection

Introduction

Single cell RNA sequencing (scRNAseq) is used to explore the transcriptomes of cell populations and tissues to understand cellular differentiation processes, tissue heterogeneity in normal and disease states [1,2]. Due to the limited mRNA (low starting material) in a single cell, scRNAseq experiments are more prone to technical variability than bulk RNAseq. Notably, scRNAseq suffers from a high drop-out rate, leading to false negative zero-expression values. This phenomenon predominately affects lowly expressed transcripts, as highly expressed transcripts are more likely to be captured during library preparation. As a consequence, scRNAseq data is often filtered to remove noisy features (i.e. genes or transcripts) before downstream analysis. This filtering step is typically performed using spike-in controls when available, or using arbitrary rules including, features with expression ≥ one TPM (Transcript per million) in at least 25% of c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 364–370, 2018. https://doi.org/10.1007/978-3-319-78723-7_31

scFeatureFilter: Correlation-Based Feature Filtering for Single-Cell RNAseq

365

cells in [3], average read count ≥ one in [4], at least five reads ≥ 10 cells in [5], at least five cells with FPKM (Fragment per kilobase per million) ≥ 10 in [6]). The choice of this threshold is highly data-dependent, and, as the popularity of single-cell technologies increases, automated pipelines are required as an alternative to manual determination of thresholds. We therefore propose a data-driven approach, implemented in an R package, scFeatureFilter, as an alternative to arbitrary threshold selection. Our approach estimates the technical variability in the data by using correlations across features to filter noisy features. scFeatureFilter uses an expression matrix in any expression unit normalised for library size, (e.g. TPM or FPKM) as an input, and provides a filtered expression matrix as an output. Furthermore, scFeatureFilter provides diagnostic plots which facilitate in-depth data exploration to guide, if required, a user-defined threshold choice. Our tool is available through https:// github.com/gdevailly/scFeatureFilterGitHub, and is submitted to Bioconductor. Complete documentation and a comprehensive https://gdevailly.github.io/ scFeatureFilterVignette.html vignette detailing usage are included. The method takes advantage of two properties of expression data: – The most highly expressed genes or transcripts in a cell are relatively less affected by technical variability than lowly expressed ones. – Gene expression is modular, i.e. genes form co-expression clusters. In short, using the first property, we create a reference of highly expressed (lownoise) features. We then divide all other features in bins of decreasing median expression. Using the second property, we calculate the correlations between each bin of features with the reference features. Finally, comparing with randomized expression values allows us to filter noisy features. scFeatureFilter first removes features that show zero-expression in over 75% (default value, can be user defined) of the cells. Next, the remaining features are ranked in a decreasing expression order and binned into groups accordingly, where the first bin includes the most highly expressed features, and will serve as the low-noise reference. The results of the binning process can be visualised as a mean - coefficient of variation (CV) plot, where CV and mean expression are typically anti-correlated (Fig. 1a, [7]). The next step is to calculate the correlation of each bin to the reference bin. First, the tool generates negative control bins (i.e. a bin completely uncorrelated with the data) by shuffling the expression values of the features in the reference bin. As a positive control, the correlation of the reference bin with itself is used. Then, correlations between all features in each bin and all features in the reference bin, as well as with the control bin, are calculated. The distributions of the correlations generated for each bin (bin vs reference, and bin vs negative control) can be plotted for visualisation (Fig. 1b). The final step of the scFeatureFilter workflow is the integration of the correlation information for filtering high-noise features. We illustrate scFeatureFilter filtering using an example dataset from 32 single human embryonic stem cells (data from [8]). In Fig. 1b, the bin plots show that the correlations between feature bins and positive low-noise reference (red lines) steadily decrease with mean expression, and become more

366

A. Arzalluz-Luque et al.

Fig. 1. A. Genes are binned according to their absolute mean expression level across cells. B. Distributions of the pairwise correlation coefficients between all genes in a bin and all genes in reference bin (red, thin lines), or with randomised control bins (blue, thick lines). Vertical dashed line represents the mean of each distribution. C. Mean pairwise correlation coefficient for each bin with reference bin (bars), and with randomised control bins (grey dots). Control bins are used to determine a background level (blue dashed line), to be used to set a threshold (red dashed line). (Color figure online)

similar to negative control (blue lines). The negative control bins are used to determine a background level of correlation (Fig. 1c, blue dashed line). By default, scFeatureFilter considers bins with a mean absolute correlation coefficient greater than twice this background to be significantly unaffected by noise (Fig. 1c, red dashed line).


367

Fig. 2. scFeatureFilter analysis of 16 datasets from the conquerDB database [3] (top window size: 100 genes, other window size: 1000 genes). A. For the Engel 2016 dataset (see supplementary Table 1), scFeatureFilter recommends to keep only the first 100 genes. B. For one of the Shalek 2014 datasets, scFeatureFilter recommends to keep the first six bins of genes (5100 genes). C. For the Kumar 2014 dataset, scFeatureFilter recommends to keep all the bins of genes. D. Comparison between the number of features retained after scFeatureFilter (y-axis) and after the arbitrary threshold approach used by [3] (x-axis) for the 16 studied datasets.

scFeatureFilter uses individual dataset characteristics to make a filtering decision, otherwise made arbitrary in the absence of spike-in RNA. In addition, our tool can also estimate the quality of a given scRNA dataset. We applied scFeatureFilter to 16 datasets from the conquerDB database ([3]). The number of filtered features was highly variable (Table 1), with results ranging from very noisy datasets where most features (genes in this case) were filtered (Fig. 2a) to a few datasets where no features were filtered (Fig. 2c). Importantly, most datasets were somewhere in between, proving the usefulness of the tool (Fig. 2b). Notably, the thresholds estimated by scFeatureFilter do not fully correspond to the arbitrary thresholds in [3] (Fig. 2d).

368


Fig. 3. scFeatureFilter analysis of a 48-replicate yeast bulk RNAseq dataset [9]. Single-cell gene expression values were obtained using kallisto [10] and processed with scFeatureFilter. A. Coefficient of variation stay low, independently of the mean expression level, for most of the genes. B. Correlations between each bin of features and the reference as well as negative control bins (reference bin: the top 100 highly expressed genes. Remaining bins: 500 genes). As opposed to single-cell results, the probability distribution of the resulting correlation coefficients is consistently different than negative control correlations, even for bins of low expression C. scFeatureFilter recommends to keep all the features for this dataset.

We also applied scFeatureFilter to a yeast bulk RNAseq experiment with 48 replicates ([9], Fig. 3). This data resembles the single cell data structure, but is free from the biases intrinsic to single-cell RNAseq. As anticipated, scFeatureFilter suggests to keep all the features for this bulk RNAseq dataset. We thus show that our method is well adapted to scRNAseq data.


369

Table 1. scFeatureFilter applied to some datasets of the conquerDB database [3]. n cells: Number of cells in the study. scFF: number of genes kept by scFeatureFilter. arbi: number of genes kept using [3] arbitrary threshold (at least 25% of cells with TPM ≥ 1. Data set

Reference

Organism

n cells

scFF

GSE48968-GPL13112 (PMID 24919153)

Shalek2014

Mus musculus

1378

11107

arbi 8087

GSE48968-GPL17021-125bp (PMID 24919153)

Shalek2014

Mus musculus

935

3327

6902

GSE63818-GPL16791 (PMID 26046443)

Guo2015

Homo sapiens

328

5366

11905

GSE45719 (PMID 24408435)

Deng2014

Mus musculus

291

18406

12538

EMTAB2805 (PMID 25599176)

Buettner2015

Mus musculus

288

9275

10330

GSE52529-GPL16791 (PMID 24658644)

Trapnell2014

Homo sapiens

288

1130

13973

GSE60749-GPL13112 (PMID 25471879)

Kumar2014

Mus musculus

268

30549

21241

GSE74596 (PMID 27089380)

Engel2016

Mus musculus

203

100

5651

GSE60749-GPL17021 (PMID 25471879)

Kumar2014

Mus musculus

147

32428

27013

GSE48968-GPL17021-25bp (PMID 24919153)

Shalek2014

Mus musculus

99

100

33845

GSE77847 (PMID 27016502)

Meyer2016

Mus musculus

96

100

8674

GSE52529-GPL11154 (PMID 24658644)

Trapnell2014

Homo sapiens

84

100

14820

GSE44183-GPL11154 (PMID 23892778)

Xue2013

Homo sapiens

29

2158

14761

GSE41265 (PMID 23685454)

Shalek2013

Mus musculus

18

0

8936

GSE44183-GPL13112 (PMID 23892778)

Xue2013

Mus musculus

17

11129

14532

GSE44183-GPL13112-trimmed (PMID 23892778)

Xue2013

Mus musculus

17

9434

14337

In conclusion, scFeatureFilter facilitates data-driven automatic feature selection in scRNAseq data. Notably, it can be easily integrated with other Bioconductor tools and workflows (e.g. scater by [11]). Acknowledgements. We thank Dr. Anna Mantsoki for her invaluable help and input in the development process. Funding AJ is a Chancellor’s fellow and AJ lab is supported by institute strategic funding from Biotechnology and Biological Sciences Research Council (BBSRC, BBSRCBB/P013732/1-ISPG 2017/22 and BBSRC-BB/P013740/1-ISPG 2017/22). GD is funded by the People Programme (Marie Curie Actions FP7/2007-2013) under REA grant agreement No. PCOFUND-GA-2012-600181.

References 1. Tang, F., Barbacioru, C., Bao, S., Lee, C., Nordman, E., Wang, X., Lao, K., Surani, M.A.: Tracing the derivation of embryonic stem cells from the inner cell mass by single-cell RNA-seq analysis. Cell Stem Cell 6(5), 468–478 (2010) 2. Ramskold, D., Luo, S., Wang, Y.C., Li, R., Deng, Q., Faridani, O.R., Daniels, G.A., Khrebtukova, I., Loring, J.F., Laurent, L.C., Schroth, G.P., Sandberg, R.: Full-length mRNA-seq from single-cell levels of RNA and individual circulating tumor cells. Nat. Biotechnol. 30(8), 777–782 (2012) 3. Soneson, C., Robinson, M.D.: Bias, robustness and scalability in differential expression analysis of single-cell RNA-seq data. bioRxiv (2017)

370


4. Lun, A., McCarthy, D., Marioni, J.: A step-by-step workflow for low-level analysis of single-cell RNA-seq data with bioconductor. F1000Research 5(2122) (2016). [version 2; referees: 3 approved, 2 approved with reservations] 5. Stevant, I., Neirijnck, Y., Borel, C., Escoffier, J., Smith, L.B., Antonarakis, S.E., Dermitzakis, E.T., Nef, S.: Deciphering cell lineage specification during male sex determination with single-cell RNA sequencing. bioRxiv (2017) 6. Petropoulos, S., Edsg¨ ard, D., Reinius, B., Deng, Q., Panula, S.P., Codeluppi, S., Plaza Reyes, A., Linnarsson, S., Sandberg, R., Lanner, F.: Single-cell RNA-seq reveals lineage and X chromosome dynamics in human preimplantation embryos. Cell 165(4), 1012–1026 (2016) 7. Mantsoki, A., Devailly, G., Joshi, A.: Gene expression variability in mammalian embryonic stem cells using single cell RNA-seq data. Computat. Biol. Chem. 63, 52–61 (2016) 8. Yan, L., Yang, M., Guo, H., Yang, L., Wu, J., Li, R., Liu, P., Lian, Y., Zheng, X., Yan, J., Huang, J., Li, M., Wu, X., Wen, L., Lao, K., Li, R., Qiao, J., Tang, F.: Single-cell RNA-seq profiling of human preimplantation embryos and embryonic stem cells. Nat. Struct. Mol. Biol. 20(9), 1131–1139 (2013) 9. Gierlinski, M., Cole, C., Schofield, P., Schurch, N.J., Sherstnev, A., Singh, V., Wrobel, N., Gharbi, K., Simpson, G., Owen-Hughes, T., et al.: Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment. Bioinformatics 31(22), 3625–3630 (2015) 10. Bray, N.L., Pimentel, H., Melsted, P., Pachter, L.: Near-optimal probabilistic RNAseq quantification. Nat. Biotechnol. 34(5), 525–527 (2016) 11. McCarthy, D.J., Campbell, K.R., Lun, A.T.L., Wills, Q.F.: Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33(8), 1179–1186 (2017)

High-Throughput Bioinformatic Tools for Medical Genomics

NearTrans Can Identify Correlated Expression Changes Between Retrotransposons and Surrounding Genes in Human Cancer Rafael Larrosa1 , Macarena Arroyo2,3 , Roc´ıo Bautista4 , Carmen Mar´ıa L´ opez-Rodr´ıguez3 , and M. Gonzalo Claros3(B) 1

Departamento de Arquitectura de Computadores, Universidad de M´ alaga, 29071 Malaga, Spain [email protected] 2 Unidad de Gesti´ on Cl´ınica de Enfermedades Respiratorias, Hospital Regional Universitario de M´ alaga, Avda Carlos Haya s/n, Malaga, Spain 3 Departamento de Biolog´ıa Molecular y Bioqu´ımica, Universidad de M´ alaga, 29071 Malaga, Spain {macarroyo,b12loroc,claros}@uma.es 4 Plataforma Andaluza de Bioinform´ atica, Universidad de M´ alaga, 29590 Malaga, Spain [email protected]

Abstract. Recent studies using high-throughput sequencing technologies have demonstrated that transposable elements (TEs) seem to be involved not only in some cancer onset but also in cancer development. New dedicated tools have been recently designed to quantify the global expression of the different families of TEs from RNA-seq data, but the identification of the particular, differentially expressed TEs would provide more profitable results. To fill the gap, here it is presented NearTrans, a bioinformatic workflow that takes advantage of gEVE (a database of endogenous viral elements) to determine differentially expressed TEs as well as the activity of genes surrounding them to study if changes in TE expression is correlated with nearby genes. An especial requirement is that input RNA-seq reads must derive from normal and cancerous tissue from the same patient. NearTrans has been tested using RNA-seq data from 14 patients with prostate cancer, where two HERVs (HERVH-int and HERV17-int) and three LINE-1 (L1PA3, L1PA4 and L1PA7) were over-expressed in separate positions of the genome. Only one of the nearby genes (ACSM1) is over-expressed in prostate cancer, in agreement with the literature. Three (PLA2G5, UBE2MP1 and MIR4675) change their expression between normal and tumor cell, although the change is not statistically significant. The fifth (LOC101928437) is highly distant to the L1PA7 and their correlation is unlikely. These results are supporting that, in some cases such as the HERVs, TE expression can be governed by the genome context related with cancer, while in others, such as the LINEs, their expression is less related with the genome context, even though they are surrounded by c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 373–382, 2018. https://doi.org/10.1007/978-3-319-78723-7_32

374

R. Larrosa et al. genes potentially involved in cancer. Therefore, NearTrans seems to be a suitable and useful workflow to discover or corroborate genes involved in cancer that might be used as specific biomarkers for the diagnosis, prognosis or treatment of cancer. Keywords: Transposon · Transposable element Mobile element · RNA-seq · Workflow · Human

1

· Cancer

Introduction

Currently, cancer is one of the leading causes of morbidity and mortality with variable survival rates depending on the type of cancer. Recent studies have demonstrated that, besides the specific somatic or germinal mutations that drive tumor growth, mobile elements, also known as transposable elements (TEs) are involved in the onset of many human diseases, as well as in the development of established cancers. For example, in epithelial cancer, activation of TEs correlates with their mobilisation and genomic drift [15]. This is due to the fact that TEs are DNA molecules with the ability to move from one place to another in the genome, contributing to genomic instability and causing genetic disorders. Since nearly 50% of the human genome is composed of TEs, cells try to avoid the deleterious consequences of TE activity inducing the inactivation of most TEs by large deletions, stop codons, and frameshift mutations within their open reading frames. It has been recently shown that some human endogenous viral elements (HEVEs) are still active and play a crucial role in placental development in various mammalian species [20]. The study of TEs using high-throughput technologies has been relegated due to the complexity of its measurement and processing, since there is a large number of copies of TEs present throughout the genome. Earlier efforts drove to tools such as RepEnrich [9] or TEtranscript [14] that were designed to accurately quantify the global expression of the different families of TEs from RNA-seq data, the TE evaluation being based on RepBase. Another one, Lions [4], has been developed to quantitatively measure and compare the contribution of TEs promoters to their expression in cancer. Recently, TEtools [16] has been designed to analyse the TE expression using non-annotated and non-assembled genomes. But better than knowing the activity of a specific family of TEs, the identification of the particular, differentially expressed TEs would provide more profitable results. Our main objective is not related to the detection of TE jumps that can explain a disease, but to design a tool that can identify which copy of the different TEs in human genome presents differential expression when the normal cell becomes a cancer cell. To elucidate this problem, gEVE [20], the database of endogenous viral elements (EVEs) including endogenous retrovirus that was developed to investigate the function and evolution of the TEs in mammalian genomes, seems to be more appropriate than RepBase. The great advantage of gEVE is that it provides nucleotide and amino acid sequences, genomic loci and functional annotations of all EVEs. Particularly, this database describes 33 966

NearTrans Can Identify Correlated Expression Changes

375

EVEs, 1782 gag elements, 1482 pro elements, 29 120 pol elements, and 1731 env elements in human genome. As a result, the bioinformatic workflow NearTrans, that is able to determine (i) differentially expressed TEs and (ii) the activity of genes surrounding them to study whether changes in TE expression are related to nearby genes. As a biological model, prostate cancer was elected, a cancer where it was already known that LINE-1 was over-expressed [9].

2 2.1

Materials and Methods Input Data

Control (healthy prostate cells) and treatment (prostate cancer) RNA-seq reads from 14 patients from Shanghai Hospital were publicly available from BioProject PRJEB2449 [24]. The main feature of these data is that prostate cancer and nearby normal tissues were paired, since they were sequenced from the same individual. Information about EVEs in gEVE was downloaded from http://geve.med. u-tokai.ac.jp/ for the Hg38 human genome in GTF format. Estructural information about human genome Hg38 was downloaded from UCSC web portal (http://genome.ucsc.edu/cgi-bin/hgTables). The sequences of the human genome assembly Hg38 were downloaded from NCBI (https://www.ncbi.nlm. nih.gov/assembly?term=GRCh38). 2.2

Implementation

The double task of NearTrans related to differential expression of TEs and expression level of their nearby genes was carried out as follows (Fig. 1), using the same tools for genes and TEs, normal and tumoral prostate, whenever is possible: 1. Data quality control using SeqTrimNext (STN) [11] with the specific NGS Illumina configuration parameters to remove low quality, ambiguous and low complexity stretches, adaptors, organelle DNA, polyA/polyT tails, and contaminated sequences while keeping the longest (at least > 20 bp) informative part of the read. 2. Mapping the pre-processed, useful reads to human genome hg38 using STAR v2.5 [10] with the following parameters (see the STAR help for the meaning of each parameter): STAR --genomeLoad NoSharedMemory --runThreadN 16 $arg read --outSAMstrandField intronMotif --sjdbGTFfile $REF/Annotation/Genes/genes.gtf --genomeDir $REF/Sequence/STARIndex/index genome STAR/ --readFilesIn $file1 $file2 --outFilterMismatchNmax 6 --outFileNamePrefix align STAR sorted --outSAMtype BAM SortedByCoordinate --twopassMode Basic --outReadsUnmapped None

376

R. Larrosa et al.

--chimSegmentMin 12 --chimJunctionOverhangMin 12 --alignSJDBoverhangMin 10 --alignMatesGapMax 200000 --alignIntronMax 200000 --chimSegmentReadGapMax parameter 3 --alignSJstitchMismatchNmax 5 -1 5 5. 3. Use the GFFs of hg38 and gEVE with Cufflinks (v.2.2.1) [25] followed by Cuffquant and then Cuffdiff, for assessing expression levels of genes and TEs, respectively, between matched normal and cancer tissues, as described in [13]. cummeRbund v3.6 is then pipelined to analyse, explore, manipulate and plot (visualise) the results. 4. Selection of differentially expressed TEs using as filters an adjusted P < 0.05 and a |log2 F C| > 1. 5. Location of nearby genes and their expression fold-change for every differentially expressed TE using BEDTools (v.2.26.0) [22], with the command

Fig. 1. Flowchart illustrating tools and datasets provided and obtained by NearTrans workflow.


377

bedtools closest -a TEs file.bed -b genes file.gtf -D a > nearest genes.bed. Where the file TEs file.bed contains the location of the differentially expressed TEs in the human genome and genes file.gtf contains the location of all genes in the human genome.

3

Results

After preprocessing raw RNA-seq datasets data from the 14 prostate cancer patients from PRJEB2449, the percentage of useful reads is in the range of 93.54% for patient ERR031029 to 96.16% for patient ERR031025. This clearly shows the high quality of those sequence reads, and that further analyses will not be affected by read quality. Mapping useful reads resulted in a global 98.18% of the reads mapped on the human genome. Again, the high mapping rate confirms that results will not be affected by inadequate sequencing.

L1PA3

HERVs

–log10PTE

L1PA4 L1PA7

log2FCTE

Fig. 2. Volcano plot where each TE is defined by its log2 of fold-change (log 2 FC T E ) vs −log10 of adjusted P -value (log 10 P T E ). Dots highlighted in red are those presenting a significant over-expression in prostate cancer cells. The TE corresponding to each red dot is indicated. (Color figure online)

The differentially expressed TEs are shown as red dots in Fig. 2. The three red dots having the log 2 FC T E closer to 0 are LINEs (L1PA3, L1PA4 and L1PA7), while the upper-right point is for the two HERVs (HERVH-int and HERV17int). All TEs where found to be over-expressed in prostate cancer: HERVs were not expressed at all on normal cells, but expressed only on cancer cells (this is why they appear at the right border of Fig. 2 and as “Inf” in Table 1). On the contrary, LINEs (as many other TEs) were expressed in normal cells and their

378

R. Larrosa et al.

expressions were significantly increased in tumor cells. The advantage of using gEVE is that now we know that from the 20 699 described positions of LINE1 in Hg38, 946 were strongly (although not significantly, adjusted P > 0.05) repressed, while 3 829 were over-expressed (but only three positions exhibit significant over-expression). The remaining 15 924 positions of LINEs can be considered unchanged, since they show a log 2 FC T E of −0.06 with a standard deviation of 1.58. These results are highly compatible with the reported overexpression of LINE-1 already described in prostate cancer [9], the main innovation of NearTrans being the positions of the LINE-1 copies whose over-expression is significant. Taking in mind the idea that a TE can only be expressed if its genomic context is not supercoiled (silenced), the chromosome region where each differentially expressed TE is located was screened for the closer gene. It can be seen that distances between genes and TEs is highly variable irrespective of the TE (Table 1). The stronger correlation was observed between the expression of HERV17-int and ACSM1, while LINEs present the less significant correlation (adjusted P > 0.5). Interestingly, expression of MIR4675 (close to L1PA4) that has not been found in the samples analysed. It seems that those HERVs are more dependent on the genetic context than LINEs. Table 1. Summary of differentially expressed retrotransposons in prostate cancer and their nearby genes Chra log2 FCT E PTb E

TEa

Gene 10−5

log2 FCg Pgb

Distancec 10−5

HERV17-int 16

Inf

5.0 ×

3.67

5.0 ×

HERVH-int 1

Inf

5.0 × 10−5 PLA2G5

0.33

0.25805

L1PA3

16

3.51

1.0 × 10−4 UBE2MP1

−0.32

0.50

51 219

L1PA4

10

3.38

1.5 × 10−4 MIR4675

0

1

−5 728

ACSM1

−5 045 −16 585

L1PA7 X 2.81 1.5 × 10−4 LOC101928437 3.31 1 211 321 Transposable element. Chr: chromosome. b The P refers to adjusted P -value of TEs (P T E ) and genes (Pg ). c Distance, in nucleotides, from the TE to te nearby gene; negative values indicate upstream and positive values indicate downstream. a TE:

4

Discussion

The capabilities of NearTrans workflow (Fig. 1) allowed the identification of five TEs (HERVH-int, HERV17-int, L1PA3, L1PA4 and L1PA7) with differential expression in separate positions of the human genome in prostate cancer (Fig. 2 and Table 1). In some cases (HERV17-int and L1PA7), TE over-expression appears to be correlated with high gene expression of their nearby genes (ACSM1 and LOC101928437, respectively). In most cases, the gene is not highly expressed or the correlation is not significant. Even though the statistic significance of these correlations between genes and TEs is significant only in the case of HERV17int/ACSM1, we will examine if nearby genes are related to prostate cancer to


379

know which TEs are over-expressed due to their proximity to expressed genes that have a role in the development of cancer. Investigating the roles of the genes identified by NearTrans in prostate cancer close to the differentially expressed TEs, (Table 1) we found that: – ACSM1 has already been described as highly expressed when compared with the normal prostate tissue [1–3,26], while its expression was decreased when the patients underwent androgen deprivation and a chemotherapy antitumor treatment with docetaxel [23]. It has also been described that the silencing of ACSM1 in breast cancer decreases the cellular invasion and progression, and therefore it is identified as a potential biomarker for the prognosis of cancer [7]. – PLA2G5 has variable expression profile and is involved in diseases of immunological nature [5,8]. It was described as repressed in colon adenocarcinoma [19], acute myeloid leukemia [12] and in the leukemic cell line Jurkat [17]. It has been recently related to prostate as highly expressed in normal epitelial cells while repressed by methylation in diseased prostate [18]. In the analysis of NearTrans, PLA2G5 has an adjusted Pg = 0.25 and a log2 F Cg = 0.33 (Table 1), indicating that its expression is not so high and not significant. – L1PA3 is close to two pseudogenes: UBE2MP1 is the ubiquitin conjugating enzyme E2 M pseudogene 1 not apparently related with any disease, even though its upregulation was significantly involved in a pathway related to prostate cancer [21]. The HAVANA GTF for Hg38 predicts another closer pseudogene with unknown function, VN1R68P, only at 26 nt. – MIR4675 is a miRNA that has not been described in prostate cancer but is related with other types of tumors, including adenocarcinoma, colorectal carcinoma, non-small cell lung carcinoma and breast cancer, where its expression is inhibited with respect to normal tissue [6]. In our case it has not been found in the samples. – We consider that the unknown nature of LOC101928437, its distance to L1PA7 (211 321 nt) and the Pg = 1 completely discard any influence on the expression of L1PA7. In conclusion, NearTrans seem to be a suitable and useful workflow for detection of differentially expressed TEs and their nearby genes. It must be noted that NearTrans can be applied to any cancer or any other disease, provided that the same individual presents healthy and diseased tissues where the gene expression levels are different, and from which samples can be taken. The results presented regarding HERVs in prostate cancer suggest that they are expressed depending on the nature of the genome context. The over-expression of LINEs is compatible with previous reports [9] but NearTrans offers more detail since it also indicates which genome copy of the TE is significantly over-expressed. Interestingly, the TEs belonging to LINE1 family appeared as the most genomic context independent, which supports the idea that this type of TE could be used to increase genome instability in cancer, even though the nearby genes could have a potential relation with cancer. We propose then that the study of TEs in cancer can

380

R. Larrosa et al.

help in the discovery or corroboration of genes involved in cancer, and can be used as specific biomarkers for the diagnosis, prognosis or treatment of cancer. Acknowledgements. This work was funded by the Neumosur grants 12/2015 and 14/2016, and was also co-funded by the European Union through the ERDF 20142020 “Programa Operativo de Crecimiento Inteligente” to the RTA2013-00068-C0302 of the Spanish INIA and MINECO. The authors also thankfully acknowledge the computer resources and the technical support provided by the Plataforma Andaluza de Bioinform´ atica of the University of M´ alaga.

References 1. Alinezhad, S., V¨ aa ¨n¨ anen, R.M., Mattsson, J., Li, Y., Tallgrén, T., Tong Ochoa, N., Bjartell, A., ˚ Akerfelt, M., Taimen, P., Bostr¨ om, P.J., Pettersson, K., Nees, M.: Validation of novel biomarkers for prostate cancer progression by the combination of bioinformatics, clinical and functional studies. PLoS ONE 11(5), e0155901 (2016) 2. Alinezhad, S., V¨ aa ¨n¨ anen, R.M., Ochoa, N.T., Vertosick, E.A., Bjartell, A., Bostr¨ om, P.J., Taimen, P., Pettersson, K.: Global expression of AMACR transcripts predicts risk for prostate cancer - a systematic comparison of AMACR protein and MRNA expression in cancerous and noncancerous prostate. BMC Urol. 16(1), 10 (2016) 3. Alinezhad, S., V¨ aa ¨n¨ anen, R.M., Tallgrén, T., Perez, I.M., Jambor, I., Aronen, H., K¨ ahk¨ onen, E., Ettala, O., Syv¨ anen, K., Nees, M., Kallajoki, M., Taimen, P., Bostr¨ om, P.J., Pettersson, K.: Stratification of aggressive prostate cancer from indolent disease—prospective controlled trial utilizing expression of 11 genes in apparently benign tissue. Urol. Oncol.: Semin. Orig. Investig. 34(6), 255.e15– 255.e22 (2016). Seminar on Preservation Strategies in Bladder Cancer 4. Babaian, A., Lever, J., Gagnier, L., Mager, D.L.: LIONS: analysis suite for detecting and quantifying transposable element initiated transcription from RNA-seq. bioRxiv (2017) 5. Balestrieri, B., Arm, J.P.: Group V sPLA2: classical and novel functions. Biochimica et Biophysica Acta (BBA) - Mol. Cell Biol. Lipids 1761(11), 1280–1288 (2006) 6. Best, M.G., Sol, N., Kooi, I., Tannous, J., Westerman, B.A., Rustenburg, F., Schellen, P., Verschueren, H., Post, E., Koster, J., Ylstra, B., Ameziane, N., Dorsman, J., Smit, E.F., Verheul, H.M., Noske, D.P., Reijneveld, J.C., Nilsson, R.J.A., Tannous, B.A., Wesseling, P., Wurdinger, T.: RNA-seq of tumor-educated platelets enables blood-based pan-cancer, multiclass, and molecular pathway cancer diagnostics. Cancer Cell 28(5), 666–676 (2015) 7. Bockmayr, M., Klauschen, F., Gy¨ orffy, B., Denkert, C., Budczies, J.: New network topology approaches reveal differential correlation patterns in breast cancer. BMC Syst. Biol. 7, 78 (2013) 8. Boilard, E., Lai, Y., Larabee, K., Balestrieri, B., Ghomashchi, F., Fujioka, D., Gobezie, R., Coblyn, J.S., Weinblatt, M.E., Massarotti, E.M., Thornhill, T.S., Divangahi, M., Remold, H., Lambeau, G., Gelb, M.H., Arm, J.P., Lee, D.M.: A novel anti-inflammatory role for secretory phospholipase A2 in immune complexmediated arthritis. EMBO Mol. Med. 2(5), 172–187 (2010) 9. Criscione, S.W., Zhang, Y., Thompson, W., Sedivy, J.M., Neretti, N.: Transcriptional landscape of repetitive elements in normal and cancer human cells. BMC Genom. 15(583), 1–17 (2014)


381

10. Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., Gingeras, T.R.: STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29(1), 15–21 (2013) 11. Falgueras, J., Lara, A.J., Fernandez-Pozo, N., Canton, F.R., Perez-Trabado, G., Claros, M.G.: SeqTrim: a high-throughput pipeline for preprocessing any type of sequence reads. BMC Bioinform. 11(1), 38 (2010) 12. Fiancette, R., Vincent, C., Donnard, M., Bordessoule, D., Turlure, P., Trimoreau, F., Denizot, Y.: Genes encoding multiple forms of phospholipase A2 are expressed in immature forms of human leukemic blasts. Leukemia 23(6), 1196–1199 (2009) 13. Ghosh, S., Chan, C.K.K.: Analysis of RNA-seq data using TopHat and Cufflinks. Methods Mol. Biol. 1374, 339–361 (2016) 14. Jin, Y., Tam, O.H., Paniagua, E., Hammell, M.: TEtranscripts: a package for including transposable elements in differential expression analysis of RNA-seq datasets. Bioinformatics 31(22), 3593–3599 (2015) 15. Kassiotis, G.: Endogenous retroviruses and the development of cancer. J. Immunol. (Baltim. Md.: 1950) 192(4), 1343–1349 (2014) 16. Lerat, E., Fablet, M., Modolo, L., Lopez-Maestre, H., Vieira, C.: TEtools facilitates big data expression analysis of transposable elements and reveals an antagonism between their activity and that of piRNA genes. Nucleic Acids Res. 45(4), e17 (2017) 17. Menschikowski, M., Hagelgans, A., Kostka, H., Eisenhofer, G., Siegert, G.: Involvement of epigenetic mechanisms in the regulation of secreted phospholipase A2 expressions in Jurkat leukemia cells. Neoplasia (N.Y.) 10(11), 1195–1203 (2008) 18. Menschikowski, M., Hagelgans, A., Nacke, B., Jandeck, C., Mareninova, O.A., Asatryan, L., Siegert, G.: Epigenetic control of group V phospholipase A2 expression in human malignant cells. Tumor Biol. 37(6), 8097–8105 (2016) 19. Mounier, C.M., Wendum, D., Greenspan, E., Fléjou, J.F., Rosenberg, D.W., Lambeau, G.: Distinct expression pattern of the full set of secreted phospholipases A2 in human colorectal adenocarcinomas: sPLA2-III as a biomarker candidate. Br. J. Cancer 98(3), 587–595 (2008) 20. Nakagawa, S., Takahashi, M.U.: gEVE: a genome-based endogenous viral element database provides comprehensive viral protein-coding sequences in mammalian genomes. Database 2016, baw087 (2016) 21. Ning, Q.Y., Wu, J.Z., Zang, N., Liang, J., Hu, Y.L., Mo, Z.N.: Key pathways involved in prostate cancer based on gene set enrichment analysis and meta analysis. Genet. Mol. Res. 10(4), 3856–3887 (2011) 22. Quinlan, A.R., Hall, I.M.: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26(6), 841–842 (2010) 23. Rajan, P., Stockley, J., Sudbery, I.M., Fleming, J.T., Hedley, A., Kalna, G., Sims, D., Ponting, C.P., Heger, A., Robson, C.N., McMenemin, R.M., Pedley, I.D., Leung, H.Y.: Identification of a candidate prognostic gene signature by transcriptome analysis of matched pre- and post-treatment prostatic biopsies from patients with advanced prostate cancer. BMC Cancer 14(1), 977 (2014) 24. Ren, S., Peng, Z., Mao, J.H., Yu, Y., Yin, C., Gao, X., Cui, Z., Zhang, J., Yi, K., Xu, W., Chen, C., Wang, F., Guo, X., Lu, J., Yang, J., Wei, M., Tian, Z., Guan, Y., Tang, L., Xu, C., Wang, L., Gao, X., Tian, W., Wang, J., Yang, H., Wang, J., Sun, Y.: RNA-seq analysis of prostate cancer in the Chinese population identifies recurrent gene fusions, cancer-associated long noncoding RNAs and aberrant alternative splicings. Cell Res. 22(5), 806–821 (2012)

382

R. Larrosa et al.

25. Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D.R., Pimentel, H., Salzberg, S.L., Rinn, J.L., Pachter, L.: Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7(3), 562–578 (2012) 26. V¨ aa ¨n¨ anen, R.M., Lilja, H., Kauko, L., Helo, P., Kekki, H., Cronin, A.M., Vickers, A.J., Nurmi, M., Alanen, K., Bjartell, A., Pettersson, K.: Cancer-associated changes in expression of TMPRSS2-ERG, PCA3 and SPINK1 in histologically benign tissue from cancerous versus non-cancerous prostatectomy specimens. Urology 83(2), 511.e1–511.e7 (2014)

An Interactive Strategy to Visualize Common Subgraphs in Protein-Ligand Interaction Alexandre V. Fassio1 , Charles A. Santana1(B) , Fabio R. Cerqueira2 , ao P. R. Romanelli3 , Raquel C. de Melo-Minardi1 , Carlos H. da Silveira3 , Jo˜ and Sabrina de A. Silveira2 1

2 3

Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais 31270-901, Brazil {alexandrefassio,charlesas,raquelcm}@dcc.ufmg.br Universidade Federal de Vi¸cosa, Vi¸cosa, Minas Gerais 36570-900, Brazil {fabio.cerqueira,sabrina}@ufv.br Universidade Federal de Itajub´ a, Itabira, Minas Gerais 35903-087, Brazil {carlos.silveira,joaoromanelli}@unifei.edu.br

Abstract. Interactions between proteins and ligands play an important role in biological processes of living systems. For this reason, the development of computational methods to facilitate the understanding of the ligand-receptor recognition process is fundamental, since this comprehension is a major step towards ligand prediction, target identification, lead discovery, among others. This article presents a visual interactive interface to explore protein-ligand interactions and their conserved substructures for a set of similar proteins. The protein-ligand interface is modeled as bipartite graphs, where nodes represents protein and ligand atoms, and edges depicts interactions between them. Such graphs are the input to search for frequent subgraphs that are the conserved interaction patterns over the datasets. To illustrate the potential of our strategy, we used two test datasets, Ricin and human CDK2. Availability: http:// dcc.ufmg.br/∼alexandrefassio/gremlin/. Keywords: Visualization Data mining

1

· Pattern · Protein · Ligand · Graph

Introduction

Protein-ligand interactions, which refer to noncovalent bonding such as aromatic stacking, hydrogen bonding, hydrophobic forces and salt bridges, play a crucial role in molecular recognition. The conditions responsible for the binding and interaction of two or more molecules are a combination of conformational and physicochemical complementarity [1]. Hence, understanding, characterizing, and using knowledge of protein-ligand interactions can lead to target protein identification, prediction of hit as well as lead compounds and, ultimately, the determination of drug candidates [2,3]. c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 383–394, 2018. https://doi.org/10.1007/978-3-319-78723-7_33

384

A. V. Fassio et al.

Usual methods for in silico prediction of interactions between proteins and small molecules are classified into ligand-based and structure-based approaches. Ligand - based approaches generate or compare a candidate ligand to the known active molecules to identify compounds with similar bioactivity, whereas structure-based approaches use information about target structure to sample candidate molecules in target binding site [4]. Recently proposed techniques based on machine learning reached success by taking the perspective of chemogenomics, which integrates attributes of drug compounds, proteins, and the known ligandprotein interactions into a unified mathematical framework [5]. A remarkable motivation to use such methods is that some classes of molecules can bind similar proteins, suggesting that the knowledge of some ligands for a target can be helpful to determine ligands for similar targets [6]. The mentioned approaches aim at predicting ligands or targets in a computational manner, which implies that there are some types of conserved patterns among similar ligands or receptors. In previous work we proposed GReMLIN [7], a strategy to search for conserved protein-ligand interactions in a set of related proteins, based on clustering and frequent subgraph mining, which is able to perceive structural arrangements relevant for the protein-ligand interaction. However, we realized that although our graph-based strategy is able to infer protein-ligand interaction patterns, it fails on informing such patterns for two reasons: (i) it is difficult to understand a protein-ligand complex interface modeled as a graph without a visual representation of this graph, especially when we are considering many proteins at once; and (ii) given the interaction patterns, which are common substructures found in the protein-ligand interface, domain specialists would be interested in visualizing such interactions in the context of protein structures in a 3D molecule representation. In this paper, we propose visGReMLIN, a visual interactive interface to explore protein-ligand interactions and their common substructures computed by GReMLIN. Interactive visualizations can be particularly interesting to represent complex and high volume data as well as to support users on revealing tendencies and exceptions in those data. We provide a variety of filters to explore interactions and their patterns as well a text search to help users to find residues/atoms in which they are particularly interested. Finally, visGReMLIN allows to select an interaction pattern and highlight it in the context of 2D interface graphs and in a 3D molecule viewer. visGReMLIN was implemented in Data-Driven Documents [8].

2

Problem Modeling

Given a dataset (a collection of similar proteins), the first step is to represent the interface between proteins and their ligands as graphs, in which atoms from proteins and ligands are nodes and the interactions between atoms are edges. Ligands with 6 or less atoms were considered crystallographic artifacts and were thus removed [2]. The interactions were computed through a Voronoy tessellation followed by a Delaunay triangulation [9,10] which is a cutoff-independent approach that avoids occlusion [11].

Visualizing Subgraphs in Protein-Ligand Interaction

385

Thereafter, nodes were labeled as positively charged, negatively charged, aromatic, hydrophobic, donor, or acceptor based on previous works [12,13]. Ligand nodes were labeled by the Pmapper software from Chemaxon (Pmapper 5.3.8, 2010, Chemaxon 2) at pH 7.0. According to predefined distance criteria and the type of nodes (Table 1), edges were labeled as aromatic stacking, hydrogen bond, hydrophobic, repulsive, and salt bridge [14,15]. Table 1. Distance criteria used to compute interactions (in ˚ A). Interaction type

Atom types

Min. Max. distance distance

Aromatic stacking

2 aromatic atoms

1.5

3.5

Hydrogen bond

1 acceptor and 1 donor atom

2.0

3.0

Hydrophobic

2 hydrophobic atoms

2.0

3.8

Repulsive

2 atoms with the same charge 2.0

6.0

Salt bridges

2 atoms with opposite charge 2.0

6.0

At this point, the protein-ligand interfaces built from proteins of the input dataset are represented as bipartite graphs in which every edge connects a protein node to a ligand node. A dataset is created with the graphs from modeling step and a clustering analysis is performed on this dataset of graphs, which more detail about can be found in [7]. Once the dataset of graphs is segmented in groups, a Frequent Subgraph Mining (FSM) experiment is conducted using the algorithm gSpan [16] to extract frequent subgraphs that represent the common substructures conserved in the protein-ligand interfaces of each group. The GReMLIN strategy workflow can be observed in Fig. 1.

Fig. 1. GReMLIN workflow to calculate protein-ligand interaction patterns.

2.1

Related Work

Some strategies focus on representing protein-ligand complexes as 2D diagram in which the ligand and the interacting residues are depicted in a static way.

386

A. V. Fassio et al.

LIGPLOT [19] is one of the first tools to take advantage of such strategy and today its results are used by other protein-ligand interaction tools (PLIC [20], PDBSum [21], PLI [22], PDBe [23]). PoseView [24] also focus on the 2D representation of the interaction network between the complex partners. However, LIGPLOT and PoseView do not allow to compare protein-ligand interactions from different complexes. In [25], schematic diagrams for one or more complexes can be ploted in a static way, which allows comparison of interactions based on superposition of 3D structures to identify equivalences between ligands and interacting residues. Similarly, LigPlot+ [26] is a tool which generates interactive 2D protein-ligand interaction diagrams and allows to superimpose these diagrams for a few complexes. Furthermore, it highlights conserved interactions for protein residues that are in equivalent 3D positions when the two structures are superposed and show the 3D visualization in molecular viewers. Although sc-PDB [23] is a curated database of binding sites, it also presents 2D diagrams generated by PoseView. Also, LigDig [27] is a web server for querying proteinligand interactions which also provides 2D representation of these interactions though it is not its main focus. Here, we are interested in delivering a scalable interactive strategy to visualize protein-ligand interactions and their patterns across a dataset of similar proteins, allowing users to explore, filter, and highlight conserved interaction patterns in schematic 2D representations or in a 3D molecule viewer. Also, we provide some visualizations of the frequencies of atoms and interactions of specific types in the dataset.

3

visGReMLIN

The main objetive of visGreMLIN is to present a panoramic view of proteinligand interaction patterns for a set of related proteins. Also, we want to permit users to explore these patterns in detail by size, physicochemical type of atoms, and interactions. Furthermore, the user can pose and solve general questions about conserved protein-ligand interactions. Dataset Details: General information for a specific dataset of related proteins is displayed as a table (Fig. 2). In the column Group, we show the group from the clustering analysis. The graph identifiers coupled with their PDB entries and chains belonging to the respective group are provided in PDB ids column. By clicking on the Points icon, 2D representations for all ligands from a group are exhibited and, by clicking on the Hexagon icon, graph representations for all protein-ligand interfaces from a group are provided. Users can click on PDB entries to be directed to the PDB web site. Graph Patterns Table: A summary information regarding the frequent subgraph minig experiment is provided in this section. There are two kinds of tables, which we named Grouping columns and Simple table. Grouping columns displays


387

tables with the number of nodes for each type of pattern (subgraph) and shows the number of types of patterns (subgraphs) with a specific size (number of nodes). This table is also segmented by Group from the clustering analysis and Support value.

Fig. 2. Dataset details table.

Fig. 3. Simple table. (Color figure online)

In Simple table, Fig. 3, we use a heatmap representation in which color is a pre-attentive attribute that encodes the frequency of subgraphs in a shade of blue. This table helps choosing an appropriate support value for mining proteinligand interaction patterns. As the support increases, we lose some patterns. Therefore, by inspecting this table, we can choose the support value that helps us keeping interesting patterns (many nodes). Graph Patterns View: This section is organized in three subsections, which allow users to perform analytical interaction and navigation all over the patterns. The subsection Options (Fig. 4) permits to interact with the patterns through filters. The common workflow is choosing a support value (based on Graph patterns table) and explore subgraphs using the filters below: • Color nodes by: Nodes are colored according to atom type or molecule (one color for atoms from proteins and another color for atoms from ligands). • Filter by atom type: Atoms of the selected type are highlighted. Possible types are acceptor, aromatic, donor, hydrophobic, negative and positive. • Filter by interaction type: Interactions are highlighted. Possible types are aromatic stacking, hydrogen bond, hydrophobic, repulsive and salt bridge. • Filter by group: Only graphs from the selected groups from clustering analysis are displayed. • Filter by vertex number: Only graphs with the selected number of vertices are displayed. • Remove pattern selection: If a pattern was selected in the Pattern graphs section, it removes the selection, which displays all subgraphs according to the filters from subsection Options. • Search for a residue, ligand, or atom: Vertices from graphs that contain residue/ligand/atom in the text search are highlighted.

388

A. V. Fassio et al.

In the Pattern graphs subsection (Fig. 5), the users can navigate through patterns, which are the frequent subgraphs for a dataset of graphs representing a set of related proteins. By clicking on a pattern, only subgraphs that contain such pattern are displayed on the subsection Input graphs (Fig. 5). This subsection displays the graphs that represent interactions on the protein-ligand interface for a set of related proteins according to filters from the section Options and pattern selected in Pattern graphs. By passing the mouse over the graph, we give the details below on demand for nodes and edges:

Fig. 4. Subsection options.

Fig. 5. Pattern Graphs and Input Graphs.

• Protein atoms: Name and number of residue to which the atom belongs, atom name, chain, physicochemical type of atom (Fig. 6(a)). • Ligand atoms: Ligand name and number inside PDB file, atom name, chain, physicochemical type of atom (Fig. 6(b)). • Interactions: Information about connected atoms (residue or ligand name, number, and atom), physicochemical type of interaction and distance between connected atoms in ˚ A (Fig. 6(c)).

Fig. 6. Data displayed in the Input graphs section by passing the mouse over graphs for protein node (a), ligand node (b), and interaction (c), respectively.


389

In addition to a graph schematic 2D visualization, to support users on understanding the patterns in the context of protein structure, we provide a 3D representation of the protein-ligand interaction graphs in a molecule viewer by clicking on the eye icon shown in Fig. 7. We also provide a general 2D visualization for ligands, shown in Fig. 8, by clicking on the ligand name in any graph from the subsection Input graphs. This visualization allows users to compare and contrast ligands, revealing global trends among them for specific groups.

Fig. 7. Graph schematic 2D visualization and the corresponding 3D protein structure representation in the molecule viewer.

4

Fig. 8. Ligand structure in 2D representation.


We present interactions for two datasets, Ricin and human CDK2. Our datasets were downloaded from PDB and comprise two sets of similar proteins in which we are interested in extracting protein-ligand interaction patterns. Further details about datasets can be found in visGReMLIN web site and in [15]. • CDK2: This dataset, which is based on the work in [17], comprises a specific protein for which several inhibitors are known. This same protein was crystallized with a variety of ligands, and the 73 experimental structures, with identical sequences, are available in PDB. Also, authors described the development of highly potent and selective diaminothiazole inhibitors of CDK2 based on a single hit compound with weak inhibitory activity. We extracted from that work binding site residues and atoms experimentally determined that are relevant for CDK2 interaction with ligands. • Ricin: It is composed of 29 experimental structures from PDB, which have at least one ligand and 50% or more identity with ricin A chain (PDB 2AAI). We consider this dataset a more realistic one, as the sequences are not exactly the same, which is common, for instance, in a protein family. In [18], authors cocrystalized ricin chain A with a transition state analogue inhibitor that mimics the sarcin-ricin recognition loop of the eukaryote 28S rRNA. We extracted from such work active site residues and atoms experimentally determined that are relevant for the interaction of ricin chain A with subunit 28S rRNA.

390

4.1

A. V. Fassio et al.

CDK2 Pattern Analysis

Clustering analysis for the CDK2 dataset resulted in 15 groups. Visually inspecting groups for the CDK2 dataset, in general, we perceive that each one has a color signature. In other words, due to the use of color pre-attentive attribute in nodes and edges, users can see at a glance that the atom and interaction types are similar inside each group. On the other hand, visually comparing different groups, we notice that each one has a different color signature, which means that different groups involve different atom types and interactions. This indicates that our groups meet an important requirement of clustering analysis: high intra-cluster similarity and low inter-cluster similarity. For example, the Group 1 (Fig. 9) from CDK2 dataset is very homogeneous, being the majority of nodes are hydrophobic.

Fig. 9. Fraction of Group 1 from the CDK2 dataset. The majority of atoms are hydrophobic and the group has 1 pattern which occurs in all graphs of such group.

4.2

Ricin Pattern Analysis

For the Ricin dataset, the clustering analysis resulted in 21 groups. By comparing different groups, we note that they have different color signatures. Similarly to the CDK dataset, this indicates that groups have high intra-cluster similarity and low inter-cluster similarity. For example, the Group 1 has the majority of its nodes hydrophobic or aromatic/hydrophobic (Fig. 10). However, it is important to point out that Ricin is a dataset smaller than CDK2 and, even so, it resulted in a higher number of groups in the clustering

Fig. 10. Fraction of Group 1 from the Ricin dataset. The majority of its nodes are hydrophobic or aromatic/hydrophobic and the pattern occurs in 20 out of 28 graphs.


391

analysis, which indicates that, although CDK2 contains more graphs, they are more homogeneous than those from Ricin. The Ricin dataset contains 29 PDB entries that resulted in 197 protein-ligand interface graphs, while the CDK2 dataset is composed of 73 entries that resulted in 341 graphs. We consider CDK2 as a controlled scenario, while, Ricin is a more difficult and realistic scenario, as it involves a set of structures with different sequences, which happens, for instance, in protein families. It is reasonable to consider that the clustering analysis of Ricin is more chalenging for data mining algorithms. 4.3

visGReMLIN Patterns Compared to Experimental Patterns

We compare patterns computed through the visGReMLIN strategy with relevant patterns experimentally determined for CDK2 and Ricin according to [17] and [18], respectively, to verify whether our strategy is able to find the experimentally determined patterns. This is a qualitative analysis, as the residues determined as relevant in protein-ligand interactions for both studies do not represent interactions between a protein and all its possible ligands in the datasets used. However, we believe it is an interesting comparison, as these studies experimetally determined protein-ligand interactions stablished by Ricin and CDK2 with ligands that are very important for both proteins. CDK2: We consider the set of binding site residues of CDK2 that interact with the 2 most potent sulfonamide analogue inhibitors developed in [17] as the experimentally determined patterns for the CDK2 dataset. Table 2 details these residues and which of them are detected using GReMLIN. Out of the 27 atoms relevant for CDK2 interaction experimentally determined in [17], our strategy found 21, which represent about 78% of such atoms. In Fig. 11(a), we provide an example where atoms CD, CE, CG from LYS 33 are in the graph representing the interaction between ligand X42 and protein 3QTZ:A. Table 2. Binding site residues of CDK2 interacting with the 2 most potent sulfonamide analogue inhibitors. Residue Atom GReMLIN Residue Atom GReMLIN Residue Atom GReMLIN PHE82 CE2 • ASP145 CB ASP86 N CZ • CG CB OD1 OD1 GLU81 O OD2 LYS33 CB PHE80 CB CD LYS89 CB CG • CE CE CD2 CG NZ CE2 • NZ CZ • LEU83 N HIS84 O O GLN85 × Residues/atoms found in patterns; • Found but not in patterns; × Not found.

392

A. V. Fassio et al.

Fig. 11. (a) Atoms CD, CE and CG from LYS 33 (CDK) and (b) atoms CD1, CD2, CE1, CE2, CG and CZ from TYR80 (Ricin) in 2D graph and in a molecule viewer.

Ricin: In [18], authors co-crystalize ricin chain A with a transition state analogue inhibitor that mimics the sarcin-ricin recognition loop of the 28 S rRNA. We consider the active site residues and atoms that the authors highlight as relevant in the interaction of ricin with subunit 28 S rRNA as the experimentally determined patterns for the Ricin dataset. Out of the 23 atoms experimentally determined in [18] that are relevant for Ricin interaction with a transition state analogue inhibitor (Table 3), our strategy found 21, which represent about 91%. Table 3. Active site residues of Ricin chain A interacting with a cyclic transition state analogue inhibitor Residue Atom GReMLIN Residue Atom GReMLIN Residue Atom GReMLIN GLY121 O TYR123 N ASN78 ND2 CD2 ARG180 NH1 TYR80 CD1 CE2 NH2 CD2 CG VAL81 N CE1 GLU208* × O CE2 GLU177 OE2 CG ASP96 OD1 CZ OD2 ARG134* × ASP100 OD2 ASP75 OD2 Residues/atoms found in patterns; • Found but not in patterns; × Not found.

GLU208 and ARG134 do not directly interact with ligand according to [18]. Therefore it is expected that these residues do not appear in a pattern as GReMLIN does not consider water mediated interactions. The Fig. 11(b) shows Atoms CD1, CD2, CE1, CE2, CG and CZ from TYR80 in the graph that represents the interface between protein 1IFU:A and ligand FMC.

5

Conclusion

In this paper, we propose an interactive tool to visualize conserved interactions between protein and ligands for a set of related proteins. More specifically, we


393

obtained a set of proteins from PDB, computed the interactions in the proteinligand interface, and modeled such interactions as a bipartite graph at the atomic level. Each vertex represents an atom from protein or ligand and each edge denotes an interaction between a protein and a ligand atom. We labeled vertices and edges with physicochemical properties of atoms and interactions and used a strategy based on clustering analysis and frequent subgraph mining to compute conserved interactions in the protein-ligand interface. This strategy delivers the input graphs and results of such strategy, allowing users to explore, filter, and understand conserved interaction patterns that are relevant for a variety of biological processes. Our strategy is able to find 78% of the experimentally determined patterns for the CDK2 dataset and 91% of such patterns for the Ricin dataset in a totally automatic manner, using data available in PDB, without any manual support from domain specialists. As future work, we intend to implement a more general version of our tool to permit users to choose their own dataset of interest to perform analysis of conserved patterns in the protein-ligand interface as well as to visualize and explore such patterns. Also, we pretend let the computing of interactions more robust, whit the inclusion of angles between atoms, for example. Finally, we plan to systematically collect user insights about the visGReMLIN to improve our visualization strategy considering the needs of domain specialists. Funding: This work has been supported by Coordena¸cão de Aperfei¸coamento de Pessoal de N´ıvel Superior (CAPES), Conselho Nacional de Desenvolvimento Cient´ıfico e Tecnológico (CNPq) (grant number 477587/2013-5) and Funda¸cão de Amparo a` Pesquisa do Estado de Minas Gerais (FAPEMIG).

References 1. Kahraman, A., Morris, R.J., Laskowski, R.A., Thornton, J.M.: Shape variation in protein binding pockets and their ligands. J. Mol. Biol. 368, 283–301 (2007) 2. Pires, D.E., et al.: Noise-free graph-based signatures to large-scale receptor-based ligand prediction. Bioinformatics 29(7), 855–861 (2013) 3. Medina-Franco, J.L., et al.: Chapter one-the interplay between molecular modeling and chemoinformatics to characterize protein-ligand and protein-protein interactions landscapes for drug discovery. Adv. Protein Chem. Struct. Biol. 96, 1–37 (2014) 4. Danishuddin, M., Khan, A.U.: Structure based virtual screening to discover putative drug candidates: necessary considerations and successful case studies. Methods 71, 135–145 (2015) 5. Liu, H., et al.: Improving compound-protein interaction prediction by building up highly credible negative samples. Bioinformatics 31(12), i221–i229 (2015) 6. Jacob, L., Vert, J.P.: Protein-ligand interaction prediction: an improved chemogenomics approach. Bioinformatics 24(19), 2149–2156 (2008) 7. Santana, C.A., et al.: GReMLIN: a graph mining strategy to infer protein-ligand interaction patterns. In: 2016 IEEE 16th International Conference on Bioinformatics and Bioengineering (BIBE), pp. 28–35. IEEE (2016)

394

A. V. Fassio et al.

8. Data-Driven Documents - D3. https://d3js.org/ 9. Poupon, A.: Voronoi and Voronoi-related tessellations in studies of protein structure and interaction. Curr. Opin. Struct. Biol. 14(2), 233–241 (2004) 10. Senechal, M.: Spatial tessellations: concepts and applications of Voronoi diagrams. Science 260(5111), 1170–1173 (1993) 11. da Silveira, C.H., et al.: Protein cutoff scanning: a comparative analysis of cutoff dependent and cutoff free methods for prospecting contacts in proteins. Proteins: Struct. Funct. Bioinform. 74(3), 727–743 (2009) 12. Goncalves-Almeida, V.M., et al.: HydroPaCe: understanding and predicting crossinhibition in serine proteases through hydrophobic patch centroids. Bioinformatics 28(3), 342–349 (2011) 13. Sobolev, V., Sorokine, A., Prilusky, J., Abola, E.E., Edelman, M.: Automated analysis of interatomic contacts in proteins. Bioinformatics 15(4), 327–332 (1999) 14. Mancini, A.L., et al.: STING contacts: a web-based application for identification and analysis of amino acid contacts within protein structure and across protein interfaces. Bioinformatics 20(13), 2145–2147 (2004) 15. Silveira, S.A., et al.: Revealing protein-Ligand interaction patterns through frequent subgraph mining. In: Proceedings of the International Conference on Bioinformatics and Computational Biology, p. 50 (2015) 16. Yan, X., Han, J.: gSpan: graph-based substructure pattern mining. In: Proceedings IEEE International Conference on ICDM 2003, pp. 721–724. IEEE (2002) 17. Schonbrunn, E., et al.: Development of highly potent and selective diaminothiazole inhibitors of cyclin-dependent kinases. JMC 56(10), 3768–3782 (2013) 18. Ho, M.C., et al.: Transition state analogues in structures of ricin and saporin ribosome-inactivating proteins. Proc. Natl. Acad. Sci. 106(48), 20276–20281 (2009) 19. Wallace, A.C., et al.: LIGPLOT: a program to generate schematic diagrams of protein-Ligand interactions. Protein Eng. Des. Sel. 8(2), 127–134 (1995) 20. Anand, P., et al.: PLIC: protein-Ligand interaction clusters. Database (2014) 21. De Beer, T.A., et al.: PDBsum additions. Nucleic Acids Res. 42(D1), D292–D296 (2013) 22. Gallina, A.M., et al.: PLI: a web-based tool for the comparison of protein-ligand interactions observed on PDB structures. Bioinformatics 29(3), 395–397 (2012) 23. Desaphy, J., Bret, G., Rognan, D., Kellenberger, E.: sc-PDB: a 3D-database of ligandable binding sites–10 years on. Nucleic Acids Res. 43(D1), D399–D404 (2014) 24. Stierand, K., Rarey, M.: Drawing the PDB: protein- Ligand complexes in two dimensions. ACS Med. Chem. Lett. 1(9), 540–545 (2010) 25. Clark, A.M., Labute, P.: 2D depiction of protein- Ligand complexes. J. Chem. Inf. Model. 47(5), 1933–1944 (2007) 26. Laskowski, R.A., Swindells, M.B.: LigPlot+: multiple Ligand-protein interaction diagrams for drug discovery (2011) 27. Fuller, J.C., Martinez, M., Henrich, S., Stank, A., Richter, S., Wade, R.C.: LigDig: a web server for querying Ligand-protein interactions. Bioinformatics 31(7), 1147– 1149 (2014)

Meta-Alignment: Combining Sequence Aligners for Better Results Beat Wolf1,2(B) , Pierre Kuonen1 , and Thomas Dandekar2 1

Institute of complex systems, University of Applied Sciences Western Switzerland, Fribourg, 1700 Fribourg, Switzerland [email protected] 2 Biozentrum Universit¨ at W¨ urzburg, University of W¨ urzburg, 97074 W¨ urzburg, Germany

Abstract. Analysing next generation sequencing data often involves the use of a sequence aligner to map the sequenced reads against a reference. The output of this process is the basis of many downstream analyses and its quality thus critical. Many different alignment tools exist, each with a multitude of options, creating a vast amount of possibilities to align sequences. Choosing the correct aligner and options for a specific dataset is complex, and yet it can have a major impact on the quality of the data analysis. We propose a new approach in which we combine the output of multiple sequence aligners to create an improved sequence alignment files. Our novel approach can be used to either increase the sensitivity or the specificity of the alignment process. The software is freely available for non-commercial usage at http://gnaty.phenosystems.com/.

Keywords: Next generation sequencing Algorithmics

1

· Sequence alignment

Introduction

Sequence alignment is often one of the first steps in next generation sequencing (NGS) data analysis. A variety of methods exist to perform this task, all of them with their respective advantages and disadvantages (see [1]). Choosing the right method and its configuration options highly depends on the type of data to be analysed. This problem is not unique to sequence alignment, but also present in other analysis steps of NGS data analysis, such as variant calling. Different methods have been published to improve variant calling by combining multiple variant calling pipelines, such as VariantMetaCaller [2] and Intersectthen-combine [3]. All those methods have in common that they only combine the results of the final variant calling step, but do not improve the actual sequence alignment. There are different methods, such as the GATK indel realigner [5], which do improve the quality of the alignment file, but only for the already aligned sequences, ignoring unaligned sequences. c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 395–404, 2018. https://doi.org/10.1007/978-3-319-78723-7_34

396

B. Wolf et al.

We propose to combine the outputs of multiple sequence alignments, either from different tools or the same with different options, and create an improved sequence alignment file using the different alignments available for every sequence. Our method is based on the fact that sequence alignment at its core is a simple process. For every sequence produced by the sequencer, the position on the reference sequence with the lowest edit distance is searched. In the field of bioinformatics multiple algorithms are known and well documented to perform this task, such as Smith-Waterman [7], Needleman-Wunsch [8] and Gotoh [6]. While not working exactly the same way (Smith-Waterman and Gotoh are local alignment algorithms, Needleman-Wunsch a global one), all of those algorithms allow to find the optimal alignment between two sequences based on a given alignment scoring matrix, indicating the penalties for mismatches or indels between the two sequences. Where the true difference between the different aligners comes from is the fact that the search space during sequence alignment is very large, and they use heuristics to reduce this search space. The different aligners and their respective options use different heuristics, resulting in different alignment. The differences between aligners are not only present between different datasets, but also inside the same datasets, where one aligner or setting works better for a certain part of the data than another. This makes sequence alignment and the choice of the right tools difficult. This is why we propose a new approach to reduce this complexity. Our novel approach, called meta-alignment, combines the strengths of the different aligners to produce an unified alignment out of multiple separate alignments. The output of those different aligners contains for every aligned sequence a location on the reference as well as a so called CIGAR string The CIGAR string indicates exactly how the sequence has been mapped against the reference at the indicated position, with most notably the information if and where indels have been detected. Aligners use a score matrix to determine what the best alignment for a sequence is. This score matrix indicates the score for a matching nucleotide, as well as the penalty for mismatches, indels and skipping parts of the sequence. Given a certain alignment of a sequence, the score can be calculated and thus compared between different aligners. This means that, given a certain score matrix, it is possible to rank the alignments given by different aligners to determine the most optimal alignment. It does not matter the specific method used by the different aligners, in the context of a (predefined or user specified) score matrix, the different alignments can be compared. This is the core idea of our meta-alignment method. The output of multiple sequence alignments is taken and then the best alignment, based on a given score matrix, is stored in a new alignment file. The hypothesis is that using this approach, the quality of the alignment, as well as the number of aligned sequences can be improved. This new method was developed as part of the doctoral thesis “Reducing the complexity of OMICS data analysis” [4] and extends the code base of the NGS data analysis pipeline GensearchNGS [9].

Meta-Alignment: Combining Sequence Aligners for Better Results

397

In Sect. 2 we discuss implementation details as well as the test setup used to evaluate the validity of this approach. The results of those tests are presented in Sect. 2.1 with a discussion of the method in Sect. 3.

2

Method

To test the meta-alignment method, a stand-alone tool has been developed using the Java language based on an existing NGS data analysis pipeline GensearchNGS [9]. To interact with alignment files, we use the HTSJDK Java library (https://github.com/samtools/htsjdk). The tool does not use parallel or distributed computing, but focuses on the improvement of the alignment quality only. Figure 1 displays an overview of the proposed method to combine the output of multiple aligners through meta-alignment.

Fig. 1. Activity diagram of the proposed meta-alignment method

As shown in the figure, the process starts by aligning the raw FASTQ data with multiple sequence aligners. The sequences in the alignment files are then sorted by their name. Those name sorted BAM files are used as the input for the Alignment merger, the heart of the meta-alignment method, creating a merged

398

B. Wolf et al.

BAM files with the best alignment for every sequence. The following paragraphs look at this process in more detail. The application is able to take multiple alignment files as an input and create a merged alignment file, using the best alignment for every sequence. The user is able to define multiple options on how the merging is done. Not only can he specify as many alignments (which have to originate from the same source file) as he wishes, but also the way that the selection of the best alignment is done can be influenced. In the standard mode, for every sequence the best scoring alignment is used for the final alignment. The goal is to improve the overall quality of the alignment, and at the same time increase the total amount of aligned sequences. A sequence has only to be aligned by one of the input aligners to be used in the final alignment. The consequence is that although the total number of aligned sequences is expected to higher than any individual aligner, the number of low quality alignments might also increase. This is because for certain reasons, a certain aligner might have rejected the alignment of a sequence for a good reason, where as another one still aligned it, even with its low quality. This is why the user has the option to require a minimal number of aligners to agree on an alignment for it to be used. This is expected to decrease the amount of overall aligned sequences, but at the same time, to improve the quality of the alignment. As specified, the input files are BAM files, usually coming directly out of the aligners. Those BAM files contain the alignments of all sequences. They usually are, by default, either not sorted or sorted by position. To quickly find all the alignments of a particular sequence in the various alignment files provided by the user, the first step of meta-alignment is to sort the input files by sequence name. This step requires the sequences to have unique names and those names need to be the same in the different input files. If all alignments used as input come from the same source raw data file, this requirement usually holds true. After the creation of the name sorted alignment files, all of them are opened in parallel and read sequence by sequence. As they are all sorted by name, finding all alignments for a specific sequence is very time effective. Once all the alignments of a particular sequence have been recovered, the best scoring alignment needs to be determined. This is done using a score matrix, which for each nucleotide in the aligned sequence assigns a score based on its relation to the reference. Different approaches exist for score matrices, but in the context of meta-alignment we use score matrices with affine gap, just like the Gotoh [6] algorithm. This means, gap extensions have a different impact on the score than gap starts. This is because the addition of a new gap (either an insertion or a deletion) is less likely than the extension of an existing one. To build the score matrix the following values are used: – – – – –

Match = α (When the sequence and the reference match) Mismatch = β (When the sequence and the reference mismatch) Gap start = γ (Start of a deletion or an insertion) Gap extension = δ (Extension of a deletion or an insertion) Skip = (Part of the read discarded by the aligner).


399

Four our tests, the following values for the score matrix: match = 2, mismatch = −3, gap start = −6, gap extension = −1, skip = −1. Those values are example values that are based on values used by various aligners. Those values can be changed by the user. To calculate the score of a specific sequence, we need to know how exactly it was mapped against the reference (and where). As mentioned earlier, every aligned sequence indicates its mapping against the reference through the use of a CIGAR string. The CIGAR string indicates where the sequence aligns to the reference and where it contains insertions and deletions. One important information is missing from the CIGAR string, which is if a particular nucleotide matches the reference sequence or not. For this reason, every sequence needs to be mapped back (which means deletions have to be added and insertions cut out) to determine the exact score of every alignment. To determine this score, we need to determine the amount of matches (M ), mismatches (M S), gaps starts (GS), gap extensions (GE) and skips (S) for that sequence. Once determine we can calculate the score as: Score = α ∗ M + β ∗ M S + γ ∗ GS + δ ∗ GE + ∗ S

(1)

Once the score of every sequence is determined, the best scoring alignment is chosen and output to the final alignment file. If the user specified that multiple aligners have to agree on the position of a sequence, then after determining the best alignment, we count how many other alignments placed the sequence at the same position. To test the impact of meta-alignment on the results of sequence alignment we applied it to multiple datasets using three different aligners. The three aligners used were BWA-MEM 0.7.12-r1039 [10], Bowtie 2.2.6 [11] and CUSHAW2 2.4.3. [12]. BWA-MEM and CUSHAW2 used the default alignment settings, Bowtie 2 was run with the option-local. The first dataset used was a collection of simulated datasets, created specifically to study the effect of meta-alignment over various degrees of data quality. For this purpose we simulated 11 datasets with increasing degrees of errors in them. Each dataset consisted of 600 000 reads simulated using the human chromosome 19. The errors ranged from no errors to 20% errors with 50% of those errors being indels of length 1–3. The advantage of using simulated reads over real data is that the solution, which means the correct alignment for every sequence, is known. The results of this test are presented in Sect. 2.1. The second dataset used comes from the genome comparison & analytic testing project (GCAT) [13] which provides standardizes test datasets. We used the 150bp-se-large-indel dataset to compare the 3 aligners individually against the result of our meta-alignment approach. Section 2.1 discusses the results of this test. For all datasets, we tested the effect of meta-alignment in 3 modes. The first mode is the default mode, which takes the best alignment for a sequence from all input alignments. The two other modes required at least 2, respectively 3, input alignments to agree on the position of a sequence for it to be used in the output alignment.

400

B. Wolf et al.

The next Sect. 2.1 presents the results of this tests and Sect. 3 gives a broader perspective on meta-alignment and its use-cases. 2.1

Results

This section presents the results obtained by applying our meta-alignment approach against the previously described datasets. The section is split in two parts, the first discussing the custom datasets with variable error rates and the second part using the datasets provided by GCAT. Variable errors dataset. Using the first datasets, which consists of 11 datasets with gradually increasing error rates, we compared the precision and alignment rate of the three aligners as well as the meta-alignment approach used on the output of those same three aligners. All 11 datasets consist of 600 000 sequences originating from the human chromosome 19 (hg19). The error rates started with no errors and increased up to 20%. By default, those errors are SNPs (Single nucleotide polymorphism), but a certain percentage of them is instead created as indels of length 1–3 bases. The indel rate ranges from 0% to 50%, resulting in 10% indels (50% of the SNPs are converted into indels) in the worst dataset. All 11 datasets have been aligned against the full human genome (hg19) by all three aligners. Afterwards the meta-alignment algorithm was applied to the output of all three aligners for every dataset, once with the default setting and also with a minimum of 2 respective 3 aligners needing to agree for the resulting alignment. We calculated the precision as well as the alignment rate for all aligners and c where P the meta-alignment approaches. The precision is defined as: P = c+w is the precision, c is the amount of correctly aligned reads and w is the amount c+w of reads aligned at the wrong place. The alignment rate is defined as: a = 600000 where a is the alignment rate. Those two values give us an understanding of how much of the raw input data is actually aligned and to which degree the resulting alignment is correct. We first look at the alignment rate of the different aligners and metaalignment settings. In Fig. 2 we can see the alignment rate and how it evolves over various degrees of errors in the datasets. We can immediately observe that the meta-alignment in its default mode (Meta 1 ) has consistently higher alignment rates than any other aligner. As the default setting of meta-alignment is to take the best alignment for every sequence out of all input alignments, this is the expected result. Any sequence only aligned by a single aligner will be found in the final alignment and thus the total count of aligned sequence will always be higher than any of the single aligners. Requiring two (Meta 2 ) or three (Meta 3 ) aligners to agree on the best alignment significantly reduces the amount of aligned sequences. This again is a result that is expected, as in those two modes the goal is not to increase the total amount of aligned sequences, but improve the quality of the aligned sequences. To analyze the quality of the aligned reads we look at the precision of the alignments. As already mentioned, the precision is defined by the percentage


401

Fig. 2. Alignment rate of the alignments ranging from 0% errors (Dataset 1) to 20% errors with 50% indels (Dataset 11). We can see that the higher the error rate, the less sequences are aligned by the different aligners.

of correctly aligned sequences when only looking at the aligned sequences (not aligned sequences are not counted). Figure 3 shows the precision of the alignments over the same 11 datasets. The meta-alignment approaches that require two or even three alignment to agree are consistently the ones with the highest precision. Even with the dataset which have the highest error rates, those two approaches reach 88.3% and 91.1%. In terms of precision, the default meta-alignment reaches 61.5% and is beaten by CUSHAW2 with 70.6%. But it has to be noted that the alignment rate of CUSHAW2 is only at 21.7%, which is very low compared to the other aligners. What is interesting to see is that Meta 2, greatly improves the precision. The difference between requiring Meta 2 and Meta 3 is much lower, but comes with a big drop in the alignment rate as seen in the previous analysis. Those measurements show the tradeoff between the amount of sequences aligned, and their quality. A hypothetical aligner that aligns every sequence at a random position would reach an amazing 100% alignment rate, no matter the quality of the input data. But the precision would be close to 0%. The output of meta-alignment cannot be better than the best aligner for the individual sequences, which limits the precision and alignment rates that can be reached. If for example no aligner aligns a certain sequence correctly, our method cannot correct this. With that in mind, the results of meta-alignment seem interesting, especially when considering the datasets with high error rates. GCAT alignment. The genome comparison & analytic testing project (GCAT) [13] is a project which provides free datasets and a website to test different aligners. To test the accuracy of alignments, the GCAT project generated a

402

B. Wolf et al.

Fig. 3. Precision of the alignments ranging from no errors (Dataset 1) to 20% errors with 50% indels (Dataset 11)

set of datasets for which the correct alignment is known for every sequence. To test an aligner, the website provides the raw FASTQ files of every dataset. The raw data can then be aligned locally against the human reference genome HG19 with any aligner, as long as a BAM file is generated. This BAM file can then be uploaded to the website which will automatically create the statistics of the correctly and wrongly aligned reads. We decided to test the meta-alignment method on one of the provided datasets, which is called 150bp-se-large-indel. This dataset contains single ended sequences with a length of 150bp and a large amount of indels, even though the GCAT project does not specify how high the amount of added indels is. We used the same testing procedure as for our simulated data with the same aligners (BWA-MEM, Bowtie 2 and CUSHAW2) and also tested our meta-alignment approach with 1, 2 or 3 alignments that had to agree for an alignment to be used. Table 1 contains the details of the performed tests. We can again see the BWA aligner getting very good results, having the highest precision and alignment rate when used alone. Similar to our previous test, we can see how the meta-alignment is able to improve the quality of the alignment. The first meta-alignment configuration, Meta 1, is able to improve both the precision and the alignment rate compared to all other aligners. But the improvement, especially in terms of alignment rate, is rather minor, which indicates that all three source aligners had trouble with a similar set of sequences. Meta 2 and Meta 3 on the other hand greatly increase the precision of the alignments compared to single aligner. Especially the Meta 2 configuration, which requires 2 out of the 3 single aligners to agree, shows a great compromise between a decreased alignment rate and the improved precision. When looking at the time required to perform the meta-alignment on this dataset, we get the following values using a quad core Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50 GHz: BWA-MEM, 13 min 13 s. Bowtie 2, 29 min, 515 min


403

Table 1. Meta-alignment comparison table for the 150bp-se-large-indel dataset BWA-MEM Bowtie 2 CUSHAW2 Meta 1 Meta 2 Meta 3

7 7 7 7 7 7

Total 878 949 878 771 868 183 878 987 787 802 507 014

Correct 7 779 572 7 604 671 7 650 416 7 781 337 7 737 157 7 487 040

Wrong Not aligned Precision Alignment rate 99 477 84 549 98.74% 97.69% 274 064 84 727 96.52% 95.49% 217 767 95 315 97.23% 96.07% 97 614 84 511 98.76% 97.71% 50 645 175 696 99.35% 97.16% 19 974 456 484 99.73% 94.02%

CUSHAW2, 15 min, 25 s. The times for all 3 meta-alignment approaches are the same, 32 min 53 s. We can observe that the time required to perform meta-alignment is significant compared to the benefits of the method. As no focus was put on optimizing the method, there is still a lot of room for improvements in that regard. The way the method works, it lends itself ideally to be distributed, something we want to explore in the future.

3

Conclusion

As shown through our tests, the meta-alignment approach shows great potential in certain use-cases, especially when working with high error rates in the data to align. The current prototype has been published as free software as part of the GNATY project suite on http://gnaty.phenosystems.com. While several limitations still need to be addressed, like the lack of paired end support, it is already a usefull tool to improve the quality of sequence alignments. Our approach is complementary to other methods that improve sequence alignment and other downstream analyses like variant calling. It is indeed possible to combine our approach with tools like the GATK indel realigner or VariantMetaCaller, as our method uses standard BAM files as input and outputs again standard BAM files. Because of this, it is easy to integrate our method into existing pipelines. It is also important to note, that the meta-alignment approach described is independent of the aligners used. The initial implementation of our method still has one downside, which is the lack of paired end data support. But the lack of support of this type of data is not inherent to our approach, and support for paired end data will be added in the future. Still, even with this limitation, the method is already useful for many of the newer sequencing approaches, like MinION [14], that produce reads which are not paired. Other than the lack of paired end support, the performance of the approach is a major obstacle. The major performance bottleneck is the multiple alignments that have to be done with the different aligners/configurations. But those alignments can easily be done in parallel, thus reducing the overall time for the analysis. Further tests will be conducted to better assert the impact of our method on downstream analyses like variant calling.

404

B. Wolf et al.

Acknowledgements. The authors thank Phenosystems SA for the opportunity to release part of their software for free. Conflicts of Interest The authors have no conflict of interest to declare.

References 1. Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation sequencing. Brief. Bioinform. 11(5), 473–483 (2010). https://doi.org/10.1093/bib/ bbq015 2. Gézsi, A., Bolg´ ar, B., Marx, P., Sarkozy, P., Szalai, C., Antal, P.: VariantMetaCaller: automated fusion of variant calling pipelines for quantitative, precisionbased filtering. BMC Genom. 16(1), 875 (2015). https://doi.org/10.1186/s12864015-2050-y 3. Callari, M., Sammut, S.-J., De Mattos-Arruda, L., Bruna, A., Rueda, O.M., Chin, S.-F., Caldas, C.: Intersect-then-combine approach: improving the performance of somatic variant calling in whole exome sequencing data using multiple aligners and callers. Genome Med. 9(1), 35 (2017). https://doi.org/10.1186/s13073-017-0425-1 4. Wolf, B.: Reducing the complexity of OMICS data analysis. Universit¨ at W¨ urzburg (2017) 5. McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., DePristo, M.A.: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010) 6. Gotoh, O.: An improved algorithm for matching biological sequences. J. Mol. Biol. 162(3), 705–708 (1982). https://doi.org/10.1016/0022-2836(82)90398-9 7. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981) 8. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443– 453 (1970). https://doi.org/10.1016/0022-2836(70)90057-4 9. Wolf, B., Kuonen, P., Dandekar, T., Atlan, D.: DNAseq workflow in a diagnostic context and an example of a user friendly implementation. BioMed Res. Int. 2015, 403–497 (2015) 10. Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWAMEM, p. 3. arXiv Preprint arXiv (2013). arXiv:1303.3997 [q-bio.GN] 11. Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9(4), 357–359 (2012). https://doi.org/10.1038/nmeth.1923 12. Liu, Y., Schmidt, B.: Long read alignment based on maximal exact match seeds. Bioinformatics 28(18), i318–i324 (2012). https://doi.org/10.1093/bioinformatics/ bts414 13. Highnam, G., Wang, J.J., Kusler, D., Zook, J., Vijayan, V., Leibovich, N., Mittelman, D.: An analytical framework for optimizing variant discovery from personal genomes. Nat. Commun. 6, 1–6 (2015). https://doi.org/10.1038/ncomms7275 14. Mikheyev, A.S., Tin, M.M.Y.: A first look at the Oxford Nanopore MinION sequencer. Mol. Ecol. Res. 14(6), 1097–1102 (2014). https://doi.org/10.1111/17550998.12324

Exploiting In-memory Systems for Genomic Data Analysis Zeeshan Ali Shah1,2 , Mohamed El-Kalioby1,2 , Tariq Faquih1,2 , Moustafa Shokrof2 , Shazia Subhani1,2 , Yasser Alnakhli2 , Hussain Aljafar2 , Ashiq Anjum3 , and Mohamed Abouelhoda1,2(B) 1

2

King Faisal Specialist Hospital and Research Center (KFSHRC), Riyadh, Saudi Arabia [email protected] Saudi Human Genome Program, King Abdulaziz City for Science and Technology (KACST), Riyadh, Saudi Arabia 3 Department of Computing and Mathematics, University of Derby, Derby, UK

Abstract. With the increasing adoption of next generation sequencing technology in the medical practice, there is an increasing demand for faster data processing to gain immediate insights from the patient’s genome. Due to the extensive amount of genomic information and its big data nature, data processing takes long time and delays are often experienced. In this paper, we show how to exploit in-memory platforms for big genomic data analysis, with focus on the variant analysis workflow. We will determine where different in-memory techniques are used in the workflow and explore different memory-based strategies to speed up the analysis. Our experiments show promising results and encourage further research in this area, especially with the rapid advancement in memory and SSD technologies.

Keywords: Bioinformatics Next generation sequencing

1

· Big data · In memory processing

Introduction

Genomics based medicine, referred to as personalized or precision medicine, became an important component in the healthcare system. This is basically due to the recent advancements in next generation sequencing (NGS) technology, which reduced the cost and time of reading the genome. NGS is currently used in the clinic to find variants (mutations) related to the disease to improve the diagnosis, prognosis, or to find optimized treatment plans. For computational scientists, the wide use of NGS in the clinic has introduced new challenges. The clinical grade data analysis requires more optimized algorithms to reach reliable results, which accordingly increases the running time. Moreover, to reach a list of variants with the necessary information for the clinic, a sophisticated computational workflow of many software tools should be used. c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 405–414, 2018. https://doi.org/10.1007/978-3-319-78723-7_35

406

Z. A. Shah et al.

The input to this workflow is the list of NGS reads and the output is the list of significant annotated variants related to the disease. The output of an NGS machine is a large set of short reads (DNA fragments). The number of these reads depends on the technology and the model of the NGS instrument. For Ion technology, one expects around 80 million reads per run for the Ion Proton model. For Illumina technology, one expects up to 20 billion reads per run for the recent NovaSeq model. Processing such huge number of reads entails huge I/O operations, especially when a workflow of multiple independent programs is used. This causes two problems: First, a considerable fraction of the analysis time is spent in reading/writing of the data. Second, such intensive I/O mode of operation reduces the lifetime of the hard disks, which interrupts the operation and increases the operational costs. To solve these problems, it is important to avoid reading and writing to the mechanical hard disk and to keep the processing in the RAM as much as possible. Fortunately, the recent advancements in hardware and computer architecture coupled with modern operating systems make this possible. Currently, one can find commercially available servers at an affordable price with a RAM size reaching tens of terabytes. Parallel to this, we observe a continuous advancement in the software side as well. One can find many options for RAM resident data structures and in-memory database systems, where issues like fault-tolerance, efficient synchronized read-write, and data integrity are addressed. In this paper, we discuss how the in-memory techniques can be used in the clinical bioinformatics workflows. We will focus on the variant analysis workflow, which is the mostly used workflow in the clinic. We will explain the mode of use of in-memory systems at each step of the workflow, either within the program itself or to pass the data from one tool to the next. We will also show by experiment that the use of in memory systems at different steps indeed leads to improved running times. This paper is organized as follows: In the following section, we shortly review different technologies like high size RAM computers and new storage technology. In the same section, we will briefly review in-memory data systems and some of their use in bioinformatics. In Sect. 3, we will also discuss the variant analysis workflow and its basic steps. In Sect. 4, we will show how in-memory techniques can used in the variant analysis workflow. Finally, Sects. 5 and 6 include experimental results and conclusions.

2 2.1

Background Advanced Hardware Architecture

It is already well known that accessing data in memory is much faster than doing so on hard disks. A memory access by one CPU can take up to 200 ns, while one disk access can take up to 10 million nanoseconds (10 ms). Although different protocols to speed up the transfer of hard disk data have been developed (like SAS with 12 Gbps and SATA with 6 Gbps), reading or writing to disks is still many orders of magnitude slower than memory access.

In-memory Systems for Genomics

407

Modern commercial servers can be equipped with huge memory; for example Dell PowerEdge R940 can have up to 6 TB RAM. Furthermore, distributed shared memory architecture combine memories from different physical machines into one logical address space. The machines in this case are linked with high speed low latency interconnection. The operating system can still see this system as a single computing server with collective number of processors and RAM. This architecture can lead to a server with much higher RAM size in the range of tens of terabytes. Interestingly, these architectures also provide battery powered non-volatile RAM (NVDIMM), but with limited size, which could help keep important information in case of power failure. To narrow the gap between RAM and hard disk, solid state drives (SSD) (and the new 3D XPoint of Intel) have recently become available with faster interface based on the NVMe (Non-volatile memory express) protocol. NVMe is based on PCIe and it assumes the SSDs are mounted to physical slot architecture; i.e., direct connection to the CPU. While the maximum throughput of SATA is 6 Gbps and that of SAS is 12 Gbps, the throughput of NVMe based on PCIe Gen3 can reach 24 Gbps. It is expected by many researchers that the performance of SSD will converge to that of the RAM. That is, in the near future we will have huge non-volatile high speed memory. 2.2

In-memory Data Processing

Linux Pipes: Linux systems offer two options for using the RAM instead of the disk. The first is through the folder /dev/shm (tmpfs), where data in this folder resides actually in the RAM and not in the disk. The second option is through the use of pipes where the output of one program is fed to the next through intermediate buffering in the RAM. Another interesting feature about the piping is that the two involved processes run actually in parallel, where the data items are processed whenever they are ready. For example, assume a task B is piped to task A (i.e., A|B). The data item A(Di ) output by A can be processed by B while A still processes the data item Dj , i < j. This characteristic is of great advantage when the data set is a list of items that can be processed independent of one another, like our NGS reads and variants. In-memory database systems: The emergence of in-memory database systems dates back to 1980s [1–4]. Recent in-memory systems can come in different flavors: There are relational databases, column-oriented, key-value, document, and graph-oriented. There are also systems that can offer many of these models, like Apache Ignite, Couchbase, Arange DB, SAP-HANA among others. This is in addition to systems supporting memory resident data structures, like the Redis system. For a survey on these systems, we refer the reader to [5,6]. For genome analysis, Schapranow et al. [7] demonstrated how in-memory systems can be used to speed up the alignment of NGS reads.

408

3

Z. A. Shah et al.

The Variant Analysis Workflows

The variant analysis workflow is the mostly used workflow in the clinical practice. The workflow is used to identify the variants in the patient’s genome and to annotate them with as much information as possible to enable interpretation and clinical decisions. In Fig. 1, we show the basic steps of the workflow. The input to the workflow is a set of NGS reads coming from the sequencing machine. The first step is to check the quality of the reads and to exclude those with low quality. The second step includes the alignment (mapping) of the reads to a reference genome. This alignment arranges the reads with respect to the reference genome to assemble the target genome, a step sometimes referred to as reference-based assembly. The program BWA [8] is the most commonly used tool for this task. Once the reads are mapped, variant calling is performed to spot the variants and to distinguish them from sequencing errors using different

Fig. 1. The variant analysis workflow. The upper part is schematic diagram. The lower part as it is designed in the canvas of Tavaxy.


409

statistical models, taking technology specific parameters into consideration. The most commonly used tool for variant calling is the GATK pipeline [9]. In fact, the variant calling process is a sub-workflow that involves many steps, as shown in the figure. The final step includes the annotation of the variants using different information and knowledge databases. One way to implement the variant analysis workflow is to write a (shell) script to run one phase after another. Another means is to use workflow management systems, where the workflow is visualized and the non-IT experts can modify the analysis parameters without any scripting or programming skills. Example workflow management systems with a graphical user interface include Galaxy [10], Taverna [11], and Tavaxy [12,13], among others. In the workflow systems, the programs run according to certain dependency plan and the results of one program is fed to the next one. These workflow systems have also an advantage of running on a high performance computing infrastructure, where many independent tasks can run in parallel. In the lower part of Fig. 1, we show the implementation of the variant analysis workflow in Tavaxy. In Tavaxy, each major step is represented by a node (which is actually a sub-workflow). A drawback of these systems (including the current version of Tavaxy) is that the output of one step in the workflow is passed as input to the next step via intermediate files written to certain folders. Another drawback is that one task in the workflow cannot start before the completion of all the previous tasks it depends on. That is, these workflow systems do not directly support the use of Linux piping and the use of the main memory. In the following section, we will discuss how to overcome these limitations within these workflow systems using linux piping and in-memory systems.

4 4.1

In-memory Systems in Action Linux Piping

The first steps in the variant analysis workflow involving the quality check and alignment can readily use Linux piping. The output of the program fastx for quality check can be piped to the alignment program BWA. With piping, the quality check and the alignment step can run in parallel, where any good quality read reported by fastx is directly processed by BWA. (Note that there might be some delay until BWA loads the index in the RAM). For distributed computation on a computer cluster, the set of NGS reads can be decomposed into subsets that can be processed on different compute nodes, and within each node the quality check and alignment can be still piped. After the alignment, samtools can be used to format the output and decompose the results if required into subsets to be processed in parallel. Fortunately, samtools supports Linux piping. Piping from the alignment to GATK cannot be achieved on the read level, because GATK computes some background statistics on the whole read set. One idea to overcome this limitation is to decompose the set of reads into different subsets. Each sub-set includes the reads covering large segment of the genomes

410

Z. A. Shah et al.

(the reads are sorted by samtools). In [14], a segment size of about 50 Mbp was suggested. Each of these subsets can be processed in parallel and fed one after another whenever ready. In other words, the piping will work on the subset level rather than the reads level. The output of GATK is a set of variants. These variants can then be piped to the subsequent annotation step. Piping and online processing of the annotation step is addressed in more detail later in the paper. To sum-up, piping can be used on the read level from the beginning of the workflow until the alignment. From the alignment to variant calling using GATK, piping is done on the level of blocks of reads. From GATK to variant annotation, piping can be used again on the read level. Piping in workflow management systems: The workflow management systems based on a data flow model (like Galaxy and Tavaxy) do not support such mode of operation. Usually, for the two tasks A → B, where B depends on A, the data flow model assumes that task B cannot start before the completion of task A and the whole output of A become available. Usually intermediate files are used to store the output of A which will be in turn the input of B. In the piping or streaming model, we can allow that B starts processing once some result data from A becomes available. That is A and B can run in parallel. Instead of changing the workflow engine itself, one can overcome this problem by defining a new workflow node representing a sub-workflow where tasks A and B run using Linux pipes. That is the command to be executed in association with this node is the command including piping in the form run(A)|run(B). Combining this with parallel processing of A → B on subsets/blocks of the reads will lead to considerable speedup. 4.2

In Memory Systems for the Variant Annotation

The Annovar system [15] is used to annotate the variants with all possible information. The pieces of annotation information include the respective gene and the coding region, the frequencies in population databases, the structural effect of the mutation, and the relation to the disease. The Annovar system compiles public databases including this information and uses them for the annotation process. For hg19, Annovar has 32 text files including these databases of total size ≈350 GB. To complete the annotation, each variant should be searched for in these files to extract the respective record, if exists. Optimizing Annovar can be achieved in two ways: – Decompose the list of variants into different sub-lists and process each list independently. The Annovar system uses Perl threads to achieve this. This can indeed speedup the annotation, because the queries can run quickly in parallel on the loaded databases. – Query each variant against the different databases in parallel. This solution is not implemented in Annovar yet as it has the following challenge: To query the


411

databases in parallel, all the 350 GB files should be cached in the RAM and that caching time could outweigh the disk-based search. For large number of files to annotate, this strategy might pay off. If there is enough RAM, another possible solution is to keep these databases in the RAM and use them once an annotation is needed. There are two means to do this: 1. Use of tmpfs: One uses the /dev/shm folder to host the annotation databases. 2. Use of in-memory database: In this case, the Annovar databases should be hosted in an in-memory database. This option necessitates the reimplementation of the Annovar system, because some of the computation required to report the variants according to the HGVS nomenclature (www. hgvs.org). (mostly for handling ambiguities associated with indels.) In the following, we introduced an in-memory based strategy without changing the code of Annovar, and this strategy works for large scale genome projects requiring the annotation of large number of variant files. Our optimization strategy is based on the observation that there is a large number of variants that are common in the population. Our strategy is that once a variant is annotated, we keep a copy of it in an in-memory database. We use the Redis in-memory system to handle the annotated variants. Redis uses key-value pairs to define and retrieve the objects. The key of the variant is its physical location in the genome and the base change, defined by the tuple (chromosome, start position, end position, reference base(s), alternative base(s)). The value is the annotation information related to the variant. When annotating a variant, we first check whether the variant is in the Redis database or not. If it exists, we report it. Otherwise, it is added to a list L. After collecting all new variants in L, we run Annovar on them. Once the new variants are annotated, they are reported and added to the Redis database. The size of the database in the main memory can be kept constant by deleting less frequent variants on a regular basis. Frequency of the variant can be kept in a separate Redis table, where the key is defined as above and the value is the number of times the variant was observed in that annotation process. As we will demonstrate by experiment, this strategy has proven very effective in reducing the annotation time. 4.3

Fault tolerance

One of the main concerns about in-memory processing is the loss of data, due to power loss, hardware malfunctioning or software errors. Fault tolerance mechanisms are critical to assure the transactions’ consistency during the occurrence of failures. For introducing an additional layer of fault tolerance, a copy of result data in each step in the workflow can be written in parallel to another stable storage, preferrably of type SSD. Logging information about each step enables resumption of workflow in case of any failure without re-computation of already finished tasks. “Write-ahead-logging” techniques can be traditionally used to keep copy of the variants in Redis in permanent storage and/or in batterypowered non-volatile RAM.

412

5 5.1

Z. A. Shah et al.

Experimental Results Experiment 1: Linux Piping

The performance gain for using the Linux pipes has been addressed before in [16]. We could also achieve similar results on similar test data like the standard human NGS exome dataset (NA12878) [9], whose size is about 9 Gbp. Using the pipes, a speed up of about 30% could be achieved for the steps of quality check followed by read mapping; and for the formatting the final BAM file after alignment to be ready for removing duplicates and variant calling. Combining this to the usual parallelizaion for processing the data achieves more speedup as shown in Table 1. Table 1. Running times in minutes using different number of cores. For these computations, we used a machine of 64 Cores, 512 GB RAM, and 8 TB disk. Mode

Nodes 16

32

64

Without piping 452 min 237 min 170 min With piping

5.2

385 min 194 min 139 min

Experiment 2: In-memory Databases

In this experiment, we tested our strategy for speeding up Annovar. Table 2 shows the results of our approach using different scenarios. We measured the execution of Annovar on hard disk and measured the performance when the whole databases are hosted in the RAM (tmpfs). We also measured our optimization strategy where we stored 200 previously annotated variant files (with about 8 million variants) in-memory using the Redis and MySQL (memory-based). For about 1000 exomes and 3000 gene panel files that have been annotated, we observed an average hit rate of 93%. That is, only around 10% new variants has to go for Annovar for annotation for each file. We also measured the time assuming all variants are in the database; i.e., we have hit rate of 100%. This is the best case, and it unlikely happens in practice even with large number of variants are in the database due to individual variations. From the results in the table, we can confirm the advantage of hosting the Annovar databases in the RAM compared to storing it in the hard drive. We can confirm that our strategy based on using memory-based database (Redis or MySQL) significantly speeds up the annotation process. Overall, the annotation time could be reduced by 50% compared to the use of tmpfs only and by 80% compared to the hard-disk based version. We also observe that Redis performs slightly better than the memory-based version of MySQL. It is important to note that we did not observe much time reduction when processing small gene panel files. This can be attributed to the overhead of


413

Table 2. Running times in seconds using different variant files from Illumina and Ion Technology (Exome and Gene Panels). The column titled “HDD” includes time when using hard drives, the column titled “tmpfs” is when the Annovar system including its databases is hosted in the RAM. The columns titled “Redis100” and MySQL100 includes the time when we store previously annotated variants in Redis and MySQL assuming hit rate of 100%, respectively. The column titled “Redis90” includes the time when we have an average hit rate of ≈90%. For these computations, we used a machine of 64 Cores, 512 GB RAM, and 8 TB disk. Input

#Variants (FileSize) HDD tmpfs Redis100% MySQL100% Redis (90%)

VF1 Illumina Exome 84287 (18M)

1351

474

26

34

241


1275

480

28

34

251


1351

490

28

31

259


1275

481

27

33

246

870

305

17

23

185

VF5 Ion Exome

54249 (27M)

VF6 Ion Exome

55265 (27M)

844

307

18

21

188

VF7 Ion GP

1344 (622K)

619

213

2

3

150

VF8 Ion GP

1498 (642K)

623

215

2

3

150

reading the annotation databases. To overcome this problem for gene panels, we recommend to merge multiple gene panel files and processing them as one big variant file to eliminate that overhead.

6

Conclusions

In this paper, we explored how in-memory systems (both hardware and software) can be exploited for clinical genomics data, with focus on the variant analysis workflow. We have shown the points in the workflow where the in-memory techniques in the form of Linux piping and memory-resident databases can be used. We demonstrated that piping techniques leads to speeding up the variant analysis workflow. We also explained how the piping techniques can be wrapped in the workflow management systems, even those based on the data flow computational model. For the annotation step, we introduced a new strategy based on storing previously annotated variants in in-memory databases. Interestingly, with a reasonable number of stored variants, we can reduce the running time by about 80% compared to the disk based systems. The use of SSDs based on the NVMe protocol is very effective and it should be exploited on a large scale in genome analysis. In fact, SSD can be soon extend the RAM, and this implies that the sequence analysis tools should be re-engineered to make use of this feature. Acknowledgments. This publication was supported by the Saudi Human Genome Project, King Abdulaziz City for Science and Technology (KACST).

414

Z. A. Shah et al.

References 1. DeWitt, D.J., Katz, R.H., Olken, F., Shapiro, L.D., et al.: Implementation techniques for main memory database systems, vol. 14, no. 2. ACM (1984) 2. Eich, M.H.: Mars: the design of a main memory database machine. Database Mach. Knowl. Base Mach. 43, 325–338 (1988) 3. Garcia-Molina, H., Salem, K.: Main memory database systems: an overview. IEEE Trans. Knowl. Data Eng. 4(6), 509–516 (1992) 4. Sikka, V., F¨ arber, F., Lehner, W., et al.: Efficient transaction processing in SAP HANA database: the end of a column store myth. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 731–742. ACM (2012) 5. Han, J., Haihong, E., Guan, L., Jian, D.: Survey on NoSQL database. In: 6th International Conference on Pervasive Computing and Applications (ICPCA), pp. 363–366 (2011) 6. Ganesh Chandra, D.: BASE analysis of NoSQL database. Future Gener. Comput. Syst. 52, 13–21 (2015) 7. Schapranow, M.P., Plattner, H.: An in-memory database platform enabling realtime analyses of genome data. In: 2013 IEEE International Conference on Big Data, pp. 691–696, October 2013 8. Li, H., Durbin, R.: Fast and accurate short read alignment with burrows and wheeler transform. Bioinformatics 25(14), 1754–1760 (2009) 9. DePristo, M., Banks, E., et al.: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43(5), 491–498 (2011) 10. Goecks, J., Nekrutenko, A., Taylor, J., Team, T.G.: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11(8), R86+ (2010) 11. Hull, D., Wolstencroft, K., Stevens, R., Goble, C., et al.: Taverna: a tool for building and running workflows of services. Nucleic Acids Res. 34, W729–W732 (2006) 12. Abouelhoda, M., Issa, S., Ghanem, M.: Tavaxy: integrating Taverna and galaxy workflows with cloud computing support. BMC Bioinform. 13(1), 77+ (2012) 13. Ali, A.A., El-Kalioby, M., Abouelhoda, M.: Supporting bioinformatics applications with hybrid multi-cloud services. In: Ortu˜ no, F., Rojas, I. (eds.) IWBBIO 2015. LNCS, vol. 9043, pp. 415–425. Springer, Cham (2015). https://doi.org/10.1007/ 978-3-319-16483-0 41 14. Elshazly, H., Souilmi, Y., Tonellato, P., Wall, D., Abouelhoda, M.: MCGenomeKey: a multicloud system for the detection and annotation of genomic variants. BMC Bioinform. 18, 49 (2017) 15. Wang, K., Li, M., Hakonarson, H.: ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38(16), e164 (2010) 16. GATK: How to Map and clean up short read sequence data efficiently. https://ga tkforums.broadinstitute.org/gatk/discussion/6483/how-to-map-and-clean-up-sho rt-read-sequence-data-efficiently. Accessed December 2017

Improving Metagenomic Assemblies Through Data Partitioning: A GC Content Approach F´ abio Miranda1(B) , Cassio Batista1 , Artur Silva2,3 , Jefferson Morais1 , Nelson Neto1 , and Rommel Ramos1,2,3 1

3

Computer Science Graduate Program, Federal University of Par´ a, Belém, Brazil {fabiomm,cassiotb,jmorais,nelsonneto,rommelramos}@ufpa.br 2 Institute of Biological Sciences, Federal University of Par´ a, Belém, Brazil [email protected] Center of Genomics and Systems Biology, Federal University of Par´ a, Belém, Brazil

Abstract. Assembling metagenomic data sequenced by NGS platforms poses significant computational challenges, especially due to large volumes of data, sequencing errors, and variations in size, complexity, diversity and abundance of organisms present in a given metagenome. To overcome these problems, this work proposes an open-source, bioinformatic tool called GCSplit, which partitions metagenomic sequences into subsets using a computationally inexpensive metric: the GC content. Experiments performed on real data show that preprocessing short reads with GCSplit prior to assembly reduces memory consumption and generates higher quality results, such as an increase in the size of the largest contig and N50 metric, while both the L50 value and the total number of contigs produced in the assembly were reduced. GCSplit is available at https://github.com/mirand863/gcsplit. Keywords: DNA sequencing · Metagenomics · Data partitioning Bioinformatic tools · Metagenomic data preprocessing

1

Introduction

Metagenomics consists in determining the collective DNA of microorganisms that coexist as communities in a variety of environments, such as soil, sea and even the human body [1–3]. In a sense, the field of metagenomics transcends the traditional study of genes and genomes, because it allows scientists to investigate all the organisms present in a certain community, thus allowing the possibility to infer the consequences of the presence or absence of certain microbes. For example, sequencing the gastrointestinal microbiota enables the understanding of the role played by microbial organisms in the human health [4]. Nevertheless, second generation sequencing technologies — which belong to the Next Generation Sequencing (NGS), and are still the most widespread technology on the market — are unable to completely sequence the individual genome c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 415–425, 2018. https://doi.org/10.1007/978-3-319-78723-7_36

416

F. Miranda et al.

of each organism that comprises a metagenome. Instead, NGS platforms can sequence only small fragments of DNA from random positions, and the fragments of the different organisms are blended [5]. Hence, one of the fundamental tasks in metagenome analysis is to overlap the short reads in order to obtain longer sequences, denominated contigs, with the purpose of reconstructing each individual genome of a metagenome or represent the gene repertoire of a community [6]. This task is referred to as the metagenome assembly problem. Roughly speaking, metagenome assembly can be done with or without the guidance of a reference genome. The reference assembly can be performed by aligning reads to the genomes of cultivated microbes [7]. However, this method is rather limited because the microbial diversity of most environments extends far beyond what is covered by the reference databases. Consequently, it is necessary to perform de novo assembly when reconstructing a metagenome that contains many unknown microorganisms. Although it seems simple at first glance, the metagenome assembly problem is actually quite complex. Among the several challenges this task arises, there are sequencing errors specific to each platform and the processing of the large volume of data produced by NGS platforms [8]. Moreover, the problem is further complicated by variations on the size of the genomes and also by the complexity, diversity and abundance of each organism present in a microbial community [9]. For these reasons, the metagenome assembly becomes a challenging problem. To solve all these challenges, either de novo assembly can be performed directly by a metagenome assembler, or the short reads can be clustered in advance in order to individually assembly each organism present in the metagenome [10]. The latter approach has the advantage of reducing the computational complexity during the metagenome assembly, because the assembler will process smaller subsets of short reads and, furthermore, it is possible to run the individual assembly of each genome in parallel, since those tasks are independent from each other. The reduction of computational complexity can also be achieved through the previous digital normalization or data partitioning prior to assembly, which reduces the dataset by removing redundant sequences, and divides it into groups of similar reads, respectively [11]. The main focus of this study is the application of the data partitioning method towards the reduction of computational complexity and the improvement of metagenome assembly. The developed computational approach, denominated GCSplit, uses the nucleotide composition of the reads, i.e., the amount of bases A, G, C and T present on DNA sequences. This decision was based on the fact that different organisms or genes that compose metagenomes have distinct GC content and different GC contents will present coverage variation, a metric used by assemblers to reconstruct the genomes, which in turn affects the k-mer selected to perform the sequence assembly based on NGS reads. The rest of this paper is structured as follows. Related works on digital normalization and data partitioning are discussed in Sect. 2. Section 3 then presents the proposed algorithm. In Sect. 4, the impact of the new approach

Improving Metagenomic Assemblies Through Data Partitioning

417

on the performance of the metagenomic assembler metaSPAdes [12] is evaluated through experiments on real data. Finally, Sect. 5 presents the conclusions and plans for future works.

2

Related Work

In the literature there are several studies that attempt to reduce the computational complexity and improve metagenomic assemblies through data preprocessing techniques. The main approaches used are either digital normalization or data partitioning, the latter being the main focus of this article. In this context, the goal of this section is to carry out a bibliographical review of tools that use such methodologies. Diginorm [13] is a tool that uses the CountMin Sketch data structure to count k-mers, with the purpose of obtaining an estimate of the sequencing coverage; and reducing coverage variation by discarding redundant data. Due to the data structure, this technique keeps a constant memory usage and a linear runtime complexity for the de novo assembly in relation to the amount of input data. Trinity’s in silico normalization (TIS) [14], which belongs to the Trinity assembler algorithm package, presents an implementation that computes the median k-mer coverage for all reads of a given dataset. If the median coverage is lower than the desired value, all reads are kept. Otherwise, the reads may be kept with a probability that is equal to the ratio of the desired coverage by the median coverage. NeatFreq [15] clusters and selects short reads based on the median k-mer frequency. However, the main innovation in the work is the inclusion of methods for the use of paired reads alongside with preferential selection of regions with extremely low coverage. The results achieved indicate that the coverage reduction obtained by NeatFreq increased the processing speed and reduced the memory usage during the de novo assembly of bacterial genomes. ORNA [16] presents a novel and interesting approach that normalizes short reads to the minimum necessary amount in order to preserve important k-mers that connect different regions of the assembly graph. The authors treat data normalization as a set multi-cover problem, and they also have proposed a heuristic algorithm. Their results show that a better normalization was achieved with ORNA, when compared with similar tools. Moreover, the size of the datasets was drastically reduced without a significant loss in the quality of the assemblies. Khmer [17,18] presents a novel data partitioning methodology, in which the main data structure — a probabilistic model called bloom filter — is used to obtain a compact representation for graphs. The authors’ implementation can represent each k-mer using only 4 bits, which was the major factor for achieving a forty-fold memory economy while assembling a soil metagenome. MetaPrep [19] contains efficient implementations for k-mer counting, parallel sorting, and graph connectivity and partitioning. The developed solution was evaluated in a soil metagenome dataset composed of 223 Gigabases (Gb) distributed in 1.13 billion short reads. As a result of the experiment, MetaPrep took only 14 min to process this dataset using just 16 nodes of the NERSC

418

F. Miranda et al.

Edison supercomputer. The authors also assessed how MetaPrep can improve the performance of the metagenomic assembler MEGAHIT. Latent Strain Analysis [20] is an approach that separates the DNA sequences into partitions considering the biological factor, thus allowing the individual assembly of each genome present in a metagenome. The proposed methodology assumes that the abundance of the genomes present in a sample reflects on their k-mer abundance. The results achieved allowed the partial or almost complete assembly of bacteria whose relative abundance varies to a minimum of 0.00001%. During the literature review, it was not found any software that performs data partitioning using the information present in the nucleotide composition of short reads sequenced by NGS platforms. Hence, in this work we propose GCSplit, a tool that uses the GC content of the DNA sequences in combination with statistical metrics to partition the dataset. This new approach is promising because it is computationally inexpensive and uses information present in the reads that, as far as we know, has not been used in any other work for data partitioning. Further details about this new algorithm will follow.

3

The Proposed Algorithm

GCSplit was implemented in C++ in order to facilitate the communication with the library that performs the parallelization of the critical sections of the algorithm. The software KmerStream [21] and metaSPAdes [12], which are automatically executed to estimate the best values of k-mer and to assembly the metagenome, respectively, are dependencies of the proposed algorithm. The object-oriented programming paradigm was used in order to simplify an eventual addition of either new assemblers or k-mer estimation programs in the future, since one would only need to implement new classes to interact with the desired programs. Figure 1 summarizes the main steps of the developed algorithm.

Fig. 1. GCSplit algorithm overview.


419

As input, GCSplit takes two paired FASTQ files, both containing the metagenome’s sequences. The files are automatically passed to KmerStream, which estimates the best k-mer values for the assembly. Next, an algorithm was developed within GCSplit in order to partition the datasets, as shown in the pseudo-code of Algorithm 1. More specifically, the algorithm computes the GC content from all reads, making use of OpenMP library to do so in parallel. C++ STL library, on the other hand, has a parallel implementation of the merge sort algorithm, which is used to sort the sequences according to their GC content in ascending order. Algorithm 1. Pseudo-code to partition the dataset

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Input : Paired reads vectors Li (x) and Ri (x), number of partitions n Output : n sets of paired FASTQ files s ← |L|; for i ← 0 to s − 1 do l ← |Li |; Li .gc = 0; for j ← 0 to l − 1 do if Li [j] = ‘C’ or Li [j] = ‘G’ then Li .gc = Li .gc + 1; Li .gc × 100 ; l Li .pair ← Ri ; Li .gc =

// Creates a pointer from Li to Ri

sort(L, 0, s − 1) ; s p← ; n r ← s mod n ; x←0; y←0; for i ← 0 to n do if r > 0 then y ← y + p + 1; r ← r − 1; else y ← y + p;

// merge sort

createPartition(L, x, y - 1, string(i)); x ← y;

The proposed algorithm then calculates the approximate number of reads p that go into each partition n and also divides the dataset into n subsets/partitions based on this value. The number of partitions “n” — which is expected to be much smaller than the number of reads “s”, i.e., n > n) and, thus, s × l > s log s > n × (b − e) in the worst case. The created partitions are individually assembled by metaSPAdes. Then, another assembly is performed with SPAdes [22] to concatenate the n previous assemblies, where the assembly result that gets the highest N50 is used as input in the trusted contigs parameter and the remaining are used as single libraries. SPAdes was used in this final step because metaSPAdes does not accept multiple libraries as input yet. Ultimately, the final output is a FASTA file that contains a high quality assembly.

4 4.1

Evaluation Datasets and Platform

In order to evaluate GCSplit, analysis were conducted in three real metagenomic datasets, whose samples were collected from the following environments: from a moose rumen, from hot springs at headwaters and a sewage treatment plant. The datasets used in the experiments are listed in Table 1. The moose rumen sample was collected in Växsj¨ o, Sweden. This sample was sequenced with the Illumina HiSeq 2500 platform and contains 25,680,424 paired short reads with 101 base pairs (bp). These short reads can be obtained from the Sequence Read Archive (SRA) database through the accession number ERR1278073.


421

Table 1. Dataset description ID

Dataset

Read count Size SRA accession R (×106 ) M (Gbp) number

MR

Moose rumen

25.6

HS

Hot springs

5.2

ERR1278073

7.4

3.7

ERR975001

STP Sewage treatment plant 21.3

6.4

SRR2107212, SRR2107213

The hot springs samples were collected in the years 2014 and 2015 at the headwaters of Little Hot Creek, located in the Long Valley Caldera, near Mammoth Lake in California, United States [23]. Sequencing was conducted using one of the following Illumina platforms: MiSeq PE250, MiSeq PE300 or HiSeq Rapid PE250. The insert size used has approximately 400 bp and MiSeq runs were prepared using the Agilent SureSelect kit, while HiSeq PE250 samples were prepared using the Nextera XT library preparation kit. These data, which contain 7,422,611 short reads of 250 base pairs can be obtained via SRA accession number ERR975001. The sludge sample was collected in a municipal sewage treatment plant located in Argentina [24]. This sample was split in two technical replicates and sequenced with the platform Illumina HiSeq 1500 in 150 bp short reads. This dataset can be downloaded via SRA accession numbers SRR2107212 and SRR2107213, which contain 8,899,734 and 12,406,582 short reads, respectively. The assemblies were performed in a cluster that runs the GNU/Linux x64 operating system (openSUSE), and contains 4 nodes with 64 cores and 512 Gigabytes (GB) of Random Access Memory (RAM). 4.2

Results

Table 2 shows the assembly quality results produced by metaSPAdes without preprocessing and after preprocessing with GCSplit using four partitions for the metagenomes collected from the Moose Rumen (MR), the Hot Springs (HS) and the Sewage Treatment Plant (STP). The best 10 k-mers estimated by KmerStream were used in those assemblies. The statistics were computed with the tool QUAST [25], while peak memory usage was extracted from the assemblers’ log. The best results are highlighted in bold. For the MR dataset, the amount of contigs was drastically reduced from 175, 761 to 572 when preprocessing the data with GCSplit. The N50 value increased from 2,004 bp in the assembly without preprocessing to 164,782 bp after preprocessing, while the L50 value decreased from 23,843 to 6 contigs. This implies that to reach the N50 value of 164,782 bp, we need to sum the length of only 6 contigs, whereas without GCSplit 23,843 contigs were necessary to reach a much smaller N50 value, which means that the assembly with metaSPAdes alone contains more fragmented sequences overall. There was also a reduction of 13 GB in memory consumption during the assembly with GCSplit.

422

F. Miranda et al. Table 2. Assembly quality comparison with four partitions MetaSPAdes assembly statistics Dataset Preproc.? #Contigs Largest contig MR HS STP

Yes

572

No

175,761

Yes

98

No

26,656

Yes

1,562

No

385,566

791,884

Total length (Mbp) 4.3

408,484 276.1

N50

164,782

3.1

109,705

28.7

1,108

19.6

85,650

340,186 463.8

Memory peak (GB) 6 21

2,004 23,843

193,569 1,020,504

L50

109,605

34

11 21 4,795

45

59 33

1,312 71,126

76

Moreover, the largest contig produced in the MR assembly after preprocessing the dataset with GCSplit increased from 408,484 bp to 791,884 bp. However, the total length of the assembly generated after applying GCSplit to the data dropped from 276.1 Mbp to 4.3 Mbp. The merging strategy adopted using SPAdes may be one of the reasons for this reduction because it breaks down the contigs into smaller sequences of length k (k-mers), but other possibilities are under investigation. The assembly of the HS sample also showed improvements with the usage of GCSplit. Memory peak dropped from 45 GB to 21 GB on partitioned data. Furthermore, the amount of contigs decreased from 26,656 in the assembly with metaSPAdes alone to 98 contigs after preprocessing with GCSplit, representing a 99% reduction. On the other hand, the N50 value increased significantly, yielding a 9792% growth after partitioning, which is excellent for gene prediction. Moreover, the L50 value reduced about 99% with the GC content partitioning. In the HS dataset, the size of the largest contig produced in the assembly after preprocessing was closer to the assembly with metaSPAdes alone, containing 193,569 bp and 109,705 bp, respectively. The total length of the assembly also experienced a less dramatical decrease, going from 28.7 Mbp in the traditional assembly to 3.1 Mbp after using GCSplit. For the STP sample, there was a 56% economy in memory usage with prior data partitioning. Additionally, the amount of contigs drastically reduced from 385,566 to 1,562 contigs when the assembly was performed after preprocessing the sequences with GCSplit. The N50 value raised from 1,312 bp in the assembly with metaSPAdes alone to 85,650 bp after partitioning, while the L50 value decreased from 71,126 to 59 contigs with the aid of GCSplit. Furthermore, the largest contig produced in the STP assembly after preprocessing the dataset with GCSplit increased from 340,186 bp to 1,020,504 bp. Conversely, the total length of the assembly generated after applying GCSplit to the data declined from 463.8 Mbp to 19.6 Mbp. This result shows that despite the amount of data has decreased, the N50 value improved and this can favor gene prediction in later analysis.


423

Another experiment was carried out to assess whether different numbers of partitions affect the computational complexity and quality of the assembly. Table 3 shows the results produced by metaSPAdes after preprocessing with GCSplit using different amounts of partition for the metagenome collected from the Hot Springs (HS). Table 3. Assembly quality comparison for different number of partitions MetaSPAdes assembly statistics Dataset #Part. #Contigs Largest contig HS

Total length (Mbp)

N50

L50 Memory peak (GB)

2

205

349,825

4.7

50,407 22

31

4

82

193,547

2.7

110,167 10

21

8

66

283,547

2.5

123,532

6

19

The results in Table 3 show that larger values of partition reduce memory consumption. However, their total length represents about half of the data produced when the dataset is divided into only two partitions, which may indicate a significant loss in gene representativity. Therefore, gene prediction analyses are necessary to validate whether the largest amount of data (observed in the assembly of less partitioned data) is proportional to the amount of predicted Open Reading Frames (ORFs) or to products identified when performing a search by homology in public databases. All things considered, the number of partitions is a flexible parameter, so the user can evaluate several options and identify the best, since variations in different organisms can occur due to the GC content of the genome.

5

Conclusion

In this work, we developed a new bioinformatic tool called GCSplit, which partitions metagenomic data into subsets using a computationally inexpensive metric: the GC content of the sequences. GCSplit has been implemented in C++ as an open-source program, which is freely available at the following repository on GitHub: https://github.com/mirand863/gcsplit. GCSplit requires GCC version 4.4.7 or higher, the library OpenMP and the software KmerStream and metaSPAdes. Empirical results showed that applying GCSplit to the data before assembling reduces memory consumption and generates higher quality results, such as an increase in the size of the largest contig and N50 metric, while both the L50 value and the total number of contigs generated in the assembly were reduced. Although larger number of partitions produced less data, it is important to notice that the next analysis performed after assembly is gene prediction, where larger sequences are more likely to have genes predicted, as opposed to

424

F. Miranda et al.

fragmented assemblies such as those carried out with metaSPAdes alone, which contain smaller N50 and larger amounts of bp. Moreover, metagenome binning can be favored by the generation of less fragmented results due to higher values of N50. As future work, we expect to perform gene prediction to assess whether the contigs produced in the assembly with GCSplit contain meaningful information. Additionally, we also plan to test the application of GCSplit in eukaryotic datasets, and test the Overlap–Layout–Consensus (OLC) approach in the merging step. The experiments that would allow the comparison of GCSplit with other algorithms specialized in either digital normalization or data partitioning could not be completed by the submission deadline of this article. Acknowledgments. This research is supported in part by CNPq under grant numbers 421528/2016–8 and 304711/2015–2. The authors would also like to thank CAPES for granting scholarships. Datasets processed in Sagarana HPC cluster, CPAD–ICB– UFMG.

References 1. Vogel, T.M., Simonet, P., Jansson, J.K., et al.: TerraGenome: a consortium for the sequencing of a soil metagenome. Nat. Rev. Microbiol. 7, 252 (2009) 2. Venter, J.C., Remington, K., Heidelberg, J.F., et al.: Environmental genome shotgun sequencing of the Sargasso Sea. Science 304, 66–74 (2004) 3. Qin, J., Li, R., Raes, J., et al.: A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010) 4. Turnbaugh, P.J., Ley, R.E., Hamady, M., et al.: The human microbiome project: exploring the microbial part of ourselves in a changing world. Nature 449, 804–810 (2007) 5. Namiki, T., Hachiya, T., Tanaka, H., et al.: MetaVelvet: an extension of Velvet assembler to De Novo metagenome assembly from short sequence reads. Nucleic Acids Res. 40, e155 (2012) 6. Rodrigue, S., Materna, A.C., Timberlake, S., et al.: Unlocking short read sequencing for metagenomics. PLoS ONE 5, e11840 (2010) 7. Nielsen, H.B., Almeida, M., Juncker, A.S., et al.: Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32, 822–828 (2014) 8. Wojcieszek, M., Pawelkowicz, M., Nowak, R., et al.: Genomes correction and assembling: present methods and tools. In: SPIE Proceedings, vol. 9290, p. 92901X (2014) 9. Charuvaka, A., Rangwala, H.: Evaluation of short read metagenomic assembly. BMC Genom. 12, S8 (2011) 10. Rasheed, Z., Rangwala, H.: Mc-MinH: metagenome clustering using minwise based hashing. In: SIAM International Conference in Data Mining, pp. 677–685 (2013) 11. Howe, A.C., Jansson, J.K., Malfatti, S.A., et al.: Tackling soil diversity with the assembly of large, complex metagenomes. Proc. Natl. Acad. Sci. 111, 4904–4909 (2014) 12. Nurk, S., Meleshko, D., Korobeynikov, A., et al.: metaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017)


425

13. Brown, C.T., Howe, A., Zhang, Q., et al.: A reference-free algorithm for computational normalization of shotgun sequencing data. arXiv:1203.4802 (2012) 14. Haas, B.J., Papanicolaou, A., Yassour, M., et al.: De Novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat. Protoc. 8, 1494–1512 (2013) 15. McCorrison, J.M., Venepally, P., Singh, I., et al.: NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly. BMC bioinform. 15, 357 (2014) 16. Durai, D.A., Schulz, M.H.: In-silico read normalization using set multi-cover optimization. bioRxiv:133579 (2017) 17. Pell, J., Hintze, A., Canino-Koning, R., et al.: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Natl. Acad. Sci. 109, 13272–13277 (2012) 18. Crusoe, M.R., Alameldin, H.F., Awad, S., et al.: The khmer software package: enabling efficient nucleotide sequence analysis. F1000Research 4, 900 (2015) 19. Rengasamy, V., Medvedev, P., Madduri, K.: Parallel and memory-efficient preprocessing for metagenome assembly. In: IPDPSW, pp. 283–292 (2017) 20. Cleary, B., Brito, I.L., Huang, K., et al.: Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning. Nat. Biotechnol. 33, 1053–1060 (2015) 21. Melsted, P., Halld´ orsson, B.V.: KmerStream: streaming algorithms for k-mer abundance estimation. Bioinformatics 30, 3541–3547 (2014) 22. Bankevich, A., Nurk, S., Antipov, D., et al.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455– 477 (2012) 23. Stamps, B.W., Corsetti, F.A., Spear, J.R., et al.: Draft genome of a novel Chlorobi member assembled by tetranucleotide binning of a hot spring metagenome. Genome Announc. 2, e00897–e00914 (2014) 24. Ibarbalz, F.M., Orellana, E., Figuerola, E.L., et al.: Shotgun metagenomic profiles have a high capacity to discriminate samples of activated sludge according to wastewater type. Appl. Environ. Microbiol. 82, 5186–5196 (2016) 25. Gurevich, A., Saveliev, V., Vyahhi, N., et al.: QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013)

Next Generation Sequencing and Sequence Analysis

Quality Assessment of High-Throughput DNA Sequencing Data via Range Analysis Ali Fotouhi1(B) , Mina Majidi2 , and M. O˘ guzhan K¨ ulekci3 1

3

Electronics and Communication Engineering Department, Istanbul Technical University, 34605 Istanbul, Turkey [email protected] 2 Department of Mathematics, Université de Sherbrooke, Sherbrooke, QC J1K 2R1, Canada [email protected] Informatics Institute, Istanbul Technical University, 34605 Istanbul, Turkey [email protected]

Abstract. In the recent literature, there appeared a number of studies for the quality assessment of sequencing data. These efforts, to a great extent, focused on reporting the statistical parameters regarding the distribution of the quality scores and/or the base-calls in a FASTQ file. We investigate another dimension for the quality assessment motivated by the fact that reads including long intervals having fewer errors improve the performances of the post-processing tools in the downstream analysis. Thus, the quality assessment procedures proposed in this study aim to analyze the segments on the reads that are above a certain quality. We define an interval of a read to be of desired–quality when there are at most k quality scores less than or equal to a threshold value v, for some k and v provided by the user. We present the algorithm to detect those ranges and introduce new metrics computed from their lengths. These metrics include the mean values for the longest, shortest, average, cubic average, coefficient variation, and segment numbers of the fragment lengths in each read that are appropriate according to the k and v input parameters. We also provide a new software tool QASDRA for quality assessment of sequencing data via range analysis, which is available at https://github.com/ali-cp/QASDRA.git. QASDRA creates the quality assessment report of an input FASTQ file according to the user-specified k and v parameters. It also has the capabilities to filter out the reads according to the metrics introduced. Keywords: DNA sequencing data quality assessment High-throughput DNA sequencing · Quality score

1

Introduction

With the spread of high-throughput DNA sequencing, today, not only the research centers, but also the practitioners such as the hospitals, clinics, and c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 429–438, 2018. https://doi.org/10.1007/978-3-319-78723-7_37

430

A. Fotouhi et al.

even the individuals become customers of the sequencing centers. Each day more sequencing data than the previous is being produced rapidly. This brings a strong necessity to assess the quality of the generated data. Previous studies [1–6] for the quality assessment of the DNA sequencing data concentrated on extracting the basic statistical properties such as the mean, median, and standard deviation values of the quality score distribution, where some of those efforts also included the statistical analysis of the base-calls distributions as well, e.g., the GC or N content ratios. It is well known that long intervals having fewer errors improve the performances of the post-processing tools in the downstream analysis of the DNA sequencing data [7]. This brings the idea of evaluating the DNA sequencing data quality via analyzing the lengths of the fragments that are above a certain threshold. Such an assessment requires the explicit definition of desired–quality on a read segment. We propose to identify the quality of the fragments by using two parameters v and k. The v parameter defines a threshold value such that quality scores less than or equal to v are assumed to be erroneous. Similarly, in an interval the number of allowed errors, which are defined by v, is limited by the parameter k. Based on v, and k parameters, the read segments that include at most k scores below v are of desired–quality. Finding such ranges has been recently studied in [8] as the inverse range selection queries. We focus in this study to devise some metrics based on analyzing the lengths of the intervals that include at most k quality scores less than or equal to v on the quality scores of the reads in an input FASTQ file. The proposed scheme computes a series of metrics for the quality assessment of the input file. We present QASDRA as a new quality assessment tool for DNA sequencing data based on these metrics. QASDRA creates an assessment report that includes the results with various related plots for the input FASTQ file according to the provided k, v parameters. Since the FASTQ files can potentially be so large, random sampling of the reads with a user-specified percentage is possible with the QASDRA. Additionally, filtering out the reads that are below the defined threshold is yet another capability of the developed tool. The outline of the paper is as follows. We briefly review the previous studies in Sect. 2. Section 3 describes the proposed metrics along with the reasons that they are devised for. Before the conclusions, the empirical evaluations are given in Sect. 4 and finally, in Sect. 6, a sample report of QASDRA is provided.

2

Previous Studies

The major tools that have been proposed in the related literature for the DNA sequencing data quality evaluation have focused on statistical distributions of the quality scores, the base–calls, or both. We provide a short review of those tools below. PIQA [3] was proposed as an extension of the standard Illumina pipeline particularly targeting identification of various technical problems, such as defective

Quality Assessment of High-Throughput DNA Sequencing Data

431

files, mistakes in the sample/library preparation, and abnormalities in frequencies of sequenced reads. With that purpose, it calculates statistics considering the distribution of the A-C-G-T bases. Both the base-calls and their quality scores are considered together. SolexaQA [1] calculates sequence quality statistics and creates visual representations of data quality for second-generation sequencing data. Default metric is mean quality scores extracted from the reads, but users may also calculate variances, minimum, and maximum quality scores observed. Additionally, the longest read segments with a user defined threshold for minimum quality score is also provided. Based on this calculation, it provides support to trim all reads such that only the longest segment with the user defined threshold remains. The longest fragment detection provided in SolexaQA is a special case of one of our metrics. We discuss this issue in Sect. 3 at the related subsection. BIGpre [4] provides the statistics such as the distributions of the mean qualities of the reads and the GC content. The main contribution here has been reported to be the extra features to achieve alignment-free detection of duplicates in a read set. The quality control and statistics tools in the NGS-QC toolkit [5] is yet another option to retrieve the fundamental statistics of the quality scores and the base-calls. The toolkit includes features to remove low quality reads decided according to the mean quality scores or the base-call distributions. Similar to SolexaQA, the HTQC [2] performs quality assessment and filtration focusing on statistical distribution of the quality scores throughout the input reads with the main motivation of achieving this process faster. The FastQC [6] software is a commonly used quality control tool. It reports the basic statistics as well as the GC or N content, per base or per read with a graphical user interface.

3

The Metrics and the Quality Assessment Method

The metrics we propose are based on detecting intervals of the reads that contain at most k quality scores below a given threshold v, where k and v are the parameters. This is akin to the inverse range selection queries, which has been recently investigated in [8]. We suggest the readers to refer that paper to get more insights about the algorithmic background of detecting those intervals. In this study we focus on using this approach to provide a new way of quality evaluation for sequencing data. We denote the quality values of a read t by Q[t] = q1 q2 . . . qt , where t is the length of that read and the total number of reads in the input FASTQ file is shown by N . The inverse range selection query InvR(k, v) on Q[t] returns the set of maximal ranges as {r1 , r2 , . . . rφt }, where each ri = si , ei denotes the maximal interval of length |ri | = ei − si + 1 as qsi qsi+1 . . . qei that no more than k quality values, less than or equal to v appears. Notice that ri = si , ei is a maximal range when the si , ei interval cannot be expanded either to the right or to the left without breaking the restriction. The number of detected such

432

A. Fotouhi et al.

maximal ranges on Q[t] is denoted by φt , and the standard deviation of the maximal range lengths {r1 , r2 , . . . rφt } is shown by σt . We compute the InvR(k, v) on each read based on the parameters k and v provided by the user, and then, calculate the following metrics, which are mean values of the same quantities computed per read based on the detected maximal range lengths (MRL). These proposed quality assessment metrics are summarized in Table 1. Table 1. Proposed quality assessment metrics Average longest maximal range length Average shortest maximal range length

LM RLt = max{|r1 |, |r2 |, . . . |rφt |} 1 N ALMRL = N · t=1 LM RLt SM RLt = min{|r1 |, |r2 |, . . . |rφt |} 1 N · t=1 SM RLt N

ASMRL = Grand average of maximal range lengths

AM RLt = GAMRL =

Average cubic means of maximal range lengths CM M RLt = ACMRL = Average of coefficient variations of MRL

CV M RLt = ACVMRL =

Average number of maximal ranges

|r1 | + ... + |rφt | φt 1 N · t=1 AM RLt N φt 3 3 i=1 |ri | φt 1 N · t=1 CM M RLt N σt AM RLt 1 N · t=1 CV M RLt N

N M Rt = φt ANMR =

1 N

·

N

t=1

N M Rt

ALMRL: The performance of the downstream processing on DNA sequences would increase with longer intervals having less errors. For instance, it had been shown in [1] that filtering the low quality segments of the reads improves the assembly performance. Thus, ALMRL aims to evaluate the quality by measuring the lengths of the longest segments that are defined to be of enough quality by the inverse range selection query. Notice that this metric is akin to the dynamic trimming of the SolexaQA [1] that detects the longest read segment, where the minimum quality value is above a threshold. The process, in regard to that in SolexaQA, is a special case of LMRL metric by setting the k value to 1, where the proposed tool allows variable number of bases instead of 1 to be below the threshold value. This extension make sense when one uses the methods that can handle multiple errors on the reads. For example, the alignment applications such as the BWA [9], Bowtie [10], and others [11] have mechanisms to handle more than one error efficiently. Larger ALMRL scores indicate better quality. Since the perfect LMRL value of a read is its length, which indicates the number of quality scores below v are at most k throughout the read, the best ALMRL score of a FASTQ file is actually its average read lengths. ASMRL: The shortest maximal range on a read indicates the smallest distance in which there are k quality scores that are below v. This value becomes k in


433

the worst case, where all the low quality values appear subsequently. When the SMRL value of a read is significantly small, it means there is a burst error, where the erroneous values appear very close to each other. Thus, the ASMRL metric might be useful for the purpose of measuring the distribution uniformity of the low quality positions. For instance, when those erroneous positions do not appear close, larger ASMRL values are expected. GAMRL: There might be, and most possibly will be, more than one maximal range on a read. The mean value of those maximal range lengths are computed per each, and the grand average of those means can be used to evaluate the overall performance of the sequencing process assuming that higher values of GAMRL indicates better quality. It is expected that on a given read there appears a segment of length GAMRL withholding the queried v, k criteria. However, this measure is a bit coarse, and due to that, we introduce additional metrics below to support more detailed analysis of the detected segments. ACMRL: When the maximal range lengths are detected on a read, we would like to devote more weight to the longer ones than the shorts. Despite the GAMRL measure, we also compute the cubic mean of the MRLs a read. Notice that on n p p 1 cubic mean, which is a generalized mean computation i=1 xi with p = 3, n favors the longer MRL values more. For instance, assume the detected MRLs on two different reads are 20, 30, 40 and 30, 30, 30. Although their averages are both 30, their cubic means are 32.07 and 30. Here, we can see the difference made by longer segments, which shows the power of longer reads. The ASMRL metric mainly evaluates the fragmentation and burst errors in the reads. However, even in case of high fragmentation and burst errors, there might still be enough long segments that can be helpful in downstream analysis. The ACMRL metric aims to provide a way of measuring the longest maximal range lengths associated with the number of maximal ranges detected. We would like to increase the power of the longer maximal ranges, and thus, tried generalized means with different values, where empirically decided on cubic mean as the best value to measure this. ACVMRL: The coefficient variation (CV ) is defined as the ratio of the standard deviation σ to the mean μ. The coefficient variation is useful because the actual value of CV is independent of the unit in which the measurement has been taken, so it is a dimensionless number. For comparison between data sets with different units or widely different means, one should use the coefficient variation instead of the standard deviation. For example, a data set of [100, 100, 100] has constant values. Its standard deviation is 0 and its average is 100, so CV = 0. A data set of [90, 100, 110] has more variability. Its standard deviation is 8.165 and its average is 100, so CV = 0.08165. Finally, a data set of [1, 5, 6, 8, 10, 40, 65, 88] has even more variability. Its standard deviation is 30.78 and its average is 27.875, so CV = 1.104. Since each read has different number of maximal range lengths with different means, this concept illustrates the coherence of the data in terms of MRLs computed, where higher values indicate less uniformity among the maximal range lengths.

434

A. Fotouhi et al.

Thus, having a small ACVMRL is good in terms of the quality and indicates that one may expect to be more confident to observe the computed average values in a randomly selected read. ANMR: The number of maximal ranges (NMR) simply counts the fragments that have been generated in each read due to the k and v values. If the number is large, this means that low quality bases occur distributed in the read, which causes many fragments to appear. In other words, this metric describes how close low quality bases are along the read. So the lower (higher) this value is, the longer (shorter) the segments obtained are. By taking the average of these amounts, we simply show in average how many fragments have been detected in each read.

4

Results

We present the results of our evaluation scheme on files generated by different sequencing platforms with the Illumina, IonTorrent, and PacBio equipments. We have used the individual NA12878 as published by the Coriell Cell Repository [12]. The data files used for Illumina [13] are ERR091571 1.fastq and ERR091571 1.fastq concatenated to one file ERR091571.fastq including 422875838 reads of constant length 101, SRR1238539.fastq, including 183976176 reads of lengths varying between 25 and 396 for IonTorrent [14], and chemistry 3 for PacBio [15], consisted of 8 FASTQ files concatenated to one, including 654547 reads of lengths varying between 50 and 33230. The results of the quality assessment with the proposed technique with different k and v parameters are given in Table 2. Table 2. Quality assessment of the selected FASTQ files with different v and k values File info

k value v value ALMRL ASMRL GAMRL ACMRL ACVMRL ANMR

Platform: Illumina

2

Sequence data: NA12878 Name: ERR091571.fastq

3

88.94

90.99

0.16

4

73.60

77.73

0.31

8

20

95.79

88.33

90.31

92.04

0.13

4

30

87.59

72.09

76.70

80.09

0.24

8

2

20

55.92

3.49

14.57

26.23

1.16

47

30

3.49

2.00

2.00

2.11

0.12

174

3

20

63.18

5.84

19.66

31.74

0.98

46

30

4.54

3.00

3.00

3.12

0.10

173

20

69.52

8.58

24.66

36.89

0.86

45

30

5.58

4.00

4.00

4.14

0.08

172

4

4 2 3

Number of reads: 654547 Quality scores: (0, 14) Read lengths: (50, 33230)

4 9

86.64

Sequence data: NA12878 Name: chemistry 3.fastq

0.21 0.41

68.29

Read lengths: (25, 396) Platform: PacBio

89.56 74.47

95.10

Number of reads: 183976176 Quality scores: (3, 38)

87.02 69.25

86.03

Sequence data: NA12878 Name: SRR1238539.fastq

84.25 62.87

20

Read lengths: (101, 101) Platform: ION torrent

94.13 83.86

30

Number of reads: 422875838 Quality scores: (2, 41)

20 30

4

7

55.46

2.01

11.11

17.00

0.79

1385

11

22.14

2.00

4.13

6.38

0.74

3009

7

63.84

3.05

15.32

21.54

0.67

1384

11

25.70

3.00

6.00

8.32

0.60

3008

7

71.73

4.13

19.52

26.04

0.61

1383

11

29.07

4.00

7.87

10.25

0.52

3007


435

We observed that the Illumina reads, according to the given parameters k = 2 and v = 20, include longer maximal ranges according to the ALMRL, ASMRL, and GAMRL metrics. On these measures, the IonTorrent and PacBio platforms returned similar results particularly on ALMRL. The results for ASMRL, 3.49 for IonTorrent and 2.01 for the PacBio data, reminds us that on the selected data sets, whenever a quality score below 20 is observed, its adjacent neighbors are also usually below that quality, and hence, the ASMRL values are that much small. Considering the GAMRL metric, the tested IonTorrent data provides slightly longer contiguous blocks of desired–quality. The average cubic means, being 26.23 for the IonTorrent and 17.00 for the PacBio, indicates that although the ALMRL, ASMRL, and GAMRL values are close, the reads in the IonTorrent data include longer intervals than the PacBio. However, the high value of ACVMRL in the IonTorrent shows that the PacBio data is more uniformly distributed. In the ACVMRL metric, the Illumina shows a much nicer distribution. The ANMR of these data also shows how low quality bases are distributed along the reads, which have caused large number of segmentations for IonTorrent and PacBio. Notice that on PacBio and IonTorrent platforms, the shortest maximal range values are quite small, which means, in general, the low-quality base-calls appear very closely. The larger ASMRL value on Illumina data shows that the lowquality positions are more uniformly spread here. In general, the quality of a read is likely to be assessed according to the average of the quality scores it includes. We would like to compare the newly proposed metrics against this mean quality score. With this purpose, we have designed an experiment to unleash the capabilities of newly introduced metrics. We start with aligning the reads with the BWA-MEM tool and observe the mapping ratio. Afterwards, we sort the reads according to each metric and filter out the worst X% from the FASTQ file, where X is roughly the unmapped ratio in the original file. For example, when we aligned the SRR1238539.fastq file of IonTorrent, we observed that the mapping rate occurred as 97.66%. In other words, nearly 2.5% of the reads could not be aligned. We compute the average quality score of each read, and sort the reads according to this metric first. We filter out the worst 2.5%, and again run the aligner on this filtered reads. We expect to have a higher mapping ratio since the reads that are filtered out are low quality ones according to “mean quality” metric. We observed that our expectation was right and the mapping ratio on the filtered reads appeared to be 98.28%. Now we perform the same filtering, but instead of mean quality we use the LMRL metric while sorting the reads. After filtering out the worst 2.5% of the low quality reads according to the LMRL metric, we saw that the new mapping ratio has become 98.69 (for k = 2 and v = 20), which presents a better improvement than the mean quality filtering. Similarly, we repeat the same sort, filter, and re-align steps with the remaining 5 metrics and observe the results shown in Table 3. Therefore, Table 3 shows the benchmark results of the proposed metrics against the generally used mean quality assessment. Based on these results, we can categorize our metrics into three groups:

436

A. Fotouhi et al. Table 3. Mapping rates of filtered FASTQ files with different k and v values

File info

Mapping rate Mean quality k v

Illumina

99.35

ION torrent 97.66

PacBio

88.85

99.41

98.28

91.44

LMRL SMRL AMRL CMMRL CVMRL NMR

2 20 99.39

99.36

99.39

99.41

99.34

99.41

3

99.39

99.36

99.39

99.41

99.34

99.41

4

99.41

99.40

99.36

99.39

99.41

99.34

2 20 98.69

97.65

98.22

98.59

98.26

97.76

3

98.71

97.65

98.24

98.59

98.26

97.76

4

98.73

97.65

98.31

98.59

98.26

97.76

2 7

92.35

88.70

91.17

92.03

88.57

88.91

3

92.42

88.70

91.18

92.00

88.57

88.91

4

92.46

88.70

91.19

91.98

88.57

89.91

– Metrics that improve mapping rate compared to the both original file and the reference metric: LM RL and CM M RL – Metrics that improve mapping rate compared to just the original file: AM RL – Metrics that in some cases improve and in some cases do not improve the mapping rate compared to the original file: SM RL, CV M RL, and N M R. Moreover, from Table 3, we can see that for metrics in the 1st and 2nd groups, as the k value increases mapping rate also improves. On the other hand, mapping rates associated with 3rd groups metrics do not vary for different k values. We observe that longest maximal range length and cubic mean of the maximal range length metrics outperform the mean quality metric especially on IonTorrent and PacBio platforms. On Illumina, the NMR metric showed a compatible performance with the mean quality score metric. These results are on the first step of the downstream analysis and depending on the application or further investigations, each metric would have its own benefits.

5

Conclusion

The statistical properties of the distributions, regarding both the quality scores and the base–calls of a sequencing experiment, have been extensively explored in the previous studies. We have presented an alternative approach to the quality assessment of sequencing data by analyzing the maximal ranges, which are defined as the longest segments in which no more than k scores are less than or equal to v. Besides, how downstream analysis is improved using this method have been shown. The software developed with Python for the proposed metric is available at https://github.com/ali-cp/QASDRA.git for public use. The sequencing centers or the consumers of those centers can use the tool to evaluate or benchmark their data. In the near future, it might be necessary to define the international standards of the good sequencing data, where we believe the approach presented in this study might help in creating such standards. The metrics introduced in this study may serve for clustering/classifying the reads from the different platforms or for the overall successes of the sequencing centers.


437

Acknowledgements. We thank S. Andrews from the Babraham Bioinformatics for providing us feedback regarding their FastQC software. This work has been partially ¨ ITAK) ˙ supported by The Scientific and Technological Research Council of Turkey (TUB by the grant number 114E293.

6

Appendix QASDRA - Quality Assessment of Sequencing Data via Range Analysis File Name: ERR091571.fastq / Date: Mon Jan 16 03:33:59 2017

Input Sequencing Data Digest:

Computed QASDRA Vector for k= 4 , v= 30:

Quality Score Format:

33 ASCII-based Phred

Average Longest Maximal Range length

87.59

Quality Scores (min,max):

(2,41)

Average Shortest Maximal Range Length

72.09

Number of Reads:

422875838

Grand Average Maximal Range Length

76.70

Processed Number of Reads:

422875838

Cubic Average Maximal Range Length

80.09

Read Length (min,max):

(101,101)

Average Coefficient of Variation

0.24

developed by: [email protected]

438

A. Fotouhi et al.

References 1. Cox, M.P., Peterson, D.A., Biggs, P.J.: SolexaQA: at-a-glance quality assessment of illumina second-generation sequencing data. BMC Bioinf. 11(1), 485 (2010) 2. Yang, X., Liu, D., Liu, F., Wu, J., Zou, J., Xiao, X., Zhao, F., Zhu, B.: HTQC: a fast quality control toolkit for illumina sequencing data. BMC Bioinf. 14(1), 1 (2013) 3. Mart´ınez-Alc´ antara, A., Ballesteros, E., Feng, C., Rojas, M., Koshinsky, H., Fofanov, V., Havlak, P., Fofanov, Y.: PIQA: pipeline for illumina G1 genome analyzer data quality assessment. Bioinformatics 25(18), 2438–2439 (2009) 4. Zhang, T., Luo, Y., Liu, K., Pan, L., Zhang, B., Yu, J., Hu, S.: BIGpre: a quality assessment package for next-generation sequencing data. Genomics Proteomics Bioinf. 9(6), 238–244 (2011) 5. Patel, R.K., Jain, M.: NGS QC toolkit: a toolkit for quality control of next generation sequencing data. PLoS ONE 7(2), e30619 (2012) 6. Andrews, S.: FastQC: a quality control tool for high throughput sequence data (2010). http://www.bioinformatics.babraham.ac.uk/projects/fastqc 7. Auwera, G.A., Carneiro, M.O., Hartl, C., Poplin, R., del Angel, G., LevyMoonshine, A., Jordan, T., Shakir, K., Roazen, D., Thibault, J., et al.: From FastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline. Curr. Protoc. Bioinf. 43, 11.10.1–11.10.33 (2013) 8. K¨ ulekci, M.O.: Inverse range selection queries. In: Inenaga, S., Sadakane, K., Sakai, T. (eds.) SPIRE 2016. LNCS, vol. 9954, pp. 166–177. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46049-9 17 9. Li, H., Durbin, R.: Fast and accurate long-read alignment with burrows-wheeler transform. Bioinformatics 26(5), 589–595 (2010) 10. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10(3), 1 (2009) 11. Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation sequencing. Brief. Bioinf. 11(5), 473–483 (2010) 12. Coriell Institute: “NA12878,” International HapMap Project. https://catalog. coriell.org/0/Sections/Search/Sample Detail.aspx?Ref=GM12878 13. University of California, Berkeley: SMaSH a benchmarking toolkit for variant calling. http://smash.cs.berkeley.edu/datasets.html 14. DNA Data Bank of Japan: “DDBJ FTP repository,” DDBJ Center. ftp://ftp.ddbj. nig.ac.jp/ddbj database/dra/fastq/SRA096/SRA096885/SRX517292 15. NCBI: 1000genomes. ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase3/ integrated sv map/supporting/NA12878/pacbio/fastq/

A BLAS-Based Algorithm for Finding Position Weight Matrix Occurrences in DNA Sequences on CPUs and GPUs Jan Fostier(B) IDLab, Department of Information Technology, Ghent University - imec, Ghent, Belgium [email protected] http://idlab.ugent.be

Abstract. Finding all matches of a set of position weight matrices (PWMs) in large DNA sequences is a compute-intensive task. We propose a light-weight algorithm inspired by high performance computing techniques in which the problem of finding PWM occurrences is expressed in terms of matrix-matrix products which can be performed efficiently by highly optimized BLAS library implementations. The algorithm is easy to parallelize and implement on CPUs and GPUs. It is competitive on CPUs with state-of-the-art software for matching PWMs in terms of runtime while requiring far less memory. For example, both strands of the entire human genome can be scanned for 1404 PWMs in the JASPAR database in 41 min with a p-value of 10−4 using a 24-core machine. On a dual GPU system, the same task can be performed in under 5 min.

Keywords: Position weight matrix (PWM) High performance computing (HPC) Basic linear algebra subprograms (BLAS) Graphics processing units (GPUS)

1

Introduction

Short biologically relevant patterns such as transcription factor binding sites are often represented using a position weight matrix (PWM), also referred to as a position-specific scoring matrix (PSSM) [1]. In contrast to consensus patterns, a PWM can model variability at each position in the pattern. A PWM representing a pattern of length m is a 4 × m matrix where each matrix element PWM(i, j) represents the log-likelihood of observing character i (0 = ‘A’; 1 = ‘C’; 2 = ‘G’; 3 = ‘T’) at position j, taking into account the nucleotide composition of the background sequences. Given a sequence of length m, the PWM score of that sequence can be computed by summing over the PWM values that correspond to each nucleotide at each position in the sequence. Higher scores indicate a better correspondence to the pattern represented by the PWM. c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 439–449, 2018. https://doi.org/10.1007/978-3-319-78723-7_38

440

J. Fostier

Given an input sequence of length n and a PWM of length m, the PWM matching problem involves the identification of all matches of the PWM, i.e., subsequences for which the PWM score exceeds a user-defined threshold. A brute-force approach simply involves the computation of the PWM score at all positions in the input sequence and hence has a time complexity of O(nm). More complex algorithms for PWM matching build upon ideas that were initially developed for exact pattern matching and rely on the preprocessing of the input sequence and/or the preprocessing of the search matrix. In [2], a suffix tree is constructed from the input sequence and PWM matches are found using a depthfirst traversal of the tree up to depth m. By using a lookahead scoring technique, subtrees that contain no PWM matches can be detected and discarded from the search procedure. A similar methodology has been implemented in PoSSuM [3], where an enhanced suffix array is used as a more memory-friendly alternative to suffix trees. Both methods however have the disadvantage of requiring O(n) memory to build and store the index structure. In [4], the Morris-Pratt and Knuth-Morris-Pratt algorithms are extended to PWM matching. Similarly, in [5], the Aho-Corasick, filtration and super-alphabet techniques developed for exact string matching are generalized to PWM matching and further extended to the case where matches of multiple PWMs are searched for [6]. These algorithms are implemented in the MOODS software package [7]. Finally, in [8], some of these algorithms are implemented on graphics processing unit (GPU) architectures. Compared with the naive brute-force algorithm, these more complex PWM matching algorithms reduce the runtime by eliminating parts of the search space that are guaranteed not to contain matches. As such, their runtime is dependent on the PWM threshold that is used. The higher this threshold is taken, the more the PWM matching problem approaches that of exact pattern matching with its O(n + m) time complexity. Because for most practical problems, m takes values between 5 and 15 while n is very large, they yield a speedup of approximately one order of magnitude over the O(nm) brute-force algorithm. In this contribution, we describe an orthogonal strategy to accelerate the brute-force algorithm. Our approach does not reduce the search space but rather improves the speed at which the brute-force algorithm can be evaluated. This is done by expressing the PWM matching problem in terms of matrix-matrix products. It is well known that matrix-matrix multiplications can be evaluated very efficiently on modern, cache-based CPUs using highly optimized Basic Linear Algebra Subroutines (BLAS) library implementations [9]. These BLAS implementations leverage SIMD (single instruction multiple data) operations and maximally exploit spatial and temporal locality of reference, thus ensuring that most data accesses are satisfied from cache memory. As such, matrixmatrix products are among a select class of algorithms that can be evaluated with a performance that approaches the theoretical peak performance of a CPU. Optimized BLAS library implementations are provided by all major CPU vendors. Alternatively, open-source implementations such as ATLAS [10] or GotoBLAS [11] can be considered. We found that the BLAS-based approach yields a 5× to 6.4× speedup over a naive implementation of the brute-force algorithm.

BLAS-Based Algorithm for Finding PWM Occurrences

441

Additionally, the proposed BLAS-based algorithm has minimal memory requirements whereas more complex algorithms may require tens of GBytes of memory for large problem sizes. Finally, we also present an implementation of the BLASbased algorithm that leverages the cuBLAS library to perform the matrix-matrix multiplications on graphics processing units (GPUs). We demonstrate that on a dual-GPU system, this yields an additional 10× speedup compared to using a 24-core CPU system. Using this GPU system, we report speedups of up to 43× compared with the state-of-the-art MOODS software package. An open-source implementation of the algorithm is available on https:// github.com/biointec/blstools.

2

Algorithm Description

We consider the PWM matching problem in the general case where we have multiple PWMs over a DNA alphabet. The goal is to recast the naive algorithm into an algorithm that relies on matrix-matrix multiplications. In essence, this procedure involves three matrices: – A pattern matrix P that contains all of the PWMs. – A sequence matrix S that contains some sequence content. – A result matrix R that is computed as R = P ∗ sub(S) and that contains the PWM scores of all PWMs at some positions in the sequence. The routine sub(.) denotes that a submatrix of S is used. Below, we describe each matrix in detail. Figure 1 provides an overview of the algorithm. 2.1

Pattern Matrix P

The pattern matrix P is built once and remains fixed during the course of the algorithm. Matrix P has dimensions c × 4m where c denotes the total number of PWMs and m = maxi (mi ) refers to the maximum PWM length where mi denotes the length of PWMi . Every row of P corresponds to a single PWM. The values in a row of P are obtained by unrolling the values of the corresponding PWM. For PWMs shorter than m characters, trailing zeros are appended to the corresponding row in P . Formally: 0 ≤ j < 4mi PWMi (j mod 4, j/4) (1) P (i, j) = 0 j ≥ 4mi for all 0 ≤ i < c. In case PWM occurrences on both strands of the input sequence(s) need to be identified, an additional c rows can be added to matrix P that represent the reverse-complement of each PWM.

442

J. Fostier

Fig. 1. The result matrix R is computed as the matrix-matrix product of pattern matrix P and a submatrix of sequence matrix S. Each row in P represents a single PWM. Matrix S represents (part of) the input sequence. Each element in R contains a PWM score at some position in the input sequence.

2.2

Sequence Matrix S

The sequence matrix S has dimensions 4(h + m − 1) × w where h and w can be arbitrarily chosen ≥ 1 and where m again represents the maximum PWM length. The matrix S is used to encode (part of) the input sequence(s) S DNA of exactly hw + m − 1 nucleotides. First, the string S DNA is converted into an array S enc of 4(hw + m − 1) zeros and ones by simply replacing character A by 1000; C by 0100; G by 0010; and T by 0001. Formally: ⎧ ⎪ 1 S DNA (i/4) = A ∧ i mod 4 = 0 ⎪ ⎪ ⎪ ⎪ ⎪ S DNA (i/4) = C ∧ i mod 4 = 1 ⎨1 S enc (i) = 1 (2) S DNA (i/4) = G ∧ i mod 4 = 2 ⎪ ⎪ DNA ⎪ (i/4) = T ∧ i mod 4 = 3 1 S ⎪ ⎪ ⎪ ⎩0 otherwise


443

for all 0 ≤ i < 4(hw + m − 1). The matrix S is constructed from this temporary array as follows: (3) S(i, j) = S enc (4hj + i) for all 0 ≤ i < 4(h + m − 1) and 0 ≤ j < w. Every column in S contains a contiguous subarray of S enc and thus encodes a substring of S DNA . The bottom 4(m − 1) elements of column j are identical to the top 4(m − 1) elements of column j + 1. In other words, subsequent columns of S encode overlapping substrings of S DNA with an overlap of m − 1 characters. 2.3

Result Matrix R

The result matrix R had dimensions c × w and is computed as the matrix-matrix product of matrix P with a submatrix of S. Given an offset o with 0 ≤ o < h, Ro is computed as follows: Ro = P ∗ S([4o, 4(o + m)[, :)

(4)

where the notation S([4o, 4(o + m)[, :) refers to the 4m × w submatrix of S where the first row in the submatrix corresponds to row with index 4o in S. Every element in Ro is thus computed as the dot product of a row in P and (part of) a column in S. The elements of S (zeros and ones) are multiplied with the elements of the PWM and thus generate the terms that, when added, correspond to the PWM score. As such, element Ro (i, j) contains the score for PWMi at position (hj + o) in S DNA . Algorithm 1 then provides a complete description of the workflow. In the outer for-loop, a portion of the input sequence(s) of length hw + m − 1 is read into S DNA . In the inner for-loop, the PWM scores are exhaustively computed for all c PWMs at the hw first positions of S DNA . Therefore, the S DNA strings at consecutive outer for-loop iterations overlap by m − 1 nucleotides. Algorithm 1. BLAS-based PWM occurrence detection Input: Sequence Sinput Input: PWMs = {PWMi } Input: thresholds = {thresholdi } 1: P ← createPatternMatrix(PWMs) 2: for pos = 0 to length(Sinput ) − 1 step hw do 3: SDNA ← Sinput [pos, pos + hw + m − 1[ 4: Senc ← encodeString(SDNA ) 5: S ← createSequenceMatrix(Senc ) 6: for o = 0 to h − 1 step 1 do 7: Ro ← P ∗ S([4o, 4(o + m)[, :) 8: reportOccurrences(Ro , o, thresholds) 9: end for 10: end for

DNA sequence Set of PWMs Set of thresholds

444

J. Fostier

Note that in case the input data consists of multiple DNA sequences, these sequences can be concatenated when generating S DNA . With minimal extra bookkeeping, one can prevent the reporting of occurrences that span adjacent DNA sequences. 2.4

Implementation Details

The algorithm is implemented in C++. Multithreading support is added through C++ 11 threads by parallelizing the outer for-loop in Algorithm 1. The BLAS sgemm routine [9] was used to perform the matrix-matrix multiplications using single-precision computations. Recall that PWMs with a length shorter than the maximum PWM length m are represented in the pattern matrix P by adding trailing zeros to the corresponding row. In case many PWMs have a length that is substantially shorter than m, a large fraction of P consists of zero elements. In turn, this creates overhead during the matrix-matrix product when computing the result matrix R due to the including of many terms with value zero. This overhead can easily be reduced by representing the PWMs in P in a sorted manner (sorted according to length). The matrix-matrix product Ro = P ∗ subo (S) can then be computed as a number of smaller matrix-matrix products as follows: R([ci , ci+1 [, :) = P ([ci , ci+1 [, [0, 4mi [) ∗ S([4o, 4(o + mi )[, :)

(5)

where the interval [ci , ci+1 [ corresponds to a subset of the rows in R and P and where mi denotes the maximum PWM length in that range. When mi < m overhead is reduced. For the JASPAR dataset (see description below) the idea is clarified in Fig. 2. The pattern matrix P represents 1404 PWMs with lengths

Fig. 2. Example of a pattern matrix P containing 1404 JASPAR PWMs where many rows contain trailing zeros because of differences in length of the corresponding PWMs. Matrix P can be subdivided in a number of smaller submatrices P i (shaded areas) that each contain less zero fill.


445

between 5 and 30 and thus exhibits substantial zero fill. By subdividing P in 10 submatrices each representing 140 or 141 PWMs, the number of elements of P used in the matrix-matrix multiplication is more than halved. Note that the submatrices P ([ci : ci+1 [, [0 : 4mi [) should not become too thin such that the evaluation of (5) still corresponds to a meaningful matrix-matrix product. In other words, one could avoid zero fill altogether by subdividing P in c different vectors, however, this would result in loss of temporal locality of cache and hence, a considerable loss of speed. For the same performance reasons, parameters h and w that govern the dimensions of matrix S should not be chosen too small. In our implementation, we set h = 250 and w = 1000 such that matrix S corresponds to a 1000 + 4 (m − 1) × 1000 matrix. Finally, note that BLAS routines have full support to specify submatrix ranges without any need to explicitly copy these submatrices onto separate data structures. 2.5

GPU Version

Through the use of the cuBLAS library [12], it is possible to execute the matrixmatrix multiplication on a graphics processing unit (GPU). The pattern matrix P is copied to the GPU memory only once, while a new sequence matrix S is copied during each outer for-loop iteration in Algorithm 1. To avoid copying the entire result matrix Ro from GPU memory to system RAM after each matrixmatrix product during each inner for-loop iteration, a kernel was developed in the CUDA language to report only the matrix indices (i, j) for which Ro (i, j) exceeds the threshold score for PWMi (a task known as stream compaction). Only those indices are copied from GPU to system RAM, thus minimizing data movements between GPU and host memory. Occurrences are written to disk by the CPU. Note that the programming effort to port the BLAS-based algorithm from CPU to GPU is minimal as most tasks are handled by CUDA library calls (e.g. copying data between CPU and GPU, calling cublasSgemm,. . . ). The only exception is the stream compaction kernel itself that consists of 7 lines of CUDA code.

3

Benchmark Results and Discussion

The performance of the BLAS-based algorithm was benchmarked against (i) a naive, scalar algorithm and (ii) the MOODS software package [7]. To ensure a fair comparison, the naive algorithm has the same code quality standards as the BLAS-based algorithm, the only difference being that three nested for-loops are used to scan for the occurrences: one for-loop over the input sequence, a second one over the different PWMs and a third for-loop to compute the PWM score. The C++ source code was compiled against the Intel Math Kernel Library (MKL) version 2017.1.132 which implements optimized BLAS routines for Intel CPUs. In all cases, multi-threading within the MKL was disabled. In other words,

446

J. Fostier

individual calls to sgemm were always executed in a single-threaded manner but multiple calls to sgemm are issued by different threads concurrently. The CUDA code was compiled with the nvcc compiler and linked against cuBLAS from CUDA SDK version 8.0. From the JASPAR database [13], 1404 position frequency matrices were downloaded. As a sequence dataset, the human genome reference sequence (HG38) was used from the GATK Resource Bundle. Part of the tests were run only on chromosome 1 (230 Mbp). We scanned for PWM occurrences on both strands of the DNA sequences by also including the reverse complements of the PWM matrices. Thus effectively, 2808 PWM matrices were used in total. The benchmarks were run on a node containing two 12-core Intel E5-2680v3 CPUs (24 CPU cores in total) running at 2.5 GHz with 64 GByte of RAM. The CPU is of the Haswell-EP architecture and disposes of AVX-256 instructions that can deal with 8 single precision floating point numbers in a single instruction. For configurations with p-value = 10−4 , MOODS required >64 GByte of RAM. Those runs were performed on a single core of a system containing two 10core Intel E5-2660v3 CPUs running at 2.6 GHz with 128 GByte of RAM. When performing the benchmarks with fewer threads than CPU cores, the remaining CPU cores were idle. The GPU runs were performed on a system with a dual nVidia 1080 Ti GPU configuration. Runtime (wall clock time) and peak resident memory use were measured using the Linux /usr/bin/time -v tool. Table 1 shows the runtime, memory use and parallel efficiency for the different approaches when considering chromosome 1 of the human genome as input Table 1. Benchmark results of the naive algorithm, the MOODS algorithm and the proposed BLAS-based algorithm (on CPU and GPU). In all cases, the occurrences of 1404 JASPAR PWMs were searched on both strands of human chromosome 1 for two different PWM thresholds (p-value = 10−5 and 10−4 ). No. cores p-value 10−5

p-value 10−4

Wall clock Parallel Parallel Memory Wall clock Parallel Parallel Memory time speedup efficiency (GByte) time speedup efficiency (GByte) Naive algorithm (24-core CPU system) 1

21 999 s

-

-

0.01

22 024 s

-

-

0.01

4

5 495 s

4.0

100%

0.01

5 506 s

4.00

100%

0.01

8

2 752 s

7.99

100%

0.01

2 755 s

7.99

100%

0.01

24

926 s

23.76

99%

0.01

921 s

23.91

100%

0.01

-

19.02

1 028 s

-

-

64.89 0.04

MOODS (CPU system) 1

402 s

-

BLAS-based algorithm (24-core CPU system) 1

3 441 s

-

-

0.04

3 582 s

-

-

4

871 s

3.95

99%

0.11

889 s

4.03

101%

0.12

8

479 s

7.18

90%

0.20

473 s

7.57

95%

0.22

24

179 s

19.22

80%

0.59

183 s

19.57

82%

0.66

25 s

-

-

0.58

BLAS-based algorithm (dual GPU system) -

24 s

-

-

0.49


447

dataset. Even though it has perfect scaling behavior with respect to the number of CPU cores used and negligible memory use, the naive algorithm is also the slowest. MOODS has very good performance in terms of runtime, especially when taking into account that the software is single-threaded. However, it has much higher memory requirements, over 64 GByte of RAM. Additionally, both the runtime and memory use depend on the PWM thresholds that are used: more relaxed thresholds (i.e., higher p-values) result in additional resource requirements. The BLAS-based algorithm also shows very good multi-threading scaling behavior and outperforms the naive algorithm by a factor between 5× and 6.4× while still maintaining very low memory requirements. Compared with MOODS, the BLAS-based algorithm is slower when using only a single thread but outperforms the latter when using multiple cores. Additionally, like the naive algorithm, its resource requirements do not depend on the p-value that is used. Finally, the BLAS-based algorithm attains maximal performance when executed on the GPU system. Table 2 shows runtime and memory use when considering the entire human genome as input dataset. Due to its dependence on the p-value, MOODS has runtimes ranging from 43 min to 3.5 h and memory requirement ranging from 20 GByte to 103 GByte. In contrast, the BLAS-based algorithm has a runtime that is nearly constant and requires very little memory. On the GPU system, the BLAS-based algorithm shows speedups of 9.5× and 43× over MOODS. Finding the occurrences of a PWM in a sequence can be seen as an imprecise string matching problem. When only the very best PWM matches are needed (by using a low p-value and hence, a high PWM score threshold), the problem eventually approaches that of exact string matching for which very efficient algorithms have been designed by either indexing the sequence or preprocessing the patterns. These algorithms yield O(n + m) time complexity instead of the brute-force O(nm). Nevertheless, for less strict p-values, these algorithms Table 2. Benchmark results of the MOODS algorithm and the proposed BLAS-based algorithm (on CPU and GPU). In all cases, the occurrences of 1404 JASPAR PWMs were searched on both strands of the entire human genome for three different PWM thresholds (p-value = 10−6 , 10−5 and 10−4 ). No. cores

p-value 10−6

p-value 10−5

p-value 10−4

Wall clock Memory Wall clock Memory Wall clock time use (GB) time use (GB) time

Memory use (GB)

MOODS (CPU system) 1

43 min 8 s

20.71

71 min 42 s 30.25

3 h 35 min 26 s 103.20

BLAS-based algorithm (24-core CPU system) 24

36 min 39 s 0.61

37 min 29 s 0.78

40 min 50 s

3.49

4 min 57 s

3.86

BLAS-based algorithm (dual GPU system) -

4 min 29 s

0.51

4 min 33 s

0.73

448

J. Fostier

perform considerably worse because they cannot a-priori eliminate large parts of the search space. Even though the proposed BLAS-based algorithm does not reduce the search space it has several advantages: – The runtime is independent of the chosen p-value and hence of the number of occurrences that are found, at least for as long as writing the occurrences to disk does not become a bottleneck of the system. – The memory use of the proposed algorithm is negligible and again independent of the chosen p-value. In our configuration, we effectively use only a few MBytes of RAM per thread. All matrices involved are thread-local and hence, the multi-threaded algorithm scales very well to a high number of CPU cores, even on non-uniform memory architectures (NUMA). – As the vast majority of the compute time is spent inside the BLAS library the performance of the code is fully dependent on the quality of the BLAS implementation. As CPU vendors provide optimized BLAS libraries for their hardware optimal performance is guaranteed on all systems, including future ones. For example, AVX-512 instructions will be available on next generations of CPUs and will thus offer doubled performance compared to the AVX-256 system used in the benchmarks. Additionally, support for half-precision floating point computations is increasingly adopted might as also double throughput. – Arguably, the implementation of the algorithm is very simple. – The algorithm is easily portable to GPUs, through the use of the cuBLAS Libra that enables very high-performance matrix-matrix multiplications on GPUs. As the peak performance of modern GPUs exceeds that of GPUs one can observe very high performance on GPUs. The same argument holds for other co-processors/hardware accelerators.

4

Conclusion

We proposed a conceptually simple and easy to implement algorithm to identify position weight matrix matches in DNA sequences. The algorithm performs a brute-force evaluation of all PWM matrices at all possible starting positions in the DNA sequences, however, these evaluations are expressed entirely through matrix-matrix multiplications. On modern, cache-based CPUs that dispose of SIMD instructions, matrix-matrix products can be evaluated very efficiently through the use of highly optimized BLAS libraries. As a consequence, the BLAS-based algorithm outperforms the naive algorithm by a factor of 5 to 6.4. The runtime of the proposed algorithm is independent of the p-value and hence the PWM score threshold that is used and requires only very low amounts of memory. Additionally, the algorithm is trivial to parallelize and exhibits good scaling behavior. Compared with the state-of-the-art MOODS software package which implements more sophisticated online search algorithms that reduce the search space, the proposed BLAS-based algorithm is competitive in terms of runtime while requiring less memory. On GPU systems, the BLAS-based algorithm attains maximal performance and outperforms CPU-based algorithms by a large factor.


449

Acknowledgments. The computational resources (Stevin Supercomputer Infrastructure) and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by Ghent University, FWO and the Flemish Government – department EWI.

References 1. Stormo, G.D.: DNA binding sites: representation and discovery. Bioinformatics 16(1), 16–23 (2000) 2. Dorohonceanu, B., Nevill-Manning, C.G.: Accelerating protein classification using suffix trees. In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, 19–23 August 2000, La Jolla/San Diego, CA, USA, pp. 128–133 (2000) 3. Beckstette, M., Homann, R., Giegerich, R., Kurtz, S.: Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinf. 7(1), 389+ (2006) 4. Liefooghe, A., Touzet, H., Varré, J.-S.: Self-overlapping occurrences and KnuthMorris-Pratt algorithm for weighted matching. In: Dediu, A.H., Ionescu, A.M., Mart´ın-Vide, C. (eds.) LATA 2009. LNCS, vol. 5457, pp. 481–492. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00982-2 41 5. Pizzi, C., Rastas, P., Ukkonen, E.: Fast search algorithms for position specific scoring matrices. In: Hochreiter, S., Wagner, R. (eds.) BIRD 2007. LNCS, vol. 4414, pp. 239–250. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3540-71233-6 19 6. Pizzi, C., Rastas, P., Ukkonen, E.: Finding significant matches of position weight matrices in linear time. IEEE/ACM Trans. Comput. Biol. Bioinf. 8(1), 69–79 (2011) 7. Korhonen, J., Martinm¨ aki, P., Pizzi, C., Rastas, P., Ukkonen, E.: MOODS: fast search for position weight matrix matches in DNA sequences. Bioinformatics 25(23), 3181–3182 (2009) 8. Giraud, M., Varré, J.S.: Parallel position weight matrices algorithms. Parallel Comput. 37(8), 466–478 (2011) 9. Dongarra, J.J., Du Croz, J., Hammarling, S., Duff, I.S.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990) 10. Whaley, R.C., Dongarra, J.J.: Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, SC 1998. IEEE Computer Society, Washington, DC, USA, pp. 1–27 (1998) 11. Goto, K., van de Geijn, R.A.: Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34(3), 12:1–12:25 (2008) 12. Cook, S.: CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs, 1st edn. Morgan Kaufmann Publishers Inc., San Francisco (2013) 13. Mathelier, A., Fornes, O., Arenillas, D.J., Chen, C.Y.Y., Denay, G., Lee, J., Shi, W., Shyr, C., Tan, G., Worsley-Hunt, R., Zhang, A.W., Parcy, F., Lenhard, B., Sandelin, A., Wasserman, W.W.: JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 44(D1), D110–D115 (2016)

Analyzing the Differences Between Reads and Contigs When Performing a Taxonomic Assignment Comparison in Metagenomics Pablo Rodr´ıguez-Brazzarola, Esteban Pérez-Wohlfeil, Sergio D´ıaz-del-Pino, Ricardo Holthausen, and Oswaldo Trelles(B) University of Malaga, Campus de Teatinos, 29071 Malaga, Spain {pabrod,estebanpw,sergiodiazdp,ricardoholthausen,ortrelles}@uma.es http://www.bitlab.es

Abstract. Metagenomics is an inherently complex field in which one of the primary goals is to determine the compositional organisms present in an environmental sample. Thereby, diverse tools have been developed that are based on the similarity search results obtained from comparing a set of sequences against a database. However, to achieve this goal there still are affairs to solve such as dealing with genomic variants and detecting repeated sequences that could belong to different species in a mixture of uneven and unknown representation of organisms in a sample. Hence, the question of whether analyzing a sample with reads provides further understanding of the metagenome than with contigs arises. The assembly yields larger genomic fragments but bears the risk of producing chimeric contigs. On the other hand, reads are shorter and therefore their statistical significance is harder to asses, but there is a larger number of them. Consequently, we have developed a workflow to assess and compare the quality of each of these alternatives. Synthetic read datasets beloging to previously identified organisms are generated in order to validate the results. Afterwards, we assemble these into a set of contigs and perform a taxonomic analysis on both datasets. The tools we have developed demonstrate that analyzing with reads provide a more trustworthy representation of the species in a sample than contigs especially in cases that present a high genomic variability. Keywords: Taxonomic assignment Metagenomic comparison

1

· Sequencing analysis

Introduction

A drastical reduction of time and cost per sequencing experiment has taken place, dropping from a 10,000$ per megabase down to a few cents, due to the major breakthroughs in sequencing technologies that have occurred in the last decades [1]. These techniques produce a huge amount of data overcoming the c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 450–460, 2018. https://doi.org/10.1007/978-3-319-78723-7_39

Reads and Contigs Comparison in a Metagenomic Taxonomic Assignment

451

data generation problem, which was the main barrier during the early Genomic Era. Biologists are now facing a torrent of data which have paved the way towards the analysis of numerous unknown biological communities and the research of pioneering scientific areas such as metagenomics (beyond genomes). The goal of metagenomics is to study microbial communities, also known as microbiotas, in their natural environment, without requiring to aisle and cultivate the species that make up such community. This field brings a profound transformation in multiple fields, such as: biology, medicine, ecology, agriculture, and biotechnology [2]. Despite these benefits, metagenomic sequence data presents several challenges. For instance, most communities are so diverse that most genomes are not utterly represented by reads. The difficulty of performing direct comparisons through sequence alignment is even greater due to distinct reads from the same gene that may not overlap. However, when they do overlap it is not always noticeable whether they are from the same or different genomes, challenging the sequence assembly. Additionally, its informatic analysis is more complicated when dealing with poor quality reads, detecting repeated sequences from similar organisms, and genomic variants or species that have not yet been sequenced within a sample in which the representation of organisms is uneven and unidentified [3]. A primary objective in metagenomics is portray the organisms present in an environmental sample. A correct classification of the species within a sample with enable a further insight about several issues such as: the microbial ecosystems models used to describe and predict community-based microbial processes, changes, and sustainability; the global scale descriptions of the role of the human microbiome in different health states in individuals and populations; and the exploitation of the remarkably versatile and diverse biosynthetic capacities of microbial communities to generate beneficial industrial, health, and food products. Tools such as MEGAN [4], FANTOM [5] or RAST [6] perform a taxonomic analysis with reads and are also prepared to work with contigs, since each approach has advantages and disadvantages. Analyzing contigs provide larger genomic fragments, nevertheless this entails a risk of generating chimeric contigs due to the heterogeneity of the sample. On the other hand, with reads this risk is non-existent, however the analysis is affected by several factors such as the quality and length of the sequences, thus may generate matches with low statistical significance. The main contributions of this paper are a set of tools that analyze the quality of the taxa assigned to the metagenomic sample and establishes statistical differences between reads and contigs in order to provide a better judgement to properly identify the correct taxa distribution in a metagenomic sample. It also provides a workflow that employs the previous tools to propose suggestions on how to perform an optimal taxonomic analysis of a metagenomic sample, either with reads or with contigs.

452

2

P. Rodr´ıguez-Brazzarola et al.

Methods

The definitions, procedures and algorithms employed to compare reads and contigs when analyzing a metagenomic sample are described in this section. First, we defined a set of conditions that describe the taxonomic concordance between the contig (handles as once sequence) and the reads that assemble it on a specific taxonomic rank in order to achieve a reasonable comparison. – General Definitions: Let S be the set composed by the Reads and Contigs nucleotide sequences Let T be the set composed by the Taxa in a specific taxonomic rank and None s∈S ∧ t∈T T axon(s) → t – Consistency (C): Both, the read and the contig, have the same taxon assigned or were not assigned at all. T axon(Read) = T axon(Contig) – Weak Inconsistency (WI): One of the sequences has been assigned to a taxon, but the other one was not assigned to any. These relationships are classified based on which sequence was unassigned. Granted that the read does not match a taxon in the specified taxonomy rank, it will be defined as a Weak Inconsistency by Read (WIR). But if the unassigned sequence was the contig then it will be classified as a Weak Inconsistency by Contig (WIC). T axon(Read) = N one ∧ T axon(Contig) = x ∧ x = N one or T axon(Read) = X ∧ T axon(Contig) = N one ∧ x = N one – Strong Inconsistency (SI): Both, the read and the contig, are assigned to a taxon in the selected taxonomic rank, but to different taxa. Note that if either the read or the contig is not assigned, it will be classified as a WI. T axon(Read) = T axon(Contig) Having settled the previous definitions, a workflow has been designed with the intent of analyzing the levels of concordance between reads and the contigs they assemble and retrieve reliable comparison results (see Supplementary Material Fig. 1). Such workflow begins with metagenomic reads and a reference database as input. Firstly, the reads are assembled into contigs using MEGAHIT [7], an assembler developed for large and complex metagenomic NGS (Next Generation Sequencing) reads. Afterwards, both sets of sequences are mapped against a reference database to acquire the possible species that each sequence came from. These results are fed to MEGAN, a microbiome analysis tool that uses the last


453

common ancestor (LCA) algorithm to assign each sequence to a taxa. Finally, the retrieved information from previous steps is processed by our developed software in order to generate a set of results that assesses the quality of the taxa assigned to the reads and contigs and provides statistical insight about these results. The subsequent section provides a detailed description of the internal functioning of the workflow. 2.1

Detecting Differences Between the Reads and Contigs

The developed toolkit has been designed for comparing the results obtained after performing a primary sequence comparison and a biological taxonomic analysis between reads and contigs. The output information provided by this tool is composed by: 1. 2. 3. 4.

The associations for each contigs and the reads that assembled it. Concordance of the taxa assigned between the reads and contig assembled. Coverage of the reference database. Ratios of highest scoring matching species per sequence in a metagenomic dataset. 5. Correct taxonomic classification percentage. The internal procedure implemented to obtain these results are described in the following section. – The associations for each contigs and the reads which assembled it: The associations between the reads and contigs are extracted from the BLASTN [8] output obtained by performing a DNA primary sequence alignment between them. Other comparison tools can be used by adding an specific parser. This result is processed to obtain two collections of the relationships between the reads and the contigs: one in which the reads are assigned to the contig that it assembled; and the other where the contigs are partnered with the group of reads used to assemble it. – Concordance of the taxa assigned between the contigs and the reads that assembled it: Firstly, the identifier of all the sequences that have been assigned to a taxon in the selected biological classification rank are extracted from the MEGAN results. Afterwards, this information is used to classify the previously obtained associations between reads and contigs, based on the concordance level of the taxon assigned to a contig and the reads that assembles it. – Coverage of the reference database: The amount of base pairs that were aligned to the database obtained from the results after executing the BLASTN with each set of sequences is compared to the number of base pairs in such database to obtain the following results: • Total coverage of the database for each set of sequences • Total coverage of the database that the reads and contigs match together

454


– Ratios of highest scoring matching species per sequence in a metagenomic dataset: The average of top scoring matches resulting from the sequence alignment against the reference database is calculated for each of the datasets. Afterwards, the measurement obtained from each dataset is compared to decide which one provides less variable matches. – Correct taxonomic classification: This measurement can only be calculated when the original taxon for each read is previously known. The percentage of sequences assigned to the taxon to which they belong is calculated for both datasets. An assignment is correct for a read if such is matched to the correct taxon. However, it is impossible to know the proper specie for a contig because they can be assembled from reads that belong to different organisms. Therefore, we define the assigned taxon to a contig as the one to which the majority of the reads that assemble it belong to.

3


In order to apply the previously described workflow and to obtain valid comparison results, the metagenomic reads dataset has to be properly designed, meaning that the real taxon for each read must be known beforehand for the purpose of enabling us to assess and to establish whether it is better to perform a taxonomic analysis with reads or with contigs. To achieve this goal, two use cases have been developed to achieve authentic results that fulfill the requirement of knowing the original specie for each read. One fully synthetic where the reads are originated from each genome in the database. The other one is semi-synthetic, where the reads come from a selection of genomes that are representative of the classes in a real metagenomic sample (See Fig. 1). The differences between the cases are the initial dataset of reads and the reference database. These metagenomic reads and the databases employed are obtained through the following approaches: – Fully synthetic use-case/dataset (FSD): The database selected are the gastrointestinal tract genomes provided by Human Microbiome Project (HMP) [10] and the reads are obtained by executing Better Emulation for Artificial Reads (BEAR) [11] with the HMP dataset. An equitably number of reads are generated from each genome in the reference database to obtain a very mixed sample of reads from all the different species to which they will be compared. The total number of reads is 521,334. – Semi-synthetic use-case/dataset (SSD): Following the class taxonomic distribution from the study “Comparative metagenomic, phylogenetic and physiological analyses of soil microbial communities across nitrogen gradients” [9], a set of genomes were selected that belonged to each of the classes. These samples were used with BEAR to generate a set of reads proportional to the class distribution obtained by analyzing the article. The remaining percentage of the metagenomic sample (9%) was obtained by generating a set of random reads that followed the nucleotide distribution from the rest


455

Fig. 1. On the left: generation of fully synthetic reads. On the right: generation of semi synthetic reads. BEAR (referenced in the next paragraph)

of the dataset. In order to provide a soil sequencing framework, the reference database for the soil microbial genomes selected is RefSoil [12]. The total number of reads is 499,991. The species distribution for both metagenomic datasets is represented in the Fig. 2. The workflow depicted in the Methods section has been applied to each of the use cases with the parameters described in the Table 1 of the Supplementary Material. Afterwards, the output from the developed tools for each use case is interpreted to obtain the following results: – Comparison with the Original Distribution: The species distribution obtained by performing a taxonomical analysis with the reads and contigs is compared with the original dataset in Fig. 3. For the FSD both reads and contigs seem to have differences against the original dataset, however it is not noticeable which one is more similar to the authentic dataset. This is not the case for the SSD since the reads present an almost identical distribution of species in comparison to the original, while the contigs clearly have noticeable differences. This is further verified in the next section. – Root Mean Square Error (RMSE) after the Taxonomical Analysis: The RMSE is calculated for both reads and contigs using the original dataset as reference. This implies that if the RMSE is lower for a set of sequences (reads or contigs), the mapping of this dataset is more appropriate to describe the ideal distribution of species in the metagenome (Table 1).

456


Fig. 2. Percentage distribution of the species in a metagenomic original dataset. On the left: Distribution for the fully synthetic dataset. On the right: Distribution for the semi synthetic dataset.

Fig. 3. Percentage distribution of the species in the metagenomic original dataset (green), reads (red) and contigs (blue). On the left: distribution for the fully synthetic dataset. On the right: distribution for the semi synthetic dataset. (Color figure online) Table 1. Root of the Mean Squared Error of the assignment of species for reads and contigs compared to the original dataset for both datasets. Dataset RMSE for FSD RMSE for SSD Reads

0.3187

0.4031

Contigs 0.3858

4.2534

The RMSE describes the average difference for each dataset in comparison to the original, and provides further insight about how correct are the assignments of species per dataset. Reaffirming the results from the Fig. 4, it is observed that in both use cases the reads have a lower RMSE than the contigs.


457

– Inconsistencies Found: A concordance level is established to each of the associations between a contig and the read that it assembles. Identifying the types of inconsistencies aids us at the moment of determining the reason behind the RMSD. If there are more weak inconsistencies at the species taxonomic rank, then most of the reads or contigs involved were assigned to a taxon in a higher and less specific taxonomic rank. The detected inconsistencies and the percentage of relationships they represent are shown in the Table 2. Table 2. Number of inconsistencies assigned for each use case. Type of inconsistency

Found on FSD (%) Found on SSD (%)

Weak inconsistency by read

21,393 (4.10)

4,003 (0.80)

Weak inconsistency by contig 24,183 (4.64)

1,622 (0.32)

Hard inconsistency

2,231 (0.45)

4,464 (0.84)

– Inconsistency Resolution: Inconsistencies can be solved by selecting a less specific taxonomic group to cover a broader range of taxa that a sequence could be assigned. In both use cases the sequences belong to bacterias, therefore the discrepancy between the assignment a contig and the read that assembles it will be sorted out in the taxonomic group “Domain”. This can be appreciated in Fig. 4.

Fig. 4. Percentage of inconsistencies solved at different taxonomic ranks. In both use cases, over 50% of the inconsistencies are resolved if the desired taxonomic group to analyze is the family. On the left: inconsistency resolution for the fully synthetic dataset. On the right: inconsistency resolution for the semi synthetic dataset.

458


The heterogeneity of the samples provoke that a noticeable amount of contigs are assembled from reads from different species. This confirms that the inconsistencies arise during the assembly process. – Coverage and Mapping Comparison against the Reference Database: For each use case, the ratio of top scoring matches after performing the sequence alignment to the reference database and the percentage of nucleotides covered by the full set of sequences is depicted in the Table 3. Table 3. Mapping and coverage comparison between reads and contigs for each use case Measurement

FSD Reads Contigs

SSD Reads Contigs

Ratio of matches per sequence

7.05

7.50

4.52

8.38

% Coverage of database

21.21

7.16

5.59

3.37

Common % within use case

6.42

Common coverage % against contigs

89.66

3.03 Not applicable 89.91

Not applicable

For both use cases, reads have a lower average of top scoring matches since the contigs tend to have more matches due to assembly noise generated by forming contigs from reads belonging to different species. It is also noteworthy that over 85% of the nucleotides covered by contigs are also covered by reads, yet reads cover a wider range of the database meaning that they provide more information that may be of interest depending on the interests of the metagenome experiment. – Correct Assignment of a Taxon for each Sequence Comparison: Both use cases fulfil the prerequisite to calculate this measurement, to know beforehand the original taxon of a read. The resulting assessment is described in the Table 4. Table 4. Correct assignment for each sequence comparison Measurement

FSD Reads

Properly assigned 493,226

SSD Contigs Reads

Contigs

59,088

56,242

454,900

Wrongly assigned 3,179,682 611,595 1,933,138 387,162 Total

3,672,908 670,683 2,388,038 443,409

Correct (%)

13

87

19

13

Incorrect (%)

87

91

81

87

The low percentage of properly assigned sequences is caused by the multiple top scoring matches for each sequence previously described. This fact generates a noticeable amount of wrong assignments, but these must be taken into


459

account because in a real metagenomic sample it is impossible to know which is the correct match. This indicates that the reads provide a more accurate assignment in both use cases, whereas contigs provide less sensitive results. However, this also proves that a high percentage of the data obtained from a metagenomic sample is noise originated from various sources. Hence it can be concluded that the assembly is not the only process that needs to be refined in order to obtain more valuable information.

4

Conclusions

Even though tools and algorithms in metagenomics have advanced, there are still shortcomings very difficult to solve due to the intrinsic complexity of analyzing a metagenomic sample. Therefore these errors have to be properly addressed to make better tools in the future. Accordingly, the results obtained in this work demonstrate some issues to be resolved in the field of metagenomic assembly. We have presented in this work several indicators that enable a valid comparison between reads and contigs when performing a taxonomic analysis of a metagenomic sample. We have demonstrated that reads provide a more accurate assignment of taxa, and that the distribution of species resembles in a larger extent the original metagenomic sample distribution, and provide a more specific assignment of taxa than using the contigs. The measurements established in previous sections suggest that during the assembly process, some reads belonging to different species are put together into a contig as a result of the great heterogeneity of species in a metagenomic sample. In this same stage, another issue arises which is that the distribution of species of contigs assigned less resembles the original since their length will vary depending on how many reads are used to assemble. Yet at the moment of assigning a taxon it will still count as one sequence match even though it was formed by many of them. Moreover, reads will describe more accurately the proper distribution of species in the metagenomic sample since each belongs to one specie and their length size is uniformly distributed. However, when working with contigs specificity is lost due to the possibility of creating chimeric contigs and the fact that the quality of the assembly will vary strongly on the length and quality of the reads, misrepresenting the original sample. In terms of future work, the toolkit is being applied to compare the quality of different metagenomic assembly tools and to compare the quality of the assembly using different parameters. Likewise, adjusting the presented workflow to compare the functional analysis between reads and contigs would be very interesting. Acknowledgements. The authors would like to thank Fabiola Carvalho and Ana T. R. Vasconcellos from the LNCC-Brazil for their support. This work has been partially supported by the European project ELIXIR-EXCELERATE (grant no. 676559), the Spanish national projects Plataforma de Recursos Biomoleculares y Bioinformaticos (ISCIII-PT13.0001.0012) and RIRAAF (ISCIIIRD12/0013/0006) and the University of Malaga.

460


References 1. National Human Genome Research Institute, The Cost of Sequencing a Human Genome. https://www.genome.gov/27565109/the-cost-of-sequencing-a-human-ge nome/ 2. National Research Council (US) Committee on Metagenomics: Challenges and Functional Applications. The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet. National Academies Press (US), Washington (DC), Why Metagenomics? (2007). https://www.ncbi.nlm.nih.gov/books/NBK54011/ 3. Sharpton, T.J.: An introduction to the analysis of shotgun metagenomic data. Front. Plant Sci. 5, 209 (2014). https://doi.org/10.3389/fpls.2014.00209 4. Huson, D.H., Auch, A.F., Qi, J., Schuster, S.C.: MEGAN analysis of metagenomic data. Genome Res. 17(3), 377–386 (2007). https://doi.org/10.1101/gr.5969107 5. Sanli, K., Karlsson, F.H., Nookaew, I., Nielsen, J.: FANTOM: functional and taxonomic analysis of metagenomes. BMC Bioinform. 14, 38 (2013). https://doi.org/ 10.1186/1471-2105-14-38 6. Meyer, F., et al.: The metagenomics RAST server–a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinform. 9, 386 (2008) 7. Li, D., Liu, C.-M., Luo, R., Sadakane, K., Lam, T.-W.: MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics (2015). https://doi.org/10.1093/bioinformatics/ btv033 8. Madden, T.: The BLAST sequence analysis tool. In: McEntyre, J., Ostell, J. (eds.) The NCBI Handbook [Internet]. National Center for Biotechnology Information (US), Bethesda (2002). (Chap. 16). https://www.ncbi.nlm.nih.gov/books/ NBK21097/. Accessed 13 Aug 2003 9. Fierer, N., Lauber, C.L., Ramirez, K.S., Zaneveld, J., Bradford, M.A., Knight, R.: Comparative metagenomic, phylogenetic and physiological analyses of soil microbial communities across nitrogen gradients. ISME J. 6(5), 1007–1017 (2012). https://doi.org/10.1038/ismej.2011.159 10. The NIH HMP Working Group, Peterson, J., Garges, S., Giovanni, M., McInnes, P., Wang, L., Schloss, J.A., Bonazzi, V., McEwen, J.E., Wetterstrand, K.A., Deal, C., Baker, C.C., Di Francesco, V., Howcroft, T.K., Karp, R.W., Lunsford, R.D., Wellington, C.R., Belachew, T., Wright, M., Giblin, C., David, H., Mills, M., Salomon, R., Mullins, C., Akolkar, B., Begg, L., Davis, C., Grandison, L., Humble, M., Khalsa, J., Little, A.R., Peavy, H., Pontzer, C., Portnoy, M., Sayre, M.H., Starke-Reed, P., Zakhari, S., Read, J., Watson, B., Guyer, M.: The NIH human microbiome project. Genome Res. 19(12), 2317–2323 (2009). https://doi.org/10. 1101/gr.096651.109 11. Johnson, S., Trost, B., Long, J.R., Pittet, V., Kusalik, A.: A better sequence-read simulator program for metagenomics. BMC Bioinform. 15(Suppl. 9), S14 (2014). https://doi.org/10.1186/1471-2105-15-S9-S14 12. Choi, J., Yang, F., Stepanauskas, R., Cardenas, E., Garoutte, A., Williams, R., Flater, J., Tiedje, J.M., Hofmockel, K.S., Gelder, B., Howe, A.: Strategies to improve reference databases for soil microbiomes. ISME J. 11(4), 829–834 (2017). https://doi.org/10.1038/ismej.2016.168

Estimating the Length Distributions of Genomic Micro-satellites from Next Generation Sequencing Data Xuan Feng1,2, Huan Hu1,2, Zhongmeng Zhao1,2, Xuanping Zhang1,2, and Jiayin Wang1,2(&) 1

School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, Shaanxi, China [email protected] 2 Shaanxi Engineering Research Center of Medical and Health Big Data, Institute of Data Science and Information Quality, Xi’an Jiaotong University, Xi’an 710049, Shaanxi, China

Abstract. Genomic micro-satellites are the genomic regions that consist of short and repetitive DNA motifs. In contrast to unique genome, genomic micro-satellites expose high intrinsic polymorphisms, which mainly derive from variability in length. Length distributions are widely used to represent the polymorphisms. Recent studies report that some micro-satellites alter their length distributions significantly in tumor tissue samples comparing to the ones observed in normal samples, which becomes a hot topic in cancer genomics. Several state-of-the-art approaches are proposed to identify the length distributions from the sequencing data. However, the existing approaches can only handle the micro-satellites shorter than one read length, which limits the potential research on long micro-satellite events. In this article, we propose a probabilistic approach, implemented as ELMSI that estimates the length distributions of the micro-satellites longer than one read length. The core algorithm works on a set of mapped reads. It first clusters the reads, and a k-mer extension algorithm is adopted to detect the unit and breakpoints as well. Then, it conducts an expectation maximization algorithm to approach the true length distributions. According to the experiments, ELMSI is able to handle micro-satellites with the length spectrum from shorter than one read length to 10 kbps scale. A series of comparison experiments are applied, which vary the numbers of micro-satellite regions, read lengths and sequencing coverages, and ELMSI outperforms MSIsensor in most of the cases. Keywords: Genomic micro-satellite Length distribution Estimation approach Next generation sequencing data

1 Introduction Genomic micro-satellite regions are first discovered by Miesfeld and others nearly four decades ago [1]. Although most of the identified micro-satellites are considered as neutral events, many of which locate in non-coding genomic regions, more and more © Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortuño (Eds.): IWBBIO 2018, LNBI 10813, pp. 461–472, 2018. https://doi.org/10.1007/978-3-319-78723-7_40

462

X. Feng et al.

evidences are successively reporting that the phenotypic effects of some micro-satellites are quite remarkable [2]. A genomic micro-satellite region is such a genomic region that consists of perfect or near-perfect tandem iterations of short DNA motifs. It is reported that every possible motif of mono-, di-, tri- and tetra-nucleotide repeats may widely exist in human genomes [3]. In addition, different from unique genome, a genomic micro-satellite region mainly exposes the variability in length. Although many details about micro-satellite polymorphisms are unclear, it is suggested that genomic microsatellites are highly mutable due to the replication slippage during DNA replication processes [2, 3]. Replication slippage occurs when the nascent strand mismatches the template strand, and then the continued replication leads to a different length from the template strand. The nascent strand mismatch generates a loop that consists of one or multiple motifs. If the loop occurs on the nascent strand, it expands the micro-satellite region. Otherwise, if the loop occurs on the template strand, it shortens the microsatellite region [3]. Thus, when sequencing reads sampled for a micro-satellite region, different reads often present different lengths of the micro-satellite, and then a length distribution is commonly used to represent its polymorphisms. In addition, if the length distributions for the same micro-satellite significantly alter between different tissue samples, such as a tumor sample and a normal sample from the same patient, it is known as a micro-satellite instability (MSI) event. MSIs are widely observed in cancer cases [4]. It is considered that when the somatic events affect the length of a micro-satellite, normally, the mismatch repair mechanism corrects it back to the normal situation. However, if the mismatch repair mechanism is dysregulation, which may be a result of somatic inactivation via methylation [5], germline variations in MMR genes [6] or other rare somatic mutational events, the ability of correction would be largely limited [7]. For example, it is reported that up to 15%–20% sporadic colorectal cancer cases have positive MSI events [8, 9]. 12% of advanced prostate cancer cases brings MSI events [10]. Thus, recent studies report the landscape of MSI events from pan-cancer cases, which implies important clinical implications in cancer diagnostics and prognosis [11, 12]. Due to the clinical usages, detecting micro-satellites and MSIs becomes an important problem. Traditional methods usually employ the polymerase chain reaction technology. Recent methods prefer to use next generation sequencing data. Several computational approaches have recently proposed, such as MSIsensor [4], mSINGS [13] and MSIseq [14]. MSIsensor is among the first approach for cancer sequencing data. It computes the length distributions of each micro-satellite in paired tumor and normal sequence data, respectively. Then, it subsequently conducts a statistical test to compute the significance between the two distributions from the paired samples respectively. However, MSIsensor limits the lengths of MSIs shorter than one read length. mSINGS works on the target-gene captured sequencing data. It compares the numbers of the signals that reflect the repetitive micro-satellite tracts with different lengths from cases to the numbers from control samples. The MSI status is determined by the scores of the selected micro-satellite regions. Due to the computational complexity, it is more suitable for small panels. MSIseq incorporates four machine learning frameworks, logistic regression, decision tree, random forest and naive Bayes approach to compare the distributions.

Estimating the Length Distributions of Genomic Micro-satellites

463

Although several approaches are designed for detecting micro-satellites and identifying their status, based on our best knowledge, none of them overcome the one-read-length limitation. To detect the micro-satellites longer than one read length is a complicated computational problem. First, for long micro-satellites, the detector is no longer able to squeeze the micro-satellites by partially mapping reads. Furthermore, if a micro-satellite is longer than the average insert-size, the algorithm cannot use the paired-end reads to locally anchor the micro-satellite. In this article, we propose a novel algorithm, ELMSI that tries to estimate the length distributions of long micro-satellites (LMSI) from next generation sequencing data. ELMSI is a probabilistic approach. The algorithm is supposed to be given a set of mapped reads. It first clusters the reads and adopts a k-mer segmentation step to determine the repeat units. It then constantly approximates the distribution of a micro-satellite via an expectation maximization algorithm. To test the performance of ELMSI, we conduct a series of simulation experiments and compare the results to MSIsensor. The results demonstrate that ELMSI has better performance on multiple indicators. The recall rates and precision rates are able to reach around 81% and 72% in some common simulation settings, respectively. Moreover, it keeps satisfied accuracy when the coverage decreases.

2 Methods 2.1

Model Framework

The proposed approach, ELMSI consists of two components. Suppose that we are given a set of mapped reads, and the first component infers the number of micro-satellite candidate regions by a clustering algorithm. In brief, the clustering algorithm is based on the distances among the initial mapping positions of the reads across each breakpoint. The number of clusters represents the number of MS regions. Then ELMSI uses a k-mer based algorithm to identify the exact breakpoints and the repeat unit as well. The second component estimates the distribution of lengths at each MS region via an EM algorithm. Here, we take the same assumption [15] that the length distribution approximates a normal distribution according to the law of large numbers. If the lengths of MSI are shorter than read length, the computational problem is reduced to the problem that the existing approaches can solve. For the LMSIs, the reads cannot locate them, so that we cannot pinpoint the MSIs lengths. Thus, we propose an EM algorithm to estimate the parameters of the length distribution of any LMSI. Before introduce the details of ELMSI, we list the following definitions, which are also shown in Fig. 1. • MS-pairs: Two paired reads one of which is perfectly mapped while the other is across the breakpoint. • SB-reads: A read which is across the breakpoint in an MS-pair. • PSset: A collection of the binary group consists of initial positions and sequences of the SB-reads, which is represented by (POS, SEQ). • Sk-mer: The sequence consists of the front k bases.

464

X. Feng et al.

Fig. 1. MSI sequence signatures and an example of identifying the repeat unit. The sample sequences represent the same MS region from different cells. An example of Sk-mer shows how they are used to identify the repeat unit.

2.2

Identify Breakpoints and Repeat Units of MSIs

In order to get the breakpoints and repeat units of MSIs, ELMSI computes the features extracted from the aligned reads. We assume that the lengths of MSIs are generally less than 50000 bps. ELMSI estimates the number of MSIs by conducting a clustering algorithm according to the distances of the initial positions of the SB-reads. The clustering strategy is as follows: According to the mapping results from the PSset, two SB-reads will belong to the same cluster only if the distance between their initial positions is less than 50000 bps. Then, each cluster represents a candidate MSI. And thus, we have the number of MSIs. Once the number of MSIs is determined, for each MSI candidate region, ELMSI uses a k-mer based algorithm to split each read in reverse order. The process of k-mer is also shown in Fig. 1. As the repeat units of micro-satellites are usually less than 6 bps, we set k = 6 as default. Starting from the first base of the reversed read sequence, it detects in turn whether two k-mer sequences are full-form repetition. The sequence is a candidate repeat unit and the first base of the sequence is a candidate breakpoint of an MSI. Do the same operating for all reads in the MS region, candidate areas, take the mode of the repeat units and breakpoints as the final results. 2.3

Estimating the Distribution Parameters of MSIs Lengths

Suppose that the length distribution of an MSI follows a normal distribution. ELMSI considers a continuing estimation strategy, whose basic idea is to estimate an MSI


465

length by the coverage of the specified area containing the MSI, then use the updated MSI length to estimate the coverage of the specified area. Repeat this loop until the MSI length no longer changes significantly. Reference sequence

MS

Reads in sam file

LMSI sample sequence

WIN-bk

Reads in sam file

Reference sequence

MS LMSI

Sample sequence MSI MSI

C-pairs

C-pairs T-pairs O-pairs O-pairs C-pairs

T-pairs

C-pairs SO-pairs

S-pairs S-pairs S-read SO-pairs SO-pairs

S-pairs SO-pairs SO-pairs

SO-pairs SO-pairs S-read S-pairs S-pairs

Fig. 2. The changes of coverages when MSI event occurs and the definitions of different read pairs. In the areas where MSIs occur, the coverages become lower.

Let WIN-bk be the window on the reference and the breakpoint of an MSI is the midpoint of a WIN-bk. The default length of WIN-bk is set to 5000 bps. The read pairs can be divided into the following categories. Let C-pairs be the paired reads mapped to WIN-bk, T-pairs be the paired-reads mapped to MS region, O-pairs be the paired-reads one of which is perfectly mapped to WIN-bk while the other is mapped to MS region, SO-pairs be the paired-reads one of which is mapped to MS region while the other spans across a breakpoint, S-pairs be the paired-reads one of which is perfectly mapped to WIN-bk while the other spans across a breakpoint, and S-reads be the reads which span across the breakpoints in any SO-pairs and S-pairs. Figure 2 is a graphical representation of the relevant definitions.

466

X. Feng et al.

We can estimate the length distributions of MSIs through the aligned reads when their lengths are shorter than one read length. But if the lengths of MSIs are longer than one read length, once the breakpoints and the repeat units of MSIs are identified, we set a WIN-bk with each breakpoint as the midpoint. The initial length of WIN-kb is set to 5000 bps. According to the aligned reads corresponding to WIN-bk, we can obtain the coverage of reference in WIN-bk. The formulas are as follows: SUMbp ¼ NUMread Lread C¼

SUMbp L

ð1Þ ð2Þ

where SUMbp represents the total number of bases in Win-bk, NUMread represents the total number of reads in the target area, Lread represents the read length. And C represents the coverage of the target area; L represents the length of the target area. When the Win-bk length is fixed, SUMbp is constant. Thus, the lengths of MSIs do not affect SUMbp, but influence the coverage C. We calculate the normal distribution parameters of the MSIs lengths through the following eight steps. Step 1: Initialization of the variable: Let m be the total number of MSIs, p be the pth MSI, and S be the sampling times, WIN-bk be the sequence of samples with the MSI’s breakpoint as the midpoint, LWin be the length of WIN-bk, Laln be the total number of bases belong to the MSIs region in all S-reads, Lset be the set of MSI lengths. Step 1-1: Getting the number of MSIs, the repeating units and breakpoints through the first component; Step 1-2: Clustering the paired-reads into 5 categories which are C-pairs, T-pairs, O-pairs, S-pairs and SO-pairs, all the paired-reads are in Win-bk. Step 1-3: Calculate the number of paired-reads in these categories and NUMC, NUMT, NUMO, NUMS, NUMSO represent the number of C-pairs, T-pairs, O-pairs, S-pairs, SO-pairs and Laln respectively; Step 1-4: Set m represents the number of MSIs, p = 1, S = 1, LWin = 5000 bps, L′ = 0, Lset = ∅. Step 2: According to the paired-reads clustering results in first step, calculate the SUM average coverage of Win-bk. The formula is C ¼ L bp , where SUMbp = 2 * (NUMC + NUMT + NUMO + NUMS + NUMSO) * Lread. And L = L′ + LWin. Step 3: Suppose that the coverage follows uniform distribution, and then the coverage in Step 2 can reflect the coverage in MSI area. In this step, we use the formula 00 SUM L ¼ C bp to calculate the MSI length. Where SUMbp = (2 * NUMT + NUMO + NUMSO) * Lread + Laln. Step 4: If |L′ − L″| > d, let L′ = L″, repeat the step 2 and step 3 operations, |L′ − L ″| < d until go to step 5. Step 5: The MSI length obtained is incorporated into a set. Lset = Lset [ {L″}. Step 6: In order to find out the normal distribution parameter of an MSI sequence, we sample 30 multiple times by change the size of LWin. Set S = S + 1, if S < 30, let Lwin = Lwin + 1000, go to step 1.


467

Step 7: The normal distribution parameter of an MSI is N (l, б2). l and б2 are the mean and covariance of lengths in Lset. Step 8: If p < m, set p = p + 1, go to step 1.

3 Experiments and Results To test the performance of ELMSI, we conduct the experiments on a series of simulation datasets with different configurations, which alter the numbers of MSI, coverages and read lengths. And also compare the two major indicators, precision rate and recall rate, to MSIsensor [4]. Precision rate is calculated by P ¼ T þT F and recall rate is calculated by R ¼ T þT N . There into, T is the number of MSIs be estimated which is correct; F is the number of MSIs be estimated which is not MSI; N is the number of MSIs not be estimated. 3.1

Generating the Simulation Datasets

To generate the simulation datasets, we first randomly selected a 10 Mbps region from the human 19 chromosome as reference. Then, single nucleotide variants are randomly planted into the 10 Mbps region with 1% mutation rate. To design a complex situation, we randomly choose the MSIs lengths, the repeat units and the breakpoints of the MSIs. Three major MSIs lengths are considered: (1) long MSIs whose length range are longer than or equal to 500 bps, (2) middle MSIs whose length range are 300 bp– 500 bp and (3) short MSIs whose length range are 100 bp–300 bp. The proportion of the long MSIs is 10%; the proportion of the middle MSIs is 45%, while the proportion of the short MSIs is 45%. The MSIs in different length ranges are generated randomly according to the proportion of the above lengths. The standard deviation of the MSIs length is less than 5. As aforementioned, the MSI length in a given individual is normal distribution. Assuming that the parameter of the normal distribution is (l, r2), we divide this normal distribution into seven parts which are l − 3r, l − 2r, l − r, l, l + r, l + 2r, l + 3r, and the number of each part planted into reference are got through multiplied coverage by corresponding probability which are 1%, 6%, 24%, 38%, 24%, 6%, 1% for each part, respectively. Once each part of MSIs is planted, we merge the seven read files. All of the simulated reads are then mapped to the reference sequence. The alignment file is given to variant calling tools. 3.2

Performance Tests Under Different Numbers of MSIs, Coverages and Read Lengths

We test the performance of ELMSI under different numbers of MSIs, coverages and read lengths. A correct call is defined as follows: If an MSI is identified with correct repeat unit, the breakpoint detected is belong to the (b – 10 bps, b + 10 bps) which b is

468

X. Feng et al. Table 1. Recall rates and precision rates of ELMSI in different number of MSIs. Numbers of MSI 20 Recall Precision 30 Recall Precision 40 Recall Precision 50 Recall Precision

Coverages 30X 60X 55% 72% 62% 70% 51% 57% 55% 60% 35% 49% 36% 59% 45% 50% 50% 60%

100X 81% 61% 68% 68% 56% 59% 54% 52%

the set breakpoint, and the actual MSI length belong to the (l − 3r, l + 3r) which (l, r2) is normal distribution parameter estimated, then it is a correct call. We first change the numbers of MSIs whose range from 20 to 50. In order to better reflect the effects of ELMSI on the different numbers of MSIs, we also vary the coverages which are 30, 60, 100. And the read length is 100 bp in this set of experiments. For each number of MSIs, we repeat five times for the same setting and calculate the average results of the five times, the results are summarized in Table 1. The increase of the number of MSIs will influence the robustness of the ELMSI. In reality, since MSI is a rare mutation, there won’t be too many MSIs in 10 Mbps chromosomal sequence region. But in order to test ELMSI, we intend to increase the density. We expect both the recall rates and precision rates of ELMSI will decrease with the number of MSIs’ increasing. From Table 1, we could say that with the number of MSIs increasing from 20 to 50, both the recall rates and precision rates of ELMSI decrease, but the decrease is not too much. This shows that ELMSI has good performance in this situation. Sequencing coverage affects somatic mutation calling, which in turn would presumably affect the performance of ELMSI. To assess the influence of coverage on ELMSI’s performance, we further vary the coverages from 10 to 100, as show in Fig. 3, the coverages intuitively reflects the changes of recall rates and precision rates. In this group of experiments, we set the number of MSIs to 20 or 50, and set read length to 100 bps. Lower coverage the sequencer sample, the higher difficulty level brings to computational method. From Fig. 3, we could summarize that the recall rates of ELMSI increases with the increasing coverage. The highest recall rate can reach more than 80%. It indicates that the higher the coverage, the less likely ELMSI to wrong estimates of the MSIs. And 70 is the optimum coverage for ELMSI which precision rate achieves 72% for 20 MSIs and 60% for 50 MSIs. It can be seen from the simulation results that, the ELMSI has good performance. Precision: P ¼ T þT F ; Recall: R ¼ T þT N . (T: The number of MSIs estimated which are correct; F: The number of MSIs estimated which are not MSI; N: The number of


469

coverage-recall/precision(read length=100bp)

100% 80% 60% 40% 20% 0% 10X

20X

recall-20

30X

40X

50X

precision-20

60X

70X

recall-50

80X

90X

100X

precision-50

Fig. 3. Comparison of the performance of ELMSI in different coverages. (Color figure online)

MSIs not be estimated.) X-axis represents coverages and y-axis represents recall rate (blue) or precision rate (yellow). recall-X, precision-X: X represent the number of MSIs. ELMSI is also validated when the read lengths alter. Set the number of MSIs to 20, the coverages to 30, 60, 100 respectively, the read lengths to 100 bps and 200 bps respectively, as two groups of experimental observations. The results are shown in Table 2. Table 2. Recall rates and precision rates of ELMSI corresponding to different read length. Read lengths

Coverages 30 60 100 bp Recall 55% 72% Precision 62% 70% 200 bp Recall 38% 42% Precision 43% 55%

100 81% 61% 48% 52%

NGS is a high throughput, low cost technique. The main disadvantage of this method is short read length which above 100 bps will be greatly improved the error rate. We prejudge that with the increase of read length, the precision rates and recall rates will decrease, Table 2 proves the viewpoint and shows the longer the read lengths, the more likely ELMSI to wrong estimate and miss estimate of the MSIs. But even the read length is 200 bps, the precision rate reach 52% and recall rate reach 48% respectively which performance well (Fig. 4).

470

X. Feng et al.

coverage-recall/precision 100% 80% 60% 40% 20% 0% 10X 20X 30X 40X 50X 60X 70X 80X 90X 100X recall-RL=100bp

precision-RL=100bp

recall-RL=200bp

precision-RL=200BP

Fig. 4. Comparison of the performance of ELMSI corresponding to different read lengths. (Color figure online)

3.3

Comparison Experiments

In the proposed MSIs estimate algorithm, mSINGS is suitable for small panels and has been reported only for limited exome data, and MSIseq is a classifier for MSIs status which cannot identify the MSIs lengths. Thus, comparison of these two algorithms with ELMSI is meaningless. MSIsensor can more accurately identify the status and MSIs lengths when the MSIs lengths are shorter than read length. Thus we compare ELMSI with MSIsensor. As mentioned before, MSIsensor only designed for the MSIs lengths are shorter than read length. To make this comparison fair, we no longer set the numbers of MSIs in different length ranges which the proportion of long MSIs: middle MSIs: short MSIs = 1: 4.5: 4.5. We conduct the comparison experiments in two MSIs lengths range: (1) randomly planted MSIs whose length range is 40 bps–60 bps into Table 3. Comparison results of ELMSI and MSIsensor. Algorithm

ELMSI

Lengths of MSIs 40 bps– 1000 bps– 60 bps 10000 bps 95% 89% 81% 85% 100% * 98% * 20 19

Precision Recall MSIsensor Precision Recall Correctly identify the number of breakpoints The number of correctly identifying MSIs 20 * lengths Note: Because under the definition of a correct call, MSIsensor cannot identify the MSIs when the MSIs lengths are long. So we add “correctly identify the number of breakpoints” and “the number of correctly identifying MSIs lengths” two sets of parameters. * represents cannot identify.


471

reference, and the standard deviation is 5 bps. (2) Randomly planted MSIs whose length range is 1000 bps–10000 bps into reference, and the standard deviation are 50 bps. Set the number of MSIs to 20 and the coverage to 100, estimating the MSIs with MSIsensor and ELMSI respectively. The results are shown in Table 3. For LMSIs, both sides of reads used to locate the candidate MSIs regions are greatly weaken which lead to bad effect of the algorithm. Table 3 shows that MSIsensor failed to identify any LMSIs which illustrate this point. But ELMSI has good performance for LMSIs. Although experiments show that the MSIsensor has a better performance when MSIs lengths are short, however, advantage is less obvious with ELMSI. In summary, ELMSI has good performance in any situation.

4 Conclusions In this article, we focus on the computational problem of estimating the distributions of Micro-Satellite Instability. Existing approaches, such as mSINGS, MSIsensor and MSIseq, are able to handle the genomic micro-satellite events whose lengths are shorter than one read length, but often encounter accuracy-loss when the lengths of MSIs become longer. However, it is suggested that MSIs may have a wide length range. Long micro-satellite events and long MSIs may also have clinical importance. We propose an algorithm to handle MSIs which have a wide length range from the next generation sequencing data. The proposed method, ELMSI, directly computes the aligned reads. It is able to report the breakpoints and repeat units of MSIs with wide range. And for short MSIs it can identify the lengths accurately, while for long MSIs it can estimate the normal distribution parameters. ELMSI is among the first approach to recognize long MSIs. The experimental results demonstrate that ELMSI is robust on precision rates and recall rates varying the coverages, read lengths and numbers of MSIs. The performance of ELMSI is also compared to MSIsensor, and it is proved more effective for LMSIs than MSIsensor. For the entire complex situations, the ELMSI can keeps more than 80% on precision rate and more than 70% on recall rate. And the run time of ELMSI is short. For the MSIs with wide length range, ELMSI can identify the breakpoints, repeat units and estimate the MSIs lengths. It will be useful for MSIs screening and anticipate a wider usage in cancer clinical sequencing. Acknowledgement. This work is supported by the National Science Foundation of China (Grant No: 31701150) and the Fundamental Research Funds for the Central Universities (CXTD2017003)

References 1. Miesfeld, R., Krystal, M., Arnheim, N.: A member of a new repeated sequence family which is conserved throughout eucaryotic evolution is found between the human delta and beta globin genes. Nucleic Acids Res. 9(22), 5931–5947 (1981) 2. Ashley, C., Warren, S.: Trinucleotide repeat expansion and human disease. Annu. Rev. Genet. 16(1), 1698–1704 (1995)

472

X. Feng et al.

3. Ellegren, H.: Microsatellites: simple sequences with complex evolution. Nat. Rev. Genet. 5 (6), 435–445 (2004) 4. Niu, B., Ye, K., Zhang, Q., et al.: MSIsensor: microsatellite instability detection using paired tumor-normal sequence data. Bioinformatics 30(7), 1015 (2014) 5. Murphy, K.M., Zhang, S., Geiger, T., Hafez, M.J., Bacher, J., Berg, K.D., Eshleman, J.R.: Comparison of the microsatellite instability analysis system and the bethesda panel for the determination of micro-satellite instability in colorectal cancers. J. Mol. Diagn. 8(3), 305– 311 (2006) 6. Lu, C., Xie, M., Wendl, M., et al.: Patterns and functional implications of rare germline variants across 12 cancer types. Nat. Commun. 6(10086), 1–13 (2015) 7. Markowitz, S.D., Bertagnolli, M.M.: Molecular origins of cancer: molecular basis of colorectal cancer. N. Engl. J. Med. 361(25), 2449 (2009) 8. Kim, T.M., Laird, P.W., Park, P.J.: The landscape of microsatellite instability in colorectal and endometrial cancer genomes. Cell 155(4), 858–868 (2013) 9. Woerner, S.M., Kloor, M., Mueller, A., et al.: Microsatellite instability of selective target genes in HNPCC-associated colon adenomas. Oncogene 24(15), 2523–2535 (2005) 10. Ritchard, C.C., Morrissey, C., Kumar, A., et al.: Complex MSH2 and MSH6 mutations in hypermutated microsatellite unstable advanced prostate cancer. Nat. Commun. 5, 4988 (2014) 11. Ribic, C.M., Sargent, D.J., Moore, M.J., et al.: Tumor microsatellite instability status as a predictor of benefit from fluorouracil-based adjuvant chemotherapy for colon cancer. N. Engl. J. Med. 349(3), 247–257 (2003) 12. Pawlik, T.M., Raut, C.P., Rodriguez-Bigas, M.A.: Colorectal carcinogenesis: MSI-H versus MSI-L. Dis. Markers 20(4–5), 199–206 (2004) 13. Salipante, S.J., Scroggins, S.M., Hampel, H.L., et al.: Microsatellite instability detection by next generation sequencing. Clin. Chem. 60(9), 1192–1199 (2014) 14. Mi, N.H., Mcpherson, J.R., Cutcutache, I., et al.: MSIseq: software for assessing microsatellite instability from catalogs of somatic mutations. Sci. Rep. 5, 13321 (2015) 15. Wu, C.W., Chen, G.D., Jiang, K.C., et al.: A genome-wide study of microsatellite instability in advanced gastric carcinoma. Cancer 92(1), 92–101 (2015)

CIGenotyper: A Machine Learning Approach for Genotyping Complex Indel Calls Tian Zheng1,3, Yang Li1,3, Yu Geng1,3, Zhongmeng Zhao1,3, Xuanping Zhang1,3, Xiao Xiao2,3, and Jiayin Wang1,3(&) 1

Department of Computer Science and Technology, School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, China [email protected] 2 School of Public Policy and Administration, Xi’an Jiaotong University, Xi’an 710049, China 3 Shaanxi Engineering Research Center of Medical and Health Big Data, Institute of Data Science and Information Quality, Xi’an Jiaotong University, Xi’an 710049, China

Abstract. Complex insertion and deletion (complex indel) is a rare category of genomic structural variations. A complex indel presents as one or multiple DNA fragments inserted into the genomic location where a deletion occurs. Several studies emphasize the importance of complex indels, and some state-of-the-art approaches are proposed to detect them from sequencing data. However, genotyping complex indel calls is another challenged computational problem because some commonly used features for genotyping indel calls from the sequencing data could be invalid due to the components of complex indels. Thus, in this article, we propose a machine learning approach, CIGenotyper to estimate genotypes of complex indel calls. CIGenotyper adopts a relevance vector machine (RVM) framework. For each candidate call, it first extracts a set of features from the candidate region, which usually includes the read depth, the variant allelic frequency for aligned contigs, the numbers of the splitting and discordant paired-end reads, etc. For a complex indel call, given its features to a trained RVM, the model outputs the genotype with highest likelihood. An algorithm is also proposed to train the RVM. We compare our approach to two popular approaches, Gindel and Pindel, on multiple groups of artificial datasets. The results of our model outperforms them on average success rates in most of the cases when vary the coverages of the given data, the read lengths and the distributions of the lengths of the pre-set complex indels. Keywords: Genomic structural variant Complex indel Genotyping problem Relevance vector machine

1 Introduction Benefiting from the next generation sequencing, detecting genomic structural variations becomes a basic work in many genomic analysis pipelines. A mass of functional structural variations are identified, many of which are reported to associate with complex © Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortuño (Eds.): IWBBIO 2018, LNBI 10813, pp. 473–485, 2018. https://doi.org/10.1007/978-3-319-78723-7_41

474

T. Zheng et al.

traits and diseases [1, 2]. Structural variations have many categories, which commonly include deletion, insertion, inversion, translocation, tandem repeat and their combinations [2, 3]. Recent studies emphasize the importance of combined structural variations. For example, some somatic complex indels are reported to be potentially druggable [4], while a lot of germline complex indels are observed in population-based genomic studies [5, 6], etc. The latest versions of some popular approaches, such as Pindel-C [4] and SVSeq-3 [7] are able to detect complex indels with high accuracy from the second and the third generation sequencing data, respectively. In addition to detecting the variant calls, for diploid genome, the genotypes are also important, which are widely used in downstream analyses [2, 8, 9]. Thus, estimating the genotypes of structural variation calls is another important computational problem in genomics. A series of state-of-the-art approaches are proposed. The first category of them are those variant detection tools, such as Pindel-C and SVSeq-2 [10], which extends their algorithmic detection pipelines, and then estimate the genotypes according to the local contigs and breakpoints. The second category, such as piCALL [11] and MATE-CELEVER [12] estimates the genotypes based on Bayesian frameworks given a population-based panel or a genealogical prior. These approaches are often more robust when the distributions of the read depths are less than ideal. Recent approaches, such as Gindel [13], prefer to incorporate the machine learning or artificial intelligence models. These models could integrate a series of features around a variant call, which harbors a comprehensive view of the information from the sequencing data. However, based on our best knowledge, the existing approaches usually encounter a significant accuracy loss when handling complex indel calls. It is largely due to the structure of complex indels. A complex indel is formed by simultaneously deleting and inserting DNA fragments of different sizes at a common genomic location [7]. Thus, it could weaken the data signals of some features in some cases. For example, the discordant read-pair feature is invalid for those complex indels who have similar lengths of the deleted region and inserted fragments. A detailed observation is summarized in Sect. 2.1. Motivated by this, in this article, we propose an improved machine learning approach, CIGenotyper to estimate the genotypes of given complex indel calls. CIGenotyper adopts a relevance vector machine (RVM) framework. The features include the read depth, the variant allelic frequency for aligned contigs, the numbers of the splitting and discordant paired-end reads, etc. For each candidate call, RVM model weights the features, which limits the impacts of invalid features, and then outputs a genotype with highest likelihood. An algorithm is also proposed to train the RVM. To test the proposed approach, we compare our approach to two popular approaches, Gindel and Pindel, on multiple groups of simulation datasets. The proposed approach performs better in most of the cases when vary the coverages of the given data, the read lengths and the distributions of the lengths of the pre-set complex indels.

2 Methods Suppose that we are given a set of mapped reads and a set of candidate indel calls. Note that the proposed approach handles both indels and complex indels. When the sequencing reads of the individual/sample are mapped to the reference genome, the

CIGenotyper: A Machine Learning Approach for Genotyping Complex Indel Calls

475

features are the data signals observed at the particular genomic region corresponding to the reference genome. If the region harbors an indel, it usually exposes a different combination of the feature values from the combinations of the neutral regions. Furthermore, if the indel is homozygous, it often exposes a different combination of the feature values from the combinations of the heterozygous ones. Thus, CIGenotyper extracts the features around each candidate, and then learns the feature patterns via the RVM model. The model outputs the genotype of each candidate with highest likelihood. To achieve this, here we try to answer three questions: Which features should we prefer? Why to choose RVM model? How to train the RVM model? 2.1

Key Observations on Feature Selection

Although a number of features could be observed from the mapped reads, we have to limit the number of features used in the RVM model. There are two major reasons to conduct feature selection. First, an amount of features are neutral ones, which present similar signals when the indel is either homozygous or heterozygous. They may dilute the whole patterns and further interfere. Moreover, as a computational problem, the more features included in a RVM model, the higher degree of time-/space-consuming the model will be. Here we considers three groups of six features. The numbers of discordant/concordant read pairs. The first feature considers the distribution of insert sizes. For any sequencing data with well quality-controlled library, the insert sizes often follow a normal distribution. Indels often drive the local distributions different from the normal one. For example, an insertion may narrow the distribution, while a deletion may increase the local insert sizes, whose example is shown in Fig. 1(a). The pair of reads from hap1 spanning the heterozygous deletion. When mapping them to the reference, the observed insert size is enlarged. We define them as a discordant read-pair. On the other hand, the pair of reads from hap2 are not affected by the heterozygous deletion, which keep the true insert size. We define them as a concordant read-pair. On the other hand, this feature is invalid for some complex indels. An example of such kind of complex indels is shown in Fig. 1(b). The inserted fragments have a total length similar to the deletion. In this case, the change on insert size cannot be observed significantly.

(a) A normal heterozygous deletion

(b) A heterozygous complex indel

Fig. 1. Discordant and concordant read pairs. The line with an arrow represents a read.

476

T. Zheng et al.

The numbers of partially mapped reads and one-end splitting reads. If a read can only partially mapped to the reference genome, we define it as a partially mapped read. There are two cases: (1) If a read contains an insertion or a complex indel that is shorter than a read, as shown in Fig. 2(left), it may have two contigs mapped to the reference genome, while the inserted fragment(s) is unmapped. (2) If a read contains part of an insertion or a complex indel, it could have one contig mapped to the reference genome, while the inserted fragment(s) is unmapped. We also consider the splitting mapped reads. There are also two cases: (1) if a read spans a deletion, it can be mapped to the reference genome by splitting it into two parts. In this case, the read cannot map to the reference genome directly because it harbors one or more breakpoints. When the read is segmented at every breakpoint, each contig then can be mapped to the reference genome, as shown in Fig. 2(right). It is known that the inserted fragments may come from other genomic regions. In this case, the indel detection algorithms usually prefer to split the reads, and then map the contigs to the original regions where the inserted fragments come from.

Fig. 2. An example of a partially mapped read (left) and a splitting read (right)

Read depth and variant allelic frequency. Read depth is a widely used feature for sequencing data, which refers to the number of reads mapped to a particular site or genomic region. Indels usually alter read depth in different ways: (1) if a deletion occurs, as shown in Fig. 3(left), it decreases the read depth because no read should be sampled from a deleted fragment. (2) If an insertion occurs, it is also hard to contribute the read depth because the reads sampled from the inserted fragment cannot mapped to the reference genome. (3) If a complex indel occurs, it becomes a little complicated: (a) if the inserted fragments are not significantly from other genomic regions, the case simplifies to a common insertion. However, (b) if the inserted fragments do be from other genomic regions, as shown in Fig. 3(right), it decreases the read depth of the deleted region, while the read depth of the original regions may increase. Furthermore, as the read coverages often follow a normal distribution, for each indel candidate, we consider the variant allelic frequency for both the indel and other variants nearby. The reason is that it is suggested that the region where an indel occurs may have a higher probability of harboring more mutational events. Thus, the variant allelic frequency is calculated as the percentage of the reads that mapped to the candidate region which carry or partially carry any variants.


477

Fig. 3. The examples of a heterozygous deletion (left) and a complex indel (right) alter the read depths at different regions, respectively

2.2

RVM Framework

Some popular approaches, such as Gindel, adopt supporting vector machine (SVM) as the core of machine learning model. SVM works well for indels, however, it is weakened by the complex indels. As aforementioned, some complex indels may lead one or more features invalid. If it happens frequently, the accuracy and convergence of an SVM would be deadly hurt. A straightforward idea to overcome this weakness is to train the SVM by the combinations of features because any complex indel cannot make all of the six features invalid. Here, as a computational problem, we consider to use relevance vector machine (RVM) model to optimize the combination. RVM is a kind of sparse probability model, which is widely used in bioinformatics research and other research field [14, 15]. Comparing to adopting a SVM model, by adopting a RVM framework, we may earn two major advantages for the problem discussed here. First, RVM considers the prior and posterior probabilities and uses automatic relevance determination to weight the features. When complex indels interfere in the values of some features, it is able to classify the suitable features for indels and complex indels respectively, and thus it improves the training process by the features attached for genotyping. Moreover, as a Bayesian learning model, RVM outputs the results with posterior probabilities, which helps to further filter the results. In order to achieve high accuracy, we adopt a two-level RVM framework. The level-1 RVM model is given the values of features and outputs whether an indel or a complex indel occurs at the candidate region or not. The level-2 RVM follows the level-1 RVM, whose inputs are not only the values of features but the output of the level-1 model as well. The level-2 RVM outputs the genotypes as final results. According to our comparison experiments, the two-level RVM framework usually outperforms a multi-category RVM framework. Suppose that we are given N indel candidates. For candidate xi , the values of six features are represented as a vector ½xi , while the unknown genotype is denoted by gi , where we have gi 2 f0; 1; 2g. The aim of the proposed RVM framework is to estimate gi by extracting ½xi from the sequencing data and calculate the result. Thus, we have the functional relationships between the features and the genotypes as:

478

T. Zheng et al.

8 gi ¼ y2 ðx; y1 ðxÞ; x1 ; x2 Þ þ ei > > > N > < y ðx; y ðxÞ; x ; x Þ ¼ P xi Kð½x; y ðxÞ; ½x ; y ðx ÞÞ þ xi 2 1 1 2 1 i 1 i 2 2 i¼1 > N > P > > : y1 ðx; x1 Þ ¼ xi1 Kðx; xi Þ þ x01 i¼1

where ei is the model residual. y2 ðx; y1 ðxÞ; x1 ; x2 Þ is the output of the level-2 RVM, while y1 ðx; x1 Þ is the output of the level-1 RVM. Kð½x; y1 ðxÞ; ½xi ; y1 ðxi ÞÞ and Kðx; xi Þ are the kernel functions for them, respectively. x1 and x2 are the unknown model parameters, which are the weights of the features. Here, we simply use Gaussian kernel function, where K ðx; yÞ ¼ expðckx yk2 Þ. A logistics regression is also incorporated to conduct the final filtering and compute the statistical significances, where the genotype for indel candidate xi is: pðgi ¼ 1jx Þ ¼ r½yðxi ; x Þ ¼

2.3

1 1 þ eyðxi ;x Þ

Training the RVM Framework and Estimating Model Parameters

Before the RVM framework is ready to use, a training set is needed to train the model. The training set should consist of thousands of indel candidates with known genotypes. It is not difficult to find a training set. Many famous genomic projects, such as 1000 Genomes, TCGA and ICGC provide indel call sets which are verified. Simulation datasets are also helpful for training the framework because the interactions among indel candidates are not considered in this problem. Based on the six selected features, the value of each feature is first normalized to interval ½0; 1. Let x1 and x2 follow the same normal distribution as the distribution of e. Let a ¼ ½a1 ; a2 ; ; aN T be the hyperparameters. Then, we have pðxi jai Þ ¼ N xi j0; a1 i pðxjaÞ ¼

N Y i

rffiffiffiffiffiffi ai ai x2i exp 2p 2

As we do not consider the interactions among indel candidates, gi is independent to gi0 . Thus, given the weight, the joint probability of the genotypes of all given candidates is: pðgjxÞ ¼

N Y

r½yðxi ; xÞgi f1 r½yðxi ; xÞg1gi

i¼1

Because the values of unknown g are directly related to the weights x only, according to the nature of Markov, we have:


479

pðgjxÞ ¼ pðgjx; aÞ And thus, the predictions of the genotypes g are: Pðg jgÞ ¼

X x

pðg jxÞpðxjgÞ ¼

XX x

a

pðg jx; aÞpðx; ajgÞ

The following algorithm is then used to train the parameter a: Step 1: Set the initial values of a. As a is a set of hyperparameters, we preset the initial value of N12 to each dimension of a. Step 2: Calculate the weights x corresponding to the current a values, according to the maximum a posterior probability: xMAP ¼ argmax pðxjg; aÞ x

Take the logarithm, we have xMAP / argmax log pðgjxÞpðxjaÞ x

where log pðgjxÞpðxjaÞ /

N P

gi log r½yðxi ; xÞ

i¼1

þ ð1 gi Þ logð1 r½yðxi ; xÞÞ 12 xT A where A is the diagonal matrix of ai . By using the second-order Newton method, we have g ¼ rx log pðgjxÞpðxjaÞ ¼ UT ðgÞ A x where U is the matrix that represents the features substituted into the kernel function, 2

1 61 6 U ¼ 6. 4 .. 1

K ð x0 ; x0 Þ K ð x1 ; x0 Þ .. .

K ðxN1 ; x0 Þ

K ð x0 ; x1 Þ K ð x1 ; x1 Þ .. .

K ðxN1 ; x1 Þ

3 K ðx0 ; xN1 Þ K ðx1 ; xN1 Þ 7 7 7 .. .. 5 . . K ðxN1 ; xN1 Þ

Step 3: Use Laplace method to approximate pðxjaÞ. Let xMAP obtained in Step 2 be the average value, then we have the covariance matrix:

480

T. Zheng et al.

W ¼ ðHjxMAP Þ1

where

H ¼ r2x log pðgjxÞpðxjaÞ ¼ ðUT BU AÞ1 where B is the diagonal matrix of yi ð1 yi Þ. Step 4: Update hyperparameters a by xMAP and W. For each iteration, we have anew ¼ i

1 ai wi;i 2 xnew MAP i

xnew MAP ¼ xMAP þ Dx Dx ¼

rx log pðgjxÞpðxjaÞ ¼ H 1 g r2x log pðgjxÞpðxjaÞ

where wi; i is the diagonal element value of the matrix. Step 5: Go to Step 2 with updated a. The algorithm terminates in two ways: (1) if for any ai , we have anew ai t, where t is a preset threshold. (2) If the number of i iterations reaches a preset maximum number of iterations T. In most of the cases, the algorithm finishes in the first way. In these cases, a number of ai are close to infinity, while the weights xi corresponding to them are close to 0. The other aj are convergence to stable values, respectively, while xj corresponding to them become the so-called correlation vector.

3 Experiments and Results We test the proposed approach on a set of artificial datasets and compare it to two popular approaches, Pindel [4] and Gindel [13]. The sequencing reads given to these three approaches are mapped by BWA under default parameter settings. A VCF file is also given as the list of indel candidates. 3.1

Generating Simulation Datasets

To generate the artificial datasets, we randomly sampled a 100kbps region from the chromosome 19 of the human reference genome (hg19). For each dataset, we randomly plant 160 indels. The length of an indel has the same probability of falling into one of the following four length intervals: 20–50 bps, 50–200 bps, 200–1000 bps and 1000– 5000 bps. For each indel, we set an elevated region around it. The length of the elevated region is 1000 bps longer than the indel. Some concomitant SNVs may also be planted in the elevated region according to a preset elevated mutation rate of 0.01. A background mutation rate for the whole region is set to 0.0001. In addition, for about one third complex indels, the inserted fragments are sampled from the nearby regions.


481

Then, paired-end reads are sampled from the artificial chromosome 19. The read length is set to 100 bps, while the distribution of insert sizes follow the normal distribution with the mean of 500 bps and standard deviation of 15 bps. 0.5% of the sequencing error is considered in sampling the reads. 3.2

Comparisons Results Under Different Coverages

We follow the similar comparison strategy as Gindel. The sequencing coverage is usually sensitive to the accuracy of indel calling and genotyping. Here, we still consider the cases that the coverages are very low. Although it is no longer difficult to obtain the coverage of 100, in cancer sequencing data, where the complex indels often occur, some important minor sub-clones may share quite low coverages. Thus, we cannot simplify the problem here. We vary the coverages among 4, 6.4, 10, 15 and 20 and the comparison results are shown in Table 1. For each approach, the column marked “Indel” lists the results of accuracy when apply it on the datasets that consist of indels only. For each simulation configuration, we repeat five times and the results are the average values. The column marked “CI included” lists the results of accuracy

Table 1. The comparison results under different coverages Coverage ()

Indel length (bps)

CIGenotyper Indel CI included 4 20–50 0.9021 0.7481 50–200 0.9346 0.7647 200–1000 0.9565 0.8170 >1000 0.9636 0.8502 6.4 20–50 0.9126 0.7742 50–200 0.9456 0.8075 200–1000 0.9587 0.8217 >1000 0.9689 0.8668 10 20–50 0.9569 0.8193 50–200 0.9536 0.8478 200–1000 0.9684 0.8668 >1000 0.9706 0.8930 15 20–50 0.9589 0.8645 50–200 0.9689 0.8811 200–1000 0.9658 0.8977 >1000 0.9756 0.9167 20 20–50 0.9556 0.9048 50–200 0.9589 0.9167 200–1000 0.9623 0.9286 >1000 0.9689 0.9357 UC denotes unfair comparison. N/A denotes invalid

Gindel Indel CI included 0.8987 0.4880 0.9069 0.4970 0.9084 0.4526 0.9132 0.5904 0.9113 0.6179 0.9176 0.7188 0.9220 0.7248 0.9336 0.6458 0.9157 0.7793 0.9273 0.8017 0.9204 0.7799 0.9360 0.8093 0.9240 0.8015 0.9278 0.8093 0.9336 0.7700 0.9462 0.8573 0.9375 0.8311 0.9472 0.8016 0.9569 0.8431 0.9588 0.8570 outputs.

Pindel Indel CI included UC UC UC N/A UC N/A UC N/A UC UC UC N/A UC N/A UC N/A 0.6022 0.6215 0.6077 N/A 0.6016 N/A 0.6413 N/A 0.6308 0.7178 0.6545 N/A 0.6875 N/A 0.6985 N/A 0.6985 0.7244 0.7683 N/A 0.7859 N/A 0.8646 N/A

482

T. Zheng et al.

when apply the approach on the datasets that consist of both indels and complex indels. We do not collect the results of Pindel when the coverage is less than 10 because Pindel has a default filtering threshold on the number of supporting reads. If we apply this threshold, it conducts an unfair comparison. Moreover, Pindel prefers the complex indels whose lengths are shorter than a read length, and thus the results for long complex indels are invalid. The accuracy is defined as the percentage of successfully genotyped indels among the given indels. From Table 1 we can see that CIGenotyper performs quite stable. The accuracy results are always higher than 90% for indels when the coverages alter. Although a number of complex indels are included in the datasets, it keeps the accuracy results above 75% for most of the cases. Gindel and Pindel also achieve high-level of accuracy for indels. However, the accuracy results of Gindel drop to around 50% when complex indels are included. Thus, comparing to Gindel and Pindel, CIGenotyper significantly improve the performance for complex indels. It nearly doubles the accuracy results when the coverage decreases to 4 and earns around 30% higher accuracy when the coverage decreases to 6.4. 3.3

Further Analysis on Complex Indels Under High Coverages

For those complex indels whose inserted fragments come from nearby regions, high coverages may not always be helpful, such as the case mentioned in Sect. 2.1. We conduct a group of experiments on the datasets that consist of complex indels only. The length of a complex indel has the same probability of falling into one of the following four length intervals: 20–50 bps, 50–200 bps, 200–1000 bps and 1000–5000 bps. We keep the other settings but vary the coverages ranging from 50, 75, 100, 150, 200 and 300. The results are shown in Fig. 4. The x-axis denotes the coverages, while the y-axis (left) denotes the error rates and the y-axis(right) denotes the running times. From Fig. 4, we can see that CIGenotyper does not encounter the accuracy-loss issue when the coverage goes quite high. The running times increase because the number of reads increases, which increases almost linearly. These results show that the proposed approach is able to handle high coverage data as well.

Fig. 4. The accuracy vs the coverage


483

Fig. 5. Error rates and running times for different read lengths (length interval of complex indels: 50–200 bps)

3.4

Comparisons Results Under Different Read Lengths and Insert Sizes

Read length and insert size are important issues in detecting indels. We test different read lengths varying from 100 bps to 225 bps. When the read length increases, the number of the reads decreases to maintain the same coverage. When the read length decreases, the number of the reads increases to maintain the same coverage. The length interval of the planted complex indels is limited to 50–200 bps. The results are shown in Fig. 5. The running times decrease because the number of reads decreases. According to Fig. 5, the error rates decrease along with the increasing read length. We also test different insert sizes which are changing from 500 bps to 1800 bps when the coverage is set to 20 and the read length is set to 200 bps. The results are shown in Fig. 6. The more insert size increases, the less sensitive discordant read-pair feature is. The error rates increase when a larger insert size is preset. However, according to Fig. 6, the error rate only reaches 5%, which is acceptable for most of the cases. Altering insert sizes has limited effect on running times.

Fig. 6. Error rates and running times under different insert sizes

484

T. Zheng et al.

4 Conclusions In this article, we focus on the computational problem of estimating the genotypes of indels and complex indels. We propose a machine learning approach, implemented as CIGenotyper. The proposed approach adopts a relevance vector machine (RVM) framework, which consists of two RVM. The level-1 RVM model is given the values of features and outputs whether an indel or a complex indel occurs at the candidate region or not. The level-2 RVM follows the level-1 RVM, whose inputs are not only the values of features but the output of the level-1 model as well. The level-2 RVM outputs the genotypes as final results. Six features are carefully discussed and selected. A series of experiments are conducted to test the performance of CIGenotyper. The results are compared to two popular approaches, Gindel and Pindel, and CIGenotyper significantly improves the accuracy for complex indels under different coverages and read lengths. Therefore, the proposed approach fits for the genotyping problem for both indels and complex indels. Acknowledgement. This work is supported by the National Science Foundation of China (Grant No: 31701150) and the Fundamental Research Funds for the Central Universities (CXTD2017003).

References 1. The Computational Pan-Genomics Consortium: Computational pan-genomics: status, promises and challenges. Briefings Bioinf. 19(1), 118–135 (2018) 2. Lu, C., Xie, M., Wendl, M., et al.: Patterns and functional implications of rare germline variants across 12 cancer types. Nat. Commun. 6, 10086 (2015) 3. DePristo, M., Banks, E., Polon, R., et al.: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43(5), 491–498 (2011) 4. Ye, K., Wang, J., Jayasinghe, R., et al.: Systematic discovery of complex insertions and deletions in human cancers. Nat. Med. 22(1), 97–104 (2016) 5. Iakovishina, D., Janoueix-Lerosey, I., Barillot, E., et al.: SV-Bay: structural variant detection in cancer genomes using a Bayesian approach with correction for GC-content and read mappability. Bioinformatics 32(7), 984–992 (2016) 6. Kloosterman, W., Francioli, L., Hormozdiari, F., et al.: Characteristics of de novo structural changes in the human genome. Genome Res. 25(6), 792–801 (2015) 7. Zhang, X., Chen, H., Zhang, R., et al.: Detecting complex indels with wide length-spectrum from the third generation sequencing data. BIBM 2017, 1980–1987 (2017) 8. Geng, Y., Zhao, Z., Xu, J., et al.: Identifying heterogeneity patterns of allelic imbalance on germline variants to infer clonal architecture. In: Huang, D., Jo, K., Figueroa-García, J. (eds.) ICIC 2017. LNCS, vol. 10362, pp. 286–297. Springer, Cham (2017). https://doi.org/10. 1007/978-3-319-63312-1_26 9. Geng, Y., Zhao, Z., Zhang, X., et al.: An improved burden-test pipeline for identifying associations from rare germline and somatic variants. BMC Genom. 18(7:55), 55–62 (2017) 10. Zhang, J., Wang, J., Wu, Y.: An improved approach for accurate and efficient calling of structural variations with low-coverage sequence data. BMC Bioinf. 13(6), S6 (2012) 11. Bansal, V., Libiger, O.: A probabilistic method for the detection and genotyping of small indels from population-scale sequence data. Bioinformatics 27(15), 2047–2053 (2011)


485

12. Marschall, T., Hajirasouliha, I., Schonhuth, A.: MATE-CLEVER: Mendelian-inheritanceaware discovery and genotyping of midsize and long indels. Bioinformatics 29(24), 3143– 3150 (2013) 13. Chu, C., Zhang, J., Wu, Y.: GINDEL: accurate genotype calling of insertions and deletions from low coverage population sequence reads. PLoS One 9(11), e113324 (2014) 14. Camps-Valls, G., Martínez-Ramón, M., Rojo-Alvarez, J., et al.: Nonlinear system identification with composite relevance vector machines. IEEE Sig. Process. Lett. 14(4), 279–282 (2007) 15. Zhang, X., Xu, M., Wang, Y., et al.: A graph-based algorithm for prioritizing cancer susceptibility genes from gene fusion data. BIBM 2017, 2204–2210 (2017)

Genomic Solutions to Hospital-Acquired Bacterial Infection Identification Max H. Garzon ✉ and Duy T. Pham (

)

Computer Science and Bioinformatics, The University of Memphis, Memphis, TN 38152, USA {mgarzon,dtpham}@memphis.edu

Abstract. Hospital acquired infections (HAIs) are notorious for their likelihood of fatal outcomes in infected patients due to rapid bacterial mutation rates, conse‐ quent resistance to antibiotic treatments and stubbornness to treatment, let alone eradication, to the point they have become a challenge to medical science. A fast and accurate method to identify HAI will assist in the diagnosis and identification of appropriate patient treatment and in controlling future outbreaks. Based on recently developed new methods for genomic data extraction, representation and analysis in bioinformatics, we propose an entirely new method for species iden‐ tification. The accuracy of the new methods is very competitive and in several cases outperforms the standard spectroscopic protein-based MALDI-TOF MS commonly used in clinical microbiology laboratories and public healthcare settings, at least prior to translation to a clinical setting. The proposed method relies on a model of hybridization that is robust to frameshifts and thus is likely to provide resilience to length variability in the sonication of the samples, prob‐ ably one of the major challenges in a translation to clinical settings. Keywords: Hospital-acquired infections · Identification Next-generation nxh microarrays · Digital genomic signature · Machine learning Neural networks · Self-organizing maps · Random forests

1

Introduction

The morbidity and mortality rate due to hospital-acquired infections (HAI) pose a major public health concern worldwide (Magill et al. 2014). In the U.S., an estimated 1 of 25 patients admitted to hospitals gets infected with an HAI (Magill et al. 2014). Patients infected can expect to pay thousands of dollars in healthcare cost depending on the HAI site, resulting in billions of dollars spent on healthcare treatment for HAI annually (Magill et al. 2014). Despite a number of preventative programs established to reduce the rate of infection and healthcare cost, the need for a fast and accurate method to identify HAI is still needed due to the rapid rate of mutation of bacterial genomes. Such method will assist in the diagnosis (identification of the pathogen) and the selection of appropriate treatment, as well as in controlling future outbreaks. Methods to identify bacterial pathogens can be based on proteomic or genomic data. For the proteomic method, matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF-MS) revolutionized clinical microbiology and is (or © Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortuño (Eds.): IWBBIO 2018, LNBI 10813, pp. 486–497, 2018. https://doi.org/10.1007/978-3-319-78723-7_42

Genomic Solutions to Hospital-Acquired Bacterial Infection Identification

487

has become) the gold standard procedure to identify bacterial pathogens in clinical microbiology (Zhou et al. 2017). This method is known to be rapid, accurate, easy to use, and cost-effective a tool for bacterial identification based on microbial proteins (Mellman et al. 2008). Spectra are obtained by measuring the exact size of peptides and small proteins, and then by comparing them to other spectra in a database to identify a species within a few minutes (Mellman et al. 2008). A meta-analysis study by (Zhou et al. 2017) showed that MALDI –TOF-MS can identify bacterial species with about 84% accuracy with a confidence interval [81.2%–88.9%] and bacterial genus with about 90.9% accuracy with a confidence interval [88.3%–93.3%], both at 95% confidence level. Accuracy may vary due to the pattern recognition algorithm, scoring criteria used to determine a species or genus match, and comprehensiveness of the spectra database. For genomic methods, pulse-field gel electrophoresis (PFGE) is still widely used to identify clinical pathogens. This method starts by cutting the DNA into different size fragments using restriction enzymes and then placing them in an electric field that sorts the fragments according to size (Sharma-Kuinkel et al. 2016). A key feature using this method is that large fragments (50 kb to 10 Mb) can be separated by changing the intensity of the electric field at the electrodes (Sharma-Kuinkel et al. 2016). This pattern provides a finely discriminable banding pattern that uniquely characterizes a bacterium and allows for comparison on tasks such as strain identification. The drawback to this method is the larger amount of time and costs since it can take up to four days or more to obtain results (Guadalupe et al. 2015). Further problem includes manual scoring of ambiguous bands and limitations in the resolution of this method. Compared to other genomic methods (discussed next), this method may thus fail for certain type of strains (Guadalupe et al. 2015). Another genomic method that is used for identification of bacterial strains is multi‐ locus sequence typing (MLST). This method takes several housekeeping genes (usually 7) of length 400–500 bp and matches the sequenced genes to a database. This method requires less work to obtain results; however, the database contains only a limited number of bacteria (http://pubmlst.org/) and conserved genes need to be known to fully discriminate species from one other (Jolley and Maiden 2014). Recently, genomic sequencing technologies and, in particular, whole genome sequencing (WGS), are becoming commonplace in public health and hospital infection control-affiliated laboratories (Kwong et al. 2015). Whole genomes (WGS) provide comprehensive genomic information for understanding infectious diseases and better resolution in characterizing strains. Generally, alignment (e.g. BLAST) or alignmentfree based methods (such as k-mers) are used to classify bacteria using the entire genome. The downside to using alignment-based methods is that there is no universal cut-off percentage threshold to determine the identity of an unidentified genome (Zielezinski et al. 2017). Additionally, alignment-based methods are time and memory consuming since the number of possible alignments increases with the length of the sequence, especially for multiple-sequence alignment (Zielezinski et al. 2017). Dynamic program‐ ming algorithms can remedy the issue regarding speed and memory; however, the time complexity remains in the order of the product of the length of the sequences (Zielezinski et al. 2017). Alignment-free based methods such as k-mers are popular tools due to the speed to results. These alignment-free methods use fast algorithms to count the

488

M. H. Garzon and D. T. Pham

frequency of all possible k-mers (short DNA oligonucleotides of length k) and find unique k-mers that match to a species in a database. The disadvantage to alignment-free methods is that memory consumption increases exponentially (there are 4k possible kmers) as k grows (Zielezinski et al. 2017). In this study, we take a similar but much more refined alignment-free and genomic approach by using next generation microarrays (so-called nxh chips), as described in (Garzon and Mainali 2017). WGSs of 80 strains representing 16 different species (fully described in Sect. 3) were processed according to this method and the resulting markers (so-called digital signatures) were used for the species identification problem (precisely defined in Sect. 3). We show that these markers can be readily implemented with standard microarray technology, provide enough information to discriminate bacteria at the species level with competitive, if not superior, degrees of reliability by machine learning methods, such as neural networks (NNs), self-organizing maps (SOMs), and random forests (RFs). The markers are universal, i.e. can be used for all types of bacteria. Finally, we discuss implications and questions of further interest in Sect. 5.

2

A New Method for HAI Species Identification

An early way to capture genomic data is with microarrays (Schena 2003; Stekel 2003) and more recently, Next Generation Sequencing (NGS) (Demkow and Polski 2015) and Next Generation Microarrays (nxh DNA chips) (Garzon and Mainali 2017). Together with new data analysis and mining from machine learning, these methods can become very powerful. However, a fundamental problem with standard DNA microarrays and even NGS is that useful information is often deeply buried in a haze of noise, i.e., redundant, incomplete and uncertain data. An analysis of hybridization reliability has been presented in (Garzon and Mainali 2017) that makes evident why this problem occurs and provides an effective way to address it using a next generation of microarrays, named nxh chips. They have resulted in demonstrable increases in the confidence, accu‐ racy and reliability (false positives/negatives) of DNA chips, which become even more prominent in applications such as species identification and diagnosis, as shown below. To make this paper self-contained, we summarize the requisite results next. The critical property behind the operation of microarrays and NGS is hybridization between Watson-Crick (WC) complements, the fundamental characteristic property of DNA. If genomic analyses are to produce useful information (say, “pixels” (spots) in microarrays), knowledge of the structural properties of what we term “DNA spaces” would help. (We will frame the discussion in microarrays below, but most of these results apply to readouts of NGS as well.) Building on preliminary studies in prior work (Deaton et al. 2006), DNA chips usually require uniform length short oligos of up to n = 60-mers or so. Let B be a set of DNA n-mers of uniform length, copies of which (and their separate complements) are to be affixed to the spots of a chip. These oligo nucleotides will be called probes (e.g., in the case of a very special type of set, an nxh n-pmer in an encoding basis, a term to be precisely defined below.) In practice, this set is a judicious target-free selection of oligos that provide full but tight covering of the DNA space of all oligos of said size n, unlike for ordinary microarrays, where a full selection of genes in a target


489

organism is chosen. The key property is named noncrosshybridization (nxh), which comes in degrees of quality determined by a numeric parameter τ that controls for the quality of nxh and in practice serves as a proxy for the hybridization stringency of reac‐ tion conditions. The best choice would be the Gibbs energy, or approximations thereof (Deaton et al. 2006) typically set at τ = −6 kcal/mole (considered to be the standard threshold of minimum free energy required for hybridization of two strands to occur). Such a selection requires deep knowledge of the structural properties of the Gibbs energy landscapes of the full set of all n-mers (DNAn). The key tool to understand this structure is a refinement of the notion of Gibbs energy to the so-called hybridization distance (h-distance.) An example that illustrates its calculation is shown in Fig. 1(a). Approximating Gibbs Energy of Hybridization (a)

(b)

Fig. 1. (a) Computation of the h-distance h(x, y) between strands x and y of common length n. The strands x and the reverse yR of y are aligned, and the minimum difference from n of the number of overlapping WC-complementary matches (in red) across all possible frameshifts is the value hm(x, y) for the h-measure between x and y. This procedure is then repeated for the Watson-Crick complement (WC) y′ of y. The h-distance between x and y is the minimum of the two h-measures hm(x, y) and hm(x, y′) (Garzon and Bobba 2012). (b) A typical procedure to obtain a digital signature counting the number of hybridizations of fragments of a sequence x (left in blue) to the probes (copies of the sequences in an nxh set) on an nxh chip. (Color figure online)

Four key properties will nearly optimize noise removal in DNA chips: (a) The h-distance is defined in such a way that low values of h(x, y) are highly corre‐ lated with the degree of WC-complementarity of two oligos x, y or their WCcomplements (e.g., h(x, y) = 0 if an only if x, y are identical or perfect complements, i.e. h does not distinguish perfect WC-complements; such pairs are thus bundled in so-called pmers); (b) The h-distance is a true distance function among n-pmers, i.e., in addition, h(x, y) = h(y, x), and the familiar geometric triangle inequality: h(x, z) ≤ h(x, y) + h(y, z), hold for arbitrary n-pmers x, y, z; (c) If a basis B is selected so that its sequences satisfy the property h(bi, bj) ≥ 2τ, for distinct probes bi, bj in B, it is called an nxh basis of stringency τ. A noncrosshyb‐ rizing τ-set is an “orthogonal” set of oligos for hybridization, i.e., they have no redundancies in terms of hybridization to stringency τ. Moreover, for a pair of oligos x, y and criterion threshold τ, a hybridization decision based on h and τ agrees with one made on the Nearest Neighbor (NN) model of Gibbs energy about 80% of the time (i.e. the statement h(bj, xi) < τ if and only if Gibbs(x, y) < −6, is

490


true 80% of the time.) This is a nonobvious observation based on principled design of the h-distance as an approximation of the Gibbs energy, as well as an exhaustive/ extensive check on various sizes of n (Garzon and Bobba 2012); (d) The selection of pmers in an nxh chip should be complete, i.e., provide full coverage of the space of all n–pmers, so that an arbitrary random n-mer fragment xi in a target x will hybridize to at least one (and hence, by the triangle inequality, to exactly one) probe bj. Given an nxh basis B, an nxh chip can be built on solid surface using standard biotechnology available for DNA/RNA microarrays, as described in (Garzon and Mainali 2017). Properties (a)–(d) guarantee that on an nxh chip, a random oligo xi of comparable probe size is much more likely to hybridize to fewer probes, and under appropriate stringency τ (directly related to the minimum separating h-distance between oligos in B and other conditions) to ensure hybridization to at most one probe. This socalled nxh property immediately translates into the desirable properties to the problems mentioned above. The noise is notably reduced (in fact, it is completely eliminated under ideal conditions), results will be more predictable and reproducible, and analyses will be much more reliable, as demonstrated in (Garzon and Mainali 2017). This property remains true in general for n ≤ 60 and every (shredded) genomic sequence as input. On any given DNA chip, a unique hybridization pattern (the digital signature) can be produced as described in Fig. 1(b). A given (possibly unknown) target x is fragmented by sonication (Stekel 2003) usually tagged, and poured in solution over the chip under appropriate reaction conditions enforcing stringency τ. We emphasize once again that this reduction in the number of probes is typically considered a loss of redundancy, and hence a loss of signal-to-noise ratio in standard analyses that would diminish the payoff of the readout. However, the analyses in (Garzon and Wong 2011) shows that, to the contrary, they will result in much clearer genomic signal in the chip readout x. Therefore, signatures will exhibit minimum variability, assuming saturation conditions in target concentrations and long enough relaxation time to obtain a full signal (order of hours.) The application of this indexing technique to HAI Identification thus requires addressing a fundamental general question: Does the digital signature of a sequence x contain enough information to perform species identification of HAIs? The goal of this paper is to show strong evidence for a positive answer to this question.

3

Data and Methods

The computational problem underlying species or strain identification is the recognition problem for genomic sequences. This is a well-known and difficult problem in computer science, usually unsolvable or NP-complete (i.e., likely very difficult) in full generality. For finite sets, such as the genes belonging to a specific organism or chromosome, it becomes more tractable, but the problem remains scruffy and inaccessible to efficient approaches, especially if the genomes are to be represented by compact data structures, such as digital signatures on a microarray/nxh chip. A more general problem is classi‐ fication, where the organisms are grouped into species, and the corresponding identifi‐ cation problem asks for identifying the species of any given strain.


491

In order to solve this problem, whole genome sequencing data were obtained from Genbank (www.ncbi.nlm.nih.gov/genbank/) as shown in Table 1, for a data corpus containing 80 strains representing 16 species (5 strains per species.) The computational platform used to obtain the visualization and results below was the popular package R, downloaded from http://cran.us.r-project.org/. The pmer (DNA strand of length p and its Watson Crick complement) counts were used to obtain signatures for the organisms using Perl software. Table 1. Species in the data sample described above. A total of 5 strains per species were downloaded from ncbi.nlm.nih.gov/genbank/ of total genome sizes shown in the right column. ID 1 2 3 4

Species A. Baumannii C. Coli C. Jejuni C. Difficile

~Size (Mbs) 20.1 8.6 8.3 20.8

ID 9 10 11 12

5

E. Coli

25.9

13

6 7 8

K. Pneumoniae 26.4 P. Mirabilis 20.5 S. Marcescens 26.1

14 15 16

Species E. Faecalis E. Faecium H. Pylori M. Tuberculosis N. Gonorrhoeae N. Meningitidis P. Aeruginosa S. Aureus

~Size (Mbs) 13.8 14.2 8.1 22.0 11.0 10.8 34.0 14.7

3.1 Neural Networks An artificial neural network consists of a finite number of neuronal units connected through directed synaptic links that resemble the synaptic meshes in the mammalian brain. A neuron can be in one of various states of activation characterized by a (real) number at any given time, but can change its activation in the next time step by applying its characteristic transfer function (a nonlinearity) to the net input from neighboring neurons (obtained as a weighted sum of the states of other neurons with a synaptic link into it.) This process is iterated in parallel to update all neurons for any single data point (exemplar.) The particular kind used here is a feed-forward neural net (FNN), where the neurons are arranged in layers. Each neuron receives signals from neurons in the previous layer. The first is an input layer of neurons (in 1–1 correspondence with and) reading the features in a data point; the last is one producing a (prediction of the) response as to which category the features in the input belongs; and several neurons arranged in (so-called hidden) layers try to tease out critical distinguishing features in the input feature vector. They can solve classification problems in a way similar to how the human brain works (see (Hassoun 1995) for an overview.) A major advantage of this method is that, now, a prior deep analysis needs not to be carried out. The model can be trained by a learning algorithm (backpropagation below) (Hassoun 1995) to classify data by passing a number of exemplars (data points labeled with the expected correct answer, here the category where they belong) from a training set for an appropriate number of times (or epochs), until the answers are mostly right. The quality of the model is

492


measured by the accuracy with which it predicts the answers on a testing set of data that the network has never seen in the training phase. Various feed forward neural networks (FNN) were trained for 2000 epochs by using the ‘h2o’ library package in R (Arora et al. 2006) with a hyperbolic tangent function as a smooth transfer function (nonlinearity.) The single output unit produced normalized decimal values between 0 and 1. For species identification, each k of the 16 species were assigned a range of outputs in the interval of radius 0.03125 centered at (k/16) − 0.03125, for k = 1… 16. The data corpus was partitioned into a learning set (80% of them, randomly assigned) and the remaining (20%) for the testing set. As is customary in machine learning, various combinations of nxh bases, neuron types, and hidden layers were tried in an attempt to optimize performance. As an example, 3mE4-[4-3-2-1] describes a FNN with 4 input features, two hidden layers with 3 and 2 neurons, providing input to a single 1 neuron in the output layer using the four feature signature vectors on the nxh basis 3mE4-2-at1.1, as described in Table 2. The h2o.predict function was then used to test the accuracy of the model. The accuracy was based on whether the predicted value obtained from the h2o.prediction function after training of the network fell within the correct interval coding for the corresponding species of a data point (strain.) This process was repeated 32 times and the average of the 32 accuracies was taken to deter‐ mine the performance of a given neural network architecture. Table 2. Noncrossbhybridizing (nxh) chip designs used to obtain digital signatures. ID A B C D

Basis 3mE4-2-at1.1 3mE4b-2at1.1 4mP3-3at2.1 8mP10-4at4.1

Probe length 3 3 4 8

Probes 4 4 3 10

τ 1.1 1.1 2.1 4.1

3.2 Self-organizing Maps A self-organizing map (SOM) is a special kind of neural network that learns from input features without feedback. This form of learning is known as unsupervised learning and is useful for clustering together groups that contain similar feature patterns. A unique aspect of SOM is that it takes high-dimensional data and is able to represent the data onto a low-dimensional (typically 2D) plane (Kohonen 2013). Like NNs, SOMs also contain neuron ensembles (also called maps) that are initialized with weighted values and are continually adjusted to extract a topological representation of the input data, where abstract similarities are captured as distance similarities on the map. A visuali‐ zation map usually in the form of a hexagonal or rectangular grid allows us to understand the pattern of the input space. In order to classify a new input datum, the Euclidean distance between the input and all neuron sites is computed and the one closest to the input takes all the reinforcement in the weights (winner-takes-all), while other neurons weights are decreased. After training, neuronal sites are labeled to identify the various clusters properly in the data for the testing phase on unseen data.


493

The supersom function from the ‘kohonen’ library package in R and its default settings were used to obtain a SOM classifier since it supports supervised learning and prediction (Wehrens and Buydens 2007). The strains were classified into 16 species labeled 1–16. The data was partitioned into sets of 80%/20% for training and testing data. Features of the nxh basis were scaled to z-scores to mimic the example in the package documents. Prediction was performed using the predict function without species label and accuracy was based on whether the SOM correctly identified the unla‐ beled species with given input features of the nxh basis. This process was repeated 32 times to obtain accuracy measurements. The mean accuracy is reported. 3.3 Random Forests A tree is a directed acyclic graph with a finite number of nodes and adjacencies of the type parent-child. A decision tree is a tree that solves a classification problem for certain data (e.g. bacteria) based on certain features associated with the levels of the tree. The most determinant feature determines the top adjacency and a datum is classified into one of two subsets, roughly partitioning the data set into two halves by a certain threshold. The process continues recursively in each sub-tree at the second level based on a second feature, and culminates in a childless node (leaf.) A random forest (RF) is a classification method that relies on a finite collection of decision trees is grown to make a prediction on unknown data (Breiman 2001). Each tree is grown using a bootstrap sample of the original data and each node of each tree is split by a randomly selected subset of features. Only the features in the best split for the node are selected. Bootstrapping and selection will ensure that the trees are de-correlated and that the forest ensemble will have low variance. Usually, the decision is based on information from the collections of decision trees, based on majority voting for example. This training technique is usually responsible for the better performance of RFs on classification tasks. The randomForest function from the ‘randomForest’ package in R was used to train a data set and make predictions (Liaw and Wiener 2002). The strains were classified into 16 species labeled 1-16. As before, the data was partitioned into sets of 80% for training and 20% for testing. A RF model was fitted by formulating the species label as the response variable, to the feature values in the digital signatures, for each of several combinations of nxh bases. The default settings of the randomForest were retained (ntree = 500). Prediction was done using the predict function without the species label for each strain and with only their features depending on the nxh basis. Accuracy performance was determined by whether the RF correctly identified the strain to its own species. Both training and testing accuracy were measured. The training was ran 32 times, where each repetition selected different samples of training and testing data to obtain an accuracy percentage. The mean prediction accuracy percentages for each of the training and testing are reported.

494

4


Results

For NNs, single and paired combination of nxh basis performed poorly with average accuracies below 85%. Results improved when triplet or quadruplet combinations of nxh basis were trained. As shown in Fig. 2, the triplet combination of nxh bases could achieve an average accuracy above 85% using various architectures. The best improve‐ ment in accuracy was achieved with ‘A + B + C + D’, a combination of features in all four bases (as described in Table 2). Figure 2 shows the top five performing combinations of bases and neural network architectures in terms of accuracy.

Fig. 2. Neural networks can identify species of HAI strains with an accuracy of at least 85% from the top five performing feed-forward neural networks (FNNs) shown above. They achieved high accuracy with the combination of features in all bases (‘A + B + C + D’) in Table 2.

SOMs outperformed NNs in that combination of fewer nxh bases were required to achieve an average accuracy above 90%. As shown in Fig. 3, nearly any combination of nxh bases could achieve an average accuracy above the average accuracy of MALDITOF-MS (84%) at the species level. Moreover, a combination of 3mE4-2at1.1 and 4mP3-4at2.1 could achieve 99% accuracy reducing the need for others trials.

Fig. 3. Self-organizing maps can be used on nxh digital signatures to identify the species of HAI strains with accuracy near 100% by a genomic method on next generation nxh chips. Nearly any combinations of nxh bases can identify the species of each strain with at least 95% accuracy. The mean of the testing accuracies across all combinations of nxh basis are shown by the red-dotted line. (Color figure online)

Finally, RFs outperformed NNs and produced accuracy comparable or better than SOMs for HAI species identification. For this method, any combination of nxh bases could achieve an average accuracy above 97% as shown in Fig. 4. Single nxh bases such as 3mE4-2at1.1 could achieve an average accuracy nearly 100%. Additionally, the mean


495

testing accuracy of all possible combination of nxh bases was higher than that of SOMs. Using RFs as a classifier would eliminate the need for multiple combinations of nxh bases, which would result in lower cost and time for results.

Fig. 4. Random forests are very likely to identify species of HAI strains with nearly perfect accuracy (97%) on combinations of nxh bases in Table 2, outperforming neural networks and comparable or better to self-organizing maps. The mean of the testing accuracies across all combinations of nxh basis are shown by the red-dotted line. (Color figure online)

5

Conclusion and Further Research

We have introduced a new alignment free method for identification of species of HAI strains, based solely on genomic data and built on recent advances in microarray tech‐ nology (next-generation nxh chips.) The technology is readily implementable on current biotechnology, both in vitro and in silico, and with large genomic sequences (including whole genomes as shredded NGS readouts.) The solutions provided are very competi‐ tive, if not superior, at least in principle, to the standard spectroscopic protein-based MALDI-TOF MS commonly used in clinical microbiology laboratories and public healthcare settings today. The proposed method is alignment-free and rather relies on a model of hybridization that is robust to frameshifts and thus is likely to provide resilience to length variability in the sonication of the samples, one of the major challenges in a translation to clinical settings. On the other hand, digital signatures have also been successfully used for phyloge‐ netic analyses (Garzon et al. 2011, 2017), and so demonstrate a versatile and scalable range of application, where other genomic methods requiring alignments may not prove feasible. This scalability is not really surprising considering that the biomarkers used in nxh bases have been designed to provide full and lean (noise-fee) coverage of the DNA spaces at their levels of resolution, and therefore can be used independently of the target sequences in identification problems of other genera and families. Acknowledgement. The use of the High Performance Computing Center (HPC) at the U of Memphis is gratefully acknowledged.

496


References Guadalupe, A., Castro-Escarpulli, G., Alonso-Aguilar, N.M., Rivera, G., Bocanegra-Garcia, V., Guo, X., Juárez-Enríquez, S.R., Luna-Herrera, J., Martínez, C.M.: Identification and typing methods for the study of bacterial infections: a brief review and mycobacterial as case of study. Arch. Clin. Microbiol. 7, 3 (2015) Arora, A. Candel, A., Lanford, J., LeDell, E., Parmar, V.: Deep learning with H2O (2006) Breiman, L.: Mach. Learn. 45, 5 (2001). https://doi.org/10.1023/A:1010933404324 Deaton, R., Chen, J., Kim, J.W., Garzon, M.H., Wood, D.H.: Test tube selection of large independent sets of DNA oligonucleotides. In: Chen, J., Jonoska, N., Rozenberg, G. (eds.) Nanotechnology: Science and Computation. NCS, pp. 147–161. Springer, Heidelberg (2006). https://doi.org/10.1007/3-540-30296-4_9 Demkow, U., Ploski, R.: Clinical Applications for Next- Generation Sequencing. Academic Press, Cambridge (2015) Jolley, K.A., Maiden, M.C.: Using MLST to study bacterial variation: prospects in the genomic era. Future Microbiol. 9, 623–630 (2014). https://doi.org/10.2217/fmb.14.24 Kohonen, T.K.: Essentials of the self-organizing map. Neural Netw. 37, 52–65 (2013). https:// doi.org/10.1016/j.neunet.2012.09.018 Garzon, M.H., Bobba, K.C.: A geometric approach to gibbs energy landscapes and optimal DNA codeword design. In: Stefanovic, D., Turberfield, A. (eds.) DNA 2012. LNCS, vol. 7433, pp. 73–85. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32208-2_6 Garzon, M.H., Mainali, S.: Towards a universal genomic positioning system: phylogenetics and species identification. In: Rojas, I., Ortuño, F. (eds.) IWBBIO 2017. LNCS, vol. 10209, pp. 469–479. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56154-7_42 Garzon, M., Mainali, S.: Towards reliable microarray analysis and design. In: 9th International Conference on Bioinformatics and Computational Biology, ISCA, 6p. (2017) Garzon, M.H., Wong, T.Y.: DNA chips for species identification and biological phylogenies. Nat. Comput. 10, 375–389 (2011) Hassoun, M.H.: Fundamentals of Artificial Neural Networks. MIT Press, Cambridge (1995) Kwong, J.C., McCallum, N., Sintchenko, V., Howden, B.P.: Whole genome sequencing in clinical and public health microbiology. Pathology 47, 199–210 (2015). https://doi.org/10.1097/PAT. 0000000000000235 Liaw, A., Wiener, M.: Classification and regression by random forest. R News 2(3), 18–22 (2002) Magill, S.S., Edwards, J.R., Bamberg, W., Beldavs, Z.G., Dumyati, G., Kainer, M.A., Lynfield, R., Maloney, M., McAllister-Hollod, L., Nadle, J., Ray, S.M., Thompson, D.L., Wilson, L.E., Fridkin, S.K.: Multistate point-prevalence survey of health care-associated infections. New Engl. J. Med. 370, 1198–1208 (2014) Mellmann, A., Cloud, J., Maier, T., Keckevoet, U., Ramminger, I., Iwen, P., Harmsen, D.: Evaluation of Matrix-Assisted Laser Desorption Ionization-Time-of-Flight Mass Spectrometry in Comparison to 16S rRNA Gene Sequencing for Species Identification of Nonfermenting Bacteria. J.Clin. Microbiol. 46(6), 1946–1954. (2008). http://doi.org/10.1128/ JCM.00157-08 Schena, M.: Microarray Analysis. Wiley, Hoboken (2003) Stekel, D.: Microarray Bioinformatics. Cambridge University Press, Cambridge (2003) Sharma-Kuinkel, B.K., Rude, T.H., Fowler, V.G.: Pulse field gel electrophoresis. Methods Mol. Biol. 1373, 117–130 (2016). https://doi.org/10.1007/7651_2014_191. Clifton, NJ Wehrens, R., Buydens, L.M.C.: Self- and super-organising maps in R: the kohonen package. J. Stat. Softw. 21(5), 1–19 (2007)


497

Zielezinski, A., Vinga, S., Almeida, J., Karlowski, W.M.: Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 18, 186 (2017). https://doi.org/10.1186/ s13059-017-1319-7 Zhou, Y., Shen, N., Hou, H., Lu, Y., Yu, J., Mao, L., Sun, Z.: Identification accuracy for matrix assisted laser desorption ionization-time of flight mass spectrometry (MALDI-TOF MS) for clinical pathogenic bacteria and fungi diagnosis; meta-analysis. Int. J. Clin. Exp. Med. 10(2), 4057–4076 (2017). www.ijcem.com. ISSN 1940-5901/IJCEM0035141

Interpretable Models in Biomedicine and Bioinformatics

Kernel Conditional Embeddings for Associating Omic Data Types Ferran Reverter(B) , Esteban Vegas , and Josep M. Oller Department of Genetics, Microbiology and Statistics, University of Barcelona, Diagonal 643, 08028 Barcelona, Spain {freverter,evegas,joller}@ub.edu

Abstract. Computational methods are needed to combine diverse type of genome-wide data in a meaningful manner. Based on the kernel embedding of conditional probability distributions, a new measure for inferring the degree of association between two multivariate data sources is introduced. We analyze the performance of the proposed measure to integrate mRNA expression, DNA methylation and miRNA expression data.

1

Introduction

Modern genomic and clinical studies are in a strong need of integrative machine learning models for better use of big volumes of heterogeneous information in the deep understanding of biological systems and the development of predictive models. For example, in current biomedical research, it is not uncommon to have access to a large amount of data from a single patient, such as clinical records (e.g. age, gender, medical histories, pathologies and therapeutics), high-throughput omics data (e.g. genomics, transcriptomics, proteomics and metabolomics measurements) and so on. How data from multiple sources are incorporated in a learning system is a key step for successful analysis. Some of the most powerful methods for integrating heterogeneous data types are kernel-based methods [1]. Kernel-based data integration approaches can be described using two basic steps. Firstly, the right kernel is chosen for each data set. Secondly, the kernels from the different data sources are combined to give a complete representation of the available data for a given statistical task. In this paper we propose a new measure (to the best of our knowledge) for inferring the degree of association between two multivariate data sources based on the embedding of conditional probability distributions in the framework of kernel methods.

2

Kernel Conditional Embeddings

The Reproducing Kernel Hilbert Space (RKHS) methods provide a formal framework for data modeling, where models are determined by specifying a kernel function, a loss function and a penalty function [2]. Representer theorem [2] c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 501–510, 2018. https://doi.org/10.1007/978-3-319-78723-7_43

502

F. Reverter et al.

shows that solutions of a large class of optimization problems in RKHS can be expressed as kernel expansions over the sample points. A question that arises in a natural manner in the context of inference refers to the representation of a probability distribution P in a RKHS. With this goal Smola et al. [3], Fukumizu et al. [4] among others, have introduced the RKHS versions of the fundamental multivariate statistics, the mean vector and the covariance matrix. These RKHS-counterparts of the mean vector and the covariance matrix are called mean element and covariance operator, respectively. Let H be an RKHS on the separable metric space X , with continuous feature mapping ϕ(x) ∈ H for each x ∈ X . The inner product between feature mappings is given by the kernel function k(x, z) := ϕ(x), ϕ(z). Let P be a probability distribution on X . We can represent P (X) for an element in the RKHS associated with a kernel k: ϕ(x)p(x)dx. μX := EX [ϕ(X)] = X

It has been shown that if EX [k(x, x)] < ∞, μX is guaranteed to be an element of RKHS. The embedding μX of P (X) enjoys two attractive properties. First, if the kernel is characteristic, the mapping from P (X) to μX is injective, which means that different distributions are mapped to different points in a RKHS. An example of characteristic kernel is the gaussian kernel. Second, the expectation of any function f ∈ H can be evaluated as a scalar product in H μX , f Hk := EX [f (X)],

∀f ∈ H.

ˆX is Let {x1 , . . . , xm } be a sample i.i.d from P , an empirical estimator μ defined through m 1 ϕ(xi ). (1) μ ˆX := m i=1 Let (X, Y ) be a random variable taking values on X × Y and (H, k) and (G, l) be RKHSs with measurable kernels on X and Y, respectively. Let ϕ(x) = k(·, x) and φ(y) = l(·, y) denote the feature maps. According to the definition of the kernel embedding of a probability distribution P (X), for the kernel embedding of a conditional distribution P (Y |X) we have μY |x := EY |x φ(Y ) = φ(y)p(y|x)dy. Y

Given a data set S = {(x1 , y 1 ), . . . , (xm , y m )} drawn i.i.d from P (X, Y ), and where Φ := (φ(y 1 ), . . . , φ(y m )) and Υ := (ϕ(x1 ), . . . , ϕ(xm )) are implicitly formed feature matrix, and K = Υ Υ is the kernel matrix for samples from variable X, Song et al. [5] estimate the conditional embedding as μ ˆY |x =

m i=1

where

βi (x)φ(y i ) =

m

βi (x)l(·, y i ) = ΦB(x)

(2)

i=1

B(x) = (β1 (x), . . . , βm (x)) = (K + λI)−1 K:x

(3)


503

and K:x = (k(x, x1 ), . . . , k(x, xm )) . The empirical estimator of the conditional embedding is similar to the estimator of the ordinary embedding from Eq. (1). 1 , the former applies The difference is that, instead of applying uniform weights m non-uniform weights, βi (x), on observations which are, in turn, determined by the value x of the conditioning variable. These non-uniform weights reflect the effects of conditioning on the embeddings. 2.1

Measuring the Discrepancy Between Conditional Embeddings

Conditional embeddings allows us to quantify the differential effect on the response vector Y , when the values of the conditioning vector X varies. For instance, the conditioning values on which the vector X is fixed, may correspond to the mean vector of X measured in different experimental conditions. We propose the quantity ||μY |x1 − μY |x2 ||2G for measuring the differential effect on Y when conditioning X to x1 or when conditioning X to x2 . From (2) we can estimate this quantity by using the statistic: T = ||ˆ μY |x 1 − μ ˆY |x 2 ||2G = ˆ μY |x 1 − μ ˆY |x 2 , μ ˆY |x 1 − μ ˆY |x 2 G = ˆ μY |x 1 , μ ˆY |x 1 G + ˆ μY |x 2 , μ ˆY |x 2 G − 2ˆ μY |x 1 , μ ˆY |x 2 G m βi (x1 )βj (x1 ) + βi (x2 )βj (x2 ) − 2βi (x1 )βj (x2 ) l(yi , yj ). =

(4)

i,j=1

To assess the significance, we generate a null distribution by taking permutation of the rows of Y but keeping the rows of X. Thus, after B permutations we have B datasets S1 = (X, Y1 ), . . . , SB = (X, YB ), where each Yi results from a random permutation of the rows of Y . Thus we get T1 , . . . , TB and we can estimate a p-value by computing the number of times that Ti , i = 1, . . . , B, are greater than T .

3

Case Study 1: Glioblastoma Multiforme Cancer

We used data glioblastoma multiforme cancer type (GBM) available from TCGA [6] (The Cancer Genome Atlas, 2008), preprocessed and provided by Wang et al. in [7]. We downloaded data sets containing mRNA expression (12,042 genes), miRNA expression (534 miRNAs) and DNA methylation (1,305 genes) from 215 patients. We aim to determine the degree of association between methylation and mRNA expression. To this goal, we measure the effect on mRNA (Y ) when conditioning on different conditions of DNA methylation (X). In particular, DNA methylation conditions are fixed by the centers of the clusters discovered by using spectral clustering of DNA methylation data. Authors in [7] obtained patient clusters for individual data types by building a patient-similarity network and

504

F. Reverter et al.

clustering it using spectral clustering. Single data type analysis did not lead to significantly different survival profiles, but network fused by Similarity Network Fusion methodology had significant differences in survival among subtypes. This fact suggest that the subtypes detected by spectral clustering have biological relevance. Then, we set the number of clusters to be three. Patients were grouped in three clusters with 18, 140 and 57 patients each one. Using (3) we computed B(xi ), where xi , i = 1, 2, 3, denotes the mean vectors (centroids) of the clusters. Then, from (4) we computed Tij = ||ˆ μY |x i − μ ˆY |x j ||2G where indexes i and j denote on which pair of vectors xi and xj the conditional ˆY |¯x ||2G , i = 1, 2, 3, embeddings were compared. In addition, we estimate ||ˆ μY |x i − μ ¯ denotes the overall mean vector of DNA methylation data. where x We used gaussian kernel for both X and Y , kernel parameters were adjusted using the sigest function in Kernlab package [8]. The estimation is based upon the 0.1 and 0.9 quantiles of ||z − z ||2 where z denotes a general observation of the space X or Y depending on the case. Basically, any value in between these two bounds will produce a good hyper-parameter estimation. In Fig. 1 it is shown the heatmap of the kernel matrix corresponding to the methylation data. We observe that the kernel matrix also reveals the same patterns of similarities found by spectral clustering. In fact, when samples are ordered according the clusters found by spectral clustering, identified in the heatmap by the upper color bar, we observe that the similarity values in the kernel matrix shows three homogenous groups that coincide with clusters. A small group of samples, left bottom corner, we call this group as group 1. The largest group, in the central part of the heatmap, we call this group as group 2, and a group of samples, upper right corner, that we identify group as group 3. x) in the last column, that Figure 2 shows vectors B(xi ), i = 1, 2, 3 and B(¯ define the weights of the conditional embeddings (3). Samples, grouped according cluster they belong, are in rows. For each sample, row-normalized weights are displayed. Observe that the normalized weights change consistently across conditions (cluster centroids). That is, samples with highest weights belong to the same cluster on which we are conditioning. To asses the statistical significance of the empirical values Tij we applied a permutation based test, using Table 1. Gene expression and DNA methylation analysis (GBM). Summary of the permutation test. Comparison Tij

Raw p-value Min

q1

Med

q3

Max

1.vs.2

0.0332 0.19561

0.0128 0.0218 0.0258 0.0313 0.0699

1.vs.3

0.0301 0.01397

0.0081 0.0148 0.0170 0.0200 0.0471

2.vs.3

0.0155 0.00200

0.0035 0.0061 0.0072 0.0084 0.0146

1.vs.¯ x

0.0318 0.20958

0.0143 0.0215 0.0256 0.0301 0.0595

2.vs.¯ x

0.0005 0.00399

0.0001 0.0002 0.0002 0.0002 0.0006

3.vs.¯ x

0.0131 0.00200

0.0042 0.0064 0.0076 0.0088 0.0128


505

5000 permutation samples. We observe (Table 1) that are significant pairwise comparisons that involve group 3. On the other hand, comparisons with respect the conditional embedding on the overall mean are only significant in clusters 2 and 3. Table 1 also includes a summary of the null distribution of the test.

Fig. 1. Heatmap of the kernel matrix from DNA methylation data (GBM). Clusters found by spectral clustering are also supported by the kernel matrix. (Color figure online)

Table 2. Gene expression and miRNA analysis (GBM). Summary of the permutation test. Comparison Tij

Raw p-value Min

q1

Med

q3

Max

1vs2

0.0027 0.00200

0.0007 0.0013 0.0015 0.0017 0.0025

1vs3

0.0013 0.55090

0.0009 0.0012 0.0013 0.0016 0.0030

2vs3

0.0042 0.79840

0.0029 0.0043 0.0050 0.0059 0.0102

1.vs.¯ x

0.0001 0.84431

0.0000 0.0001 0.0001 0.0001 0.0001

2.vs.¯ x

0.0016 0.90818

0.0010 0.0018 0.0021 0.0024 0.0034

3.vs.¯ x

0.0010 0.49102

0.0006 0.0008 0.0009 0.0011 0.0021

In addition, we study the association between gene expression (Y ) and miRNA (X). In analogy with the previous analysis, the miRNA conditions were determined by the centroids of the clusters from the spectral clustering of the miRNA dataset. In accordance with [7], we set the number of clusters to be three. Clusters have 70, 84 and 61 patients each one. Next, from (4) we comμY |x i − μ ˆY |x j ||2G where indexes i and j denote on which pair of puted Tij = ||ˆ vectors xi and xj the conditional embeddings were compared.

506

F. Reverter et al.

Fig. 2. Gene expression and DNA methylation analysis (GBM). Weights that determine the conditional embedding.

Figure 3 shows vectors that define the weights of the conditional embeddings (3). Samples in rows are grouped according cluster they belong. For each sample, row-normalized weights are displayed. Normalized weights change almost consistently across conditions (cluster centroids). We applied a permutation based test, using 5000 permutation samples, to evaluate the significance of the empirical values Tij . We observe (Table 2) that are significant only the comparison between groups 1 and 2. Any other comparison is not significant neither comparisons between conditional embeddings and overall mean embedding.

4

Case Study 2: Breast Invasive Cancer

We used data breast invasive cancer (BIC) available from TCGA [6], preprocessed and provided by Wang et al. in [7]. We downloaded data sets containing mRNA expression (17814 genes), miRNA expression (354 miRNAs) and DNA methylation (23094 sites) from 105 patients. In particular, DNA methylation conditions were fixed by the centers of the clusters discovered by using spectral clustering of DNA methylation data. Number of clusters were determined to be five in agreement with [7] where authors obtained patient clusters by building a patient-similarity network and clustering it using spectral clustering. The coherence of the clusters was measured using commonly used measures such as Cox long-rank test to evaluate significance of the difference in survival profiles between clusters and silhouette score. In Fig. 4 it is shown the heatmap of the kernel matrix corresponding to the methylation data. It is not entirely consistent with the groups determined by the spectral method. It seems that some groups are to some extent heterogeneous. For instance, group 2 (blue) seems could be subdivided. Figure 5 shows the rownormalized weights for the conditional embedding. Observe that the normalized weights change consistently across conditions (cluster centroids). We observe


507

(Table 3) that all comparisons are significant except comparison between groups 1 and 5, 2 and 5, and 1 and overall mean embedding. In Fig. 6 it is shown the heatmap of the kernel matrix corresponding to the miRNA data. Five groups are evident, however group 1 (red) and 2 (blue) and group 4 (purple) and 5 (gray) seem to be more similar between them.

Fig. 3. Gene expression and miRNA analysis (GBM). Weights that determine the conditional embedding. Table 3. Gene expression and DNA methylation analysis (BIC). Summary of the permutation test. Comparison Tij

Raw p-value Min

q1

Med

q3

Max

1vs2

0.0664 0.03593

0.0372 0.0460 0.0493 0.0540 0.0849

1vs3

0.1093 0.00599

0.0587 0.0723 0.0772 0.0829 0.1190

1vs4

0.1523 0.00200

0.0473 0.0588 0.063

0.0672 0.1270

1vs5

0.0856 0.22954

0.0591 0.0735 0.079

0.0849 0.1345

2vs3

0.1578 0.00798

0.0793 0.0952 0.1027 0.1124 0.1688

2vs4

0.1863 0.00200

0.0549 0.0673 0.0738 0.0802 0.1394

2vs5

0.0870 0.31337

0.0643 0.0767 0.0819 0.0892 0.1375

3vs4

0.1135 0.00200

0.0516 0.0613 0.0655 0.0704 0.1098

3vs5

0.0930 0.00200

0.0350 0.0436 0.0464 0.0504 0.0783

4vs5

0.2199 0.00200

0.0728 0.0885 0.0938 0.1015 0.1550

1vs¯ x

0.0234 0.12774

0.0149 0.0184 0.0198 0.0219 0.0355

2vs¯ x

0.0447 0.00998

0.0237 0.0294 0.0318 0.0343 0.0607

3vs¯ x

0.0567 0.00998

0.0294 0.0360 0.0389 0.0426 0.0667

4vs¯ x

0.0960 0.00200

0.0247 0.0297 0.0318 0.0346 0.0707

5vs¯ x

0.0558 0.01397

0.0316 0.0368 0.0394 0.0430 0.0663

508

F. Reverter et al.

Fig. 4. Heatmap of the kernel matrix from DNA methylation (BIC). Clusters found by spectral clustering are in general supported by the kernel matrix. Except group 2 (blue) that could be subdivided (Color figure online)

Fig. 5. Gene expression and DNA methylation (BIC). Weights that determine the conditional embedding.

Fig. 6. Heatmap of the kernel matrix from miRNA data (BIC). (Color figure online)


509

Table 4. Gene expression and miRNA analysis (BIC). Summary of the permutation test. Comparison Tij

Raw p-value Min

q1

Med

q3

Max

1vs2

0.0003 0.24551

0.0002 0.0002 0.0003 0.0003 0.0004

1vs3

0.0008 0.00200

0.0003 0.0004 0.0004 0.0004 0.0006

1vs4

0.0002 0.01597

0.0001 0.0001 0.0002 0.0002 0.0002

1vs5

0.0002 0.00200

0.0001 0.0001 0.0001 0.0001 0.0002

2vs3

0.0014 0.01397

0.0008 0.0010 0.0011 0.0012 0.0015

2vs4

0.0001 0.05988

0.0001 0.0001 0.0001 0.0001 0.0001

2vs5

0.0003 0.64471

0.0002 0.0003 0.0003 0.0004 0.0005

3vs4

0.0011 0.00599

0.0006 0.0008 0.0008 0.0009 0.0012

3vs5

0.0006 0.00200

0.0003 0.0004 0.0004 0.0004 0.0005

4vs5

0.0002 0.63273

0.0001 0.0002 0.0002 0.0002 0.0003

1vs¯ x

0.0001 0.00200

0.0000 0.0001 0.0001 0.0001 0.0001

2vs¯ x

0.0003 0.98603

0.0003 0.0004 0.0004 0.0004 0.0005

3vs¯ x

0.0005 0.00200

0.0002 0.0003 0.0003 0.0003 0.0004

4vs¯ x

0.0002 0.96008

0.0002 0.0002 0.0002 0.0002 0.0003

5vs¯ x

0.0001 0.03393

0.0000 0.0000 0.0000 0.0001 0.0001

Fig. 7. Gene expression and miRNA analysis (BIC). Weights that determine the conditional embedding. (Color figure online)

Figure 7 shows the row-normalized weights for the conditional embedding. Observe that the normalized weights change consistently across conditions. Results most evident in group 3 (green). We observe (Table 4) that all comparisons are significant except comparison between groups 2 and 4, 2 and 5, and 4 and 5. Comparisons with overall mean embedding are significant for clusters 1, 3 and 5.

510

5

F. Reverter et al.

Conclusions

We propose a measure to integrate data in the framework of kernel methods. This methodology is based on the kernel embedding of conditional probability distributions. Our measure allows us to infer the degree of association between two types of multivariate measurements by measuring the effect on the mean element associated with the response vector when it is conditioned on different values of the explanatory vector, representing different experimental or clinical conditions. Statistical significance assessment of the degree of association can be implemented using permutation test. Acknowledgments. This research is partially supported by Grant MTM2015-64465C2-1-R (MINECO/FEDER) from the Ministerio de Econom´ıa y Competitividad (Spain) and Project 2014 SGR 1319 Agència de Gesti´ o d’Ajuts Universitaris i de Recerca (AGAUR) (Catalonia, Spain).

References 1. Gonen, M., Alpaydin, E.: Multiple kernel learning algorithms. J. Mach. Learn. Res. 12, 2211–2268 (2011) 2. Schoelkopf, B., Smola, A.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. The MIT Press, Cambridge (2001) 3. Smola, A., Gretton, A., Song, L., Sch¨ olkopf, B.: A Hilbert space embedding for distributions. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds.) ALT 2007. LNCS (LNAI), vol. 4754, pp. 13–31. Springer, Heidelberg (2007). https://doi.org/10.1007/ 978-3-540-75225-7 5 4. Fukumizu, K., Bach, F.R., Jordan, M.I.: Kernel dimension reduction in regression. Ann. Stat. 37(4), 1871–1905 (2009). https://doi.org/10.1214/08-AOS637. https://projecteuclid.org/euclid.aos/1245332835 5. Song, L., Fukumizu, K., Gretton, A.: Kernel embeddings of conditional distributions: a unified kernel framework for nonparametric inference in graphical models. IEEE Signal Process. Mag. 30(4), 98–111 (2013) 6. The Cancer Genome Atlas Network. The Cancer Genome Atlas (2006). http:// cancergenome.nih.gov/ 7. Wang, B., Mezlini, A.M., Demir, F., Fiume, M., Tu, Z., Brudno, M., Haibe-Kains, B., Goldenberg, A.: Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 11, 333 (2014). https://doi.org/10.1038/nmeth.2810 8. Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A.: kernlab - an S4 package for kernel methods in R. J. Stat. Softw. 11(9), 1–20 (2004)

Metastasis of Cutaneous Melanoma: Risk Factors, Detection and Forecasting Iker Malaina1 ✉ , Leire Legarreta1, Maria Dolores Boyano2, Jesus Gardeazabal3, Carlos Bringas2, Luis Martinez1, and Ildefonso Martinez de la Fuente1,4 (

)

1

4

Department of Mathematics, University of the Basque Country UPV/EHU, Bilbao, Spain {iker.malaina,leire.legarreta,luis.martinez, mtpmadei}@ehu.eus 2 Department of Cell Biology and Histology, University of the Basque Country UPV/EHU, Bilbao, Spain {lola.boyano,carlos.bringas}@ehu.eus 3 Department of Dermatology, Cruces University Hospital, Bilbao, Spain [email protected] Department of Nutrition, CEBAS-CSIC Institute, Espinardo University Campus, Murcia, Spain

Abstract. In this work, we present a quantitative analysis of cutaneous mela‐ noma based on 615 patients attended in Cruces University Hospital between 1988 and 2012. First, we studied which characteristics are more associated with the metastasis of this kind of cancer. We observed that people with light eyes, light hair, an ulcerated nevus, or exposed to the sun during working hours, had more risk to suffer from metastasis. Besides, a big diameter or a thick nevus (measured by Breslow’s depth) were also associated with this condition. Next, we evaluated the metastasis detection capability of the tests performed in this hospital, which indicated that X-rays and CT scan were the best techniques for metastasis detec‐ tion, since they identified this condition successfully in 80% and 93.5% of the cases, respectively. Moreover, we concluded that the blood test was very inac‐ curate, since it recognized the presence of metastasis in only 40% of the cases and failed in the rest. Consequently, we suggest the replacement of this test in order to save money, time, and avoid the misdiagnosis of cutaneous melanoma meta‐ stasis. Finally, we built a predictive model to forecast the time that takes for metastasis to happen, based on Breslow’s depth. This tool could be used not only for improving the programming appointment management of the dermatology section, but also for detecting metastasis sooner. Keywords: Cutaneous melanoma · Breslow’s depth · Metastasis · Odds ratio Linear regression

1

Introduction

Cutaneous melanoma is a type of skin cancer placed in the epidermis, which is developed from the pigment-containing cells known as melanocytes. This kind of cancer is consid‐ ered very invasive, and it has a big chance of becoming metastatic [1].


512

I. Malaina et al.

Melanocytes are in charge of generating melanin, the pigment in the skin, eyes and hair, which has as main objective to protect against sun’s ultraviolet rays, and therefore, to avoid mutations in the DNA of the most exposed cells [2]. When cutaneous melanoma is detected early, for example, while the main tumor is thicker than 1 mm, it can be healed. In fact, if the melanoma is located solely in the epidermis, the chances of curing it are almost complete. However, not performing the diagnosis in time favors the dermis invasion, and as a consequence, worst’s the prog‐ nostic. Because of this reason, the phenotypes of higher risk, or the patients with a family history of melanoma perform periodic revisions in order to detect any suspicious skin alteration. Only in the Basque Country (Spain), there are detected 11.47 cases for every 100,000 citizen each year [3]. This tumor is given more often in women (corresponding to the 2.7% of the cancers) than in men (where it represents the 1.5% of the male cancers). In Spain, the percentage of cases increases yearly by 7%, decreasing the average age each year. Among the risk factors for this disease, stand out the prolonged exposition to ultraviolet rays, having a light skin, the presence of moles, a genetic history of mela‐ noma, or even getting old [4]. This kind of cancer can develop from an existing nevus, or appear as a new one. A malignant nevus can be differentiated from a regular nevus one because the following characteristics: asymmetry (half of it doesn’t match with the other half), irregular edges (non-rounded limits), unusual color (changes from brown to red or blue tones), big diameter (the radius is bigger than 6 mm), and a fast evolution (quick changes in size, color or form). The degree of malignancy of cutaneous melanoma is usually measured by Breslow’s depth. This index measures (in millimeters) the depth of the lesion from the epidermis to the lowest tumor cell. Between the benefits of this method, highlight its easy repro‐ ducibility, and the fact that it directly correlates with the metastasis risk of this kind of cancer [5]. In this work, we have performed three analyses based on the characteristics of the 615 patients attending the dermatology service of Cruces University Hospital (Spain) between 1988 and 2012, because of cutaneous melanoma. Though the first study, we have been able to detect and confirm some of the risk factors associated to this disease. Through the second analysis, we have studied the capability of the techniques utilized to detect the metastasis of this type of cancer, which could be used to save both money and time to patients and hospitals. Finally, with the third study, we have developed an easy-to-use regression model relating the metastasis-free interval with the Breslow’s depth. This tool could be used for both improving the programming appointment management of the dermatology section and also for detecting metastasis earlyer.

2

Methodology

2.1 Data Acquisition In this experiment, we have analyzed the 615 patients diagnosed with cutaneous mela‐ noma in the dermatology service of Cruces University Hospital between June 21st 1988

Metastasis of Cutaneous Melanoma

513

and June 16th 2012. From these patients, 95 suffered from metastasis and 45 died because of it. There were 228 males and 387 females, with ages comprehended between 17 and 96. In Fig. 1, we illustrate the increasing trend of cutaneous melanoma in our hospital during the study period, divided by gender:

Fig. 1. Number of melanoma cases divided by gender from 1995 to 2012. In blue, the number of male cases of melanoma detected in Cruces University Hospital per year. In green, the analogous representation for female cases. The cases from 1988 to 1995 have not been included in the graph for illustrative purposes. (Color figure online)

For this study, we analyzed the following qualitative variables: gender, hair color, eye color, phototype (which is the degree of coloration of the skin), type of sun exposition (during working hours or during spare time), location of the melanoma, elevation, ulcer‐ ation (detected by the dermatologist or the histologist), kind of surface (rough or smooth), and the reason for the visit (the appearance of blood, change in volume or diameter, increase of itchiness, or sent by the family physician). In addition, we collected the metastasis detection method (presence of symptoms, physical exploration, bloodanalysis, X-rays and computed tomography (CT) scan). The following quantitative variables were also gathered: age, birth date, date of every visit, metastasis detection date, Breslow’s depth (measured in mm) and melanoma diameter (measured in 10−4 m).

514

I. Malaina et al.

2.2 Statistical Analysis In order to study the factors related to an increased risk of having cutaneous melanoma, first, we calculated the relative frequencies in three groups: the whole sample, the patients that suffered from metastasis, and the group of those who died because of it. Next, we calculated the Odds ratio of each variable. This is a statistic related to risk and probability, that quantifies if the presence (or absence) of a property in one group is related to the presence (or absence) of that same property in another group [6]. Finally, we have applied the linear regression [7] to build a predictive model with the form:

Y = a0 + a1 X + 𝜀,

(1)

being, Y the dependent variable, X the independent variable, a0 the value of the constant, a1 the value measuring the effect of the independent variable, and ɛ the error estimation. The significance of the results has been measured by the following statistical tests: the Fisher’s exact test, which is used to evaluate the independency in contingency tables; the Kolmogorov-Smirnov test, applied to compare two populations that are normally distributed, and the Wilcoxon ranksum test, utilized to compare populations that do not follow a normal distribution.

3

Results

In order to study some of the risk factors related to cutaneous melanoma and its malig‐ nancy, we have first performed a basic statistical analysis studying the relative frequen‐ cies depending on each characteristic. Among the key findings, we observed that the patients with darker hair or darker eyes (or in other words, the ones with higher amount of melanin) were more protected against the metastasis of this kind of cancer than the ones with lighter hair or eyes (which concordats with the results of previous studies [8]). Within the group of patients with light eyes, the 20.5% of the cases ended metastasizing, while the percentage of metastasis in the group with dark eyes was only of 13.4% (being statistically different, with a pvalue of 0.043 in the Fisher’s exact test). On the other hand, the 10.5% of the patients with black hair gave metastasis, while in the brown haired ones, it rose up to 19% (with a p-value of 0.023 in the Fisher’s test). The patients exposed to sun during working hours were more affected by metastasis than the ones exposed during spare time (which is consistent with previous analyses [9]); in the first case, the percentage of cases ending in metastasis was 27.7%, which in the second case was 13.7%, with a p-value of 0.002, indicating that these groups were significantly different. Next, we computed the Odds ratio for this relationship, which resulted in 2.418, with a confidence interval CI95% = (1.408, 4.151); (for more details on the confidence interval calculation see [10]). On one hand, this indicated that it was 2.418 times more probable to suffer metastasis if we were regularly exposed to sun during working hours, than if were exposed during free time. On the other hand, this suggested that if the regular exposition to sun was while working, it had at least 1.408


515

times (and at most 4.151) more probability to induce metastasis than if it happened during free time. In addition, our study indicated that the most common location for male melanoma is the torso (in 58.5% of the cases), while in the case of women is located more often in the limbs (50.4% of the cases), which concordats with previous studies [11]. With respect to the qualitative characteristics of the nevus, the results indicated that the ulceration detected by the dermatologist was a very malignant factor, since the 44.9% of the cases where blood was found suffered metastasis, while only 9.4% of the nonulcerated cases were affected by this condition (with a p-value ~ 0). Besides, the Odds ratio for ulceration was 7.811 with a CI95% = (4.394, 13.887). However, when the ulcer‐ ation was detected by the histologist, the metastasis percentajes rose up to 48.8% and to 8.9% respectively (p-value ~ 0), with an Odds ratio of 9.757 and a confidence interval of (4.072, 23.375). Additionally, a rough surface was considered a risk factor, giving metastasis in 24.8% of the cases, in contraposition to the nevus with plain surface, which only metastasized 4.8% of the times (p-value ~ 0), with an Odds ratio of 6.449 (CI95% = (3.291, 12.636)).

Fig. 2. Box plot of Breslow’s depth divided by survivorship. Box plot illustration of the distributions of the Breslow’s depth values calculated for the group of patients who died because of metastasis, and the group of those who suffered this condition but did not die because of it. The blue boxes represent the distribution of the central 50% of the values and the red lines represent the medians. The rest of the values are represented by the arms, or in the case of atypical values, by red crosses. As it can be observed in the figure, the distributions of both groups were significantly different. (Color figure online)

516

I. Malaina et al.

Analyzing the quantitative variables related to the morphology of the nevus, we observed that both the diameter and the Breslow’s depth were significantly different (pvalues of 0.005 and 0.000 respectively) between the ones who ended giving metastasis and the ones that did not. In fact, the diameter of the ones with metastasis was 164.9 ± 298.9 (median ± interquartil range), while it was 100.5 ± 159.5 for the ones without metastasis. In the case of Breslow’s depth, the metastatic group presented a depth of 2.1 ± 2.87, while for the non-metastatic group was 0.9 ± 1. Additionally, these two variables were found to be significantly distinct between the subgroup of patients that died because of metastasis, and the ones who suffered from metastasis, but did not die because of it. In this subcase, the diameters were 170.8 ± 156.8 and 100.5 ± 177 respectively (p-value of 0.005); the Breslow’s depths on the other hand, were 2 ± 3.5 and 1 ± 1.3 respectively (p-value ~ 0). In Fig. 2, these differences are illustrated. Then, we analyzed the reasons that took the patients to visit the dermatology section. The main one was an increase on the diameter (in 47.8% of the cases), followed by alterations in the edges of the mole (in 24.7% of cases). However, the presence of blood in the nevus detected by the patient was the reason most related with metastasis. The 35.7% of the cases presenting blood acknowledged by the patient ended in metastasis, while for the cases were no blood was found, only 11.7% did. The Odds ratio associated to the presence of blood was 4.194, (CI95% = (2.551, 6.896)). It can be noticed that these values are lower than the ones corresponding to blood found by the dermatologyst or histologyst, which, with greater precision, were capable of detecting and associating more cases. Next, we performed a second study evaluating each test performed in order to detect metastasis. First, we calculated the frequency at which each analysis is applied. Then, we computed the percentage of times that each test was positive, in cases where patients presented metastasis. These values are depicted in Table 1. Table 1. In the first row, the percentage of times used each technique when the metastasis was detected; in the second, the percentage of times in which the test was positive, in the presence of metastasis. In each column, the respective test: symptoms, P. exploration (physical exploration), B. analysis (blood-analysis), X-rays and CT scan. Times used Detection

Symptoms 97.9% 51%

P. exploration B. analysis 95.8% 41.6% 78.2% 40%

X-rays 20.8% 80%

CT scan 64.5% 93.5%

From this analysis, we observed that the most common tests where the analysis of symptoms and the physical exploration, since they are the fastest and easiest. Even though the detection rate of symptoms was very low, it improved the detection rate notably when it was completed with the physical exploration. The blood analysis was performed in less than half of the patients, and its detection capacity was very poor, since even if there was metastasis, the test was negative in 60% of the cases. On the other hand, X-rays and CT scan were proven to be very effective tracking down metastasis, but were less used due to their cost, or need to use intravenous contrast and the amounts of ionizing radiation.


517

Finally, we performed a third study by applying linear regression to build a predictive model. The objective of this model is to forecasts the metastasis-free survival period to Breslow’s depth. First of all, we verified the assumptions of standard linear regression models (weak exogeneity, linearity, homoscedasticity, independence of errors and lack of perfect multicollinearity in the predictors). In order to fulfill these conditions, we worked with the logarithms of the variables instead of the raw ones, being ln(Breslow) the napierian logarithm of Breslow’s depth (measured in millimeters), and ln(months) the napierian logarithm of the months past from the detection of melanoma, to the detection of metastasis. Therefore, the obtained model obtained by linear regression was the following: ln(months) = 3.836 − 0.805 ⋅ ln(Breslow),

(2)

with a coefficient of determination R2 was 0.546. However, in order to make more “conservative” predictions and ensure that all patients visit the dermatologist before the metastasis happens, we built a second model by using the lower end of the confidence interval of the constant of the previous model, instead of using the constant directly. Thus, the second model was defined by: ln(monthslow ) = 2.55179 − 0.805 ⋅ ln(Breslow),

(3)

were ln(monthslow ) is the napierian logarithm of the lower estimate of the free-metastasis period measured in months. In Fig. 3, the two models and the linear regression are represented. Lastly, we took exponentials to ease the second model’s use, and illustrated it with some examples: monthslow =

12.8302 Breslow0.805

(4)

Thus, for example, a patient with 0.5 mm of Breslow’s depth would not be expected to develop metastasis before 22.41 months, while for a depth of 3.5, the model’s forecast indicates that such patient could be free from metastasis for at least 4.68 months.

518

I. Malaina et al.

Fig. 3. Linear regression of ln(months) vs ln(Breslow). The dots represent the values of ln(Breslow) (the napierian logarithm of Breslow’s depth) versus ln(months) (the napierian logarithm of the months past from the detection of melanoma, to the detection of metastasis). The blue line represents the first linear regression model which has a determination coefficient of 0.546, and the red line represents the second model, obtained by taking, instead of the model’s constant, the lowest value of its confidence interval estimation. (Color figure online)

4

Discussion

In this work, we have performed a quantitative analysis of the 615 patients who attended the dermatology service of Cruces University Hospital (Spain) between 1988 and 2012. First, we performed a study to detect the risk factors associated to melanoma and its metastasis, which indicated that light eyes, light hair, sun exposition during working hours, a rough surface and ulceration in the nevus were indeed, associated to metastasis. In fact, the presence of ulceration was considered the most malignant, being almost 10 times more probable to suffer metastasis if this condition was given. Among the quantitative variables, we observed that both an increase in diameter or in depth were associated with metastasis. Moreover, these two variables were also significantly different between the ones who died because metastasis and the ones who survived, indicating that both can be used as good indicators of prognosis. Next, we evaluated the detection capacity of the tests that are performed in the hospital in order to detect the metastasis of cutaneous melanoma. We observed that the most efficient were the X-rays and the CT scan, which due to their cost and side effects,


519

are the less used. On the other hand, we observed that the blood analysis fails detecting metastasis more often than it forecasts this condition properly. Therefore, we suggest that this test should be substituted (or removed) by a more reliable one, in order to save money and time for both the patient and the hospital. Finally, we built two predictive models based on Breslow’s depth. Even if the inclu‐ sion of more variables would increase the determination coefficient, we decided to make the model as simple as possible in order to facilitate its use. Besides, this allows fore‐ casting the metastasis-free period by studying only a single parameter. Between the two models, we recommend the use of the second one (the most conservative), because even if the first one adjusts the data, the aim of this kind of model is to get ahead the appearance of metastasis. Moreover, in addition to be used with predictive purposes, this model could improve the medical consultation system, both by removing unnecessary appoint‐ ments and by including visits before the metastasis forecast. Summarizing, here we have presented a three-part study: first, we have analyzed some of the clinical risk factors associated to cutaneous melanoma and its metastasis; second, we have evaluated the detection capacity of the tests performed to detect this condition; and third, we have built a model to predict the metastasis-free period based on Breslow’s depth. By preventing and anticipating cutaneous melanoma and its metastasis, our society will be able to save money, time, and most importantly, the lives of the patients. Acknowledgements. Work by the first and second authors was supported by the Basque Government grant IT974-16.

References 1. Miller, A.J., Mihm, M.C.: Melanoma. N. Engl. J. Med. 355, 51–65 (2006) 2. Lotze, M.T., Dallal, R.M., Kirkwood, J.M., Flickinger, J.C.: Cancer: Principles and Practice of Oncology. Lippincott Williams & Wilkins, Philadelphia (2001) 3. Tejera-Vaquerizo, A., et al.: Incidencia y mortalidad del cáncer cutáneo en España: revisión sistemática y metaanálisis. Actas Dermo-Sifiliográficas 107, 318–328 (2016) 4. Kolmel, K.F., Kulle, B., Lippold, A., Seebacher, C.: Survival probabilities and hazard functions of malignant melanoma in Germany 1972–1996, an analysis of 10433 patients. Evolution of gender differences and malignancy. Eur. J. Cancer 38, 1388–1394 (2002) 5. Lee, M.L., Toms, K., Von Eschen, K.B.: Duration of survival fordisseminated malignant melanoma: results of a meta-analysis. Melanoma Res. 10, 81–92 (2000) 6. Kraemer, H.C.: Reconsidering the odds ratio as a measure of 2 × 2 association in a population. Stat. Med. 23, 257–270 (2004) 7. Chatterjee, S., Hadi, A.S.: Regression Analysis By Example. Wiley, Hoboken (2006) 8. Candille, S.I., et al.: Genome-wide association studies of quantitatively measured skin, hair, and eye pigmentation in four european populations. PLoS ONE 7, e48294 (2012) 9. Mackie, R.M., Hauschild, A., Eggermont, A.M.: Epidemiology of invasive cutaneous melanoma. Ann. Oncol. 20, 6–7 (2009) 10. He, K.: Parametric empirical Bayes confidence intervals based on James-Stein estimator. Stat. Dec. 10, 121–132 (1992) 11. San Martín, A.L., et al.: Características clínico-patológicas del Melanoma Maligno en Pacientes del Hospital Hernán Henríquez Aravena de Temuco. Revista Anacem 2, 36–39 (2008)

Graph Theory Based Classification of Brain Connectivity Network for Autism Spectrum Disorder Ertan Tolan(B)

and Zerrin Isik

Computer Engineering Department, Dokuz Eylul University, Tinaztepe Kampusu Buca, 35160 Izmir, Turkey [email protected], [email protected]

Abstract. Connections in the human brain can be examined efficiently using brain imaging techniques such as Diffusion Tensor Imaging (DTI), Resting-State fMRI. Brain connectivity networks are constructed by using image processing and statistical methods, these networks explain how brain regions interact with each other. Brain networks can be used to train machine learning models that can help the diagnosis of neurological disorders. In this study, two types (DTI, fMRI) of brain connectivity networks are examined to retrieve graph theory based knowledge and feature vectors of samples. The classification model is developed by integrating three machine learning algorithms with a na¨ıve voting scheme. The evaluation of the proposed model is performed on the brain connectivity samples of patients with Autism Spectrum Disorder. When the classification model is compared with another state-of-the-art study, it is seen that the proposed method outperforms the other one. Thus, graphbased measures computed on brain connectivity networks might help to improve diagnostic capability of in-silico methods. This study introduces a graph theory based classification model for diagnostic purposes that can be easily adapted for different neurological diseases. Keywords: Brain connectivity network Graph theory · Machine learning

1

· Autism Spectrum Disorder

Introduction

Neurological diseases [1] are structural, biochemical or electrical abnormalities in the brain, spinal cord or other nerves can result in a range of symptoms. Alzheimer, Parkinson, Multiple Sclerosis, Autism Spectrum Disorder (ASD) are the most important examples of neurological diseases. Among these, ASD [2] is a brain disorder that makes difficult or impossible to communicate with other individuals [3]. In ASD, different regions of brain can not work together and most autistic individuals are always forced to communicate with people. However, early detection and treatment of ASD are effective in reducing this communication difficulty. With the developments in technology, structural and functional c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 520–530, 2018. https://doi.org/10.1007/978-3-319-78723-7_45

Graph Theory Based Classification of Brain Connectivity Network

521

connections in the brain can be examined using brain imaging techniques [4] (magnetic resonance imaging, positron emission tomography, functional magnetic resonance). Structural interactions [5] are anatomical alterations of neighbour neurons or synapses; these interactions are measured via Diffusion Tensor Imaging (DTI). Structural connections also represent a network of the physical connections. These connections are relatively stable at shorter time scales ranging from seconds to minutes. At longer time scales (hours to days), connectivity is subject to the significant morphological change and plasticity. On the other hand, functional connections [5] are measured using Resting-State fMRI method. Functional interactions are fundamentally statistical concept; statistical dependence might be estimated by measuring correlation or covariance. Functional connectivity is highly time-dependent according to structural connectivity. Statistical patterns between neuronal elements fluctuate on multiple time scales, some of them are as short as tens or hundreds of milliseconds. These functional and structural connections are also used to calculate the relations between the predetermined brain regions. Thus, structural and functional connectivity networks between the brain regions are established. Neural diseases cause changes in the structure and function of the brain. When these changes are examined, the related differences between the patient’s brain and a healthy individual’s brain can be analyzed using the above-mentioned brain connectivity networks. Graph theory based algorithms are used to extract features from large networks [6–9]. Implementation of this idea is quite proper, since samples are defined as matrices and can be transformed into graphs easily. A binary classification model can be applied to discriminate healthy and patients with ASD. The model trained with the data of diseased and healthy people might provide an estimate of the condition for a person whose brain connectivity network is previously identified. Therefore, such a model can help to diagnose ASD at the early stage and to improve the treatment of the disease. In this study, we consider each patient’s brain connectivity network as a graph and we extract features from these brain connectivity networks by using graph theoretical analysis. At this stage, we benefit from graph measures which characterize the examined graph. Then, our learning model is trained with the obtained feature vectors. Results of this model and comparisons show that our model is simple, effective and improvable. The rest of the paper is organized as follows. The literature of the related problem is described in the next section. Then in Sect. 3, base models and designed ensemble model are presented. In Sect. 4, data sets and preprocessing steps are explained. Later on, results of experiments to validate our model and results of designed model are reported. Finally Sect. 5 gives information on future work and concludes the study.

2

Related Work

There are several studies for different types of neurological diseases. Bassett and Sporns [10] sketch the outlines of a new interdisciplinary field, which called

522

E. Tolan and Z. Isik

network neuroscience. They indicate that neuroscience requires statistical inference and theoretical ideas to explore the brain structure and function at this big data age. Crossley et al. [11] shows that the anatomical locations of high-degree hubs can change dynamically in parallel to change type and levels of cognitive processing, whereas the global network topological graph measures of the brain functional network are relatively conserved under different cognitive conditions. Some studies using graph theory have found decreases in whole brain connectivity, efficiency, and a shift in the balance between integration and segregation. Study of Dennis and Thompson [12] aimed to show the earliest signs of decline and treatment response of Alzheimer disease and tried to determine how to slow or stop decline. Rubinov and Sporns [13] described a collection of measures that quantify local and global properties of complex brain networks. The accompanying brain connectivity toolbox allows researchers to start exploring network properties of complex structural and functional datasets. The approach of Guye et al. [14] has provided a formal description of the complex brain network topology in vivo as well as a quantification of its properties. Dodero et al. [15] applied the Riemannian manifold to the classification problem of brain functional networks. Woodward and Cascio [16] showed that rs-fcMRI can span human clinical populations and animal models to achieve a level of translational continuity which has eluded functional neuroimaging thus far. The integration of network-(graph) based analysis with the machine learning models might achieve better diagnosis for several neurological diseases. The computational gap in this field is mainly motivated this study.

3 3.1

Materials and Methods Brain Connectivity Network Analysis

We proposed a binary classification model that discriminates ASD patients and healthy people by using their brain connectivity network data. The use of the graph-based representation for this connectivity networks has been seen to be quite appropriate. Each node of the graph represents individual brain region, while each edge represents the relationship between two brain regions. Graph theory is utilized in the analysis of the structural and functional brain connectivity network. The global and nodal measurements are computed as the input feature vector of the machine learning model. Each brain connectivity network is represented as a matrix that has 264 brain regions. The corresponding graph has 264 nodes and fully connected edge lists for each of them. Brain regions are predefined and positions of regions are given as 3D coordinates. Right occipital pole, right precuneous cortex, left occipital pole are some of the examples for brain regions. We begin with determining graph measures to be used in our model. There are two types of measures: nodal and global. We utilize 5 global (radius, diameter, cluster, transitivity, modularity) and 8 nodal (eccentricity, path length, closeness, structure, global efficiency, betweenness, strength, triangles) measures in total.


523

We apply global measures warts and all. Since nodal measures refer to properties of individual nodes in the graph, we use average, standard deviation, minimum and maximum values of nodal measures to generalize these measures into a feature vector. N GMmin , N GMmax , N GMstd , N GMavg denotes derivations for a nodal measure. N GMmin = min(N GM1 , N GM2 , . . . , N GMn−1 , N GMn )

(1)

N GMmax = max(N GM1 , N GM2 , . . . , N GMn−1 , N GMn )

(2)

N GMstd

=

N

1 ¯ )2 (N GMi − N GM N − 1 i=1

N GMavg =

N 1 N GMi N i=1

(3)

(4)

where N is the number of nodes, N GMi is the defined measure of i-th node in the graph. As an example, the nodal measure of betweenness is derived as betweennessmin , betweennessmax , betweennessstd and betweennessavg . We obtain 37 features by using 4 kinds of derived nodal measures as given above. Gender information of each person is also attached, finally feature vectors are comprised of 38 features. To compute these measures we use the Brain Analysis using Graph Theory (BRAPH) tool [17]. BRAPH is a toolbox written in MatLab that uses s graph theory-based analysis to characterize brain connectivity. In addition, we substitute negative correlation coefficients with their absolute value to calculate some measures in which negative weights are not allowed. This transformation is also recommended by the BRAPH tool. After this feature extraction step, we deal with feature selection to eliminate improper features. The Relieff algorithm [18] computes the importance of attributes by selecting relevant features using a statistical method. We apply the Relieff feature selection method in the training phase to reduce feature size and to produce more significant attributes. We set different size of features for different machine learning models. These feature numbers are examined by using different sets of features (Table 1). Table 1. Total number of features used for each machine learning model. fMRI and DTI are different brain connectivity networks. Model

fMRI DTI

SVM

35

21

kNN

26

11

Decision Tree 24

7

524

3.2


Machine Learning Methods

We design a binary classifier to predict whether a given brain connectivity network belongs to ASD patient or typically development (TD) one. Our machine learning methods are set as Support Vector Machine (SVM), k-Nearest Neighbor (kNN), Decision Tree (DT). We chose these three distinct methods to obtain an ensemble model using simple voting method. Our ensemble learning model works based on a majority voting scheme. For example, a prediction for a sample is assumed to be true, if it is predicted as true at least by two machine learning methods, otherwise it is labeled as a false prediction. The Statistics and Machine Learning Toolbox in MATLAB is used to implement this study. In this toolbox SVM, kNN, and DT algorithms are provided. Parameters of each machine learning algorithm for functional connectivity and structural connectivity are used with default values in MATLAB. Since, the goal of this study is to explore the best performing features extracted from brain connectivity networks rather than optimizing parameters of machine learning algorithms. As an evaluation strategy, the leave one out cross validation (LOOCV) is used due to limited sample sizes. We train each machine learning model by using n − 1 samples, the remaining one is used for as a test sample. 3.3

Dataset

We obtain our data sets from the USC Multimodal Connectivity Database [19] which is openly available brain connectivity database and a part of “Human Connectome Project”. USC Multimodal Connectivity Database is a web-based repository and analysis site for connectivity matrices that have been derived from de-identified neuro-imaging data. In this repository, brain connectivity networks are presented in a pre-processed condition. Brain networks are provided for researchers freely for various of neurological diseases e.g., Autism Spectrum Disorder, Attention Deficit Hyperactivity Disorder, Obsessive-Compulsive Disorder, Alzheimer. Among these diseases, we chose Autism Spectrum Disorder (ASD) connectome database that comprise of 79 functional and 94 structural connectomes (Table 2). Table 2. The patient samples for ASD database with two different connectivity networks. TD refers to typical development sample (i.e., healthy), ASD refers to patients diagnosed with Autism Spectrum Disorder. Network type

TD ASD Total

Functional connectivity 37

42

79

Structural connectivity 43

51

94

As mentioned, we have two different data sets for ASD diagnosis which are structural and functional connectivity networks. We treated these functional connectivity networks individually and applied all training and evaluation pipelines separately. Each sample is provided as a 264 × 264 connectivity matrix that corresponds to the brain region positions in 3D coordinates. Additionally gender


525

Fig. 1. The sample ASD-100B-DTI is shown as a graph.

Fig. 2. The sample ASD-100B-DTI is shown as a connectivity matrix.

and age information of individuals are provided for each sample. We use all of them except age and specific 3D coordinate data. Figure 1 shows the network of ASD-100B-DTI sample. Figure 2 represents the brain connectivity matrix of the same as a 264 × 264 matrix.

526

4


Results

We developed three machine learning models and their decisions are aggregated by an ensemble model. We used the GK-LogE from the study of Dodero et al. [15] to measure the performance of the generated model for the functional and structural connectivity datasets. This model applies a kernel-based classification for brain networks on the Riemannian manifold of positive definite matrices. We compared our classification results with the Dodero et al. works due to having the same ASD connectome dataset from the USC Multimodal Connectivity Database. Table 3 shows average accuracy values for different machine learning models compared with the GK-LogE from the study of Dodero et al. [15]. For the functional connectivity (fMRI) dataset, SVM and ensemble model achieved 66% and 67% average accuracies, respectively. On the other hand, GK-LogE and our ensemble model had the same performance with the 67% of average accuracy for the structural connectivity (DTI) dataset. As a result of this evaluation, our graph-based features can manage to improve the classification performance in functional connectivity networks compared to the kernel-based approached proposed in the Dodero et al. study [15]. The performance of our model is on the same level for the structural connectivity networks. Table 3. Average accuracy of LOOCV scheme. GK-LogE shows the original results reported in the Dodero et al. [15] study. Model

fMRI DTI

GK-LogE [15]

0.61

0.68

SVM

0.66

0.60

kNN

0.57

0.60

DT

0.58

0.56

Ensemble model 0.67

0.68

Confusion matrices are used to compute sensibility and predictivity of the trained models. In Table 4, confusion matrices are shown for structural and functional connectivity, respectively. The calculated precision and recall values are shown in the Table 5. The precision and recall values are almost in the same levels with the computed accuracies. We analyzed the global and nodal measures which are selected and ranked in the top places by the Relieff algorithm on functional and structural connectivity networks. Additionally, the average of ranked measures in the ASD and the control sets are given in the Tables 6 and 7. As seen in these tables, some measures are more important than others. For example, the path.length measure is found with two kind of derivations (std, max) in the top-10 ranked features for both datasets. path.length is the average distance to all other nodes for a given node


527

Table 4. Confusion matrices for ensemble model on DTI and fMRI datasets. DTI dataset fMRI dataset Predicted: ASD Predicted: TD Predicted: ASD Predicted: TD Actual: ASD 37

14

28

14

Actual: TD

27

12

25

16

Table 5. Precision and recall values for ensemble model. Precision Recall fMRI 0.67

0.70

DTI

0.70

0.73

Table 6. Top 10 features ranked by the Relieff algorithm for DTI dataset and average of measures in the ASD and control (TD) sets. Rank Measure

ASD

TD

1

structuremax

9.3529

9.1860

2

closenessmin

6.0028

5.8447

3

global.ef f iciencymin 6.4962

6.2704

4

path.lengthmax

0.1887

0.1774

5

strengthstd

219.1198 214.9973

6

eccentricitymax

0.2725

0.2622

7

eccentricityavg

0.2087

0.1979

8

path.lengthstd

0.0170

0.0166

9

trianglesstd

1357.4

1293.7

10

structurestd

2.6011

2.5629

Table 7. Top 10 features ranked by the Relieff algorithm for fMRI dataset and average of measures in the ASD and control (TD) sets. Rank Measure

ASD

TD

1

structurestd

1.1934

1.1487

2

modularity

0.1251

0.1259

3

eccentricitystd

0.3074

0.3189

4

global.ef f iciencymax 0.4990

0.5242

5

strengthstd

14.9012

16.8826

6

trianglesstd

1167.2

1345.8

7

strengthmax

107.3624 115.2444

8

path.lengthstd

0.3010

0.3180

9

global.ef f iciencymin 0.2662

0.2691

path.lengthmax

3.8613

10

3.8987

528


in a weighted graph. In the fMRI dataset, the structure measure is the one of the top-ranked measure, which defines the modularity of a given node. The commonly top-ranked graph measures for both datasets are path.length, structure and eccentricity. These measures were chosen by the algorithm because they contain discriminative properties and our model is more successful by using these attributes. This situation can be explained by taking into consideration that nodes stand by brain regions and weighted edges denote correlations between brain regions in this graph representation. Higher weight values indicate significant (positive/negative) correlation between brain regions, so measures of path.length, structure, and eccentricity become more prominent. This observation makes sense, since Autism Spectrum Disorder patients have ill-communication between brain regions. Additionally, the standard deviation of nodal measures in functional data draw attention. When maximum and minimum correlation values of brain regions are close to each other, the standard deviation of nodal measures can separate samples more effectively.

5

Conclusion

Early diagnosis of neurological diseases is important research area in medicine. With the assistance of computational prediction models, early diagnosis of diseases becomes more possible. Machine learning models are developed to generate more reliable diagnostic tools especially for patients with neurological disorder. In this study, machine learning models use nodal and global graph measures computed over brain connectivity networks to obtain input feature vectors. Then, three different binary classifiers (SVM, kNN, DT) are integrated to construct an ensemble prediction model. When the proposed model is evaluated on ASD patient datasets, the graph-based features improved the classification performance in functional connectivity networks compared to the state-of-the-art method [15]. The proposed model provided the same performance with the stateof-the-art method for the structural connectivity networks. Considering feature weights calculated by the Relieff algorithm, derivations of nodal measures seem to contribute more effectively to the classification than global measures for both connectivity networks. Therefore, nodal measures could contain more valuable information for the classification of ASD patients and healthy individuals. As future works, parameters of classification algorithms can be tuned with optimization methods to obtain better performances. To achieve better results, more convenient ensemble method can be defined, e.g., the stacking model might be used to create a more robust ensemble model. Additionally, the proposed model can be validated with other neurological diseases found in USC Multimodal Connectivity Database. Validating with more ASD patient data can strengthen our model. The other significant point of this study is determining graph-based measures. Different graph measures can be used depending on the graph type. Directed or undirected graphs have different measures from each other, so to remove this deficiency, a directed graph can be converted into a undirected one to compute more graph measures. After some pre-processing on


529

graphs, the number of nodal and global graph measures can be increased. Local and nodal measures can be used as individual feature vectors of the learning model. As conclusion, the proposed study develops a classification model for diagnostic purposes. Its evaluation shows that generation of graph-based measures computed on brain connectivity networks might help to improve diagnostic capability of the state-of-the-art computational models. Acknowledgments. E. Tolan is supported by the 100/2000 CoHE Doctoral Scholarship Project.

References 1. Stam, C.J.: Modern network science of neurological disorders. Nat. Rev. Neurosci. 15(10), 683–695 (2014) 2. Lord, C., Cook, E.H., Leventhal, B.L., Amaral, D.G.: Autism spectrum disorders. Neuron 28(2), 355–363 (2000) 3. Rane, P., Cochran, D., Hodge, S.M., Haselgrove, C., Kennedy, D., Frazier, J.A.: Connectivity in autism: a review of MRI connectivity studies. Harv. Rev. Psychiatry 23(4), 223 (2015) 4. Posner, M.I., Raichle, M.E.: Images of Mind. Scientific American Library/Scientific American Books, New York (1994) 5. Bullmore, E., Sporns, O.: Complex brain networks: graph theoretical analysis of structural and functional systems. Nat. Rev. Neurosci. 10(3), 186–198 (2009) 6. Bassett, D.S., Bullmore, E.D.: Small-world brain networks. Neuroscientist 12(6), 512–523 (2006) 7. van den Heuvel, M.P., Sporns, O.: Network hubs in the human brain. Trends Cogn. Sci. 17(12), 683–696 (2013) 8. Reijneveld, J.C., Ponten, S.C., Berendse, H.W., Stam, C.J.: The application of graph theoretical analysis to complex networks in the brain. Clin. Neurophysiol. 118(11), 2317–2331 (2007) 9. Stam, C.J., De Haan, W., Daffertshofer, A.B.F.J., Jones, B.F., Manshanden, I., van Cappellen van Walsum, A.M., Berendse, H.W.: Graph theoretical analysis of magnetoencephalographic functional connectivity in Alzheimer’s disease. Brain 132(1), 213–224 (2008) 10. Bassett, D.S., Sporns, O.: Network neuroscience. Nat. Neurosci. 20(3), 353–364 (2017) 11. Crossley, N.A., Mechelli, A., Vrtes, P.E., Winton-Brown, T.T., Patel, A.X., Ginestet, C.E., Bullmore, E.T.: Cognitive relevance of the community structure of the human brain functional coactivation network. Proc. Nat. Acad. Sci. 110(28), 11583–11588 (2013) 12. Dennis, E.L., Thompson, P.M.: Functional brain connectivity using fMRI in aging and Alzheimer s disease. Neuropsychol. Rev. 24(1), 49–62 (2014) 13. Rubinov, M., Sporns, O.: Complex network measures of brain connectivity: uses and interpretations. Neuroimage 52(3), 1059–1069 (2010) 14. Guye, M., Bettus, G., Bartolomei, F., Cozzone, P.J.: Graph theoretical analysis of structural and functional connectivity MRI in normal and pathological brain networks. Magn. Reson. Mater. Phys. Biol. Med. 23(5–6), 409–421 (2010)

530


15. Dodero, L., Minh, H.Q., San Biagio, M., Murino, V., Sona, D.: Kernel-based classification for brain connectivity graphs on the riemannian manifold of positive definite matrices. In: 2015 IEEE 12th International Symposium on Biomedical Imaging (ISBI), pp. 42–45. IEEE (2015) 16. Woodward, N.D., Cascio, C.J.: Resting-state functional connectivity in psychiatric disorders. JAMA Psychiatry 72(8), 743–744 (2015) 17. Mijalkov, M., Kakaei, E., Pereira, J.B., Westman, E., Volpe, G., Alzheimer’s Disease Neuroimaging Initiative: BRAPH: a graph theory software for the analysis of brain connectivity. PloS ONE 12(8), e0178798 (2017) 18. Kira, K., Rendell, L.A.: The feature selection problem: traditional methods and a new algorithm. In: Aaai, vol. 2 (1992) 19. Brown, J.A., Rudie, J.D., Bandrowski, A., Van Horn, J.D., Bookheimer, S.Y.: The UCLA multimodal connectivity database: a web-based platform for brain connectivity matrix sharing and analysis. Front. Neuroinformatics 6, 28 (2012)

Detect and Predict Melanoma Utilizing TCBR and Classification of Skin Lesions in a Learning Assistant System Sara Nasiri(B) , Matthias Jung, Julien Helsper, and Madjid Fathi Department of Electrical Engineering and Computer Science, Institute of Knowledge Based Systems and Knowledge Management, University of Siegen, H¨ olderlinstr. 3, 57076 Siegen, Germany {sara.nasiri,madjid.fathi}@uni-siegen.de, {julien.helsper,matthias.jung}@student.uni-siegen.de

Abstract. In this paper, case-based reasoning is used as a problemsolving method in the development of DePicT Melanoma CLASS. It is a textual case-based system to detect and predict melanoma utilizing text information and image classification. Each case contains disease description and possible recommendation as references (images or texts). Case description has an image gallery and a word association profile which is the association strengths between the stages/types of melanoma and its symptoms and characteristics (keywords from text references). Therefore, in the retrieval and reuse process, first, requested problem which is as a new incoming case have to be retrieved from all collected cases, then, the solution of the most similar case is selected and recommended to users. In this paper support vector machine (SVM) and k-nearest neighbor (k-NN) classifiers are also used with the extracted features of skin lesions. A region growing method is applied by initialization of seed points for the segmentation. DePicT Melanoma CLASS is tested on sample texts and 400 images from ISIC archive dataset including two classes of Melanoma and it achieves 63% accuracy for the overall system. Keywords: Case-based reasoning · Word association strength DePicT CLASS · Segmentation · Region growing · Melanoma

1

Introduction

Melanoma results in the vast majority of skin cancer deaths; in 2016, there were an estimated 3,520 deaths from other types of skin cancer and approximately three times more around 10,130 deaths from melanoma, even though this disease accounts for only 1% of all instances of skin cancer [1]. The survival rates of melanoma from early to terminal stages vary between 15 and 65% [2]; therefore, having the right information at the right time via early detection is essential to surviving this type of cancer. Accordingly, developing decision support systems has become a major area of research in this field [3]. The best path to early c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 531–542, 2018. https://doi.org/10.1007/978-3-319-78723-7_46

532

S. Nasiri et al.

detection is recognizing new or changing skin growths, especially those that appear different from other moles [1]. Even after treatment, it is very important that patients keep up on their medical history and records. The national comprehensive cancer network (NCCN) creates helpful reports and resources to serve as guidelines for informing patients and other stakeholders about cancer [4]. The NCCN guideline for patients on melanoma, which is endorsed by the Melanoma Research Foundation [5] and Aim at Melanoma [6], explains which cancer tests and treatments are recommended by melanoma experts. Although CBR has been applied in a number of medical systems, only a few systems have been developed for melanoma. The CBR system of Nicolas et al. used rules to answer medical questions based on the knowledge extracted from image data [7]. Various skin lesions classification have been developed using SVMs and k-NNlike interactive object recognition methodologies to perform border segmentation [8], extract global and local features and apply Otsus adaptive thresholding method [9]. Sumithra et al. utilized SVM and k-NN for skin cancer classification based on region-growing segmentation with results of 46 and 34%, respectively [10]. Although convolutional networks outperform other methods in many recognition tasks and in the classification of particular melanomas [11,12], deep networks generally require thousands of training samples. In this study, DePicT CLASS [13] was used to classify melanoma images using region growing methods (utilizing SVM and k-NN) to support patients and health providers in managing the disease. Our paper is organized as follows: Sect. 2 first, briefly describes the textual CBR, preliminary concept and then, explains the word association strength in melanomic skin cancer references. This system is followed in Sect. 3 by a description of image processing and segmentation with a concrete example. Section 4 explains the tools and dataset which we have used for the implementation of our system. Section 5 discussed the results and evaluation. Finally, Sect. 6 concludes the paper.

2

Textual CBR and Word Association Strength

This section describes the development of a textual-conversational case-based system for melanoma classification and treatment based on the DePicT CLASS [13]. In the proposed system, text and image references and learning materials related to the disease and its treatment are attached to individual cases as case descriptions and a case recommendations, respectively. Problem requests submitted as free text, images from the affected areas, and filled-out questionnaires are compared with the existing cases from the case base, with the solution of the most similar case considered as the recommendation. In the following, it is explained how DePicT CLASS can be applied to the detection of melanoma through enrichment by an image processing algorithm. DePicT Melanoma CLASS was developed for use in the MedAusbild1 project, an initiative research project group of 1

MedAusbild, https://www.eti.uni-sie-gen.de/ws/projekte/medausbild/index.html. en?lang=en.

Detect and Predict Melanoma Utilizing TCBR and Classification

533

the Institute of Knowledge Based Systems & Knowledge Management (KBS & KM) at the University of Siegen. The case base of melanoma is built based on the AJCC2 staging and melanoma skin cancer information data base3 . Each case has a word association profile for main keywords extracted from melanoma textbooks and reports (fifteen melanoma-related papers and books) from which case descriptions and references are built. The case structure (Fig. 1) comprises a case description and recommendation including image features, segmentation processes, identified keywords, and a word association profile.

Fig. 1. Case structure of DePicT Melanoma CLASS.

2.1

DePicT CLASS’s Preliminary Concept

Each case has a word association profile based on its main keywords and defined by the domain experts as well as other identified keywords, which are extracted from the case description and case references. The word association strength (WAS) between the case title and case features (identified keywords) are combined within the DePicT Profile Matrix [13]. ⎡ ⎤ W AS(1, 1) . . . W AS(1, n) ⎢ ⎥ .. .. .. DePicT Profile MatrixW AS = ⎣ (1) ⎦ . . . W AS(j(t, i), 1) . . . W AS(j(t, i), n) W AS(j(t, i), 1) is the numeric value of CIMAWA4 [14] between the title phrase of the case i and the j th identified keyword of the tth reference extracted from the references and learning materials associated with case i. The case title phrase is a combination of the keywords into text string. To find the similar keywords and extract commonalities from the text, the system needs a similarity measure SIM . The local similarity measure of word association is calculated based on the vectors of each case (Ci ) and incoming case IC: Ci Ci IC simwas (wasIC j , wasj ) = (wasj . wasj ) 2 3 4

(2)

American Joint Committee on Cancer: Melanoma of the Skin Staging. Melanoma Skin Cancer, https://www.cancer.org/cancer/melanoma-skin-cancer. html. Concept of imitation of the human ability of word association.

534

S. Nasiri et al.

where Ci is the word association profile vector of ith case that is as follows: Ci = (W AS1,i , ..., W ASm,i )

(3)

where W AS1,i is the feature value of the first word association strength of the ith case and m is the total number of identified keywords in case i. Assume that the problem description, IC, is expressed as follows: IC = (W1 , ..., Wk , ..., Wm )

(4)

where W1 is the feature value of the first word association strength of the input case, which takes a value 1 for the request keyword appearance and 0 for its absence, and k is the total number of input keywords or common keywords between IC and Ci . The matching process can skip either the graphical or the textual similarity measure if relevant data are not present in the request. Each reference has a word association vector for each relevant keywords in the reference. DePicT CLASS checks the similarity of these vectors to the new vector created with the selected keywords input via the user request. The similarity measurement for comparing references in DePicT CLASS is given as: SIM (IC, Rt,i ) = Σni=1 Σkj=1 Σqt=0

wtj × wij × sim(Rt,i , IC) q

(5)

where q is the total number of references in case i and Rt,i is the word association profile vector of the tth reference from case i, which given by Rt,i = (W AS1,(t,i) , ..., W ASj,(t,i) , ..., W ASr,(t,i) )

(6)

where W ASj,(t,i) is the feature value of the word association strength of the j th identified keyword of the tth reference of the ith case, r is the total number of words in the tth reference. The weights of identified keywords are determined based on the within-case counting frequencies in cases which are as follows: wij =

fij N

(7)

where fij is the frequency of word j in the case i and N is the total number of identified keywords including their frequencies, in case i., and wtj is the weight of identified keyword j in the reference t, which is expressed as follows: wtj =

ftj Q

(8)

(i) Here, ftj is the frequency of word j in reference t and Q is the total number of common keywords between reference t and IC. (ii) For the reference image t, ftj is the impact factor of word j in the reference t and Q is the sum of impact factors of all common keywords between the reference image and incoming images.


2.2

535

DePicT Profile Matrix of Melanoma

The DePicT Profile Matrix of melanoma has 260(5 × 52) elements for five cases as a melanoma stages and 52 identified keywords as features. To describe the procedure of DePicT CLASS, a problem request is needed. An example of such a request is given as follows: [A few months ago, I discovered a strange spot on my arm. At first I did not mind, but now it is itching. Strangely I was burned by the sun in the same region last summer. I was exposed to massive UV radiation there. The spot seems to have changed in the last weeks. Its shape seems awkward now. Now some lymph nodes are swollen.] From this example, the system recognized the following keywords: UV, lymph, and arm. Instead of comprising a problem, the user can also answer to the questions, with the answers to the questions used to generate input features for the incoming case. Table 1 shows our DePicT Profile Matrix based on the common keywords for five cases. Thus, based on the identified keywords, Table 1. DePicT profile matrix of melanoma for the requested problem. Keyword ID

C1

UV

15

5.6

Lymph

22

0

Arm

34

4.5

C2

C3

C4

C5

6.4

7.7

10.4

4.4

25.2

26.2

29.2

13.1

6.9

13.7

17.5

8.7

common keywords from the requested problem are recognized, and an IC vector created based on three (out of 52) common keywords: IC ∈ [...; UV; ...; lymph; ...; arm; ...]

(9)

IC = [0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 1; 0; 0; 0; 0; 0; 0; 1; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 1; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0]

(10)

Each reference has a word association vector for each relevant keywords in the reference. DePicT CLASS checks the similarity of this vector with the new (incoming) vector created by input keywords selected from a user request. In addition, the DePicT Profile Matrices (wi ) and (wt ) which are filled based on Eqs. (7) and (8), respectively to define the weights in each case and reference, respectively. ⎡ ⎤ w11 ...w1j... w1k ⎢ .. .. ⎥ .. (11) ⎣ . . . ⎦ wn1 ...wnj... wnk ⎡ ⎤ w11 ...w1j... w1k ⎢ .. .. ⎥ .. ⎣ . . . ⎦ wq1 ...wqj... wqk

(12)

536

S. Nasiri et al.

After defining the IC, the similarity measurement is calculated using Eq. (5) to represent the similarities between IC and each case with their respective references for the three common keywords. The similarity degrees of all cases are sorted and the most similar cases obtained. To demonstrate the performance of the case retrieval process, the similarity degree of the first case from this example is calculated below: SIM (IC, C(R3,3 )) = (5.6 × 0.6 × 0.8) + (5.6 × 0.4 × 0.8) + (5.6 × 0.3 × 0.8) + (0 × 0.8 × 0.4) + (0 × 0.6 × 0.4) + (0 × 0.5 × 0.4) + (4.5 × 0.2 × 0.2) + (4.5 × 0.2 × 0.2) + (4.5 × 0.3 × 0.2)/3 = 3.2

3

(13)

Image Processing and Classification

The first clinical signs of melanoma are called the lesion area, which denotes the affected area and corresponding spots on the skin. There are two classes of melanoma-malignant and benign-and the recognition process comprises five main steps (which are illustrated as a pipeline in Fig. 2):

Fig. 2. Image processing pipeline.

– Sensor : The input data used for image processing comprises images and respective points of interest (POI). The image data should be in RGB format with a minimum resolution of 200 pixels, and images should include the skin spot to be examined. If there is more than one spot, the relevant spots must be marked as POIs. – Feature generation: Features are developed to distinguish the spots based on their respective characteristics. The spots can then be divided into the categories malignant or benign. – Feature Selection: After feature creation, the features are tested, with those that significantly improve detection selected and used while the rest are removed. – Classification: Data input data using the selected features are categorized into the above two classes. – System evaluation: After the system has been implemented, evaluation metrics are used to assess the performance of the system. Image pre-processing is the crucial step in the use of image data. Scaling is performed to bring images from datasets to a correct resolution. Spots images


537

often contain noises such as hairs or dermatological markers. In this study, a DullRazor5 algorithm was used to perform a three-steps: (i) first, the position was determined using a morphological closing operation; (ii) then, neighboring hairs were replaced using bilinear interpolation pixels; and finally (iii) the remaining pixels were smoothed using an adaptive median filter [15]. 3.1

Feature Selection and Region Growing Segmentation

Region growing is a method for identifying similar areas on an image and then selecting them as a whole. As an initial step of this procedure, one or more seed points are selected. The color values of the neighboring pixels are then compared with those of the seed points; if they are sufficiently similar, the compared pixels are also selected and similarity comparisons are then performed with their neighbors. The algorithm terminates when there are no more sufficiently similar pixels. The image is then converted into a gray-value image as the first step. A 3 × 3 median filter (a modified decision-based unsymmetrical trimmed median filter) is used for noise reduction based on a received set of valuesin this case, the gray values of the nine pixelswhich are sorted, with the middle value selected as the new value for the current pixel. In the case of a border pixels, missing values are filled with zeros. The brightness is adjusted to increase the contrast. The histogram of the gray values is rearranged to take advantage of the complete color space. To segment the lesion region of interest (ROI), POIs are indicated within the relevant skin regions; these are used as seeds for a region-growing algorithm that compares each pixel of the ROI to its neighbors for similarity. If the similarity is higher than a threshold, the pixel is added to the region. The method terminates if no further pixels can be recursively added. To remove remaining smaller elements, a complementary image is formed and an opening process is used to close existing holes. Based on the characteristics of the skin spots occurring as a results of melanoma, Table 2 presents twelve features considered in terms of the categories of color and shape. Because skin spots can be sharply differentiated or categorized by the color composition, the color values of the segmented images are examined in both the RGB and HSV color spaces. The average values of all three channels of the two color spaces (for features 1–6) are formed as follows: 1 p Σ CVb , l ∈ {1, 2, 3, 4, 5, 6} (14) P b=1 where p is the number of pixels in the segmented region and CVb is the pixel color value. The distance between the maximum and minimum values in HSV color space is calculated and used as feature 7 (F V7 = F Vmax − F Vmin ). For the classification of skin spots, both the color value distribution around a spot and the different color values within the spot are generally relevant. For this reason, three regions are defined for the next two features: Rs is defined as the region F Vl =

5

http://www.dermweb.com/dull razor/.

538

S. Nasiri et al. Table 2. Image processing features. Category Name

Inputs Feature numbers

Color

Average RGB channel Average HSV channel Color structure descriptor Color layout descriptor

3 3 1 2

Shape

Principal component ratio 1 Filled fitted ellipse 1 Unfilled fitted ellipse 1

1–3 4–6 7 8–9 10 11 12

surrounding the ROI, Ro represents the outside of the region, and Ri represents the core of ROI. Accordingly, features 8 and 9 are calculated as follows: ACVRi ACVRo (15) F V8 = − PRi PRo ACVRo ACVRs (16) F V9 = − PRo PRs where ACVRi is the average color value in the inner region of the ROI, ACVRo is the average color value outside the ROI, and ACVRs is the average color value in the ROI neighborhood (See Fig. 3). PRo , PRi , and PRs represent the total number of pixels in each region. The tenth feature is the principal component ratio. Benign melanomas tend to have much more circular or elliptical shape than malignant melanomas; accordingly, an ellipse is fitted around the ROI and the percentages of ROI pixels on the ellipse and outside of the ROI are used as features 11 and 12, respectively. To compare all twelve features, they are normalized (0, 1) and stored in a feature vector for use in classification.

Fig. 3. Color layout descriptor (a) Input field (ISIC Archive) (b) Color layout descriptor regions, Ro: red, Rs: yellow, Ri: white. (Color figure online)

3.2

DePicT Melanoma CLASS-Classification

As it explained in the previous section, users can fill their request based on their situation and our system can recommend the solution related to their problem (See Fig. 4 [16]). Below, the performance of the proposed classification process


539

Fig. 4. DePicT Melanoma CLASS-result view.

in DePicT Dementia CLASS is demonstrated through an example using two images, which are benign and malignant, respectively. For the benign image, the POI dimensions were set to x = 828 and y = 420, while the malignant image dimensions were set to x = 943 and y = 506. The pre-processing and segmentation processes described above was performed, with the results shown in Fig. 5. Expert segmentation for these examples is illustrated in Fig. 6. The values of all features from these test data are specified in Table 3. This methodology can be demonstrated by visualizing its steps as a process of selecting two 1600 pixel images from a dataset representing a benign (See Fig. 5-a1) and a malignant (see Fig. 5-a2) melanoma, respectively. The result of the current example is illustrated in Table 4 and accuracy obtained using k-NN 100% and using SVM 50%.

4

Tools and Dataset

Matlab (2017a) was utilized to develop DePicT Melanoma CLASS, in particular with the use of Image Processing Toolbox, Parallel Computing Toolbox, Matlab Compiler and Coder, and App Designer. The ISIC Archive dataset6 containing 6

https://www.isic.org, https://isic-archive.com/.

540

S. Nasiri et al.

Fig. 5. Left: segmentation of malignant image. Right: segmentation of benign image.

Fig. 6. Expert segmentation- (a) benign melanoma and (b) malignant melanoma. Table 3. Feature values of experimented images. Nr. Benign Malignant Min

Max

1

0.2730 0.3054

0.1720 0.9127

2

0.2327 0.2491

0.0408 0.7341

3

0.2472 0.2219

0.0242 0.6632

4

0.7317 0.2044

0.0265 0.9436

5

0.1614 0.2741

0.0900 0.9579

6

0.2758 0.3158

0.1853 0.9134

7

0.9948 0.9977

0.0366 0.9989

8

0.0985 0.0335

0.0018 0.2101

9

0.3473 0.6485

0.0060 0.6358

10

0.4941 0.7431

0.3101 0.9933

11

0.9577 0.9787

0.3412 1

12

0.4491 0.4650

0.1600 0.9918

Table 4. Result of described example. Input (sample image) Classification method k-NN SVM Benign

Correct Correct

Malignant

Correct Incorrect


541

images of benign and malignant melanoma was used for image-processing and classification (300 images for training and 100 images for testing).

5

Experimental Results and Validation

DePicT Melanoma CLASS achieved appropriate results. Its performance in terms of comparison of its evaluation scores (precision, recall, f-score and accuracy) for k-NN, SVM, and WAS is shown in Table 5. In training 300 images and testing 100 images for k = 1, 64% of the inputs were classified correctly (Euclidean), while the accuracy obtained using SVM was 62%. In evaluating the textual components of requested problems in the form of nineteen samples extracted from melanoma forums7 , it outperformed with an accuracy of 63%. Table 5. The comparison of the evaluation scores (precision, recall, f-score, and accuracy) of DePicT Melanoma CLASS.

6

Classification and local sim

Evaluation score TP TN FP FN Pre. Rec. F-m. Acc.

k-NN

30

34

16

20

0.65 0.6

0.62

0.64

SVM

25

37

13

25

0.66 0.5

0.57

0.62

WAS

8

4

7

0

0.42 0.53 0.47

0.63

Conclusion

In this paper, melanoma skin cancer is discussed as domain of current research. The main parameters and datasets for case creation were also discussed. In addition, the process of case representation and retrieval by our application (DePicT Melanoma CLASS) were described using an example. Early melanoma detection is one of the key objectives in skin cancer treatment. We proposed a case-based system for utilizing collected cases to support patients and healthcare providers through the early detection of melanoma. We used both k-NN and SVMs to classify incoming images and word association profiles obtained from requests in the form of text queries or filled-in questionnaires. Analysis of the results obtained by testing a melanoma dataset suggests that our case-based system for detecting malignant melanoma is fit for the purpose of supporting users by providing relevant information. Further would involve extending the image processing phase by selecting more relevant features and using more testing images. The text-mining phase could also be further developed by enriching the case base with packages of synonym words, more descriptions and references. 7

MRF: https://www.melanoma.org and MIF: http://melanomainternational.org/.

542

S. Nasiri et al.

References 1. American Cancer Society: Cancer facts and figures 2017. Genes Develop. 21(20), 2525–2538 (2017) 2. Ali, A.R., Deserno, T.: A systematic review of automated melanoma detection in dermatoscopic images and its ground truth data. In: Proceedings of SPIE, 8318I (2012). https://doi.org/10.1117/12.912389 3. Masood, A., Al-Jumaily, A.A.: Computer aided diagnostic support system for skin cancer: a review of techniques and algorithms. Int. J. Biomed. Imaging, vol. 2013, article ID 323268 (2013). https://doi.org/10.1155/2013/323268 4. Coit, D.G., et al.: NCCN guidelines insights: melanoma, version 3.2016. J. Nat. Compr. Cancer Netw. JNCCN 14(8), 945–958 (2016) 5. MRF: Melanoma Research Foundation (2017). https://www.melanoma.org/ 6. AIM: AIM at Melanoma (2017). https://www.aimatmelanoma.org/ 7. Nicolas, R., Vernet, D., Golobardes, E., Fornells, A., Puig, S., Malvehy, J.: Improving the combination of CBR systems with preprocessing rules in melanoma domain. In: Workshop Proceedings of the 8th International Conference on Case-Based Reasoning, pp. 225–234 (2009) 8. Sabouri, P., GholamHosseini, H., Larsson, T., Collins, J.: A cascade classifier for diagnosis of melanoma in clinical images. In: Conference Proceedings : Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 6748–6751. IEEE Engineering in Medicine and Biology Society (2014) 9. Kavitha, J.C., Suruliandi, A., Nagarajan, D., Nadu, T.: Melanoma detection in dermoscopic images using global and local feature extraction. Int. J. Multimedia Ubiquit. Eng. 12(5), 19–28 (2017) 10. Sumithra, R., Suhil, M., Guru, D.S.: Segmentation and classification of skin lesions for disease diagnosis. Procedia Comput. Sci. 45, 76–85 (2015) 11. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4 28. CoRR, abs/1505.0 12. Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., Thrun, S.: Dermatologist-level classification of skin cancer with deep neural networks. Nature 542(7639), 115–118 (2017) 13. Nasiri, S., Zenkert, J., Fathi, M.: Improving CBR adaptation for recommendation of associated references in a knowledge-based learning assistant system. Neurocomputing 250, 5–17 (2017) 14. Uhr, P., Klahold, A., Fathi, M.: Imitation of the human ability of word association. Int. J. Soft Comput. Softw. Eng. [JSCSE] 3(3), 248–254 (2013) 15. Lee, T., Ng, V., Gallagher, R., Coldman, A., McLean, D.: Dullrazor: a software approach to hair removal from images. Comput. Biol. Med. 27(6), 533–543 (1997) 16. Helsper, J., Jung, M.: Projektgruppe Wissensbasiertes System zur Untersttzung der medizinischen Ausbildung (MedAusbild) Sommersemester2017. Technical report, Institute of KBS and KM, University of Siegen (2017)

On the Use of Betweenness Centrality for Selection of Plausible Trajectories in Qualitative Biological Regulatory Networks Muhammad Tariq Saeed1 , Jamil Ahmad1(B) , and Amjad Ali2 1

Research Centre for Modeling and Simulation (RCMS), National University of Sciences and Technology (NUST), Islamabad 44000, Pakistan [email protected] 2 Atta-ur-Rehman School of Applied Biosciences (ASAB), National University of Sciences and Technology (NUST), Islamabad 44000, Pakistan

Abstract. Qualitative modeling approach is widely used to study the behavior of Biological Regulatory Networks. The approach uses directed graphs also called as state graphs, to represent system dynamics. As the number of genes increase, the complexity of stategraph increases exponentially. The identification of important trajectories and isolation of more probable dynamics from less significant ones constitutes an important problem in qualitative modeling of biological networks. In this work, we implement a parallel approach for identification of important dynamics in qualitative models. Our implementation uses the concept of Betweenness centrality. For parallelization, we used a Java based library MPJ Express to implement our approach. We evaluate the performance of our implementation on well known case study of bacteriophage lambda. We demonstrate the effectiveness of our implementation by selecting important trajectories and correlating with experimental data. Keywords: Biological Regulatory Networks (BRNs) Qualitative modeling · René Thomas framework · Network analysis Betweenness centrality

1

Introduction

The use of computational methods and tools for modeling and analysis of biological regulatory networks lies at the core of systems biology. A number of approaches have been developed to understand the working of biological systems. Graph based approaches are commonly used for modeling and analysis of various types of biological networks [6] such as protein-protein interaction (PPI) networks, signal transduction, metabolic and chemical networks. A number of different data formats such as SBML (Systems Biology Markup language), c Springer International Publishing AG, part of Springer Nature 2018 I. Rojas and F. Ortu˜ no (Eds.): IWBBIO 2018, LNBI 10813, pp. 543–552, 2018. https://doi.org/10.1007/978-3-319-78723-7_47

544

M. T. Saeed et al.

BioPAX, CML (Chemical Markup Language) have been made available for storage and retrieval of biological data for further analysis. There are several studies which show that networks from different domains share common structural characteristics and most functional entities in a network can be found using graph based approaches. Qualitative modeling approach, developed by René Thomas, is widely used for analysis of Biological Regulatory Networks (BRNs) [4,20–22]. The qualitative framework also employs a graph-theoretic approach for representation of regulatory systems and generates system dynamics by using a parameter driven procedure to covert an interaction graph to state graph (Sect. 2) [7]. In many cases, when precise experimental data (concentration levels, reaction rates etc.) is not available, it is possible to use qualitative modeling frameworks to model biological interactions and draw meaningful inferences (stable states, homeostasis etc.) [9,11]. The complexity of Thomas’s models is exponential in terms of number of genes and their qualitative expression levels and even small networks lead to large state graphs comprising of several thousands trajectories between two nodes [8,12]. Each trajectory represents a unique sequence of step-by-step alteration in expression levels of genes/proteins. The qualitative trajectories are further analyzed by using hybrid modeling to compute logical constraints for isolating target genes [1–3]. The minimization of trajectories to select only biologically significant paths is therefore an important problem and precursor to identification of potential drug targets through hybrid modeling. 1.1

Related Work

The use of graph based modeling is one of the most suitable methods to model behavior of a biological regulatory system. In particular, centrality measures have been employed in literature to identify important entities involved in regulation. Centrality measures are useful in analysis of graphs. In state graphs of Thomas’s qualitative models, degree of a specific state does not provide any useful insight into the functionality of biological system under investigation. Tareen et al. [18] used betweenness centrality to isolate trajectories in qualitative model of an environmental system. The authors apply qualitative modeling framework on case studies of microbial population and an atmospheric system. The trajectories are further isolated on the basis of network analysis using shortest path betweenness centrality calculation [10]. A similar approach was used in [16] to investigate the role of Hexosamine Biosynthetic Pathway (HBP) in cancer progression. The trajectories were sorted on the basis of betweenness centrality. 1.2

Our Contribution

Our work is based on existing work in the area of centrality measures presented in [10,18]. In this study, we discuss generalized framework that can be adopted for selection of important trajectories in Thomas’s qualitative models. We demonstrate the effectiveness of this approach on a well studied biological network that controls immunity control mechanism in bacteriophage

Important Trajectories in Qualitative Biological Regulatory Networks

545

lambda [15,19]. Moreover, for reduction in processing time, we employed data decomposition approach. We use a Java based parallelization software MPJ Express [5] to implement high level data decomposition on the set of all paths among available processes for computation of average betweenness centrality. 1.3

Paper Organization

The rest of the paper is organized as follows. Section 2 describes the methodology comprising of qualitative modeling framework of René Thomas with a focus on how to select important trajectories in qualitative models. Section 3 presents results of experiments on qualitative model of immunity control mechanism in bacteriophage lambda. Finally, conclusion and future work are presented in Sect. 4.

2 2.1

Methods Qualitative Modeling Framework

Here, we briefly revisit formal definitions of qualitative modeling framework, mainly adopted from [7,16,17]. Definition 1 (Directed Graph). A directed graph G is an ordered pair G = (V, E), where V is a finite set of all vertices and E ⊆ V × V (arcs) is a set of ordered pair of vertices. The arc e = (u, v) connects vertex u to v, where u is called the head and v is called the tail. Definition 2 (Biological Regulatory Network). A Biological Regulatory Network (BRN) is a weighted directed graph G = (V, E). The biological entities such as genes, proteins etc. are modelled by set of nodes V and regulations are modelled by set of edges E ⊆ V × V . Each regulation (vi , vj ) is labeled by a pair (τ, σ), where τ defines threshold at which gene u starts regulating gene v. σ = {+, −} is called sign of interaction. The “+” sign depicts activation, whereas “−” represents inhibition. Definition 3 (State). A qualitative state of BRN is n-tuple S = {sv1 , . . . , svn }, ∀svi ∈ Evi , where svi is the abstract expression level of vi . The dynamic behavior of the BRN is depended on resources and a set of positive integers called model parameters. The same BRN with different model parameters can lead to different dynamics. Definition 4 (Resources). In G, for each biological entity, the availability of an activator, or the absence of an inhibitor is considered as a resource. The cartesian product of activators and inhibitors generates different resource sets. Definition 5 (Logical Parameters). These are the K parameters that control the discrete evolution of entities in a BRN.

546

M. T. Saeed et al.

Definition 6 (State Graph). Let G be a GRN and sνa denotes the expression level of biological entity a in a state s ∈ S. Then the state graph R = (S, T ) of G = (V, E) is a directed graph, where S represents set of states, and T ⊆ S ×S is a relation between states, also called the transition relation, such that s → s ∈ T iff: – ∃ a unique x V such that sνx = sνx and sνx = sνx Kx (Wνx ), and – ∀ y V \ {x} sνy = sνy . Definition 7 (Path). The path in a qualitative state graph (R = (S, T )) is a finite or infinite sequence of arcs between two given vertices x and y. The interaction graph only depicts regulations of a biological network. In order to view the dynamic behavior, the interaction graph is converted to state transition graph or state graph with the help of logical parameters. These model parameters are not known in advance and therefore, formal verification techniques such as model checking are applied for computation. In [8], Bernot et al. introduced a method to determine logical parameters from experimental observations encoded in a temporal logic framework i.e. Computation Tree Logic (CTL). A set of quantifiers is used to construct CTL formulas. 2.2

CTL Quantifiers

– A (for All paths): The path quantifier “A” shows that a given property should hold in all paths originating from the given state. – E (there Exists a path): The path quantifier “E” shows that a given property must hold in at least one path originating from the given state. – G (Globally): The state quantifier “G” shows that a property holds in all states of a path originating from the given state, inclusive of the given state as well. – F (Future state): The state quantifier “F” shows that a given property must hold in one of the future states in the path originating from the given state. – X (neXt state): The state quantifier “X” that a given property must hold in the immediate successor state. Figure 1 elaborates different steps involved in qualitative modeling on a well known BRN of pseudomonas aeruginosa [15]. 2.3

Selection of Important Trajectories

Figure 2 shows procedure for selection of important trajectories. Given a dynamic qualitative model as a state graph, source and target vertices and number of processes, the procedure SELECTPATH determines the most probable trajectory arising from source vertex and terminating at target vertex. In order to reduce computational time, it performs a data decomposition by partitioning the number of total paths among available processes.


A.

547

+2 +1

u

v

-1

B.

Starting from a qualitative states where u=0 (not expressed), the system will not move to state where the expression of u is very high (i.e. u = 2) (u=0)

EG! ( u = 2 )

When expression of u is very high, it will continue to maintain its state (u=2)

AX ( EF ( u = 2 ) )

01

C.

D. Ku{ } Ku{ u } Ku{ v } Ku{ u+v } Kv{ } kv{ u }

= = = = = =

00

11

0 2 2 2 1 1

10

20 21

Fig. 1. Qualitative modeling applied on BRN that controls mucus production in pseudomonas aeruginosa (A–D). By using Definition 2, the interaction graph shown in (A) abstracts two entities of BRN responsible for mucus production in pseudomonas aeruginosa. Activations are shown with pointed arrow labeled with + sign. The inhibitions are shown with a blunt arrow labeled with − sign. The weight on the arrow shows threshold. The known experimental observations are encoded into Computation Tree Logic (CTL) by using temporal logic quantifiers introduced in Sect. 2.2. SMBioNet [12] is used to compute model parameters shown in (C). Finally, GINsim [13] is used to generate a state graph shown in (D). The state graph shows two important dynamics. The qualitative cycle (0, 0), (1, 0), (1,1), (0, 1), (0, 0) shows normal response of the biological system maintaining homeostasis. The divergent behavior that leads to deadlock is represented by qualitative state (2, 1).

Definition 8 (Betweenness Centrality). Let R = (S, T ) be a state graph and x, y and z are three different states in R. Let σx,y represents total number of paths from state x to state y, and let σy,x represents total number of paths from state y to x, that pass through state z. Let Ox represents the set of all ordered pairs, (y, x) such that x, y and z are all distinct. Then, the betweenness Centrality of state z is computed from Eq. 1: Cb (z) =

(x,y)∈O

σx,y (z) σx,y

(1)

548

M. T. Saeed et al.

PSEUDOCODE : Select path from source to target vertex based on avg. betweennes centrality 1: proceedure SELECTPATH (Graph g, source s, target t, number_of_procs) 2: paths[ ] = sort ( g.getAllPaths( g, s, t ) ); 3: btw[ ] = g.getbtw(file btw_listing ); 4: Initialize MPJ Processes; 5: for ( i=rank ; i < paths.length; i+=number_of_procs ) 6: 7:

if ( paths [i].getbtw / path.length > max ) selected = path [i];

8:

return selcted;

Fig. 2. Psudo-code to select path from source to target vertex based on avg. betweenness centrality

3


We present experimental results of our implementation with reference to well studied network of immunity control mechanism in bacteriophage λ. 3.1

Immunity Control in Bacteriophage λ

Bacteriophage lambda is a temperate virus which is commonly found in contaminated environments and infects bacterial population. The bacteria infected with phage λ infects bacterial population. The infection may lead to two different responses; 1. Lytic: In this case, lambda virus starts synthesizing lamaxia proteins, resulting in fast replication of lambda DNA. Consequently, cell lyses and large number of new viruses are produced. 2. Lysogenic: Phage λ can also swith to a passive state where it resides in the bacteria in latent or passive form. In this case, lambda virus attaches itself to the host and integrates its DNA into host chromosome. It is then safely transmitted to bacterial progeny. The mechanism underlying the switch between lytic and lysogenic states in phage λ has been a subject of much interest in last three decades. The qualitative model shown in Fig. 3 has several trajectories. The selection of important paths that originate from initial state (0, 0, 0, 0) to lytic state (2, 0, 0, 0) and lysogenic states [(0, 2, 0, 0), (0, 3, 0, 0)] constitutes an important problem towards selection of most relevant set of dynamics underlying immunity control in bacteriophage lambda. 3.2

Network Analysis: Selection of Trajectories

In [14], Richard et al. performed analysis of immunity control mechanism in phage λ using model checking approach. We use the temporal logic properties in [14] to generate qualitative state graph shown in Fig. 3. The model parameters

Important Trajectories in Qualitative Biological Regulatory Networks (A) Qualitative BRN of Immunity Control Mechanism in Bacteriophage Lambda

(B)

549

Qualitative State Graph generated from Logical Parameters Order of Genes in Qualitative States: (CI,Cro,CII,N) (0,0,0,0) represents initial state of the system (2,0,0,0) represents lytic state (0,2,0,0) and (0,3,0,0) represent lysogenic states

0010

0011

2010 2011 1101 1100

1200 1201

1011

-2

+2

2101

1010

-3

2201

2311

Cro

CI

2100

2310

2200

0310 0101

-1

0300

0311 0100

2000

1300

-2

2001

-2

+1 -3

1301

-1

1111 0000

0001

0200

1110 1311

CII

+1

N

0201 1310

1210 0301

1211

2211

0111

2210

0110

1000 1001

0211 2110 2111 2300 2301

Min

0210

Max Betweenness Centrality

Fig. 3. (A) Biological Regulatory Network (BRN) shows main regulators involved in immunity control mechanism in bacteriophage lambda. The genes are shown as vertices and interactions are represented as edges. The edges labeled with “+” sign show activation and with “−” sign shows inhibition. The threshold upon which the target gene is regulated is also labeled. Qualitative modeling of the BRN, using René Thomas framework allows to generate system dynamics in the form of State Graph, shown in (B). Each state in State Graph is boolean vector [CI, Cro, CII, N] that describes expression level of genes. The State Graph is sorted according to betweenness centrality and rendered using Cytoscape Software.

were provided in the GINsim tool [13] for qualitative modeling. The qualitative model was further analyzed using algorithm shown in Fig. 2 for selection of important trajectories. In the qualitative model, the state (0, 0, 0, 0) represents a typical start or reset state. We search all paths from state (0, 0, 0, 0) to state (0, 2, 0, 0) and for each path length, we select a trajectory based on maximum average betweenness centrality. Figure 4(A) shows a list of selected trajectories for each path length sorted with respect to aforementioned criteria. Finally, the trajectory with highest centrality is highlighted in the qualitative model (Fig. 4B). The algorithm presented in Fig. 2 has been implemented using MPJ Express [5], which is a thread-safe, Java messaging library for high performance computing [5]. The speedup graph in Fig. 5 shows that our implementation in MPJExpress achieves almost linear speedup.

550

M. T. Saeed et al. (B)

Qualitative State Graph with Highlighted Trajectory (on the basis of Betweenness Centrality) Order of Genes in Qualitative States: (CI,Cro,CII,N) (0,0,0,0) represents initial state of the system (2,0,0,0) represents lytic state (0,2,0,0) and (0,3,0,0) represent lysogenic states

2001 2101 2 1 2201

2010 0 2000 0 2100 2011 2 2

1201 1 1301

2210

0000

1000

1300

2110 2 2111 2 2211 2 2200 2300

1310

2301

0210 2311 0310 2310

0300

1100

1210 1010

0100

1011

0110

1110

1200

1311 1211 1111 0111

0211

0010

0200

0101 1 0311 0301 0 0 0201 0

0011

0001 1001 1 1101 1

min

max

Betweenness Centrality

(A)

Paths of Different Lengths (with Highest Betweenness Centrality)

(0.0801)

0000

0100

0200

(0.1098)

0000

1000

1100

1200

0200

(0.0934)

0000

0001

1001

1101

1201

1301

1300

1200

0000

(0.0863)

0000

0001

1001

1101

1100

1110

1210

1310

1300

1200

0200

(0.0824)

0000

0001

1001

1101

1100

0100

0110

1110

1210

1310

1300

1200

0200

(0.0806)

0000

0001

1001

1101

1100

0100

0101

0111

1111

1110

1210

1310

1300

1200

0200

Fig. 4. Qualitative state graph of immunity control in bacteriophage lambda is shown in (A). The highlighted path is computed by comparing average betweenness centrality of all possible paths. For a specific path length, average betweenness centrality is shown in (B). The results show that path highlighted in (A) has the highest average betweenness centrality value of 0.1098.

One limitation of using an exhaustive calculation on the set of all paths (from source to target vertex) is that the approach cannot scale for large networks. In future, it can be improved by applying some pre-processing technique to reduce the search space.


551

16

Speed-Up(%)

14

Linear Speed-Up Observed Speed-Up

12 10 8 6 4 2

2

4

6

8 10 Threads

12

14

16

Fig. 5. The speedup graph shows that our implementation in MPJExpress achieves almost linear speedup

4

Conclusion

One important problem in qualitative modeling of biological networks is the selection of important or more probable trajectories from the set all possible dynamics. In this work, we present experimental results of our implementation that employs the concept of average betweenness centrality to address the aforementioned problem. We consider a well known case study of immunity control mechanism in bacteriophage lambda. From a list of permissible dynamics, we isolate important trajectories and discuss their coherence with exiting literature. In order to reduce the computational complexity of algorithm, we used high level data parallelism. In future, we aim to evaluate and improve the functionality of this approach on more complicated BRNs. Moreover, we also aim to provide a GUI based tool that facilitates selection of important trajectories in qualitative models.

References 1. Ahmad, J., Bernot, G., Comet, J.P., Lime, D., Roux, O.: Hybrid modelling and dynamical analysis of gene regulatory networks with delays. Complexus 3(4), 231– 251 (2006). https://doi.org/10.1159/000110010 2. Ahmad, J., Niazi, U., Mansoor, S., Siddique, U., Bibby, J.: Formal modeling and analysis of the MAL-associated biological regulatory network: insight into cerebral malaria 7(3) (2012) 3. Aslam, B., Ahmad, J., Ali, A., Zafar Paracha, R., Tareen, S.H.K., Niazi, U., Saeed, T.: On the modelling and analysis of the regulatory network of dengue virus pathogenesis and clearance. Comput. Biol. Chem. 53, 277–291 (2014). http://linkinghub. elsevier.com/retrieve/pii/S1476927114001261 4. Atkinson, D.E.: Biological feedback control at the molecular level. Science 150(3698), 851–857 (1965)

552

M. T. Saeed et al.

5. Baker, M., Carpenter, B., Shafi, A.: MPJ express: towards thread safe Java HPC. In: 2006 IEEE International Conference on Cluster Computing, pp. 1–10. IEEE (2006) 6. Barabasi, A.L., Oltvai, Z.N.: Network biology: understanding the cell’s functional organization. Nat. Rev. Genet. 5(2), 101–113 (2004) 7. Bernot, G., Cassez, F., Comet, J.P., Delaplace, F., M¨ uller, C., Roux, O.: Semantics of biological regulatory networks. Electron. Notes Theor. Comput. Sci. 180(3), 3–14 (2007) 8. Bernot, G., Comet, J.P., Richard, A., Guespin, J.: Application of formal methods to biological regulatory networks: extending thomas asynchronous logical approach with temporal logic. J. Theor. Biol. 229(3), 339–347 (2004) 9. De Jong, H.: Modeling and simulation of genetic regulatory systems: a literature review. J. Comput. Biol. 9(1), 67–103 (2002) 10. Juncker, B., Schreiber, F.: Analysis of Biological Networks. Wiley, Hoboken (2008) 11. Karlebach, G., Shamir, R.: Modelling and analysis of gene regulatory networks. Nat. Rev. Mol. Cell Biol. 9(10), 770–780 (2008) 12. Khalis, Z., Comet, J.P., Richard, A., Bernot, G.: The smbionet method for discovering models of gene regulatory networks. Genes Genomes Genomics 3(1), 15–22 (2009) 13. Naldi, A., Berenguier, D., Fauré, A., Lopez, F., Thieffry, D., Chaouiya, C.: Logical modelling of regulatory networks with GINsim 2.3. Biosystems 97(2), 134–139 (2009) 14. Richard, A., Comet, J.P., Bernot, G.: Formal methods for modeling biological regulatory networks. Mod. Formal Methods Appl. 5, 83–122 (2006) 15. Richard, A., Comet, J.-P., Bernot, G.: Formal methods for modeling biological regulatory networks. In: Gabbar, H.A. (ed.) Modern Formal Methods and Applications, pp. 83–122. Springer, Dordrecht (2006). https://doi.org/10.1007/1-40204223-X 5 16. Saeed, M.T., Ahmad, J., Kanwal, S., Holowatyj, A.N., Sheikh, I.A., Paracha, R.Z., Shafi, A., Siddiqa, A., Bibi, Z., Khan, M., et al.: Formal modeling and analysis of the hexosamine biosynthetic pathway: role of O-linked N-acetylglucosamine transferase in oncogenesis and cancer progression. PeerJ 4, e2348 (2016) 17. Saeed, T., Ahmad, J.: A parallel approach for accelerated parameter identification of gene regulatory networks 18. Tareen, S.H.K., Ahmad, J., Roux, O.: Parametric linear hybrid automata for complex environmental systems modeling. Front. Environ. Sci. 3, 47 (2015) 19. Thieffry, D., Thomas, R.: Dynamical behaviour of biological regulatory networks– II. Immunity control in bacteriophage lambda. Bull. Math. Biol. 57(2), 277–297 (1995) 20. Thomas, R.: Boolean formalization of genetic control circuits. J. Theor. Biol. 42(3), 563–585 (1973) 21. Thomas, R.: Logical analysis of systems comprising feedback loops. J. Theor. Biol. 73(4), 631–656 (1978) 22. Thomas, R.: Regulatory networks seen as asynchronous automata: a logical description. J. Theor. Biol. 153(1), 1–23 (1991)

Author Index

Abou Tabl, Ashraf I-166 Abouelhoda, Mohamed I-405 Acosta, Antonio I-96 Adebiyi, Marion O. I-290 Adetiba, Emmanuel I-290, II-266 Ahmad, Jamil I-543 Ahmaidi, Said II-59, II-128 Akanle, Matthew B. I-290, II-266 Akinrinmade, Adekunle II-266 Al Meer, Hossam I-71 Alberca Díaz-Plaza, Ana II-135 Albuquerque Nogueira, Romildo I-57 Al-Harbi, Najla I-197 Ali, Amjad I-543 Aljafar, Hussain I-405 Alkhateeb, Abedalrhman I-166, I-343 Alnakhli, Yasser I-405 Alpala-Alpala, Luis Omar II-231 Alpar, Orcan II-243, II-255 Alsbeih, Ghazi I-197 Alshammari, Sultanah M. II-430 Álvarez, Óscar I-211 Alvarez-Machancoses, Óscar II-15, II-24, II-33 Alvarez-Uribe, Karla C. II-289 Alves, P. I-329 Alwan, Abir II-69, II-128 Aminian, Kamiar II-75 Angles, Renzo I-235 Anjum, Ashiq I-405 Antal, Peter I-41 Antelis, Javier Mauricio I-129 Antsiperov, Viacheslav II-383 Anya, Obinna II-456 Arciniegas-Mejia, Andrés F. I-96 Arenas, Mauricio I-235 Arrais, Joel P. I-221 Arroyo, Macarena I-373 Årsand, Eirik II-443 Arzalluz-Luque, Angeles I-364 Ayoub, Marie-Louise II-59 Bachratá, Katarína I-279 Bachratý, Hynek I-279

Badejo, Joke A. I-290, II-266 Bárta, Antonín I-3, I-107, I-139 Basile, Teresa M. A. I-302 Bastidas Torres, D. I-26 Batista, Cassio I-415 Bautista, Rocío I-373 Becerra, Miguel Alberto I-26, II-289 Becerra-Botero, Miguel A. II-231 Bekkhozhayeva, Dinara I-139 Bekkozhayeova, Dinara I-3 Bekkozhayeva, Dinara I-107 Bellotti, Roberto I-302 Benjelloun, Mohammed II-198 Bensmail, Halima I-71 Bereksi Reguig, Fethi II-311 Berro, Abdel-Jalil II-59, II-69, II-128 Betrouni, Nacim II-409 Beuscart, Regis II-421 Blanco Valencia, X. I-26 Blazek, Pavel I-352 Bobrowski, Leon I-153 Bodin, Oleg N. II-325 Bohiniková, Alžbeta I-259 Bohoyo, Pablo de Miguel II-135 Bottigli, Ubaldo I-302 Boudet, Samuel II-421 Boyano, Maria Dolores I-511 Bozhynov, Vladyslav I-3, I-107, I-139 Bringas, Carlos I-511 Bruncsics, Bence I-41 Canakoglu, Arif I-270 Caon, Maurizio II-75 Cárdenas-García, Maura I-83 Castellanos-Domínguez, Cesar Germán II-231 Castro-Ospina, Andrés Eduardo I-26, I-96 Cekas, Elingas II-49 Cernea, Ana I-211, II-15, II-24, II-33 Cerqueira, Fabio R. I-383 Chen, Bolin I-247 Chernikov, Anton I. II-325 Chovanec, Michal I-279 Cimrák, Ivan I-259

554

Author Index

Císař, Petr I-139 Claros, M. Gonzalo I-373 Coelho, Edgar D. I-221 Coimbra, P. I-329 Conti, Massimo II-347 Cortet, Bernard II-59 Cruz-Cruz, Lilian Dayana II-231 Cyran, Norbert I-139 da Silva, José Radamés Ferreira I-57 da Silveira, Carlos H. I-383 Dandekar, Thomas I-395 de l’Aulnoit, Agathe Houzé II-421 de l’Aulnoit, Denis Houzé II-421 de Melo-Minardi, Raquel C. I-383 De Paolis, Lucio Tommaso II-118 deAndrés-Galiana, Enrique J. II-15, II-24, II-33 Demailly, Romain II-421 Dentamaro, Rosalba I-302 Devailly, Guillaume I-364 Dey, Kashi Nath II-397 Díaz, Gloria M. II-106 Díaz-Del-Pino, Sergio I-177 Díaz-del-Pino, Sergio I-450 Didonna, Vittorio I-302 Dobruch-Sobczak, Katarzyna II-186 Drisis, Stylianos II-198 El Hage, Rawad II-59, II-69 El Khoury, César II-59, II-69 El Khoury, Georges II-59 El-Kalioby, Mohamed I-405 ElMaraghy, Waguih I-166 Esseiva, Julien II-75 Estella, Francisco II-176 Fanizzi, Annarita I-302 Faquih, Tariq I-405 Fassio, Alexandre V. I-383 Fathi, Madjid I-531 Fausto, Alfonso I-302 Fayad, Ibrahim II-69 Feng, Xuan I-461 Fernández-Martínez, Juan Luis I-211, II-15, II-24, II-33 Fernández-Muñiz, Zulima I-211, II-15, II-24, II-33 Fernández-Ovies, Francisco Javier II-15, II-24, II-33

Ferreira da Rocha, Adson II-95 Fethi, Bereksi-Reguig II-301 Fostier, Jan I-439 Fotouhi, Ali I-429 Frenn, Fabienne II-128 Gaiduk, Maksym II-347, II-371 García, Pedro I-96 García-Rojo, Marcial II-276 Gardeazabal, Jesus I-511 Garzon, Max H. I-486 Geng, Yu I-473 Genin, Michael II-421 Gezsi, Andras I-41 Ghosh, Anupam II-397 Gil, M. H. I-329 Golcuk, Guray I-270 González, Jesús II-276 González, Ramón E. R. I-57 González-Castaño, Catalina II-231 González-Pérez, Pedro Pablo I-83 Guan, Xin II-3 Gudnason, Kristinn I-329 Guiomar, A. J. I-329 Guta, Gabor I-41 Hadj Henni, Abderraouf II-409 Hage, Rawad El II-128 Hamzeh, Osama I-343 Hargaš, Libor II-163 Hartvigsen, Gunnar II-443 Helsper, Julien I-531 Hernández, Luis Guillermo I-129 Holthausen, Ricardo I-177, I-450 Hu, Huan I-461 Imbajoa-Ruiz, David Esteban Isik, Zerrin I-520 Jablončík, František II-163 Janusas, Giedrius II-49 Jonsdottir, Fjola I-329 Joshi, Anagha I-364 Judia, Sara Bin I-197 Jung, Matthias I-531 Kajánek, František I-279 Karwat, Piotr II-186 Khaled, Omar Abou II-75 Kirimtat, Ayca II-212, II-221

I-96

Author Index

Klimonda, Ziemowit II-186 Kloczkowski, Andrzej I-211 Koniar, Dušan II-163 Korbicz, Józef II-151 Kovalčíková, Kristína I-259 Kowal, Marek II-151 Kramm, Mikhail N. II-325 Krejcar, Ondrej I-352, II-212, II-221, II-243, II-255 Kuca, Kamil I-352 Külekci, M. Oğuzhan I-429 Kuonen, Pierre I-395 Kupriyanova, Yana A. II-325

Matthews, Victor O. I-290 Mazza Guimaraes, Isabelle I-259 Megrez, Nasreddine I-71 Meier, Klaus II-359 Mikler, Armin R. II-430 Miranda, Fábio I-415 Monaco, Alfonso I-302 Monczak, Roman II-151 Morais, Jefferson I-415 Mora-Jiménez, Inmaculada II-135 Moschetta, Marco I-302 Mugellini, Elena II-75 Muñoz-Minjares, Jorge II-85

La Forgia, Daniele I-302 Lachkar, Abdelmonaime I-314 Larhmam, Mohamed Amine II-198 Larrosa, Rafael I-373 Leber, Isabel II-335 Legarreta, Leire I-511 Li, Yang I-473 Litniewski, Jerzy II-186 Liu, Li II-3 Loncová, Zuzana II-163 López, Vanessa I-15 Lopez-Chamorro, Fabián M. I-96 López-Delis, Alberto II-95, II-106 López-Rodríguez, Carmen María I-373 Losurdo, Liliana I-302 Lozano, Beatriz II-176 Luo, Ping I-247

Nabil, Dib II-301 Nadia, Ouacif II-301 Nasiri, Sara I-531 Nasr, Riad II-69 Neto, Nelson I-415 Ngom, Alioune I-166

Maalouf, Ghassan II-69, II-128 Maarouf, Haitham I-15 Madrid, Natividad Martínez II-335, II-347, II-359 Mahmoudi, Saïd II-198 Majid, Salma I-197 Majidi, Mina I-429 Malaina, Iker I-511 Mall, Raghvendra I-71 Malyshev, Andrew I-186 Mansurov, Gennady II-383 Martinez de la Fuente, Ildefonso I-511 Martínez, Diego I-15 Martínez, Efraín I-129 Martinez, Luis I-511 Massafra, Raffaella I-302 Matta, Joseph II-128

555

Oliveira, José Luís I-221 Oller, Josep M. I-501 Olugbara, Oludayo O. I-290 Orcioni, Simone II-347 Ortega, Juan Antonio II-371 Ortega, Julio II-276 Palevicius, Arvydas II-49 Pasquier, David II-409 Patel, YatinkumarRajeshbhai II-49 Peluffo-Ordóñez, Diego Hernán I-26, I-96, II-231, II-289 Penzel, Thomas II-371 Perez–Chimal, R. J. II-85 Pérez-Wohlfeil, Esteban I-177, I-450 Peyrodie, Laurent II-421 Pham, Duy T. I-486 Piñeros Rodriguez, C. I-26 Pinti, Antonio II-59, II-69, II-128, II-421 Piotrzkowska-Wróblewska, Hanna II-186 Polanska, Joanna I-197 Popescu, Ondina I-302 Popova, Tatiana II-85 Rais, Mohammed I-314 Ramos, Rommel I-415 Ramos-López, Javier II-135 Redouane, Benali II-301

556

Author Index

Reverter, Ferran I-501 Rizkallah, Maroun II-128 Rodríguez-Brazzarola, Pablo I-177, I-450 Rojas, Fernando II-176 Rojas, Ignacio II-176 Romanelli, João P. R. I-383 Rønningen, Ida Charlotte II-443 Rosero-Montalvo, Paul D. I-96 Roy, Sukriti II-397 Rubio-Sánchez, Manuel II-135 Rueda, Luis I-166, I-343 Ruíz-Olaya, Andrés F. II-95, II-106 Saddik, Hayman II-69, II-128 Sadovsky, Michael I-186 Saeed, Muhammad Tariq I-543 Saha, Sujay II-397 Saiz, Antonio II-176 Salazar-Castro, Jose Alejandro II-231 Saligan, Leorey II-15, II-24, II-33 Sánchez, Alberto II-135 Sanchez-Morillo, Daniel II-276 Santamarta, Elena II-176 Santana, Charles A. I-383 Schiro, Jessica II-421 Sedjelmaci, Ibticeme II-311 Seepold, Ralf II-325, II-347, II-371 Seijo, Fernando II-176 Senashova, Maria I-186 Shah, Zeeshan Ali I-405 Shmaliy, Yuriy S. II-85 Shokrof, Moustafa I-405 Sigurdsson, Sven I-329 Silva, Artur I-415 Silveira, Sabrina de A. I-383 Skobel, Marcin II-151 Slavík, Martin I-259, I-279 Smiešková, Monika I-279 Sobrido, Maria J. I-15 Soguero-Ruiz, Cristina II-135 Sonis, Stephen T. II-15, II-24, II-33 Souček, Pavel I-3, I-107, I-139

Suarez, Esther II-176 Subhani, Shazia I-405 Taboada, Maria I-15 Tahar, Omari II-301 Taiwo, Tunmike B. I-290 Tamborra, Pasquale I-302 Tangaro, Sabina I-302 Tawfik, Hissam II-456 Thies, Christian II-359 Tian, Li-Ping I-247 Tobiasz, Joanna I-197 Tolan, Ertan I-520 Trelles, Oswaldo I-177, I-450 Tuncel, Mustafa Anil I-270 Ullah, Ehsan I-71 Umaquinga-Criollo, Ana Cristina II-231 Urban, Jan I-118 Urbanová, Pavla I-3, I-107, I-139 Vegas, Esteban I-501 Vieira, A. P. I-329 Volák, Jozef II-163 Vunderl, Bruno II-371 Walzer, Thomas II-359 Wang, Jiayin I-461, I-473 Watelain, Eric II-69 Wolf, Beat I-395 Wu, Fang-Xiang I-247 Xiao, Qianghua I-247 Xiao, Xiao I-473 Zabielski, Paweł I-153 Zakhem, Eddy II-59 Železný, Miloš I-139 Zhang, Xuanping I-461, I-473 Zhao, Zhongmeng I-461, I-473 Zheng, Tian I-473 Zhikhareva, Galina V. II-325 Zhuravleva, Natalija A. II-325