Big Data in Healthcare: A Survey

10 downloads 0 Views 363KB Size Report
BigQuery. Cloud-based and open-source platform. Columnar database. Replicate d ata across v aried d ata centres. Index es are not supported b y. G oogle. B.
Big Data in Healthcare: A Survey Muhammad Mashab Farooqi, Munam Ali Shah, Abdul Wahid, Adnan Akhunzada, Faheem Khan, Noor ul Amin, and Ihsan Ali

1 Introduction The collection of very large and complex set of raw facts and figures is called data; it is very difficult to process using ordinary database management systems [1]. We need certain tools, techniques and procedures to create, manipulate and manage very large set of Big Data. Big Data does not have just large size, but it also contains complex, heterogeneous, noisy, longitudinal and voluminous data [2]. The challenges which the organization faces in handling the Big Data are capturing, searching, storing and analysing the data. The five dimensions of Big Data (5 V’s) are volume, variety, velocity, veracity and value [3, 4]. The main reason for the growth of the complexity and abundance of data is that the medical practice is moving to evidence-based healthcare. The other reason for the abundance and increased complexity of the Big Data is the development of the new technologies and tool such as mobile application capturing devices and sensors, as these devices collect huge amount of data which is to be stored somewhere. Big Data leads towards the increased patient social communication [5], to make online appointment, to search patient’s record online and to check the availability

M. M. Farooqi · M. A. Shah · A. Wahid · A. Akhunzada () Department of Computer Science, COMSATS Institute of Information Technology, Islamabad, Pakistan e-mail: [email protected]; [email protected] F. Khan · N. ul Amin Department of Computer Science, Bacha Khan University Charsadda, Charsadda, Pakistan I. Ali Department of Computer Systems and Technology, Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia e-mail: [email protected] © Springer Nature Switzerland AG 2019 F. Khan et al. (eds.), Applications of Intelligent Technologies in Healthcare, EAI/Springer Innovations in Communication and Computing, https://doi.org/10.1007/978-3-319-96139-2_14

143

144

M. M. Farooqi et al.

of the doctor online. Handling a large amount of data efficiently is a challenge for the organization. Traditional host or service in Big Data is shifting towards data-centric architecture and model [6]. Components included in the Big Data Architecture Framework are Big Data infrastructure, data structures and models, Big Data analytics, Big Data lifecycle management and Big Data security [7, 8]. The technologies such as cloud computing are very necessary for the providence of a stage for the computerization of all processes in data collection, storing, processing and visualization. By ensuring that the patient gets the most effective treatment, better care quality and efficiency can be achieved in the healthcare [2, 9]. However, these new technologies not only made the data huge and complex but also made it difficult to handle and process because of its unstructured nature. Moreover, data is recorded from different devices like sensors and smartphones within a short period of time. These data are stored in different formats which can be regarded as a challenge in Big Data. In [9] Health-CPS built on cloud and Big Data analytic technologies are used by some health organizations to provide a more convenient service and environment of healthcare. In order to become responsive, healthcare organizations need to be agile. They can achieve agility by optimizing their operations. Service-oriented architecture (SOA) and business process management (BPM) are adopted by some healthcare organizations which can provide flexible, dynamic and cloud-ready infrastructure [10]. Big Data analytics helps to understand large amount of data and to categorize it, then predict the outcomes before it happens and suggest the treatments [11]. This paper presents the review of the characteristics of Big Data, tools and techniques, challenges and limitations and architectures used by the healthcare organization.

2 Characteristics of Big Data in Healthcare Many researchers have studied and worked on Big Data in healthcare; there are many challenges, prospects and resolutions in Big Data in healthcare. There are some characteristics of Big Data known as dimensions of the Big Data. All the dimensions are discussed briefly in this section. Big Data is characterized as 5 V’s by many authors [3, 4, 12]. These characteristics are volume, variety, velocity, variability and veracity. As the time is passing, these dimensions are increasing; now in 2017 we have 42 V’s [13]. 42 V’s are mentioned in this paper and summarized in Table 1. Table 1 discusses the 42 V’s of the Big Data, and the dimensions will continue to increase as the Big Data develops further. The generic concept of Big Data is portrayed in Fig. 1.

Knowledge about: mathematics, statistics, programming, etc. Vetting the assumptions with evidence Difficult to build robust models Big Data fuel of data science Data science provides visibility into complex data problems Ability of data science to cope with every real-life aspect Ability to speak with knowledge Data should never be missing Increasing knowledge It is about trees and forest

Versed Vet Viability Victual Visibility Vivify Voice Volatility Voyage Varifocal

Vault Veil Verdict

Veracity Value Validity Varnish

Variety

Meaning Size of data generated in healthcare The speed at which data is generated The change in data semantics, data structure, data format and data rate. Evolving behaviour in data source Diverse source of data from which different sources data is produced The quality of data that is produced, data accuracy Useful data Data quality, governance, data management on massive Interaction of end-users with our work matters, and polish counts Importance of data security Examine latent variables from behind the curtain People affected by model’s decision

Characteristics Volume Velocity Variability

Table 1 Characteristics of Big Data: the 42 V’s

Valour Vane Vanilla Vantage Varmint Vastness Vaticination Veer Venue Vocabulary

Voodoo Vulpine Vagueness

Viral Virtuosity Visualization Vogue

Vibrant

Characteristics Venue Version control Vexed

Deliver results with real-world impact Data leads to a new technology Confusion about meaning of Big Data and tools that are used Tackle the big problem in face of Big Data Unclear direction of decision-making Simple methods if tackled with care, can provide value Privileged view of complex systems As data gets bigger, so do software bugs Bigness of Big Data Ability to forecast Change direction according to customer need Distributed heterogeneous data from multiple platforms Data models, semantics that describe the data

How fast data spreads Craze to get more knowledge about Big Data The way customer interacts with models Artificial intelligence will become?

Provision of insight by data science

Meaning No customer workstation in cloud (logically) You are using it right? Potential of data science to handle complicated problems

Big Data in Healthcare: A Survey 145

146

M. M. Farooqi et al.

Validity

Volume

Data quality, Governance Moster Data Management on Massive

Size of Data

Velocity

Variability

The Speed at which Data is Generated

Dynamic, Evolving Behavior in Data Source

Venue

Variety Different type of Data

Distributed Heterogeneous Data from Multiple Platforms

BigData

Veracity

Vocabulary

Data Accuracy

Data Models, Semantics that describes data Structure

Value Useful Data

Vagueness Confusion over Meaning of BigData and Tools used

Fig. 1 Big Data concept

3 Tools and Techniques for Analysing Big Data With the advancements in the information and communication, the healthcare data has exceeded from exabyte to petabytes which is increasing gradually [4]. With this growth rate, it is difficult to handle large amount of data using traditional data architectures and models. Health organizations have two options: whether they can use open source or the commercial solutions available [14]. Some of the refined tools, technologies and platforms for analysing Big Data are discussed in Table 2.

4 Challenges and Limitations The emerging field of Big Data poses many challenges, limitations and issues as the healthcare data is increasing [15]. As the industry is facilitating with the advantages of Big Data, security and privacy issues are the main problem arising as the threats and vulnerabilities keep on increasing [16]. Some of these challenges are discussed by the researchers which are mentioned in this section. Figure 2 shows Big Data in health cloud.

Microsoft Windows Azure

Jaql

Reduce

Tools Google BigQuery Hadoop

Type of databases Columnar database

It is a non-relational database It is a non-relational database It is a query language for JavaScript object notation Cloud-based and It is a relational open-source platform database

Platforms Cloud-based and open-source platform Cloud-based and open-source platform Cloud-based and open-source platform It is a proprietary query language

Table 2 Comparative analysis of tools

No indexing capability

No security and technical support

Limitations Indexes are not supported by Google BigQuery

There is no client characterized sorts; this suggests pattern data is just a requirement on conceivable estimations of a domain Against structured, unstructured and Immense databases are unrealistic on Windows Azure as it semi-structured, it can make has restricted size relational queries

It supports both unstructured and semi-structured data It supports both structured and semi-structured data

Advantages Replicate data across varied data centres Stores data and structure

Big Data in Healthcare: A Survey 147

148

M. M. Farooqi et al.

Fig. 2 Big Data healthcare cloud

4.1 Data Governance Regulating and managing of data is data governance. As healthcare industry is moving towards the healthcare analytics, data governance is a big challenge [16]. The data generated in healthcare is assorted in nature, and it requires standardization and governance.

4.2 Security Analytics Predicting and analysing security threats are of the highest need in the growing healthcare industry. Healthcare industry suffered from these types of security attacks ranging from stealthy malware to distributed denial-of-service (DDoS) attack. Social engineering attacks are also on the rise, and they are very difficult to predict. As healthcare industry is depending more on medicinal services innovations to settle on better educated choices, security investigation will be the primary concentration of any plan for the cloud-based SaaS arrangement facilitating protected health information (PHI) [16]. Healthcare IT providers can remove dangers and threats in real time and can emit them before they impact the social insurance framework.

Big Data in Healthcare: A Survey

149

4.3 Ethical and Moral Challenges Moral test incorporates information protection, control of access to patients’ data, secrecy and viable trade of data between the patients. Because of increment in information volume, the mix of information from various sources turns into a challenge. The entrance of patient’s data in a fitting way turns into an issue. Context sensitivity is one of the moral tests that Big Data in healthcare is confronting nowadays. It is imperative to guarantee the context sensitivity. Context of Big Data contrasts in huge ways from different sorts of Big Data exercises. A context delicate comprehension may reveal that a few information may not be reasonable inside corporate movement [17].

4.4 Security and Privacy Issues Security and privacy issues are another challenge facing the integration of diverse source of healthcare information. Healthcare data and mobile ad hoc are open to security threats like inappropriate access of the patients’ data and unapproved utilization of patient data. Hence, healthcare providers are facing security and privacy issues, and they fear to share health information using electronic medicinal services frameworks. Privacy deals with the legal and ethical restrictions, and a question arises on which piracy issue must be taken care of [18]. The privacy policy is all about the implementation of procedures with respect to authorization and permission of specific functions [19].

5 Discussion on Big Data Architecture Big Data analytics assumes an essential part in anticipating the crisis circumstance before it happens. Big Data analytics uses Hadoop [1] for the real-time investigation on the enormous amount of data. Data analytics is essential because without legitimate information investigation techniques, such information is useless. Hadoop is an open-source Apache Software Foundation project which is composed in Java. It has two main components, HDFS and MapReduce programming framework, which are closely related to each other [20]. In this approach, cost is reduced by the effective analysation of the data. With the assistance of machine learning algorithm, data patterns and relationship between them are analysed which helps in making valuable decisions. The author worked on Big Data generation, data characteristics, security concerns in Big Data and how Big Data analytics aids in discovering valuable decisions. Factors for the improved quality of healthcare are discussed which are providing patient-centric services, detecting spreading diseases earlier, monitoring the hospital’s quality and enhancing the treatment technique.

150

M. M. Farooqi et al.

Big Data lifecycle involves data collection, data cleaning, data classification, data modelling and data delivery. Security is the main factor in the Big Data processing in dispersed environment. The main challenges are to provide network-level security, authentication for users, nodes and application in the distributed environment, identification of the malicious hackers, etc. Author has done some work on the secured layered data architecture for Big Data architecture. Using SSL for the correspondence through RPC between appropriated hubs through organize level security. Static and dynamic data can be handled by two-way communication. Attribute-based encryption method is used for transmitting the data between nodes for the prevention of the data from the malicious users. The author in [21] discussed the problems in the HDFS as it deals with the storage of the large files. The files within the volume of 5 MB are considered as small files. The current file systems including distributed file systems, local file systems and object-based storage systems all are designed for the handling of large files. For example, XFS, GPFS and HDFS all are targeting mainly the large files. In small file system, the performance degradation will occur while processing small files through HDFS. Storage of small files will increase the internal storage space. MapReduce needs more tasks for processing large amount of data, so it will increase overhead of the CPU. Author proposed that merging of small files into large file and then inputting it to HDFS will remove the problem. Storing a lot of small files into large files will reduce the quantity of files and increase the efficiency of retrieval of data. It will also decrease pressure on the disk file system. The data is increasing day by day in healthcare; its volume, velocity, variability, variety and veracity [22] are big problems these days, so it is difficult to handle large amount of data without a planned and efficient tool and technique. The author [23] presents a novel Big Data framework for healthcare applications. Apache Spark used a Big Data analytic framework, but it cannot function alone. The supporting components such as Hadoop Core and JDK are required. The author adopts spring framework, as it is fault tolerating. Spring framework is a set of extension of the Java programming language, and it is modular in nature (Fig. 3).

6 Conclusion Big Data are referred as large volumes of data stored at the different location. This data has very high velocity and complex and variable data which requires advance tools and techniques for its creation, storage and maintenance and its governance. With the increase in the volume of the data, there are many challenges for handling of Big Data like security and privacy issues, unethical use of data, bad data quality, storage and maintenance of the data. The models and architectures used for the Big Data analysation are very difficult to implement. Despite the numerous benefits of the Big Data, factors such as resistance to change from traditional mode to use the ICT as well as security challenges are the main hindrance in effective adoption of Big Data in healthcare.

Big Data in Healthcare: A Survey

Big Data Sources

151

Big Data Transformation

Middleware

▪ Multiple Formats ▪ Multiple Locations

Big Data Analytics Applications

▪ Hadoop

Queries

▪ MapReduce

▪ Internal ▪ External

Big Data Platforms & Tools

▪ Pig Raw Data

Extract Transform Load

Transformed Data

▪ Hive ▪ Jaql

Big Data Analytics

Reports

▪ Zookeeper ▪ HBase

Data Warehouse

▪ Multiple Applications

▪ Cassandra

OLAP

▪ Oozie ▪ Avro

Traditional Format CSV, Tables

▪ Mahout ▪ Others

Data Mining

Fig. 3 Generic architecture of Big Data

References 1. Khan, R., Khan, S. U., Zaheer, R., & Khan, S. (2012). Future internet: The internet of things architecture, possible applications and key challenges, Proc. – 10th Int. Conf. Front. Inf. Technol. FIT 2012, pp. 257–260. 2. Yu, Y., Wang, J., & Zhou, G. (2010). The exploration in the education of professionals in applied Internet of Things Engineering, ICDLE 2010–2010 4th Int. Conf. Distance Learn. Educ. Proc., pp. 74–77. 3. Da Xu, L., He, W., & Li, S. (2014). Internet of things in industries: A survey. IEEE Transactions on Industrial Informatics, 10(4), 2233–2243. 4. Gubbi, J., Buyya, R., Marusic, S., & Palaniswami, M. (Sep. 2013). Internet of things (IoT): A vision, architectural elements, and future directions. Future Generation Computer Systems, 29(7), 1645–1660. 5. Ghosh, A., & Das, S. K. (2010). Coverage and connectivity issues in wireless sensor networks: A survey. Pervasive and Mobile Computing, 4(3), 303–334. 6. Wang, F., & Yuan, H. (2010). Challenges of the sensor web for disaster management. International Journal of Digital Earth, 3(3), 260–279. 7. Sookhak, M., et al. (2015). Remote data auditing in cloud computing environments: a survey, taxonomy, and open issues. ACM Computing Surveys (CSUR), 47(4), 65. 8. Abdelaziz, A., et al. (2017). Distributed controller clustering in software defined networks. PloS One, 12(4), e0174715. 9. Jia, X., Feng, Q., Fan, T., & Lei, Q. (2012). RFID technology and its applications in Internet of Things (IoT), 2012 2nd Int. Conf. Consum. Electron. Commun. Networks, pp. 1282–1285. 10. Armbrust, M., Fox, A., Griffith, R., Joseph, A., & Katz, R. H. (2010). Above the clouds: A Berkeley view of cloud computing, Univ. California, Berkeley, Tech. Rep. UCB, pp. 7–13. 11. Icu, D. L. & Icu, H. L. (2011, March). Efficient Novel Anti-collision Protocols for Passive RFID Tags, no. 12. Wattegama, C. (2014). ICT for disaster management. Bangkok: UNDP-APDIP.

152

M. M. Farooqi et al.

13. Chen, Z., Li, Z., Liu, Y., Li, J., & Chen, J. (2011). Quasi real-time evaluation system for seismic disaster based on internet of things, Proc. – 2011 IEEE Int. Conf. Internet Things Cyber, Phys. Soc. Comput. iThings/CPSCom 2011, pp. 520–524. 14. Ma, Y., Liu, X., Li, X., Sun, Y., & Li, X. (2011). Rapid assessment of flood disaster loss in Sind and Punjab province, Pakistan based on RS and GIS, 2011 Int. Conf. Multimed. Technol. ICMT 2011, no. Figure 1, pp. 646–649. 15. Liu, J., Wen, J., Yang, K., Shang, Z., & Zhang, H. (2011). GIS-based analysis of flood disaster risk in LECZ of China and population exposure, Proc. – 2011 19th Int. Conf. Geoinformatics, Geoinformatics 2011, no. 40471028, pp. 0–3. 16. Seal, V., Raha, A., Maity, S., Mitra, S. K., Mukherjee, A., & Naskar, M. K. (2012). A real time multivariate robust regression based flood prediction model using polynomial approximation for wireless sensor network based flood forecasting systems (pp. 432–441). Berlin Heidelberg: Springer. 17. Ahmad, N., Hussain, M., Riaz, N., Subhani, F., Haider, S., Alamgir, K. S., & Shinwari, F. (2013). Flood prediction and disaster risk analysis using GIS based wireless sensor networks, a review. Journal of Basic and Applied. Scientific Research, 3(8), 632–643. 18. Sulaiman, N. A., Husain, F., Hashim, K. A., & Samad, A. M. (2012). A study on flood risk assessment for Bandar Segamat sustainability using remote sensing and GIS approach, in 2012 IEEE Control and System Graduate Research Colloquium, pp. 386–391. 19. Dawod, G. M., & Koshak, N. A. (2011). Developing GIS-based unit hydrographs for flood Management in Makkah Metropolitan Area, Saudi Arabia. Journal of Geographic Information System, 03(02), 160–165. 20. Akar, Î., Kalkan, K., & Maktav, D. (2011). Determination of land use effects on flood risk by using integration of GIS and remote sensing, Recent Adv. 21. Al-Jabari, S., Sharkh, M., & Mimi, Z. (2010). Estimation of runoff for agricultural watershed using SCS curve number and GIS. 22. Sherief, Y. (2010). Flash floods and their effects on the development in El-Qaá plain area in South Sinai, Egypt, Diss. PhD dissertation, University of Mainz, Germany. 23. Fang, S., Xu, L., Zhu, Y., Liu, Y., Liu, Z., Pei, H., Yan, J., & Zhang, H. (2015). An integrated information system for snowmelt flood early-warning based on internet of things. Information Systems Frontiers, 17(2), 321–335.