Yesterday, Today and Tomorrow of Big Data (PDF Download Available)

10 downloads 160209 Views 415KB Size Report
Official Full-Text Publication: Yesterday, Today and Tomorrow of Big Data on ResearchGate, the professional network for scientists.
Available online at www.sciencedirect.com

ScienceDirect Procedia - Social and Behavioral Sciences 195 (2015) 1042 – 1050

World Conference on Technology, Innovation and Entrepreneurship

Yesterday, Today and Tomorrow of Big Data Hakan Özkösea,*(PLQ6HUWDo$UÕa, Cevriye Gencerb a

Gazi UniYHUVLW\.DYDNOÕGHUH$QNDUD, Turkey b Gazi University, Maltepe, Ankara, Turkey

Abstract Owing to the self-improvement desire, the human being always tries to reach to the current information and generate new ones from the data on hand. The practices are realized by processing and transforming the data, whose existence is broadly accepted, into information. Generating information from data is vitally important in terms of regulating the life. Especially firms need to store and transform data quickly and properly into information in order to achieve the objectives such as having a competitive edge, producing new products, moving the firm ahead and stabilizing the internal dynamics. The increase in the amount of data sources also increases the amount of the data acquired. Therefore storing and processing data become difficult and classical approaches remain incapable to do such transactions. By means of Big Data large amount of data with a wide range can be stored, managed and processed. Besides Big Data ensures proper information quickly and offers advantage and convenience to the firms, researchers and consumers by taking the properties of Volume, Value, Variety, Veracity and Velocity into consideration. This study consists of 5 parts. In the Introduction part the features, classification, the process, the areas of usage and the techniques of Big Data are explained. In the second part the appearance process and the advantages of the concept of Big Data are illustrated with examples. A detailed literature review is produced in the third part. The actual studies and the most interested areas of Big Data are told in this part. In the fourth part the future of the Big Data is evaluated. Besides the situation and distribution of the studies on Big Data in Turkey and all over the world is presented. In the Conclusion part, an overall assessment is included and probable troubles are mentioned. © byby Elsevier Ltd.Ltd. This is an open access article under the CC BY-NC-ND license © 2015 2015The TheAuthors. Authors.Published Published Elsevier (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of Istanbul University. Peer-review under responsibility of Istanbul Univeristy. Keywords: Big data, data, information

*

Corresponding author. E-mail address: [email protected]

1877-0428 © 2015 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of Istanbul Univeristy. doi:10.1016/j.sbspro.2015.06.147

Hakan Özköse et al. / Procedia - Social and Behavioral Sciences 195 (2015) 1042 – 1050

1043

1. Introduction Big Data has various definitions in the literature. Some of those are specified below: Big Data is the amount of data beyond the ability of technology to store, manage and process efficiently (Manyika et.al, 2011). Big Data is a term which defines the hi-tech, high speed, high-volume, complex and multivariate data to capture, store, distribute, manage and analyze the information (TechAmerica Foundation, 2014). Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization (Gartner, 2014; Gürsakal, 2014). Big Data Technologies are new generation technologies and architectures which were designed to extract value from multivariate high volume data sets efficiently by providing high speed capturing, discovering and analyzing (Gantz and Reinsel, 2011). Hashem et.al. define Big Data by combining various definitions in literature as follows: The cluster of methods and technologies in which new forms are integrated to unfold hidden values in diverse, complex and high volume data sets (Hashem et.al., 2015). As the definitions suggest, there are some points to take into consideration in big data sets. The data should be complex and multiple together with its size. Therefore conventional methods have difficulty in analyzing big data sets and new methods and technologies are needed. 1.1. Characteristics of Big Data Various studies in the literature show that big data has 3, 4 or 5 characteristics; 3 of whom are common at all: Volume, Velocity and Variety. Others are Veracity and Value (Hashem et.al, 2015; Elragal, 2014; Fadiya et.al., 2014; Yang et.al., 2014; López et.al., 2015). 5 characteristics of big data is shown in Figure 1.

Fig 1. 5 Characteristics of Big Data (Elragal, 2014)

These 5 characteristics are explained as follows (Hashem et.al, 2015; Elragal, 2014; Fadiya et.al., 2014; Yang et.al., 2014; López et.al., 2015): Volume: It is the most important characteristic of big data. It represents the size of the big data set. Variety: Various data come to the companies from numerous resources (internal or external). These data entries from separate resources cause variance in data set. External data are hardly ever structural. Velocity: The production rate of big data is notably high. The heavy increase in data means that the data should be analyzed more swiftly. The faster the data increases, the faster the need for the data increases; therefore the process shows increase as well.

1044

Hakan Özköse et al. / Procedia - Social and Behavioral Sciences 195 (2015) 1042 – 1050

Veracity: It is the accuracy of the data. The data should be acquired from correct resources and its security should be provided. Only authorized people should have the access permission. Value: A result should be generated after all of the procedures and the result should enrich the process. 1.2. Classification of Big Data The characteristic of big data can be understood better by dividing it into classes. These classes are Data Sources, Content Format, Data Stores, Data Staging and Data Processing (Hashem et.al, 2015). Data Sources : Web & Social, Machine, Sensing, Transactions and IoT : Structured, Semi-Structured and Unstructured Content Format Data Stores : Document-oriented, Column-oriented, Graph based and Key-value : Cleaning, Normalization and Transform Data Staging : Batch and Real time Data Processing 1.3.

Big Data Process

Big data process is shown below: x Data Management o Acquisition and Recording o Extraction, Cleaning and Annotation o Integration, Aggregation and Representation x Analytics o Modeling and Analysis o Interpretation Big data process is visualized by Gondomi and Haider as shown in Figure 2:

Fig. 2. Big Data Process (Gandomi and Haider, 2015)

1.4.

Usage Areas of Big Data

Big data is used efficiently in numerous fields. Some of them are listed below: x Automotive industry, x Travel and transport sector, x High technology and industry, x Financial services, x Oil and gas, x Social media and online services, x Telecommunication sector, x Public services, x Medical field, x Education and research, x Retail industry, x Health services, x Packaged consumer products, x Law enforcement and defense industry. x Media and show business,

Hakan Özköse et al. / Procedia - Social and Behavioral Sciences 195 (2015) 1042 – 1050

1045

1.5. Methods Used in Big Data 1.5.1. Text Analytics Text analytics is used for information retrieval from data. E-mails, blogs, online forums, news and call center records are all examples of text data. Text analytics involve machine learning, statistical analysis and computational linguistics. Text analytics enable to extract meaningful summaries from large scale data (Gandomi and Haider, 2015). Information Extraction, Text Summarization, Question Answering and Sentiment Analysis are some of the techniques used in text analytics. 1.5.2. Audio Analytics Audio analytics is used to extract information from unstructured audio data. Call centers and health services are commonly used utilization areas of audio analytics. Audio analytics can be used in numerous fields such as increasing the customer experience, the performance of customer representative and the sales rate; comprehending several tasks such as customer behaviors and the troubles of products (Gandomi and Haider, 2015). 1.5.3. Video Analytics Video analytics is the usage of various techniques to extract meaningful information, track and analyze video streams. Marketing and operations management is the main application area of video analytics (Gandomi and Haider, 2015). 1.5.4.

Social Media Analytics

Social media analytics is the analysis of the structured and unstructured data on the social media channels. Social media can be categorized as follows (Gandomi and Haider, 2015): x x x x x 1.5.5.

Social networks (Facebook, LinkedIn), Blogs (BlogSpot, WordPress), Microblogs (Twitter, Tumblr), Social news (Digg, Reddit), Social bookmarks (Delicious, StumbleUpon),

x x x x

Media sharing (Instagram, YouTube), Wiki (Wikipedia, Wikihow), Question-and-answer sites (Yahoo! Answers, Ask.com), Review sites (Yelp, TripAdvisor).

Predictive Analytics

Predictive analytics is based upon estimating future considering current or stale data. Predictive analysis is used to capture the relationships of data and discover the patterns. Predictive analytics which is primarily based on statistical methods, is highly applicable on many disciplines (Gandomi and Haider, 2015). 2.

Yesterday of Big Data

The extremities of big data go long way back, however it has lately been understood that most of those former studies were big data studies. For example in 1839 Matthew Fontaine Maury; the head of the Depot of Charts and Instruments of the U.S. Navy; collected data about the tides, winds and sea flows of the places he visited. There were numerous navigation books, maps and charts at the depot where he was working. The log books of the former voyages were also present there. There were lots of records about wind, water and air conditions in the log books. Maury realized that he could achieve a new voyage chart as he combines all of the data in hand. He generated new routes by utilizing them. He developed a standard form for U.S. Army battleships to expand his study and he enhanced the accuracy of the route information he had. Then he included merchant ships into his study and utilized

1046

Hakan Özköse et al. / Procedia - Social and Behavioral Sciences 195 (2015) 1042 – 1050

the log book data of them. As a result he passed on huge savings by cutting the durations across by a third. Since then he has been commemorated as “Pathfinder of the Seas” (Mayer-Schönberger and Cukier, 2013). In the academic field big data can be accepted as a new concept. The change in the interest in big data by years is shown in the Figure 3.

Fig. 3. The change in the interest in big data by years (Google Trends-Big Data, 2015)

As it is seen in the Figure 3, the interest in big data violently increases by year 2011. At the present time the search rate of big data is on its peak. Big data and data mining concepts are usually confused. Frawley et.al. define data mining as the discovery of the data which wasn’t known before and which has the makings of being useful (Frawley et.al., 1992). According to Dunham, it is detection of hidden data in the database (Dunham, 2006). Fayyad et.al. describe it as the application of specific algorithms to extract patterns from the dataset (Fayyad et.al., 1996). As it is understood from the definitions; even though big data and data mining have several steps in common, data mining doesn’t cover all properties of big data. The interest in big data increases day by day whereas the interest in data mining decreases. Google Trend Analysis gives us the comparative diagram in Figure 4.

Fig. 4. Comparison of the interests in Big Data and Data Mining (Google Trends-Big Data vs. Data Mining, 2015)

As it is seen in the Figure 4, it is obvious that the search field has been heading from Data Mining through Big Data. 3.

Today of Big Data

As it is mentioned before, the academic studies on big data have been increased by leaps and bounds since 2012 and the tendency through the studies on it increases day by day. There are numerous studies on big data in diverse fields. In this section the academic studies on big data are mentioned. The study of Xiang et.al. aimed to determine the experiences and satisfactions of hotel guests. Online customer testimonials and satisfaction rates on expedia.com were taken into consideration. After preprocessing, classification

Hakan Özköse et al. / Procedia - Social and Behavioral Sciences 195 (2015) 1042 – 1050

1047

method was applied to the data. Finally statistical relationship analysis was performed (Xiang et.al., 2015). Hashem et.al. studied diffusively on the usage of big data in cloud computing. They mentioned the mutual affinity of the two concepts in 5 case studies. After that they informed about big data storage systems. The importance of Map Reduce algorithm was also told within the study (Hashem et.al, 2015). Gandomi and Haider’s study gave general information about big data. A specific definition, the characteristics of big data and the analyses used in big data were included in this study (Gandomi and Haider, 2015). An “Arabic sentiment lexicon” was created in the text mining study of Mahyoub et.al. The study was inspired by WordNet; which was developed as an English dictionary data base (Mahyoub et.al., 2014). Elragal presented that ERP and big data could be combined, thus a strong platform could be constructed (Elragal, 2014). Du et.al. stated that real estate firms could provide a competitive advantage by using big data technologies (Du et.al., 2014). Ackerman and Angus did a big data study to visualize spatial and temporal IP mobility. They visualized the temporal IP mobility in Los Angeles, New York, London, Moscow, Tokyo and Melbourne. Moreover, they plotted the IP map of Australia and showed the hourly variation of Melbourne’s IP mobility (Ackerman and Angus, 2014). Young focused on HIV and the importance of big data studied to be safe from HIV. Young claimed that a new HIV monitoring system could be set up by analyzing social media networking, thus new ideas could be generated for early intervention and disease control. According to Young, big data was applicable to various fields (Young, 2015). Shin and Choi’s study was on ecology. In this sense, the socio-ecological effects of big data transactions; such as social dynamics, political rhetoric technological choices were analyzed. Korea’s big data studies were also mentioned and some hints and tips were offered for Korean big data entrepreneurs (Shin and Choi, 2015). Weichselbraun et.al. studied on “opinion mining” to enrich semantic knowledge. They gathered the data from Amazon (electronics and software) and IMDB (comedy, crime and drama). A quantitative evaluation was done with “sentiment analysis”. It was seen that the accuracy increased with contextualization and contextualization had a positive effect on recall and precision. Besides with concept grounding the concepts could be understood whether they were positive, negative or neutral; with word grounding the words could be understood whether they were synonyms, antonyms, emotions or explanations (Weichselbraun et.al., 2014). Quian et.al. first introduced “granular computing”. Then they defined hierarchical encoded decision table and discussed some criteria. Finally map reduce based hierarchical attribute reduction algorithm was suggested and the efficiency of the algorithm was tested with examples (Qian et.al., 2015). Jifa and Lingling explained the data, DIKW, big data and data science concepts and mentioned the relationship between them (Jifa and Lingling, 2014). Other than the academic studies firms like Google, Netflix, Amazon or Facebook carry on big data studies. Google uses big data in numerous fields and integrates it to its own work flow successfully. For example in 2009 due to the need for medical report the determination process of the diffusion area of H1N1 would take weeks; even months. However Google could have an idea about the area utilizing the storage of the search terms of the users. That meant that the area could be determined before the health organizations (MayerSchönberger and Cukier, 2013; Cook et.al., 2011). Google also presented another big data study; Google Flu Trends; which was helpful to analyze world-wide flu trends by using Google search terms (Google Flu Trends, 2015). Similarly by means of Search Google Trends, it is possible to learn the trending topics and their fluctuation (Search Google Trends, 2015). Firms like Facebook, Flickr, YouTube, Academia and The Marker need big data for link prediction. By this means it is possible to present network connections such as “You may also know…” or “You may also be interested…” (Fire et.al., 2011). Firms like eBay and Amazon use big data for offering systems. As the customer purchases an item, another item which has the possibility to be purchased is offered before the payment process considering the previous and current purchases. Thus the sales rates increase and various items which can be overlooked are served (Chen et.al., 2012).

1048

Hakan Özköse et al. / Procedia - Social and Behavioral Sciences 195 (2015) 1042 – 1050

4.

The Future of Big Data

The interest through big data increases day by day. The firms with the ability to store large amount data carry their work one step further and provide an advantage within the market. Firms like Google, Amazon, Facebook, YouTube and eBay have an advantage over the others in self-improvement and competition due to having a great quantity of data. It should not be forgotten that data processing and information retrieval is as important as data storage. Aforementioned firms which have the ability to store great quantity of data have also been highly successful in data processing and putting it into service and have become movers and shakers. Big data studies have been included in numerous fields. The importance of big data studies which have been used in a wide range of industries from automotive and communication to finance and health will increase in the future. As is seen from Figure 3, the interest in big data have increased day by day whereas the interest in data mining have diminished in importance by 2000s. Turkey have recently placed importance on big data academically. As is seen from Figure 5, the interest in big data in Turkey have just started in 2012 and shown increase over the years. The interest in big data is seen on its top in 2015 (Figure 5). Turkey have fallen behind the world in beginning the studies on big data.

Fig.5. The change of the interest by years in Turkey (Google Trends-Big Data, 2015)

$QNDUDDQGøVWDQEXODUHWKHOHDGLQJFLWLHVLQ7XUNH\RQSODFLQJ LPSRUWDQFHRQELJGDWDVWXGLHV and the mostly searched term is “What is big data” (Google Trends- The change of the interest by years in Turkey, 2015). As is seen from Figure 6, the most interested country in big data is India (Search volume index: 100). Singapore (Search volume index: 71) and South Korea (Search volume index: 57) follow India. Turkey has a relatively low search volume index considering the leading countries. The search volume index of Turkey is 4 (Google TrendsThe interest distribution of Big Data by countries and cities, 2015).

Fig. 6. The interest distribution of Big Data by countries (Google Trends- The interest distribution of Big Data by countries and cities, 2015)

Figure 7 shows the interest distribution of big data by cities. Chinchwad (Search volume index: 100) and Bangalore (Search volume index: 51) are the most interested countries in big data.

Hakan Özköse et al. / Procedia - Social and Behavioral Sciences 195 (2015) 1042 – 1050

1049

Fig. 7. The interest distribution of Big Data by cities (Google Trends- The interest distribution of Big Data by countries and cities, 2015)

Numerous searches have been made on big data. Some of them are big data analytics, Hadoop and big data (Google Trends- The interest distribution of Big Data by countries and cities, 2015). 5.

Conclusion and Recommendations

Several difficulties may show up in the acquisition, storage and processing of data. As the interest in big data increases, such difficulties will decrease or will be solved in shorter time. Turkey has fallen behind the world in terms of the academic interest and the academic studies in big data. There are plenty of academic studies in big data which can be taken by the researchers as a model. Data providers bear tremendous responsibility as much as the researchers in big data. The data providers which cannot process data will get harmed in terms of competition as they hide their data. Moreover, they prevent the data transform into knowledge. The data providers and the researchers should be in cooperation perspicuously based on the basis of trust. This will increase the reliability of the results and keep the firms advantageous in competition. Big firms like Facebook, Flickr, YouTube, Academia, The Marker, Google and Amazon have already gone towards big data because they all have broad visions. Those firms each hold its own market, expand their dominance at the same time ensure customer satisfaction. They keep their leadership and increase their market values day by day. As the studies in big data increase; technological developments and customer satisfaction will increase, diseases will be cured or precautions will be taken earlier and the general price level will decrease. References Ackermann, K., & Angus, S. D. (2014). A Resource Efficient Big Data Analysis Method for the Social Sciences: The Case of Global IP Activity. Procedia Computer Science, 29, 2360-2369. Chen, H., Chiang, R. H., & Storey, V. C. (2012). Business Intelligence and Analytics: From Big Data to Big Impact. MIS quarterly, 36(4), 11651188. Cook, S., Conrad, C., Fowlkes, A. L., & Mohebbi, M. H. (2011). Assessing Google flu trends performance in the United States during the 2009 influenza virus A (H1N1) pandemic. PloS one, 6(8), e23610. Du, D., Li, A., & Zhang, L. (2014). Survey on the Applications of Big Data in Chinese Real Estate Enterprise. Procedia Computer Science, 30, 24-33. Dunham, M. H. (2006). Data mining: Introductory and advanced topics. Pearson Education India. Elragal, A. (2014). ERP and Big Data: The Inept Couple. Procedia Technology, 16, 242-249. Fadiya, S. O., Saydam, S., & Zira, V. V. (2014). Advancing big data for humanitarian needs. Procedia Engineering, 78, 88-95. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37. Fire, M., Tenenboim, L., Lesser, O., Puzis, R., Rokach, L., & Elovici, Y. (2011, October). Link prediction in social networks using computationally efficient topological features. 2011 IEEE International Conference on Privacy, Security, Risk, and Trust, and IEEE International Conference on Social Computing,73-80. IEEE.

1050

Hakan Özköse et al. / Procedia - Social and Behavioral Sciences 195 (2015) 1042 – 1050

Frawley, W. J., Piatetsky-Shapiro, G., & Matheus, C. J. (1992). Knowledge discovery in databases: An overview. AI magazine, 13(3), 57. Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137-144. Gantz, J., & Reinsel, D. (2011). Extracting value from chaos. IDC iview, (1142), 9-10. Gartner IT Glossary, What is Big Data?, URL: http://www.gartner.com/it-glossary/big-GDWD 6RQ(ULúLP7DULKL  Google Flu Trends. Flu Trends (07.04.2015). URL: https://www.google.org/flutrends/ Gürsakal, N., Büyük Veri, Dora Kitabevi, Bursa, 2014. Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The rise of “big data” on cloud computing: Review and open research issues. Information Systems, 47, 98-115. Jifa, G., & Lingling, Z. (2014). Data, DIKW, Big data and Data science. Procedia Computer Science, 31, 814-821. López, V., del Río, S., Benítez, J. M., & Herrera, F. (2015). Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data. Fuzzy Sets and Systems, 258, 5-38. Mahyoub, F. H., Siddiqui, M. A., & Dahab, M. Y. (2014). Building an Arabic Sentiment Lexicon Using Semi-Supervised Learning. Journal of King Saud University-Computer and Information Sciences, 26(4), 417-424. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., ... & McKinsey Global Institute. (2011). Big data: The next frontier for innovation, competition, and productivity. Mayer-Schönberger, V., & Cukier, K. (2013). Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt. Qian, J., Lv, P., Yue, X., Liu, C., & Jing, Z. (2015). Hierarchical attribute reduction algorithms for big data using MapReduce. Knowledge-Based Systems, 73, 18-31. Search Google Trends. Big Data (02.04.2015). URL: http://www.google.com.tr/trends/explore#q=Big%20Data Search Google Trends. Big Data vs Data Mining (02.04.2015). URL: http://www.google.com.tr/trends/explore#q=Big%20Data%2C%20Data%20Mining&cmpt=q&tz= Search Google Trends. Google Trends (07.04.2015). URL: http://www.google.com.tr/trends/?hl=tr Search Google Trends. The change of the interest by years in Turkey (08.04.2015). URL: https://www.google.com.tr/trends/explore#q=big%20data&geo=TR&date=1%2F2009%2076m&cmpt=q&tz= Search Google Trends. The interest distribution of Big Data by countries and cities (08.04.2015). URL: https://www.google.com.tr/trends/explore#q=big%20data&date=1%2F2009%2071m&cmpt=q&tz= Shin, D. H., & Choi, M. J. (2015). Ecological views of big data: Perspectives and issues. Telematics and Informatics, 32(2), 311-320. TechAmerica Foundation’s Federal Big Data Commission, Demystifying Bigdata: A Practical Guide To Transforming The Business Of Government, URL: http://www.techamerica.org/Docs/fileManager.cfm?f=techamerica-bigdatareport-ILQDOSGI 6RQ (ULúLP Tarihi:20.12.2014). Weichselbraun, A., Gindl, S., & Scharl, A. (2014). Enriching semantic knowledge bases for opinion mining in big data applications. KnowledgeBased Systems, 69, 78-85. Xiang, Z., Schwartz, Z., Gerdes, J. H., & Uysal, M. (2015). What can big data and text analytics tell us about hotel guest experience and satisfaction?. International Journal of Hospitality Management, 44, 120-130. Yang, C., Zhang, X., Zhong, C., Liu, C., Pei, J., Ramamohanarao, K., & Chen, J. (2014). A spatiotemporal compression based approach for efficient big data processing on Cloud. Journal of Computer and System Sciences, 80(8), 1563-1583. Young, S. D. (2015). A “big data” approach to HIV epidemiology and prevention. Preventive medicine, 70, 17-18.