Data Science vs Big Data @ UTM Big Data Centre

8 downloads 0 Views 663KB Size Report
UTM Big Data Centre has b excellence in Universiti Tekno areas of Big Data Analytics. Machine Learning, CUDA Prog and Soft Computing Solutio. P CLUSTER ...
2015 International Conference on Science in Information Technology (ICSITech)

Data Science vs Big Data @ UTM Big Data Centre Siti Mariyam Shamsuddin, Shafaatunnur Hasan1,2 1

UTM Big Data Centre Ibnu Sina Institute for Scientific and Industrial Research Universiti Teknologi Malaysia, 81310 Skudai Johor 2

Faculty of Computing, Universiti Teknologi Malaysia 81310 Skudai Johor Email: [email protected]; [email protected] numbers of data need to collect and large number data traffic, a new method of information system is required to process of that ability to enhancement decision and management system also optimization performance. Billions number of data transaction and streams coming from devices worldwide, one of the challenges for the current data management to serve without any losses and low throughput also latency.

Abstract—Big data tsunami has hit Malaysia recently that has awakening the industry and academy communities to aggressively address the insight, hindsight and foresight challenges ensuring Malaysia to be among the top world players in big data information economy for the next decade. Rapid development of Information and Communication Technology (ICT) in this era is very significant due to increasing number of users accessing data keeps growing by the time. This phenomenon has been coined as big data. What is Big data??? We address big data as assets that needs unique platform to deal with bizarre behavior of datasets whose size is beyond the ability of typical data storage to manage, mine and analyze accordingly. This bizarre behavior requires three main personalities: volume, velocity, and variety that basically need new architecture, techniques, algorithms, and analytics to uncover the golden and hidden knowledge from information obesity. From these perspectives, we demonstrate our experiences in setting up our Data Science/Big Data platform, algorithms and tool to align with big data plug and play within the academic environment as well as our services to the community and industries

A lot of academic research is being conducted in the field of big data; ranging from applications, tools, techniques and architecture. The research is interdisciplinary in nature and is generally called data science. Data scientists are being trained at undergraduate and postgraduate levels by Universities to address big data manpower needs. Industry and business have gone far in designing big data solutions that suit their needs and governments across the globe are making big data initiatives. In view of the divergent interest in big data from distinct domains, a clear and innate appreciation of its definition, advancement, constituent technologies and challenges becomes paramount [3].

Keywords— data scienc; big data; big data analytics & platform

I.

Among the challenges in big data flavor and recipe are: a.

Lacking skill of data scientists who can collect, analyze and process big data for companies. In big data ingredients, advance analytics are needed for deep knowledge digging and mining for information obesity in the dynamic data storage.

b.

Sophisticated big data platform is needed for big data environment. Thus the investment in big data technology is expensive. This will be a barrier for the company to scale up to big data analytics unless the integration of the big data analytic technology with the traditional technology can be compensated.

c.

Lack of willingness to share data as big data analytics becomes widespread. Nations, companies, government agencies may be unwilling to share a pool of information that can be useful for the entire world population. Policies are yet to be formulated on how to handle data about people and information that is gathered about them in big data.

BIG DATA : OVERVIEW, DEFINITION AND CHALLENGES

Big data is a rather new term, as indicated by [1] however it is still surprising that up until now there is no clear or uniform definition of big data. Opinion leaders and companies working in the field have their own opinion and definitions of big data at this point. Clear is however that big data embodies an accumulation of different technologies under the heading “Big Data”. Big data has significant impact on the economic and social well-being as the most developed regions, such as Europe, have the biggest potential to create value through the use of Big Data [2]. Big Data in Information and Communication Technology (ICT) is data collection in large number and complex data transfer / transaction that need a good data management system or application to process of those data sets because current data management have difficulties to handle it. Some challengers for data collection such as data storage and need large database capacity, data capture and sharing that required high speed infrastructure system, data visual and analysis required good management system tools to do that. With high volume and

978-1-4799-8386-5/15/$31.00 ©2015 IEEE

1

2015 International Conference on Science in Information Technology (ICSITech) II.

DATA SCIENCE VS BIG DATA

III.

GPU BASED HADOOPP CLUSTER @ UTM BIG DATA

We have successfully setup s Cloudera Hadoop Cluster at our centre with feasible cost under GPU platform for establishing the Hadoop Distriibuted File System (HDFS) and Map Reduce. Hadoop is an Oppen source software task written in Java script., and the Apachee is the founder of Hadoop. This is a program written for best possible enormous quantity of data usage. It is put into opeeration for Google MapReduce which acts as an open souurce. It is centered on easy programming model known as MapReduce [7]. Few benchmark functions have beenn implemented for stress testing of our cluster. The TestDFSIO O benchmark function is a read and write test for HDFS. It is helpful for tasks such as stress p bottlenecks in our testing of HDFS, to discover performance network, to shake out the harddware, OS and Hadoop setup of our cluster machines (particuularly the NameNode and the DataNodes). This test provides initial impression of how fast our cluster in terms of I/O. Fiigure 3 illustrates the output of the I/O execution time for our cluster. c

Malaysia needs 1,500 data scientist by 20020. Thus it is our role to assist the government in achievingg the statistics of producing high quality data scientists. The T differentiation between data scientist and big data is giveen accordingly for better understanding and visualization. From m our perspective, data science is the extraction of knowledgee from data at the field work, lab experiments, design of expeeriments (DOE) or other means that contributing to data collecttion and gathering. While Data analytics (DA) is the granulatioon process of raw data with the purpose of drawing concluusions about that information. The intersection of these two atttributes contributes to the term of Data Scientist (Fig. 1). How do d these term coin to the big data analytics?

Fig. 1. Data Scientist

We address the term of Big Data Analytics as Data digging from information obesity (infobesity) to present high quality information diet and to provide new infoormation economy (infonomics) i.e., the right information at the right time to enable managers to make informed busineess decisions. The correlation between data science and big daata is illustrated in Fig. 2. Good data scientists will not justt address business problems; they will pick the right problems that t have the most value to the organization. With the feasible big data platform, good data scientists can expedite their expertise for advance analytics of the organization for better informed i decision making.

Fig. 3. I/O Execution Time

IV.

DATA SCIENCE/BIG DATA A @ UTM BIG DATA CENTRE

The notation of Data Sciennce and Big Data Analytics has been used extensively, and largge scale awareness of these terms has been taken seriously by b various agencies including companies and institution of higher learning. Due to that, UTM Big Data Centre has been b setup as one of research excellence in Universiti Teknoologi Malaysia, engaging in the areas of Big Data Analyticss, Deep Learning, GPU-based Machine Learning, CUDA Proggramming, Predictive Analytics, and Soft Computing Solutioons. Our big data ecosystem

Fig. 2. Correlations between Data Science and Bigg Data under big data platform

2

2015 International Conference on Science in Information Technology (ICSITech) involves the deployment of Hadoop technoloogy as our platform for multi sources input of various types eitherr in structure, semi structure or unstructured representation (Fig. 4). Since we have been awarded as GPU Research Centre by the NVIDIA Corporation recently, our post-analytics enngine is executed under the GPU platform. Under this platform m, we develop inhouse deep learning/machine learning algorrithms for advance analytics of various data science and big dataa problems [4].

To address the need of data scientist/big data analytics, our Centre offers two (2) professionnal certificates in Data Science and Big Data Analytics with thhe following objectives:

a. to

provide

high

performance

computing

environment for advvanced machine learning and business intelligence methods for data science and big data problems. b. To intersect the theoory and industrial applications with data science, biig data analytics and big data engineering in high peerformance environment. c. To produce data anaalysts, data scientist and data engineers for nation wealth w creation. d. To enable data analyssts, data scientist, data engineers or managers to makke informed business decisions intelligently. Professional certificates in Data D Science mainly addresses professional, engineers and stuudents who have knowledge in basic sciences, arts and humaanities, and other fields such as mathematics, statistics, biologgy, chemistry, physics, social sciences, retail and business. The T modules will cover on the exploring, digesting, ingesting and analysing data for scientific t respective domain. While and information revenues in their Professional certificates in Biig Data Analytics will provide depth knowledge on big datta environment which involve multiple requirements, advannce processing, methods and analytics for solving the big daata problems. These certificates are supported by the Multimedia Development Corporation VIDIA Corporation (visit our (MDeC), Malaysia and NV website: smartdigitalcommunitty.utm.my/utmbigdata/).

Fig. 4. Big Data Ecosystem @ UTM Big Data Centre

An example of our data science projects is Social network analytics of missing flight MH 370.The aim of this study is to analyze the mesoscopic and macroscopic features of that community using social network analysis too generate a series of social networks that represent the different d network communities. The dataset has been collected during March and April 2014 from online sources (such ass YAHOO! News Malaysia, CNN.com, the Economic Times,, The New Indian Express, India Today and the Daily Exprress). Information gathering was applied by surfing hundreds of WebPages that addressed Flight MH370 from when it was announced a missing by the Malaysian Airlines on the 8th of Marchh 2014 until late of April when the international efforts aim ming to find the wreckage of the missing plane declined.

V.

O CONCLUSION

Data centre has to serve high h volume transaction of data every day, thus a Big Data Platform is required to solve insufficient of data storage and processing in some of provider. Big Data @UTM Big Data Ceentre is an integrated of Hadoop technology under GPU platfoorm for pre-data analytics and post-data analytics. With GPU U processors, the challengers for data collection such as data sttorage and need large database capacity, data capture and shariing have been solved since GPU provides high speed infrastruucture system, data visual and analysis for good management system tools. With high volume and numbers of data need to collect and large number data traffic, our new method of inforrmation system has the ability to enhance the decision making and optimize the management system performance. In the futture, we will setup wireless 5G technology, mutual research coollaboration work with Wireless Communication Centre (WCC)), UTM, to allow our data center management to gain a bettter understanding of our key operating conditions in server room r for better performance and services [6].

Online sources have addressed this toppic from different perspectives such as (a) telling the story of thhe missing aircraft, how it disappeared from radar and the possible spots of its current location, (b) providing short personall profiles for some of the passengers onboard, (c) showing how the t relatives of the passengers have been dealing with this issuee and (d) picturing the efforts made by the international sociiety to locate the missing plane. From our analytics, there are three larger components within the MH370 network: the largest component is the Artists component (29 vertices). Thee other two larger components are: the Freescale Semiconducttor (14 nodes) and the Aircraft Crew (12 nodes) (for detail refer to [5]). Other data science projects aree premium claim analytics to model the claim frequency annd claim severity which are components in the data and thuus will be able to accurately forecast the motor insurance claaim; detecting the anomaly of stock trading analysis using inntelligent immune anomaly for digesting the big stock trading data and returning the spike signal of bizarre data behavior; Reetail Analytics and others.

3

2015 International Conference on Science in Information Technology (ICSITech) [3]

ACKNOWLEDGMENT This work is supported by Universiti Teknologi Malaysia under Flagship Project: (Q.J130000.2428.02G38; Q.J130000.2428.03G17; Q.J130000.2428.02G50). The authors would like to thanks Research Management Centre (RMC), Universiti Teknologi Malaysia (UTM) for the support in R & D, and Soft Computing Research Group (SCRG) for the inspiration in making this study a success.

[4]

[5]

[6]

REFERENCES [1] [2]

Google. (2014). Google trends Retrieved 01-10, 2014, from http://www.google.com/trends/ex-plore#q=big%20data. McKinsey Global Institute. (2012). Big Data: The next frontier for innovation, competition and productivity.

[7]

4

H. Hu, Y. Wen, T.-S. Chua, and X. Li, “Toward Scalable Systems for Big Data Analytics: A Technology Tutorial,” IEEE Access, vol. 2, pp. 652–687, 2014. Shafaatunnur Hasan, Siti Mariyam Shamsuddin & Noel Lopes (2014). Machine Learning Big Data Framework for Big Data Problems. International Journal of Advances in Soft Computing and Its Applications, 6(2): 1-14. Mohammed Z. Al-Taie, Siti Mariyam Shamsuddin & Nor Bahiah Ahmad. Flight MH370 Community Structure. . International Journal of Advances in Soft Computing and Its Applications, 6(2): 15-34. Evizal Abdul Kadir, Siti Mariyam Shamsuddin & Tharek Abdul Rahman (2015). Big Data Network Architecture and Monitoring Using Wireless 5G Technology. International Journal of Advances in Soft Computing and Its Applications, 7(1): 1-14. Murthy, Arun, et al. (2013). Apache Hadoop YARN: Moving Beyond MapReduce and Batch Processing with Apache Hadoop 2: Pearson Education.