Big Data

4 downloads 34230 Views 136KB Size Report
open source libraries set the foundation for visualization of Big Data, and domain expertise and .... Many of the data analysis and visualization tools scientists.
Visualization Xiaogang Ma Department of Computer Science, University of Idaho Published as: Ma, X., 2018. Visualization. In: Schintler, L.A., McNeely, C.L. (eds.) Encyclopedia of Big Data. Springer, Cham, Switzerland. In Press. http://dx.doi.org/10.1007/978-3-319-32001-4_202-1

Synonyms Data Visualization; Information Visualization; Visual Representation

Introduction People use visualization for information communication. Data visualization is the study of creating visual representations of data, which bears two levels of meaning: the first is to make information visible and the second is to make it obvious for understand. Visualization is a pervasive existence in the data life cycle and recent trends is to promote the use of visualization in data analysis rather than use it only as a way to present the result. Community standards and open source libraries set the foundation for visualization of Big Data, and domain expertise and creative ideas are needed to put standards into innovative applications.

Visualization and Data Visualization Visualization, in its literal meaning, is the procedure to form a mental picture of something that is not present to the sight (Cohen et al. 2002). People can also illustrate such kind of mental pictures by using various visible media such as papers and computer screens. Seen as a way to facilitate information communication, the meaning of visualization can be understood at two levels. The first level is to make something to be visible and the second level is to make it obvious so it is easy to understand (Tufte 1983). People’s daily experience shows that graphics are easier to read and understand than words and numbers, such as the use of maps in automotive navigation systems to show the location of an automobile and the road to the destination. The daily experience is approved by scientific discoveries. Studies on visual object perceptions explain such differentiation in reading graphics and texts/numbers: the human brain deciphers image elements simultaneously and decodes language in a linear and sequential manner, where the linear process takes more time than the simultaneous process. Data are representations of facts and information is the meaning worked out from data. In the context of the Big Data, visualization is a crucial method to tackle the considerable needs of extracting information from data and presenting it. Data visualization is the study of creating visual representations of data. In practice, data visualization means to visually display one or more objects by combined use of words, numbers, symbols, points, lines, color, shading, coordinate systems, and more. While there are various choices of visual representations for a same piece of data, there are a few general guidelines that can be applied to establish effective and efficient data visualization. This first is to avoid distorting what the data have to say. That is,

the visualization should not give a false or misleading account of the data. The second is to know the audience and serve a clear purpose. For instance, the visualization can be a description of the data, a tabulation of the records, or an exploration of the information that is of interest to the audience. The third is to make large datasets coherent. A few artistic designs will be required to present the data and information in an orderly and consistent way. The presidential, Senate and House elections of the United States have been reported with well-presented data visualization, such as those on the website of The New York Times. The visualization on that website is underpinned by dynamic datasets and can show the latest records simultaneously.

Visualization in the Data Life Cycle Visualization is crucial in the process from data to information. However, information retrieval is just one of the many steps in the data life cycle, and visualization is useful through the whole data life cycle. In conventional understanding, a data life cycle begins with data collection and continues with cleansing, processing, archiving and distribution. Those are from the perspective of data providers. Then, from the perspective of data users, the data life cycles continues with data discovery, access, analysis, and then repurposing. From repurposing, the life cycle may go back to the collection or processing step restarting the cycle. Recent studies show that there is another step called concept before the step of data collection. The concept step covers works such as conceptual models, logical models and physical models for relational databases, and ontologies and vocabularies for Linked Data in the Semantic Web. Visualization or more specifically data visualization provides support to different steps in the data life cycle. For example, the Unified Modeling Language (UML) provides a standard way to visualize the design of information systems, including the conceptual and logical models of databases. Typical relationships in UML include Association, Aggregation and Composition at the instance level, Generalization and Realization at the class level, and general relationships such as Dependency and Multiplicity. For ontologies and vocabularies in the sematic web, concept maps are widely used for organizing concepts in a subject domain and the interrelationships among those concepts. In this way a concept map is the visual representation of a knowledge base. The concept maps are more flexible than UML because they cover all the relationships defined in UML and allow people to create new relationships that apply to the domain under working (Ma et al. 2014). For example, there are concept maps for the ontology of the Global Change Information System led by the U.S. Global Change Research Program. The concept maps are able to show that Report is a subclass of Publication, and there are several components in a report, such as Chapter, Table, Figure, Array, and Image, etc. Recent work in information technologies also enable online visualized tools to capture and explore concepts underlying collaborative science activities, which greatly facilitate the collaboration between domain experts and computer scientists. Visualization is also used to facilitate data archive, distribution and discovery. For instance, the Tetherless World Constellation at Rensselaer Polytechnic Institute recently developed the International Open Government Dataset Catalog, which is a Web-based faceted browsing and search interface to help users find datasets of interest. A facet represents a part of the properties of a dataset, so faceted classification allows the assignment of the dataset to multiple taxonomies,

and then datasets can be classified and ordered in different ways. On the user interface of a data center the faceted classification can be visualized as a number of small windows and options, which allows the data center to hide the complexity of data classification, archive and search on the server side.

Visual Analytics The pervasive existence of visualization in the data life cycle shows that visualization can be applied broadly in data analytics. Yet, in actual practices visualization is often treated as a method to show the result of data analysis rather than as a way to enable the interactions between users and complex datasets. That is, the visualization as a result is separated from the datasets upon which the result is generated. Many of the data analysis and visualization tools scientists use in nowadays do not allow dynamic and live linking between visual representations and datasets, and when dataset changes, the visualization is no longer updated to reflect the changes. In the context of Big Data, many socioeconomic challenges and scientific problem facing the world are increasingly linked to the interdependent datasets from multiple fields of research, organizations, instruments, dimensions, and formats. Interactions are becoming an inherent characteristic of data analytics with the Big Data, which requires new methodologies and technologies of data visualization to be developed and deployed. Visual analytics is a field of research to address the requests of interactive data analysis. It combines many existing techniques from data visualization with those from computational data analysis, such as those from statistics and data mining. Visual analytics is especially focused on the integration of interactive visual representations with the underlying computational process. For example, the IPython Notebook provides an online collaborative environment for interactive and visual data analysis and report drafting. IPython Notebook uses JavaScript Object Notation (JSON) as the scripting language, and each notebook is a JSON document that contains a sequential list of input/output cells. There are several types of cells to contain different contents, such as text, mathematics, plots, codes and even rich media such as video and audio. Users can design a workflow of data analysis through the arrangement and update of cells in a notebook. A notebook can be shared with others as a normal file, or it can also be shared with the public using online services such as the IPython Notebook Viewer. A completed notebook can be converted into a number of standard output formats, such as HyperText Markup Language (HTML), HTML presentation slides, LaTeX, Portable Document Format (PDF), and more. The conversion is done through a few simple operations, so that means once a notebook is complete, a user only needs to press a few buttons to generate a scientific report. The notebook can be reused to analyze other datasets, and the cells inside it can also be reused in other notebooks.

Standards and Best Practices Any applications of Big Data will face the challenges caused by the four dimensions of Big Data: volume, variety, velocity and veracity. Commonly accepted standards or communities consensus are a proved way to reduce the heterogeneities between datasets under working. Various standards have already been used in application tackling scientific, social and business issues,

such as the aforementioned JSON for transmitting data with human-readable text, the Scalable Vector Graphics (SVG) for two dimensional vector graphics, and the GeoJSON for representing collections of georeferenced features. There are also organizations coordinating the works on community standards. The World Wide Web Consortium (W3C) coordinates the development of standards for the Web. For example, the SVG is an output of the W3C. Other W3C standards include the Resource Description Framework (RDF), the Web Ontology Language (OWL), and the Simple Knowledge Organization System (SKOS). Many of them are used for data in the Semantic Web. The Open Geospatial Consortium (OGC) coordinates the development of standards relevant to geospatial data. For example, the Keyhole Markup Language (KML) is developed for presenting geospatial features in Web-based maps and virtual globes such as Google Earth. The Network Common Data Form (netCDF) is developed for encoding arrayoriented data. Most recently, the GeoSPARQL is developed for encoding and querying geospatial data in the Semantic Web. Standards just enable the initial elements for data visualization, and domain expertise and novel ideas are needed to put standards into practice (Fox and Hendler 2011). For example, Google Motion Chart adapts the fresh idea of motion charts to extend the traditional static charts, and the aforementioned IPython Notebook allows the use of several programming languages and data formats through the use of cells. There are various programming libraries developed for data visualization, and many of them are made available on the Web. The D3.js is a typical example of such open source libraries (Murray 2013). The D3 here represents Data-Driven Documents. It is a JavaScript library using digital data to drive the creation and running of interactive graphics in web browsers. D3.js based visualization uses JSON as the format of input data and SVG as the format for the output graphics. The OneGeology data portal provides a platform to browse geological map services across the world, using standards developed by both OGC and W3C, such as SKOS and Web Map Service (WMS). GeoSPARQL is a relatively newer standard for geospatial data but there are already feature applications. The demo system of the Dutch Heritage and Location shows the linked open dataset of the National Cultural Heritage with more than 13 thousand archaeological monuments in the Netherlands. Besides the GeoSPARQL, GeoJSON and few other standards and libraries are also used in that demo system.

Cross-references Data Visualization; Interactive Data Visualization; Pattern Recognition; Data-InformationKnowledge-Action Model.

References Cohen, L., Lehericy, S., Chochon F., Lemer C., Rivaud, S., Dehaene, S. (2002). Languagespecific tuning of visual cortex? Functional properties of the Visual Word Form Area. Brain, 125(5), 1054-1069. Fox, Peter, Hendler, J. (2011). Changing the equation on scientific data visualization. Science, 331(6018), 705-708.

Ma, X., Fox, P., Rozell, E., West, P., Zednik, S. (2014). Ontology dynamics in a data life cycle: Challenges and recommendations from a Geoscience Perspective. Journal of Earth Science, 25(2), 407-412. Murray, S. (2013). Interactive Data Visualization for the Web. Sebastopol, CA: O'Reilly. Tufte, E. (1983). The Visual Display of Quantitative Information. Cheshire, CT: Graphics Press.