SDSU Template, Version 11.1 - San Diego State University

0 downloads 0 Views 846KB Size Report
... a high number of females inherently means a low number of males and vice versa. Page 66. 56. F igu re A .28. Th e S. OM A n alyst T u torial p roject layou t.
AN INTEGRATED TOOLSET FOR EXPLORATION OF SPATIOTEMPORAL DATA USING SELF-ORGANIZING MAPS AND GIS

_______________

A Thesis Presented to the Faculty of San Diego State University

_______________

In Partial Fulfillment of the Requirements for the Degree Master of Science in Geography

_______________

by Martin A. Lacayo-Emery Summer 2011

iii

Copyright © 2011 by Martin A. Lacayo-Emery All Rights Reserved

iv

ABSTRACT OF THE THESIS An Integrated Toolset for Exploration of Spatio-Temporal Data Using Self-Organizing Maps and GIS by Martin A. Lacayo-Emery Master of Science in Geography San Diego State University, 2011 Technology has created a shift to an extremely data rich environment with significant increases in the resolution and storage of spatio-temporal attribute data. This challenges traditional analysis with exponential demands on computational resources. Data mining and knowledge construction address this issue by reducing the volume and complexity of data. However, these methods do not indicate the significance of their results and thus require the intervention of an expert. This means that solutions to the proliferation of data require both computer methods and human intelligence. The method of interest for this thesis is the selforganizing map (SOM), which creates a model of the topological structure within a dataset that can be used for further analysis. These models have some spatial qualities, allowing them to be combined with geographic information systems (GIS) to provide practical interactive solutions for the analysis and exploration of large high-dimensional datasets. This approach takes advantage of existing human and computer resources. A review of the literature and existing software reveals that while the combination of SOM and GIS is not unprecedented the approach taken here is unique. This software is a plug-in for a market-leading commercial GIS product, ArcGIS, and facilitates leveraging established GIS visualization and analysis techniques. By relying on a commercial off-theshelf GIS, development was focused on integration and takes advantage of existing user skills. The software, called SOM Analyst, contains tools for data preprocessing, SOM computation, and SOM visualization. These tools were tested using sample data and evaluated by experts. The tools are documented in a help file, in the code, and through a tutorial and are available on CD-ROM. The CD-ROM, an appendix to the thesis, is available for viewing at the Media Center of Love Library. SOM Analyst makes it easier to use the SOM method within GIS and demonstrates how SOMs can be useful for GIS-based analysis. Conversely, it also shows how GIS enhances the spatial nature of SOM models, such that GIS becomes applicable to even nongeographic data. The practical demonstration of this mutually beneficial relationship of GIS and SOM is among the main methodological contributions of this thesis. SOM Analyst supports several common data formats and has all the basic functions needed for data preparation, SOM computation, and SOM visualization. For simplicity, only the traditional SOM method is supported, but SOM Analyst is a significant contribution to which many enhancements could be added. It serves as the basis of simple and advanced SOM visualization and analysis including the ability to quickly and easily produce attribute trajectories. These capabilities are meant to enhance knowledge construction based on the analysis of large high-dimensional datasets.

v

TABLE OF CONTENTS PAGE ABSTRACT ............................................................................................................................. iv LIST OF TABLES ................................................................................................................. viii LIST OF FIGURES ................................................................................................................. ix CHAPTER 1

INTRODUCTION .........................................................................................................1 1.1 From Data to Knowledge ...................................................................................1 1.2 Exploratory Analysis .........................................................................................2 1.3 Methods..............................................................................................................3 1.3.1 Self-Organizing Maps ...............................................................................3 1.3.2 Geographic Information Science ..............................................................4 1.4 Integrating Self-Organizing Maps and Geographic Information Systems ....................................................................................................................5 1.4.1 Motivation .................................................................................................5 1.4.2 Opportunity ...............................................................................................6 1.4.3 Goals .........................................................................................................6 1.5 Summary ............................................................................................................6

2

RESEARCH OBJECTIVES AND DESIGN .................................................................8 2.1 SOM Analyst .....................................................................................................8 2.1.1 Data Preprocessing Tools .........................................................................8 2.1.2 SOM Computation Tools ..........................................................................9 2.1.3 SOM Visualization Tools .........................................................................9 2.2 Evaluation ..........................................................................................................9 2.3 Documentation ...................................................................................................9 2.4 Summary ..........................................................................................................10

3

LITERATURE REVIEW ............................................................................................11 3.1 Self-Organizing Maps ......................................................................................11 3.1.1 Kohonen Map..........................................................................................11

vi 3.1.2 SOM Variants .........................................................................................13 3.2 Geographic Information Sciences ....................................................................13 3.3 Knowledge Construction .................................................................................14 3.4 Integration of SOMs ........................................................................................14 3.4.1 SOM_PAK and SOM Toolbox ...............................................................15 3.4.2 GeoVista Studio ......................................................................................15 3.4.3 Private Tools ...........................................................................................16 3.5 Summary ..........................................................................................................16 4

METHODOLOGY ......................................................................................................17 4.1 Data Preprocessing Tools ................................................................................17 4.1.1 File Format Conversion ..........................................................................17 4.1.2 Data Management ...................................................................................19 4.1.3 Value Transformation .............................................................................19 4.2 SOM Computation Tools .................................................................................20 4.2.1 Create Initial SOM ..................................................................................20 4.2.2 Train SOM ..............................................................................................20 4.2.3 Project Data onto SOM ...........................................................................21 4.2.4 Calculate U-matrix ..................................................................................21 4.3 SOM Visualization...........................................................................................21 4.3.1 SOM Shapefile ........................................................................................21 4.3.2 Projected Data Shapefile .........................................................................22 4.3.3 Grouping Shapes .....................................................................................22 4.4 Utilities.............................................................................................................22 4.5 Leveraging ArcGIS ..........................................................................................23 4.6 Evaluation ........................................................................................................23

5

RESULTS ....................................................................................................................26 5.1 Limitations .......................................................................................................26 5.2 Significance......................................................................................................27 5.3 Future Work .....................................................................................................28 5.4 Conclusions ......................................................................................................28

REFERENCES ........................................................................................................................29 APPENDIX

vii A TUTORIAL..................................................................................................................31 A.1 System Requirements ......................................................................................32 A.2 Download ........................................................................................................32 A.3 Adding the Toolbox ........................................................................................32 A.4 Convert Data Format .......................................................................................35 A.5 Normalize Data ...............................................................................................37 A.6 Select Variables...............................................................................................41 A.7 Export Data .....................................................................................................43 A.8 Create Initial SOM ..........................................................................................44 A.9 Train SOM ......................................................................................................46 A.10 Calculate U-Matrix .......................................................................................49 A.11 Project Data onto SOM .................................................................................50 A.12 Create SOM Shapefile ..................................................................................51 A.13 Create Data Shapefile....................................................................................52 A.14 Group Data Shapefile ....................................................................................53 A.15 Create Extent Shapefile .................................................................................54 A.16 Visualization .................................................................................................55 B SOM ANALYST SOFTWARE AND SUPPORT………………..…….on CD-ROM* *The CD-ROM is available at the Media Center of Love Library.

viii

LIST OF TABLES PAGE Table 1.1. Data Mining Tasks and Techniques..........................................................................2

ix

LIST OF FIGURES PAGE Figure 1.1. Square and hexagonal topology of a self-organizing map. .....................................4 Figure 3.1. Anatomy of a self-organizing map showing the absolute coordinates (x,y), relative coordinates (i, j), and id of units within a SOM with distance of 1 between their centers....................................................................................................12 Figure 4.1. Data file to SOM data conversion with optional selection. ...................................18 Figure A.1. The ArcGIS “Window” menu. .............................................................................32 Figure A.2. The ArcToolbox context menu. ............................................................................33 Figure A.3. The Add Toolbox dialog box................................................................................33 Figure A.4. The ArcToolbox contents list showing SOM Analyst Tools. ..............................34 Figure A.5. The SOM Analyst Tools contents. .......................................................................34 Figure A.6. The Data File to Database File dialog box. ..........................................................35 Figure A.7. The census data table properties. ..........................................................................36 Figure A.8. The census data table attributes. ...........................................................................36 Figure A.9. The Normalize by Variable dialog box. ...............................................................37 Figure A.10. The normalized by variable census data table attributes. ...................................38 Figure A.11 The Min-Max Normalization dialog box.............................................................39 Figure A.12. The min-max normalized census data attributes table........................................40 Figure A.13. The Select dialog box. ........................................................................................41 Figure A.14. The normalized census data table properties. .....................................................42 Figure A.15. The normalized census data table attributes. ......................................................42 Figure A.16. The Database to SOM_PAK Data dialog box. ...................................................43 Figure A.17. The Create Initial SOM dialog box. ...................................................................44 Figure A.18. The Create Initial SOM progress window. .........................................................45 Figure A.19. The stage one Train SOM dialog box. ................................................................46 Figure A.20. The stage two Train SOM dialog box. ...............................................................47 Figure A.21. The Calculate U-matrix dialog box. ...................................................................49 Figure A.22. The U-matrix table attributes..............................................................................49 Figure A.23. The Project Data onto SOM dialog box. ............................................................50

x Figure A.24. The SOM to Shapefile dialog box. .....................................................................51 Figure A.25. The Project Data to Shapefile dialog box. ..........................................................52 Figure A.26. The Group Shapes dialog box. ...........................................................................53 Figure A.27. The Create Extent Shapefile dialog box. ............................................................54 Figure A.28. The SOM Analyst Tutorial project layout. .........................................................56

1

CHAPTER 1 INTRODUCTION Advances in technology continuously increase the amount of available data, embedding information in a haze of detail. This shift to a data rich environment challenges traditional methods with noise, bias, and diverse forms (Miller & Han, 2001). These challenges led to the development of data mining and knowledge discovery methods and their implementations in software, leveraging the same advances in technology to explore the new forms and volumes of data. However, this software is often disjoint with other methods, requiring an involved manual process for each step of analysis and often the transfer of results from one package to another. In geography this is particularly evident despite great improvements in analysis tools in the last decade. In order to address this issue, this thesis defines and implements an integrated toolset for the exploration of spatio-temporal data using self-organizing maps and geographic information systems.

1.1 FROM DATA TO KNOWLEDGE The process of distilling from data to information and from information into knowledge is complex, intricate, and significantly challenging. Data are the raw values from objects, while information is their structured form that answers the factual questions who, what, when, where, and how many. In contrast, knowledge consists of the answers to the explanatory questions how and why, which allows the inference of further information and knowledge (Ackoff, 1989). Data, information, and knowledge represent areas on a continuum that goes from little understanding to full understanding; yet modern data volumes can make this historically manual process impractical because information essential for knowledge construction is hidden deeply in databases that can exceed hundreds of terabytes. These databases have great potential for revealing important knowledge such as climate patterns, disease factors, and disaster risks, although the ever-increasing volumes make this leap increasingly challenging yet promising.

2 The problem of increasing data volumes, which potentially contain more knowledge while making that knowledge harder to discover, has led to the development and use of computer-aided processes called data mining (DM) and knowledge discovery in databases (KDD) to facilitate transforming data into knowledge. DM consists of reducing data to information to find simple patterns (see Table 1.1), while KDD consists of reducing these patterns into explanations (MacEachren, Wachowicz, Edsall, Haug, & Masters, 1999; Miller & Han, 2001). Results must then be evaluated on the basis of novelty, interestingness, plausibility, and intelligibility, qualities that are difficult to quantify and thus do not lend themselves well to automation (Valdes-Perez, 1999; Yuan, Buttenfield, Gahegan, & Miller, 2001). Consequently, human intelligence with its domain knowledge guiding DM and KDD should yield better results (Miller, 2008). Table 1.1. Data Mining Tasks and Techniques Data mining tasks Segmentation

Dependency analysis Deviation and outlier analysis

Description Clustering: Determining a finite set of implicit cases that describes the data Classification: Mapping data items into pre-defined classes Finding rules to predict the value of some attribute based on the value of other attributes Finding data items that exhibit unusual deviations from expectations Lines and curves summarizing the database, often over time Compact descriptions of the data

Tasks Cluster analysis Bayesian classification Decision or classification trees Artificial neural networks Bayesian networks Association rules

Clustering and other data mining methods Outlier detection Trend detection Regression Sequential pattern extraction Generalization and Summary rules characterization Attribute-oriented induction Source: Miller, H. J., & Han, J. (2001). Geographic data mining and knowledge discovery. London ; New York, NY: Taylor & Francis.

1.2 EXPLORATORY ANALYSIS Exploratory analysis is a non-traditional approach to analysis that starts with an examination of data to help guide the formation of a hypothesis, which differs from the traditional scientific method of gathering data relevant to a predetermined hypothesis. This concept is critical, because it serves as a common basis for analysis of the unknown. The

3 traditional scientific method is dependent on the ability to hypothesize about the subject, to determine what facts are relevant, and to deduce a conclusion, all of which require an understanding of the subject that is not always available. Exploratory analysis does not require prior understanding of the subject, but rather allows the data to inform the scientist and through inductive and abductive reasoning achieve a possible understanding of the subject that can then be tested. This is particularly important for spatio-temporal data where spatial and temporal dependence is evident but the domain knowledge required for an explanation is inadequate. Through the use of DM and KDD, exploratory analysis provides an approach to gleaning knowledge from the stores of massive databases. The research described here uses these concepts for the creation of a tool with the same goals.

1.3 METHODS The novelty of this work is the ability to use a geographic information system (GIS) on both the spatio-temporal and non-spatio-temporal components of structured data by converting the latter to pseudo-spatio-temporal components using self-organizing maps (SOMs). This can be accomplished because SOMs automatically create a “map” in a userdefined area that describes where input data would be placed as spatially-autocorrelated points; similar data tends to be located more closely than dissimilar data. Thus, a SOM has a tendency to preserve the topological relationships within the input data that become analogous to a distance-similarity relationship in the output map, which can then be analyzed using a GIS.

1.3.1 Self-Organizing Maps Self-organizing maps (SOMs) are a type of artificial neural network that automatically organizes data in an artificial space such that data that is similar tends to be close and data that is dissimilar tends to be far apart (Kohonen, Hynninen, Kangas, & Laaksonen, 1995). Typically the data used with a SOM is a large set of multivariate values, and the artificial space of the SOM is composed of a regular two-dimensional grid of square or hexagonal units (See Figure 1.1). In order to determine where data should be placed, the square or hexagonal units of the SOM are first initialized with random numbers or eigenvalues for each of the data variables. The SOM then goes through a training process

4

Figure 1.1. Square and hexagonal topology of a self-organizing map. Source: Skupin, A., & Hagelman, R. (2005). Visualizing demographic trajectories with self-organizing maps. Geoinformatica, 9(2), 159-179. during which it selects local areas and their values are shifted towards the input data. After training, data can be projected onto the SOM by assigning it the location of the most similar unit. The projected data then exhibits a spatial topology that is congruent with the multivariate distances in the unprojected data. In this way the SOM in effect preserves topological relationships among high-dimensional data despite the projection into a lowdimensional space. This conversion to the domain characteristics of low-dimensional geometric space makes the application of geographic analysis possible.

1.3.2 Geographic Information Science Geographic information science (GIScience) applies the scientific method to the analysis of geographic information and consists of theories and methods, while geographic information systems (GIS) consists of the software that implements those methods. GIScience has revolutionized the way and extent to which visualizing and analyzing geographic data can be accomplished. These advances are embedded in GIS, creating a platform for the systematic visualization and analysis of geographic data that engage the human visual system (MacEachren, et al., 1999).

5

1.4 INTEGRATING SELF-ORGANIZING MAPS AND GEOGRAPHIC INFORMATION SYSTEMS The integration of SOMs and GIS enables a value-added use of both for the analysis of high-dimensional data. Given the ability of SOMs to create geometric spaces it would seem natural to integrate it with GIS, yet this paring is in fact unusual and largely unsupported by software. Since the inception of SOMs in 1982, there is an increased ability to train and visualize SOMs using programs such as SOM_PAK (Kohonen, et al., 1995) and GeoVISTA Studio (Gahegan, Takatsuka, Wheeler, & Harrdisty, 2002). But a decade after the earliest published work on SOM integration with GIS (Li, 1998) there has been little progress in this area despite wide use of each individually. Currently disjoint elements and circuitous processing characterize the limited use of SOMs with GIS (Skupin & Hagelman, 2005). Large high-dimensional datasets confound traditional methods and have created an urgent need for methods that can synthesize knowledge (Yuan, et al., 2001) from what is quickly becoming a vast wasteland of data. Manual methods for analysis are simply impractical for coping with the current rate of data generation, while automated methods cannot detect the novel patterns needed for new knowledge. It is, rather, a fusion of SOMs and GIS aided by human intelligence that will best serve this interest. It may be possible to achieve unique insights necessary for knowledge construction by augmenting the ability of humans to perceive complex data visually with stimuli produced through the integration of SOMs and GIS.

1.4.1 Motivation Scientific inquiry predominantly requires a hypothesis to test. However, there is much more potential knowledge than that which is currently hypothesized because so much remains hidden in the plethora of available data. The challenge then is how to facilitate hypothesis generation for something that is not yet conceptualized. Where a theoretical or conceptual framework for a subject of interest may be absent, the integration of SOMs and GIS can provide a context in which such knowledge could be created. Ideally this convergence would form in an integrated environment where tools provide a seamless transition from data to knowledge. This environment could then allow scientists to focus their efforts on knowledge discovery.

6

1.4.2 Opportunity The development of an integrated SOM and GIS environment from scratch would disregard the tools that have been refined over the decades and the skills people have developed in using them. Fortunately, GIS has matured to the point where new tools can be easily integrated using customized scripts. These scripts can transform the traditional GIS environment into a novel platform for SOM computation and visualization. SOMs can be represented in a GIS in either raster or vector form by creating an arbitrary origin and scale from which the SOMs‟ units can be referenced and stored in a standard format for later use.

1.4.3 Goals This integrated environment would be productive, efficient, and promote a deeper understanding of geographic phenomena. It would minimize learning curves by utilizing existing tools and skills and streamline the transfer of results between SOMs and GIS. Such changes are ontological as well as methodological, as they extend the GIS paradigm to include models of attribute space. This advancement, when enhanced by human intelligence, creates promising conditions for analyzing the high-dimensional data embedded in SOMs. Spatial concepts such as location and topology become analogous to classification and similarity. This paradigm shift is, in fact, a natural concept described by the first law of cognitive geography, which states that people naturally assume things that are closer are more similar (Montello, Fabrikant, Ruocco, & Middleton, 2003). Thus, representing aspatial data topologically leverages people‟s innate abilities to perform complex analysis. Furthermore, both SOMs and projected data can be analyzed using spatial analysis methods that could serve as proxies for what would be impractically high-dimensional analysis and provide direction for a confirmatory analysis.

1.5 SUMMARY Improvements in analysis have not kept pace with the generation and storage of data. This has required the development of data mining and knowledge construction to summarize data. However, the significance of these summaries still requires human intervention. Guiding data mining and knowledge construction with domain knowledge creates an interactive approach that allows the informed exploration of data. The scale and complexity of modern datasets are impractical for many methods. Through the use of self-organizing

7 maps, a model of the topological structure within a dataset can be created and used for analysis. These models have spatial characteristics and behavior (Skupin & Fabrikant, 2008) such that geographic methods and approaches can be applied. This extends the domain of both self-organizing maps and geographic information science, providing a practical solution to the analysis and exploration of large high-dimensional datasets. This approach is readily achievable and could work in synergy with existing resources.

8

CHAPTER 2 RESEARCH OBJECTIVES AND DESIGN The goal of the research was the creation of a toolset that fulfills the essential needs for SOM-GIS integration, while establishing a precedent and platform for further integration. The toolset was developed and released under the name SOM Analyst. It takes the form of modular components that can function independently but can also be linked to create specialized processes. These components were designed to function in a GIS environment, eliminating the need for GIS users to learn a new interface. During an iterative evaluation process, the modules were refined and then released with documentation and a tutorial (see Appendix A) on CD-ROM (see Appendix B). The CD-ROM, an appendix to the thesis, is available for viewing at the Media Center of Love Library.

2.1 SOM ANALYST SOM Analyst is designed to be a basic set of tools for using SOMs within an existing GIS platform. The toolset includes data preprocessing, SOM computation, and SOM visualization tools that extend GIS functions. The ability to use SOM techniques within a GIS allows for the advanced analysis of spatio-temporal attribute data by leveraging native GIS functions.

2.1.1 Data Preprocessing Tools The goal of the data preparation tools is to enable the most common data preparation tasks needed for SOMs. These tasks include format conversions, normalization, random ordering, and set operations. The most efficient way to perform these tasks on a variety of formats is to first convert them into a common format. Each of these tools is designed to perform a basic task. For the sake of simplicity, common tasks that are complex are prepackaged from the basic tasks.

9

2.1.2 SOM Computation Tools The SOM computation tools are concerned with operating on data on a highdimensional level and are characterized by being largely mathematical in nature. These include SOM initialization and training, metrics to evaluate the SOM, and the ability to map data onto the SOM. These tools have a high computation demand and utilize third-party components for efficiency.

2.1.3 SOM Visualization Tools The SOM visualization tools are concerned with operating on the low-dimensional results of the SOM computation tools to produce points, lines, and polygons from SOMs and projected data. This allows the visualization of the topology of both the SOM and projected data individually and together. These tools will construct geometry for SOMs and data and store it in a GIS compatible format. They are at the core of the integration of SOMs and GIS, acting as the bridge between the otherwise separate components. Once the SOM and projected data are in a GIS, the native GIS functions are used for visualization and postprocessing.

2.2 EVALUATION In order to assure the quality of SOM Analyst, it was tested with sample data. In addition, the toolset was refined throughout the development process via consultation with experts who used SOM Analyst in research applications that ranged from demographics to climatology. The experts were self selecting and participated at the own discretion. They included colleagues and people who expressed interest in the work. The evaluation consisted of iterative communication with the experts to understand their research needs and experiences with SOM Analyst and revise the software, where appropriate, to improve its functionality. Further detail about the evaluations is included with the results.

2.3 DOCUMENTATION In the absence of documentation, tools can be unusable. Thus, a help file promotes accessibility by documenting each function and their uses are demonstrated in a brief tutorial. Additionally, the code for SOM Analyst is documented to serve as a guide for further

10 development. This allows a developer to create additional functions that are compatible with SOM Analyst and expand its capabilities.

2.4 SUMMARY The goal of this work was to produce functional software that allows the integration of SOMs into a GIS. This software is called SOM Analyst and contains tools for data preprocessing, SOM computation, and SOM visualization. These tools were tested using sample data and evaluated by experts. Following expert evaluation, the tools were refined and supplemented with documentation in a help file, in the code, and through a tutorial. The thesis continues with a review of relevant literature, followed by detailed descriptions of the tools in SOM Analyst, and concludes with a summary of the significance and limitations.

11

CHAPTER 3 LITERATURE REVIEW This thesis is informed by exploratory data analysis, knowledge discovery, geographic information science (GIScience), and self-organizing maps (SOMs). SOMs are the means for the exploratory analysis, but it is GIScience that provides the theory and the context for generating knowledge through cognition enhanced by augmented visual perception.

3.1 SELF-ORGANIZING MAPS Self-organizing maps (SOMs) are a type of artificial neural network that functions as an unsupervised classifier with topology preservation. This means that a SOM creates a network that automatically arranges data based on its topology, an approach which can be useful for a variety of tasks including clustering, classification, non-linear PCA (Villmann, Wieland, & Michael, 2000), vector quantization (de Bodt, Verleysen, & Cottrell, 1997), and K-means clustering (Bacao, Lobo, & Painho, 2005). There are several variants of SOMs including some that dynamically increase the number of units and others that dynamically increase the spaces between units. Despite the diverse uses and variants of SOMs, they always contain units with values, a matching function, a selection function, and a value changing function. Unlike other methods that perform similar tasks, SOMs are highly scalable to large high-dimensional datasets. The size of the SOM can be scaled in relation to the dataset and desired results, with larger SOMs for finding differences and smaller SOMs for finding similarities.

3.1.1 Kohonen Map The Kohonen map is considered to be the original or traditional SOM and is composed of a two-dimensional grid of units. The grid can be composed of square or hexagonal units that are initialized using random numbers or eigenvalues from the data. During a training process, the SOM is exposed to vectors from the training data one at a time and the SOM is adjusted to have more similar values. The training process begins by

12 activating the single unit in the SOM that has the smallest Euclidean distance from an input vector. The active unit then creates a neighborhood by selecting all the adjacent units up to a certain distance. All the units in the neighborhood are then adjusted so that all their Euclidean distances from the input vector are smaller. Through repetition of this process the SOM forms and the data topology is imprinted. The creation of a Kohonen map requires the user to specify the topology, neighborhood type, x-dimension, y-dimension, and the dimensionality of the SOM. The topology is the connections between the grid units visible at the adjacent edges (see Figure 3.1.) and is usually rectangular or hexagonal. The neighborhood type is the way in which connections are made between units and is usually Gaussian. The x-dimension and the ydimension are the number of units in the x-direction and y-direction. The dimensionality of the SOM is the number of variables the SOM is designed for, which depends exclusively on the number of variables in the data.

Figure 3.1. Anatomy of a self-organizing map showing the absolute coordinates (x,y), relative coordinates (i, j), and id of units within a SOM with distance of 1 between their centers. Despite the many strengths of the traditional SOM, they are subject to some special effects that can affect their quality. The magnification effect is an increase in the area specialized to specific values because they are contained more frequently in the input data. The boundary effect is the tendency of units on the edge of SOMs to be overspecialized

13 because of their lack of neighbors. The pinch effect is a ring pattern in the distribution of projected data caused by a small neighborhood size, which is also observable by a lack of order in values of variables in the SOM. The collapse effect is large areas of homogenous values caused by excessive selection of neighboring units (Kohonen, 1982). Usually these effects can be tempered by using proper training parameters, but the best values can be difficult to determine. Consequently, variants on the Kohonen map have been created to address these effects and impart other qualities.

3.1.2 SOM Variants There are a variety of SOMs each of which address specific needs or concerns that Kohonen maps do not address. Three types of particular interest are the growing SOM (GSOM) (Alahakoon, Halgamuge, & Srinivasan, 2000), the neural gas model (NGM) (Martinetz & Shulten, 1991), and the geographic SOM (GeoSOM) (Bacao, Lobo, & Painho, 2008). The GSOM dynamically increases the number of units as certain threshold values are reached. This addresses the issue of needing to specify the number of units at execution time. The NGM gives the units of the SOM free movement to go towards or away from other units. This more readily allows for classes in the data to separate from each other, especially in cases where there are hierarchies. This is particularly useful if the data topology is not compatible with a regular grid. The GeoSOM exploits the principle of spatial autocorrelation to seed the SOM with the attribute values from spatial data. This not only results in the SOM forming more quickly, it also promotes a topology in the SOM that is more consistent with spatial topology of the data.

3.2 GEOGRAPHIC INFORMATION SCIENCES In Geographic Information Science (GIScience), data are different than in other domains because of inherent dependency that affects values and patterns. The complexities of those patterns require searches to find, relate, and interpret interesting, meaningful, and unanticipated features in large datasets with an understanding of spatial dependencies. This process benefits greatly from visualization that can serve as a means for integrating computer-aided processes with human guidance at conceptual, implementation, and operational levels. In this way the power of the human visual system can be used with

14 visualizations to detect patterns and redirect processing for geographic analysis (MacEachren, et al., 1999).

3.3 KNOWLEDGE CONSTRUCTION The transition from data to knowledge is a transformation from raw values to an understanding of the system that produced those values. This process is generally called knowledge discovery, but this is perhaps a misnomer that should more appropriately be termed knowledge construction to reflect the process of synthesis involved. Knowledge construction is defined as “the active process of manipulating „data‟ to arrive at abstract models of relationships” (MacEachren, et al., 1999). The process of constructing knowledge begins with distilling information from data. Information is defined as interesting patterns in data exemplified by “non-random properties and relationships … general enough to apply to new data, … [that are] non-trivial and unexpected, … lead to some effective action, [and are] simple and interpretable by humans” (Miller & Han, 2001). Once information within a dataset is established, the relationships among the information can be explored for knowledge. Those relationships can then be used with inference for verification and further knowledge discovery.

3.4 INTEGRATION OF SOMS There are several software packages that are either dedicated to or include SOMs, but there has been little integration with GIS. There are two main packages dedicated to SOMs: SOM_PAK, and SOM Toolbox. There is one GIS suite called GeoVista Studio that includes the ability to use SOMs, but it is not treated as geographic space. Generally speaking, the dedicated packages are easy to use, SOM processing is streamlined, but they lack the advanced post-processing that can be performed using a GIS suite. GeoVista Studio has SOM functionality, as well as a host of other features that make it more extensive, but also more complicated to use. In addition to these publicly available packages, there are a variety of private implementations combining SOMs and GIS, however their availability is limited and subject to special agreements with their respective creators. While the SOM method is not especially complex, several key features and optimizations can easily be incorrectly implemented, therefore it is preferential to have source code available for validation.

15

3.4.1 SOM_PAK and SOM Toolbox SOM_PAK and SOM Toolbox are released by Helsinki University of Technology and are considered the canonical software for SOMs. They are both open source and perform the basic computations and visualizations for SOMs. SOM_PAK is written in the C languages, produces black and white visualizations in a postscript format, and is available as command line executables for Windows and DOS. SOM Toolbox is written for MATLAB, produces colored visualizations that can be saved to various formats, and is available as scripts that must be run in the MATLAB environment. The basic computations these packages perform include the initialization and training of SOMs and the projection of data onto the SOMs. Only the Euclidean distance measure is implemented for these computations. SOM_PAK and SOM Toolbox both have a simple design and are ideally suited to basic SOM needs.

3.4.2 GeoVista Studio GeoVista Studio (Studio) is full-featured cross-platform software created to improve geoscientific analysis through an integrated analysis environment and infrastructure for hypothesis generation. It has a component-oriented design that embraces visual programming, an open architecture, simple integration and advanced development. It is built with the philosophy that the linearization of the knowledge discovery process through separated software stalls advancements and creates a software bottleneck. Studio was developed to allow scientists to concentrate on analysis rather than programming. In order to accomplish this, a system has to satisfy the following demands: diverse functionality in a single environment, easy to use and support complex functions, minimizing programming requirements while allowing for rapid development and modification, and the sharing of models. Furthermore, it is advantageous to offer cross-platform support and internet-base deployment" (Gahegan & Brodaric, 2002; Gahegan, et al., 2002; Takatsuka & Gahegan, 2002). Gahegan et al. (2002) further explain their motivation for Studio as, in part, a response to the typical separation of geographic visualization and analysis in commercial off the shelf (COTS) software for geoscientific analysis, as well as COTS primarily functioning as deductive and confirmatory tools. Additionally, the difficulty of communication and

16 exchanging processes with other scientists was a motivation for integrating visualization and geocomputational approaches. It was anticipated that the integration would have benefits for communication, development, and validation. A GIS typically works with existing information only and does not derive information from data. Disconnects between information creation, visualization, and analysis were seen as a real information loss that could be critical for understanding a phenomenon.

3.4.3 Private Tools In the literature, there are several private tools that integrate SOMs and aspects of GIS. While private tools are simply not available to the general public, it is important to discuss their significance and limitations. One tool called SOMViz (Gabathuler, 2009) is a particularly good example of this. SOMViz is implemented as a web-based application with a combination of server-side and client-side operations for maximum efficiency. It utilizes SOM_PAK for SOM operations and creates a dynamic interactive environment based on these outputs. This tool is ideally designed for its intended task, the exploration of Swiss census data using web-based SOMs, but does not have the flexibility and feature set needed for the general use of GIS with SOMs. This broadly characterizes the limitation of such tools, in that while being well adapted to their specific application, simply lack the feature set of a full commercial GIS suite.

3.5 SUMMARY There are a few software packages that allow the use of SOMs, however they are limited in their ability to cross into the geographic domain. SOM_PAK provides basic functionality and SOM Toolbox extends this, but both keep SOMs separate from use with GIS. GeoVista Studio brings SOMs into a broader knowledge construction platform, however the specific advantages of treating SOMs as geographic space are not readily possible. There are a variety of private tools that are described in the literature as implementing some integration of SOMs and GIS, but lack the feature set and availability for general use. Thus, there is a need for the development of a more robust and tightly integrated SOM package for GIS.

17

CHAPTER 4 METHODOLOGY SOM Analyst is designed to work with ArcGIS because it is the de facto standard GIS and supports a user-friendly interface for customized scripts. Each of the SOM Analyst tools are designed so that they can be used individually or embedded in a more complex process and are accompanied by detailed documentation. The documentation includes a description of inputs, including formats, processes, and resulting outputs, providing all the details needed for further development. In general, implementation favored portability and readability over efficiency, except for SOM computations, which because of computation time concerns required the most efficient implementation possible. For this reason, data is read from and stored in delimited text and Dbase IV files, while geometry is stored in ESRI shapefiles, both of which enjoy wide support. This allows users of most programs to import and export their data to and from SOM Analyst.

4.1 DATA PREPROCESSING TOOLS The data preprocessing tools enable the preparation of data to use with the SOM and consist of file format conversions, data management (selection), and value transformations (normalizations) needed for the preparation of data. Conversion tools allow converting between delimited text files and Dbase IV (DBF) files, needed for the other data preprocessing tools, and the final conversion from DBF files to the SOM_PAK data formats. Data management tools allow the creation of subsets or supersets from data. Value transformation tools allow for the normalization of data values. Where useful, an optional flag to enable automatic data typing or retyping of values is available, which simplifies the use of native type dependent operations that may not work on character strings.

4.1.1 File Format Conversion Data conversion from delimited text files to DBF files requires the creation of a file header that contains a label, type, and character length for each variable, and the replacement of the delimiter with sufficient spaces so that values in each column are all of the same width.

18 This task is implemented in a tool called Data File to Database File (Figure 4.1), which has variables for the input data file, the file format, the output file, and an optional flag to indicate if the data type for each variable should be detected. After any desired data management or value transformations have been performed it is necessary to convert the DBF to the SOM_PAK data format. The SOM_PAK data format is a space delimited text file, beginning with a header row that contains the number of columns followed by a new line character and new line delimited rows of space delimited values. Optionally, immediately following the file header line a column header line can be indicated by the number sign (#), while rows can be labeled by values placed at the end of each row. This task is implemented in a tool called Data File to SOM_PAK Data, which has variables for the input database file, output SOM data, and optionally the columns to use as labels.

Figure 4.1. Data file to SOM data conversion with optional selection. The conversion from a delimited text file to a DBF file and then to the SOM_PAK data format may seem somewhat circuitous. However, it in fact enforces data integrity and allows a simple way of specifying columns as containing labels. Unlike delimited text formats, the DBF file format does not allow rows to have a mismatch in the number of columns, contain non-usable values such as not-a-number, null, or infinity, and enforces column name limits that are required for shapefiles. The limit on column names also allows for the use of built-in table interaction features of ArcGIS so that columns can be easily selected and in this case have their values converted to SOM_PAK data labels.

19

4.1.2 Data Management Combining data files requires appending the rows or columns from one data file to the other and adjusting column widths and data types for compatibility. This task is implemented in a tool called Combine, which has variables for the input database files, the type combination (that is, whether to append the tables as rows or as columns), the output database file, and an optional flag to indicate if the data type for each variable should be detected. When appending rows, the type and width of each column will automatically be adjusted. Selecting data from a file requires specifying the rows and/or columns to select and writing those values to a new file. This task is implemented in a tool called Select, which has variables for the input database file, selection type (that is, whether to select columns or rows), and the output database file. Optionally, selection can be specified using a starting index, step for each index, and stopping index, as well as a flag to retype the selected data. The selection type is only relevant when using optional indices. The stop index can be specified using either positive or negative indices, where the first index is 0 and the last index is –1 or the length minus 1. When selecting rows it is also possible to select by the names of the columns. Transposing a data file requires reading in the rows of data and writing them as columns and, when specified, combining columns into row-wise value pairs so that they can be used as a single header row. This task is implemented in a tool called Transpose, which has variables for the input database file, columns to merge and transpose, output database file, and optional flags to indicate if the header row contains value pairs and if the column values should be retyped. The data type of the new columns is determined by resolving the previous data types. However, retyping them may yield more desirable (compact) results.

4.1.3 Value Transformation Value transformations require reading in the data and then performing a calculation that changes the value. This task is implemented in three tools called Min-Max Normalization, Normalize by Variable, and Z-score Normalization. Each has variables for the input database file, the column to normalize by, the output database file, the value to return if there is a division by zero (0 by default), the number of decimal places to round to (6

20 by default), and optionally the columns on which to perform the normalization. Their functions are as follows: 

Min-Max Normalization is a histogram normalization in which all values are scaled to fall between the specified minimum and maximum values, by default the value to scale the minimum to is 0 and the value to scale the maximum to is 1 Additionally there is a variable to allow for normalizing across columns, rows, or columns and rows (global). This normalization is useful to counteract the higher weight a variable with a larger range would otherwise receive from the SOM.



Normalize by Variable is a variable normalization that divides values by the specified column. This is useful to counteract the higher weight that the higher value of inversely proportional variables would otherwise receive from the SOM.



Z-score Normalization is a normalization that divides the deviation of a value from the mean by the standard deviation. This normalization is useful to counteract the higher weight a variable with a larger range and/or standard deviation would otherwise receive from the SOM.

4.2 SOM COMPUTATION TOOLS The SOM computation tools are at the core of use of SOMs and consist of creating an initial SOM, training a SOM, projecting data onto a SOM, and calculating a U-matrix. All the tools, with the exception of calculating a U-matrix, are implemented by making calls to the precompiled binaries of SOM_PAK. This has the benefit of using trusted software for computation that is highly optimized and has an established place in the scientific community, while removing the obstacle that direct command prompt interaction may pose for some users.

4.2.1 Create Initial SOM Creating an initial SOM requires generating initial values for each variable for each unit in the SOM as specified by the parameters. This task is implemented in a tool called Create Initial SOM, which has variables for the input data, the topology type, the neighborhood type, the x-dimension, the y-dimension, the output SOM, the initialization type, and optionally a seed for the random number generator and a read buffer size. This tool generates the initial SOM that can then be trained.

4.2.2 Train SOM Training a SOM requires an initial SOM and training data that will be used to train the SOM as specified by the parameters. This task is implemented in a tool called Train SOM

21 and has variables for the initial SOM, the training data, the length of training, the initial learning rate, initial neighborhood radius, the trained SOM, and, optionally, the distance metric (that is, Euclidean with the standard SOM_PAK or Cosine with a version of SOM_PAK modified by the University of New Orlean‟s Fareed Qaddoura), a seed for the random number generator, the use of fixed points, variable weights, a read buffer, a learning type, a snapshot file, and a snapshot interval.

4.2.3 Project Data onto SOM Projecting data onto a SOM requires unprojected data and a SOM on which to project the data and consists of finding the unit that is most similar for each row of data. The coordinates of that unit are then considered the projected values for that row of data. This task is implemented in a tool called Project Data onto SOM and has variables for the SOM, the data to project, the projected data, and optionally the distance metric to use, the skipping of masked vectors, and a read buffer.

4.2.4 Calculate U-matrix Calculating a U-matrix requires a SOM and consists of calculating the average difference between each unit in the SOM and its adjacent neighbors. This task is implemented in a tool called Calculate U-matrix and has variables for the SOM, the output U-matrix in database file form, and optionally the distance metric to use.

4.3 SOM VISUALIZATION SOM visualization tools are at the core of the integration with ArcGIS and consist of creating a shapefile from a SOM, creating a shapefile from projected data, and grouping shapes in a shapefile. These tools create the data needed for the use of ArcGIS and are the most critical part of SOM Analyst. The shapefiles created by these tools are created using a proprietary shapefile library that allows for easy customization.

4.3.1 SOM Shapefile Creating a shapefile from a SOM requires a SOM and consists of creating a shapefile with a shape for each unit in the SOM. This task is implemented in a tool called SOM to Shapefile and has variables for the SOM, the desired shape type (that is, point or polygon, with polygon being the default value), the output shapefile, and optionally the SOM data file

22 for labeling, a Cartesian quadrant into which to place the shapes (1 by default), and a size for the shapes (1 by default). The Cartesian quadrant and size of the shapes allows for the calculation of a unit‟s coordinates based on the relative location of each unit in the SOM and adds compatibility for other users.

4.3.2 Projected Data Shapefile Creating a shapefile from projected data requires a projected data file and consists of creating a shapefile with a shape for each row in the projected data file. This task is implemented in a tool called Projected Data to Shapefile and has variables for the projected data, the shape type, the projected data shapefile, and optionally the SOM data file for labeling, a Cartesian quadrant into which to place the shapes (1 by default), a size for the shapes (1 by default), the location for placement in a unit (center by default), and the distance from the center if placement is random (0.3 by default).

4.3.3 Grouping Shapes Grouping shapes in a shapefile requires a shapefile and the parameters on which to base the grouping. This task consists of reading in the shapefile and grouping the shapes based on the grouping parameters and is implemented in a tool called Group Shape. This has variables for the input shapefile, data column to base grouping on, the type of shape to create, the value to use in the output shape attribute table, the output shapefile, and optionally the column on which to sort the shapes prior to grouping. This allows for the creation of trajectories (polylines) and clusters (polygons) in a SOM for visualization and analysis.

4.4 UTILITIES Utility tools allow the completion of certain tasks that do not fit into the other categories and consist of being able to create a bounding box (extent) for shapefiles and sending an email. The creation of a bounding box is implemented in a tool called Create Extent Shapefile and consists of creating a shapefile with a bounding box based on the extent listed in the input shapefile header. Sending an email is implemented in a tool called Send Email and allows a user to send an email. This is intended for use with models and will be described in the next section.

23

4.5 LEVERAGING ARCGIS ArcGIS is a mature product that includes many features that can be used for visualization and analyzing a SOM. In addition to the basic visualization of the shapefiles, there are advanced labeling features, coloring schema, and scale dependent visualization. Native ArcGIS tools can also be used to generate new visual features such as Voronoi polygons, interpolated planes, and hill shading as well as computational tools for geostatistical analysis that may serve as distortion measures or other indicators. Furthermore through the use of the ArcGIS graphical programming environment, Modelbulider, the data preprocessing, SOM computation, and SOM visualization tools can be combined into reusable data driven models. These models can even be combined with the email sending utility so the user can be remotely notified of progress or even receive results. Additionally, the sharing of models between scientists facilitates ease of communication and collaboration on projects using SOM.

4.6 EVALUATION SOM Analyst was evaluated on a voluntary basis by experts who assessed its capabilities and made suggestions for enhancements. These experts chose to use SOM Analyst as part of their own work and actively provided feedback on multiple occasions over a 4 year period from 2007 to 2011. They were composed of 7 people in total, 3 master‟s students, 1 Ph.D. candidate, 1 professional researcher, and 2 professors from 4 different institutions, with all but one of them establishing contact in person during professional activities. Their types and sources of data were broad, and included numeric and text data on demographics, politics, and the environment, from sources as varied as blogs to the US Census. I communicated with all of the experts via email and their invaluable assessments contributed to the features, integration, and completeness of SOM Analyst. The experts primarily chose to use SOM Analyst because of their familiarity with ArcGIS and the ease with which they could use it as compared to the other tools available for SOMs. Additionally, my commitment to helping them use SOM Analyst and respond to their needs assured their willingness to participate. The most common challenge by far was to prepare data for SOM Analyst. Incompatibilities were caused by differences in character systems (e.g. Windows vs. Mac), spacing, and empty lines, which required the creation of a

24 more robust parsing schema. Once data was in SOM Analyst the primary suggestions were about enhancements in manipulating the data including randomizing the order, normalizing data, and specifying labeling. As SOM Analyst evolved into a more complex piece of software the need for documentation become a strong concern as well. This led to the creation of an embedded help feature, a help file, and a tutorial. In total there were more than 200 revisions to the software comprising approximately 55% enhancements (e.g. additional file format support, normalization schema, quadrant placement, etc. ), 40% documentation (e.g. wiki, help file, embedded explanations, etc.), 15% organization (e.g. directory paths, class structures, library, program flow, etc.) , and 5% error corrections (e.g. value indexing, division by zero, formulas, etc.). One user evaluated SOM Analyst in the process of completing her 2008 by using it to

spatialize and visualize 200,000+ U.S. census block groups in a high-resolution SOM of 250,000 neurons. Her data consisted of 69 attributes that included demographics, land use and cover, climate, geology, and soils. She provided extensive feedback that in particular highlighted some critical changes that were needed in the program and for her to complete her work. This included the majority of the errors and sensitivities to variations in the data format. Her overall evaluation was that SOM Analyst was “easy to implement and use.” One independent researcher evaluated SOM Analyst in the process of researching and writing his 2010 paper. He used SOM Analyst to spatialize and visualize the population trends of the census tracts of major U.S. urban cores from 1950 to 2000, contrasting this with k-means clustering and his primary method of bicomponent trend maps. His overall evaluation was that SOM Analyst allowed him to do “everything he needed and expected to do,” and that when he did find the right documentation made the “complete process of operation straightforward.” One 2011 Ph.D. candidate used SOM Analyst to visualize an analysis of approximately “156,000 newspaper and magazine articles, TV and radio transcripts, and blog entries on the topic of global climate change published/broadcast between 1969 and 2009.” Her computation of the SOM was done using SOM Toolbox, because SOM_PAK, on which SOM Analyst relies, limits input data size. Her overall evaluation was that SOM Analyst “worked quite well,” but she had some lingering needs that cannot be addressed by SOM Analyst due to its dependence on SOM_PAK and ArcGIS. The limit of SOM_PAK to 10,000

25 rows of data caused her to use SOM Toolbox for the computational portion of her research. Similarly, the 2 gigabyte limit on the size of shapefiles and their corresponding database files, as well as restrictions to the name, quantity, and precision of fields, is seen as a possible future hindrance. Other evaluations included suggestions for documentation, organization, and presentation. These ranged from simple changes in language to significant alterations in software architecture. At times these suggestions conflicted with each other. I tried to address each suggestions either directly by implementing the changes, or by explaining why I would not be making them. These evaluations along with others allowed for the refinement of SOM Analyst to a point suitable for public release. There are additional enhancements that could be made in the software and documentation. However, based on feedback from the experts who evaluated the software, the current version of SOM Analyst has achieved the goal of integrating SOMs and GIS and is ready for public release.

26

CHAPTER 5 RESULTS SOM Analyst fills a need for easier use of SOMs in GIScience. This is evident from the evaluations of experts and online tracking of access. This facilitates the use of SOMs in GIScience and is expected to result in enhanced analytical capabilities (Bacao, et al., 2008; A. Skupin & Hagelman, 2005) and SOM Analyst makes this possible. Nevertheless, as with all software, there are limits to its capabilities.

5.1 LIMITATIONS The limitations of SOM Analyst fall broadly into two categories: computational, and representational. SOM_PAK has preset data limits, which were too low for some experts, and while they can be sufficiently increased in most instances, this would require a nonstandard version of SOM_PAK. Computationally speaking, SOM_PAK should scale up well for large SOMs, however it is intended for a single core 32-bit architecture and therefore would not be able to access additional processors or more than 4GB of RAM. As to the general performance of SOM_PAK, it performs very well in comparison to other SOM implementations, yet as expected performance is closely related to computer equipment and can vary greatly. ArcGIS uses many different file formats and databases. However, only shapefiles are an open and documented format that is widely used and supports both the geometry and data requirements for a SOM. Shapefiles are a robust way of storing geographic coordinates but the data table in the Dbase IV format has limitations that can make it unsuitable for some SOM applications. These limitations include the number of columns, width of columns in characters, and special values (e.g. null). Despite this limitation, I chose to use shapefiles because they are widely used in ArcGIS, as well as other programs, and they are suitable or can be made suitable for most purposes. There are a number of additional features it would be desirable to include in SOM Analyst but were not implemented in this version. The ability to import more data formats,

27 and perhaps even a generalized data import tool that accepts data format templates would be useful. Interactivity at each step in the SOM creation, computation, and visualization process, such as sampling data and making suggestions for parameters would be helpful. The support for other spatialization methods or SOM variants also would be useful, because it would allow for comparisons between methods and selection of the most appropriate method for a given dataset. The current framework of SOM Analyst can serve as a guide for implementation and platform for integration.

5.2 SIGNIFICANCE SOM Analyst is flexible and efficient and when used in ArcGIS it becomes a powerful platform for analysis and visualization. It was first posted online through Google Code in June 2007 after 7 months of development. In December 2008, Google Analytics was activated to track website activity. As of April 2011 Google Analytics reports 4,556 page views from 852 visits originating in 220 cities from 49 countries. The most visits, 255, came from 30 cities (30 states) in the United States, followed by 243 visits from 11 cities in the United Kingdom. According to Google Analytics there were 310 absolute unique visitors with an approximate average of 3 visits per user with more than 67% of visits including more than one page, indicating interest and not an accidental visit. This represents significant interest by the scientific community and serves as a proxy of the value of SOM Analyst. Further significance can be surmised from the fact that, as of this writing, SOM Analyst has been used in one journal article and four master‟s theses. SOM Analyst represents a significant methodological contribution to GIScience by streamlining the ability to use SOMs in a familiar environment. It is an expandable platform with all the basic functions needed for data preparation, SOM computation, and SOM visualization. It supports the most common and essential data formats in use today. It serves as the basis of simple and advanced SOM visualization and analysis including the ability to quickly and easily produce attribute trajectories (Skupin and Hagelman 2005). Placing these capabilities readily at the hand of GIScientists enhances the knowledge construction process and provide the means for performing critical analysis on large high-dimensional datasets.

28

5.3 FUTURE WORK Future enhancements to SOM Analyst could focus on data input and development of additional metrics. In particular, I would like it to be adopted by a larger user base that would contribute to its further development by implementing and sharing tools. Functionalities that could be added include more options for data input, as well as options for the creation of artificial data. Adding tools for creating artificial data would allow SOM Analyst to be used for theory testing purposes. By testing SOMs in SOM Analyst with many different kinds of artificial data, guidelines could be established to help identify the optimal parameters for different kinds of data. The usability and durability of SOMs as a means for exploring spatio-temporal data also would be enhanced by the development, testing, and demonstration of additional metrics that would for convey the qualities of a SOM. These metrics could then be used to demonstrate quantitatively the strengths of SOMs and enhance their standing as a tool for analysis. Results could then be examined relevant to other areas of science such as emergent systems.

5.4 CONCLUSIONS SOM Analyst makes it easier to use the SOM method within GIS and demonstrates how SOMs can be useful for GIS-based analysis. Conversely, it also shows how GIS enhances the spatial nature of SOM models, such that GIS becomes applicable to even nongeographic data. The practical demonstration of this mutually beneficial relationship of GIS and SOM is among the main methodological contributions of this thesis. SOM Analyst supports several common data formats and has all the basic functions needed for data preparation, SOM computation, and SOM visualization. For simplicity, only the traditional SOM method is supported, but SOM Analyst is a significant contribution to which many enhancements could be added. It serves as the basis of simple and advanced SOM visualization and analysis including the ability to quickly and easily produce attribute trajectories. These capabilities are meant to enhance knowledge construction based on the analysis of large high-dimensional datasets.

29

REFERENCES Ackoff, R. F. (1989). From data to wisdom. Journal of Applied Systems Analysis, 16, 3-9. Alahakoon, D., Halgamuge, S. K., & Srinivasan, B. (2000). Dynamic self-organizing maps with controlled growth for knowledge discovery. IEEE Transactions on Neural Networks, 11(3), 601-613. Bacao, F., Lobo, V., & Painho, M. (2005). Self-organizing maps as substitutes for k-means clustering. Computational Science - Iccs 2005, Pt 3, 3516, 476-483. Bacao, F., Lobo, V., & Painho, M. (2008). Applications of different self-organising map variants to geographical information science problems. In P. Agarwal & A. Skupin (Eds.), Self-Organising Maps: Applications in Geographic Information Science (pp. 21-44). Chichester, England:John Wiley & Sons, Ltd. de Bodt, E., Verleysen, M., & Cottrell, M. (1997). Kohonen maps versus vector quantization for data analysis. In M. Verleysen (Ed.), Proc. {ESANN}'97, 5th European Symposium on Artificial Neural Networks (pp. 211-218). Brussels, Belgium: D facto. Gabathuler, C. (2009). Web-based self-organizing maps for exploration of swiss census data. M.S., Univeristy of Zurich, Zuirch. Gahegan, M., & Brodaric, B. (2002, July 9-12, 2002). Computational and visual support for geographical knowledge construction: Filling in the gaps between exploration and explanation. Paper presented at the International Symposium on Spatial Data Handling, Ottawa, Canada. Gahegan, M., Takatsuka, M., Wheeler, M., & Harrdisty, F. (2002). Introducing GeoVISTA Studio: An integrated suite of visualization and computational methods for exploration and knowledge construction in geography. Computers, Environment and Urban Systems, 26, 267-292(226). Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43(1), 59-69. Kohonen, T., Hynninen, J., Kangas, J., & Laaksonen, J. (1995). SOM PAK: The selforganizing map program package (version 3.1). Rakentajanaukio 2 C SF-02150 Espoo, Finland: Helsinki University of Technology Laboratory of Computer and Information Science. Li, B. (1998). Exploring spatial patterns with self-organising maps. Paper presented at the Proceedings of the Geographic Information Systems and Land Information Systems Consortium. MacEachren, A. M., Wachowicz, M., Edsall, R., Haug, D., & Masters, R. (1999). Constructing knowledge from multivariate spatiotemporal data: Integrating geographical visualization with knowledge discovery in database methods. International Journal of Geographical Information Science, 13(4), 311-344.

30 Martinetz, T., & Shulten, K. (1991). A neural-gas network learns topologies. Artificial Neural Networks, I, 397-402. Miller, H. J. (2008). Geographic data mining and knowledge discovery. In J. P. Wilson & A. S. Fotheringham (Eds.), The Handbook of Geographic Information Science (pp. 352366): Blackwell Publishing. Miller, H. J., & Han, J. (2001). Geographic data mining and knowledge discovery. London ; New York, NY: Taylor & Francis. Montello, D. R., Fabrikant, S. I., Ruocco, M., & Middleton, R. S. (2003). Testing the first law of cognitive geography on point-display spatializations. Proceedings of Proceedings, Conference on Spatial Information Theory (COSIT '03), Lecture Notes in Computer Science 2825, Ittingen (pp. 24-28): Springer. Skupin, A., & Agarwal, P. (2008). Introduction: What is a self-organizing map? In A. Skupin & P. Agarwal (Eds.), Self-organising maps: Applications in geographic information science (pp. 1-20). Chichester, England: John Wiley & Sons, Ltd. Skupin, A., & Fabrikant, S. I. (2008). Spatialization. In J. P. Wilson & A. S. Fotheringham (Eds.), The handbook of geographic information science (pp. 61-79). Malden, MA: Blackwell Publishing. Skupin, A., & Hagelman, R. (2005). Visualizing demographic trajectories with selforganizing maps. Geoinformatica, 9(2), 159-179. Takatsuka, M., & Gahegan, M. (2002). GeoVISTA Studio: A codeless visual programming environment for geoscientific data analysis and visualization. Comput. Geosci., 28(10), 1131-1144. doi: http://dx.doi.org/10.1016/S0098-3004(02)00031-6 Valdes-Perez, R. E. (1999). Principles of human-computer collaboration for knowledge discovery in science. Artificial Intelligence, 107(2), 335-346. Villmann, T., Wieland, H., & Michael, G. (2000). Data mining and knowledge discovery in medical applications using self-organizing maps. Paper presented at the Proceedings of the First International Symposium on Medical Data Analysis. Yuan, M., Buttenfield, B. P., Gahegan, M., & Miller, H. (2001). Geospatial data mining and knowledge discovery. Washington, DC: University Consortium for Geographic Information Science. url: http://www.ucgis.org/priorities/research/research_white/2000%20Papers/emerging/gk d.pdf

31

APPENDIX A TUTORIAL

32 This tutorial contains step-by-step instructions on how to use the provided example dataset with SOM Analyst. The source data set for this tutorial is provided with SOM Analyst and is located in its sub-folder named dat. The file named census.csv contains gender, age, race, and housing data for each U.S. population census between the years 1900 and 1990. First, the data is converted from the comma separated file format (.csv) to the database file format (.dbf) so that normalizations can be performed. Second, the raw count data are normalized by state population counts. Third, every variable is normalized into a 0 to 1 range and the preprocessed data are then exported to the SOM input format. Using those input data, a SOM is trained in two stages. The input data are then projected onto the finished SOM. Finally, a number of visualizations are produced.

A.1 SYSTEM REQUIREMENTS 1. Windows (any version) 2. ArcGIS 9.3 (legacy toolboxes for ArcGIS 9.0-9.2 are provided, but untested) 3. Python 2.5 (included in the default ArcGIS 9.3 installation)

A.2 DOWNLOAD SOM Analyst is available for download from http://somanalyst.googlecode.com

A.3 ADDING THE TOOLBOX Add the SOM Analyst Toolbox to ArcGIS. 1. Open the ArcToolbox panel by clicking on the Window menu and select ArcToolbox. Alternatively, click on the toolbox icon on the menu bar.

Figure A.1. The ArcGIS “Window” menu. 2. Right click in the ArcToolbox panel and select Add Toolbox....

33

Figure A.2. The ArcToolbox context menu. 3. Browse to the location of SOM Analyst and select guiArcGIS93.tbx and click Open. Note: Depending on your computer setup, it may be necessary to first “connect” to the folder that contains SOM Analyst. In that case, click in the dialog box on the icon of a folder with an arrow pointing to a globe.

Figure A.3. The Add Toolbox dialog box.

34 The SOM Analyst toolbox is now accessible through the ArcToolbox panel.

Figure A.4. The ArcToolbox contents list showing SOM Analyst Tools. Browse through the toolbox to familiarize yourself with the tools.

Figure A.5. The SOM Analyst Tools contents.

35

A.4 CONVERT DATA FORMAT Convert the data to a database file format. 1. Run the Data File to Database File tool by double clicking on it in the File Format Conversions toolbox of the Data Preprocessing toolbox.

Figure A.6. The Data File to Database File dialog box. 2. Select census.csv as the input data file. 3. Set Comma Separated Values (CSV) as the input file format. 4. Change the output database file to census.dbf. 5. Click OK to run the conversion.

36 In the table properties the data type for each column is text.

Figure A.7. The census data table properties. The values in the table are left justified indicating that they are text.

Figure A.8. The census data table attributes.

37

A.5 NORMALIZE DATA Normalize values in the database file. 1. Run the Normalize by Variable tool by double clicking on it in the Value Transformations toolbox of the Data Preprocessing toolbox.

Figure A.9. The Normalize by Variable dialog box. 2. Select census.dbf as the input database file. 3. Select Population as the normalize by column. 4. Change the output database file to census.dbf.

38 5. Select the columns male, female, Under_15, 15_64, 65_Over, Am_Indian, Asian, Black, and White in the columns to normalize field. 6. Click OK to run the normalization. The resulting table contains population ratios.

Figure A.10. The normalized by variable census data table attributes.

39

7. Run the Min-Max Normalization tool by double clicking on it in the Value Transformations toolbox of the Data Preprocessing toolbox.

Figure A.11 The Min-Max Normalization dialog box. 8. Select normVar.dbf as the input database file. 9. Select column as the normalize by field. 10. Change the output database file to norm01.dbf. 11. Select the columns male, female, Under_15, 15_64, 65_Over, Am_Indian, Asian, Black, and White in the columns to normalize field.

40 12. Click OK to run the normalization. The resulting table contains normalized values.

Figure A.12. The min-max normalized census data attributes table.

41

A.6 SELECT VARIABLES Select the relevant variables from the database file. 1. Run the Select tool by double clicking on it in the Data Management toolbox of the Data Preprocessing toolbox.

Figure A.13. The Select dialog box. 2. Select norm01.dbf as the input database file. 3. Set columns as the selection type. 4. Change the output database file to demographics.dbf. 5. Select all columns except Owner, Renter, and Households in the columns field.

42 6. Enable detect data types. 7. Click OK to run the selection. In table properties the value types for the columns has changed where appropriate.

Figure A.14. The normalized census data table properties. The numeric values in the table are right justified indicating that they are numbers.

Figure A.15. The normalized census data table attributes.

43 Note: Detecting data types for columns requires checking the data type of each value and can be time consuming for large datasets. This step is only necessary if performing normalizations or other calculations before using the data with a SOM.

A.7 EXPORT DATA Export the database file to the SOM data format. 1. Run the Database File to SOM_PAK Data tool by double clicking on it in the File Format Conversions toolbox of the Data Preprocessing toolbox.

Figure A.16. The Database to SOM_PAK Data dialog box. 2. Select demographics.dbf as the input database file. 3. Change the output SOM data file to demographics.dat. 4. Select Region, Division, State, and Year in the label columns field. 5. Click OK to run the export.

44

A.8 CREATE INITIAL SOM Creating the initial SOM. 1. Run the Create Initial SOM tool by double clicking on it in the SOM Computation toolbox.

Figure A.17. The Create Initial SOM dialog box. 2. Select demographics.dat as the data for SOM. 3. Select hexa as the topology of map. 4. Set 25 as the x dimension. 5. Set 25 as the y dimension. 6. Set init.cod as the initial SOM. 7. Click OK to run the creation of the initial SOM.

45 A window will open that indicates the progress of the process.

Figure A.18. The Create Initial SOM progress window.

46

A.9 TRAIN SOM Training the SOM. Note: The SOM will be trained in two steps. The first training will create the overall structure in the SOM. The second training will create the finer specialization. 1. Run the Train SOM tool by double clicking on it in the SOM Computation toolbox.

Figure A.19. The stage one Train SOM dialog box. 2. Select init.cod as the initial som. 3. Select demographics.dat as the training data.

47 4. Set 4900 as the length of training. 5. Set 0.04 as the initial learning rate. 6. Set 25 as the initial neighborhood radius. 7. Change the trained SOM to stage1.cod. 8. Click OK to run the training of the SOM. A window will open that indicates the progress of the process as it did with the creation of the initial SOM. 9. Run the Train SOM tool.

Figure A.20. The stage two Train SOM dialog box.

48 10. Select stage1.cod as the initial som. 11. Select demographics.dat as the training data. 12. Set 49000 as the length of training. 13. Set 0.03 as the initial learning rate. 14. Set 5 as the initial neighborhood radius. 15. Change the trained SOM to stage2.cod. 16. Click OK to run the training of the SOM.

49

A.10 CALCULATE U-MATRIX Calculate the U-matrix of a SOM. 1. Run the Calculate U-matrix tool by double clicking on it in the SOM Computation toolbox.

Figure A.21. The Calculate U-matrix dialog box. 2. Select stage2.cod as the input SOM. 3. Change the output U-matrix database file to Umatrix.dbf. 4. Click OK to calculate the U-matrix

Figure A.22. The U-matrix table attributes.

50

A.11 PROJECT DATA ONTO SOM Project the data onto the SOM. 1. Run the Project Data onto SOM tool by double clicking on it in the SOM Computation toolbox.

Figure A.23. The Project Data onto SOM dialog box. 2. Select stage2.cod as the SOM. 3. Select demographics.dat as the data to project. 4. Change the projected data to demographics.bmu. 5. Click OK to project the data onto the SOM. A window will open that indicates the progress of the process as it did with the creation of the initial SOM.

51

A.12 CREATE SOM SHAPEFILE Creating the SOM shapefile. 1. Run the SOM to Shapefile tool by double clicking on it in the SOM Visualization toolbox.

Figure A.24. The SOM to Shapefile dialog box. 2. Select stage2.cod as the SOM. 3. Select polygon as the shape type. 4. Change the SOM shapefile to stage2.shp. 5. Set demographics.dat as the SOM data for variable names. 6. Enable label SOM with data labels 7. Set Umatrix.dbf as the U-matrix. 8. Click OK to create the SOM shapefile.

52

A.13 CREATE DATA SHAPEFILE Creating the data shapefile. 1. Run the Projected Data to Shapefile tool by double clicking on it in the SOM Visualization toolbox.

Figure A.25. The Project Data to Shapefile dialog box. 2. Select demographics.bmu as the projected data. 3. Select point as the shape type. 4. Change the projected data shapefile to bmu.shp. 5. Select demographics.dat as the label from SOM data. 6. Select random around center as the placement. 7. Click OK to create the data shapefile.

53

A.14 GROUP DATA SHAPEFILE Grouping the shapes in the data shapefile. 1. Run the Group Shapes tool by double clicking on it in the SOM Visualization toolbox.

Figure A.26. The Group Shapes dialog box. 2. Select bmu.shp as the input shapefile. 3. Select State as the group by column 4. Select polyline as the group type. 5. Select maximum as the value type. 6. Change the output shapefile to trajectories.shp. 7. Select Year as the sort by column. 8. Click OK to create the trajectories.

54

A.15 CREATE EXTENT SHAPEFILE Creating the extent shapefile. 1. Run the Create Extent Shapefile tool by double clicking on it in the Utilities toolbox.

Figure A.27. The Create Extent Shapefile dialog box. 2. Select stage2.shp as the input shapefile. 3. Change the output shapefile to extent.shp. 4. Click OK to create the extent shapefile.

55

A.16 VISUALIZATION Visualizing the SOM and projected data. 1. Open tutorial.mxd. Note: Your map will not be identical, but should be very similar. The frames may appear rotated due to the initial random numbers used. The large map shows the trajectory of each state across the SOM over time with a base of the U-matrix, a measure of distortion. The trajectories are color coded by census division, which are shown in the lower right. The other frames contain the component planes, each showing the neuron weights for one variable across the entire SOM. When examining the demographic trajectories of each state note that each shift in the trajectory corresponds to a census year and that at the end of the trajectory is an arrow that represents the year 1990. Parallel trajectories indicate a similar change in demographics over time. Parallel trajectories are particularly evident within the South Division (West South Central Region, East South Central Region, and South Atlantic Region) and Northeast Division (Middle Atlantic Region and New England Region). This demonstrates spatial autocorrelation and is consistent with the demographic changes over the last century. In the Northeast Division, the parallel trajectories split 40 years ago mainly into coastal and land locked areas with New York and New Jersey similar to each other, but dissimilar to the other coastal states. When examining component planes you are seeing how the SOM allocates location based on that variable. In this map, darker color means high values and lighter color means low values. You can see that the female component plane is very dark in one corner and light in the opposite corner with a gradual change between the two. Conversely the male component plane is very dark in the opposite corner and has a similar pattern of gradual change. When comparing component planes to each other you can see how the SOM weights the variables in the same location and thus derive a relationship between them. You can see that that female and male have an inversely proportional relationship in the SOM that corresponds with reality, that is that a high number of females inherently means a low number of males and vice versa.

Figure A.28. The SOM Analyst Tutorial project layout. 56