Searching for Significance in Unstructured Data: Text ... - SAGE Journals

European Educational Research Journal Volume 13 Number 2 2014 www.wwwords.eu/EERJ REPORT

Searching for Significance in Unstructured Data: text mining with Leximancer DAVID A. THOMAS Mathematics Centre, University of Great Falls, USA

ABSTRACT Scholars in many knowledge domains rely on sophisticated information technologies to search for and retrieve records and publications pertinent to their research interests. But what is a scholar to do when a search identifies hundreds of documents, any of which might be vital or irrelevant to his or her work? The problem is further complicated by the unstructured nature of most documents which, unlike databases, are difficult to search systematically. More and more, scholars are turning to content analysis technologies to achieve what they do not have time to do themselves – characterize the global features of a large corpus of work and identify relationships between particular concepts, themes, and methodologies. This article begins with a brief discussion of content analysis as a research methodology. The utility of this methodology is then illustrated using Leximancer, an automated content analysis and concept mapping technology. Against this conceptual and technical background, the value of content analysis is discussed in a variety of academic contexts, including individual and collaborative scholarship and academic advising.

Digital discourse (e.g. cell phones, email, threaded discussions, synchronous chats, and webinars) and documents (e.g. journal articles, conference proceedings papers, eBooks, research databases, and university-based student and administrative records) have all but replaced older communication technologies (e.g. analog phones, snail mail, and hardcopy print) in millions of workplaces and homes. Stored in many formats, digital data are easily transmitted, transformed, edited, and re-purposed. Inexpensive word processing, data management, data analysis, presentation, publication, and social networking technologies have fuelled an exponential growth in digital discourse and publication. For instance, Worldometers [1] estimated that two billion internet users sent 300 million emails per day in 2011. In the same year, more than one million books were published worldwide. To cope with this information explosion, scholars in many knowledge domains rely on sophisticated information technologies to search for and retrieve records and publications pertinent to their research interests. But what is a scholar to do when a search identifies hundreds of documents, any of which might be vital or irrelevant to his or her work? The problem is further complicated by the unstructured nature of most documents which, unlike databases, are difficult to search systematically. More and more, scholars are turning to automated content analysis technologies to achieve what they do not have time to do themselves – characterize the global features of a large corpus of work and identify relationships between particular concepts, themes, and methodologies. A growing number of computer- and network-based technologies are now available for conducting content analyses of textual information published as scholarly articles, dissertations,

235

http://dx.doi.org/10.2304/eerj.2014.13.2.235

David A. Thomas books, conference proceedings, governmental reports, online forums, and so on. Some of the websites focused on identifying, describing, and reviewing these technologies include: • Online QDA: learning qualitative data analysis on the Web http://onlineqda.hud.ac.uk/Step_by_step_software/index.php • Computer Assisted Qualitative Data AnalysiS (CAQDAS) http://www.surrey.ac.uk/sociology/research/researchcentres/caqdas/support/choosing/ • Content-analysis.de http://www.content-analysis.de/ • Computer-Assisted Qualitative Data Analysis Software (CAQDAS) http://www.restore.ac.uk/lboro/research/software/caqdas.php A few of the leading content analysis technologies include: • ATLAS.ti http://www.atlasti.com/index.html • NVIVO http://www.qsrinternational.com/ • QDA MINER http://provalisresearch.com/products/qualitative-data-analysis-software/ • MAXQDA http://www.maxqda.com/ This article begins with a brief discussion of content analysis as a research methodology. The utility of this methodology is then illustrated using Leximancer [2], an automated content analysis and concept mapping technology. Against this conceptual and technical background, the value of content analysis is discussed in a variety of academic contexts, including individual scholarship, departmental and interdisciplinary research groups, and graduate student advising. Content Analysis According to Smith and Humphreys (2006), there are several reasons why one would want an automated system for content analysis of documents. Researchers are subject to influences that they are unable to report (Nisbett & Wilson, 1977) which may lead to subjectivity in data analysis and the interpretation of findings. Limiting researcher subjectivity often involves extensive investments of time and money to address inter-rater reliability and other sources of bias. One goal of automated content analysis is to reduce this cost and to allow more rapid and frequent analysis and reanalysis of text. A related goal is to facilitate the analysis of massive document sets and to do so unfettered by a priori assumptions or theoretical frameworks used by the researcher, consciously or unconsciously, as a scaffold for the identification of concepts and themes in the data (Zimitat, 2006). Since textual analysis technologies operate directly on words (as well as other symbols), a rationale for inducing relationships between words is needed. Beeferman observed that words tend to correlate with other words over a certain range within the text stream (Beeferman et al, 1997). Indeed, a word may be defined by its context in usage (Courtial, 1989; Leydesdorff & Hellsten, 2006; Smith & Humphreys, 2006, p. 262; Lee & Jeong, 2008). According to the Leximancer manual (2010), concepts are ‘collections of words that typically travel together throughout the text’. For example, in a document about climate change, the Leximancer concept ‘carbon’ might include the keywords dioxide, carbonate, footprint, and sequester. Leximancer weights these terms according to how frequently they occur in sentences containing the concept, compared to how frequently they occur elsewhere. A sentence (or group of sentences) is only tagged as containing a concept if the accumulated evidence (the sum of the weights of the keywords found) is above a set threshold. Leximancer induces the definition of each concept (i.e. the set of weighted terms) by noting the co-occurrence of words within a ‘sliding window’ as it scans blocks of text a few sentences at a time (Grimbeek et al, 2005; Liesch et al, 2011). These data are used to make two determinations: (1) the most frequently used concepts within a body of text; and, more importantly, (2) the relationships between these concepts (e.g. the co-occurrence

236

Searching for Significance between concepts). During the learning process, words highly relevant to the seed are continuously updated, and eventually form a thesaurus of terms for each concept. This approach to textual analysis is well founded. The repeated co-occurrence of words within a document or a set of documents is a widely accepted principle (Stubbs, 1996; Clandinin & Connelly, 2000; Sowa, 2000). Concepts consisting of co-occurring words do reflect categories that carry meaning (Osgood et al, 1957; Leydesdorff & Hellsten, 2006). For example, a textual analysis of this article would discover that the words ‘content’ and ‘analysis’ frequently occur within the same sentence. Investigating Professional Knowledge Bases Leximancer has been used in a number of studies to analyse the content and evolution of professional knowledge bases embodied in journal publications and conference proceedings. Indulska and Recker (2008) studied the progress of design science research in Information Systems (IS) as seen in papers presented at the main five Association for Information Systems (AIS) sponsored IS conferences, namely the Australian Conference on Information Systems (ACIS), the Americas Conference on Information Systems (AMCIS), the European Conference on Information Systems (ECIS), the International Conference on Information Systems (ICIS) and the Pacific Asia Conference on Information Systems (PACIS) in 2005-2007. Their study sought answers to several questions, including: (1) What proportion of papers at IS conferences pertains to design science research?; (2) Is the focus on design science in IS publication outlets increasing over recent years?; (3)What are the main thematic foci of IS design science papers?; and (4) Is design science in IS concentrated within schools in specific geographical areas? Their analyses found that design science is a growing stream of research in IS and that design science research is widespread in the research domains of process, knowledge and information management. Using a similar approach, Liesch et al (2011) examined articles published in the Journal of International Business Studies (JIBS) from 1970 until 2008. Their findings suggest that the literature has evolved from a formative stage in which macro-environmental contextual factors were of prime consideration to its current position in which the firm, its strategy and its performance, have become the dominant interests. Finally, Sadiq et al (2011) used Leximancer to investigate the current landscape of data quality research to create better awareness of synergies between various research communities, and, subsequently, to direct attention towards holistic solutions. Their findings include a taxonomy of data quality problems, identification of the top themes, outlets and main trends in data quality research, as well as a detailed thematic analysis that outlines the overlaps and distinctions between the focus of IS and Computer Science publications. In this study, Leximancer was used to analyse 30 award-winning papers from the SITE 2008, SITE 2009, SITE 20010, and SITE 2011 annual meetings of the Society for Information Technology and Teacher Education (SITE), all of which were downloaded from the EdITLib data library in 2011 [3] (see Appendix for a list of the papers analysed). For the purposes of this illustration, the number of documents selected was kept small in order to simplify inspection and interpretation of Leximancer’s graphical and tabular outputs. These papers were then stored in four folders on the author’s computer and labelled ‘Site2008’, ‘Site2009’, ‘Site2010’, and ‘Site 2011’ within a folder labelled ‘Siteaward’. Prior to undertaking a content analysis of the award papers, the following research questions were asked: 1(a). Which concepts occur with the greatest frequency? 1(b). Which concepts are closely associated with one another? 2. What are the emergent themes? 3. To what extent do papers from different years emphasize similar concepts? Introduction to Leximancer According to Smith and Humphreys (2006), Leximancer discovers and extracts thesaurus-based concepts directly from the text data. These concepts are then coded into the text (i.e. tags are 237

David A. Thomas inserted) using the thesaurus as a classifier. This process employs two stages of co-occurrence information extraction – semantic and relational – using a different algorithm for each stage. Clusters of co-occurring concepts are then aggregated into themes. The algorithms used are statistical, but they employ nonlinear dynamics and machine learning. The resulting asymmetric concept co-occurrence information is then used to generate a concept map Leximancer processes text in a series of eight stages (See Figure 1): Load Data, Pre-Process, Concept Seeds Identification, Edit Emergent Concept Seeds, Develop Concept Thesaurus, Create Compound Concepts, Code Concepts into Text, and Generate Outputs. At each stage, the researcher is prompted to edit the manner in which Leximancer processes the data provided. The editing options provided at each of these stages enable the researcher to, in effect, change the size of the sliding window, add or delete documents, merge similar concepts (e.g. student, students, pupils), delete irrelevant or distracting concepts (e.g. recurring formatting terms like Figure and Table), propose additional concepts, and so on. All such choices are reversible, so experimentation is both possible and desirable in refining the analysis and presentation graphics. In practice, there is no limit to the number of documents that may be analysed using Leximancer, provided that the computer on which the software is running is adequate to the task.

Figure 1. Leximancer control panel.

Once the analysis is complete, three additional options are available at the bottom of the control panel: Concept Map, Insight Dashboard, and Data Exports. An analysis of the SITE awards data reported in the Insight Dashboard identified 2825 context blocks, 55 concepts, and 4 categories of data corresponding to the four sets of award papers. Figure 2 shows a network-like set of nodes and connecting line segments. In Figure 2, the 55 nodes near the center of the figure correspond to the 55 concepts discovered. Further from the center are 4 clusters associated with the 30 award papers, coded by first author. This presentation separates the papers themselves from the concepts they contain. In Figure 3, the concepts discovered by Leximancer are displayed adjacent to their respective nodes. Note that the concepts teachers, technology, use, learning, and students appear as large, red dots near the center of the map and the concepts art and gender appear as smaller blue dots on the periphery. In each concept map: • Green concept labels represent proper names (such as people or locations). Black concept labels refer to other objects, locations, actions, and so on. 238

Searching for Significance • Concept nodes are heat-mapped, in that hot colours (red, orange) denote the most relevant concepts, and cool colours (blue, green), denote the least relevant. • The brightness of a concept’s label reflects its frequency in the text. The brighter a concept label the more often the concept is coded in the text. • Concepts are contextually clustered on the map. That is, concepts that frequently appear together in the text settle close to one another on the map.

Figure 2. Concept and paper nodes.

s Figure 3. Emergent concepts.

An alternative look at the discovered concepts is seen in Figures 4 and 5. In Figure 4, the concepts are listed in ranked order based on the number of times they occur (e.g. ‘Count’) in the set of papers. This statistic is then converted into a Relevance score, computed as the percentage frequency of text segments which are coded with that concept relative to the frequency of the most frequent concept in the list. The ten most frequent concepts encountered in this study were teachers, students, use, learning, technology, preservice, classroom, knowledge, education, and study. In that list, we see that the concepts technology and science occur, respectively, 531 and 176 times. When a concept in the ranked list is clicked, such as technology, another screen is displayed (see Figure 5) 239

David A. Thomas showing the co-occurrence of the concept technology with other discovered concepts, such as science. In Figure 5, we see that technology co-occurred with science 71 times. The Likelihood score of 40%, computed as (# co-occurrences of science and technology)/(# occurrences of science) = 71/176, may be interpreted as saying that the concept technology co-occurs with the concept science 40% of the time. Clicking on the magnifying glass icon to the left of the science concept opens a window in which each of those 71 co-occurrences is displayed as seen in the original paper.

Figure 4. Ranked concepts.

Figure 5. Co-occurring concepts.

The Thesaurus tab shows which words are associated with any given concept. For instance, in Figure 6, the terms associated with the concept technology are listed in the ‘Word’ column with their respective weights. Clicking on the ‘…’ icon beside each Word generates the text passages 240

Searching for Significance from which the Words were drawn (See Figure 7). The ‘Iterations: 7’ notation indicates that 7 iterations of learning were used to generate a stable set of concept definitions.

Figure 6. Thesaurus word definitions.

Figure 7. Thesaurus source texts.

Themes are clusters of concepts. Like concepts, themes are heat-mapped, meaning hot colours (red, orange) denote the most relevant themes, and cool colours (blue, green), denote the least relevant. In this study, Leximancer discovered 56 themes, 33 of which are seen in Figure 8. Theme 241

David A. Thomas connectivity is the summed co-occurrence counts of each concept within the theme with all available concepts.

Figure 8. 33 of 56 discovered themes in tabular format.

The map seen in Figure 9 shows all 56 themes and suggests a motive for reducing the number of themes displayed: too much information can obscure relationships. The ‘% Theme Size’ slider under the Concept Map may be used to adjust the number and relative sizes (i.e. generality) of the themes displayed, reducing the overall complexity of the display. A more manageable set of themes is presented in Figure 10 and graphical form in Figure 11.

242

Searching for Significance

Figure 9. All 56 emergent themes.

Figure 10. Tabular theme display.

243

David A. Thomas

Figure 11. Graphical theme display.

Figure 12 overlays concepts with themes. In so doing, it juxtaposes small scale (i.e. concepts) and large scale (i.e. themes) analytic elements. Metaphorically, it is at this point in the analysis that we begin to get a glimpse of the forest implicit in the corpus of work as well as the trees explicit in the individual papers.

Figure 12. Emergent concepts and themes.

We now turn to the individual papers, as represented by their first authors. In Figure 13, the papers are aggregated by year and positioned on the periphery of the network of nodes. In Figure 14, the papers also appear on the periphery, but no longer segregated by year.

244

Searching for Significance Overlaying the paper and author data on the theme representations provides insights into where the themes originate. Figure 15 emphasizes that papers from the 2008 and 2009 proceedings were more closely associated with the teachers and group themes than with the gender, analysis, and online themes. The opposite impression is implicit relative to the 2010 and 2011 proceedings.

Figure 13. Papers by year.

Figure 14. Papers by author.

245

David A. Thomas

Figure 15. Emergent themes – papers by year.

In Figure 16, the paper and authors are individually associated with different themes, which still suggests the sort of differential associations by year seen in Figure 15, but with finer resolution. Interpreted in terms of the forest metaphor, think of the concepts as different species of trees (e.g. oak, maple, pine), the themes as different landscapes where they grow (e.g. meadow, hillside, wetland), and the papers and authors as different species of birds (e.g. robins, starlings, sparrows) that nest in the trees. Taken one step further, our metaphor says that while each species of bird (i.e. each author) prefers to nest in a particular species of tree (i.e. pursue a particular research interest) growing in a particular landscape (i.e. situated in a particular knowledge domain), they often take a broad view of things.

Figure 16. Emergent themes – papers by author.

246

Searching for Significance As seen in Figure 17, the concept map uses the distance between concepts nodes as a metaphor for the ‘distance’ (i.e. similarity and differences) between the concepts themselves. Recalling that each concept is defined in the Leximancer thesaurus by a list of weighted words, two concepts may be contrasted by comparing their respective word lists. The ‘pathways’ tab provides a graphical interpretation of such comparisons. Clicking two concepts (e.g. classroom and technology) illustrates the pathway between them, along with example text. The relationships between concepts are best thought of as correlations, though the text segments describing the relationship may define a direction for cause. Clicking on the ‘more…’ link in the text window in Figure 18 reveals the 89 occasions that these words occur together. Hence the strong measure of association, 0.81.

Figure 17. Concept pathway graphic.

Figure 18. Concept pathway text.

As seen in Figures 19 and 20, a similar procedure can be used to map pathways between entire documents. In this case, the connection is somewhat weaker, with a measure of association of 0.28. In other words, the papers (and perhaps the authors) have little in common.

247

David A. Thomas

Figure 19. Paper pathway graphic.

Figure 20. Paper pathway text.

248

Searching for Significance

Figure 21. Pathways between award folders and the concepts teachers and students.

Using the ‘pathways’ tab, measures of association were obtained for the concepts teachers and students relative to the award papers grouped by years and between successive years (see Figure 21). The measures of association between years 2009 and 2011 (0.19) and 2008 and 2010 (0.27) were also obtained but not included in Figure 21 to avoid further cluttering the image. Leximancer’s Insight Dashboard provides a number of other outputs, both tabular and graphical in nature. These outputs make reference to measures identified as frequency, strength, and prominence. Frequency is coded as a conditional probability the chance that a given attribute is coded in a text excerpt. This corresponds to frequency of mention in the data, and is affected by the distribution of comments across the categories. The strength score is the reciprocal conditional probability. Given that a particular attribute is present in a section of text, it gives the probability that the text comes from that category. Strong concepts distinguish the category from others, whether or not the attribute is mentioned frequently. Finally, the prominence score is computed as the product of the strength and frequency scores. Prominent concepts are not incidental, they are important ideas. The ten most prominent concepts among each year’s award winning papers and the combined list appear in Figure 22, where prominence is computed as the product of the strength and frequency scores for each concept. Concepts that appear more than once anywhere in the Figure are highlighted. Inspection reveals major differences in these lists.

Figure 22. Ranked concept prominence by year and combined years.

249

David A. Thomas Answering the SITE Award Research Questions The questions asked relative to the SITE Proceedings award papers and their respective answers are as follows: 1(a). Which Concepts Occur with the Greatest Frequency? From Figure 4, the ten most frequent concepts encountered in this study were teachers, students, use, learning, technology, preservice, classroom, knowledge, education, and study. This is hardly a surprising list for award winning papers at a SITE Conference. On the other hand, several of these concepts are conspicuous by their absence from the list of prominent concepts found in Figure 22 (e.g. teachers, students, and learning). Since prominence is computed as the product of strength and frequency, we must conclude that these are not strong terms; that is, they are present in a diffuse manner throughout the body of work rather than concentrated in particular sections. 1(b). Which Concepts are Closely Associated with One Another? The Insight Dashboard report rank lists the following pairs of concepts as most prominent. Pairs that appear more than once anywhere in Figure 23 are highlighted. The prominence of the concepts professional and development in the 2008 and 2009 papers and that of gender and social in the 2010 and 2011 papers suggests a reason for the apparent separation of papers and concepts suggested in Figure 21; the authors really were addressing different issues. 2. What are the Emergent Themes? From Figure 8, the top 20 emergent themes in rank order are: teachers, students, use, technology, learning, preservice, classroom, knowledge, study, education, content, lesson, research, experience, development, group, science, professional, school, and pedagogical. Again, this is hardly a surprising list for award winning papers at a SITE Conference. 3. To What Extent do Papers from Different Years Emphasize Similar Concepts? From Figures 15, 16, and 21 it is apparent that papers from 2008-2009 shared more in common with each other than with papers in 2010-2011. Pathways computed between several pairs of papers within and between these groups revealed similar patterns. So, there were meaningful similarities and differences between papers from year to year.

Figure 23. Prominent pairs of co-occurring concepts.

Implications for Scholarship, Graduate Student Advising, and Institutional Research We now turn to the potential value of content analysis in a variety of academic contexts, including faculty scholarship, graduate student advising, and institutional research. Admittedly, many of these thoughts are speculative in nature. But the challenges they address are real. Most scholars are participating and contributing members of scholarly communities (e.g. academic departments, government laboratories, museums, archives) and/or associations (e.g. the American Education Research Association [AERA], the American Mathematical Society [AMS]). One of the functions of professional communities and associations is to sponsor, structure, and moderate an ongoing dialogue among members. Meetings and publications provide mechanisms for vetting research findings. This approach continues to motivate and facilitate the development 250

Searching for Significance and testing of new ideas, so many new ideas that it is all but impossible for modern scholars to ‘keep up’ unaided by sophisticated information technologies. This article suggests how Leximancer might be used to review, not dozens, but hundreds or even thousands of documents, be they scholarly papers, government reports, interview transcripts, or any other textual material. Since most scholars are specialists, their work is situated within a specialized literature typically published in professional journals, conference proceedings, monographs, books, and online forums. Traditionally, scholars have acquired and managed the knowledge base embodied in such hardcopy publications using bookshelves, file cabinets, crossreferencing materials in hardcopy and memory, the most fluid of all media. With automated content analysis, such as that provided by Leximancer, sophisticated knowledge management is now possible on a scale that has, until recently, been available only to information management and qualitative research specialists. Using Leximancer, individual scholars may create digital folders of seminal documents in their research domains (including raw data, reports, journal articles, books, proceedings, dissertations, and works in progress), and analyse the entire corpus of work in terms of concepts, themes, authors, and their connections. In this approach, both the forest and its boundaries of a knowledge domain may be contemplated holistically. This sort of perspective is important to both faculty and students. For instance, when approached by a graduate student struggling with his or her research proposal, an advisor might point to a content analyses and say with assurance, ‘Here is what matters most to scholars working in this knowledge domain. Start here’. Later on, when the student presents his or her research proposal for review, the content of the proposal may be compared with that of the knowledgebase. Such comparisons naturally motivate formative questions, such as, ‘How are your research questions and theoretical framework grounded in the literature? How does your literature review compare to those of published papers? Are your research design, sampling, and data analysis procedures consistent with accepted practice? What concepts and/or methods have you included in your proposal not included in the knowledge base? How do you justify doing so?’, and so on. Hypotheses may be formulated and tested relative to differences between the student’s proposal and the knowledgebase as a whole using the Bayes Factor [4] statistic. Here is a bridge between qualitative and quantitative methodologies on which their respective proponents might meet. University administrators are constantly seeking better information relative to the effectiveness of student recruiting, transition to campus life, retention, achievement, and employment. For the most part, these data are gathered using closed response format items (i.e. true–false, multiple choice, Likert scale) and university student records. To date, no golden needle has been found in that haystack. Using content analysis methods, open response items, interview transcripts, and samples of student work can be ‘unpacked’ to find the real roots of success and failure and satisfaction and disappointment for students, faculty, and staff in their respective roles. The problem has always been how to reliably and rapidly analyse tens of thousands of documents. With automated content analysis, authentic student comments and recommendations from many sources may be mined for both insight and factual information. Implications for Collaborative Research and Development An issue commonly faced by teams of scholars working both within and across disciplines is how to juxtapose their individual interests, expertise, and resources relative to a particular challenge or opportunity situated at the boundaries or intersections of their respective specialties. We now consider a scenario in which automated content analysis might be used to facilitate such collaborations. In this scenario, a group of faculty interested in pedagogical aspects and outcomes of technology implementation in K-12 mathematics undertakes a systematic search of the professional literature. Their goal is to create a knowledgebase in which scholarship in their individual specialties is seen relative to scholarship in other sub-domains of K-12 mathematics education, and to overarching goals and standards (e.g. the Common Core State Standards), highstakes assessments, and the science, technology, engineering and maths (STEM) education movement. In such a venture, one scholar might focus on instructional uses of geometry modeling technologies (e.g. Geogebra and the Geometers Sketchpad) in grades 6-8. Others might focus on 251

David A. Thomas the use of online applets in the teaching of elementary probability, graphing technologies in high school algebra, capturing and analysing student discourse during problem-solving activities in grades 3-5, and so on. Clearly, a venture of this sort should involve the collaboration of many scholars from several countries. Working individually, each scholar would identify and download journal articles, proceedings papers, white papers, project reports, and other information resources in his or her specialty related to technology implementation in K-12 mathematics (e.g. number and operation, geometry, algebra and functions, probability and statistics, mathematical technologies, mathematical modeling, mathematical communication and discourse, problem-solving, reasoning and proof, teacher preparation and professional development, and research methodology). As specialists working in these areas, these individuals are equipped to do more than locate information; they already know which publications, organizations, and individuals are respected and trusted by their colleagues. They also know what sort of information they would value in a knowledgebase of their specializations. In this approach, participating scholars serve as a quality control filter. Once selected, documents are downloaded and saved in folders named for the sub-domains from which they were drawn (e.g. geometry). Within these folders, each document may be renamed using the lead author’s institutional affiliation (e.g. Montana State University [MSU], University of Nevada Reno [UNR], National Science Foundation [NSF], etc.), a term characterizing the focus of the technology implementation (e.g. modeling, concepts, practice, assessment, communication, etc.), the funding agency supporting the work, or any other designation that may, in the content analysis, facilitate the recognition of meaningful associations. Multiple files associated with the same document may be indexed as ‘MSU1’, ‘modeling1’, and so on. At this point, each specialist has a folder containing the information resources he or she believes should be included in a knowledgebase focused on his or her interests. This is also the point at which Leximancer training would be most meaningful for individual scholars. Individual and small group training could be provided in face-to-face settings or using desktop conferencing and application sharing technologies. Initial training would culminate in a content analysis of the information resources each scholar has gathered. Subsequent training might focus on the continuous development and reanalysis of the knowledgebase as new research and other information resources become available. When the individual folders of all sub-domains are aggregated and analysed, an overview of scholarship on technology implementation in K-12 mathematics education should emerge that transcends the individual analyses by individual scholars; that is, the whole will be more than the sum of its parts. And as participating scholars continuously update their knowledgebases and share those updates with the group as a whole, the overview will also evolve. This article does not discuss dissemination of emergent knowledgebases. That, in itself, is a complex topic that will be addressed in a subsequent publication. Suffice it to say, there are strategies and supporting technologies for doing so, all of which would require time, talent, and funding to develop and maintain. This article focuses on a process by which a relatively small group of scholars could create an information resource that would benefit all mathematics educators or any group interested in applying this approach to their knowledge domain. Conclusion The exponential growth of scholarship is making it more and more difficult for scholars to identify and digest new information essential to their research. Among the many software technologies available to assist in the analysis of large data sets, Leximancer is one of a few that automatically perform the most complex and time consuming aspects of content analysis. While the use of automation may be objectionable to some qualitative researchers, it opens up the use of content analysis to a much wider audience of scholars. This article illustrates the use of Leximancer in the analysis of a set of conference proceedings papers. It then discusses the general implications of automated content analysis for individual faculty scholarship, academic advising, and institutional research. The implications for collaborative research are illustrated in a scenario focused on a review of the professional literature 252

Searching for Significance on technology implementation in K-12 mathematics. This author would like to see that research happen. This author would like to participate. Would you? Readers are invited to contact the author to explore that possibility. Notes [1] Worldometers (2011) Real Time World Statistics. http://www.worldometers.info [2] Leximancer (2010) Leximancer Manual v. 3.5. https://www.leximancer.com/ [3] Educational & Information Technology Digital Library (EdITLib). See http://www.editlib.org/ [4] http://en.wikipedia.org/wiki/Bayes_factor

References Beeferman, D., Berger, A. & Lafferty, J. (1997) A Model of Lexical Attraction and Repulsion, in P.R. Cohen & W. Wahlster (Eds) Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, pp. 373-380. Madrid: Association for Computational Linguistics. Clandinin, D.J. & Connelly, F.M. (2000) Narrative Inquiry: experience and story in qualitative research. San Francisco, CA: Jossey-Bass Publishers. Courtial, J.P. (1989) Qualitative Models, Quantitative Tools and Network Analysis, Scientometrics, 15(5-6), 527-534. http://dx.doi.org/10.1007/BF02017069 Grimbeek, P., Bryer, F., Davies, M. & Bartlett, B. (2005) Themes and Patterns in 3 Years of Abstracts from the ‘International Conference on Cognition, Language, and Special Education Research’ identified by Leximancer analysis, in B. Bartlett, F. Bryer & D. Roebuck (Eds) Stimulating the Action as Participants in Participatory Research, VOL.? 2, pp. 101-113. Brisbane, Australia: Griffith University, School of Cognition, Language, and Special Education. Indulska, M. & Recker, J.C. (2008) Design Science in IS Research: a literature analysis, in S. Gregor & S. Ho (Eds) Proceedings 4th Biennial ANU Workshop on Information Systems Foundations, Canberra. Lee, B. & Jeong, Y.-I. (2008) Mapping Korea’s National R&D Domain of Robot Technology by using Coword Analysis, Scientometrics, 77(1), 3-19. http://dx.doi.org/10.1007/s11192-007-1819-4 Leydesdorff, L. & Hellsten, L. (2006) Measuring the Meaning of Words in Contexts: an automated analysis of controversies about ‘Monarch butterflies’, ‘Frankenfoods’, and ‘stem cells’, Scientometrics, 67(2), 231-258. http://dx.doi.org/10.1007/s11192-006-0096-y Liesch, P., Håkanson, L., McGaughey, S., Middleton, S. & Cretchley, J. (2011) The Evolution of the International Business Field: a scientometric investigation of articles published in its premier journal, Scientometrics, 88(1), 17-42. http://dx.doi.org/10.1007/s11192-011-0372-3 Nisbett, R.E. & Wilson, T.D. (1977) Telling More than We Can Know: verbal reports on mental processes, Psychological Review, 84(3), 231-259. http://dx.doi.org/10.1037/0033-295X.84.3.231 Osgood, C.E., Suci, G.J. & Tannenbaum, P.H. (1957) The Measurement of Meaning. Urbana, IL: University of Illinois Press. Sadiq, S., Yeganeh, N. & Indulska, M. (2011) 20 Years of Data Quality Research: themes, trends and synergies, in Heng Tao Shen & Yanchun Zhang (Eds) Conferences in Research and Practice in Information Technology, 1-10. Proceedings of the 22nd Australasian Database Conference, Perth, 17-20 January. Society for Information Technology & Teacher Education (SITE) (2008) K. McFerrin, R. Weber, R. Carlsen & D.A. Willis (Eds) Proceedings of the Society for Information Technology & Teacher Education International Conference 2008. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Society for Information Technology & Teacher Education (SITE) (2009) I. Gibson, R. Weber, K. McFerrin, R. Carlsen & D.A. Willis (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2009. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Society for Information Technology & Teacher Education (SITE) (2010) D. Gibson & B. Dodge (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2010. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Society for Information Technology & Teacher Education (SITE) (2011) M. Koehler & P. Mishra (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2011. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE).

253

David A. Thomas Smith, A. & Humphreys, M. (2006) Evaluation of Unsupervised Semantic Mapping of Natural Language with Leximancer Concept Mapping, Behavior Research Methods, 38(2), 262-279. http://dx.doi.org/10.3758/BF03192778 Sowa, J.F. (2000) Knowledge Representation: logical, philosophical, and computational functions. Pacific Grove, CA: Brooks Cole. Stubbs, M. (1996) Text and Corpus Analysis: computer-assisted studies of language and culture. Oxford: Blackwell. Zimitat, C. (2006) A Lexical Analysis of 1995, 2000 and 2005 Ascilite Conference Papers, in L. Markauskaite, P. Goodyear & P. Reimann (Eds) Proceedings of the 23rd Annual Conference of the Australasian Society for Computers in Learning in Tertiary Education. Who’s Learning? Whose Technology?, pp. 947-951. Sydney: Sydney University Press.

APPENDIX. SITE Award Papers 2008-2011 Browne, J. (2009) An IRT Analysis of Preservice Teacher Self-efficacy in Technology Integration, in I. Gibson, R. Weber, K. McFerrin, R. Carlsen & D.A. Willis (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2009, pp. 831-844. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Cameron, L. (2008) LAMS: pre-service teachers update the old lesson plan, in K. McFerrin, R. Weber, R. Carlsen & D.A. Willis (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2008, pp. 2517-2524. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Carrington, L., Kervin, L. & Ferry, B. (2009) Enhancing the Development of Pre-service Teacher Professional Identity Through the Use of a Virtual Learning Environment, in I. Gibson, R. Weber, K. McFerrin, R. Carlsen & D.A. Willis (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2009, pp. 1402-1409. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Cavanaugh, C., Dawson, K. & Ritzhaupt, A. (2008) Conditions, Processes and Consequences of 1:1 Computing in K-12 Classrooms: the impact on teaching practices and student achievement, in K. McFerrin, R. Weber, R. Carlsen & D.A. Willis (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2008, pp. 1956-1963. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Cavin, R. (2008) Developing Technological Pedagogical Content Knowledge in Preservice Teachers Through Microteaching Lesson Study, in K. McFerrin, R. Weber, R. Carlsen & D.A. Willis (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2008, pp. 5214-5220. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Chen, D. & Couse, L. (2009) Exploring the Viability of Tablet Computers in Early Education: considering the principles of developmentally appropriate practice, in I. Gibson, R. Weber, K. McFerrin, R. Carlsen & D.A. Willis (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2009, pp. 3251-3254. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Duran, M. & Brunvand, S. (2008) Preparing Science Teachers to Teach with Technology: exploring collaborative approaches, in K. McFerrin, R. Weber, R. Carlsen & D.A. Willis (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2008, pp. 3319-3324. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Fluck, A., Ranmuthugala, D., Chin, C. & Penesis, I. (2011) Calculus in Elementary School: an example of ICT-based curriculum transformation, in M. Koehler & P. Mishra (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2011, pp. 3203-3210. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Hadjerrouit, S. (2010) An Empirical Evaluation of Technical and Pedagogical Usability Criteria for WebBased Learning Resources with Middle School Students, in D. Gibson & B. Dodge (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2010, pp. 2231-2238. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Hall, B. (2008) Practicing Teachers’ Advice to Pre-service Teachers on Technology Skills Needed in the Classroom, in K. McFerrin, R. Weber, R. Carlsen & D.A. Willis (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2008, pp. 3777-3781. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE).

254

Searching for Significance Harris, J., Grandgenett, N. & Hofer, M. (2010) Testing a TPACK-Based Technology Integration Assessment Rubric, in D. Gibson & B. Dodge (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2010, pp. 3833-3840. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Herrick, M., Lin, M.F.G. & Huei-Wen, C. (2011) Online Discussions: the effect of having two deadlines, in M. Koehler & P. Mishra (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2011, pp. 344-351. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Hoban, G., McDonald, D. & Ferry, B. (2009) Improving Preservice Teachers’ Science Knowledge by Creating, Reviewing and Publishing Slowmations to Teacher Tube, in I. Gibson, R. Weber, K. McFerrin, R. Carlsen & D.A. Willis (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2009, pp. 3133-3140. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Jones, G. & Warren, S. (2008) Demonstration and Discussion of a 3D Online Learning Environment for Literacy, in K. McFerrin, R. Weber, R. Carlsen & D.A. Willis (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2008, pp. 1695-1698. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Kidd, J., O’Shea, P., Baker, P., Kaufman, J. & Allen, D. (2008) Student-authored Wikibooks: textbooks of the future?, in K. McFerrin, R. Weber, R. Carlsen & D.A. Willis (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2008, pp. 2644-2647. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Krauskopf, K., Zahn, C. & Hesse, F.W. (2011) Leveraging the Affordances of YouTube: pedagogical knowledge and mental models of technology affordances as predictors for pre-service teachers’ planning for technology integration, in M. Koehler & P. Mishra (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2011, pp. 4372-4379. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Lai, A. (2010) An Examination of Technology-Mediated Feminist Consciousness-raising in Art Education, in D. Gibson & B. Dodge (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2010, pp. 3013-3020. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Magana de Leon, A., Brophy, S. & Newby, T. (2009) Pre-service Teachers’ Perceptions of Web-based Interactive Media: three different tools one learning goal, in I. Gibson, R. Weber, K. McFerrin, R. Carlsen & D.A. Willis (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2009, pp. 1502-1509. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Magoun, A.D., Hare, D., Saydam, A., Owens, C. & Smith, E. (2010) Assessing the Performance of College Algebra Using Modularity and Technology: a two year case study of NCAT’s supplemental model, in D. Gibson & B. Dodge (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2010, pp. 3471-3478. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Mills, L., Wakefield, J., Najmi, A., et al (2011) Validating the Computer Attitude Questionnaire NSF ITEST (CAQ N/I), in M. Koehler & P. Mishra (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2011, pp. 1572-1579. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Moore, J. (2008) The BRIDGE – teacher directed resources and learning communities, in K. McFerrin, R. Weber, R. Carlsen & D.A. Willis (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2008, pp. 2737-2739. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Moreno, R., Abercrombie, S. & Hushman, C. (2009) Using Virtual Classroom Cases as Thinking Tools in Teacher Education, in I. Gibson, R. Weber, K. McFerrin, R. Carlsen & D.A. Willis (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2009, pp. 2615-2622. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Mumford, J., Harold, C. & Dunnerstick, R. (2008) The eVolution of Teacher Preparation Portfolios: an agenda for research, in K. McFerrin, R. Weber, R. Carlsen & D.A. Willis (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2008, pp. 112-116. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). O’Byrne, B., Bailey, D. & Murrell, S. (2011) Literacy in Multimedia Environments: preliminary findings, in M. Koehler & P. Mishra (Eds) Proceedings of Society for Information Technology & Teacher Education

255

David A. Thomas International Conference 2011, pp. 1600-1606. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Plonczak, I., Joseph, R. & Stemn, B. (2009) Enhancing Preservice Elementary Teachers’ Field Placements in Math and Science through Videoconferencing, in I. Gibson, R. Weber, K. McFerrin, R. Carlsen & D.A. Willis (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2009, p. 4218. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Seo, H.J. & Baek, Y. (2010) The Effects of Fantasy in an Educational Game via Interest, Intrinsic Motivation, and Storytelling on Student’s Academic Achievements: a path analysis, in D. Gibson & B. Dodge (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2010, pp. 2056-2063. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Shin, T., Koehler, M., Mishra, P., et al (2009) Changing Technological Pedagogical Content Knowledge (TPACK) through Course Experiences, in I. Gibson, R. Weber, K. McFerrin, R. Carlsen & D.A. Willis (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2009, pp. 4152-4159. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Skoretz, Y. & Cottle, A. (2011) Meeting ISTE Competencies with a Problem-Based Video Framework, in M. Koehler & P. Mishra (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2011, pp. 1211-1217. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Trautman, N. & MaKinster, J. (2008) Flexibly Adaptive Professional Development for Teaching Science with Geospatial Technology, in K. McFerrin, R. Weber, R. Carlsen & D.A. Willis (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2008, pp. 4791-4805. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE). Trombley, A. (2011) Virtual Gender Roles: is gender a better predictor of internet use than sex?, in M. Koehler & P. Mishra (Eds) Proceedings of Society for Information Technology & Teacher Education International Conference 2011, pp. 789-795. Chesapeake, VA: Association for the Advancement of Computing in Education (AACE).

DAVID A. THOMAS is director of the Mathematics Centre at the University of Great Falls, USA. His teaching and research interests include geometry, geometry technologies, and the analysis of unstructured data using social network and content analysis methodologies and technologies. Correspondence: [email protected]

256