accuracy of textual document clustering with semantic

ACCURACY OF TEXTUAL DOCUMENT CLUSTERING WITH SEMANTIC APPROACH

SK AHAMMAD FAHAD

MASTER OF SCIENCE IN INFORMATION AND COMMUNICATION TECHNOLOGY FACULTY OF COMPUTER AND INFORMATION TECHNOLOGY

2017/1438H

PAGE

ACCURACY OF TEXTUAL DOCUMENT CLUSTERING WITH SEMANTIC APPROACH

SK AHAMMAD FAHAD MIT153BL308

Thesis submitted in fulfillment of the requirements for the degree of master of science in information and communication technology Faculty of computer and information technology

Supervised by: Assoc. Prof. Dr. Wael Mohamed Shaher Yafooz

September 2017/ Zulhajah 1438H

ii

iii

iv

v

ABSTRACT Now the age of information technology, textual document is spontaneously increasing over the internet, e-mail, b pages, offline and online reports, journals, articles and they stored in the electronic database format. Millions of new text file created in a day, but for the proper classification, people miss vast information those are useful to several challenges in daily life. To maintain and access those documents are very difficult without adequate rating and when there has classification without any information provide call clustering. To overcome such difficulties K-means and others old clustering algorithms are unfit to impart as may be expected on Natural languages. Because of high-dimensional about texts, the presence of logical structure clues within the texts and novel segmentation techniques have taken advantage of advances in generative topic modeling algorithms, specifically designed to spot questions at intervals text to cipher word–topic distributions. By considering those challenges there, in the current thesis proposed a semantic document clustering framework and the framework be developed by using Python platform and tested each of steps. In this context there have preprocessing steps like tag elimination, removed stop words according to Oxford dictionary, applying lemmatization process after getting the help of WordNet semantic information available and synsets for each word individually from raw text. So considering the limitation of K-Means algorithm and other old algorithms, COBB conceptual clustering algorithm applied to the preprocessed data in this context. Clusters quality and accuracy is one of the most significant contributions to this research. For ensuring the accuracy of clusters, the f-measure accuracy measuring methods selected for evaluate the clusters and feedback the accuracy of clusters. F-Measure returns the accuracy of clusters and also ensuring the purity of clustering process. Framework tests on 20 samples of 20 different articles and minimum accuracy considered as the accuracy of the clusters and the developed system return 71.42% accurate. There are several challenges, such as synonym, high dimensionality, extracting core semantics from texts, and assigning appropriate description for the generated clusters need to experiment further. This research to work to find an accurate way to cluster text documents based on semantic meaning by the help of WordNet database.

vi

ACKNOWLEDGEMENT First and foremost, I would like to pay my heartiest gratitude to the almighty ALLAH for keeping me in sound and healthy during the thesis period and giving me the ability to complete my research successfully. I would also like to show sincere gratitude to my research supervisor Dr. Wael M.S. Yafooz. Without his guidance and dedicated involvement in every step throughout the process, this thesis would have never accomplished. I would like to thank you very much for your support and understanding over these past several months. His directive helps to me in all the time of research and writing of this thesis. I could not have imagined having a better advisor and mentor for my M.Sc. study. From his cheerful speech and inspiration, I am inspirited every time. When I need some inspiration, I just need to meet with him. I would like to give thanks to all the teachers of the Faculty of Computer and Information Technology and stuff of MEDIU for their constructive criticism and valuable advice and cordial cooperation during the study in the university. Especially, I would like to thanks, Dr. Mamoun Mohamad Jamous and Dr. Shadi M.S. Hilles. Their friendly guidance helps me the whole graduate academic session. When I am facing any kind of academic issue those two great people solve my problems as like my Guardians as like my friends. I love to remember my Supervisor of amendments Dr. Yousef Abu Baker El-Ebiary. He was not only a teacher; he is the man he can make anyone smile within minutes. From the first day of MEDIU, he is in my focused by his personality. I like to remember one of my beloved teacher Dr. Najeeb Abbas Al-Sammarraie for his kindness about me. My sincere thanks to Dr. Zainab Binti Abu Bakar, for her efforts and dedication in the period of the amendment. Her experienced direction gives me proper guideline helps to learn a lot. I love to show gratitude to my beloved parents Sk. Abul Kashem and Ainun Naher for their guidance, love, and support throughout my whole life. My parents are my lifeline. They are best in the universe. I like to remember my sister Theba and brother Faisal. Because of them I never feel alone in any situations in my whole life. I remember my friend Jasim Uddin. Without his support and dedication, The entire graduate program he was with me. I like to remember Mr. Mahbub Alam, he is not only a teacher, he has some magical touch that can convert a student to a bright star. I dedicate my thesis to my nephew, Almas and niece, Suhaila. vii

Table of content TITLE PAGE ............................................................................................................ ii CERTIFICATION OF DISSERTATION WORK PAGE ................................................................................................................. Err or! Bookmark not defined.

DECLARATION....................................................................................................... iii COPYRIGHT ............................................................................................................ ..v ABSTRACT ............................................................................................................... v ACKNOWLEDGEMENT ........................................................................................ vii TABLE OF CONTENT ............................................................................................ viii LIST OF FIGURES .................................................................................................. xi LIST OF TABLES .................................................................................................... xii LIST OF ABBREVIATIONS .................................................................................. xiii CHAPTER ONE: INTRODUCTION ..................................................................... 01 1.1

Introduction ........................................................................................................ 01

1.2

Problem Statement ............................................................................................. 02

1.3

Research Questions ............................................................................................ 03

1.4

Research Objectives ........................................................................................... 04

1.5

Challenges .......................................................................................................... 04

1.6

Thesis Contribution ............................................................................................ 04

1.7

Acquainting of this Book ................................................................................... 04

CHAPTER TWO: LITERATURE REVIEW ........................................................ 07 2.1

Introduction ........................................................................................................ 07

2.2

Classification...................................................................................................... 09 2.2.1

2.3

Classification and Clustering .............................................................. 11

Clustering ........................................................................................................... 12 2.3.1

Type of Clustering .............................................................................. 13

2.3.2

Type of Clusters .................................................................................. 15 2.3.2.1

Well-Separated .................................................................. 15 viii

2.4

2.5

2.3.2.2

Prototype-Based ................................................................ 16

2.3.2.3

Graph-Based ..................................................................... 17

2.3.2.4

Density-Based ................................................................... 17

2.3.2.5

Shared-Property ................................................................ 18

Textual Document Clustering ............................................................................ 18 2.4.1

Document Clustering .......................................................................... 19

2.4.2

Different Document Clustering........................................................... 21 2.4.2.1

Hierarchical and Flat Clustering ....................................... 21

2.4.2.2

Online and Offline Clustering ........................................... 24

2.4.2.3

Hard and Soft Clustering .................................................. 25

2.4.2.4

Documents-based & Keyword-based Clustering .............. 27

Algorithms for Document Clustering ................................................................ 27 2.5.1

Term Frequency .................................................................................. 28

2.5.2

Inverse Document Frequency ............................................................ 28

2.5.3

Term Frequency-Inverse Document Frequency ................................ 29

2.5.4

Chi Square Statistic ............................................................................. 29

2.5.5

Frequent Term-Based Text Clustering................................................ 31

2.5.6

Frequent Term Sequence .................................................................... 31

2.6

Semantic Document Clustering ......................................................................... 33

2.7

Conceptual Clustering ........................................................................................ 34

2.8

2.9

2.7.1

Conceptual Clustering Algorithms .................................................... 35

2.7.2

COBWEB Algorithm .......................................................................... 38

Lexical Database ................................................................................................ 40 2.8.1

Lexical Overview ................................................................................ 40

2.8.2

Different Lexical Databases................................................................ 41

2.8.3

WordNet .............................................................................................. 42

F-measure ........................................................................................................... 46 2.9.1

Field of F-Measure .............................................................................. 46

2.9.2

Derivation of F-Measure ..................................................................... 47

2.10 Summary of Literature Review .......................................................................... 48 CHAPTER THREE: METHODOLOGY .............................................................. 49 3.1

Introduction ........................................................................................................ 49

3.2

Research Methodology ...................................................................................... 50 ix

3.2.1

Sample Data ........................................................................................ 52

3.3

Framework: Steps .............................................................................................. 53

3.4

Pre-Processing.................................................................................................... 54

3.5

Streaming and Lemmatizing .............................................................................. 62

3.6

Clustering ........................................................................................................... 65

3.7

Accuracy ............................................................................................................ 68

CHAPTER FOUR: TESTING AND RESULT ANALYSIS ................................. 70 4.1

Environmental Setup .......................................................................................... 70 4.1.1

Hardware ............................................................................................. 70

4.1.2

Soft Tools ............................................................................................ 70

4.1.3

Sample Data ........................................................................................ 73

4.2

Document Pre-processing .................................................................................. 73

4.3

Clustering and Accuracy .................................................................................. .81

4.4

Discussion ........................................................................................................ .84

CHAPTER FIVE: CONCLUSION ....................................................................... 86 5.1

Conclusion ....................................................................................................... 86

5.2

Future Work ..................................................................................................... 88

REFERENCES ....................................................................................................... 90

x

List of figures Figure: 2.1 Classification tree .................................................................................... 10 Figure: 2.2 Data Clustering........................................................................................ 13 Figure: 2.3 Hierarch Sequence of Various Clustering ............................................... 14 Figure: 2.4 Different types of a cluster as illustrated by sets of two-dimensional points ................................................................................................... 16 Figure: 2.5 Hierarchical Clustering ........................................................................... 23 Figure: 2.6 Soft and Hard Clustering ......................................................................... 26 Figure: 2.7 The COBWEB Tree ................................................................................ 38 Figure: 3.1 Steps of framework ................................................................................. 53 Figure: 3.2 Flow-Chart to remove tags from input text ............................................. 55 Figure: 3.3 Flow-Chart to get Noise free token ......................................................... 57 Figure: 3.4 Flow-Chart to remove stop words and get Fresh Token ......................... 58 Figure: 3.5 Flow-Chart to Streaming process ............................................................ 64 Figure: 4.1 WordNet synset replaced by root word for sample 1 .............................. 75 Figure: 4.2 Lemmatization to all tokens for sample 1 ............................................... 76 Figure: 4.3 Document Pre-Processing Graph (token)................................................ 85

xi

List of tables Table: 2.1 Difference between Clustering and Classification.................................... 11 Table: 3.1 Five steps of Research Methodology ........................................................ 51 Table: 3.2 List of stop words ..................................................................................... 58 Table: 3.3 Token with frequency ready for clustering operation ............................... 66 Table: 4.1 Sample text file report after tokenize ........................................................ 73 Table: 4.2 Number of token removed and token left for processing after remove stopwords. ........................................................................................... 74 Table: 4.3 Token frequency with token’s source file. Frequency more than three listed on table ................................................................................................ 76 Table: 4.4 Clusters with the member ....................................................................... .81 Table: 4.5 Clusters with the F-Measure Accuracy................................................... .82

xii

List of Abbreviations API - Application Programming Interface BOW – Bag of Words DGD – Dependency Graph-based Document DIG – Document Index Graph DRCC – Dual Regularized Co-Clustering EL – Entity Lexicon EoR – Entity-oriented Retrieval HCMSL – High-order Co-clustering via Multiple Subspaces Learning HOCC – High-Order Co-Clustering IDE - Integrated Development Environment IE – Information Extraction IM – Intent Modifier IR – Information Retrieval IT – Intent Type LDA – Latent Dirichlet Allocation LSI - Latent Semantic Indexing NIST - National Institute of Standards and Technology NLP - Natural Language Processing NLTK - Natural Language Toolkit NP - Noun Phrase RMC – Relational Manifold Co-clustering TF-IDF – Term Frequency-Inverse Document Frequency VSM - Vector Space Model WN - WordNet WSD - Word Sense Disambiguation xiii

CHAPTER ONE INTRODUCTION

1.1

INTRODUCTION A substantial a part of the offered data persisted in Text databases that

comprise vast collections of documents from varied sources, like news articles, analysis papers, books, digital libraries, e-mail messages, and website. Text documents unit is proliferating as a result of the increasing measure of information offered in electronic and digitized sort, like electronic publications, varied styles of electronic records, e-mail, and also the World Wide internet. Recently most of the information regarding government, industry, business, and various establishments’ unit hold on electronically, inside the kind of text databases. Most of the text in databases group are semi-structured info that they are neither structured. Document Pre-processing and agglomeration is very helpful gizmo in today’s world where a good deal of documents and information area unit keep and retrieved electronically. As text info area unit inherently unstructured, some researchers applied the fully different technique for document management. Researchers have presented info discovery in writing system that uses the most effective knowledge extraction to induce fascinating experience and knowledge from unstructured text assortment. To vary the illustration technique; and with the efficient transformation, the word frequencies ought to be normalized regarding their relative frequencies that area unit gift during a document and over the entire assortment. To organize the large text corpus and to make future navigating, the document browsing becomes easier, friendly and economical. Almost, it is not doable for the creature to scan through all the text documents and ascertain the relative for a selected topic and also the thanks to preparing a large document. To organize a large amount of knowledge and keep throughout a structured format specified processing techniques area unit able to use or extract the desired information from the unstructured document collections. Thanks to this, Techniques area unit in Text mining useful to the method these documents. The reason of text mining is to structure the textual document collections to spice up the flexibleness of users to retrieve and apply info implicitly contained in those collections. Text mining yield through fully different phases to complete the goal: 1

preprocessing, using WordNet and term alternative approach. Attributes and dimension reduction area unit the required limits in text mining. The improves the performance of term alternative ways that by reducing dimensions thus text mining procedures info with a reduced vary of conditions and accuracy get improved. WordNet is that the merchandise of associate inquiry project in Princeton (Miller, 1990; Bowman, Angeli, Potts & Manning, 2015) that have tried to model the lexical info of a verbalizing of English. In this approach, the documents already pre-processed, dimension re reduced, WordNet is applied for composition words as a noun, verb, adverb, and adjectives, for any having higher clustering algorithm, then term alternative methods used. Documents get pre-processed by several steps: 1st of all stop words removed, second stemming is performed by victimization porter’s steamer rule, third WordNet senses are applied, distinctive world words and common word set gets generated by victimization feature alternative approaches. Traditional clump ways do not seem to be effective on matter clump. After oral cluster communication, it got to settle on a skillful clump technique that may create a real cluster on scientific communication. In the tends selected abstract cluster and that it picked the COBWEB clump algorithmic rule. The clusters accuracy measured by f-measure. It considers every the truth and also the recall of the check to reason the score: the fact is that the very of accurate positive results divided by the quantity of all positive results, and recall is that the very of actual positive results divided by the variety of positive results that need to return. The f-measure are going to understand as a weighted average of the truth and recall, where associate f-measure reaches its best price at one and worst at zero. 1.2

PROBLEM STATEMENT Large Corpora area unit high-dimensional about words, documents area unit

skinny, area unit of different length, and should contain terms (Aggarwal & Zhai, 2012; Zhai, & Massung, 2016). Several researchers have recognized that partitional cluster algorithms area unit like-minded for cluster large document data sets thanks to their relatively low method wants (Charu, Cecilia, Joel, Phili & Jong, 1999; Steinbach, Karypis & Kumar, 2000). The presence of logical structure clues inside the document, scientific criteria and math similarity measures are primarily accustomed figure thematically coherent, contiguous text blocks in unstructured documents (Lee, Han & 2

Whang, 2007; Hung, Peng & Lee, 2015; MacQueen, 1967; Ferrari, & De Castro, 2015). Use PLSA to cipher word–topic distributions, fold in those distributions at the block level, and so choose segmentation points supported the similarity values of adjacent block pairs. (Sun, Li, Luo& Wu, 2008; Zhang, Kang, Qian & Huang, 2014; Rangel, Faria, Lima & Oliveira, 2016) use LDA on a corpus of segments, intersegment cipher similarities via a Fisher kernel, and optimize segmentation via dynamic programming. (Misra, Yvon, Jose, & Cappe, 2009; Glavaš, Nanni & Ponzetto, 2016) use a document-level LDA model, treat sections as new documents and predict their LDA models, and so do segmentation via dynamic programming with probabilistic scores. By this considering the discussion in this thesis the four points selected as the problem. That are listed below; 1. It is together a challenge to look out the useful data from the large documents (Aggarwal & Zhai, 2012; Zhai, & Massung, 2016). 2. The traditional document cluster unit high-dimensional about texts. (Misra et al., 2009; Glavaš, Nanni & Ponzetto, 2016). 3. The presence of logical structure clues within the document, scientific criteria and applied math similarity measures chiefly accustomed figure thematically coherent, contiguous text blocks in unstructured documents (Sun et al., 2008; Zhang et al., 2014; Rangel et al., 2016). 4. Recent segmentation techniques have taken advantage of advances in generative topic modeling algorithms, which were specifically designed to spot issues at intervals text to cipher word–topic distributions (Lee, Han & Whang, 2007; Hung, Peng & Lee, 2015). 1.3

Research Questions There has three Research Question, and those are following: 1. What are semantic textual document clustering and the specialty of semantic relation in textual Document Clustering? 2. How to do modeling, analysis and develop a proper semantic document clustering method? 3. What is the result of testing and analysis of proposed clustering method?

3

1.4

Research Objectives Objective is as following: 1. To study the existing tools and techniques of semantic textual document clustering. 2. To propose and develop a model for semantic textual document clustering. 3. To test the proposed model.

1.5

Challenges Document clustering studied for many decades; but, still, it was aloof from a

trivial and resolved to draw back. The challenges are: •

Choosing acceptable choices of the documents that need to use for clustering.

•

Choosing degree acceptable similarity lives between documents.

•

Choosing degree acceptable clump technique utilizing the on prime of similarity lives.

•

Implementing the clumping formula in level economic implies that produces it attainable regarding required memory and equipment resources.

• 1.6

Finding ways in which of assessing the quality of the performed clump. Thesis Contribution The Contribution was given in two prospective;

•

Apply COBWEB conceptual Clustering Algorithm in semantic pre-proceed token

•

Applied F-measure accuracy technique in Semantic Textual Document Clusters is individual.

1.7

Acquainting of this Book This thesis book was prepared in five chapters with reference and appendices.

The five chapters are; •

CHAPTER 1: Introduction

•

CHAPTER 2: Literature Review

•

CHAPTER 3: Methodology

•

CHAPTER 4: Testing And Result Analysis

•

CHAPTER 5: Conclusion CHAPTER 1: Introduction Chapter one is an introduction. It just introduces to

the thesis. Introductions have a simple overview of the thesis. It started by Introduction. This section for those words that can make a brif about the element and 4

mentor of the thesis. After the introduction, Problem Statement comes. In this article, it focuses on the real life problems, and those problems are the reason for of thesis. Research Questions are those main featured questions those question answer are complete by the whole thesis. In Research Objectives part it described the objective taken to reach by thesis. Last part of chapter one is Challenges. There it kept some points; those are a shadow of upcoming challenges. CHAPTER 2: Literature Review second chapter of out thesis was a literature review. Related to out thesis every statement is available in this chapter. Chapter two started with classification and what are classification and clustering describe here. After that, it tried to a clear conception about clustering and the type of clustering and type of clusters. It discussed the areas of clustering. Then it tries to clear about t partitional clustering and hierarchical clustering. Those two are the major two type of clustering. Finally, it reaches textual document clustering to clarify. Document clustering, different document clustering, algorithms for document clustering and semantic document clustering discussed now. When this can provide enough information, it steps ahead to conceptual clustering. It also features the conceptual clustering algorithms. Then it gives a brief to cobweb algorithm. It is the algorithm it was going to implement. After that, it was going to give some concept of the lexical database and various lexical databases available. Then it goes to its ascertained lexical database WorldNet. Finally, it comes to accurately measuring, and it was starting to clear the concept of f-measure and its field of working. Lastly, it is completely out the second chapter by giving the calculation about to derivation of f-measure. CHAPTER 3: Methodology this is the chapter where it was going to place every detail of this proposed framework and development. Firstly it gives an overview of this proposed model. Then it forward step by step; it narrates out a model with each stratum. At first, it is going to exposition the details of pre-processing. After preprocessing, it is going to describe the streaming and lemmatizing process. That process used WordNet. Then that is giving a brief about clustering process of the proposed method. It follows cobweb steps for doing clustering. At the end of chapter three, it discussed the accuracy of a cluster of this proposed method. CHAPTER 4: Testing And Result Analysis This chapter is the chapter where it was going to write everything about the proposed development testing. It starts by Environmental Setup for testing proposed development. After environment setup, it writes the result of Document Pre-processing phase. When Pre-processing was 5

complete, it steps forward and writes the testing result of Clustering and Accuracy. With that chapter four was complete. CHAPTER 5: Conclusion, The last chapter of this book, was chapter 5 about to conclusion of this thesis. In Conclusion chapter, there should be a conclusion about proposed framework and development. After giving the conclusion, it gives some idea about Future Research on this topic. By those materials, it covers the chapter conclusion.

6

CHAPTER TWO LITERATURE REVIEW

2.1

Introduction Cluster analysis is one amongst the foremost necessary data processing

strategies. It is a central downside in information management. Clustering is one altogether the first data analysis techniques and deals with organizing a bunch of objects during a three-dimensional space into cohesive groups named as clusters for higher management and navigation (Oikonomakou & Vazirgiannis, 2009). Many agglomeration algorithms exist in the literature; but, difficult to supply a categorization of agglomeration ways as a consequence of these categories might overlap. So as that a way might have choices from several categories, however, the first vital agglomeration ways is also classified into the next main categories hierarchic ways, partitioning ways (Drias, Cherif, & Kechid, 2016; Han, Pei & Kamber, 2011). Text cluster is to hunt out-out the team's data from the text papers and cluster these documents into the first relevant groups. Text cluster groups the document in Associate in nursing unattended suggests that and there is no label or class data. Cluster methods have to be compelled to discover the connections between the documents then supported these connections the documents area unit clustered. Provided many papers, a capable document cluster methodology might organize those immense numbers of documents into pregnant groups, which modify other browsing and navigation of this corpus be teeming easier (Xu, Liu & Gong, 2003). A basic setup of text cluster is to hunt cut-out that documents have many words in common and place these papers with the first words in common into the constant cluster. All the overall purpose cluster algorithms often applied to document/text cluster. Some algorithms are developed alone for document/text cluster. Of these algorithms are often classified into partitional, hierarchal et al. like probabilistic, graph-based, and frequent term-based. Based on their characteristics, text agglomeration may classify into entirely different classes. The most common classifications square measure hierarchal agglomeration (Hahsler, Bolaños, 2016) and flat agglomeration. Counting on once to perform agglomeration or a way to update the result once new documents square 7

measure inserted there square measure online agglomeration(Hahsler, Bolaños, 2016) and offline agglomeration(Hahsler, Bolaños, 2016). Document cluster aims to segregate documents into pregnant clusters that replicate the content of each document. For example, at intervals the newswire, manually distribution one or plenty of categories for each document desires complete human labor, significantly with the big amount of text uploaded online daily. Thus, a cost-effective cluster is crucial. Another disadvantage associated with document cluster is that the vast variety of terms. In associate degree excellent matrix illustration, each term goes to be a feature, and each document is associated degree instance. In typical cases, the variety of choices square measure planning to be close to a variety of words at intervals the lexicon. This decent challenge imposes for cluster ways where the efficiency goes to greatly degraded. However, an oversized vary of these words squares measures either stop words, digressive to the topic, or redundant. Thus, removing these additional words may facilitate significantly cut back property. Data clump partitions a group of unlabelled objects into disjoint/joint groups of clusters. In associate passing smart cluster, all the objects among a cluster area unit similar whereas the objects in different clusters area unit entirely totally different. Once the knowledge processed may be a group of documents, it is called document clump. Document clump is improbably very relevant and useful at intervals the info retrieval house. Document clump usually applied to a document data thus similar documents area unit connected at intervals constant cluster. Throughout the retrieval technique, documents happiness to the same cluster as a result of the retrieved documents can also come back to the user. This would possibly improve the recall of associated information retrieval system. Document clump can also be applied to the retrieved documents to facilitate finding the useful documents for the user. The feedback of associated information retrieval system may be a stratified list ordered by their numerable connectedness to the question. Once the degree of associated information data is little, and so the question developed by the user is well made public, this stratified list approach is economical. Aside from a large information offer, just like the Wide globe net, and poor question conditions (just one or a pair of keywords), it is robust for the retrieval system to identify the attention-grabbing things for the user. Usually, most of the retrieved documents area units have not any interest to the users. Applying documenting clump to the retrieved documents might build it

8

easier for the users to browse their results and notice what they need quickly (Maitri et al. 2015). In this modern age of computers, however, there is an answer to its criticism. One obvious reason to resort to online dictionaries, lexical databases is browsed by computers and is that computers can search such alphabetical lists plenty of faster than of us will. Moreover, since dictionaries unit written from tapes that unit browse by computers, it is a relatively straightforward concern converts those tapes into the relevant info. Swing standard dictionaries online seems a simple and natural wedding of the recent and thus the new. WordNet® could also be an enormous on-line database of English. Nouns, verbs, adjectives and adverbs unit classified into sets of psychological feature synonyms (synsets), each expressing a particular construct. Synsets unit interlinked by suggests that of conceptual-semantic and lexical relations. Sir James Murray’s Oxford English lexicon was compiled ‘‘on historical principles, ’’ and no-one doubts the value of the Oxford English Dictionary in subsidence issues with word use or sense priority. By specializing in historical (diachronic) proof, however, the OED, like various commonplace dictionaries, neglected queries relating to the synchronic organization of lexical information. 2.2

Classification In this thesis, it resided in a very world packed with data. Every day, people

address differing kinds of data returning from all types of measurements and observations. The knowledge describes the characteristics of a living species, depict the properties of development, summarize the results of a scientific experiment, and record the dynamics of a running machinery system. Lots of considerably, data provide a basis for any analysis, reasoning, decisions, and ultimately, for the understanding of every type of objects and phenomena. One all told the foremost critics of the myriad knowledge analysis activities is to classify or cluster data into a bunch of categories or clusters. Data objects that square measure classified among a similar cluster have to be compelled to show similar properties supported some criteria. Reciprocally of the principal primitive activities of teams of individuals (Anderberg, 2014; Everitt, Landau & Leese, 2001), classification delivers an essential and indispensable role in the history of human modernization. Therefore on be told a replacement object or understand an alternative development, people frequently commit to establishing clear choices, and any compare these options with those of

9

renowned objects or phenomena, supported their similarity or un-similarity, generalized as proximity, in step with some certain standards or rules. Classification supervised learning. It had been learned a way for predicting the instance group from pre-labeled (classified) instances. In another hand, the clustering process is a learning process with out any supervision. It finds “natural” classification of cases given unlabelled information. Cluster analysis is that the method of classifying objects into subsets that ought to mean within the context of a selected downside. The cluster could be a special reasonable classification. A clustering could be a form of classification obligatory on a finite set of objects. Classification issues as instructed by Lance and Williams (Jain & Dubes, 1998; Fahad & Alam, 2016). Each leaf within the tree in Figure: 2.1 outlines a distinct genus of the classification downside. The nodes in the tree of Figure: 2.1 are defined below; Exclusive versus non-exclusive: Exclusive classification is also a partition of the set of objects. Every object belongs to exactly one set or cluster. In non-exclusive or overlapping classification, each object can assign to several classes. It can describe an example and make it easy. If it can classify some people by genders like male or female. It will be an exclusive classification. If it classifies, same peoples by their disease. Now it will be an example of non-exclusive classification. Because, in first class or cluster, one person could in only one class or cluster. However, in the second example, people can go more than one class or cluster, because one person could have several diseases. Classification

Non-Exclusive (over lapping)

Exclusive

Supervised

Unsupervised

Partitional

Hierarchical

Figure 2.1: Classification tree (Fahad & Alam, 2016) 10

Supervised versus Unsupervised; an unattended classification uses alone the proximity matrix to perform the classification. It is cited as "unsupervised learning'' in pattern recognition. As a result of no category labels denoting; a partition of the objects utilized. Supervised categorization uses class labels on the objects to boot owing to the proximity matrix. The matter is then to see a discriminate surface that separates the objects in step with the category. For instance, suppose that many indices of own deaths collected from smokers and non-smokers. Associate unattended classification would cluster the people supported similarities among the health indices so attempt to verify whether or not smoking was an element within the propensity of individuals toward numerous diseases. A supervised classification would study ways that of discriminating smokers from non-smokers supported health indices. Hierarchical versus Partitional: Unattended classifications square measure subdivided into hierarchal and partitional classifications by the type of structure obligatory on the data. A hierarchal classification could also be a nested sequence of partitions, whereas a partitional classification could also be one partition. So a graded classification may be a special sequence of partitional classifications. It will use the term clustering for an exclusive, unsupervised, partitional classification and the term hierarchical clustering for an exclusive, unsupervised, hierarchical classification (Jain & Dubes, 1998; Fahad & Alam, 2016). 2.2.1 Classification and Clustering Clustering and classification seem some shut processes; there contains a distinction between cluster and classification supported their which means. Within the data processing world, cluster and classification area unit two sorts of learning strategies. Each these strategies characterize objects into teams by one or additional options. The key distinction between cluster and classification is that; cluster could be an unsupervised learning technique accustomed similar cluster instances on the premise of choices whereas classification is a supervised learning technique accustomed assign predefined tags to instances on the assumption of options. Table 2.1 Difference between Clustering and Classification (Narwal & Mintwal, 2013; Fahad & Alam, 2016) Definition

Clustering:

Clustering is an unattended learning technique

11

accustomed similar cluster instances on the premise of options Classification may be a supervised learning Classification:

technique accustomed assign predefined tags to instances on the assumption of options

Clustering: Supervision

Classification: Clustering:

Training Set

Classification:

Clustering is AN unattended learning technique Classification may be a supervised learning technique A coaching set not employed in clustering A coaching set is used to search out similarities in classification Statistical ideas square measure used, and data sets

Clustering:

square measure split into subsets with similar options

Process

This is a process of grouping the e member between Classification:

similarity. It used the

algorithms to divide the

element by its knowledge provided in step with the observations of the coaching set

Labels

Clustering:

There are not any labels in clump

Classification:

There square measure labels for a few points The aim of clump is, grouping a group of objects to

Clustering:

search out whether or not there's any relationship between them

Aim

The clump seeks to search out that category a Classification:

replacement object belongs to from the set of predefined categories.

2.3

Clustering In today’s extremely competitive business surroundings, clump plays a

significant role. The clump could be a major task in knowledge method used for the aim to make clustering for provided data set support the similarity between them. The clump could be a major conception to unite objects in groups (clusters) in line with their similarity. The clump is comparable to classification except that the teams do not seem to be predefined, however rather outlined by the info alone. 12

Cluster analysis is one amongst the foremost necessary data processing strategies. It is a central downside in information management. Document clump is that the act of grouping similar documents into categories, wherever similarity is a few performs on a document. Document clump would not like separate coaching method or manual tagging cluster earlier. It is the strategy of partitioning or grouping a given set of patterns into disjoint clusters. The documents within the same clusters square measure a lot of similar, whereas the material in several cluster square measures a lot of dissimilar.

(a) Original points

(b) Two clusters

(c) Four clusters

(d) Eight clusters Figure 2.2: Data Clustering (Toor, 2014)

Data agglomeration may be an information exploration technique that permits objects with similar characteristics to be sorted along to facilitate their more process. Most of the fundamental clustering techniques developed by statistics or pattern recognition communities (Toor, 2014), where the objective is to cluster a modest kind of data instances. In further recent years, clump referred to as a key technique in processing tasks. This basic operation also applied to many common tasks like unsupervised classification, segmentation, and dissection. Within the unsupervised technique, the right answers are not framed or simply not told to the network. 2.3.1

Type of Clustering Different approaches to clump data area unit delineate with the help of the

hierarchy shown in Figure 2.3(Jain & Dubes, 1998; Fahad & Alam, 2016). At the very 13

best level, there is a distinction between graded and partitional approaches. Graded ways in which manufacture a nested series of partitions, whereas partitional ways in which manufacture only 1. There have various ways in which of axonometric representations of a clump. Ours depends on the discussion in religious belief and Dubes. At the very best level, there is a distinction between graded and partitional approaches. Graded ways in which manufacture a nested series of partitions, on the other hand, partitional ways in which manufacture only one. The taxonomy is shown in Figure 2.3(Jain & Dubes, 1998; Fahad & Alam, 2016) ought to supplemented by a discussion of crosscutting issues which can have a bearing on all of the varied approaches despite their placement inside the taxonomy. Agglomerative vs. divisive; this facet relates to algorithmic structure and operation. Associate in nursing agglomerate approach begins with each pattern in AN extremely distinct cluster, and successively, merges clusters on until a stopping criterion is happy. A discordant methodology starts with all models in a single cluster and performs squawky until a stopping criterion met. Monothetic vs. polythetic; this facet relates to the sequent or concurrent use of choices inside the agglomeration technique. Most algorithms are polythetic; that is, all choices enter into the computation of distances between patterns, and selections have supported those distances. A simple monothetic algorithm considers choices consecutive to divide the given assortment of patterns.

Clustering

Partitional

Hierarchial

Single link

Complete link

Square Error

Graph Theoretic

Mixture Resolving

Mode Seeking

Figure 2.3: Hierarch Sequence of Various Clustering (Jain & Dubes, 1998; Fahad & Alam, 2016).

14

Hard vs. fuzzy; a troublesome agglomeration algorithm allocates each pattern to one cluster throughout its operation and in its output. A fuzzy agglomeration methodology assigns degrees of membership in several clusters to each input pattern. A fuzzy agglomeration is additionally regenerating to a hard agglomeration by distribution every pattern to the cluster with the first necessary period of membership. Deterministic vs. Stochastic; this issue is most relevant to partitional approaches designed to optimize a SQL Error operates. These improvements are accomplished practice ancient techniques or through a random search of the state space consisting of all double labeling. Incremental Vs. Non-Incremental; This issue arises, once the pattern set to clustered is huge, and constraints on execution time or memory house have a bearing on the look of the rule. The first history of clump methodology does not contain several samples of clump algorithms designed to figure with huge data sets. However, the arrival of knowledge mining has fostered the event of clustering rule that reduces the number of scans through the pattern set, cut back some patterns examined throughout execution, or reduced the dimensions of knowledge structures used by the algorithm’s operations. 2.3.2

Type of Clusters Clustering aims to look out useful groups of objects (clusters), where utility is

made public by the goals of the information analysis. Not astonishingly, there is a unit several all different notions of a cluster that prove useful in observe. Therefore on visually illustrate the variations among these varieties of groups; It tended to tend to use two-dimensional points, as shown in Figure 2.4, as its data objects. It stresses, however, that the classes of clusters delineate here area unit equally valid for alternative varieties of evidence (Qiu & Saprio, 2015). 2.3.2.1 Well-Separated A cluster might be a collection of objects among which each object is nearer (or plenty of similar) to every different object among the cluster than to any object not among the cluster. A threshold is utilized to specify that everybody the objects throughout a cluster ought to be sufficiently shut (or similar) to a minimum of each other (Qiu & Saprio, 2015). This idealistic definition of a cluster is glad solely the knowledge contains genetic clusters that unit quite far away from each other. Figure 2.4(a) offers associate degree associate example of well distinct clusters that consists 15

of two groups of points throughout a two-dimensional space. The gap between any two points in different groups is larger than the difference between any two points within a gaggle. Well-separated clusters ought to not be circular, however, can have any type. 2.3.2.2 Prototype-Based A cluster could also be a collection of objects throughout which each object is nearer (more similar) to the paradigm that defines the cluster than to the paradigm of the different cluster. For data with continuous attributes, the paradigm of a cluster is typically a middle of mass, i.e., the every day (mean) of all the points among the cluster.

(a) Well-separated clusters. Every purpose is nearer to any or all of the purposes in its cluster than to any point in another cluster.

(b) Center-based clusters. Every purpose is nearer to the middle of its cluster than to the middle of the other cluster.

(c) Contiguity-based clusters. Every purpose is nearer to a minimum of one purpose in its cluster than to any purpose in another cluster.

(d) Density-based clusters. The cluster is regions of high density separated by regions of tenacity.

(e) Conceptual clusters. Points during a cluster share some general property that derives from the whole set of points. Points within the intersection of the circles belong to each. Figure 2.4: Different types of a cluster as illustrated by sets of two-dimensional points (Qiu & Saprio, 2015) 16

Once a middle of mass is not meaty like once the information has categorical attributes. The paradigm is typically a medoid, i.e., the principal representative purpose of a cluster. For many styles of data, the paradigm is thought to be the first central purpose, and in such instances, It tends to tend to check with prototype primarily based clusters as center-based clusters normally. Not surprisingly, such clusters tend to be the world. Figure 2.4(b) shows Associate in the example of centerbased clusters. 2.3.2.3 Graph-Based If the knowledge painted as a graph, where the nodes are objects, and thus the links represent connections among objects. Then a cluster is typically printed as a connected component; i.e., a bunch of objects that connected to a minimum of each other, but that does not have any association with things outside the cluster. Necessary samples of graph-based clusters are contiguity-based clusters, where a pair of objects connected given that they are within such distance of each completely different (Qiu & Saprio, 2015). Figure 2.4(c) shows degree example of such clusters for twodimensional points. This definition of a cluster is useful once clusters are irregular or tangled but can have trouble once the noise is that the gift since, as illustrated by two globular clusters of figure 2.4(c), a little low bridge of points can merge a pair of distinct clusters. 2.3.2.4 Density-Based A cluster may well be a dense region of objects that is swallowed by a section of pertinacity. Figure 2.4(d) shows some density-based clusters for information created by adding noise to the information of Figure 2.4(c). The two circular clusters do not appear to be united, as in Figure a pair of.4(c), as a result of the bridge between them fades into the noise. Likewise, the curve that is the gift in Figure 2.4(c) jointly fades into the noise and does not reasonably the cluster in Figure 2.4(d). A density based definition of a cluster is typically used once the clusters unit of measurement irregular or tangled, and once noise and outliers unit of measurement gift. Against this, a contiguity based definition of a cluster would not work well for the information (Qiu & Saprio, 2015) of Figure 2.4(d) since the noise would tend to make bridges between clusters.

17

2.3.2.5 Shared-Property It can define a cluster as a set of objects that share some property. This definition encompasses all the previous definitions of a cluster; e.g., objects in Associate in a Nursing extremely center-based cluster share the property that they are all Nighest to the same center of mass or medoid (Qiu & Saprio, 2015). However, the shared property approach collectively includes new varieties of clusters. Ponder the clusters shown in Figure 2.4(e). A triangular house (cluster) is adjacent to Associate in nursing square one, and there is a unit, this unit of a pair of tangled circles (clusters). In every case, a cluster rule would want a particular conception of a cluster to observe these clusters successfully. The tactic of finding such clusters is known as an abstract cluster. 2.4

Textual Document Clustering Clustering is one altogether the first information analysis techniques and deals

with organizing a bunch of objects during a three-dimensional space into cohesive groups named as clusters for higher management and navigation (Oikonomakou & Vazirgiannis, 2009). Clustering is an examples of unsupervised learning; classification refers to a procedure that assigns information objects to a bunch of classes, depends on unsupervised suggests that agglomeration can count upon predefined classes and training

examples

through

classifying

information

objects

(Sathiyakumari,

Manimekalai, Preamsudha & & Scholar, 2011; Forgy, 1965; Lammersen, Schmidt & Sohle, 2015). Document agglomeration is helpful for many knowledge retrieval tasks like document browsing, organization, and viewing of retrieval results (Pantel & Lin, 2002). Many agglomeration algorithms exist in the literature; but, difficult to supply a categorization of agglomeration ways as a consequence of these categories might overlap. So as that a way might have choices from several categories, however, the first vital agglomeration ways is also classified into the next main categories hierarchic ways, partitioning ways (Drias, Cherif, & Kechid, 2016; Han, Pei & Kamber, 2011). The partitioning methodology tries a horizontal partitioning of a collection of documents into a predefined type of disjoint clusters (Oikonomakou & Vazirgiannis, 2009). It then uses a repetitious relocation technique that tries to boost the partitioning 18

by moving objects from cluster to entirely different, partitioning ways embody kmeans and k-medoids (Drias, Cherif, & Kechid, 2016; Han, Pei & Kamber, 2011). Hierarchical ways prove a sequence of nested partitions (Oikonomakou & Vazirgiannis, 2009). The maneuver also classified as being either agglomerate (bottom-up) or discordant (top-down) (Han, Pei & Kamber, 2011). Document agglomeration is that the tactic of grouping a bunch of records into clusters so as that the documents within each cluster area unit like each other, in different words, they belong to same topic or subtopic, whereas documents completely different in many clusters belong to different subjects or subtopics. A document agglomeration formula is typically captivated with the employment of a pair-wise distance live between the individual documents to be clustered. Most of the techniques utilized in document agglomeration affect a document as a bag of words whereas not considering the linguistics of each document (El-Din, 2016). A traditional formula primarily uses choices like words, phrases, and sequences from the documents supported enumeration and frequency of the choices to perform agglomeration freelance of the context. (El-Din, 2016; Chim & Deng, 2008; Li, Chung & Holt, 2008; Fung, Wang & Ester, 2003; Alelyani, Tang & Liu, 2013; Li, Luo & Chung, 2008). They ignore the linguistics of words in documents. 2.4.1

Document Clustering Text cluster is to hunt out-out the team's data from the text papers and cluster

these documents into the first relevant groups. Text cluster groups the document in Associate in nursing unattended suggests that and there is no label or class data. Cluster methods have to be compelled to discover the connections between the documents then supported these connections the documents area unit clustered. Provided large amount of documents, a capable document cluster methodology might organize those immense numbers of documents into pregnant groups, which modify other browsing and navigation of this corpus be teeming easier (Xu, Liu & Gong, 2003). A basic set up of text cluster is to hunt cut-out that documents have many words in common and place these papers with the first words in common into the constant cluster. Current researcher’s efforts in the cluster paper began to specialize in the event of the much economic cluster with considering the linguistics between terms in documents to reinforce the cluster results. Text cluster aims to segregate documents 19

into teams wherever a bunch represents certain topics that are entirely different from alternative teams. From a geometrical purpose of reading, a corpus saw as a group of samples on multiple manifolds, and cluster aims at grouping documents supported fundamental structures of the manifold. Grouping of records into clusters is Associate in simple Nursing step in several applications like categorization, Retrieval, and Mining of information on the net. With an accurate text cluster methodology, a document corpus usually organized into a pregnant cluster hierarchy. That facilitates Associate in Nursing economic browsing and navigation of the corpus or economic information retrieval by that focus on relevant subsets (clusters) rather than whole collections (McKeown, Barzilay, Evans, Hatzivassiloglou, Klavans, Nenkova & Sigelma, 2002; Liu & Croft, 2004). All the overall purpose cluster algorithms often applied to document/text cluster. Some algorithms are developed alone for document/text cluster. Of these algorithms are often classified into partitional, hierarchal et al. like probabilistic, graph-based, and frequent term-based. Partitional cluster tries to interrupt the given knowledge set into k disjoint categories such the info objects in an exceeding category are nearer to at least one another than the info objects in alternative categories. The foremost well-known and ordinarily used partitional cluster formula is K-Means (Hartigan, 1975; Witten, Frank, Hall & Pal, 2016), still has its variances Bisecting K-Means (Forgy, 1965; Lammersen, Schmidt & Sohle, 2015) and K-Medoids (Kaufman & Rousseeuw, 2009; Drias, Cherif, & Kechid, 2016). Hierarchical cluster yields successively by building a tree of clusters. There is a unit a pair of kinds of hierarchical cluster methods: collective and discordant. A collective hierarchical cluster might be a bottom-up strategy that starts by inserting each object in its cluster then merges these atomic clusters into larger and larger clusters until all of the objects area units in Associate in a Nursing passing single cluster or until a user-defined criterion is met. The discordant hierarchical cluster might be a top-down strategy that starts with all objects in one cluster. It divides the cluster into smaller and smaller things until each object forms a cluster on its own or until positive termination conditions area unit glad. Regarding the distance/similarity live, a hierarchical cluster could use minimum distance (single-link) (Yim & Ramdeen, 2015; Sneath & Sokal, 1973), most distance (complete-link) (Yim & Ramdeen, 2015; King, 1967), distance, or average distance. 20

Model-based cluster algorithms plan to optimize the work between the given data and some mathematical models beneath the thought that the information generated by a mix of underlying probability distributions. Asian nation unit (Kohonen, 2012) is one all told the foremost trendy model-based algorithms that use neural network methods for the cluster. It represents all points in Associate in Nursing passing high-dimensional space by points in Associate in Nursing passing lowdimensional (2-D or 3-D) spot, such the house and proximity relationship area unit preserved the most quantity as potential. It assumes that there is some topology or ordering among input objects that the points will eventually attack this structure inside the spot. Graph-based cluster algorithms apply graph theories to a cluster. A widely known graph-based discordant cluster formula (Zahn, 1971) depends on the event of rock bottom spanning tree (MST) of the information then deleting the time edges with the first necessary lengths to urge clusters. Another trendy graph-based cluster formula is MCL (Markov Cluster formula (Van, 2001; Malliaros & Vazirgiannis, 2013). It will be mentioned with plenty of details later throughout this section. 2.4.2

Different Document Clustering Based on their characteristics, text agglomeration may classify into entirely

different classes. The most common classifications square measure hierarchal agglomeration (Hahsler, Bolaños, 2016) and flat agglomeration. Counting on once to perform agglomeration or a way to update the result once new documents square measure inserted there square measure online agglomeration(Hahsler, Bolaños, 2016) and offline agglomeration(Hahsler, Bolaños, 2016). Moreover, in step with if associate degree overlap is allowed or not their square measure soft agglomeration (Lucic, Bachem & Krause, 2016) and exhausting agglomeration (Mario et al. 2016). Supported the options that square measure used, agglomeration algorithms may sort to document-based agglomeration and keywords-based agglomeration. 2.4.2.1 Hierarchical and Flat Clustering Hierarchical and flat agglomeration strategies area unit 2 major classes of agglomeration algorithms. Similar to departments during a company also organized during a hierarchical vogue or a flat one; clusters of a document corpus also arranged in a hierarchical tree structure or a pretty flat vogue.

21

Hierarchical Clustering: Hierarchical agglomeration techniques manufacture a nested sequence of partitions, with one, large cluster at best and singleton clusters of individual points at the all-time low (Steinbach, Karypis & Kumar, 2000). The stratified agglomeration viewed as a turned tree: the inspiration of the tree is that the best level of clusters, the leaves of the tree unit all-time low-level clusters that unit the individual documents, and conjointly the branches of the tree unit the intermediate level among the agglomeration result. Seeing from utterly different level could get a distinct outline of clusters. As an example shown in figure 2.6. •

If found the level value 1, all input samples are going to crust by only one group means one cluster;

•

If found the level value 0.8, all input samples are going to crust into two clusters C1 and C2. Where C1 includes documents B1, B2, B3, and B4; B2 includes B5 and B6.

•

If found the level value 0.6, G1 all input samples are going to crust into two sub-clusters C11 and C12. Moreover, the documents clustered into three groups G11, G12 and G2 which respectively contain B1 & B2, B3 & B4, B5 and B6.

•

If found the level value (value 0.4), all input samples are going to denotes one cluster. There are two approaches to get such a hierarchic agglomeration (Steinbach,

Karypis & Kumar, 2000): •

Agglomerative: begin from and leaves, and rely on each document as a separate cluster at the beginning. Merge a mix of most similar clusters until just one single cluster is left.

•

Divisive: begin from the muse, and rely on the entire document set in concert cluster. At each step divide a cluster into two (or several) sub-clusters until each cluster contains exactly one document or until the necessary kind of clusters is archived. Agglomerative techniques square measure relatively another common: it is

quite easy and most common distance calculation, and similarity activity techniques could also be applied. Ancient collective stratified agglomeration steps could also summarize due to the subsequent (Steinbach, Karypis & Kumar, 2000): Given a collection of documents; 𝐵𝐵 = 𝑏𝑏1 , 𝑏𝑏2 , … … , 𝑏𝑏𝑛𝑛 22

1. Consider every document as a personal cluster. Reckon the gap between all pairs of clusters and construct the 𝑛𝑛 𝑥𝑥 𝑛𝑛 distance matrix 𝐵𝐵 during which 𝐵𝐵𝑖𝑖𝑖𝑖 denotes the gap between cluster 𝑖𝑖 and cluster.

2. Merge the highest two groups into a replacement cluster. 3. Update the gap matrix: calculate the difference between the newly generated cluster and also the rest clusters. 4. Repeat step two and three till just one single cluster remains, that is that the root cluster of the hierarchy

1 C2

C1 0.8

0.6

C11

C12

0.4

B1

B2

B3

B4

B5

B6

Figure 2.5: Hierarchical clustering (Hahsler, Bolaños, 2016) When extended partition generates in a cluster, a cut created at the precise level of the graded cluster tree, and thereon level, every branch represents a cluster and everyone the leaves (documents) underneath a similar branch belong to at least one cluster. Agglomerative techniques ought to think about that inter-cluster similarity measures to use; •

Single-link measure: be a part of the 2 clusters containing the two nearest documents.

•

Complete-link measure: be a part of the 2 clusters with the minimum “most distant” try of documents. 23

•

Group average: be a part of the 2 groups with the minimum average document distance. Flat Clustering: Different from the hierarchical bunch, flat bunch creates one

level (unnested) partitions (Hahsler, Bolaños, 2016; Steinbach, Karypis & Kumar, 2000) of documents rather than generating a well-organized hierarchical cluster tree. Commonly flat bunch techniques demand the proper range of groups K as associate degree input parameter; begin with a random partitioning then keep refinement till algorithms converge. The convergence state is that the final states that everyone clusters are stable and no a lot of documents switched between groups. Similarly, flat bunch techniques might also produce hierarchical cluster tree. By repetition the flat bunch techniques from the highest level (root) of the tree to the bottom level (the leaves), a hierarchical cluster tree is generated. The hierarchical and flat bunch have their blessings and weaknesses: hierarchical bunch provides much detail regarding the complete document corpus, during which clusters well organized in an exceeding tree structure. The worth is that the comparatively higher complexness. On the contrary flat bunch, techniques are commonly straightforward and straightforward to implement. They might be applied with many potencies once examination with hierarchical bunch technologies. When coping with massive document corpus, potency is that the major issue it tends to concern. During this thesis project, The tends to think about flat bench techniques principally. The tends to conjointly appraise a hierarchical bunch algorithmic program during this thesis project as a result of the hierarchical bunch would possibly provide much facilitating in knowing the structure and relation in an exceedingly massive document corpus than a flat bunch. 2.4.2.2 Online and Offline Clustering According to one bunch performed, bunch algorithms are divided into online bunch algorithms and offline bunch algorithms (Mario et al. 2016; Vester & Martiny, 2005). Online bunch algorithms perform document clustering when receiving the request and return the request within a limited period. It is evident that online clustering demands high-speed operations (low complexity) and make the clustering result up-to-date. Normally online clustering algorithms are applied to the small or medium corpus. 24

Offline clustering, on the contrary, processes the documents and groups them into consonant clusters before receiving the request. When an application is received, offline clustering algorithms perform a few simple operations and then represent the clustering result. Compared with online clustering, offline clustering performs most of the operations before receiving the requests; it is relatively complex (high complexity) and can apply to large document corpus. The major disadvantage of offline clustering is that the clustering result is not up-to-date. Sometimes it cannot reflect the fact that if a single document or a few documents are added into the corpus before most operations applied over an extended period (Mario et al. 2016). Online clustering and offline clustering have their different applications: the former typically applied to group the search results and the latter is to organize the document corpus. A cluster algorithmic program is additionally classified as the online cluster if it solely updates the required documents within the corpus rather than re-clustering all documents once new material square measure additional into the document corpus. Given associate degree existing document corpus and also the clustering result, once new equipment square measure additional into the document assortment, online cluster algorithms solely apply cluster calculation on the newly inserted documents and any low a part of the first document assortment. This comparatively less calculation complexness ends up in fast cluster speed once new material square measure inserted into the document corpus often and make attainable that the cluster result's up-to-date. As it is considering dealing with large document corpus instead of clustering search results, it mainly concerns offline clustering in this thesis project. 2.4.2.3 Hard and Soft Clustering Depending on whether overlapping is allowed in the clustering result, clustering methods may generate solid clustering results or soft ones (Mario et al. 2016). It is very common for one document has multiple topics, it might be tagged with multiple labels and grouped into more than one clusters. In this case, overlapping is allowed (Mario et al. 2016). For instance, for a document which describes how scientists discovered the way bats use to “hear” flies and catch them. How this biological technique was 25

applied to create modern radar technique. Moreover, how the radar benefited to martial engineering, it is entirely reasonable to say that this document can be classified into “biology,” “radar,” “martial engineering” and some other relevant classes if there are any others. So, soft clustering includes this kind of clustering algorithms which may cluster documents into different clusters, and each document may belong to several clusters. Moreover, keep the boundaries of the clusters “soft” In summary with soft clustering, each document is probabilistically assigned to groups (Dhillon, 2003), just as shown in Figure 2.6.

(a) Soft Clustering

(b)

Hard Clustering

Figure 2.6: Soft and Hard Clustering (Mario et al. 2016) However, some things demand one document ought to solely organized to the first relevant class. This type of cluster is named exhausting cluster as a result of every document belongs to precisely one cluster. It is important for the exhausting cluster algorithms to decide that cluster is that the most matched one. Given the paper on top of, a cheap means is to cluster it into the “radar” as a result of its principally concerning the invention and therefore the applications of measuring instrument. The concept of an exhausting cluster illustrated in Figure2.7.

26

2.4.2.4 Documents-based and Keyword-based Clustering Keyword-based and document-based clustering are different in the features base on which the documents grouped (Habibi & Popescu-Belis, 2015; Shete, Zaware, Thube, Hande & Shimpi, 2016). Document-based clustering algorithms mainly applied on document vector space model in which every entry presents the term-weighting of the term in the corresponding document. Thereby a document is mapped as an information purpose inside a particularly high-dimensional area wherever every term is Associate in the Nursing axis. During this field, the space between points calculated and compared. Close data points can be merged and clustered into the same group; distant points isolated into different groups (Habibi & Popescu-Belis, 2015; Shete et al., 2016). Thereby the corresponding documents are grouped or separated. As document-based clustering based on the “document distance,” it is imperative to map the material into the right space and apply appropriate distance calculation methods. Keyword-based clustering algorithms only choose specific document features and based on these relatively limit number of features the clusters generated. Those specific features selected because they considered as the core features between the documents and they are shared by the similar documents and are sparse in unlike documents (Habibi & Popescu-Belis, 2015; Shete et al., 2016). Thereby how to pick up the most core feature is an essential step in keyword-based clustering. 2.5

Algorithms for Document Clustering Document cluster aims to segregate documents into pregnant clusters that

replicate the content of each document. For example, at intervals the news wire, manually distribution one or plenty of categories for each document desires complete human labor, significantly with the big amount of text uploaded online daily. Thus, a cost-effective cluster is crucial. Another disadvantage associated with document cluster is that the vast variety of terms. In associate degree excellent matrix illustration, each term goes to be a feature, and each document is associated degree instance. In typical cases, the variety of choices square measure planning to be close to a variety of words at intervals the lexicon. This decent challenge imposes for cluster ways where the efficiency goes to greatly degraded. However, an oversized vary of these words squares measures either stop words, digressive to the topic, or redundant. Thus, removing these additional words may facilitate significantly cut back property. 27

Feature alternative does not exclusively reduce procedure time but to boot improves cluster results and provides higher info interpretability (Mugunthadevi, Punitha, Punithavalli & Mugunthadevi, 2011). In document cluster, the set of chosen words that area unit related to a particular group square measure planning to be plenty of informative than the whole round of words at intervals the documents with a connection that cluster. Completely different feature alternative ways square measure used in document cluster recently, for example, term frequency, pruning occasional terms, pruning extraordinarily successive terms, and entropy-based constant. The variety of those ways et al. square measure planning to be explained at intervals the subsequent subsections. 2.5.1 `Term Frequency Term Frequency is one in every of the earliest and most simple yet effective term ways in which. It dated back to 1957 in (Luhn, 1957). Thus, it is, indeed, a standard term selection technique. During a very text corpus, the documents that belong to constant topic extra attainable will use similar words (Habibi & PopescuBelis, 2015). Therefore, these common terms are about to be an associate honest indicator for a certain topic. The tends to be ready to say that a common term that typically distributed across utterly completely different subjects is not informative; thus, such term would be random. The tends to tend to call this method pruning extraordinarily various terms. Similarly, very rare terms have to be compelled to equally cropped that is termed pruning occasional terms. Stop words probably are about to be cropped as a result of their high frequency (Habibi & Popescu-Belis, 2015). Moreover, words like abecedarian are about to be unnoticed since they will not be commonplace. TF for the term 𝑓𝑓𝑖𝑖 with relevancy the complete corpus is given by; 2.5.2

𝑇𝑇𝑇𝑇(𝑓𝑓𝑖𝑖 ) = ∑𝑗𝑗∈𝐷𝐷𝑓𝑓𝑖𝑖 𝑡𝑡𝑓𝑓𝑖𝑖𝑖𝑖 ……….. 2.1

Inverse Document Frequency

TF is associated economic term alternative methodology. However, it is not effective regarding term constant, were all selected terms are progressing to be appointed constant weight. Also, it tended to tend to cannot link TF value to any document (Myint & Oo., 2016). In various words, it tended to tend to cannot distinguish between ordinary words that appear throughout slightly set of documents, 28

which may have discriminative power for this set of papers, and frequent words that appear in all or most of the material at intervals the corpus. Thus on the scale, the term’s weight, use, instead, the inverse document frequency (IDF). IDF measures whether or not or not the term is frequent or rare across all documents: 𝑖𝑖𝑖𝑖𝑖𝑖 (𝑓𝑓𝑖𝑖 ) = log

|𝐷𝐷|

|𝐷𝐷𝑓𝑓𝑖𝑖 |

............. 2.2

Where, |𝐷𝐷 | represent the full may be a range of documents and �𝐷𝐷𝑓𝑓𝑓𝑓 �represents

the quantity of documents that contain the term 𝑓𝑓𝑖𝑖 (Myint & Oo., 2016). The worth of

Israeli Defense Force can go high by the rare terms and it's going low once extremely frequent ones. 2.5.3

Term Frequency-Inverse Document Frequency It was able now to currently mix the preceding measure process (i.e., TF and

IDF) to deliver a weight for every term 𝑓𝑓𝑖𝑖 in each of document contain in 𝑑𝑑𝑗𝑗 . Now the live is named as TF-IDF. It was given by;

𝑡𝑡𝑡𝑡 − 𝑖𝑖𝑖𝑖𝑖𝑖(𝑓𝑓𝑖𝑖 , 𝑑𝑑𝑖𝑖 ) = 𝑡𝑡𝑓𝑓𝑖𝑖𝑖𝑖 ∗ 𝑖𝑖𝑖𝑖𝑖𝑖(𝑓𝑓𝑖𝑖 )

............ 2.3

𝑡𝑡𝑡𝑡 − 𝑖𝑖𝑖𝑖𝑖𝑖 Assign a more large value to the terms that able to occur oftentimes

in the time of small sets of documents (Habibi & Popescu-Belis, 2015; Myint & Oo., 2016). So it’s having an additional discriminative power. This value goes to lower once the term happens to the additional sets of documents, whereas very cheap value is given to terms that can occur to all told documents. For all document bunches, terms that have a bit higher 𝑡𝑡𝑡𝑡 − 𝑖𝑖𝑖𝑖𝑖𝑖 have more high ability to be much more appropriate clustering. 2.5.4

Chi Square Statistic Chi-square (𝑥𝑥 2 ) datum get popularity widely for the specific feature that

supervise to get appropriate choice (Witten, Frank, Hall & Pal, 2016). It can calculate

the applied math dependence to between the feature and therefore the category (Hussain, & Asghar, 2016). 𝑥𝑥 2 With 𝑟𝑟 completely different values and 𝐶𝐶 categories is outlined as;

𝑥𝑥 2 = ∑𝑟𝑟𝑖𝑖=1 ∑𝑐𝑐𝑗𝑗=1

(𝑛𝑛𝑖𝑖𝑖𝑖 − 𝜇𝜇𝑖𝑖𝑖𝑖 )2

………… 2.4

𝜇𝜇𝑖𝑖𝑖𝑖

In equation 2.4, 𝑛𝑛𝑖𝑖𝑖𝑖 represent the number of samples and 𝑖𝑖 𝑡𝑡ℎ contain the value

of feature, class are presented by 𝑗𝑗 𝑡𝑡ℎ and 𝜇𝜇𝑖𝑖𝑖𝑖 = 29

𝑛𝑛𝑖𝑖 ∗𝑛𝑛𝑗𝑗 𝑛𝑛

(Hussain, & Asghar, 2016).

Here, 𝑛𝑛 is represent the total documents in the sample. There have a chance to interpret the equation by applying the probability as; 𝑥𝑥 2 (𝑓𝑓, 𝑐𝑐 ) =

𝑝𝑝( 𝑓𝑓 ,𝑐𝑐)𝑝𝑝(¬ 𝑓𝑓 ,¬𝑐𝑐) − 𝑝𝑝( 𝑓𝑓 ,¬𝑐𝑐)𝑝𝑝(¬ 𝑓𝑓 ,𝑐𝑐) 𝑝𝑝( 𝑓𝑓 )𝑝𝑝(𝑐𝑐)

…………… 2.5

In the equation 2.5, 𝑝𝑝( 𝑓𝑓 , 𝑐𝑐) represent probability and by 𝐶𝐶 represent the

classes and the trams are representing by 𝑓𝑓, and 𝑝𝑝(¬ 𝑓𝑓 , ¬𝑐𝑐) is the probability of not

being in class 𝐶𝐶 and not containing term 𝑓𝑓 and so on. Thus, 𝑥𝑥 2 can't be directly

applied in AN unattended learning like bunch as a result of the absence of a category label. Y. Li et al in propose a variation of 𝑥𝑥 2 known as 𝑟𝑟𝑟𝑟 2 that overcomes some

drawbacks of the first 𝑥𝑥 2 and is embedded in an Expectation-Maximization (EM) rule to be used for text bunch issues (Li, Luo & Chung, 2008). Noticed that 𝑥𝑥 2 cannot verify whether or not the dependency between the feature and therefore the category is

negative or positive, that results in ignoring relevant options and choosing inapplicable options generally. Therefore, they projected a connexion live (R) which will be utilized in the first 𝑥𝑥 2 to beat this limitation. This new live R follows. 𝑅𝑅( 𝑓𝑓 , 𝑐𝑐) =

𝑝𝑝( 𝑓𝑓 ,𝑐𝑐)𝑝𝑝(¬ 𝑓𝑓 ,¬𝑐𝑐) − 𝑝𝑝( 𝑓𝑓 ,¬𝑐𝑐)𝑝𝑝(¬ 𝑓𝑓 ,𝑐𝑐) 𝑝𝑝( 𝑓𝑓 )𝑝𝑝(𝑐𝑐)

……… 2.6

𝑅𝑅 in Equation (2.6) are going to be adequate one if there's no such dependency

between the category and therefore the feature, larger than a pair of if there's a positive

dependency and fewer than one if the dependency is negative(Hussain, & Asghar, 2016). From Equations (2.5) and (2.6), Hoffman et al. (Hussain, & Asghar, 2016) [C. Elkan-2003] planned a brand new variation of 𝑥𝑥 2 that's ready to distinguish positive

and negative relevance.

𝑟𝑟𝑟𝑟2( 𝑓𝑓 ) = ∑𝐶𝐶𝑗𝑗=1 𝑝𝑝(𝑅𝑅( 𝑓𝑓 , 𝑐𝑐𝑐𝑐))𝜒𝜒2( 𝑓𝑓 , 𝑐𝑐𝑐𝑐)

……… 2.7

Where 𝑝𝑝(𝑅𝑅( 𝑓𝑓 , 𝑐𝑐𝑐𝑐)) is given by 𝑝𝑝(𝑅𝑅( 𝑓𝑓 , 𝑐𝑐𝑐𝑐)) = ∑𝑐𝑐

𝑅𝑅( 𝑓𝑓 ,𝑐𝑐𝑐𝑐)

𝑗𝑗=1 𝑅𝑅( 𝑓𝑓 ,𝑐𝑐𝑐𝑐)

The larger the value of 𝑟𝑟𝑟𝑟2 is, the more relevant the feature 𝑓𝑓 will be.

As it tends to mention earlier, it tends to cannot apply a supervised feature

choice in Associate in nursing unattended learning directly. Therefore, (Li, Luo & Chung, 2008) embedded their planned technique given in Equation (2.7) during an agglomeration formula mistreatment an EM approach. They used k-means because the agglomeration formula and 𝑟𝑟𝑟𝑟 2 because of the feature choice technique. 30

2.5.5

Frequent Term-Based Text Clustering Frequent Term-Based Text Clustering (FTC), planned in, and provides true

thanks for scaling back property in text agglomeration. It follows the notion of a frequent item set that forms the premise of association rule mining. In FTC, the set of documents that contains constant, familiar term set goes to be a candidate cluster. Therefore, clusters could overlap since the document could embody all completely different item sets. This sort of agglomeration is either flat (FTC) or stratified (HFTC) agglomeration since it tends to area unit progressing to have all different cardinalities of item sets (Li, Yang & Jiang, 2017). First, a data set 𝐷𝐷, gift minimum support minsup value, associate degree a rule

that finds frequent item set needs to air the market. The rule starts by finding the frequent item set with minimum support minsup. Then, it runs until variety of documents contribute to the chosen term set |cov(STS)| is loved variety of documents in 𝐷𝐷(Li, Yang & Jiang, 2017). In each iteration, the rule calculates the entropy overlap 𝐸𝐸𝐸𝐸 for each set inside the remaining term set RTS, where 𝐸𝐸𝐸𝐸 is given by; 𝐸𝐸𝐸𝐸𝑖𝑖 = ∑𝐷𝐷𝑗𝑗∈𝐶𝐶𝑖𝑖 −

1

𝐹𝐹𝑗𝑗

1

. ln( ) 𝐹𝐹𝑗𝑗

…………. 2.8

Where 𝐷𝐷𝑗𝑗 is the jth document, 𝐶𝐶𝑖𝑖 is the 𝑖𝑖 𝑡𝑡 ℎ cluster and 𝐹𝐹𝑗𝑗 is the number of all

frequent term sets supported by document 𝐽𝐽, with the less overlap assumed to be the

better. 𝐸𝐸𝐸𝐸 Equals 0 if all the documents in 𝐶𝐶𝑖𝑖 support only on frequent item set (i.e.,

𝐹𝐹𝑗𝑗 = 1). This price will increase with the rise of 𝐹𝐹𝑗𝑗 . This methodology of overlap

analysis was found to provide higher bunch quality than the quality one (Li, Yang &

Jiang, 2017). The most effective candidate set Best Set is the set with a minimum quantity of overlap. Best Set is elite and side to the STS and excluded from RTS. Additionally, the set of documents that supports the most effective Set is far from the dataset since they need to be been already clustered, that results in dramatically reducing the number of documents. They’re additionally far from the documents’ list of RTS that results in reducing the amount of the remaining term set. 2.5.6

Frequent Term Sequence Similar to Federal Trade Commission, a clump supported common term

sequence (FTS) planned in (Li, Chung & Holt, 2008). This suggests that the order of terms within the document is significant. The frequent terms sequence denoted as 𝑓𝑓, may be a set that contains the frequent terms < < 𝑓𝑓1 , 𝑓𝑓2 , … , 𝑓𝑓𝑘𝑘 > & get; the 31

sequence here means 𝑓𝑓2 must be after𝑓𝑓1 (Fazal & Rafi, 2015), however, it's not

necessary to be forthwith when it. There might be alternative no frequent terms between them. This is often true for any 𝑓𝑓𝑘𝑘 and 𝑓𝑓𝑘𝑘 − 1 terms. This definition of frequent terms sequence is a lot of convertible to the variation of human languages (Li, Chung & Holt, 2008). Similar to Federal Trade Commission, FTS starts by finding frequent term sets exploitation associated association rule mining rule. This constant term set guarantees to contain the common term sequence however not contrariwise. Hence, it tend to don't have to be compelled to search the entire term area for the constant term sequence. It will search in precisely the common term set area, which is to reduce dramatic dimension. After that, FTS builds a generalized suffix tree (GST), that may be a reputable organization for sequence pattern matching, exploitation the documents when removing the non-frequent terms. From the suffix nodes in GST, It tend to get the cluster candidates. These cluster candidates might contain subtopics that will be eligible to be incorporated along to form a lot of general topics. Therefore, a merging step takes place (Fazal & Rafi, 2015). The authors (Li, Chung & Holt, 2008) of selected to merge cluster candidates into a lot of general topic clusters victimization k–mismatch rather than the similarity. An associate example of victimization the k–mismatch construct is after it has 𝐹𝐹𝑆𝑆𝑖𝑖 =

{ 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓, 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠, 𝑐𝑐𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙} and 𝐹𝐹𝐹𝐹𝑗𝑗 = { 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓, 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠, 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐fi𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐},

where they need one mate. Therefore, It will merge these 2 clusters if the tolerance parameter 𝑘𝑘 ≥ 1.

In (Das & Mannila, 2000), FTS adopted Landau–Vishkin (LV) to check three

sorts of mismatches: insertion, deletion, substitution. Insertion means all it would like to insert is k, or fewer, terms into the 𝐹𝐹𝑆𝑆𝑗𝑗 in order to match 𝐹𝐹𝑆𝑆𝑖𝑖 . Deletion, in

distinction, suggests that we'd like to delete. Whereas substitution suggests that it would like to substitute terms from 𝐹𝐹𝑆𝑆𝑗𝑗 with terms from 𝐹𝐹𝑆𝑆𝑖𝑖 .

These integrated clusters are vulnerable to overlap. Consequently, new

merging is going to perform when mensuration the quantity of overlap victimization the 𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼:

𝐽𝐽(𝐶𝐶𝑖𝑖 , 𝐶𝐶𝑗𝑗 ) =

|𝐶𝐶𝑖𝑖 ∪𝐶𝐶𝑗𝑗 | |𝐶𝐶𝑖𝑖∩𝐶𝐶𝑗𝑗 |

32

………….…. 2.9

2.6

Semantic Document Clustering First, the clump is one in all the techniques to enhance the potency in data

retrieval for up search and retrieval potency. It is an information mining tool to use for grouping objects into clusters. Clump divides the objects (Documents) into purposeful teams supported the similarity between objects (Naik, Prajapati & Dabhi, 2015). Documents among one cluster have high similarity with one another; however the low similarity with documents in alternative clusters (Rodrigues & Sacks, 2004). Document clump generates clusters from the entire document assortment automatically and is used in many fields, further as processing and information retrieval. At intervals the traditional vector house model, the unique words occurring at intervals the document try square live used thanks to the choices. However, because of the word disadvantage and so the ambiguous handicap, such a bag of words unable to represent the content of a textual document precisely (Maitri et al. 2015). The growth of the World Wide net has enticed many researchers to aim to plot various methodologies for organizing such a huge information offer. Quality issues acquired play still thanks to the standard of automatic organization and categorization. Data clump partitions a group of unlabelled objects into disjoint/joint groups of clusters. In associate passing smart cluster, all the objects among a cluster area unit similar whereas the objects in different clusters area unit entirely totally different. Once the knowledge processed may be a group of documents, it is called document clump. Document clump is improbably very relevant and useful at intervals the info retrieval house. Document clump usually applied to a document data thus similar documents area unit connected at intervals constant cluster. Throughout the retrieval technique, documents happiness to the same cluster as a result of the retrieved documents can also come back to the user. This would possibly improve the recall of associated information retrieval system. Document clump can also be applied to the retrieved documents to facilitate finding the useful documents for the user. The feedback of associated information retrieval system may be a stratified list ordered by their numerable connectedness to the question. Once the degree of associated information data is little, and so the question developed by the user is well made public, this stratified list approach is economical. Aside from a large information offer, just like the Wide globe net, and poor question conditions (just one or a pair of keywords), it is robust for the retrieval system to identify the attention-grabbing things for the user. Usually, most of the retrieved documents area units have not any interest 33

to the users. Applying documenting clump to the retrieved documents might build it easier for the users to browse their results and notice what they need quickly (Maitri et al. 2015). Previous strategies of clump chiefly use matching keywords of text. However, it does not capture the means behind the words that are that the unhealthy facet of the standard methodology to mine the text. Within the linguistics document clump, It will analyze the online documents into two ways that, initial is syntactically and secondly is semantically (Shah & Mahajan, 2012). Syntactic parsing will ignore the decreased knowledge from documents, so it will have the accurate knowledge to pass into next step. Then in next step, i.e., linguistics parsing will apply on the parsed grammar knowledge that will cluster the documents properly and provides the necessary response to the user at the time of information mining that is not accurate in ancient strategies (Shah & Mahajan, 2012). 2.7

Conceptual Clustering When no classification info understood regarding the info, an agglomeration

rule is typically want to cluster the info into teams such the similarity at intervals every cluster is larger than that among teams. This is often called learning from observations, as hostile the classification task that taken into account as learning from examples. A different approach is conceptual clustering. These methods are incremental and build a hierarchy of probabilistic concepts. COBWEB and its successor CLASSIT are the most notable among them. Unlike traditional hierarchical methods (that use similarity measures) they use Category Utility as the cluster quality measure. Conceptual clustering is based on numerical taxonomy (Fisher & Langley, 1986; Branch & Rocchi, 2015; Fahad & Yafooz, 2017) and was originally introduced (Michalski & Stepp, 1983). Gennari et al. (Carlson, Weinberg & Fisher, 2014) described the problem of conceptual clustering in the following way: •

Given: a consecutive presentation of instances and their associated descriptions;

•

Find: clustering is that group cases in categories;

•

Find: an intensional definition for each category that summarizes its instances;

•

Find: a hierarchical organization for those categories.

34

As it stated above, conceptual clustering organizes instances (tuples) into categories. This makes conceptual clustering suitable for categorical data that cannot be ordered and can only put into categories (Vu & Parker, 2016). Despite variations in illustration [Rendell-1986] and quality judgments, all abstract, clustering systems can judge category quality by wanting to an outline or idea description of the category (Vu & Parker, 2016). There square measure two issues that have got to be addressed by an ideal clump system: •

The clump drawback involves determinative helpful subsets of the associated object set. This consists of distinguishing a collection of object categories, every outlined as an associate denotative set of objects.

•

The characterization drawback involves determinative helpful ideas for every (extensionally defined) object category. This can be merely the matter of learning from examples. Fisher and Langley (Fisher & Langley, 1985; Fisher & Langley, 1986; Branch

& Rocchi, 2015; Fahad & Yafooz, 2017) adapt the read of learning as a look to suit abstract clump. Clump and characterization dictate a two-tiered search, a look through an area of object clusters and a subordinate search through an area of ideas (Vu & Parker, 2016). One within the case of stratified techniques, this becomes a three-tiered search, with a ranking search through an area of hierarchies. A strong conceptual clustering algorithm that has been the basis for many other algorithms, for example, LABYRINTH (Thompson & Langley, 1991), ITERATE (Biswas, Weinberg, Yang & Koller, 1991; Xu, 2014), and COBWEB (Fisher, 1987; Fisher & Schlimmer, 2014). 2.7.1

Conceptual Clustering Algorithms The conceptual cluster tightly associated with information clustering; but, in

an abstract cluster, it is not solely the integrated structure of the information that drives cluster formation however conjointly the outline language that is accessible to the learner. Thus, a statistically robust grouping within the report could fail to be extracted by the learner if the dominant conception description language is incapable of describing that exact regularity. A decent variety of algorithms projected for the abstract cluster. Some examples square measure gave below: 35

CLUSTER/2: Early work on abstract cluster done by Michalski and Stepp (Michalski & Stepp, 1983) UN agency projected the general cluster rule referred to as CLUSTER/2. The selection for abstract cluster arises from the attention-grabbing property that abstract cluster usually used for nominal-valued information. AN extension exists for an abstract cluster that will deal with numeric information (Li & Biswas, 1998; Fahad & Yafooz, 2017), except for the aim of this paper it tends to solely get to be troubled by nominal-valued information because the information set it to tend to square measure handling is inherently nominal and symbolic-valued. However, the data set it to tend to a plaza measure dealing with contains an oversized variety of attributes, and their values square measure non-fixed nominal values. Preprocessing of the information is then an essential step to create the information useable. Conceptual cluster builds a structure out of the knowledge incrementally by trying to subdivide a bunch of observations into subclasses. The results a stratified organization said because of the conception hierarchy. Each node in the hierarchy subsumes all the nodes below it, with the entire data set at the premise of the hierarchy tree. LABYRINTH: (Thompson & Langley, 1991) to incorporate a structured object into a node, Labyrinth performs an additional search to determine the best characterization for the object. Since the values labyrinth uses for structured objects are stored concepts that have been returned by the previous classification, they are hierarchically related to each other. Labyrinth uses an attribute generalization operator analogous to the climbing tree operator to take advantages of these hierarchical relationships and to search for more predictive characterizations of the structured method. Traditionally, Concept formation system has started tabula rasa, i.e., without exploiting knowledge of the domain. However, the incremental nature of policy labyrinth and Cobweb means that they can revise and existing memory structure. One can simply hand-encode the first memory and start operation from there. The primary memory thus serves to prime the learning algorithm, and the traditional concept learning operators revise and extend the fundamental theory. Since cobweb store information characterizing each class individually, rather than as organization of several component concepts, expressing knowledge of subsets of attributes is not straight forward. In contrast, the labyrinth can be primed with a 36

class that characterizes arbitrary subsets of attributes, provided instances and decomposed in the same way. Thus, labyrinth's use of components in classification enables it to take advantage of a form of background knowledge that is common in many domains: information about correlated sets of attributes. ITERATE: Research conducted by the cluster has junction rectifier to the event of telling, a general bunch rule that works with combos of numeric and non-numeric knowledge. The first motivations for developing tell were to increase previous abstract bunch algorithms (e.g., COBWEB) to come up with stable associated maximally distinct partitions (Biswas, 1993) associated to provide an economic rule for an interactive knowledge analysis tool. Like totally different abstract bunch algorithms, tell-builds a concept tree from domain objects or instances pictured as a vector of attributes-value pairs, however, try to mitigate the impact of mounting management structures. The rule exploits information on the whole object set in creating the item hierarchy. Extra specifically, it adopts associate ordering operator that pre-orders the item sequence to use the biases of the criterion perform, informing maximally distinct classes among the original classification tree (Gautam et al., 1995). The tree generated throughout a breadth-first manner; therefore, class potentialities at a parent node unit of measurement allowed to stabilize before child nodes unit of measurement created. COBWEB: Cobweb is a conceptual clustering algorithm developed by Fisher [39] for the analysis of categorical data that cannot order. The algorithm builds a hierarchy of clusters following the divisive approach to clustering. The goal of Cobweb, like all conceptual clustering algorithms, is to create a model that can use for future predictions (Carlson, Weinberg & Fisher, 2014). Cobweb is a relatively old algorithm but since it was introduced its relevance to solving data mining problems has remained relevant. Biswas et al. (Biswas et al., 1991; Xu, 2014) use Cobweb for predicting missing values. Perkowitz & Etzioni (Perkowitz & Etzioni, 2000) discuss the suitability of Cobweb for data mining on the web. (Hurst, Marriott & Moulder, 2003) And Paliouras et al. (Gaigole, Patil & Chaudhari, 2013)

use Cobweb on the internet, while Li et al. (Li, 2005) combine

Cobweb with k-means (MacQueen, 1967; Ferrari, & De Castro, 2015) to present an algorithm for large scale clustering. The algorithm is, also, part of some popular general purpose data mining tools. Two of these data mining tools are (i) Weka (Stephen, 1995) which provides an implementation of Cobweb that applies to

37

categorical and numeric data, and (ii) OIDM [Chen & Oid-2004], which gives an implementation of Cobweb based on the original Fisher’s paper. 2.7.2

COBWEB Algorithm The Cobweb algorithm is an incremental clustering algorithm that clusters one

tuple at a time in a top down manner. It starts clustering a tuple by inserting it into the root cluster of the tree (figure 2.8 is an example of a Cobweb tree). Inserting a new tuple in a cluster involves updating the probabilities the cluster covers (Kavita & Pallavi, 2015). P(C1)=0.50 Mature Type Standard Male Sex Female Expert C_Skill Novice Intermediate

P(C2)=0.50 Type Mature Male Sex Female C_Skill

Expert Novice

P(C5)=0.50 Type Mature Sex Male C_Skill Expert

P(V|C) 0.50 0.50 0.50 0.50

P(V|C)

1.00 0.50 0.50

P(C3)=0.25 Type Standard Sex Male C_Skill

Intermediate

P(V|C) 1.00 1.00 1.00

P(C4)=0.50 Type Standard Sex Male C_Skill

Intermediate

0.50 0.50

P(V|C) 1.00 1.00 1.00

P(C6)=0.50 Type Mature Sex Female C_Skill Novice

P(V|C) 1.00 1.00 1.00

Figure2.7: The COBWEB Tree (Fisher, 1987; Fisher & Schlimmer, 2014; Carlson, Weinberg & Fisher, 2014)

38

P(V|C) 1.00 1.00 1.00

The algorithm uses four operators to evaluate and improve the quality of the tree. The quality measure in Cobweb is category utility. The four operators are: (i) incorporate, (ii) disjunct, (iii) split, and (iv) merge. The incorporate and disjunct operators area unit are accustomed to building the tree whereas the merge and split operators area unit accustomed correct any knowledge ordering bias within the clusters by rearrangement the hierarchy. •

Incorporate: Cobweb tries a brand new tuple in each cluster of the assessed level to spot the most active cluster to include the new tuple. It conjointly records the runner-up cluster because alternative operators require it.

•

Disjunct: Cobweb tries a brand new tuple in an exceedingly new cluster that covers the tuple solely.

•

Split: Cobweb replaces the most active cluster, known by the incorporate operator, with its youngsters and tries the new tuple in each kid of the most active cluster.

•

Merge: Cobweb merges the most efficient and runner-up clusters, known by the incorporate operator, and tries the new tuple within the incorporate cluster. According to Fisher et al. (Fisher, 1987; Fisher & Schlimmer, 2014), the

incremental property can have an impact on the quality of the clusters as algorithms are much more sensitive to the order of the data. With the merge and split operators, the algorithm corrects the ordering effect by restructuring the tree (Kavita & Pallavi, 2015). As it descends the tree, at every level of the tree, Cobweb tries all four operators - incorporate, disjunct, split and merges - and identifies which is the best operator to implement by measuring the category utility of the clustering produced by each operator. Category utility favors the operator that when implemented produces a clustering that maximizes the potential for inferring information (Fisher, 1987; Fisher & Schlimmer, 2014; Carlson, Weinberg & Fisher, 2014). If the best operator is incorporate, the algorithm inserts the new tuple in the best cluster and proceeds to the next level. If the best operator is disjunct, the algorithm creates a new cluster in the tree. If the best operator split, the algorithm rearranges the tree by replacing the best cluster with its children and moves to the next level. If the simplest operator is merging, the formula merges the simplest and contender cluster (best and contender cluster square measure indicated by the incorporate operator) and moves to a constant level. 39

Function Cobweb (tuple, root) Incorporate tuple into the root; If root is a leaf node Then Expand leaf node; Return expanded leaf node with the tuple; Else Get the children of the root; Evaluate operators and select the best: a) Try incorporate the tuple in every child; b) Try creating a new cluster with the tuple; c) Try merging the two best clusters; d) Try splitting the best cluster into its children;

Cobweb has an additional operator accustomed predict missing values, the predict operator. The predict operator classifies a tuple down the tree victimization the incorporate operator, but it does not add the tuple to the clusters inside the tree. 2.8

Lexical Database An on-line database is associated Associate in nursing organized description of

the lexemes of a language. An on-line database tries to approximate the lexicon of an utterer. It includes an inventory of marvelous morphemes and data relating to their meanings. For every sense of a part, it includes such things as; •

a part of speech designation

•

a definition

•

sample sentences perhaps this sense

•

cultural annotations to purpose its significance, and

•

Identification of linguistics relationships with entirely different morphemes.

2.8.1

Lexical Overview A lexical category may well be a category for elements that are a district of the

lexicon of a language. These items are at the word level. Also legendary as: •

part of speech

•

word class

•

grammatical category 40

•

grammatical class Lexical categories also are printed regarding core notions or 'prototypes.'

Given forms may or won't work neatly in one in every of the categories. The membership category of a form can vary to keep with, but that sort used in discourse. There unit of measurement major and minor lexical categories. Every language includes a minimum of two major lexical categories; •

noun

•

verb Some of the languages also there have other two broad categories besides noun

and verb. This two are; •

adjective

•

adverb In deep observation; some of the languages also have some minor lexical

categories. As like; •

conjunctions

•

particles

•

adpositions Grammatical classes square measure distinct from formal relative classes like a

subject, object, and predicate, or useful classes like the agent, topic or definite. A lexical type is an associate degree abstract unit representing a group of word-forms differing solely in infection and not in the core which means. It is a portion of a lexical unit. Examples: The language unit, brooch n. 'a giant decorative pin with a clasp, worn by women' features a single lexical unit with one lexical type representing the two word-forms, brooch, and brooches. The first sense of the language unit ignites. 'To set hearth to' could be a single lexical unit with one lexical type representing many word forms like, ignite, ignited, ignites and igniting. 2.8.2

Different Lexical Databases There have some of existing lexical databases. It can make a group to donate

them; those are Collections, French, German, English, Italian, and Spanish. The aim of this thesis to work with the English Language. Some of the lexical databases given below;

41

COLLECTIONS •

The CELEX Lexical Databases (Dutch, English, German),

•

List of WordNets in the world

•

Multi WordNet FRENCH

•

Lexique GERMAN

•

GermaNet

•

Noun Associations for German ENGLISH

•

The CMU Pronouncing Dictionary of American English

•

WordNet (English)

•

Dante

•

MRC Psycholinguistic Database,

•

Lists of high-frequency English lemmas and word forms

•

The Verb Semantics Ontology Project

•

Twitter Current English Lexicon ITALIAN

•

LexIt, University of Pisa

•

Morph-it! SPANISH

• 2.8.3

Spanish FrameNet (SFN) WordNet In this modern age of computers, however, there is an answer to its criticism.

One obvious reason to resort to online dictionaries, lexical databases is browsed by computers and is that computers can search such alphabetical lists plenty of faster than of us will. Moreover, since dictionaries unit written from tapes that unit browse by computers, it is a relatively straightforward concern converts those tapes into the relevant info. Swing standard dictionaries online seems a straightforward and natural wedding of the recent and thus the new. 42

WordNet® could also be an enormous on-line database of English. Nouns, verbs, adjectives and adverbs unit classified into sets of psychological feature synonyms (synsets), each expressing a particular construct. Synsets unit interlinked by suggests that of conceptual-semantic and lexical relations. Sir James Murray’s Oxford English lexicon was compiled ‘‘on historical principles, ’’ and no-one doubts the value of the Oxford English Dictionary in subsidence issues with word use or sense priority. By specializing in historical (diachronic) proof, however, the OED, like various commonplace dictionaries, neglected queries relating to the synchronic organization of lexical information. It is presently potential to look at that among which that omission would repair. The 20th Century has seen the emergence of psychology, a knowledge domain field of study involved the psychological feature basis of linguistic ability. Every linguist and psycholinguists have explored in intensive depth the factors deciding the up thus far (synchronic) structure of language information unremarkable, and lexical information specifically, Miller and Johnson-Laird (Miller & Johnson-Laird, 1976) have planned that analysis involved the lexical part of language got to be called unfortunate linguistics. As significant theories evolved in recent decades, linguists became additional and additional specific relating to the information a lexicon ought to contain therefore as for the linguistics, syntactic, and lexical parts to work on among the daily production and comprehension of linguistic messages, and folk’s proposals area unit incorporated into the work of psycholinguists. Beginning with word association studies at the flip of the century and continuing all the method all the way down to the delicate experimental tasks of the past twenty years, psycholinguists have discovered many synchronic properties of the vocabulary which can exploit in composition. Definitions of common nouns typically give a superordinate term with characteristic features; that data provides the premise that organizing noun files in WordNet. The superordinate relation (hyponymy) creates a class-conscious linguistics organization that duplicated among the nouns. This files by the use of differently labeled pointers between sets of synonyms (synsets). The hierarchies have a limitation on depth, seldom prodigious quite a dozen levels. Characteristic choices unit entered in such the best method on turn out a lexical inheritance system, a system among which every word inherits the individual choices of all its superordinate’s. Three types of individual choices unit discussed: attributes (modification), parts (metonymy), and 43

functions (prediction). The semantic relation found between nouns; but, it is not a basic organizing principle for nouns (Miller, 1990). Adjectives divided in major tow class in WordNet database: descriptive and relative. Descriptive adjectives assign to their head nouns values of (typically) bipolar attributes, and consequently, a unit organized oppositions (antonymy) and similarity of means (synonymy). Descriptive adjectives that do not have direct antonyms unit aforesaid to possess indirect antonyms by their linguistics similarity to adjectives that do have direct antonyms. WordNet contains pointers between descriptive adjectives expressing a value of associate degree attribute, and thus the noun by that attribute is lexicalized. Relative adjectives unit assumed to be rhetorical variants of modifying nouns then unit cross-referenced to the noun files (Christiane, Derek & Katherine, 2004). There contains a linguistics network of English verbs in WordNet. The linguistics relations will not build networks of nouns and adjectives cannot be applied whereas not modification, however, have to be compelled to be tailored to suit the linguistics of verbs that dissent significantly from those of the other lexical categories. The character of these relations mentioned, as is their distribution throughout altogether entirely different linguistics groups of verbs that determines certain individual patterns of the scientific process. Also, four variants of lexical illation unit distinguished, that acts in systematic ways in which with the linguistics relations. Finally, the lexical properties of the separate verb groups unit created public (Christiane, 1990). Some study efforts explored the use of WordNet as data to spice up document cluster by providing relations between vocabulary terms and thus the results area unit altogether entirely different. Where some studies prompt that the use of a WordNet is helpful for cluster methodology, whereas others have consistent with that the WordNet is not useful. (Hotho, Staab & Stumme, 2003; Sedding & Kazakov, 2014; Moravec, Kolovrat, & Snasel, 2004; Fodeh, Punch & Tan, 2009; Recupero, 2007; Yoo, Hu & Song, 2006; Wang & Hodges, 2006; Termier, Sebag, & Rousset, 2001). The following researchers used WordNet which they monitored the advance among the results. Hotho et al. (Hotho, Staab & Stumme, 2003) used WordNet synsets to bolster document vector, showed that enhancing the BOW with WN synsets from the phrase

44

among the text and their hypernyms (up to an exact distance) can produce higher clusters than an understandable bag of words illustration. Recupero and Reforgiato (Recupero, 2007), Wang and Hodges (Wang & Hodges, 2006) used WordNet as data in document cluster with altogether entirely different datasets; the results area unit showed that the use of philosophy is helpful for the cluster. Other research worker did not observe any improvement; a gaggle of researchers everywhere that WordNet does not make a profit as a result of its structure does not facilitate to seek out the similarity between the words. Jing, L. et al. (Jiang, Tang & Zhang, 2004) used the same technique as Hotho et al. (Hotho, Staab & Stumme, 2003) and enhances it by computing a word similarity live supported what they call 'mutual information' over their cluster corpus. However, their technique did not prove any intensive improvement over Hotho et al. (Hotho, Staab & Stumme, 2003) baseline. Passos and Wainer [Passos & Wainer-2009] showed that plenty of similarity measures between words derived from Wordnet unit worse than the baseline for the wants of text cluster, Wordnet do not offer good word similarity information. Because of a variety of reasons the similarity between two words is not one in all Wordnet’s goals, and its structure does not match well to the task, no measurements unit directly supported Wordnet can relate a verb like “to seat” to a noun sort of a chair. Wedding and Kazakov (Sedding & Kazakov, 2014) showed synonyms, and hypernyms, disambiguated solely by Part-of-Speech tags do not seem to be thriving in up cluster effectiveness. This might attribute to the noise introduced by all false senses that square measure retrieved from WordNet. Foden et al. (Samah et al., 2009), Terrier, An et al. (Termier et al., 2011) used WordNet with entirely different data sets; The results have according to that the metaphysics ideas adds no worth and impairs the performance of document cluster. Foden et al. (Samah et al., 2009) self-addressed the problem of the impact of incorporating the ambiguous and synonymous into document cluster, that showed the ambiguous and synonymous nouns play a vital role in the cluster, albeit their clarification does not essentially cause important improvement in cluster purity. Moravec et al. (Moravec et al., 2004) showed different results once mistreatment two analysis measures. Recall live revealed that, mistreatment WordNet

45

improved cluster result. Where exactness lives demonstrated that mistreatment WordNet failed to improve cluster method. 2.9

F-measure The F-measure or F-score is one in all the foremost ordinarily used “single

number” measures in info Retrieval, scientific communication process and Machine Learning. F-measure, generally called F-score or (incorrectly) the 𝐹𝐹1 metric (the 𝛽𝛽 =

1 case of the additional general measure), may be a weighted mean value of Recall & exactness (𝑅𝑅 & 𝑃𝑃). There square measure many motivations for this

alternative of mean. Above all, the mean value is usually applicable once averaging rates or frequencies. The foremost general type, 𝐹𝐹, permits differential weight of

Recall and exactness however ordinarily they're given equal weight, giving rise to 𝐹𝐹1

however because it is thus present this square measure usually understood once touching on F-measure. In biomedicine, exactness termed positive prognostic worth (PPV), and recall

is termed sensitivity however to my data, there's nothing similar to the F-measure within the domain. The F-measure is outlined as a mean value of exactness 𝑃𝑃 and recall 𝑅𝑅:

2.9.1 Field of F-Measure

𝐹𝐹 =

2𝑃𝑃𝑃𝑃

𝑃𝑃+𝑅𝑅

…………… 2.10

F-measure accuracy technique comes from Information Retrieval (IR). It is known elsewhere as Sensitivity or True Positive Rate (TPR). Precision is that the frequency with that retrieved documents or predictions square measure relevant or ‘correct’ and is correctly a style of Accuracy, conjointly called Positive prognostic price (PPV) or True Positive Accuracy. F is intended to merge them into a single measure of search ‘effectiveness’. More generally, precision refers to the concept of consistency, or the ability to group well, while accuracy major by the distance between the element to target that, how close it is to the goal. The data his faithful to refers to how close it is to a specific target on average (BS-ISO,1994). High precision and low accuracy can found for systematic bias. The biggest problems with Recall, Precision, F-measure, and Accuracy as used in Information Retrieval. Those are easily biased. To understand the relationships between these 46

measures, Batter to give their formulae in two forms, one form related to the raw counts, and form related to normalized frequencies. These statistics are all appropriate when there is one class of items that are of interest or relevance out of huge sites of N items or instances. More generally It names multiple classes and refers to the proportion of the total each represents as its prevalence (for 𝑅𝑅𝑅𝑅𝑅𝑅, 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 𝜌𝜌 = 𝑟𝑟𝑟𝑟 = 𝑅𝑅𝑅𝑅/𝑁𝑁). IR also assumes a retrieval mechanism that gives rise to the predicted positives, and also represents the bias of this “classifier” (𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 𝜋𝜋 = 𝑝𝑝𝑝𝑝 = 𝑃𝑃𝑃𝑃/𝑁𝑁). In this systematic notation, it use upper case initialisms to refer to counts of items (e.g. RP), and lower case equivalents to

refer to the corresponding probabilities or proportions 𝑟𝑟𝑟𝑟. It is also common to use

Greek equivalents for probabilities in a mnemonic way 𝜌𝜌. 2.9.2 Derivation of F-Measure

An oft-used life at intervals the information retrieval and language method communities area unit that the “F1-measure” in line with the rule and Liu (Yang & Liu, 1999), this life was at first introduced by C.J. van Rijsbergen. They state, the 𝐹𝐹1

live combines recall 𝑟𝑟 and truth 𝑝𝑝 with an equal weight of the subsequent kind: 𝐹𝐹1 (𝑟𝑟, 𝑝𝑝) =

2𝑟𝑟𝑟𝑟

𝑟𝑟+𝑝𝑝

……………. 2.11

However, where can this sort come back from? What happens once it weights the two quantities differently? So, the 𝐹𝐹1 -measure could also be a mean. Mathworld1 defence the harmonic implies that 𝐻𝐻 of 𝑛𝑛 numbers 𝑥𝑥1 , … … . . , 𝑥𝑥𝑛𝑛 as; 1

𝐻𝐻

1

=

𝑛𝑛

∑𝑛𝑛𝑖𝑖=1

1

𝑥𝑥𝑖𝑖

………………. 2.12

Appling this formula to preciseness and recall, 𝐻𝐻 = 𝐻𝐻 =

2𝑟𝑟𝑟𝑟

𝑟𝑟+𝑝𝑝

1 1 1 1 2 (𝑟𝑟 + 𝑝𝑝)

………….. 2.13

Now it is clear that the 𝐹𝐹1 -measure could be a mean value. But, what's a

harmonic mean? It was discover the solution by multiple by 𝐻𝐻 into either side within

the Equation 2.12 This gives; 1

𝑛𝑛

∑𝑛𝑛𝑖𝑖=1

𝐻𝐻

𝑥𝑥𝑖𝑖

=1

…………. 2.14

In several words, the common of ratios between the mean and conjointly the knowledge points is unity. 47

It get a weighted version of the 𝐹𝐹-measure by computing a weighted average

of the inverses of the values. Let 𝑟𝑟 have a weight of 𝛼𝛼 ∈ (0, + ∞)and 𝑝𝑝 have a weight of 1, then the weighted mean of 𝑟𝑟 and 𝑝𝑝 is; 𝐹𝐹𝛼𝛼 (𝑟𝑟, 𝑝𝑝) =

2.10

𝐹𝐹𝛼𝛼 (𝑟𝑟, 𝑝𝑝) =

1 1 𝛼𝛼 1 ( 𝛼𝛼 + 1 𝑟𝑟 + 𝑝𝑝)

(𝛼𝛼+1)𝑟𝑟𝑟𝑟 𝑟𝑟+𝛼𝛼𝛼𝛼

……….. 2.15

Summary of Literature Review

To the beginning of Literature Review, it was started by Classification. After describing the clustering is come on Literature Review. In this process, it was doing some clustering, and without the concept of classification, it was hard to describe clustering. Because clustering is one kind of classification without any information provided. In clustering, several types of clusters described. Clusters are the output of clustering, and in this research, Shared-Property clusters used. After the Clustering it Textual Document Clustering section is described with their types and details. Documents-based & Keyword-based Clustering, Hard and Soft Clustering, Online and Offline Clustering and Hierarchical and Flat Clustering were the main section of Textual Document Clustering. In this research, it used conceptual clustering, and it is a part of Hierarchical and Flat Clustering. There have several conceptual textual document clustering and this research; COBWEB clustering algorithm is used to clustering of pre-processed textual data. To complete the semantic document clustering, there has a lot of lexical databases with different feature, and those are describing after the clustering algorithm. In this research, only work with the English languages article. In this scenario, WordNet is the biggest English semantic relational treasure of words. This research uses WordNet database for getting synset and lemmatization. In the last portion of Literature Review, F-Measure accuracy measure technique described. After successfully done all the steps, it needs to ensure the accuracy of clusters. F-measure is used in this research to ensure the accuracy of groups.

48

CHAPTER THREE METHODOLOGY

3.1

Introduction: Text cluster may be a helpful technique that aims at organizing large document

collections into smaller purposeful and manageable teams, that plays a prominent role in data retrieval, browsing, and comprehension. Ancient cluster algorithms area unit typically wishing on the BOW (Bag of Words) approach and a clear disadvantage of the BOW is that it ignores the linguistics relationship among words so that cannot accurately represent the means of documents. Because of the ascent of text documents, the matter knowledge became diversity of vocabulary; they are highdimensional and conjointly carry linguistics data. Therefore, document cluster techniques that may adequately represent the theme of records and improve cluster performance, ideally method knowledge with a tiny little size, area unit substantially required. Recently, some semantic-based approaches developed. In this research, the full methodology completed in five sections. The Methodology starts with Feasibility Study. The Study results help to analysis the requirements. After getting the requirement analysis output, this thesis forward to design the model for textual document clustering. The modeling process was ending when the development of the proposed framework in Python completed successfully. Testing and Analysis is the last step to ensure the expected result. In this thesis, there have some steps to finish the entire linguistics cluster. The method can begin by taking sample knowledge from twenty articles twenty abstracts because of the sample text. Those texts have tags. Tags area unit removed and tokenize them to pre method. By eliminating stop words those tokens area unit ready for stemming with the assistance of WorldNet linguistics treasure and o lemmatization to generalize for the clustering. After complete those steps, the COBWEB clustering algorithm applied to the tokens. F-Measure applied to clusters to urge the accuracy of clusters.

49

3.2

Research Methodology In this research, the whole methodology completed in five sections. Each part

has some activities those give some result, and the outputs are used as the key to start next step. Those steps provided below; •

Feasibility Study

•

Requirement Analysis

•

Modeling

•

Model Development

•

Testing and Analysis In those sections, the first section is Feasibility Study. Feasibility Study section

is for reading and gathering information from books. Also, Search information on relevant technological Journals and paper in different conferences. For study and collect information Encyclopedia is one of the biggest sources. Lots of the information gathered on feasibility study from encyclopedia beside of books, journals, and papers. By study, Textual Document Clustering has a clear scenario, and by valuable study, information gathered about Semantic Document clustering. By feasibility study, there has some concept about the Dataset. In the semantic document, there has an important role in natural language processing; feasibility study discovered it. After a feasibility study, there has comes Requirement Analysis. In this section PyCharm (IDE) platform is selected from developing the proposed framework in Python. In this research its need a database to store the data. The SQLite relational database system is a Built-In function of Python. It is used in this development to store the data in a database. In the stage of requirement analysis, WordNet selected as a lexical database because of the huge semantic database on English. Streaming and lemmatization both processes were completed with the help of WordNet process. When there has a database, its need to be view and examine the database. In this case DB viewer as useful tools selected on requirement analysis face. K-Mean and other old clustering algorithms are not suitable for this research because of using semantic clustering; COBWEB conceptual clustering algorithm selected to o clustering process. At the end of the Analysis, sample texts chosen from 20 different articles, and those are the abstract of these 20 articles.

50

Table 3.1 Five steps of Research Methodology Phases

Activities

Deliver

Feasibility Study

• Book • Journal • Paper • Encyclopedia

• Information Source • Textual Document Clustering • Semantic Document clustering • Dataset • Natural language processing

Requirement Analysis

• PyCharm (IDE) • SQLite Relational Database • WordNet • DB Viewer • COBWEB Algorithm • Sample Text

• Full-Text Search • Dataset Schema • SQL quarry • Stopword removal • Lemmatization • Frequency • Semantic Document Clustering • WordNet

Modeling

• Development Platform • Natural Language tools • Accuracy tools

• PyCharm(IDE) • COBWEB concept formation • NLTK • Synset • F-Measure

Model Development

• Coding • Experiment design • Standard maintain • Sample Text

• Sematic Document clustering Model • Hardware and software prepare • Similarity measure

Testing and Analysis

• Pre-Processing • Clustering • Accuracy Measure

• F-Measure • High Accurate Cluster • Semantic Relation Between words

After Study and analysis, it was going to modeling the framework for semantic textual document clustering. Python programing language is used to develop the model. For complete the whole thing PyCharm(IDE) chosen for complete the developing process. Work with natural language there has a tool in Python. That is Natural Language Processing Tool Kits (NLTK). In NLTk there has some feature, 51

especially for languages. Like, tokenize, streaming, lemmatization, WordNet, and so. NLTK is a perfect tool set for natural language processing. In this development, in several stages, NLTK was implemented. After the process, the text its step forward to apply COBWEB clustering and COBWEB formation added to the proposed development. When the clustering process is complete, Development focused on the accuracy tools F-Measure. By adding F-Measure in the development, the modeling completed. In modeling, the tools and framework are ready. The model development section is for coding and adjusts the tools with Python. The development face it was ensured that the system could take input the text and output the cluster with their accuracy with an accuracy of the whole system. In this section, some library function was included in Python to complete development. NLTK, Concept Formation, 𝐹𝐹1

Score etc. When everything is complete on development section, it focused on next step for testing and analysis. In the testing section, it takes sample data as 20 text file which collected from 20 scientific research articles. Firstly it removes the tags and tokenized them and returns the number of token contains in each sample text. Next row it shows the number of tokens removed from each document and how much token remains in each

document. Rest of the token are merged from stemming process and lemmatized by the help of WordNet, and It returns the token after this two operations. Those for tests of tokens get clustered and return the result as the clusters with members. Each of clusters sent to f-measure and f-measure return the accuracy of each group and minimum accuracy is the accuracy of the total system. 3.2.1

Sample Data For testing this development, It needs some sample data. It can test for an

enormous amount of data. However, the preliminary information in a limited data because in the testing face it was tried to add the data and static result and every detail. There, it was selected 20 papers abstract those are related to in this study. So technically there have 20 sample data. It was going to apply start to end every step to this 20 sample data.

52

3.3

Framework: Steps In this thesis framework, it identified four Phases. In this four Phases, there

have nine steps. Those four steps are; i.

Pre-Processing

ii.

Streaming and lemmatizing

iii.

Clustering

iv.

Accuracy There have nine steps within this four phase. Step one to four refers to phases

one. In pre-processing phases the steps are; Sample Tex Files; Remove Tags from Text; Tokenize Documents and Remove Stopwords. Phases two have two phases. Those are Synset Replacement (WordNet) and Lemmatization (WordNet). In the third Phase, there has only one step, named Clustering (COBWEB Algorithm). Final phase there have two steps, and those are Measure Cluster Accuracy High Accurate Cluster.

• Sample Tex Files 01

• Take some Text file as Input Text

• Remove Tags from Text 02

• Tags are for represent text. for clustering it Just need the Text only

• Tokenize Documents 03

• Tokenize All the sample and store them

• Remove Stopwords 04

• Remove all the stopword from token as Oxfor Dictionary

• Synset Replacement (WordNet) 05

• Send each Word token to wordnet an get their synset & replace

• Lemmatization (WordNet) 06

• Lemmatize the words to get more appropiate result

• Clustering (COBWEB Algorithm) 07

• Cluster the bag of word token by COBWEB custering algorithm

• Measure Cluster Accuracy 08

• Send the clusters to F-Measure to get the accuricy of clusters

• High Accurate Cluster 09

• End the system with get High Accuricy Clusters by Output Figure 3.1: Steps of framework

53

Tags are used to represent text, but it needs the raw text for the clustering. So its need to clean the text and that is why cleaning the sample text. In first steps, it was going to remove all tags of sample text file. After removing the labels, it steps forward to split the sample text into the token. In this study, it needs token as a word. In some case, the researcher has divided the text into a sentence, and they got a token of the phrase. However, for this semantic document clustering method, it needs token of the word. After complete the tokenized process those are focused on removing unwanted words from the tokens. By the guidance of Oxford dictionary, there have four handed and twenty-nine stop word lists. It is going to eliminate the stop words from the tokens. By this process, the system has the pure tokens. When it has noise free pure text token; then time to streaming process. In streaming, the process needs a tools or algorithm that can help to get the similar semantic word with a synonym. In this case, there have used WordNet. WordNet is a semantic lexical database can give back the meaning of the word and the type with synonyms. The streaming process and lemmatizing process and step forward to clustering. The Clustering will be Conceptual clustering method, and for this textual document clustering, it was applied COBWEB clustering algorithm to complete conceptual clustering to streamed data. There it was followed COBWEB algorithm steps. After clustering process finished, there have clusters. Those are the clusters of this semantic textual document process. Finally, it was going to apply F-Measure technique to confirm the accuracy. F-measure accuracy technique applied to the clusters. F-measure will give the output of the clustering accuracy. In this chapter rest of the part, it was describing all the four phase’s details with their nine steps details. 3.4

Pre-Processing For associate degree correct matter agglomeration, typically it would like

sensible data; a radical cleansing of the information is a vital step to boost the standard of knowledge mining ways. Not solely the correctness; conjointly the consistency of values is essential. Pre-processing method plays a significant role in text clustering techniques and applications. It is the first step in the semantic text clustering process. Remove Tags: There have some tags are in a text file for representing the text file. Sometimes those tags make some space, new line, justification, left oriented size, etc. but when it was going to apply clustering to that text there, those tags have 54

nothing to do. Tags need to clean before achieving a more accurate result on clustering process. It removes all tags for better clustering process. Start Open Text Files

Remain Tags

Remove Tag Using Library Function

Remove Tag Using Non Library Function

All Tags Removed

All Tags Removed

Clean Tag free text

Save On Basis text format with new name Figure 3.2: Flow-Chart to remove tags from input text The proposed development on python and in the case of Python, as lvc mentions XML.etree is available in the Python Standard Library, so it just adapts it to serve like it current lxml version. General syntax is; “[

def remove_tags(text):

''.join(xml.etree.ElementTree.fromstring(text).itertext()) 55

]”

Sometime Standard library functions cannot remove all the tags. In this case, if it used non-library Python without a complex function. It is take more system. That is why it uses library function first. Most of the cases it can remove all the tags. If some cases it did not work then it uses non-library Python function. Syntax of proposed development given below; “[

import re

TAG_RE = re.compile(r']+>') def remove_tags(text): return TAG_RE.sub('', text)

]”

Tokenizing Document: The term tokenization used on two major factors on information technology world. Tokenization on lexical analysis and Tokenization on data security. It was a concern to work tokenization on lexical analysis for textual data normalization. In lexical analysis, tokenization is that the strategy of breaking a stream of text up into words, phrases, symbols, or various necessary components mentioned as tokens. The list of tokens becomes input for an additional method of parsing or text mining. Tokenization is useful every in linguistics (where it is a kind of text segmentation), and in the study, where it forms a vicinity of a lexical analysis. It is a vital activity in any data retrieval model that merely segregates all the words, numbers, and their characters, etc. From given document and these known words, numbers, and alternative characters square measure referred to as tokens (Qui & Tang, 2007; Karthikeyan & Aruna, 2012). In conjunction with token generation, this method additionally evaluates the frequency worth of these tokens gift within the input documents. In this step of the proposed methodology, it was going too tokenized out texts. By the previous step, it has tag free text, and those saved as the given name and location. When it Open those tags free texts and split the text to word token. In the development phase, it uses NLTK on Python. Installing NLTK on Python in the proposed semantic clustering process is much easier than previous traditional coding. Syntax given below; “[

from nltk.tokenize import word_tokenize

56

TEXT = "MEDIU is a Non-Profit Higher Education Institute stabilized on 2006 in Malaysia." token = word_tokenize(TEXT) ]” In the case of tokenization some time some special character going to fact. Sometimes those are count as a token. As like “?” count as a token. Yes, those are tokens, but in this case, it was going to semantic clustering. It does not need them at all. In this case, It needs clear and burden free token collection. It was going to eliminate those tokens. Syntax given below; “[

TEXT = [word.replace("?", "") for word in TEXT]

]”

It was going to cluster textual data base on semantic. So case sensitivity was not a face to get the semantic meaning of the word, but sometimes case sensitivity was a major issue to get accurate clustering. Tag free Text

Tokenized

Noise Free

Noisy

Remove Noisy Token

Case Elimination

Noise Free TOKEN Figure 3.3: Flow-Chart to get Noise free token For those, get appropriate noise free token it was going to make all texts as lower case. Syntax given below; “[

TEXT = [item.lower() for item in TEXT] ]” Remove Stop Words: There have three forms of stops words in English.

Generic stop-words, mis-spelling stop-words and domain stop-words. Generic stop57

words are often picked up once scanning the documents; the latter two have to be compelled to wait until all documents within the corpus read through, and applied math calculations applied. Generic stop-words square measure non-information bearing words at intervals a selected language. They will remove while not considering any domain information. There for English, the extraordinarily common words like; ‘a,' ‘an,' ‘and,' ‘the’ and so on. Mis-Spelling stop-words are not real words, however, mis-spelling words. Inevitably folks could search by mistake input some words that are not in the dictionaries, like orthography “world” as “wrold.” Of course, at intervals, a context, an individual's being could resolve this is often an orthography error and still be ready to get proper which means from it. However, it might be tough for a PC to confirm the proper spell. Noise Free Token

List of Stopwords

Token match with Stopwords Remove Token

Fresh Token Figure 3.4: Flow-Chart to remove stop words and get Fresh Token Domain stop-words, in general, aren't widespread words however they transform stop-words solely underneath specific domain data or contents. As an example, in an exceedingly document corpus containing documents from classes animal, automobile, geography, economy, politics and PC, the word ”computer” is not a stop-word. As a result of it's not common altogether alternative classes, and it helps to differentiate the PC relative documents from alternative documents like animalrelative or geography-relative ones. However once considering a corpus at intervals that all documents square measure is discussing different aspects of computers like 58

computer code, hardware, and PC applications, words “computer” are too common to be enclosed within the following process. In this development, it tends to square measure centered on Oxford English wordbook declared four hundred and twenty-nine stop words. That square measure gave below; Stop words square measure words that from nonlinguistic read do not carry data (Marko et al., 2011). There square measure words in English like pronouns, prepositions, and conjunctions that square measure accustomed give structure within the language instead of content. These words, that square measure encountered soft and carried no helpful data regarding the content and so the class of documents, a square measure known as stop words. Removing stop words from the documents is extremely common in data retrieval. One major property of stop-words is that they are ubiquitous words. The reason of the sentences still command when these stop words removed. Stop-words square measure smitten by the tongue. Different languages have their stop-words list. In the proposed development it was focused on Oxford English dictionary declared four hundred and twenty-nine stop-words. Those given below; Table 3.2 List of stop words “A”

“About”

“Above”

“Across”

“After”

“Gain”

“Against”

“All”

“Almost”

“Alone”

“Along”

“Already”

“Also”

“Although”

“Always”

“Among”

“An”

“And”

“Another”

“Any”

“Anybody”

“Anyone”

“Anything”

“Anywhere”

“Are”

“Area”

“Areas”

“Around”

“As”

“Ask”

“Asked”

“Asking”

“Asks”

“At”

“Away”

“B”

“Back”

“Backed”

“Backing”

“Backs”

“Be”

“Became”

“Because”

“Become”

“Becomes”

“Been”

“Before”

“Began”

“Behind”

“Being”

“Beings”

“Best”

“Better”

“Between”

“Big”

“Both”

“But”

“By”

“C”

“Came”

“Can”

“Cannot”

“Case”

“Cases”

“Certain”

“Certainly”

“Clear”

“Clearly”

“Come”

“Could”

“D”

“Did”

“Differ”

“Different”

“Differently”

“Do”

“Does”

“Done”

“Down”

“Down”

“Downed”

“Downing”

“Downs”

“During”

“E”

“Each”

“Early”

“Either”

“End”

“Ended”

“Ending”

“Ends”

“Enough”

“Even”

“Evenly”

“Ever”

59

“Every”

“Everybody”

“Everyone”

“Everything” “Everywhere”

“Face”

“Faces”

“Fact”

“Facts”

“Far”

“Felt”

“Few”

“Find”

“Finds”

“First”

“For”

“Four”

“From”

“Full”

“Fully”

“Further”

“Furthered”

“Furthering”

“Furthers”

“G”

“Gave”

“General”

“Generally”

“Get”

“Gets”

“Give”

“Given”

“Gives”

“Go”

“Going”

“Good”

“Goods”

“Got”

“Great”

“Greater”

“Greatest”

“Group”

“Grouped”

“Grouping”

“Groups”

“H”

“Had”

“Has”

“Have”

“Having”

“He”

“Her”

“Here”

“Herself”

“High”

“Higher”

“Highest”

“Him”

“Himself”

“His”

“How”

“However”

“I”

“If”

“Important”

“In”

“Interest”

“Interested”

“Interesting”

“Interests”

“Into”

“Is”

“It”

“Its”

“Itself”

“J”

“Just”

“K”

“Keep”

“Keeps”

“Kind”

“Knew”

“Know”

“Known”

“Knows”

“L”

“Large”

“Largely”

“Last”

“Later”

“Latest”

“Least”

“Less”

“Let”

“Lets”

“Like”

“Likely”

“Long”

“Longer”

“Longest”

“M”

“Made”

“Make”

“Making”

“Man”

“Many”

“May”

“Me”

“Member”

“Members”

“Men”

“Might”

“More”

“Most”

“Mostly”

“Mr”

“Mrs”

“Much”

“Must”

“My”

“Myself”

“N”

“Necessary”

“Need”

“Needed”

“Needing”

“Needs”

“Never”

“New”

“Newer”

“Newest”

“Next”

“No”

“Nobody”

“Non”

“Noone”

“Not”

“Nothing”

“Now”

“Nowhere”

“Number”

“Numbers”

“O”

“Of”

“Off”

“Often”

“Old”

“Older”

“Oldest”

“On”

“Once”

“One”

“Only”

“Open”

“Opened”

“Opening”

“Opens”

“Or”

“Order”

“Ordered”

“Ordering”

“Orders”

“Other”

“Others”

“Our”

“Out”

“Over”

“P”

“Part”

“Parted”

“Parting”

“Parts”

“Per”

“Perhaps”

“Place”

“Places”

“Point”

“Pointed”

“Pointing”

“Points”

“Possible”

“Present”

“Presented”

“Presenting”

“Presents”

“Problem”

“Problems”

“Put”

“Puts”

“Q”

“Quite”

“R”

“Rather”

“Really”

“Right”

“Right”

“Room”

60

“F”

“Rooms”

“S”

“Said”

“Same”

“Saw”

“Say”

“Says”

“Second”

“Seconds”

“See”

“Seem”

“Seemed”

“Seeming”

“Seems”

“Sees”

“Several”

“Shall”

“She”

“Should”

“Show”

“Showed”

“Showing”

“Shows”

“Side”

“Sides”

“Since”

“Small”

“Smaller”

“Smallest”

“So”

“Some”

“Somebody”

“Someone”

“Something”

“Somewhere”

“State”

“States”

“Still”

“Such”

“Sure”

“T”

“Take”

“Taken”

“Than”

“That”

“The”

“Their”

“Them”

“Then”

“There”

“Therefore”

“These”

“They”

“Thing”

“Things”

“Think”

“Thinks”

“This”

“Those”

“Though”

“Thought”

“Thoughts”

“Three”

“Through”

“Thus”

“To”

“Today”

“Together”

“Too”

“Took”

“Toward”

“Turn”

“Turned”

“Turning”

“Turns”

“Two”

“U”

“Under”

“Until”

“Up”

“Upon”

“Us”

“Use”

“Used”

“Uses”

“V”

“Very”

“W”

“Want”

“Wanted”

“Wanting”

“Wants”

“Was”

“Way”

“Ways”

“We”

“Well”

“Wells”

“Went”

“Were”

“What”

“When”

“Where”

“Whether”

“Which”

“While”

“Who”

“Whole”

“Whose”

“Why”

“Will”

“With”

“Within”

“Without”

“Work”

“Worked”

“Working”

“Works”

“Would”

“X”

“Y”

“Year”

“Years”

“Yet”

“You”

“Young”

“Younger”

“Youngest”

“Your”

“Yours”

“Z”

It needs to check the whole text that listed stop words are in the inputted text or not. If any of them in the text it should remove them. In this development on python and it was used NLTK tools for the development. Syntax of Python (NLTK) code for remove stop words given below; “[

TEXT = [x for x in TEXT if x not in stopwords] 61

]”

3.5

Streaming and Lemmatizing Stemming techniques unit needs to see the root/stem of the award. Stemming

converts words to their stems, which has an honest deal of language-dependent scientific knowledge. Behind stemming, the word root mainly describes same or relatively shut ideas within the text then words square measure usually conflated by pattern stems (Gaigole, Patil & Chaudhari, 2013). In most languages there exist different syntactical forms (Vester & Martiny, 2005) of a word describe constant thought. In English, nouns have singular and plural forms; verbs have a gift, past and participial tenses. These different styles of the constant word may well be a drag for text knowledge analysis (Grobelnik & Mladenic, 2004) as a result of they need different spellings, however, share the similar that means. Though the stemming method could also be useful to the clump algorithms, it should have conjointly negative affect them if over-stemming happens. Over-stemming implies that words square measure unsuccessfully stemmed along as a result of their sufficiently different in that means and that they must not sort along (Chris & Chris, 2011). Over-stemming introduces noise into the process and leads to poor clump performance. A good steamer needs to be ready to convert completely different syntactic kinds of a word into its normalized kind deflate variety of index terms, save memory and storage and may increase the performance of clump algorithms to some extent; within the in the meantime, it needs to attempt to avoid over-stemming. In The development, It was taking help from WordNet to get appropriate and most valuable streaming process. It was prepared a first token list for entire token. It has and sends them one by one to WordNet. It queries the synonyms to WordNet and gets back synsets. When any synonyms receive from WordNet called synsets. In this development on python and its use NLTK tools for WordNet API. To get the synsets the general syntax is; “[

wn.synsets('small')

output: [Synset('small.n.01'), Synset('small.n.02'), Synset('small.a.01'), Synset('minor.s.10'), Synset('little.s.03'), 62

Synset('small.s.04'), Synset('humble.s.01'), Synset('little.s.07'), Synset('little.s.05'), Synset('small.s.08'), Synset('modest.s.02'), Synset('belittled.s.01'), Synset('small.r.01')]

]”

For getting sunsets with general lemma syntax was given below and it will replace all synsets and lemma to ‘small.' “[

for ss in wn.synsets('small'):

print(ss.name(), ss.lemma_names()) Output: small.n.01 ['small'] small.n.02 ['small'] small.a.01 ['small', 'little'] minor.s.10 ['minor','modest','small','small-scale','pocket-size','pocket-sized'] little.s.03 ['little', 'small'] small.s.04 ['small'] humble.s.01 ['humble', 'low', 'lowly', 'modest', 'small'] ]” After replace to synsets and lemma by root word out data are more normalized and those are helpful to the clustering algorithms to get more appropriate clustering. In this stage, it was focused on lemmatization to finish the streaming process. Lemmatization is the process to normalized word from its forms. It used customized wordnet lemmatized tools to lemmatization. In below there has an example of nltk.stem.wordnet module. That will clear out lemmatization process.

63

Fresh Token

Create a Fundamental Token List

WordNet

Get Synonyms (synsets)

Find synsets on Tokens

No Synonyms Matched on Tokens

Synonyms Matched on Tokens Replaced matches by root

Lemmatization

Streamed Token Figure 3.5: Flow-Chart to Streaming process “[

from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer() Output: print(wnl.lemmatize('dogs'))

]”

Lemmatization process complete means it has streamed token. That means lemmatization is the last things to do in the streaming process. Now it has out streamed token, and it is going to apply the clustering process.

64

3.6

Clustering In this stage, it was going to make clustering to the steamed token to get the

clusters. For doing clustering method, it is going to apply conceptual clustering, and in this case, it was chosen COBWEB algorithm to complete clustering process. Before clustering process, it has to take one more steps for finally prepare data for clustering. For clustering, it does not need the data every time. In this development, it was using SQLite database for saving tokens. For each text document, it allocates an array and saves those token to the array. Now it has normalized streamed token in that array. It just prepares a table of token information according to the existing database SQLite. Name of input file name and its token with the number of frequency. It needs to measure the frequency for prepare the data table. For measure the frequency it needs a key token list for each document. It has the original list; it prepares this list for the streaming process. By NLTK it just needs to just a count command, for now, the frequency of a particular word in the text. Syntax given below; “[

TEXT.count("cat")

]”

To maintain SQLite database no need to setup anything’s, SQLite was a builtin tool in python. Using SQLite database in Python was straightforward. General connect, database creation, table creation, insert data, commit (save) data syntax are given below; “[

import sqlite3

conn = sqlite3.connect('example.db') c = conn.cursor() c.execute('''CREATE TABLE stocks (date text, trans text, symbol text, qty real, price real)''') c.execute("INSERT

INTO

stocks

05','BUY','RHAT',100,35.14)") c.execute("SELECT PRICE from VALUES") conn.commit() conn.close()

]”

65

VALUES

('2006-01-

When it has the frequency of token in each text with the list, it was going to save them into one table. In this chart, it has the word (token) source name, name or word (token) and frequency of the word (token). Table 3.3 Token with frequency ready for clustering operation Source

Word

Frequency

Text_mediu

university

05

Text_mediu

islamic

19

Text_mediu

education

08

Text_malaysia

country

17

Text_malaysia

economy

10

Text_malaysia

education

03

Text_bangladesh

country

04

Text_bangladesh

economy

12

Text_bangladesh

river

17

Now send those data to COBWEB algorithm for hierarchical clustering. For this development project it is use python platform. To complete the clustering process of those data to COBWEB algorithm it use the tools named “concept_formation 0.1.3” [125] this is a complete package for COBWEB conceptual clustering. The package has the same pure COBWEB clustering algorithm in code. Here it attach the general COBWEB algorithm. To use COBWEB algorithm to this development purpose; its need slight modification. Python code was given below; “[

current = self.root

while current: if not current.children and (current.is_exact_match(instance) or current.count == 0): current.increment_counts(instance) break elif not current.children: new = current.__class__(current) current.parent = new new.children.append(current) 66

if new.parent: new.parent.children.remove(current) new.parent.children.append(new) else: self.root = new new.increment_counts(instance) current = new.create_new_child(instance) break else: best1, best2 = current.two_best_children(instance) action_cu, best_action = current.get_best_operation(instance, best1, best2) if best1: best1_cu, best1 = best1 if best2: best2_cu, best2 = best2 if best_action == 'best': current.increment_counts(instance) current = best1 elif best_action == 'new': current.increment_counts(instance) current = current.create_new_child(instance) break elif best_action == 'merge': current.increment_counts(instance) new_child = current.merge(best1, best2) current = new_child elif best_action == 'split': current.split(best1) else: raise Exception('Best action choice "' + best_action + '" not a recognized option. This should be'' impossible...') return current

]” 67

3.7

Accuracy When it apply some accuracy measure technique to clusters and get a

satisfactory result than it can assure in the method is useful and convenient for textual document clustering. For measuring the accuracy it was going to apply f-measure to cluster. There have two measure component of –measure; Precision and recall. Maximum value of f-measure is 1 and minimum value is 0. 1 means that the clusters 100% accurate. This development developed on python. In case of python it was giving the general precision and recall value retrieve coding was given below; precision: “[

if (not hasattr(reference, 'intersection') or

not hasattr(test, 'intersection')): raise TypeError('reference and test should be sets') if len(test) == 0: return None else: return len(reference.intersection(test)) / len(test) recall: if (not hasattr(reference, 'intersection') or not hasattr(test, 'intersection')): raise TypeError('reference and test should be sets') if len(reference) == 0: return None else: return len(reference.intersection(test)) / len(reference)

]”

When it have the precision and recall its just matter of time to know the accuracy of cluster. According to equation 2.1 it known that; 𝐹𝐹 =

2𝑃𝑃𝑃𝑃 𝑃𝑃 + 𝑅𝑅

Here, F means F_Measure. P represents the precision and R says the recall. it know the precision and recall. In this stage of this development Python code; “[

p = precision(reference, test) 68

r = recall(reference, test) if p is None or r is None: return None if p == 0 or r == 0: return 0 return 1.0 / (alpha / p + (1-alpha) / r)

]”

The whole f-measure process can complete by using NLTK tolls. “[nltk.metrics.scores.f_measure(reference, test, alpha=0.5)]” this tools can do same thing. Calculate the precision and recall, and then it uses those two values to calculate the accuracy of clusters.

69

CHAPTER FOUR TESTING AND RESULT ANALYSIS

4.1

Environmental Setup For development, it uses Python as the development language. In the

development, there have used several Tolls for testing and developing the framework. This whole development and testing accomplished in a windows environment. Python is a brilliant and dynamic programming language. It used for general purpose, and it is a procedure language. It is an interpreted language, and that is the point that makes comfortable to find a mistake to the programmer. It was developed in Python because there have a lot of useful tools on Python that can make Natural language processing very faster and very easy to operate. In The experiment, Microsoft Windows 10 it uses IDE for Python, and in Python, it used many tools for making sure the code is simple and code simple. 4.1.1

Hardware The testing and development face it uses Windows 10 ultimate edition IN HP

touch Smart 320 Desktop PC. Its technical hardware specification is given below; Display: 50.80 cm (20 inch) Resolution: 1600 x 900 (16:9 aspect ratio) Motherboard: Angelino2-UB Processor: AMD A6-3600, 4 MB Cache Memory: 8 GB, PC3-10600 MB/sec Hard Drive: 1 TB, 7200 rpm Rotational Speed 4.1.2

Soft Tools Pycharm: PyCharm Professional was a Python IDE for Professional

Developers. It uses PyCharm Professional. Version: 2016.3.2; Build: 163.10154.50; Released: December 30, 2016, it has 30 free trailers for everyone and education purpose they give one year license, and It was using one year full free professional license for the student. It is a Full-featured IDE for Python and Web development. It

70

has chosen Pycharm for some feature. There have a lot, but here it gives the attraction reason feature. Intelligent committal to writing Assistance: PyCharm provides smart code completion, code inspections, on-the-fly error lightness and quick fixes, in conjunction with automatic code refactoring’s and created navigation capabilities. Built-in Developer Tools:

In PyCharm, Brobdingnagian assortment of

instruments out of the box includes associate degree integrated computer program associate degree take a look at the runner. Python profiler; an integral terminal; integration with major VCS and necessary information tools; remote development capabilities with remote interpreters; an integrated sash terminal; and integration with laborer and Vagrant. Scientific Tools: PyCharm integrates with Python Notebook, has academic degree interactive Python console, and supports boat additionally as multiple scientific packages still as Matplotlib and NumPy. NLTK: linguistic communication Toolkit (NLTK) has been known as-is “a marvelous tool for teaching and dealing in linguistics victimization Python,” and conjointly “a beautiful library to play with linguistic communication.” NLTK can be a number one platform for building Python programs that unit of measurement cozy to work with human language info. It provides easy-to-use interfaces to over fifty corpora and lexical resources like WordNet, with a group of text method libraries for classification, tokenization, stemming, tagging, parsing, and linguistics reasoning, wrappers for study human language technology libraries, and it is an active discussion forum. It tends to use WordNet in python as an integral operate of NLTK. It tends to simply ought to import NLTK and WordNet for the out process. It tends to use NLTK for This development due to its made dataset and tools for the exact communication process. It would like to write down lots of codes for every section however if it tends to import NLTK to The program then its lots of default library operate those facilitate USA lots and save the time. Natural Language method with Python provides a quick introduction to programming for language method. Written by the creators of NLTK, it guides the reader through the fundamentals of writing Python programs, operative with corpora, categorizing text, analyzing the grammatical structure, and more. NLTK is open provide the package. The American Standard Code for Information Interchange computer file is distributed underneath the terms of the Apache License Version a 71

combine of.0. The documentation is circulated underneath the conditions of the original Commons Attribution-Non-commercial. No spinoff Works 3.0 us. License. The corpora square measure distributed underneath various licenses. NLTK is undergoing continual development as new modules square measure additional and existing one's square measure improved. Dozens of corpora square measure out there to use with NLTK. NLTK will be interfaced to alternative corpora. SQLite: SQLite could be a C library that has a light-weight disk-based info that doesn’t need a separate server method and permits accessing the info employing a nonstandard variant of the SQL source language. Some applications will use SQLite for internal knowledge storage. It is additionally attainable to example AN application victimization SQLite then port the code to a bigger info like PostgreSQL or Oracle. The Python standard Library includes a module referred to as "sqlite3" meant for operating with this info. SQLite3 could be a simple to use the info engine. It is selfcontained, server less, zero-configuration and transactional. It is in no time and lightweight, and therefore the complete info is kept in a single computer file. It utilized in plenty of applications as internal knowledge storage. This module could be a SQL interface compliant with the DB-API 2.0 specifications. It tends to use SQLite as a result of it works with common SQL language, if the info goes extremely huge. It can shift Oracle or PostgreSQL. Maneuvering the info had been simple. There have API for Oracle and PostgreSQL in Python and very easy to connect to database API in Python. COBWEB: it uses concept_formation 0.1.3 for apply COBWEB to the program. It is a library for doing incremental concept formation using algorithms in the COBWEB family. Python library has the nib file named concept formation written by Christopher MacLellan and Erik Harpstead. It just imports the library and COBWEB for out process. In this library, the COBWEB and COBWEB/3 algorithms implemented. A stream of instances accept this system, which represented as dictionaries of attributes and values, and learns a concept hierarchy. The hierarchy can used for resulting clustering and prediction. This library also includes TRESTLE, an extension of COBWEB and COBWEB/3 that support structured and relational data objects. Employs partial matching this system to rename new objects to align with previous examples and then categorizes these renamed objects. Actually, in this library have three main algorithms of conceptual clustering, and It is only uses COBWEB. It is make the development easy and error free. 72

F-Measure: nltk.metrics.scores module has some a function, one of them named nltk.metrics.scores.f_measure. it use nltk.metrics.scores.f_measure Function for measuring the clustering accuracy. There have a given set of reference values and a collection of taking a look at values, come the f-measure of the take a look at values when put next against the reference values. The f-measure is that the mean value of the exactitude and recall, weighted by alpha. Specifically, given the exactitude p and recall r outlined by: 𝑝𝑝 = 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 (𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡)/𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 (𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡)

𝑟𝑟 = 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 (𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡)/𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 (𝑟𝑟𝑒𝑒𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓)

The f-measure is:

1/ (𝑎𝑎𝑎𝑎𝑎𝑎ℎ𝑎𝑎/𝑝𝑝 + (1 − 𝑎𝑎𝑎𝑎𝑎𝑎ℎ𝑎𝑎)/𝑟𝑟)

In this Development, it just called the function

“[ nltk.metrics.scores.f_measure(reference, test, alpha=0.5) ]” 4.1.3

Sample Data For testing this development, it needs some sample data. It can test on a huge

amount of data. However, the preliminary information in a limited data because in the testing face it was tried to add the data and static result and every detail. There, it was selected 20 papers abstract those are related to in this study. So technically there have 20 sample data. It was going to apply start to end every step to this 20 sample data. 4.2

Document Pre-Processing There has developed a framework and also have sample data. In testing, it will

complete the process in two phases. First one is Pre-processing and then it performing Clustering and accuracy measure. By take input the sample data, firstly it makes an operation remove the tags and tokenize to each text file. It has had 20 sample text file. After removing labels and tokenized 20 files, token reports given in table 4.1. Table 4.1 Sample text file report after tokenize Name of Source File

Number of Token in File

“Sample 1”

213

“Sample 2”

257

“Sample 3”

127

“Sample 4”

204 73

“Sample 5”

451

“Sample 6”

216

“Sample 7”

108

“Sample 8”

259

“Sample 9”

151

“Sample 10”

79

“Sample 11”

149

“Sample 12”

86

“Sample 13”

154

“Sample 14”

100

“Sample 15”

132

“Sample 16”

84

“Sample 17”

152

“Sample 18”

139

“Sample 19”

114

“Sample 20”

117

After tokenizing the text, it was forward to remove stop words tokens from the token it has after removing the tag. It matches each word individually with the list of stop words. If any words match with the stop word, system deletes the token immediately. List of stop words was mention in Chapter 3. Those are only 429 words but uses frequency of those words are very much high. After removing the stop word the token report and remove token numbers are mention on table 4.2. Table 4.2 Number of tokens removed and token left for processing after remove stop words Name of Source File

Total Remove Token

Token after Remove Stop Word

“Sample 1”

86

127

“Sample 2”

105

152

“Sample 3”

71

56

“Sample 4”

86

118

“Sample 5”

213

218

“Sample 6”

103

113

“Sample 7”

61

47

“Sample 8”

128

131 74

“Sample 9”

62

89

“Sample 10”

34

45

“Sample 11”

60

89

“Sample 12”

49

37

“Sample 13”

74

80

“Sample 14”

47

53

“Sample 15”

66

66

“Sample 16”

37

47

“Sample 17”

59

93

“Sample 18”

60

79

“Sample 19”

52

62

“Sample 20”

71

46

In this step the system help of WordNet Lexical Database. It sends each token to WordNet and framework to get back ten synsets (synonyms) for each word. After getting the ten synonyms of each word, It matches with the rest of the word(token). If any token match with them. This token was replaced by the word send it to get synset. That mean real word replaces his synonyms by this process. Fig4.1 have the result of WordNet synset replaced by root word. ['pages', 'stored', 'preprocessing', 'maintaining', 'parameters', 'stages', 'preprocessing', 'removing', 'stemming', 'performed', 'maintaining', 'sets', 'generated', 'extracted', 'documents', 'approaches', 'tf-idf', 'tf-df', 'tf2', 'preprocessed', 'counted', 'attributes', 'experiments', 'evaluated', 'reuters', 'subsets', 'reuters', '21578', '(atheism)', '(hardware)', '(computer', 'graphics)', 'etc', 'days', 'text', 'spontaneously', 'increasing', 'e-mail', 'electronic', 'database', 'browse', 'difficult', 'overcome', 'selection', 'selection', 'selection', 'reduction', 'relationship', 'relationship', 'terms', 'terms', 'terms', 'terms', 'terms', 'terms', 'terms', 'terms', 'using', 'using', 'using', 'using', 'background', 'knowledge', 'wordnet', 'wordnet', 'data', 'data', 'mining', 'paper', 'stages', 'stages', 'formed', 'formed', 'firstly', 'stop', 'words', 'words', 'words', 'words', 'words', 'words', 'words', 'porter', 'stemmer', 'algorithm', 'net', 'net', 'net', 'thesaurus', 'applied', 'global', 'unique', 'frequent', 'secondly', 'matrix', 'thirdly', 'documents', 'documents', 'documents', 'documents', 'documents', 'documents', 'approaches', 'based', 'minimum', 'threshold', 'frequency', 'representation', 'purpose', 'reduce', 'attributes', 'effective', 'method', 'clustering', 'accuracy', 'evaluated', 'transcription', 'wheat', 'trade', 'money', 'grain', 'ship', 'classic', '30', '20', '20', '20']

Figure 4.1: WordNet synset replaced by root word for sample 1 75

After this, the system focused on lemmatization process, and by lemmatization, it was going to make the text tokens more stable and more normalized for clustering. In fig 4.2 it attaches tokens after lemmatized by the lemmatization process. In this process, each word sends it to lemmatization algorithm, and algorithm removes the expression tense and format and makes it easier to work with it. As an example, if the three word Cook, Cooked, Cooking send to lemmatization process. Those three words will be Cook, Cook, Cook. 20 20 20 30 21578 (atheism) (comput (hardware) accuraci algorithm appli approach approach aramet attribut attribut background base brows classic cluster count data data databas day difficult document document document document document document document e-mail effect electron etc evalu evalu experi firstli form form frequenc frequent gener global grain graphics) increas knowledg maintain maintain matrix method mine minimum money net net net overcom page paper perform porter preprocess preprocess preprocess purpos reduc reduct relationship relationship remov represent reuter reuter secondli select select select set ship spontan stage stage stage stem stemmer store subset term term term term term term termterm text tf2 tf-df tf-idf thesauru thirdli threshold top trade transcript uniqu use use use use wheat word word word word word word word wordnet wordnet

Figure 4.2: Lemmatization to all tokens for sample 1 After complete the lemmatization the system going forward to the last step for data pre-processing. Get the frequency. In this case, here it gives the frequency for all 20 sample file. Only more than two time means, minimum three times appear words, those tokens are taking for clustering. So this list has only three times appeared worlds list with source name and frequency. Table 4.3 represents the frequency with source file name in whole 20 samples. Table 4.3 The token frequency with token’s source file. Frequency more than three listed on table Source File Name Word Frequency 20

3

document

7

net

3

preprocess

3

select

3

Sample 1

76

stage

3

term

6

use

4

word

7

algorithm

3

appli

4

approach

6

base

3

cluster

10

document

9

estim

3

inform

5

languag

7

method

3

multilingu

8

related

3

result

3

semant

3

similar

3

supervisori

3

index

3

aggreg

3

algorithm

4

base

3

cliqu

5

cluster

7

content

4

document

7

engin

3

queri

3

search

4

semant

5

shorter

3

Sample 2

Sample 3

Sample 4

77

Sample 5

Sample 6

Sample 7

78

accuraci

3

address

3

algorithm

8

approach

4

citat

5

citonomi

3

cluster

22

cs2c

4

cs-v

3

document

7

dynam

3

featur

4

improv

3

make

3

measur

5

model

6

ontolog

6

public

3

relat

4

search

3

select

5

semant

8

analysi

3

cluster

12

document

3

inform

3

measur

3

process

4

semant

4

text

3

word

3

cluster

6

conjunct

3

algorithm

8

categori

3

classif

6

cobweb

7

correspond

4

creat

3

cu

3

function

6

insert

6

maximum

3

merg

4

node

10

object

5

oper

5

select

4

separ

5

tree

8

valu

5

classif

3

cluster

9

data

5

mine

3

object

4

result

3

techniqu

3

cluster

5

conceptu

5

algorithm

5

categor

3

cluster

6

conceptu

3

imput

3

inform

4

Sample 8

Sample 9

Sample 10

Sample 11

79

uncertainti

3

valu

4

cluster

7

techniqu

3

algorithm

8

categor

3

cluster

6

data

9

develop

3

scale

3

set

3

cluster

5

conceptu

3

data

6

set

3

techniqu

3

informed

3

measur

5

probabl

5

cluster

4

conceptu

4

concept

5

domain

3

measur

4

ontolog

5

semant

5

similar

8

techniqu

5

approach

8

f-measur

3

learn

4

model

4

algorithm

5

Sample 12

Sample 13

Sample 14

Sample 15

Sample 16

Sample 17

Sample 18

Sample 19

80

Sample 20 4.3

document

3

modifi

3

pass

5

research

3

singl

5

version

5

f-measur

6

Clustering and Accuracy In this section, it performed the conceptual clustering and this clustering

process complete with COBWEB algorithm. When it has the frequency of words for each input text than it input the words with the frequency table input to the development algorithm and it returns the clustering result. Table 4.4 Clusters with the member Cluster Name

Member of Cluster (Source file)

Algorithm

Sample 5, Sample 8, Sample 11, Sample 13, Sample 19

Approach

Sample 2, Sample18

Citat

Sample 5

Classif

Sample 8

Cliqu

Sample 4 Sample 2, Sample 4, Sample 5, Sample 6, Sample 7,

Cluster

Sample 9, Sample 10, Sample 11, Sample 12, Sample 13, Sample 14

Cobweb

Sample 8

Concept

Sample 10, Sample 17

Data

Sample 9, Sample 13, Sample 14

Document

Sample 1, Sample 2, Sample 4, Sample 5

f-measur

Sample 20

Function

Sample 8

Inform

Sample 2

Insert

Sample 8

Language

Sample 2

Measure

Sample 5, Sample 15 81

Model

Sample 5

Multilingu

Sample 2

Node

Sample 8

Object

Sample 8

Ontolog

Sample 5, Sample 17

Oper

Sample 8

Pass

Sample 19

Probabl

Sample 15

Select

Sample 5

Semant

Sample 4, Sample 5, Sample 17

Separ

Sample 8

Similar

Sample 17

Singl

Sample 19

Technique

Sample 17

Term

Sample 1

Tree

Sample 8

Valu

Sample 8

Version

Sample 19

Word

Sample 1

The system takes 20 papers abstract as input, and those have 3,292 number of the token. After complete the entire clustering process. It has only 35 clusters. In this 35 cluster, all sample text file associated except sample 3 and sample 16. Those two inputs do not have enough maturity to assign a cluster. Table 4.5 Clusters with the F-Measure Accuracy Cluster Name

Accuracy by F-Measure

Algorithm

88.57 %

Approach

92.87%

Citat

71.42%

Classif

85.71%

Cliqu

85.71%

Cluster

90.90%

82

Cobweb

100%

Concept

71.42%

Data

85.71%

Document

100%

f-measur

85.71%

Function

85.71%

Inform

71.42%

Insert

87.71%

Language

100%

Measure

71.42%

Model

85.71%

Multilingu

100%

Node

100%

Object

71.42%

Ontolog

78.57%

Oper

71.42%

Pass

71.42%

Probabl

71.42%

Select

71.42%

Semant

80.95%

Separ

71.42%

Similar

100%

Singl

71.42%

Technique

71.42%

Term

85.71%

Tree

100%

Valu

71.42%

Version

71.42%

Word

100%

This clustering process oriented for high quality. It focused on quality (accuracy of the cluster). To determine and quality out clusters passed from standard quality measurement technique.

83

F-measure accuracy measure technique applied to 35 clusters. It returns the accuracy for each cluster. The result of precision for each cluster given in table 4.5. It considers that the minimum cluster accuracy will be the complete system accuracy. By this consideration, 71.42% is the overall accuracy. Otherwise, there have some clusters with 100% accurate result. 4.4

Discussion In this experiment, there have 20 samples from 20 different papers abstract.

After removing the tags from the sample text, when it tokenized, it has 3292 tokens. There tend to continue the operation on those 3292 token. Once obtaining the token there tend to forward to get rid of the stop words from those tokens, then it trends towards finding that 1524 token is from stopwords. It tends to remove those tokens from the total tokens. Those can be an enormous range of the token. That square measure removed. Those are 46.29% of the entire token. Once take away there has 1748 token. Those 1784 tokens are sent to WordNet separately and get the synsets (synonyms). Then the synsets have replaced the word with its signifier. Then it steps forward to the process of lemmatization. This system sends original word for lemmatization. When it has a trend towards complete outset margining and lemmatization method, and it returns 672 tokens to the system. The system gets 3292 tokens from the input, and currently, it has solely 672 tokens to cluster, and it is 20.41% of the total inputted token. Document pre-processing, normalized knowledge, terribly swimmingly and currently, it tends to have one fifth of the inputted knowledge solely. In those 672 token numbers of the distinctive token square measure a hundred and forty-four. They seem many times on those entire twenty sample inputs. Most twenty-two times it was found a word. Some number of the phrase seems only once. When COBWEB algorithmic rule clusters those 672 tokens; then it gets thirty-five clusters. During this thirty-five cluster, all sample documents associated except sample three and sample sixteen. Those two inputs do not have enough maturity to assign a cluster. This agglomeration method familiarized for prime quality. It was time to center on quality (accuracy of the cluster). To see and quality out clusters square measure passed from conventional quality mensuration technique. In this system, By apply f-measure in the thirty-five clusters, the accuracy can be assured. Some clusters square measure 100% correct, however, System take minimum accuracy consider

84

overall accuracy for all agglomeration was 71.42%. The bullet points of the results given below; •

20 samples from 20 different Source.

•

Total 3292 tokens came from 20 sample

•

1524 token removed for stop-word matching. 46.29% of the total tokens.

•

There has 1748 token left.

•

After WordNet Operation( Synset & Lemmatization) it has 672 tokens and it is 20.41% of total tokens.

•

In those 672 tokens, only 144 tokens are unique.

•

Most 22 times it tends to get a word.

Document Pre-Processing

3500 3000 2500 2000 1500 1000 500 0

Number of Words Number of Stop Inputed Words

Words After Remove Stop Words

Words After Synset & Limmatization Replacement

Number of Unique Words

Figure 4.3: Document Pre-Processing Graph (token) •

When complete the clustering process by COBWEB algorithmic rule; then it gets 35 clusters.

•

All sample documents associated except sample 3 and sample 16. Those two inputs do not have enough maturity to assign a cluster.

•

F-Measure applied on 35 clusters.

•

Several Clusters are 100% Accurate.

•

It was considering minimum accuracy for overall accuracy and it was71.42%. 85

CHAPTER FIVE CONCLUSION

5.1

Conclusion Text documents unit is snowballing as a result of the increasing live of data

offered in electronic and digitized kind, like electronic publications, varied forms of electronic records, e-mail, and conjointly the planet Wide net. Recently most of the data concerning government, industry, business, and numerous establishments’ unit hold on electronically, within the type of text databases. Most of the text in databases unit area unit semi-structured data that they are neither structured. Massive document corpus would possibly afford helpful variant information to parents. However, it is a challenge to seem out the helpful information from the large variety of records. The presence of logical structure clues within the document, scientific criteria, and scientific discipline, similarity measures area unit primarily accustomed figure thematically coherent, contiguous text blocks in unstructured documents. To maintain and access those documents area unit troublesome while not adequate classification and once there has classification with none data give decision clump. To beat such difficulties K-means et al. previous clump algorithms area unit unfit to impart as is also expected on Natural languages. High-dimensional concerning texts, the presence of logical structure clues among the texts and novel segmentation techniques have taken advantage of advances in generative topic modeling algorithms, which were specifically designed to identify topics at intervals text to cipher word– topic distributions. By considering those challenges there, it has a tendency to planned linguistics document clump framework, and also the framework is developed by victimization Python platform and tested every of steps. In this analysis, the complete methodology built in 5 sections. The Methodology starts with feasibleness Study. The Study results facilitate to analysis the wants. Once obtaining the necessity analysis output, this thesis forward to style the model for matter document clump. The modeling method was ending once the event of the framework in Python completed with success. Testing and Analysis is that the last step to confirm the expected result. During this thesis, there have some steps to complete the whole linguistics cluster. The strategy will begin by taking sample 86

information from twenty articles twenty abstracts owing to the sample text. Those texts have tags. Tags unit of measurement removed and tokenize them to pre methodology. By eliminating stop words those tokens unit of measurement prepared for stemming with the help of WorldNet linguistics treasure and o lemmatization to generalize for the clump. Once complete those steps, the COBWEB clump rule applied to the tokens. F-Measure applied to clusters to urge the accuracy of groups. There have nine steps among this four part. The 1st step to four refers to steps one. In pre-processing stages the steps are; Sample Tex Files; take away Tags from Text; Tokenize Documents and take away Stopwords. Phases 2 have two stages. Those are set Replacement (wordnet) and Lemmatization (wordnet). Within the third part, there has only 1 step, named bunch (COBWEB Algorithm). Moreover, there have two stages, Measure accuracy of Cluster and Accuracy High correct Cluster as output. For development, it tends to use Python as The development language. It tends to use many Tolls for testing and developing the framework. This whole development and testing accomplished within the windows atmosphere. Python could be a practical and dynamic artificial language. It used on general purpose, and it is a procedure language. It's associate degree understood language, and that is the objective that produces comfy to seek out the error to the computer programmer. It tends to develop in Python as a result of there have lots of helpful tools for Python which will build scientific communication process quicker and straightforward to work.

In the

Microsoft Windows10 it tends to use IDE for Python, and in Python, it tends to used lots of tools for creating the code simple. There has developed a framework and even have sample knowledge. In testing, it will complete the method in 2 phases. The initial one is Pre-processing, and so it tends activity bunch and accuracy live. Different section it performed the abstract bunch and this bunch method complete with COBWEB algorithmic rule. Once it is the frequency of words for every input text. Then it input the words with the frequency table input to the COBWEB algorithmic rule and it returns the clusters as a result. Fmeasure accuracy live technique applied on groups. It returns the accuracy for every cluster. This experiment, there have twenty samples from twenty different papers abstract. When removing the tags from the sample text, once it tokenized, it is 3292 tokens. There tend to continue the operation on those 3292 token. Once getting the token there tend to forward to remove the stop words from those tokens, then it trends 87

towards finding that 1524 token is from stopwords. It tends to get rid of those tokens from the total token. Those are often a huge vary of the token. That area unit removed. Those are 46.29% of the total token. Once deduct there has 1748 token. Those 1784 tokens are sent to Wordnet singly and find the synsets (synonyms). Then the synsets have replaced the word with its form. Then its improvement to the method of lemmatization. This method sends original word for lemmatization. Once it has a trend towards full offset margining and lemmatization methodology, and it returns 672 tokens to the system. The system gets 3292 tokens from the input, and presently, it is alone 672 tokens to cluster, and it is 20.41% of a total coded token. Document preprocessing, normalized data, really smoothly and currently, it tends to tend to own one fifth of the inputted data alone. In those 672 token numbers of the distinctive token area unit 100 and cardinal. They repeatedly appear on those complete twenty sample knowledge. Most twenty-two times it was found a word. Some range of the phrase appears just the once. Once COBWEB rule clusters those 672 tokens; then it gets 35 groups. Throughout this 35 cluster, all sample documents associated except sample 3 and sample 16. Those two inputs do not have enough maturity to assign a bunch. This agglomeration methodology oriented for high quality. It was time to center on quality (accuracy of the cluster). To check and quality out clusters area unit passed from conventional quality activity technique. During this system, By apply f-measure within the 35 clusters, the accuracy often assured. Some clusters area unit 100% correct, however, System take minimum potency take into account overall accuracy for all agglomeration was 71.42%. 5.2

Future Work In this framework, it figured out the general context and development it

develops by a Python platform. COBWEB Clustering algorithm used here to do clustering process in pre-processed data. It took help from WordNet lexical database and used synset and lemmatization feature of WordNet. Several systems can update this development. In future it can focus some points, those can make semantic document clustering more eligible. Here Some point those can focus on future development;  Use new version of Conceptual clusterings like COBWEB/3 or ITERATE or LABYRINTH.

88

 In this thesis, it was designed for word token, in future, there have some chance to work with sentence token.  It used an only synset feature of WordNet. There have much more tools on WordNet like type, semantic meaning. It can use them for future research.

89

REFERENCE Aggarwal, C. C., & Zhai, C. (2012). A survey of text clustering algorithms. In Mining text data (pp. 77-128). Springer US. Aggarwal, C. C., & Zhai, C. (Eds.). (2012). Mining text data. Springer Science & Business Media. Alelyani, S., Tang, J., & Liu, H. (2013). Feature Selection for Clustering: A Review. Data Clustering: Algorithms and Applications, 29, 110-121. Anderberg, M. R. (2014). Cluster analysis for applications: probability and mathematical statistics: a series of monographs and textbooks (Vol. 19). Academic press. Biswas, G. (1993). ITERATE: A conceptual clustering algorithm that produces stable clusters," in review. IEEE Trans. on Pattern Analysis and Machine Intelligence. Biswas, G., Weinberg, J. B., Yang, Q., & Koller, G. R. (1991, June). Conceptual clustering and exploratory data analysis. In Proceedings of the Eighth International Conference on Machine Learning (pp. 591-595). Morgan Kaufmann Publishers Inc.. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022. Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326. Branch, J., & Rocchi, F. (2015). Concept Development: A Primer. Philosophy of Management, 14(2), 111-133. Carlson, B., Weinberg, J., & Fisher, D. (2014). Search control, utility, and concept induction. In Proc. 1th Intern. Conf. on Machine Learning (pp. 85-92). Chim, H., & Deng, X. (2008). Efficient phrase-based document similarity for clustering. IEEE Transactions on Knowledge and Data Engineering, 20(9), 1217-1229. Cutting, D. R., Karger, D. R., Pedersen, J. O., & Tukey, J. W. (2017). Scatter/gather: A cluster-based approach to browsing large document collections. In ACM SIGIR Forum (Vol. 51, No. 2, pp. 148-159). ACM. Das, G., & Mannila, H. (2000). Context-based similarity measures for categorical databases. In European Conference on Principles of Data Mining and Knowledge Discovery (pp. 201-210). Springer, Berlin, Heidelberg. 90

Dhillon, I. S. (2003). Information Theoretic Clustering, Co-clustering and Matrix Approximations. Dhillon, I. S., Mallela, S., & Modha, D. S. (2003). Information-theoretic co-clustering. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 89-98). ACM. Drias, H., Cherif, N. F., & Kechid, A. (2016). k-mm: A hybrid clustering algorithm based on k-means and k-medoids. In Advances in Nature and Biologically Inspired Computing (pp. 37-48). Springer, Cham. El-Din, D. M. (2016). Enhancement bag-of-words model for solving the challenges of sentiment analysis. International Journal of Advanced Computer Science and Applications, 7, 244-247. Everitt, B. S., Landau, S., & Leese, M. (2001). Cluster Analysis 4th Edition, Arnold. Fahad, S. A., & Alam, M. M. (2016). A Modified K-Means Algorithm for Big Data Clustering. International Journal of Science, Engineering and Computer Technology, 6(4), 129. Fahad, S. A., & Yafooz, W. M. (2017). Review on Semantic Document Clustering. International Journal of Contemporary Computer Research, 1(1), 14-30. Fazal, M. M., & Rafi, M. (2015). A Semi-supervised approach to Document Clustering with Sequence Constraints. Journal of Independent Studies and Research, 13(1), 65. Ferrari, D. G., & De Castro, L. N. (2015). Clustering algorithm selection by metalearning systems: A new distance-based problem characterization and ranking combination methods. Information Sciences, 301, 181-194. Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering. Machine learning, 2(2), 139-172. Fisher, D. H., & Schlimmer, J. C. (2014). Concept simplification and prediction accuracy. In Proceedings of the Fifth International Conference on Machine Learning (pp. 22-28). Fisher, D., & Langley, P. (1985). Approaches to Conceptual Clustering (No. UCIICS-85-17). CALIFORNIA UNIV IRVINE DEPT OF INFORMATION AND COMPUTER SCIENCE. Fisher, D., & Langley, P. (1986). Conceptual clustering and its relation to numerical taxonomy. In Artificial intelligence and statistics.

91

Fodeh, S. J., Punch, W. F., & Tan, P. N. (2009, March). Combining statistics and semantics via ensemble model for document clustering. In Proceedings of the 2009 ACM symposium on Applied Computing (pp. 1446-1450). ACM. Forgy, E. W. (1965). Cluster analysis of multivariate data: Efficiency vs. interpretability of classifications. Biometrics, 21, 768-769. Fortunato, S. (2010). Community detection in graphs. Physics reports, 486(3), 75-174. Fung, B. C., Wang, K., & Ester, M. (2003, May). Hierarchical document clustering using frequent itemsets. In Proceedings of the 2003 SIAM International Conference on Data Mining (pp. 59-70). Society for Industrial and Applied Mathematics. Gaigole, P. C., Patil, L. H., & Chaudhari, P. M. (2013). Preprocessing Techniques in Text catagorization. In National Conference on Innovative Paradigms in Engineering & Technology (NVIPET-2013), Proceedings published by International Journal of Computer Applications (IJCA). Garner, S. R. (1995). Weka: The waikato environment for knowledge analysis. In Proceedings of the New Zealand computer science research students conference (pp. 57-64). Glavaš, G., Nanni, F., & Ponzetto, S. P. (2016). Unsupervised text segmentation using semantic relatedness graphs. Association for Computational Linguistics. Grobelnik, M., & Mladenic, D. (2004). Text-mining tutorial. the Proceedings of Learning Methods for Text Understanding and Mining, Grenoble, France. Gunopulos, D., & Das, G. (2001). Time series similarity measures and time series indexing. In Acm Sigmod Record (Vol. 30, No. 2, p. 624). ACM. Habibi, M., & Popescu-Belis, A. (2015). Keyword extraction and clustering for document recommendation in conversations. IEEE/ACM Transactions on audio, speech, and language processing, 23(4), 746-759. Hahsler, M., & Bolaños, M. (2016). Clustering Data Streams Based on Shared Density Between Micro-Clusters. IEEE Transactions on Knowledge and Data Engineering, 28(6), 1449-1461. Han, J., Pei, J., & Kamber, M. (2011). Data mining: concepts and techniques. Elsevier. Hartigan, Jhon A., & Hartigan, J. A. (1975). Clustering algorithms (Vol. 209). New York: Wiley. Hotho, A., Staab, S., & Stumme, G. (2003). Ontologies improve text document clustering. In Data Mining, 2003. ICDM 2003. Third IEEE International Conference on (pp. 541-544). IEEE. 92

Hung, C. C., Peng, W. C., & Lee, W. C. (2015). Clustering and aggregating clues of trajectories for mining trajectory patterns and routes. The VLDB Journal—The International Journal on Very Large Data Bases, 24(2), 169-192. Hurst, N., Marriott, K., & Moulder, P. (2003). Cobweb: a constraint-based WEB browser. In Proceedings of the 26th Australasian computer science conference-Volume 16 (pp. 247-254). Australian Computer Society, Inc.. Hussain, T., & Asghar, S. (2016). Chi-square based hierarchical agglomerative clustering for web sessionization. Journal of the National Science Foundation of Sri Lanka, 44(2). Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Prentice-Hall, Inc.. Jiang, D., Tang, C., & Zhang, A. (2004). Cluster analysis for gene expression data: a survey. IEEE Transactions on knowledge and data engineering, 16(11), 13701386. Karthikeyan, M., & Aruna, P. (2013). Probability based document clustering and image clustering using content-based image retrieval. Applied Soft Computing, 13(2), 959-966. Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis (Vol. 344). John Wiley & Sons. Kavita and Bedi, P. (2015). Clustering of Categorized Text Data Using Cobweb Algorithm. International Journal of Computer Science and Information Technology Research. Vol. 3, Issue 3, pp: (249-254). King, B. (1967). Step-wise clustering procedures. Journal of the American Statistical Association, 62(317), 86-101. Kohonen, T. (2012). Self-organization and associative memory (Vol. 8). Springer Science & Business Media. Lammersen, C., Schmidt, M., & Sohler, C. (2015). Probabilistic k-median clustering in data streams. Theory of Computing Systems, 56(1), 251-290. Lee, J. G., Han, J., & Whang, K. Y. (2007, June). Trajectory clustering: a partitionand-group framework. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data (pp. 593-604). ACM. Li, C., & Biswas, G. (1998). Conceptual Clustering With Numeric-and-Nominal Mixed Data-A New Similarity Based System. IEEE Transcript on KCE, 1998. Li, C., Yang, C., & Jiang, Q. (2017). The research on text clustering based on LDA joint model. Journal of Intelligent & Fuzzy Systems, 32(5), 3655-3667.

93

Li, T. (2005). A general model for clustering binary data. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (pp. 188-197). ACM. Li, Y., Chung, S. M., & Holt, J. D. (2008). Text document clustering based on frequent word meaning sequences. Data & Knowledge Engineering, 64(1), 381-404. Li, Y., Luo, C., & Chung, S. M. (2008). Text clustering with feature selection by using statistical data. IEEE Transactions on Knowledge and Data Engineering, 20(5), 641-652. Liao,

T. W. (2005). Clustering of recognition, 38(11), 1857-1874.

time

series

data—a

survey. Pattern

Liu, X., & Croft, W. B. (2004). Cluster-based retrieval using language models. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 186-193). ACM. Lucic, M., Bachem, O., & Krause, A. (2016). Strong coresets for hard and soft Bregman clustering with applications to exponential family mixtures. In Artificial Intelligence and Statistics (pp. 1-9). Luhn, H. P. (1957). A statistical approach to mechanized encoding and searching of literary information. IBM Journal of research and development, 1(4), 309-317. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, No. 14, pp. 281-297). Madeira, S. C., & Oliveira, A. L. (2004). Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 1(1), 24-45. Malliaros, F. D., & Vazirgiannis, M. (2013). Clustering and community detection in directed networks: A survey. Physics Reports, 533(4), 95-142. McKeown, K. R., Barzilay, R., Evans, D., Hatzivassiloglou, V., Klavans, J. L., Nenkova, A., ... & Sigelman, S. (2002). Tracking and summarizing news on a daily basis with Columbia's Newsblaster. In Proceedings of the second international conference on Human Language Technology Research(pp. 280285). Morgan Kaufmann Publishers Inc.. Michalski, R. S., & Stepp, R. E. (1983). Learning from observation: Conceptual clustering. In Machine learning (pp. 331-363). Springer Berlin Heidelberg. Miller, G. A. (1990). Nouns in WordNet: a lexical inheritance system. International journal of Lexicography, 3(4), 245-264. 94

Miller, G. A. (1995). WordNet: a lexical database for English. Communications of the ACM, 38(11), 39-41. Miller, G. A., & Johnson-Laird, P. N. (1976). Perception and language. Cambridge: Cambridge University. Misra, H., Yvon, F., Jose, J. M., & Cappe, O. (2009). Text segmentation via topic modeling: an analytical study. In Proceedings of the 18th ACM conference on Information and knowledge management (pp. 1553-1556). ACM. Moravec, P., Kolovrat, M., & Snasel, V. (2004). LSI vs. Wordnet Ontology in Dimension Reduction for Information Retrieval. In Dateso (pp. 18-26). Mugunthadevi, K., Punitha, S. C., Punithavalli, M., & Mugunthadevi, K. (2011). Survey on feature selection in document clustering. International Journal on Computer Science and Engineering, 3(3), 1240-1241. Myint, M., Z., & Oo., Z., M., (2016). Analysis of Modified Inverse Document Frequency Variants for Word Sense Disambiguation. International Journal of Advanced Computational Engineering and Networking, Volume-4, Issue-8. Naik, M. P., Prajapati, H. B., & Dabhi, V. K. (2015). A survey on semantic document clustering. In Electrical, Computer and Communication Technologies (ICECCT), 2015 IEEE International Conference on (pp. 1-10). IEEE. Narwal, M., & Mintwal, K. (2013). Comparison of the various clustering and classification algorithms of WEKA tools. Int. J. Adv. Res. Comput. Sci. Software Eng. Oikonomakou, N., & Vazirgiannis, M. (2009). A review of web document clustering approaches. In Data Mining and Knowledge Discovery Handbook(pp. 931948). Springer US. Pantel, P., & Lin, D. (2002). Document clustering with committees. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 199-206). ACM. Perkowitz, M., & Etzioni, O. (2000). Towards adaptive web sites: Conceptual framework and case study. Artificial intelligence, 118(1-2), 245-275. Qi, G. J., Aggarwal, C. C., & Huang, T. (2012). Community detection with edge content in social media networks. In Data Engineering (ICDE), 2012 IEEE 28th International Conference on (pp. 534-545). IEEE. Qiu, J., & Tang, C. (2007). Topic oriented semi-supervised document clustering. In Proceedings of SIGMOD 2007 Workshop IDAR. Qiu, Q., & Sapiro, G. (2015). Learning transformations for clustering and classification. The Journal of Machine Learning Research, 16(1), 187-225. 95

Rangel, F., Faria, F., Lima, P. M. V., & Oliveira, J. (2016). Semi-Supervised Classification of Social Textual Data Using WiSARD. ESANN. Recupero, D. R. (2007). A new unsupervised method for document clustering by using WordNet lexical and conceptual relations. Information Retrieval, 10(6), 563-579. Rodrigues, M. M., & Sacks, L. (2004). A scalable hierarchical fuzzy clustering algorithm for text mining. In Proceedings of the 5th international conference on recent advances in soft computing (pp. 269-274).. Sathiyakumari, K., Manimekalai, G., Preamsudha, V., & Scholar, M. P. (2011). A survey on various approaches in document clustering. International Journal of computer technology and application (IJCTA), 2(5), 1534-1539. Sedding, J., & Kazakov, D. (2004). Wordnet-based text document clustering. In proceedings of the 3rd workshop on robust methods in analysis of natural language data (pp. 104-113). Association for Computational Linguistics. Shah, N., & Mahajan, S. (2012). Semantic based document clustering: A detailed review. International Journal of Computer Applications, 52(5). Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8), 888-905. Sneath, P. H., & Sokal, R. R. (1973). Numerical taxonomy. The principles and practice of numerical classification.. Steinbach, M., Karypis, G., & Kumar, V. (2000, August). A comparison of document clustering techniques. In KDD workshop on text mining (Vol. 400, No. 1, pp. 525-526). Sun, Q., Li, R., Luo, D., & Wu, X. (2008). Text segmentation with LDA-based Fisher kernel. In Proceedings of the 46th annual meeting of the association for computational linguistics on human language technologies: short papers (pp. 269-272). Association for Computational Linguistics. Termier, A., Sebag, M., & Rousset, M. C. (2001). Combining Statistics and Semantics for Word and Document Clustering. In workshop on ontology learning. Thompson, K., & Langley, P. (1991). Concept formation in structured domains. Concept formation: Knowledge and experience in unsupervised learning, 127-161. Toor, A. (2014). An advanced clustering algorithm (ACA) for clustering large data set to achieve high dimensionality. Global Journal of Computer Science and Technology, 14(2-C), 71.

96

Van Dongen, S. M. (2001). Graph clustering by flow simulation (Doctoral dissertation). Vester, K. L., & Martiny, M. C. (2005). Information retrieval in document spaces using clustering (Master's thesis, Technical University of Denmark, DTU, DK2800 Kgs. Lyngby, Denmark). Vu, T., & Parker, D. S. (2016). K-Embeddings: Learning Conceptual Embeddings for Words using Context. In HLT-NAACL (pp. 1262-1267). Wang, Y., & Hodges, J. (2006). Document clustering with semantic analysis. In System Sciences, 2006. HICSS'06. Proceedings of the 39th Annual Hawaii International Conference on (Vol. 3, pp. 54c-54c). IEEE. Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann. Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann. Xu, L. (2014, June). Ordering Effects in Clustering. In Machine Learning Proceedings 1992: Proceedings of the Ninth International Workshop (ML92)(p. 163). Morgan Kaufmann. Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval(pp. 267-273). ACM. Yang, Y., & Liu, X. (1999). A re-examination of text categorization methods. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 42-49). ACM. Yim, O., & Ramdeen, K. T. (2015). Hierarchical cluster analysis: comparison of three linkage measures and application to psychological data. Quant. Methods. Psychol, 11, 8-21. Yoo, I., Hu, X., & Song, I. Y. (2006). Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 791-796). ACM. Zhai, C., & Massung, S. (2016). Text data management and analysis: a practical introduction to information retrieval and text mining. Morgan & Claypool. Zahn, C. T. (1971). Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on computers, 100(1), 68-86.

97

Zhang, Q., Kang, J., Qian, J., & Huang, X. (2014). Continuous word embeddings for detecting local text reuses at the semantic level. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval (pp. 797-806). ACM. Zhou, Y., Cheng, H., & Yu, J. X. (2009). Graph clustering based on structural/attribute similarities. Proceedings of the VLDB Endowment, 2(1), 718-729. Zhu, Y., & Shasha, D. (2002). Statstream: Statistical monitoring of thousands of data streams in real time. In Proceedings of the 28th international conference on Very Large Data Bases (pp. 358-369). VLDB Endowment.

98