Improving Text Summarization Using Fuzzy Logic

1 downloads 0 Views 1MB Size Report
by Mr. Samrat Ashok Babar Roll, No-1230020 is approved for partial ... partment of Computer Science and Engineering, Rajarambapu Institute of Technology,.

A Stage II Report On

Improving Text Summarization Using Fuzzy Logic By Mr. S.A.Babar (Roll No-1230020) M. Tech. Computer Science and Engineering Under the Guidance of Prof. S. A. Thorat Department of Information Technology

KASEGAON EDUCATION SOCIETY ’S RAJARAMBAPU INSTITUTE OF TECHNOLOGY, RAJARAMNAGAR DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 2013-2014

i

Department of Computer Science and Engineering RAJARAMBAPU INSTITUTE OF TECHNOLOGY Rajaramnagar, Islampur - 415 414 STAGE II APPROVAL SHEET This Stage II report entitled Improving Text Summarization Using Fuzzy Logic by Mr. Samrat Ashok Babar Roll, No-1230020 is approved for partial fulfillment for the degree of Master of Technology in Computer Science & Engineering from the department of Computer Science and Engineering, Rajarambapu Institute of Technology, Rajaramnagar. Name of Chairman & Department(Examination)

Signature

Name of External Examiner

Signature

Name of Supervisor

Signature

Name of HOP

Signature

Name of Chairman(DPGC)

Signature

Date:-

Place:- Rajaramnagar

ii

DECLARATION I declare that this report reflects my thoughts about the subject in my own words. I have sufficiently cited and referenced the original sources, referred or considered in this work. I have not misrepresented or fabricated or falsified any idea/data/fact/source in this my submission. I understand that any violation of the above will be cause for disciplinary action by the Institute.

No.

Name of Student and Roll No.

1

Samrat Ashok Babar (1230020)

Date:-

Signature

Place:- Rajaramnagar

iii

Improving Text Summarization Using Fuzzy Logic Mr. S.A.Babar Department of Computer Science and Engineering RAJARAMBAPU INSTITUTE OF TECHNOLOGY Rajaramnagar, Islampur - 415 414

ABSTRACT Text summarization technique deals with the compression of large document into shorter version of text.Text summarization takes care of choosing the most significant portions of text and generates coherent summaries that express the main intent of the given document.Extraction based text summarization involves selecting sentences of high relevance (rank) from the document based on word and sentence features and put them together to generate summary.Significant text features are extracted from given document and the decision model determines the degree of importance of each sentence based on its rated features.Decision module is modeled using Fuzzy Inference System.The summary of the document is created based upon the degree of the importance of the sentences in the document. Keywords: Summarization, Extraction,fuzzy Inference System, Text Features.

iv

Contents DISSERTATION APPROVAL SHEET DECLARATION . . . . . . . . . . . ABSTRACT . . . . . . . . . . . . . CONTENTS . . . . . . . . . . . . . LIST OF TABLES . . . . . . . . . . LIST OF FIGURES . . . . . . . . . ABBREVIATIONS . . . . . . . . . . 1

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. ii . iii . iv . vi . vii . viii . ix

INTRODUCTION 1.1 General . . . . . . . . . . . . . . . . . . . 1.1.1 Introduction to Text Summarization 1.1.2 Types of Text Summarization: . . . 1.1.3 Text Summarization Approaches: . 1.1.4 Main Steps in Text Summarization : 1.1.5 Text Summarization Methods: . . . 1.1.6 Evaluating the Summary System: . 1.2 Motivation of the present work . . . . . . . 1.3 Objectives . . . . . . . . . . . . . . . . . 1.4 Layout of the thesis . . . . . . . . . . . . . 1.5 Closure . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

1 1 1 2 3 4 4 5 5 7 7 8

2

LITERATURE REVIEW 9 2.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3

Implementation of Existing Technique 3.1 Introduction to MATLAB: . . . . . 3.1.1 Workspace Variables . . . . 3.1.2 Calling Functions . . . . . . 3.1.3 Programming and Scripts: . 3.2 Preprocessing(Matlab) . . . . . . . 3.2.1 Segmentation: . . . . . . . 3.2.2 Removal of Stop words : . . 3.2.3 POS Tagging/Tokenization: 3.2.4 Word Stemming : . . . . . .

v

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

17 17 20 21 21 24 25 25 26 26

3.3

3.4 3.5 3.6

Feature Extraction: . . . . . . . . . . . 3.3.1 Sentence length: . . . . . . . . 3.3.2 Sentence Position: . . . . . . . 3.3.3 Numeric Data: . . . . . . . . . 3.3.4 Term Weight: . . . . . . . . . . 3.3.5 Sentence to sentence similarity: Fuzzy System: . . . . . . . . . . . . . . Experimental Results: . . . . . . . . . . Comparison Graph: . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

27 27 28 29 29 30 31 32 36

4

PROPOSED WORK 37 4.1 Semantic Analysis of the Text: . . . . . . . . . . . . . . . . . . . . . . 37 4.2 Proposed Architecture: . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5

CONCLUSION

41

APPROVED COPY OF SYNOPSIS

45

ACKNOWLEDGMENTS

57

LIST OF PUBLICATIONS ON PRESENT WORK

58

VITAE

59

vi

List of Tables

vii

List of Figures 1.1

Schematic overview of the Text Summarization . . . . . . . . . . . . .

2.1

Text summarization based on fuzzy logic system architecture . . . . . . 12

3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17

MATLAB Desktop Layout . . Sript file . . . . . . . . . . . . command line . . . . . . . . . Segmented file . . . . . . . . Stop Word file . . . . . . . . . Stemming file . . . . . . . . . Sentence length . . . . . . . . Sentence Position . . . . . . . Numeric Data . . . . . . . . . Term Frequency . . . . . . . . Sentence Similarity . . . . . . Input file . . . . . . . . . . . . Output File . . . . . . . . . . Manually Generated Summary Online Summarizer 1 . . . . . Online Summarizer 2 . . . . . comparison graph . . . . . . .

4.1

Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.1

Asp.Net Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

viii

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

7

19 23 24 25 26 27 28 28 29 30 30 32 33 34 35 35 36

ABBREVIATIONS TS

Text Summarization

FIS

Fuzzy Inference System

TF

Term Frequency

DUC

Document Understanding Conferences

ROUGE

Recall-Oriented Understudy for Gisting Evaluation

TF- IDF

Term Frequency-Inverse Document Frequency

POS Tagging

Part-of-Speech Tagging

ix

Chapter 1

INTRODUCTION 1.1

General Text Summarization is a process of creating a shorter version of original text

that contains the important information. The amount of information on the web is growing day by day. A considerable amount of time is wasted in searching for relevant documents. Hence text summarization technique came into existence which created a short summary for the text document by choosing important sentences of the document. An Automatic text summarization works very well on structured documents such as news articles, research publications and reports.

1.1.1

Introduction to Text Summarization Automatic Text summarization is extremely helpful in tackling the information

overload problems. It is the technique to identify the most important pieces of information from the document, omitting irrelevant information and minimizing details to generate a compact coherent summary document..Text summarization has two approaches: Extraction and Abstraction. Extraction involves selecting sentences of high relevance (rank) from the document based on word and sentence features and put them

1

together to generate summary. It uses mostly statistical methods. Abstraction procedure examine the text, interprets it and generates summary using different sentences. It uses Linguistic methods. Abstraction is the process of paraphrasing sections of the source document.

1.1.2

Types of Text Summarization: 1) Abstract vs. Extract summary - Abstraction is the process of paraphrasing

sections of the source document whereas extraction is the process of picking subset of sentences from the source document and presents them to user in form of summary that provides an overall sense of the documents content. 2) Generic vs. Query-based summary - Generic summary do not target to any particular group. It addresses broad community of readers while Query or topic focused queries are tailored to the specific needs of an individual or a particular group and represent particular topic. 3) Single vs. Multi-document summary - Single document summary provide the most relevant information contained in single document to the user that helps the user in deciding whether the document is related to the topic of interest .4) Indicative vs. informative: An indicative summary provides merely an indication of the principal subject matter or domain of the input text(s) without including its contents. After reading an informative summary, one can explain what the input text was about, but not necessarily what was contained in it. An informative summary reflects (some of) the content, and

2

allows one to describe (parts of) what was in the input text.

1.1.3

Text Summarization Approaches: There are different types of summarization approaches depending on what the

summarization method focuses on to make the summary of the text.itemise a) Statistical approaches: In Statistical methods, sentence selection is done based on word frequency, indicator phrases and other features regardless of the meaning of the words.These methods are based on the idea that text surface cues are the most obvious indication of the text contents. There are several methods for determining the key sentences such as, The Title Method,The Location Method,The Aggregation Similarity Method,The Frequency Method etc. b) Linguistic approaches: Linguistic approaches are based on considering the connections between words and trying to find the main concept by analyzing the words. There are some methods such as,Lexical chain,WordNet,Graph theory,Clustering etc. c) Rhetorical approaches: Rhetorical structure theory (RST) is based on the Rhetorical connections between different parts of the text. In this theory the Rhetoric behind the decomposed text is extracted. In summarization systems, Rhetorical structure (RS) presents the logical connections between different parts of the text and interprets these connections. These information represent the discourse structure and features of the main document

3

1.1.4

Main Steps in Text Summarization : These steps are topic identification, interpretation and summary generation.Topic

Identification:In topic identification step, the most prominent information in the text is identified. Usually the system assign different precedence to different parts of the text like sentence, words, and phrases; then a fuser module mix the scores of each part in order to find the total score for a part. At last, the system presents the N highest score parts in accordance with predefined length.Several techniques for topic identification, including methods based on Position, Cue Phrases, word frequency, and Discourse Segmentation have been reported in the literature[2]. Interpretation:Abstract summaries need to go through interpretation step. In This step, different subjects are fused in order to form a general content. Summary Generation:In this step, the system uses text generation method which itself is still an open research topic that has lots of similarities with text summarization. This step includes a range of various generation methods from very simple word or phrase printing to more sophisticated phrase merging and sentence generation

1.1.5

Text Summarization Methods: Extractive summarizer aims at picking out the most relevant sentences in the doc-

ument while also maintaining a low redundancy in the summary. There are summarization methods which woks on the most relevant sentences in the document . A. Term Frequency-Inverse Document Frequency (TFIDF) method: B. Cluster based method: 4

C. Graph theoretic approach: D. Machine Learning approach: E. LSA Method: F.Text summarization with neural networks: G.Automatic text summarization based on fuzzy logic: H.Query based extractive text summarization:

1.1.6

Evaluating the Summary System: Evaluation methods are useful in evaluating the usefulness and trustfulness of

the summary. In summary, evaluating the qualities like comprehensibility, coherence, and readability is really difficult. System evaluation might be performed manually by experts who compare different summaries and choose the best one.Two main criterion for evaluating the proficiency of a system is precision and recall which are used for specifying the similarity between the summary which is generated by a system versus the one generated by human.The weighted harmonic mean of precision and recall is called as F-measure.These terms are defined by following equations Precision = Recall =

Correct (Correct+Wrong)

Correct , (Correct+Missed)

f − measure = 2 ×

1.2

,

precison×recall precison+recall

Motivation of the present work Text Summarization is an active field of research in both the IR and NLP com-

munities .Text Summarization is increasingly being used in the commercial sector such

5

as Telephone communication industry. In data mining of text databases, E.g. Oracle’s Context. In filters for web-based information retrieval.In word Processing tools e.g. Microsoft’s Auto Summarize . A variety of new applications are using multilingual summarization, multimedia news broadcast, audio scanning services for the blind etc. To summarize news to SMS or WAP-format for mobile phones. Many approaches differ on the manner of their problem formulations.For many of these larger information management goals, automatic text summarization is an important step. It addresses the problem of selecting the most important portions of the text.High quality summarization requires sophisticated NLP techniques. This existing work uses a combination of statistical and Linguistic [4] method to improve the quality of summary. It uses Fuzzy logic for effective Text Summarization. Fuzzy logic uses decision module that determines the degree of importance of each sentence based on its rated features. Decision module is designed using a fuzzy inference system. It works in four phases i) Preprocessing of text ii) Feature extraction of both words and sentences iii) Fuzzy logic scoring. iv) .Extracting sentences of higher ranks to generate summary. The figure below shows the architecture of this technique. Although a number of tools (i.e. MS AutoSum, Summarist, etc.) is available to facilitate the text summarization process automatically, but the summarized text output is still imprecise or inaccurate. Thus, automatic text summarization research is still an ongoing process especially in this era of information overload due to the wide usage of internet. Single document summarization Simulate the work of intelligence analyst. Judge if a document is relevant to a topic of interest. Breaking a document into sections before

6

Figure 1.1: Schematic overview of the Text Summarization.

summarizing will ensure that the summary includes all the topics that were covered in the document.

1.3

Objectives Following are the objectives of proposed work:

1. To study different techniques of text summarization . 2. To present a text document in a summarize form. 3. To identify the important sentence features for extraction using fuzzy logic. 4. To analyse the generated summary with online automate summarizer.

1.4

Layout of the thesis The thesis is structured as follows. In chapter 2, we provide literature survey

on Text Summarization System. Also we analyze various techniques and approaches used for Text summary. The implementation of sentence feature extraction using Fuzzy 7

Logic technique for Text Summarization is given in chapter 3. The chapter 4 discuss about proposed Technique for the text summarization.

1.5

Closure By using Extraction approach summary can be constructed by taking the most

important sentences out of the original document .Statistical techniques based on term frequency to determine the term importance. Sentences with more important terms are extracted in higher priorities.

8

Chapter 2

LITERATURE REVIEW 2.1

General Following are the general terms used in literature survey.

• Extractive summarization method: An extractive summarization method consists of selecting important sentences, paragraphs etc. from the original document and concatenating them into shorter form. • Sentence Features: For each sentence there are 8 features and each feature has a value between 0 and 1 .The 8 features are:Title features, Sentence length,Term weight,Sentence position,Sentence to sentence similarity,Proper noun,Thematic word,and Numerical data[9].

• Sentence Scoring: In sentence scoring algorithms some numerical weight is assigned to each sentences (between 0.0-1.0) and selecting the best one. • Fuzzy Logic: The sentence features are used as input to the fuzzifier .The input membership function for each feature is divided into five fuzzy set which are composed of unimportant values (low (L) and very low (VL), Median (M) and impor-

9

tant values (high (H) and very high (VH). n inference engine, the most important part in this procedure is the definition of fuzzy IF-THEN rules. The important sentences are extracted from these rules according to eight features criteria. The last step in fuzzy logic system is the defuzzification to convert the fuzzy results from the inference engine into a crisp output for the final score of each sentence[6].

2.2

Literature Review This section covers review of Text Summarization and its techniques. In [3], paper shows recent techniques and challenges on advances of automatic

text summarization. Special attention is paid to the latest trends in text summarization. Author discusses the key challenges in automatic text summarization. These are inherent problem of overlapping of sets of similar text units or paragraphs; documents which contain long sentences are still a problem; another challenge is the word sense ambiguities which are inherent to natural language. The problem is that matching a system summary against the ideal summary is very difficult to establish. The problem of providing much accurate or efficient result for automatic text summarization. Author has discussed different types of summarization approaches [4] depending on what the summarization method focuses on to make the summary of the text. Automatic document summarization is extremely helpful in tackling the information overload problems. It is the technique to identify the most important pieces of information from the document, omitting irrelevant information and minimizing details to generate a compact coherent summary document. Author has given the types of summaries Abstract vs. Extract summary, Generic vs. Query-based summary, Single vs. Multi-

10

document summary, Indicative vs. Informative, Background vs. Justthe- news. Author has mentioned in detail the different approaches: Graph Theoretic Approach, Text summarization based on fuzzy logic, Text summarization using cluster based method. In [2], paper author has concentrating on extractive summarization methods. An extractive summary is selection of important sentences from the original text. The importance of sentences is decided based on statistical and linguistic features of sentences. There are two broad methods of text summarization extractive and abstractive summarization. An extractive summarization method consists of selecting important sentences, paragraphs etc. from the original document and concatenating them into shorter form. An abstractive summarization method consists of understanding the original text and re-telling it in fewer words. Paper [1], defines the most important criteria for a summary and different methods of text summarization as well as main steps for summarization process is discussed. There are different approaches for sentence selection are presented in order to generate a summary from a text. Author has explained detailed steps for summary of document, which are topic identification, interpretation and summary generation. Also author has given the different approaches for scoring and selecting sentences. In paper [5] , it is said that sentence scoring is the technique most used for extractive text summarization. Paper describes 15 sentence scoring algorithm and performs a quantitative and qualitative assessment of these 15 algorithms. Example Sentence length: It works as follows: (i) Calculate the largest sentence length; (ii) Penalize sentences larger than 80 percent of the largest sentence length; (iii) Calculate the Sentence Length Score for all other sentences. Author has used three different datasets (News,

11

Blogs and Article contexts) . It is a single document summarization. L. Suanmali et al.[6] has proposed text summarization based on fuzzy logic method to extract important sentences as a summary. This paper focuses on extraction approach. The sentence selection method had used to obtain the suitable sentences by assigning some numerical weighting to each sentences and selecting the best one. The 8 important features are used and calculate their score for each sentence.These are Title word, Sentence length, Sentence Position, Numerical data, Term weight, Sentence to Sentence Similarity, existence of Thematic words and Proper Nouns. Author presents the some summarization approaches and describes preprocessing and the important features. The experimental results compared with the baseline summarizer and Microsoft Word 2007 summarizers.

Figure 2.1: Text summarization based on fuzzy logic system architecture

In paper [7], the focus is on Text Summarization by Extraction using Fuzzy Logic. Author has used the hybrid method-Statistical and Linguistic methods .The evaluation is carried out by using precision and recall method . Using all the 8 feature scores, the score for each sentence are derived using fuzzy logic method. The fuzzy

12

logic method uses the fuzzy rules and triangular membership function. The fuzzy rules are in the form of IF-THEN. The triangular membership function fuzzifies each score into one of 3 values that is LOW,MEDIUM HIGH. Then we apply fuzzy rules to determine whether sentence is unimportant, average or important. This is also known as defuzzification. .

In paper [8] author has analyzed some state of methods for text summarization. He discussed the main disadvantages of these methods and proposed a new method of text summarization using fuzzy logic. The problems of machine learning method is features used in this method e.g. the occurrence of proper nouns have binary attributes such as zero and one which sometimes are not exact. It is clear that some of the attributes have more importance and some have less and so they should have balance weight in computations and hence the fuzzy logic is used to solve this problem. The implementation of text summarization is carried out using MATLAB. In this [9] paper the focus is on the automatic text summarization by sentence extraction.Author used fuzzy logic to extract the sentences. The eight features and their calculated scores are used compare the results with the baseline summarizer and Microsoft Word 2007 summarizers. Author has explained in detail the definitions and terminology of each eight sentence features with their formulas. JThis paper[10], proposed an automatic text summarization approach based on sentence extraction using fuzzy logic, genetic algorithm, semantic role labeling and their combinations to generate high quality summaries. GA used in text summarization. Author has mentioned the benefits of the genetic algorithm in the optimization problem

13

in for feature selection. Fuzzy IF- THEN rules were used to balance the weights between important and unimportant features.. In [11] ,author has applied the fuzzy logic approach for important features to extract the sentences. In this paper the result is compared with baseline summarizer and Microsoft Word 2007 summarizers. Author has used 9 sentence features and 30 test documents in DUC2002 data set. Fuzzy logic techniques in the form of approximate reasoning provide decision-support . In this [12] paper, author proposed a fuzzy-rough set aided method to extract key sentences as a summary of a document by estimating the relevance of sentences through the use of fuzzy-rough sets. This method uses senses rather than raw words to avoid the similar semantic meaning .Author has proposed the system in which the four phases of processes are organized. These are :Document Preprocessing ,Word Sense Disambiguation ,Semantic Patterns Retrieval ,Key Sentences Extraction . In this [13] paper, the investigation of the correlation between ROUGE and human evaluation of extractive meeting summaries is carried out. Both human and system generated summaries are used. The human evaluation of different summaries and calculated ROUGE scores are examined with their correlation .The better correlation can be achieved between the ROUGE scores and human evaluation. In text summarization, ROUGE has been correlate well with human evaluation when measuring match of content units. In this [14] paper author proposed a sentences clustering based summarization approach. The proposed approach consists of three steps: first clusters the sentences based on the semantic distance among sentences in the document, and then on each

14

cluster calculates the accumulative sentence similarity based on the multi features combination method, at last chooses the topic sentences by some extraction rules. The text summarization result is not only depends the sentence features, but also depends on the sentence similarity measure. In [15] paper,it is said that the summarization process has three phases: analyzing the source text, determining its salient points, and synthesizing an appropriate output. Most current work focuses on the more developed technology of summarizing a single document. In [16] paper,author proposed a method of personalized text summarization which improves the conventional automatic text summarization methods by taking into account the differences in readersŠ characteristics. A method of personalized summarization which extracts from the document information that we assume to be the most important or interesting for a particular user. Annotations (e.g. highlights) can indicate a userŠs interest in the specific parts of the document .Author used them as one of the sources of personalization. This [17] paper describes and performs a quantitative and qualitative assessment of 15 algorithms for sentence scoring available in the literature. Three different datasets (News, Blogs and Article contexts) were evaluated. Each of the 15 scoring methods is described and implemented. A quantitative and qualitative assessment of those methods using three different datasets (news, blogs, and articles context) is performed .

15

2.3

Closure We studied the different text summarization techniques. The various author has

given the different sentence features to be extracted for the summary.The fuzzy logic approach for the feature extraction solve the problem of important and less important attributes of weights of sentences.

16

Chapter 3

Implementation of Existing Technique

3.1

Introduction to MATLAB: MATLAB is a high-level language and interactive environment for numerical

computation, visualization, and programming. Using MATLAB, you can analyze data, develop algorithms, and create models and applications. The language, tools, and builtin math functions enable you to explore multiple approaches and reach a solution faster than with spreadsheets or traditional programming languages, such as C/C++ or Java˝o. You can use MATLAB for a range of applications, including signal processing and communications, image and video processing, control systems, test and measurement, computational finance, and computational biology. More than a million engineers and scientists in industry and academia use MATLAB, the language of technical computing. • Key Features :

– High-level language for numerical computation, visualization, and applica17

tion development. – Interactive environment for iterative exploration, design, and problem solving – Mathematical functions for linear algebra, statistics, Fourier analysis, filtering, optimization, numerical integration, and solving ordinary differential equations.

– Built-in graphics for visualizing data and tools for creating custom plots.

– Development tools for improving code quality and maintainability and maximizing performance.

– Tools for building applications with custom graphical interfaces. MATLAB programs are called also functions. MATLAB functions are available within the MATLAB environment as soon as you invoke the name of a function, but cannot be run separated from the MATLAB environment.Both scripts and functions are stored as so-called m files in Matlab. A script is a series of interactive MATLAB commands listed together in an m file.MATLAB functions are self-contained programs that require arguments (input variables) and, many a time, an assignment to output variables. Examples of simple MATLAB functions: A n example of a simple function that calculates the area of a trapezoidal,given the bottom width, b, the side slope z, and the flow depth y. function [A] = Area(b,y,z) 18

A = (b+z*y)*y; command line jus type and hit enter Area = inline(’(b+z*y)*y’,’b’,’y’,’z’) »b = 2; z = 1.5; y = 0.75; A = Area(b,y,z) A = 2.3438 • Desktop Basics : When you start MATLAB, the desktop appears in its default layout.

Figure 3.1: MATLAB Desktop Layout

The desktop includes these panels: • Current Folder : Access your files. • Command Window : Enter commands at the command line, indicated by the prompt (»). • Workspace : Explore data that you create or import from files. 19

• Command History : View or rerun commands that you entered at the command line. As you work in MATLAB, you issue commands that create variables and call functions. For example, create a variable named a by typing this statement at the command line: a=1 MATLAB adds variable a to the workspace and displays the result in the Command Window. a= 1 When you do not specify an output variable, MATLAB uses the variable ans, short for answer, to store the results of your calculation. sin(a) ans = 0.8415

3.1.1 Workspace Variables The workspace contains variables that you create within or import into MATLAB from data files or other programs. For example, these statements create variables A and B in the workspace.

20

3.1.2 Calling Functions MATLAB provides a large number of functions that perform computational tasks. Functions are equivalent to subroutines or methods in other programming languages. To call a function, such as max, enclose its input arguments in parentheses: A = [1 3 5]; max(A); If there are multiple input arguments, separate them with commas: B = [10 6 4]; max(A,B); Enclose any character string inputs in single quotes: disp(’hello world’); To call a function that does not require any inputs and does not return any outputs, type only the function name: clc The clc function clears the Command Window.

3.1.3 Programming and Scripts: MATLAB M-Files: MATLAB is also a powerful programming language, as well as an interactive computational environment. MATLAB also allows you to write series of commands into a file and execute the file as complete unit, like writing a function and calling it. 21

The M Files-MATLAB allows writing two kinds of program files: – Scripts-script files are program files with .m extension. In these files, you write series of commands, which you want to execute together. Scripts do not accept inputs and do not return any outputs. They operate on data in the workspace – Functions-functions files are also program files with .m extension . Functions can accept inputs and return outputs. Internal variables are local to the function. The simplest type of MATLAB program is called a script. A script is a file with a .m extension that contains multiple sequential lines of MATLAB commands and function calls. You can run a script by typing its name at the command line. Example Script:To create a script, use the edit command, edit arithmatic This opens a blank file named arithmatic.m. Enter some code that makes addition: a = 5; b = 7; c=a+b d = c + sin(b) e=5*d f = exp(-d)

22

Figure 3.2: Sript file

After creating and saving the file, you can run it in two ways: 1. Clicking the Run button on the editor window or 2. Just typing the filename (without extension) in the command prompt: » arithmatic The command window prompt displays the result: c= 12 d= 12.6570 e= 63.2849 f= 3.1852e-06

23

Figure 3.3: command line

3.2 Preprocessing(Matlab) Pre Processing is structured representation of the original text. It usually includes: a) Sentences boundary identification(segmentation). In English, sentence boundary is identified with presence of dot at the end of sentence. b) Stop-Word Elimination.Common words with no semantics and which do not aggregate relevant information to the task are eliminated. c) Stemming.The purpose of stemming is to obtain the stem or radix of each word, which emphasize its semantics. In Processing step, features influencing the relevance of sentences are decided and calculated and then weights are assigned to these features using weight learning method. Final score of each sentence is determined using Feature-weight equation. Top ranked sentences are selected for final summary.

24

3.2.1 Segmentation: In English, sentence boundary is identified with presence of dot at the end of sentence.

Figure 3.4: Segmented file

3.2.2 Removal of Stop words : Common words with no semantics and which do not aggregate relevant information to the task are eliminated. Words which rarely contribute to useful information in terms of document relevance and appear frequently in document but provide less meaning in identifying the important content of the document are removed. These words include articles, prepositions, conjunctions and other high-frequency words such as “a”, “an”, “the”, “in”, “and”, “I”, etc...

25

Figure 3.5: Stop Word file

3.2.3 POS Tagging/Tokenization: Tokenization is done by separating the input document into individual words

3.2.4 Word Stemming : Stemming.The purpose of stemming is to obtain the stem or radix of each word, which emphasize its semantics.Word stemming is the process of reducing inflected or derived words to their stem, base or root form.For example, a stemming algorithm for English should stem from the words “compute”, “computed”, “computer”, “computable”, and “computation” to its word stem, “comput”.

26

Figure 3.6: Stemming file

3.3 Feature Extraction: Sentence Features-After pre-processing, each sentence of the document is represented by an attribute vector of features. These features are attributes that attempt to represent the sentence in the task of sentence selection. We focus on eight features for each sentence. Each document is converted into a list of sentences.

3.3.1 Sentence length: The ratio of the number of words occurring in the sentence over the number of words occurring in the longest sentence of the document. F1 =

No.o f the word in the sentence No.o f words int he longest sentnce

27

Figure 3.7: Sentence length

3.3.2 Sentence Position: The first 5 sentences in a paragraph has a score value of 5/5 for first sentence, the second sentence has a score 4/5, and so on. F2(S 1) = nn ;F2(S 2) = 45 ;F2(S 3) = 35 ;F2(S 4) = 25 ; so on...

Figure 3.8: Sentence Position

28

3.3.3 Numeric Data: The ratio of the number of numerical data in sentence over the sentence length. F3(S i) =

No.o f the Numerical data in the sentence S i S entnce Length

Figure 3.9: Numeric Data

3.3.4 Term Weight: The score of this feature is given by the ratio of summation of term frequencies of all terms in a sentence over the maximum of summation values of all sentences in a document. F4 =

P

T Fi P , MAX( T Fi)

Where TF=Term Frequency, i=1 to n, n is the number of

terms in a sentence.

29

Figure 3.10: Term Frequency

3.3.5 Sentence to sentence similarity: For each sentence S, the similarity between S and every other sentence is computed by the method of token matching. F5 =

P [S im(S i,S j)] , MAX[S im(S i,S j)]

Where i=1 to N and j=1 to N.

Figure 3.11: Sentence Similarity

30

3.4 Fuzzy System: Fuzzy Logic-FL is used for solving uncertainties in a given problem.Fuzzy logic system design usually implicates selecting fuzzy rules and membership function. The selection of fuzzy rules and membership functions directly affect the performance of the fuzzy logic system. The fuzzy logic system consists of four components: fuzzifier, inference engine, defuzzifier, and the fuzzy knowledge base. In the fuzzifier, crisp inputs are translated into linguistic values using a membership function to be used to the input linguistic variables. After fuzzification, the inference engine refers to the rule base containing fuzzy IF- THEN rules to derive the linguistic values. In the last s tep, the output linguistic variables from the inference are converted to the final crisp values by the defuzzifier using membership function for representing the final sentence score. Thus each sentence is associated with 8 feature vector. Using all the 8 feature scores, the score for each sentence are derived using fuzzy logic method. The fuzzy logic method uses the fuzzy rules and triangular membership function .The fuzzy rules are in the form of IF-THEN .The triangular membership function fuzzifies each score into one of 3 values that is LOW,MEDIUM HIGH. Then we apply fuzzy rules to determine whether sentence is unimportant, average or important. This is also known as defuzzification. For example IF (F1is H) and (F2 is M) and (F3 is H) and (F4 is M) and (F5 is M) and (F6 is M) and (F7 is H) and (F8 is H) THEN (sentence is important). All the sentences of a document are ranked in a descending order based on their

31

scores. Top n sentences of highest score are extracted as document summary based on compression rate. Finally the sentences in summary are arranged in the order they occur in the original document.

3.5 Experimental Results: The input article of text file and its summary generated by our method are shown bellow. • Original input Document File:

Figure 3.12: Input file

32

• Output Summary File:

Figure 3.13: Output File

33

• Manually Generated Summary File:

Figure 3.14: Manually Generated Summary

34

• Online Summarizer 1 Result:

Figure 3.15: Online Summarizer 1

• Online Summarizer 2 Result:

Figure 3.16: Online Summarizer 2

35

3.6 Comparison Graph: The chart showing the comparison between result of online summarizer and the result of our method .

Figure 3.17: comparison graph

36

Chapter 4

PROPOSED WORK 4.1 Semantic Analysis of the Text: Conventional extraction methods cannot capture semantic relations between concepts in a text. Therefore, the use of semantic analysis to capture the semantic contents in sentences and incorporate it into the summarization method. The motivation for using Semantic Analysis in text summarization is to calculate semantic similarity between sentences score.

• Latent Semantic Analysis: Latent Semantic Analysis (LSA) is a statistical model of word usage that permits comparisons of semantic similarity between pieces of textual information. LSA was originally designed to improve the effectiveness of information retrieval methods by performing retrieval based on the derived "semantic" content of words in a query as opposed to performing direct word matching. This approach avoids some of the problems of synonymy, in which different words can be used to describe the same semantic concept. 37

The primary assumption of LSA is that there is some underlying or "latent" structure in the pattern of word usage across documents, and that statistical techniques can be used to estimate this latent structure. The term "documents" in this case, can be thought of as contexts in which words occur and also could be smaller text segments such as individual paragraphs or sentences. Through an analysis of the associations among words and documents, the method produces a representation in which words that are used in similar contexts will be more semantically associated. In order to analyze a text, LSA first generates a matrix of occurrences of each word in each document (sentences or paragraphs). LSA then uses singularvalue decomposition (SVD), a technique closely related to eigenvector decomposition and factor analysis. The SVD scaling decomposes the word-by-document matrix into a set of k, typically 100 to 300, orthogonal factors from which the original matrix can be approximated by linear combination. Instead of representing documents and terms directly as vectors of independent words, LSA represents them as continuous values on each of the k orthogonal indexing dimensions derived from the SVD analysis. Since the number of factors or dimensions is much smaller than the number of unique terms, words will not be independent. For example, if two terms are used in similar contexts (documents), they will have similar vectors in the reduced-dimensional LSA representation. One advantage of this approach is that matching can be done between two pieces of textual information, even if they have no words in common.

38

• Semantic Role Labeling (SRL) Method: The semantic role labeling method (SRL proposes the sentence similarity using the SRL that can capture semantic contents in sentences. First, it split the document into sentences and parses each sentence into the frame(s) using a semantic role parser. After argument has been defined, each argument in each sentence will be performed the semantic similarity measure between any pair of sentences using WordNet taxonomy. The SRL based text summarization has three main steps as described as following. • Identification of semantic role labeling • Calculation of semantic similarity based on WordNet thesaurus • Calculation of Sentence Score based on Semantic Similarity

39

4.2

Proposed Architecture:

Figure 4.1: Proposed Architecture

40

Chapter 5

CONCLUSION Automatic summarization is a complex task that consists of several sub-tasks. Each of the sub-task directly affects the ability to generate high quality summaries. In extraction based summarization the important part of the process is the identification of important relevant sentences of text. Use of fuzzy logic as a summarization sub-task improved the quality of summary by a great amount. In this technique we implement automatic text summarization which involves feature based extraction of important sentences using fuzzy logic method. We extracted the important features for each sentence of the document represented as the vector of features consisting of the following elements: title feature, sentence length, term weight, sentence position, sentence to sentence similarity, proper noun, thematic word and numerical data. The results show that the best average precision, recall and f-measure to summaries produced by the fuzzy method. Certainly, the experimental result is based on fuzzy logic could improve the quality of summary results that based on the general statistic method. In conclusion, we will extend the this

41

existing method using semantic analysis along with extraction of other features which could provide the sentences more important. The quality of summary can still be improved by using topic segmentation and semantic analysis of the text in addition to the features considered in this technique.

42

References

[1] Saeedeh Gholamrezazadeh ,Mohsen Amini Salehi ;"A Comprehensive Survey on Text Summarization Systems ", 978-1-4244-4946-0,2009 IEEE. [2] Vishal Gupta and Gurpreet Singh Lehal; "A survey of Text summarization techniques ",Journal of Emerging Technologies in Web Intelligence VOL 2 NO 3 August 2010. [3] Oi Mean Foong ,Alan Oxley and Suziah Sulaiman "Challenges and Trends of Automatic Text Summarization ",International Journal of Information and Telecommunication Technology Vol.1, Issue 1, 2010 . [4] Archana AB, Sunitha. C ,"An Overview on Document Summarization Techniques" ,International Journal on Advanced Computer Theory and Engineering (IJACTE) ,ISSN (Print) : 2319 ¡U 2526, Volume-1, Issue-2, 2013 . [5] Rafael Ferreira ,Luciano de Souza Cabral ,Rafael Dueire Lins ,Gabriel Pereira e Silva ,Fred Freitas ,George D.C. Cavalcanti ,Luciano Favaro , "Assessing sentence scoring techniques for extractive text summarization ",Expert Systems with Applications 40 (2013) 5755-5764 ,2013 Elsevier . [6] L. Suanmali , N. Salim and M.S. Binwahlan,"Fuzzy Logic Based Method for Improving Text Summarization" , International Journal of Computer Science and Information Security, Vol. 2, No. 1,pp. 4-10., 2009. [7] Mrs.A.R.Kulkarni , Dr.Mrs.S.S.Apte "A DOMAIN-SPECIFIC AUTOMATIC TEXT SUMMARIZATION USING FUZZY LOGIC ",International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 - 6375(Online) Volume 4, Issue 4, July-August (2013). [8] Farshad Kyoomarsi ,Hamid Khosravi ,Esfandiar Eslami ,Pooya Khosravyan Dehkordy; "Optimizing Text Summarization Based on Fuzzy Logic ",Seventh IEEE/ACIS International Conference on Computer and Information Science ,9780-7695-3131-1 ,2008. [9] Ladda Suanmali ,Naomie Salim and Mohammed Salem Binwahla ,"Feature-Based Sentence Extraction Using Fuzzy Inference rules ",2009 International Conference on Signal Processing Systems ,978-0-7695-3654-5 ,2009 IEEE . [10] Ladda Suanmali, Naomie Salim and Mohammed Salem Binwahlan "Fuzzy Genetic Semantic Based Text Summarization ", 2011 Ninth Ninth International Conference on Dependable, Autonomic and Secure Computing ,978-0-7695-4612-4 ,2011 IEEE . [11] Ladda Suanmali, Mohammed Salem Binwahlan and Naomie Salim "Sentence Features Fusion for Text Summarization Using Fuzzy Logic ",2009 Ninth International Conference on Hybrid Intelligent Systems ,978-0-7695-3745-0 ,2009 IEEE.

43

[12] Hsun-Hui Huang ,Yau-Hwang Kuo ,Horng-Chang Yang ,"Fuzzy-Rough Set Aided Sentence Extraction Summarization",Proceedings of the First International Conference on Innovative Computing, Information and Control (ICICICŠ06),0-76952616-0/06 ,IEEE. [13] Feifan Liu and Yang Liu, Member, IEEE "Exploring Correlation Between ROUGE and Human Evaluation on Meeting Summaries ",IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 1, JANUARY ,2010. [14] ZHANG Pei-ying ,LI Cun-he , "Automatic text summarization based on sentences clustering and extraction ",978-1-4244-4520-2 ,2009 IEEE . [15] Udo Hahn ,Inderjeet Man ,"The Challenges of Automatic Summarization ",00189162/00,2000 IEEE . [16] Róbert Móro, Mária Bieliková "Personalized Text Summarization Based on Important Terms Identification ",2012 23rd International Workshop on Database and Expert Sytems Applications ,1529-4188, 2012 IEEE . [17] Rafael Ferreira ,Luciano de Souza Cabral ,Rafael Dueire Lins ,Gabriel Pereira e Silva , "Assessing sentence scoring techniques for extractive text summarization", Expert Systems with Applications 40 (2013) 5755-5764 ,2013 Elsevier Ltd.

44

K. E. Society’s Rajarambapu Institute of Technology, Rajaramnagar An Autonomous institute SYNOPSIS OF M. Tech. DISSERTATION

1. Name of Program

: M.Tech. (Computer Science & Engineering)

2. Name of student

: Babar Samrat Ashok

3. Date of registration

: August 2013

4. Name of guide

: Prof. S. A. Thorat

5. Sponsorers Detail

: Nil

6. Proposed title

:

“ Improving Text Summarization Using Fuzzy Logic” 7. Synopsis of dissertation work 7.1 Relevance : “Text Summarization” is a process of creating a shorter version of original text that contains the important information. The amount of information on the web is growing day by day. A considerable amount of time is wasted in searching for relevant documents. Hence text summarization technique came into existence which created a short summary for the text document by choosing important sentences of the document. An automatic text summarization works very well on structured documents such as news articles, research publications and reports. Text summarization has two approaches: Extraction and Abstraction. Extraction involves selecting sentences of high relevance (rank) from the document based on word and sentence features and put them together to generate summary. It uses mostly statistical methods. Abstraction procedure examines the text, interprets it and generates summary using different sentences. It uses Linguistic methods. This paper focuses

on extractive summarization technique. It uses a combination of both Statistical and Linguistic methods on fusion of various features to generate a better quality summary. Fuzzy Logic-FL is used for solving uncertainties in a given problem. For eg. consider condition “drive slowly”. When we hear about it, we can understand what it means. But for computer it does not have knowledge about which is the slow speed, 10 kmph or 15 kmph or 25 kmph. And if we give a limit as the slowest speed for example 20 kmph is the speed limit. More than this speed is considered as fast.But if 20 kmph is the speed limit, is 20.1 kmph is fast or 19.9 kmph is slow. In order to solve these uncertainties for computer system, FL is used. There will be membership function for solving this. Fuzzy logic system design usually implicates selecting fuzzy rules and membership function. The selection of fuzzy rules and membership functions directly affect the performance of the fuzzy logic system. The fuzzy logic system consists of four components: fuzzifier, inference engine, defuzzifier, and the fuzzy knowledge base. In the fuzzifier, crisp inputs are translated into linguistic values using a membership function to be used to the input linguistic variables. After fuzzification, the inference engine refers to the rule base containing fuzzy IF- THEN rules to derive the linguistic values. In the last step, the output linguistic variables from the inference are converted to the final crisp values by the defuzzifier using membership function for representing the final sentence score. The automatic summarization means a automatically summarized output is given when an input is applied. Remember that input is well structured document. For this there are initially preprocesses such as Sentence Segmentation, Tokenization, Removing stop words and Word Stemming. Sentence Segmentation is separating document into sentences. Tokenization means separating sentences into words. Removing stop words means removing frequently occurring words such as a, an, the etc. And word stemming means removing suffixes and prefixes. After preprocessing each sentence is represented by attribute of vector of features. For each sentence there are 8 features and each feature has a value between 0 and 1. The 8 features are: Title features, Sentence length, Term weight, Sentence position, Sentence to sentence similarity, Proper noun, Thematic word and Numerical data.

The method is as follows... After extraction of 8 features the result is passed to fuzzifier then to inference engine and finally to defuzzifier. Rules for Inference engine is supplied from Fuzzy rule base. After this each sentence will have score and the sentence is sorted in the decreasing order of the score. Then 20% of this finally sorted sentences will be the summary of the given document.

7.2 Present theories and practices: In [1], defines the most important criteria for a summary and different methods of text summarization as well as main steps for summarization process is discussed. There are different approaches for sentence selection are presented in order to generate a summary from a text. Author has explained detailed steps for summary of document, which are topic identification, interpretation and summary generation. Also author has given the different approaches for scoring and selecting sentences. In [2], paper author has concentrating on extractive summarization methods. An extractive summary is selection of important sentences from the original text. The importance of sentences is decided based on statistical and linguistic features of sentences. There are two broad methods of text summarization extractive and abstractive summarization. An extractive summarization method consists of selecting important sentences, paragraphs etc. from the original document and concatenating them into shorter form. An abstractive summarization method consists of understanding the original text and re-telling it in fewer words. In [3], paper shows recent techniques and challenges on advances of automatic text summarization. Special attention is paid to the latest trends in text summarization.Author discusses the key challenges in automatic text summarization. These are inherent problem of overlapping of sets of similar text units or paragraphs; documents which contain long sentences are still a problem; another challenge is the word sense ambiguities which are inherent to natural language. The problem is that matching a system summary against the ideal summary is very difficult to establish. The problem of providing much accurate or efficient result for automatic text summarization. In [4], author has discussed different types of summarization approaches de-

pending on what the summarization method focuses on to make the summary of the text. Automatic document summarization is extremely helpful in tackling the information overload problems. It is the technique to identify the most important pieces of information from the document, omitting irrelevant information and minimizing details to generate a compact coherent summary document. Author has given the types of summaries - Abstract vs. Extract summary, Generic vs. Query-based summary, Single vs. Multi-document summary, Indicative vs. Informative, Background vs. Justthe-news. Author has mentioned in detail the different approaches: Graph Theoretic Approach, Text summarization based on fuzzy logic, Text summarization using cluster based method. In paper [5] , it is said that sentence scoring is the technique most used for extractive text summarization. Paper describes 15 sentence scoring algorithm and performs a quantitative and qualitative assessment of these 15 algorithms. Example Sentence length: It works as follows: (i) Calculate the largest sentence length; (ii) Penalize sentences larger than 80 percent of the largest sentence length; (iii) Calculate the Sentence Length Score for all other sentences. Author has used three different datasets (News, Blogs and Article contexts) . It is a single document summarization. In paper [6], author has proposed text summarization based on fuzzy logic method to extract important sentences as a summary. This paper focuses on extraction approach. The sentence selection method had used to obtain the suitable sentences by assigning some numerical weighting to each sentences and selecting the best one. The 8 important features are used and calculate their score for each sentence.These are Title word, Sentence length, Sentence Position, Numerical data, Term weight, Sentence to Sentence Similarity, existence of Thematic words and Proper Nouns. Author presents the some summarization approaches and describes preprocessing and the important features. The experimental results compared with the baseline summarizer and Microsoft Word 2007 summarizers.

Figure 5.1: Text summarization based on fuzzy logic system architecture

In paper [7], the focus is on Text Summarization by Extraction using Fuzzy Logic. Author has used the hybrid method-Statistical and Linguistic methods .The evaluation is carried out by using precision and recall method . Using all the 8 feature scores, the score for each sentence are derived using fuzzy logic method. The fuzzy logic method uses the fuzzy rules and triangular membership function. The fuzzy rules are in the form of IF-THEN. The triangular membership function fuzzifies each score into one of 3 values that is LOW,MEDIUM HIGH. Then we apply fuzzy rules to determine whether sentence is unimportant, average or important. This is also known as defuzzification. In paper [8] author has analyzed some state of methods for text summarization. He discussed the main disadvantages of these methods and proposed a new method of text summarization using fuzzy logic. The problems of machine learning method is features used in this method e.g. the occurrence of proper nouns have binary attributes such as zero and one which sometimes are not exact. It is clear that some of the attributes have more importance and some have less and so they should have balance weight in computations and hence the fuzzy logic is used to solve this problem. The implementation of text summarization is carried out using MATLAB. In this [9] paper the focus is on the automatic text summarization by sentence extraction.Author used fuzzy logic to extract the sentences. The eight features and their calculated scores are used compare the results with the baseline summarizer and Microsoft Word 2007 summarizers. Author has explained in detail the definitions and terminology of each eight sentence features with their formulas. Example, T itle f eature =

Number o f word in S i

Number o f word in T itle This paper[10], proposed an automatic text summarization approach based on sentence extraction using fuzzy logic, genetic algorithm, semantic role labeling and their combinations to generate high quality summaries. GA used in text summarization. Author has mentioned the benefits of the genetic algorithm in the optimization problem in for feature selection. Fuzzy IF- THEN rules were used to balance the weights between important and unimportant features. In [11] ,author has applied the fuzzy logic approach for important features to

extract the sentences. In this paper the result is compared with baseline summarizer and Microsoft Word 2007 summarizers. Author has used 9 sentence features and 30 test documents in DUC2002 data set. Fuzzy logic techniques in the form of approximate reasoning provide decision-support . In this [12] paper, author proposed a fuzzy-rough set aided method to extract key sentences as a summary of a document by estimating the relevance of sentences through the use of fuzzy-rough sets. This method uses senses rather than raw words to avoid the similar semantic meaning .Author has proposed the system in which the four phases of processes are organized. These are :Document Preprocessing ,Word Sense Disambiguation ,Semantic Patterns Retrieval ,Key Sentences Extraction . In this [13] paper, the investigation of the correlation between ROUGE and human evaluation of extractive meeting summaries is carried out. Both human and system generated summaries are used. The human evaluation of different summaries and calculated ROUGE scores are examined with their correlation .The better correlation can be achieved between the ROUGE scores and human evaluation. In text summarization, ROUGE has been correlate well with human evaluation when measuring match of content units. In this [14] paper author proposed a sentences clustering based summarization approach. The proposed approach consists of three steps: first clusters the sentences based on the semantic distance among sentences in the document, and then on each cluster calculates the accumulative sentence similarity based on the multi features combination method, at last chooses the topic sentences by some extraction rules. The text summarization result is not only depends the sentence features, but also depends on the sentence similarity measure. In [15] paper,it is said that the summarization process has three phases: analyzing the source text, determining its salient points, and synthesizing an appropriate output. Most current work focuses on the more developed technology of summarizing a single document. In [16] paper,author proposed a method of personalized text summarization which improves the conventional automatic text summarization methods by taking into account the differences in readers’ characteristics. A method of personalized summa-

rization which extracts from the document information that we assume to be the most important or interesting for a particular user. Annotations (e.g. highlights) can indicate a user’s interest in the specific parts of the document .Author used them as one of the sources of personalization. This [17] paper describes and performs a quantitative and qualitative assessment of 15 algorithms for sentence scoring available in the literature. Three different datasets (News, Blogs and Article contexts) were evaluated. Each of the 15 scoring methods is described and implemented. A quantitative and qualitative assessment of those methods using three different datasets (news, blogs, and articles context) is performed .

7.3 Proposed work:

In this dissertation work it is proposed to use Fuzzy based extraction method for can producing summary of text document.

Objective of Project: • To study different techniques of text summarization . • To present a text document in a summarize form . • To identify the important sentence features for extraction using fuzzy logic. • To analyse the generated summary with online automate summarizer. Phase I : Literature Survey – Literature survey of text summarization will be carried out by referring Journal of Elsevier, IEEE communication magazine, Journal of IET communication, Journal on areas in communications, Journal of IEEE system, Journal of Springer. – In this phase literature survey of different techniques for text summarization along with its limitations will be performed. Different techniques for text summarization.

1. Term Frequency-Inverse Document Frequency (TF- IDF): 2. Cluster based : 3. Graph theoretic approach :: 4. Machine Learning approach : 5. Automatic text summarization based on fuzzy logic : Phase II : Implementation of Existing Technique and Propose New Technique – In this phase, implementation of existing technique based on fuzzy logic will be performed. – Propose possible improvements in the existing technique. Phase III : Implementation of Proposed Technique – Implementation of proposed technique by considering improvements found during phase-II to achieve a better performance. – Improvements using topic segmentation and semantic analysis of the text . Phase IV :Performance Analysis and Comparison – In this phase we will analyse the performance of our proposed technique using recision, recall and F- measure . – Comparison between existing and proposed technique will be performed and finalize conclusions from the same.

8. Facilities Available : The following facilities to carry out dissertation work are available at Rajarambapu Institute of Technology, Rajaramnagar. – Programming Language : JAVA/.NET Framework/MATLAB – Other Facilities:Library,IEEE /Springer/ScienceDirect.

9. Expected date for completion of work : April 2014 10. Approximate expenditure : Nil Date : Place: Rajaramnagar. Babar Samrat Ashok Student

Prof. S.A.Thorat

Prof A. C. Adamuthe

Prof. S. S.Patil

Guide

HOP,

HOD,

Dept. of CSE,

Dept. of CSE,

Dept. of CSE,

RIT, Rajaramnagar.

RIT,Rajaramnagar.

RIT,Rajaramnagar.

53

References

[1] Saeedeh Gholamrezazadeh ,Mohsen Amini Salehi, “A Comprehensive Survey on Text Summarization Systems ”, 978-1-4244-4946-0,2009 IEEE. [2] Vishal Gupta and Gurpreet Singh Lehal "A survey of Text summarization techniques ",Journal of Emerging Technologies in Web Intelligence VOL 2 NO 3 August 2010. [3] Oi Mean Foong ,Alan Oxley and Suziah Sulaiman "Challenges and Trends of Automatic Text Summarization ",International Journal of Information and Telecommunication Technology Vol.1, Issue 1, 2010 . [4] Archana AB, Sunitha. C ,"An Overview on Document Summarization Techniques" ,International Journal on Advanced Computer Theory and Engineering ˝ 2526, Volume-1, Issue-2, 2013 . (IJACTE) ,ISSN (Print) : 2319 U [5] Rafael Ferreira ,Luciano de Souza Cabral ,Rafael Dueire Lins ,Gabriel Pereira e Silva ,Fred Freitas ,George D.C. Cavalcanti ,Luciano Favaro , "Assessing sentence scoring techniques for extractive text summarization ",Expert Systems with Applications 40 (2013) 5755-5764 ,2013 Elsevier . [6] L. Suanmali , N. Salim and M.S. Binwahlan,"Fuzzy Logic Based Method for Improving Text Summarization" , International Journal of Computer Science and Information Security, 2009, Vol. 2, No. 1,pp. 4-10. [7] Mrs.A.R.Kulkarni , Dr.Mrs.S.S.Apte "A DOMAIN-SPECIFIC AUTOMATIC TEXT SUMMARIZATION USING FUZZY LOGIC ",International Journal of

Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 - 6375(Online) Volume 4, Issue 4, July-August (2013). [8] Farshad Kyoomarsi ,Hamid Khosravi ,Esfandiar Eslami ,Pooya Khosravyan Dehkordy; "Optimizing Text Summarization Based on Fuzzy Logic ",Seventh IEEE/ACIS International Conference on Computer and Information Science ,9780-7695-3131-1 ,2008 [9] Ladda Suanmali ,Naomie Salim and Mohammed Salem Binwahla ,"Feature-Based Sentence Extraction Using Fuzzy Inference rules ",2009 International Conference on Signal Processing Systems ,978-0-7695-3654-5 ,2009 IEEE . [10] Ladda Suanmali, Naomie Salim and Mohammed Salem Binwahlan "Fuzzy Genetic Semantic Based Text Summarization ", 2011 Ninth Ninth International Conference on Dependable, Autonomic and Secure Computing ,978-0-7695-4612-4 ,2011 IEEE . [11] Ladda Suanmali, Mohammed Salem Binwahlan and Naomie Salim "Sentence Features Fusion for Text Summarization Using Fuzzy Logic ",2009 Ninth International Conference on Hybrid Intelligent Systems ,978-0-7695-3745-0 ,2009 IEEE. [12] Hsun-Hui Huang ,Yau-Hwang Kuo ,Horng-Chang Yang ,"Fuzzy-Rough Set Aided Sentence Extraction Summarization",Proceedings of the First International Conference on Innovative Computing, Information and Control (ICICIC’06),0-76952616-0/06 ,IEEE. [13] Feifan Liu and Yang Liu, Member, IEEE "Exploring Correlation Between ROUGE and Human Evaluation on Meeting Summaries ",IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 1, JANUARY 2010 [14] ZHANG Pei-ying ,LI Cun-he , "Automatic text summarization based on sentences clustering and extraction ",978-1-4244-4520-2 ,2009 IEEE . [15] Udo Hahn ,Inderjeet Man ,"The Challenges of Automatic Summarization ",00189162/00,2000 IEEE .

[16] Róbert Móro, Mária Bieliková "Personalized Text Summarization Based on Important Terms Identification ",2012 23rd International Workshop on Database and Expert Sytems Applications ,1529-4188, 2012 IEEE . [17] Rafael Ferreira ,Luciano de Souza Cabral ,Rafael Dueire Lins ,Gabriel Pereira e Silva , "Assessing sentence scoring techniques for extractive text summarization", Expert Systems with Applications 40 (2013) 5755-5764 ,2013 Elsevier Ltd.

56

ACKNOWLEDGMENTS I must mention several individuals and organizations that were of enormous help in the development of this work. No volume of words is enough to express my gratitude towards Prof. S. A. Thorat my supervisor, philosopher and personality with a midas touch encouraged me to carry this work. His continuous invaluable knowledgably guidance throughout the course of this study helped me to complete the work up to this stage and hope will continue in further research. I am also very thankful to Examiner 1 and Examiner 2 for their valuable suggestions, critical examination of work during the progress, I am indebted to them. I am very grateful to Prof. S. S. Patil our Head of Department for his positive cooperation and immense kindly help during the period of work with him. In addition, very energetic and competitive atmosphere of the Computer Science and Engineering Department had much to do with this work. I acknowledge with thanks to faculty, teaching and non-teaching staff of the department, Central library and Colleagues. I sincerely thank to Dr. Mrs. S. S. Kulkarni our principal for supporting me to do this work and I am very much obliged to her. Last but not the least; Mr. Ashok B. Babar my father, Mrs. Mandakini A.Babar my mother, constantly supported me for this work in all aspects.

RIT, Sakhrale

Samrat Ashok Babar

June 2013

(1230020)

57

LIST OF PUBLICATIONS ON PRESENT WORK -NIL

58

VITAE

59