A Novel Gaussian Based Similarity Measure for ... - ScienceDirect

3 downloads 0 Views 261KB Size Report
[2] Charu C. Aggarwal, Jiawei Han, Philip S. Yu. 2004. On Demand Classification of Data Streams. ... [16] Shi Zhong. 2005. Efficient streaming text clustering, ...
Available online at www.sciencedirect.com

ScienceDirect Procedia Technology 19 (2015) 880 – 887

8th International Conference Interdisciplinarity in Engineering, INTER-ENG 2014, 9-10 October 2014, Tirgu-Mures, Romania

A novel Gaussian based similarity measure for clustering customer transactions using transaction sequence vector M.S.B. Phridviraja,*, Vangipuram RadhaKrishnab, Chintakindi Srinivasa, C.V. GuruRaoc a

Department of Computer Science & Engineering, Kakatiya Institute of Technology, Warangal, India b Department of Information Technology, VNR VJIET (Autonomous), Hyderabad, India c Principal & Professor of Computer Science & Engineering, S.R.Engineering College (Autonomous), Warangal, India

Abstract Clustering transactions in sequence databases, temporal databases, and time series databases is achieving an important attention from the database researchers. There is a significant research being carried towards defining and validating the suitability of new similarity measures for sequence databases, temporal databases, time series databases which can accurately and efficiently find the similarity between any two given user transactions in the database of transactions to predict the user behavior. The distribution of items present in the transactions contributes to a great extent in finding the degree of similarity between them. This forms the key idea for the design of the proposed similarity measure. The main objective of this research is to design similarity function to find similarity between two user transactions by defining two terms called transaction sequence vector and transaction vector and use them for defining the proposed measure. We then carry out the analysis for worst case, average case and best case situations. The Similarity measure designed is Gaussian based and preserves the properties of Gaussian function. © 2015 2014 The TheAuthors. Authors.Published Publishedby byElsevier ElsevierLtd. Ltd.This is an open access article under the CC BY-NC-ND license © Selection and/or peer-review under responsibility of “Petru Maior” University of Tirgu-Mures, Faculty of Engineering. (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of “Petru Maior” University of Tirgu Mures, Faculty of Engineering Keywords:Transaction; Transaction Sequence vector; Transaction vector; feature distribution;

* Corresponding author. Tel.: +919030076521 E-mail address:[email protected]

2212-0173 © 2015 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of “Petru Maior” University of Tirgu Mures, Faculty of Engineering doi:10.1016/j.protcy.2015.02.126

M.S.B. Phridviraj et al. / Procedia Technology 19 (2015) 880 – 887

881

1. Introduction Clustering Transactions in sequence databases, temporal databases, and time series databases is achieving an important attention from the database researchers and from the perspective of the software industry. Clustering has several applications in the area of sequence databases, temporal databases, time series databases, spatial databases. The significance for clustering approach comes from the need of decision making such as classification, prediction [21]. The input to clustering algorithm in databases is usually a set of user transactions with the output being set of clusters of user transactions. The advantage of clustering w.r.t databases is that the user transactions are fixed with respect to the item set with the itemset consisting of fixed set of items and do not change frequently. In other words, the itemset is static. This eliminates the need of preprocessing the transaction dataset. The motivation of this work comes from our previous research in [21, 22, and 23]. In this paper, we design the similarity measure for the purpose of clustering the user transactions which has the Gaussian property and considers the distribution of each item from the itemset over the entire database of transactions. In case the transactions are arriving as a stream then we can first find the closed frequent itemset and apply the similarity measure on the final set of transactions as was done in our previous work carried out in [22, 23, and 24]. 2. Proposed Similarity Measure– To Cluster User Transactions The idea for the present similarity measure comes from our previous work [22, 23, and 24] considering the feature distribution and commonality which also holds good between pair of any two transactions. In this work we assume each transaction to be a sequence of 2-tuple elements, the first being count of each item and the later denoting the presence or absence of an item in that transaction, say Ti. The Table.1 below denotes the function Ф, we used in previous work [23, 24] and here we use it as a second element in the 2-tuple representation. We define another function called ∆(Iik, Ijk) which is used to store the difference of count of items w.r.t transactions T i and Tj . The Table.1 and Table.2 define functions Ф and ∆ for the binary and non-binary transaction-itemset. Table 1. Function definitions for Ф and ∆ for Transaction-Itemset in binary form

‫ܫ‬௜௞ 0 0 1 1

‫ܫ‬௝௞ 0 1 0 1

Ф൫‫ܫ‬௜௞ ǡ ‫ܫ‬௝௞ ൯ U 0 0 1

∆൫‫ܫ‬௜௞ ǡ ‫ܫ‬௝௞ ൯ 0 1 1 0

Table 2. Function definitions for Ф and ∆ for Transaction-Itemset in non-binary form

‫ܫ‬௜௞ 0 0 ‫ܥ‬௜௞ ‫ܥ‬௜௞

‫ܫ‬௝௞ 0 ‫ܥ‬௝௞ 0 ‫ܥ‬௝௞

Ф൫‫ܫ‬௜௞ ǡ ‫ܫ‬௝௞ ൯ U 0 0 1

∆൫‫ܫ‬௜௞ ǡ ‫ܫ‬௝௞ ൯ 0 ‫ܥ‬௝௞ ‫ܥ‬௜௞ ‫ܥ‬௜௞ -‫ܥ‬௝௞

Definition.1: Transaction Vector, Ґi Let Ti be any transaction with items defined from the itemset denoted by I = { I1, I2, I3, …..Im}then the transaction vector, Ґi is the sequence of 2-tuple elements separated by comma (, ) and is denoted with each sequence pair of the form (Cik, Eik) with Cik, Eik being count of item k and presence or absence of item k in transaction Ti respectively. Let Ґ1 = {(C11, E11), (C12, E12)……... (C1m, E1m)} and Ґ2 = {(C21, E21), (C22, E22)……... (C2m, E2m)} be two transaction vectors. Here Cik denote the count of item k in transaction Ti present and Eik denotes presence or absence of an item in transaction Ti . In case, we are using binary representation of items without counting then we denote

882

M.S.B. Phridviraj et al. / Procedia Technology 19 (2015) 880 – 887

Cik=1; if Eik=1or we denote Cik= 0 in the case Eik = 0. If we are maintaining count of each item in transaction then Cik is any count if Eik=1and 0 in the case Eik=0. Definition.2: Sequence Vector Let Ґi and Ґjbe any two transaction vectorswith items defined from the itemset I = { I1, I2, I3…..Im} then the ௜ǡ௝ ௜ǡ௝ sequence vector over Ґi and Ґj is defined as SV[Ґi, Ґj] = Uk{ Tk}=Uk{ (ο௞ ǡ ʣ௞ ሻ } with Uk denoting union of all 2-tuple elements and is represented as SV[Ґi, Ґj] = [T1 , T2,T3 …………..Tm] where Tk is a 2-tuple denoted by Tk = ( (Cik– Cjk), Ф (Eik, Ejk) ) Let Ґi and Ґj be any two transaction vectors with items defined from the itemset denoted by I = { I1, I2, I3…..Im} then the sequence vector over Ґi and Ґj is given by Eq.1 below SV[Ґ1, Ґ2] = [T1 , T2,T3 …………..Tm]

(1)

where T1 = ( (C11– C21) , Ф (E11, E21) ) T2 = ( (C12 – C22) , Ф (E12, E22) ) T3 = ( (C13 – C23) , Ф (E13, E23) ) .. Tm = ( (C1m – C2m), Ф (E1m, E2m) ) In general, the Sequence Vector for any two transaction vectors Ґi and Ґj is given by ௜ǡ௝

௜ǡ௝

SV[Ti, Tj] = Uk { T1 }=Uk{ (ο௞ ǡ ʣ௞ ሻ }

(2)

with Uk denoting union of all 2-tuple elements. Now, we generalize the sequence vector of two transaction vectors to represent SV[Ti, Tj] as SV[Ti, Tj] = {T1 , T2,T3 …………..Tm}

(3)

where Tk = (∆ (Iik, Ijk), Ф (Iik, Ijk)) with the k varying from 1 to m

(4)

with ∆ (Iik, Ijk) = | Iik|- | Ijk| and Ф (Iik, Ijk) is the function on item w.r.t the two transactions T i and Tj and m is the no of items in the itemset. The sequence vector is a 2 tuple of the form (∆, Ф) with the elements ∆ and Ф. Here∆ contains the difference of the count of two items in both transactions Ti and Tj. Here we have the count values of items as 0 or 1. Having defined all the required definitions and terms now we now define our proposed similarity measure, TSIM given by the equation.5 below TSIM = [1+ Ѕ(α, β) ]/ (1+λ) where

(5)

883

M.S.B. Phridviraj et al. / Procedia Technology 19 (2015) 880 – 887

ሺȽǡ Ⱦሻ ൌ 

σ௞ୀଵ ௞ୀ௠ ߙ ൫ܶ௜௞ ǡ ܶ௝௞ ൯ σ௞ୀଵ ௞ୀ௠ ߚ ൫ܶ௜௞ ǡ ܶ௝௞ ൯

ሺ͸ሻ

where మ

ͲǤͷ ‫ כ‬ሾͳ ൅ ݁ ିఊ ሿ ; Ф(Iik, Ijk) = 1

and ∆ (Iik, Ijk) = 0



ߙ൫௜௞ ǡ ௝௞ ൯  ൌ  െ݁ ିఊ ; Ф(Iik, Ijk) = 0

and ∆ (Iik, Ijk) = 1

; Ф(Iik, Ijk) =U and

0

∆ (Iik, Ijk) = 0

(7)

with

γ=

οሺ୍౟ౡ ǡ୍ౠౡ ሻ

(8)

ఙೖ

Here, ߪ௞ = standard deviation of feature k in all files of training set. 0 ; Ф(Iik, Ijk) = U ߚ൫௜௞ ǡ ௝௞ ൯ = 1 ; Ф(Iik, Ijk) ≠ U

(9)

Here,௜௞ indicates presence or absence of the kth feature in ith transaction. The values of α and β are used to measure the contribution of each feature in finding similarity. 3. Validation of the Proposed Measure 3.1. Best Case In the best case situation, all the items may be present in the pair of transactions considered. For the best case situation T1 = {1, 1, 1, 1, 1…….m} and T2 = {1, 1, 1, 1, 1…….m}. Then the sequence vector is denoted by SV12 and is represented as SV12 = . The value of ሺȽǡ Ⱦሻ is computed using eq.7 and eq.9 as shown below ሺȽǡ Ⱦሻ ൌ 

ߙ൫ܶ௜ଵ ǡ ܶ௝ଵ ൯ ൅ ߙ൫ܶ௜ଶ ǡ ܶ௝ଶ ൯ ൅ ߙ൫ܶ௜ଷ ǡ ܶ௝ଷ ൯ ൅ ‫ ڮ‬Ǥ ൅ߙ൫ܶ௜௠ ǡ ܶ௝௠ ൯ ߚ൫ܶ௜ଵ ǡ ܶ௝ଵ ൯ ൅ ߚ൫ܶ௜ଶ ǡ ܶ௝ଶ ൯ ൅ ߚ൫ܶ௜ଷ ǡ ܶ௝ଷ ൯ ൅ ‫ڮ‬൅ ߚ൫ܶ௜௠ ǡ ܶ௝௠ ൯ మ

ൌ







ͲǤͷ ‫ כ‬ሾ൫ͳ ൅ ݁ ିఊభ ൯ ൅ ൫ͳ ൅ ݁ ିఊమ ൯ ൅ ൫ͳ ൅ ݁ ିఊయ ൯ ǥ ǥ ǥ Ǥ Ǥ ൫ͳ ൅ ݁ ିఊ೘ ൯ሿ ሺͳ ൅ ͳ ൅ ͳ ǥ ǥ ǥ Ǥ ݉‫ݏ݁݉݅ݐ‬ሻ మ





ͲǤͷ ‫ כ‬ሺͳ ൅ ͳ ൅ ͳ ǥ ݉‫ݏ݁݉݅ݐ‬ሻ ൅ ͲǤͷ ‫ כ‬൫݁ ିఊభ ൅ ݁ ିఊమ ൅ ݁ ିఊయ ǥ Ǥ ݉‫ݏ݁݉݅ݐ‬൯ ሺͳͲሻ ൌ ݉ For the best case situation the values of σk for k = 1to m, approaches zero. This makes the values of మ మ మ మ ݁ ିఊభ ǡ ݁ ିఊమ ǡ ݁ ିఊయ ǥ ǥ Ǥ ݁ ିఊ೘ become 1. This means the above eq.10 reduces to

884

M.S.B. Phridviraj et al. / Procedia Technology 19 (2015) 880 – 887

ൌ

ͲǤͷ ‫ ݉ כ‬൅ ͲǤͷ ‫݉ ݉ כ‬ ൌ  ൌ ͳሺͳͳሻ ݉ ݉

In this case, the similarity measure is

ܶܵ‫ ܯܫ‬ൌ

ሺ‫ ܨ‬൅ ͳሻ ሺͳ ൅ ͳሻ ൌ  ൌ ͳǤሺͳʹሻ ሺߣ ൅ ͳሻ ሺͳ ൅ ͳሻ

The value of TSIM = 1 from eq.12 indicates that the two text files are most similar to each other. 3.2 Worst Case The worst case situation occurs when all the items are absent in the transactions considered. This means in the worst case worst case T1 = {0, 0, 0, 0, 0…….m} and T2 = {0, 0, 0, 0, 0…….m}. The Sequence Vector is denoted by SV12 and is represented as SV12 = . The value of ሺȽǡ Ⱦሻ is computed using eq.7 and eq.9 as shown below ሺȽǡ Ⱦሻ ൌ 



ߙ൫ܶ௜ଵ ǡ ܶ௝ଵ ൯ ൅ ߙ൫ܶ௜ଶ ǡ ܶ௝ଶ ൯ ൅ ߙ൫ܶ௜ଷ ǡ ܶ௝ଷ ൯ ൅ ‫ ڮ‬Ǥ ൅ߙ൫ܶ௜௠ ǡ ܶ௝௠ ൯ ߚ൫ܶ௜ଵ ǡ ܶ௝ଵ ൯ ൅ ߚ൫ܶ௜ଶ ǡ ܶ௝ଶ ൯ ൅ ߚ൫ܶ௜ଷ ǡ ܶ௝ଷ ൯ ൅ ‫ڮ‬൅ ߚ൫ܶ௜௠ ǡ ܶ௝௠ ൯

ܷ ሺ݅݊݀݁‫݊݋݅ݐܽݑݐ݅ݏ݁ݐܽ݊݅݉ݎ݁ݐ‬ሻ ൌ  െͳሺ‫ ݊ݎݑݐ݁ݎ݋ݏ‬െ ͳሻ ܷ

In this case, the similarity measure is

ܶܵ‫ ܯܫ‬ൌ

ሺ‫ ܨ‬൅ ͳሻ ሺെͳ ൅ ͳሻ ൌ  ൌ ͲǤሺͳ͵ሻ ሺߣ ൅ ͳሻ ሺͳ ൅ ͳሻ

The value of TSIM = 0 in eq.13 indicates that the two text files are least similar to each other or dissimilar w.r.t each other. 3.3 Average Case Scenario In the average case situation, T1 = {1, 0, 1, 0, 1…….m times} and T2 = {0, 1, 0, 1, 0…….m times}. Then the Feature Vector is denoted by SV12 and is represented as SV12 = . The value of ሺȽǡ Ⱦሻ is computed as shown below using eq.7 and eq.9 ሺȽǡ Ⱦሻ ൌ 

ߙ൫ܶ௜ଵ ǡ ܶ௝ଵ ൯ ൅ ߙ൫ܶ௜ଶ ǡ ܶ௝ଶ ൯ ൅ ߙ൫ܶ௜ଷ ǡ ܶ௝ଷ ൯ ൅ ‫ ڮ‬Ǥ ൅ߙ൫ܶ௜௠ ǡ ܶ௝௠ ൯ ߚ൫ܶ௜ଵ ǡ ܶ௝ଵ ൯ ൅ ߚ൫ܶ௜ଶ ǡ ܶ௝ଶ ൯ ൅ ߚ൫ܶ௜ଷ ǡ ܶ௝ଷ ൯ ൅ ‫ڮ‬൅ ߚ൫ܶ௜௠ ǡ ܶ௝௠ ൯









൫െ݁ ିఊభ ൯ ൅ ൫െ݁ ିఊమ ൯ ൅ ൫െ݁ ିఊయ ൯ ǥ ǥ ǥ Ǥ Ǥ ൫െ݁ ିఊ೘ ൯ ൌ ሺͳ ൅ ͳ ൅ ͳ ǥ ǥ ǥ Ǥ ݉‫ݏ݁݉݅ݐ‬ሻ మ







െ൫݁ ିఊభ ൅ ݁ ିఊమ ൅ ݁ ିఊయ ǥ Ǥ ݉‫ݏ݁݉݅ݐ‬൯ ݉ మ







Assuming the values of ݁ ିఊభ ǡ ݁ ିఊమ ǡ ݁ ିఊయ ǥ Ǥ ݁ ିఊ೘ are all the same , then we have the above equation reduced to

885

M.S.B. Phridviraj et al. / Procedia Technology 19 (2015) 880 – 887

ൌ



ି௠௘ షം ௠

ൌ  െ݁ ିఊ



(14)

Case 1: ࢽ ൌ ૙. The value for similarity measure denoted by TSIM is now given by ܶܵ‫ ܯܫ‬ൌ



ሺଵି௘ షം ሻ ሺଵାଵሻ

ൌ

ሺଵିଵሻ ሺଵାଵሻ

 ൌ Ͳ

(13)

Case 2: ࢽ ് λ. Practically it is not infinite. Then the value for TSIM is ܶܵ‫ ܯܫ‬ൌ



ሺଵି௘ షം ሻ ሺଵାଵሻ



ൌൌ ͲǤͷ ‫  כ‬ሺͳ െ ݁ ିఊ ሻ

(15)

4. Case Study Consider the transactions with the following items as in Table.2. The Table.3 below shows the Binary representation of the transaction-item matrix. Table 2. User Transactions with items Frequent items T1

{BREAD, BUTTER,JAM}

T2

{ JAM,COFFEE,MILK }

T3

{BUTTER , JAM ,COFFEE , MILK}

T4

{BREAD , BUTTER ,JAM , MILK}

T5

{JAM , COFFEE}

T6

{ BREAD , BUTTER , MILK}

T7

{ BREAD , BUTTER , COFFEE}

T8

{BUTTER , COFFEE}

T9

{ BUTTER ,JAM , MILK}

Table 3. Transaction-Itemset Matrix in Binary Form bread

butter

jam

coffee

milk

T1

1

1

1

0

0

T2

0

0

1

1

1

T3

0

1

1

1

1

T4

1

1

1

0

1

T5

0

0

1

1

0

T6

1

1

0

0

1

T7

1

1

0

1

0

T8

0

1

0

1

0

T9

0

1

1

0

1

886

M.S.B. Phridviraj et al. / Procedia Technology 19 (2015) 880 – 887

The sample computation is shown for finding similarity of transaction T1 with all the remaining transactions T 2 through T9 as shown below. The value of λ is 1. Here Nr and Dr indicates Numerator and Denominator of the function ሺȽǡ Ⱦሻrespectively. 4.1 Sample Computations In the computations below ߙand β indicate the values of numerator and denominator of the function S(ߙ, β) respectively. : ߙ= -0.02732-0.00584+1-0.02732 -0.02732 = 0.9122 β= 5 TSIM= (0.42136+1)/(1+1)=0.18244 : ߙ = -0.02732 + 1+1-0.02732-0.02732=1.91804 β=5 TSIM = 0.38361 : ߙ =-1-1-1+0-0.0273 = -2.48634 β= 4 TSIM= (-0.62159+1)/(1+1)=0.18921 : ߙ = -0.0273-0.00584+1-0.02732+0=0.93952 β= 4 TSIM= (0.23488+1)/ (1+1) =0.61744 : ߙ =1+1-0.01832+0-0.02732=1.95436 β= 4 TSIM= (0.48859+1)/2=0.7443 : ߙ= 1+1-0.0183-0.0273+0=1.95436 β= 4 TSIM= (0.48859+1)/2 = 0.7443 = ߙ= -0.0273+1-0.0183-0.02732+0=0.92704 β= 4 TSIM=(0.23716+1)/2 = 0.61588 = ߙ= -0.0273+0.9942+0.9817+0-0.0273=1.95436 β= 4 TSIM= (0.48634+1)/2 = 0.74317 The final set of Clusters formed after applying clustering algorithm of our previous work in [21, 22, and 23] is Cluster-1: { T1, T2, T3, T4, T6, T9 } ; Cluster-2: { T7, T8 } ; Cluster-3: { T5 }

M.S.B. Phridviraj et al. / Procedia Technology 19 (2015) 880 – 887

887

5. Conclusion The objective of this research is to define a similarity function to find similarity between two transactions. This measure may then be used to cluster and classify user transactions. This can be further extended to classify the users based on the transactions carried out. This helps in predicting the user behaviors in advance. In this paper, we design and define a similarity measure by defining two terms transaction sequence vector and transaction vector. We apply the clustering algorithm [16, 18] and show the clustering process for each transaction pair. The similarity measure is analyzed for worst case, average case and best case situations. To extend the clustering process to data stream of transactions we may use the algorithm defined in [22, 23]. References [1] Albert Bifet, Geoff Holmes et.al. 2011. Mining Frequent Closed graphs on evolving data streams. In the Proceedings of 17th ACM SIGKDD International Conference on knowledge discovery and data mining.2011. 591-98. [2] Charu C. Aggarwal, Jiawei Han, Philip S. Yu. 2004. On Demand Classification of Data Streams. In the proceedings of ACM KDD’04, August 2004, USA. [3] Hoang Thanh Lam, Toon Calders. 2010. Mining Top-K Frequent Items in a Data Stream with Flexible Sliding Windows. Proceedings of in the proceedings of ACM KDD’10, July 2010, USA. [4] Cheqing Jin et.al. 2003. Dynamically Maintaining Frequent Items over a Data Stream. In the proceedings of CIKM 2003.USA. [5] Nan Jiang and Le Grunewald. 2006. Research Issues in Data Stream Association Rule Mining, SIGMOD Record, Vol. 35, No. 1, Mar. 2006. [6] Sudipta Guha, D.Gunopulos, N.Kaudas.2003.Correlating synchronous and asynchronous data streams. In the proceedings of SIGKDD 2003 held from august 24th -27th, 2003, USA. [7] Yu.Bao.Liu et.al. 2008. Clustering Text data streams. Journal of computer science and technology, volume 23, issue 1, pages 112-128, 2008. [8] Dou Shen, Qiang Yang, Jian-Tuo-Sun, Zheng Chen.2003. Thread Detection in Dynamic Text Message Streams. In the proceedings of SIGIR from august 6th -11th, 2003, USA. [9] Jun Yan et.al. 2006. A scalable supervised algorithm for dimensionality reduction on streaming data. Information Sciences, An International Journal, Published by Elsevier, Volume 176, 2042-65, 2006. [10] L.Rutkowski et.al. 2013. Decision trees for mining data streams based on the McDiarmid’s bound. IEEE Transactions on Knowledge and Data Engineering, Volume 25(6), 2013. [11] Jun Yan et.al. 2006. Effective and Efficient Dimensionality Reduction for Large-Scale and Streaming Data Preprocessing. IEEE Transactions on Knowledge and Data Engineering, Volume 18(2), 2006. [12] Graham Cormode et.al. 2003. Comparing Data Streams Using Hamming Norms (How to Zero In).IEEE Transactions on Knowledge and Data Engineering, Volume 15(3), 2003. [13] Chen Ling, Zou Ling-Jun, Tu Li.2012. Clustering algorithm for multiple data streams based on spectral component similarity. Information Sciences, An International Journal, Published by Elsevier, Volume 183, 35-47, 2012. [14] Panagiotis Antonellis, Christos Makris, Nikos Tsirakis. 2009. Algorithms for clustering click stream data. Information Processing Letters109, 381–385, 2009 published by Elsevier. [15] Chang-Dong Wang, Dong Huang. 2013. A support vector based algorithm for clustering data streams. IEEE Transactions on Knowledge and Data Engineering, Volume 25, Issue 6, 2013. [16] Shi Zhong. 2005. Efficient streaming text clustering, Neural Networks.Volume18, 2005, 790–798, published by Elsevier. [17] Pedro Pereira Rodrigues, Joao Gama and Joao Pedro Pedroso.2008. Hierarchical Clustering of Time Series Data Streams. IEEE Transactions on Knowledge and Data Engineering, Volume 20, Issue 5, 2008. [18] Vaneet Aggarwal, Shankar Krishnan. 2012. Achieving Approximate Soft Clustering in Data Streams”, 2012. [19] Haiyan Zhou, Xiaolin Bai, Jinsong Shan. 2011. A Rough-Set-based Clustering Algorithm for Multi-stream. Procedia Engineering 15 (2011) 1854-58. [20] Mohamed Medhat Gaber.2012. Advances in Data stream mining. WIREs Data Mining Knowledge Discovery. Volume 2, 79–85, 2012. Doi: 10.1002/widm.52. [21] Vangipuram Radhakrishna, C. Srinivas, C. V. Guru Rao. 2013. Document Clustering Using Hybrid XOR Similarity Function for Efficient Software Component Reuse. ITQM 2013: 121-128. [22] M.S.B.Phridviraj, C.V.GuruRao. Data Mining – Past, Present and Future – A Typical Survey on Data Streams. The 7th International Conference Interdisciplinarity in Engineering, INTER-ENG 2013, 10-11 October 2013, Petru Maior University of Tirgu Mures, Romania. [23] M.S.B.Phridviraj et.al Clustering Text Data Streams – A tree based approach with ternary function and ternary feature vector. ITQM 2014. [24] Vangipuram Radhakrishna, C. Srinivas, C. V. Guru Rao. Clustering and Classification of Software Component for Efficient Component Retrieval and Building Component Reuse Libraries, ITQM 2014.