An Efficient and Scalable Approach for Ontology Instance Matching

JOURNAL OF COMPUTERS, VOL. 9, NO. 8, AUGUST 2014

1755

An Efficient and Scalable Approach for Ontology Instance Matching Rudra Pratap Deb Natha, Hanif Seddiquia,c, Masaki Aonob a b c

University of of Chittagong, Chittagong-4331, Bangladesh

Toyohashi Univeristy of Technology, Toyohashi, Aichi, 441-8580, Japan

Bangabandhu Sheikh Mujibur Rahman Science and Technology University, Gopalganj, Bangladesh

Abstract² Ontology instance matching is a key interoperability enabler across heterogeneous data resources in the Semantic Web for integrating data semantically. Although most of the research has been emphasized on schema level matching so far, research on ontology matching is shifting from ontology schema or concept level to instance level to fulfill the vision of ³:eb of Data´. Ontology instances define data semantically and are kept in knowledge base. Since, heterogeneous sources of massive ontology instances grow sharply day-by-day, scalability has become a major research concern in ontology instance matching of semantic knowledge bases. In this study, we propose a method by filtering instances of knowledge base into two stages to address the scalability issue. First stage groups the instances based on the relation of concepts and next stage further filters the instances based on the properties associated to instances. Then, our instance matcher works by comparing an instance within a classification group of one knowledge base against the instances of same sub-group of other knowledge base to achieve interoperability. We experiment our proposed method with several benchmark data sets namely OAEI-2009, OAEI-2010 and OAEI-2011. On comparison with other baseline methods, our proposed method shows satisfactory result. Index Terms²Ontology Instance Matching, Record Linkage, Knowledge base integration, Ontology alignment, Ontology Population, Identity Recognition, Linked Data, Anchor-flood Algorithm.

I. INTRODUCTION Ontology, can be GHILQHG DV ³H[SOLFLW IRUPDO VSHFLILFDWLRQ of a shared conceptualization [2@´has become the backbone to enable the fulfillment of the Semantic Web vision [6]. Nowadays, ontology alignment has been taken as a key technology to solve interoperability problems across heterogeneous data sources. It takes ontologies as input and determines as output an alignment, that is, a set of correspondences between the semantically related entities of those ontologies. These correspondences can be used for various tasks, such as ontology merging, data translation, query answering or navigation on the web of data. Thus, matching ontologies enables the knowledge and data expressed in the matched ontologies to interoperate. However, success of the vision of Semantic Web [1] depends on the availability of Manuscript received June 8, 2013; revised February 18, 2014; accepted April 14, 2014. Corresponding author ± Rudra Pratap Deb Nath, Email : [email protected]

© 2014 ACADEMY PUBLISHER doi:10.4304/jcp.9.8.1755-1768

semantic linked data. Semantic linked data are all about connected data i.e. people, places and things that are connected to each other. Each individual is known as instance [8]. Linked data paradigm is a pragmatic approach for the transition of the current document oriented web into a web of interlinked data. It is used to describe a recommended best practice for publishing, sharing and connecting structured data on the web. But, linking on the web of data is a more generic and thus formidable task as it aims not only establishing links between the ontologies underlying two data sources but also discovering links between instances contained in two data sources [14]. Hence, with the development of linked data and various social network websites, huge amount of semantic data is published on the web, which not only imposes new technology challenges over traditional schema level ontology alignment algorithms, but also demands new techniques for instance matching [7]. Ontology instance matching (OIM) compares different individuals within same or heterogeneous ontologies with the goal of identifying the same real world objects. It also describes the degree of semantic relationship among each other. The OIM problem has been widely investigated in several application domains where it is known with different names such as identity recognition, record linkage, entity resolution problem and so on according to the requirements that need to be satisfied. It is important to notice that ontology and instance matching are similar to the idea of database schema matching and record linkage in the research domain of database. Instance matching plays a crucial role in semantic data integration as it interconnects all the islands of instances of semantic world to achieve the interoperability and information integration issues. OIM is equally important in ontology population as it helps to correctly perform the insertion and update operation and to discover relationship between the new incoming instance and the set of instance already stored in the ontology [10]. Most of the research has been done so far basically based on schema level matching while in few systems [9,11,15,18], information of instances is partially considered to support ontology schema matching. However, to cope with the demand for enabling Semantic Web technology practically, ontology instance matching is more important than schema matching. For this reason, in recent years, the research work on ontology matching is gradually shifting from the level of concepts to the level of instances [17]. Nonetheless, information of schema is equally important in the alignment of individuals that are sharing the same ontology structure.

1756

An instance can be described with the ontology concepts, properties and their values associated with it. However, every property does not have equal contribution to identify an instance uniquely. Some properties have more influences in instance identification than others. For example, in spite of having different values for the property ³EDQN$FFRXQW´ two LQGLYLGXDOV RI W\SH ³(PSOR\HH´ DUH the same if they have common values for property ³VRFLDOBLG´ 6R DXtomatically assigning a weight to each of the property is one of the most important tasks in OIM. Currently a large number of ontology instances are available in semantic knowledge bases: AllegroGraph [25] contains more than one trillion triples, a basic building block of Semantic Web formed as < subject > < predicate > and < object >, Linked Open Data [26] contains more than fifty billion triples and there are more other knowledge bases too like DBpedia [27] , DBLP [28] and so on. Moreover, several individual groups are also working to create billions of triples to represent ontology instances of Semantic Web which also raises the challenge of scalability in the instance matching assignment. Most of the instance matching researches have been done by comparing each instance belonging to one ontology with each instance belonging to another ontology to identify instances that refer to the same real world object. This kind of brute-force approach must suffer with high memory and time consumption. Therefore, another big challenge raised in OIM is to accurately choose the subset of instances that are more likely to be similar to the input instance, avoiding the comparisons with impertinent instances. In this study, we give effort on improving the performance of our state-of-the-art [5,8,12] system by introducing new technique for automatic property weight generation and considering it into the matching formula. Moreover, a naïve and novel approach is proposed for resolving scalability issue. This paper elaborates the approach proposed in [24]. As the detection of mappings on schema level directly affect instance level matching, in this research, ontology schema matching and instance matching work together for discovering semantic mappings between possible distributed and heterogeneous semantic data. An efficient and scalable ontology schema matching is described in [4]. Our system uses the Anchor-flood (Aflood) algorithm to get the aligned concepts and properties. According to the information of the instances of a concept, we automatically assign a weight factor, frequency factor and category factory to each of the properties of that concept. Weight factor defines how many unique values a property contains. Frequency factor indicates how many instances contain values for the particular property. Category factor describes the efficiency of a property to be used for categorizing instances of a concept. To address the scalability issue, we classify the instances of ontology in two steps: firstly, using the disjoint and other structural relations of the concepts of ontology hierarchy, we produced concept level clusters by grouping similar type of concepts eliminating disjoint concepts. Then, the instances of similar clusters across ontologies will be compared concept-wise. Secondly, to compare the instances of two concepts across ontologies, our system uses a novel approach to select some prominent

© 2014 ACADEMY PUBLISHER


semantically common properties and according to the values of those properties instances of both concepts are partitioned. Then, our instance matcher works by comparing an instance within a classification group of one knowledge base against the instance of corresponding sub-group of other knowledge base. The rest of the paper is organized as follows. Section 2 describes terminology frequently used throughout the paper. Our instance matching approach with automatic property weight factor is narrated in section 3. An efficient approach to address scalability issue in instance matching depicts in section 4. Elaboration with example is presented in section 5. Section 6 outlines the experiment and evaluation. Several related research works are concisely articulated in section 7. Final remarks and further scopes of improvement are discussed in section 8. II. GENERAL TERMINOLOGY This section introduces the basic definitions for familiarizing the readers with the notations and terminologies used throughout the paper. A. Ontology An ontology is the basic element of the Semantic Web and the semantic knowledge base. According to M. Ehrig [3], an ontology contains a core ontology, logical mappings, a knowledge base, and a lexicon. Furthermore, a core ontology is defined as a tuple of five sets: concepts, concept hierarchy or taxonomy, properties, property hierarchy, and concept to property function. ܵǣ ൌ ሺ‫ܥ‬௥ ǡ ൑௖ ǡ ܴǡ ߪǡ ൑ோ ሻ consisting of two disjoint sets C and R whose elements are called concepts and relations, two partial orders C on C called concept hierarchy or taxonomy and R on R called relation hierarchy, and a function ı: 5 ĺ & î & called signature of binary relation where ı(r) = (dom(r), ran(r)) with r ‫ א‬R, domain dom(r), and range ran(r). ,QVWDQFHV DUH WKH ³WKLQJV´ represented by a concept. Ontology schema is often called as TBox. B. Semantic Knowledge base Semantic knowledge base is also referred as ABox and it contains TBox-compliant statements about individuals belonging to those concepts. The semantic knowledge base is a structure ‫ ܤܭ‬ൌ ሺ‫ܥ‬ǡ ܴǡ ‫ܫ‬ǡ ߡ஼ ǡ ߡோ ሻ consisting of two disjoint sets C and R as defined before, a set ‫ܫ‬ whose elements are called instances, two functions ߡ஼ and ߡோ called concept instantiation and relation instantiation respectively [3]. C. Ontology Alignment in schema level Alignment A is defined as a set of correspondences with quadruples < e, f, r, l > where e and f are the two aligned entities across ontologies, r represents the relation holding between them, and l represents the level of confidence [0, 1] if there exists in the alignment statement. The notion r is a simple (one-to-one equivalent) relation or complex (subsumption or


one to many) relation [3]. The correspondence between e and f is called aligned pair throughout the paper. Alignment is obtained by measuring similarity values between pairs of entities [4].

1757

the instance. In the fig.1, square represents an instance, oval denotes a concept and the label of arc indicates a property. The black color indicates object type property whereas light yellowish color shows the data type property.

D. Anchor-flood Algorithm Unless aligning schema entities1, i.e. concepts and properties across ontologies, instance matching is not often achievable. Therefore, we use our scalable and efficient ontology alignment algorithm called Anchor-flood to obtain alignment (a set of aligned-pair of schema entities, such as concepts, properties etc.) between ontology pair. [4,5,17]. Our scalable algorithm of ontology alignment starts off a VHHG SRLQW FDOOHG DQ DQFKRU D SDLU RI ³ORRN-DOLNH´ FRQFHSWV from each of two ontologies). Starting off an anchor point, our scalable algorithm collects two sets of neighboring concepts across ontologies. Then it computes the structural and terminological similarity among the collected concepts and produces a list of aligned pairs. The collected concept pairs are in turn considered as further seed points or anchors. The operation cycle is repeated for each of the newly found aligned concept pairs. The cycle is stopped if there is no more new concept pair left to be considered as an anchor [4,8]. E. Ontology Instance Matching Informally, Ontology instance matching (OIM), as defined before, compares different individuals within same or heterogeneous ontologies with the goal of identifying the same real world objects. Given two sets S (source) and T (target) of instances within same ontology or across different ontologies, a semantic similarity measure ߪǣ ܵ ൈ ܶ ՜ ሾͲǡͳሿ and a thresholdߜ ‫ א‬ሾͲǡͳሿ, the goal of instance matching task is to compute the set ‫ ܯ‬ൌ ሼሺܵǡ ܶሻǡ ߪሺ‫ݏ‬ǡ ‫ݐ‬ሻ ൒ ߜሽ [14]. The instances of S are compared to the instances of T and if their affinity is greater than a threshold value then the instance pair is considered as a member of set M and M contains all aligned instance pairs. F. Semantic Linked Cloud (SLC) In ontology, neither a concept nor an instance comprises its full specification in its name or URI (Uniform Resource Identifier) alone. Therefore, we consider the semantically linked information that includes concepts, properties and their values and other instances as well. They all together make an information cloud to specify the meaning of that particular instance. The degree of certainty is proportional to the number of semantic links associated to a particular instance by means of property values and other instances. The SLC is defined below: A Semantic Link Cloud (SLC) of an instance is defined as a part of knowledge base [3] that includes all linked concepts, properties and their instantiations which are related to specify the instance sufficiently. An example of a Semantic Link Cloud is described fig.1. ,I ZH VHH WKH ZRUG µ7LJHU¶ LW PD\ EH WKRXJKW DV DQ DQLPDO. However, if we observe the surrounding information of ³7LJHU´, we can conclude that ³7LJHU´ is the nickname of famous golfer Woods whose birthdate is 30th Dec, 1995. Thus, after observing the SLC, we get the semantic understanding of


Figure 1: Semantic Link Cloud for the instance Tiger

III. INSTANCE MATCHING APPROACH Several challenges are arisen in the instance matching process namely value transformation, structural heterogeneity & logical heterogeneity [15]. Value transformation: It focuses on the description variation in their lexicon. For instanceDSHUVRQ¶Vname can be described differently across nations and even citation of different publications and date are also presented in different formats. Structural heterogeneity: Property level heterogeneity is known as structural heterogeneity. An instance¶V OH[LFDO information can be often associated with a property by direct sequence of characters (with data type property), or by other instances (in case of object property) imposing different level of difficulties in the instance matching. Separation of a single property into several properties like full-name as represented by first-name and last-name derives extra level of difficulties in the instance matching. Additionally, multiple values of a single property and missing values of properties across knowledge bases induce heterogeneity to represent same real world instances differently. Logical heterogeneity: It deals with the typeset of ontology concept level variations of a particular instance. Identical instances can be instantiated into different sub-classes of the same class or into more general classes without altering the meaning. On the other hand, instances defined by disjoint classes may have different meaning even if their descriptions are similar. So, in OIM, these kinds of heterogeneities should be addressed. Our state-of-the-art-instance matcher [5,8] handles these kinds of problems efficiently. As the schema level matching directly influence instance level matching, our schema matching system and instance matcher work together. For schema level matching (i.e. matching in terms of concepts and properties), Anchor-flood algorithm (described in 2(D)) is used.

1758


Our process of normalization and error correction of data cleansing reduces inconsistencies of typological errors. Moreover, we use regular expression, GeoNames Web service and personal data to resolve format heterogeneity. Instances contain lexical information as values of properties. The range of data type property is directly an absolute value, while the range of an object property is another instance. In addition to that, the instance may again contain object property to represent its values. We consolidate the property types by collecting all the values iteratively against the primary objecttypeProperty of the starting instance and convert it as a datatypeProperty to cope with structural heterogeneity. Other structural heterogeneity is handled by SLC concept (described in sub-section 2(F)). However, solution of different kinds of heterogeneity does not infer better results because all properties should not have equal impact to identify instance unambiguously. So, our first effort to improve our start-of-theart instance matcher by assigning a weight to each of the properties automatically. A. Property Weight Factor Ontology can be defined as the set of concepts of the domain and the relations among the concepts, i.e. it describes the domain in the concept layer. On the contrary, instance is the instantiation of the concept and it contains the concrete values for the properties of its concept. Hence, instances are rich in having more practical semantic information. Moreover, some SURSHUW\ YDOXHV SOD\ PRUH DFWLYH UROH LQ LQVWDQFH¶V XQLYRFDO identification than others. For example, some properties like name, email_address, birth_date have more influence than properties like hasAge, salary and so on. Furthermore, we can conclude that in spite of having different values for the property hasAge, two instances are same if they have common values for the property email_address as the data might be captured at different times. Hence, automatically assigning more relevance to those properties that are considered as more significant for individual identification is one of the key challenging issues in the instance matching process. Although we have proposed several techniques for measuring property weight, further improvement is necessary. For example, in [8], property weight factor is decreasing when more instances contain the values for that property although it considers the uniqueness of a property. Here, we propose an effective method to impose a weight factor to each property of the concept by analyzing property values of instances of that concept. Our basic hypothesis of weight factor measurement is that a property, for which instances contain more unique or distinct values, could have KLJKHULGHQWLILFDWLRQFDSDELOLW\LH³the more the uniqueness, the more the weight factor´ 7KHUHIRUH IRU D SURSHUW\ LI several instances have common property values, it loses its ability to identify an instance unambiguously as it subsides its uniqueness characteristics. We define the weight factor of each property of an ontology concept C used in a knowledgebase as follows: ‫ݎ݋ݐ݂ܿܽݐ݄݃݅݁ݓ‬൫ܹ௣ ൯ ൌ

ȁ݅ ‫݌ ד‬௨ ȁ ሺͳሻ ȁ݅ȁ

where ܹ௣ is the property weight, ݅ represents an instance, ሺȁ݅ ‫݌ ד‬௨ ȁሻ is the number of instances where each contains


unique value for property ‫ ݌‬and ȁ݅ȁ represents the number of instances of the concept C. B. SLC Generation According to the definition of a Semantic Link Cloud (SLC), collection of semantically linked resources of ABox along with concepts or properties of TBox specifies an instance at sufficient depth to identify instances even at a different location or with quite different label. Therefore, our proposed method collects all the linked information from a particular instance as a reference point. The linked information is defined as the concepts, properties or their values which have a direct relation to the reference instance [8, 12]. C. Instance Matching Algorithm (IMA) Suppose two instances i1 and i2 are given. The instance affinity function ‫ܣܫ‬ሺ݅ଵǡ ݅ଶ ሻ ՜ ሾͲǡͳሿ provides a value in the range [1, 0]. IA is calculated by comparing the elements of SLCs of both instances. Here, we consider weight factors of properties in calculation. Instances contain their lexical information as values of properties. String metric [13] is used for measuring similarity between two strings. Let, Sp be a string and Sq be another string. The string based similarity between Sp and Sq is then measured as follows: ܵ݅݉൫‫ݏ‬௣ ǡ ‫ݏ‬௤ ൯ ൌ ܿ‫݉݉݋‬൫‫ݏ‬௣ ǡ ‫ݏ‬௤ ൯ െ ݂݂݀݅൫‫ݏ‬௣ ǡ ‫ݏ‬௤ ൯ ൅ ‫ݎ݈݁݇݊݅ݓ‬൫‫ݏ‬௣ ǡ ‫ݏ‬௤ ൯ሺʹሻ

where ܿ‫݉݉݋‬൫‫ݏ‬௣ ǡ ‫ݏ‬௤ ൯indicates the commonality between ‫ݏ‬௣ and ‫ݏ‬௤ . ݂݂݀݅൫‫ݏ‬௣ ǡ ‫ݏ‬௤ ൯ stands for the difference and ‫ݎ݈݁݇݊݅ݓ‬൫‫ݏ‬௣ ǡ ‫ݏ‬௤ ൯ for the improvement of the result using the method introduced by Winkler [13]. Two property values sp and sq are equal if their string similarity is greater or equal to the predefined threshold. Therefore, we define an equality function ‫ܧ‬ሺ‫݌‬ǡ ‫ݍ‬ሻas follows: ͳǡሺܵ‫݌‬ǡ ܵ‫ݍ‬ሻ ൒ Ɂଵ ሺ͵ሻ ‫ܧ‬ሺ‫݌‬ǡ ‫ݍ‬ሻ ൌ ൜ Ͳǡ

Let, two instances ins1 and ins2 are represented by slc1 and slc2 respectively. Therefore, affinity between ins1 and ins2 is measured by measuring the affinity between two SLCs; slc1 and slc2. We define the affinity between two SLCs by considering weight factor assigned to each of the property automatically as follows: ‫ܣܫ‬ሺ‫݈ܿݏ‬ଵ ǡ ‫݈ܿݏ‬ଶ ሻ ൌ

σ೛ಣ౩ౢౙ ǡ౧ಣ౩ౢౙ ቀ୉ሺ௣ǡ௤ሻǤ൫ௐ೛ ାௐ೜ ൯ቁ భ మ σ೛ಣೞ೗೎ ௐ೛ ାσ೜ಣೞ೗೎ ௐ೜ భ మ

ሺͶሻ

‫ܕܐܜܑܚܗ܏ܔۯ‬૚ǣሺ‫ܾܽݔ݋ܤܣ‬ଵ ǡ ‫ܾܽݔ݋ܤܣ‬ଶ ǡ ‫ܣݐ݈݊݁݉݊݃݅ܣ‬ሻ Input: ABox contains the information instance, Alignment provides concept and property level alignment Output: set of aligned instance pairs 1. ܎‫ݏ݊݅ܐ܋܉܍ܚܗ‬௜ ‫ܾܽ א‬ଵ 2. ‫݈ܿݏ‬௜ ൌ ሺ݅݊‫ݏ‬௜ ǡ ܾܽଵ ሻ 3. ܎‫ݏ݊݅ܐ܋܉܍ܚܗ‬௝ ‫ܾܽ א‬ଶ 4. ‫݈ܿݏ‬௝ ൌ ሺ݅݊‫ݏ‬௝ ǡ ܾܽଶ ሻ 5. ܑ܎‫ܣܫ‬ሺ‫݈ܿݏ‬௜ ǡ ‫݈ܿݏ‬௝ ሻ ൒ ߜଶ 6. ൌ ‫ ׫‬ሺ݅݊‫ݏ‬௜ ǡ ݅݊‫ݏ‬௝ ሻ 7. ‫ܖܚܝܜ܍ܚ‬


Algorithm 1 describes a simple flow of the matching algorithm. ABox indicates the knowledge base. For an SLC of an instance of one knowledge base is matched against every SLCs of instances of other knowledge base (line 1~4). Function generateSLC(ins,ab) collects an SLC against an instance ins in ABox ab. An SLC usually contains related concepts, properties, and their consolidated values. Every value of an SLC is compared with that of another SLC. Once similarity value is greater than the threshold, it is collected as an aligned pair (as stated at line 5~ 6 in Algorithm 1) Finally, the algorithm produces a list of matched instance pairs. IV. SCALABILITY ISSUE IN INSTANCE MATCHING Since the size of knowledge base is increasing in monolithic way and instance matching across the heterogeneous gigantic knowledge sources suffers from high computational complexity, optimization or scalability is increasingly becoming an indispensable feature in the instance matching. The motto of scalability is to accurately select a subset of instances that are more likely to be similar to an input instance, avoiding comparing the input instance against all the instances within the ontology [17]. To address the scalability issue, a novel approach is presented here. The instances of ontology are partitioned in two steps: Firstly, using the taxonomical and disjoint relations of the concepts in the ontology hierarchy, we produce concept level clusters by aggregating similar type of concepts and avoiding disjoint concepts. Then, schema level matching algorithm is used to get concept level cluster alignment. Now, only instances of relevant concepts of aligned clusters are compared across ontologies. Hence, our first step reduces the number of comparison significantly as comparisons of instances across impertinent concepts of knowledge bases are no more necessary. Secondly, to compare the instances of two concepts across ontologies, our system does not match each instance belonging to one concept with each instance belonging to other concept ridiculously. Although concept level cluster wise filtering trims the number of comparison significantly, scalability is yet to resolve as instances of concept are also increasing equally with the size of knowledge base. For example, even older DBLP contains 400,000 authors while there are 198,056 persons in DBpedia. Our brute force algorithm will compare every author of DBLP with each person of DBpedia. It requires 400000 * 198056 SLCs comparisons. So, further classification is imperative. To achieve further scalability, our system uses a cunning approach to select some prominent properties by analyzing the information of the instances and according to the values of those properties instances of both concepts are grouped. Then, our instance matcher, which considers the semantic


1759

specification of properties associated to instances and the property weight factors in the matching strategy, works by comparing an instance within a classification group of one knowledge base against the instance of corresponding subgroup of other knowledge base. Fig. 2 illustrates the different components and their interrelations of our scalable instance matcher. From the given ontology hierarchy, we get the taxonomical relationship (e.g. subclass, superclass, symmetric, asymmetric, transitive, reflexive, etc.) and disjoint relationship among the concepts of that ontology. Using those relationships, several concept level clusters are made. For each leaf concept of ontology, a concept-cluster is made by including it siblings and ancestors and excluding the disjoint neighbors. We use scalable and efficient Anchor flood algorithm to get concept-cluster alignment across ontologies. Now, only instfances of aligned clusters are matched. Hence, a sufficient number of comparisons are trimmed. Primary filter takes the input from ABoxes and schema level alignment and outputs the concept-cluster wise instances. As the number of instances of the concept-clusters may also be huge, we propose further classification technique by analyzing the information of instances that are contained in both concepts. The next component describes how the instances of two concepts across ontologies are compared. According to the information of their instances, Factor Analyzer imposes a category factor to select category property set. Then, our Secondary filter further classifies the instances of both concepts according to the values of properties of category property set and output of this block is aligned sub-groups. SLC generator defines instances with its relevant information. Automatic Weight Generator assigns weight to the properties associated with the instances. Finally, our Instance matcher compares only instances of same category across different knowledge bases (in Fig 2, same color indicates the similar category) to identify same real world object. The following sub-sections describe the strategy in details. A. Property Frequency Factor For each property of a concept, property frequency factor is assigned. Property frequency factor describes how many instances contain values for that property. Thus, that property has more frequency factor for which more instances contain values. For example, name property of person concept may have high frequency factor as every person must have a name. We define the frequency factor of a particular property of concept C by following equation. ݂‫ݎ݋ݐ݂ܿܽݕܿ݊݁ݑݍ݁ݎ‬൫‫ܨ‬௣ ൯ ൌ

ȁ݅ ‫݌ ד‬ȁ ሺͷሻ ȁ݅ȁ

where ‫ܨ‬௣ is the property frequency, ݅ represents an instance, ሺȁ݅ ‫݌ ד‬ȁሻ is the number of instances contain values for property ‫ ݌‬and ȁ݅ȁ represents the number of instances of the concept C.

1760


Figure 2. Components of scalable instance matching system

B. Property Category Factor

We assign a category factor to each property of a concept. It defines how much a particular property is appropriate for grouping the instances of a concept. We observe that category factor of a property depends on two factors: first, the number of instances containing distinct value for that property and second, the number of instances contains that property. We define category factor of property in a cunning way so that the properties, which have little number of distinct value and more instances contain value for them, become better candidates for instance categorization. Equation³ ´ defines the property category factor as a ratio of property weight factor and property frequency factor. Category factor (Cp) = Wp /Fp

(6)

The lower value for category factor of a property indicates better candidate for instance categorization. C. Selection of Properties for grouping instances In order to further classify the instances of a concept into several smaller sub-groups so that only relevant sub-groups from different knowledge bases are matched by pruning irrelevant sub-groups, our approach is presented. As the definition of wHLJKW IDFWRU RI D SURSHUW\ ³1)´, property with lowest weight factor contains more repeating values for the instances. However, solely lowest weight factor does not indicate whether most of the instances contain values for that particular property or not. However, frequency factor of a SURSHUW\ ³)´ includes the information of frequency i.e. number of instances contains value for that property. By gathering information of both weight and frequency factor, we


define category factor. We want to select those properties which have more repeated values but high frequency i.e. lower weight factor and higher frequency factor because lower weight factor confirms that several instances share common values and higher frequency factor gives us the trust of more instances association with that property. However, for keeping the value in the range of 0 and 1, our category factor is defined inversely i.e. ratio of weight factor and frequency factor. So, lower value of category factor of a property indicate better candidate for property selection. Now, a threshold value is set to select the candidate properties from the available properties of the concept. That is, a property will be a member of a Candidate Property Set Sp if and only if its category value less than the assigned threshold. Threshold value may be different for different concepts. We also try to set it automatically rather than manually. Since we get the aligned concepts with their aligned properties pair from our schema matching algorithm, we consider the information of both aligned concepts to categorize instances of both concepts. As both aligned concepts provide different Candidate Property Sets by analyzing their own instance sets, the final set of property, by which instances of both aligned concepts are categorized, will be the semantically common properties of the Candidate Property Sets of both aligned concepts that are semantically similar. We denote the final set of property as categorized property set S. Suppose, from schema matching algorithm, it is found that concept C1 and concept C2 are aligned and the set of candidate property of C1 and C2 are SpC1 and SpC2 respectively. Therefore, the set of categorized property S, by which instances of C1 and C2 will be categorized, is the intersection of SpC1 and SpC2. i.e. ܵ ൌ ܵ௣஼భ ‫ܵ ת‬௣஼మ

(7)


For example, if the Candidate Property Sets of two aligned concept Author and Employee, each from different ontology, are {country, department, position, salary} and {location, category, department, birth_year} respectively then instances of Author and Employee will be partitioned according to the values of {country, department}. Here, country and location are semantically similar though syntactically mismatched. Furthermore, instances with NULL values for those categorized properties are put in different cluster for better performance. In addition, we also try to eliminate the membership of frequently updatable properties from the categorized property set so that instance categorization becomes more reliable. Now, only instances of same category will be compared across different knowledge bases. D. Scalable Ontology Instance Matching (SOIM) Algorithm Algorithm 2 describes a simple flow of our proposed scalableInstanceMatch algorithm. This algorithm takes two knowledge bases and ontology schema alignment pairs as parameters. Here, ABox is used to indicate knowledge bases and A is used to denote Alignment pair. From schema level matching, we get concept-cluster alignment pairs. Here, b indicates a concept-cluster. Algorithm 2: scalableInstancMatch (O1, O2 , ABox ab1, ABox ab2, Alignment A) Input: ontology O1 and its knowledgebase ab1 , ontology O2 and its knowledgebase ab2; A provides concept and property level alignment Output: a set of aligned instance pairs. 1. for each (b1,b2)‫ א‬A /* b1 and b2 are concept-clusters from O1 and O2 respectively */ 2. for each (c1,c2) /* c1 is a concept of ( O1‫ר‬b1) and c2 is a concept of (O2‫ ר‬b2) */ 3. if (categoryPropertySet S = null && (P1ŀP2) != null) /*categoryPropertySet S is the set of properties by whose values instances of two concepts will be categorized. P1 and P2 are the property set of concept c1 and c2 respectively */ 4. IM=IM ‫׫‬instanceMatch (c1,c2, A) /* IM is a set of aligned instance pairs*/ 5. elseif (categoryPropertySet S=null &&(P1 ŀP2)=null) continue 6. else 7. iCluster s1= classifyInstances (c1, ab1,categoryPropertySet S) /* iCluster is a set of instance-cluster and each instance-cluster contains the instances of same category */ 8. iCluster s2= classifyInstances (c2, ab2 ,categoryPropertySet S) 9. for each category cid 10. if (iClusterm ‫א‬iCluster s1 ‫ר‬iClustern ‫א‬iCluster s2)‫ א‬cid /* if two instance-clusters from different ontologies are same category. iClusterm and iClustern are the member of iCluster s1 and iCluster s2 respectively */ 11. IM=IM ‫׫‬instanceMatch (iClusterm , iClustern , A) 12. return IM


1761

For each concept-cluster alignment across ontologies our SOIM works (at line 1). Across aligned concept-cluster, we compare instances via concept wise across ontologies [at line 2]. If the both concepts have no Category-Property-Set but at least one property of both concept are semantically or syntactically aligned, the all instances of one concept will be compared with each instances of other concept by calling our previous algorithm instanceMatch (Algorithm 1) [at line 3 and 4]. Here, P1 and P2 are the property set of concept C1 and concept C2 respectively. If there is no any Category-PropertySet and no any syntactically or semantically aligned properties, instances of both concepts will not be compared with each other [at line 5]. However, if there is a CategoryProperty-Set between both concepts, the instances of both concepts will be categorized according to the values of the Category-Property-Set [at line 6-8]. Here, iCluster is the set of different small instance group, i.e., the instance sets of both concepts are partitioned into several smaller groups. Now, only instances of same category are compared by calling our previous algorithm instanceMatch (Algorithm 1). Here, cid [at line 9] is the set of different categories. In a consequence, our scalableInstanceMatch provides aligned instance pairs set, IM [at line 12]. V. ELABORATION OF INSTANCE PARTITIONING WITH EXAMPLE Suppose ³Book´DQG³Publication´ are two concepts from different ontologies and their concept-cluster is aligned. This VHFWLRQ GHVFULEHV KRZ WKH LQVWDQFHV RI ³Book´ DQG ³Publication´ ZLOO EH SDUWLWLRQHG For simplicity, only data property is used although most of the real world ontology contains both data type and object type property. Table 1 shows the information of concept Book from ontology_1. Book has several data type properties namely, Book_id, Book_name, Publication_year, Price, First_author, Book_Type, ISBN and Publisher. Last two records summarize the information of properties containing distinct values and just values. The second last contain the number of instances containing distinct values for a particular property and last record informs the frequency of the property. Table 2 exhibits the information of concept Publication from ontology_2. Publication concept contains Publication_id, Title, Publisher, First_author, Published_year, Book_Type, Price, and ISBN. Instances of these two concepts will be categorized. Last two rows indicate the number of instances that contain distinct value and the number of instances that contain the property respectively. Table 3 shows the category factor of the properties of Book and Publication concepts. 22222

1762


Book_id

Book_name

Publication_ year 2009

Price (in USD) 45

First_author

Book_Type

547.48HE

Semantic Web Programming

John Hebeler

CSE

547.48AL

Semantic Web for the working Ontologist

2008

51

Dean Allemang Marc Ehrig

CSE

547.48EH

Ontology Alignment

7566

Introductory Circuit Analysis

Robert L. Boylestad Robert L. Boylestad S. Stab Robert L. Boylestad

IEEE

7814

Electronic Devices and Circuit Theory

007.63ST 2541

Handbook on Ontologies Study guide for Electronic Devices and Circuit Theory

2004 2009

175 28

006.42LN

Pattern Recognition

2007

49

Fred A. Hamprecht Mohamed E. El-Hawary Jiawei Han

CSE

1974

Introduction to Electrical Power Systems

199 2010

127

136

Data Mining Concepts and Techniques 4584

100

2006

200

Fundamentals of Power System Economics

91

10

11

6

11

Daniel S. Kirschen 9

11

11

7

11

11

CSE

IEEE CSE IEEE

ISBN

Publisher

975-0-47041801-7 978-0-12373556-0 978-0-38732805-8 9780137146666 9780135026496 3-540-408347 9781428846883

Wiley

IEEE 2

9783540749332 9780470408636 13-978-155890-901-3 9780470845721 11

11

11

IEEE CSE

Elsevier Springer Prentice Hall Prentice Hall Springer AIPI Springer Wiley Elsevier Wiley 5 #OfDistinct Value , |D| 11 # Of Frequency |F|

||

Table 1: Knowledge base of Book concept from Ontology_1. Publica tion_id 1 2 3 4 5

6 7 8 9 10 10 10

Title

Publisher

First_author

Introduction to the theory of computation Ontology Alignment Semantic Web for the working Ontologist Data Mining Concepts and Techniques A Developer's Guide to the Semantic Web

Thomson

Michael Sipser

Springer Elsevier

Marc Ehrig Dean Allemang

Elsevier

Study guide for Electronic Devices and Circuit Theory Handbook of Logic and Language Fundamentals of Power System Economics All New Electronics Self-Teaching Guide Principles of Electric Machines and Power Electronics 10 10

Published _year 2004

Book_T ype CSE

Price (in usd) 70

2007

CSE CSE

200 200

978-0-387-32805-8

Han

CSE

150

13-978-1-55890-901-3

Springer

Liyang Yu

CSE

65

978-3642159695

AIPI

2009

IEEE

50

978-1428846883

CSE

169

978-0444537263

IEEE

95

978-0470845721

Wiley

Robert L. Boylestad Johan F.A.K. van Benthem Daniel S. Kirschen Harry Kybett

IEEE

19

978-0470289617

Wiley

Paresh C. Sen

1996

IEEE

163

978-0471022954

5 10

10 10

5 5

2 10

9 10

9 10

Elsevier Wiley

2008

ISBN 981-240-226-8

|D| |F |

Table 2: Knowledge base of Publication concept from Ontology_2. Category factor (Cp) for properties of Book Property Category Property Category factor (Cp) factor (Cp) Book_id Book_name

0.90 1

First_author Book_Type

Publication_year

0.85

ISBN

Price Book_id

1 0.90

Publisher

0.81 0.181

Category factor (Cp) for properties of Publication Property Category Property Category factor (Cp) factor (Cp) Publication_id Title

1 1

Book_Type Price

0.2 0.9

1

Publisher

0.5

ISBN

0.9

0.45

First_author Published_year

1 1

Table 3: Category factor of the properties of Book concept.



1763

Categorization of the instances of Book concept. Category (CSE, Wiley) 547.48HE Semantic Web Programming Category (CSE, Springer) 547.48EH Ontology Alignment 007.63ST 006.42LN

2009

Handbook on Ontologies Pattern Recognition

45

John Hebeler

CSE

975-0-47041801-7

Wiley

199

Marc Ehrig

CSE

Springer

2004 2007

175 49

S. Stab Fred A. Hamprecht

CSE CSE

978-0-38732805-8 3-540-408347 9783540749332

2008

51

CSE

2006

200

Dean Allemang Jiawei Han

2010

100

Robert L. Boylestad Robert L. Boylestad

IEEE

Robert L. Boylestad

Springer Springer

Category (CSE, Elsevier) 547.48AL

Semantic Web for the working Ontologist Data Mining Concepts and Techniques Category (IEEE, Prentice Hall) 7566 Introductory Circuit Analysis 7814 Electronic Devices and Circuit Theory

978-0-12373556-0 13-978-155890-901-3

Elsevier

9780137146666 9780135026496

Prentice Hall

IEEE

9781428846883

AIPI

Mohamed E. IEEE El-Hawary 91 Daniel S. IEEE Kirschen Categorization of the instances of Publication concept.

9780470408636 9780470845721

Wiley

CSE

70

981-240-2268

CSE

200

978-0-38732805-8 9783642159695

127

CSE

IEEE

Elsevier

Prentice Hall

Category (IEEE, AIPI) 2541

Study guide for Electronic Devices and Circuit Theory Category (IEEE, Wiley) 1974 4584

Introduction to Electrical Power Systems Fundamentals of Power System Economics

2009

28

136

Wiley

Category (CSE, Thomson) 1

Introduction to the theory of computation Category (CSE, Springer)

Thomson

Michael Sipser

2004

2

Ontology Alignment

Springer

Marc Ehrig

5

A Developer's Guide to the Semantic Web

Springer

Liyang Yu

CSE

65

Elsevier

Dean Allemang

CSE

200

Elsevier

Han

CSE

150

13-978-155890-901-3

Handbook of Logic and Language Category (IEEE, AIPI)

Elsevier

Johan F.A.K. van Benthem

CSE

169

9780444537263

6

AIPI

Robert L. Boylestad

2009

IEEE

50

9781428846883

Daniel S. Kirschen Harry Kybett Paresh C. Sen

2008

IEEE

95

IEEE

19

1996

IEEE

163

9780470845721 9780470289617 9780471022954

Daniel S. Kirschen

2008

IEEE

95

2007

Category (CSE, Elsevier) 3 4

Semantic Web for the working Ontologist Data Mining Concepts and Techniques

7

Study guide for Electronic Devices and Circuit Theory

Category (IEEE, Wiley) 8 9 10

8

Fundamentals of Power System Economics All New Electronics SelfTeaching Guide Principles of Electric Machines and Power Electronics Fundamentals of Power System Economics

Wiley Wiley Wiley

Wiley

Table 4: Categorization of the instances of Book and Publication concepts.


9780470845721

1764


Threshold value calculation for concept Book : Ɂͳ ൌ

൫‫ܥ‬௣ ൯ ൅ ሺ‫ܥ‬௣ ሻ ʹ

ଵା଴Ǥଵ଼ଵ

ଵǤଵ଼ଵ

= ଶ ൌ ଶ ൌ ͲǤͷͻͲͷ Property selection from the concept Book : If the category factor of the property is less than the threshold Ɂͳ then the property is a member of Candidate Property Set, SBook i.e. P ‫ א‬SBook if‫ < ݌ܥ‬Ɂͳ. The set of candidate properties to classify the instances of Book, SBook = {Book_Type, Publisher} Threshold value calculation for the concept Publication : ൫‫ܥ‬௣ ൯ ൅ ሺ‫ܥ‬௣ ሻ Ɂʹ ൌ ʹ

ൌ

ଵା଴Ǥଶ ଶ

ൌ ͲǤͷͳ

Property selection from the concept Publication : If the category factor of the property is less than the threshold Ɂʹ then the property is a member of Candidate Property Set, SPublicaiton i.e. P ‫ א‬SPublication if‫ < ݌ܥ‬Ɂʹ. The set of candidate properties to classify the instances of Publication, SPublicaiton = {Book_Type, Publisher} Selection of properties for grouping instances of both aligned concepts: Now we can choose the set of properties S, by which instances of both concepts (Book and Publication) are categorized, simply taking intersection of SBook and SPublication . i.e , S = SBook ‫ ת‬SPublication = {Book_Type, Publisher} ‫{ ת‬Book_Type, Publisher} = {Book_Type, Publisher} Now according to the values of Book_Type and Publisher instances of both concepts are categorized. Categorization of the instances of Book and Publication according to the values of S i.e {Book_Type, Publisher} is shown in Table 4. Total comparison : If we use brute-force method, then total number of comparisons is 11*10 =110. Our approach considers only instances of similar category in matching process and it analyses only 22 comparisons For Category (CSE,Springer) (CSE,Elsevier) (IEEE, AIPI) (IEEE, Wiley) Non-matched Total Comparison

Comparison 3*2 2*3 1*1 2*3 3*1

= = = = =

Table 5: Number of required comparisons


Total 6 6 1 6 3 22

All instances, containing null or missing values or in nonmatched category, we put them into a special category. Then, the instances of special category will be compared to the instances of special category across concepts. We also want to eliminate the frequently updatable properties from the Category-Property-Set. At this moment, we do it PDQXDOO\)RULQVWDQFHZHGRQ¶WZDQWWRVHOHFW3ULFHIRUWKH Category-Property-Set as it is updatable though our previous example includes it. VI. EXPERIMENTS AND EVALUATION A. Data Set Description Nowadays, Ontology Alignment Evaluation Initiative (OAEI)2 campaigns includes instance matching track every year [20,23]. The motto of the track is to evaluate the performance of different tools on the task of matching RDF individuals which originate from different sources but describe the same real-world entity. With the development of the Linked Data initiative, the huge amount of semantic data published on the Web and the necessity to discover identity links between instances from different repositories, this assignment has obtained more importance in the recent years. ISLab Instance Matching Benchmark3 (IIMB) uses a set RI DUWL¿FLDOO\ generated and real test cases respectively. These are designed to illustrate all common cases of discrepancies between individual descriptions (different value formats, modi¿ed properties, different clasVL¿FDWLRQ schemas). IIMB is composed of a set of test cases, each one represented by a set of instances, i.e., an OWL ABox, built from an initial dataset of real linked data extracted from the ZHE 7KHQ WKH $%R[ LV DXWRPDWLFDOO\ PRGL¿HG LQ VHYHUDO ways by generating a set of new ABoxes, called test cases. Each test case is produced by transforming the individual descriptions in the reference ABox in new individual descriptions that are inserted in the test case at hand. The goal of transforming the original individuals is twofold: on one side, a simulated situation is provided where data referring to the same objects are provided in different data sources; on the other side, different datasets with a variable level of data quality and complexity are generated. IIMB provides transformation techQLTXHVVXSSRUWLQJPRGL¿FDWLRQV of data property-valueV PRGL¿FDWLRQV RI number and type of properties used for the individual description, and PRGL¿FDWLRQV RIWKHLQGLYLGXDOVFODVVL¿FDWLRQ 7KH¿UVWNLQG of transformations is known as data value transformation and it aims at simulating the fact that data expressing the same real object in different data sources may be different because of data errors or because of the usage of different conventional patterns for data representation. The second kind of transformation is called data structure transformation and it aims at simulating the fact that the same real object may be described using different properties/attributes in different data sources. Finally, the


third kind of transformation, called data semantic transformation, simulates the fact that the same real object PD\EHFODVVL¿HGLQGLIIHUHQWZD\VLQGLIIHUHQWGDWDVRXUFHV 1) ,,0%-'DWD6HW The test-bed provides OWL/RDF data about actors, sport persons, and business firms. The main directory contains 37 sub-directories and the original ABox and the associated TBox (abox.owl and tbox.owl). The total no of triple used in this set is 84.5K. Each sub-directory contains a modified ABox (abox.owl+tbox.owl), and the corresponding mapping with the instances in the original ABox (refalign.rdf). Orinigal ABox contains 1700 triple. The benchmark data is divided into four major groups: value transformation (001-010), structural transformation (011019), logical transformation (020-029) and combination transformation (030-037) 2) ,,0%- 'DWD6HW The 2010 edition of IIMB is a collection of OWL ontologies consisting of 29 concepts, 20 object properties, 12 data properties and thousands of individuals divided into 80 test cases. In fact, in IIMB 2010, 80 test cases are defined and divided into 4 sets of WHVWFDVHVHDFK7KH¿UVWWKUHH sets are different implementations of data value, data structure and data semantic transformations, respectively, while the fourth set is obtained by combining together the three kinds of transformations. IIMB 2010 is created by extracting data from Freebase, an open knowledge base that contains information about 11 million real objects including movies, books, TV shows, celebrities, locations, companies and more. Data extraction has been performed using the query language JSON together with the Freebase JAVA API 11. The benchmark has been generated in a small version consisting of total 0.258M triples and in a large version containing total 1M triples. The original ABoxes of small and large version contain respectively 2616 and 9924 triples. IIMB-2011 Data set contains total 6.8 million triples and 164 concepts and 45 properties and original one contains 87556 triples.

1765

Where ݅௥ stands for the total number of retrieved aligned or matched instance pairs, ݅௖௥ is the total number of correct aligned or matched instance pairs and ݅௖ denotes the total number of correct aligned or instance pair across semantic knowledge bases. We use six baselines namely, AFlood ,ASMOV, DSSim, HMatch, FBEM, RiMOM [23] for IIMB-2009 dataset and three baselines in our experiments : RiMOM, CODI [19] and ASMOV [21] for IIMB-2010 Dataset [20]. Consequently, we presented a graph of size of knowledge bases (in logarithmic scale) vs elapsed time to compare. C. Evaluation with IIMB-2009 Data Set For the first time, an instance matching track was proposed to the participants in the OAEI-2009. Fig.3 illustrates the results of the participants [23] in where AFlood is our instance matcher without considering property weight. Our system was average in performance although it was the fastest in speed. Fig.4 illustrates the improvement of our system in recall-precision graph over the previous one. Here, ESIM stands for our proposed Efficient and Scalable Instance Matcher.

Figure 3: Instance matching results in OAEI-2009

B. (YDOXDWLRQ0HWULFVDQG%DVHOLQH To evaluate our proposed method, we use precision, recall, f-measure and elapsed time as measurement metrics. Precision P is defined as the ratio of the number of correct discovered aligned pairs to the total number of discovered aligned pairs. Recall R is the ratio of number of correct discovered aligned pairs to the total number of correct aligned pairs. Elapsed time describes how long it takes to compare two datasets. The mathematical equations for precision P and recall R can be defined as follows: ȁ݅௥‫݅ ת‬௖௥ ȁ ሺͺሻ ȁ݅௥ ȁ ȁ݅௥‫݅ ת‬௖௥ ȁ ܴ݈݈݁ܿܽǡ ܴ ൌ ሺͻሻ ȁ݅௖ ȁ

ܲ‫݊݋݅ݏ݅ܿ݁ݎ‬ǡ ܲ ൌ


Figure 4: Instance matching results of ESIM with IIMB2009 data set

1766

Figure 5: Comparison of proposed method with IIMB-2010 data set

Figure 6: Size of knowledge base (Triple number in Log Scale) vs Required Time (In minute) graph.

D. Evaluation with IIMB-2010 Data Set Fig.5 pictorially describes the strength of our system over the systems participated in OAEI-2010 competition. Our proposed system ESIM achieved on average 88% precision and 84% recall which is better than ASMOV and RiMOM and slightly better than CODI in some cases. Fig.6 exhibits the knowledge base versus required time graph. Here knowledge base size has been considered in log scale. The knowledge base size is calculated by adding total no. of the triples of each test-bed. The original ABox approximately contains three thousands triples. Required time for comparing the knowledge bases of different sizes with original ABox has been measured in minute. In this test, we consider IIMB-2010 and IIMB-2011 data sets. We use Core2Duo computer with 4GB RAM of 32-bit windows 7 system for experiment. We use Core2Duo computer with 4GB RAM of 32-bit windows 7 system for experiment. VII. RELATED WORK Linked Data, a very pragmatic approach towards achieving the vision of the Semantic Web, leads to the creation of a global data space that contains billion of assertions [14]. It is not restricted only in publishing



semantic data on the web, interlinking among those data has also gained much more attention in recent years i.e. it raises the demand of shifting ontology matching from schema or concept level to instance level. Moreover, Ontology alignment Evaluation Initiative (OAEI) initiates instance matching campaign in Semantic Web every year. So, nowadays, lots of researchers are investing their interests and efforts in developing efficient ontology instance matcher. To be clear, we are emphasized here on ontology instance mapping rather than ontology instance-based schema matching. Until now, several systems are proposed. In [7, 16], RiMOM has been presented. They used several instance matching benchmark data sets to evaluate their systems namely, A-R-S, T-S-D and IIMB. For different dataset, their matching strategy is different. To evaluate with A-R-S dataset, they have proposed the concept of sufficient property group and necessary property group. For better manipulation with T-S-D datasets, they vector based algorithm has been used. S. Castano et al. [10] has proposed their HMatch 2.0 ontology matching which includes the instance matching functionalities to fit it for ontology population task in the context of their BOEMIE project. To discover semantic equivalence between persons in online profiles or otherwise, an appropriate metric is proposed in [22] for weighting the attributes which were syntactically and/or semantically matched. They propose that properties that have a maximum or an exact cardinality of 1 have a higher impact factor on the matching process. In [19], J. Huber et al have proposed CODI: Combinatorial Optimization for Data Integration in where they emphasize on object-properties to determine the instances for which the similarity should be computed. Moreover, they have claimed that their system only computes the similarity of promising instances. Anchor-flood algorithm¶s straight forward approach has been demonstrated in [5]. In [8], we have been achieved better results by introducing automatic weight generation technique. For handling large knowledge bases and by considering scalability issue, a scalable algorithm has been proposed by M. Aono et al [12]. In that study, SwetoDblp and Rexa knowledge bases was classified according to the taxonomy of the ACM¶s Computing Classification System (CCS). We achieved better outcomes for those datasets in terms of scalability, precision, and recall. However, that method is more domain specific which raises a demand of a general and scalable approach. VIII. CONCLUSION An efficient method of assigning automatic weight to each of the property and addressing scalability issue to improve our state-of±the-art ontology instance matching is presented. We achieve efficiency in terms of precision and recall. Now, our algorithm not only aligned the higher level or concept and property level but also it provides sub ±group level alignment and finally instance level alignment. As matching of instances are computing by considering the weight factors of the properties associated to them, better outcomes in terms of precision and recall are achieved. Experiment and evaluation section depicts how fast our scalable approach works although we introduce some extra overhead for calculating several factors and matching SLCs


by considering weight factor. Moreover, it reduces the complexity in terms of time and space without compromising with precision and recall. Though our scalable algorithm takes some extra times for pre-processing, it reduces the SLCs comparisons sufficiently. However, we want to apply machine learning techniques to select a sub set of instances for pre-processing stage to avoid the consideration of all instances. Moreover, until now, Anchorflood algorithm is capable only to measure one to one or equivalent alignment. For achieving better results in terms of precision and recall, schema level matching should cover complex (one-to-many, many-to-one, many-to-many) mappings and subsumption alignment. Our future plan also includes the use of sub-set attributes section techniques to consider only relevant property values by avoiding missing values and irrelevant property values in matching process so that we can further minimize comparison cost. We will try to achieve further improvement by addressing the same individual that resides in non-aligned concepts across knowledge bases. Testing our algorithm with large knowledge bases like DBpedia and DBLP as well as to fit it with Linked Open Data (LOD) project is our ongoing research. [1]

[2]

[3] [4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

REFERENCES T. Berners-Lee, M. Fischetti, M. Dertouzos, ³:HDYLQJ WKH web: the original design and ultimate destiny of the world wide web,´ Harper San Francisco, 1999. R. Studer, V. Benjamins, D. Fensel, Knowledge Engineering: Principles and Methods, Journal of Data & Knowledge Engineering 25 (1-2) (1998) 161±197. M. Ehrig, Ontology Alignment: Bridging the semantic gap, Springer, New York, 2007. M. H. Seddiqui, and M. Aono, An Efficient and Scalable Algorithm for Segmented Alignment of Ontologies of Arbitrary Size, Journal of Web Semantics: Science, Services and Agents on the World Wide Web, 2009. M. H. Seddiqui, and M. Aono, ³Anchor-flood: results for OAEI-2009,´ Proceedings of ontology matching workshop of the 8th international Semantic Web conference, Chantilly, VA, USA, 2009. J. Tang, J. Li, B. Liang, X. Huang, Y. Li, K. Wang, Using Bayesian decision for ontology mapping, Journal of web semantics: science, services & agents on the world wide web, 2006. Z. Wang, X. Zhang, L. Hou, Y. Zhao, J. Li, Y. Qi, J. Jang, ³5L020 UHVXOWV IRU 2$(, ´ 3URFHHGLQJV RI RQWRORJ\ matching workshop of the 9th international Semantic Web conference, Shangai, China, 2010. M. H. Seddiqui, S. Das, I. Ahmed, R. P. D. Nath, M. Aono, ³$XJPHQWDWLRQ RI RQWRORJ\ LQVWDQFH PDWFKLQJ E\ DXWRPDWLF weight generation´ :RUOG FRQJUHVV RQ FRPPXQLFDWLRQ DQG technology, India, IEEE, 2011. A. Isaac, L. V. D. Meij, s. Schlobach, S:DQJ³$Q(PSLULFDO VWXG\ RQ LQVWDQFH EDVHG RQWRORJ\ PDWFKLQJ´ ,6:&$6:& 2007. 6 &DVWDQR $ )HUUDUD ' /RUXVVR V 0RQWDQHOOL ³2Q WKH RQWRORJ\ LQVWDQFH PDWFKLQJ SUREOHP´ th international conference on database and expert systems application, Italy, IEEE, 2008. K. K. Breitman, D. Brauner, M. Casanova, A. Perazolo, ³,QVWDQFH-EDVHG RQWRORJ\ PDSSLQJ´ )LIWK ,((( ZRUNVKRS RQ engineering of autonomic and autonomous systems, IEEE, 2008. M. Aono, and M. H. Seddiqui, ³Scalability in ontology


1767

[13]

[14]

[15]

[16]

[17]

[18] [19]

[20]

[21]

[22]

[23]

[24]

[25] [26]

[27]

[28]

instance matching of large semantic knowledge base.´ Proceeding of 9th WSEAS international conference on artificial intelligence, knowledge engineering and databases, pp 378383, University of Cambridge, UK, 2010. W. E. Winkler, The state of record linkage and current research problems, Technical report, Statistical Research Division, U.S. Census Bureau, Washington DC, 1999. 6 $XHU - /HKQDQQ DQG $ & 1 1JRPR ³,QWURGXFWLRQ WR OLQNHG GDWD DQG LWV OLIH F\FOH RQ WKH ZHE´ 5HDVRQLQJ :HE 2011, pp. 1-75. S. Castano, A. Ferrara, S. Montanelli, and d. Lorusso, ³2QWRORJ\ DQG LQVWDQFH PDWFKLQJ´ .QRZOHGJH-driven multimedia information extraction and ontology evolution, 2011, pp. 167-195. X. Zhang, Q. Zhong, F. Shi, J. Li, and J. Tang, ³RiMOM results for OAEI 2009,´ Proceedings of ontology matching workshop of the 8th international Semantic Web conference, Chantilly, VA, USA, 2009. 0 + 6HGGLTXL DQG 0 $RQR ³Anchor-flood: results for OAEI-2008´ 3URFHHGLQJV RI RQWRORJ\ PDWFKLQJ ZRUNVKRS RI the 7th international Semantic Web conference, Germany, 2008. S. Staab, and R. Studer, Handbook on ontologies, Springer,2004. J. Huber, T. Sztyler, J. Noessner, and C. Meilicke, ³CODI: combinatorial optimization for data integration-results for OAEI 2011,´ Bonn, Germany, 2011. J. Euzenat, A. Ferrara, C. Hage, J. Pane, H. Stuckenschmidt, et al : ³Results of the ontology alignment evaluation initiative 2010,´ Proceeding of ontology matching workshop of the 9 th international semantic web conference,2010. Y. R. Jean-Marry, E.P. Shironoshita, M.R. Kabuka: ASMOV: ³5HVXOWV for OAEI 2010,´ Proceedings of ontology matching workshop of the 9th international semantic web conference, 2010. K. Cortis, S. Scerri, I. Rivera, and S. Handschuh: ³Discovering Semantic Equivalence of People behind online Profiles´ In Proceedings of the Fifth International Workshop on Resource Discovery (RED2012) Workshop at ESWC 2012, Heraklion, Greece, 2012. J. Euzenat, A. Ferrara, et al.: Results of the ontology alignment evaluation initiative 2009, Proceedings of ontology matching workshop of the 8th international semantic web conference ,2009. R. P. D. 1DWK 0 + 6HGGLTXL 0 $RQR ³5HVROYLQJ scalability issue to ontology instance matching in semantic ZHE´ Proceedings of 15 international conference on computer and information technology, ICCIT, Chittagong, Bangladesh, 2012. - $DVPDQ ³$OOHJUR JUDSK´ Technical Report I, Franz Incorporated, Tech. Rep., 2006. C. Bizer, T. Heath, K. idehen, and T. Berners-/HH ³/LQNHG GDWD RQ WKH ZHE ,GRZ ´ ,Q 3URFHHGLQJ RI WKH th international conference on world wide web. ACM, 2008. S. Auer, C. Bizer, G. Kobilarov, j. Lehmann, R. Cyganiak, and = ,YHV ³'ESHGLD $ QXFOHXV IRU D ZHE RI RSHQ GDWD´ 7KH Semantic Web, 2007. M. Ley³'EOS&RPSXWHU6FLHQFHELEOLRJUDSK\´

1768

Mr. Rudra Pratap Deb Nath is working as a lecturer in the dept. of Computer Science and Engineering, University of Chittagong, Bangladesh. He completed his master of engineering (M.Engg.) program from the dept. of Computer Science and Engineering, Toyohashi University of Technology, Toyohashi, Aichi, Japan. His research areas are semantic web, knowledge engineering, information retrieval, affective computing, and human computer interaction. He has a handsome number of publications. He is the coordinator of knowledge engineering and sharing laboratory (KESL). Dr. Hanif Seddiqui is working as an associate professor in the dept. of Computer Science and Engineering, University of Chittagong, Bangladesh. He completed both his Doctor of Engineering (D.Engg.) and master of engineering (M.Engg.) from the dept. of Computer Science and Engineering, Toyohashi University of Technology, Toyohashi, Aichi, Japan. His research areas are semantic web, knowledge and intelligent engineering and artificial intelligent. He achieved most cited article 2006~2010 award from the journal of web semantics, Elsevier.

Dr. Masaki Aono has been serving as a professor in the dept. of Computer Science and Engineering, Toyohashi University of Technology, Aichi, Japan since 2003. He has numerous patent and publications in top class journal and conference. His research interest covers information retrieval, 3d shape modeling and retrieval, web data mining, and knowledge engineering. He is a member of ACM, IEEE, IPSJ, IEICE, JSAI and NLP. He achieved most cited article 2006~2010 award from the journal of web semantics, Elsevier and best paper award in APSIPA-2013.