A Similarity Measure based on Semantic ...

A Similarity Measure based on Semantic, Terminological and Linguistic Information Nitish Aggarwal∗ , Tobias Wunner∗◦ , Mihael Arcan∗ , Paul Buitelaar∗ , Se´an O’Riain◦ ∗

Unit for Natural Language Processing and ◦ eBusiness Unit Digital Enterprise Research Institute, National University of Ireland, Galway [email protected]

Abstract. Ontology matching algorithms need to handle rich semantic, terminological and linguistic variations of specialized domain vocabularies. Semantic variations include synonyms or otherwise related terms, terminological variations include more or less complex terms expressing the same concept and linguistic variations include morphological or syntactic variants. In this work we therefore exploit a deeper semantic, terminological and linguistic analysis of domain vocabularies in order to establish a more sophisticated similarity measure that caters for the specific semantic, terminological and linguistic characteristics of this data. In particular we propose ”STL”, a novel similarity measure that takes semantic, terminological and linguistic variation into account. This paper reports our first experiment towards implementing this methodology, on a data set of 59 financial term pairs, drawn from the xEBR (European Business Registry) vocabulary, that were annotated by four human annotators. Results show that the STL measure improves over state of the art similarity measures.

1

Introduction

Similarity is a central concept in Semantic Web and Web of Data research and its applications. Reasoning about sets of semantically described web objects, extraction, integration and retrieval of object instances, comparison and merging of object descriptions, etc. all depend on a measurement of similarity between such objects. For instance, the fundamental task of ontology matching is based on measuring the similarity between two ontologies, e.g. in terms of structural and/or string-based similarity. Structural or ’semantic’ similarity measures the overlap in the occurrence and ordering of classes, whereas string-based similarity measures the overlap in class names.

Given this central importance of similarity, it is necessary to choose the best performing similarity measure in order to optimize the performance of ontology matching algorithms. Existing similarity measures, such as Levenshtein on string distance or Wu-Palmer on structural similarity, take account of semantic and linguistic variation into some respect. However, we argue that a deeper semantic, terminological and linguistic analysis of specialized domain vocabularies is needed in order to establish a more sophisticated similarity measure that caters for the specific characteristics of this data. In particular we propose ’STL’, a novel similarity measure that takes semantic, terminological and linguistic variation into account. Experiments with this measure show that it outperforms existing similarity measures on a data set of 59 term pairs in the financial domain that were annotated with similarity scores by four human annotators. The remainder of this paper is organized as follows, Section 2 discusses related work, Section 3 presents the STL approach to similarity, Section 4 describes experiments on the financial data set, Section 5 discusses the results of this, and finally in section 6 we conclude the findings of this work and show directions of future work.

2

Related Work

In recent years, there has been a wide variety of efforts in improving similarity measures, specifically in the context of ontology matching and alignment. Most of these approaches address this problem with methods based on string patterns or with structure-based methods. String-based measures are the most simple and computational efficient similarity measures, which compute the edit distance between two strings (see for details [8] & [2]). More linguistic advanced measures include a prestemming step to allow for a more loose comparison to increase recall & precision. Mark et al. [6] showed that this effect decreases with the length of the terms. Structure-based methods calculate similarity on the basis of a given semantic (taxonomic or ontological) structure. For instance Rada et al. [13] proposed a simple way to measure semantic similarity by evaluating the number of edges between concept c1 and c2 (restricted to the ISA relation). This approach relies on the assumption that edges represent uniform distances, which is not always the case. To overcome this problem,

Wu and Palmer [15] proposed a similarity measure, which uses the number of edges and depth of the Most Specific Common Ancestor (MSCA) of corresponding concepts being compared. Other structure-based methods use corpus evidence to remedy the problem of non-uniform distance by use of the notion of ”information content”. Information content (ICvalue) is measured by using the negative logarithm of the likelihood of a concept considering that the more probable a concept is of appearing in a corpus, the less information it conveys. Resnik [14] describes an implementation based on the IC-value of MSCA of corresponding concepts, which reflects the commonality of two concepts. However, only considering a commonality does not differentiate in similarity between two specific concepts vs. similarity between two abstract concepts, if they share a same MSCA. To correct the problem Jiange and Conrath [5] and Lin [9] present a method that also considers the IC-value of concepts alone. The main drawback of these measures is their requirement of a large corpus analysis. To overcome this problem Pirro [11] measures similarity by using intrinsic information content, which reflects the information content of a concept by the number of its sub-concepts. In [12] Pirro and Ezuzenat showed a significant improvement in semantic similarity by also including non-taxonomic relationships. However, all these approaches only make use of structural information in the model and do not take terminological or linguistic information into account. There are several linguistic measures ([4] & [1]), which can account for pseudo-syntactic information by analysing word order by use of n-gram analysis, e.g. as in the similarity between ”Accrued Charges and Deferred Income” and ”Accrued Income and Deferred Charges”1 . To do this, Islam and Inkpen [4] defined a syntactic measure which measures similarity between two strings by computing the maximal word order overlap. This measure shows high correlation with human intuition, thus confirming that syntactic information play an important role in case of short text similarities. However, it is easy to construct examples with high word order overlap but low similarity, e.g., ”Amount payable after more than one year” and ”Amount receivable after more than one year”. Oliva et al [10] present a similarity measure for sentences and short text that takes syntactic information, such as morphology and parse structure, into account. In this method syntactic information is obtained through a deep parsing process that finds the phrases in each sentence and semantic information is obtained from a lexical database. Similarity is calculated as the sum 1

Examples taken from the xEBR vocabulary, see section 4 below.

of similarities between the heads of phrases that have the same syntactic role (subject, object, etc.) in both sentences. Additionally, by weighting adjectives and adverbs they could further improve this similarity measure. As shown by this overview of related work, there is much potential for taking linguistic and terminological structure into account in calculating similarity, which is therefore the objective of our research as reported here.

3

The STL Approach To Similarity

As mentioned above we are presenting a novel similarity measure that takes domain specific terminological and linguistic characteristics of ontologies (i.e. of ontology labels) into account. We base our approach on the three-faceted STL ontology enrichment process introduced in [16]. We calculate similarity according to semantic, terminological and linguistic variation and then take a linear combination by using linear regression, called STL similarity, which we describe as follows: – Semantic similarity (SSim) is calculated based on semantic (taxonomic or ontological) structure or by use of statistical semantic analysis methods such as Latent Semantic Analysis (LSA) [7] or Explicit Semantic Analysis (ESA) [3]. For our purposes we used a recently proposed semantic similarity measure proposed by Pirro Sim [11], which uses intrinsic information content, i.e. the information content of a concept defined by the number of its subconcepts. – Terminological similarity (TSim) is defined by maximal subterm overlap, i.e. we calculate TSim between two concepts c1 and c2 as the number of subterms ti in a termbase that can be matched on the labels of c1 and c2. A term ti is said to match on a concept when no other longer term tj can be matched on the same concept (label). To calculate TSim we use monolingual as well as multilingual termbases as the latter reflect terminological similarities that may be available in one language but not in others, e.g. there is no terminological similarity between the English terms ”Property Plant and Equipment” and ”Tangible Fixed Asset”, whereas in German these concepts are actually identical on the terminological level2 . 2

The German translation for ”Property Plant and Equipment” as well as ”Tangible Fixed Asset” is ”Sachanlagen”

– Linguistic similarity (LSim) is defined as the Dice coefficient applied on the head&modifier syntactical arguments of two terms, i.e., the ratio of common modifiers to all modifiers of two concepts. For instance the concepts ”Financial Income” and ”Net Financial Income” have 3 modifiers ”financial” ”net” and ”net financial”, whereby only ”financial” is a common modifier. – STL Putting it all together we define STL similarity (STLSim) as a linear combination of the above described semantic, terminological and linguistic similarity measures. Linear combination is obtained by using a linear regression model, which can be represent by the following formula: STLsim = a1 * Ssim + a2 * Tsim +a3 * Lsim + Const Where SSim, TSim and LSim are semantic, terminological and linguistic similarities respectively as described above and parameter a1, a2, a3 represent the contribution of each.

4

Experiments

4.1

Financial Use Case

To evaluate the STL similarity measure as well as the underlying Ssim, Tsim and Lsim measures, we used a finance vocabulary developed by the XBRL3 European Business Registers (xEBR) Working Group. The xEBR vocabulary is a model for describing financial accounting and profile information of business entities within Europe. Semantic Characteristics All concepts in the xEBR model are interlinked via a taxonomic structure that uses the XBRL parent-child relation for describing the layout of concepts in a financial report, which means that the parent-child relationship does not necessarily models subclass taxonomies such as WordNet. Table 1 shows an extract of the taxonomy with the three top-level concepts ”Assets”, ”Equity and liabilities” and ”Income statement”. So, for instance, ”Fixed Asset” is a type (subclass) of ”Asset” whereas ”Work in progress” is not, although both are linked via the parent-child relationship to ”Asset”. 3

eXtensible Business Reporting Language

Assets ...

Key Balance Sheet (Financial) Equity and liabilities

Fixed assets

Work in Intangible ... progress fixed assets

...

...

Income statement Net operating income

... Profit (loss) Amounts brought payable Accrued charges forward after more than and one year deferred income

Table 1. Extract of the semantic model in xEBR (financial)

Terminological Characteristics As sketched out in Table 1 the finance taxonomy is very rich in domain-specific terminology and subterm structure. For instance the terms ”Assets” or ”Income” are contained in different concepts of the ontology. To uncover such a structure and maximize the reuse in the lexicon we need a sophisticated subterm analysis beyond ngrams. For instance the bigram ”Intangible Fixed” is a substring of ”Other Intangible Fixed Assets” but not a subterm. Linguistic Characteristics Table 2 shows the linguistic characteristics of the xEBR vocabulary by its part-of-speech distribution. It can be seen that financial vocabulary terms contain more nouns and adjectives than verbs. Apart from semantic and terminological variations the voPart-Of-Speech Noun Adjective Preposition Conjunction Verb Determiner Adverb

Frequency 555 203 76 42 26 3 1

Example assets, amounts fixed, financial after, for, to, of, than, and paid, owed, received the, a, an forward

Table 2. Part-of-speech characteristics of the xEBR vocabulary

cabulary also showed linguistic variations. A common linguistic variation is number, i.e., singular/plural, which was not found in the vocabulary. As the vocabulary consists of phrase-based terms, there are dependencies

solely on the phrasal level and not on the sentence level, e.g the noun phrases ”amounts receivable” and ”received amounts”have different syntactic structures (noun adverb vs. adjective noun) but share the same dependency structure (amount:head, receive:modifier). 4.2

Benchmarking

To evaluate our developed similarity measure we created a benchmark data4 set from the financial xEBR vocabulary. The vocabulary contains 269 terms which results in 72361 term pairs. To reduce the annotation effort we selected 59 term pairs. This sample was selected by applying S, T and L baseline measures to all pairs and then taking a random subset with an even distribution over the similarity scores. The reason for doing this was to avoid a bias to particular similarity scores (very high or low) or such term pairs which lack specific S, T or L characteristics (e.g. samples without any parent-child pairs). This benchmark data set was then manually rated by four human evaluators by using a rating score from 5 for ’very similar’ to 1 for ’same’ as shown in table 3. The most frequent score given by the evaluators was 3 (related) followed by 4 (narrow/broader). The least used score was 5 (same). All ratings were normalized using the mean. Score Occurence not that related opposites related narrow/broader same

1 2 3 4 5

16.9% 25% 31.7% 24.5% 1.7%

Table 3. Human evaluator scores

4.3

Similarity Measures

In order to evaluate our approach, we implemented three semantic (S1S3), six terminological (T1-T6), two linguistic (L1,L2) and one combined STL measure as depicted in Table 5. We describe these measures as follows: 4

http://nlp.deri.ie/Nitish/simsem59

Semantic PathLength Wu−Palmer S1 S2

Pirro Sim S3

Terminological Unigram Bigram Subterm Mono Multi Mono Multi Mono Multi T1 T2 T3 T4 T5 T6 Linguistic STL Lemma Head&Mod linear combination L1 L2 of S3, T6 and L2 Table 4. An overview of similarity measures

Semantic In the case of semantic or structure-based measures, we implemented a measure based on semantic path length (S1) [13], the WuPalmer similarity measure (S2) [15] and the recently proposed similarity Pirro Sim [11] that uses intrinsic information content. Terminological To analyze the effect of terminological information on similarity, we developed six different terminological similarity measures using a termbase constructed from n-gram and subterm lists. We constructed four different termbases based on unigrams (T1 & T2) and bigrams (T3 & T4), but as discussed before, ngrams are not always subterms of the vocabulary, e.g. the bigram ”than one” does not reflect an actual financial term. We therefore constructed two subterm termbases (T5, T6) by including terms from several additional domain-specific financial resources such as Investopedia5 and Investorwords6 . All termbases were created in a monolingual and multilingual version. Linguistic The role of linguistic information is measured based on the level of lemmatized unigrams (L1)and of head/modifier structure (L2). STL As discussed in section 3, STL similarity is a linear combination of semantic, terminological and linguistic similarity measures. For the purpose of our experiments, the STL formula used is as follows: STL = 0.1531 * Ssim + 0.5218 * Tsim + 0.1041 * Lsim + 0.1791 5 6

www.investopedia.com www.investorwords.com

The parameter values are generated by analyzing results obtained for SSim(Pirro Sim), TSim(Multi Term) and LSim(Head&Modi) in a linear regression model.

5

Results & Discussion

On the semantic level, PathLength (S1), Wu-Palmer (S2) and Pirro Sim (S3) showed low correlations of respectively 0.16, 0.18, and 0.20, because the xEBR model does not have a proper taxonomic relationship as explained in section 4.1. On the terminological level, results showed that unigrams (T1 & T2) performed on average 28% better than bigrams (T3 &T4). In the bigram case, the multilingual measure (T4) outperformed its monolingual version (T3) with 0.54 vs. 0.53. The subterm measures (T5 & T6) showed the best performance among the terminological measures with a score of 0.74 (monolingual) and 0.75 (multilingual). The multilingual subterm measure (T6) had a performance gain of 4% over the best ngram measure (T1). On the linguistic level, the lemmatized unigram measure (L1) had a score of 0.70, whereas the head-modifier measure (L2) performed much lower at 0.51. Finally STL, the combination of each level, i.e., Pirro Sim (S3) for SSim, Multi Term xEBR subterm measure (T6) for TSim and Head & Modi (L2) for LSim, performed best with a highest score of 0.78, as shown in table 5.

Measure Correlation S1 0.16 S2 0.18 S3 0.20 T1 0.72

Measure Correlation T2 0.72 T3 0.53 T4 0.54 T5 0.74

Measure Correlation T6 0.75 L1 0.70 L2 0.51 STL 0.78

Table 5. Correlation of STL similarity measures with human evaluator scores

As can be observed from table 5, the best correlations were achieved by the terminological measures followed by the linguistic and semantic ones. The combination in the STL measure gave the best performance.

The semantic measures gave an overall lower score. On the terminological level we could observe that the multilingual bigram and subterm measures (T4 &T6) performed better than their monolingual configuration. In fact, we could show a multilingual performance gain on the terminological level in two out of three cases (T3 vs. T4 and T5 vs. T6). The linguistic measures are positioned between the semantic and terminological measures. Even though the linguistic measures did not outperform any other measure, L2 performed better for a number of term pairs, such as ”Amounts Receivable After More Than One Year” vs. ”Amounts Payable After More Than One Year”. Here it could better explain the low similarity of this term pair by identifying a different modifier (”receivable” vs. ”payable”), whereas the terminological measures did not take this into account, as ”amount” and ”after more than one year” are both valid financial terms. Finally, results showed that the STL measure outperforms all.

6

Conclusions & Future Work

In this work we presented a framework for the integration of semantic, terminological and linguistic variation into a unified similarity measure (STL), which we have shown to outperform more traditional similarity measures that typically account for only one of these variation aspects. We evaluated our results on a domain-specific vocabulary with rich semantic, terminological and linguistic variations. However, as the vocabulary was relatively small, the data was sparse and therefore results might be biased. To remedy this aspect of the research reported so far, we have started work on the construction of a significantly larger benchmark that will be used to optimise the evaluation of the defined measures, taking into account a wider variety of semantic, terminological and linguistic variation. For now, we could however already show that the terminological measures showed significant contributions to performance and measures based on multilingual term variation performed even better than those taking only monolingual term information into account. The linguistic measures did not yet show significant contributions to performance. In future work we therefore plan to integrate a richer set of linguistic operations, such as derivation rules, that provide a richer account of linguistic information. For instance the verb ’recognise’ could be related to a derived nominal form ’recognition’ by applying the derivation rule verb_lemma+"ion".

References 1. Achananuparp, P., Hu, X., Shen, X.: The evaluation of sentence similarity measures. In: DaWaK. pp. 305–316 (2008) 2. Bergroth, L., Hakonen, H., Raita, T.: A survey of longest common subsequence algorithms. In: In Proceedings of the Seventh International Symposium on String Processing Information (SPIRE00). pp. 39–48 (2000) 3. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipediabased explicit semantic analysis. In: Proceedings of The Twentieth International Joint Conference for Artificial Intelligence. pp. 1606–1611. Hyderabad, India (2007), http://www.cs.technion.ac.il/ shaulm/papers/pdf/GabrilovichMarkovitch-ijcai2007 .pdf 4. Islam, A., Inkpen, D.: Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data 2, 10:1–10:25 (July 2008), http://doi.acm.org/10.1145/1376815.1376819 5. Jiang, J., Conrath, D.: Semantic similarity based on corpus statistics and lexical taxonomy. In: Proc. of the Int’l. Conf. on Research in Computational Linguistics. pp. 19–33 (1997), http://www.cse.iitb.ac.in/ cs626449/Papers/WordSimilarity/4.pdf 6. Kantrowitz, M., Mohit, B., Mittal, V.: Stemming and its effects on tfidf ranking (poster session). In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval. pp. 357–359. SIGIR ’00, ACM, New York, NY, USA (2000), http://doi.acm.org/10.1145/345508.345650 7. Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Processes 25, 259–284 (1998) 8. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Tech. Rep. 8 (1966) 9. Lin, D.: An information-theoretic definition of similarity. In: Proc. of the 15th Int’l. Conf. on Machine Learning. pp. 296–304 (1998), http://portal.acm.org/citation.cfm?id=657297 10. Oliva, J., Serrano, J.I., del Castillo, M.D., Iglesias, A.: Symss: A syntax-based measure for short-text semantic similarity. Data Knowl. Eng. 70, 390–405 (April 2011), http://dx.doi.org/10.1016/j.datak.2011.01.002 11. Pirr´ o, G.: A semantic similarity metric combining features and intrinsic information content. Data Knowl. Eng. 68, 1289–1308 (November 2009), http://portal.acm.org/citation.cfm?id=1630170.1630362 12. Pirr´ o, G., Euzenat, J.: A feature and information theoretic framework for semantic similarity and relatedness. In: Proceedings of the 9th international semantic web conference on The semantic web - Volume Part I. pp. 615–630. ISWC’10, Springer-Verlag, Berlin, Heidelberg (2010), http://portal.acm.org/citation.cfm?id=1940281.1940321 13. Rada, R., Mili, H., Bicknell, E., Blettner, M.: Development and application of a metric on semantic nets. In: IEEE Transactions on Systems, Man and Cybernetics. pp. 17–30 (1989) 14. Sun, P.R.: Using information content to evaluate semantic similarity in a taxonomy. In: In Proceedings of the 14th International Joint Conference on Artificial Intelligence. pp. 448–453

15. Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the 32nd annual meeting on Association for Computational Linguistics. pp. 133– 138. ACL ’94, Association for Computational Linguistics, Stroudsburg, PA, USA (1994), http://dx.doi.org/10.3115/981732.981751 16. Wunner, T., Buitelaar, P., O’Riain, S.: Semantic, terminological and linguistic interpretation of xbrl. In: In Reuse and Adaptation of Ontologies and Terminologies Workshop at 17th International Conference on Knowledge Engineering and Knowledge Management (EKAW) (2010)

Acknowledgements This work is supported in part by the European Union under Grant No. 248458 for the Monnet project as well as by the Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2). We would also like to thank the evaluators for their efforts in annotating the similarity benchmark.