Mining Linked Open Data through Semi ... - Semantic Scholar

5 downloads 10362 Views 198KB Size Report
Mining Linked Open Data through. Semi-supervised Learning Methods based on Self-training. Nicola Fanizzi, Claudia d'Amato, Floriana Esposito. Dipartimento ...
Mining Linked Open Data through Semi-supervised Learning Methods based on Self-training Nicola Fanizzi, Claudia d’Amato, Floriana Esposito Dipartimento di Informatica, Universit`a degli studi di Bari Campus Universitario, Via Orabona 4, 70125 Bari, Italy Email: [email protected]

Abstract—The paper tackles the problem of mining linked open data. The inherent lack of knowledge caused by the openworld assumption made on the semantic of the data model determines an abundance of data of uncertain classification. We present a semi-supervised machine learning approach. Specifically a self-training strategy is adopted which iteratively uses labeled instances to predict a label also for unlabeled instances. The approach is empirically evaluated with an extensive experimentation involving several different algorithms demonstrating the added value yielded by a semi-supervised approach over standard supervised methods.

I. I NTRODUCTION More and more applications in the context of the Semantic Web are essentially based on the interoperability of data at various degrees of complexity. Frameworks like the Linked Open Data (LOD) provide a way for exposing, sharing, and connecting data via dereferenceable URIs on the Web [1]. There is an increasing number of interesting open data sets available on the Web, such as DB PEDIA, G EONAMES, MusicBrainz, W ORD N ET, DBLP, etc. published under various licenses. Dealing with knowledge bases in the Semantic Web poses various challenges due to problems of uncertainty and sparsity in the data, as well as issues with the scalability and rigidity of the processing mechanisms. Hence the necessity of data-driven methods arises also in this context. As the size of the LOD cloud is constantly increasing, we explore the usefulness of machine learning methods applied to LOD. Indeed, machine learning has the chance of exploiting statistical regularities in the data that cannot easily be captured by logical statements and can handle contradictory, uncertain and missing data [2]. So far the strength of machine learning methods [3] has been exploited in this context for the preliminary phases of knowledge acquisition, construction (ontology learning [4]), and integration (ontology matching [5]). Most of these tasks are tackled by borrowing and adapting text mining techniques. Structured knowledge bases in the Semantic Web deserve specific methods that are suitable for the specific data models employed in this context [6]. We intend to investigate specific learning methods by which to successfully perform data mining tasks on dara

drawn from the LOD cloud. In the line of some preliminary investigation on Semantic Web mining methods [7], [8], we focus on semi-supervised learning techniques [9], [10] for the LOD. Recurring to semi-supervised techniques is justified by the automation of the preliminary knowledge acquisition process: the acquisition of labeled data for a learning problem often requires skilled human agents (the experts) to manually classify training examples. The cost associated with the labeling process may render a fully labeled training set infeasible, whereas acquisition of unlabeled data is relatively inexpensive. This is even more compelling in a context which is particularly characterized by data sparseness due to the inherently open and distributed nature of the knowledge bases. The paper contributes by showing how to cast specific tasks as learning problems. Given these premises, a number of machine learning algorithms have been applied to datasets extracted from the LOD cloud so that their performance can be empirically evaluated. The paper is organized as follows. After the next Sect. II presenting the learning problem, the methods proposed for its solution are discussed in Sect. III. In Sect. IV the experiments proving the effectiveness of the semi-supervised learning approach over the supervised one are reported. Sect. V reports on relevant related work on mining the knowledge bases available in the Semantic Web. Finally, possible developments are discussed in Sect. VI. II. I SSUES WITH L EARNING FROM L INKED O PEN DATA A. Sparse and Missing Data Dealing with Semantic Web knowledge bases faces various obstacles related to the very large scale and nature of such data. As the worst case complexity of the algorithms for the basic reasoning tasks seems to indicate that they are unfeasible for application to large scale knowledge bases, especially when expressed in rich standard representations. However, many reasoning engines demonstrate good scalability and performance with a constant effort towards optimization (or approximation). The most important problem when learning from LOD is data sparseness. An abundance of missing data is due to the inherent open nature of the Semantic Web which makes each attempt to circumscribe knowledge a failure. Making

the closed-world assumption (as in the context of databases) may (partially) help coping with data sparseness but such an assumption would also affect the intended semantics of the data. The same would happen adopting alternative forms of (auto-)epistemic inference [11]. When querying the LOD cloud, one is interested to statements (implicit triples) that may be derived by some form of (deductive but also inductive) inference rather than those that are already explicitly contained in the knowledge base. Hence, generally some form of inferred closure is sought, i.e. the completion of the knowledge base with (all the) statements that (definitively or likely) hold according to the intended semantics [12]. Through the mechanism of materialization a knowledge base is updated and prepared for enabling query and retrieval tasks (already supported by many tools e.g. by Sesame, Jena, etc.). This phase is assumed as preparatory w.r.t. learning: e.g. in the probabilistic materialization method proposed in [8], statements weighted by their estimated probabilities are elicited before the actual learning activity takes place. The cost of the labeling process (often involving human experts) may make the goal of producing a fully labeled training set infeasible, whereas acquisition of unlabeled data is relatively inexpensive. Then, assuming another perspective, semi-supervised learning is a family of machine learning techniques that make use of both labeled and unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data. Indeed, using unlabeled data in conjunction with a small amount of labeled data can produce considerable improvement in accuracy. Semi-supervised learning approaches can acquire inductively new knowledge (likely true statements) exploiting information conveyed by the labeled instances to be iteratively propagated, such as self-training, co-training, generative models, or transductive SVMs, and others [10]. B. Complexity and Efficiency There is also an issue with scalability of the underlying learning processes in terms of the complexity of the datasets in terms of both size and representation. The learning algorithms are often very efficient and generally scale well in terms of the amount of the overall number of instances that are available for learning. Moreover such algorithms need not consider the whole population but only samples drawn from it. The complexity of the representation poses new challenges for machine learning practitioners, although advances are being made on relational learning and linked data in particular. In particular statistical learning procedures based on graphical model (and approximate forms of probabilistic inference) can be adopted (see Sect. 5). Some are inherently able to cope with sparse data (e.g. Bayesian methods or kernel machines [13]).

III. T HE P ROPOSED A PPROACH While the proposed probabilistic approaches [8], [14] are able to manage missing data and seem to scale on a simple dataset (basic representations), we are considering a setting where multiple linked knowledge bases are involved in principle. Therefore, we focus on an alternative setting that aims at the integral structure of the LOD in the Semantic Web. The general strategy underlying our method is made up of a number of steps: • selection of the target attribute: since learning sample can be thought as structured like a table, one of its columns will be chosen as the objective of the predictive model to be learned; • creation of a SPARQL query for the LOD: this query will have a fixed structure but random parameters so that a heterogeneous result set will be returned (likely also many triples that have no value for the target property); • retrieval of the result set: a table is built for the learning algorithm containing also missing data; the values of a continuous target attribute may be discretized in classes referring to value intervals; • run of the learning algorithm: this action will produce a model for predicting the target property values for further instances than those included in the sample; • test and validation of the resulting model: the accuracy of the predictive model is evaluated on new unseen instances that did not take place in the training phase (e.g. part of the instances extracted before). The core activities may be looped so that the training phase may take place incrementally with a number of stages, producing more and more precise models. A. Self-Training As mentioned, we resort to a semi-supervised learning strategy for the core action in the work-flow above. The semi-supervised learning that will be utilized is based on self-training (also known as bootstrapping). In a nutshell, this means that during the learning process the predictions made on the grounds of the current model can be re-used in order to provide a likely correct labeling for unlabeled data which can be used for training the next (more accurate) models. Let us consider instances x ∈ X (so far, no specific structure is dictated for the instances, e.g. X may correspond to a set of RDF triples), some of which may be assigned with a certain label (e.g. the value for a target property for which the predictive model is to be built), say y ∈ Y. The learning problem can be stated as follows: Definition 3.1 (semi-supervised learning problem): Given a bipartite training set made up of l • labeled instances L = {(xi , yi )}i=1

unlabeled instances U = {xj }l+u j=l+1 Induce a predictor function (classifier) f : X → Y that correctly assigns the label to the (seen and unseen) instances. Usually l  u. Differently from transductive learning (see ch. 25 in [9]), in this case is required that f is a good predictor also for unseen data not available during training (not only for the unlabeled instances). A generic self-training (see also [10]) procedure is sketched as Algorithm 1. The loop (steps 3–9) repeats the application of a wrapped supervised learning algorithm A LG to train a classifier f on the current set of labeled data L(t) (step 4). Then a set S is formed with the newly labeled instances which received a definite classification by f (step 5). These labeled instances are added to L(t) for the next run of the A LG (step 6) and removed from U (t) (step 7). The loop closes when a stable situation is reached. In other alternative simplified forms, the loop ends when no instance has been labeled U (t) = ∅. •

Algorithm 1 Self-training procedure. Input: L: labeled data; U : unlabeled data; A LG: based learning algorithm; Output: f : classifier; 1: t ← 0; l+u 2: Let L(0) = {(xi , yi )}li=1 and U (0) = {xj }j=l+1 ; 3: repeat 4: f ← A LG(L(t) ); 5: S ← {x ∈ U (t) | if f (x) definite}; 6: L(t+1) ← L(t) ∪ {(x, f (x)) | x ∈ S}; 7: U (t+1) ← U (t) \ S; 8: t ← t + 1; 9: until U (t) = ∅; 10: return f Note that the initial labeled set is left unchanged reflecting the fact that the confidence in the labels that were assigned beforehand is maximal. However, there are alternative selfteaching algorithms that allow the change of labels of all instances (not only the unlabeled ones) depending on the current model. Observe that the choice made at step 5 depends on the quality of the predictions made, hence a number of further variants is possible, such as: • consider all instances (x, f (x)), weighted by a measure of confidence; • consider all (x, f (x)) such that p(f (x) | x) > ζ where the threshold ζ may be set to 1/` (where ` is the number of labels); • consider the n most likely examples (x, f (x)) the natural n will be called amplitude.

This is one of the simplest methods for semi-supervised learning. It has the advantage to apply to existing (even complex) classifiers as a wrapper method. The known disadvantage is that early mistakes could reinforce themselves. In such cases a heuristic solution may require to un-label an instance if its confidence falls below a threshold. Besides, convergence may not be guaranteed (closed forms are known only for special cases, e.g., linear functions). B. Implementation The proposed work-flow has been implemented so that the approach could be empirically evaluated. The schema depicted in Fig. 1 gives an idea on how these and further components interact in the system. The external components used in the implementation are the following: 1 • FactForge: a SPARQL end-point used as an interface between LOD knowledge bases and our application (support for other end-points has also been added); • Jena: the framework was used to manage the RDF/RDFS models and the related SPARQL queries (through both end-points and files); 2 • Weka: a well-known framework that contains a collection of machine learning algorithms used to perform data mining tasks through the mentioned self-training strategy [3]. The core component has been developed to perform learning on datasets that share the same format so that a comparison between purely supervised and semi-supervised mode were possible The Query Creation module is based on a randomizer that instantiates a query schema; once a query is obtained and validated the related result set is got by the SPARQL iterator interfacing the selected end-point (e.g. FactForge). The results will populate a properly encoded table of instances which represent a valid input for the algorithm contained in Weka. These tables can be exploited both for the standard supervised mode and for the implementation of the semi-supervised mode based on self-training illustrated before (see Sect. III-A). The random query is created so that instances of the target concept can be extracted from the LOD cloud. The query is validated so that it can be satisfied and the number of labeled instances is sufficient (w.r.t. a threshold) to trigger the next learning phases. The threshold is set as the product of the number of attributes of the result set and the number of training instances considered. Hence, instances with missing values occur with a cardinality between half the mentioned threshold and half of the overall number of results. The results are also validated so that a stratified sample is obtained (see [3], ch. 5), i.e. the amount of examples for 1 available

at http://factforge.net/sparql.

2 http://www.cs.waikato.ac.nz/ml/weka/.

LOD Cloud Virtuoso

Generic end-point

Predicitve Model

FactForge

Query Creator

SPARQL iterator

Randomizer

JENA

Instances

Semi-Supervised Learner Supervised Learner WEKA

Figure 1.

System components interaction.

the values to be predicted per attribute reflects an estimate of the distribution in the overall population.

Table I FACTS CONCERNING THE DISTRIBUTIONS OF THE TRAINING SETS : PERCENTAGES OF LABELED AND UNLABELED INSTANCES AND INSTANCES BELONGING TO THE MAJORITY CLASS .

IV. E XPERIMENTAL E VALUATION

problem #1 #2 #3 #4

We report the setup and results of an experimental evaluation of our prototype-system LODM INE aiming at comparing the added value yielded by a semi-supervised learning strategy to a specific mining task, such as the prediction of a target attribute (property). A. Setup In order to compare the performance of learning algorithms working in semi-supervised learning mode w.r.t. the standard supervised mode, the same algorithms selected from the Weka suite were run on random datasets generated by ad hoc queries to the LOD cloud as described above (the FactForge end-point was used). For each learning problem, the extracted dataset instances where divided into training and test set according to the standard cross-validation design [3]. The problems were crafted so to present diverse datasets that could elicit strong and weak points of the various learners, especially in terms of the percentage of labeled and unlabeled instances. The rates of labeled and unlabeled instances and percentage of labeled instances belonging to the majority class in the distributions of the resulting datasets (training sets) are reported in Tab. I: Seven supervised learning base algorithms implemented in the Weka suite, belonging to diverse families, were considered in this experimental evaluation: • JR IP : a Java implementation of the propositional rule learner RIPPER; • NAIVE BAYES (NB): a Naive Bayes classifier using estimator classes;

• • • • •

%labeled 86.67 81.81 51.59 83.24

%unlabeled 13.33 18.19 48.41 16.76

%maj.class 54.52 52.19 69.86 53.58

J4.8: a decision tree induction algorithm based on the well-known C4.5; R EGRESSION (R EG .): implementing an import3 of a regression model; D ECISION TABLES (DT): a method for building decision table majority classifiers; K NN: k-nearest neighbors (with k = 10); SVM: a support vector machine with γ = 1/k and  = .1.

More details on these implementations, parameters, and the respective original algorithms can be found in [3]. For each algorithm, experiments consisted of four learning problems aiming at the prediction of (pre-discretized) values for the target properties. While their vanilla version is used in the first session, in the second one the algorithms are utilized as base learners for the self-training strategy. In this latter case, we investigated also the possibility of varying the number of elements of highest confidence taken into account for the self-training algorithm (see Algorithm 1, step 5). This number was varied from 5 to 30. This roughly corresponds to going from a more cautious to a more credulous behavior of the system w.r.t. accepting the labeling suggested by the 3 Encoded in PMML (Predictive Model Markup Language), a standard of the Data Mining Group (DMG).

predictive models produced at a certain stage of training. In the supervised and semi-supervised sessions the obtained predictor models (classifiers) were subject to an evaluation based on a stratified eight-fold cross-validation, during which the original distribution of the dataset was preserved in each fold [3]. As suggested in [15], together with the accuracy, another possible performance metric for this setting is the Cohen’s κ coefficient [16] to be computed from the confusion matrix obtained by applying the trained predictor to the instances in the test set. B. Results We start off presenting the session where only supervised learning on labeled training instances was performed. Then we present the session involving the same base algorithms exploited in semi-supervised learning mode to that the unlabeled instances are also exploited. the aim is to prove that semi-supervised learning is beneficial in terms of the adopted metrics. Supervised Learning Experiments: Tab. II reports the average outcomes, in terms of accuracy, of the experimental evaluation carried out running the base learning algorithms on the labeled instances of the training set and testing the resulting predictive model on the remaining ones. The table shows how difficult, for some algorithms, it is to induce models in supervised mode with a decent predictive accuracy when applied on this type of very sparse datasets. On average, problem #1 seems the less difficult while #4 is the hardest. There is no significant difference in the performance of the various algorithms, as testified by the low variance of the results per problem. Comparing the average performance of the single algorithms over the four problems (figures not reported in the table), there seems to be no significant difference with values ranging between a minimum of 65.95 (Naive Bayes) and a maximum of 68.58 (Regression). As mentioned above, the results of the experiments in supervised learning mode have been evaluated in terms of Cohen’s κ coefficient, ranging in the interval [−1, 1], where a value that is close to 1 means that the predictive model classifies most of the test instances correctly (diagonal of the confusion matrix). The results are reported in Tab. III. With this measure it seems that problem #3 is really the hardest. Indeed by looking back at Tab. I, we observe that this is the problem with the most of unlabeled instances. With this measure the performance of models learned by JR IP, J4.8 and DT appears comparable to the one observed with those induced by R EG. However the highest variability in the outcomes is observed for problem #1 when the algorithms perform better than or the other problems. Semi-Supervised Learning: Tab. V reports the outcomes of the experimental evaluation carried out using the base learning modules within the semi-supervised algorithm of

the self-training strategy presented before. We have introduced another factor of variability called that is the cardinality of the set of unlabeled instances that are labeled in a given round: those ranking with the highest confidence in the given label. The main goal of this empirical evaluation was to show the added value yielded by adopting a semi-supervised learning strategy, that can avail itself also of the information conveyed by the many unlabeled instances that are typical for these datasets as discussed in the beginning. This table immediately shows this point, if we compare the outcomes with those reported in Tab. II. It is evident the average improvement of all the models produced by the algorithms, especially for some that performed worse in supervised mode. One noteworthy exception is the case of the Naive Bayes for which no such general improvement was observed yet even a slight degradation of the results for some problem. The simple K NN, instead, led to improvements of up to 10 points; this fact together with its simplicity (no real training is really needed, but an efficient storage / retrieval of the labeled instances) naturally suggests it as a good candidate for implementation with more sophisticate semi-supervised strategies. If we consider the dimension of the amplitude, no clear indication comes from the results in the table as no monotonic increase or decrease was observed varying the amplitude. Moreover, the differences among the observed values are not appreciable. Some algorithms seem more stable than others w.r.t. this parameter. Also considering Tab. IV, which aggregates the results in Tab. V, averaging over the different problems, we conclude that all algorithms tend to improve their performance in semi-supervised mode independently of the particular amplitude adopted. Again the K NN is the algorithm which showed the best improvement and best average performance (with JR IP and J4.8 close enough). V. R ELATED W ORK Semantic Web Mining constitutes a sub-field of Data Mining and Semantic Web research that spans nearly over the last decade [6]. Initially it has been targeting the semantics hidden in the documents published on the Web with techniques borrowed from text mining. With the advent of the Web of data [1], new efforts have focused on mining ontologies in the Semantic Web. Adaptations of statistical relational learning methods to graphs or other standard representations backing the ontologies in the Semantic Web have provided new techniques for mining structured knowledge bases. Defining a Semantic Web Mining work-flow [2], existing techniques can be adapted in a principled way. In [14], [8], based on the graphical representation used in context of the RDF data model, the authors define a statistical learning framework and derive a data matrix on which machine learning techniques are applicable. In particular

Table II R ESULTS ( IN TERMS OF ACCURACY ) OF THE EXPERIMENTS WITH THE BASE LEARNERS IN SUPERVISED MODE : AVERAGE AND RESPECTIVE STANDARD DEVIATION VALUES .

problem #1 #2 #3 #4

JR IP 69.63 66.30 70.24 59.03

NB 71.04 63.01 68.87 60.89

J4.8 75.20 66.20 69.88 60.33

algorithm R EG 75.26 68.11 69.73 61.21

TD 75.28 64.59 70.19 61.10

K NN 68.69 66.91 69.51 59.37

SWM 73.54 67.38 68.43 63.58

avg. 72,66 66.07 69.55 60.79

std.dev. 02.84 01.74 00.68 01.49

Table III R ESULTS ( IN TERMS OF THE κ STATISTIC ) OF THE EXPERIMENTS WITH THE BASE LEARNERS IN SUPERVISED MODE : AVERAGE AND RESPECTIVE STANDARD DEVIATION VALUES .

problem #1 #2 #3 #4

JR IP 00,43 00.39 00.03 00.23

NB 00,30 00.36 00.09 00.21

J4.8 00,50 00.34 00.03 00.20

algorithm R EG 00,47 00.40 00.00 00.22

TD 00,46 00.38 00.02 00.20

K NN 00,33 00.26 00.08 00.21

SWM 00,30 00.30 00.06 00.18

avg. 00.40 00.35 00.04 00.21

std.dev. 00.08 00.05 00.03 00.02

they focus on statistical learning algorithms from multivariate prediction, which we utilize to estimate the truth values of assertions that do not explicitly occur in the knowledge base. Learned tuples and their certainty values are stored to be possibly integrated in querying. The authors investigate matrix completion based on techniques like singular value decomposition, non-negative matrix factorization (NNMF) and latent Dirichlet allocation (LDA).

ity for individuals w.r.t. a context of discerning features. Moreover also methods for training neural-symbolic classification models have been exploited for predicting classmembership [22]. The advantage of these methods is that they can be applied especially to ontologies expressed with rich languages (based on OWL) involving not only the structured knowledge of the data graph but also the background knowledge given by the ontologies through reasoning.

Further Bayesian approaches exploiting the correlations between property subject and object values are investigated in [17] that presents a combination of Bayesian networks and multidimensional histograms which is able to identify the probability of these dependencies; BAY-H IST implements multidimensional histograms in order to aggregate the data associated with each node in the network.

VI. C ONCLUSIONS AND O UTLOOK

The same authors have lately targeted explicitly the case of the LOD. In [18], [19] a graph-sampling method is proposed modeling linked data as a Bayesian network and implements a Direct Sampling reasoning algorithm to approximate the ranking scores of the network. This is implemented in the B IO NAV system. Adding statistical learning and inferencing capabilities to the standard reasoning techniques is the go goal of SPARQL-ML [7], which is a framework for adding data mining support to SPARQL that was inspired by M ICROSOFT ’ S DATA M INING E XTENSION (DMX). This framework enables the prediction and classification of unseen data and relations in a new dataset based on a mining model induced through statistical relational learning methods, taking relations between the individual resources into account, without prior propositionalization phase, which is recognized as an error-prone task. Other non-parametric techniques based on kernels or similarity measures have been proposed [20], [21]. These methods are based on the definition of semantic similarity measure or kernel functions that encode forms of similar-

Given the problem of perform data mining on linked open data one of the key-issues is represented by missing data and their sparseness. These issues motivated the investigation of the application of the semi-supervised learning approach to this type of data. In this paper we have presented a machine learning approach to mining tasks on LOD through the selftraining strategy. The approach is empirically evaluated with an extensive experimentation proving the superiority of the semi-supervised approach over the supervised one. The encouraging results obtained with simple algorithms, e.g. the kNN, wrapped in a semi-supervised strategy, suggests them as a good candidate for implementation with more sophisticate semi-supervised approaches. For instance we are considering the usage of co-training. Future work will concern also an extension of the approach application towards different tasks such as clustering or ranking. ACKNOWLEDGMENTS The authors are thankful to Carlo Suglia for his work on the implementation and evaluation of the proposed methods during his internship at the LACAM lab. R EFERENCES [1] T. Berners-Lee, “Linked data,” W3C Design Issues, 2006, http://www.w3.org/DesignIssues/LinkedData.html.

Table IV R ESULTS OF THE EXPERIMENTS ( IN TERMS OF ACCURACY ) WITH THE BASE LEARNERS IN SEMI - SUPERVISED MODE FOR 6 AMPLITUDE LEVELS . JR IP

NB

30 77.70 71.41 25 77.04 71.44 20 77.15 71.41 15 78.46 71.41 10 77.30 71.41 5 77.56 71.47 30 72.20 67.65 25 72.32 67.62 20 72.37 67.65 15 72.22 67.72 10 72.17 67.72 5 72.33 67.69 30 72.31 71.40 25 72.62 71.40 20 72.13 71.40 15 71.90 71.40 10 72.31 71.36 5 72.74 71.52 30 68.99 51.91 25 68.32 51.96 20 67.35 52.20 15 65.93 52.20 10 69.11 52.16 5 68.38 52.00 JR IP NB

J4.8

R EG problem 78.28 74.47 77.91 74.47 77.91 74.47 77.91 74.47 77.91 74.47 78.28 74.47 problem 72.19 71.45 72.02 71.45 72.19 71.45 72.38 71.39 72.02 71.45 72.38 71.32 problem 72.93 71.96 73.08 72.06 72.62 72.11 72.62 72.19 72.62 72.06 72.70 71.80 problem 66.32 65.34 66.40 64.83 66.30 65.29 66.40 65.29 66.40 64.38 66.32 65.17 J4.8 R EG

TD #1 77.52 77.46 77.52 77.52 77.46 77.46 #2 71.73 71.73 71.73 71.73 71.73 71.73 #3 71.69 71.69 71.69 71.69 71.69 71.90 #4 63.75 64.71 64.50 63.83 64.98 65.01 TD

K NN

SWM

78.43 78.43 78.43 78.43 78.43 78.21

70.74 70.83 70.74 70.80 70.83 70.77

70.70 70.62 70.65 70.76 70.53 70.42

63.95 65.79 63.74 64.24 64.27 64.10

72.90 73.15 72.78 72.96 72.93 72.76

70.90 70.81 70.92 70.93 71.04 71.17

69.04 68.95 69.00 69.01 68.91 68.89 K NN

60.32 59.91 59.74 59.69 59.63 59.66 SWM

Table V AVERAGE RESULTS OVER THE FOUR PROBLEMS IN SEMI - SUPERVISED MODE FOR THE SIX AMPLITUDE LEVELS .. JR IP NB J4.8 R EG TD K NN 5 72.75 65.67 72.42 70.69 71.53 72.57 10 72.72 65.66 72.24 70.59 71.47 72.70 15 72.13 65.68 72.33 70.84 71.19 72.79 20 72.25 65.67 72.26 70.83 71.36 72.72 25 72.58 65.60 72.35 70.70 71.40 72.79 30 72.80 65.60 72.43 70.81 71.17 72.77 avg. 72.54 65.65 72.34 70.74 71.35 72.72

SWM 66.42 66.44 66.41 66.29 66.83 66.48 66.48

[2] V. Tresp, M. Bundschus, A. Rettinger, and Y. Huang, “Towards Machine Learning on the Semantic Web,” in Uncertainty Reasoning for the Semantic Web I, ser. LNAI/LNCS, P. da Costa et al., Eds. Springer, 2008, vol. 5327, ch. 17, pp. 282–314.

[7] C. Kiefer, A. Bernstein, and A. Locher, “Adding data mining support to SPARQL via statistical relational learning methods,” in Proceedings of the 5th European Semantic Web Conference, ESWC2008, ser. LNCS, S. Bechhofer et al., Eds., vol. 5021. Springer, 2008, pp. 478–492.

[3] I. H. Witten and E. Frank, Data Mining, 2nd ed. Kaufmann, 2005.

[8] V. Tresp, Y. Huang, M. Bundschus, and A. Rettinger, “Materializing and querying learned knowledge,” in Proceedings of 1st ESWC Workshop on Inductive Reasoning and Machine Learning for the Semantic Web, IRMLeS09, ser. CEUR Workshop Proceedings, C. d’Amato et al., Eds., vol. 474, Heraklion, Greece, 2009.

Morgan

[4] P. Buitelaar and P. Cimiano, Eds., Ontology Learning and Population: Bridging the Gap between Text and Knowledge, ser. Frontiers in Artificial Intelligence and Applications Series. IOS Press, 2008, vol. 167. Springer,

[9] O. Chapelle, B. Sch¨olkopf, and A. Zien, Semi-Supervised Learning. MIT press, 2006.

[6] B. Berendt, A. Hotho, and G. Stumme, “Towards semantic web mining,” in Proceedings of the 1st International Semantic Web Conference, ISWC2002, ser. LNCS, I. Horrocks and J. Hendler, Eds., vol. 2342. Springer, 2002, pp. 264–278.

[10] X. Zhu and A. Goldberg, Introduction to Semi-Supervised Learning, ser. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool, 2009, vol. 3, no. 1.

[5] J. Euzenat and P. Shvaiko, Ontology matching. 2007.

[11] F. Donini, M. Lenzerini, D. Nardi, and W. Nutt, “An epistemic operator for description logics,” Artificial Intelligence, vol. 100, no. 1-2, pp. 225–274, 1998. [12] A. Kiryakov, “Measurable targets for scalable reasoning,” LarKC Project, Deliverable D5.5.1, 2008, http://www.larkc. eu/resources/deliverables/. [13] A. Rettinger, U. L¨osch, V. Tresp, C. d’Amato, and N. Fanizzi, “Mining the semantic web - statistical learning for next generation knowledge bases,” Data Min. Knowl. Discov., vol. 24, no. 3, pp. 613–662, 2012. [14] A. Rettinger, M. Nickles, and V. Tresp, “Statistical relational learning with formal ontologies,” in Proceedings of the European Conference, ECML PKDD 2009, Part II, ser. LNAI/LNCS, W. L. Buntine et al., Eds., vol. 5782. Springer, 2009, pp. 286–301. [15] C. d’Amato, N. Fanizzi, and F. Esposito, “A note on the evaluation of inductive concept classification procedures,” in Proceedings of the 5th Workshop on Semantic Web Applications and Perspectives, SWAP2008, ser. CEUR Workshop Proceedings, A. Gangemi et al., Eds., vol. 426. CEURWS.org, 2008. [16] J. Cohen, “A coefficient of agreement for nominal scales,” Educational and Psychological Measurement, vol. 20, no. 1, pp. 37–46, 1960. [17] E. Ruckhaus and M.-E. Vidal, “The BAY-HIST prediction model for rdf documents,” in Proceedings of 2nd ESWC Workshop on Inductive Reasoning and Machine Learning for the Semantic Web, IRMLeS10, ser. CEUR Workshop Proceedings, C. d’Amato et al., Eds., vol. 611. CEURWS.org, 2010, pp. 30–41.

[18] M.-E. Vidal, L. Raschid, L. Iba˜nez, H. Rodrıguez, J. Rivera, and E. Ruckhaus, “A ranking-based approach to discover semantic associations between linked data.” in Proceedings of 2nd ESWC Workshop on Inductive Reasoning and Machine Learning for the Semantic Web, IRMLeS10, ser. CEUR Workshop Proceedings, C. d’Amato et al., Eds., vol. 611. CEUR-WS.org, 2010, pp. 18–29. [19] M.-E. Vidal, L. Raschid, N. Marquez, J. C. Rivera, and E. Ruckhaus, “Bionav: An ontology-based framework to discover semantic links in the cloud of linked data,” in Proceedings of the 7th Extended Semantic Web Conference, ESWC2010, ser. LNCS, L. Aroyo et al., Eds., vol. 6089. Springer, 2010, pp. 441–445. [20] C. d’Amato, N. Fanizzi, and F. Esposito, “Query answering and ontology population: An inductive approach,” in Proceedings of the 5th European Semantic Web Conference, ESWC2008, ser. LNCS, S. Bechhofer et al., Eds., vol. 5021. Springer, 2008, pp. 288–302. [21] N. Fanizzi, C. d’Amato, and F. Esposito, “Statistical learning for inductive query answering on OWL ontologies,” in Proceedings of the 7th International Semantic Web Conference, ISWC2008, ser. LNCS, A. Sheth et al., Eds., vol. 5318. Springer, 2008, pp. 195–212. [22] ——, “Inductive classification of semantically annotated resources through reduced coulomb energy networks,” Int. J. Semantic Web Inf. Syst., vol. 5, no. 4, pp. 19–38, 2009.