Discovering Latent Structure in Clinical Databases

Discovering Latent Structure in Clinical Databases

Jesse Davis Dept. of Computer Science Katholieke Universiteit Leuven 3000 Leuven, Belgium [email protected] Vitor Santos Costa Dept. of Computer Science Universidade do Porto Porto, Portugal [email protected]

Elizabeth Berg, David Page Dept. of Biostatistics and Medical Informatics University of Wisconsin-Madison Madison, WI {berg,page}@biostat.wisc.edu Peggy Peissig and Michael Caldwell Marshfield Clinic Marshfield, WI {peissig.peggy,caldwell.michael}@marshfieldclinic.org

Abstract Statistical relational learning allows algorithms to simultaneously reason about complex structure and uncertainty with a given domain. One common challenge when analyzing these domains is the presence of latent structure within the data. We present a novel algorithm that automatically groups together different objects in a domain in order to uncover latent structure, including a hierarchy or even heterarchy. We empirically evaluate our algorithm on two large real-world tasks where the goal is to predict whether a patient will have an adverse reaction to a medication. We found that the proposed approach produced a more accurate model than the baseline approach. Furthermore, we found interesting latent structure that was deemed to be relevant and interesting by a medical collaborator.

1

Introduction

Statistical relational learning (SRL) [1] focuses on developing learning and reasoning formalisms that combine the benefits of relational representations (e.g., first-order logic) with those of probabilistic, graphical models. SRL is especially applicable for domains when (1) it is important to explicitly model uncertainty (e.g., the probabilities of predictions the model makes) and (2) the data resides in a relational database, where information from each table is needed to learn an accurate model. One particularly challenging aspect of analyzing real-world relational data is the presence of latent structure. Accurately detecting and modeling this type of structure requires combining ideas from latent variable discovery in graphical models (e.g., [2, 3]) and predicate invention in inductive logic programming (e.g., [4, 5]). Despite the fact that many domains have latent structure, very few learning algorithms [6, 7, 8] can effectively cope with its presence in rich relational domains. To motivate the need to discover latent structure, consider the task of analyzing electronic medical records (EMRs). An EMR is a relational database that stores a patient’s clinical history (e.g., disease diagnoses, prescriptions, lab test results, etc.). Recently, there has been much interest in applying machine learning to medical data [9]. When treating a patient, a doctor must decide among many different medications that could be prescribed. These medications share varying degrees of similarity in terms of (i) how they interact with the human body to counteract disease, (ii) how they interact with other drugs, and (iii) how they interact with genetic and other variations from one human body to another. In terms of medication, a patient’s clinical history only contains which drugs were prescribed for the patient (e.g., name, dosage, duration). Consequently, it can be difficult to detect interesting and meaningful patterns present in the data. For example, there may be a substan1

tial number of people who take a class of medicines, such as statins for cholesterol, and also have certain outcome (e.g., disease diagnosis, adverse drug reaction, etc.). The data only records which specific medicine each patient has been prescribed, and the number of people who take each individual medicine and have the outcome may be too small to meet an interestingness threshold (e.g., support threshold in association rule mining). What is missing is the ability to automatically detect related medicines and group them together. Requiring a domain expert to hand craft all relevant features or relations necessary for a problem is a difficult and often infeasible task. For example, should drugs be grouped by which disease they treat, by which mechanism they use, by potential side-effects or by interactions with other drugs? Ideally, the learning algorithm should automatically discover and incorporate relevant features and relations. We propose a novel approach for automatically discovering latent structure in relational domains. Our approach represents latent structure by grouping together, possibly hierarchically, sets of objects in a domain. The proposed approach dynamically introduces latent structure in a data-driven fashion. It automatically groups together objects and/or already-existing groups by evaluating whether the proposed grouping results in a more accurate learned model. Furthermore, each object can appear in multiple different groupings as an object may appear in multiple contexts (e.g., a drug could be in different groupings related to its mechanism, indications, contraindications, etc.). A central challenge in the domains we consider is that they contain a large number of objects (e.g., diseases, drugs). Consequently, searching for latent relationships among all objects at once is prohibitively expensive. We introduce several search strategies that help automatically identify a small but promising subset of objects that could belong to the same latent group. We motivate and evaluate our approach on the specific task of predicting adverse drug reactions (ADRs) from EMR data. This application is timely for a number of reasons. First, EMRs are now becoming widespread, making large amounts of this data available for research. Second, adverse drug reactions are a major risk to health, quality-of-life and the economy. ADRs are the fourthleading cause of death in the United States. The pain reliever VioxxTM alone was earning US$2.5 billion per year before it was found to double the risk of heart attack and was pulled from the market, while the related drug CelebrexTM also raised this risk and remains on the market. Third, accurate predictive models for ADRs are actionable—if found to be accurate in a prospective trial, such a model could easily be incorporated into EMR-based medical practices to avoid giving a drug to those at highest risk of an ADR. Using three real-world ADR tasks, we demonstrate that the proposed approach produces a more accurate model than the baseline approach. Furthermore, our algorithm uncovers latent structure that a doctor, who has expertise in our tasks of interest, deems to be interesting and relevant.

2

Background

The proposed approach builds on the SRL algorithm VISTA [10], which combines automated feature construction and model learning into a single, dynamic process. VISTA uses first-order definite clauses, which can capture relational information, to define (binary) features. These features then become nodes in a Bayesian network. VISTA only selects those features that improve the underlying statistical model. It selects features by performing the following iterative procedure until some stop criterion has been met. The feature induction module proposes a set of candidate features to include in the model. VISTA evaluates each feature, f , by learning a new model (i.e., the structure of the Bayesian network) that incorporates f . To evaluate each candidate feature f , VISTA estimates the generalization ability of the model with and without the new feature by calculating the area under the precision-recall curve (in principle, any metric is posible) on a tuning set. VISTA selects the feature that results in the largest improvement in the model’s score and incorporates it into the model. If no feature improves the score, the process terminates. VISTA offers several advantages for analyzing clinical data. First, the first-order rules can incorporate information from different relations within a single rule. Second, by incorporating each rule into a Bayesian network, VISTA can capture the inherent uncertainty in the data. Third, the learned rules are comprehensible to domain experts. 2.1

Statistical Model

Bayesian networks are probabilistic graphical models that encode a joint probability distribution over a set of random variables. Given a set of random variables X = {X1 , . . . , Xn }, a 2

Bayesian network B = hG, Θi is defined as follows. G is a directed, acyclic graph that contains a node for each variable Xi ∈ X. For each variable (node) in the graph, the Bayesian network has a conditional probability distribution (CPD), θXi |P arents(Xi ) , giving the probability distribution over the values that variable can take for each possible setting of its parents, and Θ = {θX1 , . . . , θXn }. A Bayesian network B encodes the following probability distribution: Qi=n PB (X1 , . . . Xn ) = i=1 P (Xi |P arents(Xi )). VISTA needs to learn the structure of a Bayesian network. That is, given a data set, it must learn both the network structure G (i.e., the arcs between different variables) and the CPDs, θXi |P arents(Xi ) , for each node in the network. VISTA uses tree augmented na¨ıve Bayes (TAN) [11] as its model. In a TAN model, there is a directed arc from the class variable to each non-class attribute (i.e., variable) in the domain. Furthermore, each non-class attribute may have at most one other parent, which allows the model to capture a limited set of dependencies between attributes. The algorithm for learning a TAN model has two nice theoretical properties [11]. First, it finds the TAN model that maximizes the log likelihood of the network structure given the data. Second, it finds this model in polynomial time. 2.2

Defining Features

VISTA uses formulas in first-order logic to define features. Technically, it uses the non-recursive Datalog subset of first-order logic, which with a closed-world assumption is equivalent to relational algebra. The alphabet for Datalog consists of three types of symbols: constants, variables, and predicates. Constants (e.g., the drug name Propranolol), which start with an upper case letter, denote specific objects in the domain. Variable symbols (e.g., disease), denoted by lower case letters, range over objects in the domain. Predicate symbols P/n, where n refers to the arity of the predicate and n ≥ 0, represent relations among objects. An example of a predicate is Diagnosis/3. A term is a constant or variable. If P/n is a predicate with arity n and t1 , . . . , tn are terms, then P (t1 , . . . , tn ) is an atomic formula. An example of an atomic formula is Diagnosis(Patient123, 12 − 12 − 2010, Tuberculosis), where all three arguments are constants. This says that patient Patient123 had a diagnosis of Tuberculosis on December12, 2010. A literal is an atomic formula or its negation. A clause is a disjunction over a finite set of literals. A definite clause is a clause that contains exactly one positive literal; it can be written as a conjunction of atomic formulas that imply another atomic formula (the positive literal), as follows: Drug(pid, date1, Ketoconazole) ∧ WithinMonth(date1, date2) ⇒ ADR(pid, date2) All variables in a definite clause are assumed to be universally-quantified. VISTA uses definite clauses to define features for the statistical model. Each definite clause becomes a binary feature in the underlying statistical model. The feature receives a value of one for a specific patient if data about that patient can be used to satisfy (i.e., prove) the clause and it receives a value of zero otherwise. Feature definitions are constructured in the standard, top-down (i.e., general-tospecific) manner. We briefly describe the approach here (see [12] for more details). Each induced rule begins by just containing the target attribute (in the above example this is ADR(pid, date2)) on the right-hand side of the implication. This means that the feature matches all examples. The induction algorithm then follows an iterative procedure. It generates a set of candidate refinements by conjoining predicates to the left-hand side of the rule. This has the effect of making the feature more specific (i.e., it matches fewer examples). The search can proceed in a breadth-first, best-first, or greedy (no backtracking) manner, but we employ a breadth-first search in this paper.

3

Algorithm Description

At a high-level, the key innovation of our proposed approach, LUCID (Latent Uncertain Concept Invention on-Demand), occurs when constructing feature definitions. Here, the algorithm has the ability to invent (hierarchical) clusters that pertain to a subset of the constants in the domain. Intuitively, constants that appear in the same grouping share some latent relationship. Discovering and exploiting the latent structure in the feature definitions provides several benefits. First, it allows for more compact feature definitions. Second, by aggregating across groups of objects, it helps identify important features that may not otherwise be deemed relevant by the learning algorithm. To illustrate the intuition behind our approach, we will use a running example about ADRs to the 3

medication WarfarinTM , which is a blood thinner commonly prescribed to patients at risk of having a stroke. However, Warfarin is known to increase the risk of internal bleeding for some patients. Consider the following feature definition: Drug(pid, date1, Terconazole) ∧ Weight(pid, date1, w) ∧ w < 120 ⇒ ADR(pid)

(1)

This rule applies only to those patients who satisfy all the conditions on the left hand side of rule. Conditioning on whether a patient has been prescribed Terconazole limits the applicability of this rule. Terconazole is an enzyme inducer, which is a type of medication known to elevate a patient’s sensitivity to Warfarin. However, many other drugs in the enzyme inducer class (e.g., Rifampicin and Ketoconazolegive) are frequently prescribed instead of Terconazole, which makes this feature overly specific. A potentially stronger feature would replace Terconazole with an invented concept such as enzyme inducer or Warfarin elevator. Yet, these concepts are not explicitly encoded in clinical data. By grouping together related objects, LUCID captures latent structure and is able to learn more general features. For example, we could generalize the previous rule as follows: Cluster1(did) ∧ Drug(pid, date1, did) ∧ Weight(pid, date1, w) ∧ w < 120 ⇒ ADR(pid) (2) The definition for Cluster1 represents latent structure among a group of medicines. 3.1

Representing Latent Structure

The goal of our approach is to capture hierarchical latent structure about specific constants (i.e., objects) in the domains. First, we want to capture that specific constants are interchangeable in some cases. For example, Terconazole, Rifampicin and Ketoconazole are all enzyme inducers, and a doctor could reasonable prescribe any of them. We can accomplish this by introducing a new concept, which we generically call Cluster1, as follows:  Cluster1(Terconazole) Cluster1(Rifampicin) definition of Cluster1 (3)  Cluster1(Ketoconazole) These statements simply assign these drugs to Cluster1. There is no limit on the number of objects that can be assigned to each invented cluster. Secondly, we want to able to make use of previously discovered concepts to represent more highlevel, hierarchical structure. We can do this in the following manner:  Cluster2(Propranolol) Cluster2(Alpranolol) definition of Cluster2 (4)  Cluster1(x) ⇒ Cluster2(x) Just as before, the first two statements assign specific drugs to Cluster2. The key step is the third statement, where all the constants that have been assigned to Cluster1 are assigned to Cluster2 as well. Once a proposed grouping has been used in a feature that has been included in the model, it is available for future reuse during the learning procedure. Reusing previously discovered concepts allows the algorithm to automatically explore tradeoffs between fine-grained grouping (e.g., enzyme inducers) and more high-level groupings (e.g., Warfarin elevators) that may be present in the data. Furthermore, it allows the algorithm to build progressively more complex concepts over time. 3.2

Learning Latent Structure

The key step in the algorithm is discovering the latent structure. Given a feature definition such as Rule (1), latent structure is learned in the following way. First, LUCID rewrites the feature definition by replacing the reference to the specific constant with a variable and conjoining an invented latent structure predicate to the end of the rule. For example, Rule (1) would be transformed into Rule (2), where Rule (2) has the variable did instead of the constant Terconazole and it contains the invented predicate Cluster1(did). Second, LUCID learns a definition for the invented latent predicate (e.g., Cluster1 in Rule (2)). It begins by assigning the replaced constant to this cluster, which in the running example corresponds 4

to this statement: Cluster1(Terconazole). Next, it tries to extend the definition of the cluster by identifying a set of candidate constants that could be added to the cluster. It adds each constant to the cluster in turn. The benefit of the modified cluster is measured by seeing if the model, which includes the feature that makes use of the extended cluster definition, improves. LUCID greedily selects the single constant that results in the largest improvement in the model’s score. This procedure iterates until no addition improves the model’s performance or the set of candidate constants is empty. The end result is a cluster definition as illustrated by either Cluster (3) or Cluster (4). The central challenge in trying to discover latent structure is the large number of concepts that could be invented. For example, when predicting adverse reactions, the data contains information about thousands of drugs and diseases. Consequently, performing a complete search, where the utility of adding each constant to the cluster is evaluated separately, is prohibitively expensive. We propose two different techniques to identify promising candidates to include in a grouping. Restrict constants to those from “near miss” examples. To illustrate this idea, consider the following two rules: Weight(pid, date1, w) ∧ w < 120 ⇒ ADR(pid) Drug(pid, date1, Terconazole) ∧ Weight(pid, date1, w) ∧ w < 120 ⇒ ADR(pid)

(5) (6)

The second rule, by adding the condition Drug(pid, date1, Terconazole), applies to fewer patients. Some patients may match Rule (5), but not the more specific Rule (6) because they took a similar, but not identical medication. Looking at the medications prescribed to these patients (i.e., those that match Rule (5) but not Rule (6)) potentially can inform the search as to which medications can be prescribed in place of Terconazole. Therefore, this strategy restricts the search to only considering grouping together constants that appear in examples that are covered by a rules’ immediate predecessor (i.e., Rule (5)) but not the rule itself (i.e., Rule (6)). Restrict constants to those correlated with initial constant. This approach performs a precomputation to identify constants that are mutually replaceable. Intuitively, the idea is to discover when one constant can be used in lieu of another (e.g., one drug is prescribed in place of another drug). This can be done by employing a variant of “guilt by association,” which says that objects are similar if they appear in similar contexts. Given a ground atom, Rel(A1 , . . . , C1 , . . . , An ), constant C1 shares a context with another constant C2 (with C1 6= C2 ) if replacing C1 with C2 results in a ground atom that appears in the data. LUCID performs a preprocessing step over the training data and computes the Pearson correlation among all pairs of constants Ci and Cj : P ¯i )(Nj,k − N ¯j ) (Ni,k − N corr(Ci , Cj ) = qP k ¯ 2 P (Nj,k − N ¯j )2 k (Ni,k − Ni ) k where k ranges over all constants of the same type (k 6= i and k 6= i), Ni,k is the context size (i.e., ¯i is the average context size for Ci . the number of times that Ci shares a context with Ck ), and N 3.3

Internal Evaluation Metric

In principle, LUCID can use any evaluation metric for evaluating the quality of the model. We use the area under the precision-recall curve (AUC-PR). The tasks considered in this paper contain many more negative examples than positive examples, and this measure ignores the potentially large number of true negative examples. LUCID evaluates a model by looking at its AUC-PR on both its training set (used to learn the model structure and parameters) and an independent tune set. With relatively few positive examples, considering both the train and tune set scores helps make the algorithm more robust to overfitting. Furthermore, it is likely that many features will improve the model. Therefore, candidates must improve both the AUC-PR of the train set and the AUC-PR of the tune set by a certain percentage based threshold to be considered. Again, using the threshold helps control overfitting by preventing relative weak features (i.e., those that only improve the model score slightly) from being included. 3.4

Algorithm

LUCID essentially follows the same high-level control structure as the VISTA algorithm described in Subsection 2. The key difference is that it defines more candidate features as it constructs and 5

Table 1: Data Set Characteristics. Selective Cox-2 Number of positive examples 160 Number of negative examples 2,134 Number of unique medications 2,590 Number of unique diagnoses 7,912 Number of facts in the medicine database table 3,518,467 Number of facts in disease database table 3,653,487

Warfarin 144 1,440 2,316 8,389 603,503 691,591

ACEI 102 1,020 2,044 7,286 335,065 436,934

evaluates features that contain invented, latent predicates. However, it is prohibitively expensive to consider adding latent structure to each candidate feature. Therefore, LUCID restricts itself to adding latent concepts only to features that meet the following two conditions: Condition 1: The rule under consideration improves the score of the model. This provides initial evidence that the rule is useful, but the algorithm may be able to improve its quality by modeling latent structure. Discarding rules that initially exhibit no improvement dramatically improves the algorithm’s efficiency. Condition 2: The most recent condition added to the rule must refer to a constant. Furthermore, the user must have identified this type of constant as a candidate for having latent structure. This helps reduce the search space as not all types of constants will exhibit latent structure. For each candidate feature that meets these two criteria, LUCID attempts to discover latent structure. It invokes the procedure outlined in Subsection 3.2 and adds the feature it constructs (which contains an invented latent predicate) to the set of candidate features. Given the extended set of candidate features, LUCID selects the feature that most improves the score of the model. After adding a candidate feature to the model, all other features must be re-evaluated. The modified model, which changes the score for each candidate features, means that LUCID must check each feature to determine which ones satisfy both of the aforementioned conditions and should be augmented with a latent predicate. The modified model, if it incorporated a feature with a latent predicate, also presents expanded opportunities for latent structure discovery. Newly invented latent predicates can extend or reuse the previously introduced latent predicate definition.

4

Empirical Evaluation

In this section, we evaluate our proposed approach on three real-world data sets. As the baseline, we compare LUCID to the VISTA algorithm [10]. In all tasks, the goal is to predict at prescription time whether a patient will have an ADR that may be related to taking the medication. We first describe the data sets we use and then present and discuss our experimental results. Task Descriptions. Our data comes from a large multispecialty clinic that has been using electronic medical records since 1985 and has electronic data back to the early 1960’s. We have received institutional review board approval to undertake these studies. For all tasks, we have access to information about observations (e.g., vital signs, family history, etc.), lab test results, disease diagnoses and medications. We only consider patient data up to one week before that patient’s first prescription of the drug under consideration. This ensures that we are building predictive models only from data generated before a patient is prescribed that drug. Characteristics of each task can be found in Table 4. We now briefly describe each task. Selective Cox-2 inhibitors (e.g., VioxxTM ) are a class of pain relief drugs that were found to increase a patients risk of having a a myocardial infarction (MI) (i.e., a heart attack) [13]. Angiotensinconverting enzyme inhibitors (ACEIs) are a class of drugs commonly prescribed to treat high blood-pressure and congestive heart failure. It is known that in some people, ACEIs may result in angioedma (a swelling beneath the skin). Warfarin is a commonly prescribed blood-thinner that is known to increase the risk of internal bleeding for some individuals. On each task the goal is to distinguish between patients who take the medicine and have an adverse event (i.e., positive examples) and those who do not (i.e., the negative examples). 6

Table 2: Average AUC-PR for each approach. The best results for each task is shown in bold. Selective Cox-2 Warfarin ACEI LUCID (Near Miss) 0.424 0.203 0.300 LUCID (Correlation) 0.362 0.149 0.312 VISTA 0.377 0.143 0.286

Methodology and Results. We performed stratified, ten-fold cross-validation for each tasks. We sub-divided the training data and used five folds for training (i.e., learning the model structure and parameters) and four folds for tuning. We require that a candidate feature result in at least a 2% improvement to the AUC-PR in order to be considered for acceptance. We set all parameters to be identical for all approaches. The only difference between the algorithms is that LUCID can introduce latent structure. Without this ability, the algorithms would construct and evaluate identical candidate feature sets. Table 2 reports the average AUC-PRs for each task . LUCID using the “near miss” strategy (LUCIDNM) for inventing latent structure outperforms VISTA on all three tasks. On the Selective Cox-2 and Warfarin domains, LUCID-NM results in relatively large improvements in AUC-PR, of 12% and 41%, respectively when compared to VISTA. LUCID using the “correlation” strategy (LUCIDC) performance is more comparable with VISTA, except for on the ACI domain, where it has the best performance overall. One possible reason that LUCID-NM does better than LUCID-C is that it more actively uses the rule to guide concept invention. By leveraging a partial feature definition, LUCID-NM is able to detect correlations among objects that may more clearly arise in the context of the rule. In contrast, LUCID-C takes a more global view with its pre-computation based strategy. Learned Groupings. An another important evaluation measure is whether LUCID invents interesting and relevant concepts. We presented several of the invented clusterings to a medical doctor with expertise in circulatory diseases. We focus our discussion on structures from the Selective Cox-2 domain. The expert remarked a cluster that contained the drugs diltiazem, a calcium-channel blocker, and clopidogrel (PlavixTM ), an antiplatelet agent. These two cardiac drugs are frequently used in acute coronary syndrome especially after angioplasty. In terms of diseases, the expert highlighted a cluster describing cardiac catheter and coronary angioplasty which are consistent with acute coronary syndrome and means that a patient is at a high risk of having a heart attack (MI). Another cluster of interest involved cholecystectomy (which is a procedure to remove the gall blader) as in female individuals the diagnosis of MI is often confused with gall bladder pain. Finally, the expert remarked on a cluster containing hearing loss as an finding that deserves further investigation.

5

Related Work

SRL lies at the intersection of relational learning and graphical model learning. In terms of relational learning, our approach is closely related to Dietterich and Michalski’s work [14]. Their work contains an operation known as internal disjunction, which replaced constant with a disjunction of several constants. We go beyond this work by allowing re-use of an internal disjunction and most importantly, by explicitly modeling and reasoning about uncertainty in the data and the invented predicates. The present paper is closely related to predicate invention in relational learning, especially in inductive logic programming (e.g., [4, 5]). The present work advances beyond these approaches by explicitly modeling the uncertainty as well as structure within the data. These approaches only invent new predicates when nothing else works, whereas our approach is much more liberal about detecting latent structure. Our approach is closely related to latent variable discovery for graphical models. Introducing latent variables into a Bayesian network often results in a simpler structure, yet it is difficult to automatically discover these latent variables from data [2, 3]. Our work goes beyond these approaches by operating in a relational setting. Consequently our new clusters are incorporated into the Bayes net only within the context of specific rules, or definite clauses. Such rules can capture a limited context around the cluster in which it is relevant. Our work is not the first to combine ideas from latent variable discovery and predicate invention [6, 7, 8, 15]. Popescul and Ungar [15] use an initial pre-processing step that learns clusterings and then 7

treats cluster membership as an invented feature during learning. In contrast, in the present approach the learning task guides the construction of clusterings and also allows reuse of clusters as part of new clusters. Kemp et al. [6] propose a more advance algorithm based on a infinite relational model which clusters entities in a domain. The cluster that an entity is assigned to should be predictive of the relationships it satisfies. A weakness to this approach is that each entity can belong to only one cluster. Kok and Domingos [7] propose an algorithm that learns multiple relational clusters (MRC). The MRC algorithm clusters both relations and entities, and relations and entity can belong to more than one cluster. However, MRC is a transductive approach, rather than inductive approach. These approaches have been evaluated on domains that contain information on only between 100 and 200 objects. We have evaluated our approach on problems that are between one and two orders of magnitude larger. It is unlikely that these approaches would scale to problems of this size.

6

Future Work and Conclusions

We presented LUCID, a novel algorithm for latent structure discovery. We tested it within the domain of learning from electronic medical record (EMR) data which patients are most at risk to suffer a given adverse drug reaction (ADR). LUCID improved the performance of the baseline SRL algorithm, and it produced meaningful latent structure. Important directions for further research include applications to other ADRs, other tasks in learning from EMRs, and other types of relational databases, as well as integrating LUCID with other SRL algorithms. Other important directions include theoretical analysis of LUCID and of the task of latent structure discovery in general. For example, how accurately can correct latent structure be discovered as the complexity of the latent structure varies, and as the amount of training data varies?

References [1] L. Getoor and B. Taskar, editors. An Introduction to Statistical Relational Learning. MIT Press, 2007. [2] G. Elidan, N. Lotner, N. Friedman, and D. Koller. Discovering hidden variables: A structure-based approach. In NIPS 13, pages 479–485, 2000. [3] N. L. Zhang, T. D. Nielsen, and F. V. Jensen. Latent variable discovery in classification models. Artificial Intelligence in Medicine, 30(3):283–299, 2004. [4] S. Muggleton and W. Buntine. Machine invention of first-order predicates by inverting resolution. In Proc. of the 5th ICML, pages 339–352, 1988. [5] J. Zelle, R. Mooney, and J. Konvisser. Combining top-down and bottom-up techniques in inductive logic programming. In Proc. of the 11th ICML, pages 343–351, 1994. [6] C. Kemp, J. Tenenbaum, T. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. In Proc. of the 21st AAAI, 2006. [7] S. Kok and P. Domingos. Statistical predicate invention. In Proc. of the 24th ICML, pages 433–440, 2007. [8] Z. Xu, V. Tresp, K. Yu, and H-P. Kriegel. Infinite hidden relational models. In Proc. of the 22nd UAI, 2006. [9] F. Farooq, B. Krishnapuram, R. Rosales, S. Yu, J.W. Shavlik, and R. Kucherlapati. Predictive models in personalized medicine: NIPS 2010 workshop report. SIGHIT Record, 1(1):23–25, 2011. [10] J. Davis, I. Ong, J. Struyf, E. Burnside, D. Page, and V. Santos Costa. Change of representation for statistical relational learning. In Proc. of the 20th IJCAI, pages 2719–2726, 2007. [11] N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian networks classifiers. Machine Learning, 29:131– 163, 1997. [12] N. Lavrac and S. Dzeroski. Inductive Logic Programming: Techniques and Applications. Ellis Horwood, New York, 1994. [13] P.M. Kearney, C. Baigent, J. Godwin, H. Halls, J.R. Emberson, and C. Patrono. Do selective cyclo-oxygenase-2 inhibitors and traditional non-steroidal anti-inflammatory drugs increase the risk of atherothrombosis? meta-analysis of randomised trials. BMJ, 332:1302–1308, 2006. [14] T. G. Dietterich and R. S. Michalski. A comparative review of selected methods for learning from examples. In Machine Learning: An Artificial Intelligence Approach, pages 41–81. 1983. [15] A. Popescul and L. Ungar. Cluster-based concept invention for statistical relational learning. In Proc. of the 10th ACM SIGKDD, pages 665–670, 2004.

8