Uncovering Functional Dependencies in MDD-Compiled ... - CiteSeerX

0 downloads 0 Views 86KB Size Report
demonstrates each representation form for a T-shirt catalogue. EXAMPLE 1 ... not fit on small t-shirts. ..... For the largest instance, Big-PC, it took 34 seconds.
Uncovering Functional Dependencies in MDD-Compiled Product Catalogues Tarik Hadzic

Barry O’Sullivan

Cork Constraint Computation Centre University College Cork, Ireland

Cork Constraint Computation Centre University College Cork, Ireland

[email protected]

[email protected]

ABSTRACT A functional dependency is a logical relationship amongst the attributes that define a table of data. Specifically, a functional dependency holds when the values of a subset of the attributes in a dataset determine the values of one or more other attributes. Uncovering such dependencies is utilized in many domains, such as database design. We demonstrate that it can also be utilized in a recommendation context when datasets represent product catalogues. State-of-the-art approaches to discovering functional dependencies require a tabular representation of the data. However, product catalogues can sometimes be defined implicitly, for example, as a set of solutions to a combinatorial problem. Such combinatorial catalogues can have a very large number of products, thus making standard approaches to uncovering functional dependencies inapplicable. In this paper we present the first approach to computing functional dependencies over compiled knowledge representations which can often be small even for huge catalogues. In particular, we develop efficient algorithms that operate over decision diagrams, which allow us to handle catalogues that are out of reach for current approaches. We apply our algorithms to tabular and combinatorial benchmarks and detect a number of properties that could be considered as anomalies in product catalogues.

Categories and Subject Descriptors H.3.3 [Information Systems]: Information Search and Retrieval

General Terms Algorithms, Verification

1.

INTRODUCTION

Uncovering functional dependencies is an important problem in many artificial intelligence (AI) domains. Many AI datasets are represented in tabular form, defined in terms of a set of attributes. A dataset might be represented as a database or as a spreadsheet. Determining that one or more attributes is functionally determined by values of other attributes can be of critical importance, e.g. in

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. RecSys’09, October 23–25, 2009, New York, New York, USA. Copyright 2009 ACM 978-1-60558-435-5/09/10 ...$10.00.

analyzing the effect of chemical compounds on cancerogenity or studying the shopping habits of the customers. We suggest that uncovering functional dependencies can also be useful in the context of recommender systems. In particular, various forms of anomalies can be introduced if care is not taken when the catalogue is updated, or if the raw catalogue data is “cleaned” either by vendors before deploying it in an online e-commerce application, or by researchers before subjecting it to various forms of experimental evaluation. We show how one class of such anomalies can be detected using functional dependencies. For instance, ignoring seemingly “redundant” attributes can lead to catalogues where identical product specifications have different prices. Detecting this anomalous case reduces to verifying that the price attribute is functionally determined by the remaining attributes. However, some product catalogues cannot be represented in a tabular form, but can be defined implicitly, as solutions to combinatorial problems. We refer to such cases as combinatorial catalogues. For example, in the holiday travel domain a solution is any itinerary that fits transportation schedules and accommodation availability. In the product configuration domain a solution is any product that is possible to manufacture in a factory. There could be a huge number of such solutions, making it infeasible to represent the catalogue in a table. Techniques depending on a tabular representation of all solutions would, therefore, be inapplicable. In this paper we, therefore, present an approach to uncovering functional dependencies when a catalogue is compiled into a multivalued decision diagram (MDD). MDDs are a multi-valued extension of binary decision diagrams [1] which have been proved to be useful for compactly representing solution sets in many combinatorial domains, such as planning, diagnosis, probabilistic reasoning, and product configuration. While the size of an MDD is, in the worst-case, linear in the number of solutions, they can often be exponentially smaller. The algorithms we develop in this paper have quadratic time complexity in the size of the MDD, thus offering potentially dramatic improvements in the complexity of reasoning over large solution spaces. Therefore, we have an approach that outperforms the state-of-the-art whenever a catalogue with a large number of solutions can be represented using a small MDD. In the experimental section we show that such catalogues frequently occur in the product configuration domain. To the best of our knowledge, this is the first approach to functional dependency detection and exploitation over compiled knowledge representations. The remainder of this paper is organised as follows. Section 2 presents the necessary technical background required throughout the paper. We develop MDD-based algorithms for uncovering various forms of functional dependencies in Section 3. We report on a number of experiments in Section 4 and conclude in Section 5.

2.

BACKGROUND

Solution Sets. We consider contexts where we are given a set of variables (product attributes) X = {x1 , . . . , xn } and an associated set of finite domains D1 , . . . , Dn such that each variable xi takes a value from the corresponding set of possible values Di . We are also given a set of solutions (available products) Sol ⊆ D1 × . . . × Dn , which can be defined either explicitly as a set of items in a product catalogue, or implicitly as the set of solutions to a constraint satisfaction problem hX, D, C =def {c1 , . . . , cm }i, in which each ci ∈ C restricts the set of possible choices one can make over a subset of the variables in X. Consider the following example which demonstrates each representation form for a T-shirt catalogue. E XAMPLE 1 (T-S HIRT C ATALOGUE ). Consider a catalogue of T-shirts defined in terms of three attributes: the color (black, white, red, or blue), the size (small, medium, or large) and the print (“Men In Black” - MIB or “Save The Whales” - STW). There are two constraints that define the set of valid combinations: the MIB print is only available on black t-shirts, and the STW print does not fit on small t-shirts. The implicit representation hX, D, Ci of the T-shirt example consists of variables X = {x1 , x2 , x3 } representing color, size and print, respectively, each taking values from the following domains, respectively: D1 = {0, 1, 2, 3} (black , white, red , blue), D2 = {0, 1, 2} (small, medium, large), and D3 = {0, 1} (MIB, STW ). The two constraints are defined as follows: x3 = 0 ⇒ x1 = 0 (MIB ⇒ black ), and x2 = 0 ⇒ x3 6= 1 (small ⇒ not STW ). ♦ An equivalent representation of the solution space of the T-shirt example, satisfying the constraints defined above, can be given explicitly as a set of items in a table, as shown in Figure 1(a). Any explicitly defined catalogue can always be interpreted as a constraint satisfaction problem having only one (tabular) constraint. color black black black black black white white red red blue blue

size small medium medium large large medium large medium large medium large

print MIB MIB STW MIB STW STW STW STW STW STW STW

(a) Explicit Representation.

u1 0 1 23 u2 0 u4

u3

1 2

1 2 u5

0 0 1

u6 1

1

(b) MDD Representation.

Figure 1: Solution set representations for the T-shirt example. Multivalued Decision Diagrams. Multivalued decision diagrams (MDDs) are compressed representations of solution sets Sol ⊆ D1 × . . . × Dn . MDDs are directed acyclic graphs that, for a sorted list of attributes, use prefix and suffix sharing to compactly represent the solution set. Each solution is encoded as a path, and every edge on the path encodes an attribute-value pair. MDDs are a generalization of binary decision diagrams (BDDs) [1] which are used as compressed representations of Boolean functions in many areas, such as verification, model checking, and VLSI design [4]. Formally, an MDD M is a rooted directed acyclic graph (V, E), where V is a set of vertices containing the special terminal vertex 1 and a root r ∈ V . Furthermore, var : V → {1, . . . , n + 1} is a

labeling of all nodes with a variable index such that var(1) = n+1. Each edge e ∈ E is defined by a triple (u, u0 , a) of its start node u, its end node u0 and an associated value a. An MDD representing the T-shirt solution set is presented in Figure 1(b). MDDs are deterministic: no node may have more than one outgoing edge with the same label. We work only with ordered MDDs. A total ordering < of the variables is assumed such that for all edges (u, u0 , a) it holds var(u) < var(u0 ). Ordered MDDs can be regarded as being arranged in n layers of vertices, each layer being labeled with the same variable index. We denote as Vi the set of all nodes labeled with xi , Vi = {u ∈ V | var(u) = i}. We denote as Ei the set of all edges originating in Vi , i.e. Ei = {e(u, u0 , a) ∈ E | var(u) = i}. For example, in Figure 1(b), V3 = {u4 , u5 , u6 }, E3 = {(u4 , 1, 0), (u5 , 1, 0),(u5 , 1, 1), (u6 , 1, 1)}. An MDD encodes a CSP solution set Sol ⊆ D1 × . . . × Dn , defined over variables {x1 , . . . , xn }. To check whether an assignment a = (a1 , . . . , an ) ∈ D1 × . . . × Dn is in Sol we traverse M from the root, and at every node u labeled with variable xi , we follow an edge labeled with ai . If there is no such edge then a is not a solution a 6∈ Sol. Otherwise, if such a traversal eventually ends in terminal 1 then a ∈ Sol. We will denote as p : u1 ; u2 any path in the MDD from u1 to u2 . Also, edges between u and u0 will sometimes be denoted as e : u → u0 . A value a of an edge e(u, u0 , a) will be sometimes denoted as v(e), while a partial assignment associated with path p will be denoted as v(p). We will use Ch[u] to denote the set of all outgoing (child) edges of node u. Every path corresponds to a unique assignment. Hence, the set of all solutions represented by the MDD is Sol = {v(p) | p : r ; 1}. In fact, every node u ∈ Vi can be associated with a subset of solutions, Sol(u) = {v(p) | p : u ; 1} ⊆ Di × . . . × Dn . For example, Sol(u3 ) = {(1, 1), (2, 1)}. Functional Dependencies. Given solution set Sol defined over variables X = {x1 , . . . , xn }, and given solution a ∈ Sol, a = (a1 , . . . , an ), we define the projection of solution a onto variable xi , denoted as a[xi ], to be the value of the i-th coordinate in the tuple, a[xi ] =def ai . Similarly, we define the projection of a onto a subset of variables Y ⊆ X, denoted as a[Y ], as a tuple of values corresponding to the variables in Y . For a given set of solutions S ⊆ Sol, we define the projection onto a subset of variables Y as a collection of all projected tuples, S[Y ] =def {a[Y ] | a ∈ S}. For a solution set Sol, defined over variables X = {x1 , . . . , xn } we say that a variable xi is functionally determined by a subset of variables Y ⊆ X, denoted as Y → xi , if for any two solutions a1 , a2 ∈ Sol, whenever a1 and a2 agree on variables Y (a1 [Y ] = a2 [Y ]), they also agree on variable xi (a1 [xi ] = a2 [xi ]). Formally: Y → xi ⇔def (∀a1 ,a2 ∈Sol a1 [Y ] = a2 [Y ] ⇒ a1 [xi ] = a2 [xi ]). A number of approaches are known for uncovering functional dependencies. A core operation that is executed repeatedly is testing for atomic functional dependencies of the form Y → xi . Stateof-the art approaches for testing atomic dependencies first cluster data into equivalence classes with respect to the value of xi , and then make multiple linear iterations through the dataset. This incurs linear complexity in the number of solutions [2].

3. A DECISION DIAGRAM APPROACH We present our main contribution: efficient algorithms for detecting functional dependencies when a product (solution) set is represented compactly as a multi-valued decision diagram.

3.1 Directional Dependencies Directional dependencies {x1 , . . . , xi−1 } → xi state that a variable xi is determined by the subset of all variables preceding it in

the variable ordering of the MDD. This is a particularly easy case to detect as shown in the following proposition.

Algorithm 1: Compute functionally determined variables. Data: MDD M (V, E) Imp = X; foreach i = 1, . . . , n do foreach u ∈ Vi , |Ch(u)| > 1 do foreach e1 : u → u1 , e2 : u → u2 do if v(e1 ) 6= v(e2 ) ∧ D[u1 , u2 ] = 1 then Imp ← Imp \ {xi }; go to next layer;

P ROPOSITION 1. {x1 , . . . , xi−1 } → xi if and only if for all u ∈ Vi , u has only one outgoing edge, i.e. |Out(u)| = 1. P ROOF. Let p : r ; u be a path from root to u, e1 : u → u1 and e2 : u → u2 be two outgoing edges and p1 : u1 ; 1 and p2 : u2 ; 1 be paths from u1 and u2 to terminal 1 respectively. Then paths (p, e1 , p1 ) and (p, e2 , p2 ) represent two solutions with identical assignments to variables {x1 , . . . , xi−1 } and two different assignments to xi variable, v(e1 ) 6= v(e2 ). Proposition 1 provides us with a simple test for checking whether {x1 , . . . , xi−1 } → xi . It suffices that all nodes in Vi have exactly one outgoing edge. This can be easily checked by verifying that the number of nodes and outgoing edges is the same, |Vi | = |Ei |. The set of all variables implied by preceding variables in the order are given by Imp≺ = {xi | |Vi | = |Ei |}.

3.2 General Dependencies Variables determined by subsets of preceding variables in the order, Imp≺, do not account for all implied variables. If variable xi is implied by any subset Y ⊆ X \ {xi } then it will be also implied by X \ {xi }. Therefore, the set of all implied variables Imp is the set of all xi such that X \ {xi } → xi . To detect such variables in an MDD, we will use the following proposition.

return Imp;

Compatibility pairs D[u1 , u2 ] can be computed in quadratic time and space using Algorithm 2. We first initialize D[u1 , u2 ] = 0 for all pairs of nodes, except for the terminal 1, setting D[1, 1] = 1. We then, in a bottom-up manner, traverse the MDD. It is crucial to observe that D[u1 , u2 ] = 1 if and only if there are two outgoing edges e1 : u1 → u01 , e2 : u2 → u02 such that P v(e1 ) 2= v(e2 ) ∧ D[u01 , u02 ] = 1. The algorithm runs in O( n i=1 |Ei | ) time since in each layer, Ei , each edges is compared at most P pair of once. The algorithm takes Θ( i |Vi |2 ) space, since we introduce a compatibility indicator for each pair of nodes in the layer. Algorithm 2: Compute Boolean indicators. Data: MDD M (V, E) D[·, ·] = 0, D[1, 1] = 1; foreach i = n, . . . , 1 do foreach (u1 , u2 ) ∈ Vi × Vi do if u1 = u2 then D[u1 , u2 ] = 1; continue; foreach e1 : u1 → u01 , e2 : u2 → u02 do if v(e1 ) = v(e2 ) ∧ D[u01 , u02 ] = 1 then D[u1 , u2 ] = 1; break;

P ROPOSITION 2. X \ {xi } 6→ xi if and only if there is a node u ∈ Vi with two outgoing edges e1 : u → u1 , e2 : u → u2 such that v(e1 ) 6= v(e2 ) and Sol(u1 ) ∩ Sol(u2 ) 6= ∅. P ROOF. If X \ {xi } 6→ xi then there are two solutions a = (a1 , . . . , an ), a0 = (a01 , . . . , a0n ) differing only in the i-th coordinate, ai 6= a0i . Let pa and pa0 be the paths encoding these solutions. These paths must be of the form: pa = (p, e1 , p1 ), pa0 = (p, e2 , p2 ), where p is a unique path encoding (a1 , . . . , ai−1 ). Path p ends in a node u ∈ Vi . Since ai 6= a0i , v(e1 ) 6= v(e2 ) and since v(p1 ) = v(p2 ) = (ai+1 , . . . , an ), then Sol(u1 ) ∩ Sol(u2 ) 6= ∅. Alternatively, if there is a u ∈ Vi with two outgoing edges e1 , e2 , such that v(e1 ) 6= v(e2 ) and Sol(u1 ) ∩ Sol(u2 ) 6= ∅ then there are paths p1 : u1 ; 1, p2 : u2 ; 1 such that v(p1 ) = v(p2 ). It suffices to choose any path from the root to u, p : r ; u to construct paths pa = (p, e1 , p1 ) and pa0 (p, e2 , p2 ), which encode solutions differing only at the i-th coordinate. Hence, X \ {xi } 6→ xi holds. Assume that for each pair of nodes in the same layer (u1 , u2 ) we have precomputed Boolean indicators D[u1 , u2 ] as follows: D[u1 , u2 ] = 1 ⇔ Sol(u1 ) ∩ Sol(u2 ) 6= ∅. Whenever we encounter a pair of nodes (u1 , u2 ) such that D[u1 , u2 ] = 1, we are guaranteed that there are at least two paths p1 : u1 ; 1 and p2 : u2 ; 1 encoding the same solution v(p1 ) = v(p2 ). Given such labels, we can compute all functionally determined variables using Algorithm 1. In each layer we check for all pairs of edges with the same parent (e1 (u, u1 , a1 ),e2 (u, u2 , a2 )) whether D[u1 , u2 ] = 1. As soon as such edges are found we have proven that xi is not impliedP and we may proceed to the next layer. The 2 algorithm runs in O( n i=1 |Vi | · |Di | ) steps, since for each node in each layer u ∈ Vi , we compare in the worst-case all pairs of its child edges, and there are at most P |Di | ×2(|Di | − 1)/2 such pairs. The space complexity is O( n i=1 |Vi | ) since we have to store Boolean indicators D[u1 , u2 ] for each pair (u1 , u2 ) ∈ Vi2 .

return D;

3.3 Approximate Dependencies We have so far discussed only exact dependencies, i.e. we detect only whether variable xi is determined or not. However, a subset of variables Y might have a significant implicative influence on a variable x even though it does not imply it exactly. Therefore, we are interested in computing the degree, d(Y, x) ∈ [0, 1], to which a dependency Y → x holds. There are a number of ways to define it, for example in [2] the authors use the minimal percentage of solutions that must be removed in order for the dependency to hold. In this paper we define d(Y, x) as |Sol[Y ]|/|Sol[Y ∪ {x}]|. When Y → x, then |Sol[Y ]| = |Sol[Y ∪ {x}]|, and d(Y, x) = 1. On the other hand, when x is completely undetermined by Y , then every assignment to Y variables a ∈ Sol[Y ] can be combined with all values in domain for x, Dx . Hence, larger values indicate larger degrees of dependency, and when d(Y, x) = 1, Y → x holds exactly. Note that such a statistic can be easily computed using a BDD-based representation of solution set Sol. A BDD-package BuDDy1 supports both projection and counting operations. Counting the number of solutions in a BDD is an efficient operation, lin1

http://sourceforge.net/projects/buddy

ear in the number of nodes. While computing a BDD representation of a projected solution space can in theory increase the size of the BDD, in practice projecting out variables is almost always an efficient operation that decreases the number of nodes significantly.

4.

EXPERIMENTAL EVALUATION

To evaluate our approach we applied our algorithms to a set of standard tabular and combinatorial benchmarks. Tabular Benchmarks. We have applied our techniques to four well-known product catalogues that are frequently used in recommender system research. These are related to digital cameras, laptop computers, property lettings and travel [3]. We analyzed the data under the same adjustments that are usually used for experimental evaluation. Firstly, all unique identifiers or textual descriptions were removed. Secondly, all declared domain values that did not appear in at least one product, but appeared in the specification, were also removed. If some values appeared in datasets but were not declared, we added them to the model specification. Table 1: Basic properties of the tabular product catalogues. Instance Rows Sol Var dmin dmax davg Cameras 210 210 9 5 165 40 Laptops 693 683 14 2 438 42 Travel 1470 1461 7 4 839 134 794 751 6 2 174 45 Lettings A summary of the properties of the instances are reported in Table 1. For each instance, Cameras, Laptops, Travel, Lettings we show the number of rows in the initial tabular representation, the number of solutions extracted from the MDD representation, the number of variables, and the minimum/maximum/average domain size. We discovered that three out of four datasets contain duplicate entries, since the number of solutions is smaller than the number of rows. This suggests the presence of anomalies in the data. After compiling catalogues into MDDs we performed a dependency analysis X \ {xi } → xi for each instance and each attribute. For each instance and each attribute we computed a degree of dependence, d(X \ {xi }, xi ). To our surprise, the price attribute is not functionally dependent in any of the catalogues, with values 0.9666, 0.8555, 0.809 and 0.563 for Cameras, Laptops, Travel and Lettings datasets, respectively. This indicates that in each catalogue there are identical configurations that have different prices! This is particularly evident in the Lettings catalogue, where the degree of dependence is 0.563. This means that each configuration of the remaining attributes, on average, has two different prices. We discovered that such a large degree of dependency was caused by the omittance of the Street attribute in our preprocessing phase. However, for the Cameras and Laptops datasets, preprocessing removed only the attributes we anticipated: a row identifier and a textual description. The lack of functional determinism implies that there are products containing identical features but differing only in a textual description. For example, in the Laptops catalogue, two "Intel Pentium MMX" laptops are offered with identical features, but with different prices, costing 1409 and 1508 respectively. A user navigating the catalogue has no means of understanding what causes such a difference in price. Even after reintroducing the textual description, i.e. operating over the unprocessed catalogue, the explanation the user can obtain is not satisfactory: “The first laptop costs less because its name is AMS Travel Pro 2500CS while the second laptops’s name is AMS Travel Pro 2500CT". This indicates that an essential feature of the product necessary for understanding the pricing structure is missing in the catalogue and is only hinted at in a textual description through a change in a single character.

Combinatorial Benchmarks. To demonstrate the applicability of our approach to combinatorial catalogues, we generated MDDs for a number of real-world configuration instances from CLib2 . We present their relevant properties in Table 2. The columns indicate instance name, number of variables, number of solutions, size of compiled representation on disk, number of MDD nodes |V | and number of MDD edges |E|. We can see that even small MDDs can represent huge solution sets thus highlighting the computational advantage of our methodology. In particular, catalogue Big-PC is considerably out of reach for approaches operating over explicit tabular representations. Table 2: MDD properties for some combinatorial catalogues. MDD MDD Instance Var Sol Size (KB) Nodes Edges Bike2 34 226 56 933 1886 PC2 41 220 237 3907 6136 PC 45 220 298 4875 7989 Big-PC 124 283 7945 100272 132889 We used Algorithm 1 and Algorithm 2 from Section 3 to compute the sets of implied variables. Catalogues Bike2, PC2, PC and Big-PC have 15, 35, 39 and 114 implied variables, respectively. For the first three instances the execution time was less than a second. For the largest instance, Big-PC, it took 34 seconds. All experiments were executed on a Fedora 9 operating system, using dual Quad core Intel Xeon processor running at 2.66 GHz.

5. CONCLUSIONS We presented an approach to computing functional dependencies over an MDD representation of a product catalogue. Unlike state-of-the-art approaches, we do not require that we operate over a tabular representation of data, and hence can handle huge catalogues that have small MDD representations. Using functional dependencies as an analytical tool we discovered that a set of publicly available product catalogues exhibits specific characteristics some of which can be regarded as anomalies in the catalogue definition or in the preprocessing step of the catalogues.

Acknowledgements Tarik Hadzic is supported by a Post-doctoral Research Fellowship from the Irish Research Council for Science, Engineering and Technology. Barry O’Sullivan is supported by Science Foundation Ireland (Grant Number 05/IN/I886).

6. REFERENCES [1] R. E. Bryant. Graph-based algorithms for boolean function manipulation. IEEE Transactions on Computers, 35(8):677–691, 1986. [2] Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. Tane: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100–111, March 1999. [3] R. Nicholson, D. Bridge, and N. Wilson. Decision diagrams: Fast and flexible support for case retrieval and recommendation. In Proceedings of ECCBR 2006, pp.136–150. [4] I. Wegener. Branching Programs and Binary Decision Diagrams. Society for Industrial and Applied Mathematics, 2000. 2

http://www.itu.dk/research/cla/externals/clib