Querying and Repairing Inconsistent Numerical Databases

3 downloads 0 Views 779KB Size Report
The problem of extracting consistent information from relational databases violating integrity constraints on numerical data is addressed. In particular, aggregate ...
Querying and Repairing Inconsistent Numerical Databases S. FLESCA, F. FURFARO, F. PARISI DEIS - Universit`a della Calabria Via P. Bucci - 87036 Rende (CS) - ITALY {flesca, furfaro, fparisi}.deis.unical.it The problem of extracting consistent information from relational databases violating integrity constraints on numerical data is addressed. In particular, aggregate constraints defined as linear inequalities on aggregate-sum queries on input data are considered. The notion of repair as consistent set of updates at attribute-value level is exploited, and the characterization of several data-complexity issues related to repairing data and computing consistent query answers is provided. Moreover, a method for computing “reasonable” repairs of inconsistent numerical databases is provided, for a restricted but expressive class of aggregate constraints. Several experiments are presented, which assess the effectiveness of the proposed approach in real-life application scenarios. Categories and Subject Descriptors: F.2.1 [Analysis of Algorithms and Problem Complexity]: Numerical Algorithms and Problems; H.2.m [Database Management]: Miscellaneous General Terms: Theory Additional Key Words and Phrases: Inconsistent Databases, Aggregate Constraints, Repairs, Consistent Query Answer

1.

INTRODUCTION

Research has deeply investigated several issues related to the use of integrity constraints on relational databases. In this context, a great deal of attention has been devoted to the problem of extracting reliable information from databases containing pieces of information inconsistent w.r.t. some integrity constraints. Most of the previous work in this area deals with “classical” forms of constraint (such as keys, foreign keys, functional dependencies), and proposes different strategies for updating inconsistent data reasonably, in order to make it consistent by means of minimal changes. Indeed these kinds of constraint often do not suffice to manage data consistency, as they cannot be used to define algebraic relations between stored values. In fact, this issue frequently occurs in several scenarios, such as scientific databases, statistical databases, and data warehouses, where numerical values of

This is a preliminary release of an article accepted by ACM Transactions on Database Systems. The definitive version is currently in production at ACM and, when released, will supersede this version. Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. c 2010 ACM 0362-5915/2010/0300-0001 $5.00  ACM Transactions on Database Systems, Vol. V, No. N, February 2010, Pages 1–0??.

2

·

Querying and Repairing Inconsistent Numerical Databases

tuples are derivable by aggregating values stored in other tuples. In this work we focus our attention on databases where stored data violate a set of aggregate constraints, i.e. integrity constraints defined on aggregate values extracted from the database. These constraints are defined on numerical attributes (such as sale prices, costs, etc.) which represent measure values and are not intrinsically involved in other forms of constraints. The following example describes a real-life scenario where inconsistency of numerical attributes can arise, and aggregate constraints can be exploited to manage the integrity of data. Example 1.1. Table I represents a two-year cash budget for a firm, that is a summary of cash flows (receipts, disbursements, and cash balances) over the specified periods. Values ‘det’, ‘aggr ’ and ‘drv ’ in column Type stand for detail, aggregate and derived, respectively. In particular, an item of the table is aggregate if it is obtained by aggregating items of type detail of the same section, whereas a derived item is an item whose value can be computed using the values of other items of any type and belonging to any section. This example will be referred to as “Cash budget” and used as a running example. We will denote the i -th tuple in Table I as ti .

Year

Section

Subsection

Type

Value

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

2003 2003 2003 2003 2003 2003 2003 2003 2003 2003

Receipts Receipts Receipts Receipts Disbursements Disbursements Disbursements Disbursements Balance Balance

beginning cash cash sales receivables total cash receipts payment of accounts capital expenditure long-term financing total disbursements net cash inflow ending cash balance

drv det det aggr det det det aggr drv drv

20 100 120 250 120 0 40 160 60 80

t11 t12 t13 t14 t15 t16 t17 t18 t19 t20

2004 2004 2004 2004 2004 2004 2004 2004 2004 2004

Receipts Receipts Receipts Receipts Disbursements Disbursements Disbursements Disbursements Balance Balance

beginning cash cash sales receivables total cash receipts payment of accounts capital expenditure long-term financing total disbursements net cash inflow ending cash balance

drv det det aggr det det det aggr drv drv

80 100 100 200 130 40 20 190 10 90

Table I.

A cash budget

A cash budget must satisfy the following integrity constraints, which define algebraic relations among numerical data: ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

S. Flesca, F. Furfaro and F. Parisi

·

3

1. for each section and year, the sum of the values of all detail items must be equal to the value of the aggregate item of the same section and year; 2. for each year, the net cash inflow must be equal to the difference between total cash receipts and total disbursements; 3. for each year, the ending cash balance must be equal to the sum of the beginning cash and the net cash inflow. Table I was acquired by means of an Optical Character Recognition (OCR) tool from two paper documents, reporting the cash budget for the years 2003 and 2004, respectively. The original paper documents were consistent, but some symbol recognition errors occurred during the digitizing phase, as constraints 1) and 2) are not satisfied on the acquired data for year 2003. In fact: i) in section Receipts, the aggregate value of total cash receipts is not equal to the sum of detail values of the same section. ii) the value of net cash inflow is not equal to the difference between total cash receipts and total disbursements. 2 Most of the work dealing with the problem of extracting reliable information from inconsistent databases is based on the notion of repair, i.e., a consistent set of update operations fixing the database reasonably. In the literature, two minimality criteria are generally employed for assessing the reasonability of a repair, namely the set- and card -minimality. Both these semantics aim at giving an insight on the extent to which the repair preserves the information represented in the original database. Specifically, a repair ρ is said to be set-minimal [resp., card -minimal] if there is no repair ρ such that ρ is a strict subset of ρ [resp., ρ contains fewer updates than ρ]. Initially, starting from the seminal work [Arenas et al. 1999], the research community has devoted its attention mainly to the use of the set-minimal semantics, for different classes of integrity constraints. However, more and more interest has been exhibited in the card -minimal semantics in recent works: repairs with the minimum number of performed updates were first used in [Franconi et al. 2001] (where a strategy for fixing categorical data was introduced), then discussed in [Arenas et al. 2003] (in the context of relational data violating specific forms of universal constraints), and, finally, studied in more detail in [Lopatenko and Bertossi 2007], in the presence of denial constraints. The interest in both the set- and card -minimal semantics, in the presence of different forms of integrity constraints, is due to the fact that, depending on the particular scenario addressed, each of them can be more suitable than the other. Also in the presence of aggregate constraints (like those of Example 1.1), these semantics are suitable for different application contexts. For instance, in the scenario of Example 1.1, where inconsistency is due to acquisition errors, repairing the data by means of sets of updates of minimum cardinality seems more reasonable, since the case that the acquiring system made the minimum number of bad symbol-recognition errors can be considered the most probable event. The same reasoning can be applied to other scenarios dealing with numerical data acquired automatically, such as sensor networks. In this case, inconsistency is often due to some trouble occurred at sensors while generating some reading, thus repairing data by modifying the minimum number of readings is justified. On the other hand, the ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

4

·

Querying and Repairing Inconsistent Numerical Databases

set-minimal semantics appears to be more suitable in the data integration context, where assembling data from different (even consistent) databases can result in an inconsistent database, as shown in the following example. Example 1.2. The “High Performance Computing Lab” is a laboratory of the Computer Science department of Neverland University. Until the year 2007, the laboratory periodically sent aggregate information (with different levels of granularity) about the current projects to the central administration of the university and the business offices of the faculty and the department. Here, three distinct databases (D1 , D2 , D3 , respectively) were maintained. Specifically, the central administration of the university received and maintained the total yearly costs of the projects, while the business offices of the faculty and the department received and stored the costs of each project aggregated per year and month, respectively. In particular, the costs in D1 and D2 took into account both the equipments and the pays of the employees, while in D3 only the sums of the pays of the employees were reported. The relation instances contained in D1 , D2 and D3 representing this information are in Table II(a), (b) and (c), respectively. Year

Costs

2007

5000

Project

(a) Project

Year

Costs

X Y

2007 2007

3200 1800

X X Y Y

Month

Year

Costs

January February January February

2007 2007 2007 2007

1000 2000 1000 1100

(c)

(b) Table II.

Aggregate costs of projects represented in D1 (a), D2 (b), and D3 (c)

The governing council of the university has deliberated to integrate the data stored in these three different databases in a unique database. In the new database, the following integrity constraints are defined: (a) for each project, the yearly cost for equipments and salaries must be greater than or equal to the sum of the monthly costs for salaries in the same year; (b) for each year, the overall cost for equipments and salaries must be equal to the sum of the yearly costs for equipments and salaries for the different projects. Thus, when the data from D1 , D2 and D3 are integrated, an inconsistency is detected, as D2 reports that the sum of all the costs for project Y in the year 2007 is 1800, while evaluating the sum of the salaries reported in D3 results in 2100. 2 In the scenario of Example 1.2, as often occurs when managing inconsistencies arising with data integration, adopting the card -minimal semantics would result in discarding reasonable ways of repairing the data. In fact, if only value updates are allowed as repair primitives, the card -minimal semantics would consider only the repairs consisting of a single update as reasonable, such as that assigning 800 to the February cost for project Y taken from Table II(c). With the set-minimal semantics, also repairs with two updates will be considered, such as the repair ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

S. Flesca, F. Furfaro and F. Parisi

·

5

updating the cost of project Y from Table II(b) into 2500 and the 2007 cost from Table II(a) into 5700. Indeed, in the absence of reasonable assumptions on the most likely source of inconsistency (which is related to how data were aggregated at the laboratory, transmitted to the business offices, and finally stored in D1 , D2 , D3 ), discarding the latter repair would be unfounded. Besides the minimality semantics adopted, the repairing strategies proposed for traditional forms of constraints differ in the update operations allowed for fixing the inconsistent data. Most of the work in the literature considers repairs consisting of tuple insertion/deletion operations on the inconsistent database. Indeed, this repairing strategy is not suitable for contexts analogous to that of Example 1.1, that is of data acquired by OCR tools from paper documents. In fact, using tuple insertions/deletions as basic primitives means hypothesizing that the OCR tool skipped/“invented” a whole row when acquiring the source paper document, which is rather unrealistic. In this scenario, a repairing strategy based on attribute-update operations only seems more reasonable, as updating single attribute values is the most natural way for fixing inconsistencies resulting from symbol recognition errors. The same holds in other scenarios dealing with numerical data representing pieces of information acquired automatically, such as sensor networks. In a sensor network with error-free communication channels, no reading generated by sensors can be lost, thus repairing data by adding new readings (as well as removing collected ones) is of no sense. However, also in the general case, as observed in [Wijsen 2003], a repairing strategy based on value updates can be more reasonable than strategies performing insertions and/or deletions of tuples. In fact, aggregate constraints are defined on measure attributes only, i.e., numerical attributes which are often a small subset of the whole set of attributes. Hence, deleting tuples to make the data consistent has the side-effect of removing the (possibly consistent) information encoded in the other attributes of the deleted tuples, thus resulting in a loss of information which is avoided if only value updates are allowed. On the other hand, a repairing strategy using tuple insertions suffers from the problem that often there is no reasonable way to guess the values which should be assigned to the nonmeasure attributes in the tuples to be inserted. This work provides a study of the problem of extracting reliable information from databases violating a set of aggregate constraints in a setting where an attributelevel repairing strategy is adopted. Before explaining our contribution in detail, we provide a brief overview on the main issues and results in the area of inconsistent database management. 1.1

Related Work

First theoretical approaches to the problem of dealing with incomplete and inconsistent information date back to 80s, but these works mainly focus on issues related to the semantics of incompleteness [Imielinski and Lipski 1984]. The problem of extracting reliable information from inconsistent data was first addressed in [Agarwal et al. 1995], where an extension of relational algebra (namely flexible algebra) was proposed to evaluate queries on data inconsistent w.r.t. key constraints. The first proof-theoretic notion of consistent query answer was introduced in [Bry 1997], expressing the idea that tuples involved in an integrity violation should not be considered in the evaluation of consistent query answering. In [Arenas et al. 1999] a ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

6

·

Querying and Repairing Inconsistent Numerical Databases

different notion of consistent answer was introduced, based on the notion of repair : a repair of an inconsistent database D is a database D , on the same scheme as D, satisfying the given integrity constraints and which is minimally different from D. Thus, the consistent answer of a query q posed on D is the answer which is in every result of q posed on each repair D . In particular, in [Arenas et al. 1999] the authors show that, for quantifier-free conjunctive queries and binary universal constraints, consistent answers can be evaluated without computing repairs, but by looking only at the specified constraints and rewriting the original query q into a query q  such that the answer of q  on D is equal to the consistent answer of q on D. The technique based on query rewriting proposed in [Arenas et al. 1999], and further developed in [Celle and Bertossi 2000], was extended in [Fuxman and Miller 2007; Fuxman et al. 2005] to work for a subclass of conjunctive queries with existential quantification in the presence of key constraints. Recently, the results provided in [Fuxman and Miller 2007] were further generalized in [Wijsen 2009]. Based on the notions of repair and consistent query answer introduced in [Arenas et al. 1999], several works investigated the problem of querying inconsistent data considering more expressive classes of queries and constraints. In [Cal`ı et al. 2003] the complexity of the consistent query answer problem was investigated in the presence of both functional and inclusion dependencies under different assumptions on data (soundness and/or completeness). The computational complexity of the consistent query answer problem in the presence of denial constraints and inclusion dependencies was further investigated in [Chomicki et al. 2004; Chomicki and Marcinkowski 2005], for several classes of queries with existential quantification, under the assumption that data are complete (thus exploiting a repairing strategy using tuple deletions only). The notion of consistent answer was extended to the case of aggregate queries in [Arenas et al. 2003], where consistent answers of aggregate queries were investigated in the presence of functional dependencies. Several works exploited logic-based frameworks for investigating the problem of computing repairs and evaluating consistent query answers. Specifically, in [Arenas et al. 2000; 2003], extended disjunctive logic programs with exceptions were used for the computation of repairs. A further generalization was proposed in [Greco et al. 2003], where the authors defined a technique based on the rewriting of constraints into extended disjunctive rules with two different forms of negation (negation as failure and classical negation). This technique was shown to be sound and complete for universally quantified constraints. In [Arenas et al. 2000], a repairing framework based on a non-classical logic (the annotated predicate calculus [Kifer and Lozinskii 1992]) was proposed, which works for queries that are conjunctions or disjunctions of positive literals in the presence of universal constraints. This strategy was extended in [Barcel´ o and Bertossi 2002] to deal with referential integrity constraints. A similar approach was proposed in [Barcel´ o and Bertossi 2003], where repairs were specified as the stable models of a disjunctive logic program where the database predicates contain annotations as extra arguments (as opposed to annotated programs that contain annotated atoms). All the above-cited approaches assume that tuple insertions and deletions are the basic primitives for repairing inconsistent data. In [Franconi et al. 2001; Bertossi et al. 2008; Bohannon et al. 2005; Wijsen 2003; 2004; 2005] repairs consisting of also ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

S. Flesca, F. Furfaro and F. Parisi

·

7

value-update operations were considered. In particular, [Wijsen 2003] was the first investigating the complexity of the consistent query answer problem in a setting where the basic primitive for repairing data is the attribute-value update. Some works are focused on problems regarding the extraction of consistent information from inconsistent data other than the consistent query answer problem. Specifically, in [Afrati and Kolaitis 2009] the repair checking problem is addressed (i.e., the problem of checking whether a repair is reasonable according to a minimality criterion), and, among the above-cited works, [Franconi et al. 2001; Bertossi et al. 2008; Bohannon et al. 2005] focused on the problem of computing a repair. The latter problem is relevant in several contexts, where users are interested in a consistent version of the data on which an analysis task can be accomplished. For instance, consider the scenario of “Cash budget” example, and the case that the data, after being automatically acquired, must be analyzed by means of a balance analysis tool in order to detect potential liquidity crisis or to estimate the financial reliability. Obviously, the information provided by the analysis tool are meaningful if they result from consistent data. Thus, if the acquired data are inconsistent, the analysis task should be preceded by the computation of a reasonable repair. In [Franconi et al. 2001], the computation of repairs was studied for categorical data in the presence of constraints expressed as first order formulas. In [Bohannon et al. 2005], repairs on categorical data in the presence of functional and inclusion dependencies were studied. These approaches are not well-suited in our context, as they do not consider any form of aggregate constraint. The first work investigating aggregate constraints on numerical data is [Ross et al. 1998], where the consistency problem of very general forms of aggregation was considered, but no issue related to data-repairing was investigated. In [Bertossi et al. 2008] the problem of repairing databases by fixing numerical data at attribute level was addressed. The authors showed that deciding the existence of a repair under both denial constraints (where built-in comparison predicates are allowed) and a non-linear form of multi-attribute aggregate constraints is undecidable. Then they disregarded aggregate constraints and focused on the problem of repairing data violating denial constraints, where no form of aggregation is allowed in the adopted constraints. The form of aggregate constraints considered in this paper was introduced in our previous work [Flesca et al. 2005], where the complexity of several problems regarding the extraction of reliable information from inconsistent numerical data (i.e. repair existence, minimal repair checking, as well as consistent query answer) was investigated. In [Fazzinga et al. 2006], we also introduced the architecture of a tool for acquiring and repairing numerical data inconsistent w.r.t. a restricted form of aggregate constraints (namely, steady aggregate constraints), along with a strategy for computing reasonable repairs. 1.2

Main Contributions

The results presented in this work extend those presented in our previous works [Fazzinga et al. 2006; Flesca et al. 2005]. Thus, the problem of repairing and extracting reliable information from data violating a given set of aggregate constraints is investigated. These constraints consist of linear inequalities on aggregate-sum queries issued on numerical values stored in the database, and enable most of the conditions on numerical data which occur in practice to be expressed. ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

·

8

Querying and Repairing Inconsistent Numerical Databases

We consider repairs consisting of sets of value-update operations aiming at reconstructing the correct measure values of inconsistent data, and consider both the set- and card -minimal semantics as criteria for deciding the reasonability of repairs. We also consider the restricted form of aggregate constraints (namely, steady aggregate constraints) introduced in [Fazzinga et al. 2006], which, although less expressive than general aggregate constraints, allow us to guarantee data consistency in many real-life scenarios. For both general and steady aggregate constraints, we provide the characterization of the computational complexity of three canonical problems: repair existence, minimal repair checking, and consistent query answer, which was only stated (without proofs) in [Flesca et al. 2005] for general aggregate constraints. Moreover, we investigate the dependency of the complexity of these problems on the domain of the numerical data involved in the constraints. Furthermore, for steady aggregate constraints, we prove the correctness of the strategy for computing a reasonable repair (according to the card -minimal semantics) introduced in [Fazzinga et al. 2006]. We also prove the effectiveness of the approach for computing repairs through several experiments in real-life scenarios. Specifically, we conducted experiments on real-life data representing cash budgets, and synthetic data representing fundings and expenses for projects developed in the departments of a university. The former scenario summarizes common characteristics of a number of contexts where databases are populated by semi-automatically acquiring data from paper documents. Further examples of this (other than applications performing analysis tasks on balance sheet data) are the management of pay statements and tax returns. Pay statements in paper form are usually collected by banks when making decisions on loans, while tax returns in paper form are currently collected by the USA Department of Treasure for non-resident aliens. Experiments on synthetic data consider the case of already populated databases, and measure the impact of database size on the performance of our approach. 2.

PRELIMINARIES

We assume classical notions of database scheme, relation scheme, and relation instances. In the following we will also use a logical formalism to represent relational databases, and relation schemes will be represented by means of sorted predicates of the form R(A1 : ∆1 , . . . , An : ∆n ), where R is said to be the name of the relation scheme, A1 , . . . , An are attribute names (composing the set denoted as AR ), ∆1 , . . . , ∆n are the corresponding domains, and n is said to be the arity of R. Each ∆i can be either Z (infinite domain of integers), Q (rationals), S (strings), or B = {0, 1}. Specifically, B is treated as a subset of Z which is not closed under + and −. Although it is not of practical interest, it is used for proof-technical reasons. For the sake of brevity, relation schemes will be often identified by their names (thus omitting their signature consisting of attribute names and domains). A tuple over a relation scheme R of arity n is a member of (B ∪ Z ∪ Q ∪ S)n 1 . A relation instance of R is a set r of tuples over R. A database scheme D is a set of relation schemes, whereas a database instance D is a set of relation instances of 1 This

definition of tuple over a relation scheme admits tuples inconsistent w.r.t. attribute domains.

ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

S. Flesca, F. Furfaro and F. Parisi

·

9

the relation schemes in D. Given a tuple t, the value of attribute A of t will be denoted as t[A]. Given a boolean formula β consisting of comparison atoms of the form X  Y , where X, Y are either attributes of relation scheme R or constants and  is a comparison operator in {=, =, ≤, ≥, } , we say that a tuple t over R satisfies β (denoted as t |= β) if replacing the occurrences of each attribute A in β with t[A] makes β true. Domains B, Q and Z will be said to be numerical domains, and attributes defined over B, Q or Z will be said to be numerical attributes. Given a relation scheme R, we will denote the set of its numerical attributes representing measure data as MR (namely, Measure attributes). That is, MR specifies the set of attributes representing measure values, such as weights, lengths, prices, etc. For instance, in “Cash budget” example, MR consists only of attribute Value. Given a database scheme D, we will denote as MD the union of the sets of measure attributes associated with the relation schemes in D. On each relation scheme R, a key constraint is assumed. Specifically, we denote as KR the subset of AR consisting of the names of the attributes which are a key for R. For instance, in “Cash budget” example, KR = {Year, Subsection }. We also denote the key of a relation scheme by underlining its key attributes. Throughout this paper, we assume that KR ∩ MR = ∅, i.e., measure attributes of a relation scheme R are not used to identify tuples belonging to instances of R. Although this assumption leads to a loss of generality, it is acceptable from a practical point of view, since the situations excluded by this assumption are unlikely to occur often in real-life scenarios. Clearly, this assumption holds in the scenario considered in “Cash budget” example. We distinguish among measure and non-measure attributes as, in our framework, we will rely on the assumption that inconsistencies involve measure attributes only, whereas non-measure attributes are assumed to be consistent. Therefore, also key constraints are assumed to be satisfied. The same assumption was made in [Bertossi et al. 2008] (the only work in the literature dealing with repairing numerical data), and its rationale is that, in many real-life situations, even if integrity violations of measure data can coexist with integrity violations involving non-measure data, these inconsistencies can be fixed separately. For instance, in the balance sheet scenario of our running example, errors in the OCR-mediated acquisition of nonmeasure attributes (such as lacks of correspondences between real and acquired strings denoting item descriptions) can be repaired in a pre-processing step using a dictionary, by searching for the strings in the dictionary which are the most similar to the acquired ones. In fact, in [Fazzinga et al. 2006], a system prototype adopting such a dictionary-based repairing strategy for string attributes is described. However, the study of the problem of repairing the data when these different forms of inconsistencies cannot be fixed separately goes beyond the scope of this paper. 2.1

Domain constraints and Aggregate constraints

Several forms of constraints can be defined over a database scheme restricting the set of its valid instances. In this paper we deal with two forms of constraints: domain constraints and aggregate constraints. The former impose that, if an attribute is associated with a domain ∆ in the definition of a relation scheme, then it must take ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

10

·

Querying and Repairing Inconsistent Numerical Databases

value from ∆. The latter define algebraic relationships among measure attributes, and are formally defined in what follows. An attribute expression e on R is either a constant or the name of a numerical (either measure or non-measure) attribute of R. Given an attribute expression e on R and a tuple t over R, we denote as e(t) the value e, if e is a constant, or the value t[e], if e is an attribute. Given a relation scheme R and a sequence y of variables, an aggregation function χ(y ) on R is a triplet R, e, α(y ) , where e is an attribute expression on R and α(y ) is a (possibly empty) boolean combination of atomic comparisons of the form X  Y , where X and Y are constants, attributes of R, or variables in y , and  is a comparison operator in {=, =, ≤, ≥, }. When empty, α will be denoted as ⊥. Given an aggregation function χ(y ) = R, e, α(y ) and a sequence a of constants with |a| = |y |, χ(a) maps every instance r of R to t∈r∧t|=α(a) e(t), where α(a) is the (ground) boolean combination of atomic comparisons obtained from α(y ) by replacing each variable in y with the corresponding value in a. We assume that, in the case that the set of tuples selected by the evaluation of an aggregation function χ is empty, χ evaluates to 0. Example 2.1. The following aggregation functions are defined on the relation scheme CashBudget(Year, Section, Subsection, Type, Value) of Example 1.1, where KCashBudget = {Year, Subsection}, MCashBudget = {Value}, and the measure attribute Value ranges over domain Z: χ1 (x, y, z) = CashBudget, Value, (Section= x ∧ Year= y ∧ Type= z) χ2 (x, y) = CashBudget, Value, (Year= x ∧ Subsection= y) Function χ1 returns the sum of Value of all the tuples having Section x, Year y and Type z. For instance, evaluating χ1 (‘Receipts’, ‘2003’, ‘det’) on the relation instance shown in Table I results in 100 + 120 = 220, whereas χ1 (‘Disbursements’, ‘2003’, ‘aggr’) returns 160. Function χ2 returns the sum of Value of all the tuples where Year=x and Subsection=y. In our running example, as the pair Year, Subsection is a key for CashBudget, the sum returned by χ2 is an attribute value of a single tuple. For instance, χ2 (‘2003’, ‘cash sales’) returns 100, 2 whereas χ2 (‘2004’, ‘net cash inflow’) returns 10. Definition 2.2 Aggregate constraint. Given a database scheme D, an aggregate constraint on D is an expression of the form:   n  ∀ x φ(x) =⇒ ci · χi (yi ) ≤ K (1) i=1

where: 1. n is a positive integer, and c1 , . . . , cn , K are constants in Q; 2. φ(x) is a (possibly empty) conjunction of atoms constructed from relation names, constants, and all the variables in x; 3. each χi (yi ) is an aggregation function, where yi is a list of variables and constants, and every variable that occurs in yi also occurs in x. ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

S. Flesca, F. Furfaro and F. Parisi

·

11

The semantics of an aggregate constraint is the “natural” one, that is, given a database instance D over the database scheme D, an aggregate constraint of the form (1) imposes that, for all the substitutions n θ of the variables in x with constants of D making φ(θ(x)) true, the inequality i=1 ci · χi (θ(yi )) ≤ K holds on D. Observe that aggregate constraints enable equalities to be expressed as well, since an equality can be viewed as a pair of inequalities. For the sake of brevity, in the following, equalities will be written explicitly. Example 2.3. Constraint 1 defined in Example 1.1 can be expressed as follows: ∀ x, y, s, w, v CashBudget(y, x, s, w, v) =⇒ χ1 (x, y, ‘det’) − χ1 (x, y, ‘aggr’) = 0 2 For the sake of simplicity, in the following we will use a shorter notation for denoting aggregate constraints, where universal quantification is implied and variables in φ which do not occur in any aggregation function are replaced with the symbol ‘ ’. For instance, the constraint of Example 2.3 can be written as: CashBudget(y, x, , , ) =⇒ χ1 (x, y, ‘det’) − χ1 (x, y, ‘aggr ’) = 0 Example 2.4. Constraints 2 and 3 of Example 1.1 can be expressed as follows: , , , ) =⇒ Constraint 2: CashBudget(x,  

χ2 (x, ‘net cash inflow’) − χ2 (x, ‘total cash receipts’) − χ2 (x, ‘total disbursements’) = 0 CashBudget(x, , , , ) =⇒  χ2 (x, ‘ending cash balance’) − χ2 (x, ‘beginning cash’) + χ2 (x, ‘net cash inflow’) = 0

Constraint 3:

2 The following examples show additional usages of aggregate constraints. Example 2.5. Consider the database scheme D resulting from the integration of the three databases of Example 1.2. D contains the following relation schemes, inherited from the three source databases: R1 (Year, Costs), R2 (Project, Year, Costs), R3 (Project, Month, Year, Costs), where MRi = { Costs }, for each i ∈ [1..3]. Constraints (a) and (b) introduced in Example 1.2 can be expressed as follows: (a) R2 (x, y, ) =⇒ χ3 (x, y) − χ4 (x, y) ≥ 0 where χ3 (x, y) = R2 , Cost, (P roject = x ∧ Y ear = y) , and χ4 (x, y) = R3 , Cost, (P roject = x ∧ Y ear = y) . (b) R1 (y, ) =⇒ χ5 (y) − χ6 (y) = 0 where χ5 (y) = R1 , Cost, Y ear = y and χ6 (y) = R2 , Cost, Y ear = y .

2

Example 2.6. Consider the database scheme consisting of the relation schemes R1 (Department, Area), and R2 (Project, Department, Costs ). In R1 , each department is associated with a research area, and, in R2 , each research project is associated with the department in which it was developed and its overall costs. Consider the following integrity constraint: for every project developed in a department of the ‘database’ area, the costs must be less than or equal to 100K. This constraint is expressed by the following aggregate constraint: R1 (x, ‘database’)∧R2 (y, x, ) =⇒ 2 χ(y) ≤ 100K, where χ(y) = R2 , Cost, Project= y . ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

12

·

Querying and Repairing Inconsistent Numerical Databases

Example 2.7. Consider a relation instance over the relation scheme Employee(Name, Salary, Bonus), where both Salary and Bonus are measure attributes. Consider the constraint requiring that a total amount of 30K of bonuses has been distributed among the employees receiving a salary greater than 5K. We can express this constraint by means of the aggregate constraint =⇒ χ() = 30K, where χ() =  Employee, Bonus, (Salary> 5K) returns the sum of bonuses for the employees whose salary is greater than 5K. 2 Observe that, according to Definition 2.2, the conjunction of atoms on the lefthand side of an aggregate constraint can be empty. Thus, an expression of the form =⇒ χ(k1 , . . . , kq ) ≤ K, where k1 , . . . , kq are constants, is an aggregate constraint, whose semantics derives from assuming the left-hand side of the implication true. In the following, for the sake of readability, we will omit the symbol ‘ =⇒ ’ for this form of constraint. For instance, the aggregate constraint introduced in Example 2.7 can be written as χ() = 30K. 2.2

Numerical Database Inconsistency

According to the two above-defined forms of constraint on numerical data (domain and aggregate constraints), we consider two forms of database inconsistency. Specifically, given a database instance D over the database scheme D, we say that D is inconsistent w.r.t. D if D contains a tuple t over a relation scheme R(A1 : ∆1 , . . . , An : ∆n ) of D such that, for some Ai ∈ MR , it holds that t[Ai ] ∈ ∆i . Moreover, given a set of aggregate constraints AC on D, we say that D is inconsistent w.r.t. AC if there is an aggregate constraint ac ∈ AC (of the form (1)) such that there is a substitution n θ of variables in x with constants of D making φ(θ(x)) true and the inequality i=1 ci · χi (θ(yi )) ≤ K false on D. We will write D |= AC (resp., D |= D) to denote that D is inconsistent w.r.t. AC (resp., w.r.t. D), and D |= AC (resp., D |= D) otherwise. 2.3

Updates

Updates at attribute-level will be used as basic primitives for repairing the data. Definition 2.8 Atomic update. Let t = R(v1 , . . . , vn ) be a tuple over the relation scheme R(A1 : ∆1 , . . . , An : ∆n ). An atomic update on t is a triplet < t, Ai , vi >, where Ai ∈ MR and vi is a value in ∆i and vi = vi . Update u =< t, Ai , vi > replaces t[Ai ] with vi , thus yielding the tuple u(t) = R(v1 , . . . , vi−1 , vi , vi+1 , . . . , vn ). We denote the pair < tuple, attribute > updated by u as λ(u), that is λ(u) =< t, Ai >. Observe that we consider atomic updates working on the set MR of measure attributes only, since, as explained in Section 2, non-measure attributes are a priori assumed to be correct. Definition 2.9 Consistent database update. Let D be a database instance and U = {u1 , . . . , un } be a set of atomic updates on tuples of D. The set U is said to be a consistent database update iff ∀ j, k ∈ [1..n] if j = k then λ(uj ) = λ(uk ). Informally, a set of atomic updates U is a consistent database update iff, for each pair of updates u1 , u2 ∈ U , either u1 and u2 work on distinct tuples, or they change different attributes of the same tuple. ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

S. Flesca, F. Furfaro and F. Parisi

·

13

The set of pairs < tuple, attribute > updated by a consistent database update U will be denoted as λ(U ) = ∪ui ∈U {λ(ui )}. Given a database instance D, a tuple t in D, and a consistent database update U , we denote the tuple obtained by applying the atomic updates in U of the form < t, A, v > on t as U (t). Moreover, we denote the database instance resulting from applying all the atomic updates in U on the tuples of D as U (D). 3.

REPAIRING INCONSISTENT DATABASES

In this section we introduce the notions of repair, minimal repair (under set- and card -minimality semantics) and consistent answer. We define the repair existence problem, the minimal repair checking problem and the consistent query answer problem, and characterize their computational complexity. Definition 3.1 Repair. Let D be a database scheme, AC a set of aggregate constraints on D, and D an instance of D. A repair ρ for D w.r.t. D and AC is a consistent database update such that ρ(D) |= D and ρ(D) |= AC. Example 3.2. The database instance in the “Cash Budget” example can be made consistent by decreasing attribute Value in t4 down to 220. That is, 2 ρ = { t4 , Value, 220 } is a repair. We now characterize the complexity of the problem of deciding whether an inconsistent database can be repaired. This problem is defined as follows: Definition 3.3 repair existence (re). Let D be a fixed database scheme, and AC a fixed set of aggregate constraints on D. re is the problem of deciding whether a given instance of D belongs to the set {D | D is an instance of D and there exists a repair for D w.r.t. D and AC} In the following, we denote with re∆ the version of the re problem where every measure attribute in MD is associated with the domain ∆. For instance, reB is the version of the repair existence problem where all the measure attributes are constrained to range over the domain B. An analogous notation will be used in the following for the other decision problems that will be defined later (minimal repair checking (mrc) and consistent query answer (cqa)). The following lemma is a preliminary result which states that, if at least one repair ρ exists, then there is a repair ρ of polynomial size which updates a subset of the pairs tuple, attribute updated by ρ. The size of a repair ρ is determined by the size of the representation of the (integer or rational) numbers occurring in the atomic updates belonging to ρ. Lemma 3.4. Let D be a database scheme, AC a set of aggregate constraints on D, and D an instance of D such that D is not consistent w.r.t. D and AC. If there is a repair ρ for D w.r.t. D and AC, then there is a repair ρ for D w.r.t. D and AC such that λ(ρ ) ⊆ λ(ρ) and ρ has polynomial size w.r.t. D. The proof of Lemma 3.4 is rather complex, thus we postpone it to the electronic Appendix A. Here, we only give an intuition on how the result can be proved and how it will be exploited throughout the paper. The idea underlying our proof is that, given a repair ρ for D w.r.t. D and AC, ρ corresponds to a solution x ˆ(ρ) of a ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

14

·

Querying and Repairing Inconsistent Numerical Databases

system of linear inequalities In(ρ, D, D, AC), where each variable corresponds to a pair tuple, attribute changed by ρ, and the inequalities appropriately encode all the constraints of AC violated by D and the domain constraints. The core of the proof is showing that the inequalities in In(ρ, D, D, AC) can be defined so that any other solution x ˆ of In(ρ, D, D, AC) still corresponds to a repair ρ , in the sense that assigning the values of x ˆ to the corresponding pairs tuple, attribute of D not only makes D consistent w.r.t. the constraints of AC originally violated by D, but does not trigger violations of constraints of AC not encoded in In(ρ, D, D, AC). The facts that: (i) In(ρ, D, D, AC) can be constructed of polynomial size w.r.t. D, (ii) the variables of In(ρ, D, D, AC) correspond to pairs tuple, attribute changed by ρ only, and (iii) every feasible system of linear inequalities admits a polynomial bounded solution, are then exploited to show that a repair ρ can be always found which performs a subset of the updates in ρ and of polynomial size w.r.t. D. The main property straightforwardly following from Lemma 3.4 (which will be exploited to characterize the re problem) is that, when deciding the existence of a repair for an inconsistent database, the search space can be restricted to polynomial size sets of updates. This allows us to provide the upper bound on the complexity of re stated in the following theorem. Theorem 3.5. re is in NP. Proof. A guess-and-check strategy for deciding an instance < D, AC, D > of re is the following: first, we guess a database update U for D having polynomial size w.r.t. D; then, we check whether U is a repair for D w.r.t. D and AC. This strategy is complete as Lemma 3.4 guarantees that, if a repair exists for D, then a polynomial size repair for D exists too. NP-membership follows from the fact that testing whether U (D) |= D and U (D) |= AC can be accomplished in polynomial time w.r.t. the size of D, since U has polynomial size w.r.t. D. We now focus our attention on providing a lower bound on the complexity of re. Specifically, we consider the B-variant of this problem, and we will extend this result to the other variants in Corollary 3.21, at the end of this section. The complexity characterization of the other problems studied in this section will be presented following the same organization. Theorem 3.6. reB is NP-hard. Proof. We show a LOGSPACE reduction from circuit sat [Papadimitriou 1994] to reB . Without loss of generality, we consider an instance of circuit sat where the input boolean circuit C uses NOR gates only, and assume that both inputs and gates of C are encoded as integers (we assume that the output gate of C is encoded as 0). The inputs of C will be denoted as x1 , . . . , xn . The boolean circuit C can be represented by means of an instance of the following database scheme D: gate(IDGate : Z, norV al : B, orV al : B), gateInput(IDGate : Z, IDIngoing : Z, V al : B), input(IDInput : Z, V al : B). where Mgate = {norVal, orVal }, MgateInput = {Val }, Minput = {Val }. Specifically: ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

S. Flesca, F. Furfaro and F. Parisi

·

15

(1) each gate g of C will be encoded by a tuple t over gate, where t[IDGate] = g, whereas t[norVal ] and t[orVal ] represent the output of g and its negation, respectively; (2) each input xi of C will be encoded by a tuple t over input, where t[IDInput] = xi , whereas t[Val ] represents the truth assignment of xi (1 and 0 will be used to represent true and false, respectively); (3) the structure of C will be encoded by tuples over gateInput. Specifically, for any gate g of C, the fact that g takes as input the output of gate g  can be encoded by the tuple gateInput(g, g  , v), where v is the truth value of the output of g  encoded in the instance of gate. Analogously, the fact that g takes as input the circuit-input xi can be encoded by the tuple gateInput(g, xi , v), where v is the truth value of xi encoded in the instance of input. Consider the database instance D of D populated as follows. For each input xi in C we insert the tuple input(xi , −1) and for each gate g in C we insert the tuple gate(g, −1, −1) into D. Moreover, for each edge in C going from g  to the gate g (where g  is either a gate or an input of C), the tuple gateInput(g, g  , −1) is inserted into D. We recall that measure attributes are defined on domain B, thus this initial assignment (where every measure attribute is set to −1) means that D is not consistent w.r.t. D (this can be viewed as assigning an undefined truth value to every input and output). Consider the following aggregation functions: - NORVal (X) =  gate, norVal, (IDGate=X) and ORVal (X) =  gate, orVal, (IDGate=X) , returning the truth value of gate X and its opposite, respectively; - IngoingVal (X, Y ) =  gateInput, Val, (IDGate=X∧ IDIngoing=Y ) , returning, for the gate with identifier X, the truth value of its input with identifier Y (where Y refers to either the output of a gate or an input of the circuit); - IngoingSum(X) =  gateInput, Val, (IDGate=X) , returning the sum of the truth values of the inputs of gate X; - InputVal (X) = Input, Val,(IDInput=X) , returning the truth value of input X. Consider the following aggregate constraints on D: (1) gate(X, , ) ⇒ ORVal (X) + NORVal (X) = 1, which imposes that, for each tuple representing a NOR gate, orVal must be complementary to norVal ; (2) gate(X, , ) ⇒ ORVal (X)−IngoingSum(X) ≤ 0, which imposes that, for each tuple representing a NOR gate, orVal cannot be greater than the sum of the truth assignments of its inputs (i.e. if all inputs are 0, orVal must be 0 too); (3) gateInput(X, Y, ) ⇒ IngoingVal (X, Y )− ORVal (X) ≤ 0, which imposes that, for each gate g, attribute orVal must be 1 if at least one input of g has value 1; (4) gateInput(X, Y, ) ⇒ IngoingVal (X, Y )−NORVal (Y )−InputVal (Y ) = 0, which imposes that the attribute V al in each tuple of gateInput is the same as the truth value of either the ingoing gate or the ingoing input. Let AC  be the set consisting of the above-defined aggregate constraints. It is easy to see that every truth assignment of the inputs of C bi-univocally corresponds to a repair ρ for D w.r.t. D and AC  , where, for each tuple t of gate in ρ(D), the value ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

16

·

Querying and Repairing Inconsistent Numerical Databases

of attribute norVal encodes the truth value (resulting from the truth assignment) of the output of the gate encoded by t. Hence, there are 2n possible repairs for D w.r.t. D and AC  (each corresponding to a truth assignment of inputs x1 , . . . , xn ). For each of these pairs < truth-assignment, repaired-database>, the output of the circuit resulting from the truth assignment is encoded in attribute norV al of the tuple of gate associated with the output of C. Let AC be the set of aggregate constraints obtained by augmenting AC  with the aggregate constraint NORV al(0) = 1 (which imposes that the truth value of the output gate must be true). By applying the above-described reasoning, it is easy to see that deciding whether there is a truth assignment which makes C evaluate to true is equivalent to deciding whether there is a repair ρ for D w.r.t. D and AC. 3.1

Minimal repairs

Theorems 3.5 and 3.6 deal with the problem of deciding whether a database D violating a set of aggregate constraints AC can be repaired. If this is the case, different repairs can be performed on D yielding a new database consistent w.r.t. D and AC, although not all of them can be considered “reasonable”. For instance, if a repair exists for D changing only one value in one tuple of D, any repair updating all values in all tuples of D can be reasonably disregarded. To evaluate whether a repair should be considered “relevant” or not, we introduce two different ordering criteria on repairs, corresponding to the comparison operators ‘ ∈ λ(ρ), a unique variable xt,A ranging on the domain of A. The intuition underlying the proof is that the existence of ρ implies the existence of a feasible set of linear inequalities In(ρ, D, D, AC) defined on the variables of Xρ such that every solution x

of In(ρ, D, D, AC) corresponds to a repair ρx for D w.r.t. D and AC. In more detail, we will show that, given a solution x

of In(ρ, D, D, AC),

as x

t,A , the set of atomic updates: and denoting the value of variable xt,A in x ρx = { < t, A, x

t,A > | < t, A >∈ λ(ρ) ∧ x

t,A = t[A] } is a repair for D w.r.t. D and AC. Specifically, In(ρ, D, D, AC) can be shown to have a solution x

of polynomial size w.r.t. D, so that ρx is a polynomial-size repair for D w.r.t. D and AC which updates a subset of the values modified by ρ. The intuition underlying the construction of In(ρ, D, D, AC) is given by the following example. Example A.1. Let D be the database scheme consisting of the relation schemes R1 (K1 , A, B, C), R2 (K2 , D, E, F ), where MD = {A, B, C, D, E, F } and measure attributes are defined over Z. Let D be the database instance of D consisting of the following relation instances (tx denotes the tuple where the key attribute is equal to x):

This is a preliminary release of an article accepted by ACM Transactions on Database Systems. The definitive version is currently in production at ACM and, when released, will supersede this version. Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. c 2010 ACM 0362-5915/2010/0300-0001 $5.00  ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

App–2

·

Querying and Repairing Inconsistent Numerical Databases

r1 ta1 ta2

r2

K1

A

B

C

a1 a2

1 1

2 2

3 87

t b1 t b2

K2

D

E

F

b1 b2

1 2

4 6

3 6

Let AC be the singleton consisting of the following aggregate constraint:   ac : ∀x1 , x2 , x3 , x4 , x5 R1 (x1 , x2 , x3 , 3) ∧ R2 (x4 , x2 , 4, x5 ) =⇒ χ(x3 ) ≤ 2 (3) where χ(x3 ) = R2 , F, x3 < E . Consider the following set of atomic updates on D: ρ = {< tb1 , D, 3 >, < tb1 , E, 1 >, < tb2 , D, 1 >, < tb2 , E, 4 >, < tb2 , F, 1 >}. It is easy to see that D is not consistent w.r.t. AC, and that ρ is a repair5 for D w.r.t. D and AC, as instances of R1 , R2 in ρ(D) are as follows (circles circumscribe attribute values updated by ρ): ρ(r1 ) ρ(ta1 ) ρ(ta2 )

ρ(r2 )

K1

A

B

C

a1 a2

1 1

2 2

3 87

K2 ρ(tb1 ) ρ(tb2 )

b1 b2

D

E

F

 3  1 3  1  4  1

We define the system of inequalities In(ρ, D, D, AC) on the set of variables Xρ corresponding to the pairs tuple, attribute updated by ρ, i.e., Xρ = {xtb1 ,D , xtb1 ,E , xtb2 ,D , xtb2 ,E , xtb2 ,F }. For each aggregate constraint we proceed as follows. First, we-write the formula φ(x1 , x2 , x3 , x4 , x5 ) = R1 (x1 , x2 , x3 , 3) ∧ R2 (x4 , x2 , 4, x5 ) into the following equivalent formula where only new fresh variables occur in the conjunction of relation atoms: φ (x1 , x2 , x3 , x4 , x5 ) = ∃z1 , z2 , z3 , z4 , z5 , z6 , z7 , z8 (R1 (z1 , z2 , z3 , z4 ) ∧ R2 (z5 , z6 , z7 , z8 ) ∧ z2 = z6 ∧ z4 = 3 ∧ z7 = 4 ∧ z1 = x1 ∧ z2 = x2 ∧ z3 = x3 ∧ z5 = x4 ∧ z8 = x5 ) Then, we consider the substitutions of variables z such that the conjunction of relation atoms R1 (z1 , z2 , z3 , z4 ) ∧ R2 (z5 , z6 , z7 , z8 ) evaluates to true on ρ(D). In our example, this happens for four substitutions. Two of them are the following: - θ = {z1 /a2 , z2 /1, z3 /2, z4 /87, z5 /b1 , z6 /3, z7 /1, z8 /3}, since the conjunction R1 (a2 , 1, 2, 87) ∧ R2 (b1 , 3, 1, 3) is true on ρ(D). - θ = {z1 /a1 , z2 /1, z3 /2, z4 /3, z5 /b2 , z6 /1, z7 /4, z8 /1}, since the conjunction R1 (a1 , 1, 2, 3) ∧ R2 (b2 , 1, 4, 1) is true on ρ(D). Then, for every of these substitutions, we distinguish the following two cases: 5 We use a non-minimal repair intentionally, as this allows us to provide an insight on the meaning of the inequalities in In(ρ, D, D, AC).

ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

S. Flesca, F. Furfaro and F. Parisi

·

App–3

Case 1: For the given constraint ac and substitution, the whole conjunction φ on the left-hand side of constraint ac is false w.r.t. ρ(D). This happen for θ , since there is an equality atom in φ imposing that z7 = 4, whereas θ (z7 ) = 1. We add the following inequalities to In(ρ, D, D, AC) in order to ensure that for every solution x

of In(ρ, D, D, AC), the same conjunction evaluates to false w.r.t. ρx (D) too. As θ (z7 ) = 1 < z7 = 4, we add the inequality xtb1 ,E < 4 to In(ρ, D, D, AC) (we use variable xtb1 ,E since, according θ , z7 corresponds to attribute E in the tuple tb1 = R2 (b1 , 3, 4, 3) in D). The inequality xtb1 ,E < 4 ensure that in every repair

of In(ρ, D, D, AC)), ρx (tb1 )[E] < 4, that is, ρx (D) (corresponding to solution of x the above conjunction evaluates to false w.r.t. ρx (D) too. Analogously, considering the conjunct z2 = z6 , since θ (z2 ) = 1 < θ (z6 ) = 3, we add the inequality 1 < xtb1 ,D to In(ρ, D, D, AC). Case 2: For the given constraint ac and substitution, the whole conjunction φ on the left-hand side of constraint ac is true w.r.t. ρ(D). It is easy to see that this happens for θ . For the unique aggregation function χ(x3 ) = R2 , F, x3 < E in ac, we denote with T (resp., F) the set of tuples t in r2 such that ρ(t) satisfies (resp., does not satisfy) the α-formula x3 < E, that is T = {R2 (b2 , 2, 6, 6)} and F = {R2 (b1 , 1, 4, 3)}. We augment In(ρ, D, D, AC) as follows. a. We add the inequality xtb2 ,F ≤ 2 to In(ρ, D, D, AC). This takes into account the summation of the attribute expression ‘F  evaluated on the tuples T . b. For the tuple in T , we add the inequality 2 < xtb2 ,E to In(ρ, D, D, AC).

of This ensure that for every repair ρx (D) (corresponding to solution of x In(ρ, D, D, AC)), condition tb2 ∈ T holds. c. For the tuple in F, we add the inequality 1 ≥ xtb1 ,E to In(ρ, D, D, AC).

of This ensure that for every repair ρx (D) (corresponding to solution of x In(ρ, D, D, AC)), condition tb1 ∈ F holds. 2 Statement of Lemma 3.4: Let D be a database scheme, AC a set of aggregate constraints on D, and D an instance of D such that D is not consistent w.r.t. D and AC. If there is a repair ρ for D w.r.t. D and AC, then there is a repair ρ for D w.r.t. D and AC such that λ(ρ ) ⊆ λ(ρ) and ρ has polynomial size w.r.t. D. Proof. We first introduce the definition of In(ρ, D, D, AC), by explaining how to translate each constraint in AC into a set of (in)equalities. Then, we prove that In(ρ, D, D, AC) has at least one solution of polynomial size w.r.t. D, and that every solution of In(ρ, D, D, AC) defines a repair for D w.r.t. D and AC. Finally, we exploit these properties to prove the statement. ——————— Beginning of Definition of In(ρ, D, D, AC) ——————— The set of inequalities In(ρ, D, D, AC) is obtained by first translating each aggregate constraint ac in AC into a set of inequalities In(ρ, D, D, ac), and then assembling the sets of inequalities corresponding to the different constraints into a unique set of inequalities. We focus our attention on defining the translation of a single aggregate constraint ac in AC. ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

App–4

·

Querying and Repairing Inconsistent Numerical Databases

Constraint ac is of the form (1) with φ(x) = R1 (x1 ) ∧ · · · ∧ Rm (xm ), where R1 , . . . , Rm are (not necessarily distinct) relation names in D and each xi is a sequence of variables occurring in x and constants. Formula φ(x) can be re-written into an equivalent formula φ (x) of the form: ∃z (Ψ(z) ∧ Υ(z, x) )

(4)

where: - z is a sequence of new fresh variables not occurring elsewhere; - Ψ(z) = R1 (z1 ) ∧ · · · ∧ Rm (zm ); - Υ(z, x) is a conjunction of equality atoms of the form z = T , where z occurs in z, T is either a constant or a variable in (z, x), and each variable that occurs in x appears in exactly one equality atom. Let Θac be the set of ground substitutions of variables in z such that, for every θ ∈ Θac , θ(Ψ) evaluates to true on ρ(D). zl ) in Ψ, For each θ ∈ Θac and variable z occurring at ith position of atom Rl ( let A be the ith attribute of relation scheme Rl , we define the substitution θ of z with either variables in Xρ or constants as follows:   xt,A if < t, A >∈ λ(ρ) where t is the tuple in D  = zl )) such that ρ(t) = θ (Rl ( θ(z)  θ(z) otherwise. Basically, θ substitutes the variable z with the variable xt,A in the case that the value of the attribute A in the tuple t in D corresponding to the tuple ρ(t) = zl )) in ρ(D) has been updated by ρ. Otherwise, θ substitutes z with θ(z), θ (Rl ( which is the value of attribute A in both the tuples t in D and ρ(t) in ρ(D). The set of inequalities In(ρ, D, D, ac) corresponding to a single aggregate constraint ac is obtained by assembling different sets of inequalities, which are generated considering the substitutions in Θac separately. Specifically, for each θ ∈ Θac , we define a set of inequalities In(ρ, D, D, ac, θ) according to the following cases: Case 1: There is no substitution θ of variables in x with variables in z such that θ(θ(φ )) (i.e. θ(Ψ) ∧ θ(θ(Υ))) evaluates to true on ρ(D). Basically, in this case, we define the inequalities In(ρ, D, D, ac, θ) ensuring that, if the conjunction on the left-hand side of constraint ac is false w.r.t. ρ(D), then, for every solution x

of In(ρ, D, D, AC), the same conjunction evaluates to false w.r.t. ρx (D) too. Initially, In(ρ, D, D, ac, θ) is set to ∅. Then, for each equality atom in Υ of the form z = T , if the following three conditions are true: - T is either a variable in z or a constant (i.e., T is not a variable in x); - θ(z) = θ(T );   ) is not a constant, - at least one among θ(z) and θ(T  > θ(T  ) (in the case that then we add to In(ρ, D, D, ac, θ) either the inequality θ(z)   θ(z) > θ(T )) or the inequality θ(z) < θ(T ) (in the case that θ(z) < θ(T )). ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

S. Flesca, F. Furfaro and F. Parisi

·

App–5

Observe that the number of (in)equalities added to In(ac, θ) in this case is bounded by |Υ|, that is the number of equality atoms occurring in Υ. Case 2: There is a substitution θ of variables in x with variables in z such that θ(θ(φ )) evaluates to true on ρ(D)6 . Basically, in this case, In(ρ, D, D, ac, θ) consists of inequalities translating the fact that the right-hand side of ac holds on every

corresponds to a solution of In(ρ, D, D, AC)) whenever database ρx (D) (where x the variables occurring as argument of every aggregation function in ac have been substituted with the values which they are assigned by θ. n The right-hand side of ac is an inequality of the form i=1 ci · χi (yi ) ≤ K. For each i ∈ [1..n], let χi (yi ) = Rχi , ei , αi (yi ) . Without loss of generality, we assume that every αi is in disjunctive normal form. For every i ∈ [1..n], we denote as Ti (resp., Fi ) the set of tuples in the instance ri of Rχi in D such that, for each t ∈ Ti (resp., t ∈ Fi ), the corresponding tuple ρ(t) in ρ(D) satisfies (resp., does not satisfy) θ(θ(αi )) (see Section 2 for the notion of tuple satisfying a boolean formula). In(ρ, D, D, ac, θ) is initially set to ∅. Then, we augment In(ρ, D, D, ac, θ) with the following sets of (in)equalities: a. first, we add to In(ρ, D, D, ac, θ) the inequality: n  i=1

ci ·



xt,ei ≤ K −

t ∈ Ti ∧ < t, ei > ∈ λ(ρ)

n  i=1

ci ·



e(ti )

(5)

t ∈ Ti ∧ < t, ei > ∈ λ(ρ)

where the summation on the right-hand side represents the sum of values returned by aggregation functions whose attribute expressions consist of either constants or attribute values not changed by ρ7 . b. for each i ∈ [1..n] and each tuple t ∈ Ti we proceed as follows. From definition of Ti , we have that there is at least one disjunct β in αi such that ρ(t) satisfies θ(θ(β)). For each β such that ρ(t) satisfies θ(θ(β)) we proceed as follows. Every conjunct γ in β is of the form w1  w2 , where  is a comparison operator, and w1 , w2 are either variables in x, constants, or attribute names. Before defining the inequalities that must be added to In(ρ, D, D, ac, θ) for the pair i, t under consideration, we introduce the following functions defined on the set of variables, constants and attribute names occurring in β:  ρ(t)[w] if w is an attribute name;      µ(w) = θ(θ(w)) if w is a variable in x;      w if w is a constant; 6 Observe

that, as each variable x appears in exactly one equality atom in Υ, substitution θ is unique. 7 Observe that if e is a constant, then < t, e >∈ λ(ρ). i i ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

App–6

·

Querying and Repairing Inconsistent Numerical Databases

µ (w) =

 xt,w if w is an attribute name and < t, w >∈ λ(ρ);         ρ(t)[w] if w is an attribute name and < t, w >∈ λ(ρ);      if w is a variable in x; θ(θ(w))       w if w is a constant.

For each conjunct w1  w2 in β such that either µ(w1 ) or µ(w2 ) is a variable, we consider the following cases: - if  is equal to ‘=’, then either the inequality µ (w1 ) < µ (w2 ) (in the case that (w1 ) > µ (w2 ) (in the case that µ(w1 ) > µ(w1 ) < µ(w2 )) or the inequality µ µ(w2 )) is added to In(ρ, D, D, ac, θ); - if  is different from ‘=’, then the (in)equality µ (w1 )  µ (w2 ) is added to In(ρ, D, D, ac, θ). Observe that the overall number of (in)equalities added to In(ac, θ) in this case is n bounded by i=1 |Ti | · |αi |, where |αi | denotes the number of comparison atoms occurring in αi .  be c. for each i ∈ [1..n] and each tuple t ∈ Fi , we proceed as follows. Let µ and µ functions on the set of variables, constants and attribute names occurring in β defined as in the above case. From definition of Fi it follows that every disjunct βj in αi contains at least a conjunct γj such that ρ(t) does not satisfy θ(θ(γj )). Specifically, conjunct γj is of the form w1  w2 , where  is a comparison operator, and w1 , w2 are either variables in x, constants, or attribute names. For each disjunct βj in αi , we add to In(ρ, D, D, ac, θ) an inequality defined as follows: - if  is ‘=’, then either the inequality µ (w1 ) < µ (w2 ) or µ (w1 ) > µ (w2 ) is added to In(ρ, D, D, ac, θ), in the case that either µ(w1 ) < µ(w2 ) or µ(w1 ) > µ(w2 ), respectively; - otherwise, the inequality µ (w1 )   µ(w2 ) is added to In(ρ, D, D, ac, θ), where   is the comparison operator “opposite” to  (for instance, if  is ‘≤’, then   is ‘>’). Observe that the overall number of (in)equalities added to In(ac, θ) in this case is n bounded by i=1 |Fi | · |αi |, where |αi | denotes the number of comparison atoms occurring in αi . The above-defined rules for translating a pair ac, θ into a set of linear inequalities In(ρ, D, D, ac, θ) can be used to formally define the set In(ρ, D, D, AC). Specifically, the set of inequalities In(ρ, D, D, AC) translating the set of aggregate constraints AC is defined as follows:     In(ρ, D, D, ac, θ) . In(ρ, D, D, AC) = ac ∈ AC

θ ∈ Θac

In(ρ, D, D, AC) contains also, for each variable xt,A , the specification of the domain which xt,A ranges over (that is, B, Z, or Q), according to the domain associated with A. ———————— End of Definition of In(ρ, D, D, AC) ————————ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

S. Flesca, F. Furfaro and F. Parisi

·

App–7

For the set of inequalities In(ρ, D, D, AC) introduced above, the properties stated in the following two claims hold. Claim A.2. In(ρ, D, D, AC) has at least one solution of polynomial size w.r.t. D. Claim A.3. For each solution x

of In(ρ, D, D, AC), the set of updates ρx is a repair for D w.r.t. D and AC. The proofs of these claims are postponed for the sake of readability. We first exploit them to complete the proof of this lemma. Claim A.2 implies that there is a solution x

of In(ρ, D, D, AC) having polynomial size w.r.t. D, and Claim A.3 implies that the corresponding set of updates ρx is a repair for D w.r.t. D and AC. It is easy to see that ρx is of polynomial size w.r.t. D too (as every updated value

), and that ρx updates a subset of the values updated in ρx is equal to a value in x by ρ (this trivially follows from definition of ρx ). Proof of Claim A.2. Obviously, In(ρ, D, D, AC) has at least one solution, where each variable xt,A ∈ Xρ is assigned the value v such that < t, A, v >∈ ρ (it is easy to see that, for each above-described translation rule, every (in)equality added to In(ρ, D, D, AC) is satisfied by replacing each variable xt,A with ρ(t)[A], i.e. the value assigned by ρ to attribute A in tuple t). For each aggregate constraint ac ∈ AC and each θ ∈ Θac , every constant occurring in the inequalities of In(ρ, D, D, ac, θ) is either a constant in ac or an attribute value in D. Hence, since |Xρ | is bounded by the size of D, the size of each inequality is linearly bounded by the size of D. Moreover, the number n of inequalities which In(ρ, D, D, ac, θ) consists of is bounded by |Υ| + 1 + i=1 |ri | · |αi |, where ri is the instance of Ri . In fact, if Case 1 holds, In(ρ, D, D, ac, θ) is such that |In(ρ, D, D, ac, θ)| ≤ |Υ| (see description of Case 1). Otherwise, In(ρ, D, D, ac, θ) consists of both inequality (5) (Case 2.a) and the inequalities generated Case 2.c, thus |In(ρ, D, D, ac, θ)| ≤ n in Case 2.b and n as described n 1 + i=1 |Ti | · |αi | + i=1 |Fi | · |αi | = 1 + i=1 |ri | · |αi |. Furthermore, for each ac ∈ AC, it holds that |Θac | = Πm |ri |. Hence, the number i=1  n of (in)equalities in In(ρ, D, D, ac) is bounded by (|Υ| + 1 + i=1 |ri | · |αi |)·Πm i=1 |ri |. Since the size of each inequality in In(ρ, D, D, ac) is linearly bounded by the size of D (as explained above), the size of In(ρ, D, D, ac) is polynomially bounded by the size of D. Finally, since the number of constraints in AC is independent from the size of D, it follows that the size of In(ρ, D, D, AC) is polynomially bounded by the size of D too. Since In(ρ, D, D, AC) has at least one solution, there is a solution of In(ρ, D, D, AC) of polynomial size w.r.t. the size of In(ρ, D, D, AC)8 , 8 This result follows from the results provided in [Papadimitriou and Steiglitz 1998], where it was shown that every feasible instance of ili (Integer Linear Inequalities) admits a solution having polynomial size w.r.t. the size of the input system of inequalities. In [Papadimitriou and Steiglitz 1998], the extension of this result to mili (Mixed Integer Linear Inequalities), where also rational variables are allowed, was left as an exercise for the reader. In our case, In(ρ, D, D, AC) contains

ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

App–8

·

Querying and Repairing Inconsistent Numerical Databases

thus of polynomial size w.r.t. D too.

2

Proof of Claim A.3. First, observe that λ(ρx ) is a subset of λ(ρ), that is attribute values which are not updated by ρ are not updated by ρx too. We prove the statement reasoning by contradiction. That is, we assume that there is a solution x

of In(ρ, D, D, AC) such that ρx is not a repair for D w.r.t. D and AC. This means that there is a constraint ac in AC such n that ρx (D) |= ac. yi ) ≤ K), Constraint ac can be re-written in the form: ∀x (φ (x) =⇒ i=1 ci · χi ( where φ is of the form (4). The fact that ρx (D) |= ac implies that there is a ground substitution θρx of variables in z and a substitution θρx of variables in x with variables in z such that:   (a) θρx θρx (φ ) is true on ρx (D), and    n yi ) > K, where every χi is evaluated on ρx (D). (b) i=1 ci · χi θρx θ ρx ( Let θ be the ground substitution of variables in z such that for each l ∈ [1..m], zl )), it holds that ρ(t) = θ (Rl ( zl )). being t the tuple in D such that ρx (t) = θρx (Rl ( Basically, θρx and θ are ground substitutions of variables in z such that, for each l ∈ [1..m], the tuples t = θρx (Rl ( zl )) in ρx (D) and the tuple t = θ (Rl ( zl )) in ρ(D) result from updating the same tuple t in D, i.e., t = ρx (t) and t = ρ(t). We now prove that there is a substitution θ of variables in x with variables   in z such that θ θ(φ ) is true on ρ(D). Let θ = {x/z | the conjunct x = z appears in Υ}. First, observe that θ is a substitution of variables in x with variθ ables in z since each variable x in x appears in exactly one conjunct in Υ (thus,  contains no two distinct pairs x/z1 , x/z2 with z1 = z2 ). We now show that θ θ(φ ) is true on ρ(D).  We accomplish this reasoning by contradiction, that is we assume that θ θ(φ ) is false on ρ(D). This implies that there is a conjunct z = T in Υ (where T is a variable in z or a constant) such that θ (z) = θ (T ). If both θ (z) and θ (T ) are constants, then θρx (z) = θρx (T ) holds too (this trivially follows from the above-explained fact that attribute values  which are not updated by ρ are not updated by ρx ), which contradicts that θρx θρx (φ ) is true on ρx (D). Otherwise, if at least one among θ (z) and θ (T ) is not a constant, either the inequality θ (z) < θ (T ) or the inequality θ (z) > θ (T ) is in In(ρ, D, D, AC) (see Case 1). Since x

is a solution of In(ρ, D, D, AC), inequality θ (z) < θ (T ) (resp., inequality θ (z) > θ (T )) implies  that θρx (z) < θρx (T ) (resp., θρx (z) > θρx (T )), thus contradicting that θρx θρx (φ ) is true on ρx (D). This completes the proof of the existence of θ. The existence of θ implies that, when the pair ac, θ is considered to define the inequalities of In(ρ, D, D, ac, θ), Case 2 holds. Observe that this implies that the inequality (5) is in In(ρ, D, D, ac, θ) (see Case 2.a). We first prove that, for each aggregation function χ(y ) = R, e, α(y ) appearing   in ac and for each tuple t belonging to r instance of R, if ρ(t) satisfies θ θ(α) also strict inequalities (which are not allowed in the formulation of mili in [Papadimitriou and Steiglitz 1998]). The complete proof of this result is reported at the end of this appendix. ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

S. Flesca, F. Furfaro and F. Parisi

·

App–9

  then ρx (t) satisfies θρx θρx (α) , and vice versa. Left-to-right Reasoning by contradiction, that ρx (t) does not  implication:   assume  satisfy θρx θρx (α) . The fact that ρ(t) satisfies θ θ(α) implies that there is at   least a disjunct β in α such that ρ(t) satisfies θ θ(β) . Since ρx (t) does not satisfy     θρx θρx (α) , it must be the case that ρx (t) does not satisfy θρx θρx (β) . That is,   there is a conjunct w1  w2 in β such that ρx (t) does not satisfy θρx θρx (w1  w2 ) . We consider two cases separately. - If both µ (w1 ) and µ (w2 ) are constants, then we are in the case that neither w1 nor w2 refer to database values updated by ρ. Thus, since ρx updates a subset   of the values modified by ρ, it must hold that ρx (t) satisfies θρx θρx (w1  w2 ) , which is a contradiction. (w2 ) is a variable, if  is equal to =, then either - If at least one among µ (w1 ) and µ (w2 ) or the inequality µ (w1 ) > µ (w2 ) has been added the inequality µ (w1 ) < µ to In(ρ, D, D, ac, θ). Since x

satisfies inequalities of In(ρ, D,   D, ac, θ), it must   be the case that ρx (t) satisfies θρx θρx (w1 < w2 ) (resp., θρx θρx (w1 > w2 ) ) and   then it satisfies θρx θρx (w1 = w2 ) too, which is a contradiction. An analogous reasoning can be exploited in the case that  is different from ‘=’. Right-to-left implication: Reasoning by contradiction, assume that ρ(t) does not   satisfy θ θ(α) . In this case, for every disjunct β in α, there is a conjunct γ = w1 w2   such that ρ(t) does not satisfy θ θ(w1  w2 ) . We first show that this implies that   also ρx (t) does not satisfy θρx θρx (w1  w2 ) . We consider separately the following two cases. - If both µ (w1 ) and µ (w2 ) are constants, then neither w1 nor w2 refer to database values updated by ρ. Thus, since ρx updates a subset of the  values modified by ρ, it must hold that ρx (t) does not satisfy θρx θρx (w1  w2 ) , which implies that   ρx (t) does not satisfy θρx θρx (β) . (w2 ) is a variable, if  is ‘=’, then either - If at least one among µ (w1 ) and µ (w2 ) or the inequality µ (w1 ) > µ (w2 ) has been added the inequality µ (w1 ) < µ to In(ρ, D, D, ac, θ)). Since x

satisfies inequalities of In(ρ, D, D,   ac, θ), it must   be the case that ρx (t) satisfies θρx θρx (w1 < w2 ) (resp., θρx θρx (w1 > w2 ) )   and then it does not satisfy θρx θρx (w1 = w2 ) . An analogous reasoning can be exploited in the case that  is different from ‘=’. This reasoning can be applied to every disjunct β in α, that is forevery disjunct  β in αi there is a conjunct γ such that ρx (t) does not satisfy θρx θρx (γ) . This   implies that ρx (t) does not satisfy θρx θρx (α) , which contradicts the hypothesis. We now show that the above-proved equivalence, i.e., for each R, e, α  in  ac and tuple t in  the instance r of R,  ρ(t) satisfies θ θ(α) ⇔ ρx (t) satisfies θρx θρx (α) , together with the fact that    n inequality (5) is in In(ρ, D, D, AC), imply that yi ) ≤ K i=1 ci · χi θρx θ ρx ( holds on ρx (D). We accomplish   this reasoning by contradiction, that is we assume n that i=1 ci · χi θρx θρx (yi ) > K holds on ρx (D). ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

App–10

·

Querying and Repairing Inconsistent Numerical Databases

For every aggregation function χi (yi ) = Rχi , ei , αi (yi ) , we denote as T i (resp.,

Fi ) the set of tuples belonging to ri ∈ D instance of Rχi such that, for each t ∈ T i , the corresponding tuple ρx (t) in ρx (D) satisfies (resp., does not satisfy) θρx (θρx (αi )). From definition of aggregate function, it holds that: n 



ci · χi θρx



n    θρx (yi ) = ci · ei (ρx (t))

i=1

i=1

t ∈ T i

Due to contradiction hypothesis, it holds that

n

i=1 ci

·

 t ∈ T i

ei (ρx (t)) > K.

t belonging to ri in D, if ρ(t) satisfies  each tuple  Since for each i ∈ [1..n] and for θ θ(αi ) then ρx (t) satisfies θρx θρx (αi ) , and vice versa, the latter inequality can be re-written as follows, nby making  the inner summation range on the tuples of Ti instead of T i , that is, i=1 ci · ei (ρx (t)) > K, which is equivalent to: t∈T i

n  i=1

ci ·



> K−

ei (ρx (t))

n  i=1

t ∈ Ti ∧ < t, ei > ∈ λ(ρ)

ci ·



ei (ρx (t))

t ∈ Ti ∧ < t, ei > ∈ λ(ρ)

Since ρ and ρx coincide on the database values not updated by ρ, the right-hand side of the above inequality coincides with the term on right-hand side in (5). Let K  be this term, the above inequality implies that: n  i=1

ci ·



ei (ρx (t))

> K

t ∈ Ti ∧ < t, ei > ∈ λ(ρ)

As from the definition of ρx , for each i ∈ [1..n] and for each tuple t ∈ Ti such that

t,ei , the latter inequality implies that: < t, ei > ∈ λ(ρ), it holds that ei (ρx (t)) = x n  i=1

ci ·



x

t,ei > K 

t ∈ Ti ∧ < t, ei > ∈ λ(ρ)

which contradicts the

is a solution of In(ρ, D, D, AC), since the  hypothesis that x n inequality i=1 ci · t∈Ti ∧ ∈λ(ρ) xt,ei ≤ K  has been added to In(ρ, D, D, AC) (see Case 2.a). This completes the proof of the property that, for each solution x

of In(ρ, D, D, AC), the corresponding set of updates ρx is a repair for D w.r.t. D and AC. 2 B. PROOFS OF LEMMA 3.16 AND COROLLARY 3.21 Statement of Lemma 3.16: Let D be a database scheme, AC a set of aggregate constraints on D, D an instance of D such that D is not consistent w.r.t. to D and AC, and q = R(a1 , . . . , an ) a query over D. If there is a repair ρ for D w.r.t. D and AC such that q ∈ ρ(D) (resp. q ∈ ρ(D)), then there is a repair ρ for D w.r.t. ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

S. Flesca, F. Furfaro and F. Parisi

·

App–11

D and AC such that: (i) λ(ρ ) ⊆ λ(ρ), (ii) q ∈ ρ (D) (resp. q ∈ ρ (D)), and (iii) ρ has polynomial size w.r.t. D and q. Proof. We consider separately the cases when q ∈ ρ(D) and q ∈ ρ(D). Case 1: q ∈ ρ(D). Consider the system of inequalities In(ρ, D, D, AC) defined in the proof of Lemma 3.4. Let r ∈ D be the instance of the relation scheme R and τ the tuple in r coinciding with q in all the attributes in KR . Observe that the existence of τ is implied by the existence of ρ: since no repair can change key attributes in a relation, the fact that q ∈ ρ(D) implies that there is one tuple τ in r such that for each A ∈ KR it holds that τ [A] = q[A]. Consider the system of inequalities In+ (ρ, D, D, AC) obtained from In(ρ, D, D, AC) by adding, for each A ∈ MR such that < τ, A > ∈ λ(ρ), the equation xτ,A = q[A]. Obviously, In+ (ρ, D, D, AC) has at least one solution, where each variable xt,A is assigned the value v of the atomic update < t, A, v > performed by ρ. Moreover, it is trivial to see that both the number of variables and (in)equalities, and the size of constants in In+ (ρ, D, D, AC) are polynomially bounded by the size of D and q. Therefore, as In+ (ρ, D, D, AC) has at least one solution, the latter implies that there is a solution x

for In+ (ρ, D, D, AC) whose size is polynomially bounded by the size of D and q (this can be easily proved reasoning analogously to the proof of Claim A.2). Furthermore, reasoning analogously to the proof of Lemma 3.4, it can be proved that ρx is repair for D such that (i) λ(ρx ) ⊆ λ(ρ), and (ii) ρx has polynomial size w.r.t. D and q. Finally, it is trivial to see that for each attribute A in the relation scheme R it holds that q[A] is equal to τ [A] on ρx (D), i.e., q ∈ ρx (D). Case 2: q ∈ ρ(D). Let r ∈ D be the instance of the relation scheme R. We consider the following two cases separately. 1) there is no tuple τ in r such that, for each A ∈ KR , τ [A] = q[A]: In this case q is false in every repair (recall that no repair can change key attributes in a relation). Moreover, Lemma 3.4 ensures that there is a repair ρ for D w.r.t. D and AC such that λ(ρ ) ⊆ λ(ρ) and ρ has polynomial size w.r.t. D. Therefore, since the size of ρ does not depend of the size of q, then ρ has also polynomial size w.r.t. D and q. Hence, ρ satisfies conditions (i), (ii), (iii). 2) there is a tuple τ in r such that, for each A ∈ KR , τ [A] = q[A]: We consider the following cases separately. a) τ ∈ ρ(D). Since q ∈ / ρ(D), it must be the case that q and τ differ at least in the value of one attribute. Moreover, as τ ∈ D and τ ∈ ρ(D), there is no pair < τ, A >∈ λ(ρ), where A ∈ MR . Let ρ be a repair for D w.r.t. D and AC such that λ(ρ ) ⊆ λ(ρ) and ρ has polynomial size w.r.t. D (the existence of ρ is guaranteed by Lemma 3.4). Since λ(ρ ) ⊆ λ(ρ) and there is no pair < τ, A >∈ λ(ρ), we have that there is no pair < τ, A >∈ λ(ρ ), which means that τ ∈ ρ (D). As, for each A ∈ KR , τ [A] = q[A] and q differs from τ in at least one attribute value, it holds that q ∈ / ρ (D), which completes the proof for this case. ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

App–12

·

Querying and Repairing Inconsistent Numerical Databases

b) τ ∈ / ρ(D). Consider the system of inequalities In(ρ, D, D, AC) obtained as explained in the proof of Lemma 3.4, where each variable xt,A corresponds to a pair < t, A > ∈ λ(ρ). Since τ ∈ / ρ(D), then there is at least one pair < τ, A >∈ λ(ρ), where A ∈ MR . We will denote as V + (resp. V − ) the set of variables xτ,A such that there is an atomic update < τ, A, v >∈ ρ where v > q[A] (resp. v < q[A]) holds. Let In(ρ, D, D, AC)+ be the system of inequalities obtained by adding the following inequalities to In(ρ, D, D, AC): - xτ,A > q[A], for each xτ,A ∈ V + ; - xτ,A < q[A], for each xτ,A ∈ V − . Reasoning analogously to the proof of the case that q ∈ ρ(D), it can be proved ˆ whose size is polynomially that In+ (ρ, D, D, AC) has at least one solution x ˆ is bounded by the size of D and q, and that the repair ρxˆ “corresponding” to x repair for D such that (i) λ(ρx ) ⊆ λ(ρ), and (ii) ρx has polynomial size w.r.t. D and q. Finally, it is trivial to see that the new inequalities added to In(ρ, D, D, AC) imply that there is at least one attribute A ∈ MR such that q[A] = τ [A] on ρx (D). The latter, since, for each A ∈ KR , it holds that τ [A] = q[A], implies that q ∈ / ρx (D), which completes the proof. Statement of Corollary 3.21: There exists a logspace reduction from every Bvariant of re, mrc, and cqa to its corresponding Z- and Q-variant. Proof. The reductions of B-variants of re, mrc, and cqa to its corresponding Z-variants, follows from the fact that, given a database scheme D, it is possible to impose that any measure attribute in MD defined on Z must take either value 0 or 1 by defining on D a set of aggregate constraints Bin ZD whose size depends only on the size of database scheme D. Specifically, Bin ZD can be defined as follows. Let R1 , . . . , Rn be the relation names in D. Without loss of generality, we assume that, for each i ∈ [1..n], the key of the relation scheme Ri (Ai1 : ∆i1 , . . . , Aiki : ∆iki ) is Ai1 . For each i ∈ [1..n] and Aij ∈ MRi such that ∆ij = Z, let χij (x)=Ri , Aij , Ai1 = x , and Bin ij be the set of aggregation constraints defined as follows: Binij = { Ri (x, , . . . , ) =⇒ χij (x) ≤ 1,

Ri (x, , . . . , ) =⇒ χij (x) ≥ 0 }.

Then, BinZD is defined as { ac ∈ Binij | i, j are such that Aij ∈ MD ∧ ∆ij = Z }. It is trivial to see that the pair of aggregate constraints in acij imposes that, in every instance of D, the attribute Aij in every tuple of Ri must take value from {0, 1}. Hence, associating BinZD with D entails that any database instance of D is valid if and only if, for all the tuples of the database, every measure attribute defined on Z takes value from {0, 1}. Observe that the size of Bin ZD depends on the size of D only, since both the size of each aggregate constraint in Bin ZD and cardinality of Bin ZD are linear in the size of D. We now exploit the definition of Bin ZD to define a LOGSPACE reduction from reB to reZ . Given an instance < D, AC, D > of reB , consider the instance ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

S. Flesca, F. Furfaro and F. Parisi

·

App–13

< DZ , AC Z , DZ > of reZ defined as follows: (i) DZ is obtained from D by associating the measure attributes with the domain Z instead of B, (ii) AC Z = AC∪ Bin ZD , and (iii) DZ is the database instance of DZ containing the same tuples as D. It is straightforward to see that deciding whether there is a repair for D w.r.t. D and AC is equivalent to deciding whether there is a repair for DZ w.r.t. DZ and AC Z . Moreover, since both DZ and AC Z do not depend on D, it is easy to see that < DZ , AC Z , DZ > can be constructed from < D, AC, D > by means of a LOGSPACE reduction w.r.t. the size of D. Reasoning analogously, it is easy to see that mrcBS (resp. cqaBS ) with S ∈ {card, set} can be reduced to mrcZS (resp. cqaZS ) by exploiting the set of aggregate constraints Bin ZD . The reductions of B-variants of re, mrc, and cqa to its corresponding Q-variants can be provided by exploiting a set of aggregate constraints, namely Bin D , which impose that any measure attribute in MD must take either value 0 or 1. Let R1 , . . . , Rn be the relation names in D. For each i ∈ [1..n] and Aij ∈ MRi , let χij be the aggregation function defined by Ri , 1, (Aij = 0 ∧ Aij = 1) , and bin ij be the aggregation constraint defined as follows: χij ( ) ≤ 0. Basically, χij () returns a value greater than 0 iff there is a tuple of Ri whose attribute Aij has a value different from both 0 and 1. Thus, it is trivial to see that aggregate constraint bin ij imposes that, in every instance of D, the attribute Aij in every tuple of Ri must take value from {0, 1}. Hence, associating the set of aggregate constraints BinD = {binij | i, j are such that Aij ∈ MD } with D entails that any database instance of D is valid if and only if, for all the tuples of the database, every measure attribute takes value from {0, 1}9 . Analogously to Bin ZD , the size of Bin D depends on the size of D only, since both the size of each aggregate constraint in Bin D and cardinality of Bin D are linear in the size of D. Reasoning analogously to the reduction from reB to reZ provided above, it is easy to see that Bin D can be exploited to reduce every instance of reB to an instance of reQ , by means of a LOGSPACE reduction (w.r.t. the size of the database). Finally, it is clear that mrcBS (resp. cqaBS ) with S ∈ {card, set} can be reduced Z to mrcQ S (resp. cqaS ) by exploiting the set of aggregate constraints Bin D . C.

PROOFS OF THEOREMS STATED IN SECTION 4

In this section we provide the proofs of Theorem 4.6, Theorem 4.7, Theorem 4.8 and Theorem 4.9. Statement of Theorem 4.5: steady-reQ is P-complete. Proof. We prove membership in P only, as hardness has been proved in Section 4. In Section 5 (Corollary 5.3) we proved that, in the presence of steady aggregate constraints, a repair can be computed by solving an optimization problem S ∗ (D, AC, D), containing some variables defined on the same domain as the measure attributes of the input database scheme, along with binary variables (namely, 9 Observe

that Bin D could be exploited also to reduce the B-variant of a problem (re, mrc, and cqa) to it Z-variant instead of Bin Z D . Indeed, in the previous case, we introduced a different form of constraints (namely, Bin Z D ) since it will be exploited also in Section 4, where complexity issues regarding a more specific form of aggregate constraints will be investigated. ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

App–14

·

Querying and Repairing Inconsistent Numerical Databases

δi ) used to count the number of measure attributes updated by the repair. Following the same arguments used in that proof, it is easy to see that considering the linear inequalities of S ∗ (D, AC, D) containing only the variables corresponding to the measure attributes results in a set of linear inequalities on rational variables, whose solutions bi-univocally correspond to M-bounded repairs. This system of linear inequalities has been denoted as S(D, AC, D) in Section 5. Thus, membership in P follows from the facts that a repair exists iff an M-bounded repair exists, and that a set of linear inequalities on rational variables can be solved in polynomial time. Statement of Theorem 4.6: steady-mrcQ card is coNP-hard. Proof. We show a LOGSPACE reduction from steady-mrcBcard to steadyB B mrcQ card . Given an instance I =< D, AC, D, ρ > of steady-mrccard , we show Q     how to construct an instance I =< D , AC , D , ρ > of steady-mrcQ card such that ρ is a card -minimal repair for D w.r.t. D and AC  iff ρ is a card -minimal repair for D w.r.t. D and AC. Before providing the definition of I Q , we introduce some notations. Let D be a database scheme consisting of m relation schemes of the form Ri (Ai1 : B, . . . , Aipi : B, Aipi +1 : ∆ipi +1 , . . . , Aiki : ∆iki ) (with i ∈ [1..m]), where the first pi attributes are the measure attributes of Ri (that is, MRi = {Ai1 , . . . , Aipi }). We denote the relation instance of Ri in the database instance D as ri , for each i ∈ [1..m]. Let id(t, Aij ) be an injective function assigning an identifier to the pair taken as argument, where t is a tuple of a relation instance ri in D, and Aij ∈ MRi . Specifically, the identifiers m assigned by function id are the values in the interval [1..T ], where T = i=1 |ri |×pi . The instance I Q is obtained from the instance I B as follows. 1.

D is obtained from D by modifying the relation schemes in D and adding two new relation schemes. Specifically, for every relation scheme Ri of D, D contains the following relation scheme: Ri (Ai1 : Q, . . . , Aipi : Q, Aipi +1 : ∆ipi +1 , . . . , Aiki : ∆iki , ID1i : Z, . . . , IDpi i : Z)

That is, Ri is obtained from Ri by adding, for each measure attribute Aij of Ri , the attribute IDji ranging over Z, and changing the domain of measure attributes from B to Q (also in this case, the first pi attributes are the measure attributes of Ri ). In the following, Ri will be said to be the relation scheme of D corresponding to Ri . Moreover, the database scheme D contains also the new relation schemes: W eightZero (ID : Z, Count : Z, V al : Q); W eightOne (ID : Z, Count : Z, V al : Q); where MWeightZero = {V al} and MWeightOne = {V al}. Relations W eightZero and W eightOne will be used to associate every pair < t , Aij >, where t is a tuple of relation instance ri in D and Aij a measure attribute of Ri , with 2 × (T + 1) tuples. In more detail, < t, Aij > will be associated to T + 1 tuples of W eightZero and T + 1 tuples of W eightOne . 2. The database instance D is obtained from D in two steps. First, for each relation instance ri of D and each tuple t of ri , a tuple t is inserted into the ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

S. Flesca, F. Furfaro and F. Parisi

·

App–15

instance ri of the relation scheme Ri , defined as follows:   t = Ri t[Ai1 ], . . . , t[Aiki ], id(t, Ai1 ), . . . , id(t, Aipi ) . In the following, the above-defined tuple t will be said to be the tuple of D corresponding to t. We extend the definition of function id to tuples and measure attributes of D as follows: id(t , Aij ) = id(t, Aij ), where t is the tuple of D which t corresponds to. Then, the relation instances of relation schemes W eightZero and W eightOne are populated as follows. For each relation scheme Ri of D and each pair < t, Aij >, where t is a tuple of ri and Aij ∈ MRi , we insert T + 1 tuples in the instance of W eightZero and T + 1 tuples in the instance of W eightOne . Each of the T + 1 tuples inserted into W eightZero is of the form W eightZero (id(t, Aij ), k, 0), where k ∈ [1..T + 1]. Analogously, each of the T + 1 tuples inserted into W eightOne is of the form W eightOne (id(t, Aij ), k, 0), where k ∈ [1..T + 1]. In the following, given a pair < t, Aij >, where t is a tuple of ri and Aij ∈ MRi , the tuple W eightZero (id(t, Aij ), k, v) (resp. W eightOne (id(t, Aij ), k, v)) will be said to be the k-th tuple of W eightZero (resp. W eightOne ) corresponding to < t, Aij > (or, equivalently, to < t , Aij >, where t is the tuple of D corresponding to t). 3. The set of steady aggregate constraints AC  is obtained from AC in two steps. First, for each steady aggregate constraint ac ∈ AC having the form (1)(see Definition 2.2), with φ(x) = R1 (x1 ) ∧ · · · ∧ Rl (xl ) where R1 , . . . , Rl are (not necessarily distinct) relation names in D, and each element in xj is a variable name in x or a constant, we add to AC  the steady aggregate constraint   n     ac : ∀ x, x φ (x, x ) =⇒ ci · χi (yi ) ≤ K i=1

where a) x is a sequence of p new variables not appearing in ac, where p = l i=1 |MRφi |; b) φ is obtained from φ by replacing every atom Ri (xi ) occurring in φ with the atom Ri (xi , zi ), where zi is a list of pi = |MRi | variables in x ; c) for each pair of atoms Ri (xi , zi ), Rj (xj , zj ) in φ (with i = j), the sets of variables in zi and zj are disjoint; d) every χi (yi ) (with 1 ≤ i ≤ n) corresponds to the aggregation function χi (yi ) occurring in ac, in the sense that if χi (yi ) is defined over R then χi (yi ) is defined over R ; The set of steady aggregate constraints AC  consists of the above-defined constraints obtained from AC, and new steady aggregate constraints using the aggregation functions defined in the following. For each relation scheme Ri and attribute Aij ∈ MRi , let V ali (X) be the aggregation function defined by the triple Ri , Aij , (IDij = X) . Basically, V alji (X) returns t [Aij ], where t is the tuple of the instance of Ri such that id(t , Aij ) = X. Let V alZero (X, Y ) and V alOne (X, Y ) be the following aggregation functions: - V alZero (X, Y ) = W eightZero , V al, (ID = X) ∧ (Count = Y ) - V alOne (X, Y ) = W eightOne , V al, (ID = X) ∧ (Count = Y ) . ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

App–16

·

Querying and Repairing Inconsistent Numerical Databases

Therein, V alZero (X, Y ) (resp., V alOne (X, Y )) returns the value of attribute Val in the Y -th tuple of W eightZero (resp., W eightOne ) corresponding to the pair < t, Aij > identified by X (that is, the pair < t, Aij > such that X = id(t, Aij )). For each relation scheme Ri and attribute Aij ∈ MRi , we define the following pair of steady aggregate constraints: zeroij : W eightZero (X, Y, ) ⇒ V alji (X) − V alZero (X, Y ) = 0, which imposes that, for each tuple t in the instance of W eightZero , the value t [V al] must be equal to t [Aij ], where t is the tuple in the instance of Ri such that id(t , Aij ) = t [ID]. That is, constraint zeroij entails that all the tuples of W eightZero corresponding to the same pair < t , Aij > share the same value of attribute V al, and this value is equal to t [Aij ]. oneij : W eightOne (X, Y, ) ⇒ V alji (X)+V alOne (X, Y ) = 1, which imposes that, for each tuple t in the instance of W eightOne , the value t [V al] must be the one’s complement of t [Aij ], where t is the tuple in the instance of Ri such that id(t , Aij ) = t [ID]. That is, constraint oneij entails that all the tuples of W eightOne corresponding to the same pair < t , Aij > share the same value of attribute V al, and this value is equal to 1 − t [Aij ]. Hence, the set of steady aggregate constraints AC  is augmented with the set of constraints {zeroij | i ∈ [1..m]∧j ∈ [1..pi ]} and the set of constraints {oneij | i ∈ [1..m] ∧ j ∈ [1..pi ]}. 4.

The repair ρ for D w.r.t. D and AC  is obtained from the repair ρ for D w.r.t. D and AC in two steps. First, for each atomic update u = t, Aij , v ∈ ρ, where t is a tuple in D over the relation scheme Ri and Aij is a measure attribute of Ri , we add the atomic update u = t , Aij , v to ρ , where t is the tuple of D corresponding to t. In the following, the above-defined atomic update u will be said to be the atomic update of ρ corresponding to u. Secondly, we add new atomic updates modifying relations W eightZero and W eightOne . Specifically, for each tuple t in ri and for each measure attribute Aij of Ri , we reason as follows: - if ρ contains the atomic update t , Aij , 1 , then the following set of atomic updates is added to ρ : {t , V al, 1 | t = W eightZero (id(t , Aij ), k, 0) ∧ k ∈ {1, . . . , T + 1}} These atomic updates makes the constraint zeroij satisfied in ρ (D ). - otherwise, if ρ contains the atomic update t , Aij , 0 , then, in order to make ρ (D ) satisfy constraint oneij , the set of atomic updates {t , V al, 1 | t = W eightOne (id(t , Aij ), k, 0) ∧ k ∈ {1, . . . , T + 1}} is added to ρ . From construction of ρ , it is trivial to see that ρ is a repair for D w.r.t. D and AC  (this straightforwardly follows from the facts that: i) ρ (D) satisfies the constraints of AC  inherited from those of AC, since ρ is a repair for D w.r.t. D and AC; ii) ρ (D) satisfies every constraint zeroij and oneij in AC  , as explained above).

ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

S. Flesca, F. Furfaro and F. Parisi

·

App–17

Observe that repair ρ changes exactly (T +1)·T values in relations W eightZero and W eightOne , since, for each pair t , Aij , (where t is a tuple of a relation instance ri in D and Aij ∈ MRi ) it updates either the (T + 1) tuples of W eightZero corresponding to t , Aij , , or the (T + 1) tuples of W eightOne corresponding to t , Aij , . On the whole, ρ contains |λ(ρ)| + (T + 1) · T atomic updates, as it also performs |λ(ρ)| atomic updates on the tuples of the relations  corresponding to the relations of D. r1 , . . . rm

It is easy to see that the above defined instance I Q of steady-mrcQ card can be obtained from the instance I B of steady-mrcBcard by means of a LOGSPACE computation. To complete the hardness proof, it suffices to show that ρ is a card minimal repair for D w.r.t. D and AC iff ρ is a card -minimal repair for D w.r.t. D and AC  . Before accomplishing this, we prove a preliminary result: Let ρ∗ be a card -minimal repair for D w.r.t. D and AC  . Then, the following hold: (P1 ): for each pair < t , Aij >, where t is a tuple of a relation instance ri of D and Aij ∈ MRi , it holds that ρ∗ (t )[Aij ] ∈ {0, 1} (P2 ): the number of atomic updates in ρ∗ of the form < t , V al, v >, where t is a tuple of either W eightZero or W eightOne , is T · (T + 1).

(P1 ): Assume by contradiction that there is a card -minimal repair ρ∗ for D w.r.t. D and AC  such that there are N pairs (with 0 < N ≤ T ) of the form < t , Aij > such that t is a tuple of a relation instance ri in D , Aij ∈ MRi and ρ∗ (t )[Aij ] ∈ {0, 1}. We denote the set of these N pairs as N B (which stands for “NonBoolean”). For every < t , Aij >∈ N B, since ρ∗ (D ) satisfies constraint zeroij , each tuple t of the T + 1 tuples of W eightZero corresponding to < t , Aij > must be updated by ρ∗ in order to make ρ∗ (t )[V al] = ρ∗ (t )[Aij ] (we recall that attribute V al in t is set to 0 in D , by construction). Thus, repair ρ∗ must contain T + 1 atomic updates performed on the T + 1 tuples of W eightZero corresponding to < t , Aij >. Analogously, for each < t , Aij >∈ N B, from the fact that ρ∗ (D ) satisfies constraint oneij , it follows that ρ∗ must contain T + 1 atomic updates performed on the T + 1 tuples of W eightOne corresponding to < t , Aij > (for each of these tuples, attribute V al is updated to 1 − ρ∗ (t )[Aij ]). Therefore, ρ∗ contains at least 2 · N · (T + 1) atomic updates performed on the tuples of W eightZero and W eightOne . Let B (which stands for “Boolean”) the set of the T − N pairs < t , Aij > such that t is a tuple of a relation instance ri in D , Aij ∈ MRi and ρ∗ (t )[Aij ] ∈ {0, 1}. As explained above in the construction of repair ρ , for every of pair < t , Aij >∈ B, in order to make D satisfy constraints zeroij and oneij , repair ρ∗ must contain either T + 1 atomic updates performed on the T + 1 tuples of W eightZero corresponding to < t , Aij > (in the case that ρ∗ (t )[Aij ] = 0), or T +1 atomic updates performed on the T +1 tuples of W eightOne corresponding to < t , Aij > (in the case that ρ∗ (t )[Aij ] = 1). Therefore, repair ρ∗ contains 2 · N · (T + 1) + (T − N ) · (T + 1) = (T + N ) · (T + 1) (with N > 0) atomic updates performed on the tuples of W eightZero and W eightOne . Thus |λ(ρ∗ )| ≥ (T + 1) · (T + 1) = T 2 + 2 · T + 1, as ρ∗ possi . We have already bly contains further atomic updates on relations r1 , . . . , rm ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

App–18

·

Querying and Repairing Inconsistent Numerical Databases

observed that |λ(ρ )| = |λ(ρ)| + (T + 1) · T . As |λ(ρ)| ≤ T , we have that |λ(ρ )| ≤ T + T · (T + 1) = T 2 + 2 · T . Therefore it holds that |λ(ρ )| < |λ(ρ∗ )| which contradicts that ρ∗ is card -minimal repair for D w.r.t. D and AC  . (P2 ): As explained above, ρ∗ contains (T +N )·(T +1) atomic updates performed on the tuples of W eightZero and W eightOne , where N is the number of pairs of the form < t , Aij > such that t is a tuple of a relation instance ri in D , Aij ∈ MRi and ρ∗ (t )[Aij ] ∈ {0, 1}. From (P1 ) we have that N = 0, thus the number of atomic updates performed on the tuples of W eightZero and W eightOne by ρ∗ is T · (T + 1). We now complete the proof showing that ρ is a card -minimal repair for D w.r.t. D and AC iff ρ is a card -minimal repair for D w.r.t. D and AC  . (⇒): Assume by contradiction that ρ is not a card -minimal repair for D w.r.t. D and AC  . Thus, there is a card -minimal repair ρ∗ for D w.r.t. D and AC  such that |λ(ρ∗ )| < |λ(ρ )|. Let U(ρ∗ ) = {< t , Aij > | < t , Aij >∈ λ(ρ∗ ) ∧ t is a tuple of a relation ri of D }, U(ρ ) = {< t , Aij > | < t , Aij >∈ λ(ρ ) ∧ t is a tuple of a relation ri of D }. That is, U(ρ∗ ) (resp. U(ρ )) is the set of pairs < tuple, attribute > updated by ρ∗ (resp. ρ ) where only tuples of relations ri (inherited from D by D ) are considered. We now show that the existence of ρ∗ implies the existence of a repair for D w.r.t. D and AC whose cardinality is less than that of ρ. Let ρ ∗ be the set of atomic updates on D obtained from ρ∗ as follows: for each < t , Aij > in U(ρ∗ ), ρ ∗ contains the atomic update < t, Aij , ρ∗ (t )[Aij ] >, where t is the tuple of D corresponding to t . Informally, ρ ∗ and ρ∗ update the corresponding pairs of D and D , respectively, to the same value. It is straightforward to see that ρ ∗ is a repair for D w.r.t. D and AC. So far we have proved that ρ ∗ is a repair for D w.r.t. D and AC under the contradiction hypothesis stating that ρ is not a card -minimal repair, while ρ∗ is a card -minimal repair. We now show that the cardinality of ρ ∗ is less than that of ρ, that is |λ( ρ ∗ )| < |λ(ρ)|. To accomplish this, we first prove that |λ(ρ∗ )| < |λ(ρ )| implies that |U(ρ∗ )| < |U(ρ )|. Assume by contradiction that |U(ρ∗ )| ≥ |U(ρ )|. Since |λ(ρ∗ )| = |U(ρ∗ )| + T · (T + 1) (property P2 ) and |λ(ρ )| = |U(ρ )| + T · (T + 1) (by construction, see item (4) in this proof), from |U(ρ∗ )| ≥ |U(ρ )| we obtain that |λ(ρ∗ )| ≥ |λ(ρ )|, which contradicts that ρ∗ is card -minimal and ρ is not card minimal. Therefore, contradiction hypothesis that ρ is not card -minimal implies that |U(ρ∗ )| < |U(ρ )|, which means that the number of atomic updates per is less than the number of atomic updates formed by ρ∗ on relations r1 , . . . , rm  performed by ρ on the same relations. Since |λ( ρ ∗ )| = |U(ρ∗ )| and |U(ρ∗ )| < |U(ρ )| and |U(ρ )| = |λ(ρ)| (by construc ρ ∗ )| < |λ(ρ)|. The latter contradicts that ρ is a tion of ρ from ρ), we obtain |λ( card -minimal repair for D w.r.t. D and AC. (⇐): Assume by contradiction that ρ is not a card -minimal repair for D w.r.t. D and AC. Thus, there is a card -minimal repair ρ for D w.r.t. D and AC such that |λ( ρ)| < |λ(ρ)|. Let ρ  be the repair for D w.r.t. D and AC  obtained from ρ by ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

S. Flesca, F. Furfaro and F. Parisi

·

App–19

applying the same two-step reasoning as that described in item (4) of this proof, where the construction of ρ from ρ is explained. That is: - for each atomic update t, Aij , v ∈ ρ, where t is a tuple in D over a relation ri and Aij is a measure attribute of Ri , repair ρ  contains the atomic update t , Aij , v , where t is the tuple of D corresponding to t. - for each relation instance ri and tuple t in ri and measure attribute Aij of Ri , repair ρ  contains T + 1 atomic updates modifying either the instance of W eightZero (in order to satisfy the constraint zeroij ) or W eightOne (in order to satisfy the constraint oneij ), as explained in the item (4) of this proof. In the construction of ρ from ρ, we have shown that |λ(ρ )| = |λ(ρ)| + (T + 1) · T . Reasoning analogously, it is easy to see that the construction of ρ  from ρ entails ρ)|+(T +1)·T . Since |λ( ρ)| < |λ(ρ)| (contradiction hypothesis), that |λ( ρ  )| = |λ( we have that |λ( ρ  )| < |λ(ρ )|, which contradicts that ρ is a card -minimal repair for D w.r.t. D and AC  . Statement of Theorem 4.7: steady-mrcQ set is P-complete. Proof. (Membership) Let D, AC, D, ρ be an instance of steady-mrcQ set . For the sake of simplicity, and without loss of generality, we assume that D contains a single relation scheme R(A1 , . . . , AM ), such that KR = {A1 } and MR = {AM }.  Let λ(ρ) = {p1 , . . . , p|λ(ρ)| }. Given a pair p ∈ λ(ρ), let Iρ,p = D , AC  , Dρ,p be Q the instance of steady-re defined as follows: - D contains the relation scheme R (A1 , . . . , AM , Old, Mod), which is obtained from the relation scheme of D by adding the non-measure attributes Old and Mod (ranging over Q and B, respectively) to its definition;  - Dρ,p contains, for each tuple t = R(v1 , . . . , vM ) of D, the corresponding tuple  t = R (v1 , . . . , vM , vold , vmod ), where (i) vold = t[AM ] , and (ii) vmod is either 1, if t, AM ∈ λ(ρ) ∨ t, AM = p, or 0, otherwise. Basically, the boolean value  is 1 iff either AM has not been stored in attribute Mod of a tuple t in Dρ,p updated by ρ or t, AM is the pair p. Moreover, the value assigned to Old is the original value of attribute AM in t. As will be clearer in the following, the value  will be used to either forbid or assigned to attribute Mod in a tuple t of Dρ,p  allow repairs for Dρ,p to change the measure attribute of t ; - AC  contains all the constraints in AC (which are rewritten according the defi ), and the following (steady) aggregate connition of the database scheme Dρ,p  straint: ac : R (X, , . . . , ) ⇒ N ewV al(X) − OldV al(X) = 0, where the aggregation functions N ewV al(X) and OldV al(X) are defined as follows: N ewV al(X) = R , AM , (A1 = X) ∧ (M od = 1) OldV al(X) = R , Old, (A1 = X) ∧ (M od = 1) .  identified by X, if Basically, constraint ac imposes that, for each tuple t in Dρ,p  the tuple t of D corresponding to t has not been changed by ρ, then no repair  w.r.t. D and AC  can update t . Moreover, ac ensures that no repair ρ for Dρ,p   for Dρ,p w.r.t. D and AC  can update the tuple t of Dρ,p corresponding to the tuple t in p. ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

App–20

·

Querying and Repairing Inconsistent Numerical Databases

 It is easy to see that a repair exists for Dρ,p w.r.t. D and AC  iff a repair ρ∗ exists for D w.r.t. D and AC such that λ(ρ∗ ) ⊆ λ(ρ) \ {p}, i.e., ρ∗ is a proper subset of ρ, as it does not update the values not updated by ρ, and it does not update the pair p. Hence, deciding the instance D, AC, D, ρ of steady-mrcQ set is equivalent to deciding the instances Iρ,p1 , . . . , Iρ,p|λ(ρ)| of steady-reQ . That is, ρ is a set-minimal repair for D w.r.t. D and AC iff there is no instance among Iρ,p1 , . . . , Iρ,p|λ(ρ)| whose answer is yes. As steady-reQ was shown to be decidable in polynomial time, the latter suffices to prove that steady-mrcQ set is in P . (Hardness) We show a LOGSPACE reduction from a P -hard problem to steadyQ+ mrcQ set . Specifically, we show a reduction from steady-reM , i.e., the repairexistence problem where all measure attributes are defined on the domain of nonnegative rationals, and all attribute expressions occurring in aggregate constraints Q+ straightforconsist of measure attributes only. The P -hardness of steady-reM wardly follows from the proof of Theorem 4.5. In fact, in the latter, we proved that steady-reQ is P -hard by considering an instance of steady-reQ where all measure attribute values were constrained to be non-negative, and where all the aggregate constraints contain attribute expressions consisting of measure attributes only. Q+ Given an instance < D, AC, D > of steady-reM , we show how to construct such that there is a repair for D an instance < D , AC  , D , ρ > of steady-mrcQ set w.r.t. D and AC iff ρ is not a set-minimal repair for D w.r.t. D and AC  . The instance < D , AC  , D , ρ > is obtained from < D, AC, D > as follows.

1.

The database scheme D is obtained from D by associating measure attributes of D with domain Q (instead of Q+ ), and then adding the relation scheme Ans(ID : Z, V al : Q) to D, where MAns = {V al}. For the sake of clarity, we rename relations of D as R1 , . . . , Rn . Thus, D consists of relations R1 , . . . , Rn (which are inherited from D) plus relation scheme Ans. Without loss of generality, we assume that, for each i ∈ [1..n], the key of the relation scheme Ri (Ai1 : ∆i1 , . . . , Aiki : ∆iki ) is Ai1 .

2.

The database instance D is defined as follows. The instance of Ans contains the unique tuple Ans(0, 0). As regards relations r1 , . . . , rn of D , each ri has the same cardinality as ri in D, and each tuple t of ri “corresponds” to exactly one tuple t of ri in D, in the following sense: t and t share the same values for non-measure attributes, whereas all the measure attributes of t are set to −1.

3.

The set of steady aggregate constraints AC  consists of the union of two subsets Inherited and NonNegative. Specifically, Inherited contains steady aggregate constraints obtained by modifying those of AC. In more detail, for each steady aggregate constraint ac ∈ AC (which is of the form (1) – see Definition 2.2), Inherited contains the following constraint:   n  ci · χi (yi ) + χAns () ≤ K ac : ∀ x φ (x) =⇒ i=1

where: ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

S. Flesca, F. Furfaro and F. Parisi

·

App–21

- φ (x) and each aggregation function χi () are obtained from φ(x) and χi (), respectively, by replacing any relation name Rj with the corresponding relation name Rj ; - aggregation function χAns () returns the value of the attribute V al of the unique tuple in the instance of Ans, and is defined as follows: χAns () = Ans, V al, ⊥ . The subset NonNegative of AC  consists of steady aggregate constraints which impose that every measure attribute in relations r1 , . . . , rn must take nonnegative values. More formally, the constraints of NonNegative are defined as follows. For each i ∈ [1..n] and Aij ∈ MRi , let χij be the aggregation function defined as χij (X) = Ri , Aij , (Ai1 = X) , and ac ij be aggregation constraint defined as follows: acij : Ri (X, , . . . , ) =⇒ χij (X) ≥ 0. Basically, χij (X) returns the value of attribute Aij in the tuple of Ri whose key attribute has value X. Thus, the aggregate constraint acij imposes that, in every instance of D , the attribute Aij in every tuple of Ri must take non-negative values. On the whole, the set of aggregate constraints NonNegative is given by: NonNegative = { acij | Ri ∈ D ∧ Aij ∈ MRi } 4.

Repair ρ consists of a set of atomic updates which assign 0 to all the measure attributes of relations r1 , . . . , rn . That is, for each relation instance ri of D , for each measure attribute Aij of Ri , and for each tuple t of ri , ρ contains the atomic update t , Aij , 0 . Moreover, ρ also contains the atomic update t, V al, Kmin , where t = Ans(0, 0) and Kmin is a value different from 0 and strictly less than all constants K occurring on the right-hand side of the constraints in Inherited. It is easy to see that this set of atomic updates is a repair for D w.r.t. D and AC  . In fact, constraints in NonNegative are satisfied by ρ(D ) as every measure attribute of relations r1 , . . . , rn is assigned a non-negative value in ρ(D ) returns Kmin , and since ρ(D ). Moreover, since evaluating nχAns () on  evaluating the term of the form i=1 ci · χi (yi ) occurring on the right-hand side of every constraint in Inherited results in it holds that ρ(D ) satisfies 0, n all the constraints in Inherited (the fact that i=1 ci · χi (yi ) evaluates to 0 on ρ(D ) follows from the fact that the attribute expressions occurring in Inherited Q+ consist of measure attribute only, due to definition of steady-reM ).

Observe that, in order to make D satisfy the constraints in NonNegative, any repair for D must update at least all measure attributes in relations r1 , . . . , rn , as they are initially assigned a negative value. Some repair can exist which does not change the value of attribute V al of relation scheme Ans. Therefore, given two repairs ρ1 and ρ2 for D w.r.t. D and AC  , it holds that λ(ρ1 ) ⊂ λ(ρ2 ) iff ρ1 does not change the value of attribute V al of relation scheme Ans, whereas ρ2 changes it. We now show that the following equivalence holds: “There is a repair for D w.r.t. D and AC ⇔ ρ is not a set-minimal repair for D w.r.t. D and AC  ”. ⇒: Let ρ∗ be a repair for D w.r.t. D and AC, and let ρ be the set of atomic updates ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

App–22

·

Querying and Repairing Inconsistent Numerical Databases

on D defined as follows: for each atomic update t, A, v ∈ ρ∗ , ρ contains the atomic update t , A, v , where t is the tuple of D corresponding to t. That is, ρ does not change the value of attribute V al of relation scheme Ans in D , and it changes the tuples of relations r1 , . . . , rn in the same way as ρ∗ changes the corresponding tuples in D. It is straightforward to see that ρ (D ) satisfies both the constraints in NonNegative (as all the measure attributes of relations r1 , . . . , rn are assigned values taken from ρ∗ (D), which are non-negative) and the constraints in Inherited (as ρ∗ (D) satisfies every ac ∈ AC and χAns () returns 0). Hence, ρ is a repair for D w.r.t. D and AC  . Moreover, since ρ does not change the tuple in the instance of Ans, it holds that λ( ρ) ⊂ λ(ρ), which means that ρ is not a set-minimal repair for D w.r.t. D and AC  . ⇐: Assume that ρ is not a set-minimal repair for D w.r.t. D and AC  . Therefore ρ) ⊂ λ(ρ). As observed there is a repair ρ for D w.r.t. D and AC  such that λ( above, this implies that ρ does not change the value of attribute V al of relation scheme Ans, thus attribute V al in the unique tuple over Ans in ρ (D ) is assigned 0. Let ρ∗ be set of atomic updates on D defined as follows: for each atomic update t , A, v ∈ ρ , where t is a tuple of D , ρ∗ contains the atomic update t, A, v , where t is the tuple of D corresponding to t . We now show that the above-defined set of updates ρ∗ is a repair for D w.r.t. D and AC. Since evaluating χAns () on ρ (D ) returns 0, and ρ (D ) satisfies every constraint ac ∈ AC  , it is easy to see that ρ∗ (D) satisfies every constraint ac ∈ AC, by construction of AC  . Moreover, since ρ (D ) satisfies NonNegative, ρ assigns non-negative values to measure attributes in D, thus ρ∗ assigns values from domain Q+ to measure attributes in D . Hence, ρ∗ (D ) is consistent w.r.t. D and AC. Before providing the proof of Theorem 4.8 it is worth noting that the hardness results stated in theorems 3.19 and 3.20 and Corollary 3.21 for the different versions of cqa still hold under the hypothesis that the database D admits at least one repair w.r.t. D and AC. This follows from the fact that, given an atomic knowledge base T , a formula Q and a formula p, deciding whether Q is derivable from every model in T ◦S p (resp., T ◦D p) is Πp2 -hard (resp., ∆p2 [log n]-hard) if p is satisfiable [Eiter and Gottlob 1992]. In fact, as the satisfiability of p implies that M od(T ◦S p) = ∅ and M od(T ◦D p) = ∅, and since, in the reductions defined in the proofs of theorems 3.19 and 3.20, there is a one-to-one correspondence between the models of the revised knowledge base and the minimal repairs for D, we can conclude that, by means of our reductions, instances of ip where the revision formula is satisfiable are reduced to cqa instances where there is at least one repair for D. From a practical standpoint, this result implies that the computation of the consistent query answer can not benefit (in terms of efficiency) from the fact that at least one repair is known to exist. This is the case, for instance, of our running example, as balance sheet data are known to be originally consistent. From a theoretical standpoint, this property will be also exploited in the proof of Theorem 4.8. p Statement of Theorem 4.8: steady-cqaQ card is ∆2 [log n]-hard. ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

S. Flesca, F. Furfaro and F. Parisi

·

App–23

p Proof. As previously shown cqa∆ card , with ∆ ∈ {B, Z, Q}, is ∆2 [log n]-hard even considering only cqa∆ card instances < D, D, AC, q > such that there exists a repair ρ for D w.r.t. D and AC. Thus, it is easy to see that steady-cqaBcard is ∆p2 [log n]-hard even considering only steady-cqaBcard instances < D, D, AC, q > such that there exists a repair ρ for D w.r.t. D and AC (see Corollary 4.4). Let < D, D, AC, q > be a steady-cqaBcard instance such that there exists a repair ρ for D w.r.t. D and AC. We consider the steady-cqaQ card instance < D , D , AC  , q  > where D , D and AC  are defined as in the proof of Theorem 4.6 and q  is a tuple obtained form q by applying the same transformation used in the proof of Theorem 4.6 to obtain a tuple t ∈ D from the corresponding tuple t ∈ D. It is easy to see that < D , D , AC  , q  > can be obtained from < D, D, AC, q > by means of a LOGSPACE transformation. We now prove that < D, D, AC, q > is true iff < D , D , AC  , q  > is true. Let ρ be a repair for D w.r.t. D and AC. We denote as ρ the repair for D w.r.t.  D and AC  obtained from ρ in the same way as in the proof of Theorem 4.6. It is easy to see that q is true on ρ(D) iff q  is true on ρ (D ). Furthermore, as shown in the proof of Theorem 4.6, ρ is a card -minimal repair for D w.r.t. D and AC iff ρ is a card -minimal repair for D w.r.t. D and AC  . Moreover, given a card -minimal repair ρ∗ for D w.r.t. D and AC  , since for each pair < t, Aij >∈ λ(ρ∗ ) it holds that ρ∗ (t)[Aij ] ∈ {0, 1} (property P1 , proof of Theorem 4.6), then there is a card -minimal repair ρ for D w.r.t. D and AC which ρ∗ corresponds to (in the sense that ρ∗ can be obtained starting from ρ according the definition in the proof of Theorem 4.6, item 4.). Since, for each card -minimal repair ρ for D w.r.t. D and AC, there is a card minimal repair ρ for D w.r.t. D and AC  such that the query q is true on ρ(D) iff the corresponding query q  is true on ρ (D ) and vice versa, and there exists at least one repair ρ for D w.r.t. D and AC, then < D, D, AC, q > is true iff < D , D , AC  , q  > is true.

Statement of Theorem 4.9: steady-cqaQ set is coNP-complete. Proof. Membership in coNP follows from Theorem 3.19 and Theorem 4.7. In order to prove coNP-hardness, we show a LOGSPACE reduction from 3Satisfiability (3SAT) [Papadimitriou 1994] to the complement of steadycqaQ set . We use the same definitions and notations adopted in the proof of Theorem 3.12, where a reduction from 3SAT to mrcBS (for any S ∈ { set, card }) was shown. Thus, given a 3SAT instance U, Φ , we denote the clauses of Φ as C1 , . . . , Cn , and the literals occurring in each Ci as Li,1 , Li,2 , Li,3 . A truth assignment to the literals occurring in a 3SAT instance can be translated into an instance of the following database scheme D: Lit(ID : Z, V al : Q); Pair (IDL : Z, IDL : Z); Clause(Lit1 : Z, Lit2 : Z, Lit3 : Z); Jolly(ID : Z, V al : Q); Weight(ID : Z, Count : Z, V al : Q). As in the hardness proof of Theorem 3.12, the above-defined database scheme ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

App–24

·

Querying and Repairing Inconsistent Numerical Databases

contains relations Lit, Pair, Clause and Jolly, and MLit = {V al}, MPair = ∅, MClause = ∅ and MJolly = {V al}, but in contrast with the above-mentioned proof, here measure attributes are defined on domain Q. Moreover, here the database scheme is augmented with the relation scheme Weight and we define MWeight = {V al}. We consider a database instance D of D populated as follows. The instances of relations Lit, Pair, Clause are obtained in the same manner as in the hardness proof of Theorem 3.12, that is, (i) for each L ∈ U , the instance of Lit contains the tupleLit(L,  −1); (ii) for each pair L, L ∈ U , the instance of Pair contains the tuple Pair L, L ; (iii) for each clause Ci in Φ of the form Li,1 ∨ Li,2 ∨ Li,3 , the instance of Clause contains the tuple Clause(Li,1 , Li,2 , Li,3 ). In contrast with hardness proof of Theorem 3.12, the instance of Jolly contains the tuple tjolly =Jolly(id, −1). Moreover, for each literal L ∈ U , the instance of Weight contains the two tuples Weight(L, 0, 0) and Weight(L, 1, 0). In the following, these two tuples will be said tuples associated to L. Let V alW eightZero and V alW eightOne be the aggregation functions defined as follows: V alW eightZero (X) = W eight, V al, (ID = X) ∧ (Count = 0) V alW eightOne (X) = W eight, V al, (ID = X) ∧ (Count = 1) . Basically, for each literal L ∈ U , V alW eightZero (L) and V alW eightOne (L)) return the values of attribute V al in the tuples associated to L, which are the two tuples in Weight identified by the pairs L, 0 and L, 1, respectively. Let ValLit and ValJolly be the aggregation functions defined as in the proof of Theorem 3.12. Let AC consist of the following steady aggregate constraints: ac1 : Pair (X, Y ) ⇒ ValLit(X) + ValLit(Y ) = 1, which imposes that for each variable x, the truth value assigned to x must be complementary to that of ¬x (as in the proof of Theorem 3.12); ac2 : Clause(X, Y, Z) ⇒ ValLit(X) + ValLit(Y ) + ValLit(Z) + ValJolly() ≥ 1 which is used to imposes that for each tuple in the instance of Clause, either at least one of the literals of the corresponding clause is true, or tjolly [V al] is required to be 1 (as in the proof of Theorem 3.12). ac3 : Lit(X, ) ⇒ ValLit(X) ≥ 0, which imposes that for each literal L, the value of attribute V al of the tuple of Lit encoding the truth value assigned to L must be greater than or equal to zero; ac4 : Lit(X, ) ⇒ ValLit(X) ≤ 1 which imposes that for each literal L, the value of attribute V al of the tuple of Lit encoding the truth value assigned to L must be less than or equal to one; ac5 : Lit(X, ) ⇒ ValLit(X) − V alW eightZero (X) = 0, which impose that, for each literal L, the value t[V al] of the tuple t associated to L and such that t[Count] = 0 must be equal to the value of attribute V al of the tuple in Lit identified by L. ac6 : Lit(X, ) ⇒ ValLit(X) + V alW eightOne (X) = 1, which impose that, for each literal L, the value t[V al] of the tuple t associated to L and such that t[Count] = 1 must be the one’s complement of the value of attribute V al of the tuple in Lit identified by L; ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

S. Flesca, F. Furfaro and F. Parisi

·

App–25

ac7 : Jolly( , ) ⇒ ValJolly() ≥ 0, which imposes that tjolly [V al] ≥ 0; ac8 : Jolly( , ) ⇒ ValJolly() ≤ 1 which imposes that tjolly [V al] ≤ 1. Observe that, in order to satisfy constraints ac3 and ac7 , every possible repair for D w.r.t. D and AC must update all values of attribute V al of relation scheme Lit and the value of attribute V al of tuple tjolly , as these attributes are assigned −1. However, some repairs can exist which does not change some attribute values in the instance of Weight. Specifically, given a repair ρ for D w.r.t. D and AC, for each tuple t ∈ ρ(D) over Lit corresponding to a literal L ∈ U (i.e., such that t[ID] = L), the following hold: (i) constraints ac3 and ac4 impose that 0 ≤ t[V al] ≤ 1; (ii) if t[V al] = 0, then constraint ac6 entails that ρ contains the atomic update < t1 , V al, 1 >, where t1 = W eight(L, 1, 0). Moreover, in order to satisfy constraint ac5 , ρ does not contain any update performed on the tuple W eight(L, 0, 0) in D; (iii) if t[V al] = 1, then ρ contains the atomic update < t0 , V al, 1 >, where t0 = W eight(L, 0, 0) (this is implied by ac5 ). Moreover, ρ does not contain any atomic update performed on the tuple W eight(L, 1, 0) in D (this is due to ac6 ); (iv) if 0 < t[V al] < 1, then ρ contains the atomic updates < t1 , V al, v1 > and < t0 , V al, v2 >, where t1 = W eight(L, 1, 0), t0 = W eight(L, 0, 0), v1 = 1 − t[V al] and v2 = t[V al] (this is implied by both constraints ac5 and ac6 ). Therefore, if a repair ρ contains the atomic update < t, V al, v >, where t is a tuple of Lit and v ∈ {0, 1}, the repair ρ contains only one atomic update performed on one of the two tuples in W eight associated to t. Whereas, if ρ contains the atomic update < t, V al, v >, where t is a tuple of Lit and v ∈ {0, 1}, then ρ contains two atomic updates on the tuples associated to t. Thus, it is easy to see that every repair ρ for D w.r.t. D and AC which updates the attribute V al of a tuple t in Lit to a value not in {0, 1} is not a set-minimal repair. Let q be the query Jolly(id, 1). We now prove that Φ is satisfiable iff the consistent answer of q on D w.r.t. D and AC is false under the set-minimal semantics. Assume that Φ is satisfiable. It is easy to see that every truth value assignment of literals in U making Φ evaluate to true corresponds to a set-minimal repair ρ for D w.r.t. D and AC. Moreover, since for each tuple in the instance of Clause at least one of the literals of the corresponding clause Ci in Φ is true, the value of attribute V al of tuple tjolly can be updated by a set-minimal repair to any value in [0,1]. Therefore, there is a set-minimal repair ρ∗ for D w.r.t. D and AC such that tjolly [V al] is strictly less than 1 in ρ∗ (D). Thus, the consistent answer to the query q is false. Conversely, assume that the consistent answer of q on D under the set-minimal semantics is false. Therefore, there is a set-minimal repair ρ∗ for D w.r.t. D and AC such that the value tjolly [V al] is strictly less than 1 in ρ∗ (D). The fact that constraint ac2 is satisfied by ρ∗ (D) implies that, for each tuple in the instance of Clause, there is at least one literal of the corresponding clause Ci in Φ which is assigned 1. Therefore, there is a truth value assignment of literals in U making Φ true, i.e. Φ is satisfiable. ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

App–26

D.

·

Querying and Repairing Inconsistent Numerical Databases

ON THE SIZE OF SOLUTIONS OF A SYSTEM OF LINEAR INEQUALITIES

In this section, we show that every feasible system of linear inequalities on rational and integer variables admits a solution whose size is polynomially bounded by the size of the system, extending the results of [Papadimitriou and Steiglitz 1998] to the presence of strict inequalities. We first introduce some preliminary notations and then provide the main result. D.1

Preliminary notations

Given two vectors x, y, we denote with [x, y] the concatenation of x and y. Given a non-negative integer value a, we denote with size(a) the number of digits of a (size(a) = logB (a), where B > 1 is the base of the numerical system). We denote with N the set of non-negative integers and with Q+ the set of non-negative rationals. Let A be an m × n integer matrix, b be a vector of m integers, X = [X1 , . . . , Xn ] be a vector of n variables, and I, J be two disjoint subsets of [1..n] such that I ∪J = [1..n]. Moreover, let A1 (resp., A2 ) be an m1 ×n matrix (resp., m2 ×n which contains the first m1 rows of A (resp., the last m2 rows of A), with m = m1 + m2 , and let b = [b1 , b2 ] where b1 , b2 are two vectors of m1 and m2 integers, respectively. We consider a standard form for Mixed Integer Linear Equalities (MILE) and Mixed Integer Linear Inequalities (MILI) systems. The standard form for MILE is: A×X = b where ∀i ∈ I, Xi ∈ N ∀i ∈ J, Xi ∈ Q+

(6)

The standard form for MILI is: A1 × X ≤ b1 A2 × X < b2 . where ∀i ∈ I, Xi ∈ N ∀i ∈ J, Xi ∈ Q+

(7)

We report some results in the literature which will be exploited in the following. Fact 1. [Papadimitriou and Steiglitz 1998] Let v1 , . . . , vk be k > 0 vectors in {0, ±1, ±2, . . . , ±a}m , and let M = (ma)m+1 . The following statements are equivalent: a) there exist k integers α1 , . . . , αk such that: • for each j ∈ [1..k] it holds that 0 ≤ αj ≤ M ; • there exists a j ∈ [1..k] such that αj = 0; k • j=1 αj · vj = 0; b) there is no vector h ∈ {0, ±1, ±2, . . . , ±M }m such that for each j ∈ [1..k] it holds that hT × vj ≥ 1. The following result states that every feasible ILE (Integer Linear Equalities system), which is a MILE without rational variables, has at least one solution which is polynomially bounded by n, m, and the size of the coefficients. ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

S. Flesca, F. Furfaro and F. Parisi

·

App–27

Fact 2. [Papadimitriou 1981] Let S : Ax = b be an ILE consisting of m equalities with n variables such that a is the greatest absolute value of the coefficients in A and b. If there is a solution x ∈ Nn for S, then there is a solution x such that the value of every component xi of x is bounded by n(ma)2m+1 . Finally, the following result provides a polynomial bound on the search space for the existence of solutions of the linear programming problem (LP). Fact 3. [Papadimitriou and Steiglitz 1998] Consider an LP of the form: min(cT × X) where A×X = b ∀i ∈ [1..n], Xi ∈ Q+

(8)

where A is an m × n integer matrix, X = [X1 , . . . , Xn ] is a vector of variables, b and c are vectors of m and n integers, respectively, and a is the greatest integer value appearing in A and b. If there exists a solution for (8) and the optimal value of (8) is o∗ , then there exists a solution x = [x1 , . . . , xn ] of (8) such that, for each i ∈ [1..n], there exist Ni ∈ N Ni and both size(Ni ) and size(Di ) are polynomially and Di ∈ N such that xi = D i bounded by n, m, size(a). D.2

Proof of the main results

The following corollary extends Fact 2 to MILE. Corollary D.1. Let S be a MILE of the form (6), such that a is the greatest absolute value of the coefficients in A and b. If there is a solution x for S, then there is a solution x = [x1 , . . . , xn ] for S such that, for each i ∈ [1..n], the following hold: i) for each i ∈ [1..n], the value of xi is bounded by n(ma)2m+1 ; ii) if xi ∈ I, then size(xi ) is polynomially bounded by n, m, size(a); iii) if xi ∈ J, then there are Ni ∈ N and Di ∈ N, with Di > 0, such that xi = and both size(Ni ) and size(Di ) are polynomially bounded by n, m, size(a).

Ni Di

Proof. For the proof of i), the reader is referred to the proof of Fact 2 in [Papadimitriou 1981], whose strategy can be applied also in the presence of rational variables. ii) trivially follows from i). We now prove item iii). Consider a solution x of S such that ∀i ∈ I size(xi ) is polynomially bounded by n, m, size(a) (the existence of this x is guaranteed by ii)). Let S be the LP obtained from S by replacing each variable xi (i ∈ I) with the value assigned to xi in x and considering a null cost vector (i.e., c = 0). Let a be the greatest coefficient appearing in S . As S is feasible (since S is feasible), Fact 3 implies that there is a solution x of S such that, for each i ∈ {1, . . . , n} \ I, Ni and both size(Ni ) and size(Di ) there exist Ni ∈ N and Di ∈ N such that xi = D i are polynomially bounded by n − |I|, m, size(a ). It is easy to see that size(a ) is polynomially bounded by n, m, size(a), thus both size(Ni ) and size(Di ) are polynomially bounded by n, m, size(a). ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

App–28

·

Querying and Repairing Inconsistent Numerical Databases

The following theorem states that a MILI admits a solution having polynomial size w.r.t. the size of the input system of inequalities. Theorem D.2. Let S be a MILI of the form (7), where a is the greatest absolute value of the coefficients in S. If there exists a solution for S then there exists a solution x = [x1 , . . . , xn ] for S such that, for each i ∈ [1..n], the following hold: (a) if i ∈ I, then size(xi ) is polynomially bounded by n, m, size(a); (b) if i ∈ J, then there exist Ni ∈ N and Di ∈ N such that xi = size(Ni ) and size(Di ) are polynomially bounded by n, m, size(a).

Ni Di

and both

Proof. Consider the matrix A˜ defined as follows:   A1 Im1 ×m1 0m1 ×m2 0 ˜ A= A2 0m2 ×m1 Im2 ×m2 1 where Im1 ×m1 and Im2 ×m2 are m1 × m1 and m2 × m2 identity matrixes, 0m1 ×m2 and 0m2 ×m1 are m1 × m2 and m2 × m1 matrixes of zeroes, 0 is a vector consisting of m1 values equal to 0, and 1 is a vector consisting of m2 values equal to 1. In the ˜ following, for each i ∈ [1..n + 1], A˜i denotes the i-th column of A. Let v = n + m + 1 and Xn+1 . . . , Xv−1 , Xv be new variables not appearing in X. We consider the following system of inequalities ˜ =b A˜ × X where Xv > 0 ∀i ∈ I, Xi ∈ N ∀i ∈ [1..v] s.t. i ∈ I, Xi ∈ Q+

(9)

 = [X1 , . . . , Xn , Xn+1 . . . , Xv−1 , Xv ]. where X It is straightforward to see that the following property holds. For each solution x = [x1 , . . . , xn ] of (7) there is a solution P1 : xv = [x1 , . . . , xn , xn+1 , . . . , xv ] of (9), and vice versa. Since (7) is feasible, property P1 implies that there is at least one solution of (9). Therefore, in order to complete the proof, it suffices to prove that (9) has at least one solution satisfying conditions (a) and (b). We consider the following two cases separately. [I]: there exists a solution of (9) such that Xv ≥ 1. Consider the following system of inequalities: ˜ =b A˜ × X Xv ≥ 1 where ∀i ∈ I, Xi ∈ N ∀i ∈ [1..v] s.t. i ∈ I, Xi ∈ Q+

(10)

It is straightforward to see that every solution of (10) is a solution of (9) and every solution x = [x1 , . . . , xv ] of (9) with xv ≥ 1 is a solution of (10). Therefore, assumption [I] implies that (10) is feasible. Hence, since (10) does not ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

S. Flesca, F. Furfaro and F. Parisi

·

App–29

contain strict inequalities, from Corollary D.1 it follows that there is a solution x = [x1 , . . . , xv ] of (10) such that for each i ∈ [1..v] the following hold: —xi ∈ I and size(xi ) is polynomially bounded by v, m + 1, size(a); —x ∈ I and there exists Ni ∈ N and Di ∈ N such that xi = N D and both size(Ni ) and size(Di ) are polynomially bounded by v, m + 1, size(a). As x is a solution of (9) too, property P1 and the fact that v = n+m+1 implies that [x1 , . . . , xn ] is a solution of S such that conditions (a) and (b) hold. [II]: there is no solution of (9) such that Xv ≥ 1. We start by observing that the following property holds: P 2: there exists a solution x = [x1 , . . . , xv ] of (9) such that 0 < xv < 1 and, for each i ∈ [1..(v − 1)], xi ≤ (n + m + 1) · (m · a)2m+1 . For the proof of P 2, the reader is referred to the proof of Fact 2 in [Papadimitriou and Steiglitz 1998] or [Papadimitriou 1981], whose strategy can be applied here by observing that xv is bounded by 1. Let A˜Q (resp., A˜N ) be the matrix obtained from A˜ by removing all the columns ˜ Q (resp, X ˜ N ) be the vector of variables A˜i such that i ∈ I (resp., i ∈ I) and X ˜ obtained from X by removing all the variables Xi such that i ∈ I (resp., i ∈ I). It is easy to see that (9) can be equivalently rewritten as ˜ Q = b − A˜N × X ˜N A˜Q × X where Xv > 0 ∀i ∈ I, Xi ∈ N ∀i ∈ [1..v] s.t. i ∈ I, Xi ∈ Q+

(11)

Let x = [x1 , . . . , xv ] be a solution of (11) such that 0 < xv < 1 and, for each i ∈ [1..(v −1)], xi ≤ (n+m+1)·(m·a)2m+1 (the existence of x is guaranteed by Property P 2), and let xN be the vector obtained from x by removing all values xi such that i ∈ I. Consider now the following LP: max(Xv ) where ˜ Q = b − A˜N × xN A˜Q × X

(12)

∀i ∈ [1..v] s.t. i ∈ I, Xi ∈ Q+ Let a ˜ be the greatest absolute value of a coefficient in A˜Q and b − A˜N × xN . ˜ = Since, for each x ∈ xN , x ≤ (n + m + 1) · (m · a)2m+1 , it follows that a 2m+1 ˙ min(a, na˙ (n + m + 1) · (m · a) ). Hence, size(˜ a) is polynomially bounded by n, m, size(a). Since x is a solution of (11), xQ is a candidate solution of (12). Moreover, since xv is greater than 0, it follows that the optimal value of (12) is greater than zero. Due to Fact 3, this implies that there is a solution x Q of (12) such Q  is greater than 0 and, for each x ∈ x Q , that the value assigned to Xv by x N there are N ∈ N and D ∈ N such that x = D and both size(N ) and size(D) are polynomially bounded by n, m, size(˜ a). The latter implies that [ xQ , xN ] is ACM Transactions on Database Systems, Vol. V, No. N, February 2010.

App–30

·

Querying and Repairing Inconsistent Numerical Databases

a solution of (11) which satisfies conditions (a) and (b). Since every solution of (11) is a solution of (9), this completes the proof.

ACM Transactions on Database Systems, Vol. V, No. N, February 2010.