(2) c MMXIV, G ABRIEL D I ÉGUEZ F RANZANI. Se autoriza la reproducción total o parcial, con fines académicos, por cualquier medio o procedimiento, incluyendo la cita bibliográfica que acredita al trabajo y a su autor..

(3) PONTIFICIA UNIVERSIDAD CATOLICA DE CHILE SCHOOL OF ENGINEERING. ON THE COMPLEXITY OF BIDIRECTIONAL CONSTRAINTS FOR DATA EXCHANGE. GABRIEL SIMÓN DIÉGUEZ FRANZANI. Members of the Committee: MARCELO ARENAS S. JUAN L. REUTTER D. JORGE PÉREZ R. JOSÉ LUIS ALMAZÁN C. Thesis submitted to the Office of Research and Graduate Studies in partial fulfillment of the requirements for the degree of Master of Science in Engineering. Santiago de Chile, December 2014 c MMXIV, G ABRIEL D I ÉGUEZ F RANZANI.

(4) To everyone who feels to be my family..

(5) ACKNOWLEDGEMENTS. I would like to thank the following people (in no particular order): Marcelo Arenas for three reasons. First, for the opportunity of being part of such an amazing research group, full of incredibly talented and fun people; second, for the huge opportunity of giving lectures; and finally, for his dedication as my advisor, having always an idea when things were not going through. My “second advisor” Jorge Pérez for his time, encouragement and support, and for always challenging me with new problems. Juan Reutter for his incredibly useful insights for many proofs, for always having time for a little talk, and for always having the right professional or personal advice. José Luis Almazán, president of my thesis committee, for making the final part of this long process fast and simple. My office colleagues, the old and new (BANG!) guys, for all the good times in these years. It is an honour to be the link between two “Fishbowl 10” generations. All other friends in DCC / PUC: the ones from my undergraduate years, the ones from the master’s times, the ones who laugh at Dinkleberg, etc., and all other people in the “database gang” and the Nucleus for the fun trips and chats. My parents, my brothers, my family and my friends for their invaluable love and support, and for always believing in me. And finally, Daniela for being such a great support through all these years. Without the huge amounts of love, time, patience, kindness and fun she has given to me / had with me, I would not be here finishing this thesis.. My postgraduate studies were partially funded by CONICYT Master’s scholarship CONICYTPCHA/Magı́ster Nacional/2013 - 221320842 and the Millennium Nucleus Center for Semantic Web Research under Grant NC120004.. v.

(6) TABLE OF CONTENTS. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. v. LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ix. Resumen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. x. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1. Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5. Thesis outline and structure . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 7. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 8. 2.1.. Query languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 8. 2.2.. Schema mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 9. 2.2.1.. Specifying schema mappings . . . . . . . . . . . . . . . . . . . . . .. 9. 2.2.2.. Data exchange: Universal solutions and Query answering . . . . . . .. 10. Complexity of the Existence of Solutions Problem . . . . . . . . . . . . . . .. 12. 3.1.. Data complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 13. 3.2.. Combined complexity . . . . . . . . . . . . . . . . . . . . . . . . . . .. 18. Complexity of Query Answering . . . . . . . . . . . . . . . . . . . . . . . .. 28. Data Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 29. 4.1.1.. The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 30. 4.1.2.. The full case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 35. Combined Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . .. 45. 4.2.1.. The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 45. 4.2.2.. The full case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 46. Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 58. 1.. 2.. 3.. 4.. 4.1.. 4.2.. 5.. vi.

(7) References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 60. vii.

(8) LIST OF TABLES. 3.1. Complexity of E XISTENCE O F S OLUTIONS for bidirectional constraints . . . .. 13. 4.1. Data complexity of C ERTAINA NSWERS under bidirectional constraints . . . .. 29. 4.2. Combined complexity of C ERTAINA NSWERS under bidirectional constraints .. 29. viii.

(9) ABSTRACT. Schema mappings are of fundamental importance in data management; they have proved to be the essential building block for several data-interoperability tasks such as data exchange, data integration and peer data management. Most of the research on schema mappings has focused on mappings specified by st-tgds, which although natural and simple to specify, fail to impose enough conditions to unambiguously define what are the instances that should be materialized when exchanging data. Recently, bidirectional constraints have been proposed to specify mappings; they impose at the same time constraints over the source and target instances participating in them, and have the potential to minimize the ambiguity in the description of the target instances. In this thesis, we expand the formal study of bidirectional constraints; in particular, we study the computational complexity of two fundamental problems in the context of data exchange: checking the existence of solutions and answering queries. In the former, we analyze both the data and combined complexity, providing upper and lower bounds for different scenarios. In the latter, we also distinguish between several query languages with different expressive powers. In the proofs we introduce some new techniques, like a modified version of the classical chase procedure.. Keywords: Data exchange, Schema mappings, Bidirectional constraints, Query answering, Computational complexity ix.

(10) RESUMEN. Los mapeos de esquemas tienen una importancia fundamental en el manejo de datos, pues han mostrado ser la base para numerosas tareas de interoperabilidad de datos como intercambio de información, integración de datos y manejo de datos entre pares. La mayor parte de la investigación sobre mapeos de esquemas se ha concentrado en mapeos descritos por st-tgds, las cuales si bien son naturales y simples de especificar, no logran imponer suficientes condiciones para definir sin ambigüedad cuáles son las instancias que debieran materializarse al intercambiar información. Recientemente, se ha propuesto el uso de dependencias bidireccionales en la especificación de mapeos de esquemas, siendo capaces de imponer al mismo tiempo restricciones sobre las instancias del source y del target que participan en ellos, y teniendo el potencial de minimizar la ambigüedad en la descripción de las instancias target. En esta tesis continuamos con el estudio formal sobre las dependencias bidireccionales. En particular, estudiamos la complejidad computacional de dos problemas fundamentales en el contexto de intercambio de información: verificar la existencia de soluciones y contestar consultas. En el primer caso, se analiza tanto la complejidad de los datos como la complejidad combinada, mostrando cotas superiores e inferiores en distintos escenarios. En el segundo caso, además distinguimos entre diversos lenguajes de consulta con distintos poderes expresivos. En las demostraciones introducimos algunas técnicas nuevas, como una versión modificada del clásico algoritmo de chase.. Palabras Claves: Intercambio de información, Mapeo de esquemas, Dependencias bidireccionales, Contestar consultas, Complejidad computacional x.

(11) 1. INTRODUCTION. A schema mapping is a high-level specification that describes how data from a source schema is to be mapped to a target schema. Schema mappings are of fundamental importance in data management today. In particular, they have proved to be the essential building block for several data-interoperability tasks such as data exchange (Fagin, Kolaitis, Miller, & Popa, 2005), data integration (Lenzerini, 2002), and peer data management (De Giacomo, Lembo, Lenzerini, & Rosati, 2007). In the relational-database context, schema mappings are usually specified by using a logical language considering the set of relation names (or table names) of the database schemas as vocabulary. For example, consider two independent database schemas: S containing relation Employee(name, lives in, works in), and T containing relation Shuttle(name, destination). Relation Employee in schema S is used to store employee names and the places where they live in and work in. Relation Shuttle in schema T is intended to store names of employees that must take the shuttle bus to reach the places where they work in (destination). A possible way of relating schemas S and T is by using the following first-order logic formula: ∀x∀z. ∃y( Employee(x, y, z) ∧ y 6= z ) → Shuttle(x, z) .. (1.1). The above formula essentially states that if relation Employee stores an employee that lives in a place different from which she/he works in, then the employee name and the place where she/he works in should be stored in relation Shuttle. Formula (1.1) describes a mapping between schemas S and T that is given by an implication where the left-hand side of the implication is a query over S and the right-hand side of the implication is a query over T. This class of implication formulas has been the most widely used to specify schema mappings both in theoretical studies (Lenzerini,. 1.

(12) 2002; Fagin, Kolaitis, Miller, & Popa, 2005; Kolaitis, 2005; De Giacomo et al., 2007; Arenas, Pérez, & Riveros, 2009; Arenas, Pérez, Reutter, & Riveros, 2010; Arenas, Pérez, & Reutter, 2011; Arenas, Pérez, & Reutter, 2013; Pérez, 2011) and in practical applications (Hernández et al., 2002; Haas, Hernández, Ho, Popa, & Roth, 2005; Bernstein, Green, Melnik, & Nash, 2006). In particular, schema mappings specified by implication formulas have been the preferred formalism for exchanging data (Fagin, Kolaitis, Miller, & Popa, 2003, 2005; Fagin, Kolaitis, & Popa, 2003; Libkin, 2006; Gottlob & Nash, 2006; De Giacomo et al., 2007). In the data exchange context one is given a source database instance and a schema mapping. Then the problem is to find a target database instance that satisfies the constraints imposed by the schema mapping. Consider Formula (1.1) above and the database D1 over (the source) schema S given by Emp name lives in works in D1 :. Juan. Santiago. Valparaiso. Diego. Santiago. Santiago. A possible solution for the data exchange problem is the database D2 over (the target) schema T given by D2 :. Shuttle name destination Juan. Valparaiso. Notice that D1 together with D2 satisfy the constraints imposed by (1.1) (considering the standard first-order logic semantics). Nevertheless, there are other solutions for this data exchange problem. Consider the databases D3 and D4 given by Shuttle. name. destination. Juan. Valparaiso. Juan. Santiago. Alberto. Curico. Shuttle name destination. D3 :. Juan. Valparaiso. Diego. Santiago. D4 :. 2.

(13) In this case we have that D1 together with D3 satisfy Formula (1.1). We also have that D1 and D4 satisfy Formula (1.1). Thus, the database instances D3 and D4 , although less natural than D2 , are also solutions for the data exchange problem. This sort of anomaly is caused by the semantics of the implication formula. Notice that the formula used to exchange data is not restricting the possibility of adding arbitrary tuples to relation Shuttle in the target database. The semantics of implication formulas has raised several issues in data exchange. One of them is the problem of deciding what is a good solution for the data exchange problem. In Fagin, Kolaitis, Miller, and Popa (2003) and Fagin, Kolaitis, Miller, and Popa (2005), it was proposed to consider the minimal (or universal) solutions as the only solutions that are good for data exchange. Although D2 , D3 and D4 are considered valid solutions for the data exchange problem, only D2 is considered a good solution according to Fagin, Kolaitis, Miller, and Popa (2005). Towards solving the same problem, Libkin (2006) has proposed to change the semantics of schema mappings given by implication formulas by considering a closed-world assumption. Thus, formulas are no longer evaluated using the standard first-order logic semantics. Under the semantics proposed by Libkin (2006), D3 and D4 do not satisfy the constraints given by Formula (1.1), and thus, they are no longer valid solutions for D1 . Although this new semantics departs from a classical firstorder logic semantics, it has proved to have good properties in terms of materialization of target instances. Other lines of research include the proposal of alternative notions for answering target queries, in particular, non-monotone queries (Hernich, 2013) and aggregate queries (Afrati & Kolaitis, 2008). In Arenas, Diéguez, and Pérez (2014), it was argued that there is a more simple and natural way of dealing with the above mentioned issue. In that work, we decided to follow a different approach and, instead of using ad-hoc solutions for each of the mentioned issues, we used a mapping-specification language that imposes enough constraints over possible target instances in order to minimize the uncertainty when exchanging data. In. 3.

(14) that way one can use standard first-order logic notions to define the semantics of mappings, the possible target solutions as well as the process of answering target queries. In our example, if one wants D2 to be the solution for the data exchange problem, then that should be clear in the specification of the schema mapping. Thus, instead of an implication formula, one should use a bidirectional implication. Therefore, our schema mapping should be specified as: ∀x∀z. ∃y( Employee(x, y, z) ∧ y 6= z ) ↔ Shuttle(x, z) .. (1.2). If one considers D1 as a source database and the schema mapping specified by Formula (1.2), then the only possible solution for the data exchange problem is D2 , since D2 is the only database instance over schema T that together with D1 satisfies (1.2). Thus, in Arenas, Diéguez, and Pérez (2014) we proposed to use what we call bidirectional constraints to specify mappings. These specifications impose at the same time constraints over the source and target instances participating in a mapping, and have the potential to minimize the ambiguity in the description of the target instances that should be materialized when exchanging data. Bidirectional constraints are formulas of the form ∀x̄ ϕ(x̄) ↔ ψ(x̄) where ϕ(x̄) is a formula over the source schema, and ψ(x̄) is a formula over the target schema. One can obtain several different languages of bidirectional constraints depending on the formulas allowed in the source and target parts. Although bidirectional constraints are natural in several scenarios, they have been almost disregarded in the study and use of schema mappings. The reason for that is manifold. First of all, from a logical point of view, is more simple to deal with unidirectional implications, since bidirectional implications impose more restrictions on the database instances. Second, it is not clear how to use standard database techniques like the chase procedure (Maier, Mendelzon, & Sagiv, 1979) with bidirectional implications. Notice that the chase procedure lies in the core of almost all algorithms used in data exchange (Fagin, Kolaitis, Miller, & Popa, 2003, 2005; Fagin, Kolaitis, & Popa, 2003; Gottlob & Nash, 2006). In Arenas, Diéguez, and Pérez (2014) it was shown that the chase is still a useful 4.

(15) tool in this scenario, and we expand upon that in this thesis. Last but not least, the interest on dealing with schema mappings as first class citizens in the schema mapping management area is very recent (Fagin, Kolaitis, Popa, & Tan, 2005; Madhavan & Halevy, 2003; Fagin, 2007; Kolaitis, 2005; Fagin, Kolaitis, Popa, & Tan, 2008; Arenas, Pérez, & Riveros, 2008; ten Cate & Kolaitis, 2009, 2010; Arenas, Pérez, Reutter, & Riveros, 2009b, 2009a; Arenas et al., 2010; Pérez, 2011; Melnik, Adya, & Bernstein, 2008; Bernstein, Halevy, & Pottinger, 2000; Bernstein, 2003; Melnik, 2004; Melnik, Bernstein, Halevy, & Rahm, 2005). Thus, in the short life of the area, it is natural that the researchers decided to focus in well-known and well-behaved classes of formulas to define mappings. We do think that schema mappings specified by bidirectional implication formulas deserve a deep investigation, mainly because in many applications they are more natural than mappings defined by unidirectional implications. We also think that in several cases when users map data they are implicitly thinking in bidirectional constraints and not in unidirectional implication formulas. Thus, dealing directly with bidirectional constraints would have a considerable impact in practice, and the research in this area has the potential of laying the foundations for the the next generation data-interoperability tools. This thesis continues with the work started in Arenas, Diéguez, and Pérez (2014) in order to study the fundamental problems that arise in data exchange and schema mapping management when mappings are specified by bidirectional constraints.. Summary of contributions In this thesis we expand the formal study of bidirectional constraints started in Arenas, Diéguez, and Pérez (2014). Specifically, we study the computational complexity of two fundamental problems in the context of data exchange: checking whether there exists a solution in a given data exchange setting, and answering queries in the data exchange context. In both cases, we provide results for data and combined complexity (Vardi, 1982), and we also distinguish between general dependencies and full dependencies. The latter are a widely used class of dependencies, which will be defined in the next chapter. 5.

(16) In the first part of the thesis, we study the computational complexity of the existence of solutions problem. Regarding this problem, we have the following results: • Data complexity: – PTIME-membership for mappings specified by full dependencies. – NP-completeness for mappings specified by general dependencies. • Combined complexity: – ΠP2 -completeness for mappings specified by full dependencies. – NEXPTIME-completeness for mappings specified by general dependencies. In the second part, we study the problem of answering queries in the data exchange context. In this case, besides analyzing data and combined complexity, and considering general and full dependencies, we also distinguish between several query languages with different expressive powers, including non-monotone queries. The results in this part are the following: • In data complexity, we analyzed the complexity of the problem for five widely used query languages: – For mappings specified by full dependencies, the results range from PTIMEmembership to coNP-completeness. – For mappings specified by general dependencies, the problem is coNPcomplete for all the considered languages. • In combined complexity, we again analyzed the complexity of the problem for six widely used query languages: – For mappings specified by full dependencies, the results range from ΣP2 completeness to coNEXPTIME-completeness. – For mappings specified by general dependencies, the problem is coNEXPTIME-complete for all the considered languages. • Finally, we proved that the problem is undecidable for unrestricted first-order queries, even when the mapping is specified by full dependencies. 6.

(17) Additionally, in the proofs we introduce some new techniques, like a modified version of the classical chase procedure, showing that some typical tools from the field are still useful in our setting. Thesis outline and structure Chapter 2 introduces all necessary notation and concepts, including notions from relational databases, data exchange and bidirectional constraints. Chapter 3 contains the complexity analysis of the E XISTENCE O F S OLUTIONS problem in several scenarios, while Chapter 4 contains it for the C ERTAINA NSWERS problem. Finally, Chapter 5 presents some final remarks and future lines of research.. 7.

(18) 2. PRELIMINARIES. We assume some familiarity with first-order logic, computational complexity (Papadimitriou, 1994), database theory (Abiteboul, Hull, & Vianu, 1995), and data exchange (Fagin, Kolaitis, Miller, & Popa, 2005). We also assume that data is represented in the relational model. A relational schema, or just schema, is a finite set {R1 , . . . , Rn } of relation symbols, each relation having a fixed arity. Given a schema R, we denote by Inst (R) the set of all instances of R.. 2.1. Query languages Information stored in databases is retrieved via queries. In this thesis we focus on queries expressed by using logical formulas, and in particular formulas in fragments of first-order logic (FO). A query Q over a schema R is a first-order logic formula using R as vocabulary. Given a query Q(x̄), where x̄ is the tuple of free variables mentioned in Q, the answer of Q on a particular instance J is the set Q(J) = {t̄ | J |= Q(t̄)}, where |= denotes the standard satisfaction of FO formulas. A query without free variables is called a boolean query, and then we say that Q(J) = true if J |= Q, and Q(J) = false if J 6|= Q. Besides FO, the main query languages that we consider in this thesis are the languages of conjunctive queries (CQ), unions of CQ (UCQ), and the languages obtained from them by adding the equality predicate, the inequality predicate and the negation operator (e.g. UCQ= , CQ6= and UCQ¬ ). Additionally, some restrictions to queries with negation are considered; in particular, CQs with one and two negations, and UCQs with one negation per disjunct (denoted by CQ1-¬ , CQ2-¬ and UCQ1-¬ respectively). We also consider the class of monotone queries, denoted by M ON. This class contains all queries Q over a schema R that satisfy the following property: given two instances J1 , J2 over R such that J1 ⊆ J2 , it holds that Q(J1 ) ⊆ Q(J2 ). Note that this is a semantic class of queries, while the previous were syntactic classes. 8.

(19) 2.2. Schema mappings Schema mappings are used to define a semantic relationship between two schemas. In this thesis, we use a general definition of a schema mapping; given two schemas with no relation symbol in common, S and T, a schema mapping (or just a mapping) M between S and T is a set of pairs (I, J), where I is an instance of S, and J is an instance of T. That is, a mapping M is just a subset of Inst (S) × Inst (T). Given an instance I of S, a mapping M associates to I a set of possible solutions for I, denoted by S OLM (I), given by the set S OLM (I) = {J ∈ Inst (T) | (I, J) ∈ M}. From now on, assume that we have such schemas S and T. 2.2.1. Specifying schema mappings In practice, schema mappings are represented by using logical formulas. Again, we focus on using fragments of FO to specify mappings. Given a set Σ of FO sentences over vocabulary S ∪ T, we say that a mapping M is specified by Σ if for every pair of instances (I, J) ∈ Inst (S) × Inst (T) it holds that (I, J) ∈ M if and only if (I, J) |= Φ for every Φ ∈ Σ. For convenience, we write the last statement as (I, J) |= Σ. Therefore, we usually refer to a mapping as a triple M = (S, T, Σ), and to the set of solutions as S OLM (I) = {J ∈ Inst (T) | (I, J) |= Σ}. The usual way to specify schema mappings was introduced by Fagin, Kolaitis, Miller, and Popa (2005). A source-to-target dependency from S to T is a formula of the form ∀x̄(ϕ(x̄) → ψ(x̄)). (2.1). where ϕ(x̄) is an FO-formula over S and ψ(x̄) is an FO-formula over T, both formulas with x̄ as its tuple of free variables. We usually drop the outermost universal quantification when specifying these constraints, and thus we only write ϕ(x̄) → ψ(x̄) for formula (2.1). A mapping defined by source-to-target dependencies is called an st-mapping. Depending on which fragments of FO we use to define formulas ϕ(x̄) and ψ(x̄), we obtain a wide range of possible fragments of source-to-target dependencies. Given fragments L1 and 9.

(20) L2 of FO= , an L1 - TO -L2 dependency is a formula of the above form in which ϕ(x̄) is an L1 -formula over S and ψ(x̄) is an L2 -formula over T. In Fagin, Kolaitis, Miller, and Popa (2005), the language of CQ- TO -CQ dependencies was chosen as the preferred formalism for specifying schema mappings, calling it source-to-target tuple-generating dependencies (st-tgds). In this work, we will also consider full L- TO -CQ dependencies, which are formulas in which the target part is a CQ without existential quantifiers. In this thesis, we study mappings specified by sets of formulas of the following form ∀x̄ ϕ(x̄) ↔ ψ(x̄) ,. (2.2). where ϕ(x̄) is an FO-formula over S and ψ(x̄) is an FO-formula over T, both formulas with x̄ as tuple of free variables. We call this formula a bidirectional constraint. We also usually drop the outermost universal quantification when specifying these constraints, and thus we only write ϕ(x̄) ↔ ψ(x̄) for formula (2.2). We say that a sentence Φ is an hL1 , L2 i-dependency between S and T, if Φ is a bidirectional constraint of the form (2.2) in which ϕ(x̄) is in L1 and ψ(x̄) is in L2 . When the source and target schemas are clear from the context, we will only talk about hL1 , L2 i-dependencies. For example, consider schemas S = {Mother(·, ·), Father(·, ·)} and T = {Parent(·, ·)}. Then the following sentence ( Father(x, y) ∨ Mother(x, y) ) ↔ Parent(x, y). (2.3). is an example of a hUCQ, CQi-dependency, which states that x is a parent of y if and only if it is whether her/his father or her/his mother. As before, an hL, CQi-dependency in which the target part is a CQ without existential quantifiers is called a full hL, CQidependency.. 2.2.2. Data exchange: Universal solutions and Query answering In the context of data exchange, the main task is to materialize a target instance for a given source instance. Given a mapping M specified by FO= - TO -CQ dependencies 10.

(21) and a source instance I, one can define a particular class of solutions in S OLM (I) called universal solutions. These solutions are the most general among all the possible solutions for I under M (Fagin, Kolaitis, Miller, & Popa, 2005). Moreover, a particular class of universal solutions, called canonical universal solutions, can be generated (in polynomial time) by means of the classical chase procedure (Maier et al., 1979). We refer the reader to Fagin, Kolaitis, Miller, and Popa (2005) for precise definitions of these notions. We denote by chaseΣ (I) the result of applying the chase procedure to an instance I with a set Σ of dependencies. We call US OLM (I) and CUSM (I), the set of universal solutions and canonical universal solutions for I under M, respectively. In general, for st-mappings, we have that CUSM (I) ( US OLM (I) ( S OLM (I). Another important aspect in data exchange is how to answer queries over the target schema. The most accepted semantics for query answering in data exchange is the certain answers semantics: given a mapping M, a source instance I and a query Q over T, the certain answers of Q with respect to I under M is the set CERTAIN M (Q, I). \. =. Q(J).. J∈S OLM (I). In other words, a tuple t̄ ∈. CERTAIN M (Q, I). if t̄ ∈ Q(J) for every solution J for I. under M. For boolean queries, we say that. CERTAIN M (Q, I). every solution J, and that. = false if there exists a solution J such that. CERTAIN M (Q, I). = true if Q(J) = true for. Q(J) = false.. 11.

(22) 3. COMPLEXITY OF THE EXISTENCE OF SOLUTIONS PROBLEM. As it was defined in Chapter 2, in a data exchange setting M = (S, T, Σ) each source instance I has a corresponding set of solutions, denoted by S OLM (I). This set represents all the possible ways in which data from the source instance can be exchanged to the target according to the setting. Then, a natural question arises: is there any way in which we can exchange data? Or, more formally, is S OLM (I) 6= ∅? In the framework introduced by Fagin et al. (2005), in which Σ consists of st-tgds, this problem is trivial: for every source source instance I one can always find a solution. However, in the presence of bidirectional constraints this does not necessarily hold. Example 1. Take a simple setting M = (S, T, Σ), with Σ consisting of the st-tgds A(x) → R(x) B(x) → R(x) and a source instance I = {B(1)}. It is clear that the target instance J = {R(1)} is a solution for I under M. Now consider the setting M0 = (S, T, ∆), with ∆ consisting of the bidirectional constraints A(x) ↔ R(x) B(x) ↔ R(x) Note that we only changed the implication. The previous source instance I does not have any solution under M0 .. . The previous example shows that asking for the existence of solutions for a particular source instance is worth studying under the settings specified by bidirectional constraints, and, in particular, for the fragment of hUCQ= , CQi-dependencies introduced by Arenas, Diéguez, and Pérez (2014). In the following sections, we study the computational complexity of solving this problem in several scenarios. The results are summarized in table 3.1, showing the references to the theorems where each result is proved. 12.

(23) data complexity combined complexity. full hCQ, CQi full hUCQ= , CQi in PTIME in PTIME Theorem 2. Theorem 2. ΠP2 -complete. ΠP2 -complete. Theorem 4. Theorem 4. general hCQ, CQi NP-complete. general hUCQ= , CQi NP-complete. Theorem 1. Theorem 1. NEXPTIME-complete NEXPTIME-complete Theorem 3. Theorem 3. TABLE 3.1. The complexity of E XISTENCE O F S OLUTIONS for bidirectional constraints.. 3.1. Data complexity In this section we study the data complexity (Vardi,. 1982) of the. E XISTENCE O F S OLUTIONS problem; this is, when the mapping is considered to be fixed, and the input is only the source instance. This is a natural way of studying the complexity of this problem, since in practice is usual that databases are much larger than mapping specifications. Formally, the problem is defined as follows:. Problem:. E XISTENCE O F S OLUTIONS(M). Input:. An instance I over S.. Question: Is S OLM (I) 6= ∅? The following result establishes the upper and lower bounds of the complexity of this problem, when we consider mappings specified by hUCQ= , CQi-dependencies without any further restrictions. We show that the problem is provably intractable, and the lower bound holds even when equalities and disjunctions are banned.. T HEOREM 1. (1) E XISTENCE O F S OLUTIONS(M) is in NP for every mapping M specified by a set of hUCQ= , CQi-dependencies. (2) There exists a mapping M specified by hCQ, CQi-dependencies such that E XISTENCE O F S OLUTIONS(M) is NP-hard.. 13.

(24) P ROOF. (1) Let M. =. (S, T, ∆) be a mapping,. where ∆ is a set of. hUCQ= , CQi-dependencies. Consider the following set of UCQ= - TO -CQ stdependencies: ∆→ = {∀x̄ (ϕ(x̄) → ∃ȳψ(x̄, ȳ)) | ∀x̄ (ϕ(x̄) ↔ ∃ȳψ(x̄, ȳ)) ∈ ∆}. Consider the following Proposition: P ROPOSITION 1. Given a source instance I, if there exists a solution J for I under M, there exists a solution J ∗ for I under M of polynomial size (with respect to I). The result follows directly from Proposition 1, since checking that J ∗ is a solution can be done in polynomial time. Now we prove Proposition 1: First, note that since J is a solution, it holds that (I, J) |= ∆, and then (I, J) |= ∆→ . In this proof we use the solution-aware chase procedure as it is used in Fuxman, Kolaitis, Miller, and Tan (2006): we chase (I, ∅) with ∆→ and (I, J), obtaining an instance (I, J ∗ ) such that J ∗ ⊆ J and (I, J ∗ ) |= ∆→ . It is clear that (I, ∅) ⊆ (I, J), so by the results in Fuxman et al. (2006) it holds that J ∗ is of polynomial size w.r.t. (I, ∅) (since (I, J) |= ∆→ and J ∗ is the result of a solution-aware chase of (I, ∅) with ∆→ and (I, J)). Now we need to show that J ∗ is a solution for I under M, i.e. (I, J ∗ ) |= ∆. By contradiction, suppose that all previous statements hold, but (I, J ∗ ) 6|= ∆. Since (I, J ∗ ) |= ∆→ , the only way this could happen is if there exists a dependency δ = ∀x̄(ϕδ (x̄) ↔ ∃ȳψδ (x̄, ȳ)) ∈ ∆, where ψδ (x̄, ȳ) is a conjunction of target relations, such that for a tuple of constants ā it holds that I 6|= ϕδ (ā) and J ∗ |= ∃ȳψδ (ā, ȳ). Since J ∗ ⊆ J, by monotonicity we know that {b̄ | J ∗ |= ∃ȳψδ (b̄, ȳ)} ⊆ {b̄ | J |= ∃ȳψδ (b̄, ȳ)}, and therefore it is clear that J |= ∃ȳψδ (ā, ȳ). Given that J is a solution, it holds that (I, J) |= ∆, and in particular (I, J) |= δ, and thus I |= ϕδ (ā), which contradicts our initial supposition.. 14.

(25) (2) We will perform a reduction from graph 3-C OLORABILITY to the E XISTENCE O F S OLUTIONS(M) problem, for a mapping M built as follows. Let G = (V, E) be a graph with 2 connected components: K3 and the graph itself. Let M = (S, T, ∆) be a data exchange setting such that S consists of binary relation E and unary relations V, H and Error, T consists of binary relations E 0 and C, and the dependencies in ∆ are the following: V (x) ↔ ∃uC (x, u) E (x, y) ↔ E 0 (x, y) H (u) ↔ ∃xC (x, u) Error (x) ↔ ∃y∃uC (x, u) ∧ C (y, u) ∧ E 0 (x, y). (3.1) (3.2) (3.3) (3.4). Consider the source instance I (G) = (V, E, H, Error), where H = {r, g, b} is a set of three colors, none of which is an element of V , and Error = ∅. It is clear that I (G) can be constructed in polynomial time from G. We claim that G is 3-colorable if and only if there is a solution for I (G) under M. (⇒) Since G is 3-colorable, there exists a 3-coloration col (x). Without loss of generality, we assume the colors assigned by col (x) are the ones in H. We construct the solution J as follows: • Start with J = ∅. • For each x ∈ V , we add the tuple C (x, col (x)) to J. • For each (x, y) ∈ E, we add the tuple E 0 (x, y) to J. Now we show that (I (G) , J) |= ∆, showing that (I (G) , J) satisfies each rule: (3.1) and (3.2) by construction. (3.3) [→] This side of the dependency states that every color is assigned to at least one vertex. As G contains K3 and is 3-colorable, this holds for G.. 15.

(26) [←] This side of the dependency states that every color that is assigned to a vertex is in H. Since col (x) only assigns colors in H, every color mentioned in any tuple C (x, u) is in H. (3.4) Since Error = ∅, the right-hand side of this dependency must be always false. In other words, for every vertex x, it must not exist an adjacent vertex y such that x and y have the same color assigned in C. As the colors in C come from col (x), and G is 3-colorable, this always holds. (⇐) Given J such that (I (G) , J) |= ∆, we generate a coloration col (x) using dependencies (3.1) and (3.3) to choose, for every vertex x ∈ V , a color c ∈ H such that C (x, c). Then, col (x) = c. We now show that col (x) is a 3-coloration: By contradiction, suppose that col (x) is not a 3-coloration. Therefore, there exists an edge (y, z) ∈ E such that col (y) = col (z). Using dependency (3.2), we know (y, z) ∈ E 0 . Since the colors in col (x) are all obtained from C, the right-hand side of (3.4) holds for vertex y. Since J is a solution for I (G), it follows that y ∈ Error, which is a contradiction since Error = ∅ in I (G).. . A usual restriction in data exchange is to consider mappings specified by full dependencies. In this scenario, the E XISTENCE O F S OLUTIONS problem can be efficiently solved. T HEOREM 2. E XISTENCE O F S OLUTIONS(M) is solvable in polynomial time for every mapping M specified by full hUCQ= , CQi-dependencies. P ROOF. Let M = (S, T, ∆) be a mapping, where ∆ is a set of full hUCQ= , CQidependencies. Consider the following set of full UCQ= - TO -CQ st-dependencies: ∆→ = {∀x̄ (ϕ(x̄) → ψ(x̄)) | ∀x̄ (ϕ(x̄) ↔ ψ(x̄)) ∈ ∆}. In this proof we use the chase procedure as it is used in Fagin, Kolaitis, Miller, and Popa (2005) with the dependencies in ∆→ . Note that we chase with full UCQ= - TO -CQ 16.

(27) dependencies, and then every chase has the same result. Now call chase∆→ (I) to the chase result for a source instance I. Consider the following Proposition regarding full mappings: P ROPOSITION 2. Given a source instance I, it has a solution under M if and only if chase∆→ (I) is a solution for I under M. Since this chase procedure uses UCQ= - TO -CQ dependencies, it terminates in polynomial time, and then chase∆→ (I) is of polynomial size. Therefore, if Proposition 2 holds, theorem 2 holds: we only need to compute the chase and check whether it is a solution, which can be done in polynomial time. Now we prove Proposition 2: (⇐) This direction is trivial, since chase∆→ (I) is a solution. (⇒) By contradiction, suppose that S OLM (I) 6= ∅ but J = chase∆→ (I) 6∈ S OLM (I). Thus, we have that (I, J) 6|= ∆, and then it must exist a dependency δ = ∀x̄(ϕδ (x̄) ↔ ψδ (x̄)) ∈ ∆, where ψδ (x̄) is a conjunction of target relations, such that for a tuple of constants ā one of the following holds: (1) I |= ϕδ (ā) and J 6|= ψδ (ā). (2) I 6|= ϕδ (ā) and J |= ψδ (ā). Note that the first scenario is not possible: given that I |= ϕδ (ā), the chase procedure would have generated the atoms in ψδ (ā). Now, suppose statement 2 holds, and name Q the conjunctive query ψδ (ā). Since J |= ψδ (ā), there exists a homomorphism h1 : I Q → J, where I Q is the canonical instance of Q. Consider now the st-mapping M→ = (S, T, ∆→ ). As S OLM (I) 6= ∅, there exists a target instance J ∗ that is a solution for I under M, and therefore (I, J ∗ ) |= ∆. Moreover, (I, J ∗ ) |= ∆→ , and thus J ∗ ∈ S OLM→ (I). Now, from the results in Fagin, Kolaitis, Miller, and Popa (2005) we know that J ∈ US OLM→ (I), and then we know there exists a homomorphism h2 : J → J ∗ . Thus, there exists a homomorphism h = h2 ◦ h1 : I Q → J ∗ , and then it follows that J ∗ |= ψδ (ā). We also have that (I, J ∗ ) |= δ, and therefore I |= ϕδ (ā), which contradicts statement 2. We conclude that it cannot be that S OLM (I) 6= ∅ but chase∆→ (I) 6∈ S OLM (I).. 17.

(28) 3.2. Combined complexity In this section, we study the combined complexity (Vardi, 1982) of the E XISTENCE O F S OLUTIONS problem, when both the mapping and the source instance are part of the input. This is the most well-known notion of complexity, when one does not distinguish between different parts of the input. Formally, the problem is defined as follows:. Problem:. E XISTENCE O F S OLUTIONS. Input:. Mapping M = (S, T, ∆) where ∆ is a set of bidirectional. constraints, and an instance I over S. Question: Is S OLM (I) 6= ∅? Like in the previous section, we begin by analyzing the complexity of the E XISTENCE O F S OLUTIONS problem with unrestricted hUCQ= , CQi-dependencies. We prove that an exponential blow-up happens with respect to the data complexity. As before, the lower bound holds without equalities nor disjunctions. T HEOREM 3. (1) E XISTENCE O F S OLUTIONS is in NEXPTIME for the class of mappings specified by hUCQ= , CQi-dependencies. (2) For. the. class. of. mappings. specified. by. hCQ, CQi-dependencies,. E XISTENCE O F S OLUTIONS is NEXPTIME-hard. P ROOF. (1) It is clear that this problem is in NEXPTIME: we non-deterministically guess a target instance J, and then check if J ∈ S OLM (I), which can be done in exponential time. Notice that a solution J of exponential size with respect to M and I is guaranteed to exist if S OLM (I) 6= ∅, applying a solution-aware chase. As we already noted, every solution-aware chase sequence is polynomial in the 18.

(29) size of the source instance, as it was shown in Fuxman et al. (2006), and this expression is exponential when the schema is not fixed.. (2) To prove that the problem is NEXPTIME-hard, we will show a reduction from T ILING (Papadimitriou, 1994): given a set of tile types T = {t0 , . . . , tm }, relations H, V ⊆ T × T (which represent horizontal and vertical adjacency constraints between tile types) and an integer n in unary, the problem is to determine if there exists a tiling of a 2n × 2n square with tiles in T , starting with the first tile type in the origin that satisfies the constraints imposed by H and V . Formally, a tiling is a function f : {0, . . . , 2n −1}×{0, . . . , 2n −1} → T such that f (0, 0) = t0 and for all i, j (f (i, j), f (i + 1, j)) ∈ H, and (f (i, j), f (i, j + 1)) ∈ V . Given this, we build a data exchange setting M = (S, T, ∆), with S = {T (·), T0 (·), . . . , Tm (·), T0 (·), H(·, ·), V (·, ·), Bin(·, ·), Zero(·), One(·), A(·), B, Error(·, ·), Error0 } 0. 0. 0. T = {T ile, T00 , . . . , Tm0 , T0 , H , V , Bin0 , Zero0 , One0 } Intuitively, source relations T , Ti , T0 , H and V come directly from the problem (where H is the complement of H, the same for V and T0 ). Relation A will be used to compute all possible positions in the square in binary, but it also includes positions with 2, in order to overcome some limitations. Bin, Zero and One will be used to distinguish values. Predicate B has arity 2n and will be used to represent a special position, which will be necessary to simulate that every tile type is used in the tiling. Predicates Error and Error0 will represent errors in the tiling and initial condition, respectively. Relation Error0 has arity 2n + 1. Finally, the target relation T ile has arity 2n + 1, where the first 2n parameters represent a position in the square, and the last parameter is the tile type assigned to it, and the remaining target relations will be copies of the corresponding source relations. 19.

(30) Source instance I will contain the following relation instances: T = T, T0 = {t0 }, . . ., Tm = {tm }, T0 = T \{t0 }, H = T × T \H, V = T × T \V , Bin = {0, 1}, A = {0, 1, 2}, Zero = {0}, One = {1}, B = {(2̄, 2̄)}, Error = Error0 = ∅ From now on, x̄ will be shorthand for the tuple of variables (x1 , . . . , xn ), and the analogous applies to ȳ. The set ∆ will have the following dependencies: • Copying dependencies: T0 (x) ↔ T00 (x), . . . , Tm (x) ↔ Tm0 (x) 0. T0 (x) ↔ T0 (x) 0. H(x, y) ↔ H (x, y) 0. (3.5) (3.6) (3.7). V (x, y) ↔ V (x, y). (3.8). Bin(x) ↔ Bin0 (x). (3.9). Zero(x) ↔ Zero0 (x). (3.10). One(x) ↔ One0 (x). (3.11). • A dependency that assigns to each position in the square a tile type, which are computed using predicate A. We also assign tile types to special positions that include 2’s. A(x1 ) ∧ . . . ∧ A(xn ) ∧ A(y1 ) ∧ . . . ∧ A(yn ) ↔ ∃zT ile(x̄, ȳ, z). (3.12). • A dependency that ensures that every tile type assigned by equation (3.12) comes from the given set: T (z) ↔ ∃x̄∃ȳT ile(x̄, ȳ, z). (3.13). 20.

(31) Note that this dependency also forces to use each tile type, a restriction that it is not part of the problem. The following dependency solves this: B(x̄, ȳ) ↔ ∃z0 . . . ∃zm T ile(x̄, ȳ, z0 ) ∧ . . . ∧ T ile(x̄, ȳ, zm ) ∧T00 (z0 ) ∧ . . . ∧ Tm0 (zm ). (3.14). This dependency (from left to right) assigns to special position (2̄, 2̄) all tile types. Note that it also says (from right to left) that if a position has all tile types assigned to it, it must be this special one. Note that this is not a problem, since a valid tiling never uses more than one tile per position, and therefore we are not discarding possible tilings. • A dependency that sets the first position to tile type t0 : 0. Error0 (x̄, ȳ, z) ↔T ile(x̄, ȳ, z) ∧ T0 (z) ∧. n ^. Zero0 (xi ) ∧ Zero0 (yi ). . (3.15). i=1. The intuition behind this dependency is that relation Error0 contains erronous positions. Since it is empty in I, the right part must be false in every solution, and therefore position (0̄, 0̄) must contain a tile of type t0 . Finally, now we explain how to check horizontal and vertical constraints. As it was noted by Kostylev and Reutter (2013), horizontally adjacent positions in the square have the form (wh 01n−k−1 , wv ), (wh 10n−k−1 , wv ). (3.16). where k ∈ {0, . . . , n − 1}, wh is a binary word of length k and wv is a binary word of length n. Similarly, vertically adjacent positions in the square have the form (wh , wv 01n−k−1 ), (wh , wv 10n−k−1 ). 21.

(32) where wh is a binary word of length n and wv is a binary word of length k. Thus, for each k ∈ {0, . . . , n − 1} we will have dependencies Error(z1 , z2 ) ↔ ∃p̄∃q̄∃r̄∃w1 ∃w2 ∃ȳT ile(p̄, w1 , q̄, ȳ, z1 ) 0. ∧T ile(p̄, w2 , r̄, ȳ, z2 ) ∧ H (z1 , z2 ) ∧ αk (p̄, q̄, r̄, w1 , w2 , ȳ). (3.17). Error(z1 , z2 ) ↔ ∃p̄∃q̄∃r̄∃w1 ∃w2 ∃x̄T ile(x̄, p̄, w1 , q̄, z1 ) 0. ∧T ile(x̄, p̄, w2 , r̄, z2 ) ∧ V (z1 , z2 ) ∧ αk (p̄, q̄, r̄, w1 , w2 , x̄). (3.18). that will check horizontal and vertical constraints respectively, where p̄ = (p1 , . . . , pk ), q̄ = (q1 , . . . , qn−k−1 ), r̄ = (r1 , . . . , rn−k−1 ) and k n−k−1 V V αk (p̄, q̄, r̄, w1 , w2 , x̄) = Bin0 (pi ) ∧ One0 (qi ) ∧ Zero0 (ri ) ∧ i=1 0. i=1 n V. 0. Zero (w1 ) ∧ One (w2 ) ∧. Bin0 (xi ).. i=1. Here the intuition is the same as before, since relation Error is empty in I, and therefore it cannot be that there exist two adjacent positions which tile types are in the complements of the horizontal or vertical relations, respectively. Note that we only check the restrictions for positions in binary. Positions that contain 2’s are ignored. It is clear that we can build M and I in polynomial time. Now we will show that there exists a 2n × 2n tiling that satisfies the constraints if and only if there exists a solution for I under M: (⇒) Given that there exists a tiling, we have the function f , which we will use to build a solution J as follows: • Start with J = ∅. • For each x ∈ Ti , 0 ≤ i ≤ m, we add the tuple Ti0 (x) to J. 0. • For each x ∈ T0 , we add the tuple T0 (x) to J. 0. • For each (x, y) ∈ H, we add the tuple H (x, y) to J. 0. • For each (x, y) ∈ V , we add the tuple V (x, y) to J.. 22.

(33) • For each x ∈ Bin, we add the tuple Bin0 (x) to J. • For each x ∈ Zero, we add the tuple Zero0 (x) to J. • For each x ∈ One, we add the tuple One0 (x) to J. • For each position (i, j). ∈. {0, . . . , 2n − 1}2 , we add the tuple. T ile(x̄i , ȳj , f (i, j)) to J, where x̄i , ȳj are the binary representations of i and j respectively. • For each x̄, ȳ of size n composed by 0, 1 or 2’s, and such that one of them mentions at least a 2, we add the tuple T ile(x̄, ȳ, t0 ). • For each t ∈ T , we add the tuple T ile(2̄, 2̄, t) to J. Now we show that (I, J) |= ∆, showing that they satisfy each rule: (3.5), (3.6), (3.7), (3.8), (3.9), (3.10), (3.11) and (3.12) by construction. (3.13) [→] This side of the dependency states that every tile type is assigned to at least one position. Since special position (2̄, 2̄) is assigned every tile type, this is true. [←] This side of the dependency states that every tile type assigned to a position comes from the original set T , which is true by the way J was built. (3.14) By construction. (3.15) Since Error0 is empty in I, the right-hand side must be always false. This holds in J, because position (0̄, 0̄) is the only that satisfies the first atoms, and it has assigned only tile type t0 , and therefore the last atom is not satisfied. (3.17) Since Error is empty in I, the right-hand side must be always false. This means that it cannot be that two non-compatible tile types (i.e. (z1 , z2 ) ∈ 0. H ) are assigned to horizontally adjacent positions, which are encoded as it was explained before. To give some more detail, note that any position that mentions value 2 does not satisfy the right-hand side, because all bits are forced to be 1’s or 0’s. Then, this could only happen for positions represented by binary words. Now, let word p̄w1 q̄ represent some number i ∈ {0, . . . , 2n − 2}. As it was explained before, word p̄w2 r̄ would then represent number i + 1. Also, 23.

(34) let ȳ represent number j ∈ {0, . . . , 2n − 1}. Since f is a tiling, it holds that (f (i, j), f (i + 1, j)) ∈ H, and then (z1 , z2 ) ∈ H, which implies that (z1 , z2 ) 6∈ 0. H , and therefore the right-hand side is always false. (3.18) Analogous to (3.17). In conclusion, given that there exists a tiling, we built a solution for I under M. (⇐) Given J such that (I, J) |= ∆, we generate a tiling f using dependencies (3.12) and (3.13) to choose, for every position (i, j) ∈ {0, . . . , 2n − 1}2 , a tile type tl ∈ Tl (and then tl ∈ T ) such that T ile(x̄i , ȳj , tl ), where x̄i , ȳj are the binary representations of i and j respectively, and then we make f (i, j) = tl . Without loss of generality, suppose that we choose tile type t0 for position (0, 0) (this is possible because (I, J) must satisfy dependency (3.15)). Now we show that f is a valid tiling: By contradiction, suppose that f is not a valid tiling. Therefore, there exist i, j such that (f (i, j), f (i + 1, j)) 6∈ H or (f (i, j), f (i, j + 1)) 6∈ V . For simplicity suppose that the first statement holds (the other is analogous). Then, (f (i, j), f (i + 1, j)) ∈ H, and by dependency (3.7) it holds that (f (i, j), f (i + 0. 1, j)) ∈ H .. By dependency (3.12) (and by how f was built) we know. T ile(x̄i , ȳj , f (i, j)) and T ile(x̄i+1 , ȳj , f (i + 1, j)) hold. Now, by equation (3.16) we know that it exists some k ∈ {0, . . . , n − 1} such that x̄i = p1 . . . pk 01n−k−1 and x̄i+1 = p1 . . . pk 10n−k−1 , and then for such k we know that J satisfies the right-hand side of dependency (3.17), with z1 = f (i, j) and z2 = f (i + 1, j). Therefore, since J is a solution, it must be that I |= Error(f (i, j), f (i + 1, j)), which is a contradiction because Error = ∅ in I. Regarding full dependencies, even if we restrict to them we are no longer capable of solving the problem efficiently, in terms of combined complexity. Thus, now we prove both upper and lower bounds, showing that the problem becomes complete for a complexity class in the Polynomial Hierarchy, which contains NP, and therefore is believed to 24.

(35) be intractable (see (Papadimitriou, 1994) for details). Again, the lower bound is still true without equalities nor disjunctions. T HEOREM 4. (1) E XISTENCE O F S OLUTIONS is in ΠP2 for the class of mappings specified by full hUCQ= , CQi-dependencies. (2) For the class of mappings specified by full hCQ, CQi-dependencies, E XISTENCE O F S OLUTIONS is ΠP2 -hard. P ROOF. (1) If the dependencies are full, then Proposition 2 holds. Therefore, we can use a non-deterministic machine with an NP oracle that does the following: • Guess a dependency ϕ(x̄) ↔ ψ(x̄) in ∆, where ϕ is a UCQ= query over S and ψ is a CQ query over T without existential quantifiers. Let ψ(x̄) = R1 (x̄1 ) ∧ . . . ∧ Rn (x̄n ), with each x̄i ⊆ x̄. • Guess n dependencies of the form ϕk (v̄k ) ↔ ψk (v̄k ), 1 ≤ k ≤ n, in ∆, where ϕk is a UCQ= query over S and ψk is a CQ query over T without existential quantifiers, such that ψk (v̄k ) = ψk1 (w̄k ) ∧ Rk (ȳk ) ∧ ψk2 (z̄k ), where ψk1 and ψk2 are (possibly empty) conjunctions of target atoms, and w̄k , ȳk , z̄k ⊆ v̄k . • Guess tuples of constants t̄1 , . . . , t̄n of the same arities as R1 , . . . , Rn respectively, such that t̄1 ∪ . . . ∪ t̄n = t̄ is of the same arity as ϕ and has the same pattern on the right-hand side. This is, t̄1 , . . . , t̄n should match with x̄1 , . . . , x̄n . • Guess tuples of constants āk and b̄k for each 1 ≤ k ≤ n, of the same arity as ψk1 and ψk2 respectively, such that āk ∪ t̄k ∪ b̄k = s̄k is of the same arity as ϕk and has the same pattern on the right-hand side. This is, āk , t̄k and b̄k should match w̄k , ȳk and z̄k respectively. • Ask the oracle if I |= ϕ1 (s̄1 ) ∧ . . . ∧ ϕn (s̄n ), and then if I |= ϕ(t̄). 25.

(36) • If the answers are YES and NO, the machine accepts. Otherwise, it rejects. In other words, the machine accepts if and only if the target instance produced by the chase is not a solution, which is equivalent to saying that there is no solution for I under M. Thus, the complement of the existence-of-solutions problem is in ΣP2 , and then the existence-of-solutions problem is in co-ΣP2 = ΠP2 . (2) We will show a reduction from Q3SAT (Stockmeyer, 1976; Wrathall, 1976), the problem of determining if a QBF formula of the form ∀x̄∃ȳϕ(x̄, ȳ), where ϕ is in 3-CNF and x̄ and ȳ form a partition of the variables mentioned in ϕ, is true. First, suppose that x̄ = (x1 , . . . , xn ). Given such a formula, we build a data exchange setting M = (S, T, ∆), with S consisting of unary relation V and ternary relations N0 , N1 , N2 and N3 , and T consisting of n-ary relation R. The set ∆ will have two dependencies: V (x1 ) ∧ . . . ∧ V (xn ) ↔ R(x1 , . . . , xn ) ∃ȳψ(x̄, ȳ) ↔ R(x̄) where ψ is a CQ built from ϕ as it is explained now. First, let ϕ = C1 ∧ . . . ∧ Cm , where each Ci is a clause with three literals. Without loss of generality, suppose that negated literals are mentioned at the end of the clauses. Each clause will be replaced with a source predicate among N0 , . . . , N3 depending on how many negated literals it mentions, and using the same variables in the same order. For example, a clause without negated literals will be replaced by N0 over the same variables mentioned in the clause, while a clause with two negated literals will be replaced by N2 ; e.g. clause p∨q ∨¬r is replaced by N1 (p, q, r). Then, ψ is the CQ obtained by replacing all clauses in ϕ following this method and removing the quantifiers. Finally, the source instance I contains the following tuples for source relations: V = {0, 1}, N0 = {0, 1}3 − {(0, 0, 0)}, N1 = {0, 1}3 − {(0, 0, 1)}, N2 =. 26.

(37) {0, 1}3 − {(0, 1, 1)} and N3 = {0, 1}3 − {(1, 1, 1)}. Intuitively, the tuples in N0 , . . . , N2 are the truth asignments that make true each kind of clause. It is clear that both M and I can be built in polynomial time. Now we show that ∀x̄∃ȳϕ(x̄, ȳ) is true if and only if there exists a solution for I under M: (⇒) It is clear that R = {0, 1}n is a solution for I under M. In the first place, it is the only way the first dependency would be satisfied (since the source query is satisfied with every assignment for x1 , . . . , xn , because V (0) and V (1) are true in I). Now, the second dependency is satisfied given that ∀x̄∃ȳϕ(x̄, ȳ) is true, taking exactly the same assignments, since the N predicates are defined using the propositional logic semantics (note that truth assignments are nothing more than a function from propositional variables to {0, 1}, the same we need to do for the variables in ψ). (⇐) If there exists a solution for I under M, the only possibility is that the solution contains all possible tuples for predicate R, since the source query is satisfied with every assignment for x1 , . . . , xn as we mentioned before. Now, if a target instance such that R = {0, 1}n is a solution, then {ā | I |= ∃ȳψ(ā, ȳ)} = {0, 1}n , and therefore ∀x̄∃ȳϕ(x̄, ȳ) is true, since for each possible assignment for the variables in x̄, there is an assignment for the variables in ȳ that satisfies ψ, and since the definition of N predicates is the same as the semantics of propositional logic, we can take exactly the same assignments for ϕ. . 27.

(38) 4. COMPLEXITY OF QUERY ANSWERING. Answering queries is of fundamental importance in databases, since it is the way one can obtain the information stored in them. In the context of data exchange, a usual task is to answer queries over the target schema. Then, a natural question is how to effectively answer such queries. This question has been addressed by defining which target instance one should materialize in order to answer queries in a way that is consistent with the data on the source instance. In the work by Fagin, Kolaitis, Miller, and Popa (2005), the authors showed that one can use a universal solution to obtain the certain answers semantics for positive queries. However, in our setting a universal solution is not even guaranteed to exist.. P ROPOSITION 3. There exists a mapping M = (S, T, ∆), where ∆ is a set of hUCQ= , CQi-dependencies, such that there is a source instance I for which there is no universal solution under M.. P ROOF. Take an st-mapping with the following dependencies: A(x) ↔ ∃y (R(y) ∧ S(y)) B(x) ↔ R(x) and a source instance I = {A(3), B(1), B(2)}. Applying the second dependency, it is clear that each solution for I must contain tuples R(1) and R(2). Furthermore, it cannot contain any other tuples in relation R. Then, in any solution we will have that R = {1, 2}. Given this, it is easy to see that both J1 = {R(1), R(2), S(1)} and J2 = {R(1), R(2), S(2)} are solutions, but there is no homomorphism from one onto the other. Moreover, it is mandatory that any solution contains S(1) or S(2), and therefore it is impossible to have a solution with homomorphisms to all solutions.. 28.

(39) Given this new scenario, it is worth studying how we can answer queries over the target schema in the presence of hUCQ= , CQi-dependencies. Moreover, as it was shown in Arenas, Diéguez, and Pérez (2014), these dependencies have the potential to specify more tightly which solutions should we consider, and therefore it is interesting to analyze the query answering behaviour of non-monotone queries. In this chapter, we present the results of the complexity analysis of the C ERTAINA NSWERS problem in several scenarios, including data and combined complexity analysis, and many query languages. Tables 4.1 and 4.2 summarize the results, showing the references to the theorems where each result is proved.. full hUCQ= , CQi =. general hUCQ , CQi. UCQ¬ FO coNP-complete undecidable. CQ in PTIME. M ON in PTIME. UCQ1-¬ in PTIME. CQ2-¬ coNP-complete. Theorem 8. Theorem 8. Theorem 9. Theorems 10 & 6 Theorems 10 & 6. Theorem 11. coNP-complete coNP-complete coNP-complete coNP-complete. coNP-complete undecidable. Theorems 7 & 5. Theorems 7 & 6. Theorems 7 & 5. Theorems 7 & 6. Theorems 7 & 6. Theorem 11. TABLE 4.1. The data complexity of C ERTAINA NSWERS under bidirectional constraints.. full hUCQ= , CQi. CQ. UCQ6=. ΣP2 -complete. ΣP2 -complete. Theorem 13. Theorem 13. UCQ1-¬ EXPTIME-complete. CQ2-¬ coNEXPTIME-complete. Theorem 14. Theorem 14. Theorems 15 & 12 Theorems 15 & 12. coNEXPTIME- coNEXPTIME- coNEXPTIME- coNEXPTIME-complete -complete -complete -complete general hUCQ= , CQi Theorem 12. Theorem 12. UCQ¬ coNEXPTIME-complete. CQ1-¬ EXPTIME-complete. Theorem 12. Theorem 12. coNEXPTIME-complete. coNEXPTIME-complete. Theorem 12. Theorem 12. TABLE 4.2. The combined complexity of C ERTAINA NSWERS under bidirectional constraints.. 4.1. Data Complexity In this section, we study the data complexity (Vardi, 1982) of the C ERTAINA NSWERS problem. Similar to Chapter 3, we consider that the mapping and the query are fixed, and the input is only the source instance, along with the tuple we wish to check. As we said before, this is a natural way of studying the complexity of this problem, since in practice is usual that databases are much larger than mapping specifications. Formally, the problem is defined as follows:. 29.

(40) Problem:. C ERTAINA NSWERS(M, Q). Input:. n-tuple ā, and an instance I over S.. Question: Is ā in CERTAINM (Q, I)? 4.1.1. The general case The following results consider the general case; i.e., when the mapping is specified by unrestricted hUCQ= , CQi-dependencies. Our first result establishes the upper bound for monotone queries, based on the results in Chapter 3. T HEOREM 5. C ERTAINA NSWERS(M, Q) is in coNP for every mapping M specified by hUCQ= , CQi-dependencies and every query Q in M ON. P ROOF. From Proposition 1 we know that given a source instance I, if there exists a solution J for I under M, there exists a solution J ∗ for I under M of polynomial size (with respect to I). The proof of that Proposition uses the solution-aware chase to obtain such a solution, which also is contained in J. Note that chasing any solution for I will produce another solution that satisfies the previous conditions. Now, given a n-ary monotone query Q and a n-tuple ā, we want to know if ā ∈ CERTAIN M (Q, I).. Therefore, a witness for the complement of this problem is a solution. Jw of polynomial size such that ā 6∈ Q(Jw ). Suppose that there exists some solution J 0 such that ā 6∈ Q(J 0 ). If we perform a solution-aware chase, we obtain a solution Jw of polynomial size such that Jw ⊆ J 0 . To conclude, we need to show that ā 6∈ Q(Jw ), which follows directly from the fact that Q is a monotone query: since Jw ⊆ J 0 , it can’t be that ā ∈ Q(Jw ) but ā 6∈ Q(J 0 ). Finally, the algorithm is to guess a polynomial-size solution Jw and check if ā 6∈ Q(Jw ). As. we. mentioned. before,. given. the. greater. expressive. power. of. hUCQ= , CQi-dependencies, it is worth studying its query answering capabilities regarding non-monotone queries. In particular, the following result establishes the upper bound 30.

(41) of the problem for queries with negation. This proof uses a custom version of the chase procedure, which is defined in detail in the proof. T HEOREM 6. C ERTAINA NSWERS(M, Q) is in coNP for every mapping M specified by hUCQ= , CQi-dependencies, and every query Q in UCQ¬ . P ROOF. Given a query Q as described, we assume it is a boolean query as in Fagin, Kolaitis, Miller, and Popa (2005). We also suppose that Q = Q1 ∨ Q2 , where Q1 is a UCQ query without negation, and Q2 is a UCQ¬ with at least one negated atom per disjunct. Each of these disjuncts has the form: . . ∃x̄ ϕ(x̄) ∧. V. ¬Ri (ȳi ) , ȳi ⊆ x̄. i. where ϕ is a conjunction of atoms and the Ri ’s are target relations. Thus, it is easy to see that the negation of Q2 yields a conjunction of a set of disjunctive tgds Σ of the form: ∀x̄ ϕ(x̄) →. W. Ri (ȳi ). i. It is clear that certain(Q, I) = false if and only if there exists a solution J for I under M such that J |= Σ and J 6|= Q1 . Consider the following Proposition: P ROPOSITION 4. Given a source instance I and a query Q as described, if there exists a solution J for I under M such that J |= Σ and J 6|= Q1 , there exists a solution J ∗ of polynomial size with respect to I with the same properties. Theorem 6 follows directly, since checking the above conditions can be done in polynomial time. We will prove Proposition 4 by using a combination of both chase and disjunctive chase procedures defined in Fagin, Kolaitis, Miller, and Popa (2005), with the solution-aware chase defined in Fuxman et al. (2006), which we conveniently call Disjunctive Solution-Aware Chase. As we are only using tgds, the definitions are rather straightforward. 31.

(42) Definition 1 (Disjunctive Solution-Aware Chase Step). Let K be an instance and let d be a disjunctive tgd ∀x̄ (ϕ(x̄) → (R1 (ȳ1 ) ∨ . . . ∨ Rm (ȳm ))). Let K 0 be an instance that contains K and satisfies d. Denote by di the tgds obtained from d of the form ϕ(x̄) → Ri (ȳi ) for each i ∈ {1, . . . , m}, which we say are associated with d. Note that, because K 0 satisfies d, K must satisfy at least one of the tgds associated with d. Then, let D ⊆ {1, . . . , m} be the set of the indexes of the tgds associated with d that K 0 satisfies. Let h be a homomorphism from ϕ(x̄) to K such that there are no extensions of h to homomorphisms h0i from ϕ(x̄) ∧ Ri (ȳi ) to K, for each i ∈ {1, . . . , m}. We say that d can be applied to K with homomorphism h and solution K 0 . Note that at all the di ’s can be applied to K with homomorphism h and solution K 0 , according to the definition in Fuxman et al. (2006). For each j ∈ D, let Kj be the result of applying dj to K with h and solution K 0 , according to the definition in Fuxman et al. (2006). We say that the result of applying d to d,h,K 0. K with h and solution K 0 is the set {Kj | j ∈ D}, and write K → {Kj | j ∈ D}.. . In addition to the chase steps defined above, we will use solution-aware chase steps as they were defined in Fuxman et al. (2006).. Definition 2 (Disjunctive Solution-Aware Chase). Let ∆ be a set of tgds and let Σ be a set of disjunctive tgds. Let K be an instance and K 0 be an instance that contains K and satisfies ∆ ∪ Σ. • A solution-aware chase tree of K with ∆ ∪ Σ and K 0 is a tree such that: – the root is K, and – for every node Kp in the tree, let {Kp1 , . . . , Kpr } be the set of its children. Then there must exist some dependency d in ∆ ∪ Σ and homomorphism h d,h,K 0. such that Kp → {Kp1 , . . . , Kpr } • A finite disjunctive solution-aware chase of K with ∆ ∪ Σ and K 0 is a finite solution-aware chase tree such that for each leaf Kl , there is no dependency d in 32.

(43) ∆ ∪ Σ and there is no homomorphism h such that d can be applied to Kl with h and K 0 .. . It follows directly from the results in Fagin, Kolaitis, Miller, and Popa (2005) and Fuxman et al. (2006) that if the tgds are weakly acyclic, the disjunctive solution-aware chase is finite and polynomial: P ROPOSITION 5. Let ∆ be a set of weakly acyclic tgds, Σ a set of disjunctive tgds, K an instance, and K 0 an instance such that K ⊆ K 0 and K 0 satisfies ∆ ∪ Σ. Then every solution-aware chase tree of K with ∆ ∪ Σ and K 0 is finite. Moreover, there exists a polynomial in the size of K that bounds the depth of every such tree. P ROOF. Let Σ0 be the set of all tgds that are associated to some disjunctive tgd in Σ. Let T be a solution-aware chase tree of K with ∆ ∪ Σ and K 0 . Then, every path in T that starts in the root is a solution-aware chase sequence of K with K 0 , as it was defined in Fuxman et al. (2006), which uses dependencies in ∆ ∪ Σ0 . Moreover, it only uses dependencies in Σ0 that K 0 satisfies. Now, since all the tgds in Σ0 are full, they form a weakly acyclic set together with ∆, and then by the results in Fuxman et al. (2006) there exists a polynomial in the size of K that bounds the length of every such path.. . Finally, we now prove Proposition 4. First, note that since J is a solution, it holds that (I, J) |= ∆, and then (I, J) |= ∆→ . Then, we non-deterministically perform a disjunctive solution-aware chase of (I, ∅) with ∆→ ∪ Σ and (I, J), guessing the sequence of dependencies and homomorphisms to be applied as well as the branch we pick at each step, arriving at a leaf J ∗ . Since (I, ∅) ⊆ (I, J) and (I, J) |= ∆→ ∪ Σ, by Proposition 5 we know that J ∗ is of polynomial size. It is easy to see that J ∗ ⊆ J, and then J ∗ is a solution for I under M (as it was shown in the proof of theorem 1). Moreover, it holds that J ∗ |= Σ, since it is a leaf in the chase tree. Finally, by monotonicity it must be that J ∗ 6|= Q1 , and therefore J ∗ is the instance we were looking for.. 33.

(44) Following directly from the results in Chapter 3, the C ERTAINA NSWERS problem in the general case is intractable. Unfortunately, this even holds for boolean conjunctive queries, and dependencies without equalities nor disjunctions. T HEOREM 7. There exists a mapping M specified by hCQ, CQi-dependencies, and a query Q in CQ, such that C ERTAINA NSWERS(M, Q) is coNP-hard, even if Q is boolean. P ROOF. This proof is almost entirely based on Theorem 1’s proof. We will again perform a reduction from 3-C OLORABILITY. Recall that we have a graph G = (V, E) with no self-loops, and with 2 connected components: K3 and the graph itself. Let M = (S, T, ∆) be a data exchange setting such that S consists of binary relation E and unary relations V and H, T consists of binary relations E 0 and C, and the dependencies in ∆ are the following: V (x) ↔ ∃uC (x, u) E (x, y) ↔ E 0 (x, y) H (u) ↔ ∃xC (x, u). (4.1) (4.2) (4.3). Finally, let q be the following query over T: ∃x∃y∃uC (x, u) ∧ C (y, u) ∧ E 0 (x, y) Given a graph G, consider the source instance IG = (V, E, H), where H = {r, g, b} is a set of three colors, none of which is an element of V . It is clear that IG can be constructed in polynomial time from G. We claim that G is 3-colorable if and only if certain(q, IG ) = false. In other words, we need to show that there exists a 3-coloration of G if and only if there exists a solution J for IG under M such that q(J) = false. (⇒) We build a solution J exactly as in the proof of Theorem 1, where we showed it was indeed a solution. In this case the latter follows immediately, because we need to satisfy less dependencies. Now, it is clear that q(J) = false, since otherwise there would 34.

(45) exist adjacent vertices x and y with the same color assigned in predicate C, which cannot be since these were assigned using the 3-coloration from G. (⇐) Similar to the proof of Theorem 1, given J such that (IG , J) |= ∆ and q(J) = false, we generate a coloration col (x) using dependencies (4.1) and (4.3) to choose, for every vertex x ∈ V , a color c ∈ H such that C (x, c). Then, col (x) = c. We now show that col (x) is a 3-coloration: By contradiction, suppose that col (x) is not a 3-coloration. Therefore, there exists an edge (y, z) ∈ E such that col (y) = col (z). Using dependency (4.2), we know (y, z) ∈ E 0 . Since the colors in col (x) are all obtained from C, then q(J) = true, taking y, z and col(y) for the existential quantifiers, which contradicts our initial setting.. . 4.1.2. The full case Now we restrict the problem to full dependencies. In the case of monotone queries, the problem can be efficiently solved. T HEOREM 8. C ERTAINA NSWERS(M, Q) can be solved in polynomial time for every mapping M specified by full hUCQ= , CQi-dependencies, and every query Q in M ON. P ROOF. Consider the following Proposition: P ROPOSITION 6. If Q is a monotonic query over T, then for every source instance I such that S OLM (I) 6= ∅, it holds that certain(Q, I) = Q(chase∆→ (I)). Theorem 8 follows directly from the previous Proposition. First, we need a useful Lemma regarding the full scenario: L EMMA 1. Given a data exchange setting M = (S, T, ∆), where ∆ is a set of full UCQ= dependencies, for every source instance I such that S OLM (I) 6= ∅, it holds that chase∆→ (I) ⊆ J for every J ∈ S OLM (I). P ROOF. By contradiction, suppose that there exists a solution J such that chase∆→ (I) 6⊆ J. Thus, there exists a tuple R(ā) ∈ chase∆→ (I) such that R(ā) 6∈ J. 35.

(46) Since R(ā) is produced by the chase procedure, there is a dependency ϕ(x̄) ↔ ψ(x̄) in ∆, where ϕ is a UCQ= query and ψ(x̄) = ψ1 (w̄) ∧ R(ȳ) ∧ ψ2 (z̄) with w̄ ∪ ȳ ∪ z̄ = x̄ and where ψ1 and ψ2 are (possibly empty) conjunctions of target atoms, and there exists an assignment σ : V ar → Const such that I |= ϕ(σ(x̄)) and σ(ȳ) = ā. Then, as J is a solution, it must be that J |= ψ(σ(x̄)), and therefore J |= R(σ(ȳ)) = R(ā), which . contradicts our initial supposition. Now we prove the Proposition, showing the containment if both directions:. (⊆) Given that S OLM (I) 6= ∅, by Proposition 2 we know that chase∆→ (I) is a solution, and then for every tuple t̄ ∈ certain(Q, I) it holds that t ∈ chase∆→ (I). (⊇) Since Q is a monotonic query, by Lemma 1 we know that {t̄ | t̄ ∈ Q(chase∆→ (I))} ⊆ {t̄ | t̄ ∈ Q(J)} for every solution J for I under M. Q(chase∆→ (I)) ⊆ certain(Q, I).. Therefore, it holds that . Moreover, we have been able to find a polynomial algorithm to answer unions of conjunctive queries with restricted use of negations. T HEOREM 9. C ERTAINA NSWERS(M, Q) can be solved in polynomial time for every mapping M specified by full hUCQ= , CQi-dependencies, and every query Q in UCQ¬ with at most one negated atom per disjunct. P ROOF. Given a query Q as described, we assume it is a boolean query as in Fagin, Kolaitis, Miller, and Popa (2005). We also suppose that Q = Q1 ∨ Q2 , where Q1 is a UCQ query without negation, and Q2 is a UCQ¬ with exactly one negated atom per disjunct. Each of these disjuncts has the form: ∃x̄ (ϕ(x̄) ∧ ¬R(ȳ)), ȳ ⊆ x̄ where ϕ is a conjunction of atomic formulas. Thus, it is easy to see that the negation of Q2 yields a conjunction of a set of full tgds Σ of the form: ∀x̄ (ϕ(x̄) → R(ȳ)) 36.