An Access Control Model for Querying XML Data - Semantic Scholar

An Access Control Model for Querying XML Data Sabrina De Capitani di Vimercati [email protected]

Stefania Marrara [email protected]

Pierangela Samarati [email protected]

Universita degli Studi di Milano, Dipartimento di Tecnologie dell’Informazione via Bramante 65, 26013 Crema (CR), Italy

ABSTRACT

Web where data are stored in sources that do not impose a rigid structure, or when data are combined from several heterogeneous, possibly structured, data sources with different structures. In such a context, XML [3] is rapidly emerging as the new standard for semi-structured data representation and exchange on the Internet and many XML-based applications have been developed (e.g., Open Financial Exchange, Channel Data Format, Open Software Distribution). Securing XML data is then becoming increasingly important and different access control models for XML documents have been proposed [2, 4, 9, 10]. These models present an approach that works more or less in the same way: a security administrator defines a set of security policies and a conflict resolution policy. Given an access request, an algorithm computes a view of the target XML document based on the requester’s rights. In [4] authorizations can be positive and negative and can be defined both at the document-level or at the Document Type Definition (DTD) level (in this case authorizations propagate to all instances of the DTD). Authorizations are characterized by a type field defining how the authorizations must be treated with respect to propagation at finer granules and overriding (exception support). The model in [2] supports two kinds of privileges: browsing (read) and authoring (write). Authorizations are specified along with propagation options. Depending on its propagation option, an authorization referring to an element may propagate to all the direct and indirect sub-elements, propagate to all the direct sub-elements only, or not propagate at all. As the previous model, authorizations can be specified at the DTD-level or at the instance-level. The model in [10] supports read and write privileges. The authors define three types of propagation policies: no propagation, propagation up (an authorization referring to an element is propagated to all its parent elements) or propagation down (an authorization referring to an element is propagated to all its subelements). The conflict resolution policy is either “denials take precedence” or “permissions take precedence”. The main contribution of this paper is to propose a provisional authorization that specifies an action that a user has to perform before obtaining a given privilege. The model in [9] supports the read privilege only. The authors do not define any propagation policy. The conflict resolution policy is based on the priority of the different rules. More recently, in [7] has been proposed an approach that tries to address the write privilege based on the non-standard XML update language Xupdate. The author separates the existence of an XML value and its content adding a new position privi-

In the last few years, an increasing amount of semistructured data have become available electronically to humans and programs. In such a context, XML is rapidly emerging as the new standard for semi-structured data representation and exchange on the Internet. Securing XML data is then becoming increasingly important and several attempts at developing methods for securing XML data have been proposed. However, these proposals do not take into consideration scenarios where users want to query XML data by using complex query languages. In this paper, we propose an extension to our previous access control model handling the new standard query language XQuery, which is a powerful and convenient language designed for querying XML data.

Categories and Subject Descriptors H.2.7 [Database Management]: Database Administration—Security, integrity, and protection

General Terms Languages, Security

Keywords XML, XML access control, XQuery

1. INTRODUCTION An increasing amount of data have become available electronically to humans and programs. Such data are managed via a large number of data models and access techniques, and may come from relational or object oriented databases (structured data), consist of collections of text or image files (unstructured data), or can be seen as semistructured . Semistructured data are data whose schema is not fixed in advance, and whose structure may be irregular or incomplete [1]. Examples of such data arise in the World Wide

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SWS’05, November 11, 2005, Fairfax, Virginia, USA. Copyright 2005 ACM 1-59593-234-8/05/0011 ...$5.00.

36

lege that allows to know the existence of a node but not its content. Nodes tagged with a position privilege are shown with a restricted label. A completely different approach to the XML access control is [6] where the authors pose the problem of a simple and unambiguous language to state the semantics of an access control policy and propose XPath. All these models however do not take into consideration scenarios where users want to query secured XML data by using complex query languages. The objective of this paper is to extend our previous work [4, 5] by taking into account the new standard query language XQuery [11], which is a powerful and convenient language designed for querying XML data. In particular, we modify the pruning strategy to avoid inferences about the existence of negative labels and the overall structure of a document. Since XQuery includes only querying features and the update syntax is still a research issue, we still focus on the read privilege, and differ the study of the write privilege for a future work, when the XQuery update syntax will have been standardized. The remainder of this paper is organized as follows. Section 2 illustrates the main concepts of the access control model. Section 3 illustrates the view computation process. Section 4 describes the query rewriting process. Section 5 concludes the paper.

(a)

New Beetle < /model> Tornado red < /color> $ 22,015 XM Satellite Radio $ 375< /price> Touareg < /model> Offroad gray< /color> $ 37,140 < /price> Alice < /name> Freehoarses, 5 < /address> Sydney < /city> < /buyer> < /sold> < /vehicles> < /showroom>

2. PRELIMINARY CONCEPTS We illustrate the main characteristics of the access control model on which our approach is based [4]. The definition of an access control model requires the specification of the subjects, objects, and privileges against which authorizations must be specified. Privileges are the actions executable on the objects. Since XQuery supports only read operations, we consider the read privilege only. This choice is motivated by the fact that current XML applications are mostly read-only and that no consensus has emerged up to now on a model for XML updates. Beside these basic components, each authorization may be characterized by other components that regulate whether the authorization propagate to other objects (content at a finer granularity) and how it interplays with other authorizations (exception policy). In our model each authorization is then characterized as a 5-tuple hsubject,object,action,sign,typei, where:

(b)

• subject is the subject to which the authorization refers;

Figure 1: A simple example of DTD (a) and a corresponding valid XML document (b)

• object can be an URI or an URI followed by a path expression; • action is the action on which the authorization is defined;

2.1

Subjects

Subjects are entities requesting access to data. The basic concept for the characterization of a subject is the person presenting the request, to which we refer to as user . However, the decision of whether some data may or may not be released may not (only) depend on the user identity but also on the machine from which the user connected. Therefore, we characterize each subject with a triple huser-id, IP-address, sym-addressi, where user-id is the local identity with which the user connected to the server, IP-address is the IP address of the machine from which the user is connected, and sym-address is the symbolic name associated with the machine from which the user is connected.1 Ab-

• sign indicates whether the authorization states a permission (‘+’) or a denial (‘−’); • type defines how the authorization must be treated with respect to propagation at finer granules and overriding (exception support) and can take values L (Local), R (Recursive), LD (Local DTD), RD (Recursive DTD), LDH (Local DTD Hard), RDH (Recursive DTD Hard), LS (Local Soft), RS (Recursive soft). The remainder of this section discusses the subject, object, and type fields and the language for the specification of authorizations.

1

37

Note that IP-address and sym-address can be omitted; in

stractions can be defined within the domains of subjects. More precisely, with respect to the user domain, abstractions allow the definition of groups, representing named sets of users. Groups can be nested and need not be disjoint. IP addresses and symbolic names can make use of patterns denoting sets of addresses (e.g., 150.55.* denotes all the machines in subnetwork 150.55).

2.2 Objects

(a)

Objects are the entities to which accesses can be requested. With reference to the objects, our model supports different levels of granularity ranging from a whole XML document/DTD (identified by its URI) to a single element within an XML document (identified by an XPath expression [12]). Note that DTDs and XML documents can be modeled graphically as trees. A DTD is represented as a labeled tree containing a node for each attribute and element in the DTD. There is an arc between an element and an element/attribute belonging to it, labeled with the cardinality of the relationship. Each XML document is described by a tree with a node for each element, attribute, and value in the document, and with an arc between each element and each of its sub-elements/attributes/values and between each attribute and each of its value(s). Each arc in the DTD tree may correspond to zero, one, or several arcs in the XML document, depending on the cardinality of the corresponding containment relationship (note that arcs are not labeled). In our model, we consider a subset of the XPath expressions defined as follows.

(b)

Figure 2: DTDs of the language

2.3

Type

Authorizations can be specified both at the document level (L, R) and at the DTD level (LD, RD). Authorizations specified at the DTD level propagate to all XML documents that are instances of that DTD. DTD level authorizations are overridden by possible authorizations specified for the instance. A recursive (R or RD) authorization defined on an element propagates to all its sub-elements, but may be overridden (exception supports) by authorizations explicitly specified on the sub-element, according to the principle that the more specific authorization takes precedence [8]. By contrast, a local (L or LD) authorization does not propagate. We can imagine scenarios where, for example, one may wish to specify authorizations at the schema level that do not allow exceptions (LDH, RDH). Analogously, one may wish to specify authorizations at the instance level that behave as default rules, in case no schema level statement has been made, but are not intended to override them otherwise (LS, RS).

Definition 1: A path expression on a tree is a sequence of element names separated by character / (slash): l1 /l2 / . . . /ln . Path expressions may terminate with an attribute name as the last term of the sequence. Attribute names are syntactically distinguished by preceding them with special character @. A condition q may be defined on any label li , enclosing in square brackets a separate evaluation context containing a comparison between a relative path expression and a constant or another expression. Conditional expressions may be combined via and and or operators to build boolean expressions. Multiple conditional expressions appearing in the same path expression are considered to be anded (i.e., all the conditions must be satisfied). As an example, consider the DTD and the corresponding valid XML document in Figure 1. The document includes a list of vehicles available and sold in a showroom of Melbourne. Each vehicle is described by its model, color, price, a set of optional accessory, and buyer (only for sold vehicles). The buyer element includes a name, address, and an optional percentage of discount on the car’s price. Path expression /showroom/vehicles/available denotes the available elements that are children of the vehicles element that is children of the showroom element. Path expression /showroom[./@city = ‘‘Melbourne’’]/vehicles/sold/[./model = ‘‘Golf’’]/price denotes the price elements of the Golf vehicles sold in a showroom of Melbourne.

2.4

XML-based language

We now introduce a language for the specification of access restrictions on the XML data. The access rules are expressed in XML and comply with the DTDs illustrated in Figure 2. More precisely, access rules defined at the schema level are listed in an XML Authorization Schema file (XAS) which is an XML document valid with respect to the DTD in Figure 2(a). Analogously, access rules associated with an XML document are listed within an XML Authorization Document file (XAD) which is an XML document valid with respect to the DTD in Figure 2(b). The structure of these authorization files is very simple: both XAS and XAD files have a root element, called XAS and XAD, respectively. The XAS root element is characterized by attribute schema which is the schema to which the authorization file is associated. The XAD root element is characterized by attribute doc which is the XML document to which the authorization file is associated. Each authorization file includes one or

this case character ∗ is used to denote any IP address and any symbolic name.

38

/* D1 */ /* D2 */ /* D3 */ /* D4 */ /* D2 */ /* D3 */ /* I2 */ /* I2 */ /* I2 */ /* I1 */

more group-rule elements that include all rules applicable to a given subject specified in the subject attribute. Each element group-rule is then composed of rule elements characterized by attributes object, sign, and type. Example 1: Consider the DTD and the XML document in Figure 1. The following are examples of protection requirements that can be expressed on these documents. DTD level (applicable to all showrooms) D1 Information about the available vehicles is publicly accessible. D2 Information about the price of available vehicles and the price of their accessories is only accessible to registered users. D3 Information about the sold vehicles is only accessible by users who are members of the administrative staff. D4 The price of sold vehicles and the price of their accessories is not publicly accessible. Instance level (applicable to showroom in Melbourne) I1 The price of sold vehicles and the price of their accessories is only accessible by members of the financial staff. I2 Alice can access information about vehicles but discount and but those vehicles that have been sold to buyers living at Sydney.

Figure 3: Example of access authorizations.

XML document, the system first returns the corresponding schema. This schema is then used by the user to formulate an XQuery on the XML document. To guarantee condition C1, the schema should be publicly accessible. However, since different users may have different access rights on the same schema and document, our system computes a view of the original schema based on the user identity and the rules specified in the corresponding XAS and XAD. In the following, we describe how the view is computed.

Figure 3 shows the authorizations expressing these protection requirements.

3. VIEWS COMPUTATION The focus of our work is how to enforce access control in such a way that the access control system is safe. We consider an access control system safe when the following conditions are satisfied. C1. The system provides users with necessary schema information to facilitate query formulation.

3.1

View process

In the view computation process, DTD level authorizations and instance level authorizations are merged. This means that they are treated in the same way. The result is a schema that includes all and only the information that a user is authorized to access (condition C3). Note that the use of path expressions with conditions provides a mechanism for specifying authorizations on elements/attributes in a content-dependent way. It is then possible that a user can access a given element/attribute only if some conditions are satisfied. These conditions are therefore used to annotate nodes of the DTD tree and are also used in the query rewriting process (see Section 4). The view computation process is basically composed of four steps: initial labeling, conflict resolution, propagation, and pruning.2

C2. The system avoids potential inference by users. C3. The answer returned to users contains all and only the information that the users are authorized to access (i.e., the answer does not contain any securityviolating data). Figure 4 illustrates the considered scenario. Users must explicitly open a working session by connecting to the system. Connection requires identification of the user and corresponding authentication of her identity. This identity will be used by the system for enforcing access control. (Note that this assumption does not rule out the possibility of anonymous connection. Anonymous connection may be treated with a special user’s identifier, for example, anonymous). Besides the identity with which she connected to the system, a user has also associated the IP address and/or symbolic address of the machine from which she connected to the system. Once the user has been authenticated, she can submit requests to access XML data. More precisely, we work under the assumption that when a user wants to access a specific

3.1.0.1

Initial labeling..

The purpose of this step is to associate authorizations with the corresponding elements/attributes. To this pur2 The view computation process is an extension of the process presented in [5]. We therefore keep at a simplified level the description of the common parts.

39

XAS DTD view schema Compute view

−

QA2

XAD

−QA1

user

XQuery

Query management

XQuery’ results

results

XML document

Access control system

Figure 4: Framework for querying XML data pose, access rules in the XAS and XAD and applicable to the requester are first determined. Since authorizations can be of different type, we associate with each node n an array, n.veclabel , of eight components, one for each authorization type, which is a record including four fields: sign, Allowed , Denied, and xpath Given a node n and an authorization type t ∈ {LDH, RDH, L, R, LD, RD, LS, RS}, n.veclabel[t].Allowed and n.veclabel[t].Denied are two lists of pairs hsbj ,xpathi storing all subjects sbj for which there exists a positive/negative authorization of type t that applies to n through conditional path expression xpath.3 Field n.veclabel[t].sign indicates the sign associated with the node according to the authorizations and the conflict resolution policy (in case of no authorization, n.veclabel[t].sign = ‘ε’). Field n.veclabel[t].xpath indicates the conditional path expressions according to which the user can access (n.veclabel[t].sign = ‘+’) or cannot access (n.veclabel[t].sign = ‘−’) node n.

3.1.0.2

• n.veclabel[t].Allowed is not empty and n.veclabel[t].Denied is empty: value + is assigned to n.veclabel[t].sign and the union of all conditional path expressions specified in the pairs hsbj ,xpathi in n.veclabel[t].Allowed is assigned to n.veclabel[t].xpath.

3.1.0.3

Conflict resolution..

Since different rules of different sign may exist for each authorization type, the determination of the final label associated with each node in the DTD requires the application of a conflict resolution policy. Different approaches can be used: we apply the most specific subject takes precedence principle together with the denials take precedence principle. For each authorization type t, the two lists n.veclabel[t].Allowed and n.veclabel[t].Denied are combined according to the above principle. Intuitively, each subject sbj in n.veclabel[t].Denied is compared with each subject sbj ′ in n.veclabel[t].Allowed . If sbj is more specific than sbj ′ , sbj ′ is removed from n.veclabel[t].Allowed . Otherwise, if sbj ′ is more specific than sbj , sbj is removed from n.veclabel[t].Denied . At the end, tree cases can occur: • n.veclabel[t].Allowed and n.veclabel[t].Denied empty: value ‘ε’ is assigned to n.veclabel[t].sign.

Propagation..

The labels (signs) associated with each node and the corresponding set of conditional path expressions are then propagated by considering the nodes according to a preorder visit of the tree. The criteria adopted for the propagation states that authorizations on a node take precedence over those on its ancestors. Intuitively, this means that given a node n and its parent p, n.veclabel[t].sign is assigned the value of p.veclabel[t].sign only if n.veclabel[t].sign is equal to ‘ε’. The final sign finlabel and the final conditional path expressions finxpath, called query annotation, of each node n is determined by taking into consideration the field sign and xpath, respectively, of components of array n.vectlabel considered in their priority order: LDH (local hard), RDH (recursive hard), L (local), R (recursive), LD (local, schema level), RD (recursive, schema level), LS (local soft), and RS (recursive soft). As a result of this step, each node is therefore associated with a unique label reflecting under what conditions finxpath the node can be accessed (finlabel = ‘+’) or not (finlabel = ‘−’). The final label of a node is equal to ‘ε’ in the case where no authorizations have been specified nor can be derived for the node.

3.1.0.4

Pruning..

The final view on the DTD can be obtained by executing a postorder visit on the tree and removing any leaf node labeled ‘−’ or ‘ε’ and for which finxpath is empty. After the post order visit, all the subtrees containing only nodes labeled negative or undefined have been removed. The pruned tree includes therefore only nodes labeled positive, nodes labeled negative and with a query annotation, and nodes labeled negative or undefined with one or more descendants labeled positive. To guarantee condition C2, we transform the pruned DTD tree as follows:

are

• n.veclabel[t].Allowed is empty and n.veclabel[t].Denied is not empty or the two lists are both not empty: value − is assigned to n.veclabel[t].sign and the union of all conditional path expressions specified in the pairs hsbj ,xpathi in n.veclabel[t].Denied is assigned to n.veclabel[t].xpath.

• nodes with a positive label are transformed in the DTD as optional elements or attributes;

3 Note that since the view is computed on the DTD, the conditions specified in the path expressions are not evaluated. However, to guarantee condition C3 we need to store the whole conditional path expressions.

• nodes with a negative or undefined label and without a query annotation are renamed to an anonymous label to hide the original label (condition C3).

40

includes only the information that the user can access. To address this issue, we consider a core fragment of XQuery, where the XPath expressions are those defined in Section 2 and where each XQuery may involve one document at a time. Basically, an XQuery is an expression that consists of four clauses:

anonymous

*

available model color

• FOR binds one or more variables to a sequence of values returned by another expression (in our fragment, an XPath expression) and iterates over the values;

price

*

*

accessory

• LET binds one or more variables but without iterating;

description

• WHERE contains one or more predicates that filters the set of nodes as generated by the FOR/LET clauses;

price sold

QA1

• RETURN generates the output and may contain one or more element constructors and/or references to variables. It is executed once for each element that is returned by the FOR/LET/WHERE clauses.

model color

*

The query rewriting process applies to any XPath expression specified in the XQuery. Given an XPath expression p = l1 /l2 / . . . /ln , we rewrite any label path from the root to li by incorporating the query annotations associated with node li . In particular, a preliminary transformation consists in replacing any label li = anonymous with // because on the original XML document it may correspond with one or more sequential nodes. Note that we are currently working on the formalization of a query rewriting algorithm and therefore we only illustrate the rationale behind the process. As an example, suppose that Alice submits the following XQuery.

accessory description

buyer name address

FOR $n IN document(“melbourne.xml”)/anonymous/sold WHERE $n/color = “Offroad gray” RETURN $n/buyer/name

city

Figure 5: View returned to user Alice

Variable $n iterates over sold vehicles stored in the melbourne.xml document. Each vehicle $n is subject to further filtering by the WHERE clause. The RETURN clause, containing constructors of XML elements, is executed once for each vehicle that satisfies the condition of the WHERE clause. The query rewriting process needs first to check the XPath expression xp in the FOR clause. The system determines all paths from the root of the original XML document to a node n identified by xp (in our example, xp is equal to //sold). Let P be the set of all these paths (in our example P is equal to /showroom/vehicles/sold). For each path p = l1 /l2 / . . . /ln ∈ P and for each label li in p, if li is in the view returned to the user, path p is transformed by adding the query annotation (if any) associated with li .4 Let l1 /l2 / . . . /li /li+1 / . . . /ln and l1′ /l2′ / . . . /li′ /li+1 / . . . /ln be the path before and after the introduction of the query annotation, respectively. Then, if the sign associated with li is negative, the path is rewritten as (l1 /l2 / . . . /li except l1′ /l2′ / . . . /li′ )/li+1 / . . . /ln . In this way, nodes that are not accessible by the user are removed. At the end of this process, all paths p ∈ P have been transformed in such a way that they identify only the nodes that

Note that if the final DTD tree includes two or more contiguous anonymous nodes, then they are merged into one single anonymous node. Also, the conditional path expressions associated with a node of the resulting view are not visible to the user; they are used in the query rewriting process to guarantee condition C2. Example 2: Consider user Alice who is a registered user that connects to the system to access the XML document in Figure 1(b). Suppose that Alice is neither a member of the administrative staff nor a member of the financial staff. According to the authorizations stated at the DTD level and at the instance level (see Figure 3), Alice can access information about available vehicles and vehicles sold to buyers not living at Sydney. Figure 5(b) illustrates the DTD returned to Alice. Query annotation QA1 = //sold[./buyer/city = ‘‘Sydney’’] is associated with element sold.

4. QUERY REWRITING We now consider the problem of what happens when a user formulates an XQuery by exploiting the schema information obtained as a result of the view computation process described in the previous section. The XQuery is directly executed on the original XML document and therefore it must be transformed to guarantee that the answer returned

4 To this purpose, the system needs to evaluate the XPath containment between p and the query annotation associated with li .

41

the user can access. The last step of this transformation consists in replacing path xp in the FOR clause with the union of all transformed paths. In our example, the XQuery is rewritten as follows.

[3] W3C Consortium. Xml 1.0, Feb. 1998. http://www.w3.org/XML. [4] Ernesto Damiani, Sabrina De Capitani di Vimercati, Stefano Paraboschi, and Pierangela Samarati. Securing xml documents. Lecture Notes in Computer Science, 1777:121–??, 2000.

FOR $n IN (document(“melbourne.xml”)/showroom/vehicles/sold except document(“melbourne.xml”)/showroom/vehicles/sold [./buyer/city = “Sydney”]) WHERE $n/color = “Offroad gray” RETURN $n/buyer/name

[5] Ernesto Damiani, Sabrina De Capitani Di Vimercati, Stefano Paraboschi, and Pierangela Samarati. A fine-grained access control documents. [6] I. Fundulaki and M. Marx. Specifying Access Control Policies for XML Documents with XPath. ACM Symp. on Access Control Models and Technologies (SACMAT), 2004. citeseer.ist.psu.edu/640891.html

5. CONCLUSIONS

[7] A. Gabillon. An authorization model for xml databases. In Proc. of the 11th ACM Conference on Computer Security (Workshop Secure Web Services), George Mason University, Fairfax, VA, USA, 2004.

We have presented an extension of our access control model to be able to formulate XQueries by exploiting the schema information returned as a security view to the user. This paper represents only a first step towards the development of an access control model that supports XQueries, and much work is still to be done before this model can be used in practice. A necessary improvement is related to the query rewriting process that should be better formalized.

[8] T.F. Lunt. Access control policies for database systems. In C.E. Landwehr, editor, Database Security, II: Status and Prospects, pages 41–52. North-Holland, Amsterdam, 1989. [9] A. Gabillon and E. Bruno. Regulating Access to XML documents. Fifteenth Annual IFIP WG 11.3 Working Conference on Database Security, Niagara on the Lake, Ontario, Canada, July, 2001.

6. ACKNOWLEDGMENTS This work was supported in part by the European Union within the PRIME Project in the FP6/IST Programme under contract IST-2002-507591 and by the Italian MIUR within the KIWI and MAPS projects.

[10] M. Kudo and S. Hada. XML Document Security based on Provisional Authorization. Proc. of the 7th ACM Conference on Computer and communication security. Athens, Greece, Novemeber 2000.

7. REFERENCES

[11] W3C. Xml query (xquery) version 2.0, 2004. http://www.w3.org/XML/Query.

[1] Serge Abiteboul. Querying semi-structured data. In Proc. ICDT’97, 1997.

[12] World Wide Web Consortium (W3C). XML Path Language (XPath) Version 1.0, November 1999. http://www.w3.org/tr/xpath.

[2] E. Bertino, S. Castano, E. Ferrari, and M. Mesiti. Specifying en Enforcing Access Control Policies for XML Document Sources, World Wide Web Journal, vol. 3, Baltezer Science Publisher, 2000.

42