XML Types for C

XML Types for C# Erik Meijer, CLR Wolfram Schulte, MSR

Abstract This paper proposes a language extension that adds native support for XML to the type system of C#. In our approach XML documents or document fragments become first class citizens that benefit from the full range of features available in a modern programming language like C#. XML elements can be constructed, loaded, passed, transformed, updated, and written in a true type-safe manner. To our knowledge, no other approach to XML provides for type safety and syntactic integration of this sort. Existing approaches are either completely untyped (some form of string processing) or rely on schema-to-class translation patterns that have many limitations. Our approach is a true semantic integration via the type system. One of the consequences of type integration is that many runtime checks may be moved to compile time. This has many benefits for performance and program correctness. The type system extension is based on XML Schemas and incorporates operational aspects from XPATH, XSLT and XQuery. The implementation uses the existing capabilities of the System.XML namespace in Microsoft’s Common Language Infrastructure to provide its functionality. In this document, we use the term C#-xml to mean the combination of C# and the proposed extensions. We present these extensions through a series of worked examples. Although we restrict our attention to XML and C# in this paper, our approach can be applied to integrate any standard object-oriented programming language (C#, Visual Basic, Java, etc.) with any data-structuring language (XML, SQL, etc.).

1

Introduction

XML has become the lingua franca for data-exchange over the Internet [Box, Skonnard, Lam]. As an open industry standard, XML lets developers describe the data exchange between different devices, applications or businesses. XML is used in data base applications as the language to describe the structure of the database and its views as well as for

data access. XML is also used for dynamic layout of web pages. But what is so special about XML? XML schemas can describe typed content. Typing helps guarantee the interoperability of applications. Currently, there is a plethora of special purpose XML processing languages. [XSLT] is for transforming well formed but untyped documents. [XDuce] and [XMLambda] are typed alternatives, but they are mainly an experiment in language design; their use is restricted to a subset of [XML 1.0]. [XPath] and [XQuery] are languages for query processing. XPath is untyped; XQuery however is typed. Yet all of these languages have the same problems: • • • •

they all come with their own syntax; they have no or only restricted type systems; they have no module system, no separate compilation; they have only a very restricted library.

Furthermore as soon as you want to compute something beyond the special purpose for which they are developed, you reach their limits. Many of them have ended up reimplementing features that are already in programming languages, but do so only partially and in an ad-hoc manner. Table 1 compares the different languages and their feature sets. So why not use our favorite object-oriented programming language to support XML processing? The story so far is disappointing: One of the main reasons is that the type systems of XML and class based object-oriented languages don’t match very well. In [C#], you have subtyping based on named relationships, while the type system of XML schemas is based on structural equivalence, named relationships and range restrictions. As a remedy for this problem two approaches have been taken to integrate XML into programming languages (see also Table 1): • •

either one uses XML in an untyped way or else one translates XML types into classes.

A typical example of the former is the support for XML in the Common Language Infrastructure [CLI], where XML processing is supported in the form of libraries. This is extremely flexible; whenever new functionality is needed a library is added. However processing XML is not type safe. Furthermore it is not efficient because one always deals with untyped data. This requires later validation to be on the safe side. The alternative, also supported in the CLI distribution is to use a schema compiler that can either map schemas to classes or classes to schemas. As long as only very primitive forms of schemas or classes are used, the mapping works; however in most cases the results are disappointing. [Box] for example noted that many XML constructs can’t be mapped easily

onto programmatic types: choice, derivation by restriction, mixed content, or “nillable” elements of value types. Likewise some programming language constructs don’t map easily onto typed XML, for instance there is no canonical way to map arrays or typed references onto XML; that’s one of the reasons for the complexity of the SOAP specification.

Language/ Feature XML Schema XPath

Purpose

Paradigm/ Syntax

Type Decla- Data defs/ rations XML Projection Functional/ Directory Paths XQuery Query Lan- Functional/ Own guage XSLT Transforma- Functional/ XML tion Language Functional/ XDuce Explore DTD SML Typesystem Data defs; XMLambda Explore Functional/ DTD typesystem Haskell + XML SysXML supImperative/ tem.XML port for C# C# XSD Compiler

XML support for C#

C#-xml

XML support for C#

Typesystem

Technology Restrictions

XML

Validating Parser Interpreter/ Compiler

Sublanguage

XML

Prototype

Sublanguage

None

Interpreter/ Compiler

Sublanguage

Monomorphic Interpreter/ DTDs Compiler

Experimental

None

Polymorphic DTDs

Not implemented

C#

Library

Data defs; C# Imperative/ C# Imperative/ C# + XML C# + XML + XPath

Sublanguage

Experimental

XML processing untyped Compiler C# and XML Types don’t match Precompiler None

Table 1: Different Languages and Approaches How to Use and Integrate XML

We solve these problems by integrating the type system of XML as a first order citizen in the programming language. We use C# in this paper, but a similar approach would work

for other languages. We call this C#-xml. We add XML Schemas as types, and XML document fragments as literals. The proposed type system guarantees that • •

valid document fragments are well-typed expressions; invalid fragments are ill-typed expressions.

We also integrate ideas from XSLT and XPath for a limited form of pattern matching; and from XQuery for a limited form of set based operations. For interoperability we provide mappings between the CLI and schema types. The resulting language is an extension of C#. A pre-compiler translates C#-xml into C#. We assume that the reader is familiar with basic C# concepts and its syntax and has a rough understanding of using XML. XML schemas are nicely explained in [XML Schema Part0]. The paper is organized as follows. Section 2 discusses the mapping of C# types to XML schemas and vice versa. Section 3 shows how we build dynamic documents. Section 4 discusses projection and selective update. Section 5 presents C#-xml’s support of iteration, here used for query processing and stream processing. Section 6 discusses C#-xml’s provisions for dealing with well formed but not well typed documents. Section 7 concludes. The Appendices are not yet written. (However by the time this document might be read they should be on the WEB.) Appendix A gives an example application. Appendix B defines the grammar for the extension; Appendix C describes part of the type system. Appendix D gives the signatures for the new CLI functionality.

2

XML Schemas and CLI Types

The main challenge of the integration of XML into C# is to engineer a bridge between both type systems. In the following we assume that the reader knows the type system of C# and knows a little bit about XML and the former [XML 1.0] document definitions.

XML Schemas - An Overview The purpose of an XML schema is to define a class of XML documents. XML schemas extend [XML 1.0] document definitions considerably. Among other things schemas provide a uniform syntax, support derivation of document types (similar to inheritance in OO languages), permit all-groups and nested definitions and provide atomic types (similar to built-in types in programming languages). Here is a fragment of a database schema for a collection of books.

The schema declares two elements bib and book having type Bib and Book respectively. The bib element declaration describes an XML document tree with root .. whose children consists of a list of book elements. Likewise a book can be a root element. Each book has a mandatory isbn, optional attributes for year and price, followed by a title, followed by a non-empty list of authors. Note that the content of each element and attribute is typed. An example document that conforms to this schema is the following Essential XML Box Skonnard Lam Component Software Szyperski

The value of this document is a node-labeled tree (in XML parlance: an abstract information set [Infoset]). The root of the instance document is a document node; it defines the character encoding and version information. In our example it contains the bib element as its only child. The node for the bib element contains the namespace node, a node for the attribute schema location and two element subtrees. The book nodes contain attribute nodes, and title and author subtrees.

Our aim is to integrate these XML schema types and conforming documents into C#. To do so let us analyze schemas more carefully.

Simple Types A simple datatype, as defined in [XML Schema Part 1], is either primitive (e.g., xs:string, xs:boolean, xs:float, xs:double, xs:ID, xs:IDREF) or it is derived from another simple type by specifying a set of facets (e.g., xs:language, xs:NMTOKEN, xs:long, etc., or user defined). Facets are constraints like minimum or maximum values of numbers or regular expressions for strings. A type hierarchy is induced between simple types by subset ordering on the value spaces of the corresponding types. For instance, the value space of the XML type string is the set of finite-length sequences of characters. The value space of normalizedString is the set of strings that do not contain the carriage return, line feed or tab characters. Thus in XML every value of the type normalized string is a string but not vice versa; normalized string is a (proper) subtype of string. Note that the value spaces of simple types may overlap; a simple value may be an instance of more than one schema simple type. (We defer the discussion of lists and unions of simple datatypes until we discuss the complex types.) All of C#’s built-in types like bool, int or string have a corresponding XML simple type. For other simple types, like normalizedString, C#-xml provides corresponding structures in the namespace System.XML.Types. Constructors for these structures take C# built-in types and check whether their range restriction is met. Here is the C# struct for XML’s normalizedString. namespace System.XML.Types{ public struct NormalizeString{ private string Value; public NormalizedString(string s){ foreach(c in s) if(c == ‘\r’ || c == ‘\f’ || c == ’\t’) throw new RangeRestrictionException(); value = s.Clone(); } public static implicit operator string(NormalizeString s) { return s.Value; } }}

The constructor takes a string, and if the passed string doesn’t contain any carriage returns, line feeds or tabs, assigns it to the field Value. Runtime checking is necessary, since checking whether the constraints are fulfilled is in generally not possible at compile-time. To access the embedded value, C#-xml uses implicit conversion operators.

The proposed encoding of simple types in C#-xml does not maintain the subtype order on XML’s simple types; for instance in C#-xml a normalized string is not a subtype of string. However, many of the properties provided by subtypes – for instance, ease of use, flexibility, and type safety – are retained because of the implicit casting that allows normalized strings to be passed to functions that expect a string. Simple types, for which C #-xml provides corresponding C# structures, are very limited. They have only one field of the base type from which this type derives. Values of simple types never contain object references.

Complex Types Complex types contain elements and/or attributes (see [XML Schema Part 2]). Complex types build their own type hierarchy. A complex type is either a restriction of a complex base type, an extension of a simple or complex type definition or a restriction of the super type of all complex types called AnyComplexType (see also Section 6). Restriction and extension is based (as in class based languages) on named relationships. The content model for complex types, however, is based on regular tree grammars. Regular tree grammars support structural subtyping. The content model for C#-xml is adapted from [XML Schema: Formal Description]. To simplify type checking it unifies various constructs that are distinct in XML. For instance mandatory and optional attributes (see for example the attributes isbn and price in our bib schema) are handled as if they would be defined as a sequence with appropriate number of occurrences. However it distinguishes everything that is relevant for checking structural subtyping. In this respect C#-xml follows the design decisions taken by [XQuery Formal Semantics]. The following simplified grammar makes this more precise: p in primeType ::= qName | `*:*`

name(set)s

t in typeDefn

global type

::= elem p { g } | attr p { g } | group p { g }

g in groupDefn ::= p | | |

t all{g1;…;g2} {g1;…;gn}

| | |

choice{g1;…;gn } mixed g g[m-n]

element, group or attribute reference local type interleaved product heterogeneous sequence g1 followed by … gn choice, g1 or … gn mixed content of g homogeneous sequence of type g

A prime type is either a qualified name, denoting a type with a given URI and local name, or a wildcard. The wildcard `*:*`denotes any name in any namespace. C#-xml allows one to define elements, attributes and (named) groups. We have the following group constructors: The empty sequence (written {}) matches only the empty document; it is an identity for sequence and all. The empty choice (written choice {}) matches no document; it is a unit for choice. An interleaved product all{g1;..;gn} matches all documents which contain values in groups g1 up to gn in arbitrary order. All groups in XML Schema are a specialization of “all” in this type system. In XSD they can consist only of global or local element declarations. The homogeneous sequence type T[n-m] matches a minimum of m values and a maximum of n values. The length of the sequence is undetermined if n= *. In the sequel we will homogenous sequences just sequences. C#-xml users normally don’t have to write type definitions using this grammar. C#-xml is able to directly import a schema; this is shown in Section 3. However, if a query results in a return type which is not predefined by a schema, then they can denote the type using the above grammar. For example, the C#-xml types for the given Bib Schema is as follows : group Bib{ elem bib{Book}[0-*]; } group Book { attr isbn{ int }; attr year{ int } [0-1]; attr price{ int } [0-1]; elem title { string}; elem author {string} [1-*]; }

(This declaration uses the fact that the type {g1;…; gn} can be written as g1;…;gn as long as it is unambiguous.) XML schemas are much richer than what is provided in C#-xml. We deliberately decided not to include the whole type language of XML in C#-xml. To use more advanced features one has to import the corresponding schema; however the result of every query and of every literal is expressible using the given syntax.

Structural Subtypes The idea of the subtype relationship is that t is a subtype of t’ if t describes a subset of the possible values described by t’. For instance t is subtype of {t; t}. We use the symbol