A Service Oriented Architecture for a Health Research Data Network

A Service Oriented Architecture for a Health Research Data Network Kerry Taylor, Christine O’Keefe, John Colton, Rohan Baxter, Ross Sparks, Uma Srinivasan, Mark Cameron, Laurent Lefort CSIRO ICT Centre GPO Box 664 Canberra, 2601 Australia [email protected] Abstract Sharing health data for research purposes across data custodian boundaries poses technical, organisational and ethical challenges. In beginning to address some of the technical challenges, we describe a service oriented architecture for a proposed Health Research Data Network (HRDN). The HRDN architecture supports services to manage data access and use by researchers in accordance with individual data custodian policies. The capabilities of the HRDN architecture are described using a layered service model. The four abstract layers from lowest level to the highest level are 1) Preparing, with services for managing the data collection process by data custodians; 2) Storing, with services enabling a data custodian to make data and metadata available as a network resource by means of standard interfaces; 3) Sharing, with services for resource discovery and integrated access to network resources, such as multiple data sources and software services such as transformations, and 4) Using, with analytical services for researchers. Two additional groups of services are interfaced with the services in each of the four layers. They are 1) Describing, with services for collecting and managing metadata, and 2) Protecting, with services for ensuring confidentiality and privacy protection, as well as services and tools implementing information security functions. In addition to these HRDN service groups, client-side applications are used by data custodians, service providers and researchers. For example, the Researcher’s Workbench allows a researcher to perform remote analyses on sensitive individual-level data with appropriate disclosure control before the release of results. The current implementation status is that reference implementations for about a third of the services have been completed and used to demonstrate a Researcher’s Work-

bench prototype over use cases.

1. Introduction The goal of this project is to provide the critical elements of a technology platform for a new research facility, called the Health Research Data Network (HRDN). The HRDN is a collection of software services, connected via a highbandwidth communications infrastructure and standardised interfaces, enabling use of the participating data collections by authorised participants. The overall HRDN goal is to minimise the time to insight for answering health and community sector research questions. Although rich and growing data collections already exist in Australia, the task of performing any kind of analysis across them is beset with technical difficulties, organisational and ethical constraints. For example, the data collections are geographically dispersed, are built on a variety of technologies and have differing data quality related to their varying primary purposes. Data custodians have ethical and legislative requirements for managing research use of their data. The HRDN design addresses the technical difficulties in information integration, while supporting complex organisational and ethical requirements arising from research on inter-organisational and inter-jurisdictional data collections. A further design goal is to support collaboration and knowledge reuse among HRDN participants. The two key HRDN features of custodial control over access and use of resources and privacy protection integrated into a secure end-to-end system for data sharing and analysis distinguish HRDN from other service oriented architectures for distributed data sharing and analysis. In HRDN, data custodians have the ability to specify access and use policies for each data collection: who may access it, what can be done with it, what sort of results can be

released, for what purpose and with what approvals, conditions and obligations. HRDN Infrastructure services for data integration, transformation, and analysis and results presentation interact with privacy services, which implement the data access and use policies of the data custodians within an overarching HRDN data access and use policy. The paper is structured as follows. In Section 3, the capabilities of the HRDN are described using a layered service model, with increasing capabilities offered by higher layers building on lower ones. This explains the scope and purpose of the network. Section 4 discusses the architectural principles used to guide the design of the component services in the HRDN. The network services and their interfaces to other network services or applications and their roles in delivering network capability are described in Section 5. As for any service-oriented architecture, there is a focus on establishing the services and describing the messages exchanged. The paper starts in Section 2 by using a Researcher’s Workbench example to motivate a key subset of the provided capabilities.

2. Using the HRDN 2.1 Researchers This section provides a motivating example showing the HRDN benefits for a researcher. It also motivates the scope of the HRDN capabilities described in the next section. A client-side application, called the Researcher’s Workbench, is used by a researcher on her desktop. The current implementation status of the Researcher’s Workbench is discussed in Section 6. In this example, we use the names of HRDN network services without defining them. Their definitions are provided in Section 5. Consider an epidemiologist, Emma, who is interested in studying the relationship between the post-operative survival times of colo-rectal cancer patients and their MYC gene amplification level [2]. Emma must first obtain HRDN membership through the Member Registration service. Emma then uses the Researcher’s Workbench to browse available resources, including the metadata descriptions of the available data collections. We assume Emma finds some useful resources and develops a detailed research plan. At this stage, Emma may need to apply to data custodians for specific access to their data for the purposes of her research and receive the data custodians’ individual access agreement(s) for signing via the Agreement Facilitator service. Emma’s project involves access to individuals’ medical data and so will require approval from one or more Human Research Ethics Commitees (HREC). If the data custodians are satisfied and the approvals are granted, Emma designs the MYC data resource (or model)

for her study using the Researcher’s Workbench. The model includes mappings between her purpose-built data resource and the selected existing data sources to be accessed. For example, source A may have details of the patient clinical care events, while source B contains the Births, Deaths and Marriages Register and source C is Emma’s own register of MYC amplification levels for patients. Obviously, Emma only requires a subset of data from each source for her project. The mappings describe the required subsets and any desired transformations (e.g. from date of birth to age). Emma uses the Exploratory Data Analysis (EDA) service tools in the Researcher’s Workbench to understand data coverage and quality issues, which may effect the validity of her intended analysis. The EDA service interacts with the Disclosure Risk Estimator and Disclosure Risk Mitigator to ensure presented results do not disclose confidential or personally identifiable information unless Emma has specific approval. Emma’s model requires linkage of the records for individuals from the three data sources. The workflow in the development of linking methods and any other aspects of the MYC data resource developed by Emma become part of her data model specification. The Researcher’s Workbench records Emma’s data model specification for any future audits, for results verification of results and for re-use by Emma and other authorised HRDN users. Eventually, Emma executes the resultant queries for her MYC data model. The Researcher’s Workbench interacts with the Planner and Orchestration services to plan and then schedule requests for data from sources A, B and C, linking and other transformation services. The resulting data will be stored remotely in a secure cache. The data custodians’ conditions of use constrain the invocation of HRDN services as Emma calls them to explore the cached data. Within the imposed constraints, Emma explores the distribution of her response variable (survival time) and the relationships between the response and potential explanatory variables, such as age, gender, co-morbidities and MYC level. All the while, disclosure risk assessments are made and confidentiality transforms invoked, before any results are presented to Emma. Emma will refine the MYC data model and her choice of analytical models. Version control will document iterations of the project workflow, to enable reuse of her analyses. Emma can choose to share her data models and project workflows with colleagues and the broader HRDN community.

2.2 Benefits and Costs The benefits and costs for Emma arise from the key consequence of using the HRDN: Emma never sees any confidential data unless approved and authorised to do so.

Space limitations prevent inclusion of a similar detailed example of HRDN use by data custodians. Briefly, we mention some benefits. Data custodians currently vet research proposals and, if they meet their data release policy requirements, they physically release the data to a researcher. In HRDN, the data custodian’s data release policy is present and enforced within the network, allowing an auditable trail of what is happening. In addition, confidentialisation and disclosure control services combined with remote analyses of the data may allow research questions to be answered with data that otherwise would be unavailable due to privacy concerns. Costs for providing HRDN infrastructure support could

3. Layer model Our HRDN layer model shown in Figure 1 divides the functionality of an HRDN network into layers of successively greater functionality, with the top layer closer to an end user’s need. Higher level functionality depends on the lower level functionality, however not all layers will be needed for all users. The HRDN layer model is motivated by the successful OSI ISO 7-layer model of networks.

Using

Protecting

2.3 Data Custodians

be amortised over the network participants. Currently, online access control and publishing is done on an individual ad hoc basis by the better resourced data custodians. Only a few data custodians, such as National Statistical Offices, are implementing remote data access and analysis laboratories.

Describing

Emma’s research plan, as submitted to the relevant Human Research Ethics Committees (HRECs) and data custodians, will emphasise the ability to answer the research questions without obtaining any confidental data as a part of the HRDN research methodology. We expect that this will allow wider data access approvals and faster research plan approvals than is currently available. Three examples of the benefits are now given. In some cases, data governance requirements, whether legislative or organisational, may enable remote access and use of confidentialised unit record files, with strict limits on the number of such records which can be downloaded [1]. HRDN provides the infrastructure for data custodians to do this. In other cases, data custodians are prepared to share the data for linking provided the identifying data is stripped before delivery to the researcher [11]. HRDN also provides the infrastructure to do this. The third example is where a data custodian can release data for research purposes, but has had negative experiences in the past of apparently bona fide researchers using the data for commercial purposes contrary to the data release agreement. The data custodian is then more reluctant to share data with genuine researchers like Emma. HRDN reduces the risks of abuse of data by requiring the remote use of the data. Emma’s actions on the data are checked for consistency with the release policy. Audit logs are available to confirm the uses. A disadvantage for Emma is the need to perform data cleaning, linking and the resulting analyses remotely. This requires a large change from the current practice of working in one’s own local workspace with one’s expertise in particular standalone analytical packages, such as SAS, Stata or SPLUS. This disadvantage will be reduced if familiar environments can be made available remotely. This can be enabled by allowing shared models of metadata resulting from the capture of data models and project workflows.

Sharing Storing

Preparing Figure 1. HDRN Architecture: Abstract Layer Model The Preparing layer is about collecting data by data custodians. The challenges and software services required for managing data collection are out of the scope of the present project. The Storing layer is about adding layers of software and standards around the fundamental data held by a data custodian to enable its contribution as a network resource. The Sharing layer addresses the capabilities beyond enabling access to individual data custodian data resources. Sharing services provide capabilities for integrating resources effected by the Storing layer. The resources include software services, such as transformations, linking, and analytical functions, as well as virtual views of data and shared workflows. The Using layer supports value-added interaction with the network and includes user-oriented services that interface with the lower layers. Some groups of services deliver functions to multiple layers. The Protecting and Describing service groups are

both presented as vertical bars rather than horizontal layers, because there are ‘protecting’ and ‘describing’ requirements at every horizontal layer of the network capability. Protecting services provide the functions for implementing information security and privacy protection, partitioned into two sub-groups.

Preparing

Preped

Storing

Raw

Prepared

Custodian Data

Data Custodian

Stored

Sharing

1. Information Security, which involves the protection of data from unauthorised access, modification and disclosure. In this sub-group, the content of the data and the purpose for which it is going to be used are irrelevant. The tools used are standard information tools such as encryption, PKI, digital signatures, and digital certificates. 2. Privacy Protection, which involves protecting the privacy of indviduals or, organisations. In this sub-group content and purpose are critical. The tools here include a metadata language for expressing data custodian policies for data access and use, and disclosure control processes. Describing services provide the functions for implementing metadata management. Metadata for these purposes is broadly interpreted as descriptive metadata that assist in discovering and applying resources, but does not include metadata for access rights and privacy protection since that is handled in the Protecting layer.

Sharing Rights

Data Metadata

Describing

Sharing Policies

Service Metadata

Data (Linking)

Protecting

Policies

Service Provider

Data(Sharing)

Linking Rights

Linking

Linking Service

Analysis Rights Linked Data

Using/ Analysis

Analysis Service Resource Metadata

Request

Response

Researcher

3.1 Relationship between Layers in the HRDN Layer Model A schematic representation of the relationship between the layers is shown in simplified form in Figure 2. The three square boxes represent the HRDN members: data custodians, researchers, and service providers. The two lowest layers, Preparing and Storing, interact only with the Data Custodian at the top of the figure. The highest layer, Using, interacts with the Researcher, but also the vertical Protecting and Describing services. A sub-layer within the Sharing layer is Linking. These are shown separately in the figure. The Sharing service publishes integrated data from one or more data custodians. It may use services from the Linking sublayer to do this. Outputs from both groups of services are provided to the Using layer.

Figure 2. Schematic Diagram of the main relationships betwen the layers in the HRDN layer model

At the heart of a service-oriented architecture is service composition planning and execution. Some architectural principles for the choice of services to aid service composition are 1. Services should be composable: Facilities are provided that enable users to build new composite services from pre-existing services, which may themselves be composite services.

4. Architectural Principles

2. Composition should be scalable: It should be as easy to create composite services from many existing services, as it is to create them from a few.

We now describe the architectural principles used to guide the HRDN architecture design, which is described in the next section. A service-oriented architecture requires decisions on the services to be provided and the messages to be exchanged.

3. Composition should be fast: There should be appropriate automated support for the process of building composite services, and for finding and examining existing service definitions as the building blocks of new ones.

4. Composition should be flexible: It should be straightforward to modify a composite service, either by changing the flow and ordering of events, or by substituting one service for another, where it makes sense to do so. The desired composition properties just described are greatly promoted by an appropriate design of Service types using the following principles. 1. Adopt a common, reference model for service types: Service capabilities, their interfaces, and the types used in message exchange, will be defined in a small number of uniform ways. Within this standard model, non-standard service instance-specific parameters may be supported through a universal parameter data type. 2. A standard set of service features: For example, a standard set of service features would make it (syntactically) possible to substitute one service for another in composite workflows. A common type model would encourage economical and flexible data flows in composite services, and minimise the need for prohibitive data conversion technology. 3. Canonical metadata: Canonical metadata is fundamental to supporting and attaining a semantically correct outcome for consumers of HRDN services. All HRDN services (whether they be custodian data services, analytical services or composed services) will provide canonical metadata. Finally, the services themselves should comply with available non-proprietary, open standards. The services will be made available through non-proprietary, standards-based interfaces that will enable greatest access to the services in the distributed environment of the Internet. Specifically, the ongoing development of the Web Services family of standards and related technologies such as Grid computing, are applied where appropriate. Relevant Web Service standards used currently include XML, SOAP, and WSDL. Standards that maybe used in future include the Web Service Conversation Language (WSCL) and BPEL4WS.

5. Architecture In order to realise the capability outlined in the layer model, the following basic services are proposed. They are grouped into layers according to their major role in the layer model. Table 1 outlines these basic services and their major role grouping. The provision of each of the services described in this section is subject to the user having been satisfactorily registered and authenticated, and having all the appropriate approvals and authorisations. In addition, the user may have

Layer Preparing Storing Sharing

Using

Protecting

Describing

Services Specific to data custodian organisation requirements and out of scope Specific to data custodian, but typically a commercial or open-source database engine Data Service; Orchestration Service; Cache Manager; Planner; Transformer service; Structure Registry. Linking sub-layer:: Standardisation; Indexing; Comparison; Decision Classifier; Evaluation. Surveillance Analysis Service; Exploratory Statistical Analysis Service; Statistical Model Building; Analytical Information Management. Member Registration; Security Token Validator; Action Approver; Conditions of Use Interpreter; Audit Logger; Audit Log Analyser; Security Token Issuer; Disclosure Risk Estimator; Disclosure Risk Mitigator; Session Initiator; Agreement Facilitator. Registry of Registries; Descriptive Metadata Service; Annotation Service; Metaflow Interpreter; Provenance Tracking.

Table 1. Basic HRDN Services had to sign an agreement with one or more individual data custodians, which may include agreeing to comply with further conditions and obligations, beyond those enforced by the network services. These functions are supported by services in the Protecting layer described in Section 5.5 below, where the interface between the two layers requires secure tokens to be included in the service requests. For brevity, we do not include all the details of the interactions between the services in the Sharing, Using and Describing layers and the Protecting layers in the descriptions of the services in this Section.

5.1. Preparing and Storing The Preparing and Storing layers are not described in detail. The specific services available are specific to the data custodian and are not determined by the HRDN architecture.

5.2. Sharing A Data service is the access point for data resources in the ooint is a good one. network. Primarily, a data service accepts requests for data framed as a query against a data schema, and returns the requested data as a single message. The request may be parameterised by quality indica-

tors that specify, for example, sample or perturbation options. The request may include a request for temporary data storage by the service on behalf of the client. The response data message is affixed with a small metadata summary that characterises the response by, for example, a timestamp and volume summary. Secondly, and optionally, a data service might also respond to requests for structural (schema) metadata and return that in a single message. A data service may invoke a security token validator, an action approver and an audit logger directly. An Orchestration Service is an infrastructure resource to manage multiple composite invocations of network resources. Its primary input is a workflow plan, which is represented as a graph of data-dependent service invocations. The orchestration service will manage the asynchronous invocation and dataflow of network services on behalf of the requesting client. It will produce a composite message, which is the final result of the workflow execution, and an execution trace. It will accept parameters to support workflow management, such as selecting testing modes and timeout behaviour properties. It will also respond to requests for progress reports on active workflows. A Planner accepts requests for data and data transformation services. Transformation services include linking and type transformations phrased as queries over a data integration schema. These queries are presented to the planner. The planner produces a workflow plan. A Cache Manager offers a persistent and secure data storage service on behalf of clients. It manages both data and associated metadata that have been generated through interaction with network resources. It accepts messages requesting data creation, deletion, retrieval and update. At all times, it operates in concert with the Action Approver service to ensure compliance with the custodian-specified conditions of use. It will be used together with the Statistical Model Building and Analytical Tools services to enable access to summaries or approved analyses of protected data. The Cache Manager accepts a client identification, using the Security Token Validator service to authenticate. The persistent data is not directly accessible to the authenticated client as it is effectively placed inside a security and privacy screen that ensures the conditions of use are applied. A Transformer Service provides a method for translating between data types, data representations and standard coding schemes. It accepts an input message containing input data and a description of that data, and provides an output message containing the transformed data and, optionally, metadata describing the statistical properties of the transformation. It may also accept run-time parameters to modulate the transformation behaviour. A Transformer Service is represented by a description in the Structure Registry service, and, optionally, may itself respond to requests for its description.

A Structure Registry service tracks formal metadata about data services and transformer services in the network. This information can be used by infrastructure services and application programs for automated matchmaking and service composition. Data services represented may include both primary resources, represented as services in the network, and virtual resources. Virtual resources are represented entirely as metadata in the registry, but they may be instantiated as virtual network services by network users. The structure registry may be informed about services by members of the network. The linking services group are a sub-layer of the Sharing layer. The linking services have been derived from the components of a process model for linking consisting of the following six steps: data selection; standardisation; indexing, also called blocking or clustering; comparison; decision modelling; and evaluation [4, 8]. The resulting linking services: Standardisation, Indexing, Decision Classifier, Comparison and Evaluation are specialist transformer services, and adopt the same service interface. Data selection is provided by a Data Service or a Cache Manager.

5.3. Using Analytical services operate in concert with the Disclosure Risk Estimator service and Disclosure Risk Mitigator service to ensure any specified confidentiality and privacy protection requirements are enforced. Data inputs are sourced from a Data service, or a Cache manager. The Analytical Information Management service takes data, control parameters and supporting data as input in order to generate metadata relating to the quality and fitnessfor-purpose of data. Some specialist functions provide a transformed version of the data. The Exploratory Data Analysis service produces a range of statistical summaries, in graphical form, given some input data and control parameters. The produces a summary representation of the data. The Surveillance Analysis service provides some specialist analysis functions appropriate for health surveillance. It takes data and control parameters as inputs and produces a graphic representation of trends together with summary information. This service requires the input of specialist parameters needed for the selected model style. The service produces a report comprising a description of the fitted model, although output may be modified depending on disclosure risk . The Statistical Model Building service provides ways for developing functional equations that describe how the response can be ‘best’ predicted from a set of explanatory variables. For example, the response may be the probability of a person contracting lung cancer given explanatory

variables such as their age, gender, occupation, past smoking behaviour and other important lifestyle determinants. This service provides ways of fitting functional equations for both continuous and discrete response variables. The service needs the distribution of the response, the functional form of the model, and nature of the explanatory variables to be specified. This specifies how the functional equation is estimated. After the service fits the equation, it provides graphical plots for assessing the goodness-of-fit and validating certain assumptions used for fitting these equations. The statistical significance of each explanatory variable in the equation and analysis of variance tables are also provided.

5.4. Describing The Describing services can all be modelled as specialised data services in terms of their interface, but they perform a very different role in the network, according to the semantics of the data they store. Unlike data services, they are not themselves described in the Structure Registry. A Descriptive Metadata service delivers descriptive metadata about network resources. The descriptive metadata are viewed by users, so that the applicability of data resources to user needs are assessed. The service accepts requests for metadata phrased over a metadata schema, and returns a metadata description in one of a number of standardised schema formats. A single metadata schema is used for each metadata service. Metadata services describe both data resources and algorithmic services within the network, such as analysis and linkage services. A Registry-of-Registries (RoR) service is designed to respond to user-level interactive queries over metadata resources in the network. It handles interoperability over Descriptive Metadata services. It conforms to a request/response interface pattern and so can be used interactively via a web browser client. A Metaflow Interpreter service implements reasoning over formal metadata descriptions generated by Descriptive Metadata services in the network. Given a set of metadata documents as input, together with a specification, it produces a metadata document that corresponds to the application of the specification over the inputs. It is a specialist instance of a Transformer Service, operating over metadata documents. An Annotation service supports informal markup of HRDN resources by the user community. It accepts input of annotation remarks associated with content references; and, given a context reference, can respond with the relevant annotation remarks. It may be used when retrieving data or metadata to provide an extra level of descriptive metadata that reflects community opinions or practice. A Provenance Tracking service provides a lineage track-

ing service that helps to retrace the path of discovery and causality, by providing information about which processing intermediaries have touched the data during the course of its discovery. It uses descriptive metadata service, annotation service and with the help of the metaflow interpreter provides formal data provenance records for replay and analysis. Table 2 summarises the metadata services using the capability layer model. Abstract Layer Preparing Storing Sharing Using

Functionality

Metadata service

Recording what happened Storing and describing

Descriptive Metadata service. Descriptive Metadata service. Annotation Service RoR, Metaflow Interpreter Metaflow Interpreter, Provenance tracking

Track information flow Fitness for purpose of resources, Discovery of available resources, Relationships, such as derivations, versioning, between resources.

Table 2. Describing or metadata services summarised using the capability layer model.

5.5. Protecting The security and privacy services assume deployment of all network services over a secure transport layer, such as SSL. They also assume that network services are securely deployed and protected from interference through interfaces other than the specified service interface. A Member Registration service accepts HRDN member descriptions, and optionally policy statements and purpose statements, and responds with authentication tokens designed for authentication in subsequent interaction with network resources. A Member Registration service may use a Security Token Issuer service to do this. A Security Token Issuer service generates security tokens on request, given descriptive information to be associated with the token. These tokens become optional inputs to all other network services where authentication and access control are required.

These services may use a Security Token Validator service for authentication, which accepts a token and a request for the service and produces a security token or a validation message. A Session Initiator service enables a registered member to log in to the network, authenticate their claim of identity and initiate an HRDN network session. A Session Initiator service accepts a claim of identity and security token, such as username and password, and a statement of session purpose. It then creates a session token which encodes the member identity, security token, member details, session purpose and session ID. The session token becomes an input to all other network services where authentication and access control are required. An Agreement Facilitator service oversees, validates and records a user’s acceptance of a data custodian’s access agreement, including conditions and obligations. The service takes a user’s identity claim, security token and request details and consults the data custodian’s conditions of use to generate an agreement which the user must ‘sign’ before the request can proceed. Network services may use an Action Approver, to determine whether they should respond to a service request. Here an ‘action’ is either access to a single dataset, linking of two or more datasets, analysis of a linked dataset or delivery of a data product. An Action Approver takes a service request, a session token and the relevant statement of conditions of use and determines whether the service may be offered in accord with the conditions of use. As data are manipulated in the HRDN, the conditions of use need to be adjusted according to rules. For example the conditions of use of a linked dataset must include all the conditions of use of the component datasets. This is needed to resolve any conflicts arising from combining conditions of use. A Conditions of Use Interpreter service takes the conditions of use of datasets and an action description and produces conditions of use for the new dataset arising from the given action. A Disclosure Risk Estimator estimates the risk of identification or a disclosure occurring. A Disclosure Risk Mitigator applies statistical disclosure control techniques to reduce the risk of identification or of a disclosure in a data product before release to a researcher. The aim is to reduce the risk to an acceptable level with a minimum of information loss. An Audit Logger service maintains an audit trail at the request of other network services. The Audit Log Analyser conducts analysis and data mining of the audit logs in order to verify compliance with network conditions.

6. Implementation Status Elements of the HRDN architecture and the Researcher’s Workbench are currently under active development using Web Services standards. Within the Sharing layer, we have: an implementation of a Data Service wrapper enabling data custodians to mount and share datasets; a basic Orchestration service enabling execution of a workflow; a Planning service enabling automatic generation of workflows in response to user queries; and a Structure Registry enabling static documentation of services and their interfaces. Within the Linking sub-layer of the Sharing layer, we have developed Standardisation, Indexing, Comparison, and Classification services. The Using layer currently has Exploratory Data Analysis tools and model builders for survival analysis. These are implemented in an Rweb [3] web server environment rather than through WSDL interfaces. These services and data resources are made available through the prototype Researcher Workbench. Figure 3 shows the workbench layout for interactive graphical query development. In the figure, network and local data resources are presented on the left-hand side, while a query under development is displayed in the graphical pane on the right-hand side. On completion of query development, a researcher can request materialization, during which the workbench converses with the Planning and Orchestration services to develop and execute a workflow. In addition, the Researcher’s Workbench has an interface to the Linking Services and a subset of analytical methods available. The Protecting and Describing services are the next implementation focus. These services are needed before the Custodian Workbench application can be developed as the access point for data collection publishing and access management by data custodians.

7. Related Work The proposed Health Research Data Network requires multidisciplinary research and our work is related to other work in many areas. In the following we highlight some of the key relationships. Our service oriented architecture is aligned with generic architectures for Web Services systems such as the abstract pyramid model of [12]. In the terminology of that model, we do not address top-level management services for critical applications, as these facilities are not essential for a research network, but may be adopted in time as they are developed for other purposes. We do, however, provide the top-level support for open service marketplaces, although in a scientific rather than a commercial setting. Our protecting services certainly fit into this layer of the pyramid, although our placement of these services in a vertical bar

Figure 3. Screenshot of Query Developer screen in the Researcher’s Workbench recognises both their essential role throughout the network and their value as services supporting lower-level services. The pyramid’s middle service composition layer is similar to our sharing layer, enabling service aggregation and valueadded services–in our case these are knowledge-rich virtual data sets. To the pyramid model, we add a range of describing services for management and delivery of descriptive metadata, that are essential for an information-rich research network. Our using layer is also an addition. We could alternatively view these services as basic, at the bottom of the pyramid, but this would not take account of their value-adding role over the services providing basic data access. Our overall goals are similar to those of the Clinical E-Science Framework (CLEF) [10], a research project using Grid Services for sharing clinical data for research purposes. We share a strong concern for privacy, but are aiming for a broader network participation, and are therefore designing infrastructure for distributed data integration and distributed custodial control of confidentiality and privacy protection. Our distributed service architecture resembles that of myGRID [13], with which we share the ideas of a workbench client application, distributed query processing and workflow enactment, along with Web-standard wrapped ba-

sic services. However our services are designed more directly to meet the needs of the statistically-savvy researcher, and to implement strong privacy protection. Turning to our design for describing services, our architecture is strongly influenced by developments in shared statistical metadata such as arising from the FASTER project and its precursors [9]. We aim to develop facilities for automatic generation of provenance metadata drawing on both a statistical theory approach such as [7] and a data processing tracking approach such as [14]. Distributed multi-party privacy protection protocols have been recently proposed for the linkage of clinical data at an individual level [11, 6]. A trusted third party is required to provide these linkage services. Our architecture supports the implementation of these protocols in a form more readily available to researchers. Currently, we have developed linking services based on the algorithms implemented in the Febrl package[5]. Remote access data laboratories offering on-line access to Confidentialised Unit Record Files can also be mapped onto our service oriented architecture [1]. In contrast, we are proposing to enable linking and analysis of raw datasets in a secure environment with strict authorisation procedures and confidentialisation of released data products.

8. Conclusions We have proposed a service-oriented Health Research Data Network architecture. The capability and scope of the architecture was described using a four horizontal layer model with two vertical cross-cutting layers: Protecting and Describing. The architectural principles used to guide the design of basic services and interfaces were provided. Services need to be designed with ease of composition and reuse in mind. The unique contribution of HRDN is its end-to-end (from data custodian to researcher) application of custodian specified data access and use conditions, as well as confidentialisation and privacy protection. Description of the specific details of the component technologies used in HRDN services, such as the planner for service composition, and distributed linkage algorithms, are outside the scope of this paper. There are some crucial tests to be passed before HRDN could be deployed beyond its prototype development. One test would be to gain endorsement and adoption by important data custodians. This would depend on expressiveness of the custodian conditions of use policy language and the privacy/security robustness throughout the HRDN. Adequate testing and communication of confidence in the testing to data custodians is a significant hurdle, for sensitive data collections at least. Another test for the HRDN is acceptance by researchers. Adoption of the HRDN requires a change in existing work practices, from local data storage, management and software tools to their network equivalents. Researcher acceptance is perhaps a sociological issue as much as a technical one. We are using the prototype Researcher’s Workbench to test these user adoption issues. The HRDN architecture offers the potential for realising increased efficient data sharing for research purposes while offering enhanced confidentiality and privacy protection.

9 Acknowledgments We thank Dr Dave Abel and Dr Simon Hawkins for helpful discussions. We thank Mike Kearney, Deanne Vickers, Lifang Gu, Gavin Walker, Bella Robinson, Catherine Daly, and John Donnelly for their assistance with architecture ideas and implementation.

References [1] ABS. Remote access data laboratory, www.abs.gov.au, 2003. [2] L. H. Augenlicht, S. Wadler, G. Corner, C. Richards, L. Ryan, A. S. Multani, S. Pathak, A. Benson, D. Haller,

[3] [4]

[5] [6]

[7] [8] [9] [10]

[11] [12] [13]

[14]

, and B. G. Heerdt. Low-Level c-myc Amplification in Human Colonic Carcinoma Cell Lines and Tumors: A Frequent, p53-independent Mutation Associated with Improved Outcome in a Randomized Multi-institutional Trial. Cancer Research, 57:1769–1775, 1997. P. Banfield. Rweb, http://www.math.montana.edu/Rweb/. R. Baxter, P. Christen, and T. Churches. A Comparison of fast blocking methods for record linkage. In Proc. of ACM SIGKDD’03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pages 25–27, Washington, DC, USA, August 2003. P. Christen and T. Churches. Febrl: Freely extensible biomedical record linkage Manual, release 0.2 edition, April 2003. T. Churches. A proposed architecture and method of operation for improving the protection of privacy and confidentiality in disease registers. BMC Medical Research Methodology, 3(1):1–13, 2003. M. Denk and K. Froeschl. The idaresa data mediation architecture for statistical aggregates. Research in Official Statistics, 3(1):7–38, 2000. M. Elfeky, V. Verykios, and A. Elmagarmid. TAILOR: A Record Linkage Toolbox. In Proc. of the 18th Int. Conf. on Data Engineering. IEEE, 2002. FASTER. http://www.faster-data.org. D. Kalra, P. Singleton, D. Ingram, J. Milan, J. MacKay, D. Detmer, and A. Rector. Security and confidentiality approach for the clinical e-science framework (CLEF). In Proceedings of UK e-Science All Hands Meeting 2003, volume 83, Nottingham, UK, September 2003. C. W. Kelman, A. J. Bass, and C. D. J. Holman. Research use of linked health data - a best practice protocol. Aust. N.Z. J. Public Health, 26:251–5, 2002. M. Papazoglou and D. Georgakapoulos. Service oriented computing. Communications of the ACM, 46(10):24–28, October 2003. C. Wroe, R. Stevens, C. Goble, A. Roberts, and M. Greenwood. A Suite of DAML+OIL Ontologies to Describe Bioinformatics Web Services and Data. Int. J. of Cooperative Information Systems, 12(2):197–224, 2003. J. Zhao, C. Goble, M. Greenwood, C. Wroe, and R. Stevens. Annotating, linking and browsing provenance logs for eScience. In Proceedings of the Workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data, volume 83, Florida USA, October 2003. CEUR Workshop Proceedings.