A Web Services Application for the Data Quality Management in the ...

Proceedings of the 38th Hawaii International Conference on System Sciences - 2005

A Web Services Application for the Data Quality Management in the B2B Networked Environment G. Shankaranarayanan and Yu Cai Boston University School of Management {gshankar, ycai}@bu.edu Abstract Characteristics of Web services such as interoperability and platform independence make Web service a promising technique to manage data quality effectively in inter-organizational information exchanges. In this paper, we describe an application of Web services for managing data quality in the B2B information exchange that is typically characterized by large volumes of information, widely distributed data sources, and frequent information interchanges. In such environments, it is important that organizations are able to evaluate the quality of information they get from other organizations. We propose a framework for managing data quality in inter-organizational settings using the information product approach. We highlight the requirements for data quality management and the developing Web service standards to show why Web services offer a unique, yet simple platform for managing data quality in interorganizational settings.

1. Introduction Large volumes and frequent information exchanges as well as a wide variety of data sources characterize the Business-to-Business (B2B) electronic commerce environments. In organizational decision-making, relevant data gathered from outside organizational boundaries (e.g. the upstream and downstream inventory levels in a supply chain) plays an important role. Organizations appear to implicitly assume that the information/data (Though we recognize the difference between data and information, in this paper we use the terms interchangeably. Both terms are equally applicable for the purposes of describing the concepts and technology described here.) obtained from other organizations is of good quality. Research indicates that poor data quality (DQ) is a serious problem within many organizations [5, 13, 17, 22]. It is estimated that the cost of data quality problems is 8-25% of an organization’s revenue (or budget in non-profit organizations) [9, 18]. Moreover, addressing problems caused by poor data quality could constitute 40-60% of

the expenses of a service organization or waste 40-50% of an organization’s IT budget. Based on a data quality survey conducted in 2001, the Data Warehousing Institute (TDWI) estimates that the poor quality customer data alone costs U.S. businesses more than $600 billion a year [8]. Although the data quality problem costs billions of dollars each year, the survey found only 11 percent of companies had implemented data quality management programs and 48 percent of all companies still had no plans for managing data quality at that time. Given that the quality of data within most organizations is still far from satisfactory, how can organizations be sure that the quality of data obtained from other organizations is acceptable? The same TDWI data quality survey also states that a major source of data quality problems is external data. Inter-organizational information exchange is an inseparable part of a B2B relationship. It is important to assure organizations of the quality of data they get from other organizations. A prerequisite is that organizations must first manage data quality internally. But that is not enough. In the B2B environment, organizations use data received from another organization as inputs (either directly or after processing) to their business operations and decision-making. Therefore data quality is neither a local nor an isolated issue. To manage and evaluate data quality in an information exchange, the following requirements must be met. First we need a standard based on which organizations can develop, implement, and understand data quality. This would an organization understand the implications of data quality measures associated with data it receives from other organizations. Second, we need a method for evaluating the quality of data exchanged by organizations. Further, an organization needs to not only evaluate the quality of the data it generates but also understand how the quality of that data is impacted by the quality of data received from other organizations. In other words, they must be able to integrate their own data quality information with that received from other organizations. Besides exchanging data, organizations must also be able to exchange data quality information (metadata) associated with the exchanged data. They must be able to integrate this data

0-7695-2268-8/05/$20.00 (C) 2005 IEEE

1


quality information with their own relevant data quality information and manage data quality in an integrated fashion. Finally, as a B2B information exchange is characterized by multiple different stakeholders we must permit each stakeholder evaluate data quality independently and on demand. The Electronic Data Interchange (EDI) system has been used for the conventional B2B transactions to reduce error and increase efficiency. However, the EDI is a closed network system limited to a proprietary membership and is usually very expensive for small and medium enterprises to implement [7]. In B2B partnerships, players interact with numerous different business partners. The issues with interoperability make it difficult to integrate all business information exchanges using platform dependent EDI solutions. The B2B networked e-commerce environment requires a lightweight and open architecture that permits each business-partner to exchange information conveniently and economically. XML (Extensible Markup Language) as a platform independent language offers such a solution for B2B information exchange. Web services, using XML, offer the ability to not only exchange information but also add functionality that manipulates this information in some desired manner. Web services can greatly facilitate communication and interoperation among heterogeneous and distributed systems. It is a promising technology to address information exchange and integration in the B2B environment. In this paper, we develop a technical architecture using Web services for data quality management in the B2B environment. This research offers three key contributions. First, we outline a theoretical model for data quality management in B2B environment. We propose the “DQ 9000” as a data quality standard for managing data quality in the B2B information exchange. This standard is based on the Information Product (IP) approach. IP approach views information as product. The IP is the information deliverable that corresponds to the specific requirements of end users/consumers [23]. Examples of an IP include invoices and business reports. Second, to facilitate a common and consistent interpretation of data quality, we develop an XML-based data quality specification for exchanging data quality metadata (data about the data exchanged by organizations). We develop this detailed data quality metadata specification based on the Information Product Map (IPMAP) [19, 20]. The IPMAP, corresponding to some IP, is a graphical model that represents the manufacture of that IP. It shows the different stages that are involved in creating the information output (or IP) including the data elements associated with each stage as well as the flow of data elements between stages. Each manufacturing stage in the IPMAP is supplemented with metadata describing the processing, associated business rules, system used for

processing, individual or role responsible, and the department/business unit responsible for that stage. In addition, metadata necessary for evaluating data quality (such as processing time, estimated accuracy, and estimated completeness) are also captured for each stage. For each IP, the metadata specified in its corresponding IPMAP, including the structure of the IPMAP is defined using XML. Organizations can use this IP-XML associated with information products to manage data quality in an integrated fashion. Third, to implement the data quality management framework, we describe a technical architecture of a Web service for data quality management across organizational/system boundaries. The remainder of this paper is organized as follows. Section 2 presents an overview of the relevant literature. Section 3 describes the data quality management standard “DQ 9000” and IP-XML for B2B networked environment. Section 4 explains the Web service implementation of the data quality framework. A discussion of the benefits and drawbacks is presented in section 5.

2. Relevant literature Although literature in data quality proposes several different techniques for data quality management, none have addressed the issue of managing data quality in inter-organizational settings. Further, no research has examined data quality evaluation and management as a Web service offering. There are several characteristics of data quality evaluation as well as characteristics of webservices and web-service standards that make this marriage not only compatible but also unique.

2.1. Information product approach The foundation of the proposed data quality management model is the Information Product (IP) approach. The IP approach takes both data and data processing into consideration to systematically evaluate and manage data quality in information systems [22, 23]. Using the IP approach we can adopt total quality management (TQM) techniques successfully applied in physical product manufacture to manage data quality. From the IP perspective, the data/information exchanged across organizational boundaries can be viewed as information products created in one organization and consumed by another. The design of the proposed data quality management model is general enough to address the inter-organizational data/information exchanges. A special case of information product exchange in B2B environment is the purchase of information goods in dedicated marketplaces. This data quality management model is also applicable for facilitating transactions in marketplaces dedicated to information goods.

0-7695-2268-8/05/$20.00 (C) 2005 IEEE

2


To exploit these properties of IP and to manage data quality using the IP-approach, mechanisms for systematically representing the manufacturing stages and for evaluating data quality at each stage are essential. This paper uses the IPMAP, a modeling scheme that permits explicit representation of the manufacture of an IP [19]. Adopting the IPMAP offers two benefits for managing data quality in inter-organizational settings. First, it offers an intuitive way of integrating one or more information products received from external organizations with the manufacture of an information product created within – very similar to physical components/subassemblies purchased from suppliers and incorporated into products that a company manufactures. Second, techniques for evaluating data quality in one information system have been described. These techniques can be adapted for evaluating data quality in inter-organizational settings that typically spans multiple systems. Data quality is evaluated along quality dimensions such as timeliness, completeness, accuracy, and relevance, each requiring a different technique and using different metadata for evaluation. Ballou et al. have shown how timeliness can be specified using currency and age for the raw data units (at the source blocks) and consequently computed at the processing and quality blocks [5]. Accuracy can also be computed using a similar logic [5, 20]. The raw data units that come in from data source blocks are assigned an accuracy value by the provider or by the decision-maker. The value assigned is between 0 and 1, with 1 indicating a very accurate value. A processing block may combine raw and/or component data units to create a different component data unit. The accuracy of the output data element in a processing block is dependent on the processing performed. The formula proposed is based on a generic process that combines together multiple data elements to create an output. It does not take into account the type of processing performed and ignores the error (in accuracy) that might be introduced by the process itself. The type of process that generates an output also affects its completeness [4]. The completeness of a simple (raw or unprocessed) data element is 1 if the data element has an assigned value and 0 if is value is missing or null. For a process that collates inputs (a simple process) to generate the output, the completeness of the output is the weighted average of the completeness measures of the inputs. For more complex mathematical operations that create a new value (output) from the inputs, the product of the completeness measures of inputs provides a statistical estimate of the completeness of the output (when inputs are large data sets and the distribution of missing values in each are not correlated). Further, an output can have missing values and yet be perceived as complete by the decision-maker if it includes all the values necessary for the decision-task the output is used for. To accommodate such contextual

evaluations, for simple processes, the completeness of the output is the weighted average of the completeness of the inputs, the weights assigned based on the relevance of the input to the decision-task. The nature of the Web service makes it possible to offer each of these evaluation methods as an independent utility. Each can perform not only the evaluation but also seek and retrieve the relevant metadata from sources that span organizational boundaries.

2.2. Web services overview In this paper we adopt the definition of a Web service offered by W3C Web Services Architecture Working Group. A Web service is “a software application identified by a URI, whose interfaces and bindings are capable of being defined, described, and discovered as XML artifacts. A Web service supports direct interactions with other software agents using XML-based messages exchanged via Internet-based protocols.” [25]. Based on this definition, a Web service is described using the Web Services Description Language (WSDL) and uses Simple Object Access Protocol (SOAP) as the protocol for information exchange [10]. The major benefit of Web services is increasing applications’ interoperability across heterogeneous and distributed platforms / systems / applications. Web services support modular design of components that can be rearranged to suit requirements [2, 10]. A Web service has specific functionality, reusability, and high interoperability with other applications. It employs a loosely coupled approach to separate the service interfaces from specific implementations. The Web service interface is written in the Web Services Description Language (WSDL), which includes all the necessary information to invoke Web services and transfer the information. WSDL document describes details of a Web service’s functions and the interface. SOAP is a lightweight protocol based on XML to replace many complicated distributed application technologies. It is the major transport mechanism for the Web services and has become a W3C specification [2]. With WSDL and SOAP, different implementations can “talk” in the same language to each other and Web services are easily made “plug and play”. Another important component of Web services is a registry to publish WSDL documents for other potential service requestors to find the Web service. Universal Description, Discovery, and Integration (UDDI) is a specification developed to define such a registry. A general Web services architecture is as follows: Web service provider publishes the service’s WSDL document on the UDDI registry. The service requestor finds the service from the registry. After the service requestor gets the WSDL document providing the detailed information on how to access the Web service, it can communicate

0-7695-2268-8/05/$20.00 (C) 2005 IEEE

3


with the Web service directly using SOAP message [12]. We adopt this architecture for our implementation. Web services have many characteristics to make it suitable for the data quality management in the interorganizational data exchange environment. First, Web services provide dynamic interoperability. In networked B2B environment, companies establish ties with a number of business partners. Each partner employs proprietary platforms/systems. The platform independence of Web services enables dynamic and automatic interaction between heterogeneous and distributed information systems for data exchange. With WSDL and UDDI, Web services can automatically exchange data using SOAP so as to greatly reduce the cost/effort of finding where the data sources are as well as negotiating data transfer format/protocols. Second, Web service is conveniently accessible on demand on the Internet. Any organization/client can request the Web service from any platform, anywhere, at any time, as long as he/she can access the Internet [1, 14]. This allows the implementation of data quality management by various mobile devices in decentralized and distributed environments. Third, Web services are based on industry standards and specifications. The messaging, data exchange, interface description and discovery of services are defined by universally agreed SOAP, WSDL and UDDI. Using Web service means all parties in data exchange are benefited by the simple and efficient standard solutions. The Web service is easy to “Plug and Play” and to extend with new functionalities. More specifically, the architecture for data quality that employs Web services can benefit from evolving Web service standards. For instance, the OMG XML Metadata Interchange (XMI) specification can be adopted to define and exchange data quality metadata in XML. BEA, IBM, Microsoft and SAP are developing a specification on Web Services Metadata Exchange (WS-MetadataExchange). Although it is about the exchange of Web Services endpoint metadata, it greatly benefits data quality management. It facilitates tracking the source of metadata besides the associated data and metadata, all of which are important for managing data quality. The OASIS Business Process Execution Language for Web Services (BPEL4WS) specification provides a language for the formal specification of business processes and business interaction protocols. It defines how remote services communicate, interact and interoperate. The BPEL4WS will facilitate automated process integration in both the intra- and inter-organizational environments. The BPEL4WS specification offers a reference for the automatic and seamless integration of process metadata to manage data quality in an integrated manner in interorganizational settings. Security is an important issue in the B2B data exchange. The recently announced OASIS standard, the Web Services Security (WSS) version 1.0

(WS-Security 2004), provides solutions for securing and ensuring the integrity of SOAP messages. This will benefit the exchange of data on information manufacture and usage across organizations. Finally, Web services permit unbundling the different components in data quality evaluation – creating an IPMAP, visualizing an existing IPMAP, and evaluating quality at each stage along a given dimension. Each can be created as a separate component of the Web service and provided to the user in any order and on demand. Orchestrating the interplay between these components can be managed by the Web service.

3. The data quality management framework In B2B networked environment, players interact with numerous different business partners. Information is processed in each organization’s own proprietary information systems before it is exchanged across organizational boundaries. Without a common understanding of data quality management across organizations, it is difficult to assess the quality of information exchanged in B2B activities. The data quality standard for the B2B networked environment must address three requirements. 1. Organizations manage data quality internally. (They must know how to speak.) 2. Organizations use IPMAP as instrument and a similar set of metrics to evaluate each data quality dimension. (They speak the same language.) 3. In the B2B data exchange relationship, players agree to share the relevant IPMAP metadata (which includes the data quality data) associated with the exchanged data. (They talk to each other and understand each other.) Based on the nature of the information exchanged in the B2B networked environment, we propose the following two-part data quality management framework. 1. General data quality management standards “DQ 9000”. 2. Detailed data quality specifications for information product metadata: IP-XML. Built upon the IP approach, the data quality management framework adopts the well-accepted concepts from the physical product quality management practice. In the following example by examining a physical product’s quality, we illustrate the basic concepts in the data quality management framework. To protect against the infection of Severe Acute Respiratory Syndrome (SARS) virus, a high quality mask is required. But how do we ensure that a specific mask is qualified? As consumers, we can interpret the product quality from different perspectives. For instance, in a mask’s product description, we find the following information: the mask is manufactured by a company

0-7695-2268-8/05/$20.00 (C) 2005 IEEE

4


complying with the ISO 9000 standards and it is certified by the National Institute for Occupational Safety and Health (NIOSH) as a type N95 respirator. After further research, we find that the “N95”specificaion means the mask is capable of filtering at least 95% of the 0.3um particles (non-oil containing dusts, fumes and mists), and it can last for 4 hours in regular use. The ISO 9000 complying information tells us the manufacturer has implemented a set of well-established quality management standards to control product quality, which implies the mask comes out of a reasonably good quality production process. This is the analogous to the general data quality standards “DQ 9000”. The information about the technical specification tells us the detailed quality standard the product meets. This is analogous to the detailed data quality specification in the information product metadata associated with its corresponding IPMAP. As long as we have the above information (and assuming the information is authentic), we can understand the quality of that product. Following the same logic, we describe a data quality management framework for information exchange in the B2B networked environment.

procedures/working instructions (e.g. what should do with the missing values; how regularly should the data repository be updated). After clearly defining how information is processed in all settings, organizations must make sure that they strictly follow those rules in practice. DQ 9000 also requires organization to establish dedicated data quality assurance unit to oversee data quality management. DQ 9000 is intentionally designed to be general enough to fit all kinds of organizations. Under DQ 9000 guidelines, organizations can choose various data quality management tools/techniques to meet their specific requirements. Details of this standard have been left out for brevity. We believe developing and implementing a generic DQ 9000 quality management standards will systematically improve data quality management process within organizations. When most organizations implement DQ 9000, the collective quality of all IP will increase. As a result, the quality of overall information exchanges in the B2B networked environment will be improved. This defines our first requirement: organizations manage data quality internally.

3.2. The data quality metadata specification: IPXML

3.1. “DQ 9000” quality standard The ISO 9000 family is a set of well-established, highlevel, generic and auditable standards that specifies quality management system requirements [16]. ISO 9000 provides a conceptual framework for improving the quality of products and services in a disciplined way. Organizations typically use ISO 9000 standards to regulate their internal quality (ensuring control, repeatability and good documentation of processes) and assure quality of their suppliers. DQ 9000 is a generic standard for data quality management for all data creating/consuming processes. The key elements of DQ 9000 include 1) process description; 2) record keeping; 3) administrative responsibility; and 4) dedicated data quality assurance unit. DQ 9000 provides guidelines to precisely document the whole information product manufacturing process. It requires the detailed description of each informationprocessing stage including data collection, data manipulation, and data storage. Organizations need to keep accurate records on the source of data, the location of data, data attributes, the quality of data along multiple data quality dimensions such as accuracy, timeliness, and completeness, data processing specifications, data use, person/role responsible for data processing/storage, and the time when these activities occur. DQ 9000 outlines the administration responsibility of data quality management. It calls for organizations to develop and implement highlevel data quality policies, explicit data quality documentation, and comprehensive data quality operating

The IPMAP is a graphic model to represent the manufacture of an information product [19]. It permits the systematic representation of the distribution of the data sources, the flow of data elements and the sequence by which these data elements are processed to create the required IPs within/across organization and information system boundaries. Another important feature of IPMAP is the metadata associated with each construct that includes detailed information such as the flow of data through constructs, the elements of data, and data manipulations performed. IPMAP as a platform independent method can be implemented in any organization for data quality management purpose. Moreover, the IPMAP metadata can be shared with other organizations to provide a comprehensive understanding of the quality of data exchanged across B2B boundaries. An organization must adopt the IP approach for managing data quality and use the IPMAP to define the metadata for each IP it creates and exchanges. Each organization should be responsible for creating and sharing the IPMAP metadata (which includes data quality information) associated with each IP. Each organization is therefore the data creator and custodian of its IPMAP metadata. When exchanging an IP the receiver or the IP-consumer would be responsible for initiating the request for and using the metadata associated with the IP it received.

0-7695-2268-8/05/$20.00 (C) 2005 IEEE

5


Sales

Customer 1’s local inventory

Capture sales detail

Enter inventory detail

Customer 1’s local inventory data

Administrative Data Repository

Transfer

. . . Customer X’s local inventory

Upload sales

Sales database

Firm Boundary

Enter inventory detail

Customer X’s local inventory data

Collect external data

Upload downstream inventory

Transfer

Create demand forecast

Operation Manager

Figure 1. The integrated IPMAP across firm boundary Figure 1 illustrates a sample IPMAP-- the manufacture of a demand forecast (the IP) and shows the constructs of an IPMAP. To do demand forecasting, ideally a firm (referred to as Firm-A) needs both past sales data and the downstream customers’ inventory level data. In the figure1, the “Sales” is a data source block. It represents the source of sales data that is local to Firm-A. The sales data is processed (shown by the processing block “Capture sales detail”) and is then stored in a data storage shown by “Sales Database”. It is then extracted and loaded into the administrative repository where it awaits further processing. Firm-A has many customers who provide the downstream inventory data. In figure 1, the shaded parts show the independent processing performed by each customer to generate its inventory detail (another IP). Several such IP, each generated by a different customer are used by Firm-A to generate its own IP, namely, the demand forecast. Each of this external inventory data is also loaded into the administrative repository as and when it becomes available. All of this data (sales and inventory data) is extracted from the administrative repository by a process that generates the demand forecast (Firm-A’s IP). An operations manager (shown as the data consumer in figure 1) uses this IP for decision-making. If all organizations use the IPMAP to manage data quality and downstream customers agree to share their inventory level data along with its corresponding IPMAP metadata, Firm-A can integrate its downstream customers’ IPMAPs with its local IPMAP (linked at the “firm boundary” block in figure 1). The boundary block in an IPMAP represents an interorganizational (or departmental) data exchange. The example illustrates how an organization can integrate data quality metadata from external sources with its own for evaluating its data quality. The integrated IPMAP also permits an organization understand the impact of the quality of external data on the quality of its own (newly generated) data.

To facilitate this exchange we need to define a structure for the IPMAP and other associated metadata. We propose the following structure defined using Backus Naur Form (BNF), based on the metadata specifications defined in [20]: IPMAP Metadata ::= [Block_info] [Data_info] Block_info ::= [Block name] [Block type] [Block physical location] [Business unit or organization responsible] [Block description] [Data quality dimensions] Data_info ::= [Data ID] [Data source] [ Data destination ] [Data type] [Data quality measurements] [Business logic/rules] Block name ::= {String} Block type ::= source block | the processing block | the data quality check block | the organizational boundary block | the information system boundary block | the sink block Block physical location ::= {String} Business unit responsible ::= {String} Block description ::= {String} Data quality dimension ::= {Accuracy, Timeliness, Completeness, other data quality dimensions} Data ID ::= {String} Data source ::= {Block name} Data destination ::= {Block name} Data type ::= {String} Data quality measurement ::= {Real} The metadata includes the structure of the IPMAP (what are the manufacturing stages and how information flows through these stages in the IPMAP). Given this

0-7695-2268-8/05/$20.00 (C) 2005 IEEE

6


11 Client in Firm -A

DQM Web Service

1 Firm -A

4

7

3

8, 9, 10

M X W eb Service

2 5

6

6

M X in C 1

M X in C n

Customer Information System (C 1 )

Customer Information System (C n )

Figure 2. Schematic representation of the web service architecture illustrating the steps in execution

metadata, we can construct the IPMAP. IPMAP metadata can also provide detailed data quality specifications on how data quality is measured along established quality dimensions. We include three such data quality dimensions--accuracy, timeliness, and completeness in our proposed framework as these three are well addressed in data quality literature [11, 15, 21, 24]. An IP is created from data elements (raw materials) using manufacturing processes. Using the IPMAP, we can identify these elements/components and manufacturing processes. Given the quality measurements of elements, algorithms have been developed to compute the data quality metrics of the information product [3, 5, 6, 20]. These metrics can be computed for each intermediate stage in the IPMAP besides the final stage. The IPMAP metadata structure can be translated using XMI specification into an XML schema, which includes the format of the IPMAP metadata tags, elements and attributes. This defines the specifications for managing data quality of the information exchanged between organizations. Electronic Business XML (ebXML) is a set of XML-based specifications and protocols developed for facilitating business transaction over the Internet. It is a specific Web service component for the e-business environment. We believe IP-XML is complementary to ebXML. While ebXML is for IPs exchanged in B2B transactions, IP-XML is for exchanging data quality metadata associated with each IPs.

4. The Web service implementation To implement the data quality management model we propose a web-service architecture for data quality management. This architecture includes two loosely

coupled components: a Web service (Metadata exchange service or MX) for facilitating the exchange of IPMAP metadata and another Web service for data quality management (the DQM). The MX component provides IPMAP metadata in IP-XML to the DQM service upon request. All organizations that agree to share data and metadata in the B2B exchange can implement its own MX component. The MX can access IPMAP-metadata for all information products belonging to the organization. Moreover, MX can help collect and transfer IPMAPmetadata for all information products received from other organizations by working with its counterpart in other organizations. All organizations’ MX information is published in the metadata access registry (MXR). The IPXML is enveloped and transported in SOAP messages. The DQM component collects and processes the IP-XML to perform data quality evaluation and data quality management. It reverse-engineers the metadata in IPXML and generates the corresponding IPMAP. The IPMAP describes the manner in which the sender created the exchanged IP. The following simple scenario illustrates how the data quality Web service architecture works. The figure 2 illustrates the interactions that occur in the architecture during the process of evaluating data quality. For simplicity, we assume the client (in Firm-A) already knows how to invoke the data quality management Web services (searching the registry and identifying the appropriate Web service is completed in one or more previous steps). 1. Client (in Firm-A) sends SOAP message to invoke the DQM Web service to evaluate data quality of an information product (say IPa) that Firm-A has created using data inputs from external organizations. 2. The DQM Web service requests the MX to obtain the metadata for IPa.

0-7695-2268-8/05/$20.00 (C) 2005 IEEE

7


3.

The MX retrieves this metadata corresponding to IPa from Firm-A’s metadata repository. 4. MX then sends this to DQM. The DQM interprets this metadata to identify the data creators responsible for the inputs. This is available as part of the metadata associated with the IPMAP. 5. Using this metadata, DQM requests the MX for the IPMAP metadata corresponding to each input. 6. For each input, the MX consults the access registry, communicates with the MX in the organization that created that input, which retrieves the metadata and transfers it to the requesting MX. 7. The IP-XML document for each input is then submitted to the DQM. 8. The DQM then generates the IPMAPs corresponding to each customer from the metadata retrieved from each customer. 9. The DQM integrates the IPMAPs. It does so by replacing each input block (to the left of the organizational boundary block) with its corresponding IPMAP generated in step 4. 10. The quality metrics as well as other useful process metadata received are maintained by the DQM and associated with the block (processing stage) in the integrated IPMAP that is described by the metadata. Client SO A P m essage

T ranslator X M L m essa ge Ù Java invocation/response

using the IPMAP metadata, the service offers the ability (as another different and independent utility) to trace back and pin-point the stages that may have caused this (some of these stages may be external to the organization) and offers the ability to look ahead and target the stages and IP(s) that are likely to be affected by this problem. Currently two major platforms to implement Web services are Microsoft’s .NET framework and Sun Microsystems’ J2EE. We have adopted the J2EE platform to implement the data quality Web service. The logical architecture of the data quality service implementation is in Figure 3 (It applies to both the IPMAP metadata exchange Web service and data quality management Web service). The Web service is on an Apache’s Axis platform with Java servlets. After the client sends a SOAP message to invoke the Web service, the Axis translator converts XML message into Java MessageContext objects and then forward them to the service engine. The service functions delivery appropriate messages to a back-end RDBMS (the metadata repository) in order to access information necessary for handling client requests. The database server (MS-SQL DB Server in this case) holds pertinent information on IPMAP structures as well as data quality metadata and communicates with the service via JDBC.

Java M essageC ontext

Service E ngine

SQ L D B JD B C SQ L D B Server

W eb Serv ices Platform A pac he’s A xis

Figure 3. The DQ web service implementation 11. DQM then notifies the client that the integration is complete and data quality evaluation can be initiated. 5. Discussion 12. The computation of each quality dimension and the computation of the overall quality, is each an In this paper, we have described the data quality independent utility of the DQM web service. Each problem in the B2B networked environment and proposed works on the integrated IPMAP and uses the a solution: the data quality management standard model associated metadata for data quality evaluation. The and data quality Web service architecture. The B2B client can now request for one or more of these environment is characterized by heterogeneous and utilities. Each utility will return the computed values distributed platforms/systems/applications. Different and will present these values to the client using a partners use proprietary systems and process data GUI for visually rendering the information. independently. These make data quality management in If a new IP is created within the receiving organization the B2B information exchanges difficult. From technical using the exchanged IP as an input, the DQM Web perspective, the core barrier of data quality management service also integrates the reverse-engineered IPMAP problem in the B2B environment is lack of a standard for with the IPMAP corresponding to the new IP. Based on managing data quality. XML is a simple but powerful the integrated IPMAP, the accuracy, timeliness and/or method for describing and defining data. It is rapidly completeness can be calculated at any stage of IPMAP to becoming an industry standard method for data exchange evaluate the quality of the IP. Further more, if a data between applications. XML further circumvents the quality problem is identified at some stage in the IPMAP, problems of semantic “matching” as it can carry the

0-7695-2268-8/05/$20.00 (C) 2005 IEEE

8


semantic content (Document Type Definition or XML Schema Document). The agreement on the meanings has to done upfront and can be achieved by adopting the DQ 9000 standard as well as the IPMAP for metadata specification. By combining strengths of IPMAP method (powerful capability to managing data quality) and XML (highly structure and standardization), we propose a generic data quality standard based on IPMAP metadata in XML. The proposed IP-XML is complementary to the ebXML and can be extended to an industry standard for data quality management. Another obstacle of data quality management in B2B settings is the difficulty in exchanging data across heterogeneous platforms. The organizational/system boundaries greatly hinder the integration effort in B2B networked environment. The essence of Web services is interoperability, which is based on interface and standard. Three major components of Web services architecture: SOAP, WSDL and UDDI are all built upon XML. Therefore Web services can provide a standard communication and interoperation infrastructure across different platforms and implementations. Moreover, Web services offer the ability to integrate “islands” of data in an automated fashion. Since data quality metadata exists as independent “islands” within each organization’s information system, the Web service provides a convenient tool for integrating such islands for data quality management. Hence, we propose a solution using Web services for B2B data quality management. We believe that Web services architecture offer a solution to overcome the difficulties associated with integrating the management of data quality across multiple cooperating organization in the B2B data exchange environment. A key to successfully managing data quality is clearly the adoption of a data quality standard and the commitment to manage data quality internally. Further, the adoption of a standard set of methods for evaluating data quality helps organizations understand and interpret data quality evaluations performed by other organizations. Organizations today have understood the importance of data quality and many are making a commitment to manage quality. No universally accepted standards currently exist for data quality management. Clearly, standards are a necessary to manage data quality in interorganizational settings and we have proposed the DQ 9000 as a first step towards defining this standard. The IPMAP is just one, and arguably the most dominant technique for managing data quality and for evaluating it. We have described this implementation to illustrate how data quality can be managed in inter-organizational information exchanges by viewing the IPMAP as a possible standard. Should other, better techniques be proposed in the future, this model for data quality management can be extended to include them. Each new technique, if implemented as a Web service, would allow

organizations to choose from among the available services and implement the one that best suits their requirements. One potential application of the data quality management Web service is the data quality verification functionality for third party data quality verification. Prior to formally entering a business partnership, an organization may want to verify whether the data quality management procedures adopted by the potential partner meets its data quality management requirement. This data quality verification Web service provides a reference and an up-front quality assurance mechanism in the data exchange environment by playing the role of a “virtual” third-party. We believe the data quality Web service architecture proposed in this paper provides a technologically feasible and cost efficient solution to meet the data quality management challenge in B2B settings. It has great practical significance in facilitating the B2B e-commerce and organizational data quality management. In this paper, we emphasize the technological perspective for data quality management. Data quality evaluation consists of many, different functions. Some functions are interdependent while others are not. For instance evaluating the accuracy of an IP is dependent on the generation of its corresponding IPMAP but not on the evaluation of any other dimension. Further, evaluating data quality at a given stage in the IPMAP is dependent on the evaluation of data quality in all of its preceding stages only. In fast paced decision environments such as that of a B2B information exchange, the Web service permits us to unbundle the functional components and offer them independently as Web service. The Web service also allows us to define the interplay between dependent functional components and orchestrate their execution.

6. References [1] Adam, C. “Why Web Services?”, www.webservices.org, 2002. [2] Albrecht, C. C. “How Clean is the Future of SOAP?”, Communications of the ACM, 47, 2, 2004, pp. 66-68. [3] Ballou, D. P. and Pazer, H. L. “Designing Information Systems to Optimize the Accuracy-timeliness Tradeoff”, Information Systems Research, 6, 1, 1995, pp.51-72. [4] Ballou, D. P. and Pazer, H. L. “Modeling Completeness versus Consistency Tradeoffs in Information Decision Contexts”, IEEE Transactions on Knowledge and Data Engineering, 15, 1, 2003, pp.240-243. [5] Ballou, D. P., Wang, R. Y., Pazer, H. and Tayi, G. K. “Modeling information manufacturing systems to determine information product quality”, Management Science, 44, 4, 1998, pp. 462-484.

0-7695-2268-8/05/$20.00 (C) 2005 IEEE

9


[6] Cai, Y., Shankaranarayanan, G. and Ziad, M. “Evaluating Completeness of an Information Product”, In The Proceedings of Ninth Americas Conference on Information Systems (AMCIS), Tampa, 2003, pp. 2274-2281. [7] Deitel, H.M., Deitel, P.J., DuWaldt, B. and Trees, L.K. Web Services a Technical Introduction. Prentice Hall, New Jersey, 2003. [8] Eckerson, W. W. Data Quality and the Bottom Line, The Data Warehousing Institute, Seattle, WA, 2002. [9] English, L. Improving Data Warehouse and Business Information Quality, John Wiley & Sons, NY, 1999. [10] Ferris, C. and Farrell, J. “What are Web Services?”, Communications of the ACM, 46, 6, 2003. pp. 31.

[21] Wand, Y. and Wang, R. Y. “Anchoring data quality dimensions in ontological foundations”, Communications of the ACM, 39, 11, 1996 , pp. 86-95. [22] Wang, R. Y. “A product perspective on Total Data Quality Management”, Communications of the ACM, 41, 2, 1998, pp. 58-65. [23] Wang, R. Y., Lee, Y. W., Pipino, L. L. and Strong, D. M. “Manage Your Information as a Product”, Sloan Management Review, 39, 4, 1998, pp. 95-105. [24] Wang, R. Y. and Strong, D. M. “Beyond accuracy: What data quality means to data consumers”, Journal of Management Information Systems, 12, 4, 1996, pp. 5-34. [25] W3C Web Services Architecture Working Group, Web Services Architecture Requirements, W3C Working Draft (Aug. 19, 2002); www.w3.org/TR/2002/WD-wsa-reqs-20020819.

[11] Fox, C., Levitin, A. and Redman, T. C. “The notion of data and its quality dimensions”, Information Processing & Management, 30, 1, 1994, pp. 9-19. [12] Iyer, B., Freedman, J., Gaynor, M. and Wyner, G., “Web Services: Enabling Dynamic Business Networks”, CAIS, Vol. 11, 2003 [13] Kahn, B. K., Strong, D. M. and Wang, R. Y.) “Information quality Benchmarks: Product and Service Performance”, Communications of the ACM, 45, 4, 2002, pp. 184-193. [14] Kotok, A. “Tell me about Web services, and make it quick”. www.webservices.org, 2002 [15] Miller, H. “The multiple dimensions of information quality”, Information Systems Management, 13, 2, 1996, pp. 7982. [16] Paulk, Mark C. “How ISO 9001 Compares With the CMM”, IEEE Software, January 1995, pp. 74-83. [17] Pipino, L. L., Lee, Y. W. and Wang, R. Y. “Data Quality Assessment”, Communications of the ACM, 45, 4, 2002, pp. 212-218. [18] Redman, T. C. “The impact of poor data quality on the typical enterprise”, Communications of the ACM, 41, 2, 1995, pp. 79-82. [19] Shankaranarayanan, G., Wang, R. and Ziad, M. “IPMAP: Representing the Manufacture of an Information Product” In The Proceedings of MIT Data quality Conference (IQ 2000) Boston, MA. 2000 [20] Shankaranarayanan, G., Ziad, M. and Wang, R. Y. “Managing Date Quality in Dynamic Decision Environment: An Information Product Approach”, Journal of Database Management, Vol. 14, 2003, pp. 14-32.

0-7695-2268-8/05/$20.00 (C) 2005 IEEE

10