Exploring the Relationship between Age and Work

1 downloads 0 Views 955KB Size Report
Questions: ❑ How can we identify and validate constraints that may be missed by domain experts? ❑ What types of constraints can we identify using the data?
Decisions

Analysis

DATA WAREHOUSE

An Automated Data Quality Test Approach Hajar Homayouni, Sudipto Ghosh, Indrakshi Ray Department of Computer Science

Reports Transactions

Data Quality Tests

Data Quality Test Approach

Results

Validate data in a data store to detect violations of syntactic and semantic constraints that are imposed by application domain experts and data model

 Flags as faulty those records that do not conform to the discovered constraints  Uses unsupervised clustering to group the faulty records

Number of runs  100 iterations of deep neural network with 50 to 100 neurons in 2 to 5 hidden layers that discover the constraints  400 iterations of clustering that groups the faulty records

 Uses unsupervised deep neural network to discover constraints in unlabeled data

Syntactic constraint validation  Check for conformance of an attribute with the syntactic specifications in the data model Semantic constraint validation  Check for conformance of an attribute value with the specifications stated by domain experts

Total Time to Detect Faulty Records in Four Datasets

Constraint Discovery Module

Temperature must be a numeric value

 Domain expert flags as correct those faulty records that are actually faulty

Constraints

Dataset

Inspection Module

Testing Module Groups of Faulty Records

Data Records

Inspected Faulty Records

1 2 3 4

If Rainfall is greater than 80%, Relative_humidity must not be zero

Constraints Record ID

Domain-independent approaches (Informatica)  Can only check for syntactic constraints but not for semantic ones

Goal: Develop an automated data quality test approach that:  Discovers the constraints in unlabeled data records that must be satisfied  Detects faulty records that do not satisfy the discovered constraints  Uses domain knowledge to validate the detected faulty records and improve the constraint discovery phase

Weight

Height

BMI

1

110

5.41

3.76

2

132

5.57

4.90

3

Research Questions and Goal Questions:  How can we identify and validate constraints that may be missed by domain experts?  What types of constraints can we identify using the data?  What types of faults can we detect based on the identified constraints?  How can we incorporate domain knowledge into the constraint identification and fault detection phases?

100 80 60 40 20 0

Data Quality Test Tool Prototype

Domain-specific approaches (Achilles and PEDSnet)  Can only check for constraints that are specified by experts who may miss important constraints

100

5.24

3.64

4

154

5.54

5.32

5

180

5.90

5.17

BMI = Weight / (Height)2

Group_1

Record ID

Weight

Height

BMI

2

132

5.57

4.90

4

154

5.54

5.32

Record ID

Weight

Height

BMI

1

110

5.41

3.76

3

100

5.24

3.64

Metric



0.8 1.5 1.6 2.5

|𝐸𝐸∩𝐴𝐴| Previously Detected (PD): |𝐸𝐸| E: set of faulty records detected by existing approaches |𝐴𝐴−𝐸𝐸| Newly Detected (ND): A: set of faulty records detected by our approach |𝐴𝐴| |𝐸𝐸−𝐴𝐴| Undetected (UD): |𝐸𝐸|

E UD

 Total Time: Time to train model and detect the faulty records

Subjects: Four datasets created using multiple table joins in a health data warehouse

Dataset 1

Dataset 2

ND

Dataset 3

Datset 4

UD

Our automated data quality test approach:  Detected between 96.14% and 100% of previously detected faults in the four data sets  Detected between 0% to 16.75% of faults that were not previously detected  These are suspicious records that were missed by the domain experts  Detected between 0% to 3.86% of faults that were previously detected  This indicates that autoencoder could not discover all of the associations among the data attributes

Goals: Demonstrate that the test approach can detect 1) Faults that were already detected by the existing tools 2) New faults that were not previously detected by the existing tools



0.02 18.33 41.00 0.07

Conclusions

Evaluation



94,165 600,000 600,000 1,000,000

PD

Group_2

Height – BMI < 1

Known Faulty Records (%) Total Time (min)

Known and New Faults Detected by Our Approach

Label faulty data records

Limitations of Existing Approaches

Size

A PD

ND

Future Work  Extend the approach to find undetected faults using other machine learning techniques  Improve the constraint discovery module using domain knowledge  Evaluate the approach using data stores from different domains