Jan 30, 2014 - Turnover and Productivity in Online Communities ... arXiv:1401.7890v1 [cs. ...... B. 2012. lme4: Linear mixed-effects models using S4 classes.
Jan 30, 2014 - Turnover and Productivity in Online Communities ... School of Computer Science & Informatics, University College ... arXiv:1401.7890v1 [cs.
graphic, pict. To add or re text, just clic the Home ta. If you need titles, conte make a copy drag it into p. Smart Guide with everyth. Want to use instead of o right.
Avey, James B., Tara S. Wernsing & Michael E. Palansk (2012). Exploring ... Hamed Khakssar, Mohd Zulkifli bin Tan Sri Mohd Ghazali, & Zahra Seyed Ghorban.
1Department of Management and Marketing, College of Commerce and Business Administration, Dhofar University, Salalah,. Sultanate of ... *Email: [email protected] .... commercial companies listed at the stock market in Sri Lanka for.
Development Tools and. Security ... Grails, Dancer, CodeIgniter,. Tapestry,
Pyjamas, Symfony. PHP. JAVA. RUBY ... Web Application Development
Framework.
quite different across organizations in terms of their sophistication, size, location, modus operandi, and the variety of services providcd to cnd-users [17,6].
The Norwegian School of Information Technology, Norway ... innovation is best supported by local initiatives in a competent environment, using locally .... H2: A high degree of innovation is associated with a low degree of formal development.
Jun 30, 2017 - a number that continues to increase (Hadley, 2016; National Center for Education Statistics, ...... Boston, MA: University of Massachusetts Press.
WV Ach. WV Ind WV Rec. WV Rel WV Sup. WV WrkC. IN R. IN I. IN A. IN S. IN E. IN C. WV Ach. 1. -.355** -.089** -.027** -.060. -.414** -.028** -.012** .005. -.002.
Jul 20, 2016 - this study was to investigate the effect of age on the relationship ...... (PDF). Acknowledgments. We appreciate and thank our study participants.
The relationship between work-team personality composition and the job performance of teams. Neuman, George A;Wagner, Stephen H;Christiansen, Neil D.
Questions: â How can we identify and validate constraints that may be missed by domain experts? â What types of constraints can we identify using the data?
Decisions
Analysis
DATA WAREHOUSE
An Automated Data Quality Test Approach Hajar Homayouni, Sudipto Ghosh, Indrakshi Ray Department of Computer Science
Reports Transactions
Data Quality Tests
Data Quality Test Approach
Results
Validate data in a data store to detect violations of syntactic and semantic constraints that are imposed by application domain experts and data model
Flags as faulty those records that do not conform to the discovered constraints Uses unsupervised clustering to group the faulty records
Number of runs 100 iterations of deep neural network with 50 to 100 neurons in 2 to 5 hidden layers that discover the constraints 400 iterations of clustering that groups the faulty records
Uses unsupervised deep neural network to discover constraints in unlabeled data
Syntactic constraint validation Check for conformance of an attribute with the syntactic specifications in the data model Semantic constraint validation Check for conformance of an attribute value with the specifications stated by domain experts
Total Time to Detect Faulty Records in Four Datasets
Constraint Discovery Module
Temperature must be a numeric value
Domain expert flags as correct those faulty records that are actually faulty
Constraints
Dataset
Inspection Module
Testing Module Groups of Faulty Records
Data Records
Inspected Faulty Records
1 2 3 4
If Rainfall is greater than 80%, Relative_humidity must not be zero
Constraints Record ID
Domain-independent approaches (Informatica) Can only check for syntactic constraints but not for semantic ones
Goal: Develop an automated data quality test approach that: Discovers the constraints in unlabeled data records that must be satisfied Detects faulty records that do not satisfy the discovered constraints Uses domain knowledge to validate the detected faulty records and improve the constraint discovery phase
Weight
Height
BMI
1
110
5.41
3.76
2
132
5.57
4.90
3
Research Questions and Goal Questions: How can we identify and validate constraints that may be missed by domain experts? What types of constraints can we identify using the data? What types of faults can we detect based on the identified constraints? How can we incorporate domain knowledge into the constraint identification and fault detection phases?
100 80 60 40 20 0
Data Quality Test Tool Prototype
Domain-specific approaches (Achilles and PEDSnet) Can only check for constraints that are specified by experts who may miss important constraints
100
5.24
3.64
4
154
5.54
5.32
5
180
5.90
5.17
BMI = Weight / (Height)2
Group_1
Record ID
Weight
Height
BMI
2
132
5.57
4.90
4
154
5.54
5.32
Record ID
Weight
Height
BMI
1
110
5.41
3.76
3
100
5.24
3.64
Metric
0.8 1.5 1.6 2.5
|𝐸𝐸∩𝐴𝐴| Previously Detected (PD): |𝐸𝐸| E: set of faulty records detected by existing approaches |𝐴𝐴−𝐸𝐸| Newly Detected (ND): A: set of faulty records detected by our approach |𝐴𝐴| |𝐸𝐸−𝐴𝐴| Undetected (UD): |𝐸𝐸|
E UD
Total Time: Time to train model and detect the faulty records
Subjects: Four datasets created using multiple table joins in a health data warehouse
Dataset 1
Dataset 2
ND
Dataset 3
Datset 4
UD
Our automated data quality test approach: Detected between 96.14% and 100% of previously detected faults in the four data sets Detected between 0% to 16.75% of faults that were not previously detected These are suspicious records that were missed by the domain experts Detected between 0% to 3.86% of faults that were previously detected This indicates that autoencoder could not discover all of the associations among the data attributes
Goals: Demonstrate that the test approach can detect 1) Faults that were already detected by the existing tools 2) New faults that were not previously detected by the existing tools
0.02 18.33 41.00 0.07
Conclusions
Evaluation
94,165 600,000 600,000 1,000,000
PD
Group_2
Height – BMI < 1
Known Faulty Records (%) Total Time (min)
Known and New Faults Detected by Our Approach
Label faulty data records
Limitations of Existing Approaches
Size
A PD
ND
Future Work Extend the approach to find undetected faults using other machine learning techniques Improve the constraint discovery module using domain knowledge Evaluate the approach using data stores from different domains