Data Quality Case Study Prepared by ORC Macro

26 Slides175.00 KB

Data Quality Case Study Prepared by ORC Macro

Data Correction Background – Data Correction Tracking system SAS AF query application Guidelines – Profile Analysis SSNs Names 2

Profile Analysis—SSNs P e rs o n s t n 3 4 6 ,3 8 1 PPR Fs M is s in g n 9 6 ,0 9 7 (2 7 % ) V a lid - lo o k in g S S N s n 2 3 4 ,3 1 1 (6 8 % ) I n v a lid n 1 5 ,9 7 3 (5 % ) S h a re d S S N s n 7 ,1 0 0 R e p e a te d S S N s n 3 ,4 0 6 3

Profile Analysis—SSNs Shared SSNs (n 7,100) Different Names 27% Candidates for Collapse Candidates for Correction Same or Similar Names 73% 4

Profile Analysis—Names P e rs o n s t n 3 4 6 ,3 8 1 U n iq u e N a m e s n 2 3 2 ,1 7 2 Possible Duplicates 23% n 79,300 R e p e a te d N a m e s n 1 1 4 ,2 0 9 Unique Persons 77% n 267,081 N a m e G ro u p s n 3 0 ,4 4 7 I n d iv id u a l P e r s o n s n 3 4 ,9 0 9 P o s s ib le D u p lic a te s n 7 9 ,3 0 0 5

Profile Analysis—Names N a m e G ro u p s n 3 0 ,4 4 7 C o n tra c ts n 1 8 ,6 5 0 (9 1 % ) S e q u e n t ia l/M u ltip le P r o f ile s n 2 0 ,3 7 5 1 1 4 , 2 0 9 P r o f ile s O th e r n 1 ,7 2 5 (9 % ) I n v a lid /M is s in g S S N s n 8 3 ,5 2 1 S h a re d S S N s n 2 ,0 9 2 A p p a r e n t V a lid S S N s n 3 0 ,6 6 8 T y p o /D a ta E n try n 3 ,6 2 2 U n iq u e S S N s n 2 4 ,9 5 4 6

OLTP—Commons Cases Definition Statistics Status 7

Data Correction Identifying the extent of the problem Investigating based on type of error Validating the investigation Implementing the change Tracking the identification, investigation, validation, and implementation 8

Data Correction—An Example PERSON ID 3070908—PPRF record Identification of problem – Two different middle initials found Investigation of problem – TA module – Scripts run Validation of information – Name, SSN, degree(s), grant(s) – Sources 9

Data Correction—An Example PERSON ID 3070908—PPRF record Implementation of correction – Grants report submitted to NIH OD Tracking of correction – Internal tracking system Post-correction – Loss of control of data 10

Developing a Data Quality Business Plan

Focus of Our Activities Examination of the Database, Procedures, and Interface Development of Modified Use Cases Unified Modeling Language Identification and Extraction of Business Rules Identification of Business Model 12

Data Quality Issues Type-over of information Generation of duplicate persons Collapsing Changes in degree and address data Generation of orphans 13

Type-Over Practices Intentions: – Assign a new principal investigator (PI) to a grant – Change the name of a PI on a grant – Correct a misspelled name Consequences: – Inclusion of incorrect information in a person profile – Absence of linkages between PIs and grant applications – Creation of false linkages between PIs and grant applications 14

Factors Affecting Quality Relatively easy access to person-related data elements Lack of self-validation routines Interface issues 15

Solutions Restricted access Quality control validation Interface simplification Self-validation algorithm 16

Data Quality Validation Who does it? – ICs – A Quality Assurance group – Other How is it done? – Staging areas – Manual and intelligent filtering – Architecture 17

GM Module Screen GM1040 18

GM Module Screen COM1100 19

Self Validation Name-matching algorithm Consistency checking 20

Higher-Level Analysis The following are being examined relative to their effect on quality: Commons interface with IMPAC II Database redundancy Business rules in the database Master person file Front-end design Human factors Ownership 21

Development of a Data Quality Model

Major Goals Quality improvements plan for personal identifiers Evaluate the different identification algorithms currently in use for IMPAC II Develop identification algorithm(s) and procedures Serve as consultant and guarantor of efficacy of algorithm implementation 23

Moving Forward Understanding the technical infrastructure Identification of specific areas of concern Development/proposal of data quality expectations Development/proposal of appropriate, acceptable solutions 24

Data Quality White Paper Knowledge assets are very real and carry tremendous value. Outline Definition Rules Risks and Costs NIH Expectations Process Measurements/Metrics Testing Continuous Improvements Conclusions 25

Conclusion Examination of the Database, Procedures, and Interface Development of Modified Use Cases Unified Modeling Language Identification and Extraction of Business Rules Identification of Business Model Development/Proposal of Appropriate, Acceptable Solutions Development/Proposal of Data Quality Expectations Identification of Specific Areas of Concern Understanding the Technical Infrastructure 26

Back to top button