Virtual University of Pakistan Data Warehousing Lecture-17 Issues

15 Slides153.00 KB

Virtual University of Pakistan Data Warehousing Lecture-17 Issues of ETL Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research www.nu.edu.pk/cairindex.asp National University of Computers & Emerging Sciences, Islamabad Email: [email protected] Ahsan Abdullah 1

Issues of ETL 2 Ahsan Abdullah

Why ETL Issues? Data from different source systems will be different, poorly documented and dirty. Lot of analysis required. Easy to collate addresses and names? Not really. No address or name standards. Use software for standardization. Very expensive, as any “standards” vary from country to country, not large enough market. 3 Ahsan Abdullah

Why ETL Issues? Things would have been simpler in the presence of operational systems, but that is not always the case Manual data collection and entry. Nothing wrong with that, but potential to introduces lots of problems. Data is never perfect. The cost of perfection, extremely high vs. its value. 4 Ahsan Abdullah

“Some” Issues Usually, if not always underestimated Diversity in source systems and platforms Inconsistent data representations Complexity of transformations Rigidity and unavailability of legacy systems Volume of legacy data Web scrapping 5 Ahsan Abdullah

Complexity of problem/work underestimated Work seems to be deceptively simple. People start manually building the DWH. Programmers underestimate the task. Impressions could be deceiving. Traditional DBMS rules and concepts break down for very large heterogeneous historical databases. 6 Ahsan Abdullah

Diversity in source systems and platforms Platform OS DBMS MIS/ERP Main Frame VMS Oracle SAP Mini Computer Unix Informix PeopleSoft Desktop Win NT Access JD Edwards DOS Text file Dozens of source systems across organizations Numerous source systems within an organization Need specialist for each 7 Ahsan Abdullah

Inconsistent data representations Same data, different representation Date value representations Examples: 970314 03/14/1997 March 14 1997 1997-03-14 14-MAR-1997 2450521.5 (Julian date format) Gender value representations Examples: - Male/Female - 0/1 - M/F - PM/PF 8 Ahsan Abdullah

Multiple sources for same data element Need to rank source systems on a per data element basis. Take data element from source system with highest rank where element exists. “Guessing” gender from name Something is better than nothing? Must sometimes establish “group ranking” rules to maintain data integrity. First, middle and family name from two systems of different rank. People using middle name as first name. 9 Ahsan Abdullah

Complexity of required transformations Simple one-to-one scalar transformations - 0/1 M/F One-to-many element transformations - 4 x 20 address field House/Flat, Road/Street, Area/Sector, City. Many-to-many element transformations - House-holding (who live together) and individualization (who are same) and same lands. 10 Ahsan Abdullah

Rigidity and unavailability of legacy systems Very difficult to add logic to or increase performance of legacy systems. Utilization of expensive legacy systems is optimized. Therefore, want to off-load transformation cycles to open systems environment. This often requires new skill sets. Need efficient and easy way to deal with incompatible mainframe data formats. 11 Ahsan Abdullah

Volume of legacy data Talking about not weekly data, but data spread over years. Historical data on tapes that are serial and very slow to mount etc. Need lots of processing and I/O to effectively handle large data volumes. Need efficient interconnect bandwidth to transfer large amounts of data from legacy sources to DWH. 12 Ahsan Abdullah

Web scrapping Lot of data in a web page, but is mixed with a lot of “junk”. Problems: Limited query interfaces Fill in forms “Free text” fields E.g. addresses Inconsistent output i.e., html tags which mark interesting fields might be different on different pages. Rapid change without notice. 13 Ahsan Abdullah

Beware of data quality (or lack of it) Data quality is always worse than expected. Will have a couple of lectures on data quality and its management. It is not a matter of few hundred rows. Data recorded for running operations is not usually good enough for decision support. Correct totals don’t guarantee data quality. Not knowing gender does not hurt POS. Centurion customers popping up. Ahsan Abdullah 14

ETL vs. ELT There are two fundamental approaches to data acquisition: ETL: Extract, Transform, Load in which data transformation takes place on a separate transformation server. ELT: Extract, Load, Transform in which data transformation takes place on the data warehouse server. Combination of both is also possible 15 Ahsan Abdullah

Back to top button