ETL Explanation



Extract Transform Load (ETL)

The back-end processes involved in collecting data from various sources, preparing the data with respect to the requirements and loading it in the warehouse/mart.
             
Extraction – This is the process where we select the data from various source systems and transport it in an ETL server. Source systems could range from one to many in number, and similar or completely disparate in nature.
             
Cleansing – Once the data has been extracted, we need to check whether the data is valid and ensure consistency when coming from disparate sources. We check for valid values of data by performing ETL data validations rules. The rules could be defined to check for correct data formats (Numeric, Date, Text), data within the range, correct business codes, check for junk values. We can ensure consistency of data when it is collected from disparate sources by making sure that the same business attribute will have the same representation. Example would be the gender of customer may be represented as “m” and “f” in one source system and “male” and “female” in another source system. In this case we will make sure we convert “male” to “m” and “female” to “f” so that the gender coming from the second source system is consistent with that from the first source system.
             
Transformation – This is the process where we apply the business rules to the source data before loading it to the warehouse/mart. Transformation as the term suggests converts the data based on the business rules. Transformation also adds further business context to the data. The types of transformations we generally use in ETL are sort data, merge data, compare data, generate running numbers, aggregate data, change data type etc.
             
Loading – Once the source data is transformed by applying business rules, we load it into the warehouse/mart. There are two types of loads we perform in data warehousing.
History Load / Bulk Load – This is the process of loading the historical data over a period of time into the warehouse. This is usually a one time activity where we take the historical business data and bulk load into the warehouse/mart.
Incremental Load – This follows the history load. Based on the availability of the current source data we load the data into the warehouse. This is an ongoing process and is performed periodically based on the availability of source data (eg daily, weekly, monthly etc).

No comments:

Post a Comment