Software Testing: ETL Methodology

Overview

This document is designed for use by business associates and technical resources to better understand the process of building a data warehouse and the methodology employed to build the EDW.

This methodology has been designed to provide the following benefits:

A high level of performance
Scalable to any size
Ease of maintenance
Boiler-plate development
Standard documentation techniques

ETL Definitions

Term	Definition
ETL – Extract Transform Load	The physical process of extracting data from a source system, transforming the data to the desired state, and loading it into a database
EDW – Enterprise Data Warehouse	The logical data warehouse designed for enterprise information storage and reporting
DM – Data Mart	A small subset of a data warehouse specifically defined for a subject area

Documentation Specifications

A primary driver of the entire process is accurate business information requirements. TDD Consulting will use standard documents prepared by the Project Management Institute for requirements gathering, project signoff, and compiling all testing information.

ETL Naming Conventions

To maintain consistency all ETL processes will follow a standard naming methodology.

Tables

All destination tables will utilize the following naming convention:

EDW_<SUBJECT>_<TYPE>

There are six types of tables used in a data warehouse: Fact, Dimension, Aggregate, Staging, Temp, and Audit. Sample names are listed below the quick overview of table types.

Fact – a table type that contains atomic data

Dimension – a table type that contains referential data needed by the fact tables

Aggregate – a table type used to aggregate data, forming a pre-computed answer to a business question (ex. Totals by day)

Staging – Tables used to store data during ETL processing but the data is not removed immediately

Temp – tables used during ETL processing that can immediately be truncated afterwards (ex. storing order ids for lookup)

Audit – tables used to keep track of the ETL process (ex. Processing times by job)

Each type of table will be kept in a separate schema. This will decrease maintenance work and time spent looking for a specific table.

Table Name	Explanation
EDW_RX_FACT	Fact table containing RX subject matter
EDW_TIME_DIM	Dimension table containing TIME subject matter
EDW_CUSTOMER_AG	Aggregate table containing CUSTOMER subject matter
ETL_PROCESS_AUDIT	Audit table containing PROCESS data
STG_DI_CUSTOMER	Staging table sourced from DI system used for CUSTOMER data processing
ETL_ADDRESS_TEMP	Temp table used for ADDRESS processing

ETL Processing

There following types of ETL jobs will be used for processing. This table lists the job type, naming convention, and explains the job functions.

Job Type	Explanation	Naming Convention
Extract	Extracts information from a source systems & places in a staging table	Extract<Source><Subject> ExtractDICustomer
Source	Sources information from STG tables & performs column validation	Source<Table> SourceSTGDICustomer
LoadTemp	Load temp tables used in processing	LoadTemp<Table> LoadTempETLAddressTemp
LookupDimension	Lookup dimension tables	LookupDimension<Subject> LookupDimensionCustomer
Transform	Transform the subject area data and generate insert files	Transform<Subject> TransformCustomer
QualityCheck	Checks the quality of the data before loaded into the EDW	QualityCheck<Subject> QualityCheckCustomer
Load	Load the data into the EDW	Load<Table> LoadEDWCustomerFact

ETL Job Standards

All ETL jobs will be created with a boiler-plate approach. This approach allows for rapid creation of similar jobs while keeping maintenance low.

Comments

Every job will have a standard comment template that specifically spells out the following attributes of the job:

Job Name: LoadEDWCustomerFact

Purpose: Load the EDW_Customer_Fact table

Predecessor: QualityCheckCustomer

Date: April 21, 2006

Author: Wes Dumey

Revision History:

April 21, 2006 – Created the job from standard template

April 22, 2006 – Added error checking for table insert

In addition there will also be a job data dictionary that describes every job in a table such that it can be easily searched via standard SQL.

Persistent Staging Areas

Data will be received from the source systems in its native format. The data will be stored in a PSA table following the naming standards listed previously. The table will contain the following layout:

Column	Data Type	Explanation
ROW_NUMBER	NUMBER	Unique for each row in the PSA
DATE	DATE	Date row was placed in the PSA
STATUS_CODE	CHAR(1)	Indicates status of row (‘I’ inducted, ‘P’ processed, ‘R’ rejected)
ISSUE_CODE	NUMBER	Code uniquely identifying problems with data if STATUS_CODE = ‘R’
BATCH_NUMBER	NUMBER	Batch number used to process the data (auditing)
Data columns to follow

Auditing

The ETL methodology maintains a process for providing audit and logging capabilities.

For each run of the process, a unique batch number composed of the time segments is created. This batch number is loaded with the data into the PSA and all target tables. In addition, an entry with the following data elements will be made into the ETL_PROCESS_AUDIT table.

Column	Data Type	Explanation
DATE	DATE	(Index) run date
BATCH_NUMBER	NUMBER	Batch number of process
PROCESS_NAME	VARCHAR	Name of process that was executed
PROCESS_RUN_TIME	TIMESTAMP	Time (HH:MI:SS) of process execution
PROCESS_STATUS	CHAR	‘S’ SUCCESS, ‘F’ FAILURE
ISSUE_CODE	NUMBER	Code of issue related to process failure (if ‘F’)
RECORD_PROCESS_COUNT	NUMBER	Row count of records processed during run

The audit process will allow for efficient logging of process execution and encountered errors.

Quality

Due to the sensitive nature of data within the EDW, data quality is a driving priority. Quality will be handled through the following processes:

Source job - the source job will contain a quick data scrubbing mechanism that verifies the data conforms to the expected type (Numeric is a number and character is a letter).
Transform – the transform job will contain matching metadata of the target table and verify that NULL values are not loaded into NOT NULL columns and that the data is transformed correctly.
QualityCheck – a separate job is created to do a cursory check on a few identified columns and verify that the correct data is loaded into these columns.

Source Quality

A data scrubbing mechanism will be constructed. This mechanism will check identified columns for any anomalies (ex. Embedded carriage returns) and value domains. If an error is discovered, the data is fixed and a record is written in the ETL_QUALITY_ISSUES table (see below for table definition).

Transform Quality

The transformation job will employ a matching metadata technique. If the target table enforces NOT NULL constraints, a check will be built into the job preventing NULLS from being loaded and causing a job stream abend.

Quality Check

Quality check is the last point of validation within the job stream. QC can be configured to check any percentage of rows (0-100%) and any number of columns (1-X). QC is designed to pay attention to the most valuable or vulnerable rows with the data sets. QC will use a modified version of the data scrubbing engine used during the source job to derive correct values and reference rules listed in the ETL_QC_DRIVER table. Any suspects rows will be pulled from the insert/update files, updated in the PSA table to an ‘R’ status and create an issue code for the failure.

Logging of Data Failures

Data that fails the QC job will not be loaded into the EDW based on defined rules. An entry will be made into the following table (ETL_QUALITY_ISSUES). An indicator will show the value of the column as defined in the rules (‘H’ HIGH, ‘L’ LOW). This indicator will allow resources to be used efficiently to trace errors.

ETL_QUALITY_ISSUES

Column	Data Type	Explanation
DATE	DATE	Date of entry
BATCH_NUMBER	NUMBER	Batch number of process creating entry
PROCESS_NAME	VARCHAR	Name of process creating entry
COLUMN_NAME	VARCHAR	Name of column failing validation
COLUMN_VALUE	VARCHAR	Value of column failing validation
EXPECTED_VALUE	VARCHAR	Expected value of column failing validation
ISSUE_CODE	NUMBER	Issue code assigned to error
SEVERITY	CHAR	‘H’ HIGH, ‘L’ LOW

ETL_QUALITY_AUDIT

Column	Data Type	Explanation
DATE	DATE	Date of entry
BATCH_NUMBER	NUMBER	Batch number of process creating entry
PROCESS_NAME	VARCHAR	Name of entry creating process
RECORD_PROCESS_COUNT	NUMBER	Number of records processed
RECORD_COUNT_CHECKED	NUMBER	Number of records checked
PERCENTAGE_CHECKED	NUMBER	Percentage of checked records out of data set

Closing

After reading this ETL document you should have a better understanding of the issues associated with ETL processing. This methodology has been created to address as many negatives as possible while providing a high level of performance and ease of maintenance while being scalable and workable in a real-time ETL processing scenario.

Software Testing

testing

Thursday, February 24, 2011

ETL Methodology

Overview

ETL Definitions

Documentation Specifications

ETL Naming Conventions

Tables

ETL Processing

ETL Job Standards

Comments

Persistent Staging Areas

Auditing

Quality

Source Quality

Transform Quality

Quality Check

Logging of Data Failures

ETL_QUALITY_ISSUES

ETL_QUALITY_AUDIT

Closing

1 comment:

testing

Thursday, February 24, 2011

ETL Methodology

Overview

ETL Definitions

Documentation Specifications

ETL Naming Conventions

Tables

ETL Processing

ETL Job Standards

Comments

Persistent Staging Areas

Auditing

Quality

Source Quality

Transform Quality

Quality Check

Logging of Data Failures

ETL_QUALITY_ISSUES

ETL_QUALITY_AUDIT

Closing

1 comment:

Subscribe To