testing

Thursday, February 24, 2011

Testing Data Warehouse – A Four Step Approach


Testing Data Warehouse – A Four Step Approach

In today's fast paced business environment, it is almost always an unstated fact that the success of any Data Warehouse solution lies in its ability to not only analyze vast quantities of data over time but also to provide stakeholders and end-users meaningful options that are based on real-time data. This requirement mandates an extremely efficient system that can extract, transform, cleanse and load data from the source systems on a 24*7 basis without impacting the performance, scalability or causing system downtime.

One of the key elements contributing to the success of a Data Warehouse solution is the ability of the test team to plan, design and execute a set of effective tests that will help identify multiple issues related to data inconsistency, data quality, data security, failures in the extract, transform and load (ETL) process, performance related issues, accuracy of business flows and fitness for use from an end user perspective.

The primary focus of testing should be on the ETL process. This includes, validating the loading of all required rows, correct execution of all transformations and successful completion of the cleansing operation. The team also needs to thoroughly test SQL queries, stored procedures or queries that produce aggregate or summary tables. Keeping in tune with emerging trends, it is also important for test team to design and execute a set of tests that are customer experience -centric.
Fig 1: Key components of an effective Data Warehouse test strategy

As shown in the above picture, the focus of Data Warehouse test strategy is primarily on four key aspects including:

  • Data Quality Validation
  • End User & BI / Report Testing
  • Load and Performance Testing
  • End-to-End (E2E) Regression and Integration Testing

Data Quality Validation:
An essential part of the overall ETL test strategy is validating data for accuracy, which is core to any Data Warehouse tests. Validating data for quality includes test for data completeness, data transformation and data quality.

  • Data Completeness Tests: are designed to verify that all the expected data loads into the data warehouse. This includes running detailed tests to verify that all records are completely loaded without errors in content quality or quantity.
  • Data Transformation Tests: are designed to verify the accuracy of the transformation logic or transformation business rules. This can, at times, be a complex activity hence teams should consider using automated tools as part of the test strategy. Integration tests are generally a part of data transformation tests. This is covered in more detail in a separate section below.
  • Data Quality Tests: are designed to validate system behavior when data is rejected (example: data inaccuracy or missing data) during data correction and substitution. Scenario-based tests and Validation tests for the solution’s reporting feature are part of Data Quality Tests.

At a bare minimum data quality validation should ensure:
  • Extraction of data to the required fields
  • Proper functioning of the extraction logic for each source system (historical and incremental loads)
  • Availability of security access to source systems for extraction scripts
  • Updates to extract audit log and time stamping as per requirements
  • Completeness and accuracy of “Source to Extraction Destination” Transaction scripts, which are transforming the data as per the expected logic
  • Historical load transformation for historical snap-shots is working
  • Incremental load transformation for historical snap-shots is working
  • Detailed and aggregated data sets are created, and are matching
  • Transaction audit log and time stamping
  • No pilferage of data during Transformation process and also during historical and incremental load
  • Real-time or near-real time data loading occurs without impacting performance adversely
  • Multi-pass SQL statements update all the temporary tables with real-time or near-real time reporting and analytics

End User and BI / Report Testing:
Testing for accuracy of reports is another critical aspect in Data warehouse testing. Extreme care should be taken while testing, as reports are probably the only experience most users have with the Data Warehouse and Business Intelligence (DW/BI) system. The guiding philosophy is that testing reports should be as clear and self-explanatory as possible. Usability, performance, data accuracy and preview and/or export to different formats are areas where most of the failures occur.

When designing tests for End user and BI / Report testing, some key points to address include:

  • Data display on the business views and dashboard are as expected
  • Users can see reports according to their user profile (authentication and authorization)
  • Verification of Report format and content by appropriate end users
  • Verification on the Accuracy and completeness of the scheduled reports
  • OLAP, Drill down report, cross tab report, parent / child report etc are all working as expected
  • 'Analysis Functions' and 'Data Analysis' are working
  • No pilferage of data between the source systems and the views
  • Testing of Replicated reports from old system to new system for consistency of business rules
  • Previewing and/or exporting of reports to different formats such as spreadsheet, pdf, html, e-mail displays accurate and consistent data
  • Print facility, where applicable, produces expected output
  • Where graphs and data in tabular format exist, both should reflect consistent data



Load and Performance Testing:  
With increasing volume of data, stability and scalability become critical test parameters. Under stress from large transactional data volumes, data warehouses will typically not scale, and eventually fail, unless they are tested and issues are fixed. To avoid such problems, it is essential that the test team design and execute series of tests that validate the performance and scalability of the system under different loads. As part of this activity, the following tests can be executed:
  • Shutdown the server during batch process and validate the result
  • Perform ETL with load that is twice or thrice the maximum possible imagined data (for which the capacity is planned)
  • Run huge volumes of ad-hoc queries mimicked from multiple users simultaneously
  • Run large number of scheduled reports
  • Monitor the timing of the reject processes and check system behavior when handling large volumes of rejected data

E2E Integration and Regression Testing:
Integration tests show how the application fits into the overall flow of all upstream and downstream applications. When designing Integration Tests, the focus of the tester should be on the following topics:
·         How the overall process can break and focus on the integrations between different systems and their subsystems
·         Validating system behavior when different types of data (different user profiles, different data types, different data volumes etc) get processed and communication to the subsequent system
·         Run custom-designed regression tests that simulate end user behavior (ensures success of user-acceptance tests).

Usage of techniques like scenarios based testing, risk based testing and model based testing will enhance the effectiveness of testing. In addition, it is always a good idea to consider creating different tests by using test design techniques like Boundary Value Analysis (BVA) and Equivalence Partitioning (EP).

Summary: While basic testing philosophies hold good while testing a Data Warehouse implementation, it is important for test teams to understand that testing a Data Warehouse implementation is a different ball game. Since a Data Warehouse primary deals with data, a major portion of the test effort is spent on planning, designing and executing tests that are data oriented. These tests include running SQL queries, validating that ETL executes as expected, exceptions are handled effectively, application performance meets the SLAs and finally, ensuring that the integration points are working as expected. Planning and designing most of the test cases require the test team to have experience in SQL and performance testing. It will also be helpful if the team members have experience in debugging performance bottlenecks.

Another dimension of Data Warehouse testing is the dependency of the tests on the test environment. Since it is a known fact that in general, a test environment will not be as robust (high end servers, clustering, load balancing, data volumes and data accuracy), this will have an impact on some of the tests. For example, simulated or masked data that might not be reflecting all the characters of production data may restrict the accuracy of performance tests. In other cases, some of the jobs may not fail under simulated test environment. Test teams should be wary of such limitations and should factor such risks when designing their tests.

Detecting all possible defects may be complex, however a little bit of planning will go a long way in identifying the most obvious and costly defects early in the life cycle. Finally, a group of Data Warehouse Architects, Business Analysts and Test teams working together during the initial planning and design phase is one of the time tested approaches that can help in identifying and eliminating potential failures.

1 comment:

  1. Effective testing of the data warehouse gives business users the confidence that the information contained in the data warehouse is correct.STC Technologies|STC Technologies

    ReplyDelete