testing

Sunday, August 7, 2011

General Question in Interview

Tell us about yourself ? Click Here


Why do you want to join us? Click Here


What would you like to be doing five years from now? Click Here


Do you prefer working with others or alone?  Click Here


What are your biggest accomplishments?  Click Here


What are your favorite subjects?   Click Here


Why should we hire you?   Click Here


What are your hobbies?   Click Here


What is the worst feedback you have ever got?  Click Here


What is the most difficult situation you have faced? Click Here


If ur given a project in some other field would u work? Click Here


When you're disappointed, how do you overcome it ? Click Here


What position would you like to have in a group ? Click Here


Why didn't you take up GRE ? Click Here


What do you feel how was your technical interview? Click Here


Why didn't you do job after B.Tech ? (MAINLY for M.Techs.) Click Here


If you get vast amount of money, what will you do of it ? Click Here


Are you sceptic/enthusiastic? Click Here


You're pessimistic/optimistic? Click Here


Do you want to ask anything from us?   Click Here


If u are asked to lead a group, and u'r group does not work, what would u do..?Click Here  

Monday, February 28, 2011

Load Runner Interview Questions-1


Load runner Interview Questions and Answers

What is load testing?
Load testing is to test that if the application works fine with the loads that result from large number of simultaneous users, transactions and to determine weather it can handle peak usage periods.
What is Performance testing?
Timing for both read and update transactions should be gathered to determine whether system functions are being performed in an acceptable timeframe. This should be done standalone and then in a multi user environment to determine the effect of multiple transactions on the timing of a single transaction.
Did u use LoadRunner? What version?
Yes. Version 7.2.
Explain the Load testing process?
Step 1: Planning the test.
Here, we develop a clearly defined test plan to ensure the test scenarios we develop will accomplish load-testing objectives.
Step 2: Creating Vusers.

Here, we create Vuser scripts that contain tasks performed by each Vuser, tasks performed by Vusers as a whole, and tasks measured as transactions.
Step 3: Creating the scenario.

A scenario describes the events that occur during a testing session. It includes a list of machines, scripts, and Vusers that run during the scenario. We create scenarios using LoadRunner Controller. We can create manual scenarios as well as goal-oriented scenarios. In manual scenarios, we define the number of Vusers, the load generator machines, and percentage of Vusers to be assigned to each script. For web tests, we may create a goal-oriented scenario where we define the goal that our test has to achieve. LoadRunner automatically builds a scenario for us.
Step 4: Running the scenario.
We emulate load on the server by instructing multiple Vusers to perform tasks simultaneously. Before the testing, we set the scenario configuration and scheduling. We can run the entire scenario, Vuser groups, or individual Vusers.
Step 5: Monitoring the scenario.
We monitor scenario execution using the LoadRunner online runtime, transaction, system resource, Web resource, Web server resource, Web application server resource, database server resource, network delay, streaming media resource, firewall server resource, ERP server resource, and Java performance monitors.
Step 6: Analyzing test results.

During scenario execution, LoadRunner records the performance of the application under different loads. We use LoadRunner’s graphs and reports to analyze the application’s performance.
When do you do load and performance Testing?
We perform load testing once we are done with interface (GUI) testing. Modern system architectures are large and complex. Whereas single user testing primarily on functionality and user interface of a system component, application testing focuses on performance and reliability of an entire system. For example, a typical application-testing scenario might depict 1000 users logging in simultaneously to a system. This gives rise to issues such as what is the response time of the system, does it crash, will it go with different software applications and platforms, can it hold so many hundreds and thousands of users, etc. This is when we set do load and performance testing.
What are the components of LoadRunner?
The components of LoadRunner are The Virtual User Generator, Controller, and the Agent process, LoadRunner Analysis and Monitoring, LoadRunner Books Online.
What Component of LoadRunner would you use to record a Script?
The Virtual User Generator (VuGen) component is used to record a script. It enables you to develop Vuser scripts for a variety of application types and communication protocols.
What Component of LoadRunner would you use to play back the script in multi user mode?
The Controller component is used to playback the script in multi-user mode. This is done during a scenario run where a vuser script is executed by a number of vusers in a group.
What is a rendezvous point?
You insert rendezvous points into Vuser scripts to emulate heavy user load on the server. Rendezvous points instruct Vusers to wait during test execution for multiple Vusers to arrive at a certain point, in order that they may simultaneously perform a task. For example, to emulate peak load on the bank server, you can insert a rendezvous point instructing 100 Vusers to deposit cash into their accounts at the same time.
What is a scenario?
A scenario defines the events that occur during each testing session. For example, a scenario defines and controls the number of users to emulate, the actions to be performed, and the machines on which the virtual users run their emulations.
Explain the recording mode for web Vuser script?
We use VuGen to develop a Vuser script by recording a user performing typical business processes on a client application. VuGen creates the script by recording the activity between the client and the server. For example, in web based applications, VuGen monitors the client end of the database and traces all the requests sent to, and received from, the database server. We use VuGen to: Monitor the communication between the application and the server; Generate the required function calls; and Insert the generated function calls into a Vuser script.
Why do you create parameters?
Parameters are like script variables. They are used to vary input to the server and to emulate real users. Different sets of data are sent to the server each time the script is run. Better simulate the usage model for more accurate testing from the Controller; one script can emulate many different users on the system.
What is correlation? Explain the difference between automatic correlation and manual correlation?
Correlation is used to obtain data which are unique for each run of the script and which are generated by nested queries. Correlation provides the value to avoid errors arising out of duplicate values and also optimizing the code (to avoid nested queries). Automatic correlation is where we set some rules for correlation. It can be application server specific. Here values are replaced by data which are created by these rules. In manual correlation, the value we want to correlate is scanned and create correlation is used to correlate.
How do you find out where correlation is required? Give few examples from your projects?

Two ways: First we can scan for correlations, and see the list of values which can be correlated. From this we can pick a value to be correlated. Secondly, we can record two scripts and compare them. We can look up the difference file to see for the values which needed to be correlated. In my project, there was a unique id developed for each customer, it was nothing but Insurance Number, it was generated automatically and it was sequential and this value was unique. I had to correlate this value, in order to avoid errors while running my script. I did using scan for correlation.
Where do you set automatic correlation options?
Automatic correlation from web point of view can be set in recording options and correlation tab. Here we can enable correlation for the entire script and choose either issue online messages or offline actions, where we can define rules for that correlation. Automatic correlation for database can be done using show output window and scan for correlation and picking the correlate query tab and choose which query value we want to correlate. If we know the specific value to be correlated, we just do create correlation for the value and specify how the value to be created.
What is a function to capture dynamic values in the web Vuser script?
Web_reg_save_param function saves dynamic data information to a parameter.
When do you disable log in Virtual User Generator, When do you choose standard and extended logs?
Once we debug our script and verify that it is functional, we can enable logging for errors only. When we add a script to a scenario, logging is automatically disabled. Standard Log Option: When you select
Standard log, it creates a standard log of functions and messages sent during script execution to use for debugging. Disable this option for large load testing scenarios. When you copy a script to a scenario, logging is automatically disabled Extended Log Option: Select
extended log to create an extended log, including warnings and other messages. Disable this option for large load testing scenarios. When you copy a script to a scenario, logging is automatically disabled. We can specify which additional information should be added to the extended log using the Extended log options.
How do you debug a LoadRunner script?
VuGen contains two options to help debug Vuser scripts-the Run Step by Step command and breakpoints. The Debug settings in the Options dialog box allow us to determine the extent of the trace to be performed during scenario execution. The debug information is written to the Output window. We can manually set the message class within your script using the lr_set_debug_message function. This is useful if we want to receive debug information about a small section of the script only.
How do you write user defined functions in LR? Give me few functions you wrote in your previous project?
Before we create the User Defined functions we need to create the external
library (DLL) with the function. We add this library to VuGen bin directory. Once the library is added then we assign user defined function as a parameter. The function should have the following format: __declspec (dllexport) char* <function name>(char*, char*)Examples of user defined functions are as follows:GetVersion, GetCurrentTime, GetPltform are some of the user defined functions used in my earlier project.
What are the changes you can make in run-time settings?
The Run Time Settings that we make are: a) Pacing - It has iteration count. b) Log - Under this we have Disable Logging Standard Log and c) Extended Think Time - In think time we have two options like Ignore think time and Replay think time. d) General - Under general tab we can set the vusers as process or as multithreading and whether each step as a transaction.
Where do you set Iteration for Vuser testing?
We set Iterations in the Run Time Settings of the VuGen. The navigation for this is Run time settings, Pacing tab, set number of iterations.
How do you perform functional testing under load?
Functionality under load can be tested by running several Vusers concurrently. By increasing the amount of Vusers, we can determine how much load the server can sustain.
What is Ramp up? How do you set this?
This option is used to gradually increase the amount of Vusers/load on the server. An initial value is set and a value to wait between intervals can be specified. To set Ramp Up, go to ‘Scenario Scheduling Options’
What is the advantage of running the Vuser as thread?
VuGen provides the facility to use multithreading. This enables more Vusers to be run per generator. If the Vuser is run as a process, the same driver program is loaded into memory for each Vuser, thus taking up a large amount of memory. This limits the number of Vusers that can be run on a single generator. If the Vuser is run as a thread, only one instance of the driver program is loaded into memory for the given number of Vusers (say 100). Each thread shares the memory of the parent driver program, thus enabling more Vusers to be run per generator.
If you want to stop the execution of your script on error, how do you do that?
The lr_abort function aborts the execution of a Vuser script. It instructs the Vuser to stop executing the Actions section, execute the vuser_end section and end the execution. This function is useful when you need to manually abort a script execution as a result of a specific error condition. When you end a script using this function, the Vuser is assigned the status "Stopped". For this to take effect, we have to first uncheck the “Continue on error” option in Run-Time Settings.
What is the relation between Response Time and Throughput?
The Throughput graph shows the amount of data in bytes that the Vusers received from the server in a second. When we compare this with the transaction response time, we will notice that as throughput decreased, the response time also decreased. Similarly, the peak throughput and highest response time would occur approximately at the same time.
Explain the Configuration of your systems?
The configuration of our systems refers to that of the client machines on which we run the Vusers. The configuration of any client machine includes its hardware settings, memory, operating system, software applications, development tools, etc. This system component configuration should match with the overall system configuration that would include the network infrastructure, the web server, the database server, and any other components that go with this larger system so as to achieve the load testing objectives.
How do you identify the performance bottlenecks?
 Performance Bottlenecks can be detected by using monitors. These monitors might be application server monitors, web server monitors, database server monitors and network monitors. They help in finding out the troubled area in our scenario which causes increased response time. The measurements made are usually performance response time, throughput, hits/sec, network delay graphs, etc.
If web server, database and Network are all fine where could be the problem?
The problem could be in the system itself or in the application server or in the code written for the application.
How did you find web server related issues?
Using Web resource monitors we can find the performance of web servers. Using these monitors we can analyze throughput on the web server, number of hits per second that
occurred during scenario, the number of http responses per second, the number of downloaded pages per second.
How did you find database related issues?
By running “Database” monitor and help of “Data Resource Graph” we can find database related issues. E.g. You can specify the resource you want to measure on before running the controller and than you can see database related issues
What is the difference between Overlay graph and Correlate graph?
Overlay Graph: It overlay the content of two graphs that shares a common x-axis. Left Y-axis on the merged graph show’s the current graph’s value & Right Y-axis show the value of Y-axis of the graph that was merged. Correlate Graph: Plot the Y-axis of two graphs against each other. The active graph’s Y-axis becomes X-axis of merged graph. Y-axis of the graph that was merged becomes merged graph’s Y-axis.
How did you plan the Load? What are the Criteria?
Load test is planned to decide the number of users, what kind of machines we are going to use and from where they are run. It is based on 2 important documents, Task Distribution Diagram and Transaction profile. Task Distribution Diagram gives us the information on number of users for a particular transaction and the time of the load. The peak usage and off-usage are decided from this Diagram. Transaction profile gives us the information about the transactions name and their priority levels with regard to the scenario we are deciding.
What does vuser_init action contain?
Vuser_init action contains procedures to login to a server.
What does vuser_end action contain?
Vuser_end section contains log off procedures.
What is think time? How do you change the threshold?
Think time is the time that a real user waits between actions. Example: When a user receives data from a server, the user may wait several seconds to review the data before responding. This delay is known as the think time. Changing the Threshold: Threshold level is the level below which the recorded think time will be ignored. The default value is five (5) seconds. We can change the think time threshold in the Recording options of the Vugen.
What is the difference between standard log and extended log?
The standard log sends a subset of functions and messages sent during script execution to a log. The subset depends on the Vuser type Extended log sends a detailed script execution messages to the output log. This is mainly used during debugging when we want information about: Parameter substitution. Data returned by the server. Advanced trace.
Explain the following functions: - lr_debug_message
The lr_debug_message function sends a debug message to the output log when the specified message class is set. lr_output_message - The lr_output_message function sends notifications to the Controller Output window and the Vuser log file. lr_error_message - The lr_error_message function sends an error message to the LoadRunner Output window. lrd_stmt - The lrd_stmt function associates a character string (usually a SQL statement) with a cursor. This function sets a SQL statement to be processed. lrd_fetch - The lrd_fetch function fetches the next row from the result set.
Throughput
If the throughput scales upward as time progresses and the number of Vusers increase, this indicates that the bandwidth is sufficient. If the graph were to remain relatively flat as the number of Vusers increased, it would
be reasonable to conclude that the bandwidth is constraining the volume of
data delivered.
Types of Goals in Goal-Oriented Scenario
Load Runner provides you with five different types of goals in a goal oriented scenario:
The number of concurrent Vusers
The number of hits per second
The number of transactions per second
The number of pages per minute
The transaction response time that you want your scenario
Analysis Scenario (Bottlenecks):
In Running Vuser graph correlated with the response time graph you can see that as the number of Vusers increases, the average response time of the check itinerary transaction very gradually increases. In other words, the average response time steadily increases as the load increases. At 56 Vusers, there is a sudden, sharp increase in the average response time. We say that the test broke the server. That is the mean time before failure (MTBF). The response time clearly began to degrade when there were more than 56 Vusers running simultaneously.
What is correlation? Explain the difference between automatic correlation and manual correlation?
Correlation is used to obtain data which are unique for each run of the script and which are generated by nested queries. Correlation provides the value to avoid errors arising out of duplicate values and also optimizing the code (to avoid nested queries). Automatic correlation is where we set some rules for correlation. It can be application server specific. Here values are replaced by data which are created by these rules. In manual correlation, the value we want to correlate is scanned and create correlation is used to correlate.
Where do you set automatic correlation options?
Automatic correlation from web point of view, can be set in recording options and correlation tab. Here we can enable correlation for the entire script and choose either issue online messages or offline actions, where we can define rules for that correlation. Automatic correlation for database, can be done using show output window and scan for correlation and picking the correlate query tab and choose which query value we want to correlate. If we know the specific value to be correlated, we just do create correlation for the value and specify how the value to be created.
What is a function to capture dynamic values in the web vuser script?
Web_reg_save_param function saves dynamic data information to a parameter.


Thursday, February 24, 2011

Testing Data Warehouse – A Four Step Approach


Testing Data Warehouse – A Four Step Approach

In today's fast paced business environment, it is almost always an unstated fact that the success of any Data Warehouse solution lies in its ability to not only analyze vast quantities of data over time but also to provide stakeholders and end-users meaningful options that are based on real-time data. This requirement mandates an extremely efficient system that can extract, transform, cleanse and load data from the source systems on a 24*7 basis without impacting the performance, scalability or causing system downtime.

One of the key elements contributing to the success of a Data Warehouse solution is the ability of the test team to plan, design and execute a set of effective tests that will help identify multiple issues related to data inconsistency, data quality, data security, failures in the extract, transform and load (ETL) process, performance related issues, accuracy of business flows and fitness for use from an end user perspective.

The primary focus of testing should be on the ETL process. This includes, validating the loading of all required rows, correct execution of all transformations and successful completion of the cleansing operation. The team also needs to thoroughly test SQL queries, stored procedures or queries that produce aggregate or summary tables. Keeping in tune with emerging trends, it is also important for test team to design and execute a set of tests that are customer experience -centric.
Fig 1: Key components of an effective Data Warehouse test strategy

As shown in the above picture, the focus of Data Warehouse test strategy is primarily on four key aspects including:

  • Data Quality Validation
  • End User & BI / Report Testing
  • Load and Performance Testing
  • End-to-End (E2E) Regression and Integration Testing

Data Quality Validation:
An essential part of the overall ETL test strategy is validating data for accuracy, which is core to any Data Warehouse tests. Validating data for quality includes test for data completeness, data transformation and data quality.

  • Data Completeness Tests: are designed to verify that all the expected data loads into the data warehouse. This includes running detailed tests to verify that all records are completely loaded without errors in content quality or quantity.
  • Data Transformation Tests: are designed to verify the accuracy of the transformation logic or transformation business rules. This can, at times, be a complex activity hence teams should consider using automated tools as part of the test strategy. Integration tests are generally a part of data transformation tests. This is covered in more detail in a separate section below.
  • Data Quality Tests: are designed to validate system behavior when data is rejected (example: data inaccuracy or missing data) during data correction and substitution. Scenario-based tests and Validation tests for the solution’s reporting feature are part of Data Quality Tests.

At a bare minimum data quality validation should ensure:
  • Extraction of data to the required fields
  • Proper functioning of the extraction logic for each source system (historical and incremental loads)
  • Availability of security access to source systems for extraction scripts
  • Updates to extract audit log and time stamping as per requirements
  • Completeness and accuracy of “Source to Extraction Destination” Transaction scripts, which are transforming the data as per the expected logic
  • Historical load transformation for historical snap-shots is working
  • Incremental load transformation for historical snap-shots is working
  • Detailed and aggregated data sets are created, and are matching
  • Transaction audit log and time stamping
  • No pilferage of data during Transformation process and also during historical and incremental load
  • Real-time or near-real time data loading occurs without impacting performance adversely
  • Multi-pass SQL statements update all the temporary tables with real-time or near-real time reporting and analytics

End User and BI / Report Testing:
Testing for accuracy of reports is another critical aspect in Data warehouse testing. Extreme care should be taken while testing, as reports are probably the only experience most users have with the Data Warehouse and Business Intelligence (DW/BI) system. The guiding philosophy is that testing reports should be as clear and self-explanatory as possible. Usability, performance, data accuracy and preview and/or export to different formats are areas where most of the failures occur.

When designing tests for End user and BI / Report testing, some key points to address include:

  • Data display on the business views and dashboard are as expected
  • Users can see reports according to their user profile (authentication and authorization)
  • Verification of Report format and content by appropriate end users
  • Verification on the Accuracy and completeness of the scheduled reports
  • OLAP, Drill down report, cross tab report, parent / child report etc are all working as expected
  • 'Analysis Functions' and 'Data Analysis' are working
  • No pilferage of data between the source systems and the views
  • Testing of Replicated reports from old system to new system for consistency of business rules
  • Previewing and/or exporting of reports to different formats such as spreadsheet, pdf, html, e-mail displays accurate and consistent data
  • Print facility, where applicable, produces expected output
  • Where graphs and data in tabular format exist, both should reflect consistent data



Load and Performance Testing:  
With increasing volume of data, stability and scalability become critical test parameters. Under stress from large transactional data volumes, data warehouses will typically not scale, and eventually fail, unless they are tested and issues are fixed. To avoid such problems, it is essential that the test team design and execute series of tests that validate the performance and scalability of the system under different loads. As part of this activity, the following tests can be executed:
  • Shutdown the server during batch process and validate the result
  • Perform ETL with load that is twice or thrice the maximum possible imagined data (for which the capacity is planned)
  • Run huge volumes of ad-hoc queries mimicked from multiple users simultaneously
  • Run large number of scheduled reports
  • Monitor the timing of the reject processes and check system behavior when handling large volumes of rejected data

E2E Integration and Regression Testing:
Integration tests show how the application fits into the overall flow of all upstream and downstream applications. When designing Integration Tests, the focus of the tester should be on the following topics:
·         How the overall process can break and focus on the integrations between different systems and their subsystems
·         Validating system behavior when different types of data (different user profiles, different data types, different data volumes etc) get processed and communication to the subsequent system
·         Run custom-designed regression tests that simulate end user behavior (ensures success of user-acceptance tests).

Usage of techniques like scenarios based testing, risk based testing and model based testing will enhance the effectiveness of testing. In addition, it is always a good idea to consider creating different tests by using test design techniques like Boundary Value Analysis (BVA) and Equivalence Partitioning (EP).

Summary: While basic testing philosophies hold good while testing a Data Warehouse implementation, it is important for test teams to understand that testing a Data Warehouse implementation is a different ball game. Since a Data Warehouse primary deals with data, a major portion of the test effort is spent on planning, designing and executing tests that are data oriented. These tests include running SQL queries, validating that ETL executes as expected, exceptions are handled effectively, application performance meets the SLAs and finally, ensuring that the integration points are working as expected. Planning and designing most of the test cases require the test team to have experience in SQL and performance testing. It will also be helpful if the team members have experience in debugging performance bottlenecks.

Another dimension of Data Warehouse testing is the dependency of the tests on the test environment. Since it is a known fact that in general, a test environment will not be as robust (high end servers, clustering, load balancing, data volumes and data accuracy), this will have an impact on some of the tests. For example, simulated or masked data that might not be reflecting all the characters of production data may restrict the accuracy of performance tests. In other cases, some of the jobs may not fail under simulated test environment. Test teams should be wary of such limitations and should factor such risks when designing their tests.

Detecting all possible defects may be complex, however a little bit of planning will go a long way in identifying the most obvious and costly defects early in the life cycle. Finally, a group of Data Warehouse Architects, Business Analysts and Test teams working together during the initial planning and design phase is one of the time tested approaches that can help in identifying and eliminating potential failures.

ETL Process Definitions and Deliverables


ETL Process Definitions and Deliverables

  • 1.0 Define Requirements – In this process you should understand the business needs by gathering information from the user.  You should understand the data needed and if it is available.  Resources should be identified for information or help with the process.
    • Deliverables
      • A logical description of how you will extract, transform, and load the data.
      • Sign-off of the customer(s).
    • Standards
      • Document ETL business requirements specification using either the ETL Business Requirements Specification Template, your own team-specific business requirements template or system, or Oracle Designer.
    • Templates
      • ETL Business Requirements Specification Template
  • 2.0 Create Physical Design – In this process you should define your inputs and outputs by documenting record layouts.  You should also identify and define your location of source and target, file/table sizing information, volume information, and how the data will be transformed. 
    • Deliverables
      • Input and output record layouts
      • Location of source and target
      • File/table sizing information
      • File/table volume information
      • Documentation on how the data will be transformed, if at all
    • Standards
      • Complete ETL Business Requirements Specification using one of the methods documented in the previous steps.
      • Start ETL Mapping Specification
    • Templates
      • ETL Business Requirements Specification Template
      • ETL Mapping Specification Template
  • 3.0 Design Test Plan – Understand what the data combinations are and define what results are expected.  Remember to include error checks.  Decide how many test cases need to be built.  Look at technical risk and include security.  Test business requirements.
    • Deliverables
      • ETL Test Plan
      • ETL Performance Test Plan
    • Standards
      • Document ETL test plan and performance plan using either the standard templates listed below or your own team-specific template(s).
    • Templates
      • ETL Test Plan Template
      • ETL Performance Test Plan Template
  • 4.0 Create ETL Process – Start creating the actual Informatica ETL process.  The developer is actually doing some testing in this process.
    • Deliverables
      • Mapping Specification
      • Mapping
      • Workflow
      • Session
    • Standards
      • Start the ETL Object Migration Form
      • Start Database Object Migration Form (if applicable)
      • Complete ETL Mapping Specification
      • Complete cleanup process for log and bad files – Refer to Standard_ETL_File_Cleanup.doc
      • Follow Informatica Naming Standards
    • Templates
      • ETL Object Migration Form
      • ETL Mapping Specification Template
      • Database Object Migration Form (if applicable)
  • 5.0 Test Process – The developer does the following types of tests: unit, volume, and performance.
    • Deliverables
      • ETL Test Plan
      • ETL Performance Test Plan
    • Standards
      • Complete ETL Test Plan
      • Complete ETL Performance Test Plan
    • Templates
      • ETL Test Plan Template
      • ETL Performance Test Plan
  • 6.0 Walkthrough ETL Process – Within the walkthrough the following factors should be addressed:  Identify common modules (reusable objects), efficiency of the ETL code, the business logic, accuracy, and standardization.
    • Deliverables
      • ETL process that has been reviewed
    • Standards
      • Conduct ETL Process Walkthrough
    • Templates
      • ETL Mapping Walkthrough Checklist Template
  • 7.0 Coordinate Move to QA – The developer works with the ETL Administrator to organize ETL Process move to QA.
    • Deliverables
      • ETL process moved to QA
    • Standards
      • Complete ETL Object Migration Form
      • Complete Unix Job Setup Request Form
      • Complete Database Object Migration Form (if applicable)
    • Templates
      • ETL Object Migration Form
      • Unix Job Setup Request Form
      • Database Object Migration Form
  • 8.0 Test Process – At this point, the developer once again tests the process after it has been moved to QA.
    • Deliverables
      • Tested ETL process
    • Standards
      • Developer validates ETL Test Plan and ETL Performance Test Plan
    • Templates
      • ETL Test Plan Template
      • ETL Performance Test Plan Template
  • 9.0 User Validates Data – The user validates the data and makes sure it satisfies the business requirements.
    • Deliverables
      • Validated ETL process
    • Standards
      • Validate Business Requirement Specifications with the data
    • Templates
      • ETL Business Requirement Specifications Template
  • 10.0 Coordinate Move to Production - The developer works with the ETL Administrator to organize ETL Process move to Production.
    • Deliverables
      • Accurate and efficient ETL process moved to production
    • Standards
      • Complete ETL Object Migration Form
      • Complete Unix Job Setup Request Form
      • Complete Database Object Migration Form (if applicable)
    • Templates
      • ETL Object Migration Form
      • Unix Job Setup Request Form
      • Database Object Migration Form (if applicable)
  • 11.0 Maintain ETL Process – There are a couple situations to consider when maintaining an ETL process.  There is maintenance when an ETL process breaks and there is maintenance when and ETL process needs updated.
    • Deliverables
      • Accurate and efficient ETL process in production
    • Standards
      • Updated Business Requirements Specification (if needed)
      • Updated Mapping Specification (if needed)
      • Revised mapping in appropriate folder
      • Updated ETL Object Migration Form
      • Developer checks final results in production
      • All monitoring (finding problems) of the ETL process is the responsibility of the project team
    • Templates
      • Business Requirements Specification Template
      • Mapping Specification Template
      • ETL Object Migration Form
      • Unix Job Setup Request Form
      • Database Object Migration Form (if applicable)

ETL Methodology


Overview


This document is designed for use by business associates and technical resources to better understand the process of building a data warehouse and the methodology employed to build the EDW.

This methodology has been designed to provide the following benefits:
  1. A high level of performance
  2. Scalable to any size
  3. Ease of maintenance
  4. Boiler-plate development
  5. Standard documentation techniques


ETL Definitions


Term
Definition
ETL – Extract Transform Load
The physical process of extracting data from a source system, transforming the data to the desired state, and loading it into a database
EDW – Enterprise Data Warehouse
The logical data warehouse designed for enterprise information storage and reporting
DM – Data Mart
A small subset of a data warehouse specifically defined for a subject area

 

Documentation Specifications


A primary driver of the entire process is accurate business information requirements.  TDD Consulting will use standard documents prepared by the Project Management Institute for requirements gathering, project signoff, and compiling all testing information.

ETL Naming Conventions


To maintain consistency all ETL processes will follow a standard naming methodology.

Tables

All destination tables will utilize the following naming convention:
   EDW_<SUBJECT>_<TYPE>

There are six types of tables used in a data warehouse: Fact, Dimension, Aggregate, Staging, Temp, and Audit.  Sample names are listed below the quick overview of table types.

Fact – a table type that contains atomic data
Dimension – a table type that contains referential data needed by the fact tables
Aggregate – a table type used to aggregate data, forming a pre-computed answer to a business question (ex. Totals by day)
Staging – Tables used to store data during ETL processing but the data is not removed immediately
Temp – tables used during ETL processing that can immediately be truncated afterwards (ex. storing order ids for lookup)
Audit – tables used to keep track of the ETL process (ex. Processing times by job)

Each type of table will be kept in a separate schema.  This will decrease maintenance work and time spent looking for a specific table.

Table Name
Explanation
EDW_RX_FACT
Fact table containing RX subject matter
EDW_TIME_DIM
Dimension table containing TIME subject matter
EDW_CUSTOMER_AG
Aggregate table containing CUSTOMER subject matter
ETL_PROCESS_AUDIT
Audit table containing PROCESS data
STG_DI_CUSTOMER
Staging table sourced from DI system used for CUSTOMER data processing
ETL_ADDRESS_TEMP
Temp table used for ADDRESS processing

ETL Processing


There following types of ETL jobs will be used for processing.  This table lists the job type, naming convention, and explains the job functions.



Job Type
Explanation
Naming Convention
Extract
Extracts information from a source systems & places in a staging table
Extract<Source><Subject>
ExtractDICustomer
Source
Sources information from STG tables & performs column validation
Source<Table>
SourceSTGDICustomer
LoadTemp
Load temp tables used in processing
LoadTemp<Table>
LoadTempETLAddressTemp
LookupDimension
Lookup dimension tables
LookupDimension<Subject>
LookupDimensionCustomer
Transform
Transform the subject area data and generate insert files
Transform<Subject>
TransformCustomer
QualityCheck
Checks the quality of the data before loaded into the EDW
QualityCheck<Subject>
QualityCheckCustomer
Load
Load the data into the EDW
Load<Table>
LoadEDWCustomerFact

 

 

ETL Job Standards


All ETL jobs will be created with a boiler-plate approach.  This approach allows for rapid creation of similar jobs while keeping maintenance low.

Comments


Every job will have a standard comment template that specifically spells out the following attributes of the job:

Job Name:            LoadEDWCustomerFact
Purpose:               Load the EDW_Customer_Fact table
Predecessor:         QualityCheckCustomer
Date:                     April 21, 2006
Author:                 Wes Dumey
Revision History: 
April 21, 2006 – Created the job from standard template
April 22, 2006 – Added error checking for table insert

In addition there will also be a job data dictionary that describes every job in a table such that it can be easily searched via standard SQL.

 

Persistent Staging Areas


Data will be received from the source systems in its native format.  The data will be stored in a PSA table following the naming standards listed previously.  The table will contain the following layout:

Column
Data Type
Explanation
ROW_NUMBER
NUMBER
Unique for each row in the PSA
DATE
DATE
Date row was placed in the PSA
STATUS_CODE
CHAR(1)
Indicates status of row (‘I’ inducted, ‘P’ processed, ‘R’ rejected)
ISSUE_CODE
NUMBER
Code uniquely identifying problems with data if STATUS_CODE = ‘R’
BATCH_NUMBER
NUMBER
Batch number used to process the data (auditing)
Data columns to follow


Auditing


The ETL methodology maintains a process for providing audit and logging capabilities. 

For each run of the process, a unique batch number composed of the time segments is created.  This batch number is loaded with the data into the PSA and all target tables.  In addition, an entry with the following data elements will be made into the ETL_PROCESS_AUDIT table.

Column
Data Type
Explanation
DATE
DATE
(Index) run date
BATCH_NUMBER
NUMBER
Batch number of process
PROCESS_NAME
VARCHAR
Name of process that was executed
PROCESS_RUN_TIME
TIMESTAMP
Time (HH:MI:SS) of process execution
PROCESS_STATUS
CHAR
‘S’ SUCCESS, ‘F’ FAILURE
ISSUE_CODE
NUMBER
Code of issue related to process failure (if ‘F’)
RECORD_PROCESS_COUNT
NUMBER
Row count of records processed during run

The audit process will allow for efficient logging of process execution and encountered errors.

Quality


Due to the sensitive nature of data within the EDW, data quality is a driving priority.  Quality will be handled through the following processes:

  1. Source job - the source job will contain a quick data scrubbing mechanism that verifies the data conforms to the expected type (Numeric is a number and character is a letter). 
  2. Transform – the transform job will contain matching metadata of the target table and verify that NULL values are not loaded into NOT NULL columns and that the data is transformed correctly.
  3. QualityCheck – a separate job is created to do a cursory check on a few identified columns and verify that the correct data is loaded into these columns.

Source Quality


A data scrubbing mechanism will be constructed.  This mechanism will check identified columns for any anomalies (ex. Embedded carriage returns) and value domains.  If an error is discovered, the data is fixed and a record is written in the ETL_QUALITY_ISSUES table (see below for table definition).

Transform Quality


The transformation job will employ a matching metadata technique. If the target table enforces NOT NULL constraints, a check will be built into the job preventing NULLS from being loaded and causing a job stream abend.



Quality Check


Quality check is the last point of validation within the job stream. QC can be configured to check any percentage of rows (0-100%) and any number of columns (1-X).  QC is designed to pay attention to the most valuable or vulnerable rows with the data sets. QC will use a modified version of the data scrubbing engine used during the source job to derive correct values and reference rules listed in the ETL_QC_DRIVER table.  Any suspects rows will be pulled from the insert/update files, updated in the PSA table to an ‘R’ status and create an issue code for the failure. 

Logging of Data Failures


Data that fails the QC job will not be loaded into the EDW based on defined rules.  An entry will be made into the following table (ETL_QUALITY_ISSUES).  An indicator will show the value of the column as defined in the rules (‘H’ HIGH, ‘L’ LOW).  This indicator will allow resources to be used efficiently to trace errors.

ETL_QUALITY_ISSUES


Column
Data Type
Explanation
DATE
DATE
Date of entry
BATCH_NUMBER
NUMBER
Batch number of process creating entry
PROCESS_NAME
VARCHAR
Name of process creating entry
COLUMN_NAME
VARCHAR
Name of column failing validation
COLUMN_VALUE
VARCHAR
Value of column failing validation
EXPECTED_VALUE
VARCHAR
Expected value of column failing validation
ISSUE_CODE
NUMBER
Issue code assigned to error
SEVERITY
CHAR
‘H’ HIGH, ‘L’ LOW


ETL_QUALITY_AUDIT

Column
Data Type
Explanation
DATE
DATE
Date of entry
BATCH_NUMBER
NUMBER
Batch number of process creating entry
PROCESS_NAME
VARCHAR
Name of entry creating process
RECORD_PROCESS_COUNT
NUMBER
Number of records processed
RECORD_COUNT_CHECKED
NUMBER
Number of records checked
PERCENTAGE_CHECKED
NUMBER
Percentage of checked records out of data set



Closing


After reading this ETL document you should have a better understanding of the issues associated with ETL processing.  This methodology has been created to address as many negatives as possible while providing a high level of performance and ease of maintenance while being scalable and workable in a real-time ETL processing scenario.