Sunday, August 7, 2011

General Question in Interview

Tell us about yourself ? Click Here

Why do you want to join us? Click Here

What would you like to be doing five years from now? Click Here

Do you prefer working with others or alone?  Click Here

What are your biggest accomplishments?  Click Here

What are your favorite subjects?   Click Here

Why should we hire you?   Click Here

What are your hobbies?   Click Here

What is the worst feedback you have ever got?  Click Here

What is the most difficult situation you have faced? Click Here

If ur given a project in some other field would u work? Click Here

When you're disappointed, how do you overcome it ? Click Here

What position would you like to have in a group ? Click Here

Why didn't you take up GRE ? Click Here

What do you feel how was your technical interview? Click Here

Why didn't you do job after B.Tech ? (MAINLY for M.Techs.) Click Here

If you get vast amount of money, what will you do of it ? Click Here

Are you sceptic/enthusiastic? Click Here

You're pessimistic/optimistic? Click Here

Do you want to ask anything from us?   Click Here

If u are asked to lead a group, and u'r group does not work, what would u do..?Click Here

Monday, February 28, 2011

Load Runner Interview Questions-1

Load runner Interview Questions and Answers

What is load testing?
Load testing is to test that if the application works fine with the loads that result from large number of simultaneous users, transactions and to determine weather it can handle peak usage periods.

What is Performance testing?
Timing for both read and update transactions should be gathered to determine whether system functions are being performed in an acceptable timeframe. This should be done standalone and then in a multi user environment to determine the effect of multiple transactions on the timing of a single transaction.

Did u use LoadRunner? What version?
Yes. Version 7.2.

Explain the Load testing process?
Step 1: Planning the test.
Here, we develop a clearly defined test plan to ensure the test scenarios we develop will accomplish load-testing objectives.
Step 2: Creating Vusers.
Here, we create Vuser scripts that contain tasks performed by each Vuser, tasks performed by Vusers as a whole, and tasks measured as transactions.
Step 3: Creating the scenario.
A scenario describes the events that occur during a testing session. It includes a list of machines, scripts, and Vusers that run during the scenario. We create scenarios using LoadRunner Controller. We can create manual scenarios as well as goal-oriented scenarios. In manual scenarios, we define the number of Vusers, the load generator machines, and percentage of Vusers to be assigned to each script. For web tests, we may create a goal-oriented scenario where we define the goal that our test has to achieve. LoadRunner automatically builds a scenario for us.
Step 4: Running the scenario.
We emulate load on the server by instructing multiple Vusers to perform tasks simultaneously. Before the testing, we set the scenario configuration and scheduling. We can run the entire scenario, Vuser groups, or individual Vusers.
Step 5: Monitoring the scenario.
We monitor scenario execution using the LoadRunner online runtime, transaction, system resource, Web resource, Web server resource, Web application server resource, database server resource, network delay, streaming media resource, firewall server resource, ERP server resource, and Java performance monitors.
Step 6: Analyzing test results.
During scenario execution, LoadRunner records the performance of the application under different loads. We use LoadRunner’s graphs and reports to analyze the application’s performance.

When do you do load and performance Testing?

We perform load testing once we are done with interface (GUI) testing. Modern system architectures are large and complex. Whereas single user testing primarily on functionality and user interface of a system component, application testing focuses on performance and reliability of an entire system. For example, a typical application-testing scenario might depict 1000 users logging in simultaneously to a system. This gives rise to issues such as what is the response time of the system, does it crash, will it go with different software applications and platforms, can it hold so many hundreds and thousands of users, etc. This is when we set do load and performance testing.

What are the components of LoadRunner?
The components of LoadRunner are The Virtual User Generator, Controller, and the Agent process, LoadRunner Analysis and Monitoring, LoadRunner Books Online.

What Component of LoadRunner would you use to record a Script?
The Virtual User Generator (VuGen) component is used to record a script. It enables you to develop Vuser scripts for a variety of application types and communication protocols.

What Component of LoadRunner would you use to play back the script in multi user mode?
The Controller component is used to playback the script in multi-user mode. This is done during a scenario run where a vuser script is executed by a number of vusers in a group.

What is a rendezvous point?
You insert rendezvous points into Vuser scripts to emulate heavy user load on the server. Rendezvous points instruct Vusers to wait during test execution for multiple Vusers to arrive at a certain point, in order that they may simultaneously perform a task. For example, to emulate peak load on the bank server, you can insert a rendezvous point instructing 100 Vusers to deposit cash into their accounts at the same time.

What is a scenario?
A scenario defines the events that occur during each testing session. For example, a scenario defines and controls the number of users to emulate, the actions to be performed, and the machines on which the virtual users run their emulations.

Explain the recording mode for web Vuser script?
We use VuGen to develop a Vuser script by recording a user performing typical business processes on a client application. VuGen creates the script by recording the activity between the client and the server. For example, in web based applications, VuGen monitors the client end of the database and traces all the requests sent to, and received from, the database server. We use VuGen to: Monitor the communication between the application and the server; Generate the required function calls; and Insert the generated function calls into a Vuser script.

Why do you create parameters?
Parameters are like script variables. They are used to vary input to the server and to emulate real users. Different sets of data are sent to the server each time the script is run. Better simulate the usage model for more accurate testing from the Controller; one script can emulate many different users on the system.

What is correlation? Explain the difference between automatic correlation and manual correlation?
Correlation is used to obtain data which are unique for each run of the script and which are generated by nested queries. Correlation provides the value to avoid errors arising out of duplicate values and also optimizing the code (to avoid nested queries). Automatic correlation is where we set some rules for correlation. It can be application server specific. Here values are replaced by data which are created by these rules. In manual correlation, the value we want to correlate is scanned and create correlation is used to correlate.

How do you find out where correlation is required? Give few examples from your projects?

Two ways: First we can scan for correlations, and see the list of values which can be correlated. From this we can pick a value to be correlated. Secondly, we can record two scripts and compare them. We can look up the difference file to see for the values which needed to be correlated. In my project, there was a unique id developed for each customer, it was nothing but Insurance Number, it was generated automatically and it was sequential and this value was unique. I had to correlate this value, in order to avoid errors while running my script. I did using scan for correlation.

Where do you set automatic correlation options?
Automatic correlation from web point of view can be set in recording options and correlation tab. Here we can enable correlation for the entire script and choose either issue online messages or offline actions, where we can define rules for that correlation. Automatic correlation for database can be done using show output window and scan for correlation and picking the correlate query tab and choose which query value we want to correlate. If we know the specific value to be correlated, we just do create correlation for the value and specify how the value to be created.

What is a function to capture dynamic values in the web Vuser script?

Web_reg_save_param function saves dynamic data information to a parameter.

When do you disable log in Virtual User Generator, When do you choose standard and extended logs?
Once we debug our script and verify that it is functional, we can enable logging for errors only. When we add a script to a scenario, logging is automatically disabled. Standard Log Option: When you select
Standard log, it creates a standard log of functions and messages sent during script execution to use for debugging. Disable this option for large load testing scenarios. When you copy a script to a scenario, logging is automatically disabled Extended Log Option: Select
extended log to create an extended log, including warnings and other messages. Disable this option for large load testing scenarios. When you copy a script to a scenario, logging is automatically disabled. We can specify which additional information should be added to the extended log using the Extended log options.

How do you debug a LoadRunner script?
VuGen contains two options to help debug Vuser scripts-the Run Step by Step command and breakpoints. The Debug settings in the Options dialog box allow us to determine the extent of the trace to be performed during scenario execution. The debug information is written to the Output window. We can manually set the message class within your script using the lr_set_debug_message function. This is useful if we want to receive debug information about a small section of the script only.

How do you write user defined functions in LR? Give me few functions you wrote in your previous project?
Before we create the User Defined functions we need to create the external
library (DLL) with the function. We add this library to VuGen bin directory. Once the library is added then we assign user defined function as a parameter. The function should have the following format: __declspec (dllexport) char* <function name>(char*, char*)Examples of user defined functions are as follows:GetVersion, GetCurrentTime, GetPltform are some of the user defined functions used in my earlier project.

What are the changes you can make in run-time settings?
The Run Time Settings that we make are: a) Pacing - It has iteration count. b) Log - Under this we have Disable Logging Standard Log and c) Extended Think Time - In think time we have two options like Ignore think time and Replay think time. d) General - Under general tab we can set the vusers as process or as multithreading and whether each step as a transaction.

Where do you set Iteration for Vuser testing?
We set Iterations in the Run Time Settings of the VuGen. The navigation for this is Run time settings, Pacing tab, set number of iterations.

How do you perform functional testing under load?
Functionality under load can be tested by running several Vusers concurrently. By increasing the amount of Vusers, we can determine how much load the server can sustain.

What is Ramp up? How do you set this?
This option is used to gradually increase the amount of Vusers/load on the server. An initial value is set and a value to wait between intervals can be specified. To set Ramp Up, go to ‘Scenario Scheduling Options’

What is the advantage of running the Vuser as thread?
VuGen provides the facility to use multithreading. This enables more Vusers to be run per generator. If the Vuser is run as a process, the same driver program is loaded into memory for each Vuser, thus taking up a large amount of memory. This limits the number of Vusers that can be run on a single generator. If the Vuser is run as a thread, only one instance of the driver program is loaded into memory for the given number of Vusers (say 100). Each thread shares the memory of the parent driver program, thus enabling more Vusers to be run per generator.

If you want to stop the execution of your script on error, how do you do that?
The lr_abort function aborts the execution of a Vuser script. It instructs the Vuser to stop executing the Actions section, execute the vuser_end section and end the execution. This function is useful when you need to manually abort a script execution as a result of a specific error condition. When you end a script using this function, the Vuser is assigned the status "Stopped". For this to take effect, we have to first uncheck the “Continue on error” option in Run-Time Settings.

What is the relation between Response Time and Throughput?
The Throughput graph shows the amount of data in bytes that the Vusers received from the server in a second. When we compare this with the transaction response time, we will notice that as throughput decreased, the response time also decreased. Similarly, the peak throughput and highest response time would occur approximately at the same time.

Explain the Configuration of your systems?
The configuration of our systems refers to that of the client machines on which we run the Vusers. The configuration of any client machine includes its hardware settings, memory, operating system, software applications, development tools, etc. This system component configuration should match with the overall system configuration that would include the network infrastructure, the web server, the database server, and any other components that go with this larger system so as to achieve the load testing objectives.

How do you identify the performance bottlenecks?
Performance Bottlenecks can be detected by using monitors. These monitors might be application server monitors, web server monitors, database server monitors and network monitors. They help in finding out the troubled area in our scenario which causes increased response time. The measurements made are usually performance response time, throughput, hits/sec, network delay graphs, etc.

If web server, database and Network are all fine where could be the problem?
The problem could be in the system itself or in the application server or in the code written for the application.

How did you find web server related issues?
Using Web resource monitors we can find the performance of web servers. Using these monitors we can analyze throughput on the web server, number of hits per second that
occurred during scenario, the number of http responses per second, the number of downloaded pages per second.

How did you find database related issues?
By running “Database” monitor and help of “Data Resource Graph” we can find database related issues. E.g. You can specify the resource you want to measure on before running the controller and than you can see database related issues

What is the difference between Overlay graph and Correlate graph?
Overlay Graph: It overlay the content of two graphs that shares a common x-axis. Left Y-axis on the merged graph show’s the current graph’s value & Right Y-axis show the value of Y-axis of the graph that was merged. Correlate Graph: Plot the Y-axis of two graphs against each other. The active graph’s Y-axis becomes X-axis of merged graph. Y-axis of the graph that was merged becomes merged graph’s Y-axis.

How did you plan the Load? What are the Criteria?
Load test is planned to decide the number of users, what kind of machines we are going to use and from where they are run. It is based on 2 important documents, Task Distribution Diagram and Transaction profile. Task Distribution Diagram gives us the information on number of users for a particular transaction and the time of the load. The peak usage and off-usage are decided from this Diagram. Transaction profile gives us the information about the transactions name and their priority levels with regard to the scenario we are deciding.

What does vuser_init action contain?
Vuser_init action contains procedures to login to a server.

What does vuser_end action contain?
Vuser_end section contains log off procedures.

What is think time? How do you change the threshold?
Think time is the time that a real user waits between actions. Example: When a user receives data from a server, the user may wait several seconds to review the data before responding. This delay is known as the think time. Changing the Threshold: Threshold level is the level below which the recorded think time will be ignored. The default value is five (5) seconds. We can change the think time threshold in the Recording options of the Vugen.

What is the difference between standard log and extended log?
The standard log sends a subset of functions and messages sent during script execution to a log. The subset depends on the Vuser type Extended log sends a detailed script execution messages to the output log. This is mainly used during debugging when we want information about: Parameter substitution. Data returned by the server. Advanced trace.

Explain the following functions: - lr_debug_message
The lr_debug_message function sends a debug message to the output log when the specified message class is set. lr_output_message - The lr_output_message function sends notifications to the Controller Output window and the Vuser log file. lr_error_message - The lr_error_message function sends an error message to the LoadRunner Output window. lrd_stmt - The lrd_stmt function associates a character string (usually a SQL statement) with a cursor. This function sets a SQL statement to be processed. lrd_fetch - The lrd_fetch function fetches the next row from the result set.

Throughput
If the throughput scales upward as time progresses and the number of Vusers increase, this indicates that the bandwidth is sufficient. If the graph were to remain relatively flat as the number of Vusers increased, it would
be reasonable to conclude that the bandwidth is constraining the volume of
data delivered.

Types of Goals in Goal-Oriented Scenario
Load Runner provides you with five different types of goals in a goal oriented scenario:
The number of concurrent Vusers
The number of hits per second
The number of transactions per second
The number of pages per minute
The transaction response time that you want your scenario

Analysis Scenario (Bottlenecks):
In Running Vuser graph correlated with the response time graph you can see that as the number of Vusers increases, the average response time of the check itinerary transaction very gradually increases. In other words, the average response time steadily increases as the load increases. At 56 Vusers, there is a sudden, sharp increase in the average response time. We say that the test broke the server. That is the mean time before failure (MTBF). The response time clearly began to degrade when there were more than 56 Vusers running simultaneously.

Where do you set automatic correlation options?
Automatic correlation from web point of view, can be set in recording options and correlation tab. Here we can enable correlation for the entire script and choose either issue online messages or offline actions, where we can define rules for that correlation. Automatic correlation for database, can be done using show output window and scan for correlation and picking the correlate query tab and choose which query value we want to correlate. If we know the specific value to be correlated, we just do create correlation for the value and specify how the value to be created.

What is a function to capture dynamic values in the web vuser script?
Web_reg_save_param function saves dynamic data information to a parameter.

Thursday, February 24, 2011

Testing Data Warehouse – A Four Step Approach

Testing Data Warehouse – A Four Step Approach

In today's fast paced business environment, it is almost always an unstated fact that the success of any Data Warehouse solution lies in its ability to not only analyze vast quantities of data over time but also to provide stakeholders and end-users meaningful options that are based on real-time data. This requirement mandates an extremely efficient system that can extract, transform, cleanse and load data from the source systems on a 24*7 basis without impacting the performance, scalability or causing system downtime.

One of the key elements contributing to the success of a Data Warehouse solution is the ability of the test team to plan, design and execute a set of effective tests that will help identify multiple issues related to data inconsistency, data quality, data security, failures in the extract, transform and load (ETL) process, performance related issues, accuracy of business flows and fitness for use from an end user perspective.

The primary focus of testing should be on the ETL process. This includes, validating the loading of all required rows, correct execution of all transformations and successful completion of the cleansing operation. The team also needs to thoroughly test SQL queries, stored procedures or queries that produce aggregate or summary tables. Keeping in tune with emerging trends, it is also important for test team to design and execute a set of tests that are customer experience -centric.

Fig 1: Key components of an effective Data Warehouse test strategy

As shown in the above picture, the focus of Data Warehouse test strategy is primarily on four key aspects including:

Data Quality Validation
End User & BI / Report Testing
Load and Performance Testing
End-to-End (E2E) Regression and Integration Testing

Data Quality Validation:

An essential part of the overall ETL test strategy is validating data for accuracy, which is core to any Data Warehouse tests. Validating data for quality includes test for data completeness, data transformation and data quality.

Data Completeness Tests: are designed to verify that all the expected data loads into the data warehouse. This includes running detailed tests to verify that all records are completely loaded without errors in content quality or quantity.
Data Transformation Tests: are designed to verify the accuracy of the transformation logic or transformation business rules. This can, at times, be a complex activity hence teams should consider using automated tools as part of the test strategy. Integration tests are generally a part of data transformation tests. This is covered in more detail in a separate section below.
Data Quality Tests: are designed to validate system behavior when data is rejected (example: data inaccuracy or missing data) during data correction and substitution. Scenario-based tests and Validation tests for the solution’s reporting feature are part of Data Quality Tests.

At a bare minimum data quality validation should ensure:

Extraction of data to the required fields
Proper functioning of the extraction logic for each source system (historical and incremental loads)
Availability of security access to source systems for extraction scripts
Updates to extract audit log and time stamping as per requirements
Completeness and accuracy of “Source to Extraction Destination” Transaction scripts, which are transforming the data as per the expected logic
Historical load transformation for historical snap-shots is working
Incremental load transformation for historical snap-shots is working
Detailed and aggregated data sets are created, and are matching
Transaction audit log and time stamping
No pilferage of data during Transformation process and also during historical and incremental load
Real-time or near-real time data loading occurs without impacting performance adversely
Multi-pass SQL statements update all the temporary tables with real-time or near-real time reporting and analytics

End User and BI / Report Testing:

Testing for accuracy of reports is another critical aspect in Data warehouse testing. Extreme care should be taken while testing, as reports are probably the only experience most users have with the Data Warehouse and Business Intelligence (DW/BI) system. The guiding philosophy is that testing reports should be as clear and self-explanatory as possible. Usability, performance, data accuracy and preview and/or export to different formats are areas where most of the failures occur.

When designing tests for End user and BI / Report testing, some key points to address include:

Data display on the business views and dashboard are as expected
Users can see reports according to their user profile (authentication and authorization)
Verification of Report format and content by appropriate end users
Verification on the Accuracy and completeness of the scheduled reports
OLAP, Drill down report, cross tab report, parent / child report etc are all working as expected
'Analysis Functions' and 'Data Analysis' are working
No pilferage of data between the source systems and the views
Testing of Replicated reports from old system to new system for consistency of business rules
Previewing and/or exporting of reports to different formats such as spreadsheet, pdf, html, e-mail displays accurate and consistent data
Print facility, where applicable, produces expected output
Where graphs and data in tabular format exist, both should reflect consistent data

Load and Performance Testing:

With increasing volume of data, stability and scalability become critical test parameters. Under stress from large transactional data volumes, data warehouses will typically not scale, and eventually fail, unless they are tested and issues are fixed. To avoid such problems, it is essential that the test team design and execute series of tests that validate the performance and scalability of the system under different loads. As part of this activity, the following tests can be executed:

Shutdown the server during batch process and validate the result
Perform ETL with load that is twice or thrice the maximum possible imagined data (for which the capacity is planned)
Run huge volumes of ad-hoc queries mimicked from multiple users simultaneously
Run large number of scheduled reports
Monitor the timing of the reject processes and check system behavior when handling large volumes of rejected data

E2E Integration and Regression Testing:

Integration tests show how the application fits into the overall flow of all upstream and downstream applications. When designing Integration Tests, the focus of the tester should be on the following topics:

· How the overall process can break and focus on the integrations between different systems and their subsystems

· Validating system behavior when different types of data (different user profiles, different data types, different data volumes etc) get processed and communication to the subsequent system

· Run custom-designed regression tests that simulate end user behavior (ensures success of user-acceptance tests).

Usage of techniques like scenarios based testing, risk based testing and model based testing will enhance the effectiveness of testing. In addition, it is always a good idea to consider creating different tests by using test design techniques like Boundary Value Analysis (BVA) and Equivalence Partitioning (EP).

Summary: While basic testing philosophies hold good while testing a Data Warehouse implementation, it is important for test teams to understand that testing a Data Warehouse implementation is a different ball game. Since a Data Warehouse primary deals with data, a major portion of the test effort is spent on planning, designing and executing tests that are data oriented. These tests include running SQL queries, validating that ETL executes as expected, exceptions are handled effectively, application performance meets the SLAs and finally, ensuring that the integration points are working as expected. Planning and designing most of the test cases require the test team to have experience in SQL and performance testing. It will also be helpful if the team members have experience in debugging performance bottlenecks.

Another dimension of Data Warehouse testing is the dependency of the tests on the test environment. Since it is a known fact that in general, a test environment will not be as robust (high end servers, clustering, load balancing, data volumes and data accuracy), this will have an impact on some of the tests. For example, simulated or masked data that might not be reflecting all the characters of production data may restrict the accuracy of performance tests. In other cases, some of the jobs may not fail under simulated test environment. Test teams should be wary of such limitations and should factor such risks when designing their tests.

Detecting all possible defects may be complex, however a little bit of planning will go a long way in identifying the most obvious and costly defects early in the life cycle. Finally, a group of Data Warehouse Architects, Business Analysts and Test teams working together during the initial planning and design phase is one of the time tested approaches that can help in identifying and eliminating potential failures.

ETL Process Definitions and Deliverables

1.0 Define Requirements – In this process you should understand the business needs by gathering information from the user. You should understand the data needed and if it is available. Resources should be identified for information or help with the process.

Deliverables

A logical description of how you will extract, transform, and load the data.
Sign-off of the customer(s).

Standards

Document ETL business requirements specification using either the ETL Business Requirements Specification Template, your own team-specific business requirements template or system, or Oracle Designer.

Templates

ETL Business Requirements Specification Template

2.0 Create Physical Design – In this process you should define your inputs and outputs by documenting record layouts. You should also identify and define your location of source and target, file/table sizing information, volume information, and how the data will be transformed.

Deliverables

Input and output record layouts
Location of source and target
File/table sizing information
File/table volume information
Documentation on how the data will be transformed, if at all

Standards

Complete ETL Business Requirements Specification using one of the methods documented in the previous steps.
Start ETL Mapping Specification

Templates

ETL Business Requirements Specification Template
ETL Mapping Specification Template

3.0 Design Test Plan – Understand what the data combinations are and define what results are expected. Remember to include error checks. Decide how many test cases need to be built. Look at technical risk and include security. Test business requirements.

Deliverables

ETL Test Plan
ETL Performance Test Plan

Standards

Document ETL test plan and performance plan using either the standard templates listed below or your own team-specific template(s).

Templates

ETL Test Plan Template
ETL Performance Test Plan Template

4.0 Create ETL Process – Start creating the actual Informatica ETL process. The developer is actually doing some testing in this process.

Deliverables

Mapping Specification
Mapping
Workflow
Session

Standards

Start the ETL Object Migration Form
Start Database Object Migration Form (if applicable)
Complete ETL Mapping Specification
Complete cleanup process for log and bad files – Refer to Standard_ETL_File_Cleanup.doc
Follow Informatica Naming Standards

Templates

ETL Object Migration Form
ETL Mapping Specification Template
Database Object Migration Form (if applicable)

5.0 Test Process – The developer does the following types of tests: unit, volume, and performance.

Deliverables

ETL Test Plan
ETL Performance Test Plan

Standards

Complete ETL Test Plan
Complete ETL Performance Test Plan

Templates

ETL Test Plan Template
ETL Performance Test Plan

6.0 Walkthrough ETL Process – Within the walkthrough the following factors should be addressed: Identify common modules (reusable objects), efficiency of the ETL code, the business logic, accuracy, and standardization.

Deliverables

ETL process that has been reviewed

Standards

Conduct ETL Process Walkthrough

Templates

ETL Mapping Walkthrough Checklist Template

7.0 Coordinate Move to QA – The developer works with the ETL Administrator to organize ETL Process move to QA.

Deliverables

ETL process moved to QA

Standards

Complete ETL Object Migration Form
Complete Unix Job Setup Request Form
Complete Database Object Migration Form (if applicable)

Templates

ETL Object Migration Form
Unix Job Setup Request Form
Database Object Migration Form

8.0 Test Process – At this point, the developer once again tests the process after it has been moved to QA.

Deliverables

Tested ETL process

Standards

Developer validates ETL Test Plan and ETL Performance Test Plan

Templates

ETL Test Plan Template
ETL Performance Test Plan Template

9.0 User Validates Data – The user validates the data and makes sure it satisfies the business requirements.

Deliverables

Validated ETL process

Standards

Validate Business Requirement Specifications with the data

Templates

ETL Business Requirement Specifications Template

10.0 Coordinate Move to Production - The developer works with the ETL Administrator to organize ETL Process move to Production.

Deliverables

Accurate and efficient ETL process moved to production

Standards

Complete ETL Object Migration Form
Complete Unix Job Setup Request Form
Complete Database Object Migration Form (if applicable)

Templates

ETL Object Migration Form
Unix Job Setup Request Form
Database Object Migration Form (if applicable)

11.0 Maintain ETL Process – There are a couple situations to consider when maintaining an ETL process. There is maintenance when an ETL process breaks and there is maintenance when and ETL process needs updated.

Deliverables

Accurate and efficient ETL process in production

Standards

Updated Business Requirements Specification (if needed)
Updated Mapping Specification (if needed)
Revised mapping in appropriate folder
Updated ETL Object Migration Form
Developer checks final results in production
All monitoring (finding problems) of the ETL process is the responsibility of the project team

Templates

Business Requirements Specification Template
Mapping Specification Template
ETL Object Migration Form
Unix Job Setup Request Form
Database Object Migration Form (if applicable)

ETL Methodology

Overview

This document is designed for use by business associates and technical resources to better understand the process of building a data warehouse and the methodology employed to build the EDW.

This methodology has been designed to provide the following benefits:

A high level of performance
Scalable to any size
Ease of maintenance
Boiler-plate development
Standard documentation techniques

ETL Definitions

Term	Definition
ETL – Extract Transform Load	The physical process of extracting data from a source system, transforming the data to the desired state, and loading it into a database
EDW – Enterprise Data Warehouse	The logical data warehouse designed for enterprise information storage and reporting
DM – Data Mart	A small subset of a data warehouse specifically defined for a subject area

Documentation Specifications

A primary driver of the entire process is accurate business information requirements. TDD Consulting will use standard documents prepared by the Project Management Institute for requirements gathering, project signoff, and compiling all testing information.

ETL Naming Conventions

To maintain consistency all ETL processes will follow a standard naming methodology.

Tables

All destination tables will utilize the following naming convention:

EDW_<SUBJECT>_<TYPE>

There are six types of tables used in a data warehouse: Fact, Dimension, Aggregate, Staging, Temp, and Audit. Sample names are listed below the quick overview of table types.

Fact – a table type that contains atomic data

Dimension – a table type that contains referential data needed by the fact tables

Aggregate – a table type used to aggregate data, forming a pre-computed answer to a business question (ex. Totals by day)

Staging – Tables used to store data during ETL processing but the data is not removed immediately

Temp – tables used during ETL processing that can immediately be truncated afterwards (ex. storing order ids for lookup)

Audit – tables used to keep track of the ETL process (ex. Processing times by job)

Each type of table will be kept in a separate schema. This will decrease maintenance work and time spent looking for a specific table.

Table Name	Explanation
EDW_RX_FACT	Fact table containing RX subject matter
EDW_TIME_DIM	Dimension table containing TIME subject matter
EDW_CUSTOMER_AG	Aggregate table containing CUSTOMER subject matter
ETL_PROCESS_AUDIT	Audit table containing PROCESS data
STG_DI_CUSTOMER	Staging table sourced from DI system used for CUSTOMER data processing
ETL_ADDRESS_TEMP	Temp table used for ADDRESS processing

ETL Processing

There following types of ETL jobs will be used for processing. This table lists the job type, naming convention, and explains the job functions.

Job Type	Explanation	Naming Convention
Extract	Extracts information from a source systems & places in a staging table	Extract<Source><Subject> ExtractDICustomer
Source	Sources information from STG tables & performs column validation	Source<Table> SourceSTGDICustomer
LoadTemp	Load temp tables used in processing	LoadTemp<Table> LoadTempETLAddressTemp
LookupDimension	Lookup dimension tables	LookupDimension<Subject> LookupDimensionCustomer
Transform	Transform the subject area data and generate insert files	Transform<Subject> TransformCustomer
QualityCheck	Checks the quality of the data before loaded into the EDW	QualityCheck<Subject> QualityCheckCustomer
Load	Load the data into the EDW	Load<Table> LoadEDWCustomerFact

ETL Job Standards

All ETL jobs will be created with a boiler-plate approach. This approach allows for rapid creation of similar jobs while keeping maintenance low.

Comments

Every job will have a standard comment template that specifically spells out the following attributes of the job:

Job Name: LoadEDWCustomerFact

Purpose: Load the EDW_Customer_Fact table

Predecessor: QualityCheckCustomer

Date: April 21, 2006

Author: Wes Dumey

Revision History:

April 21, 2006 – Created the job from standard template

April 22, 2006 – Added error checking for table insert

In addition there will also be a job data dictionary that describes every job in a table such that it can be easily searched via standard SQL.

Persistent Staging Areas

Data will be received from the source systems in its native format. The data will be stored in a PSA table following the naming standards listed previously. The table will contain the following layout:

Column	Data Type	Explanation
ROW_NUMBER	NUMBER	Unique for each row in the PSA
DATE	DATE	Date row was placed in the PSA
STATUS_CODE	CHAR(1)	Indicates status of row (‘I’ inducted, ‘P’ processed, ‘R’ rejected)
ISSUE_CODE	NUMBER	Code uniquely identifying problems with data if STATUS_CODE = ‘R’
BATCH_NUMBER	NUMBER	Batch number used to process the data (auditing)
Data columns to follow

Auditing

The ETL methodology maintains a process for providing audit and logging capabilities.

For each run of the process, a unique batch number composed of the time segments is created. This batch number is loaded with the data into the PSA and all target tables. In addition, an entry with the following data elements will be made into the ETL_PROCESS_AUDIT table.

Column	Data Type	Explanation
DATE	DATE	(Index) run date
BATCH_NUMBER	NUMBER	Batch number of process
PROCESS_NAME	VARCHAR	Name of process that was executed
PROCESS_RUN_TIME	TIMESTAMP	Time (HH:MI:SS) of process execution
PROCESS_STATUS	CHAR	‘S’ SUCCESS, ‘F’ FAILURE
ISSUE_CODE	NUMBER	Code of issue related to process failure (if ‘F’)
RECORD_PROCESS_COUNT	NUMBER	Row count of records processed during run

The audit process will allow for efficient logging of process execution and encountered errors.

Quality

Due to the sensitive nature of data within the EDW, data quality is a driving priority. Quality will be handled through the following processes:

Source job - the source job will contain a quick data scrubbing mechanism that verifies the data conforms to the expected type (Numeric is a number and character is a letter).
Transform – the transform job will contain matching metadata of the target table and verify that NULL values are not loaded into NOT NULL columns and that the data is transformed correctly.
QualityCheck – a separate job is created to do a cursory check on a few identified columns and verify that the correct data is loaded into these columns.

Source Quality

A data scrubbing mechanism will be constructed. This mechanism will check identified columns for any anomalies (ex. Embedded carriage returns) and value domains. If an error is discovered, the data is fixed and a record is written in the ETL_QUALITY_ISSUES table (see below for table definition).

Transform Quality

The transformation job will employ a matching metadata technique. If the target table enforces NOT NULL constraints, a check will be built into the job preventing NULLS from being loaded and causing a job stream abend.

Quality Check

Quality check is the last point of validation within the job stream. QC can be configured to check any percentage of rows (0-100%) and any number of columns (1-X). QC is designed to pay attention to the most valuable or vulnerable rows with the data sets. QC will use a modified version of the data scrubbing engine used during the source job to derive correct values and reference rules listed in the ETL_QC_DRIVER table. Any suspects rows will be pulled from the insert/update files, updated in the PSA table to an ‘R’ status and create an issue code for the failure.

Logging of Data Failures

Data that fails the QC job will not be loaded into the EDW based on defined rules. An entry will be made into the following table (ETL_QUALITY_ISSUES). An indicator will show the value of the column as defined in the rules (‘H’ HIGH, ‘L’ LOW). This indicator will allow resources to be used efficiently to trace errors.

ETL_QUALITY_ISSUES

Column	Data Type	Explanation
DATE	DATE	Date of entry
BATCH_NUMBER	NUMBER	Batch number of process creating entry
PROCESS_NAME	VARCHAR	Name of process creating entry
COLUMN_NAME	VARCHAR	Name of column failing validation
COLUMN_VALUE	VARCHAR	Value of column failing validation
EXPECTED_VALUE	VARCHAR	Expected value of column failing validation
ISSUE_CODE	NUMBER	Issue code assigned to error
SEVERITY	CHAR	‘H’ HIGH, ‘L’ LOW

ETL_QUALITY_AUDIT

Column	Data Type	Explanation
DATE	DATE	Date of entry
BATCH_NUMBER	NUMBER	Batch number of process creating entry
PROCESS_NAME	VARCHAR	Name of entry creating process
RECORD_PROCESS_COUNT	NUMBER	Number of records processed
RECORD_COUNT_CHECKED	NUMBER	Number of records checked
PERCENTAGE_CHECKED	NUMBER	Percentage of checked records out of data set

Closing

After reading this ETL document you should have a better understanding of the issues associated with ETL processing. This methodology has been created to address as many negatives as possible while providing a high level of performance and ease of maintenance while being scalable and workable in a real-time ETL processing scenario.

testing