testing

Thursday, February 24, 2011

ETL Interview Questions


Q. What is a staging area? Do we need it? What is the purpose of a staging area?

A1. Staging area is place where you hold temporary tables on data warehouse server. Staging tables are connected to work area or fact tables. We basically need staging area to hold the data , and perform data cleansing and merging , before loading the data into warehouse. 
A2.
he staging area is:-

* One or more database schema(s) or file stores used to “stage” data extracted from the source OLTP systems prior to being published to the “warehouse” where it is visible to end users.
* Data in the staging area is NOT visible to end users for queries, reports or analysis of any kind. It does not hold completed data ready for querying.
* It may hold intermediate results, (if data is pipelined through a process)
* Equally it may hold “state” data – the keys of the data held on the warehouse, and used to detect whether incoming data includes New or Updated rows. (Or deleted for that matter).
* It is likely to be equal in size (or maybe larger) than the “presentation area” itself.
* Although the “state” data – eg. Last sequence loaded may be backed up, much of the staging area data is automatically replaced during the ETL load processes, and can with care avoid adding to the backup effort. The presentation area however, may need backup in many cases.
* It may include some metadata, which may be used by analysts or operators monitoring the state of the previous loads (eg. audit information, summary totals of rows loaded etc).
* It’s likely to hold details of “rejected” entries – data which has failed quality tests, and may need correction and re-submission to the ETL process.
* It’s likely to have few indexes (compared to the “presentation area”), and hold data in a quite normalized form. The presentation area (the bit the end users see), is by comparison likely to be more highly indexed (mainly bitmap indexes), with highly de-normalized tables (the Dimension tables anyway).



The staging area exists to be a separate “back room“ or “engine room” of the warehouse where the data can be transformed, corrected and prepared for the warehouse.


It should ONLY be accessible to the ETL processes working on the data, or administrators monitoring or managing the ETL process.

In summary. A typical warehouse generally has three distinct areas:-

1. Several source systems which provide data. This can include databases (Oracle, SQL Server, Sybase etc) or files or spreadsheets
2. A single “staging area” which may use one or more database schemas or file stores (depending upon warehouse load volumes).
3. One or more “visible” data marts or a single “warehouse presentation area” where data is made visible to end user queries. This is what many people think of as the warehouse – although the entire system is the warehouse – it depends upon your perspective.


The “staging area” is the middle bit.

Q. What are active transformation / Passive transformations?

A1. An active transformation can change the number of rows as output after a transformation, while a passive transformation does not change the number of rows and passes through the same number of rows that was given to it as input.

A2.
Transformations can be active or passive. An active transformation can change the number of rows that pass through it, such as a Filter transformation that removes rows that do not meet the filter condition. A passive transformation does not change the number of rows that pass through it, such as an Expression transformation that performs a calculation on data and passes all rows through the transformation

Active transformations:

Advanced External Procedure

Aggregator

Application Source Qualifier

Filter

Joiner

Normalizer

Rank

Router

Update Strategy



Passive transformation:

Expression

External Procedure

Maplet- Input

Lookup

Sequence generator

XML Source Qualifier

Maplet - Output


Q. What is the difference between Power Center & Power Mart?

A1.
Power Mart is designed for:
Low range of warehouses
only for local repositories
mainly desktop environment.

Power mart is designed for:
High-end warehouses
Global as well as local repositories
ERP support.


Q. What is the difference between etl tool and olap tools?


A1.
ETL tool is ment for extraction data from the legecy systems and load into specified data base with some process of cleansing data.

ex: Informatica,data stage ....etc

OLAP is ment for Reporting purpose.in OLAP data avaliable in Mulitidimectional model. so that u can write smple query to extract data fro the data base.

ex: Businee objects,Cognos....etc

A2.
ETL tools are used to extract, transformation and loading the data into data warehouse / data mart
OLAP tools are used to create cubes/reports for business analysis from data warehouse / data mart


Q. What are the various ETL tools? - Name a few

A1.
- Informatica
- Abinitio
- DataStage
- Cognos Decision Stream
- Oracle Warehouse Builder
- Business Objects XI (Extreme Insight)
- SAP Business Warehouse
- SAS Enterprise ETL Server


Q. What are the various transformation available?


A1.
Transformation plays an important role in Datawarehouse. Transformation are used when data is moved from source to destination. Depding upon crieteria transformations are done. Some of the transformations are Aggregater,Lookup,Filter,Source Qualifier,Sequence Generator,Expression

A2.
the various type of transformation in informatica
source qualifier
aggregate
sequence generator
sorter
router
filter
lookup
update strategy
joiner
normalizer
expression

A3.
Aggregator Transformation
Expression Transformation
Filter Transformation
Joiner Transformation
Lookup Transformation
Normalizer Transformation
Rank Transformation
Router Transformation
Sequence Generator Transformation
Stored Procedure Transformation
Sorter Transformation
Update Strategy Transformation
XML Source Qualifier Transformation
Advanced External Procedure Transformation
External Transformation


Q. What are the different Lookup methods used in Informatica?

A1.
connected lookup will receive input from the pipeline and sends output to the pipeline and can return any number of values

Unconnected lookup can return only one column



Q. Lets suppose we have some 10,000 odd records in source system and when load them into target how do we ensure that all 10,000 records that are loaded to target doesn't contain any garbage values.

How do we test it. We can't check every record as number of records are huge.


A1.
Data Quality checks come in a number of forms:-

1. For FACT table rows, is there a valid lookup against each of the Dimensions

2. For FACT or DIMENSION rows, for each value:-

* Is it Null when it shouldn’t be
* Is the Data Type correct (eg. Number, Date)
* Is the range of values or format correct
* Is the row valid with relation to all the other source system business rules?

There is no magic way of checking the integrity of data.

You could simply count the number of rows in and out again and assume it’s all OK, but for a fact table (at the very minimum) you’ll need to cope with failed Dimension lookups (typically from late arriving Dimension rows).

Classic solution is, include a Dimension Key Zero and Minus One (Null and Invalid) in your Dimension Table. Null columns are set to the Zero key, and a lookup failure to the Minus One. You may need to store and re-cycle rows with failed lookups and treat these as updates – so if the missing Dimension row appears, the data is corrected.

Otherwise, you’ve no option. If the incoming data is from an unreliable source, you’ll need to check it’s validity or accept the warehouse includes wrong results.

If the warehouse includes a high percentage of incorrect or misleading values – what’s the point of having it ?


A2. To do this, you must profile the data at the source to know the domain of all the values, get the actual number of rows in the source, get the types of the data in the source. After it is loaded into the target, this process can be repeated i.e. checking the data values with respect to range, type, etc and also checking the actual number of rows inserted. If the result before and after match, then we are OK. This process is automated typically in ETL tools.


Q. What is ODS (operation data source)?

A1. ODS Comes between staging area & Data Warehouse. The data is ODS will be at the low level of granularity.

Once data was poopulated in ODS aggregated data will be loaded into into EDW through ODS.

A2. ODS is the Operational Data Source which is also called transactional data ODS is the source of a warehouse. Data from ODs is staged, transformed and then moved to datawarehouse.


Unix Short cuts


Unix Commands
Command/Syntax
What it will do
awk/nawk [options] file
scan for patterns in a file and process the results
cat [options] file
concatenate (list) a file
cd [directory]
change directory
chgrp [options] group file
change the group of the file
chmod [options] file
change file or directory access permissions
chown [options] owner file
change the ownership of a file; can only be done by the superuser
chsh (passwd -e/-s) username login_shell
change the user's login shell (often only by the superuser)
cmp [options] file1 file2
compare two files and list where differences occur (text or binary files)
compress [options] file
compress file and save it as file.Z
cp [options] file1 file2
copy file1 into file2; file2 shouldn't already exist. This command creates or overwrites file2.
cut (options) [file(s)]
cut specified field(s)/character(s) from lines in file(s)
date [options]
report the current date and time
dd [if=infile] [of=outfile] [operand=value]
copy a file, converting between ASCII and EBCDIC or swapping byte order, as specified
diff [options] file1 file2
compare the two files and display the differences (text files only)
df [options] [resource]
report the summary of disk blocks and inodes free and in use
du [options] [directory or file]
report amount of disk space in use
echo [text string]
echo the text string to stdout
ed or ex [options] file
Unix line editors
emacs [options] file
full-screen editor
expr arguments
evaluate the arguments. Used to do arithmetic, etc. in the shell.
file [options] file
classify the file type
find directory [options] [actions]
find files matching a type or pattern
finger [options] user[@hostname]
report information about users on local and remote machines
ftp [options] host
transfer file(s) using file transfer protocol
grep [options] 'search string' argument
egrep [options] 'search string' argument
fgrep [options] 'search string' argument
search the argument (in this case probably a file) for all occurrences of the search string, and list them.
gzip [options] file
gunzip [options] file
zcat [options] file
compress or uncompress a file. Compressed files are stored with a .gz ending
head [-number] file
display the first 10 (or number of) lines of a file
hostname
display or set (super-user only) the name of the current machine
kill [options] [-SIGNAL] [pid#] [%job]
send a signal to the process with the process id number (pid#) or job control number (%n). The default signal is to kill the process.
ln [options] source_file target
link the source_file to the target
lpq [options]
lpstat [options]
show the status of print jobs
lpr [options] file
lp [options] file
print to defined printer
lprm [options]
cancel [options]
remove a print job from the print queue
ls [options] [directory or file]
list directory contents or file permissions
mail [options] [user]
mailx [options] [user]
Mail [options] [user]
simple email utility available on Unix systems. Type a period as the first character on a new line to send message out, question mark for help.
man [options] command
show the manual (man) page for a command
mkdir [options] directory
make a directory
more [options] file
less [options] file
pg [options] file
page through a text file
mv [options] file1 file2
move file1 into file2
od [options] file
octal dump a binary file, in octal, ASCII, hex, decimal, or character mode.
passwd [options]
set or change your password
paste [options] file
paste field(s) onto the lines in file
pr [options] file
filter the file and print it on the terminal
ps [options]
show status of active processes
pwd
print working (current) directory
rcp [options] hostname
remotely copy files from this machine to another machine
rlogin [options] hostname
login remotely to another machine
rm [options] file
remove (delete) a file or directory (-r recursively deletes the directory and its contents) (-i prompts before removing files)
rmdir [options] directory
remove a directory
rsh [options] hostname
remote shell to run on another machine
script file
saves everything that appears on the screen to file until exit is executed
sed [options] file
stream editor for editing files from a script or from the command line
sort [options] file
sort the lines of the file according to the options chosen
source file
. file
read commands from the file and execute them in the current shell. source: C shell, .: Bourne shell.
strings [options] file
report any sequence of 4 or more printable characters ending in <NL> or <NULL>. Usually used to search binary files for ASCII strings.
stty [options]
set or display terminal control options
tail [options] file
display the last few lines (or parts) of a file
tar key[options] [file(s)]
tape archiver--refer to man pages for details on creating, listing, and retrieving from archive files. Tar files can be stored on tape or disk.
tee [options] file
copy stdout to one or more files
telnet [host [port]]
communicate with another host using telnet protocol
touch [options] [date] file
create an empty file, or update the access time of an existing file
tr [options] string1 string2
translate the characters in string1 from stdin into those in string2 in stdout
uncompress file.Z
uncompress file.Z and save it as a file
uniq [options] file
remove repeated lines in a file
uudecode [file]
decode a uuencoded file, recreating the original file
uuencode [file] new_name
encode binary file to 7-bit ASCII, useful when sending via email, to be decoded as new_name at destination
vi [options] file
visual, full-screen editor
wc [options] [file(s)]
display word (or character or line) count for file(s)
whereis [options] command
report the binary, source, and man page locations for the command named
which command
reports the path to the command or the shell alias in use
who or w
report who is logged in and what processes are running
zcat file.Z
concatenate (list) uncompressed file to screen, leaving file compressed on disk