Q. What is a staging area? Do we need it? What is the purpose of a staging area?
A1. Staging area is place where you hold temporary tables on data warehouse server. Staging tables are connected to work area or fact tables. We basically need staging area to hold the data , and perform data cleansing and merging , before loading the data into warehouse.A2.
he staging area is:-
* One or more database schema(s) or file stores used to “stage” data extracted from the source OLTP systems prior to being published to the “warehouse” where it is visible to end users.
* Data in the staging area is NOT visible to end users for queries, reports or analysis of any kind. It does not hold completed data ready for querying.
* It may hold intermediate results, (if data is pipelined through a process)
* Equally it may hold “state” data – the keys of the data held on the warehouse, and used to detect whether incoming data includes New or Updated rows. (Or deleted for that matter).
* It is likely to be equal in size (or maybe larger) than the “presentation area” itself.
* Although the “state” data – eg. Last sequence loaded may be backed up, much of the staging area data is automatically replaced during the ETL load processes, and can with care avoid adding to the backup effort. The presentation area however, may need backup in many cases.
* It may include some metadata, which may be used by analysts or operators monitoring the state of the previous loads (eg. audit information, summary totals of rows loaded etc).
* It’s likely to hold details of “rejected” entries – data which has failed quality tests, and may need correction and re-submission to the ETL process.
* It’s likely to have few indexes (compared to the “presentation area”), and hold data in a quite normalized form. The presentation area (the bit the end users see), is by comparison likely to be more highly indexed (mainly bitmap indexes), with highly de-normalized tables (the Dimension tables anyway).
The staging area exists to be a separate “back room“ or “engine room” of the warehouse where the data can be transformed, corrected and prepared for the warehouse.
It should ONLY be accessible to the ETL processes working on the data, or administrators monitoring or managing the ETL process.
In summary. A typical warehouse generally has three distinct areas:-
1. Several source systems which provide data. This can include databases (Oracle, SQL Server, Sybase etc) or files or spreadsheets
2. A single “staging area” which may use one or more database schemas or file stores (depending upon warehouse load volumes).
3. One or more “visible” data marts or a single “warehouse presentation area” where data is made visible to end user queries. This is what many people think of as the warehouse – although the entire system is the warehouse – it depends upon your perspective.
The “staging area” is the middle bit.
Q. What are active transformation / Passive transformations?
A1. An active transformation can change the number of rows as output after a transformation, while a passive transformation does not change the number of rows and passes through the same number of rows that was given to it as input.A2.
Transformations can be active or passive. An active transformation can change the number of rows that pass through it, such as a Filter transformation that removes rows that do not meet the filter condition. A passive transformation does not change the number of rows that pass through it, such as an Expression transformation that performs a calculation on data and passes all rows through the transformation
Active transformations:
Advanced External Procedure
Aggregator
Application Source Qualifier
Filter
Joiner
Normalizer
Rank
Router
Update Strategy
Passive transformation:
Expression
External Procedure
Maplet- Input
Lookup
Sequence generator
XML Source Qualifier
Maplet - Output
Q. What is the difference between Power Center & Power Mart?
A1.Power Mart is designed for:
Low range of warehouses
only for local repositories
mainly desktop environment.
Power mart is designed for:
High-end warehouses
Global as well as local repositories
ERP support.
Q. What is the difference between etl tool and olap tools?
A1.
ETL tool is ment for extraction data from the legecy systems and load into specified data base with some process of cleansing data.
ex: Informatica,data stage ....etc
OLAP is ment for Reporting purpose.in OLAP data avaliable in Mulitidimectional model. so that u can write smple query to extract data fro the data base.
ex: Businee objects,Cognos....etc
A2.
ETL tools are used to extract, transformation and loading the data into data warehouse / data mart
OLAP tools are used to create cubes/reports for business analysis from data warehouse / data mart
Q. What are the various ETL tools? - Name a few
A1.- Informatica
- Abinitio
- DataStage
- Cognos Decision Stream
- Oracle Warehouse Builder
- Business Objects XI (Extreme Insight)
- SAP Business Warehouse
- SAS Enterprise ETL Server
Q. What are the various transformation available?
A1.
Transformation plays an important role in Datawarehouse. Transformation are used when data is moved from source to destination. Depding upon crieteria transformations are done. Some of the transformations are Aggregater,Lookup,Filter,Source Qualifier,Sequence Generator,Expression
A2.
the various type of transformation in informatica
source qualifier
aggregate
sequence generator
sorter
router
filter
lookup
update strategy
joiner
normalizer
expression
A3.
Aggregator Transformation
Expression Transformation
Filter Transformation
Joiner Transformation
Lookup Transformation
Normalizer Transformation
Rank Transformation
Router Transformation
Sequence Generator Transformation
Stored Procedure Transformation
Sorter Transformation
Update Strategy Transformation
XML Source Qualifier Transformation
Advanced External Procedure Transformation
External Transformation
Q. What are the different Lookup methods used in Informatica?
A1.connected lookup will receive input from the pipeline and sends output to the pipeline and can return any number of values
Unconnected lookup can return only one column
Q. Lets suppose we have some 10,000 odd records in source system and when load them into target how do we ensure that all 10,000 records that are loaded to target doesn't contain any garbage values.
How do we test it. We can't check every record as number of records are huge.
A1.
Data Quality checks come in a number of forms:-
1. For FACT table rows, is there a valid lookup against each of the Dimensions
2. For FACT or DIMENSION rows, for each value:-
* Is it Null when it shouldn’t be
* Is the Data Type correct (eg. Number, Date)
* Is the range of values or format correct
* Is the row valid with relation to all the other source system business rules?
There is no magic way of checking the integrity of data.
You could simply count the number of rows in and out again and assume it’s all OK, but for a fact table (at the very minimum) you’ll need to cope with failed Dimension lookups (typically from late arriving Dimension rows).
Classic solution is, include a Dimension Key Zero and Minus One (Null and Invalid) in your Dimension Table. Null columns are set to the Zero key, and a lookup failure to the Minus One. You may need to store and re-cycle rows with failed lookups and treat these as updates – so if the missing Dimension row appears, the data is corrected.
Otherwise, you’ve no option. If the incoming data is from an unreliable source, you’ll need to check it’s validity or accept the warehouse includes wrong results.
If the warehouse includes a high percentage of incorrect or misleading values – what’s the point of having it ?
A2. To do this, you must profile the data at the source to know the domain of all the values, get the actual number of rows in the source, get the types of the data in the source. After it is loaded into the target, this process can be repeated i.e. checking the data values with respect to range, type, etc and also checking the actual number of rows inserted. If the result before and after match, then we are OK. This process is automated typically in ETL tools.
Q. What is ODS (operation data source)?
A1. ODS Comes between staging area & Data Warehouse. The data is ODS will be at the low level of granularity.Once data was poopulated in ODS aggregated data will be loaded into into EDW through ODS.
A2. ODS is the Operational Data Source which is also called transactional data ODS is the source of a warehouse. Data from ODs is staged, transformed and then moved to datawarehouse.
Nice posting! Sharing all peoples. Thank You!
ReplyDeleteSTC Technologies|STC Technologies