ETL Tool – Informatica – Part 1

Informatica:

Informatica is a powerful ETL tool from Informatica Corporation, a leading provider of enterprise data integration software and ETL software.

The important Informatica Components are:

  • Power Exchange
  • Power Center
  • Power Center Connect
  • Power Exchange
  • Power Channel
  • Metadata Exchange
  • Power Analyzer
  • Super Glue

In Informatica, all the Metadata information about source systems, target systems and transformations are stored in the Informatica repository. Informatica’s Power Center Client and Repository Server access this repository to store and retrieve metadata.

Note: To know more about Metadata and its significance, please click here.

Source and Target:

Consider a Bank that has got many branches throughout the world. In each branch data may be stored in different source systems like oracle, sql server, terradata, etc. When the Bank decides to integrate its data from several sources for its management decisions, it may choose one or more systems like oracle, sql server, terradata, etc. as its data warehouse target. Many organisations prefer Informatica to do that ETL process, because
Informatica is more powerful in designing and building data warehouses. It can connect to several sources and targets to extract meta data from sources and targets, transform and load the data into target systems.

Guidelines to work with Informatica Power Center:

Repository: This is where all the metadata information is stored in the Informatica suite. The Power Center Client and the Repository Server would access this repository to retrieve, store and manage metadata.

Power Center Client: Informatica client is used for managing users, identifying source and target systems definitions, creating mapping and mapplets, creating sessions and run workflows etc.

Repository Server: This repository server takes care of all the connections between the repository and the Power Center Client.

Power Center Server: Power Center server does the extraction from source and then loading data into targets.

Designer: Source Analyzer, Mapping Designer and Warehouse Designer are tools reside within the Designer wizard.

Source Analyzer is used for extracting metadata from source systems.

Mapping Designer is used to create mapping between sources and targets. Mapping is a pictorial representation about the flow of data from source to target.

Warehouse Designer is used for extracting metadata from target systems or metadata can be created in the Designer itself.

Data Cleansing: The PowerCenter’s data cleansing technology improves data quality by validating, correctly naming and standardization of address data. A person’s address may  not be same in all source systems because of typos  and postal code, city name may not match with address. These errors can be corrected by using data cleansing process and standardized data can be loaded in target systems (data warehouse).

Transformation: Transformations help to transform the source data according to the requirements of target system. Sorting, Filtering, Aggregation, Joining are some of the examples of transformation. Transformations ensure the quality of the data being loaded into target and this is done during the mapping process from source to target.

Workflow Manager: Workflow helps to load the data from source to target in a sequential manner. For example, if the fact tables are loaded before the lookup tables, then the target system will pop up an error message since the fact table is violating the foreign key validation. To avoid this, workflows can be created to ensure the correct flow of data from source to target.

Workflow Monitor: This monitor is helpful in monitoring and tracking the workflows created in each Power Center Server.

Power Center Connect: This component helps to extract data and metadata from ERP systems like IBM’s MQSeries, Peoplesoft, SAP, Siebel etc. and other third party applications.

Power Center Exchange: This component helps to extract data and metadata from ERP systems like IBM’s MQSeries, Peoplesoft, SAP, Siebel etc. and other third party applications.

 

Next ⇒ ETL Tool – Informatica – Part 2

 

Informatica ETL Interview Questions and Answers – Part 3

Informatica Interview Questions & Answers:

P. How DTM buffer size and buffer block size are related?

  • The number of buffer blocks in a session = DTM Buffer Size / Buffer Block Size. Default settings create enough buffer blocks for 83 sources and targets. If the session contains more than 83, you might need to increase DTM Buffer Size or decrease Default Buffer Block Size.
  • (total number of sources + total number of targets)* 2] = (session buffer blocks)
  • (session Buffer Blocks) = (.9) * (DTM Buffer Size) / (Default Buffer Block Size) * (number of partitions)

Q. Difference between shortcut and reusable transformation?

  • A shortcut is a reference (link) to an object in a shared folder, these are commonly used for sources and targets that are to be shared between different environments / or projects. A shortcut is created by assigning ‘Shared’ status to a folder within the Repository Manager and then dragging objects from this folder into another open folder; This provides a single point of control / reference for the object – multiple projects don’t all have import sources and targets into their local folders.
  • A reusable transformation is usually something that is kept local to a folder, examples would be the use of a reusable sequence generator for allocating warehouse Customer Id’s which would be useful if you were loading customer details from multiple source systems and allocating unique ids to each new source-key. Many mappings could use the same sequence and the sessions would all draw from the same continuous pool of sequence numbers generated.

R. What are the types of  groups in Router transformation?

  • Input Group.
  • Output Group.
  • The designer copies property information from the input ports of the input group to create a set of output ports for each output group.
  • User defined Group.
  • Default group.
  • One cannot modify or delete default groups.

S. What are the out put files that the Informatica server creates during the session running?

  • Informatica server log: Informatica server (on UNIX) creates a log for all status and error messages (default name:pm.server.log). It also creates an error log for error messages. These files will be created in Informatica home directory.
  • Session log file: Informatica server creates session log file for each session. It writes information about session into log files such as initialization process, creation of sql commands for reader and writer threads, errors encountered and load summary. The amount of detail in session log file depends on the tracing level that you set.
  • Session detail file: This file contains load statistics for each target in mapping. Session detail includes information such as table name, number of rows written or rejected. We can view this file by double clicking on the session in monitor window.
  • Performance detail file: This file contains information known as session performance details which helps you where performance can be improved. To generate this file select the performance detail option in the session property sheet.
  • Reject file: This file contains the rows of data that the writer does not write to targets.
  • Control file: Informatica server creates control file and a target file when you run a session that uses the external loader. The control file contains the information about the target flat file such as data format and loading instructions for the external loader.
  • Post session email: Post session email allows you to automatically communicate information about a session run to designated recipients. U can create two different
    messages. One if the session completed successfully the other if the session fails.
  • Indicator file: If u use the flat file as a target, you can configure the Informatica server to create indicator file. For each target row, the indicator file contains a number to indicate whether the row was marked for insert, update, delete or reject.
  • Output file: If session writes to a target file, the Informatica server creates the target file based on file properties entered in the session property sheet.
  • Cache files: When the Informatica server creates memory cache it also creates cache files. For the following circumstances Informatica server creates index and data cache files.

 

Informatica ETL Interview Questions and Answers – Part 1

Informatica Interview Questions & Answers:

A. What are the differences between Connected and Unconnected
lookup?

  • Connected lookup participates in the mapping(dataflow), just like any other transformation. Unconnected  lookup is used when a lookup function is used instead in an expression transformation in the mapping in which case the lookup does not appear in the main flow (dataflow) of the mapping. Connected lookup can return more than one value(output port) whereas an Unconnected lookup gives only one output port. Unconnected lookups are reusable.
  • Connected transformation is connected to other transformations or directly to target table in the mapping. An unconnected transformation is not connected to other transformations in the mapping. It is called within another transformation, and returns a value to that transformation.

B. When do we use dynamic cache and static cache in connected and unconnected lookup transformations?

  • Dynamic cache used for updation of Master Table & SCD(Slowly Changing Dimensions) type 1.
  • Static used for Flatfile.

C. What is the tracing level?

  • Tracing level means that the amount of data storing in to the log files.
  • Tracing Levels are TERSE,VERBOSE,VERBOSE INITIALIZATION and NORMAL.
  • This is a kind of explanation when a session log is created….
    • Normal: It explains in a detailed manner
    • Verbose: It explains detailed explanation for each and every row

D. How many types of transformations supported by sorted input?

  • Aggregator Transformation,Joiner Transformation and Lookup Transformation support sorted input so that it will increase the session performance.

E. Name 4 output files that Informatica server creates during session running:

  • Session log
  • Workflow log
  • Errors log
  • Badfile

F. what is the difference between stop and abort?

  • Stop command immediately kills the reading process and doesn’t have any timeout period.
  • Abort command gives a time out period of 60 secs to the Informatica server to finish the dtm process else it kills the dtm process.

G. What is Update Override? What are the differences between SQL Override and Update Override?

  • Update Override – It is an option available in TARGET instance. By default, Target table is updated based on Primary key values. To update the Target table on non primary key values, we can generate the default Query and override the Query according to the requirement. Suppose for example, if we want to update the record in target table, when a column value=’AAA’ then, we can include this condition in Where clause of default Query.
  • Coming to SQL override – It is an option available in Source Qualifier and Lookup transformation where we can include joins, filters, Group by and Order by.

 

Next⇒ Informatica Interview Questions Continued …Page 2

ETL Concepts

ETL Concepts:

Extraction, transformation, and loading. ETL refers to the methods involved in accessing and manipulating source data and loading it into target database.

The first step in ETL process is mapping the data between source systems and target database(data warehouse or data mart). The second step is cleansing of source data in staging area. The third step is transforming cleansed source data and then loading into the target system.

Note that ETT (Extraction, Transformation, Transportation) and ETM (Extraction, Transformation, Move) are sometimes used instead of ETL.

Data comes in different forms – Structured format or UnStructured format – with huge volumes – in high velocity.

Data in Structured Format:

  • Example: .csv files, .xls files, .dat files, .xml files etc.
  • Structured data is stored for several years within the Organization’s high end servers.
  • The need for the Data Analysis (Slicing and Dicing of the Data) of these Structured Data Formats led to the Evolution of ETL.
In the earlier times, data from different source systems like .csv, .xls, .dat, .xml etc., were  loaded to the target database by traditional programming methods. But the same data can be loaded very quickly by using ETL tools when compared to the old methods by making use of different components present in the ETL tools.
From the ETL perspective, an ETL developer has to know how to identify/connect to the source systems and do the transformations and load the data into target database.
Extraction and Loading process is easy when compared to the Transformation process  since Transformation has to see the different anomalies present in the data entry.

Common Anomalies found in the Source Systems:

1. Name: Combination of upper character and lower characters. No predefined formats.
Example: kEvin CURtis, STEven CURtis
2. Indicator Columns: Yes, No, Y, N
3. Gender Columns: M, F, Male, Female
4. Format Columns: xxx-xx-xxxx, xx-xxx-xxxx
5. Currency Columns: $1234, $1234.00, $1234.50
6. Date Columns: MM/DD/YYYY, DD/MM/YYYY oformat
The above mentioned problems are resolved by the ETL developer using the ETL tools and the right data is loaded into target system.

Sample ETL Process Flow Diagram:

Sample ETL Process Flow Diagram

Glossary of ETL Terms (Reference:www.Oracle.com):

Source System:

A database, application, file, or other storage facility from which the data in a data warehouse is derived.

Mapping:

The definition of the relationship and data flow between source and target objects.

Metadata:

Data that describes data and other structures, such as objects, business rules, and processes. For example, the schema design of a data warehouse is typically stored in a repository as metadata, which is used to generate scripts used to build and populate the
data warehouse. A repository contains metadata.

Staging Area:

A place where data is processed before entering the warehouse.

Cleansing:

The process of resolving inconsistencies and fixing the anomalies in source data, typically as part of the ETL process.

Transformation:

The process of manipulating data. Any manipulation beyond  copying is a transformation. Examples include cleansing, aggregating, and integrating data from multiple sources.

Transportation:

The process of moving copied or transformed data from a source to a data warehouse.

Target System:

A database, application, file, or other storage facility to which the “transformed source data” is loaded in a data warehouse.

 

Reach US!!!

 

  • We provide online training in advanced OLTP Data Modeling and DIMENSIONAL Data Modeling.
  • We also teach the data structures with Data Analytics Software “R”.
  • We provide online Data Modeling Project Support when you get struck with projects that you are involved.
  • We can analyse your Business Requirements, understand and suggest solutions to create OLTP Data Models and Dimensional Data models.
  • We provide Data Modeling Interview Preparation Sessions with a lot of Data Modeling Interview Questions/Answers, which will help you to clear any interview.

 

If you are interested, please reach us at Training@LearnDataModeling.com or 91-9080157239