Wednesday, 22 January 2014

Performance Points

Performance Points:
1) Avoid unnecessary type conversions: set the OSH_PRINT_SCHEMAS environment variable to verify that run time schemas match the job design column definitions. If you are using stage variables on a Transformer stage, ensure that their data types match the expected result types.
2) Use Transformer stages sparingly and wisely. Transformer stages can slow down your job. Do not have multiple stages where the functionality could be incorporated into a single stage, and use other stage types to perform simple transformation operations
3) Increase Sort performance where possible. Careful job design can improve the performance of sort operations, both in standalone Sort stages and in on-link sorts specified in the Inputs page Partitioning tab of otherstage types.
4) Remove Unneeded Columns. Remove unneeded columns as early as possible within the job flow. Every additional unused column requires additional buffer memory, which can impact performance and make each row transfer from one stage to the next more expensive. If possible, when reading from databases, use a select list to read just the columns required, rather than the entire table.
5) Avoid reading from sequential files using the Same partitioning method. Unless you have specified more than one source file, this will result in the entire file being read into a single partition, making the entiredownstream flow run sequentially unless you explicitly repartition
The environment variable APT_STRING_PADCHAR can be used to change the default pad character from an ASCII NULL (0×0) to another character; for example, an ASCII space (Ox20) .
6)It is important to note that the individual SQL statements required by a Sparse Lookup are an expensive operation from a performance perspective. In most cases, it is faster to use a DataStage Join stage between the input and DB2 reference data than it is to perform a Sparse Lookup.
7) For scenarios where the number of input rows is significantly smaller (1:100 or more) than the number of reference rows in a DB2 or Oracle table, a Sparse Lookup may be appropriate. CPU-intensive applications, which typically perform multiple CPUdemanding operations on each record, benefit from the greatest possible parallelism, up to the capacity supported by your system.
8)Parallel jobs with large memory requirements can benefit from parallelism if they act on data that has been partitioned and if the required memory is also divided among partitions.
9)Applications that are disk- or I/O-intensive, such as those that extract data from and load data into RDBMSs, benefit from configurations in which the number of logical nodes equals the number of disk spindles being accessed. For example, if a table is fragmented 16 ways inside a database or if a data set is spread across 16 disk drives, set up a node pool consisting of 16 processing nodes.
10)For some jobs, especially those that are disk-intensive, you must sometimes configure your system to prevent the RDBMS from having either to redistribute load data or to re-partition the data from an extract operation.
11) The speed of communication among stages should be optimized by your configuration. For example, jobs whose stages exchange large amounts of data should be assigned to nodes where stages communicate by either shared memory (in an SMP environment) or a high-speed link (in an MPP environment). The relative placement of jobs whose stages share small amounts of data is less important.
12)Turn off Runtime Column propagation wherever it’s not required.
13) If you are using a copy or a filter stage either immediately after or immediately before a transformer stage, you are reducing the efficiency by using more stages because a transformer does the job of both copy stage as well as a filter stage
14)If you have a single input and single output when you are using a copy stage then set the force option to tru so that DataStage should not try to optimize the job by removing the Copy operation.

Pivot Enterprise stage

Pivot Enterprise stage:
The Pivot Enterprise stage is a processing stage that pivots data horizontally and vertically.
Pivot Types:
1)Horizontal
2)Vertical
Horizontal pivoting maps a set of columns in an input row to a single column in multiple output rows.
The output data of the horizontal pivot action typically has fewer columns, but more rows than the input
data.
Input Data:
CUSTOMER_IDCUSTOMER_NAMEJAN_EXPENSEFEB_EXPENSEMAR_EXPENSE
1UMA100200300
2POOJITHA200300400
Output Data:
CUSTOMER_IDCUSTOMER_NAMEQ1EXPENSE
1UMA100
1UMA200
1UMA300
2POOJITHA200
2POOJITHA300
2POOJITHA400
Specify a horizontal operation in Pivot enterprise stage.
Create basic datastage job using sequential file as input stage to read input data,Pivot Enterprise stage and sequential file stage as output to create output data.
Basic Job Design:
Configure sequential file stage as shown below to read input data(comma delimited input file).
Step2)Configure pivot enterprise stage.Double click on pivot enterprise stage and configure shown like below.
*Go to the properties page and select Pivot Type = Horizontal
Click on Pivot Properties tab on stage properties page and follow following steps to configure for Horizontal pivot:
You can generate a pivot index that will assign an index number to each row within sets of horizontally
pivoted data:
Click on the above highlighted red circle.
Now Click on the Mapping tab of the output page:
Now configure Output sequential file stage to create a delimited output file as shown below.
Compile the job:
Run the job:
Output:
vertical pivot operation:
Vertical pivoting maps a set of rows in the input data to single or multiple output columns. The array
size determines the number of rows in the output data. The output data of the vertical pivot action
typically has more columns, but fewer rows than the input data.
Input Data:
CUSTOMER_IDCUSTOMER_NAMEQ1EXPENSE
1UMA100
1UMA200
1UMA300
2POOJITHA200
2POOJITHA300
2POOJITHA400
Output Data:
CUSTOMER_IDCUSTOMER_NAMEJAN_EXPENSEFEB_EXPENSEMAR_EXPENSEQ1AVERAGE
1UMA100200300200
2POOJITHA200300400300
Now open Pivot enterprise stage and go the properties tab on stage page:
Go to pivotproperties tab on the stage page and click load button to load metadata:
Now specify array size as 3 because we are have 3 months expenses in the quarter expense of in the input data and we need to convert into 1 row.
Configure Mapping tab in the output page of the stage.
Compile and run the job:

REMOVE-DUPLICATES STAGE

Remove Duplicates Stage:
In this article I will explain how to use remove duplicate stage to select unique record from the incoming data based on keys.
Input :
CUSTOMER_IDCUSTOMER_NAMECUSTOMER_CITYDATE
1UMABUENA PARK20100101
1UMACYPRESS20110101
If you look at above input we have two records with same customer name.
Business Logic:Select a most recent customer record.
we need to remove duplicates from the incoming data based on customer_name and customer_id keys and select most recent record base on the date.
Expected Output:(most recent record based on Date)
CUSTOMER_IDCUSTOMER_NAMECUSTOMER_CITYDATE
1UMACYPRESS20110101
In order to implement the above business logic we need to use Remove Duplicates Stage in our datastage job design.
We are using sequential file stage to read the input data.
Step1) configure sequential file stage
We have defined data as datatype for DATE field and by default datastage uses %yyyy-%mm-%dd as date format string.We need to double click on row 4 in the above screen and configure date format as %yyyy%mm%dd.
Now add remove duplicate stage to the job canvas by drag and drop from palette.Double click on remove dup stage  and go to the properties tab on the stage page.
select keys the define the duplicates based on the given business logic.In our case CUSTOMER_ID and CUSTOMER_NAME are keys that defined duplicates in the data.
We need to sort and partition the data on the duplicate keys to make sure ros with same keys should go the same datastage partition node.
Go to the partition tab in the input page of the rem dup stage.
Partition Type:Hash
Now we need to sort the data on date column(No need to partition) in order to select single record with latest date.
Please make sure the order of the partition and sort key columns should be same with keys that define the duplicate rows.
Now go the properties tab in the stage page and select Duplicate To Retain = Last.we configured DATE sorting in ascending that is the reason i have selected Duplicate To Retain = Last as shown in below figure.
Now add sequential file stage as output stage to create a output file and that completes our job design as shown like below.
Compile and run the job:
Now look at the director log to see if there are any warnings.
I found following warning in the job log:
stg_out: When checking operator: A sequential operator cannot preserve the partitioning
 of the parallel data set on input port 0
To remove this warning from the job log go to the Advances tab in the stage page and clear preserver partition flag.
Now compile and run the job.
This job finished successfully with out any warnings.
Now i will change the partition and sort order of the columns in the partition tab of the stage page:
I ran the job with the above change and got the following warning the job log:
stg_remDup: When checking operator: User inserted sort “stg_remDup.Lnk_Input_remdup_Sort” does not fulfill the sort requirements of the downstream operator “stg_remDup”
So we need to make sure the order of the partition and sort key columns should be same with keys that define the duplicate rows to avoid above warning and incorrect results in the output.

JOIN STAGE

Join Stage:
The Join stage is a processing stage. It performs join operations on two or more data sets input to the
stage and then outputs the resulting data set. The Join stage is one of three stages that join tables based
on the values of key columns. The other two are:
The three stages differ mainly in the memory they use, the treatment of rows with unmatched keys, and
their requirements for data being input.
If these take up a large amount of memory relative to the physical RAM memory size of the computer you are running on, then a lookup stage might thrash because the reference datasets might not fit in RAM along with everything else that has to be in RAM. This results in very slow performance since each lookup operation can, and typically
does, cause a page fault and an I/O operation.
So, if the reference datasets are big enough to cause trouble, use a join. A join does a high-speed sort on
the driving and reference datasets. This can involve I/O if the data is big enough, but the I/O is all
highly optimized and sequential. Once the sort is over the join processing is very fast and never involves
paging or other I/O.
In the below sections will show join stage configuration in datastage jobs:
Left Dataset:
CUSTOMER_IDCUSTOMER_NAME
1UMA
2POOJI
3KATRU
Right Dataset or Intermediate Dataset:
CUSTOMER_IDCITYZIP_CODE
1CYPRESS90630
2CYPRESS90630
Scenario 1)
InnerJoin:Inner transfers records from input data sets whose key columns contain equal values to the output
data set. Records whose key columns do not contain equal values are dropped.
Step1)Configure sequential file stage to read both inputs.
LeftInput
Right_Intermediate_Input
Step2)Configure Join stage
In order to specify join key columns both column names should be same from both the inputs.In case of lookup stage this is not required which is also a difference between join and lookup stages.
Click on properties tab in the stage page and select key columns to perform the join.In our case we are using customer_id as key column .
stg_Join_properties
Join stage expects both inputs need to be partitioned and sorted based on join key columns.Go to the partitioning tab of input stage page and select hash partitioning as shown in below figure.
Go to the Advanced tab in stage page and clear the preserve partition flag to avoid warning while writing into sequential file stage.If you are using dataset then you can use propagate becuase dataset is parallel stage it will keep the exact partitions.
Go to the Mapping tab of the output page and select the required columns in the output
Go to the Link Ordering tab in the stage page and check the input link order.
Compile and run the job:
I am adding new column Customer Name in the right link to explain a scenario where non-key column name should be different from both inputs otherwise will get warning from the datastage job.

Compile and run the job and check the job log.We got following warning in the job.
stg_InnerJoin: When checking operator: Dropping component “CUSTOMER_NAME” because of a prior component with the same name.
Note:Non Key column names should be different from both the inputs to the join stage to avoid the above warning.
2)Left Outer Join
Compile and run the job and check results:
If you observe the output you can see all 3 left rows came out from the join stage.
Join stage doesnot have reject link to capture unmatched rows.Add a transformer stage after join stage and check for unmacted rows and write it to a another link.
when a match does not occur, the Join stage inserts values into the unmatched non-key column(s) using the following rules:
a) If the non-key column is defined as nullable (on the Join input links) then Enterprise Edition
will insert NULL values in the unmatched columns
b) If the non-key column is defined as not-nullable, then Enterprise Edition inserts “default”
values based on the data type. For example, the default value for an Integer is zero, the default
value for a Varchar is an empty string (“”), and the default value for a Char is a string of
padchar characters equal to the length of the Char column.
For this reason, care must be taken to change the column properties to allow NULL values
before the Join. This is most easily done by inserting a Copy stage and mapping a column from
NON-NULLABLE to NULLABLE.
A Transformer stage can be used to test for NULL values in unmatched columns.
In most cases, it is best to use a Column Generator to add an ‘indicator’ column, with a constant value,
to each of the inner links and test that column for the constant after you have performed the join. This
isolates your match/no-match logic from any changes in the metadata. This is also handy with Lookups
that have multiple reference links.
Add a column generator stage and define MATCH_IND column and specify value = ‘Y’.In the transformer stage check If MATCH_IND=’Y’ then matched else unmatched.
Compile and run the job: