DataStage Help Topics: 01/22/14

Wednesday, 22 January 2014

Performance Points

Performance Points:

1) Avoid unnecessary type conversions: set the OSH_PRINT_SCHEMAS environment variable to verify that run time schemas match the job design column definitions. If you are using stage variables on a Transformer stage, ensure that their data types match the expected result types.

2) Use Transformer stages sparingly and wisely. Transformer stages can slow down your job. Do not have multiple stages where the functionality could be incorporated into a single stage, and use other stage types to perform simple transformation operations

3) Increase Sort performance where possible. Careful job design can improve the performance of sort operations, both in standalone Sort stages and in on-link sorts specified in the Inputs page Partitioning tab of otherstage types.

4) Remove Unneeded Columns. Remove unneeded columns as early as possible within the job flow. Every additional unused column requires additional buffer memory, which can impact performance and make each row transfer from one stage to the next more expensive. If possible, when reading from databases, use a select list to read just the columns required, rather than the entire table.

5) Avoid reading from sequential files using the Same partitioning method. Unless you have specified more than one source file, this will result in the entire file being read into a single partition, making the entiredownstream flow run sequentially unless you explicitly repartition

The environment variable APT_STRING_PADCHAR can be used to change the default pad character from an ASCII NULL (0×0) to another character; for example, an ASCII space (Ox20) .

6)It is important to note that the individual SQL statements required by a Sparse Lookup are an expensive operation from a performance perspective. In most cases, it is faster to use a DataStage Join stage between the input and DB2 reference data than it is to perform a Sparse Lookup.

7) For scenarios where the number of input rows is significantly smaller (1:100 or more) than the number of reference rows in a DB2 or Oracle table, a Sparse Lookup may be appropriate. CPU-intensive applications, which typically perform multiple CPUdemanding operations on each record, benefit from the greatest possible parallelism, up to the capacity supported by your system.

8)Parallel jobs with large memory requirements can benefit from parallelism if they act on data that has been partitioned and if the required memory is also divided among partitions.

9)Applications that are disk- or I/O-intensive, such as those that extract data from and load data into RDBMSs, benefit from configurations in which the number of logical nodes equals the number of disk spindles being accessed. For example, if a table is fragmented 16 ways inside a database or if a data set is spread across 16 disk drives, set up a node pool consisting of 16 processing nodes.

10)For some jobs, especially those that are disk-intensive, you must sometimes configure your system to prevent the RDBMS from having either to redistribute load data or to re-partition the data from an extract operation.

11) The speed of communication among stages should be optimized by your configuration. For example, jobs whose stages exchange large amounts of data should be assigned to nodes where stages communicate by either shared memory (in an SMP environment) or a high-speed link (in an MPP environment). The relative placement of jobs whose stages share small amounts of data is less important.

12)Turn off Runtime Column propagation wherever it’s not required.

13) If you are using a copy or a filter stage either immediately after or immediately before a transformer stage, you are reducing the efficiency by using more stages because a transformer does the job of both copy stage as well as a filter stage

14)If you have a single input and single output when you are using a copy stage then set the force option to tru so that DataStage should not try to optimize the job by removing the Copy operation.

Pivot Enterprise stage

Pivot Enterprise stage:

The Pivot Enterprise stage is a processing stage that pivots data horizontally and vertically.

Pivot Types:

1)Horizontal
2)Vertical

Horizontal pivoting maps a set of columns in an input row to a single column in multiple output rows.
The output data of the horizontal pivot action typically has fewer columns, but more rows than the input
data.

Input Data:

CUSTOMER_ID	CUSTOMER_NAME	JAN_EXPENSE	FEB_EXPENSE	MAR_EXPENSE
1	UMA	100	200	300
2	POOJITHA	200	300	400

Output Data:

CUSTOMER_ID	CUSTOMER_NAME	Q1EXPENSE
1	UMA	100
1	UMA	200
1	UMA	300
2	POOJITHA	200
2	POOJITHA	300
2	POOJITHA	400

Specify a horizontal operation in Pivot enterprise stage.

Create basic datastage job using sequential file as input stage to read input data,Pivot Enterprise stage and sequential file stage as output to create output data.

Basic Job Design:

Configure sequential file stage as shown below to read input data(comma delimited input file).

Step2)Configure pivot enterprise stage.Double click on pivot enterprise stage and configure shown like below.

*Go to the properties page and select Pivot Type = Horizontal

Click on Pivot Properties tab on stage properties page and follow following steps to configure for Horizontal pivot:

You can generate a pivot index that will assign an index number to each row within sets of horizontally
pivoted data:

Click on the above highlighted red circle.

Now Click on the Mapping tab of the output page:

Now configure Output sequential file stage to create a delimited output file as shown below.

Compile the job:

Run the job:

Output:

vertical pivot operation:

Vertical pivoting maps a set of rows in the input data to single or multiple output columns. The array
size determines the number of rows in the output data. The output data of the vertical pivot action
typically has more columns, but fewer rows than the input data.

Input Data:

CUSTOMER_ID	CUSTOMER_NAME	Q1EXPENSE
1	UMA	100
1	UMA	200
1	UMA	300
2	POOJITHA	200
2	POOJITHA	300
2	POOJITHA	400

Output Data:

CUSTOMER_ID	CUSTOMER_NAME	JAN_EXPENSE	FEB_EXPENSE	MAR_EXPENSE	Q1AVERAGE
1	UMA	100	200	300	200
2	POOJITHA	200	300	400	300

Now open Pivot enterprise stage and go the properties tab on stage page:

Go to pivotproperties tab on the stage page and click load button to load metadata:

Now specify array size as 3 because we are have 3 months expenses in the quarter expense of in the input data and we need to convert into 1 row.

Configure Mapping tab in the output page of the stage.

Compile and run the job:

REMOVE-DUPLICATES STAGE

Remove Duplicates Stage:

In this article I will explain how to use remove duplicate stage to select unique record from the incoming data based on keys.

Input :

CUSTOMER_ID	CUSTOMER_NAME	CUSTOMER_CITY	DATE
1	UMA	BUENA PARK	20100101
1	UMA	CYPRESS	20110101

If you look at above input we have two records with same customer name.

Business Logic:Select a most recent customer record.

we need to remove duplicates from the incoming data based on customer_name and customer_id keys and select most recent record base on the date.

Expected Output:(most recent record based on Date)

CUSTOMER_ID	CUSTOMER_NAME	CUSTOMER_CITY	DATE
1	UMA	CYPRESS	20110101

In order to implement the above business logic we need to use Remove Duplicates Stage in our datastage job design.

We are using sequential file stage to read the input data.

Step1) configure sequential file stage

We have defined data as datatype for DATE field and by default datastage uses %yyyy-%mm-%dd as date format string.We need to double click on row 4 in the above screen and configure date format as %yyyy%mm%dd.

Now add remove duplicate stage to the job canvas by drag and drop from palette.Double click on remove dup stage and go to the properties tab on the stage page.

select keys the define the duplicates based on the given business logic.In our case CUSTOMER_ID and CUSTOMER_NAME are keys that defined duplicates in the data.

We need to sort and partition the data on the duplicate keys to make sure ros with same keys should go the same datastage partition node.
Go to the partition tab in the input page of the rem dup stage.

Partition Type:Hash

Now we need to sort the data on date column(No need to partition) in order to select single record with latest date.

Please make sure the order of the partition and sort key columns should be same with keys that define the duplicate rows.

Now go the properties tab in the stage page and select Duplicate To Retain = Last.we configured DATE sorting in ascending that is the reason i have selected Duplicate To Retain = Last as shown in below figure.

Now add sequential file stage as output stage to create a output file and that completes our job design as shown like below.

Compile and run the job:

Now look at the director log to see if there are any warnings.

I found following warning in the job log:

stg_out: When checking operator: A sequential operator cannot preserve the partitioning
of the parallel data set on input port 0

To remove this warning from the job log go to the Advances tab in the stage page and clear preserver partition flag.

Now compile and run the job.

This job finished successfully with out any warnings.

Now i will change the partition and sort order of the columns in the partition tab of the stage page:

I ran the job with the above change and got the following warning the job log:

stg_remDup: When checking operator: User inserted sort “stg_remDup.Lnk_Input_remdup_Sort” does not fulfill the sort requirements of the downstream operator “stg_remDup”

So we need to make sure the order of the partition and sort key columns should be same with keys that define the duplicate rows to avoid above warning and incorrect results in the output.

JOIN STAGE

Join Stage:

The Join stage is a processing stage. It performs join operations on two or more data sets input to the
stage and then outputs the resulting data set. The Join stage is one of three stages that join tables based
on the values of key columns. The other two are:

1)Lookup Stage(http://datastage4u.wordpress.com/2011/04/30/lookup-stage/)
2)Merge Stage(http://datastage4u.wordpress.com/2011/04/30/merge-stage)

The three stages differ mainly in the memory they use, the treatment of rows with unmatched keys, and
their requirements for data being input.

If these take up a large amount of memory relative to the physical RAM memory size of the computer you are running on, then a lookup stage might thrash because the reference datasets might not fit in RAM along with everything else that has to be in RAM. This results in very slow performance since each lookup operation can, and typically
does, cause a page fault and an I/O operation.
So, if the reference datasets are big enough to cause trouble, use a join. A join does a high-speed sort on
the driving and reference datasets. This can involve I/O if the data is big enough, but the I/O is all
highly optimized and sequential. Once the sort is over the join processing is very fast and never involves
paging or other I/O.

In the below sections will show join stage configuration in datastage jobs:

Left Dataset:

CUSTOMER_ID	CUSTOMER_NAME
1	UMA
2	POOJI
3	KATRU

Right Dataset or Intermediate Dataset:

CUSTOMER_ID	CITY	ZIP_CODE
1	CYPRESS	90630
2	CYPRESS	90630

Scenario 1)

InnerJoin:Inner transfers records from input data sets whose key columns contain equal values to the output
data set. Records whose key columns do not contain equal values are dropped.

Step1)Configure sequential file stage to read both inputs.

LeftInput

Right_Intermediate_Input

Step2)Configure Join stage

In order to specify join key columns both column names should be same from both the inputs.In case of lookup stage this is not required which is also a difference between join and lookup stages.

Click on properties tab in the stage page and select key columns to perform the join.In our case we are using customer_id as key column .

stg_Join_properties

Join stage expects both inputs need to be partitioned and sorted based on join key columns.Go to the partitioning tab of input stage page and select hash partitioning as shown in below figure.

Go to the Advanced tab in stage page and clear the preserve partition flag to avoid warning while writing into sequential file stage.If you are using dataset then you can use propagate becuase dataset is parallel stage it will keep the exact partitions.

Go to the Mapping tab of the output page and select the required columns in the output

Go to the Link Ordering tab in the stage page and check the input link order.

Compile and run the job:

I am adding new column Customer Name in the right link to explain a scenario where non-key column name should be different from both inputs otherwise will get warning from the datastage job.

Compile and run the job and check the job log.We got following warning in the job.

stg_InnerJoin: When checking operator: Dropping component “CUSTOMER_NAME” because of a prior component with the same name.

Note:Non Key column names should be different from both the inputs to the join stage to avoid the above warning.

2)Left Outer Join

Compile and run the job and check results:

If you observe the output you can see all 3 left rows came out from the join stage.

Join stage doesnot have reject link to capture unmatched rows.Add a transformer stage after join stage and check for unmacted rows and write it to a another link.

when a match does not occur, the Join stage inserts values into the unmatched non-key column(s) using the following rules:

a) If the non-key column is defined as nullable (on the Join input links) then Enterprise Edition
will insert NULL values in the unmatched columns
b) If the non-key column is defined as not-nullable, then Enterprise Edition inserts “default”
values based on the data type. For example, the default value for an Integer is zero, the default
value for a Varchar is an empty string (“”), and the default value for a Char is a string of
padchar characters equal to the length of the Char column.
For this reason, care must be taken to change the column properties to allow NULL values
before the Join. This is most easily done by inserting a Copy stage and mapping a column from
NON-NULLABLE to NULLABLE.
A Transformer stage can be used to test for NULL values in unmatched columns.
In most cases, it is best to use a Column Generator to add an ‘indicator’ column, with a constant value,
to each of the inner links and test that column for the constant after you have performed the join. This
isolates your match/no-match logic from any changes in the metadata. This is also handy with Lookups
that have multiple reference links.

Add a column generator stage and define MATCH_IND column and specify value = ‘Y’.In the transformer stage check If MATCH_IND=’Y’ then matched else unmatched.

Compile and run the job: