DataStage Help Topics: JOIN STAGE

Join Stage:

The Join stage is a processing stage. It performs join operations on two or more data sets input to the
stage and then outputs the resulting data set. The Join stage is one of three stages that join tables based
on the values of key columns. The other two are:

1)Lookup Stage(http://datastage4u.wordpress.com/2011/04/30/lookup-stage/)
2)Merge Stage(http://datastage4u.wordpress.com/2011/04/30/merge-stage)

The three stages differ mainly in the memory they use, the treatment of rows with unmatched keys, and
their requirements for data being input.

If these take up a large amount of memory relative to the physical RAM memory size of the computer you are running on, then a lookup stage might thrash because the reference datasets might not fit in RAM along with everything else that has to be in RAM. This results in very slow performance since each lookup operation can, and typically
does, cause a page fault and an I/O operation.
So, if the reference datasets are big enough to cause trouble, use a join. A join does a high-speed sort on
the driving and reference datasets. This can involve I/O if the data is big enough, but the I/O is all
highly optimized and sequential. Once the sort is over the join processing is very fast and never involves
paging or other I/O.

In the below sections will show join stage configuration in datastage jobs:

Left Dataset:

CUSTOMER_ID	CUSTOMER_NAME
1	UMA
2	POOJI
3	KATRU

Right Dataset or Intermediate Dataset:

CUSTOMER_ID	CITY	ZIP_CODE
1	CYPRESS	90630
2	CYPRESS	90630

Scenario 1)

InnerJoin:Inner transfers records from input data sets whose key columns contain equal values to the output
data set. Records whose key columns do not contain equal values are dropped.

Step1)Configure sequential file stage to read both inputs.

LeftInput

Right_Intermediate_Input

Step2)Configure Join stage

In order to specify join key columns both column names should be same from both the inputs.In case of lookup stage this is not required which is also a difference between join and lookup stages.

Click on properties tab in the stage page and select key columns to perform the join.In our case we are using customer_id as key column .

stg_Join_properties

Join stage expects both inputs need to be partitioned and sorted based on join key columns.Go to the partitioning tab of input stage page and select hash partitioning as shown in below figure.

Go to the Advanced tab in stage page and clear the preserve partition flag to avoid warning while writing into sequential file stage.If you are using dataset then you can use propagate becuase dataset is parallel stage it will keep the exact partitions.

Go to the Mapping tab of the output page and select the required columns in the output

Go to the Link Ordering tab in the stage page and check the input link order.

Compile and run the job:

I am adding new column Customer Name in the right link to explain a scenario where non-key column name should be different from both inputs otherwise will get warning from the datastage job.

Compile and run the job and check the job log.We got following warning in the job.

stg_InnerJoin: When checking operator: Dropping component “CUSTOMER_NAME” because of a prior component with the same name.

Note:Non Key column names should be different from both the inputs to the join stage to avoid the above warning.

2)Left Outer Join

Compile and run the job and check results:

If you observe the output you can see all 3 left rows came out from the join stage.

Join stage doesnot have reject link to capture unmatched rows.Add a transformer stage after join stage and check for unmacted rows and write it to a another link.

when a match does not occur, the Join stage inserts values into the unmatched non-key column(s) using the following rules:

a) If the non-key column is defined as nullable (on the Join input links) then Enterprise Edition
will insert NULL values in the unmatched columns
b) If the non-key column is defined as not-nullable, then Enterprise Edition inserts “default”
values based on the data type. For example, the default value for an Integer is zero, the default
value for a Varchar is an empty string (“”), and the default value for a Char is a string of
padchar characters equal to the length of the Char column.
For this reason, care must be taken to change the column properties to allow NULL values
before the Join. This is most easily done by inserting a Copy stage and mapping a column from
NON-NULLABLE to NULLABLE.
A Transformer stage can be used to test for NULL values in unmatched columns.
In most cases, it is best to use a Column Generator to add an ‘indicator’ column, with a constant value,
to each of the inner links and test that column for the constant after you have performed the join. This
isolates your match/no-match logic from any changes in the metadata. This is also handy with Lookups
that have multiple reference links.

Add a column generator stage and define MATCH_IND column and specify value = ‘Y’.In the transformer stage check If MATCH_IND=’Y’ then matched else unmatched.

Compile and run the job:

DataStage Help Topics

Wednesday, 22 January 2014

JOIN STAGE

1 comment: