Wednesday 22 January 2014

REMOVE-DUPLICATES STAGE

Remove Duplicates Stage:
In this article I will explain how to use remove duplicate stage to select unique record from the incoming data based on keys.
Input :
CUSTOMER_IDCUSTOMER_NAMECUSTOMER_CITYDATE
1UMABUENA PARK20100101
1UMACYPRESS20110101
If you look at above input we have two records with same customer name.
Business Logic:Select a most recent customer record.
we need to remove duplicates from the incoming data based on customer_name and customer_id keys and select most recent record base on the date.
Expected Output:(most recent record based on Date)
CUSTOMER_IDCUSTOMER_NAMECUSTOMER_CITYDATE
1UMACYPRESS20110101
In order to implement the above business logic we need to use Remove Duplicates Stage in our datastage job design.
We are using sequential file stage to read the input data.
Step1) configure sequential file stage
We have defined data as datatype for DATE field and by default datastage uses %yyyy-%mm-%dd as date format string.We need to double click on row 4 in the above screen and configure date format as %yyyy%mm%dd.
Now add remove duplicate stage to the job canvas by drag and drop from palette.Double click on remove dup stage  and go to the properties tab on the stage page.
select keys the define the duplicates based on the given business logic.In our case CUSTOMER_ID and CUSTOMER_NAME are keys that defined duplicates in the data.
We need to sort and partition the data on the duplicate keys to make sure ros with same keys should go the same datastage partition node.
Go to the partition tab in the input page of the rem dup stage.
Partition Type:Hash
Now we need to sort the data on date column(No need to partition) in order to select single record with latest date.
Please make sure the order of the partition and sort key columns should be same with keys that define the duplicate rows.
Now go the properties tab in the stage page and select Duplicate To Retain = Last.we configured DATE sorting in ascending that is the reason i have selected Duplicate To Retain = Last as shown in below figure.
Now add sequential file stage as output stage to create a output file and that completes our job design as shown like below.
Compile and run the job:
Now look at the director log to see if there are any warnings.
I found following warning in the job log:
stg_out: When checking operator: A sequential operator cannot preserve the partitioning
 of the parallel data set on input port 0
To remove this warning from the job log go to the Advances tab in the stage page and clear preserver partition flag.
Now compile and run the job.
This job finished successfully with out any warnings.
Now i will change the partition and sort order of the columns in the partition tab of the stage page:
I ran the job with the above change and got the following warning the job log:
stg_remDup: When checking operator: User inserted sort “stg_remDup.Lnk_Input_remdup_Sort” does not fulfill the sort requirements of the downstream operator “stg_remDup”
So we need to make sure the order of the partition and sort key columns should be same with keys that define the duplicate rows to avoid above warning and incorrect results in the output.

No comments:

Post a Comment