Remove Duplicates Stage:
In this article I will explain how to use remove duplicate stage to select unique record from the incoming data based on keys.
Input :
CUSTOMER_ID | CUSTOMER_NAME | CUSTOMER_CITY | DATE |
1 | UMA | BUENA PARK | 20100101 |
1 | UMA | CYPRESS | 20110101 |
If you look at above input we have two records with same customer name.
Business Logic:Select a most recent customer record.
we need to remove duplicates from the incoming data based on customer_name and customer_id keys and select most recent record base on the date.
Expected Output:(most recent record based on Date)
CUSTOMER_ID | CUSTOMER_NAME | CUSTOMER_CITY | DATE |
1 | UMA | CYPRESS | 20110101 |
In order to implement the above business logic we need to use Remove Duplicates Stage in our datastage job design.
We are using sequential file stage to read the input data.
Step1) configure sequential file stage
We have defined data as datatype for DATE field and by default datastage uses %yyyy-%mm-%dd as date format string.We need to double click on row 4 in the above screen and configure date format as %yyyy%mm%dd.
Now add remove duplicate stage to the job canvas by drag and drop from palette.Double click on remove dup stage and go to the properties tab on the stage page.
select keys the define the duplicates based on the given business logic.In our case CUSTOMER_ID and CUSTOMER_NAME are keys that defined duplicates in the data.
Go to the partition tab in the input page of the rem dup stage.
Partition Type:Hash
Now go the properties tab in the stage page and select Duplicate To Retain = Last.we configured DATE sorting in ascending that is the reason i have selected Duplicate To Retain = Last as shown in below figure.
Compile and run the job:
stg_out: When checking operator: A sequential operator cannot preserve the partitioning
of the parallel data set on input port 0
of the parallel data set on input port 0
To remove this warning from the job log go to the Advances tab in the stage page and clear preserver partition flag.
Now i will change the partition and sort order of the columns in the partition tab of the stage page:
stg_remDup: When checking operator: User inserted sort “stg_remDup.Lnk_Input_remdup_Sort” does not fulfill the sort requirements of the downstream operator “stg_remDup”
So we need to make sure the order of the partition and sort key columns should be same with keys that define the duplicate rows to avoid above warning and incorrect results in the output.
No comments:
Post a Comment