Lookup Stage:
The Lookup stage is most appropriate when the reference data for all lookup stages in a job is small enough to fit into available physical memory. Each lookup reference requires a contiguous block of shared memory. If the Data Sets are larger than available memory resources, the JOIN or MERGE stage should be used.
Lookup stages do not require data on the input link or reference links to be sorted. Be aware, though, that large in-memory lookup tables will degrade performance because of their paging requirements. Each record of the output data set contains columns from a source record plus columns from all the corresponding lookup records where corresponding source and lookup records have the same value for the lookup key columns. The lookup key columns do not have to have the same names in the primary and the reference links.
The optional reject link carries source records that do not have a corresponding entry in the input lookup tables.
You can also perform a range lookup, which compares the value of a source column to a range of values between two lookup table columns. If the source column value falls within the required range, a row is passed to the output link. Alternatively, you can compare the value of a lookup column to a range of values between two source columns. Range lookups must be based on column values, not constant values. Multiple ranges are supported.
There are some special partitioning considerations for Lookup stages. You need to ensure that the data being looked up in the lookup table is in the same partition as the input data referencing it. One way of doing this is to partition the lookup tables using the Entire method.
Lookup stage Configuration:Equal lookup
You can specify what action need to perform if lookup fails.
Scenario1: Continue
Scenario2:Fail
stg_Lkp,0: Failed a key lookup for record 2 Key Values: CUSTOMER_ID: 3
Scenari03:Drop
If we select reject as lookup failure condition then we need to add reject link otherwise we get compilation error.
Business scenario:we have input data with customer id and customer name and transaction date.We have customer dimension table with customer address information.Customer can have multiple records with different start and active dates and we want to select the record where incoming transcation date falls between start and end date of the customer from dim table.
Ex Input Data:
CUSTOMER_ID | CUSTOMER_NAME | TRANSACTION_DT |
1 | UMA | 2011-03-01 |
1 | UMA | 2010-05-01 |
Ex Di Data:
CUSTOMER_ID | CITY | ZIP_CODE | START_DT | END_DT |
1 | BUENA PARK | 90620 | 2010-01-01 | 2010-12-31 |
1 | CYPRESS | 90630 | 2011-01-01 | 2011-04-30 |
Expected Output:
CUSTOMER_ID | CUSTOMER_NAME | TRANSACTION_DT | CITY | ZIP_CODE |
1 | UMA | 2011-03-01 | CYPRESS | 90630 |
1 | UMA | 2010-05-01 | BUENA PARK | 90620 |
Configure the lookup stage as shown below.Double click on Lnk_input.TRANSACTION_DATE column.(specifying condition on the input link)
You need to specify return multiple rows from the reference link otherwise you will get following warning in the job log.Even though we have two distinct rows base on customer_id,start_dt and end_dt columns but datastage is considering duplicate rows based on customer_id key only.
stg_Lkp,0: Ignoring duplicate entry; no further warnings will be issued for this table
Compile and Run the job:
No comments:
Post a Comment