DataStage Help Topics: 03/08/14

Saturday, 8 March 2014

New Features in DataStage 8.5

Ten Reasons Why You Need DataStage 8.5

I have taken a look through the new functions and capabilities of DataStage 8.5 and come up with a top ten list of why you should upgrade to it.

Information Server 8.5 came out a couple weeks ago and is currently available onIBM Passport Advantage for existing customers and from IBM PartnerWorld for IM partners. The XML pack described below is available as a separate download from the IBM Fix Central website.

This is a list of the ten best things in DataStage 8.5. Most of these are improvements in DataStage Parallel Jobs only while a couple of them will help Server Job customers as well.

1. It’s Faster

Faster, faster, faster. A lot of tasks in DataStage 8.5 are at least 40% faster than 8.1 such as starting DataStage, opening a job, running a Parallel job and runtime performance have all improved.

2. It' is now an XML ETL Tool

Previous versions of DataStage were mediocre at processing XML. DataStage 8.5 is a great XML processing tool. It can open, understand and store XML schema files. I did a longer post about just this pack in New Hierarchical Transformer makesDataStage great a XML Tool and if you have XML files without schemas you can follow a tip at the DataStage Real Time blog: The new XMLPack in 8.5….generating xsd’s….

The new XML read and transform stages are much better at reading large and complex XML files and processing them in parallel:

3. Transformer Looping

The best Transformer yet. The DataStage 8.5 parallel transformer is the best version yet thanks to new functions for looping inside a transformer and performing transformations across a grouping of records.

With looping inside a Transformer you can output multiple rows for each input row.

Transformer Remembering

DataStage 8.5 Transformer has Remembering and key change detection which is something that ETL experts have been manually coding into DataStage for years using some well known workarounds. A key change in a DataStage job involves a group of records with a shared key where you want to process that group as a type of array inside the overall recordset.

I am going to make a longer post about that later but there are two new cache objects inside a Transformer – SaveInputRecord() and GetSavedInputRecord(0 where you can save a record and retrieve it later on to compare two or more records inside a Transformer.

There are new system variables for looping and key change detection - @ITERATION, LastRow() indicates the last row in a job, LastTwoInGroup(InputColumn) indicates a particular column value will change in the next record.

Here is an aggregation example where rows are looped.

Click here to Know Pivoting through Transformer

4. Easy to Install

Easier to install and more robust. DataStage 8.5 has the best installer of any version of DataStage ever. Mind you – I jumped aboard the DataStage train in version 3.6 so I cannot vouch for earlier installers but 8.5 has the best wizard, the best pre-requisite checking and the best recovery. It also has the IBM Support Assistant packs for Information Server that make debugging and reporting of PMRs to IBM much easier. There is also a Guide to Migrating to InfoSphere Information Serve 8.5 that explains how to migrate from most earlier versions.

See my earlier blog post Why Information Server 8.5 is Easier to Install than Information Server 8.1.

Patch Merge – that’s right, patch merge. The new installer has the ability to merge patches and fixes into the install for easier management of patches and fixes.

5. Check In and Check Out Jobs

Check in and Check out version control. DataStage 8.5 Manager comes with direct access to the source control functions of CVS and Rational ClearCase in an Eclipse workspace. You can send artefacts to the source control system and replace a DataStage component from out of the source control system.

DataStage 8.5 comes with out of the box menu integration with CVS and Rational ClearCase but for other source control systems you need to use the Eclipse source control plugins.

6. High Availability Easier than ever

High Availability – the version 8.5 installation guide has over thirty pages on Information Server topologies including a bunch of high availability scenarios across all tiers of the product. On top of that there are new chapters for the high availability of the metadata repository, the services layer and the DataStage engine.

Horizontal and vertical scaling and load balancing.
Cluster support for WebSphere Application Server.

Cluster support for XMETA repository: DB2 HADR/Cluster or Oracle RAC.
Improved failover support on the engine.

7. New Information Architecture Diagramming Tool

InfoSphere Blueprint Direct – DataStage 8.5 comes with a free new product for creating diagrams of an information architecture and linking elements in the diagram directly into DataStage jobs and Metadata Workbench metadata. Solution Architects can draw a diagram of a data integration solution including sources, Warehouses and repositories.

8. Vertical Pivot

It is now available and it can pivot multiple input rows with a common key into output rows with multiple columns. Key based groups, columnar pivot and aggregate functions.

You can also do this type of vertical pivoting in the new Transformer using the column change detection and row cache – but the Vertical pivot stage makes it easier as a specialised stage.

9. Z/OS File Stage

Makes it easier to process complex flat files by providing native support for mainframe files. Use it for VSAM files – KSDS, ESDS, RRDS. Sequential QSAM, BDAM, BSAM. Fixed and variable length records. Single or multiple record type files.

10. Balanced Optimizer Comes Home

In DataStage 8.5 the Balanced Optimizer has been merged into the Designer and it has a number of usability improvements that turns DataStage into a better ETLT or ELT option. Balanced Optimizer looks at a normal DataStage job and comes up with a version that pushes some of the steps down onto a source or target database engine. IE it balances the load across the ETL engine and the database engines.

Version 8.5 has improved logging, improved impact analysis support and easier management of optimised versions of jobs in terms of creating, deleting, renaming, moving, compiling and deploying them.

IBM DataStage 8.5 New Added Features :

DataStage 8.5 is out and IBM has made some significant improvements this time around. Let’s see some of the important enhancements in the new DataStage 8.5 version

XML data

DataStage has historically been inefficient at handling XML files, but in 8.5 IBM has given us a great XML processing package. DataStage 8.5 can now process large XML files (over 30 GB) with ease. Also, we can now process XML data in parallel.

The new XML transform stage can data from multiple sources into a single XML output stream. If you think that is cool, it can also do it the other way around i.e., multiple XML input to a single output stream.

It can also convert data from one XML format to another.

Transformer Stage

It is one of the most used and the most important stages on DataStage and it just got better in 8.5

a. Transformer Looping:

Over the years DataStage programmers have been using workarounds to implement this concept. Now IBM has included it directly in the transformer stage.

There are two types of looping’s available

Output looping: Where we can output multiple output links for a single input link

Ex:

Input Record:

Salesman_name	City_1	City_2	City_3
Jason Bourne	New York	Madrid	New Delhi

Output Record:

Salesman_name	City
Jason Bourne	New York
Jason Bourne	Madrid
Jason Bourne	New Delhi

This is achieved using a new system variable @ITERATION

Input looping: We can now aggregate input records within the transformer and assign the aggregated data to the original input link while sending it to the output.

b. Transformer change detection:

SaveInputRecord() – Save a record to be used for later transformations within the job

GetInputRecord() – Retrieve the saved record as when it is required for comparisons

c. System Variables:

i. @ITERATION: Used in the looping mechanism

ii. LastRow(): Indicates the last row in the job

iii. LastRowInGroup(): Will return the last row in the group based on the key column

d. New NULL Handling features:

In DataStage 8.5 we need not explicitly handle NULL values. Record dropping is arrested if the target column is nullable. We need not handle NULL values explicitly when using functions over columns that have NULL values. And also stage variables are now nullable by default.

APT_TRANSFORM_COMPILE_OLD_NULL_HANDLING has been prepared to support backward compatibility

e. New Data functions:

There are a host of new date functions incorporated into DataStage 8.5. I personally found the below function most useful

DataFromComponents(years, months, daysofmonth)

Ex: DataFromComponenets(2012,07,20) will output 2012-07-20

DataOffsetByComponents(basedate, years offset, month offset, daysofmonth offset)

Ex: DataOffsetByComponents(2012-07-20, 2,1,1) will output 2014-08-21

DataOffsetByComponents(2012-07-20, -4,0,0) will output 2008-07-20

I will write another detailed blog on the new data functions shortly

Parallel Debugger:

DataStage 8.5 now has a built in debugger functionality. We can now set breakpoints on the links in our jobs.

When the job is run in debug mode, it will stop when it encounters a breakpoint. From here we can step to the next action on that link or skip to the next row of data.

Refer Below Link :

http://datastageinfoguide.blogspot.in/2013/01/new-debug-feature-in-datastage-85.html

Functionality Enhancements:

- Mask encryption for before and after job subroutines

- Ability to copy permissions from one project to a new project

- Improvements in the multi-client manager

- New audit tracing and enhanced exception dialog

- Enhanced project creation failure details

Vertical Pivoting:

At long last vertical pivoting has been added

Integration with CVS

Now in DataStage 8.5 we have the feature that integrates directly with version control systems like CVS. We can now Check-in and Check-out directly from DataStage

Information Architecture Diagramming Tool:

Now solution architects can draw detailed integration solution plans for data warehouses from within DataStage

Balanced Optimizer:

As you all know DataStage is an ETL tool. But now with Balanced Optimizer directly being integrated we have the ELT (Extract Load and Transform) feature.

With this we can extract the data, load it and perform the transformations inside the database engine.

Its Fast!

DataStage 8.5 is considerably faster than its previous version (8.1). Tasks like saving, renaming, compiling are faster by nearly 40%. The run time performance of jobs has also improved.

The parallel engine

on DataStage has been tuned to improve performance and resource usage has reduced by 5% when compared to DataStage 8.1

Compile DataStage Job From Client Command Line

Compiling the DataStage Jobs from Windows Client Machine/ Windows Command Prompt ..!!DK®

You can find the compilation command in C:\IBM\InformationServer\Clients\Classic folder.DK®

dscc.exe - Compiling the Jobs.

cd C:\IBM\InformationServer\Clients\Classic
./dscc.exe /H ABC123 /U dsadm /P dsadm TestProject /J TestDSJob
./dscc.exe /H hostname /U username /P password project_name /J jobname

Note: The job name can be * to compile all the jobs in the Project.you cannot use this command to compile a single job.DK®

The dscc command takes the following arguments: DK®

Find the details below

/H hostname specify the InfoSphere DataStage server where the job or jobs reside.
/U username the username to use when attaching to the project.
/P password the password to use when attaching to the project.
project_name the project which the job or jobs belong to.
/J jobname |* | category_name\* specifies the jobs to be compiled. Use jobname to specify a single job, * to compile all jobs in the project and category_name\* to compile all jobs in that category (this will not include categories within that category). You can specify job sequences as well as parallel or server jobs.
/R routinename | * | category_name\* specifies routines to be compiled. Use routinename to specify a single routine, * to compile all routines in the project andcategory_name\* to compile all routines in that category (this will not include categories within that category).
/F force compile (for parallel jobs).
/OUC only compile uncompiled jobs.
/RD reportname specify a name and destination for a compilation report. Specify DESKTOP\filename to write it to your desktop or .\filename to write it to the current working directory.

The options are not case sensitive....!!DK®

Create/Delete/View DS Projects Using Command

DSADMIN Command to Create/Delete/View DataStage Projects from Command line Unix/Linux

This post will help you to understand Create/Delete and View of DataStage Projects from command line[Unix].

Note : Always use DSADM user to create or delete projects.

Check List Of Datastage Projects from Command Line:
--------------------------------------------------------------------

DSHOME=`cat /.dshome`;
echo $DSHOME

DSHOME=`cat /.dshome`
cd $DSHOME
. ./dsenv
cd ./bin
./dsjob -lprojects =>{ For Listing the Datastage Projects}
./dsjob -ljobs project =>{ For Listing the Datastage Jobs in given Project}

Create Projects From Command Line
-----------------------------------------------

DSHOME=`cat /.dshome`
cd $DSHOME
. ./dsenv
cd bin
./dsadmin -domain XYZ1234 -user dsadm -password dsadm -server XYZ1234 -createproject Test1 -location /tmp/Test1

Syntax : ./dsadmin -domain [domain] -user [user] -password [password] -server [server] -createproject [project name] -location [prjoect path/NewProjectName];

Delete Projects From Command Line
-----------------------------------------------

DSHOME=`cat /.dshome`
cd $DSHOME
. ./dsenv
cd bin
./dsadmin -domain XYZ1234 -user dsadm -password dsadm -server XYZ1234 -deleteproject Test1
Syntax : ./dsadmin -domain [domain] -user [user] -password [password] -server [server] -
deleteproject [project name];

UNIX:Chmod Calculator

Unix File Permission in Details - chmod Calculator:

This post will give you complete details about Unix file permission. chmod calculator help you to find out the numbers to execute chmod commands.

CHMOD:

Chmod stands for Change Mode and is a command often needed for installing scripts (CGI, PHP etc.) on a UNIX server, after uploading the file (with FTP) you may need to change the permissions. Basically it tells the server who can make what changes to the file or folder, i.e. can the script only read the info, or can it write information as into it as well.

CHMOD CALCULATOR:

Some FTP programs give you something that looks like this to set chmod permissions.
This may help you try it out first and find out if you really are setting the right permissions.

Symbolic Notation	Octal Notation	Description
----------	000	no permissions
---x--x--x	111	execute
--w--w--w-	222	write
--wx-wx-wx	333	write & execute
-r--r--r--	444	read
-r-xr-xr-x	555	read & execute
-rw-rw-rw-	666	read & write
-rwxrwxrwx	777	read, write, & execute

Legend:

CODE	DESCRIPTION
1	execute
2	write
3	write & execute
4	Read
5	Read & execute
6	Read and Write, no execute
7	Full Access

First one is Owner, Second is Group and third is others.

Common ones:

PERMISSION	COMMAND
rwxrwxrwx	chmod 777 filename
rwxrwxr-x	chmod 775 filename
rwxr-xr-x	chmod 755 filename
rw-rw-rw-	chmod 666 filename
rw-rw-r--	chmod 664 filename
rw-r--r--	chmod 644 filename

WHAT DO THE CHMOD SETTINGS MEAN?

Chmod tells the server the access privileges for a file. For example, common file settings are:
777: all can read(4) / write(2) / execute(1) the file.
755: owner can do all, group / others can read / execute.
644: owner can read / write, group / others can read only.

COMMON CHMOD SETTINGS

•user-group-other read/write permissions

•-rw-r--r-- 1 DYADAV dstage 164870 27 Feb 17:58 a.txt

Datastage Job Run Time Architecture-OSH

This gives you the generation of the OSH (Orchestrate® Shell Script) script, and

The execution flow of IBM InfoSphere DataStage using the Information Server engine.

and Information about Conductor Node, Section Leader and Player in datastage job.

OSH script :

The IBM InfoSphere DataStage and QualityStage Designer client creates IBM InfoSphere DataStage jobs that are compiled into parallel job flows, and reusable components that execute on the parallel Information Server engine. It allows you to use familiar graphical point-and-click techniques to develop job flows for extracting, cleansing, transforming, integrating, and loading data into target files,

target systems, or packaged applications.

The Designer generates all the code. It generates the OSH (Orchestrate Shell Script) and C++ code for any Transformer stages used.

Briefly, the Designer performs the following tasks:

1.Validates link requirements, mandatory stage options, transformer logic, etc.

2.Generates OSH representation of data flows and stages (representations of framework “operators”).

3.Generates transform code for each Transformer stage which is then compiled into C++ and then to corresponding native operators.

4.Reusable BuildOp stages can be compiled using the Designer GUI or from the command line.

Here is a brief primer on the OSH:

1. Comment blocks introduce each operator, the order of which is determined by the order stages were added to the canvas.

2.OSH uses the familiar syntax of the UNIX shell. such as Operator name, schema, operator options (“-name value” format), input (indicated by n< where n is the input#), and output (indicated by the n> where n is the output #).

3. For every operator, input and/or output data sets are numbered sequentially starting from zero.

4. Virtual data sets (in memory native representation of data links) are generated to connect operators.

Note: The actual execution order of operators is dictated by input/output designators, and not by their placement on the diagram. The data sets connect the OSH operators. These are “virtual data sets”, that is, in memory data flows. Link names are used in data set names — it is therefore good practice to give the links meaningful names.

Framework (Information Server Engine) terms and DataStage terms have equivalency. The GUI frequently uses terms from both paradigms. Runtime messages use framework terminology because the framework engine is where execution occurs. The following list shows the equivalency between framework and DataStage terms:

1. Schema corresponds to table definition

2. Property corresponds to format

3. Type corresponds to SQL type and length

4. Virtual data set corresponds to link

5. Record/field corresponds to row/column

6. Operator corresponds to stage

7. Step, flow, OSH command correspond to a job

8. Framework corresponds to Information Server Engine

Execution flow

When you execute a job, the generated OSH and contents of the configuration file ($APT_CONFIG_FILE) is used to compose a “score”. This is similar to a SQL query optimization plan. At runtime, IBM InfoSphere DataStage identifies the degree of parallelism and node assignments for each operator, and inserts sorts and partitioners as needed to ensure correct results. It also defines the connection topology (virtual data sets/links) between adjacent operators/stages, and inserts buffer operators to prevent deadlocks (for example, in fork-joins). It also defines the number of actual OS processes. Multiple operators/stages are combined within a single OS process as appropriate, to improve performance and optimize resource requirements.

The job score is used to fork processes with communication interconnects for data, message and control3. Processing begins after the job score and processes are created. Job processing ends when either the last row of data is processed by the final operator, a fatal error is encountered by any operator, or the job is halted by DataStage Job Control or human intervention such as DataStage Director STOP.

Job scores are divided into two sections — data sets (partitioning and collecting) and operators (node/operator mapping). Both sections identify sequential or parallel processing. The execution (orchestra) manages control and message flow across processes and consists of the conductor node and one or more processing nodes as shown in Figure 1-6. Actual data flows from player to player — the conductor and section leader are only used to control process execution through control and

message channels.

Conductor is the initial framework process. It creates the Section Leader (SL) processes (one per node), consolidates messages to the DataStage log, and manages orderly shutdown. The Conductor node has the start-up process. The Conductor also communicates with the players.

Note:

1. Set $APT_STARTUP_STATUS to show each step of the job startup, and $APT_PM_SHOW_PIDS

to show process IDs in the DataStage log.

2. You can direct the score to a job log by setting $APT_DUMP_SCORE.To identify the Score dump, look for “main program: This step....”.

Section Leader is a process that forks player processes (one per stage) and manages up/down communications. SLs communicate between the conductor and player processes only. For a given parallel configuration file,one section leader will be started for each logical node.

Players are the actual processes associated with the stages. It sends stderr and stdout to the SL, establishes connections to other players for data flow, and cleans up on completion. Each player has to be able to communicate with every other player. There are separate communication channels

(pathways) for control, errors, messages and data. The data channel does not go through the section leader/conductor as this would limit scalability. Data flows directly from upstream operator to downstream operator.

Saturday, 8 March 2014

New Features in DataStage 8.5

XML data

Transformer Stage

Parallel Debugger:

Functionality Enhancements:

Vertical Pivoting:

Integration with CVS

Information Architecture Diagramming Tool:

Balanced Optimizer:

Its Fast!

The parallel engine