Saturday, 8 March 2014

New Features in DataStage 8.5


Ten Reasons Why You Need DataStage 8.5

I have taken a look through the new functions and capabilities of DataStage 8.5 and come up with a top ten list of why you should upgrade to it.

Information Server 8.5 came out a couple weeks ago and is currently available onIBM Passport Advantage for existing customers and from IBM PartnerWorld for IM partners.  The XML pack described below is available as a separate download from the IBM Fix Central website.

This is a list of the ten best things in DataStage 8.5.  Most of these are improvements in DataStage Parallel Jobs only while a couple of them will help Server Job customers as well.

1. It’s Faster
 Faster, faster, faster.  A lot of tasks in DataStage 8.5 are at least 40% faster than 8.1 such as starting DataStage, opening a job, running a Parallel job and runtime performance have all improved.

2. It' is now an XML ETL Tool
Previous versions of DataStage were mediocre at processing XML.  DataStage 8.5 is a great XML processing tool.  It can open, understand and store XML schema files.  I did a longer post about just this pack in New Hierarchical Transformer makesDataStage great a XML Tool and if you have XML files without schemas you can follow a tip at the DataStage Real Time blog: The new XMLPack in 8.5….generating xsd’s….
The new XML read and transform stages are much better at reading large and complex XML files and processing them in parallel:
DataStage 8.5 XML Job

3. Transformer Looping
The best Transformer yet.  The DataStage 8.5 parallel transformer is the best version yet thanks to new functions for looping inside a transformer and performing transformations across a grouping of records.
With looping inside a Transformer you can output multiple rows for each input row.
DataStage 8.5 Transformer Looping

Transformer Remembering
DataStage 8.5 Transformer has Remembering and key change detection which is something that ETL experts have been manually coding into DataStage for years using some well known workarounds.  A key change in a DataStage job involves a group of records with a shared key where you want to process that group as a type of array inside the overall recordset. 
I am going to make a longer post about that later but there are two new cache objects inside a Transformer – SaveInputRecord() and GetSavedInputRecord(0 where you can save a record and retrieve it later on to compare two or more records inside a Transformer. 
There are new system variables for looping and key change detection - @ITERATION, LastRow() indicates the last row in a job, LastTwoInGroup(InputColumn) indicates a particular column value will change in the next record.
Here is an aggregation example where rows are looped.
4. Easy to Install
Easier to install and more robust.  DataStage 8.5 has the best installer of any version of DataStage ever.  Mind you – I jumped aboard the DataStage train in version 3.6 so I cannot vouch for earlier installers but 8.5 has the best wizard, the best pre-requisite checking and the best recovery.  It also has the IBM Support Assistant packs for Information Server that make debugging and reporting of PMRs to IBM much easier.  There is also a Guide to Migrating to InfoSphere Information Serve 8.5 that explains how to migrate from most earlier versions.
Patch Merge – that’s right, patch merge.  The new installer has the ability to merge patches and fixes into the install for easier management of patches and fixes.

5. Check In and Check Out Jobs
Check in and Check out version control.  DataStage 8.5 Manager comes with direct access to the source control functions of CVS and Rational ClearCase in an Eclipse workspace.  You can send artefacts to the source control system and replace a DataStage component from out of the source control system.
DataStage 8.5 Check In
DataStage 8.5 comes with out of the box menu integration with CVS and Rational ClearCase but for other source control systems you need to use the Eclipse source control plugins.

6. High Availability Easier than ever
High Availability – the version 8.5 installation guide has over thirty pages on Information Server topologies including a bunch of high availability scenarios across all tiers of the product.  On top of that there are new chapters for the high availability of the metadata repository, the services layer and the DataStage engine.
  • Horizontal and vertical scaling and load balancing.
  • Cluster support for WebSphere Application Server.
    • Cluster support for XMETA repository: DB2 HADR/Cluster or Oracle RAC.
    • Improved failover support on the engine.

7. New Information Architecture Diagramming Tool
InfoSphere Blueprint Direct – DataStage 8.5 comes with a free new product for creating diagrams of an information architecture and linking elements in the diagram directly into DataStage jobs and Metadata Workbench metadata.  Solution Architects can draw a diagram of a data integration solution including sources, Warehouses and repositories.
DataStage 8.5 Blueprint Director

8. Vertical Pivot
It is now available and it can pivot multiple input rows with a common key into output rows with multiple columns.  Key based groups, columnar pivot and aggregate functions.
You can also do this type of vertical pivoting in the new Transformer using the column change detection and row cache – but the Vertical pivot stage makes it easier as a specialised stage.


9. Z/OS File Stage
Makes it easier to process complex flat files by providing native support for mainframe files.  Use it for VSAM files – KSDS, ESDS, RRDS.  Sequential QSAM, BDAM, BSAM.  Fixed and variable length records.  Single or multiple record type files.
DataStage 8.5 zOS File Stage 
10.  Balanced Optimizer Comes Home
In DataStage 8.5 the Balanced Optimizer has been merged into the Designer and it has a number of usability improvements that turns DataStage into a better ETLT or ELT option.  Balanced Optimizer looks at a normal DataStage job and comes up with a version that pushes some of the steps down onto a source or target database engine.  IE it balances the load across the ETL engine and the database engines.
Version 8.5 has improved logging, improved impact analysis support and easier management of optimised versions of jobs in terms of creating, deleting, renaming, moving, compiling and deploying them.



IBM DataStage 8.5 New Added Features :

DataStage 8.5 is out and IBM has made some significant improvements this time around. Let’s see some of the important enhancements in the new DataStage 8.5 version
  • XML data

DataStage has historically been inefficient at handling XML files, but in 8.5 IBM has given us a great XML processing package. DataStage 8.5 can now process large XML files (over 30 GB) with ease. Also, we can now process XML data in parallel.
The new XML transform stage can data from multiple sources into a single XML output stream. If you think that is cool, it can also do it the other way around i.e., multiple XML input to a single output stream.
It can also convert data from one XML format to another.

  • Transformer Stage

It is one of the most used and the most important stages on DataStage and it just got better in 8.5
a.     Transformer Looping:
Over the years DataStage programmers have been using workarounds to implement this concept. Now IBM has included it directly in the transformer stage.
There are two types of looping’s available
Output looping: Where we can output multiple output links for a single input link
Ex:
Input Record:
Salesman_nameCity_1City_2City_3
Jason BourneNew YorkMadridNew Delhi

Output Record:
Salesman_nameCity
Jason BourneNew York
Jason BourneMadrid
Jason BourneNew Delhi

This is achieved using a new system variable @ITERATION
Input looping: We can now aggregate input records within the transformer and assign the aggregated data to the original input link while sending it to the output.

b.    Transformer change detection:
SaveInputRecord() – Save a record to be used for later transformations within the job
GetInputRecord() – Retrieve the saved record as when it is required for comparisons

c.     System Variables:
                              i.        @ITERATION: Used in the looping mechanism
                            ii.        LastRow(): Indicates the last row in the job
                           iii.        LastRowInGroup(): Will return the last row in the group based on the key column

d.    New NULL Handling features:
In DataStage 8.5 we need not explicitly handle NULL values. Record dropping is arrested if the target column is nullable. We need not handle NULL values explicitly when using functions over columns that have NULL values. And also stage variables are now nullable by default.
APT_TRANSFORM_COMPILE_OLD_NULL_HANDLING has been prepared to support backward compatibility

e.     New Data functions:
There are a host of new date functions incorporated into DataStage 8.5. I personally found the below function most useful
DataFromComponents(years, months, daysofmonth)
Ex: DataFromComponenets(2012,07,20) will output 2012-07-20

DataOffsetByComponents(basedate, years offset, month offset, daysofmonth offset)
Ex: DataOffsetByComponents(2012-07-20, 2,1,1) will output 2014-08-21
DataOffsetByComponents(2012-07-20, -4,0,0) will output 2008-07-20
I will write another detailed blog on the new data functions shortly

  • Parallel Debugger:

DataStage 8.5 now has a built in debugger functionality. We can now set breakpoints on the links in our jobs.
When the job is run in debug mode, it will stop when it encounters a breakpoint. From here we can step to the next action on that link or skip to the next row of data.
Refer Below Link :
http://datastageinfoguide.blogspot.in/2013/01/new-debug-feature-in-datastage-85.html


  • Functionality Enhancements:

-       Mask encryption for before and after job subroutines
-       Ability to copy permissions from one project to a new project
-       Improvements in the multi-client manager
-       New audit tracing and enhanced exception dialog
-       Enhanced project creation failure details

  • Vertical Pivoting:

At long last vertical pivoting has been added
 
  • Integration with CVS

Now in DataStage 8.5 we have the feature that integrates directly with version control systems like CVS. We can now Check-in and Check-out directly from DataStage

  • Information Architecture Diagramming Tool:

Now solution architects can draw detailed integration solution plans for data warehouses from within DataStage

  • Balanced Optimizer:

As you all know DataStage is an ETL tool. But now with Balanced Optimizer directly being integrated we have the ELT (Extract Load and Transform) feature.
With this we can extract  the data, load it and perform the transformations inside the database engine.

  • Its Fast!

DataStage 8.5 is considerably faster than its previous version (8.1). Tasks like saving, renaming, compiling are faster by nearly 40%. The run time performance of jobs has also improved.
  
  • The parallel engine

on DataStage has been tuned to improve performance and resource usage has reduced by 5% when compared to DataStage 8.1 

Compile DataStage Job From Client Command Line

Compiling the DataStage Jobs from Windows Client Machine/ Windows Command Prompt ..!!DK®



You can find the compilation command in C:\IBM\InformationServer\Clients\Classic folder.DK®

dscc.exe - Compiling the Jobs.

cd C:\IBM\InformationServer\Clients\Classic
./dscc.exe /H ABC123 /U dsadm /P dsadm TestProject /J TestDSJob
./dscc.exe /H hostname /U username /P password project_name /J jobname

Note: The job name can be * to compile all the jobs in the Project.you cannot use this command to compile a single job.DK® 

The dscc command takes the following arguments: DK®
Find the details below
  • /H hostname specify the InfoSphere DataStage server where the job or jobs reside.
  • /U username the username to use when attaching to the project.
  • /P password the password to use when attaching to the project.
  • project_name the project which the job or jobs belong to.
  • /J jobname |* | category_name\* specifies the jobs to be compiled. Use jobname to specify a single job, * to compile all jobs in the project and category_name\* to compile all jobs in that category (this will not include categories within that category). You can specify job sequences as well as parallel or server jobs.
  • /R routinename | * | category_name\* specifies routines to be compiled. Use routinename to specify a single routine, * to compile all routines in the project andcategory_name\* to compile all routines in that category (this will not include categories within that category).
  • /F force compile (for parallel jobs).
  • /OUC only compile uncompiled jobs.
  • /RD reportname specify a name and destination for a compilation report. Specify DESKTOP\filename to write it to your desktop or .\filename to write it to the current working directory.
The options are not case sensitive....!!DK®

Create/Delete/View DS Projects Using Command

DSADMIN Command to Create/Delete/View DataStage Projects from Command line Unix/Linux


This post will help you to understand Create/Delete and View of DataStage Projects from command line[Unix].

Note : Always use DSADM user to create or delete projects.

Check List Of Datastage Projects from Command Line:
--------------------------------------------------------------------

DSHOME=`cat /.dshome`;
echo $DSHOME

DSHOME=`cat /.dshome`
cd $DSHOME
. ./dsenv
cd ./bin
./dsjob -lprojects    =>{ For Listing the Datastage Projects}
./dsjob -ljobs project   =>{ For Listing the Datastage Jobs in given Project}


Create Projects From Command Line
-----------------------------------------------

DSHOME=`cat /.dshome`
cd $DSHOME
. ./dsenv
cd bin
./dsadmin -domain XYZ1234 -user dsadm -password dsadm -server XYZ1234 -createproject Test1 -location /tmp/Test1

Syntax : ./dsadmin -domain [domain] -user [user] -password [password] -server [server] -createproject [project name] -location [prjoect path/NewProjectName];



Delete Projects From Command Line
-----------------------------------------------

DSHOME=`cat /.dshome`
cd $DSHOME
. ./dsenv
cd bin
./dsadmin -domain XYZ1234 -user dsadm -password dsadm -server XYZ1234 -deleteproject Test1
Syntax : ./dsadmin -domain [domain] -user [user] -password [password] -server [server] -
deleteproject [project name];

UNIX:Chmod Calculator

Unix File Permission in Details - chmod Calculator:

This post will give you complete details about Unix file permission. chmod calculator help you to find out the numbers to execute chmod commands.

CHMOD:

Chmod stands for Change Mode and is a command often needed for installing scripts (CGI, PHP etc.) on a UNIX server, after uploading the file (with FTP) you may need to change the permissions. Basically it tells the server who can make what changes to the file or folder, i.e. can the script only read the info, or can it write information as into it as well.

CHMOD CALCULATOR:

Some FTP programs give you something that looks like this to set chmod permissions.
This may help you try it out first and find out if you really are setting the right permissions.


Symbolic Notation
Octal Notation
Description
----------
000
no permissions
---x--x--x
111
execute
--w--w--w-
222
write
--wx-wx-wx
333
write & execute
-r--r--r--
444
read
-r-xr-xr-x
555
read & execute
-rw-rw-rw-
666
read & write
-rwxrwxrwx
777
read, write, & execute


Legend:
CODE
DESCRIPTION
1
execute
2
write
3
write & execute
4
Read
5
Read & execute
6
Read and Write, no execute
7
Full Access

First one is Owner, Second is Group and third is others.

Common ones:
 
PERMISSION
COMMAND
rwxrwxrwx
chmod 777 filename
rwxrwxr-x
chmod 775 filename
rwxr-xr-x
chmod 755 filename
rw-rw-rw-
chmod 666 filename
rw-rw-r--
chmod 664 filename
rw-r--r--
chmod 644 filename


WHAT DO THE CHMOD SETTINGS MEAN?

Chmod tells the server the access privileges for a file. For example, common file settings are:
777: all can read(4) / write(2) / execute(1) the file.
755: owner can do all, group / others can read / execute.
644: owner can read / write, group / others can read only.


COMMON CHMOD SETTINGS

user-group-other read/write permissions
-rw-r--r--  1 DYADAV  dstage  164870 27 Feb 17:58 a.txt

Datastage Job Run Time Architecture-OSH

This gives you the generation of the OSH (Orchestrate® Shell Script) script, and
The execution flow of IBM InfoSphere DataStage using the Information Server engine.
and  Information about Conductor Node, Section Leader and Player in datastage job.

OSH script :
The IBM InfoSphere DataStage and QualityStage Designer client creates IBM InfoSphere DataStage jobs that are compiled into parallel job flows, and reusable components that execute on the parallel Information Server engine. It allows you to use familiar graphical point-and-click techniques to develop job flows for extracting, cleansing, transforming, integrating, and loading data into target files,
target systems, or packaged applications.

The Designer generates all the code. It generates the OSH (Orchestrate Shell Script) and C++ code for any Transformer stages used.

Briefly, the Designer performs the following tasks:
1.Validates link requirements, mandatory stage options, transformer logic, etc.
2.Generates OSH representation of data flows and stages (representations of framework “operators”).
3.Generates transform code for each Transformer stage which is then compiled into C++ and then to corresponding native operators.
4.Reusable BuildOp stages can be compiled using the Designer GUI or from the command line.

Here is a brief primer on the OSH:
1. Comment blocks introduce each operator, the order of which is determined by the order stages were added to the canvas.
2.OSH uses the familiar syntax of the UNIX shell. such as Operator name, schema, operator options (“-name value” format), input (indicated by n< where n is the input#), and output (indicated by the n> where n is the output #).
3. For every operator, input and/or output data sets are numbered sequentially starting from zero.
4. Virtual data sets (in memory native representation of data links) are generated to connect operators.

Note: The actual execution order of operators is dictated by input/output designators, and not by their placement on the diagram. The data sets connect the OSH operators. These are “virtual data sets”, that is, in memory data flows. Link names are used in data set names — it is therefore good practice to give the links meaningful names.

Framework (Information Server Engine) terms and DataStage terms have equivalency. The GUI frequently uses terms from both paradigms. Runtime messages use framework terminology because the framework engine is where execution occurs. The following list shows the equivalency between  framework and DataStage terms:

1. Schema corresponds to table definition
2. Property corresponds to format
3. Type corresponds to SQL type and length
4. Virtual data set corresponds to link
5. Record/field corresponds to row/column
6. Operator corresponds to stage
7. Step, flow, OSH command correspond to a job
8. Framework corresponds to Information Server Engine

Execution flow
When you execute a job, the generated OSH and contents of the configuration file ($APT_CONFIG_FILE) is used to compose a “score”. This is similar to a SQL query optimization plan. At runtime, IBM InfoSphere DataStage identifies the degree of parallelism and node assignments for each operator, and inserts sorts and partitioners as needed to ensure correct results. It also defines the connection topology (virtual data sets/links) between adjacent operators/stages, and inserts buffer operators to prevent deadlocks (for example, in fork-joins). It also defines the number of actual OS processes. Multiple operators/stages are combined within a single OS process as appropriate, to improve performance and optimize resource requirements.

The job score is used to fork processes with communication interconnects for data, message and control3. Processing begins after the job score and processes are created. Job processing ends when either the last row of data is processed by the final operator, a fatal error is encountered by any operator, or the job is halted by DataStage Job Control or human intervention such as DataStage Director STOP.

Job scores are divided into two sections — data sets (partitioning and collecting) and operators (node/operator mapping). Both sections identify sequential or parallel processing. The execution (orchestra) manages control and message flow across processes and consists of the conductor node and one or more processing nodes as shown in Figure 1-6. Actual data flows from player to player — the conductor and section leader are only used to control process execution through control and
message channels.

 

Conductor is the initial framework process. It creates the Section Leader (SL) processes (one per node), consolidates messages to the DataStage log, and manages orderly shutdown. The Conductor node has the start-up process. The Conductor also communicates with the players.
Note:
1. Set $APT_STARTUP_STATUS to show each step of the job startup, and $APT_PM_SHOW_PIDS
to show process IDs in the DataStage log.
2.  You can direct the score to a job log by setting $APT_DUMP_SCORE.To identify the Score dump, look for “main program: This step....”.

Section Leader is a process that forks player processes (one per stage) and manages up/down communications. SLs communicate between the conductor and player processes only. For a given parallel configuration file,one section leader will be started for each logical node.

Players are the actual processes associated with the stages. It sends stderr and stdout to the SL, establishes connections to other players for data flow, and cleans up on completion. Each player has to be able to communicate with every other player. There are separate communication channels
(pathways) for control, errors, messages and data. The data channel does not go through the section leader/conductor as this would limit scalability. Data flows directly from upstream operator to downstream operator.