Saturday, 2 November 2013

Datastage Scenario - 1

Scenario:

I have strings
MumbaiBangaloreChennai,
NaviMumbaiBangaloreChennai,
MumbaiPuneBangaloreKolkataChennai,
PuneBangaloreKolkataChennai,

I need to replace Kolkata with Calcutta if it exists in the string. How it can be achieved? This is in Datastage 8.1


Solution:

If Index(InputCol,"Kolkata",1) =0 Then InputCol Else InputCol [1,Index(InputCol,"Kolkata",1)-1] :"Calcutta":InputCol [Index(InputCol,"Kolkata",1) +7, Index(InputCol, Right(InputCol,1),1)]


What ETL does

Data Flow of an ETL Functionality 

Datastage Sample Job Creation-Video




Orchadmin in Datastage- Dataset Deletion, Copy, Extraction

Orchadmin in Datastage

Orchadmin


DataStage stores data in persistent internal (specific to DataStage) format in the form of Data sets. Orchestrate Data set aid in the parallel processing of data and are much faster performance wise. They help in achieving end-to-end parallelism by writing data in partitioned form and maintaining the sort order. Orchestrate Data set consists of one or more data files stored on multiple processing nodes. A parallel data set is represented on disk by:



• A single descriptor file - defines the record schema of the data set and the location of all data files in the set. It does not contain the actual data.
• Data files (which contain the actual data) located on one or more processing nodes.


Orchadmin Utility

This is an Orchestrate Administrator Utility. It can perform operations on Data sets which cannot be performed by normal UNIX file commands. The basic syntax is:


orchadmin [command] [options] [descriptor_files]


Commands

The various commands that are available with orchadmin are dump, delete, truncate, copy and describe.


Dump Command


This command can be used to write records from a given data set onto standard output or can be redirected to a sequential file. The syntax is:
Syntax-orchadmin dump [options] descriptor_file
If no option is specified, all the records will be returned to the standard output.

ex
1)orchadmin dump test.ds

2)orchadmin dump test.ds>temp.txt

In the second example temp.txt file will contain data present in test.ds


Delete Command

rm deletes only descriptor file and the actual data is not deleted as it is present in the data files which reside on the processing nodes.Toremove the persistent data from the data sets the conventional approach is the use of Data set management in data stage.

Orchadmin utility simplifies the whole process by providing the delete command. The syntax is:


Syntax-orchadmin delete [-option] ds_1 ... ds_N

ex-orchadmin delete test.ds


Describe Command

This command outputs a report about the datasets specified. The syntax is:
orchadmin describe [-options] descriptor_file
ex- orchadmin describe test.ds

Copy Command

This command can be used to create an identical dataset with the same column definition and number of records. Orchadmin copy command can be used to take backups of existing datasets.

Syntax-orchadmin copy source-ds target-ds

ex-orchadmin copy temp.ds temp_target.ds


Note:
1)If one uses the UNIX cp command then only the descriptor file is copied, and these descriptor files point to the same data files residing in the processing nodes.
2)Type orchadmin on command prompt to get help information about this command

Datastage Parallel Processing

http://www.youtube.com/watch?v=flkc3AFH8EM
https://www.youtube.com/watch?v=QvZdgcW7iIk





dsjob -run Command / How to Run Datastage Job from Unix Command Line

How to Run Datastage Job from Unix Command Line:

Most Data warehousing project requires that your job runs in batches at specified time slots. In such cases the Datastage jobs are usually scheduled by using an external scheduling tool like ESP Scheduler, Control M, Autosys, etc… 

This is made possible by writing scripts that will run your jobs through the command line.  The command line is very powerful interface to Datastage which lets us do more than just run the normal job. The guides present in the Datastage documentation will be very helpful in exploring the various things that can be done through the command line. 

However this gives the basics you will need to need to carry out your execution

In UNIX, the Datastage home directory location will always be specified in the “.dshome” file which will be present in the root directory.  Before you can run your Datastage commands you will have to run the following commands


§ cd  `cat /.dshome`

This will change the location to the home directory. By default this will be /opt/IBM/InformationServer/Server/DSEngine

§                     . ./dsenv > /dev/null 2>&1


This will run the dsenv file which contains all the environment variables. Without doing this, your UNIX commands won’t run on the command prompt.

After you have done this then you can use any Datastage command for interacting with the server. The main command 

To run a job:

Using the dsjob command you can start,stop,reset or run the job in  validation mode.

dsjob  –run –mode VALIDATE/RESET/RESTART  project_name  job_name

This command will actually run the job in validation mode. Similarly you can use RESET or RESTART instead of VALIDATE depending on what type of run you want. 

If you want a normal run then you will not need to specify the –mode keyword as shown below

dsjob –run project_name  job_name | job_name.invocationid

Running with the invocationid would mean that the job would be run with that specific invocation id

Now if you have parameters to set or paratemeterset values to set then this can also be as set as shown below

dsjob –run –param variable_name=”VALUE” –param psParameterSet=”vsValueSet” project_name  job_name

To stop a job:

Stopping a job is fairly simple. You might not actually require it but still its worth to take a look. It acts the same way as you would stop a running job the Datastage director.

dsjob –stop  project_name  job_name|job_name.invocationid

To list projects, jobs, stages in jobs, links in jobs, parameters in jobs and invocations of jobs

dsjob command can very easily give you all the above based on the different keywords. It will be useful for you if you want to get a report of what’s being used in what project and things like that

The various commands are shown below

‘dsjob –lprojects’  
will give you a list of all the projects on the server

‘dsjob –ljobs  project_name’ 
will give you a list of jobs in a particular project

‘dsjobs –lstages  project_name job_name’  

will give you a list of all the stages used in your job. Replacing –lstage with –links will give you a list of all the links in your job. Using –lparams will give you a list of all the parameters used in your job. Using –linvocations will give you a list of all the invocations of your multiple instance job.

To generate reports of a job

You can get the basic information of a job buy using the  ‘jobinfo’ option as shown below

dsjob -jobinfo  project_name job_name

Running this command will give you a short report of your job which includes The current status of the job, the name of any controlling job for the job, the date and time when the job started , the wave number of the last or current run (internal InfoSphere Datastage reference number) and the user status

You can get a more detailed report using the below command

dsjob -report  project  job_name BASIC|DETAIL|XML

BASIC means that your report will contain very basic information like start/end time of the job , time elapsed and the current status of the job. DETAIL as the name indicates will give you a very detailed report on the job down to the stages and link level. XML would give you an XML report which is also a detailed report in an XML format.

To access logs:

You can use the below command to get the list of latest 5 fatal errors  from the log of the job that was just run

dsjob -logsum –type FATAL –max 5 project_name job_name

You can get different types of information based on the keyword you specify for –type. Full list of allowable types are available in the help guide for reference.
There are a number of other options also available to get different log information. You can explore this in more detail in the developer guide. With the Datastage commands you can administer jobs, run jobs, maintain jobs, handle errors, prepare meaningful job logs and even prepare reports.  The possibilities are endless. If you like to code then you won’t mind spending your time exploring the command line options available.

Basic Practical Perl Tutorial



Hope this helps