Saturday 25 January 2014

CONFIGURATION FILE in DataStage


In Datastage, the degree of parallelism, resources being used, etc. are all determined during the run time based entirely on the configuration provided in the APT CONFIGURATION FILE. This is one of the biggest strengths of Datastage. For cases in which you have changed your processing configurations, or changed servers or platform, you will never have to worry about it affecting your jobs since  all the jobs depend on this configuration file for execution. Datastage jobs determine which node to run the process on, where to store the temporary data , where to store the dataset data, based on the entries provide in the configuration file. There is a default configuration file available whenever the server is installed.  You can typically find it under the <>\IBM\InformationServer\Server\Configurations  folder with the name default.apt. Bear in mind that you will have to optimise these configurations for your server based on your resources.
Basically the configuration file contains the different processing nodes and also specifies the disk space provided for each processing node. Now when we talk about processing nodes you have to remember that these can are logical processing nodes that are specified in the configuration file. So if you have more than one CPU this does not mean the nodes in your configuration file correspond to these CPUs. It is possible to have more than one logical node on a single physical node. However you should be wise in configuring the number of logical nodes on a single physical node. Increasing nodes, increases the degree of parallelism but it does not necessarily mean better performance because it results in more number of processes. If your underlying system should have the capability to handle these loads then you will be having a very inefficient configuration on your hands.
Now lets try our hand in interpreting a configuration file. Lets try the below sample.
{
node “node1″
{
fastname “SVR1″
pools “”
resource disk “C:/IBM/InformationServer/Server/Datasets/Node1″ {pools “”}
resource scratchdisk “C:/IBM/InformationServer/Server/Scratch/Node1″ {pools “}
}
node “node2″
{
fastname “SVR1″
pools “”
resource disk “C:/IBM/InformationServer/Server/Datasets/Node1″ {pools “”}
resource scratchdisk “C:/IBM/InformationServer/Server/Scratch/Node1″ {pools “”}
}
node “node3″
{
fastname “SVR2″
pools “” “sort”
resource disk “C:/IBM/InformationServer/Server/Datasets/Node1″ {pools “”}
resource scratchdisk “C:/IBM/InformationServer/Server/Scratch/Node1″ {pools  ”" }
}
}
This is a 3 node configuration file. Lets go through the basic entries and what it represents.
Fastname – This refers to the node name on a fast network. From this we can imply that the nodes node1 and node2 are on the same physical node. However if we look at node3 we can see that it is on a different physical node (identified by SVR2). So basically in node1 and node2 , all the resources are shared. This means that the disk and scratch disk specified is actually shared between those two logical nodes. Node3 on the other hand has its own disk and scratch disk space.
Pools – Pools allow us to associate different processing nodes based on their functions and characteristics. So if you see an entry other  entry like “node0” or other reserved node pools like “sort”,”db2”,etc.. Then it means that this node is part of the specified pool.  A node will be by default associated to the default pool which is indicated by “”. Now if you look at node3 can see that this node is associated to the sort pool. This will ensure that that the sort stage will run only on nodes part of the sort pool.
Resource disk  - This will specify Specifies the location on your server where the processing node will write all the data set files. As you might know when Datastage creates a dataset, the file you see will not contain the actual data. The dataset file will actually point to the place where the actual data is stored. Now where the dataset data is stored is specified in this line.
Resource scratchdisk – The location of temporary files created during Datastage processes, like lookups and sorts will be specified here. If the node is part of the sort pool then the scratch disk can also be made part of the sort scratch disk pool. This will ensure that the temporary files created during sort are stored only in this location. If such a pool is not specified then Datastage determines if there are any scratch disk resources that belong to the default scratch disk pool on the nodes  that sort is specified to run on. If this is the case then this space will be used.


Below is the sample diagram for 1 node and 4 node resource allocation:


 

 

SAMPLE CONFIGURATION FILES

 

Configuration file for a simple SMP

 

A basic configuration file for a single machine, two node server (2-CPU) is shown below. The file defines 2 nodes (node1 and node2) on a single dev server (IP address might be provided as well instead of a hostname) with 3 disk resources (d1 , d2 for the data and Scratch as scratch space).

The configuration file is shown below: 



node "node1"
{             fastname "dev"
               pool ""
               resource disk "/IIS/Config/d1" { }
               resource disk "/IIS/Config/d2" { }                            
               resource scratchdisk "/IIS/Config/Scratch" { }
}

node "node2"
{
               fastname "dev"
               pool ""
               resource disk "/IIS/Config/d1" { }
               resource scratchdisk "/IIS/Config/Scratch" { }
}             
          

 

 

Configuration file for a cluster / MPP / grid


The sample configuration file for a cluster or a grid computing on 4 machines is shown below.
The configuration defines 4 nodes (node[1-4]), node pools (n[1-4]) and s[1-4), resource pools bigdata and sort and a temporary space. 



node "node1"
            {
                        fastname "dev1"
                        pool "" "n1" "s1" "sort"
                        resource disk "/IIS/Config1/d1" {}
                        resource disk "/IIS/Config1/d2" {"bigdata"}                      
                        resource scratchdisk "/IIS/Config1/Scratch" {"sort"}
            }

            node "node2"
            {
                        fastname "dev2"
                        pool "" "n2" "s2"
                        resource disk "/IIS/Config2/d1" {}
                        resource disk "/IIS/Config2/d2" {"bigdata"}                      
                        resource scratchdisk "/IIS/Config2/Scratch" {}
            }

            node "node3"
            {
                        fastname "dev3"
                        pool "" "n3" "s3"
                        resource disk "/IIS/Config3/d1" {}
                        resource scratchdisk "/IIS/Config3/Scratch" {}
            }

            node "node4"
            {
                        fastname "dev4"
                        pool "n4" "s4"
                        resource disk "/IIS/Config4/d1" {}
                        resource scratchdisk "/IIS/Config4/Scratch" {}
            }




Resource disk : Here a disk path is defined. The data files of the dataset are stored in the resource disk.

Resource scratch disk :  Here also a path to folder is defined. This path is used by the parallel job stages for buffering of the data when the parallel job runs.


COURTESY: http://mydatastage-notes.blogspot.in/2013/04/configuration-file.html

1 comment: