Realtime datastage interview Questions

what is the diff between sequential file and fileset stages?
 
Sequential file is used to read the data sequentially. It can be configured to execute both in parallel and sequential mode. We cann't perform lookups using sequential file. Fileset stage is used to import the exported files list. It executes only in parallel mode. the most importance of it is 2G.B limit on the size of a file and we need to distribute files among the nodes to prevent overruns.

What's the Main Function of the Staging area in DWH
Preparing the Extracted Data to be loaded to the DWH. Preparing like examine each extracted file, review the Business rules, perform the various data transformation functions, resolve inconsistencies, Sort & Merge data, etc.

if 3 table having different columes. like first table  having 4 columns , second table having 3 columns and third table having 2 columns then how to capture the data by 
using funnel stage in parallel jobs...

we can use column generator for each of the inputs in order to make same metadata for all the 3 inputs because the prerequisite for the funnel stage is metadata should be same.........then we can funnel based on the key column
 
in sequential file 2 columns available , I want only one column load the target how do it.
 
Datastage: Use modify stage
Unix: $cut command
 
How can we run same job in 1 day 2 times
we can do this in different ways.
 
1. we can right the userdefined shell script.
2. we can do with Job sequences by selecting Start - loop and End -loop activities.
3. we can do with datastage director by scheduling.
 
I thing using Invocation Id we can run the job more than one time in one day.
 
In detailed
 
we have an option in Job properties---->Allow multiple instance (enable this one and)
 
At the time running it will ask invocation id give the id how many times u want run in particular day...
 
How to Convert the columns into rows?
Using the Pivot we can convert the columns into rows
 
if we take 2 tables(like emp and dept), we use join stage and how to improve the performance?
 
when ever join 2 tables based on key columns if the key column is numeric ,set modulus,if the key column is non numeric set hash partition technique. and compare to look up join  give better performance

how would i run job1 then job 3 , then job2 in a sequence of job1 ,job2,job3.
U can run the job1 then job3 by two methods by using trigger conditional or by nested condition activity.

I have source file which contains duplicate data, my requirement is unique data should pass to one file and duplicate data should pass another file how?
All you have to do is link your source to a Sort Stage. Sort the data and generate a Key Change column.  Key Change column = ‘1’ represents that the record is unique while Key Change Column = ‘0’ represents the duplicates. 
 
Put a Filter stage and filter out the data into two different outputs based on the generated Key Change Column.

Is it possible to load two tables data into one sequential file? if possible how?
Yes you can join two tables and load the data into sequential file... Take two tables use join stage and join the data based on the key and then load the data into sequential file...

how can we create a Surrogate key in transformer stage? I want it in parallel mode
In transformer Properties, there is a tab called “Surrogate Key”. Here we can define it.
We can select from
a.             Flat file : State file which holds the SK values and generates SK values.
b.            DB Sequence: Database sequence object to use to generate the SK values.

 i have one table i want divide the table with two different table like even rows and odd rows how can i do 
oracle stage--->surrogate key generator--->transformer
 
the t/f having 'mod' function
 
mod(empkey,2)=0 --->even rows
mod(empkey,2)<>0 --->odd rows
 
empkey is generated from surrogate key generator

on how many columns we can perform aggregation in the Aggregator stage?
any number of columns we can perform aggregation 
ex:group by deptno,group by deptname,

How many nodes supported by a one cpu in parallel jobs?
here node is nothing but processor, parallel jobs supports uniprocessor, smp, mpp and clustered system so nodes supporting is depend upon h/w architecture ok.
 
at source level i have 40 columns, i want only 20 cols at target what r the various ways to get it
Use Modify Stage and Drop the columns that are not needed . Or Use a Transformer and take the links tat are only needed(make sure that RCP is not marked )

when U have a remove duplicate option in sort stage, why we have a remove duplicate stage in PX. 
In Duplicate Stages we have more number of options compare to sort while removing duplicates. If you have less number if data you can go with Sort stage to remove duplicates. If  you have large number of data go for Remove Duplicates Stage.
 
Source have 1000 records and it can have three nodes ok but i want how many records are executed on each node?
 Can’t perfectly say how many records are executed on each node. It all depends on the partitioning method used. If we  use a round robin type the records will be distributed 
equally among the nodes. If a Hash method is used where similar records will be assigned to the same node then it depends on the data.

what are the types of nodes
NODE REFERS TO LOGICAL  PROCESSING UNIT 
1-nod file: for sequential execution, lighter reports
MedN – nod file:supports mix of pipeline and data partition
BigN – nod file: aims at full data partitioned parallelism

what is time dimension? and how to populate time dimension
Every DWH has time dimension u can load the time dimension though pl/sql script.

I have oracle enterprise stage, oracle bulkloader stage, ODBC stage and I want to load data in to target, which of the above stages do I need to use and why?
Its all depends on the source data if the data is coming through oracle datastage then go for oracle enterprise stage if it a incremental load but if it is initial load then go for oracle bulkloader stage. Oracle enterprise is faster as compare to ODBC. oracle enterprise runs parallel where as ODBC runs seq.
 
how can we perform the 2nd time extraction of client database without accepting the data which is already loaded in first time extraction
use change capture stage and make sure drop copy=true
 
How to display all the duplicate records from the oracle Table?
select empno from emp group by empno having count(empno)>1;

I am running a job with 1000 records.. If the job gots aborted after loading 400 records into target... In this case i want to load the records in the target with 401 record... How will we do it??? This scenario is not for sequence job it's only in the job Ex: Seq file--> Trans--> Dataset..

In dataset stage àupdate policy
Use existing(discard records)
Use existing(discard schema and discard records)
 
i have seq file that contents 10 million records  load to target any data base.. in that case it takes lot of time for loading. how do performance tuning in that situation...?

Sequential fileà outputàAvailable properties to add: number of readers per node=2 or more

what is the use of invocation id
It is an ID that we specify for our identification while using a job in multiple instances. One unique ID for each instance of the job. When we enable to run a job in multiple instances, DS engine would throw a window where in we need to specify the invocation ID for that instance.  When we check the log in director, we can identify the job by its name followed by its invocation ID.

where we use column generator stage in real time scenario?
The Column Generator stage adds columns to incoming data and generates mock data for these columns for each data row processed. The new data set is then output..
 
Let us two table x and Y ..and you are doing funnel..
X file is having only 2 columns  and Y file is having 3 columns..
While doing funnel metadata should be same. in order to achieve this include one column to X file and next use this column generator stage, this stage populates some mock data in third column..
 
how to identifie,is it innerjoin,leftouter join in lookup?
double click on lookup --->lookup failure --->if it is drop then it is inner join. If it is continue then it is left outer join
 
how to cleansing data
data cleansing is a process of identifying the data inconsistency and inaccuracies
ex:
     data inaccuracy:
                       hyd
                       Hydrabad
     after
                       hydrabad
                       hydrabad
     data inconsistency
                       10.78
                       10.23465
     after
                       10.27
                       10.23
 
What are the environmental settings for data stage, while working on parallel jobs?
Mainly we need 3 Environment variable to run datastage px 
job.   They are  1. APT_CONFIG_FILE
                         2. APT_RECORD_COUNT
                         3. APT_DUMP_SCORE
 
How to create user defined environment variable (parameter)?
u can create parameter in 2 ways 1.job level params 2.project level params.
Job-Level: when u want to create job level params ,go to job properties and use it.
 
Project -Level :when u want to create project level, go to DS-Admin and click on environment variable and define which parameter u need and open design go to job properties and call those params (which u defined in DS admin) in job properties and use it.
 
30 jobs are running in unix.i want to find out my job.how to do this?Give me command?
 
ps -ef|grep PID
 
how to move project from developement to uat?
By using the Datastage Manager we can move the project from Dev to Uat. Through datastage manager Export the project into your local machine as .dsx format (project.dsx) from DEV server. The same .dsx (project.dsx) import into UAT server by using the datastage manager.
 
How to do error handling in datastage? 
Error handling can be done by using the reject file link.
what are the errors coming through job needs to be capture in sequential file and that file needs to be fetch in job which will load this exceptions or errors in database.
 
Can u pls tell me How can u Call the Shell Scripting/Unix Commands in Job Sequence?
There are two scenarios where u myt want to call a script
 
Scenario 1(Dependency exists between script and a job):  Where a job has to be executed first then the script has to run, upon completion of script execution only the sec job has to be invoked. In this case develop a sequencer job where first job activity will invoke the first job then using Execute command activity call the script u would desire to invoke by typing "sh <script name>" in the command property of the activity, then with the other job activity call the second job.
 
Scenario 2: (Script and job are independent) : In this case right in your parallel job say job1, under job properties u can find "After-job subroutine" where u need to select "ExecSH" and pass the script name which you would like to execute. By doing this once the job1 execution completes the script gets invoked. The  job succeeding the job1 say job2 doesn’t wait for the execution of the script.
 
 
 

No comments:

Post a Comment