Parallel DS job design

Process large volume data
1. Using datasets
  1. Use only for intermediate results
    1. Persistent data
    2. Maintains partitions and sort
    3. Avoid format conversion and serial IO
  2. Not for long term backup and recovery of source data
    1. Because datasets is configuration file dependent
  3. To create restart points to rerun jobs or sequences
2. Allowing smaller jobs concurrently (based on available system resources)
3. Use mulitple parallel configuation files
  1. Helps to measure system performance on various config
  2. Overall performance depends on proper config of scratch, resource disks and underlying hardware architecture
Modular development
1. Use container for reusability logic
2. Parameterize jobs
  1. Run a job multiple times using different parameter
  2. Using director's multiple invocation capability
Designing for good performance
1. Avoid unnecessary data type conversion
  1. Be careful to use proper source data types (especially oracle)
  2. OSH_PRINT_SCHEMAS environment variable to verify that runtime schemas match the job design column definitions
  3. Stage variables in transformer stage: their data types should match with expected result types
2. Use transformer stages sparingly and wisely
  1. Use Modify Stage
    1. For explicit type conversion
    2. NULL handling
  2. Use Copy Stage
    1. Place holder
    2. Renaming columns
    3. Dropping columns
    4. Implicit type conversion
  3. BAISC Transformer stage
    1. Take advantage of USD/Routine
    2. Performance problem when used with parallel job
  4. Custom stage
    1. For complex & reusable logic which creates peformance problem using transformer stage
3. Remove unneeded column as early as possible
  1. Every additional unused column requires additional buffer
4. Combining data
  1. Join Vs Lookup
5. Increase sort performance where possible
  1. Coordinate sort strategy with hash partitioning strategy
  2. If data is sorted and partitioned on primary key already, "Don't sort previously sorted" option for key column (may be secondary sort key)
  3. when writing to parallel Datasets maintain partitioning and sort, so use same partitioning to maintain sort and partition order while you read it
  4. Tune "Restrict memory usage" option - Per partitioning (20MB -default)
    1. Only available for standalone sort, not for inline link sort
6. Default and explicit type conversion
  1. Explicit type conversion
    1. By Modify or transformer stages
  2. Points on type conversions
    1. When converting from variable-length to fixed-length strings using default conversions, parallel jobs pad the remaining length with NULL (ASCII zero) characters.
    2. The environment variable APT_STRING_PADCHAR can be used to change the default pad character from an ASCII NULL (0x0) to another character; for example, an ASCII space (Ox20) or a unicode space (U+0020).
    3. As an alternate solution, the PadString function can be used to pad a variable-length (Varchar) string to a specified length using a specified pad character. Note that PadString does not work with fixed-length (Char) string types. You must first convert Char to Varchar before using PadString
7. Using Sequential File stage
  1. Avoid reading from sequential file using same partitioning method
    1. Making entire downstream flow run sequentially
  2. Should be careful with bounded length delimited varchar. Extra characters will be discarded if length is more
  3. If a column is nullable, you must define the null field value and length in the Edit Column Meta Data dialog box
Using DB stages
1. Use "Enterprise DB" stages than plug-in stages
  1. Avoid generating target tables from DB stages
  2. Enterprise DB gives best parallel peformance
    1. Use Open/Close property to do some pre/post processes
2. DB Lookup
  1. Data read by any database stage can serve as the reference input to a Lookup stage. By default, this reference data is loaded into memory like any other reference link.
  2. Sparse Lookup
    1. DB2 & Oracle can use this
    2. Available to reference lookup no intermittent stage
    3. Send individual SQL statement to reference DB for each incoming lookup row
    4. Use only when I/P rows are lesser than no. of reference row to DB
3. DB2 Design tips
  1. Write Vs. Load
    1. Write method (Using insert,update,upsert and delete)
      1. All operations are logged (logfile)
      2. Time and row based commit intervals
      3. communicates directly with DB2 database nodes to execute instructions in parallel.
    2. Load Method
      1. Running on DBADM previlege
      2. Exclusive lock on table
      3. Nonrecoverable if it fails
      4. Goes to pending stage
      5. Rerun or run in truncate mode to recover from pending state
4. Oracle Design tips
  1. Importing oracle table definitions
    1. Using desinger: import -> Orchestrate schema definitions
    2. Using orchdbutil
  2. Loading and indexes
    1. Use load write method:"Parallel direct path load"
    2. You must specify indexing options (Rebuild or maintenance)
    3. To avoid index maintenance
      1. Set env variable APT_ORACLE_LOAD_OPTIONS to OPTIONS (DIRECT=TRUE,PARALLEL=FALSE)
      2. Or use "UPSERT" write method
5. Teradata Design tips
  1. Additional connection property in Teradata Enterprise stage
    1. Sessionsperplayer
    2. Requestedsessions