-
Process large
volume data
-
Using datasets
-
Use only for intermediate results
- Persistent data
- Maintains partitions and sort
- Avoid format conversion and serial IO
-
Not for long term backup
and recovery of source data
- Because datasets is configuration file dependent
- To create restart points to rerun jobs or sequences
- Allowing smaller jobs concurrently
(based on available system resources)
-
Use mulitple parallel configuation files
- Helps to measure system performance on various config
- Overall performance depends on proper config of scratch,
resource disks and underlying hardware architecture
-
Modular development
- Use container for reusability logic
-
Parameterize jobs
- Run a job multiple times using different parameter
- Using director's multiple invocation capability
-
Designing for
good performance
-
Avoid unnecessary data type conversion
- Be careful to use proper source data types (especially oracle)
- OSH_PRINT_SCHEMAS environment variable to
verify that runtime schemas match the job design column definitions
- Stage variables in transformer stage: their data types
should match with expected result types
-
Use transformer stages sparingly and wisely
-
Use Modify Stage
- For explicit type conversion
- NULL handling
-
Use Copy Stage
- Place holder
- Renaming columns
- Dropping columns
- Implicit type conversion
-
BAISC Transformer stage
- Take advantage of USD/Routine
- Performance problem when used with parallel job
-
Custom stage
- For complex & reusable logic which creates
peformance problem using transformer stage
-
Remove unneeded column
as early as possible
- Every additional unused column requires additional buffer
-
Combining data
- Join Vs Lookup
-
Increase sort performance where possible
- Coordinate sort strategy with hash partitioning strategy
- If data is sorted and partitioned on primary key already,
"Don't sort previously sorted" option for key column
(may be secondary sort key)
- when writing to parallel Datasets maintain partitioning and sort,
so use same partitioning to maintain sort and partition order while you read it
-
Tune "Restrict memory usage" option - Per partitioning (20MB -default)
- Only available for standalone sort, not for inline link sort
-
Default and explicit type conversion
-
Explicit type conversion
- By Modify or transformer stages
-
Points on type conversions
- When converting from variable-length to fixed-length strings using
default conversions, parallel jobs pad the remaining length
with NULL (ASCII zero) characters.
- The environment variable APT_STRING_PADCHAR can be used
to change the default pad character from an ASCII NULL (0x0)
to another character; for example, an ASCII space (Ox20) or
a unicode space (U+0020).
- As an alternate solution, the PadString function can be used
to pad a variable-length (Varchar) string to a specified length
using a specified pad character. Note that PadString
does not work with fixed-length (Char) string types.
You must first convert Char to Varchar before using PadString
-
Using Sequential File stage
-
Avoid reading from sequential file using same partitioning method
- Making entire downstream flow run sequentially
- Should be careful with bounded length delimited varchar.
Extra characters will be discarded if length is more
- If a column is nullable, you must define the null field
value and length in the Edit Column Meta Data dialog box
-
Using DB stages
-
Use "Enterprise DB" stages than plug-in stages
- Avoid generating target tables from DB stages
-
Enterprise DB gives best parallel peformance
- Use Open/Close property to do some pre/post processes
-
DB Lookup
- Data read by any database stage can serve as the reference
input to a Lookup stage. By default, this reference data is
loaded into memory like any other reference link.
-
Sparse Lookup
- DB2 & Oracle can use this
- Available to reference lookup no intermittent stage
- Send individual SQL statement to reference DB for each incoming lookup row
- Use only when I/P rows are lesser than no. of reference row to DB
-
DB2 Design tips
-
Write Vs. Load
-
Write method (Using insert,update,upsert and delete)
- All operations are logged (logfile)
- Time and row based commit intervals
- communicates directly with DB2 database
nodes to execute instructions in parallel.
-
Load Method
- Running on DBADM previlege
- Exclusive lock on table
- Nonrecoverable if it fails
- Goes to pending stage
- Rerun or run in truncate mode to recover from pending state
-
Oracle Design tips
-
Importing oracle table definitions
- Using desinger: import -> Orchestrate schema definitions
- Using orchdbutil
-
Loading and indexes
- Use load write method:"Parallel direct path load"
- You must specify indexing options (Rebuild or maintenance)
-
To avoid index maintenance
- Set env variable
APT_ORACLE_LOAD_OPTIONS to OPTIONS (DIRECT=TRUE,PARALLEL=FALSE)
- Or use "UPSERT" write method
-
Teradata Design tips
-
Additional connection property
in Teradata Enterprise stage
- Sessionsperplayer
- Requestedsessions