-
Comparison: Join, Lookup, merge
-
Model
- Master-Update(s)
- Source - in RAM LU memory
- RDBMS Style-Relational
-
Memory Usage
- Light
- Heavy
- Light
-
# and Name of I/P
- 2 or more: left or right
- 1 Source, N LU tables
- 1 Master, N update(s)
-
Mandatory Input sort
- All Input
- No
- All Inputs
-
Duplicates in Primary I/P
- OK (x-Products)
- OK
- Warning
-
Duplicates in Secondary I/P
- OK (x-Products)
- Warning!
- OK only when N=1
-
Options on unmatched Primary
- Keep (left,right) Drop (inner)
- [Fail]|Continue|Drop|Reject
- [keep]|drop
-
Options on Unmatched Secondary
- Keep (left,right), Drop (inner)
- None
- Captured in Reject Set(s)
-
On match secondary Entries are
- Captured
- Captured
- Consumed
-
# output
- 1
- 1 Out (1 reject)
- 1 Out (N reject(s))
-
Captured in Reject set(s)
- Nothing (N/A)
- Unmatched Primary Entries
- Unmatched 2ndary entries
- Floating Topic
-
Horizontally
- These stages differs mainly on
1. Memory usage
2. Treatment of rows with unmatched key values
3. Input requirments (sorted, deduplicated etc.)
-
Join stage
- Link ordering gains importance
to maintain primary/reference link
-
Mulitple input/Single Output
- Input notations: left, right & intermediate
- Must be key partitioned and sorted
-
Types
- Inner
-
Outer
- Left Outer
- Right Outer
- Full outer
-
Performance Considerations
-
Pre-sort makes join lightweight
- Few rows need to be in memory
-
RDBMS Style
- Cross products on duplicates
- Matching entries are reusable for multiple matches
- Non matching entries can be captured
- No fail/Reject option for missed matches
- JOIN Vs LOOKUP
-
Lookup stage
- 1 Stream I/P (Source), Multiple reference link,
1 Stream O/P (Target), 1 (Optional) reject link
- Can return multiple matching rows
-
Lookup file
- Indexed by key
-
Small enough to fit into memory
- Built in memory from lookup files
- Called as HASH FILES
- Sequential file can be lookup file..
Choose "Key" from columns tab
-
Lookup failure actions
-
Fail
- Reports an error
Job fails immediatley (DEFAULT)
-
Drop
- I/P row with failed lookup is dropped
- Inner Join
- Merger/Drop
-
Continue
- I/P row with failed lookup
goes to O/P with either
default or null values filled
- Join/Outer
- Merge/Keep
-
Reject
- I/P row with failed
lookup goes to reject link
-
Partitioning Care
-
On MPP
- Use Entire Partition
- Or Use Hash partition same key as source link
-
On SMP
- Entire Partition
- Shared memory, No duplicate copies
- Or Auto Partition
-
Sorting
- Lookup stages do not require input or reference links to be sorted
- Range lookup
-
DB2 or Oracle as lookup
-
Sparse
- sparse should only be used when the lookup primary
source is much smaller than the DB2/oracle lookup reference.
- Normal
-
Merge stage
-
Multiple Input link
-
Master and as many update
- multiple reject link as update
- Input must be sorted and key partitioned
-
Single output link
- All the columns from master record +
additional columns from update records
-
Follows Master/Update Model
- 1 Sorted Duplicate free master input (Primary)
- 1 or more sorted update links (secondary)
- Has several reject links same number as update input links
-
Key column should be specified
- based on key master update merge happens
- Single or multiple from master and update records
-
Concept
-
Master row and one or more update rows
are merged if they have same value in user specified key(s) column
- Allow composite key
- Matched update rows are consumed
- Unmatched update rows in port n can be captured in output port n
-
Performance requirement
- Presort makes component lightweight
-
Merge options
-
Keep
- Keep means that unmatched rows
(those without any updates) from the
master link are output;
-
Drop
- Drop means that unmatched rows are dropped instead.
-
Vertically
-
Aggregator
- Grouping or Summarization
- Zero or more key columns that
define aggregation or grouping unit
-
Aggregation typle
- Count
- Calculation
- Recalculations
-
Grouping Method
-
Sort method
- Only a single group is kept in memory.
When a new group is seen, current group
is written out
- Sort mode requires the input data set to have been partition
sorted with all of the grouping keys specified as
hashing and sorting keys.
- Can handle unlimited number of groups
-
Hash Method(Default)
- Use hash mode for a relatively small number of groups;
generally, fewer than about 1000 groups per megabyte
of memory.
- Intermediate results for each
group are stored in the hash table
- Final result are written to the output
after all input has been processed
- Use when number of unique group is small
- No sort required
-
Remove duplicates
-
More sophisticated way
- Can choose to retain first or lost row
-
Other way
-
Using Sort component
- Stable sort always retain first row in the group
- non-stable sort result is indeterminate
- No choice on which to keep
-
Sort
-
Why?
- Some stages require sorted input (eg: merge, join)
- Some stages use less memory with sorted input eg: aggregator
- Has more option to set
- Visible to the flow
-
Alternative (link Sort)
-
Within stages using Input link "Partitioning" tab
- Partition should be other than "AUTO"
-
Types of Sort
-
Sequential Sort
- Set Sequential through "Advanced" tab
-
Parallel Sort
- Use hash partitioned
-
Total Sort
- Must be range partitioned
-
Temp Disk space usage
- Scratch disks in the disk spool sorts(can be created in config file)
- Scratch disks in the default disk spool(scratch disks are included here by default)
- Directory specified in the TMPDIR environ variable
- The directory /tmp
-
How the sorted order disturbed?
-
Repartition
- Can use "same" partition to avoid demage in sort order
- Must be careful with that.Can destroy sort order
-
Parallel Sorted data on sequential stages
- Appropriate "collector" required
- Input range partitioned - use ordered collection
- Input hash partitioned - use "sorted merge" collection
-
Sort Utility
- Datastage(default)
- Unix
-
Important Propery set
-
Stable sort
- True: to guarantee that this sort operation will not
rearrange records that are already in a properly
sorted data set.(Default)
- False : no prior ordering of records is
guaranteed to be preserved by the sorting
operation.
-
Create cluster key change column
- True to create "clusterkeychange" column in each
o/p records whose value is 1 for first record and
0 for other records in a group and groups are defined
"sort mode" as "Don't sort (previously sorted)"
-
Create Keychange column
- True to create "keychange" column in each record
whose value is 1 for first record and
0 for other records in a group
-
Restrict memory usage
- 20MB By default
- Can be tuned to increase performance of this stage
- Specified mem should be less than available mem in processing node
-
Funneling
-
Funnel stage
-
Multiple input datasets to single output dataset
- Meta data for the input datasets must be identical
- Combine data from several identical
sources into one single large outputsets
-
Types
-
Contineous
- Combine the records for input link with no guaranteed order
-
Sort Funnel
- Input records maintain the
order based on keys defined
- All input data sets must be sorted by the same key
columns as to be used by the Funnel operation.
- Typically Hash Partitioned
- Allows to set 1 primary key
and multiple secondary key
-
Sequence
- Copies all the records from first input link then second (so on) to output link