Combining Data

Comparison: Join, Lookup, merge
1. Model
  1. Master-Update(s)
  2. Source - in RAM LU memory
  3. RDBMS Style-Relational
2. Memory Usage
  1. Light
  2. Heavy
  3. Light
3. # and Name of I/P
  1. 2 or more: left or right
  2. 1 Source, N LU tables
  3. 1 Master, N update(s)
4. Mandatory Input sort
  1. All Input
  2. No
  3. All Inputs
5. Duplicates in Primary I/P
  1. OK (x-Products)
  2. OK
  3. Warning
6. Duplicates in Secondary I/P
  1. OK (x-Products)
  2. Warning!
  3. OK only when N=1
7. Options on unmatched Primary
  1. Keep (left,right) Drop (inner)
  2. [Fail]|Continue|Drop|Reject
  3. [keep]|drop
8. Options on Unmatched Secondary
  1. Keep (left,right), Drop (inner)
  2. None
  3. Captured in Reject Set(s)
9. On match secondary Entries are
  1. Captured
  2. Captured
  3. Consumed
10. # output
  1. 1
  2. 1 Out (1 reject)
  3. 1 Out (N reject(s))
11. Captured in Reject set(s)
  1. Nothing (N/A)
  2. Unmatched Primary Entries
  3. Unmatched 2ndary entries
Floating Topic
Horizontally
1. These stages differs mainly on 1. Memory usage 2. Treatment of rows with unmatched key values 3. Input requirments (sorted, deduplicated etc.)
2. Join stage
  1. Link ordering gains importance to maintain primary/reference link
  2. Mulitple input/Single Output
    1. Input notations: left, right & intermediate
    2. Must be key partitioned and sorted
  3. Types
    1. Inner
    2. Outer
      1. Left Outer
      2. Right Outer
      3. Full outer
  4. Performance Considerations
    1. Pre-sort makes join lightweight
      1. Few rows need to be in memory
  5. RDBMS Style
    1. Cross products on duplicates
    2. Matching entries are reusable for multiple matches
    3. Non matching entries can be captured
  6. No fail/Reject option for missed matches
  7. JOIN Vs LOOKUP
3. Lookup stage
  1. 1 Stream I/P (Source), Multiple reference link, 1 Stream O/P (Target), 1 (Optional) reject link
  2. Can return multiple matching rows
  3. Lookup file
    1. Indexed by key
    2. Small enough to fit into memory
      1. Built in memory from lookup files - Called as HASH FILES
    3. Sequential file can be lookup file.. Choose "Key" from columns tab
  4. Lookup failure actions
    1. Fail
      1. Reports an error Job fails immediatley (DEFAULT)
    2. Drop
      1. I/P row with failed lookup is dropped
      2. Inner Join
      3. Merger/Drop
    3. Continue
      1. I/P row with failed lookup goes to O/P with either default or null values filled
      2. Join/Outer
      3. Merge/Keep
    4. Reject
      1. I/P row with failed lookup goes to reject link
  5. Partitioning Care
    1. On MPP
      1. Use Entire Partition
      2. Or Use Hash partition same key as source link
    2. On SMP
      1. Entire Partition
      2. Shared memory, No duplicate copies
      3. Or Auto Partition
  6. Sorting
    1. Lookup stages do not require input or reference links to be sorted
  7. Range lookup
  8. DB2 or Oracle as lookup
    1. Sparse
      1. sparse should only be used when the lookup primary source is much smaller than the DB2/oracle lookup reference.
    2. Normal
4. Merge stage
  1. Multiple Input link
    1. Master and as many update
      1. multiple reject link as update
    2. Input must be sorted and key partitioned
  2. Single output link
    1. All the columns from master record + additional columns from update records
  3. Follows Master/Update Model
    1. 1 Sorted Duplicate free master input (Primary)
    2. 1 or more sorted update links (secondary)
    3. Has several reject links same number as update input links
  4. Key column should be specified
    1. based on key master update merge happens
    2. Single or multiple from master and update records
  5. Concept
    1. Master row and one or more update rows are merged if they have same value in user specified key(s) column
      1. Allow composite key
    2. Matched update rows are consumed
    3. Unmatched update rows in port n can be captured in output port n
  6. Performance requirement
    1. Presort makes component lightweight
  7. Merge options
    1. Keep
      1. Keep means that unmatched rows (those without any updates) from the master link are output;
    2. Drop
      1. Drop means that unmatched rows are dropped instead.
Vertically
1. Aggregator
  1. Grouping or Summarization
  2. Zero or more key columns that define aggregation or grouping unit
  3. Aggregation typle
    1. Count
    2. Calculation
    3. Recalculations
  4. Grouping Method
    1. Sort method
      1. Only a single group is kept in memory. When a new group is seen, current group is written out
      2. Sort mode requires the input data set to have been partition sorted with all of the grouping keys specified as hashing and sorting keys.
      3. Can handle unlimited number of groups
    2. Hash Method(Default)
      1. Use hash mode for a relatively small number of groups; generally, fewer than about 1000 groups per megabyte of memory.
      2. Intermediate results for each group are stored in the hash table
      3. Final result are written to the output after all input has been processed
      4. Use when number of unique group is small
      5. No sort required
2. Remove duplicates
  1. More sophisticated way
    1. Can choose to retain first or lost row
  2. Other way
    1. Using Sort component
      1. Stable sort always retain first row in the group
      2. non-stable sort result is indeterminate
      3. No choice on which to keep
3. Sort
  1. Why?
    1. Some stages require sorted input (eg: merge, join)
    2. Some stages use less memory with sorted input eg: aggregator
    3. Has more option to set
    4. Visible to the flow
  2. Alternative (link Sort)
    1. Within stages using Input link "Partitioning" tab
      1. Partition should be other than "AUTO"
  3. Types of Sort
    1. Sequential Sort
      1. Set Sequential through "Advanced" tab
    2. Parallel Sort
      1. Use hash partitioned
    3. Total Sort
      1. Must be range partitioned
  4. Temp Disk space usage
    1. Scratch disks in the disk spool sorts(can be created in config file)
    2. Scratch disks in the default disk spool(scratch disks are included here by default)
    3. Directory specified in the TMPDIR environ variable
    4. The directory /tmp
  5. How the sorted order disturbed?
    1. Repartition
      1. Can use "same" partition to avoid demage in sort order
      2. Must be careful with that.Can destroy sort order
    2. Parallel Sorted data on sequential stages
      1. Appropriate "collector" required
      2. Input range partitioned - use ordered collection
      3. Input hash partitioned - use "sorted merge" collection
  6. Sort Utility
    1. Datastage(default)
    2. Unix
  7. Important Propery set
    1. Stable sort
      1. True: to guarantee that this sort operation will not rearrange records that are already in a properly sorted data set.(Default)
      2. False : no prior ordering of records is guaranteed to be preserved by the sorting operation.
    2. Create cluster key change column
      1. True to create "clusterkeychange" column in each o/p records whose value is 1 for first record and 0 for other records in a group and groups are defined "sort mode" as "Don't sort (previously sorted)"
    3. Create Keychange column
      1. True to create "keychange" column in each record whose value is 1 for first record and 0 for other records in a group
    4. Restrict memory usage
      1. 20MB By default
      2. Can be tuned to increase performance of this stage
      3. Specified mem should be less than available mem in processing node
Funneling
1. Funnel stage
  1. Multiple input datasets to single output dataset
    1. Meta data for the input datasets must be identical
  2. Combine data from several identical sources into one single large outputsets
  3. Types
    1. Contineous
      1. Combine the records for input link with no guaranteed order
    2. Sort Funnel
      1. Input records maintain the order based on keys defined
      2. All input data sets must be sorted by the same key columns as to be used by the Funnel operation.
      3. Typically Hash Partitioned
      4. Allows to set 1 primary key and multiple secondary key
    3. Sequence
      1. Copies all the records from first input link then second (so on) to output link