ab-initio Parallel processing

Depth Of parallelism
Dataskew
1. m_du,m_df,m_ls can be used to identify this
Component Parallelism
1. Program components execute simultaneously on different branches of the graph
2. Eg: Sort customer and transaction on separate branches of the same graph then join it using key
Pipeline Parallelism
1. Occurs when a connected sequence of program components on the same branch of a graph execute simultaneously.
2. Not for all the ab-initio component
  1. Pipeline parallelism is broken when Sort component used (It needs to read all the input rec to sort before writing out)
Data Parallelism
1. Partitions
  1. Partition components
    1. Partition By Key
      1. It distributes data records to its output flow partitions according to key values
      2. Runtime behaviour
      3. Reads records in arbitrary order from the in port
      4. Distributes them to the flows connected to the out port, according to the key parameter, writing records with the same key value to the same output flow
      5. Parameter
      6. key
      7. Refer Partition by key and sort component also
    2. Partition by Expression
      1. It distributes data records to its output flow partitions according to a specified DML expression.
      2. Runtime behaviour
      3. Reads records in arbitrary order from the flows connected to the in port
      4. Distributes the records to the flows connected to the out port, according to the expression in the function parameter
      5. Parameter
      6. Function
      7. The expression must evaluate to a number between 0 and the number of flows connected to the out port minus 1.
      8. Partition by Expression routes the record to the flow number returned by this expression.
      9. Flow numbers start at 0.
      10. Ex: DML expr: zipcode/10000 so 30338/10000 goes to 3rd output port
    3. Partition by Range
      1. Tightly coupled with
      2. Find Splitters Component
      3. sorts data records according to a key specifier, and then finds the ranges of key values that divide the total number of input data records approximately evenly into a specified number of partitions.
      4. Parameters
      5. Key
      6. Name(s) of the key field(s) and the sequence specifier(s) you want Find Splitters to use when it orders data records and sets splitter points.
      7. num_partitions
      8. Number of partitions into which you want to divide the total number of data records evenly.
      9. Runtime behaviour
      10. Reads records from the in port
      11. Sorts the records according to the key specifier in the key parameter
      12. Writes a set of splitter points to the out port in a format suitable for the split port of Partition by Range
      13. Typically, you route the output from the out port of Find Splitters to the split port of PARTITION BY RANGE.
      14. It has 2 input ports
      15. IN port
      16. split port
      17. Parameter
      18. Key
      19. Note: The field(s) specified must exist in the record formats for both the in and split ports, and must be of the same type in both record formats.
      20. Runtime behaviour
      21. Reads splitter records from the split port, and assumes that these records are sorted according to the key parameter.
      22. Determines whether the number of flows connected to the out port is equal to n (where n-1 represents the number of splitter records).
      23. If not, Partition by Range writes an error message and stops the execution of the graph.
      24. Reads data records from the flows connected to the in port in arbitrary order.
      25. Distributes the data records to the flows connected to the out port according to the values of the key field(s), as follows:
      26. Assigns records with key values less than or equal to the first splitter record to the first output flow.
      27. Assigns records with key values greater than the first splitter record, but less than or equal to the second splitter record to the second output flow, and so on.
      28. Important Consideration with this component and find splitters
      29. Use the same key specifier for both components.
      30. Make the number of partitions on the flow connected to the out port of Partition by Range the same as the value in the num_partitions parameter of Find Splitters.
    4. Partition with Load Balance
      1. distributes data records to its output flow partitions by writing more records to the flow partitions that consume records faster
      2. No Parameter
      3. Run time behaviour
      4. Reads records in arbitrary order from the flows connected to its in port
      5. Distributes those records among the flows connected to its out port by sending more records to the flows that consume records faster
      6. Partition with Load Balance writes data records until each flow's output buffer fills up.
      7. Important Note
      8. Although Partition with Load Balance balances the workload between CPUs, the resulting number of data records in each partition can be unbalanced. You can use PARTITION BY ROUND-ROBIN to balance the number of data records among partitions.
    5. Partition by round robin
      1. distributes blocks of data records evenly to each output flow in round-robin fashion
      2. Parameter
      3. Block size
      4. Number of records distributed to one flow before distributing the same number to the next flow. Default is 1.
      5. Runtime behaviour
      6. Reads records from the in port.
      7. Distributes them in block_size chunks to its output flows according to the order in which the flows are connected
    6. Partition by percentage
      1. distributes a specified percentage of the total number of input data records to each output flow
      2. Parameter
      3. Percentage
      4. List of percentages between 1 to 100, seperated by comma
      5. You can assign a different percentage to each output flow
      6. Runtime behaviour
      7. Reads records from the in port
      8. Writes a specified percentage of the input records to each flow on the out port
      9. Peculiar Input port
      10. It contains IN port and pct port
      11. By connecting the output of any component that produces a list of percentages to the pct port of Partition by Percentage. Use decimal('\n') as the record format for the pct port of Partition by Percentage.
    7. Broadcast
      1. Broadcast arbitrarily combines all the data records it receives into a single flow and writes a copy of that flow to each of its output flow partitions.
      2. Runtime behaviour
      3. Reads records from all flows on the in port
      4. Combines the records arbitrarily into a single flow
      5. Copies all the records to all the flow partitions connected to the out port
      6. Use
      7. Use Broadcast to increase data parallelism when you have connected a single fan-out flow to the out port or to increase component parallelism when you have connected multiple straight flows to the out port.
2. Using Parallel files
  1. Multifile
    1. Structure
      1. Control file
      2. Data files which are located on different disk
    2. Ad-hoc multifile
    3. Referencing multifile
      1. Eg; mfile:/ab1/initi/test.dat
    4. Multifile co-op commands
      1. m_mkfs - Create multifile system m_rmfs - remove multifile system
      2. m_mkdir, m_rmdir-Create and remove multidir
      3. m_cp - Copy multifile
      4. m_mv - move multifile
      5. m_chmod - Change mode
      6. m_touch - Create empty multifile
      7. m_ls - List multifie
      8. Tells dataskew %
      9. m_du - Printing disk usage
      10. Size of diskusage in KB
      11. -partitions - size for its all partitions
      12. m_df - Printing information about multifile system
      13. m_expand - prints to stdout various information about the multifile, multidirectory, file, or directory
3. Flow
  1. Straight
    1. Connect the components with same depth of parallelism. Parallel -> Parallel, serial -> serial
  2. Fan-in
    1. connects a component with a greater depth of parallelism to one with a lesser depth.
    2. Kind of many to one relationships (Not always, check next branch)
    3. Eg; 4 way to serial, 4 way to 2 way
    4. You can only use a fan-in flow when the result of dividing the greater number of partitions by the lesser number of partitions is an integer. If this is not the case, you must use an all-to-all flow.
  3. Fan-out
    1. connects a component with a lesser number of partitions to one with a greater number of partitions
    2. Kind of One to Many relationships (not always, check next branch)
    3. Eg: serial to 4 way parallel 2 way parallel to 4 way parallel
  4. All-to-All
    1. Happens in 2 conditions
      1. Connect components with different numbers of partitions, when the result of dividing the greater number of partitions by the lesser number is not an integer
      2. Repartition data, using components with the same or different numbers of partitions
4. Repartitioning
  1. Changing one or both of the following
    1. The degree of parallelism of partitioned data
    2. The grouping of records within the partitions of partitioned data
  2. Why repartitiong
    1. Read partitioned data file by greater no of partitioning program component to increase the processing speed
    2. Connecting 2 processing stages having two different degree of parallelism
    3. Load balance on different CPU
    4. To perform global sort
  3. Example
    1. Sorting Multifile
Departitioning
1. Concatenate
  1. appends multiple flow partitions of data records one after another
  2. No Parameter
  3. Runtime behaviour
    1. Reads all the data records from the first flow connected to the in port (counting from top to bottom on the graph) and copies them to the out port.
    2. Then reads all the data records from the second flow connected to the in port and appends them to those of the first flow, and so on.
  4. Automatic flow buffering should be on to avoid dead lock
  5. No default record assignment. I/P record format should be identical to O/P
2. Merge
  1. combines data records from multiple flow partitions that have been sorted according to the same key specifier, and maintains the sort order
  2. Parameter
    1. Key
  3. Caution
    1. combines data records from multiple flow partitions that have been sorted according to the same key specifier, and maintains the sort order
  4. Normally used after sort compoents
3. Gather
  1. combines data records from multiple flow partitions arbitrarily
  2. Runtime behaviour
    1. Reads data records from the flows connected to the in port.
    2. Combines the records arbitrarily.
    3. Writes the combined records to the out port.
  3. Usage
    1. Reduce data parallelism, by connecting a single fan-in flow to the in port
    2. Reduce component parallelism, by connecting multiple straight flows to the in port
  4. No parameters
  5. No gather for sort component
    1. You do not need to use a Gather component when connecting a fan-in or all-to-all flow to the in port of a Sort , because Sort can gather internally on its in port.
4. Interleave
  1. combines blocks of data records from multiple flow partitions in round-robin fashion
  2. Runtime Behaviour
    1. Reads the number of data records specified in the blocksize parameter from the first flow connected to the in port
    2. Reads the number of data records specified in the blocksize parameter from the next flow, and so on
    3. Writes the records to the out port
  3. Parameter
    1. Blocksize
      1. Just like partition by round robin
  4. It is related to partition by round robine
  5. Can cause deadlock
LAYOUT
1. What is?
  1. The location of files
  2. The number and locations of the partitions of multifiles
  3. The number of, and the locations in which, the partitions of program components execute
2. Critical Concerns
  1. The Co>Operating System must be installed on the computers specified by the layout.
  2. The run host must be able to connect to the computers specified by the layout.
  3. The layout must allow enough space for the files the graph needs to write there.
  4. The permissions in the directories of the layout must allow the graph to write files there.
3. Who uses layout?
  1. Intermediate file component
  2. Phase, checkpoint, watcher
  3. Buffered flow
  4. Many programming components - like sort