Dataflow Across Pipeline Stages#

Stage 3 Dataflow#

At the highest level we have the following in the context of pipeline stages 2-4:

$skinparam defaultTextAlignment center !theme cyborg ' Configuration left to right direction skinparam DefaultFontSize 48 skinparam DefaultFontColor #black skinparam stateFontStyle bold state "2. Execute\nExperiments\n" as stage2 { state "Raw Output Data" as raw #lightcyan } state "3. Process\nExperiment\nOutputs" as stage3 { state "Processed Output Data" as proc #lightcyan } state "4. Generate\nProducts\n" as stage4 { state "Products " as products #lightcyan } raw --> proc proc --> products$

The Raw Output Data files from experimental runs is processed during stage 3 into Processed Output Data files. In stage 4 those processed files are turned into products of various sorts. All stage 4 products are sourced from a single data file, to encourage and enable reusability of code across projects. As such, it is the job of active stage 3 plugins to make sure all the data needed to generate a given product appear in the same file. The process of doing this is called Data Collation.

Important

Stage 3 operates at the level of Raw Output Data files and Experimental Runs, while stage 4 operates at the level of Collated Output Data files, Processed Output Data files and Experiments.

With that framing in mind, we can dive into the dataflow in detail.

Intra-Experiment Dataflow#

Within stage 3 the first type of data processing that occurs is intra-experiment data processing. If we look at the data from stage 2 for a single Experimental Run $j$ from Experiment $i$ in Batch Experiment which produces $k$ raw output files, we could represent the output data abstractly as:

For intra-experiment data processing, all of the per-run outputs are matched across Experimental Runs within an Experiment, and processed in some way (e.g., generating statistical distributions). Crucially, the processing is done at the level of entire files (i.e., it is a file-level reduce operation). For example, if runs produce a foo.csv file, then every column in foo.csv will be present in the corresponding Processed Output Data files as well.

This can be visualized as follows:

Some examples of plugins performing this reduce operation:

Statistics Generation

Inter-Experiment Dataflow#

Within stage 3 the second type of data processing that occurs is inter-experiment data processing. If we look at the data from stage 2 for a single Experimental Run $j$ from Experiment $i$ in Batch Experiment which produces $k$ raw output files, we could represent the output data abstractly as follows, using Experimental Run as SCOPE:

An important point here is that within the SIERRA builtin stage3 processing plugins not all raw output files get processed in this manner, only those which are going to be used during stage 4 to produce something via a user-specification. Generally this means that there is a .yaml file in a Project somewhere which has a list of Products which a user wants to generate. This list is matched against the raw output files, and only matching files are processed. Thus, SIERRA is very efficient in its data processing.

Tip

Processed Output Data files can be thought of as time-series data at the level of Experimental Runs.

Some examples of plugins performing this reduce operation:

Intra-Experiment Data Collation

Stage 4 Dataflow#

At the highest level we have the following in the context of pipeline stages 3-5:

$skinparam defaultTextAlignment center !theme cyborg ' Configuration left to right direction skinparam DefaultFontSize 48 skinparam DefaultFontColor #black skinparam stateFontStyle bold state "3. Process\nExperiment\nOutputs" as stage3 { state "Processed Experiment\nOutputs" as proc #lightcyan } state "4. Generate\nProducts\n" as stage4 { state "Products" as products #lightcyan { state "Intra-experiment Products" as intra_prod #lightcoral state "Inter-experiment Products" as inter_prod #lightcoral } } state "5. Compare\nProducts\n" as stage5 { state "Inter-batch Products" as inter_batch #lightcyan } stage3 --> stage4 stage4 --> stage5$

After Stage 3 Dataflow, data is in Processed Output Data files and/or Collated Output Data files. In stage 4, the Processed Output Data files can be taken and directly converted to products along one of two paths using appropriate plugins:

Intra-experiment products such as graphs and videos, which are built from a single processed output data file.
Inter-experiment products such as graphs, which are built by joining together identical sections/slices of the processed output data files for a single experiment.

Like the stage3 dataflow, generally in stage4 things are file-level.

Intra-Experiment Dataflow#

$skinparam defaultTextAlignment center !theme cyborg ' Configuration left to right direction skinparam DefaultFontSize 48 skinparam DefaultFontColor #black skinparam stateFontStyle bold state "Processed Experiment\nOutputs" as proc #lightcyan { state "file 0" as filep0 #darkturquoise state "file 1" as filep1 #limegreen state "..." as filepx #green state "file k" as filepk #lightseagreen filepx -[hidden]r-> filepk filep1 -[hidden]r-> filepx filep0 -[hidden]r-> filep1 } state "Intra-Experiment\nProducts" as prod #lightcyan { state "product 0" as productp0 #darkturquoise state "product 1" as productp1 #limegreen state "..." as productpx #green state "product k" as productpk #lightseagreen productpx -[hidden]r-> productpk productp1 -[hidden]r-> productpx productp0 -[hidden]r-> productp1 } filep0 --> productp0 filep1 --> productp1 filepk --> productpk filepx --> productpx$

There isn't really any dataflow for intra-experiment products, because there is a 1:1 mapping between the Processed Output Data file and the Product: all the data needed to generate a given product is within a single file.

Inter-Experiment Dataflow#

Inter-experiment processing in stage4 is Data Collation, but this time at the level of Experiments rather the Experimental Runs:

This process in stage 3 can be visualized as follows for a single Batch Experiment, using Experiment as SCOPE. Input files in this case are Processed Output Data, and output files are Collated Output Data at the experiment level. Each output file is a summary of a batch experiment along some axis of interest.

Once processed, products can be generate directly from the inter-experiment files with a 1:1 mapping as above.

Stage 5 Inter-Batch Dataflow#

After Stage 4 Dataflow, data is in Processed Output Data files and/or Collated Output Data files. In stage 5, the Collated Output Data files can be taken and further collated to create Inter-Batch Data files. The dataflow for this can be visualized as follows, with Batch Experiment as SCOPE.

Dataflow Across Pipeline Stages#

Stage 3 Dataflow#

Intra-Experiment Dataflow#

Inter-Experiment Dataflow#

Stage 4 Dataflow#

Intra-Experiment Dataflow#

Inter-Experiment Dataflow#

Stage 5 Inter-Batch Dataflow#

This Page