Dataflow Across Pipeline Stages#
Stage 3 Dataflow#
At the highest level we have the following in the context of pipeline stages 2-4:
The Raw Output Data files from experimental runs is processed during stage 3 into Processed Output Data files. In stage 4 those processed files are turned into products of various sorts. All stage 4 products are sourced from a single data file, to encourage and enable reusability of code across projects. As such, it is the job of active stage 3 plugins to make sure all the data needed to generate a given product appear in the same file. The process of doing this is called Data Collation.
Important
Stage 3 operates at the level of Raw Output Data files and Experimental Runs, while stage 4 operates at the level of Collated Output Data files, Processed Output Data files and Experiments.
With that framing in mind, we can dive into the dataflow in detail.
Intra-Experiment Dataflow#
Within stage 3 the first type of data processing that occurs is intra-experiment data processing. If we look at the data from stage 2 for a single Experimental Run \(j\) from Experiment \(i\) in Batch Experiment which produces \(k\) raw output files, we could represent the output data abstractly as:
For intra-experiment data processing, all of the per-run outputs are matched
across Experimental Runs within an
Experiment, and processed in some way (e.g., generating
statistical distributions). Crucially, the processing
is done at the level of entire files (i.e., it is a file-level reduce
operation). For example, if runs produce a foo.csv file, then every column
in foo.csv will be present in the corresponding Processed Output
Data files as well.
This can be visualized as follows:
Some examples of plugins performing this reduce operation:
Inter-Experiment Dataflow#
Within stage 3 the second type of data processing that occurs is inter-experiment data processing. If we look at the data from stage 2 for a single Experimental Run \(j\) from Experiment \(i\) in Batch Experiment which produces \(k\) raw output files, we could represent the output data abstractly as follows, using Experimental Run as SCOPE:
An important point here is that within the SIERRA builtin stage3 processing
plugins not all raw output files get processed in this manner, only those which
are going to be used during stage 4 to produce something via a
user-specification. Generally this means that there is a .yaml file in a
Project somewhere which has a list of Products which a
user wants to generate. This list is matched against the raw output files, and
only matching files are processed. Thus, SIERRA is very efficient in its data
processing.
Tip
Processed Output Data files can be thought of as time-series data at the level of Experimental Runs.
Some examples of plugins performing this reduce operation:
Stage 4 Dataflow#
At the highest level we have the following in the context of pipeline stages 3-5:
After Stage 3 Dataflow, data is in Processed Output Data files and/or Collated Output Data files. In stage 4, the Processed Output Data files can be taken and directly converted to products along one of two paths using appropriate plugins:
Intra-experiment products such as graphs and videos, which are built from a single processed output data file.
Inter-experiment products such as graphs, which are built by joining together identical sections/slices of the processed output data files for a single experiment.
Like the stage3 dataflow, generally in stage4 things are file-level.
Intra-Experiment Dataflow#
There isn't really any dataflow for intra-experiment products, because there is a 1:1 mapping between the Processed Output Data file and the Product: all the data needed to generate a given product is within a single file.
Inter-Experiment Dataflow#
Inter-experiment processing in stage4 is Data Collation, but this time at the level of Experiments rather the Experimental Runs:
This process in stage 3 can be visualized as follows for a single Batch Experiment, using Experiment as SCOPE. Input files in this case are Processed Output Data, and output files are Collated Output Data at the experiment level. Each output file is a summary of a batch experiment along some axis of interest.
Once processed, products can be generate directly from the inter-experiment files with a 1:1 mapping as above.
Stage 5 Inter-Batch Dataflow#
After Stage 4 Dataflow, data is in Processed Output Data files and/or Collated Output Data files. In stage 5, the Collated Output Data files can be taken and further collated to create Inter-Batch Data files. The dataflow for this can be visualized as follows, with Batch Experiment as SCOPE.
Each output file is a summary of a set of batch experiments along some axis of interest.#