.. Copyright 2025 John Harwell, All rights reserved. SPDX-License-Identifier: MIT .. _concepts/dataflow: =============================== Dataflow Across Pipeline Stages =============================== .. _concepts/dataflow/stage3: Stage 3 Dataflow ================ At the highest level we have the following in the context of pipeline stages 2-4: .. plantuml:: skinparam defaultTextAlignment center !theme cyborg ' Configuration left to right direction skinparam DefaultFontSize 48 skinparam DefaultFontColor #black skinparam stateFontStyle bold state "2. Execute\nExperiments\n" as stage2 { state "Raw Output Data" as raw #lightcyan } state "3. Process\nExperiment\nOutputs" as stage3 { state "Processed Output Data" as proc #lightcyan } state "4. Generate\nProducts\n" as stage4 { state "Products " as products #lightcyan } raw --> proc proc --> products The :term:`Raw Output Data` files from experimental runs is processed during stage 3 into :term:`Processed Output Data` files. In stage 4 those processed files are turned into :term:`products ` of various sorts. All stage 4 products are sourced from a *single* data file, to encourage and enable reusability of code across projects. As such, it is the job of active stage 3 plugins to make sure all the data needed to generate a given product appear in the same file. The process of doing this is called :term:`Data Collation`. .. IMPORTANT:: Stage 3 operates at the level of :term:`Raw Output Data` files and :term:`Experimental Runs `, while stage 4 operates at the level of :term:`Collated Output Data` files, :term:`Processed Output Data` files and :term:`Experiments `. With that framing in mind, we can dive into the dataflow in detail. Intra-Experiment Dataflow -------------------------- Within stage 3 the first type of data processing that occurs is *intra*-experiment data processing. If we look at the data from stage 2 for a single :term:`Experimental Run` :math:`j` from :term:`Experiment` :math:`i` in :term:`Batch Experiment` which produces :math:`k` raw output files, we could represent the output data abstractly as: .. plantuml:: skinparam defaultTextAlignment center !theme cyborg ' Configuration left to right direction skinparam DefaultFontSize 24 skinparam DefaultFontColor #black skinparam stateFontStyle bold state "run j" as runj #skyblue { state "file 0" as filej0 #darkturquoise state "file 1" as filej1 #limegreen state "..." as filejx #green state "file k" as filejk #lightseagreen filej0 -[hidden]r-> filej1 filej0 -[hidden]d-> filejx filej1 -[hidden]d-> filejk filejx -[hidden]r-> filejk } For intra-experiment data processing, all of the per-run outputs are matched across :term:`Experimental Runs ` within an :term:`Experiment`, and processed in some way (e.g., :ref:`generating statistical distributions `). Crucially, the processing is done at the level of *entire files* (i.e., it is a file-level reduce operation). For example, if runs produce a ``foo.csv`` file, then every column in ``foo.csv`` will be present in the corresponding :term:`Processed Output Data` files as well. This can be visualized as follows: .. plantuml:: skinparam defaultTextAlignment center !theme cyborg ' Configuration skinparam DefaultFontSize 48 skinparam DefaultFontColor #black skinparam stateBorderThickness 8 skinparam stateFontStyle bold state "run 0" as run0 #skyblue { state "file 0" as file00 #darkturquoise state "file 1" as file01 #limegreen state "..." as file0x #green state "file k" as file0k #lightseagreen file00 -[hidden]r-> file01 file00 -[hidden]d-> file0x file01 -[hidden]d-> file0k file0x -[hidden]r-> file0k } state "run 1" as run1 #skyblue { state "file 0" as file10 #darkturquoise state "file 1" as file11 #limegreen state "..." as file1x #green state "file k" as file1k #lightseagreen file10 -[hidden]r-> file11 file10 -[hidden]d-> file1x file11 -[hidden]d-> file1k file1x -[hidden]r-> file1k } state "..." as runx #skyblue state "run j" as runj #skyblue { state "file 0" as filej0 #darkturquoise state "file 1" as filej1 #limegreen state "..." as filejx #green state "file k" as filejk #lightseagreen filej0 -[hidden]r-> filej1 filej0 -[hidden]d-> filejx filej1 -[hidden]d-> filejk filejx -[hidden]r-> filejk } state "Processed outputs" as intra #skyblue { state "file 0" as filep0 #darkturquoise state "file 1" as filep1 #limegreen state "..." as filepx #green state "file k" as filepk #lightseagreen filep0 -[hidden]r-> filep1 filep1 -[hidden]r-> filepx filepx -[hidden]r-> filepk } run0 -[hidden]r-> run1 run1 -[hidden]r-> runx runx -[hidden]r-> runj run1 -d-> intra run0 -d-> intra runx -d-> intra runj -d-> intra Some examples of plugins performing this reduce operation: - :ref:`plugins/proc/statistics` Inter-Experiment Dataflow ------------------------- Within stage 3 the second type of data processing that occurs is *inter*-experiment data processing. If we look at the data from stage 2 for a single :term:`Experimental Run` :math:`j` from :term:`Experiment` :math:`i` in :term:`Batch Experiment` which produces :math:`k` raw output files, we could represent the output data abstractly as follows, using :term:`Experimental Run` as SCOPE: .. figure:: /figures/data-collation.png An important point here is that within the SIERRA builtin stage3 processing plugins not all raw output files get processed in this manner, only those which are going to be used during stage 4 to produce something via a user-specification. Generally this means that there is a ``.yaml`` file in a :term:`Project` somewhere which has a list of :term:`Products ` which a user wants to generate. This list is matched against the raw output files, and only matching files are processed. Thus, SIERRA is very efficient in its data processing. .. TIP:: :term:`Processed Output Data` files can be thought of as time-series data at the level of :term:`Experimental Runs `. Some examples of plugins performing this reduce operation: - :ref:`plugins/proc/collate` .. _concepts/dataflow/stage4: Stage 4 Dataflow ================ At the highest level we have the following in the context of pipeline stages 3-5: .. plantuml:: skinparam defaultTextAlignment center !theme cyborg ' Configuration left to right direction skinparam DefaultFontSize 48 skinparam DefaultFontColor #black skinparam stateFontStyle bold state "3. Process\nExperiment\nOutputs" as stage3 { state "Processed Experiment\nOutputs" as proc #lightcyan } state "4. Generate\nProducts\n" as stage4 { state "Products" as products #lightcyan { state "Intra-experiment Products" as intra_prod #lightcoral state "Inter-experiment Products" as inter_prod #lightcoral } } state "5. Compare\nProducts\n" as stage5 { state "Inter-batch Products" as inter_batch #lightcyan } stage3 --> stage4 stage4 --> stage5 After :ref:`concepts/dataflow/stage3`, data is in :term:`Processed Output Data` files and/or :term:`Collated Output Data` files. In stage 4, the :term:`Processed Output Data` files can be taken and directly converted to products along one of two paths using appropriate plugins: - Intra-experiment products such as graphs and videos, which are built from a single processed output data file. - Inter-experiment products such as graphs, which are built by joining together identical sections/slices of the processed output data files for a single experiment. Like the stage3 dataflow, generally in stage4 things are file-level. Intra-Experiment Dataflow ------------------------- .. plantuml:: skinparam defaultTextAlignment center !theme cyborg ' Configuration left to right direction skinparam DefaultFontSize 48 skinparam DefaultFontColor #black skinparam stateFontStyle bold state "Processed Experiment\nOutputs" as proc #lightcyan { state "file 0" as filep0 #darkturquoise state "file 1" as filep1 #limegreen state "..." as filepx #green state "file k" as filepk #lightseagreen filepx -[hidden]r-> filepk filep1 -[hidden]r-> filepx filep0 -[hidden]r-> filep1 } state "Intra-Experiment\nProducts" as prod #lightcyan { state "product 0" as productp0 #darkturquoise state "product 1" as productp1 #limegreen state "..." as productpx #green state "product k" as productpk #lightseagreen productpx -[hidden]r-> productpk productp1 -[hidden]r-> productpx productp0 -[hidden]r-> productp1 } filep0 --> productp0 filep1 --> productp1 filepk --> productpk filepx --> productpx There isn't really any dataflow for intra-experiment products, because there is a 1:1 mapping between the :term:`Processed Output Data` file and the :term:`Product`: all the data needed to generate a given product is within a single file. Inter-Experiment Dataflow ------------------------- Inter-experiment processing in stage4 is :term:`Data Collation`, but this time at the level of :term:`Experiments ` rather the :term:`Experimental Runs `: This process in stage 3 can be visualized as follows for a single :term:`Batch Experiment`, using :term:`Experiment` as SCOPE. Input files in this case are :term:`Processed Output Data`, and output files are :term:`Collated Output Data` at the experiment level. Each output file is a summary of a batch experiment along some axis of interest. .. figure:: /figures/data-collation.png Once processed, products can be generate directly from the inter-experiment files with a 1:1 mapping as above. .. _concepts/dataflow/stage5: Stage 5 Inter-Batch Dataflow ============================ After :ref:`concepts/dataflow/stage4`, data is in :term:`Processed Output Data` files and/or :term:`Collated Output Data` files. In stage 5, the :term:`Collated Output Data` files can be taken and further collated to create :term:`Inter-Batch Data` files. The dataflow for this can be visualized as follows, with :term:`Batch Experiment` as SCOPE. .. figure:: /figures/data-collation.png Each output file is a summary of a set of batch experiments along some axis of interest.