Statistics Generation#

When doing Monte Carlo simulations, or dealing with any sort of Engine or Project which contains randomness, data analysis an the ensemble level is required. This plugin supports such analysis by automatically computing statistics to e.g., enable plotting 95% confidence intervals on graph products in stage 4.

This plugin processes at the file level for each Experimental Run. All Raw Output Data files produced by each run are gathered and statistics calculated, and the results written out as described in the Runtime Directory Tree.

This plugin requires that the selected storage plugin supports pd.DataFrame objects.

When run:

  • Floating point numeric data is rounded to 8 decimals.

  • Integer data is not rounded.

  • Categorical data is "averaged" via mode(). Thus, only df-stats=mean is supported for categorical data.

Note

This plugin is not intended for use with projects whose output is deterministic. That is, if you always use --n-runs=1 because your code doesn't have any randomness/produces deterministic output, then you should consider using Pseudo-Statistics instead of this plugin.

Ordering Considerations#

:Data Decompression should proceed this plugin in the --proc chain if you previously compressed the data.

Usage#

This plugin can be selected by adding proc.statistics to the list passed to --proc. When active it will create <batchroot>/statistics, and all statistics generated during stage 3 will accrue under this root directory. Each experiment will get their own directory in this root for their statistics. E.g.:

|-- <batchroot>
    |-- statistics
        |-- c1-exp0
        |-- c1-exp1
        |-- c1-exp2
        |-- c1-exp3
        |-- exec

exec/ contains automatically statistics about SIERRA runtime w.r.t. each experiment. Useful for capturing runtime of specific experiments to better plan/schedule time on HPC clusters. Not currently in dataframe/.csv format, though that might change in the future.

Cmdline Interface#

sierra - CLI interface#

sierra [--dist-stats {none,all,conf95,bw}]
sierra Multi-stage options#

Options which are used in multiple pipeline stages

  • --dist-stats DIST_STATS -

    Specify what kinds of statistics, if any, should be calculated on the distribution of experimental data during stage 3 for inclusion on graphs during stage 4:

    • none - Only calculate and show raw mean on graphs.

    • conf95 - Calculate standard deviation of experimental distribution and show 95%% confidence interval on relevant graphs.

    • bw - Calculate statistics necessary to show box and whisker plots around each point in the graph (Summary Line graphs only).

    • all - Generate all possible statistics, and plot all possible statistics on graphs.

    (default: none)

Configuration#

This plugin reads graphs.yaml for intra- and inter-experiment graphs. If either are present, then it only gathers and processing data for the selected graphs. If graphs.yaml is missing or doesn't contain specs for those graph types, then all output data files are gathered and processed. This can take a looooonnngggg time, depending on the amount of data produced and the filesystem speed.