Statistics Generation#
When doing Monte Carlo simulations, or dealing with any sort of Engine or Project which contains randomness, data analysis an the ensemble level is required. This plugin supports such analysis by automatically computing statistics to e.g., enable plotting 95% confidence intervals on graph products in stage 4.
This plugin processes at the file level for each Experimental Run. All Raw Output Data files produced by each run are gathered and statistics calculated, and the results written out as described in the Runtime Directory Tree.
This plugin requires that the selected storage plugin
supports pd.DataFrame objects.
When run:
Floating point numeric data is rounded to 8 decimals.
Integer data is not rounded.
Categorical data is "averaged" via
mode(). Thus, onlydf-stats=meanis supported for categorical data.
Note
This plugin is not intended for use with projects whose output is
deterministic. That is, if you always use --n-runs=1 because your
code doesn't have any randomness/produces deterministic output, then
you should consider using Pseudo-Statistics instead of
this plugin.
Ordering Considerations#
:Data Decompression should proceed this plugin in the --proc
chain if you previously compressed the data.
Usage#
This plugin can be selected by adding proc.statistics to the list passed to
--proc. When active it will create <batchroot>/statistics, and all
statistics generated during stage 3 will accrue under this root directory. Each
experiment will get their own directory in this root for their
statistics. E.g.:
|-- <batchroot>
|-- statistics
|-- c1-exp0
|-- c1-exp1
|-- c1-exp2
|-- c1-exp3
|-- exec
exec/ contains automatically statistics about SIERRA runtime w.r.t. each
experiment. Useful for capturing runtime of specific experiments to better
plan/schedule time on HPC clusters. Not currently in dataframe/.csv format,
though that might change in the future.
Cmdline Interface#
sierra - CLI interface#
sierra [--dist-stats {none,all,conf95,bw}]
sierra Multi-stage options#
Options which are used in multiple pipeline stages
--dist-statsDIST_STATS-Specify what kinds of statistics, if any, should be calculated on the distribution of experimental data during stage 3 for inclusion on graphs during stage 4:
(default:none- Only calculate and show raw mean on graphs.conf95- Calculate standard deviation of experimental distribution and show 95%% confidence interval on relevant graphs.bw- Calculate statistics necessary to show box and whisker plots around each point in the graph (Summary Line graphs only).all- Generate all possible statistics, and plot all possible statistics on graphs.
none)
Configuration#
This plugin reads graphs.yaml for intra- and inter-experiment graphs. If
either are present, then it only gathers and processing data for the selected
graphs. If graphs.yaml is missing or doesn't contain specs for those graph
types, then all output data files are gathered and processed. This can take a
looooonnngggg time, depending on the amount of data produced and the filesystem
speed.