HPC Execution Environment Plugins

SIERRA is capable of adapting its runtime infrastructure to a number of different HPC environments so that experiments can be run efficiently on whatever computational resources a researcher has access to. Supported environments that come with SIERRA are listed on this page.

These plugins tested with the following platforms (they may work on other platforms out of the box too):

SIERRA makes the following assumptions about the HPC environments corresponding to the plugins listed on this page:

HPC Environment Assumptions

Assumption

Rationale

All nodes allocated to SIERRA have the same # of cores (can be less than the total # available on each compute node). Note that this may be less than the actual number of cores available on each node, if the HPC environment allows node sharing, and the job SIERRA runs in is allocated less than the total # cores on a given node.

Simplicity: If allocated nodes had different core counts, SIERRA would have to do more of the work of an HPC scheduler, and match jobs to nodes. May be an avenue for future improvement.

All nodes have a shared filesystem.

Standard feature on HPC environments. If for some reason this is not true, stage 2 outputs will have to be manually placed such that it is as if everything ran on a common filesystem prior to running any later stages.

Local HPC Plugin

This HPC environment can be selected via --exec-env=hpc.local.

This is the default HPC environment in which SIERRA will run all experiments on the same computer from which it was launched using GNU parallel. The # simultaneous simulations will be determined by:

# cores on machine / # threads per experimental run

If more simulations are requested than can be run in parallel, SIERRA will start additional simulations as currently running simulations finish.

No additional configuration/environment variables are needed with this HPC environment for use with SIERRA.

ARGoS Considerations

The # threads per experimental run is defined with --physics-n-engines, and that option is required for this HPC environment during stage 1.

PBS HPC Plugin

This HPC environment can be selected via --exec-env=hpc.pbs.

In this HPC environment, SIERRA will run experiments spread across multiple allocated nodes by a PBS compatible scheduler such as Moab. The following table describes the PBS-SIERRA interface. Some PBS environment variables are used by SIERRA to configure experiments during stage 1,2 (see TOQUE-PBS docs for meaning); if they are not defined SIERRA will throw an error.

PBS-SIERRA interface

PBS environment variable

SIERRA context

PBS_NUM_PPN

Used to calculate # threads per experimental run for each allocated compute node via:

floor(PBS_NUM_PPN / --exec-jobs-per-node)

That is, --exec-jobs-per-node is required for PBS HPC environments.

PBS_NODEFILE

Obtaining the list of nodes allocated to a job which SIERRA can direct GNU parallel to use for experiments.

PBS_JOBID

Creating the UUID nodelist file passed to GNU parallel, guaranteeing no collisions (i.e., simultaneous SIERRA invocations sharing allocated nodes) if multiple jobs are started from the same directory.

The following environmental variables are used in the PBS HPC environment:

Environment variable

Use

SIERRA_ARCH

Used to enable architecture/OS specific builds of simulators for maximum speed at runtime on clusters.

PARALLEL

Used to transfer environment variables into the GNU parallel environment. This must be always done because PBS doesn’t transfer variables automatically, and because GNU parallel starts another level of child shells.

SLURM HPC Plugin

https://slurm.schedmd.com/documentation.html

This HPC environment can be selected via --exec-env=hpc.slurm.

In this HPC environment, SIERRA will run experiments spread across multiple allocated nodes by the SLURM scheduler. The following table describes the SLURM-SIERRA interface. Some SLURM environment variables are used by SIERRA to configure experiments during stage 1,2 (see SLURM docs for meaning); if they are not defined SIERRA will throw an error.

SLURM-SIERRA interface

SLURM environment variable

SIERRA context

Command line override

SLURM_CPUS_PER_TASK

Used to set # threads per experimental node for each allocated compute node.

N/A

SLURM_TASKS_PER_NODE

Used to set # parallel jobs per allocated compute node.

--exec-jobs-per-node

SLURM_JOB_NODELIST

Obtaining the list of nodes allocated to a job which SIERRA can direct GNU parallel to use for experiments.

N/A

SLURM_JOB_ID

Creating the UUID nodelist file passed to GNU parallel, guaranteeing no collisions (i.e., simultaneous SIERRA invocations sharing allocated nodes if multiple jobs are started from the same directory).

N/A

The following environmental variables are used in the SLURM HPC environment:

Environment variable

Use

SIERRA_ARCH

Used to enable architecture/OS specific builds of simulators for maximum speed at runtime on clusters.

PARALLEL

Used to transfer environment variables into the GNU parallel environment. This must be done even though SLURM can transfer variables automatically, because GNU parallel starts another level of child shells.

Adhoc HPC Plugin

This HPC environment can be selected via --exec-env=hpc.adhoc.

In this HPC environment, SIERRA will run experiments spread across an ad-hoc network of compute nodes. SIERRA makes the following assumptions about the compute nodes it is allocated each invocation:

  • All nodes have a shared filesystem.

The following environmental variables are used in the Adhoc HPC environment:

Environment variable

SIERRA context

Command line override

Notes

SIERRA_NODEFILE

Contains hostnames/IP address of all compute nodes SIERRA can use. Same format as GNU parallel --sshloginfile.

--nodefile

SIERRA_NODEFILE must be defined or --nodefile passed. If neither is true, SIERRA will throw an error.