Running Rapthor on the SKAO AWS cluster¶

The recommended way to run rapthor on the SKAO AWS development cluster is to use the rapthor spack module that is pre-installed (you can see details of the spack package here). Loading this module will also load all of rapthor’s dependencies, including wsclean and dp3.

$ module use "/shared/fsx1/spack/modules/2025.07.3/linux-ubuntu22.04-x86_64_v3"
$ module load py-rapthor

To ensure that PyBDSF can find the correct boost libraries you must also load the boost module and add to LD_LIBRARY_PATH:

$ module load boost
$ export LD_LIBRARY_PATH=$BOOST_ROOT/lib:$LD_LIBRARY_PATH

Rapthor is now ready to run.

Note

We recommend running rapthor as a SLURM job submitted from the headnode. Example SLURM scripts that will set up the required environment variables, run and benchmark rapthor using SKA tools are available for a single-node run and a multi-node run. See below for details.

Starting a Rapthor run¶

Rapthor can be run from the command line using:

$ rapthor rapthor.parset

where rapthor.parset is the parset described in The Rapthor parset. A number of options are available (see Running Rapthor for details).

Warning

Rapthor attempts to resume from a previous state if output files from a previous run are left in the working directory (see Resuming an interrupted run). This means that changes to your parset may not be respected unless you remove or rename the previous output folder and delete the contents of your scratch/temporary directories.

Warning

Due to storage limits on the default /tmp directory on AWS, it is best to create a new temporary folder on the shared /shared/fsx1 directory. You will then need to set local_scratch_dir and global_scratch_dir in the parset, as well TMPDIR in the slurm script to this path. This is necessary because toil/cwl is used by rapthor to create intermediate files in TMPDIR, local_scratch_dir and global_scratch_dir during the run which may exceed the available space on /tmp. Note, however, that the filter_skymodel step will always set /tmp as the temporary directory. This is a workaround for socket file paths having a character limit (107 bytes on unix systems), causing issues with long path names during multiprocessing (used by pybdsf). Since Toil creates path names for temporary storage files using random hexadecimal strings, the base location of the temporary storage paths global_scratch_dir and local_scratch_dir can be too long, resulting in errors.

Running rapthor on a single node¶

For runs on a single node (i.e., when batch_system = single_machine), the recommended method of running Rapthor on the SKAO cluster is to submit a SLURM job from the headnode.

An example SLURM script for a single node run is provided in the examples directory, together with a corresponding example parset.

Copy these files and edit as needed (edit the paths to your data set and scratch directories and the cluster configuration - make sure the resources requested in your slurm script match those in the parset) then submit the job using sbatch. This will allocate a compute node and run all workflows on this node.

Running rapthor on multiple nodes¶

For runs that use multiple nodes of a compute cluster (i.e., when batch_system = slurm), the recommended method of running Rapthor on the SKAO cluster is to submit a SLURM job from the headnode.

An example SLURM script for a multi-node run is provided in the examples directory, together with a corresponding example parset.

Copy these files and edit as needed (edit the paths to your data set and temporary directories and the cluster configuration) then submit the job using sbatch. This will allocate a compute node to act as the “leader” node which Toil will use to orchestrate allocating other nodes for different workflows.

Warning

Ensure you match the max_cores and max_threads to the nodes on the partition(s) you specify in your SLURM script – if you specify more cores than are available rapthor will fail to run.

Known issues¶

Both single node and multi-node runs will be run with benchmarking activated but this will currently not monitor all nodes on a multi-node run if mpi is enabled due to the way rapthor uses salloc to allocate interactive nodes for wsclean-mp.
The “leader” node will be idle for most of the rapthor run. Toil uses this node to orchestrate the allocation of other nodes. A further node will be idle during imaging steps if mpi is enabled since this node is only used to allocate additional nodes for wsclean-mp.

Troubleshooting a run¶

See the Installation FAQ for tips on troubleshooting Rapthor.

Developing rapthor on the SKAO AWS cluster¶

To test latest changes to the rapthor pipeline or develop on your own branch:

Clone the rapthor repository
Start an interactive compute node on AWS (using srun)
Edit and source this shell script. This will set up a virtual python environment that with rapthor installed in editable mode.
Run pytest to ensure your environment is setup correctly.

Note

To avoid unexpected behaviour while testing code changes by running rapthor, always use a fresh output directory and remove all temporary files from previous runs. If rapthor is run using the same parset as previously it will try to resume from the previous state (see Resuming an interrupted run).

Note

When starting an interactive node for testing, make sure you request enough resources (e.g. cpus-per-task) to satisfy the cluster parameters in your parset (e.g. max_cores).

Running Rapthor on the SKAO AWS cluster¶

Starting a Rapthor run¶

Running rapthor on a single node¶

Running rapthor on multiple nodes¶

Known issues¶

Troubleshooting a run¶

Developing rapthor on the SKAO AWS cluster¶

Rapthor

Navigation

Related Topics