I’ve been using Snakemake on HPC clusters for years, mostly for my own projects. Now I need to set up a pipeline that others can run too. Here’s how I set things up and how I added custom modules to the pipeline.
How to run jobs on an HPC cluster
I used to run jobs by passing --cluster
along with qsub
commands and using --cluster-config
for cluster- and workflow-specific settings.
snakemake -p --cluster-config hpc.json \
--cluster "qsub -j oe -l walltime={cluster.time} -l select=1:ncpus={cluster.ncpus}:mem={cluster.mem}" \
--jobs 100 \
--latency-wait 15
Since snakemake version 8, things had change a lot. --cluster
is deprecated and snakemake now use excutor plugins to support the cluster job scheduler and using --profile
is more recommended way.
If you’re using PBS Pro like I am, you’ll first you need to install the executor plugin snakemake-executor-plugin-cluster-generic. There are plugins for other schedulers too, check Snakemake plugin catalog.
pip3 install snakemake-executor-plugin-cluster-generic
After that, Snakemake’s CLI will show extra options:
# snakemake -h
cluster-generic executor settings:
--cluster-generic-submit-cmd VALUE
Command for submitting jobs
--cluster-generic-status-cmd VALUE
Command for retrieving job status
--cluster-generic-cancel-cmd VALUE
Command for cancelling jobs. Expected to take one or more jobids as arguments.
--cluster-generic-cancel-nargs VALUE
Number of jobids to pass to cancel_cmd. If more are given, cancel_cmd will be called multiple times.
--cluster-generic-sidecar-cmd VALUE
Command for sidecar process.
You don’t need to pass all of these via the CLI. It’s better to use a profile and define them in a config file.
# profile/pbspro/config.yaml
executor: cluster-generic
cluster-generic-submit-cmd: "pbs-submit.py"
cluster-generic-cancel-cmd: "qdel"
jobscript: "jobscript.sh"
jobs: 500
use-envmodules: True # if you need to load modules in job scripts
use-apptainer: True
printshellcmds: True
latency-wait: 15
retries: 0
keep-going: True
A default jobscript.sh
comes with the plugin (under snakemake_interface_executor_plugins/executors
), but you can provide your own in the profile directory and point to it in config.yaml
.
In the profile config above, I set cluster-generic-submit-cmd
to pbs-submit.py
because I use a custom wrapper script to submit jobs. This gives me more flexibility than just calling qsub directly. so I can control which queue to use (e.g., gpu, bigmem) depending on the job’s resource requests. This is specific to our HPC cluster.
# profile/pbspro/pbs-submit.py
#!/usr/bin/env python3
import sys
import subprocess
from snakemake.utils import read_job_properties
jobscript = sys.argv[-1]
job_properties = read_job_properties(jobscript)
ncpus=""
mem=""
walltime=""
ngpus=""
queue = ""
if 'threads' in job_properties:
ncpus = f"ncpus={job_properties['threads']}"
if 'resources' in job_properties:
resources = job_properties['resources']
if 'gpu' in resources:
ngpus = f"ngpus={resources['gpu']},"
queue = "-q gpu"
elif 'nvidia_gpu' in resources:
ngpus = f"ngpus={resources['nvidia_gpu']},"
queue = "-q gpu"
if 'mem' in resources:
mem = f"mem={resources['mem']}gb"
queue = "-q bigmem" if resources['mem'] > 200 else queue
if 'walltime' in resources:
walltime = f"walltime={resources['walltime']}:00:00"
cmd = f"qsub {queue} -r n -l {ngpus}{ncpus},{mem},{walltime} {jobscript}"
try:
res = subprocess.run(cmd, check=True, shell=True, stdout=subprocess.PIPE)
except subprocess.CalledProcessError as e:
raise e
res = res.stdout.decode()
print(res.strip())
What is this script doing?
- Reads job properties using snakemake.utils.read_job_properties()
- Dynamically sets:
- ncpus based on threads
- mem, walltime from rule resources
- ngpus if GPU resources are requested
- Chooses the gpu or bigmem queue based on resource usage
- Builds and submits a qsub command
- Prints the job ID to stdout (Snakemake uses this to track the job)
Containerization
Using module load is a simple way to manage environments on HPC, but it’s not always ideal. Many Snakemake workflows use Conda so they can run anywhere, but Conda can be a pain on HPC, especially if there’s a proxy or restricted internet.
Instead, I build a Apptainer container (.sif) with all the tools I need and specify it in the workflow:
container: "/path/to/my.sif" # "docker://user/my-snakemake-pipeline:latest"
For rules that don’t need containerized tools, just set:
container: None
If use-apptainer is enabled, Snakemake will check if apptainer is available. Make sure to load the module first. Since each cluster job runs as its own Snakemake command via a jobscript, you also need to load Apptainer inside the jobscript. You can add it manually:
# profile/pbspro/jobscript.sh
#!/usr/bin/env bash
# properties = {properties}
# This script gets executed on a compute node on the cluster
echo -e "JOB ID\t$PBS_JOBID"
echo "================================="
# Load apptainer module for the Snakemake execution
module load apptainer
{exec_job}
Now each job will be running inside the container and versions of the tools will be consistent across platforms.
Run it on hpc for different projects
You can either copy the workflow’s config file and edit it, or override specific config values via the command line:
snakemake --configfile myconfig.yaml --config samples=samples.csv \
-s path/to/workflow/Snakefile \
--directory work \
--profile /path/to/profile/pbspro
I like to keep all intermediate files of each project in its own work directory (Nextflow style). You can specify it as current --directory .
. But if you use relative paths for samples or configfile, jobs in the workflow might not find those files.
I just find out that using Path.cwd() won’t get the path where I launch snakemake, instead it will get the path where the working directory is. So I added this in snk file:
# rules/common.smk
from pathlib import Path
import os
samples = config["samples"]
if not samples.is_absolute():
samples = Path(os.environ["PWD"]) / samples
This ensures paths are resolved relative to where you launched Snakemake. If you don’t want to type the profile path every time, you can: • Put your profile in ~/.config/snakemake/, then just use –profile pbspro • Rename the profile directory to default, and Snakemake will pick it up automatically
A better way: snk!
I really like this nice and neat package by Wytamma. It makes it easy to manage the workflow.
After installing snk, you can install the pipeline (online or locally), along with any dependencies:
snk install /path/to/pipeline -n my-pipeline -d pandas -d snakemake-executor-plugin-cluster-generic
It turns your workflow into a CLI app, which is so cool. You can configure, run scripts, and execute the pipeline like this:
# Check and edit pipeline config
my-pipeline config > config.yaml
# Run custom script from the pipeline
my-pipeline script run somescript
# Run the full pipeline
my-pipeline run --profile pbspro --configfile config.yaml
It can install any workflow as long as your workflow is properly structureed like those in catalog.
├── config
│ ├── config.yaml
│ └── samples.csv
├── README.md
└── workflow
├── profiles
├── rules
├── schema
├── scripts
└── Snakefile
If your pipeline is not structured as standard, you can also use snk to install like this:
snk install --config path/to/config /path/to/pipeline
snk install --snakfile path/to/Snakefile /path/to/pipeline
This is how I set up a Snakemake pipeline on our HPC, but there are more plugins and other features to be explored. I am especially interested in this snkmt. It seems like a nf-tower (now cloud seqera) equivalent for snakemake. I might try it out in the future.