diff --git a/docs/_running_slurm.rst b/docs/_running_slurm.rst new file mode 100644 index 000000000..3f9998e84 --- /dev/null +++ b/docs/_running_slurm.rst @@ -0,0 +1,265 @@ +Running workflow on a server or cluster +============================================ + +In some situations, users may be unable to upload data to public Galaxy instances due to legal or policy restrictions. +Even if they have access to local cluster systems, these environments often do not permit hosting web servers or lack +dedicated personnel to maintain a Galaxy instance. +Planemo addresses this challenge by enabling users to launch a temporary local Galaxy instance, allowing workflow +execution without the need for a permanent server setup. +Jobs can be executed directly on the server or submitted to a workload manager such as Slurm. + +This guide explains how to run a workflow: + 1. On a standalone server + 2. On a Slurm cluster using DRMAA + + +Requirements +--------------------- + +- planemo: install instructions are available at :doc:`readme` +- DRMAA library +- Tools dependencies +- Database (optional) + +DRMAA library +~~~~~~~~~~~~~~~~~~ +To enable Galaxy to submit jobs to a cluster via Slurm, you need the DRMAA +(Distributed Resource Management Application API) library, which provides a standard interface +for job submission. If your system does not already have the DRMAA library installed, you must +compile it manually. For Slurm, the recommended implementation is +`slurm-drmaa `__. + +Follow these steps to download and build the DRMAA library for Slurm: + +:: + + # Download and extract source code + wget https://github.com/natefoo/slurm-drmaa/releases/download/1.1.4/slurm-drmaa-1.1.4.tar.gz + tar -xvzf slurm-drmaa-1.1.4.tar.gz + cd slurm-drmaa-1.1.4 && mkdir dist + + # Configure build + ./autogen.sh + ./configure --prefix=${PWD}/dist + + # Compile and install + make + make install + + # print path to library + LIBDRMAA=`ls ${PWD}/dist/lib/libdrmaa.so` + echo "Library: ${LIBDRMAA}" + + +Tool dependencies +~~~~~~~~~~~~~~~~~~ +To run a tool/workflow on a server or cluster, the required tools must +be available on the system. Galaxy can manage dependencies +using Conda, Docker or Singularity. While Conda is an option, Docker or +Singularity is recommended for its greater reliability. One of the following +environments need to be available on the system: + +- Conda +- Docker +- Singularity + + +Database +~~~~~~~~~~~~~~~~~~ +Galaxy tracks jobs and their statuses using a database, with SQLite as the default option. +However, when running complex workflows, SQLite can encounter issues if multiple jobs attempt +to update the database simultaneously. This can lead to database locking and premature workflow +termination. + +To prevent these issues, it's recommended to use a PostgreSQL database instead of SQLite. If a +PostgreSQL database isn't readily available, Planemo can launch a temporary PostgreSQL instance +using Singularity by adding the ``--database_type postgres_singularity`` to the run command. + + +Running on workflow +------------------ +Planemo can be configured in many ways to suit different user needs. In this guide, we'll demonstrate +how to run a workflow on both a local laptop/server and a SLURM cluster, using the same example workflow +introduced in `the basic `_ + +Laptop/server +~~~~~~~~~~~~~~~~~~ +:: + + planemo run tutorial.ga tutorial-job.yml \ + --download_outputs \ + --output_directory . \ + --output_json output.json \ + --biocontainers + +This command runs the workflow locally and attempts to resolve tool dependencies using containers. If the +workflow completes successfully, the output files will be downloaded to the current directory. + +Slurm +~~~~~~~~~~~~~~~~~~ + +To run the workflow on a SLURM cluster, you’ll need to configure the ``job_config.yaml`` file to set up a job +runner that uses the DRMAA library for job submission. + + +**job_config.yaml** +:: + + runners: + local: + load: galaxy.jobs.runners.local:LocalJobRunner + slurm: + load: galaxy.jobs.runners.slurm:SlurmJobRunner + drmaa_library_path: PATH/slurm-drmaa-1.1.4/dist/lib/libdrmaa.so + + execution: + default: slurm + environments: + local: + runner: local + slurm: + runner: slurm + native_specification: "-p PARTITION_NAME --time=02:00:00 --nodes=1" + singularity_enabled: true + singularity_volumes: $defaults + tools: + - class: local + environment: local + +This example runs the workflow on a SLURM cluster, using Singularity to resolve tool dependencies. However, it +can be easily adapted to use Docker or Conda instead. For more details on configuring the ``job_config.yaml`` file, +refer to the `job configuration documentation `__ + +**Note:** before using the example remember to update ``PATH`` and ``PARTITION_NAME`` values to match your system’s configuration. + +Run using the default SQLite database +^^^^^^^^^^^^^^ +:: + + planemo run tutorial.ga tutorial-job.yml \ + --download_outputs \ + --output_directory . \ + --output_json output.json \ + --job_config_file job_config.yaml \ + --biocontainers + +Run using a temporary PostgreSQL database +^^^^^^^^^^^^^^ +:: + + planemo run tutorial.ga tutorial-job.yml \ + --download_outputs \ + --output_directory . \ + --output_json output.json \ + --job_config_file job_config.yaml \ + --biocontainers \ + --database_type postgres_singularity + +Slurm with TPV +~~~~~~~~~~~~~~~~~~ +To make it easier to configure resource used by the different tools we can +use `TPV (Total Perspective Vortex) `_ to +set the resources used by each individual tool. This well require some changes to the +``job_config.yaml``. + +**job_config.yaml** +:: + + runners: + local: + load: galaxy.jobs.runners.local:LocalJobRunner + + drmaa: + load: galaxy.jobs.runners.drmaa:DRMAAJobRunner + drmaa_library_path: PATH/slurm-drmaa-1.1.4/dist/lib/libdrmaa.so + + execution: + default: tpv + environments: + local: + runner: local + tpv: + runner: dynamic + function: map_tool_to_destination + rules_module: tpv.rules + tpv_config_files: + - https://gxy.io/tpv/db.yml + - PATH/destinations.yaml + + tools: + - class: local + environment: local + +**destinations.yaml** +:: + + destinations: + tpvdb_drmaa: + runner: drmaa + params: + native_specification: "-p PARTION_NAME --time=02:00:00 --nodes=1 --ntasks={cores} --ntasks-per-node={cores}" + singularity_enabled: true + singularity_volumes: $defaults + +Like the previous example, this setup submits jobs to a SLURM cluster and uses Singularity to resolve dependencies. The +key difference is the use of **TPV**, which allows you to define resource requirements—such as the number of cores used +by bwa mem—based on settings from the shared configuration a ``https://gxy.io/tpv/db.yml`` + +These defaults can be customized either by providing an entirely new db.yaml file or by overriding specific tool settings +in a separate YAML file. For more details, see the , see `using-the-shared-database `__ section of the TPV documentation. + +**Note:** even if your SLURM scheduler does not use ntasks, you should still set it when a tool is intended to use more +than one core. If not specified, Galaxy will default to using a single core for that tool. + +Troubleshooting +------------------ + +**Temp direcory not shared between nodes** + +Galaxy uses a temporary directory when creating and running jobs. +This directory must be accessible to both the server that creates +the job and the compute node that executes it. If this folder is +not shared, local jobs will succeed, while those submitted to a +separate compute node will fail. To resolve this issue, configure +a shared folder as the temporary directory. + +:: + + TMPDIR=PATH_TO_SHARED_TEMPORARY_FOLDER planemo run ... + +**Locked database** + +While running jobs, Galaxy tracks their status using a database. +The default database is SQLite, which may encounter issues when +handling a high number of concurrent jobs. In such cases, you may see errors like this: + + +:: + + cursor.execute(statement, parameters) + sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked + [SQL: SELECT kombu_queue.id AS kombu_queue_id, kombu_queue.name AS kombu_queue_name + FROM kombu_queue + WHERE kombu_queue.name = ? + LIMIT ? OFFSET ?] + +The solution is to switch to a more robust database, such as PostgreSQL. + +**Multiple galaxy instances** + +When you run ``planemo run`` a temporary Galaxy instance is created and +in some cases that instance my not be properly shut down. This can cause +the following error: + +:: + + bioblend.ConnectionError: Unexpected HTTP status code: 500: Internal Server Error: make sure that you don't already have a running galaxy instance. + +You will be able to see if instances are still running using the following commands: + +:: + + # Find running galaxy instances + ps aux | grep galaxy + # Find running planemo commands + ps aux | grep planemo diff --git a/docs/running.rst b/docs/running.rst index e130d373a..e20d9c3f3 100644 --- a/docs/running.rst +++ b/docs/running.rst @@ -20,4 +20,5 @@ the ``planemo run`` command, which allows Galaxy tools and workflows to be executed simply via the command line. .. include:: _running_intro.rst -.. include:: _running_external.rst \ No newline at end of file +.. include:: _running_external.rst +.. include:: _running_slurm.rst \ No newline at end of file