From 3d4e5464c1d3fef4f87226676176ea90ddeb9b1d Mon Sep 17 00:00:00 2001 From: Kaelyn Ferris <43348706+kaelynj@users.noreply.github.com> Date: Mon, 22 Sep 2025 11:22:33 -0400 Subject: [PATCH] Add draft pages of slurm plugin --- docs/guides/_toc.json | 16 ++- docs/guides/slurm-hpc-ux.mdx | 213 +++++++++++++++++++++++++++++++++++ docs/guides/slurm-plugin.mdx | 151 +++++++++++++++++++++++++ 3 files changed, 379 insertions(+), 1 deletion(-) create mode 100644 docs/guides/slurm-hpc-ux.mdx create mode 100644 docs/guides/slurm-plugin.mdx diff --git a/docs/guides/_toc.json b/docs/guides/_toc.json index 57e29a70ca0..78daa2b8140 100644 --- a/docs/guides/_toc.json +++ b/docs/guides/_toc.json @@ -127,7 +127,6 @@ "title": "Set up custom roles", "url": "/docs/guides/custom-roles" } - ] }, { @@ -613,6 +612,21 @@ } ] }, + { + "title": "High-Performance Compute", + "childre": [ + { + "title": "Spank Plugins for Slurm", + "url": "/docs/guides/slurm-plugin", + "isNew": "true" + }, + { + "title": "HPX User Experience", + "url": "/docs/guides/slurm-hpc-ux", + "isNew": "true" + } + ] + }, { "title": "Visualization", "children": [ diff --git a/docs/guides/slurm-hpc-ux.mdx b/docs/guides/slurm-hpc-ux.mdx new file mode 100644 index 00000000000..5e8f0b2ec12 --- /dev/null +++ b/docs/guides/slurm-hpc-ux.mdx @@ -0,0 +1,213 @@ +HPC user experience, HPC developer experience and usage patterns +================================================================ + +## Content + +- [Principles](#principles) +- [Connecting physical resources to Slurm resoures and how to use them](#connecting-physical-resources-to-slurm-resources-and-how-to-use-them) + - [HPC admin scope](#hpc-admin-scope) + - [HPC user scope](#hpc-user-scope) + - [HPC application scope](#hpc-application-scope) + - [Backend specifics](#backend-specifics) + - [IBM Direct Access API](#ibm-direct-access-api) + - [Qiskit Runtime Service](#qiskit-runtime-service) +- [Examples](#examples) + - [Running jobs with dependencies](#running-jobs-with-dependencies) + - [Running a job with several Slurm QPU resources](#running-a-job-with-several-slurm-qpu-resources) + - [Running primitives directly](#running-primitives-directly) + - [Other workflow tools](#other-workflow-tools) + +See [Overview](./overview.md) for a glossary of terms. + +## Principles + +Slurm QPU resource definitions determine what physical resources can be used by Slurm jobs. +User source code should be agnostic to specific backend instances and even backend types as far as possible. +This keeps source code portable while the QPU selection criteria are part of the resource definition (which is considered configuration as opposed to source code). +The source code does not have to take care resp. is not involved in resource reservation handling (that is done when Slurm jobs are assigned QPU resources and start running, if applicable on the backend) or execution modes like sessions (these are automatically in place while the job is running, if applicable on the backend). +This makes the source code more portable between similar QPU resource types through different backend access methods (such as IBM's Direct Access API and IBM's Qiskit Runtime service through IBM Quantum Platform). +All backend types (such as IBM's Direct Access API, IBM's Qiskit Runtime service, or Pasqal's backends) follow these principles. + +## Connecting physical resources to Slurm resources and how to use them + +Note the exact syntax is subject to change -- this is a sketch of the UX at this time. + +### HPC admin scope + +HPC administrators configure the SPANK plugin, what physical resources can be provided to Slurm jobs. +This configuration contains all the information needed to have Slurm jobs access the physical resources, such as endpoints, and access credentials -- note some parts of the configuration such as credentials can be sensitive information. + +See the file [qrmi_config.json.example](../plugins/spank_qrmi/qrmi_config.json.example) for a comprehensive example showing. + +In `slurm.conf`, qpu generic resources can be assigned to some or all nodes for usage: +``` +... +GresTypes=qpu,name +NodeName=node[1-5000] Gres=qpu,name:ibm_fez +... +``` + +### HPC user scope + +HPC users submit jobs using QPU resources that are tied to Slurm QPU resources. +The name attribute references what the HPC administrator has defined. +Mid-term, backend selection can be based on criteria other than a predefined name which refers to a specific backend (e.g. by capacity and error rate qualifiers which help downselect between the defined set of backends). + +There might be additional environment variables required, depending on the backend type. + +SBATCH parameters will point to one or more QPU resource assigned to the application as generic resources. +Environment variables provided through the plugin will provide the necessary information to the application (see the [HPC application scope](#hpc-application-scope) section for details). + +```shell +#SBATCH --time=100 +#SBATCH --output= +#SBATCH --gres=qpu:1 +#SBATCH --qpu=ibm_fez +#SBATCH --... # other options + +srun ... +``` + +To use more QPU resources, add more QPUs to the `--qpu` parameter: + +```shell +#SBATCH --time=100 +#SBATCH --output= +#SBATCH --gres=qpu:3 +#SBATCH --qpu=my_local_qpu,ibm_fez,ibm_marrakesh +#SBATCH --... # other options + +srun ... +``` + +### HPC application scope + +HPC applications use the Slurm QPU resources assigned to the Slurm job. + +Environment variables provide more details for use by the appliction, e.g. `SLURM_JOB_QPU_RESOURCES` listing the quantum resource names (comma separated if there are several provided). +These variables will be used by QRMI. +See the README files in the various QRMI flavor directories ([ibm](https://github.com/qiskit-community/qrmi/blob/main/examples/qiskit_primitives/ibm/README.md), [pasqal](https://github.com/qiskit-community/qrmi/blob/main/examples/qiskit_primitives/pasqal/README.md)) for details. + +```python +from qiskit import QuantumCircuit +# using an IBM QRMI flavor: +from qrmi.primitives import QRMIService +from qrmi.primitives.ibm import SamplerV2, get_target + +# define circuit + +circuit = QuantumCircuit(2) +circuit.h(0) +circuit.cx(0, 1) +circuit.measure_all() + +# instantiate QRMI service and get quantum resource (we'll take the first one should there be serveral of them) +# inject credentials needed for accessing the service at this point +load_dotenv() +service = QRMIService() + +resources = service.resources() +qrmi = resources[0] + +# Generate transpiler target from backend configuration & properties and transpile +target = get_target(qrmi) +pm = generate_preset_pass_manager( + optimization_level=1, + target=target, +) + +isa_circuit = pm.run(circuit) + +# run the circuit +options = {} +sampler = SamplerV2(qrmi, options=options) + +job = sampler.run([(isa_circuit, isa_observable, param_values)]) +print(f">>> Job ID: {job.job_id()}") + +result = job.result() +print(f">>> {result}") +``` + +See [examples directory](https://github.com/qiskit-community/qrmi/tree/main/examples/qiskit_primitives/) for example files. + +### Backend specifics +#### IBM Direct Access API +##### HPC admin scope +Configuration of Direct Access API backends (HPC admin scope) includes endpoints and credentials to the Direct Access endpoint, authentication services as well as the S3 endpoint. +Specifically, this includes: + +* IBM Cloud API key for creating bearer tokens +* endpoint of Direct Access API +* S3 bucket and access details + +Access credentials should not be visible to HPC users or other non-privileged users on the system. +Therefore, sensitive data can be put in separate files which can be access protected accordingly. + +Note that Slurm has got full access to the backend. +This has several implications: + +* the Slurm plugin is responsible for multi-tenancy (ensuring that users don't see results of other users' jobs) +* vetting of users (who is allowed to access the QPU) and ensuring according access is up to the HPC cluster side +* the capacity and priority of the QPU usage is solely managed through Slurm; there is not other scheduling of users involved outside of Slurm + +##### HPC user scope +Execution lanes are not exposed to the HPC administrator or user directly. +Instead, mid term, there can be two different modes that HPC users can specify: + +* `exclusive=true` specifies that no other jobs can use the resource at the same time. An exclusive mode job gets all execution lanes and can not run at the same time as a non-exclusive job +* `exclusive=false` allows other jobs to run in parallel. In that case, there can be as many jobs as there are execution lanes at the same time, and the job essentially only gets one lane + +#### Qiskit Runtime Service +##### HPC user scope + +It is expected, that users specify additional access details in environment variables. +Specifically, this includes + +* Qiskit Runtime service instance (CRN, Cloud Resource Name) +* Endpoint for Qiskit Runtime (unless auto-detected from the CRN) +* API key which has access to the CRN +* S3 instance, bucket and access token/credentials for data transfers + +This determines under which user and service instance the Qiskit Runtime service is used +Accordingly, IBM Quantum Platform's scheduling considers the user's and service instance's capabilities for scheduling. + +At this time, users have to provide the above details (no shared cluster-wide Quantum access). + +#### Pasqal + +#### Pasqal Cloud Services +##### HPC admin scope +There is no specific set-up required from HPC admins for PCS usage. + +##### HPC user scope +It is expected, that users specify additional access details in environment variables. +Specifically, this currently includes + +* PCS resource to target (FRESNEL, EMU_FRESNEL, EMU_MPS) +* Authorization token + +#### Pasqal on-prem devices +TBD. + +## Examples + +### Running jobs with dependencies + +FIXME: show example with 1 classical job => 1 quantum job (python pseudo code)=> 1 classical job. +Main topic: show dependencies + +### Running a job with several Slurm QPU resources + +FIXME: show example (quantum only, python, is good enough) where several backends are defined, referenced and used +Main topic: show how ids play an important role in that case + +### Running primitives directly + +FIXME: show example of qrun -- same SBATCH, but different executable. +Main topic: present qrun as an option +FIXME: define/finalize qrun at some time (parameters etc) + +### Other workflow tools + +FIXME: show how other workflow tooling could play into that diff --git a/docs/guides/slurm-plugin.mdx b/docs/guides/slurm-plugin.mdx new file mode 100644 index 00000000000..82eb0f1de35 --- /dev/null +++ b/docs/guides/slurm-plugin.mdx @@ -0,0 +1,151 @@ +Spank plugins for Slurm to support quantum resources +==================================================== + +## Content + +- [Context](#content) +- [Definitions](#definitions) + - [QPU](#qpu) + - [Quantum computer](#quantum-computer) + - [Spank plugins](#spank-plugins) + - [Spank quantum plugin](#spank-quantum-plugin) + - [Qiskit primitives (Sampler and Estimator)](#qiskit-primitives-sampler-and-estimator) +- [Vendor-Specific Context: IBM](#vendor-specific-context-ibm) +- [Vendor-Specific Definitions: IBM](#vendor-specific-definitions-ibm) + - [IBM Quantum Platform](#ibm-quantum-platform) + - [Direct Access API](#direct-access-api) +- [High Level Structure](#high-level-structure) +- [Quantum resource for workload management systems](#quantum-resource-for-workload-management-system) +- [Quantum resource API](#quantum-resource-api) +- [Integration Flow](#integration-flow) +- [High Level Flow of Quantum Plugin](#high-level-flow-of-quantum-plugin) +- [General architecture](#general-architecture-of-plugin) +- [Architectural Tenets](#architectural-tenents) + +See [UX](./ux.md) for HPC user experience, HPC developer experience and usage patterns. + +## Context + +Overview of involved components, personas and backend service options: +![context diagram](./images/context_diagram.png) + +## Definitions + +### QPU +A `QPU` includes all of the hardware responsible for accepting an executable quantum instruction set, or a quantum circuit, and returning an accurate answer. That means the QPU includes the quantum chip(s) in a superconducting quantum computer, as well as additional components such as the amplifiers, control electronics, instruments. + +### Quantum Computer +A `Quantum Computer` is comprised of the QPU and the classical compute needed to execute requests coming in through an API (its endpoint). + +### Spank plugins +`SPANK` provides a very generic interface for stackable plug-ins which may be used to dynamically modify the job launch code in Slurm. +https://slurm.schedmd.com/spank.html + +### Spank quantum plugin +A plugin in Slurm that manages the operation of quantum jobs in Slurm. It handles Slurm resources related to quantum and is configured so that jobs can execute on Quantum Computers. + +### Qiskit primitives (Sampler and Estimator) +The two most common tasks for quantum computers are sampling quantum states and calculating expectation values. These tasks motivated the design of the Qiskit primitives: `Estimator` and `Sampler`. + +- Estimator computes expectation values of observables with respect to states prepared by quantum circuits. +- Sampler samples the output register from quantum circuit execution. + +In short, the computational model introduced by the Qiskit primitives moves quantum programming one step closer to where classical programming is today, where the focus is less on the hardware details and more on the results you are trying to achieve. + +## Vendor-Specific Context: IBM + +Extension of the context overview of involved components, personas and backend service options for IBM: +![context diagram IBM](./images/context_diagram_ibm.png) + +## Vendor-Specific Definitions: IBM + +### IBM Quantum Platform +Cloud-based quantum computing service providing access to IBM's fleet of quantum backends. Sometimes abbreviated as IQP. + +### Direct Access API +Local interface to am IBM Quantum Computer. Sometimes abbreviated as DA API. Below the Direct Access API, classical preparation of jobs prior to the actual quantum execution can run in parallel (called *lanes* in the API definition). + +## Vendor-Specific Definitions: Pasqal + +### Pasqal Cloud Service +Cloud-based quantum computing service providing access to Pasqal QPU's and emulators. Sometimes abbreviated as PCS. + +### Pulser +Pasqal's native programming library [GitHub](https://github.com/pasqal-io/pulser). Supported in the low-level interfaces such as the QRMI. + +## High Level Structure + +At large, there are three domains: +* HPC users, consuming slurm resources and using access to Quantum Computers through these resources +* HPC admins, configuring slurm and managing access and mapping to available Quantum Computers +* Quantum Computer providers, offering access to Quantum/QPU resources on Quantum Computers + +![High Level Structure](./images/high_level_structure.png) + +## Quantum resource for workload management system +General GRES (custom resource) for quantum computers is QPU. +All quantum resources will have an identity and map to a Quantum Computer's quantum resource (i.e. map to a QPU). + +Additional resource definition might be needed depending on implementation from hardware vendors. Some vendors expose to parallelism within quantum computer as execution lanes, threads, parts of devices, etc. Therefore we define quantum resource as an abstract that composed out of physical device and parallelism notion. + +![resource definition](./images/resource_definition.png) + +The QPU resource definition does not expose individual parallelism abstracts. Each backend flavor can have specific qualifiers how to use backend specific capabilities (e.g. for common use case: if a user wants to exclusively use a backend, all parallel job preparation units will be available for use -- if not, several users could submit jobs and share these units. As execution lanes in DA API do not have any identities that could be managed explicitly, only quantities resp. exclusive/shared use should be user controlled). + +![resource mapping](./images/resource_mapping.png) + +## Quantum resource API +Any type of resource should implement resource control interface. Flow of working with resource following pattern: `acquire resource` → `execute` → `release resource`. Implementation of this interface might vary from platform to platform. + +![resource control api](./images/resource_control_api.png) + +## Integration Flow + +Similar to any Gres resource (GPU, AIU, etc), we treat QPU as gres and acquire it for whole duration of the job. +Primitive calls will manage the data and call towards the Quantum Computer (for most cases through Slurm to govern user access to the slurm-level quantum resource and potentially inject Quantum Computer access credentials) + +![integration Flow](./images/integration_flow.png) + +This avoids drawbacks of other options, e.g. when the user application's primitive call will create other Slurm jobs that send primitive data towards the Quantum Computer. +Having this logic of sending data towards the Quantum Computer in qiskit level code reduces complexity and latency, and avoids complexity in error handling. + +## High Level Flow of Quantum Plugin + +This is the high level flow from when Slurm jobs are started to how requests find their way to a Quantum Computer. +Requests refer to any interaction between application and Quantum Computer (e.g. including getting configuration information that is needed for transpilation). + +![High Level Flow -- png editable with draw.io, please keep it that way](./images/high-level-plugin-flow.png) + +## General architecture of plugin + +Quantum plugin will be using Spank architecture events sequence of call during job execution to inject necessary logic to work with quantum computer API. + +1. Job execution flow + 1. Prolog + 1. Handle secrets + 2. Acquire resource + 3. Create network middleware + 2. Task init + 1. Handle options + 2. Set env variables + 3. Epilog + 1. Remove middleware + 2. Release resource + +![general architecture](./images/general_architecture.png) + +## Architectural Tenents + +* A Slurm-level QPU resource maps to physical resource of a Quantum Computer + * Quantum backend selection is part of that Slurm resource definition and is considered configuration, not source code + * Qiskit-level code can refer to that Slurm resource and access/use the quantum resource behind it. Qiskit-level code should avoid naming the desired backend directly (=> it should be part of the Slurm resource definition instead) + * Slurm QPU resources have an identity (allowing to bind against it from qiskit) + * additional qualifiers of the Slurm QPU resource are backend type specific + * parallelism abstracts (such as execution lanes which are anonymous units to prepare jobs in parallel for quantum execution which is still serialized) are abstracted behind the Slurm QPU resource. Qualifiers may be used to deal with specifics (such as: are these lanes held exclusive for one user, or is there a shared access possible) +* Quantum resources are acquired/locked before usage + * as identification/selection of the quantum resource is through the Slurm resource, transpilation can only happen after that + * initially, transpilation will happen after acquiring the resource, which can can lead to slightly lower QPU utilization, as other jobs may be locked out. This may (should!) be improved in a later phase and requires an extended concept (such as first step is define the resource, which may or may not result in actions, second step is lock for execution) -- more details required at a later time! +* Primitive calls will trigger submission towards the Quantum Computer + * Flow is triggered from the qiskit-level code, without scheduling additional intermediate Slurm jobs + * The network flow can go through the Slurm plugin, to govern user access or manage access to the Quantum Computer + * The network/data flow can be specific to a backend type (using intermediate storage to hold input/output data, or sending data in-line)