error GPU with CUDA capability 7 0 is not supported for Flash Attention V2 #22139

leolivier · 2024-03-22T14:55:07Z

leolivier
Mar 22, 2024

Issue Description

Running the command
podman run -it --rm --security-opt=label=disable --device nvidia.com/gpu=all --shm-size 1g -p 8080:80 -v /opt/lorax/data:/data:Z ghcr.io/predibase/lorax:latest --model-id mistralai/Mistral-7B-Instruct-v0.1
I get the error GPU with CUDA capability 7 0 is not supported for Flash Attention V2

I am running on an RHEL9 with podman version:

Client:       Podman Engine
Version:      4.6.1
API Version:  4.6.1
Go Version:   go1.20.10
Built:        Sat Dec  2 16:05:24 2023
OS/Arch:      linux/amd64

Steps to reproduce the issue

run podman run -it --rm --security-opt=label=disable --device nvidia.com/gpu=all --shm-size 1g -p 8080:80 -v /opt/lorax/data:/data:Z ghcr.io/predibase/lorax:latest --model-id mistralai/Mistral-7B-Instruct-v0.1
see the error GPU with CUDA capability 7 0 is not supported for Flash Attention V2

Describe the results you received

error GPU with CUDA capability 7 0 is not supported for Flash Attention V2

Describe the results you expected

No startup error

podman info output

host:
  arch: amd64
  buildahVersion: 1.31.3
  cgroupControllers:
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.8-1.el9.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.8, commit: aadb7c890ac6283eb4666d92690238e5fbdec5c7'
  cpuUtilization:
    idlePercent: 99.15
    systemPercent: 0.22
    userPercent: 0.63
  cpus: 8
  databaseBackend: boltdb
  distribution:
    distribution: '"rhel"'
    version: "9.3"
  eventLogger: file
  freeLocks: 2048
  hostname: frdevllm01.coface.dns
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 30600513
      size: 1
    - container_id: 1
      host_id: 10000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 30665244
      size: 1
    - container_id: 1
      host_id: 10000
      size: 65536
  kernel: 5.14.0-362.18.1.el9_3.x86_64
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 2571972608
  memTotal: 33373880320
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.7.0-1.el9.x86_64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.7.0
    package: netavark-1.7.0-2.el9_3.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.7.0
  ociRuntime:
    name: crun
    package: crun-1.8.7-1.el9.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.7
      commit: 53a9996ce82d1ee818349bdcc64797a1fa0433c4
      rundir: /run/user/30665244/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  pasta:
    executable: ""
    package: ""
    version: ""
  remoteSocket:
    path: /run/user/30665244/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.1-1.el9.x86_64
    version: |-
      slirp4netns version 1.2.1
      commit: 09e31e92fa3d2a1d3ca261adaeb012c8d75a8194
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.2
  swapFree: 3685216256
  swapTotal: 4294963200
  uptime: 436h 51m 4.00s (Approximately 18.17 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.access.redhat.com
  - registry.redhat.io
  - docker.io
store:
  configFile: /ad_users/olivier_levillain/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /ad_users/olivier_levillain/.local/share/containers/storage
  graphRootAllocated: 161050787840
  graphRootUsed: 128729018368
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /opt/lorax/tmp
  imageStore:
    number: 2
  runRoot: /run/user/30665244/containers
  transientStore: false
  volumePath: /ad_users/olivier_levillain/.local/share/containers/storage/volumes
version:
  APIVersion: 4.6.1
  Built: 1701529524
  BuiltTime: Sat Dec  2 16:05:24 2023
  GitCommit: ""
  GoVersion: go1.20.10
  Os: linux
  OsArch: linux/amd64
  Version: 4.6.1

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

No

Additional environment details

Additional information

I ran what is described on https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html:

$ sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
$ nvidia-ctk cdi list
INFO[0000] Found 2 CDI devices
nvidia.com/gpu=0
nvidia.com/gpu=all
$ podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable ubuntu nvidia-smi -L
GPU 0: Tesla V100-PCIE-32GB (UUID: GPU-b7226b6e-9315-b531-bc7f-b013fda18063)
$ nvidia-smi -L
GPU 0: Tesla V100-PCIE-32GB (UUID: GPU-b7226b6e-9315-b531-bc7f-b013fda18063)

and also, I updated /etc/subuid and /etc/subgid for running rootless

$ sudo usermod --add-subuids 10000-75535 <my account>
$ cat /etc/subuid
<my account>:10000:65536
$ sudo usermod --add-subgids 10000-75535 <my account>
$ cat /etc/subgid
<my account>:10000:65536
$ podman system migrate

Full stack trace of the error:

$ podman run -it --rm --security-opt=label=disable --device nvidia.com/gpu=all --shm-size 1g -p 8080:80 -v /opt/lorax/data:/data:Z ghcr.io/predibase/lorax:latest --model-id mistralai/Mistral-7B-Instruct-v0.1
2024-03-22T14:13:24.272301Z  INFO lorax_launcher: Args { model_id: "mistralai/Mistral-7B-Instruct-v0.1", adapter_id: None, source: "hub", adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, compile: false, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 1024, adapter_cycle_time_s: 2, adapter_memory_fraction: 0.1, hostname: "af93241ad930", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_header: [], cors_expose_header: [], cors_allow_method: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false }
2024-03-22T14:13:24.272537Z  INFO download: lorax_launcher: Starting download process.
2024-03-22T14:13:29.027157Z  INFO lorax_launcher: cli.py:110 Files are already present on the host. Skipping download.

2024-03-22T14:13:29.982180Z  INFO download: lorax_launcher: Successfully downloaded weights.
2024-03-22T14:13:29.982739Z  INFO shard-manager: lorax_launcher: Starting shard rank=0
2024-03-22T14:13:34.853566Z ERROR lorax_launcher: server.py:274 Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", line 29, in <module>
    raise ImportError(
ImportError: GPU with CUDA capability 7 0 is not supported for Flash Attention V2

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 324, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 270, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/__init__.py", line 179, in get_model
    from lorax_server.models.flash_mistral import FlashMistral
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mistral.py", line 21, in <module>
    from lorax_server.models.custom_modeling.flash_mistral_modeling import (
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mistral_modeling.py", line 32, in <module>
    from lorax_server.utils.flash_attn import HAS_FLASH_ATTN_V2
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", line 45, in <module>
    raise ImportError(
ImportError: GPU with CUDA capability 7 0 is not supported

2024-03-22T14:13:36.093198Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:

Traceback (most recent call last):

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", line 29, in <module>
    raise ImportError(

ImportError: GPU with CUDA capability 7 0 is not supported for Flash Attention V2


The above exception was the direct cause of the following exception:


Traceback (most recent call last):

  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
    server.serve(

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 324, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 270, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/__init__.py", line 179, in get_model
    from lorax_server.models.flash_mistral import FlashMistral

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mistral.py", line 21, in <module>
    from lorax_server.models.custom_modeling.flash_mistral_modeling import (

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mistral_modeling.py", line 32, in <module>
    from lorax_server.utils.flash_attn import HAS_FLASH_ATTN_V2

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", line 45, in <module>
    raise ImportError(

ImportError: GPU with CUDA capability 7 0 is not supported
 rank=0
2024-03-22T14:13:36.190628Z ERROR lorax_launcher: Shard 0 failed to start
2024-03-22T14:13:36.190695Z  INFO lorax_launcher: Shutting down shards
Error: ShardCannotStart

nvidia-smi provides:

nvidia-smi
Fri Mar 22 15:57:49 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-PCIE-32GB           Off |   00000000:13:00.0 Off |                    0 |
| N/A   51C    P0             39W /  250W |     789MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1272      C   /usr/local/bin/ollama                         786MiB |
+-----------------------------------------------------------------------------------------+

Luap99 · 2024-03-22T15:24:13Z

Luap99
Mar 22, 2024
Maintainer

This does not look like anything podman has to do with, I read that like you app requires a cuda capability your GPU simply does not have. As such I do not believe this is a podman bug and converting it to a discussion

0 replies

Luap99 · 2024-03-22T15:26:39Z

Luap99
Mar 22, 2024
Maintainer

https://github.com/Dao-AILab/flash-attention says

FlashAttention-2 currently supports:
Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100). Support for Turing GPUs (T4, RTX 2080) is coming soon, please use FlashAttention 1.x for Turing GPUs for now.
Datatype fp16 and bf16 (bf16 requires Ampere, Ada, or Hopper GPUs).
All head dimensions up to 256. Head dim > 192 backward requires A100/A800 or H100/H800. Head dim 256 backward now works on consumer GPUs (if there's no dropout) as of flash-attn 2.5.5.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

error GPU with CUDA capability 7 0 is not supported for Flash Attention V2 #22139

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

error GPU with CUDA capability 7 0 is not supported for Flash Attention V2 #22139

Uh oh!

Uh oh!

leolivier Mar 22, 2024

Issue Description

Steps to reproduce the issue

Describe the results you received

Describe the results you expected

podman info output

Podman in a container

Privileged Or Rootless

Upstream Latest Release

Additional environment details

Additional information

Replies: 2 comments

Uh oh!

Luap99 Mar 22, 2024 Maintainer

Uh oh!

Luap99 Mar 22, 2024 Maintainer

leolivier
Mar 22, 2024

Luap99
Mar 22, 2024
Maintainer

Luap99
Mar 22, 2024
Maintainer