Flash Attention 2.0 does not work. Need help 🙏 #1948

dezdem · 2024-01-08T04:14:41Z

dezdem
Jan 8, 2024

Win 10, Python 3.10.11, CUDA 12.3, CUDNN 8.9.7.29, Ampere (3060ti).

pip list:

accelerate 0.25.0 aiohttp 3.9.1 aiosignal 1.3.1 anyio 3.6.2 appdirs 1.4.4 argcomplete 3.2.1 argon2-cffi 21.3.0 argon2-cffi-bindings 21.2.0 arrow 1.2.3 asttokens 2.2.1 async-timeout 4.0.2 attrs 23.1.0 audioread 3.0.1 backcall 0.2.0 backoff 2.1.2 beautifulsoup4 4.12.2 bleach 6.0.0 boto3 1.26.120 botocore 1.29.120 certifi 2022.12.7 cffi 1.15.1 charset-normalizer 3.1.0 click 8.1.7 colorama 0.4.6 coloredlogs 15.0.1 comm 0.1.3 cuda-python 12.1.0 Cython 0.29.34 datasets 2.16.1 debugpy 1.6.7 decorator 5.1.1 defusedxml 0.7.1 dill 0.3.7 einops 0.6.1 encodec 0.1.1 executing 1.2.0 fastjsonschema 2.16.3 filelock 3.12.0 flash_attn 2.4.2 fqdn 1.5.1 frozenlist 1.4.1 fsspec 2023.10.0 funcy 2.0 githubrelease 1.5.9 huggingface-hub 0.20.2 humanfriendly 10.0 idna 3.4 ifaddr 0.2.0 ipykernel 6.22.0 ipython 8.12.0 ipython-genutils 0.2.0 ipywidgets 8.0.6 isoduration 20.11.0 jedi 0.18.2 Jinja2 3.1.2 jmespath 1.0.1 joblib 1.3.2 jsonpointer 2.3 jsonschema 4.17.3 jupyter 1.0.0 jupyter_client 8.2.0 jupyter-console 6.6.3 jupyter_core 5.3.0 jupyter-events 0.6.3 jupyter_server 2.5.0 jupyter_server_terminals 0.4.4 jupyterlab-pygments 0.2.2 jupyterlab-widgets 3.0.7 lazy_loader 0.3 librespot 0.0.9 librosa 0.10.1 LinkHeader 0.4.3 llvmlite 0.41.1 MarkupSafe 2.1.2 matplotlib-inline 0.1.6 mistune 2.0.5 more-itertools 10.1.0 mpmath 1.3.0 msgpack 1.0.7 multidict 6.0.4 multiprocess 0.70.15 music-tag 0.4.3 mutagen 1.46.0 nbclassic 0.5.5 nbclient 0.7.4 nbconvert 7.3.1 nbformat 5.8.0 nest-asyncio 1.5.6 networkx 3.1 ninja 1.11.1.1 notebook 6.5.4 notebook_shim 0.2.3 numba 0.58.1 numpy 1.24.3 nvidia-cuda-runtime-cu12 12.3.101 openai-whisper 20231117 optimum 1.16.1 packaging 23.2 pandas 2.1.4 pandocfilters 1.5.0 parso 0.8.3 pickleshare 0.7.5 Pillow 9.5.0 pip 23.3.2 pipx 1.4.1 platformdirs 3.4.0 pooch 1.8.0 prometheus-client 0.16.0 prompt-toolkit 3.0.38 protobuf 3.20.1 psutil 5.9.5 pure-eval 0.2.2 pyarrow 14.0.2 pyarrow-hotfix 0.6 pycparser 2.21 pycryptodomex 3.18.0 pydub 0.25.1 Pygments 2.15.1 PyOgg 0.6.14a1 pyreadline3 3.4.1 pyrsistent 0.19.3 python-dateutil 2.8.2 python-json-logger 2.0.7 pytz 2023.3.post1 pywin32 306 pywinpty 2.0.10 PyYAML 6.0 pyzmq 25.0.2 qtconsole 5.4.2 QtPy 2.3.1 regex 2023.3.23 requests 2.31.0 rfc3339-validator 0.1.4 rfc3986-validator 0.1.1 s3transfer 0.6.0 safetensors 0.4.1 scikit-learn 1.3.2 scipy 1.10.1 Send2Trash 1.8.0 sentencepiece 0.1.99 setuptools 69.0.3 six 1.16.0 sniffio 1.3.0 soundfile 0.12.1 soupsieve 2.4.1 soxr 0.3.7 stack-data 0.6.2 suno-bark 0.0.1a0 sympy 1.11.1 terminado 0.17.1 threadpoolctl 3.2.0 tiktoken 0.5.2 tinycss2 1.2.1 tokenizers 0.15.0 tomli 2.0.1 torch 2.1.2+cu121 torchaudio 2.1.2+cu121 torchvision 0.16.2+cu121 tornado 6.3.1 tqdm 4.65.0 traitlets 5.9.0 transformers 4.37.0.dev0 typing_extensions 4.5.0 tzdata 2023.4 uri-template 1.2.0 urllib3 1.26.15 userpath 1.9.1 wcwidth 0.2.6 webcolors 1.13 webencodings 0.5.1 websocket-client 1.5.2 wheel 0.42.0 widgetsnbextension 4.0.7 xxhash 3.4.1 yarl 1.9.4 zeroconf 0.64.0

I have the following code:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
import time

# Measure the start time
start_time = time.time()

# Check if GPU is available
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load the model on the CPU
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="flash_attention_2"
)

# Move the model to the GPU if available
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

# Execute the pipeline
result = pipe("1.ogg")

# Print the result and execution time
print(result["text"])
end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time} seconds")

I am getting this error:

You are attempting to use Flash Attention 2.0 with a model not initialized on GPU and with no GPU available. This is not supported yet. Please make sure to have access to a GPU and either initialise the model on a GPU by passing a device_map or initialising the model on CPU and then moving it to GPU.

😥

Answered by dezdem

Jan 10, 2024

I did this:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
import time

# Measure the start time
start_time = time.time()

# Check if GPU is available
device = "cuda:0" #if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load the model on the CPU
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, device_map="cuda:0", attn_implementation="flash_attention_2"
)

# Move the model to the GPU if available
#model.to(device)

processor =

View full answer

phineas-pta · 2024-01-10T01:15:05Z

phineas-pta
Jan 10, 2024

load model in gpu

3 replies

dezdem Jan 10, 2024
Author

I did this:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
import time

# Measure the start time
start_time = time.time()

# Check if GPU is available
device = "cuda:0" #if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load the model on the CPU
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, device_map="cuda:0", attn_implementation="flash_attention_2"
)

# Move the model to the GPU if available
#model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    #device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

# Execute the pipeline
result = pipe("1.ogg")

# Print the result and execution time
print(result["text"])
end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time} seconds")

It executes without any errors now. Adding device_map="cuda:0" argument helped.

I also noticed that this code takes about 10 seconds, so I commented it out.

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

Answer selected by dezdem

phineas-pta Jan 10, 2024

u ran the code without knowing which line does what ? 👽

dezdem Jan 10, 2024
Author

I only have vague understanding of what it does. I'm simply following the instructions.¯_(ツ)_/¯

dezdem · 2024-01-10T13:00:53Z

dezdem
Jan 10, 2024
Author

I was wrong; speed gains are noticeable with a larger input.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flash Attention 2.0 does not work. Need help 🙏 #1948

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Flash Attention 2.0 does not work. Need help 🙏 #1948

Uh oh!

Uh oh!

dezdem Jan 8, 2024

Replies: 2 comments · 3 replies

Uh oh!

phineas-pta Jan 10, 2024

Uh oh!

Uh oh!

dezdem Jan 10, 2024 Author

Uh oh!

phineas-pta Jan 10, 2024

Uh oh!

dezdem Jan 10, 2024 Author

Uh oh!

dezdem Jan 10, 2024 Author

dezdem
Jan 8, 2024

Replies: 2 comments 3 replies

phineas-pta
Jan 10, 2024

dezdem Jan 10, 2024
Author

dezdem Jan 10, 2024
Author

dezdem
Jan 10, 2024
Author