In this work we implement in C++ FlashAttention based hardware accelerators with our proposed operator ExpMul, that fuses floating-point exponent function calculation and multiplication into simple add and shift operations using fixed-point arithmetic, withought the need for additional conversion back to floating-point domain since the result is given directly as a floating-point number. In order to evaluate power metrics we run inference using the Google's FLAN-T5 LLM. More specifically, we run the pytorch model from huggingface and extracted inter layer results for the different tests included in GLUE dataset to use as inputs in main.cc.
Most of the floating-point functionality utilizes the Fast-Float4HLS library, publicly available on github.
This repository is organized as follows:
.
├── src
│ ├── attnetion.h
│ ├── bf16_arithm.h
│ ├── defines.h
│ ├── file_io.h
│ ├── fused_operators.h
│ ├── logging.h
│ ├── main.cc
│ ├── math_ops.h
│ └── reduction.h
│
├── utils
│ ├── gen_pwl_coeff.py
│ └── pack.py
│
├── LICENSE
├── README.md
└── setup.sh./src/This directory contains the C++ implementation of FlashAttention based accelerators with ExpMul operator.attention.hfile contains the implementation of FlashAttention Acceleratorsfused_operators.hfile contains the implementation of ExpMul operator
./utils/This directory contains Python utility scripts../setup.shA bash script to fetch all required dependencies.
- Python scripts for automatically loading and extracting FLAN-T5 input on GLUE.
- Fix Dependency issues regarding HLS math library and Fast-Float4HLS.
TODO
Currently active: Kosmas Alexandridis and Giorgos Dimitrakopoulos
Fused-ExpMul is licensed with the MIT License. You are completely free to re-distribute your work derived from Fused-ExpMul