This repository implements the attention-condenser-based TinySpeech architecture, aimed at offsetting the dependence (and subsequent computational cost) of sparse convulution layers. This brings the parameter count orders of magnitude down from previous low-footprint attempts, while maintaining similar accuracies. We also provide a training and inference engine for the Z, Y and X families achieving 91%+ accuracy. Moreover, this project contains drivers to run the model on the VSDSquadron-Mini board, carrying the CH32V003 MCU and equipped with only 2kb SRAM | 16kb Flash.
- VSDSquadron Mini
- ESP32 WROOM Moule
- Miscelleneous (Wires, Battery, etc.)
First, install required packages using pip install -r requirements.txt.
You can train the TinySpeech family of models using CLI arguments, or one of the given experiment configs:
python train.py --save_pth "models" --quant --quant_type 8 --model_type Z --epochs 50 --batch_size 64 --lr 0.01 --momentum 0.9 --seed 42 --device "cuda"
# OR
python train.py --config tinyspeechz_google_speech.yaml
We offer training using one of
quant_mode = ["DQ", "SQ", "QAT", "UN"]. These expand to Dynamic Quantization, Static Quantization, Quantization-Aware Training (QAT) and Unquantized.
All the layers required to run the model are specified in the ./verification folder. Under this, you will find sub-folders for all the layers with a .c and .py file. First, we generate a tensor and run it against a layer/activation function used in ./training/tinyspeech.py or ./training/modules.py. Then, we saved this in the form of a int8 or float binary, which is subsequently loaded into C for testing with the custom code.
You can test all the layers by running make verify_all when in the ./ root directory of the project.
Run python -m torch.utils.bottleneck train.py --config <your_config_yaml> to evaluate efficiency when training.
Attention condensers are designed to replace or reduce the need for traditional convolutional layers, which are typically resource-intensive. The idea is to leverage a self-contained self-attention mechanism that can effectively capture and model both local and cross-channel activation relationships within the input data.
We're dealing with only quantization-aware training for the TinySpeech-Z and TinySpeech-M variants for now, considering their smaller size. In Quantization Aware Training (QAT), the model weights should not be fully converted to 4 bits throughout the entire training process. Instead, the weights should remain in their higher precision format (typically fp32) during training, but simulated as lower precision (e.g., 4 bits) during the forward and backward passes. This approach allows for the benefits of quantization while still leveraging the precision of higher bit-widths for weight updates.
- Our quantization-aware training modules were adapted from BitNetMCU: High Accuracy Low-Bit Quantized Neural Networks on a low-end Microcontroller. This project itself, was inspired by their early work on simple 3-layer CNN inference on a low-end MCU.
If you find our work useful, please cite us.
@software{araviki-2024-qp_tinyspeech,
title="QP-TinySpeech: Extremely Low-Bit Quantized + Pruned TinySpeech-Z for low-power MCUs",
author="Ravikiran, Akshath Raghav"
year={2024}
}
Original Paper:
@misc{wong-etal-2020-tinyspeech,
title="TinySpeech: Attention Condensers for Deep Speech Recognition Neural Networks on Edge Devices",
author="Wong, Alexander and
Famouri Mahmoud and
Pavlova Maya and
Surana Siddharth",
year={2020},
eprint={2008.04245},
archivePrefix={arXiv},
}

