English | 简体中文
New generation lightweight DiffSinger automatic phoneme annotation tool
For a better solution,please see here.
Currently, the project supports Chinese, English, and Japanese (but the reliability of Japanese recognition is not high and a larger model needs to be selected)
- Support for Chinese
- Support for English
- Support for Japanese
torch
faster-whisper
pykakasi
fast-phasr-next requires Python 3.8 or later. We strongly recommend you create a virtual environment via Conda or venv before installing dependencies.
- install
# cpu
pip install -r requirement.txt
# gpu
conda install cudatoolkit -y
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install -r requirement.txtThis project uses fast-whisper, which reimplements OpenAI's Whisper model using CTranslate2, a fast inference engine for the Transformer model. This implementation is 4x faster than openai/whisper but uses less memory and has the same accuracy. Efficiency can be further improved through 8-bit quantization on CPU and GPU.
In the test environment of RTX 3060 Laptop 6G GPU, using the Large-v3-fp16 model, it only takes about 0.7s to label a 6~10s audio, and in the labeling test of 50 audios, about 98.71% can be obtained accuracy
| Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed (Compared with the original project) |
|---|---|---|---|---|---|
| tiny | 39 M | tiny.en |
tiny |
~1 GB | ~128x |
| base | 74 M | base.en |
base |
~1 GB | ~64x |
| small | 244 M | small.en |
small |
~2 GB | ~36x |
| medium | 769 M | medium.en |
medium |
~5 GB | ~8x |
| large | 1550 M | N/A | large |
~10 GB | ~4x |
python main.py -d [import directory] -m [model default="large"] -l [language default="Chinese"] --device [default="cuda"] --compute_type [default="float16"]
