Montreal Forced Aligner Needs Lots of environment dependencies
This is not my main area of development at the moment, so I'll put it on hold for now.
However, the main workflow should be as follows: the CosyVoice model first obtains the audio from the TTS, and then the MFA aligns the subtitles. This approach is likely to be the most accurate.