A completely reconstructed EmiyaEngine using neural networks. An upsampling/restoration model suitable for common lossy audio.
Lossy AAC Input | Upsampled Lossless FLAC Output |
---|---|
![]() |
![]() |
EmiyaEngineNN is a high-fidelity broadband audio upsampling model based on a modification
of BAE-Net.
Compared to the original design, the network capacity has been increased to about three times the original by
significantly widening the FFT window (1576->3072), modifying the number of channels in the intermediate layers, and
other operations.
This is to better adapt to the more complex scenarios of general lossy audio, rather than just VCTK speech.
Additionally, the engineering aspect references the design of kokoro, with
STFT/iSTFT built into the network for end-to-end computation, reducing the alignment cost of pre- and post-processing.
The environment uses Python 3.12 + PyTorch 2.7.1 + ONNX 1.18.0.
Prepare a directory named dataset
, put the audio files into it, and start train_aio.py
to begin training.
If you just want to see the effect, you can download the binary from the Release page.
It supports common lossy audio format inputs (e.g., MP3, AAC, Opus), and the output is fixed to lossless compressed
FLAC.
zansei.exe model.onnx input.mp3 output.flac
It should be noted that the program will internally downsample the audio to 32kHz to remove the empty spectrum and
optimize the output quality.
Using lossless audio or other inputs containing information above this frequency range may cause audio degradation.
The training used 226 stereo recordings randomly selected from a personal music library and trained for about 90 hours,
which was interrupted by a machine failure and restart.
The MS-STFT weighted loss at the last checkpoint was about 8.1, and the discriminator loss was about 0.33. Other metrics
were lost due to the failure.