|
1 | 1 | # EmiyaEngineNN |
2 | | -Fully Neural Networked EmiyaEngine. An upsampler for general music audio. |
| 2 | + |
| 3 | +[English](README.md) | [简体中文](README_CN.md) |
| 4 | + |
| 5 | +A completely reconstructed EmiyaEngine using neural networks. |
| 6 | +An upsampling/restoration model suitable for common lossy audio. |
| 7 | + |
| 8 | +--- |
| 9 | + |
| 10 | +## Methodology |
| 11 | + |
| 12 | +EmiyaEngineNN is a high-fidelity broadband audio upsampling model based on a modification |
| 13 | +of [BAE-Net](https://github.com/yuguochencuc/BAE-Net). |
| 14 | +Compared to the original design, the network capacity has been increased to about three times the original by |
| 15 | +significantly widening the FFT window (1576->3072), modifying the number of channels in the intermediate layers, and |
| 16 | +other operations. |
| 17 | +This is to better adapt to the more complex scenarios of general lossy audio, rather than just VCTK speech. |
| 18 | +Additionally, the engineering aspect references the design of [kokoro](httpss://github.com/hexgrad/kokoro), with |
| 19 | +STFT/iSTFT built into the network for end-to-end computation, reducing the alignment cost of pre- and post-processing. |
| 20 | + |
| 21 | +## Usage |
| 22 | + |
| 23 | +The environment uses Python 3.12 + PyTorch 2.7.1 + ONNX 1.18.0. |
| 24 | +Prepare a directory named `dataset`, put the audio files into it, and start `train_aio.py` to begin training. |
| 25 | + |
| 26 | +If you just want to see the effect, you can download the binary from the Release page. |
| 27 | +It supports common lossy audio format inputs (e.g., MP3, AAC, Opus), and the output is fixed to lossless compressed |
| 28 | +FLAC. |
| 29 | + |
| 30 | +```shell |
| 31 | +zansei.exe model.onnx input.mp3 output.flac |
| 32 | +``` |
| 33 | + |
| 34 | +It should be noted that the program will internally downsample the audio to 32kHz to remove the empty spectrum and |
| 35 | +optimize the output quality. |
| 36 | +Using lossless audio or other inputs containing information above this frequency range may cause audio degradation. |
| 37 | + |
| 38 | +## Training Details |
| 39 | + |
| 40 | +The training used 226 stereo recordings randomly selected from a personal music library and trained for about 90 hours, |
| 41 | +which was interrupted by a machine failure and restart. |
| 42 | +The MS-STFT weighted loss at the last checkpoint was about 8.1, and the discriminator loss was about 0.33. Other metrics |
| 43 | +were lost due to the failure. |
0 commit comments