Skip to content

Sg4Dylan/EmiyaEngineNN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EmiyaEngineNN

English | 简体中文

A completely reconstructed EmiyaEngine using neural networks. An upsampling/restoration model suitable for common lossy audio.

Lossy AAC Input Upsampled Lossless FLAC Output

Methodology

EmiyaEngineNN is a high-fidelity broadband audio upsampling model based on a modification of BAE-Net.
Compared to the original design, the network capacity has been increased to about three times the original by significantly widening the FFT window (1576->3072), modifying the number of channels in the intermediate layers, and other operations.
This is to better adapt to the more complex scenarios of general lossy audio, rather than just VCTK speech.
Additionally, the engineering aspect references the design of kokoro, with STFT/iSTFT built into the network for end-to-end computation, reducing the alignment cost of pre- and post-processing.

Usage

The environment uses Python 3.12 + PyTorch 2.7.1 + ONNX 1.18.0.
Prepare a directory named dataset, put the audio files into it, and start train_aio.py to begin training.

If you just want to see the effect, you can download the binary from the Release page.
It supports common lossy audio format inputs (e.g., MP3, AAC, Opus), and the output is fixed to lossless compressed FLAC.

zansei.exe model.onnx input.mp3 output.flac

It should be noted that the program will internally downsample the audio to 32kHz to remove the empty spectrum and optimize the output quality.
Using lossless audio or other inputs containing information above this frequency range may cause audio degradation.

Training Details

The training used 226 stereo recordings randomly selected from a personal music library and trained for about 90 hours, which was interrupted by a machine failure and restart.
The MS-STFT weighted loss at the last checkpoint was about 8.1, and the discriminator loss was about 0.33. Other metrics were lost due to the failure.

About

Fully Neural Networked EmiyaEngine. An upsampler for general lossy audio.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages