Skip to content

Commit ab08d21

Browse files
committed
init
1 parent 2a63b9f commit ab08d21

File tree

5 files changed

+1901
-2
lines changed

5 files changed

+1901
-2
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -173,7 +173,7 @@ cython_debug/
173173
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
174174
# and can be added to the global gitignore or merged into this file. For a more nuclear
175175
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
176-
#.idea/
176+
.idea/
177177

178178
# Abstra
179179
# Abstra is an AI-powered process automation framework.

README.md

Lines changed: 42 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,43 @@
11
# EmiyaEngineNN
2-
Fully Neural Networked EmiyaEngine. An upsampler for general music audio.
2+
3+
[English](README.md) | [简体中文](README_CN.md)
4+
5+
A completely reconstructed EmiyaEngine using neural networks.
6+
An upsampling/restoration model suitable for common lossy audio.
7+
8+
---
9+
10+
## Methodology
11+
12+
EmiyaEngineNN is a high-fidelity broadband audio upsampling model based on a modification
13+
of [BAE-Net](https://github.com/yuguochencuc/BAE-Net).
14+
Compared to the original design, the network capacity has been increased to about three times the original by
15+
significantly widening the FFT window (1576->3072), modifying the number of channels in the intermediate layers, and
16+
other operations.
17+
This is to better adapt to the more complex scenarios of general lossy audio, rather than just VCTK speech.
18+
Additionally, the engineering aspect references the design of [kokoro](httpss://github.com/hexgrad/kokoro), with
19+
STFT/iSTFT built into the network for end-to-end computation, reducing the alignment cost of pre- and post-processing.
20+
21+
## Usage
22+
23+
The environment uses Python 3.12 + PyTorch 2.7.1 + ONNX 1.18.0.
24+
Prepare a directory named `dataset`, put the audio files into it, and start `train_aio.py` to begin training.
25+
26+
If you just want to see the effect, you can download the binary from the Release page.
27+
It supports common lossy audio format inputs (e.g., MP3, AAC, Opus), and the output is fixed to lossless compressed
28+
FLAC.
29+
30+
```shell
31+
zansei.exe model.onnx input.mp3 output.flac
32+
```
33+
34+
It should be noted that the program will internally downsample the audio to 32kHz to remove the empty spectrum and
35+
optimize the output quality.
36+
Using lossless audio or other inputs containing information above this frequency range may cause audio degradation.
37+
38+
## Training Details
39+
40+
The training used 226 stereo recordings randomly selected from a personal music library and trained for about 90 hours,
41+
which was interrupted by a machine failure and restart.
42+
The MS-STFT weighted loss at the last checkpoint was about 8.1, and the discriminator loss was about 0.33. Other metrics
43+
were lost due to the failure.

README_CN.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# EmiyaEngineNN
2+
3+
[English](README.md) | [简体中文](README_CN.md)
4+
5+
完全使用神经网络重构的 EmiyaEngine
6+
一个适用于常见有损音频的上采样/修复模型
7+
8+
---
9+
10+
## 方法论
11+
12+
EmiyaEngineNN 是一个基于 [BAE-Net](https://github.com/yuguochencuc/BAE-Net) 修改而来的高保真度宽带音频上采样模型。
13+
相比原设计通过大幅加宽 FFT 窗口(1576->3072)、修改中间层通道数等操作,增大了网络容量到原设计的三倍左右。
14+
以此更好地适应更复杂的一般有损音频场景,而非单纯的 VCTK 语音。
15+
另外工程上参考了 [kokoro](https://github.com/hexgrad/kokoro) 的设计,将 STFT/iSTFT 内置在网络中端到端计算,降低前后处理的对齐成本。
16+
17+
## 使用方法
18+
19+
环境使用 Python 3.12 + PyTorch 2.7.1 + ONNX 1.18.0。
20+
准备一个目录名为 `dataset`,把音频文件丢进去,启动 `train_aio.py` 即可开始训练。
21+
22+
如果只想看看效果,可以从 Release 页面下载二进制使用。
23+
支持常见有损音频格式输入(例如 MP3、AAC、Opus),输出固定为无损压缩 FLAC。
24+
25+
```shell
26+
zansei.exe model.onnx input.mp3 output.flac
27+
```
28+
29+
需要注意,程序会在内部将音频下采样到 32kHz 以去除空白频谱优化输出质量。
30+
使用无损音频或其他高于此频率范围包含信息的输入可能会导致音频劣化。
31+
32+
## 训练细节
33+
34+
训练使用了个人曲库中随机挑选的 226 个立体声录音,训练了大约 90 小时,因机器故障重启中断。
35+
最后一个存档点的 MS-STFT 加权损失大约为 8.1,鉴别器损失大约 0.33,其他的指标因故障丢失。

0 commit comments

Comments
 (0)