Parameter Experiments #95

allo- · 2025-05-09T09:39:28Z

allo-
May 9, 2025

I've noticed that the updated version has increased the number of diffusion steps. In my experience, there is not too much difference between many steps and fewer (if not too low), but I wonder if the effect that higher CFG needs more steps is similar to Stable Diffusion.

Some parameter effects I noticed:

The Heun scheduler seems to produce louder tracks with clipping. It also seems to produce less clear vocals, but some tracks still sound quite interesting.
CFG instead of APG gives me bad results. Is the scale for the CFG parameter different?
High CFG values can produce really nice songs with pronounced vocals and only light instrumental accompaniment.
Increasing the granularity does seem to reduce some artifacts, but increasing it further seems to have no further effect. I am not sure what the optimal value is, but 20-30 may work better than the default of 10.
Musical tags are often ignored. I had songs that completely ignored the genre and used a different one. It would also be nice if the authors could publish a list of the most used tags during training. And it would be interesting if the tags were grouped by synonyms during training, or if the same concept has multiple tags and you should try to add all related ones.
The current ComfyUI workflow seems to sometimes provide better quality than the app here (there is already a pull request about the scheduler Comfy uses), but it seems to degrade with longer songs more than the results of the gradio app.

I would really appreciate some more explanation of what each parameter really does. For a start, it can even be quite technical, and probably the effect on the generated output needs to be explored separately. I hope we'll find more on this when the paper is published.

I know about CFG and diffusion steps, but I need to read about APG and can only get an idea about granularity from the word. I also wonder what is the effect of the parameter when to apply CFG. I tried to change it and see, but couldn't find the exact effect on the output.

Post what parameters worked well for you.

akande-ogundimu · 2025-05-09T13:11:57Z

akande-ogundimu
May 9, 2025

Thank you for experimenting with the parameters and sharing your thoughts!

My experience is very similar. The default settings are generally a good starting point.

Using Euler as a scheduler and increasing the default steps from 27 to around 50–120 usually yields better results. However, with more steps, the quality deteriorates.
Using Heun often doesn't deliver as good results.
The audio duration is usually very accurate, but I've also produced songs around 110 seconds long, even though the setting was set to around 90 seconds. Perhaps longer lyrics can affect this parameter?
I really like the cadence/rhythm it creates for the lyrics.
I think it recognizes the main language of the lyrics and retains the pronunciation of that language. If you use a foreign language and mix in English words, it sounds really bad.

As you mentioned, the tags are usually ignored or have little impact. Often, the lyrics are very dominant and the instrumental accompaniment is too quiet, making parts of the song sound a cappella.

0 replies

allo- · 2025-05-10T12:00:48Z

allo-
May 10, 2025
Author

I think I never had the duration different from what I requested (using the gradio app).

My problem is more that it seems you have to find the right duration for the lyrics to fit, but you cannot easily experiment with it as you cannot just change the duration to make the song fit without creating a whole new song which may then require a different duration.

So the best one can do is to estimate what duration may work and then generate a few versions of the song hoping one has the right tempo to fit into the chosen duration. It seems to be a smaller problem with short lyrics, which seem to adapt much easier to the wanted duration. I tried to generate step by step using the extend function, but I didn't get the new segment to match the style of the previous one yet.
I wonder how much the promised next version of the model will already help to avoid skipping or repeating parts of the lyrics.

English pronounciation seems to work well mostly, but I did not get it to pronounce "character" correctly. German works better than expected, but still has some of the mistakes I would expect from a model that is trained mainly on English and Chinese.

I also don't think the German example on the project page is a good one. I'm not sure if they couldn't find a better one or didn't understand enough German to spot the mistakes, but it skips enough words of the lyrics that the song no longer makes sense. From my experiments one can generate better German songs than that, though. I also think the lyrics themselves don't have a clear theme and maybe the song would work better if the verses would rhyme more.

Perhaps we should find enough native speakers for all supported languages to help create a set of high quality examples for all languages.

1 reply

nkaacobb May 20, 2025

Can you lock the seed when you find the song you like? I have had luck extending the duration from 120 to 160 on songs that get cut off. And as long as I lock in the seed I get the same song.

ChuxiJ · 2025-05-10T13:29:28Z

ChuxiJ
May 10, 2025
Maintainer

The reason is that I can't understand German, so it's difficult for me to evaluate and judge. If anyone finds a better German example, you are welcome to post it in our Discord sharing channel, and I will update it on our project page.

2 replies

allo- May 10, 2025
Author

This is not meant to be a criticism, I also can't judge other examples than the English and German ones.

The interesting thing is that the model does not stumble over all words with umlauts ("äöüß") as one might expect, but can pronounce quite a few of them correctly. I am not sure if it generalizes this or if it is limited to words that were in the training, because it completely fails on others.

Are the songs tagged with language? If not, it might be worth tagging them in the next training or maybe having language as a parameter so the model does not have to guess the language. On the other hand, it seems to guess quite well. 🙂

I'm currently a bit uncreative for German lyrics, and LLM examples are often cheesy (and hardly rhyme for German lyrics), maybe sometime later. So if someone else wants to generate a nice example, don't wait for me.

JaneDoe84 May 18, 2025

The reason is that I can't understand German, so it's difficult for me to evaluate and judge. If anyone finds a better German example, you are welcome to post it in our Discord sharing channel, and I will update it on our project page.

i can confirm as german native speaker that the model has some problems.
I tried to create a German song that I wanted to post in Discord.

but from missing words i found some other problem, i try to descripe:
if i say in german "jetzt ist sie weg" (now she is gone) the model sing "weg" not as gone, model sing "weg" like (way).
we have same word for two or more meanings, in this case "weg" is like "she is GONE".
but model think i use "weg" for something like "this is the WAY".
weg = gone and way (depends on the situation you use the word)

hope this is understandable, as my english is not the best.

i see this often in tts engines.
in the "weg" case it is because if we say "she is gone" we use to say a short and hard "weg"
ending sound like it is not written with "weg" more like "weCK"
if we say "this is the way, we use a soft and long E in weg. like "wEEg"

so german user often need to rewrite there text like a weierd for tts engines.
another problem are to good trained german engines/models.
we have something we called "Denglisch" in our language, some words/names etc are spoken english like.
for example the name "Simon", today no one say the german style "Simon"...we use the english sound like "Seimon"
in a german tts engine i often need to write "Ceimon" get it sound similar.

i can help on this, if needed and im able to help.
it is no complain or so, im very happy with ace-step and im sure it will become better and better over time !

rkfg · 2025-05-10T19:59:13Z

rkfg
May 10, 2025

I found that increasing guidance interval often leads to a better tag following and fewer lyrics artefacts (such as skipping/mixing lines), cleaner vocals, but also makes the song less varied. 0.5 is a good starting point, and one might experiment with going to 0.75. However, at 1.0 vocals become slightly distorted and too loud (which is a problem in ComfyUI since it doesn't have this setting in the native workflow).

I experimented with Russian lyrics and it was generally pretty good. There are a few noticeable things however: word stresses may be wrong sometimes, and there's a slight foreign accent that can be heard sometimes but not always. The word stresses can be dealt with the únicode áccent márk (illustrated), known as U+0301 COMBINING ACUTE ACCENT. On Linux if you have enabled typography layout you can press RAlt+A to add it to the preceding letter. I didn't expect it to work because lyrics usually don't have word stresses but it worked in almost all cases and the rest of the song was almost unchanged. It's pretty cool that the model can follow so many completely different languages and still produce naturally sounding vocals! TTS engines usually require specifying the target language explicitly, and Russian word stresses are also often messed up there since they're effectively random unlike in English for example. Some words were mispronounced such as the word "вопрос" (question) lost the first consonant and became "опрос" (poll). I fixed it with the same accent mark and exclamation mark since it was the last word in the sentence. Probably helped it to pay more attention to the word.

There's a certain nuance with choruses. If you have different versions of them that don't really match both versions might mix up words and lines. It's an artistic device, probably not used often enough, but there could be two identical choruses and then a modified one at the end of the song. It breaks expectations and hits hard (if done right), but ACE-Step really doesn't like it. I wonder if attention weights can fix it, same as it's done in ComfyUI/A1111: you mark (a part of the sentence:1.4) like this and the bracketed tokens get their attention multiplied by the coefficient, 1.4 in this case. Helps to enforce some concepts the model is ignoring without it. Just an idea, I don't know how attention is implemented here and whether it would work.

Specifying BPM works often enough to stabilize the tempo and the beat. It might not be precise but quite close. Sometimes beat gets messed up and the song devolves into a literal cacophony.

It was hard to make synthwave/retrowave songs, more misses than hits, and it prefers to make alt rock/pop instead. Could be a good lora candidate because sometimes it works, just needs to be reinforced. I tried synth-pop, synthwave, retrowave, outrun, 80s. Listing the instruments helps more I think.

One pattern that I started noticing is that at the last line of a verse or chorus the music stops and then starts again on the next line, creating a drop effect. While it sounds kinda cool, it might get old fast. I suppose it's because the model can't make a good enough smooth transition between the song parts so it resorts to a proven working drop.

The popular easy listening genres such as pop or funk work extremely well, the songs are often very catchy and deserve to be played on the radio! The thing I'd personally love to have is key change, it's a simple trick but it can add a lot of emotion if done right. However, the problem here is that the modern human-produced music has also almost lost it! See https://tedium.co/2022/11/09/the-death-of-the-key-change/

3 replies

allo- May 10, 2025
Author

The accent trick is interesting!

For the chorus, I had a similar effect of jumping from the chorus to similar lines in the verses. It seems that it does not have a strong positional encoding, but can jump to the same line in another verse and continue from there.

I wonder if you could alleviate the chorus problems with a section title like (Chorus 2), or if both problems can be solved with tricks to make the lines different, perhaps using multiple spaces or other unhearable characters inside the lines.

rkfg May 10, 2025

Yes, I tried [chorus 1] and [chorus 2], no effect whatsoever. I think the issue with positional encoding stems from the lyrics themselves, in some songs lines or words repeat but it's not reflected in the text and is rather assumed. Or maybe the text is short and the song is long, and it's repeated a few times here and there, we can only guess what went into the dataset 😅

Slight structural changes indeed help to make a small "reroll" without changing the seed/length, it's like adding a few spaces or an insignificant tag in Stable Diffusion makes it hit a different latent and get rid of the 6th finger and such. Maybe adding a "variance noise" could be a good feature to correct really small mistakes because "retake" makes too many changes even at 0.05 strength. Variance noise in A1111 adds a really small amount of noise on top of the initial one so the result changes but not by much as the main noise structure from the main seed stays the same.

firepol Jun 17, 2025

Try [alt-chorus]

afgonczol · 2025-05-20T02:30:53Z

afgonczol
May 20, 2025

Interesting thing I just found with the parameters. Even if you're making instrumental music, it's important to put some weight (like 0.3) for Guidance Scale Lyric. If you leave it at 0, for some reason it doesn't do a good job of following the text tags. My guess is that when the code does the comparison, there's some funky divide by zero issue going on if you leave it at 0.

1 reply

rkfg May 20, 2025

There definitely is some difference if that value is changed but I wouldn't say it becomes better or worse. I tried a few values between 0.1 and 1 for that slider.

nikolatesla20 · 2025-05-29T12:00:40Z

nikolatesla20
May 29, 2025

I've been having better success with these settings;

0 replies

Parameter Experiments #95

Uh oh!

Replies: 6 comments · 7 replies

Uh oh!

Uh oh!

Uh oh!

allo- May 10, 2025 Author

Uh oh!

Uh oh!

ChuxiJ May 10, 2025 Maintainer

Uh oh!

allo- May 10, 2025 Author

Uh oh!

Uh oh!

Uh oh!

allo- May 10, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 6 comments 7 replies

allo-
May 10, 2025
Author

ChuxiJ
May 10, 2025
Maintainer

allo- May 10, 2025
Author

allo- May 10, 2025
Author