Replies: 6 comments 7 replies
-
Thank you for experimenting with the parameters and sharing your thoughts! My experience is very similar. The default settings are generally a good starting point.
As you mentioned, the tags are usually ignored or have little impact. Often, the lyrics are very dominant and the instrumental accompaniment is too quiet, making parts of the song sound a cappella. |
Beta Was this translation helpful? Give feedback.
-
I think I never had the duration different from what I requested (using the gradio app). My problem is more that it seems you have to find the right duration for the lyrics to fit, but you cannot easily experiment with it as you cannot just change the duration to make the song fit without creating a whole new song which may then require a different duration. So the best one can do is to estimate what duration may work and then generate a few versions of the song hoping one has the right tempo to fit into the chosen duration. It seems to be a smaller problem with short lyrics, which seem to adapt much easier to the wanted duration. I tried to generate step by step using the extend function, but I didn't get the new segment to match the style of the previous one yet. English pronounciation seems to work well mostly, but I did not get it to pronounce "character" correctly. German works better than expected, but still has some of the mistakes I would expect from a model that is trained mainly on English and Chinese. I also don't think the German example on the project page is a good one. I'm not sure if they couldn't find a better one or didn't understand enough German to spot the mistakes, but it skips enough words of the lyrics that the song no longer makes sense. From my experiments one can generate better German songs than that, though. I also think the lyrics themselves don't have a clear theme and maybe the song would work better if the verses would rhyme more. Perhaps we should find enough native speakers for all supported languages to help create a set of high quality examples for all languages. |
Beta Was this translation helpful? Give feedback.
-
The reason is that I can't understand German, so it's difficult for me to evaluate and judge. If anyone finds a better German example, you are welcome to post it in our Discord sharing channel, and I will update it on our project page. |
Beta Was this translation helpful? Give feedback.
-
I found that increasing guidance interval often leads to a better tag following and fewer lyrics artefacts (such as skipping/mixing lines), cleaner vocals, but also makes the song less varied. 0.5 is a good starting point, and one might experiment with going to 0.75. However, at 1.0 vocals become slightly distorted and too loud (which is a problem in ComfyUI since it doesn't have this setting in the native workflow). I experimented with Russian lyrics and it was generally pretty good. There are a few noticeable things however: word stresses may be wrong sometimes, and there's a slight foreign accent that can be heard sometimes but not always. The word stresses can be dealt with the únicode áccent márk (illustrated), known as There's a certain nuance with choruses. If you have different versions of them that don't really match both versions might mix up words and lines. It's an artistic device, probably not used often enough, but there could be two identical choruses and then a modified one at the end of the song. It breaks expectations and hits hard (if done right), but ACE-Step really doesn't like it. I wonder if attention weights can fix it, same as it's done in ComfyUI/A1111: you mark (a part of the sentence:1.4) like this and the bracketed tokens get their attention multiplied by the coefficient, 1.4 in this case. Helps to enforce some concepts the model is ignoring without it. Just an idea, I don't know how attention is implemented here and whether it would work. Specifying BPM works often enough to stabilize the tempo and the beat. It might not be precise but quite close. Sometimes beat gets messed up and the song devolves into a literal cacophony. It was hard to make synthwave/retrowave songs, more misses than hits, and it prefers to make alt rock/pop instead. Could be a good lora candidate because sometimes it works, just needs to be reinforced. I tried One pattern that I started noticing is that at the last line of a verse or chorus the music stops and then starts again on the next line, creating a drop effect. While it sounds kinda cool, it might get old fast. I suppose it's because the model can't make a good enough smooth transition between the song parts so it resorts to a proven working drop. The popular easy listening genres such as pop or funk work extremely well, the songs are often very catchy and deserve to be played on the radio! The thing I'd personally love to have is key change, it's a simple trick but it can add a lot of emotion if done right. However, the problem here is that the modern human-produced music has also almost lost it! See https://tedium.co/2022/11/09/the-death-of-the-key-change/ |
Beta Was this translation helpful? Give feedback.
-
Interesting thing I just found with the parameters. Even if you're making instrumental music, it's important to put some weight (like 0.3) for Guidance Scale Lyric. If you leave it at 0, for some reason it doesn't do a good job of following the text tags. My guess is that when the code does the comparison, there's some funky divide by zero issue going on if you leave it at 0. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I've noticed that the updated version has increased the number of diffusion steps. In my experience, there is not too much difference between many steps and fewer (if not too low), but I wonder if the effect that higher CFG needs more steps is similar to Stable Diffusion.
Some parameter effects I noticed:
I would really appreciate some more explanation of what each parameter really does. For a start, it can even be quite technical, and probably the effect on the generated output needs to be explored separately. I hope we'll find more on this when the paper is published.
I know about CFG and diffusion steps, but I need to read about APG and can only get an idea about granularity from the word. I also wonder what is the effect of the parameter when to apply CFG. I tried to change it and see, but couldn't find the exact effect on the output.
Post what parameters worked well for you.
Beta Was this translation helpful? Give feedback.
All reactions