First, I created a sparse attention module for using Flux 1.D which I tested. Some thoughts, questions...

Using the Wan model as a starting point, I created a Flux 1.D version. I am using sage attention normally and wanted to see what the speed up or differences might be using SpargeAttn. I am using the current github diffusers library for testing.

1) I did not notice any speed boost using sparge vs sage. I compile both and run as normal using pre-built wheels from https://github.com/woct0rdho/SpargeAttn/releases and https://github.com/woct0rdho/SageAttention/releases while using their Triton wheels as well for Windows 10 OS. Question is does sparge only matter for very long prompts (tokens) as I only tested with a short prompt: a cat holding a sign that says "Hello World"
   I used topk: SpargeAttn API spas_sage2_attn_meansim_topk_cuda for my test with a value of 0.5
I also noted that SpargeAttn was not adhering to the Text portion of my prompt as well as sage attention.
I applied SpargeAttn to Flux Transformer for both:
transformer_blocks (self-attention)
single_transformer_blocks (cross-attention)
Should I omit cross-attention for Sparge??

2) It is unclear what values you would recommend for mode = "cdfthreshd". I see 90% - 95% recommendation on a query search (i.e. .90; .95). Is this correct?

3) Are you folks even considering a version for Flux 1.D?

Thanks for your wonderful project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First, I created a sparse attention module for using Flux 1.D which I tested. Some thoughts, questions... #107

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

First, I created a sparse attention module for using Flux 1.D which I tested. Some thoughts, questions... #107

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions