Skip to content

Commit 07e4dc5

Browse files
SongyuanweisongyuanweiSamitHuang
authored
update readme (mindspore-lab#959)
* update readme * update * Update README.md * Update README_CN.md * fix typo --------- Co-authored-by: songyuanwei <song.yuanwei@huawei.com> Co-authored-by: Samit <285365963@qq.com>
1 parent fbed6d8 commit 07e4dc5

File tree

109 files changed

+293
-201
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

109 files changed

+293
-201
lines changed

docs/diffusers/optimization/fp16.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ There are several ways to optimize Diffusers for inference speed, such as reduci
1818

1919
Optimizing for inference speed or reduced memory usage can lead to improved performance in the other category, so you should try to optimize for both whenever you can. This guide focuses on inference speed, but you can learn more about lowering memory usage in the [Reduce memory usage](memory.md) guide.
2020

21-
The inference times below are obtained from generating a single 512x512 image from the prompt "a photo of an astronaut riding a horse on mars" with 50 DDIM steps on a Ascend 910B in Graph mode.
21+
The inference times below are obtained from generating a single 512x512 image from the prompt "a photo of an astronaut riding a horse on mars" with 50 DDIM steps on a Ascend Atlas 800T A2 machine in Graph mode.
2222

2323
| setup | latency | speed-up |
2424
|----------|---------|----------|
@@ -48,7 +48,7 @@ You could also use a distilled Stable Diffusion model and autoencoder to speed u
4848

4949
Read the [Open-sourcing Knowledge Distillation Code and Weights of SD-Small and SD-Tiny](https://huggingface.co/blog/sd_distillation) blog post to learn more about how knowledge distillation training works to produce a faster, smaller, and cheaper generative model.
5050

51-
The inference times below are obtained from generating 4 images from the prompt "a photo of an astronaut riding a horse on mars" with 25 PNDM steps on a Ascend 910B. Each generation is repeated 3 times with the distilled Stable Diffusion v1.4 model by [Nota AI](https://hf.co/nota-ai).
51+
The inference times below are obtained from generating 4 images from the prompt "a photo of an astronaut riding a horse on mars" with 25 PNDM steps on a Ascend Atlas 800T A2 machine. Each generation is repeated 3 times with the distilled Stable Diffusion v1.4 model by [Nota AI](https://hf.co/nota-ai).
5252

5353
| setup | latency | speed-up |
5454
|------------------------------|---------|----------|

docs/diffusers/stable_diffusion.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ image
6161
<img src="https://github.com/user-attachments/assets/67b06273-9081-4b4f-a31f-585b23f70f27">
6262
</div>
6363

64-
This process took ~5.6 seconds on a Ascend 910B in Graph mode. By default, the [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) runs inference with full `float32` precision for 50 inference steps. You can speed this up by switching to a lower precision like `float16` or running fewer inference steps.
64+
This process took ~5.6 seconds on a Ascend Atlas 800T A2 machine in Graph mode. By default, the [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) runs inference with full `float32` precision for 50 inference steps. You can speed this up by switching to a lower precision like `float16` or running fewer inference steps.
6565

6666
Let's start by loading the model in `float16` and generate an image:
6767

@@ -163,7 +163,7 @@ make_image_grid(images, rows=2, cols=4)
163163
<img src="https://github.com/user-attachments/assets/5028a23d-7acd-4bb0-8633-38f8371eb393">
164164
</div>
165165

166-
Whereas before you couldn't even generate a batch of 4 images, now you can generate a batch of 8 images at ~1.6 seconds per image! This is probably the fastest you can go on a Ascend 910B without sacrificing quality.
166+
Whereas before you couldn't even generate a batch of 4 images, now you can generate a batch of 8 images at ~1.6 seconds per image! This is probably the fastest you can go on a Ascend Atlas 800T A2 machine without sacrificing quality.
167167

168168
## Quality
169169

docs/diffusers/using-diffusers/marigold_usage.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -131,7 +131,7 @@ Points on the shoulders pointing up with a large `Y` promote green color.
131131
### Speeding up inference
132132

133133
The above quick start snippets are already optimized for speed: they load the LCM checkpoint, use the `fp16` variant of weights and computation, and perform just one denoising diffusion step.
134-
The `pipe(image)` call completes in 180ms on Ascend 910B in Graph mode.
134+
The `pipe(image)` call completes in 180ms on Ascend Atlas 800T A2 machines in Graph mode.
135135
Internally, the input image is encoded with the Stable Diffusion VAE encoder, then the U-Net performs one denoising step, and finally, the prediction latent is decoded with the VAE decoder into pixel space.
136136
In this case, two out of three module calls are dedicated to converting between pixel and latent space of LDM.
137137
Because Marigold's latent space is compatible with the base Stable Diffusion, it is possible to speed up the pipeline call by more than 3x (85ms on RTX 3090) by using a [lightweight replacement of the SD VAE](../api/models/autoencoder_tiny.md):

examples/animatediff/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ This repository is the MindSpore implementation of [AnimateDiff](https://arxiv.o
44

55
## Features
66

7-
- [x] Text-to-video generation with AnimdateDiff v2, supporting 16 frames @512x512 resolution on Ascend 910*
7+
- [x] Text-to-video generation with AnimdateDiff v2, supporting 16 frames @512x512 resolution on Ascend Atlas 800T A2 machines
88
- [x] MotionLoRA inference
99
- [x] Motion Module Training
1010
- [X] Motion LoRA Training
@@ -253,7 +253,7 @@ Here are some generation results after lora fine-tuning on 512x512 resolution an
253253

254254
## Performance (AnimateDiff v2)
255255

256-
Experiments are tested on ascend 910* graph mode.
256+
Experiments are tested on Ascend Atlas 800T A2 machines with graph mode.
257257

258258
### Inference
259259

examples/animatediff/ad/modules/attention.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -121,7 +121,7 @@ def __init__(self, dim, heads=4, dim_head=32):
121121

122122
class CrossAttention(nn.Cell):
123123
"""
124-
Flash attention doesnot work well (leading to noisy images) for SD1.5-based models on 910B up to MS2.2.1-20231122 version,
124+
Flash attention doesnot work well (leading to noisy images) for SD1.5-based models on Ascend Atlas 800T A2 machines up to MS2.2.1-20231122 version,
125125
due to the attention head dimension is 40, num heads=5. Require test on future versions
126126
"""
127127

examples/animatediff/args_train.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,12 @@ def parse_args():
4040
)
4141
# ms
4242
parser.add_argument("--device_target", type=str, default="Ascend", help="Ascend or GPU")
43-
parser.add_argument("--max_device_memory", type=str, default=None, help="e.g. `30GB` for 910a, `59GB` for 910b")
43+
parser.add_argument(
44+
"--max_device_memory",
45+
type=str,
46+
default=None,
47+
help="e.g. `30GB` for Ascend 910, `59GB` for Ascend Atlas 800T A2 machines",
48+
)
4449
parser.add_argument("--mode", default=0, type=int, help="Specify the mode: 0 for graph mode, 1 for pynative mode")
4550
parser.add_argument("--use_parallel", default=False, type=str2bool, help="use parallel")
4651
parser.add_argument(

examples/animatediff/train.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -459,7 +459,7 @@ def main(args):
459459
use_lora=args.motion_lora_finetune,
460460
lora_rank=args.motion_lora_rank,
461461
param_save_filter=[".temporal_transformer."] if args.save_mm_only else None,
462-
record_lr=False, # TODO: check LR retrival for new MS on 910b
462+
record_lr=False, # TODO: check LR retrival for new MS on Ascend Atlas 800T A2 machines
463463
)
464464
callback.append(save_cb)
465465
if args.profile:

examples/autoencoders/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ For detailed arguments, please run `python infer.py -h`.
6666

6767
### Performance
6868

69-
We split the CelebA-HQ dataset into 24,000 images for training and 6,000 images for testing. Experiments are tested on ascend 910* with graph mode.
69+
We split the CelebA-HQ dataset into 24,000 images for training and 6,000 images for testing. Experiments are tested on Ascend Atlas 800T A2 machines with graph mode.
7070

7171
- mindspore 2.5.0
7272

examples/diffusers/cogvideox_factory/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
在 Ascend 硬件下对 Cog 系列视频模型进行微调以实现自定义视频生成 ⚡️📼
44

5-
> 我们的开发和验证基于Ascend 910*硬件,相关环境如下:
5+
> 我们的开发和验证基于Ascend Atlas 800T A2硬件,相关环境如下:
66
> | mindspore | ascend driver | firmware | cann toolkit/kernel |
77
> |:----------:|:--------------:|:-----------:|:------------------:|
88
> | 2.5 | 24.1.RC2 | 7.5.0.1.129 | 8.0.0.beta1 |
@@ -379,15 +379,15 @@ NODE_RANK="0"
379379
| CogvideoX 1.5 T2V 20B | 8 | 2 | 4 | zero3 | ON | 1x77x768x1360 | bf16 | O1 | 20.1 | 35.7 GB |
380380
| CogvideoX 1.5 T2V 30B | 8 | 2 | 4 | zero3 | ON | 1x77x768x1360 | bf16 | O1 | 26.5 | 47.3 GB |
381381

382-
以上数据在Disney数据集,910*上获得
382+
以上数据在Disney数据集,Ascend Atlas 800T A2训练服务器上获得
383383

384384
### 推理
385385

386386
| model | cards | DP | SP | zero | video shape | precision | jit level | s/step | total cost |
387387
|:-----------------:|:-----:|:--:|:--:|:-----:|:-------------:|:---------:|:---------:|:------:|:----------:|
388388
| CogvideoX 1.5 T2V 5B | 8 | 1 | 8 | zero3 | 1x77x768x1360 | bf16 | O1 | 3.21 | ~ 5min |
389389

390-
以上数据在910*上获得
390+
以上数据在Ascend Atlas 800T A2训练服务器上获得
391391

392392
## 与原仓的差异&功能限制
393393

examples/diffusers/controlnet/README_flux.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ We also support importing data from jsonl(xxx.jsonl),using `--jsonl_for_train` t
4444

4545
## Training
4646

47-
Our experiments were conducted on a single 64GB 910* NPU.
47+
Our experiments were conducted on a single 64GB Ascend Atlas 800T A2 NPU.
4848

4949
We can define the num_layers, num_single_layers, which determines the size of the control.
5050

0 commit comments

Comments
 (0)