Skip to content

Commit a3fcd90

Browse files
suluyanasuluyangemini-code-assist[bot]
authored
feat: video readme_en (#856)
Co-authored-by: suluyan <suluyan.sly@alibaba-inc.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
1 parent 58ff89e commit a3fcd90

File tree

2 files changed

+193
-107
lines changed

2 files changed

+193
-107
lines changed

projects/singularity_cinema/README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -132,7 +132,7 @@ video_generator:
132132

133133
---
134134

135-
### 3)运行命令示例
135+
### 4)运行命令示例
136136

137137
在使用默认 YAML 的基础上,通过命令行覆盖 LLM / MLLM / 文生图 / 文生视频等关键配置。
138138

@@ -175,7 +175,7 @@ ms-agent run --project singularity_cinema \
175175

176176
---
177177

178-
### 4)输出与失败重试
178+
### 5)输出与失败重试
179179

180180
- 运行持续约20min左右。
181181
- 生成视频输出在 命令执行目录/`output_video/`(由配置项 `--output_dir` 控制)final_video.mp4
@@ -202,17 +202,17 @@ ms-agent run --project singularity_cinema \
202202
- 输出:更新的remotion_code/segment_N.py文件
203203
6. 渲染remotion代码
204204
- 输入:remotion_code/segment_N.py
205-
- 输出:remotion_render/scene_N文件夹列表,如果segments.txt中对某个步骤包含了manim要求,则对应文件夹中会有remotion.mov文件
205+
- 输出:remotion_render/scene_N文件夹列表,如果segments.txt中对某个步骤包含了remotion要求,则对应文件夹中会有remotion.mov文件
206206
7. 生成文生图提示词
207207
- 输入:segments.txt
208208
- 输出:illustration_prompts/segment_N.txt,N为segment序号从1开始
209209
8. 文生图
210210
- 输入:illustration_prompts/segment_N.txt列表
211211
- 输出:images/illustration_N.png列表,N为segment序号从1开始
212-
10. 生成背景,为纯色带有短视频title和slogans的图片
212+
9. 生成背景,为纯色带有短视频title和slogans的图片
213213
- 输入:title.txt
214214
- 输出:background.jpg
215-
11. 拼合整体视频
215+
0拼合整体视频
216216
- 输入:前序所有的文件信息。这一步会有较长无日志耗时,这一阶段不消耗token。
217217
- 输出:final_video.mp4
218218
---
Lines changed: 188 additions & 102 deletions
Original file line numberDiff line numberDiff line change
@@ -1,152 +1,238 @@
11
# SingularityCinema
22

3-
A lightweight and excellent short video generator
3+
A lightweight short-video generator: it uses large language models to generate a **script and storyboard**, then automatically produces **voice-over / (optional) subtitles / images / (optional) text-to-video**, and finally composes them into a short video.
4+
5+
---
6+
7+
## Showcase
8+
9+
[![Video Preview](./show_case/deploy_llm.png)](http://modelscope.oss-cn-beijing.aliyuncs.com/ms-agent/show_case/video/deploy_llm_claude_sonnet_4_5_mllm_gemini_3_pro_image_gen_gemini_3_pro_image.mp4)
10+
[![Video Preview](./show_case/silu.png)](http://modelscope.oss-cn-beijing.aliyuncs.com/ms-agent/show_case/video/silu_claude_sonnet_4_5_mllm_gemini_3_pro_image_gen_gemini_3_pro_image.mp4)
11+
[![Video Preview](./show_case/deploy_llm_en.png)](http://modelscope.oss-cn-beijing.aliyuncs.com/ms-agent/show_case/video/en_deploy_llm_claude_sonnet_4_5_mllm_gemini_3_pro_image_gen_gemini_3_pro_image.mp4)
412

513
## Installation
614

7-
This project requires Python and Node.js environments.
15+
This project requires both Python and Node.js.
816

9-
1. **Prerequisites**
10-
- **Python**: Version >= 3.10 is required. Using [Conda](https://docs.conda.io/projects/conda/en/stable/user-guide/install/index.html) is recommended.
11-
- **Node.js**: Required if you use the default Remotion engine. Install [Node.js](https://nodejs.org/) (Version >= 16 recommended).
12-
- **FFmpeg**: Install [ffmpeg](https://www.ffmpeg.org/download.html#build-windows) and add it to your PATH.
17+
1. **Environment setup**
18+
- **Python**: version >= 3.10. Using [Conda](https://docs.conda.io/projects/conda/en/stable/user-guide/install/index.html) to create a virtual environment is recommended.
19+
- **Node.js**: if you use the default Remotion engine to generate videos, you must install [Node.js](https://nodejs.org/) (recommended version >= 16).
20+
- **FFmpeg**: install [ffmpeg](https://www.ffmpeg.org/download.html#build-windows) and add it to your environment variables.
1321

14-
2. **Clone Code**
15-
```shell
22+
2. **Get the code**
23+
```bash
1624
git clone https://github.com/modelscope/ms-agent.git
1725
cd ms-agent
1826
```
1927

20-
3. **Install Python Dependencies**
21-
```shell
28+
3. **Install dependencies**
29+
```bash
2230
pip install .
2331
cd projects/singularity_cinema
2432
pip install -r requirements.txt
2533
```
2634

35+
---
36+
2737
## Compatibility and Limitations
2838

29-
SingularityCinema generates scripts and storyboards based on large language models and produces short videos.
39+
SingularityCinema generates scripts and storyboards using LLMs and produces short videos.
3040

3141
### Compatibility
32-
33-
- Short video types: Educational, economic videos, especially those containing charts, formulas, and principle explanations
34-
- Language: No restrictions, subtitles and voice follow your original query and document materials
35-
- Reading external materials: Supports plain text, does not support multimodal
36-
- Secondary development: Complete code is in stepN/agent.py with no license restrictions, free for secondary development and commercial use
37-
- Please note and comply with the commercial licenses of background music and fonts you use
42+
- Short video types: science popularization, economics (especially those involving charts/tables, formulas, and principle explanations)
43+
- Languages: unlimited (subtitle and voice-over language follow your query and materials)
44+
- External materials: supports reading plain text (does not support direct multimodal material input)
45+
- Secondary development: the workflow can be found in `projects/singularity_cinema/workflow.yaml`; the core implementation is under `projects/singularity_cinema`,
46+
in each step’s `agent.py`, which can be extended and used commercially.
47+
- Please note and comply with the commercial licenses of background music, fonts, etc. that you use.
3848

3949
### Limitations
50+
- The quality varies significantly across different LLM/AIGC models. It is recommended to use verified combinations and test on your own. The current default configuration can be found in `projects/singularity_cinema/agent.yaml`.
4051

41-
- LLM test range: Claude, effects with other models untested
42-
- AIGC model test range: Qwen-Image, effects with other models untested
52+
---
4353

44-
## Running
54+
## Run
4555

46-
1. Prepare API Key
56+
### 1) Prepare API Keys
4757

48-
### Prepare LLM Key
49-
50-
Taking Claude as an example, you need to first apply for or purchase Claude model access. The Claude Key can be set in environment variables:
58+
**Prepare an LLM key**
5159

60+
Using Gemini as an example, you need to apply for or purchase access to Gemini models. Runtime parameters:
5261
```shell
53-
OPENAI_API_KEY=xxx-xxx
62+
--llm.openai_base_url https://generativelanguage.googleapis.com/v1beta/openai/ \
63+
--llm.model gemini-3-pro \
64+
--llm.openai_api_key {your_api_key_of_openai_base_url} \
5465
```
5566

56-
### Prepare ModelScope Text-to-Image Key
57-
58-
The default model is currently Qwen-Image. The ModelScope API Key can be applied for [here](https://www.modelscope.cn/my/myaccesstoken). Then set it in environment variables:
67+
**Prepare a text-to-image model key**
5968

69+
Using ModelScope’s Qwen/Qwen-Image-2512 as an example. ModelScope provides a small free quota per account daily. If you hit rate limits during high-frequency usage, simply rerun the same command to retry; it will resume from the failure point.
6070
```shell
61-
T2I_API_KEY=ms-xxx-xxx
71+
--image_generator.api_key {your_modelscope_api_key} \
72+
--image_generator.type modelscope \
73+
--image_generator.model Qwen/Qwen-Image-2512 \
6274
```
6375

64-
### Prepare an MLLM to check animation layouts
76+
**Prepare a multimodal LLM for quality inspection**
6577

78+
Using Gemini as an example, you need to apply for or purchase access to Gemini models. Runtime parameters:
6679
```shell
67-
MANIM_TEST_API_KEY=xxx-xxx
80+
--mllm_openai_base_url https://generativelanguage.googleapis.com/v1beta/openai/ \
81+
--mllm_openai_api_key {your_api_key_of_mllm_openai_base_url} \
82+
--mllm_model gemini-3-pro \
6883
```
6984

70-
2. Prepare your short video materials
85+
### 2) Prepare materials (optional)
7186

72-
You can choose to generate a video with a single sentence, for example:
87+
You can generate a video with just one sentence, for example:
88+
```text
89+
Generate a short video describing GDP-related economics knowledge, about 3 minutes long.
90+
```
7391

92+
You can also reference a local text file, for example:
7493
```text
75-
Generate a short video describing GDP economic knowledge, approximately 3 minutes long.
94+
Generate a short video describing large-model technologies. Read /home/user/llm.txt for details.
7695
```
7796

78-
Or use your previously collected text materials:
97+
---
7998

80-
```text
81-
Generate a short video describing large language model technology, read /home/user/llm.txt for detailed content
99+
### 3) Configuration Notes
100+
101+
The current default configuration is in `projects/singularity_cinema/agent.yaml`. At runtime, command-line arguments override the corresponding default parameters in the YAML. Specifically:
102+
103+
- If a field name is **unique** in the config, you can override it directly with the same-name argument, e.g.:
104+
- `--openai_api_key ...`
105+
- If a field name is **not unique / conflicts** (e.g., multiple modules have `api_key`), you can specify it using a **multi-level path**, e.g.:
106+
- `--image_generator.api_key ...`
107+
- `--video_generator.api_key ...`
108+
109+
> Rule of thumb:
110+
> - “Unique field” uses `--field`
111+
> - “Nested / potentially conflicting field” uses `--a.b.c`
112+
113+
Relevant structure in the default YAML (excerpt):
114+
```yaml
115+
llm:
116+
model: claude-sonnet-4-5-20250929 # LLM model name (e.g., gemini-3-pro)
117+
openai_api_key: "" # Required: API Key (provider key matching openai_base_url)
118+
openai_base_url: "https://dashscope.aliyuncs.com/compatible-mode/v1" # OpenAI-compatible Base URL (varies by provider)
119+
120+
mllm:
121+
mllm_model: gemini-3-pro-preview # Multimodal model name (e.g., gemini-3-pro)
122+
mllm_openai_api_key: "" # Required: multimodal model API Key
123+
mllm_openai_base_url: "https://dashscope.aliyuncs.com/compatible-mode/v1" # OpenAI-compatible Base URL for MLLM
124+
125+
image_generator:
126+
api_key: "" # Required: provider API Key
127+
type: dashscope # provider/platform type: modelscope | dashscope | google
128+
model: gemini-3-pro-image-preview # model ID/name supported by the selected type
129+
130+
video_generator:
131+
api_key: "" # provider API Key
132+
type: dashscope # modelscope | dashscope | google
133+
model: sora-2-2025-10-06 # video model ID/name supported by the selected type
82134
```
83135
84-
3. Run command
136+
---
137+
138+
### 4) Example Commands
139+
140+
Based on the default YAML, override key configurations for LLM / MLLM / text-to-image / text-to-video via the command line.
141+
142+
Below are the two examples used to generate the video previews on this page:
143+
- Before running, replace `{path_to_ms-agent}` in the query with your local reference file path.
144+
- Replace the `api_key` values with real API keys.
145+
146+
```bash
147+
# For the English version, replace the query content with:
148+
# "Convert /home/user/workspace/ms-agent/projects/singularity_cinema/test_files/J.部署.md
149+
# into a short video in a blue-themed style, making sure to use the important images from the document.
150+
# The short video must be in English."
151+
ms-agent run --project singularity_cinema \
152+
--query "把/{path_to_ms-agent}/projects/singularity_cinema/test_files/J.部署.md转为短视频,蓝色风格,注意使用其中重要的图片" \
153+
--trust_remote_code true \
154+
--openai_base_url https://api.anthropic.com/v1/ \
155+
--llm.model claude-sonnet-4-5 \
156+
--openai_api_key {your_api_key_of_anthropic} \
157+
--mllm_openai_base_url https://generativelanguage.googleapis.com/v1beta/openai/ \
158+
--mllm_openai_api_key {your_api_key_of_gemini} \
159+
--mllm_model gemini-3-pro-preview \
160+
--image_generator.api_key {your_api_key_of_gemini} \
161+
--image_generator.type google \
162+
--image_generator.model gemini-3-pro-image-preview
163+
```
85164

86-
```shell
87-
ms-agent run --config "projects/singularity_cinema" --query "Your custom theme, see description above" --load_cache true --trust_remote_code true
165+
```bash
166+
ms-agent run --project singularity_cinema \
167+
--query "Please create a short video introducing the Silk Road, with a consistent visual style." \
168+
--trust_remote_code true \
169+
--openai_base_url https://api.anthropic.com/v1/ \
170+
--llm.model claude-sonnet-4-5 \
171+
--openai_api_key {your_api_key_of_anthropic} \
172+
--mllm_openai_base_url https://generativelanguage.googleapis.com/v1beta/openai/ \
173+
--mllm_openai_api_key {your_api_key_of_gemini} \
174+
--mllm_model gemini-3-pro-preview \
175+
--image_generator.api_key {your_api_key_of_gemini} \
176+
--image_generator.type google \
177+
--image_generator.model gemini-3-pro-image-preview
88178
```
89179

90-
4. The run takes approximately 20 minutes. The video is generated at output/final_video.mp4. After generation, you can review this file, compile the parts that don't meet requirements, input them into the command line input, and the workflow will continue improving. If requirements are met, input quit or exit and the program will automatically terminate.
91-
92-
5. If the execution fails, such as URL call timeout or file generation failure, you can re-run the command above. ms-agent saves execution information in the output/memory folder, and after re-running the command, it will continue from where it failed.
93-
* If you want to regenerate from scratch, please rename or move the output folder elsewhere, or delete the corresponding memory and input files.
94-
* You can delete input files for only specific scenes/shots, so that re-execution will only process those corresponding scenes/shots. This is also the principle behind the manual feedback correction in the final step.
95-
96-
## Technical Principles
97-
98-
1. Generate basic script based on user requirements
99-
* Input: User requirements, may read user-specified files
100-
* Output: Script file script.txt, original requirement file topic.txt, short video name file title.txt
101-
2. Split storyboard design based on script
102-
* Input: topic.txt, script.txt
103-
* Output: segments.txt, storyboard list describing narration, background image generation requirements, foreground manim animation requirements
104-
3. Generate audio narration for storyboards
105-
* Input: segments.txt
106-
* Output: audio/audio_N.mp3 list, N is segment number starting from 1, and root directory audio_info.txt containing audio duration
107-
4. Generate manim animation code based on voice duration
108-
* Input: segments.txt, audio_info.txt
109-
* Output: Manim code file list manim_code/segment_N.py, N is segment number starting from 1
110-
5. Fix manim code
111-
* Input: manim_code/segment_N.py N is segment number starting from 1, code_fix/code_fix_N.txt error prediction file
112-
* Output: Updated manim_code/segment_N.py files
113-
6. Render manim code
114-
* Input: manim_code/segment_N.py
115-
* Output: manim_render/scene_N folder list, if segments.txt contains manim requirements for a step, the corresponding folder will have a manim.mov file
180+
---
181+
182+
### 5) Output and Failure Retry
183+
184+
- The run typically takes about 20 minutes.
185+
- The generated video is output to `output_video/` under your command execution directory (controlled by `--output_dir`) as `final_video.mp4`.
186+
- If the run fails (timeout/interruption/missing files), you can rerun the command directly: the system will read execution info in `output_video` and resume from the breakpoint.
187+
- To regenerate from scratch: rename/delete the `output_video` directory.
188+
- To rerun only part of a storyboard: delete only the corresponding files for that segment; rerunning will execute only those segments.
189+
190+
---
191+
192+
## Technical Workflow
193+
194+
1. Generate a base script from user requirements
195+
- Input: user requirements; may read a user-specified file
196+
- Output: script file `script.txt`, original request file `topic.txt`, short-video title file `title.txt`
197+
2. Split the script into storyboard segments
198+
- Input: `topic.txt`, `script.txt`
199+
- Output: `segments.txt`, a list of segments describing narration, background image generation requirements, and foreground Manim animation requirements
200+
3. Generate audio narration for each segment
201+
- Input: `segments.txt`
202+
- Output: `audio/audio_N.mp3` list (N starts from 1), plus `audio_info.txt` in the root directory containing audio durations
203+
4. Generate Remotion animation code based on audio duration
204+
- Input: `segments.txt`, `audio_info.txt`
205+
- Output: Manim code files `remotion_code/segment_N.py` (N starts from 1)
206+
5. Fix Remotion code
207+
- Input: `remotion_code/segment_N.py` (N starts from 1), pre-error file `code_fix/code_fix_N.txt`
208+
- Output: updated `remotion_code/segment_N.py`
209+
6. Render Remotion code
210+
- Input: `remotion_code/segment_N.py`
211+
- Output: `remotion_render/scene_N` folder list; if a segment includes Remotion requirements in `segments.txt`, the corresponding folder will contain `remotion.mov`
116212
7. Generate text-to-image prompts
117-
* Input: segments.txt
118-
* Output: illustration_prompts/segment_N.txt, N is segment number starting from 1
119-
8. Text-to-image
120-
* Input: illustration_prompts/segment_N.txt list
121-
* Output: images/illustration_N.png list, N is segment number starting from 1
122-
9. Generate subtitles
123-
* Input: segments.txt
124-
* Output: subtitles/bilingual_subtitle_N.png list, N is segment number starting from 1
125-
10. Generate background, a solid color image with short video title and slogans
126-
* Input: title.txt
127-
* Output: background.jpg
128-
11. Composite complete video
129-
* Input: All previous file information
130-
* Output: final_video.mp4
131-
12. Human feedback
132-
133-
## Adjustable Parameters
134-
135-
Most adjustable parameters are in agent.yaml. Before running, you can modify this file for customization.
136-
137-
Some important parameters are listed below:
138-
139-
- llm: This group of parameters controls the LLM's url, apikey, etc.
140-
- generation_config: This group of parameters controls LLM generation parameters
141-
- prompt.system: Controls the system for script generation stage
142-
- If you want to modify the system for storyboard generation, you can modify the system in step2_segment/agent.py
143-
- text2image: Text-to-image model parameters, including url, model id, etc.
144-
- t2i_transition: Background image effect, default is ken-burns effect
145-
- t2i_style: Image style, you can set your desired text-to-image style
146-
- t2i_num_parallel: Text-to-image call parallelism. Default is 1 to prevent rate limiting
147-
- llm_num_parallel: LLM call parallelism, default is 10
148-
- video: Video generation bitrate and other parameters
149-
- voice/voices: edge_tts voice settings, if you have other voice options, you can add them here
150-
- subtitle_translate: Multilingual subtitle language, if not set, no translation is performed
151-
- slogan: Displayed on the right side of the screen, generally shows producer name and short video collection
152-
- fonts: The recommended fonts list
213+
- Input: `segments.txt`
214+
- Output: `illustration_prompts/segment_N.txt` (N starts from 1)
215+
8. Text-to-image generation
216+
- Input: list of `illustration_prompts/segment_N.txt`
217+
- Output: list of `images/illustration_N.png` (N starts from 1)
218+
9. Generate a background image (solid color) with the short-video title and slogans
219+
- Input: `title.txt`
220+
- Output: `background.jpg`
221+
10. Compose the final video
222+
- Input: all files from previous steps. This step may take a long time with no logs and does not consume tokens.
223+
- Output: `final_video.mp4`
224+
225+
---
226+
227+
## Tunable Parameters (Overview)
228+
229+
Most parameters are in the default `agent.yaml`. Recommended practice: **do not modify the default YAML**; override what you need via command-line arguments.
230+
231+
Common examples:
232+
- LLM/MLLM: `--openai_base_url`, `--openai_api_key`, `--llm.model`, `--mllm_model`, etc.
233+
- Text-to-image / Text-to-video:
234+
- `--image_generator.type`, `--image_generator.model`, `--image_generator.api_key`
235+
- `--video_generator.type`, `--video_generator.model`, `--video_generator.api_key`
236+
- Parallelism: `--t2i_num_parallel`, `--t2v_num_parallel`, `--llm_num_parallel`
237+
- Video params: `--video.fps`, `--video.bitrate`, etc.
238+
- Toggles: `--use_subtitle`, `--use_text2video`, `--use_doc_image`, etc.

0 commit comments

Comments
 (0)