feat: video readme_en (#856)

suluyana · suluyan · gemini-code-assist[bot] · web-flow · commit a3fcd90462a2 · 2026-02-04T20:13:10.000+08:00
Co-authored-by: suluyan &lt;suluyan.sly@alibaba-inc.com&gt;
Co-authored-by: gemini-code-assist[bot] &lt;176961590+gemini-code-assist[bot]@users.noreply.github.com&gt;
diff --git a/projects/singularity_cinema/README.md b/projects/singularity_cinema/README.md
@@ -132,7 +132,7 @@ video_generator:
 
 ---
 
-### 3）运行命令示例
+### 4）运行命令示例
 
 在使用默认 YAML 的基础上，通过命令行覆盖 LLM / MLLM / 文生图 / 文生视频等关键配置。
 
@@ -175,7 +175,7 @@ ms-agent run --project singularity_cinema \
 
 ---
 
-### 4）输出与失败重试
+### 5）输出与失败重试
 
 - 运行持续约20min左右。
 - 生成视频输出在 命令执行目录/`output_video/`（由配置项 `--output_dir` 控制）final_video.mp4
@@ -202,17 +202,17 @@ ms-agent run --project singularity_cinema \
    - 输出：更新的remotion_code/segment_N.py文件
 6. 渲染remotion代码
    - 输入：remotion_code/segment_N.py
-   - 输出：remotion_render/scene_N文件夹列表，如果segments.txt中对某个步骤包含了manim要求，则对应文件夹中会有remotion.mov文件
+   - 输出：remotion_render/scene_N文件夹列表，如果segments.txt中对某个步骤包含了remotion要求，则对应文件夹中会有remotion.mov文件
 7. 生成文生图提示词
    - 输入：segments.txt
    - 输出：illustration_prompts/segment_N.txt，N为segment序号从1开始
 8. 文生图
    - 输入：illustration_prompts/segment_N.txt列表
    - 输出：images/illustration_N.png列表，N为segment序号从1开始
-10. 生成背景，为纯色带有短视频title和slogans的图片
+9. 生成背景，为纯色带有短视频title和slogans的图片
     - 输入：title.txt
     - 输出：background.jpg
-11. 拼合整体视频
+0拼合整体视频
     - 输入：前序所有的文件信息。这一步会有较长无日志耗时，这一阶段不消耗token。
     - 输出：final_video.mp4
 ---
diff --git a/projects/singularity_cinema/README_EN.md b/projects/singularity_cinema/README_EN.md
@@ -1,152 +1,238 @@
 # SingularityCinema
 
-A lightweight and excellent short video generator
+A lightweight short-video generator: it uses large language models to generate a **script and storyboard**, then automatically produces **voice-over / (optional) subtitles / images / (optional) text-to-video**, and finally composes them into a short video.
+
+---
+
+## Showcase
+
+[![Video Preview](./show_case/deploy_llm.png)](http://modelscope.oss-cn-beijing.aliyuncs.com/ms-agent/show_case/video/deploy_llm_claude_sonnet_4_5_mllm_gemini_3_pro_image_gen_gemini_3_pro_image.mp4)
+[![Video Preview](./show_case/silu.png)](http://modelscope.oss-cn-beijing.aliyuncs.com/ms-agent/show_case/video/silu_claude_sonnet_4_5_mllm_gemini_3_pro_image_gen_gemini_3_pro_image.mp4)
+[![Video Preview](./show_case/deploy_llm_en.png)](http://modelscope.oss-cn-beijing.aliyuncs.com/ms-agent/show_case/video/en_deploy_llm_claude_sonnet_4_5_mllm_gemini_3_pro_image_gen_gemini_3_pro_image.mp4)
 
 ## Installation
 
-This project requires Python and Node.js environments.
+This project requires both Python and Node.js.
 
-1. **Prerequisites**
-   - **Python**: Version >= 3.10 is required. Using [Conda](https://docs.conda.io/projects/conda/en/stable/user-guide/install/index.html) is recommended.
-   - **Node.js**: Required if you use the default Remotion engine. Install [Node.js](https://nodejs.org/) (Version >= 16 recommended).
-   - **FFmpeg**: Install [ffmpeg](https://www.ffmpeg.org/download.html#build-windows) and add it to your PATH.
+1. **Environment setup**
+   - **Python**: version >= 3.10. Using [Conda](https://docs.conda.io/projects/conda/en/stable/user-guide/install/index.html) to create a virtual environment is recommended.
+   - **Node.js**: if you use the default Remotion engine to generate videos, you must install [Node.js](https://nodejs.org/) (recommended version >= 16).
+   - **FFmpeg**: install [ffmpeg](https://www.ffmpeg.org/download.html#build-windows) and add it to your environment variables.
 
-2. **Clone Code**
-   ```shell
+2. **Get the code**
+   ```bash
    git clone https://github.com/modelscope/ms-agent.git
    cd ms-agent
    ```
 
-3. **Install Python Dependencies**
-   ```shell
+3. **Install dependencies**
+   ```bash
    pip install .
    cd projects/singularity_cinema
    pip install -r requirements.txt
    ```
 
+---
+
 ## Compatibility and Limitations
 
-SingularityCinema generates scripts and storyboards based on large language models and produces short videos.
+SingularityCinema generates scripts and storyboards using LLMs and produces short videos.
 
 ### Compatibility
-
-- Short video types: Educational, economic videos, especially those containing charts, formulas, and principle explanations
-- Language: No restrictions, subtitles and voice follow your original query and document materials
-- Reading external materials: Supports plain text, does not support multimodal
-- Secondary development: Complete code is in stepN/agent.py with no license restrictions, free for secondary development and commercial use
-  - Please note and comply with the commercial licenses of background music and fonts you use
+- Short video types: science popularization, economics (especially those involving charts/tables, formulas, and principle explanations)
+- Languages: unlimited (subtitle and voice-over language follow your query and materials)
+- External materials: supports reading plain text (does not support direct multimodal material input)
+- Secondary development: the workflow can be found in `projects/singularity_cinema/workflow.yaml`; the core implementation is under `projects/singularity_cinema`,
+  in each step’s `agent.py`, which can be extended and used commercially.
+  - Please note and comply with the commercial licenses of background music, fonts, etc. that you use.
 
 ### Limitations
+- The quality varies significantly across different LLM/AIGC models. It is recommended to use verified combinations and test on your own. The current default configuration can be found in `projects/singularity_cinema/agent.yaml`.
 
-- LLM test range: Claude, effects with other models untested
-- AIGC model test range: Qwen-Image, effects with other models untested
+---
 
-## Running
+## Run
 
-1. Prepare API Key
+### 1) Prepare API Keys
 
-### Prepare LLM Key
-
-Taking Claude as an example, you need to first apply for or purchase Claude model access. The Claude Key can be set in environment variables:
+**Prepare an LLM key**
 
+Using Gemini as an example, you need to apply for or purchase access to Gemini models. Runtime parameters:
 ```shell
-OPENAI_API_KEY=xxx-xxx
+  --llm.openai_base_url https://generativelanguage.googleapis.com/v1beta/openai/ \
+  --llm.model gemini-3-pro \
+  --llm.openai_api_key {your_api_key_of_openai_base_url} \
 ```
 
-### Prepare ModelScope Text-to-Image Key
-
-The default model is currently Qwen-Image. The ModelScope API Key can be applied for [here](https://www.modelscope.cn/my/myaccesstoken). Then set it in environment variables:
+**Prepare a text-to-image model key**
 
+Using ModelScope’s Qwen/Qwen-Image-2512 as an example. ModelScope provides a small free quota per account daily. If you hit rate limits during high-frequency usage, simply rerun the same command to retry; it will resume from the failure point.
 ```shell
-T2I_API_KEY=ms-xxx-xxx
+  --image_generator.api_key {your_modelscope_api_key} \
+  --image_generator.type modelscope \
+  --image_generator.model Qwen/Qwen-Image-2512 \
 ```
 
-### Prepare an MLLM to check animation layouts
+**Prepare a multimodal LLM for quality inspection**
 
+Using Gemini as an example, you need to apply for or purchase access to Gemini models. Runtime parameters:
 ```shell
-MANIM_TEST_API_KEY=xxx-xxx
+  --mllm_openai_base_url https://generativelanguage.googleapis.com/v1beta/openai/ \
+  --mllm_openai_api_key {your_api_key_of_mllm_openai_base_url} \
+  --mllm_model gemini-3-pro \
 ```
 
-2. Prepare your short video materials
+### 2) Prepare materials (optional)
 
-You can choose to generate a video with a single sentence, for example:
+You can generate a video with just one sentence, for example:
+```text
+Generate a short video describing GDP-related economics knowledge, about 3 minutes long.
+```
 
+You can also reference a local text file, for example:
 ```text
-Generate a short video describing GDP economic knowledge, approximately 3 minutes long.
+Generate a short video describing large-model technologies. Read /home/user/llm.txt for details.
 ```
 
-Or use your previously collected text materials:
+---
 
-```text
-Generate a short video describing large language model technology, read /home/user/llm.txt for detailed content
+### 3) Configuration Notes
+
+The current default configuration is in `projects/singularity_cinema/agent.yaml`. At runtime, command-line arguments override the corresponding default parameters in the YAML. Specifically:
+
+- If a field name is **unique** in the config, you can override it directly with the same-name argument, e.g.:
+  - `--openai_api_key ...`
+- If a field name is **not unique / conflicts** (e.g., multiple modules have `api_key`), you can specify it using a **multi-level path**, e.g.:
+  - `--image_generator.api_key ...`
+  - `--video_generator.api_key ...`
+
+> Rule of thumb:
+> - “Unique field” uses `--field`
+> - “Nested / potentially conflicting field” uses `--a.b.c`
+
+Relevant structure in the default YAML (excerpt):
+```yaml
+llm:
+  model: claude-sonnet-4-5-20250929 # LLM model name (e.g., gemini-3-pro)
+  openai_api_key: ""                # Required: API Key (provider key matching openai_base_url)
+  openai_base_url: "https://dashscope.aliyuncs.com/compatible-mode/v1" # OpenAI-compatible Base URL (varies by provider)
+
+mllm:
+  mllm_model: gemini-3-pro-preview  # Multimodal model name (e.g., gemini-3-pro)
+  mllm_openai_api_key: ""           # Required: multimodal model API Key
+  mllm_openai_base_url: "https://dashscope.aliyuncs.com/compatible-mode/v1" # OpenAI-compatible Base URL for MLLM
+
+image_generator:
+  api_key: ""                       # Required: provider API Key
+  type: dashscope                   # provider/platform type: modelscope | dashscope | google
+  model: gemini-3-pro-image-preview # model ID/name supported by the selected type
+
+video_generator:
+  api_key: ""                       # provider API Key
+  type: dashscope                   # modelscope | dashscope | google
+  model: sora-2-2025-10-06          # video model ID/name supported by the selected type
 ```
 
-3. Run command
+---
+
+### 4) Example Commands
+
+Based on the default YAML, override key configurations for LLM / MLLM / text-to-image / text-to-video via the command line.
+
+Below are the two examples used to generate the video previews on this page:
+- Before running, replace `{path_to_ms-agent}` in the query with your local reference file path.
+- Replace the `api_key` values with real API keys.
+
+```bash
+# For the English version, replace the query content with:
+# "Convert /home/user/workspace/ms-agent/projects/singularity_cinema/test_files/J.部署.md
+# into a short video in a blue-themed style, making sure to use the important images from the document.
+# The short video must be in English."
+ms-agent run --project singularity_cinema \
+  --query "把/{path_to_ms-agent}/projects/singularity_cinema/test_files/J.部署.md转为短视频，蓝色风格，注意使用其中重要的图片" \
+  --trust_remote_code true \
+  --openai_base_url https://api.anthropic.com/v1/ \
+  --llm.model claude-sonnet-4-5 \
+  --openai_api_key {your_api_key_of_anthropic} \
+  --mllm_openai_base_url https://generativelanguage.googleapis.com/v1beta/openai/ \
+  --mllm_openai_api_key {your_api_key_of_gemini} \
+  --mllm_model gemini-3-pro-preview \
+  --image_generator.api_key {your_api_key_of_gemini} \
+  --image_generator.type google \
+  --image_generator.model gemini-3-pro-image-preview
+```
 
-```shell
-ms-agent run --config "projects/singularity_cinema" --query "Your custom theme, see description above" --load_cache true --trust_remote_code true
+```bash
+ms-agent run --project singularity_cinema \
+  --query "Please create a short video introducing the Silk Road, with a consistent visual style." \
+  --trust_remote_code true \
+  --openai_base_url https://api.anthropic.com/v1/ \
+  --llm.model claude-sonnet-4-5 \
+  --openai_api_key {your_api_key_of_anthropic} \
+  --mllm_openai_base_url https://generativelanguage.googleapis.com/v1beta/openai/ \
+  --mllm_openai_api_key {your_api_key_of_gemini} \
+  --mllm_model gemini-3-pro-preview \
+  --image_generator.api_key {your_api_key_of_gemini} \
+  --image_generator.type google \
+  --image_generator.model gemini-3-pro-image-preview
 ```
 
-4. The run takes approximately 20 minutes. The video is generated at output/final_video.mp4. After generation, you can review this file, compile the parts that don't meet requirements, input them into the command line input, and the workflow will continue improving. If requirements are met, input quit or exit and the program will automatically terminate.
-
-5. If the execution fails, such as URL call timeout or file generation failure, you can re-run the command above. ms-agent saves execution information in the output/memory folder, and after re-running the command, it will continue from where it failed.
-    * If you want to regenerate from scratch, please rename or move the output folder elsewhere, or delete the corresponding memory and input files.
-    * You can delete input files for only specific scenes/shots, so that re-execution will only process those corresponding scenes/shots. This is also the principle behind the manual feedback correction in the final step.
-
-## Technical Principles
-
-1. Generate basic script based on user requirements
-    * Input: User requirements, may read user-specified files
-    * Output: Script file script.txt, original requirement file topic.txt, short video name file title.txt
-2. Split storyboard design based on script
-    * Input: topic.txt, script.txt
-    * Output: segments.txt, storyboard list describing narration, background image generation requirements, foreground manim animation requirements
-3. Generate audio narration for storyboards
-    * Input: segments.txt
-    * Output: audio/audio_N.mp3 list, N is segment number starting from 1, and root directory audio_info.txt containing audio duration
-4. Generate manim animation code based on voice duration
-    * Input: segments.txt, audio_info.txt
-    * Output: Manim code file list manim_code/segment_N.py, N is segment number starting from 1
-5. Fix manim code
-    * Input: manim_code/segment_N.py N is segment number starting from 1, code_fix/code_fix_N.txt error prediction file
-    * Output: Updated manim_code/segment_N.py files
-6. Render manim code
-    * Input: manim_code/segment_N.py
-    * Output: manim_render/scene_N folder list, if segments.txt contains manim requirements for a step, the corresponding folder will have a manim.mov file
+---
+
+### 5) Output and Failure Retry
+
+- The run typically takes about 20 minutes.
+- The generated video is output to `output_video/` under your command execution directory (controlled by `--output_dir`) as `final_video.mp4`.
+- If the run fails (timeout/interruption/missing files), you can rerun the command directly: the system will read execution info in `output_video` and resume from the breakpoint.
+  - To regenerate from scratch: rename/delete the `output_video` directory.
+  - To rerun only part of a storyboard: delete only the corresponding files for that segment; rerunning will execute only those segments.
+
+---
+
+## Technical Workflow
+
+1. Generate a base script from user requirements
+   - Input: user requirements; may read a user-specified file
+   - Output: script file `script.txt`, original request file `topic.txt`, short-video title file `title.txt`
+2. Split the script into storyboard segments
+   - Input: `topic.txt`, `script.txt`
+   - Output: `segments.txt`, a list of segments describing narration, background image generation requirements, and foreground Manim animation requirements
+3. Generate audio narration for each segment
+   - Input: `segments.txt`
+   - Output: `audio/audio_N.mp3` list (N starts from 1), plus `audio_info.txt` in the root directory containing audio durations
+4. Generate Remotion animation code based on audio duration
+   - Input: `segments.txt`, `audio_info.txt`
+   - Output: Manim code files `remotion_code/segment_N.py` (N starts from 1)
+5. Fix Remotion code
+   - Input: `remotion_code/segment_N.py` (N starts from 1), pre-error file `code_fix/code_fix_N.txt`
+   - Output: updated `remotion_code/segment_N.py`
+6. Render Remotion code
+   - Input: `remotion_code/segment_N.py`
+   - Output: `remotion_render/scene_N` folder list; if a segment includes Remotion requirements in `segments.txt`, the corresponding folder will contain `remotion.mov`
 7. Generate text-to-image prompts
-    * Input: segments.txt
-    * Output: illustration_prompts/segment_N.txt, N is segment number starting from 1
-8. Text-to-image
-    * Input: illustration_prompts/segment_N.txt list
-    * Output: images/illustration_N.png list, N is segment number starting from 1
-9. Generate subtitles
-    * Input: segments.txt
-    * Output: subtitles/bilingual_subtitle_N.png list, N is segment number starting from 1
-10. Generate background, a solid color image with short video title and slogans
-    * Input: title.txt
-    * Output: background.jpg
-11. Composite complete video
-    * Input: All previous file information
-    * Output: final_video.mp4
-12. Human feedback
-
-## Adjustable Parameters
-
-Most adjustable parameters are in agent.yaml. Before running, you can modify this file for customization.
-
-Some important parameters are listed below:
-
-- llm: This group of parameters controls the LLM's url, apikey, etc.
-- generation_config: This group of parameters controls LLM generation parameters
-- prompt.system: Controls the system for script generation stage
-  - If you want to modify the system for storyboard generation, you can modify the system in step2_segment/agent.py
-- text2image: Text-to-image model parameters, including url, model id, etc.
-  - t2i_transition: Background image effect, default is ken-burns effect
-  - t2i_style: Image style, you can set your desired text-to-image style
-- t2i_num_parallel: Text-to-image call parallelism. Default is 1 to prevent rate limiting
-- llm_num_parallel: LLM call parallelism, default is 10
-- video: Video generation bitrate and other parameters
-- voice/voices: edge_tts voice settings, if you have other voice options, you can add them here
-- subtitle_translate: Multilingual subtitle language, if not set, no translation is performed
-- slogan: Displayed on the right side of the screen, generally shows producer name and short video collection
-- fonts: The recommended fonts list
+   - Input: `segments.txt`
+   - Output: `illustration_prompts/segment_N.txt` (N starts from 1)
+8. Text-to-image generation
+   - Input: list of `illustration_prompts/segment_N.txt`
+   - Output: list of `images/illustration_N.png` (N starts from 1)
+9. Generate a background image (solid color) with the short-video title and slogans
+    - Input: `title.txt`
+    - Output: `background.jpg`
+10. Compose the final video
+    - Input: all files from previous steps. This step may take a long time with no logs and does not consume tokens.
+    - Output: `final_video.mp4`
+
+---
+
+## Tunable Parameters (Overview)
+
+Most parameters are in the default `agent.yaml`. Recommended practice: **do not modify the default YAML**; override what you need via command-line arguments.
+
+Common examples:
+- LLM/MLLM: `--openai_base_url`, `--openai_api_key`, `--llm.model`, `--mllm_model`, etc.
+- Text-to-image / Text-to-video:
+  - `--image_generator.type`, `--image_generator.model`, `--image_generator.api_key`
+  - `--video_generator.type`, `--video_generator.model`, `--video_generator.api_key`
+- Parallelism: `--t2i_num_parallel`, `--t2v_num_parallel`, `--llm_num_parallel`
+- Video params: `--video.fps`, `--video.bitrate`, etc.
+- Toggles: `--use_subtitle`, `--use_text2video`, `--use_doc_image`, etc.