Skip to content

[Docs] Add LLM contribution guide#10411

Merged
davidzollo merged 10 commits intoapache:devfrom
corgy-w:llm-agent-md
Feb 9, 2026
Merged

[Docs] Add LLM contribution guide#10411
davidzollo merged 10 commits intoapache:devfrom
corgy-w:llm-agent-md

Conversation

@corgy-w
Copy link
Contributor

@corgy-w corgy-w commented Jan 28, 2026

Purpose of this pull request

This PR introduces AGENTS.md to the repository root.

Motivation:
As AI coding assistants (like Cursor, Windsurf, Trae, etc.) are becoming more popular, having a dedicated context file helps them understand the project structure, coding standards, and testing procedures of Apache SeaTunnel. This file mirrors the AGENTS.md practice found in other mature Apache projects (e.g., Apache Superset) but adapts it specifically to SeaTunnel's engineering conventions.

Key contents of AGENTS.md:

  • Critical Requirements: Validation steps (Spotless, Verify, Test) before pushing.
  • Git Convention: Standardized commit message format.
  • Project Structure: Overview of key modules like seatunnel-connectors-v2 and seatunnel-engine.
  • Tech Stack: Rules for dependencies (e.g., using shaded packages) and code style.
  • Testing Strategy: Guidelines for Unit Tests and E2E Tests.
  • Documentation: Requirement for bilingual (EN/ZH) updates.

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Comment on lines +119 to +121
Any incompatible change MUST:

* Be explicitly documented
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incompatible changes should be documented in docs/en/introduction/concepts/incompatible-changes.md

@davidzollo
Copy link
Contributor

Good practice for AI coding

davidzollo
davidzollo previously approved these changes Jan 28, 2026
Copy link
Contributor

@davidzollo davidzollo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
LGTM
It's better to add soft links for three LLMs

Image

@chl-wxp
Copy link
Collaborator

chl-wxp commented Jan 29, 2026

Isn’t Chinese documentation in the plan?

@corgy-w
Copy link
Contributor Author

corgy-w commented Jan 29, 2026

Isn’t Chinese documentation in the plan?

@chl-wxp I took a look at the loading logic of commonly used tools. It's all unnecessary to load Chinese items like AGENT/GPT in the root directory

@DanielCarter-stack
Copy link

Issue 1: Missing Chinese version of documentation AGENTS_ZH.md

Location: Root directory

# Files promised but not created in PR
AGENTS_ZH.md

Related Context:

  • PR description explicitly states: "This PR introduces AGENTS.md (and its Chinese version AGENTS_ZH.md)"
  • Project documentation standards require: "Any user-visible change MUST update: docs/en AND docs/zh"
  • Existing Chinese documentation directory: docs/zh/ exists with complete structure

Issue Description:
The PR description explicitly promises to create a Chinese version AGENTS_ZH.md, but the actual changes only create an English version AGENTS.md and three symlinks. This violates:

  1. The explicit promise in the PR description
  2. The project's own bilingual documentation standards (also mentioned in AGENTS.md lines 170-177)

Potential Risks:

  • Risk 1: Chinese contributors using AI assistants may not receive optimal guidance
  • Risk 2: Violates the project's bilingual documentation convention, causing inconsistent documentation structure
  • Risk 3: May set a wrong example for other contributors to "ignore Chinese documentation"

Impact Scope:

  • Direct impact: Chinese contributors and developers using Chinese AI models
  • Indirect impact: Project internationalization strategy and documentation consistency
  • Impact level: Project-level (documentation standards)

Severity: MAJOR

Improvement Suggestions:

# Option 1: Create AGENTS_ZH.md (recommended)
# Need to translate AGENTS.md to Chinese, keeping structure consistent

# Option 2: Create symlink to English version (temporary solution)
ln -s AGENTS.md AGENTS_ZH.md
# But this is not a good long-term solution

# Option 3: Defer to follow-up PR
# Clearly state in PR description that Chinese version will be added in follow-up PR
# Need to create Issue to track

Rationale: The project has clear bilingual documentation standards, and the PR description promised a Chinese version. Suggest supplementing before merge, or explicitly stating the plan for the Chinese version in the PR description.


Issue 2: Startup script path error

Location: AGENTS.md:233-237

### Run Job (Zeta)
```bash
sh bin/seatunnel.sh --config config/v2.batch.config.template -e local

** **Related Context:
- 实际 seatunnel.sh 位置:`seatunnel-core/seatunnel-starter/src/main/bin/seatunnel.sh`
- config 模板位置:`config/v2.batch.config.template` ✓ 存在
- 文档声称的路径:`bin/seatunnel.sh` ✗ 不存在

** **Problem Description:
文档中提供的运行命令使用了错误的路径。`bin/` 目录下只有 `install-plugin.sh`,没有 `seatunnel.sh`。实际的 `seatunnel.sh` 位于 `seatunnel-core/seatunnel-starter/src/main/bin/`。

这会导致:
1. 新手直接复制命令会失败
2. AI 助手可能生成错误的运行脚本
3. 降低文档的可信度

** **Potential Risks:
- 风险 1:开发者尝试运行命令时会报错 "No such file or directory"
- 风险 2:浪费开发者时间排查路径问题
- 风险 3:AI 助手可能学习到错误的路径模式

** **Impact Scope:
- 直接影响:尝试运行 SeaTunnel 的开发者
- 间接影响:文档可信度
- 影响面:使用 Zeta 引擎的用户

** **Severity:** MAJOR

** **Improvement Suggestions:
```markdown
### Run Job (Zeta)

**After building from source**, you can run a job using:

```bash

# ## Issue 3: Incomplete directory structure description

** **Location:** `AGENTS.md:60-74`
```markdown
## Repository Structure

```text
seatunnel/
├── seatunnel-api/              # Core API definitions
├── seatunnel-connectors-v2/    # Source & Sink connectors (main contribution area)
├── seatunnel-transforms-v2/    # Transform plugins (including LLM)
├── seatunnel-engine/           # Zeta engine & Web UI
├── seatunnel-core/             # Job submission & CLI entry points
├── seatunnel-translation/      # Flink & Spark adapters
├── seatunnel-formats/          # Data formats (JSON, Avro, etc.)
├── seatunnel-e2e/              # End-to-End integration tests
├── docs/                       # Documentation (en & zh)
└── config/                     # Default configurations

** **Related Context:
- 实际存在的模块(通过 `ls -d seatunnel-*` 确认):
  - `seatunnel-api` ✓
  - `seatunnel-connectors-v2` ✓
  - `seatunnel-transforms-v2` ✓
  - `seatunnel-engine` ✓
  - `seatunnel-core` ✓
  - `seatunnel-translation` ✓
  - `seatunnel-formats` ✓
  - `seatunnel-e2e` ✓
  - `seatunnel-common` ✗(未列出)
  - `seatunnel-config` ✗(未列出)
  - `seatunnel-dist` ✗(未列出)
  - `seatunnel-examples` ✗(未列出)
  - `seatunnel-plugin-discovery` ✗(未列出)
  - `seatunnel-shade` ✗(未列出)
  - `seatunnel-ci-tools` ✗(未列出)

** **Problem Description:
文档中列出的目录结构不完整,遗漏了 7 个子模块。虽然文档说明是"Overview of key modules",但:
1. 没有明确说明这是"非完整列表"
2. `seatunnel-common` 是一个核心模块(公共工具类),被遗漏不太合理
3. `seatunnel-dist` 与打包分发相关,也可能被贡献者接触到

** **Potential Risks:
- 风险 1:AI 助手可能认为只有这 9 个模块,在实际修改时遗漏其他模块
- 风险 2:贡献者在需要修改 `seatunnel-common` 时,可能不知道这个模块的存在
- 风险 3:可能误导 AI 助手进行模块推荐

** **Impact Scope:
- 直接影响:需要修改被遗漏模块的贡献者
- 间接影响:AI 助手的模块推荐准确性
- 影响面:多模块

** **Severity:** MINOR

** **Improvement Suggestions:
```markdown
## Repository Structure

```text
seatunnel/
├── seatunnel-api/              # Core API definitions
├── seatunnel-common/           # Common utilities and shared code
├── seatunnel-config/           # Configuration module
├── seatunnel-connectors-v2/    # Source & Sink connectors (main contribution area)
├── seatunnel-transforms-v2/    # Transform plugins (including LLM)
├── seatunnel-engine/           # Zeta engine & Web UI
├── seatunnel-core/             # Job submission & CLI entry points
├── seatunnel-translation/      # Flink & Spark adapters
├── seatunnel-formats/          # Data formats (JSON, Avro, etc.)
├── seatunnel-e2e/              # End-to-End integration tests
├── seatunnel-dist/             # Distribution packaging
├── seatunnel-shade/            # Shaded dependencies
├── seatunnel-plugin-discovery/ # Plugin discovery mechanism
├── seatunnel-ci-tools/         # CI/CD tools
├── seatunnel-examples/         # Example configurations
├── docs/                       # Documentation (en & zh)
└── config/                     # Default configurations

Key modules for contributors:

  • Connector-V2: Most contributions happen here
  • Transform-V2: Data transformation plugins
  • Zeta (seatunnel-engine): Engine development
  • E2E: Integration tests

** **Reasoning:
1. 列出所有模块,提供完整的视图
2. 添加简短描述说明每个模块的用途
3. 强调"Key modules for contributors",帮助贡献者聚焦
4. 保持简洁,但提供完整信息

---

# ## Issue 4: Git commit format differs from actual history

** **Location:** `AGENTS.md:22-58`

** **Related Context:
- 文档声称的格式:`[Type][Module] Description`
- 实际历史中的变体:
  - `[Docs] Add LLM contribution guide`(无模块)✓ 符合
  - `[Fix][Connector-V2][Hive] ...`(三级标签)⚠️ 变体
  - `[Fix] [connector-hbase] ...`(小写+空格)⚠️ 变体
  - `[Feature] [Connector-file] ...`(空格)⚠️ 变体
  - `[Bug][Connector-V2] ...`(使用 Bug)⚠️ 类型不一致

** **Problem Description:
文档中定义的提交格式是"理想格式",但项目实际历史中存在多种变体:
1. 模块标签有时使用小写(如 `connector-hbase` vs `Connector-V2`)
2. 有时使用三级标签(如 `[Fix][Connector-V2][Hive]`)
3. 有时标签之间有空格(`[Fix] [connector-hbase]`)
4. 类型有时使用 `Bug` 而非 `Fix`
5. 有些提交没有模块标签(如 `[Docs] ...`)

文档没有说明这些变体是否可接受,可能导致 AI 助手或贡献者困惑。

** **Potential Risks:
- 风险 1:AI 助手可能拒绝接受历史中的合法提交格式
- 风险 2:贡献者可能认为历史提交格式"错误"
- 风险 3:缺乏明确的规范说明,导致新贡献者不知道该遵循哪个格式

** **Impact Scope:
- 直接影响:Git 提交信息格式
- 间接影响:提交历史的一致性
- 影响面:项目级

** **Severity:** MINOR

** **Improvement Suggestions:
```markdown
## Git Commit Message Convention

SeaTunnel follows a **standardized commit message format** to maintain a clean and searchable history.

**Recommended Format**:

[Type][Module] Description


**Allowed Variations**:
- `[Type] Description` (without module tag, for general changes)
- `[Type][Module][Sub-module] Description` (for specific sub-components)

**Type Values** (case-sensitive, prefer these):
* `Feature`  – New features
* `Fix`      – Bug fixes (also `Bug` is accepted but discouraged)
* `Improve`  – Improvements to existing behavior
* `Docs`     – Documentation-only changes
* `Test`     – Test cases or test framework changes
* `Chore`    – Build, dependency, or maintenance tasks

**Module Values** (use consistent casing):
* `Connector-V2`  – seatunnel-connectors-v2
* `Zeta`          – seatunnel-engine (Zeta engine)
* `Core`          – seatunnel-core
* `API`           – seatunnel-api
* `Transform-V2`  – seatunnel-transforms-v2
* `Format`        – seatunnel-formats
* `Translation`   – seatunnel-translation
* `E2E`           – seatunnel-e2e

**Examples**:
* `[Fix][Connector-V2] Fix MySQL source split enumeration bug`
* `[Feature][Transform-V2] Add LLM transform plugin`
* `[Docs] Update quick start guide`
* `[Improve][Core] Optimize jar package loading speed`

**Historical Note**: Some historical commits use variations like lowercase module names (`connector-hbase`), extra spaces (`[Fix] [module]`), or `Bug` instead of `Fix`. New commits should follow the recommended format above.

** **Reasoning:

  1. 明确区分"推荐格式"和"允许变体"
  2. 说明历史差异,避免混淆
  3. 提供清晰的指导,告诉新贡献者应该怎么做
  4. 保持向后兼容性,承认历史格式的合理性

## Issue 5: Documentation location may not be prominent enough

** Location: Root directory AGENTS.md

** **Related Context:

  • 文件放置在项目根目录
  • README.md 同级
  • 没有在 README.md 中引用或提及
  • 没有在 docs/ 目录中建立索引

** **Problem Description:
虽然 AGENTS.md 放在根目录有助于 LLM 工具发现,但对于人类贡献者来说:

  1. 没有在 README 中提及,可能不容易被发现
  2. 不在 docs/ 目录下,可能不被视为"正式文档"
  3. 没有在文档导航中集成

** **Potential Risks:

  • 风险 1:人类贡献者可能不知道这个文档的存在
  • 风险 2:文档可能长期不更新,因为没人知道要更新它
  • 风险 3:与 docs/ 目录的隔离可能导致内容重复或冲突

** **Impact Scope:

  • 直接影响:文档的可发现性
  • 间接影响:文档的维护和更新
  • 影响面:项目级

** Severity: MINOR

** **Improvement Suggestions:

** **Option 1: Add reference in README.md (recommended)

# Add to the "Contributing" section in README.md

## Contributing

We welcome contributions! Please check out:
- [Contribution Guide](docs/en/contribution/...) (for humans)
- [LLM Context Guide](AGENTS.md) (for AI assistants like Claude, GPT, etc.)

** **Option 2: Add link in docs/ as well

# Create docs/en/AGENTS.md (symlink or redirect)
ln -s ../../AGENTS.md docs/en/AGENTS.md
# Create docs/zh/AGENTS_ZH.md
ln -s ../../AGENTS_ZH.md docs/zh/AGENTS_ZH.md

** **Option 3: Add to documentation sidebar
docs/sidebars.js 中添加:

{
  type: 'link',
  label: 'For AI Assistants',
  href: '/en/AGENTS',
}

** **Reasoning:

  1. 提高文档的可发现性
  2. 让人类贡献者也知道这个资源
  3. 建立与正式文档体系的连接
  4. 便于未来维护和更新

## Issue 6: Missing CI/CD workflow documentation

** Location: AGENTS.md (overall)

** **Related Context:

  • 文档提到了本地验证命令(./mvnw spotless:apply, ./mvnw verify, ./mvnw test
  • 但没有提及 CI/CD 流程
  • 项目有 .github/workflows/ 目录(GitHub Actions)

** **Problem Description:
现代开源项目通常有 CI/CD 流程自动检查 PR。文档中:

  1. 没有说明 PR 提交后 CI 会自动运行哪些检查
  2. 没有说明如何查看 CI 结果
  3. 没有说明 CI 失败后如何处理
  4. 没有说明如何在本地模拟 CI 检查

** **Potential Risks:

  • 风险 1:贡献者可能不知道 CI 会自动检查
  • 风险 2:贡献者可能不知道如何理解 CI 错误
  • 风险 3:AI 助手可能生成无法通过 CI 的代码

** **Impact Scope:

  • 直接影响:PR 提交流程
  • 间接影响:PR 合并效率
  • 影响面:所有贡献者

** Severity: MINOR

** **Improvement Suggestions:

## CI/CD Process

**Automated Checks**:
When you open a PR, GitHub Actions will automatically run:
- Spotless format check
- Compilation (verify)
- Unit tests
- E2E tests (for some modules)

**Viewing CI Results**:
1. Go to your PR page
2. Scroll to the "Checks" section at the bottom
3. Click on the failed job to see detailed logs

**Common CI Failures**:
- **Spotless check failed**: Run `./mvnw spotless:apply` and push again
- **Test failed**: Check the test logs, fix the issue, and push
- **Compilation failed**: Check the compiler error logs

**Running CI Locally** (optional):
```bash

# ## Issue 7: E2E test command may be inaccurate

** **Location:** `AGENTS.md:193-203`

** **Related Context:
- 文档声称:
  ```bash
  ./mvnw -DskipUT -DskipIT=false verify
  • 这个命令会跳过单元测试(skipUT),但运行集成测试

** **Problem Description:

  1. 命令名称叫 "E2E Tests",但实际使用 -DskipIT=false,这是运行集成测试,不一定是 E2E 测试
  2. E2E 测试可能位于 seatunnel-e2e 模块,需要特定的命令运行
  3. 文档没有说明如何运行特定模块的 E2E 测试

** **Potential Risks:

  • 风险 1:AI 助手可能无法正确区分 E2E 测试和集成测试
  • 风险 2:贡献者可能无法运行特定的 E2E 测试
  • 风险 3:测试运行失败时不知道如何排查

** **Impact Scope:

  • 直接影响:测试运行
  • 间接影响:PR 质量保证
  • 影响面:Connector-V2 和其他需要 E2E 测试的模块

** Severity: MINOR

** **Improvement Suggestions:

### E2E Tests

* Located in `seatunnel-e2e`
* Uses Testcontainers for real environment testing
* Extend `TestSuiteBase` for consistent test setup

**Run all E2E tests**:
```bash
./mvnw -DskipUT -DskipIT=false verify

Run specific E2E test:

@davidzollo
Copy link
Contributor

Can you remind the AI to add comments for important methods? I find out that the comments in the current code are rare.

* `[Fix][Zeta] Fix checkpoint timeout under heavy backpressure`
* `[Feature][Transform-V2] Add LLM transform plugin`
* `[Improve][Core] Optimize jar package loading speed`
* `[Docs] Update quick start guide`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small detail: the document states the strict format is [Type][Module], but the example has [Docs] ... (no Module). Let's standardize this part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DOC doesn't have [Module], so I think this is fine. I feel there's no need to add anything here

Copy link
Collaborator

@yzeng1618 yzeng1618 Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could add DOC doesn't have [Module] to this description, but it’s totally fine to leave it out—this isn’t something we need to overthink.
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I think it’s better to use Doc as [Module] as before.

@yzeng1618
Copy link
Collaborator

Isn’t Chinese documentation in the plan?

@chl-wxp I took a look at the loading logic of commonly used tools. It's all unnecessary to load Chinese items like AGENT/GPT in the root directory

image This can be removed.

Copy link
Collaborator

@yzeng1618 yzeng1618 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +202 to +204
```bash
./mvnw -DskipUT -DskipIT=false verify
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Run specific connector E2E tests
./mvnw test -pl seatunnel-e2e/seatunnel-connector-v2-e2e/connector-xxx-e2e 

# Run all E2E tests (warning: time-consuming)
./mvnw -DskipUT -DskipIT=false verify

Comment on lines +16 to +18
# Unit tests (strongly recommended)
./mvnw test
```
Copy link
Contributor

@davidzollo davidzollo Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Unit tests (strongly recommended)
./mvnw test
```
# Unit tests (strongly recommended)
./mvnw test
# Check for compilation errors in affected modules
./mvnw clean package -pl <affected-module> -am

@davidzollo
Copy link
Contributor

Suggested addition

## Adding a New Connector

When contributing a new connector, you MUST update:

1. `plugin-mapping.properties` - Add connector mapping
2. `seatunnel-dist/pom.xml` - Include in distribution
3. `seatunnel-e2e/` - Add E2E test cases
4. `config/plugin_config` - Add plugin configuration
5. `docs/en/connector-v2/` - Add English documentation
6. `docs/zh/connector-v2/` - Add Chinese documentation

Suggested new section:

## Common Pitfalls to Avoid

1. **Don't use wildcard imports** - Spotless will fail
2. **Don't forget license headers** - All new files need ASF headers
3. **Don't skip `-nsu` flag** - Ensures build reproducibility
4. **Don't modify existing Option names** - Breaking change for users
5. **Don't commit without running spotless:apply** - CI will fail
6. **Don't use `System.out.println`** - Use proper logging framework

What Works Well

  1. Clear structure - Well-organized sections
  2. Good examples - Commit message format examples are helpful
  3. Backward compatibility emphasis - Critical for production systems
  4. Repository structure - Accurate and helpful
  5. License header template - Essential for Apache projects

Copy link
Member

@zhangshenghang zhangshenghang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@davidzollo davidzollo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@davidzollo davidzollo merged commit 2df5480 into apache:dev Feb 9, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants