|
| 1 | +# AGENTS.md |
| 2 | + |
| 3 | +## Repository-specific guidance |
| 4 | + |
| 5 | +### Main code vs experimental code |
| 6 | + |
| 7 | +The repository is separated into **main code** and **experimental code**. |
| 8 | + |
| 9 | +* **Main code** should remain stable, consistent, and well-tested. |
| 10 | +* **Experimental code** may be less stable and may contain inconsistent patterns or limited testing. |
| 11 | + |
| 12 | +Small non-invasive improvements that make experimental code more consistent with the main codebase are encouraged, but avoid large refactors. |
| 13 | + |
| 14 | +### Paper implementations |
| 15 | + |
| 16 | +If a PR implements a method, algorithm, or training approach from a research paper, it must also add a corresponding subsection to `paper_index.md`. |
| 17 | + |
| 18 | +When reviewing such PRs, ensure that `paper_index.md` was updated. |
| 19 | + |
| 20 | +### Code duplication and consistency |
| 21 | + |
| 22 | +Trainers in this repository are **self-contained by design**. Shared logic (generation, reward computation, metric logging, weight syncing, etc.) is deliberately duplicated across trainers rather than abstracted into a shared base class. |
| 23 | + |
| 24 | +This is intentional: each trainer must be readable, modifiable, and evolvable in isolation. The base class (`_BaseTrainer`) provides only minimal utilities (model card generation). Everything else — vLLM generation paths, `_get_per_token_logps_and_entropies`, `_calculate_rewards`, `_prepare_inputs`, metric logging — is copied in full. |
| 25 | + |
| 26 | +**The tradeoff**: duplication is accepted, but **consistency is mandatory**. When the same logic appears in multiple trainers, the duplicated blocks must stay aligned: |
| 27 | + |
| 28 | +- Same variable names (`self._last_loaded_step`, `self._metrics[mode]`, …) |
| 29 | +- Same control flow structure (if/elif/else branches in the same order) |
| 30 | +- Same comments (word-for-word when the logic is identical) |
| 31 | +- Divergences only where the trainer's semantics require it (e.g., GRPO extracts logprobs from vLLM, RLOO discards them) |
| 32 | + |
| 33 | +**Consistency over correctness**: this is a strong requirement. When duplicating code, reproduce it exactly — even if you believe the original has a bug. Do not silently fix the issue in your copy. Instead, keep your copy consistent with the source and report the problem so it can be fixed across all trainers in a dedicated PR. A correct-but-inconsistent codebase is harder to maintain than a consistently-wrong one that can be fixed in a single sweep. |
| 34 | + |
| 35 | +**When modifying duplicated code**: if you change a pattern that exists in multiple trainers (e.g., the vLLM generation path in `_generate_single_turn`), apply the same change to all other trainers. A fix in GRPO often implies the same fix in RLOO, and vice versa. Not propagating a change is a bug. |
| 36 | + |
| 37 | +**When reviewing**: if a PR touches duplicated logic, verify that all copies are updated consistently. A common mistake is fixing one trainer and forgetting the others. |
| 38 | + |
| 39 | +### Simplicity |
| 40 | + |
| 41 | +This codebase values **leanness and simplicity above all**. Prefer straightforward, inline code over abstractions, helpers, or utilities — even at the cost of some robustness or generality. |
| 42 | + |
| 43 | +Concretely: |
| 44 | + |
| 45 | +- Do not add layers of indirection (registries, factory patterns, plugin systems). A contributor should be able to read a trainer top to bottom and understand the full flow. |
| 46 | +- Prefer a simple implementation that covers 90% of cases over a complex one that covers 100%. A function that handles the common path in 20 lines is better than a catch-all that handles every edge case in 80. |
| 47 | +- Do not add defensive code, fallback paths, or configuration options "just in case". Only handle cases that actually exist today. |
| 48 | +- When in doubt, prefer less code. Every new function, parameter, or branch is maintenance burden. The best abstraction is often no abstraction. |
| 49 | + |
| 50 | +## Documentation |
| 51 | + |
| 52 | +### Docstrings |
| 53 | + |
| 54 | +Docstrings must follow the repository format below. Do **not** convert docstrings to other styles (Google, NumPy, etc.). |
| 55 | + |
| 56 | +Rules: |
| 57 | + |
| 58 | +* Types appear in backticks inside parentheses: (`str`) |
| 59 | +* Optional parameters are marked with `*optional*` |
| 60 | +* Defaults are written as: `defaults to <value>` |
| 61 | +* When the default is `None`, prefer ```(`str`, *optional*)``` instead of ```(`str` or `None`, *optional*, defaults to `None`)``` |
| 62 | +* Union types use `or`: `str` or `None` |
| 63 | +* References to classes use the format: [`~transformers.PreTrainedModel`] |
| 64 | +* Class docstrings may group parameters using headers such as: `> Parameters for X:` |
| 65 | + |
| 66 | +Example: |
| 67 | + |
| 68 | +````python |
| 69 | +def method(self, param1: str, param2: int = 1, param3: float | None = None): |
| 70 | + """ |
| 71 | + Brief one-line description of what this does. |
| 72 | +
|
| 73 | + Args: |
| 74 | + param1 (`str`): |
| 75 | + Description of required param. |
| 76 | + param2 (`int`, *optional*, defaults to `1`): |
| 77 | + Description of optional param with default. |
| 78 | + param3 (`float`, *optional*): |
| 79 | + Description of optional param without explicit default. |
| 80 | +
|
| 81 | + Returns: |
| 82 | + `dict` with keys: |
| 83 | + - `key1` (`list[int]`): |
| 84 | + Description of this key. |
| 85 | +
|
| 86 | + Examples: |
| 87 | +
|
| 88 | + ```python |
| 89 | + >>> my_func("hello") |
| 90 | + ``` |
| 91 | + """ |
| 92 | +```` |
| 93 | + |
| 94 | +### Links to papers |
| 95 | + |
| 96 | +When linking to papers, use `https://huggingface.co/papers/<id>` instead of `https://arxiv.org/abs/<id>` (same ID suffix system). |
0 commit comments