synthetic datasets for benchmarking and testing#3518
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis change introduces a new synthetic dataset generation feature that enables creating random token-based datasets for training without external data sources. It includes a new strategy module, updated configuration schemas to support the Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/axolotl/prompt_strategies/synthetic.py (1)
68-68: Avoid list multiplication for nested lists—use a comprehension instead.
[[1] * self.sequence_length] * self.lengthcreatesself.lengthreferences to the same inner list. While this works correctly here becauseDataset.from_dictcopies the data into Arrow format, it's a subtle footgun that could cause issues if the code is refactored or the list is used before conversion.♻️ Safer approach using list comprehension
- attention_mask = [[1] * self.sequence_length] * self.length + attention_mask = [[1] * self.sequence_length for _ in range(self.length)]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/axolotl/prompt_strategies/synthetic.py` at line 68, The attention_mask is built with nested list multiplication which creates shared inner-list references; update the construction of attention_mask in synthetic.py (the assignment to attention_mask that uses self.sequence_length and self.length) to use a list comprehension that creates independent inner lists (e.g., iterate over range(self.length) and create [1] * self.sequence_length for each) so each row is a distinct list before passing to Dataset.from_dict.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/axolotl/prompt_strategies/synthetic.py`:
- Line 68: The attention_mask is built with nested list multiplication which
creates shared inner-list references; update the construction of attention_mask
in synthetic.py (the assignment to attention_mask that uses self.sequence_length
and self.length) to use a list comprehension that creates independent inner
lists (e.g., iterate over range(self.length) and create [1] *
self.sequence_length for each) so each row is a distinct list before passing to
Dataset.from_dict.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: aa26e1be-b85c-4a3e-85cf-668ec4e28446
📒 Files selected for processing (6)
src/axolotl/prompt_strategies/synthetic.pysrc/axolotl/utils/config/__init__.pysrc/axolotl/utils/data/sft.pysrc/axolotl/utils/schemas/config.pysrc/axolotl/utils/schemas/datasets.pytests/prompt_strategies/test_synthetic.py
|
📖 Documentation Preview: https://69bd40c71a433c223a77aa76--resonant-treacle-0fd729.netlify.app Deployed on Netlify from commit 2a50620 |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Summary by CodeRabbit
Release Notes