Skip to content

Conversation

dagrayvid
Copy link
Collaborator

Summary

Refactor how synthetic data is generated, to make it run faster for large datasets of 1000s of prompts each with 1000s of tokens.

Details

Basic outline of how it works now:

  • Tokenize the whole dataset and save it locally in npy format as an array of token_ids under XDG_CACHE_HOME
  • Cache name is unique to the tokenizer and source file context.
  • Sample randomly from this array of token IDs, no need to search for the right length since it is already an array of token IDs
  • Convert selected arrays back to text

Test Plan

In terms of performance, we can now generate 10,000 prompts with 5,000 tokens each in under 30 seconds, whereas before this would take >10 min

Related Issues

None?

  • "I certify that all code in this PR is my own, except as noted below."

Use of AI

  • Includes AI-assisted code completion
  • Includes code generated by an AI application
  • Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes ## WRITTEN BY AI ##)

Signed-off-by: David Whyte-Gray <[email protected]>
@markurtz
Copy link
Collaborator

markurtz commented Oct 1, 2025

@dagrayvid can you take a look at the new refactor, specifically the data pipelines rework, and what is missing / adapting this on top of it? #384

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants