|
| 1 | +--- |
| 2 | +name: new-sdg |
| 3 | +description: Implement a new synthetic data generator using NeMo Data Designer by defining its configuration and executing a preview job. |
| 4 | +argument-hint: <dataset-description> |
| 5 | +--- |
| 6 | + |
| 7 | +# Your Goal |
| 8 | + |
| 9 | +Implement a new synthetic data generator using NeMo Data Designer to match the user's specifications below. |
| 10 | + |
| 11 | +<dataset-description> |
| 12 | + **$ARGUMENTS** |
| 13 | +</dataset-description> |
| 14 | + |
| 15 | +## Getting Exact Specifications |
| 16 | + |
| 17 | +The user will provide you with some description, but it is likely that you |
| 18 | +do not have enough information to precisely define what they want. It is hard |
| 19 | +for a user to define everything up front. Ask follow up questions to the user |
| 20 | +using the AskUser tool to narrow down on precisely what they want. |
| 21 | + |
| 22 | +Common things to make precise are: |
| 23 | + |
| 24 | +- IMPORTANT: What the "axes of diversity" are -- e.g. what should be well represented and diverse in the resulting dataset. |
| 25 | +- The kind an nature of any input data to the dataset. |
| 26 | +- What variables should be randomized. |
| 27 | +- The schema of the final dataset. |
| 28 | +- The structure of any required structured output columns. |
| 29 | +- What facets of the output dataset are important to capture. |
| 30 | + |
| 31 | +## Interactive, Iterative Design |
| 32 | + |
| 33 | +> USER: Request |
| 34 | +> YOU: Clarifying AskUser Questions |
| 35 | +> YOU: Script Impelmentation (with preview) |
| 36 | +> YOU: Script Execution |
| 37 | +> YOU: Result Presentation |
| 38 | +> YOU: Followup Questions |
| 39 | +> USER: Respond |
| 40 | +> YOU: ...repeat... |
| 41 | +
|
| 42 | +Very often, the initial implementation will not conform precisely to what the user wants. You are to engage in an **iterative design loop** with the user. As shown |
| 43 | +in the example below, you will construct a configuration, then review its outputs, |
| 44 | +present those outputs to the user, and ask follow up questions. |
| 45 | + |
| 46 | +Depending on the user responses, you will then edit the script, re-run it, and present the user with the results and ask followups and so. When showing results to the user DO NOT SUMMARIZE content, it is *very important* that you show them the records as-is so they can make thoughtful decisions. |
| 47 | + |
| 48 | +DO NOT disengage from this **iterative design loop** unless commanded by the user. |
| 49 | + |
| 50 | + |
| 51 | +## Implementing a NeMo Data Designer Synthetic Data Generator |
| 52 | + |
| 53 | +- You will be writing a new python script for execution. |
| 54 | +- The script should be made in the current working directory, so `$(pwd)/script-name.py`. |
| 55 | +- Implement the script as a stand-alone, `uv`-executable script (https://docs.astral.sh/uv/guides/scripts/#creating-a-python-script). |
| 56 | +- The script should depend on the latest version of `data-designer`. |
| 57 | +- Include other third-party dependencies only if the job requires it. |
| 58 | +- Model aliases are required when definining LLM generation columns. |
| 59 | +- Before implementing, make sure to use the Explore tool to understand the src/ and docs/. |
| 60 | +- Review available model aliases and providers. |
| 61 | +- You will need to ask the user what Model Provider they want to use via AskUser tool. |
| 62 | +- You may use Web Search to find any information you need to help you construct the SDG, since real-world grounding is key to a good dataset. |
| 63 | +- If you need to use a large number of categories for a sampler, just build a pandas DataFrame and use it as a Seed dataset. |
| 64 | + |
| 65 | +### Model Alises and Providers |
| 66 | + |
| 67 | +View known model aliases and providers with the following command. You will need a longer timeout on first run (package first-time boot). |
| 68 | + |
| 69 | +```bash |
| 70 | +uv run --with data-designer data-designer config list |
| 71 | +``` |
| 72 | + |
| 73 | +### Real World Seed Data |
| 74 | + |
| 75 | +Depending on user requirements, you may need to access real-world datasets to serve as Seed datasets for your Data Designer SDG. |
| 76 | +In these cases, you may use Web Search tools to search for datasets available on HuggingFace, and use the `datasets` python library |
| 77 | +to load them. You will have to convert them to Pandas DataFrames in these cases. |
| 78 | + |
| 79 | +If you do use real-world data, pay attention to file sizes and avoid large file transfers. Only download small sections of datasets or use a streaming option. |
| 80 | + |
| 81 | +### Example |
| 82 | + |
| 83 | +```python |
| 84 | +# /// script |
| 85 | +# dependencies = [ |
| 86 | +# "data-designer", |
| 87 | +# ] |
| 88 | +# /// |
| 89 | + |
| 90 | +# ... data designer config_builder implementation |
| 91 | + |
| 92 | +def build_config() -> DataDesignerConfigBuilder: |
| 93 | + """Implements the definition of the synthetic data generator. |
| 94 | + """ |
| 95 | + config_builder = DataDesignerConfigBuilder() |
| 96 | + |
| 97 | + ## Add whatever columns need to be added |
| 98 | + # config_builder.add_column(...) |
| 99 | + # config_builder.add_column(...) |
| 100 | + # config_builder.add_column(...) |
| 101 | + |
| 102 | + return config_builder |
| 103 | + |
| 104 | +if __name__ == "__main__": |
| 105 | + config_builder = build_config() |
| 106 | + designer = DataDesigner() |
| 107 | + preview = designer.preview(config_builder=config_builder) |
| 108 | + |
| 109 | + # The following command will print a random sample record |
| 110 | + # which you can present to the user |
| 111 | + preview.display_sample_record() |
| 112 | + |
| 113 | + # The raw data is located in this Pandas DataFrame object. |
| 114 | + # You can implenent code to display some or all of this |
| 115 | + # to STDOUT so you can see the outputs and report to the user. |
| 116 | + preview.dataset |
| 117 | +``` |
0 commit comments