You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: add metadata fields for synthetic data traceability (#2389)
## Issue Link / Problem Description
- Fixes#2385 - Testset generator not preserving persona and scenario
metadata
- Improves synthetic data generation traceability by adding metadata
fields to track query generation parameters
- Currently there's no way to trace which persona, style, and length
settings were used for synthetic queries
## Changes Made
- Added metadata fields to `dataset_schema.py`:
- `persona_name: Optional[str]`
- `query_style: Optional[str]`
- `query_length: Optional[str]`
- Updated `single_hop/base.py` to populate these fields during synthetic
data generation:
```python
return SingleTurnSample(
user_input=response.query,
reference=response.answer,
reference_contexts=[reference_context],
persona_name=getattr(scenario.persona, "name", None),
query_style=getattr(scenario.style, "name", None),
query_length=getattr(scenario.length, "name", None),
)
```
- Updated class documentation with descriptions for new fields
## Testing
### How to Test
- [x] Manual testing steps:
1. Run synthetic data generation using SingleHopQuerySynthesizer
2. Verify metadata fields are properly populated in generated samples
3. Confirm values match the scenario settings (persona, style, length)
4. Check backwards compatibility with existing code
## References
- Fixes Issue: #2385
- Documentation: Updated in `dataset_schema.py` docstring
- Implementation: Updated in `single_hop/base.py` for field population
## Screenshots/Examples
```python
# Example of generated sample with metadata:
{
"user_input": "What are the key features of Python?",
"reference": "Python is a versatile programming language...",
"persona_name": "Student",
"query_style": "POOR_GRAMMAR",
"query_length": "MEDIUM"
}
```
f"Size ratio: {size_ratio:.2f}, (Scaled: {scaled_size_ratio:.2f}), Time ratio: {time_ratio:.2f}"
973
+
f"Size ratio: {size_ratio:.2f}, (Scaled: {scaled_size_ratio:.2f}), Time ratio: {time_ratio:.2f}, Tolerance: {tolerance_threshold:.2f}"
966
974
)
967
975
968
-
asserttime_ratio<scaled_size_ratio, (
969
-
f"Time complexity growing faster than expected: size {results[i]['size']} vs {results[i-1]['size']}, time ratio {time_ratio:.2f} vs {scaled_size_ratio:.2f}"
976
+
asserttime_ratio<tolerance_threshold, (
977
+
f"Time complexity growing faster than expected: size {results[i]['size']} vs {results[i-1]['size']}, time ratio {time_ratio:.2f} vs {tolerance_threshold:.2f}"
0 commit comments