Skip to content

Commit 328e935

Browse files
committed
added starter notebook
1 parent 1aa520e commit 328e935

25 files changed

+1472
-65
lines changed

README.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,8 @@ export GOOGLE_API_KEY="your-google-key" # optional
3535

3636
## Quick Start
3737

38+
For a comprehensive tutorial with detailed explanations, see [starter_notebook.ipynb](starter_notebook.ipynb).
39+
3840
### 1. Extract and Cluster Properties with `explain()`
3941

4042
```python
@@ -90,6 +92,63 @@ clustered_df, model_stats = explain(
9092
)
9193
```
9294

95+
### Using Custom Column Names
96+
97+
If your dataframe uses different column names, you can map them using column mapping parameters:
98+
99+
```python
100+
# Your dataframe has custom column names
101+
df = pd.DataFrame({
102+
"input": ["What is ML?", "Explain QC"],
103+
"llm_name": ["gpt-4", "gpt-4"],
104+
"output": ["ML is...", "QC uses..."],
105+
"accuracy": [0.95, 0.88],
106+
"helpfulness": [4.2, 4.5]
107+
})
108+
109+
# Map custom column names to expected StringSight names
110+
clustered_df, model_stats = explain(
111+
df,
112+
prompt_column="input", # Map "input" → "prompt"
113+
model_column="llm_name", # Map "llm_name" → "model"
114+
model_response_column="output", # Map "output" → "model_response"
115+
score_columns=["accuracy", "helpfulness"],
116+
output_dir="results/test"
117+
)
118+
```
119+
120+
For side-by-side comparisons with custom column names:
121+
122+
```python
123+
df = pd.DataFrame({
124+
"query": ["What is ML?", "Explain QC"],
125+
"model_1": ["gpt-4", "gpt-4"],
126+
"model_2": ["claude-3", "claude-3"],
127+
"response_1": ["ML is...", "QC uses..."],
128+
"response_2": ["ML involves...", "QC leverages..."],
129+
"accuracy_1": [0.95, 0.88],
130+
"accuracy_2": [0.92, 0.85]
131+
})
132+
133+
clustered_df, model_stats = explain(
134+
df,
135+
method="side_by_side",
136+
prompt_column="query", # Map "query" → "prompt"
137+
model_a_column="model_1", # Map "model_1" → "model_a"
138+
model_b_column="model_2", # Map "model_2" → "model_b"
139+
model_a_response_column="response_1", # Map "response_1" → "model_a_response"
140+
model_b_response_column="response_2", # Map "response_2" → "model_b_response"
141+
score_columns=["accuracy"], # Note: score columns need _a/_b suffixes
142+
output_dir="results/test"
143+
)
144+
```
145+
146+
**Note:** Default column names are:
147+
- `prompt`, `model`, `model_response`, `question_id` (optional) for single_model
148+
- `prompt`, `model_a`, `model_b`, `model_a_response`, `model_b_response`, `question_id` (optional) for side_by_side
149+
150+
If your columns already match these names, you don't need to specify mapping parameters.
151+
93152
### 2. Fixed Taxonomy Labeling with `label()`
94153

95154
When you know exactly which behavioral axes you care about:
@@ -140,6 +199,10 @@ Use the React frontend or other visualization tools to explore your results.
140199
|--------|-------------|---------|
141200
| `score` | Evaluation metrics dictionary | `{"accuracy": 0.85, "helpfulness": 4.2}` |
142201
| `score_columns` | Alternative: separate columns for each metric (e.g., `accuracy`, `helpfulness`) instead of a dict | `score_columns=["accuracy", "helpfulness"]` |
202+
| `prompt_column` | Name of the prompt column in your dataframe (default: `"prompt"`) | `prompt_column="input"` |
203+
| `model_column` | Name of the model column for single_model (default: `"model"`) | `model_column="llm_name"` |
204+
| `model_response_column` | Name of the model response column for single_model (default: `"model_response"`) | `model_response_column="output"` |
205+
| `question_id_column` | Name of the question_id column (default: `"question_id"` if column exists) | `question_id_column="qid"` |
143206

144207
### Side-by-Side Comparisons
145208

@@ -159,6 +222,12 @@ Use the React frontend or other visualization tools to explore your results.
159222
|--------|-------------|---------|
160223
| `score` | Winner and metrics | `{"winner": "model_a", "helpfulness_a": 4.2, "helpfulness_b": 3.8}` |
161224
| `score_columns` | Alternative: separate columns for each metric with `_a` and `_b` suffixes (e.g., `accuracy_a`, `accuracy_b`) | `score_columns=["accuracy_a", "accuracy_b", "helpfulness_a", "helpfulness_b"]` |
225+
| `prompt_column` | Name of the prompt column in your dataframe (default: `"prompt"`) | `prompt_column="query"` |
226+
| `model_a_column` | Name of the model_a column (default: `"model_a"`) | `model_a_column="model_1"` |
227+
| `model_b_column` | Name of the model_b column (default: `"model_b"`) | `model_b_column="model_2"` |
228+
| `model_a_response_column` | Name of the model_a_response column (default: `"model_a_response"`) | `model_a_response_column="response_1"` |
229+
| `model_b_response_column` | Name of the model_b_response column (default: `"model_b_response"`) | `model_b_response_column="response_2"` |
230+
| `question_id_column` | Name of the question_id column (default: `"question_id"` if column exists) | `question_id_column="qid"` |
162231

163232
**Option 2: Tidy Data (Auto-pairing)**
164233

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "stringsight"
7-
version = "0.3.1"
7+
version = "0.3.2"
88
authors = [
99
{name = "Lisa Dunlap", email = "[email protected]"},
1010
]

scripts/dataset_configs/aci_bench.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ output_dir: results/aci_bench
33
method: single_model
44
min_cluster_size: 12
55
embedding_model: text-embedding-3-large
6-
max_workers: 64
6+
max_workers: 16
77
groupby_column: behavior_type
88
assign_outliers: false
99
task_description: |

scripts/dataset_configs/instructeval.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,6 @@ model_b: openai/gpt-5-nano-2025-08-07
66
min_cluster_size: 5
77
sample_size: 50
88
embedding_model: text-embedding-3-small
9-
max_workers: 64
9+
max_workers: 16
1010
groupby_column: behavior_type
1111
assign_outliers: false

scripts/dataset_configs/koala.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ output_dir: results/koala
33
method: single_model
44
min_cluster_size: 5
55
embedding_model: text-embedding-3-small
6-
max_workers: 64
6+
max_workers: 16
77
groupby_column: behavior_type
88
assign_outliers: false
99

scripts/dataset_configs/medi_qa.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ output_dir: results/medi_qa
33
method: single_model
44
min_cluster_size: 5
55
embedding_model: text-embedding-3-small
6-
max_workers: 64
6+
max_workers: 16
77
groupby_column: behavior_type
88
assign_outliers: false
99
models:

scripts/dataset_configs/omni_math_gpt.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ output_dir: results/omni_math_gpt
33
method: single_model
44
min_cluster_size: 8
55
embedding_model: text-embedding-3-large
6-
max_workers: 64
6+
max_workers: 16
77
groupby_column: behavior_type
88
assign_outliers: false
99
sample_size: 50

scripts/dataset_configs/omni_math_top_models.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ output_dir: results/omni_math_top_models
33
method: single_model
44
min_cluster_size: 8
55
embedding_model: text-embedding-3-large
6-
max_workers: 64
6+
max_workers: 16
77
groupby_column: behavior_type
88
assign_outliers: false
99
models:

scripts/dataset_configs/safety.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ output_dir: results/safety
33
method: single_model
44
min_cluster_size: 5
55
embedding_model: text-embedding-3-small
6-
max_workers: 64
6+
max_workers: 16
77
groupby_column: behavior_type
88
assign_outliers: false
99
task_description: |

scripts/dataset_configs/taubench_airline.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ output_dir: results/taubench_airline_data
33
method: single_model
44
min_cluster_size: 5
55
embedding_model: text-embedding-3-small
6-
max_workers: 64
6+
max_workers: 16
77
groupby_column: behavior_type
88
assign_outliers: false
99
system_prompt: agent_system_prompt

0 commit comments

Comments
 (0)