Skip to content

Commit 9312656

Browse files
authored
Merge branch 'main' into agentic_capability_generation
2 parents 589e0e2 + 3c241f9 commit 9312656

File tree

3 files changed

+167
-14
lines changed

3 files changed

+167
-14
lines changed

example_scripts/README.md

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
2+
## `train_test_embedding_visualization` example
3+
4+
Here we describe the steps required for reading and selecting pre-generated capabilities and their tasks, generating capability embeddings, filtering capabilities based on those embeddings, reducing dimensionality, and visualizing capabilities. All of these steps are implemented in the `train_test_embedding_visualization.py` script, which runs the process for both `train` and `test` capabilities. The directory containing the `train` and `test` capabilities and tasks is specified in the `train_test_embedding_visualization_cfg.yaml` file.
5+
6+
You can also find the steps for loading and visualizing LLM scores in `plot_llm_capability_scores.py`. The scores can be plotted using a spider chart or a bar chart via the `plot_capability_scores_spider_and_bar_chart()` function.
7+
8+
9+
Step 1: Read the already generated and saved train capabilities:
10+
11+
```python
12+
# Read the capabilities from the base directory
13+
train_capability_dir = os.path.join(
14+
cfg.capabilities_cfg.saved_capabilities_dir,
15+
cfg.capabilities_cfg.domain,
16+
)
17+
# Fetch previously generated capabilities
18+
capabilities = get_previous_capabilities(capability_dir=train_capability_dir)
19+
20+
```
21+
22+
Step 2: Sort and keep complete capabilities. Complete capabilities have enough verified tasks generated for them.
23+
24+
```python
25+
26+
logger.info(f"All capability names:\n{capabilities}")
27+
# Select complete capabilities (same set of capabilities were evaluated)
28+
capabilities = select_complete_capabilities(
29+
capabilities=capabilities,
30+
strict=False,
31+
num_tasks_lower_bound=int(
32+
cfg.capabilities_cfg.num_gen_tasks_per_capability
33+
* (1 - cfg.capabilities_cfg.num_gen_tasks_buffer)
34+
),
35+
)
36+
capabilities = sorted(capabilities, key=lambda x: x.name)
37+
38+
```
39+
40+
41+
42+
Step 3: Generate capability embeddings using openai model, and assign embeddings to each capability object.
43+
44+
```python
45+
# Embed capabilities using openai embedding model
46+
generate_and_set_capabilities_embeddings(
47+
capabilities=capabilities,
48+
embedding_model_name=cfg.embedding_cfg.embedding_model,
49+
embed_dimensions=cfg.embedding_cfg.embedding_size,
50+
)
51+
```
52+
53+
Step 4: Filter capabilities based on the embeddings such that if embeddings are too similar according to a threshold, one of them (capability) should be removed.
54+
55+
```python
56+
# Filter capabilities based on their embeddings
57+
filtered_capabilities = filter_capabilities(
58+
capabilities,
59+
embedding_model_name=cfg.embedding_cfg.embedding_model,
60+
similarity_threshold=cfg.embedding_cfg.filtering_similarity_threshold,
61+
)
62+
```
63+
64+
Step 5: Capability embedding dimensionality reduction.
65+
66+
```python
67+
# Reduce the dimensionality of capability embeddings generated by the
68+
# embedding model.
69+
dim_reduction = apply_dimensionality_reduction(
70+
filtered_capabilities,
71+
dim_reduction_method_name=cfg.dimensionality_reduction_cfg.reduce_dimensionality_method,
72+
output_dimension_size=cfg.dimensionality_reduction_cfg.reduced_dimensionality_size,
73+
embedding_model_name=cfg.embedding_cfg.embedding_model,
74+
tsne_perplexity=cfg.dimensionality_reduction_cfg.tsne_perplexity,
75+
normalize_output=cfg.dimensionality_reduction_cfg.normalize_output,
76+
)
77+
```
78+
79+
80+
Step 6: Visualize the reduced embeddings.
81+
82+
```python
83+
# Plot training capabilities
84+
plot_hierarchical_capability_2d_embeddings(
85+
capabilities=filtered_capabilities,
86+
dim_reduction_method=cfg.dimensionality_reduction_cfg.reduce_dimensionality_method,
87+
save_dir=cfg.embedding_visualization_cfg.save_dir,
88+
plot_name=cfg.embedding_visualization_cfg.plot_name,
89+
show_point_ids=cfg.embedding_visualization_cfg.show_point_ids,
90+
)
91+
```
92+
93+
Step 7: Capability Heatmap
94+
95+
```python
96+
generate_capability_heatmap(
97+
capabilities=filtered_capabilities,
98+
embedding_model_name=cfg.embedding_cfg.embedding_model, # Using the original embeddings, not the reduced version.
99+
save_dir=cfg.heatmap_cfg.save_dir,
100+
plot_name=cfg.heatmap_cfg.plot_name,
101+
add_squares=cfg.heatmap_cfg.add_squares,
102+
)
103+
```
104+
105+
106+
Step 8: **Test capabilities** are also loaded and their embeddings are generated using openai embedding model just like the previous steps. The only difference is that we should use the already fitted PCA model for dimensionality reduction. Also, we visualize test capabilities 2D embeddings with train embeddings to see their relative distance.
107+
108+
```python
109+
# Use the fitted PCA dim reduction to transform the test capabilities
110+
apply_dimensionality_reduction_to_test_capabilities(
111+
test_capabilities,
112+
dim_reduction_method=dim_reduction,
113+
embedding_model_name=cfg.embedding_cfg.embedding_model,
114+
)
115+
```
116+
117+
Step 9: Visualize train and test capability embeddings together.
118+
119+
```python
120+
all_capabilities = filtered_capabilities + test_capabilities
121+
logger.info(
122+
f"Visualizing {len(all_capabilities)} train and test capabilities at {cfg.embedding_visualization_cfg.save_dir}"
123+
)
124+
plot_hierarchical_capability_2d_embeddings(
125+
capabilities=all_capabilities,
126+
dim_reduction_method=cfg.dimensionality_reduction_cfg.reduce_dimensionality_method,
127+
save_dir=cfg.embedding_visualization_cfg.save_dir,
128+
plot_name=cfg.embedding_visualization_cfg.plot_name + " Train and Test",
129+
show_point_ids=cfg.embedding_visualization_cfg.show_point_ids,
130+
)
131+
```
132+
133+
134+
### How are embeddings generated?
135+
136+
The `generate_and_set_capabilities_embeddings()` function in `src/utils/embedding_utils.py` handles this process. Capability name and descriptions are extracted to form the representation string `rep_string`. Then, embeddings are generated using the OpenAI embedding model via `embedding_generator`. Finally, the embeddings are assigned to each capability object.
137+
The representation string was chosen based on visualization-based experiments and is defined as:
138+
139+
```python
140+
rep_string = f"{capability_dict['name']} - {capability.area}: {capability_dict['description']}"
141+
```

example_scripts/example_cfg/train_test_embedding_visualization_cfg.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ dimensionality_reduction_cfg:
1919
normalize_output: False
2020

2121
embedding_visualization_cfg:
22-
save_dir: /fs01/projects/aieng/public/acecapabilities_o4-mini_C100_R5_A10_T100/visualizations
22+
save_dir: /fs01/projects/aieng/public/ace/capabilities_o4-mini_C100_R5_A10_T100/visualizations
2323
plot_name: "PCA Embeddings"
2424
show_point_ids: False # Set to true when plotting a small number of capabilities.
2525

src/utils/prompts.py

Lines changed: 25 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
CAPABILITY_GENERATION_SYSTEM_PROMPT = """
2-
You are an expert in designing capabilities to assess the abilities of large language models (LLMs). Your goal is to create novel, diverse capabilities that can reveal the breadth and depth of LLMs’ skills within the specified domain. You will be particularly rewarded for uncovering capabilities that could reveal surprising abilities or failures of LLMs. Valid capabilities will be added to a capability archive. In each generation, previously accepted capabilities for the specified domain will be provided as context.
2+
You are an expert in designing capabilities to assess the abilities of foundation models.
3+
Your goal is to create novel, diverse capabilities that can reveal the breadth and depth of a foundation model's skills within the specified domain.
4+
You will be particularly rewarded for a comprehensive design of capabilities.
5+
Valid capabilities will be added to a capability archive.
6+
In each generation, previously accepted capabilities for the specified domain will be provided as context.
37
48
Each capability should be designed according to the METR Standard, which requires the following Python format:
59
```python
@@ -33,7 +37,7 @@ def parse_submission(submission: str) -> str:
3337
# Parse the submission string to extract the answer based on the "ANSWER" keyword.
3438
# Return an empty string if no match is found.
3539
```
36-
3. The score function should use a helper function that uses LLM as a judge to score the submission:
40+
3. The score function should use a helper function that uses a large language model (LLM) as a judge to score the submission:
3741
```python
3842
def evaluate_with_llm_judge(
3943
submission: str,
@@ -55,14 +59,16 @@ def evaluate_with_llm_judge(
5559
5660
In <THOUGHT>, briefly think and reason about what kind of capability you want to propose.
5761
In <JSON>, provide a JSON response of the new capability with the following fields:
58-
- "name": A concise, descriptive label (lowercase, no spaces, e.g., "math_competition_algebra").
59-
- "description": A clear explanation of what the capability entails (e.g., The capability consists of challenging competition mathematics problems in algebra).
60-
- "domain": The domain to which the capability belongs to (e.g., math, physics, etc.).
62+
- "name": A concise, descriptive label (lowercase, no spaces, e.g., "personalized_budget_planning").
63+
- "description": A clear explanation of what the capability entails (e.g., "Ability to generate a realistic monthly budget tailored to an individual's income, fixed and variable expenses, and financial goals. Requires understanding spending categories, prioritization, and basic cash flow allocation.").
64+
- "domain": The domain to which the capability belongs to (e.g., personal finance, math, etc.).
6165
- "class": The fully implemented Python code for the Capability class. This should be easily human-readable.
6266
6367
Do not download additional data from the internet or access the file system.
6468
65-
Be creative and design capabilities that can distinguish between models with varying levels of expertise, but ensure that the capability remains relevant to the domain. Also ensure that the proposed capabilities ARE DISTINCT compared to the existing capabilities. Names of all existing capabilities will be provided.
69+
Be creative and design capabilities that can distinguish between different levels of expertise, but ensure that the capability remains relevant to the domain.
70+
Also ensure that the proposed capabilities ARE DISTINCT compared to the existing capabilities.
71+
Names of all existing capabilities will be provided.
6672
6773
Your response will be automatically parsed so ensure it adheres to the specified format.
6874
""" # noqa: D100
@@ -76,7 +82,7 @@ def evaluate_with_llm_judge(
7682
Existing capability names:
7783
{prev_capabilities}
7884
79-
Generate {num_gen_capabilities} new, interesting capabilities within the {domain} domain.
85+
Generate {num_gen_capabilities} new capabilities within the {domain} domain that are **semantically and functionally distinct** from the existing capabilities.
8086
"""
8187

8288
HIERARCHICAL_CAPABILITY_GENERATION_USER_PROMPT = """
@@ -88,11 +94,14 @@ def evaluate_with_llm_judge(
8894
Existing capability names:
8995
{{prev_capabilities}}
9096
91-
Generate {{num_gen_capabilities}} new, interesting capabilities for the "{capability_area}" area within the {{domain}} domain.
97+
Generate {{num_gen_capabilities}} new capabilities for the "{capability_area}" area within the {{domain}} domain that do not overlap with the existing capabilities.
9298
"""
9399

94100
HIERARCHICAL_CAPABILITY_AREAS_GENERATION_USER_PROMPT = """
95-
You are an expert in designing capabilities to assess the abilities of large language models (LLMs). Identify {num_areas} broad and diverse areas for capability generation for the {domain} domain. Each area should cover {num_capabilities_per_area} capabilities, which will be generated in the next step. The areas should be relevant to the {domain} domain, should be high level and should not overlap with each other.
101+
You are an expert in designing capabilities to assess the abilities of foundation models.
102+
For the domain of {domain}, identify {num_areas} high-level, broad, diverse, and non-overlapping areas for capability generation.
103+
Each area should cover {num_capabilities_per_area} capabilities, which will be generated in the next step.
104+
Aim for each area to cover a broad subdomain or skill cluster within the domain.
96105
97106
Respond precisely in the following format:
98107
@@ -132,7 +141,7 @@ def evaluate_with_llm_judge(
132141
"""
133142

134143
TASK_GENERATION_SYSTEM_PROMPT = """
135-
You are an expert in designing tasks for a given capability. The name, description, {zero_or_few_shot_patch} for the capability will be provided. You will be particularly rewarded for designing diverse tasks spanning a wide range of difficulty levels for the given capability.
144+
You are an expert in designing tasks for a given capability. The name, description, {zero_or_few_shot_patch} for the capability will be provided. Ensure designed tasks are diverse spanning a wide range of difficulty levels for the given capability.
136145
137146
Respond precisely in the following format, including the JSON start and end markers:
138147
@@ -144,6 +153,7 @@ def evaluate_with_llm_judge(
144153
In <STR>, provide a string containing the task text.
145154
146155
Be careful to make sure that all proposed tasks are unique. Also ensure that all tasks are within the scope of the given capability.
156+
Avoid simple rewordings or duplicates in structure. Tasks should test different angles of the capability.
147157
If the text includes mathematical symbols or equations, ensure they are appropriately formatted using LaTeX. Ensure the single backlash "\\" included in a LateX string is escaped as "\\\\". For example, the LaTeX string "$\\[2x + 3 = 11\\]$" should be formatted as "$\\\\[2x + 3 = 11\\\\]$" in the task text.
148158
149159
Your response will be automatically parsed so ensure it adheres to the specified format.
@@ -172,6 +182,7 @@ def evaluate_with_llm_judge(
172182
In <STR>, provide a string containing the task text.
173183
174184
Be careful to make sure that all proposed tasks are unique. Also ensure that all tasks are within the scope of the given capability.
185+
Avoid simple rewordings or duplicates in structure. Tasks should test different angles of the capability.
175186
176187
If the text includes mathematical symbols or equations, ensure they are appropriately formatted using LaTeX. Ensure the single backlash "\\" included in a LateX string is escaped as "\\\\". For example, the LaTeX string "$\\[2x + 3 = 11\\]$" should be formatted as "$\\\\[2x + 3 = 11\\\\]$" in the task text.
177188
@@ -187,7 +198,7 @@ def evaluate_with_llm_judge(
187198
Description: {capability_description}
188199
Domain: {capability_domain}
189200
{zero_or_few_shot_patch}
190-
Generate {num_gen_tasks} new tasks for the given capability. Ensure that the tasks are diverse and span a wide range of difficulty levels, testing different aspects of the capability comprehensively.
201+
Design {num_gen_tasks} tasks that are diverse in format and difficulty, and collectively cover multiple dimensions of the capability's skill requirements.
191202
"""
192203

193204
TASK_GENERATION_ZERO_OR_FEW_SHOT_PATCH = {
@@ -207,11 +218,12 @@ def evaluate_with_llm_judge(
207218

208219

209220
TASK_SOLVER_SYSTEM_PROMPT = """
210-
You are an expert in completing tasks for the {capability_name} capability in the {capability_domain} domain. Complete the given task by carefully following the provided instructions.
221+
You are an expert in completing tasks for the {capability_name} capability in the {capability_domain} domain.
222+
Complete the given task by carefully following the provided instructions.
211223
"""
212224

213225
ANSWER_JUDGEMENT_SYSTEM_PROMPT = """
214-
You are an expert in evaluating answers to problems for the {capability_domain} domain. Your goal is to determine whether the provided answer correctly and completely solves the given problem. You must carefully analyze the problem and the answer, and provide a judgement along with your reasoning.
226+
You are an expert in evaluating answers to problems for the {capability_domain} domain. Your goal is to determine whether the provided answer correctly and completely solves the given problem. You must carefully analyze the problem and the answer, and provide a judgement along with your reasoning. "Correctly and completely" means the answer must be accurate, sufficient, and aligned with the task's expectations.
215227
216228
Respond precisely in the following format:
217229

0 commit comments

Comments
 (0)