You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here we describe the steps required for reading and selecting pre-generated capabilities and their tasks, generating capability embeddings, filtering capabilities based on those embeddings, reducing dimensionality, and visualizing capabilities. All of these steps are implemented in the `train_test_embedding_visualization.py` script, which runs the process for both `train` and `test` capabilities. The directory containing the `train` and `test` capabilities and tasks is specified in the `train_test_embedding_visualization_cfg.yaml` file.
5
+
6
+
You can also find the steps for loading and visualizing LLM scores in `plot_llm_capability_scores.py`. The scores can be plotted using a spider chart or a bar chart via the `plot_capability_scores_spider_and_bar_chart()` function.
7
+
8
+
9
+
Step 1: Read the already generated and saved train capabilities:
Step 4: Filter capabilities based on the embeddings such that if embeddings are too similar according to a threshold, one of them (capability) should be removed.
embedding_model_name=cfg.embedding_cfg.embedding_model, # Using the original embeddings, not the reduced version.
99
+
save_dir=cfg.heatmap_cfg.save_dir,
100
+
plot_name=cfg.heatmap_cfg.plot_name,
101
+
add_squares=cfg.heatmap_cfg.add_squares,
102
+
)
103
+
```
104
+
105
+
106
+
Step 8: **Test capabilities** are also loaded and their embeddings are generated using openai embedding model just like the previous steps. The only difference is that we should use the already fitted PCA model for dimensionality reduction. Also, we visualize test capabilities 2D embeddings with train embeddings to see their relative distance.
107
+
108
+
```python
109
+
# Use the fitted PCA dim reduction to transform the test capabilities
The `generate_and_set_capabilities_embeddings()` function in `src/utils/embedding_utils.py` handles this process. Capability name and descriptions are extracted to form the representation string `rep_string`. Then, embeddings are generated using the OpenAI embedding model via `embedding_generator`. Finally, the embeddings are assigned to each capability object.
137
+
The representation string was chosen based on visualization-based experiments and is defined as:
Copy file name to clipboardExpand all lines: src/utils/prompts.py
+25-13Lines changed: 25 additions & 13 deletions
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,9 @@
1
1
CAPABILITY_GENERATION_SYSTEM_PROMPT="""
2
-
You are an expert in designing capabilities to assess the abilities of large language models (LLMs). Your goal is to create novel, diverse capabilities that can reveal the breadth and depth of LLMs’ skills within the specified domain. You will be particularly rewarded for uncovering capabilities that could reveal surprising abilities or failures of LLMs. Valid capabilities will be added to a capability archive. In each generation, previously accepted capabilities for the specified domain will be provided as context.
2
+
You are an expert in designing capabilities to assess the abilities of foundation models.
3
+
Your goal is to create novel, diverse capabilities that can reveal the breadth and depth of a foundation model's skills within the specified domain.
4
+
You will be particularly rewarded for a comprehensive design of capabilities.
5
+
Valid capabilities will be added to a capability archive.
6
+
In each generation, previously accepted capabilities for the specified domain will be provided as context.
3
7
4
8
Each capability should be designed according to the METR Standard, which requires the following Python format:
# Parse the submission string to extract the answer based on the "ANSWER" keyword.
34
38
# Return an empty string if no match is found.
35
39
```
36
-
3. The score function should use a helper function that uses LLM as a judge to score the submission:
40
+
3. The score function should use a helper function that uses a large language model (LLM) as a judge to score the submission:
37
41
```python
38
42
def evaluate_with_llm_judge(
39
43
submission: str,
@@ -55,14 +59,16 @@ def evaluate_with_llm_judge(
55
59
56
60
In <THOUGHT>, briefly think and reason about what kind of capability you want to propose.
57
61
In <JSON>, provide a JSON response of the new capability with the following fields:
58
-
- "name": A concise, descriptive label (lowercase, no spaces, e.g., "math_competition_algebra").
59
-
- "description": A clear explanation of what the capability entails (e.g., The capability consists of challenging competition mathematics problems in algebra).
60
-
- "domain": The domain to which the capability belongs to (e.g., math, physics, etc.).
62
+
- "name": A concise, descriptive label (lowercase, no spaces, e.g., "personalized_budget_planning").
63
+
- "description": A clear explanation of what the capability entails (e.g., "Ability to generate a realistic monthly budget tailored to an individual's income, fixed and variable expenses, and financial goals. Requires understanding spending categories, prioritization, and basic cash flow allocation.").
64
+
- "domain": The domain to which the capability belongs to (e.g., personal finance, math, etc.).
61
65
- "class": The fully implemented Python code for the Capability class. This should be easily human-readable.
62
66
63
67
Do not download additional data from the internet or access the file system.
64
68
65
-
Be creative and design capabilities that can distinguish between models with varying levels of expertise, but ensure that the capability remains relevant to the domain. Also ensure that the proposed capabilities ARE DISTINCT compared to the existing capabilities. Names of all existing capabilities will be provided.
69
+
Be creative and design capabilities that can distinguish between different levels of expertise, but ensure that the capability remains relevant to the domain.
70
+
Also ensure that the proposed capabilities ARE DISTINCT compared to the existing capabilities.
71
+
Names of all existing capabilities will be provided.
66
72
67
73
Your response will be automatically parsed so ensure it adheres to the specified format.
68
74
"""# noqa: D100
@@ -76,7 +82,7 @@ def evaluate_with_llm_judge(
76
82
Existing capability names:
77
83
{prev_capabilities}
78
84
79
-
Generate {num_gen_capabilities} new, interesting capabilities within the {domain} domain.
85
+
Generate {num_gen_capabilities} newcapabilities within the {domain} domain that are **semantically and functionally distinct** from the existing capabilities.
Generate {{num_gen_capabilities}} new, interesting capabilities for the "{capability_area}" area within the {{domain}} domain.
97
+
Generate {{num_gen_capabilities}} newcapabilities for the "{capability_area}" area within the {{domain}} domain that do not overlap with the existing capabilities.
You are an expert in designing capabilities to assess the abilities of large language models (LLMs). Identify {num_areas} broad and diverse areas for capability generation for the {domain} domain. Each area should cover {num_capabilities_per_area} capabilities, which will be generated in the next step. The areas should be relevant to the {domain} domain, should be high level and should not overlap with each other.
101
+
You are an expert in designing capabilities to assess the abilities of foundation models.
102
+
For the domain of {domain}, identify {num_areas} high-level, broad, diverse, and non-overlapping areas for capability generation.
103
+
Each area should cover {num_capabilities_per_area} capabilities, which will be generated in the next step.
104
+
Aim for each area to cover a broad subdomain or skill cluster within the domain.
96
105
97
106
Respond precisely in the following format:
98
107
@@ -132,7 +141,7 @@ def evaluate_with_llm_judge(
132
141
"""
133
142
134
143
TASK_GENERATION_SYSTEM_PROMPT="""
135
-
You are an expert in designing tasks for a given capability. The name, description, {zero_or_few_shot_patch} for the capability will be provided. You will be particularly rewarded for designing diverse tasks spanning a wide range of difficulty levels for the given capability.
144
+
You are an expert in designing tasks for a given capability. The name, description, {zero_or_few_shot_patch} for the capability will be provided. Ensure designed tasks are diverse spanning a wide range of difficulty levels for the given capability.
136
145
137
146
Respond precisely in the following format, including the JSON start and end markers:
138
147
@@ -144,6 +153,7 @@ def evaluate_with_llm_judge(
144
153
In <STR>, provide a string containing the task text.
145
154
146
155
Be careful to make sure that all proposed tasks are unique. Also ensure that all tasks are within the scope of the given capability.
156
+
Avoid simple rewordings or duplicates in structure. Tasks should test different angles of the capability.
147
157
If the text includes mathematical symbols or equations, ensure they are appropriately formatted using LaTeX. Ensure the single backlash "\\" included in a LateX string is escaped as "\\\\". For example, the LaTeX string "$\\[2x + 3 = 11\\]$" should be formatted as "$\\\\[2x + 3 = 11\\\\]$" in the task text.
148
158
149
159
Your response will be automatically parsed so ensure it adheres to the specified format.
@@ -172,6 +182,7 @@ def evaluate_with_llm_judge(
172
182
In <STR>, provide a string containing the task text.
173
183
174
184
Be careful to make sure that all proposed tasks are unique. Also ensure that all tasks are within the scope of the given capability.
185
+
Avoid simple rewordings or duplicates in structure. Tasks should test different angles of the capability.
175
186
176
187
If the text includes mathematical symbols or equations, ensure they are appropriately formatted using LaTeX. Ensure the single backlash "\\" included in a LateX string is escaped as "\\\\". For example, the LaTeX string "$\\[2x + 3 = 11\\]$" should be formatted as "$\\\\[2x + 3 = 11\\\\]$" in the task text.
177
188
@@ -187,7 +198,7 @@ def evaluate_with_llm_judge(
187
198
Description: {capability_description}
188
199
Domain: {capability_domain}
189
200
{zero_or_few_shot_patch}
190
-
Generate {num_gen_tasks} new tasks for the given capability. Ensure that the tasks are diverse and span a wide range of difficulty levels, testing different aspects of the capability comprehensively.
201
+
Design {num_gen_tasks} tasks that are diverse in format and difficulty, and collectively cover multiple dimensions of the capability's skill requirements.
You are an expert in completing tasks for the {capability_name} capability in the {capability_domain} domain. Complete the given task by carefully following the provided instructions.
221
+
You are an expert in completing tasks for the {capability_name} capability in the {capability_domain} domain.
222
+
Complete the given task by carefully following the provided instructions.
211
223
"""
212
224
213
225
ANSWER_JUDGEMENT_SYSTEM_PROMPT="""
214
-
You are an expert in evaluating answers to problems for the {capability_domain} domain. Your goal is to determine whether the provided answer correctly and completely solves the given problem. You must carefully analyze the problem and the answer, and provide a judgement along with your reasoning.
226
+
You are an expert in evaluating answers to problems for the {capability_domain} domain. Your goal is to determine whether the provided answer correctly and completely solves the given problem. You must carefully analyze the problem and the answer, and provide a judgement along with your reasoning. "Correctly and completely" means the answer must be accurate, sufficient, and aligned with the task's expectations.
0 commit comments