Merge branch 'main' into agentic_capability_generation

kohankhaki · web-flow · commit 93126565dd13 · 2025-08-26T13:49:18.000-07:00
diff --git a/example_scripts/README.md b/example_scripts/README.md
@@ -0,0 +1,141 @@
+
+## `train_test_embedding_visualization` example
+
+Here we describe the steps required for reading and selecting pre-generated capabilities and their tasks, generating capability embeddings, filtering capabilities based on those embeddings, reducing dimensionality, and visualizing capabilities. All of these steps are implemented in the `train_test_embedding_visualization.py` script, which runs the process for both `train` and `test` capabilities. The directory containing the `train` and `test` capabilities and tasks is specified in the `train_test_embedding_visualization_cfg.yaml` file.
+
+You can also find the steps for loading and visualizing LLM scores in `plot_llm_capability_scores.py`. The scores can be plotted using a spider chart or a bar chart via the `plot_capability_scores_spider_and_bar_chart()` function.
+
+
+Step 1: Read the already generated and saved train capabilities:
+
+```python
+    # Read the capabilities from the base directory
+    train_capability_dir = os.path.join(
+        cfg.capabilities_cfg.saved_capabilities_dir,
+        cfg.capabilities_cfg.domain,
+    )
+    # Fetch previously generated capabilities
+    capabilities = get_previous_capabilities(capability_dir=train_capability_dir)
+
+```
+
+Step 2: Sort and keep complete capabilities. Complete capabilities have enough verified tasks generated for them.
+
+```python
+
+    logger.info(f"All capability names:\n{capabilities}")
+    # Select complete capabilities (same set of capabilities were evaluated)
+    capabilities = select_complete_capabilities(
+        capabilities=capabilities,
+        strict=False,
+        num_tasks_lower_bound=int(
+            cfg.capabilities_cfg.num_gen_tasks_per_capability
+            * (1 - cfg.capabilities_cfg.num_gen_tasks_buffer)
+        ),
+    )
+    capabilities = sorted(capabilities, key=lambda x: x.name)
+
+  ```
+
+
+
+Step 3: Generate capability embeddings using openai model, and assign embeddings to each capability object.
+
+```python   
+    # Embed capabilities using openai embedding model
+    generate_and_set_capabilities_embeddings(
+        capabilities=capabilities,
+        embedding_model_name=cfg.embedding_cfg.embedding_model,
+        embed_dimensions=cfg.embedding_cfg.embedding_size,
+    )
+```
+
+Step 4: Filter capabilities based on the embeddings such that if embeddings are too similar according to a threshold, one of them (capability) should be removed.
+
+```python
+    # Filter capabilities based on their embeddings
+    filtered_capabilities = filter_capabilities(
+        capabilities,
+        embedding_model_name=cfg.embedding_cfg.embedding_model,
+        similarity_threshold=cfg.embedding_cfg.filtering_similarity_threshold,
+    )
+```
+
+Step 5: Capability embedding dimensionality reduction. 
+
+```python
+    # Reduce the dimensionality of capability embeddings generated by the
+    # embedding model.
+    dim_reduction = apply_dimensionality_reduction(
+        filtered_capabilities,
+        dim_reduction_method_name=cfg.dimensionality_reduction_cfg.reduce_dimensionality_method,
+        output_dimension_size=cfg.dimensionality_reduction_cfg.reduced_dimensionality_size,
+        embedding_model_name=cfg.embedding_cfg.embedding_model,
+        tsne_perplexity=cfg.dimensionality_reduction_cfg.tsne_perplexity,
+        normalize_output=cfg.dimensionality_reduction_cfg.normalize_output,
+    )
+```
+
+
+Step 6: Visualize the reduced embeddings.
+
+```python
+    # Plot training capabilities
+    plot_hierarchical_capability_2d_embeddings(
+        capabilities=filtered_capabilities,
+        dim_reduction_method=cfg.dimensionality_reduction_cfg.reduce_dimensionality_method,
+        save_dir=cfg.embedding_visualization_cfg.save_dir,
+        plot_name=cfg.embedding_visualization_cfg.plot_name,
+        show_point_ids=cfg.embedding_visualization_cfg.show_point_ids,
+    )
+```
+
+Step 7: Capability Heatmap
+
+```python
+    generate_capability_heatmap(
+        capabilities=filtered_capabilities,
+        embedding_model_name=cfg.embedding_cfg.embedding_model,  # Using the original embeddings, not the reduced version.
+        save_dir=cfg.heatmap_cfg.save_dir,
+        plot_name=cfg.heatmap_cfg.plot_name,
+        add_squares=cfg.heatmap_cfg.add_squares,
+    )
+```
+
+
+Step 8: **Test capabilities** are also loaded and their embeddings are generated using openai embedding model just like the previous steps. The only difference is that we should use the already fitted PCA model for dimensionality reduction. Also, we visualize test capabilities 2D embeddings with train embeddings to see their relative distance.
+
+```python
+        # Use the fitted PCA dim reduction to transform the test capabilities
+        apply_dimensionality_reduction_to_test_capabilities(
+            test_capabilities,
+            dim_reduction_method=dim_reduction,
+            embedding_model_name=cfg.embedding_cfg.embedding_model,
+        )
+```
+
+Step 9: Visualize train and test capability embeddings together.
+
+```python
+        all_capabilities = filtered_capabilities + test_capabilities
+        logger.info(
+            f"Visualizing {len(all_capabilities)} train and test capabilities at {cfg.embedding_visualization_cfg.save_dir}"
+        )
+        plot_hierarchical_capability_2d_embeddings(
+            capabilities=all_capabilities,
+            dim_reduction_method=cfg.dimensionality_reduction_cfg.reduce_dimensionality_method,
+            save_dir=cfg.embedding_visualization_cfg.save_dir,
+            plot_name=cfg.embedding_visualization_cfg.plot_name + " Train and Test",
+            show_point_ids=cfg.embedding_visualization_cfg.show_point_ids,
+        )
+```
+
+
+### How are embeddings generated?
+
+The `generate_and_set_capabilities_embeddings()` function in `src/utils/embedding_utils.py` handles this process. Capability name and descriptions are extracted to form the representation string `rep_string`. Then, embeddings are generated using the OpenAI embedding model via `embedding_generator`. Finally, the embeddings are assigned to each capability object.
+The representation string was chosen based on visualization-based experiments and is defined as:
+
+```python 
+        rep_string = f"{capability_dict['name']} - {capability.area}: {capability_dict['description']}"
+```
diff --git a/example_scripts/example_cfg/train_test_embedding_visualization_cfg.yaml b/example_scripts/example_cfg/train_test_embedding_visualization_cfg.yaml
@@ -19,7 +19,7 @@ dimensionality_reduction_cfg:
   normalize_output: False
 
 embedding_visualization_cfg:
-  save_dir: /fs01/projects/aieng/public/acecapabilities_o4-mini_C100_R5_A10_T100/visualizations
+  save_dir: /fs01/projects/aieng/public/ace/capabilities_o4-mini_C100_R5_A10_T100/visualizations
   plot_name: "PCA Embeddings"
   show_point_ids: False # Set to true when plotting a small number of capabilities.
 
diff --git a/src/utils/prompts.py b/src/utils/prompts.py
@@ -1,5 +1,9 @@
 CAPABILITY_GENERATION_SYSTEM_PROMPT = """
-You are an expert in designing capabilities to assess the abilities of large language models (LLMs). Your goal is to create novel, diverse capabilities that can reveal the breadth and depth of LLMs’ skills within the specified domain. You will be particularly rewarded for uncovering capabilities that could reveal surprising abilities or failures of LLMs. Valid capabilities will be added to a capability archive. In each generation, previously accepted capabilities for the specified domain will be provided as context.
+You are an expert in designing capabilities to assess the abilities of foundation models.
+Your goal is to create novel, diverse capabilities that can reveal the breadth and depth of a foundation model's skills within the specified domain.
+You will be particularly rewarded for a comprehensive design of capabilities.
+Valid capabilities will be added to a capability archive.
+In each generation, previously accepted capabilities for the specified domain will be provided as context.
 
 Each capability should be designed according to the METR Standard, which requires the following Python format:
 ```python
@@ -33,7 +37,7 @@ def parse_submission(submission: str) -> str:
     # Parse the submission string to extract the answer based on the "ANSWER" keyword.
     # Return an empty string if no match is found.
 ```
-3. The score function should use a helper function that uses LLM as a judge to score the submission:
+3. The score function should use a helper function that uses a large language model (LLM) as a judge to score the submission:
 ```python
 def evaluate_with_llm_judge(
     submission: str,
@@ -55,14 +59,16 @@ def evaluate_with_llm_judge(
 
 In <THOUGHT>, briefly think and reason about what kind of capability you want to propose.
 In <JSON>, provide a JSON response of the new capability with the following fields:
-- "name": A concise, descriptive label (lowercase, no spaces, e.g., "math_competition_algebra").
-- "description": A clear explanation of what the capability entails (e.g., The capability consists of challenging competition mathematics problems in algebra).
-- "domain": The domain to which the capability belongs to (e.g., math, physics, etc.).
+- "name": A concise, descriptive label (lowercase, no spaces, e.g., "personalized_budget_planning").
+- "description": A clear explanation of what the capability entails (e.g., "Ability to generate a realistic monthly budget tailored to an individual's income, fixed and variable expenses, and financial goals. Requires understanding spending categories, prioritization, and basic cash flow allocation.").
+- "domain": The domain to which the capability belongs to (e.g., personal finance, math, etc.).
 - "class": The fully implemented Python code for the Capability class. This should be easily human-readable.
 
 Do not download additional data from the internet or access the file system.
 
-Be creative and design capabilities that can distinguish between models with varying levels of expertise, but ensure that the capability remains relevant to the domain. Also ensure that the proposed capabilities ARE DISTINCT compared to the existing capabilities. Names of all existing capabilities will be provided.
+Be creative and design capabilities that can distinguish between different levels of expertise, but ensure that the capability remains relevant to the domain.
+Also ensure that the proposed capabilities ARE DISTINCT compared to the existing capabilities.
+Names of all existing capabilities will be provided.
 
 Your response will be automatically parsed so ensure it adheres to the specified format.
 """  # noqa: D100
@@ -76,7 +82,7 @@ def evaluate_with_llm_judge(
 Existing capability names:
 {prev_capabilities}
 
-Generate {num_gen_capabilities} new, interesting capabilities within the {domain} domain.
+Generate {num_gen_capabilities} new capabilities within the {domain} domain that are **semantically and functionally distinct** from the existing capabilities.
 """
 
 HIERARCHICAL_CAPABILITY_GENERATION_USER_PROMPT = """
@@ -88,11 +94,14 @@ def evaluate_with_llm_judge(
 Existing capability names:
 {{prev_capabilities}}
 
-Generate {{num_gen_capabilities}} new, interesting capabilities for the "{capability_area}" area within the {{domain}} domain.
+Generate {{num_gen_capabilities}} new capabilities for the "{capability_area}" area within the {{domain}} domain that do not overlap with the existing capabilities.
 """
 
 HIERARCHICAL_CAPABILITY_AREAS_GENERATION_USER_PROMPT = """
-You are an expert in designing capabilities to assess the abilities of large language models (LLMs). Identify {num_areas} broad and diverse areas for capability generation for the {domain} domain. Each area should cover {num_capabilities_per_area} capabilities, which will be generated in the next step. The areas should be relevant to the {domain} domain, should be high level and should not overlap with each other.
+You are an expert in designing capabilities to assess the abilities of foundation models.
+For the domain of {domain}, identify {num_areas} high-level, broad, diverse, and non-overlapping areas for capability generation.
+Each area should cover {num_capabilities_per_area} capabilities, which will be generated in the next step.
+Aim for each area to cover a broad subdomain or skill cluster within the domain.
 
 Respond precisely in the following format:
 
@@ -132,7 +141,7 @@ def evaluate_with_llm_judge(
 """
 
 TASK_GENERATION_SYSTEM_PROMPT = """
-You are an expert in designing tasks for a given capability. The name, description, {zero_or_few_shot_patch} for the capability will be provided. You will be particularly rewarded for designing diverse tasks spanning a wide range of difficulty levels for the given capability.
+You are an expert in designing tasks for a given capability. The name, description, {zero_or_few_shot_patch} for the capability will be provided. Ensure designed tasks are diverse spanning a wide range of difficulty levels for the given capability.
 
 Respond precisely in the following format, including the JSON start and end markers:
 
@@ -144,6 +153,7 @@ def evaluate_with_llm_judge(
 In <STR>, provide a string containing the task text.
 
 Be careful to make sure that all proposed tasks are unique. Also ensure that all tasks are within the scope of the given capability.
+Avoid simple rewordings or duplicates in structure. Tasks should test different angles of the capability.
 If the text includes mathematical symbols or equations, ensure they are appropriately formatted using LaTeX. Ensure the single backlash "\\" included in a LateX string is escaped as "\\\\". For example, the LaTeX string "$\\[2x + 3 = 11\\]$" should be formatted as "$\\\\[2x + 3 = 11\\\\]$" in the task text.
 
 Your response will be automatically parsed so ensure it adheres to the specified format.
@@ -172,6 +182,7 @@ def evaluate_with_llm_judge(
 In <STR>, provide a string containing the task text.
 
 Be careful to make sure that all proposed tasks are unique. Also ensure that all tasks are within the scope of the given capability.
+Avoid simple rewordings or duplicates in structure. Tasks should test different angles of the capability.
 
 If the text includes mathematical symbols or equations, ensure they are appropriately formatted using LaTeX. Ensure the single backlash "\\" included in a LateX string is escaped as "\\\\". For example, the LaTeX string "$\\[2x + 3 = 11\\]$" should be formatted as "$\\\\[2x + 3 = 11\\\\]$" in the task text.
 
@@ -187,7 +198,7 @@ def evaluate_with_llm_judge(
 Description: {capability_description}
 Domain: {capability_domain}
 {zero_or_few_shot_patch}
-Generate {num_gen_tasks} new tasks for the given capability. Ensure that the tasks are diverse and span a wide range of difficulty levels, testing different aspects of the capability comprehensively.
+Design {num_gen_tasks} tasks that are diverse in format and difficulty, and collectively cover multiple dimensions of the capability's skill requirements.
 """
 
 TASK_GENERATION_ZERO_OR_FEW_SHOT_PATCH = {
@@ -207,11 +218,12 @@ def evaluate_with_llm_judge(
 
 
 TASK_SOLVER_SYSTEM_PROMPT = """
-You are an expert in completing tasks for the {capability_name} capability in the {capability_domain} domain. Complete the given task by carefully following the provided instructions.
+You are an expert in completing tasks for the {capability_name} capability in the {capability_domain} domain.
+Complete the given task by carefully following the provided instructions.
 """
 
 ANSWER_JUDGEMENT_SYSTEM_PROMPT = """
-You are an expert in evaluating answers to problems for the {capability_domain} domain. Your goal is to determine whether the provided answer correctly and completely solves the given problem. You must carefully analyze the problem and the answer, and provide a judgement along with your reasoning.
+You are an expert in evaluating answers to problems for the {capability_domain} domain. Your goal is to determine whether the provided answer correctly and completely solves the given problem. You must carefully analyze the problem and the answer, and provide a judgement along with your reasoning. "Correctly and completely" means the answer must be accurate, sufficient, and aligned with the task's expectations.
 
 Respond precisely in the following format: