Skip to content

Commit 3c241f9

Browse files
authored
Embedding generation and visualization README (#36)
* Added example script README * Added example script README * small fix
1 parent b393b97 commit 3c241f9

File tree

2 files changed

+142
-1
lines changed

2 files changed

+142
-1
lines changed

example_scripts/README.md

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
2+
## `train_test_embedding_visualization` example
3+
4+
Here we describe the steps required for reading and selecting pre-generated capabilities and their tasks, generating capability embeddings, filtering capabilities based on those embeddings, reducing dimensionality, and visualizing capabilities. All of these steps are implemented in the `train_test_embedding_visualization.py` script, which runs the process for both `train` and `test` capabilities. The directory containing the `train` and `test` capabilities and tasks is specified in the `train_test_embedding_visualization_cfg.yaml` file.
5+
6+
You can also find the steps for loading and visualizing LLM scores in `plot_llm_capability_scores.py`. The scores can be plotted using a spider chart or a bar chart via the `plot_capability_scores_spider_and_bar_chart()` function.
7+
8+
9+
Step 1: Read the already generated and saved train capabilities:
10+
11+
```python
12+
# Read the capabilities from the base directory
13+
train_capability_dir = os.path.join(
14+
cfg.capabilities_cfg.saved_capabilities_dir,
15+
cfg.capabilities_cfg.domain,
16+
)
17+
# Fetch previously generated capabilities
18+
capabilities = get_previous_capabilities(capability_dir=train_capability_dir)
19+
20+
```
21+
22+
Step 2: Sort and keep complete capabilities. Complete capabilities have enough verified tasks generated for them.
23+
24+
```python
25+
26+
logger.info(f"All capability names:\n{capabilities}")
27+
# Select complete capabilities (same set of capabilities were evaluated)
28+
capabilities = select_complete_capabilities(
29+
capabilities=capabilities,
30+
strict=False,
31+
num_tasks_lower_bound=int(
32+
cfg.capabilities_cfg.num_gen_tasks_per_capability
33+
* (1 - cfg.capabilities_cfg.num_gen_tasks_buffer)
34+
),
35+
)
36+
capabilities = sorted(capabilities, key=lambda x: x.name)
37+
38+
```
39+
40+
41+
42+
Step 3: Generate capability embeddings using openai model, and assign embeddings to each capability object.
43+
44+
```python
45+
# Embed capabilities using openai embedding model
46+
generate_and_set_capabilities_embeddings(
47+
capabilities=capabilities,
48+
embedding_model_name=cfg.embedding_cfg.embedding_model,
49+
embed_dimensions=cfg.embedding_cfg.embedding_size,
50+
)
51+
```
52+
53+
Step 4: Filter capabilities based on the embeddings such that if embeddings are too similar according to a threshold, one of them (capability) should be removed.
54+
55+
```python
56+
# Filter capabilities based on their embeddings
57+
filtered_capabilities = filter_capabilities(
58+
capabilities,
59+
embedding_model_name=cfg.embedding_cfg.embedding_model,
60+
similarity_threshold=cfg.embedding_cfg.filtering_similarity_threshold,
61+
)
62+
```
63+
64+
Step 5: Capability embedding dimensionality reduction.
65+
66+
```python
67+
# Reduce the dimensionality of capability embeddings generated by the
68+
# embedding model.
69+
dim_reduction = apply_dimensionality_reduction(
70+
filtered_capabilities,
71+
dim_reduction_method_name=cfg.dimensionality_reduction_cfg.reduce_dimensionality_method,
72+
output_dimension_size=cfg.dimensionality_reduction_cfg.reduced_dimensionality_size,
73+
embedding_model_name=cfg.embedding_cfg.embedding_model,
74+
tsne_perplexity=cfg.dimensionality_reduction_cfg.tsne_perplexity,
75+
normalize_output=cfg.dimensionality_reduction_cfg.normalize_output,
76+
)
77+
```
78+
79+
80+
Step 6: Visualize the reduced embeddings.
81+
82+
```python
83+
# Plot training capabilities
84+
plot_hierarchical_capability_2d_embeddings(
85+
capabilities=filtered_capabilities,
86+
dim_reduction_method=cfg.dimensionality_reduction_cfg.reduce_dimensionality_method,
87+
save_dir=cfg.embedding_visualization_cfg.save_dir,
88+
plot_name=cfg.embedding_visualization_cfg.plot_name,
89+
show_point_ids=cfg.embedding_visualization_cfg.show_point_ids,
90+
)
91+
```
92+
93+
Step 7: Capability Heatmap
94+
95+
```python
96+
generate_capability_heatmap(
97+
capabilities=filtered_capabilities,
98+
embedding_model_name=cfg.embedding_cfg.embedding_model, # Using the original embeddings, not the reduced version.
99+
save_dir=cfg.heatmap_cfg.save_dir,
100+
plot_name=cfg.heatmap_cfg.plot_name,
101+
add_squares=cfg.heatmap_cfg.add_squares,
102+
)
103+
```
104+
105+
106+
Step 8: **Test capabilities** are also loaded and their embeddings are generated using openai embedding model just like the previous steps. The only difference is that we should use the already fitted PCA model for dimensionality reduction. Also, we visualize test capabilities 2D embeddings with train embeddings to see their relative distance.
107+
108+
```python
109+
# Use the fitted PCA dim reduction to transform the test capabilities
110+
apply_dimensionality_reduction_to_test_capabilities(
111+
test_capabilities,
112+
dim_reduction_method=dim_reduction,
113+
embedding_model_name=cfg.embedding_cfg.embedding_model,
114+
)
115+
```
116+
117+
Step 9: Visualize train and test capability embeddings together.
118+
119+
```python
120+
all_capabilities = filtered_capabilities + test_capabilities
121+
logger.info(
122+
f"Visualizing {len(all_capabilities)} train and test capabilities at {cfg.embedding_visualization_cfg.save_dir}"
123+
)
124+
plot_hierarchical_capability_2d_embeddings(
125+
capabilities=all_capabilities,
126+
dim_reduction_method=cfg.dimensionality_reduction_cfg.reduce_dimensionality_method,
127+
save_dir=cfg.embedding_visualization_cfg.save_dir,
128+
plot_name=cfg.embedding_visualization_cfg.plot_name + " Train and Test",
129+
show_point_ids=cfg.embedding_visualization_cfg.show_point_ids,
130+
)
131+
```
132+
133+
134+
### How are embeddings generated?
135+
136+
The `generate_and_set_capabilities_embeddings()` function in `src/utils/embedding_utils.py` handles this process. Capability name and descriptions are extracted to form the representation string `rep_string`. Then, embeddings are generated using the OpenAI embedding model via `embedding_generator`. Finally, the embeddings are assigned to each capability object.
137+
The representation string was chosen based on visualization-based experiments and is defined as:
138+
139+
```python
140+
rep_string = f"{capability_dict['name']} - {capability.area}: {capability_dict['description']}"
141+
```

example_scripts/example_cfg/train_test_embedding_visualization_cfg.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ dimensionality_reduction_cfg:
1919
normalize_output: False
2020

2121
embedding_visualization_cfg:
22-
save_dir: /fs01/projects/aieng/public/acecapabilities_o4-mini_C100_R5_A10_T100/visualizations
22+
save_dir: /fs01/projects/aieng/public/ace/capabilities_o4-mini_C100_R5_A10_T100/visualizations
2323
plot_name: "PCA Embeddings"
2424
show_point_ids: False # Set to true when plotting a small number of capabilities.
2525

0 commit comments

Comments
 (0)