Skip to content

Commit 17617c6

Browse files
authored
Merge pull request #275 from allenai/favyen/20260211-embedding-eval
Embeddings evaluation updates
2 parents e5386fa + 20b6f14 commit 17617c6

31 files changed

+1779
-434
lines changed

data/landsat_vessels/config_detector.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -82,15 +82,15 @@ data:
8282
init_args:
8383
mean: 0
8484
std: 255
85-
- class_path: rslp.transforms.mask.Mask
85+
- class_path: rslearn.train.transforms.mask.Mask
8686
train_config:
8787
patch_size: 512
8888
transforms:
8989
- class_path: rslearn.train.transforms.normalize.Normalize
9090
init_args:
9191
mean: 0
9292
std: 255
93-
- class_path: rslp.transforms.mask.Mask
93+
- class_path: rslearn.train.transforms.mask.Mask
9494
- class_path: rslearn.train.transforms.flip.Flip
9595
init_args:
9696
image_selectors: ["image"]

data/sentinel1_vessels/config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,7 @@ data:
111111
mean: 0
112112
std: 250
113113
valid_range: [0, 4]
114-
- class_path: rslp.transforms.mask.Mask
114+
- class_path: rslearn.train.transforms.mask.Mask
115115
train_config:
116116
patch_size: 512
117117
transforms:
@@ -127,7 +127,7 @@ data:
127127
mean: 0
128128
std: 250
129129
valid_range: [0, 4]
130-
- class_path: rslp.transforms.mask.Mask
130+
- class_path: rslearn.train.transforms.mask.Mask
131131
- class_path: rslearn.train.transforms.flip.Flip
132132
init_args:
133133
image_selectors: ["image"]

one_off_projects/2025_08_01_alphaearth_eval_to_rslearn/README.md

Lines changed: 0 additions & 58 deletions
This file was deleted.

rslp/embedding_eval/README.md

Lines changed: 191 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,191 @@
1+
## Embedding Evaluations
2+
3+
This module provides utilities to evaluate OlmoEarth embeddings in different settings
4+
and against AlphaEarth embeddings.
5+
6+
Compared to rslearn (training with frozen encoder), the evaluation here is much faster
7+
since we precompute the embeddings.
8+
9+
Compared to olmoearth_pretrain (kNN and linear probe evaluations), the datasets we test
10+
on are more consistent here (window size at least 128x128 with label only in the center
11+
of the window, and 12 Sentinel-2 images), allowing experimentation with settings like
12+
overlap ratio and different input sizes.
13+
14+
## Datasets
15+
16+
Datasets should have a "sentinel2" layer with the Sentinel-2 L2A image time series.
17+
18+
The window options (in the window metadata) should have a key containing the label
19+
category. For example, in the AlphaEarth evals, the key is "label", while for the AWF
20+
and Nandi datasets, the key is "category". This label should correspond to the center
21+
pixel of the window.
22+
23+
The dataset should be split into train and test in one of two ways:
24+
25+
1. Using groups. There should be a group named "train" and a group named "test".
26+
2. Using another key in the window options. For example, Nandi and AWF use a key called
27+
"split". The value should be "train" or "test" (others are ignored).
28+
29+
### AlphaEarth Supplemental Evaluations
30+
31+
We test on several of the AlphaEarth supplemental evaluation datasets, which we
32+
download from https://zenodo.org/records/16585402. For internal use, the WEKA paths
33+
are:
34+
35+
- Raw format: `/weka/dfive-default/rslearn-eai/artifacts/deepmind_alphaearth_supplemental_evaluation_datasets/`
36+
- Converted to rslearn format: `/weka/dfive-default/rslearn-eai/datasets/alphaearth_supplemental_evaluations/`
37+
38+
We test on these datasets (others may not have Sentinel-2 images materialized, or may
39+
be regression tasks that we don't currently support):
40+
41+
- africa_crop_mask
42+
- canada_crops_fine
43+
- descals
44+
- glance
45+
- lcmap_lu
46+
- us_trees
47+
48+
Below we document how the datasets are converted to rslearn format.
49+
50+
#### Convert to rslearn dataset format
51+
52+
The `create_datasets.py` converts them to rslearn dataset format. It has the WEKA paths
53+
hardcoded, so simply run the script:
54+
55+
```
56+
python -m rslp.embedding_eval.convert_alphaearth_supplemental_to_rslearn
57+
```
58+
59+
Then the data needs to be materialized.
60+
61+
```
62+
rslearn dataset prepare --root /weka/dfive-default/rslearn-eai/datasets/alphaearth_supplemental_evaluations/aster_ged/ --workers 128 --jobs-per-process 16 --retry-max-attempts 10 --retry-backoff-seconds 5 --disabled-layers landsat
63+
rslearn dataset materialize --root /weka/dfive-default/rslearn-eai/datasets/alphaearth_supplemental_evaluations/aster_ged/ --workers 128 --retry-max-attempts 10 --retry-backoff-seconds 5 --disabled-layers landsat --ignore-errors
64+
```
65+
66+
### Other Datasets
67+
68+
Here are other datasets we can evaluate embeddings on:
69+
70+
- `/weka/dfive-default/rslearn-eai/datasets/awf/`
71+
- `/weka/dfive-default/rslearn-eai/datasets/nandi/`
72+
73+
## Obtain AlphaEarth Embeddings
74+
75+
If you are setting up a new dataset for embedding evaluation, you can use `config.json`
76+
which includes a "gse" layer that obtains AlphaEarth embeddings using the
77+
`rslearn.data_sources.aws_google_satellite_embedding_v1.GoogleSatelliteEmbeddingV1`
78+
data source.
79+
80+
Alternatively, you can copy just that "gse" layer into your existing dataset config
81+
file.
82+
83+
## Explicitly Compute Embeddings and Evaluate
84+
85+
We can manually run a command to compute and cache embeddings, and another command to
86+
evaluate the embeddings.
87+
88+
### Compute Embeddings
89+
90+
First, compute embeddings. The dataset must have a "sentinel2" layer with Sentinel-2
91+
L2A image time series (the datasets mentioned above have this).
92+
93+
```bash
94+
python -m rslp.embedding_eval.compute_olmoearth_embeddings \
95+
--ds_path /weka/dfive-default/rslearn-eai/datasets/alphaearth_supplemental_evaluations/africa_crop_mask/ \
96+
--patch_size 1 \
97+
--model_id OlmoEarth-v1-Base \
98+
--input_size 32 \
99+
--embed_fname embeddings.h5
100+
```
101+
102+
`embeddings.h5` will contain two datasets in the H5 file, "embeddings" with a
103+
`(N, embed_dim)` tensor of embeddings, and "window_names" with a corresponding list of
104+
group and window names.
105+
106+
You can specify an OlmoEarth checkpoint directory instead of the model ID:
107+
108+
```bash
109+
python -m rslp.embedding_eval.compute_olmoearth_embeddings \
110+
--checkpoint_dir /weka/dfive-default/helios/checkpoints/favyen/favyen_decode_gse_worldcover_osm_srtm_titan/step370000 \
111+
# ...
112+
```
113+
114+
By default, the images will be center cropped based on the `--input_size`, and we save
115+
the embedding corresponding to the center patch. Center cropping means the label (which
116+
we always assume corresponds to the center of the window) is in the center of the
117+
input. We can pass `--label_position` to put the label in a different position in the
118+
input, to e.g. test the impact of different overlap ratios and how the model performs
119+
with less spatial context.
120+
121+
```bash
122+
python -m rslp.embedding_eval.compute_olmoearth_embeddings \
123+
--patch_size 4 \
124+
--input_size 32 \
125+
# Have the script crop the window such that the center pixel of the window appears
126+
# at the bottom right of the crop.
127+
--label_position 31 31 \
128+
# ...
129+
```
130+
131+
### Evaluate
132+
133+
Run an evaluation with kNN:
134+
135+
```bash
136+
python -m rslp.embedding_eval.get_balanced_accuracy \
137+
--ds_path /weka/dfive-default/rslearn-eai/datasets/alphaearth_supplemental_evaluations/africa_crop_mask/ \
138+
# How many evaluation runs to average metrics over. If set > 1, then samples should
139+
# be set > 0, otherwise each run would use the same train set.
140+
--repeats 1 \
141+
# How many examples to sample per category on each run. 0 means to use all of the
142+
# training data. The --repeats and --samples option are mainly used for consistency
143+
# with AlphaEarth evaluation; for internal comparisons we can disable it.
144+
--samples 0 \
145+
# K for kNN evaluation method.
146+
--k 3 \
147+
# The filename containing the embeddings, or "gse" to load AlphaEarth embeddings
148+
# from a "gse" layer in the dataset. There are a couple other options too, see
149+
# --help for details.
150+
--embed_fname embeddings.h5 \
151+
# Either knn or linear_probe.
152+
--method knn \
153+
# The key in the window options containing the label category.
154+
--label_key label \
155+
# The key in the window options containing the split. "group" means the dataset has
156+
# "train" and "test" groups instead.
157+
--split_key group
158+
```
159+
160+
The linear probe evaluation has a few different options, these are the defaults:
161+
162+
```bash
163+
python -m rslp.embedding_eval.get_balanced_accuracy \
164+
--method linear_probe \
165+
# Learning rate for training the linear probe.
166+
--lr 0.001 \
167+
# Number of epochs to train for.
168+
--epochs 100 \
169+
# The batch size.
170+
--batch_size 32 \
171+
# ...
172+
```
173+
174+
### Automated Evaluation
175+
176+
We can use `run_crop_experiments.py` to evaluate many settings together and create a
177+
table.
178+
179+
There are example JSON files that configure the settings to evaluate on in
180+
`rslp/embedding_eval/crop_experiment_configs/`. It will test each crop config combined
181+
with method, while the patch size and model ID or checkpoint directory are fixed. It
182+
will try to evaluate AlphaEarth embeddings as well.
183+
184+
```bash
185+
python -m rslp.embedding_eval.run_crop_experiments --experiment_config rslp/embedding_eval/crop_experiment_configs/crop_experiment_results.json
186+
```
187+
188+
The script is designed to run correctly when executed in parallel across multiple GPUs:
189+
it will shuffle the experiments specified by the experiment config and iterate over
190+
them, so different executions will process different experiments and skip over ones
191+
that were previously completed based on the results JSON file.

0 commit comments

Comments
 (0)