Skip to content

Commit 0ad47fd

Browse files
sdenton4copybara-github
authored andcommitted
Read and insert Deployment and Recording metadata from CSV files during embedding process.
PiperOrigin-RevId: 874748549
1 parent e196040 commit 0ad47fd

File tree

5 files changed

+649
-8
lines changed

5 files changed

+649
-8
lines changed

perch_hoplite/agile/README.md

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
# Agile Modeling with Perch-Hoplite
2+
3+
This directory contains tools for Agile bird song modeling with Perch-Hoplite.
4+
These tools are intended to support embedding large audio datasets, adding
5+
labels and metadata, and training and evaluating audio classifiers.
6+
7+
## Data Organization
8+
9+
The embedding pipeline assumes that audio files are organized into directories,
10+
where each top-level directory within the `base_path` represents a
11+
**deployment**. For example, with a directory structure like:
12+
13+
```
14+
my_dataset/
15+
├── deployment_A/
16+
│ ├── recording01.wav
17+
│ └── recording02.wav
18+
├── deployment_B/
19+
│ └── recording03.wav
20+
...
21+
```
22+
23+
`deployment_A` and `deployment_B` will be treated as deployment names.
24+
25+
**Recordings** are identified by their relative path from the `base_path`,
26+
including the deployment directory (e.g., `deployment_A/recording01.wav`).
27+
This relative path serves as the `file_id` for recordings when linking
28+
metadata or annotations.
29+
30+
## Adding metadata to the Hoplite Database
31+
32+
The Agile embedding pipeline supports adding metadata to deployments and
33+
recordings in the Hoplite database. Metadata is loaded from CSV files
34+
located in the `base_path` of each `AudioSourceConfig`.
35+
36+
To add metadata, create the following three files in the root of your dataset
37+
directory:
38+
39+
1. **`metadata_description.csv`**: This file describes the metadata fields you
40+
want to add. It should contain the following columns:
41+
* `field_name`: The name of the metadata field (e.g., `habitat`).
42+
* `metadata_level`: The level at which metadata applies, either
43+
`deployment` or `recording`.
44+
* `type`: The data type of the field. Supported types are `str`, `float`,
45+
`int`, and `bytes`.
46+
* `description`: An optional description of the field.
47+
48+
2. **`deployments_metadata.csv`**: This file contains metadata for each
49+
deployment. The first column must be the deployment identifier (which
50+
corresponds to the directory name if audio files are in
51+
`deployment/recording.wav`
52+
format), and subsequent columns should match `field_name`s from
53+
`metadata_description.csv` where `metadata_level` is `deployment`.
54+
55+
3. **`recordings_metadata.csv`**: This file contains metadata for each
56+
recording. The first column must be the recording identifier (e.g.
57+
`deployment/recording.wav`),
58+
and subsequent columns should match `field_name`s from
59+
`metadata_description.csv` where `metadata_level` is `recording`.
60+
61+
### Example
62+
63+
**`metadata_description.csv`**
64+
65+
```csv
66+
field_name,metadata_level,type,description
67+
deployment_name,deployment,str,Deployment identifier.
68+
habitat,deployment,str,Habitat type.
69+
latitude,deployment,float,Deployment latitude.
70+
file_id,recording,str,Recording identifier.
71+
mic_type,recording,str,Microphone type.
72+
```
73+
74+
**`deployments_metadata.csv`**
75+
76+
```csv
77+
deployment_name,habitat,latitude
78+
DEP01,"forest",47.6
79+
DEP02,"grassland",45.1
80+
```
81+
82+
**`recordings_metadata.csv`**
83+
84+
```csv
85+
file_id,mic_type
86+
DEP01/rec001.wav,"MicA"
87+
DEP01/rec002.wav,"MicB"
88+
DEP02/rec001.wav,"MicA"
89+
```
90+
91+
When `EmbedWorker.process_all()` is run, it will detect these files, load the
92+
metadata, and insert it into the database alongside new deployments and
93+
recordings. Metadata fields can then be accessed as attributes on `Deployment`
94+
and `Recording` objects returned by the database interface (e.g.,
95+
`deployment.habitat`, `recording.mic_type`).
96+
97+
## Adding Annotations
98+
99+
If you have existing annotations for your audio data, Hoplite can ingest these
100+
during the embedding process. Annotations should be stored in CSV files named
101+
`annotations.csv` alongside your audio data. Each `annotations.csv` should
102+
contain columns for `recording` (the filename or file_id of the audio),
103+
`start_offset_s`, `end_offset_s`, `label`, and `label_type` ('positive',
104+
'negative', or 'uncertain'). When embeddings are generated, Hoplite will find
105+
any relevant annotations and add them to the database, associating them with the
106+
appropriate time windows.
107+
108+
### Example
109+
110+
**`annotations.csv`**
111+
112+
```csv
113+
recording,start_offset_s,end_offset_s,label,label_type
114+
DEP01/rec001.wav,10.0,15.0,MyBird,positive
115+
DEP01/rec001.wav,20.0,25.0,OtherBird,negative
116+
DEP02/rec001.wav,5.0,10.0,MyBird,positive
117+
```
118+
119+
## Colab Notebooks
120+
121+
This directory includes Colab notebooks to guide users through embedding audio,
122+
adding annotations, and training agile classifiers.
123+
124+
These notebooks are designed for use in Google Colab and make use of interactive
125+
forms (e.g., dropdowns and text fields) via cell parameters (`#@param`).
126+
While developed for Colab, the notebooks are also compatible with standard
127+
Jupyter environments, although the interactive form elements will not be
128+
rendered.
129+
130+
The notebooks provided are:
131+
132+
* **`01_embed_audio.ipynb`**: This notebook guides you through the process of
133+
embedding audio files from a dataset using a specified pre-trained model
134+
(e.g., Perch v2, BirdNet) and saving them into a Hoplite database. It
135+
handles dataset configuration, database initialization, and running the
136+
embedding process.
137+
* **`02_agile_modeling.ipynb`**: This notebook focuses on the interactive
138+
modeling process. It allows you to search the embedding database using
139+
example audio, display search results, label data as positive or negative,
140+
and then train or retrain a simple linear classifier based on these labels.
141+
You can also use the trained classifier to run inference or perform
142+
margin-based sampling to find examples for further annotation.
143+
* **`03_call_density.ipynb`**: This notebook shows how to use Hoplite to
144+
compute aggregate call density statistics, which can act as an indicator
145+
of species abundance in many cases.
146+
(As described in: https://arxiv.org/abs/2402.15360)
147+
* **`99_migrate_db.ipynb`**: A utility notebook for migrating Hoplite
148+
databases created with `perch-hoplite < 1.0` to the format used by
149+
`perch-hoplite >= 1.0`.

perch_hoplite/agile/embed.py

Lines changed: 57 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -19,13 +19,15 @@
1919
import dataclasses
2020
import functools
2121
import itertools
22+
import os
2223
import threading
2324

2425
from absl import logging
2526
import audioread
2627
from ml_collections import config_dict
2728
import numpy as np
2829
from perch_hoplite import audio_io
30+
from perch_hoplite.agile import metadata
2931
from perch_hoplite.agile import source_info
3032
from perch_hoplite.db import interface as hoplite_interface
3133
from perch_hoplite.zoo import model_configs
@@ -142,6 +144,10 @@ def __init__(
142144
self.audio_globs = {
143145
g.dataset_name: g for g in self.audio_sources.audio_globs
144146
}
147+
self.metadata = {
148+
g.dataset_name: metadata.AgileMetadata.from_directory(g.base_path)
149+
for g in self.audio_sources.audio_globs
150+
}
145151

146152
def _log_error(self, source_id, exception, counter_name):
147153
logging.warning(
@@ -213,8 +219,11 @@ def _get_or_insert_deployment_id(
213219
config_dict.create(eq=dict(name=deployment_name, project=project_name))
214220
)
215221
if not deployments:
222+
md = self.metadata[project_name].get_deployment_metadata(deployment_name)
216223
return self.db.insert_deployment(
217-
name=deployment_name, project=project_name
224+
name=deployment_name,
225+
project=project_name,
226+
**md,
218227
)
219228
else:
220229
return deployments[0].id
@@ -224,6 +233,7 @@ def _get_or_insert_recording_id(
224233
self,
225234
filename: str,
226235
deployment_id: int,
236+
dataset_name: str,
227237
) -> tuple[int, bool]:
228238
"""Get the recording ID, and indicate whether it was newly inserted."""
229239
recordings = self.db.get_all_recordings(
@@ -232,9 +242,12 @@ def _get_or_insert_recording_id(
232242
)
233243
)
234244
if not recordings:
245+
md = self.metadata[dataset_name].get_recording_metadata(filename)
235246
return (
236247
self.db.insert_recording(
237-
filename=filename, deployment_id=deployment_id
248+
filename=filename,
249+
deployment_id=deployment_id,
250+
**md,
238251
),
239252
True,
240253
)
@@ -245,8 +258,9 @@ def add_deployments(self, target_dataset_name: str | None = None):
245258
"""Add deployments to db and create a source ID to deployment ID mapping."""
246259
# Create missing deployments in the database.
247260
for source in self.audio_sources.iterate_all_sources(target_dataset_name):
261+
deployment_name = source.deployment_name_from_file_id()
248262
self._get_or_insert_deployment_id(
249-
deployment_name=source.deployment_name_from_file_id(),
263+
deployment_name=deployment_name,
250264
project_name=source.dataset_name,
251265
)
252266
self.db.commit()
@@ -256,18 +270,53 @@ def add_recordings(self, target_dataset_name: str | None = None) -> set[int]:
256270
new_recordings = set([])
257271
for source in self.audio_sources.iterate_all_sources(target_dataset_name):
258272
deployment_id = self._get_or_insert_deployment_id(
259-
source.deployment_name_from_file_id(), source.dataset_name
273+
deployment_name=source.deployment_name_from_file_id(),
274+
project_name=source.dataset_name,
260275
)
261276
recording_id, is_new = self._get_or_insert_recording_id(
262-
source.file_id, deployment_id
277+
source.file_id,
278+
deployment_id,
279+
source.dataset_name,
263280
)
264281
if is_new:
265282
new_recordings.add(recording_id)
266283
self.db.commit()
267284
return new_recordings
268285

269-
def add_annotations(self):
270-
pass
286+
def add_annotations(self, target_dataset_name: str | None = None):
287+
"""Add annotations from metadata to db."""
288+
dataset_names = self.metadata.keys()
289+
if target_dataset_name is not None:
290+
dataset_names = [target_dataset_name]
291+
292+
for dataset_name in dataset_names:
293+
if dataset_name not in self.metadata:
294+
continue
295+
agile_md = self.metadata[dataset_name]
296+
if not agile_md.annotations:
297+
continue
298+
for file_id, annotation_list in agile_md.annotations.items():
299+
depl_name = os.path.split(file_id)[0]
300+
if not depl_name:
301+
logging.warning(
302+
'Could not get deployment name from file_id %s, skipping.',
303+
file_id,
304+
)
305+
continue
306+
depl_id = self._get_or_insert_deployment_id(depl_name, dataset_name)
307+
rec_id, _ = self._get_or_insert_recording_id(
308+
file_id, depl_id, dataset_name
309+
)
310+
for annotation in annotation_list:
311+
self.db.insert_annotation(
312+
rec_id,
313+
annotation.offsets,
314+
annotation.label,
315+
annotation.label_type,
316+
provenance=annotation.provenance,
317+
handle_duplicates='skip',
318+
)
319+
self.db.commit()
271320

272321
def embed_dataset(
273322
self,
@@ -307,7 +356,7 @@ def embed_dataset(
307356
s.deployment_name_from_file_id(), s.dataset_name
308357
)
309358
recording_id, _ = self._get_or_insert_recording_id(
310-
s.file_id, deployment_id
359+
s.file_id, deployment_id, s.dataset_name
311360
)
312361
recording_ids.append(recording_id)
313362
if all(r in new_recordings for r in recording_ids):

0 commit comments

Comments
 (0)