Skip to content

Commit ab58e37

Browse files
authored
Merge pull request #77 from orcasound/partitioned-file-accessor
Partitioned file accessor
2 parents ce4e5d4 + 97c9bb3 commit ab58e37

25 files changed

+302
-64
lines changed

pages/Broadband_Comparison.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
from plotly.subplots import make_subplots
77
from scipy import signal
88

9-
from src.orcasound_noise.analysis import accessor
9+
from src.orcasound_noise.analysis.legacy import accessor
1010
from src.orcasound_noise.utils.hydrophone import Hydrophone
1111
from src.orcasound_noise.pipeline import pipeline
1212
from src.orcasound_noise.pipeline import acoustic_util

pages/Daily_Trends.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
import streamlit as st
66
import pandas as pd
77

8-
from src.orcasound_noise.analysis import DailyNoiseAnalysis
8+
from src.orcasound_noise.analysis.legacy.daily_noise import DailyNoiseAnalysis
99
from src.orcasound_noise.utils import Hydrophone
1010
from src.orcasound_noise.pipeline import pipeline
1111
from src.orcasound_noise.pipeline import acoustic_util

pages/Spectrograms.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
import streamlit as st
55
import plotly.graph_objects as go
66

7-
from src.orcasound_noise.analysis import accessor
7+
from src.orcasound_noise.analysis.legacy import accessor
88
from src.orcasound_noise.utils import Hydrophone
99
from src.orcasound_noise.pipeline import pipeline
1010
from src.orcasound_noise.pipeline import acoustic_util

requirements.txt

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,14 +5,16 @@ matplotlib-inline==0.2.1
55
scipy==1.16.3
66
streamlit==1.17.0
77
librosa==0.11.0
8-
scikit-image==0.19.3
8+
scikit-image==0.26
99
scikit-learn==1.7.2
1010
ipykernel==6.17.1
11-
numpy==1.25.2
11+
numpy==2.4.2
1212
pandas==2.3.3
1313
boto3>=1.26.65
1414
python-dotenv==1.2.1
1515
plotly==6.5.0
1616
altair<5
1717
pytest
18-
pytest-asyncio
18+
pytest-asyncio
19+
polars==1.38.0
20+
pyarrow==23.0.0
Lines changed: 71 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,49 +1,92 @@
1-
# Noise Accessor
1+
# Power Spectral Density Parquet File Retrieval and Analysis Functionality
22

3-
The accessor is the toolkit used for accessing the stored files. This is done by initializing a NoiseAccessor object for a specific hydrophone, and then requesting a time range and optional time and frequency resolution (or granularity). The accessor scans the generated archive files, loads the correct ones, concatenates the data into a single dataframe, and then trims any data outside of the requested range.
3+
These modules facilitate the retrieval of parquet files stored on AWS S3 of hydrophone power spectral density and broadband sound level
4+
and include functionality to analyze that sound data.
45

5-
Example:
6+
## partitioned_accessor
7+
8+
Accessor uses the python polars library to retrieve partitioned parquet files using lazy loading for fast on-demand data retrieval.
9+
10+
Current partition structure:
11+
*psd/hydrophone=###/year=####/month=##/day=##/*
12+
*broadband/hydrophone=###/year=####/month=##/day=##/*
13+
14+
### Dependencies
15+
16+
* Requires AWS CLI on PATH, (external install)
17+
18+
### Current analytical metrics
19+
20+
* Broadband sound level for a given frequency range
21+
* use 500-15000 for orca communication band
22+
* use >15000 for orca echo location band
23+
* 0.05, 0.25, 0.75, 0.95 broadband quantiles for a given range
24+
* Quantile vs Db range of broadband
25+
26+
### Example
627

728
```python
8-
from src.orcasound_noise.analysis import NoiseAcccessor
29+
import datetime as dt
30+
from orcasound_noise.analysis.partitioned_accessor import ParitionedAccessor
31+
from orcasound_noise.utils import Hydrophone
32+
33+
# start and end time for time range of dataset
34+
start = dt.datetime(2026, 2, 5, 0, 0, 0)
35+
end = dt.datetime(2026, 2, 6, 0, 0, 0)
936

10-
ac = NoiseAcccessor(Hydrophone.ORCASOUND_LAB)
11-
df = ac.create_df(dt.datetime(2023, 2, 1), dt.datetime(2023, 2, 2), delta_t=10, delta_f="3oct")
12-
print(df.shape) # (8638, 26)
37+
pa_orcalab = PartitionedAccessor(Hydrophone.ORCASOUND_LAB, start, end)
38+
39+
# start and end time of a specific ship passage, or other event of interest
40+
start_ship = dt.datetime(2026, 2, 5, 12, 30, 0)
41+
end_ship = dt.datetime(2026, 2, 5, 12, 55, 0)
42+
43+
quantiles = pa_orcalab.get_quantiles(start_ship, end_ship)
1344
```
1445

15-
where the parameters `delta_t=10` and `delta_f="3oct"` specify computation of 1/3-octave band levels over 10-second time intervals.
46+
### Overview of Broadband sound level calculation from PSD
47+
48+
Assume broadband $SPL$ is represented as follows:
49+
50+
$$
51+
SPL = 10\log\frac{p^2(t)}{p^2_{ambient}} \; or \; SPL = 10\log\frac{V^2(t)}{V^2_{ambient}}
52+
$$
53+
54+
where:
55+
56+
$ p^2(t) = V^2(t)/sensitivity$
57+
58+
$p^2(t)$ has units of pascals ($Pa$) and is the mean square of the pressure waveform over a given windowing time, $t$
59+
60+
$V^2(t)$ is the mean square of voltage waveform generated by the hydrophone
1661

17-
# Usage
62+
$sensitivity$ has units of $V/Pa$ and characterizes the sensitivity of the hydrophone
1863

19-
To initialize a NoiseAccessor object, all that is needed a Hydrophone enum instance. This instance contains all needed connection info.
64+
$p^2_{ambient}$ or $V^2_{ambient}$ is the mean square of the waveform over a period of time that is assumed to reflect the ancient ambient noise of puget sound.
2065

21-
## Create a Dataframe
66+
since the sound level is a ratio, the sensitivity value is canceled out and the sound pressure level can be represented by the voltage waveform.
2267

23-
The NoiseAccessor object has a create_df method that can be used to generate dataframes of requested ranges. It needs the following arguments:
68+
#### PSD to broadband sound level
2469

25-
- start: datetime object representing start of range
26-
- end: datetime object representing end of range
27-
- delta_t: Int, Time interval to find
28-
- delta_f: Str, Hz frequency to find. Use format '50hz' for linear hz bands or '3oct' for octave bands
29-
- round_timestamps: Bool, default False. Set to True to round timestamps to the delta_t frequency. Good for when grouping by time.
70+
$$
71+
p^2​= \sum_{k=f_1}^{f_2} PSD(k) \times \Delta f
72+
$$
3073

31-
Currently, only 1 second 3rd octave files (`delta_t=1, delta_f="3oct"`) are periodically generated and available in AWS: anything else must be manually created and uploaded first using the [NoiseAnalysisPipeline](../pipeline/README.md).
74+
Where $PSD(k)$ has units of $Pa^2/Hz$
3275

33-
## delta_f
76+
Our PSD data is reported in values of dB re Pa^2/Hz so the values need to be converted back to linear with:
3477

35-
This argument is a string to allow different frequency banding methods. Note that only frequency bands that have been pre-compiled are available to access.
78+
$$
79+
PSD(f) = p_{ambient}^2 * 10^{PSD(f)_{dB}/10}
80+
$$
3681

37-
- To access linear frequency bands, use the "hz" suffix. For example, a "50hz" would return frequency bounds in columns like [0, 50, 100, 150...]
38-
- To access (fractions of) octave bands, use the "oct" suffix. "3oct" will return the 1/3 octave bands, starting with [63, 80, 100, 125, 160...]
39-
- To access broadband noise, use the "broadband" suffix. This returns a single column representing the total noise level across all frequencies sensed by the hydrophone recording system.
82+
#### $\Delta f$ given 1/12 octave bands
4083

41-
## round_timestamps
84+
take n = 12 for 1/12 octaves and $f_c$ as the center frequency reported in the PSD
4285

43-
Due to the nature of Orcasound's source data (see the [orcanode repo](https://github.com/orcasound/orcanode)), timestamps can experience some drift in the nanosecond precision. A dataframe may start with 00:00:00.010 but may end with 00:00:00.020 or a larger gap.
86+
$f_{i,low} = \frac {f_c}{2^{1/2n}} $ and $f_{i,high} = f_c * 2^{1/2n} $
4487

45-
If you want to do time-based analysis across multiple days, this can cause mis-alignment. To correct, set the _round_timestamps_ argument to true. This will round the timestamps to the delta_t value's precision, dropping nanosecond values. For example, at delta_t=10 and round_timestamps=True, every timestamp will be a multiple of 10 seconds from the minute.
88+
$\Delta f_i = f_c ( 2^{1/2n} - \frac {1}{2^{1/2n}})$
4689

47-
_*Warning*_ Rounding is only available when delta_t is a divisor of 60.
90+
Then:
4891

49-
# Structure
92+
$\Delta f_i = 0.0577 * f_c$
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
1-
from .daily_noise import DailyNoiseAnalysis
2-
from .accessor import NoiseAccessor
1+
from .legacy.daily_noise import DailyNoiseAnalysis
2+
from .legacy.accessor import NoiseAccessor
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Noise Accessor
2+
3+
The accessor is the toolkit used for accessing the stored files. This is done by initializing a NoiseAccessor object for a specific hydrophone, and then requesting a time range and optional time and frequency resolution (or granularity). The accessor scans the generated archive files, loads the correct ones, concatenates the data into a single dataframe, and then trims any data outside of the requested range.
4+
5+
Example:
6+
7+
```python
8+
from src.orcasound_noise.analysis import NoiseAcccessor
9+
10+
ac = NoiseAcccessor(Hydrophone.ORCASOUND_LAB)
11+
df = ac.create_df(dt.datetime(2023, 2, 1), dt.datetime(2023, 2, 2), delta_t=10, delta_f="3oct")
12+
print(df.shape) # (8638, 26)
13+
```
14+
15+
where the parameters `delta_t=10` and `delta_f="3oct"` specify computation of 1/3-octave band levels over 10-second time intervals.
16+
17+
# Usage
18+
19+
To initialize a NoiseAccessor object, all that is needed a Hydrophone enum instance. This instance contains all needed connection info.
20+
21+
## Create a Dataframe
22+
23+
The NoiseAccessor object has a create_df method that can be used to generate dataframes of requested ranges. It needs the following arguments:
24+
25+
- start: datetime object representing start of range
26+
- end: datetime object representing end of range
27+
- delta_t: Int, Time interval to find
28+
- delta_f: Str, Hz frequency to find. Use format '50hz' for linear hz bands or '3oct' for octave bands
29+
- round_timestamps: Bool, default False. Set to True to round timestamps to the delta_t frequency. Good for when grouping by time.
30+
31+
Currently, only 1 second 3rd octave files (`delta_t=1, delta_f="3oct"`) are periodically generated and available in AWS: anything else must be manually created and uploaded first using the [NoiseAnalysisPipeline](../pipeline/README.md).
32+
33+
## delta_f
34+
35+
This argument is a string to allow different frequency banding methods. Note that only frequency bands that have been pre-compiled are available to access.
36+
37+
- To access linear frequency bands, use the "hz" suffix. For example, a "50hz" would return frequency bounds in columns like [0, 50, 100, 150...]
38+
- To access (fractions of) octave bands, use the "oct" suffix. "3oct" will return the 1/3 octave bands, starting with [63, 80, 100, 125, 160...]
39+
- To access broadband noise, use the "broadband" suffix. This returns a single column representing the total noise level across all frequencies sensed by the hydrophone recording system.
40+
41+
## round_timestamps
42+
43+
Due to the nature of Orcasound's source data (see the [orcanode repo](https://github.com/orcasound/orcanode)), timestamps can experience some drift in the nanosecond precision. A dataframe may start with 00:00:00.010 but may end with 00:00:00.020 or a larger gap.
44+
45+
If you want to do time-based analysis across multiple days, this can cause mis-alignment. To correct, set the _round_timestamps_ argument to true. This will round the timestamps to the delta_t value's precision, dropping nanosecond values. For example, at delta_t=10 and round_timestamps=True, every timestamp will be a multiple of 10 seconds from the minute.
46+
47+
_*Warning*_ Rounding is only available when delta_t is a divisor of 60.
48+
49+
# Structure
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# Legacy analysis modules

src/orcasound_noise/analysis/accessor.py renamed to src/orcasound_noise/analysis/legacy/accessor.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@
66

77
import pandas as pd
88

9-
from ..utils.file_connector import S3FileConnector
10-
from ..utils import Hydrophone
9+
from ...utils.file_connector import S3FileConnector
10+
from ...utils import Hydrophone
1111

1212
class NoiseAccessor:
1313

File renamed without changes.

0 commit comments

Comments
 (0)