This sub-project is responsible for scraping wind data from various sources and storing it in the database.
The scraper fetches wind observation data from one or more configured stations in a continuous loop. For each station, it performs the following steps:
- Fetch: It makes an HTTP request to the station's configured URL to retrieve the latest wind data.
- Parse: It parses the raw data (JSON is currently supported) to extract the wind direction, speed, gust, and timestamp.
- Store: It stores the parsed data in a PostgreSQL database.
The scraper is designed to be resilient, with a retry mechanism for network errors and the ability to handle stale data. It also tracks the status of each station and can report errors to the database.
The scraper is configured through a TOML file (e.g., windburglr.toml.example). The configuration file is divided into two sections: general and stations.
The general section contains the following parameters:
log_level: The logging level (e.g.,INFO,DEBUG).refresh_rate: The number of seconds to wait between scraping cycles.db_url: The URL of the PostgreSQL database. This can also be provided via the--database-urlcommand-line argument or theDATABASE_URLenvironment variable.output_mode: The output mode. Can bepostgres(to store data in a PostgreSQL database) orstdout(to print data to the console).
The stations section is a list of one or more stations to scrape. Each station has the following parameters:
name: The name of the station (e.g.,CYTZ).url: The URL of the station's wind data feed.timeout: The timeout in seconds for the HTTP request (default:15).headers: A table of HTTP headers to include in the request.parser: The parser to use for the data feed. Currently, onlyjsonis supported (default:json).direction_path: Dot-separated path to the wind direction value in the JSON data (e.g.,sensor_data.wind_direction). Default:direction.speed_path: Dot-separated path to the wind speed value in the JSON data (e.g.,sensor_data.wind_speed). Default:speed.gust_path: Dot-separated path to the wind gust value in the JSON data (e.g.,sensor_data.wind_gust). Default:gust.timestamp_path: Dot-separated path to the timestamp value in the JSON data (e.g.,sensor_data.updated). Default:timestamp.timestamp_format: The format of the timestamp in the data feed using Python'sstrptimeformat codes (e.g.,%Y-%m-%d %H:%M). See the Python datetime documentation for available format codes. Default:%Y-%m-%d %H:%M.timezone: The timezone of the timestamp in the data feed (default:UTC).local_timezone: The local timezone of the station.stale_data_timeout: The minimum elapsed time in seconds before data is considered stale (default:300).
The path fields (direction_path, speed_path, gust_path, timestamp_path) use dot notation to traverse nested JSON structures. For example, given this JSON response:
{
"sensor_data": {
"SENSOR_ID": {
"wind_direction": "180",
"wind_speed": "15.5",
"wind_gust": "20.0",
"updated": "2024-01-01 12:00"
}
}
}You would configure the paths as:
direction_path = "sensor_data.SENSOR_ID.wind_direction"
speed_path = "sensor_data.SENSOR_ID.wind_speed"
gust_path = "sensor_data.SENSOR_ID.wind_gust"
timestamp_path = "sensor_data.SENSOR_ID.updated"This allows the scraper to work with different JSON structures without code changes.
The scraper has robust error handling and can recover from transient errors. It catches and logs various exceptions, including:
- HTTP errors
- Parsing errors
- Network timeouts
- Stale data
The scraper also uses Sentry for error reporting. To enable Sentry, create a .env.sentry file with the following environment variables:
SENTRY_DSN: Your Sentry DSN.SENTRY_ENVIRONMENT: The Sentry environment (e.g.,production,development).SENTRY_RELEASE: The Sentry release.
- Python 3.13 or later
- UV
- Postgres with TimescaleDB (for storage - optional)
For the best development experience, use Nix to set up a complete development environment. This is the recommended approach as it automatically provides a PostgreSQL database for running integration tests.
-
Install Nix: Follow the instructions at https://nixos.org/download.html to install Nix.
-
Enable Flakes: If you haven't already, enable Nix flakes:
echo "experimental-features = nix-command flakes" >> ~/.config/nix/nix.conf
-
Enter the Development Environment:
nix develop
Or, if you use
direnv:direnv allow
The Nix environment will automatically:
- Set up Python 3.13 with all dependencies
- Initialize a PostgreSQL 15 database
- Configure development tools (pyright, ruff, etc.)
- Create a Python virtual environment
- Install all Python dependencies
# Install dependencies
uv pip install -e .
# Run the scraper with default configuration from ./windburglr.toml
scraper
# Run the scraper with a custom configuration file
scraper --config-file custom.windburglr.tomlYou can also use the --help flag to see a list of all available command-line arguments:
scraper --helpTo run the tests, you need to have pytest installed. You can then run the tests using the following command:
pytestThis will run all the unit and integration tests. You can also run specific tests by providing a path to a test file or directory. For example:
# Install dev dependencies
uv sync --dev
# Run all tests in the tests/unit directory
pytest tests/unit
# Run a specific test file
pytest tests/unit/test_scraper.pyNote: The integration tests require a running PostgreSQL database. The recommended way to run the tests is within the Nix development environment, which automatically provides a PostgreSQL database.