Skip to content

NCAR/convergence_landscape

Repository files navigation

Convergence Landscape: Masked Autoencoder for RAF Flight Data

A self-supervised Masked Autoencoder (MAE) that learns physical state representations from NCAR Research Aviation Facility (RAF) aircraft timeseries data. The model produces per-window embeddings ("physical state fingerprints") useful for flight segment clustering, anomaly detection, and understanding how atmospheric conditions evolve over flights.

Why MAE Over Sequence Models (LSTM, etc.)?

This project uses a transformer-based MAE rather than an LSTM or other sequence model. The key reasons:

  1. Variable missingness is the core challenge. RAF data has irregular variable availability across projects and within flights. LSTMs expect a fixed input vector at each timestep — missing variables must be imputed or zero-filled, losing the distinction between "measured zero" and "not measured." The MAE's token-per-variable architecture handles this natively: each (timestep, variable) pair is an independent token, and missing ones get a learned missing_token embedding.

  2. Self-supervised learning without labels. There are no labels for flight segments or anomalies. An LSTM needs a prediction target (next timestep? classification?). The MAE's reconstruction objective is label-free — it learns representations by reconstructing masked variables from visible ones, effectively learning "what physical state explains these co-occurring measurements?"

  3. Cross-variable relationships matter more than temporal dynamics. Within a 30s window at 1Hz, atmospheric state barely changes — temperature, pressure, and moisture are nearly constant. The interesting signal is across variables: "given this temperature and altitude, what should wind and moisture look like?" That's a reconstruction/association task, not a sequence prediction task.

  4. Symmetric attention produces better embeddings. The MAE's attention mechanism pools information across all timesteps and variables symmetrically. An LSTM's hidden state would be biased toward the end of the window.

Where an LSTM would be better: forecasting future atmospheric state, modeling full-flight regime transitions over minutes/hours, or if variable availability were uniform. A natural extension is using MAE embeddings as input features to a flight-level LSTM for longer-range temporal modeling.

Architecture

Overview

Input Window [B, 30, N]          Context [B, 30, 8]
        │                              │
   var_embedding (Linear 1→d)    context_proj (Linear 8→d)
        │                              │
   + temporal_pos [30, d]              │
   + variable_pos [N, d]               │
        │                              │
        └──────────── + ───────────────┘
                      │
         Replace missing → missing_token
         Replace masked  → mask_token
                      │
              Flatten to [B, 30×N, d]
                      │
              ┌───────┴───────┐
              │   Encoder     │  (4-6 layers)
              │  Transformer  │
              │  EncoderLayer │
              └───────┬───────┘
                      │
              ┌───────┴───────┐
              │   Decoder     │  (2-4 layers)
              │  Transformer  │
              │  DecoderLayer │
              └───────┬───────┘
                      │
              Reshape to [B, 30, N, d]
                      │
              output_proj (Linear d→1)
                      │
              Predictions [B, 30, N]

Token Structure

Each input window is a [window_size, n_vars] matrix of normalized atmospheric measurements at 1Hz. The model treats every (timestep, variable) pair as a separate token, giving 30 × N tokens per window (e.g., 30 × 19 = 570 tokens in lite mode).

Each token embedding is the sum of:

  • Value embedding: Linear(1 → d_model) applied to the scalar measurement
  • Temporal position: learned Embedding(30, d_model) — encodes position within the window
  • Variable position: learned Embedding(n_vars, d_model) — encodes which variable
  • Spatiotemporal context: Linear(8 → d_model) — encodes real-world location and time (shared across all variables at each timestep, always visible)
  • Project embedding (optional): Embedding(n_projects, d_model) — for domain adversarial training

Handling Missing Data

Unavailable variables receive a learned missing_token embedding instead of being hidden via attention padding masks. This prevents variable availability patterns from acting as a project fingerprint (which was causing >90% project classification leakage). The transformer sees all W×N tokens uniformly — missing positions participate in attention but carry a learned "I'm missing" signal rather than a hard binary mask.

Separately, positions selected for reconstruction masking receive a mask_token embedding.

Spatiotemporal Context

Context features are computed per-timestep and never masked — they condition every token so the model knows where and when each measurement was taken:

Channel Encoding Purpose
lat_norm lat / 90 Latitude [-1, 1]
lon_sin sin(lon × π/180) Longitude (wraparound-safe)
lon_cos cos(lon × π/180) Longitude (wraparound-safe)
alt_norm alt / 15000 Altitude normalized by ~aircraft ceiling
tod_sin sin(2π × hour/24) Time-of-day cycle
tod_cos cos(2π × hour/24) Time-of-day cycle
doy_sin sin(2π × day/365) Seasonal cycle
doy_cos cos(2π × day/365) Seasonal cycle

Masking Strategy

The model uses group-level masking that respects instrument relationships:

  • Variable groups: atmospheric_state (backbone), wind, chemistry, aerosol, cloud_and_precip, navigation
  • Backbone group (atmospheric_state): never fully masked — always provides context
  • Other groups: masked with probability MASK_GROUP_PROB (entire group) or individual variables masked with MASK_VAR_PROB

Only positions where data was actually measured (or imputed) contribute to the reconstruction loss.

Model Configurations

Parameter Full Lite
d_model 256 128
n_encoder_layers 6 4
n_decoder_layers 4 2
n_heads 8 4
feedforward_dim 1024 512
dropout 0.1 0.1

Data Pipeline

Source

PostgreSQL database containing 1Hz timeseries from RAF flight campaigns. Each project (e.g., CGWAVES, GOTHAAM, WECAN) contains multiple flights with ~1300+ instrument variables.

Export (export_all_cache.py)

Converts raw database records into training-ready NPZ files:

  1. Variable canonicalization: Strips instrument suffixes (e.g., CO_PICCO) and merges equivalent variables across projects into a unified vocabulary
  2. Light mode (--light): Curates ~19 key variables across atmospheric state, wind, chemistry, aerosol, and cloud groups
  3. Ground filtering: Excludes ground data using weight-on-wheels (WOW=0 → airborne) with TAS > 50 m/s fallback
  4. Sliding windows: 30s windows at 5s stride, requiring ≥30% data availability
  5. Normalization: Global or per-project z-score using percentile-clipped (1st–99th) statistics; clipped to [-10, 10]σ to prevent extreme outliers (e.g., cloud liquid water spikes) from destabilizing training. Global normalization (--global-norm) is recommended for cross-project clustering — ensures normalized values are physically comparable across campaigns.
  6. Imputation: Partial gaps filled with window mean; fully absent variables set to 0 with availability=False
  7. Context array: 8-channel spatiotemporal features computed from GPS coordinates and timestamps

Output per project:

cache_light/projects/{PROJECT}/
├── windows.npz          # data, availability_mask, imputed_mask, context,
│                        # flight_ids, flight_numbers, project_ids, t_starts
├── norm_stats.json      # per-variable mean/std
└── project_info.json    # metadata

Variables (Lite Mode)

Group Variables
Atmospheric State ATX, THETA, THETAV, PSXC, DPXC, MR, RHUM
Wind UI, VI, WI, WS, WD
Chemistry CO, O3
Cloud/Precip PLWCC
Navigation GGALT, GGLAT, GGLON, TASX
Aerosol CONCU (when available)

Loss Function

Masked MSE computed only at positions that were both masked and had available data:

loss = MSE(predictions[mask & available], targets[mask & available])

Three imputation-aware modes:

  • standard: Treats imputed values as real measurements
  • strict: Loss only on originally-measured positions (excludes imputed)
  • weighted: Full weight on measured, reduced weight (IMPUTATION_WEIGHT) on imputed

Training

  • Optimizer: AdamW (lr=1e-4, weight_decay=0.01)
  • Scheduler: OneCycleLR (5% linear warmup + cosine decay), stepped per batch
  • Gradient clipping: max norm 1.0
  • Early stopping: patience=10 epochs, min_delta=1e-4
  • Train/val split: Per-flight (80/20) to prevent temporal leakage
  • Project balancing (optional): Weighted sampling to upsample smaller projects
  • Hyperparameter search: Built-in random search cell (12 configs × 5 epochs)

Embedding Extraction

After training, the encoder produces embeddings via mean-pooling over available positions:

emb = model.get_embedding(data, availability_mask, context=ctx)  # [B, d_model]

Each embedding is a d_model-dimensional "physical state fingerprint" for a 30s flight window, conditioned on its spatial location, altitude, and time.

Applications

Immediate (current model)

  • Flight segment clustering: Group 30s windows by learned physical similarity (e.g., boundary layer, free troposphere, cloud penetration, pollution plume). Track how cluster labels evolve over a flight to see regime transitions.
  • Anomaly detection: Windows with high reconstruction error represent atmospheric conditions the model hasn't learned to explain — unusual aerosol events, instrument malfunctions, rare meteorological phenomena, or air mass boundaries.
  • Cross-project similarity search: Given a window from one campaign, find the most similar conditions observed in other campaigns via embedding nearest neighbors. ("When did WECAN see conditions most like this GOTHAAM cloud event?")
  • Instrument QC: Variables with consistently high per-variable reconstruction error across a flight may indicate instrument drift or calibration issues. The model learns what a variable should look like given all other measurements — deviations flag suspect data.

Near-term extensions

  • Flight-level trajectory analysis: Feed sequences of window embeddings into a lightweight LSTM or temporal transformer to model how atmospheric state evolves over an entire flight. Detect flight legs, holding patterns, and deliberate science maneuvers from the embedding trajectory.
  • Probing classifiers: Train simple linear classifiers on embeddings to predict derived quantities (cloud type, air mass origin, pollution influence) without retraining the MAE. The model already achieves 94% altitude classification and 96% boundary layer detection from linear probes.
  • Transfer to new campaigns: When a new RAF project is flown, export its data through the same pipeline and extract embeddings with the frozen model — no retraining needed. The global normalization and missing token mechanism ensure new projects with different instrument suites produce comparable embeddings.

Longer-term possibilities

  • Multi-scale representation: Stack a flight-level model on top of window embeddings to capture phenomena at different timescales — turbulence (seconds), cloud processes (minutes), synoptic weather (hours).
  • Conditional generation: Use the decoder to ask "what would variable X look like if the atmospheric state were Y?" by manipulating the encoder output. This enables virtual instrument simulation — predicting what an unmeasured variable would have shown.
  • Foundation model for atmospheric science: Scale to the full ~1300 variable set and all RAF campaigns to build a general-purpose representation of atmospheric state. Downstream tasks (parameterization development, model evaluation, field campaign planning) could fine-tune or probe these representations.

Projects Trained On

ACCLIP, CAESAR, CGWAVES, GOTHAAM, MAIRE24, MethaneAIR, SOCRATES, SPICULE, TI3GER, WECAN

(ACES held out for zero-shot evaluation)

Usage

# Export cache with global normalization (run once, or with --force to regenerate)
python export_all_cache.py --light --global-norm --cache-dir ./cache_light

# Train in Google Colab
# 1. Upload cache_light/ to Google Drive
# 2. Open Convergence_MAE.ipynb in Colab
# 3. Set MULTI_PROJECT = True, LITE_MODEL = True
# 4. Run hyperparameter search cell first, then training

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors