A self-supervised Masked Autoencoder (MAE) that learns physical state representations from NCAR Research Aviation Facility (RAF) aircraft timeseries data. The model produces per-window embeddings ("physical state fingerprints") useful for flight segment clustering, anomaly detection, and understanding how atmospheric conditions evolve over flights.
This project uses a transformer-based MAE rather than an LSTM or other sequence model. The key reasons:
-
Variable missingness is the core challenge. RAF data has irregular variable availability across projects and within flights. LSTMs expect a fixed input vector at each timestep — missing variables must be imputed or zero-filled, losing the distinction between "measured zero" and "not measured." The MAE's token-per-variable architecture handles this natively: each (timestep, variable) pair is an independent token, and missing ones get a learned
missing_tokenembedding. -
Self-supervised learning without labels. There are no labels for flight segments or anomalies. An LSTM needs a prediction target (next timestep? classification?). The MAE's reconstruction objective is label-free — it learns representations by reconstructing masked variables from visible ones, effectively learning "what physical state explains these co-occurring measurements?"
-
Cross-variable relationships matter more than temporal dynamics. Within a 30s window at 1Hz, atmospheric state barely changes — temperature, pressure, and moisture are nearly constant. The interesting signal is across variables: "given this temperature and altitude, what should wind and moisture look like?" That's a reconstruction/association task, not a sequence prediction task.
-
Symmetric attention produces better embeddings. The MAE's attention mechanism pools information across all timesteps and variables symmetrically. An LSTM's hidden state would be biased toward the end of the window.
Where an LSTM would be better: forecasting future atmospheric state, modeling full-flight regime transitions over minutes/hours, or if variable availability were uniform. A natural extension is using MAE embeddings as input features to a flight-level LSTM for longer-range temporal modeling.
Input Window [B, 30, N] Context [B, 30, 8]
│ │
var_embedding (Linear 1→d) context_proj (Linear 8→d)
│ │
+ temporal_pos [30, d] │
+ variable_pos [N, d] │
│ │
└──────────── + ───────────────┘
│
Replace missing → missing_token
Replace masked → mask_token
│
Flatten to [B, 30×N, d]
│
┌───────┴───────┐
│ Encoder │ (4-6 layers)
│ Transformer │
│ EncoderLayer │
└───────┬───────┘
│
┌───────┴───────┐
│ Decoder │ (2-4 layers)
│ Transformer │
│ DecoderLayer │
└───────┬───────┘
│
Reshape to [B, 30, N, d]
│
output_proj (Linear d→1)
│
Predictions [B, 30, N]
Each input window is a [window_size, n_vars] matrix of normalized atmospheric measurements at 1Hz. The model treats every (timestep, variable) pair as a separate token, giving 30 × N tokens per window (e.g., 30 × 19 = 570 tokens in lite mode).
Each token embedding is the sum of:
- Value embedding:
Linear(1 → d_model)applied to the scalar measurement - Temporal position: learned
Embedding(30, d_model)— encodes position within the window - Variable position: learned
Embedding(n_vars, d_model)— encodes which variable - Spatiotemporal context:
Linear(8 → d_model)— encodes real-world location and time (shared across all variables at each timestep, always visible) - Project embedding (optional):
Embedding(n_projects, d_model)— for domain adversarial training
Unavailable variables receive a learned missing_token embedding instead of being hidden via attention padding masks. This prevents variable availability patterns from acting as a project fingerprint (which was causing >90% project classification leakage). The transformer sees all W×N tokens uniformly — missing positions participate in attention but carry a learned "I'm missing" signal rather than a hard binary mask.
Separately, positions selected for reconstruction masking receive a mask_token embedding.
Context features are computed per-timestep and never masked — they condition every token so the model knows where and when each measurement was taken:
| Channel | Encoding | Purpose |
|---|---|---|
lat_norm |
lat / 90 |
Latitude [-1, 1] |
lon_sin |
sin(lon × π/180) |
Longitude (wraparound-safe) |
lon_cos |
cos(lon × π/180) |
Longitude (wraparound-safe) |
alt_norm |
alt / 15000 |
Altitude normalized by ~aircraft ceiling |
tod_sin |
sin(2π × hour/24) |
Time-of-day cycle |
tod_cos |
cos(2π × hour/24) |
Time-of-day cycle |
doy_sin |
sin(2π × day/365) |
Seasonal cycle |
doy_cos |
cos(2π × day/365) |
Seasonal cycle |
The model uses group-level masking that respects instrument relationships:
- Variable groups:
atmospheric_state(backbone),wind,chemistry,aerosol,cloud_and_precip,navigation - Backbone group (
atmospheric_state): never fully masked — always provides context - Other groups: masked with probability
MASK_GROUP_PROB(entire group) or individual variables masked withMASK_VAR_PROB
Only positions where data was actually measured (or imputed) contribute to the reconstruction loss.
| Parameter | Full | Lite |
|---|---|---|
d_model |
256 | 128 |
n_encoder_layers |
6 | 4 |
n_decoder_layers |
4 | 2 |
n_heads |
8 | 4 |
feedforward_dim |
1024 | 512 |
dropout |
0.1 | 0.1 |
PostgreSQL database containing 1Hz timeseries from RAF flight campaigns. Each project (e.g., CGWAVES, GOTHAAM, WECAN) contains multiple flights with ~1300+ instrument variables.
Converts raw database records into training-ready NPZ files:
- Variable canonicalization: Strips instrument suffixes (e.g.,
CO_PIC→CO) and merges equivalent variables across projects into a unified vocabulary - Light mode (
--light): Curates ~19 key variables across atmospheric state, wind, chemistry, aerosol, and cloud groups - Ground filtering: Excludes ground data using weight-on-wheels (WOW=0 → airborne) with TAS > 50 m/s fallback
- Sliding windows: 30s windows at 5s stride, requiring ≥30% data availability
- Normalization: Global or per-project z-score using percentile-clipped (1st–99th) statistics; clipped to [-10, 10]σ to prevent extreme outliers (e.g., cloud liquid water spikes) from destabilizing training. Global normalization (
--global-norm) is recommended for cross-project clustering — ensures normalized values are physically comparable across campaigns. - Imputation: Partial gaps filled with window mean; fully absent variables set to 0 with
availability=False - Context array: 8-channel spatiotemporal features computed from GPS coordinates and timestamps
Output per project:
cache_light/projects/{PROJECT}/
├── windows.npz # data, availability_mask, imputed_mask, context,
│ # flight_ids, flight_numbers, project_ids, t_starts
├── norm_stats.json # per-variable mean/std
└── project_info.json # metadata
| Group | Variables |
|---|---|
| Atmospheric State | ATX, THETA, THETAV, PSXC, DPXC, MR, RHUM |
| Wind | UI, VI, WI, WS, WD |
| Chemistry | CO, O3 |
| Cloud/Precip | PLWCC |
| Navigation | GGALT, GGLAT, GGLON, TASX |
| Aerosol | CONCU (when available) |
Masked MSE computed only at positions that were both masked and had available data:
loss = MSE(predictions[mask & available], targets[mask & available])
Three imputation-aware modes:
standard: Treats imputed values as real measurementsstrict: Loss only on originally-measured positions (excludes imputed)weighted: Full weight on measured, reduced weight (IMPUTATION_WEIGHT) on imputed
- Optimizer: AdamW (lr=1e-4, weight_decay=0.01)
- Scheduler: OneCycleLR (5% linear warmup + cosine decay), stepped per batch
- Gradient clipping: max norm 1.0
- Early stopping: patience=10 epochs, min_delta=1e-4
- Train/val split: Per-flight (80/20) to prevent temporal leakage
- Project balancing (optional): Weighted sampling to upsample smaller projects
- Hyperparameter search: Built-in random search cell (12 configs × 5 epochs)
After training, the encoder produces embeddings via mean-pooling over available positions:
emb = model.get_embedding(data, availability_mask, context=ctx) # [B, d_model]Each embedding is a d_model-dimensional "physical state fingerprint" for a 30s flight window, conditioned on its spatial location, altitude, and time.
- Flight segment clustering: Group 30s windows by learned physical similarity (e.g., boundary layer, free troposphere, cloud penetration, pollution plume). Track how cluster labels evolve over a flight to see regime transitions.
- Anomaly detection: Windows with high reconstruction error represent atmospheric conditions the model hasn't learned to explain — unusual aerosol events, instrument malfunctions, rare meteorological phenomena, or air mass boundaries.
- Cross-project similarity search: Given a window from one campaign, find the most similar conditions observed in other campaigns via embedding nearest neighbors. ("When did WECAN see conditions most like this GOTHAAM cloud event?")
- Instrument QC: Variables with consistently high per-variable reconstruction error across a flight may indicate instrument drift or calibration issues. The model learns what a variable should look like given all other measurements — deviations flag suspect data.
- Flight-level trajectory analysis: Feed sequences of window embeddings into a lightweight LSTM or temporal transformer to model how atmospheric state evolves over an entire flight. Detect flight legs, holding patterns, and deliberate science maneuvers from the embedding trajectory.
- Probing classifiers: Train simple linear classifiers on embeddings to predict derived quantities (cloud type, air mass origin, pollution influence) without retraining the MAE. The model already achieves 94% altitude classification and 96% boundary layer detection from linear probes.
- Transfer to new campaigns: When a new RAF project is flown, export its data through the same pipeline and extract embeddings with the frozen model — no retraining needed. The global normalization and missing token mechanism ensure new projects with different instrument suites produce comparable embeddings.
- Multi-scale representation: Stack a flight-level model on top of window embeddings to capture phenomena at different timescales — turbulence (seconds), cloud processes (minutes), synoptic weather (hours).
- Conditional generation: Use the decoder to ask "what would variable X look like if the atmospheric state were Y?" by manipulating the encoder output. This enables virtual instrument simulation — predicting what an unmeasured variable would have shown.
- Foundation model for atmospheric science: Scale to the full ~1300 variable set and all RAF campaigns to build a general-purpose representation of atmospheric state. Downstream tasks (parameterization development, model evaluation, field campaign planning) could fine-tune or probe these representations.
ACCLIP, CAESAR, CGWAVES, GOTHAAM, MAIRE24, MethaneAIR, SOCRATES, SPICULE, TI3GER, WECAN
(ACES held out for zero-shot evaluation)
# Export cache with global normalization (run once, or with --force to regenerate)
python export_all_cache.py --light --global-norm --cache-dir ./cache_light
# Train in Google Colab
# 1. Upload cache_light/ to Google Drive
# 2. Open Convergence_MAE.ipynb in Colab
# 3. Set MULTI_PROJECT = True, LITE_MODEL = True
# 4. Run hyperparameter search cell first, then training