Convergence Landscape: Masked Autoencoder for RAF Flight Data

A self-supervised Masked Autoencoder (MAE) that learns physical state representations from NCAR Research Aviation Facility (RAF) aircraft timeseries data. The model produces per-window embeddings ("physical state fingerprints") useful for flight segment clustering, anomaly detection, and understanding how atmospheric conditions evolve over flights.

Why MAE Over Sequence Models (LSTM, etc.)?

This project uses a transformer-based MAE rather than an LSTM or other sequence model. The key reasons:

Variable missingness is the core challenge. RAF data has irregular variable availability across projects and within flights. LSTMs expect a fixed input vector at each timestep — missing variables must be imputed or zero-filled, losing the distinction between "measured zero" and "not measured." The MAE's token-per-variable architecture handles this natively: each (timestep, variable) pair is an independent token, and missing ones get a learned missing_token embedding.
Self-supervised learning without labels. There are no labels for flight segments or anomalies. An LSTM needs a prediction target (next timestep? classification?). The MAE's reconstruction objective is label-free — it learns representations by reconstructing masked variables from visible ones, effectively learning "what physical state explains these co-occurring measurements?"
Cross-variable relationships matter more than temporal dynamics. Within a 30s window at 1Hz, atmospheric state barely changes — temperature, pressure, and moisture are nearly constant. The interesting signal is across variables: "given this temperature and altitude, what should wind and moisture look like?" That's a reconstruction/association task, not a sequence prediction task.
Symmetric attention produces better embeddings. The MAE's attention mechanism pools information across all timesteps and variables symmetrically. An LSTM's hidden state would be biased toward the end of the window.

Where an LSTM would be better: forecasting future atmospheric state, modeling full-flight regime transitions over minutes/hours, or if variable availability were uniform. A natural extension is using MAE embeddings as input features to a flight-level LSTM for longer-range temporal modeling.

Architecture

Overview

Input Window [B, 30, N]          Context [B, 30, 8]
        │                              │
   var_embedding (Linear 1→d)    context_proj (Linear 8→d)
        │                              │
   + temporal_pos [30, d]              │
   + variable_pos [N, d]               │
        │                              │
        └──────────── + ───────────────┘
                      │
         Replace missing → missing_token
         Replace masked  → mask_token
                      │
              Flatten to [B, 30×N, d]
                      │
              ┌───────┴───────┐
              │   Encoder     │  (4-6 layers)
              │  Transformer  │
              │  EncoderLayer │
              └───────┬───────┘
                      │
              ┌───────┴───────┐
              │   Decoder     │  (2-4 layers)
              │  Transformer  │
              │  DecoderLayer │
              └───────┬───────┘
                      │
              Reshape to [B, 30, N, d]
                      │
              output_proj (Linear d→1)
                      │
              Predictions [B, 30, N]

Token Structure

Each input window is a [window_size, n_vars] matrix of normalized atmospheric measurements at 1Hz. The model treats every (timestep, variable) pair as a separate token, giving 30 × N tokens per window (e.g., 30 × 19 = 570 tokens in lite mode).

Each token embedding is the sum of:

Value embedding: Linear(1 → d_model) applied to the scalar measurement
Temporal position: learned Embedding(30, d_model) — encodes position within the window
Variable position: learned Embedding(n_vars, d_model) — encodes which variable
Spatiotemporal context: Linear(8 → d_model) — encodes real-world location and time (shared across all variables at each timestep, always visible)
Project embedding (optional): Embedding(n_projects, d_model) — for domain adversarial training

Handling Missing Data

Unavailable variables receive a learned missing_token embedding instead of being hidden via attention padding masks. This prevents variable availability patterns from acting as a project fingerprint (which was causing >90% project classification leakage). The transformer sees all W×N tokens uniformly — missing positions participate in attention but carry a learned "I'm missing" signal rather than a hard binary mask.

Separately, positions selected for reconstruction masking receive a mask_token embedding.

Spatiotemporal Context

Context features are computed per-timestep and never masked — they condition every token so the model knows where and when each measurement was taken:

Channel	Encoding	Purpose
`lat_norm`	`lat / 90`	Latitude [-1, 1]
`lon_sin`	`sin(lon × π/180)`	Longitude (wraparound-safe)
`lon_cos`	`cos(lon × π/180)`	Longitude (wraparound-safe)
`alt_norm`	`alt / 15000`	Altitude normalized by ~aircraft ceiling
`tod_sin`	`sin(2π × hour/24)`	Time-of-day cycle
`tod_cos`	`cos(2π × hour/24)`	Time-of-day cycle
`doy_sin`	`sin(2π × day/365)`	Seasonal cycle
`doy_cos`	`cos(2π × day/365)`	Seasonal cycle

Masking Strategy

The model uses group-level masking that respects instrument relationships:

Variable groups: atmospheric_state (backbone), wind, chemistry, aerosol, cloud_and_precip, navigation
Backbone group (atmospheric_state): never fully masked — always provides context
Other groups: masked with probability MASK_GROUP_PROB (entire group) or individual variables masked with MASK_VAR_PROB

Only positions where data was actually measured (or imputed) contribute to the reconstruction loss.

Model Configurations

Parameter	Full	Lite
`d_model`	256	128
`n_encoder_layers`	6	4
`n_decoder_layers`	4	2
`n_heads`	8	4
`feedforward_dim`	1024	512
`dropout`	0.1	0.1

Data Pipeline

Source

PostgreSQL database containing 1Hz timeseries from RAF flight campaigns. Each project (e.g., CGWAVES, GOTHAAM, WECAN) contains multiple flights with ~1300+ instrument variables.

Export (`export_all_cache.py`)

Converts raw database records into training-ready NPZ files:

Variable canonicalization: Strips instrument suffixes (e.g., CO_PIC → CO) and merges equivalent variables across projects into a unified vocabulary
Light mode (--light): Curates ~19 key variables across atmospheric state, wind, chemistry, aerosol, and cloud groups
Ground filtering: Excludes ground data using weight-on-wheels (WOW=0 → airborne) with TAS > 50 m/s fallback
Sliding windows: 30s windows at 5s stride, requiring ≥30% data availability
Normalization: Global or per-project z-score using percentile-clipped (1st–99th) statistics; clipped to [-10, 10]σ to prevent extreme outliers (e.g., cloud liquid water spikes) from destabilizing training. Global normalization (--global-norm) is recommended for cross-project clustering — ensures normalized values are physically comparable across campaigns.
Imputation: Partial gaps filled with window mean; fully absent variables set to 0 with availability=False
Context array: 8-channel spatiotemporal features computed from GPS coordinates and timestamps

Output per project:

cache_light/projects/{PROJECT}/
├── windows.npz          # data, availability_mask, imputed_mask, context,
│                        # flight_ids, flight_numbers, project_ids, t_starts
├── norm_stats.json      # per-variable mean/std
└── project_info.json    # metadata

Variables (Lite Mode)

Group	Variables
Atmospheric State	ATX, THETA, THETAV, PSXC, DPXC, MR, RHUM
Wind	UI, VI, WI, WS, WD
Chemistry	CO, O3
Cloud/Precip	PLWCC
Navigation	GGALT, GGLAT, GGLON, TASX
Aerosol	CONCU (when available)

Loss Function

Masked MSE computed only at positions that were both masked and had available data:

loss = MSE(predictions[mask & available], targets[mask & available])

Three imputation-aware modes:

standard: Treats imputed values as real measurements
strict: Loss only on originally-measured positions (excludes imputed)
weighted: Full weight on measured, reduced weight (IMPUTATION_WEIGHT) on imputed

Training

Optimizer: AdamW (lr=1e-4, weight_decay=0.01)
Scheduler: OneCycleLR (5% linear warmup + cosine decay), stepped per batch
Gradient clipping: max norm 1.0
Early stopping: patience=10 epochs, min_delta=1e-4
Train/val split: Per-flight (80/20) to prevent temporal leakage
Project balancing (optional): Weighted sampling to upsample smaller projects
Hyperparameter search: Built-in random search cell (12 configs × 5 epochs)

Embedding Extraction

After training, the encoder produces embeddings via mean-pooling over available positions:

emb = model.get_embedding(data, availability_mask, context=ctx)  # [B, d_model]

Each embedding is a d_model-dimensional "physical state fingerprint" for a 30s flight window, conditioned on its spatial location, altitude, and time.

Applications

Immediate (current model)

Flight segment clustering: Group 30s windows by learned physical similarity (e.g., boundary layer, free troposphere, cloud penetration, pollution plume). Track how cluster labels evolve over a flight to see regime transitions.
Anomaly detection: Windows with high reconstruction error represent atmospheric conditions the model hasn't learned to explain — unusual aerosol events, instrument malfunctions, rare meteorological phenomena, or air mass boundaries.
Cross-project similarity search: Given a window from one campaign, find the most similar conditions observed in other campaigns via embedding nearest neighbors. ("When did WECAN see conditions most like this GOTHAAM cloud event?")
Instrument QC: Variables with consistently high per-variable reconstruction error across a flight may indicate instrument drift or calibration issues. The model learns what a variable should look like given all other measurements — deviations flag suspect data.

Near-term extensions

Flight-level trajectory analysis: Feed sequences of window embeddings into a lightweight LSTM or temporal transformer to model how atmospheric state evolves over an entire flight. Detect flight legs, holding patterns, and deliberate science maneuvers from the embedding trajectory.
Probing classifiers: Train simple linear classifiers on embeddings to predict derived quantities (cloud type, air mass origin, pollution influence) without retraining the MAE. The model already achieves 94% altitude classification and 96% boundary layer detection from linear probes.
Transfer to new campaigns: When a new RAF project is flown, export its data through the same pipeline and extract embeddings with the frozen model — no retraining needed. The global normalization and missing token mechanism ensure new projects with different instrument suites produce comparable embeddings.

Longer-term possibilities

Multi-scale representation: Stack a flight-level model on top of window embeddings to capture phenomena at different timescales — turbulence (seconds), cloud processes (minutes), synoptic weather (hours).
Conditional generation: Use the decoder to ask "what would variable X look like if the atmospheric state were Y?" by manipulating the encoder output. This enables virtual instrument simulation — predicting what an unmeasured variable would have shown.
Foundation model for atmospheric science: Scale to the full ~1300 variable set and all RAF campaigns to build a general-purpose representation of atmospheric state. Downstream tasks (parameterization development, model evaluation, field campaign planning) could fine-tune or probe these representations.

Projects Trained On

ACCLIP, CAESAR, CGWAVES, GOTHAAM, MAIRE24, MethaneAIR, SOCRATES, SPICULE, TI3GER, WECAN

(ACES held out for zero-shot evaluation)

Usage

# Export cache with global normalization (run once, or with --force to regenerate)
python export_all_cache.py --light --global-norm --cache-dir ./cache_light

# Train in Google Colab
# 1. Upload cache_light/ to Google Drive
# 2. Open Convergence_MAE.ipynb in Colab
# 3. Set MULTI_PROJECT = True, LITE_MODEL = True
# 4. Run hyperparameter search cell first, then training

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
CLAUDE.md		CLAUDE.md
Convergence_MAE.ipynb		Convergence_MAE.ipynb
README.md		README.md
XAI_README.md		XAI_README.md
aircraft_data_psql_schema.txt		aircraft_data_psql_schema.txt
animate_flights.py		animate_flights.py
export_all_cache.py		export_all_cache.py
nextsteps.md		nextsteps.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Convergence Landscape: Masked Autoencoder for RAF Flight Data

Why MAE Over Sequence Models (LSTM, etc.)?

Architecture

Overview

Token Structure

Handling Missing Data

Spatiotemporal Context

Masking Strategy

Model Configurations

Data Pipeline

Source

Export (`export_all_cache.py`)

Variables (Lite Mode)

Loss Function

Training

Embedding Extraction

Applications

Immediate (current model)

Near-term extensions

Longer-term possibilities

Projects Trained On

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Convergence Landscape: Masked Autoencoder for RAF Flight Data

Why MAE Over Sequence Models (LSTM, etc.)?

Architecture

Overview

Token Structure

Handling Missing Data

Spatiotemporal Context

Masking Strategy

Model Configurations

Data Pipeline

Source

Export (export_all_cache.py)

Variables (Lite Mode)

Loss Function

Training

Embedding Extraction

Applications

Immediate (current model)

Near-term extensions

Longer-term possibilities

Projects Trained On

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Export (`export_all_cache.py`)

Packages