Skip to content

Shakhoyat/PAST-TIDE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PAST-TIDE: Prototype-Anchored Statement Tuning with Topic-Invariant Normalization for Stance Detection

Paper License: MIT Python 3.8+ PyTorch 2.0+

The Blackwell Collective at StanceNakba 2026
Md. Shakhoyat Rahman Shujon, MD Jahid Hasan Jim, Md. Milon Islam

Official implementation of our system paper at the StanceNakba 2026 Shared Task (NakbaNLP Workshop @ LREC-COLING 2026).


Table of Contents


Overview

PAST-TIDE is a unified stance detection system that reformulates classification as cloze-style masked language modeling via statement tuning, achieving 0.79 macro-F1 on both subtasks with a single architecture.

Instead of adding a randomly-initialized classification head on top of [CLS], we convert each input into a natural-language statement with a [MASK] slot and let the pre-trained MLM head predict stance-indicative words. This closes the pre-training/fine-tuning gap and yields strong performance with only ~1,000 labelled examples — adding zero new classifier parameters.

On top of this, we introduce Prototypical Contrastive Learning (PCL) for batch-size-agnostic representation shaping, and Topic-Conditional Layer Normalization (T-CLN) for cross-topic distribution alignment — both novel components designed specifically for this low-resource, multilingual stance detection challenge.

Innovation Summary

Component What it does Why it matters Source
Statement Tuning Reformulates stance as [MASK] prediction with verbalizer mapping Reuses the pre-trained MLM head — zero new classifier parameters, closes the pre-training/fine-tuning gap src/model.py
Prototypical Contrastive Learning (PCL) Contrasts samples against K=3 learnable class prototypes instead of in-batch negatives Stable contrastive gradients at any batch size; only 2,304 new parameters src/modules/pcl.py
Topic-Conditional LayerNorm (T-CLN) Dynamically generates normalization γ/β from topic embeddings Produces topic-invariant representations for cross-topic transfer (Subtask B) src/modules/t_cln.py
R-Drop Symmetric KL-divergence between two dropout-masked forward passes Smooths decision boundary; +1.5 F1 in low-resource setting src/modules/losses.py
Two-Stage LLRD Training Freeze backbone → unfreeze with layer-wise LR decay Prevents catastrophic forgetting of pre-trained knowledge src/train.py

Key Results

System Subtask A (EN) Subtask B (AR)
PAST-TIDE 0.79 0.79

Parameter efficiency: Only 2,304 new parameters (3 × 768 = PCL prototypes) are added for Subtask A. The MLM head and verbalizer contribute zero new parameters — they are fully reused from pre-training.


Novel Architectural Choices — Design Deep-Dive

This section provides detailed technical explanations of every novel component, including mathematical formulations, design rationale, and pointers to the exact source files.

1. Statement Tuning (Cloze-Style MLM Classification)

Source: src/model.py (class PASTIDEv23, method forward and set_verbalizer)
Config: src/config.py (fields statement_template_a, statement_template_b, verbalizer)

The Problem: Standard fine-tuning adds a randomly-initialized linear head on [CLS] embeddings, discarding the pre-trained MLM head entirely. With only ~1,000 training samples this creates two issues: (1) the new head starts from scratch, wasting pre-trained knowledge; (2) there is an objective mismatch between pre-training (MLM) and fine-tuning (classification).

Our Solution — Statement Tuning: We convert stance detection into a cloze task by appending a natural-language template to each input:

Input:  [CLS] {text} [SEP] Regarding {target}, this author's stance is [MASK]. [SEP]

The model predicts a word at the [MASK] position using its pre-trained MLM head (zero new parameters). A verbalizer maps predicted words to stance classes:

Class Verbalizer Words
Pro-Palestine / Favor (0) support, favor, pro, yes
Pro-Israel / Against (1) oppose, against, anti, no
Neutral (2) neutral, unclear, none

Mathematical Formulation: Given verbalizer token set $V_c$ for class $c$, the class logit is:

$$\text{logit}(c) = \frac{1}{|V_c|} \sum_{v \in V_c} \log P_{\text{MLM}}([\text{MASK}] = v \mid \mathbf{x})$$

where $P_{\text{MLM}}$ is the pre-trained MLM head output (softmax over vocabulary). Multi-word verbalizers are handled by taking the first subword token of each word (which carries the primary semantic content in SentencePiece).

Why This Design:

  • Zero new parameters — the MLM head's weight matrix is already tied to the input embeddings, so it has rich vocabulary-level knowledge from pre-training.
  • Closes the objective gap — both pre-training and fine-tuning use the same MLM objective, preserving learned representations.
  • Naturally handles multilingual input — mDeBERTa's MLM head was trained on 100+ languages; the verbalizer words exploit this cross-lingual knowledge.

2. Prototypical Contrastive Learning (PCL)

Source: src/modules/pcl.py (class PrototypicalContrastiveHead)
Config: src/config.py (fields use_pcl, pcl_weight, pcl_temperature)

The Problem: Supervised Contrastive Loss (SupCon) requires large batch sizes to provide enough in-batch negatives for stable gradients. With our effective batch size of 32, there are only ~5.3 expected negatives per anchor — leading to noisy, high-variance gradient estimates that destabilize training.

Our Solution — Prototypical Contrastive Learning: Instead of contrasting against other samples in the batch, we maintain K=3 learnable class prototypes (one per stance class) and contrast each sample against these fixed anchors. The denominator always has exactly K=3 terms, regardless of batch size.

Mathematical Formulation:

$$\mathcal{L}_{\text{PCL}} = -\log \frac{\exp\bigl(\text{sim}(\mathbf{h}, \mathbf{p}_{y}) / \tau\bigr)}{\sum_{j=1}^{K} \exp\bigl(\text{sim}(\mathbf{h}, \mathbf{p}_{j}) / \tau\bigr)}$$

where:

  • $\mathbf{h} = \text{L2-normalize}(\text{hidden}_{[\text{MASK}]})$ — the normalized [MASK] hidden state
  • $\mathbf{p}_j = \text{L2-normalize}(\text{prototype}_j)$ — the normalized learnable prototype for class $j$
  • $\text{sim}(\cdot, \cdot)$ — cosine similarity
  • $\tau = 0.1$ — temperature scalar
  • $y$ — ground-truth class

This is mathematically equivalent to a temperature-scaled cosine-similarity cross-entropy, but with prototypes serving as virtual class anchors trained end-to-end alongside the encoder.

Implementation Detail: Prototypes are initialized with small random values (torch.randn * 0.02) and are nn.Parameters. During Stage 1 of training the backbone is frozen, so only the prototypes and T-CLN parameters receive gradient updates — this lets the prototypes find reasonable class centroids before the encoder starts shifting.

Parameter Cost: 3 prototypes × 768 dimensions = 2,304 parameters — the only new parameters in the entire Subtask A system.

Why This Over SupCon:

  • Batch-size invariant: Works at batch_size=1 because the contrastive denominator is determined by K (not batch size).
  • Stable gradients: No sampling noise from in-batch negatives.
  • Memory efficient: No need for large memory banks or momentum encoders.

Reference: Li et al. (2021) "Prototypical Contrastive Learning of Unsupervised Representations" (ICLR)


3. Topic-Conditional Layer Normalization (T-CLN)

Source: src/modules/t_cln.py (class TopicConditionalLayerNorm)
Config: src/config.py (fields use_t_cln, t_cln_topic_dim, num_topics)

The Problem: Subtask B requires cross-topic stance detection on Arabic text, where two topics — "Normalization with Israel" (political vocabulary) and "Refugees in Jordan" (humanitarian vocabulary) — have dramatically different lexical distributions. A standard model conflates topic-specific features with stance-indicative features, hurting generalization.

Our Solution — Topic-Conditional Layer Normalization: We replace a static LayerNorm with a dynamic one whose affine parameters $\gamma$ and $\beta$ are generated from topic embeddings via MLPs. This "whitens" topic-specific style so the classifier sees a topic-invariant stance representation.

Mathematical Formulation:

Standard LayerNorm (static parameters): $$\text{LN}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x} - \mu}{\sigma} + \beta$$

T-CLN (dynamic parameters from topic $t$): $$\mathbf{e}t = \text{TopicEmbedding}(t) \in \mathbb{R}^{64}$$ $$\gamma_t = \text{MLP}\gamma(\mathbf{e}t) \in \mathbb{R}^{768}$$ $$\beta_t = \text{MLP}\beta(\mathbf{e}_t) \in \mathbb{R}^{768}$$ $$\text{T-CLN}(\mathbf{x}, t) = \gamma_t \odot \frac{\mathbf{x} - \mu}{\sigma} + \beta_t$$

Each MLP is a two-layer network: Linear(64→768) → Tanh → Linear(768→768).

Initialization: The final γ MLP layer has its bias initialized to 1.0, and the β MLP bias to 0.0, so T-CLN starts as a near-identity transformation (equivalent to standard LayerNorm). This ensures the model doesn't diverge at the start of training.

Placement: Applied to encoder hidden states after the last transformer layer and before the MLM head / verbalizer. This is critical — normalizing after the encoder but before classification ensures the topic shift is removed at the representation level, not at the token level.

Why Conditional Normalization Over Alternatives:

  • Topic-adversarial training (gradient reversal) requires careful λ scheduling and can be unstable in low-resource settings.
  • Multi-task with topic classification adds parameters and doesn't directly align distributions.
  • T-CLN is lightweight — the topic embeddings + two MLPs add minimal parameters, and the identity initialization makes it safe to train from scratch.

Reference: Su et al. (2021) "Enhancing Content Preservation in Text Style Transfer via Learnable Normalization" (ACL)


4. R-Drop Regularization

Source: src/modules/losses.py (class RDropLoss)
Config: src/config.py (fields use_rdrop, rdrop_weight)

The Problem: With ~1,000 training samples and a 280M-parameter backbone, overfitting is severe. Standard dropout helps but leaves room for the model to produce inconsistent predictions for the same input under different dropout masks.

Our Solution: R-Drop forces the model to produce consistent predictions across two forward passes of the same input with different dropout masks, using symmetric KL-divergence:

$$\mathcal{L}_{\text{RDrop}} = \frac{1}{2} \bigl[ D_{\text{KL}}(P_1 | P_2) + D_{\text{KL}}(P_2 | P_1) \bigr]$$

where $P_1, P_2$ are the softmax outputs from two forward passes with independently sampled dropout masks.

Why It Works: This smooths the decision boundary by constraining the model's prediction surface to be locally consistent — any two random subnetworks (induced by dropout) must agree on the output. On our benchmark this yields approximately +1.5 macro-F1 over standard dropout alone.

Reference: Liang et al. (2021) "R-Drop: Regularized Dropout for Neural Networks" (NeurIPS)


5. Two-Stage Training with LLRD

Source: src/train.py (class KFoldTrainer), src/model.py (methods freeze_backbone, unfreeze_backbone, get_param_groups)

The Problem: Immediately fine-tuning all 280M parameters with only ~1,000 samples risks catastrophic forgetting of pre-trained representations.

Our Solution — Two-Stage Training:

Stage Epochs What's Trained Learning Rate
Stage 1 (Warm-up) 0–1 PCL prototypes + T-CLN parameters only base_lr × 5.0
Stage 2 (Full) 2–9 All parameters LLRD (see below)

Layer-wise Learning Rate Decay (LLRD): In Stage 2, lower transformer layers (which capture general linguistic knowledge) receive smaller learning rates, while upper layers (which are more task-specific) receive larger ones:

$$\text{lr}_{\text{layer } i} = \text{base_lr} \times 0.95^{(11 - i)}$$

  • Layer 0 (bottom): 2e-5 × 0.95^11 ≈ 1.14e-5
  • Layer 11 (top): 2e-5 × 0.95^0 = 2e-5
  • Embeddings: 2e-5 × 0.95^12 ≈ 1.08e-5
  • PCL / T-CLN / LM head: 2e-5 × 5.0 = 1e-4

Component Interaction Summary

The full forward pass and loss computation:

1. Encode:     [CLS] text [SEP] template_with_[MASK] [SEP]  →  mDeBERTa  →  hidden_states
2. T-CLN:      hidden_states + topic_id  →  topic-normalized hidden_states     (Subtask B only)
3. Extract:    hidden_states[MASK_position]  →  mask_hidden  ∈ R^768
4. MLM Head:   mask_hidden  →  vocab_logits  ∈ R^250002
5. Verbalizer: vocab_logits[verbalizer_tokens]  →  class_logits  ∈ R^3
6. Loss:       L = L_focal(class_logits, y)
                 + 0.1 · L_PCL(mask_hidden, y)
                 + 1.0 · L_RDrop(class_logits_pass1, class_logits_pass2)

The focal loss handles class imbalance by down-weighting easy examples. PCL shapes the [MASK] embedding space into well-separated clusters. R-Drop regularizes the overall prediction surface. Together, these three losses address the three core challenges: imbalance, representation quality, and overfitting.


Architecture

Input: [CLS] text [SEP] Regarding {target}, this author's stance is [MASK]. [SEP]
                                         │
                                  ┌──────┴──────┐
                                  │  mDeBERTa   │
                                  │  v3-base    │
                                  │  (280M)     │
                                  └──────┬──────┘
                                         │
                              ┌──────────┤ (optional)
                              │    ┌─────┴─────┐
                              │    │   T-CLN   │ ← Topic Embedding
                              │    │(Subtask B)│
                              │    └─────┬─────┘
                              │          │
                        ┌─────┴────┐ ┌───┴───┐
                        │ MLM Head │ │  PCL  │
                        │(pre-trn) │ │K=3 pr.│
                        └─────┬────┘ └───┬───┘
                              │          │
                        ┌─────┴────┐     │
                        │Verbalizer│     │
                        └─────┬────┘     │
                              │          │
                    L_focal ──┘    L_PCL─┘  + L_RDrop

Quick Start

A minimal example to get predictions with a trained model:

# 1. Clone and install
git clone https://github.com/Shakhoyat/PAST-TIDE.git
cd PAST-TIDE
pip install -r requirements.txt

# 2. Train on Subtask A (5-fold CV, ~3 hrs on T4 GPU)
python src/train.py \
    --data_dir /path/to/stancenakba/ \
    --task A \
    --output_dir checkpoints/subtask_a \
    --num_epochs 10 \
    --num_folds 5

# 3. Generate submission
python src/inference.py \
    --data_dir /path/to/stancenakba/ \
    --task A \
    --checkpoint_dir checkpoints/subtask_a \
    --output_path submission_a.csv \
    --strategy prob_avg

Installation

git clone https://github.com/Shakhoyat/PAST-TIDE.git
cd PAST-TIDE
pip install -r requirements.txt

Requirements

Package Version
Python ≥ 3.8
PyTorch ≥ 2.0
Transformers ≥ 4.30
sentencepiece ≥ 0.1.99
protobuf ≥ 3.20
NumPy ≥ 1.24
Pandas ≥ 2.0
scikit-learn ≥ 1.2

Hardware: CUDA-compatible GPU required. Tested on NVIDIA T4 (16 GB). Gradient checkpointing and FP16 mixed precision are enabled by default to fit within 16 GB VRAM.


Data Format

The system expects StanceNakba 2026 Shared Task data in CSV format:

Subtask A (Subtask_A/Subtask_A_train.csv):

Column Description
text English text expressing a stance
label One of Pro-Palestine, Pro-Israel, Neutral

Subtask B (Subtask_B/Subtask_B_train.csv):

Column Description
text Arabic text expressing a stance
target Topic string (e.g., "Normalization with Israel", "Refugees in Jordan")
label One of favor, against, neither

Place data files under a single root directory and pass it via --data_dir.


Usage

Training

The system uses 5-fold stratified cross-validation with two-stage training (frozen backbone → LLRD unfreezing).

Subtask A (English stance detection):

python src/train.py \
    --data_dir /path/to/stancenakba/ \
    --task A \
    --output_dir checkpoints/subtask_a \
    --num_epochs 10 \
    --num_folds 5

Subtask B (Arabic cross-topic stance detection):

python src/train.py \
    --data_dir /path/to/stancenakba/ \
    --task B \
    --output_dir checkpoints/subtask_b \
    --num_epochs 10 \
    --num_folds 5

Note: T-CLN is automatically enabled for Subtask B. Back-translation augmentation (EN↔DE for A, AR↔EN for B) runs by default. Use --no_backtrans to disable.

Inference

python src/inference.py \
    --data_dir /path/to/stancenakba/ \
    --task A \
    --checkpoint_dir checkpoints/subtask_a \
    --output_path submission_a.csv \
    --strategy prob_avg

Ensemble strategies: prob_avg (default, averages class probabilities across all K folds) or majority_vote.


Project Structure

PAST-TIDE/
├── README.md                        # This file
├── LICENSE                          # MIT License
├── requirements.txt                 # Python dependencies
├── GITHUB_SETUP.md                  # Repository setup instructions
├── src/
│   ├── __init__.py                  # Package docstring
│   ├── config.py                    # All hyperparameters, verbalizer maps, label maps
│   ├── model.py                     # PASTIDEv23: Statement Tuning + PCL + T-CLN
│   ├── train.py                     # KFoldTrainer: 2-stage training, R-Drop, back-translation
│   ├── inference.py                 # Ensemble inference (prob avg / majority vote)
│   └── modules/
│       ├── __init__.py              # Module exports
│       ├── pcl.py                   # PrototypicalContrastiveHead (K=3 learnable prototypes)
│       ├── t_cln.py                 # TopicConditionalLayerNorm (dynamic γ/β from topic)
│       └── losses.py                # FocalLoss (class-weighted) + RDropLoss (symmetric KL)
└── notebooks/
    ├── subtask_a.ipynb              # Kaggle notebook for Subtask A
    └── subtask_b.ipynb              # Kaggle notebook for Subtask B

Hyperparameters

All hyperparameters are defined in src/config.py (TIDEv23Config dataclass) and can be overridden via CLI.

Parameter Value Rationale
Backbone mDeBERTa-v3-base (280M) Best multilingual encoder with disentangled attention
Max sequence length 256 Covers 99% of inputs including appended template
Effective batch size 8 × 4 = 32 Physical=8 (GPU memory), accumulation=4
Learning rate (backbone) 2 × 10⁻⁵ Conservative to prevent catastrophic forgetting
Head LR multiplier New components learn faster: 1 × 10⁻⁴
LLRD decay 0.95 per layer Lower layers (general knowledge) updated more slowly
Focal loss γ 2.0 Standard value; down-weights confident predictions
Label smoothing 0.1 Prevents overconfident predictions
PCL temperature τ 0.1 Sharpens cosine similarity distribution
PCL weight λ_PCL 0.1 Auxiliary loss; kept small to not dominate focal
R-Drop weight λ_RDrop 1.0 Standard from Liang et al. (2021)
T-CLN topic dim 64 Compact topic embedding; 768 would overparameterize
Freeze epochs (Stage 1) 2 Lets prototypes/T-CLN converge before encoder shifts
Total epochs 10 Sufficient with early stopping
Early stopping patience 3 Based on validation macro-F1
K-fold splits 5 Standard stratified CV

Training Details

  • Hardware: Dual NVIDIA T4 GPUs (2×16 GB), Kaggle free tier
  • Training time: ~3 hours for 5-fold CV
  • Mixed precision: FP16 with gradient checkpointing (fits in 16 GB VRAM)
  • Augmentation: Back-translation via Helsinki-NLP/opus-mt models (EN↔DE for Subtask A, AR↔EN for Subtask B; 100% augmentation ratio doubles the dataset)
  • Optimizer: AdamW with weight decay 0.01, linear warmup (10% of steps)
  • Early stopping: Based on validation macro-F1 with patience of 3 epochs

Two-Stage Training Strategy

  1. Stage 1 (epochs 0–1): Encoder and MLM head frozen. Only PCL prototypes and T-CLN parameters receive gradient updates. This lets the new components find reasonable initializations without disturbing pre-trained representations.
  2. Stage 2 (epochs 2–9): All parameters unfrozen with layer-wise learning rate decay (LLRD, factor 0.95/layer). Lower transformer layers get smaller LRs, preserving general linguistic knowledge while upper layers adapt to the task.

Composite Loss Function

L = L_focal + 0.1 · L_PCL + 1.0 · L_RDrop
Loss Term Purpose Formula
L_focal Handles class imbalance; focuses on hard examples FL(p_t) = −α_t (1 − p_t)^γ log(p_t)
L_PCL Shapes [MASK] embedding space into separated clusters −log exp(sim(h, p_y)/τ) / Σ exp(sim(h, p_j)/τ)
L_RDrop Regularizes prediction surface for consistency 0.5 · (KL(P₁‖P₂) + KL(P₂‖P₁))

Reproducibility

To reproduce our results exactly:

  1. Environment: Python 3.8+, PyTorch 2.0+, Transformers 4.30+ (see requirements.txt for pinned versions)
  2. Data: StanceNakba 2026 Shared Task data from the official competition page
  3. Hardware: NVIDIA T4 GPU (16 GB) with CUDA — results were obtained on Kaggle's free GPU tier
  4. Commands:
    # Subtask A
    python src/train.py --data_dir /path/to/data/ --task A --output_dir checkpoints/subtask_a
    python src/inference.py --data_dir /path/to/data/ --task A --checkpoint_dir checkpoints/subtask_a --output_path submission_a.csv
    
    # Subtask B
    python src/train.py --data_dir /path/to/data/ --task B --output_dir checkpoints/subtask_b
    python src/inference.py --data_dir /path/to/data/ --task B --checkpoint_dir checkpoints/subtask_b --output_path submission_b.csv
  5. Expected output: Macro-F1 ≈ 0.79 on both subtasks (minor variance from dropout/GPU non-determinism)

Kaggle notebooks (notebooks/subtask_a.ipynb, notebooks/subtask_b.ipynb) contain the exact execution environments used for our competition submissions and can be run directly on Kaggle with no modifications.


Citation

If you use this code, please cite our paper:

@inproceedings{shujon-etal-2026-blackwell-stancenakba,
    title     = "{T}he {B}lackwell {C}ollective at {S}tance{N}akba 2026:
                 {PAST}-{TIDE} -- Prototype-Anchored Statement Tuning with
                 Topic-Invariant Normalization for Stance Detection",
    author    = "Shujon, Md. Shakhoyat Rahman and
                 Jim, MD Jahid Hasan and
                 Islam, Md. Milon",
    booktitle = "Proceedings of the 15th International Conference on
                 Language Resources and Evaluation (LREC'26)",
    month     = may,
    year      = "2026",
    address   = "Palma, Spain",
}

License

This project is licensed under the MIT License — see LICENSE for details.


Acknowledgements

We thank the StanceNakba 2026 Shared Task organizers for providing the datasets and evaluation infrastructure. All experiments were conducted on Kaggle's free GPU tier. We also acknowledge the creators of mDeBERTa and Helsinki-NLP/opus-mt for the pre-trained models used in this work.


Contact

For questions or discussions about this work:

  • Md. Shakhoyat Rahman ShujonGitHub
  • MD Jahid Hasan Jim
  • Md. Milon Islam

Feel free to open an issue for bug reports or feature requests.

About

PAST-TIDE: Prototype-Anchored Statement Tuning with Topic-Invariant Detection and Ensembling | The Blackwell Collective at StanceNakba 2026 (NakbaNLP @ LREC-COLING 2026)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors