PAST-TIDE: Prototype-Anchored Statement Tuning with Topic-Invariant Normalization for Stance Detection
The Blackwell Collective at StanceNakba 2026
Md. Shakhoyat Rahman Shujon, MD Jahid Hasan Jim, Md. Milon Islam
Official implementation of our system paper at the StanceNakba 2026 Shared Task (NakbaNLP Workshop @ LREC-COLING 2026).
- Overview
- Key Results
- Novel Architectural Choices — Design Deep-Dive
- Architecture
- Quick Start
- Installation
- Data Format
- Usage
- Project Structure
- Hyperparameters
- Training Details
- Reproducibility
- Citation
- License
- Acknowledgements
- Contact
PAST-TIDE is a unified stance detection system that reformulates classification as cloze-style masked language modeling via statement tuning, achieving 0.79 macro-F1 on both subtasks with a single architecture.
Instead of adding a randomly-initialized classification head on top of [CLS], we convert each input into a natural-language statement with a [MASK] slot and let the pre-trained MLM head predict stance-indicative words. This closes the pre-training/fine-tuning gap and yields strong performance with only ~1,000 labelled examples — adding zero new classifier parameters.
On top of this, we introduce Prototypical Contrastive Learning (PCL) for batch-size-agnostic representation shaping, and Topic-Conditional Layer Normalization (T-CLN) for cross-topic distribution alignment — both novel components designed specifically for this low-resource, multilingual stance detection challenge.
| Component | What it does | Why it matters | Source |
|---|---|---|---|
| Statement Tuning | Reformulates stance as [MASK] prediction with verbalizer mapping |
Reuses the pre-trained MLM head — zero new classifier parameters, closes the pre-training/fine-tuning gap | src/model.py |
| Prototypical Contrastive Learning (PCL) | Contrasts samples against K=3 learnable class prototypes instead of in-batch negatives | Stable contrastive gradients at any batch size; only 2,304 new parameters | src/modules/pcl.py |
| Topic-Conditional LayerNorm (T-CLN) | Dynamically generates normalization γ/β from topic embeddings | Produces topic-invariant representations for cross-topic transfer (Subtask B) | src/modules/t_cln.py |
| R-Drop | Symmetric KL-divergence between two dropout-masked forward passes | Smooths decision boundary; +1.5 F1 in low-resource setting | src/modules/losses.py |
| Two-Stage LLRD Training | Freeze backbone → unfreeze with layer-wise LR decay | Prevents catastrophic forgetting of pre-trained knowledge | src/train.py |
| System | Subtask A (EN) | Subtask B (AR) |
|---|---|---|
| PAST-TIDE | 0.79 | 0.79 |
Parameter efficiency: Only 2,304 new parameters (3 × 768 = PCL prototypes) are added for Subtask A. The MLM head and verbalizer contribute zero new parameters — they are fully reused from pre-training.
This section provides detailed technical explanations of every novel component, including mathematical formulations, design rationale, and pointers to the exact source files.
Source:
src/model.py(classPASTIDEv23, methodforwardandset_verbalizer)
Config:src/config.py(fieldsstatement_template_a,statement_template_b,verbalizer)
The Problem: Standard fine-tuning adds a randomly-initialized linear head on [CLS] embeddings, discarding the pre-trained MLM head entirely. With only ~1,000 training samples this creates two issues: (1) the new head starts from scratch, wasting pre-trained knowledge; (2) there is an objective mismatch between pre-training (MLM) and fine-tuning (classification).
Our Solution — Statement Tuning: We convert stance detection into a cloze task by appending a natural-language template to each input:
Input: [CLS] {text} [SEP] Regarding {target}, this author's stance is [MASK]. [SEP]
The model predicts a word at the [MASK] position using its pre-trained MLM head (zero new parameters). A verbalizer maps predicted words to stance classes:
| Class | Verbalizer Words |
|---|---|
| Pro-Palestine / Favor (0) | support, favor, pro, yes |
| Pro-Israel / Against (1) | oppose, against, anti, no |
| Neutral (2) | neutral, unclear, none |
Mathematical Formulation: Given verbalizer token set
where
Why This Design:
- Zero new parameters — the MLM head's weight matrix is already tied to the input embeddings, so it has rich vocabulary-level knowledge from pre-training.
- Closes the objective gap — both pre-training and fine-tuning use the same MLM objective, preserving learned representations.
- Naturally handles multilingual input — mDeBERTa's MLM head was trained on 100+ languages; the verbalizer words exploit this cross-lingual knowledge.
Source:
src/modules/pcl.py(classPrototypicalContrastiveHead)
Config:src/config.py(fieldsuse_pcl,pcl_weight,pcl_temperature)
The Problem: Supervised Contrastive Loss (SupCon) requires large batch sizes to provide enough in-batch negatives for stable gradients. With our effective batch size of 32, there are only ~5.3 expected negatives per anchor — leading to noisy, high-variance gradient estimates that destabilize training.
Our Solution — Prototypical Contrastive Learning: Instead of contrasting against other samples in the batch, we maintain K=3 learnable class prototypes (one per stance class) and contrast each sample against these fixed anchors. The denominator always has exactly K=3 terms, regardless of batch size.
Mathematical Formulation:
where:
-
$\mathbf{h} = \text{L2-normalize}(\text{hidden}_{[\text{MASK}]})$ — the normalized[MASK]hidden state -
$\mathbf{p}_j = \text{L2-normalize}(\text{prototype}_j)$ — the normalized learnable prototype for class$j$ -
$\text{sim}(\cdot, \cdot)$ — cosine similarity -
$\tau = 0.1$ — temperature scalar -
$y$ — ground-truth class
This is mathematically equivalent to a temperature-scaled cosine-similarity cross-entropy, but with prototypes serving as virtual class anchors trained end-to-end alongside the encoder.
Implementation Detail: Prototypes are initialized with small random values (torch.randn * 0.02) and are nn.Parameters. During Stage 1 of training the backbone is frozen, so only the prototypes and T-CLN parameters receive gradient updates — this lets the prototypes find reasonable class centroids before the encoder starts shifting.
Parameter Cost: 3 prototypes × 768 dimensions = 2,304 parameters — the only new parameters in the entire Subtask A system.
Why This Over SupCon:
- Batch-size invariant: Works at batch_size=1 because the contrastive denominator is determined by K (not batch size).
- Stable gradients: No sampling noise from in-batch negatives.
- Memory efficient: No need for large memory banks or momentum encoders.
Reference: Li et al. (2021) "Prototypical Contrastive Learning of Unsupervised Representations" (ICLR)
Source:
src/modules/t_cln.py(classTopicConditionalLayerNorm)
Config:src/config.py(fieldsuse_t_cln,t_cln_topic_dim,num_topics)
The Problem: Subtask B requires cross-topic stance detection on Arabic text, where two topics — "Normalization with Israel" (political vocabulary) and "Refugees in Jordan" (humanitarian vocabulary) — have dramatically different lexical distributions. A standard model conflates topic-specific features with stance-indicative features, hurting generalization.
Our Solution — Topic-Conditional Layer Normalization: We replace a static LayerNorm with a dynamic one whose affine parameters
Mathematical Formulation:
Standard LayerNorm (static parameters):
T-CLN (dynamic parameters from topic
Each MLP is a two-layer network: Linear(64→768) → Tanh → Linear(768→768).
Initialization: The final γ MLP layer has its bias initialized to 1.0, and the β MLP bias to 0.0, so T-CLN starts as a near-identity transformation (equivalent to standard LayerNorm). This ensures the model doesn't diverge at the start of training.
Placement: Applied to encoder hidden states after the last transformer layer and before the MLM head / verbalizer. This is critical — normalizing after the encoder but before classification ensures the topic shift is removed at the representation level, not at the token level.
Why Conditional Normalization Over Alternatives:
- Topic-adversarial training (gradient reversal) requires careful λ scheduling and can be unstable in low-resource settings.
- Multi-task with topic classification adds parameters and doesn't directly align distributions.
- T-CLN is lightweight — the topic embeddings + two MLPs add minimal parameters, and the identity initialization makes it safe to train from scratch.
Reference: Su et al. (2021) "Enhancing Content Preservation in Text Style Transfer via Learnable Normalization" (ACL)
Source:
src/modules/losses.py(classRDropLoss)
Config:src/config.py(fieldsuse_rdrop,rdrop_weight)
The Problem: With ~1,000 training samples and a 280M-parameter backbone, overfitting is severe. Standard dropout helps but leaves room for the model to produce inconsistent predictions for the same input under different dropout masks.
Our Solution: R-Drop forces the model to produce consistent predictions across two forward passes of the same input with different dropout masks, using symmetric KL-divergence:
where
Why It Works: This smooths the decision boundary by constraining the model's prediction surface to be locally consistent — any two random subnetworks (induced by dropout) must agree on the output. On our benchmark this yields approximately +1.5 macro-F1 over standard dropout alone.
Reference: Liang et al. (2021) "R-Drop: Regularized Dropout for Neural Networks" (NeurIPS)
Source:
src/train.py(classKFoldTrainer),src/model.py(methodsfreeze_backbone,unfreeze_backbone,get_param_groups)
The Problem: Immediately fine-tuning all 280M parameters with only ~1,000 samples risks catastrophic forgetting of pre-trained representations.
Our Solution — Two-Stage Training:
| Stage | Epochs | What's Trained | Learning Rate |
|---|---|---|---|
| Stage 1 (Warm-up) | 0–1 | PCL prototypes + T-CLN parameters only | base_lr × 5.0 |
| Stage 2 (Full) | 2–9 | All parameters | LLRD (see below) |
Layer-wise Learning Rate Decay (LLRD): In Stage 2, lower transformer layers (which capture general linguistic knowledge) receive smaller learning rates, while upper layers (which are more task-specific) receive larger ones:
- Layer 0 (bottom):
2e-5 × 0.95^11 ≈ 1.14e-5 - Layer 11 (top):
2e-5 × 0.95^0 = 2e-5 - Embeddings:
2e-5 × 0.95^12 ≈ 1.08e-5 - PCL / T-CLN / LM head:
2e-5 × 5.0 = 1e-4
The full forward pass and loss computation:
1. Encode: [CLS] text [SEP] template_with_[MASK] [SEP] → mDeBERTa → hidden_states
2. T-CLN: hidden_states + topic_id → topic-normalized hidden_states (Subtask B only)
3. Extract: hidden_states[MASK_position] → mask_hidden ∈ R^768
4. MLM Head: mask_hidden → vocab_logits ∈ R^250002
5. Verbalizer: vocab_logits[verbalizer_tokens] → class_logits ∈ R^3
6. Loss: L = L_focal(class_logits, y)
+ 0.1 · L_PCL(mask_hidden, y)
+ 1.0 · L_RDrop(class_logits_pass1, class_logits_pass2)
The focal loss handles class imbalance by down-weighting easy examples. PCL shapes the [MASK] embedding space into well-separated clusters. R-Drop regularizes the overall prediction surface. Together, these three losses address the three core challenges: imbalance, representation quality, and overfitting.
Input: [CLS] text [SEP] Regarding {target}, this author's stance is [MASK]. [SEP]
│
┌──────┴──────┐
│ mDeBERTa │
│ v3-base │
│ (280M) │
└──────┬──────┘
│
┌──────────┤ (optional)
│ ┌─────┴─────┐
│ │ T-CLN │ ← Topic Embedding
│ │(Subtask B)│
│ └─────┬─────┘
│ │
┌─────┴────┐ ┌───┴───┐
│ MLM Head │ │ PCL │
│(pre-trn) │ │K=3 pr.│
└─────┬────┘ └───┬───┘
│ │
┌─────┴────┐ │
│Verbalizer│ │
└─────┬────┘ │
│ │
L_focal ──┘ L_PCL─┘ + L_RDrop
A minimal example to get predictions with a trained model:
# 1. Clone and install
git clone https://github.com/Shakhoyat/PAST-TIDE.git
cd PAST-TIDE
pip install -r requirements.txt
# 2. Train on Subtask A (5-fold CV, ~3 hrs on T4 GPU)
python src/train.py \
--data_dir /path/to/stancenakba/ \
--task A \
--output_dir checkpoints/subtask_a \
--num_epochs 10 \
--num_folds 5
# 3. Generate submission
python src/inference.py \
--data_dir /path/to/stancenakba/ \
--task A \
--checkpoint_dir checkpoints/subtask_a \
--output_path submission_a.csv \
--strategy prob_avggit clone https://github.com/Shakhoyat/PAST-TIDE.git
cd PAST-TIDE
pip install -r requirements.txt| Package | Version |
|---|---|
| Python | ≥ 3.8 |
| PyTorch | ≥ 2.0 |
| Transformers | ≥ 4.30 |
| sentencepiece | ≥ 0.1.99 |
| protobuf | ≥ 3.20 |
| NumPy | ≥ 1.24 |
| Pandas | ≥ 2.0 |
| scikit-learn | ≥ 1.2 |
Hardware: CUDA-compatible GPU required. Tested on NVIDIA T4 (16 GB). Gradient checkpointing and FP16 mixed precision are enabled by default to fit within 16 GB VRAM.
The system expects StanceNakba 2026 Shared Task data in CSV format:
Subtask A (Subtask_A/Subtask_A_train.csv):
| Column | Description |
|---|---|
text |
English text expressing a stance |
label |
One of Pro-Palestine, Pro-Israel, Neutral |
Subtask B (Subtask_B/Subtask_B_train.csv):
| Column | Description |
|---|---|
text |
Arabic text expressing a stance |
target |
Topic string (e.g., "Normalization with Israel", "Refugees in Jordan") |
label |
One of favor, against, neither |
Place data files under a single root directory and pass it via --data_dir.
The system uses 5-fold stratified cross-validation with two-stage training (frozen backbone → LLRD unfreezing).
Subtask A (English stance detection):
python src/train.py \
--data_dir /path/to/stancenakba/ \
--task A \
--output_dir checkpoints/subtask_a \
--num_epochs 10 \
--num_folds 5Subtask B (Arabic cross-topic stance detection):
python src/train.py \
--data_dir /path/to/stancenakba/ \
--task B \
--output_dir checkpoints/subtask_b \
--num_epochs 10 \
--num_folds 5Note: T-CLN is automatically enabled for Subtask B. Back-translation augmentation (EN↔DE for A, AR↔EN for B) runs by default. Use
--no_backtransto disable.
python src/inference.py \
--data_dir /path/to/stancenakba/ \
--task A \
--checkpoint_dir checkpoints/subtask_a \
--output_path submission_a.csv \
--strategy prob_avgEnsemble strategies: prob_avg (default, averages class probabilities across all K folds) or majority_vote.
PAST-TIDE/
├── README.md # This file
├── LICENSE # MIT License
├── requirements.txt # Python dependencies
├── GITHUB_SETUP.md # Repository setup instructions
├── src/
│ ├── __init__.py # Package docstring
│ ├── config.py # All hyperparameters, verbalizer maps, label maps
│ ├── model.py # PASTIDEv23: Statement Tuning + PCL + T-CLN
│ ├── train.py # KFoldTrainer: 2-stage training, R-Drop, back-translation
│ ├── inference.py # Ensemble inference (prob avg / majority vote)
│ └── modules/
│ ├── __init__.py # Module exports
│ ├── pcl.py # PrototypicalContrastiveHead (K=3 learnable prototypes)
│ ├── t_cln.py # TopicConditionalLayerNorm (dynamic γ/β from topic)
│ └── losses.py # FocalLoss (class-weighted) + RDropLoss (symmetric KL)
└── notebooks/
├── subtask_a.ipynb # Kaggle notebook for Subtask A
└── subtask_b.ipynb # Kaggle notebook for Subtask B
All hyperparameters are defined in src/config.py (TIDEv23Config dataclass) and can be overridden via CLI.
| Parameter | Value | Rationale |
|---|---|---|
| Backbone | mDeBERTa-v3-base (280M) | Best multilingual encoder with disentangled attention |
| Max sequence length | 256 | Covers 99% of inputs including appended template |
| Effective batch size | 8 × 4 = 32 | Physical=8 (GPU memory), accumulation=4 |
| Learning rate (backbone) | 2 × 10⁻⁵ | Conservative to prevent catastrophic forgetting |
| Head LR multiplier | 5× | New components learn faster: 1 × 10⁻⁴ |
| LLRD decay | 0.95 per layer | Lower layers (general knowledge) updated more slowly |
| Focal loss γ | 2.0 | Standard value; down-weights confident predictions |
| Label smoothing | 0.1 | Prevents overconfident predictions |
| PCL temperature τ | 0.1 | Sharpens cosine similarity distribution |
| PCL weight λ_PCL | 0.1 | Auxiliary loss; kept small to not dominate focal |
| R-Drop weight λ_RDrop | 1.0 | Standard from Liang et al. (2021) |
| T-CLN topic dim | 64 | Compact topic embedding; 768 would overparameterize |
| Freeze epochs (Stage 1) | 2 | Lets prototypes/T-CLN converge before encoder shifts |
| Total epochs | 10 | Sufficient with early stopping |
| Early stopping patience | 3 | Based on validation macro-F1 |
| K-fold splits | 5 | Standard stratified CV |
- Hardware: Dual NVIDIA T4 GPUs (2×16 GB), Kaggle free tier
- Training time: ~3 hours for 5-fold CV
- Mixed precision: FP16 with gradient checkpointing (fits in 16 GB VRAM)
- Augmentation: Back-translation via Helsinki-NLP/opus-mt models (EN↔DE for Subtask A, AR↔EN for Subtask B; 100% augmentation ratio doubles the dataset)
- Optimizer: AdamW with weight decay 0.01, linear warmup (10% of steps)
- Early stopping: Based on validation macro-F1 with patience of 3 epochs
- Stage 1 (epochs 0–1): Encoder and MLM head frozen. Only PCL prototypes and T-CLN parameters receive gradient updates. This lets the new components find reasonable initializations without disturbing pre-trained representations.
- Stage 2 (epochs 2–9): All parameters unfrozen with layer-wise learning rate decay (LLRD, factor 0.95/layer). Lower transformer layers get smaller LRs, preserving general linguistic knowledge while upper layers adapt to the task.
L = L_focal + 0.1 · L_PCL + 1.0 · L_RDrop
| Loss Term | Purpose | Formula |
|---|---|---|
| L_focal | Handles class imbalance; focuses on hard examples | FL(p_t) = −α_t (1 − p_t)^γ log(p_t) |
| L_PCL | Shapes [MASK] embedding space into separated clusters | −log exp(sim(h, p_y)/τ) / Σ exp(sim(h, p_j)/τ) |
| L_RDrop | Regularizes prediction surface for consistency | 0.5 · (KL(P₁‖P₂) + KL(P₂‖P₁)) |
To reproduce our results exactly:
- Environment: Python 3.8+, PyTorch 2.0+, Transformers 4.30+ (see
requirements.txtfor pinned versions) - Data: StanceNakba 2026 Shared Task data from the official competition page
- Hardware: NVIDIA T4 GPU (16 GB) with CUDA — results were obtained on Kaggle's free GPU tier
- Commands:
# Subtask A python src/train.py --data_dir /path/to/data/ --task A --output_dir checkpoints/subtask_a python src/inference.py --data_dir /path/to/data/ --task A --checkpoint_dir checkpoints/subtask_a --output_path submission_a.csv # Subtask B python src/train.py --data_dir /path/to/data/ --task B --output_dir checkpoints/subtask_b python src/inference.py --data_dir /path/to/data/ --task B --checkpoint_dir checkpoints/subtask_b --output_path submission_b.csv
- Expected output: Macro-F1 ≈ 0.79 on both subtasks (minor variance from dropout/GPU non-determinism)
Kaggle notebooks (
notebooks/subtask_a.ipynb,notebooks/subtask_b.ipynb) contain the exact execution environments used for our competition submissions and can be run directly on Kaggle with no modifications.
If you use this code, please cite our paper:
@inproceedings{shujon-etal-2026-blackwell-stancenakba,
title = "{T}he {B}lackwell {C}ollective at {S}tance{N}akba 2026:
{PAST}-{TIDE} -- Prototype-Anchored Statement Tuning with
Topic-Invariant Normalization for Stance Detection",
author = "Shujon, Md. Shakhoyat Rahman and
Jim, MD Jahid Hasan and
Islam, Md. Milon",
booktitle = "Proceedings of the 15th International Conference on
Language Resources and Evaluation (LREC'26)",
month = may,
year = "2026",
address = "Palma, Spain",
}This project is licensed under the MIT License — see LICENSE for details.
We thank the StanceNakba 2026 Shared Task organizers for providing the datasets and evaluation infrastructure. All experiments were conducted on Kaggle's free GPU tier. We also acknowledge the creators of mDeBERTa and Helsinki-NLP/opus-mt for the pre-trained models used in this work.
For questions or discussions about this work:
- Md. Shakhoyat Rahman Shujon — GitHub
- MD Jahid Hasan Jim
- Md. Milon Islam
Feel free to open an issue for bug reports or feature requests.