startkit/challenge_2.py at main · eeg2025/startkit · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
# %% [markdown]
# <a target="_blank" href="https://colab.research.google.com/github/eeg2025/startkit/blob/main/challenge_2.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# %% [markdown]
# # Challenge 2: Predicting the externalizing factor from EEG

# The externalizing factor is a widely recognized construct in mental health research, representing a common underlying dimension of psychopathology across various disorders.
# Currently, the externalizing factor is often assessed using self-report questionnaires or clinician ratings, which can be subjective, prone to bias, and time-consuming.
# **The Challenge 2** consists of developing a model to predict the externalizing factor from EEG recordings.

# The challenge encourages learning physiologically meaningful signal representations and discovery of reproducible biomarkers.
# Models of any size should emphasize robust, interpretable features that generalize across subjects,
# sessions, and acquisition sites.

# Unlike a standard in-distribution classification task, this regression problem stresses out-of-distribution robustness
# and extrapolation. The goal is not only to minimize error on seen subjects, but also to transfer effectively to unseen data.

# Ensure the dataset is available locally. If not, see the [dataset download guide](https://eeg2025.github.io/data/#downloading-the-data)

# %% [markdown]
# ## Contents of this start kit

# 0. Understand the externalizing factor regression task.
# 1. Loading the data.
# 2. Wrap the data into a PyTorch-compatible dataset.
# 3. Define, train and save a model.

# > **Note:** If you need additional explanations on the [`EEGChallengeDataset`](https://eeglab.org/EEGDash/api/eegdash.html#eegdash.EEGChallengeDataset) class, dataloading, [`braindecode`](https://braindecode.org/stable/models/models_table.html)'s deep learning models, or brain decoding in general, please refer to the start-kit of challenge 1 which delves deeper into these topics.

# More contents will be released during the competition inside the [`eegdash`](https://eeglab.org/EEGDash/overview.html) [examples webpage](https://eeglab.org/EEGDash/generated/auto_examples/index.html).

# We will establish direct communication channels with you over the next two months, resolve platform issues, and discuss other details.

# We are going to do weekly one-hour support time to help teams that may have difficulties with any technical aspects that may arise.

# We will have closer communication starting next week, 08/09/2025.

# Before we begin, I just want to make a deal with you, ok?

# This is a community competition with a strong open-source foundation. When I say open-source, I mean volunteer work.

# So, if you see something that does not work or could be improved, first, **please be kind**, and we will fix it together on GitHub, okay?

# The entire decoding community will only go further when we stop solving the same problems over and over again, and start working together.

# %% [markdown]
# ----
# | The tutorial assumes some prior knowledge of deep learning concepts and EEG concepts:|
# | --- |
# |* Standard neural network architectures, e.g., convolutional neural networks|
# |* Optimization by batch gradient descent and backpropagation|
# |* Overfitting, early stopping, regularisation |
# |* Some knowledge of pytorch and, optionally, of the pytorch Lightning framework|
# |* That you know what EEG is |
# |* That you have basic familiarity with EEG preprocessing |
# |* Like and support open-source :) |
# ----

# %% Install dependencies on colab [code] tags=["hide-input"]
#!pip install braindecode
#!pip install eegdash

# %% Imports [code] tags=["hide-input"]
from pathlib import Path
import math
import os
import random
from joblib import Parallel, delayed

import torch
from torch.utils.data import DataLoader
from torch import optim
from torch.nn.functional import l1_loss
from braindecode.preprocessing import create_fixed_length_windows
from braindecode.datasets.base import EEGWindowsDataset, BaseConcatDataset, BaseDataset
from braindecode.models import EEGNeX
from eegdash import EEGChallengeDataset

# %% [markdown]
# ⚠️ **In case of colab, before starting, make sure you're on a GPU instance for faster training!** ⚠️

# > If running on Google Colab, please request a GPU runtime by clicking `Runtime/Change runtime type` in the top bar menu, then selecting 'T4 GPU' under 'Hardware accelerator'.


# %% Identify whether a CUDA-enabled GPU is available [code] tags=["hide-input"]
device = "cuda" if torch.cuda.is_available() else "cpu"
if device == "cuda":
    msg = "CUDA-enabled GPU found. Training should be faster."
else:
    msg = (
        "No GPU found. Training will be carried out on CPU, which might be "
        "slower.\n\nIf running on Google Colab, you can request a GPU runtime by"
        " clicking\n`Runtime/Change runtime type` in the top bar menu, then "
        "selecting 'T4 GPU'\nunder 'Hardware accelerator'."
    )
print(msg)

# %% [markdown]
# ## 0. Understanding the externalizing factor regression task.
#
# The externalizing factor is a widely recognized construct in mental health research, representing a common underlying dimension of psychopathology across various disorders.
# Currently, the externalizing factor is often assessed using self-report questionnaires or clinician ratings, which can be subjective, prone to bias, and time-consuming.
# In the dataset of this challenge, the externalizing factor was assessed using the Child Behavior Checklist (CBCL) [McElroy et al., (2017)](https://doi.org/10.1111/jcpp.12849).

# The goal of Challenge 2 is to develop a model to predict the externalizing factor from EEG recordings.
# **The feasibility of using EEG data for this purpose is still an open question**.
# The solution may involve finding meaningful representations of the EEG data that correlate with the externalizing factor scores.
# The challenge encourages learning physiologically meaningful signal representations and discovery of reproducible biomarkers.
# If contestants are successful in this task, it could pave the way for more objective and efficient assessments of the externalizing factor in clinical settings.

# %% [markdown]
# ## 1. Loading the data

# ### A. Define local path and (down)load the data

# In this challenge 2 example, we load the EEG 2025 release using EEG Dash.

# > **Note:** in this example notebook, we load the contrast change detection task from one mini release only as an example. Naturally, you are encouraged to train your models on all complete releases, using data from all the tasks you deem relevant.

# %%
# The first step is to define the cache folder!
DATA_DIR = Path("~/mne_data/eeg2025_competition")

# Creating the path if it does not exist
DATA_DIR.mkdir(parents=True, exist_ok=True)

# We define the list of releases to load.
# Here, only release 5 is loaded.
release_list = ["R5"]

all_datasets_list = [
    EEGChallengeDataset(
        release=release,
        task="contrastChangeDetection",
        mini=True,
        description_fields=[
            "subject",
            "session",
            "run",
            "task",
            "age",
            "gender",
            "sex",
            "p_factor",
        ],
        cache_dir=DATA_DIR,
    )
    for release in release_list
]
print("Datasets loaded")
sub_rm = ["NDARWV769JM7", "NDARME789TD2", "NDARUA442ZVF", "NDARJP304NK1",
          "NDARTY128YLU", "NDARDW550GU6", "NDARLD243KRE", "NDARUJ292JXV", "NDARBA381JGH"]
# %% [markdown]
# ### B. Combine the datasets into a single one

# Here, we combine the datasets from the different releases into a single
# ``BaseConcatDataset`` object.

# %%
all_datasets = BaseConcatDataset(all_datasets_list)
print(all_datasets.description)

raws = Parallel(n_jobs=os.cpu_count())(
    delayed(lambda d: d.raw)(d) for d in all_datasets.datasets
)

# %% [markdown]
# ### C. Inspect your data

# We can check what is inside the dataset consuming the
# MNE-object inside the Braindecode dataset.
#
# The following snippet, if uncommented, will show the first 10 seconds of the raw EEG signal.
# We can also inspect the data further by looking at the events and annotations.
# We strongly recommend you to take a look into the details and check how the events are structured.

# %%
# raw = all_datasets.datasets[0].raw  # mne.io.Raw object
# print(raw.info)

# raw.plot(duration=10, scalings="auto", show=True)

# print(raw.annotations)

SFREQ = 100

# %% [markdown]
# ## 2. Wrap the data into a PyTorch-compatible dataset

# The class below defines a dataset wrapper that will extract 2-second windows,
# uniformly sampled over the whole signal. In addition, it will add useful information
# about the extracted windows, such as the externalizing factor, the subject or the task.


# %%
class DatasetWrapper(BaseDataset):
    def __init__(
        self,
        dataset: EEGWindowsDataset,
        crop_size_samples: int,
        target_name: str = "externalizing",
        seed=None,
    ):
        self.dataset = dataset
        self.crop_size_samples = crop_size_samples
        self.target_name = target_name
        self.rng = random.Random(seed)

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, index):
        X, _, crop_inds = self.dataset[index]

        target = self.dataset.description[self.target_name]
        target = float(target)

        # Additional information:
        infos = {
            "subject": self.dataset.description["subject"],
            "sex": self.dataset.description["sex"],
            "age": float(self.dataset.description["age"]),
            "task": self.dataset.description["task"],
            "session": self.dataset.description.get("session", None) or "",
            "run": self.dataset.description.get("run", None) or "",
        }

        # Randomly crop the signal to the desired length:
        i_window_in_trial, i_start, i_stop = crop_inds
        assert i_stop - i_start >= self.crop_size_samples, f"{i_stop=} {i_start=}"
        start_offset = self.rng.randint(0, i_stop - i_start - self.crop_size_samples)
        i_start = i_start + start_offset
        i_stop = i_start + self.crop_size_samples
        X = X[:, start_offset : start_offset + self.crop_size_samples]

        return X, target, (i_window_in_trial, i_start, i_stop), infos


# %% [markdown]
# We filter out certain recordings, create fixed length windows and finally make use of our `DatasetWrapper`.

# %%
# Filter out recordings that are too short
all_datasets = BaseConcatDataset(
    [
        ds
        for ds in all_datasets.datasets
        if not ds.description.subject in sub_rm
        and ds.raw.n_times >= 4 * SFREQ
        and len(ds.raw.ch_names) == 129
        and not math.isnan(ds.description["p_factor"])
    ]
)

# Create 4-seconds windows with 2-seconds stride
windows_ds = create_fixed_length_windows(
    all_datasets,
    window_size_samples=4 * SFREQ,
    window_stride_samples=2 * SFREQ,
    drop_last_window=True,
)

# Wrap each sub-dataset in the windows_ds
windows_ds = BaseConcatDataset(
    [DatasetWrapper(ds, crop_size_samples=2 * SFREQ) for ds in windows_ds.datasets]
)


# %% [markdown]
# ## 3. Define, train and save a model

# Now we have our pytorch dataset necessary for the training!
#
# Below, we define a simple EEGNeX model from Braindecode.
# All the braindecode models expect the input to be of shape (batch_size, n_channels, n_times)
# and have a test coverage about the behavior of the model.
# However, you can use any pytorch model you want.

# %%
# Initialize model
model = EEGNeX(n_chans=129, n_outputs=1, n_times=2 * SFREQ).to(device)

# Specify optimizer
optimizer = optim.Adamax(params=model.parameters(), lr=0.002)

print(model)


# %% [markdown]
# Finally, we can train our model. Here we define a simple training loop using pure PyTorch.

# > In this example, we only train for a single epoch. Feel free to increase the number of epochs.

# %% [code]
# Create PyTorch Dataloader

num_workers = 1  # We are using a single worker, but you can increase this for faster data loading
dataloader = DataLoader(windows_ds, batch_size=128, shuffle=True, num_workers=num_workers)

n_epochs = 1

# Train model for 1 epoch
for epoch in range(n_epochs):

    for idx, batch in enumerate(dataloader):
        # Reset gradients
        optimizer.zero_grad()

        # Unpack the batch
        X, y, crop_inds, infos = batch
        X = X.to(dtype=torch.float32, device=device)
        y = y.to(dtype=torch.float32, device=device).unsqueeze(1)

        # Forward pass
        y_pred = model(X)

        # Compute loss
        loss = l1_loss(y_pred, y)
        print(f"Epoch {0} - step {idx}, loss: {loss.item()}")

        # Gradient backpropagation
        loss.backward()
        optimizer.step()

# Finally, we can save the model for later use
torch.save(model.state_dict(), "weights_challenge_2.pt")
print("Model saved as 'weights_challenge_2.pt'")