ADHD and sex classifier

🎯 Project Highlights

Built a multioutput classification model using XGBoost to identify gender-specific and ADHD-related brain connectivity patterns
Achieved an F1 score of .6962 and a ranking of 310 out of 540 participating teams on the final Kaggle Leaderboard
Implemented data balancing using SMOTE, ADYSYN, and SMOTEEN to optimize results within compute constraints

🔗 WiDS Datathon 2025 | Kaggle Competition Page

👩🏽‍💻 Setup & Execution

Clone the repo by pasting this into the terminal:

git clone git@github.com:kulieshova/ADHD-prediction.git

Using the terminal, cd into the ADHD-prediction/ folder

cd  ADHD-prediction/

Download the data from WiDS Kaggle Page

Run the cells inside XGB model.ipynb

🏗️ Project Overview

The WiDS Datathon 2025 on Kaggle is a core component of Break Through Tech’s Spring AI Studio. In this initiative, teams of four BTT students apply the advanced machine learning techniques they mastered during the program’s ML course to a real-world challenge. The Kaggle challenge focuses on building robust predictive models for ADHD diagnosis while simultaneously uncovering subtle, sex-specific neural patterns by analyzing differences in brain connectivity and activation between males and females. The project has significant implications for ADHD research and clinical care. By integrating diverse data modalities and uncovering nuanced, sex-specific neural patterns, the developed models could fundamentally transform how ADHD is diagnosed and managed. In particular, identifying distinct fMRI signatures that differentiate between male and female patients offers a pathway to address the persistent underdiagnosis of ADHD in women, paving the way for more personalized and effective treatment strategies.

📊 Data Exploration

Structure of the Data:
- Train Categorical: (1213, 10)
- Train Connectome: (1213, 19901)
- Train Quantitative: (1213, 19)
- Train Solutions: (1213, 3)
- Test Categorical: (304, 10)
- Test Connectome: (304, 19901)
- Test Quantitative: (304, 19)
Handling Missing Values:
- For numerical features, missing values were imputed with the mean.
- For categorical features, missing values were replaced with the string "unknown".
Data Exploration and Preprocessing Approaches:
- Verified the sizes of all three datasets to ensure that participant IDs were matching across splits.
- Conducted exploratory data analysis to assess the distribution, range, and anomalies within each feature set.
- Examined the connectome data for consistency and sparsity given its high dimensionality.
Data Balancing Techniques:
- To address class imbalances in both sex (male/female) and ADHD diagnosis (ADHD/no ADHD), we applied an undersampling strategy for overpopulated classes (male with no ADHD, male with ADHD) combined with SMOTE for underpopulated classes (female with ADHD, female with no ADHD). This approach ensured a more balanced distribution across all categories, ultimately contributing to improved model performance.
- ADHD/non-ADHD × male/female:
- Original distribution: {'1_0': 581, '1_1': 250, '0_0': 216, '0_1': 166}
- After balancing: {'0_0': 250, '0_1': 250, '1_0': 250, '1_1': 250}
- Exploring multiple methods including class weighting, ADASYN, and SMOTEENN, with SMOTE proving most effective for this particular dataset.
Challenges and Assumptions:
- Imbalance in Data Examples:
  The dataset exhibited an imbalance between male and female participants (with almost twice as much data representing male participants), which contributed to lower accuracy in sex prediction models.
- Data Integration:
  Aligning participant IDs across multiple data types required careful preprocessing to maintain data integrity.
- Computational Constraints:
  The large size of the connectome data (around 20,000 features) posed challenges in terms of computation and memory usage during initial exploration and model training.

Feauture distribution in training data	Visualization
Gender distribution
ADHD distribution

🧠 Model Development

XGBoost with SMOTE oversampling and linear objective
Kaggle score: 0.69326
LightGBM
Kaggle score: 0.41
CatBoost
Kaggle score: 0.34
Feature selection and hyperparameter tuning via XGBClassifier with parameters:
- random_state = 9
- n_estimators = 100
- learning_rate = 0.5
- max_depth = 10

The Right Model, the Right Method

Our initial modeling strategy involved predicting both ADHD_Outcome and Sex_F simultaneously using a multi-output classifier.

This approach underperformed, as it failed to capture the distinct statistical structure of each target and often conflated their signals — particularly problematic given the correlation between sex and ADHD status in the dataset.

Midway through the project, we pivoted to training separate models for each target, which markedly improved performance for the imbalanced and complex ADHD classification task.

Why Separating the Targets Helped

Distinct feature dependencies: ADHD and sex are predicted by different feature subsets; joint modeling diluted the signal for ADHD.
Reduced overfitting: The joint model overfit on the easier Sex_F prediction, compromising ADHD accuracy.
Task-specific optimization: Separate models allowed for tuning loss functions, metrics, and class balancing strategies tailored to each target.

📈 Results & Key Findings

For XGB model:

Overall Kaggle accuracy: .69326
Cross-validation accuracy:

Findings in training data	Visualization
25 most predictive features for ADHD
25 most predictive features for gender

🖼️ Impact Narrative

1. Brain Activity Patterns and Gender Differences

Our project integrated fMRI data with behavioral and socio-demographic measures to pinpoint the neural signatures of ADHD—and how these signatures differ between males and females. Here are some specifics from our feature analysis:

🧠 ADHD-Related fMRI Markers:

Features like MRI_Track_Scan_Location_1 (coefficient ≈ 0.2001) and MRI_Track_Scan_Location_3 (≈ 0.1853) indicate that spatial brain activation patterns in regions tied to attention and executive function are predictive of ADHD. These markers suggest disruptions in networks such as the default mode and prefrontal circuits.

🧑‍🧑‍🧒‍🧒 Behavioral and Socio-Demographic Influences:

The top predictor for ADHD outcome was SDQ_SDQ_Hyperactivity (≈ 0.6567), highlighting the strong link between hyperactivity symptoms and ADHD. Other behavioral measures like SDQ_SDQ_Conduct_Problems (≈ 0.3615) also emerged as key predictors. Socio-demographic features (e.g., Basic_Demos_Enroll_Year_2016 and Basic_Demos_Enroll_Year_2018) further underline how contextual factors might modulate ADHD expression.

👧 Gender-Specific Differences:

When predicting sex, our model identified distinct features. For instance, Barratt_Barratt_P1_Edu_21 (≈ 0.2661) suggests that educational or related socio-economic factors differ significantly between genders. Additionally, several fMRI-derived features—such as 158throw_191thcolumn (≈ 0.2478)—imply that subtle variations in brain connectivity or activation patterns are instrumental in differentiating males from females. Behavioral markers like SDQ_SDQ_Emotional_Problems (≈ 0.2381) and SDQ_SDQ_Prosocial (≈ 0.2322) further suggest that emotional regulation and social behavior may manifest differently, hinting that female ADHD patients might exhibit unique neural and behavioral profiles compared to males.

2. Contributions to ADHD Research and Clinical Care

By combining detailed fMRI analyses with behavioral and socio-demographic data, our work advances ADHD research and clinical practice in several meaningful ways:

🧑‍⚕️ Refined Diagnostic Models:

The identification of both shared and gender-specific predictors (e.g., hyperactivity and fMRI scan locations for ADHD, versus education and nuanced fMRI “throw” features for sex) supports the development of more precise, individualized diagnostic criteria. This is particularly crucial for recognizing ADHD in females, who are frequently underdiagnosed due to subtler or different symptom presentations.

💊 Personalized Intervention Strategies:

With clear evidence that brain activity patterns and behavioral traits vary by gender, clinicians can tailor intervention strategies. For example, treatments for female patients might focus more on emotional regulation and cognitive control if their neuroimaging data suggest distinct activation in relevant brain regions.

🚀 Next Steps & Future Improvements

Limitations

One of the primary challenges we encountered was the limitation in computational and technical resources. The large dataset significantly slowed down the hyperparameter tuning process, meaning we couldn't explore the full range of parameter configurations efficiently. This constraint forced us to compromise on the depth of hyperparameter optimization, potentially leaving some performance gains on the table.

Future Improvements with More Resources:

With additional time and better computational infrastructure (e.g., access to high-performance computing clusters or GPUs), we would:

🛠️ Expand Hyperparameter Tuning: Implement a more extensive grid search or advanced optimization techniques such as Bayesian optimization. This would allow for a more thorough exploration of the hyperparameter space, potentially yielding a more optimal model.
🙌 Enhance Model Scalability: Run more complex and resource-intensive models, such as end-to-end deep learning approaches that work directly on raw fMRI data, capturing intricate spatial and temporal patterns that might be missed with our current feature-engineered approach.
📊 Optimize Data Processing Pipelines: Invest in more efficient data preprocessing and model training pipelines to better handle large datasets, reducing training times and enabling more iterative experimentation.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.ipynb_checkpoints		.ipynb_checkpoints
old models		old models
predictions		predictions
.gitattributes		.gitattributes
README.md		README.md
XGB model.ipynb		XGB model.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ADHD and sex classifier

🎯 Project Highlights

👩🏽‍💻 Setup & Execution

🏗️ Project Overview

🧠 Model Development

The Right Model, the Right Method

Why Separating the Targets Helped

📈 Results & Key Findings

🖼️ Impact Narrative

1. Brain Activity Patterns and Gender Differences

🧠 ADHD-Related fMRI Markers:

🧑‍🧑‍🧒‍🧒 Behavioral and Socio-Demographic Influences:

👧 Gender-Specific Differences:

2. Contributions to ADHD Research and Clinical Care

🧑‍⚕️ Refined Diagnostic Models:

💊 Personalized Intervention Strategies:

🚀 Next Steps & Future Improvements

Limitations

Future Improvements with More Resources:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Kulieshova/ADHD-prediction

Folders and files

Latest commit

History

Repository files navigation

ADHD and sex classifier

🎯 Project Highlights

👩🏽‍💻 Setup & Execution

🏗️ Project Overview

🧠 Model Development

The Right Model, the Right Method

Why Separating the Targets Helped

📈 Results & Key Findings

🖼️ Impact Narrative

1. Brain Activity Patterns and Gender Differences

🧠 ADHD-Related fMRI Markers:

🧑‍🧑‍🧒‍🧒 Behavioral and Socio-Demographic Influences:

👧 Gender-Specific Differences:

2. Contributions to ADHD Research and Clinical Care

🧑‍⚕️ Refined Diagnostic Models:

💊 Personalized Intervention Strategies:

🚀 Next Steps & Future Improvements

Limitations

Future Improvements with More Resources:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages