EEGDash LLM Tagger

Automatic tagging of EEG datasets using LLM-based predictions. This tool scrapes the EEGDash website, extracts metadata from BIDS-formatted datasets, and uses machine learning to predict pathology, modality, and experiment type tags.

Features

Dataset Scraping: Automatically discovers datasets from EEGDash
Metadata Extraction: Parses BIDS-formatted datasets from GitHub/OpenNeuro
Paper Abstract Fetching: Extracts DOIs from references and fetches abstracts from CrossRef, Semantic Scholar, and PubMed APIs with persistent caching
Few-Shot Learning: Uses labeled examples for in-context learning to improve classification accuracy
LLM-based Tagging: Uses language models (GPT-4, Claude) via OpenRouter.ai with structured reasoning
CSV Updates: Automatically updates dataset catalogs with predictions

Installation

Requirements

Python 3.11 or higher
Git
GitHub Personal Access Token (for API access)

Setup

Clone the repository:

git clone https://github.com/yourusername/eegdash-llm-tagger.git
cd eegdash-llm-tagger

Install the package in development mode:

pip install -e .

Set up environment variables:

# Copy the example environment file
cp .env.example .env

# Edit .env with your actual API keys
# NEVER commit the .env file to git - it's already in .gitignore

Edit .env and add your actual API keys:

OPENROUTER_API_KEY: Get from https://openrouter.ai/
GITHUB_TOKEN: Generate at https://github.com/settings/tokens

Load environment variables (required for each terminal session):

# Option 1: Source the .env file (bash/zsh)
export $(cat .env | xargs)

# Option 2: Use python-dotenv (automatic loading)
pip install python-dotenv

Usage

Fetch Incomplete Datasets

Find and process datasets with missing tags:

python scripts/fetch_incomplete_datasets.py \
    --output-json data/processed/incomplete_metadata.json \
    --verbose

Fetch Complete Datasets

Process datasets that already have complete tags (for training data):

python scripts/fetch_complete_datasets.py \
    --output-json data/processed/complete_metadata.json \
    --limit 10 \
    --verbose

Update CSV with LLM Predictions

Apply LLM predictions to update a dataset CSV:

python scripts/update_csv.py \
    --llm-json data/processed/llm_output.json \
    --csv dataset_summary.csv \
    --confidence-threshold 0.5 \
    --verbose

LLM-Based Tagging with OpenRouter.ai

Setup

Get an API key from https://openrouter.ai/
Add it to your .env file (see Installation section above)
Load environment variables:
```
export $(cat .env | xargs)
```

Usage

Test with Single Dataset

Test the API integration with one dataset first:

python scripts/test_llm_tagger.py

This will:

Load the first dataset from data/processed/incomplete_metadata.json
Call the OpenRouter API using GPT-4 Turbo
Display tagging results with confidence scores and reasoning
Save output to data/processed/test_llm_output.json

Process All Incomplete Datasets

Tag all incomplete datasets (or a limited subset for testing):

# Process first 5 datasets (recommended for testing)
python scripts/tag_with_llm.py \
    --input data/processed/incomplete_metadata.json \
    --output data/processed/llm_output.json \
    --model openai/gpt-4-turbo \
    --limit 5 \
    --verbose

# Process all datasets (may incur significant API costs)
python scripts/tag_with_llm.py \
    --input data/processed/incomplete_metadata.json \
    --output data/processed/llm_output.json \
    --model openai/gpt-4-turbo \
    --verbose

Update CSV with LLM Results

After generating predictions, update your CSV:

python scripts/update_csv.py \
    --llm-json data/processed/llm_output.json \
    --csv dataset_summary.csv \
    --confidence-threshold 0.8 \
    --verbose

Supported Models

openai/gpt-4-turbo - GPT-4 Turbo (recommended, ~$0.13/dataset)
openai/gpt-4 - GPT-4 (more expensive, ~$0.33/dataset)
anthropic/claude-3-opus - Claude 3 Opus (~$0.23/dataset)
anthropic/claude-3-sonnet - Claude 3 Sonnet (faster, cheaper)

Cost Estimation

Based on ~9,500 tokens per dataset (8K input + 1.5K output):

GPT-4 Turbo: ~~$0.13 per dataset (~~$38 for all 295)
GPT-4: ~~$0.33 per dataset (~~$97 for all 295)
Claude 3 Opus: ~~$0.23 per dataset (~~$68 for all 295)

Recommendation: Start with --limit 5 to verify results before processing all datasets.

Few-Shot Learning

The LLM tagger uses in-context learning with curated labeled examples to improve classification accuracy.

How It Works

Few-shot examples (data/processed/few_shot_examples.json) contain labeled datasets showing EEGDash's classification patterns
When tagging a new dataset, these examples are included in the prompt as reference
The LLM learns the labeling conventions from examples and applies them consistently

Structured Reasoning

The LLM follows a priority-based reasoning process (defined in prompt.md):

Few-shot analysis: Compare with similar labeled examples
Metadata analysis: Extract relevant info from BIDS metadata fields
Paper abstract analysis: Use paper abstracts to disambiguate unclear cases
Decision summary: Justify final labels with confidence scores

Labels

The tagger classifies datasets into three categories:

Pathology: Healthy, Epilepsy, Depression, Parkinson's Disease, etc.
Modality: Visual, Auditory, Motor, Resting State, Sleep, etc.
Type: Perception, Memory, Attention, Motor, Clinical/Intervention, etc.

Paper Abstract Fetching

The tool automatically fetches paper abstracts to provide additional context for classification.

Data Sources

Abstracts are fetched from multiple APIs (in order):

CrossRef API - Best for general academic papers
Semantic Scholar API - Good for CS/AI papers
PubMed API - Best for biomedical papers

Caching

Abstracts are cached in data/processed/abstract_cache.json
Cache is persistent across runs to minimize API calls
Both successful fetches and failures are cached

DOI Extraction

DOIs are automatically extracted from the References field in dataset descriptions. The tool filters out OpenNeuro dataset DOIs and only fetches paper abstracts.

Project Structure

eegdash-llm-tagger/
├── src/eegdash_tagger/      # Main package
│   ├── metadata/            # BIDS metadata parsing
│   │   ├── parser.py        # BIDS file parsing
│   │   └── providers.py     # GitHub/OpenNeuro data providers
│   ├── scraping/            # Web scraping and data collection
│   │   ├── scraper.py       # EEGDash website scraper
│   │   ├── enrichment.py    # Dataset metadata enrichment
│   │   ├── abstract_fetcher.py  # Paper abstract fetching (CrossRef/Semantic Scholar/PubMed)
│   │   └── dataset_filters.py   # Dataset filtering utilities
│   ├── tagging/             # LLM-based tagging
│   │   ├── tagger.py        # Tagger protocol and types
│   │   └── llm_tagger.py    # OpenRouter.ai LLM implementation
│   └── utils/               # CSV updates and helpers
│       └── csv_updater.py   # CSV update utilities
├── scripts/                 # CLI entry points
│   ├── fetch_incomplete_datasets.py
│   ├── fetch_complete_datasets.py
│   ├── tag_with_llm.py
│   ├── test_llm_tagger.py
│   └── update_csv.py
├── tests/                   # Test files
├── tools/                   # Utility scripts
├── data/                    # Data files (gitignored)
│   ├── processed/           # Generated metadata
│   └── test/                # Test datasets
├── prompt.md                # LLM system prompt with reasoning framework
├── environment.yml          # Conda environment
└── setup.py                 # Package configuration

Data Directory

The data/ directory contains generated metadata files and is excluded from git:

data/processed/: Production metadata files
- complete_metadata.json - Datasets with full tags (used for few-shot examples)
- incomplete_metadata.json - Datasets needing tags
- few_shot_examples.json - Curated labeled examples for in-context learning
- llm_output.json - LLM prediction results
- abstract_cache.json - Cached paper abstracts (auto-generated)
data/test/: Test datasets for development

Development

Running Tests

# Test metadata extraction
python tests/test_metadata.py

# Test GitHub API integration
export GITHUB_TOKEN="your_token"
python tests/test_github_api.py

Utility Tools

# Show metadata for a dataset
python tools/show_metadata.py /path/to/dataset

# Create test/training datasets
python tools/create_test_set.py

License

[Add your license here]

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Acknowledgments

EEGDash project for the dataset catalog
OpenNeuro for hosting EEG datasets
BIDS format for standardized neuroimaging data

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data/processed		data/processed
ground-truth-data		ground-truth-data
scripts		scripts
src/eegdash_tagger		src/eegdash_tagger
tests		tests
tools		tools
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
README.md		README.md
environment.yml		environment.yml
prompt.md		prompt.md
pyproject.toml		pyproject.toml
setup.py		setup.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

EEGDash LLM Tagger

Features

Installation

Requirements

Setup

Usage

Fetch Incomplete Datasets

Fetch Complete Datasets

Update CSV with LLM Predictions

LLM-Based Tagging with OpenRouter.ai

Setup

Usage

Test with Single Dataset

Process All Incomplete Datasets

Update CSV with LLM Results

Supported Models

Cost Estimation

Few-Shot Learning

How It Works

Structured Reasoning

Labels

Paper Abstract Fetching

Data Sources

Caching

DOI Extraction

Project Structure

Data Directory

Development

Running Tests

Utility Tools

License

Contributing

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages