Automatic tagging of EEG datasets using LLM-based predictions. This tool scrapes the EEGDash website, extracts metadata from BIDS-formatted datasets, and uses machine learning to predict pathology, modality, and experiment type tags.
- Dataset Scraping: Automatically discovers datasets from EEGDash
- Metadata Extraction: Parses BIDS-formatted datasets from GitHub/OpenNeuro
- Paper Abstract Fetching: Extracts DOIs from references and fetches abstracts from CrossRef, Semantic Scholar, and PubMed APIs with persistent caching
- Few-Shot Learning: Uses labeled examples for in-context learning to improve classification accuracy
- LLM-based Tagging: Uses language models (GPT-4, Claude) via OpenRouter.ai with structured reasoning
- CSV Updates: Automatically updates dataset catalogs with predictions
- Python 3.11 or higher
- Git
- GitHub Personal Access Token (for API access)
- Clone the repository:
git clone https://github.com/yourusername/eegdash-llm-tagger.git
cd eegdash-llm-tagger- Install the package in development mode:
pip install -e .- Set up environment variables:
# Copy the example environment file
cp .env.example .env
# Edit .env with your actual API keys
# NEVER commit the .env file to git - it's already in .gitignoreEdit .env and add your actual API keys:
OPENROUTER_API_KEY: Get from https://openrouter.ai/GITHUB_TOKEN: Generate at https://github.com/settings/tokens
- Load environment variables (required for each terminal session):
# Option 1: Source the .env file (bash/zsh)
export $(cat .env | xargs)
# Option 2: Use python-dotenv (automatic loading)
pip install python-dotenvFind and process datasets with missing tags:
python scripts/fetch_incomplete_datasets.py \
--output-json data/processed/incomplete_metadata.json \
--verboseProcess datasets that already have complete tags (for training data):
python scripts/fetch_complete_datasets.py \
--output-json data/processed/complete_metadata.json \
--limit 10 \
--verboseApply LLM predictions to update a dataset CSV:
python scripts/update_csv.py \
--llm-json data/processed/llm_output.json \
--csv dataset_summary.csv \
--confidence-threshold 0.5 \
--verbose- Get an API key from https://openrouter.ai/
- Add it to your
.envfile (see Installation section above) - Load environment variables:
export $(cat .env | xargs)
Test the API integration with one dataset first:
python scripts/test_llm_tagger.pyThis will:
- Load the first dataset from
data/processed/incomplete_metadata.json - Call the OpenRouter API using GPT-4 Turbo
- Display tagging results with confidence scores and reasoning
- Save output to
data/processed/test_llm_output.json
Tag all incomplete datasets (or a limited subset for testing):
# Process first 5 datasets (recommended for testing)
python scripts/tag_with_llm.py \
--input data/processed/incomplete_metadata.json \
--output data/processed/llm_output.json \
--model openai/gpt-4-turbo \
--limit 5 \
--verbose
# Process all datasets (may incur significant API costs)
python scripts/tag_with_llm.py \
--input data/processed/incomplete_metadata.json \
--output data/processed/llm_output.json \
--model openai/gpt-4-turbo \
--verboseAfter generating predictions, update your CSV:
python scripts/update_csv.py \
--llm-json data/processed/llm_output.json \
--csv dataset_summary.csv \
--confidence-threshold 0.8 \
--verboseopenai/gpt-4-turbo- GPT-4 Turbo (recommended, ~$0.13/dataset)openai/gpt-4- GPT-4 (more expensive, ~$0.33/dataset)anthropic/claude-3-opus- Claude 3 Opus (~$0.23/dataset)anthropic/claude-3-sonnet- Claude 3 Sonnet (faster, cheaper)
Based on ~9,500 tokens per dataset (8K input + 1.5K output):
- GPT-4 Turbo:
$0.13 per dataset ($38 for all 295) - GPT-4:
$0.33 per dataset ($97 for all 295) - Claude 3 Opus:
$0.23 per dataset ($68 for all 295)
Recommendation: Start with --limit 5 to verify results before processing all datasets.
The LLM tagger uses in-context learning with curated labeled examples to improve classification accuracy.
- Few-shot examples (
data/processed/few_shot_examples.json) contain labeled datasets showing EEGDash's classification patterns - When tagging a new dataset, these examples are included in the prompt as reference
- The LLM learns the labeling conventions from examples and applies them consistently
The LLM follows a priority-based reasoning process (defined in prompt.md):
- Few-shot analysis: Compare with similar labeled examples
- Metadata analysis: Extract relevant info from BIDS metadata fields
- Paper abstract analysis: Use paper abstracts to disambiguate unclear cases
- Decision summary: Justify final labels with confidence scores
The tagger classifies datasets into three categories:
- Pathology: Healthy, Epilepsy, Depression, Parkinson's Disease, etc.
- Modality: Visual, Auditory, Motor, Resting State, Sleep, etc.
- Type: Perception, Memory, Attention, Motor, Clinical/Intervention, etc.
The tool automatically fetches paper abstracts to provide additional context for classification.
Abstracts are fetched from multiple APIs (in order):
- CrossRef API - Best for general academic papers
- Semantic Scholar API - Good for CS/AI papers
- PubMed API - Best for biomedical papers
- Abstracts are cached in
data/processed/abstract_cache.json - Cache is persistent across runs to minimize API calls
- Both successful fetches and failures are cached
DOIs are automatically extracted from the References field in dataset descriptions. The tool filters out OpenNeuro dataset DOIs and only fetches paper abstracts.
eegdash-llm-tagger/
├── src/eegdash_tagger/ # Main package
│ ├── metadata/ # BIDS metadata parsing
│ │ ├── parser.py # BIDS file parsing
│ │ └── providers.py # GitHub/OpenNeuro data providers
│ ├── scraping/ # Web scraping and data collection
│ │ ├── scraper.py # EEGDash website scraper
│ │ ├── enrichment.py # Dataset metadata enrichment
│ │ ├── abstract_fetcher.py # Paper abstract fetching (CrossRef/Semantic Scholar/PubMed)
│ │ └── dataset_filters.py # Dataset filtering utilities
│ ├── tagging/ # LLM-based tagging
│ │ ├── tagger.py # Tagger protocol and types
│ │ └── llm_tagger.py # OpenRouter.ai LLM implementation
│ └── utils/ # CSV updates and helpers
│ └── csv_updater.py # CSV update utilities
├── scripts/ # CLI entry points
│ ├── fetch_incomplete_datasets.py
│ ├── fetch_complete_datasets.py
│ ├── tag_with_llm.py
│ ├── test_llm_tagger.py
│ └── update_csv.py
├── tests/ # Test files
├── tools/ # Utility scripts
├── data/ # Data files (gitignored)
│ ├── processed/ # Generated metadata
│ └── test/ # Test datasets
├── prompt.md # LLM system prompt with reasoning framework
├── environment.yml # Conda environment
└── setup.py # Package configuration
The data/ directory contains generated metadata files and is excluded from git:
-
data/processed/: Production metadata files
complete_metadata.json- Datasets with full tags (used for few-shot examples)incomplete_metadata.json- Datasets needing tagsfew_shot_examples.json- Curated labeled examples for in-context learningllm_output.json- LLM prediction resultsabstract_cache.json- Cached paper abstracts (auto-generated)
-
data/test/: Test datasets for development
# Test metadata extraction
python tests/test_metadata.py
# Test GitHub API integration
export GITHUB_TOKEN="your_token"
python tests/test_github_api.py# Show metadata for a dataset
python tools/show_metadata.py /path/to/dataset
# Create test/training datasets
python tools/create_test_set.py[Add your license here]
Contributions are welcome! Please open an issue or submit a pull request.
- EEGDash project for the dataset catalog
- OpenNeuro for hosting EEG datasets
- BIDS format for standardized neuroimaging data