Skip to content

eegdash/eegdash-llm-tagger

Repository files navigation

EEGDash LLM Tagger

Automatic tagging of EEG datasets using LLM-based predictions. This tool scrapes the EEGDash website, extracts metadata from BIDS-formatted datasets, and uses machine learning to predict pathology, modality, and experiment type tags.

Features

  • Dataset Scraping: Automatically discovers datasets from EEGDash
  • Metadata Extraction: Parses BIDS-formatted datasets from GitHub/OpenNeuro
  • Paper Abstract Fetching: Extracts DOIs from references and fetches abstracts from CrossRef, Semantic Scholar, and PubMed APIs with persistent caching
  • Few-Shot Learning: Uses labeled examples for in-context learning to improve classification accuracy
  • LLM-based Tagging: Uses language models (GPT-4, Claude) via OpenRouter.ai with structured reasoning
  • CSV Updates: Automatically updates dataset catalogs with predictions

Installation

Requirements

  • Python 3.11 or higher
  • Git
  • GitHub Personal Access Token (for API access)

Setup

  1. Clone the repository:
git clone https://github.com/yourusername/eegdash-llm-tagger.git
cd eegdash-llm-tagger
  1. Install the package in development mode:
pip install -e .
  1. Set up environment variables:
# Copy the example environment file
cp .env.example .env

# Edit .env with your actual API keys
# NEVER commit the .env file to git - it's already in .gitignore

Edit .env and add your actual API keys:

  1. Load environment variables (required for each terminal session):
# Option 1: Source the .env file (bash/zsh)
export $(cat .env | xargs)

# Option 2: Use python-dotenv (automatic loading)
pip install python-dotenv

Usage

Fetch Incomplete Datasets

Find and process datasets with missing tags:

python scripts/fetch_incomplete_datasets.py \
    --output-json data/processed/incomplete_metadata.json \
    --verbose

Fetch Complete Datasets

Process datasets that already have complete tags (for training data):

python scripts/fetch_complete_datasets.py \
    --output-json data/processed/complete_metadata.json \
    --limit 10 \
    --verbose

Update CSV with LLM Predictions

Apply LLM predictions to update a dataset CSV:

python scripts/update_csv.py \
    --llm-json data/processed/llm_output.json \
    --csv dataset_summary.csv \
    --confidence-threshold 0.5 \
    --verbose

LLM-Based Tagging with OpenRouter.ai

Setup

  1. Get an API key from https://openrouter.ai/
  2. Add it to your .env file (see Installation section above)
  3. Load environment variables:
    export $(cat .env | xargs)

Usage

Test with Single Dataset

Test the API integration with one dataset first:

python scripts/test_llm_tagger.py

This will:

  • Load the first dataset from data/processed/incomplete_metadata.json
  • Call the OpenRouter API using GPT-4 Turbo
  • Display tagging results with confidence scores and reasoning
  • Save output to data/processed/test_llm_output.json

Process All Incomplete Datasets

Tag all incomplete datasets (or a limited subset for testing):

# Process first 5 datasets (recommended for testing)
python scripts/tag_with_llm.py \
    --input data/processed/incomplete_metadata.json \
    --output data/processed/llm_output.json \
    --model openai/gpt-4-turbo \
    --limit 5 \
    --verbose

# Process all datasets (may incur significant API costs)
python scripts/tag_with_llm.py \
    --input data/processed/incomplete_metadata.json \
    --output data/processed/llm_output.json \
    --model openai/gpt-4-turbo \
    --verbose

Update CSV with LLM Results

After generating predictions, update your CSV:

python scripts/update_csv.py \
    --llm-json data/processed/llm_output.json \
    --csv dataset_summary.csv \
    --confidence-threshold 0.8 \
    --verbose

Supported Models

  • openai/gpt-4-turbo - GPT-4 Turbo (recommended, ~$0.13/dataset)
  • openai/gpt-4 - GPT-4 (more expensive, ~$0.33/dataset)
  • anthropic/claude-3-opus - Claude 3 Opus (~$0.23/dataset)
  • anthropic/claude-3-sonnet - Claude 3 Sonnet (faster, cheaper)

Cost Estimation

Based on ~9,500 tokens per dataset (8K input + 1.5K output):

  • GPT-4 Turbo: $0.13 per dataset ($38 for all 295)
  • GPT-4: $0.33 per dataset ($97 for all 295)
  • Claude 3 Opus: $0.23 per dataset ($68 for all 295)

Recommendation: Start with --limit 5 to verify results before processing all datasets.

Few-Shot Learning

The LLM tagger uses in-context learning with curated labeled examples to improve classification accuracy.

How It Works

  1. Few-shot examples (data/processed/few_shot_examples.json) contain labeled datasets showing EEGDash's classification patterns
  2. When tagging a new dataset, these examples are included in the prompt as reference
  3. The LLM learns the labeling conventions from examples and applies them consistently

Structured Reasoning

The LLM follows a priority-based reasoning process (defined in prompt.md):

  1. Few-shot analysis: Compare with similar labeled examples
  2. Metadata analysis: Extract relevant info from BIDS metadata fields
  3. Paper abstract analysis: Use paper abstracts to disambiguate unclear cases
  4. Decision summary: Justify final labels with confidence scores

Labels

The tagger classifies datasets into three categories:

  • Pathology: Healthy, Epilepsy, Depression, Parkinson's Disease, etc.
  • Modality: Visual, Auditory, Motor, Resting State, Sleep, etc.
  • Type: Perception, Memory, Attention, Motor, Clinical/Intervention, etc.

Paper Abstract Fetching

The tool automatically fetches paper abstracts to provide additional context for classification.

Data Sources

Abstracts are fetched from multiple APIs (in order):

  1. CrossRef API - Best for general academic papers
  2. Semantic Scholar API - Good for CS/AI papers
  3. PubMed API - Best for biomedical papers

Caching

  • Abstracts are cached in data/processed/abstract_cache.json
  • Cache is persistent across runs to minimize API calls
  • Both successful fetches and failures are cached

DOI Extraction

DOIs are automatically extracted from the References field in dataset descriptions. The tool filters out OpenNeuro dataset DOIs and only fetches paper abstracts.

Project Structure

eegdash-llm-tagger/
├── src/eegdash_tagger/      # Main package
│   ├── metadata/            # BIDS metadata parsing
│   │   ├── parser.py        # BIDS file parsing
│   │   └── providers.py     # GitHub/OpenNeuro data providers
│   ├── scraping/            # Web scraping and data collection
│   │   ├── scraper.py       # EEGDash website scraper
│   │   ├── enrichment.py    # Dataset metadata enrichment
│   │   ├── abstract_fetcher.py  # Paper abstract fetching (CrossRef/Semantic Scholar/PubMed)
│   │   └── dataset_filters.py   # Dataset filtering utilities
│   ├── tagging/             # LLM-based tagging
│   │   ├── tagger.py        # Tagger protocol and types
│   │   └── llm_tagger.py    # OpenRouter.ai LLM implementation
│   └── utils/               # CSV updates and helpers
│       └── csv_updater.py   # CSV update utilities
├── scripts/                 # CLI entry points
│   ├── fetch_incomplete_datasets.py
│   ├── fetch_complete_datasets.py
│   ├── tag_with_llm.py
│   ├── test_llm_tagger.py
│   └── update_csv.py
├── tests/                   # Test files
├── tools/                   # Utility scripts
├── data/                    # Data files (gitignored)
│   ├── processed/           # Generated metadata
│   └── test/                # Test datasets
├── prompt.md                # LLM system prompt with reasoning framework
├── environment.yml          # Conda environment
└── setup.py                 # Package configuration

Data Directory

The data/ directory contains generated metadata files and is excluded from git:

  • data/processed/: Production metadata files

    • complete_metadata.json - Datasets with full tags (used for few-shot examples)
    • incomplete_metadata.json - Datasets needing tags
    • few_shot_examples.json - Curated labeled examples for in-context learning
    • llm_output.json - LLM prediction results
    • abstract_cache.json - Cached paper abstracts (auto-generated)
  • data/test/: Test datasets for development

Development

Running Tests

# Test metadata extraction
python tests/test_metadata.py

# Test GitHub API integration
export GITHUB_TOKEN="your_token"
python tests/test_github_api.py

Utility Tools

# Show metadata for a dataset
python tools/show_metadata.py /path/to/dataset

# Create test/training datasets
python tools/create_test_set.py

License

[Add your license here]

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Acknowledgments

  • EEGDash project for the dataset catalog
  • OpenNeuro for hosting EEG datasets
  • BIDS format for standardized neuroimaging data

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages