Skip to content

An example of how to retrieve descriptions and markers from the CELLxGENE public API

Notifications You must be signed in to change notification settings

MaximilianLombardo/cell-guide-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cell Guide Scraper

A Python package to extract cell type information from the CELLxGENE Cell Guide portal (https://cellxgene.cziscience.com/cellguide/).

Repository Structure

cell-guide-scripts/
├── config.json                # Configuration for API endpoints
├── README.md                  # This documentation
├── src/                       # Source code directory
│   ├── cell_guide_scraper.py  # Main scraper module
│   ├── test_scraper.py        # Test script
│   └── debug_api.py           # API debugging utility
├── data/                      # Data directory
│   ├── examples/              # Example data files
│   └── output/                # Directory for generated output (gitignored)
├── logs/                      # Log files (gitignored)
└── docs/                      # Documentation
    └── api_endpoints.md       # API endpoints reference

Features

  • Extract cell descriptions through direct API endpoints
    • Primary source: Validated descriptions from official API
    • Secondary source: GPT-generated descriptions from official API
    • Fallback to HTML scraping when APIs are unavailable
  • Extract marker gene information (computational and canonical markers)
  • Process multiple cell types in a single run with parallel processing
  • Configurable API endpoints through config.json
  • JSON output for easy integration with other tools
  • Robust error handling with graceful fallbacks
  • Detailed logging system

Installation

Prerequisites

  • Python 3.6+
  • Required packages:
    • requests
    • beautifulsoup4

Setup

  1. Clone this repository:
git clone https://github.com/yourusername/cell-guide-scripts.git
cd cell-guide-scripts
  1. Install required packages:
pip install -r requirements.txt

Or with conda:

conda create -n cell-guide-env python=3.10 requests beautifulsoup4 -y
conda activate cell-guide-env

Usage

Basic Commands

All scripts should be run from the repository root directory.

Extract data for a single cell type:

python src/cell_guide_scraper.py CL_0000084 --output data/output/t_cell_data.json

Process multiple cell types in parallel:

python src/cell_guide_scraper.py CL_0000084 CL_0000236 CL_0000094 --threads 3 --output data/output/multiple_cells.json

Test script with convenient display formatting:

# Test with a specific cell type
python src/test_scraper.py --cell-id CL_0000084

# Test extracting data for multiple cell types
python src/test_scraper.py --multiple

Debug API endpoints:

# Display API responses
python src/debug_api.py CL_0000084

# Save API responses to files
python src/debug_api.py CL_0000084 --save

Using as a Module

from src.cell_guide_scraper import scrape_cell_data

# Extract data for a specific cell type
t_cell_data = scrape_cell_data("CL_0000084")

# Access the data
description = t_cell_data["description"]
computational_markers = t_cell_data["markers"]["computational"]
canonical_markers = t_cell_data["markers"]["canonical"]

API Endpoints

The Cell Guide website provides several API endpoints that can be used to extract data:

  1. Validated Descriptions:

    https://cellguide.cellxgene.cziscience.com/validated_descriptions/CL_0000084.json
    
  2. GPT-Generated Descriptions:

    https://cellguide.cellxgene.cziscience.com/gpt_descriptions/CL_0000084.json
    
  3. Computational Marker Genes:

    https://cellguide.cellxgene.cziscience.com/1743611056/computational_marker_genes/CL_0000084.json
    
  4. Canonical Marker Genes:

    https://cellguide.cellxgene.cziscience.com/1743611056/canonical_marker_genes/CL_0000084.json
    

See API Endpoints Documentation for detailed information about these endpoints.

Configuration

The config.json file contains configuration for the API endpoints, including the current version ID for marker gene endpoints. If the markers API stops working, this version number may need to be updated.

Example Cell Types

  • CL_0000084: T cell
  • CL_0000236: B cell
  • CL_0000094: Granulocyte
  • CL_0000928: Activated CD4-negative, CD8-negative type I NK T cell

License

This project is open source and available under the MIT License.

About

An example of how to retrieve descriptions and markers from the CELLxGENE public API

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages