A Python package to extract cell type information from the CELLxGENE Cell Guide portal (https://cellxgene.cziscience.com/cellguide/).
cell-guide-scripts/
├── config.json # Configuration for API endpoints
├── README.md # This documentation
├── src/ # Source code directory
│ ├── cell_guide_scraper.py # Main scraper module
│ ├── test_scraper.py # Test script
│ └── debug_api.py # API debugging utility
├── data/ # Data directory
│ ├── examples/ # Example data files
│ └── output/ # Directory for generated output (gitignored)
├── logs/ # Log files (gitignored)
└── docs/ # Documentation
└── api_endpoints.md # API endpoints reference
- Extract cell descriptions through direct API endpoints
- Primary source: Validated descriptions from official API
- Secondary source: GPT-generated descriptions from official API
- Fallback to HTML scraping when APIs are unavailable
- Extract marker gene information (computational and canonical markers)
- Process multiple cell types in a single run with parallel processing
- Configurable API endpoints through
config.json - JSON output for easy integration with other tools
- Robust error handling with graceful fallbacks
- Detailed logging system
- Python 3.6+
- Required packages:
- requests
- beautifulsoup4
- Clone this repository:
git clone https://github.com/yourusername/cell-guide-scripts.git
cd cell-guide-scripts- Install required packages:
pip install -r requirements.txtOr with conda:
conda create -n cell-guide-env python=3.10 requests beautifulsoup4 -y
conda activate cell-guide-envAll scripts should be run from the repository root directory.
python src/cell_guide_scraper.py CL_0000084 --output data/output/t_cell_data.jsonpython src/cell_guide_scraper.py CL_0000084 CL_0000236 CL_0000094 --threads 3 --output data/output/multiple_cells.json# Test with a specific cell type
python src/test_scraper.py --cell-id CL_0000084
# Test extracting data for multiple cell types
python src/test_scraper.py --multiple# Display API responses
python src/debug_api.py CL_0000084
# Save API responses to files
python src/debug_api.py CL_0000084 --savefrom src.cell_guide_scraper import scrape_cell_data
# Extract data for a specific cell type
t_cell_data = scrape_cell_data("CL_0000084")
# Access the data
description = t_cell_data["description"]
computational_markers = t_cell_data["markers"]["computational"]
canonical_markers = t_cell_data["markers"]["canonical"]The Cell Guide website provides several API endpoints that can be used to extract data:
-
Validated Descriptions:
https://cellguide.cellxgene.cziscience.com/validated_descriptions/CL_0000084.json -
GPT-Generated Descriptions:
https://cellguide.cellxgene.cziscience.com/gpt_descriptions/CL_0000084.json -
Computational Marker Genes:
https://cellguide.cellxgene.cziscience.com/1743611056/computational_marker_genes/CL_0000084.json -
Canonical Marker Genes:
https://cellguide.cellxgene.cziscience.com/1743611056/canonical_marker_genes/CL_0000084.json
See API Endpoints Documentation for detailed information about these endpoints.
The config.json file contains configuration for the API endpoints, including the current version ID for marker gene endpoints. If the markers API stops working, this version number may need to be updated.
CL_0000084: T cellCL_0000236: B cellCL_0000094: GranulocyteCL_0000928: Activated CD4-negative, CD8-negative type I NK T cell
This project is open source and available under the MIT License.