A comprehensive Python framework for extracting, processing, and analyzing molecular crystal structures from the Cambridge Structural Database (CSD).
- High-Performance Pipeline: GPU-accelerated batch processing with PyTorch
- CSD Integration: Direct interface to Cambridge Structural Database
- Advanced Analytics: Fragment analysis, intermolecular contacts, and geometric descriptors
- Efficient Storage: HDF5-based data management with variable-length datasets
- Scalable Architecture: Parallel processing for large datasets
The full documentation includes:
- Getting Started Guide - Installation and quickstart
- User Guide - Core concepts and workflows
- API Reference - Complete API documentation
- Tutorials - Step-by-step guides
- Examples - Ready-to-run code
# Install CSA
pip install -e .
# Run analysis
python src/csa_main.py --config your_config.jsonFor detailed installation instructions and requirements, see the Installation Guide.
CSA transforms raw crystallographic data into analysis-ready datasets through a five-stage pipeline:
- Family Extraction - Query and organize CSD structures by chemical families
- Similarity Clustering - Group structures by 3D packing similarity
- Representative Selection - Choose optimal structures using statistical metrics
- Data Extraction - Extract atomic coordinates, bonds, and intermolecular contacts
- Feature Engineering - Compute advanced geometric and topological descriptors
- Python 3.9 (Required for CSD Python API)
- PyTorch (GPU recommended)
- Valid CCDC license for CSD access
- HDF5 and related dependencies
See the full requirements in the documentation.
Contributions are welcome! Please see our contributing guidelines for details.
This project is licensed under the MIT License - see the LICENSE file for details.
Note: CSA requires a valid Cambridge Crystallographic Data Centre (CCDC) license for full functionality.