Welcome to the STARR-ONCOLOGY Data Lake ARPA-H project. This repository contains resources and documentation for our oncology data initiatives, aimed at facilitating cancer research and improving patient outcomes.
The Oncology Data Lake project provides comprehensive datasets and tools for analyzing and sharing healthcare data. Our primary focus is on extracting structured information from unstructured clinical data, ensuring patient privacy, and creating high-quality training data for machine learning models.
Core Documentation (src/)
-
Primary Documents
about.qmd- Project overview and dataset descriptionsdata_labeling.qmd- Data labeling guidelines and methodsdata_metrics.qmd- Dataset statistics and quality analysisdata_dictionary.qmd- OMOP-CDM schema referencemetadata.qmd- Metadata standards and documentation
-
Configuration Files
_quarto.yml- Quarto project settingssql_params.yml- Database parametersstyles.css- Custom styling
-
Source Code & Assets
R/- R analysis scripts and utilitiesfonts/- Typography resourcesimages/- Documentation images and diagrams
-
Genomic Data
bed_files/Heme-STAMP_v1_APR2018.bedHeme-STAMP_v2_AUG2018.bedSTAMP1_OCT2014.bedSTAMP2_OCT2015.bedSTAMP3_SEP2018.bed
-
Data Dictionaries
cap_forms_data_dict/- CAP Forms referencedicom_data_dict/- DICOM metadata schemaneuralframe_data_dict/- NeuralFrame documentationomop_data_dict/- OMOP CDM specificationsphilips_ispm_data_dict/- Philips ISPM referencestamp_data_dict/- STAMP assay documentationstarr_deid_data_dict/- De-identification protocols
-
Data Releases
aug_2025/(Current)- Clinical metrics and analyses
- Tumor board documentation
- Imaging data analysis
- SQL queries and datasets
may_2025/- Previous releasefeb_2025/- Initial release- Demographic studies
- PHI labeling protocols
- Release documentation
Development Setup
.devcontainer/- Development environment configuration.github/- GitHub workflows and templates.vscode/- VS Code workspace settings.venv/- Python virtual environment.cspell/- Spell check rules
Build & Configuration
Dockerfile- Container definitionpyproject.toml- Python dependenciesinstall.R- R package installerpost_create.sh- Setup scriptuv.lock- Python dependency lockREADME.md- Main documentation
- Install Visual Studio Code.
- Install Docker.
- Clone the Repository:
git clone https://github.com/your-username/starr-oncology-data-lake-arpah.git cd starr-oncology-data-lake-arpah - Open the Repository in VS Code:
- Launch Visual Studio Code.
- Open the cloned repository folder.
- Open in Devcontainer:
- When you open the repository in VS Code, you should see a prompt to reopen the folder in a devcontainer. Click "Reopen in Container".
- If you don't see the prompt, you can manually reopen in a container by clicking on the green button in the bottom-left corner of the VS Code window and selecting "Reopen in Container".
- Run Quarto Documents:
- Open any .qmd file (e.g., data_labeling.qmd).
- Click the "Render" button in the top-right corner of the editor to render the document.
Example: Rendering data_labeling.qmd
- Open data_labeling.qmd in VS Code.
- Click the "Render" button to generate the HTML output.
- View the rendered document in the browser to see the visualizations and tables.