update README with metadata description

stemangiola · stemangiola · commit 9f3c57e415ab · 2023-02-10T09:12:50.000+11:00
diff --git a/README.Rmd b/README.Rmd
@@ -3,6 +3,10 @@ title: "CuratedAtlasQueryR"
 output: github_document
 ---
 
+`CuratedAtlasQuery` is a query interface that allow the programmatic exploration and retrieval of the harmonised, curated and reannotated CELLxGENE single-cell human cell atlas. Data can be retrieved at cell, sample, or dataset levels based on filtering criteria. 
+
+# Query interface
+
 ```{r, include = FALSE}
 # Note: knit this to the repo readme file using:
 # rmarkdown::render("vignettes/readme.Rmd", output_format = "github_document", output_dir = getwd())
@@ -167,3 +171,36 @@ get_metadata() |>
 knitr::include_graphics("inst/NCAM1_figure.png")
 ```
 
+# Cell metadata
+
+Dataset-specific columns (definitions available at cellxgene.cziscience.com)
+
+`cell_count`, `collection_id`, `created_at.x`, `created_at.y`, `dataset_deployments`, `dataset_id`, `file_id`, `filename`, `filetype`, `is_primary_data.y`, `is_valid`, `linked_genesets`, `mean_genes_per_cell`, `name`, `published`, `published_at`, `revised_at`, `revision`, `s3_uri`, `schema_version`, `tombstone`, `updated_at.x`, `updated_at.y`, `user_submitted`, `x_normalization`
+
+Sample-specific columns (definitions available at cellxgene.cziscience.com)
+
+`.sample`, `.sample_name`, `age_days`, `assay`, `assay_ontology_term_id`, `development_stage`, `development_stage_ontology_term_id`, `ethnicity`, `ethnicity_ontology_term_id`, `experiment___`, `organism`, `organism_ontology_term_id`, `sample_placeholder`, `sex`, `sex_ontology_term_id`, `tissue`, `tissue_harmonised`, `tissue_ontology_term_id`, `disease`, `disease_ontology_term_id`, `is_primary_data.x`
+
+Cell-specific columns (definitions available at cellxgene.cziscience.com)
+
+`.cell`, `cell_type`, `cell_type_ontology_term_idm`, `cell_type_harmonised`, `confidence_class`, `cell_annotation_azimuth_l2`, `cell_annotation_blueprint_singler` 
+
+Through harmonisation and curation we introduced custom column, not present in the original CELLxGENE metadata
+
+- `tissue_harmonised`: a coarser tissue name for better filtering
+- `age_days`: the number of days corresponding to the age
+- `cell_type_harmonised`: the consensus call identiti (for immune cells) using the original and three novel annotations using Seurat Azimuth and SingleR
+- `confidence_class`: an ordinal class of how confident `cell_type_harmonised` is. 1 is complete consensus, 2 is 3 out of four and so on.             
+- `cell_annotation_azimuth_l2`: Azimuth cell annotation
+- `cell_annotation_blueprint_singler`: SingleR cell annotation using Blueprint reference
+- `cell_annotation_blueprint_monaco`: SingleR cell annotation using Monaco reference
+- `sample_id_db`: Sample subdivision for internal use
+- `file_id_db`: File subdivision for internal use
+- `.sample`: Sample ID
+- `.sample_name`: How samples were defined
+
+# RNA abundance
+
+The `raw` assay includes RNA abundance in the positive real scale (not transformed with non-linear functions, e.g. log sqrt). Originally CELLxGENE include a mix of scales and tranformations specified in the `x_normalization` column.
+
+The `cpm` assay includes counts per million.
diff --git a/README.md b/README.md
@@ -1,6 +1,13 @@
 CuratedAtlasQueryR
 ================
 
+`CuratedAtlasQuery` is a query interface that allow the programmatic
+exploration and retrieval of the harmonised, curated and reannotated
+CELLxGENE single-cell human cell atlas. Data can be retrieved at cell,
+sample, or dataset levels based on filtering criteria.
+
+# Query interface
+
 <img src="inst/logo.png" width="120px" height="139px" />
 
 ## Load the package
@@ -233,3 +240,64 @@ get_metadata() |>
 ```
 
 <img src="inst/NCAM1_figure.png" width="629" />
+
+# Cell metadata
+
+Dataset-specific columns (definitions available at
+cellxgene.cziscience.com)
+
+`cell_count`, `collection_id`, `created_at.x`, `created_at.y`,
+`dataset_deployments`, `dataset_id`, `file_id`, `filename`, `filetype`,
+`is_primary_data.y`, `is_valid`, `linked_genesets`,
+`mean_genes_per_cell`, `name`, `published`, `published_at`,
+`revised_at`, `revision`, `s3_uri`, `schema_version`, `tombstone`,
+`updated_at.x`, `updated_at.y`, `user_submitted`, `x_normalization`
+
+Sample-specific columns (definitions available at
+cellxgene.cziscience.com)
+
+`.sample`, `.sample_name`, `age_days`, `assay`,
+`assay_ontology_term_id`, `development_stage`,
+`development_stage_ontology_term_id`, `ethnicity`,
+`ethnicity_ontology_term_id`, `experiment___`, `organism`,
+`organism_ontology_term_id`, `sample_placeholder`, `sex`,
+`sex_ontology_term_id`, `tissue`, `tissue_harmonised`,
+`tissue_ontology_term_id`, `disease`, `disease_ontology_term_id`,
+`is_primary_data.x`
+
+Cell-specific columns (definitions available at
+cellxgene.cziscience.com)
+
+`.cell`, `cell_type`, `cell_type_ontology_term_idm`,
+`cell_type_harmonised`, `confidence_class`,
+`cell_annotation_azimuth_l2`, `cell_annotation_blueprint_singler`
+
+Through harmonisation and curation we introduced custom column, not
+present in the original CELLxGENE metadata
+
+- `tissue_harmonised`: a coarser tissue name for better filtering
+- `age_days`: the number of days corresponding to the age
+- `cell_type_harmonised`: the consensus call identiti (for immune cells)
+  using the original and three novel annotations using Seurat Azimuth
+  and SingleR
+- `confidence_class`: an ordinal class of how confident
+  `cell_type_harmonised` is. 1 is complete consensus, 2 is 3 out of four
+  and so on.  
+- `cell_annotation_azimuth_l2`: Azimuth cell annotation
+- `cell_annotation_blueprint_singler`: SingleR cell annotation using
+  Blueprint reference
+- `cell_annotation_blueprint_monaco`: SingleR cell annotation using
+  Monaco reference
+- `sample_id_db`: Sample subdivision for internal use
+- `file_id_db`: File subdivision for internal use
+- `.sample`: Sample ID
+- `.sample_name`: How samples were defined
+
+# RNA abundance
+
+The `raw` assay includes RNA abundance in the positive real scale (not
+transformed with non-linear functions, e.g. log sqrt). Originally
+CELLxGENE include a mix of scales and tranformations specified in the
+`x_normalization` column.
+
+The `cpm` assay includes counts per million.