Skip to content

Commit 2bbc0a3

Browse files
committed
Merge remote-tracking branch 'origin/master'
2 parents 7692187 + 49eb8a9 commit 2bbc0a3

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+1441
-192
lines changed

.github/workflows/publish.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,13 +18,13 @@ jobs:
1818
git config user.email 41898282+github-actions[bot]@users.noreply.github.com
1919
- uses: actions/setup-python@v4
2020
with:
21-
python-version: 3.x
21+
python-version: "3.12"
2222
- run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV
2323
- uses: actions/cache@v3
2424
with:
2525
key: mkdocs-material-${{ env.cache_id }}
2626
path: .cache
2727
restore-keys: |
2828
mkdocs-material-
29-
- run: pip install mkdocs-material
29+
- run: pip install -r requirements-docs.txt
3030
- run: mkdocs gh-deploy --force

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ __pycache__
77
*/#*
88
.idea
99
*.DS_Store*
10+
site/
1011

1112
# Ignore Jupyter notebook checkpoints
1213
.ipynb_checkpoints

autodoc.py

Lines changed: 0 additions & 102 deletions
This file was deleted.
Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
## BEDMS
2+
3+
BEDMS (BED Metadata Standardizer) is a tool desgined to standardize genomics and epigenomics metadata attributes according to user-selected or user-trained schemas. BEDMS ensures consistency and FAIRness of metadata across different platforms.
4+
Users can interact with BEDMS either through Python or via [PEPhub](https://pephub.databio.org/) choosing from predefined schemas provided by the tool. Additionally, BEDMS allows users to create and train custom schemas as per their project requirements. For detailed information on the available schemas, please visit [HuggingFace](https://huggingface.co/databio/attribute-standardizer-model6).
5+
6+
### Installation
7+
8+
To install bedms use this command:
9+
10+
```
11+
pip install bedms
12+
```
13+
14+
or install the latest version from the GitHub repository:
15+
16+
```
17+
pip install git+https://github.com/databio/bedms.git
18+
```
19+
20+
### Usage
21+
22+
BEDMS can be used to standardize metadata attributes based on available schemas, train models on custom schemas, and standardize attributes based on the custom schema models.
23+
24+
### Standardizing based on available schemas
25+
26+
If you want to standardize the attributes in your PEP based on our available schemas, you can either visit [PEPhub](https://pephub.databio.org/) or using Python:
27+
28+
```python
29+
from bedms import AttrStandardizer
30+
31+
model = AttrStandardizer(
32+
repo_id="databio/attribute-standardizer-model6", model_name="encode"
33+
)
34+
results = model.standardize(pep="geo/gse228634:default")
35+
36+
print(results) #Dictionary of suggested predictions with their confidence: {'attr_1':{'prediction_1': 0.70, 'prediction_2':0.30}}
37+
```
38+
In the above example, we have provided the `repo_id` which is the path to the repository that holds the models on HuggingFace. The `model_name` selection can vary based on your choice of schema. You can view the schemas on PEPhub for [encode](https://pephub.databio.org/schemas/databio/bedms_encode), [fairtracks](https://pephub.databio.org/schemas/databio/bedms_fairtracks), and [bedbase](https://pephub.databio.org/schemas/databio/bedms_bedbase).
39+
For standardization, you need to provide the path to your PEP which in the above example is `pep="geo/gse228634:default"`.
40+
41+
### Training custom schemas
42+
43+
If you want to train your custom schema-based models, you would need two things to get started:
44+
1. Training sets
45+
2. HuggingFace model and associated files
46+
47+
#### Training sets
48+
49+
To develop training sets, follow the step by step protocol mentioned below:
50+
51+
1. Select what attributes would be most suitable for your project metadata. For example, here are some attributes that you might choose:
52+
```
53+
sample_name: Name of the sample_name
54+
assembly: Genome assembly (e.g. hg38)
55+
species_name: Scientific name of the species
56+
```
57+
58+
2. Fetch training data from ontologies, publications and other available surces to make two directories: `values_directory` and `headers_directory`. `values_directory` has all the values associated with that attribute while the `headers_directory` has various synonyms for the attribute names.
59+
The directory structure would look like this:
60+
```
61+
values_directory/
62+
values_1.csv
63+
values_2.csv
64+
values_3.csv
65+
.
66+
.
67+
values_1000.csv
68+
69+
headers_directory/
70+
headers_1.csv
71+
headers_2.csv
72+
headers_3.csv
73+
.
74+
.
75+
values_1000.csv
76+
```
77+
To see an example of what a `values_*.csv` and `headers_*.csv` might look like, you can check our sample csv files on PEPhub: [sample_bedms_values_1.csv](https://pephub.databio.org/databio/sample_bedms_values_1?tag=default) and [sample_bedms-headers_1.csv](https://pephub.databio.org/databio/sample_bedms_headers_1?tag=default).
78+
While these are only samples and are not information dense, we recommend having large vocabulary for the training files for both the `values_directory` and `headers_directory`. To get a better understanding of the training data that we trained BEDMS on, you can visit this [link](https://big.databio.org/bedms/)
79+
80+
3. Once your training sets are ready, you can make a directory for your schema in your HuggingFace repository. If the name of your schema is `new_schema` and the name of your repository is `new_repo`, this is what the directory structure will look like:
81+
```
82+
new_repo/
83+
new_schema/
84+
new_schema_design.yaml #This has the schema design defining the attributes with their data types and descriptions
85+
86+
```
87+
88+
4. You can now start training your model using the `AttrStandardizerTrainer` module. For this, you would need a `training_config.yaml`. Please follow the config file schema given [here](https://github.com/databio/bedms/blob/saanika/training_config.yaml).
89+
90+
To instantiate `AttrStandardizerTrainer` class:
91+
92+
```python
93+
from bedms.train import TrainStandardizer
94+
95+
trainer = TrainStandardizer("training_config.yaml")
96+
97+
```
98+
To load the datasets and encode them:
99+
100+
```python
101+
train_data, val_data, test_data, label_encoder, vectorizer = trainer.load_data()
102+
```
103+
104+
To train the custom model:
105+
106+
```python
107+
trainer.train()
108+
```
109+
110+
To test the custom model:
111+
112+
```python
113+
test_results_dict = trainer.test() #Dictionary with Precision, Recall, and F1 values
114+
```
115+
116+
To generate visualizations such as Learning Curves, Confusion Matrices, and ROC Curve:
117+
118+
```python
119+
acc_fig, loss_fig, conf_fig, roc_fig = trainer.plot_visualizations()
120+
```
121+
Where `acc_fig` is Accuracy Curve figure object, `loss_fig` is Loss Curve figure object, `conf_fig` is the Confusion Matrix figure object, and `roc_fig` is the ROC Curve figure object.
122+
123+
5. After your model is trained, you will have three files for it (paths to which you mentioned in the `training_config.yaml`):
124+
i. `model_pth` : Path to your model. Let us assume it is named `model_new_schema.pth`.
125+
ii. `label_encoder_pth`: Path to the Label Encoder. Let us assume it is named `label_encoder_new_schema.pkl`.
126+
iii. `vectorizer_pth`: Path to the Vectorizer. Let us assume it is named `vectorizer_new_schema.pkl`.
127+
Upload these files to your HuggingFace repository in the directory you had made earlier `new_repo/new_schema`.
128+
Now, your HuggingFace repository would look something like this:
129+
130+
```
131+
new_repo/
132+
new_schema/
133+
new_schema_design.yaml #This has the schema design defining the attributes with their data types and descriptions
134+
model_new_schema.pth
135+
label_encoder_new_schema.pkl
136+
vectorizer.pkl
137+
```
138+
139+
6. You're just one step away from standardizing metadata according to your custom schema. You would need to add a config file with the parameters you trained your model on to the `new_schema/` directory. Name this config file as `config_new_schema.yaml`. The config file should have the following keys:
140+
```
141+
params:
142+
input_size_bow: int
143+
embedding_size: int
144+
hidden_size: int
145+
output_size: int
146+
dropout_prob: float
147+
```
148+
Provide the values that you trained your model on. Now, the completely trained repository should have the following structure:
149+
150+
```
151+
new_repo/
152+
new_schema/
153+
new_schema_design.yaml #This has the schema design defining the attributes with their data types and descriptions
154+
model_new_schema.pth
155+
label_encoder_new_schema.pkl
156+
vectorizer.pkl
157+
config_new_schema.yaml
158+
```
159+
Before moving on to standardization, confirm that all the above files are present in your repository.
160+
161+
#### Standardizing on custom schema models
162+
163+
For standardizing on custom schema model, instantiate `AttrStandardizer` and provide the repo_id:
164+
165+
```python
166+
from bedms import AttrStandardizer
167+
168+
model = AttrStandardizer(
169+
repo_id="new_repo", model_name="new_schema"
170+
)
171+
results = model.standardize(pep="geo/gse228634:default")
172+
173+
print(results) #Dictionary of suggested predictions with their confidence: {'attr_1':{'prediction_1': 0.70, 'prediction_2':0.30}}
174+
```

docs/citations.md

Lines changed: 22 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -18,21 +18,27 @@ Thanks for citing us! If you use BEDbase, geniml, or their components in your re
1818
| `bedbase` database | Unpublished |
1919

2020

21-
## Full citation information for manuscripts
22-
23-
<li><b>Gharavi et al. (2024). </b><i>Joint representation learning for retrieval and annotation of genomic interval sets</i>
24-
<br><i>Bioengineering</i>. <span class="doi">DOI: <a href="http://dx.doi.org/10.3390/bioengineering11030263">10.3390/bioengineering11030263</a></span></li>
25-
<li><b>Zheng et al. (2024). </b><i>Methods for evaluating unsupervised vector representations of genomic regions</i>
26-
<br><i>Nucleic Acids Research Genomics and Bioinformatics</i>. <span class="doi">DOI: <a href="https://doi.org/10.1093/nargab/lqae086">10.1093/nargab/lqae086</a></span></li>
27-
<li><b>Xue et al. (2023). </b><i>Opportunities and challenges in sharing and reusing genomic interval data</i>
28-
<br><i>Frontiers in Genetics</i>. <span class="doi">DOI: <a href="http://dx.doi.org/10.3389/fgene.2023.1155809">10.3389/fgene.2023.1155809</a></span></li>
29-
<li><b>Rymuza et al. (2024). </b><i>Methods for constructing and evaluating consensus genomic interval sets</i>
30-
<br><i>Nucleic Acids Research</i>. <span class="doi">DOI: <a href="https://doi.org/10.1093/nar/gkae685">10.1093/nar/gkae685</a></span></li>
31-
<li><b>LeRoy et al. (2024). </b><i>Fast clustering and cell-type annotation of scATACdata with pre-trained embeddings</i>
32-
<br><i>Nucleic Acids Research Genomics and Bioinformatics</i>. <span class="doi">DOI: <a href="http://dx.doi.org/10.1093/nargab/lqae073">10.1093/nargab/lqae073</a></span></li>
33-
<li><b>Gu et al. (2021). </b><i>Bedshift: perturbation of genomic interval sets</i>
34-
<br><i>Genome Biology</i>. <span class="doi">DOI: <a href="http://dx.doi.org/10.1186/s13059-021-02440-w">10.1186/s13059-021-02440-w</a></span></li>
35-
<li><b>Gharavi et al. (2021). </b><i>Embeddings of genomic region sets capture rich biological associations in low dimensions</i>
36-
<br><i>Bioinformatics</i>. <span class="doi">DOI: <a href="http://dx.doi.org/10.1093/bioinformatics/btab439">10.1093/bioinformatics/btab439</a></span></li>
21+
## Full citation information for manuscripts
22+
23+
- **Gharavi et al. (2024).** *Joint representation learning for retrieval and annotation of genomic interval sets*.
24+
*Bioengineering*. [10.3390/bioengineering11030263](http://dx.doi.org/10.3390/bioengineering11030263)
25+
26+
- **Zheng et al. (2024).** *Methods for evaluating unsupervised vector representations of genomic regions*.
27+
*Nucleic Acids Research Genomics and Bioinformatics*. [10.1093/nargab/lqae086](https://doi.org/10.1093/nargab/lqae086)
28+
29+
- **Xue et al. (2023).** *Opportunities and challenges in sharing and reusing genomic interval data*.
30+
*Frontiers in Genetics*. [10.3389/fgene.2023.1155809](http://dx.doi.org/10.3389/fgene.2023.1155809)
31+
32+
- **Rymuza et al. (2024).** *Methods for constructing and evaluating consensus genomic interval sets*.
33+
*Nucleic Acids Research*. [10.1093/nar/gkae685](https://doi.org/10.1093/nar/gkae685)
34+
35+
- **LeRoy et al. (2024).** *Fast clustering and cell-type annotation of scATAC data with pre-trained embeddings*.
36+
*Nucleic Acids Research Genomics and Bioinformatics*. [10.1093/nargab/lqae073](http://dx.doi.org/10.1093/nargab/lqae073)
37+
38+
- **Gu et al. (2021).** *Bedshift: perturbation of genomic interval sets*.
39+
*Genome Biology*. [10.1186/s13059-021-02440-w](http://dx.doi.org/10.1186/s13059-021-02440-w)
40+
41+
- **Gharavi et al. (2021).** *Embeddings of genomic region sets capture rich biological associations in low dimensions*.
42+
*Bioinformatics*. [10.1093/bioinformatics/btab439](http://dx.doi.org/10.1093/bioinformatics/btab439)
3743

3844

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# Assess Module
2+
3+
::: geniml.assess
4+
options:
5+
docstring_style: sphinx
6+
heading_level: 2
7+
show_source: false
8+
show_root_heading: true
9+
show_root_full_path: false
10+
show_submodules: true
11+
show_category_heading: true
12+
members_order: source
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
2+
# Atacformer Module
3+
4+
::: geniml.atacformer
5+
options:
6+
docstring_style: sphinx
7+
heading_level: 2
8+
show_source: false
9+
show_root_heading: true
10+
show_root_full_path: false
11+
show_submodules: false
12+
show_category_heading: true
13+
members_order: source
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# BBClient Module
2+
3+
::: geniml.bbclient
4+
options:
5+
docstring_style: sphinx
6+
heading_level: 2
7+
show_source: false
8+
show_root_heading: true
9+
show_root_full_path: false
10+
show_submodules: false
11+
show_category_heading: true
12+
members_order: source

0 commit comments

Comments
 (0)