|
| 1 | +# 🔄 BEDbase loader |
| 2 | + |
| 3 | +BEDbase loader is an automated tool and cron job that continuously fetches, processes, and integrates new BED files from public repositories into the BEDbase database. This ensures that BEDbase remains up-to-date with the latest genomic data available. |
| 4 | + |
| 5 | +BEDbase loader repository: [https://github.com/databio/bedbase-loader](https://github.com/databio/bedbase-loader) |
| 6 | + |
| 7 | +## Key Features |
| 8 | +- **Automated GEO Retrival** |
| 9 | +- **Automated BED heavy processing** |
| 10 | +- **Automated Genomes Updater** |
| 11 | +- **Umap creator** |
| 12 | + |
| 13 | +### Automated GEO Retrieval |
| 14 | + |
| 15 | +Main and the most important part of the bedbase-loader is automated retrieval of the GEO data. |
| 16 | +**Steps**: |
| 17 | +1. First, it is done by fetching metadata from PEPhub API, from bedbase repository: [https://pephub.databio.org/bedbase](https://pephub.databio.org/bedbase). And selects GSE projects that were uploaded in certain period of time (e.g. last 2 days). |
| 18 | +2. After list of GSE projects is fetched, BEDboss checks if this projects were already processed. If not, it is going to the next step. |
| 19 | +3. Then, it is fetching metadata for all the projects from PEPhub, including urls to the files. |
| 20 | +4. Next, files are being downloaded and metadata is inserted into the BEDbase database. |
| 21 | +5. Finally, the status flag is updated to "downloaded" and the project is ready for the next step - heavy processing (Next section). |
| 22 | + |
| 23 | + |
| 24 | +### Automated BED heavy processing |
| 25 | + |
| 26 | +A lot of files are downloaded from GEO using automated GEO retrival. |
| 27 | +But to speed up downloading and inserting time we are skipping heavy processing on the initial step. |
| 28 | +Heavy processing is happening in AWS using AWS Fargate and automated cron job, after the files are downloaded and inserted into the database. |
| 29 | +Docker image for heavy processing: [https://github.com/databio/bedboss/blob/main/Dockerfile](https://github.com/databio/bedboss/blob/main/Dockerfile) |
| 30 | + |
| 31 | + |
| 32 | +### Automated Genomes Updater |
| 33 | + |
| 34 | +BEDbase loader includes automated genomes updater that is fetching genomes from Refgenie server. |
| 35 | +We are storing information about all genomes available on the Refgenie server to make links between BED file stored in the BEDbase |
| 36 | +and the exact reference genome used to create this BED file. |
| 37 | +To automatically update genomes we are using cron job located here: https://github.com/databio/bedbase-loader/blob/master/.github/workflows/update_genomes.yml |
| 38 | + |
| 39 | + |
| 40 | +### Umap Creator |
| 41 | + |
| 42 | +One of the important parts of the BEDbase is embeddings of the BED files. |
| 43 | +Visualization of the embedding provides insights into the data stored in the BEDbase. |
| 44 | +To create embeddings we are using BEDbase package, it automatically creates umap file, that later is visualized here: [https://bedbase.org/umap](https://bedbase.org/umap) |
| 45 | +To provide up-to-date umap we are using cron job located here: https://github.com/databio/bedbase-loader/blob/master/.github/workflows/update_umap.yml |
0 commit comments