Skip to content

Commit 628b8bd

Browse files
committed
refactor: reconstruct examples/search file
1 parent 658397d commit 628b8bd

20 files changed

+961
-375
lines changed

examples/search/build_db/build_protein_blast_db.sh

Lines changed: 0 additions & 56 deletions
This file was deleted.

examples/search/build_db/build_rna_blast_db.sh

Lines changed: 0 additions & 219 deletions
This file was deleted.

examples/search/search_dna.sh

Lines changed: 0 additions & 2 deletions
This file was deleted.
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# Search DNA Sequences
2+
3+
This example demonstrates how to search DNA sequences from NCBI RefSeq database using BLAST.
4+
5+
## Overview
6+
7+
The DNA search pipeline reads DNA sequence queries and searches against NCBI RefSeq database to find similar sequences and retrieve associated metadata.
8+
9+
## Quick Start
10+
11+
### 1. Build Local BLAST Database (Optional)
12+
13+
If you want to use local BLAST for faster searches, first build the database:
14+
15+
```bash
16+
./build_db.sh [human_mouse_drosophila_yeast|representative|complete|all]
17+
```
18+
19+
Options:
20+
- `human_mouse_drosophila_yeast`: Download only Homo sapiens, Mus musculus, Drosophila melanogaster, and Saccharomyces cerevisiae sequences (minimal, smallest)
21+
- `representative`: Download genomic sequences from major categories (recommended, smaller)
22+
- `complete`: Download all complete genomic sequences from complete/ directory (very large)
23+
- `all`: Download all genomic sequences from all categories (very large)
24+
25+
The script will create a BLAST database in `refseq_${RELEASE}/` directory.
26+
27+
### 2. Configure Search Parameters
28+
29+
Edit `search_dna_config.yaml` to set:
30+
31+
- **Input file path**: Set the path to your DNA sequence queries
32+
- **NCBI parameters**:
33+
- `email`: Your email address (required by NCBI)
34+
- `tool`: Tool name for NCBI API
35+
- `use_local_blast`: Set to `true` if you have a local BLAST database
36+
- `local_blast_db`: Path to your local BLAST database (without .nhr extension)
37+
38+
Example configuration:
39+
```yaml
40+
input_path:
41+
- examples/input_examples/search_dna_demo.jsonl
42+
43+
data_sources: [ncbi]
44+
ncbi_params:
45+
email: [email protected] # Required!
46+
tool: GraphGen
47+
use_local_blast: true
48+
local_blast_db: refseq_release/refseq_release
49+
```
50+
51+
### 3. Run the Search
52+
53+
```bash
54+
./search_dna.sh
55+
```
56+
57+
Or run directly with Python:
58+
59+
```bash
60+
python3 -m graphgen.run \
61+
--config_file examples/search/search_dna/search_dna_config.yaml \
62+
--output_dir cache/
63+
```
64+
65+
## Input Format
66+
67+
The input file should be in JSONL format with DNA sequence queries:
68+
69+
```jsonl
70+
{"type": "text", "content": "BRCA1"}
71+
{"type": "text", "content": ">query\nATGCGATCG..."}
72+
{"type": "text", "content": "ATGCGATCG..."}
73+
```
74+
75+
## Output
76+
77+
The search results will be saved in the output directory with matched sequences and metadata from NCBI RefSeq.
78+
79+
## Notes
80+
81+
- **NCBI requires an email address** - Make sure to set `email` in `ncbi_params`
82+
- **Local BLAST** provides faster searches and doesn't require internet connection during search
83+
- The local BLAST database can be very large (several GB to TB depending on the download type)
84+
- Adjust `max_concurrent` based on your system resources and API rate limits

0 commit comments

Comments
 (0)