- frontend tst: https://hivelab.tst.biochemistry.gwu.edu/biomuta
- frontend prd: https://hivelab.biochemistry.gwu.edu/biomuta
- BioMuta data release repo:
/data/shared/repos/biomuta
cd preprocessing
- Run
id_mapper.pyonbiomuta.csv→ getstranscriptId,peptideId,refseqAcpercanonicalAc→ outputs$root/generated/uniprot_mapped_identifiers.csv - Run
codon_mapper.pyon your CSV → getsrefCodon,altCodon,posInCds,posInCodon - Join the two outputs on
canonicalAc+aa_posto build completebiomuta_mutation_effrecords - Write upsert script for
biomuta_mutationandbiomuta_mutation_freq(the collections directly frombiomuta.csv) - Write upsert script for
biomuta_mutation_effwith the joined data
# Step 2
python codon_mapper.py -c config.json -m /data/shared/repos/biomuta-old/generated_datasets/compiled/biomuta_v6.1_toy.csv -o /data/shared/repos/biomuta/generated/mapped_codons.csv
/data/shared/repos/biomuta/json_exports/
Note to self: implement positional argument that takes the server instead of having to edit the script
Next time and always:
- Always run
create_indexes.pyafter loading data — your loader script should call it automatically - A 502 from Apache usually means the backend timed out, not crashed — check app logs first
docker logs <container> -fis your best friend for real-time debugging- When queries are slow despite indexes existing, check that the index field name exactly matches what the code queries (we chased a
gene_namevsgeneNamered herring) getIndexes()andexplain("executionStats")in mongosh are the fastest way to verify query performance
cd /data/shared/repos/biomuta
senv # alias for source env/bin/activate
# Replace 'tst' or 'prd' with the correct environment if needed in mongo_port = config['dbinfo']['port']['tst']
python misc_scripts/json_to_MongodbCollections.py
Connect to MongoDB via the Docker container
# tst
docker exec -it running_biomuta_mongo_tst mongosh
# prd
docker exec -it running_biomuta_mongo_prd mongosh
Once inside the docker:
use admin
db.auth("your_admin_user", "your_admin_password")
use your_db_name
db.<collection_name>.findOne() # e.g. db.C_biomuta_mutation.findOne()
docker ps -a
Look for the BioMuta app container (not the mongo one). Check its status and how long ago it was running.
docker logs <biomuta_app_container> --tail 100
ss -tlnp | grep <expected_port>
# or
netstat -tlnp | grep <expected_port>
Sometimes containers go down due to OOM (out of memory).
free -h
df -h
cat /etc/apache2/sites-enabled/*.conf
# or for nginx
cat /etc/nginx/sites-enabled/*
If you see that every single query in the MongoDB log shows "planSummary":"COLLSCAN", it means MongoDB is scanning the entire collection (millions of documents) for every lookup by id. With queries scanning up to 3.28 million documents each taking 1-2 seconds, a single page load that triggers many such queries easily exceeds Apache/Nginx's proxy timeout, which returns the 502. The collection has no index on the id field, so as the collection grew over time, queries got progressively slower until they started timing out.
The fix is to add an index inside the Docker container:
docker exec -it running_biomuta_mongo_prd mongosh
# Once inside the docker
use biomuta_db
db.auth("your_db_user", "your_db_password")
db.C_biomuta_mutation.createIndex({ id: 1 })
This will take a few minutes to build on a large collection, but once done, those queries that take 500-1700ms will drop to under 1ms. You should also check if other collections have the same problem:
db.C_biomuta_protein.createIndex({ id: 1 })
db.C_biomuta_cancer.createIndex({ id: 1 })
// etc. for all collections queried by id
Indexing is taken care of by the json_to_MongodbCollections.py script which creates indexes after loading — see this after insert_many:
collection.create_index("id")
For some reason the above command doesn't create all indexes, so run these inside the docker mongo shell
db.C_biomuta_mutation_eff.createIndex({ canonicalAc: 1 })
db.C_biomuta_mutation_eff.createIndex({ mutationId: 1 })
db.C_biomuta_protein_ann.createIndex({ canonicalAc: 1 })
db.C_biomuta_mutation_freq.createIndex({ mutationId: 1 })
db.C_biomuta_mutation_pmid.createIndex({ mutationId: 1 })
db.C_biomuta_do2uberon.createIndex({ doId: 1 })
db.C_biomuta_cancer.createIndex({ id: 1 })