PubTrends is an interactive scientific literature exploration tool that helps researchers analyze topics, visualize research trends, and discover related works.
Available online at: https://pubtrends.info/
With PubTrends, you can:
- Gain a concise overview of your research area.
- Explore popular trends and impactful publications.
- Discover new and promising research directions.
See an example of the analysis at: https://pubtrends.info/about.html
- Pubmed ~40 mln papers and 450 mln citations
- Semantic Scholar 170 mln papers and 600 mln citations
PubTrends is a Python / Kotlin + JavaScript web service with a PostgreSQL backend. It uses:
- Languages: Python + Kotlin + JavaScript
- Backend: Nginx + Flask + Gunicorn
- Task Queue: Celery + Redis
- DataBase: Postgres + Postgres pgvector + Psycopg2 + Kotlin ORM
- Data Analysis: Pandas, NumPy, Scikit-learn
- Semantic Search: Sentence-Tranformers + Faiss
- NLP: NLTK, SpaCy, GenSim, Fasttext
- Visualization: Bokeh, Holoviews, Seaborn, Matplotlib
- Frontend: Bootstrap, jQuery, Cytoscape.js
- Deployment: Docker Compose
- Testing: PyTest + Flake8 + JUnit + TeamCity
See pyproject.toml for the full list of libraries used in the project.
-
Copy and modify
config.propertiesto~/.pubtrends/config.properties.
Ensure that file contains correct information about the database(s) (url, port, DB name, username and password). -
Python environment
pubtrendscan be easily created using uv for launching Jupyter Notebook and Web Service:uv venv source .venv/bin/activate # On Windows: .venv\Scripts\activate uv pip install -r pyproject.toml -
Build base Docker image
biolabs/pubtrendsand nested imagebiolabs/pubtrends-testfor testing.docker build -f resources/docker/main/Dockerfile -t biolabs/pubtrends --platform linux/amd64 . docker build -f resources/docker/test/Dockerfile -t biolabs/pubtrends-test --platform linux/amd64 . -
Init Postgres database.
- Launch Docker image:
docker run --rm --name pubtrends-postgres \ -e POSTGRES_USER=biolabs -e POSTGRES_PASSWORD=mysecretpassword \ -v ~/postgres/:/var/lib/postgresql/data \ -e PGDATA=/var/lib/postgresql/data/pgdata \ -p 5432:5432 \ -d postgres:17- Create a database (once a database is created use
-d pubtrendsargument):
psql -h localhost -p 5432 -U biolabs ALTER ROLE biolabs WITH LOGIN; CREATE DATABASE pubtrends OWNER biolabs;- Configure memory params in
~/postgres/pgdata/postgresql.conf.
# Memory settings effective_cache_size = 8GB # ~ 50 to 75% (can be set precisely by referring to “top” free+cached) shared_buffers = 2GB # ~ 1/4 – 1/3 total system RAM work_mem = 1GB # For sorting, ordering etc max_connections = 4 # Total mem is work_mem * connections maintenance_work_mem = 1GB # Memory for indexes, etc # Write performance checkpoint_timeout = 10min checkpoint_completion_target = 0.8 synchronous_commit = offYou can check current settings by command
SHOW ALL;in psql console.
Use the following command to test and build the JAR package:
./gradlew clean test shadowJar
Postgresql should be configured and launched.
Launch crawler to download and keep up to date a Pubmed database:
java -cp build/libs/pubtrends-dev.jar org.jetbrains.bio.pubtrends.pm.PubmedLoader --fillDatabase
Command line options supported:
resetDatabase- clear current contents of the database (for development)fillDatabase- option to fill a database with Pubmed data. Can be interrupted at any moment.lastId- force downloading from given id from articles packpubmed20n{lastId+1}.xml.
Updates - add the following line to crontab:
crontab -e
0 22 * * * java -cp pubtrends-<version>.jar org.jetbrains.bio.pubtrends.pm.PubmedLoader --fillDatabase | \
tee -a crontab_update.log
Download Sample from Semantic Scholar or full archive. See Open Corpus.
The latest release can be found at: https://api.semanticscholar.org/api-docs/datasets#tag/Release-Data
curl https://api.semanticscholar.org/datasets/v1/release/
-
Linux & Mac OS
# Fail on errors set -euox pipefail DATE="2022-05-01" PUBTRENDS_JAR= wget https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/$DATE/manifest.txt echo "" > complete.txt N=$(cat manifest.txt | grep corpus | wc -l) cat manifest.txt | grep corpus | while read -r file; do if [[ -z $(grep "$file" complete.txt) ]]; then echo "Processing $file / $N" wget https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/$DATE/$file; java -cp $PUBTRENDS_JAR org.jetbrains.bio.pubtrends.ss.SemanticScholarLoader --fillDatabase $(pwd)/$file rm $file; echo "$file" >> complete.txt fi; done java -cp $PUBTRENDS_JAR org.jetbrains.bio.pubtrends.ss.SemanticScholarLoader --index --finish -
Windows 10 PowerShell
$DATE = "2023-03-14 $PUBTRENDS_JAR = curl.exe -o .\manifest.txt https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/$DATE/manifest.txt echo "" > .\complete.txt foreach ($file in Get-Content .\manifest.txt) { $sel = Select-String -Path .\complete.txt -Pattern $file if ($sel -eq $null) { echo "Processing $file" curl.exe -o .\$file https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/$DATE/$file java -cp $PUBTRENDS_JAR org.jetbrains.bio.pubtrends.ss.SemanticScholarLoader --fillDatabase .\$file del ./$file echo $file >> .\complete.txt } } java -cp $PUBTRENDS_JAR org.jetbrains.bio.pubtrends.ss.SemanticScholarLoader --index --finish
Please ensure that embeddings Postgres DB with vector extension is up and running
docker run --rm --name pgvector -p 5430:5432 \
-m 32G \
-e POSTGRES_USER=biolabs -e POSTGRES_PASSWORD=mysecretpassword \
-e POSTGRES_DB=pubtrends \
-v ~/pgvector/:/var/lib/postgresql/data \
-e PGDATA=/var/lib/postgresql/data/pgdata \
-d pgvector/pgvector:pg17
Then you'll be able to update embeddings with a commandline below. It will compute embeddings and store them into the vector DB, and update the Faiss index for fast search.
docker build -f pysrc/preprocess/embeddings/Dockerfile -t update_embeddings --platform linux/amd64 .
docker run -v ~/.pubtrends:/config:ro \
-v ~/.pubtrends/logs:/logs \
-v ~/.pubtrends/sentence-transformers:/sentence-transformers \
-v ~/.pubtrends/nltk_data:/home/user/nltk_data \
-v ~/.pubtrends/faiss:/faiss \
-it update_embeddings /bin/bash
uv pip install --no-cache torch --index-url https://download.pytorch.org/whl/cpu
uv pip install --no-cache sentence-transformers faiss-cpu
export PYTHONPATH=$PYTHONPATH:$(pwd)
/bin/bash scripts/nlp.sh
python pysrc/preprocess/update_embeddings.py
Please ensure that you have a database configured, up and running.
Then launch web-service or use jupyter notebook for development.
Two Docker images are used for testing, development and deployment:
- biolabs/pubtrends - production
- biolabs/pubtrends-test - testing
We use Docker Hub to store built images.
-
Create the necessary folders with script
scripts/init.shand download prerequisites.bash scripts/init.sh bash scripts/nlp.sh -
Start Redis
docker run -p 6379:6379 redis:7.4.2 -
Configure Python environment with uv
uv venv source .venv/bin/activate # On Windows: .venv\Scripts\activate uv pip install -r pyproject.toml pip install --no-cache torch --index-url https://download.pytorch.org/whl/cpu pip install --no-cache sentence-transformers faiss-cpu jupyter notebook -
Start Celery worker queue
source .venv/bin/activate export PYTHONPATH=$PYTHONPATH:$(pwd) celery -A pysrc.celery.tasks worker -c 1 --loglevel=debug -
Start flask server at http://localhost:5000/
source .venv/bin/activate export PYTHONPATH=$PYTHONPATH:$(pwd) python -m pysrc.app.pubtrends_app -
Start service for text embeddings based on either pretrained fasttext model or sentence-transformer at http://localhost:5001/
source .venv/bin/activate export PYTHONPATH=$PYTHONPATH:$(pwd) python -m pysrc.endpoints.embeddings.fasttext.fasttext_app
or
source .venv/bin/activate
export PYTHONPATH=$PYTHONPATH:$(pwd)
python -m pysrc.endpoints.embeddings.sentence_transformer.sentence_transformer_app
- Optionally, start a semantic search service http://localhost:5002/
source .venv/bin/activate export PYTHONPATH=$PYTHONPATH:$(pwd) python -m pysrc.endpoints.semantic_search.semantic_search_app
Notebooks are located under the /notebooks folder. Please configure PYTHONPATH before using jupyter.
source .venv/bin/activate
export PYTHONPATH=$PYTHONPATH:$(pwd)
jupyter notebook
-
Start a Docker image with a Postgres environment for tests (Kotlin and Python development)
docker run --rm --platform linux/amd64 --name pubtrends-test \ --publish=5433:5432 --volume=$(pwd):/pubtrends -i -t biolabs/pubtrends-testNOTE: remember to stop the container afterward.
-
Kotlin tests
./gradlew clean test -
Python tests with code style check for development (including integration with Kotlin DB writers)
source .venv/bin/activate; pytest pysrc -
Python tests within Docker (ensure that
./build/libs/pubtrends-dev.jarfile is present)docker run --rm --platform linux/amd64 --volume=$(pwd):/pubtrends -t biolabs/pubtrends-test /bin/bash -c \ "/usr/lib/postgresql/17/bin/pg_ctl -D /home/user/postgres start; \ cd /pubtrends; cp config.properties /home/user/.pubtrends/; \ pytest pysrc"
Deployment is done with docker-compose:
- Gunicorn serving main pubtrends Flask app
- Redis as a message proxy
- Celery workers queue
Please ensure that you have configured and prepared the database(s).
-
Modify file
config.propertieswith information about the database(s). File from the project folder is used in this case. -
Start Postgres server.
docker run --rm --name pubtrends-postgres -p 5432:5432 \ -m 32G \ -e POSTGRES_USER=biolabs -e POSTGRES_PASSWORD=mysecretpassword \ -e POSTGRES_DB=pubtrends \ -v ~/postgres/:/var/lib/postgresql/data \ -e PGDATA=/var/lib/postgresql/data/pgdata \ -d postgres:17NOTE: stop Postgres docker image with timeout
--time=300to avoid DB recovery.\NOTE2: for speed reasons we use materialize views, which are updated upon a successful database update. In case of an emergency stop, the view should be refreshed manually to ensure sort by citations works correctly:
psql -h localhost -p 5432 -U biolabs -d pubtrends refresh materialized view matview_pmcitations; -
Build ready for deployment package with script
scripts/dist.sh.scripts/dist.sh build=build-number ga=google-analytics-id -
Launch pubtrends with docker-compose (one of the options)
# start with local word2vec tf-idf tokens embeddings docker-compose -f docker-compose/word2vec.yml up --build # start with BioWord2Vec tokens embeddings docker-compose -f docker-compose/fasttext.yml up --build # start with Sentence Transformer for text embeddings docker-compose -f docker-compose/sentence-transformer.yml up --build # Start with Semantic Search based on Sentence Transformer docker-compose -f docker-compose/semantic-search.yml up --buildUse these commands to stop compose build and check logs:
# stop docker-compose -f docker-compose/semantic-search.yml down --remove-orphans # inpect logs docker-compose -f docker-compose/semantic-search.yml logsPubtrends will be serving on port 5000.
-
Update nginx timeouts.
# increase timeouts proxy_connect_timeout 60s; proxy_send_timeout 600s; proxy_read_timeout 600s; send_timeout 600s;
Use a simple placeholder during maintenance.
cd pysrc/app; python -m http.server 5000
- Update
CHANGES.md - Update version in
scripts/dist.sh - Launch
scripts/dist.sh,pubtrends-XXX.tar.gzwill be created in thedistdirectory.
See AUTHORS.md for a list of authors and contributors.
-
Shpynov, O. and Kapralov, N., 2021, August. PubTrends: a scientific literature explorer. In Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (pp. 1-1). https://doi.org/10.1145/3459930.3469501
Here’s how you can help:
- ⭐ Star this repo, help others to discover it
- 🐛 Found a bug? Open an issue
- 💡 Have an idea? Feel free to submit a feature request or a PR
- 👍 Upvote issues you care about, help us prioritize
