PubTrends

PubTrends is an interactive scientific literature exploration tool that helps researchers analyze topics, visualize research trends, and discover related works.

Available online at: https://pubtrends.info/

Overview

With PubTrends, you can:

Gain a concise overview of your research area.
Explore popular trends and impactful publications.
Discover new and promising research directions.

See an example of the analysis at: https://pubtrends.info/about.html

Datasets:

Pubmed ~40 mln papers and 450 mln citations
Semantic Scholar 170 mln papers and 600 mln citations

Technical Architecture

PubTrends is a Python / Kotlin + JavaScript web service with a PostgreSQL backend. It uses:

Languages: Python + Kotlin + JavaScript
Backend: Nginx + Flask + Gunicorn
Task Queue: Celery + Redis
DataBase: Postgres + Postgres pgvector + Psycopg2 + Kotlin ORM
Data Analysis: Pandas, NumPy, Scikit-learn
Semantic Search: Sentence-Tranformers + Faiss
NLP: NLTK, SpaCy, GenSim, Fasttext
Visualization: Bokeh, Holoviews, Seaborn, Matplotlib
Frontend: Bootstrap, jQuery, Cytoscape.js
Deployment: Docker Compose
Testing: PyTest + Flake8 + JUnit + TeamCity

See pyproject.toml for the full list of libraries used in the project.

Configuration

Copy and modify config.properties to ~/.pubtrends/config.properties.
Ensure that file contains correct information about the database(s) (url, port, DB name, username and password).

Python environment pubtrends can be easily created using uv for launching Jupyter Notebook and Web Service:

uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -r pyproject.toml

Build base Docker image biolabs/pubtrends and nested image biolabs/pubtrends-test for testing.

docker build -f resources/docker/main/Dockerfile -t biolabs/pubtrends --platform linux/amd64  .
docker build  -f resources/docker/test/Dockerfile -t biolabs/pubtrends-test --platform linux/amd64 .

Init Postgres database.

Launch Docker image:

docker run --rm --name pubtrends-postgres \
    -e POSTGRES_USER=biolabs -e POSTGRES_PASSWORD=mysecretpassword \
    -v ~/postgres/:/var/lib/postgresql/data \
    -e PGDATA=/var/lib/postgresql/data/pgdata \
    -p 5432:5432 \
    -d postgres:17

Create a database (once a database is created use -d pubtrends argument):

psql -h localhost -p 5432 -U biolabs
ALTER ROLE biolabs WITH LOGIN;
CREATE DATABASE pubtrends OWNER biolabs;

Configure memory params in ~/postgres/pgdata/postgresql.conf.

# Memory settings
effective_cache_size = 8GB  # ~ 50 to 75% (can be set precisely by referring to “top” free+cached)
shared_buffers = 2GB        # ~ 1/4 – 1/3 total system RAM
work_mem = 1GB            # For sorting, ordering etc
max_connections = 4  # Total mem is work_mem * connections
maintenance_work_mem = 1GB  # Memory for indexes, etc

# Write performance
checkpoint_timeout = 10min
checkpoint_completion_target = 0.8
synchronous_commit = off

You can check current settings by command SHOW ALL; in psql console.

Kotlin/Java Build

Use the following command to test and build the JAR package:

./gradlew clean test shadowJar

Papers downloading and processing

Postgresql should be configured and launched.

Pubmed

Launch crawler to download and keep up to date a Pubmed database:

java -cp build/libs/pubtrends-dev.jar org.jetbrains.bio.pubtrends.pm.PubmedLoader --fillDatabase

Command line options supported:

resetDatabase - clear current contents of the database (for development)
fillDatabase - option to fill a database with Pubmed data. Can be interrupted at any moment.
lastId - force downloading from given id from articles pack pubmed20n{lastId+1}.xml.

Updates - add the following line to crontab:

crontab -e
0 22 * * * java -cp pubtrends-<version>.jar org.jetbrains.bio.pubtrends.pm.PubmedLoader --fillDatabase | \
tee -a crontab_update.log

Semantic Scholar

Download Sample from Semantic Scholar or full archive. See Open Corpus.
The latest release can be found at: https://api.semanticscholar.org/api-docs/datasets#tag/Release-Data

curl https://api.semanticscholar.org/datasets/v1/release/

Linux & Mac OS

# Fail on errors
set -euox pipefail 

DATE="2022-05-01"
PUBTRENDS_JAR=

wget https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/$DATE/manifest.txt
echo "" > complete.txt
N=$(cat manifest.txt | grep corpus | wc -l)
cat manifest.txt | grep corpus | while read -r file; do 
   if [[ -z $(grep "$file" complete.txt) ]]; then
      echo "Processing $file / $N"
      wget https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/$DATE/$file;
      java -cp $PUBTRENDS_JAR org.jetbrains.bio.pubtrends.ss.SemanticScholarLoader --fillDatabase $(pwd)/$file
      rm $file;
      echo "$file" >> complete.txt
   fi;
done
java -cp $PUBTRENDS_JAR org.jetbrains.bio.pubtrends.ss.SemanticScholarLoader --index --finish

Windows 10 PowerShell

$DATE = "2023-03-14
$PUBTRENDS_JAR = 
curl.exe -o .\manifest.txt https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/$DATE/manifest.txt 
echo "" > .\complete.txt
foreach ($file in Get-Content .\manifest.txt) {
    $sel = Select-String -Path .\complete.txt -Pattern $file
    if ($sel -eq $null) {
       echo "Processing $file"
       curl.exe -o .\$file https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/$DATE/$file
       java -cp $PUBTRENDS_JAR org.jetbrains.bio.pubtrends.ss.SemanticScholarLoader --fillDatabase .\$file
       del ./$file
       echo $file >> .\complete.txt
    }
}
java -cp $PUBTRENDS_JAR org.jetbrains.bio.pubtrends.ss.SemanticScholarLoader --index --finish

Updating embeddings

Please ensure that embeddings Postgres DB with vector extension is up and running

docker run --rm --name pgvector -p 5430:5432 \
     -m 32G \
     -e POSTGRES_USER=biolabs -e POSTGRES_PASSWORD=mysecretpassword \
     -e POSTGRES_DB=pubtrends \
     -v ~/pgvector/:/var/lib/postgresql/data \
     -e PGDATA=/var/lib/postgresql/data/pgdata \
     -d pgvector/pgvector:pg17

Then you'll be able to update embeddings with a commandline below. It will compute embeddings and store them into the vector DB, and update the Faiss index for fast search.

docker build -f pysrc/preprocess/embeddings/Dockerfile -t update_embeddings --platform linux/amd64 .
docker run  -v ~/.pubtrends:/config:ro \
   -v ~/.pubtrends/logs:/logs \
   -v ~/.pubtrends/sentence-transformers:/sentence-transformers \
   -v ~/.pubtrends/nltk_data:/home/user/nltk_data \
   -v ~/.pubtrends/faiss:/faiss \
   -it update_embeddings /bin/bash

uv pip install --no-cache torch --index-url https://download.pytorch.org/whl/cpu
uv pip install --no-cache sentence-transformers faiss-cpu
export PYTHONPATH=$PYTHONPATH:$(pwd)
/bin/bash scripts/nlp.sh
python pysrc/preprocess/update_embeddings.py

Development

Please ensure that you have a database configured, up and running.
Then launch web-service or use jupyter notebook for development.

Docker

Two Docker images are used for testing, development and deployment:

biolabs/pubtrends - production
biolabs/pubtrends-test - testing

We use Docker Hub to store built images.

Web service

Create the necessary folders with script scripts/init.sh and download prerequisites.
```
bash scripts/init.sh
bash scripts/nlp.sh
```
Start Redis
```
docker run -p 6379:6379 redis:7.4.2
```

Configure Python environment with uv

uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -r pyproject.toml
pip install --no-cache torch --index-url https://download.pytorch.org/whl/cpu
pip install --no-cache sentence-transformers faiss-cpu jupyter notebook

Start Celery worker queue

source .venv/bin/activate
export PYTHONPATH=$PYTHONPATH:$(pwd)
celery -A pysrc.celery.tasks worker -c 1 --loglevel=debug

Start flask server at http://localhost:5000/

source .venv/bin/activate
export PYTHONPATH=$PYTHONPATH:$(pwd)
python -m pysrc.app.pubtrends_app

Start service for text embeddings based on either pretrained fasttext model or sentence-transformer at http://localhost:5001/
```
source .venv/bin/activate
export PYTHONPATH=$PYTHONPATH:$(pwd)
python -m pysrc.endpoints.embeddings.fasttext.fasttext_app
```

or

    source .venv/bin/activate
    export PYTHONPATH=$PYTHONPATH:$(pwd)
    python -m pysrc.endpoints.embeddings.sentence_transformer.sentence_transformer_app

Optionally, start a semantic search service http://localhost:5002/

source .venv/bin/activate
export PYTHONPATH=$PYTHONPATH:$(pwd)
python -m pysrc.endpoints.semantic_search.semantic_search_app

Jupyter notebook

Notebooks are located under the /notebooks folder. Please configure PYTHONPATH before using jupyter.

source .venv/bin/activate
export PYTHONPATH=$PYTHONPATH:$(pwd)
jupyter notebook

Testing

Start a Docker image with a Postgres environment for tests (Kotlin and Python development)

docker run --rm --platform linux/amd64 --name pubtrends-test \
--publish=5433:5432 --volume=$(pwd):/pubtrends -i -t biolabs/pubtrends-test

NOTE: remember to stop the container afterward.

Kotlin tests
```
./gradlew clean test
```
Python tests with code style check for development (including integration with Kotlin DB writers)
```
source .venv/bin/activate; pytest pysrc
```

Python tests within Docker (ensure that ./build/libs/pubtrends-dev.jar file is present)

docker run --rm --platform linux/amd64 --volume=$(pwd):/pubtrends -t biolabs/pubtrends-test /bin/bash -c \
"/usr/lib/postgresql/17/bin/pg_ctl -D /home/user/postgres start; \
cd /pubtrends; cp config.properties /home/user/.pubtrends/; \
pytest pysrc"

Deployment

Deployment is done with docker-compose:

Gunicorn serving main pubtrends Flask app
Redis as a message proxy
Celery workers queue

Please ensure that you have configured and prepared the database(s).

Modify file config.properties with information about the database(s). File from the project folder is used in this case.

Start Postgres server.

docker run --rm --name pubtrends-postgres -p 5432:5432 \
    -m 32G \
    -e POSTGRES_USER=biolabs -e POSTGRES_PASSWORD=mysecretpassword \
    -e POSTGRES_DB=pubtrends \
    -v ~/postgres/:/var/lib/postgresql/data \
    -e PGDATA=/var/lib/postgresql/data/pgdata \
    -d postgres:17

NOTE: stop Postgres docker image with timeout --time=300 to avoid DB recovery.\

NOTE2: for speed reasons we use materialize views, which are updated upon a successful database update. In case of an emergency stop, the view should be refreshed manually to ensure sort by citations works correctly:

psql -h localhost -p 5432 -U biolabs -d pubtrends
refresh materialized view matview_pmcitations;

Build ready for deployment package with script scripts/dist.sh.
```
scripts/dist.sh build=build-number ga=google-analytics-id
```

Launch pubtrends with docker-compose (one of the options)

# start with local word2vec tf-idf tokens embeddings
docker-compose -f docker-compose/word2vec.yml up --build

# start with BioWord2Vec tokens embeddings
docker-compose -f docker-compose/fasttext.yml up --build

# start with Sentence Transformer for text embeddings
docker-compose -f docker-compose/sentence-transformer.yml up --build

# Start with Semantic Search based on Sentence Transformer
docker-compose -f docker-compose/semantic-search.yml up --build

Use these commands to stop compose build and check logs:

# stop
docker-compose -f docker-compose/semantic-search.yml down --remove-orphans
# inpect logs
docker-compose -f docker-compose/semantic-search.yml logs

Pubtrends will be serving on port 5000.

Update nginx timeouts.

# increase timeouts
proxy_connect_timeout 60s;
proxy_send_timeout    600s;
proxy_read_timeout    600s;
send_timeout          600s;

Maintenance

Use a simple placeholder during maintenance.

cd pysrc/app; python -m http.server 5000

Release

Update CHANGES.md
Update version in scripts/dist.sh
Launch scripts/dist.sh, pubtrends-XXX.tar.gz will be created in the dist directory.

Authors

See AUTHORS.md for a list of authors and contributors.

Materials

Shpynov, O. and Kapralov, N., 2021, August. PubTrends: a scientific literature explorer. In Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (pp. 1-1). https://doi.org/10.1145/3459930.3469501
Icons by Feather

Contributing

Here’s how you can help:

⭐ Star this repo, help others to discover it
🐛 Found a bug? Open an issue
💡 Have an idea? Feel free to submit a feature request or a PR
👍 Upvote issues you care about, help us prioritize

Name		Name	Last commit message	Last commit date
Latest commit History 1,969 Commits
.run		.run
docker-compose		docker-compose
gradle/wrapper		gradle/wrapper
notebooks		notebooks
pysrc		pysrc
resources		resources
scripts		scripts
src		src
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
AUTHORS.md		AUTHORS.md
CHANGES.md		CHANGES.md
LICENSE.txt		LICENSE.txt
README.md		README.md
build.gradle		build.gradle
config.properties		config.properties
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PubTrends

Overview

Datasets:

Technical Architecture

Configuration

Kotlin/Java Build

Papers downloading and processing

Pubmed

Semantic Scholar

Updating embeddings

Development

Docker

Web service

Jupyter notebook

Testing

Deployment

Maintenance

Release

Authors

Materials

Contributing

About

Uh oh!

Releases 7

Packages

Contributors 5

Uh oh!

Languages

License

JetBrains-Research/pubtrends

Folders and files

Latest commit

History

Repository files navigation

PubTrends

Overview

Datasets:

Technical Architecture

Configuration

Kotlin/Java Build

Papers downloading and processing

Pubmed

Semantic Scholar

Updating embeddings

Development

Docker

Web service

Jupyter notebook

Testing

Deployment

Maintenance

Release

Authors

Materials

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Contributors 5

Uh oh!

Languages

Packages