Skip to content

commoncrawl/cc-citations-paper-explorer

Repository files navigation

CC-Citations: Paper Explorer

A visual tool for exploring research papers citing Common Crawl based on embedding similarity. The tool is deployed as a Huggingface space. This folder contains all code for generating paper embeddings, topic modeling, and the Web app.

Setup

  • Python 3.12 (recommended)
# install dependencies via pip
pip install -r requirements.txt

Merge citation exports and OpenAlex data

# download citations
wget -O data/citations.jsonl https://raw.githubusercontent.com/commoncrawl/cc-citations/refs/heads/main/gscholar_alerts/citations.jsonl

# merge datasets from Google Scholar and OpenAlex
python merge_openalex_data.py \
    --citations data/citations.jsonl \
    --openalex /path/to/citations.2024-2025.openalex.sorted.jsonl \
    --output data/merged_citations.jsonl

Generate paper embeddings

The paper explorer requires 2D representations of the paper embeddings. To obtain those, we use the title and abtract of each paper to generate embeddings and then apply dimensionality reduction.

# embeddings for OpenAlex papers
python embed_papers.py --input_path=<path to OpenAlex JSONL> \
    --json_output_path=papers.json \
    --js_output_path=hf_space/papers.js \
    --model_name_or_path=malteos/scincl

# embeddings for Google Scholar papers
python embed_papers.py --input_path=../gscholar_alerts/citations.jsonl \
    --json_output_path=papers_full.json \
    --js_output_path=papers_full.js \
    --model_name_or_path=malteos/scincl \
    --batch_size=12 \
    --title_field=title \
    --url_field=url \
    --authors_field=authors \
    --abstract_field=snippet \
    --embedding_fields title

# embeddings for the merged dataset (both sources)
python embed_papers.py --input_path=./merged_citations.jsonl \
    --json_output_path=papers_merged.json \
    --js_output_path=papers_merged.js \
    --model_name_or_path=malteos/scincl \
    --batch_size=12 \
    --title_field=title \
    --url_field=url \
    --authors_field=authors \
    --abstract_field=abstract \
    --id_field=openalex_id \
    --embedding_fields title abstract

Topic detection with LDA

To assign topics to each paper, we run LDA on the titles and abstracts (we experimented with different hyperparameters like number of topics).

# 12 topics
python classify_paper_topics.py \
    --input_path=papers.json \
    --topics_path=topics.json \
    --paper_to_topic_path=paper_topics.json \
    --n_topics=12 --n_words=20 --max_iter=100 --use_abstracts

# 30 topics
python classify_paper_topics.py \
    --input_path=papers_full.json \
    --topics_path=topics_full.json \
    --paper_to_topic_path=paper_topics_full.json \
    --n_topics=30 --n_words=20 --max_iter=100

# 50 topics
python classify_paper_topics.py \
    --input_path=papers_merged.json \
    --topics_path=topics_merged.json \
    --paper_to_topic_path=paper_topics_merged.json \
    --n_topics=50 --n_words=20 --max_iter=100  --use_abstracts

Since LDA does not produce topic titles but keywords list we use an LLM to assign titles and colors (e.g., for Claude Code):

Assign a topic_title to each topic in topics.json based on the provided keywords (LDA output) and assign colors such that the color reflects topic similarity. If no meaningful title can be assigned use "Other" as a topic title.

Or use a CLI command:

# Call Claude Code via CLI and different prompts to update topics JSON 
claude --permission-mode acceptEdits --allowedTools Read,Edit,Glob -p "In the file topics_full.json, assign a `topic_title` field to each topic in JSON list of topics based provided keywords (LDA output) and assign colors such that the color reflects topic similarity. If no meaningful title can be assigned use "Other" as a topic title." 

# other prompt ...
claude --permission-mode acceptEdits --allowedTools Read,Edit,Glob -p "In the file topics_merged.json, assign a 'topic_title' field to each topic in JSON list of topics based provided keywords (LDA output) and assign colors such that the color reflects topic similarity. If no meaningful title can be assigned use "Other" as a topic title." 

# other prompt - part 2 ...
claude --permission-mode acceptEdits --allowedTools Read,Edit,Glob -p "The file topics_merged.json holds a list of many topics with titles and keywords (LDA output). Group these topics into a 15 meaningful main topics. Assign a new field 'main_topic_title' to each topic. If certain topics cannot be meanigfully grouped, assign them the 'Other' main topic title. Save the output into a new file with the '_grouped' suffix."

JavaScript data

To load all results in a Web page, all pieces need to be converted into a Javascript file:

# using the full set
python create_papers_js.py \
  --papers papers_full.json \
  --topics topics_full.json \
  --paper-topics paper_topics_full.json \
  --output hf_space/papers.js

# using the merged set
python create_papers_js.py \
  --papers data/papers_merged.json \
  --topics data/topics_merged.json \
  --paper-topics data/paper_topics_merged.json \
  --output hf_space/papers.js

View Web page

The resulting Web app is a single HTML file with Javascript and can be viewed in a browser.

cd hf_space

# from local FS
open hf_space/index.html

# via local web server at http://localhost
python -m http.server 80

Push to HF space

To deploy the web app to Huggingface, you can upload the relevant files as follows:

# authenticate with hf first
huggingface-cli upload commoncrawl/cc-citations ./hf_space --repo-type space --commit-message "Uploading paper explorer"

References

License

Apache 2.0