You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We present a comprehensive approach to democratizing access to scientific knowledge through large-scale, **structured summarization** of academic literature. We retrieved and processed ~**100 million** research papers from the public internet, leveraging existing datasets from **bethgelab**, **PeS2o**, **Hugging Face**, and **Common Pile**. We designed a standardized **JSON schema** for scientific paper summaries and **post-trained two models**—**Qwen 3 14B** and **Nemotron 12B**—to produce summaries in this format. Our evaluation combines **LLM-as-a-Judge** and a **QA dataset**. Fine-tuned models achieve performance on our evals comparable to leading closed models (e.g., GPT-5, Claude 4.5). **Nemotron 12B** offers ~**2.25×** higher throughput than Qwen 3 14B, making it attractive for large-scale processing.
@@ -17,10 +20,13 @@ With this preliminary blog post, we **release a fine-tuned models, 100k paper su
17
20
A live **visualization tool** at [https://laion.inference.net/](https://laion.inference.net/) demonstrates the utility of structured summaries. We plan to release structured summaries for the full **100M** paper corpus.
Access to scientific knowledge remains constrained by paywalls, licensing, and copyright, slowing research and education. Our **Project Alexandria** ([arXiv:2502.19413](https://arxiv.org/abs/2502.19413)) showed that it is legally and technically feasible to **extract factual knowledge** while respecting copyright via **Knowledge Units**—structured, style-agnostic representations of content. However, research-paper corpora vary in format and structure, making it hard to compare similar claims or retrieve knowledge efficiently. Building on Alexandria, we introduce a **pipeline** to collect, process, and summarize papers into **structured outputs** consumable by humans and AI systems alike. Our aims: * **Create** a massive, openly accessible, well-structured summary dataset of scientific literature * **Develop** models capable of generating **structured, factual** summaries * **Demonstrate** the utility of these summaries for scientific tasks * **Explore** decentralized computing to process at global scale This brief outlines **methodology**, **results**, and **implications** for the scientific community—and humanity.
@@ -65,7 +71,10 @@ We used **two complementary approaches**: 1. **LLM-as-a-Judge** — Ensemble of
0 commit comments