Skip to content

Commit d09466e

Browse files
Refine abstract and update release information
Removed duplicate abstract section and retained the refined version. Added information about the release of fine-tuned models and a visualization tool.
1 parent ef080e7 commit d09466e

File tree

1 file changed

+7
-5
lines changed

1 file changed

+7
-5
lines changed

notes/summaries.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,17 +4,19 @@ author: "Christoph Schuhmann, Amarjot Singh, Andrej Radonjic, Sean Smith, and Sa
44
date: "November 07 2025"
55
previewImg: "/images/blog/sci3.jpg"
66
---
7+
8+
9+
10+
## Abstract
11+
12+
We present a comprehensive approach to democratizing access to scientific knowledge through large-scale, **structured summarization** of academic literature.
713
<p align="center">
814
<img src="/images/blog/sci5.png"
915
alt="LLM-as-a-Judge scores chart"
1016
style="width:90%; height:auto;">
1117
</p>
1218

13-
14-
15-
## Abstract
16-
17-
We present a comprehensive approach to democratizing access to scientific knowledge through large-scale, **structured summarization** of academic literature. We retrieved and processed ~**100 million** research papers from the public internet, leveraging existing datasets from **bethgelab**, **PeS2o**, **Hugging Face**, and **Common Pile**. We designed a standardized **JSON schema** for scientific paper summaries and **post-trained two models****Qwen 3 14B** and **Nemotron 12B**—to produce summaries in this format. Our evaluation combines **LLM-as-a-Judge** and a **QA dataset**. Fine-tuned models achieve performance on our evals comparable to leading closed models (e.g., GPT-5, Claude 4.5). **Nemotron 12B** offers ~**2.25×** higher throughput than Qwen 3 14B, making it attractive for large-scale processing.
19+
We retrieved and processed ~**100 million** research papers from the public internet, leveraging existing datasets from **bethgelab**, **PeS2o**, **Hugging Face**, and **Common Pile**. We designed a standardized **JSON schema** for scientific paper summaries and **post-trained two models****Qwen 3 14B** and **Nemotron 12B**—to produce summaries in this format. Our evaluation combines **LLM-as-a-Judge** and a **QA dataset**. Fine-tuned models achieve performance on our evals comparable to leading closed models (e.g., GPT-5, Claude 4.5). **Nemotron 12B** offers ~**2.25×** higher throughput than Qwen 3 14B, making it attractive for large-scale processing.
1820

1921
With this preliminary blog post, we **release a fine-tuned models, 100k paper summaries**.
2022
A live **visualization tool** at [https://laion.inference.net/](https://laion.inference.net/) demonstrates the utility of structured summaries. We plan to release structured summaries for the full **100M** paper corpus.

0 commit comments

Comments
 (0)