Skip to content

Commit 829c28f

Browse files
Refactor abstract for clarity and conciseness
1 parent db5dd28 commit 829c28f

File tree

1 file changed

+2
-1
lines changed

1 file changed

+2
-1
lines changed

notes/summaries.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,14 @@ previewImg: "/images/blog/sci3.jpg"
1010
## Abstract
1111

1212
We present a comprehensive approach to democratizing access to scientific knowledge through large-scale, **structured summarization** of academic literature.
13+
We retrieved and processed ~**100 million** research papers from the public internet, leveraging existing datasets from **bethgelab**, **PeS2o**, **Hugging Face**, and **Common Pile**.
1314
<p align="center">
1415
<img src="/images/blog/sci5.png"
1516
alt="LLM-as-a-Judge scores chart"
1617
style="width:90%; height:auto;">
1718
</p>
1819

19-
We retrieved and processed ~**100 million** research papers from the public internet, leveraging existing datasets from **bethgelab**, **PeS2o**, **Hugging Face**, and **Common Pile**. We designed a standardized **JSON schema** for scientific paper summaries and **post-trained two models****Qwen 3 14B** and **Nemotron 12B**—to produce summaries in this format. Our evaluation combines **LLM-as-a-Judge** and a **QA dataset**. Fine-tuned models achieve performance on our evals comparable to leading closed models (e.g., GPT-5, Claude 4.5). **Nemotron 12B** offers ~**2.25×** higher throughput than Qwen 3 14B, making it attractive for large-scale processing.
20+
We designed a standardized **JSON schema** for scientific paper summaries and **post-trained two models****Qwen 3 14B** and **Nemotron 12B**—to produce summaries in this format. Our evaluation combines **LLM-as-a-Judge** and a **QA dataset**. Fine-tuned models achieve performance on our evals comparable to leading closed models (e.g., GPT-5, Claude 4.5). **Nemotron 12B** offers ~**2.25×** higher throughput than Qwen 3 14B, making it attractive for large-scale processing.
2021

2122
With this preliminary blog post, we **release a fine-tuned models, 100k paper summaries**.
2223
A live **visualization tool** at [https://laion.inference.net/](https://laion.inference.net/) demonstrates the utility of structured summaries. We plan to release structured summaries for the full **100M** paper corpus.

0 commit comments

Comments
 (0)