The project gathers high-quality e-book repositories from GitHub and leverages MinerU 2.0 to transform the PDF content into Markdown format.
Each directory represents a repository originally hosted on GitHub.
If you possess any high-quality e-book resources that require conversion, you’re welcome to submit the links in an issue and we will assist with the PDF-to-Markdown extraction.
Our goal is to convert more high-quality knowledge data into AI-ready data.
Repo url | Download |
---|---|
ChinaTextbook | opendatalab/awesome-markdown-ebooks/ChinaTextbook |
Output File Structure Documentation (Based on MinerU2 vlm, Output File Structure)
- File Type:
.md
file +images/
folder - Description: Final result of PDF to Markdown conversion
- Content: Document text content and image references
- Filename:
model_output.txt
- Description: Intermediate inference data from VLM model
- Content: Model's visual understanding results of pages
- Filename:
middle.json
- Description: Processed result from
model_output.txt
- Content: Contains position information of text, images, formulas, tables, etc. in PDF
- Filename:
content_list.json
- Description: Final result converted from
middle.json
- Content: Document conversion results segmented by elements, including page information