Skip to content

Jaswant-2525/Information-Extraction-From-Research-Papers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 

Repository files navigation

πŸ“„ Automatic Extraction of Key Information from Research Papers using NLP & LayoutLM

This project focuses on automatically extracting important information from research papers in PDF format using OCR, NLP & document layout modeling. The system processes research papers, recognizes text, understands layout patterns, and extracts structured information such as:

  1. Title
  2. Abstract
  3. Keywords
  4. Named Entities
  5. Token labels from LayoutLM

It aims to automate literature analysis and reduce manual reading effort for researchers, students, and analysts.

🎯 Project Objective

  • Automate extraction of key data from research PDFs
  • Reduce manual reading time
  • Convert unstructured PDF text β†’ structured JSON
  • Assist research scholars in faster literature review

πŸ”₯ Features

Feature Description
πŸ“„ PDF to Image Conversion Converts pages to image format using pdf2image
πŸ” OCR Text Extraction Uses pytesseract to extract raw text from PDF pages
🧠 Layout-aware Processing Uses LayoutLM model to understand document structure
🏷 Named Entity Recognition Detects people, places, dates, organizations, etc.
πŸ“¦ Structured Output Extracted data returned in JSON format
βš™οΈ Page-wise Processing Prevents RAM crashes by processing pages individually

🧠 Technology Stack

Category Tools Used
OCR Pytesseract
NLP SpaCy, Transformers
Layout Model LayoutLM
Preprocessing pdf2image, PIL
Backend / Processing Python
Optional UI Streamlit

πŸš€ Installation (Google Colab Recommended)

  1. Install required libraries

    pip install pytesseract pdf2image transformers spacy Pillow

    python -m spacy download en_core_web_sm

  2. Install Poppler (required for pdf2image)

    apt-get install poppler-utils

πŸ”§ How to Run in Google Colab

from google.colab import files
uploaded = files.upload()
pdf_path = list(uploaded.keys())[0]
result = extract_info(pdf_path)
import json
print(json.dumps(result, indent=2))

Releases

No releases published

Packages

 
 
 

Contributors