LLM-Powered Literature Mining System for MARVEL is an intelligent literature mining system based on large language models, specifically designed for the MARVEL (Measured Active Rotational-Vibrational Energy Levels) algorithm. The system provides an automated workflow for literature retrieval and analysis, starting from natural language queries and progressing through metadata filtering and content classification.
For data extraction, since the original literature can only be mannualy fetched, this project offers carefully designed prompts to guide the LLM in extracting high quality experimental molecular spectroscopy data.
- Intelligent Literature Retrieval: Automatically extracts keywords from natural language queries and searches relevant academic literature
- Deep Content Analysis: Utilizes large language models for in-depth content analysis of the relevance between MARVEL research and the literature
- High Quality Data Extraction: Provides optimized prompts contribute to the extraction of high-quality data in a format compatible with MARVEL
- Quality Control: Intelligently filters theoretical calculation data, old data and imcomplete data, ensuring new experimental observations are extracted
- Molecular spectroscopy database construction and maintenance
- Data source expansion for MARVEL algorithm
- Python: 3.7 or higher
- Operating System: Windows, macOS, Linux
- Network: Stable internet connection (for API calls)
- Clone the Repository
git clone https://github.com/Songzx07/LLM_MARVEL.git
cd "LLM_MARVEL_Mining"- Install Dependencies
pip install -r requirements.txtBefore running the system, users need to configure the following API keys:
-
Edit Configuration Files
- Navigate to
llm_literature_search/config/settings.py - Navigate to
llm_literature_analysis/config/settings.py
- Navigate to
-
Set API Keys
# LLM API Configuration
GROQ_API_KEY = "your_groq_api_key_here"
GEMINI_API_KEY = "your_gemini_api_key_here"
# Elsevier API Configuration (for literature retrieval)
ELSEVIER_API_KEY = "your_elsevier_api_key_here"- Obtain API Keys
- Groq API: Visit Groq Console for free access
- Gemini API: Visit Google AI Studio for free access
- Elsevier API: Visit Elsevier Developer Portal to apply for academic license
Retrieve relevant academic literature using natural language queries:
cd llm_literature_search
python search.py- IUPAC Name of the Molecule
- Molecular Formula
- Search Year Range (10-year range per search is recommended)
- Isotopic Molecular Formula
- Additional Keywords
- Citation Count
Interface Example of CH4-literature Retrieval:
======================================================================
Please enter the name and chemical formula of the molecule you want to search for:
Enter the molecule name (e.g., 'methane'): methane
Enter the molecular formula (e.g., 'CH4'): CH4
If you want to search for a specific isotope of the molecule, please specify it as well. Only one isotope can be specified at a time.
Enter the isotope (e.g., '12CH4', press Enter to skip): 12CH4
======================================================================
Any specifically required keywords:
(Specify keywords that can help search for more relvant papers)
Please enter required keywords (comma-separated, press Enter to skip):
======================================================================
Set year range for searching (minimum valid year is 1900):
Please enter year range (e.g., 2000-2023, press Enter for no restriction): 2010-2019
Year range set: 2010-2019
======================================================================
Restriction on citation count:
1. No restriction on citation count (0 citations allowed)
2. At least 1 citation (low impact)
3. At least 5 citations (medium impact)
4. At least 20 citations (high impact)
5. At least 50 citations (very high impact)
Please choose (1-5, defaults to 3): 3
======================================================================
The output files containing the analysis results are stored in a folder named with the search timestamp inside the retrieval_results directory, which is located under the llm_literature_search project folder.
The fetched Elsevier XML files are saved stored in a folder named with the analysis timestamp inside the article_xmls directory, which is placed at the same directory level as llm_literature_search.
Perform in-depth analysis and data extraction on retrieved literature:
cd llm_literature_analysis
python analysis.py- Directory of folder containing XML file under article_xmls
- Output folder name
Interface Example of Benchmark Analysis:
Enter the folder path containing XML files (or press Enter to use default '../article_xmls'): Test_files_F
======================================================================
Output settings:
Output filename (defaults to paper_analysis_results_20250817_162927.json):
======================================================================
ANALYSIS CONFIGURATION:
LLM Service: Gemini API using model gemini-2.5-flash
Target Folder: ../article_xmls/Test_files_F
XML Files Found: 19
Output File: paper_analysis_results_20250817_162927.json
======================================================================
Starting analysis...
LLM_MARVEL_Mining/
├── llm_literature_search/ # Literature Retrieval Module
│ ├── search.py # Main retrieval program entry
│ ├── src/ # Retrieval functionality source code
│ │ └── core/
│ │ ├── llm_processor.py # LLM processing logic
│ │ ├── elsevier_article_retrieval.py # Elsevier API interface
│ │ ├── keyword_extractor.py # Keyword extraction
│ │ ├── literature_searcher.py # Literature search engine
│ │ └── paper_filter.py # Paper filtering logic
│ ├── config/
│ │ └── settings.py # API keys and parameter configuration
│ └── retrieval_results/ # Retrieval results output directory
│
├── llm_literature_analysis/ # Literature Analysis Module
│ ├── analysis.py # Main analysis program entry
│ ├── src/ # Analysis functionality source code
│ │ └── core/
│ │ ├── llm_client.py # LLM-powered analysis tool
│ │ ├── paper_analyzer.py # Paper analysis engine
│ │ └── xml_processor.py # XML document processor
│ ├── config/
│ │ └── settings.py # LLM model configuration
│ └── analysis_results/ # Analysis results output directory
│
├── prompts_for_content_extraction/ # LLM Prompt Templates
│ ├── USE.md # Prompt usage instructions
│ ├── General_Prompts/ # General data extraction templates
│ └── CH4-Prompts/ # Methane-specific templates
│
├── article_xmls/ # Fetched XML Files
│
│
├── project_results/ # Final Results Output Directory
│ ├── Retrieval/ # Statistical results of retrieved literature
│ ├── Analysis/ # Statistical results of analysed literature
│ ├── Content_Extraction/ # Extracted molecular data(The specific content is not displayed for copyright reasons)
│ └── Benchmark/ # Performance evaluation data
│
├── requirements.txt # Python dependencies list
└── README.md # Project documentation
-
Automatic Keyword Extraction: Intelligently identifies scientific keywords from natural language queries
-
Result Filtering and Ranking: Filters literature based on relevance and quality metrics
-
XML Document Parsing: Processes academic literature XML files in various formats
-
Semantic Content Understanding: Uses large language models to understand literature content structure
- MARVEL-Compatible Prompt Templates: Specially designed data extraction prompts
- Molecule-Specific Versions: Templates optimized for different molecular types
- Quality Control Instructions: Ensures accuracy and completeness of extracted data
- Results of Filtered Papers (Sample of a paper)
{
"title": "Improved line list of 12CH4 in the 8850–9180 cm−1 region",
"authors": [
"A.V. Nikitin",
"A.E. Protasevich",
"M. Rey",
"......"
],
"year": 2019,
"venue": "Journal of Quantitative Spectroscopy and Radiative Transfer",
"doi": "10.1016/j.jqsrt.2019.106646",
"abstract": "",
"publisher": "Elsevier BV",
"citation_count": 15,
"url": "https://doi.org/10.1016/j.jqsrt.2019.106646",
"type": "journal-article",
"page": "106646",
"volume": "239",
"issue": "",
"source": "crossref",
"llm_analysis": {
"relevance_score": 1.0,
"reasoning": "The title explicitly states......",
"is_relevant": true,
"analysis_type": "title",
"llm_service": "gemini"
}
},- BiBTeX Results (Sample of a paper)
@article{19NiPrRe,
title = {{Improved line list of 12CH4 in the 8850–9180 cm−1 region}},
author = {Nikitin, A.V. and Protasevich, A.E. and Rey, M. and Serdyukov, V.I. and Sinitsa, L.N. and Lugovskoy, A. and Tyuterev, V.I.G.},
year = {2019},
journal = {Journal of Quantitative Spectroscopy and Radiative Transfer},
doi = {10.1016/j.jqsrt.2019.106646},
publisher = {Elsevier BV},
volume = {239},
pages = {106646}
}- CSV Results (Sample)
| title | doi |
|---|---|
| Improved line list of 12CH4 in the 8850–9180 cm−1 region | 10.1016/j.jqsrt.2019.106646 |
| Improved line list of 12CH4 in the 3760–4100 cm−1 region | 10.1016/j.jqsrt.2018.12.034 |
- Analysis Summary (Sample of a paper)
{
"file_path": "..\\article_xmls\\Test_files_T\\10.1016_j.jqsrt.2019.106646.xml",
"llm_analysis": {
"success": true,
"analysis": {
"marvel_relevance": {
"is_relevant": true,
"explanation": "The paper reports ......"
},
"experimental_data": {
"has_data": true,
"data_format": "The paper explicitly states that ......",
"need_pdf": true,
"has_uncertainty": true,
"uncertainty_description": "The paper indicates ......",
"uncertainty_value": "not available",
"table_info": {
"table_title": [
"Table 3: ......"
],
"description": "This table ......"
},
"has_supplementary_data": true
},
"summary": {
"Evaluation": "This paper......"
}
},
"model": "gemini-2.5-flash"
}
},# General Template Output Format:
WaveNumber_cm-1 Uncertainty Upper_State Lower_State Source_tag.rownumber
8736.2987 0.005 (6,5,A2,239) (0,6,A1,1) 19NiPrRe.1
# CH₄-Specific Template Output Format (for comparison):
WaveNumber_cm-1 Uncertainty Upper_P Upper_J Upper_C Upper_α Lower_P Lower_J Lower_C Lower_α Source_tag.rownumber
8736.2987 0.005 6 5 A2 239 0 6 A1 1 19NiPrRe.1
# config/settings.py
GROQ_MODEL = "gemma2-9b-it" # Select different models
GEMINI_MODEL = "gemini-2.5-flash" # Adjust model version- Users who want to adjust the prompts used for keyword generation need to access
llm_literature_search/src/core/keyword_extractor.pyand edit the following section
user_prompt = f"Extract academic search keywords from this description:\n\nI am looking for research papers that provide high-quality experimental spectroscopic data suitable for input into the MARVEL......"- Users who want to adjust the prompts used for paper filtering need to access
llm_literature_search/src/core/paper_filter.pyand edit the following section
system_prompt = f"""You are an expert academic literature reviewer. Analyze if research papers are......- Users who want to adjust the prompts used for in-depth analysis need to access
llm_literature_analysis/src/core/llm_client.py
user_prompt = f"""MARVEL is an algorithm that reconstructs molecular energy level structures......- Network Connection: Requires stable internet connection to access LLM API services
- API Limitations: Pay attention to call frequency and quota limits for each API service
- Elsevier Access: Ensure that your network has permission to access Elsevier content.
- Literature Data: Ensure compliance with usage terms and copyright regulations of various data
- Citation Standards: Please cite original literature appropriately when using extracted data
- Auto-generated Content: Contents in
retrieval_results/andanalysis results/are automatically generated by the system - Manual Review: Recommend manual verification of critical data, especially data for scientific publication
-
API Key Errors
Error: Gemini API key not found! # Solution: Check API key configuration in config/settings.py
-
Network Connection Issues
ConnectionError: Failed to connect to API endpoint # Solution: Check network connection, consider using proxy or VPN -
File Path Issues
FileNotFoundError: XML file not found # Solution: Ensure there are processable files in article_xmls/ directory
Community contributions are welcomed! If you want to contribute code or improve the system:
- Fork this repository
- Create a feature branch (
git checkout -b Feature/feature) - Commit your changes (
git commit -m 'Add some Feature') - Push to the branch (
git push origin Feature/feature) - Open a Pull Request
- New molecular type prompt templates
- Additional literature database interfaces
- Data quality assessment algorithm improvements
- User interface optimization
- Project Maintainer: Songzx07
- GitHub: LLM_MARVEL_Mining
- Issue Reports: GitHub Issues
Last updated: August 20, 2025