Skip to content

renjieli08/QuantumChem-200K

Repository files navigation

QuantumChem-200K: A Large-Scale Open Organic Molecular Dataset for Quantum-Chemistry Property Screening and Language Model Benchmarking

We introduce QuantumChem-200K, a large-scale dataset of over 200,000 organic molecules annotated with eleven quantum-chemical properties, including two-photon absorption (TPA) cross sections, TPA spectral ranges, singlet–triplet intersystem crossing (ISC) energies, toxicity and synthetic accessibility scores, hydrophilicity, solubility, boiling point, molecular weight, and aromaticity. These values are computed using a hybrid workflow that integrates density function theory (DFT), semi- empirical excited-state methods, atomistic quantum solvers, and neural-network predictors. Data is available at: https://huggingface.co/datasets/YinqiZeng704/200k_monomer_properties, and paper is avalibale at: https://arxiv.org/abs/2511.21747

Two-photon absorption (TPA) lithography to fabricate advanced micro-optical devices:

Screenshot 2025-11-30 at 18 30 53

Data curation process:

Screenshot 2025-11-30 at 18 25 57

Fine-tuning and evaluation:

Using QuantumChem-200K, we fine-tuned the open-source Qwen-2.5-32B LLM to create a chemistry AI assistant capable of forward polymer property prediction from SMILES. It demonstrates that domain-specific fine-tuning significantly improves prediction accuracy over baselines such as GPT-4o, Llama-3.1-70B, and the base Qwen2.5- 32B model. The evaluation metric used is the wMAE:

Screenshot 2025-11-30 at 18 31 14

Benchmarking:

Screenshot 2025-11-30 at 18 35 53

About

QuantumChem-200K: A Large-Scale Open Organic Molecular Dataset for Quantum-Chemistry Property Screening and Language Model Benchmarking

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors