QuantumChem-200K: A Large-Scale Open Organic Molecular Dataset for Quantum-Chemistry Property Screening and Language Model Benchmarking

We introduce QuantumChem-200K, a large-scale dataset of over 200,000 organic molecules annotated with eleven quantum-chemical properties, including two-photon absorption (TPA) cross sections, TPA spectral ranges, singlet–triplet intersystem crossing (ISC) energies, toxicity and synthetic accessibility scores, hydrophilicity, solubility, boiling point, molecular weight, and aromaticity. These values are computed using a hybrid workflow that integrates density function theory (DFT), semi- empirical excited-state methods, atomistic quantum solvers, and neural-network predictors. Data is available at: https://huggingface.co/datasets/YinqiZeng704/200k_monomer_properties, and paper is avalibale at: https://arxiv.org/abs/2511.21747

Two-photon absorption (TPA) lithography to fabricate advanced micro-optical devices:

Data curation process:

Fine-tuning and evaluation:

Using QuantumChem-200K, we fine-tuned the open-source Qwen-2.5-32B LLM to create a chemistry AI assistant capable of forward polymer property prediction from SMILES. It demonstrates that domain-specific fine-tuning significantly improves prediction accuracy over baselines such as GPT-4o, Llama-3.1-70B, and the base Qwen2.5- 32B model. The evaluation metric used is the wMAE:

Benchmarking:

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
figs		figs
infer and benchmark data		infer and benchmark data
LICENSE		LICENSE
README.md		README.md
fine-tuned-infer.ipynb		fine-tuned-infer.ipynb
testing_infer.ipynb		testing_infer.ipynb
wmae_eval_sqrt.ipynb		wmae_eval_sqrt.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QuantumChem-200K: A Large-Scale Open Organic Molecular Dataset for Quantum-Chemistry Property Screening and Language Model Benchmarking

Two-photon absorption (TPA) lithography to fabricate advanced micro-optical devices:

Data curation process:

Fine-tuning and evaluation:

Benchmarking:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

QuantumChem-200K: A Large-Scale Open Organic Molecular Dataset for Quantum-Chemistry Property Screening and Language Model Benchmarking

Two-photon absorption (TPA) lithography to fabricate advanced micro-optical devices:

Data curation process:

Fine-tuning and evaluation:

Benchmarking:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages