QuantumChem-200K: A Large-Scale Open Organic Molecular Dataset for Quantum-Chemistry Property Screening and Language Model Benchmarking
We introduce QuantumChem-200K, a large-scale dataset of over 200,000 organic molecules annotated with eleven quantum-chemical properties, including two-photon absorption (TPA) cross sections, TPA spectral ranges, singlet–triplet intersystem crossing (ISC) energies, toxicity and synthetic accessibility scores, hydrophilicity, solubility, boiling point, molecular weight, and aromaticity. These values are computed using a hybrid workflow that integrates density function theory (DFT), semi- empirical excited-state methods, atomistic quantum solvers, and neural-network predictors. Data is available at: https://huggingface.co/datasets/YinqiZeng704/200k_monomer_properties, and paper is avalibale at: https://arxiv.org/abs/2511.21747
Using QuantumChem-200K, we fine-tuned the open-source Qwen-2.5-32B LLM to create a chemistry AI assistant capable of forward polymer property prediction from SMILES. It demonstrates that domain-specific fine-tuning significantly improves prediction accuracy over baselines such as GPT-4o, Llama-3.1-70B, and the base Qwen2.5- 32B model. The evaluation metric used is the wMAE: