Skip to content

Conversation

@Wei-jie-Wu
Copy link

✅ Description

📘 Overview

This PR contributes the Feature-Guided Inverse Design (LSTMDoubleFit) model for the inverse design of organic A-site cations in low-dimensional perovskites.
The project integrates descriptor calculation, LSTM-based generative learning, and feature-constrained molecular optimization into a unified Paddle-based workflow.

This work reproduces and extends the study:

Feature-Guided Inverse Design of Organic A-Site Cations for Perovskite Dimensional Engineering, Wei-jie Wu et al., 2025.


🧠 Model Workflow

  1. Descriptor Calculation (Cal.py)

    • Calculates molecular descriptors (e.g., ATSC1pe, MATS2c, SlogP_VSA2) from input SMILES.
    • Results are stored in CSV files under Modeldata/.
  2. Dataset Preparation

    • Before training, merge all CSV files under the Modeldata/ directory into a single dataset:
      cat Modeldata/*.csv > Modeldata.csv
      
      The merged file Modeldata.csv will serve as the unified training dataset.
  3. Model Training and Generation (Best_Seq2seq.py)

    • Implements an LSTM-based sequence-to-sequence model for SMILES reconstruction and generation.
    • Inputs: one-hot encoded SMILES sequences + three physicochemical descriptors.
    • Outputs: property-conditioned SMILES sequences (new organic cations).
  4. Feature-Guided DoubleFit Model (MolecularDoubleFitting.py)

    • Performs secondary regression to enforce property–structure consistency.
    • Refines generated molecules according to target perovskite dimensional features.
  5. Postprocessing

    • Generated molecules are filtered, ranked, and optionally validated through structural optimization workflows.

📁 Directory Structure

project/
└── Feature-Guided Inverse Design of LDPs/
├── Best_Seq2seq.py # Main LSTM model: training & molecular generation
├── Cal_ATSC1pe_MATS2c.py # Descriptor calculator (ATSC1pe, MATS2c)
├── Cal_SlogP_VSA2.py # Descriptor calculator (SlogP_VSA2)
├── MolecularDoubleFitting.py # Feature-guided molecular fitting model
├── MSEcalculation.py # Evaluation metrics
├── ModelandDataAnalysis.py # Dataset statistics & analysis
├── Modeldata/ # Folder containing split CSV datasets
├── GreatMolecular.xlsx # High-quality generated molecules
├── NewMolecules.xlsx # Newly generated candidates
├── README.md # Project documentation
└── data_parts/ # (Optional) Split dataset parts (<100 MB each)


⚙️ How to Run

1. Environment

pip install paddlepaddle scikit-learn pandas numpy tqdm rdkit
2. Prepare dataset
Merge CSV files in Modeldata/ into a single file:
cat Modeldata/*.csv > Modeldata.csv
3. Train and generate molecules
python Best_Seq2seq.py
4. Feature-guided molecular refinement
python MolecularDoubleFitting.py
📊 Dataset Note
The full dataset (~200 MB) was split into smaller CSV files under Modeldata/
to comply with GitHub’s 100MB per-file limit.
They must be merged before training as described above.
🚀 Results
LSTM reconstruction accuracy: >95%
Enhanced novelty and property diversity in generated cations
Generated organic A-site cations exhibit favorable dimensional preferences for RP- and DJ-type perovskites.
💡 Key Contributions
DoubleFit Learning Mechanism: Joint optimization of molecular structure and descriptor features.
Feature-Constrained Generation: Enables directionally controlled molecular design.
Descriptor-Integrated Workflow: Fully compatible with PaddlePaddle for training and inference.
🧑‍💻 Author
Weijie Wu
South China Normal University

leeleolay and others added 23 commits July 5, 2025 20:06
* fix: fix chgnet model download link

* fix: set nan to 0
* feat: add task readme

* fix error

* update logo
* fix: update reshape

* fix: fix
* feat: add task readme

* fix error

* update logo

* Add files via upload

* Update README.md

* Add files via upload

* Update README.md
* feat: add task readme

* fix error

* update logo

* Add files via upload

* Update README.md

* Add files via upload

* Update README.md

* Add files via upload

* Update README.md

* Add files via upload

* Update README.md

* Delete docs/paddlematerial_overview_en.png

* Delete docs/paddlematerial_overview_ch.png
* feat: add task readme

* fix error

* update logo

* Add files via upload

* Update README.md

* Add files via upload

* Update README.md

* Add files via upload

* Update README.md

* Add files via upload

* Update README.md

* Delete docs/paddlematerial_overview_en.png

* Delete docs/paddlematerial_overview_ch.png

* Delete docs/logo_ppmat.png

* Delete docs/ppmat_overview_en.png

* Add files via upload

* Update README.md

* Update README.md

* Update README.md

* fix conflict
* feat: add task readme

* fix error

* update logo

* Add files via upload

* Update README.md

* Add files via upload

* Update README.md

* Add files via upload

* Update README.md

* Add files via upload

* Update README.md

* Delete docs/paddlematerial_overview_en.png

* Delete docs/paddlematerial_overview_ch.png

* Delete docs/logo_ppmat.png

* Delete docs/ppmat_overview_en.png

* Add files via upload

* Update README.md

* Update README.md

* Update README.md

* fix conflict

* fix words error
* Update README.md

* Update README.md
* matbench_dataset

* 训练文件

* Delete megnet_matbench_bulk_modulus_t_20250731_041800_s_42 directory

* Delete megnet_matbench_shear_modulus_t_20250731_041740_s_42 directory

* matbench数据集适配

* 修改PR

* jarvis数据集适配

* megnet_readme修改

* 修改requirements,修改jarvis_dataset
* add DiffNMR

* fix bugs

* fix bugs

* fix bugs

* fix bugs

* fix bugs of diffprior

* fix bug

* fix bugs
…set name=alex_mp_20 for mattergen training with alex_mp20 dataset. (PaddlePaddle#200)

* fix diffnmr model and config.

* fix AlexMP20MatterGenDataset name=alex_mp_20 for mattergen training with alex_mp20 dataset.
@paddle-bot
Copy link

paddle-bot bot commented Nov 13, 2025

Thanks for your contribution!

@paddle-bot paddle-bot bot added the contributor External developers label Nov 13, 2025
@CLAassistant
Copy link

CLAassistant commented Nov 13, 2025

CLA assistant check
All committers have signed the CLA.

@leeleolay
Copy link
Collaborator

leeleolay commented Nov 18, 2025

Thanks for your contribution!
Please fetch the newest version repo codes and pull your codes.We recommend to use the ppmat architecture to fit your model. If these is some problem of adaption, please contact us!

@leeleolay leeleolay added the non-compeleted need to revise label Dec 7, 2025
Copy link
Collaborator

@leeleolay leeleolay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revise this PR

Copy link
Collaborator

@leeleolay leeleolay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revise this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers non-compeleted need to revise

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants