Skip to content

ZiyanZhuang/BOGDA

Repository files navigation

BOGDA: A Deep Learning Framework for Pest Resistance Prediction

License: MIT

Authors: Ziyan Zhuang, Qiao Sheng, Jingang Xie

Abstract: BOGDA is a framework designed to predict the evolution of pesticide resistance in agricultural pests using deep learning techniques. It employs a Temporal Convolutional Network (TCN) combined with an Attention mechanism to model the complex temporal patterns of resistance development. The framework also includes a Probit analysis module for processing bioassay data.

Origin of the Name: BOGDA (Mount Bogda) is the highest peak in the eastern Tianshan Mountains. It symbolizes resilience and adaptation in harsh environments, mirroring the goal of this model to understand and predict how pests adapt (develop resistance) to environmental pressures (pesticides).

Features

  1. Advanced Model Architecture: Utilizes a Temporal Convolutional Network (TCN) with an Attention mechanism to effectively capture long-term dependencies in resistance data.
  2. Integrated Bioassay Analysis: Includes an LC50Processor class that estimates the median lethal concentration (LC50) from raw mortality data using the Probit model.
  3. Long-Term Forecasting: Predicts resistance levels (Resistance Ratio, RR) for multiple future generations based on historical data.
  4. Biological Constraints: Incorporates regularization during training to encourage predictions that align with typical biological resistance trends (e.g., resistance generally increasing or plateauing over time).
  5. Synthetic Data Generation: Features a dedicated script (synth_data.py) that can: -Generate an enhanced synthetic dataset that mimics the distribution of user-provided raw experimental data (A.xlsx, B.xlsx) while preserving privacy. -Automatically generate a random demonstration (Demo) dataset when raw experimental data is not available. This ensures the main prediction code (pest_predic_sys.py) can be run and tested.
  6. Ease of Use & Evaluation: Provides a complete workflow for training, prediction, and evaluation, outputting visualizations, performance reports, and prediction files.

Data Handling & Privacy

Raw bioassay data (like LC50 results and mortality observations at different concentrations) can be sensitive or have usage restrictions. Therefore, no original experimental data is included in this open-source repository.

To allow researchers and developers to run, test, and understand this framework without accessing sensitive original data, we provide:

  1. The synth_data.py script: This is a key component. Mode 1 (Based on Real Data): If you provide A.xlsx (containing LC50 and confidence interval data) and B.xlsx (containing concentration, tested insects, dead insects data) in the project's root directory, synth_data.py will read these files. It uses an optimization algorithm (combining Mean Squared Error and Kolmogorov-Smirnov test) to learn the characteristics of the real data's distribution. It then generates a synthetic dataset (output/synthetic_dataset.csv) that resembles the real data. This helps generate usable alternative data for model training while protecting the original data's privacy. Mode 2 (Demonstration Mode): If A.xlsx and B.xlsx files are not provided, synth_data.py will automatically run the generate_demo_data function. This creates a purely random dataset for demonstration purposes only. The script then runs its optimization process based on this demo data and generates output/synthetic_dataset.csv. Please note: The synthetic data generated in this mode does not reflect any real biological process and is primarily intended to verify that the code executes successfully.
  2. Input for pest_predic_sys.py: The main prediction script (pest_predic_sys.py) requires an input file named data.xlsx. This file should contain columns like generations, concentration(mg/l), dead number, and insects. You need to rename or copy the output/synthetic_dataset.csv generated by synth_data.py to data.xlsx and place it in the project's root directory for the prediction model to use.

Model Architecture (pest_predic_sys.py)

The main prediction model (pest_predic_sys.py) consists of:

  1. LC50Processor: Processes the input data.xlsx. It groups data by generation, fits dose-response curves using scipy.optimize.curve_fit with the Probit function (scipy.stats.norm.cdf), calculates the LC50 for each generation, and then computes the Resistance Ratio (RR).
  2. TCN_Attention Model: Uses multiple 1D convolutional layers with varying dilation factors (TCN) to capture temporal dependencies at different scales. Applies a Multi-Head Attention mechanism (nn.MultiheadAttention) to weigh the features extracted by the TCN, focusing on more important time steps for prediction. Outputs the predicted normalized Resistance Ratio for the next generation via a final fully connected layer (nn.Linear).
  3. Training System (PestResistanceSystem): Manages data loading, preprocessing (MinMax Scaling), model initialization, the training loop (AdamW optimizer, MSE Loss + slope regularization), validation, learning rate scheduling (ReduceLROnPlateau), and early stopping. Includes a predict method for generating multi-step future predictions. Includes an evaluate method to assess model performance on training and testing sets (MSE, R²). Includes a visualize method to plot historical and predicted resistance ratio curves.

Installation

  1. Clone the repository:

    git clone https://github.com/ZiyanZhuang/BOGDA.git
    cd BOGDA
    pip install -r requirements.txt
  2. Create a virtual environment (Recommended):

    python -m venv venv
    source venv/bin/activate  # Linux/macOS
    # venv\Scripts\activate  # Windows
  3. Install dependencies: Based on the code, you will need at least:

    pandas
    numpy
    scipy
    scikit-learn
    torch
    matplotlib
    openpyxl # For reading/writing Excel files
    seaborn # For visualization in synth_data.py (if kept)
    

    After creating the requirements.txt file, run:

    pip install -r requirements.txt

    (Please ensure requirements.txt includes the correct libraries and versions used in your environment.)

Usage

Step 1: Prepare/Generate Data

Option A (Using your own data - recommended for meaningful analysis): 1. Prepare an A.xlsx file with at least Generations and LC50(CI95% )(mg/L) columns (format example: 0.123(0.098-0.150)), and optionally a Slope±SE column. 2. Prepare a B.xlsx file with generations, concentration(mg/l), insects, and dead number columns. 3. Place A.xlsx and B.xlsx in the project root directory. 4. Run the synthetic data generation script: bash python synth_data.py This will create synthetic_dataset.csv and optimized_parameters.csv in the output/ directory.

Option B (Demonstration mode - for testing code execution only): 1. Ensure no A.xlsx or B.xlsx files are present in the project root directory. 2. Run the synthetic data generation script: bash python synth_data.py The script will report data loading failure, generate demo data automatically, continue processing, and finally create synthetic_dataset.csv in the output/ directory based on the demo data.

Step 2: Prepare Input for Prediction Model

  1. Copy or rename the file output/synthetic_dataset.csv to data.xlsx.
  2. Place data.xlsx in the project root directory.

Step 3: Configure the Prediction Model (Optional)

You can modify parameters in the CONFIG dictionary at the top of pest_predic_sys.py as needed, for example: seq_length: Number of time steps in the input sequence. batch_size: Batch size during training. epochs: Maximum number of training epochs. pred_steps: Number of future generations to predict. num_channels: Number of channels in each TCN layer. slope_reg_weight: Weight for the resistance slope regularization term.

Step 4: Train and Predict

Run the main prediction model script: bash python pest_predic_sys.py The script will: Load and process data.xlsx. Split data into training and testing sets. Initialize the model, optimizer, and loss function. Train the model, monitor validation loss, and apply early stopping. The best model weights are saved to best_model.pth. Training progress is logged to training_log.txt. Load the best model to predict the next pred_steps generations. Evaluate the model's performance on the training and testing sets, writing results to model_report.txt. Save the numerical predictions to predictions.csv. Generate a plot comparing historical and predicted resistance ratios (prediction.png).

Step 5: Check Outputs

Review the following files generated in the project root directory:

best_model.pth: Saved weights of the best-performing model. training_log.txt: Log of the training process (loss per epoch, etc.). model_report.txt: Model evaluation metrics (MSE, R²) on training/testing sets and prediction info. predictions.csv: Predicted resistance ratios for future generations. prediction.png: Visualization plot of historical vs. predicted RR. output/ (from synth_data.py): synthetic_dataset.csv: The generated synthetic data. optimized_parameters.csv: (If Mode 1 was used) Optimized distribution parameters.

%%%Project Structure%%% . ├── pest_predic_sys.py # Main prediction model and training script ├── synth_data.py # Synthetic/Demo data generation script ├── requirements.txt # Python dependency list (User needs to create/update) ├── data.xlsx # Input data for pest_predic_sys.py (renamed from synthetic_dataset.csv) ├── A.xlsx # (Optional) User's raw LC50 data for synth_data.py ├── B.xlsx # (Optional) User's raw mortality data for synth_data.py ├── output/ # Output directory for synth_data.py │ ├── synthetic_dataset.csv │ └── optimized_parameters.csv ├── prediction.png # Output visualization plot ├── predictions.csv # Output prediction data ├── model_report.txt # Output evaluation report ├── training_log.txt # Output training log ├── best_model.pth # Saved model weights └── README.md # This file

About

BOGDA: A Deep Learning Framework for Pest Resistance Prediction

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages