BOGDA: A Deep Learning Framework for Pest Resistance Prediction
Authors: Ziyan Zhuang, Qiao Sheng, Jingang Xie
Abstract: BOGDA is a framework designed to predict the evolution of pesticide resistance in agricultural pests using deep learning techniques. It employs a Temporal Convolutional Network (TCN) combined with an Attention mechanism to model the complex temporal patterns of resistance development. The framework also includes a Probit analysis module for processing bioassay data.
Origin of the Name: BOGDA (Mount Bogda) is the highest peak in the eastern Tianshan Mountains. It symbolizes resilience and adaptation in harsh environments, mirroring the goal of this model to understand and predict how pests adapt (develop resistance) to environmental pressures (pesticides).
Features
- Advanced Model Architecture: Utilizes a Temporal Convolutional Network (TCN) with an Attention mechanism to effectively capture long-term dependencies in resistance data.
- Integrated Bioassay Analysis: Includes an
LC50Processorclass that estimates the median lethal concentration (LC50) from raw mortality data using the Probit model. - Long-Term Forecasting: Predicts resistance levels (Resistance Ratio, RR) for multiple future generations based on historical data.
- Biological Constraints: Incorporates regularization during training to encourage predictions that align with typical biological resistance trends (e.g., resistance generally increasing or plateauing over time).
- Synthetic Data Generation: Features a dedicated script (
synth_data.py) that can: -Generate an enhanced synthetic dataset that mimics the distribution of user-provided raw experimental data (A.xlsx,B.xlsx) while preserving privacy. -Automatically generate a random demonstration (Demo) dataset when raw experimental data is not available. This ensures the main prediction code (pest_predic_sys.py) can be run and tested. - Ease of Use & Evaluation: Provides a complete workflow for training, prediction, and evaluation, outputting visualizations, performance reports, and prediction files.
Data Handling & Privacy
Raw bioassay data (like LC50 results and mortality observations at different concentrations) can be sensitive or have usage restrictions. Therefore, no original experimental data is included in this open-source repository.
To allow researchers and developers to run, test, and understand this framework without accessing sensitive original data, we provide:
- The
synth_data.pyscript: This is a key component. Mode 1 (Based on Real Data): If you provideA.xlsx(containing LC50 and confidence interval data) andB.xlsx(containing concentration, tested insects, dead insects data) in the project's root directory,synth_data.pywill read these files. It uses an optimization algorithm (combining Mean Squared Error and Kolmogorov-Smirnov test) to learn the characteristics of the real data's distribution. It then generates a synthetic dataset (output/synthetic_dataset.csv) that resembles the real data. This helps generate usable alternative data for model training while protecting the original data's privacy. Mode 2 (Demonstration Mode): IfA.xlsxandB.xlsxfiles are not provided,synth_data.pywill automatically run thegenerate_demo_datafunction. This creates a purely random dataset for demonstration purposes only. The script then runs its optimization process based on this demo data and generatesoutput/synthetic_dataset.csv. Please note: The synthetic data generated in this mode does not reflect any real biological process and is primarily intended to verify that the code executes successfully. - Input for
pest_predic_sys.py: The main prediction script (pest_predic_sys.py) requires an input file nameddata.xlsx. This file should contain columns likegenerations,concentration(mg/l),dead number, andinsects. You need to rename or copy theoutput/synthetic_dataset.csvgenerated bysynth_data.pytodata.xlsxand place it in the project's root directory for the prediction model to use.
Model Architecture (pest_predic_sys.py)
The main prediction model (pest_predic_sys.py) consists of:
LC50Processor: Processes the inputdata.xlsx. It groups data by generation, fits dose-response curves usingscipy.optimize.curve_fitwith the Probit function (scipy.stats.norm.cdf), calculates the LC50 for each generation, and then computes the Resistance Ratio (RR).TCN_AttentionModel: Uses multiple 1D convolutional layers with varying dilation factors (TCN) to capture temporal dependencies at different scales. Applies a Multi-Head Attention mechanism (nn.MultiheadAttention) to weigh the features extracted by the TCN, focusing on more important time steps for prediction. Outputs the predicted normalized Resistance Ratio for the next generation via a final fully connected layer (nn.Linear).- Training System (
PestResistanceSystem): Manages data loading, preprocessing (MinMax Scaling), model initialization, the training loop (AdamW optimizer, MSE Loss + slope regularization), validation, learning rate scheduling (ReduceLROnPlateau), and early stopping. Includes apredictmethod for generating multi-step future predictions. Includes anevaluatemethod to assess model performance on training and testing sets (MSE, R²). Includes avisualizemethod to plot historical and predicted resistance ratio curves.
Installation
-
Clone the repository:
git clone https://github.com/ZiyanZhuang/BOGDA.git cd BOGDA pip install -r requirements.txt -
Create a virtual environment (Recommended):
python -m venv venv source venv/bin/activate # Linux/macOS # venv\Scripts\activate # Windows
-
Install dependencies: Based on the code, you will need at least:
pandas numpy scipy scikit-learn torch matplotlib openpyxl # For reading/writing Excel files seaborn # For visualization in synth_data.py (if kept)After creating the
requirements.txtfile, run:pip install -r requirements.txt
(Please ensure
requirements.txtincludes the correct libraries and versions used in your environment.)
Usage
Step 1: Prepare/Generate Data
Option A (Using your own data - recommended for meaningful analysis):
1. Prepare an A.xlsx file with at least Generations and LC50(CI95% )(mg/L) columns (format example: 0.123(0.098-0.150)), and optionally a Slope±SE column.
2. Prepare a B.xlsx file with generations, concentration(mg/l), insects, and dead number columns.
3. Place A.xlsx and B.xlsx in the project root directory.
4. Run the synthetic data generation script:
bash python synth_data.py
This will create synthetic_dataset.csv and optimized_parameters.csv in the output/ directory.
Option B (Demonstration mode - for testing code execution only):
1. Ensure no A.xlsx or B.xlsx files are present in the project root directory.
2. Run the synthetic data generation script:
bash python synth_data.py
The script will report data loading failure, generate demo data automatically, continue processing, and finally create synthetic_dataset.csv in the output/ directory based on the demo data.
Step 2: Prepare Input for Prediction Model
- Copy or rename the file
output/synthetic_dataset.csvtodata.xlsx. - Place
data.xlsxin the project root directory.
Step 3: Configure the Prediction Model (Optional)
You can modify parameters in the CONFIG dictionary at the top of pest_predic_sys.py as needed, for example:
seq_length: Number of time steps in the input sequence.
batch_size: Batch size during training.
epochs: Maximum number of training epochs.
pred_steps: Number of future generations to predict.
num_channels: Number of channels in each TCN layer.
slope_reg_weight: Weight for the resistance slope regularization term.
Step 4: Train and Predict
Run the main prediction model script:
bash python pest_predic_sys.py
The script will:
Load and process data.xlsx.
Split data into training and testing sets.
Initialize the model, optimizer, and loss function.
Train the model, monitor validation loss, and apply early stopping. The best model weights are saved to best_model.pth. Training progress is logged to training_log.txt.
Load the best model to predict the next pred_steps generations.
Evaluate the model's performance on the training and testing sets, writing results to model_report.txt.
Save the numerical predictions to predictions.csv.
Generate a plot comparing historical and predicted resistance ratios (prediction.png).
Step 5: Check Outputs
Review the following files generated in the project root directory:
best_model.pth: Saved weights of the best-performing model.
training_log.txt: Log of the training process (loss per epoch, etc.).
model_report.txt: Model evaluation metrics (MSE, R²) on training/testing sets and prediction info.
predictions.csv: Predicted resistance ratios for future generations.
prediction.png: Visualization plot of historical vs. predicted RR.
output/ (from synth_data.py):
synthetic_dataset.csv: The generated synthetic data.
optimized_parameters.csv: (If Mode 1 was used) Optimized distribution parameters.
%%%Project Structure%%% . ├── pest_predic_sys.py # Main prediction model and training script ├── synth_data.py # Synthetic/Demo data generation script ├── requirements.txt # Python dependency list (User needs to create/update) ├── data.xlsx # Input data for pest_predic_sys.py (renamed from synthetic_dataset.csv) ├── A.xlsx # (Optional) User's raw LC50 data for synth_data.py ├── B.xlsx # (Optional) User's raw mortality data for synth_data.py ├── output/ # Output directory for synth_data.py │ ├── synthetic_dataset.csv │ └── optimized_parameters.csv ├── prediction.png # Output visualization plot ├── predictions.csv # Output prediction data ├── model_report.txt # Output evaluation report ├── training_log.txt # Output training log ├── best_model.pth # Saved model weights └── README.md # This file