Skip to content

GEMINI-Breeding/Image2PlantArchitecture_v2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Image2PlantArchitecture v2

License Python 3.10+ arXiv

Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning

A novel framework that leverages state-of-the-art Vision-Language Models (VLMs) to automatically generate 3D plant simulation configurations in JSON format directly from drone-based remote sensing imagery. This is the first study to utilize VLMs for generating structural JSON configurations required for functional-structural plant models (FSPMs), providing a scalable approach for reconstructing 3D plots for digital twin applications in agriculture.

Citation

If you use this work, please cite:

@article{yun2026image2plant,
  title={Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning},
  author={Yun, Heesup and Uyehara, Isaac Kazuo and Ranario, Earl and Lundqvist, Lars and Diepenbrock, Christine H. and Bailey, Brian N. and Earles, J. Mason},
  journal={arXiv preprint arXiv:2603.08930},
  year={2026},
  institution={University of California, Davis}
}

Overview

This project introduces a synthetic benchmark to evaluate Vision-Language Models (VLMs) in generating plant simulation configurations for digital twins. While functional-structural plant models (FSPMs) are powerful tools for simulating biophysical processes in agricultural environments, their high complexity and low throughput create bottlenecks for deployment at scale.

Key Features

  • VLM-Based Simulation Configuration: Automatically generate Helios 3D plant simulator JSON configurations from drone imagery using Gemma 3 and Qwen3-VL models
  • Five In-Context Learning Methods: Progressive context augmentation from zero-shot to grounding-information-enhanced few-shot learning
  • Comprehensive Evaluation Framework: Three evaluation categories covering JSON integrity, geometric accuracy, and biophysical parameters
  • Synthetic Benchmark Dataset: 1,120 synthetic cowpea plot images (224 plots × 5 growth stages) generated via Helios 3D
  • Real-World Validation: Tested on drone orthophoto dataset from California cowpea breeding experiments
  • LoRA Fine-Tuning: Parameter-efficient fine-tuning (PEFT) with 0.65% trainable parameters using Unsloth
  • High-Throughput Inference: Asynchronous batch processing powered by vLLM with continuous batching and FP8 quantization

Research Highlights

Our results demonstrate that:

  • VLMs can interpret structural metadata and estimate parameters like plant count and sun azimuth
  • Larger models and richer context generally improve performance, but exhibit non-linear trends
  • Models often rely on contextual priors when visual cues are insufficient (contextual bias)
  • Grounding information (plant count, locations, sun position) significantly reduces errors across all metrics
  • Qwen3-VL models generally outperform Gemma3 models on geometric evaluations
  • Fine-tuning improves JSON integrity and early-stage plant localization accuracy
  • Sim-to-real gap remains a challenge, with higher errors on real images compared to synthetic data

🧪 Methodology

Dataset

Synthetic Cowpea Plot Dataset:

  • 1,120 synthetic images (224 plots × 5 growth stages: 10, 30, 50, 70, 90 DAP)
  • Generated using Helios 3D procedural plant generation library
  • Resolution: 381×1080 pixels
  • Plant species: Cowpea (Vigna unguiculata)
  • Data-driven parameter sampling from real-world field measurements

Real Drone Orthophoto Dataset:

  • 2025 California cowpea breeding experiment
  • Plot dimensions: 1.5m × 3.0m with 1.5m alleys
  • 15 beds, 12 plots per bed, 60 genotypes, 3 replicate blocks
  • Processed with OpenDronemap, annotated in QGIS
  • Manual annotations for plant count and locations (10 DAP)

In-Context Learning Methods

We tested five progressively enriched in-context learning methods:

Method Description Context Enhancement
Method 1 Zero-shot JSON generation Role definition + task description + JSON format restriction instruction (FRI)
Method 2 Method 1 + JSON schema Adds structured schema with variable types and key definitions
Method 3 Method 2 + few-shot JSONs Includes example JSON outputs with reasoning texts
Method 4 Method 3 + few-shot images Adds paired image-JSON examples as chat history
Method 5 Method 4 + grounding info Provides plant count, locations, sun position as hints

JSON Configuration Structure

The Helios 3D simulator accepts JSON files with six top-level metadata fields:

  • random_seed: Controls procedural generation randomness
  • metadata: year, location, plant_type, days_after_planting
  • environment: soil spectral data, sun elevation/azimuth angles
  • field: plot size, bed layout, plant locations (x, y coordinates)
  • plant_properties: PROSPECT model parameters (chlorophyll, carotenoids, anthocyanin, water content, dry matter, leaf structure)
  • camera: sensor parameters, position, resolution

Evaluation Metrics

JSON Integrity Metrics:

  • JSON syntax error rate (parsing failures)
  • JSON key-missing rate (incomplete structure)
  • BLEU-4 score (similarity to ground truth)

Geometric Evaluations:

  • Days After Planting (DAP) - Mean Absolute Error (MAE)
  • Plant count - MAE
  • Plant locations - Chamfer Distance: d_CD(S1, S2) = d(S1, S2) + d(S2, S1)
  • Sun elevation and azimuth angles - MAE
  • Leaf pitch angle - MAE

Biophysical Evaluations:

  • Chlorophyll content (μg/cm²) - MAE
  • Carotenoid content (μg/cm²) - MAE
  • Anthocyanin content (μg/cm²) - MAE
  • Leaf water mass (g/cm²) - MAE
  • Leaf dry matter (g/cm²) - MAE
  • Leaf structure parameter (N) - MAE

Models Tested

  • Qwen3-VL: 4B, 8B, 30B parameters (released December 2025)
  • Qwen3-VL Fine-tuned: 32B with LoRA (r=16, α=16, 141.9M trainable params, 0.65% of model capacity)
    • Training: 1,788 synthetic images, 3 epochs, batch size 64, 4× NVIDIA A100 GPUs (~3 hours)
    • Framework: Unsloth for accelerated training and reduced memory overhead
  • Gemma 3: 4B, 12B, 27B parameters (released March 2025)
  • Hosting: Self-hosted Ollama server with 32K context window

Project Structure


Image2PlantArchitecture_v2/
├── scripts/
│   ├── ollama_batch_inference.py        # Included structure of prompt and tools for ollama inference
│   ├── vllm_batch_inference.py          # Asynchronous high-throughput inference script
│   ├── finetune_qwen3_vl_unsloth.py     # Finetuning the Qwen3-VL 32M model using unsloth
│   └── render_real_test_image.py        # Rendering comparisons
│
├── notebooks/
│   ├── calculate_evaluation_metrics.py  # Comprehensive evaluation metrics pipeline
│   ├── calculate_evaluation_metrics.ipynb # Test notebook file
│   └── check_integrity_data.py          # JSON output syntax/key analysis
│
├── data/
│   └── raw/2025_Davis/                  # Dataset directory (HELIOS synthetic & Real)
│
├── unsloth_env.yml                      # Conda environment for training
├── vllm_environment.yml                 # Conda environment for inference
├── requirements.txt
└── README.md

Key Results

Synthetic Dataset Performance

On the synthetic cowpea dataset with Method 5 (grounding information):

  • JSON Integrity: <6.6% syntax error, <8.5% key-missing rate
  • DAP Estimation: Fine-tuned Qwen3-VL 32B achieved lowest MAE
  • Plant Count: Qwen3-VL 4B outperformed Gemma3 27B in some cases
  • Plant Localization: Qwen3-VL models showed significantly lower Chamfer distances than Gemma3
  • Grounding Info Impact: Reduced errors across all metrics by providing plant count/location hints
  • Model Size Effect: Non-linear trends - larger models don't always guarantee better performance

Real Dataset Performance (Sim-to-Real Gap)

  • Higher Error Rates: Real images showed 4-5× higher syntax errors compared to synthetic
  • DAP MAE: Up to 4.7 days error on real images (vs. lower on synthetic)
  • Plant Count: Higher MAE (~5.3 plants) on real images
  • Plant Locations: Surprisingly lower MAE (~0.1m) on real images
  • Fine-Tuning Benefit: Reduced JSON errors and improved plant count estimation

Ablation Study (Blind Baseline)

Testing without target images revealed:

  • Models rely heavily on contextual priors when visual cues are weak
  • Blind baseline sometimes achieved lower errors than image-based inference (when close to mean-guess)
  • Indicates contextual bias - models default to few-shot example distributions
  • Adding images can introduce noise when models fail to extract reliable visual signals

Installation

1. Training Environment (Unsloth)

For model finetuning, an environment with Unsloth and PyTorch is required:

conda env create -f unsloth_env.yml
conda activate unsloth

2. Inference Environment (vLLM)

For high-throughput inference, a separate environment configured for vLLM is recommended:

conda env create -f vllm_environment.yml
# Or manually install vllm
pip install vllm

Quick Start

1. Finetuning Qwen3-VL

Finetune Qwen3-VL models using the optimized Unsloth pipeline across single or multiple GPUs (via DDP):

# Edit configurations (epochs, model_id) inside the script
sbatch scripts/finetune_qwen3_vl_unsloth.sh

2. Batch Inference with vLLM

Run inference on a dataset using the highly optimized vLLM server:

Terminal 1 (Start vLLM Server):

export VLLM_V1_ENABLED=0 
mamba run -n vllm-runtime python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-VL-8B-Instruct --port 8000 \
    --limit-mm-per-prompt '{"image": 4}' --max-model-len 32768

Terminal 2 (Run Inference Script):

mamba run -n digital-crops python scripts/vllm_batch_inference.py \
    --model Qwen/Qwen3-VL-8B-Instruct --vllm-url http://localhost:8000/v1 \
    --dataset-dir data/raw/2025_Davis/HELIOS_20260215 \
    --start 0 --end 100 --concurrency 32 --method 5

3. Evaluation Metrics

Calculate performance metrics (JSON Integrity + Biophysical Accuracy) on the output directories generated by inference:

# Run integrity checks
python notebooks/check_integrity_data.py

# Full evaluation pipeline
python notebooks/calculate_evaluation_metrics.py

4. Visualizations

Generate comparison grids combining Ground Truth images alongside method predictions:

python scripts/render_real_test_image.py
python test_synthetic_vs_real_plot.py

Related Work & References

Agricultural VLM Benchmarks

This work builds on recent advances in agricultural vision-language models:

  • AgEval (Arshad et al., 2025): Plant disease identification and quantification
  • AgroBench (Shinoda et al.): Expert-annotated multiple-choice QA benchmark for 7 agricultural tasks
  • AgriGPT-VL (Yang et al., 2025): Trained on Agri-3M-VL dataset, evaluated on AgriBench-VL-4K
  • AgroGPT (Awais et al., 2025): Multi-turn multimodal dialogue for agricultural concepts

Our work is the first to apply VLMs to 3D plant simulation configuration generation for digital twins.

Key Dependencies

  • Helios 3D: 3D plant and environmental biophysical modeling framework (Bailey, 2019)
  • Unsloth: Fast and memory-efficient LLM/VLM fine-tuning
  • vLLM: High-throughput and memory-efficient LLM inference engine
  • Qwen3-VL: State-of-the-art open-source vision-language models (2B-235B parameters)
  • Ollama: Self-hosted LLM serving platform
  • OpenDroneMap: Drone imagery processing
  • QGIS: Plot boundary annotation

Structured Output Generation Literature

  • Structured Decoding: Grammar-constrained decoding (Geng et al., 2023), XGrammar (Dong et al., 2024)
  • Document Generation: Pix2Struct (Lee et al., 2022), DePlot (Liu et al., 2023), MatCha (Liu et al., 2023)
  • Format Restriction Impact: Tam et al. (2024) showed structured output can harm complex reasoning tasks

Discussion & Future Directions

Key Findings

  1. Contextual Bias: Models often rely on few-shot example distributions rather than visual cues when signals are ambiguous
  2. Non-Linear Scaling: Larger models don't always outperform smaller ones - context quality matters more than model size
  3. Grounding Information: Providing basic spatial hints (plant count, locations) dramatically improves all metrics
  4. Sim-to-Real Gap: Synthetic training doesn't fully transfer to real images; fine-tuning helps but gap remains significant

Limitations

  • Biophysical parameters (chlorophyll, carotenoids) show high error rates
  • Models struggle with late-stage growth (70-90 DAP) due to canopy occlusion
  • JSON integrity issues persist (~6-8% error rates even with best methods)
  • Real-world performance lags synthetic benchmarks

Future Research Directions

  1. Extended Context: Test 128K token windows with 50+ few-shot examples
  2. Specialized Fine-Tuning: Train on 10K+ diverse synthetic images
  3. Hybrid Approaches: Combine computer vision preprocessing with VLM reasoning
  4. Leaf Color References: Provide pigment-indexed color calibration charts
  5. Multi-Species: Extend to wheat, maize, soybean, and other crops
  6. Domain Adaptation: Bridge sim-to-real gap with adversarial training
  7. Structured Decoding: Force valid JSON schemas during generation
  8. Temporal Modeling: Multi-temporal predictions from time-series imagery

🤝 Contributing

We welcome contributions! Areas for improvement include:

  • Extending to other crop species (wheat, soybean, maize, etc.)
  • Improving biophysical parameter estimation (chlorophyll, carotenoids)
  • Reducing sim-to-real gap through domain adaptation
  • Testing larger context windows (128K tokens) with more few-shot examples
  • Incorporating structured decoding methods
  • Multi-temporal analysis (time-series predictions)
  • Integration with other FSPMs (e.g., OpenAlea, GroIMP)

Please open an issue or pull request if you'd like to contribute!

📧 Contact

Authors:

Institution: University of California, Davis

📜 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🙏 Acknowledgements

This research leverages the outstanding open-source contributions of the Helios, Unsloth, vLLM, and Qwen teams. We thank the Gates fondation for research support.


Last Updated: March 2026

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages