Image2PlantArchitecture v2

Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning

A novel framework that leverages state-of-the-art Vision-Language Models (VLMs) to automatically generate 3D plant simulation configurations in JSON format directly from drone-based remote sensing imagery. This is the first study to utilize VLMs for generating structural JSON configurations required for functional-structural plant models (FSPMs), providing a scalable approach for reconstructing 3D plots for digital twin applications in agriculture.

Citation

If you use this work, please cite:

@article{yun2026image2plant,
  title={Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning},
  author={Yun, Heesup and Uyehara, Isaac Kazuo and Ranario, Earl and Lundqvist, Lars and Diepenbrock, Christine H. and Bailey, Brian N. and Earles, J. Mason},
  journal={arXiv preprint arXiv:2603.08930},
  year={2026},
  institution={University of California, Davis}
}

Overview

This project introduces a synthetic benchmark to evaluate Vision-Language Models (VLMs) in generating plant simulation configurations for digital twins. While functional-structural plant models (FSPMs) are powerful tools for simulating biophysical processes in agricultural environments, their high complexity and low throughput create bottlenecks for deployment at scale.

Key Features

VLM-Based Simulation Configuration: Automatically generate Helios 3D plant simulator JSON configurations from drone imagery using Gemma 3 and Qwen3-VL models
Five In-Context Learning Methods: Progressive context augmentation from zero-shot to grounding-information-enhanced few-shot learning
Comprehensive Evaluation Framework: Three evaluation categories covering JSON integrity, geometric accuracy, and biophysical parameters
Synthetic Benchmark Dataset: 1,120 synthetic cowpea plot images (224 plots × 5 growth stages) generated via Helios 3D
Real-World Validation: Tested on drone orthophoto dataset from California cowpea breeding experiments
LoRA Fine-Tuning: Parameter-efficient fine-tuning (PEFT) with 0.65% trainable parameters using Unsloth
High-Throughput Inference: Asynchronous batch processing powered by vLLM with continuous batching and FP8 quantization

Research Highlights

Our results demonstrate that:

VLMs can interpret structural metadata and estimate parameters like plant count and sun azimuth
Larger models and richer context generally improve performance, but exhibit non-linear trends
Models often rely on contextual priors when visual cues are insufficient (contextual bias)
Grounding information (plant count, locations, sun position) significantly reduces errors across all metrics
Qwen3-VL models generally outperform Gemma3 models on geometric evaluations
Fine-tuning improves JSON integrity and early-stage plant localization accuracy
Sim-to-real gap remains a challenge, with higher errors on real images compared to synthetic data

🧪 Methodology

Dataset

Synthetic Cowpea Plot Dataset:

1,120 synthetic images (224 plots × 5 growth stages: 10, 30, 50, 70, 90 DAP)
Generated using Helios 3D procedural plant generation library
Resolution: 381×1080 pixels
Plant species: Cowpea (Vigna unguiculata)
Data-driven parameter sampling from real-world field measurements

Real Drone Orthophoto Dataset:

2025 California cowpea breeding experiment
Plot dimensions: 1.5m × 3.0m with 1.5m alleys
15 beds, 12 plots per bed, 60 genotypes, 3 replicate blocks
Processed with OpenDronemap, annotated in QGIS
Manual annotations for plant count and locations (10 DAP)

In-Context Learning Methods

We tested five progressively enriched in-context learning methods:

Method	Description	Context Enhancement
Method 1	Zero-shot JSON generation	Role definition + task description + JSON format restriction instruction (FRI)
Method 2	Method 1 + JSON schema	Adds structured schema with variable types and key definitions
Method 3	Method 2 + few-shot JSONs	Includes example JSON outputs with reasoning texts
Method 4	Method 3 + few-shot images	Adds paired image-JSON examples as chat history
Method 5	Method 4 + grounding info	Provides plant count, locations, sun position as hints

JSON Configuration Structure

The Helios 3D simulator accepts JSON files with six top-level metadata fields:

random_seed: Controls procedural generation randomness
metadata: year, location, plant_type, days_after_planting
environment: soil spectral data, sun elevation/azimuth angles
field: plot size, bed layout, plant locations (x, y coordinates)
plant_properties: PROSPECT model parameters (chlorophyll, carotenoids, anthocyanin, water content, dry matter, leaf structure)
camera: sensor parameters, position, resolution

Evaluation Metrics

JSON Integrity Metrics:

JSON syntax error rate (parsing failures)
JSON key-missing rate (incomplete structure)
BLEU-4 score (similarity to ground truth)

Geometric Evaluations:

Days After Planting (DAP) - Mean Absolute Error (MAE)
Plant count - MAE
Plant locations - Chamfer Distance: d_CD(S1, S2) = d(S1, S2) + d(S2, S1)
Sun elevation and azimuth angles - MAE
Leaf pitch angle - MAE

Biophysical Evaluations:

Chlorophyll content (μg/cm²) - MAE
Carotenoid content (μg/cm²) - MAE
Anthocyanin content (μg/cm²) - MAE
Leaf water mass (g/cm²) - MAE
Leaf dry matter (g/cm²) - MAE
Leaf structure parameter (N) - MAE

Models Tested

Qwen3-VL: 4B, 8B, 30B parameters (released December 2025)
Qwen3-VL Fine-tuned: 32B with LoRA (r=16, α=16, 141.9M trainable params, 0.65% of model capacity)
- Training: 1,788 synthetic images, 3 epochs, batch size 64, 4× NVIDIA A100 GPUs (~3 hours)
- Framework: Unsloth for accelerated training and reduced memory overhead
Gemma 3: 4B, 12B, 27B parameters (released March 2025)
Hosting: Self-hosted Ollama server with 32K context window

Project Structure


Image2PlantArchitecture_v2/
├── scripts/
│   ├── ollama_batch_inference.py        # Included structure of prompt and tools for ollama inference
│   ├── vllm_batch_inference.py          # Asynchronous high-throughput inference script
│   ├── finetune_qwen3_vl_unsloth.py     # Finetuning the Qwen3-VL 32M model using unsloth
│   └── render_real_test_image.py        # Rendering comparisons
│
├── notebooks/
│   ├── calculate_evaluation_metrics.py  # Comprehensive evaluation metrics pipeline
│   ├── calculate_evaluation_metrics.ipynb # Test notebook file
│   └── check_integrity_data.py          # JSON output syntax/key analysis
│
├── data/
│   └── raw/2025_Davis/                  # Dataset directory (HELIOS synthetic & Real)
│
├── unsloth_env.yml                      # Conda environment for training
├── vllm_environment.yml                 # Conda environment for inference
├── requirements.txt
└── README.md

Key Results

Synthetic Dataset Performance

On the synthetic cowpea dataset with Method 5 (grounding information):

JSON Integrity: <6.6% syntax error, <8.5% key-missing rate
DAP Estimation: Fine-tuned Qwen3-VL 32B achieved lowest MAE
Plant Count: Qwen3-VL 4B outperformed Gemma3 27B in some cases
Plant Localization: Qwen3-VL models showed significantly lower Chamfer distances than Gemma3
Grounding Info Impact: Reduced errors across all metrics by providing plant count/location hints
Model Size Effect: Non-linear trends - larger models don't always guarantee better performance

Real Dataset Performance (Sim-to-Real Gap)

Higher Error Rates: Real images showed 4-5× higher syntax errors compared to synthetic
DAP MAE: Up to 4.7 days error on real images (vs. lower on synthetic)
Plant Count: Higher MAE (~5.3 plants) on real images
Plant Locations: Surprisingly lower MAE (~0.1m) on real images
Fine-Tuning Benefit: Reduced JSON errors and improved plant count estimation

Ablation Study (Blind Baseline)

Testing without target images revealed:

Models rely heavily on contextual priors when visual cues are weak
Blind baseline sometimes achieved lower errors than image-based inference (when close to mean-guess)
Indicates contextual bias - models default to few-shot example distributions
Adding images can introduce noise when models fail to extract reliable visual signals

Installation

1. Training Environment (Unsloth)

For model finetuning, an environment with Unsloth and PyTorch is required:

conda env create -f unsloth_env.yml
conda activate unsloth

2. Inference Environment (vLLM)

For high-throughput inference, a separate environment configured for vLLM is recommended:

conda env create -f vllm_environment.yml
# Or manually install vllm
pip install vllm

Quick Start

1. Finetuning Qwen3-VL

Finetune Qwen3-VL models using the optimized Unsloth pipeline across single or multiple GPUs (via DDP):

# Edit configurations (epochs, model_id) inside the script
sbatch scripts/finetune_qwen3_vl_unsloth.sh

2. Batch Inference with vLLM

Run inference on a dataset using the highly optimized vLLM server:

Terminal 1 (Start vLLM Server):

export VLLM_V1_ENABLED=0 
mamba run -n vllm-runtime python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-VL-8B-Instruct --port 8000 \
    --limit-mm-per-prompt '{"image": 4}' --max-model-len 32768

Terminal 2 (Run Inference Script):

mamba run -n digital-crops python scripts/vllm_batch_inference.py \
    --model Qwen/Qwen3-VL-8B-Instruct --vllm-url http://localhost:8000/v1 \
    --dataset-dir data/raw/2025_Davis/HELIOS_20260215 \
    --start 0 --end 100 --concurrency 32 --method 5

3. Evaluation Metrics

Calculate performance metrics (JSON Integrity + Biophysical Accuracy) on the output directories generated by inference:

# Run integrity checks
python notebooks/check_integrity_data.py

# Full evaluation pipeline
python notebooks/calculate_evaluation_metrics.py

4. Visualizations

Generate comparison grids combining Ground Truth images alongside method predictions:

python scripts/render_real_test_image.py
python test_synthetic_vs_real_plot.py

Related Work & References

Agricultural VLM Benchmarks

This work builds on recent advances in agricultural vision-language models:

AgEval (Arshad et al., 2025): Plant disease identification and quantification
AgroBench (Shinoda et al.): Expert-annotated multiple-choice QA benchmark for 7 agricultural tasks
AgriGPT-VL (Yang et al., 2025): Trained on Agri-3M-VL dataset, evaluated on AgriBench-VL-4K
AgroGPT (Awais et al., 2025): Multi-turn multimodal dialogue for agricultural concepts

Our work is the first to apply VLMs to 3D plant simulation configuration generation for digital twins.

Key Dependencies

Helios 3D: 3D plant and environmental biophysical modeling framework (Bailey, 2019)
Unsloth: Fast and memory-efficient LLM/VLM fine-tuning
vLLM: High-throughput and memory-efficient LLM inference engine
Qwen3-VL: State-of-the-art open-source vision-language models (2B-235B parameters)
Ollama: Self-hosted LLM serving platform
OpenDroneMap: Drone imagery processing
QGIS: Plot boundary annotation

Structured Output Generation Literature

Structured Decoding: Grammar-constrained decoding (Geng et al., 2023), XGrammar (Dong et al., 2024)
Document Generation: Pix2Struct (Lee et al., 2022), DePlot (Liu et al., 2023), MatCha (Liu et al., 2023)
Format Restriction Impact: Tam et al. (2024) showed structured output can harm complex reasoning tasks

Discussion & Future Directions

Key Findings

Contextual Bias: Models often rely on few-shot example distributions rather than visual cues when signals are ambiguous
Non-Linear Scaling: Larger models don't always outperform smaller ones - context quality matters more than model size
Grounding Information: Providing basic spatial hints (plant count, locations) dramatically improves all metrics
Sim-to-Real Gap: Synthetic training doesn't fully transfer to real images; fine-tuning helps but gap remains significant

Limitations

Biophysical parameters (chlorophyll, carotenoids) show high error rates
Models struggle with late-stage growth (70-90 DAP) due to canopy occlusion
JSON integrity issues persist (~6-8% error rates even with best methods)
Real-world performance lags synthetic benchmarks

Future Research Directions

Extended Context: Test 128K token windows with 50+ few-shot examples
Specialized Fine-Tuning: Train on 10K+ diverse synthetic images
Hybrid Approaches: Combine computer vision preprocessing with VLM reasoning
Leaf Color References: Provide pigment-indexed color calibration charts
Multi-Species: Extend to wheat, maize, soybean, and other crops
Domain Adaptation: Bridge sim-to-real gap with adversarial training
Structured Decoding: Force valid JSON schemas during generation
Temporal Modeling: Multi-temporal predictions from time-series imagery

🤝 Contributing

We welcome contributions! Areas for improvement include:

Extending to other crop species (wheat, soybean, maize, etc.)
Improving biophysical parameter estimation (chlorophyll, carotenoids)
Reducing sim-to-real gap through domain adaptation
Testing larger context windows (128K tokens) with more few-shot examples
Incorporating structured decoding methods
Multi-temporal analysis (time-series predictions)
Integration with other FSPMs (e.g., OpenAlea, GroIMP)

Please open an issue or pull request if you'd like to contribute!

📧 Contact

Authors:

Heesup Yun - hspyun@ucdavis.edu
Isaac Kazuo Uyehara - ikuyehara@ucdavis.edu
Earl Ranario - ewranario@ucdavis.edu
Lars Lundqvist - llund@ucdavis.edu
Christine H. Diepenbrock - chdiepenbrock@ucdavis.edu
Brian N. Bailey - bnbailey@ucdavis.edu
J. Mason Earles - jmearles@ucdavis.edu

Institution: University of California, Davis

📜 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🙏 Acknowledgements

This research leverages the outstanding open-source contributions of the Helios, Unsloth, vLLM, and Qwen teams. We thank the Gates fondation for research support.

Last Updated: March 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image2PlantArchitecture v2

Citation

Overview

Key Features

Research Highlights

🧪 Methodology

Dataset

In-Context Learning Methods

JSON Configuration Structure

Evaluation Metrics

Models Tested

Project Structure

Key Results

Synthetic Dataset Performance

Real Dataset Performance (Sim-to-Real Gap)

Ablation Study (Blind Baseline)

Installation

1. Training Environment (Unsloth)

2. Inference Environment (vLLM)

Quick Start

1. Finetuning Qwen3-VL

2. Batch Inference with vLLM

3. Evaluation Metrics

4. Visualizations

Related Work & References

Agricultural VLM Benchmarks

Key Dependencies

Structured Output Generation Literature

Discussion & Future Directions

Key Findings

Limitations

Future Research Directions

🤝 Contributing

📧 Contact

📜 License

🙏 Acknowledgements

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Image2PlantArchitecture v2

Citation

Overview

Key Features

Research Highlights

🧪 Methodology

Dataset

In-Context Learning Methods

JSON Configuration Structure

Evaluation Metrics

Models Tested

Project Structure

Key Results

Synthetic Dataset Performance

Real Dataset Performance (Sim-to-Real Gap)

Ablation Study (Blind Baseline)

Installation

1. Training Environment (Unsloth)

2. Inference Environment (vLLM)

Quick Start

1. Finetuning Qwen3-VL

2. Batch Inference with vLLM

3. Evaluation Metrics

4. Visualizations

Related Work & References

Agricultural VLM Benchmarks

Key Dependencies

Structured Output Generation Literature

Discussion & Future Directions

Key Findings

Limitations

Future Research Directions

🤝 Contributing

📧 Contact

📜 License

🙏 Acknowledgements