Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning
A novel framework that leverages state-of-the-art Vision-Language Models (VLMs) to automatically generate 3D plant simulation configurations in JSON format directly from drone-based remote sensing imagery. This is the first study to utilize VLMs for generating structural JSON configurations required for functional-structural plant models (FSPMs), providing a scalable approach for reconstructing 3D plots for digital twin applications in agriculture.
If you use this work, please cite:
@article{yun2026image2plant,
title={Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning},
author={Yun, Heesup and Uyehara, Isaac Kazuo and Ranario, Earl and Lundqvist, Lars and Diepenbrock, Christine H. and Bailey, Brian N. and Earles, J. Mason},
journal={arXiv preprint arXiv:2603.08930},
year={2026},
institution={University of California, Davis}
}This project introduces a synthetic benchmark to evaluate Vision-Language Models (VLMs) in generating plant simulation configurations for digital twins. While functional-structural plant models (FSPMs) are powerful tools for simulating biophysical processes in agricultural environments, their high complexity and low throughput create bottlenecks for deployment at scale.
- VLM-Based Simulation Configuration: Automatically generate Helios 3D plant simulator JSON configurations from drone imagery using Gemma 3 and Qwen3-VL models
- Five In-Context Learning Methods: Progressive context augmentation from zero-shot to grounding-information-enhanced few-shot learning
- Comprehensive Evaluation Framework: Three evaluation categories covering JSON integrity, geometric accuracy, and biophysical parameters
- Synthetic Benchmark Dataset: 1,120 synthetic cowpea plot images (224 plots × 5 growth stages) generated via Helios 3D
- Real-World Validation: Tested on drone orthophoto dataset from California cowpea breeding experiments
- LoRA Fine-Tuning: Parameter-efficient fine-tuning (PEFT) with 0.65% trainable parameters using Unsloth
- High-Throughput Inference: Asynchronous batch processing powered by vLLM with continuous batching and FP8 quantization
Our results demonstrate that:
- VLMs can interpret structural metadata and estimate parameters like plant count and sun azimuth
- Larger models and richer context generally improve performance, but exhibit non-linear trends
- Models often rely on contextual priors when visual cues are insufficient (contextual bias)
- Grounding information (plant count, locations, sun position) significantly reduces errors across all metrics
- Qwen3-VL models generally outperform Gemma3 models on geometric evaluations
- Fine-tuning improves JSON integrity and early-stage plant localization accuracy
- Sim-to-real gap remains a challenge, with higher errors on real images compared to synthetic data
Synthetic Cowpea Plot Dataset:
- 1,120 synthetic images (224 plots × 5 growth stages: 10, 30, 50, 70, 90 DAP)
- Generated using Helios 3D procedural plant generation library
- Resolution: 381×1080 pixels
- Plant species: Cowpea (Vigna unguiculata)
- Data-driven parameter sampling from real-world field measurements
Real Drone Orthophoto Dataset:
- 2025 California cowpea breeding experiment
- Plot dimensions: 1.5m × 3.0m with 1.5m alleys
- 15 beds, 12 plots per bed, 60 genotypes, 3 replicate blocks
- Processed with OpenDronemap, annotated in QGIS
- Manual annotations for plant count and locations (10 DAP)
We tested five progressively enriched in-context learning methods:
| Method | Description | Context Enhancement |
|---|---|---|
| Method 1 | Zero-shot JSON generation | Role definition + task description + JSON format restriction instruction (FRI) |
| Method 2 | Method 1 + JSON schema | Adds structured schema with variable types and key definitions |
| Method 3 | Method 2 + few-shot JSONs | Includes example JSON outputs with reasoning texts |
| Method 4 | Method 3 + few-shot images | Adds paired image-JSON examples as chat history |
| Method 5 | Method 4 + grounding info | Provides plant count, locations, sun position as hints |
The Helios 3D simulator accepts JSON files with six top-level metadata fields:
- random_seed: Controls procedural generation randomness
- metadata: year, location, plant_type, days_after_planting
- environment: soil spectral data, sun elevation/azimuth angles
- field: plot size, bed layout, plant locations (x, y coordinates)
- plant_properties: PROSPECT model parameters (chlorophyll, carotenoids, anthocyanin, water content, dry matter, leaf structure)
- camera: sensor parameters, position, resolution
JSON Integrity Metrics:
- JSON syntax error rate (parsing failures)
- JSON key-missing rate (incomplete structure)
- BLEU-4 score (similarity to ground truth)
Geometric Evaluations:
- Days After Planting (DAP) - Mean Absolute Error (MAE)
- Plant count - MAE
- Plant locations - Chamfer Distance: d_CD(S1, S2) = d(S1, S2) + d(S2, S1)
- Sun elevation and azimuth angles - MAE
- Leaf pitch angle - MAE
Biophysical Evaluations:
- Chlorophyll content (μg/cm²) - MAE
- Carotenoid content (μg/cm²) - MAE
- Anthocyanin content (μg/cm²) - MAE
- Leaf water mass (g/cm²) - MAE
- Leaf dry matter (g/cm²) - MAE
- Leaf structure parameter (N) - MAE
- Qwen3-VL: 4B, 8B, 30B parameters (released December 2025)
- Qwen3-VL Fine-tuned: 32B with LoRA (r=16, α=16, 141.9M trainable params, 0.65% of model capacity)
- Training: 1,788 synthetic images, 3 epochs, batch size 64, 4× NVIDIA A100 GPUs (~3 hours)
- Framework: Unsloth for accelerated training and reduced memory overhead
- Gemma 3: 4B, 12B, 27B parameters (released March 2025)
- Hosting: Self-hosted Ollama server with 32K context window
Image2PlantArchitecture_v2/
├── scripts/
│ ├── ollama_batch_inference.py # Included structure of prompt and tools for ollama inference
│ ├── vllm_batch_inference.py # Asynchronous high-throughput inference script
│ ├── finetune_qwen3_vl_unsloth.py # Finetuning the Qwen3-VL 32M model using unsloth
│ └── render_real_test_image.py # Rendering comparisons
│
├── notebooks/
│ ├── calculate_evaluation_metrics.py # Comprehensive evaluation metrics pipeline
│ ├── calculate_evaluation_metrics.ipynb # Test notebook file
│ └── check_integrity_data.py # JSON output syntax/key analysis
│
├── data/
│ └── raw/2025_Davis/ # Dataset directory (HELIOS synthetic & Real)
│
├── unsloth_env.yml # Conda environment for training
├── vllm_environment.yml # Conda environment for inference
├── requirements.txt
└── README.md
On the synthetic cowpea dataset with Method 5 (grounding information):
- JSON Integrity: <6.6% syntax error, <8.5% key-missing rate
- DAP Estimation: Fine-tuned Qwen3-VL 32B achieved lowest MAE
- Plant Count: Qwen3-VL 4B outperformed Gemma3 27B in some cases
- Plant Localization: Qwen3-VL models showed significantly lower Chamfer distances than Gemma3
- Grounding Info Impact: Reduced errors across all metrics by providing plant count/location hints
- Model Size Effect: Non-linear trends - larger models don't always guarantee better performance
- Higher Error Rates: Real images showed 4-5× higher syntax errors compared to synthetic
- DAP MAE: Up to 4.7 days error on real images (vs. lower on synthetic)
- Plant Count: Higher MAE (~5.3 plants) on real images
- Plant Locations: Surprisingly lower MAE (~0.1m) on real images
- Fine-Tuning Benefit: Reduced JSON errors and improved plant count estimation
Testing without target images revealed:
- Models rely heavily on contextual priors when visual cues are weak
- Blind baseline sometimes achieved lower errors than image-based inference (when close to mean-guess)
- Indicates contextual bias - models default to few-shot example distributions
- Adding images can introduce noise when models fail to extract reliable visual signals
For model finetuning, an environment with Unsloth and PyTorch is required:
conda env create -f unsloth_env.yml
conda activate unslothFor high-throughput inference, a separate environment configured for vLLM is recommended:
conda env create -f vllm_environment.yml
# Or manually install vllm
pip install vllmFinetune Qwen3-VL models using the optimized Unsloth pipeline across single or multiple GPUs (via DDP):
# Edit configurations (epochs, model_id) inside the script
sbatch scripts/finetune_qwen3_vl_unsloth.shRun inference on a dataset using the highly optimized vLLM server:
Terminal 1 (Start vLLM Server):
export VLLM_V1_ENABLED=0
mamba run -n vllm-runtime python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-VL-8B-Instruct --port 8000 \
--limit-mm-per-prompt '{"image": 4}' --max-model-len 32768Terminal 2 (Run Inference Script):
mamba run -n digital-crops python scripts/vllm_batch_inference.py \
--model Qwen/Qwen3-VL-8B-Instruct --vllm-url http://localhost:8000/v1 \
--dataset-dir data/raw/2025_Davis/HELIOS_20260215 \
--start 0 --end 100 --concurrency 32 --method 5Calculate performance metrics (JSON Integrity + Biophysical Accuracy) on the output directories generated by inference:
# Run integrity checks
python notebooks/check_integrity_data.py
# Full evaluation pipeline
python notebooks/calculate_evaluation_metrics.pyGenerate comparison grids combining Ground Truth images alongside method predictions:
python scripts/render_real_test_image.py
python test_synthetic_vs_real_plot.pyThis work builds on recent advances in agricultural vision-language models:
- AgEval (Arshad et al., 2025): Plant disease identification and quantification
- AgroBench (Shinoda et al.): Expert-annotated multiple-choice QA benchmark for 7 agricultural tasks
- AgriGPT-VL (Yang et al., 2025): Trained on Agri-3M-VL dataset, evaluated on AgriBench-VL-4K
- AgroGPT (Awais et al., 2025): Multi-turn multimodal dialogue for agricultural concepts
Our work is the first to apply VLMs to 3D plant simulation configuration generation for digital twins.
- Helios 3D: 3D plant and environmental biophysical modeling framework (Bailey, 2019)
- Unsloth: Fast and memory-efficient LLM/VLM fine-tuning
- vLLM: High-throughput and memory-efficient LLM inference engine
- Qwen3-VL: State-of-the-art open-source vision-language models (2B-235B parameters)
- Ollama: Self-hosted LLM serving platform
- OpenDroneMap: Drone imagery processing
- QGIS: Plot boundary annotation
- Structured Decoding: Grammar-constrained decoding (Geng et al., 2023), XGrammar (Dong et al., 2024)
- Document Generation: Pix2Struct (Lee et al., 2022), DePlot (Liu et al., 2023), MatCha (Liu et al., 2023)
- Format Restriction Impact: Tam et al. (2024) showed structured output can harm complex reasoning tasks
- Contextual Bias: Models often rely on few-shot example distributions rather than visual cues when signals are ambiguous
- Non-Linear Scaling: Larger models don't always outperform smaller ones - context quality matters more than model size
- Grounding Information: Providing basic spatial hints (plant count, locations) dramatically improves all metrics
- Sim-to-Real Gap: Synthetic training doesn't fully transfer to real images; fine-tuning helps but gap remains significant
- Biophysical parameters (chlorophyll, carotenoids) show high error rates
- Models struggle with late-stage growth (70-90 DAP) due to canopy occlusion
- JSON integrity issues persist (~6-8% error rates even with best methods)
- Real-world performance lags synthetic benchmarks
- Extended Context: Test 128K token windows with 50+ few-shot examples
- Specialized Fine-Tuning: Train on 10K+ diverse synthetic images
- Hybrid Approaches: Combine computer vision preprocessing with VLM reasoning
- Leaf Color References: Provide pigment-indexed color calibration charts
- Multi-Species: Extend to wheat, maize, soybean, and other crops
- Domain Adaptation: Bridge sim-to-real gap with adversarial training
- Structured Decoding: Force valid JSON schemas during generation
- Temporal Modeling: Multi-temporal predictions from time-series imagery
We welcome contributions! Areas for improvement include:
- Extending to other crop species (wheat, soybean, maize, etc.)
- Improving biophysical parameter estimation (chlorophyll, carotenoids)
- Reducing sim-to-real gap through domain adaptation
- Testing larger context windows (128K tokens) with more few-shot examples
- Incorporating structured decoding methods
- Multi-temporal analysis (time-series predictions)
- Integration with other FSPMs (e.g., OpenAlea, GroIMP)
Please open an issue or pull request if you'd like to contribute!
Authors:
- Heesup Yun - hspyun@ucdavis.edu
- Isaac Kazuo Uyehara - ikuyehara@ucdavis.edu
- Earl Ranario - ewranario@ucdavis.edu
- Lars Lundqvist - llund@ucdavis.edu
- Christine H. Diepenbrock - chdiepenbrock@ucdavis.edu
- Brian N. Bailey - bnbailey@ucdavis.edu
- J. Mason Earles - jmearles@ucdavis.edu
Institution: University of California, Davis
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
This research leverages the outstanding open-source contributions of the Helios, Unsloth, vLLM, and Qwen teams. We thank the Gates fondation for research support.
Last Updated: March 2026