Unity-MetaXR-AI-ZeroShot

Unity project integrating Zero-Shot Object Detection models like Grounding DINO and Florence-2 (Deprecated) via NVIDIA’s AI API, with an end-to-end controller and UI to run image understanding tasks in XR.

Important

We recommend using Grounding DINO as Florence-2 has been temporarily removed from the NVIDIA build platform.

🔎 Overview

This project calls Zero-Shot models through NVIDIA’s hosted endpoint and parses the response to draw 2D bounding boxes or spawn 3D anchors in the scene.
Grounding DINO: A powerful open-vocabulary object detector that can identify a wide range of objects based on text prompts.
[Deprecated] Florence-2: A multi-task vision-language model by Microsoft that supports captioning, detection, OCR, and more. (Currently unavailable on NVIDIA's platform).

📁 Key Paths

Scene: Assets/XR-AI-ZeroShot/Scenes/XR-AI-ZeroShot.unity
Controllers:
- Assets/XR-AI-ZeroShot/Scripts/GroundingDinoController.cs
- [Deprecated] Assets/XR-AI-ZeroShot/Scripts/Florence2Controller.cs
API Config asset class: Assets/XR-AI-ZeroShot/Scripts/ApiConfig.cs

✅ What’s Implemented

Grounding DINO:
- OpenVocabularyDetection: Detects objects based on a text prompt.
[Deprecated] Florence-2 Tasks (enumerated in Florence2Task):
- Caption, DetailedCaption, MoreDetailedCaption
- ObjectDetection
- DenseRegionCaption, RegionProposal
- CaptionToPhraseGrounding, OpenVocabularyDetection
- ReferringExpressionSegmentation, RegionToSegmentation
- RegionToCategory, RegionToDescription
- OCR, OCRWithRegion
Visuals currently implemented for Object Detection: draws 2D UI boxes and/or places 3D labels per detection.
Other tasks return text/entities; basic display is included in resultText, with room to extend visuals if desired.

⚙️ Requirements

Unity 6 LTS recommended.
Meta XR Core and MRUK packages. (Or All-In-One)
NVIDIA API key with access to the desired model endpoint (Grounding DINO or Florence-2).

☁️ NVIDIA Endpoint

Grounding DINO URL: https://ai.api.nvidia.com/v1/vlm/grounding-dino
[Deprecated] Florence-2 URL: https://ai.api.nvidia.com/v1/vlm/microsoft/florence-2
Auth: Bearer token in Authorization header.
Content-Type: application/json
Accept: application/json for Grounding DINO, application/zip for Florence-2.

⚡ Setup: 5 Minutes

Get an NVIDIA API Key
- Obtain a key from NVIDIA’s AI API portal and ensure you have access to the model you want to use. https://build.nvidia.com/
Create an API Config asset
- Project window: Go to XR-AI-ZeroShot/Data folder, right click, create → API → API Configuration.
- Name it, ApiConfig.asset. (so it's properly ignored keeping your api key safe)
- Paste your API key into the apiKey field.
Open the sample scene
- Assets/XR-AI-ZeroShot/Scenes/XR-AI-ZeroShot.unity.
Assign the Controller fields:
- Select the GroundingDinoController or [Deprecated] Florence2Controller in the scene hierarchy.
- Api Configuration: assign the ScriptableObject you created.
- Optional
  - Anchor Mode: BoundingBox2D, SpatialAnchor3D, or Both.

Other field descriptions that are already assigned:

Source Texture (RawImage): the image to analyze, it's by default assigned to a RawImage that is fed by the Passthrough Camera of the Quest 3.
Task (Florence-2 only): choose a task from the dropdown.
Text Prompt (Grounding DINO): Enter the objects you want to detect, separated by commas (e.g., "a cat, a dog, the tallest person").
Region Of Interest: used by region-based tasks. Coordinates are normalized (0–1) as a Rect (x, y, width, height).
UI
- Result Text (TMP_Text): summary and counts.
- Result Image (RawImage): where overlay or source is shown.
- Bounding Box Container (RectTransform): parent for box UI.
- Bounding Box Prefab: prefab containing a root RectTransform and a TextMeshProUGUI child for the label.
- Status Text (TMP_Text): request status and errors.
- Loading Icon (GameObject): optional spinner shown during requests.

Run a request / Build to device
- In Play Mode, click the SendRequest() button shown in the Inspector (NaughtyAttributes adds the button to the component). In the editor only the Anchor Mode "Bounding Box 2D" will work.
- Or call it via script if you have a reference: controller.SendRequest();
- If you want to test the "Spatial Label 3D" anchor mode (the one shown in the video above), you must build the scene to your Quest 3 device.

🛠️ How It Works (Under the Hood)

Image encoding
- EncodeTextureToJPG(Texture) converts the sourceTexture.texture into JPEG bytes and base64-embeds it.
Prompt construction
- Grounding DINO: The Text Prompt is sent directly. The model excels at open-vocabulary detection, allowing for descriptive and flexible prompts. You can specify multiple items to detect by separating them with commas (e.g., "car, bike, person"). It also understands relative descriptions, such as "the tallest cat" or "the person on the left."
- [Deprecated] Florence-2: Florence2Task maps to Florence-2 tags, e.g. <OD> for Object Detection. For text-conditional tasks, your Text Prompt is appended after the tag.
Request/Response
- HTTP POST to the corresponding NVIDIA endpoint with Authorization: Bearer <apiKey>.
- Grounding DINO: The response is a JSON object with bboxes and labels.
- [Deprecated] Florence-2: The response is a ZIP containing *.response JSON and possibly overlay.png. The JSON is deserialized into Florence2Response → Choices[0].Message.Entities.
Visuals
- 2D: Converts model coordinates to width/height and spawns the bounding box prefab under BoundingBoxContainer, scaled to Result Image size.
- 3D: Projects box center to a world-space ray and uses EnvironmentRaycastManager.Raycast to place an anchor prefab at the hit point, labeled with the detection class.

⚠️ Limitations

Because requests are network-bound, latency can cause pose drift relative to the original capture. If you move, the raycast from the detected 2D box center may no longer intersect the same real-world surface.
Tips:
- Prefer testing while stationary, or on a tripod/stand when possible.

🧯 Troubleshooting

"API Key or Source Image is missing": Ensure the ApiConfig asset is assigned and sourceTexture.texture is valid.
HTTP 4xx with error JSON in Console: Verify your key, model access, and request payload format.
No boxes drawn:
- Make sure Result Image has a texture with correct dimensions; scaling uses resultImage.texture.width/height.
- Confirm Bounding Box Container and Bounding Box Prefab are assigned.
3D anchors not appearing: Ensure EnvironmentRaycastManager is in scene and spatialAnchorPrefab is set. Also confirm passthrough/camera utilities are available.

🔐 Security

Do not commit your API key. Keep the ApiConfig asset out of version control or remove the key before committing. The Gitignore of the project will leave out /Assets/XR-AI-Florence2/Data/ApiConfig.asset

📄 License

MIT – Free to use, modify and learn from.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.idea		.idea
.vscode		.vscode
Assets		Assets
Packages		Packages
ProjectSettings		ProjectSettings
.env		.env
.gitattributes		.gitattributes
.gitignore		.gitignore
.vsconfig		.vsconfig
README.md		README.md
Unity-MetaXR-AI-ZeroShot.slnx		Unity-MetaXR-AI-ZeroShot.slnx
app.config		app.config
packages.config		packages.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unity-MetaXR-AI-ZeroShot

🔎 Overview

📁 Key Paths

✅ What’s Implemented

⚙️ Requirements

☁️ NVIDIA Endpoint

⚡ Setup: 5 Minutes

🛠️ How It Works (Under the Hood)

⚠️ Limitations

🧯 Troubleshooting

🔐 Security

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Unity-MetaXR-AI-ZeroShot

🔎 Overview

📁 Key Paths

✅ What’s Implemented

⚙️ Requirements

☁️ NVIDIA Endpoint

⚡ Setup: 5 Minutes

🛠️ How It Works (Under the Hood)

⚠️ Limitations

🧯 Troubleshooting

🔐 Security

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages