Unity project integrating Microsoft Florence-2 (Vision-Language Model) via NVIDIA’s AI API, with an end-to-end controller and UI to run image understanding tasks in XR.
- Florence-2 is a multi-task vision-language model by Microsoft that supports captioning, detection, OCR, segmentation, region descriptions, and more using a single, tag-driven prompt format.
- This project calls Florence-2 through NVIDIA’s hosted endpoint and parses the response to draw 2D bounding boxes or spawn 3D anchors in the scene.
- Scene:
Assets/XR-AI-Florence2/Scenes/XR-AI-Florence2.unity - Controller:
Assets/XR-AI-Florence2/Scripts/Florence2Controller.cs - API Config asset class:
Assets/XR-AI-Florence2/Scripts/ApiConfig.cs
- Tasks enumerated in
Florence2Task:- Caption, DetailedCaption, MoreDetailedCaption
- ObjectDetection
- DenseRegionCaption, RegionProposal
- CaptionToPhraseGrounding, OpenVocabularyDetection
- ReferringExpressionSegmentation, RegionToSegmentation
- RegionToCategory, RegionToDescription
- OCR, OCRWithRegion
- Visuals currently implemented for Object Detection: draws 2D UI boxes and/or places 3D labels per detection.
- Other tasks return text/entities; basic display is included in
resultText, with room to extend visuals if desired.
- Unity 6 LTS recommended.
- Meta XR Core and MRUK packages. (Or All-In-One)
- NVIDIA API key with access to Florence-2 endpoint.
- URL used by the controller:
https://ai.api.nvidia.com/v1/vlm/microsoft/florence-2 - Auth: Bearer token in
Authorizationheader. - Content-Type:
application/json - Accept:
application/zip(response is a ZIP containing a.responseJSON file and optionallyoverlay.png).
-
Get an NVIDIA API Key
- Obtain a key from NVIDIA’s AI API portal and ensure access to the Florence-2 VLM endpoint. https://build.nvidia.com/
-
Create an API Config asset
- Project window: Go to XR-AI-Florence/Data folder, right click, create → API → API Configuration.
- Name it,
ApiConfig.asset. (so it's properly ignored keeping your api key safe) - Paste your API key into the
apiKeyfield.
-
Open the sample scene
Assets/XR-AI-Florence2/Scenes/XR-AI-Florence2.unity.
-
Assign the
Florence2Controllerfield:Api Configuration: assign the ScriptableObject you created.- Optional
Anchor Mode: BoundingBox2D, SpatialAnchor3D, or Both.
Other field descriptions that are already assigned:
Source Texture(RawImage): the image to analyze, it's by default assigned to a RawImage that is fed by the Passthrough Camera of the Quest 3.Task: choose a task from the dropdown. For now, the only task that is fully implemented is Object Recognition.Region Of Interest: used by region-based tasks. Coordinates are normalized (0–1) as a Rect (x, y, width, height).- UI
Result Text(TMP_Text): summary and counts.Result Image(RawImage): where overlay or source is shown.Bounding Box Container(RectTransform): parent for box UI.Bounding Box Prefab: prefab containing a rootRectTransformand aTextMeshProUGUIchild for the label.Status Text(TMP_Text): request status and errors.Loading Icon(GameObject): optional spinner shown during requests.
- Run a request / Build to device
- In Play Mode, click the
SendRequest()button shown in the Inspector (NaughtyAttributes adds the button to the component). In the editor only the Anchor Mode "Bounding Box 2D" will work. - Or call it via script if you have a reference:
controller.SendRequest(); - If you want to test the "Spatial Label 3D" anchor mode (the one shown in the video above), you must build the scene to your Quest 3 device.
- In Play Mode, click the
-
Image encoding
EncodeTextureToJPG(Texture)converts thesourceTexture.textureinto JPEG bytes and base64-embeds it in HTML<img src="data:image/jpeg;base64,..." />.
-
Prompt construction
Florence2Taskmaps to Florence-2 tags, e.g.<OD>for Object Detection,<CAPTION>, etc.- For text-conditional tasks (e.g.,
<CAPTION_TO_PHRASE_GROUNDING>,<OPEN_VOCABULARY_DETECTION>), yourText Promptis appended after the tag.
-
Request/Response
- HTTP POST to the NVIDIA endpoint with
Authorization: Bearer <apiKey>andAccept: application/zip. - The ZIP contains
*.responseJSON and possiblyoverlay.png. - JSON is deserialized into
Florence2Response→Choices[0].Message.Entitieswithbboxesandlabels(when applicable).
- HTTP POST to the NVIDIA endpoint with
-
Visuals
- 2D: Converts Florence coordinates
[x1, y1, x2, y2]to width/height and spawns the bounding box prefab underBoundingBoxContainer, scaled toResult Imagesize. - 3D: Projects box center to a world-space ray and uses
EnvironmentRaycastManager.Raycastto place an anchor prefab at the hit point, labeled with the detection class.
- 2D: Converts Florence coordinates
- Because requests are network-bound, latency can cause pose drift relative to the original capture. If you move, the raycast from the detected 2D box center may no longer intersect the same real-world surface.
- Tips:
- Prefer testing while stationary, or on a tripod/stand when possible.
- Segmentation: Use
overlay.png(if returned) or theEntitiessegmentation data to render masks or outlines. - OCR: Display
Message.Content/entities in the UI, draw text regions. - Region tasks: Use
regionOfInterestin prompts and visualize per-task outputs.
- "API Key or Source Image is missing": Ensure the ApiConfig asset is assigned and
sourceTexture.textureis valid. - HTTP 4xx with error JSON in Console: Verify your key, model access, and request payload format.
- No boxes drawn:
- Make sure
Result Imagehas a texture with correct dimensions; scaling usesresultImage.texture.width/height. - Confirm
Bounding Box ContainerandBounding Box Prefabare assigned.
- Make sure
- 3D anchors not appearing: Ensure
EnvironmentRaycastManageris in scene andspatialAnchorPrefabis set. Also confirm passthrough/camera utilities are available.
- Do not commit your API key. Keep the
ApiConfigasset out of version control or remove the key before committing. The Gitignore of the project will leave out /Assets/XR-AI-Florence2/Data/ApiConfig.asset
- Microsoft Florence-2: https://huggingface.co/microsoft/Florence-2-large
- NVIDIA AI API (VLM Florence-2): https://build.nvidia.com/
MIT – Free to use, modify and learn from.
