GradientSpaces
diff --git a/‎DATA.md‎
Lines changed: 2 additions & 2 deletions b/‎DATA.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎TRAIN.md‎
Lines changed: 28 additions & 3 deletions b/‎TRAIN.md‎
Lines changed: 28 additions & 3 deletions
diff --git a/‎data/datasets/__init__.py‎
Lines changed: 1 addition & 2 deletions b/‎data/datasets/__init__.py‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎demo/demo_instance_retrieval.py‎
Lines changed: 15 additions & 32 deletions b/‎demo/demo_instance_retrieval.py‎
Lines changed: 15 additions & 32 deletions
@@ -46,8 +46,8 @@ File structure below:
 ## :wrench: Data Preprocessing
 In order to process data faster during training + inference, we preprocess 1D (referral), 2D (RGB + floorplan) & 3D (Point Cloud + CAD) for both object instances and scenes. Note that, since for 3RScan dataset, they do not provide frame-wise RGB segmentations, we project the 3D data to 2D and store it in `.npz` format for every scan. We provide the scripts for projection. Here's an overview which data features are precomputed:
 
-- Object Instance: Referral, Multi-view RGB images, Point Cloud & CAD (only for ScanNet)
-- Scene: Referral, Multi-view RGB images, Floorplan (only for ScanNet) Point Cloud 
+- Object Instance: Referral, Multi-view RGB images, Point Cloud, & CAD (only for ScanNet)
+- Scene: Referral, Multi-view RGB images, Floorplan (only for ScanNet), & Point Cloud 
 
 We provide the preprocessing scripts which should be easily cusotmizable for new datasets. Further instructions below.
 
 
@@ -55,7 +55,31 @@ We provide all available checkpoints on huggingface 👉 [here](https://huggingf
 
 
 # :shield: Single Inference
-We release script to perform inference (generate scene-level embeddings) on a single scan of 3RScan/Scannet. Detailed usage in the file. Quick instructions below:
+
+## Instance Inference
+We provide script to perform instance-level cross-modal retrieval inference on a single scan, and report retrieval metrics and matched objects within the scene, across all available modality pairs. Detailed usage in the file. Quick instructions below:
+
+```bash
+$ python single_inference/instance_inference.py
+```
+
+Various configurable parameters:
+
+- `--dataset`: Dataset name - Options: `scannet`, `scan3r`, `arkitscenes`, `multiscan`
+- `--process_dir`: Path to processed features directory containing preprocessed object data 
+- `--ckpt`: Path to the pre-trained instance crossover model checkpoint (details [here](TRAIN.md#checkpoint-inventory)), example_path: `./checkpoints/instance_crossover_scannet+scan3r+multiscan+arkitscenes.pth`
+- `--scan_id`: Scan ID to run inference on (e.g., `scene_00004_00`)
+- `--modalities`: List of modalities to use (default: `['rgb', 'point', 'cad', 'referral']`)
+- `--input_dim_3d`: Input dimension for 3D features (default: 384)
+- `--input_dim_2d`: Input dimension for 2D features (default: 1536)
+- `--input_dim_1d`: Input dimension for 1D features (default: 768)
+- `--out_dim`: Output embedding dimension (default: 768)
+
+
+> **Note**: This script requires preprocessed object data for the target scene, namely `objectsDataMultimodal.npz` files generated during data preprocessing as described in [DATA.md](DATA.md/#wrench-data-preprocessing). The scan must have valid object instances across the specified modalities.
+
+## Scene Inference
+We release a script to perform inference (generate scene-level embeddings) on a single scan of all supported datasets. Detailed usage in the file. Quick instructions below:
 
 ```bash
 $ python single_inference/scene_inference.py
@@ -65,12 +89,13 @@ Various configurable parameters:
 
 - `--dataset`: dataset name, Scannet/Scan3R
 - `--data_dir`: data directory (eg: `./datasets/Scannet`, assumes similar structure as in `preprocess.md`).
-- `--floorplan_dir`: directory consisting of the rasterized floorplans (this can point to the downloaded preprocessed directory), only for Scannet
-- `--ckpt`: Path to the pre-trained scene crossover model checkpoint (details [here](TRAIN.md#checkpoint-inventory)), example_path: `./checkpoints/scene_crossover_scannet+scan3r.pth/`).
+- `--process_dir`: preprocessed data directory (this can point to the downloaded preprocessed directory)
+- `--ckpt`: Path to the pre-trained scene crossover model checkpoint (details [here](TRAIN.md#checkpoint-inventory)), example_path: (`./checkpoints/scene_crossover_scannet+scan3r.pth/`).
 - `--scan_id`: the scan id from the dataset you'd like to calculate embeddings for (if not provided, embeddings for all scans are calculated).
 
 The script will output embeddings in the same format as provided [here](DATA.md/#generated-embedding-data).
 
+
 # :bar_chart: Evaluation
 #### Cross-Modal Object Retrieval
 Run the following script (refer to the script to run instance baseline/instance crossover) for object instance + scene retrieval results using the instance-based methods. Detailed usage inside the script.
 
@@ -1,5 +1,4 @@
 from .scannet import *
 from .scan3r import *
 from .arkit import *
-from .multiscan import *
-from .structured3d import *
+from .multiscan import *
@@ -26,13 +26,13 @@
 
 DEFAULT_CONFIG = {
     'dataset': 'scannet',  # scannet, scan3r, arkitscenes, multiscan
-    'data_dir': '/drive/datasets/Scannet',  # Update this with your data path
-    'process_dir': '/drive/dumps/multimodal-spaces/preprocess_feats/Scannet',  # Update this with your processed data path
-    'ckpt': '/drive/dumps/multimodal-spaces/runs/new_runs/instance_crossover_scannet+scan3r+multiscan+arkitscenes.pth',  # Update this with your model checkpoint
-    'scan_id': 'scene0568_00',  # Default scan to search in
-    'query_modality': 'point',  # point, rgb, referral
-    'target_modality': 'referral',  # point, rgb, referral, cad
-    'query_path': './demo_data/kitchen/scene.ply',  # Path to your query file
+    'data_dir': '/drive/datasets/Scannet',
+    'process_dir': '/drive/dumps/multimodal-spaces/preprocess_feats/Scannet',
+    'ckpt': '/drive/dumps/multimodal-spaces/runs/new_runs/instance_crossover_scannet+scan3r+multiscan+arkitscenes.pth',  
+    'scan_id': 'scene0568_00', 
+    'query_modality': 'point', 
+    'target_modality': 'point', 
+    'query_path': './demo_data/kitchen/scene.ply',  # Path to your query file - refers to query object PCL
     'top_k': 5  
 }
 # =============================================================================
@@ -49,7 +49,6 @@ def __init__(self, args):
         self.args = args
         self.setup_model()
 
-        # Setup image transforms
         self.image_transform = tvf.Compose([
             tvf.ToTensor(),
             tvf.Normalize(mean=[0.485, 0.456, 0.406], 
@@ -62,7 +61,6 @@ def setup_model(self):
         kwargs = [init_kwargs]
         self.accelerator = Accelerator(kwargs_handlers=kwargs)
 
-        # Convert args to DictConfig format expected by model
         model_args = DictConfig({
             'out_dim': self.args.out_dim,
             'input_dim_3d': self.args.input_dim_3d,
@@ -111,7 +109,7 @@ def _encode_point_query(self, path: str) -> torch.Tensor:
         points = np.asarray(pcd.points)
 
         # Send raw point cloud as list (like datasets) - model will handle sampling
-        point_clouds = [points]  # List of raw point clouds
+        point_clouds = [points]
         point_masks = torch.ones(1, 1).bool()  # (1, 1)
 
         data_dict = {
@@ -132,11 +130,9 @@ def _encode_rgb_query(self, path: str) -> torch.Tensor:
 
         image = Image.open(path)
         image = image.resize((224, 224), Image.BICUBIC)
-        image_pt = self.image_transform(image).unsqueeze(0)  # (1, C, H, W)
-        
-        # Convert to model expected format: (batch_size, num_objects, num_views, C, H, W)
-        rgb_data = image_pt.unsqueeze(0).unsqueeze(0)  # (1, 1, 1, C, H, W)
-        rgb_masks = torch.ones(1, 1).bool()  # (1, 1)
+        image_pt = self.image_transform(image).unsqueeze(0)  
+        rgb_data = image_pt.unsqueeze(0).unsqueeze(0) 
+        rgb_masks = torch.ones(1, 1).bool()
 
         data_dict = {
             'objects': {
@@ -152,11 +148,9 @@ def _encode_rgb_query(self, path: str) -> torch.Tensor:
 
     def _encode_referral_query(self, path: str) -> torch.Tensor:
         """Encode text referral query"""
-        if os.isfile(path):
-            with open(path, 'r') as f:
-                text = f.read().strip()
-        else:
-            text = path  # Assume path is the text itself
+        assert os.path.isfile(path), 'Referral Path should be a text file'
+        with open(path, 'r') as f:
+            text = f.read().strip()
 
         data_dict = {'referral_texts': [[[text]]]}
 
@@ -168,29 +162,23 @@ def _encode_referral_query(self, path: str) -> torch.Tensor:
     def encode_scene(self, scan_id: str) -> Dict[str, torch.Tensor]:
         """Encode all objects in the scene and return embeddings by modality"""
 
-        # Setup dataset for this specific scan
         self.setup_dataset(scan_id)
-        
-        # Get the data for this scan
         data_dict = self.dataset.get_data()
 
 
         with torch.no_grad():
             output = self.model(data_dict)
 
-        # Extract embeddings and masks for each modality
         scene_embeddings = {}
         for modality in output['embeddings']:
             embeddings = output['embeddings'][modality].cpu()
             masks = data_dict['masks'][modality].cpu()
 
-            # Remove batch dimension
             if len(embeddings.shape) == 3:
                 embeddings = embeddings.squeeze(0)
             if len(masks.shape) == 2:
                 masks = masks.squeeze(0)
 
-            # Store embeddings and masks
             scene_embeddings[modality] = {
                 'embeddings': embeddings,
                 'masks': masks,
@@ -229,7 +217,6 @@ def retrieve(
         target_embeddings = scene_data[target_modality]['embeddings']
         target_masks = scene_data[target_modality]['masks']
 
-        # Filter valid objects only 
         valid_mask = target_masks.bool()
         if valid_mask.sum() == 0:
             log.warning("No valid objects found in target modality")
@@ -278,7 +265,6 @@ def main():
                        choices=['point', 'rgb', 'referral', 'cad'],
                        help=f'Target modality to match against - default: {DEFAULT_CONFIG["target_modality"]}')
 
-    # Dataset arguments with defaults from config
     parser.add_argument('--dataset', type=str, default=DEFAULT_CONFIG['dataset'],
                        choices=['scannet', 'scan3r', 'arkitscenes', 'multiscan'],
                        help=f'Dataset name - default: {DEFAULT_CONFIG["dataset"]}')
@@ -289,7 +275,6 @@ def main():
     parser.add_argument('--ckpt', type=str, default=DEFAULT_CONFIG['ckpt'],
                        help=f'Path to model checkpoint - default: {DEFAULT_CONFIG["ckpt"]}')
 
-    # Optional arguments
     parser.add_argument('--top_k', type=int, default=DEFAULT_CONFIG['top_k'],
                        help=f'Number of top results to return - default: {DEFAULT_CONFIG["top_k"]}')
 
@@ -301,7 +286,6 @@ def main():
 
     args = parser.parse_args()
 
-    # Print configuration being used
     log.info("=== Instance Retrieval Configuration ===")
     log.info(f"Dataset: {args.dataset}")
     log.info(f"Data directory: {args.data_dir}")
@@ -329,15 +313,14 @@ def main():
 
     # Run retrieval
     retriever = InstanceRetrieval(args)
-    results = retriever.retrieve(
+    retriever.retrieve(
         args.query_path,
         args.query_modality, 
         args.scan_id,
         args.target_modality,
         args.top_k
     )
 
-    return results
 
 
 if __name__ == '__main__':