Cross modality framework

Description

This project wants to provide a flexible framework for cross-modality learning, enabling the integration and training of models across different data modalities (e.g., images and events). It supports unimodal and multimodal architectures, domain adaptation, and tasks such as detection and segmentation(WIP). The framework is designed for extensibility, allowing easy configuration, modular backbone and head selection. It is suitable for research and development in multi-domain and multi-task machine learning scenarios.

Dependencies Installation

You can set up the environment using the following commands:

conda create -n CMF python=3.13.2  pip # Create a new conda environment (other Python 3.x versions should work)

To install all required dependencies, use the provided script:

conda activate CMF
sh install_req.sh

The script will install PyTorch 2.7.0 with CUDA 12.6. Adjust versions as needed for your system.

Usage

This guide provides instructions to set up and run the framework:

Preparing the Data:
- Organize your datasets according to the required modalities (e.g., images, events).
- Update configuration files with dataset paths and parameters.

Configuration: Edit the configuration file (typically located in config/) to specify:

Modality-specific parameters.
Model details (e.g. backbone and head settings).
Training hyperparameters.

Example snippet:

model:
name: 'resnet50_yolox'
head: 
    name: 'yolox_head' # redundant here, but useful for clarity
    num_classes: 8
    losses_weights: [5.0, 1.0, 1.0, 1.0]  # [iou, obj, cls, l1]
backbone:
    name: '' # if name is empty -> stack of the two backbone will be used
    rgb_backbone: resnet50 # from timm
    pretrained: True # if not specified, it will be set to True
    pretrained_weights: #'../resnet50_backbone_from_detr.pth' # path to pretrained weights if needed
    embed_dim: 256
    input_size: 512
    output_indices: [3, 4] # indices of the output layers to be used

Use existing templates as a reference.

⚠️⚠️⚠️ Warning ⚠️⚠️⚠️

The config .yaml file must include all the parameters defined in the argparse for them to take effect. Otherwise, the argument parsing WILL NOT WORK. You can think of the .yaml file as containing the default arguments for your specific training run.

The losses, two types:
- Unimodal Tasks: Losses are computed within each head of the respective models, such as the "YoloXHead." A base class will be implemented soon to ensure a consistent interface across all heads.
- Multimodal Tasks: The loss needs to be specified in the .yaml configuration file. A factory method will handle the building process. If a loss is not yet implemented, add it to the builder.
Preparing the Dataset The DSEC-Night and Cityscapes datasets are currently supported. To prepare them for training:
1. Ensure the root directory of each dataset (or a symlink to it) is placed within the data/ folder.
2. Run the appropriate script to generate the train and validation split files.
  - For Cityscapes:
```
python dataset/create_cs_txt.py
```
  - For Cityscapes Events:
```
python dataset/create_cs_events_txt.py
```
  - For DSEC-Night:
```
python dataset/create_dataset_txt.py
```
For DSEC-Night, the script create_dataset_vg.py is also available. This script will create caches for voxel grids inside your dataset folder. This was added due to the high computational cost of creating voxel grids at runtime.
(⚠️⚠️⚠️ Currently under mantainance ⚠️⚠️⚠️)
Training the Model:
- Run the training script with your configuration file:
```
python train_from_config.py configs/your_config.yaml
```
  You can add any arguments you want (at least, the ones specified in utils/argparser.py). Arguments are parsed as follows:
- Use _ to separate words in argument names (e.g., --batch_size).
- Use - to specify a key within a sub-dictionary (e.g., --logger-name, where name is a key inside the logger sub-dictionary in the config).
- To monitor the process, the framework is fully integrated with wandb.
- Use the DEBUG environment variable to monitor the internal processes. Higher values (>=1) will increase the verbosity of the output, Most used are:
  - DEBUG=1: Provides basic information such as real-time loss for each batch and setup details.
  - DEBUG=3: e.g. saves and allows inspection of ground truth bounding box images (just one, >4 for all of them).
- For testing purposes, use the EVAL_ONLY environment variable to skip the training loop and run only the evaluation pipeline.

Evaluating the Model:

Run the evaluation script:

python detect_from_config.py --config config/your_config.yaml --checkpoint path/to/your/checkpoint.pth --input_image path_to_image

TODO

Implement new model -> thinking about rf-detr
Make the logger uniform for all the framework (probably the one in dsec_evaluator)
in custom, get the output frame dims in input (for now it only works with 512x512)
check for validity of VGs
Add segmentation task (low prio)
Add same build from config as mmcv.
(per la proposta di metodo) considerare di fare la loss di contrastive solo sulla bbox e tutto il resto considerarlo come negative

Model Output Format

All models in this framework follow a standardized output format to ensure consistency and easy integration with the training pipeline.

Standard Model Output Structure

Models must return a dictionary with the following structure:

{
    'backbone_features': {
        'preflatten_feat': [...],    # Multi-scale features (list of tensors)
        'flattened_feat': tensor,    # Flattened features (optional)
        # ... other backbone-specific outputs
    },
    'head_outputs': tensor,          # Task-specific predictions
    'total_loss': scalar,            # Total weighted loss (training only)
    'losses': {                      # Individual loss components (training only)
        'iou_loss': scalar,
        'obj_loss': scalar,
        'cls_loss': scalar,
        'l1_loss': scalar,           # Optional, task-dependent
        # ... other task-specific losses
    }
}

YOLOXHead Input and Output Specifications

Input Format (Training)

During training, the YOLOXHead expects ground truth labels in the following format:

Labels tensor shape: [batch_size, max_objects, 5]
Label format: [class_id, x_center, y_center, width, height]
- class_id: Integer class identifier (0-based indexing)
- x_center, y_center: Center coordinates of the bounding box (absolute pixel coordinates)
- width, height: Width and height of the bounding box (absolute pixel values)
Coordinate system: Center-based format with absolute pixel coordinates
Padding: Unused label slots should be filled with negative values (e.g., -1)

Output Format (Inference)

By default, the model outputs bounding boxes in the format: [x_center, y_center, width, height, objectness_score, class_confidence_0, class_confidence_1, ...]

Detailed breakdown:

Bounding box coordinates: [x_center, y_center, width, height] (center coordinates with width/height)
Objectness score: Confidence that the box contains an object
Class confidences: Per-class confidence scores (one for each class)
Coordinate system: Center-based format with absolute pixel coordinates
Output tensor shape: [batch_size, num_detections, 5 + num_classes]

Internal Processing

The YOLOXHead uses center-based coordinates throughout its internal processing:

Loss computation uses [x_center, y_center, width, height] format
Assignment algorithms expect center-based ground truth
IoU calculations support both coordinate formats via the xyxy parameter
Multi-scale feature processing maintains center-based representation

Name		Name	Last commit message	Last commit date
Latest commit History 401 Commits
configs		configs
dataset		dataset
evaluator		evaluator
model		model
scripts		scripts
training		training
utils		utils
.gitignore		.gitignore
README.md		README.md
detect_from_config.py		detect_from_config.py
eval.py		eval.py
extract_resnet50_from_detr.py		extract_resnet50_from_detr.py
requirements.txt		requirements.txt
train_from_config.py		train_from_config.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cross modality framework

Description

Table of Contents

Dependencies Installation

Usage

⚠️⚠️⚠️ Warning ⚠️⚠️⚠️

TODO

Model Output Format

Standard Model Output Structure

YOLOXHead Input and Output Specifications

Input Format (Training)

Output Format (Inference)

Internal Processing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cross modality framework

Description

Table of Contents

Dependencies Installation

Usage

⚠️⚠️⚠️ Warning ⚠️⚠️⚠️

TODO

Model Output Format

Standard Model Output Structure

YOLOXHead Input and Output Specifications

Input Format (Training)

Output Format (Inference)

Internal Processing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages