This project wants to provide a flexible framework for cross-modality learning, enabling the integration and training of models across different data modalities (e.g., images and events). It supports unimodal and multimodal architectures, domain adaptation, and tasks such as detection and segmentation(WIP). The framework is designed for extensibility, allowing easy configuration, modular backbone and head selection. It is suitable for research and development in multi-domain and multi-task machine learning scenarios.
You can set up the environment using the following commands:
conda create -n CMF python=3.13.2 pip # Create a new conda environment (other Python 3.x versions should work)To install all required dependencies, use the provided script:
conda activate CMF
sh install_req.shThe script will install PyTorch 2.7.0 with CUDA 12.6. Adjust versions as needed for your system.
This guide provides instructions to set up and run the framework:
-
Preparing the Data:
- Organize your datasets according to the required modalities (e.g., images, events).
- Update configuration files with dataset paths and parameters.
-
Configuration: Edit the configuration file (typically located in
config/) to specify:- Modality-specific parameters.
- Model details (e.g. backbone and head settings).
- Training hyperparameters.
Example snippet:
model: name: 'resnet50_yolox' head: name: 'yolox_head' # redundant here, but useful for clarity num_classes: 8 losses_weights: [5.0, 1.0, 1.0, 1.0] # [iou, obj, cls, l1] backbone: name: '' # if name is empty -> stack of the two backbone will be used rgb_backbone: resnet50 # from timm pretrained: True # if not specified, it will be set to True pretrained_weights: #'../resnet50_backbone_from_detr.pth' # path to pretrained weights if needed embed_dim: 256 input_size: 512 output_indices: [3, 4] # indices of the output layers to be used
Use existing templates as a reference.
The config
.yamlfile must include all the parameters defined in the argparse for them to take effect. Otherwise, the argument parsing WILL NOT WORK. You can think of the.yamlfile as containing the default arguments for your specific training run. -
The losses, two types:
- Unimodal Tasks: Losses are computed within each head of the respective models, such as the "YoloXHead." A base class will be implemented soon to ensure a consistent interface across all heads.
- Multimodal Tasks: The loss needs to be specified in the
.yamlconfiguration file. A factory method will handle the building process. If a loss is not yet implemented, add it to the builder.
-
Preparing the Dataset The DSEC-Night and Cityscapes datasets are currently supported. To prepare them for training:
-
Ensure the root directory of each dataset (or a symlink to it) is placed within the
data/folder. -
Run the appropriate script to generate the train and validation split files.
-
For Cityscapes:
python dataset/create_cs_txt.py
-
For Cityscapes Events:
python dataset/create_cs_events_txt.py
-
For DSEC-Night:
python dataset/create_dataset_txt.py
-
For DSEC-Night, the script
create_dataset_vg.pyis also available. This script will create caches for voxel grids inside your dataset folder. This was added due to the high computational cost of creating voxel grids at runtime.
(⚠️ ⚠️ ⚠️ Currently under mantainance⚠️ ⚠️ ⚠️ ) -
-
Training the Model:
- Run the training script with your configuration file:
You can add any arguments you want (at least, the ones specified in utils/argparser.py). Arguments are parsed as follows:
python train_from_config.py configs/your_config.yaml
- Use
_to separate words in argument names (e.g.,--batch_size). - Use
-to specify a key within a sub-dictionary (e.g.,--logger-name, wherenameis a key inside theloggersub-dictionary in the config). - To monitor the process, the framework is fully integrated with wandb.
- Use the
DEBUGenvironment variable to monitor the internal processes. Higher values (>=1) will increase the verbosity of the output, Most used are:DEBUG=1: Provides basic information such as real-time loss for each batch and setup details.DEBUG=3: e.g. saves and allows inspection of ground truth bounding box images (just one, >4 for all of them).
- For testing purposes, use the
EVAL_ONLYenvironment variable to skip the training loop and run only the evaluation pipeline.
- Run the training script with your configuration file:
-
Evaluating the Model:
- Run the evaluation script:
python detect_from_config.py --config config/your_config.yaml --checkpoint path/to/your/checkpoint.pth --input_image path_to_image
- Run the evaluation script:
-
Implement new model -> thinking about rf-detr
-
Make the logger uniform for all the framework (probably the one in dsec_evaluator)
-
in custom, get the output frame dims in input (for now it only works with 512x512)
-
check for validity of VGs
-
Add segmentation task (low prio)
-
Add same build from config as mmcv.
-
(per la proposta di metodo) considerare di fare la loss di contrastive solo sulla bbox e tutto il resto considerarlo come negative
All models in this framework follow a standardized output format to ensure consistency and easy integration with the training pipeline.
Models must return a dictionary with the following structure:
{
'backbone_features': {
'preflatten_feat': [...], # Multi-scale features (list of tensors)
'flattened_feat': tensor, # Flattened features (optional)
# ... other backbone-specific outputs
},
'head_outputs': tensor, # Task-specific predictions
'total_loss': scalar, # Total weighted loss (training only)
'losses': { # Individual loss components (training only)
'iou_loss': scalar,
'obj_loss': scalar,
'cls_loss': scalar,
'l1_loss': scalar, # Optional, task-dependent
# ... other task-specific losses
}
}During training, the YOLOXHead expects ground truth labels in the following format:
- Labels tensor shape:
[batch_size, max_objects, 5] - Label format:
[class_id, x_center, y_center, width, height]class_id: Integer class identifier (0-based indexing)x_center, y_center: Center coordinates of the bounding box (absolute pixel coordinates)width, height: Width and height of the bounding box (absolute pixel values)
- Coordinate system: Center-based format with absolute pixel coordinates
- Padding: Unused label slots should be filled with negative values (e.g.,
-1)
By default, the model outputs bounding boxes in the format: [x_center, y_center, width, height, objectness_score, class_confidence_0, class_confidence_1, ...]
Detailed breakdown:
- Bounding box coordinates:
[x_center, y_center, width, height](center coordinates with width/height) - Objectness score: Confidence that the box contains an object
- Class confidences: Per-class confidence scores (one for each class)
- Coordinate system: Center-based format with absolute pixel coordinates
- Output tensor shape:
[batch_size, num_detections, 5 + num_classes]
The YOLOXHead uses center-based coordinates throughout its internal processing:
- Loss computation uses
[x_center, y_center, width, height]format - Assignment algorithms expect center-based ground truth
- IoU calculations support both coordinate formats via the
xyxyparameter - Multi-scale feature processing maintains center-based representation