11# mmlearn
2+
23[ ![ code checks] ( https://github.com/VectorInstitute/mmlearn/actions/workflows/code_checks.yml/badge.svg )] ( https://github.com/VectorInstitute/mmlearn/actions/workflows/code_checks.yml )
34[ ![ integration tests] ( https://github.com/VectorInstitute/mmlearn/actions/workflows/integration_tests.yml/badge.svg )] ( https://github.com/VectorInstitute/mmlearn/actions/workflows/integration_tests.yml )
45[ ![ license] ( https://img.shields.io/github/license/VectorInstitute/mmlearn.svg )] ( https://github.com/VectorInstitute/mmlearn/blob/main/LICENSE )
56
6- This project aims at enabling the evaluation of existing multimodal representation learning methods, as well as facilitating
7+ * mmlearn * aims at enabling the evaluation of existing multimodal representation learning methods, as well as facilitating
78experimentation and research for new techniques.
89
910## Quick Start
11+
1012### Installation
13+
1114#### Prerequisites
15+
1216The library requires Python 3.10 or later. We recommend using a virtual environment to manage dependencies. You can create
1317a virtual environment using the following command:
18+
1419``` bash
1520python3 -m venv /path/to/new/virtual/environment
1621source /path/to/new/virtual/environment/bin/activate
1722```
1823
1924#### Installing binaries
25+
2026To install the pre-built binaries, run:
27+
2128``` bash
2229python3 -m pip install mmlearn
2330```
@@ -73,13 +80,15 @@ Uses the <a href=https://huggingface.co/docs/peft/index>PEFT</a> library to enab
7380</table >
7481
7582For example, to install the library with the ` vision ` and ` audio ` extras, run:
83+
7684``` bash
7785python3 -m pip install mmlearn[vision,audio]
7886```
7987
8088</details >
8189
8290#### Building from source
91+
8392To install the library from source, run:
8493
8594``` bash
@@ -89,6 +98,7 @@ python3 -m pip install -e .
8998```
9099
91100### Running Experiments
101+
92102We use [ Hydra] ( https://hydra.cc/docs/intro/ ) and [ hydra-zen] ( https://mit-ll-responsible-ai.github.io/hydra-zen/ ) to manage configurations
93103in the library.
94104
@@ -97,9 +107,11 @@ have an `__init__.py` file to make it a Python package and an `experiment` folde
97107This format allows the use of ` .yaml ` configuration files as well as Python modules (using [ structured configs] ( https://hydra.cc/docs/tutorials/structured_config/intro/ ) or [ hydra-zen] ( https://mit-ll-responsible-ai.github.io/hydra-zen/ ) ) to define the experiment configurations.
98108
99109To run an experiment, use the following command:
110+
100111``` bash
101112mmlearn_run ' hydra.searchpath=[pkg://path.to.config.directory]' +experiment=< name_of_experiment_yaml_file> experiment=your_experiment_name
102113```
114+
103115Hydra will compose the experiment configuration from all the configurations in the specified directory as well as all the
104116configurations in the ` mmlearn ` package. * Note the dot-separated path to the directory containing the experiment configuration
105117files.*
@@ -109,23 +121,38 @@ One can add a path to `hydra.searchpath` either as a package (`pkg://path.to.con
109121Hence, please refrain from using the ` file:// ` notation.
110122
111123Hydra also allows for overriding configuration parameters from the command line. To see the available options and other information, run:
124+
112125``` bash
113126mmlearn_run ' hydra.searchpath=[pkg://path.to.config.directory]' +experiment=< name_of_experiment_yaml_file> --help
114127```
115128
116129By default, the ` mmlearn_run ` command will run the experiment locally. To run the experiment on a SLURM cluster, we use
117130the [ submitit launcher] ( https://hydra.cc/docs/plugins/submitit_launcher/ ) plugin built into Hydra. The following is an example
118131of how to run an experiment on a SLURM cluster:
132+
119133``` bash
120- mmlearn_run --multirun hydra.launcher.mem_gb=32 hydra.launcher.qos=your_qos hydra.launcher.partition=your_partition hydra.launcher.gres=gpu:4 hydra.launcher.cpus_per_task=8 hydra.launcher.tasks_per_node=4 hydra.launcher.nodes=1 hydra.launcher.stderr_to_stdout=true hydra.launcher.timeout_min=60 ' +hydra.launcher.additional_parameters={export: ALL}' ' hydra.searchpath=[pkg://path.to.config.directory]' +experiment=< name_of_experiment_yaml_file> experiment=your_experiment_name
134+ mmlearn_run --multirun \
135+ hydra.launcher.mem_per_cpu=5G \
136+ hydra.launcher.qos=your_qos \
137+ hydra.launcher.partition=your_partition \
138+ hydra.launcher.gres=gpu:4 \
139+ hydra.launcher.cpus_per_task=8 \
140+ hydra.launcher.tasks_per_node=4 \
141+ hydra.launcher.nodes=1 \
142+ hydra.launcher.stderr_to_stdout=true \
143+ hydra.launcher.timeout_min=720 \
144+ ' hydra.searchpath=[pkg://path.to.my_project.configs]' \
145+ +experiment=my_experiment \
146+ experiment_name=my_experiment_name
121147```
148+
122149This will submit a job to the SLURM cluster with the specified resources.
123150
124151** Note** : After the job is submitted, it is okay to cancel the program with ` Ctrl+C ` . The job will continue running on
125152the cluster. You can also add ` & ` at the end of the command to run it in the background.
126153
127-
128154## Summary of Implemented Methods
155+
129156<table >
130157<tr >
131158<th style =" text-align : left ; width : 250px " > Pretraining Methods </th >
@@ -181,33 +208,6 @@ Binary and multi-class classification tasks are supported.
181208</tr >
182209</table >
183210
184- ## Components
185- ### Datasets
186- Every dataset object must return an instance of ` Example ` with one or more keys/attributes corresponding to a modality name
187- as specified in the ` Modalities ` registry. The ` Example ` object must also include an ` example_index ` attribute/key, which
188- is used, in addition to the dataset index, to uniquely identify the example.
189-
190- <details >
191- <summary ><b >CombinedDataset</b ></summary >
192-
193- The ` CombinedDataset ` object is used to combine multiple datasets into one. It accepts an iterable of ` torch.utils.data.Dataset `
194- and/or ` torch.utils.data.IterableDataset ` objects and returns an ` Example ` object from one of the datasets, given an index.
195- Conceptually, the ` CombinedDataset ` object is a concatenation of the datasets in the input iterable, so the given index
196- can be mapped to a specific dataset based on the size of the datasets. As iterable-style datasets do not support random access,
197- the examples from these datasets are returned in order as they are iterated over.
198-
199- The ` CombinedDataset ` object also adds a ` dataset_index ` attribute to the ` Example ` object, corresponding to the index of
200- the dataset in the input iterable. Every example returned by the ` CombinedDataset ` will have an ` example_ids ` attribute,
201- which is instance of ` Example ` containing the same keys/attributes as the original example, with the exception of the
202- ` example_index ` and ` dataset_index ` attributes, with values being a tensor of the ` dataset_index ` and ` example_index ` .
203- </details >
204-
205- ### Dataloading
206- When dealing with multiple datasets with different modalities, the default ` collate_fn ` of ` torch.utils.data.DataLoader `
207- may not work, as it assumes that all examples have the same keys/attributes. In that case, the ` collate_example_list `
208- function can be used as the ` collate_fn ` argument of ` torch.utils.data.DataLoader ` . This function takes a list of ` Example `
209- objects and returns a dictionary of tensors, with all the keys/attributes of the ` Example ` objects.
210-
211211## Contributing
212212
213213If you are interested in contributing to the library, please see [ CONTRIBUTING.MD] ( CONTRIBUTING.MD ) . This file contains
0 commit comments