Exploring Zero-Shot Capabilities of Multi-Modal Foundation Models for Semantic Image Segmentation

Dissertation for the MSc in Artificial Intelligence of the University of Surrey (2023/2024)

Abstract

Semantic image segmentation is particularly exciting as it requires models to approach a level of understanding of images equivalent to that of humans. As a foretaste of a potential general artificial intelligence — that is, an artificial intelligence model capable of performing or learning any cognitive task specific to humans — the so-called Foundation Models have recently pushed the limits of understanding the world around us. This dissertation explores the zero-shot capabilities of multimodal foundation models in the context of the semantic image segmentation task. This research work is based on the ZegCLIP architecture developed by Ziqin Zhou et al., a baseline that will be improved throughout this research work. Two key ideas were the main vectors of improvement: the integration of multi-scaling techniques to bridge the gap between the whole image scale (CLIP) and the pixel scale (segmentation) as well as the use of inner layers from the image encoder. This work contributes to the advancement of semantic segmentation techniques by identifying the limits of current approaches and proposing innovative architectures accordingly. The numerous experiments conducted have confirmed the relevance of the key points described above, greatly improving the segmentation performance of the ZegCLIP baseline, particularly its generalization capacity.

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
configs		configs
images		images
models		models
optims		optims
personnal_utils		personnal_utils
requirements		requirements
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dist_train.sh		dist_train.sh
environment.yml		environment.yml
run_net.py		run_net.py
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring Zero-Shot Capabilities of Multi-Modal Foundation Models for Semantic Image Segmentation

Abstract

My MSO Architecture

Result Summary

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

License

antoineedy/EZ-SCM-MFMSIS-dissertation-UoS

Folders and files

Latest commit

History

Repository files navigation

Exploring Zero-Shot Capabilities of Multi-Modal Foundation Models for Semantic Image Segmentation

Abstract

My MSO Architecture

Result Summary

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages