Machine Learning Institute - Week 4 - Multimodal architectures / Image Captioning

This week, we are experimenting with multi-model transformer/LLM models.

We will try using both a self-built/trained decoder, and a fine-tuned Qwen base model; paired with a contrastive visual transformer to create image embedding tokens.

Set-up

Install the git lfs extension before cloning this repository
Install the uv package manager

Then install dependencies with:

uv sync --all-packages --dev

Model Training

Run the following, with an optional --model "model_name" parameter

uv run -m model.start_train

Run streamlit app

uv run streamlit run streamlit/app.py

TODOs

Add positional encoding
Offset the output

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
.run		.run
model		model
streamlit		streamlit
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
data_gen.py		data_gen.py
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning Institute - Week 4 - Multimodal architectures / Image Captioning

Set-up

Model Training

Run streamlit app

TODOs

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

dhedey/mlx-8-week4-multimodel-image-captioning

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Institute - Week 4 - Multimodal architectures / Image Captioning

Set-up

Model Training

Run streamlit app

TODOs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages