Skip to content

dhedey/mlx-8-week4-multimodel-image-captioning

Repository files navigation

Machine Learning Institute - Week 4 - Multimodal architectures / Image Captioning

This week, we are experimenting with multi-model transformer/LLM models.

We will try using both a self-built/trained decoder, and a fine-tuned Qwen base model; paired with a contrastive visual transformer to create image embedding tokens.

Set-up

Then install dependencies with:

uv sync --all-packages --dev

Model Training

Run the following, with an optional --model "model_name" parameter

uv run -m model.start_train

Run streamlit app

uv run streamlit run streamlit/app.py

TODOs

  • Add positional encoding
  • Offset the output

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •