SeeSay: A Multimodal Language Model

Hey there! Welcome to SeeSay, my project where I'm building a multimodal language model from scratch using PyTorch. This thing is pretty cool - it can understand both pictures and words!

What's SeeSay All About?

So, I recently watched this awesome video about coding a multimodal language model, and I just had to give it a shot myself. That's how SeeSay was born! It's been a wild ride learning about all these fancy concepts, but I think I've got something pretty neat going on here.

The Cool Stuff

Vision Transformer (ViT)

I'm using a vision Transformer as my visual encoder. It's pretty neat - it breaks images into patches, turns them into vectors, adds some special position info, and then feeds it all through a Transformer encoder. It's like giving my model superpowers for understanding pictures!

Contrastive Learning

I'm training my vision encoder using contrastive learning. It's this fancy self-supervised technique that lets me teach my model to recognize cool stuff in images without needing labeled data. It's like showing a kid lots of pictures and saying "Hey, figure out what's important!"

Language Model

My text part is built around a regular Transformer architecture. It's good at reading and generating human-like text.

Putting it All Together

The magic happens when I combine visual and textual embeddings. I'm using this K-cache thingy to store and retrieve relevant info efficiently. It's like having a super-smart librarian who knows exactly where everything is!

Implementation Details

Rotary Positional Encodings

I'm using these fancy rotary positional encodings. They help my model handle long sequences really well, which is great because sometimes we need to talk about lots of stuff at once!

Normalization Layers

I've got layer normalization throughout. It helps keep things stable during training and makes everything work better.

Attention Mechanism

My model uses regular Transformer attention. It's like having a spotlight that shines on important bits of both images and text.

Training Time!

Training involves teaching both parts of my model simultaneously. It's a bit tricky, but I've got it working pretty well. You'll need to set up your own dataset, though - I don't have one ready to go yet.

What Can You Do With It?

Once trained, you can do some pretty cool stuff:

Describe pictures (image captioning)
Answer questions about images (visual Q&A)
Even generate images from text (though that part isn't perfect yet!)

There are example scripts in the examples folder to show you how to do these things.

Want to Help Out?

Contributions are totally welcome! Just check out the CONTRIBUTING.md file for details on how to join in.

License Stuff

This whole shebang is under the MIT License. Check LICENSE.md for the boring legal bits.

That's it! Thanks for checking out my project. Let me know if you try anything cool with it!

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
inference.py		inference.py
launch_inference.sh		launch_inference.sh
modeling_gemma.py		modeling_gemma.py
modeling_siglip.py		modeling_siglip.py
processing_paligemma.py		processing_paligemma.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SeeSay: A Multimodal Language Model

What's SeeSay All About?

The Cool Stuff

Vision Transformer (ViT)

Contrastive Learning

Language Model

Putting it All Together

Implementation Details

Rotary Positional Encodings

Normalization Layers

Attention Mechanism

Training Time!

What Can You Do With It?

Want to Help Out?

License Stuff

About

Uh oh!

Releases

Packages

Uh oh!

Languages

RohanDebnath24/SeeSay-A-Multimodal-Language-Model

Folders and files

Latest commit

History

Repository files navigation

SeeSay: A Multimodal Language Model

What's SeeSay All About?

The Cool Stuff

Vision Transformer (ViT)

Contrastive Learning

Language Model

Putting it All Together

Implementation Details

Rotary Positional Encodings

Normalization Layers

Attention Mechanism

Training Time!

What Can You Do With It?

Want to Help Out?

License Stuff

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages