The Llama 3.2 90B-Vision-Instruct model is a vision-based version of the Llama 3.2 model, designed to be highly capable with visual reasoning and instruction following abilities. This model is ideal for building personalized, on-device agentic applications with strong privacy, where data never leaves the device.
- Highly capable with visual reasoning and instruction following abilities
- Supports image understanding and visual grounding tasks
- Optimized for edge and mobile devices
- Supports context length of 128K tokens
- Available for fine-tuning and deployment on a variety of platforms
- Part of the Llama 3.2 ecosystem, providing seamless integration with other Llama models
- Model size: 90B parameters
- Context length: 128K tokens
- Input type: Text and image
- Output type: Text and image
- Pre-trained on: Large-scale noisy (text, image) pair data
- Fine-tuned on: Medium-scale high-quality in-domain and knowledge-enhanced (text, image) pair data
- Weights: Based on BFloat16 numerics
- Quantized variants: Currently in development
- Competitive with leading foundation models on image recognition and visual understanding tasks
- Outperforms Gemma 2 2.6B and Phi 3.5-mini models on tasks such as following instructions, visual grounding, and image captioning
- Competitive with Gemma 2 2.6B model on tasks such as visual reasoning and image captioning
- Personalized on-device agentic applications with strong privacy
- Visual reasoning and instruction following
- Image understanding and visual grounding
- Image captioning and generation
- Multimodal text and image generation
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast with:
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with PagedAttention
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache
- Optimized CUDA kernels
vLLM is flexible and easy to use with:
- Seamless integration with popular Hugging Face models
- High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support NVIDIA GPUs and AMD GPUs
- (Experimental) Prefix caching support
- (Experimental) Multi-lora support
vLLM seamlessly supports many Hugging Face models, including the following architectures:
- Aquila & Aquila2 (
BAAI/AquilaChat2-7B,BAAI/AquilaChat2-34B,BAAI/Aquila-7B,BAAI/AquilaChat-7B, etc.) - Baichuan & Baichuan2 (
baichuan-inc/Baichuan2-13B-Chat,baichuan-inc/Baichuan-7B, etc.) - BLOOM (
bigscience/bloom,bigscience/bloomz, etc.) - ChatGLM (
THUDM/chatglm2-6b,THUDM/chatglm3-6b, etc.) - Command-R (
CohereForAI/c4ai-command-r-v01, etc.) - DBRX (
databricks/dbrx-base,databricks/dbrx-instructetc.) - DeciLM (
Deci/DeciLM-7B,Deci/DeciLM-7B-instruct, etc.) - Falcon (
tiiuae/falcon-7b,tiiuae/falcon-40b,tiiuae/falcon-rw-7b, etc.) - Gemma (
google/gemma-2b,google/gemma-7b, etc.) - GPT-2 (
gpt2,gpt2-xl, etc.) - GPT BigCode (
bigcode/starcoder,bigcode/gpt_bigcode-santacoder, etc.) - GPT-J (
EleutherAI/gpt-j-6b,nomic-ai/gpt4all-j, etc.) - GPT-NeoX (
EleutherAI/gpt-neox-20b,databricks/dolly-v2-12b,stabilityai/stablelm-tuned-alpha-7b, etc.) - InternLM (
internlm/internlm-7b,internlm/internlm-chat-7b, etc.) - InternLM2 (
internlm/internlm2-7b,internlm/internlm2-chat-7b, etc.) - Jais (
core42/jais-13b,core42/jais-13b-chat,core42/jais-30b-v3,core42/jais-30b-chat-v3, etc.) - LLaMA, Llama 2, and Meta Llama 3 (
meta-llama/Meta-Llama-3-8B-Instruct,meta-llama/Meta-Llama-3-70B-Instruct,meta-llama/Llama-2-70b-hf,lmsys/vicuna-13b-v1.3,young-geng/koala,openlm-research/open_llama_13b, etc.) - MiniCPM (
openbmb/MiniCPM-2B-sft-bf16,openbmb/MiniCPM-2B-dpo-bf16, etc.) - Mistral (
mistralai/Mistral-7B-v0.1,mistralai/Mistral-7B-Instruct-v0.1, etc.) - Mixtral (
mistralai/Mixtral-8x7B-v0.1,mistralai/Mixtral-8x7B-Instruct-v0.1,mistral-community/Mixtral-8x22B-v0.1, etc.) - MPT (
mosaicml/mpt-7b,mosaicml/mpt-30b, etc.) - OLMo (
allenai/OLMo-1B-hf,allenai/OLMo-7B-hf, etc.) - OPT (
facebook/opt-66b,facebook/opt-iml-max-30b, etc.) - Orion (
OrionStarAI/Orion-14B-Base,OrionStarAI/Orion-14B-Chat, etc.) - Phi (
microsoft/phi-1_5,microsoft/phi-2, etc.) - Phi-3 (
microsoft/Phi-3-mini-4k-instruct,microsoft/Phi-3-mini-128k-instruct, etc.) - Qwen (
Qwen/Qwen-7B,Qwen/Qwen-7B-Chat, etc.) - Qwen2 (
Qwen/Qwen1.5-7B,Qwen/Qwen1.5-7B-Chat, etc.) - Qwen2MoE (
Qwen/Qwen1.5-MoE-A2.7B,Qwen/Qwen1.5-MoE-A2.7B-Chat, etc.) - StableLM(
stabilityai/stablelm-3b-4e1t,stabilityai/stablelm-base-alpha-7b-v2, etc.) - Starcoder2(
bigcode/starcoder2-3b,bigcode/starcoder2-7b,bigcode/starcoder2-15b, etc.) - Xverse (
xverse/XVERSE-7B-Chat,xverse/XVERSE-13B-Chat,xverse/XVERSE-65B-Chat, etc.) - Yi (
01-ai/Yi-6B,01-ai/Yi-34B, etc.)
Visit our documentation to get started.