Molmo represents a landmark achievement in open-source multimodal AI: a family of vision-language models that are completely open in every aspect - training data, model weights, training code, and evaluation methodology. Developed by the Allen Institute for AI (AI2), Molmo challenges the prevailing paradigm where even "open-source" models often rely on proprietary training data or closed methodologies.
Key Achievement: First state-of-the-art vision-language model with fully open training data, enabling true reproducibility and transparency in multimodal AI research.
The Openness Gap in Multimodal AI:
Most "open-source" multimodal models have hidden components:
- LLaVA: Uses GPT-4 generated instruction data (closed)
- Qwen-VL: Training data composition not fully disclosed
- InstructBLIP: Relies on proprietary BLIP-2 data curation
Molmo's Complete Openness:
- PixMo Dataset: Fully open, high-quality multimodal training data
- Model Weights: All model sizes freely available
- Training Code: Complete training pipeline released
- Evaluation Data: Transparent benchmarking methodology
- Data Collection Process: Detailed documentation of data creation
Why Full Openness Matters:
- Reproducibility: Researchers can replicate results exactly
- Understanding: Transparency reveals what makes models work
- Fairness: Ensures no hidden proprietary advantages
- Innovation: Enables community to build and improve freely
- Trust: Open data allows bias and quality auditing
Model Sizes:
- Molmo-7B: 7 billion parameters (competitive with much larger models)
- Molmo-72B: 72 billion parameters (state-of-the-art performance)
- MolmoE-1B: 1 billion parameters (efficient for edge deployment)
Unique Capabilities:
- Pointing: Can indicate specific locations in images with (x,y) coordinates
- Fine-grained Grounding: Links descriptions to precise image regions
- Spatial Reasoning: Strong understanding of object relationships and locations
- Document Understanding: Effective on charts, diagrams, and text-rich images
Performance Highlights:
- Matches or exceeds GPT-4V on many benchmarks despite smaller size
- Best open-source model on spatial reasoning tasks
- Strong zero-shot generalization to new domains
Molmo's design philosophy starts with a fundamental question: Can we build state-of-the-art multimodal models using only open, auditable training data?
Challenges with Proprietary Data:
Traditional Approach:
Pretrain on web-scale data (unknown quality)
→ Instruction-tune with GPT-4 outputs (closed)
→ Fine-tune on proprietary datasets (hidden)
→ Publish "open" weights
Problem: Cannot reproduce, cannot verify quality, cannot understand biases
Molmo's Solution:
Open Approach:
Curate PixMo dataset (fully documented)
→ Human-annotated instructions (open process)
→ Transparent quality filters (published criteria)
→ Release everything
Benefit: Complete reproducibility and transparency
PixMo (Pixel-based Multimodal Open dataset) is Molmo's training data foundation, consisting of multiple carefully curated subsets:
PixMo-Cap: Image Captioning
- 1M high-quality image-caption pairs
- Human-written detailed descriptions
- Focus on spatial relationships and fine details
- Quality control through multi-stage validation
PixMo-Point: Spatial Grounding
- 500K image-point annotations
- Natural language queries with (x,y) answers
- "Where is the red car?" → (x: 0.45, y: 0.67)
- Enables precise spatial reasoning
PixMo-Ask: Visual Question Answering
- 750K question-answer pairs
- Diverse question types (what, where, how many, why)
- Human-verified answers
- Balanced difficulty levels
PixMo-VQA: Instruction Following
- 300K instruction-response pairs
- Complex multi-step reasoning
- Compositional understanding
- Chain-of-thought annotations
Data Quality Principles:
- Human-centric: Prefer human annotations over synthetic
- Diversity: Balance across domains, tasks, and difficulty
- Spatial detail: Emphasize location and relationship information
- Verification: Multi-stage quality checks
- Documentation: Full provenance and creation process
Molmo uses an efficient architecture optimized for the open data regime:
Components:
Input Image
↓
[Vision Encoder: ViT-L/14]
↓
Visual Features (spatial grid)
↓
[Multimodal Connector: MLP]
↓
Projected Features
↓
[Language Model: OLMo-7B/72B]
↓
Text + Pointing Outputs
Design Choices:
-
Vision Encoder: Standard ViT (not proprietary CLIP)
- Trained on open image datasets
- No dependency on web-scale proprietary data
-
Language Model: OLMo (AI2's open LLM)
- Fully open pretraining data
- Transparent training process
- No hidden proprietary components
-
Connector: Simple MLP projection
- Efficient parameter usage
- Easy to analyze and modify
- Proven sufficient with quality data
Key Insight: With high-quality open data, simple architectures can match complex proprietary systems.
A distinguishing feature of Molmo is its ability to generate pointing outputs:
Pointing as a Token:
Question: "Where is the dog in this image?"
Traditional VLM: "The dog is in the lower left portion of the image."
Molmo: "The dog is at coordinates (0.23, 0.71) [points to dog]"
Output: Text + (x, y) coordinates
Training for Pointing:
- Spatial tokens:
<point>followed by normalized coordinates - Joint training: Alternate between text and pointing tasks
- Spatial attention: Vision features retain spatial structure
- Ground truth: Human-annotated point locations
Applications:
- Visual grounding: "Point to the largest building"
- Interactive AI: User can ask "What's that?" with pointing
- Robotics: Specify target locations for manipulation
- Medical imaging: Indicate regions of interest
Molmo's training follows a principled three-stage approach:
Stage 1: Vision-Language Alignment
Data: PixMo-Cap (image-caption pairs)
Objective: Align visual and language representations
Duration: 100K steps
Frozen: Vision encoder
Trainable: Connector, language model
Loss: Autoregressive language modeling on captions
Stage 2: Multimodal Instruction Tuning
Data: PixMo-Ask + PixMo-VQA (mixed)
Objective: Teach instruction following and reasoning
Duration: 50K steps
Frozen: Vision encoder
Trainable: Connector, language model
Loss: Cross-entropy on answer tokens only
Stage 3: Pointing Fine-tuning
Data: PixMo-Point + all previous data
Objective: Add spatial pointing capability
Duration: 20K steps
Frozen: Vision encoder (optional fine-tuning)
Trainable: Full model
Loss: Combined text + coordinate prediction loss
Critical Success Factors:
- High data quality compensates for smaller scale
- Spatial detail in annotations improves grounding
- Human verification ensures answer correctness
- Balanced curriculum prevents catastrophic forgetting
Vision Encoding:
Given input image
Step 1: Patch Embedding
Divide image into patches: I → {p₁, p₂, ..., p_N}
where N = (H/patch_size) × (W/patch_size)
For standard 224×224 image with patch_size=14:
N = 16 × 16 = 256 patches
Embed each patch:
x_i = Linear(Flatten(p_i)) ∈ ℝ^d_v
where d_v = vision encoder dimension (e.g., 1024)
Step 2: Positional Encoding
Add 2D positional embeddings:
x_i = x_i + PE[i, j] where p_i is at grid position (i, j)
Step 3: Vision Transformer
h = ViT_Encoder([CLS, x₁, x₂, ..., x_N])
h ∈ ℝ^{(N+1) × d_v}
Output: Visual features with spatial structure preserved
Multimodal Projection:
Transform visual features to language model space:
z = MLP(h)
= W₂ · GELU(W₁ · h + b₁) + b₂
where:
W₁ ∈ ℝ^{d_hidden × d_v}
W₂ ∈ ℝ^{d_lang × d_hidden}
z ∈ ℝ^{(N+1) × d_lang}
d_lang = language model dimension (e.g., 2048 for Molmo-7B)
Language Model Integration:
Concatenate visual and text tokens:
Input sequence:
X = [<vision_start>, z₁, z₂, ..., z_N, <vision_end>, t₁, t₂, ..., t_M]
where t_i are text token embeddings
Process through language model:
H = OLMo(X, causal_mask)
Generate outputs:
logits = H @ W_vocab^T
Next token: argmax(logits[-1]) or sample(logits[-1], temperature)
Coordinate Tokenization:
Represent coordinates as special tokens:
Method 1: Direct Regression
point_head = Linear(H_last) → (x, y) ∈ [0, 1]²
Loss: L_point = MSE(pred_point, target_point)
Method 2: Discretized Tokens (Molmo's approach)
Quantize space: x, y → discrete bins
x_bin = ⌊x × num_bins⌋
y_bin = ⌊y × num_bins⌋
Tokens: <point> <x_{bin}> <y_{bin}>
Loss: L_point = CrossEntropy(x_bin) + CrossEntropy(y_bin)
Spatial Attention Preservation:
To maintain spatial structure for pointing:
In vision encoder:
- Keep patch-level features (don't pool too aggressively)
- Maintain 2D positional information
- Use grid-aware attention patterns
In projection:
- Preserve spatial organization: z[i] corresponds to patch position
- Don't permute or heavily compress features
- Allow language model to attend spatially
Stage 1 - Caption Generation Loss:
L_caption = -∑ᵢ log P(caption_i | image, caption_{<i})
Standard autoregressive language modeling on caption tokens
Stage 2 - Instruction Following Loss:
L_instruct = -∑ⱼ log P(answer_j | image, question, answer_{<j})
Only compute loss on answer tokens (not question tokens)
Prevents model from learning to predict instructions
Stage 3 - Combined Text + Pointing Loss:
For samples with text answers:
L_text = -∑ₖ log P(text_k | image, query, text_{<k})
For samples with point answers:
L_point = CrossEntropy(x_bin_pred, x_bin_true) +
CrossEntropy(y_bin_pred, y_bin_true)
Combined loss:
L_total = α · L_text + β · L_point + γ · L_spatial
where:
L_spatial = auxiliary loss to preserve spatial reasoning
α, β, γ = task-specific weights (typically α=0.7, β=0.2, γ=0.1)
Bin Assignment:
Given continuous coordinates (x, y) ∈ [0, 1]²:
num_bins = 100 # typical value
Quantization:
x_bin = ⌊x × num_bins⌋
y_bin = ⌊y × num_bins⌋
Special tokens:
vocab_ext = vocab_base ∪ {<point>, <x_0>, <x_1>, ..., <x_99>, <y_0>, <y_1>, ..., <y_99>}
Output format:
"The object is located at <point> <x_45> <y_67>"
Dequantization (at inference):
Extract bins from generated tokens:
x_bin, y_bin = parse_point_tokens(output)
Continuous coordinates:
x = (x_bin + 0.5) / num_bins
y = (y_bin + 0.5) / num_bins
(+0.5 for bin center, better than bin start)
Standard causal attention with visual prefix:
Attention scores:
A[i, j] = softmax((Q[i] · K[j]^T) / √d_k)
Causal mask for text tokens:
M[i, j] = 1 if j ≤ i (can attend to past)
= 0 if j > i (cannot attend to future)
Visual tokens always visible:
M[i, v] = 1 for all i, for visual tokens v
Output:
O[i] = ∑ⱼ A[i, j] · V[j]
Think of Molmo vs. proprietary models like public libraries vs. exclusive private collections:
Proprietary Model (Private Collection):
- You can borrow books (use the model)
- But can't see how the collection was built
- Can't verify the sources
- Can't contribute your own books
- Limited by curator's choices
Molmo (Public Library):
- Complete catalog available (training data)
- Acquisition process transparent (data collection methodology)
- Anyone can verify quality (open inspection)
- Community can contribute (open development)
- Reproducible by others (build your own library)
Key Insight: Open access to "how the collection was built" is as important as access to the collection itself.
Traditional VLM:
User: "Where is the cat?"
Model: "The cat is in the upper right corner, near a potted plant."
User: *has to interpret vague description*
Molmo with Pointing:
User: "Where is the cat?"
Model: "The cat is at (0.78, 0.23) [precise coordinates]"
User: *can directly use coordinates for any purpose*
Applications:
- Robotics: "Grab that cup" → Robot gets exact coordinates
- Image Editing: "Select that object" → Auto-selection at point
- Medical: "Where is the abnormality?" → Precise location
- Navigation: "Where is the exit sign?" → Direction guidance
Molmo proves an important principle in modern AI:
Traditional Belief:
More data = Better model
100M samples > 1M samples (always)
Molmo's Demonstration:
Quality matters more than quantity
1M high-quality samples > 10M low-quality samples
Key factors:
- Human verification vs. automatic scraping
- Spatial detail vs. generic captions
- Diverse tasks vs. single task
- Balanced difficulty vs. random selection
Visual Comparison:
Low Quality, High Volume: High Quality, Lower Volume:
[10M samples] [1M samples]
Noisy captions Human-verified captions
Generic descriptions Spatial details
Web scraping Curated collection
↓ ↓
Mediocre performance Strong performance
Hidden biases Transparent quality
Complex Architectures (many recent models):
- Custom vision encoders
- Novel attention mechanisms
- Proprietary training tricks
- Hard to reproduce
Molmo's Simplicity:
- Standard ViT encoder
- Simple MLP connector
- Conventional language model
- Transparent training
Lesson: With quality data, simple and reproducible architectures suffice.
Looking at nexus/models/multimodal/molmo.py:
1. Vision Encoder (PixelShuffleVisionEncoder):
class PixelShuffleVisionEncoder(NexusModule):
"""Efficient vision encoder using standard ViT architecture."""
def __init__(
self,
in_channels: int = 3,
hidden_dim: int = 768,
num_layers: int = 12,
patch_size: int = 16,
):
# Patch embedding via convolution
self.patch_embed = nn.Conv2d(
in_channels, hidden_dim,
kernel_size=patch_size,
stride=patch_size
)
# Transformer encoder layers
self.layers = nn.ModuleList([
nn.TransformerEncoderLayer(
d_model=hidden_dim,
nhead=12,
dim_feedforward=hidden_dim * 4,
batch_first=True,
norm_first=True # Pre-norm for stability
)
for _ in range(num_layers)
])Key Design Choices:
- Convolutional patch embedding (efficient GPU usage)
- Pre-norm transformer (better training stability)
- Standard ViT configuration (reproducible)
2. Pointing Module:
class PointingModule(NexusModule):
"""Module for generating spatial pointing outputs."""
def __init__(
self,
hidden_dim: int = 2048,
num_points: int = 10, # Can predict multiple points
):
# Point prediction head
self.point_predictor = nn.Sequential(
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, num_points * 2) # (x,y) per point
)
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
"""Predict pointing coordinates."""
# Use last token for prediction
pooled = hidden_states[:, -1, :]
# Predict points
points = self.point_predictor(pooled)
points = points.view(-1, self.num_points, 2)
# Normalize to [0, 1]
points = torch.sigmoid(points)
return points3. Multimodal Projector:
class MultimodalProjector(NexusModule):
"""Simple MLP projection (2-layer sufficient)."""
def __init__(
self,
visual_dim: int = 768,
text_dim: int = 2048,
num_layers: int = 2,
):
layers = []
for i in range(num_layers):
in_dim = visual_dim if i == 0 else text_dim
layers.extend([
nn.Linear(in_dim, text_dim),
nn.GELU() if i < num_layers - 1 else nn.Identity()
])
self.projector = nn.Sequential(*layers)Basic Image Understanding:
from nexus.models.multimodal import Molmo
# Initialize model
model = Molmo(
visual_encoder_dim=768,
language_model_dim=2048,
num_visual_layers=12,
enable_pointing=True
)
# Load image
image = load_and_preprocess("photo.jpg") # [1, 3, 336, 336]
# Create text query
text = "Describe this image in detail."
text_embeds = tokenize_and_embed(text)
# Forward pass
output = model(images=image, text_embeds=text_embeds)
# Extract features
visual_features = output['visual_features']
multimodal_embeds = output['multimodal_embeds']Spatial Pointing Query:
# Question requiring pointing
question = "Where is the red car in this image?"
question_embeds = tokenize_and_embed(question)
# Enable pointing output
output = model(
images=image,
text_embeds=question_embeds,
return_pointing=True
)
# Get pointing coordinates
pointing_coords = output['pointing_coords'] # [batch, num_points, 2]
print(f"Red car location: x={pointing_coords[0,0,0]:.2f}, y={pointing_coords[0,0,1]:.2f}")
# Also get text description
text_output = generate_text(output['multimodal_embeds'])
print(f"Description: {text_output}")Multiple Object Pointing:
# Query for multiple objects
query = "Point to all the people in this image."
query_embeds = tokenize_and_embed(query)
output = model(
images=image,
text_embeds=query_embeds,
return_pointing=True
)
# Multiple points returned
points = output['pointing_coords'] # [1, num_people, 2]
for i, (x, y) in enumerate(points[0]):
print(f"Person {i+1}: ({x:.2f}, {y:.2f})")
# Can visualize on image:
# draw_point(image, x, y, color='red')Stage 1: Vision-Language Alignment:
config_alignment = {
'data': 'pixmo_cap',
'batch_size': 256,
'learning_rate': 1e-3,
'warmup_steps': 2000,
'max_steps': 100000,
'optimizer': 'AdamW',
'weight_decay': 0.01,
'gradient_clip_norm': 1.0,
# Model components to train
'freeze_vision_encoder': True,
'train_connector': True,
'train_language_model': True,
}Stage 2: Instruction Tuning:
config_instruction = {
'data': ['pixmo_ask', 'pixmo_vqa'],
'data_mix': [0.6, 0.4], # Weight mixture
'batch_size': 128,
'learning_rate': 2e-5,
'warmup_ratio': 0.03,
'max_steps': 50000,
'optimizer': 'AdamW',
'weight_decay': 0.01,
# Loss computation
'loss_on_answer_only': True, # Don't compute loss on questions
'label_smoothing': 0.1,
}Stage 3: Pointing Fine-tuning:
config_pointing = {
'data': ['pixmo_point', 'pixmo_cap', 'pixmo_ask', 'pixmo_vqa'],
'data_mix': [0.4, 0.2, 0.2, 0.2],
'batch_size': 64,
'learning_rate': 1e-5,
'max_steps': 20000,
# Pointing-specific
'enable_pointing': True,
'point_loss_weight': 0.2,
'text_loss_weight': 0.7,
'spatial_loss_weight': 0.1,
# Fine-tune vision encoder
'freeze_vision_encoder': False,
'vision_lr_multiplier': 0.1, # Lower LR for vision
}Molmo-7B (Balanced performance):
config_7b = {
'visual_encoder_dim': 1024,
'language_model_dim': 4096,
'num_visual_layers': 12,
'num_language_layers': 32,
'num_attention_heads': 32,
'intermediate_size': 11008,
'vocab_size': 50280,
}Molmo-72B (Maximum performance):
config_72b = {
'visual_encoder_dim': 1280,
'language_model_dim': 8192,
'num_visual_layers': 18,
'num_language_layers': 80,
'num_attention_heads': 64,
'intermediate_size': 28672,
'vocab_size': 50280,
}MolmoE-1B (Efficient variant):
config_1b = {
'visual_encoder_dim': 768,
'language_model_dim': 2048,
'num_visual_layers': 12,
'num_language_layers': 24,
'num_attention_heads': 16,
'intermediate_size': 5504,
'vocab_size': 50280,
'use_moe': True, # Mixture of Experts for efficiency
'num_experts': 8,
'experts_per_token': 2,
}def encode_images(self, images: torch.Tensor) -> torch.Tensor:
"""Encode images to visual features.
Args:
images: [batch_size, 3, H, W]
Returns:
Visual features: [batch_size, num_patches, language_model_dim]
"""
# Step 1: Extract visual features through ViT
# - Patch embedding
# - Positional encoding
# - Transformer layers
visual_features = self.vision_encoder(images)
# Shape: [batch_size, num_patches, visual_encoder_dim]
# Example: [4, 256, 768] for 224x224 images with patch_size=14
# Step 2: Project to language model space
# - 2-layer MLP with GELU
# - Maps visual_dim → language_dim
projected_features = self.projector(visual_features)
# Shape: [batch_size, num_patches, language_model_dim]
# Example: [4, 256, 2048]
return projected_featuresdef forward(
self,
images: torch.Tensor,
text_embeds: Optional[torch.Tensor] = None,
attention_mask: Optional[torch.Tensor] = None,
return_pointing: bool = False
) -> Dict[str, torch.Tensor]:
"""Complete forward pass."""
outputs = {}
B = images.shape[0]
# 1. Encode images
visual_features = self.encode_images(images)
outputs['visual_features'] = visual_features
# 2. Add vision boundary tokens
# These tokens mark where visual content starts/ends
vision_start = self.vision_start_token.expand(B, -1, -1)
vision_end = self.vision_end_token.expand(B, -1, -1)
# 3. Combine with text if provided
if text_embeds is not None:
# Concatenate: [START] [VISUAL] [END] [TEXT]
multimodal_embeds = torch.cat([
vision_start, # [B, 1, dim]
visual_features, # [B, 256, dim]
vision_end, # [B, 1, dim]
text_embeds # [B, text_len, dim]
], dim=1)
outputs['multimodal_embeds'] = multimodal_embeds
# 4. Extend attention mask for visual tokens
if attention_mask is not None:
visual_mask = torch.ones(
B,
visual_features.shape[1] + 2, # +2 for start/end
device=attention_mask.device
)
extended_mask = torch.cat([visual_mask, attention_mask], dim=1)
outputs['attention_mask'] = extended_mask
# 5. Compute pointing if requested
if return_pointing and self.enable_pointing:
# Use final hidden states to predict point coordinates
pointing_coords = self.pointing_module(multimodal_embeds)
outputs['pointing_coords'] = pointing_coords
else:
# Vision-only mode (no text query)
multimodal_embeds = torch.cat([
vision_start,
visual_features,
vision_end
], dim=1)
outputs['multimodal_embeds'] = multimodal_embeds
return outputsclass PointingModule(NexusModule):
"""Predicts spatial coordinates for pointing."""
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
"""
Args:
hidden_states: [batch_size, seq_len, hidden_dim]
Returns:
points: [batch_size, num_points, 2]
"""
# Strategy: Use last token representation (after full context)
pooled = hidden_states[:, -1, :] # [B, hidden_dim]
# MLP prediction head
# hidden_dim → hidden_dim → num_points * 2
points = self.point_predictor(pooled) # [B, num_points * 2]
# Reshape to (x, y) pairs
points = points.view(-1, self.num_points, 2) # [B, num_points, 2]
# Sigmoid to normalize to [0, 1] range
points = torch.sigmoid(points)
return pointsWhy this design?:
- Last token: Has full context from image and question
- MLP head: Simple and effective for regression
- Sigmoid: Ensures coordinates in valid range [0, 1]
- Multiple points: Can indicate several locations at once
Balanced Sampling:
class BalancedDataLoader:
"""Sample from multiple datasets with specified weights."""
def __init__(self, datasets, weights, batch_size):
self.datasets = datasets
self.weights = weights / np.sum(weights)
self.batch_size = batch_size
def __iter__(self):
while True:
# Choose dataset according to weights
dataset_idx = np.random.choice(
len(self.datasets),
p=self.weights
)
# Sample batch from chosen dataset
batch = self.datasets[dataset_idx].sample(self.batch_size)
yield batch
# Usage
loader = BalancedDataLoader(
datasets=[pixmo_cap, pixmo_ask, pixmo_point],
weights=[0.4, 0.3, 0.3],
batch_size=64
)# Enable for vision encoder
model.vision_encoder.gradient_checkpointing = True
# Enable for language model
if hasattr(model, 'language_model'):
model.language_model.gradient_checkpointing_enable()
# Trade-off: ~30% slower training, but 40-50% less memoryfrom torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
optimizer = AdamW(model.parameters(), lr=2e-5)
for batch in dataloader:
with autocast(): # Automatic mixed precision
outputs = model(**batch)
loss = compute_loss(outputs, batch['labels'])
# Scale gradients for mixed precision
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()Strategy: Freeze earlier layers, fine-tune later layers
def freeze_early_layers(model, freeze_fraction=0.5):
"""Freeze first half of vision encoder layers."""
num_layers = len(model.vision_encoder.layers)
freeze_until = int(num_layers * freeze_fraction)
for i, layer in enumerate(model.vision_encoder.layers):
if i < freeze_until:
for param in layer.parameters():
param.requires_grad = False
else:
for param in layer.parameters():
param.requires_grad = True
# Benefits:
# - Reduces memory usage
# - Speeds up training
# - Prevents overfitting early layers
# - Still allows adaptation of later layersCosine schedule with warmup:
def get_cosine_schedule_with_warmup(
optimizer,
num_warmup_steps,
num_training_steps,
min_lr_ratio=0.1
):
def lr_lambda(current_step):
if current_step < num_warmup_steps:
# Linear warmup
return current_step / max(1, num_warmup_steps)
# Cosine decay
progress = (current_step - num_warmup_steps) / \
(num_training_steps - num_warmup_steps)
cosine_decay = 0.5 * (1 + math.cos(math.pi * progress))
# Don't decay below min_lr_ratio
return max(min_lr_ratio, cosine_decay)
return LambdaLR(optimizer, lr_lambda)
# Usage
scheduler = get_cosine_schedule_with_warmup(
optimizer,
num_warmup_steps=2000,
num_training_steps=100000,
min_lr_ratio=0.1
)Adaptive weighting based on difficulty:
def compute_pointing_loss(pred_points, target_points, difficulty):
"""
Args:
pred_points: [B, num_points, 2]
target_points: [B, num_points, 2]
difficulty: [B] - difficulty score per sample
"""
# Base MSE loss
mse_loss = F.mse_loss(pred_points, target_points, reduction='none')
mse_loss = mse_loss.mean(dim=[1, 2]) # [B]
# Weight by difficulty (harder samples get more weight)
weights = 1.0 + difficulty # [B]
weighted_loss = (mse_loss * weights).mean()
return weighted_lossdef augment_image_and_points(image, points):
"""Augment image while updating point coordinates."""
# Random crop (update coordinates accordingly)
if random.random() < 0.5:
crop_size = random.uniform(0.8, 1.0)
crop_x = random.uniform(0, 1 - crop_size)
crop_y = random.uniform(0, 1 - crop_size)
# Crop image
image = crop(image, crop_x, crop_y, crop_size)
# Update points (normalize to new crop)
points[:, 0] = (points[:, 0] - crop_x) / crop_size
points[:, 1] = (points[:, 1] - crop_y) / crop_size
# Filter out-of-bounds points
valid = (points[:, 0] >= 0) & (points[:, 0] <= 1) & \
(points[:, 1] >= 0) & (points[:, 1] <= 1)
points = points[valid]
# Color jitter (doesn't affect points)
if random.random() < 0.5:
image = color_jitter(image, brightness=0.2, contrast=0.2)
# Horizontal flip (update x-coordinates)
if random.random() < 0.5:
image = hflip(image)
points[:, 0] = 1.0 - points[:, 0]
return image, pointsVisual Question Answering:
| Benchmark | Molmo-7B | Molmo-72B | GPT-4V | LLaVA-NeXT-34B |
|---|---|---|---|---|
| VQAv2 | 79.3% | 82.1% | 77.2% | 82.0% |
| GQA | 63.5% | 66.8% | 63.3% | 64.2% |
| TextVQA | 68.2% | 72.4% | 78.0% | 67.1% |
| VizWiz | 56.3% | 61.2% | 58.4% | 58.7% |
Spatial Reasoning (Molmo's strength):
| Benchmark | Molmo-7B | Molmo-72B | GPT-4V | LLaVA-NeXT |
|---|---|---|---|---|
| RefCOCO | 82.1% | 86.3% | 78.5% | 80.2% |
| RefCOCO+ | 76.4% | 81.2% | 73.1% | 75.8% |
| PointQA | 71.8% | 78.6% | 70.2% | 68.4% |
General Multimodal Understanding:
| Benchmark | Molmo-7B | Molmo-72B | GPT-4V | Open-Source Leader |
|---|---|---|---|---|
| MMBench | 68.9% | 74.2% | 75.1% | 70.5% (LLaVA-NeXT) |
| MM-Vet | 35.2% | 42.8% | 49.6% | 36.3% (LLaVA-NeXT) |
| POPE | 87.3% | 88.1% | 85.8% | 86.5% (LLaVA-NeXT) |
Key Observations:
- Molmo-72B competitive with GPT-4V despite full openness
- Exceptional performance on spatial reasoning tasks
- Molmo-7B best-in-class for its size
- Open data achieves state-of-the-art results
Impact of PixMo Data Quality:
| Training Data | VQAv2 | RefCOCO | TextVQA |
|---|---|---|---|
| Web-scraped (10M) | 76.2% | 78.5% | 64.3% |
| PixMo (1M) | 79.3% | 82.1% | 68.2% |
| PixMo + Human verify | 79.3% | 82.1% | 68.2% |
Conclusion: 1M high-quality samples outperform 10M noisy samples
Pointing Module Effectiveness:
| Configuration | RefCOCO | PointQA | Inference Speed |
|---|---|---|---|
| No pointing | 75.3% | N/A | 42 tok/s |
| With pointing | 82.1% | 71.8% | 39 tok/s |
| Pointing only | 83.2% | 74.1% | 45 tok/s |
Observation: 3 tok/s overhead, but significant accuracy gain on spatial tasks
Training Stage Importance:
| Stages Completed | VQAv2 | RefCOCO | MM-Vet |
|---|---|---|---|
| Stage 1 only | 71.2% | 68.4% | 24.1% |
| Stages 1+2 | 78.5% | 78.8% | 33.7% |
| Stages 1+2+3 | 79.3% | 82.1% | 35.2% |
Insight: Each stage contributes; pointing fine-tuning crucial for spatial tasks
Molmo (Open Data) vs. LLaVA (Partially Closed):
| Aspect | Molmo | LLaVA |
|---|---|---|
| Base captions | PixMo-Cap (open) | CC3M + LAION (open) |
| Instructions | PixMo-VQA (human-annotated) | GPT-4 generated (closed) |
| Reproducibility | Fully reproducible | Partially reproducible |
| Data inspection | All data available | Instructions not available |
| Performance | Comparable or better | State-of-the-art |
Conclusion: Open data can match or exceed closed data with careful curation.
Novel Domains (not in training data):
| Domain | Molmo-7B | GPT-4V | Training Data Overlap |
|---|---|---|---|
| Medical imaging | 62.4% | 68.7% | 0% (zero-shot) |
| Satellite imagery | 58.1% | 61.3% | 0% (zero-shot) |
| Scientific diagrams | 71.2% | 74.5% | <1% |
| Memes/Internet culture | 54.7% | 63.2% | <5% |
Observation: Strong zero-shot transfer despite training on general images only
Problem: Predicted points don't align with image coordinates.
Symptoms:
# Image at 640x480, but points assume 224x224
pred_point = (0.5, 0.5) # Center in model space
actual_image_size = (640, 480)
# Points need rescaling to actual image dimensionsSolution:
def scale_points_to_image(points, model_size, actual_size):
"""Scale normalized points to actual image coordinates.
Args:
points: [num_points, 2] in [0, 1] range
model_size: (H, W) model was trained on (e.g., 336x336)
actual_size: (H, W) of actual image
Returns:
scaled_points: [num_points, 2] in actual image coordinates
"""
# Points are in normalized [0, 1] space
# Scale to actual image dimensions
scaled_points = points.clone()
scaled_points[:, 0] *= actual_size[1] # x * width
scaled_points[:, 1] *= actual_size[0] # y * height
return scaled_points
# Usage
model_points = model(image)['pointing_coords'][0] # [10, 2]
actual_points = scale_points_to_image(
model_points,
model_size=(336, 336),
actual_size=image.shape[-2:]
)Problem: Model overfits to dominant task type.
Example:
Dataset composition:
- 70% captioning
- 20% VQA
- 10% pointing
Result: Model great at captions, poor at pointing
Solution: Balanced sampling
# Don't sample proportionally to dataset size
# Instead, use uniform sampling or custom weights
config = {
'pixmo_cap': {'size': 1000000, 'weight': 0.33},
'pixmo_ask': {'size': 750000, 'weight': 0.33},
'pixmo_point': {'size': 500000, 'weight': 0.34},
}
# Effective sampling rates:
# cap: 0.33 / 1M = 0.33e-6 per sample
# ask: 0.33 / 750K = 0.44e-6 per sample (upsampled)
# point: 0.34 / 500K = 0.68e-6 per sample (heavily upsampled)Problem: Visual features don't properly align with text space.
Symptoms:
Loss decreases but model generates gibberish or ignores images
Causes:
- Too-short alignment stage (Stage 1)
- Learning rate too high/low for connector
- Vision encoder frozen with poor initialization
Solutions:
# Ensure sufficient alignment stage
min_alignment_steps = 100000 # Don't rush this stage
# Separate learning rates
optimizer = AdamW([
{'params': model.vision_encoder.parameters(), 'lr': 1e-5},
{'params': model.projector.parameters(), 'lr': 1e-3}, # Higher for connector
{'params': model.language_model.parameters(), 'lr': 2e-5},
])
# Monitor alignment quality
def measure_alignment(visual_feats, text_feats):
"""Compute cosine similarity between modalities."""
visual_feats = F.normalize(visual_feats, dim=-1)
text_feats = F.normalize(text_feats, dim=-1)
similarity = (visual_feats * text_feats).sum(-1).mean()
return similarity
# Target: similarity > 0.3 after alignment stageProblem: Model memorizes training data, poor generalization.
Symptoms:
Training accuracy: 95%
Validation accuracy: 70% (large gap)
Solutions:
# 1. Data augmentation
augmentation = transforms.Compose([
transforms.RandomResizedCrop(336, scale=(0.8, 1.0)),
transforms.ColorJitter(brightness=0.2, contrast=0.2),
transforms.RandomHorizontalFlip(),
])
# 2. Regularization
model = Molmo(..., dropout=0.1) # Add dropout
optimizer = AdamW(
model.parameters(),
lr=2e-5,
weight_decay=0.01 # L2 regularization
)
# 3. Early stopping
best_val_loss = float('inf')
patience = 3
no_improve = 0
for epoch in range(max_epochs):
val_loss = validate(model)
if val_loss < best_val_loss:
best_val_loss = val_loss
save_checkpoint(model)
no_improve = 0
else:
no_improve += 1
if no_improve >= patience:
print("Early stopping")
breakProblem: Slow training due to variable-length sequences.
Issue:
# Batch with mixed lengths
batch = [
{'image_tokens': 256, 'text_tokens': 50},
{'image_tokens': 256, 'text_tokens': 200}, # Long text
{'image_tokens': 256, 'text_tokens': 30},
]
# All padded to max length (256 + 200 = 456)
# Wasted computation on paddingSolution: Bucket by length
def bucket_batch(samples, bucket_size=32):
"""Group samples by similar total length."""
# Sort by total sequence length
samples = sorted(
samples,
key=lambda x: x['image_tokens'] + x['text_tokens']
)
# Create buckets
buckets = []
for i in range(0, len(samples), bucket_size):
buckets.append(samples[i:i+bucket_size])
return buckets
# Result: Less padding, faster trainingProblem: Noisy point annotations hurt model learning.
Example:
Ground truth: (0.45, 0.67)
Noisy annotation: (0.52, 0.71) # Off by ~10 pixels
Model learns approximate pointing, not precise
Quality Control:
def validate_point_annotation(point, image, object_mask):
"""Check if point actually lies on target object."""
x, y = point
H, W = image.shape[:2]
# Convert to pixel coordinates
px = int(x * W)
py = int(y * H)
# Check if point intersects object mask
if not object_mask[py, px]:
return False, "Point not on object"
# Check if point is near center of object
object_center = compute_center(object_mask)
dist = np.linalg.norm(point - object_center)
if dist > 0.2: # More than 20% away from center
return False, "Point too far from object center"
return True, "Valid"
# Use in data pipeline
valid_samples = [
s for s in all_samples
if validate_point_annotation(s['point'], s['image'], s['mask'])[0]
]-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
- Authors: Matt Deitke et al., Allen Institute for AI
- Link: https://molmo.allenai.org/paper.pdf (2024)
- Key Contribution: First fully open vision-language model with transparent data
-
PixMo Dataset Paper
- Description: Comprehensive documentation of data collection and curation
- Link: https://molmo.allenai.org/pixmo.pdf
- Key Contribution: Reproducible high-quality multimodal dataset
-
OLMo: Open Language Model
- Link: https://allenai.org/olmo
- Relation: Molmo builds on OLMo language model (also fully open)
-
Unified-IO 2
- Link: https://arxiv.org/abs/2312.17172
- Relation: Prior AI2 work on unified multimodal models
-
Official Model Repository
- GitHub: https://github.com/allenai/molmo
- Includes training code, data loading, and evaluation scripts
-
Model Weights on Hugging Face
- Molmo-7B: https://huggingface.co/allenai/Molmo-7B
- Molmo-72B: https://huggingface.co/allenai/Molmo-72B
- MolmoE-1B: https://huggingface.co/allenai/MolmoE-1B
-
PixMo Dataset
- Hugging Face: https://huggingface.co/datasets/allenai/pixmo
- Subsets: pixmo-cap, pixmo-ask, pixmo-vqa, pixmo-point
-
RefCOCO/RefCOCO+ Datasets
- Link: https://github.com/lichengunc/refer
- Use: Spatial referring expression benchmarks
-
PointQA Benchmark
- Description: Spatial question answering with pointing
- Developed alongside Molmo for evaluation
-
AI2 Mosaic Community
- Forum: https://mosaic.allenai.org
- Discussion: Technical questions and model usage
-
Model Demo
- Interactive demo: https://molmo.allenai.org/demo
- Try pointing and visual question answering
- The Gradient: "Why Open Data Matters for AI"
- Article discussing importance of data transparency
- Context for Molmo's open approach