Skip to content

TerraMind and the Ghost in the Machine: A Quest to Make a Vision-Language Model Speak #1018

@restupi

Description

@restupi

The promise of multimodal AI is breathtaking. Imagine showing a model a satellite image and simply asking, "What is the state of the crops in this area?" and receiving a detailed, natural language answer. This is the promise of models like TerraMind, a foundation model for Earth Observation.

Our quest was to make this promise a reality using a local setup with the terramind_v1_large_generate model. What followed was not a simple model(input) call, but a deep dive into the internals of PyTorch and TerraTorch, a battle against cryptic error messages, and a face-off with what we can only describe as a "ghost in the machine."

This is the story of our struggle.

The Villains of the Piece: Three Unwavering Ghosts

Every hero's journey has its villains. Ours appeared as three cryptic warnings that haunted our every attempt:

  1. UserWarning: Tokenizer for output modality caption not found.
  2. UserWarning: Cannot find EOS token in caption input, ignoring input.
  3. IndexError: too many indices for tensor of dimension 1

These messages weren't just errors; they were a riddle. The model was telling us it couldn't find the tool it needed to understand text, and when it tried to proceed anyway, it failed catastrophically. Our goal became clear: we had to give TerraMind its voice.

The Chronicle of Attempts: A Descent into the Model's Soul

Our journey to solve this riddle took us through several layers of the PyTorch and TerraTorch framework.

Attempt 1: The Naive Bridge. Our first idea was to build a wrapper around a standard Hugging Face Llama 3.1 tokenizer. This wrapper would have the encode() method TerraMind expected. The result? An AttributeError, because the wrapper wasn't a PyTorch nn.Module.

Attempt 2: The __dict__ Hack. We learned we needed to wrap our tokenizer in an nn.ModuleDict. We tried to assign this directly to the model's internal dictionary with model.__dict__['tokenizer'] = .... The result? Progress! The AttributeError was gone, but we were met with the IndexError: too many indices. We were giving the model a 1D tensor when it expected a 2D one.

Attempt 3: The add_module() Back Door. We discovered PyTorch's official method for adding submodules and used model.add_module(). The result? The IndexError was fixed, but the first ghost returned: Tokenizer for output modality caption not found. The model still refused to see our registered component.

Attempt 4: The modality_info Deep Dive. Debugging showed the model was checking for an eos_token_id in a nested configuration object. We found it, updated it with the correct ID from our tokenizer (128009), and thought we had won. The result? The ghost remained. The warnings persisted.

Attempt 5: The Factory Bypass. We suspected the FullModelFactory was the source of our problems. We tried to get the build function directly from the registry and call it manually. The result? A TypeError: 'NoneType' object is not callable. The factory's internal structure was more complex than we thought.

Attempt 6: The Final Synthesis. We returned to the FullModelFactory, but this time we passed our tokenizer during the build process using the tokenizer_dict argument. This caused a TypeError about multiple values for the keyword argument. It seemed the factory was trying to set a default tokenizer and use ours.

This led us to our current, most robust approach: build the model with the factory, then manually apply our post-build hacks.

The Unsolvable Mystery: A Ghost That Sees Nothing

And this is where we are now. Our final script is a masterpiece of accumulated knowledge. It does everything right:

  1. It builds the model with the FullModelFactory.
  2. It creates a perfect CaptionTokenizerWrapper that returns a 2D tensor with the correct EOS token.
  3. It registers the tokenizer in the model's __dict__.
  4. It finds the nested modality_info and updates the eos_token_id to match our tokenizer.

The debug logs confirm our success at every step:

✅ Tokenizer of 'caption' added to model.tokenizer.
✅ modality_info['caption'].eos_token_id updated to: 128009

The tensor we pass to the model contains the EOS token. The model's configuration knows the correct EOS token ID.

And yet, the ghosts remain. The forward pass still throws the same warnings and the same IndexError. It's as if the forward method is looking at a completely different, "ghost" version of the tokenizer and modality_info—one that we can see but cannot touch.

Our final hypothesis is a deep-seated initialization or caching issue within the TerraTorch framework, where the forward pass uses a reference to the tokenizer that is determined during the model's build and cannot be altered afterward.

The Code of the Quest

Here is the state-of-the-art script that brings us to this final impasse. It combines all the hacks we've learned, yet the mystery remains unsolved.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import torch
import rioxarray as rxr
import torch.nn.functional as F
from pathlib import Path
from terratorch.models import FullModelFactory
from transformers import AutoTokenizer
import torch.nn as nn

# --- CONFIGURATION ---
TIF_PATH = "data/imagery/sentinel2_allbands_utm.tif"
MODEL_PATH = Path("data/models/TerraMind_v1_large.pt")
LLAMA3_TOKENIZER_DIR = "data/models/Meta-Llama-3-8B"

# --- Wrapper that creates the correct tensor ---
class CaptionTokenizerWrapper(nn.Module):
    def __init__(self, path_dir):
        super().__init__()
        self.text_tokenizer = AutoTokenizer.from_pretrained(path_dir, local_files_only=True, trust_remote_code=True)
        self.eos_token_id = self.text_tokenizer.eos_token_id
        self.bos_token_id = self.text_tokenizer.bos_token_id
        self.pad_token_id = self.text_tokenizer.pad_token_id

    def encode(self, text, device):
        ids = self.text_tokenizer.encode(text, add_special_tokens=False)
        ids.append(self.eos_token_id)
        return torch.tensor(ids, device=device).unsqueeze(0)

    def decode_text(self, out_dict, skip_special_tokens=True):
        ids = out_dict[0].cpu().tolist()
        return self.text_tokenizer.decode(ids, skip_special_tokens=skip_special_tokens)

def main():
    # --- Build the model ---
    factory = FullModelFactory()
    model = factory.build_model(
        "terramind_v1_large_generate",
        modalities=["S2L2A"],
        output_modalities=["caption"],
        pretrained=False,
        img_size=224,
        standardize=True
    )
    state_dict = torch.load(str(MODEL_PATH), map_location="cpu")
    if isinstance(state_dict, dict) and "state_dict" in state_dict:
        state_dict = state_dict["state_dict"]
    model.load_state_dict(state_dict, strict=False)
    model.eval()
    print("✅ TerraMind Large model loaded.")

    # --- Create and register our tokenizer ---
    caption_tok = CaptionTokenizerWrapper(LLAMA3_TOKENIZER_DIR)
    model.__dict__['tokenizer'] = nn.ModuleDict({"caption": caption_tok})
    print("✅ Custom 'caption' tokenizer registered.")

    # --- Update the nested modality_info ---
    for name, module in model.named_modules():
        if hasattr(module, 'modality_info'):
            module.modality_info['caption']['eos_token_id'] = caption_tok.eos_token_id
            print(f"✅ modality_info['caption'].eos_token_id updated in {name}.")
            break

    # --- Load and preprocess image ---
    arr = rxr.open_rasterio(TIF_PATH).values
    tensor = torch.tensor(arr, dtype=torch.float32).unsqueeze(0)
    tensor = F.interpolate(tensor, size=(224, 224), mode="bilinear", align_corners=False)

    # --- The fateful call ---
    pregunta = "Describe el estado de los cultivos en esta área."
    with torch.no_grad():
        outputs = model({"S2L2A": tensor, "caption": pregunta})

    # --- Decode the answer ---
    if "caption" in outputs:
        respuesta = caption_tok.decode_text(outputs["caption"])
        print("\n🧠 Model's Response:")
        print(respuesta)
    else:
        print("\n⚠️ No caption was generated.")

if __name__ == "__main__":
    main()

A Call to All AI Wizards

This is where we turn to you, the community. We have exhausted the standard approaches and delved deep into the framework's internals.

Has anyone else faced this behavior when trying to register a custom component post-build?
Is there a canonical, documented way to inject a custom tokenizer into a pre-built TerraTorch model?
Are we missing something fundamental about PyTorch's module registration that causes this "ghost" reference issue?
Is there a way to force a re-initialization of the model's internal references?

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions