TerraMind and the Ghost in the Machine: A Quest to Make a Vision-Language Model Speak

The promise of multimodal AI is breathtaking. Imagine showing a model a satellite image and simply asking, *"What is the state of the crops in this area?"* and receiving a detailed, natural language answer. This is the promise of models like TerraMind, a foundation model for Earth Observation.

Our quest was to make this promise a reality using a local setup with the `terramind_v1_large_generate` model. What followed was not a simple `model(input)` call, but a deep dive into the internals of PyTorch and TerraTorch, a battle against cryptic error messages, and a face-off with what we can only describe as a "ghost in the machine."

This is the story of our struggle.

#### The Villains of the Piece: Three Unwavering Ghosts

Every hero's journey has its villains. Ours appeared as three cryptic warnings that haunted our every attempt:

1.  `UserWarning: Tokenizer for output modality caption not found.`
2.  `UserWarning: Cannot find EOS token in caption input, ignoring input.`
3.  `IndexError: too many indices for tensor of dimension 1`

These messages weren't just errors; they were a riddle. The model was telling us it couldn't find the tool it needed to understand text, and when it tried to proceed anyway, it failed catastrophically. Our goal became clear: we had to give TerraMind its voice.

#### The Chronicle of Attempts: A Descent into the Model's Soul

Our journey to solve this riddle took us through several layers of the PyTorch and TerraTorch framework.

**Attempt 1: The Naive Bridge.** Our first idea was to build a wrapper around a standard Hugging Face Llama 3.1 tokenizer. This wrapper would have the `encode()` method TerraMind expected. **The result?** An `AttributeError`, because the wrapper wasn't a PyTorch `nn.Module`.

**Attempt 2: The `__dict__` Hack.** We learned we needed to wrap our tokenizer in an `nn.ModuleDict`. We tried to assign this directly to the model's internal dictionary with `model.__dict__['tokenizer'] = ...`. **The result?** Progress! The `AttributeError` was gone, but we were met with the `IndexError: too many indices`. We were giving the model a 1D tensor when it expected a 2D one.

**Attempt 3: The `add_module()` Back Door.** We discovered PyTorch's official method for adding submodules and used `model.add_module()`. **The result?** The `IndexError` was fixed, but the first ghost returned: `Tokenizer for output modality caption not found.` The model still refused to see our registered component.

**Attempt 4: The `modality_info` Deep Dive.** Debugging showed the model was checking for an `eos_token_id` in a nested configuration object. We found it, updated it with the correct ID from our tokenizer (`128009`), and thought we had won. **The result?** The ghost remained. The warnings persisted.

**Attempt 5: The Factory Bypass.** We suspected the `FullModelFactory` was the source of our problems. We tried to get the build function directly from the registry and call it manually. **The result?** A `TypeError: 'NoneType' object is not callable`. The factory's internal structure was more complex than we thought.

**Attempt 6: The Final Synthesis.** We returned to the `FullModelFactory`, but this time we passed our tokenizer *during* the build process using the `tokenizer_dict` argument. This caused a `TypeError` about multiple values for the keyword argument. It seemed the factory was trying to set a default tokenizer *and* use ours.

This led us to our current, most robust approach: build the model with the factory, then manually apply our post-build hacks.

#### The Unsolvable Mystery: A Ghost That Sees Nothing

And this is where we are now. Our final script is a masterpiece of accumulated knowledge. It does everything right:

1.  It builds the model with the `FullModelFactory`.
2.  It creates a perfect `CaptionTokenizerWrapper` that returns a 2D tensor with the correct EOS token.
3.  It registers the tokenizer in the model's `__dict__`.
4.  It finds the nested `modality_info` and updates the `eos_token_id` to match our tokenizer.

The debug logs confirm our success at every step:
```
✅ Tokenizer of 'caption' added to model.tokenizer.
✅ modality_info['caption'].eos_token_id updated to: 128009
```
The tensor we pass to the model contains the EOS token. The model's configuration knows the correct EOS token ID.

And yet, the ghosts remain. The `forward` pass still throws the same warnings and the same `IndexError`. It's as if the `forward` method is looking at a completely different, "ghost" version of the tokenizer and `modality_info`—one that we can see but cannot touch.

Our final hypothesis is a deep-seated initialization or caching issue within the TerraTorch framework, where the `forward` pass uses a reference to the tokenizer that is determined *during the model's build* and cannot be altered afterward.

#### The Code of the Quest

Here is the state-of-the-art script that brings us to this final impasse. It combines all the hacks we've learned, yet the mystery remains unsolved.

```python
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import torch
import rioxarray as rxr
import torch.nn.functional as F
from pathlib import Path
from terratorch.models import FullModelFactory
from transformers import AutoTokenizer
import torch.nn as nn

# --- CONFIGURATION ---
TIF_PATH = "data/imagery/sentinel2_allbands_utm.tif"
MODEL_PATH = Path("data/models/TerraMind_v1_large.pt")
LLAMA3_TOKENIZER_DIR = "data/models/Meta-Llama-3-8B"

# --- Wrapper that creates the correct tensor ---
class CaptionTokenizerWrapper(nn.Module):
    def __init__(self, path_dir):
        super().__init__()
        self.text_tokenizer = AutoTokenizer.from_pretrained(path_dir, local_files_only=True, trust_remote_code=True)
        self.eos_token_id = self.text_tokenizer.eos_token_id
        self.bos_token_id = self.text_tokenizer.bos_token_id
        self.pad_token_id = self.text_tokenizer.pad_token_id

    def encode(self, text, device):
        ids = self.text_tokenizer.encode(text, add_special_tokens=False)
        ids.append(self.eos_token_id)
        return torch.tensor(ids, device=device).unsqueeze(0)

    def decode_text(self, out_dict, skip_special_tokens=True):
        ids = out_dict[0].cpu().tolist()
        return self.text_tokenizer.decode(ids, skip_special_tokens=skip_special_tokens)

def main():
    # --- Build the model ---
    factory = FullModelFactory()
    model = factory.build_model(
        "terramind_v1_large_generate",
        modalities=["S2L2A"],
        output_modalities=["caption"],
        pretrained=False,
        img_size=224,
        standardize=True
    )
    state_dict = torch.load(str(MODEL_PATH), map_location="cpu")
    if isinstance(state_dict, dict) and "state_dict" in state_dict:
        state_dict = state_dict["state_dict"]
    model.load_state_dict(state_dict, strict=False)
    model.eval()
    print("✅ TerraMind Large model loaded.")

    # --- Create and register our tokenizer ---
    caption_tok = CaptionTokenizerWrapper(LLAMA3_TOKENIZER_DIR)
    model.__dict__['tokenizer'] = nn.ModuleDict({"caption": caption_tok})
    print("✅ Custom 'caption' tokenizer registered.")

    # --- Update the nested modality_info ---
    for name, module in model.named_modules():
        if hasattr(module, 'modality_info'):
            module.modality_info['caption']['eos_token_id'] = caption_tok.eos_token_id
            print(f"✅ modality_info['caption'].eos_token_id updated in {name}.")
            break

    # --- Load and preprocess image ---
    arr = rxr.open_rasterio(TIF_PATH).values
    tensor = torch.tensor(arr, dtype=torch.float32).unsqueeze(0)
    tensor = F.interpolate(tensor, size=(224, 224), mode="bilinear", align_corners=False)

    # --- The fateful call ---
    pregunta = "Describe el estado de los cultivos en esta área."
    with torch.no_grad():
        outputs = model({"S2L2A": tensor, "caption": pregunta})

    # --- Decode the answer ---
    if "caption" in outputs:
        respuesta = caption_tok.decode_text(outputs["caption"])
        print("\n🧠 Model's Response:")
        print(respuesta)
    else:
        print("\n⚠️ No caption was generated.")

if __name__ == "__main__":
    main()
```

#### A Call to All AI Wizards

This is where we turn to you, the community. We have exhausted the standard approaches and delved deep into the framework's internals.

Has anyone else faced this behavior when trying to register a custom component post-build?
Is there a canonical, documented way to inject a custom tokenizer into a pre-built TerraTorch model?
Are we missing something fundamental about PyTorch's module registration that causes this "ghost" reference issue?
Is there a way to force a re-initialization of the model's internal references?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TerraMind and the Ghost in the Machine: A Quest to Make a Vision-Language Model Speak #1018

The Villains of the Piece: Three Unwavering Ghosts

The Chronicle of Attempts: A Descent into the Model's Soul

The Unsolvable Mystery: A Ghost That Sees Nothing

The Code of the Quest

A Call to All AI Wizards

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TerraMind and the Ghost in the Machine: A Quest to Make a Vision-Language Model Speak #1018

Description

The Villains of the Piece: Three Unwavering Ghosts

The Chronicle of Attempts: A Descent into the Model's Soul

The Unsolvable Mystery: A Ghost That Sees Nothing

The Code of the Quest

A Call to All AI Wizards

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions