Add raw text dataset support to ActivationsStore by HowardHsuuu · Pull Request #11 · LLM-Interp/CLT-Forge

HowardHsuuu · 2026-06-17T14:17:49Z

Summary

This PR adds support for using either pre-tokenized datasets or raw text datasets in ActivationsStore.

The existing is_dataset_tokenized config option is now respected:

is_dataset_tokenized=True: preserves the current behavior and reads token IDs from tokens or input_ids
is_dataset_tokenized=False: reads raw text from dataset_text_column and tokenizes lazily with model.tokenizer

Both paths are normalized into the same token stream before batching and activation generation, so the downstream activation/training pipeline does not need to know whether the source dataset was raw text or pre-tokenized.

Also included:

Adds dataset_text_column to CLTTrainingRunnerConfig and AutoInterpConfig
Updates generate_and_save_activations(number_of_tokens=None) to work with raw text datasets
Adds a short README note for raw-text dataset usage
Adds a CHANGELOG entry

Tests

Added focused tests in tests/training/test_activation_store_dataset_tokens.py.

These cover:

tokenized datasets using tokens
fallback from input_ids to tokens
raw text datasets tokenized through model.tokenizer
custom dataset_text_column
short raw text sequences
ActivationsStore iteration for both raw and pre-tokenized datasets
raw text activation cache generation through generate_and_save_activations(number_of_tokens=None)

Add: support raw text datasets in activation store

ca20095

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add raw text dataset support to ActivationsStore#11

Add raw text dataset support to ActivationsStore#11
HowardHsuuu wants to merge 1 commit into
LLM-Interp:masterfrom
HowardHsuuu:add-raw-dataset-tokenization

HowardHsuuu commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

HowardHsuuu commented Jun 17, 2026

Summary

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant