Skip to content

Add raw text dataset support to ActivationsStore#11

Open
HowardHsuuu wants to merge 1 commit into
LLM-Interp:masterfrom
HowardHsuuu:add-raw-dataset-tokenization
Open

Add raw text dataset support to ActivationsStore#11
HowardHsuuu wants to merge 1 commit into
LLM-Interp:masterfrom
HowardHsuuu:add-raw-dataset-tokenization

Conversation

@HowardHsuuu

Copy link
Copy Markdown

Summary

This PR adds support for using either pre-tokenized datasets or raw text datasets in ActivationsStore.

The existing is_dataset_tokenized config option is now respected:

  • is_dataset_tokenized=True: preserves the current behavior and reads token IDs from tokens or input_ids
  • is_dataset_tokenized=False: reads raw text from dataset_text_column and tokenizes lazily with model.tokenizer

Both paths are normalized into the same token stream before batching and activation generation, so the downstream activation/training pipeline does not need to know whether the source dataset was raw text or pre-tokenized.

Also included:

  • Adds dataset_text_column to CLTTrainingRunnerConfig and AutoInterpConfig
  • Updates generate_and_save_activations(number_of_tokens=None) to work with raw text datasets
  • Adds a short README note for raw-text dataset usage
  • Adds a CHANGELOG entry

Tests

Added focused tests in tests/training/test_activation_store_dataset_tokens.py.

These cover:

  • tokenized datasets using tokens
  • fallback from input_ids to tokens
  • raw text datasets tokenized through model.tokenizer
  • custom dataset_text_column
  • short raw text sequences
  • ActivationsStore iteration for both raw and pre-tokenized datasets
  • raw text activation cache generation through generate_and_save_activations(number_of_tokens=None)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant