generated from ContextLab/latex-base
-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Let's take a book with long, engaging chapters (e.g., hyperion_djvu.txt). Divide it into chapters/sections (Prologue, Chapter 1, Chapter 2, ..., Chapter 6, Epilogue).
For each chapter (can be parallelized in different threads):
- Use TinyLlama-1.1b to embed each token. Create a number-of-tokens by number-of-embedding-dimensions matrix (for this chapter).
- For each of 100 "particles":
- Project it forward by iteratively predicting next tokens (until we get to a stop token). If we run out of context, just slide the window forward to include only the last <length-of-context-window - 1> tokens.
- Store the embeddings of the predicted token sequences (in a length number-of-particles list of number-of-tokens-for-that-particle by number-of-embedding-dimensions matrices)
Save everything as a pkl file (one per chapter).
Then, once all chapters' pkl files are saved out:
- Concatenate all of the embedded tokens into an enormous total-number-of-tokens by number-of-embedding-dimensions matrices (across all chapters and particles)
- Project into 2D using UMAP
- Split the concatenated matrix back out into separate chapters/particles
- Save out a pkl file with the 2D projections
Then make a plot like this (one panel per chapter):

The blue lines are chapter trajectories. The red lines (projecting forward from the end of each chapter) are particles' predictions. The blue dots are the starts/ends of each chapter.
Metadata
Metadata
Assignees
Labels
No labels