letter-markov-js

A character-level Markov chain text generator trained on Shakespeare's complete works. Uses probabilistic selection based on weighted character transitions to generate Shakespeare-style text.

Quick Start

import Markov from "./markov.js";

const markov = new Markov();
await markov.load();        // Load cached model or process source
markov.run(1000);           // Generate 1000 characters

Sample Output:

The sorrow that I have, the sun of the common cries
Where thou shalt thou speak to him

How It Works

The generator builds a character-level state machine from the source text, tracking which characters follow each character and how often. During generation, it uses weighted random selection to choose the next character based on learned probabilities.

stateDiagram-v2
    [*] --> T: START (100%)
    T --> h: 75%
    T --> r: 20%
    T --> newline: 5%
    h --> e: 85%
    h --> a: 10%
    h --> i: 5%
    e --> space: 40%
    e --> n: 25%
    e --> r: 20%
    e --> newline: 15%
    space --> T: 30%
    space --> a: 25%
    space --> w: 20%
    space --> newline: 25%
    newline --> [*]

Edge weights represent transition probabilities learned from Shakespeare's text. Higher percentages indicate more frequent character sequences.

Technical Highlights

Architecture:

Markov class: State machine orchestrator with async I/O
CharBucket class: Hash-based frequency storage {char: {nextChar: count}}
Private fields for encapsulation (#storage, #FILE, #JSON_CACHE)

Core Algorithm (markov.js:61-78):

#findNextChar(current) {
  const transitions = this.charBucket.get(current);
  const entries = Object.entries(transitions);

  // Sum all transition frequencies
  const total = entries.reduce((sum, [_, freq]) => sum + freq, 0);

  // Weighted random selection
  const random = Math.random() * total;
  let running = 0;

  for (const [char, freq] of entries) {
    running += freq;
    if (random < running) return char;  // O(n) selection
  }
}

Data Structure:

{
  "T": { "h": 8234, "r": 1523, "o": 2341 },
  "h": { "e": 9821, "a": 1234, "i": 876 },
  "e": { " ": 12456, "n": 3421, "r": 2134 }
}

Installation & Usage

Requirements: Bun.js (or Node.js with ES6 module support)

git clone <repo-url>
cd letter-markov-js
bun run index.js

API:

await markov.load() - Loads cached model or processes source file (first run only)
markov.run(steps) - Generates steps characters with 50ms typing delay
markov.debug(char) - Inspect transition data for a specific character

Example:

const markov = new Markov();
await markov.load();

// Generate 500 characters
markov.run(500);

// Debug: see what follows 'T'
console.log(markov.debug('T'));
// Output: { h: 8234, r: 1523, o: 2341, ... }

Algorithm Deep Dive

Training Phase (markov.js:93-113)

Stream file line-by-line using createReadStream + readline
Process each character pair: For each character at position i, record line[i+1] as a successor
Special markers:
- START → firstChar for line beginnings
- lastChar → \n for line endings
- \n → firstChar to connect lines
Increment frequencies in CharBucket hash map
Cache to JSON (data/shakespeare.json) after processing

Generation Phase (markov.js:61-78)

Start with current = "START"
Get all possible next characters from CharBucket
Sum their frequencies (e.g., {h: 8234, r: 1523} → total = 9757)
Generate random number in [0, total)
Walk through transitions, accumulating frequencies until random number is reached
Return selected character and repeat

Example: If current = "T" and transitions = {h: 8234, r: 1523, o: 2341}:

Total = 12098
Random = 7500
Walk: 0 + 8234 = 8234 > 7500 → Return "h" (68% probability)

This creates text that statistically resembles the source while maintaining readable character-level patterns.

Design Decisions

Why character-level? Simpler than word-level Markov chains, demonstrates core probability concepts, handles punctuation and capitalization naturally.

Why probabilistic selection? Recent evolution (see commit 68b5a9c) from popularity-based selection. Reflects actual source distribution—if "e" follows "th" 450 times vs "a" 150 times, "e" should appear 3× more often in generated text.

Why 50ms typing delay? Creates readable output stream, demonstrates async patterns, simulates human reading speed.

Built with Bun.js • Zero dependencies • ~3.5 KB of code

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
README.md		README.md
charBucket.js		charBucket.js
index.js		index.js
markov.js		markov.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

letter-markov-js

Quick Start

How It Works

Technical Highlights

Installation & Usage

Algorithm Deep Dive

Training Phase (markov.js:93-113)

Generation Phase (markov.js:61-78)

Design Decisions

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

letter-markov-js

Quick Start

How It Works

Technical Highlights

Installation & Usage

Algorithm Deep Dive

Training Phase (markov.js:93-113)

Generation Phase (markov.js:61-78)

Design Decisions

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages