Skip to content

bonniesimon/markov-chain-js

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

letter-markov-js

A character-level Markov chain text generator trained on Shakespeare's complete works. Uses probabilistic selection based on weighted character transitions to generate Shakespeare-style text.

Quick Start

import Markov from "./markov.js";

const markov = new Markov();
await markov.load();        // Load cached model or process source
markov.run(1000);           // Generate 1000 characters

Sample Output:

The sorrow that I have, the sun of the common cries
Where thou shalt thou speak to him

How It Works

The generator builds a character-level state machine from the source text, tracking which characters follow each character and how often. During generation, it uses weighted random selection to choose the next character based on learned probabilities.

stateDiagram-v2
    [*] --> T: START (100%)
    T --> h: 75%
    T --> r: 20%
    T --> newline: 5%
    h --> e: 85%
    h --> a: 10%
    h --> i: 5%
    e --> space: 40%
    e --> n: 25%
    e --> r: 20%
    e --> newline: 15%
    space --> T: 30%
    space --> a: 25%
    space --> w: 20%
    space --> newline: 25%
    newline --> [*]
Loading

Edge weights represent transition probabilities learned from Shakespeare's text. Higher percentages indicate more frequent character sequences.

Technical Highlights

Architecture:

  • Markov class: State machine orchestrator with async I/O
  • CharBucket class: Hash-based frequency storage {char: {nextChar: count}}
  • Private fields for encapsulation (#storage, #FILE, #JSON_CACHE)

Core Algorithm (markov.js:61-78):

#findNextChar(current) {
  const transitions = this.charBucket.get(current);
  const entries = Object.entries(transitions);

  // Sum all transition frequencies
  const total = entries.reduce((sum, [_, freq]) => sum + freq, 0);

  // Weighted random selection
  const random = Math.random() * total;
  let running = 0;

  for (const [char, freq] of entries) {
    running += freq;
    if (random < running) return char;  // O(n) selection
  }
}

Data Structure:

{
  "T": { "h": 8234, "r": 1523, "o": 2341 },
  "h": { "e": 9821, "a": 1234, "i": 876 },
  "e": { " ": 12456, "n": 3421, "r": 2134 }
}

Installation & Usage

Requirements: Bun.js (or Node.js with ES6 module support)

git clone <repo-url>
cd letter-markov-js
bun run index.js

API:

  • await markov.load() - Loads cached model or processes source file (first run only)
  • markov.run(steps) - Generates steps characters with 50ms typing delay
  • markov.debug(char) - Inspect transition data for a specific character

Example:

const markov = new Markov();
await markov.load();

// Generate 500 characters
markov.run(500);

// Debug: see what follows 'T'
console.log(markov.debug('T'));
// Output: { h: 8234, r: 1523, o: 2341, ... }

Algorithm Deep Dive

Training Phase (markov.js:93-113)

  1. Stream file line-by-line using createReadStream + readline
  2. Process each character pair: For each character at position i, record line[i+1] as a successor
  3. Special markers:
    • START → firstChar for line beginnings
    • lastChar → \n for line endings
    • \n → firstChar to connect lines
  4. Increment frequencies in CharBucket hash map
  5. Cache to JSON (data/shakespeare.json) after processing

Generation Phase (markov.js:61-78)

  1. Start with current = "START"
  2. Get all possible next characters from CharBucket
  3. Sum their frequencies (e.g., {h: 8234, r: 1523} → total = 9757)
  4. Generate random number in [0, total)
  5. Walk through transitions, accumulating frequencies until random number is reached
  6. Return selected character and repeat

Example: If current = "T" and transitions = {h: 8234, r: 1523, o: 2341}:

  • Total = 12098
  • Random = 7500
  • Walk: 0 + 8234 = 8234 > 7500 → Return "h" (68% probability)

This creates text that statistically resembles the source while maintaining readable character-level patterns.

Design Decisions

Why character-level? Simpler than word-level Markov chains, demonstrates core probability concepts, handles punctuation and capitalization naturally.

Why probabilistic selection? Recent evolution (see commit 68b5a9c) from popularity-based selection. Reflects actual source distribution—if "e" follows "th" 450 times vs "a" 150 times, "e" should appear 3× more often in generated text.

Why 50ms typing delay? Creates readable output stream, demonstrates async patterns, simulates human reading speed.

Built with Bun.js • Zero dependencies • ~3.5 KB of code

About

A simple implementation of markov chains

Resources

Stars

Watchers

Forks

Contributors