chilean-humor [Work in progress]

Tracking the history of Chilean humor.

You can visit the previous version of this project (2024) here.

The raw dataset is available on Hugging Face: astroza/chilean-humor-raw-transcripts
The processed version of the dataset is available at: astroza/chilean-humor-jokes

Data

The dataset includes 135 comedy routines performed at the Festival de Viña del Mar between 1960 and 2025. All material is publicly available on YouTube.

The audio was automatically transcribed using the chirp_2 speech-to-text model. The resulting segments were then cleaned, merged, and refined using gemini-3-flash-preview.

Installation

Using uv:

Windows (PowerShell):

uv venv
.venv\Scripts\activate
uv pip install -r .\requirements.txt

macOS/Linux:

uv venv
source .venv/bin/activate
uv pip install -r ./requirements.txt

NVIDIA GPU setup (Jina embeddings)

If you want Jina embeddings to run on GPU, install CUDA-enabled PyTorch wheels after the base requirements.

Windows (PowerShell):

uv venv
.venv\Scripts\activate
uv pip install -r .\requirements.txt
uv pip uninstall torch torchvision
uv pip install --index-url https://download.pytorch.org/whl/cu124 torch torchvision
python -c "import torch; print(torch.__version__, torch.cuda.is_available(), torch.version.cuda)"

macOS/Linux:

uv venv
source .venv/bin/activate
uv pip install -r ./requirements.txt
uv pip uninstall torch torchvision
uv pip install --index-url https://download.pytorch.org/whl/cu124 torch torchvision
python -c "import torch; print(torch.__version__, torch.cuda.is_available(), torch.version.cuda)"

The check must print True for torch.cuda.is_available().

Build HF Dataset from raw data

Generate the two Hugging Face configs from data/2026:

.venv\Scripts\python scripts/build_hf_dataset.py --input-root data/2026 --output-root hf_dataset

This creates:

hf_dataset/routines/train.parquet
hf_dataset/routines/train.jsonl
hf_dataset/segments/train.parquet
hf_dataset/segments/train.jsonl

Upload Dataset To Hugging Face Hub

Set your token in .env (or export HF_TOKEN in your shell):

HF_TOKEN=hf_xxx_your_token

Push both configs (routines and segments) to the same dataset repo:

Windows (PowerShell):

python scripts/push_hf_dataset.py --repo-id <your_username>/chilean-humor-raw-transcripts --dataset-root hf_dataset

macOS/Linux:

python scripts/push_hf_dataset.py --repo-id <your_username>/chilean-humor-raw-transcripts --dataset-root hf_dataset

Optional: add --private if you want the dataset repo to be private when it is first created.

Topic Modeling With Jina Embeddings

You can precompute Jina embeddings and reuse them across BERTopic runs.

Local provider (transformers + optional GPU):

.venv\Scripts\python scripts/run_topic_modeling.py `
  --use-jina-embeddings `
  --jina-provider local `
  --jina-model-name jinaai/jina-embeddings-v4 `
  --jina-task text-matching `
  --jina-truncate-dim 128 `
  --jina-device auto

Jina API provider (useful when local inference is too slow):

JINA_API_TOKEN=jina_xxx_your_token

.venv\Scripts\python scripts/run_topic_modeling.py `
  --use-jina-embeddings `
  --jina-provider api `
  --jina-model-name jina-embeddings-v4 `
  --jina-task text-matching

Notes:

Embeddings are cached under outputs/topic_modeling/embeddings_cache by default.
Use --jina-cache-dir <path> to keep cache files somewhere else.
Set --jina-truncate-dim 0 to disable truncation.
API mode reads the bearer token from JINA_API_TOKEN by default (or another variable via --jina-api-token-env).
API endpoint can be overridden with --jina-api-url and timeout with --jina-api-timeout-seconds.

Main outputs for topic analysis are written to outputs/topic_modeling:

tables/segments_topics.csv: row-level table linked to the original segments dataset (original columns + cleaned text + decade + initial/final topic assignment + outlier flags + max topic probability).
tables/topic_info.csv: topic metadata and representative documents.
tables/topics_over_time.csv: topic trends over time.
tables/hierarchical_topics.csv: hierarchy produced by BERTopic.
figures/topic_hierarchy.html and figures/topic_hierarchy.png: hierarchical clustering visualization.
figures/topics_over_time_top_n.html and figures/topics_over_time_top_n.png: temporal trend visualization.

Joke Extraction and Mentalizing Analysis Pipeline

This repository implements two complementary structured-output pipelines:

Joke extraction pipeline\
Mentalizing (intentionality) analysis pipeline

Both pipelines use structured JSON output enforced by schema validation, ensuring deterministic and reproducible results suitable for downstream analysis.

This work builds upon the cognitive framework proposed by Dunbar et al. (2016), who demonstrated that verbal jokes rely on recursive mentalizing—the ability to represent nested mindstates such as “A thinks that B thinks…”. Their analysis showed that a conversational exchange minimally requires three levels of intentionality, and that most jokes involve approximately three to five levels, with humor effectiveness peaking within this range. Each additional embedded mindstate increases cognitive load, establishing a natural limit on joke complexity based on human mentalizing capacity.

Overview

The system processes comedy transcripts in two stages:

Transcript → Joke Extraction → Individual Jokes → Mentalizing Analysis → Intentionality Tree / Graph

Each stage uses:

deterministic preprocessing (heuristics)
structured output (JSON schema)
post-processing validation (Pydantic models)

1. Joke Extraction Pipeline

Purpose

Extract clean, standalone jokes from noisy transcripts of comedy performances.

This stage converts transcript segments into a list of normalized joke texts.

Input

Transcript in structured segment format:

{
  "segments": [
    { "text": "..." },
    { "text": "..." }
  ]
}

Segments may be incomplete or split across boundaries.

Output

List[str]

Example:

[
  "Mi amigo cree que su polola piensa que quiero robarle el perro.",
  "El médico dijo: Toro parado, venid."
]

Method

The pipeline uses a hybrid approach:

Step 1 --- Segment windowing with overlap

Segments are merged into windows with overlap to prevent punchlines from being cut:

segment N + segment N+1 overlap → complete joke context

Step 2 --- Gemini structured extraction

Gemini is instructed to classify content into:

kind = "joke" | "non_joke"

Non-joke content includes:

comedian introductions
awards
applause segments
announcer speech
promos or transitions

Only "joke" entries are returned.

Step 3 --- Cleaning and deduplication

The pipeline:

fixes punctuation
removes duplicates
returns normalized joke text

Design goals

deterministic output format
transcript noise robustness
reliable joke boundaries
language-agnostic (configured via prompt)

2. Mentalizing (Intentionality) Analysis Pipeline

Purpose

Compute the mentalizing complexity of a joke using explicit recursive mindstate modeling.

This follows the cognitive framework described in Dunbar et al.

Key concept:

Intentionality level = depth of recursively embedded mental states

Example:

comedian intends
  audience understands
    doctor believes
      Toro Sentado is now Toro Parado

Intentionality depth = 4

Output

Structured nested tree:

{
  "root": {
    "holder": "comedian",
    "verb": "intends",
    "content": "...",
    "embeds": {
      "holder": "audience",
      "verb": "understands",
      "content": "...",
      "embeds": {
        ...
      }
    }
  }
}

This structure explicitly encodes mental state nesting.

Derived outputs

From this tree, the system can compute:

intentionality_depth = compute_intentionality_depth(root)

Example:

Depth = 5

Pipeline Architecture

This pipeline combines deterministic heuristics and structured LLM reasoning.

Step 1 --- Heuristic candidate extraction (deterministic)

The system scans the joke text for mental state indicators:

Examples:

creo que...
piensa que...
quiere...
espera...

These generate candidate mental states:

Candidate(
    span="mi amigo cree que...",
    holder_hint="mi amigo",
    verb_hint="believes",
    estimated_depth=2
)

This step improves reproducibility and reduces hallucination.

Step 2 --- Gemini structured mentalizing reconstruction

Gemini receives:

joke text
heuristic candidates
strict recursive schema

Gemini outputs a minimal nested mental state structure.

The schema enforces explicit embedding:

NestedMentalState
    embeds NestedMentalState
        embeds NestedMentalState

Step 3 --- Depth computation

Depth is computed deterministically:

def compute_intentionality_depth(node):
    if node.embeds is None:
        return 1
    return 1 + compute_intentionality_depth(node.embeds)

This produces the intentionality level.

Step 4 --- Optional graph visualization

The nested structure can be converted into Graphviz DOT format:

ms1 → ms2 → ms3 → ms4 → ms5

This allows visualization of the mentalizing chain.

Why both pipelines are needed

The pipelines solve different but complementary problems:

Pipeline Purpose

Joke extraction isolate jokes from transcript noise Mentalizing analysis measure cognitive complexity of each joke

Together, they enable large-scale cognitive analysis of humor.

Example end-to-end workflow

transcript → extract_jokes()

for joke in jokes:
    tree = analyze_joke_mentalizing_tree(joke)
    depth = compute_intentionality_depth(tree.root)

Output:

[
  {
    "joke": "...",
    "intentionality_depth": 5
  }
]

Name		Name	Last commit message	Last commit date
Latest commit History 135 Commits
backup		backup
data/2026		data/2026
notebooks		notebooks
outputs		outputs
reports		reports
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

chilean-humor [Work in progress]

Data

Installation

NVIDIA GPU setup (Jina embeddings)

Build HF Dataset from raw data

Upload Dataset To Hugging Face Hub

Topic Modeling With Jina Embeddings

Joke Extraction and Mentalizing Analysis Pipeline

Overview

1. Joke Extraction Pipeline

Purpose

Input

Output

Method

Step 1 --- Segment windowing with overlap

Step 2 --- Gemini structured extraction

Step 3 --- Cleaning and deduplication

Design goals

2. Mentalizing (Intentionality) Analysis Pipeline

Purpose

Output

Derived outputs

Pipeline Architecture

Step 1 --- Heuristic candidate extraction (deterministic)

Step 2 --- Gemini structured mentalizing reconstruction

Step 3 --- Depth computation

Step 4 --- Optional graph visualization

Why both pipelines are needed

Example end-to-end workflow

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages