Tracking the history of Chilean humor.
You can visit the previous version of this project (2024) here.
- The raw dataset is available on Hugging Face: astroza/chilean-humor-raw-transcripts
- The processed version of the dataset is available at: astroza/chilean-humor-jokes
The dataset includes 135 comedy routines performed at the Festival de Viña del Mar between 1960 and 2025. All material is publicly available on YouTube.
The audio was automatically transcribed using the chirp_2 speech-to-text model. The resulting segments were then cleaned, merged, and refined using gemini-3-flash-preview.
Using uv:
Windows (PowerShell):
uv venv
.venv\Scripts\activate
uv pip install -r .\requirements.txtmacOS/Linux:
uv venv
source .venv/bin/activate
uv pip install -r ./requirements.txtIf you want Jina embeddings to run on GPU, install CUDA-enabled PyTorch wheels after the base requirements.
Windows (PowerShell):
uv venv
.venv\Scripts\activate
uv pip install -r .\requirements.txt
uv pip uninstall torch torchvision
uv pip install --index-url https://download.pytorch.org/whl/cu124 torch torchvision
python -c "import torch; print(torch.__version__, torch.cuda.is_available(), torch.version.cuda)"macOS/Linux:
uv venv
source .venv/bin/activate
uv pip install -r ./requirements.txt
uv pip uninstall torch torchvision
uv pip install --index-url https://download.pytorch.org/whl/cu124 torch torchvision
python -c "import torch; print(torch.__version__, torch.cuda.is_available(), torch.version.cuda)"The check must print True for torch.cuda.is_available().
Generate the two Hugging Face configs from data/2026:
.venv\Scripts\python scripts/build_hf_dataset.py --input-root data/2026 --output-root hf_datasetThis creates:
hf_dataset/routines/train.parquethf_dataset/routines/train.jsonlhf_dataset/segments/train.parquethf_dataset/segments/train.jsonl
- Set your token in
.env(or exportHF_TOKENin your shell):
HF_TOKEN=hf_xxx_your_token- Push both configs (
routinesandsegments) to the same dataset repo:
Windows (PowerShell):
python scripts/push_hf_dataset.py --repo-id <your_username>/chilean-humor-raw-transcripts --dataset-root hf_datasetmacOS/Linux:
python scripts/push_hf_dataset.py --repo-id <your_username>/chilean-humor-raw-transcripts --dataset-root hf_datasetOptional: add --private if you want the dataset repo to be private when it is first created.
You can precompute Jina embeddings and reuse them across BERTopic runs.
Local provider (transformers + optional GPU):
.venv\Scripts\python scripts/run_topic_modeling.py `
--use-jina-embeddings `
--jina-provider local `
--jina-model-name jinaai/jina-embeddings-v4 `
--jina-task text-matching `
--jina-truncate-dim 128 `
--jina-device autoJina API provider (useful when local inference is too slow):
JINA_API_TOKEN=jina_xxx_your_token.venv\Scripts\python scripts/run_topic_modeling.py `
--use-jina-embeddings `
--jina-provider api `
--jina-model-name jina-embeddings-v4 `
--jina-task text-matchingNotes:
- Embeddings are cached under
outputs/topic_modeling/embeddings_cacheby default. - Use
--jina-cache-dir <path>to keep cache files somewhere else. - Set
--jina-truncate-dim 0to disable truncation. - API mode reads the bearer token from
JINA_API_TOKENby default (or another variable via--jina-api-token-env). - API endpoint can be overridden with
--jina-api-urland timeout with--jina-api-timeout-seconds.
Main outputs for topic analysis are written to outputs/topic_modeling:
tables/segments_topics.csv: row-level table linked to the originalsegmentsdataset (original columns + cleaned text + decade + initial/final topic assignment + outlier flags + max topic probability).tables/topic_info.csv: topic metadata and representative documents.tables/topics_over_time.csv: topic trends over time.tables/hierarchical_topics.csv: hierarchy produced by BERTopic.figures/topic_hierarchy.htmlandfigures/topic_hierarchy.png: hierarchical clustering visualization.figures/topics_over_time_top_n.htmlandfigures/topics_over_time_top_n.png: temporal trend visualization.
This repository implements two complementary structured-output pipelines:
- Joke extraction pipeline\
- Mentalizing (intentionality) analysis pipeline
Both pipelines use structured JSON output enforced by schema validation, ensuring deterministic and reproducible results suitable for downstream analysis.
This work builds upon the cognitive framework proposed by Dunbar et al. (2016), who demonstrated that verbal jokes rely on recursive mentalizing—the ability to represent nested mindstates such as “A thinks that B thinks…”. Their analysis showed that a conversational exchange minimally requires three levels of intentionality, and that most jokes involve approximately three to five levels, with humor effectiveness peaking within this range. Each additional embedded mindstate increases cognitive load, establishing a natural limit on joke complexity based on human mentalizing capacity.
The system processes comedy transcripts in two stages:
Transcript → Joke Extraction → Individual Jokes → Mentalizing Analysis → Intentionality Tree / Graph
Each stage uses:
- deterministic preprocessing (heuristics)
- structured output (JSON schema)
- post-processing validation (Pydantic models)
Extract clean, standalone jokes from noisy transcripts of comedy performances.
This stage converts transcript segments into a list of normalized joke texts.
Transcript in structured segment format:
{
"segments": [
{ "text": "..." },
{ "text": "..." }
]
}Segments may be incomplete or split across boundaries.
List[str]Example:
[
"Mi amigo cree que su polola piensa que quiero robarle el perro.",
"El médico dijo: Toro parado, venid."
]The pipeline uses a hybrid approach:
Segments are merged into windows with overlap to prevent punchlines from being cut:
segment N + segment N+1 overlap → complete joke context
Gemini is instructed to classify content into:
kind = "joke" | "non_joke"Non-joke content includes:
- comedian introductions
- awards
- applause segments
- announcer speech
- promos or transitions
Only "joke" entries are returned.
The pipeline:
- fixes punctuation
- removes duplicates
- returns normalized joke text
- deterministic output format
- transcript noise robustness
- reliable joke boundaries
- language-agnostic (configured via prompt)
Compute the mentalizing complexity of a joke using explicit recursive mindstate modeling.
This follows the cognitive framework described in Dunbar et al.
Key concept:
Intentionality level = depth of recursively embedded mental states
Example:
comedian intends
audience understands
doctor believes
Toro Sentado is now Toro Parado
Intentionality depth = 4
Structured nested tree:
{
"root": {
"holder": "comedian",
"verb": "intends",
"content": "...",
"embeds": {
"holder": "audience",
"verb": "understands",
"content": "...",
"embeds": {
...
}
}
}
}This structure explicitly encodes mental state nesting.
From this tree, the system can compute:
intentionality_depth = compute_intentionality_depth(root)Example:
Depth = 5
This pipeline combines deterministic heuristics and structured LLM reasoning.
The system scans the joke text for mental state indicators:
Examples:
creo que...
piensa que...
quiere...
espera...
These generate candidate mental states:
Candidate(
span="mi amigo cree que...",
holder_hint="mi amigo",
verb_hint="believes",
estimated_depth=2
)This step improves reproducibility and reduces hallucination.
Gemini receives:
- joke text
- heuristic candidates
- strict recursive schema
Gemini outputs a minimal nested mental state structure.
The schema enforces explicit embedding:
NestedMentalState
embeds NestedMentalState
embeds NestedMentalStateDepth is computed deterministically:
def compute_intentionality_depth(node):
if node.embeds is None:
return 1
return 1 + compute_intentionality_depth(node.embeds)This produces the intentionality level.
The nested structure can be converted into Graphviz DOT format:
ms1 → ms2 → ms3 → ms4 → ms5This allows visualization of the mentalizing chain.
The pipelines solve different but complementary problems:
Pipeline Purpose
Joke extraction isolate jokes from transcript noise Mentalizing analysis measure cognitive complexity of each joke
Together, they enable large-scale cognitive analysis of humor.
transcript → extract_jokes()
for joke in jokes:
tree = analyze_joke_mentalizing_tree(joke)
depth = compute_intentionality_depth(tree.root)Output:
[
{
"joke": "...",
"intentionality_depth": 5
}
]