add dataset exploration notebook

Accrame · Accrame · commit 3213a089ab85 · 2025-10-19T16:15:45.000+02:00
diff --git a/notebooks/01_dataset_exploration.ipynb b/notebooks/01_dataset_exploration.ipynb
@@ -0,0 +1,225 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Financial PhraseBank — Dataset Exploration\n",
+    "\n",
+    "Quick look at the dataset before we start fine-tuning anything. Using the `sentences_allagree` split since those are the ones where all annotators agreed on the label."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset\n",
+    "from collections import Counter\n",
+    "import matplotlib.pyplot as plt\n",
+    "import numpy as np"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ds = load_dataset(\"financial_phrasebank\", \"sentences_allagree\", trust_remote_code=True)\n",
+    "data = ds[\"train\"]  # only has a train split\n",
+    "\n",
+    "print(f\"Number of samples: {len(data)}\")\n",
+    "print(f\"Features: {data.features}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "ok so ~2264 samples total. that's pretty small for fine-tuning but should be enough for QLoRA since we're not updating that many params."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# label mapping — 0=negative, 1=neutral, 2=positive\n",
+    "label_map = {0: \"negative\", 1: \"neutral\", 2: \"positive\"}\n",
+    "\n",
+    "label_counts = Counter(data[\"label\"])\n",
+    "for label_id, count in sorted(label_counts.items()):\n",
+    "    pct = count / len(data) * 100\n",
+    "    print(f\"{label_map[label_id]:>10}: {count:>5}  ({pct:.1f}%)\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# label distribution\n",
+    "names = [label_map[i] for i in sorted(label_counts.keys())]\n",
+    "counts = [label_counts[i] for i in sorted(label_counts.keys())]\n",
+    "colors = [\"#e74c3c\", \"#95a5a6\", \"#2ecc71\"]\n",
+    "\n",
+    "plt.figure(figsize=(7, 4))\n",
+    "plt.bar(names, counts, color=colors, edgecolor=\"black\", linewidth=0.5)\n",
+    "plt.title(\"Label Distribution (sentences_allagree)\")\n",
+    "plt.ylabel(\"Count\")\n",
+    "for i, c in enumerate(counts):\n",
+    "    plt.text(i, c + 15, str(c), ha=\"center\", fontsize=10)\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "super imbalanced — neutral dominates, negative is tiny. gonna need to think about this during training (class weights or oversampling or something)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# examples from each class\n",
+    "for label_id in sorted(label_map.keys()):\n",
+    "    print(f\"\\n--- {label_map[label_id].upper()} ---\")\n",
+    "    examples = [s for s, l in zip(data[\"sentence\"], data[\"label\"]) if l == label_id][:3]\n",
+    "    for ex in examples:\n",
+    "        print(f\"  • {ex[:120]}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# sentence length stats (just splitting on spaces, nothing fancy)\n",
+    "word_counts = [len(s.split()) for s in data[\"sentence\"]]\n",
+    "char_counts = [len(s) for s in data[\"sentence\"]]\n",
+    "\n",
+    "print(f\"Word counts — mean: {np.mean(word_counts):.1f}, median: {np.median(word_counts):.0f}, \"\n",
+    "      f\"min: {min(word_counts)}, max: {max(word_counts)}\")\n",
+    "print(f\"Char counts — mean: {np.mean(char_counts):.1f}, median: {np.median(char_counts):.0f}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n",
+    "\n",
+    "axes[0].hist(word_counts, bins=30, edgecolor=\"black\", alpha=0.7, color=\"steelblue\")\n",
+    "axes[0].set_title(\"Word Count Distribution\")\n",
+    "axes[0].set_xlabel(\"# words\")\n",
+    "axes[0].set_ylabel(\"Frequency\")\n",
+    "\n",
+    "axes[1].hist(char_counts, bins=30, edgecolor=\"black\", alpha=0.7, color=\"coral\")\n",
+    "axes[1].set_title(\"Character Count Distribution\")\n",
+    "axes[1].set_xlabel(\"# characters\")\n",
+    "\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "most sentences are pretty short, like 15-30 words. good — won't need huge context windows for this."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# word frequency per class\n",
+    "import re\n",
+    "from collections import defaultdict\n",
+    "\n",
+    "stopwords = {\"the\", \"a\", \"an\", \"in\", \"of\", \"to\", \"and\", \"for\", \"is\", \"was\", \"its\",\n",
+    "             \"it\", \"on\", \"by\", \"with\", \"from\", \"at\", \"as\", \"has\", \"had\", \"that\",\n",
+    "             \"this\", \"are\", \"were\", \"be\", \"been\", \"will\", \"or\", \"which\", \"also\",\n",
+    "             \"than\", \"have\", \"not\", \"but\", \"s\", \"said\", \"would\", \"their\", \"about\"}\n",
+    "\n",
+    "class_words = defaultdict(list)\n",
+    "for sentence, label in zip(data[\"sentence\"], data[\"label\"]):\n",
+    "    tokens = re.findall(r\"\\b[a-z]+\\b\", sentence.lower())\n",
+    "    tokens = [t for t in tokens if t not in stopwords and len(t) > 2]\n",
+    "    class_words[label].extend(tokens)\n",
+    "\n",
+    "for label_id in sorted(label_map.keys()):\n",
+    "    top = Counter(class_words[label_id]).most_common(15)\n",
+    "    print(f\"\\n{label_map[label_id].upper()} — top 15 words:\")\n",
+    "    print(\", \".join(f\"{w} ({c})\" for w, c in top))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# quick keyword pattern check\n",
+    "keywords = {\n",
+    "    \"positive\": [\"profit\", \"growth\", \"increased\", \"rose\", \"improved\", \"gains\"],\n",
+    "    \"negative\": [\"loss\", \"declined\", \"fell\", \"dropped\", \"decreased\", \"lower\"],\n",
+    "    \"neutral\":  [\"reported\", \"announced\", \"according\", \"expects\", \"company\", \"shares\"]\n",
+    "}\n",
+    "\n",
+    "print(\"Keyword hit rates per class:\\n\")\n",
+    "for sentiment, kws in keywords.items():\n",
+    "    label_id = {v: k for k, v in label_map.items()}[sentiment]\n",
+    "    sents = [s.lower() for s, l in zip(data[\"sentence\"], data[\"label\"]) if l == label_id]\n",
+    "    total = len(sents)\n",
+    "    for kw in kws:\n",
+    "        hits = sum(1 for s in sents if kw in s)\n",
+    "        print(f\"  {sentiment:>8} | '{kw}': {hits}/{total} ({hits/total*100:.1f}%)\")\n",
+    "    print()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "makes sense that words like \"profit\" and \"growth\" show up more in positive, and \"loss\"/\"declined\" in negative. the neutral class is more about reporting language (\"announced\", \"reported\"). these patterns are actually pretty strong — explains why even bag-of-words models can get decent accuracy on this."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Takeaways\n",
+    "\n",
+    "so the dataset is small (~2264 samples) but seems well-curated — the `sentences_allagree` subset means all annotators agreed, so labels should be clean. the class imbalance is real though, especially the tiny negative set. the keyword patterns are pretty clear which is why even simple models do ok on this. Let's see if QLoRA can push it further by understanding the actual financial context beyond just keywords."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}