paperswithcode
diff --git a/‎notebooks/datasets.ipynb
Lines changed: 238 additions & 0 deletions b/‎notebooks/datasets.ipynb
Lines changed: 238 additions & 0 deletions
@@ -0,0 +1,238 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Datasets\n",
+    "\n",
+    "We published four datasets for training and evaluating extraction of performance results from machine learning papers. In this notebook we describe the format and show how to use our python API to conveniently work with the datasets. Due to the licensing the datasets consists of metadata and annotations, but do not include papers and data extracted from them. However, we made special effort in our extraction pipeline to get reproducible results."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Simple functions to load the datasets"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from axcell.helpers.datasets import read_arxiv_papers\n",
+    "from pathlib import Path\n",
+    "\n",
+    "V1_URL = 'https://github.com/paperswithcode/axcell/releases/download/v1.0/'\n",
+    "ARXIV_PAPERS_URL = V1_URL + 'arxiv-papers.csv.xz'\n",
+    "SEGMENTED_TABLES_URL = V1_URL + 'segmented-tables.json.xz'\n",
+    "PWC_LEADERBOARDS_URL = V1_URL + 'pwc-leaderboards.json.xz'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## ArxivPapers\n",
+    "\n",
+    "**ArxivPapers** dataset is a corpus of over 100,000 scientific papers related to machine learning. In our work we use the corpus for self-supervised training of ULMFiT langauge model (see the lm_training notebook) and for extraction of common abbreviations. The dataset is a CSV file with one row per paper and the following fields:\n",
+    "* arxiv_id - arXiv identifier with version\n",
+    "* archive_size - the file size in bytes of the e-print archive\n",
+    "* sha256 - SHA-256 hash of the e-print archive\n",
+    "* title - paper's title\n",
+    "* status - the text and tables extraction status for this paper, one of:\n",
+    "  + success,\n",
+    "  + no-tex - LaTeX source is unavailable,\n",
+    "  + processing-error - extraction issues,\n",
+    "  + withdrawn - the paper is withdrawn from arXiv\n",
+    "* sections - number of extracted sections and subsections\n",
+    "* tables - number of extracted tables"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of papers:             104710\n",
+      "└── with LaTeX source:         93811\n",
+      "Number of extracted tables:   277946\n"
+     ]
+    }
+   ],
+   "source": [
+    "arxiv_papers = read_arxiv_papers(ARXIV_PAPERS_URL)\n",
+    "\n",
+    "print(f'Number of papers:           {len(arxiv_papers):8}')\n",
+    "print(f'└── with LaTeX source:      {(~arxiv_papers.status.isin([\"no-tex\", \"withdrawn\"])).sum():8}')\n",
+    "print(f'Number of extracted tables: {arxiv_papers.tables.sum():8}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The arXiv id can be used to generate links to e-prints. Please read https://arxiv.org/help/bulk_data and play nice."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "104705    http://export.arxiv.org/e-print/2002.08204v1\n",
+       "104706    http://export.arxiv.org/e-print/2002.08253v1\n",
+       "104707    http://export.arxiv.org/e-print/2002.08264v1\n",
+       "104708    http://export.arxiv.org/e-print/2002.08301v1\n",
+       "104709    http://export.arxiv.org/e-print/2002.08325v1\n",
+       "dtype: object"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "def get_eprint_link(paper):\n",
+    "    return f'http://export.arxiv.org/e-print/{paper.arxiv_id}'\n",
+    "\n",
+    "links = arxiv_papers.apply(get_eprint_link, axis=1)\n",
+    "links.tail()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## SegmentedTables & LinkedResults\n",
+    "\n",
+    "The **SegmentedTables** dataset contains annotations of almost 2,000 tables. The dataset is a JSON array with one item per paper and the following fields:\n",
+    "* arxiv_id - arXiv identifier with version. The version can be different than in **ArxivTables**,\n",
+    "* sha256 - SHA-256 hash of the e-print archive\n",
+    "* fold - one of 11 folds, f.e., img_class or speech_rec. Each paper has exactly one fold, even if it's related to more than one task,\n",
+    "* tables - array of tables annotations\n",
+    "  + index - 0-based index of tables extracted from paper,\n",
+    "  + leaderboard - a boolean denoting if this table is a leaderboard table,\n",
+    "  + ablation - a boolean denoting if this table is an ablation table (a table can be both a leaderboard and an ablation table),\n",
+    "  + dataset_text - datasets mentioned in table's caption, not normalized\n",
+    "  + segmentation - for leaderboard tables, a 2D array (list of lists) with one label per cell\n",
+    "\n",
+    "Additionally we annotated part of the tables with performance results, called simply the **LinkedResults** dataset. Each table contains a 'records' array with items containing:\n",
+    "* task, dataset, metric - task, dataset and metric names normalized across all papers from the **LinkedResults** dataset,\n",
+    "* value - normalized metric value,\n",
+    "* model - model name,\n",
+    "* row, column - 0-based cell location with this result."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of papers:      352\n",
+      "Number of tables:     1994\n",
+      "├── leaderboards:      796\n",
+      "└── ablations:         468\n",
+      "Linked results:       1591\n"
+     ]
+    }
+   ],
+   "source": [
+    "from axcell.helpers.datasets import read_tables_annotations\n",
+    "\n",
+    "segmented_tables_annotations = read_tables_annotations(SEGMENTED_TABLES_URL)\n",
+    "\n",
+    "leaderboards = (segmented_tables_annotations.tables.apply(\n",
+    "    lambda tables: len([t for t in tables if t['leaderboard']])\n",
+    ").sum())\n",
+    "ablations = (segmented_tables_annotations.tables.apply(\n",
+    "    lambda tables: len([t for t in tables if t['ablation']])\n",
+    ").sum())\n",
+    "records = (segmented_tables_annotations.tables.apply(\n",
+    "    lambda tables: sum([len(t['records']) for t in tables])\n",
+    ").sum())\n",
+    "\n",
+    "print(f'Number of papers: {len(segmented_tables_annotations):8}')\n",
+    "print(f'Number of tables: {segmented_tables_annotations.tables.apply(len).sum():8}')\n",
+    "print(f'├── leaderboards: {leaderboards:8}')\n",
+    "print(f'└── ablations:    {ablations:8}')\n",
+    "print(f'Linked results:   {records:8}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## PWCLeaderboards\n",
+    "\n",
+    "The **PWCLeaderboards** dataset is similar in structure to the **LinkedResults** dataset. It's a JSON array with one item per paper, containing:\n",
+    "* arxiv_id - arXiv identifier with version. The version corresponds to the version in **ArxivTables**,\n",
+    "* tables\n",
+    "  + index - 0-based table index\n",
+    "  + records - as in **LinkedResults**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of papers:      731\n",
+      "Number of tables:     1278\n",
+      "Linked results:       5393\n"
+     ]
+    }
+   ],
+   "source": [
+    "pwc_leaderboards = read_tables_annotations(PWC_LEADERBOARDS_URL)\n",
+    "\n",
+    "records = (pwc_leaderboards.tables.apply(\n",
+    "    lambda tables: sum([len(t['records']) for t in tables])\n",
+    ").sum())\n",
+    "\n",
+    "print(f'Number of papers: {len(pwc_leaderboards):8}')\n",
+    "print(f'Number of tables: {pwc_leaderboards.tables.apply(len).sum():8}')\n",
+    "print(f'Linked results:   {records:8}')"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}