|
1 | 1 | {
|
2 | 2 | "cells": [
|
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "# Papers with Code ML papers dataset" |
| 8 | + ] |
| 9 | + }, |
3 | 10 | {
|
4 | 11 | "cell_type": "code",
|
5 | 12 | "execution_count": 1,
|
|
24 | 31 | "cell_type": "code",
|
25 | 32 | "execution_count": 3,
|
26 | 33 | "metadata": {},
|
27 |
| - "outputs": [ |
28 |
| - { |
29 |
| - "data": { |
30 |
| - "text/html": [ |
31 |
| - "\n", |
32 |
| - " <div>\n", |
33 |
| - " <style>\n", |
34 |
| - " /* Turns off some styling */\n", |
35 |
| - " progress {\n", |
36 |
| - " /* gets rid of default border in Firefox and Opera. */\n", |
37 |
| - " border: none;\n", |
38 |
| - " /* Needs to be in here for Safari polyfill so background images work as expected. */\n", |
39 |
| - " background-size: auto;\n", |
40 |
| - " }\n", |
41 |
| - " .progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {\n", |
42 |
| - " background: #F44336;\n", |
43 |
| - " }\n", |
44 |
| - " </style>\n", |
45 |
| - " <progress value='952' class='' max='952', style='width:300px; height:20px; vertical-align: middle;'></progress>\n", |
46 |
| - " 100.00% [952/952 00:03<00:00]\n", |
47 |
| - " </div>\n", |
48 |
| - " " |
49 |
| - ], |
50 |
| - "text/plain": [ |
51 |
| - "<IPython.core.display.HTML object>" |
52 |
| - ] |
53 |
| - }, |
54 |
| - "metadata": {}, |
55 |
| - "output_type": "display_data" |
56 |
| - }, |
57 |
| - { |
58 |
| - "data": { |
59 |
| - "text/html": [ |
60 |
| - "\n", |
61 |
| - " <div>\n", |
62 |
| - " <style>\n", |
63 |
| - " /* Turns off some styling */\n", |
64 |
| - " progress {\n", |
65 |
| - " /* gets rid of default border in Firefox and Opera. */\n", |
66 |
| - " border: none;\n", |
67 |
| - " /* Needs to be in here for Safari polyfill so background images work as expected. */\n", |
68 |
| - " background-size: auto;\n", |
69 |
| - " }\n", |
70 |
| - " .progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {\n", |
71 |
| - " background: #F44336;\n", |
72 |
| - " }\n", |
73 |
| - " </style>\n", |
74 |
| - " <progress value='948' class='' max='948', style='width:300px; height:20px; vertical-align: middle;'></progress>\n", |
75 |
| - " 100.00% [948/948 00:23<00:00]\n", |
76 |
| - " </div>\n", |
77 |
| - " " |
78 |
| - ], |
79 |
| - "text/plain": [ |
80 |
| - "<IPython.core.display.HTML object>" |
81 |
| - ] |
82 |
| - }, |
83 |
| - "metadata": {}, |
84 |
| - "output_type": "display_data" |
85 |
| - } |
86 |
| - ], |
| 34 | + "outputs": [], |
87 | 35 | "source": [
|
88 | 36 | "from sota_extractor2.data.paper_collection import PaperCollection\n",
|
89 | 37 | "from pathlib import Path\n",
|
90 | 38 | "\n",
|
91 |
| - "#DATA_PATH = Path(\"/home/ubuntu/pwc/arxiv-s3/arxiv\")\n", |
92 |
| - "DATA_PATH = Path(\"/home/ubuntu/pwc/arxiv-pwc/arxiv\")\n", |
| 39 | + "DATA_PATH = Path(\"/home/ubuntu/pwc/arxiv-s3/arxiv\")\n", |
| 40 | + "PICKLE_PATH = Path(\"/home/ubuntu/pwc/pc-pickle.pkl\")\n", |
| 41 | + "#DATA_PATH = Path(\"/home/ubuntu/pwc/arxiv-pwc/arxiv\")\n", |
| 42 | + "#PICKLE_PATH = Path(\"/home/ubuntu/pwc/pc-pickle-fast.pkl\")" |
| 43 | + ] |
| 44 | + }, |
| 45 | + { |
| 46 | + "cell_type": "markdown", |
| 47 | + "metadata": {}, |
| 48 | + "source": [ |
| 49 | + "## Dataset\n", |
| 50 | + "The dataset was created by parsing 75K arXiv papers related to machine learning. Due to parsing errors, the dataset contains texts and tables extracted from 56K papers. \n", |
| 51 | + "```\n", |
| 52 | + ".\n", |
| 53 | + "└── arxiv\n", |
| 54 | + " ├── texts\n", |
| 55 | + " │ └── 0709\n", |
| 56 | + " │ ├── 0709.1667.json\n", |
| 57 | + " │ ...\n", |
| 58 | + " │ ...\n", |
| 59 | + " ├── tables\n", |
| 60 | + " │ └── 0709\n", |
| 61 | + " │ ├── 0709.1667\n", |
| 62 | + " │ │ ├── metadata.json\n", |
| 63 | + " │ │ ├── table_01.csv\n", |
| 64 | + " │ │ ...\n", |
| 65 | + " │ ...\n", |
| 66 | + " │ ...\n", |
| 67 | + " └── structure-annotations.json\n", |
| 68 | + "```\n", |
93 | 69 | "\n",
|
94 |
| - "pc = PaperCollection(DATA_PATH)" |
| 70 | + "`texts` directory contains `.json` files with papers' content organized into sections. `metadata.json` list tables and their captions found in a given paper. `table_xx.csv` contains data of a given table (nested tables are flattened). We provide a simple API to load and access the dataset. Due to large number of papers it is recommended to load the dataset in parallel (default uses number of processes equal to number of CPU cores) and store it in a pickle file. Set `jobs=1` to disable multiprocessing." |
95 | 71 | ]
|
96 | 72 | },
|
97 | 73 | {
|
98 | 74 | "cell_type": "code",
|
99 | 75 | "execution_count": 4,
|
100 | 76 | "metadata": {},
|
| 77 | + "outputs": [ |
| 78 | + { |
| 79 | + "name": "stdout", |
| 80 | + "output_type": "stream", |
| 81 | + "text": [ |
| 82 | + "CPU times: user 3min 10s, sys: 10.4 s, total: 3min 20s\n", |
| 83 | + "Wall time: 7min 16s\n" |
| 84 | + ] |
| 85 | + } |
| 86 | + ], |
| 87 | + "source": [ |
| 88 | + "%time pc = PaperCollection.from_files(DATA_PATH)" |
| 89 | + ] |
| 90 | + }, |
| 91 | + { |
| 92 | + "cell_type": "code", |
| 93 | + "execution_count": 5, |
| 94 | + "metadata": {}, |
101 | 95 | "outputs": [],
|
102 | 96 | "source": [
|
103 |
| - "PICKLE_PATH = Path(\"/home/ubuntu/pwc/pc-pickle-fast.pkl\")\n", |
104 | 97 | "pc.to_pickle(PICKLE_PATH)"
|
105 | 98 | ]
|
106 | 99 | },
|
107 | 100 | {
|
108 | 101 | "cell_type": "code",
|
109 |
| - "execution_count": 5, |
| 102 | + "execution_count": 6, |
110 | 103 | "metadata": {},
|
111 | 104 | "outputs": [
|
112 | 105 | {
|
113 | 106 | "name": "stdout",
|
114 | 107 | "output_type": "stream",
|
115 | 108 | "text": [
|
116 |
| - "CPU times: user 2.78 s, sys: 149 ms, total: 2.93 s\n", |
117 |
| - "Wall time: 2.9 s\n" |
| 109 | + "CPU times: user 3.48 s, sys: 144 ms, total: 3.63 s\n", |
| 110 | + "Wall time: 3.58 s\n" |
118 | 111 | ]
|
119 | 112 | }
|
120 | 113 | ],
|
121 | 114 | "source": [
|
122 |
| - "%time pc2 = PaperCollection.from_pickle(PICKLE_PATH)" |
| 115 | + "%time pc = PaperCollection.from_pickle(PICKLE_PATH)" |
| 116 | + ] |
| 117 | + }, |
| 118 | + { |
| 119 | + "cell_type": "markdown", |
| 120 | + "metadata": {}, |
| 121 | + "source": [ |
| 122 | + "PaperCollection is a wrapper for `list` of papers with additional functions added for convenience. " |
123 | 123 | ]
|
124 | 124 | },
|
125 | 125 | {
|
126 | 126 | "cell_type": "code",
|
127 |
| - "execution_count": 6, |
| 127 | + "execution_count": 7, |
| 128 | + "metadata": {}, |
| 129 | + "outputs": [ |
| 130 | + { |
| 131 | + "data": { |
| 132 | + "text/plain": [ |
| 133 | + "56696" |
| 134 | + ] |
| 135 | + }, |
| 136 | + "execution_count": 7, |
| 137 | + "metadata": {}, |
| 138 | + "output_type": "execute_result" |
| 139 | + } |
| 140 | + ], |
| 141 | + "source": [ |
| 142 | + "len(pc)" |
| 143 | + ] |
| 144 | + }, |
| 145 | + { |
| 146 | + "cell_type": "markdown", |
| 147 | + "metadata": {}, |
| 148 | + "source": [ |
| 149 | + "## Tables\n", |
| 150 | + "Each `Paper` contains `text` and `tables` fields. Tables can be displayed with color-coded labels." |
| 151 | + ] |
| 152 | + }, |
| 153 | + { |
| 154 | + "cell_type": "code", |
| 155 | + "execution_count": 7, |
128 | 156 | "metadata": {},
|
129 | 157 | "outputs": [
|
130 | 158 | {
|
|
245 | 273 | }
|
246 | 274 | ],
|
247 | 275 | "source": [
|
248 |
| - "paper = pc.papers['1607.04315']\n", |
| 276 | + "paper = pc.get_by_id('1607.04315')\n", |
249 | 277 | "table = paper.tables[0]\n",
|
250 | 278 | "table.display()"
|
251 | 279 | ]
|
252 | 280 | },
|
253 | 281 | {
|
254 | 282 | "cell_type": "code",
|
255 |
| - "execution_count": 7, |
| 283 | + "execution_count": null, |
| 284 | + "metadata": {}, |
| 285 | + "outputs": [], |
| 286 | + "source": [] |
| 287 | + }, |
| 288 | + { |
| 289 | + "cell_type": "markdown", |
| 290 | + "metadata": {}, |
| 291 | + "source": [ |
| 292 | + "Table's data is stored in `.df` pandas `DataFrame`. Each cell contains its content `value`, annotated `gold_tags` and references `refs` to other papers. Most of the references were normalized across all papers." |
| 293 | + ] |
| 294 | + }, |
| 295 | + { |
| 296 | + "cell_type": "code", |
| 297 | + "execution_count": 8, |
256 | 298 | "metadata": {},
|
257 | 299 | "outputs": [
|
258 | 300 | {
|
259 | 301 | "data": {
|
260 | 302 | "text/plain": [
|
261 |
| - "Cell(value='Classifier with handcrafted features [12]', gold_tags='model-competing', refs=['xxref-Xbowman:15'])" |
| 303 | + "Cell(value='SPINN-PI encoders [14]', gold_tags='model-competing', refs=['xxref-XBowmanGRGMP16'])" |
262 | 304 | ]
|
263 | 305 | },
|
264 |
| - "execution_count": 7, |
| 306 | + "execution_count": 8, |
265 | 307 | "metadata": {},
|
266 | 308 | "output_type": "execute_result"
|
267 | 309 | }
|
268 | 310 | ],
|
269 | 311 | "source": [
|
270 |
| - "table.df.iloc[1,0]" |
| 312 | + "table.df.iloc[4,0]" |
| 313 | + ] |
| 314 | + }, |
| 315 | + { |
| 316 | + "cell_type": "markdown", |
| 317 | + "metadata": {}, |
| 318 | + "source": [ |
| 319 | + "Additionally, each table contains `gold_tags` describing what is the content of the table." |
271 | 320 | ]
|
272 | 321 | },
|
273 | 322 | {
|
|
276 | 325 | "metadata": {},
|
277 | 326 | "outputs": [
|
278 | 327 | {
|
279 |
| - "data": { |
280 |
| - "text/plain": [ |
281 |
| - "'sota'" |
282 |
| - ] |
283 |
| - }, |
284 |
| - "execution_count": 8, |
285 |
| - "metadata": {}, |
286 |
| - "output_type": "execute_result" |
| 328 | + "ename": "NameError", |
| 329 | + "evalue": "name 'table' is not defined", |
| 330 | + "output_type": "error", |
| 331 | + "traceback": [ |
| 332 | + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", |
| 333 | + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", |
| 334 | + "\u001b[0;32m<ipython-input-8-3e23c0e16ba1>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mtable\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgold_tags\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", |
| 335 | + "\u001b[0;31mNameError\u001b[0m: name 'table' is not defined" |
| 336 | + ] |
287 | 337 | }
|
288 | 338 | ],
|
289 | 339 | "source": [
|
290 | 340 | "table.gold_tags"
|
291 | 341 | ]
|
292 | 342 | },
|
| 343 | + { |
| 344 | + "cell_type": "markdown", |
| 345 | + "metadata": {}, |
| 346 | + "source": [ |
| 347 | + "## Text Content\n", |
| 348 | + "Papers' content is represented using elastic search document classes (can be easily `save()`'ed to an existing elastic search instance). Each `text` contains `title`, `abstract`, and 'authors'. Paper's text is split into `fragments`." |
| 349 | + ] |
| 350 | + }, |
293 | 351 | {
|
294 | 352 | "cell_type": "code",
|
295 |
| - "execution_count": 9, |
| 353 | + "execution_count": 10, |
296 | 354 | "metadata": {},
|
297 | 355 | "outputs": [
|
298 | 356 | {
|
|
301 | 359 | "'Abstract We present a memory augmented neural network for natural language understanding: Neural Semantic Encoders. NSE is equipped with a novel memory update rule and has a variable sized encoding memory that evolves over time and maintains the understanding of input sequences through read , compose and write operations. NSE can also access 1 xxanchor-x1-2f1 multiple and shared memories. In this paper, we demonstrated the effectiveness and the flexibility of NSE on five different natural language tasks: natural language inference, question answering, sentence classification, document sentiment analysis and machine translation where NSE achieved state-of-the-art performance when evaluated on publically available benchmarks. For example, our shared-memory model showed an encouraging result on neural machine translation, improving an attention-based baseline by approximately 1.0 BLEU.'"
|
302 | 360 | ]
|
303 | 361 | },
|
304 |
| - "execution_count": 9, |
| 362 | + "execution_count": 10, |
305 | 363 | "metadata": {},
|
306 | 364 | "output_type": "execute_result"
|
307 | 365 | }
|
|
312 | 370 | },
|
313 | 371 | {
|
314 | 372 | "cell_type": "code",
|
315 |
| - "execution_count": 10, |
| 373 | + "execution_count": 11, |
316 | 374 | "metadata": {},
|
317 | 375 | "outputs": [
|
318 | 376 | {
|
|
514 | 572 | },
|
515 | 573 | {
|
516 | 574 | "cell_type": "code",
|
517 |
| - "execution_count": 11, |
| 575 | + "execution_count": 12, |
518 | 576 | "metadata": {},
|
519 | 577 | "outputs": [
|
520 | 578 | {
|
|
524 | 582 | "Recently several studies have explored ways of extending the neural networks with an external memory [ xxref-Xgraves2014neural – xxref-Xgrefenstette2015learning ]. Unlike LSTM, the short term memories and the training parameters of such a neural network are no longer coupled and can be adapted. In this paper we propose a novel class of memory augmented neural networks called Neural Semantic Encoders (NSE) for natural language understanding. NSE offers several desirable properties. NSE has a variable sized encoding memory which allows the model to access entire input sequence during the reading process; therefore efficiently delivering long-term dependencies over time. The encoding memory evolves over time and maintains the memory of the input sequence through read , compose and write operations. NSE sequentially processes the input and supports word compositionality inheriting both temporal and hierarchical nature of human language. NSE can read from and write to a set of relevant encoding memories simultaneously or multiple NSEs can access a shared encoding memory effectively supporting knowledge and representation sharing. NSE is flexible, robust and suitable for practical NLU tasks and can be trained easily by any gradient descent optimizer.<Fragment(meta.id=1607.04315_1001, order=1001)>"
|
525 | 583 | ]
|
526 | 584 | },
|
527 |
| - "execution_count": 11, |
| 585 | + "execution_count": 12, |
528 | 586 | "metadata": {},
|
529 | 587 | "output_type": "execute_result"
|
530 | 588 | }
|
|
538 | 596 | "execution_count": null,
|
539 | 597 | "metadata": {},
|
540 | 598 | "outputs": [],
|
541 |
| - "source": [] |
| 599 | + "source": [ |
| 600 | + "paper.text.print_section(\"Machine Translation\")" |
| 601 | + ] |
542 | 602 | }
|
543 | 603 | ],
|
544 | 604 | "metadata": {
|
|
0 commit comments