|
2 | 2 | "cells": [
|
3 | 3 | {
|
4 | 4 | "cell_type": "markdown",
|
5 |
| - "id": "accessory-allen", |
| 5 | + "id": "registered-bahrain", |
6 | 6 | "metadata": {},
|
7 | 7 | "source": [
|
8 |
| - "# xbatcher Demo Notebook \n", |
| 8 | + "# xbatcher demo\n", |
9 | 9 | "\n",
|
10 | 10 | "Author: Cindy Chiao\n",
|
11 | 11 | "Last Modified: Nov 16, 2021\n",
|
|
21 | 21 | {
|
22 | 22 | "cell_type": "code",
|
23 | 23 | "execution_count": 1,
|
24 |
| - "id": "forward-tyler", |
| 24 | + "id": "engaged-nicaragua", |
25 | 25 | "metadata": {},
|
26 | 26 | "outputs": [],
|
27 | 27 | "source": [
|
|
32 | 32 | },
|
33 | 33 | {
|
34 | 34 | "cell_type": "markdown",
|
35 |
| - "id": "collect-medline", |
| 35 | + "id": "violent-walker", |
36 | 36 | "metadata": {},
|
37 | 37 | "source": [
|
38 |
| - "## Example Data\n", |
| 38 | + "## Example data\n", |
39 | 39 | "\n",
|
40 | 40 | "Here we will load an example dataset from a global climate model. The data is from the _historical_ experiment from CMIP6 and represents 60 days of daily max air temperature. "
|
41 | 41 | ]
|
42 | 42 | },
|
43 | 43 | {
|
44 | 44 | "cell_type": "code",
|
45 | 45 | "execution_count": 2,
|
46 |
| - "id": "stretch-greece", |
| 46 | + "id": "furnished-vanilla", |
47 | 47 | "metadata": {},
|
48 | 48 | "outputs": [
|
49 | 49 | {
|
|
566 | 566 | {
|
567 | 567 | "cell_type": "code",
|
568 | 568 | "execution_count": 3,
|
569 |
| - "id": "hollywood-battery", |
| 569 | + "id": "motivated-sustainability", |
570 | 570 | "metadata": {},
|
571 | 571 | "outputs": [
|
572 | 572 | {
|
|
599 | 599 | },
|
600 | 600 | {
|
601 | 601 | "cell_type": "markdown",
|
602 |
| - "id": "abroad-tours", |
| 602 | + "id": "muslim-policy", |
603 | 603 | "metadata": {},
|
604 | 604 | "source": [
|
605 |
| - "## Batch Generation\n", |
| 605 | + "## Batch generation\n", |
606 | 606 | "\n",
|
607 | 607 | "Xbatcher's `BatchGenerator` can be used to generate batches with several arguments controlling the exact behavior.\n",
|
608 | 608 | "\n",
|
|
614 | 614 | {
|
615 | 615 | "cell_type": "code",
|
616 | 616 | "execution_count": 4,
|
617 |
| - "id": "saving-equivalent", |
| 617 | + "id": "brown-danish", |
618 | 618 | "metadata": {},
|
619 | 619 | "outputs": [
|
620 | 620 | {
|
|
1040 | 1040 | },
|
1041 | 1041 | {
|
1042 | 1042 | "cell_type": "markdown",
|
1043 |
| - "id": "ruled-tonight", |
| 1043 | + "id": "supported-equipment", |
1044 | 1044 | "metadata": {},
|
1045 | 1045 | "source": [
|
1046 | 1046 | "We can verify that the outputs have the expected shapes. \n",
|
|
1051 | 1051 | {
|
1052 | 1052 | "cell_type": "code",
|
1053 | 1053 | "execution_count": 5,
|
1054 |
| - "id": "royal-consortium", |
| 1054 | + "id": "equal-profile", |
1055 | 1055 | "metadata": {},
|
1056 | 1056 | "outputs": [
|
1057 | 1057 | {
|
|
1069 | 1069 | },
|
1070 | 1070 | {
|
1071 | 1071 | "cell_type": "markdown",
|
1072 |
| - "id": "international-priest", |
| 1072 | + "id": "approximate-hurricane", |
1073 | 1073 | "metadata": {},
|
1074 | 1074 | "source": [
|
1075 | 1075 | "There are 145 lat points and 192 lon points, thus we're expecting 145 * 192 = 27840 samples in a batch."
|
|
1078 | 1078 | {
|
1079 | 1079 | "cell_type": "code",
|
1080 | 1080 | "execution_count": 6,
|
1081 |
| - "id": "center-lightweight", |
| 1081 | + "id": "identified-prototype", |
1082 | 1082 | "metadata": {},
|
1083 | 1083 | "outputs": [
|
1084 | 1084 | {
|
|
1096 | 1096 | },
|
1097 | 1097 | {
|
1098 | 1098 | "cell_type": "markdown",
|
1099 |
| - "id": "entire-renewal", |
| 1099 | + "id": "tropical-danish", |
1100 | 1100 | "metadata": {},
|
1101 | 1101 | "source": [
|
1102 | 1102 | "## Controlling the size/shape of batches\n",
|
|
1107 | 1107 | {
|
1108 | 1108 | "cell_type": "code",
|
1109 | 1109 | "execution_count": 7,
|
1110 |
| - "id": "aware-layer", |
| 1110 | + "id": "circular-array", |
1111 | 1111 | "metadata": {},
|
1112 | 1112 | "outputs": [
|
1113 | 1113 | {
|
|
1559 | 1559 | },
|
1560 | 1560 | {
|
1561 | 1561 | "cell_type": "markdown",
|
1562 |
| - "id": "returning-ozone", |
| 1562 | + "id": "broadband-romance", |
1563 | 1563 | "metadata": {},
|
1564 | 1564 | "source": [
|
1565 | 1565 | "## Last batch behavior\n",
|
|
1570 | 1570 | {
|
1571 | 1571 | "cell_type": "code",
|
1572 | 1572 | "execution_count": 8,
|
1573 |
| - "id": "configured-european", |
| 1573 | + "id": "funny-garbage", |
1574 | 1574 | "metadata": {},
|
1575 | 1575 | "outputs": [
|
1576 | 1576 | {
|
|
2005 | 2005 | },
|
2006 | 2006 | {
|
2007 | 2007 | "cell_type": "markdown",
|
2008 |
| - "id": "continued-boring", |
| 2008 | + "id": "affecting-preview", |
2009 | 2009 | "metadata": {},
|
2010 | 2010 | "source": [
|
2011 | 2011 | "## Overlapping inputs\n",
|
|
2017 | 2017 | {
|
2018 | 2018 | "cell_type": "code",
|
2019 | 2019 | "execution_count": 9,
|
2020 |
| - "id": "partial-emergency", |
| 2020 | + "id": "improved-coating", |
2021 | 2021 | "metadata": {},
|
2022 | 2022 | "outputs": [
|
2023 | 2023 | {
|
|
2473 | 2473 | },
|
2474 | 2474 | {
|
2475 | 2475 | "cell_type": "markdown",
|
2476 |
| - "id": "directed-punch", |
| 2476 | + "id": "contrary-throat", |
2477 | 2477 | "metadata": {},
|
2478 | 2478 | "source": [
|
2479 | 2479 | "We can inspect the samples in a batch for a lat/lon pixel, noting that the overlap only applies within a batch and not across. Thus, within the 20 time points in a batch, we can get 11 samples each with 10 time points and 9 time points allowed to overlap."
|
|
2482 | 2482 | {
|
2483 | 2483 | "cell_type": "code",
|
2484 | 2484 | "execution_count": 10,
|
2485 |
| - "id": "prime-imaging", |
| 2485 | + "id": "accepting-hundred", |
2486 | 2486 | "metadata": {},
|
2487 | 2487 | "outputs": [
|
2488 | 2488 | {
|
|
2944 | 2944 | },
|
2945 | 2945 | {
|
2946 | 2946 | "cell_type": "markdown",
|
2947 |
| - "id": "graduate-kingdom", |
| 2947 | + "id": "enclosed-investing", |
2948 | 2948 | "metadata": {},
|
2949 | 2949 | "source": [
|
2950 | 2950 | "## Example applications\n",
|
|
2957 | 2957 | {
|
2958 | 2958 | "cell_type": "code",
|
2959 | 2959 | "execution_count": 11,
|
2960 |
| - "id": "difficult-recipe", |
| 2960 | + "id": "vulnerable-terminology", |
2961 | 2961 | "metadata": {},
|
2962 | 2962 | "outputs": [
|
2963 | 2963 | {
|
|
3005 | 3005 | },
|
3006 | 3006 | {
|
3007 | 3007 | "cell_type": "markdown",
|
3008 |
| - "id": "adopted-pound", |
| 3008 | + "id": "primary-dance", |
3009 | 3009 | "metadata": {},
|
3010 | 3010 | "source": [
|
3011 | 3011 | "We can also use the Xarray's \"stack\" method to transform these into 2D inputs (n_samples, n_features) suitable for other machine learning algorithms implemented in libraries such as [sklearn](https://scikit-learn.org/stable/) and [xgboost](https://xgboost.readthedocs.io/en/stable/). In this case, we are expecting 9 x 9 x 9 = 729 features total."
|
|
3014 | 3014 | {
|
3015 | 3015 | "cell_type": "code",
|
3016 | 3016 | "execution_count": 12,
|
3017 |
| - "id": "injured-passage", |
| 3017 | + "id": "numerous-computer", |
3018 | 3018 | "metadata": {},
|
3019 | 3019 | "outputs": [
|
3020 | 3020 | {
|
|
3055 | 3055 | },
|
3056 | 3056 | {
|
3057 | 3057 | "cell_type": "markdown",
|
3058 |
| - "id": "assumed-characterization", |
| 3058 | + "id": "addressed-collapse", |
3059 | 3059 | "metadata": {},
|
3060 | 3060 | "source": [
|
3061 |
| - "## What's Next?\n", |
| 3061 | + "## What's next?\n", |
3062 | 3062 | "\n",
|
3063 | 3063 | "There are many additional useful features that were yet to be implemented in the context of batch generation for downstream machine learning model training purposes. One of the current efforts is adding a set of data loaders (see [working PR for PyTorch data loader](https://github.com/pangeo-data/xbatcher/pull/25)). \n",
|
3064 | 3064 | "\n",
|
3065 | 3065 | "Additional features of interest can include: \n",
|
3066 | 3066 | "1. Handling overlaps across batches. The common use case of batching in machine learning training involves generating all samples, then group them into batches. When overlap is enabled, this yields different results compared to first generating batches then creating possible samples within each batch. \n",
|
| 3067 | + "\n", |
3067 | 3068 | "2. Shuffling/randomization of samples across batches. It is often desirable for each batch to be grouped randomly instead of along a specific dimension. \n",
|
| 3069 | + "\n", |
3068 | 3070 | "3. Be efficient in terms of memory usage. In the case where overlap is enabled, each sample would comprised of mostly repetitive values compared to adjacent samples. It would be beneficial if each batch/sample is generated lazily to avoid storing these extra duplicative values. \n",
|
| 3071 | + "\n", |
3069 | 3072 | "4. Handling preprocessing steps. For example, data augmentation, scaling/normalization, outlier detection, etc. \n",
|
3070 | 3073 | "\n",
|
| 3074 | + "\n", |
3071 | 3075 | "Interested users are welcomed to submit an issue in GitHub. "
|
3072 | 3076 | ]
|
| 3077 | + }, |
| 3078 | + { |
| 3079 | + "cell_type": "code", |
| 3080 | + "execution_count": null, |
| 3081 | + "id": "continent-property", |
| 3082 | + "metadata": {}, |
| 3083 | + "outputs": [], |
| 3084 | + "source": [] |
3073 | 3085 | }
|
3074 | 3086 | ],
|
3075 | 3087 | "metadata": {
|
|
0 commit comments