|
| 1 | +# Data for your package |
| 2 | + |
| 3 | +In this section we talk about data for your scientific Python package: |
| 4 | +when you would need it, and how you can access it and provide it to your users. |
| 5 | + |
| 6 | +```{admonition} |
| 7 | +:class: note |
| 8 | +Some material adapted from: |
| 9 | +https://www.dampfkraft.com/code/distributing-large-files-with-pypi.html |
| 10 | +https://learn.scientific-python.org/development/patterns/data-files/ |
| 11 | +``` |
| 12 | + |
| 13 | +## When and why you might need data |
| 14 | + |
| 15 | +First we describe when and why you might need data. |
| 16 | +Basically there are two cases: |
| 17 | +for examples, and for tests. |
| 18 | +We'll talk through both in the next couple of sections. |
| 19 | + |
| 20 | +### Data for example usage |
| 21 | + |
| 22 | +It's very common for scientific Python packages to need data that helps their users understand how the library is to be used. |
| 23 | +Often the package provides functionality to access this data, |
| 24 | +either by loading it from inside the source code, or by downloading it off of a remote host. |
| 25 | +In fact, the latter approach is so common that libraries have been developed just to "fetch" data, |
| 26 | +like pooch. |
| 27 | +We will show you how to use both methods for providing access to data below, |
| 28 | +but here we present some examples. |
| 29 | + |
| 30 | +#### Examples in pyOpenSci packages |
| 31 | +* movingpandas: <https://movingpandas.github.io/movingpandas-website/2-analysis-examples/bird-migration.html> |
| 32 | + |
| 33 | +#### Examples in core scientific Python packages |
| 34 | +* scikit-image: <https://github.com/scikit-image/scikit-image/tree/main/skimage/data> |
| 35 | +* scikit-learn: <https://github.com/scikit-learn/scikit-learn/tree/main/sklearn/datasets/data> |
| 36 | + |
| 37 | +### Data for tests |
| 38 | +It is common to design your code and tests in such a way that you can quickly test on fake data, |
| 39 | +ranging from something as simple as a NumPy array of zeros, |
| 40 | +to something much more complex like a test suite that "mocks" data for a specific domain. |
| 41 | +This lets you make sure the core logic of your code works, *without* needing real data. |
| 42 | +At the end of the day though, you do want to make sure your code works on real data, |
| 43 | +especially if it is scientific code that may work with very specific data formats. |
| 44 | +That's why you will often want at least a small amount of real-world test data. |
| 45 | +A good rule of thumb is to have a handful of small files, |
| 46 | +say no more than 10 files that are a maximum of 50 MB each. |
| 47 | +Anything more than that you will probably want to store on-line and download, |
| 48 | +for reasons we describe in the next section. |
| 49 | + |
| 50 | +## Why you should prefer to download data: size limits |
| 51 | + |
| 52 | +Below we will introduce places you can store data on-line, and show you tools you can use to download that data. |
| 53 | +We suggest you prefer this approach when possible, |
| 54 | +The main reason for this is that there are limits on file and project sizes for forges, like GitHub and GitLab, |
| 55 | +and for package indexes--most importantly, PyPI. |
| 56 | +Especially with scientific datasets that can be quite large, |
| 57 | +we want to be good citizens of the ecosystem and not place unneccesarry demands on the common infrastructure. |
| 58 | + |
| 59 | +### Forges (GitHub, GitLab, BitBucket, etc.) |
| 60 | + |
| 61 | +Forges for hosting source code have maximum sizes for both files and projects. |
| 62 | +For example, on GitHub, a single file cannot be more than 100 MB. |
| 63 | +You would be surprised how quickly you can make a csv file this big! |
| 64 | +You also want to avoid committing larger binary files (like images or audio) |
| 65 | +to a version control system like git, because it is hard to go back and remove them later, |
| 66 | +and it can really slow down the speed with which you can clone the project. |
| 67 | +More importantly, it slows down the speed with which potential contributors can clone your project! |
| 68 | + |
| 69 | +### Data size and PyPI |
| 70 | + |
| 71 | +The Python Package Index (PyPI) places a limit on the size of individual files uploaded--where a "file" is either |
| 72 | +a sdist or a wheel--and also a limit on the total size of the project (the sum of all the "files"). |
| 73 | +These limits are not documented as far as we can tell, |
| 74 | +but most estimates are around 100 MB per file and 1 GB for the total project. |
| 75 | +Files this large place a real strain on the resources supporting PyPI, as discussed here. |
| 76 | +For this reason, as a good citizen in the Python ecosystem you should do everything you can to minimize your impact. |
| 77 | +Don't worry, we're here to help you do that! |
| 78 | +You can request increases for both file size and project size |
| 79 | +(see [here](https://pypi.org/help/#file-size-limit) |
| 80 | +and [here](https://pypi.org/help/#project-size-limit)) |
| 81 | +but we strongly suggest you read about other options here first. |
| 82 | + |
| 83 | +## Where to store your data |
| 84 | + |
| 85 | +Alright, we're strongly suggesting you don't try to cram your data into your code--where should you store it? |
| 86 | +Here we provide several options. |
| 87 | + |
| 88 | +### Within the repository |
| 89 | + |
| 90 | +As stated above, there *are* cases where relatively small datasets |
| 91 | +can be included in a package. |
| 92 | +If this data consists of examples for usage, |
| 93 | +then you would likely put it inside your source code |
| 94 | +so that it will be included in the sdist and wheel. |
| 95 | +If the data is meant only for tests, |
| 96 | +and you have a separate test directory (as we suggest) |
| 97 | + |
| 98 | +* strengths and weaknesses |
| 99 | + * Strengths |
| 100 | + * easy to access |
| 101 | + * can be very do-able for smaller files, e.g. text files used in bioinformatics |
| 102 | + * Weaknesses |
| 103 | + * maximum file sizes on forges like GitHub and on PyPI |
| 104 | + * You want to avoid adding these files to your version control history (git) and draining the resources of PyPI |
| 105 | +* examples: |
| 106 | + * pyOpenSci: |
| 107 | + * opengenomics: <https://github.com/JonnyTran/OpenOmics/tree/master/tests/data/TCGA_LUAD> |
| 108 | + * jointly: <https://github.com/hpi-dhc/jointly/tree/master/test-data> |
| 109 | + * core scientific-python packages |
| 110 | + * scikit-learn: <https://github.com/scikit-learn/scikit-learn/tree/main/sklearn/datasets/data> |
| 111 | + |
| 112 | +### In the cloud |
| 113 | + |
| 114 | +#### scientific data repositories |
| 115 | +* strengths and weaknesses |
| 116 | + * strength: free, guaranteed lifetime of dataset, often appropriate for pyOpenSci packages |
| 117 | + * weaknesses: may be hard to automate for data that changes frequently |
| 118 | +* examples |
| 119 | + * Zenodo |
| 120 | + * OSF |
| 121 | + * FigShare |
| 122 | + * Dryad (paid?) |
| 123 | + |
| 124 | +#### private cloud |
| 125 | +* strengths and weaknesses |
| 126 | + * strengths: robust; tooling exists to more easily automate updating of packages, but this requires more technical know-how |
| 127 | + * weaknesses: not free |
| 128 | +* examples |
| 129 | + * AWS |
| 130 | + * Google Cloud |
| 131 | + * Linode |
| 132 | + |
| 133 | +```{admonition} Data version control |
| 134 | +:class: tip |
| 135 | +
|
| 136 | +Did you know that tools exist that let you track changes to datasets in the same way version control systems like git |
| 137 | +lets you track changes to code? Although you don't strictly need data versioning to include data with your package, |
| 138 | +you probably want to be aware that such tools exist if you are reading this section. |
| 139 | +Such tools could be particularly important if your package focuses mainly on providing access to datasets. |
| 140 | +Within science, tools have been developed to provide distributed access to datasets. These tools |
| 141 | +general |
| 142 | +DataLad https://www.datalad.org/ |
| 143 | +related tools that are used for data engineering and industry (maybe breakout?) |
| 144 | +Git-LFS |
| 145 | +DVC |
| 146 | +Pachyderm (I think it's called?) |
| 147 | +``` |
| 148 | + |
| 149 | +```{admonition} Field specific standards + metadata |
| 150 | +:class: tip |
| 151 | +
|
| 152 | +It's important to be aware of field-specific standards |
| 153 | +eg astronomy |
| 154 | +neuroscience: DANDI, NWB |
| 155 | +many pyOpenSci tools exist to address these standards or to provide interoperability because these standards don't exist |
| 156 | +see also: FAIR data |
| 157 | +``` |
| 158 | + |
| 159 | +## How to access your data |
| 160 | + |
| 161 | +Last but definitely not least, it's important to understand how you *and* your users |
| 162 | + |
| 163 | +### For examples: in documentation, tutorials, docstrings, etc. |
| 164 | + |
| 165 | +### Accessing local files with importlib-resources |
| 166 | + |
| 167 | +If you have included data files in your source code, then you can provide access to these through importlib-resources. |
| 168 | + |
| 169 | +link to PyCon talk w/Barry Warsaw |
| 170 | +code snippet example here |
| 171 | +mention python-3.9 backport |
| 172 | +* examples: |
| 173 | + * pyOpenSci packages: |
| 174 | + * crowsetta: <https://github.com/vocalpy/crowsetta/blob/main/src/crowsetta/data/data.py> |
| 175 | + * core scientific Python packages: |
| 176 | + * scikit-learn: <https://github.com/scikit-learn/scikit-learn/blob/f86f41d80bff882689fc16bd7da1fef4a805b464/sklearn/datasets/_base.py#L297> |
| 177 | + |
| 178 | +### Accessing files that are hosted remotely with pooch |
| 179 | +pooch: |
| 180 | +https://github.com/fatiando/pooch |
| 181 | +code snippet example of using pooch |
| 182 | + |
| 183 | +### For tests |
| 184 | +Many of the same tools apply. |
| 185 | +You can download test data as a set-up step in your CI. |
| 186 | +Pytest fixtures for accessing test data. |
0 commit comments