Skip to content

Commit 277c576

Browse files
committed
WIP: Add data/intro.md
1 parent d656dac commit 277c576

File tree

1 file changed

+186
-0
lines changed

1 file changed

+186
-0
lines changed

data/intro.md

Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
# Data for your package
2+
3+
In this section we talk about data for your scientific Python package:
4+
when you would need it, and how you can access it and provide it to your users.
5+
6+
```{admonition}
7+
:class: note
8+
Some material adapted from:
9+
https://www.dampfkraft.com/code/distributing-large-files-with-pypi.html
10+
https://learn.scientific-python.org/development/patterns/data-files/
11+
```
12+
13+
## When and why you might need data
14+
15+
First we describe when and why you might need data.
16+
Basically there are two cases:
17+
for examples, and for tests.
18+
We'll talk through both in the next couple of sections.
19+
20+
### Data for example usage
21+
22+
It's very common for scientific Python packages to need data that helps their users understand how the library is to be used.
23+
Often the package provides functionality to access this data,
24+
either by loading it from inside the source code, or by downloading it off of a remote host.
25+
In fact, the latter approach is so common that libraries have been developed just to "fetch" data,
26+
like pooch.
27+
We will show you how to use both methods for providing access to data below,
28+
but here we present some examples.
29+
30+
#### Examples in pyOpenSci packages
31+
* movingpandas: <https://movingpandas.github.io/movingpandas-website/2-analysis-examples/bird-migration.html>
32+
33+
#### Examples in core scientific Python packages
34+
* scikit-image: <https://github.com/scikit-image/scikit-image/tree/main/skimage/data>
35+
* scikit-learn: <https://github.com/scikit-learn/scikit-learn/tree/main/sklearn/datasets/data>
36+
37+
### Data for tests
38+
It is common to design your code and tests in such a way that you can quickly test on fake data,
39+
ranging from something as simple as a NumPy array of zeros,
40+
to something much more complex like a test suite that "mocks" data for a specific domain.
41+
This lets you make sure the core logic of your code works, *without* needing real data.
42+
At the end of the day though, you do want to make sure your code works on real data,
43+
especially if it is scientific code that may work with very specific data formats.
44+
That's why you will often want at least a small amount of real-world test data.
45+
A good rule of thumb is to have a handful of small files,
46+
say no more than 10 files that are a maximum of 50 MB each.
47+
Anything more than that you will probably want to store on-line and download,
48+
for reasons we describe in the next section.
49+
50+
## Why you should prefer to download data: size limits
51+
52+
Below we will introduce places you can store data on-line, and show you tools you can use to download that data.
53+
We suggest you prefer this approach when possible,
54+
The main reason for this is that there are limits on file and project sizes for forges, like GitHub and GitLab,
55+
and for package indexes--most importantly, PyPI.
56+
Especially with scientific datasets that can be quite large,
57+
we want to be good citizens of the ecosystem and not place unneccesarry demands on the common infrastructure.
58+
59+
### Forges (GitHub, GitLab, BitBucket, etc.)
60+
61+
Forges for hosting source code have maximum sizes for both files and projects.
62+
For example, on GitHub, a single file cannot be more than 100 MB.
63+
You would be surprised how quickly you can make a csv file this big!
64+
You also want to avoid committing larger binary files (like images or audio)
65+
to a version control system like git, because it is hard to go back and remove them later,
66+
and it can really slow down the speed with which you can clone the project.
67+
More importantly, it slows down the speed with which potential contributors can clone your project!
68+
69+
### Data size and PyPI
70+
71+
The Python Package Index (PyPI) places a limit on the size of individual files uploaded--where a "file" is either
72+
a sdist or a wheel--and also a limit on the total size of the project (the sum of all the "files").
73+
These limits are not documented as far as we can tell,
74+
but most estimates are around 100 MB per file and 1 GB for the total project.
75+
Files this large place a real strain on the resources supporting PyPI, as discussed here.
76+
For this reason, as a good citizen in the Python ecosystem you should do everything you can to minimize your impact.
77+
Don't worry, we're here to help you do that!
78+
You can request increases for both file size and project size
79+
(see [here](https://pypi.org/help/#file-size-limit)
80+
and [here](https://pypi.org/help/#project-size-limit))
81+
but we strongly suggest you read about other options here first.
82+
83+
## Where to store your data
84+
85+
Alright, we're strongly suggesting you don't try to cram your data into your code--where should you store it?
86+
Here we provide several options.
87+
88+
### Within the repository
89+
90+
As stated above, there *are* cases where relatively small datasets
91+
can be included in a package.
92+
If this data consists of examples for usage,
93+
then you would likely put it inside your source code
94+
so that it will be included in the sdist and wheel.
95+
If the data is meant only for tests,
96+
and you have a separate test directory (as we suggest)
97+
98+
* strengths and weaknesses
99+
* Strengths
100+
* easy to access
101+
* can be very do-able for smaller files, e.g. text files used in bioinformatics
102+
* Weaknesses
103+
* maximum file sizes on forges like GitHub and on PyPI
104+
* You want to avoid adding these files to your version control history (git) and draining the resources of PyPI
105+
* examples:
106+
* pyOpenSci:
107+
* opengenomics: <https://github.com/JonnyTran/OpenOmics/tree/master/tests/data/TCGA_LUAD>
108+
* jointly: <https://github.com/hpi-dhc/jointly/tree/master/test-data>
109+
* core scientific-python packages
110+
* scikit-learn: <https://github.com/scikit-learn/scikit-learn/tree/main/sklearn/datasets/data>
111+
112+
### In the cloud
113+
114+
#### scientific data repositories
115+
* strengths and weaknesses
116+
* strength: free, guaranteed lifetime of dataset, often appropriate for pyOpenSci packages
117+
* weaknesses: may be hard to automate for data that changes frequently
118+
* examples
119+
* Zenodo
120+
* OSF
121+
* FigShare
122+
* Dryad (paid?)
123+
124+
#### private cloud
125+
* strengths and weaknesses
126+
* strengths: robust; tooling exists to more easily automate updating of packages, but this requires more technical know-how
127+
* weaknesses: not free
128+
* examples
129+
* AWS
130+
* Google Cloud
131+
* Linode
132+
133+
```{admonition} Data version control
134+
:class: tip
135+
136+
Did you know that tools exist that let you track changes to datasets in the same way version control systems like git
137+
lets you track changes to code? Although you don't strictly need data versioning to include data with your package,
138+
you probably want to be aware that such tools exist if you are reading this section.
139+
Such tools could be particularly important if your package focuses mainly on providing access to datasets.
140+
Within science, tools have been developed to provide distributed access to datasets. These tools
141+
general
142+
DataLad https://www.datalad.org/
143+
related tools that are used for data engineering and industry (maybe breakout?)
144+
Git-LFS
145+
DVC
146+
Pachyderm (I think it's called?)
147+
```
148+
149+
```{admonition} Field specific standards + metadata
150+
:class: tip
151+
152+
It's important to be aware of field-specific standards
153+
eg astronomy
154+
neuroscience: DANDI, NWB
155+
many pyOpenSci tools exist to address these standards or to provide interoperability because these standards don't exist
156+
see also: FAIR data
157+
```
158+
159+
## How to access your data
160+
161+
Last but definitely not least, it's important to understand how you *and* your users
162+
163+
### For examples: in documentation, tutorials, docstrings, etc.
164+
165+
### Accessing local files with importlib-resources
166+
167+
If you have included data files in your source code, then you can provide access to these through importlib-resources.
168+
169+
link to PyCon talk w/Barry Warsaw
170+
code snippet example here
171+
mention python-3.9 backport
172+
* examples:
173+
* pyOpenSci packages:
174+
* crowsetta: <https://github.com/vocalpy/crowsetta/blob/main/src/crowsetta/data/data.py>
175+
* core scientific Python packages:
176+
* scikit-learn: <https://github.com/scikit-learn/scikit-learn/blob/f86f41d80bff882689fc16bd7da1fef4a805b464/sklearn/datasets/_base.py#L297>
177+
178+
### Accessing files that are hosted remotely with pooch
179+
pooch:
180+
https://github.com/fatiando/pooch
181+
code snippet example of using pooch
182+
183+
### For tests
184+
Many of the same tools apply.
185+
You can download test data as a set-up step in your CI.
186+
Pytest fixtures for accessing test data.

0 commit comments

Comments
 (0)