-
Notifications
You must be signed in to change notification settings - Fork 72
[review this in October 2025] Add: data section to guide #110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,186 @@ | ||
# Data for your package | ||
|
||
In this section we talk about data for your scientific Python package: | ||
when you would need it, and how you can access it and provide it to your users. | ||
|
||
```{admonition} | ||
:class: note | ||
Some material adapted from: | ||
https://www.dampfkraft.com/code/distributing-large-files-with-pypi.html | ||
https://learn.scientific-python.org/development/patterns/data-files/ | ||
``` | ||
|
||
## When and why you might need data | ||
|
||
First we describe when and why you might need data. | ||
Basically there are two cases: | ||
for examples, and for tests. | ||
We'll talk through both in the next couple of sections. | ||
|
||
### Data for example usage | ||
|
||
It's very common for scientific Python packages to need data that helps their users understand how the library is to be used. | ||
Often the package provides functionality to access this data, | ||
either by loading it from inside the source code, or by downloading it off of a remote host. | ||
In fact, the latter approach is so common that libraries have been developed just to "fetch" data, | ||
like pooch. | ||
We will show you how to use both methods for providing access to data below, | ||
but here we present some examples. | ||
|
||
#### Examples in pyOpenSci packages | ||
* movingpandas: <https://movingpandas.github.io/movingpandas-website/2-analysis-examples/bird-migration.html> | ||
|
||
#### Examples in core scientific Python packages | ||
* scikit-image: <https://github.com/scikit-image/scikit-image/tree/main/skimage/data> | ||
* scikit-learn: <https://github.com/scikit-learn/scikit-learn/tree/main/sklearn/datasets/data> | ||
|
||
### Data for tests | ||
It is common to design your code and tests in such a way that you can quickly test on fake data, | ||
ranging from something as simple as a NumPy array of zeros, | ||
to something much more complex like a test suite that "mocks" data for a specific domain. | ||
This lets you make sure the core logic of your code works, *without* needing real data. | ||
At the end of the day though, you do want to make sure your code works on real data, | ||
especially if it is scientific code that may work with very specific data formats. | ||
That's why you will often want at least a small amount of real-world test data. | ||
A good rule of thumb is to have a handful of small files, | ||
say no more than 10 files that are a maximum of 50 MB each. | ||
Anything more than that you will probably want to store on-line and download, | ||
for reasons we describe in the next section. | ||
|
||
## Why you should prefer to download data: size limits | ||
|
||
Below we will introduce places you can store data on-line, and show you tools you can use to download that data. | ||
We suggest you prefer this approach when possible, | ||
The main reason for this is that there are limits on file and project sizes for forges, like GitHub and GitLab, | ||
and for package indexes--most importantly, PyPI. | ||
Especially with scientific datasets that can be quite large, | ||
we want to be good citizens of the ecosystem and not place unneccesarry demands on the common infrastructure. | ||
|
||
### Forges (GitHub, GitLab, BitBucket, etc.) | ||
|
||
Forges for hosting source code have maximum sizes for both files and projects. | ||
For example, on GitHub, a single file cannot be more than 100 MB. | ||
You would be surprised how quickly you can make a csv file this big! | ||
You also want to avoid committing larger binary files (like images or audio) | ||
to a version control system like git, because it is hard to go back and remove them later, | ||
and it can really slow down the speed with which you can clone the project. | ||
More importantly, it slows down the speed with which potential contributors can clone your project! | ||
|
||
### Data size and PyPI | ||
|
||
The Python Package Index (PyPI) places a limit on the size of individual files uploaded--where a "file" is either | ||
a sdist or a wheel--and also a limit on the total size of the project (the sum of all the "files"). | ||
These limits are not documented as far as we can tell, | ||
but most estimates are around 100 MB per file and 1 GB for the total project. | ||
Files this large place a real strain on the resources supporting PyPI, as discussed here. | ||
For this reason, as a good citizen in the Python ecosystem you should do everything you can to minimize your impact. | ||
Don't worry, we're here to help you do that! | ||
You can request increases for both file size and project size | ||
(see [here](https://pypi.org/help/#file-size-limit) | ||
and [here](https://pypi.org/help/#project-size-limit)) | ||
but we strongly suggest you read about other options here first. | ||
|
||
## Where to store your data | ||
|
||
Alright, we're strongly suggesting you don't try to cram your data into your code--where should you store it? | ||
Here we provide several options. | ||
|
||
### Within the repository | ||
|
||
As stated above, there *are* cases where relatively small datasets | ||
can be included in a package. | ||
If this data consists of examples for usage, | ||
then you would likely put it inside your source code | ||
so that it will be included in the sdist and wheel. | ||
If the data is meant only for tests, | ||
and you have a separate test directory (as we suggest) | ||
|
||
* strengths and weaknesses | ||
* Strengths | ||
* easy to access | ||
* can be very do-able for smaller files, e.g. text files used in bioinformatics | ||
* Weaknesses | ||
* maximum file sizes on forges like GitHub and on PyPI | ||
* You want to avoid adding these files to your version control history (git) and draining the resources of PyPI | ||
* examples: | ||
* pyOpenSci: | ||
* opengenomics: <https://github.com/JonnyTran/OpenOmics/tree/master/tests/data/TCGA_LUAD> | ||
* jointly: <https://github.com/hpi-dhc/jointly/tree/master/test-data> | ||
* core scientific-python packages | ||
* scikit-learn: <https://github.com/scikit-learn/scikit-learn/tree/main/sklearn/datasets/data> | ||
|
||
### In the cloud | ||
|
||
#### scientific data repositories | ||
* strengths and weaknesses | ||
* strength: free, guaranteed lifetime of dataset, often appropriate for pyOpenSci packages | ||
* weaknesses: may be hard to automate for data that changes frequently | ||
* examples | ||
* Zenodo | ||
* OSF | ||
* FigShare | ||
* Dryad (paid?) | ||
|
||
#### private cloud | ||
* strengths and weaknesses | ||
* strengths: robust; tooling exists to more easily automate updating of packages, but this requires more technical know-how | ||
* weaknesses: not free | ||
* examples | ||
* AWS | ||
* Google Cloud | ||
* Linode | ||
|
||
```{admonition} Data version control | ||
:class: tip | ||
|
||
Did you know that tools exist that let you track changes to datasets in the same way version control systems like git | ||
lets you track changes to code? Although you don't strictly need data versioning to include data with your package, | ||
you probably want to be aware that such tools exist if you are reading this section. | ||
Such tools could be particularly important if your package focuses mainly on providing access to datasets. | ||
Within science, tools have been developed to provide distributed access to datasets. These tools | ||
general | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. incomplete sentence |
||
DataLad https://www.datalad.org/ | ||
related tools that are used for data engineering and industry (maybe breakout?) | ||
Git-LFS | ||
DVC | ||
Pachyderm (I think it's called?) | ||
``` | ||
|
||
```{admonition} Field specific standards + metadata | ||
:class: tip | ||
|
||
It's important to be aware of field-specific standards | ||
eg astronomy | ||
neuroscience: DANDI, NWB | ||
many pyOpenSci tools exist to address these standards or to provide interoperability because these standards don't exist | ||
see also: FAIR data | ||
``` | ||
|
||
## How to access your data | ||
|
||
Last but definitely not least, it's important to understand how you *and* your users | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. incomplete sentence |
||
|
||
### For examples: in documentation, tutorials, docstrings, etc. | ||
|
||
### Accessing local files with importlib-resources | ||
|
||
If you have included data files in your source code, then you can provide access to these through importlib-resources. | ||
|
||
link to PyCon talk w/Barry Warsaw | ||
code snippet example here | ||
mention python-3.9 backport | ||
* examples: | ||
* pyOpenSci packages: | ||
* crowsetta: <https://github.com/vocalpy/crowsetta/blob/main/src/crowsetta/data/data.py> | ||
* core scientific Python packages: | ||
* scikit-learn: <https://github.com/scikit-learn/scikit-learn/blob/f86f41d80bff882689fc16bd7da1fef4a805b464/sklearn/datasets/_base.py#L297> | ||
|
||
### Accessing files that are hosted remotely with pooch | ||
pooch: | ||
https://github.com/fatiando/pooch | ||
code snippet example of using pooch | ||
|
||
### For tests | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I get the impression is is much less useful to scientific programming, but do we want to mention that the data could instead be fully generated, with something like faker or hypothesis, and no data files need to be checked in at all? At least for tests. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ucodery, this has been open for so long! Essentially, David wrote a draft years ago, and we were talking about it in Slack. At some point, I think I suggested to David that we put it online and let folks add to it. I think some parts of it are still in bullet format. We could merge, OR we could also add to the content by suggesting inline edits to sections that are in bullet format. I'm super open to an approach, or if you want me to decide a strategy, I would lean into - let's all work on sections and get it to a point where we can merge it. Just a thought!! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am okay with either approach. But I think unless we actively bring up the conversation again on slack, or make this part of a documentation sprint, there won't be much if any added content. I think the best thing is to get this PR pushed as soon as we can while still looking professional and not including any incorrect information. We can still discuss or sprint on it after a push too. My thinking is that having some information up online is likely to attract more contribution than a long-standing PR. People online are more likely to come to us with corrections or additional parts if they find it on our guide, and probably aren't idly looking through our github. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cool. I just updated the branch, so our preview builds. I'll have a look at it, and then let's plan on a sprint of some kind. MAYBE we start with a hackmd sprint to get it over the finish line to "good enough" to merge? IF i pull this down, create a new branch and repush i can link it to our hackmd for easy editing on that branch. Let me give this a try and we can go from there! |
||
Many of the same tools apply. | ||
You can download test data as a set-up step in your CI. | ||
Pytest fixtures for accessing test data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
incomplete sentence.