From 277c5769d215d407c56746d3856d1822a560d605 Mon Sep 17 00:00:00 2001 From: David Nicholson Date: Fri, 27 Oct 2023 09:17:23 -0400 Subject: [PATCH 1/2] WIP: Add data/intro.md --- data/intro.md | 186 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 186 insertions(+) create mode 100644 data/intro.md diff --git a/data/intro.md b/data/intro.md new file mode 100644 index 000000000..9b100ecf2 --- /dev/null +++ b/data/intro.md @@ -0,0 +1,186 @@ +# Data for your package + +In this section we talk about data for your scientific Python package: +when you would need it, and how you can access it and provide it to your users. + +```{admonition} +:class: note +Some material adapted from: +https://www.dampfkraft.com/code/distributing-large-files-with-pypi.html +https://learn.scientific-python.org/development/patterns/data-files/ +``` + +## When and why you might need data + +First we describe when and why you might need data. +Basically there are two cases: +for examples, and for tests. +We'll talk through both in the next couple of sections. + +### Data for example usage + +It's very common for scientific Python packages to need data that helps their users understand how the library is to be used. +Often the package provides functionality to access this data, +either by loading it from inside the source code, or by downloading it off of a remote host. +In fact, the latter approach is so common that libraries have been developed just to "fetch" data, +like pooch. +We will show you how to use both methods for providing access to data below, +but here we present some examples. + +#### Examples in pyOpenSci packages +* movingpandas: + +#### Examples in core scientific Python packages +* scikit-image: +* scikit-learn: + +### Data for tests +It is common to design your code and tests in such a way that you can quickly test on fake data, +ranging from something as simple as a NumPy array of zeros, +to something much more complex like a test suite that "mocks" data for a specific domain. +This lets you make sure the core logic of your code works, *without* needing real data. +At the end of the day though, you do want to make sure your code works on real data, +especially if it is scientific code that may work with very specific data formats. +That's why you will often want at least a small amount of real-world test data. +A good rule of thumb is to have a handful of small files, +say no more than 10 files that are a maximum of 50 MB each. +Anything more than that you will probably want to store on-line and download, +for reasons we describe in the next section. + +## Why you should prefer to download data: size limits + +Below we will introduce places you can store data on-line, and show you tools you can use to download that data. +We suggest you prefer this approach when possible, +The main reason for this is that there are limits on file and project sizes for forges, like GitHub and GitLab, +and for package indexes--most importantly, PyPI. +Especially with scientific datasets that can be quite large, +we want to be good citizens of the ecosystem and not place unneccesarry demands on the common infrastructure. + +### Forges (GitHub, GitLab, BitBucket, etc.) + +Forges for hosting source code have maximum sizes for both files and projects. +For example, on GitHub, a single file cannot be more than 100 MB. +You would be surprised how quickly you can make a csv file this big! +You also want to avoid committing larger binary files (like images or audio) +to a version control system like git, because it is hard to go back and remove them later, +and it can really slow down the speed with which you can clone the project. +More importantly, it slows down the speed with which potential contributors can clone your project! + +### Data size and PyPI + +The Python Package Index (PyPI) places a limit on the size of individual files uploaded--where a "file" is either +a sdist or a wheel--and also a limit on the total size of the project (the sum of all the "files"). +These limits are not documented as far as we can tell, +but most estimates are around 100 MB per file and 1 GB for the total project. +Files this large place a real strain on the resources supporting PyPI, as discussed here. +For this reason, as a good citizen in the Python ecosystem you should do everything you can to minimize your impact. +Don't worry, we're here to help you do that! +You can request increases for both file size and project size +(see [here](https://pypi.org/help/#file-size-limit) +and [here](https://pypi.org/help/#project-size-limit)) +but we strongly suggest you read about other options here first. + +## Where to store your data + +Alright, we're strongly suggesting you don't try to cram your data into your code--where should you store it? +Here we provide several options. + +### Within the repository + +As stated above, there *are* cases where relatively small datasets +can be included in a package. +If this data consists of examples for usage, +then you would likely put it inside your source code +so that it will be included in the sdist and wheel. +If the data is meant only for tests, +and you have a separate test directory (as we suggest) + +* strengths and weaknesses + * Strengths + * easy to access + * can be very do-able for smaller files, e.g. text files used in bioinformatics + * Weaknesses + * maximum file sizes on forges like GitHub and on PyPI + * You want to avoid adding these files to your version control history (git) and draining the resources of PyPI +* examples: + * pyOpenSci: + * opengenomics: + * jointly: + * core scientific-python packages + * scikit-learn: + +### In the cloud + +#### scientific data repositories +* strengths and weaknesses + * strength: free, guaranteed lifetime of dataset, often appropriate for pyOpenSci packages + * weaknesses: may be hard to automate for data that changes frequently +* examples + * Zenodo + * OSF + * FigShare + * Dryad (paid?) + +#### private cloud +* strengths and weaknesses + * strengths: robust; tooling exists to more easily automate updating of packages, but this requires more technical know-how + * weaknesses: not free +* examples + * AWS + * Google Cloud + * Linode + +```{admonition} Data version control +:class: tip + +Did you know that tools exist that let you track changes to datasets in the same way version control systems like git +lets you track changes to code? Although you don't strictly need data versioning to include data with your package, +you probably want to be aware that such tools exist if you are reading this section. +Such tools could be particularly important if your package focuses mainly on providing access to datasets. +Within science, tools have been developed to provide distributed access to datasets. These tools +general +DataLad https://www.datalad.org/ +related tools that are used for data engineering and industry (maybe breakout?) +Git-LFS +DVC +Pachyderm (I think it's called?) +``` + +```{admonition} Field specific standards + metadata +:class: tip + +It's important to be aware of field-specific standards +eg astronomy +neuroscience: DANDI, NWB +many pyOpenSci tools exist to address these standards or to provide interoperability because these standards don't exist +see also: FAIR data +``` + +## How to access your data + +Last but definitely not least, it's important to understand how you *and* your users + +### For examples: in documentation, tutorials, docstrings, etc. + +### Accessing local files with importlib-resources + +If you have included data files in your source code, then you can provide access to these through importlib-resources. + +link to PyCon talk w/Barry Warsaw +code snippet example here +mention python-3.9 backport +* examples: + * pyOpenSci packages: + * crowsetta: + * core scientific Python packages: + * scikit-learn: + +### Accessing files that are hosted remotely with pooch +pooch: +https://github.com/fatiando/pooch +code snippet example of using pooch + +### For tests +Many of the same tools apply. +You can download test data as a set-up step in your CI. +Pytest fixtures for accessing test data. From ec668a4d621456aeea86cb70e9505a1e3aba2978 Mon Sep 17 00:00:00 2001 From: Leah Wasser Date: Thu, 18 Sep 2025 19:15:46 -0600 Subject: [PATCH 2/2] enh: add data page to tests section --- data/intro.md | 186 ---------------------- index.md | 15 +- tests/index.md | 11 +- tests/package-data.md | 359 ++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 372 insertions(+), 199 deletions(-) delete mode 100644 data/intro.md create mode 100644 tests/package-data.md diff --git a/data/intro.md b/data/intro.md deleted file mode 100644 index 9b100ecf2..000000000 --- a/data/intro.md +++ /dev/null @@ -1,186 +0,0 @@ -# Data for your package - -In this section we talk about data for your scientific Python package: -when you would need it, and how you can access it and provide it to your users. - -```{admonition} -:class: note -Some material adapted from: -https://www.dampfkraft.com/code/distributing-large-files-with-pypi.html -https://learn.scientific-python.org/development/patterns/data-files/ -``` - -## When and why you might need data - -First we describe when and why you might need data. -Basically there are two cases: -for examples, and for tests. -We'll talk through both in the next couple of sections. - -### Data for example usage - -It's very common for scientific Python packages to need data that helps their users understand how the library is to be used. -Often the package provides functionality to access this data, -either by loading it from inside the source code, or by downloading it off of a remote host. -In fact, the latter approach is so common that libraries have been developed just to "fetch" data, -like pooch. -We will show you how to use both methods for providing access to data below, -but here we present some examples. - -#### Examples in pyOpenSci packages -* movingpandas: - -#### Examples in core scientific Python packages -* scikit-image: -* scikit-learn: - -### Data for tests -It is common to design your code and tests in such a way that you can quickly test on fake data, -ranging from something as simple as a NumPy array of zeros, -to something much more complex like a test suite that "mocks" data for a specific domain. -This lets you make sure the core logic of your code works, *without* needing real data. -At the end of the day though, you do want to make sure your code works on real data, -especially if it is scientific code that may work with very specific data formats. -That's why you will often want at least a small amount of real-world test data. -A good rule of thumb is to have a handful of small files, -say no more than 10 files that are a maximum of 50 MB each. -Anything more than that you will probably want to store on-line and download, -for reasons we describe in the next section. - -## Why you should prefer to download data: size limits - -Below we will introduce places you can store data on-line, and show you tools you can use to download that data. -We suggest you prefer this approach when possible, -The main reason for this is that there are limits on file and project sizes for forges, like GitHub and GitLab, -and for package indexes--most importantly, PyPI. -Especially with scientific datasets that can be quite large, -we want to be good citizens of the ecosystem and not place unneccesarry demands on the common infrastructure. - -### Forges (GitHub, GitLab, BitBucket, etc.) - -Forges for hosting source code have maximum sizes for both files and projects. -For example, on GitHub, a single file cannot be more than 100 MB. -You would be surprised how quickly you can make a csv file this big! -You also want to avoid committing larger binary files (like images or audio) -to a version control system like git, because it is hard to go back and remove them later, -and it can really slow down the speed with which you can clone the project. -More importantly, it slows down the speed with which potential contributors can clone your project! - -### Data size and PyPI - -The Python Package Index (PyPI) places a limit on the size of individual files uploaded--where a "file" is either -a sdist or a wheel--and also a limit on the total size of the project (the sum of all the "files"). -These limits are not documented as far as we can tell, -but most estimates are around 100 MB per file and 1 GB for the total project. -Files this large place a real strain on the resources supporting PyPI, as discussed here. -For this reason, as a good citizen in the Python ecosystem you should do everything you can to minimize your impact. -Don't worry, we're here to help you do that! -You can request increases for both file size and project size -(see [here](https://pypi.org/help/#file-size-limit) -and [here](https://pypi.org/help/#project-size-limit)) -but we strongly suggest you read about other options here first. - -## Where to store your data - -Alright, we're strongly suggesting you don't try to cram your data into your code--where should you store it? -Here we provide several options. - -### Within the repository - -As stated above, there *are* cases where relatively small datasets -can be included in a package. -If this data consists of examples for usage, -then you would likely put it inside your source code -so that it will be included in the sdist and wheel. -If the data is meant only for tests, -and you have a separate test directory (as we suggest) - -* strengths and weaknesses - * Strengths - * easy to access - * can be very do-able for smaller files, e.g. text files used in bioinformatics - * Weaknesses - * maximum file sizes on forges like GitHub and on PyPI - * You want to avoid adding these files to your version control history (git) and draining the resources of PyPI -* examples: - * pyOpenSci: - * opengenomics: - * jointly: - * core scientific-python packages - * scikit-learn: - -### In the cloud - -#### scientific data repositories -* strengths and weaknesses - * strength: free, guaranteed lifetime of dataset, often appropriate for pyOpenSci packages - * weaknesses: may be hard to automate for data that changes frequently -* examples - * Zenodo - * OSF - * FigShare - * Dryad (paid?) - -#### private cloud -* strengths and weaknesses - * strengths: robust; tooling exists to more easily automate updating of packages, but this requires more technical know-how - * weaknesses: not free -* examples - * AWS - * Google Cloud - * Linode - -```{admonition} Data version control -:class: tip - -Did you know that tools exist that let you track changes to datasets in the same way version control systems like git -lets you track changes to code? Although you don't strictly need data versioning to include data with your package, -you probably want to be aware that such tools exist if you are reading this section. -Such tools could be particularly important if your package focuses mainly on providing access to datasets. -Within science, tools have been developed to provide distributed access to datasets. These tools -general -DataLad https://www.datalad.org/ -related tools that are used for data engineering and industry (maybe breakout?) -Git-LFS -DVC -Pachyderm (I think it's called?) -``` - -```{admonition} Field specific standards + metadata -:class: tip - -It's important to be aware of field-specific standards -eg astronomy -neuroscience: DANDI, NWB -many pyOpenSci tools exist to address these standards or to provide interoperability because these standards don't exist -see also: FAIR data -``` - -## How to access your data - -Last but definitely not least, it's important to understand how you *and* your users - -### For examples: in documentation, tutorials, docstrings, etc. - -### Accessing local files with importlib-resources - -If you have included data files in your source code, then you can provide access to these through importlib-resources. - -link to PyCon talk w/Barry Warsaw -code snippet example here -mention python-3.9 backport -* examples: - * pyOpenSci packages: - * crowsetta: - * core scientific Python packages: - * scikit-learn: - -### Accessing files that are hosted remotely with pooch -pooch: -https://github.com/fatiando/pooch -code snippet example of using pooch - -### For tests -Many of the same tools apply. -You can download test data as a set-up step in your CI. -Pytest fixtures for accessing test data. diff --git a/index.md b/index.md index 0a6edc9ef..1b2843590 100644 --- a/index.md +++ b/index.md @@ -42,7 +42,6 @@ This guide will help you: You will also find best practice recommendations and curated lists of community resources surrounding packaging and package documentation. :::: - ```{todo} TODO: change the navigation of docs to have a @@ -56,11 +55,11 @@ Community docs Publish your docs ``` + ## _new_ Tutorial Series: Create a Python Package The first round of our community-developed, how to create a Python package tutorial series for scientists is complete! Join our community review process or watch development of future tutorials in our [GitHub repo here](https://github.com/pyOpenSci/python-package-guide). - :::::{grid} 1 1 2 2 :class-container: text-center :gutter: 3 @@ -107,7 +106,6 @@ The first round of our community-developed, how to create a Python package tutor ::::: - ## Python Packaging for Scientists Learn about Python packaging best practices. You will also get to know the @@ -195,7 +193,7 @@ Learn about best practices for: ## Tests -*We are actively working on this section. [Follow development here.](https://github.com/pyOpenSci/python-package-guide)* +_We are actively working on this section. [Follow development here.](https://github.com/pyOpenSci/python-package-guide)_ :::::{grid} 1 1 2 2 :class-container: text-center @@ -227,7 +225,6 @@ Learn about best practices for: :class-container: text-center :gutter: 3 - ::::{grid-item} :::{card} ✨ Code style & Format ✨ :class-card: left-aligned @@ -249,8 +246,7 @@ contribute. ::::: - -:::{figure} https://www.pyopensci.org/images/people-building-blocks.jpg +:::{figure} :align: right :width: 350 :alt: xkcd comic showing a stick figure on the ground and one in the air. The one on the ground is saying. `You're flying! how?` The person in the air replies `Python!` Below is a 3 rectangle comic with the following text in each box. Box 1 - I learned it last night. Everything is so simple. Hello world is just print hello world. Box 2 - the person on the ground says - come join us programming is fun again. It's a whole new world. But how are you flying? box 3 - the person flying says - i just typed import antigravity. I also sampled everything in the medicine cabinet. But i think this is the python. The person on the ground is saying - that's it? @@ -286,7 +282,6 @@ If you have questions about our peer review process or packaging in general, you This living Python packaging guide is updated as tools and best practices evolve in the Python packaging ecosystem. We will be adding new content over the next year. - :::{toctree} :hidden: :caption: Tutorials @@ -310,16 +305,14 @@ Documentation ::: - :::{toctree} :hidden: -:caption: Testing +:caption: Tests & Data Tests ::: - :::{toctree} :hidden: :caption: Continuous Integration diff --git a/tests/index.md b/tests/index.md index dc4666d5b..c7db35e5d 100644 --- a/tests/index.md +++ b/tests/index.md @@ -1,4 +1,5 @@ (tests-intro)= + # Tests and data for your Python package Tests are an important part of your Python package because they @@ -9,7 +10,6 @@ In this section, you will learn more about the importance of writing tests for your Python package and how you can set up infrastructure to run your tests both locally and on GitHub. - :::::{grid} 1 1 3 2 :class-container: text-center :gutter: 3 @@ -62,7 +62,6 @@ and different operating systems. Learn about setting up tests to run in Continuo ::::: - :::{figure-md} fig-target @@ -82,3 +81,11 @@ Run tests locally Run tests online (using CI) Code coverage ``` + +```{toctree} +:hidden: +:maxdepth: 2 +:caption: Data for Your Package + +Package Data +``` diff --git a/tests/package-data.md b/tests/package-data.md new file mode 100644 index 000000000..6b0893bad --- /dev/null +++ b/tests/package-data.md @@ -0,0 +1,359 @@ +# Data for your package + +Here, you will learn about working with data for your scientific Python package. + +::{admonition} What you will learn +:class: tip + +* When and why you might need data, +* Where you can store your package's data, +* How you can access the data both from within your package and by downloading it as your package is used or as tests are run + +:::{note} + +We adapted some of the material on this page from: + +* +* +::: + +## When and why you might need data + +There are two cases when and why you might need data for maintaining and using your package. + +1. Data for example usage: This is data that helps your users understand how to use your package. This data is often used in documentation, tutorials, and docstrings. +2. Data for tests: This is data that helps you and your contributors make sure your package is working as expected. This data is often used in unit tests, integration tests, and end-to-end tests. + +We'll talk through both use cases next. + +### Data for example package usage and tutorials + +It's common for scientific Python packages to use example datasets for tutorials and examples to help users understand how they can use a library. +Often, the package provides functionality to access this data, +Either by loading it from inside the package itself or by downloading it from a remote host such as Figshare or another open repository. + +This use case is so common that libraries exist to "fetch" data, +like [Pooch](https://www.fatiando.org/pooch/latest/). + +We will show you how to implement both, including data in your package or downloading it below. + +#### Examples of how scientific Python packages use data + +* **movingpandas** is a pyOpenSci-accepted package that uses data to support some of its tutorials. [Here is an example of a tutorial that shows how to use MovingPandas to process bird migration data.](https://movingpandas.github.io/movingpandas-website/2-analysis-examples/bird-migration.html) + +* [scikit-image:](https://github.com/scikit-image/scikit-image/tree/main/skimage/data) stores data within the package itself to be used for package examples. +* [scikit-learn:](https://github.com/scikit-learn/scikit-learn/tree/main/sklearn/datasets/data) + +### Data for tests + +It is common to design your code and tests in such a way that you can quickly test on data that you create yourself. You can store created data in any format. It can be simple (for example, a NumPy array of zeros) or more complex ( a test suite that "mocks" data for a specific domain or that mocks data returned through an API call). + +Including created data in your package allows you to ensure the core logic of your code works, without needing to download or use real data (which is often large). + +However, you should still make sure that your code runs properly using real data formats and structures. This is why you should consider including a small amount of real-world test data. This data should be small enough that it can be included in your package without making the package too large. + + + +A good rule of thumb is to have a handful of small files, +say no more than 10 files that are a maximum of 50 MB each. +Anything more than that, you will probably want to store online and download. + +## Why you should download data: size limits + +As a general rule of thumb, you should download data from an online source rather than including it in your package. + +There are several reasons to store your data online: + +* There are limits on file and project sizes for forges, like GitHub and GitLab. +* There are limits on file and project sizes for repositories and package indexes like PyPI and Conda. +* Downloading data as needed means that your package will be smaller and faster to install. +* Downloading data as needed means that you can update the data without needing to re-release your package. + +It's essential to avoid placing unnecessary demands on our shared open source infrastructure. These demands might be on the infrastructure hosting the data (GitHub, PyPI) or the user downloading the package that contains the data. Keep your example and test datasets as small as possible. + +### Forges (GitHub, GitLab, BitBucket, etc.) + +Services like GitHub and GitLab, which host source code, have maximum sizes for both files and projects. For GitHub, a single file cannot exceed 100 MB. You might be surprised how quickly you can make a .csv file this big! + +You also want to avoid committing larger binary files (like images or audio) +to a version control system like git, because it is hard to go back and remove them later. It can also slow down the speed at which you can clone the project. +More importantly, it slows down the speed at which potential contributors can clone your project. + +### Data size and PyPI + +The Python Package Index (PyPI) places a limit on the size of individual files uploaded, where a "file" is either +a sdist or a wheel--and also a limit on the total size of the project (the sum of all the "files"). + + +While these limits are not clearly documented, +most estimates are around 100 MB per file and 1 GB for the total project. +Files this large place a real strain on the resources supporting PyPI so you should try your best to minimize the size of your package. + +The pyOpenSci community is here to help you do just that! + +:::{tip} +You can request increases for both file size and project size +(see [here](https://pypi.org/help/#file-size-limit) +and [here](https://pypi.org/help/#project-size-limit)) +But we strongly suggest you read about other ways to store your data first. +::: + +## Where to store your data + +There are several options for storing your data. +We will discuss the pros and cons of each option below. + +### Within the repository + +There *are* cases where relatively small datasets can be safely included in a package. +If this data is used in package tutorials or examples and the data are small, you can include it in your package. If the data are included in your package, +that means that it will be included in the sdist and wheel and available to a user to run your tutorials or examples after they install your package. + +If the data are meant to be used for tests, and you have a separate test directory (as we suggest), then you can include the data in the tests directory. + +There are pros and cons to including data in your package or repository. + +The pros include: + +* The data are easy to access. +* If the data size is small, it won't impact PyPI or a user downloading your package from PyPI (for example, small snippets in text files, small images, small CSV files. + +The cons include: + +* Data can bloat your repository and make it hard to clone +* Data can bloat your package and make it hard to install +* You may run into maximum file size limits on forges like GitHub and on PyPI +* If you update the data periodically, your repository bloat will increase over time through the commit history. + +You want to avoid adding these files to your version control history (git) and draining the resources of PyPI. + + +* examples: + * pyOpenSci: + * opengenomics: + * jointly: + * core scientific-python packages + * scikit-learn: + +# Store Your Data in a Scientific Repository + +Scientific data repositories offer reliable, long-term storage for research datasets. These platforms are typically free and provide essential features like DOIs and version control. + +Some popular places to store data include: + +* **[Zenodo](https://zenodo.org)** - General-purpose repository with excellent GitHub integration +* **[Open Science Framework (OSF)](https://osf.io)** - Comprehensive research platform +* **[Figshare](https://figshare.com)** - User-friendly with good visualization tools +* **[Dryad](https://datadryad.org)** - Focused on research data (subscription model for some features) + +## Pros and Cons + +**Strengths:** + +* **Free to use** - No cost for most repositories and storage limits +* **Guaranteed long-term preservation** - Data remains accessible indefinitely +* **DOI assignment** - Permanent identifiers make datasets citable in publications +* **Version control** - Track changes and maintain multiple dataset versions +* **Community standards** - Well-suited for pyOpenSci packages and research workflows + +**Weaknesses:** + +* **Limited automation** - Difficult to set up automated uploads for dynamic datasets +* **Static nature** - Not designed for frequently changing or real-time data + +## Private Cloud Storage + +Private cloud platforms offer scalable, enterprise-grade storage with extensive automation capabilities. These services provide robust infrastructure and comprehensive tooling for data management workflows. + +### Pros and cons + +There are some pros to using private cloud storage for your data: + +* **Robust infrastructure** - Enterprise-grade reliability and uptime guarantees +* **Automation-friendly** - Rich APIs and tooling for automated data pipelines +* **Scalability** - Handle datasets from gigabytes to petabytes +* **Integration capabilities** - Connect with CI/CD pipelines and package management systems +* **Advanced features** - Access controls, encryption, backup, and disaster recovery + +And also some cons: + +* **Cost** - Pay-per-use pricing can become expensive for large datasets +* **Technical complexity** - Requires cloud infrastructure knowledge and setup +* **Vendor lock-in** - May create dependencies on specific cloud ecosystems + +### Popular Platforms + +* **[Amazon Web Services (AWS)](https://aws.amazon.com/s3/)** - S3 storage with extensive ecosystem integration +* **[Google Cloud Platform](https://cloud.google.com/storage)** - Cloud Storage with strong AI/ML tool integration +* **[Linode](https://www.linode.com/products/object-storage/)** - Object Storage with straightforward pricing and developer-friendly tools + + +```{admonition} Version control platforms for your data +:class: tip + +Platforms exist that allow you to track changes to datasets in the same way as version control systems like git let you track changes to code. + +Such tools are essential if your package primarily focuses on providing data access. + + + +Some tools provide distributed access to larger data. + +- **[DataLad](https://www.datalad.org/)** - Distributed data management for scientific datasets + +### Industry data engineering tools +- **[Git LFS](https://git-lfs.github.io/)** - Git extension for versioning large files +- **[DVC](https://dvc.org/)** - Data version control for machine learning projects +- **[Pachyderm](https://www.pachyderm.com/)** - Data pipeline platform with version control +``` + + + +:::{todo} + +::: + +```{admonition} Field specific standards + metadata +:class: tip + +When creating data for your package, be aware of field-specific standards and formats. For example, in neuroscience, DANDI and NWB are common file formats used in the domain. So consider whether your package can / should support those formats if you expect users in the neuroscience space to use it. + +???Many pyOpenSci tools exist to address these standards or to provide interoperability because these standards don't exist ??? +see also: FAIR data +``` + +## How to access your data + +Last but not least, it's essential to understand how you and your users can access the data your package provides. + +### For examples: in documentation, tutorials, docstrings, etc + +When writing documentation and examples, consider how users will access your data in different contexts - whether they're running code locally, in notebooks, or in IDEs like VSCODE, in codespaces on GitHub or other environments. + +Also think about whether you intend to use data in docstring examples that will be run with doctest. Or in tutorials or example snippets in your documentation. + +### Accessing local files with importlib-resources + +If you have included data files in your source code, then you can provide access to these using `importlib-resources`. + +If you have included data files in the source code of your package, then you can provide access to it using `importlib-resources`. This is the recommended approach for accessing package data files in modern Python. + +```python +from importlib import resources +import my_package + +with resources.open_text(my_package.data, 'example.csv') as f: + data = f.read() +``` + +The code example above demonstrates the recommended pattern for Python 3.9+. For older Python versions, install the backport with `pip install importlib-resources` and use the same API. + +See Barry Warsaw's [PyCon talk on accessing package data](https://www.youtube.com/watch?v=ZsGFU2qh73E) for a comprehensive overview of modern approaches accessing package data. + +:::{tip} +For Python versions before 3.9, install the `importlib-resources` backport with `pip install importlib-resources` to access the same modern API. + +See the [importlib-resources documentation](https://importlib-resources.readthedocs.io/en/latest/) for more on using that package to importa data. + +::: + +* examples: + * pyOpenSci packages: + * crowsetta: + * core scientific Python packages: + * scikit-learn: + +### Accessing files that are hosted remotely with pooch + +[Pooch](https://github.com/fatiando/pooch) is a Python library designed for downloading and caching data files from remote sources. It's particularly useful for scientific packages that need to access datasets hosted online while providing a smooth user experience. + +Pooch features include: + +* **Automatic downloads:** Files are downloaded on first access +* **Local caching:** Subsequent calls to the data use cached versions to minimize redundant downloads +* **Integrity verification:** Uses checksums to ensure data hasn't been corrupted +* **Version management:** Pooch can handle different versions of the same dataset + +Here is an example of using Pooch to download and cache a remote data file: +:::{todo} +TODO: I copied some code online. THIS NEEDS TO BE TESTED // verified +::: + +```python +import pooch + +# Define your data registry +data_registry = pooch.create( + path=pooch.os_cache("my_package"), + base_url="https://github.com/my_org/my_data/raw/main/", + registry={ + "sample_data.csv": "sha256:abc123...", # File hash for verification + "large_dataset.nc": "sha256:def456...", + } +) + +# Access files - downloads automatically if not cached +def load_sample_data(): + file_path = data_registry.fetch("sample_data.csv") + return pd.read_csv(file_path) + +def load_large_dataset(): + file_path = data_registry.fetch("large_dataset.nc") + return xr.open_dataset(file_path) + +``` + +The same testing principles apply when using Pooch for remote data access. You can download test datasets as a setup step in your CI pipeline to ensure they're available during testing. You can use pytest fixtures to provide consistent access to test data across your test suite, whether the data is cached locally or needs to be downloaded fresh. + +### Use Pytest fixtures for data access + +Pytest fixtures provide a clean way to set up and share data across your test suite. They're especially useful for scientific packages where you need consistent access to test datasets. + +Basic Data Fixture + +```python +import pytest +import pandas as pd +from pathlib import Path + +@pytest.fixture +def sample_data(): + """Load sample dataset for testing.""" + data_path = Path(__file__).parent / "data" / "sample.csv" + return pd.read_csv(data_path) + +def test_data_processing(sample_data): + """Test uses the fixture automatically.""" + result = my_function(sample_data) + assert len(result) > 0 +``` + +```python +import pytest +import pandas as pd +import pooch + +# Assuming data_registry is defined in your package +from my_package.data import data_registry + +@pytest.fixture(scope="session") +def remote_dataset(): + """Download and cache remote data once per test session.""" + file_path = data_registry.fetch("sample_data.csv") + return pd.read_csv(file_path) + +def test_remote_data_analysis(remote_dataset): + """Test using remote dataset.""" + result = analyze_dataset(remote_dataset) + assert result is not None + +``` + +Key Benefits: + +Reusable - Define data loading once, use in multiple tests +Automatic cleanup - Fixtures handle setup and teardown +Scoped caching - Use scope="session" for expensive data operations +Parameterization - Test functions with multiple datasets easily + +This approach keeps your tests clean and ensures consistent data access patterns across your test suite.