enh(data): add page on data to the tests section #580

lwasser · 2025-09-19T01:17:45Z

This pr is a rework of #110 . Let's plan to run a sprint on this pr in the next few weeks to see if we can get to a place where it's good.

NOTE: I added some code examples that I literally found online and DID NOT TEST. so we will want to definitely test what is there before merging.

ALSO - because this topic is not my expertise, in some cases I ran with a section after a Google search and tried to flesh it out, but it also could be off. Any feedback is welcome on this!!

yeelauren · 2025-09-25T01:20:52Z

tests/package-data.md

+* **[Open Science Framework (OSF)](https://osf.io)** - Comprehensive research platform
+* **[Figshare](https://figshare.com)** - User-friendly with good visualization tools
+* **[Dryad](https://datadryad.org)** - Focused on research data (subscription model for some features)
+


Would huggingface fit here or is this more for academic ? Perhaps https://docs.source.coop ?

it's good you suggested these -- sorry for a big wordy response to a single suggestion, but I think it shows that we could say a little more about why we have what we have here.

I'll say what I think we might add, but first, re: your suggestions:
Huggingface is free, and convenient.
I have a couple of worries about it though. It will only exist (1) at most as long as the company exists, and that might not be long if the AI bubble pops, and (2) it might not even be that long if the company decides that the cost of maintaining all the data is not worth how many customers it gets them. I don't think this is just an academic thing, since a company could choose to be structured in such a way that they can't pull the rug on users whenever they want. Forgive me for being pedantic about it, I guess I'm sensitive as a failed academic 😇

Source Co-op looks cool! Maybe we should include them! Need to read about it more

re: the places we have listed now, I think there's a couple things we could do here:

say a little bit more in the blurb about what our criteria are. E.g., is it free, is there some sort of guarantee on how long the data will be stored, is there a good UI / API / existing software tool that makes it easy to work with

put in some sort of table for each of these that indicates how each place to store data stack up WRT to those criteria

Again, sorry for being wordy -- some of this stuff I say already in the actual test, but you're making me see how this could maybe be better organized

yeelauren · 2025-09-25T01:24:02Z

tests/package-data.md

+* **[Google Cloud Platform](https://cloud.google.com/storage)** - Cloud Storage with strong AI/ML tool integration
+* **[Linode](https://www.linode.com/products/object-storage/)** - Object Storage with straightforward pricing and developer-friendly tools
+
+<!-- I don't understand how these platforms are different from things like figshare - can we clarify that? and how / why someone would pick these vs figshare / dryad? -->


This section is moving into data versioning which is important for scientific studies and reproducibility, but may be out of scope for a package. Having a small subset or testable data for a package makes a lot of sense.

yeelauren · 2025-09-25T01:24:25Z

tests/package-data.md

+- **[Pachyderm](https://www.pachyderm.com/)** - Data pipeline platform with version control
+```
+
+<!-- I am not sure how this section relates to data stored in a package. I understand it's important, but does it belong in a page focused on how and where to store data for your Python package? It might be that I just don't understand as written!-->


+1 to removing

I hear you -- I put this in thinking "I wish I knew about this earlier so we should mention it" but it is probably out of scope for the specifics of including data with your package

yeelauren · 2025-09-25T01:27:52Z

tests/package-data.md

+
+### Use Pytest fixtures for data access
+
+Pytest fixtures provide a clean way to set up and share data across your test suite. They're especially useful for scientific packages where you need consistent access to test datasets.


Maybe this is just personal preference, but I think we should move pytest up and pooch down.
Pytest fits better with the whole 'testing your data thing' along with testing your code. Pooch seems like an additional tool or service to download data from an external source, I've not heard of it before.

👍 We definitely should talk about pytest fixtures sooner if this page is about using data in tests.

@lwasser (at the risk of being even more annoying on this review 😅 ) looking at this again, I'm wondering if it would make sense to break this into two pages.

One on "data for tests" and another on "example data in your package", that would not live in the section on tests, but instead live in ... some section that doesn't exist yet.
I guess it would be some section that's "packaging", but more about, like library structure I guess? As oppossed to the nitty-gritty of publishing a package

I know a lot of times test data and example data overlap but it's striking me as odd to have content about example data in the tests/ section

ok I'll stop being noisy on the review 🤐

One last thought: if I was writing a page that was only "data for tests", I'd write something like:

as much as possible, use pytest.fixture with "fake data" you generate to test: this lets you avoid adding real data to your project, and helps you avoid writing code and tests that are tightly coupled, encouraging you to test the interface instead of implementation details.

for functionality where you absolutely need to test on real data, e.g. parsing specific file formats, include those in version control if possible, but avoid pushing the data to PyPI because it increases package size and strains the service

in some cases you may include example data in your package (link to section on example data), and you can re-use this for tests, using fixtures, here's how...

to test some functionality, e.g. fitting statistical models, you may not be able to avoid downloading relatively large amounts of data for your tests. In these cases you may find it convenient to publish your dataset to open data repositories and then download it as part of a set-up step for your project, that you should outline in your contributing.md. Tools exist for accessing these dataset [link to example data page again]

yeelauren · 2025-09-25T01:31:08Z

tests/package-data.md

+<!-- I am not sure about this statement in terms of what it means and whether we have tools that consider standards or not  we might also want to link to FAIR-->
+:::
+
+```{admonition} Field specific standards + metadata


A link to FAIR would be good. I haven't heard of the ones mentioned 😓

+1 for saying that in general datasets should be FAIR.

I could add link to the neuro formats.

I know some of the astro ones too, e.g., FITS, and the loftily-named [Advanced Scientific Data Format](https://proceedings.scipy.org/articles/majora-212e5952-000. Wwe could ask astro people.
Would geo formats be worth a mention here too?
Or is all that info overload

NickleDave and others added 3 commits October 27, 2023 09:17

WIP: Add data/intro.md

277c576

Merge branch 'main' into add-data-section-to-guide

22c275f

enh: add data page to tests section

ec668a4

lwasser added help wanted We welcome a contributor to work on this issue! thank you in advance! 🚀 ready-for-review labels Sep 19, 2025

github-project-automation bot added this to pyOpenSci Help Wanted Project Board Sep 19, 2025

yeelauren reviewed Sep 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

enh(data): add page on data to the tests section #580

enh(data): add page on data to the tests section #580

lwasser commented Sep 19, 2025

Uh oh!

yeelauren Sep 25, 2025

Uh oh!

NickleDave Oct 7, 2025

Uh oh!

yeelauren Sep 25, 2025

Uh oh!

yeelauren Sep 25, 2025

Uh oh!

NickleDave Oct 7, 2025

Uh oh!

yeelauren Sep 25, 2025

Uh oh!

NickleDave Oct 7, 2025

Uh oh!

NickleDave Oct 7, 2025 •

edited

Loading

Uh oh!

yeelauren Sep 25, 2025

Uh oh!

NickleDave Oct 7, 2025

Uh oh!

Uh oh!


		### Use Pytest fixtures for data access

		Pytest fixtures provide a clean way to set up and share data across your test suite. They're especially useful for scientific packages where you need consistent access to test datasets.

enh(data): add page on data to the tests section #580

Are you sure you want to change the base?

enh(data): add page on data to the tests section #580

Conversation

lwasser commented Sep 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NickleDave Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NickleDave Oct 7, 2025 •

edited

Loading