Skip to content

Conversation

lwasser
Copy link
Member

@lwasser lwasser commented Sep 19, 2025

This pr is a rework of #110 . Let's plan to run a sprint on this pr in the next few weeks to see if we can get to a place where it's good.

NOTE: I added some code examples that I literally found online and DID NOT TEST. so we will want to definitely test what is there before merging.

ALSO - because this topic is not my expertise, in some cases I ran with a section after a Google search and tried to flesh it out, but it also could be off. Any feedback is welcome on this!!

@lwasser lwasser added help wanted We welcome a contributor to work on this issue! thank you in advance! 🚀 ready-for-review labels Sep 19, 2025
* **[Open Science Framework (OSF)](https://osf.io)** - Comprehensive research platform
* **[Figshare](https://figshare.com)** - User-friendly with good visualization tools
* **[Dryad](https://datadryad.org)** - Focused on research data (subscription model for some features)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would huggingface fit here or is this more for academic ? Perhaps https://docs.source.coop ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's good you suggested these -- sorry for a big wordy response to a single suggestion, but I think it shows that we could say a little more about why we have what we have here.

I'll say what I think we might add, but first, re: your suggestions:
Huggingface is free, and convenient.
I have a couple of worries about it though. It will only exist (1) at most as long as the company exists, and that might not be long if the AI bubble pops, and (2) it might not even be that long if the company decides that the cost of maintaining all the data is not worth how many customers it gets them. I don't think this is just an academic thing, since a company could choose to be structured in such a way that they can't pull the rug on users whenever they want. Forgive me for being pedantic about it, I guess I'm sensitive as a failed academic 😇

Source Co-op looks cool! Maybe we should include them! Need to read about it more

re: the places we have listed now, I think there's a couple things we could do here:

  • say a little bit more in the blurb about what our criteria are. E.g., is it free, is there some sort of guarantee on how long the data will be stored, is there a good UI / API / existing software tool that makes it easy to work with
  • put in some sort of table for each of these that indicates how each place to store data stack up WRT to those criteria

Again, sorry for being wordy -- some of this stuff I say already in the actual test, but you're making me see how this could maybe be better organized

* **[Google Cloud Platform](https://cloud.google.com/storage)** - Cloud Storage with strong AI/ML tool integration
* **[Linode](https://www.linode.com/products/object-storage/)** - Object Storage with straightforward pricing and developer-friendly tools

<!-- I don't understand how these platforms are different from things like figshare - can we clarify that? and how / why someone would pick these vs figshare / dryad? -->

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is moving into data versioning which is important for scientific studies and reproducibility, but may be out of scope for a package. Having a small subset or testable data for a package makes a lot of sense.

- **[Pachyderm](https://www.pachyderm.com/)** - Data pipeline platform with version control
```

<!-- I am not sure how this section relates to data stored in a package. I understand it's important, but does it belong in a page focused on how and where to store data for your Python package? It might be that I just don't understand as written!-->

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to removing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear you -- I put this in thinking "I wish I knew about this earlier so we should mention it" but it is probably out of scope for the specifics of including data with your package


### Use Pytest fixtures for data access

Pytest fixtures provide a clean way to set up and share data across your test suite. They're especially useful for scientific packages where you need consistent access to test datasets.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is just personal preference, but I think we should move pytest up and pooch down.
Pytest fits better with the whole 'testing your data thing' along with testing your code. Pooch seems like an additional tool or service to download data from an external source, I've not heard of it before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 We definitely should talk about pytest fixtures sooner if this page is about using data in tests.

@lwasser (at the risk of being even more annoying on this review 😅 ) looking at this again, I'm wondering if it would make sense to break this into two pages.

One on "data for tests" and another on "example data in your package", that would not live in the section on tests, but instead live in ... some section that doesn't exist yet.
I guess it would be some section that's "packaging", but more about, like library structure I guess? As oppossed to the nitty-gritty of publishing a package

I know a lot of times test data and example data overlap but it's striking me as odd to have content about example data in the tests/ section

ok I'll stop being noisy on the review 🤐

Copy link
Contributor

@NickleDave NickleDave Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thought: if I was writing a page that was only "data for tests", I'd write something like:

  1. as much as possible, use pytest.fixture with "fake data" you generate to test: this lets you avoid adding real data to your project, and helps you avoid writing code and tests that are tightly coupled, encouraging you to test the interface instead of implementation details.
  2. for functionality where you absolutely need to test on real data, e.g. parsing specific file formats, include those in version control if possible, but avoid pushing the data to PyPI because it increases package size and strains the service
  3. in some cases you may include example data in your package (link to section on example data), and you can re-use this for tests, using fixtures, here's how...
  4. to test some functionality, e.g. fitting statistical models, you may not be able to avoid downloading relatively large amounts of data for your tests. In these cases you may find it convenient to publish your dataset to open data repositories and then download it as part of a set-up step for your project, that you should outline in your contributing.md. Tools exist for accessing these dataset [link to example data page again]

<!-- I am not sure about this statement in terms of what it means and whether we have tools that consider standards or not we might also want to link to FAIR-->
:::

```{admonition} Field specific standards + metadata

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A link to FAIR would be good. I haven't heard of the ones mentioned 😓

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for saying that in general datasets should be FAIR.

I could add link to the neuro formats.

I know some of the astro ones too, e.g., FITS, and the loftily-named [Advanced Scientific Data Format](https://proceedings.scipy.org/articles/majora-212e5952-000. Wwe could ask astro people.
Would geo formats be worth a mention here too?
Or is all that info overload

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted We welcome a contributor to work on this issue! thank you in advance! 🚀 ready-for-review
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants