Skip to content

contribute_data

Daniel van Strien edited this page Jul 1, 2022 · 20 revisions

How to add data

For this sprint, we are adding datasets in two parts:

It is not required to be involved in both parts of this for each dataset. For example, you can suggest a dataset and not add it, or you could make available a dataset you didn't initially suggest.

Suggesting a dataset

The first way in which you can help is to suggest datasets that you think would make a good candidate for inclusion in the Hugging Face Hub.

To do this, the overall steps are:

  • Checking that the dataset isn't already in the Hub
  • Isn't already being tracked as part of this sprint
  • Deciding if you think it's suitable for sharing.
  • Creating an issue for the dataset

What kind of data is suitable for sharing via the Hub?

The primary focus of the Hugging Face hub is machine learning datasets. This means that data should have some existing or potential relevance to machine learning.

There are a few different categories of data that would be particularly useful for sharing:

  • Annotated datasets. These datasets will have some labels that can be used to evaluate or train machine learning models. For example, a dataset consisting of newspaper articles with a label for the sentiment. These don't necessarily need to have been made explicitly for the purpose of training machine learning models. For example, datasets annotated through crowdsourcing tasks may be usefully adapted for machine learning.

  • Larger text datasets could be used for training large language models.

  • Large image collections that could be used for pre-training computer vision models

Not all potential datasets will fall into one of these categories neatly. If you have a dataset that doesn't fit one of these boxes but it could still be of interest, feel free to suggest it.

How to submit a candidate dataset

For this sprint, we are using GitHub issues to propose new datasets. To open an issue, you will need a GitHub account if you don't already have an account, please singup first.

To suggest a dataset for inclusion in the sprint, use the add dataset issue template. This issue template includes a basic form which captures some important information about the dataset.

Screenshot 2022-07-01 at 16 48 01

When you submit a candidate dataset through this issue template, it will be assigned the tag candidate-dataset.. One of the organizers will check if this dataset is suitable for sharing and clarify any issues before moving it to the project board we use to track datasets that we want to make available via the Hub.

Need some inspiration?

If you need some inspiration on where to look for suitable datasets, you can check out this discussion thread which points to some possible places you might discover potential datasets for inclusion.

Uploading data to the Hugging Face Hub

You can also contribute to making data available via the Hugging Face Hub. The overall steps for doing this are:

  • claim the dataset from the project board
  • make the data available via the Hub. The two main ways of doing this are:
    • directly sharing the files via the Hub
    • writing a dataset script
  • closing the issue tracking the addition of that dataset

The following sections discuss these steps in more detail.

Became a contributor to the LAM sprint datasets org

To add datasets to the LAM org, you will need to become a member:

  • Make sure you have an account for Hugging Face. If you don't, you will need to join
  • Use this link to add yourself to the biglam organization

Choose a dataset to work on

  • Decide on which dataset you want to work on from the datasets overview tab
  • If it isn't yet an issue for this dataset you should first create one. This is so we can track datasets
  • Assign yourself to this dataset. This lets other people know you are working on this dataset. You can do this by commenting #selfassign in the issue for that dataset.

Making datasets available via the Hugging Face Hub

There are a few different ways of making datasets available via the Hugging Face Hub. The best approach will depend on the dataset you are adding.

Upload files directly to the Hugging Face Hub

The datasets library has support for many file types out of the box. If your dataset is one of these formats, the easiest way to share this data is to upload the files directly. The supported formats include CSV, JSON, JSON lines, text lines, and Parquet.

Creating a dataset loading script

There are some situations where you may not want to share files directly inside the Hugging Face Hub. These include:

  • When data is not in a standard format and needs some processing to prepare it
  • When you want to provide different 'configurations' for loading your data, for example, allows loading only a subset of the data.
  • When the data is already available in a different repository, and you want to use that as the source for loading data.

Writing a dataset script is outlined in the datasets library documentation. For the datasets for this sprint, we will create a repository under the sprint organization and upload our scripts there.

The steps for doing this are as follows:

    1. Make sure you have joined the biglam organization on the Hub
    1. Create your data loading script. It is usually easier to develop this on a local machine so you can test the script as you work on it.
    1. Once you have a script ready, create a repository for that script. Try and choose a descriptive name for the dataset. If there is an existing dataset name, it makes sense to match closely to the existing name
    1. Upload your dataset script to the repository you just created. You should make sure you name the dataset script the same name as the repository you created. This is so that the datasets library will correctly load data from this repository. For example, if you have a repository biglam/metadata_quality, you should name the dataset loading script metadata_quality.py
    1. Once you have made your public, it's a good idea to test that everything loads correctly

It can also be helpful to look at other scripts to understand better how to structure your dataset script. Some examples are listed below:

Example data loading scripts

It can be beneficial to make large datasets streamable. This allows someone to work with the data even if it is too large to fit onto their machine. For some datasets, it is pretty easy to make them streamable, but for others, it can be a bit more tricky. If you get stuck with this, feel free to ask for help in the issue tracking the dataset you are working on.

Clone this wiki locally