Open Trusted Data Initiative README

Welcome to the AI Alliance Open Trusted Data Initiative (OTDI).

Vision

OTDI is building a high-quality, trusted, and open catalog of datasets for AI LLM pre-training, fine-tuning, and domain-specific applications. These datasets are amenable to a wide variety of use cases in enterprises, governments, regulated industries, and wherever high trust in the data foundations of AI is essential.

The initiative consists of several projects:

Define Openness Criteria: What has to be true about a dataset in order for it to be considered truly open for use? This project defines those criteria. See the Dataset Specification page for our evolving thinking on the minimally-sufficient criteria.
Find Diverse Datasets: We seek a very broad range of datasets, including: text (especially under-served language), multimedia (audio, video, images), time series (targeting any domain or application), science (molecular discovery, drug discovery, geospatial, physics, etc., etc), specific domains and use cases (industry-specific and use case-specific data), synthetic (datasets for all of the above can be synthetic or "real").
Data Pipelines: Data pipelines implemented using tools like DPK are used both to validate datasets proposed for inclusion in our catalog and, eventually, to derive new datasets specialized for particular purposes. See the How We Process Datasets page for more information.
Open Dataset Catalog: A catalog of datasets from many sources that meet our criteria for openness. See the Dataset Catalog page for more information.

Each of these projects welcome enthusiastic participants! Please join us!

Using This Repo

This repo will contain the "code" for the OTDI website, as well as the code that implements the projects for OTDI.

About the GitHub Pages Website Published from this Repo

The website is published using GitHub Pages, where the pages are written in Markdown/HTML and served using Jekyll. We use the Just the Docs Jekyll theme.

See GITHUB_PAGES.md for more information, especially for instructions on previewing changes locally using jekyll.

See the static-catalog/README.md for details about building the current "static" catalog.

Note

All documentation is licensed under Creative Commons Attribution 4.0 International. See LICENSE.CDLA-2.0.

Getting Involved

We welcome contributions as PRs. Please see our Alliance community repo for general information about contributing to any of our projects. This section provides some specific details you need to know.

In particular, see the AI Alliance CONTRIBUTING instructions. You will need to agree with the AI Alliance Code of Conduct.

All code contributions are licensed under the Apache 2.0 LICENSE (which is also in this repo, LICENSE.Apache-2.0).

All documentation contributions are licensed under the Creative Commons Attribution 4.0 International (which is also in this repo, LICENSE.CC-BY-4.0).

All data contributions are licensed under the Community Data License Agreement - Permissive - Version 2.0 (which is also in this repo, LICENSE.CDLA-2.0).

We use the "Developer Certificate of Origin" (DCO).

Warning

Before you make any git commits with changes, understand what's required for DCO.

See the Alliance contributing guide section on DCO for details. In practical terms, supporting this requirement means you must use the -s flag with your git commit commands.

Name		Name	Last commit message	Last commit date
Latest commit History 445 Commits
.github		.github
code/dpk		code/dpk
docs		docs
other-documentation		other-documentation
src		src
static-catalog		static-catalog
tools-notes		tools-notes
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
GITHUB_PAGES.md		GITHUB_PAGES.md
Gemfile		Gemfile
LICENSE.Apache-2.0		LICENSE.Apache-2.0
LICENSE.CC-BY-4.0		LICENSE.CC-BY-4.0
LICENSE.CDLA-2.0		LICENSE.CDLA-2.0
Makefile		Makefile
README.md		README.md
check-external-links.sh		check-external-links.sh
define-dataset-tags.sh		define-dataset-tags.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Open Trusted Data Initiative README

Vision

Using This Repo

About the GitHub Pages Website Published from this Repo

Other Documentation and Code

Getting Involved

We use the "Developer Certificate of Origin" (DCO).

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 7

Uh oh!

Languages

The-AI-Alliance/open-trusted-data-initiative

Folders and files

Latest commit

History

Repository files navigation

Open Trusted Data Initiative README

Vision

Using This Repo

About the GitHub Pages Website Published from this Repo

Other Documentation and Code

Getting Involved

We use the "Developer Certificate of Origin" (DCO).

About

Topics

Resources

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 7

Uh oh!

Languages

Packages