Welcome to the AI Alliance Open Trusted Data Initiative (OTDI).
OTDI is building a high-quality, trusted, and open catalog of datasets for AI LLM pre-training, fine-tuning, and domain-specific applications. These datasets are amenable to a wide variety of use cases in enterprises, governments, regulated industries, and wherever high trust in the data foundations of AI is essential.
The initiative consists of several projects:
- Define Openness Criteria: What has to be true about a dataset in order for it to be considered truly open for use? This project defines those criteria. See the Dataset Specification page for our evolving thinking on the minimally-sufficient criteria.
- Find Diverse Datasets: We seek a very broad range of datasets, including: text (especially under-served language), multimedia (audio, video, images), time series (targeting any domain or application), science (molecular discovery, drug discovery, geospatial, physics, etc., etc), specific domains and use cases (industry-specific and use case-specific data), synthetic (datasets for all of the above can be synthetic or "real").
- Data Pipelines: Data pipelines implemented using tools like DPK are used both to validate datasets proposed for inclusion in our catalog and, eventually, to derive new datasets specialized for particular purposes. See the How We Process Datasets page for more information.
- Open Dataset Catalog: A catalog of datasets from many sources that meet our criteria for openness. See the Dataset Catalog page for more information.
Each of these projects welcome enthusiastic participants! Please join us!
This repo will contain the "code" for the OTDI website, as well as the code that implements the projects for OTDI.
The website is published using GitHub Pages, where the pages are written in Markdown/HTML and served using Jekyll. We use the Just the Docs Jekyll theme.
See GITHUB_PAGES.md for more information, especially for instructions on previewing changes locally using jekyll
.
See the static-catalog/README.md
for details about building the current "static" catalog.
Note
All documentation is licensed under Creative Commons Attribution 4.0 International. See LICENSE.CDLA-2.0.
This repo will also host the code for the projects that are part of OTDI, listed above. Eventually, as these projects grow, we may move them out to separate repos.
Miscellaneous other documentation, not in the website, is also captured here:
tools-notes
- Notes on potential tool choices.data-processing-notes
- Notes on requirements and data-specific tool choices.
We welcome contributions as PRs. Please see our Alliance community repo for general information about contributing to any of our projects. This section provides some specific details you need to know.
In particular, see the AI Alliance CONTRIBUTING instructions. You will need to agree with the AI Alliance Code of Conduct.
All code contributions are licensed under the Apache 2.0 LICENSE (which is also in this repo, LICENSE.Apache-2.0).
All documentation contributions are licensed under the Creative Commons Attribution 4.0 International (which is also in this repo, LICENSE.CC-BY-4.0).
All data contributions are licensed under the Community Data License Agreement - Permissive - Version 2.0 (which is also in this repo, LICENSE.CDLA-2.0).
Warning
Before you make any git commits with changes, understand what's required for DCO.
See the Alliance contributing guide section on DCO for details. In practical terms, supporting this requirement means you must use the -s
flag with your git commit
commands.