glossAPI Text Dataset Standard

This repository defines the standard used by the glossAPI team for creating AI-ready textual datasets. It is based entirely on our internal study on modern data engineering practices for large-scale text pipelines.

The standard covers:

File formats (JSONL for ingestion, Parquet for storage and training)
Pipeline architecture (normalization, heuristic filtering, deduplication, PII handling, sharding)
Metadata requirements (Data Card fields)
Safety considerations

🔧 How to Contribute

We keep this project intentionally simple with multiple ways to contribute:

💬 Discussions (Recommended for questions & ideas)

Use GitHub Discussions for general questions, brainstorming, and community engagement
Great for exploring ideas before formal proposals

📝 Issues (For tracked proposals)

Open a new Issue on GitHub
Add one of the following labels:
- proposal → for new ideas or enhancements
- change-request → for modifications to existing standards
- question → for clarifications

🔀 Pull Requests (For direct contributions)

Fork this repository
Make your changes to files in /standards
Submit a Pull Request with a clear description
Maintainers will review and provide feedback

✔️ Scope of accepted contributions:

We only review proposals related to:

file formats (JSONL / Parquet)
pipeline stages (normalization, filtering, deduplication, safety, sharding)
metadata specification
safety and redaction practices

✔️ Review Process

The glossAPI maintainers review each Discussion, Issue, and Pull Request
Accepted changes are integrated into the /standards directory
Changes are documented in commit history (no RFC process, no complex governance)

For more details, see CONTRIBUTING.md.

Version: v1.0 (initial release).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

glossAPI Text Dataset Standard

🔧 How to Contribute

💬 Discussions (Recommended for questions & ideas)

📝 Issues (For tracked proposals)

🔀 Pull Requests (For direct contributions)

✔️ Scope of accepted contributions:

✔️ Review Process

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

glossAPI Text Dataset Standard

🔧 How to Contribute

💬 Discussions (Recommended for questions & ideas)

📝 Issues (For tracked proposals)

🔀 Pull Requests (For direct contributions)

✔️ Scope of accepted contributions:

✔️ Review Process