Skip to content

Latest commit

Β 

History

History
52 lines (39 loc) Β· 1.86 KB

File metadata and controls

52 lines (39 loc) Β· 1.86 KB

glossAPI Text Dataset Standard

This repository defines the standard used by the glossAPI team for creating AI-ready textual datasets. It is based entirely on our internal study on modern data engineering practices for large-scale text pipelines.

The standard covers:

  • File formats (JSONL for ingestion, Parquet for storage and training)
  • Pipeline architecture (normalization, heuristic filtering, deduplication, PII handling, sharding)
  • Metadata requirements (Data Card fields)
  • Safety considerations

πŸ”§ How to Contribute

We keep this project intentionally simple with multiple ways to contribute:

πŸ’¬ Discussions (Recommended for questions & ideas)

  • Use GitHub Discussions for general questions, brainstorming, and community engagement
  • Great for exploring ideas before formal proposals

πŸ“ Issues (For tracked proposals)

  1. Open a new Issue on GitHub
  2. Add one of the following labels:
    • proposal β†’ for new ideas or enhancements
    • change-request β†’ for modifications to existing standards
    • question β†’ for clarifications

πŸ”€ Pull Requests (For direct contributions)

  1. Fork this repository
  2. Make your changes to files in /standards
  3. Submit a Pull Request with a clear description
  4. Maintainers will review and provide feedback

βœ”οΈ Scope of accepted contributions:

We only review proposals related to:

  • file formats (JSONL / Parquet)
  • pipeline stages (normalization, filtering, deduplication, safety, sharding)
  • metadata specification
  • safety and redaction practices

βœ”οΈ Review Process

  • The glossAPI maintainers review each Discussion, Issue, and Pull Request
  • Accepted changes are integrated into the /standards directory
  • Changes are documented in commit history (no RFC process, no complex governance)

For more details, see CONTRIBUTING.md.


Version: v1.0 (initial release).