Skip to content

Latest commit

 

History

History
57 lines (37 loc) · 2.61 KB

File metadata and controls

57 lines (37 loc) · 2.61 KB

🌐 Data Collection and Source Tracking

This document summarizes key notes on text acquisition, language-specific sourcing variability, and metadata management across the Multilingual Segmentation Corpus.


🌍 Data Collection Variability Across Languages

The process of text acquisition varied considerably depending on the language.

This variability required a flexible, language-sensitive approach to both sourcing and preprocessing.


🧾 Source Tracking & Metadata

To ensure transparency, consistency, and reproducibility, an internal application form was created to standardize metadata collection for each text in the segmentation corpus.

This form captured key details such as:

  • 📌 Source type (digital edition, manuscript, OCR, etc.)
  • 📚 Edition or manuscript reference (bibliographic citation)
  • 🌍 Linguistic variety and chronological range
  • 🗂️ Format and structure of the original file
  • 📝 Reuse/licensing conditions

Although the form itself is not public, we provide access to the processed metadata and the scripts used during compilation.


🔗 Resources

  • 📂 Data Processing Repository – Corpus Temporis App
    Streamlit app and scripts used to structure, validate, and convert incoming texts and metadata.

  • 📊 Compiled Metadata Table – data.csv
    A centralized CSV listing all processed texts with:

    • Language
    • Title
    • Edition or source
    • Format
    • License/reuse status
    • File location references

📬 For contributions or metadata corrections, feel free to open an issue or pull request.