-
Notifications
You must be signed in to change notification settings - Fork 52
feat(C1): add distributed training data collection endpoint security control (1.2.8) #632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -13,7 +13,7 @@ Training data must be sourced, handled, and maintained in a way that preserves o | |
| Maintain a verifiable inventory of all datasets, accept only trusted sources, and log every change for auditability. | ||
|
|
||
| | # | Description | Level | | ||
| |:--------:|---------------------------------------------------------------------------------------------------------------------|:---:| | ||
| | :--------: | ------------------------------------------------------------------------------------------------------------------- | :---: | | ||
| | **1.1.1** | **Verify that** an up-to-date inventory of every training-data source (origin, responsible party, license, collection method, intended use constraints, and processing history) is maintained. | 1 | | ||
| | **1.1.2** | **Verify that** training data processes exclude unnecessary features, attributes, or fields (e.g., unused metadata, sensitive PII, leaked test data). | 1 | | ||
| | **1.1.3** | **Verify that** all dataset changes are subject to a logged approval workflow. | 1 | | ||
|
|
@@ -26,14 +26,15 @@ Maintain a verifiable inventory of all datasets, accept only trusted sources, an | |
| Restrict access to training data, encrypt it at rest and in transit, and validate its integrity to prevent tampering, theft, or data poisoning. | ||
|
|
||
| | # | Description | Level | | ||
| |:--------:|---------------------------------------------------------------------------------------------------------------------|:---:| | ||
| | :--------: | ------------------------------------------------------------------------------------------------------------------- | :---: | | ||
| | **1.2.1** | **Verify that** access controls protect training data storage and pipelines. | 1 | | ||
| | **1.2.2** | **Verify that** all access to training data is logged, including user, time, and action. | 1 | | ||
| | **1.2.3** | **Verify that** training datasets are encrypted in transit and at rest, using current recommended cryptographic algorithms and key management practices. | 1 | | ||
| | **1.2.4** | **Verify that** cryptographic hashes or digital signatures are used to ensure data integrity during training data storage and transfer. | 2 | | ||
| | **1.2.5** | **Verify that** automated integrity monitoring is applied to guard against unauthorized modifications or corruption of training data. | 2 | | ||
| | **1.2.6** | **Verify that** obsolete training data is securely purged or anonymized. | 1 | | ||
| | **1.2.7** | **Verify that** all training dataset versions are uniquely identified, stored immutably, and auditable to support rollback and forensic analysis. | 3 | | ||
| | **1.2.8** | **Verify that** distributed training data collection endpoints authenticate to the central aggregation system using mutual authentication, and that data received from those endpoints is integrity-verified (e.g., via cryptographic checksums or digital signatures generated at source) before being accepted into training pipelines. | 2 | | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, I think requirement bundles two independent, separately testable things:
Of course in both completely bespoke end-to-end system or with a framework these might be implemented in a common codebase, but in practice often mTLS is implemented using infrastructure services (e.g. service mesh or similar, which provides key rotations and such facilities) and integrity verification in the application code. |
||
|
|
||
| --- | ||
|
|
||
|
|
@@ -42,7 +43,7 @@ Restrict access to training data, encrypt it at rest and in transit, and validat | |
| Ensure labeling and annotation processes are access-controlled, auditable, and protect sensitive information. | ||
|
|
||
| | # | Description | Level | | ||
| |:--------:|---------------------------------------------------------------------------------------------------------------------|:---:| | ||
| | :--------: | ------------------------------------------------------------------------------------------------------------------- | :---: | | ||
| | **1.3.1** | **Verify that** labeling interfaces and platforms enforce access controls that restrict who can create, modify, or approve annotations. | 1 | | ||
| | **1.3.2** | **Verify that** all labeling activities are recorded in audit logs, including the annotator identity, timestamp, and action performed. | 1 | | ||
| | **1.3.3** | **Verify that** annotator identity metadata is exported and retained alongside the dataset so that every annotation or preference pair can be attributed to a specific, verified human annotator throughout the training pipeline. | 1 | | ||
|
|
@@ -57,7 +58,7 @@ Ensure labeling and annotation processes are access-controlled, auditable, and p | |
| Combine automated validation, manual spot-checks, and logged remediation to guarantee dataset reliability. | ||
|
|
||
| | # | Description | Level | | ||
| |:--------:|---------------------------------------------------------------------------------------------------------------------|:---:| | ||
| | :--------: | ------------------------------------------------------------------------------------------------------------------- | :---: | | ||
| | **1.4.1** | **Verify that** automated tests catch format errors and nulls on every ingest or significant data transformation. | 1 | | ||
| | **1.4.2** | **Verify that** training and fine-tuning pipelines implement data integrity validation and poisoning detection techniques (e.g., statistical analysis, outlier detection, embedding analysis) to identify potential data poisoning or unintentional corruption in training data. | 2 | | ||
| | **1.4.3** | **Verify that** automatically generated labels (e.g., via models or weak supervision) are subject to confidence thresholds and consistency checks to detect misleading or low-confidence labels. | 2 | | ||
|
|
@@ -72,7 +73,7 @@ Combine automated validation, manual spot-checks, and logged remediation to guar | |
| Track the full journey of each dataset from source to model input for auditability and incident response. | ||
|
|
||
| | # | Description | Level | | ||
| |:--------:|---------------------------------------------------------------------------------------------------------------------|:---:| | ||
| | :--------: | ------------------------------------------------------------------------------------------------------------------- | :---: | | ||
| | **1.5.1** | **Verify that** the lineage of each dataset and its components, including all transformations, augmentations, and merges, is recorded and can be reconstructed. | 1 | | ||
| | **1.5.2** | **Verify that** lineage records are immutable, securely stored, and accessible for audits. | 2 | | ||
| | **1.5.3** | **Verify that** lineage tracking covers synthetic data generated via augmentation, synthesis, or privacy-preserving techniques. | 2 | | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think this requirement combines two concerns already covered elsewhere:
You wrote in the description that the existing C1.2 controls don't address endpoint authentication or pre-ingestion integrity, but I think 4.8.1 and 4.3.4 cover the former, and 1.2.4's "during transfer" scope covers the latter.
What specific gap remains after applying these four controls together?
If the gap is specifically about combining these in the federated learning context, that might be better addressed by adding federated/distributed training to the scope note in C12.6 (which already has a dedicated federated learning section) or C4.8 (Edge & Distributed AI Security) where the related controls already live, rather than C1.2.