Update model-card-annotated.md (#1267)

meg-huggingface · julien-c · web-flow · commit cbced2c171eb · 2024-09-16T11:45:07.000-07:00
* Update model-card-annotated.md

Food for thought: Adding a section that deals directly with Redteaming (as that is becoming a big interest/concern). Hit specifically on CSAM, as we have been inspired by Thorn and Rebecca Portnoff to be more proactive about this.

* Update model-card-annotated.md

* Update docs/hub/model-card-annotated.md

Co-authored-by: Julien Chaumond &lt;julien@huggingface.co&gt;

* Update model-card-annotated.md

Suggesting updates based on multiple discussions.
This includes:
- The github discussion here
- Discussions in the Ethics meetings about pros and cons of different documentation requirements
- Discussions with Thorn and All Tech is Human on responsible AI development

* Update model-card-annotated.md

Iterating on wording to introduce a new Evaluation section.

---------

Co-authored-by: Julien Chaumond &lt;julien@huggingface.co&gt;
diff --git a/docs/hub/model-card-annotated.md b/docs/hub/model-card-annotated.md
@@ -132,7 +132,7 @@ _List how the model may foreseeably be misused (used in a way it will not work f
 
 **Section Overview:** This section identifies foreseeable harms, misunderstandings, and technical and sociotechnical limitations. It also provides information on warnings and potential mitigations. Bias, risks, and limitations can sometimes be inseparable/refer to the same issues. Generally, bias and risks are sociotechnical, while limitations are technical: 
 - A **bias** is a stereotype or disproportionate performance (skew) for some subpopulations. 
-- A **risk** is a socially-relevant issue that the model might cause. 
+- A **risk** is a socially-relevant issue that the model might cause.
 - A **limitation** is a likely failure mode that can be addressed following the listed Recommendations.
 
 `bias_risks_limitations`
@@ -182,43 +182,53 @@ _Detail throughput, start/end time, checkpoint sizes, etc._
 
 # Evaluation
 
-**Section Overview:** This section describes the evaluation protocols, what is being measured in the evaluation, and provides the results.  Evaluation is ideally constructed with factors, such as domain and demographic subgroup, and metrics, such as accuracy, which are prioritized in light of foreseeable error contexts and groups. Target fairness metrics should be decided based on which errors are more likely to be problematic in light of the model use. You can also specify your model's evaluation results in a structured way in the model card metadata. Results are parsed by the Hub and displayed in a widget on the model page. See https://huggingface.co/docs/hub/model-cards#evaluation-results.
+**Section Overview:** This section describes the evaluation protocols, what is being measured in the evaluation, and provides the results.  Evaluation ideally has at least two parts, with one part looking at quantitative measurement of general performance ([Testing Data, Factors & Metrics](#testing-data-factors--metrics)), such as may be done with benchmarking; and another looking at performance with respect to specific social safety issues ([Societal Impact Assessment](#societal-impact-assessment-optional)), such as may be done with red-teaming. You can also specify your model's evaluation results in a structured way in the model card metadata. Results are parsed by the Hub and displayed in a widget on the model page. See https://huggingface.co/docs/hub/model-cards#evaluation-results.
 
 
 ## Testing Data, Factors & Metrics
 
+_Evaluation is ideally **disaggregated** with respect to different factors, such as task, domain and population subgroup; and calculated with metrics that are most meaningful for foreseeable contexts of use. Equal evaluation performance across different subgroups is said to be "fair" across those subgroups; target fairness metrics should be decided based on which errors are more likely to be problematic in light of the model use. However, this section is most commonly used to report aggregate evaluation performance on different task benchmarks._
+
 ### Testing Data
 
 `testing_data`
 
-_Ideally this links to a Dataset Card for the testing data._
+_Describe testing data or link to its Dataset Card._
 
 ### Factors
 
 `testing_factors`
 
-_What are the foreseeable characteristics that will influence how the model behaves? This includes domain and context, as well as population subgroups. Evaluation should ideally be **disaggregated** across factors in order to uncover disparities in performance._
+_What are the foreseeable characteristics that will influence how the model behaves? Evaluation should ideally be disaggregated across these factors in order to uncover disparities in performance._
 
 ### Metrics
 
 `testing_metrics`
 
-_What metrics will be used for evaluation in light of tradeoffs between different errors?_
+_What metrics will be used for evaluation?_
 
 ## Results
 
 `results`
 
-
 _Results should be based on the Factors and Metrics defined above._
 
 ### Summary
 
-
 `results_summary`
 
 _What do the results say? This can function as a kind of tl;dr for general audiences._
 
+## Societal Impact Assessment [optional]
+
+_Use this free text section to explain how this model has been evaluated for risk of societal harm, such as for child safety, NCII, privacy, and violence. This might take the form of answers to the following questions:_
+
+- _Is this model safe for kids to use? Why or why not?_
+- _Has this model been tested to see if it can be used for non-consensual intimate imagery (including CSEM)?_
+- _Has this model been tested to see if it can be used for violent activities, or to depict violence? What were the results?_
+
+_Quantitative numbers on each issue may also be provided._
+
 # Model Examination [optional]
 
 **Section Overview:** This is an experimental section some developers are beginning to add, where work on explainability/interpretability may go.