You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/concept-sourcing-human-data.md
+24-23Lines changed: 24 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
---
2
2
title: Manually sourcing human data for AI development
3
-
description: Manually sourcing human data can be important to building AI systems that work for all users. But certain practices should be avoided, especially ones that can cause physical and psychological harm to data contributors, as well as flawed datasets.
3
+
description: Learn best practices for mitigating potential harm to people—especially in vulnerable groups—and building balanced datasets when collecting human data manually.
4
4
author: nhu-do-1
5
5
ms.author: nhudo
6
6
ms.service: machine-learning
@@ -18,7 +18,8 @@ The best practices in this article will help you conduct manual data collection
18
18
19
19
- People contributing data are not coerced or exploited in any way, and they have control over what personal data is collected.
20
20
- People collecting and labeling data have adequate training
21
-
- These practices can also help ensure more-balanced and higher-quality datasets and better stewardship of human data.
21
+
22
+
These practices can also help ensure more-balanced and higher-quality datasets and better stewardship of human data.
22
23
23
24
These are emerging practices, and we are continually learning. The best practices below are a starting point as you begin your own responsible human data collections. These best practices are provided for informational purposes only and should not be treated as legal advice. All human data collections should undergo specific privacy and legal reviews.
24
25
@@ -28,18 +29,18 @@ We suggest the following best practices for manually collecting human data direc
28
29
29
30
30
31
|**Best Practice**|**Why**|
31
-
|:--------------------|----------|
32
+
|:--------------------|:----------|
32
33
|**Obtain voluntary informed consent.**| <ul><li>Participants should understand and consent to data collection and how their data will be used.<li>Data should only be stored, processed, and used for purposes that are part of the original documented informed consent. <li>Consent documentation should be properly stored and associated with the collected data. <ul> |
33
34
|**Compensate data contributors appropriately.**| <ul><li>Data contributors should not be pressured or coerced into data collections and should be fairly compensated for their time and data. <li>Inappropriate compensation can be exploitative or coercive.<ul> |
34
-
|**Let contributors self-identify demographic information**| <ul><li>Demographic information that is not self-reported by data contributors but assigned by data collectors may 1) result in inaccurate metadata and 2) be disrespectful to data contributors<ul> |
35
+
|**Let contributors self-identify demographic information.**| <ul><li>Demographic information that is not self-reported by data contributors but assigned by data collectors may 1) result in inaccurate metadata and 2) be disrespectful to data contributors.<ul> |
35
36
|**Anticipate harms when recruiting vulnerable groups.**| <ul><li>Collecting data from vulnerable population groups introduces risk to data contributors and your organization.<ul> |
36
37
|**Treat data contributors with respect.**| <ul><li>Improper interactions with data contributors at any phase of the data collection can negatively impact data quality, as well as the overall data collection experience for data contributors and data collectors.<ul> |
37
38
|**Qualify external suppliers carefully.**| <ul><li>Data collections with unqualified suppliers may result in low quality data, poor data management, unprofessional practices, and potentially harmful outcomes for data contributors and data collectors (including violations of human rights). <li> Annotation or labeling work (e.g., audio transcription, image tagging) with unqualified suppliers may result in low quality or biased datasets, insecure data management, unprofessional practices, and potentially harmful outcomes for data contributors (including violations of human rights).<ul> |
38
39
|**Communicate expectations clearly in the Statement of Work (SOW) with suppliers.**| <ul><li>An SOW which lacks requirements for responsible data collection work may result in low-quality or poorly collected data.<ul> |
39
40
|**Qualify geographies carefully.**| <ul><li> When applicable, collecting data in restricted and/or unfamiliar geographies may result in unusable or low-quality data and may impact the safety of involved parties.<ul> |
40
41
|**Be a good steward of your datasets.**| <ul><li>Improper data management and poor documentation can result in data misuse.<ul> |
41
42
42
-
>[!TIP]
43
+
>[!NOTE]
43
44
>This article focuses on recommendations for human data, including personal data and sensitive data such as biometric data, health data, racial or ethnic data, data collected manually from the general public or company employees, as well as metadata relating to human characteristics, such as age, ancestry, and gender identity, that may be created via annotation or labeling.
44
45
45
46
@@ -60,12 +61,12 @@ To enable people to self-identify, consider using the following survey questions
60
61
*Select your age range*
61
62
62
63
[*Include appropriate age ranges as defined by project purpose, geographical region, and guidance from domain experts*]
63
-
64
-
<ul><li># to # </li>
65
-
<li># to # </li>
66
-
<li># to # </li>
67
-
<li>Prefer not to answer </li></ul>
68
-
64
+
<ul>
65
+
<li># to #
66
+
<li># to #
67
+
<li># to #
68
+
<li>Prefer not to answer
69
+
</ul>
69
70
70
71
### Ancestry
71
72
@@ -75,12 +76,12 @@ To enable people to self-identify, consider using the following survey questions
75
76
76
77
[*Include appropriate categories as defined by project purpose, geographical region, and guidance from domain experts*]
77
78
78
-
<ul><li>ancestry group </li>
79
-
<li>ancestry group </li>
80
-
<li>ancestry group </li>
81
-
<li>Multiple (multiracial, mixed ancestry) </li>
82
-
<li>Not listed, I describe myself as: _________________ </li>
83
-
<li>Prefer not to answer </li></ul>
79
+
- Ancestry group
80
+
- Ancestry group
81
+
- Ancestry group
82
+
-Multiple (multiracial, mixed Ancestry)
83
+
-Not listed, I describe myself as: _________________
84
+
-Prefer not to answer
84
85
85
86
86
87
### Gender identity
@@ -91,14 +92,14 @@ To enable people to self-identify, consider using the following survey questions
91
92
92
93
[*Include appropriate gender identities as defined by project purpose, geographical region, and guidance from domain experts*]
93
94
94
-
<ul><li>gender identity </li>
95
-
<li>gender identity </li>
96
-
<li>gender identity </li>
97
-
<li>Prefer to self-describe: _________________ </li>
98
-
<li>Prefer not to answer </li></ul>
95
+
- Gender identity
96
+
- Gender identity
97
+
- Gender identity
98
+
-Prefer to self-describe: _________________
99
+
-Prefer not to answer
99
100
100
101
101
-
>[!CAUTION]
102
+
>[!CAUTION]
102
103
>In some parts of the world, there are laws that criminalize specific gender categories, so it may be dangerous for data contributors to answer this question honestly. Always give people a way to opt out. And work with regional experts and attorneys to conduct a careful review of the laws and cultural norms of each place where you plan to collect data, and if needed, avoid asking this question entirely.
0 commit comments