You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/concept-sourcing-human-data.md
+13-44Lines changed: 13 additions & 44 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,35 +1,16 @@
1
1
---
2
2
title: Manually sourcing human data for AI development
3
-
description: Manually sourcing human data can be important to building AI systems that work for all users. But certain practices should be avoided, especially ones that can cause physical and psychological harm to data contributors, as well as flawed datasets. #Required; article description that is displayed in search results.
3
+
description: Manually sourcing human data can be important to building AI systems that work for all users. But certain practices should be avoided, especially ones that can cause physical and psychological harm to data contributors, as well as flawed datasets.
4
4
author: nhu-do-1
5
5
ms.author: nhudo
6
6
ms.service: machine-learning
7
-
ms.topic: conceptual #Required; leave this attribute/value as-is.
8
-
ms.date: 05/05/2022 #Required; mm/dd/yyyy format.
9
-
ms.custom: responsible-ml #Required; leave this attribute/value as-is.
7
+
ms.topic: conceptual
8
+
ms.date: 05/05/2022
9
+
ms.custom: responsible-ml
10
10
---
11
-
12
-
<!--Remove all the comments in this template before you sign-off or merge to the
13
-
main branch.
14
-
-->
15
-
16
-
<!--
17
-
This template provides the basic structure of a concept article.
18
-
See the [concept guidance](contribute-how-write-concept.md) in the contributor guide.
Required. Set expectations for what the content covers, so customers know the
26
-
content meets their needs. Should NOT begin with a verb.
27
-
-->
28
-
29
-
<!-- first level heading -->
30
11
# What is "human data" and why is it important to source responsibly?
31
12
32
-
Human data is data collected directly from or about people. Human data may include personal data such as names, age, images, or voice clips and sensitive data such as genetic data, biometric data, gender identity, religious beliefs or political affiliations.
13
+
Human data is data collected directly from, or about, people. Human data may include personal data such as names, age, images, or voice clips and sensitive data such as genetic data, biometric data, gender identity, religious beliefs, or political affiliations.
33
14
34
15
Collecting this data can be important to building AI systems that work for all users. But certain practices should be avoided, especially ones that can cause physical and psychological harm to data contributors, as well as flawed datasets.
35
16
@@ -41,28 +22,26 @@ The best practices in this article will help you conduct manual data collection
41
22
42
23
These are emerging practices, and we are continually learning. The best practices below are a starting point as you begin your own responsible human data collections. These best practices are provided for informational purposes only and should not be treated as legal advice. All human data collections should undergo specific privacy and legal reviews.
43
24
44
-
<!--second header -->
45
25
## General best practices
46
26
47
27
We suggest the following best practices for manually collecting human data directly from people.
|**Obtain voluntary informed consent.**| <ul><li>Participants should understand and consent to data collection and how their data will be used.<li>Data should only be stored, processed, and used for purposes that are part of the original documented informed consent. <li>Consent documentation should be properly stored and associated with the collected data. <ul> |
52
-
|**Compensate data contributors appropriately.**| <ul><li>Data contributors should not be pressured or coerced into data collections and should be fairly compensated for their time and data. <li>Inappropriate compensation can be exploitative or coercive.<ul> |
29
+
30
+
|**Best Practice**|**Why**|
31
+
|:--------------------|----------|
32
+
|**Obtain voluntary informed consent.**| <ul><li>Participants should understand and consent to data collection and how their data will be used.<li>Data should only be stored, processed, and used for purposes that are part of the original documented informed consent. <li>Consent documentation should be properly stored and associated with the collected data. <ul> |
33
+
|**Compensate data contributors appropriately.**| <ul><li>Data contributors should not be pressured or coerced into data collections and should be fairly compensated for their time and data. <li>Inappropriate compensation can be exploitative or coercive.<ul> |
53
34
|**Let contributors self-identify demographic information**| <ul><li>Demographic information that is not self-reported by data contributors but assigned by data collectors may 1) result in inaccurate metadata and 2) be disrespectful to data contributors<ul> |
54
35
|**Anticipate harms when recruiting vulnerable groups.**| <ul><li>Collecting data from vulnerable population groups introduces risk to data contributors and your organization.<ul> |
55
-
|**Treat data contributors with respect.**| <ul><li>Improper interactions with data contributors at any phase of the data collection can negatively impact data quality, as well as the overall data collection experience for data contributors and data collectors.<ul> |
36
+
|**Treat data contributors with respect.**| <ul><li>Improper interactions with data contributors at any phase of the data collection can negatively impact data quality, as well as the overall data collection experience for data contributors and data collectors.<ul> |
56
37
|**Qualify external suppliers carefully.**| <ul><li>Data collections with unqualified suppliers may result in low quality data, poor data management, unprofessional practices, and potentially harmful outcomes for data contributors and data collectors (including violations of human rights). <li> Annotation or labeling work (e.g., audio transcription, image tagging) with unqualified suppliers may result in low quality or biased datasets, insecure data management, unprofessional practices, and potentially harmful outcomes for data contributors (including violations of human rights).<ul> |
57
38
|**Communicate expectations clearly in the Statement of Work (SOW) with suppliers.**| <ul><li>An SOW which lacks requirements for responsible data collection work may result in low-quality or poorly collected data.<ul> |
58
39
|**Qualify geographies carefully.**| <ul><li> When applicable, collecting data in restricted and/or unfamiliar geographies may result in unusable or low-quality data and may impact the safety of involved parties.<ul> |
59
40
|**Be a good steward of your datasets.**| <ul><li>Improper data management and poor documentation can result in data misuse.<ul> |
60
41
61
42
>[!TIP]
62
-
This article focuses on recommendations for human data, including personal data and sensitive data such as biometric data, health data, racial or ethnic data, data collected manually from the general public or company employees, as well as metadata relating to human characteristics, such as age, ancestry, and gender identity, that may be created via annotation or labeling.
43
+
>This article focuses on recommendations for human data, including personal data and sensitive data such as biometric data, health data, racial or ethnic data, data collected manually from the general public or company employees, as well as metadata relating to human characteristics, such as age, ancestry, and gender identity, that may be created via annotation or labeling.
63
44
64
-
<!--INSERT DOWNLOAD LINK TO FULL SOURCING HUMAN DATA DOC-->
65
-
<!--Download the full set of best practices here.-->
66
45
67
46
## Best practices for collecting age, ancestry, and gender identity
68
47
@@ -120,15 +99,10 @@ To enable people to self-identify, consider using the following survey questions
120
99
121
100
122
101
>[!CAUTION]
123
-
In some parts of the world, there are laws that criminalize specific gender categories, so it may be dangerous for data contributors to answer this question honestly. Always give people a way to opt out. And work with regional experts and attorneys to conduct a careful review of the laws and cultural norms of each place where you plan to collect data, and if needed, avoid asking this question entirely.
124
-
125
-
<!-- INCLUDE LINK TO FULL DOWNLOAD-->
126
-
<!-- Download the full set of best practices here.-->
127
-
102
+
>In some parts of the world, there are laws that criminalize specific gender categories, so it may be dangerous for data contributors to answer this question honestly. Always give people a way to opt out. And work with regional experts and attorneys to conduct a careful review of the laws and cultural norms of each place where you plan to collect data, and if needed, avoid asking this question entirely.
128
103
129
104
130
105
## Next steps
131
-
<!-- Add a context sentence for the following links -->
132
106
For more information on how to work with your data:
133
107
134
108
-[Secure data access in Azure Machine Learning](concept-data.md)
@@ -141,8 +115,3 @@ Follow these how-to guides to work with your data after you've collected it:
141
115
-[Set up image labeling](how-to-create-image-labeling-projects.md)
142
116
-[Label images and text](how-to-label-data.md)
143
117
144
-
145
-
<!--
146
-
Remove all the comments in this template before you sign-off or merge to the
0 commit comments