Skip to content

Commit 3d6a8ad

Browse files
committed
formatting and description updates
1 parent e119041 commit 3d6a8ad

File tree

1 file changed

+24
-23
lines changed

1 file changed

+24
-23
lines changed

articles/machine-learning/concept-sourcing-human-data.md

Lines changed: 24 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Manually sourcing human data for AI development
3-
description: Manually sourcing human data can be important to building AI systems that work for all users. But certain practices should be avoided, especially ones that can cause physical and psychological harm to data contributors, as well as flawed datasets.
3+
description: Learn best practices for mitigating potential harm to people—especially in vulnerable groups—and building balanced datasets when collecting human data manually.
44
author: nhu-do-1
55
ms.author: nhudo
66
ms.service: machine-learning
@@ -18,7 +18,8 @@ The best practices in this article will help you conduct manual data collection
1818

1919
- People contributing data are not coerced or exploited in any way, and they have control over what personal data is collected.
2020
- People collecting and labeling data have adequate training
21-
- These practices can also help ensure more-balanced and higher-quality datasets and better stewardship of human data.
21+
22+
These practices can also help ensure more-balanced and higher-quality datasets and better stewardship of human data.
2223

2324
These are emerging practices, and we are continually learning. The best practices below are a starting point as you begin your own responsible human data collections. These best practices are provided for informational purposes only and should not be treated as legal advice. All human data collections should undergo specific privacy and legal reviews.
2425

@@ -28,18 +29,18 @@ We suggest the following best practices for manually collecting human data direc
2829

2930

3031
| **Best Practice** | **Why** |
31-
|:--------------------|----------|
32+
|:--------------------|:----------|
3233
| **Obtain voluntary informed consent.** | <ul><li>Participants should understand and consent to data collection and how their data will be used.<li>Data should only be stored, processed, and used for purposes that are part of the original documented informed consent. <li>Consent documentation should be properly stored and associated with the collected data. <ul> |
3334
| **Compensate data contributors appropriately.** | <ul><li>Data contributors should not be pressured or coerced into data collections and should be fairly compensated for their time and data. <li>Inappropriate compensation can be exploitative or coercive.<ul> |
34-
| **Let contributors self-identify demographic information** | <ul><li>Demographic information that is not self-reported by data contributors but assigned by data collectors may 1) result in inaccurate metadata and 2) be disrespectful to data contributors<ul> |
35+
| **Let contributors self-identify demographic information.** | <ul><li>Demographic information that is not self-reported by data contributors but assigned by data collectors may 1) result in inaccurate metadata and 2) be disrespectful to data contributors.<ul> |
3536
| **Anticipate harms when recruiting vulnerable groups.** | <ul><li>Collecting data from vulnerable population groups introduces risk to data contributors and your organization.<ul> |
3637
| **Treat data contributors with respect.** | <ul><li>Improper interactions with data contributors at any phase of the data collection can negatively impact data quality, as well as the overall data collection experience for data contributors and data collectors.<ul> |
3738
| **Qualify external suppliers carefully.** | <ul><li>Data collections with unqualified suppliers may result in low quality data, poor data management, unprofessional practices, and potentially harmful outcomes for data contributors and data collectors (including violations of human rights). <li> Annotation or labeling work (e.g., audio transcription, image tagging) with unqualified suppliers may result in low quality or biased datasets, insecure data management, unprofessional practices, and potentially harmful outcomes for data contributors (including violations of human rights).<ul> |
3839
| **Communicate expectations clearly in the Statement of Work (SOW) with suppliers.** | <ul><li>An SOW which lacks requirements for responsible data collection work may result in low-quality or poorly collected data.<ul> |
3940
| **Qualify geographies carefully.** | <ul><li> When applicable, collecting data in restricted and/or unfamiliar geographies may result in unusable or low-quality data and may impact the safety of involved parties.<ul> |
4041
| **Be a good steward of your datasets.** | <ul><li>Improper data management and poor documentation can result in data misuse.<ul> |
4142

42-
>[!TIP]
43+
>[!NOTE]
4344
>This article focuses on recommendations for human data, including personal data and sensitive data such as biometric data, health data, racial or ethnic data, data collected manually from the general public or company employees, as well as metadata relating to human characteristics, such as age, ancestry, and gender identity, that may be created via annotation or labeling.
4445
4546

@@ -60,12 +61,12 @@ To enable people to self-identify, consider using the following survey questions
6061
*Select your age range*
6162

6263
[*Include appropriate age ranges as defined by project purpose, geographical region, and guidance from domain experts*]
63-
64-
<ul><li># to # </li>
65-
<li># to # </li>
66-
<li># to # </li>
67-
<li>Prefer not to answer </li></ul>
68-
64+
<ul>
65+
<li># to #
66+
<li># to #
67+
<li> # to #
68+
<li> Prefer not to answer
69+
</ul>
6970

7071
### Ancestry
7172

@@ -75,12 +76,12 @@ To enable people to self-identify, consider using the following survey questions
7576

7677
[*Include appropriate categories as defined by project purpose, geographical region, and guidance from domain experts*]
7778

78-
<ul><li>ancestry group </li>
79-
<li>ancestry group </li>
80-
<li>ancestry group </li>
81-
<li>Multiple (multiracial, mixed ancestry) </li>
82-
<li>Not listed, I describe myself as: _________________ </li>
83-
<li>Prefer not to answer </li></ul>
79+
- Ancestry group
80+
- Ancestry group
81+
- Ancestry group
82+
- Multiple (multiracial, mixed Ancestry)
83+
- Not listed, I describe myself as: _________________
84+
- Prefer not to answer
8485

8586

8687
### Gender identity
@@ -91,14 +92,14 @@ To enable people to self-identify, consider using the following survey questions
9192

9293
[*Include appropriate gender identities as defined by project purpose, geographical region, and guidance from domain experts*]
9394

94-
<ul><li>gender identity </li>
95-
<li>gender identity </li>
96-
<li>gender identity </li>
97-
<li>Prefer to self-describe: _________________ </li>
98-
<li>Prefer not to answer </li></ul>
95+
- Gender identity
96+
- Gender identity
97+
- Gender identity
98+
- Prefer to self-describe: _________________
99+
- Prefer not to answer
99100

100101

101-
>[!CAUTION]
102+
>[!CAUTION]
102103
>In some parts of the world, there are laws that criminalize specific gender categories, so it may be dangerous for data contributors to answer this question honestly. Always give people a way to opt out. And work with regional experts and attorneys to conduct a careful review of the laws and cultural norms of each place where you plan to collect data, and if needed, avoid asking this question entirely.
103104
104105

0 commit comments

Comments
 (0)