Skip to content

Commit ba3e941

Browse files
committed
update data cleaning episode
1 parent cab2b76 commit ba3e941

File tree

1 file changed

+41
-50
lines changed

1 file changed

+41
-50
lines changed

episodes/clean-data.Rmd

Lines changed: 41 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -82,9 +82,9 @@ the `filter()` function from the `{dplyr}` package.
8282
:::::::::::::::::::
8383

8484

85-
The first step is to import the dataset into working environment, which can be done by following the guidelines
86-
outlined in the [Read case data](../episodes/read-cases.Rmd) episode. This involves loading
87-
the dataset into `R` environment and view its structure and content.
85+
The first step is to import the dataset into working environment. This can be
86+
done by following the guidelines outlined in the [Read case data](../episodes/read-cases.Rmd) episode. It involves loading the dataset into
87+
`R` environment and view its structure and content.
8888

8989
```{r,eval=FALSE,echo=TRUE,message=FALSE}
9090
# Read data
@@ -120,65 +120,62 @@ Are any of those characteristics familiar from any previous data analysis you ha
120120

121121
Lead a short discussion to relate the diagnosed characteristics with required cleaning operations.
122122

123-
You can use these terms to **diagnose characteristics**:
123+
You can use the following terms to **diagnose characteristics**:
124124

125-
- *Codification*, like sex and age entries using numbers, letters, and words. Also dates in different arrangement
126-
("dd/mm/yyyy" or "yyyy/mm/dd") and formats. Less visible, but also the column names.
127-
- *Missing*, how to interpret an entry like "" in status or "-99" in another column? do we have a data dictionary from
128-
the data collection process?
125+
- *Codification*, like the codification of values in columns like 'gender' and 'age' using numbers, letters, and words. Also the presence of multiple dates
126+
formats ("dd/mm/yyyy", "yyyy/mm/dd", etc) in the same column like in
127+
'date_onset'. Less visible, but also the column names.
128+
- *Missing*, how to interpret an entry like "" in the 'status' column or "-99"
129+
in other circumstances? Do we have a data dictionary from the data collection process?
129130
- *Inconsistencies*, like having a date of sample before the date of onset.
130-
- *Non-plausible values*, like outlier observations with dates outside of an expected timeframe.
131+
- *Non-plausible values*, like observations where some dates values are outside of the expected timeframe.
131132
- *Duplicates*, are all observations unique?
132133

133134
You can use these terms to relate to **cleaning operations**:
134135

135136
- Standardize column name
136-
- Standardize categorical variables like sex/gender
137+
- Standardize categorical variables like 'gender'
137138
- Standardize date columns
138-
- Convert from character to numeric values
139+
- Convert character values into numeric
139140
- Check the sequence of dated events
140141

141142
::::::::::::::::::::::::::::::
142143

143144
## A quick inspection
144145

145-
Quick exploration and inspection of the dataset are crucial to identify potential data issues before
146-
diving into any analysis tasks. The `{cleanepi}`
146+
Quick exploration and inspection of the dataset are crucial to identify
147+
potential data issues before diving into any analysis tasks. The `{cleanepi}`
147148
package simplifies this process with the `scan_data()` function. Let's take a look at how you can use it:
148149

149150
```{r}
150151
cleanepi::scan_data(raw_ebola_data)
151152
```
152153

153154

154-
The results provide an overview of the content of every column, including column names, and the percent of some data
155-
types per column.
156-
You can see that the column names in the dataset are descriptive but lack consistency, as some they are composed of
157-
multiple words separated by white spaces. Additionally, some columns contain more than one data type, and there are
158-
missing values in others.
155+
The results provide an overview of the content of all character columns, including column names, and the percent of some data types within them.
156+
You can see that the column names in the dataset are descriptive but lack consistency. Some are composed of multiple words separated by white spaces. Additionally, some columns contain more than one data type, and there are
157+
missing values in the form of an empty string in others.
159158

160159
## Common operations
161160

162161
This section demonstrate how to perform some common data cleaning operations using the `{cleanepi}` package.
163162

164163
### Standardizing column names
165164

166-
For this example dataset, standardizing column names typically involves removing spaces and connecting different words
167-
with “_”. This practice helps maintain consistency and readability in the dataset. However, the function used for
168-
standardizing column names offers more options. Type `?cleanepi::standardize_column_names` for more details.
165+
For this example dataset, standardizing column names typically involves removing with spaces and connecting different words with “_”. This practice helps
166+
maintain consistency and readability in the dataset. However, the function used for standardizing column names offers more options. Type `?cleanepi::standardize_column_names` in the console for more details.
169167

170168
```{r}
171169
sim_ebola_data <- cleanepi::standardize_column_names(raw_ebola_data)
172170
names(sim_ebola_data)
173171
```
174172

175-
If you want to maintain certain column names without subjecting them to the standardization process, you can utilize
176-
the `keep` argument of the function `cleanepi::standardize_column_names()`. This argument accepts a vector of column
177-
names that are intended to be kept unchanged.
173+
If you want to maintain certain column names without subjecting them to the standardization process, you can utilize the `keep` argument of the function `cleanepi::standardize_column_names()`. This argument accepts a vector of
174+
column names that are intended to be kept unchanged.
178175

179176
::::::::::::::::::::::::::::::::::::: challenge
180177

181-
- What differences you can observe in the column names?
178+
- What differences can you observe in the column names?
182179

183180
- Standardize the column names of the input dataset, but keep the first column names as it is.
184181

@@ -192,9 +189,8 @@ You can try `cleanepi::standardize_column_names(data = raw_ebola_data, keep = "V
192189

193190
### Removing irregularities
194191

195-
Raw data may contain irregularities such as **duplicated** rows, **empty** rows and columns, or **constant** columns
196-
(where all entries have the same value.) Functions from `{cleanepi}` like `remove_duplicates()` and `remove_constants()`
197-
remove such irregularities as demonstrated in the below code chunk.
192+
Raw data may contain fields that don't add any variability to the data such as **empty** rows and columns, or **constant** columns (where all entries have the same value). It can also contain **duplicated** rows. Functions from
193+
`{cleanepi}` like `remove_duplicates()` and `remove_constants()` remove such irregularities as demonstrated in the below code chunk.
198194

199195
```{r}
200196
# Remove constants
@@ -208,14 +204,15 @@ Now, print the output to identify what constant column you removed!
208204
sim_ebola_data <- cleanepi::remove_duplicates(sim_ebola_data)
209205
```
210206

211-
<!-- Note that, our simulated Ebola does not contain duplicated nor constant rows or columns. -->
207+
<!-- Note that, our simulated Ebola contains few duplicates and few constant
208+
columns. -->
212209

213210
::::::::::::::::::::: spoiler
214211

215212
#### How many rows you removed? What rows where removed?
216213

217-
You can get the number and location of the duplicated rows that where found. Run `cleanepi::print_report()`,
218-
wait for the report to open in your browser, and find the "Duplicates" tab.
214+
You can get the number and location of the duplicated rows that where found. Run `cleanepi::print_report()`, wait for the report to open in your browser, and
215+
find the "Duplicates" tab.
219216

220217
```{r,eval=FALSE,echo=TRUE}
221218
# Print a report
@@ -238,7 +235,7 @@ df <- tibble(
238235
) %>%
239236
mutate(col3 = rep("a", nrow(.))) %>%
240237
mutate(col4 = rep("b", nrow(.))) %>%
241-
mutate(col5 = rep(NA_Date_, nrow(.))) %>%
238+
mutate(col5 = rep(lubridate::NA_Date_, nrow(.))) %>%
242239
add_row(col1 = NA_integer_, col3 = "a") %>%
243240
add_row(col1 = NA_integer_, col3 = "a") %>%
244241
add_row(col1 = NA_integer_, col3 = "a") %>%
@@ -247,38 +244,32 @@ df <- tibble(
247244
df
248245
```
249246

250-
What columns or rows are:
247+
What columns are the:
251248

252-
- duplicates?
253-
- empty?
254-
- constant?
249+
- constant data?
250+
- duplicated rows?
255251

256252
::::::::::::::: hint
257253

258-
Duplicates mostly refers to replicated rows. Empty rows or columns can be a subset within the set of constant rows
259-
or columns.
254+
Constant data mostly refers to empty rows or columns as well as constant columns.
260255

261256
:::::::::::::::
262257

263258
:::::::::::::::::::::
264259

265260
::::::::::::::: instructor
266261

267-
- duplicated rows: 3, 4, 5
268-
- empty rows: 6
269-
- empty cols: 5
270-
- constant rows: 6
271-
- constant cols: 5
262+
Make sure they start by removing duplicates before removing constant data.
272263

273-
Point out to learners that the user can create new constant columns or rows after removing some initial ones.
264+
- indices of duplicated rows: 3, 4, 5
265+
- indices of empty rows: 4 (from the first iteration); 3 (from the second iteration)
266+
- empty cols: "col5"
267+
- constant cols: "col3", and "col4"
274268

275-
```{r}
276-
df %>%
277-
cleanepi::remove_constants()
269+
Point out to learners that they create a different set of constant data after removing by varying the value of the `cutoff` argument.
278270

279-
df %>%
280-
cleanepi::remove_constants() %>%
281-
cleanepi::remove_constants()
271+
```{r}
272+
df <- df %>% cleanepi::remove_constants(cutoff = 0.5)
282273
```
283274

284275

0 commit comments

Comments
 (0)