You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: episodes/clean-data.Rmd
+41-50Lines changed: 41 additions & 50 deletions
Original file line number
Diff line number
Diff line change
@@ -82,9 +82,9 @@ the `filter()` function from the `{dplyr}` package.
82
82
:::::::::::::::::::
83
83
84
84
85
-
The first step is to import the dataset into working environment, which can be done by following the guidelines
86
-
outlined in the [Read case data](../episodes/read-cases.Rmd) episode. This involves loading
87
-
the dataset into `R` environment and view its structure and content.
85
+
The first step is to import the dataset into working environment. This can be
86
+
done by following the guidelines outlined in the [Read case data](../episodes/read-cases.Rmd) episode. It involves loading the dataset into
87
+
`R` environment and view its structure and content.
88
88
89
89
```{r,eval=FALSE,echo=TRUE,message=FALSE}
90
90
# Read data
@@ -120,65 +120,62 @@ Are any of those characteristics familiar from any previous data analysis you ha
120
120
121
121
Lead a short discussion to relate the diagnosed characteristics with required cleaning operations.
122
122
123
-
You can use these terms to **diagnose characteristics**:
123
+
You can use the following terms to **diagnose characteristics**:
124
124
125
-
-*Codification*, like sex and age entries using numbers, letters, and words. Also dates in different arrangement
126
-
("dd/mm/yyyy" or "yyyy/mm/dd") and formats. Less visible, but also the column names.
127
-
-*Missing*, how to interpret an entry like "" in status or "-99" in another column? do we have a data dictionary from
128
-
the data collection process?
125
+
-*Codification*, like the codification of values in columns like 'gender' and 'age' using numbers, letters, and words. Also the presence of multiple dates
126
+
formats ("dd/mm/yyyy", "yyyy/mm/dd", etc) in the same column like in
127
+
'date_onset'. Less visible, but also the column names.
128
+
-*Missing*, how to interpret an entry like "" in the 'status' column or "-99"
129
+
in other circumstances? Do we have a data dictionary from the data collection process?
129
130
-*Inconsistencies*, like having a date of sample before the date of onset.
130
-
-*Non-plausible values*, like outlier observations with dates outside of an expected timeframe.
131
+
-*Non-plausible values*, like observations where some dates values are outside of the expected timeframe.
131
132
-*Duplicates*, are all observations unique?
132
133
133
134
You can use these terms to relate to **cleaning operations**:
134
135
135
136
- Standardize column name
136
-
- Standardize categorical variables like sex/gender
137
+
- Standardize categorical variables like 'gender'
137
138
- Standardize date columns
138
-
- Convert from character to numeric values
139
+
- Convert character values into numeric
139
140
- Check the sequence of dated events
140
141
141
142
::::::::::::::::::::::::::::::
142
143
143
144
## A quick inspection
144
145
145
-
Quick exploration and inspection of the dataset are crucial to identify potential data issues before
146
-
diving into any analysis tasks. The `{cleanepi}`
146
+
Quick exploration and inspection of the dataset are crucial to identify
147
+
potential data issues before diving into any analysis tasks. The `{cleanepi}`
147
148
package simplifies this process with the `scan_data()` function. Let's take a look at how you can use it:
148
149
149
150
```{r}
150
151
cleanepi::scan_data(raw_ebola_data)
151
152
```
152
153
153
154
154
-
The results provide an overview of the content of every column, including column names, and the percent of some data
155
-
types per column.
156
-
You can see that the column names in the dataset are descriptive but lack consistency, as some they are composed of
157
-
multiple words separated by white spaces. Additionally, some columns contain more than one data type, and there are
158
-
missing values in others.
155
+
The results provide an overview of the content of all character columns, including column names, and the percent of some data types within them.
156
+
You can see that the column names in the dataset are descriptive but lack consistency. Some are composed of multiple words separated by white spaces. Additionally, some columns contain more than one data type, and there are
157
+
missing values in the form of an empty string in others.
159
158
160
159
## Common operations
161
160
162
161
This section demonstrate how to perform some common data cleaning operations using the `{cleanepi}` package.
163
162
164
163
### Standardizing column names
165
164
166
-
For this example dataset, standardizing column names typically involves removing spaces and connecting different words
167
-
with “_”. This practice helps maintain consistency and readability in the dataset. However, the function used for
168
-
standardizing column names offers more options. Type `?cleanepi::standardize_column_names` for more details.
165
+
For this example dataset, standardizing column names typically involves removing with spaces and connecting different words with “_”. This practice helps
166
+
maintain consistency and readability in the dataset. However, the function used for standardizing column names offers more options. Type `?cleanepi::standardize_column_names` in the console for more details.
If you want to maintain certain column names without subjecting them to the standardization process, you can utilize
176
-
the `keep` argument of the function `cleanepi::standardize_column_names()`. This argument accepts a vector of column
177
-
names that are intended to be kept unchanged.
173
+
If you want to maintain certain column names without subjecting them to the standardization process, you can utilize the `keep` argument of the function `cleanepi::standardize_column_names()`. This argument accepts a vector of
174
+
column names that are intended to be kept unchanged.
178
175
179
176
::::::::::::::::::::::::::::::::::::: challenge
180
177
181
-
- What differences you can observe in the column names?
178
+
- What differences can you observe in the column names?
182
179
183
180
- Standardize the column names of the input dataset, but keep the first column names as it is.
184
181
@@ -192,9 +189,8 @@ You can try `cleanepi::standardize_column_names(data = raw_ebola_data, keep = "V
192
189
193
190
### Removing irregularities
194
191
195
-
Raw data may contain irregularities such as **duplicated** rows, **empty** rows and columns, or **constant** columns
196
-
(where all entries have the same value.) Functions from `{cleanepi}` like `remove_duplicates()` and `remove_constants()`
197
-
remove such irregularities as demonstrated in the below code chunk.
192
+
Raw data may contain fields that don't add any variability to the data such as **empty** rows and columns, or **constant** columns (where all entries have the same value). It can also contain **duplicated** rows. Functions from
193
+
`{cleanepi}` like `remove_duplicates()` and `remove_constants()` remove such irregularities as demonstrated in the below code chunk.
198
194
199
195
```{r}
200
196
# Remove constants
@@ -208,14 +204,15 @@ Now, print the output to identify what constant column you removed!
<!-- Note that, our simulated Ebola does not contain duplicated nor constant rows or columns. -->
207
+
<!-- Note that, our simulated Ebola contains few duplicates and few constant
208
+
columns. -->
212
209
213
210
::::::::::::::::::::: spoiler
214
211
215
212
#### How many rows you removed? What rows where removed?
216
213
217
-
You can get the number and location of the duplicated rows that where found. Run `cleanepi::print_report()`,
218
-
wait for the report to open in your browser, and find the "Duplicates" tab.
214
+
You can get the number and location of the duplicated rows that where found. Run `cleanepi::print_report()`, wait for the report to open in your browser, and
0 commit comments