forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathPA1_template.Rmd
More file actions
104 lines (84 loc) · 3.42 KB
/
PA1_template.Rmd
File metadata and controls
104 lines (84 loc) · 3.42 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
title: "Reproducible Research: Peer Assessment 1"
output:
html_document:
keep_md: true
---
## Loading and preprocessing the data
I started by unzipping the data and reading the csv file:
```{r}
d <- read.csv('activity.csv')
```
Checking the summary and the 'head' of the data, everything seems fine:
```{r}
summary(d)
head(d)
```
One could convert the date variable (which is numeric) to a Date variable, but it does not really matter.
## What is mean total number of steps taken per day?
I start by summarizing the data frame by day with the ddply function.
```{r}
library(plyr)
d_day <- ddply(d, .(date), summarize, total_steps = sum(steps))
```
Then, I plot the histogram:
```{r}
hist(d_day$total_steps, main='Histogram of number of steps per day', xlab='steps per day')
```
Mean and median are (ignoring NA values):
```{r}
mean(d_day$total_steps, na.rm=T)
median(d_day$total_steps, na.rm=T)
```
## What is the average daily activity pattern?
I again apply the ddply function for averaging over all days for each interval, and then plot the result.
```{r}
d_avgday <- ddply(d, .(interval), summarize, avg_steps = mean(steps, na.rm=T))
plot(d_avgday$interval, d_avgday$avg_steps, type='l', main='Average daily activity pattern', xlab='time of day', ylab='steps averaged over all days')
```
The maximum number of steps (on average over all days) occurs at: `r d_avgday$interval[which.max(d_avgday$avg_steps)]`.
```{r}
d_avgday$interval[which.max(d_avgday$avg_steps)]
```
## Imputing missing values
There are `r nrow(d) - nrow(na.omit(d))` rows with missing values:
```{r}
colSums(is.na(d))
```
I will follow the strategy of using the average number of steps over all days for intervals with missing values. I am using a good-old for loop, there must be more elegant ways of dealing with this (like apply for vectors), but it works...
```{r}
steps_filled <- numeric()
for(i in 1:nrow(d)) {
s <- d[i,]$steps
if(!is.na(s)) {
steps_filled <- c(steps_filled, s)
}
else {
steps_filled <- c(steps_filled, subset(d_avgday, interval==d[i,]$interval)$avg_steps)
}
}
d_filled <- d
d_filled$steps <- steps_filled
```
I redo the ddply calculation like before, but now for the dataset with filled-in values.
```{r}
d_day_filled <- ddply(d_filled, .(date), summarize, total_steps = sum(steps))
```
I again plot the histogram:
```{r}
hist(d_day_filled$total_steps, main='Histogram of number of steps per day (imputed NA values)', xlab='steps per day')
```
Mean and median are:
```{r}
mean(d_day_filled$total_steps, na.rm=T)
median(d_day_filled$total_steps, na.rm=T)
```
The differences in median and mean values are very small.
## Are there differences in activity patterns between weekdays and weekends?
The plot indicates some differences in activity patterns between weekdays and weekends.Weekdays show a marked peak in activity between 8 and 9 AM, and less activity throughout the day. Weekends also show a morning peak, but it is less marked and builds up later (almost no activity until 8 AM) and more activity is visible throughout the day.
```{r}
library(ggplot2)
d$weekday <- revalue(weekdays(as.Date(d$date, "%Y-%m-%d")), c('Saturday' = 'Weekend', 'Sunday' = 'Weekend', 'Monday' = 'Weekday', 'Tuesday' = 'Weekday', 'Wednesday' = 'Weekday', 'Thursday' = 'Weekday', 'Friday' = 'Weekday'))
d_avgday_2 <- ddply(d, .(interval, weekday), summarize, avg_steps = mean(steps, na.rm=T))
qplot(interval, avg_steps, data=d_avgday_2, geom="line", facets=weekday~.)
```