RepData_PeerAssessment1/PA1_template.Rmd at master · MathiasVersichele/RepData_PeerAssessment1 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
title: "Reproducible Research: Peer Assessment 1"
output:
  html_document:
    keep_md: true
---

## Loading and preprocessing the data
I started by unzipping the data and reading the csv file:
```{r}
d <- read.csv('activity.csv')
```

Checking the summary and the 'head' of the data, everything seems fine:
```{r}
summary(d)
head(d)
```

One could convert the date variable (which is numeric) to a Date variable, but it does not really matter.

## What is mean total number of steps taken per day?
I start by summarizing the data frame by day with the ddply function.
```{r}
library(plyr)
d_day <- ddply(d, .(date), summarize, total_steps = sum(steps))
```

Then, I plot the histogram:
```{r}
hist(d_day$total_steps, main='Histogram of number of steps per day', xlab='steps per day')
```

Mean and median are (ignoring NA values):
```{r}
mean(d_day$total_steps, na.rm=T)
median(d_day$total_steps, na.rm=T)
```


## What is the average daily activity pattern?
I again apply the ddply function for averaging over all days for each interval, and then plot the result.
```{r}
d_avgday <- ddply(d, .(interval), summarize, avg_steps = mean(steps, na.rm=T))
plot(d_avgday$interval, d_avgday$avg_steps, type='l', main='Average daily activity pattern', xlab='time of day', ylab='steps averaged over all days')
```

The maximum number of steps (on average over all days) occurs at: `r d_avgday$interval[which.max(d_avgday$avg_steps)]`.
```{r}
d_avgday$interval[which.max(d_avgday$avg_steps)]
```


## Imputing missing values
There are `r nrow(d) - nrow(na.omit(d))` rows with missing values:
```{r}
colSums(is.na(d))
```

I will follow the strategy of using the average number of steps over all days for intervals with missing values. I am using a good-old for loop, there must be more elegant ways of dealing with this (like apply for vectors), but it works...
```{r}
steps_filled <- numeric()
for(i in 1:nrow(d)) {
  s <- d[i,]$steps
  if(!is.na(s)) {
    steps_filled <- c(steps_filled, s)
  }
  else {
    steps_filled <- c(steps_filled, subset(d_avgday, interval==d[i,]$interval)$avg_steps)
  }
}

d_filled <- d
d_filled$steps <- steps_filled
```

I redo the ddply calculation like before, but now for the dataset with filled-in values.
```{r}
d_day_filled <- ddply(d_filled, .(date), summarize, total_steps = sum(steps))
```

I again plot the histogram:
```{r}
hist(d_day_filled$total_steps, main='Histogram of number of steps per day (imputed NA values)', xlab='steps per day')
```

Mean and median are:
```{r}
mean(d_day_filled$total_steps, na.rm=T)
median(d_day_filled$total_steps, na.rm=T)
```

The differences in median and mean values are very small.

## Are there differences in activity patterns between weekdays and weekends?
The plot indicates some differences in activity patterns between weekdays and weekends.Weekdays show a marked peak in activity between 8 and 9 AM, and less activity throughout the day. Weekends also show a morning peak, but it is less marked and builds up later (almost no activity until 8 AM) and more activity is visible throughout the day.

```{r}
library(ggplot2)
d$weekday <- revalue(weekdays(as.Date(d$date, "%Y-%m-%d")), c('Saturday' = 'Weekend', 'Sunday' = 'Weekend', 'Monday' = 'Weekday', 'Tuesday' = 'Weekday', 'Wednesday' = 'Weekday', 'Thursday' = 'Weekday', 'Friday' = 'Weekday'))
d_avgday_2 <- ddply(d, .(interval, weekday), summarize, avg_steps = mean(steps, na.rm=T))
qplot(interval, avg_steps, data=d_avgday_2, geom="line", facets=weekday~.)
```